Sorry, I realize this reads like a book. As such, I realize this may not even be the best forum for this, but SO is usually the first and best place I come to for answers to anything.
This question is going to be necessarily high-level and abstract. We're ElasticSearch/Nest newbies, so a lot of the things we're doing might be crazy and terrible (they almost surely are), and that's why I'm looking for some solid answers on optimizing our application.
Please keep in mind when reading this and formulating responses that many choices were driven by engineering/architecture concerns. Since we're using C#, and our application will always be somewhat fluid, we take great pains to design things in a way that will make it tolerant to changes we expect to happen, while also making those changes as easy as possible.
That is to say that we use the constructs of the language to try to ensure that, e.g. adding a new type to be indexed will require a few changes in a few places, new implementers of interfaces that correspond to the type, etc.; or that adding a new field to be searched will require an easily followed set of changes from front-end to back-end.
This entails a heavy reliance on generics, interfaces, attributes w/reflection, etc., and most definitely avoids things like type-checking, if/else/switch statements, and string literals. This means that, for example, we don't do anything to manually build up a query in JSON. Rather, our code will look at an object, look at the properties on the object, look at the attribute on a property, then use that attribute to determine what kind of strongly-typed QueryContainer should be created from that property and object's values, then hand it off to NEST to do whatever it does.
Our goal is to have things fail as soon as possible, hopefully at compile-time, and if they happen to be at runtime, then they happen immediately upon application startup, rather than while exercising a random feature. JavaScript devs may therefore be completely confused by our problems, or the concerns that drove our choices :-)
End Foreword
So, consider the following scenario:
Types: action, document, documentText
An action can be associated with many documents, and a document can be associated with many actions. An action has metadata fields on it that can be searched, and it also contains the document associations (as ID's). A document has many metadata fields on it that can be searched. The document can also have its full text searched. We only want/need to return a sub-set of data, which exists on the document, as well as highlights, which represent the reason that a particular document was hit.
We have a basic search type, which allows a user to enter a word or words. If that query matches anything in an action, then any document associated with the action should return. Any document that has metadata that is responsive to the query should also be returned. Likewise, if the document text is responsive, then those documents should also be returned.
We also have an advanced search, which allows a user to search on particular action metadata, document metadata, or document text (text searching allows "all words", "any words", "without words", and "exact phrase"). In this scenario, a document is only returned if it matches in every phase of the query, i.e. it must be associated with an action that matches the action query, and the document query and documentText query are both responsive.
Each type is located in its own index. We expect production environments to have hundreds-of-thousands to millions of documents (with full text).
We currently accomplish this with a 5-pass search:
1) the actions are searched (field-by-field for the advanced search, on _all for the basic search). From: 0 Size: something very large ---> Why? Because we need to find all documents that might be associated with any matching actions. The associated document ID's are extracted from the result set of this query. They are then...
2) fed into the documentText query as an IdsQuery, along with the text query. Source and highlighting are disabled for this pass. The results of this search are then passed to the document metadata in the same manner. If it's a basic search, then the IdsQuery becomes a "should", in an advanced search, they're a "must". From: 0 Size: something very large Ultimately, we're only going to display a page of 10 results. However, we need to find all documents that are responsive to this pass of the query, because there's no guarantee that the first 10 (or even 1,000 or 10,000) will be responsive to the document metadata pass, in which case, we risk not getting the full result set of 10 hits that we want. So, again, ID's are...
3) fed into the document metadata pass, same as above. Again, source and highlighting are disabled. From: user-stipulated page Size: 10. These are the ultimate results. Those ten are then...
4) fed into another document metadata pass. Source and highlighting are enabled, so we can return things to display. The same ten ID's are...
5) fed into another document text pass. Only highlighting is enabled
We aggregate highlights in our code, then pass the 10 results back to the user.
The problem with the above is that it is not scaling well. At all. I now have a system with a million documents. I bulk ingested these documents from test data, and 600,000+ of them are identical. If I input a word that matches this document, I get the following timings in our server-side code (using RedGate ANTs performance profiler):
Pass 1: trivial (only because the word doesn't match anything; ultimately, this pass will have to be optimized like the above, e.g. so not every result is highlighted and not all fields from source are returned on the initial pass)
Pass 2: 21 seconds
Pass 3: 11.2 seconds (remember that this one contains an IdsQuery with 600,000+ id's, so that makes a little sense to me)
Pass 4: 4.8 seconds (utterly confusing to me)
Pass 5: 58 milliseconds (pretty much what I'd expect)
Now, I think I understand why Pass 2 takes so long. It's because it has to search, score, and sort 600,000 documents (yes, we DO want to score these, as we prioritize hits in document text, although we have no idea yet how to score the things that are found in different indexes). At least, I thought I understood why. However, if I input something like "pdf", which will be found in some documents' text, but most document metadata (as the indexed filename), then my first page returns lightning quick (a matter of milliseconds, probably), despite having 500,000+ hits.
I know you're probably wondering: why is document text separated from document metadata? Good question. This is something that could potentially go away. We actually just made this change, because we were having BIG problems with indexing. The problem was actually in file I/O, and also in extracting the text from a PDF, such that re-indexing a million documents would have taken 5-7 days, and that's on relatively short test documents, and not the 600+ page documents we expect in production.
Since the schema for the document metadata is expected to be pretty fluid over our first few releases, we determined that it would not be acceptable to have to take the hit for re-indexing text that hasn't changed. While this happens off-line, 5 days is still a ridiculously long time. Also, we use these indexes for other purposes, which are independent of the document text, such that we might want to index or re-index document metadata without losing the text.
This might be able to go away because a) we are now aware of the scroll capability, which we hope will allow us to re-index everything without having to deal with the file I/O, and b) because we swapped out our PDF extraction library, which dramatically sped up the read process, so that even now the file I/O may not be an issue.
Unfortunately, splitting the document text out into its own index was done at the same time as the file I/O optimizations (which ultimately allowed us to get enough data into the system to reveal our massive scaling problems). That said, we don't have any benchmarks on how long our queries would take on this huge result set if document searching didn't require multiple passes.
So, finally, some questions:
Where will our biggest performance gains be found? Right now, we're stabbing around in the dark.
Should we put document text and document metadata back into the same index and type? This would entail engineering effort to re-design/re-architecture our indexing code and infrastructure, but if it brings us down to reasonable search times, then so be it.
How do we efficiently feed results from one query pass into another? Doing an IdsQuery with 600,000 results just seems like it's going to be inherently slow. But even if we do the above for documents, we'll still have the same issue with actions. I know the canonical answer would probably be to de-normalize everything, and put them all into the same index and do a single search pass. But we feel this would be untenable, given all the inherent problems of denormalization and keeping everything in sync.
Is there something we can do to optimize our environment? Right now, we have 3 different indexes, each with its own type. Each index has the default 5 shards. Are there any gains to be had by increasing/reducing the amount of shards, putting types into the same index, etc?
Is Scroll the sort of thing we should be using for our paging? Our product owner is insistent on returning all results, and allowing users to page to anything (even page 63000). We insist that this is totally stupid, and that results that numerous are useless to an end-user. The engineering team would rather limit the amount of results. However, there are certain use cases that will require getting all results. So, is the scroll API something that is meant to be used for normal paging in a UI, or something that's really meant for more efficient batch processing (e.g. re-indexing)? We'd rather use the same method/code for getting these results. So we could limit the results that the user sees, and the use case that requires all results could be done in background/off-line processing. Conversely, if scroll allows the user to do deep paging, then we could use that method for both, without limiting the result set. I don't imagine that scroll will help us in traversing to random pages, however.
Anything I haven't thought of? As I said, we're newbies, so we're probably not aware of a lot of the capabilities of ElasticSearch. Is there any way to combine these searches on different types in different indexes into a single query, such that ElasticSearch and/or NEST do all the work? Where are we going wrong in our thinking?
All input will be appreciated.
UPDATE
Here are the POCO's. For the most part, the properties that represent the metadata are the only thing that need be considered here. The custom attributes are only relevant for creating the type mappings at index creation. A select few things in here are not searchable, but are indexed to show to the user.
[ElasticsearchType]
public class IndexDocument : GridViewable {
[Column(Analyzer = ElasticConstants.STANDARD_ENGLISH_ANALYZER)]
public string Name { get; set; }
[Column]
public string Markings { get; set; }
[Column(IncludeInAll = false)]
public string SerialNumber { get; set; }
[Column]
public string Classification { get; set; }
[Column]
public string Category { get; set; }
[Column]
public string Series { get; set; }
[Column]
public string CreatedBy { get; set; }
[Column]
public string CheckedOutTo { get; set; }
[Column]
public string Locations { get; set; }
[Nested]
public IList<IndexNeedToKnow> NeedToKnow { get; } = new ListWithDefault<IndexNeedToKnow>(IndexNeedToKnow.EMPTY_NTK);
[Object]
public IList<IndexAuditLog> AuditLog { get; } = new List<IndexAuditLog>();
[SearchDate]
public string DateOfRecord { get; set; }
[String(NullValue = IndexDefaultValue.DefaultNullValue)]
public string Disposition { get; set; }
[SearchDate]
public string DeclassificationDate { get; set; }
[SearchDate]
public string FutureDispositionDate { get; set; }
[String(NullValue = IndexDefaultValue.DefaultNullValue, IncludeInAll = false)]
public string HasPii { get; set; }
[SearchDate]
public string FutureReviewDate { get; set; }
[String(NullValue = IndexDefaultValue.DefaultNullValue)]
public string RecordType { get; set; }
[String(NullValue = IndexDefaultValue.DefaultNullValue)]
public string Saccp { get; set; }
[String(NullValue = IndexDefaultValue.DefaultNullValue)]
public string RecordStatus { get; set; }
[Object(IncludeInAll = false)]
public IList<ActionAssociation> AssociatedActions { get; } = new ListWithDefault<ActionAssociation>(ActionAssociation.NO_ACTION);
[String(IncludeInAll = false)]
public string Mime { get; set; }
[Number(IncludeInAll = false)]
public int Version { get; set; }
}
[ElasticsearchType]
public class IndexAction : GridViewable, ISuggestable<IndexActionSuggestionPayload> {
[Column]
public string Name { get; set; }
[Completion(Analyzer = Searcher.LOWERCASE_KEYWORD_ANALYZER, Payloads = true)]
public SuggestionField<IndexActionSuggestionPayload> Suggestion
{
get {
return new SuggestionField<IndexActionSuggestionPayload> {
Input = new[] { Name },
Output = Name,
Payload = new IndexActionSuggestionPayload {
Id = Id,
}
}; }
}
[Column(IncludeInAll = false)]
public string Program { get; set; }
[Column]
public string ActionOfficer { get; set; }
[Column]
public string Manager { get; set; }
[Column]
[SearchDate]
public string Suspense { get; set; }
[Column]
public List<string> Categories { get; set; }
[Column]
public string FiscalYear { get; set; }
[Column]
public string FiscalQuarter { get; set; }
[Column]
public string State { get; set; }
[Column]
[SearchDate]
public string Close { get; set; }
[Column]
public string CreatedBy { get; set; }
public List<IndexApprovers> Approvers { get; set; } = new List<IndexApprovers>();
[Object]
public List<IndexActionTask> ActionTasks { get; set; } = new List<IndexActionTask>();
public string ActivityCount { get; set; }
[SearchDate]
public string ApprovalDate { get; set; }
public string ApprovalRole { get; set; }
[Nested]
public List<IndexOrganization> Organizations { get; set; } = new List<IndexOrganization>();
[FreeText]
public string Description { get; set; }
public List<IndexExternalRefNumber> ExternalReferenceNumbers { get; set; }
[Number(NumberType.Float)]
public string FinalCost { get; set; }
[String(NullValue = IndexDefaultValue.DefaultNullValue)]
public string IsCovertAction { get; set; }
public string LegalJustification { get; set; }
public List<string> Locations { get; set; }
[FreeText]
public string Notes { get; set; }
public List<string> PointsOfContact { get; set; }
[Number(NumberType.Float)]
public string ProjectedCost { get; set; }
public string ProjectName { get; set; }
[String(NullValue = IndexDefaultValue.DefaultNullValue)]
public string IsStaffingRequired { get; set; }
public string Status { get; set; }
[String(IncludeInAll = false, Index = FieldIndexOption.NotAnalyzed)]
public List<Guid> DocumentIDs{ get; set; } = new List<Guid>();
public string Classification { get; set; }
}
[ElasticsearchType]
public class IndexDocumentText : GridViewable {
[FreeText(IncludeInAll = false)]
public string Text { get; set; }
[Nested]
public IList<IndexNeedToKnow> NeedToKnow { get; } = new ListWithDefault<IndexNeedToKnow>(IndexNeedToKnow.EMPTY_NTK);
}
And here's the JSON for the query that is created from a 'basic' search:
Pass 1, on indexaction index
{
"timeout": "120s",
"from": 0,
"size": 10000000,
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"match": {
"_all": {
"query": "pornography",
"operator": "and"
}
}
}
]
}
}
]
}
},
"highlight": {
"pre_tags": [
"<b>"
],
"post_tags": [
"</b>"
],
"number_of_fragments": 10,
"fields": {
"name": {},
"name.highlight": {},
"actionOfficer": {},
"actionOfficer.highlight": {},
"manager": {},
"manager.highlight": {},
"suspense": {},
"suspense.highlight": {},
"categories": {},
"categories.highlight": {},
"fiscalYear": {},
"fiscalYear.highlight": {},
"fiscalQuarter": {},
"fiscalQuarter.highlight": {},
"state": {},
"state.highlight": {},
"close": {},
"close.highlight": {},
"activityCount": {},
"approvalDate": {},
"approvalDate.highlight": {},
"approvalRole": {},
"description": {},
"externalReferenceNumbers.name": {},
"finalCost": {},
"finalCost.highlight": {},
"isCovertAction": {},
"legalJustification": {},
"locations": {},
"notes": {},
"pointsOfContact": {},
"projectedCost": {},
"projectedCost.highlight": {},
"projectName": {},
"isStaffingRequired": {},
"status": {},
"actionTasks.completionDate": {},
"actionTasks.dueDate": {},
"actionTasks.description": {},
"actionTasks.owner": {},
"actionTasks.status": {},
"organizations.name": {}
},
"require_field_match": false
}
}
Pass 2 on the indexdocumenttext index; notice the ids query, which is extracted from the results of the previous pass (and yes, I just noticed that it's in there twice, which it should not be). The filter at the bottom is part of a security check. Also, notice the _source: { "exclude": ["*"] }. This has shown us (using the Head plugin) to be slower than "_source" : false}, but NEST inexplicably removed that capability.
{
"timeout": "120s",
"from": 0,
"size": 10000000,
"_source": {
"exclude": [
"*"
]
},
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"ids": {
"values": [
"00000000-0000-0000-0000-000000000000",
"00000000-0000-0000-0000-000000000003"
]
}
}
]
}
},
{
"bool": {
"should": [
{
"bool": {
"must": [
{
"match": {
"text": {
"query": "pornography",
"fuzziness": 1,
"operator": "and"
}
}
}
],
"should": [
{
"match": {
"text": {
"query": "pornography",
"fuzziness": 1,
"slop": 20,
"type": "phrase"
}
}
}
]
}
},
{
"ids": {
"values": [
"00000000-0000-0000-0000-000000000000",
"00000000-0000-0000-0000-000000000003"
],
"boost": 2
}
}
]
}
},
{
"bool": {
"filter": [
{
"nested": {
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"match": {
"needToKnow.id": {
"query": "1"
}
}
},
{
"match": {
"needToKnow.type": {
"query": "1"
}
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"needToKnow.id": {
"query": "0"
}
}
},
{
"match": {
"needToKnow.type": {
"query": "0"
}
}
}
]
}
}
]
}
},
"path": "needToKnow"
}
}
]
}
}
]
}
}
}
Pass 3 will look much the same, except that the search term is on the _all field, we only need 10 results, and the id query may be HUGE. An advanced search will look much the same as these, except that there will be more items in the bool query array, each of which will target a specific field with a specific value. Also, the ids query would be a must in that case, as we would only want to return results that were responsive to every field.