0

I am trying to index a record of 5 billion, or even more, rows using lucene. Does the time of indexing increase exponentially as the record set increases?

My initial indexing of 10 million records happened very quickly, but when I tried to index more than 100 million records, it took more time than I expected, with respect to the 10 million record indexing time.

Is it because it is indexing it against more document hence time is increasing exponentially? Or what could be the reason behind this behavior, and is there any way to optimize it (please note,currently all fields in all the documents are of type StringField, will chaning it to IntField help me in this direction?).

My second question would be how will the search performance be in case of indexing 5 billion records. Any ideas on that?

Let me know if you need more information from my end on this.

2
  • can I scale by increasing my index creation process into multiple JVMs all reading from different files having same schema and storing it in the same index folder location .????? Commented Mar 5, 2015 at 14:18
  • There is a hard limit of 2 billion documents in a single Lucene index, so you will have to distribute somehow. Commented Apr 27, 2015 at 20:10

3 Answers 3

1

Our current use case seems somewhat similar to yours: 1.6 billion rows, most fields are exact matches, periodic addition of files/rows, regular searching. Our initial indexing is not distributed or parallelized in any way, currently, and takes around 9 hours. I only offer that number to give you a very vague sense of what your indexing experience may be.

To try and answer your questions:

  1. Our indexing time does not grow exponentially with the number of rows already indexed, though it does slow down very gradually. For us, perhaps 20% slower by the end, though it could also be specific to our data.

    If you are experiencing significant slow-down, I support femtoRgon's suggestion that you profile to see what's eating the time. Lucene has never been the slowest/weakest component in our system.

  2. Yes, you can write to your index in parallel, and you may see improved throughput. Whether it helps or not depends on where your bottlenecks are, of course. Consider using Solr - it may ease your efforts here.

  3. We use a mixture of StringField, LongField, and TextField. It seems unlikely that the type of field is causing your slowdown on its own.

These answers are all anecdotal, but perhaps they'll be of some use to you.

This page is now quite dated, but if you exhaust all your other options, may provide hints of which levers you can pull that might tweak performance: How to make indexing faster

Sign up to request clarification or add additional context in comments.

Comments

0

Have you profiled to see what is actually causing your performance issues? You could find something unexpected is eating up all that time. When I profiled a similar performance issue I thought was caused by lucene, turned out the problem was mostly string concatenations.

As to whether you should use StringField or IntField (or TextField, or whatever), you should determine that based on what is in the field on how you are going to search it. If you might want to search the field as a range of numeric values, it should be an IntField, not a StringField. By the way, StringField indexes the entire value as a single term, and skips analysis, so this is also the wrong field for full text, for which you should use a TextField. Basically, using a StringField for everything would seem very much like a bad code smell to me, and could cause performance issues at index time, but I would definitely expect the much larger problems would appear when you start trying to search.

As far as "how will the search performance be with 5 billion values", that's far too vague a question to even attempt to answer. No idea. Try it and see.

1 Comment

thanks for the reply. Actually. We are trying to replace a legacy system with Lucene. The system takes in data from files in short durations, but searches continuously. The data size is growing and is the bottleneck for rdbms systems. We wanted to avoid distributed nosql databases at the moment though, hence trying it with this POC with Lucene. All my fields will not be StringField and I always have exact matches to be searched for. Performance is something which I shall post here once I get to it..
0

Lucene has a hard limit of 2.14 billion documents per shard. So you have to divide your index into shards or have multiple indexes so that each index/shard does not contain documents more than 2.14 billion. I have indexes which are reaching that hard limit (I am using lucene directly..not elastic search/Solr) and haven't seen any significant slowdown in indexing speed. I am using Numeric fields (Int, Float and Long) as well as StringField and TextField types. Also check your RAM Buffer Size set for Index Writer. Increasing that might help slightly with the speed.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.