Indexing Performance in Apache Lucene

Question

I am trying to index a record of 5 billion, or even more, rows using lucene. Does the time of indexing increase exponentially as the record set increases?

My initial indexing of 10 million records happened very quickly, but when I tried to index more than 100 million records, it took more time than I expected, with respect to the 10 million record indexing time.

Is it because it is indexing it against more document hence time is increasing exponentially? Or what could be the reason behind this behavior, and is there any way to optimize it (please note,currently all fields in all the documents are of type StringField, will chaning it to IntField help me in this direction?).

My second question would be how will the search performance be in case of indexing 5 billion records. Any ideas on that?

Let me know if you need more information from my end on this.

can I scale by increasing my index creation process into multiple JVMs all reading from different files having same schema and storing it in the same index folder location .????? — Argho Chatterjee
– Argho Chatterjee, Commented Mar 5, 2015 at 14:18
There is a hard limit of 2 billion documents in a single Lucene index, so you will have to distribute somehow. — Toke Eskildsen
– Toke Eskildsen, Commented Apr 27, 2015 at 20:10

Community · Accepted Answer · 2020-06-20 09:12:55Z

Our current use case seems somewhat similar to yours: 1.6 billion rows, most fields are exact matches, periodic addition of files/rows, regular searching. Our initial indexing is not distributed or parallelized in any way, currently, and takes around 9 hours. I only offer that number to give you a very vague sense of what your indexing experience may be.

To try and answer your questions:

Our indexing time does not grow exponentially with the number of rows already indexed, though it does slow down very gradually. For us, perhaps 20% slower by the end, though it could also be specific to our data.

If you are experiencing significant slow-down, I support femtoRgon's suggestion that you profile to see what's eating the time. Lucene has never been the slowest/weakest component in our system.
Yes, you can write to your index in parallel, and you may see improved throughput. Whether it helps or not depends on where your bottlenecks are, of course. Consider using Solr - it may ease your efforts here.
We use a mixture of StringField, LongField, and TextField. It seems unlikely that the type of field is causing your slowdown on its own.

These answers are all anecdotal, but perhaps they'll be of some use to you.

This page is now quite dated, but if you exhaust all your other options, may provide hints of which levers you can pull that might tweak performance: How to make indexing faster

femtoRgon · Accepted Answer · 2015-03-05 18:02:27Z

0

Have you profiled to see what is actually causing your performance issues? You could find something unexpected is eating up all that time. When I profiled a similar performance issue I thought was caused by lucene, turned out the problem was mostly string concatenations.

As to whether you should use StringField or IntField (or TextField, or whatever), you should determine that based on what is in the field on how you are going to search it. If you might want to search the field as a range of numeric values, it should be an IntField, not a StringField. By the way, StringField indexes the entire value as a single term, and skips analysis, so this is also the wrong field for full text, for which you should use a TextField. Basically, using a StringField for everything would seem very much like a bad code smell to me, and could cause performance issues at index time, but I would definitely expect the much larger problems would appear when you start trying to search.

As far as "how will the search performance be with 5 billion values", that's far too vague a question to even attempt to answer. No idea. Try it and see.

edited Mar 5, 2015 at 18:02

answered Mar 5, 2015 at 17:56

femtoRgon

33.4k7 gold badges67 silver badges90 bronze badges

1 Comment

Argho Chatterjee Over a year ago

thanks for the reply. Actually. We are trying to replace a legacy system with Lucene. The system takes in data from files in short durations, but searches continuously. The data size is growing and is the bottleneck for rdbms systems. We wanted to avoid distributed nosql databases at the moment though, hence trying it with this POC with Lucene. All my fields will not be StringField and I always have exact matches to be searched for. Performance is something which I shall post here once I get to it..

Kailash Kumar · Accepted Answer · 2024-01-02 10:34:07Z

0

Lucene has a hard limit of 2.14 billion documents per shard. So you have to divide your index into shards or have multiple indexes so that each index/shard does not contain documents more than 2.14 billion. I have indexes which are reaching that hard limit (I am using lucene directly..not elastic search/Solr) and haven't seen any significant slowdown in indexing speed. I am using Numeric fields (Int, Float and Long) as well as StringField and TextField types. Also check your RAM Buffer Size set for Index Writer. Increasing that might help slightly with the speed.

answered Jan 2, 2024 at 10:34

Kailash Kumar

11 bronze badge

Collectives™ on Stack Overflow

Indexing Performance in Apache Lucene

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related