8

I need to store a large dictionary of natural language words -- up to 120,000, depending on the language. These need to be kept in memory as profiling has shown that the algorithm which utilises the array is the time bottleneck in the system. (It's essentially a spellchecking/autocorrect algorithm, though the details don't matter.) On Android devices with 16MB memory, the memory overhead associated with Java Strings is causing us to run out of space. Note that each String has a 38 byte overhead associated with it, which gives up to a 5MB overhead.

At first sight, one option is to substitute char[] for String. (Or even byte[], as UTF-8 is more compact in this case.) But again, the memory overhead is an issue: each Java array has a 32 byte overhead.

One alternative to ArrayList<String>, etc. is to create an class with much the same interface that internally concatenates all the strings into one gigantic string, e.g. represented as a single byte[], and then store offsets into that huge string. Each offset would take up 4 bytes, giving a much more space-efficient solution.

My questions are a) are there any other solutions to the problem with similarly low overheads* and b) is any solution available off-the-shelf? Searching through the Guava, trove and PCJ collection libraries yields nothing.

*I know one can get the overhead down below 4 bytes, but there are diminishing returns.

NB. Support for Compressed Strings being Dropped in HotSpot JVM? suggests that the JVM option -XX:+UseCompressedStrings isn't going to help here.

19
  • An array can only have 2^31-1 = 2.1G entries, maybe too small for you? Commented Jun 29, 2015 at 17:45
  • 1
    No... a word typically takes up ~10 bytes, so the whole structure will fit in ~1MB. (~1.5MB inc. overhead.) Commented Jun 29, 2015 at 17:48
  • Do you really need to keep all the strings in memory? Probably you can keep some index and effectively load the necessary part from the file? What's your original task? How do you use these strings? Commented Jun 29, 2015 at 17:48
  • Have you thought about using an in-memory database, such as H2? Commented Jun 29, 2015 at 17:49
  • Btw, compressed strings may come back in OpenJDK 9 or 10: see this JEP. Commented Jun 29, 2015 at 17:50

1 Answer 1

0

I had to develop a word dictionary for a class project. we ended up using a trie as a data structure. Not sure the size difference between an arrraylist and a trie, but the performance is a lot better.

Here are some resources that could be helpful.

https://en.wikipedia.org/wiki/Trie

https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.