5

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in java so that when user give a keyword(Substring) he get the count of All the strings present in the file which contains that keyword. i know one technique LUCENE already..is there any other way to do this.?? i want the result within 3-4 seconds. MY SYSTEM HAS 4GB RAM AND DUAL CORE configuration.... need to do this in "JAVA ONLY"

14
  • 1
    How much RAM do you have? How much preprocessing can you do? Commented Jan 31, 2013 at 19:10
  • i need to mention all the String are arranged line by line in the file.. Commented Jan 31, 2013 at 19:10
  • @LouisWasserman i have 4gb ram and dual core processor.. Commented Jan 31, 2013 at 19:11
  • Preprocessing is practically mandatory; it will take longer than 3-4 seconds to read this file from disk! Commented Jan 31, 2013 at 19:12
  • I would count how often each word appears in advance and it would take a few milli-seconds to look up such a table. Commented Jan 31, 2013 at 19:15

4 Answers 4

0

Try to use hash tables. One more thing that can be done is any method similar to MAP-REDUCE. What i want to say is that you can try to use inverted index. Google uses the same technique. All you can create a file of stopwords where you can put words that can be ignored e.g. I, am, the, a, an, in, on etc.

this is the only thing which i suppose is possible. I read somewhere that for searching, u can arrays.

Sign up to request clarification or add additional context in comments.

Comments

0

Is there expected to be a lot of overlap in your keywords? If so, you might be able to store a hash map from keyword (String) to file locations (ArrayList). You can not store all the lines in memory though with the object overhead.

Once you have the file location, you can seek to that location in the text file and then look nearby to get the enclosing newline characters, returning the line. That will definitely be less than 4 seconds. Here is a little info on that. If this is just for a little exercise, that would work fine.

A better solution though would be a two tiered index, one mapping keywords to line numbers, and then another mapping line numbers to line text. This will not fit in memory on your machine. There are great disk based key-value stores though that would work well. If this is anything beyond a toy problem, go with the Reddis route.

Comments

0

You could build a directory structure based on the first few letters of each word. For example:

/A
/A/AA
/A/AB
/A/AC
...
/Z/ZU

Under that structure, you can keep a file containing all the strings with the first characters matching the folder name. The first characters in your search term will narrow the selection down to a folder with a small fraction of your overall list. From there, you do can do a full search of just that file. If it's too slow, increase the depth of your directory tree to cover more letters.

2 Comments

i will not help..the substring or keyword can be anywhere in the string..here we are matching only first alphabet...supppose there is a string myfunnyindia in our file..now according to your algorithm it should be put in folder like/m//my/f... now if user need to find india so how it will get the string myfunnyindia..
You're right. If you're looking for ANY substring, like matching "yfu", then it's almost pointless to index the data. Are you allowed to run an external process from your java app? It might be fastest to pass some parameters to grep and process the results from stdout.
0

Since you have more RAM than the size of the file, you might be able to store the entire data as a structure in the RAM and search it very quickly. A trie might be a good data structure to use; it does have fast prefix finding, but not sure how it performs for substrings.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.