7

I have a file.txt containing about 200,000 records.

The format of each record is 123456-99-Text. The 123456 are unique account numbers, the 99 is a location code that I need (it changes from 01 to 99), and the text is irrelevant. These account numbers are sorted in order and with a line break in the file per ac(111111, 111112, 111113, etc).

I made a visual studio textbox and search button to have someone search for the account number. The account number is actually 11 digits long but only the first 6 matter. I wrote this as string actnum = textbox1.text.substring(0,6)

I wrote a foreach (string x in file.readline('file.txt')) with an if (x.contains(actnum)) then string code = x.substring(8,2)) statement.

The program works well, but because there are so many records if someone searches an account number that doesnt exist, or a number at the bottom of the list, the program locks up for a good 10 seconds before going to the "number not found" else statement, or taking forever to find that last record.

My Question:

Reading about binary searches I have attempted to try one without much success. I cannot seem to get the array or file to act like a legitimate binary search. Is there a way to take the 6 digit actnum from textbox1, compare it to an array substring of the 6 digit account number, then grab the substring 99 code from that specific line?

A binary search would help greatly! I could take 555-555 and compare it to the top or bottom half of the record file, then keep searching until i fine the line i need, grab the entire line, then substring the 99 out. The problem I have is I cant seem to get a proper integer conversion of the file because it contains both numbers AND text, and therefore I cant properly use <, >, = signs.

Any help on this would be greatly appreciated. The program I currently have actually works but is incredibly slow at times.

7
  • Go ahead and paste in your code. If you do it wrong someone will edit the post to fix it. In short make a blank line, then indent the code block four spaces. Commented Dec 15, 2014 at 20:27
  • Have you tried loading the file into memory and possibly making an object representation out of it? Commented Dec 15, 2014 at 20:27
  • @Brandon that sounds like a rather large list to have put all into memory, and if it continues to get larger (as is suggested by having an 11 digit account number) the OP may run out of resources Commented Dec 15, 2014 at 20:28
  • Actually the two answers just posted sum up what I'd suggested. Commented Dec 15, 2014 at 20:29
  • 200,000 shouldn't be a problem, unless that string you don't care about is huge. Commented Dec 15, 2014 at 20:31

4 Answers 4

5

As one possible solution (not necessarily the best) you can add your record IDs to a Dictionary<string, int> (or even a Dictionary<long, int> if all record IDs are numeric) where each key is the ID of one line and each value is the line index. When you need to look up a particular record, just look in the dictionary (it'll do an efficient lookup for you) and gives you the line number. If the item is not there (non-existent ID), you won't find it in the dictionary.

At this point, if the record ID exists in the file, you have a line number - you can either load the entire file into memory (if it's not too big) or just seek to the right line and read in the line with the data.

For this to work, you have to go through the file at least once and collect all the record IDs from all lines and add them to the dictionary. You won't have to implement the binary search - the dictionary will internally perform the lookup for you.

Edit:

If you don't need all the data from a particular line, just one bit (like the location code you mentioned), you don't even need to store the line number (since you won't need to go back to the line in the file) - just store the location data as the value in the dictionary.

I personally would still store the line index because, in my experience, such projects start out small but end up collecting features and there'll be a point where you'll have to have everything from the file. If you expect this to be the case over time, just parse data from each line into a data structure and store that in the dictionary - it'll make your future life simpler. If you're very sure you'll never need more data than the one bit of information, you can just stash the data itself in the dictionary.

Here's a simple example (assuming that your record IDs can be parsed into a long):

public class LineData
{
    public int LineIndex { get; set; }

    public string LocationCode { get; set; }

    // other data from the line that you need
}

// ...

// declare your map
private Dictionary<long, LineData> _dataMap = new Dictionary<long, LineData> ();

// ...
// Read file, parse lines into LineData objects and put them in dictionary
// ...

To see if a record ID exists, you just call TryGetValue():

LineData lineData;

if ( _dataMap.TryGetValue ( recordID, out lineData ) )
{
    // record ID was found
}

This approach essentially keeps the entire file in memory but all data is parsed only once (at the beginning, during building the dictionary). If this approach uses too much memory, just store the line index in the dictionary and then go back to the file if you find a record and parse the line on the fly.

Sign up to request clarification or add additional context in comments.

4 Comments

Why bother storing the line index when you could just save the location code when you scan the file the first time?
@Kevin I only suggested that so that the entire line can be accessed from the file, if necessary, not just one particular bit of data. If the file is not too big (and is expected to stay at a similar size in the future), I'd just load the whole thing and parse it into a dictionary of objects, then discard the string data.
Thank you very much! I am still new, and do not fully understand the dictionary commands. However, you explained very well what it can do, and I believe the best option would be to load that dictionary at the program's start (the file itself is only around 10MB, not too large for memory I should think).
@ShoTo Dictionary is easy to use. It's a lookup table - you Add items to it with a unique key and then you can look up items using TryGetValue, ContainsKey, etc. The two types in the declaration are for the types for the keys and the values.
2

You cannot really do a binary search against file.ReadLine because you have to be able to access the lines in different order. Instead you should read the whole file into memory (file.ReadAllLines would be an option)

Assuming your file is sorted by the substring, you can create a new class that implements IComparer

public class SubstringComparer : IComparer<string>
    {
        public int Compare(string x, string y)
        {
            return x.Substring(0, 6).CompareTo(y.Substring(0, 6));
        }
    }

and then your binary search would look like:

int returnedValue = foundStrings.BinarySearch(searchValue, new SubstringComparer());

Comments

1

Assuming the file doesn't change often, then you can simply load the entire file into memory using a structure that handles the searching in faster time. If the file can change then you will need to decide on a mechanism for reloading the file, be it restarting the program or a more complex process.

It looks like you are looking for exact matches (searching for 123456 yields only one record which is labelled 123456). If that is the case then you can use a Dictionary. Note that to use a Dictionary you need to define key and value types. It looks like in your case they would both be string.

Comments

1

While I did not find a way to do a better type of search, I did manage to learn about embedded resources which considerably sped up the program. Scanning the entire file takes a fraction of a second now, instead of 5-10 seconds. Posting the following code:

   string searchfor = textBox1.Text
    Assembly assm = Assembly.GetExecutingAssembly();
    using (Stream datastream = assm.GetManifestResourceStream("WindowsFormsApplication2.Resources.file1.txt"))
    using (StreamReader reader = new StreamReader(datastream))
    {
        string lines;
        while ((lines = reader.ReadLine()) != null)
        {
            if (lines.StartsWith(searchfor))
            {
                label1.Text = "Found";
                break;
            }
            else
            {
                label1.Text = "Not found";
            }
        }
    }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.