Searching an array string with a binary search sub string

Question

I have a file.txt containing about 200,000 records.

The format of each record is 123456-99-Text. The 123456 are unique account numbers, the 99 is a location code that I need (it changes from 01 to 99), and the text is irrelevant. These account numbers are sorted in order and with a line break in the file per ac(111111, 111112, 111113, etc).

I made a visual studio textbox and search button to have someone search for the account number. The account number is actually 11 digits long but only the first 6 matter. I wrote this as string actnum = textbox1.text.substring(0,6)

I wrote a foreach (string x in file.readline('file.txt')) with an if (x.contains(actnum)) then string code = x.substring(8,2)) statement.

The program works well, but because there are so many records if someone searches an account number that doesnt exist, or a number at the bottom of the list, the program locks up for a good 10 seconds before going to the "number not found" else statement, or taking forever to find that last record.

My Question:

Reading about binary searches I have attempted to try one without much success. I cannot seem to get the array or file to act like a legitimate binary search. Is there a way to take the 6 digit actnum from textbox1, compare it to an array substring of the 6 digit account number, then grab the substring 99 code from that specific line?

A binary search would help greatly! I could take 555-555 and compare it to the top or bottom half of the record file, then keep searching until i fine the line i need, grab the entire line, then substring the 99 out. The problem I have is I cant seem to get a proper integer conversion of the file because it contains both numbers AND text, and therefore I cant properly use <, >, = signs.

Any help on this would be greatly appreciated. The program I currently have actually works but is incredibly slow at times.

Go ahead and paste in your code. If you do it wrong someone will edit the post to fix it. In short make a blank line, then indent the code block four spaces. — Denise Skidmore
– Denise Skidmore, Commented Dec 15, 2014 at 20:27
Have you tried loading the file into memory and possibly making an object representation out of it? — Brandon
– Brandon, Commented Dec 15, 2014 at 20:27
@Brandon that sounds like a rather large list to have put all into memory, and if it continues to get larger (as is suggested by having an 11 digit account number) the OP may run out of resources — Robert Snyder
– Robert Snyder, Commented Dec 15, 2014 at 20:28
Actually the two answers just posted sum up what I'd suggested. — Brandon
– Brandon, Commented Dec 15, 2014 at 20:29
200,000 shouldn't be a problem, unless that string you don't care about is huge. — hatchet - done with SOverflow
– hatchet - done with SOverflow, Commented Dec 15, 2014 at 20:31

xxbbcc · Accepted Answer · 2014-12-15 20:50:27Z

5

As one possible solution (not necessarily the best) you can add your record IDs to a Dictionary<string, int> (or even a Dictionary<long, int> if all record IDs are numeric) where each key is the ID of one line and each value is the line index. When you need to look up a particular record, just look in the dictionary (it'll do an efficient lookup for you) and gives you the line number. If the item is not there (non-existent ID), you won't find it in the dictionary.

At this point, if the record ID exists in the file, you have a line number - you can either load the entire file into memory (if it's not too big) or just seek to the right line and read in the line with the data.

For this to work, you have to go through the file at least once and collect all the record IDs from all lines and add them to the dictionary. You won't have to implement the binary search - the dictionary will internally perform the lookup for you.

Edit:

If you don't need all the data from a particular line, just one bit (like the location code you mentioned), you don't even need to store the line number (since you won't need to go back to the line in the file) - just store the location data as the value in the dictionary.

I personally would still store the line index because, in my experience, such projects start out small but end up collecting features and there'll be a point where you'll have to have everything from the file. If you expect this to be the case over time, just parse data from each line into a data structure and store that in the dictionary - it'll make your future life simpler. If you're very sure you'll never need more data than the one bit of information, you can just stash the data itself in the dictionary.

Here's a simple example (assuming that your record IDs can be parsed into a long):

public class LineData
{
    public int LineIndex { get; set; }

    public string LocationCode { get; set; }

    // other data from the line that you need
}

// ...

// declare your map
private Dictionary<long, LineData> _dataMap = new Dictionary<long, LineData> ();

// ...
// Read file, parse lines into LineData objects and put them in dictionary
// ...

To see if a record ID exists, you just call TryGetValue():

LineData lineData;

if ( _dataMap.TryGetValue ( recordID, out lineData ) )
{
    // record ID was found
}

This approach essentially keeps the entire file in memory but all data is parsed only once (at the beginning, during building the dictionary). If this approach uses too much memory, just store the line index in the dictionary and then go back to the file if you find a record and parse the line on the fly.

edited Dec 15, 2014 at 20:50

answered Dec 15, 2014 at 20:28

xxbbcc

17.5k5 gold badges54 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Kevin Over a year ago

Why bother storing the line index when you could just save the location code when you scan the file the first time?

xxbbcc Over a year ago

@Kevin I only suggested that so that the entire line can be accessed from the file, if necessary, not just one particular bit of data. If the file is not too big (and is expected to stay at a similar size in the future), I'd just load the whole thing and parse it into a dictionary of objects, then discard the string data.

ShoTo Over a year ago

Thank you very much! I am still new, and do not fully understand the dictionary commands. However, you explained very well what it can do, and I believe the best option would be to load that dictionary at the program's start (the file itself is only around 10MB, not too large for memory I should think).

xxbbcc Over a year ago

@ShoTo Dictionary is easy to use. It's a lookup table - you Add items to it with a unique key and then you can look up items using TryGetValue, ContainsKey, etc. The two types in the declaration are for the types for the keys and the values.

jtm001 · Accepted Answer · 2014-12-15 21:10:02Z

2

You cannot really do a binary search against file.ReadLine because you have to be able to access the lines in different order. Instead you should read the whole file into memory (file.ReadAllLines would be an option)

Assuming your file is sorted by the substring, you can create a new class that implements IComparer

public class SubstringComparer : IComparer<string>
    {
        public int Compare(string x, string y)
        {
            return x.Substring(0, 6).CompareTo(y.Substring(0, 6));
        }
    }

and then your binary search would look like:

int returnedValue = foundStrings.BinarySearch(searchValue, new SubstringComparer());

answered Dec 15, 2014 at 21:10

jtm001

3712 silver badges3 bronze badges

Comments

Guvante · Accepted Answer · 2014-12-15 20:28:33Z

1

Assuming the file doesn't change often, then you can simply load the entire file into memory using a structure that handles the searching in faster time. If the file can change then you will need to decide on a mechanism for reloading the file, be it restarting the program or a more complex process.

It looks like you are looking for exact matches (searching for 123456 yields only one record which is labelled 123456). If that is the case then you can use a Dictionary. Note that to use a Dictionary you need to define key and value types. It looks like in your case they would both be string.

answered Dec 15, 2014 at 20:28

Guvante

19.3k1 gold badge38 silver badges64 bronze badges

Comments

ShoTo · Accepted Answer · 2015-01-06 18:23:49Z

While I did not find a way to do a better type of search, I did manage to learn about embedded resources which considerably sped up the program. Scanning the entire file takes a fraction of a second now, instead of 5-10 seconds. Posting the following code:

   string searchfor = textBox1.Text
    Assembly assm = Assembly.GetExecutingAssembly();
    using (Stream datastream = assm.GetManifestResourceStream("WindowsFormsApplication2.Resources.file1.txt"))
    using (StreamReader reader = new StreamReader(datastream))
    {
        string lines;
        while ((lines = reader.ReadLine()) != null)
        {
            if (lines.StartsWith(searchfor))
            {
                label1.Text = "Found";
                break;
            }
            else
            {
                label1.Text = "Not found";
            }
        }
    }

Collectives™ on Stack Overflow

Searching an array string with a binary search sub string

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related