2

I am comparing two lists of data that were generated from a binary file. I have a good idea on why it's running slow, when there's a significant amount of records, it does un-needed redundant work.

For example, if a1 = a1, condition is true. Since 2a != 1a so why even bother checking it? I need to eliminate 1a from being checked again. If I don't, it will check the first record when it goes to check the 400,000th record. I thought about making the second for loop a foreach, but I can't remove 1a while iterating through the nested loop

The amount of items that can be in either 'for loop' can vary. I don't think a single for loop using 'i' will work since the match can be anywhere. I'm reading from a binary file

This is my current code. Program has been running for over an hour, and it's still going. I removed a lot of my iterating code for readability reasons.

   for (int i = 0; i < origItemList.Count; i++)
    {
        int modFoundIndex = 0;
        Boolean foundIt = false;
        for (int g = 0; g < modItemList.Count; g++)
        {
            if ((origItemList[i].X == modItemList[g].X)
                && (origItemList[i].Y == modItemList[g].Y)
                && (origItemList[i].Z == modItemList[g].Z)
                && (origItemList[i].M == modItemList[g].M))
            {

            foundIt = true;
            modFoundIndex = g;
            break;

            }
            else
            {
                foundIt = false;
            }

        }
        if (foundIt)                         
        {
            /*
             * This is run assumming it finds an x,y,z,m 
             coordinate. It thenchecks the database file.
             * 
             */
            //grab the rows where the coordinates match 
            DataRow origRow = origDbfFile.dataset.Tables[0].Rows[i];
            DataRow modRow = modDbfFile.dataset.Tables[0].Rows[modFoundIndex];

            //number matched indicates how many columns were matched
            int numberMatched = 0;

            //get the number of columns to match in order to detect all changes
            int numOfColumnsToMatch = origDbfFile.datatable.Columns.Count;

            List<String> mismatchedColumns = new List<String>();

            //check each column name for a change
            foreach (String columnName in columnNames)
            {
                //this grabs whatever value is in that field                            
                String origRowValue = "" + origRow.Field<Object>(columnName);
                String modRowValue = "" + modRow.Field<Object>(columnName);

                //check if they are the same
                if (origRowValue.Equals(modRowValue))
                {
                    //if they aren the same, increase the number matched by one
                    numberMatched++;
                    //add the column to the list of columns that don't match

                }
                else
                {
                    mismatchedColumns.Add(columnName);
                }

            }
            /* In the event it matches 15/16 columns, show the change */
            if (numberMatched != numOfColumnsToMatch)
            {
                //Grab the shapeFile in question
                Item differentAttrShpFile = origItemList[i];
                //start blue highlighting
                result += "<div class='turnBlue'>";
                //show where the change was made at
                result += "Change Detected at<br/> point X: " +
                differentAttrShpFile.X + ",<br/> point Y: " +
                    differentAttrShpFile.Y + ",<br/>";
                result += "</div>"; //end turnblue div
                foreach (String mismatchedColumn in mismatchedColumns)
                {
                    //iterate changes here

                }

            }

        }

    }
5
  • How many records are you dealing with? Commented Jun 3, 2013 at 18:57
  • 1
    is the moditemlist significantly smaller than the outer list? you could then only iterate over the smaller list, reducing a lot of work. you might also be able to turn the combination of x, y, z, m into a kind of key to do a set style lookup instead of a linear search through the whole list? Commented Jun 3, 2013 at 18:58
  • also, can you try using foreach loop instead of every time accessing by indexer? Commented Jun 3, 2013 at 19:00
  • Hi guys, they both will usually have around the same amount of records. They range in size from as little as 20, to 250,000. Ideally large files shouldn't take anymore than a few minutes. Commented Jun 3, 2013 at 19:02
  • I would sort each list O(NLogN), followed by a merge-order loop O(N). I'm gun-shy of hashes, though in principle they could take you down to O(N) overall. Commented Jun 3, 2013 at 22:23

2 Answers 2

8

You're coming at this in a totally wrong way. The loop you have is O(n^2), breaking when you find the match will on average cut the time in half for a hit, that's not enough. If you have a quarter million items in the list then this loop executes 62 billion times and even if the compiler optimizes out the extra array lookups you're still looking at at least a trillion instructions. You don't do O(n^2) for large n if you can possibly help it!

What you need to do is get rid of the O(n^2) aspect of this. My suggestion:

1) Define a hashing function that looks at the x, y, z & m and comes up with an integer value, my inclination would be to use one that's the wordsize of your target platform.

2) Iterate over both lists, compute hashes for everything.

3) Build an index to one of the tables, hash and the object. I suspect a dictionary is the best data structure here but a simple sorted array would also do.

4) Iterate over the list you didn't build the index over, compare the hashes to the entries in the index. If it's a hash that's an O(n) task, if it's a sorted array it's O(n log n).

5) When the hashes match do a full comparison to confirm the hit is real as you will get the occasional collision with a good 64-bit hash and you'll get a decent number of them if your hashes are 32-bit.

Sign up to request clarification or add additional context in comments.

5 Comments

When I find a match (and in case of 250,000 records, most will be matches) what is the most efficient way to do a full comparison? If I do a full foreach loop again, I've barely gained any speed if only 2 records changed in all 250,000.
I implemented if (set2.Contains(t1)) { } as seen in the above example, I just want to avoid doing another foreach loop.
@EvanParsons I'm not following why you have to start over, although even if you do this will perform much faster. The hash or dictionary based approach will reduce it from tens of billions to hundreds of thousands of times around the loop.
I'm having a hardtime understanding step 3. I have an index key on my custom object, the key in that object can be used to do a row lookup. However I can't index a hashset. I tried making a "List" of datarows but ran into the same performance penalty as before. I was under the impression that Dictionaries are for text? Right now I'm experimenting with populating a dictionary with Datarows.
@EvanParsons The key point here is to avoid using a loop to search for the match. The minimum acceptable performance is a SORTED list that you can do a binary search on. The ideal solution is a Dictionary or HashSet, both of which are hash tables internally.
1

This is something similar to Loren said but below is in language of .NET :)
1. Override GetHashCode method to return sum of x,y,z and m. Override Equals method to check for this sum.
2. Iterate and create HashSet from modItemList (List) before loop.
3. In inner loop, first check if origItemList[i] exists in HashSet using YourModHashSet.Contains(MyObject) method.
4. If .Contains return you false, carry one, no match.
5. If .Contains return you true, iterate thru entire modItemList and apply your current logic of checking for x,y,z and m for entire list. Note that here you should use List as hash table might eat up many objects for which hash code is same.

Also, I would use Foreach instead of For because I've seen Foreach giving little better results (5 to 30% faster) in such case.

Update:

I created MyObject class like below:

 public class MyObject
 {
    public int X, Y, Z, M;
    public override int GetHashCode()
    {
          return X*10000 + Y*100 + Z*10 + M;
    }

    public override bool Equals(object obj)
    {
        return (obj.GetHashCode() == this.GetHashCode());
    }
}

GetHashCode method is important here. We don't want many false positives. False positive occurs when Hash matches for some other combination of X, Y, Z and M. Best way to prevent false positive is to multiply each member such that each will impact one decimal place in HashCode. Note that you should consider not exceeding Int.Max value. If the expected value of X,Y,Z and M are small you should be good.

set2.Clear();
s1 = DateTime.Now;
MyObject matchingElement;
totalmatch = 0;

foreach (MyObject elem in list2)
set2.Add(elem);
foreach (MyObject t1 in list1)
{
if (set2.Contains(t1))
{
    matchingElement = null;
    foreach (MyObject t2 in list2)
    {
        if (t1.X == t2.X && t1.Y == t2.Y && t1.Z == t2.Z && t1.M == t2.M)
        {
            totalmatch++;
            matchingElement = t2;
            break;
        }
    }
    //Do Something on matchingElement if not null
}
}
Console.WriteLine("set foreach with contains: " + (DateTime.Now - s1).TotalSeconds + "\t Total Match: " + totalmatch);

Above is sample code that I was trying to describe in my answer. This code should work super fast if matches are expected to be less.

11 Comments

You don't need Equals to check the hash code. The HashSet checks that first for you.
Doesn't HashSet.Contain method check for Equals? and not GetHashCode.
Nope, it checks both, first GetHashCode to see which internal bucket to check, then it calls Equals on each element in the bucket to see which one is the correct value.
@ScottChamberlain So we should override Equals as I said. We don't want to check for reference equality in my logic. It will never match as data for both the list are being read from different binary files (mentioned in question).
x,y,z, and m are doubles, not integers. I can't over-ride it to return a double. Would parsing it to an integer still make it accurate? Also, it tells me there is no Equals method to over-ride.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.