List comparison algorithm

Question

I am attempting to create an efficient algorithm to pull all of the similar elements between two lists. The problem is two fold. First, I can not seem to find any similar algorithms online. Second, there should be a more efficient way.

By 'similar', I mean a predesigned way to find all similar elements between two lists in a timely fashion.

Currently, I am taking a greedy approach by:

Sorting the lists that are being compared,
Comparing each element in the shorter list to each element in the larger list,
Since the largeList and smallList are sorted we can save the last index that was visited,
Continue from the previous index (largeIndex).

Currently, the run-time seems to be average of O(nlog(n)). This can be seen by running the test cases listed after this block of code.

This is being run on Python 3.3.

Right now, my code looks as such:

  def compare(small,large,largeStart,largeEnd):
        for i in range(largeStart, largeEnd):
              if small==large[i]:
                    return [1,i]
              if small<large[i]:
                    if i!=0:
                          return [0,i-1]
                    else:
                          return [0, i]
        return [0,largeStart]

  def determineLongerList(aList, bList):
    if len(aList)>len(bList):
        return (aList, bList)
    elif len(aList)<len(bList):
        return (bList, aList)
    else:
        return (aList, bList)

  def compareElementsInLists(aList, bList):
        import time
        startTime   = time.time()
        holder      = determineLongerList(aList, bList)
        sameItems   = []
        iterations  = 0
        ##########################################
        smallList   = sorted(holder[1])
        smallLength = len(smallList)
        smallIndex  = 0
        largeList   = sorted(holder[0])
        largeLength = len(largeList)
        largeIndex  = 0
        while (smallIndex<smallLength):
              boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
              if boolean[0]==1:
                    #`compare` returns 1 as True
                    sameItems.append(smallList[smallIndex])
                    oldIndex    = largeIndex
                    largeIndex  = boolean[1]
              else:
                    #else no match and possible new index
                    oldIndex    = largeIndex
                    largeIndex  = boolean[1]
              smallIndex+=1
              iterations =largeIndex-oldIndex+iterations+1
        print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
  print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
  return sameItems

, and here are some test cases:

  def testLargest():
        import time
        from random import randint
        print('\n\n******************************************\n')
        start_time  = time.time()
        lis   = []
        for i in range(0,1000000):
              ran   = randint(0,1000000)
              lis.append(ran)
        lis2  = []
        for i in range(0,1000000):
              ran   = randint(0,1000000)
              lis2.append(ran)
        timeTaken   = time.time()-start_time     
        print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
        print('\n******************************************')
        start_time  = time.time()
        c           = compareElementsInLists(lis, lis2)
        timeTaken   = time.time()-start_time     
        print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
        print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
        print('\n******************************************')

  #testLargest()

  '''
  One rendition of testLargest:
        ******************************************

        CREATING LISTS TOOK:        21.009342908859253

        ******************************************
        RAN 999998 OUT OF 1000000000000 POSSIBLE
        RATIO:      9.99998e-07

        COMPARING LISTS TOOK:       13.99990701675415
        NUMBER OF SAME ITEMS:       632328

        ******************************************
  '''

  def testLarge():
        import time
        from random import randint
        print('\n\n******************************************\n')
        start_time  = time.time()
        lis   = []
        for i in range(0,1000000):
              ran   = randint(0,100)
              lis.append(ran)
        lis2  = []
        for i in range(0,1000000):
              ran   = randint(0,100)
              lis2.append(ran)
        timeTaken   = time.time()-start_time     
        print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
        print('\n******************************************')
        start_time  = time.time()
        c           = compareElementsInLists(lis, lis2)
        timeTaken   = time.time()-start_time     
        print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
        print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
        print('\n******************************************')

  testLarge()

Why don't you convert both lists to sets and then form the intersection? — Tesseract
– Tesseract, Commented Apr 6, 2015 at 0:43
@spiderpig I was looking into this, and with the randomly generated lists in my test cases, it takes a significant amount of more time for set once the data becomes large enough. Also, set does not return he entire collection, but just similar elements. I am looking into collections.Counter now... — T.Woody
– T.Woody, Commented Apr 6, 2015 at 1:07
It's not 100% clear what data you want as output here. Do you simply want to return all the elements that are common to both lists? — Yuushi
– Yuushi, Commented Apr 6, 2015 at 8:24
I also don't understand what you are trying to do here. Some examples might help. — Gareth Rees
– Gareth Rees, Commented Apr 6, 2015 at 20:59

user1016274 · Accepted Answer · 2015-04-06 21:10:38Z

After running extensive tests with varying list lengths I've found the set.intersection() method to be the fastest by far. I coded it like this:

    start_time = time.time()
    if len(lis) > len(lis2):
        lis, lis2 = lis2, lis
    c = list(set(lis).intersection(lis2))
    c.sort()
    timeTaken = time.time() - start_time
    print('COMPARING LISTS TOOK:\t\t' + str(timeTaken))
    print('NUMBER OF SAME ITEMS:\t\t' + str(len(c)))
    print c[:20]

First, the smaller of both lists will be converted to a set. The conversion does take some time, as noted before, and this will minimize the effort. Second, only one list needs to be cast into a set. The intersection will be done with the second, unaltered list. It would take longer to convert the second list as well which is unnecessary. The final sorting can or cannot be omitted depending on your needs.

With random lists of lengths 60.000 and 40.000 resp. OP's algorithm took 20.6 s and set.intersection() took 0.031 s on my (old) notebook (664:1). Both lists of equal length, 60.000, 35.4 s vs. 0.031 s = 1141:1. For identical lists of length 60.000, 23.8 to 0.016 s = 1487:1. With these ratios I don't see any way to optimize the original algorithms to gain 3 orders of magnitude in speed.

Stack Exchange Network

List comparison algorithm

1 Answer 1

You must log in to answer this question.

Hot Network Questions

List comparison algorithm

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions