11

I need to load (de-serialize) a pre-computed list of integers from a file in a Python script (into a Python list). The list is large (upto millions of items), and I can choose the format I store it in, as long as loading is fastest.

Which is the fastest method, and why?

  1. Using import on a .py file that just contains the list assigned to a variable
  2. Using cPickle's load
  3. Some other method (perhaps numpy?)

Also, how can one benchmark such things reliably?

Addendum: measuring this reliably is difficult, because import is cached so it can't be executed multiple times in a test. The loading with pickle also gets faster after the first time probably because page-precaching by the OS. Loading 1 million numbers with cPickle takes 1.1 sec the first time run, and 0.2 sec on subsequent executions of the script.

Intuitively I feel cPickle should be faster, but I'd appreciate numbers (this is quite a challenge to measure, I think).

And yes, it's important for me that this performs quickly.

Thanks

3
  • Is this really the slow part of your code? How often are you going to load up the file? Commented Feb 17, 2009 at 13:23
  • Have you tried any of these? What metrics do you have right now? Commented Feb 17, 2009 at 13:30
  • For what it is worth, you can avoid import problems by using "execfile()"... Commented Feb 19, 2009 at 5:59

6 Answers 6

7

I would guess cPickle will be fastest if you really need the thing in a list.

If you can use an array, which is a built-in sequence type, I timed this at a quarter of a second for 1 million integers:

from array import array
from datetime import datetime

def WriteInts(theArray,filename):
    f = file(filename,"wb")
    theArray.tofile(f)
    f.close()

def ReadInts(filename):
    d = datetime.utcnow()
    theArray = array('i')
    f = file(filename,"rb")
    try:
        theArray.fromfile(f,1000000000)
    except EOFError:
        pass
    print "Read %d ints in %s" % (len(theArray),datetime.utcnow() - d)
    return theArray

if __name__ == "__main__":
    a = array('i')
    a.extend(range(0,1000000))
    filename = "a_million_ints.dat"
    WriteInts(a,filename)
    r = ReadInts(filename)
    print "The 5th element is %d" % (r[4])
Sign up to request clarification or add additional context in comments.

3 Comments

'Read 1000000 ints in 0:00:03.500000', and it took 1/4 seconds for you?
however, you're right, array.fromfile is much faster than cpickle!!
@eliben - you might want to pick this as the best answer. The lessons on using the timeit module are popular, but they don't answer your question directly!
3

For benchmarking, see the timeit module in the Python standard library. To see what is the fastest way, implement all the ways you can think of and measure them with timeit.

Random thought: depending on what you're doing exactly, you may find it fastest to store "sets of integers" in the style used in .newsrc files:

1, 3-1024, 11000-1200000

If you need to check whether something is in that set, then loading and matching with such a representation should be among the fastest ways. This assumes your sets of integers are reasonably dense, with long consecutive sequences of adjacent values.

Comments

2

To help you with timing, the Python Library provides the timeit module:

This module provides a simple way to time small bits of Python code. It has both command line as well as callable interfaces. It avoids a number of common traps for measuring execution times.

An example (from the manual) that compares the cost of using hasattr() vs. try/except to test for missing and present object attributes:

% timeit.py 'try:' '  str.__nonzero__' 'except AttributeError:' '  pass'
100000 loops, best of 3: 15.7 usec per loop
% timeit.py 'if hasattr(str, "__nonzero__"): pass'
100000 loops, best of 3: 4.26 usec per loop
% timeit.py 'try:' '  int.__nonzero__' 'except AttributeError:' '  pass'
1000000 loops, best of 3: 1.43 usec per loop
% timeit.py 'if hasattr(int, "__nonzero__"): pass'
100000 loops, best of 3: 2.23 usec per loop

Comments

2

"how can one benchmark such things reliably?"

I don't get the question.

You write a bunch of little functions to create and save your list in various forms.

You write a bunch of little functions to load your lists in their various forms.

You write a little timer function to get start time, execute the load procedure a few dozen times (to get a solid average that's long enough that OS scheduling noise doesn't dominate your measurements).

You summarize your data in a little report.

What's unreliable about this?

Here are some unrelated questions that shows how to measure and compare performance.

Convert list of ints to one number?

String concatenation vs. string substitution in Python

2 Comments

How can I execute "import <filename>" several times in a loop if import is cached?
If your data set is big enough, one measurement may be all you need. If not, you can run from the command line in a shell loop and time that instead. Also, look at imp.load_module.
2

Do you need to always load the whole file? If not, upack_from() might be the best solution. Suppose, that you have 1000000 integers, but you'd like to load just the ones from 50000 to 50099, you'd do:

import struct
intSize = struct.calcsize('i') #this value would be constant for a given arch
intFile = open('/your/file.of.integers')
intTuple5K100 = struct.unpack_from('i'*100,intFile,50000*intSize)

Comments

1

cPickle will be the fastest since it is saved in binary and no real python code has to be parsed.

Other advantates are that it is more secure (since it does not execute commands) and you have no problems with setting $PYTHONPATH correctly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.