Python list serialization - fastest method

Question

I need to load (de-serialize) a pre-computed list of integers from a file in a Python script (into a Python list). The list is large (upto millions of items), and I can choose the format I store it in, as long as loading is fastest.

Which is the fastest method, and why?

Using import on a .py file that just contains the list assigned to a variable
Using cPickle's load
Some other method (perhaps numpy?)

Also, how can one benchmark such things reliably?

Addendum: measuring this reliably is difficult, because import is cached so it can't be executed multiple times in a test. The loading with pickle also gets faster after the first time probably because page-precaching by the OS. Loading 1 million numbers with cPickle takes 1.1 sec the first time run, and 0.2 sec on subsequent executions of the script.

Intuitively I feel cPickle should be faster, but I'd appreciate numbers (this is quite a challenge to measure, I think).

And yes, it's important for me that this performs quickly.

Thanks

Is this really the slow part of your code? How often are you going to load up the file? — Douglas Leeder
– Douglas Leeder, Commented Feb 17, 2009 at 13:23
Have you tried any of these? What metrics do you have right now? — S.Lott
– S.Lott, Commented Feb 17, 2009 at 13:30
For what it is worth, you can avoid import problems by using "execfile()"... — gahooa
– gahooa, Commented Feb 19, 2009 at 5:59

Nicolás Ozimica · Accepted Answer · 2012-08-09 23:07:52Z

7

I would guess cPickle will be fastest if you really need the thing in a list.

If you can use an array, which is a built-in sequence type, I timed this at a quarter of a second for 1 million integers:

from array import array
from datetime import datetime

def WriteInts(theArray,filename):
    f = file(filename,"wb")
    theArray.tofile(f)
    f.close()

def ReadInts(filename):
    d = datetime.utcnow()
    theArray = array('i')
    f = file(filename,"rb")
    try:
        theArray.fromfile(f,1000000000)
    except EOFError:
        pass
    print "Read %d ints in %s" % (len(theArray),datetime.utcnow() - d)
    return theArray

if __name__ == "__main__":
    a = array('i')
    a.extend(range(0,1000000))
    filename = "a_million_ints.dat"
    WriteInts(a,filename)
    r = ReadInts(filename)
    print "The 5th element is %d" % (r[4])

edited Aug 9, 2012 at 23:07

Nicolás Ozimica

9,7985 gold badges40 silver badges52 bronze badges

answered Feb 17, 2009 at 14:07

Carlos A. Ibarra

6,1721 gold badge32 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eli Bendersky Over a year ago

'Read 1000000 ints in 0:00:03.500000', and it took 1/4 seconds for you?

Eli Bendersky Over a year ago

however, you're right, array.fromfile is much faster than cpickle!!

Greg Ball Over a year ago

@eliben - you might want to pick this as the best answer. The lessons on using the timeit module are popular, but they don't answer your question directly!

user25148 · Accepted Answer · 2009-02-17 13:50:14Z

3

For benchmarking, see the timeit module in the Python standard library. To see what is the fastest way, implement all the ways you can think of and measure them with timeit.

Random thought: depending on what you're doing exactly, you may find it fastest to store "sets of integers" in the style used in .newsrc files:

1, 3-1024, 11000-1200000

If you need to check whether something is in that set, then loading and matching with such a representation should be among the fastest ways. This assumes your sets of integers are reasonably dense, with long consecutive sequences of adjacent values.

answered Feb 17, 2009 at 13:50

user25148

Comments

gimel · Accepted Answer · 2009-02-17 13:57:14Z

To help you with timing, the Python Library provides the timeit module:

This module provides a simple way to time small bits of Python code. It has both command line as well as callable interfaces. It avoids a number of common traps for measuring execution times.

An example (from the manual) that compares the cost of using hasattr() vs. try/except to test for missing and present object attributes:

% timeit.py 'try:' '  str.__nonzero__' 'except AttributeError:' '  pass'
100000 loops, best of 3: 15.7 usec per loop
% timeit.py 'if hasattr(str, "__nonzero__"): pass'
100000 loops, best of 3: 4.26 usec per loop
% timeit.py 'try:' '  int.__nonzero__' 'except AttributeError:' '  pass'
1000000 loops, best of 3: 1.43 usec per loop
% timeit.py 'if hasattr(int, "__nonzero__"): pass'
100000 loops, best of 3: 2.23 usec per loop

Community · Accepted Answer · 2017-05-23 12:07:11Z

2

"how can one benchmark such things reliably?"

I don't get the question.

You write a bunch of little functions to create and save your list in various forms.

You write a bunch of little functions to load your lists in their various forms.

You write a little timer function to get start time, execute the load procedure a few dozen times (to get a solid average that's long enough that OS scheduling noise doesn't dominate your measurements).

You summarize your data in a little report.

What's unreliable about this?

Here are some unrelated questions that shows how to measure and compare performance.

Convert list of ints to one number?

String concatenation vs. string substitution in Python

edited May 23, 2017 at 12:07

CommunityBot

11 silver badge

answered Feb 17, 2009 at 13:43

S.Lott

393k83 gold badges520 silver badges791 bronze badges

2 Comments

Eli Bendersky Over a year ago

How can I execute "import <filename>" several times in a loop if import is cached?

S.Lott Over a year ago

If your data set is big enough, one measurement may be all you need. If not, you can run from the command line in a shell loop and time that instead. Also, look at imp.load_module.

vartec · Accepted Answer · 2009-02-24 12:25:20Z

2

Do you need to always load the whole file? If not, upack_from() might be the best solution. Suppose, that you have 1000000 integers, but you'd like to load just the ones from 50000 to 50099, you'd do:

import struct
intSize = struct.calcsize('i') #this value would be constant for a given arch
intFile = open('/your/file.of.integers')
intTuple5K100 = struct.unpack_from('i'*100,intFile,50000*intSize)

answered Feb 24, 2009 at 12:25

vartec

135k38 gold badges227 silver badges248 bronze badges

Comments

Johannes Weiss · Accepted Answer · 2009-02-17 14:11:21Z

1

cPickle will be the fastest since it is saved in binary and no real python code has to be parsed.

Other advantates are that it is more secure (since it does not execute commands) and you have no problems with setting $PYTHONPATH correctly.

answered Feb 17, 2009 at 14:11

Johannes Weiss

54.4k16 gold badges105 silver badges139 bronze badges

Collectives™ on Stack Overflow

Python list serialization - fastest method

6 Answers 6

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related