11

I want to read a huge text file that contains list of lists of integers. Now I'm doing the following:

G = []
with open("test.txt", 'r') as f:
    for line in f:
        G.append(list(map(int,line.split())))

However, it takes about 17 secs (via timeit). Is there any way to reduce this time? Maybe, there is a way not to use map.

4
  • Try list-comprehension. Commented Feb 26, 2013 at 18:17
  • 4
    Is there some reason not to use numpy here? Commented Feb 26, 2013 at 18:18
  • 3
    Define "huge". Also, does each line have the same number of integers? Commented Feb 26, 2013 at 18:37
  • @WarrenWeckesser actually in this example there is the same number of integers, two elements. Huge > 5M lines. Commented Feb 26, 2013 at 18:47

6 Answers 6

25

numpy has the functions loadtxt and genfromtxt, but neither is particularly fast. One of the fastest text readers available in a widely distributed library is the read_csv function in pandas (http://pandas.pydata.org/). On my computer, reading 5 million lines containing two integers per line takes about 46 seconds with numpy.loadtxt, 26 seconds with numpy.genfromtxt, and a little over 1 second with pandas.read_csv.

Here's the session showing the result. (This is on Linux, Ubuntu 12.04 64 bit. You can't see it here, but after each reading of the file, the disk cache was cleared by running sync; echo 3 > /proc/sys/vm/drop_caches in a separate shell.)

In [1]: import pandas as pd

In [2]: %timeit -n1 -r1 loadtxt('junk.dat')
1 loops, best of 1: 46.4 s per loop

In [3]: %timeit -n1 -r1 genfromtxt('junk.dat')
1 loops, best of 1: 26 s per loop

In [4]: %timeit -n1 -r1 pd.read_csv('junk.dat', sep=' ', header=None)
1 loops, best of 1: 1.12 s per loop
Sign up to request clarification or add additional context in comments.

7 Comments

+1, didn't saw your answer while I was preparing mine. I just benchmarked the version of the OP too, which takes about 16s on my machine. I also noted, that loadtxt is slow. I'm not sure why, I would expect it to be faster (and it should be faster than genfromtxt. DO you also use numpy 1.7?
@bmu: Yes, I used numpy 1.7.
I opened an numpy issue: github.com/numpy/numpy/issues/3019. I can not imagine, that it is normal that loadtxt is so slow.
@BranAlgue: Christoph Gohlke provides a tremendous service to the Python community by preparing and hosting binary builds of NumPy (and many other packages) for Windows. Take a look: lfd.uci.edu/~gohlke/pythonlibs/#numpy
Hey, @WarrenWeckesser it helped. It read file but took about a minute to do this, and numbers are float type, which is not right. Unfortunately, there is no pandas for Python 3.3. Maybe to reinstall on 3.2?
|
5

pandas which is based on numpy has a C based file parser which is very fast:

# generate some integer data (5 M rows, two cols) and write it to file
In [24]: data = np.random.randint(1000, size=(5 * 10**6, 2))

In [25]: np.savetxt('testfile.txt', data, delimiter=' ', fmt='%d')

# your way
In [26]: def your_way(filename):
   ...:     G = []
   ...:     with open(filename, 'r') as f:
   ...:         for line in f:
   ...:             G.append(list(map(int, line.split(','))))
   ...:     return G        
   ...: 

In [26]: %timeit your_way('testfile.txt', ' ')
1 loops, best of 3: 16.2 s per loop

In [27]: %timeit pd.read_csv('testfile.txt', delimiter=' ', dtype=int)
1 loops, best of 3: 1.57 s per loop

So pandas.read_csv takes about one and a half second to read your data and is about 10 times faster than your method.

Comments

0

The easiest speedup would be to go for PyPy http://pypy.org/

The next issue to NOT read the file at all (if possible). Instead process it like a stream.

Comments

0

You might also try to bring the data into a database via bulk-insert, then processing your records with set operations. Depending on what you have to do, that may be faster, as bulk-insert software is optimized for this type of task.

Comments

0

As a general rule of thumb (for just about any language), using read() to read in the entire file is going to be quicker than reading one line at a time. If you're not constrained by memory, read the whole file at once and then split the data on newlines, then iterate over the list of lines.

Comments

0

List comprehensions are often faster.

G = [[int(item) for item in line.split()] for line in f]

Beyond that, try PyPy and Cython and numpy

9 Comments

G = [map(int, line.split()) for line in f] is faster.
@StevenRumbalski This line produces map objects:[<map object at 0x0000000002D28898>, <map object at 0x0000000002D28908>, <map object at 0x0000000002D289B0>.... But @forivall line works.
@BranAlgue. Aha! You are using Python 3. So change that to G = [list(map(int, line.split())) for line in f]. It is still faster than the nested list comprehension.
It's strange @StevenRumbalski because your line works slowly:stmt = ''' with open("SCC.txt", 'r') as f: G = [list(map(int, line.split())) for line in f] ''' test1 = timeit.timeit(stmt, number = 1) stmt = ''' with open("SCC.txt", 'r') as f: G = [[int(item) for item in line.split()] for line in f] ''' test2 = timeit.timeit(stmt, number = 1). >>> test1 16.291107619840908 >>> test2 11.386214308615607
It's possible that Python 3 changed improved the performance of listcomps. Old question outlining this: stackoverflow.com/questions/1247486/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.