The fastest way to read input in Python

Question

I want to read a huge text file that contains list of lists of integers. Now I'm doing the following:

G = []
with open("test.txt", 'r') as f:
    for line in f:
        G.append(list(map(int,line.split())))

However, it takes about 17 secs (via timeit). Is there any way to reduce this time? Maybe, there is a way not to use map.

Define "huge". Also, does each line have the same number of integers? — Warren Weckesser
– Warren Weckesser, Commented Feb 26, 2013 at 18:37
@WarrenWeckesser actually in this example there is the same number of integers, two elements. Huge > 5M lines. — Sergey Ivanov
– Sergey Ivanov, Commented Feb 26, 2013 at 18:47

Warren Weckesser · Accepted Answer · 2013-03-18 16:37:20Z

25

numpy has the functions loadtxt and genfromtxt, but neither is particularly fast. One of the fastest text readers available in a widely distributed library is the read_csv function in pandas (http://pandas.pydata.org/). On my computer, reading 5 million lines containing two integers per line takes about 46 seconds with numpy.loadtxt, 26 seconds with numpy.genfromtxt, and a little over 1 second with pandas.read_csv.

Here's the session showing the result. (This is on Linux, Ubuntu 12.04 64 bit. You can't see it here, but after each reading of the file, the disk cache was cleared by running sync; echo 3 > /proc/sys/vm/drop_caches in a separate shell.)

In [1]: import pandas as pd

In [2]: %timeit -n1 -r1 loadtxt('junk.dat')
1 loops, best of 1: 46.4 s per loop

In [3]: %timeit -n1 -r1 genfromtxt('junk.dat')
1 loops, best of 1: 26 s per loop

In [4]: %timeit -n1 -r1 pd.read_csv('junk.dat', sep=' ', header=None)
1 loops, best of 1: 1.12 s per loop

edited Mar 18, 2013 at 16:37

answered Feb 26, 2013 at 19:26

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

bmu Over a year ago

+1, didn't saw your answer while I was preparing mine. I just benchmarked the version of the OP too, which takes about 16s on my machine. I also noted, that loadtxt is slow. I'm not sure why, I would expect it to be faster (and it should be faster than genfromtxt. DO you also use numpy 1.7?

Warren Weckesser Over a year ago

@bmu: Yes, I used numpy 1.7.

bmu Over a year ago

I opened an numpy issue: github.com/numpy/numpy/issues/3019. I can not imagine, that it is normal that loadtxt is so slow.

Warren Weckesser Over a year ago

@BranAlgue: Christoph Gohlke provides a tremendous service to the Python community by preparing and hosting binary builds of NumPy (and many other packages) for Windows. Take a look: lfd.uci.edu/~gohlke/pythonlibs/#numpy

Sergey Ivanov Over a year ago

Hey, @WarrenWeckesser it helped. It read file but took about a minute to do this, and numbers are float type, which is not right. Unfortunately, there is no pandas for Python 3.3. Maybe to reinstall on 3.2?

|

bmu · Accepted Answer · 2013-02-26 19:46:51Z

pandas which is based on numpy has a C based file parser which is very fast:

# generate some integer data (5 M rows, two cols) and write it to file
In [24]: data = np.random.randint(1000, size=(5 * 10**6, 2))

In [25]: np.savetxt('testfile.txt', data, delimiter=' ', fmt='%d')

# your way
In [26]: def your_way(filename):
   ...:     G = []
   ...:     with open(filename, 'r') as f:
   ...:         for line in f:
   ...:             G.append(list(map(int, line.split(','))))
   ...:     return G        
   ...: 

In [26]: %timeit your_way('testfile.txt', ' ')
1 loops, best of 3: 16.2 s per loop

In [27]: %timeit pd.read_csv('testfile.txt', delimiter=' ', dtype=int)
1 loops, best of 3: 1.57 s per loop

So pandas.read_csv takes about one and a half second to read your data and is about 10 times faster than your method.

Udo Klein · Accepted Answer · 2013-02-26 18:19:17Z

0

The easiest speedup would be to go for PyPy http://pypy.org/

The next issue to NOT read the file at all (if possible). Instead process it like a stream.

answered Feb 26, 2013 at 18:19

Udo Klein

6,9401 gold badge39 silver badges63 bronze badges

Comments

Christopher Mahan · Accepted Answer · 2013-02-26 18:26:31Z

0

You might also try to bring the data into a database via bulk-insert, then processing your records with set operations. Depending on what you have to do, that may be faster, as bulk-insert software is optimized for this type of task.

answered Feb 26, 2013 at 18:26

Christopher Mahan

7,6299 gold badges55 silver badges66 bronze badges

Comments

Bryan Oakley · Accepted Answer · 2013-02-26 18:28:49Z

0

As a general rule of thumb (for just about any language), using read() to read in the entire file is going to be quicker than reading one line at a time. If you're not constrained by memory, read the whole file at once and then split the data on newlines, then iterate over the list of lines.

answered Feb 26, 2013 at 18:28

Bryan Oakley

389k53 gold badges582 silver badges739 bronze badges

Comments

mironovmeow · Accepted Answer · 2023-10-09 20:04:30Z

0

List comprehensions are often faster.

G = [[int(item) for item in line.split()] for line in f]

Beyond that, try PyPy and Cython and numpy

edited Oct 9, 2023 at 20:04

mironovmeow

1367 bronze badges

answered Feb 26, 2013 at 18:20

forivall

10k3 gold badges37 silver badges58 bronze badges

9 Comments

Steven Rumbalski Over a year ago

G = [map(int, line.split()) for line in f] is faster.

Sergey Ivanov Over a year ago

@StevenRumbalski This line produces map objects:[<map object at 0x0000000002D28898>, <map object at 0x0000000002D28908>, <map object at 0x0000000002D289B0>.... But @forivall line works.

Steven Rumbalski Over a year ago

@BranAlgue. Aha! You are using Python 3. So change that to G = [list(map(int, line.split())) for line in f]. It is still faster than the nested list comprehension.

Sergey Ivanov Over a year ago

It's strange @StevenRumbalski because your line works slowly:

stmt = ''' with open("SCC.txt", 'r') as f:     G = [list(map(int, line.split())) for line in f] ''' test1 = timeit.timeit(stmt, number = 1)  stmt = ''' with open("SCC.txt", 'r') as f:     G = [[int(item) for item in line.split()] for line in f] ''' test2 = timeit.timeit(stmt, number = 1)

. >>> test1 16.291107619840908 >>> test2 11.386214308615607

forivall Over a year ago

It's possible that Python 3 changed improved the performance of listcomps. Old question outlining this: stackoverflow.com/questions/1247486/…

|

Collectives™ on Stack Overflow

The fastest way to read input in Python

6 Answers 6

7 Comments

Comments

Comments

Comments

Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

7 Comments

Comments

Comments

Comments

Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related