39

I posted this question because I was wondering whether I did something terribly wrong to get this result.

I have a medium-size csv file and I tried to use numpy to load it. For illustration, I made the file using python:

import timeit
import numpy as np

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')

And then, I tried two methods: numpy.genfromtxt, numpy.loadtxt

setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """\
my_data = np.loadtxt('./test.csv', delimiter=',')
"""

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)

And the result shows that t1 = 32.159652940464184, t2 = 52.00093725634724.
However, When I tried using matlab:

tic
for i = 1:3
    my_data = dlmread('./test.csv');
end
toc

The result shows: Elapsed time is 3.196465 seconds.

I understand that there may be some differences in the loading speed, but:

  1. This is much more than I expected;
  2. Isn't it that np.loadtxt should be faster than np.genfromtxt?
  3. I haven't tried python csv module yet because loading csv file is a really frequent thing I do and with the csv module, the coding is a little bit verbose... But I'd be happy to try it if that's the only way. Currently I am more concerned about whether it's me doing something wrong.

Any input would be appreciated. Thanks a lot in advance!

5 Answers 5

46

Yeah, reading csv files into numpy is pretty slow. There's a lot of pure Python along the code path. These days, even when I'm using pure numpy I still use pandas for IO:

>>> import numpy as np, pandas as pd
>>> %time d = np.genfromtxt("./test.csv", delimiter=",")
CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
Wall time: 14.9 s
>>> %time d = np.loadtxt("./test.csv", delimiter=",")
CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
Wall time: 25.8 s
>>> %time d = pd.read_csv("./test.csv", delimiter=",").values
CPU times: user 740 ms, sys: 36 ms, total: 776 ms
Wall time: 780 ms

Alternatively, in a simple enough case like this one, you could use something like what Joe Kington wrote here:

>>> %time data = iter_loadtxt("test.csv")
CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
Wall time: 2.86 s

There's also Warren Weckesser's textreader library, in case pandas is too heavy a dependency:

>>> import textreader
>>> %time d = textreader.readrows("test.csv", float, ",")
readrows: numrows = 1500000
CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
Wall time: 1.34 s
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much! The pd.read_csv works great for me - in fact it finished in only half the time that MATLAB took! And also thanks for the other two very informative methods with lighter weight.
The speed is not the only thing to care about. As for me, both np.genfromtxt and pd.read_csv require more RAM than I have to read a 1,209,836,036 byte text file. The former does not care and hangs the system, however the latter throws an error. np.fromfile is almost 4 times quicker than np.loadtxt. The two do not take much memory to run.
12

I've performance-tested the suggested solutions with perfplot (a small project of mine) and found that

pandas.read_csv(filename)

is indeed the fastest solution (if more than 2000 entries are read, before that everything is in the range of milliseconds). It outperforms numpy's variants by a factor of about 10. (numpy.fromfile is here just for comparison, it cannot read actual csv files.)

enter image description here

Code to reproduce the plot:

import numpy
import pandas
import perfplot

numpy.random.seed(0)
filename = "a.txt"


def setup(n):
    a = numpy.random.rand(n)
    numpy.savetxt(filename, a)
    return None


def numpy_genfromtxt(data):
    return numpy.genfromtxt(filename)


def numpy_loadtxt(data):
    return numpy.loadtxt(filename)


def numpy_fromfile(data):
    out = numpy.fromfile(filename, sep=" ")
    return out


def pandas_readcsv(data):
    return pandas.read_csv(filename, header=None).values.flatten()


def kington(data):
    delimiter = " "
    skiprows = 0
    dtype = float

    def iter_func():
        with open(filename, "r") as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        kington.rowlength = len(line)

    data = numpy.fromiter(iter_func(), dtype=dtype).flatten()
    return data


b = perfplot.bench(
    setup=setup,
    kernels=[numpy_genfromtxt, numpy_loadtxt, numpy_fromfile, pandas_readcsv, kington],
    n_range=[2 ** k for k in range(23)],
)
b.save("out.png")

Comments

8

If you want to just save and read a numpy array its much better to save it as a binary or compressed binary depending on size:

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')
np.save('./testy', my_data)
np.savez('./testz', my_data)
del my_data

setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """\
my_data = np.load('./testy.npy')
"""
stmt3 = """\
my_data = np.load('./testz.npz')['arr_0']
"""

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)
t3 = timeit.timeit(stmt=stmt3, setup=setup_stmt, number=3)

genfromtxt 39.717250824
save 0.0667860507965
savez 0.268463134766

2 Comments

Thank you Ophion! This is a great answer, and really useful - I have been using cPickle but now realized that np.savez is faster and more compact than cPickle, as long as only ndarray are used. I did not mark "accept" because in this question I was trying to read data from experiment data saved by LabVIEW. But still, thank you so much!
I believe this should be selected as the correct answer! Thank you @Ophion
2

FWIW the built-in csv module works great and really is not that verbose.

csv module:

%%timeit
with open('test.csv', 'r') as f:
    np.array([l for l in csv.reader(f)])


1 loop, best of 3: 1.62 s per loop

np.loadtext:

%timeit np.loadtxt('test.csv', delimiter=',')

1 loop, best of 3: 16.6 s per loop

pd.read_csv:

%timeit pd.read_csv('test.csv', header=None).values

1 loop, best of 3: 663 ms per loop

Personally I like using pandas read_csv but the csv module is nice when I'm using pure numpy.

1 Comment

I know this is an old question, but if you are still using pure numpy, you can still use pandas for IO and then use `pd.DataFrame.values to extract the numpy array.
2

Perhaps it's better to rig up a simple c code which converts the data to binary and have `numpy' read the binary file. I have a 20GB CSV file to read with the CSV data being a mixture of int, double, str. Numpy read-to-array of structs takes more than an hour, while dumping to binary took about 2 minutes and loading to numpy takes less than 2 seconds!

My specific code, for example, is available here.

1 Comment

Good results. Consider dropping sample codes for others.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.