1

It is well known [1] [2] that numpy.loadtxt is not particularly fast in loading simple text files containing numbers.

I have been googling around for alternatives, and of course I stumbled across pandas.read_csv and astropy io.ascii. However, these readers don’t appear to be easy to decouple from their library, and I’d like to avoid adding a 200 MB, 5-seconds-import-time gorilla just for reading some ascii files.

The files I usually read are simple, no missing data, no malformed rows, no NaNs, floating point only, space or comma separated. But I need numpy arrays as output.

Does anyone know if any of the parsers above can be used standalone or about any other quick parser I could use?

Thank you in advance.

[1] Numpy loading csv TOO slow compared to Matlab

[2] http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/

[Edit 1]

For the sake of clarity and to reduce background noise: as I stated at the beginning, my ascii files contain simple floats, no scientific notation, no fortran specific data, no funny stuff, no nothing but simple floats.

Sample:

{

arr = np.random.rand(1000,100)
np.savetxt('float.csv',arr)

}

8
  • Similar current question, stackoverflow.com/questions/52232559/…. Not a duplicate since it doesn't have an answer either. Commented Sep 8, 2018 at 19:32
  • Typically what's the shape of the loaded array? Commented Sep 8, 2018 at 19:49
  • Please provide some sample lines. Commented Sep 8, 2018 at 20:12
  • If import times are an issue, I'm wondering if you save some by just pulling in the relevant parts of pandas.io to avoid grabbing the full API. Commented Sep 8, 2018 at 22:07
  • @hjpauli, it varies wildly, I have a few files containing data that is around 30x3, many others up to 10,000x9. Commented Sep 9, 2018 at 5:03

1 Answer 1

0

Personally I just use pandas and astropy for this. Yes, they are big and slow on import, but very widely available and on my machine import in under a second, so they aren't so bad. I haven't tried, but I would assume that extracting the CSV reader from pandas or astropy and getting it to build and run standalone isn't so easy, probably not a good way to go.

Is writing your own CSV to Numpy array reader an option? If the CSV is simple, it should be possible to do with ~ 100 lines of e.g. C / Cython, and if you know your CSV format you can get performance and package size that can't be beaten by a generic solution.

Another option you could look at is https://odo.readthedocs.io/ . I don't have experience with it, from a quick look I didn't see direct CSV -> Numpy. But it does make fast CSV -> database simple, and I'm sure there are fast database -> Numpy array options. So it might be possible to get fast e.g. CSV -> in-memory SQLite -> Numpy array via odo and possible a second package.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the suggestions. It seems odo uses pandas under the hood, so back to square one...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.