2

I need to read in data which is stored in many files of the same format, but varying length, i.e. identical columns, but varying number of rows. Furthermore, I need each column of the data to be stored in one array (preferrably one numpy array, but a list is also acceptable).

For now, I read in every file in a loop with numpy.loadtxt() and then concatenate the resulting arrays. Say the data consists of 3 columns and is stored in the two files "foo" and "bar":

import numpy as np
filenames = ["foo", "bar"]
col1_all = 0  #data will be stored in these 3 arrays
col2_all = 0
col3_all = 0
for f in filename:
    col1, col2, col3 = np.loadtxt(f, unpack=True)
    if col1.shape[0] > 0: # I can't guarantee file won't be empty
        if type(col1_all) == int:
            # if there is no data read in yet, just copy arrays
            col1_all = col1[:]
            col2_all = col2[:]
            col3_all = col3[:]
        else:
            col1_all = np.concatenate((col1_all, col1))
            col2_all = np.concatenate((col2_all, col2))
            col3_all = np.concatenate((col3_all, col3))

My question is: Is there a better/faster way to do this? I need this to be as quick as possible, as I need to read in hundreds of files.

I could imagine, for example, that first finding out how many rows in total I will have and "allocating" an array of big enough size to fit all the data first, then copying the read-in data in that array might perform better, as I circumvent the concatenations. I don't know the total number of rows, so this will have to be done in python too.

Another idea would be first read in all the data, store each read-in separately, and concatenate them in the end. (Or, as this essentialy gives me the total number of rows, allocate a row that fits all the data, and then copy the data in there).

Does anyone have experience on what works best?

1
  • As a general rule, concatenating as you go along is slowest, because it makes a new array each time. Appending to a list is relatively fast, because a list just collects pointers. Insertion into a preallocated array is also good. Commented Apr 21, 2018 at 21:26

1 Answer 1

7

Don't concatenate each file on with the rest, read everything in lists, and built the results in the end

import numpy as np
filenames = ["foo", "bar"]
data = np.concatenate([np.loadtxt(f) for f in filenames])

If you like, you can split data into columns, but mostly, this is not a good idea.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.