Python fast way to get data from multiple files in single numpy array

Question

I need to read in data which is stored in many files of the same format, but varying length, i.e. identical columns, but varying number of rows. Furthermore, I need each column of the data to be stored in one array (preferrably one numpy array, but a list is also acceptable).

For now, I read in every file in a loop with numpy.loadtxt() and then concatenate the resulting arrays. Say the data consists of 3 columns and is stored in the two files "foo" and "bar":

import numpy as np
filenames = ["foo", "bar"]
col1_all = 0  #data will be stored in these 3 arrays
col2_all = 0
col3_all = 0
for f in filename:
    col1, col2, col3 = np.loadtxt(f, unpack=True)
    if col1.shape[0] > 0: # I can't guarantee file won't be empty
        if type(col1_all) == int:
            # if there is no data read in yet, just copy arrays
            col1_all = col1[:]
            col2_all = col2[:]
            col3_all = col3[:]
        else:
            col1_all = np.concatenate((col1_all, col1))
            col2_all = np.concatenate((col2_all, col2))
            col3_all = np.concatenate((col3_all, col3))

My question is: Is there a better/faster way to do this? I need this to be as quick as possible, as I need to read in hundreds of files.

I could imagine, for example, that first finding out how many rows in total I will have and "allocating" an array of big enough size to fit all the data first, then copying the read-in data in that array might perform better, as I circumvent the concatenations. I don't know the total number of rows, so this will have to be done in python too.

Another idea would be first read in all the data, store each read-in separately, and concatenate them in the end. (Or, as this essentialy gives me the total number of rows, allocate a row that fits all the data, and then copy the data in there).

Does anyone have experience on what works best?

As a general rule, concatenating as you go along is slowest, because it makes a new array each time. Appending to a list is relatively fast, because a list just collects pointers. Insertion into a preallocated array is also good. — hpaulj
– hpaulj, Commented Apr 21, 2018 at 21:26

Daniel · Accepted Answer · 2018-04-21 21:13:27Z

7

Don't concatenate each file on with the rest, read everything in lists, and built the results in the end

import numpy as np
filenames = ["foo", "bar"]
data = np.concatenate([np.loadtxt(f) for f in filenames])

If you like, you can split data into columns, but mostly, this is not a good idea.

answered Apr 21, 2018 at 21:13

Daniel

42.9k4 gold badges57 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python fast way to get data from multiple files in single numpy array

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related