Efficiently convert numpy array of arrays to pandas series of arrays

Question

How would I efficiently convert a numpy array of arrays into a list of arrays? Ultimately, I want to make a pandas Series of arrays to be a columns in a dataframe. If there is a better way to go directly to, that would also be good.

The following reproducible code solves the issue with list() or .tolist(), but either is much too slow to implement on my actual data set. I am looking for something much faster.

import numpy as np 
import pandas as pd

a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7])])

s = pd.Series(a.tolist())

s = pd.Series(list(a))

This results in the shape going from a.shape = (2,4) to s.values.shape = (2,).

a is 2d array, (2,4). Not an array of arrays (unless you went to the extra work of constructing a (2,) shape object array first). That should map on to a 4 column DataFrame. Or do you really want a Series where each element is an array (and object dtype)? I don't think that will be an efficient Series. It isn't an efficient array. — hpaulj
– hpaulj, Commented Aug 5, 2018 at 4:31
@hpaulj - Yes I "want a Series where each element is an array." — Clay
– Clay, Commented Aug 5, 2018 at 9:23
@miradulo that results in a separate column for each element in the nested arrays. I want the resulting data frame to have one column where each row has one of the nested arrays of a. — Clay
– Clay, Commented Aug 5, 2018 at 10:47
Do you know how to make a 1d array that contains arrays? object dtyoe? Your example a doesn't qualify. Try varying the subarray length, or include a None — hpaulj
– hpaulj, Commented Aug 5, 2018 at 12:04

hpaulj · Accepted Answer · 2018-08-05 17:56:40Z

Your a:

In [2]: a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7])])
   ...:

a is (2,4) numeric array; we could have just written a = np.array([[0,1,2,3],[4,5,6,7]]). Creating a (2,) array of arrays requires a different construction.

As others wrote, making a dataframe this is trivial:

In [3]: pd.DataFrame(a)     # dtypes int64
Out[3]: 
   0  1  2  3
0  0  1  2  3
1  4  5  6  7

But making a series from it raises an error:

In [4]: pd.Series(a)
---------------------------------------------------------------------------
...
Exception: Data must be 1-dimensional

Your question would have been clearer if it showed this error, and why then you tried the list inputs:

In [5]: pd.Series(a.tolist())
Out[5]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
dtype: object
In [6]: pd.Series(list(a))
Out[6]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
dtype: object

On the surface these are the same, but when we look at actual elements of the Series, we see that one contains lists, the other arrays. That's because tolist and list() create different lists from the array.

In [8]: Out[5][0]
Out[8]: [0, 1, 2, 3]
In [9]: Out[6][0]
Out[9]: array([0, 1, 2, 3])

My experience is that a.tolist() is quite fast. list(a) is equivalent to [i for i in a]; in effect it iterates on the first dimension of a, returning (in this case) a 1d array (row) each time.

Let's change a so it is a 1d object dtype array:

In [14]: a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7]), np.array([1]), None])
In [15]: a
Out[15]: 
array([array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([1]), None],
      dtype=object)

Now we can make a Series from it:

In [16]: pd.Series(a)
Out[16]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
2             [1]
3            None
dtype: object
In [17]: Out[16][0]
Out[17]: array([0, 1, 2, 3])

In fact we could make a series from a slice of a, the one containing just the original 2 rows:

In [18]: pd.Series(a[:2])
Out[18]: 
0    [0, 1, 2, 3]
1    [4, 5, 6, 7]
dtype: object

The tricks for constructing 1d object dtype arrays have been discussed in depth in other SO questions.

Beware that a Series like this does not behave like a multicolumn DataFrame. I've seen attempts to write csv files, where elements like this get saved as quoted strings.

Lets compare some construction times:

Make larger arrays of the 2 types:

In [25]: a0 = np.ones([1000,4],int)
In [26]: a1 = np.empty(1000, object)
In [27]: a1[:] = [np.ones(4,int) for _ in range(1000)]
# a1[:] = list(a0)   # faster

First make a DataFrame:

In [28]: timeit pd.DataFrame(a0)
136 µs ± 919 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is the same time as for Out[3]; apparently just the overhead of making a DataFrame with a 2d array (any size) as values.

Making a series as you did:

In [29]: timeit pd.Series(list(a0))
434 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [30]: timeit pd.Series(a0.tolist())
315 µs ± 5.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

both of these are longer than for the small a, reflecting the iterative nature of the creation.

And with the 1d object array:

In [31]: timeit pd.Series(a1)
103 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is the same as for the small 1d array. As with In[28] I think there's just the overhead of creating a Series object, and then assigning it an unchanged values array.

Now constructing the a1 array is slower.

An object array like a1 is in many way just like a list - it contains pointers to objects elsewhere in memory. It can be useful if the elements differ in type (e.g. include strings or None), but computationally it is not the equivalent of a 2d array.

In sum, if the source array really is a 1d object dtype array, you can quickly create a Series from it. If it is really a 2d array, you'll need, in some way or other, convert it to a list or 1d object array first.

Bal Krishna Jha · Accepted Answer · 2018-08-05 09:44:01Z

1

You can make DataFrame from dict of common length list or list of lists. In former case pandas convert keys as column name and list as column values, in latter case each list is considered as rows.

import numpy as np 
import pandas as pd

a = np.array([np.array([0,1,2,3]), np.array([4,5,6,7])])
df = pd.DataFrame()
df['a'] = a.tolist()
df

Output:

    a
0   [0, 1, 2, 3]
1   [4, 5, 6, 7]

edited Aug 5, 2018 at 9:44

answered Aug 5, 2018 at 3:58

Bal Krishna Jha

7,5993 gold badges45 silver badges48 bronze badges

7 Comments

Clay Over a year ago

Thanks @krishna, but I need each row of one column of the data frame to contain each sub-array of a.

Bal Krishna Jha Over a year ago

@Clay row 1 should be [0,4] and row 2 [1,5]?

Clay Over a year ago

No, row 1 col 1 should be array([0,1,2,3]), row 2 col 1 should be array([4,5,6,7]). If you can first create a data frame from a, then convert each row to an array in a new column without using a for loop, that should work.

Clay Over a year ago

that is exactly the solution I gave in the original question, but is way to slow for a large data set.

Bal Krishna Jha Over a year ago

pd.DataFrame({'a':a.tolist()})?

|

Collectives™ on Stack Overflow

Efficiently convert numpy array of arrays to pandas series of arrays

2 Answers 2

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related