Pandas dataframe conversion to numpy array returns unexpected shape

Question

My problem is that I do not understand why pandas/numpy is returning a certain array shape when I convert a column to a numpy array. I would expect a shape of (1440, 130, 13) but due to the fact that I get a np.array of "list calls" (literally no idea why) when I call .to_numpy() on my dataframe column, I get a shape of (1440, ).

At first I thought maybe the file type as which I stored the dataframe was the Problem (I tried json and csv before) but I had the same issue with any of them.

Many thanks in advance!

def extract_features(data_df):

    mfcc_list = []

    for i in tqdm(range(len(data_df))):
        signal, sr = librosa.load(data_df.path[i], sr=SAMPLE_RATE, duration=3)

        mfcc = librosa.feature.mfcc(signal, sr=sr, n_mfcc=13, n_fft=2048, hop_length=512)
        mfcc = mfcc.T

        mfcc_list.append(mfcc.tolist()) # I make sure that everything is in list form

    data_df['mfcc'] = mfcc_list
    return data_df

data_df = extract_features(data_df=data_df)
data_df.to_pickle('path/to/file')

df = pd.read_pickle('path/to/file')

a = df["mfcc"].to_numpy() # I would expect a shape of (1440, 130, 13)

b = np.array(df.iloc[0]["mfcc"])

print(a)
# output in a shape like this:
# [list([[], ..., []]), ..., list([[], ..., []])]

print(type(a))    # output: <class 'numpy.ndarray'>
print(type(a[0])) # output: <class 'list'>

print(type(b))    # output: <class 'numpy.ndarray'>
print(type(b[0])) # output: <class 'numpy.ndarray'>

df.info()
# output:
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 1440 entries, 0 to 1439
# Data columns (total 9 columns):
#  #   Column      Non-Null Count  Dtype 
# ---  ------      --------------  ----- 
#  0   path        1440 non-null   object
#  1   source      1440 non-null   object
#  2   actor       1440 non-null   object
#  3   gender      1440 non-null   object
#  4   statement   1440 non-null   object
#  5   repetition  1440 non-null   object
#  6   intensity   1440 non-null   object
#  7   emotion     1440 non-null   object
#  8   mfcc        1440 non-null   object
# dtypes: object(9)
# memory usage: 112.5+ KB

print(df.shape) # output: (1440, 9)

df["mfcc"]
# output:
# 0       [[-857.3094533443688, 0.0, 0.0, 0.0, 0.0, 0.0,...
# 1       [[-864.8902862773604, 0.0, 0.0, 0.0, 0.0, 0.0,...
# 2       [[-849.4454325616318, 9.397479238778757, 9.257...
# 3       [[-832.7343966188961, 11.492822043371124, 0.14...
# 4       [[-902.4064116162402, 6.517241898027468, 6.427...
#                               ...                        
# 1435    [[-764.9126134873547, 0.0, 0.0, 0.0, 0.0, 0.0,...
# 1436    [[-732.3714481202685, 0.0, 0.0, 0.0, 0.0, 0.0,...
# 1437    [[-741.4161339882342, 0.0, 0.0, 0.0, 0.0, 0.0,...
# 1438    [[-713.4635562123195, 0.0, 0.0, 0.0, 0.0, 0.0,...
# 1439    [[-718.5457158330038, 0.0, 0.0, 0.0, 0.0, 0.0,...
# Name: mfcc, Length: 1440, dtype: object

Show some df info - dtypes, shape, info.

hpaulj
– hpaulj

2020-06-13 23:57:45 +00:00
Commented Jun 13, 2020 at 23:57 — hpaulj
– hpaulj, Commented Jun 13, 2020 at 23:57
@hpaulj I added some additional info

itsmartinhi
– itsmartinhi

2020-06-14 12:33:25 +00:00
Commented Jun 14, 2020 at 12:33 — itsmartinhi
– itsmartinhi, Commented Jun 14, 2020 at 12:33

hpaulj · Accepted Answer · 2020-06-14 02:17:51Z

A dataframe is a 2d object. Even if I make one like this, it is 2d, with lists in each 'cell':

In [39]: df = pd.DataFrame({'a':[[1,2,3],[4,5,6],[7,8,9]]})                                                     
In [40]: df                                                                                                     
Out[40]: 
           a
0  [1, 2, 3]
1  [4, 5, 6]
2  [7, 8, 9]
In [41]: df.to_numpy()                                                                                          
Out[41]: 
array([[list([1, 2, 3])],
       [list([4, 5, 6])],
       [list([7, 8, 9])]], dtype=object)

This is (3,1) array containing lists (as objects). If I select a column I get a pandas Series

In [42]: df['a'].to_numpy()                                                                                     
Out[42]: array([list([1, 2, 3]), list([4, 5, 6]), list([7, 8, 9])], dtype=object)
In [43]: print(_)                                                                                               
[list([1, 2, 3]) list([4, 5, 6]) list([7, 8, 9])]

This is a 1d (3,) shape array, again containing lists.

If the lists match in shape, it is possible to stack them into one array:

In [44]: np.stack(df['a'].to_numpy())                                                                           
Out[44]: 
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

===

A '3d' example:

In [45]: df = pd.DataFrame({'a':[[[1,2],[3,4]],[[4,5],[6,7]]]})                                                 
In [46]: df                                                                                                     
Out[46]: 
                  a
0  [[1, 2], [3, 4]]
1  [[4, 5], [6, 7]]
In [47]: df['a'].to_numpy()                                                                                     
Out[47]: array([list([[1, 2], [3, 4]]), list([[4, 5], [6, 7]])], dtype=object)
In [48]: np.stack(df['a'].to_numpy())                                                                           
Out[48]: 
array([[[1, 2],
        [3, 4]],

       [[4, 5],
        [6, 7]]])
In [49]: _.shape                                                                                                
Out[49]: (2, 2, 2)

Ok I think that it explains it for me. I think I was just using it wrong then. My intention was to extract all "mfccs" and every "emotion" as a list to then split it so that I could train a ML model with it. I'm building a model for speech emotion recognition. Thank you very much for the explicit answer!

Collectives™ on Stack Overflow

Pandas dataframe conversion to numpy array returns unexpected shape

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related