3

Having a 4-D numpy.ndarray, e.g.

myarr = np.random.rand(10,4,3,2) dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}

But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?

I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.

What would be a function to which to pass the column/row indexes and create a dataframe? Something like:

df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])

And and up with something like:

              meas1             meas2
              A     B     C     A    B    C
sub   time
  1      1
         2
         3
         .
         .
  2      1
         2
 ...

If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.

I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:

a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])



pd.DataFrame(a.reshape(2*3*1,2*2),index)

gives:

ValueError: Shape of passed values is (4, 6), indices imply (4, 24)

3 Answers 3

5

You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:

a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])

pd.DataFrame(a.reshape(24, 1),index=index)

Solution

Here's a generic DataFrame creator that should get the job done:

def produce_df(rows, columns, row_names=None, column_names=None):
    """rows is a list of lists that will be used to build a MultiIndex
    columns is a list of lists that will be used to build a MultiIndex"""
    row_index = pd.MultiIndex.from_product(rows, names=row_names)
    col_index = pd.MultiIndex.from_product(columns, names=column_names)
    return pd.DataFrame(index=row_index, columns=col_index)

Demonstration

Without named index levels

produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])

       1         2     
       3    4    3    4
a c  NaN  NaN  NaN  NaN
  d  NaN  NaN  NaN  NaN
b c  NaN  NaN  NaN  NaN
  d  NaN  NaN  NaN  NaN

With named index levels

produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
           row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])

number1          1         2     
number2          3    4    3    4
alpha1 alpha2                    
a      c       NaN  NaN  NaN  NaN
       d       NaN  NaN  NaN  NaN
b      c       NaN  NaN  NaN  NaN
       d       NaN  NaN  NaN  NaN
Sign up to request clarification or add additional context in comments.

1 Comment

Now I see the mistake about the extra dimension, thanks. nifty little function!
2

From the structure of your data,

names=['sub','time','measure','cond']  #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]

A straightforward way to your goal:

index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()

df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])


"""
measure  meas1         meas2        
cond         A   B   C     A   B   C
sub time                            
1   1        0   1   2     3   4   5
    2        6   7   8     9  10  11
2   1       12  13  14    15  16  17
    2       18  19  20    21  22  23
3   1       24  25  26    27  28  29
    2       30  31  32    33  34  35

"""

3 Comments

still a little terse and off from the concrete problem, but also helpful, thanks
I have adapted for a more useful and clear (?) method.
Cool, didn't know about the pivot_table method!
0

I still don't know how to do it directly, but here is an easy-to-follow step by step way:

# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)

To check the correct assignment of dimensions:

print(a[:,0,0,0])
[ 0  8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]

print(b)
meas      m1      m2    
cond       A   B   A   B
time sub                
1    1     0   1   2   3
     2     4   5   6   7
2    1     8   9  10  11
     2    12  13  14  15
3    1    16  17  18  19
     2    20  21  22  23

print(c)
meas      m1      m2    
cond       A   B   A   B
sub time                
1   1      0   1   2   3
    2      8   9  10  11
    3     16  17  18  19
2   1      4   5   6   7
    2     12  13  14  15
    3     20  21  22  23

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.