19

What is the easiest way to create a DataFrame with hierarchical columns?

I am currently creating a DataFrame from a dict of names -> Series using:

df = pd.DataFrame(data=serieses)

I would like to use the same columns names but add an additional level of hierarchy on the columns. For the time being I want the additional level to have the same value for columns, let's say "Estimates".

I am trying the following but that does not seem to work:

pd.DataFrame(data=serieses,columns=pd.MultiIndex.from_tuples([(x, "Estimates") for x in serieses.keys()]))

All I get is a DataFrame with all NaNs.

For example, what I am looking for is roughly:

l1               Estimates    
l2  one  two  one  two  one  two  one  two
r1   1    2    3    4    5    6    7    8
r2   1.1  2    3    4    5    6    71   8.2

where l1 and l2 are the labels for the MultiIndex

4 Answers 4

16

This appears to work:

import pandas as pd

data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}

df = pd.concat({"Estimates": pd.DataFrame(data)}, axis=1, names=["l1", "l2"])

l1  Estimates         
l2          a   b    c
0           1  10  100
1           2  20  200
2           3  30  300
3           4  40  400
Sign up to request clarification or add additional context in comments.

1 Comment

Thats very readable, i like it. Ultimately it might be best for Pandas to have better 'level' management, like a simple df.add_level(axis=1).
13

I know the question is really old but for pandas version 0.19.1 one can use direct dict-initialization:

d = {('a','b'):[1,2,3,4], ('a','c'):[5,6,7,8]}
df = pd.DataFrame(d, index=['r1','r2','r3','r4'])
df.columns.names = ('l1','l2')
print df

l1  a   
l2  b  c
r1  1  5
r2  2  6
r3  3  7
r4  4  8

2 Comments

Does this still work? I tried direct dict initialization but the columns are just tuples
@zkytony, I've checked that just now with a not-so-old 1.2.0 version and the thing still holds, at least on my machine. Have you tried upgrading your pandas installation? P.S same for the latest 1.3.3
2

Im not sure but i think the use of a dict as input for your DF and a MulitIndex dont play well together. Using an array as input instead makes it work.

I often prefer dicts as input though, one way is to set the columns after creating the df:

import pandas as pd

data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.DataFrame(np.array(data.values()).T, index=['r1','r2','r3','r4'])

tups = zip(*[['Estimates']*len(data),data.keys()])

df.columns = pd.MultiIndex.from_tuples(tups, names=['l1','l2'])

l1          Estimates         
l2          a   c    b
r1          1  10  100
r2          2  20  200
r3          3  30  300
r4          4  40  400

Or when using an array as input for the df:

data_arr = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])

tups = zip(*[['Estimates']*data_arr.shape[0],['a','b','c'])
df = pd.DataFrame(data_arr.T, index=['r1','r2','r3','r4'], columns=pd.MultiIndex.from_tuples(tups, names=['l1','l2']))

Which gives the same result.

5 Comments

Is there a risk that the column ordering will be messed up in the dict example? In other words when Pandas makes the DataFrame from a dict, it must pull the keys/values out of the dict which will happen in arbitrary order. I think you assume the same order in the up/list comprehension statement. This seems long term unsafe. I believe that when the columns keyword is set in DataFrame construction, Pandas attemtps to ensure some sort of alignment.
Good point, you want to avoid that indeed. Using np.array(data.values()).T together with data.keys() should be fine i guess.
According to docs, docs.python.org/2/library/stdtypes.html#dict.items, that new proposal does in fact seem safe.
Is there any concern with calling transpose? For example. are there any cases in which dtypes gets messed up?
Do you think that it would make sense to allow creating this by creating a DataFrame of DataFrames? For example: pd.DataFrame({"Extimates":pd.DataFrame(data)}) ?
2

The solution by Rutger Kassies worked in my case, but I have more than one column in the "upper level" of the column hierarchy. Just want to provide what worked for me as an example since it is a more general case.

First, I have data with that looks like this:

> df
         (A, a)    (A, b)       (B, a)    (B, b) 
0         0.00     9.75         0.00       0.00
1         8.85     8.86         35.75      35.50
2         8.51     9.60         66.67      50.70
3         0.03     508.99       56.00      8.58

I would like it to look like this:

> df
                A                    B
           a        b            a          b
0         0.00     9.75         0.00       0.00
1         8.85     8.86         35.75      35.50
...

The solution is:

tuples = df.transpose().index
new_columns = pd.MultiIndex.from_tuples(tuples, names=['Upper', 'Lower'])
df.columns = new_columns

This is counter-intuitive because in order to create columns, I have to do it through index.

1 Comment

You could also do: new_columns = pd.MultiIndex.from_tuples(df.columns, names=['Upper', 'Lower']); df.columns = new_columns

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.