Python/Pandas - TypeError when concatenating MultiIndex DataFrames

Question

I have trouble concatenating a list of MultiIndex DataFrames with 2 levels, and adding a third one to distinguish them.

As an example, I have following input data.

import pandas as pd
import numpy as np

# Input data

start = '2020-01-01 00:00+00:00'
end = '2020-01-01 02:00+00:00'
pr1h = pd.period_range(start=start, end=end, freq='1h')

midx1 = pd.MultiIndex.from_tuples([('Sup',1),('Sup',2),('Inf',1),('Inf',2)], names=['Data','Position'])
df1 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
df3 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)

midx2 = pd.MultiIndex.from_tuples([('Sup',3),('Inf',3)], names=['Data','Position'])
df2 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
df4 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)

So df1 & df2 have data for the same tag 1h and while they have the same column names at Data level, they don't have the same column names at Position level.

df1
Data                   Sup                 Inf          
Position                 1         2         1         2
2020-01-01 00:00  0.660795  0.538452  0.861801  0.502479
2020-01-01 01:00  0.205806  0.847124  0.474861  0.906546
2020-01-01 02:00  0.681480  0.479512  0.631771  0.961844

df2
Data                   Sup       Inf
Position                 3         3
2020-01-01 00:00  0.758533  0.672899
2020-01-01 01:00  0.096463  0.304843
2020-01-01 02:00  0.080504  0.990310

Now, df3 and df4 follow the same logic and same column names. To distinguish them from df1 & df2, I want to use a different tag, 2h for instance.

I want to add this third level named Period during the call to pd.concat. For this, I am trying to use keys parameter in pd.concat(). I tried following code.

df_list = [df1, df2, df3, df4]
period_list = ['1h', '1h', '2h', '2h']
concatenated = pd.concat(df_list, keys=period_list, names=('Period', 'Data', 'Position'), axis=1)

But this raises following error.

TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'

Please, any idea what is the correct call for this?

I thank you for your help. Bests,

EDIT 05/05

As requested, here is desired result (copied directly from the answer given. Result obtained from given answer is the one I am looking for).

Period                  1h                                                    \
Data                   Sup                 Inf                 Sup       Inf   
Position                 1         2         1         2         3         3   
2020-01-01 00:00  0.309778  0.597582  0.872392  0.983021  0.659965  0.214953   
2020-01-01 01:00  0.467403  0.875744  0.296069  0.131291  0.203047  0.382865   
2020-01-01 02:00  0.842818  0.659036  0.595440  0.436354  0.224873  0.114649   

Period                  2h                                                    
Data                   Sup                 Inf                 Sup       Inf  
Position                 1         2         1         2         3         3  
2020-01-01 00:00  0.356250  0.587131  0.149471  0.171239  0.583017  0.232641  
2020-01-01 01:00  0.397165  0.637952  0.372520  0.002407  0.556518  0.523811  
2020-01-01 02:00  0.548816  0.126972  0.079793  0.235039  0.350958  0.705332

The problem is not really that you have multiindex at first, is more then you have twice the same value in the period_list. If you didn't have multiindex at first, then the error would be more explicit to the problem: InvalidIndexError: Reindexing only valid with uniquely valued Index objects — Ben.T
– Ben.T, Commented May 5, 2020 at 21:21
Setting period_list = ['1h', '2h', '3h', '4h'] works. Otherwise , please post desired result. — Parfait
– Parfait, Commented May 5, 2020 at 21:37
@Parfait Hi, I added the expected result as requested. df1 and df2 have to share the same Period, and df3 and df4 have to share also a same Period. — pierre_j
– pierre_j, Commented May 5, 2020 at 21:50
Thanks, i have subscribed to this issue. If it is solved, I will modify the code you propose. Thanks again! — pierre_j
– pierre_j, Commented May 6, 2020 at 5:03

Ben.T · Accepted Answer · 2020-05-06 11:54:13Z

2

A quick fix would be to use different names in period_list and rename just after the concat. Something like:

df_list = [df1, df2, df3, df4]
period_list = ['1h_a', '1h_b', '2h_a', '2h_b']
concatenated = pd.concat(df_list, 
                         keys=period_list, 
                         names=('Period', 'Data', 'Position'), 
                         axis=1)\
                 .rename(columns={col:col.split('_')[0] for col  in period_list}, 
                         level='Period')

print (concatenated)
Period                  1h                                                    \
Data                   Sup                 Inf                 Sup       Inf   
Position                 1         2         1         2         3         3   
2020-01-01 00:00  0.309778  0.597582  0.872392  0.983021  0.659965  0.214953   
2020-01-01 01:00  0.467403  0.875744  0.296069  0.131291  0.203047  0.382865   
2020-01-01 02:00  0.842818  0.659036  0.595440  0.436354  0.224873  0.114649   

Period                  2h                                                    
Data                   Sup                 Inf                 Sup       Inf  
Position                 1         2         1         2         3         3  
2020-01-01 00:00  0.356250  0.587131  0.149471  0.171239  0.583017  0.232641  
2020-01-01 01:00  0.397165  0.637952  0.372520  0.002407  0.556518  0.523811  
2020-01-01 02:00  0.548816  0.126972  0.079793  0.235039  0.350958  0.705332

Edit: as speed is a concern, it seems that rename is slow, so you can do:

concatenated = pd.concat(df_list, 
                         keys=period_list,
                         axis=1)
concatenated.columns = pd.MultiIndex.from_tuples([(col[0].split('_')[0], col[1], col[2]) 
                                                  for col in concatenated.columns], 
                                                  names=('Period', 'Data', 'Position'), )

edited May 6, 2020 at 11:54

answered May 5, 2020 at 21:44

Ben.T

29.7k6 gold badges39 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

pierre_j Over a year ago

Thanks @Ben.T. I am surprised, is there no "direct" way to have the right tags? Your fix is nice, but still add some complexity. Thanks a lot for what you bring.

Ben.T Over a year ago

@pierre_j I'm not sure why it is not possible and the idea of Parfait is another way around, but because it is not possible even with non-multiindex makes me think that there will be no direct way

pierre_j Over a year ago

Thanks for your reply. As you seem to be rather expert in pandas, I would like to mention that I have compared both methods in terms of speed and method of @Parfait appears to be 30% faster. I could read in several places that cumulated use of pd.concat leads to performance issues. I was certainly not expecting this result compared to a "simple" rename(). Please, do you have any idea what may be the cause of such a result (pd.concat being faster than rename())?

Ben.T Over a year ago

@pierre_j I now that rename is not especially fast, but slower than 2 inner concat, is a bit surprising. I just tested and I have similar performance over a 4 year 1h interval period. If speed is a concern, see my edit, you can call MultiIndex.from_tuples, it is way faster on my side

pierre_j Over a year ago

thanks a lot for your constant support! You are of great help!

Parfait · Accepted Answer · 2020-05-05 22:35:13Z

2

Consider an inner concat on similar data frames then run a final concat to bind all together:

concatenated = pd.concat([pd.concat([df1, df2], axis=1),
                          pd.concat([df3, df4], axis=1)],
                         keys = ['1h', '2h'],
                         names=('Period', 'Data', 'Position'),
                         axis=1)

print(concatenated)  

Period                  1h                                                    \
Data                   Sup                 Inf                 Sup       Inf   
Position                 1         2         1         2         3         3   
2020-01-01 00:00  0.189802  0.675083  0.624484  0.781774  0.453101  0.224525   
2020-01-01 01:00  0.249818  0.829180  0.190488  0.923107  0.495873  0.278201   
2020-01-01 02:00  0.602634  0.494915  0.612672  0.903609  0.426809  0.248981   

Period                  2h                                                    
Data                   Sup                 Inf                 Sup       Inf  
Position                 1         2         1         2         3         3  
2020-01-01 00:00  0.746499  0.385714  0.008561  0.961152  0.988231  0.897454  
2020-01-01 01:00  0.643730  0.365023  0.812249  0.291733  0.045417  0.414968  
2020-01-01 02:00  0.887567  0.680102  0.978388  0.018501  0.695866  0.679730

answered May 5, 2020 at 22:35

Parfait

108k19 gold badges102 silver badges138 bronze badges

3 Comments

pierre_j Over a year ago

@Parfait. Thanks but I am actually chasing pd.concat, trying to keep as few of them as possible because of performance optimization. I expect rename() to be faster than 2 concat(), right? The data I have given only show 4 DataFrames. But I generate them in loops, and there can be hundred of them, with much more data. Thanks nonetheless!

pierre_j Over a year ago

Wouah, am I missing something? Your solution is actually 33% faster than the rename(). Is it something to be expected? (I worked with these 4 DataFrames, but increased their length over a 4 year period...)

Parfait Over a year ago

Glad to help. Not sure why you see timing differences. Solution here can be integrated in loops or even list comprehension that builds lists of data frames for concat.

Collectives™ on Stack Overflow

Python/Pandas - TypeError when concatenating MultiIndex DataFrames

2 Answers 2

5 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related