0

I have trouble concatenating a list of MultiIndex DataFrames with 2 levels, and adding a third one to distinguish them.

As an example, I have following input data.

import pandas as pd
import numpy as np

# Input data

start = '2020-01-01 00:00+00:00'
end = '2020-01-01 02:00+00:00'
pr1h = pd.period_range(start=start, end=end, freq='1h')

midx1 = pd.MultiIndex.from_tuples([('Sup',1),('Sup',2),('Inf',1),('Inf',2)], names=['Data','Position'])
df1 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
df3 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)

midx2 = pd.MultiIndex.from_tuples([('Sup',3),('Inf',3)], names=['Data','Position'])
df2 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
df4 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)

So df1 & df2 have data for the same tag 1h and while they have the same column names at Data level, they don't have the same column names at Position level.

df1
Data                   Sup                 Inf          
Position                 1         2         1         2
2020-01-01 00:00  0.660795  0.538452  0.861801  0.502479
2020-01-01 01:00  0.205806  0.847124  0.474861  0.906546
2020-01-01 02:00  0.681480  0.479512  0.631771  0.961844

df2
Data                   Sup       Inf
Position                 3         3
2020-01-01 00:00  0.758533  0.672899
2020-01-01 01:00  0.096463  0.304843
2020-01-01 02:00  0.080504  0.990310

Now, df3 and df4 follow the same logic and same column names. To distinguish them from df1 & df2, I want to use a different tag, 2h for instance.

I want to add this third level named Period during the call to pd.concat. For this, I am trying to use keys parameter in pd.concat(). I tried following code.

df_list = [df1, df2, df3, df4]
period_list = ['1h', '1h', '2h', '2h']
concatenated = pd.concat(df_list, keys=period_list, names=('Period', 'Data', 'Position'), axis=1)

But this raises following error.

TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'

Please, any idea what is the correct call for this?

I thank you for your help. Bests,

EDIT 05/05

As requested, here is desired result (copied directly from the answer given. Result obtained from given answer is the one I am looking for).

Period                  1h                                                    \
Data                   Sup                 Inf                 Sup       Inf   
Position                 1         2         1         2         3         3   
2020-01-01 00:00  0.309778  0.597582  0.872392  0.983021  0.659965  0.214953   
2020-01-01 01:00  0.467403  0.875744  0.296069  0.131291  0.203047  0.382865   
2020-01-01 02:00  0.842818  0.659036  0.595440  0.436354  0.224873  0.114649   

Period                  2h                                                    
Data                   Sup                 Inf                 Sup       Inf  
Position                 1         2         1         2         3         3  
2020-01-01 00:00  0.356250  0.587131  0.149471  0.171239  0.583017  0.232641  
2020-01-01 01:00  0.397165  0.637952  0.372520  0.002407  0.556518  0.523811  
2020-01-01 02:00  0.548816  0.126972  0.079793  0.235039  0.350958  0.705332
5
  • 4
    The problem is not really that you have multiindex at first, is more then you have twice the same value in the period_list. If you didn't have multiindex at first, then the error would be more explicit to the problem: InvalidIndexError: Reindexing only valid with uniquely valued Index objects Commented May 5, 2020 at 21:21
  • 1
    Setting period_list = ['1h', '2h', '3h', '4h'] works. Otherwise , please post desired result. Commented May 5, 2020 at 21:37
  • @Parfait Hi, I added the expected result as requested. df1 and df2 have to share the same Period, and df3 and df4 have to share also a same Period. Commented May 5, 2020 at 21:50
  • it is actually similar to this open issue on github Commented May 6, 2020 at 1:10
  • Thanks, i have subscribed to this issue. If it is solved, I will modify the code you propose. Thanks again! Commented May 6, 2020 at 5:03

2 Answers 2

2

A quick fix would be to use different names in period_list and rename just after the concat. Something like:

df_list = [df1, df2, df3, df4]
period_list = ['1h_a', '1h_b', '2h_a', '2h_b']
concatenated = pd.concat(df_list, 
                         keys=period_list, 
                         names=('Period', 'Data', 'Position'), 
                         axis=1)\
                 .rename(columns={col:col.split('_')[0] for col  in period_list}, 
                         level='Period')

print (concatenated)
Period                  1h                                                    \
Data                   Sup                 Inf                 Sup       Inf   
Position                 1         2         1         2         3         3   
2020-01-01 00:00  0.309778  0.597582  0.872392  0.983021  0.659965  0.214953   
2020-01-01 01:00  0.467403  0.875744  0.296069  0.131291  0.203047  0.382865   
2020-01-01 02:00  0.842818  0.659036  0.595440  0.436354  0.224873  0.114649   

Period                  2h                                                    
Data                   Sup                 Inf                 Sup       Inf  
Position                 1         2         1         2         3         3  
2020-01-01 00:00  0.356250  0.587131  0.149471  0.171239  0.583017  0.232641  
2020-01-01 01:00  0.397165  0.637952  0.372520  0.002407  0.556518  0.523811  
2020-01-01 02:00  0.548816  0.126972  0.079793  0.235039  0.350958  0.705332 

Edit: as speed is a concern, it seems that rename is slow, so you can do:

concatenated = pd.concat(df_list, 
                         keys=period_list,
                         axis=1)
concatenated.columns = pd.MultiIndex.from_tuples([(col[0].split('_')[0], col[1], col[2]) 
                                                  for col in concatenated.columns], 
                                                  names=('Period', 'Data', 'Position'), )
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks @Ben.T. I am surprised, is there no "direct" way to have the right tags? Your fix is nice, but still add some complexity. Thanks a lot for what you bring.
@pierre_j I'm not sure why it is not possible and the idea of Parfait is another way around, but because it is not possible even with non-multiindex makes me think that there will be no direct way
Thanks for your reply. As you seem to be rather expert in pandas, I would like to mention that I have compared both methods in terms of speed and method of @Parfait appears to be 30% faster. I could read in several places that cumulated use of pd.concat leads to performance issues. I was certainly not expecting this result compared to a "simple" rename(). Please, do you have any idea what may be the cause of such a result (pd.concat being faster than rename())?
@pierre_j I now that rename is not especially fast, but slower than 2 inner concat, is a bit surprising. I just tested and I have similar performance over a 4 year 1h interval period. If speed is a concern, see my edit, you can call MultiIndex.from_tuples, it is way faster on my side
thanks a lot for your constant support! You are of great help!
2

Consider an inner concat on similar data frames then run a final concat to bind all together:

concatenated = pd.concat([pd.concat([df1, df2], axis=1),
                          pd.concat([df3, df4], axis=1)],
                         keys = ['1h', '2h'],
                         names=('Period', 'Data', 'Position'),
                         axis=1)

print(concatenated)  

Period                  1h                                                    \
Data                   Sup                 Inf                 Sup       Inf   
Position                 1         2         1         2         3         3   
2020-01-01 00:00  0.189802  0.675083  0.624484  0.781774  0.453101  0.224525   
2020-01-01 01:00  0.249818  0.829180  0.190488  0.923107  0.495873  0.278201   
2020-01-01 02:00  0.602634  0.494915  0.612672  0.903609  0.426809  0.248981   

Period                  2h                                                    
Data                   Sup                 Inf                 Sup       Inf  
Position                 1         2         1         2         3         3  
2020-01-01 00:00  0.746499  0.385714  0.008561  0.961152  0.988231  0.897454  
2020-01-01 01:00  0.643730  0.365023  0.812249  0.291733  0.045417  0.414968  
2020-01-01 02:00  0.887567  0.680102  0.978388  0.018501  0.695866  0.679730

3 Comments

@Parfait. Thanks but I am actually chasing pd.concat, trying to keep as few of them as possible because of performance optimization. I expect rename() to be faster than 2 concat(), right? The data I have given only show 4 DataFrames. But I generate them in loops, and there can be hundred of them, with much more data. Thanks nonetheless!
Wouah, am I missing something? Your solution is actually 33% faster than the rename(). Is it something to be expected? (I worked with these 4 DataFrames, but increased their length over a 4 year period...)
Glad to help. Not sure why you see timing differences. Solution here can be integrated in loops or even list comprehension that builds lists of data frames for concat.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.