Python Pandas - Issue concat multi-indexed Dataframes

Question

I am trying to merge two MultiIndex'ed dataframes. My code is below. The issue, as you can see in the output, is that the "DATE" index is repeated, whereas I'd like all the values (OPEN_INT, PX_LAST) to be on the same date index... any ideas? I've tried both append, and concat but both give me similar results.

      if df.empty:
            df = bbg_historicaldata(t, f, startDate, endDate)
            print(df)            
            datesArray = list(df.index)
            tArray = [t for i in range(len(datesArray))]
            arrays = [tArray, datesArray]
            tuples = list(zip(*arrays))
            index = pd.MultiIndex.from_tuples(tuples, names=['TICKER', 'DATE'])                    
            df = pd.DataFrame({f : df[f].values}, index=index)
    else:
        temp = bbg_historicaldata(t,f,startDate,endDate)
            print(temp)
            datesArray = list(temp.index)
            tArray = [t for i in range(len(datesArray))]
            arrays = [tArray, datesArray]
            tuples = list(zip(*arrays))
            index = pd.MultiIndex.from_tuples(tuples, names=['TICKER', 'DATE'])


            temp = pd.DataFrame({f : temp[f].values}, index=index)

            #df = df.append(temp, ignore_index = True)
            df = pd.concat([df, temp], axis = 1).sortlevel()

Essentially want no NaN's!

                        PX_LAST   OPEN_INT  PX_LAST  OPEN_INT  PX_LAST  \
TICKER      DATE                                                         
EDH8 COMDTY 2017-02-01   98.365  1008044.0      NaN       NaN      NaN   
            2017-02-02   98.370  1009994.0      NaN       NaN      NaN   
            2017-02-03   98.360  1019181.0      NaN       NaN      NaN   
            2017-02-06   98.405  1023863.0      NaN       NaN      NaN   
            2017-02-07   98.410  1024609.0      NaN       NaN      NaN   
            2017-02-08   98.435  1046258.0      NaN       NaN      NaN   
            2017-02-09   98.395  1050291.0      NaN       NaN      NaN   
EDM8 COMDTY 2017-02-01      NaN        NaN   98.245  726739.0      NaN   
            2017-02-02      NaN        NaN   98.250  715081.0      NaN   
            2017-02-03      NaN        NaN   98.235  723936.0      NaN   
            2017-02-06      NaN        NaN   98.285  729324.0      NaN   
            2017-02-07      NaN        NaN   98.295  728673.0      NaN   
            2017-02-08      NaN        NaN   98.325  728520.0      NaN   
            2017-02-09      NaN        NaN   98.280  741840.0      NaN   
EDU8 COMDTY 2017-02-01      NaN        NaN      NaN       NaN   98.130   
            2017-02-02      NaN        NaN      NaN       NaN   98.135   
            2017-02-03      NaN        NaN      NaN       NaN   98.120   
            2017-02-06      NaN        NaN      NaN       NaN   98.180   
            2017-02-07      NaN        NaN      NaN       NaN   98.190   
            2017-02-08      NaN        NaN      NaN       NaN   98.225   
            2017-02-09      NaN        NaN      NaN       NaN   98.175

EDIT: Doing Axis = 0, gives the following:. I'd like it to collapse the duplicated dates (ie, each date index to have unique values, no duplicated days or NaNs)

                         OPEN_INT  PX_LAST
TICKER      DATE                          
EDH8 COMDTY 2017-02-01        NaN   98.365
            2017-02-01  1008044.0      NaN
            2017-02-02        NaN   98.370
            2017-02-02  1009994.0      NaN
            2017-02-03        NaN   98.360
            2017-02-03  1019181.0      NaN
            2017-02-06        NaN   98.405
            2017-02-06  1023863.0      NaN
            2017-02-07        NaN   98.410
            2017-02-07  1024609.0      NaN
            2017-02-08        NaN   98.435
            2017-02-08  1046258.0      NaN
            2017-02-09        NaN   98.395
            2017-02-09  1050291.0      NaN
EDM8 COMDTY 2017-02-01        NaN   98.245
            2017-02-01   726739.0      NaN
            2017-02-02        NaN   98.250
            2017-02-02   715081.0      NaN
            2017-02-03        NaN   98.235
            2017-02-03   723936.0      NaN
            2017-02-06        NaN   98.285
            2017-02-06   729324.0      NaN
            2017-02-07        NaN   98.295
            2017-02-07   728673.0      NaN
            2017-02-08        NaN   98.325
            2017-02-08   728520.0      NaN
            2017-02-09        NaN   98.280
            2017-02-09   741840.0      NaN

Here is the input data printed. I've added print(df) and print(temp) to the above. They're all dataframes with DATE as the index. The TICKER index comes from the variable "f" from the loop "for f in fields:"

            PX_LAST
DATE               
2017-02-01   98.365
2017-02-02   98.370
2017-02-03   98.360
2017-02-06   98.405
2017-02-07   98.410
2017-02-08   98.435
2017-02-09   98.395
             OPEN_INT
DATE                 
2017-02-01  1008044.0
2017-02-02  1009994.0
2017-02-03  1019181.0
2017-02-06  1023863.0
2017-02-07  1024609.0
2017-02-08  1046258.0
2017-02-09  1050291.0
            PX_LAST
DATE               
2017-02-01   98.245
2017-02-02   98.250
2017-02-03   98.235
2017-02-06   98.285
2017-02-07   98.295
2017-02-08   98.325
2017-02-09   98.280
            OPEN_INT
DATE                
2017-02-01  726739.0
2017-02-02  715081.0
2017-02-03  723936.0
2017-02-06  729324.0
2017-02-07  728673.0
2017-02-08  728520.0
2017-02-09  741840.0
            PX_LAST
DATE               
2017-02-01   98.130
2017-02-02   98.135
2017-02-03   98.120
2017-02-06   98.180
2017-02-07   98.190
2017-02-08   98.225
2017-02-09   98.175
            OPEN_INT
DATE                
2017-02-01  584448.0
2017-02-02  574246.0
2017-02-03  581897.0
2017-02-06  585169.0
2017-02-07  590248.0
2017-02-08  598478.0
2017-02-09  595884.0

The TICKER index values are different. Do you want to ignore/drop that index level? What is the desired result? — unutbu
– unutbu, Commented Feb 11, 2017 at 20:55
So basically I'm hoping to have a multi-indexed dataframe. First index is the TICKER. The next index is the date. Followed then by the columns PX_LAST and OPEN_INT. — keynesiancross
– keynesiancross, Commented Feb 11, 2017 at 21:00
For each ticker, there is going to be time series data, but all the tickers are going to share the same columns. — keynesiancross
– keynesiancross, Commented Feb 11, 2017 at 21:02
Instead of just showing the output, it'd be easier if you showed what you were starting from, so people can experiment. I suspect you're making this much harder than it needs to be. — DSM
– DSM, Commented Feb 11, 2017 at 21:10

DSM · Accepted Answer · 2017-02-11 22:27:06Z

3

Your logic is a little hard to follow (it's hard to see why sometimes you're getting different columns from your data call, for example). AFAICT, though, really you just want to do a join among all the frames with the same ticker (if you set the index to TICKER, DATE) or a merge if TICKER and DATE are columns, and then concatenate the results of those. It's trying to do them both in one step which is causing the problem.

Alternatively, we can just concat the whole thing and then pivot, which is what I'll do here because it's easier to show.

(As an aside, repeatedly concatenating within a loop can be a performance problem because a lot of data needs to be copied each time, and should generally be avoided -- build a collection of what you want to concatenate first, and then apply that.)

Assuming that each of your frames starts looking like the following (where the column might be different):

In [532]: df
Out[532]: 
            PX_LAST
DATE               
2017-02-01   98.365
2017-02-02   98.370
2017-02-03   98.360
2017-02-06   98.405
2017-02-07   98.410
2017-02-08   98.435
2017-02-09   98.395

then instead of what you're doing now I'd just add the ticker to the frame and reset the index:

In [549]: df = df.assign(TICKER=t).reset_index()   #TICKER variable = t
Out[549]: 
         DATE  PX_LAST       TICKER
0  2017-02-01   98.365  EDH8 COMDTY
1  2017-02-02   98.370  EDH8 COMDTY
2  2017-02-03   98.360  EDH8 COMDTY
3  2017-02-06   98.405  EDH8 COMDTY
4  2017-02-07   98.410  EDH8 COMDTY
5  2017-02-08   98.435  EDH8 COMDTY
6  2017-02-09   98.395  EDH8 COMDTY

To make the concatenation more memory-friendly, let's melt this:

In [579]: pd.melt(df, id_vars=["TICKER", "DATE"])
Out[579]: 
        TICKER        DATE variable   value
0  EDH8 COMDTY  2017-02-01  PX_LAST  98.365
1  EDH8 COMDTY  2017-02-02  PX_LAST  98.370
2  EDH8 COMDTY  2017-02-03  PX_LAST  98.360
3  EDH8 COMDTY  2017-02-06  PX_LAST  98.405
4  EDH8 COMDTY  2017-02-07  PX_LAST  98.410
5  EDH8 COMDTY  2017-02-08  PX_LAST  98.435
6  EDH8 COMDTY  2017-02-09  PX_LAST  98.395

and append this to a list dfs. Now the partial frames will combine nicely, because they all have the same columns, and we can pivot to get our desired output:

In [589]: pd.concat(dfs).pivot_table(index=["TICKER", "DATE"], columns="variable", values="value")
Out[589]: 
variable                 OPEN_INT  PX_LAST
TICKER      DATE                          
EDH8 COMDTY 2017-02-01  1008044.0   98.365
            2017-02-02  1009994.0   98.370
            2017-02-03  1019181.0   98.360
            2017-02-06  1023863.0   98.405
[...]

This avoids having all those intermediate NaNs. Since the concatenation+pivot approach will work even if you don't melt, at first I didn't do the melting, but on second thought having those intermediate NaNs is a bad idea even though it works because the intermediate memory requirements could grow to be prohibitive.

edited Feb 11, 2017 at 22:27

answered Feb 11, 2017 at 21:54

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

keynesiancross Over a year ago

you're a legend, thanks DSM! Out of curiosity, why does having a large number of columns change the way that you would approach this problem?

DSM Over a year ago

@keynesiancross: the more columns (or rows, for that matter) you have, the more of your intermediate pre-pivoted dataframe is just NaN. This means you can have an enormous intermediate frame in memory even though the final version will be many times smaller. In fact, I'm actually going to switch the order of my recommendations, so as not to lead anyone else down that path..

keynesiancross Over a year ago

Actually - now that I'm thinking about it, will you end up with all those NaN's if you do the concat(dfs).pivot_table() outside of the loop? Ie, you're going to build a list of df's that have only dates and fully filed in column data. The NaN's were only a byproduct of my original code

DSM Over a year ago

@keynesiancross: yeah, they show up, basically because the concat still tries to combine subframes which only have OPEN_INT with ones which only have PX_LAST.

Collectives™ on Stack Overflow

Python Pandas - Issue concat multi-indexed Dataframes

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related