16

I have a Pandas DataFrame of the following form

enter image description here

There is one row per ID per year (2008 - 2015). For the columns Max Temp, Min Temp, and Rain each cell contains an array of values corresponding to a day in that year, i.e. for the frame above

  • frame3.iloc[0]['Max Temp'][0] is the value for January 1st 2011
  • frame3.iloc[0]['Max Temp'][364] is the value for December 31st 2011.

I'm aware this is badly structured, but this is the data I have to deal with. It is stored in MongoDB in this way (where one of these rows equates to a document in Mongo).

I want to split these nested arrays, so that instead of one row per ID per year, I have one row per ID per day. While splitting the array, however, I would also like to create a new column to capture the day of the year, based on the current array index. I would then use this day, plus the Year column to create a DatetimeIndex

enter image description here

I searched here for relevant answers, but only found this one which doesn't really help me.

2
  • Are these inner arrays represented as strings or real arrays? Commented Jul 15, 2016 at 1:54
  • They are lists of floats Commented Jul 15, 2016 at 9:20

1 Answer 1

21

You can run .apply(pd.Series) for each of your columns, then stack and concatenate the results.

For a series

s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])

s
Out[103]: 
2011       [0, 1]
2012    [2, 3, 4]
dtype: object

it works as follows

s.apply(pd.Series).stack()
Out[104]: 
2011  0    0.0
      1    1.0
2012  0    2.0
      1    3.0
      2    4.0
dtype: float64

The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack, had a NaN value that has been later dropped.

Now, let's take a frame:

a = list(range(14))
b = list(range(20, 34))

df = pd.DataFrame({'ID': [11111, 11111, 11112, 11112],
                   'Year': [2011, 2012, 2011, 2012],
                   'A': [a[:3], a[3:7], a[7:10], a[10:14]],
                   'B': [b[:3], b[3:7], b[7:10], b[10:14]]})

df
Out[108]: 
                  A                 B     ID  Year
0         [0, 1, 2]      [20, 21, 22]  11111  2011
1      [3, 4, 5, 6]  [23, 24, 25, 26]  11111  2012
2         [7, 8, 9]      [27, 28, 29]  11112  2011
3  [10, 11, 12, 13]  [30, 31, 32, 33]  11112  2012

Then we can run:

# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
    unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)

and get:

result
Out[115]: 
                 A     B
ID    Year              
11111 2011 0   0.0  20.0
           1   1.0  21.0
           2   2.0  22.0
      2012 0   3.0  23.0
           1   4.0  24.0
           2   5.0  25.0
           3   6.0  26.0
11112 2011 0   7.0  27.0
           1   8.0  28.0
           2   9.0  29.0
      2012 0  10.0  30.0
           1  11.0  31.0
           2  12.0  32.0
           3  13.0  33.0

The rest (datetime index) is more less straightforward. For example:

# DatetimeIndex
years = pd.to_datetime(result.index.get_level_values(1).astype(str))
# TimedeltaIndex
days = pd.to_timedelta(result.index.get_level_values(2), unit='D')
# If the above line doesn't work (a bug in pandas), try this:
# days = result.index.get_level_values(2).astype('timedelta64[D]')

# the sum is again a DatetimeIndex
dates = years + days
dates.name = 'Date'

new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])

result.index = new_index

result
Out[130]: 
                     A     B
ID    Date                  
11111 2011-01-01   0.0  20.0
      2011-01-02   1.0  21.0
      2011-01-03   2.0  22.0
      2012-01-01   3.0  23.0
      2012-01-02   4.0  24.0
      2012-01-03   5.0  25.0
      2012-01-04   6.0  26.0
11112 2011-01-01   7.0  27.0
      2011-01-02   8.0  28.0
      2011-01-03   9.0  29.0
      2012-01-01  10.0  30.0
      2012-01-02  11.0  31.0
      2012-01-03  12.0  32.0
      2012-01-04  13.0  33.0
Sign up to request clarification or add additional context in comments.

2 Comments

Excellent answer, thank you. You were right that days = pd.to_timedelta(result.index.get_level_values(2), unit='D') doesn't work, I needed the alternative you provided days = result.index.get_level_values(2).astype('timedelta64[D]')
Glad I could help. The bug that makes to_timedelta break will be fixed in the next pandas release.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.