Create DataArray from Dict of 2D DataFrames/Arrays

Question

I'm trying to transition from Pandas into Xarray for N-Dimensional DataArrays to expand my repertoire.

Realistically, I'm going to have a bunch of different pd.DataFrames (in this case row=month, col=attribute) along a particular axis (patients in the mock example below) that I would like to merge (w/o using panels or multindex :), thank you). I want to convert them to xr.DataArrays so I can build dimensions upon them. I made a mock dataset to give a gist of what I'm talking about.

For this dataset I made up, imagine 100 patients, 12 months, 10000 attributes, 3 replicates (per attribute) which would be a typical 4D dataset. Basically, I'm condensing the 3 replicates per attribute by the mean so I end up with a 2D pd.DataFrame (row=months, col=attributes) this DataFrame is the value in my dictionary and the patient it came from is the key (i.e. (patient_x : DataFrame_X) )

I'm also going to include a round about way I did it with np.ndarray placeholder but it would be really convenient if I could generate a N-dimensional DataArray from a dictionary whose key was patient_x and the value was a DataFrame_X

How can I create a N-Dimensional DataArray using Xarray from a dictionary of Pandas DataFrames?

import xarray as xr
import numpy as np
import pandas as pd

np.random.seed(1618033)

#Set dimensions
a,b,c,d = 100,12,10000,3 #100 patients, 12 months, 10000 attributes, 3 replicates

#Create labels
patients = ["patient_%d" % i for i in range(a)]
months = [j for j in range(b)]
attributes = ["attr_%d" % k for k in range(c)]
replicates = [l for l in range(d)]

coords = [patients,months,attributes]
dims = ["Patients","Months","Attributes"]

#Dict of DataFrames
D_patient_DF = dict()

for i, patient in enumerate(patients):
    A_placeholder = np.zeros((b,c))
    for j, month in enumerate(months):
        #Attribute x Replicates
        A_attrReplicates = np.random.random((c,d))
        #Collapse into 1D Vector
        V_attrExp = A_attrReplicates.mean(axis=1)
        #Fill array with row
        A_placeholder[j,:] = V_attrExp
    #Assign dataframe for every patient
    DF_data = pd.DataFrame(A_placeholder, index = months, columns = attributes)
    D_patient_DF[patient] = DF_data

 xr.DataArray(D_patient_DF).dims
#() its empty

D_patient_DF
#{'patient_0':       attr_0    attr_1    attr_2    attr_3    attr_4    attr_5    attr_6  \
# 0   0.445446  0.422018  0.343454  0.140700  0.567435  0.362194  0.563799   
# 1   0.440010  0.548535  0.810903  0.482867  0.469542  0.591939  0.579344   
# 2   0.645719  0.450773  0.386939  0.418496  0.508290  0.431033  0.622270   
# 3   0.555855  0.633393  0.555197  0.556342  0.489865  0.204200  0.823043   
# 4   0.916768  0.590534  0.597989  0.592359  0.484624  0.478347  0.507789   
# 5   0.847069  0.634923  0.591008  0.249107  0.655182  0.394640  0.579700   
# 6   0.700385  0.505331  0.377745  0.651936  0.334216  0.489728  0.282544   
# 7   0.777810  0.423889  0.414316  0.389318  0.565144  0.394320  0.511034   
# 8   0.440633  0.069643  0.675037  0.365963  0.647660  0.520047  0.539253   
# 9   0.333213  0.328315  0.662203  0.594030  0.790758  0.754032  0.602375   
# 10  0.470330  0.419496  0.171292  0.677439  0.683759  0.646363  0.465788   
# 11  0.758556  0.674664  0.801860  0.612087  0.567770  0.801514  0.179939

shoyer · Accepted Answer · 2016-05-01 21:20:23Z

6

From a dictionary of DataFrames, you might convert each value into a DataArray (adding dimensions labels), load the results into a Dataset and then convert into a DataArray:

variables = {k: xr.DataArray(v, dims=['month', 'attribute'])
             for k, v in D_patient_DF.items()}
combined = xr.Dataset(variables).to_array(dim='patient')
print(combined)

However, beware that the result will not necessarily be ordered in sorted order, but rather use the arbitrary order of dictionary iteration. If you want sorted order, you should use an OrderedDict instead (insert after setting variables above):

variables = collections.OrderedDict((k, variables[k]) for k in patients)

This outputs:

<xarray.DataArray (patient: 100, month: 12, attribute: 10000)>
array([[[ 0.61176399,  0.26172557,  0.74657302, ...,  0.43742111,
          0.47503291,  0.37263983],
        [ 0.34970732,  0.81527751,  0.53612895, ...,  0.68971198,
          0.68962168,  0.75103198],
        [ 0.71282751,  0.23143891,  0.28481889, ...,  0.52612376,
          0.56992843,  0.3483683 ],
        ...,
        [ 0.84627257,  0.5033482 ,  0.44116194, ...,  0.55020168,
          0.48151353,  0.36374339],
        [ 0.53336826,  0.59566147,  0.45269417, ...,  0.41951078,
          0.46815364,  0.44630235],
        [ 0.25720899,  0.18738289,  0.66639783, ...,  0.36149276,
          0.58865823,  0.33918553]],

       ...,

       [[ 0.42933273,  0.58642504,  0.38716496, ...,  0.45667285,
          0.72684589,  0.52335464],
        [ 0.34946576,  0.35821339,  0.33097093, ...,  0.59037927,
          0.30233665,  0.6515749 ],
        [ 0.63673498,  0.31022272,  0.65788374, ...,  0.47881873,
          0.67825066,  0.58704331],
        ...,
        [ 0.44822441,  0.502429  ,  0.50677081, ...,  0.4843405 ,
          0.84396521,  0.45460029],
        [ 0.61336348,  0.46338301,  0.60715273, ...,  0.48322379,
          0.66530209,  0.52204897],
        [ 0.47520639,  0.43490559,  0.27309414, ...,  0.35280585,
          0.30280485,  0.77537204]]])
Coordinates:
  * month      (month) int64 0 1 2 3 4 5 6 7 8 9 10 11
  * patient    (patient) <U10 'patient_80' 'patient_73' 'patient_79' ...
  * attribute  (attribute) object 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...

Alternatively, you could create a list of 2D DataArrays and then use concat:

patient_list = []
for i, patient in enumerate(patients):
    df = ...
    array = xr.DataArray(df, dims=['patient', 'attribute'])
    patient_list.append(df)
combined = xr.concat(patient_list, dim=pd.Index(patients, name='patient')

This would give the same result, and is probably the cleanest code.

edited May 1, 2016 at 21:20

answered Apr 29, 2016 at 23:11

shoyer

9,6831 gold badge41 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

O.rka Over a year ago

Hey @Stephan thanks for the response. with the first section you wrote I tried variables = {(k, xr.DataArray(v, dims=['month', 'attribute'])) for k, v in list(D_patient_DF.items())} and got the following error: TypeError: unhashable type: 'DataArray' . I'm using Python 3.5 so I changed the D_patient_DF.items() to list(D_patient_DF.items())

O.rka Over a year ago

I like your last example. I ended up tweaking it to bypass the dataframe, and just go straight to dataarray D_patient_DA[patient] = xr.DataArray(A_placeholder, coords = [months, attributes], dims = ["Months","Attributes"]) then I do DA_data = xr.concat(list(D_patient_DA.values()), dim="Patients") but I can't assign the labels to the patients (or coords).

shoyer Over a year ago

@O.rka good catch, I made a mistake when editing my code -- I fixed the dictionary comprehension in the first example.

O.rka Over a year ago

Thanks a lot @Stephan. I didn't know one could concat with 2D DataArrays along a new dimension!

Collectives™ on Stack Overflow

Create DataArray from Dict of 2D DataFrames/Arrays

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related