2

I'm trying to transition from Pandas into Xarray for N-Dimensional DataArrays to expand my repertoire.

Realistically, I'm going to have a bunch of different pd.DataFrames (in this case row=month, col=attribute) along a particular axis (patients in the mock example below) that I would like to merge (w/o using panels or multindex :), thank you). I want to convert them to xr.DataArrays so I can build dimensions upon them. I made a mock dataset to give a gist of what I'm talking about.

For this dataset I made up, imagine 100 patients, 12 months, 10000 attributes, 3 replicates (per attribute) which would be a typical 4D dataset. Basically, I'm condensing the 3 replicates per attribute by the mean so I end up with a 2D pd.DataFrame (row=months, col=attributes) this DataFrame is the value in my dictionary and the patient it came from is the key (i.e. (patient_x : DataFrame_X) )

I'm also going to include a round about way I did it with np.ndarray placeholder but it would be really convenient if I could generate a N-dimensional DataArray from a dictionary whose key was patient_x and the value was a DataFrame_X

How can I create a N-Dimensional DataArray using Xarray from a dictionary of Pandas DataFrames?

import xarray as xr
import numpy as np
import pandas as pd

np.random.seed(1618033)

#Set dimensions
a,b,c,d = 100,12,10000,3 #100 patients, 12 months, 10000 attributes, 3 replicates

#Create labels
patients = ["patient_%d" % i for i in range(a)]
months = [j for j in range(b)]
attributes = ["attr_%d" % k for k in range(c)]
replicates = [l for l in range(d)]

coords = [patients,months,attributes]
dims = ["Patients","Months","Attributes"]

#Dict of DataFrames
D_patient_DF = dict()

for i, patient in enumerate(patients):
    A_placeholder = np.zeros((b,c))
    for j, month in enumerate(months):
        #Attribute x Replicates
        A_attrReplicates = np.random.random((c,d))
        #Collapse into 1D Vector
        V_attrExp = A_attrReplicates.mean(axis=1)
        #Fill array with row
        A_placeholder[j,:] = V_attrExp
    #Assign dataframe for every patient
    DF_data = pd.DataFrame(A_placeholder, index = months, columns = attributes)
    D_patient_DF[patient] = DF_data

 xr.DataArray(D_patient_DF).dims
#() its empty

D_patient_DF
#{'patient_0':       attr_0    attr_1    attr_2    attr_3    attr_4    attr_5    attr_6  \
# 0   0.445446  0.422018  0.343454  0.140700  0.567435  0.362194  0.563799   
# 1   0.440010  0.548535  0.810903  0.482867  0.469542  0.591939  0.579344   
# 2   0.645719  0.450773  0.386939  0.418496  0.508290  0.431033  0.622270   
# 3   0.555855  0.633393  0.555197  0.556342  0.489865  0.204200  0.823043   
# 4   0.916768  0.590534  0.597989  0.592359  0.484624  0.478347  0.507789   
# 5   0.847069  0.634923  0.591008  0.249107  0.655182  0.394640  0.579700   
# 6   0.700385  0.505331  0.377745  0.651936  0.334216  0.489728  0.282544   
# 7   0.777810  0.423889  0.414316  0.389318  0.565144  0.394320  0.511034   
# 8   0.440633  0.069643  0.675037  0.365963  0.647660  0.520047  0.539253   
# 9   0.333213  0.328315  0.662203  0.594030  0.790758  0.754032  0.602375   
# 10  0.470330  0.419496  0.171292  0.677439  0.683759  0.646363  0.465788   
# 11  0.758556  0.674664  0.801860  0.612087  0.567770  0.801514  0.179939   

1 Answer 1

6

From a dictionary of DataFrames, you might convert each value into a DataArray (adding dimensions labels), load the results into a Dataset and then convert into a DataArray:

variables = {k: xr.DataArray(v, dims=['month', 'attribute'])
             for k, v in D_patient_DF.items()}
combined = xr.Dataset(variables).to_array(dim='patient')
print(combined)

However, beware that the result will not necessarily be ordered in sorted order, but rather use the arbitrary order of dictionary iteration. If you want sorted order, you should use an OrderedDict instead (insert after setting variables above):

variables = collections.OrderedDict((k, variables[k]) for k in patients)

This outputs:

<xarray.DataArray (patient: 100, month: 12, attribute: 10000)>
array([[[ 0.61176399,  0.26172557,  0.74657302, ...,  0.43742111,
          0.47503291,  0.37263983],
        [ 0.34970732,  0.81527751,  0.53612895, ...,  0.68971198,
          0.68962168,  0.75103198],
        [ 0.71282751,  0.23143891,  0.28481889, ...,  0.52612376,
          0.56992843,  0.3483683 ],
        ...,
        [ 0.84627257,  0.5033482 ,  0.44116194, ...,  0.55020168,
          0.48151353,  0.36374339],
        [ 0.53336826,  0.59566147,  0.45269417, ...,  0.41951078,
          0.46815364,  0.44630235],
        [ 0.25720899,  0.18738289,  0.66639783, ...,  0.36149276,
          0.58865823,  0.33918553]],

       ...,

       [[ 0.42933273,  0.58642504,  0.38716496, ...,  0.45667285,
          0.72684589,  0.52335464],
        [ 0.34946576,  0.35821339,  0.33097093, ...,  0.59037927,
          0.30233665,  0.6515749 ],
        [ 0.63673498,  0.31022272,  0.65788374, ...,  0.47881873,
          0.67825066,  0.58704331],
        ...,
        [ 0.44822441,  0.502429  ,  0.50677081, ...,  0.4843405 ,
          0.84396521,  0.45460029],
        [ 0.61336348,  0.46338301,  0.60715273, ...,  0.48322379,
          0.66530209,  0.52204897],
        [ 0.47520639,  0.43490559,  0.27309414, ...,  0.35280585,
          0.30280485,  0.77537204]]])
Coordinates:
  * month      (month) int64 0 1 2 3 4 5 6 7 8 9 10 11
  * patient    (patient) <U10 'patient_80' 'patient_73' 'patient_79' ...
  * attribute  (attribute) object 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...

Alternatively, you could create a list of 2D DataArrays and then use concat:

patient_list = []
for i, patient in enumerate(patients):
    df = ...
    array = xr.DataArray(df, dims=['patient', 'attribute'])
    patient_list.append(df)
combined = xr.concat(patient_list, dim=pd.Index(patients, name='patient')

This would give the same result, and is probably the cleanest code.

Sign up to request clarification or add additional context in comments.

4 Comments

Hey @Stephan thanks for the response. with the first section you wrote I tried variables = {(k, xr.DataArray(v, dims=['month', 'attribute'])) for k, v in list(D_patient_DF.items())} and got the following error: TypeError: unhashable type: 'DataArray' . I'm using Python 3.5 so I changed the D_patient_DF.items() to list(D_patient_DF.items())
I like your last example. I ended up tweaking it to bypass the dataframe, and just go straight to dataarray D_patient_DA[patient] = xr.DataArray(A_placeholder, coords = [months, attributes], dims = ["Months","Attributes"]) then I do DA_data = xr.concat(list(D_patient_DA.values()), dim="Patients") but I can't assign the labels to the patients (or coords).
@O.rka good catch, I made a mistake when editing my code -- I fixed the dictionary comprehension in the first example.
Thanks a lot @Stephan. I didn't know one could concat with 2D DataArrays along a new dimension!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.