3

I have a dictionary of dataframes where each dictionary key corresponds to sample name and the dataframe itself has a "time" column and some measurement columns (temperature, concentration,etc). The time column is not consistent among samples (both start and finish times are different for different samples, although I think all time points between start and finish are measured/have same dT).

I want to merge all of the data into a single xarray where one axis is the time, the other axis is the measurement type, and the third axis is the sample name. Since not all times are measured for all samples the missing data should be replaced with nan.

I have little experience with xarray but doing a simple merger (construct xarray from dict of xarrays) I couldn't figure out how to make "time" one of the axis (instead each just concatenated all of the samples with time being one of the data columns).

Thank you for your help!

Edit:

Here is code I have with dummy data

import pandas as pd
import xarray as xr

#make fake data
dfs = {'sample1':pd.DataFrame([[1,0,0],[2,0,0],[3,0,0]],columns = ["Time","ColA","ColB"]), 
    'sample2':pd.DataFrame([[2,1,1],[3,1,1],[4,1,1]],columns = ["Time","ColA","ColB"])}

#code I use for real data
xrs = {k: xr.DataArray(v) for k, v in dfs.items()}
merged = xr.Dataset(variables).to_array(dim="samples")

print(merged)

Output is:

<xarray.DataArray (samples: 2, dim_0: 3, dim_1: 3)>
array([[[1, 0, 0],
        [2, 0, 0],
        [3, 0, 0]],

       [[2, 1, 1],
        [3, 1, 1],
        [4, 1, 1]]], dtype=int64)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2
  * dim_1    (dim_1) object 'Time' 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

Desired output:

<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[0, 0],
    [0, 0],
    [0, 0],
    [nan, nan]],

   [[nan, nan]
    [1, 1],
    [1, 1],
    [1, 1]]], dtype=int64)
Coordinates:
  * Time     (Time) int64 1 2 3 4
  * dim_1    (dim_1) object 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'
2
  • You know in this busy world data science people dont have patience to go through a big explanation without sample data. Better add a sample input dataframe and expected output Commented Oct 21, 2017 at 16:56
  • Good point, done Commented Oct 21, 2017 at 17:12

1 Answer 1

1

Xarray supports two basic formats for converting from pandas:

  1. DataFrames representing an 2D matrix that you want to convert into a xarray.DataArray, where the index and columns already correctly label the axes.
  2. DataFrames representing flattened "tidy data", where MultiIndex levels in the index correspond to the desired axes on the result and columns correspond to desired variables in an xarray.Dataset.

Your format is closest to the former, so let's try that. It turns out all you were missing is a call to .set_index('Time') to indicate that that column should be used for labeling the axis:

In [23]: da = xr.Dataset({k: v.set_index('Time') for k, v in dfs.items()}).to_array(dim='samples')

In [24]: da
Out[24]:
<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[  0.,   0.],
        [  0.,   0.],
        [  0.,   0.],
        [ nan,  nan]],

       [[ nan,  nan],
        [  1.,   1.],
        [  1.,   1.],
        [  1.,   1.]]])
Coordinates:
  * Time     (Time) int64 1 2 3 4
  * dim_1    (dim_1) object 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

The second approach is also worth thinking about, because making your data tidy will also make it easier to work with as a DataFrame (e.g., to easily plot your data in Seaborn).

Once you have the xarray.DataArray, you can convert it back into pandas using to_series():

In [22]: da.to_series()
Out[22]:
samples  Time  dim_1
sample1  1     ColA     0.0
               ColB     0.0
         2     ColA     0.0
               ColB     0.0
         3     ColA     0.0
               ColB     0.0
         4     ColA     NaN
               ColB     NaN
sample2  1     ColA     NaN
               ColB     NaN
         2     ColA     1.0
               ColB     1.0
         3     ColA     1.0
               ColB     1.0
         4     ColA     1.0
               ColB     1.0
dtype: float64

Or sticking with DataFrames, you can use pd.melt to "melt" your wide data into long, tidy form, e.g.,

In [26]: melted = pd.melt(dfs['sample1'], id_vars=['Time'], var_name='Measurement')

In [27]: melted
Out[27]:
   Time Measurement  value
0     1        ColA      0
1     2        ColA      0
2     3        ColA      0
3     1        ColB      0
4     2        ColB      0
5     3        ColB      0

Then converting from xarray is just a matter of setting a MultiIndex and calling .to_xarray():

In [28]: melted.set_index(['Time', 'Measurement']).to_xarray()
Out[28]:
<xarray.Dataset>
Dimensions:      (Measurement: 2, Time: 3)
Coordinates:
  * Time         (Time) int64 1 2 3
  * Measurement  (Measurement) object 'ColA' 'ColB'
Data variables:
    value        (Time, Measurement) int64 0 0 0 0 0 0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.