Using xarray to merge pandas dataframes based on a column

Question

I have a dictionary of dataframes where each dictionary key corresponds to sample name and the dataframe itself has a "time" column and some measurement columns (temperature, concentration,etc). The time column is not consistent among samples (both start and finish times are different for different samples, although I think all time points between start and finish are measured/have same dT).

I want to merge all of the data into a single xarray where one axis is the time, the other axis is the measurement type, and the third axis is the sample name. Since not all times are measured for all samples the missing data should be replaced with nan.

I have little experience with xarray but doing a simple merger (construct xarray from dict of xarrays) I couldn't figure out how to make "time" one of the axis (instead each just concatenated all of the samples with time being one of the data columns).

Thank you for your help!

Edit:

Here is code I have with dummy data

import pandas as pd
import xarray as xr

#make fake data
dfs = {'sample1':pd.DataFrame([[1,0,0],[2,0,0],[3,0,0]],columns = ["Time","ColA","ColB"]), 
    'sample2':pd.DataFrame([[2,1,1],[3,1,1],[4,1,1]],columns = ["Time","ColA","ColB"])}

#code I use for real data
xrs = {k: xr.DataArray(v) for k, v in dfs.items()}
merged = xr.Dataset(variables).to_array(dim="samples")

print(merged)

Output is:

<xarray.DataArray (samples: 2, dim_0: 3, dim_1: 3)>
array([[[1, 0, 0],
        [2, 0, 0],
        [3, 0, 0]],

       [[2, 1, 1],
        [3, 1, 1],
        [4, 1, 1]]], dtype=int64)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2
  * dim_1    (dim_1) object 'Time' 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

Desired output:

<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[0, 0],
    [0, 0],
    [0, 0],
    [nan, nan]],

   [[nan, nan]
    [1, 1],
    [1, 1],
    [1, 1]]], dtype=int64)
Coordinates:
  * Time     (Time) int64 1 2 3 4
  * dim_1    (dim_1) object 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

You know in this busy world data science people dont have patience to go through a big explanation without sample data. Better add a sample input dataframe and expected output — Bharath M Shetty
– Bharath M Shetty, Commented Oct 21, 2017 at 16:56

shoyer · Accepted Answer · 2017-10-21 18:36:51Z

Xarray supports two basic formats for converting from pandas:

DataFrames representing an 2D matrix that you want to convert into a xarray.DataArray, where the index and columns already correctly label the axes.
DataFrames representing flattened "tidy data", where MultiIndex levels in the index correspond to the desired axes on the result and columns correspond to desired variables in an xarray.Dataset.

Your format is closest to the former, so let's try that. It turns out all you were missing is a call to .set_index('Time') to indicate that that column should be used for labeling the axis:

In [23]: da = xr.Dataset({k: v.set_index('Time') for k, v in dfs.items()}).to_array(dim='samples')

In [24]: da
Out[24]:
<xarray.DataArray (samples: 2, Time: 4, dim_1: 2)>
array([[[  0.,   0.],
        [  0.,   0.],
        [  0.,   0.],
        [ nan,  nan]],

       [[ nan,  nan],
        [  1.,   1.],
        [  1.,   1.],
        [  1.,   1.]]])
Coordinates:
  * Time     (Time) int64 1 2 3 4
  * dim_1    (dim_1) object 'ColA' 'ColB'
  * samples  (samples) <U7 'sample1' 'sample2'

The second approach is also worth thinking about, because making your data tidy will also make it easier to work with as a DataFrame (e.g., to easily plot your data in Seaborn).

Once you have the xarray.DataArray, you can convert it back into pandas using to_series():

In [22]: da.to_series()
Out[22]:
samples  Time  dim_1
sample1  1     ColA     0.0
               ColB     0.0
         2     ColA     0.0
               ColB     0.0
         3     ColA     0.0
               ColB     0.0
         4     ColA     NaN
               ColB     NaN
sample2  1     ColA     NaN
               ColB     NaN
         2     ColA     1.0
               ColB     1.0
         3     ColA     1.0
               ColB     1.0
         4     ColA     1.0
               ColB     1.0
dtype: float64

Or sticking with DataFrames, you can use pd.melt to "melt" your wide data into long, tidy form, e.g.,

In [26]: melted = pd.melt(dfs['sample1'], id_vars=['Time'], var_name='Measurement')

In [27]: melted
Out[27]:
   Time Measurement  value
0     1        ColA      0
1     2        ColA      0
2     3        ColA      0
3     1        ColB      0
4     2        ColB      0
5     3        ColB      0

Then converting from xarray is just a matter of setting a MultiIndex and calling .to_xarray():

In [28]: melted.set_index(['Time', 'Measurement']).to_xarray()
Out[28]:
<xarray.Dataset>
Dimensions:      (Measurement: 2, Time: 3)
Coordinates:
  * Time         (Time) int64 1 2 3
  * Measurement  (Measurement) object 'ColA' 'ColB'
Data variables:
    value        (Time, Measurement) int64 0 0 0 0 0 0

Collectives™ on Stack Overflow

Using xarray to merge pandas dataframes based on a column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related