19

I'm hoping to use pandas as the main Trace (series of points in parameter space from MCMC) object.

I have a list of dicts of string->array which I would like to store in pandas. The keys in the dicts are always the same, and for each key the shape of the numpy array is always the same, but the shape may be different for different keys and could have a different number of dimensions.

I had been using self.append(dict_list, ignore_index = True) which seems to work well for 1d values, but for nd>1 values pandas stores the values as objects which doesn't allow for nice plotting and other nice things. Any suggestions on how to get better behavior?

Sample data

point = {'x': array(-0.47652306228698005),
         'y': array([[-0.41809043],
                     [ 0.48407823]])}

points = 10 * [ point]

I'd like to be able to do something like

df = DataFrame(points)

or

df = DataFrame()
df.append(points, ignore_index=True)

and have

>> df['x'][1].shape
()
>> df['y'][1].shape 
(2,1)
5
  • 1
    Have you looked into panel datastructure? Not sure it helps with your use case though... Commented Apr 4, 2013 at 9:10
  • 1
    Can we have sample data for your problem? Commented Apr 12, 2013 at 13:20
  • Certainly, I've added some above. Does that help? Or would you like something more? Commented Apr 14, 2013 at 9:24
  • Try MultiIndex: stackoverflow.com/a/37742328/911945 Commented May 23, 2017 at 15:08
  • pandas.Panel is now deprecated and it is recommended to use either pandas.MultiIndex or the xarray package (formerly xray) Commented Jan 30, 2021 at 9:31

3 Answers 3

12

The relatively-new library xray[1] has Dataset and DataArray structures that do exactly what you ask.

Here it is my take on your problem, written as an IPython session:

>>> import numpy as np
>>> import xray

>>> ## Prepare data:
>>> #
>>> point = {'x': np.array(-0.47652306228698005),
...          'y': np.array([[-0.41809043],
...                      [ 0.48407823]])}
>>> points = 10 * [point]

>>> ## Convert to Xray DataArrays:
>>> #
>>> list_x = [p['x'] for p in points]
>>> list_y = [p['y'] for p in points]
>>> da_x = xray.DataArray(list_x, [('x', range(len(list_x)))])
>>> da_y = xray.DataArray(list_y, [
...     ('x', range(len(list_y))),
...     ('y0', range(2)), 
...     ('y1', [0]), 
... ])

These are the two DataArray instances we built so far:

>>> print(da_x)
<xray.DataArray (x: 10)>
array([-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306,
       -0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9


>>> print(da_y.T) ## Transposed, to save lines.
<xray.DataArray (y1: 1, y0: 2, x: 10)>
array([[[-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043,
         -0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043],
        [ 0.48407823,  0.48407823,  0.48407823,  0.48407823,  0.48407823,
          0.48407823,  0.48407823,  0.48407823,  0.48407823,  0.48407823]]])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y0       (y0) int32 0 1
  * y1       (y1) int32 0

We can now merge these two DataArray on their common x dimension into a DataSet:

>>> ds = xray.Dataset({'X':da_x, 'Y':da_y})
>>> print(ds)
<xray.Dataset>
Dimensions:  (x: 10, y0: 2, y1: 1)
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y0       (y0) int32 0 1
  * y1       (y1) int32 0
Data variables:
    X        (x) float64 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 ...
    Y        (x, y0, y1) float64 -0.4181 0.4841 -0.4181 0.4841 -0.4181 0.4841 -0.4181 ...

And we can finally access and aggregate data the way you wanted:

>>> ds['X'].sum()
<xray.DataArray 'X' ()>
array(-4.765230622869801)


>>> ds['Y'].sum()
<xray.DataArray 'Y' ()>
array(0.659878)


>>> ds['Y'].sum(axis=1)
<xray.DataArray 'Y' (x: 10, y1: 1)>
array([[ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878]])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y1       (y1) int32 0

>>> np.all(ds['Y'].sum(axis=1) == ds['Y'].sum(dim='y0'))
True

>>>> ds['X'].sum(dim='y0')
Traceback (most recent call last):
ValueError: 'y0' not found in array dimensions ('x',)

[1] A library for handling N-dimensional data with labels, like pandas does for 2D: http://xray.readthedocs.org/en/stable/data-structures.html#dataset

Sign up to request clarification or add additional context in comments.

Comments

11

Combining @Eike's answer and @JohnSalvatier's comment seems pretty Pandasonic:

>>> import pandas as pd
>>> np = pandas.np
>>> point = {'x': np.array(-0.47652306228698005),
...          'y': np.array([[-0.41809043],
...                         [ 0.48407823]])}
>>> points = 10 * [point]  # this creates a list of 10 point dicts
>>> df = pd.DataFrame().append(points)
>>> df.x
# 0    -0.476523062287
#   ...
# 9    -0.476523062287
# Name: x, dtype: object
>>> df.y
# 0    [[-0.41809043], [0.48407823]]
#   ...
# 9    [[-0.41809043], [0.48407823]]
# Name: y, dtype: object
>>> df.y[0]
# array([[-0.41809043],
#        [ 0.48407823]])
>>> df.y[0].shape
# (2, 1)

To plot (and do all the other cool 2-D Pandas things) you still have to manually convert the column of arrays back to a DataFrame:

>>> dfy = pd.DataFrame([row.T[0] for row in df2.y])
>>> dfy += np.matrix([[0] * 10, range(10)]).T
>>> dfy *= np.matrix([range(10), range(10)]).T
>>> dfy.plot()

example 2-D plot

To store this on disk, use to_pickle:

>>> df.to_pickle('/tmp/sotest.pickle')
>>> df2 = pd.read_pickle('/tmp/sotest.pickle')
>>> df.y[0].shape
# (2, 1)

If you use to_csv your np.arrays become strings:

>>> df.to_csv('/tmp/sotest.csv')
>>> df2 = pd.DataFrame.from_csv('/tmp/sotest.csv')
>>> df2.y[0]
# '[[-0.41809043]\n [ 0.48407823]]'

Comments

5

It goes a bit against Pandas' philosophy, which seems to see Series as a one-dimensional data structure. Therefore you have to create the Series by hand, tell them that they have data type "object". This means don't apply any automatic data conversions.

You can do it like this (reordered Ipython session):

In [9]: import pandas as pd

In [1]: point = {'x': array(-0.47652306228698005),
   ...:          'y': array([[-0.41809043],
   ...:                      [ 0.48407823]])}

In [2]: points = 10 * [ point]

In [5]: lx = [p["x"] for p in points]

In [7]: ly = [p["y"] for p in points]

In [40]: sx = pd.Series(lx, dtype=numpy.dtype("object"))

In [38]: sy = pd.Series(ly, dtype=numpy.dtype("object"))

In [43]: df = pd.DataFrame({"x":sx, "y":sy})

In [45]: df['x'][1].shape
Out[45]: ()

In [46]: df['y'][1].shape
Out[46]: (2, 1)

1 Comment

It's good to know that this is unpandas. I think the df.append(points) method will do basically this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.