expand array to columns in dask dataframe

Question

I have avro data with the following keys: 'id, label, features'. id and label are string while features is a buffer of floats.

import dask.bag as db
avros = db.read_avro('data.avro')
df = avros.to_dataframe()
convert = partial(np.frombuffer, dtype='float64')
X = df.assign(features=lambda x: x.features.apply(convert, meta='float64'))

I eventually end up with this MCVE

  label id         features
0  good  a  [1.0, 0.0, 0.0]
1   bad  b  [1.0, 0.0, 0.0]
2  good  c  [0.0, 0.0, 0.0]
3   bad  d  [1.0, 0.0, 1.0]
4  good  e  [0.0, 0.0, 0.0]

my desired output would be:

  label id   f1   f2   f3
0  good  a  1.0  0.0  0.0
1   bad  b  1.0  0.0  0.0
2  good  c  0.0  0.0  0.0
3   bad  d  1.0  0.0  1.0
4  good  e  0.0  0.0  0.0

I tried some ways that are like pandas, namely df[['f1','f2','f3']] = df.features.apply(pd.Series) did not work like in pandas.

I can traverse with a loop like

for i in range(len(features)):
df[f'f{i}'] = df.features.map(lambda x: x[i])

but in the real use-case I have thousand of features and this traverses the dataset thousands of times.

What would be the best way to achieve the desired outcome?

Possible duplicate of Dask Dataframe split column of list into multiple columns — rpanai
– rpanai, Commented Oct 15, 2019 at 12:45
in the suggested solution, it seems like it parses the series for every feature. This is not too bad in the MCVE, but in the real world I have thousands of features. this sounds expensive computationally. — DeanLa
– DeanLa, Commented Oct 15, 2019 at 13:15
This is close, but works on string. My object is already a list. — DeanLa
– DeanLa, Commented Oct 15, 2019 at 14:46

Dave Hirschfeld · Accepted Answer · 2019-10-16 02:43:00Z

In [68]: import string
    ...: import numpy as np
    ...: import pandas as pd

In [69]: M, N = 100, 100
    ...: labels = np.random.choice(['good', 'bad'], size=M)
    ...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
    ...: features = np.empty((M,), dtype=object)
    ...: features[:] = list(map(list, np.random.randn(M, N)))
    ...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
    ...: df1 = df.copy()

In [70]: %%time
    ...: columns = [f"f{i:04d}" for i in range(N)]
    ...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 13.9 ms

In [71]: M, N = 1000, 1000
    ...: labels = np.random.choice(['good', 'bad'], size=M)
    ...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
    ...: features = np.empty((M,), dtype=object)
    ...: features[:] = list(map(list, np.random.randn(M, N)))
    ...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
    ...: df1 = df.copy()

In [72]: %%time
    ...: columns = [f"f{i:04d}" for i in range(N)]
    ...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 627 ms

In [73]: df1.shape
Out[73]: (1000, 1002)

Edit: About 2x faster than the original

In [79]: df2 = df.copy()

In [80]: %%time
    ...: features = df2.pop('features')
    ...: for i in range(N):
    ...:     df2[f'f{i:04d}'] = features.map(lambda x: x[i])
    ...:     
Wall time: 1.46 s

In [81]: df1.equals(df2)
Out[81]: True

Edit: Edit: A faster way of constructing the DataFrame gives an 8x improvement over the original:

In [22]: df1 = df.copy()

In [23]: %%time
    ...: features = pd.DataFrame({f"f{i:04d}": np.asarray(row) for i, row in enumerate(df1.pop('features').to_numpy())})
    ...: df1 = pd.concat([df1, features], axis=1)
Wall time: 165 ms

Collectives™ on Stack Overflow

expand array to columns in dask dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest