2

I have a 33620x160 pandas DataFrame which has one column that contains lists of numbers. Each list entry in the DataFrame contains 30 elements.

df['dlrs_col']

0        [0.048142470608688, 0.047021138711858, 0.04573...
1        [0.048142470608688, 0.047021138711858, 0.04573...
2        [0.048142470608688, 0.047021138711858, 0.04573...
3        [0.048142470608688, 0.047021138711858, 0.04573...
4        [0.048142470608688, 0.047021138711858, 0.04573...
5        [0.048142470608688, 0.047021138711858, 0.04573...
6        [0.048142470608688, 0.047021138711858, 0.04573...
7        [0.048142470608688, 0.047021138711858, 0.04573...
8        [0.048142470608688, 0.047021138711858, 0.04573...
9        [0.048142470608688, 0.047021138711858, 0.04573...
10       [0.048142470608688, 0.047021138711858, 0.04573...

I'm creating a 33620x30 array whose entries are the unlisted values from that single DataFrame column. I'm currently doing this as:

np.array(df['dlrs_col'].tolist(), dtype = 'float64')

This works just fine, but it takes a significant amount of time, especially when considering I do a similar calculation for 6 additional columns of lists. Any ideas on how I can speed this up?

2 Answers 2

1

you can do it this way:

In [140]: df
Out[140]:
                                          dlrs_col
0  [0.048142470608688, 0.047021138711858, 0.04573]
1  [0.048142470608688, 0.047021138711858, 0.04573]
2  [0.048142470608688, 0.047021138711858, 0.04573]
3  [0.048142470608688, 0.047021138711858, 0.04573]
4  [0.048142470608688, 0.047021138711858, 0.04573]
5  [0.048142470608688, 0.047021138711858, 0.04573]
6  [0.048142470608688, 0.047021138711858, 0.04573]
7  [0.048142470608688, 0.047021138711858, 0.04573]
8  [0.048142470608688, 0.047021138711858, 0.04573]
9  [0.048142470608688, 0.047021138711858, 0.04573]

In [141]: df.dlrs_col.apply(pd.Series)
Out[141]:
          0         1        2
0  0.048142  0.047021  0.04573
1  0.048142  0.047021  0.04573
2  0.048142  0.047021  0.04573
3  0.048142  0.047021  0.04573
4  0.048142  0.047021  0.04573
5  0.048142  0.047021  0.04573
6  0.048142  0.047021  0.04573
7  0.048142  0.047021  0.04573
8  0.048142  0.047021  0.04573
9  0.048142  0.047021  0.04573

In [142]: df.dlrs_col.apply(pd.Series).values
Out[142]:
array([[ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ],
       [ 0.04814247,  0.04702114,  0.04573   ]])
Sign up to request clarification or add additional context in comments.

1 Comment

I appreciate the response, but in my quick testing this actually took nearly twice as long as my previous method.
1

You can first convert to numpy array by values:

df = pd.DataFrame({'dlrs_col':[
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573],
[0.048142470608688, 0.047021138711858, 0.04573]]})

print (df)
                                          dlrs_col
0  [0.048142470608688, 0.047021138711858, 0.04573]
1  [0.048142470608688, 0.047021138711858, 0.04573]
2  [0.048142470608688, 0.047021138711858, 0.04573]
3  [0.048142470608688, 0.047021138711858, 0.04573]
4  [0.048142470608688, 0.047021138711858, 0.04573]
5  [0.048142470608688, 0.047021138711858, 0.04573]
6  [0.048142470608688, 0.047021138711858, 0.04573]
7  [0.048142470608688, 0.047021138711858, 0.04573]

print (np.array(df['dlrs_col'].values.tolist(), dtype = 'float64'))
[[ 0.04814247  0.04702114  0.04573   ]
 [ 0.04814247  0.04702114  0.04573   ]
 [ 0.04814247  0.04702114  0.04573   ]
 [ 0.04814247  0.04702114  0.04573   ]
 [ 0.04814247  0.04702114  0.04573   ]
 [ 0.04814247  0.04702114  0.04573   ]
 [ 0.04814247  0.04702114  0.04573   ]
 [ 0.04814247  0.04702114  0.04573   ]]

Timings:

In [56]: %timeit (np.array(df['dlrs_col'].values.tolist(), dtype = 'float64'))
The slowest run took 9.76 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop

In [57]: %timeit (np.array(df['dlrs_col'].tolist(), dtype = 'float64'))
The slowest run took 9.33 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 28.4 µs per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.