Efficiently convert dataframe rows to numpy arrays

Question

I am using pyodbc to perform queries to a large SQL Server database. Because SQL can not hold native numpy arrays, data is stored in regular strings representative of numpy arrays. When I load this data to a DataFrame, it is as follows:

Id       Array1                             Array2
1        -21.722315 11.017685 -23.340452    2754.642 481.94247 21.728323
...
149001   1.611342 1.526262 -35.415166       6252.124 61.51516 852.15167

However, I then want to perform operations on Array1 and Array2, so I need to convert them to actual numpy arrays. My current way of doing this is applying np.fromstring to the entire dataset column.

df['Array1'] = df['Array1'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
df['Array2'] = df['Array2'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
# Elapsed Time: 9.524s

Result:

Id       Array1                               Array2
1        [-21.722315, 11.017685, -23.340452]  [2754.642, 481.94247, 21.728323]
...
149001   [1.611342, 1.526262, -35.415166]     [6252.124, 61.51516, 852.15167]

While this code works, I don't believe it is efficient nor scalable. Are there more computationally efficient ways of transforming a large amount of data in numpy arrays?

have you checked if this can be done on the server side during the call to the database? I believe PostgreSql handles arrays for instance. just in case... — Je Je
– Je Je, Commented Sep 25, 2022 at 19:16
You could store the data as a byte array in a varbinary column. — David Browne - Microsoft
– David Browne - Microsoft, Commented Sep 25, 2022 at 20:50

August · Accepted Answer · 2022-09-25 19:29:33Z

1

Using tobytes() to get the underlying bytestring and frombuffer() to convert back will be much more efficient. One detail is that the array dimensions are lost after tobytes(), so you may need to resize the after frombuffer().

answered Sep 25, 2022 at 19:29

August

12.6k3 gold badges38 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Johnny Over a year ago

Just had to specify dtype=np.float32 and all worked perfectly. 16x performance increase with this method.

Collectives™ on Stack Overflow

Efficiently convert dataframe rows to numpy arrays

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related