I am using pyodbc to perform queries to a large SQL Server database. Because SQL can not hold native numpy arrays, data is stored in regular strings representative of numpy arrays. When I load this data to a DataFrame, it is as follows:
Id Array1 Array2
1 -21.722315 11.017685 -23.340452 2754.642 481.94247 21.728323
...
149001 1.611342 1.526262 -35.415166 6252.124 61.51516 852.15167
However, I then want to perform operations on Array1 and Array2, so I need to convert them to actual numpy arrays. My current way of doing this is applying np.fromstring to the entire dataset column.
df['Array1'] = df['Array1'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
df['Array2'] = df['Array2'].apply(lambda x: np.fromstring(x, dtype=np.float32, sep = ' '))
# Elapsed Time: 9.524s
Result:
Id Array1 Array2
1 [-21.722315, 11.017685, -23.340452] [2754.642, 481.94247, 21.728323]
...
149001 [1.611342, 1.526262, -35.415166] [6252.124, 61.51516, 852.15167]
While this code works, I don't believe it is efficient nor scalable. Are there more computationally efficient ways of transforming a large amount of data in numpy arrays?