1

Following the suggestions I got from my previous question here I'm converting a Pandas data frame to a numeric NumPy array. To do this Im used numpy.asarray.

My data frame:

DataFrame
----------
       label                                             vector
0         0   1:0.0033524514 2:-0.021896651 3:0.05087798 4:...
1         0   1:0.02134219 2:-0.007388343 3:0.06835007 4:0....
2         0   1:0.030515702 2:-0.0037591448 3:0.066626 4:0....
3         0   1:0.0069114454 2:-0.0149497045 3:0.020777626 ...
4         1   1:0.003118149 2:-0.015105667 3:0.040879637 4:...
...     ...                                                ...
19779     0   1:0.0042634667 2:-0.0044222944 3:-0.012995412...
19780     1   1:0.013818732 2:-0.010984628 3:0.060777966 4:...
19781     0   1:0.00019213723 2:-0.010443398 3:0.01679976 4...
19782     0   1:0.010373874 2:0.0043582567 3:-0.0078354385 ...
19783     1   1:0.0016790542 2:-0.028346825 3:0.03908631 4:...

[19784 rows x 2 columns]

DataFrame datatypes :
 label     object
vector    object
dtype: object

To convert into a Numpy Array I'm using this script:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt

r_filenameTSV = 'TSV/A19784.tsv'

tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
                                   columns = ['label','vector'])


print('DataFrame\n----------\n', df)
print('\nDataFrame datatypes :\n', df.dtypes)

arr = np.asarray(df, dtype=np.float64)

print('\nNumpy Array\n----------\n', arr)
print('\nNumpy Array Datatype :', arr.dtype)

I'm having this error from line nr.22 arr = np.asarray(df, dtype=np.float64)

ValueError: could not convert string to float: ' 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 5:-0.013740167 6:-0.0014883851 7:0.02230502 8:0.0053563705 9:0.00465044 10:-0.0030826542 11:0.010156203 12:-0.021754289 13:-0.03744049 14:0.011198468 15:-0.021201309 16:-0.0006497681 17:0.009229079 18:0.04218278 19:0.020572046 20:0.0021593391 ...

How can I solve this issue?

Regards and thanks for your time

2 Answers 2

1

Use list comprehension with nested dictionary comprehension for DataFrame:

df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
              1              2            3    4
0  0.0033524514   -0.021896651   0.05087798    0
1    0.02134219   -0.007388343   0.06835007    0
2   0.030515702  -0.0037591448     0.066626    0
3  0.0069114454  -0.0149497045  0.020777626    0
4   0.003118149   -0.015105667  0.040879637  0.4

And then convert to floats and to numpy array:

print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665  0.05087798  0.        ]
 [ 0.02134219 -0.00738834  0.06835007  0.        ]
 [ 0.0305157  -0.00375914  0.066626    0.        ]
 [ 0.00691145 -0.0149497   0.02077763  0.        ]
 [ 0.00311815 -0.01510567  0.04087964  0.4       ]]
Sign up to request clarification or add additional context in comments.

Comments

0

It seems one of your columns is a string, not an integer. Either remove that column or encode it as a string before converting the dataframe to an array

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.