Hi have a data frame (df) that contains two columns (date, text) which is being read from an Excel spreadsheet into Python/Pandas.
xl = pd.ExcelFile(dir+"file.xlsx")
df = xl.parse(xl.sheet_names[0])
date text
0 2013-08-06 NaN
1 2013-08-06 Text with unicode
2 ...
The text contains unwanted unicode characters which I normally strip out using
df['text'] = df['text'].apply(lambda sentence: ''.join(word for word in sentence if ord(word) < 128))
However, since the text in the first row contains "NaN", it appears that the column is being typed as "float" by Pandas and the above command fails since it only operates on strings. I can't find a way to reassign the type as string since it contains unicode characters:
df['text'] = df['text'].astype(str)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12: ordinal not in range(128)
It feels like I'm getting into a "the chicken or the egg" dilemma.
dropnaor you want to replace theNaNwith some value?objectwhich is astringI don't understand how you can have adtypelike that. Still after dropping the NaNs you should be able to cast it usingastype(float)