0

Hi have a data frame (df) that contains two columns (date, text) which is being read from an Excel spreadsheet into Python/Pandas.

xl = pd.ExcelFile(dir+"file.xlsx")
df = xl.parse(xl.sheet_names[0])

    date        text                
0   2013-08-06  NaN                 
1   2013-08-06  Text with unicode
2   ...

The text contains unwanted unicode characters which I normally strip out using

df['text'] = df['text'].apply(lambda sentence: ''.join(word for word in sentence if ord(word) < 128))

However, since the text in the first row contains "NaN", it appears that the column is being typed as "float" by Pandas and the above command fails since it only operates on strings. I can't find a way to reassign the type as string since it contains unicode characters:

df['text'] = df['text'].astype(str)   

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12: ordinal not in range(128) 

It feels like I'm getting into a "the chicken or the egg" dilemma.

8
  • Can you include the line you use to read the spreadsheet? Commented Sep 4, 2014 at 13:56
  • 1
    Can you not just call dropna or you want to replace the NaN with some value? Commented Sep 4, 2014 at 13:58
  • @chrisaycock: I have added the line for reading the spreadsheet. Commented Sep 4, 2014 at 14:03
  • @EdChum: If I dropna, I am assuming that Pandas still treats the column as float. I still can't convert it to type string since it contains unicode. Commented Sep 4, 2014 at 14:07
  • 1
    I don't get the same as you, mine is object which is a string I don't understand how you can have a dtype like that. Still after dropping the NaNs you should be able to cast it using astype(float) Commented Sep 4, 2014 at 14:15

1 Answer 1

1

It's not your whole column typed as float - otherwise it wouldn't be able to hold strings at all. It's just the NaN values that are causing your method to throw an exception.

So you have to deal with NaNs - How would you want your code to convert NaNs? to 'NaN'?

This kind of beats the point of NaN as a special value. If you don't want NaN values - you can use dropna. If you want some other value instead (or the string value) - you can use .fillna('NaN'). If you want to keep the NaNs for future use (which seems like the way to go for me) - just have a special case for them at your lambda, which will keep them as NaNs:

from pandas import isnull
lambda sentence: sentence if isnull(sentence) else \
                          ''.join(word for word in sentence if ord(word) < 128)
Sign up to request clarification or add additional context in comments.

2 Comments

As stated in the post, the text is currently typed as "float" and need to converted to type "string" first. However, I can't convert the text to string due to the unwanted unicode in the text.
@slaw How about you post some real data in the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.