Replace unwanted strings in pandas dataframe element wise and efficiently

Question

I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index

                                  col1                col2             col3
time                                                                         
05/04/2018 05:14:52 AM             +unend           +unend                  0
05/04/2018 05:14:57 AM                 0                 0                  0
05/04/2018 05:15:02 AM            30.691             0.000              0.121
05/04/2018 05:15:07 AM            30.691             n. def.            0.108
05/04/2018 05:15:12 AM            30.715             0.000              0.105

As these are coming from some other device (df is produced by pd.read_csv(filename)) the dataframe instead of being a completely float type now ends up having unwanted strings like +unend and n. def.. These are not the classical +infinity or NaN , that df.fillna() could take care off. I would like to replace the strings with 0.0. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.

If i try to do

mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0

i get error: nothing to repeat

if i do

mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0

i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.

The below

mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0

does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?

In this case why not just convert it to numeric? df.apply(lambda x: pd.to_numeric(x, errors='coerce')) — EdChum
– EdChum, Commented May 25, 2018 at 8:05

Haleemur Ali · Accepted Answer · 2018-05-25 08:17:00Z

5

These are not the classical +infinity or NaN , that df.fillna() could take care off

You can specify a list of strings to consider as NA when reading the csv file.

df = pd.read_csv(filename, na_values=['+unend', 'n. def.'])

And then fill the NA values with fillna

answered May 25, 2018 at 8:17

Haleemur Ali

28.6k6 gold badges67 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jpp Over a year ago

This should be the accepted answer, if na_values are well defined strings. Reading the data correctly avoids the expense of replace / convert later.

Kalron Over a year ago

yes, this is actually a great answer ... and so much faster

jezrael Over a year ago

@jpp - it depends, if values are substrings, it is not possible use :(

jezrael Over a year ago

@Vipluv - but if strings, it is better.

Kalron Over a year ago

@jezrael i agree to your point-- for very general uses (i.e df conversion once it is already there ) if you already have a dataframe and dont have control on when you are reading it in the accepted answer allows to do so

jezrael · Accepted Answer · 2018-05-25 08:33:03Z

As pointed Edchum if need replace all non numeric values to 0 - first to_numeric with errors='coerce' create NaNs for not parseable values and then convert them to 0 by fillna:

df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)

If values are not substrings use DataFrame.isin or very nice answer of Haleemur Ali:

df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)

For substrings with define values:

There are special regex char + and ., so need escape them by \:

df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)

Or use applymap for elemnetwise check:

df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)

print (df)
                          col1  col2   col3
time                                       
05/04/2018 05:14:52 AM   0.000   0.0  0.000
05/04/2018 05:14:57 AM   0.000   0.0  0.000
05/04/2018 05:15:02 AM  30.691   0.0  0.121
05/04/2018 05:15:07 AM  30.691   0.0  0.108
05/04/2018 05:15:12 AM  30.715   0.0  0.105

jpp · Accepted Answer · 2018-05-25 08:16:22Z

0

Do not use pd.Series.str.contains or pd.Series.isin

A more efficient solution to this problem is to use pd.to_numeric to convert try and convert all data to numeric.

Use errors='coerce' to default to NaN, which you can then use with pd.Series.fillna.

cols = ['col1', 'col2', 'col3']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce').fillna(0)

edited May 25, 2018 at 8:16

answered May 25, 2018 at 8:16

jpp

166k37 gold badges301 silver badges362 bronze badges

Collectives™ on Stack Overflow

Replace unwanted strings in pandas dataframe element wise and efficiently

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related