1

I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index

                                  col1                col2             col3
time                                                                         
05/04/2018 05:14:52 AM             +unend           +unend                  0
05/04/2018 05:14:57 AM                 0                 0                  0
05/04/2018 05:15:02 AM            30.691             0.000              0.121
05/04/2018 05:15:07 AM            30.691             n. def.            0.108
05/04/2018 05:15:12 AM            30.715             0.000              0.105

As these are coming from some other device (df is produced by pd.read_csv(filename)) the dataframe instead of being a completely float type now ends up having unwanted strings like +unend and n. def.. These are not the classical +infinity or NaN , that df.fillna() could take care off. I would like to replace the strings with 0.0. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.

If i try to do

mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0

i get error: nothing to repeat

if i do

mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0

i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.

The below

mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0

does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?

1
  • 1
    In this case why not just convert it to numeric? df.apply(lambda x: pd.to_numeric(x, errors='coerce')) Commented May 25, 2018 at 8:05

3 Answers 3

5

These are not the classical +infinity or NaN , that df.fillna() could take care off

You can specify a list of strings to consider as NA when reading the csv file.

df = pd.read_csv(filename, na_values=['+unend', 'n. def.'])

And then fill the NA values with fillna

Sign up to request clarification or add additional context in comments.

5 Comments

This should be the accepted answer, if na_values are well defined strings. Reading the data correctly avoids the expense of replace / convert later.
yes, this is actually a great answer ... and so much faster
@jpp - it depends, if values are substrings, it is not possible use :(
@Vipluv - but if strings, it is better.
@jezrael i agree to your point-- for very general uses (i.e df conversion once it is already there ) if you already have a dataframe and dont have control on when you are reading it in the accepted answer allows to do so
0

As pointed Edchum if need replace all non numeric values to 0 - first to_numeric with errors='coerce' create NaNs for not parseable values and then convert them to 0 by fillna:

df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)

If values are not substrings use DataFrame.isin or very nice answer of Haleemur Ali:

df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)

For substrings with define values:

There are special regex char + and ., so need escape them by \:

df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)

Or use applymap for elemnetwise check:

df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)

print (df)
                          col1  col2   col3
time                                       
05/04/2018 05:14:52 AM   0.000   0.0  0.000
05/04/2018 05:14:57 AM   0.000   0.0  0.000
05/04/2018 05:15:02 AM  30.691   0.0  0.121
05/04/2018 05:15:07 AM  30.691   0.0  0.108
05/04/2018 05:15:12 AM  30.715   0.0  0.105

Comments

0

Do not use pd.Series.str.contains or pd.Series.isin

A more efficient solution to this problem is to use pd.to_numeric to convert try and convert all data to numeric.

Use errors='coerce' to default to NaN, which you can then use with pd.Series.fillna.

cols = ['col1', 'col2', 'col3']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce').fillna(0)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.