how to delete wrong row from dataframe when get ValueError by using astype

Question

I use pandas's astype function to parse a string into data in datetime64[ns] format, but because there are some outliers in the original data, it causes the program to go wrong.

I want to get the wrong data index from the ValueError exception and delete the index data，rather than interrupt the program because of ValueError.Or is there any other way to achieve my goal?

when parsing datetime by astype, I got a the following error prompts. I want to get the wrong data index from the ValueError exception and delete the index data.：

  File "/home/xiaopeng/anaconda3/envs/tensorflow/lib/python3.5/site-packages/pandas/core/dtypes/cast.py", line 636, in astype_nansafe
    return arr.astype(dtype)
ValueError: Error parsing datetime string "2017-06-01VERSION=1.0" at position 10

the code as follows， the main function of this function is to read data from the text file, and to parse the data:

def file_to_df(file):
    print('converting file:%r(%r MB)' %(file,(os.path.getsize(file)/(1024*1024))))

    df = pd.read_csv(file, sep='\t', header=None, names=columns)

    for k in df.columns:
        _, df[k] = df[k].astype(str).str.split('=',1).str

    df = df[columns_use]

    # startswith() ,delete the wrong data when startswith is not '20'
    df = df[df['PASSTIME'].astype(str).str.startswith("20")]

    print('Log: Get %r number of data' % len(df))

    df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace(' ', '?', n=1)
    df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace(' ', '.', n=1)
    df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace('?', ' ', n=1)

    df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]')

    return df

and the Parsing error data as follows:

VERSION=1.0 PASSTIME=2017-06-01 11:01:46 625    CARSTATE=1  ...
VERSION=1.0 PASSTIME=2017-06-01VERSION=1.0  PASSTIME=2017-06-01 11:04:02 618    CARSTATE=1  ...
VERSION=1.0 PASSTIME=2017-06-01 11:04:49 595    CARSTATE=1  ...

Can you add some data sample? But main problem is you need df['PASSTIME'] = pd.to_datetime(df['PASSTIME'], errors='coerce') for convert dates to datetimes. If some data are bad e.g. 2017-06-01VERSION=1.0 then function return NaT. So first need clean data and then parse it. — jezrael
– jezrael, Commented Oct 23, 2017 at 9:16
It's work well by using to_datetime function. Many Thanks. The modified code is： {df['PASSTIME'] = pd.to_datetime(df['PASSTIME'], errors='coerce') df = df[~df['PASSTIME'].isin([pd.NaT])]} Do I no longer need to use df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]'), because the to_datetime function has changed the PASSTIME column type to datetime64[ns]? — hall
– hall, Commented Oct 23, 2017 at 9:56
No, df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]') is not necessary, because astype funtion cannot convert data to datetimes. — jezrael
– jezrael, Commented Oct 23, 2017 at 10:05

jezrael · Accepted Answer · 2017-10-23 10:04:04Z

1

I think you need to_datetime + dropna for remove NaT rows:

df['PASSTIME'] = pd.to_datetime(df['PASSTIME'], errors='coerce')
df = df.dropna('PASSTIME')

answered Oct 23, 2017 at 10:04

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hall Over a year ago

I use df = df[~df['PASSTIME'].isin([pd.NaT])], and I would like to ask if the performance of df.dropna is better

jezrael Over a year ago

Really interesting question, see this for timings. So if need faster solution need df = df[df['PASSTIME'].notnull()], but it is only a bit faster.

hall Over a year ago

Thank you very much. This is my first question in stackoverflow. Because of your answer, I love this site more

Collectives™ on Stack Overflow

how to delete wrong row from dataframe when get ValueError by using astype

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related