0

I use pandas's astype function to parse a string into data in datetime64[ns] format, but because there are some outliers in the original data, it causes the program to go wrong.

I want to get the wrong data index from the ValueError exception and delete the index data,rather than interrupt the program because of ValueError.Or is there any other way to achieve my goal?

when parsing datetime by astype, I got a the following error prompts. I want to get the wrong data index from the ValueError exception and delete the index data.:

  File "/home/xiaopeng/anaconda3/envs/tensorflow/lib/python3.5/site-packages/pandas/core/dtypes/cast.py", line 636, in astype_nansafe
    return arr.astype(dtype)
ValueError: Error parsing datetime string "2017-06-01VERSION=1.0" at position 10

the code as follows, the main function of this function is to read data from the text file, and to parse the data:

def file_to_df(file):
    print('converting file:%r(%r MB)' %(file,(os.path.getsize(file)/(1024*1024))))

    df = pd.read_csv(file, sep='\t', header=None, names=columns)

    for k in df.columns:
        _, df[k] = df[k].astype(str).str.split('=',1).str

    df = df[columns_use]

    # startswith() ,delete the wrong data when startswith is not '20'
    df = df[df['PASSTIME'].astype(str).str.startswith("20")]

    print('Log: Get %r number of data' % len(df))

    df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace(' ', '?', n=1)
    df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace(' ', '.', n=1)
    df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace('?', ' ', n=1)

    df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]')

    return df

and the Parsing error data as follows:

VERSION=1.0 PASSTIME=2017-06-01 11:01:46 625    CARSTATE=1  ...
VERSION=1.0 PASSTIME=2017-06-01VERSION=1.0  PASSTIME=2017-06-01 11:04:02 618    CARSTATE=1  ...
VERSION=1.0 PASSTIME=2017-06-01 11:04:49 595    CARSTATE=1  ...
3
  • Can you add some data sample? But main problem is you need df['PASSTIME'] = pd.to_datetime(df['PASSTIME'], errors='coerce') for convert dates to datetimes. If some data are bad e.g. 2017-06-01VERSION=1.0 then function return NaT. So first need clean data and then parse it. Commented Oct 23, 2017 at 9:16
  • It's work well by using to_datetime function. Many Thanks. The modified code is: {df['PASSTIME'] = pd.to_datetime(df['PASSTIME'], errors='coerce') df = df[~df['PASSTIME'].isin([pd.NaT])]} Do I no longer need to use df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]'), because the to_datetime function has changed the PASSTIME column type to datetime64[ns]? Commented Oct 23, 2017 at 9:56
  • No, df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]') is not necessary, because astype funtion cannot convert data to datetimes. Commented Oct 23, 2017 at 10:05

1 Answer 1

1

I think you need to_datetime + dropna for remove NaT rows:

df['PASSTIME'] = pd.to_datetime(df['PASSTIME'], errors='coerce')
df = df.dropna('PASSTIME')
Sign up to request clarification or add additional context in comments.

3 Comments

I use df = df[~df['PASSTIME'].isin([pd.NaT])], and I would like to ask if the performance of df.dropna is better
Really interesting question, see this for timings. So if need faster solution need df = df[df['PASSTIME'].notnull()], but it is only a bit faster.
Thank you very much. This is my first question in stackoverflow. Because of your answer, I love this site more

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.