I use pandas's astype function to parse a string into data in datetime64[ns] format, but because there are some outliers in the original data, it causes the program to go wrong.
I want to get the wrong data index from the ValueError exception and delete the index data,rather than interrupt the program because of ValueError.Or is there any other way to achieve my goal?
when parsing datetime by astype, I got a the following error prompts. I want to get the wrong data index from the ValueError exception and delete the index data.:
File "/home/xiaopeng/anaconda3/envs/tensorflow/lib/python3.5/site-packages/pandas/core/dtypes/cast.py", line 636, in astype_nansafe
return arr.astype(dtype)
ValueError: Error parsing datetime string "2017-06-01VERSION=1.0" at position 10
the code as follows, the main function of this function is to read data from the text file, and to parse the data:
def file_to_df(file):
print('converting file:%r(%r MB)' %(file,(os.path.getsize(file)/(1024*1024))))
df = pd.read_csv(file, sep='\t', header=None, names=columns)
for k in df.columns:
_, df[k] = df[k].astype(str).str.split('=',1).str
df = df[columns_use]
# startswith() ,delete the wrong data when startswith is not '20'
df = df[df['PASSTIME'].astype(str).str.startswith("20")]
print('Log: Get %r number of data' % len(df))
df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace(' ', '?', n=1)
df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace(' ', '.', n=1)
df['PASSTIME'] = df['PASSTIME'].astype(str).str.replace('?', ' ', n=1)
df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]')
return df
and the Parsing error data as follows:
VERSION=1.0 PASSTIME=2017-06-01 11:01:46 625 CARSTATE=1 ...
VERSION=1.0 PASSTIME=2017-06-01VERSION=1.0 PASSTIME=2017-06-01 11:04:02 618 CARSTATE=1 ...
VERSION=1.0 PASSTIME=2017-06-01 11:04:49 595 CARSTATE=1 ...
df['PASSTIME'] = pd.to_datetime(df['PASSTIME'], errors='coerce')for convert dates to datetimes. If some data are bad e.g.2017-06-01VERSION=1.0then function returnNaT. So first need clean data and then parse it.df['PASSTIME'] = df['PASSTIME'].astype('datetime64[ns]')is not necessary, becauseastypefuntion cannot convert data to datetimes.