26

I have a column in my pandas dataframe:

Start Date
1/7/13
1/7/13
1/7/13
16/7/13
16/7/13

When I convert it into a date format, I am getting the following error:

data['Start Date']= pd.to_datetime(data['Start Date'],dayfirst=True)
...
...
/Library/Python/2.7/site-packages/pandas/tseries/tools.pyc in _convert_listlike(arg, box, format, name)
    381                 return DatetimeIndex._simple_new(values, name=name, tz=tz)
    382             except (ValueError, TypeError):
--> 383                 raise e
    384 
    385     if arg is None:

ValueError: Unknown string format

What am I missing here?

4
  • 2
    what is the exact data type of the elements in the column? Commented Dec 29, 2015 at 7:20
  • 4
    You can pass errors='coerce' (or coerce=True in earlier versions), and then see which item is converted to NaT (and so could not be converted to a datetime) Commented Dec 29, 2015 at 8:49
  • The Type is object. By that I guess Its considered to be string? Commented Dec 29, 2015 at 11:19
  • This is probably indeed string (but can also be mixed). Did you try my suggestion? Commented Dec 29, 2015 at 16:00

2 Answers 2

22

I think the problem is in data - a problematic string exists. So you can try check length of the string in column Start Date:

import pandas as pd
import io

temp=u"""Start Date
1/7/13
1/7/1
1/7/13 12 17
16/7/13
16/7/13"""

data = pd.read_csv(io.StringIO(temp), sep=";", parse_dates=False)

#data['Start Date']= pd.to_datetime(data['Start Date'],dayfirst=True)
print data

     Start Date
0        1/7/13
1         1/7/1
2  1/7/13 12 17
3       16/7/13
4       16/7/13

#check, if length is more as 7
print data[data['Start Date'].str.len() > 7]

     Start Date
2  1/7/13 12 17

Or you can try to find these problematic row different way e.g. read only part of the datetime and check parsing datetime:

#read first 3 rows
data= data.iloc[:3]

data['Start Date']= pd.to_datetime(data['Start Date'],dayfirst=True)

But this is only tips.

EDIT:

Thanks joris for suggestion add parameter errors ='coerce' to to_datetime:

temp=u"""Start Date
1/7/13
1/7/1
1/7/13 12 17
16/7/13
16/7/13 12 04"""

data = pd.read_csv(io.StringIO(temp), sep=";")
#add parameter errors coerce
data['Start Date']= pd.to_datetime(data['Start Date'], dayfirst=True, errors='coerce')
print data

  Start Date
0 2013-07-01
1 2001-07-01
2        NaT
3 2013-07-16
4        NaT

#index of data with null - NaT to variable idx
idx = data[data['Start Date'].isnull()].index
print idx

Int64Index([2, 4], dtype='int64')

#read csv again
data = pd.read_csv(io.StringIO(temp), sep=";")

#find problematic rows, where datetime is not parsed
print data.iloc[idx]

      Start Date
2   1/7/13 12 17
4  16/7/13 12 04
Sign up to request clarification or add additional context in comments.

4 Comments

The data is from a CSV file. When I had a look at the data on Excel, it seems fine, like a regular date format. The length of the dates are not unusual.
Upon drilling down each record. I realised there were some encoding whitespaces issues, which needed trimming. Turns out the 'length' way of verification helped this case. Thanks
how do you trim unusual whitespaces. I guess my whole column has an extra whitespace.
@deadcode - use data['Start Date']= pd.to_datetime(data['Start Date'].str.strip(), dayfirst=True)
1

Recent pandas versions (e.g. 1.5.0) throws a helpful error that shows where the unparsable string is in the column. That said, a common way to solve this error is to pass the correct format= especially if the dates have some "exotic" format (for a list of all possible formats, check https://strftime.org/) such as having underscores (_), em dash (), unexpected white space as separators.

df = pd.DataFrame({'date': ['19_Mar_2017']})
pd.to_datetime(df['date'])                      # error
pd.to_datetime(df['date'], format='%d_%b_%Y')   # OK

df = pd.DataFrame({'date': ['19.Mar. 2017']})
pd.to_datetime(df['date'])                      # error
pd.to_datetime(df['date'], format='%d.%b. %Y')  # OK

If the datetimes have mixed formats, use the errors='coerce' argument to pass different formats and combine all using fillna(). For example,

df = pd.DataFrame({'date': ['19.Mar. 2017', '20.Mar.2017']})  # <--- has mixed format
first = pd.to_datetime(df['date'], format='%d.%b. %Y', errors='coerce')
second = pd.to_datetime(df['date'], format='%d.%b.%Y', errors='coerce')
df['date'] = first.fillna(second)

As an aside, if the data is imported from a file, pandas functions that read such files into dataframes have parse_dates= parameter that can be used to parse the dates.

# from csv
df = pd.read_csv('foo.csv', parse_dates=True)

# from Excel
df = pd.read_excel('foo.xlsx', parse_dates=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.