15

Python 2.7.10
Tried pandas 0.17.1 -- function read_excel
Tried pyexcel 0.1.7 + pyexcel-xlsx 0.0.7 -- function get_records()

When using pandas in Python is it possible to read excel files (formats: xls, xlsx) and leave columns containing date or date + time values as strings rather than auto-converting to datetime.datetime or timestamp types?

If this is not possible using pandas can someone suggest an alternate method/library to read xls, xlsx files and leave date column values as strings?

For the pandas solution attempts the df.info() and resulting date column types are shown below:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 117 entries, 0 to 116
Columns: 176 entries, Mine to Index
dtypes: datetime64[ns](2), float64(145), int64(26), object(3)
memory usage: 161.8+ KB
>>> type(df['Start Date'][0])
Out[6]: pandas.tslib.Timestamp
>>> type(df['End Date'][0])
Out[7]: pandas.tslib.Timestamp

Attempt/Approach 1:

def read_as_dataframe(filename, ext):
   import pandas as pd
   if ext in ('xls', 'xlsx'):
      # problem: date columns auto converted to datetime.datetime or timestamp!
      df = pd.read_excel(filename) # unwanted - date columns converted!

   return df, name, ext

Attempt/Approach 2:

import pandas as pd
# import datetime as datetime
# parse_date = lambda x: datetime.strptime(x, '%Y%m%d %H')
parse_date = lambda x: x
elif ext in ('xls', 'xlsx', ):
    df = pd.read_excel(filename, parse_dates=False)
    date_cols = [df.columns.get_loc(c) for c in df.columns if c in ('Start Date', 'End Date')]
    # problem: date columns auto converted to datetime.datetime or timestamp!
    df = pd.read_excel(filename, parse_dates=date_cols, date_parser=parse_date)

And have also tried pyexcel library but it does the same auto-magic convert behavior:

Attempt/Approach 3:

import pyexcel as pe
import pyexcel.ext.xls
import pyexcel.ext.xlsx

t0 = time.time()
if ext == 'xlsx':
    records = pe.get_records(file_name=filename)
    for record in records:
        print("start date = %s (type=%s), end date = %s (type=%s)" %
              (record['Start Date'],
               str(type(record['Start Date'])),
               record['End Date'],
               str(type(record['End Date'])))
              )
11
  • I don't understand your question, if you don't pass date_cols or a date_parser then it won't try to parse the date strings Commented Dec 8, 2015 at 13:15
  • 1
    As you can see from my code snippets I am not passing date_cols or date_parser for first style approach above and pandas is auto-magically (and silently) converting the columns to datetime values. I need to find a way to stop that and leave any column containing a date value as a string. Commented Dec 8, 2015 at 13:19
  • Are you sure, please post df.info() after running your code Commented Dec 8, 2015 at 13:25
  • 1
    Thanks @EdChum will explore csv approach further. The spreadsheets themselves are out of my control (format + content) so I need to work with formats provided to me and I am trying to develop a robust custom method for trying different date formats (time zones) and presenting user with best fit. Commented Dec 8, 2015 at 14:06
  • 2
    I think @EdChum is correct. The type you get depends on how the data is typed in the excel file itself. If the column is formatted as datetime data there, you will get datetime data from read_excel. Commented Dec 8, 2015 at 14:31

5 Answers 5

9

I ran into an identical problem, except pandas was oddly converting only some cells into datetimes. I ended up manually converting each cell into a string like so:

def undate(x):
    if pd.isnull(x):
        return x
    try:
        return x.strftime('%d/%m/%Y')
    except AttributeError:
        return x
    except Exception:
        raise

for i in list_of_possible_date_columns:
    df[i] = df[i].apply(undate)
Sign up to request clarification or add additional context in comments.

Comments

6
  • Using converters{'Date': str} option inside the pandas.read_excel which helps. pandas.read_excel(xlsx, sheet, converters={'Date': str})
  • you can try convert your timestamp back to the original format
    df['Date'][0].strftime('%Y/%m/%d')

Comments

6

I tried saving the file in a CSV UTF-8 format (manually) and used pd.read_csv() and worked fine.

I tried a bunch of things to figure the same thing with read_excel. Did not work anything for me. So, I am guessing read_excel is probably updating your string in a datetime object which you can not control.

1 Comment

I get why he didn't accept this as the answer but this works for me and is really easy. I don't think there's another method that pandas allows which is weird to me
1

I have had the same issue when extracting Dates from excel. My columns had the MM/DD/YYYY format, but reading it in Pyhton and converting to CSV the format was then converted to MM/DD/YYYY 00:00:00. Fortunately figured out a solution. Using

excel = pd.read_excel(file, dtype=object)

Instead of

excel = pd.read_excel(file, dtype=str)

Works.

I have no idea why and would appreciate some experienced python programmer to explain.

2 Comments

to make this a useful answer, could you please add some more details? for which specific input (excel file) does this give what desired output?
While it would be good to know more about why this solution works, it did work for me and gave the results expected.
0

The docs of pd.read_excel say under the parameter parse_dates:

If you don't want to parse some cells as date just change their type in Excel to "Text".

This means one could preprocess the cells in Excel to the "Text" number format before using pd.read_excel. However, switching to "Text" number format alone changes the dates to numbers in Excel, e.g., 1900-12-31 becomes 366 (the cryptic number is the number of days since 1900-01-01 (it is 366 here because Excel falsely presumes that 1900 was a leap year)). This means that the instruction of just changing the number format to "Text" isn’t useful.

Instead, one has to iterate over the date columns and set the Excel VBA .Text as value and change the number format to "Text" to return the string how it displays, which can be done with xlwings:

from pathlib import Path
import xlwings as xw
import pandas as pd

path = Path(r"foo/test.xlsx")

with xw.App(visible=False):
    wb = xw.Book(path)
    ws = wb.sheets[0]
    for cell in ws.used_range:
        temp_str = cell.api.Text
        cell.number_format = "@"  # @ is the number format code of Excel for the "Text" number format.
        cell.value = temp_str
    wb.save(path.with_stem(f"{path.stem}_interim"))
    # Then the Excel file can be read with xlwings.
    df = ws.used_range.options(pd.DataFrame, index=False).value
    wb.close()

# Or it can be read with pandas.
df = pd.read_excel(path.with_stem(f"{path.stem}_interim"))

print(df)

Note the prerequisites for the installation of xlwings.
Unfortunately the .Text property can only be used on a range of one cell, which is the reason why a loop over the range of cells in question is required. Note that I used used_range in this example, which means that all cells with contents on the worksheet are formatted to the "Text" number format.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.