10

I have a following Pandas DataFrame:

df = pd.DataFrame({'a': ['2020-01-02', '2020-01-02']})

Obviously, the column 'a' is string. I want to convert it to 'Date' type; and here is what I did:

df['a'] = df['a'].apply(pd.to_datetime).dt.date

It works, but in reality my DataFrame has 500,000 + rows. It seems to be very inefficient. Is there any way to directly and more efficiently convert string column to Date column?

3 Answers 3

21

pandas.DataFrame.apply is essentially a native python for loop.

pandas.to_datetime is a vectorized function, meaning it's meant to operate on sequences/lists/arrays/series by doing the inner loop in C

If we start with a larger dataframe:

import pandas
df = pandas.DataFrame({'a': ['2020-01-02', '2020-01-02'] * 5000})

And then do (in a jupyter notebook)

%%timeit
df['a'].apply(pandas.to_datetime).dt.date

We get a pretty slow result:

1.03 s ± 48.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But if we rearrange just slightly to pass the entire column:

%%timeit
pandas.to_datetime(df['a']).dt.date

We get a much faster result:

6.07 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for the quick response! Also I combined the other answer and here it becomes even faster: %timeit pd.to_datetime(df['a'], infer_datetime_format=True).dt.date
1

use df['a'] = pd.to_datetime(df['a'], format='%Y-%m-%d')

specify the format if you know they are all following the same format.

1 Comment

This does not answer the question. The question is not about input format, which is what the format argument governs, but about the output format: is it a date, or a datetime? -1.
1

Below is code for the fastest function I know of for converting strings to dates, specialised for application on a series where the same dates are repeated - e.g. sub-day granularity financial time series data. If you are working with 1 minute bars and have individual date and time columns, that's a lot of repeated date strings.

def str_to_date(s):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """

    # Create a dictionary with unique dates as keys and their corresponding
    # parsed datetime objects as values
    dates = {date: pd.to_datetime(date,
                              format="%Y-%m-%d") for date in s.unique()}

    # Map the original dates to their parsed values using the lookup dictionary
    return s.map(dates).dt.date

Then if we re-run all the timings with the same large DataFrame used in this answer https://stackoverflow.com/a/66862336/3253628. First create the Dataframe:

import pandas as pd
df = pd.DataFrame({'a': ['2020-01-02', '2020-01-02'] * 5000})

I will pass the date format to each of the old approaches to make it fair to compare.

Time the first approach:

%%timeit
df['a'].apply(pd.to_datetime, format='"%Y-%m-%d").dt.date

For which we get:

621ms ± 4.12ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time the second approach:

%%timeit
pd.to_datetime(df['a'], format='"%Y-%m-%d").dt.date

For which we get:

4.02 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And now the new approach using the function defined above:

%%timeit
str_to_date(df['a'])

For which we get:

2.66 ms ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So that is a pretty good speed up.

The same logic can also be applied when there a lot of repeated time strings and you wish to convert them to time deltas.

def str_to_time(s):
     """
    This is an extremely fast approach to datetime parsing.
    For large data, the same times are often repeated. Rather than
    re-parse these, we store all unique times, parse them, and
    use a lookup to convert all dates.
    """
    # Create a dictionary with unique times as keys and their corresponding
    # parsed timedelta objects as values
    times = {time: pd.to_timedelta(time) for time in s.unique()}

    # Map the original times to their parsed values using the lookup dictionary
    return s.map(times)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.