1

I have a column of datetimes and I want to get the difference between values in terms of years, months, etc, instead of timedeltas that only provide days. How do I do this in Pandas?

Pandas provides DateOffset for relative deltas, but the docs say "the positional argument form of relativedelta is not supported", and that's the form that calculates a relative delta (as opposed to specifying a relative delta).

For this example, I'm only dealing with the min and max of the column to get the span, but I eventually want to apply this to the whole column.

min_max = df_most_watched['time'].agg(['min', 'max'])
min   2019-06-18 18:22:05.991000+00:00
max   2021-02-15 18:03:02.893000+00:00
Name: time, dtype: datetime64[ns, UTC]

min_max.diff():

min                        NaT
max   607 days 23:40:56.902000
Name: time, dtype: timedelta64[ns]

The output should be 1 year, 7 months, 27 days, 23:40:56.902000.

Attempted

Just to confirm, I tried pd.DateOffset(low, high) and got TypeError: `n` argument must be an integer, got <class 'pandas._libs.tslibs.timestamps.Timestamp'>

I tried .pct_change() on a whim hoping it would have a special case for datetimes, but no dice. TypeError: cannot perform __truediv__ with this index type: DatetimeArray

I checked if .diff() had some sort of setting like relative=True, but no.

Research

In the User Guide, the Time series page doesn't have anything relevant when I Ctrl+F for "relative" and the Time deltas page doesn't mention "relative" at all.

I checked if DateOffset might have any alternate constructors that could take two timestamps, but the docs don't mention any methods starting with from or anything else.

Setup

min_max = pd.Series(
    {'min': pd.Timestamp('2019-06-18 18:22:05.991', tz='UTC'),
     'max': pd.Timestamp('2021-02-15 18:03:02.893', tz='UTC')},
    name='time')
2
  • Kindly provide a reproducible input data frame, with the expected output data frame Commented Oct 11 at 22:37
  • 1
    @sammywemmy Added desired output and setup code. I'm not using a whole df in this example. Commented Oct 11 at 23:41

2 Answers 2

0

Workaround

It doesn't seem to be possible, so as a workaround, use relativedelta.

Here's a basic usage with a single delta and I'll cover Series below.

from dateutil.relativedelta import relativedelta as Rd

span = Rd(min_max['min'], min_max['max'])
relativedelta(years=+1, months=+7, days=+27, hours=+23, minutes=+40, seconds=+56, microseconds=+902000)

Series

relativedelta can't handle nulls, so we'll have to handle those specially.

For the intermediate value, I'm using a dataframe, but this isn't strictly necessary, it's just to provide richer debugging if needed. Instead you could loop and keep track of the previous value.

rdiff = (
    min_max.to_frame().join(
        min_max.shift().rename('time_prev'))
    .apply(
        lambda row: Rd(row['time'], row['time_prev'])
            if pd.notna(row['time_prev']) else pd.NaT,
        axis=1)
)
min    NaT
max    relativedelta(years=+1, months=+7, days=+27, hours=+23, minutes=+40, seconds=+56, microseconds=+902000)
dtype: object

BTW: Convert to DateOffset

You can convert a relativedelta to DateOffset by selecting the attributes listed above:

pd.DateOffset(**{k: getattr(span, k) for k in [
    'years',
    'months',
    'days',
    'hours',
    'minutes',
    'seconds',
    'microseconds',
]})
<DateOffset: days=27, hours=23, microseconds=902000, minutes=40, months=7, seconds=56, years=1>
Sign up to request clarification or add additional context in comments.

2 Comments

P.S. If anyone knows how to convert a relative delta to a human-readable form (string), I'm all ears :)
P.S.2. I might publish a package with some more thorough code for doing this. LMK if there's interest.
-1

You're absolutely right that pd.Timedelta only gives you differences in days, seconds, etc., and DateOffset doesn't support the positional form of relativedelta. To get differences in years, months, days, etc., you can use the dateutil.relativedelta module directly.

from dateutil.relativedelta import relativedelta
from datetime import datetime


def format_relativedelta(rd):
    units = ['years', 'months', 'days', 'hours', 'minutes', 'seconds']
    return ', '.join(
        f"{getattr(rd, unit)} {unit.rstrip('s') if getattr(rd, unit) == 1 else unit}"
        for unit in units if getattr(rd, unit)
    )


start = datetime(2019, 6, 18, 18, 22, 5)
end = datetime(2021, 2, 15, 18, 3, 2)

delta = relativedelta(end, start)
print(format_relativedelta(delta))
1 year, 7 months, 27 days, 23 hours, 40 minutes, 57 seconds

4 Comments

What's this formatting code? That's not relevant to the question.
But also, you missed microseconds.
Pandas uses pd.Timestamp, not datetime.datetime. It works the same, but I'm just not sure why you decided to add an extra, unnecessary import.
So beyond the unnecessary formatting code and unnecessary datetime import, does this really add anything my answer doesn't? Sorry if that sounds rude, but I'm genuinely asking if there's anything I missed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.