0

I am working on a data pipeline in Airflow, and keep running into this ValueError: cannot reindex from a duplicate axis that I've been beating my head against for days.

Here is the function that is messing up:

def fill_missing_dates(df):
    df['TUNING_EVNT_START_DT'] = pd.to_datetime(df['TUNING_EVNT_START_DT'])
    dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
    masdiv = df['MASDIV'].unique()
    station = df['STATION'].unique()
    idx = pd.MultiIndex.from_product((dates, masdiv, station), names=['TUNING_EVNT_START_DT', 'MASDIV', 'STATION'])
    df = df.set_index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION']).reindex(idx, fill_value=0).reset_index()

    return df

Here is the error output from AWS Cloudwatch logs:

16:31:40
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 821, in asfreq
16:31:40
return self._upsample("asfreq", fill_value=fill_value)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 1125, in _upsample
16:31:40
res_index, method=method, limit=limit, fill_value=fill_value
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/util/_decorators.py", line 221, in wrapper
16:31:40
return func(*args, **kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3976, in reindex
16:31:40
return super().reindex(**kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4514, in reindex
16:31:40
axes, level, limit, tolerance, method, fill_value, copy
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3864, in _reindex_axes
16:31:40
index, method, copy, level, fill_value, limit, tolerance
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3886, in _reindex_index
16:31:40
allow_dups=False,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4577, in _reindex_with_indexers
16:31:40
copy=copy,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 1251, in reindex_indexer
16:31:40
self.axes[axis]._can_reindex(indexer)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/base.py", line 3362, in _can_reindex
16:31:40
raise ValueError("cannot reindex from a duplicate axis")
16:31:40
ValueError: cannot reindex from a duplicate axis
16:31:40
"""
16:31:40
The above exception was the direct cause of the following exception:
16:31:40
Traceback (most recent call last):
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 275, in <module>
16:31:40
runner(path_prefix, model_name, execution_id, table)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 230, in runner
16:31:40
df = multiprocessing(PROCESSORS, df)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 121, in multiprocessing
16:31:40
x = pool.map(iforest, (df.loc[df['MASDIV'] == masdiv] for masdiv in args))
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 268, in map
16:31:40
return self._map_async(func, iterable, mapstar, chunksize).get()
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 657, in get
16:31:40
raise self._value
16:31:40
ValueError: cannot reindex from a duplicate axis

I've ran some logger's to get an idea about the output of the dataframe at that step, but I'm not seeing what the issue points at:

18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.index(): RangeIndex(start=0, stop=93, step=1)
18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.columns: Index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION', 'DOW', 'MOY',
18:40:34
'TRANSACTIONS', 'DOW_INT', 'MOY_INT', 'DT_NBR'],
18:40:34
dtype='object')

I have tried everything in these posts, but to no avail:

Pandas error: cannot reindex from a duplicate axis

What does `ValueError: cannot reindex from a duplicate axis` mean?

I am not entirely sure I understand why this is occuring either. Any suggestions are much appreciated.

1 Answer 1

1

Without example data I cannot reproduce your error. However, based on the function's name "fill_missing_dates" I think this alternative solution may accomplish what you are trying to achieve.

import pandas as pd

df = pd.DataFrame({
    'date': ["2020-01-01 00:01:00", "2020-01-01 00:02:00", "2020-01-01 01:00:00", "2020-01-01 02:00:00",
             "2020-01-01 00:04:00", "2020-01-01 00:05:00",
             "2020-01-03 00:01:00", "2020-01-03 00:02:00", "2020-01-03 01:00:00", "2020-01-03 02:00:00",
             "2020-01-03 00:04:00", "2020-01-03 00:05:00",
            ],
    'station': ["a","a","a","a","b", "b", "a", "a", "a", "a", "b", "b"],
    'data': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})

def resampler(x):    
    return x.set_index('date').resample('D').sum()

df['date'] =  pd.to_datetime(df['date'])
multipass = pd.MultiIndex.from_frame(df[["date", "station"]])
df = df.set_index(["date", "station"])
df = df.reindex(multipass)
df.reset_index(level=0).groupby(level=0).apply(resampler)

The result fills in missing dates with 0's:

                        data
station  date   
a        2020-01-01     10
         2020-01-02     0
         2020-01-03     34
b        2020-01-01     11
         2020-01-02     0
         2020-01-03     23
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.