0

Problem Description

I'm working on a time series dataset of sensor readings collected every 5 minutes over 3 months. The dataset has approximately 15% missing values scattered throughout, but some gaps are as large as 2-3 hours.

Current Approach

I'm currently using forward fill (ffill()) to handle missing values:

python

import pandas as pd

df['sensor_reading'] = df['sensor_reading'].fillna(method='ffill')

The Issue

Forward filling works for small gaps (5-10 minutes), but for larger gaps (1-3 hours), it creates unrealistic flat patterns that don't represent actual sensor behavior. This is affecting my downstream analysis and predictions.

What I've Tried

  1. Interpolation - Works better but still struggles with large gaps

  2. Forward fill with limit - Leaves NaN values for large gaps

  3. Mean/median imputation - Doesn't capture temporal patterns

Sample Data

python

timestamp           sensor_reading
2024-01-01 00:00:00    23.5
2024-01-01 00:05:00    23.7
2024-01-01 00:10:00    NaN
2024-01-01 00:15:00    NaN
2024-01-01 00:20:00    NaN
... (20 more missing values)
2024-01-01 02:00:00    25.1

Question

What are the best practices for handling large gaps in time series sensor data? Should I:

  • Use different imputation methods based on gap size?

  • Flag large gaps and exclude them from analysis?

  • Use a predictive model (ARIMA/LSTM) to fill gaps?

  • Consider the data quality too poor and collect new data?

Environment

  • Python 3.10

  • Pandas 2.0.3

  • Dataset size: ~26,000 rows

Any advice or references to research papers/best practices would be appreciated!

5
  • 1
    There is no correct general way to address missing data. Typically, It depends on assumptions you have about the data and what is acceptable for what you're trying to achieve. Commented Nov 15 at 9:20
  • Could toi share the dataset ? In order to analyse its properties. Commented Nov 15 at 10:17
  • Fair feedback! My use case: anomaly detection where gaps might signal problems. Should I flag large gaps instead of imputing? @mqqz Commented Nov 17 at 4:53
  • @jlandercy I'd love your input! The dataset is confidential, but I can: 1. Share summary statistics & gap distribution plots 2. Run specific analyses you suggest and share results 3. Create a similar synthetic dataset Which would be most useful for your analysis? Commented Nov 17 at 4:55
  • Please don't use generative AI to write or rewrite your Stack Overflow posts. It's not allowed by site policy. Commented Nov 17 at 5:22

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.