Problem Description
I'm working on a time series dataset of sensor readings collected every 5 minutes over 3 months. The dataset has approximately 15% missing values scattered throughout, but some gaps are as large as 2-3 hours.
Current Approach
I'm currently using forward fill (ffill()) to handle missing values:
python
import pandas as pd
df['sensor_reading'] = df['sensor_reading'].fillna(method='ffill')
The Issue
Forward filling works for small gaps (5-10 minutes), but for larger gaps (1-3 hours), it creates unrealistic flat patterns that don't represent actual sensor behavior. This is affecting my downstream analysis and predictions.
What I've Tried
Interpolation - Works better but still struggles with large gaps
Forward fill with limit - Leaves NaN values for large gaps
Mean/median imputation - Doesn't capture temporal patterns
Sample Data
python
timestamp sensor_reading
2024-01-01 00:00:00 23.5
2024-01-01 00:05:00 23.7
2024-01-01 00:10:00 NaN
2024-01-01 00:15:00 NaN
2024-01-01 00:20:00 NaN
... (20 more missing values)
2024-01-01 02:00:00 25.1
Question
What are the best practices for handling large gaps in time series sensor data? Should I:
Use different imputation methods based on gap size?
Flag large gaps and exclude them from analysis?
Use a predictive model (ARIMA/LSTM) to fill gaps?
Consider the data quality too poor and collect new data?
Environment
Python 3.10
Pandas 2.0.3
Dataset size: ~26,000 rows
Any advice or references to research papers/best practices would be appreciated!