Scikit-Learn Linear Regression using Datetime Values and forecasting

Question

Below is a sample of the dataset.

row_id	datetime	energy
1	2008-03-01 00:00:00	1259.985563
2	2008-03-01 01:00:00	1095.541500
3	2008-03-01 02:00:00	1056.247500
4	2008-03-01 03:00:00	1034.742000
5	2008-03-01 04:00:00	1026.334500

The dataset has datetime values and energy consumption for that hour in object and float64 dtypes. I want to predict the energy using the datetime column as the single feature.

I used the following code

train['datetime'] = pd.to_datetime(train['datetime'])
X = train.iloc[:,0]
y = train.iloc[:,-1]

I could not pass the single feature as Series to the fit object as I got the following error.

ValueError: Expected 2D array, got 1D array instead:
array=['2008-03-01T00:00:00.000000000' '2008-03-01T01:00:00.000000000'
 '2008-03-01T02:00:00.000000000' ... '2018-12-31T21:00:00.000000000'
 '2018-12-31T22:00:00.000000000' '2018-12-31T23:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or  
array.reshape(1, -1) if it contains a single sample.

So I converted their shapes as suggested.

 X = np.array(X).reshape(-1,1)
 y = np.array(y).reshape(-1,1)
 
 from sklearn.linear_model import LinearRegression
 model_1 = LinearRegression()
 model_1.fit(X,y)
 
 test = pd.to_datetime(test['datetime'])
 test = np.array(test).reshape(-1,1)
 
 predictions = model_1.predict(test)

The LinearRegression object fitted the feature X and target y without raising any error. But when I passed the test data to the predict method, it threw the following error.

TypeError: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[float64]'>. 
This means that no common DType exists for the given inputs. 
For example they cannot be stored in a single array unless the dtype is `object`. 
The full list of DTypes is: (<class 'numpy.dtype[datetime64]'>, <class 'numpy.dtype[float64]'>)

I can't wrap my head around this error. How can I use the datetime values as a single feature and apply simple linear regression to predict the target value and do TimeSeries forecasting? Where am I doing wrong?

You can not train on a datetime format. If you want the model to learn datetime features then consider splitting it into day, month, weekday, weekofyear, hour etc to learn patterns with seasonality. — Azhar Khan
– Azhar Khan, Commented Nov 18, 2022 at 9:56

Azhar Khan · Accepted Answer · 2022-11-19 03:20:51Z

You can not train on a datetime format. If you want the model to learn datetime features then consider splitting it into day, month, weekday, weekofyear, hour etc to learn patterns with seasonality:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

df = pd.DataFrame(data=[["2008-03-01 00:00:00",1259.985563],["2008-03-01 01:00:00",1095.541500],["2008-03-01 02:00:00",1056.247500],["2008-03-01 03:00:00",1034.742000],["2008-03-01 04:00:00",1026.334500]], columns=["datetime","energy"])
df["datetime"] = pd.to_datetime(df["datetime"])
features = ["year", "month", "day", "hour", "weekday", "weekofyear", "quarter"]
df[features] = df.apply(lambda row: pd.Series({"year":row.datetime.year, "month":row.datetime.month, "day":row.datetime.day, "hour":row.datetime.hour, "weekday":row.datetime.weekday(), "weekofyear":row.datetime.weekofyear, "quarter":row.datetime.quarter }), axis=1)

X = df[features]
y = df[["energy"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(mean_squared_error(y_test, y_pred))

Gael Varoquaux · Accepted Answer · 2025-03-15 07:42:50Z

Skrub comes with a handy transformer to do this, the DatetimeEncoder:

# %%
# Our data
import pandas as pd
df = pd.DataFrame(data=[["2008-03-01 00:00:00",1259.985563],["2008-03-01 01:00:00",1095.541500],["2008-03-01 02:00:00",1056.247500],["2008-03-01 03:00:00",1034.742000],["2008-03-01 04:00:00",1026.334500]], columns=["datetime","energy"])
df["datetime"] = pd.to_datetime(df["datetime"])

# %%
# Turn it to a numerical matrix
from skrub import DatetimeEncoder
dt_encoder = DatetimeEncoder()
X = dt_encoder.fit_transform(df["datetime"])
y = df[["energy"]]

# %%
# Do machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred))

The TableVectorizer helps doing this is a more automated way, and can handle dataframes with multiple columns of different types:

from skrub import TableVectorizer
from sklearn.pipeline import make_pipeline

# Turn the LinearRegression into a model that can readily fit dataframes
df_model = make_pipeline(TableVectorizer(), LinearRegression())

# remove the target from df:
X_df = df.drop(columns=["energy"])
df_model.fit(X_df, y)
y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred))

Collectives™ on Stack Overflow

Scikit-Learn Linear Regression using Datetime Values and forecasting

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related