0

Below is a sample of the dataset.

row_id datetime energy
1 2008-03-01 00:00:00 1259.985563
2 2008-03-01 01:00:00 1095.541500
3 2008-03-01 02:00:00 1056.247500
4 2008-03-01 03:00:00 1034.742000
5 2008-03-01 04:00:00 1026.334500

The dataset has datetime values and energy consumption for that hour in object and float64 dtypes. I want to predict the energy using the datetime column as the single feature.

I used the following code

train['datetime'] = pd.to_datetime(train['datetime'])
X = train.iloc[:,0]
y = train.iloc[:,-1]

I could not pass the single feature as Series to the fit object as I got the following error.

ValueError: Expected 2D array, got 1D array instead:
array=['2008-03-01T00:00:00.000000000' '2008-03-01T01:00:00.000000000'
 '2008-03-01T02:00:00.000000000' ... '2018-12-31T21:00:00.000000000'
 '2018-12-31T22:00:00.000000000' '2018-12-31T23:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or  
array.reshape(1, -1) if it contains a single sample.

So I converted their shapes as suggested.

 X = np.array(X).reshape(-1,1)
 y = np.array(y).reshape(-1,1)
 
 from sklearn.linear_model import LinearRegression
 model_1 = LinearRegression()
 model_1.fit(X,y)
 
 test = pd.to_datetime(test['datetime'])
 test = np.array(test).reshape(-1,1)
 
 predictions = model_1.predict(test)

The LinearRegression object fitted the feature X and target y without raising any error. But when I passed the test data to the predict method, it threw the following error.

TypeError: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[float64]'>. 
This means that no common DType exists for the given inputs. 
For example they cannot be stored in a single array unless the dtype is `object`. 
The full list of DTypes is: (<class 'numpy.dtype[datetime64]'>, <class 'numpy.dtype[float64]'>)

I can't wrap my head around this error. How can I use the datetime values as a single feature and apply simple linear regression to predict the target value and do TimeSeries forecasting? Where am I doing wrong?

1
  • 1
    You can not train on a datetime format. If you want the model to learn datetime features then consider splitting it into day, month, weekday, weekofyear, hour etc to learn patterns with seasonality. Commented Nov 18, 2022 at 9:56

2 Answers 2

3

You can not train on a datetime format. If you want the model to learn datetime features then consider splitting it into day, month, weekday, weekofyear, hour etc to learn patterns with seasonality:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

df = pd.DataFrame(data=[["2008-03-01 00:00:00",1259.985563],["2008-03-01 01:00:00",1095.541500],["2008-03-01 02:00:00",1056.247500],["2008-03-01 03:00:00",1034.742000],["2008-03-01 04:00:00",1026.334500]], columns=["datetime","energy"])
df["datetime"] = pd.to_datetime(df["datetime"])
features = ["year", "month", "day", "hour", "weekday", "weekofyear", "quarter"]
df[features] = df.apply(lambda row: pd.Series({"year":row.datetime.year, "month":row.datetime.month, "day":row.datetime.day, "hour":row.datetime.hour, "weekday":row.datetime.weekday(), "weekofyear":row.datetime.weekofyear, "quarter":row.datetime.quarter }), axis=1)

X = df[features]
y = df[["energy"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(mean_squared_error(y_test, y_pred))
Sign up to request clarification or add additional context in comments.

Comments

1

Skrub comes with a handy transformer to do this, the DatetimeEncoder:

# %%
# Our data
import pandas as pd
df = pd.DataFrame(data=[["2008-03-01 00:00:00",1259.985563],["2008-03-01 01:00:00",1095.541500],["2008-03-01 02:00:00",1056.247500],["2008-03-01 03:00:00",1034.742000],["2008-03-01 04:00:00",1026.334500]], columns=["datetime","energy"])
df["datetime"] = pd.to_datetime(df["datetime"])

# %%
# Turn it to a numerical matrix
from skrub import DatetimeEncoder
dt_encoder = DatetimeEncoder()
X = dt_encoder.fit_transform(df["datetime"])
y = df[["energy"]]

# %%
# Do machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred))

The TableVectorizer helps doing this is a more automated way, and can handle dataframes with multiple columns of different types:

from skrub import TableVectorizer
from sklearn.pipeline import make_pipeline

# Turn the LinearRegression into a model that can readily fit dataframes
df_model = make_pipeline(TableVectorizer(), LinearRegression())

# remove the target from df:
X_df = df.drop(columns=["energy"])
df_model.fit(X_df, y)
y_pred = model.predict(X_test)
print(mean_squared_error(y_test, y_pred))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.