-2

I was trying to build a linear regression model to predict the price of houses to begin with machine learning but come accross negative values of score when using cross validation in this code:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
x = df.drop(['MedHouseVal'], axis=1)
y = df['MedHouseVal']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
model = LinearRegression()
model.fit(x_train, y_train)
model.score(x_test, y_test)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, x, y, cv=100)
plt.plot(scores)

i noticed that as i increased cv, the average decreased score. Therefore, i decided to plot it and realized that the score takes on negative values at some points but how can true predictions/sample size be negative is it calculated with (TP + TN - FP - FN)/sample size?

enter image description here

2
  • 2
    Basics is to think about what metric is being used, which is in the sklearn documentation (search for score or scoring function), and it is a regression metric, the R^2 score, which can be negative. Commented Jul 8, 2024 at 17:59
  • 1
    maybe you should ask on similar portals DataScience , CrossValidated , Artificial Intelligence Commented Jul 8, 2024 at 20:55

1 Answer 1

0

Scikit regression scores are calculated using R2 and this can be negative if you have a poor fit (see details here).

In your code, you initially held out 20% of the data and model.score(x_test, y_test) comes up with the score of about 0.59 when I run it.

The function cross_val_score does this for you. For example, the default cv uses 5-fold cross validation. The idea of cross validation is to use all the data to train, but also independently check the results. This randomly splits the data into 5 sections, then trains the model holding out one of those splits (or 20% of the data), checks the model on that 20%, then repeats for each section). The average across each of these should give a reasonable assessment of score. Indeed it gives [0.54866323, 0.46820691, 0.55078434, 0.53698703, 0.66051406] which has a mean/std of 0.553 +/- 0.062, which is in-line with the single case mentioned above.

Upon increased number of cv sections, there is a risk of a non-representative sample dominating the score for that section. This leads to more variance in the score result. For example with cv=100, the result is -0.095 +/- 0.918, which is still statistically in-line with the above scores but too wide a variance to be helpful in analyzing the data. I suggest keeping the k-fold low enough to have a statistically meaningful score.

P.S. To reproduce this, I needed to add the following lines at the top of your code:

from sklearn.datasets import fetch_california_housing
import pandas
import matplotlib.pyplot as plt
import numpy as np

california_housing = fetch_california_housing(as_frame=True)
df = california_housing.frame
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.