1

I am using Leave-One-Out-Cross-Validation on a Linear Regression model. Having 8869 observations, as a result of the following:

reg = LinearRegression()

list_Rs = cross_val_score(reg, X_34_const, y_34,
                      cv = len(y_34), 
                      scoring = 'r2')

I should obtain a numpy array of 8869 values included between 0 and 1, with 8 decimals. The problem is that, in producing the result, Python automatically rounds all such values to 0.0:

array([0., 0., 0., ..., 0., 0., 0.])

while instead, for instance, if I use a 2-fold-cross-validation (which implies list_Rs beinga a numpy array with 2 values), it prints the correctly not rounded values:

list_Rs = cross_val_score(reg, X_34_const, y_34,
                      cv = 2, 
                      scoring = 'r2')

which, printed, is:

array([0.16496198, 0.18115719])

This is not simply a printing representation, problem, since, for instance:

print(list_Rs[3] == 0)

returns True. This is for me a major problem since, in my computations, I will then need to put the values of list_Rs at the denominator of a fraction!

How can I solve the problem so to not have automatically rounded values also in my 8869 dimensional array?

Many thanks and I look forward to hearing from you.

1

1 Answer 1

1

Neither Python nor NumPy is doing any rounding here: scikit-learn's r2_score scoring function (which is invoked under the hood when calling cross_val_score with scoring='r2') is returning actual zeros.

That's because by using leave-one-out, each validation set consists of a single sample. So now for each fold of your cross validation, r2_score is being called with a single observed value along with a single predicted value for that observation. And in that situation, it produces zero. For example:

>>> from sklearn.metrics import r2_score
>>> import numpy as np
>>> y_true = np.array([2.3])
>>> y_pred = np.array([2.1])
>>> r2_score(y_true, y_pred)
0.0

Here's the portion of the implementation where r2_score ends up (somewhat arbitrarily) returning zero when evaluated on a single data point, assuming that the predicted value isn't an exact match for the observed value.

Arguably, r2_score should be either raising an exception or producing negative infinity rather than zero here: the coefficient of determination uses the variance of the observed data as a normalising factor, and when there's only a single observation, that variance is zero, so the formula for the R2 score involves a division by zero. There's some discussion of this in a scikit-learn bug report.

Sign up to request clarification or add additional context in comments.

1 Comment

Dear Mark, many thanks for your answer. You're indeed right, on one single observation the total sum of squares is clearly 0, which should then yield a negative infinity R^2 or more generally an error. A part from the fact that even in regular situations the R^2 is already a bad and widely misused measure, I think I will then simply consider the MSE of the observation or the R^2 of the entire regression, to be used as a proxy of goodness of fit in my subsequent computations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.