Python Numpy array (bad) automatic rounding

Question

I am using Leave-One-Out-Cross-Validation on a Linear Regression model. Having 8869 observations, as a result of the following:

reg = LinearRegression()

list_Rs = cross_val_score(reg, X_34_const, y_34,
                      cv = len(y_34), 
                      scoring = 'r2')

I should obtain a numpy array of 8869 values included between 0 and 1, with 8 decimals. The problem is that, in producing the result, Python automatically rounds all such values to 0.0:

array([0., 0., 0., ..., 0., 0., 0.])

while instead, for instance, if I use a 2-fold-cross-validation (which implies list_Rs beinga a numpy array with 2 values), it prints the correctly not rounded values:

list_Rs = cross_val_score(reg, X_34_const, y_34,
                      cv = 2, 
                      scoring = 'r2')

which, printed, is:

array([0.16496198, 0.18115719])

This is not simply a printing representation, problem, since, for instance:

print(list_Rs[3] == 0)

returns True. This is for me a major problem since, in my computations, I will then need to put the values of list_Rs at the denominator of a fraction!

How can I solve the problem so to not have automatically rounded values also in my 8869 dimensional array?

Many thanks and I look forward to hearing from you.

Possible duplicate of Getting a score of zero using cross val score — Mark Dickinson
– Mark Dickinson, Commented Feb 23, 2019 at 19:40

Mark Dickinson · Accepted Answer · 2019-02-23 19:31:50Z

1

Neither Python nor NumPy is doing any rounding here: scikit-learn's r2_score scoring function (which is invoked under the hood when calling cross_val_score with scoring='r2') is returning actual zeros.

That's because by using leave-one-out, each validation set consists of a single sample. So now for each fold of your cross validation, r2_score is being called with a single observed value along with a single predicted value for that observation. And in that situation, it produces zero. For example:

>>> from sklearn.metrics import r2_score
>>> import numpy as np
>>> y_true = np.array([2.3])
>>> y_pred = np.array([2.1])
>>> r2_score(y_true, y_pred)
0.0

Here's the portion of the implementation where r2_score ends up (somewhat arbitrarily) returning zero when evaluated on a single data point, assuming that the predicted value isn't an exact match for the observed value.

Arguably, r2_score should be either raising an exception or producing negative infinity rather than zero here: the coefficient of determination uses the variance of the observed data as a normalising factor, and when there's only a single observation, that variance is zero, so the formula for the R2 score involves a division by zero. There's some discussion of this in a scikit-learn bug report.

edited Feb 23, 2019 at 19:31

answered Feb 23, 2019 at 19:25

Mark Dickinson

31.2k9 gold badges92 silver badges130 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nicg Over a year ago

Dear Mark, many thanks for your answer. You're indeed right, on one single observation the total sum of squares is clearly 0, which should then yield a negative infinity R^2 or more generally an error. A part from the fact that even in regular situations the R^2 is already a bad and widely misused measure, I think I will then simply consider the MSE of the observation or the R^2 of the entire regression, to be used as a proxy of goodness of fit in my subsequent computations.

Collectives™ on Stack Overflow

Python Numpy array (bad) automatic rounding

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related