How does sklearn calculate AUC for random forest and why it is different when using different functions?

Question

I start with the example given for ROC Curve with Visualization API:

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)
y = y == 2

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(X_train, y_train)
ax = plt.gca()
rfc_disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test, ax=ax, alpha=0.8)
print(rfc_disp.roc_auc)

with the answer 0.9823232323232323.

Following this immediately by

from sklearn.metrics import roc_auc_score
y_pred = rfc.predict(X_test)
auc = roc_auc_score(y_test, y_pred)
print(auc)

I obtain 0.928030303030303, which is manifestly different.

Interestingly, I obtain the same result with the ROC Curve Visualization API, if I use the predicted values:

rfc_disp1 = RocCurveDisplay.from_predictions(y_test, y_pred)
print(rfc_disp1.roc_auc)

However the area under the curve obtained does sum up to the former result (using trapezoid integration):

import numpy as np
I = np.sum(np.diff(rfc_disp.fpr) * (rfc_disp.tpr[1:] + rfc_disp.tpr[:-1])/2.)
print(I)

What is the reason for this discrepancy? I assume that it is related to how teh two functions calculate AUC (perhaps different way of smoothing the curve?) This brings me to a more general question: how is ROC curve obtained for random forest in sklearn? - what parameter/threshold is changed to obtain different predictions? Are these just scores for separate trees of the forest?

Atacan · Accepted Answer · 2023-11-29 10:25:39Z

2

You should use predict_proba for AUC.

try this one:

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, rfc.predict_proba(X_test)[:, 1])
print(auc)

answered Nov 29, 2023 at 10:25

Atacan

3056 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Roger V. Over a year ago

Thanks, seems correct. Any pointers to how calculation of roc curve is implemented in sklearn?

Atacan Over a year ago

You're welcome. It is calculated using sensitivity and 1-specificity. sklearn does not apply a different method. Sensitivity and 1-specificity are calculated and plotted according to different thresholds. As the thresholds change, the probability of the model class prediction becomes important. That's why we look through predict_proba. As its name implies, the auc score is calculated by calculating the area under the curve.

Roger V. Over a year ago

What I don't understand is the parameter serves as threshold in the case of random forest...

Atacan Over a year ago

I couldn't see threshold parameter in your code. if you mean alpha parameter, it isn't related with threshold, i guess it is used for plotting.

Roger V. Over a year ago

There is no parameter in my code. But ROC curve is a parametric dependence of TPR and FPR on some threshold. If all the parameters fixed, TPR and FPR are always the same number. In other words, a ROC curve for random forest seems meaningless...

|

Collectives™ on Stack Overflow

How does sklearn calculate AUC for random forest and why it is different when using different functions?

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related