0

I start with the example given for ROC Curve with Visualization API:

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)
y = y == 2

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(X_train, y_train)
ax = plt.gca()
rfc_disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test, ax=ax, alpha=0.8)
print(rfc_disp.roc_auc)

with the answer 0.9823232323232323.

Following this immediately by

from sklearn.metrics import roc_auc_score
y_pred = rfc.predict(X_test)
auc = roc_auc_score(y_test, y_pred)
print(auc)

I obtain 0.928030303030303, which is manifestly different.

Interestingly, I obtain the same result with the ROC Curve Visualization API, if I use the predicted values:

rfc_disp1 = RocCurveDisplay.from_predictions(y_test, y_pred)
print(rfc_disp1.roc_auc)

However the area under the curve obtained does sum up to the former result (using trapezoid integration):

import numpy as np
I = np.sum(np.diff(rfc_disp.fpr) * (rfc_disp.tpr[1:] + rfc_disp.tpr[:-1])/2.)
print(I)

What is the reason for this discrepancy? I assume that it is related to how teh two functions calculate AUC (perhaps different way of smoothing the curve?) This brings me to a more general question: how is ROC curve obtained for random forest in sklearn? - what parameter/threshold is changed to obtain different predictions? Are these just scores for separate trees of the forest?

1 Answer 1

2

You should use predict_proba for AUC.

try this one:

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, rfc.predict_proba(X_test)[:, 1])
print(auc)
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks, seems correct. Any pointers to how calculation of roc curve is implemented in sklearn?
You're welcome. It is calculated using sensitivity and 1-specificity. sklearn does not apply a different method. Sensitivity and 1-specificity are calculated and plotted according to different thresholds. As the thresholds change, the probability of the model class prediction becomes important. That's why we look through predict_proba. As its name implies, the auc score is calculated by calculating the area under the curve.
What I don't understand is the parameter serves as threshold in the case of random forest...
I couldn't see threshold parameter in your code. if you mean alpha parameter, it isn't related with threshold, i guess it is used for plotting.
There is no parameter in my code. But ROC curve is a parametric dependence of TPR and FPR on some threshold. If all the parameters fixed, TPR and FPR are always the same number. In other words, a ROC curve for random forest seems meaningless...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.