sklearn calibrated classifier with random forest

Question

Scikit has a very useful classifier wrappers called CalibratedClassifer and CalibratedClassifierCV, which try to make sure that the predict_proba function of a classifier really predicts a probability and not just an arbitrary number (albeit perhaps well-ranked) between zero and one.

However, when using random forests it is customary to use oob_decision_function_ to determine the performance on the training data, but this is no longer available when using the the calibrated models. The calibration should therefore work well for new data but not for the training data. How can we evaluate performance on the training data to determine, e.g., overfitting?

pipefish · Accepted Answer · 2018-07-10 06:29:39Z

Apparently there really was no solution to this, and so I made a pull request to scikit-learn.

The problem was that the out-of-bag predictions are created during learning. Therefore, in the CalibratedClassifierCV each of the sub-classifiers does have its own oob decision function. However, this decision function is calculated on a fold of the data. Therefore, it is necessary to store each oob prediction (keeping nan values for samples that are not in the fold), then convert all the predictions using the calibration transformation, and then average the calibrated oob predictions to create an updated oob prediction.

As mentioned, I created a pull request at https://github.com/scikit-learn/scikit-learn/pull/11175. It will probably be a while before it is merged into the package, though, so if anyone really needs to use it then feel free to use my fork of scikit-learn at https://github.com/yishaishimoni/scikit-learn.

Collectives™ on Stack Overflow

sklearn calibrated classifier with random forest

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related