I am designing a binary classifier random forest model using python and scikitlearn, in which I would like to retrieve the probability of my test set being one of the two labels. To my understanding, predict_proba(xtest) will give me the following result:
Number Of Trees Voted For Classifier / Number Of Trees
I find this too imprecise, as certain tree nodes, may have separated my (non-deterministic) samples, into fairly precise leaves (100 class a , 0 class b), and imprecise leaves (5 class a, 3 class b). I would like an implementation of 'probability' that takes the total number of samples in my n-classifiers output leaves as a dominator and the total number of the overall chosen classifier in the output leaves as the numerator (even for tree's and their output leaves that chose the class most trees didn't).
For example (Simple):
2 Trees:
Tree 1:
--- 5, 0 Class A (Chosen)
10
--- 2, 3 Class B (Unchosen)
Tree 2:
--- 3, 2 Class A (Chosen)
10
--- 5, 0 Class B (Unchosen)
predict_proba results:
Number of Trees that chose Class A (2) / Number of Trees (2) = 1.0
Desired Results:
Number of Class A Samples in Output Leaves (8) / Total Number Samples in Output Leaves (10) = 0.8
Does anyone have any knowledge on how to do this, or an implementation they are using?
An idea I had was to iterate through every tree, retrieving their probabilities, and averaging them. However, this would give higher bias to output leaves that have less samples (electoral college style).
How to directly access the number of samples and their classes of the output leaf of a decision tree for a specific sample (Or even just the leaves index, and go from there)? And in the case of a random forest, summing and averaging them?
If not switch platforms/libraries entirely? Or maybe just crank up the number of classifiers (not optimal)?
Some potentially helpful documentation?:
dtc.tree_.n_node_samples
dtc.tree_[node_index].n_node_samples ?