Random Forest / Decision Tree Output Probability Design: Using Positive Output Leaf Samples / Total Output Leaf Samples

Question

I am designing a binary classifier random forest model using python and scikitlearn, in which I would like to retrieve the probability of my test set being one of the two labels. To my understanding, predict_proba(xtest) will give me the following result:

Number Of Trees Voted For Classifier / Number Of Trees

I find this too imprecise, as certain tree nodes, may have separated my (non-deterministic) samples, into fairly precise leaves (100 class a , 0 class b), and imprecise leaves (5 class a, 3 class b). I would like an implementation of 'probability' that takes the total number of samples in my n-classifiers output leaves as a dominator and the total number of the overall chosen classifier in the output leaves as the numerator (even for tree's and their output leaves that chose the class most trees didn't).

For example (Simple):

2 Trees:

Tree 1:  
   --- 5, 0 Class A (Chosen)  
10    
   --- 2, 3 Class B (Unchosen)  

Tree 2:  
   --- 3, 2 Class A (Chosen)  
10   
   --- 5, 0 Class B (Unchosen)

predict_proba results:

Number of Trees that chose Class A (2) / Number of Trees (2) = 1.0

Desired Results:

Number of Class A Samples in Output Leaves (8) / Total Number Samples in Output Leaves (10) = 0.8

Does anyone have any knowledge on how to do this, or an implementation they are using?

An idea I had was to iterate through every tree, retrieving their probabilities, and averaging them. However, this would give higher bias to output leaves that have less samples (electoral college style).

How to directly access the number of samples and their classes of the output leaf of a decision tree for a specific sample (Or even just the leaves index, and go from there)? And in the case of a random forest, summing and averaging them?

If not switch platforms/libraries entirely? Or maybe just crank up the number of classifiers (not optimal)?

Some potentially helpful documentation?:

dtc.tree_.n_node_samples
dtc.tree_[node_index].n_node_samples ?

M.C. Escher · Accepted Answer · 2024-06-01 02:27:37Z

1

I find this too imprecise, as certain tree nodes, may have separated my (non-deterministic) samples, into fairly precise leaves (100 class a , 0 class b), and imprecise leaves (5 class a, 3 class b).

As with Decision Trees in a Bagging ensemble, the individual Decision Tree learners in a Random Forest are typically fully expanded and not pruned. This is the preferred methodology of Hastie et al. in ESL. The default settings for an sklearn.ensemble.RandomForestClassifier result in fully expanded trees. What this means is that each tree perfectly fits its training data, and the entropy in the leaf nodes will be zero (assuming that there are no noisy samples with different labels for the same sample points). Therefore with the default settings, your example simply can not occur.

Otherwise, in the case that you constrain your individual trees by specifying an appropriate max_depth or a min_samples_leaf parameter, what you are asking for is simply a different algorithm than the Random Forest algorithm described by Breiman and Cutler and implemented by sklearn. I am not aware of any library that offers such an interface. However it would be relatively trivial to achieve what you are looking for, especially if you have a good working knowledge of the algorithm. You just need to modify the return type of the Decision Tree's predict method, and then refactor the logic for taking the majority vote in the Random Forest. Minor modifications, assuming you have working implementations of the Decision Tree and Random Forest modules.

Or maybe just crank up the number of classifiers (not optimal)?

I'm not sure how this would help you, in general, and in particular with what you seem to be looking to do. With more classifiers, you still will be taking a majority vote in Random Forest, only the vote will be over a larger ensemble. Furthermore, in the limit, you may be worse off: although it is rare that people have issues with a RandomForest ensemble being too big and overfitting, it can happen, and too many trees in the forest can result in a final Random Forest model with excess variance.

If you would like to work with the sklearn implementation, then I think you should find most of the information you need to get started at Understanding the Decision Tree Structure. For example, the DecisionTreeClassifier has an attribute tree_ that points to an instance of the Tree class, an array-based representation of a binary decision tree. This member allows access to a number of low-level attributes such as max_depth, or of interest to you, tree_.value. As mentioned in the documentation:

The tree_.value array is a 3D array of shape [n_nodes, n_classes, n_outputs] which provides the count of samples reaching a node for each class and for each output. Each node has a value array which is the number of weighted samples reaching this node for each output and class.

I would also suggest taking a look at the decision_path method of the RandomForestClassifier and the DecisionTreeClassifier. Lastly, you should ask yourself if you really need this functionality? Do you have a statistical justification for why this modification would improve the classifiers performance? Is there a specific problem you are trying to solve with this approach?

answered Jun 1, 2024 at 2:27

M.C. Escher

898 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

cheese_guy Over a year ago

Much appreciated. Realistically, I was afraid this would be the solution, I was searching for a way to avoid extended training periods, and still retrieve slightly finer statistical accuracy. For now I will set my min_leaf_samples to 1 (adjusting max depth accordingly), and maybe perhaps work on my own implementation of this style of probability retrieval.

cheese_guy Over a year ago

Also, the more samples in an output leaf of a fully expanded tree the more certain you can be confident that the trees' output is not a product of overfitting?

M.C. Escher Over a year ago

@cheese_guy There is no need to set min_leaf_samples to one or change max_depth if you would like fully expanded trees. By default, the RandomForestClassifier will fully expand the individual trees in the ensemble.

M.C. Escher Over a year ago

Another way to phrase what you are asking is 'does low empirical risk imply low risk`? This is a delicate question, beyond the scope of a comment. If you are interested, the book Understanding Machine Learning - From Theory to Algorithms by Shalev-Shwartz explores this question in detail. In short, the answer is it depends, but in the general setting, low empirical risk does not imply low risk, even if you have a lot of data.

M.C. Escher Over a year ago

@cheese_guy I can see you are new to the site, and if my answer was satisfactory, I would appreciate it if you would accept the answer.

Ben Reiniger · Accepted Answer · 2024-06-04 02:54:16Z

An idea I had was to iterate through every tree, retrieving their probabilities, and averaging them. However [...]

This is already how sklearn random forests work, so-called "soft voting". E.g. from the User Guide:

In contrast to the original publication [B2001], the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

That said, if you want the sample sizes in the leaves to have an effect, you'd need to do some manual work. You can use the forest's apply method to get the leaf indices reached, then use the tree structure that you and Apanasov referred to, specifically the n_node_samples and values attributes.

Collectives™ on Stack Overflow

Random Forest / Decision Tree Output Probability Design: Using Positive Output Leaf Samples / Total Output Leaf Samples

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related