scikit-learn RandomForestClassifier - How to interpret tree output?

Question

I have the below code, but I just don't understand how to interpret the tree output data from the RandomForestClassifier, like how the gini was calculated, given the samples and how the totals in the 'value' lists can be higher than the initial samples of 3.

I am comparing this output to a DecisionTreeClassifier, which I can understand and interpret.

Any help is appreciated, thanks!

from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
import numpy as np
from sklearn.externals.six import StringIO  
import pydot 

# Data
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])
Y = np.array([0, 1, 1, 0])

# Create object classifiers
clf = RandomForestClassifier()
clf_tree = tree.DecisionTreeClassifier()

# Fit data
clf_tree.fit(X,Y)
clf.fit(X, Y)

# Save data
dot_data = StringIO()
tree.export_graphviz(clf_tree, out_file = dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("orig_tree.pdf")

i_tree = 0
for tree_in_forest in clf.estimators_:    
    dot_data = StringIO()
    tree.export_graphviz(tree_in_forest, out_file = dot_data)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    f_name = 'tree_' + str(i_tree) + '.pdf'
    graph.write_pdf(f_name) 
    i_tree += 1

The decision tree: https://i.sstatic.net/XZ7vU.png

A tree from the RandomForestClassifier: https://i.sstatic.net/Bb5t9.png

I've had a look at the link and that's why I have included a normal decision tree as a comparison, which is the same class that is used in the RandomForestClassifier. In my example above there are not that many nodes in the tree from the RandomForestClassifier and the input sample is small, so it should have been easy to work out how the numbers are derived in the tree, like in the DecisionTreeClassifier. Have a look at the picture links and you'll see where I am coming from. — Reactor
– Reactor, Commented Jun 3, 2015 at 22:20

Arnaud Joly · Accepted Answer · 2015-06-04 09:32:30Z

1

How the gini was calculated given the samples ?

The gini is computed exactly in the same way for random forest and the decision tree. The Gini values, or variance, correspond to the impurity of the node.

How the totals in the 'value' lists can be higher than the initial samples of 3?

In the case of classification, the value attribute corresponds to the number of samples reaching the leaves.

In the case of random forest, the samples are bootstraped thus in total there is on average 2 / 3 of the original samples, but the overall number of samples hasn't change.

answered Jun 4, 2015 at 9:32

Arnaud Joly

9149 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Reactor Over a year ago

Apologies, but can you show how the gini is 0.375 in the first node, from the tree in the RandomForestClassifier?

Arnaud Joly Over a year ago

For the first node given the tree structure and the bootstrap sampling, you have 3 positive and 1 negative sample. Thus the gini is given by gini = 3 / 4 * 1 / 4 + 1 / 4 * 3 / 4 = 0.375 which is the sum of the variance of each class.

Collectives™ on Stack Overflow

scikit-learn RandomForestClassifier - How to interpret tree output?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related