0

I have the below code, but I just don't understand how to interpret the tree output data from the RandomForestClassifier, like how the gini was calculated, given the samples and how the totals in the 'value' lists can be higher than the initial samples of 3.

I am comparing this output to a DecisionTreeClassifier, which I can understand and interpret.

Any help is appreciated, thanks!

from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
import numpy as np
from sklearn.externals.six import StringIO  
import pydot 

# Data
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])
Y = np.array([0, 1, 1, 0])

# Create object classifiers
clf = RandomForestClassifier()
clf_tree = tree.DecisionTreeClassifier()

# Fit data
clf_tree.fit(X,Y)
clf.fit(X, Y)

# Save data
dot_data = StringIO()
tree.export_graphviz(clf_tree, out_file = dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("orig_tree.pdf")

i_tree = 0
for tree_in_forest in clf.estimators_:    
    dot_data = StringIO()
    tree.export_graphviz(tree_in_forest, out_file = dot_data)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    f_name = 'tree_' + str(i_tree) + '.pdf'
    graph.write_pdf(f_name) 
    i_tree += 1

The decision tree: https://i.sstatic.net/XZ7vU.png

A tree from the RandomForestClassifier: https://i.sstatic.net/Bb5t9.png

1
  • I've had a look at the link and that's why I have included a normal decision tree as a comparison, which is the same class that is used in the RandomForestClassifier. In my example above there are not that many nodes in the tree from the RandomForestClassifier and the input sample is small, so it should have been easy to work out how the numbers are derived in the tree, like in the DecisionTreeClassifier. Have a look at the picture links and you'll see where I am coming from. Commented Jun 3, 2015 at 22:20

1 Answer 1

1

How the gini was calculated given the samples ?

The gini is computed exactly in the same way for random forest and the decision tree. The Gini values, or variance, correspond to the impurity of the node.

How the totals in the 'value' lists can be higher than the initial samples of 3?

In the case of classification, the value attribute corresponds to the number of samples reaching the leaves.

In the case of random forest, the samples are bootstraped thus in total there is on average 2 / 3 of the original samples, but the overall number of samples hasn't change.

Sign up to request clarification or add additional context in comments.

2 Comments

Apologies, but can you show how the gini is 0.375 in the first node, from the tree in the RandomForestClassifier?
For the first node given the tree structure and the bootstrap sampling, you have 3 positive and 1 negative sample. Thus the gini is given by gini = 3 / 4 * 1 / 4 + 1 / 4 * 3 / 4 = 0.375 which is the sum of the variance of each class.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.