How to interpret trees from random forest via python

Question

I'm trying to figure out how I can go about interpreting my trees from my random forest. My data contains around 29,000 observations and 35 features. I pasted the first 22 observations, the first 11 features as well as the feature that I am trying to predict(HighLowMobility).

birthcohort countyfipscode  county_name cty_pop2000 statename   state_id    stateabbrv  perm_res_p25_kr24   perm_res_p75_kr24   perm_res_p25_c1823  perm_res_p75_c1823  HighLowMobility
1980    1001    Autauga 43671   Alabama 1   AL  45.2994 60.7061         Low
1981    1001    Autauga 43671   Alabama 1   AL  42.6184 63.2107 29.7232 75.266  Low
1982    1001    Autauga 43671   Alabama 1   AL  48.2699 62.3438 38.0642 72.2544 Low
1983    1001    Autauga 43671   Alabama 1   AL  42.6337 56.4204 38.2588 80.4664 Low
1984    1001    Autauga 43671   Alabama 1   AL  44.0163 62.2799 38.1238 73.747  Low
1985    1001    Autauga 43671   Alabama 1   AL  45.7178 61.3187 40.9339 83.0661 Low
1986    1001    Autauga 43671   Alabama 1   AL  47.9204 59.6553 47.4841 72.491  Low
1987    1001    Autauga 43671   Alabama 1   AL  48.3108 54.042  53.199  84.5379 Low
1988    1001    Autauga 43671   Alabama 1   AL  47.9855 59.42   52.8927 85.2844 Low
1980    1003    Baldwin 140415  Alabama 1   AL  42.4611 51.4142         Low
1981    1003    Baldwin 140415  Alabama 1   AL  43.0029 55.1014 35.5923 76.9857 Low
1982    1003    Baldwin 140415  Alabama 1   AL  46.2496 56.0045 38.679  77.038  Low
1983    1003    Baldwin 140415  Alabama 1   AL  44.3001 54.5173 38.7106 81.0388 Low
1984    1003    Baldwin 140415  Alabama 1   AL  46.4349 55.5245 42.4422 80.3047 Low
1985    1003    Baldwin 140415  Alabama 1   AL  47.1544 52.8189 42.7994 79.0835 Low
1986    1003    Baldwin 140415  Alabama 1   AL  47.553  54.934  42.0653 78.4398 Low
1987    1003    Baldwin 140415  Alabama 1   AL  48.9752 54.3541 39.96   79.4915 Low
1988    1003    Baldwin 140415  Alabama 1   AL  48.6887 55.3087 43.8557 79.387  Low
1980    1005    Barbour 29038   Alabama 1   AL                  Low
1981    1005    Barbour 29038   Alabama 1   AL  37.5338 54.3618 34.8771 75.1904 Low
1982    1005    Barbour 29038   Alabama 1   AL  37.028  57.2471 36.5392 90.3262 Low
1983    1005    Barbour 29038   Alabama 1   AL                  Low

Here is my random forest:

   #loading the data into data frame
   X = pd.read_csv('raw_data_for_edits.csv')
   #Impute the missing values with median values,.
   X = X.fillna(X.median())

  #Dropping the categorical values
  X = X.drop(['county_name','statename','stateabbrv'],axis=1)

  #Collect the output in y variable
  y = X['HighLowMobility']


  X = X.drop(['HighLowMobility'],axis=1)


 from sklearn.preprocessing import LabelEncoder

 #Encoding the output labels
 def preprocess_labels(y):
   yp = []
   #low = 0
   #high = 0
    for i in range(len(y)):
      if (str(y[i]) =='Low'):
         yp.append(0)
         #low +=1
     elif (str(y[i]) =='High'):
         yp.append(1)
         #high +=1
      else:
         yp.append(1)
      return yp



  #y = LabelEncoder().fit_transform(y)
  yp = preprocess_labels(y)
  yp = np.array(yp)
  yp.shape
  X.shape
  from sklearn.cross_validation import train_test_split
  X_train, X_test,y_train, y_test = train_test_split(X,yp,test_size=0.25, random_state=42)
  X_train = np.array(X_train)
  y_train = np.array(y_train)
  X_test = np.array(X_test)
  y_test = np.array(y_test)
  training_data = X_train,y_train
  test_data = X_test,y_test
  dims = X_train.shape[1]
   if __name__ == '__main__':
     nn = Neural_Network([dims,10,5,1], learning_rate=1, C=1, opt=False, check_gradients=True, batch_size=200, epochs=100)
     nn.fit(X_train,y_train) 
     weights = nn.final_weights()
     testlabels_out = nn.predict(X_test)
     print testlabels_out
     print "Neural Net Accuracy is " + str(np.round(nn.score(X_test,y_test),2))


  '''
  RANDOM FOREST AND LOGISTIC REGRESSION
  '''
  from sklearn import cross_validation
  from sklearn.linear_model import LogisticRegression
  from sklearn.ensemble import RandomForestClassifier
  clf1 = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0,       fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)
  clf2 = RandomForestClassifier(n_estimators=100, max_depth=None,min_samples_split=1, random_state=0)
   for clf, label in zip([clf1, clf2], ['Logistic Regression', 'Random Forest']):
   scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

How would I interpret my trees? For example,perm_res_p25_c1823 is a feature that states the College attendance at ages 18-23 for child born at 25th percentile, perm_res_p75_c1823 represents the 75th percentile and the HighLowMobility feature states whether it there is High or Low upward income mobility. So how would show the following: "If the person comes from 25th percentile and lives Autauga,Alabama , then they will probably have lower upward mobility" ?

What is 25 percentile here? I would think that it's just the number that equates to 25% of all college attendees (18-23 years), but the way you phrase it ("If the person comes from 25th percentile"), then it's something different. — user707650
– user707650, Commented Jun 27, 2016 at 0:49
Does this answer your question? Random Forest interpretation in scikit-learn — ClimateUnboxed
– ClimateUnboxed, Commented Mar 26, 2023 at 7:29

lejlot · Accepted Answer · 2016-06-27 00:57:08Z

2

You cannot really interpret RF in such terms because random forest does not work this way. It creates highly randomized ensemble of trees, which can have various decision rules. Once you go from decision trees, which are fully interpretable, to RF, you loose this aspect of the classifier. RFs are black boxes. You can do many different approxiamtions and estimations, but they will efficiently ignore/alternate your RF.

answered Jun 27, 2016 at 0:57

lejlot

67k9 gold badges138 silver badges168 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

M3105 Over a year ago

My professor said he wants me to interpret the trees from the RF. He specifically said "a short tree can be converted into a statement like 'If the person comes from income bracket X and lives in Y, then they will probably have greater/lower upward mobility'"

lejlot Over a year ago

A single tree yes. A forest of a hundread no (you can say something like in x% of trees there is such a rule)

M3105 Over a year ago

Okay, thank you, that's what I was thinking. I was confused given that it's a forest not a single tree. How would I be able to determine in "in x% of trees there is such a rule" or other maybe perhaps a visualizations of some sort via python? So far I only have the predictive accuracy of the algorithm which is far from intuitive

catastrophic-failure Over a year ago

@M3105 Partial Prediction Plots (Partial Dependence Plots) can be used with non-parametric learning methods for that end.

G Roshan Lal · Accepted Answer · 2022-11-01 06:07:56Z

Explainability is a hot research area. Recently, newer tools have been developed to explain tree ensemble models using a handful of human understandable rules. Here are a few options for explaining tree ensemble models, that you can try:

You can use TE2Rules (Tree Ensembles to Rules) to extract human understandable rules to explain a scikit tree ensemble (like GradientBoostingClassifier). It provides levers to control interpretability, fidelity and run time budget to extract useful explanations. Rules extracted by TE2Rules are guaranteed to closely approximate the tree ensemble, by considering the joint interactions of multiple trees in the ensemble.

Another, alternative is SkopeRules, which is a part of scikit-contrib. SkopeRules extract rules from individual trees in the ensemble and filters good rules with high precision/recall across the whole ensemble. This is often quick, but may not represent the ensemble well enough.

For developers who work in R, InTrees package is a good option.

References:

TE2Rules: You can find the code: https://github.com/linkedin/TE2Rules and documentation: https://te2rules.readthedocs.io/en/latest/ here.

SkopeRules: You can find the code: https://github.com/scikit-learn-contrib/skope-rules here.

Intrees: https://cran.r-project.org/web/packages/inTrees/index.html

Disclosure: I'm one of the core developers of TE2Rules.

Collectives™ on Stack Overflow

How to interpret trees from random forest via python

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related