0

I am trying to validate the XGBoost output (booster.predict) for logistic regression wrt my understanding of output calculation via the trees built. I see a difference of around -1.58 factor in all my results. Sharing below the code I used to validate the same. I am definitely missing something here so would request help me to understand what it is.

import xgboost as xgb
import pandas as pd
import numpy as np
import math
import random
np.random.seed(1)

data = pd.DataFrame(np.arange(100*4).reshape((100,4)), columns=['a', 'b', 'c', 'd'])
label = pd.DataFrame(np.random.randint(2, size=(100,1)))
data = pd.concat([data,label], ignore_index=True, axis =1)
data = pd.DataFrame(np.arange(100*4).reshape((100,4)), columns=['a', 'b', 'c', 'd'])
features = ['a', 'b', 'c', 'd']

dtrain = xgb.DMatrix(data, label=label)
param = {"max_depth":2, "base_score":0.2, 'objective': 'binary:logistic'}
clf1 = xgb.train(param, dtrain, 2)
clf1.dump_model("base_score1.txt")

e = math.exp(-(-0.143835619-0.123642519+0.2))
print(clf1.predict(dtrain)[0],1/(1+e))
## 0.39109966 0.7583403831446165
## Ideally value of e should be 1.5568930331924702 while here e is 0.31866905973448423

Here is the tree generated

booster[0]:
0:[a<126] yes=1,no=2,missing=1
    1:[a<58] yes=3,no=4,missing=3
        3:leaf=0.617647052
        4:leaf=0.0483870991
    2:leaf=0.691919208
booster[1]:
0:leaf=0.325955093

So my understanding is that bst.predict() outputs sigmoid applied over sum of tree_values and base_score, i.e. 1/(1+math.exp(-sum)) where sum = base_score+sum_of_tree_values (i.e. as many trees are there).

What am I doing wrong?

This might be related but not sure exactly how weight calculation of individual tree in XGBoost when using "binary:logistic"

3
  • As it, it is not clear how many trees (or boosted iterations) you use; are you sure it is just the 2 trees you show? Please modify & update your code to show explicitly that, in your model, n_estimators=2. Commented Feb 11 at 20:23
  • So xgb.train(param, dtrain, 2) this implies it has 2 trees, if that is what you were mentioning about? @desertnaut Commented Feb 12 at 7:48
  • Yes, I am utterly surprised that this even works - I would expect that all parameters would be required to be named ones. For diagnostic purposes, I suggest you re-run it as I suggested (explicitly with n_estimators=2); alternatively, with the existing form, check to confirm that you have only 2 trees indeed (i.e. that boosted[2] does not even exist). Commented Feb 12 at 11:02

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.