1

I'm learning to use xgboost, and I have read through the documentation! However, I'm not understanding why the output of my script is coming out between 0~~2. First, I thought it should come as either 0 or 1, since its a binary classification, but then, I read it comes as a probability of 0 or 1, however, some outputs are 1.5+ ( at least on the CSV ), which doesnt make sense to me!

I'm unsure if the problem is on xgboost parameters or in the csv creation! This line, np.expm1(preds) , im not sure it should be np.expm1, but I dont know for what I could change it!

In conclusion, my question is :

Why the output is not 0 or 1, and instead comes as 0.0xxx and 1.xxx ?

Here is my script:

import numpy as np
import xgboost as xgb
import pandas as pd

train = pd.read_csv('../dataset/train.csv')
train = train.drop('ID', axis=1)

y = train['TARGET']

train = train.drop('TARGET', axis=1)
x = train

dtrain = xgb.DMatrix(x.as_matrix(), label=y.tolist())

test = pd.read_csv('../dataset/test.csv')

test = test.drop('ID', axis=1)
dtest = xgb.DMatrix(test.as_matrix())


# XGBoost params:
def get_params():
    #
    params = {}
    params["objective"] = "binary:logistic"
    params["booster"] = "gbtree"
    params["eval_metric"] = "auc"
    params["eta"] = 0.3  #
    params["subsample"] = 0.50
    params["colsample_bytree"] = 1.0
    params["max_depth"] = 20
    params["nthread"] = 4
    plst = list(params.items())
    #
    return plst


bst = xgb.train(get_params(), dtrain, 1000)

preds = bst.predict(dtest)

print np.max(preds)
print np.min(preds)
print np.average(preds)

# Make Submission
test_aux = pd.read_csv('../dataset/test.csv')
result = pd.DataFrame({"Id": test_aux["ID"], 'TARGET': np.expm1(preds)})

result.to_csv("xgboost_submission.csv", index=False)

2 Answers 2

1

You just need to do that:

from xgboost import XGBClassifier

Call predict and the output will be 0 or 1, if you call predict_proba the output will be probabilities of the classes.

Sorry for my english.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! This didn't exactly answer the original question, but it solved a related problem of mine.
0

When you run a xgb model with objective binary:logistic you get arrays of probabilities for each sample. Those probabilities are the chance of the sample to belong at class i.

Let's say you have 3 classes [A, B, C]. An output for the sample y like [0.2, 0.6, 0.4] indicates that this sample will probabliy belong to class B.

If you want just the more probable class, take the index of the maximum element in such probability array, for example using numpy function argmax.

You can find more info at the xgb package parameter's documentation.

7 Comments

Like this? result = pd.DataFrame({"Id": test_aux["ID"], 'TARGET': np.argmax(preds)})
Notice np.argmax can take an axis argument. If you wan't the label prediction, try with np.argmax(preds, axis=1) .
It didnt work, axis=1 because there is just 1 axis, and when its axis=0, it just fills everything with 66390
What is the shape of your preds variable?
Can you give us a sample of your preds array?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.