2

I want to write a code for MultiOutputClassifier in Python using scikit learn. I have text values so I used CountVectorizer(), and I want to find the best parameters for my model so I used GridSearchCV and model.best_params_. Best parameter for decision tree and for MultiOutputClassifier.

I get the error and I do not know how to fix it, I looked everywhere:

ValueError: Invalid parameter criterion for estimator MultiOutputClassifier(estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
           n_jobs=None). Check the list of available parameters with `estimator.get_params().keys()`.

How can I fix this error? This is the full code:

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score

df = pd.DataFrame({"first":["yes", "no", "yes", "yes", "no"],
                  "second":["yes", "no", "no", "yes", "yes"],
                  "third":["true","true", "false", "true", "false"]})

#print(df)

features = df.iloc[:,-1]
results = df.iloc[:,:-1]

cv = CountVectorizer()  
features = cv.fit_transform(features)

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

tuned_tree = {'criterion':['entropy','gini'], 'random_state':[1,2,3,4,5,6,7,8,9,10,11,12,13]}

cls = GridSearchCV(MultiOutputClassifier(tree.DecisionTreeClassifier()), tuned_tree)
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test, model.best_params_)

3 Answers 3

1

You need to set the parameter of MultiOutputClassifier using estimator__ prefix.

Try this

{'estimator__criterion':['entropy','gini']}

Note: You should not be tuning the random_state for any reason. Just you that for reproducibility.

You need to binarize the labels (target variable) for computing metrics in multi-label setting.

For multi-label format, stratified train- test splitting is not defined in sklearn. Hence, you have to do random splitting of train-test and then apply binarization.

In sklearn, lot of metrics available for multi-label task, check this.

import pandas as pd  

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn import preprocessing


df = pd.DataFrame({"first":["yes", "no", "yes", "yes", "no"],
                  "second":["yes", "no", "no", "yes", "yes"],
                  "third":["true","true", "false", "true", "false"]})

train, test = train_test_split(
    df, test_size = 0.3, random_state = 42)

# vectorization
cv = CountVectorizer()  
# always fit the vectorizer on the train data alone
# fitting on complete data leads to data leakage

features_train_vect = cv.fit_transform(train.iloc[:,-1])

# label binarization
mlb = preprocessing.MultiLabelBinarizer()
result_train = mlb.fit_transform(train.iloc[:,:-1].values) 

# applying the transform in test data
result_test = mlb.transform(test.iloc[:,:-1].values)
features_test_vect = cv.transform(test.iloc[:,-1])


params_range = {'estimator__criterion':['entropy','gini']}


cls = GridSearchCV(MultiOutputClassifier(tree.DecisionTreeClassifier(random_state=1),),
                   params_range, cv=3)
model = cls.fit(features_train_vect, result_train)

f1_score(cls.predict(features_test_vect), result_test, average='weighted')
# 0.6666666666666666
Sign up to request clarification or add additional context in comments.

5 Comments

Ok, but do I need to do preprocessing.MultiLabelBinarizer()? I just wanted to find the best parameters for my decision tree. I have used GridSearchCV(tree.DecisionTreeClassifier(), tuned_parameters) when my code is without MultiOutputClassifier
MultiLabelBinarizer is required for when with multi-label problem. GridSearchCV can work without MutliLabelBinarizer for one target variable only.
@ai_learning ok but I want to get best parameters for my decision tree model and for my mulitioutput classifier, i want to use ` model.best_params_, thats why I used tuned_tree `, I see that you put parameters for decision tree by yourself
The code that I suggested tunes parameters of decision tree only. There is no parameter for multioutputclassifier, it's just a extension of estimate for multiple target vqriables.
I have just changed your varaible name from tuned_tree to params_range for better understanding.
0

You're passing the DecisionTreeClassifier() constructor function to the MultiOutputClassifier. Try instantiating a decision tree estimator object and passing that to the function:

dtc = tree.DecisionTreeClassifier()
cls = GridSearchCV(MultiOutputClassifier(dtc), tuned_tree)

1 Comment

still the same error. Also, i do not think that this way will give me the parameters for decision tree
0

The dictionary passed should be like

tuned_tree = {'estimator__criterion':['entropy','gini'], 'estimator__random_state':[1,2,3,4,5,6,7,8,9,10,11,12,13]}

The estimator__ prefix is required for all the parameters

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.