Using custom Pipeline for Cross Validation scikit-learn

Question

I would like to be use GridSearchCV to determine the parameters of a classifier, and using pipelines seems like a good option.

The application will be for image classification using Bag-of-Word features, but the issue is that there is a different logical pipeline depending on whether training or test examples are used.

For each training set, KMeans must run to produce a vocabulary that will be used for testing, but for test data no KMeans process is run.

I cannot see how it is possible to specify this difference in behavior for a pipeline.

ogrisel · Accepted Answer · 2012-10-26 14:23:09Z

3

You probably need to derive from the KMeans class and override the following methods to use your vocabulary logic:

fit_transform will only be called on the train data
transform will be called on the test data

Maybe class derivation is not alway the best option. You can also write your own transformer class that wraps calls to an embedded KMeans model and provides the fit / fit_transform / transform API that is expected by the Pipeline class for the first stages.

edited Oct 26, 2012 at 14:23

answered Oct 24, 2012 at 20:53

ogrisel

40.3k14 gold badges120 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

phil0stine Over a year ago

Ah, I think that might be the piece I was missing, I knew there had to be a way to perform different behavior depending on test/train. Thanks

Collectives™ on Stack Overflow

Using custom Pipeline for Cross Validation scikit-learn

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related