I'm training a Gaussian Process regressor in a mix of categorical and numerical features, and for most categorical features, the amount of data I have is ok, but some categorical features are really sparse, and I think that makes the model mess up when it's tested against those features.
Is there a way to weight the categorical features (and the categorical features only, because the numerical features go in a continuous range from 0 to 100 and are not directly related with the categorical features) so those which appear less affect the score less.
I've seen the sample_weight parameter in the r2_score function, but I think that won't cut it for me, as it seems to apply the weights in every single column, and I don't want that.
I've also seen this excelent question and answer about the parameters sample_weights and class_weights, but they don't state whether it's possible to assign weights to certain features only.
I've been trying different things, and I found that you can set the weights for any function doing something like this:
def weighted_r2(y_true, y_pred, sample_weight):
return r2_score(y_true, y_pred, sample_weight=sample_weight)
weighted_r2_scorer = make_scorer(weighted_r2, greater_is_better=True, needs_proba=False, needs_threshold=False, sample_weight=x.index)
That can be fed to the GridSearchCV function and it should work. The only problem with this is that the sample_weight parameter should be the same length as the sample features
I can hear you saying that they are the same, but they aren't, because of cross validation, the sample space gets chopped up in parts (5 by default) and that changes the number of items in the sample space (but not in the weights...), as proof, I give you the error that gets thrown like a thousand times when I run this:
UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\model_selection\_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\metrics\_scorer.py", line 220, in __call__
return self._score(
File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\metrics\_scorer.py", line 268, in _score
return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
File "D:\InSilicoOP-FUAM\INSILICO-OP\src\bayesian_model.py", line 58, in weighted_r2
return r2_score(y_true, y_pred, sample_weight=sample_weight)
File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\metrics\_regression.py", line 914, in r2_score
check_consistent_length(y_true, y_pred, sample_weight)
File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\utils\validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [263, 263, 329]
You can do the math, 263 is indeed 4/5 * 329.
Any help will be welcome