1

I'm training a Gaussian Process regressor in a mix of categorical and numerical features, and for most categorical features, the amount of data I have is ok, but some categorical features are really sparse, and I think that makes the model mess up when it's tested against those features.

Is there a way to weight the categorical features (and the categorical features only, because the numerical features go in a continuous range from 0 to 100 and are not directly related with the categorical features) so those which appear less affect the score less.

I've seen the sample_weight parameter in the r2_score function, but I think that won't cut it for me, as it seems to apply the weights in every single column, and I don't want that.

I've also seen this excelent question and answer about the parameters sample_weights and class_weights, but they don't state whether it's possible to assign weights to certain features only.

I've been trying different things, and I found that you can set the weights for any function doing something like this:

def weighted_r2(y_true, y_pred, sample_weight):
    return r2_score(y_true, y_pred, sample_weight=sample_weight)

weighted_r2_scorer = make_scorer(weighted_r2, greater_is_better=True, needs_proba=False, needs_threshold=False, sample_weight=x.index)

That can be fed to the GridSearchCV function and it should work. The only problem with this is that the sample_weight parameter should be the same length as the sample features

I can hear you saying that they are the same, but they aren't, because of cross validation, the sample space gets chopped up in parts (5 by default) and that changes the number of items in the sample space (but not in the weights...), as proof, I give you the error that gets thrown like a thousand times when I run this:

UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
  File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\model_selection\_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\metrics\_scorer.py", line 220, in __call__
    return self._score(
  File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\metrics\_scorer.py", line 268, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "D:\InSilicoOP-FUAM\INSILICO-OP\src\bayesian_model.py", line 58, in weighted_r2
    return r2_score(y_true, y_pred, sample_weight=sample_weight)
  File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\metrics\_regression.py", line 914, in r2_score
    check_consistent_length(y_true, y_pred, sample_weight)
  File "D:\InSilicoOP-FUAM\INSILICO-OP\.conda\lib\site-packages\sklearn\utils\validation.py", line 397, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [263, 263, 329]

You can do the math, 263 is indeed 4/5 * 329.

Any help will be welcome

9
  • idea: train separate model using only those weak features and the result of the model treat as an input to main model Commented Apr 26, 2023 at 15:47
  • I will keep that in mind, and I've already proposed that, but for now I have to try to train the model like that, I'm not allowed to separate categorical and numerical features Commented Apr 26, 2023 at 15:50
  • I've tried separating categorical and numerical features, and the data I have is so sparse that it's really hard to get anything out of it, even when I consider only 1 category inside the data Commented Apr 28, 2023 at 6:59
  • 1
    thats a small number of data points, you need to use some expert knowledge to remove weak features. Otherwise some features might get "lucky" and trick model into thinking they are good. In case you don't know it, search term "curse of dimensionality" in google. Commented Apr 28, 2023 at 12:26
  • 1
    And note that even seemingly very good features (from expert point of view) might lower the performance if they are not well represented in the data. Commented Apr 28, 2023 at 12:27

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.