How can I preprocess a feature that contains a list of number codes?

I have to preprocess a feature which is basically a list of number codes enocoded as a string, and I want to encode it such that the output is an array of frequencies of each of these numbers. The features should also be preprocessed by imputing missing values.

Here is what I did:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
s_data = pd.Series(['123 342 789', '12 34 56', np.nan, '1 2 3 123'])
s_data = str_data.str.split(" ")
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))
    ('mlb', MultiLabelBinarizer())
])
encoded_data = pipeline.fit_transform(s_data)
encoded_df = pd.DataFrame(encoded_data, columns=mlb.classes_)

The output that I'm expecting is something like this:

   1  12  123  2  3  34  342  56  789
0  0   0    1  0  0   0    1   0    1
1  0   1    0  0  0   1    0   1    0
2  1   0    1  1  1   0    0   0    0

However the SimpleImputer wouldn't accept the input saying the inut contains lists. When I tried to change th input to a numpy array format, it was reject by MultiLabelBinariZer saying it expects only 2 inputs but 3 were given.

edited Jul 28, 2024 at 13:50

asked Jul 27, 2024 at 15:30

AKHIL GOPIKUMAR

11 bronze badge

What is the input that you want to turn it into the expected output?

gtomer
– gtomer

2024-07-27 19:37:39 +00:00
Commented Jul 27, 2024 at 19:37
The input has been given in the code. Its: s_data = pd.Series(['123 342 789', '12 34 56', np.nan, '1 2 3 123')

AKHIL GOPIKUMAR
– AKHIL GOPIKUMAR

2024-07-28 13:40:54 +00:00
Commented Jul 28, 2024 at 13:40
1

Welcome to SO; please post a minimal reproducible example.

desertnaut
– desertnaut

2024-07-29 18:47:29 +00:00
Commented Jul 29, 2024 at 18:47
The "expects 2 inputs" issue arises from the binarizer not being compatible with a pipeline, because it is designed to be used on targets, not features. And as @desertnaut said, please make sure the code you give actually runs and gives the error you are experiencing (eg str_data is not defined, you need to import numpy, etc).

Matt Hall
– Matt Hall

2024-08-05 09:56:53 +00:00
Commented Aug 5, 2024 at 9:56

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How can I preprocess a feature that contains a list of number codes?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest