I have to preprocess a feature which is basically a list of number codes enocoded as a string, and I want to encode it such that the output is an array of frequencies of each of these numbers. The features should also be preprocessed by imputing missing values.
Here is what I did:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
s_data = pd.Series(['123 342 789', '12 34 56', np.nan, '1 2 3 123'])
s_data = str_data.str.split(" ")
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent'))
('mlb', MultiLabelBinarizer())
])
encoded_data = pipeline.fit_transform(s_data)
encoded_df = pd.DataFrame(encoded_data, columns=mlb.classes_)
The output that I'm expecting is something like this:
1 12 123 2 3 34 342 56 789
0 0 0 1 0 0 0 1 0 1
1 0 1 0 0 0 1 0 1 0
2 1 0 1 1 1 0 0 0 0
However the SimpleImputer wouldn't accept the input saying the inut contains lists. When I tried to change th input to a numpy array format, it was reject by MultiLabelBinariZer saying it expects only 2 inputs but 3 were given.
str_datais not defined, you need to import numpy, etc).