pandas to numpy array for sklearn pipeline

Question

I have a transformer which calculates the percentage of the values per group. Initially, pandas was used because I started with pandas and colnames are nicer to handle. However, now I need to integrate into sklearn-pipeline.

How can I convert my Transformer to support numpy arrays from a sklearn pipeline instead of pandas data frames? The point is that self.colname cant be used for numpy arrays and I think the grouping needs to be performed differently.

How to implement persistence of such a transformer as these weights need to be loadable from disk in order to deploy such a Transformer in a pipeline.

class PercentageTransformer(TransformerMixin):
    def __init__(self, colname,typePercentage='totalTarget', _target='TARGET', _dropOriginal=True):
        self.colname = colname
        self._target = _target
        self._dropOriginal = _dropOriginal
        self.typePercentage = typePercentage

    def fit(self, X, y, *_):
        original = pd.concat([y,X], axis=1)
        grouped = original.groupby([self.colname, self._target]).size()
        if self.typePercentage == 'totalTarget':
            df = grouped / original[self._target].sum()
        else:
            df = (grouped / grouped.groupby(level=0).sum())

        if self.typePercentage == 'totalTarget':
            nameCol = "pre_" + self.colname
        else:
            nameCol = "pre2_" + self.colname
        self.nameCol = nameCol
        grouped = df.reset_index(name=nameCol)
        groupedOnly = grouped[grouped[self._target] == 1]
        groupedOnly = groupedOnly.drop(self._target, 1)

        self.result =  groupedOnly
        return self

    def transform(self, dataF):
        mergedThing = pd.merge(dataF, self.result, on=self.colname, how='left')
        mergedThing.loc[(mergedThing[self.nameCol].isnull()), self.nameCol] = 0
        if self._dropOriginal:
            mergedThing = mergedThing.drop(self.colname, 1)
        return mergedThing

It would be used in a pipeline like this:

pipeline =  Pipeline([
    ('features', FeatureUnion([
        ('continuous', Pipeline([
            ('extract', ColumnExtractor(CONTINUOUS_FIELDS)),
        ])),
        ('factors', Pipeline([
            ('extract', ColumnExtractor(FACTOR_FIELDS)),
            # using labelencoding and all bias
            ('bias',  PercentageAllTransformer(FACTOR_FIELDS, _dropOriginal=True, typePercentage='totalTarget')),
        ]))
    ], n_jobs=-1)),
    ('estimator', estimator)
])

The pipeline will be fitted with X and y where both are data frames. I am unsure of X.as_matrix would help.

pandas objects are wrappers around numpy objects. There is no pandas array, I believe you mean Series? Anyway, maybe your problem would be solved simply by returning self.values instead of self. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Oct 23, 2016 at 17:30
As for persistence, there are several ways to go about it. Generally, object serialization in Python will use the pickle module. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Oct 23, 2016 at 17:32
Indeed I meant pandas data frames. The point is if I understand it correctly: orignal original.groupby([self.colname, self._target]is no longer a dataframe but a numpy array e.g. the colnames do no longer work. so a self.values does not seem to be enough. — Georg Heiler
– Georg Heiler, Commented Oct 23, 2016 at 17:33
No, groupby returns a groupby object, which usually is used to generate a new DataFrame. You can't access self.colname, self._target as you normally would because by default, these are used as the index to the new DataFrame. Pass the as_index=False to groupby to retain your grouping columns as columns. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Oct 23, 2016 at 17:39

yikes.gov · Accepted Answer · 2016-10-23 17:48:01Z

3

Converting Things Back and Forth

Pandas has a .to_records() method, and, as you mentioned, a .as_matrix() method. The .to_records() method will actually keep your column names for you. Numpy does support named columns in arrays. See here.

Persistence

Pandas has a pandas.to_pickle(obj, filename) method, which takes a pandas object and pickles it. There is a corresponding pandas.read_pickle(filename) method.

Numpy has a save and load function as well.

answered Oct 23, 2016 at 17:48

yikes.gov

1651 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pandas to numpy array for sklearn pipeline

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related