Pandas: apply function that return multiple new columns over Pandas DataFrame

Question

I am trying to speed up the code implementation below and improve the performance, as I am working with a dataframe with 40k columns. And I need to apply the following function to all the columns of the dataframe.

def differencing(col,per=1):
    df[f'{col}_d{per}'] = df[col].diff(periods = per)
    df[f'{col}_d{per}'].fillna(0,inplace=True)
    df[f'{col}_d{per}_ind'] = np.where(df[f'{col}_d{per}'] > 0 , 1, np.where(df[f'{col}_d{per}'] < 0, -1,0)) # 3 classes


for col in df.columns:
    differencing(col,per=1)

I only know how to use a for loop to apply this function column by column. How can I speed this up ? Problem with apply is that the function is adding 2 new columns to the existing dataframe. This is where I am stuck.

Cimbali · Accepted Answer · 2021-08-05 12:29:57Z

2

Pretty much all you do can be done directly on the dataframe, instead of per-series and iterating on the columns:

def differencing(df, per=1):
    dif = df.diff(periods=per).fillna(0).add_suffix(f'_per{per}')
    ind = np.sign(dif).add_suffix('_ind')
    return df.join([dif, ind])

differencing(df)

That’s roughly a 50% reduction in duration on a 5-column 10_000-rows dataframe. On a 5000-column 10-rows dataframe this reduced the time from 24 seconds to 0.016 seconds (caveat: both measured on my machine which runs a lot of other things simultaneously though).

edited Aug 5, 2021 at 12:29

answered Aug 5, 2021 at 11:12

Cimbali

11.5k1 gold badge44 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1769197 Over a year ago

instead of (dif.gt(0).astype(int) - dif.lt(0).astype(int)), wouldn't np.sign(dif).astype('int').add_suffix('_ind') be faster ?

Cimbali Over a year ago

That is indeed even faster @user1769197, I’ll update it.

SeaBean Over a year ago

@user1769197 You were telling another solution provider to follow my solution idea of using np.sign. Would you give credit to my original idea by upvoting my answer?

SeaBean Over a year ago

@user1769197 It's my solution that you neither accepted nor upvoted. The one below. Not this one.

SeaBean · Accepted Answer · 2021-08-05 11:49:08Z

2

You can fine-tune your code to speed it up by:

Replace the nested np.where to determine signs of values by using the numpy.sign() function
Combine the first 2 statements into one

def differencing(col,per=1):
    df[f'{col}_d{per}'] = df[col].diff(periods = per).fillna(0)
    df[f'{col}_d{per}_ind'] = np.sign(df[f'{col}_d{per}']).astype(int)

for col in df.columns:
    differencing(col,per=1)

answered Aug 5, 2021 at 11:49

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Collectives™ on Stack Overflow

Pandas: apply function that return multiple new columns over Pandas DataFrame

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related