0

I am trying to speed up the code implementation below and improve the performance, as I am working with a dataframe with 40k columns. And I need to apply the following function to all the columns of the dataframe.

def differencing(col,per=1):
    df[f'{col}_d{per}'] = df[col].diff(periods = per)
    df[f'{col}_d{per}'].fillna(0,inplace=True)
    df[f'{col}_d{per}_ind'] = np.where(df[f'{col}_d{per}'] > 0 , 1, np.where(df[f'{col}_d{per}'] < 0, -1,0)) # 3 classes


for col in df.columns:
    differencing(col,per=1)   

I only know how to use a for loop to apply this function column by column. How can I speed this up ? Problem with apply is that the function is adding 2 new columns to the existing dataframe. This is where I am stuck.

2 Answers 2

2

Pretty much all you do can be done directly on the dataframe, instead of per-series and iterating on the columns:

def differencing(df, per=1):
    dif = df.diff(periods=per).fillna(0).add_suffix(f'_per{per}')
    ind = np.sign(dif).add_suffix('_ind')
    return df.join([dif, ind])

differencing(df)

That’s roughly a 50% reduction in duration on a 5-column 10_000-rows dataframe. On a 5000-column 10-rows dataframe this reduced the time from 24 seconds to 0.016 seconds (caveat: both measured on my machine which runs a lot of other things simultaneously though).

Sign up to request clarification or add additional context in comments.

4 Comments

instead of (dif.gt(0).astype(int) - dif.lt(0).astype(int)), wouldn't np.sign(dif).astype('int').add_suffix('_ind') be faster ?
That is indeed even faster @user1769197, I’ll update it.
@user1769197 You were telling another solution provider to follow my solution idea of using np.sign. Would you give credit to my original idea by upvoting my answer?
@user1769197 It's my solution that you neither accepted nor upvoted. The one below. Not this one.
2

You can fine-tune your code to speed it up by:

  1. Replace the nested np.where to determine signs of values by using the numpy.sign() function
  2. Combine the first 2 statements into one
def differencing(col,per=1):
    df[f'{col}_d{per}'] = df[col].diff(periods = per).fillna(0)
    df[f'{col}_d{per}_ind'] = np.sign(df[f'{col}_d{per}']).astype(int)

for col in df.columns:
    differencing(col,per=1)   

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.