1

I have a bunch of dataframes and same number of arrays which represents intervals(break numbers) in price column in these dataframes

I need to assign new column called description_contrib based on these intervals, e.g. if price is 16 USD and interval array looks like this [0,10] that means description_contrib column for this row will be 2, because 16 is greater then 0 and also greater than 10

I come up with this code:

def description_contribution(df_cat):
    for i in range(0, len(df_cat)):
        for j in range(0, len(intervals[i])):
            df_cat[i]['description_contrib'].loc[df_cat[i]['price'] >= intervals[i][j]] = j

But it runs slow and there is probably more robust solution for this

How can i improve this?

UPD Data looks like this

train_id    item_condition_id   brand_name  price   shipping    description_contrib
5644        1                   Unknown     15.0    1           6
12506       1                   Unknown     8.0     1           3
26141       1                   Unknown     20.0    1           8

And intervals for this dataframe is:

[0.0, 0.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 20.0, 22.0, 31.0]
7
  • 1
    Can we see some data please? Commented Dec 11, 2017 at 10:50
  • @cᴏʟᴅsᴘᴇᴇᴅ Added Commented Dec 11, 2017 at 11:06
  • Does this give you what you're looking for? (df.price.values[:, None] > intervals).sum(1) It should be pretty damn fast. Commented Dec 11, 2017 at 11:43
  • @cᴏʟᴅsᴘᴇᴇᴅ it is fast) thanks Commented Dec 11, 2017 at 22:09
  • Faster than the current answer? If you would like, I can write an answer you can accept. Otherwise, it's alright. Commented Dec 12, 2017 at 4:58

2 Answers 2

1

You can perform a broadcasted comparison with the numpy arrays -

v = (df.price.values[:, None] > intervals).sum(1)

This can be assigned back to df -

df['description_contrib'] = v

The caveat with this is the memory usage, especially for larger data. A fair tradeoff for the speed.

Sign up to request clarification or add additional context in comments.

Comments

1

Most of the time, the first option to speed things up is to replace loops with a vectorized operation. For example, you can make your code faster and more readable this way:

import pandas as pd

intervals = [0, 10]
df_cat = pd.DataFrame({'price': range(100)})
df_cat['description_contrib'] = sum(df_cat['price'] > v for v in intervals)

Assuming that df_cat has many rows and there are few intervals, this will give you a good performance. Still, faster ways may exists.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.