Assign values in pandas based on condition on array of values

Question

I have a bunch of dataframes and same number of arrays which represents intervals(break numbers) in price column in these dataframes

I need to assign new column called description_contrib based on these intervals, e.g. if price is 16 USD and interval array looks like this [0,10] that means description_contrib column for this row will be 2, because 16 is greater then 0 and also greater than 10

I come up with this code:

def description_contribution(df_cat):
    for i in range(0, len(df_cat)):
        for j in range(0, len(intervals[i])):
            df_cat[i]['description_contrib'].loc[df_cat[i]['price'] >= intervals[i][j]] = j

But it runs slow and there is probably more robust solution for this

How can i improve this?

UPD Data looks like this

train_id    item_condition_id   brand_name  price   shipping    description_contrib
5644        1                   Unknown     15.0    1           6
12506       1                   Unknown     8.0     1           3
26141       1                   Unknown     20.0    1           8

And intervals for this dataframe is:

[0.0, 0.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 20.0, 22.0, 31.0]

Does this give you what you're looking for? (df.price.values[:, None] > intervals).sum(1) It should be pretty damn fast. — cs95
– cs95, Commented Dec 11, 2017 at 11:43
Faster than the current answer? If you would like, I can write an answer you can accept. Otherwise, it's alright. — cs95
– cs95, Commented Dec 12, 2017 at 4:58

cs95 · Accepted Answer · 2017-12-13 16:00:08Z

1

You can perform a broadcasted comparison with the numpy arrays -

v = (df.price.values[:, None] > intervals).sum(1)

This can be assigned back to df -

df['description_contrib'] = v

The caveat with this is the memory usage, especially for larger data. A fair tradeoff for the speed.

answered Dec 13, 2017 at 16:00

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sina · Accepted Answer · 2017-12-11 11:06:55Z

1

Most of the time, the first option to speed things up is to replace loops with a vectorized operation. For example, you can make your code faster and more readable this way:

import pandas as pd

intervals = [0, 10]
df_cat = pd.DataFrame({'price': range(100)})
df_cat['description_contrib'] = sum(df_cat['price'] > v for v in intervals)

Assuming that df_cat has many rows and there are few intervals, this will give you a good performance. Still, faster ways may exists.

answered Dec 11, 2017 at 11:06

Sina

2,0681 gold badge18 silver badges16 bronze badges

Collectives™ on Stack Overflow

Assign values in pandas based on condition on array of values

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related