Pandas: How to sum columns on data frame based on value of another data frame

Question

I am new to Pandas and I am trying to do the following thing::

I have a dataframe called comms with columns articleID and commentScore (among others)
I have another dataframe called arts with column articleID

I need to create in arts a new column called articleScore. Each article must have the articleScore which is the sum of all commentScores related to that article (same articleID), divided by sqrt(n_comms + 1), where n_comms is the number of comments with that specific ID.

I already managed to do this but In a very inefficient way (pictured below)

for article in arts:
    n, tempScore = 0
    
    for i, value in comms.iterrows():
        if value['articleID'] == article['articleID']:
            tempScore + = value['commentScore']
            n += 1    
    article['articleScore'] /= math.sqrt(n+1)

Edit: Here's an example of what I would like to happen:

comms:
__________________________
| # | artID | commScore  |
| 0 | 1x5w  |     2      |
| 1 | 77k3  |     1      |
| 2 | 77k3  |    -1      |
| 3 | 3612  |     5      |
| 4 | 1x5w  |     3      |
--------------------------

arts:
___________________________
| # | artID | artScore (?) |
| 0 | 1x5w  |    2.89      |
| 1 | 77k3  |     0        |
| 2 | 3612  |    3.54      |
-------------------------

I need to (create and) fill the artScore column. Each artScore is the sum of the commentScores, but only of the comments with the same artID of the article, divided by sqrt(n+1).

Can anybody help me? Thanks a lot!

Andrea

Well, can you add a sample dataframe and your expected dataframe? — dibery
– dibery, Commented Apr 17, 2021 at 15:05
The result that you wish to get given the sample input. Because it's a sample you can fill the result by hand. In this way, people can help you better. — dibery
– dibery, Commented Apr 17, 2021 at 15:19
That's very good! But can be better if you can calculate the sample result and fill it on the DF. Also, it's better to post the DF in plain text (in code block format), rather than an image. — dibery
– dibery, Commented Apr 17, 2021 at 15:39

gofvonx · Accepted Answer · 2021-04-17 16:45:32Z

1

I think you can use groupby followed by a merge on 'artID':

grpd = comms.groupby('artID')
to_merge = grpd.sum().divide(np.sqrt(grpd.count()+1)).reset_index().rename(columns={'commScore': 'artScore'})[['artID', 'artScore']]
arts.merge(to_merge, on='artID')

edited Apr 17, 2021 at 16:45

answered Apr 17, 2021 at 16:22

gofvonx

1,4591 gold badge14 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Sala Over a year ago

Hi @gofvonx, thanks for your answer. I did what you recommended, but now on my arts data frame there are a lot of unwanted columns. Plus, I don't see in your code snippet where you assign a name to the new column articleScore

gofvonx Over a year ago

@Sala I have included explicit renaming of the column and also selected only the required sub-frame to merge. (Your sample data doesn't have any additional columns.) Does that work?

Sala Over a year ago

Yes. Thanks a lot!

PieCot · Accepted Answer · 2021-04-17 18:21:15Z

1

You can use groupby with agg and a custom lambda function to apply to each group:

comms.groupby('artID').agg(
    {'commScore': lambda x: x.sum() / np.sqrt(len(x) + 1)}
).reset_index().rename(columns={'commScore': 'artScore'})

Result:

  artID  artScore
0  1x5w  2.886751
1  3612  3.535534
2  77k3  0.000000

edited Apr 17, 2021 at 18:21

answered Apr 17, 2021 at 18:14

PieCot

3,6391 gold badge17 silver badges21 bronze badges

Comments

Yasir · Accepted Answer · 2021-04-18 09:15:47Z

1

#article count and sum
df = df.groupby('artID').agg(['sum', 'count'])

#create new column and utilize your formula
df['artScore'] = df['commScore']['sum'] / math.sqrt(df['commScore']['count']+1)


    commScore   artScore
       sum  count   
artID           
1x5w    5   2   5.0
3612    5   1   5.0
77k3    0   2   0.0

edited Apr 18, 2021 at 9:15

answered Apr 17, 2021 at 17:05

Yasir

1,1201 gold badge14 silver badges31 bronze badges

2 Comments

gofvonx Over a year ago

I believe this code is not correct as it does not take the square-root.

Yasir Over a year ago

you are right. I fixed it. thanks.@gofvonx

Collectives™ on Stack Overflow

Pandas: How to sum columns on data frame based on value of another data frame

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related