Create New Columns in dataFrame using existing columns in Panda

Question

I Have a data frame from csv file:

    time    m_srcaddr   log_type    m_type  m_fwd_bytes m_rev_bytes
1   1441590784  172.19.139.165  closed  10  295 146
11  1441590785  172.19.139.174  closed  10  441 183
65  1441590792  172.19.139.166  closed  10  441 200
68  1441590792  172.19.139.166  closed  10  3423    461
73  1441590792  172.19.139.172  closed  10  441 379
76  1441590792  172.19.139.172  closed  10  3423    789
77  1441590792  172.19.139.166  closed  10  441 463
81  1441590792  172.19.139.166  closed  10  3423    963
82  1441590793  172.19.139.173  closed  10  295 168
85  1441590793  172.19.139.172  closed  10  4929    542
89  1441590793  172.19.139.166  closed  10  5135    799
93  1441590793  172.19.139.166  closed  10  4929    510
96  1441590794  172.19.139.166  closed  10  0   198
98  1441590794  172.19.139.167  closed  10  0   455
100 1441590794  172.19.139.166  closed  10  4945    495

I am trying to group by m_srcaddr and their sum of m_fwd_bytes and m_rev_bytes divide by 1000 in new columns called total_fwd_size and total_rev_size

subdata['total_fwd_size'] = subdata.groupby('m_srcaddr').sum().reset_index()['m_fwd_bytes']/1000

subdata['total_rev_size'] = subdata.groupby('m_srcaddr').sum().reset_index()['m_rev_bytes']/1000

This is not working as NaN is coming for new created columns. and is there any best way to do same thing?

You are assigning a whole column with unmatched groupby result list? — Anzel
– Anzel, Commented Sep 9, 2015 at 19:26

Anzel · Accepted Answer · 2015-09-09 20:35:16Z

1

You are trying to assign a column with unmatch size of the result from groupby (less rows), that's why the NaN.

You should assign another DataFrame to hold the index and value of such, and apply to the original DataFrame base on the groupby field (as index).

Here is how I would do it:

In [7]: df.head(5)
Out[7]:
          time       m_srcaddr log_type  m_type  m_fwd_bytes  m_rev_bytes
1   1441590784  172.19.139.165   closed      10          295          146
11  1441590785  172.19.139.174   closed      10          441          183
65  1441590792  172.19.139.166   closed      10          441          200
68  1441590792  172.19.139.166   closed      10         3423          461
73  1441590792  172.19.139.172   closed      10          441          379

# hold the groupby result along with index (I take groupby field)
# these basically act as the lookup tables
In [8]: total_fwd_size = df.groupby('m_srcaddr').sum()['m_fwd_bytes'] / 1000

In [9]: total_rev_size = df.groupby('m_srcaddr').sum()['m_rev_bytes'] / 1000

# apply to original dataframe locating the value base on m_srcaddr
In [10]: df["total_fwd_size"] = df["m_srcaddr"].apply(lambda x: total_fwd_size.ix[x])

In [11]: df["total_rev_size"] = df["m_srcaddr"].apply(lambda x: total_rev_size.ix[x])

Results:

In [12]: df.head(5)
Out[12]:
          time       m_srcaddr log_type  m_type  m_fwd_bytes  m_rev_bytes  \
1   1441590784  172.19.139.165   closed      10          295          146
11  1441590785  172.19.139.174   closed      10          441          183
65  1441590792  172.19.139.166   closed      10          441          200
68  1441590792  172.19.139.166   closed      10         3423          461
73  1441590792  172.19.139.172   closed      10          441          379

    total_fwd_size  total_rev_size
1            0.295           0.146
11           0.441           0.183
65          22.737           4.089
68          22.737           4.089
73           8.793           1.710

You may also combine the lookup dataframes by:

lookup = df.groupby('m_srcaddr').sum()[['m_rev_bytes', 'm_fwd_bytes']] / 1000

Then apply and perform the lookup:

df['total_fwd_bytes'] = df['m_srcaddr'].apply(lambda x: lookup.ix[x, 'm_fwd_bytes'])
df['total_rev_bytes'] = df['m_srcaddr'].apply(lambda x: lookup.ix[x, 'm_rev_bytes'])

Hope this helps.

edited Sep 9, 2015 at 20:35

answered Sep 9, 2015 at 19:32

Anzel

20.6k5 gold badges54 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sanjay Mishra Over a year ago

Thanks a lot ! that is what I was looking. In the final df I can groupby(m_srcaddr) as I am looking for per srcaddr size.

Anzel Over a year ago

@SanjayMishra, glad it helps. And if you think this helps resolve your question please accept my answer by clicking the "tick" at top left of my answer :-)

Collectives™ on Stack Overflow

Create New Columns in dataFrame using existing columns in Panda

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related