pandas adding new rows to dataframe based on dataframe values

Question

I have transactional data for users as follows:

userid accountid weeknumber amount_spent
1      a         1          100
1      a         2          200
1      a         4          500
1      b         1          500
...
9      z         1          350

The data only captured weeks where the user had transactions. I need to go through the data and add rows for when the user didn't spend any money.

E.g. I need to add rows like:

userid accountid weeknumber amount_spent
1      a         3          0

Currently I do it as follows:

# get all user-account pairs
for user_acct_pair in df.groupby(['userid', 'accountid']).groups.iterkeys():
    userid = user_acct_pair[0]
    acctid = user_acct_pair[1]

# get the weeks that we have recorded for this user
weeks_recorded = df.xs((userid, acctid), axis=0, level=[0, 1], \
                              drop_level=True).index.values

for i in range(1, MAX_WEEK_NUMBER):
    if i not in weeks_recorded:
        # add the row for the week without transactions
        df.loc[(userid, acctid, i), 'amount_spent'] = 0

# convert back to df from groupby object
df = df.reset_index()

This seems to be incredibly slow when I run on a dataset with ~90,000 rows. I think that there is a high cost in finding a row in a multilevel index when the row doesn't exist yet.

Are there more efficient ways to do this, or perhaps built in functionalities to achieve what I'm trying to do?

Why would you do that? Effectively, the data you want to add is already there, by the fact that it is absent. — Christopher Bottoms
– Christopher Bottoms, Commented Jun 7, 2017 at 14:55

Woody Pride · Accepted Answer · 2017-06-07 15:37:21Z

Personally I would forget groupby and iterating through the dataframe. I would just create a dataframe that looks like the empty rows you want, and then merge in the data that is populated.

#create your existing data
df = pd.DataFrame({'userId'    : [1, 1, 1, 1, 2], 
                   'accountId' : ['a', 'a', 'a', 'b', 'z'],
                   'week'            : [1, 2, 4, 1, 1],
                   'amount'    : [100, 200, 500, 500, 350]})

#create unique ID pairs
unique_ids = set(zip(df['userId'], df['accountId']))

#create empty data frame
new_df = pd.DataFrame({'userId'    :  np.repeat([val[0] for val in unique_ids], 5),
                       'accountId' :  np.repeat([val[1] for val in unique_ids], 5),
                       'week'      :  np.tile(list(range(1, 6)), len(unique_ids))})

#merge
pd.merge(df, new_df, how = 'outer').sort_values(['accountId', 'userId', 'week']).fillna(0)

This is for a 5 week period. The result is:

   accountId  amount  userId  week
0          a   100.0       1     1
1          a   200.0       1     2
5          a     0.0       1     3
2          a   500.0       1     4
6          a     0.0       1     5
3          b   500.0       1     1
11         b     0.0       1     2
12         b     0.0       1     3
13         b     0.0       1     4
14         b     0.0       1     5
4          z   350.0       2     1
7          z     0.0       2     2
8          z     0.0       2     3
9          z     0.0       2     4
10         z     0.0       2     5

Tbaki · Accepted Answer · 2017-06-07 15:23:29Z

0

Here is my solution :

#Grouby accountid
for i in df.groupby(['userid', 'accountid']).max().itertuples():
    print(i)
    #Get range of weeks
    r = [i for i in range(1,i.weeknumber+1)]
    #find all unique weeks
    unique = df[(df.accountid == i.Index[1]) & (df.userid == i.Index[0])].weeknumber.unique()
    #Substract the one that are already here
    missing = [e for e in r if e not in unique]
    #append them to the dataframe
    for m in missing:
        line = pd.DataFrame({'userid':[i.Index[0]],"weeknumber":[m], "amount_spent":[0], "accountid":i.Index[1]})
        df = df.append(line)

Should be faster, but on small sample it's hard to tell

edited Jun 7, 2017 at 15:23

answered Jun 7, 2017 at 15:12

Tbaki

1,0037 silver badges12 bronze badges

Collectives™ on Stack Overflow

pandas adding new rows to dataframe based on dataframe values

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related