I have transactional data for users as follows:
userid accountid weeknumber amount_spent
1 a 1 100
1 a 2 200
1 a 4 500
1 b 1 500
...
9 z 1 350
The data only captured weeks where the user had transactions. I need to go through the data and add rows for when the user didn't spend any money.
E.g. I need to add rows like:
userid accountid weeknumber amount_spent
1 a 3 0
Currently I do it as follows:
# get all user-account pairs
for user_acct_pair in df.groupby(['userid', 'accountid']).groups.iterkeys():
userid = user_acct_pair[0]
acctid = user_acct_pair[1]
# get the weeks that we have recorded for this user
weeks_recorded = df.xs((userid, acctid), axis=0, level=[0, 1], \
drop_level=True).index.values
for i in range(1, MAX_WEEK_NUMBER):
if i not in weeks_recorded:
# add the row for the week without transactions
df.loc[(userid, acctid, i), 'amount_spent'] = 0
# convert back to df from groupby object
df = df.reset_index()
This seems to be incredibly slow when I run on a dataset with ~90,000 rows. I think that there is a high cost in finding a row in a multilevel index when the row doesn't exist yet.
Are there more efficient ways to do this, or perhaps built in functionalities to achieve what I'm trying to do?