0

I have transactional data for users as follows:

userid accountid weeknumber amount_spent
1      a         1          100
1      a         2          200
1      a         4          500
1      b         1          500
...
9      z         1          350

The data only captured weeks where the user had transactions. I need to go through the data and add rows for when the user didn't spend any money.

E.g. I need to add rows like:

userid accountid weeknumber amount_spent
1      a         3          0

Currently I do it as follows:

# get all user-account pairs
for user_acct_pair in df.groupby(['userid', 'accountid']).groups.iterkeys():
    userid = user_acct_pair[0]
    acctid = user_acct_pair[1]

# get the weeks that we have recorded for this user
weeks_recorded = df.xs((userid, acctid), axis=0, level=[0, 1], \
                              drop_level=True).index.values

for i in range(1, MAX_WEEK_NUMBER):
    if i not in weeks_recorded:
        # add the row for the week without transactions
        df.loc[(userid, acctid, i), 'amount_spent'] = 0

# convert back to df from groupby object
df = df.reset_index()

This seems to be incredibly slow when I run on a dataset with ~90,000 rows. I think that there is a high cost in finding a row in a multilevel index when the row doesn't exist yet.

Are there more efficient ways to do this, or perhaps built in functionalities to achieve what I'm trying to do?

1
  • Why would you do that? Effectively, the data you want to add is already there, by the fact that it is absent. Commented Jun 7, 2017 at 14:55

2 Answers 2

1

Personally I would forget groupby and iterating through the dataframe. I would just create a dataframe that looks like the empty rows you want, and then merge in the data that is populated.

#create your existing data
df = pd.DataFrame({'userId'    : [1, 1, 1, 1, 2], 
                   'accountId' : ['a', 'a', 'a', 'b', 'z'],
                   'week'            : [1, 2, 4, 1, 1],
                   'amount'    : [100, 200, 500, 500, 350]})

#create unique ID pairs
unique_ids = set(zip(df['userId'], df['accountId']))

#create empty data frame
new_df = pd.DataFrame({'userId'    :  np.repeat([val[0] for val in unique_ids], 5),
                       'accountId' :  np.repeat([val[1] for val in unique_ids], 5),
                       'week'      :  np.tile(list(range(1, 6)), len(unique_ids))})

#merge
pd.merge(df, new_df, how = 'outer').sort_values(['accountId', 'userId', 'week']).fillna(0)

This is for a 5 week period. The result is:

   accountId  amount  userId  week
0          a   100.0       1     1
1          a   200.0       1     2
5          a     0.0       1     3
2          a   500.0       1     4
6          a     0.0       1     5
3          b   500.0       1     1
11         b     0.0       1     2
12         b     0.0       1     3
13         b     0.0       1     4
14         b     0.0       1     5
4          z   350.0       2     1
7          z     0.0       2     2
8          z     0.0       2     3
9          z     0.0       2     4
10         z     0.0       2     5
Sign up to request clarification or add additional context in comments.

Comments

0

Here is my solution :

#Grouby accountid
for i in df.groupby(['userid', 'accountid']).max().itertuples():
    print(i)
    #Get range of weeks
    r = [i for i in range(1,i.weeknumber+1)]
    #find all unique weeks
    unique = df[(df.accountid == i.Index[1]) & (df.userid == i.Index[0])].weeknumber.unique()
    #Substract the one that are already here
    missing = [e for e in r if e not in unique]
    #append them to the dataframe
    for m in missing:
        line = pd.DataFrame({'userid':[i.Index[0]],"weeknumber":[m], "amount_spent":[0], "accountid":i.Index[1]})
        df = df.append(line)

Should be faster, but on small sample it's hard to tell

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.