3

I have a dataframe with entries in this format:

user_id,item_list
0,3569 6530 4416 5494 6404 6289 10227 5285 3601 3509 5553 14879 5951 4802 15104 5338 3604 2345 9048 8627
1,16148 8470 7671 8984 9795 6811 3851 3611 7662 5034 5301 6948 5840 345 14652 10729 8429 7295 4949 16144
...

*Note that the user_id is not an index of the dataframe

I want to transform the dataframe into one that looks like this:

user_id,item_id
0,3569
0,6530
0,4416 
0,5494 
...
1,4949
1,16144
...

Right now I am trying this but it is wildly inefficient:

df = pd.read_csv("20recs.csv")
numberOfRows = 28107*20
df2 = pd.DataFrame(index=np.arange(0, numberOfRows),columns=('user', 'item'))
iter = 0
for index, row in df.iterrows():
    user = row['user_id']
    itemList = row['item_list']
    items = itemList.split(' ')
    for item in items:
        df2.loc[iter] = [user]+[item]
        iter = iter + 1

As you can see, I even tried pre-allocating the memory for the dataframe but it doesn't seem to help much.

So there must be a much better way to do this. Can anyone help me?

3 Answers 3

1

Use split to transform the lists to actual lists, then explode to ... well, explode the DataFrame. Requires pandas >= 0.25.0

>>> df = pd.DataFrame({'user_id': [0,1], 'item_list': ['1 2 3', '4 5 6']})
>>> df

   user_id item_list
0        0     1 2 3
1        1     4 5 6

>>> (df.assign(item_id=df.item_list.apply(lambda x: x.split(' ')))
       .explode('item_id')[['user_id', 'item_id']])

   user_id   item_id
0        0         1
0        0         2
0        0         3
1        1         4
1        1         5
1        1         6

Sign up to request clarification or add additional context in comments.

Comments

1

First your item_id column should be a list

df['item_id_list'] = df['item_id'].str.split(',').values.tolist()
df['item_id_list_int'] = [[int(i) for i in x] for x in df['item_id_list']]

Then you explode it

df_exp = df.explode('item_id_list_int')

Comments

1

Try this:

df.set_index('user_id').item_list.apply(lambda x: x.split(' ')).explode().reset_index().set_index('user_id') 

output

        item_list
user_id          
0            3569
0            6530
0            4416
0            5494
0            6404
0            6289
0           10227
0            5285
0            3601
0            3509
0            5553
0           14879
0            5951
0            4802
0           15104
0            5338
0            3604
0            2345
0            9048
0            8627
1           16148
1            8470
1            7671
1            8984
1            9795
1            6811
1            3851
1            3611
1            7662
1            5034
1            5301
1            6948
1            5840
1             345
1           14652
1           10729
1            8429
1            7295
1            4949
1           16144

or if you want an index:

df.set_index('user_id').item_list.apply(lambda x: x.split(' ')).explode().reset_index()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.