I have several data sources that i'm trying to work with - i asked a related question a couple of days ago (click here!
so i have 3 dataframes each with a 'user_id' column that is common across all 3 data frames, but not all dataframes are exactly the same size.
I didn't realize it at first and used pd.concat combine them but they aren't lined up by user_id, and i'm not sure how to accomplish that.
here is some sample data from each, and sample data from the resulting concat (perhaps that is helpful?)
df1:
user_id duration
0 1000 116.830000
1 1001 328.092000
2 1002 259.043333
3 1003 1041.000000
4 1004 327.368750
5 1005 470.220000
6 1006 32.055000
7 1007 496.830000
8 1008 491.103333
9 1009 698.710000
df2:
user_id mb_used
0 1000 1902.000000
1 1001 16088.200000
2 1002 13432.000000
3 1003 27045.000000
4 1004 19544.500000
5 1005 17141.000000
6 1006 17094.000000
7 1007 28770.800000
8 1008 18491.333333
9 1009 23405.125000
df3:
user_id id
0 1000 11.000000
1 1001 41.400000
2 1002 29.333333
3 1003 50.000000
4 1004 22.125000
5 1005 11.000000
6 1006 77.000000
7 1007 51.000000
8 1008 28.000000
9 1011 53.000000
df 4 = pd.concat([df1,df2,df3],axis=1)
df4 result:
user_id duration user_id mb_used user_id id
0 1000.0 116.830000 1000 1902.000000 1000.0 11.000000
1 1001.0 328.092000 1001 16088.200000 1001.0 41.400000
2 1002.0 259.043333 1002 13432.000000 1002.0 29.333333
3 1003.0 1041.000000 1003 27045.000000 1003.0 50.000000
4 1004.0 327.368750 1004 19544.500000 1004.0 22.125000
5 1005.0 470.220000 1005 17141.000000 1005.0 11.000000
6 1006.0 32.055000 1006 17094.000000 1006.0 77.000000
7 1007.0 496.830000 1007 28770.800000 1007.0 51.000000
8 1008.0 491.103333 1008 18491.333333 1008.0 28.000000
**9 1009.0 698.710000 1009 23405.125000 1011.0 53.000000**
is there something i've done wrong or could add to line them by that shared user_id or should i be using a different method? i'll be honest - i started with pd.merge but quickly realized i was in over my head in trying to structure that, but if that is the only way (or the best way) i'll take another crack at it.
thanks in advance for your time, and i apologize for what is likely a lack of proper terminology, i am quite new at python (and programming in general)
from functools import reducedfs = [df1, df2, df3]df_final = reduce(lambda left,right: pd.merge(left,right,on='user_id'), dfs)user_idunique in each of these dataframes?user_id mb_used. Can the same user id be in there many times, e.g., mb_used on several different media? In that case, there isn't a one-to-one correspondence between rows in these dataframes. In this example for instance, perhaps rows need to be summed to make user_id unique in the dataframe.