python - combined 3 data frames, but need to realign data by values in 1 column [duplicate]

Question

I have several data sources that i'm trying to work with - i asked a related question a couple of days ago (click here!

so i have 3 dataframes each with a 'user_id' column that is common across all 3 data frames, but not all dataframes are exactly the same size.

I didn't realize it at first and used pd.concat combine them but they aren't lined up by user_id, and i'm not sure how to accomplish that.

here is some sample data from each, and sample data from the resulting concat (perhaps that is helpful?)

df1:

        user_id duration
    0   1000    116.830000
    1   1001    328.092000
    2   1002    259.043333
    3   1003    1041.000000
    4   1004    327.368750
    5   1005    470.220000
    6   1006    32.055000
    7   1007    496.830000
    8   1008    491.103333
    9   1009    698.710000

df2:

user_id mb_used
0   1000    1902.000000
1   1001    16088.200000
2   1002    13432.000000
3   1003    27045.000000
4   1004    19544.500000
5   1005    17141.000000
6   1006    17094.000000
7   1007    28770.800000
8   1008    18491.333333
9   1009    23405.125000

df3:

    user_id id
0   1000    11.000000
1   1001    41.400000
2   1002    29.333333
3   1003    50.000000
4   1004    22.125000
5   1005    11.000000
6   1006    77.000000
7   1007    51.000000
8   1008    28.000000
9   1011    53.000000

df 4 = pd.concat([df1,df2,df3],axis=1)

df4 result:

   user_id     duration  user_id       mb_used  user_id         id
0   1000.0   116.830000     1000   1902.000000   1000.0  11.000000
1   1001.0   328.092000     1001  16088.200000   1001.0  41.400000
2   1002.0   259.043333     1002  13432.000000   1002.0  29.333333
3   1003.0  1041.000000     1003  27045.000000   1003.0  50.000000
4   1004.0   327.368750     1004  19544.500000   1004.0  22.125000
5   1005.0   470.220000     1005  17141.000000   1005.0  11.000000
6   1006.0    32.055000     1006  17094.000000   1006.0  77.000000
7   1007.0   496.830000     1007  28770.800000   1007.0  51.000000
8   1008.0   491.103333     1008  18491.333333   1008.0  28.000000
**9   1009.0   698.710000     1009  23405.125000   1011.0  53.000000**

is there something i've done wrong or could add to line them by that shared user_id or should i be using a different method? i'll be honest - i started with pd.merge but quickly realized i was in over my head in trying to structure that, but if that is the only way (or the best way) i'll take another crack at it.

thanks in advance for your time, and i apologize for what is likely a lack of proper terminology, i am quite new at python (and programming in general)

from functools import reduce dfs = [df1, df2, df3] df_final = reduce(lambda left,right: pd.merge(left,right,on='user_id'), dfs) — David Erickson
– David Erickson, Commented Jan 6, 2021 at 1:22
@tdelaney the user_id is the value across all three dataframes, they are separate usages for the same user. — seuadr
– seuadr, Commented Jan 6, 2021 at 15:18
But that is not what I asked. Take user_id mb_used. Can the same user id be in there many times, e.g., mb_used on several different media? In that case, there isn't a one-to-one correspondence between rows in these dataframes. In this example for instance, perhaps rows need to be summed to make user_id unique in the dataframe. — tdelaney
– tdelaney, Commented Jan 6, 2021 at 16:15
@tdelaney i apologize, i misunderstood. no, user_id's do not appear more than once, but not all user IDs are present in all dataframes. This created some NaNs when i used the initially suggested merge — seuadr
– seuadr, Commented Jan 6, 2021 at 16:33

ZealousWeb · Accepted Answer · 2021-01-06 05:26:18Z

1

If you want to keep user_id of all 3 dataframes, this can be accomplished using merge and outer join as follows:

df4 = pd.merge(df1, df2, how="outer", on="user_id").merge(df3, how="outer", on="user_id")

answered Jan 6, 2021 at 5:26

ZealousWeb

1,7791 gold badge11 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

seuadr Over a year ago

This worked great, thanks! i know this is a pretty basic question, so i am thankful of your time and effort.

seuadr Over a year ago

just for anyone else that might be looking at this - Ultimately, i used ZealousWeb's solution because it contains an outer join and i discovered that not all user_ids appeared in all dataframes, creating some NaNs. including how="outer" kept these records from being dropped so was, in my opinion, the best solution.

Pablo C · Accepted Answer · 2021-01-06 01:23:03Z

1

Use merge instead:

df4 = df1.merge(df2, on = "user_id").merge(df3, on = "user_id")

answered Jan 6, 2021 at 1:23

Pablo C

4,7612 gold badges10 silver badges26 bronze badges

1 Comment

seuadr Over a year ago

worked perfectly, thanks! i know this was probably a really basic question so i appreciate your time.

tdelaney · Accepted Answer · 2021-01-06 02:05:53Z

1

Reindex the dataframes by user_id and then concat on the column axis. Concatenation will happen on the row label (that is, the index you set) with unmatched user ids getting NaN.

df4 = pd.concat([d.set_index('user_id') for d in [df1, df2, df3]], axis=1)

answered Jan 6, 2021 at 2:05

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Collectives™ on Stack Overflow

python - combined 3 data frames, but need to realign data by values in 1 column [duplicate]

3 Answers 3

2 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Linked

Related