I have the following two Dataframes:
The first dataframe contains a bus timetable with bus numbers, stop ids and stop names.
1. df_time:
bus_nr stop_id stop_name
0 1 1 a
1 1 2 b
2 1 3 c
3 1 4 d
4 2 1 k
5 2 2 l
6 2 3 m
7 2 4 n
8 2 5 o
The second dataframe contains some measurements of where the bus has been, but some stops are missing. The frame contains the bus_nr, the stop name, an id for the trip and other information:
2. df_measure:
bus_nr trip_id stop_name other
0 1 1 a x
1 1 1 b x
2 1 1 d x
3 1 2 c x
4 1 2 d x
5 2 3 k x
6 2 3 m x
7 2 3 n x
Now I want to join the missing values from the timetable to the measured stops, so that all timetable stops occur in the measurement:
bus_nr trip_id stop_id stop_name other
0 1 1 1 a x
1 1 1 2 b x
2 1 1 3 c NaN
3 1 1 4 d x
4 1 2 1 a NaN
5 1 2 2 b NaN
6 1 2 3 c x
7 1 2 4 d x
8 2 3 1 k x
9 2 3 2 l NaN
10 2 3 3 m x
11 2 3 4 n x
12 2 3 5 o NaN
So for every bus_nr i want to use all the information from df_time and insert it into df_measure. Any ideas?
Code for creating the Dataframes:
df_time = pd.DataFrame()
df_time['bus_nr'] = [1, 1, 1, 1, 2, 2, 2, 2, 2]
df_time['stop_id'] = [1, 2, 3, 4, 1, 2, 3, 4, 5]
df_time['stop_name'] = ['a', 'b', 'c', 'd', 'k', 'l', 'm', 'n', 'o']
df_measure = pd.DataFrame()
df_measure['bus_nr'] = [1, 1, 1, 1, 1, 2, 2, 2]
df_measure['trip_id'] = [1, 1, 1, 2, 2, 3, 3, 3]
df_measure['stop_name'] = ['a', 'b', 'd', 'c', 'd', 'k', 'm', 'n']
df_measure['other'] = ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x']
Solution:
With the help of Sagar Dawda I found a solution that works:
1. Create a dataframe with all combinations of bus_nr and trip_nr
df_combi = df_measure[['bus_nr', 'trip_id']].copy()
df_combi = df_combi.loc[df_combi.duplicated(['bus_nr', 'trip_id'], keep='first')==False]
2. Use the solution of Sagar Dawda
out = pd.merge_ordered(df_time, df_measure, right_by='trip_id', how='outer')
out = out.loc[:, ['bus_nr', 'trip_id', 'stop_id', 'stop_name', 'other']]
3. Merge
out.merge(df_combi)