1

I have the following two Dataframes:
The first dataframe contains a bus timetable with bus numbers, stop ids and stop names.

1. df_time:

     bus_nr   stop_id   stop_name
0      1         1          a
1      1         2          b
2      1         3          c
3      1         4          d
4      2         1          k
5      2         2          l
6      2         3          m
7      2         4          n
8      2         5          o

The second dataframe contains some measurements of where the bus has been, but some stops are missing. The frame contains the bus_nr, the stop name, an id for the trip and other information:

2. df_measure:

     bus_nr   trip_id   stop_name   other
0      1         1          a         x
1      1         1          b         x
2      1         1          d         x
3      1         2          c         x
4      1         2          d         x
5      2         3          k         x
6      2         3          m         x
7      2         3          n         x

Now I want to join the missing values from the timetable to the measured stops, so that all timetable stops occur in the measurement:

     bus_nr   trip_id   stop_id   stop_name   other
0      1         1         1          a         x
1      1         1         2          b         x
2      1         1         3          c         NaN
3      1         1         4          d         x
4      1         2         1          a         NaN
5      1         2         2          b         NaN
6      1         2         3          c         x
7      1         2         4          d         x
8      2         3         1          k         x
9      2         3         2          l         NaN
10     2         3         3          m         x
11     2         3         4          n         x
12     2         3         5          o         NaN

So for every bus_nr i want to use all the information from df_time and insert it into df_measure. Any ideas?

Code for creating the Dataframes:

df_time = pd.DataFrame()
df_time['bus_nr'] = [1, 1, 1, 1, 2, 2, 2, 2, 2]
df_time['stop_id'] = [1, 2, 3, 4, 1, 2, 3, 4, 5]
df_time['stop_name'] = ['a', 'b', 'c', 'd', 'k', 'l', 'm', 'n', 'o']

df_measure = pd.DataFrame()
df_measure['bus_nr'] = [1, 1, 1, 1, 1, 2, 2, 2]
df_measure['trip_id'] = [1, 1, 1, 2, 2, 3, 3, 3]
df_measure['stop_name'] = ['a', 'b', 'd', 'c', 'd', 'k', 'm', 'n']
df_measure['other'] = ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x']

Solution:

With the help of Sagar Dawda I found a solution that works:
1. Create a dataframe with all combinations of bus_nr and trip_nr

df_combi = df_measure[['bus_nr', 'trip_id']].copy()
df_combi = df_combi.loc[df_combi.duplicated(['bus_nr', 'trip_id'], keep='first')==False]

2. Use the solution of Sagar Dawda

out = pd.merge_ordered(df_time, df_measure, right_by='trip_id', how='outer')
out = out.loc[:, ['bus_nr', 'trip_id', 'stop_id', 'stop_name', 'other']]

3. Merge

out.merge(df_combi)

2 Answers 2

1
out = pd.merge_ordered(df_time, df_measure, right_by='trip_id', how='outer')
out = out.loc[:, ['bus_nr', 'trip_id', 'stop_id', 'stop_name', 'other']]
out.sort_values(['bus_nr', 'trip_id'], inplace=True)

out
# I have shared the output as an HTML table. Please run the code snippet.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>bus_nr</th>
      <th>trip_id</th>
      <th>stop_id</th>
      <th>stop_name</th>
      <th>other</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>a</td>
      <td>x</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>1</td>
      <td>2</td>
      <td>b</td>
      <td>x</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>1</td>
      <td>3</td>
      <td>c</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>3</th>
      <td>1</td>
      <td>1</td>
      <td>4</td>
      <td>d</td>
      <td>x</td>
    </tr>
    <tr>
      <th>9</th>
      <td>1</td>
      <td>2</td>
      <td>1</td>
      <td>a</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>10</th>
      <td>1</td>
      <td>2</td>
      <td>2</td>
      <td>b</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>11</th>
      <td>1</td>
      <td>2</td>
      <td>3</td>
      <td>c</td>
      <td>x</td>
    </tr>
    <tr>
      <th>12</th>
      <td>1</td>
      <td>2</td>
      <td>4</td>
      <td>d</td>
      <td>x</td>
    </tr>
    <tr>
      <th>18</th>
      <td>1</td>
      <td>3</td>
      <td>1</td>
      <td>a</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>19</th>
      <td>1</td>
      <td>3</td>
      <td>2</td>
      <td>b</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>20</th>
      <td>1</td>
      <td>3</td>
      <td>3</td>
      <td>c</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>21</th>
      <td>1</td>
      <td>3</td>
      <td>4</td>
      <td>d</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2</td>
      <td>1</td>
      <td>1</td>
      <td>k</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>5</th>
      <td>2</td>
      <td>1</td>
      <td>2</td>
      <td>l</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>6</th>
      <td>2</td>
      <td>1</td>
      <td>3</td>
      <td>m</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>7</th>
      <td>2</td>
      <td>1</td>
      <td>4</td>
      <td>n</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>8</th>
      <td>2</td>
      <td>1</td>
      <td>5</td>
      <td>o</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>13</th>
      <td>2</td>
      <td>2</td>
      <td>1</td>
      <td>k</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>14</th>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>l</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>15</th>
      <td>2</td>
      <td>2</td>
      <td>3</td>
      <td>m</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>16</th>
      <td>2</td>
      <td>2</td>
      <td>4</td>
      <td>n</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>17</th>
      <td>2</td>
      <td>2</td>
      <td>5</td>
      <td>o</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>22</th>
      <td>2</td>
      <td>3</td>
      <td>1</td>
      <td>k</td>
      <td>x</td>
    </tr>
    <tr>
      <th>23</th>
      <td>2</td>
      <td>3</td>
      <td>2</td>
      <td>l</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>24</th>
      <td>2</td>
      <td>3</td>
      <td>3</td>
      <td>m</td>
      <td>x</td>
    </tr>
    <tr>
      <th>25</th>
      <td>2</td>
      <td>3</td>
      <td>4</td>
      <td>n</td>
      <td>x</td>
    </tr>
    <tr>
      <th>26</th>
      <td>2</td>
      <td>3</td>
      <td>5</td>
      <td>o</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

Hope this helps

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! The problem now is, that the out dataframe also contains the combinations of for example trip_id 3 and bus_nr 1. But trip_id 3 is only measured for bus_nr 2. The same problem occurs with trip_id 1 and 2 which belong to bus_nr 1 and not to bus_nr 2
OK i found a way out. Thanks!
Great... So what solution did you come up with?
I added the solution to the initial post
0

Assuming that the bus_nr and stop_name uniquely identify the rows, you can just merge on those columns:

df_measure = pd.merge([df_time, df_measure], on=['bus_nr', 'stop_name'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.