Merge one Dataframes without dropping any columns from left depending on a column value

Question

I have the following two Dataframes:
The first dataframe contains a bus timetable with bus numbers, stop ids and stop names.

1. df_time:

     bus_nr   stop_id   stop_name
0      1         1          a
1      1         2          b
2      1         3          c
3      1         4          d
4      2         1          k
5      2         2          l
6      2         3          m
7      2         4          n
8      2         5          o

The second dataframe contains some measurements of where the bus has been, but some stops are missing. The frame contains the bus_nr, the stop name, an id for the trip and other information:

2. df_measure:

     bus_nr   trip_id   stop_name   other
0      1         1          a         x
1      1         1          b         x
2      1         1          d         x
3      1         2          c         x
4      1         2          d         x
5      2         3          k         x
6      2         3          m         x
7      2         3          n         x

Now I want to join the missing values from the timetable to the measured stops, so that all timetable stops occur in the measurement:

     bus_nr   trip_id   stop_id   stop_name   other
0      1         1         1          a         x
1      1         1         2          b         x
2      1         1         3          c         NaN
3      1         1         4          d         x
4      1         2         1          a         NaN
5      1         2         2          b         NaN
6      1         2         3          c         x
7      1         2         4          d         x
8      2         3         1          k         x
9      2         3         2          l         NaN
10     2         3         3          m         x
11     2         3         4          n         x
12     2         3         5          o         NaN

So for every bus_nr i want to use all the information from df_time and insert it into df_measure. Any ideas?

Code for creating the Dataframes:

df_time = pd.DataFrame()
df_time['bus_nr'] = [1, 1, 1, 1, 2, 2, 2, 2, 2]
df_time['stop_id'] = [1, 2, 3, 4, 1, 2, 3, 4, 5]
df_time['stop_name'] = ['a', 'b', 'c', 'd', 'k', 'l', 'm', 'n', 'o']

df_measure = pd.DataFrame()
df_measure['bus_nr'] = [1, 1, 1, 1, 1, 2, 2, 2]
df_measure['trip_id'] = [1, 1, 1, 2, 2, 3, 3, 3]
df_measure['stop_name'] = ['a', 'b', 'd', 'c', 'd', 'k', 'm', 'n']
df_measure['other'] = ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x']

Solution:

With the help of Sagar Dawda I found a solution that works:
1. Create a dataframe with all combinations of bus_nr and trip_nr

df_combi = df_measure[['bus_nr', 'trip_id']].copy()
df_combi = df_combi.loc[df_combi.duplicated(['bus_nr', 'trip_id'], keep='first')==False]

2. Use the solution of Sagar Dawda

out = pd.merge_ordered(df_time, df_measure, right_by='trip_id', how='outer')
out = out.loc[:, ['bus_nr', 'trip_id', 'stop_id', 'stop_name', 'other']]

3. Merge

out.merge(df_combi)

Sagar Dawda · Accepted Answer · 2018-05-16 12:36:01Z

out = pd.merge_ordered(df_time, df_measure, right_by='trip_id', how='outer')
out = out.loc[:, ['bus_nr', 'trip_id', 'stop_id', 'stop_name', 'other']]
out.sort_values(['bus_nr', 'trip_id'], inplace=True)

out
# I have shared the output as an HTML table. Please run the code snippet.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>bus_nr</th>
      <th>trip_id</th>
      <th>stop_id</th>
      <th>stop_name</th>
      <th>other</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>a</td>
      <td>x</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>1</td>
      <td>2</td>
      <td>b</td>
      <td>x</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>1</td>
      <td>3</td>
      <td>c</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>3</th>
      <td>1</td>
      <td>1</td>
      <td>4</td>
      <td>d</td>
      <td>x</td>
    </tr>
    <tr>
      <th>9</th>
      <td>1</td>
      <td>2</td>
      <td>1</td>
      <td>a</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>10</th>
      <td>1</td>
      <td>2</td>
      <td>2</td>
      <td>b</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>11</th>
      <td>1</td>
      <td>2</td>
      <td>3</td>
      <td>c</td>
      <td>x</td>
    </tr>
    <tr>
      <th>12</th>
      <td>1</td>
      <td>2</td>
      <td>4</td>
      <td>d</td>
      <td>x</td>
    </tr>
    <tr>
      <th>18</th>
      <td>1</td>
      <td>3</td>
      <td>1</td>
      <td>a</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>19</th>
      <td>1</td>
      <td>3</td>
      <td>2</td>
      <td>b</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>20</th>
      <td>1</td>
      <td>3</td>
      <td>3</td>
      <td>c</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>21</th>
      <td>1</td>
      <td>3</td>
      <td>4</td>
      <td>d</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2</td>
      <td>1</td>
      <td>1</td>
      <td>k</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>5</th>
      <td>2</td>
      <td>1</td>
      <td>2</td>
      <td>l</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>6</th>
      <td>2</td>
      <td>1</td>
      <td>3</td>
      <td>m</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>7</th>
      <td>2</td>
      <td>1</td>
      <td>4</td>
      <td>n</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>8</th>
      <td>2</td>
      <td>1</td>
      <td>5</td>
      <td>o</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>13</th>
      <td>2</td>
      <td>2</td>
      <td>1</td>
      <td>k</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>14</th>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>l</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>15</th>
      <td>2</td>
      <td>2</td>
      <td>3</td>
      <td>m</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>16</th>
      <td>2</td>
      <td>2</td>
      <td>4</td>
      <td>n</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>17</th>
      <td>2</td>
      <td>2</td>
      <td>5</td>
      <td>o</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>22</th>
      <td>2</td>
      <td>3</td>
      <td>1</td>
      <td>k</td>
      <td>x</td>
    </tr>
    <tr>
      <th>23</th>
      <td>2</td>
      <td>3</td>
      <td>2</td>
      <td>l</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>24</th>
      <td>2</td>
      <td>3</td>
      <td>3</td>
      <td>m</td>
      <td>x</td>
    </tr>
    <tr>
      <th>25</th>
      <td>2</td>
      <td>3</td>
      <td>4</td>
      <td>n</td>
      <td>x</td>
    </tr>
    <tr>
      <th>26</th>
      <td>2</td>
      <td>3</td>
      <td>5</td>
      <td>o</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>

Hope this helps

Thanks! The problem now is, that the out dataframe also contains the combinations of for example trip_id 3 and bus_nr 1. But trip_id 3 is only measured for bus_nr 2. The same problem occurs with trip_id 1 and 2 which belong to bus_nr 1 and not to bus_nr 2

mfvas · Accepted Answer · 2018-05-16 10:51:48Z

0

Assuming that the bus_nr and stop_name uniquely identify the rows, you can just merge on those columns:

df_measure = pd.merge([df_time, df_measure], on=['bus_nr', 'stop_name'])

answered May 16, 2018 at 10:51

mfvas

1151 silver badge12 bronze badges

Collectives™ on Stack Overflow

Merge one Dataframes without dropping any columns from left depending on a column value

Solution:

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Solution:

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related