Pandas compare 2 dataframes by specific rows in all columns

Question

I have the following Pandas dataframe of some raw numbers:

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)

col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']

df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)

It looks like this:

        Tr_id       07_08_19 #1       07_08_19 #2     07_08_19 #2.1       11_31_19 #1     11_31_19 #1.1     11_31_19 #1.3       12_15_20 #1       12_15_20 #2     12_15_20 #2.1     12_15_20 #2.2
0   Quantity1                 1                 1                 2                 1                 2                 3                 1                 1                 2                 3
1   Quantity2                75                11               0.9                70                60                17                 6                 3                 5                 7
2   Quantity3                 9                20                17                 4                74                12                34                43                92                11
3   Quantity4                 7               -17               102                75                41               -89               496                12                17                62
4   Quantity5                -4                12                56               0.8               -36                30               -84               -23                64               -11
5   Quantity6               0.4               0.8               0.6               0.4               0.3               0.1               0.5               0.5               0.5               0.5
6   TimeStamp  07/08/2019 05:11  07/08/2019 10:54  07/08/2019 21:04  11/31/2019 11:15  11/31/2019 16:50  11/31/2019 21:33  12/15/2020 01:36  12/15/2020 07:01  12/15/2020 11:15  12/15/2020 21:45
7         NaN                 1                 6                65                 4                 2                22                 1                34                51                12
8         NaN                 2                 5                 3                 3                 3                 1                 6                37                20                71
9         NaN                 3                 3                 6                 6                11                 6                 3                46                 1                34
10        NaN                 4                 7                78                 8                55                32                 2               918                34                94
11        NaN                 8                 3                 9                 3                 3                 5                 6                 0                12                 1
12        NaN                 4                23                 2                 5                 7                 6                55                37                59                73
13        NaN                 3                27                45                66                33                 4                22                91                78                46
14        NaN                 8                 3                 6                32                65                 3                 6                12                 6                51
15        NaN                 7                11                 7                84                34               898                23                68               101                21

I have a separate dataframe of a processed version of these numbers where:

some of the header rows from above have been deleted,
the column names have been changed

Here is the second dataframe:

df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]

   8_1  8_2  24_3  24_1  c_1trial  14_2  14_1  28_1  24_2  8_3
0    4    2    12    34         1    65     6     1    51   22
1    3    3    71    37         2     3     5     6    20    1
2    6   11    34    46         3     6     3     3     1    6
3    8   55    94   918         4    78     7     2    34   32
4    3    3     1     0         8     9     3     6    12    5
5    5    7    73    37         4     2    23    55    59    6
6   66   33    46    91         3    45    27    22    78    4
7   32   65    51    12         8     6     3     6     6    3
8   84   34    21    68         7     7    11    23   101  898

Common parts of each dataframe:

For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same.

Output combination:

I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed.

Example:

the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns.

Here is the output I am looking to get:

    Tr_id       07_08_19 #1       07_08_19 #2     07_08_19 #2.1       11_31_19 #1     11_31_19 #1.1     11_31_19 #1.3       12_15_20 #1       12_15_20 #2     12_15_20 #2.1     12_15_20 #2.2
Quantity1                 1                 1                 2                 1                 2                 3                 1                 1                 2                 3
Quantity2                75                11               0.9                70                60                17                 6                 3                 5                 7
Quantity3                 9                20                17                 4                74                12                34                43                92                11
Proc_Name          c_1trial              14_1              14_2               8_1               8_2               8_3              28_1              24_1              24_2              24_3
Quantity4                 7               -17               102                75                41               -89               496                12                17                62
Quantity5                -4                12                56               0.8               -36                30               -84               -23                64               -11
Quantity6               0.4               0.8               0.6               0.4               0.3               0.1               0.5               0.5               0.5               0.5
TimeStamp  07/08/2019 05:11  07/08/2019 10:54  07/08/2019 21:04  11/31/2019 11:15  11/31/2019 16:50  11/31/2019 21:33  12/15/2020 01:36  12/15/2020 07:01  12/15/2020 11:15  12/15/2020 21:45

My attempts are giving trouble:

print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)

gives

ValueError: Can only compare identically-labeled DataFrame objects

and

print (df_raw.ix[7:].values == df_processed.values) #gives False

gives

False

The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row.

Question:

Is there a way to select out the output I showed above from these 2 dataframes?

In your output sample, you said you wanted the columns headers from df_processed but I think you posted those from df_raw? — Alex Petralia
– Alex Petralia, Commented Jun 6, 2016 at 18:13
Sorry, actually in the sample output, I showed both column headers only from df_raw but I also included the column names from df_processed. I should have been more clear about this. Thanks. — edesz
– edesz, Commented Jun 6, 2016 at 19:59

Alex Petralia · Accepted Answer · 2016-06-06 20:47:52Z

Does this look like the output you're looking for?

Raw dataframe df:

        Tr_id    07_08_19  07_08_19.1  07_08_19.2    11_31_19  11_31_19.1  
0   Quantity1           1           1           2           1           2   
1   Quantity2          75          11         0.9          70          60   
2   Quantity3           9          20          17           4          74   
3   Quantity4           7         -17         102          75          41   
4   Quantity5          -4          12          56         0.8         -36   
5   Quantity6         0.4         0.8         0.6         0.4         0.3   
6   TimeStamp  07/08/2019  07/08/2019  07/08/2019  11/31/2019  11/31/2019   
7         NaN           1           6          65           4           2   
8         NaN           2           5           3           3           3   
9         NaN           3           3           6           6          11   
10        NaN           4           7          78           8          55   
11        NaN           8           3           9           3           3   
12        NaN           4          23           2           5           7   
13        NaN           3          27          45          66          33   
14        NaN           8           3           6          32          65   
15        NaN           7          11           7          84          34   

    11_31_19.2    12_15_20  12_15_20.1  12_15_20.2  12_15_20.3  
0            3           1           1           2           3  
1           17           6           3           5           7  
2           12          34          43          92          11  
3          -89         496          12          17          62  
4           30         -84         -23          64         -11  
5          0.1         0.5         0.5         0.5         0.5  
6   11/31/2019  12/15/2020  12/15/2020  12/15/2020  12/15/2020  
7           22           1          34          51          12  
8            1           6          37          20          71  
9            6           3          46           1          34  
10          32           2         918          34          94  
11           5           6           0          12           1  
12           6          55          37          59          73  
13           4          22          91          78          46  
14           3           6          12           6          51  
15         898          23          68         101          21

Processed dataframe dfp:

   8_1  8_2  24_3  24_1  c_1trial  14_2  14_1  28_1  24_2  8_3
0    4    2    12    34         1    65     6     1    51   22
1    3    3    71    37         2     3     5     6    20    1
2    6   11    34    46         3     6     3     3     1    6
3    8   55    94   918         4    78     7     2    34   32
4    3    3     1     0         8     9     3     6    12    5
5    5    7    73    37         4     2    23    55    59    6
6   66   33    46    91         3    45    27    22    78    4
7   32   65    51    12         8     6     3     6     6    3
8   84   34    21    68         7     7    11    23   101  898

Code:

df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)

x = pd.DataFrame()
for col_raw in dfr.columns:
    for col_p in dfp.columns:
        if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
            series = dfr[col_raw].head(7).tolist()
            series.append(col_raw)
            x[col_p] = series

x = pd.concat([df['Tr_id'].head(7), x], axis=1)

Output:

       Tr_id    c_1trial        14_1        14_2         8_1         8_2  
0  Quantity1           1           1           2           1           2   
1  Quantity2          75          11         0.9          70          60   
2  Quantity3           9          20          17           4          74   
3  Quantity4           7         -17         102          75          41   
4  Quantity5          -4          12          56         0.8         -36   
5  Quantity6         0.4         0.8         0.6         0.4         0.3   
6  TimeStamp  07/08/2019  07/08/2019  07/08/2019  11/31/2019  11/31/2019   
7        NaN    07_08_19  07_08_19.1  07_08_19.2    11_31_19  11_31_19.1   

          8_3        28_1        24_1        24_2        24_3  
0           3           1           1           2           3  
1          17           6           3           5           7  
2          12          34          43          92          11  
3         -89         496          12          17          62  
4          30         -84         -23          64         -11  
5         0.1         0.5         0.5         0.5         0.5  
6  11/31/2019  12/15/2020  12/15/2020  12/15/2020  12/15/2020  
7  11_31_19.2    12_15_20  12_15_20.1  12_15_20.2  12_15_20.3

I think the code could be more concise but maybe this does the job.

Hey, this is very very close. Is it also possible to have column names from df_raw in the output columns? eg. in your first column c_1trial, can one of the rows be 07_08_19 #1? And similarly for the other columns?
I'm getting ValueError: cannot convert float NaN to integer. Could this be due a Python2.7 specific problem? It's pointing to this line: df.tail(9).astype(int)[col_raw].
It's because the raw DataFrame has the column Tr_id. I dropped it using df = df.drop('Tr_id', axis=1)

MaxU - stand with Ukraine · Accepted Answer · 2016-08-22 11:07:42Z

alternative solution, using DataFrame.isin() method:

In [171]: df1
Out[171]:
   a  b  c
0  1  1  3
1  0  2  4
2  4  2  2
3  0  3  3
4  0  4  4

In [172]: df2
Out[172]:
   a  b  c
0  0  3  3
1  1  1  1
2  0  3  4
3  4  2  3
4  0  4  4

In [173]: common = pd.merge(df1, df2)

In [174]: common
Out[174]:
   a  b  c
0  0  3  3
1  0  4  4

In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
   a  b  c
3  0  3  3
4  0  4  4

Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's:

select col1, .., colN from tableA
minus
select col1, .., colN from tableB

in Pandas:

In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
   a  b  c
0  1  1  3
1  0  2  4
2  4  2  2

edesz · Accepted Answer · 2016-06-06 17:31:54Z

0

I came up with this using loops. It is very disappointing:

holder = []
for randm,pp in enumerate(list(df_processed)):
    list1 = df_processed[pp].tolist()
    for car,rr in enumerate(list(df_raw)):
        list2 = df_raw.loc[7:,rr].tolist()
        if list1==list2:
            holder.append([rr,pp])

df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)

Output:

       Tr_id       11_31_19 #1     11_31_19 #1.1     12_15_20 #2.2       12_15_20 #2       07_08_19 #1     07_08_19 #2.1       07_08_19 #2       12_15_20 #1     12_15_20 #2.1     11_31_19 #1.3
0  Quantity1                 1                 2                 3                 1                 1                 2                 1                 1                 2                 3
1  Quantity2                70                60                 7                 3                75               0.9                11                 6                 5                17
2  Quantity3                 4                74                11                43                 9                17                20                34                92                12
3  Proc_Name               8_1               8_2              24_3              24_1          c_1trial              14_2              14_1              28_1              24_2               8_3
4  Quantity4                75                41                62                12                 7               102               -17               496                17               -89
5  Quantity5               0.8               -36               -11               -23                -4                56                12               -84                64                30
6  Quantity6               0.4               0.3               0.5               0.5               0.4               0.6               0.8               0.5               0.5               0.1
7  TimeStamp  11/31/2019 11:15  11/31/2019 16:50  12/15/2020 21:45  12/15/2020 07:01  07/08/2019 05:11  07/08/2019 21:04  07/08/2019 10:54  12/15/2020 01:36  12/15/2020 11:15  11/31/2019 21:33

The order of the columns is different than what I required, but that is a minor problem.

The real problem with this approach is using loops. I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.

edited Jun 6, 2016 at 17:31

answered Jun 6, 2016 at 17:26

edesz

12.5k24 gold badges87 silver badges130 bronze badges

4 Comments

Nikign Over a year ago

isn't this post what you want to do? stackoverflow.com/questions/30291032/…

edesz Over a year ago

Thanks. I actually tried that - see the OP. I am getting an error ValueError: Can only compare identically-labeled DataFrame objects. Somehow, it refuses to compare columns with different names.

Nikign Over a year ago

I think the answer has already stated to rename your columns first. Is it not possible in your case?

edesz Over a year ago

As long as it is possible to easily change these names, then I would not have a problem. In the accepted answer the columns were renamed, but it seems clear which columns from df_processed the new columns are referring to. I like this approach because I have a quick connection between the two of them.

Collectives™ on Stack Overflow

Pandas compare 2 dataframes by specific rows in all columns

3 Answers 3

3 Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related