1

I have the following Pandas dataframe of some raw numbers:

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)

col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']

df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)

It looks like this:

        Tr_id       07_08_19 #1       07_08_19 #2     07_08_19 #2.1       11_31_19 #1     11_31_19 #1.1     11_31_19 #1.3       12_15_20 #1       12_15_20 #2     12_15_20 #2.1     12_15_20 #2.2
0   Quantity1                 1                 1                 2                 1                 2                 3                 1                 1                 2                 3
1   Quantity2                75                11               0.9                70                60                17                 6                 3                 5                 7
2   Quantity3                 9                20                17                 4                74                12                34                43                92                11
3   Quantity4                 7               -17               102                75                41               -89               496                12                17                62
4   Quantity5                -4                12                56               0.8               -36                30               -84               -23                64               -11
5   Quantity6               0.4               0.8               0.6               0.4               0.3               0.1               0.5               0.5               0.5               0.5
6   TimeStamp  07/08/2019 05:11  07/08/2019 10:54  07/08/2019 21:04  11/31/2019 11:15  11/31/2019 16:50  11/31/2019 21:33  12/15/2020 01:36  12/15/2020 07:01  12/15/2020 11:15  12/15/2020 21:45
7         NaN                 1                 6                65                 4                 2                22                 1                34                51                12
8         NaN                 2                 5                 3                 3                 3                 1                 6                37                20                71
9         NaN                 3                 3                 6                 6                11                 6                 3                46                 1                34
10        NaN                 4                 7                78                 8                55                32                 2               918                34                94
11        NaN                 8                 3                 9                 3                 3                 5                 6                 0                12                 1
12        NaN                 4                23                 2                 5                 7                 6                55                37                59                73
13        NaN                 3                27                45                66                33                 4                22                91                78                46
14        NaN                 8                 3                 6                32                65                 3                 6                12                 6                51
15        NaN                 7                11                 7                84                34               898                23                68               101                21

I have a separate dataframe of a processed version of these numbers where:

  1. some of the header rows from above have been deleted,
  2. the column names have been changed

Here is the second dataframe:

df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]

   8_1  8_2  24_3  24_1  c_1trial  14_2  14_1  28_1  24_2  8_3
0    4    2    12    34         1    65     6     1    51   22
1    3    3    71    37         2     3     5     6    20    1
2    6   11    34    46         3     6     3     3     1    6
3    8   55    94   918         4    78     7     2    34   32
4    3    3     1     0         8     9     3     6    12    5
5    5    7    73    37         4     2    23    55    59    6
6   66   33    46    91         3    45    27    22    78    4
7   32   65    51    12         8     6     3     6     6    3
8   84   34    21    68         7     7    11    23   101  898

Common parts of each dataframe:

For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same.

Output combination:

I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed.

Example:

the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns.

Here is the output I am looking to get:

    Tr_id       07_08_19 #1       07_08_19 #2     07_08_19 #2.1       11_31_19 #1     11_31_19 #1.1     11_31_19 #1.3       12_15_20 #1       12_15_20 #2     12_15_20 #2.1     12_15_20 #2.2
Quantity1                 1                 1                 2                 1                 2                 3                 1                 1                 2                 3
Quantity2                75                11               0.9                70                60                17                 6                 3                 5                 7
Quantity3                 9                20                17                 4                74                12                34                43                92                11
Proc_Name          c_1trial              14_1              14_2               8_1               8_2               8_3              28_1              24_1              24_2              24_3
Quantity4                 7               -17               102                75                41               -89               496                12                17                62
Quantity5                -4                12                56               0.8               -36                30               -84               -23                64               -11
Quantity6               0.4               0.8               0.6               0.4               0.3               0.1               0.5               0.5               0.5               0.5
TimeStamp  07/08/2019 05:11  07/08/2019 10:54  07/08/2019 21:04  11/31/2019 11:15  11/31/2019 16:50  11/31/2019 21:33  12/15/2020 01:36  12/15/2020 07:01  12/15/2020 11:15  12/15/2020 21:45

My attempts are giving trouble:

print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)

gives

ValueError: Can only compare identically-labeled DataFrame objects

and

print (df_raw.ix[7:].values == df_processed.values) #gives False

gives

False

The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row.

Question:

Is there a way to select out the output I showed above from these 2 dataframes?

5
  • Do you mean 7 and onward (inclusive) in the raw DataFrame? Commented Jun 6, 2016 at 17:32
  • I meant row 8. I think you meant index 7 right? Commented Jun 6, 2016 at 17:33
  • Ah that's correct. Commented Jun 6, 2016 at 17:34
  • In your output sample, you said you wanted the columns headers from df_processed but I think you posted those from df_raw? Commented Jun 6, 2016 at 18:13
  • Sorry, actually in the sample output, I showed both column headers only from df_raw but I also included the column names from df_processed. I should have been more clear about this. Thanks. Commented Jun 6, 2016 at 19:59

3 Answers 3

1

Does this look like the output you're looking for?

Raw dataframe df:

        Tr_id    07_08_19  07_08_19.1  07_08_19.2    11_31_19  11_31_19.1  
0   Quantity1           1           1           2           1           2   
1   Quantity2          75          11         0.9          70          60   
2   Quantity3           9          20          17           4          74   
3   Quantity4           7         -17         102          75          41   
4   Quantity5          -4          12          56         0.8         -36   
5   Quantity6         0.4         0.8         0.6         0.4         0.3   
6   TimeStamp  07/08/2019  07/08/2019  07/08/2019  11/31/2019  11/31/2019   
7         NaN           1           6          65           4           2   
8         NaN           2           5           3           3           3   
9         NaN           3           3           6           6          11   
10        NaN           4           7          78           8          55   
11        NaN           8           3           9           3           3   
12        NaN           4          23           2           5           7   
13        NaN           3          27          45          66          33   
14        NaN           8           3           6          32          65   
15        NaN           7          11           7          84          34   

    11_31_19.2    12_15_20  12_15_20.1  12_15_20.2  12_15_20.3  
0            3           1           1           2           3  
1           17           6           3           5           7  
2           12          34          43          92          11  
3          -89         496          12          17          62  
4           30         -84         -23          64         -11  
5          0.1         0.5         0.5         0.5         0.5  
6   11/31/2019  12/15/2020  12/15/2020  12/15/2020  12/15/2020  
7           22           1          34          51          12  
8            1           6          37          20          71  
9            6           3          46           1          34  
10          32           2         918          34          94  
11           5           6           0          12           1  
12           6          55          37          59          73  
13           4          22          91          78          46  
14           3           6          12           6          51  
15         898          23          68         101          21

Processed dataframe dfp:

   8_1  8_2  24_3  24_1  c_1trial  14_2  14_1  28_1  24_2  8_3
0    4    2    12    34         1    65     6     1    51   22
1    3    3    71    37         2     3     5     6    20    1
2    6   11    34    46         3     6     3     3     1    6
3    8   55    94   918         4    78     7     2    34   32
4    3    3     1     0         8     9     3     6    12    5
5    5    7    73    37         4     2    23    55    59    6
6   66   33    46    91         3    45    27    22    78    4
7   32   65    51    12         8     6     3     6     6    3
8   84   34    21    68         7     7    11    23   101  898

Code:

df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)

x = pd.DataFrame()
for col_raw in dfr.columns:
    for col_p in dfp.columns:
        if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
            series = dfr[col_raw].head(7).tolist()
            series.append(col_raw)
            x[col_p] = series

x = pd.concat([df['Tr_id'].head(7), x], axis=1)

Output:

       Tr_id    c_1trial        14_1        14_2         8_1         8_2  
0  Quantity1           1           1           2           1           2   
1  Quantity2          75          11         0.9          70          60   
2  Quantity3           9          20          17           4          74   
3  Quantity4           7         -17         102          75          41   
4  Quantity5          -4          12          56         0.8         -36   
5  Quantity6         0.4         0.8         0.6         0.4         0.3   
6  TimeStamp  07/08/2019  07/08/2019  07/08/2019  11/31/2019  11/31/2019   
7        NaN    07_08_19  07_08_19.1  07_08_19.2    11_31_19  11_31_19.1   

          8_3        28_1        24_1        24_2        24_3  
0           3           1           1           2           3  
1          17           6           3           5           7  
2          12          34          43          92          11  
3         -89         496          12          17          62  
4          30         -84         -23          64         -11  
5         0.1         0.5         0.5         0.5         0.5  
6  11/31/2019  12/15/2020  12/15/2020  12/15/2020  12/15/2020  
7  11_31_19.2    12_15_20  12_15_20.1  12_15_20.2  12_15_20.3 

I think the code could be more concise but maybe this does the job.

Sign up to request clarification or add additional context in comments.

3 Comments

Hey, this is very very close. Is it also possible to have column names from df_raw in the output columns? eg. in your first column c_1trial, can one of the rows be 07_08_19 #1? And similarly for the other columns?
I'm getting ValueError: cannot convert float NaN to integer. Could this be due a Python2.7 specific problem? It's pointing to this line: df.tail(9).astype(int)[col_raw].
It's because the raw DataFrame has the column Tr_id. I dropped it using df = df.drop('Tr_id', axis=1)
1

alternative solution, using DataFrame.isin() method:

In [171]: df1
Out[171]:
   a  b  c
0  1  1  3
1  0  2  4
2  4  2  2
3  0  3  3
4  0  4  4

In [172]: df2
Out[172]:
   a  b  c
0  0  3  3
1  1  1  1
2  0  3  4
3  4  2  3
4  0  4  4

In [173]: common = pd.merge(df1, df2)

In [174]: common
Out[174]:
   a  b  c
0  0  3  3
1  0  4  4

In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
   a  b  c
3  0  3  3
4  0  4  4

Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's:

select col1, .., colN from tableA
minus
select col1, .., colN from tableB

in Pandas:

In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
   a  b  c
0  1  1  3
1  0  2  4
2  4  2  2

Comments

0

I came up with this using loops. It is very disappointing:

holder = []
for randm,pp in enumerate(list(df_processed)):
    list1 = df_processed[pp].tolist()
    for car,rr in enumerate(list(df_raw)):
        list2 = df_raw.loc[7:,rr].tolist()
        if list1==list2:
            holder.append([rr,pp])

df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)

Output:

       Tr_id       11_31_19 #1     11_31_19 #1.1     12_15_20 #2.2       12_15_20 #2       07_08_19 #1     07_08_19 #2.1       07_08_19 #2       12_15_20 #1     12_15_20 #2.1     11_31_19 #1.3
0  Quantity1                 1                 2                 3                 1                 1                 2                 1                 1                 2                 3
1  Quantity2                70                60                 7                 3                75               0.9                11                 6                 5                17
2  Quantity3                 4                74                11                43                 9                17                20                34                92                12
3  Proc_Name               8_1               8_2              24_3              24_1          c_1trial              14_2              14_1              28_1              24_2               8_3
4  Quantity4                75                41                62                12                 7               102               -17               496                17               -89
5  Quantity5               0.8               -36               -11               -23                -4                56                12               -84                64                30
6  Quantity6               0.4               0.3               0.5               0.5               0.4               0.6               0.8               0.5               0.5               0.1
7  TimeStamp  11/31/2019 11:15  11/31/2019 16:50  12/15/2020 21:45  12/15/2020 07:01  07/08/2019 05:11  07/08/2019 21:04  07/08/2019 10:54  12/15/2020 01:36  12/15/2020 11:15  11/31/2019 21:33

The order of the columns is different than what I required, but that is a minor problem.

The real problem with this approach is using loops. I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.

4 Comments

isn't this post what you want to do? stackoverflow.com/questions/30291032/…
Thanks. I actually tried that - see the OP. I am getting an error ValueError: Can only compare identically-labeled DataFrame objects. Somehow, it refuses to compare columns with different names.
I think the answer has already stated to rename your columns first. Is it not possible in your case?
As long as it is possible to easily change these names, then I would not have a problem. In the accepted answer the columns were renamed, but it seems clear which columns from df_processed the new columns are referring to. I like this approach because I have a quick connection between the two of them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.