I have 2 csv files (csv1, csv2). In csv2 there might be new column or row added in csv2. I need to verify if csv1 is subset of csv2. For being a subset whole row should be present in both the files and elements from new coulmn or row should be ignored.
csv1:
c1,c2,c3
A,A,6
D,A,A
A,1,A
csv2:
c1,c2,c3,c4
A,A,6,L
A,changed,A,L
D,A,A,L
Z,1,A,L
Added,Anew,line,L
I am trying is :
df1 = pd.read_csv(csv1_file)
df2 = pd.read_csv(csv2_file)
matching_cols=df1.columns.intersection(df2.columns).tolist()
sorted_df1 = df1.sort_values(by=list(matching_cols)).reset_index(drop=True)
sorted_df2 = df2.sort_values(by=list(matching_cols)).reset_index(drop=True)
print("truth data>>>\n",sorted_df1)
print("Test data>>>\n",sorted_df2)
df1_mask = sorted_df1[matching_cols].eq(sorted_df2[matching_cols])
# print(df1_mask)
print("compared data>>>\n",sorted_df1[df1_mask])
It gives the out put as :
truth data>>>
c1 c2 c3
0 A 1 A
1 A A 6
2 D A A
Test data>>>
c1 c2 c3 c4
0 A A 6 L
1 A changed A L
2 Added Anew line L
3 D A A L
4 Z 1 A L
compared data>>>
c1 c2 c3
0 A NaN NaN
1 A NaN NaN
2 NaN NaN NaN
What i want is :
compared data>>>
c1 c2 c3
0 Nan NaN NaN
1 A A 6
2 D A A
Please help.
Thanks
A,A,6in the first row, why should it return Nan, can you check