2

I have two DataFrames, named Child and Parent. Child acts like a metadata table, and I need to validate the Data of Parent dataframe.

Child:

Cust_id     Description   Detail
1           Good          Regular
34          Excellent     Normal
45          Bulk          Buyer

Parent:

Name    Cust_id order   date    Payment
xyz     1       ice     01-02-2019  online
abc     45      bread   01-02-2019  offline
mno     56      Butter  01-02-2019  offline
pqr     67      cookies 01-02-2019  online
rst     34      Rice    01-02-2019  online
ert     1       egg     01-02-2019  online

I need to validate Cust_id present in the Parent dataframe or not?

I am selecting on all record Cust_id of Parent table using distinct, and using a loop to check if all data present in the Parent table, are present in the Child table or not.

How can this be accomplished without iteration, using pandas methods?

0

1 Answer 1

2
  • Use pandas Boolean Indexing the determine if 'Cust_id' of parent, is in 'Cust_id' of 'child'.
  • Use .isin on a list of unique 'Cust_id' from 'child'.
    • Indexing with isin
    • child.Cust_id.unique() creates an array of all the unique values in 'Cust_id'
import pandas as pd

child = pd.DataFrame({'Cust_id': [1, 34, 45], 'Description': ['Good', 'Excellent', 'Bulk'], 'Detail': ['Regular', 'Normal', 'Buyer']})

parent = pd.DataFrame({'Name': ['xyz', 'abc', 'mno', 'pqr', 'rst', 'ert'], 'Cust_id': [1, 45, 56, 67, 34, 1], 'order': ['ice', 'bread', 'Butter', 'cookies', 'Rice', 'egg'],
                       'date': ['01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019'], 'Payment': ['online', 'offline', 'offline', 'online', 'online', 'online']})

# mask using isin
mask = parent.Cust_id.isin(child.Cust_id.unique())

# return only the data from parent, where parent Cust_id isin child Cust_id
parent[mask]

# add a column to the parent dataframe
parent['in_child'] = mask

# display(parent)
  Name  Cust_id    order        date  Payment  in_child
0  xyz        1      ice  01-02-2019   online      True
1  abc       45    bread  01-02-2019  offline      True
2  mno       56   Butter  01-02-2019  offline     False
3  pqr       67  cookies  01-02-2019   online     False
4  rst       34     Rice  01-02-2019   online      True
5  ert        1      egg  01-02-2019   online      True
  • pandas.DataFrame.merge can be used in various ways as well.
  • The following solution uses an 'outer' merge with indicator=True
    • The '_merge' column indicates which dataframe the 'Cust_id' is in.
      • 'left_only' is the parent dataframe.
  • .merge combines the information from both dataframes, and I'm not sure if that's the desired output.
merged = parent.merge(child, on='Cust_id', how='outer', indicator=True)

# display(merged)
  Name  Cust_id    order        date  Payment Description   Detail     _merge
0  xyz        1      ice  01-02-2019   online        Good  Regular       both
1  ert        1      egg  01-02-2019   online        Good  Regular       both
2  abc       45    bread  01-02-2019  offline        Bulk    Buyer       both
3  mno       56   Butter  01-02-2019  offline         NaN      NaN  left_only
4  pqr       67  cookies  01-02-2019   online         NaN      NaN  left_only
5  rst       34     Rice  01-02-2019   online   Excellent   Normal       both
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.