2

My goal is to join two dataframes from different sources in Python using Pandas and then fill null values in columns with corresponding values in the same column.

The dataframes have similar columns, but some text/object columns may have different values due to variations in the data sources. For instance, the "Name" column in one dataframe might contain "Nick M." while in the other it's "Nick Maison". However, certain columns such as "Date" (formatted as YYYY-MM-DD), "Order ID" (numeric), and "Employee ID" (numeric) have consistent values across both dataframes (we join dataframes based on them). Worth mentioning, some columns may not even exist in one or another dataframe, but should also be filled.

import pandas as pd

# Create DataFrame df1

df1_data = {

'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]

}

df1 = pd.DataFrame(df1_data)

# Create DataFrame df2

df2_data = {

'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],  
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],  
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],  
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]  

}

df2 = pd.DataFrame(df2_data)

# Combine DataFrames

outer_joined_df = pd.merge(

                            df1,
                            df2,
                            how = 'outer',
                            left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
                            right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]

                        )

# Display the result

outer_joined_df

Here is the output of joined dataframes. Null values colored in yellow should be filled.

enter image description here

I tried below code and it works for Date, Order Id and Employee Id columns as expected (because they are the same across two dataframes and we join based on them), but not for other, because they may have different values. Basically, the logic in this code is if Null, then fill with values from the same row in specified column. However, since values may be different, filled column becomes messy, because it has multiple variations of the same value.

outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])

outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])

outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])

outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])

outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])

Here is the output:

enter image description here

As you can see, it populated the data, but not the way I want.

Output I need:

enter image description here

4
  • What is your goal after the data is joined? You may just be able to concat both frames together, groupby the date, order id, employee id and preform some agg function on the group. Commented Mar 21, 2024 at 23:17
  • Original dfs have also start and end times. And goal is to compare line by line the discrepancies between each row from each dataframes. I need to fill values as I mentioned, because the output is in excel file and sent out to other team to go through it. If nulls left and person filters by the column with less instances, then some of the data will be filtered out. So I decided to code it the way, so no matter what column used for filtering, all data will be presented correctly. Hope that makes sense. Commented Mar 21, 2024 at 23:52
  • For rows that match (index 0, 1 and 2 in your example), is it true that everything matches from both DataFrames, i.e., the start time, end time and value? Commented Mar 22, 2024 at 0:03
  • Can be different. Imagine 2 different platforms that track employee time. 1 platform for billing purposes, 1 for payroll (Don't ask why it is like that :D. It is stupid, I know). So ideally there should not be discrepancies at all. They should match 100% of the time. But there are times (human error), when someone forget to register (if I can say it this way) 1 session. The report I'm building should catch such things. So it should compare line by line start time, end time, value, etc. for each date and employee. And if something is off, flag it as an exception to take a closer look. Commented Mar 22, 2024 at 0:13

1 Answer 1

1
# a list with all column names, minus `(dfx)`
columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]

# create a dict with a relation between values in df1 and df2, both ways
value_relations = {}
for col in columns:
    relations = (
        outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
        .drop_duplicates()
        .dropna()
        .to_dict("tight")
        .get("data")
    )
    value_relations[col] = {k: v for k, v in relations}
    value_relations[col].update({v: k for k, v in relations})

    # fill values of df1 with the related value of df2
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"].map(value_relations[col])
    )
    # fill values of df2 with the related value of df1
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"].map(value_relations[col])
    )
   Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
3  2024-03-18             4.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
4  2024-03-18             5.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
8  2024-03-19             NaN              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0

If you want to fill the remaining null values, add this at the end of each loop:

    # fill remaining null values of df1
    outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
        outer_joined_df[f"{col} (df2)"]
    )
    # fill remaining null values of df2
    outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
        outer_joined_df[f"{col} (df1)"]
    )
   Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
3  2024-03-18             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        24.0
4  2024-03-18             5.0              825.0    Nick M.           SD  ...             5.0             825.0  Nick Mason     San Diego        56.0
5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
8  2024-03-19             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.