2

I have a two dataframes DF1 and DF2 with id as the unique column, DF2 may contain a new records and updated values for existing records of DF1, when we merge the two dataframes result should include the new record and a old records with updated values remain should come as it is.

Input example:

id   name
10   abc
20   tuv
30   xyz

and

id   name
10   abc
20   pqr
40   lmn

When I merge these two dataframes, I want the result as:

id   name
10   abc
20   pqr
30   xyz
40   lmn
0

2 Answers 2

3

Use an outer join followed by a coalesce. In Scala:

val df1 = Seq((10, "abc"), (20, "tuv"), (30, "xyz")).toDF("id", "name") 
val df2 = Seq((10, "abc"), (20, "pqr"), (40, "lmn")).toDF("id", "name")

df1.select($"id", $"name".as("old_name"))
  .join(df2, Seq("id"), "outer")
  .withColumn("name", coalesce($"name", $"old_name"))
  .drop("old_name")

coalesce will give the value of the first non-null value, which in this case returns:

+---+----+
| id|name|
+---+----+
| 20| pqr|
| 40| lmn|
| 10| abc|
| 30| xyz|
+---+----+
Sign up to request clarification or add additional context in comments.

Comments

0
df1.join(df2, Seq("id"), "leftanti").union(df2).show

| id|name|
+---+----+
| 30| xyz|
| 10| abc|
| 20| pqr|
| 40| lmn|
+---+----+

2 Comments

It is not an answer, it is a comment to @Shaido answer. You've updated his last statement only.
df1.join(df2, Seq("id"), "leftanti").union(df2) is my answer. The user already got two dataframes df1 and df2 defined in the question. I don't have to redefine them. @Shaido's last statement(the table) is the output of his answer. My table is the output of my answer, if you take a further look, the tables are not exactly the same. Both are right

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.