0

When I join two dataframes using left join like this:

df1.join(broadcast(df2), $"id" === $"id1" || $"id2" === $"id3", "left")

Without $"id2" === $"id3", it executes very quickly but when both conditions are present, it becomes BroadcastNestedLoopJoin and becomes very very slow.

Any ideas how I can improve this?

2 Answers 2

6

BroadcastNestedLoopJoin means nested for-loops to join your data-frames. It will always give a degraded performance.

Can you try below solution:

val resultPart1 = df1.join(broadcast(df2), $"id" === $"id1", "left")
val resultPart2 = df1.join(broadcast(df2),  $"id2" === $"id3", "left")

val resultDF = resultPart1.unionByName(resultPart2)

Union causes zero shufflings of data across executors.Hence produces faster results

Sign up to request clarification or add additional context in comments.

1 Comment

What about records not matching in resultPart1 & resultPart2 ? You would end up with undesired records.
0

Take for example Spark 3.2.

Spark will delegate to one of the 5 available join strategies. In your case, only if you have an equi-join it can use Broadcast Hash Join(BHJ).

A join condition using disjunctive predicates like $"id" === $"id1" || $"id2" === $"id3" is a non-equi join.

A join condition using conjunctive predicates predicate1 AND predicate2 AND .. is an equi join.

In order to get that BHJ you can delegate to multiple equi-joins.

df1
.join(broadcast(df2), $"id" === $"id1", "left")
.join(broadcast(df2),  $"id2" === $"id3", "left")

Then validate in the query plan that both joins refer to the same broadcasted data and use a broadcast hash join.

Yes, you will see multiple joins but with no shuffle. Try implementing your joins in terms of equi-joins otherwise you're stuck with BNLJ.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.