Iterative Broadcast Join in Spark SQL

Question

I recently came across this talk about dealing with Skew in Spark SQL by using "Iterative" Broadcast Joins to improve query performance when joining a large table with another not so small table. The talk advises to tackle such scenarios using "Iterative Broadcast Joins". Unfortunately, the talk doesn't probe deep enough for me to understand its implementation.

Hence, I was hoping if someone could please shed some light on how to implement this Iterative Broadcast Join in Spark SQL with few examples. How do I implement the same using Spark SQL queries with the SQL-API ?

Note: I am using Spark SQL query 2.4

Any help is appreciated. Thanks

Addy · Accepted Answer · 2020-09-10 21:58:08Z

3

Iterative Broadcast Join : large it might be worth considering the approach of iteratively taking slices of your smaller (but not that small) table, broadcasting those, joining with the larger table, then unioning the result.

To Solve this there is concept called:

i) Salting Technique: : In this we add a random number to a key and make data evenly distributed across clusters .Let see this through a example as below

In the above image Suppose we are performing a join on large and small table, data then is divided into three executors x,y and z as below and later union and since we have data skews all X will be in one executor and Y in another executor and z in another executor. Since Y and Z data is relatively small it will get completed and wait for X-executor to complete which will take time. SO to improve performance we should get X-executor data, evenly distributed across all executors Since the data is stuck on one executor we will add a random number to all key (to both large and small table) and execute our process

Adding a random number : Key =explode(key, range(1,3)), which will give key_1,key_2,key_3

Now if you see is evenly distributed across executors, hence provides faster performance

If you need more help,please see this video :

https://www.youtube.com/watch?v=d41_X78ojCg&ab_channel=TechIsland

and this link: https://dzone.com/articles/improving-the-performance-of-your-spark-job-on-ske#:~:text=Iterative%20(Chunked)%20Broadcast%20Join,table%2C%20then%20unioning%20the%20result.

answered Sep 10, 2020 at 21:58

Addy

4274 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

marie20 Over a year ago

thank you so much for your reply. As you have indicated (in your answer) that the table could be iteratively broken down into slices and broadcast, could you please share an example on how the same could be implemented with sql queries ?

Addy Over a year ago

The example that I have shown above, hence in the back end, You just need to add random number to key

Mani Rz Over a year ago

It will work only when one of the tables have unique key value. What if I have more than one record for a single key in both the tables

Collectives™ on Stack Overflow

Iterative Broadcast Join in Spark SQL

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related