1

I recently came across this talk about dealing with Skew in Spark SQL by using "Iterative" Broadcast Joins to improve query performance when joining a large table with another not so small table. The talk advises to tackle such scenarios using "Iterative Broadcast Joins". Unfortunately, the talk doesn't probe deep enough for me to understand its implementation.

Hence, I was hoping if someone could please shed some light on how to implement this Iterative Broadcast Join in Spark SQL with few examples. How do I implement the same using Spark SQL queries with the SQL-API ?

Note: I am using Spark SQL query 2.4

Any help is appreciated. Thanks

1 Answer 1

3

Iterative Broadcast Join : large it might be worth considering the approach of iteratively taking slices of your smaller (but not that small) table, broadcasting those, joining with the larger table, then unioning the result.

To Solve this there is concept called:

i) Salting Technique: : In this we add a random number to a key and make data evenly distributed across clusters .Let see this through a example as below

enter image description here

In the above image Suppose we are performing a join on large and small table, data then is divided into three executors x,y and z as below and later union and since we have data skews all X will be in one executor and Y in another executor and z in another executor. Since Y and Z data is relatively small it will get completed and wait for X-executor to complete which will take time. SO to improve performance we should get X-executor data, evenly distributed across all executors Since the data is stuck on one executor we will add a random number to all key (to both large and small table) and execute our process

Adding a random number : Key =explode(key, range(1,3)), which will give key_1,key_2,key_3

enter image description here

Now if you see is evenly distributed across executors, hence provides faster performance

If you need more help,please see this video :

https://www.youtube.com/watch?v=d41_X78ojCg&ab_channel=TechIsland

and this link: https://dzone.com/articles/improving-the-performance-of-your-spark-job-on-ske#:~:text=Iterative%20(Chunked)%20Broadcast%20Join,table%2C%20then%20unioning%20the%20result.

Sign up to request clarification or add additional context in comments.

3 Comments

thank you so much for your reply. As you have indicated (in your answer) that the table could be iteratively broken down into slices and broadcast, could you please share an example on how the same could be implemented with sql queries ?
The example that I have shown above, hence in the back end, You just need to add random number to key
It will work only when one of the tables have unique key value. What if I have more than one record for a single key in both the tables

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.