1

When running Spark App I've noticed that fetchBlocks function of both NettyBlockTransferService and my custom ShuffleClient (implementing BlockStoreClient) are being called.

I know what my ShuffleClient does (fetching remote data blocks), but I didn't understand what NettyBlockTransferService do, according to the docs it also responsible for fetching data blocks so why we need both?

I've tried to print the blocks list passed to NettyBlockTransferService and my ShuffleClient. In my ShuffleClient I see a list of blocks as expected, i.e. tuple of (shuffle_id, map_id, reduce_id), but in NettyBlockTransferService the list looks like this:

fetchBlocks: broadcast_0_piece0
fetchBlocks: broadcast_1_piece0
fetchBlocks: rdd_1_2342
fetchBlocks: broadcast_2_piece0

...

So I'm not sure I understand what NettyBlockTransferService does exactly.

Edit: I'm might found an answer after discuss it with a colleague. Spark sometimes need to broadcast data to all the executors, for example if you perform join of a small table with a very the Driver can just send the small table to all the executors so they can perform the join. For this it uses broadcast and it looks like it uses the NettyBlockTransferService to perform this broadcast.

Need confirmation for this explanation.

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.