3

I was given a little excercise where I had to implement a Monte Carlo algorithm by using MPI to estimate the total volume of n spheres, having the coordinates of their center and radius in 3 dimensions. Even if we must use MPI, we can launch all the processes on our local machine, so there's no network overhead. I implemented two versions of this excericse:

One, using MPI_Send and MPI_Recv (where the process of rank 0 only waits for partial results from the others to perform the final sum) http://pastebin.com/AV41hJqn

The other, using MPI_Reduce, also here process of rank 0 waits for partial results. http://pastebin.com/8b0czv6a

I expected that both the programs would take the same time to finish, but I see that the one using MPI_Reduce is faster. Why this? Where's the difference?

3
  • Your MPI_Reduce example makes all ranks compute, including rank 0. Commented Jan 7, 2014 at 16:01
  • I think this is not the question. Using Send/Recv, for example, I launch 5 processes of which 4 work and the first waits the result. Using Reduce I launch 4 processes and they all work. So in both cases there always 4 processes working. Commented Jan 7, 2014 at 19:45
  • Can you provide some benchmarks, i.e. timings for both algorithms and a set of input files? If you run your processes on the same machine, there should be no noticeable difference in the speed between the two algorithms, though shared memory communication also has non-zero latency. Commented Jan 7, 2014 at 20:23

1 Answer 1

4

There could be a lot of reasons depending on which MPI implementation you're using, what kind of hardware you're running on and how optimized the implementation is to take advantage of that. This Google Scholar search gives some idea of the variety of work done on this. To give you a few ideas of what it could be:

  • Since reductions can be completed in intermediate steps, it may be possible to use a different topology than the basic rank 0 collect-from-all approach, with tradeoffs in latency and bandwidth.
  • Within a compute node (or on your desktop or laptop if you're trying this with a toy problem), it may be possible to exploit locality within cores, between cores on a CPU socket or between sockets to order the computations and communication in a way that's more efficient for the hardware. It sounds from the abstract like this paper from IBM may give some concrete details about some of these design decisions. Alternatively, the implementation might choose a cache-oblivious scheme for better performance within a general compute node.
  • Persistent communication (MPI_Send_init and MPI_Recv_init) can be used under the hood in the MPI_Reduce implementation. These routines can perform better than their blocking and non-blocking counterparts due to providing the MPI implementation and hardware with extra details about how the program is grouping its communications.

This is not a comprehensive list, but hopefully it gets you started and provides some ideas for how to search out more details if you're interested.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.