2

I am working on a C++ application, where I use the MPI C bindings to send and receive data over a network. I understand that sending

const int VECTOR_SIZE = 1e6;
std::vector<int> vector(VECTOR_SIZE, 0.0);

via

// Version A
MPI_Send(const_cast<int *>(vector.data()), vector.size(), MPI_INT, 1, 0, MPI_COMM_WORLD);

is much more efficient than

// Version B
for (const auto &element : vector)
    MPI_Send(const_cast<int *>(&element), 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

due to the latency introduced by MPI_Send. However, if I want to send data structures which are not contiguous in memory (a std::list<int>, for instance), I cannot use version A but have to resort to version B or copy the list's content to a contiguous container (like std::vector<int>, for instance) first and use version A. Since I want to avoid an extra copy, I wonder if there are any options/other functions in MPI which allow for an efficient use of Version B (or at least a similar, loop-like construct) without incurring the latency each time MPI_Send is called?

1
  • 1
    You might look at Boost MPI, since it supports STL containers. Commented Jan 17, 2016 at 18:11

1 Answer 1

2

Stepping and sending one by one through your std::list elements would indeed cause a significant communication overhead.

The MPI specification/library is designed to be language independent. This is why it uses language agnostic MPI datatypes. And the consequence is that it can only send from contiguous buffers (which is a feature that most languages offers) and not from more complex data structures like lists.

To avoid the communication overhead of sending one by one, there are two alternatives:

  • copy all the list elements into a std::vector and send the vector. However this creates a memory overhed AND makes the sending completely sequential (and during that time some MPI nodes could be iddle).

  • or iterate through your list, building smaller vectors/buffers and send these smaller chunks (eventually dispatching them to several destination nodes ?). This approach has the benefit of making better use of i/o latency and parallelism through a pipelining effect. You have however to experiment a little bit to find the optimal size of the intermediary chunks.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.