1

When i tried to call multiple MPI_Send or MPI_Recv in the program, the executable is getting hanged in the nodes and the root. ie, when it is trying to execute the second MPI_Send or MPI_Recv, the communication is getting blocked. At the same time the binaries are running at 100% in the machines.

When i tried to run this code in windows 7 64 bit with OpenMPI 1.6.3 64-bit, it ran successfully. But the same code is not working in Linux, ie, CentOS 6.3 x86_64 with OpenMPI 1.6.3 -64 bit. What is the problem i have done.

Posting the code below

#include <mpi.h>

int main(int argc, char** argv) {
MPI::Init();
int rank = MPI::COMM_WORLD.Get_rank();
int size = MPI::COMM_WORLD.Get_size();
char name[256] = { };
int len = 0;
MPI::Get_processor_name(name, len);

printf("Hi I'm %s:%d\n", name, rank);

if (rank == 0) 
{
    while (size >= 1) 
    {
        int val, stat = 1;
        MPI::Status status;
        MPI::COMM_WORLD.Recv(&val, 1, MPI::INT, 1, 0, status);
        int source = status.Get_source();
        printf("%s:%d received %d from %d\n", name, rank, val, source);

        MPI::COMM_WORLD.Send(&stat, 1, MPI::INT, 1, 2);
        printf("%s:%d sent status %d\n", name, rank, stat);

        size--;
    }
} else 
{
    int val = rank + 10;
    int stat = 0;
    printf("%s:%d sending %d...\n", name, rank, val);
    MPI::COMM_WORLD.Send(&val, 1, MPI::INT, 0, 0);
    printf("%s:%d sent %d\n", name, rank, val);

    MPI::Status status;
    MPI::COMM_WORLD.Recv(&stat, 1, MPI::INT, 0, 2, status);
    int source = status.Get_source();
    printf("%s:%d received status %d from %d\n", name, rank, stat, source);
}

size = MPI::COMM_WORLD.Get_size();
if (rank == 0) 
{
    while (size >= 1) 
    {
        int val, stat = 1;
        MPI::Status status;

        MPI::COMM_WORLD.Recv(&val, 1, MPI::INT, 1, 1, status);
        int source = status.Get_source();
        printf("%s:0 received %d from %d\n", name, val, source);

        size--;
    }

    printf("all workers checked in!\n");
} 
else
{
    int val = rank + 10 + 5;
    printf("%s:%d sending %d...\n", name, rank, val);
    MPI::COMM_WORLD.Send(&val, 1, MPI::INT, 0, 1);
    printf("%s:%d sent %d\n", name, rank, val);
}
MPI::Finalize();

return 0;

}

Hi Hristo, I have changed the source as you said and the code is again posting

    #include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) 
{
    int iNumProcess = 0, iRank = 0, iNameLen = 0, n;
    char szNodeName[MPI_MAX_PROCESSOR_NAME] = {};
    MPI_Status stMPIStatus;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &iNumProcess);
    MPI_Comm_rank(MPI_COMM_WORLD, &iRank);
    MPI_Get_processor_name(szNodeName, &iNameLen);

    printf("Hi I'm %s:%d\n", szNodeName, iRank);

    if (iRank == 0) 
    {
        int iNode = 1;
        while (iNumProcess > 1) 
        {
            int iVal = 0, iStat = 1;
            MPI_Recv(&iVal, 1, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &stMPIStatus);
            printf("%s:%d received %d\n", szNodeName, iRank, iVal);

            MPI_Send(&iStat, 1, MPI_INT, iNode, 1, MPI_COMM_WORLD);
            printf("%s:%d sent Status %d\n", szNodeName, iRank, iStat);

            MPI_Recv(&iVal, 1, MPI_INT, MPI_ANY_SOURCE, 2, MPI_COMM_WORLD, &stMPIStatus);
            printf("%s:%d received %d\n", szNodeName, iRank, iVal);

            iNumProcess--;
            iNode++;
        }

        printf("all workers checked in!\n");
    }
    else 
    {
        int iVal = iRank + 10;
        int iStat = 0;
        printf("%s:%d sending %d...\n", szNodeName, iRank, iVal);
        MPI_Send(&iVal, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
        printf("%s:%d sent %d\n", szNodeName, iRank, iVal);

        MPI_Recv(&iStat, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &stMPIStatus);
        printf("%s:%d received status %d\n", szNodeName, iRank, iVal);

        iVal = 20;
        printf("%s:%d sending %d...\n", szNodeName, iRank, iVal);
        MPI_Send(&iVal, 1, MPI_INT, 0, 2, MPI_COMM_WORLD);
        printf("%s:%d sent %d\n", szNodeName, iRank, iVal);

    }

    MPI_Finalize();

    return 0;
}

i got the output as folows. ie, after the send send/receive, root is infinitely waiting and the nodes are ruing with 100% CPU utilisation. Its output is giving below

Hi I'm N1433:1
N1433:1 sending 11...
Hi I'm N1425:0
N1425:0 received 11
N1425:0 sent Status 1
N1433:1 sent 11
N1433:1 received status 11
N1433:1 sending 20...

Here N1433 and N1425 are machine names. Please help

2 Answers 2

3

The code for the master is wrong. It is always sending to and awaiting messages from the same rank - rank 1. Thus the program would only function correctly if run as mpiexec -np 2 .... What you've probably wanted to do is to use MPI_ANY_SOURCE as the source rank and then use that source rank as the destination in the send operation. You shouldn't also use while (size >= 1) since rank 0 is not talking to itself and the number of communications is expected to be one less than size.

if (rank == 0) 
{
    while (size > 1)
    //     ^^^^^^^^
    {
        int val, stat = 1;
        MPI::Status status;
        MPI::COMM_WORLD.Recv(&val, 1, MPI::INT, MPI_ANY_SOURCE, 0, status);
        // Use wildcard source here ------------^^^^^^^^^^^^^^
        int source = status.Get_source();
        printf("%s:%d received %d from %d\n", name, rank, val, source);

        MPI::COMM_WORLD.Send(&stat, 1, MPI::INT, source, 2);
        // Send back to the same process --------^^^^^^
        printf("%s:%d sent status %d\n", name, rank, stat);

        size--;
    }
} else

Doing something like this in the worker is pointless:

MPI::Status status;
MPI::COMM_WORLD.Recv(&stat, 1, MPI::INT, 0, 2, status);
// Source rank is fixed here ------------^
int source = status.Get_source();
printf("%s:%d received status %d from %d\n", name, rank, stat, source);

You have already specified rank 0 as the source in the receive operation so it would only be able to receive messages from rank 0. There is no way that status.Get_source() would return any value other than 0, unless some communication error had occurred, in which case an exception would get thrown by MPI::COMM_WORLD.Recv().

The same is also true for the second loop in your code.

By the way, your are using what used to be the official standard C++ bindings. They were deprecated in MPI-2.2 and the latest version of the standard (MPI-3.0) removed them completely as no longer supported by the MPI Forum. You should be using the C bindings instead or rely on 3-rd party C++ interfaces like Boost.MPI.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi Hristo,I have changed the source and posting below.
1

After installing and MPICH2 instead of OpenMPI, it worked successfully. I think there is some problem in using OpenMPI 1.6.3 in my cluster machines.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.