6

EDIT: My question is similar to C, Open MPI: segmentation fault from call to MPI_Finalize(). Segfault does not always happen, especially with low numbers of processes, so it you answer that one instead that would be great, either way . . .

I was hoping to get some help debugging the following code:

int main(){
        long* my_local;
        long n, s, f;
        MPI_Init(NULL, NULL);
        MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
        MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    if(my_rank == 0){
        /*  Get size n from user                            */
        printf("Total processes: %d\n", comm_sz);
        printf("Number of keys to be sorted?  ");
        fflush(stdout);
        scanf("%ld", &n);

        /*  Broadcast size n to other processes             */
        MPI_Bcast(&n, 1, MPI_LONG, 0, MPI_COMM_WORLD);

        /*  Create n/comm_sz keys                           
             NOTE! some processes will have 1 extra key if
              n%comm_sz != 0                                */
        create_Keys(&my_local, my_rank, comm_sz, n, &s, &f);
    }

    if(my_rank != 0){
        /* Receive n from process 0                         */
        MPI_Bcast(&n, 1, MPI_LONG, 0, MPI_COMM_WORLD);

        /*  Create n/comm_sz keys                           */
        create_Keys(&my_local, my_rank, comm_sz, n, &s, &f);
    }

        /* The offending function, f is a long set to num elements of my_local*/
        Odd_Even_Tsort(&my_local, my_rank, f, comm_sz);

        printf("Process %d completed the function", my_rank);
        MPI_Finalize();
        return 0;
}

void Odd_Even_Tsort(long** my_local, int my_rank, long my_size, int comm_sz)
{
    long nochange = 1;
    long phase = 0;
    long complete = 1;
    MPI_Status Stat;
    long your_size = 1;

    long* recv_buf = malloc(sizeof(long)*(my_size+1));
    printf("rank %d has size %ld\n", my_rank, my_size);

    while (complete!=0){
        if((phase%2)==0){
            if( ((my_rank%2)==0) && my_rank < comm_sz-1){
            /*  Send right                          */
                MPI_Send(&my_size, 1, MPI_LONG, my_rank+1, 0, MPI_COMM_WORLD);
                MPI_Send(*my_local, my_size, MPI_LONG, my_rank+1, 0, MPI_COMM_WORLD);
                MPI_Recv(&your_size, 1, MPI_LONG, my_rank+1, 0,  MPI_COMM_WORLD, &Stat);
                MPI_Recv(&recv_buf, your_size, MPI_LONG, my_rank+1, 0,  MPI_COMM_WORLD, &Stat);
            }
            if( ((my_rank%2)==1) && my_rank < comm_sz){
            /*  Send left                          */
                MPI_Recv(&your_size, 1, MPI_LONG, my_rank-1, 0,  MPI_COMM_WORLD, &Stat);
                MPI_Recv(&recv_buf, your_size, MPI_LONG, my_rank-1, 0,  MPI_COMM_WORLD, &Stat);
                MPI_Send(&my_size, 1, MPI_LONG, my_rank-1, 0, MPI_COMM_WORLD);
                MPI_Send(*my_local, my_size, MPI_LONG, my_rank-1, 0, MPI_COMM_WORLD);
            }
        }
        phase ++;
        complete = 0;
    }

    printf("Done!\n");
    fflush(stdout);
}

And the Error I'm getting is:

[ubuntu:04968] *** Process received signal ***
[ubuntu:04968] Signal: Segmentation fault (11)
[ubuntu:04968] Signal code: Address not mapped (1)
[ubuntu:04968] Failing at address: 0xb
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 4968 on node ubuntu exited on signal 11 (Segmentation fault).

The reason I'm baffled is that the print statements after the function are still displayed, but if I comment out the function, no errors. So, where the heap am I getting a Segmentation fault?? I'm getting the error with mpiexec -n 2 ./a.out and an 'n' size bigger than 9.

If you actually wanted the entire runnable code, let me know. Really I was hoping not so much for the precise answer but more how to use the gdb/valgrind tools to debug this problem and others like it (and how to read their output).

(And yes, I realize the 'sort' function isn't sorting yet).

1
  • "the print statements after the function are still displayed, but if I comment out the function, no errors". This suggests that your function does something wrong, which becomes an error later on. Try placing barriers and prints at the significant stages of your code, and see how far execution gets. Commented May 2, 2012 at 10:42

1 Answer 1

13

The problem here is simple, yet difficult to see unless you use a debugger or print out exhaustive debugging information:

Look at the code where MPI_Recv is called. The recv_buf variable should be supplied as an argument instead of &recv_buf.

  MPI_Recv(   recv_buf  , your_size, MPI_LONG, my_rank-1, 0,  MPI_COMM_WORLD, &Stat);

The rest seems ok.

Sign up to request clarification or add additional context in comments.

3 Comments

Aha, thank you! Does that mean the root of the problem is recv_buf is the address of the array itself while &recv_buf is the address of the pointer, recv_buf? My C experience is somewhat limited so I tend to mix what is what too often.
@NickO Yes, exactly that is the relationship.
@NickO When you are debugging, you can add a printf after each MPI call. This is simple, but it would help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.