Spawn a single thread using OpenMP

Question

I am implementing a parallel quicksort:

Partition the input range,
Conquer the left subrange in this thread.
Conquer the right subrange in a new (single) thread while (2) is running.

In OpenMP, how can I do (3)?

My code broken code follows:

#include <algorithm>
#include <chrono>
#include <iostream>
#include <omp.h>

#define N 50000000

using std::swap;
using std::cout;

typedef unsigned long long data_t;

static std::chrono::milliseconds get_millis() {
    return std::chrono::duration_cast<std::chrono::milliseconds>(
    std::chrono::system_clock::now().time_since_epoch());
}

static int partition(data_t* data, int lo, int hi) {
    data_t pivot = data[lo];
    int i = lo - 1;
    int j = hi + 1;

    while (true) {
        do i++; while (data[i] < pivot);
        do j--; while (data[j] > pivot);

        if (i >= j) return j;

        data_t temp = data[i];
        data[i] = data[j];
        data[j] = temp;
    }
}

static void sortImpl(data_t* data,
                     int lo,
                     int hi,
                     int threads) {
    if (threads <= 1 || hi - lo < 1000) {
        std::sort(data + lo, data + hi + 1);
        return;
    }

    int pivot = partition(data, lo, hi);
    int left_threads = threads / 2;
    int right_threads = threads - left_threads;

    #pragma omp parallel shared(data) num_threads(1)
    {
        sortImpl(data,
                 lo,
                 pivot,
                 left_threads);
    }

    sortImpl(data,
             pivot + 1,
             hi,
             right_threads);
}

void psort(int n, data_t *data) {
    sortImpl(data,
             0,
             n - 1,
             omp_get_max_threads());
}

static void ssort(data_t* data, int lo, int hi) {
    if (lo < hi) {
        int pivot = partition(data, lo, hi);
        ssort(data, lo, pivot);
        ssort(data, pivot + 1, hi);
    }
}

void ssort(int n, data_t* data) {
    ssort(data, 0, n - 1);
}

int main() {
    std::cout << "Starting benchmark.\n";

    data_t* arr = new data_t[N];
    data_t* arr2 = new data_t[N];

    for (int i = 0; i < N; i++) {
        arr[i] = rand();
        arr2[i] = arr[i];
    }

    auto st = get_millis();
    psort(N, arr);
    auto et = get_millis();

    std::cout << "My parallel quicksort duration: " << et - st << "\n";

    st = get_millis();
    std::sort(arr2, arr2 + N);
    et = get_millis();

    std::cout << "std::sort duration: " << et - st << "\n";

    for (int i = 0; i < N; i++) {
        if (arr[i] != arr2[i]) {
            std::cout << arr[i] << " != " << arr2[i] << " at " << i << "\n";
            break;
        }
    }

    std::cout << "Equals: " << std::boolalpha
              << std::equal(arr2, arr2 + N, arr) << "\n";

    return 0;
}

The output is:

Starting benchmark.
My parallel quicksort duration: 11482ms
std::sort duration: 9651ms
Equals: true

Basically, the logic of the parallel quicksort is the same as in the sequential std::sort, yet it is slower by around 19 percent.

The standard way to do that is to spawn tasks, not threads. AFAIK there are example of this in the doc but you can do #pragma omp parallel then #pragma omp master then run your parallel quick-sort, and finally do #pragma omp task (be careful about variable sharing though). — Jérôme Richard
– Jérôme Richard, Commented Aug 30, 2024 at 12:23
Be aware that Quick-sort do not scale well so results will likely be disappointing. In the worst case, quick-sort will not even be parallel and may create a chain of tasks. A similar issue happens in sequential, and this is why people generally do not implement just a quick sort bu an introsort. Besides, the partitioning part tends to be memory-bound few only few threads (and memory-bound code does not scale well). — Jérôme Richard
– Jérôme Richard, Commented Aug 30, 2024 at 12:25
Also please note that C++ brings parallel STL so you can easily run std::sort in parallel. It may or may not scale well regarding the target implementation — Jérôme Richard
– Jérôme Richard, Commented Aug 30, 2024 at 12:27
In his answer, @VictorEijkhout adviced using tasks, but in your code you are not using tasks at all. Instead you are recursively creating new parallel regions, which is different. In particular, OpenMP will wait for the completion of your first call to sortImpl() before exiting the parallel region and execute the other call. So nothing is executed concurently. — PierU
– PierU, Commented Aug 31, 2024 at 10:43
I am telling you that in the code you have posted nothing is executed concurrently. In contrast there is effective concurrency in the approach posted by Victor. But he also warned you that it wouldn't be efficient anyway, which is a also a lesson to learn: OpenMP, or any other tool for parallelism, is not a magic wand that makes any code faster. Some algorithms are difficult, and sometimes impossible, to be parallelized efficiently. — PierU
– PierU, Commented Aug 31, 2024 at 11:43

Victor Eijkhout · Accepted Answer · 2024-08-30 12:35:16Z

1

The simple answer is to use tasks:

#pragma omp parallel
#pragma omp single
{
// split the array
#pragma omp task
// recursive call first half
#pragma omp task
// recursive call second half
}

This will work fine but will not be efficient.

it will keep generating tasks, even for very small arrays. This can be fixed.
Splitting the array is now not parallel. You can do this with a prefix operation, but this is much harder.

answered Aug 30, 2024 at 12:35

Victor Eijkhout

6,0002 gold badges29 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

coderodde Over a year ago

Not at my PC now. I will try your suggestion asap.

coderodde Over a year ago

My parallel quicksort resorts to std::sort on small (1000) ranges. Also, it will not exceed the number of threads over the number of logical processors.

Jérôme Richard Over a year ago

Note that the second task is optional though it may significantly reduce the parallelism on some specific runtime (it can delay the execution of the first task). Reducing the number of task by 2 can also improve performance when the number of task is rather big.

coderodde Over a year ago

Tried. It does not work for me.

PierU Over a year ago

@coderodde Post your code by editing your question, and make clearer what does "not work".

|

Collectives™ on Stack Overflow

Spawn a single thread using OpenMP

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related