1

I want to store partial reduction results in an array.

Say I have data[8] = {10,20,30,40,50,60,70,80}.
And if I divide the data with the chunk_size of 2, the chunks will be {10,20}, {30,40}, ... , {70,80}.

If I target the summation, the reduction in total will be 360 but I want to get an array of partial_sums = {30,70,110,150} which is storing the partial sum of each block.

So far, what I have in mind is to construct an iterator strided_iterator, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80} and something like

thrust::reduce(stride_iterator, stride_iterator + 2,
               partial_sums.begin(),
               thrust::plus<int>());

giving the desired result, but have no idea how could this be done efficiently.

For strided access, thrust/examples/strided_range.cu has a solution but this seems to be not applicable to store segmented reductions.

Of course I can brutally do it with a loop like this,

for (int i = 0; i<4; i++) {
  partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus<int>());
}

But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.

14
  • 3
    The solution is thrust::reduce_by_key. Commented May 14, 2023 at 17:21
  • Although in this special case of a very small chunk size it might be more performant to zip together multiple strided ranges with different offsets and perform a transform for the summation. Commented May 14, 2023 at 18:09
  • Is chunk_size variable or always 2 ? Commented May 14, 2023 at 18:51
  • 1
    @paleonix, yes indeed this question was the exact duplicate of stackoverflow.com/questions/42260493/…. I was not able to find it since I was searching via 'chunk', 'partial sum', and so on. I didn't know it was called 'segmented reduction'. Thanks for the keyword! Commented May 15, 2023 at 0:29
  • 1
    Then you could just feed reduce_by_key with a transform iterator that gives zero for the deactivated ones. For performance, try to do kernel fusion where possible. E.g. the keys can be generated on the fly using a transform iterator on a counting iterator. Commented May 15, 2023 at 11:37

1 Answer 1

1

Based on the useful answer in Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA, what I come up with so far is like follows.

In fact I wanted to get min or max values in each chunk.

using namespace thrust::placeholders;
int main(int argc, char **argv) {

  int N = atoi(argv[1]);
  int K = atoi(argv[2]);

  std::cout << "N " << N << " K " << K << std::endl;

  typedef int mytype;

  thrust::device_vector<mytype> data(N*K);
  thrust::device_vector<mytype> sums(N);

  thrust::sequence(data.begin(),data.end());

  // method 1
  thrust::reduce_by_key(thrust::device,
                        
  thrust::make_transform_iterator(thrust::counting_iterator<int>(0),  _1/K),
                        
  thrust::make_transform_iterator(thrust::counting_iterator<int>(N*K),_1/K),
                        data.begin(),
                        thrust::discard_iterator<int>(),
                        sums.begin(),
                        thrust::equal_to<int>(),
                        thrust::minimum<mytype>());

  // method 2 (bad)
  for (int i=0; i<N; i++) {
    int res = thrust::reduce(data.begin()+K*i, data.begin()+K*i+K,std::numeric_limits<mytype>::max(),thrust::minimum<mytype>());
  }

  // just print the first 10 results
  thrust::copy_n(sums.begin(),10,std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl;

  return 0;
}

The results are shown below with the estimated runtime measured via sdkTimer. As can be seen, method 1 with reduce_by_key is much~ faster than the second one with for loop.

[sangjun@newmaster01 05_thrust]$ ./exe 1000 100
N 1000 K 100
 - Elapsed time: 0.00008 sec 
 - Elapsed time: 0.02266 sec 
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 1000 256
N 1000 K 256
 - Elapsed time: 0.00008 sec 
 - Elapsed time: 0.02084 sec 
0,256,512,768,1024,1280,1536,1792,2048,2304,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 100
N 100000 K 100
 - Elapsed time: 0.00016 sec 
 - Elapsed time: 1.98978 sec 
0,100,200,300,400,500,600,700,800,900,
[sangjun@newmaster01 05_thrust]$ ./exe 100000 256
N 100000 K 256
 - Elapsed time: 0.00027 sec 
 - Elapsed time: 1.92896 sec 
0,256,512,768,1024,1280,1536,1792,2048,2304,

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.