I want to store partial reduction results in an array.
Say I have data[8] = {10,20,30,40,50,60,70,80}.
And if I divide the data with the chunk_size of 2, the chunks will be {10,20}, {30,40}, ... , {70,80}.
If I target the summation, the reduction in total will be 360 but I want to get an array of partial_sums = {30,70,110,150} which is storing the partial sum of each block.
So far, what I have in mind is to construct an iterator strided_iterator, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80} and something like
thrust::reduce(stride_iterator, stride_iterator + 2,
partial_sums.begin(),
thrust::plus<int>());
giving the desired result, but have no idea how could this be done efficiently.
For strided access, thrust/examples/strided_range.cu has a solution but this seems to be not applicable to store segmented reductions.
Of course I can brutally do it with a loop like this,
for (int i = 0; i<4; i++) {
partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus<int>());
}
But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.
thrust::reduce_by_key.reduce_by_keywith a transform iterator that gives zero for the deactivated ones. For performance, try to do kernel fusion where possible. E.g. the keys can be generated on the fly using a transform iterator on a counting iterator.