Static const array in Cuda kernel

Question

I need to have the following in the Cuda kernel:

static const float PREDEFINED_CONSTS[16] = {...}; // 16 constants.

float c = PREDEFINED_CONSTS[threadId.x % 16];
/// Use c in computations.

What's the best way to provide PREDEFINED_CONSTS ?

Const memory does't seem good, cause different threads will access different locations.
If I define them as above, will PREDEFINED_CONSTS be stored in global memory?

What about this:

float c;
if      ( threadId.x % 16 == 0 ) c = VAL0;
else if ( threadId.x % 16 == 1 ) c = VAL1;
...
else if ( threadId.x % 16 ==15 ) c = VAL15;

Although last example has thread divergence, literal VAL* values are part of the instruction opcode, so there will be no reading from memory.

As long as there are only 16 values, I don't see the problem with using constant memory. — paleonix
– paleonix, Commented Nov 29, 2020 at 17:38
@Paul constant memory accesses will get serialized because threads are not reading same location — user2052436
– user2052436, Commented Nov 29, 2020 at 17:47

Robert Crovella · Accepted Answer · 2020-11-29 19:43:00Z

What's the best way to provide PREDEFINED_CONSTS ?

If it were me, I would simply put what you have in your first example in your CUDA kernel and go with that. That is very likely the best way to do it. Later on, if you feel like you have a performance problem with your code, you can use a profiler to steer you in the direction of what needs to be addressed. I doubt it would be this. For constants, there really are only 2 possibilities:

Load them from some kind of memory
Load them as part of the instruction stream.

You've already indicated you are aware of this, you can simply benchmark both if you're really worried. Benchmarking would require more than what you have shown here, might be inconclusive, and may also depend on other factors such as how many times and in what way you are loading these constants.

As you have indicated already, __constant__ doesn't seem to be a sensible choice because the load pattern is clearly non-uniform, across the warp.

If I define them as above, will PREDEFINED_CONSTS be stored in global memory?

Yes, your first method will be stored in global memory. This can be confirmed with careful study and compilation using -Xptxas -v. Your second method has the potential (at least) to load the constants via the instruction stream. Since the second method is quite ugly from a coding perspective, and also very inflexible compared to the first method (what if I needed different constants per thread in different places in my code?), it's not what I would choose.

This strikes me as premature optimization. The first method is clearly preferred from a code flexibility and conciseness standpoint, and there is no real reason to think that simply because you are loading from memory that it is a problem. The second method is ugly, inflexible, and may not be any better from a performance perspective. Even if the data is part of the instruction stream, it still has to be loaded from memory.

Here's an example test case suggesting to me that the first case is preferred. If you come up with a different kind of test case, you may end up with a different observation:

$ cat t97.cu
#include <cstdio>
const float VAL0 = 1.1;
const float VAL1 = 2.2;
const float VAL2 = 3;
const float VAL3 = 4;
const float VAL4 = 5;
const float VAL5 = 6;
const float VAL6 = 7;
const float VAL7 = 8;
const float VAL8 = 9;
const float VAL9 = 10;
const float VAL10 = 11;
const float VAL11 = 12;
const float VAL12 = 13;
const float VAL13 = 14;
const float VAL14 = 15;
const float VAL15 = 16;


__global__ void k1(int l){
        static const float PREDEFINED_CONSTS[16] = {VAL0, VAL1, VAL2, VAL3, VAL4, VAL5, VAL6, VAL7, VAL8, VAL9, VAL10, VAL11, VAL12, VAL13, VAL14, VAL15};
        float sum = 0.0;
  for (int i = 0; i < l; i++)
    sum += PREDEFINED_CONSTS[(threadIdx.x+i) & 15];
  if (sum == 0.0) printf("%f\n", sum);
}
__device__ float get_const(int i){
  float c = VAL15;
  unsigned t = (threadIdx.x+i) & 15;
  if      (t == 0)  c = VAL0;
  else if (t == 1)  c = VAL1;
  else if (t == 2)  c = VAL2;
  else if (t == 3)  c = VAL3;
  else if (t == 4)  c = VAL4;
  else if (t == 5)  c = VAL5;
  else if (t == 6)  c = VAL6;
  else if (t == 7)  c = VAL7;
  else if (t == 8)  c = VAL8;
  else if (t == 9)  c = VAL9;
  else if (t == 10) c = VAL10;
  else if (t == 11) c = VAL11;
  else if (t == 12) c = VAL12;
  else if (t == 13) c = VAL13;
  else if (t == 14) c = VAL14;
  return c;
}

__global__ void k2(int l){
        float sum = 0.0;
  for (int i = 0; i < l; i++)
    sum += get_const(i);
  if (sum == 0.0) printf("%f\n", sum);
}

int main(){
        int l = 1048576;
  k1<<<1,16>>>(l);
  k2<<<1,16>>>(l);
  cudaDeviceSynchronize();
}
$ nvcc -o t97 t97.cu -Xptxas -v
ptxas info    : 68 bytes gmem
ptxas info    : Compiling entry function '_Z2k2i' for 'sm_52'
ptxas info    : Function properties for _Z2k2i
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 324 bytes cmem[0], 8 bytes cmem[2]
ptxas info    : Compiling entry function '_Z2k1i' for 'sm_52'
ptxas info    : Function properties for _Z2k1i
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 324 bytes cmem[0]
$ nvprof ./t97
==22848== NVPROF is profiling process 22848, command: ./t97
==22848== Profiling application: ./t97
==22848== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   91.76%  239.39ms         1  239.39ms  239.39ms  239.39ms  k2(int)
                    8.24%  21.508ms         1  21.508ms  21.508ms  21.508ms  k1(int)
      API calls:   62.34%  260.89ms         1  260.89ms  260.89ms  260.89ms  cudaDeviceSynchronize
                   37.48%  156.85ms         2  78.427ms  10.319us  156.84ms  cudaLaunchKernel
                    0.13%  542.39us       202  2.6850us     192ns  117.71us  cuDeviceGetAttribute
                    0.04%  156.19us         2  78.094us  58.411us  97.777us  cuDeviceTotalMem
                    0.01%  59.150us         2  29.575us  26.891us  32.259us  cuDeviceGetName
                    0.00%  10.845us         2  5.4220us  1.7280us  9.1170us  cuDeviceGetPCIBusId
                    0.00%  1.6860us         4     421ns     216ns     957ns  cuDeviceGet
                    0.00%  1.5850us         3     528ns     283ns     904ns  cuDeviceGetCount
                    0.00%     667ns         2     333ns     296ns     371ns  cuDeviceGetUuid
$

Collectives™ on Stack Overflow

Static const array in Cuda kernel

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related