What's the best way to provide PREDEFINED_CONSTS ?
If it were me, I would simply put what you have in your first example in your CUDA kernel and go with that. That is very likely the best way to do it. Later on, if you feel like you have a performance problem with your code, you can use a profiler to steer you in the direction of what needs to be addressed. I doubt it would be this. For constants, there really are only 2 possibilities:
- Load them from some kind of memory
- Load them as part of the instruction stream.
You've already indicated you are aware of this, you can simply benchmark both if you're really worried. Benchmarking would require more than what you have shown here, might be inconclusive, and may also depend on other factors such as how many times and in what way you are loading these constants.
As you have indicated already, __constant__ doesn't seem to be a sensible choice because the load pattern is clearly non-uniform, across the warp.
If I define them as above, will PREDEFINED_CONSTS be stored in global memory?
Yes, your first method will be stored in global memory. This can be confirmed with careful study and compilation using -Xptxas -v. Your second method has the potential (at least) to load the constants via the instruction stream. Since the second method is quite ugly from a coding perspective, and also very inflexible compared to the first method (what if I needed different constants per thread in different places in my code?), it's not what I would choose.
This strikes me as premature optimization. The first method is clearly preferred from a code flexibility and conciseness standpoint, and there is no real reason to think that simply because you are loading from memory that it is a problem. The second method is ugly, inflexible, and may not be any better from a performance perspective. Even if the data is part of the instruction stream, it still has to be loaded from memory.
Here's an example test case suggesting to me that the first case is preferred. If you come up with a different kind of test case, you may end up with a different observation:
$ cat t97.cu
#include <cstdio>
const float VAL0 = 1.1;
const float VAL1 = 2.2;
const float VAL2 = 3;
const float VAL3 = 4;
const float VAL4 = 5;
const float VAL5 = 6;
const float VAL6 = 7;
const float VAL7 = 8;
const float VAL8 = 9;
const float VAL9 = 10;
const float VAL10 = 11;
const float VAL11 = 12;
const float VAL12 = 13;
const float VAL13 = 14;
const float VAL14 = 15;
const float VAL15 = 16;
__global__ void k1(int l){
static const float PREDEFINED_CONSTS[16] = {VAL0, VAL1, VAL2, VAL3, VAL4, VAL5, VAL6, VAL7, VAL8, VAL9, VAL10, VAL11, VAL12, VAL13, VAL14, VAL15};
float sum = 0.0;
for (int i = 0; i < l; i++)
sum += PREDEFINED_CONSTS[(threadIdx.x+i) & 15];
if (sum == 0.0) printf("%f\n", sum);
}
__device__ float get_const(int i){
float c = VAL15;
unsigned t = (threadIdx.x+i) & 15;
if (t == 0) c = VAL0;
else if (t == 1) c = VAL1;
else if (t == 2) c = VAL2;
else if (t == 3) c = VAL3;
else if (t == 4) c = VAL4;
else if (t == 5) c = VAL5;
else if (t == 6) c = VAL6;
else if (t == 7) c = VAL7;
else if (t == 8) c = VAL8;
else if (t == 9) c = VAL9;
else if (t == 10) c = VAL10;
else if (t == 11) c = VAL11;
else if (t == 12) c = VAL12;
else if (t == 13) c = VAL13;
else if (t == 14) c = VAL14;
return c;
}
__global__ void k2(int l){
float sum = 0.0;
for (int i = 0; i < l; i++)
sum += get_const(i);
if (sum == 0.0) printf("%f\n", sum);
}
int main(){
int l = 1048576;
k1<<<1,16>>>(l);
k2<<<1,16>>>(l);
cudaDeviceSynchronize();
}
$ nvcc -o t97 t97.cu -Xptxas -v
ptxas info : 68 bytes gmem
ptxas info : Compiling entry function '_Z2k2i' for 'sm_52'
ptxas info : Function properties for _Z2k2i
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 324 bytes cmem[0], 8 bytes cmem[2]
ptxas info : Compiling entry function '_Z2k1i' for 'sm_52'
ptxas info : Function properties for _Z2k1i
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 324 bytes cmem[0]
$ nvprof ./t97
==22848== NVPROF is profiling process 22848, command: ./t97
==22848== Profiling application: ./t97
==22848== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 91.76% 239.39ms 1 239.39ms 239.39ms 239.39ms k2(int)
8.24% 21.508ms 1 21.508ms 21.508ms 21.508ms k1(int)
API calls: 62.34% 260.89ms 1 260.89ms 260.89ms 260.89ms cudaDeviceSynchronize
37.48% 156.85ms 2 78.427ms 10.319us 156.84ms cudaLaunchKernel
0.13% 542.39us 202 2.6850us 192ns 117.71us cuDeviceGetAttribute
0.04% 156.19us 2 78.094us 58.411us 97.777us cuDeviceTotalMem
0.01% 59.150us 2 29.575us 26.891us 32.259us cuDeviceGetName
0.00% 10.845us 2 5.4220us 1.7280us 9.1170us cuDeviceGetPCIBusId
0.00% 1.6860us 4 421ns 216ns 957ns cuDeviceGet
0.00% 1.5850us 3 528ns 283ns 904ns cuDeviceGetCount
0.00% 667ns 2 333ns 296ns 371ns cuDeviceGetUuid
$