1

I recently need to implement a Sobol sequence generator by CUDA.

Now to pre-store the Sobol table, I can either use a C++ native constexpr like below:

constexpr unsigned int sobol_table[NUM_DIMENSIONS][32] = {
            {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824},
            {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647}
            // skipped for higher dim...
};

Or to use a CUDA __constant__, like below:

__constant__ unsigned int sobol_table[NUM_DIMENSIONS][32] = {
    {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648},
    {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647, 4294967295},
    // skipped for higher dim...
};

I wonder what are the tradeoffs? Is there anything I need to worry about other than increased binary file size by constexpr, GPU constant memory, etc?

N.B. I am aware that cuRAND has native support of it. The reason why I need to build a wheel is to conform to previous behavior of the tool, which may have used a possibly mutated Sobol sequence for its life cycle...

11
  • I pretty sure you can't use __constant__ like that. Initialization as your snippet shows isn't permitted Commented Jun 4 at 12:16
  • @talonmies, but it compiles by my MSVC with CUDA 12.6, did I miss/misunderstand anything? Commented Jun 4 at 12:29
  • constexpr means that it is known at compile time while __constant__ means that is constant/read-only during the execution of the kernel but not known at compile time. I.e. it is more flexible but needs to be actually loaded from memory at runtime (there are special caches for constant memory). Constant memory is optimized for uniform access/broadcast, i.e. when all lanes of a warp access the same address. Commented Jun 4 at 13:02
  • 1
    @talonmies While it is hard to find documentation on that, CUDA sample 2_Concepts_and_Techniques/dct8x8 contains initialization like that so it should generally be fine. Commented Jun 4 at 14:11
  • 2
    A problem with using __constant__ arrays is that having different threads in the same warp access different elements of your __constant__ slows it down a lot due to serialization unless all threads in a warp access the same element. Better to load it onto a __shared__ array on startup and then lookup your data from there. Commented Jun 4 at 14:14

1 Answer 1

2

I think the key thing not to forget in your question is that in CUDA __constant__ is not just a language specifier like const or constexpr in C++, it is a memory space specifier. That means that the data for that variable is stored in a specific reserved bank of DRAM on the GPU, and the device code compiler will know to emit uniform memory access PTX instructions whenever that variable is read from in code, and that instruction has the effect of accessing the memory through a read-only cache. This is very efficient when every thread in a warp needs the same data values in lock-step (for example flexible finite difference stencils, or quadrature rules where loops within every thread consume coefficients or weights in a uniform way). But it still entails DRAM access, which would be inherently slower than an-inline constant in code. And it also could be a potential source of warp divergence if the memory access pattern is not uniform and you wind up with cache misses. This requires use dependent analysis.

On the other hand, in the constexpr case, things are somewhat simpler and you are relying on all the C++ language lawyer goodness to eventually signal to the compiler that only compile time constant expressions are allowed and let it go ahead and (potentially) give you some optimization(s) in the emitted code.

For the specific use case of Sobol pseudo-random sequence generators, I can't comment on which approach is "better" because I have never bothered to study their internals and just use known "good" libraries (paraphrasing Von Neumann, random number generation is too important to be left to chance).

In the end, the only way to be sure would be to look at the PTX emitted by the compiler for either example and see what suits your tastes better, and then benchmark, and then make a choice based on whatever purist or pragmatist criteria you wish to apply.


The final triviality is that (traditionally), this:

__constant__ unsigned int sobol_table[NUM_DIMENSIONS][32] = {
    {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648},
    {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647, 4294967295},
    // skipped for higher dim...
};

was actually illegal in CUDA because you are initializing device (in the case __constant__) memory in host code. The usual way to do this would be:

__constant__ unsigned int sobol_table[NUM_DIMENSIONS][32];
const unsigned int sobol_table_data[NUM_DIMENSIONS][32] = {
    {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648},
    {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647, 4294967295},
    // skipped for higher dim...
};

cudaMemcpyToSymbol(sobol_table, &sobol_table_data[0][0], sizeof(int)*NUM_DIMENSIONS*32);

It seems that at some point (perhaps as part of the CUDA 10 release cycle, but I couldn't find any mention of it in either documentation or changelogs) the invisible host boilerplate generation performed by the toolchain was improved so that there is an internal function __nv_cudaEntityRegisterCallback which calls the undocumented __cudaRegisterVar runtime API function for each device variable to automagically do the setup and initialization of the variable from the internal fat binary data when the runtime API context is lazily established. For about the first 10 years of CUDA, that would probably emit a compiler error, or a warning and just not work if it did compile.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.