CUDA shared memory with different dtypes

Question

I would like to write a CUDA kernel that uses two different (dynamically) shared memory arrays in the following form

__global__ myKernel()
{
    extern __shared__ int   localSum1[];
    extern __shared__ float localSum2[];
    ...
}

I read from the answer of this question that we can only declare the shared memory once.

Is there still a way to use the shared memory for different dtypes?
In this example, float and int have the same size. How do I do that with any combination of different types (int8, int32, int64) and (float16, float32, float64)?

You could lift the code I wrote here: stackoverflow.com/questions/70765553/… A struct may cause additional bank conflicts, depending on your use case — Homer512
– Homer512, Commented Oct 3, 2024 at 19:16
Also the code in that question itself shows how to do it normally, if you know what the alignment order needs to be. — Homer512
– Homer512, Commented Oct 3, 2024 at 19:27
docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared — Robert Crovella
– Robert Crovella, Commented Oct 3, 2024 at 20:52
@RobertCrovella Thank you for that link, it discusses how to use the the same memory for the same byte size. Can it be used for different sizes e.g. (float16, int32)? — Uwe.Schneider
– Uwe.Schneider, Commented Oct 4, 2024 at 20:05

Homer512 · Accepted Answer · 2024-10-07 12:30:53Z

Can it be used for different sizes e.g. (float16, int32)?

In case it needs more spelling out, here is the basic approach:

Order the types in descending order of alignment. This can be surprising with structs or vector types (e.g. float3 has the same alignment as float but float2 has a higher alignment). If unsure, pepper the code with static_assert(alignof(type1) >= alignof(type2)). It's free of runtime cost and prevents other developers from switching the order without thinking
Compute the size requirement without any extra precautions: sizeof(type1) * N + sizeof(type2) * M
Use pointer arithmetic to determine where one sub-array ends and the next starts

__global__ void kernel(
      unsigned number_of_floats,
      unsigned number_of_shorts,
      unsigned number_of_bytes)
{
    extern __shared__ float shared_floats[];
    static_assert(alignof(*shared_floats) >= alignof(short));
    short* shared_shorts = (short*) (shared_floats + number_of_floats); 
    static_assert(alignof(*shared_shorts) >= alignof(char));
    char* shared_bytes = (char*) (shared_shorts + number_of_shorts); 
}

void call_kernel(
      unsigned number_of_floats,
      unsigned number_of_shorts,
      unsigned number_of_bytes,
      dim3 blocks, dim3 threads)
{
    size_t shared_size = sizeof(float) * number_of_floats
          + sizeof(short) * number_of_shorts
          + sizeof(char) * number_of_bytes;
    kernel<<<blocks, threads, shared_size>>>(
          number_of_floats, number_of_shorts, number_of_bytes);
}

Any more sophisticated solution like the one I outlined in CUDA : Shared memory alignement in documentation is only really necessary if you write generic code, e.g. a template, that may not know the correct order for alignment a-priori.

Johan · Accepted Answer · 2024-10-07 14:14:54Z

1

Why not pass the struct type as a template parameter?

#include <stdio.h>
#include <cuda.h>

template <int size>
struct intfloat_t {
    int a[size];
    float b[size];
};

static constexpr auto elementsize = sizeof(intfloat_t<256>) / 256; //add in alignment issues if needed.

template <typename realtype, int count = sizeof(realtype) / elementsize>
__global__ void test() {
    const auto PI = 22.0f/7.0f;
    extern __shared__ realtype bigstruct[];
    for (auto i = 0; i < count; i++) {
        //Never mind the silly access pattern
        bigstruct[0].a[i] = i;
        bigstruct[0].b[i] = i * PI;
        printf("a: %i, b: %.2f\n", bigstruct[0].a[i], bigstruct[0].b[i]); 
    }
}


int main() {
    constexpr auto count = 10;
    const auto size = sizeof(intfloat_t<10>);
    test<intfloat_t<count>><<<1, 32, size>>>();
    return cudaDeviceSynchronize();
}

You can play with the code on Godbolt: https://cuda.godbolt.org/z/83dxrGMPn

edited Oct 7, 2024 at 14:14

answered Oct 7, 2024 at 14:07

Johan

77.4k28 gold badges204 silver badges346 bronze badges

4 Comments

Homer512 Over a year ago

there is no reason to make the struct extern or an array. You can simply declare __shared__ realtype bigstruct; inside the kernel. Avoids fiddling with dynamic shared memory (it's a compile time fixed value anyway) and you don't have to write bigstruct[0] or bigstruct->, which would work the same

Homer512 Over a year ago

Also, if the size is fixed at compile time, you can simply declare two __shared__ arrays with the size given by the template argument. Like this: cuda.godbolt.org/z/9ebM7vW4T

Johan Over a year ago

There may be reasons for the extern, it allows you to access more than 48KiB of shared memory in a kernel.

Homer512 Over a year ago

Hmm, good point but I don't think it's worth complicating the common case for an edge case like that. However, it's your answer …

Collectives™ on Stack Overflow

CUDA shared memory with different dtypes

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related