0

I would like to write a CUDA kernel that uses two different (dynamically) shared memory arrays in the following form

__global__ myKernel()
{
    extern __shared__ int   localSum1[];
    extern __shared__ float localSum2[];
    ...
}

I read from the answer of this question that we can only declare the shared memory once.

  1. Is there still a way to use the shared memory for different dtypes?

  2. In this example, float and int have the same size. How do I do that with any combination of different types (int8, int32, int64) and (float16, float32, float64)?

8
  • Use a structure? Commented Oct 3, 2024 at 18:53
  • You could lift the code I wrote here: stackoverflow.com/questions/70765553/… A struct may cause additional bank conflicts, depending on your use case Commented Oct 3, 2024 at 19:16
  • Also the code in that question itself shows how to do it normally, if you know what the alignment order needs to be. Commented Oct 3, 2024 at 19:27
  • 4
    docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared Commented Oct 3, 2024 at 20:52
  • @RobertCrovella Thank you for that link, it discusses how to use the the same memory for the same byte size. Can it be used for different sizes e.g. (float16, int32)? Commented Oct 4, 2024 at 20:05

2 Answers 2

1

Can it be used for different sizes e.g. (float16, int32)?

In case it needs more spelling out, here is the basic approach:

  1. Order the types in descending order of alignment. This can be surprising with structs or vector types (e.g. float3 has the same alignment as float but float2 has a higher alignment). If unsure, pepper the code with static_assert(alignof(type1) >= alignof(type2)). It's free of runtime cost and prevents other developers from switching the order without thinking

  2. Compute the size requirement without any extra precautions: sizeof(type1) * N + sizeof(type2) * M

  3. Use pointer arithmetic to determine where one sub-array ends and the next starts

__global__ void kernel(
      unsigned number_of_floats,
      unsigned number_of_shorts,
      unsigned number_of_bytes)
{
    extern __shared__ float shared_floats[];
    static_assert(alignof(*shared_floats) >= alignof(short));
    short* shared_shorts = (short*) (shared_floats + number_of_floats); 
    static_assert(alignof(*shared_shorts) >= alignof(char));
    char* shared_bytes = (char*) (shared_shorts + number_of_shorts); 
}

void call_kernel(
      unsigned number_of_floats,
      unsigned number_of_shorts,
      unsigned number_of_bytes,
      dim3 blocks, dim3 threads)
{
    size_t shared_size = sizeof(float) * number_of_floats
          + sizeof(short) * number_of_shorts
          + sizeof(char) * number_of_bytes;
    kernel<<<blocks, threads, shared_size>>>(
          number_of_floats, number_of_shorts, number_of_bytes);
}

Any more sophisticated solution like the one I outlined in CUDA : Shared memory alignement in documentation is only really necessary if you write generic code, e.g. a template, that may not know the correct order for alignment a-priori.

Sign up to request clarification or add additional context in comments.

Comments

1

Why not pass the struct type as a template parameter?

#include <stdio.h>
#include <cuda.h>

template <int size>
struct intfloat_t {
    int a[size];
    float b[size];
};

static constexpr auto elementsize = sizeof(intfloat_t<256>) / 256; //add in alignment issues if needed.

template <typename realtype, int count = sizeof(realtype) / elementsize>
__global__ void test() {
    const auto PI = 22.0f/7.0f;
    extern __shared__ realtype bigstruct[];
    for (auto i = 0; i < count; i++) {
        //Never mind the silly access pattern
        bigstruct[0].a[i] = i;
        bigstruct[0].b[i] = i * PI;
        printf("a: %i, b: %.2f\n", bigstruct[0].a[i], bigstruct[0].b[i]); 
    }
}


int main() {
    constexpr auto count = 10;
    const auto size = sizeof(intfloat_t<10>);
    test<intfloat_t<count>><<<1, 32, size>>>();
    return cudaDeviceSynchronize();
}

You can play with the code on Godbolt: https://cuda.godbolt.org/z/83dxrGMPn

4 Comments

there is no reason to make the struct extern or an array. You can simply declare __shared__ realtype bigstruct; inside the kernel. Avoids fiddling with dynamic shared memory (it's a compile time fixed value anyway) and you don't have to write bigstruct[0] or bigstruct->, which would work the same
Also, if the size is fixed at compile time, you can simply declare two __shared__ arrays with the size given by the template argument. Like this: cuda.godbolt.org/z/9ebM7vW4T
There may be reasons for the extern, it allows you to access more than 48KiB of shared memory in a kernel.
Hmm, good point but I don't think it's worth complicating the common case for an edge case like that. However, it's your answer …

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.