2

I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the output for an input (1,2,3,4) I instead get (3,4,0,0). This happens with int4 as well, but int3 and float3 work just fine.

Minimal code to reproduce error in Google Colab:

# --- Minimal PyCUDA Test ---
import pycuda.driver as drv
import pycuda.compiler
import pycuda.gpuarray as gpa
import numpy as np
import pycuda.autoinit

minimal_kernel_code = """
__global__ void write_constant(
    int* output,
    const int4 test
    ) {
    output[0] = test.x;
    output[1] = test.y;
    output[2] = test.z;
    output[3] = test.w;
    }
"""

module_test = pycuda.compiler.SourceModule(minimal_kernel_code)
write_constant_kernel = module_test.get_function("write_constant")

test_gpu_mem = drv.mem_alloc(4 * np.int32().nbytes)

write_constant_kernel(
    test_gpu_mem,
    gpa.vec.make_int4(1,2,3,4), # Constant value to write
    block=(1, 1, 1),
    grid=(1, 1)
)

test_cpu_mem = np.empty(4, dtype=np.int32)
drv.memcpy_dtoh(test_cpu_mem, test_gpu_mem)

print(test_cpu_mem)

The expected output would be [1,2,3,4] but it is [3,4,0,0].

2
  • 1
    This bug seems to be ancient github.com/inducer/pycuda/issues/143 Commented Aug 7 at 22:13
  • Yeah, I was just coming here to answer my own question... found out it's an alignment problem. I did not find the issue you mentioned in my searches though, many thanks. Commented Aug 8 at 1:05

1 Answer 1

0

It's an alignment issue, int/float4 requires different alignment than int/float3. In my example the output pointer is passed as the first argument, therefore the second one starts with an offset of 4 bytes. That works for int3/float3, but a four element vector would be "cut in half", yielding the last two elements and two undefined ones as a result.

Sign up to request clarification or add additional context in comments.

2 Comments

This explanation makes no sense whatsoever
If I understand the bug report correctly, pycuda messes up the serialization of arguments when constructing the call. CUDA expects struct { int* first; int4 second; } with 16 byte alignment for the int4, thus 8 byte padding between the arguments. Pycuda doesn't do that, instead putting the first two vector entries into the padding space. int3 only has 4 byte alignment, thus not causing the issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.