How to correctly pass float4 vector to kernel using PyCUDA?

Question

I am trying to pass a float4 as argument to my cuda kernel (by value) using PyCUDA’s make_float4(). But there seems to be some misalignment when the data is transferred to the kernel. If I read the output for an input (1,2,3,4) I instead get (3,4,0,0). This happens with int4 as well, but int3 and float3 work just fine.

Minimal code to reproduce error in Google Colab:

# --- Minimal PyCUDA Test ---
import pycuda.driver as drv
import pycuda.compiler
import pycuda.gpuarray as gpa
import numpy as np
import pycuda.autoinit

minimal_kernel_code = """
__global__ void write_constant(
    int* output,
    const int4 test
    ) {
    output[0] = test.x;
    output[1] = test.y;
    output[2] = test.z;
    output[3] = test.w;
    }
"""

module_test = pycuda.compiler.SourceModule(minimal_kernel_code)
write_constant_kernel = module_test.get_function("write_constant")

test_gpu_mem = drv.mem_alloc(4 * np.int32().nbytes)

write_constant_kernel(
    test_gpu_mem,
    gpa.vec.make_int4(1,2,3,4), # Constant value to write
    block=(1, 1, 1),
    grid=(1, 1)
)

test_cpu_mem = np.empty(4, dtype=np.int32)
drv.memcpy_dtoh(test_cpu_mem, test_gpu_mem)

print(test_cpu_mem)

The expected output would be [1,2,3,4] but it is [3,4,0,0].

This bug seems to be ancient github.com/inducer/pycuda/issues/143 — paleonix
– paleonix, Commented Aug 7 at 22:13
Yeah, I was just coming here to answer my own question... found out it's an alignment problem. I did not find the issue you mentioned in my searches though, many thanks. — Dodilei
– Dodilei, Commented Aug 8 at 1:05

Dodilei · Accepted Answer · 2025-08-08 01:08:58Z

0

It's an alignment issue, int/float4 requires different alignment than int/float3. In my example the output pointer is passed as the first argument, therefore the second one starts with an offset of 4 bytes. That works for int3/float3, but a four element vector would be "cut in half", yielding the last two elements and two undefined ones as a result.

answered Aug 8 at 1:08

Dodilei

3083 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

talonmies Aug 8 at 5:48

This explanation makes no sense whatsoever

Homer512 Aug 8 at 6:47

If I understand the bug report correctly, pycuda messes up the serialization of arguments when constructing the call. CUDA expects struct { int* first; int4 second; } with 16 byte alignment for the int4, thus 8 byte padding between the arguments. Pycuda doesn't do that, instead putting the first two vector entries into the padding space. int3 only has 4 byte alignment, thus not causing the issue.

Collectives™ on Stack Overflow

How to correctly pass float4 vector to kernel using PyCUDA?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related