0

I have a pretty complicated kernel that I've been optimizing. Without going through all the code, there is one kernel that writes some values to global memory.

Then a second kernel fires and does billion of computations on that data, all in local memory. I've optimized the code over and over, getting the kernel run time down to about 275ms.

The final part of the kernel loops over an array of data processed in local memory and searches for a matching string. Obviously, if it finds a match it needs to let the host program know this. I accomplished this by changing global_array[0].x to 999 and global_array[0].y to equal the found result.

After the kernel finishes, it does a read of the first element of global_array, checks if .x == 999 and if so we know we found our target.

In the process of doing more optimizing, I found that if I commented out the global_array[0] = lines, the kernel ran 4x as fast, at about 62ms. Knowing global memory is slow, I started testing various things. I thought, hey, maybe if I change the LOCAL array, then at the very end did a work_group_copy back to global I'd get a bit of a speed increase.

But no... I dont. And it's confusing as heck. If at the end of the kernel, I write anything to seemingly any position in global or local memory, my kernel runs at 270ms. If I write the same data to a private variable, or just do other unrelated code, it's 62ms.

I need to return a result from the kernel somehow - but for some reason, writing to a local variable, something the kernel does 50x before it reaches the end without slowdown, seems to slow it down like crazy when the write is at the end.

Can anyone explain why this would happen? I'm stumped.

2

2 Answers 2

1

When you don't write out to global memory, the JIT compiler is most likely detecting most of your code as dead code, and eliminating it.

Sign up to request clarification or add additional context in comments.

2 Comments

Aah... so by merely calculating, and never actually returning something or "showing" that I've done anything with the work I've done, the compiler is basically skipping actually doing the work? So when I put in any kind of return value, it sees it's work is being used, so it actually does it? Makes sense. No reason to do the work if nothing is done with the result. Can anyone confirm this is the case? (Before walking away, I'd like to know it IS over 'most likely is')
You can confirm this yourself. leave in the write to global memory, but replace the complex calculations with a constant. Try for local and global memory if you want to see if there is a difference.
1

To verify we'd need to see the code. One way you can verify is to leave in the writing out part, but guard it with a condition that you know will never be met (but would not be possible for the compiler to know - eg. a certain global address containing a specific value). You have to be careful with this too, because the compiler could do code motion to check the condition earlier on in the thread to still skip work. So the condition (that will never be met) needs to be based on the outcome of some actual work you do (or have the potential for it being influenced by it, or be convoluted enough that the compiler thinks that there could be a dependency). BTW, I used to write OCL micro benchmarks for measuring gpg performance, and this stuff is one of things you learn early on - that it's a constant battle trying to trick the compiler to not optimize away work you're trying measure.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.