clip() and discard() won't get you any performance increase, nor is it intended to. To understand why, consider how the GPU does things.
This must mean that the gpu thread groups are by default allocated in a tiled fashion on the screen, ...
Correct. Not only is this the "default," it is mandated by the underlying hardware in the fixed-function rasterizer stage.
...and that all threads of a group must be done computing before the (other) resources of the thread group can be used again.
Shaders are executed in waves of 64 (AMD) or 32 (NVIDIA) threads. These threads execute in SIMD fashion, meaning that they execute in lock-step.
The wave will execute at the rate of the slowest thread. If some threads clip() or return say, halfway through the shader, they will still stall until all threads in the wave have completed. They aren't like CPU threads. (I consider the term "thread" to be something of a misnomer in this case but, hey, it's the world we live in.)
Now, with shader model 6 on the horizon, or with Vulkan right now, you can use wave ballot intrinsics to early-out and get some of that execution time back. However, even with those tools, the wave will still take as long as the the slowest thread. So, turning off every other pixel won't get you anything (as adjacent pixels are typically assigned to the same wave).
Does anyone know of any magic that can force a fragment shader to only allocate resources for say half the pixels of the screen? (but keep the render target size)
You may be able to use a stencil to mask off every other pixel, but this wouldn't give you back 100% of the time you're looking to save as there's overhead for the stencil comparison. Also, that just seems like a kludge.
The best and easiest way to only draw half of the pixels screen is to render to a 1/4 resolution render target (1/2 on each side), then draw it scaled to match your screen-aligned quad. You can play with your sampler configuration when drawing to your screen-aligned quad to get the "grainy" pixelated image, if that's what you want.
You COULD do this in compute, but you'd be handling rasterization yourself. Then again, that may be a more appropriate place to implement raytracing. I've written raycasting code in both CUDA and HLSL compute, and it works pretty well. If you don't need the rasterizer, compute is a perfectly reasonable way to go.