22/09/2013 'Rasterization patterns' of Haswell GPU

Thanks to the release of an OpenGL 4.2 implementation for Intel GPUs, I can finally use atomic counter to have a look at Haswell's 'Rasterization patterns'.

How to read these images? A fragment shader is launched each pixel. An atomic counter is increased for each invocation to produce a unique color. It can me seen as a reinterprect_cast so that once the pixel color reach 255 to the red channel, the next one will be 1 for the green channel.

Haswell's 'rasterization pattern'
Haswell's 'rasterization pattern'

The rasterization is performed on Haswell by scanning the primitives top to bottom from right to left, then left to right on coarse gain tile of 16 by 16 pixels. This is surprisingly the same behavior than Kepler. As a comparison, Southern Islands is very different. It is scanning 512*32 pixel bands in which 8 by 8 pixel tiles are scheduled in Z-order as long as the assigned CUs/ALUs are available.

At fine granularity, Haswell GPU is working within 16 by 16 pixel tiles on 4 by 2 pixel horizontal blocks that might be scheduled to different Execution Units (EU) (CU on Southern Islands; SMX on Kepler). Hence, the wavefront/warp size is 8 invocations. As a comparison, Kepler works using 4 by 8 pixel vertical blocks (wrap: 32 invocations) and Southern Islands uses 8 by 8 pixel blocks (wavefront: 64 invocations) where the pixels are executed in Z-order.

Finally, Haswell seems to rely on a synchronous architecture, at coarse and fine granularity. The rendering doesn't show any long delay for specific tiles. This is either because the architecture is very synchronized so it never has to wait, or it doesn't delay the execution of blocks and systematically wait for atomic results. My guess is that it waits that means pretty bad atomic counter performance in my opinion. On Kepler, the architecture is capable of handling very long latencies between coarse grain blocks. With Fermi, it was more extreme, we could even expect that the first coarse tile could be completed last! On Southern Islands, this latency appears shorter but it might be because Southern Islands architecture has a Global Data Store (GDS) so that synchronization of atomic doesn't have to be done down to the memory.

Kepler's 'rasterization pattern'
Kepler's 'rasterization pattern'
Southern Island's 'rasterization pattern'
Southern Island's 'rasterization pattern'

Sleepless night time readings and tools >
< GLM 0.9.4.6 released
Copyright © Christophe Riccio 2002-2016 all rights reserved
Designed for Chrome 9, Firefox 4, Opera 11 and Safari 5