
Defer all the things!
I’m sometimes asked about which direction to take a renderer – should I use some form of forward lighting, or some sort of deferred lighting? For some, the decision boils down to which box to tick in Unreal or Unity, while for people developing their own rendering code, it’s which method to spend 10s (100s?) of hours perfecting.
There’s pros and cons to both – forward lighting (used to) have a severe lighting limit, while deferred lighting suffers from bandwidth concerns due to the requirement to keep all of the lighting model parameters in a set of G-Buffer render targets, along with the issue of what to do about transparent objects. In recent years, it seems that the advent of methods such as Forward+ (and its various rendering relatives), allowing a large number of lights to be considered thanks to splitting the viewport up into a grid (or in some variations a 3D cluster set), combined with compute-powered light visibility determination makes deferred rendering a thing of the past.
It’s not as clear cut as that, though, when other parts of the rendering process are taken into consideration. Evaluating lighting models is a a fair amount of work to be doing per-fragment, particularly considering cases where those fragment are then later covered over both other geometry drawn later. The traditional solution to this is to have a depth-only ‘pre pass’, where the geometry to be rendered to the screen is actually rendered twice – once to a depth-only target (much like with shadow mapping), and then again with the depth-test set to only allow fragments that have the exact depth stored within this depth target to pass. This ensures that expensive fragment shaders are only evaluated once per fragment, but at the expense of now having to issue twice as many draw calls, and render twice as much geometry.
It seems that the ideal solution to rendering would be a hybrid of the two techniques – we want to avoid a depth pre-pass, but we also don’t want a G-Buffer taking up bandwidth, and limiting our lighting model. A nice solution to this is presented in GPU Zen 1, as part of the chapter Deferred+: Next-Gen Culling and Rendering for the Dawn Engine, by Hawar Doghramachi and Jean-Normand Bucci. Based around rendering work done during the development of the excellent Deus Ex: Mankind Divided, it describes a system whereby the rendering of geometry and shading are completely separated from each other. It has some nice work on visibility determination and compute-based object culling that I won’t be going into detail here (buy the book on Kindle!), but I’ve recently been inspired by this to build my own twist on Deferred+ that I thought I’d write a little bit about.
The Deferred+ method works by splitting the drawing of the scene up into two – a very lightweight G-Buffer fill pass, and then a single full-screen quad pass for each visible material. At first glance this second stage sounds like a lot of GPU work, but there’s some tricks to making this a surprisingly efficient process.
The First Stage
In the first pass, all of the camera-visible geometry is rendered to a G-Buffer. Unlike with typical Deferred rendering, though, it’s a pretty lightweight buffer. Instead of storing the albedo, normal, light properties, and so on in a set of textures, we instead have the following:

It might seem strange at first to store texture coordinates, but that’s because in this stage, we don’t sample any material properties, yet! No texture sampling occurs at all, we just write the geometry texture coordinates to use later on. This does cause a little issue, though – to reconstruct the values needed for correct mipmap and anisotropic sampling, we also need to partial derivatives of the texture coordinates stored, too.
We also store the normal, and tangent of the surface, to allow for reconstruction of the matrix that will transform a tangent-space bumpmap into the correct space later on when performing lighting. To store these I use octahedral normal encoding, allowing a quick and efficient conversion of 3D direction vectors into a 2D form, which means I can store both the normal and tangent in a single texture.
Along with these, I store two 16 bit integers – one to indicate the material type, and one for the material instance. I have a buffer for each material type that stores any per-instance information required for correct rendering (a colour, or other uniform parameter, for instance), that the instance value serves as an index into.
The Second Stage
Once all of the G-Buffer stage has completed, we can start drawing some materials! As mentioned earlier, this is done by binding any per-material data (that buffer I just mentioned), and drawing a full-screen quad. To make this drawing efficient, there’s an extra step that’s taken before drawing these quads. A compute shader is ran, that reads in the material type GBuffer texture, and writes it to a depth buffer texture.
This depth texture is then used for depth-testing when drawing the material quads. Each is drawn with the depth test set to ‘equals’, and rendered with a vertex shader that sets the depth to a specific material ID. This means that only fragments that actually have the material being rendered end up passing the depth test; with modern GPU hierarchical depth testing, fragments that match a specific material can be quickly found and shaded.
It’s only at this point in the rendering process that any textures are sampled – each fragment shader execution reconstructs the texture coordinate, and its derivatives from the G-Buffer, allowing for mipmapped, anisotropic sampling to be performed. There’s no sampling wastage at all – every texture sample will make it to the screen, as we’ve deferred it until we know that the fragment is visible. This makes the texture sampling very efficient, and the sampling reasonably cache friendly, too! Every fragment can read the material instance being rendered to access any per-instance material data, and can read from another bound buffer to read any per-material data; I use a bindless texturing system, so this would be expected to contain the handles of the albedo and normal texture, along with the textures of any light parameters such as the metallic and roughness. Now that such values are kept in per-material textures rather than spread across a number of G-Buffer channels, reading them can be efficient, and again hopefully quite cache friendly. Have a material with different properties? Just write a different material rendering shader, and have a different per-material buffer setup! This gives us all of the benefits of ‘forward’ style rendering, and none of the restrictive drawbacks of deferred.
Once the fullscreen quads have been rendered, we have a final image, that looks much like any other rendering method, just built up in a very different way.
Lighting
So far, I’ve not mentioned much on lighting. While the geometry is now fully decoupled from the material rendering, the lighting can’t be done in the ‘standard’ deferred way of rendering a light volume mesh for each light, as there aren’t the required values in the G-Buffer in this solution. Instead, I perform lighting as in Forward+, with the screen split up into a grid of cells. Before rendering the materials, I run some compute shaders to work out A) which lights are within the view frustum, and then B) which of these lights are within each grid cell. I have a maximum number of lights per-cell, so the indices of visible lights are stored to a big integer array of size gridCountX * gridCountY * maxCellsPerGrid. When the materials are rendered, each fragment works out which cell it is part of, and iterates through the lights, accumulating the result. This has the benefit that each fragment only runs through the light list once (without needing a depth pre-pass to achieve), and can support any BRDF you like – the values come from the material buffers, not a G-Buffer, so new material lighting types (hair, brushed metal, etc) don’t incur any additional cost.
Final Thoughts
One problem does remain with this Deferred+ solution, though – it’s the age-old problem of transparent objects. As there’s only one material stored per-fragment, we don’t have the information to build up a final fragment colour based around multiple overlapping objects. Unless we want to start creating linked-lists for transparent fragments, our solution is the same as with a traditional deferred renderer – draw transparent objects afterwards, using ‘traditional’ back-to-front forward rendering. This isn’t too bad, as there’s still the benefit of having the light lists used in the material pass, so it becomes very much like a Forward+ rendering system for such objects.