What good are hundreds of lights without hundreds of shadows?


Modern dynamic light rendering techniques, such as Clustered shading, are very impressive if you want thousands of lights, and after implementing my own method for rendering scenes with hundreds of lights, I got to thinking about shadows…

What good are all these lights unless many or all of them are casting shadows?

Eye Candy

In this scene I have about 120 lights. Each light is omni-directional, so each one uses 6 views to render a cube shadow map, so that is 600+ shadow maps being rendered. All objects are casting shadows. The shadow map resolution is scaled with distance from the camera in order to fit in memory.

You can see the shadow maps being rendered in the bottom left.

So… I wanted hundreds of shadows and I set out to solve the problem. I was very impressed by earlier attempts such as Imperfect Shadow Maps and the ManyLods Algorithm but both suffer from holes when rendering large views, and hence, are not suited for rendering high resolution shadow maps. ManyLods can also end up with very deep trees because of the number of points required to represent large triangles.

This led me to a small eureka moment when I realized that there is no point in using a point representation: You can have a BVH of triangles and still get the benefits that the ManyLods algorithm brings – being able to render a node when its projected area is small enough that its children can effectively be ignored, allowing the algorithm to skip large sections of costly BVH traversal.

With a triangle BVH you still have a hierarchical representation that can be used to splat points into the view buffers if the projected area of the BVH node becomes smaller than a pixel. The advantage is that when the traversal gets down to the triangle level you can stop traversal and render the triangle directly to the view buffer – even if the node still has a large view-space projected area. This is where ManyLods would carry on traversing down the tree, potentially to the highest detail level points/surfels, as it requires all nodes that are drawn to either be leaf nodes or to be smaller than some metric in view space.

Another disadvantage of pure point based methods is that on modern hardware, rendering points is generally more expensive than rendering triangles, so you only really want to render points if they are going to save you from rendering many small sub-pixel sized triangles.

I call the following algorithm “Many Perfect Shadow Maps” or ManyPSMs for short, as a homage to the Imperfect Shadow Maps and ManyLods algorithms.

BVH GPU Structure

The scene and object instance BVHs are all packed into a single GPU Buffer. Object vertex data is stored in a separate buffer.

After all objects’ BVHs are loaded (they are generated offline), they are packed, one after the other, into the GPU Buffer, keeping track of their root nodes’ child indices. Leaf nodes store triangle indices into the vertex buffer. This part of the buffer is static.

During run-time, the dynamic scene BVH is updated and packed into the dynamic part of the same buffer. While packing the scene BVH, if an object instance is encountered, its world transform is packed into an instance buffer, the index to which is stored in the packed scene BVH node, along with child node Indices obtained from the object instance’s static BVH’s root node’s children – allowing a single and simple traversal kernel to traverse the static and dynamic parts of the tree.

Tree Structure.png

The Algorithm

So the algorithm is very similar to ManyLods, iteratively traversing the scene BVH for each view in parallel in a compute shader. This is done using a similar Node-View pattern (except it it extended to handle object instances). So the Node-View stores a Node Index, a View Index, and an Object Instance Index. During each iteration, or compute dispatch, one level of the tree is processed, filling buffers and generating indirect dispatch arguments for the next iteration.

First, the initial Node-Views for the scene are created in a GPU Buffer, one initial Node-View is created for each view in the scene, containing that View Index (1 to n) and Index of the root node of the Scene (for example, 21 in the image above) BVH. The Object Instance Index is initially set to 0xffffffff.

A compute shader is then dispatched, once for each level of the tree. Each thread loads a Node-View, and performs one of five tasks:

  1. Apply Object Instance Transform
    If a Node contains an object instance, the compute kernel needs to load the instance transform and apply it to the view transform. The object instance index is also propagated to any subsequent Node-View traversal steps
  2. Cull
    Nodes are culled if they are outside of a cone wrapped around the View’s frustum.
  3. Draw Triangle
    If a node contains a triangle, the triangle indices are pushed into an index buffer for later rendering to a triangle splat buffer.
  4. Draw Point
    If a node’s estimated projected area is less than or equal to one pixel, it is pushed into a list of points for later rendering to a point splat buffer.
  5. Traverse
    If a node can’t be drawn or culled, its children are pushed into a list of Node-Views for processing in the next iteration.


The shadow maps are splatted/rasterized simultaneously with a single draw all to splat all triangles into all views, and a single dispatch to splat all points into all views. The splatting happens on a large texture atlas containing all the shadow maps.

A compute shader is dispatched to splat all the points into views on the point splat buffer. Then a Draw call is made to splat all of the triangles into all views in the triangle splat buffer. Triangle view-clipping is currently done in the pixel shader, but it would be better to move it to a pre-process or geometry shader, as clipping in in the pixel shader is inefficient.

After this, a compute shader is used to merge the results and copy them into cube map arrays for the lights to address while rendering.

Here is another video before the shadow map filtering was fixed:

Future Improvements

  1. Triangle splatting: triangles need to be clipped to each view, this is currently being done in the pixel shader, but should really be done in a geometry shader or as some pre-process.
  2. Shadow map filtering: Shadows are currently unfiltered, I would like to try a tri-axis cube map blur.
  3. Better Shadow map representation: I’m currently using VSM, I would like to try MSM or something else.
  4. Optimizations: There is a lot of optimization that could be done to the shaders and render code to improve performance.
  5. Blend between shadow resolutions using mip-maps.


ManyLods & Imperfect Shadow Maps

If you were interested in these, here are videos for the ManyLods and Imperfect shadow maps algorithms.


Update 1, 27 March 2019: Triangle View-Clipping in the atlas is now done with SV_ClipDistance

Update 1, 27 March 2019: Added shadow map filtering with tri-axis cube map blur.


So I’ve been working on this game engine for a while…


As the title says, I have this game engine, that I call GameRAT (Rapid Architecture Tool, yeah, whatever ;o),  which I’ve been working on for a while… Quite a while… Years… But on and off over the years, and only in my spare time.

Over those years it has become quite mature and fun and easy to use. It has many features, so I’m not going to list them, but basically it’s built on top of a C++ Reflection system that uses a Clang parser to generate reflection data. The editor, is written in C# with WPF, loads the reflection data that is the basis for most of the editor’s functionality.

I’ve tried a few different windowing APIs over the years, starting with my own, then onto Win32, then Winforms, and finally WPF which is great.

Here are some videos showing a few of the features…



Better IBL BRDF convolution

I’ve fixed a few bugs in the BRDF convolution tool, and improved the quality of the convolved IBLs. It can now handle a much wider dynamic range with out any sampling noise, it’s a lot faster, and the results are much more accurate.

I improved the sampling noise by doing a few things. First I started using a good sampling pattern, which is essential, something like the Hammersley 2D pattern works well. Next I switched to a better importance sampling method for the GGX BRDF. And finally, to get rid of the final sampling noise I select a mipmap level based on the GGX probability function, so areas where less samples are taken use a lower mipmap resolution.

The results are perfectly noise free convolved IBLs, ready to render some really nice materials. Source code is available here.

_1 _2 2015-09-26 14_17_06-BRDFConvolve (Running) - Microsoft Visual Studio 2015-09-26 14_17_37-BRDFConvolve (Running) - Microsoft Visual Studio 2015-09-26 14_18_01-BRDFConvolve (Running) - Microsoft Visual Studio 2015-09-26 14_18_12-BRDFConvolve (Running) - Microsoft Visual Studio


Quick and Simple IBL Convolution

A TD at work was having problems with the tools that he was using to convolute his Image Base Light sources with the GGX BRDF, so I offered to throw something together for him quickly to do the job.

What he needed was to have the IBL’s convoluted with GGX and stored as a series of mipmaps in a ddx texture. Each mipmap level needed to incrementally change the GGX BRDF’s Alpha parameter (from specular to lambert) so that the IBL could be used to approximate roughness of the material. The output needed to be encoded as RGBM or some other 8 bit HDR format.

I wrote a quick application in C# using SharpDX. It generates samples for the BRDF you want, in this case GGX, and uses a shader to re-sample and render the image for each mipmap level. The mipmaps are saved to a series of tiff files, and then to a single ddx file. The code is pretty nasty, and SharpDX and C# are both new to me. It also crashes a lot but after a try or two it gets the job done.

Here is the source code. The rendering of the model is a bit broken because the normals are incorrect, but the convoluted IBL’s it generates are good. Here is an example:

_0 _1 _2 _3 _4 _5 _6iblibl2


Code Generator and Other Fun Stuff

Last week I set out to write yet another code generator, this time for c++, but because of my past experiences (I’ve written one for C# and HLSL in the before), I was not really looking forward to the task! My previous code generators were explorations into a deep abyss of hard coded string concatenations nested between galling nested branches and loops.

It was clear after starting down this route again when within half an hour I was already lost in the same mess of loops, branches and concatenations, that my third try at a code generator would need to be drastically different! Going down this path again for a third time would just result in more non-maintainable messy code that no one, not even I would be able to understand in 6 months time.

Introducing Jinja2
After some searching and asking around, I was shown down the path of Jinja2. This is a great little templating engine, and pretty much exactly what I needed for my code generator. The only problem is that I’m not a python guy, and I wasn’t up to the task of learning and integrating python into my build process. This is when I came across a tiny, lightweight C++ implementation of Jinja2 by Hugh Perkins on GitHub.

The problem was it wasn’t quite fully featured enough! So there was only one option, to rip it apart and add the features I needed:

  • Firstly I wanted it to be able to index directly into XML data structures as if they were Jinja variables, using this syntax:
    1. To access an attribute, using:
      {{ XMLElement['AttributeName'] }}
    2. To access a child element:
      {{ XMLElement.ChildElement['AttributeName'] }}
    3. To loop over all child objects with a given name:
      {% for x in XMLObject.Child %} {% endfor %}
    4. To select a child element with a given Attribute value:
      {{ XMLObject.ChildElement['Name'='SomeName'] }}
  • The next thing I needed to do was implement better for loops, if/else, statements, macros, a better expression parser, and a couple of other bits and pieces.

All in all, this gives you a very powerful tool that can be used almost any time you need to generate text data. For example, If you set an examples XML structure like this:


You can use a template like this:

{% for person in XMLData.Person %}
    {% if person['Wife'];'';!= %}
This is {{ person['Name'] }}, he is married to {{ person['Wife'] }}.
On Valentines day, {{ person['Name'] }} buys {{ XMLData.Person['Name'=person['Wife']]['Likes'] }} for {{ person['Wife'] }}
    {% endif %}
    {% if person['Husband'];'';!= %}
This is {{ person['Name'] }}, she is married to {{ person['Husband'] }}.
On Valentines day, {{ person['Name'] }} buys {{ XMLData.Person['Name'=person['Husband']]['Likes'] }} for {{ person['Husband'] }}
    {% endif %}

{% endfor %}

To Generate data:

This is Sam, he is married to Julie.
On Valentines day, Sam buys Candies for Julie

This is John, he is married to Brenda.
On Valentines day, John buys Flowers for Brenda

This is Brenda, she is married to John.
On Valentines day, Brenda buys Computers for John

This is Julie, she is married to Sam.
On Valentines day, Julie buys Tools for Sam

This is a pretty useless example, but should illustrate the power of using templates: With out changing a single line of code, I can write a template to generate C++, using XML to define my header files. I can generate the headers with all their classes, members, methods, serialization functions, reflection metadata, etc. etc. I can then also write a template to generate C#/python classes from the same XML, this would allow me to serialize/deserialize objects from/to from C++/C#/python and vise versa. All with out having to write a single line of code!

Another idea (which is what my next post is going to be about) is to used this type of code generation to build pretty awesome next-gen render material system.

Things to improve
There are still a few features to add, and improvements to make:

  • Add array objects
  • Add number support (doubles only?)
  • Improve expression parsing
    • Replace postfix parser with something better that can handle control statements
    • Replace “special control section parser” with tokenizer and expression parser
    • Use this for evaluating the entire template
  • Add import/include control
  • Add variable set control
  • Try to be more jinja2 complaint with the syntax
  • Error handling!! Currently the parser is very poor at handling errors.

Source Code + download
Here is the source code for what I’ve done so far, there are a few examples of code generation in here. The code for the Jinja2 parser could be vastly improved, and hopefully that will happen over time as I start using this templated generator for more things.

Simple Entropy Coder

I wanted to share my implementation of a simple entropy coder, and to do this, I have created a simple lossy wavelet based data compressor. There are 3 basic steps in the compression algorithm:

  1. Wavelet transform the data
    The wavelet transform code was taken from Parashar Krishnamachari on flipcode (http://www.flipcode.com/archives/1-D_Generalized_Wavelet_Class.shtml). This post is more about the entropy coder than wavelets, so I don’t go into detail here.The basic idea is you perform a wavelet decomposition on your data samples to obtain your wavelet coefficients. The number of coefficients is the same as the number of samples.

    The purpose of this is to reduce the correlation of the samples in your dataset, which gives you an exponential probability distribution around 0. This is a nice distribution to feed into an entropy coder.

  2. Quantize the wavelet coefficients
    This is the lossy step in the compression algorithm – you have to scale your coefficients to fit into a given range (which alters your compression ratio), and convert them to integers.The entropy coder expects positive integers only, so at the same time as quantization, I convert all of the samples to positive numbers with the formula:
    q = c > 0 ? 2 * c : -2 * c + 1
  3. Entropy code the quantized coefficients
    This is the part of the algorithm that actually does the compressing, and is a completely lossless process. The entropy coder I have implemented is a Run-length Pod Bitplane Encoder.Bitplane encoders encode the dataset one “bitplane” at a time, starting at the most significant bit of each sample, and proceeding to the next significant bit once the whole bitplane has been encoded.

    I chose to encode the run-length of 0 bits in the bitplanes using Pod numbers, as this is a pretty simple method and has given me good results compared to other methods I have tried.

    Pod numbers (http://www.firstpr.com.au/audiocomp/lossless/) are similar to Elias Gamma codes or Rice/Golomb codes, which are a type of Arithmetic coding scheme that assigns less bits in the output stream to samples of small value, and more bits to samples of large value. This works perfectly with the exponential distribution around 0 that we have for the wavelet coefficients, as many of them will be close to or exactly 0.

    The encoder could be more complex for a higher compression ratio, but this would cause its run-time performance to suffer.

I created a simple test pattern image, and compressed the image to 8.34% of its original size, here is the uncompressed original on the left, and the result on the right:
uncompressed compressed

Download Code
The code has implementations for compressing 1d and 2d data sets. Performing a 2d wavelet transform can be done by a convolution of horizontal and vertical 1d transforms. https://www.dropbox.com/s/e92p4xg9kfqzo3h/WaveletCompression.zip?dl=1


Simple Alternative to Clustered Shading for Thousands of Lights

I recently spent some time experimenting with GPU ray traced shadows, and it gave me and idea for a way to handle thousands of dynamic lights in a very simple and easy to implement manner. I couldn’t find any references to anyone doing something similar, so I call it BVH Accelerated Shading. Like Clustered Shading, BVH Accelerated Shading will work with both forward and deferred renderers.

Compared to clustered shading, the idea is simple, both in concept and implementation. It should be very easy to add BVH Accelerated Shading to your engine’s renderer, as it does not require any additional passes or compute shaders. My main contribution is the observation that with the right data-structure you can forgo the costly clustering and light assignment steps, and cheaply cull the lights affecting each pixel directly in the fragment program while shading.

BVH Accelerated Shading can broadly be broken down into three parts, the first of which is already present in most engines:

  1. Build and maintain a spatial hierarchy of lights. I have all my scene objects, including lights, in a single BVH – storing them in a separate tree might be a better idea.
  2. Pack the visible lights from your scene into a BVH of spheres for stackless traversal on the GPU. I use the same tree topology as my scene BVH, but prune branches with no visible lights.
  3. For every pixel, using it’s world position, traverse the BVH and gather the lighting.

1. Build and maintain a spatial hierarchy of lights
For building and maintaining the BVH of scene objects in my engine, I have extended the ideas from [1] to handle N-ary trees with a maximum branching factor of 8. This was done by modifying the direct insertion cost of nodes with less than 8 children.

In my implementation, the BVH is built and maintained on the CPU. This is fast enough in most cases, as most game engines are already doing this.

2. Pack the visible lights from your scene into a “Stackless BVH”
The following packing is also done on the CPU, as it is also generally very fast.

To pack the lights into an appropriate format for fast gpu traversal, I have implemented the ideas from [2]. I start by traversing the scene’s BVH and ignore any branches that are invisible or contain no lights. For each node encountered during traversal, I pack it’s data into a structured buffer for GPU traversal in a fragment program. The topology of the stackless BVH is identical to that of my scene’s BVH, barring any pruned branches. I use spheres for the stackless BVH, as this simplifies the GPU traversal algorithm.

The structure of the BVH nodes on the GPU looks like this:

struct sPackedBVHLight
    float4 mPositionRadius;
    uint mMissPointer;
    uint mColor;

StructuredBuffer gLightBVHNodeData;

Note that it might be more efficient to store your lights separately from your other scene objects, as this could lead to a better quality tree for GPU traversal.

Going with the ideas from [2], I bake the Miss-Index (this is the index of next node in the traversal if this node is outside of the query region) for each node into the packed BVH, but have noticed that the Hit-Index (this is the index of next node in the traversal if this node is inside of the query region) is always the current index plus one, so this is not baked, but rather calculated in the fragment shader during traversal. Baking the Hit and Miss pointer into the packed node data simplifies traversal, and removes the need for a stack.

3. For every pixel, traverse the BVH and gather the lighting.
After retrieving the world position of the pixel, I traverse the BVH, summing up the reflected light (based on material properties) for each light encountered.

The fragment shader traversal follows this structure:

struct sUnpackedBVHLight
    uint mHitPointer;
    uint mMissPointer;
    float4 mPositionRadius;

sUnpackedBVHLight UnPackLightNode(uint Index)
    sPackedBVHLight node = gLightBVHNodeData[Index];
    sUnpackedBVHLight ret = (sUnpackedBVHLight)0;

    ret.mHitPointer = Index + 1;
    ret.mMissPointer = node.mMissPointer;
    ret.mPositionRadius = node.mPositionRadius;

    return ret;

float3 UnPackLightColor(sPackedBVHLight Node)
    sPackedBVHLight node = gLightBVHNodeData[Index];
    float3 ret = (float3)0;

    ret.r = (node.mColor & 0x00ff0000);
    ret.g = (node.mColor & 0x0000ff00);
    ret.b = (node.mColor & 0x000000ff);

    ret /= float3(255 * 255 * 255, 255 * 255, 255);

    return ret;

float3 GatherLighting(float3 Position, float3 Normal)
    float3 lighting = (float3)0;
    uint index = 0;
        sUnpackedBVHLight node = UnPackLightNode(index);

        float3 V = node.mPositionRadius.xyz - Position;
        float d = dot(V, V);

        // Check if the Position is inside the bounding volume
        if (d < node.mPositionRadius.w*node.mPositionRadius.w)
            // Check is this is a leaf node
            if (node.mHitPointer == node.mMissPointer)
                lighting += BRDF(
                    Position, Normal,
            index = node.mHitPointer;
            index = node.mMissPointer;

    } while (index != 0x0000ffff);

    return lighting;


Scene with 2700 dynamic lights, most of them visible here.

Scene with 2700 3240 dynamic lights, most of them visible here.

So, I’ve implemented BVH Aceelerated Shading in my little engine. It hasn’t got a very efficient renderer, especially when it comes to drawing text and lines (there is no batching either). So I think there are a few bottlenecks in various places. My test scene has 2700 3240 dynamic lights and about 450 objects. The lighting pass also does IBL and uses GGX for the specular reflection on the lights.

In the above screenshot, rendered at 1280×720, most of the lights are visible. I’m using an ATI Radeon r9 280x. Given all that, according to AMD Perf studio, it takes about 22ms to fill my gbuffers (terrible, I know!), and about 5ms to calculate the lighting. Here you can download the .pro file from AMD PerfStudio that I captured just after taking the above screenshot, or see the screen-grabs here:

perf_view1 perf_view2 perf_view3

SB4 is filling the g-buffers and SB5 is doing the full screen lighting pass. I’m no expert, but that seems pretty fast to me, for very little effort in terms of implementation.

I know it’s not a direct comparison, but the results from the Necropolis test scene with 2500 lights using Clustered Shading, reports timings of 2.3ms for clustering view samples, 1.6ms for light assignment, and 5.6ms on shading. I’m not sure what GPU or shading model they are using.

You can download the demo here. The demo requires DirectX 11 and is compiled for Shader Model 4. Run AppLoader_Win32.exe, the controls are WASD and right mouse button to rotate the view. The scroll wheel changes your movement speed. I apologize for it taking so long to load (about 30s), I’m working on that and will upload a fix soon.

Pros of BVH Accelerated Lighting:

  • Very easy to implement, minimally invasive on your render pipeline
  • Doesn’t suffer from inherent depth complexity issues as in 2d/3d tiled methods
  • Doesn’t require a compute shader or and additional passes
  • Can be used with Forward and Deferred renderers


  • Requires tree traversal which has logarithmic complexity in the number of lights, where clustered shading with a 3d texture is more or less a constant lookup time.

Ideas and Extensions:
The first thing that comes to mind is how to handle other light types, like a spot light? The simplest way would be to use the minimum bounding sphere of whatever light bounding volume you have – so the minimum bounding sphere of the sphere-capped-cone of the spotlight.

Other bounding volume types could also be used, or even other trees such as OcTrees or KDTrees. There are quite a few unused bits in the packed light node structure, so one could use them as flags for the node BV type.

You could use separate structures of arrays rather than arrays of structures to better pack the data when using multiple light types. You could have a separate array for each light type, using some flag bits to select one.

If you’re generating virtual point lights on the GPU, then you could also build your BVH on the GPU, there are many fast methods to do that.

[1] Jirí Bittner, Michal Hapala and Vlastimil Havran – Fast Insertion-Based Optimization of Bounding Volume Hierarchies.
[2] Charles Lohr – GPU-Based Parallel Stackless BVH Traversal for Animated Distributed RayTracing.