Simple Alternative to Clustered Shading for Thousands of Lights

Introduction
I recently spent some time experimenting with GPU ray traced shadows, and it gave me and idea for a way to handle thousands of dynamic lights in a very simple and easy to implement manner. I couldn’t find any references to anyone doing something similar, so I call it BVH Accelerated Shading. Like Clustered Shading, BVH Accelerated Shading will work with both forward and deferred renderers.

Compared to clustered shading, the idea is simple, both in concept and implementation. It should be very easy to add BVH Accelerated Shading to your engine’s renderer, as it does not require any additional passes or compute shaders. My main contribution is the observation that with the right data-structure you can forgo the costly clustering and light assignment steps, and cheaply cull the lights affecting each pixel directly in the fragment program while shading.

BVH Accelerated Shading can broadly be broken down into three parts, the first of which is already present in most engines:

  1. Build and maintain a spatial hierarchy of lights. I have all my scene objects, including lights, in a single BVH – storing them in a separate tree might be a better idea.
  2. Pack the visible lights from your scene into a BVH of spheres for stackless traversal on the GPU. I use the same tree topology as my scene BVH, but prune branches with no visible lights.
  3. For every pixel, using it’s world position, traverse the BVH and gather the lighting.

1. Build and maintain a spatial hierarchy of lights
For building and maintaining the BVH of scene objects in my engine, I have extended the ideas from [1] to handle N-ary trees with a maximum branching factor of 8. This was done by modifying the direct insertion cost of nodes with less than 8 children.

In my implementation, the BVH is built and maintained on the CPU. This is fast enough in most cases, as most game engines are already doing this.

2. Pack the visible lights from your scene into a “Stackless BVH”
The following packing is also done on the CPU, as it is also generally very fast.

To pack the lights into an appropriate format for fast gpu traversal, I have implemented the ideas from [2]. I start by traversing the scene’s BVH and ignore any branches that are invisible or contain no lights. For each node encountered during traversal, I pack it’s data into a structured buffer for GPU traversal in a fragment program. The topology of the stackless BVH is identical to that of my scene’s BVH, barring any pruned branches. I use spheres for the stackless BVH, as this simplifies the GPU traversal algorithm.

The structure of the BVH nodes on the GPU looks like this:

struct sPackedBVHLight
{
    float4 mPositionRadius;
    uint mMissPointer;
    uint mColor;
};

StructuredBuffer gLightBVHNodeData;

Note that it might be more efficient to store your lights separately from your other scene objects, as this could lead to a better quality tree for GPU traversal.

Going with the ideas from [2], I bake the Miss-Index (this is the index of next node in the traversal if this node is outside of the query region) for each node into the packed BVH, but have noticed that the Hit-Index (this is the index of next node in the traversal if this node is inside of the query region) is always the current index plus one, so this is not baked, but rather calculated in the fragment shader during traversal. Baking the Hit and Miss pointer into the packed node data simplifies traversal, and removes the need for a stack.

3. For every pixel, traverse the BVH and gather the lighting.
After retrieving the world position of the pixel, I traverse the BVH, summing up the reflected light (based on material properties) for each light encountered.

The fragment shader traversal follows this structure:

struct sUnpackedBVHLight
{
    uint mHitPointer;
    uint mMissPointer;
    float4 mPositionRadius;
};

sUnpackedBVHLight UnPackLightNode(uint Index)
{
    sPackedBVHLight node = gLightBVHNodeData[Index];
    sUnpackedBVHLight ret = (sUnpackedBVHLight)0;

    ret.mHitPointer = Index + 1;
    ret.mMissPointer = node.mMissPointer;
    ret.mPositionRadius = node.mPositionRadius;

    return ret;
}

float3 UnPackLightColor(sPackedBVHLight Node)
{
    sPackedBVHLight node = gLightBVHNodeData[Index];
    float3 ret = (float3)0;

    ret.r = (node.mColor & 0x00ff0000);
    ret.g = (node.mColor & 0x0000ff00);
    ret.b = (node.mColor & 0x000000ff);

    ret /= float3(255 * 255 * 255, 255 * 255, 255);

    return ret;
}

float3 GatherLighting(float3 Position, float3 Normal)
{
    float3 lighting = (float3)0;
    uint index = 0;
    do
    {
        sUnpackedBVHLight node = UnPackLightNode(index);

        float3 V = node.mPositionRadius.xyz - Position;
        float d = dot(V, V);

        // Check if the Position is inside the bounding volume
        if (d < node.mPositionRadius.w*node.mPositionRadius.w)
        {
            // Check is this is a leaf node
            if (node.mHitPointer == node.mMissPointer)
            {
                lighting += BRDF(
                    Position, Normal,
                    node.mPositionRadius,
                    UnPackLightColor(index)
                );
            }
            index = node.mHitPointer;
        }
        else
        {
            index = node.mMissPointer;
        }

    } while (index != 0x0000ffff);

    return lighting;
}

Results:

Scene with 2700 dynamic lights, most of them visible here.

Scene with 2700 3240 dynamic lights, most of them visible here.

So, I’ve implemented BVH Aceelerated Shading in my little engine. It hasn’t got a very efficient renderer, especially when it comes to drawing text and lines (there is no batching either). So I think there are a few bottlenecks in various places. My test scene has 2700 3240 dynamic lights and about 450 objects. The lighting pass also does IBL and uses GGX for the specular reflection on the lights.

In the above screenshot, rendered at 1280×720, most of the lights are visible. I’m using an ATI Radeon r9 280x. Given all that, according to AMD Perf studio, it takes about 22ms to fill my gbuffers (terrible, I know!), and about 5ms to calculate the lighting. Here you can download the .pro file from AMD PerfStudio that I captured just after taking the above screenshot, or see the screen-grabs here:

perf_view1 perf_view2 perf_view3

SB4 is filling the g-buffers and SB5 is doing the full screen lighting pass. I’m no expert, but that seems pretty fast to me, for very little effort in terms of implementation.

I know it’s not a direct comparison, but the results from the Necropolis test scene with 2500 lights using Clustered Shading, reports timings of 2.3ms for clustering view samples, 1.6ms for light assignment, and 5.6ms on shading. I’m not sure what GPU or shading model they are using.

Demo:
You can download the demo here. The demo requires DirectX 11 and is compiled for Shader Model 4. Run AppLoader_Win32.exe, the controls are WASD and right mouse button to rotate the view. The scroll wheel changes your movement speed. I apologize for it taking so long to load (about 30s), I’m working on that and will upload a fix soon.

Pros of BVH Accelerated Lighting:

  • Very easy to implement, minimally invasive on your render pipeline
  • Doesn’t suffer from inherent depth complexity issues as in 2d/3d tiled methods
  • Doesn’t require a compute shader or and additional passes
  • Can be used with Forward and Deferred renderers

Cons:

  • Requires tree traversal which has logarithmic complexity in the number of lights, where clustered shading with a 3d texture is more or less a constant lookup time.

Ideas and Extensions:
The first thing that comes to mind is how to handle other light types, like a spot light? The simplest way would be to use the minimum bounding sphere of whatever light bounding volume you have – so the minimum bounding sphere of the sphere-capped-cone of the spotlight.

Other bounding volume types could also be used, or even other trees such as OcTrees or KDTrees. There are quite a few unused bits in the packed light node structure, so one could use them as flags for the node BV type.

You could use separate structures of arrays rather than arrays of structures to better pack the data when using multiple light types. You could have a separate array for each light type, using some flag bits to select one.

If you’re generating virtual point lights on the GPU, then you could also build your BVH on the GPU, there are many fast methods to do that.

References:
[1] Jirí Bittner, Michal Hapala and Vlastimil Havran – Fast Insertion-Based Optimization of Bounding Volume Hierarchies.
[2] Charles Lohr – GPU-Based Parallel Stackless BVH Traversal for Animated Distributed RayTracing.

 

Advertisement
Privacy Settings

25 thoughts on “Simple Alternative to Clustered Shading for Thousands of Lights

  1. Yuriy O'Donnell says:

    Neat! Great job. Elegant and simple idea. I’m surprised that it hasn’t come up before in real-time rendering context. It would be nice to spend some additional time and do an in-depth performance analysis and comparison between tile-based deferred, tile base forward+ and clustered lighting.

    Like

    • wolrdoffries says:

      Thanks for the comment!
      I’m also surprised about that, I was thinking of putting as one of the cons: “Why hasn’t anyone done this before?”
      As for the performance comparison, I’ve been wanting to, but I don’t have a lot of free time. Implementing one of the other tile based methods would take a but longer than BVH Accelerated Shading, so I thinking of downloading Ola Olsson’s demo code and integrating it into that.

      Like

  2. Very interesting. 🙂

    Re: clustered shading – we were using a combination of OpenGL and CUDA on a NVIDIA GTX 480 for the timings in the paper. We were running at 1920×1080 IIRC.

    Regarding the cons. The (original) clustered shading method also does a tree traversal, albeit instead of traversing single pixels directly in the pixel shader, we traverse groups of pixels (the clusters) against the tree in an intermediate pass and store the results in an easily accessed texture. (Our light-BVH probably isn’t very high quality, though, it’s more one of those quick but dirty things.)

    Liked by 1 person

  3. Hi. great to see other people exploring this area, and thanks for sharing your results!

    I have a few comments and a question.

    The question is what you mean by “Doesn’t suffer from inherent depth complexity issues as in 2d/3d tiled methods”. I understand your technique does not, but if by 3D tiled you mean clustered then that seems a bit strange as that is pretty much the core motivation behind it.

    Our performance figures as quoted in the film are measured on a GTX 480 and in 1080p. Anyway, it’s very hard to compare these things since the main factor will be how many lights affect each pixel. “Funnily” enough, it appears we managed to not put the 1080p figure in the paper AND get that through peer review! (Embarrassed!)

    A more up to date performance comparison would probably be against the implementation by Emil Persson at Avalanche, check out his slides from our course at SIGGRAPH Asia:

    http://www.cse.chalmers.se/~olaolss/main_frame.php?contents=publication&id=efficient_shading_sa2014

    Now about the “why not before”:
    Our 2012 paper actually does use a light hierarchy, but we traverse it per cluster rather than per pixel. The code is also in the demo, in the CUDA portion. In more recent work we’ve thrown out the hierarchy as maintaining it for dynamic lights is simply more expensive than other methods like what Emil describes, or the fallback method in the demo, or just brute forcing light assignment.

    A simple heuristic is that when you have many more pixels than lights, a hierarchy over the lights is probably not going to be the best option.

    That said, I think it is interesting to consider solving the light assignment in world space. For static lights this has obvious benefits. I still suspect that a straight BVH is never going to win, simply because the proportion of work and bandwidth performing the lookup is probably going to be much larger than the actual shading (unless that is very expensive). But another structure with cheaper look-up costs could work quite well.

    Also: My demo code is NOT built for speed, so go easy with the performance comparisons there 🙂

    Like

    • Luke Mamacos says:

      Hi, Thanks for the reply!

      What I meant by “Doesn’t suffer from inherent depth complexity issues…” is more for 2d tiled than 3d/clustering methods, but even so, clustering is trying to solve the depth complexity issue, with my method, there is no issue to solve. But maybe the comment shouldn’t have included 3d/clustering methods 🙂

      I know your implementation and Emil’s would probably handle a lot more fast moving dynamic lights than mine, but my motivation was more for simplicity with at least thousands of lights. I can’t imagine games that will need more lights unless they’re doing VPL/VAL GI or something like that. The benefit of doing it this way is also that, since the cost of updating the BVH is roughly a function of the number of moving lights and their velocity, and since most lights in games are static or moving slowly, the cost of maintaining the BVH should be pretty reasonable. I have read Emil’s slides and there were not enough details about the performance results for me to make a judgement against his method.

      I’m not sure I agree with this tough: “A simple heuristic is that when you have many more pixels than lights, a hierarchy over the lights is probably not going to be the best option.”
      With my method, there is no constant setup time (no cost finding clusters for instance), the cost essentially decreases to 0 as the number of lights decreases to 0. If there is 1 light in the scene, the traversal will traverse 1 BVH node, and then shade the pixel with the data in the node. Maybe I’m missing something here!

      I do agree with you that a straight BVH will never win for a ridiculously large number of lights, but If I were to speculate, from what I’ve seen, my guess is that using my method is going to be faster than clustering for up to a few thousand lights (depending hardware etc.) and then clustering will win out for more lights than that.

      I did try running your demo code, but I’m not sure it supports ATI?

      -Luke

      Like

      • Hi again,

        Let me start with explaining the heuristic (or rule of thumb) a bit: If we have L lights and P pixels. We could put a hierarchy over the lights arriving at log(L) * P, or over the pixels L * log(P). So, if P >> L (i.e., a lot larger) then log(L)*P >> L*log(P). In other words it is probably more beneficial grouping the pixels when there are a few million of these and just a few thousand lights.

        Regarding performance, the original clustered shading is pretty much the same as what you describe with the exception that we invest some time up-front to cut the P number by about two orders of magnitude. And then there is the up-front binning method (championed by Emil) which is usually much faster, when you don’t have huge numbers of lights (and when you don’t need the occupied clusters for something else).

        So I have a somewhat hard time seeing that the direct BVH is going to be faster except for very few lights. The only way to be sure where the crossover point is is to measure, a lot! But I’m not too optimistic.

        The demo did once upon a time work on ATI (first it didn’t and then it did), but I have not had any AMD hardware to test with for a long time. So it was never really properly debugged. The code falls back to pure OpenGL and CPU if there is no NVIDIA card handy, so it ought to work, but OpenGL drivers are not what they should…

        Like

      • Luke Mamacos says:

        Ah yeah, I see what you mean. But I think there is still quite a high setup cost for clustered shading which there isn’t for BVH accelerated shading. So with that, and along with the tighter culling of lights (you’re still looping over lights that might not affect the pixel, incurring bandwidth penalties I guess) the cross over point might be higher than expected.

        I got your demo code running last night, so I will spend some time tonight trying to get BVH accelerated shading working in there. Soon we shall have a better comparison, I will upload the code for everyone who is interested.

        Like

      • As to the setup cost, we got it down to under one ms for a full hierarchy of clusters with explicit bounding boxes starting at 8×8 pixels on a Titan in this paper:
        http://www.cse.chalmers.se/~olaolss/main_frame.php?contents=publication&id=clustered_ptsv_2014

        And if you use the full grid up-front method, it needs not be done at all. Then you are just paying for light assignment.

        And: “incurring bandwidth penalties I guess”. Not really, this is a design point in the algorithm. Since the lists of lights are shared between many neighboring samples the incurred bandwidth overhead is tiny (because of coherency, caches and whatnot). What is wasted is compute power, testing/shading lights that do not contribute. The point of this is that modern GPUs have far more compute power than bandwidth, and more importantly this difference is increasing with each generation.

        Great that you got the demo running! Did you have to do any fixes? If so, I’d love to put them into the demo so it runs on AMD out of the box.

        Like

      • Luke Mamacos says:

        I didn’t have to fix any code, I think what broke it was the upgrade from vs2008 (?) to vs2013, some of the project properties needed to be fixed up.

        There is a small bug in the demo though, when you go close to the object, certain lights will disappear. It might be a good idea to get a more realistic lighting model in there too which has better a chance of pushing the compute power a bit harder, though, as you say, compute power is not an issue, it might be if it were more complex and you’re shading a lot of lights that don’t contribute to the pixel.

        Like

  4. John van der Burg says:

    I did implement some clustered shading in my mini engine. I use a bounding sphere around all light types, then find the bounds in clip space and update the grid cells for clustered like that. All of that is done on the GPU (using compute shaders). Maybe we can compare against that, but my implementation isn’t going to be the best (for performance) either, as it handles several light types (point, spot, area rect, area tube, area spherical, but maybe if we use the same test scene and point light position/animation that we can compare it a bit too? I don’t cull the lights upfront though with the frustum etc.

    Like

  5. Nice!
    Some questions:
    1. What’s your worst case number of lights per shaded fragment in a scene like that?
    2. How much time does it take to rebuild the BVH?

    Here’s a quick implementation inspired by Emil’s clustered shading I did for a course: http://www.kevinortegren.com/portfolio/implicit-clustered-deferred-shading/

    Note that some of the details in that paper have been changed(as they were wrong/bad).
    It’s more or less a brute force CPU light assignment implementation where all lights are tested against iterated x and y planes to find the clusters for each light. It’s far from optimal, but it is really simple to implement and understand.

    I also ran 3000 point lights to check the performance (GTX 970, i7-4790K, 16GB RAM, 1536×768 resolution, only ambient, diffuse and linear attenuation for lights).

    All lights move around randomly at all times. Timings in ms are in the pictures(basically worst case times).

    Sorry for the jpeg-compression(thanks imgur) and the badly over-lit scene(it’s a tight space).

    If I have time I will clean up my source some and host it on github this week.

    Like

  6. Ignitron says:

    Without a sample source code cant follow up what are the real advantages of this stuff. The demo binary doesnt work without a redist, dunno exactly which is maybe vs20013 c++?

    Like

  7. Phil says:

    Hi Luke, I just stumbled across this article this evening and find it quite interesting as I am just getting up to speed on cluster shading.
    Any chance we will see the demo code and more comparison with cluster shading?

    Like

  8. Hi, just came across this article while looking for more info on clustered lighting. I couldn’t help but notice that there is a small error in your UnPackLightColor function. The result should be divided by:

    ret /= float3( 256 * 256 * 255, 256 * 255, 255 );

    Like

  9. Thank you for this post! I tried that technique and now I can render quite a lot of lights pretty fast, can’t really use something like Forward+ (it’s just a mod), so it sure much better than just going through a list and checking bounding spheres for each pixel.

    Liked by 1 person

    • Luke Mamacos says:

      Hi! Thanks for the interest. I can probably help. Do you have some specific questions about the implementation details?

      Like

  10. Jonathan Sandusky says:

    I like using this for VR rendering currently. Being tied to world-space is incredibly valuable there as tiled/clustered require doubling all of the work for sets of per-eye data or scores of false positives everywhere in any sharing-cells-between-eyes scheme.

    I opted to do the BVH walk in either compute or vertex shader (CS if compute skinning / particles / etc, VS for everything else, doing manual pulling I test the whole triangle) to pump a Drobot-style light-bitmask into the vertex data instead of doing the tree walk in pixel-shader.

    I found spotlights to be sufficiently rare that false positives are insignificant. It was overkill but I split them into 2 spheres as since I’m building a bitmask OR’ing in the same bit twice is meaningless.

    Liked by 1 person

    • Luke Mamacos says:

      It’s nice to see someone making use of this idea! I really like the optimization of filling out light bit-masks in the vertex shader, that sounds like a massive performance improvement!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s