Finding the Bottleneck

The absolute best tool to have in your repertoire for optimizing your rendering is finding out why your rendering is slow.

GPUs are designed as a pipeline. Each stage in the pipeline is functionally independent from the other. A vertex shader can be computing some number of vertices, while the clipping and rasterization are working on other triangles, while the fragment shader is working on fragments generated by other triangles.

However, a vertex generated by the vertex shader cannot pass to the rasterizer if the rasterizer is busy. Similarly, the rasterizer cannot generate more fragments if all of the fragment shaders are in use. Therefore, the overall performance of the GPU can only be the performance of the slowest step in the pipeline.

This means that, in order to actually make the GPU faster, you must find the particular stage of the pipeline that is the slowest. This step is referred to as the bottleneck. Until you know what the bottleneck is, then the most you can do is take a guess as to why things are slower than you think they are. And doing major code changes based purely on a guess is probably not something you can do. At least, not until you have a lot of experience with the GPU(s) in question.

It should also be noted that bottlenecks are not consistent throughout the rendering of a single frame. Some parts of it can be CPU bound, others can be fragment shader bound, etc. Thus, attempt to find particular sections of rendering that likely have the same problem before trying to find the bottleneck.

Measuring Performance

The most common performance statistic you see when most people talk about performance is frames per second (FPS). While this is useful when talking to the lay person, a graphics programmer does not use FPS as their standard performance metric. It is the overall goal, but when measuring the actual performance of a piece of rendering code, the more useful metric is simply time. This is usually measured in milliseconds (ms).

If you are attempting to maintain 60fps, that translates to having 16.67 milliseconds to spend performing all rendering tasks.

One thing that confounds performance metrics is the fact that the GPU is both pipelined and asynchronous. When running regular code, if you call a function, you're usually assured that the actions the function took have all completed when it returns. When you issue a rendering call (any glDraw* function), not only is it likely that rendering has not completed by the time it has returned, it is very likely that rendering has not even started. Not even doing a buffer swap will ensure that the GPU has finished, as GPUs can wait to actual perform the buffer swap until later.

If you specifically want to time the GPU, then you must force the GPU to finish its work. To do that in OpenGL, you call a function cleverly titled glFinish. It will return sometime after the GPU finishes. Note that it does not guarantee that it returns immediately after, only at some point after the GPU has finished all of its commands. So it is a good idea to give the GPU a healthy workload before calling finish, to minimize the difference between the time you measure and the time the GPU actually has.

You will also want to turn vertical synchronization, or vsync, off. There is a certain point during which a graphics chip is able to swap the front and back framebuffers with a guarantee of not causing half of the displayed image to be from one buffer and half from another. The latter eventuality is called tearing, and having vsync enabled avoids that. However, you do not care about tearing; you want to know about performance. So you need to turn off any form of vsync.

Vsync is controlled by the window-system specific extensions GLX_EXT_swap_control and WGL_EXT_swap_control. They both do the same thing and have similar APIs. The wgl/glxSwapInterval functions take an integer that tells how many vsyncs to wait between swaps. If you pass 0, then it will swap immediately.

Possible Bottlenecks

There are several potential bottlenecks that a section of rendering code can have. We will list those and the ways of determining if it is the bottleneck. You should test these in the order presented below.

Fragment Processing

This is probably the easiest to find. The quantity of fragment processing you have depends entirely on the number of fragments the various triangles are rasterized to. Therefore, simply increase the resolution. If you increase the resolution by 2x the number of pixels (double either the width or height), and the time to render doubles, then you are fragment processing bound.

Note that rendering time will go up when you increase the resolution. What you are interested in is whether it goes up linearly with the number of fragments rendered. If the rendering time only goes up by 1.2x with a 2x increase in number of fragments, then the code was not entirely fragment processing bound.

Vertex Processing

If you are not fragment processing bound, then there's a good chance you are vertex processing bound. After ruling out fragment processing, simply turn off all fragment processing. If this does not increase your performance significantly (there will generally be some change), then you were vertex processing bound.

To turn off fragment processing, simply glEnable(GL_RASTERIZER_DISCARD​). This will cause all fragments to be discarded. Obviously, nothing will be rendered, but all of the steps before rasterization will still be executed. Therefore, your performance timings will be for vertex processing alone.

CPU

A CPU bottleneck means that the GPU is being starved; it is consuming data faster than the CPU is providing it. You do not really test for CPU bottlenecks per-se; they are discovered by process of elimination. If nothing else is bottlenecking the GPU, then the CPU clearly is not giving it enough stuff to do.

Unfixable Bottlenecks

It is entirely possible that you cannot fix a bottleneck. Maybe there's simply no way to avoid a vertex-processing heavy section of your renderer. Perhaps you need all of that fragment processing in a certain area of rendering.

If there is some bottleneck that cannot be optimized away, then turn it to your advantage by increasing the complexity of the other stages in the pipeline. If you have an unfixable CPU bottleneck, then render more detailed models. If you have a vertex-shader bottleneck, improve your lighting by adding some fragment-shader complexity. And so forth. Just make sure that you do not increase complexity to the point where you move the bottleneck and make things slower.

Fork me on GitHub