Object Optimizations

These optimizations all have to do with the concept of objects. An object, for the purpose of this discussion, is a combination of a mesh, program, uniform data, and set of textures used to render some specific thing in the world.

Object Culling

A virtual world consists of many objects. The more objects we draw, the longer rendering takes.

One major optimization is also a very simple one: render only what must be rendered. There is no point in drawing an object in the world that is not actually visible. Thus, the task here is to, for each object, detect if it would be visible; if it is not, then it is not rendered. This process is called visiblity culling or object culling.

As a first pass, we can say that objects that are not within the view frustum are not visible. This is called frustum culling, for obvious reasons. Determining that an object is off screen is generally a CPU task. Each object must be represented by a simple volume, such as a sphere or camera-space box. These objects are used because they are relatively easy to test against the view frustum; if they are within the frustum, then the corresponding object is considered visible.

Of course, this only boils the scene down to the objects in front of the camera. Objects that are entirely occluded by other objects will still be rendered. There are a number of techniques for detecting whether objects obstruct the view of other objects. Portals, BSPs, and a variety of other techniques involve preprocessing certain static terrain to determine visibility sets. Therefore it can be known that, when the camera is in a certain region of the world, objects in certain other regions cannot be visible even if they are within the view frustum.

A more fine-grained solution involves using a hardware feature called occlusion queries. This is a way to render an object and then ask how many fragments of that object were actually rasterized. If even one fragment passed the depth test (assuming all possible occluding surfaces have been rendered), then the object is visible and must be rendered.

It is generally preferred to render simple test objects, such that if any part of the test object is visible, then the real object will be visible. Drawing a test object is much faster than drawing a complex hierarchial model with specialized skinning vertex shaders. Write masks (set with glColorMask and glDepthMask) are used to prevent writing the fragment shader outputs of the test object to the framebuffer. Thus, the test object is only tested against the depth buffer, not actually rendered.

Occlusion queries in OpenGL are objects that have state. They are created with the glGenQueries function. To start rendering a test object for occlusion queries, the object generated from glGenQueries is passed to the glBeginQuery function, along with the mode of GL_SAMPLES_PASSED. All rendering commands between glBeginQuery and the corresponding glEndQuery are part of the test object. If all of the fragments of the object were discarded (via depth buffer or something else), then the query failed. If even one fragment was rendered, then it passed.

This can be used with a concept called conditional rendering. This is exactly what it says: rendering an object conditionally. It allows a series of rendering commands, bracketed by glBeginConditionalRender/glEndConditionalRender functions, to cause the execution of those rendering commands to happen or not happen based on the status of an occlusion query object. If the occlusion query passed, then the rendering commands will be executed. If it did not, then they will not be.

Of course, conditional rendering can cause pipeline stalls; OpenGL still requires that operations execute in-order, even conditional ones. So all later operations will be held up if a conditional render is waiting for its occlusion query to finish. To avoid this, you can specify GL_QUERY_NO_WAIT when beginning the conditional render. This will cause OpenGL to render if the query has not completed before this conditional render is ready to be rendered. To gain the maximum benefit from this, it is best to render the conditional objects well after the test objects they are conditioned on.

Model LOD

When a model is far away, it does not need to look as detailed, since most of the details will be lost due to lack of resolution. Therefore, one can substitute more detailed models for less detailed ones. This is commonly referred to as Level of Detail (LOD).

Of course in modern rendering, detail means more than just the number of polygons in a mesh. It can often mean what shader to use, what textures to use with it, etc. So while meshes will often have LODs, so will shaders. Textures have their own built-in LODing mechanism in mip-mapping. But it is often the case that low-LOD shaders (those used from far away) do not need as many textures as the closer LOD shaders. You might be able to get away with per-vertex lighting for distant models, while you need per-fragment lighting for those close up.

The problem with this visually is how to deal with the transitions between LOD levels. If you change them too close to the camera, then the user will notice a pop. If you do them too far away, you lose much of the performance gain from rendering a low-detail mesh far away. Finding a good middle-ground is key.

State Changes

OpenGL has three kinds of functions: those that actually do rendering, those that retrieve information from OpenGL, and those that modify some information stored in OpenGL. The vast majority of OpenGL functions are the latter. OpenGL's information is generally called state, and needlessly changing state can be expensive.

Therefore, this optimization rule is to, as best as possible, minimize the number of state changes. For simple scenes, this can be trivial. But in a complicated, data-driven environment, this can be exceedingly complex.

The general idea is to gather up a list of all objects that need to be rendered (after culling non-visible objects and performing any LOD work), then sort them based on their shared state. Objects that use the same program share program state, for example. By doing this, if you render the objects in state order, you will minimize the number of changes to OpenGL state.

The three most important pieces of state to sort by are the ones that change most frequently: programs (and their associated uniforms), textures, and VAO state. Global state, such as face culling, blending, etc, are less expensive because they don't change as often. Generally, all meshes use the same culling parameters, viewport settings, depth comparison state, and so forth.

Minimizing vertex array state changes generally requires more than just sorting; it requires changing how mesh data is stored. This book usually gives every mesh its own VAO, which represents its own separate state. This is certainly very convenient, but it can work against performance if the CPU is a bottleneck.

To avoid this, try to group meshes that have the same vertex data formats in the same buffer objects and VAOs. This makes it possible to render several objects, with several different glDraw* commands, all using the same VAO state. glDrawElementsBaseVertex is very useful for this purpose when rendering with indexed data. The fewer VAO binds, the better.

There is less information on how harmful uniform state changes are to performance, or the performance difference between changing in-program uniforms and buffer-based uniforms.

Be advised that state sorting cannot help when dealing with blending, because blending correctness requires sorting based on depth. Thus, it is necessary to avoid that.

There are also certain tricky states that can hurt, depending on hardware. For example, it is best to avoid changing the direction of the depth test once you have cleared the depth buffer and started rendering to it. This is for reasons having to do with specific hardware optimizations of depth buffering.

Fork me on GitHub