Appendix A. Basic Optimization

Table of Contents

Vertex Format
Object Optimizations
Finding the Bottleneck
Vertex Format
Vertex Caching
Shaders and Performance

Optimization is far too large of a subject to cover adequately in a mere appendix. Optimizations tend to be specific to particular algorithms, and they usually involve tradeoffs with memory. That is, one can make something run faster by taking up memory. And even then, optimizations should only be made when one has proper profiling to determine where performance is lacking.

This appendix will instead cover the most basic optimizations. These are not guaranteed to improve performance in any particular program, but they almost never hurt. They are also things you can implement relatively easily. Think of these as the default standard practice you should start with before performing real optimizations. For the sake of clarity, most of the code in this book did not use these practices, so many of them will be new.

Do as I say, not as I do.

Vertex Format

Interleave vertex attribute arrays for objects where possible. Obviously, if you need to overwrite certain attributes frequently while other attributes remains static, then you will need to separate that data. But unless you have some specific need to do so, interleave your vertex data.

Equally importantly, try to use the smallest vertex data possible. Small data means that GPU caches are more efficient; they store more vertex attributes per cache line. This means fewer direct memory accesses, which means increasing the performance that vertex shaders receive their attributes. In this book, the vertex data was almost always 32-bit floats. You should only use 32-bit floats when you absolutely need that much precision.

The biggest key to this is the use of normalized integer values for attributes. As a reminder for how this works, here is the definition of glVertexAttribPointer:

void glVertexAttribPointer(GLuint index,
 GLint size,
 GLenum type,
 GLboolean normalized,
 GLsizei stride,
 GLvoid *pointer);

If type is an integer attribute, like GL_UNSIGNED_BYTE, then setting normalized to GL_TRUE will mean that OpenGL interprets the integer value as normalized. It will automatically convert the integer 255 to 1.0, and so forth. If the normalization flag is false instead, then it will convert the integers directly to floats: 255 becomes 255.0, etc. Signed values can be normalized as well; GL_BYTE with normalization will map 127 to 1.0, -128 to -1.0, etc.

Colors.  Color values are commonly stored as 4 unsigned normalized bytes. This is far smaller than using 4 32-bit floats, but the loss of precision is almost always negligible. To send 4 unsigned normalized bytes, use:

glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);

The best part is that all of this is free; it costs no actual performance. Note however that 32-bit integers cannot be normalized.

Sometimes, color values need higher precision than 8-bits, but less than 16-bits. If a color is in the linear RGB colorspace, it is often desirable to give them greater than 8-bit precision. If the alpha of the color is negligible or non-existent, then a special type can be used. This type is GL_UNSIGNED_INT_2_10_10_10_REV. It takes 32-bit unsigned normalized integers and pulls the four components of the attributes out of each integer. This type can only be used with normalization:

glVertexAttribPointer(#, 4, GL_UNSIGNED_BYTE, GL_TRUE, ...);

The most significant 2 bits of each integer is the Alpha. The next 10 bits are the Blue, then Green, and finally Red. Make note of the fact that it is reversed. It is equivalent to this bitfield struct in C:

struct RGB10_A2
  unsigned int alpha    : 2;
  unsigned int blue     : 10;
  unsigned int green    : 10;
  unsigned int red      : 10;

Normals.  Another attribute where precision isn't of paramount importance is normals. If the normals are normalized, and they always should be, the coordinates are always going to be on the [-1, 1] range. So signed normalized integers are appropriate here. 8-bits of precision are sometimes enough, but 10-bit precision is going to be an improvement. 16-bit precision, GL_SHORT, may be overkill, so stick with GL_INT_2_10_10_10_REV (the signed version of the above). Because this format provides 4 values, you will need to use 4 as the size of the attribute, but you can still use vec3 in the shader as the normal's input variable.

Texture Coordinates.  Two-dimensional texture coordinates do not typically need 32-bits of precision. 8 and 10-bit precision are usually not good enough, but 16-bit unsigned normalized integers are often sufficient. If texture coordinates range outside of [0, 1], then normalization will not be sufficient. In these cases, there is an alternative to 32-bit floats: 16-bit floats.

The hardest part of dealing with 16-bit floats is that C/C++ does not deal with very well. There is no native 16-bit float type, unlike virtually every other type. Even the 10-bit format can be built using bit selectors in structs, as above. Generating a 16-bit float from a 32-bit float requires care, as well as an understanding of how floating-point values work.

This is where the GLM math library comes in handy. It has the glm::thalf, which is a type that represents a 16-bit floating-point value. It has overloaded operators, so that it can be used like a regular float. GLM also provides glm::hvec and glm::hmat types for vectors and matrices, respectively.

Positions.  In general, positions are the least likely attribute to be easily optimized without consequence. 16-bit floats can be used, but these are restricted to a range of approximately [-6550.4, 6550.4]. They also lack some precision, which may be necessary depending on the size and detail of the object in model space.

If 16-bit floats are insufficient, a certain form of compression can be used. The process is as follows:

  1. When loading the mesh data, find the bounding volume of the mesh in model space. To do this, find the maximum and minimum values in the X, Y and Z directions independently. This represents a rectangle in model space that contains all of the vertices. This rectangle is defined by two 3D vectors: the maximum vector (containing the max X, Y and Z values), and the minimum vector. These are named max and min.

  2. Compute the center point of this region:

    glm::vec3 center = (max + min) / 2.0f;
  3. Compute half of the size (width, height, depth) of the region:

    glm::vec3 halfSize = (max - min) / 2.0f;
  4. For each position in the mesh, compute a normalized version by subtracting the center from it, then dividing it by half the size. As follows:

    glm::vec3 newPosition = (position - center) / halfSize;
  5. For each new position, convert it to a signed, normalized integer by multiplying it by 32767:

    unsigned short normX = (unsigned short)(newPosition.x * 32767.0f);
    unsigned short normY = (unsigned short)(newPosition.y * 32767.0f);
    unsigned short normZ = (unsigned short)(newPosition.z * 32767.0f);

    These three coordinates are then stored as the new position data in the buffer object.

  6. Keep the center and halfSize variables stored with your mesh data. When computing the model-space to camera-space matrix for that mesh, add one final matrix to the top. This matrix will perform the inverse operation from the one that we used to compute the normalized values:


    This final matrix should not be applied to the normal's matrix. Compute the normal matrix before applying the final step above. So if you were not using a separate matrix for normals (you did not have non-uniform scales in your model-to-camera matrix), you will need to use one now. So this may make your data bigger or make your shader run slightly slower.

Alignment.  One additional rule you should always follow is this: make sure that all attributes begin on a 4-byte boundary. This is true for attributes that are smaller than 4-bytes, such as a 3-vector of 8-bit values. While OpenGL will allow you to use arbitrary alignments, hardware may have problems making it work. So if you make your 3D position data 16-bit floats or 16-bit signed normalized integers, you will still waste 2 bytes from every position. You may want to try making your position values 4-dimensional values and putting something useful in the W component.

Fork me on GitHub