3D Graphics With OpenGL
3D Graphics With OpenGL
Basic Theory
1. Computer Graphics Hardware
All modern displays are raster-based. A raster is a 2D rectangular grid of pixels (or picture
elements). A pixel has two properties: a color and a position. Color is expressed in RGB (Red-
Green-Blue) components - typically 8 bits per component or 24 bits per pixel (or true color). The
position is expressed in terms of (x, y) coordinates. The origin (0, 0) is located at the top-left
corner, with x-axis pointing right and y-axis pointing down. This is different from the
conventional 2D Cartesian coordinates, where y-axis is pointing upwards.
The number of color-bits per pixel is called the depth (or precision) of the display. The number of
rows by columns of the rectangular grid is called the resolution of the display, which can range
from 640x480 (VGA), 800x600 (SVGA), 1024x768 (XGA) to 1920x1080 (FHD), or even higher.
Frame Buffer and Refresh Rate
The color values of the pixels are stored in a special part of graphics memory called frame buffer.
The GPU writes the color value into the frame buffer. The display reads the color values from the
frame buffer row-by-row, from left-to-right, top-to-bottom, and puts each of the values onto the
screen. This is known as raster-scan. The display refreshes its screen several dozen times per
second, typically 60Hz for LCD monitors and higher for CRT tubes. This is known as the refresh
rate.
A complete screen image is called a frame.
Double Buffering and VSync
While the display is reading from the frame buffer to display the current frame, we might be
updating its contents for the next frame (not necessarily in raster-scan manner). This would result
in the so-called tearing, in which the screen shows parts of the old frame and parts of the new
frame.
This could be resolved by using so-called double buffering. Instead of using a single frame buffer,
modern GPU uses two of them: a front buffer and a back buffer. The display reads from the front
buffer, while we can write the next frame to the back buffer. When we finish, we signal to GPU to
swap the front and back buffer (known as buffer swap or page flip).
Double buffering alone does not solve the entire problem, as the buffer swap might occur at an
inappropriate time, for example, while the display is in the middle of displaying the old frame.
This is resolved via the so-called vertical synchronization (or VSync) at the end of the raster-scan.
When we signal to the GPU to do a buffer swap, the GPU will wait till the next VSync to perform
the actual swap, after the entire current frame is displayed.
The most important point is: When the VSync buffer-swap is enabled, you cannot refresh the
display faster than the refresh rate of the display!!! For the LCD/LED displays, the refresh rate is
typically locked at 60Hz or 60 frames per second, or 16.7 milliseconds for each frame.
Furthermore, if you application refreshes at a fixed rate, the resultant refresh rate is likely to be an
integral factor of the display's refresh rate, i.e., 1/2, 1/3, 1/4, etc.
2. 3D Graphics Rendering Pipeline
A pipeline, in computing terminology, refers to a series of processing stages in which the output
from one stage is fed as the input of the next stage, similar to a factory assembly line or water/oil
pipe. With massive parallelism, pipeline can greatly improve the overall throughput.
In computer graphics, rendering is the process of producing image on the display from model
description.
The 3D Graphics Rendering Pipeline accepts description of 3D objects in terms of vertices of
primitives (such as triangle, point, line and quad), and produces the color-value for the pixels on
the display.
On the other hand, the rasterization and output merging stages are not programmable, but
configurable - via configuration commands issued to the GPU.
As an example, the following OpenGL code segment specifies a color-cube, center at the origin.
To create a geometric object or model, we use a pair
of glBegin(PrimitiveType) and glEnd() to enclose the vertices that form the model.
For primitiveType that ends with 'S' (e.g., GL_QUADS), we can define multiple shapes of the same
type.
Each of the 6 faces is a primitive quad ( GL_QUAD). We first set the color
via glColor3f(red, green, blue). This color would be applied to all subsequent vertices until it
is overridden. The 4 vertices of the quad are specified via glVertex3f(x, y, z), in counter-
clockwise manner such that the surface-normal is pointing outwards, indicating its front-face. All
four vertices has this surface-normal as its vertex-normal.
glBegin(GL_QUADS); // of the color cube
// Top-face
glColor3f(0.0f, 1.0f, 0.0f); // green
glVertex3f(1.0f, 1.0f, -1.0f);
glVertex3f(-1.0f, 1.0f, -1.0f);
glVertex3f(-1.0f, 1.0f, 1.0f);
glVertex3f(1.0f, 1.0f, 1.0f);
// Bottom-face
glColor3f(1.0f, 0.5f, 0.0f); // orange
glVertex3f(1.0f, -1.0f, 1.0f);
glVertex3f(-1.0f, -1.0f, 1.0f);
glVertex3f(-1.0f, -1.0f, -1.0f);
glVertex3f(1.0f, -1.0f, -1.0f);
// Front-face
glColor3f(1.0f, 0.0f, 0.0f); // red
glVertex3f(1.0f, 1.0f, 1.0f);
glVertex3f(-1.0f, 1.0f, 1.0f);
glVertex3f(-1.0f, -1.0f, 1.0f);
glVertex3f(1.0f, -1.0f, 1.0f);
// Back-face
glColor3f(1.0f, 1.0f, 0.0f); // yellow
glVertex3f(1.0f, -1.0f, -1.0f);
glVertex3f(-1.0f, -1.0f, -1.0f);
glVertex3f(-1.0f, 1.0f, -1.0f);
glVertex3f(1.0f, 1.0f, -1.0f);
// Left-face
glColor3f(0.0f, 0.0f, 1.0f); // blue
glVertex3f(-1.0f, 1.0f, 1.0f);
glVertex3f(-1.0f, 1.0f, -1.0f);
glVertex3f(-1.0f, -1.0f, -1.0f);
glVertex3f(-1.0f, -1.0f, 1.0f);
// Right-face
glColor3f(1.0f, 0.0f, 1.0f); // magenta
glVertex3f(1.0f, 1.0f, -1.0f);
glVertex3f(1.0f, 1.0f, 1.0f);
glVertex3f(1.0f, -1.0f, 1.0f);
glVertex3f(1.0f, -1.0f, -1.0f);
Indexed Vertices
Primitives often share vertices. Instead of repeatedly specifying the vertices, it is more efficient to
create an index list of vertices, and use the indexes in specifying the primitives.
For example, the following code fragment specifies a pyramid, which is formed by 5 vertices. We
first define 5 vertices in an index array, followed by their respectively color. For each of the 5
faces, we simply provide the vertex index and color index.
In order to produce the grid-aligned pixels for the display, the rasterizer of the graphics
rendering pipeline, as its name implied, takes each input primitive and perform raster-scan to
produce a set of grid-aligned fragments enclosed within the primitive. A fragment is 3-
dimensional, with a (x, y, z) position. The (x, y) are aligned with the 2D pixel-grid. The z-value (not
grid-aligned) denotes its depth. The z-values are needed to capture the relative depth of various
primitives, so that the occluded objects can be discarded (or the alpha channel of transparent
objects processed) in the output-merging stage.
Fragments are produced via interpolation of the vertices. Hence, a fragment has all the vertex's
attributes such as color, fragment-normal and texture coordinates.
In modern GPU, vertex processing and fragment processing are programmable. The programs
are called vertex shader and fragment shader.
4. Vertex Processing
4.1 Coordinates Transformation
The process used to produce a 3D scene on the display in Computer Graphics is like taking a
photograph with a camera. It involves four transformations:
1. Arrange the objects (or models, or avatar) in the world (Model
Transformation or World transformation).
2. Position and orientation the camera (View transformation).
3. Select a camera lens (wide angle, normal or telescopic), adjust the focus length and
zoom factor to set the camera's field of view (Projection transformation).
4. Print the photo on a selected area of the paper (Viewport transformation) - in
rasterization stage
A transform converts a vertex V from one space (or coordinate system) to another space V'. In
computer graphics, transform is carried by multiplying the vector with a transformation matrix,
i.e., V' = M V.
4.2 Model Transform (or Local Transform, or World Transform)
Each object (or model or avatar) in a 3D scene is typically drawn in its own coordinate system,
known as its model space (or local space, or object space). As we assemble the objects, we need to
transform the vertices from their local spaces to the world space, which is common to all the
objects. This is known as the world transform. The world transform consists of a series of scaling
(scale the object to match the dimensions of the world), rotation (align the axes), and translation
(move the origin).
Rotation and scaling belong to a class of transformation called linear transformation (by
definition, a linear transformation preserves vector addition and scalar multiplication). Linear
transform and translation form the so-called affine transformation. Under an affine
transformation, a straight line remains a straight line and ratios of distances between points are
preserved.
In OpenGL, a vertex V at (x, y, z) is represented as a 3x1 column vector:
Rotation
3D rotation operates about an axis of rotation (2D rotation operates about a center of rotation).
3D Rotations about the x, y and z axes for an angle θ (measured in counter-clockwise manner)
can be represented in the following 3x3 matrices:
The rotational angles about x, y and z axes, denoted as θx, θy and θz, are known as Euler angles,
which can be used to specify any arbitrary orientation of an object. The combined transform is
called Euler transform.
Fortunately, we can represent translation using a 4x4 matrices and obtain the transformed result
via matrix multiplication, if the vertices are represented in the so-called 4-
component homogeneous coordinates (x, y, z, 1), with an additional forth w-component of 1. We
shall describe the significance of the w-component later in projection transform. In general, if
the w-component is not equal to 1, then (x, y, z, w) corresponds to Cartesian coordinates of
(x/w, y/w, z/w). If w=0, it represents a vector, instead of a point (or vertex).
4.3 View Transform
After the world transform, all the objects are assembled into the world space. We shall now place
the camera to capture the view.
Notice that the 9 values actually produce 6 degrees of freedom to position and orientate the
camera, i.e., 3 of them are not independent.
OpenGL
In OpenGL, we can use the GLU function gluLookAt() to position the camera:
The view transform consists of two operations: a translation (for moving EYE to the origin),
followed by a rotation (to axis the axes):
Model-View Transform
In Computer Graphics, moving the objects relative to a fixed camera (Model transform), and
moving the camera relative to a fixed object (View transform) produce the same image, and
therefore are equivalent. OpenGL, therefore, manages the Model transform and View transform
in the same manner on a so-called Model-View matrix. Projection transformation (in the next
section) is managed via a Projection matrix.
4.4 Projection Transform - Perspective Projection
Once the camera is positioned and oriented, we need to decide what it can see (analogous to
choosing the camera's field of view by adjusting the focus length and zoom factor), and how the
objects are projected onto the screen. This is done by selecting a projection mode (perspective or
orthographic) and specifying a viewing volume or clipping volume. Objects outside the clipping
volume are clipped out of the scene and cannot be seen.
View Frustum in Perspective View
The camera has a limited field of view, which exhibits a view frustum (truncated pyramid), and is
specified by four parameters: fovy, aspect, zNear and zFar.
1. Fovy: specify the total vertical angle of view in degrees.
2. Aspect: the ratio of width vs. height. For a particular z, we can get the height from
the fovy, and then get the width from the aspect.
3. zNear; the near plane.
4. zFar: the far plane.
The camera space (xc, yc, zc) is renamed to the familiar (x, y, z) for convenience.
The projection with view frustum is known as perspective projection, where objects nearer to the
COP (Center of Projection) appear larger than objects further to the COP of the same size.
An object outside the view frustum is not visible to the camera. It does not contribute to the final
image and shall be discarded to improve the performance. This is known as view-frustum culling.
If an object partially overlaps with the view frustum, it will be clipped in the later stage.
OpenGL
In OpenGL, there are two functions for choosing the perspective projection and setting its
clipping volume:
1. More commonly-used GLU function gluPerspective():
2. void gluPerspective(GLdouble fovy, GLdouble aspectRatio, GLdouble
zNear, GLdouble zFar)
3. // fovy is the angle between the bottom and top of the
projectors;
4. // aspectRatio is the ratio of width and height of the front
(and also back) clipping plane;
// zNear and zFar specify the front and back clipping planes.
5. Core GL function glFrustum():
// zNear and zFar specify the positions of the front and back
clipping planes.
Clipping-Volume Cuboid
Next, we shall apply a so-called projection matrix to transform the view-frustum into a axis-
aligned cuboid clipping-volume of 2x2x1 centered on the near plane, as illustrated. The near
plane has z=0, whereas the far plane has z=-1. The planes have dimension of 2x2, with range
from -1 to +1.
Take note that the last row of the matrix is no longer [0 0 0 1]. With input vertex of (x, y, z, 1), the
resultant w-component would not be 1. We need to normalize the resultant homogeneous
coordinates (x, y, z, w) to (x/w, y/w, z/w, 1) to obtain position in 3D space. (It is amazing that
homogeneous coordinates can be used for translation, as well as the perspective projection.)
After the flip, the coordinate system is no longer a Right-Hand System (RHS), but becomes a Left-
hand System (LHS).
void glLoadIdentity()
We can save the value of the currently selected matrix onto the stack and restore it back via:
void glPushMatrix()
void glPopMatrix()
Push and pop use a stack and operate in a last-in-first-out manner, and can be nested.
OpenGL
In OpenGL, we can use glOrtho() function to choose the orthographic projection mode and
specify its clipping volume:
The default 3D projection in OpenGL is the orthographic (instead of perspective) with parameters
(-1.0, 1.0, -1.0, 1.0, -1.0, 1.0), i.e., a cube with sides of 2.0, centered at origin.
The vertex processing stage transform individual vertices. The relationships between vertices (i.e.,
primitives) are not considered in this stage.
5. Rasterization
In the previous vertex processing stage, the vertices, which is usually represented in a float value,
are not necessarily aligned with the pixel-grid of the display. The relationship of vertices, in term
of primitives, are also not considered.
In this rasterization stage, each primitive (such as triangle, quad, point and line), which is defined
by one or more vertices, are raster-scan to obtain a set of fragments enclosed within the
primitive. Fragments can be treated as 3D pixels, which are aligned with the pixel-grid. The 2D
pixels have a position and a RGB color value. The 3D fragments, which are interpolated from the
vertices, have the same set of attributes as the vertices, such as position, color, normal, texture.
The substages of rasterization include viewport transform, clipping, perspective division, back-
face culling, and scan conversion. The rasterizer is not programmable, but configurable via the
directives.
5.1 Viewport Transform
Viewport
Viewport is a rectangular display area on the application window, which is measured in screen's
coordinates (in pixels, with origin at the top-left corner). A viewport defines the size and shape of
the display area to map the projected scene captured by the camera onto the application
window. It may or may not occupy the entire screen.
Viewport Transform
Our final transform, viewport transform, maps the clipping-volume (2x2x1 cuboid) to the 3D
viewport, as illustrated.
Viewport transform is made up of a series of reflection (of y-axis), scaling (of x, y and z axes), and
translation (of the origin from the center of the near plane of clipping volume to the top-left
corner of the 3D viewport). The viewport transform matrix is given by:
// Set the viewport (display area on the window) to cover the whole application
window
glViewport(0, 0, width, height);
// OR
5.2 Back-Face Culling
While view frustum culling discard objects outside the view frustum, back-face culling discard
primitives which is not facing the camera.
Back face can be declared based on the normal vector and the vector connecting the surface and
the camera.
Back-face culling shall not be enabled if the object is transparent and alpha blending is enabled.
OpenGL
In OpenGL, face culling is disabled by default, and both front-face and back-faces are rendered.
We can use function glCullFace() to specify whether the back-face ( GL_BACK) or front-face
(GL_FRONT) or both (GL_FRONT_AND_BACK) shall be culled.
6. Fragment Processing
After rasterization, we have a set of fragments for each primitive. A fragment has a position,
which is aligned to the pixel-grid. It has a depth, color, normal and texture coordinates, which are
interpolated from the vertices.
The fragment processing focuses on the texture and lighting, which has the greatest impact on
the quality of the final image. We shall discussed texture and lighting in details in later sections.
7. Output Merging
7.1 Z-Buffer and Hidden-Surface Removal
z-buffer (or depth-buffer) can be used to remove hidden surfaces (surfaces blocked by other
surfaces and cannot be seen from the camera). The z-buffer of the screen is initialized to 1
(farthest) and color-buffer initialized to the background color. For each fragment (of each
primitive) processed, its z-value is checked against the buffer value. If its z-value is smaller than
the z-buffer, its color and z-value are copied into the buffer. Otherwise, this fragment is occluded
by another object and discarded. The fragments can be processed in any order, in this algorithm.
OpenGL
In OpenGL, to use z-buffer for hidden-surface removal via depth testing, we need to:
1. Request for z-buffer via glutInitDisplayMode():
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE | GLUT_DEPTH); //
GLUT_DEPTH to request for depth-buffer
glEnable(GL_DEPTH_TEST);
3. Clear the z-buffer (to 1 denoting the farthest) and the color buffer (to the
background color):
7.2 Alpha-Blending
Hidden-surface removal works only if the front object is totally opaque. In computer graphics, a
fragment is not necessarily opaque, and could contain an alpha value specifying its degree of
transparency. The alpha is typically normalized to the range of [0, 1], with 0 denotes totally
transparent and 1 denotes totally opaque. If the fragment is not totally opaque, then part of its
background object could show through, which is known as alpha blending. Alpha-blending and
hidden-surface removal are mutually exclusive.
For this blending equation, the order of placing the fragment is important. The fragments must
be sorted from back-to-front, with the largest z-value processed first. Also, the destination alpha
value is not used.
if (blendingEnabled) {
glEnable(GL_BLEND); // Enable blending
glDisable(GL_DEPTH_TEST); // Need to disable depth testing
} else {
glDisable(GL_BLEND);
glEnable(GL_DEPTH_TEST);
}
There are many choices of the blending factors. For example, a popular choice is:
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
where each component of the source is weighted by source's alpha value (As), and each
component of the destination is weighted by 1-As. In this case, if the original color component's
value is within [0.0, 1.0], the resultant value is guaranteed to be within this range. The drawback is
that the final color depends on the order of rendering if many surfaces are added one after
another (because the destination alpha value is not considered).
glBlendFunc(GL_SRC_ALPHA, GL_ONE);
where each component of source is weighted by source's alpha value (As), and each component
of the destination is weight by 1. The value may overflow/underflow. But the final color does not
depends on the order of rendering when many objects are added.
Other values for the blending factors
include GL_ZERO, GL_ONE, GL_SRC_COLOR, GL_ONE_MINUS_SRC_COLOR, GL_DST_COLOR, GL_ONE_MINU
S_DST_COLOR, GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, GL_DST_ALPHA, GL_ONE_MINUS_DST_ALPH
A, GL_CONSTANT_COLOR, GL_ONE_MINUS_CONSTANT_COLOR, GL_CONSTANT_ALPHA,
and GL_ONE_MINUS_CONSTANT_ALPHA.
The default for source blending factor is GL_ONE, and the default for destination blending factor
is GL_ZERO. That is, opaque (totally non-transparent) surfaces.
The computations also explain why depth-testing shall be disabled when alpha-blending is
enabled. This is because the final color will be determined by blending between source and
destination colors for translucent surfaces, instead of relative depth (the color of the nearer
surface) for opaque surfaces.
8. Lighting
Lighting refers to the handling of interactions between the light sources and the objects in the 3D
scene. Lighting is one of the most important factor in producing a realistic scene.
The color that we see in the real world is the result of the interaction between the light sources
and the color material surfaces. In other words, three parties are involved: viewer, light sources,
and the material. When light (of a certain spectrum) from a light source strikes a surface, some
gets absorbed, some is reflected or scattered. The angle of reflection depends on the angle of
incidence and the surface normal. The amount of scatterness depends on the smoothness and
the material of the surface. The reflected light also spans a certain color spectrum, which depends
on the color spectrum of the incident light and the absorption property of the material. The
strength of the reflected light depends on the position and distance of the light source and the
viewer, as well as the material. The reflected light may strike other surfaces, and some is absorbed
and some is reflected again. The color that we perceived about a surface is the reflected light
hitting our eye. In a 2D photograph or painting, objects appear to be three-dimensional due to
some small variations in colors, known as shades.
The default material has a gray surface (under white light), with a small amount of ambient
reflection (0.2, 0.2, 0.2, 1.0), high diffuse reflection (0.8, 0.8, 0.8, 1.0), and no specular reflection
(0.0, 0.0, 0.0, 1.0).
9. Texture
In computer graphics, we often overlay (or paste or wrap) images, called textures, over the
graphical objects to make them realistic.
An texture is typically a 2D image. Each element of the texture is called a texel (texture element),
similar to pixel (picture element). The 2D texture coordinate (s, t) is typically normalized to [0.0,
1.0], with origin at the top-left corner, s-axis pointing right and t-axis pointing down. (Need to
confirm whether this is true in OpenGL)
9.1 Texture Wrapping
Although the 2D texture coordinates is normalized to [0.0, 1.0], we can configure the behavior if
the coordinates are outside the range.
9.2 Texture Filtering
In general, the resolution of the texture image is different from the displayed fragment (or pixel).
If the resolution of the texture image is smaller, we need to perform so-called magnification to
magnify the texture image to match the display. On the other hand, if the resolution of texture
image is larger, we perform minification.
Magnification
The commonly used methods are:
1. Nearest Point Filtering: the texture color-value of the fragment is taken from the
nearest texel. This filter leads to "blockiness" as many fragments are using the
same texel.
2. Bilinear Interpolation: the texture color-value of the fragment is formed via bilinear
interpolation of the four nearest texels. This yields smoother result.
Minification
Minification is needed if the resolution of the texture image is larger than the fragment. Again,
you can use the "nearest-point sampling" or "bilinear interpolation" methods.
However, these sampling methods often to the so-called "aliasing artefact", due the low sampling
frequency compared with the signal. For example, a far-away object in perspective projection will
look strange due to its high signal frequency.
Furthermore, in perspective projection, the fast texture interpolating scheme may not handle the
distortion caused by the perspective projection. The following command can be used to ask the
renderer to produce a better texture image, in the expense of processing speed.
glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_NICEST);