Perspective Projection Matrix

Please do not copy the content of this page without our express written permission. Doing so is an infringement of the Copyright Act. If you wish to use some materials from this page, please get in touch with us. Alternatively, you can post a link to this page on your blog or your website.

Introduction

This lesson is one of the most popular on Scratchapixel, which indicates a few things. First, it seems that there are few places on the Internet where information about what these matrices are, how they work, and how they are built is available. And the sites that do address the aforementioned topics are not always complete or accurate. Secondly, the general consensus seems to be that the topic of "projection" is a critical part of the graphics pipeline that must be understood in order to solve the mystery of how images are created with computers. Yet the perspective projection matrix is actually not used by ray tracers. The popularity of this topic and lesson, however, likely comes from the fact that it is used in OpenGL, which is probably the most common rendering API in the world. So due to its popularity and importance, we felt compelled to review this lesson a second time. As promised, we have added some information about the OpenGL perspective projection matrix, as it is usually not properly explained, particularly when it comes to explaining how this matrix can be derived. Please keep in mind that this lesson is part of the advanced section. For background information and a gentle introduction to the topic of projection perspective and camera, we recommend the lessons in the Basic Section on the Introduction to Ray Tracing and Cameras. In this chapter we will present a quick refresher on perspective projection and show how a simple perspective matrix can be built. Some formulas will not be explained in this chapter; to get a full understanding of how they are found, you will need to read the Second Chapter, in which we will show how to derive the perspective projection matrix used by OpenGL. In Chapter Three, we will also examine the orthographic projection matrix, although its use is not as widespread as that of the perspective projection matrix.

Some History

The mathematics behind perspective projection started to be understood and mastered by artists towards end of the fourteenth and beginning of the fifteenth century. Artists greatly contributed to education of others in the mathematical basis of perspective drawing through books that they would write and illustrate themselves. A notable example is "The Painter's Manual" published by Albrecht Dürer in 1538 (The illustration below comes from this book). Perspective drawing is largely characterized by two concepts: that objects appear smaller as their distances to the viewer increases and that of foreshortening. Foreshortening describes the impression, or optical illusion, that an object or a distance is smaller than it really is, due to being angled towards the viewer. Another rule related to foreshortening states that vertical lines are parallel, while nonvertical lines converge to a perspective point, thereby appearing shorter than they really are. These effects give a sense of depth, which is useful in evaluating the distance of objects from the viewer. Today the same mathematical principles are used in computer graphics to create a perspective view of a 3D scene.

What is the Perspective Projection Matrix Used for?

The perspective projection matrix is actually quite fundamental to the process of creating 2D images from 3D models. Strangely enough however, very little information about it can be found in books and on the Internet, probably because using a ray tracing to render an image eliminates the need to use a perspective matrix. In ray tracing, images are created by casting rays, which are built in the following manner: for each pixel of the frame, we trace a line segment, starting at the camera's origin and passing through the center of the current pixel. This ray is then cast into the scene and tested for an intersection with the scene geometry. If an intersection is found, the color of the current pixel is set to the object's color at the intersection point. So where does the perspective projection matrix come in? The perspective projection matrix makes the reverse process possible. Take, for example, the vertices making up a polygonal mesh. We take the points, whose color we know, in 3D space and transform them using the perspective projection matrix to find their position on the 2D image plane. Once we have transformed the point's 3D world position into a raster position, i.e. into its pixel coordinates, we can now set the color of the pixel to color of the point.

Even if your renderer is a ray tracer, the perspective matrix can be useful in certain situations, such as projecting a bounding box onto the screen. (Projecting a bounding box onto the screen is helpful to quickly find out if some geometry enclosed in this bounding box is visible to camera). The perspective projection matrix, however, is more appropriate for renderers and 3D APIs, like OpenGL, that use depth-based, hidden-surface algorithms.

Projecting Points onto the Screen

Before we study how to create the perspective matrix, we will first learn how to project 3D points onto the screen. Usually 3D points being projected onto the image plane are first transformed into the camera coordinate system. In this coordinate system, the eye position corresponds to the origin, the x- and y-axes define a plane parallel to the image plane, and the z-axis is perpendicular to that xy plane. In our setup, the image plane will be located exactly one unit away from the origin of the camera coordinate system, i.e the eye. You might be confused by this convention if you are used to a system in which the distance to the image plane is not equal to one unit, as it is in the case with OpenGL. Readers interested in the OpenGL case will find a full explanation in Chapter Two, but for now, let us assume that our setup simplifies the demonstration, and therefore continue reading.

Note that Scratchapixel uses a right-hand coordinate system, as do many other commercial applications, such as Maya. To learn more about right- and left-hand coordinate systems, check out the lesson on Geometry in the Basic Section. Because we use a right-hand coordinate system, the camera will be pointing in a direction opposite to the z-axis. This is due to the fact that when we project points onto the image plane, we want the x-axis to point to the right. Mathematically speaking, all points visible to the camera have a negative z-component, when the points are expressed in the camera coordinate system.

Let us imagine that we want to to project point P onto the screen. If we draw a line from P to the eye position, we can see that P is projected onto the screen at Ps. How do we compute Ps?

Figure 1: To project P on the image plane (at Ps) we the xy coordinates of P by the z coordinate of P.

In Figure 1, you can see that the green (\(\scriptsize \Delta ABC\)) and red (\( \scriptsize \Delta DEF\)) triangles have the same shapes and are thus said to be similar. In other words, the red triangle is a scaled-down version of the green triangle. Similar triangles have a useful property: the ratio between their respective sides is constant. In other words:

$$ \scriptsize {{BC} \over {EF}} = {{AB} \over {DE}}$$

and since we are interested in the side BC, i.e. the position of Ps in the image plane, we can write:

$$\scriptsize BC = {{AB * EF} \over {DE}}$$

And given that B lies on the image plane, which is one unit away from A (AB = 1), we have our final formula for calculating the length of BC:

$$\scriptsize BC = {{EF} \over {DE}}$$

From this equation, we can find the x- and y- coordinates of Ps. Because we already know that its z-coordinate is equal to 1 (the point lies on the image plane, which is one unit away from the eye position or camera origin), all we need to do is to divide the x- and y-coordinates of P by the z-coordinate of P. In mathematical form, we can write (Equation 1):

$$\scriptsize \begin{array}{l} Ps_x=\frac{P_x}{-P_z} \\ Ps_y=\frac{P_y}{-P_z} \\ Ps_z=\frac{P_z}{P_z}=1\end{array}$$

Note that we have divided by -Pz and not Pz because the z-component of the points visible by the camera are always negative in the camera coordinate system. Computing the coordinates of Ps, which is the projection of P on the image plane, is that simple. Note that Figure 1 only shows the projection of the y-coordinate of P on the image plane. If you rotate Figure 1 by ninety degrees clockwise and replace the y-axis by the x-axis, you would have a top view representing the projection of the x-coordinate of P onto the image plane.

Homogeneous Coordinates

Now you may think that there is nothing particularly complicated about perspective projection, and you would be right. The principle itself is rather simple. The story, however, does not stop here. What we really want is to encode this projection process into a matrix, so that projecting a point onto the image plane can be obtained via basic point-matrix multiplication. If you remember what we have said in the lesson on linear algebra, two matrices can be multiplied with each other if the numbers around the multiplication sign are the same.

$$\scriptsize \begin{array}{l} {\color{\red} {\text{no:}}} & [n \: m]* [q \: n] \\ {\color{\green} {\text{yes:}}} & [m \: n]* [n \: q] \\ \end{array} $$

Now remember that a point can be represented by a one-row matrix. (Some people prefer the one-column notation, but Scratchapixel uses the one-row notation). But then our point is a 1x3 matrix (1 row, 3 columns) and therefore cannot be multiplied by a 4x4 matrix, which we use in CG to represent transformations. So what can we do? To solve this problem, we employ a trick that consists of representing the point using homogeneous coordinates. Points expressed in homogeneous coordinates do not have three but rather four coordinates and can therefore be represented in the form of a 1x4 matrix. The fourth coordinate of a point in homogeneous form is denoted with the letter w. When we convert a point from Cartesian coordinates to homogeneous coordinates, w is set to equal 1. \(\scriptsize P_c\) (a point in Cartesian coordinates) and \(\scriptsize P_h\) (a point in homogeneous coordinates) are interchangeable, as long as w=1. When w is different than 1, we must divide all four coordinates of the point [x y z w] by w in order to normalize its value to 1. The following example illustrates the point-matrix multiplication (equation 2):

$$ \scriptsize \begin{array}{l} [x \: y \: z] \neq [x \: y \: z \: w=1.2] \\ x=\frac{x}{w}, y=\frac{y}{w}, z=\frac{z}{w}, w=\frac{w}{w}=1 \\ {[x \: y \: z] = [x \: y \: z \: w = 1]} \end{array}$$

To summarize: we know that we can represent a point using four coordinates, as long as the fourth coordinate w is equal to 1, which makes point-matrix multiplication possible. Next we will study how to construct this matrix so that the multiplication results in the projection of the point onto the image plane.

A Simple Perspective Matrix

Recall from the lesson on linear algebra that the multiplication of a point by a matrix is as follows:

$$\scriptsize \begin{equation} \begin{bmatrix} x & y & z & w\end{bmatrix} * \begin{bmatrix} m_{00} & m_{01} & m_{02} & m_{03}\\ m_{10} & m_{11} & m_{12} & m_{13}\\ m_{20} & m_{21} & m_{22} & m_{23}\\ m_{30} & m_{31} & m_{32} & m_{33} \end{bmatrix} \end{equation}$$

$$\scriptsize \begin{array}{l} x' = x * m_{00} + y * m_{10} + z * m_{20} + w * m_{30}\\ y' = x * m_{01} + y * m_{11} + z * m_{21} + w * m_{31}\\ z' = x * m_{02} + y * m_{12} + z * m_{22} + w * m_{32}\\ w' = x * m_{03} + y * m_{13} + z * m_{23} + w * m_{33}\end{array}$$

Also remember from the beginning of this lesson, that point Ps, i.e. the projection of P onto the image plane, can be computed by dividing the x- and y-coordinates of P by its z-coordinate (See Equation 1). So how do we compute Ps using point-matrix multiplication? First, we set x', y' and z' (the coordinates of Ps) to x, y and z (the coordinates of P). Then we need to divide x', y' and z' by z. Transforming x', y' and z' into x, y, and z is easy enough. First set the matrix to the identity matrix (The identity matrix is where the pivot coefficients, or the coefficients along the diagonal of the matrix, equal 1. All others coefficients equal 0). But why do we divide x', y' and z' by z? We explained in the previous section that a point expressed the homogeneous coordinate system (instead of in the Cartesian coordinate system) has a w-coordinate that equals 1. When the value of w is different than 1, we must divide the x-,y-,z-,w-coordinates of the point by w to reset it back to 1 (Equation 2). The trick of the perspective projection matrix thus consists of making the w'-coordinate in Ps be different than 1, so that we have to divide x', y', and z' by w'. When we set w' to z (z cannot equal 1), we divide x', y' and z' by w' (which is equal to z). Through division by z, the resulting x'- and y'-coordinates form the projection of P onto the image plane. This operation is usually known in the literature as the z or perspective divide. To set w' to z, we need to set the components in the fourth column of the matrix to 0 0 -1 0.

$$\scriptsize \left[ \begin{array}{rrrr}x & y & z & 1\end{array} \right] * \left[ \begin{array}{rrrr}1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & -1 & -1\\ 0 & 0 & 0 & 0\end{array} \right] $$

$$\scriptsize \begin{array}{ll} \mbox{line 1:}&x' = x * 1 + y * 0 + z * 0 + 1 * 0\\ \mbox{line 2:}&y' = x * 0 + y * 1 + z * 0 + 1 * 0\\ \mbox{line 3:}&z' = x * 0 + y * 0 + z * -1 + 1 * 0\\ \mbox{line 4:}&w' = x * 0 + y * 0 + z * -1 + 1 * 0\end{array}$$

$$\scriptsize x' = \frac{x'=x}{w'=-z}, y' = \frac{y'=y}{w'=-z}, z' = \frac{z'=-z}{w'=-z} = 1$$

At this point in the lesson, we now have a basic perspective projection matrix which can be used to compute Ps. Now let us refine our model.

The Clipping Planes

Another goal of the perspective projection matrix is to normalize the z-coordinate of P, that is, to scale its value between 0 and 1. To do so, we will use the near and far clipping planes, which should be passed to the renderer as parameters of the camera (If you are unsure about what these parameters do, see the lesson on Camera in the Basic Section). To achieve this goal, we will set the components of the matrix used to calculate z' (line 3 of Equation 3) to certain values. To do so, we will change the third and fourth components of the third column to fulfill two conditions: when P is lying on the near clipping plane, z' is equal to 0 after the z-divide, and when z is lying on the far clipping plane, z' is equal to 1 after the z-divide. This remap operation is obtained by setting these components to:

$$\scriptsize -\frac{f}{(f-n)}$$

and

$$\scriptsize -\frac{f*n}{(f-n)}$$

respectively, where n stands for near clipping plane and f the far clipping planes (See the next chapter to learn how to derive these formulas). To convince you that this works, let's look at the result of z' when P lies on the near and far clipping planes:

$$\scriptsize \dfrac{\dfrac{-(z'=z=-n)*f-f*n}{(f-n)}}{(w'=-1*z=n)}= \dfrac{\dfrac{n*f-f*n}{(f-n)}}{(w'=-1*z=n)}=0$$ $$\scriptsize \dfrac{\dfrac{-(z'=z=-f)*f-f*n}{(f-n)}}{(w'=-1*z=f)}= \dfrac{\dfrac{f*f-f*n}{(f-n)}}{(w'=-1*z=f)}=$$ $$\scriptsize \dfrac{\dfrac{f*(f-n)}{(f-n)}}{(w'=-1*z=f)}=\dfrac{f}{f}=1$$

When z equals n (the near clipping plane) you can see in the first line of the equations that the numerator is equal to 0. Therefore the result of the equation is 0. In the second line, we have replaced z with f, the far clipping plane. By rearranging the terms, we can see that the (f-n) terms of the numerator cancel out, and we are left with f divided by itself, which equals 1.

Question from a reader: "You give the solution for remapping z to 0 to 1, but how did you come up with these formulas?". We will explain how to derive these formulas in the next chapter.

Our modified perspective projection matrix that projects P to Ps and remaps the z'-coordinate of Ps from 0 to 1 now looks like this:

$$\scriptsize \left[\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & -\frac{f}{(f-n)} & -1\\ 0 & 0 & -\frac{f*n}{(f-n)}& 0\\ \end{array}\right]$$

Taking the Field of View into Account

All we need to do to get a complete perspective projection matrix is to account for the field of view (FOV) of the camera. We know that by changing the focal length of a zoom lens on a real camera, we can change how much is seen in the picture. And in that sense, we want our CG camera to work in the same way.

Figure 2: Changing the focal makes it possible to see more or less of the scene we photograph. As can be seen in this illustration though, it normally changes the screen window.

The size of the projection window is [-1:1] in each dimension. In other words, once a point is projected, it is visible when its x- and y-coordinates are within the range [-1:1]. Points whose projected coordinates are not contained in this range are invisible and thereby are not drawn.

Figure 3: The field of view or FOV controls how much of the scene is viewed.

Note that in our system, the screen window maximum and minimum values do not change. They are always in the range [-1:1] regardless of the value used for the FOV. And the distance to the screen window from the eye position does not change either. When the FOV changes, however, we have just shown that the screen window should accordingly become larger or smaller (see Figures 2 and 5). How do we reconcile this contradiction? Since we want the screen window to be fixed, what we will change instead are the projected coordinates. We will scale them up or down and test them against the fixed borders of the screen window. Let's work through a few examples.

Figure 4: To account for the field of view effect while keeping the size of the screen window the same (in the range [-1:1], we need to scale the points up or down, depending on the FOV value.

Imagine a point whose projected x-y coordinates are (1.2, 1.3). These coordinates are outside the range [-1:1], and the point is therefore not visible. If we scale them down by multiplying them by 0.7, the new, scaled coordinates of the point become (0.84, 0.91). This point is now visible, since both coordinates are in the range [-1:1]. This action would corresponds to the physical action of zooming out. Zooming out means decreasing the focal length on a zoom lens or increasing the FOV. For the opposite effect, multiply by a value greater than 1. For example, imagine a point whose projected coordinates are (-0.5, 0.3). If you multiply these numbers by 2.1, the new, scaled coordinates are (-1.05, 0.63). The y-coordinate is still contained within the range [-1:1], but now the x-coordinate is lower than -1 and thus too far to the left. The point which was originally visible becomes invisible after scaling. What happened? You zoomed in.

Figure 5: Zooming in or out normally changes the size of the screen window. See how it becomes bigger or smaller as the field of view increases or decreases.

To scale the projected coordinates up or down, we will use the field of view of the camera. The field of view (or angle of view) intuitively controls how much of the scene is visible to the camera. See the lesson on Camera in the Basic Section for more information.

The FOV can be either the horizontal or vertical angle. If the screen window is square, the choice of FOV does not matter, as all the angles are the same. If the frame aspect ratio is different than 1, however, the choice of FOV (check the lesson on cameras in the basic section). In OpenGL (GLUT more precisely), the FOV corresponds to the vertical angle. In this lesson, the FOV is considered to be the horizontal angle.

The value of FOV, however, is not directly used; rather the tangent of the angle is used instead. In the CG literature, the FOV can be defined as either the angle or half of the angle that is subtended by the viewing cone. We believe it is more intuitive to see the FOV as the angular extent of the visible scene rather than as half of this angle (as represented in figures 3 and 5). To find a value that can be used to scale the projected coordinates, however, we need to divide the FOV angle by two. This explains why the FOV is sometimes expressed as the half-angle. Why do we divide the angle in half? What is of interest to us is the right triangle inscribed in the cone. The change in this angle between the hypothenuse and the adjacent side of the triangle (or the FOV half-angle) controls the length of the triangle's opposite side. By increasing or decreasing this angle, we can scale up or down the border of the image window. And since we need a value that is centered around 1, we will take the tangent of this angle to scale our projected coordinates. Note that when the FOV half-angle is 45 degrees (FOV is then 90 degrees), the tangent of this angle is equal to 1. Therefore, when we multiply the projected coordinates by 1, the coordinates do not change. For values of the FOV lesser than 90 degrees, the tangent of the half-angle gives values smaller than 1, and for values greater than 90 degrees, it gives values greater than 1. But the opposite effect is needed. Recall that zooming in should correspond to a decrease in FOV, and so we need to multiply the projected point coordinates by a value greater than 1. To zoom out means that the FOV increases, so we need to multiply these coordinates by a value less than 1. Thus, we will use the reciprocal of the tangent or in other words, one over the tangent of the FOV half-angle.

The remapping of the z-coordinate from 0 to 1 is not a linear process. In the image on the right, we have plotted the result of z' with the near and far clipping planes set to 1 and 20, respectively. As is evident, the curve is steep for values in the interval [1:3] and quite flat for values greater than 7. It means that the precision of z' is high in the proximity of the near clipping plane and low as we get closer to the far clpping planes. If the range [near:far] is too large, depth precision problems known as "z-fighting" can arise in depth-based hidden surface renderers. It is therefore important to make this interval as small as possible in order to minimize the depth buffer precision problem.

Here is the final equation to compute the value used to scale the coordinates of the projected point:

$$\scriptsize S = \dfrac{1}{tan(fov*0.5*\dfrac{\pi}{180})}$$

And thus we have the final version of the perspective projection matrix (Equation 4):

$$ \scriptsize \left[\begin{array}{cccc} S & 0 & 0 & 0 \\ 0 & S & 0 & 0 \\ 0 & 0 & -\frac{f}{(f-n)} & -1\\ 0 & 0 & -\frac{f*n}{(f-n)}& 0\\ \end{array}\right]$$

Are There Different Ways of Building this Matrix?

Yes and no. Some renderers may have a different implementation of the perspective projection matrix, as is the case with OpenGL. OpenGL uses a call to glFrustum to create a perspective projection matrix. This call takes as arguments, the left, right, bottom and top coordinates in addition to near and far clipping planes. Unlike our system, OpenGL assumes that the points in the scene are projected on the near clipping planes, rather than on a plane that lies one unit away from the camera position. The matrix itself might also look slightly different. Be careful about the convention used for vectors and matrices. The projected point can be represented as either a row or column vector. Check also whether the renderer uses a left- or right-handed coordinate system, as that could change the sign of the matrix components. Despite these differences, the underlying principle of the perspective projection matrix is the same for all renderers. They always divide the x- and y- coordinates of the point by its z-coordinate. So in the end, all matrices should project the same points to the same pixel coordinates, regardless of convention or matrix that is used. We will study the construction of the OpenGL matrix in the next chapter.

Test Program

To test the perspective projection matrix, we have written a small program to project the vertices of a polygonal object (Newell's teapot) onto the image plane. The program itself, which is available for download, is simple in its implementation. A function is used to build the perspective projection matrix. Its arguments are the camera's near and far clipping plane, as well as its field of view in degrees. The function directly implements Equation 4:

1
2
3
4
5
6
7
8
9
10
template<typename T> void setperspectivepmat(const T &near, const T &far, const T &fov, Matrix44<T> &mat) { T scale = T(1) / tan(degtorad(fov * 0.5)); mat[0][0] = mat[1][1] = scale; mat[2][2] = - far / (far - near); mat[3][2] = - far * near / (far - near); mat[2][3] = - 1; mat[3][3] = 0; }

The vertices of the teapot are stored in an array. Each point is then projected onto the image plane using simple point-matrix multiplication (line 9), where the matrix is the perspective projection matrix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
template<typename T> void projectverts(const Matrix4<T> &mat, const Point3<T> *verts, const unsigned &nverts) { // create an image unsigned width = 640; unsigned char *buffer = new unsigned char [width * width]; memset(buffer, 0x0, width * width); for (int i = 0; i < nverts; ++i) { Point3<T> ps = verts[i] * mat; if (ps.x < -1 || ps.x > 1 || ps.y < -1 || ps.y > 1) continue; // convert projected point coordinates to pixel coordinates unsigned px = std::min(unsigned((ps.x + 1) * 0.5 * width), width - 1); unsigned py = std::min(unsigned((1-(ps.y + 1) * 0.5) * width), width - 1); buffer[py * width + px] = 255; } // save to ppm std::ofstream ofs;eee ofs.open("./untitled.ppm"); ofs << "P5\n" << width << " " << width << "\n255\n"; ofs.write((char*)buffer, width * width); ofs.close(); delete [] buffer; }

The following code computes the product of a point with a matrix, and so we must first represent the point in homogeneous coordinates. Note how we create the fourth component, w (line 7), and divide the result of the new point's coordinates by w, only if w is different than 1 (line 8). This is where and when the z or perspective divide occurs:

1
2
3
4
5
6
7
8
9
Point3 operator * (const Matrix4<T> &mat) const { Point3<T> pt( x * mat[0][0] + y * mat[1][0] + z * mat[2][0] + mat[3][0], x * mat[0][1] + y * mat[1][1] + z * mat[2][1] + mat[3][1], x * mat[0][2] + y * mat[1][2] + z * mat[2][2] + mat[3][2]); T w = x * mat[0][3] + y * mat[1][3] + z * mat[2][3] + mat[3][3]; return (w != 1) ? pt / w : pt; }

In this example, the image width and height are the same; therefore, the frame aspect ratio is equal to 1. Thus, the point is only visible if its projected x- and y- coordinates are contained within the interval [-1:1]. Otherwise the point is outside the boundaries of the camera's film plane. If the point is contained within this interval, we need to remap these coordinates to raster space, i.e. pixel coordinates. This operation is simple. We remap the coordinates from [-1:1] to to [0:1], multiply by the image size, and round the resulting floating digit to the nearest integer, as pixel coordinates must be integers. To test our program, we have rendered an image of the teapot in a commercial renderer using the same camera settings and combined it with the image produced by our code. They match, as is expected.

What's Next?

In the next chapter, we will learn how to construct the perspective projection matrix that is used in OpenGL. The principle are the same, but instead of mapping the points to an image plane one unit from the camera position, it projects the point onto the near clipping plane. This results in a slightly different matrix. In Chapter Three, we will learn about constructing the orthographic projection matrix.

Chapter 1 of 4