Perspective Projection Matrix
This lesson is being written and changes regularly.
A Basic Perspective Projection Matrix
In this lesson we will explain how to build a simple perspective projection matrix. In the second chapter we will show how to construct the complete perspective projection matrix (which is used in 3D APIs such as OpenGL). In the third chatper we will learn about the orthographic projection matrix.
What is it used for ?
The perspective projection matrix is actually quite fondamental to the process of creating 2D image from 3D models. However, strangely enough, very little information about it can be found in books and on the internet. If you use ray tracing to render an image it is unlikely that you will be in need of the perspective matrix. In the case of a ray tracer, rays which are created by sampling the image plane are shot from the camera into the scene. Each camera ray is tested for intersection with the scene geometry. The perspective matrix makes the reverse process possible. We take points in 3D space and transform them using the perspective projection matrix, to find their position on the image plane. Even if your renderer is a ray tracer, the perspetive matrix can be useful some times, for instance to project a bounding box onto the screen (which is helpful to quickly find out if some geometry enclosed in this bounding box is visible to camera). The perspective matrix though is most useful in renderers and 3D APIs using depth-based hidden-surface algorithms (such as OpenGL).
Projecting Points onto the Screen
Before we study how to create the perspective matrix, we will first learn how to project 3D points onto the screen. Usually 3D points being projected to the image plane are first transformed into the camera coordinate system, where the eye position corresponds to the origin of the coordinate system, the x and y axis define a plane parallel to the image plane and the z axis is perpendicular to that plane. In our setup, the image plane will be located exactly one unit away from the orgin of the camera coordinate system (the eye).

Note that Scratchapixel uses a right-hand coordinate system like many other commercial applications such as Maya. To learn more about right and left-hand coordinate systems, check the lesson on Algebra in the basic section. Because we use a right-hand coordinate system, the camera will be pointing in a direction opposite to the z-axis (when we project points on the image plane, we want the x-axis to point to the right). Practically, this means that all points visible by the camera, have a negative z component (when the points are expressed in the camera coordinate system).
Lets imagine we want to to project point P onto the screen. If we draw a line from P to the eye position, we can see that P is projected onto the screen at Ps. How do we compute Ps ?

Figure 1: to project P on the image plane (at Ps) we the xy coordinates of P by the z coordinate of P.
In figure 1, you can see that the green and red triangles have the same shapes and are said to be similar. In other words the red triangle is a scaled up version of the green triangle and to find the xy coordinates of Ps (we already know that its z coordinate is 1 since the point lies on the image plane which is 1 unit away from the eye position or camera origin), all we need to do is to divide the xy coordinates of P by the z coordinate of P. This trick only works because these triangles are similar. In mathematical form we can write (equation 1):
$$\begin{array}{l} Ps_x=\frac{P_x}{-P_z} \\ Ps_y=\frac{P_y}{-P_z} \\ Ps_z=\frac{P_z}{P_z}=1\end{array}$$
Note that we have divided by -Pz (and not Pz) because the z-component of the points visible by the camera (when expressed in the coordinate system of the camera) is always negative. Computing the coordinates of Ps (which is the projection of P on the image plane) is that simple. Note that in figure 1, we represent Psy. To show Psx, we would have to make another drawing from the top but the principle is exactly the same.
The concept of perspective took centuries to be well understood by artists. It's only by the 14th century that techniques were available to accurately reproduce the way in which objects appear to the eye. When on the main effects of perspective is known as foreshortening. It is an optical illusion which makes objects in the distance to appear smaller than they are. It gives a sense of depth which is usefull to evaluate how far objects are from the viewer (combined with stereoscopic vision or stereopsis).
Homogeneous Coordinates
Now you may think that there is really nothing complicated about the perspective projection and you are right. The principle is very simple. However the story doesn't stop here. What we want is to encode this projection process into a matrix so that projecting a point onto the image plane can be obtained by a basic point-matrix multiplication. If you remember what we have said in the lesson on Linear Alegbra, two matrices can be multiplied with each other if the numbers around the mutiplication sign are the same.
$$\mbox{no: }\begin{bmatrix}n & m\end{bmatrix}*\begin{bmatrix}q & n\end{bmatrix}$$ $$\mbox{yes: }\begin{bmatrix}m & n\end{bmatrix}*\begin{bmatrix}n & q\end{bmatrix}$$
Remember too that a point can be represented by a one row matrix (some people prefer the one column notation but scratchapixel uses the one row form). However a point is a [1x3] matrix (1 row, 3 columns) and therefore it can not be multiplied by a [4x4] matrix which we use in CG to represent transformations. So what will we do ? To solve this problem, we employ a trick which consists of representing the point using homogeneous coordinates. Points in homogeneous coordinates don't have three but four coordinates and can therefore be represented in the form of a [1x4] matrix. The fourth coordinate of a point in homogeneous form is denoted with the letter w. When we convert a point from cartesian coordinates to homegeneous coordinates we just set w to 1. In other words we can represent P in homogeneous coordinates by setting its w coordinate to 1 and Pc (cartesian coordinates) and Ph (homogeneous coordinates) are interchangeable as long as w keeps the value 1. When w is different than 1, we need to divide all four coordinates of the points (xyzw) by w to set the its value back to 1. As you can see in the following example, when w is not equal to 1, the homogeneous point and its cartesian counterpart are not interchangeable, and to fix this situation, we need to divide all its coordinates by w (equation 2):
$$\begin{bmatrix}x,& y,& z\end{bmatrix} \neq \begin{bmatrix}x,& y,& z,& w=1.2\end{bmatrix}$$ $$x=\frac{x}{w}, y=\frac{y}{w}, z=\frac{z}{w}, w=\frac{w}{w}=1$$ $$\begin{bmatrix}x,& y,& z\end{bmatrix} = \begin{bmatrix}x,& y,& z,& w=1\end{bmatrix}$$
At this point we know that we can represent a point using four coordinates (as long as w stays 1) which makes a point-matrix multiplication technically possible. Next we will study how we will construct this matrix so that the result of this multiplication is the projection of the point on the image plane.
A Simple Perspective Matrix
Recall from the lesson on linear algebra that the multiplication of a point by a matrix looks like this (equation 3):
$$\begin{equation} \begin{bmatrix} x & y & z & w\end{bmatrix} * \begin{bmatrix} m_{00} & m_{01} & m_{02} & m_{03}\\ m_{10} & m_{11} & m_{12} & m_{13}\\ m_{20} & m_{21} & m_{22} & m_{23}\\ m_{30} & m_{31} & m_{32} & m_{33} \end{bmatrix}\end{equation}$$
$$\begin{array}{l} x' = x * m_{00} + y * m_{10} + z * m_{20} + w * m_{30}\\ y' = x * m_{01} + y * m_{11} + z * m_{21} + w * m_{31}\\ z' = x * m_{02} + y * m_{12} + z * m_{22} + w * m_{32}\\ w' = x * m_{03} + y * m_{13} + z * m_{23} + w * m_{33}\end{array}$$
Remember too from the beginning of this lesson, that the projection of P on the image plane (Ps) can be computed by dividing the xy coordinates of P by the z coordinate of P (equation 1). How can we achieve this with a point-matrix multiplication? First we need to set x', y' and z' (Ps coordinates) to x, y and z (the coordinates of P). Then we need to divide x', y' and z' by z. Setting the result of x', y' and z' to x, y, and z is easy enough. All we need to do is set the matrix to the identity matrix (the pivot coefficients, the coefficients along the diagonal of the matrix, are set to 1 and all the others coefficients are set to 0). However, how can we now divide x', y' and z' by z ? We have explained in the previous chapter that the w coordinate of a homogeneous point needs to be 1 if to be used in place of a cartesian point. When its value is different than 1, we need to divide the xyzw coordinates of the point by w to reset it back to 1 (equation 2). The trick of the perspective projection matrix, consists of setting the result of w' to z so that by forcing w' to be different than 1 (if z is different than 1), we will have to divide x', y' and z' by w' (which is equal to z). The division by z causes the resulting x' and y' coordinates to be the projection of P on the image plane which is what we want. This operation is usually called in the litterature, the z or perspective divide. To set w' to z we need to set the coefficients of the fourth column in the matrix to 0 0 -1 0.
$$\begin{equation} \begin{bmatrix} x & y & z & 1\end{bmatrix} * \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & -1 & -1\\ 0 & 0 & 0 & 0 \end{bmatrix}\end{equation}$$
$$\begin{array}{ll} \mbox{line 1:}&x' = x * 1 + y * 0 + z * 0 + 1 * 0\\ \mbox{line 2:}&y' = x * 0 + y * 1 + z * 0 + 1 * 0\\ \mbox{line 3:}&z' = x * 0 + y * 0 + z * -1 + 1 * 0\\ \mbox{line 4:}&w' = x * 0 + y * 0 + z * -1 + 1 * 0\end{array}$$
$$ x' = \frac{x'=x}{w'=-z}, y' = \frac{y'=y}{w'=-z}, z' = \frac{z'=-z}{w'=-z} = 1$$
At this point in the lesson, we now have a basic perspective projection matrix which can be used to compute Ps. However there is more to it.
The Clipping Planes
Another goal of the perspective projection matrix is to normalise the z coordinate of P, that is, to scale its value between 0 and 1. To do so, we will use the near and far clipping planes which should be passed to the renderer as parameters of the camera (if you are unsure about what these parameters do, check the lesson on Camera in the basic section). To achieve this goal, we will set the coefficients of the matrix used to caculate z' (line 3 of equation 3) with special values. We will change the third and fourth coefficients of the third column so that when P is lying on the near clipping plane, z' is equal to 0 after the z divide, and so that when z is lying on the far clipping plane, z' is equal to 1 after the z divide. This remap operation can easily be obtained by setting these coefficients to:
$$-\frac{f}{(f-n)}$$
and
$$-\frac{f*n}{(f-n)}$$
respectively, where n stands for near clipping plane and f the far clipping planes (check the next to chapter to learn how to derive these formulas). To convince you that this is working, lets look at the result of z' when P lies on the near and far clipping planes:
$$\frac{\frac{-(z'=z=-n)*f-f*n}{(f-n)}}{(w'=-1*z=n)}= \frac{\frac{n*f-f*n}{(f-n)}}{(w'=-1*z=n)}=0$$ $$\frac{\frac{-(z'=z=-f)*f-f*n}{(f-n)}}{(w'=-1*z=f)}= \frac{\frac{f*f-f*n}{(f-n)}}{(w'=-1*z=f)}=$$ $$\frac{\frac{f*(f-n)}{(f-n)}}{(w'=-1*z=f)}=\frac{f}{f}=1$$
When z equals n (the near clipping plane) you can see (first line of equations) that the numerator (what's above the z divide) is equal to 0. Therefore the result of the equation is 0. In the second line, we have replaced z with f, the far clipping plane. By rearranging the terms, we can see that the (f-n) terms of the numerator cancel out, and we are left with f divided by itself which equals 1.
"You give the solution for remapping z to 0 to 1 but how did you come with these formulas ?". We explain how to derive these formulas in the next chapter.
Our modified perspective projection matrix (it projects P to Ps and remaps the z coordinate of Ps from 0 to 1) now looks like this:
$$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & -\frac{f}{(f-n)} & -1\\ 0 & 0 & -\frac{f*n}{(f-n)}& 0\\ \end{bmatrix}$$
Taking the Field of View into Account
All we need to do to get a complete perspective projection matrix is to account for the field of view (or FOV) of the camera. We know that by changing the focal length of a zoom lens on a real camera we can see more or less things in the picture. And we want our CG camera to work the same.

Figure 2: changing the focal makes it possible to see more or less of the scene we photograph. However, as you can see in this illustration, it normally changes the screen window.
The size of the projection window is [-1:1] in each dimension. In other words, any point which once projected have its x and y coordinates within the range [-1:1] is visible (points which projected coordinates are not contained in this range are invisible, they are not drawn).

Figure 3: the field of view or FOV controls how much of the scene is viewed.
Note that in our system, the screen window maximum and mininum extents don't change (they are always in the range [-1:1] no matter what value we use for the FOV) and the distance to the screen window from the eye position doesn't change either. However when the FOV changes, we have just showed that the screen window should become larger or smaller (see figures 2 and 5). How do we reconcile this contradiction ? Since we want the screen window to be fixed, what we will change instead, are the projected coordinates. We will scale them up or down and test them against the fixed borders of the screen window. Lets take have a look at a few examples.

Figure 4: to account for the field of view effect while keeping the size of the screen window the same (in the range [-1:1], we need to scale the points up or down (depending on the FOV value).
Imagine a point which xy projected coordinates are 1.2, 1.3. These coordinates are outside the range [-1:1] and the point is therefore not visible. However if we scale them down with lets say the value 0.7, the new scaled coordinates of the point become 0.84 and 0.91 and this point would be then visible (since both coordinates are in the range [-1:1]. This action would be similar to zooming out (you zoom out if you decrease the focal length on a zoom lens or increase the FOV). To get the opposite effect, you will need to apply a scale with a value greater than 1. Imagine a point which projected coordinates are -0.5 and 0.3. If you multiply these numbers by 2.1 the new point coordinates are -1.05 and 0.63. The y coordinate is still contained in the range [-1:1] but the x coordinates is too far to the left (it is lower than -1). Therefore this point which was visible before the scale was applied, is now not visible anymore. You zoomed in.

Figure 5: zooming in or out normally changes the size of the screen window. See how it becomes bigger or smaller when the field of view increases or decreases.
To scale the projected coordinates up or down we will use the field of view of the camera. The field of view (or angle of view) intuitively controls how much of the scene is visible by the camera (check the lesson on Camera in the basic section for more information).
The FOV can either be the horizontal or vertical angle. If the screen window is square, this detail doesn't matter (the angle is the same). If the frame aspect ratio is different than 1 then it matters. In OpenGL (GLUT more precisely), the FOV correpondonds to the vertical angle. In this lesson, the FOV corresponds to the horizontal angle like in most applications.
However we can't use this value directly. We will need to use the tangent of the value instead. In the CG litterature, the field of view can either be reported as the angle or the half-angle subtended by the viewing cone (see figure xx). We believe it is more intuitive to see the FOV as the angular extent of the visible scene instead of half of this angle. To find a value that can be used to scale the projected coordinates however, we need to divide this angle by two (which explains why the FOV is sometimes expressed as the half-angle). Why ? Because for the maths to work, what we are interested in, is the right triangle inscribed in the cone (see figure xx). The change in the angle between the hypothenuse and the adjacent side of the triangle (the FOV half-angle) controls the size of the triangle's opposite side. By increasing or decreasing this angle we can scale up or down the border of the image window. And since we need a value that is centered around 1, we will take the tangent of this angle to scale our projected coordinates. Note that when the FOV half-angle is 45 degrees (when the FOV is 90 degrees), the tangent of this angle is 1. Therefore when we multiply the projected coordinates by 1, of course, the coordinates do not change. For values of the FOV lower than 90 degrees, the tangent of the half angle gives values smaller than 1 and for values greater than 90 degrees, it gives values greater than 1. But since we want the opposite (to zoom in, when the FOV decreases, we need to multiply the projected points coordinates by a value greater than 1, to zoom out, when the FOV increases, we need to multiply these coordinates by a value lower than 1), we will use the inverse of the number (1 divided by the number) we got by taking the tangent of the FOV half-angle.

The remapping of the z coordinate from 0 to 1 is not a linear process. In the image on the right we have plotted the result of z' with the near and far clipping planes respectively set to 1 and 20. As you can see, the curve is very steep for values in the range [1:3], and quite flat for values any greater than 7. It means that the precision of z' is high in the proximity of the near clipping plane and low as we get closer to the far clpping planes. If the range [near:far] is too large, in depth-based hidden surface renderers, it can cause depth precision problems (called z-fighting). It is therefore important to make this range the smallest as possible in order to minimize the depth buffer precision problem.
Here is the final equation to compute the value that is used to scale the projected points coordinates:
$$\begin{equation}S = \frac{1}{tan(fov*0.5*\frac{PI}{180})}\end{equation}$$
And finally, here is the final version of the perspective projection matrix:
$$ \begin{bmatrix} S & 0 & 0 & 0 \\ 0 & S & 0 & 0 \\ 0 & 0 & -\frac{f}{(f-n)} & -1\\ 0 & 0 & -\frac{f*n}{(f-n)}& 0\\ \end{bmatrix}$$ $$\begin{equation}S = \frac{1}{tan(fov*0.5*\frac{PI}{180})}\end{equation}$$
Are they Different Ways of Building this Matrix ?
Yes and no. Some renderer may have a slightly different use of the perspective projection matrix, which is the case of OpenGL for instance. OpenGL uses the call glFrustum to create a perspective projection matrix. This call takes as arguments, a left, right, bottom and top coordinates as well as a near and far clipping planes. OpenGL assumes that the points in the scene are projected on the near clipping planes and not on a plane which is 1 unit away from the camera position like it is in our system.The principle though of the perspective projection matrix, is the same for all renderers. The matrix might look slightly different (be careful about the convention used for vectors and matrices. It can be row or column major. Check also wether the renderer uses a left or righ-handed coordinate system which might change the sign of the matrix coefficients). But at the end, all matrices should project the same points to the same pixel coordinates, no matter what convention and matrix are used.
We will study the construction of the OpenGL matrix in the next chapter.
Test Program
To test the perspective projection matrix, we have written a small program that consists of projecting the vertices of a polygonal object onto the image plane (the object is the Utah teapot). The program itself (available for download) is very simple. The vertices of the teapot are stored in an array. Each point is then projected onto the image plane using a simple point-matrix multiplication (line 9), where the matrix is the perspective projection matrix.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Note in the following code which computes the product of a point with a matrix, how we create a fourth component, w (line 7), and divide the result of the new point's coordinates by w, only if w is different than 1 (line 8).
2
3
4
5
6
7
8
9
To test our program, we rendered an image of the teapot in a commercial renderer (using the same camera settings) and combined it with the image produced by our code. They match (as they should).

What's Next ?
In the next chapter we will learn how to construct the perspective projection matrix used in OpenGL. The principles are exactly the same but instead of mapping the points to an image plane which is 1 unit away from the camera position, it projects the point on the near clipping plane. This results in a slightly different matrix and learning how to build this matrix can be useful at times. In the third chapter we will learn about constructing the orthographic projection matrix as well.
Chapter 1 of 2 Next Chapter »