A method for Camera Calibration
1. Introduction 2. Description of the Calibration Problem2.1 Extrinsic Parameters
2.2 Intrinsic Parameters
2.3 Radial Distortion 3. Initial Approach 4. Final Method
Camera Calibration is a necessary procedure in order to extract 3-D information from 2-D images. In the past few years, a number of methods have been proposed for solving the camera calibration problem. These methods can generally be divided into 2 groups:
· Photogrammetric calibration: The calibration is done in this case by observing an object whose 3-D geometry is known with good precision. The best known such method was used in TSAI, a project which automatically calibrates a camera, given two planes with a particular shape drawn on them (16 rectangles).
· Self-calibration: The calibration is done in this case by moving a camera in a static scene. The rigidity of the scene can be used to produce 2 constraints on the intrinsic and extrinsic parameters of the camera. Therefore, by obtaining pictures of the camera by different places, we can estimate the intrinsic and extrinsic parameters of the camera.
Here is how we will proceed. In section 2 we give a brief description of the calibration problem, and analyze the parameters that need to be estimated in the process of the camera calibration. In section 3 we present our initial approach, which even though it was not successful, it can be used in order to understand the difficulty of the problem. Section 4 describes the method we ultimately used, and how we managed to overcome the various problems.
2. Description of the Calibration Problem
In this section we assume that the reader has a fairly good understanding of the pinhole model. According to the pinhole model, a world point M(X,Y,Z) is projected into the image plane into the point m(x,y), where

In the equation above,
f is the focal length of the optical center and s is a scalar. In the above equation, we assumed that the 3D world points are in expressed in the camera coordinate system. Generally this is not the case. A 3D point whose world coordinates are Mw has camera coordinates Mc, such that Mc = R Mw + t. Therefore, equation (1) now becomes:
The matrix
R and the vector t describe the orientation of the camera with respect to the world coordinate system. They are called the extrinsic parameters of the camera. The rotation matrix R is a 3-by-3 matrix, but has only 3 degrees of freedom, since R must satisfy:These 2 equations give 6 constraints on R. Therefore, in the process of camera calibration, we have to estimate 6 extrinsic parameters: 3 for the rotation and 3 for the translation.
Equation (2) shows us how a world point M is projected in the image plane. However, the coordinates of the point m in the image plane have the same unit as the world coordinates. What we are really interested in, is to find the transformation of point m to the image coordinate system of the camera (expressed in pixels).

The axis of the camera should ideally be perpendicular, but they usually form an angle
θ, which is slightly different than 90o. This is a parameter that needs to be estimated. Furthermore, we also have to estimate the origin of the image plane (u0, v0), since this does not always coincide with the intersection of the optical axis and the image plane. Finally, the units in the pixel coordinates are not the same as those as the ones of point m. If we assume that the units along the u- and v- axis are ku and kv with respect to the units used in (c,x,y), then the pixel coordinates are:
The five parameters
θ, ku, kv, u0, v0 do not depend on the position and the orientation of the camera, and are thus called the intrinsic parameters of the camera. However, as we can see from the above equation, we do not have to estimate the intrinsic parameters directly; we just need to estimate the values of α, β, γ, u0, v0.Up to now, we have not considered lens distortion of a camera. Lenses are not flat surfaces. The world points are therefore not projected on a plane, but rather in a surface which can be considered to be spherical. This has the effect that straight lines are mapped to parabolas in the image. For example, the grid of figure 1b) is actually seen as shown at the figure 1a).
According to the literature, the distortion function is dominated by the radial components, and especially by the first term. If (u,v) are the ideal image coordinates of a point, (u’,v’) are the distorted image coordinates, and (x,y) are the ideal normalized image coordinates (coordinates of the point m), then:
· u’ = u + k(u-u0)(x2+y2)
· v’ = v + k(v-v0)(x2+y2)
The center of distortion is the same as the principal point (
u0,v0). Therefore, we also have to estimate the value of k, in the process of camera calibration.
Our initial approach was to first get an estimate of the principal point (
u0, v0) and the term of the radial distortion (k), and then try to estimate the intrinsic and extrinsic parameters of the cameras. For this reason, we took with the cameras, pictures of a grid. By knowing the actual distance of the horizontal and the vertical lines of the grid remain constant, we could get an estimate on the radial distortion term. Because of the radial distortion, the image would be similar to Figure 1a, and the distance between the lines would not be the same. The principal point was estimated by the grid image. In particular, we used the shape of the "barrel" (the grid pixels look like barrels) to see where the lines started curving towards different directions. By this method we got estimates for the radial distortion term.
We then used a cylinder, with a grid wrapped around it. We marked the grid in such a way, so that, for each intersection of grid lines, we could determine the world coordinates of that point. We assumed that by knowing the world coordinated of many points, and their corresponding image coordinates, we would be able to get enough constraints to estimate the intrinsic and the extrinsic parameters of the cameras. We tried to use iterative methods and minimize the error of the image coordinates to the ones we would get if we mapped the world point to image point using our current estimates of the parameters. Unfortunately, this method did not work. In order to understand why this did not work, the reader is encouraged to try and visualize a 13-dimentional space. The 12 dimensions are our 12 unknowns (6 extrinsic parameters, 5 intrinsic ones, and the radial distortion term), and the 13-dimension is the value of the average error over all the points that we have between the real image coordinates and the estimated ones. Ideally, the minimization function we use should end up at the valley with the point of smallest value (smallest average error). However, what truly happened was that the minimization function ended being trapped in valleys, where the average error was quite large (with a few exceptions). To make matter worse, any attempt to start a new minimization function from the new valleys failed, since there was no convergence. Furthermore, we had one more problem, which seems unimportant compared to the one just presented: the focus of the cameras was different than the one used in the grid images, and, therefore, the principal point was different as well. As a result, we could not unwarp the images correctly (remove the radial distortion).
The method we ultimately used was motivated by the following observation: Each camera is not isolated from the others. We have 43 cameras, and when calibrating one camera, we can use constraints imposed by the other cameras as well. What we did is use a stick with a led attached to one of its endpoints, viewed by a set of 10 cameras. The led gave us points with good precision, which were viewed by many cameras. The basic idea of our method is based on the following steps:
i) Calibrate the 10 cameras, which we used to view the points, given the led points
ii) Given these initial estimates for the calibration, use the projection matrices and redefine the world points. This process could correct some errors while acquiring the coordinates of the led points (the led was a sphere of radius equal to about 0.5 – 1 pixel, and we always used its center – this might not be necessarily true). This process was done, by shooting out rays from the cameras (for the led points seen by each camera), and selecting the points where these rays were closer.
iii) Use the new world points to calibrate these 10 cameras once again.
iv) Reconstruct the world points, and extend the calibration to all the other cameras.
v) Reconstruct the world points and recalibrate the cameras viewing the head in our movie sequences (for extra precision).
The basic steps are the ones described above. Some crucial details are the following:
i) Each time we reconstruct the world points, we only use cameras that have been calibrated well in previous steps (average pixel error smaller than 0.4 pixels). This is done so that the reconstructed world points will be accurate.
ii) Every time we try to calibrate one camera, we run multiple optimization functions. We use a non linear least squares method, and two other minimization functions (the MATLAB functions fminu and, fminunc). We also used many starting points for each optimization function. Besides a quick estimate that we got from the led points, we also used values of the calibration parameters that we acquired both in the current run, but in previous iterations, and by previous runs of the program. Finally, we also considered these previous values for the calibration parameters, without running a new optimization function (in case the optimization functions diverged).
iii) When reconstructing the world points, we used only points that could be viewed by many cameras, so that the reconstruction would be accurate. By many experiments, we found out that at least 5 cameras are needed to define a world point well, but most of the times we demanded that 7 cameras be needed. Notice that we can’t demand that two many cameras view the point, because then we would have too few qualifying points, and the calibration would fail.
The results were even better than we had anticipated. From the 43 cameras, only 5 cameras exhibited large average pixel errors (1-2 pixels in average). Many cameras were calibrated with an average pixel error of less than 0.2 pixels, while the average pixel error for all the cameras was about 0.4 pixels. These results were better than results we have found in published papers on camera calibration.
The tests that we later run, verified the precision of the calibration.