The first thing to do was to pick a file type which was relevant to the project. To start with jpeg files were chosen because they are smaller and more efficient when it comes to storing many frames at full PAL resolution. Unfortunately the compression (characteristic of jpeg files) was not desirable when it came to accuracy. Eventually a tiff library was acquired which meant that the precision could be maintained. Armed with this the initial steps of learning how to manipulate image files efficiently in 'C' could be started.
B. TRACKING POINTS
(i) VISUAL POINT TRACKING BY AVERAGE
Code averageTrack.cpp
Example Video - Dot on a wall
if link doesn't work use dot.mov
Example Video - Red track point on dot
if link doesn't work use average.mov
The first system employed to try and ascertain point positions of visual landmarks on the screen was an algorithm based on the averages of blocks of pixels. The user specifies the position, in screen co-ordinates, of the point that they want to track. The program then reads in all the colour values of the points around it, averages them and then advances to the next frame - systematically searching the area around the point given by the user for an average which matches the previous result. This is then repeated using the newly located position as the centre of the area in which to search for the following best average match.
(ii) VISUAL POINT TRACKING BY PATTERN
Executable
pointTrack.exe
Code
pointTrack.cpp
scoreSys.h
searchMatch.h
showPoint.h
Example Video - Point on tree tracked
if link doesn't work use tree.mov
The user input is the same as before, but the method of finding where the point moves to in the next frame is based on a pattern searching system.
The program reads the user input, specifying an (x,y) screen co-ordinate, and stores the RGB values of the pixels surrounding it, in this case I chose an initial test area of 5 by 5 pixels (where the user input corresponds to the centre (3,3) of this test area). This stored pixel map can now be used to find position of the pattern in the subsequent frames.
An area is defined on the second frame, whose centre is in the same position as the pixel map and whose size is greater than that of the pixel map. This serves as an area within which to search for the stored pattern. (this search area was set to around 15 by 15 pixels as an initial test size. Motions over 15 pixels in distance would not be located, so the size of the search area should be altered to compensate for an object's speed or distance in relationship to the camera).
The pattern from the first frame is systematically compared with all the possible positions in the search area in the second frame. For each pixel the sum of the absolute differences in RGB values are noted as an error value. Then the position with the smallest error value is picked as the best match. That position is then stored as the starting position for the next track. The frames are incremented and the process is repeated for all the frames in the image sequence.
This system is only affective if the image sequence is not interlaced. It will work, but the results will be jittery. To resolve the problem of interlacing, the odd and even fields can be treated as two separate frames. This is equivalent to an images sequence moving at 50 frames per second as opposed to 25 frame per second. Such accuracy is not needed at this level - so for this case the tracking of the point was worked out using the odd fields only. To do this: when the program is being implemented the y values of the search area and of the pixel map store need to progress by a step of two rather than one.
C. ORIENTATION RECOVERY OF A FIXED POINT CAMERA
Code
tripod.cpp
Example Video - Initial footage with track point on window
if link doesn't work use podpan.mov
Example Video - Same footage with 3D poster on wall
if link doesn't work use poster.mov
To start with the problem was simplified - rather than trying to find the camera position (Vpx, Vpy, Vpz) and the camera orientation (Rx, Ry, Rz), the position is clamped to (0,0,0), from this the orientation is more easily calculated. To hold the position the camera was mounted on a tripod. This also locks the Rz or 'Roll' value, leaving two unknowns, Rx and Ry. The problem is then immediately simplified. Working in the yz plane, an equation can be formulated for the Rx rotation (this can later be applied to the xz plane to get the Ry rotation).
The z vector of world space is assumed to be into the screen and the image plane (screen) is set at a distance Vd from the fixed point camera. The vertical distance of a tracked point P from the screen's centre is called H. The change in rotation of Rx from frame 1 to frame 2 is denoted as Rx. Rx is the angle between a ray cast through the centre of the screen, from the camera, and a ray cast from the camera to the tracked point P. The subscripts denote the frame number (in this case just frame 1 and 2). To begin with only
Rx needs to be worked out (and eventually
Ry from the xz plane).
In frame 1, using basic trigonometry, it can be seen that
So
[1]
Applying this to the second frame, it can also be seen that
[2]
These can be rearranged to give
[1]->
[3]
[2]->
[4]
hence
[5]
Applying the same principle in the xz plane results in
[6]
The program that calculates this has then to communicate the information to the 3D application that controls the camera motion, in this case Maya. The 'C' application outputs the rotational information into a bit-stream file as a series of floats.
Maya's scripting language, MEL, can then read this bit-stream and apply the transformations to the camera as a change in current angle. (In fact Maya doesn't read in the 'C' floats, so doubles are used instead, but this does not affect the final result.)
D. POSITION AND ROTATION RECOVERY FOR AN ARBITRARY CAMERA
(i) GAINING THE THIRD CO-ORDINATE
The inherent problem of working out the camera position in 3D co-ordinates from 2D information is that there is not enough information there to extract any kind of result. The way that humans perceive depth and position from a monocular image is based on the recognition of shapes and/or the parallax effect produced by motion.
These two methods can both be brought into the computer system. The recognition of shapes can be implemented by giving the computer information about the shape of an object(i.e.the relationship between a set of points) and the parallax effect can be presented to the computer by getting it to compare information of points that are tracked over a two or more frames in an image sequence.
To implement the object recognition system it is necessary to work out the x, y, z value for the position of the camera in 3D space and its rotation around the x, y, z axes. The position can be given purely as a displacement or as a point value, and the same is true for the rotation (except an aim point (x, y, z) could be worked out in Maya if the situation requires so.) From this it is necessary to work out 6 equations to solve the 6 unknowns.
(ii) INVERTING THE 3D PROJECTION ALGORITHM
The most obvious method is trying to reverse the algorithm which is used to project 3D data onto a 2D screen. The principle of the projection of 3D data onto a 2D image plane begins with converting the world space co-ordinates of the object into eye-space co-ordinates which relate to the position and orientation of the camera. Then the 3D data can be projected onto the image plane.
So in reverse order the projection is first
The unknowns in this projection are the values of Pex, Pey and Pez.
Ppx and Ppy are the unknowns on the image plane, and Ppz is the distance of that image plane from the camera, Vd. The other equations obtained from the conversion to eye-space are:
These can be expanded to..
The world origin, Ow, can be assumed to be (0, 0, 0).
These can then be substituted into the above equations for Ppx, Ppy and Ppz, which then give two equations (because Ppz = Vd) to help solve the 6 unknowns. These equations define the fact that a point in 3D space, its projection and the camera point all lie on a straight line.
If this is applied to 3 points on the image plane, then that gives 6 equations to work from. By solving these equations simultaneously, it should be possible to calculate the solution.
(iii) MOTION ANALYSIS
For this system, information is taken purely from the screen and no measurements need to be taken from the real world. This section is described in full detail in the paper "Recovery of Ego-Motion Using Region Alignment"by M.Irani, B.Rousso and S.Peleg.
The initial principle is based on eliminating the rotational element of the camera's motion before then calculating the translational motion of the camera. This is done using region alignment.
Take the first and second frames of an image sequence and note the 2D positions of three points on a surface (such as the corners of a window) in both frames. Then work out the 2D transform which warps the points from frame one on to frame two. Then take a fourth point which does not lie on a plane and apply that transform to its position in frame one. When this warped frame is overlayed on to frame two it can be seen that there is a discrepancy between the warped fourth point and the fourth point in frame two - but, of course, the other points which lie on the surface match up perfectly. The magnitude of the vector between these two un-matched points gives us an indication of the distance that the fourth point is from the surface in 3D space (as the point is moved closer to the surface the magnitude of the vector between the warped projection, in frame one, and the projection, in frame two, decreases, doing this for many points on the screen would give us a depth map of the points in the scene).
By looking at the motion vectors created by some more points which are not located on the surface it is possible to trace them back to the Focus of Expansion [as explained in the paper], having been given that and the camera calibration information the three dimensional translation element of the camera's motion can be recovered.
A set of simple linear motion equations can be solved with the information gained from above. These give the rotational elements of the camera's motion.
(iv) PLANAR MAPPING
Code
egomotion.cpp
moveobj.mel
moveplane.mel
movecamera.mel
This is similar to the universe system, but the camera is fixed to begin with. By working out what transformations have been applied to a planar surface in 3D space we can invert that transformation and apply it to the camera. This would give us the new position of the camera. Assume the camera to be fixed at a point (0,0,0), with the image plane lying at a distance Vd in front of it. Let y be the vertical axis, x be the horizontal axis and z be an axis perpendicular to x and y, going into the picture. Let points in the three dimensional space be denoted as Pw and points on the image plane, or screen, be referred to as Pe. Both Pw and Pe can be extended to represent the coordinates of three points in space and their projections onto the image plane such that there exists Pw0, Pw1, Pw2, Pe0, Pe1 and Pe2. Each one of these points has an (x,y,z) component connected with it.
By using the parametric equation of a line it is possible to start working out a relationship between all the information known and the unknown positions in 3D space.
This introduces the value t into the equation, which will have to be removed later on. Hence
[7]
This describes the x component of a line going through the point (0,0,0), the camera's centre and through the point Pe0. So from that the rest of the equations can be derived.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
These are the lines on which the 3D point must lie. To further refine the calculations some more equations need to be introduced.
If the length of the lines
and
are known
then the Pythagoras Theorem can be used to work out another relationship between the points.
[16]
[17]
where d1 is the length of the line
and d2 is the length of the line
.
If equations [7] to [15] are substituted into [16] then return the answer in terms of t1, the only unknowns left in the equation are t1 and t0.
If the equations [7] to [9] and [13] to [15] are substituted into equation [17] the result will give two unknowns, t1 and t0. So one more equation is required to link t1, t0 and t2 together.
The final link between these points in space is that (for this experiment)
and
are perpendicular lines which meet at point Pw0. So the projection of one of these lines onto the other will return the result of 0. This can be determined by using the dot product of the two lines, hence;
[17]
By substituting all the past equations into this one the Pw0, Pw1, Pw2, t1 and t2 terms are lost. This leaves an equation which only has the unknown t0 in it. Once t0 has been worked out this value can be substituted back into the equations [16] and [17] to give t1 and t2, and hence it is possible to solve equations [7] to [15].
This then gives the positions of the points in 3D space. Doing this for subsequent frames will give a 3D motion. It must be noted, though, that equations [16] and [17] give two different equations each, as they involve square roots. So there are four possible answers to this problem.
.Pw1 is behind Pw0
.Pw1 is in front of Pw0
and in both of those cases
.Pw2 is behind Pw0 and
.Pw 2 infront of Pw0
The positions of these points need to be converted into a rotational and translational value so that the inverse of these motions can be applied to the camera to give its motion.
Example Video - Shows point match of surface
if link doesn't work use spheres.mov
To convert the result into rotational and translational values, the change in x, y and z co-ordinates are assigned to a plane [whose top right hand corner is positioned at (0,0,0)] and are used to work out an angle and a directed movement.
The movement is recovered purely from the position of point Pw0 (using Pw0 as a translational vector)
To recover the rotation a similar idea was followed. The rotation in the z axis was recovered from an analysis of the xy plane.
Using tan it is known that
This can be used to get the rotations in the y and x axis
Using the inverse tan function the values of Rx, Ry and Rz can be calculated.
Example Video - Shows planar match of surface
if link doesn't work use match.mov
The only thing left to do is to apply the inverse of the above transformations to the camera with relationship to the feature which is being tracked. The way of implementing this will depend on which software package is being used. Whichever platform is being worked on the idea behind this inversion is to apply the negative value of the rotation to the camera with the centre of rotation about the top left hand corner of the plane, then negate each value of the translation and apply that to the camera.
Example Video - Track of image plane on door [see appendix]
if link doesn't work use door.mov
Example Video - Adjusted track with a cone set to position of floor [see appendix]
if link doesn't work use cone.mov
E. PROBLEMS, ADVANTAGES AND SOLUTIONS
(i) VISUAL POINT TRACKING BY AVERAGE
This gave results which work in only some conditions, the point being tracked is not relaible because the average of its component colours might easily be the same as many different matches on the screen. It only works properly for tracking contrasting points on plain backgrounds. The camera position would also have to be matched for the first frame as only the motion of the camera is detected using this method.
(ii) VISUAL POINT TRACKING BY PATTERN
The initial problem with this system was the interlacing problem mentioned in section B(ii). Having found the solution the results were better and more relaible, yet the tracking point still had a small amount of jitter to it. This is probably
caused by small amounts of noise on the image being tracked. (the image quality will never be perfect) It could also be caused by subtle change in lighting conditions on the point being tracked, or it might just be caused by the system accidentally matching the pattern it's looking for to pixels which have motion blur artifacts or pixels which are slightly out of focus.
One way to smooth out the motion is to adjust the results of the track so that each match co-ordinate is made up, proportionally, of itself and the average of the points either side of it.
The alternative would be to analyse all the results of the track and to create a line of best fit (which runs as close to the points as possible, but the line has equal numbers of points either side of it unless it goes through them). This would also allow the motion to be interpreted at a semi-sub-pixel level (giving results which can be finer than a pixel)
(iii) INVERSING THE 3D PROJECTION ALGORITHM
Although this system seems the most relevant, the mathematics behind the final result are immense. These simultaneous equations keep building up until there are so many variables that it would be impossible to solve by hand in the time constraints provided (because an allowance has to be made for human error in the calculations so they will have to be repeated a number of times to ensure they are correct). This is why the system for Planar Mapping was considered , as it breaks the equation down to more manageable sections - with little room for error.
(iv) MOTION ANALYSIS
This is a more robust system than the other two and seems to be used as a basic principle for many robotic systems. The paper it comes from suggests using translational information from each pixel on the screen - which is very processor intensive - though the system would still work if the tracking points were reduced (it would just be less accurate). The 3D camera would also have to be prematched to the first frame of the sequence as this method only deals with relative changes in motions.
(v) PLANAR MAPPING
Although the Planar Mapping system involves lighter equations than the inverted 3D projection algorithm it still contains too many terms for it to be calculated manually. This is where the access to external mathematical programs helped reduce the time and increase the accuracy of the combining and re-arranging of equations. This was initially very useful, but it soon became apparent that not even this could not cope sensibly with the large number of terms that need to be dealt with. One machine was left running for four days without returning an answer.
To collect such a large number of terms so that they are on one side of an equation, involves not just manipulating quadratic or quartic equations, but manipulating an equation with more than fifty terms - which need to be reduced down to just one. The nature of processes the computer would have to go through to simplify the equation would practically double the computation time for each term it deals with. Therefore a process that might take 0.5 secs. to work out would take about
0.5 * 250 secs to calculate fifty terms. An alternative method has therefore to be adopted.
The solution to this problem lies in using an iterative process to solve simpler equations.
Using the equations [7] - [15] substituted into [16] and [17] gives an equation involving t1, t0 and t2 can be derived. By iteratively guessing what the value of t0 is, the numeric values of t1 and t2 can be evaluated.
These values are then used to evaluate the coordinates for Pw0, Pw1 and Pw2.
The values are inserted into equation [17] to see if the points calculated are perpendicular. If the equation returns a 0 value then the 3D points are in the correct position. Should the answer be greater than 0 then t0 should be increased and if it is less than 0 then t0 should be decreased.
The best way to understand this method is to imagine that there are lines coming out from the viewpoint, which go through the projections on the image planes, like rails
and to visualise the feature being tracked as two bars of set length being moved down the rails until they are at right angles to one another.
The iteration starts at the maximum possible distance from the viewpoint and goes through a sequence of selecting which side of t0 the result is closest to, then going half the current distance towards that point then repeating this until a suitable result is reached ( i.e. when the dot product = 0 ).
The use of this system has to be limited to a certain number of decimal places and/or iteration steps to save time on finding points in space to a ludicrous degree of accuracy.