Você está na página 1de 59

Computer Vision: CM30080

Peter Hall
pmh@cs.bath.ac.uk
2
Contents

1 Introduction to CM30080 5
1.1 The course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Computer Vision, and related disciplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Applications and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Basics 9
2.1 Ways to Think About Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Images evidence, functions, frequencies, and points . . . . . . . . . . . . . . . . . . 9
2.1.2 Windows, neighbourhoods, regions, and connectivity . . . . . . . . . . . . . . . . . 10
2.2 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 A simple perspective camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 A simple affine camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Low level vision 13


3.1 Linear Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Blurring images suppresses noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Differentiating images detects edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Frequency domain filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Corner and feature detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Histogram, Morophological and other transforms . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.1 Histogram Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.2 Distance Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.3 Morphological Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Cameras, stereopsis, and reconstruction 25


4.1 A single camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Two cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Epipolar geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Essential Matrix and Fundamental Matrix . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Determining scene geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Multiple Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Matching Points Across Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Many-image Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.1 Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.2 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Segmentation 37
5.0.3 Simple segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.0.4 Merge and Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.0.5 Segmentation as clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3
4 CONTENTS

6 Tracking 41
6.0.6 A simple tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.0.7 Motion models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.0.8 Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Model Based Vision 47


7.1 Simple Recognition of 3D objects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Pictures as models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.1 Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A Mathematical background 53
A.1 Linear algebra: vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.1.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.2 Fourier Transforms and convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 1

Introduction to CM30080

1.1 The course


The aim of this course is to equip students with a broad appreciation of Computer Vision. The course
is delivered via lectures, labs, books and exercises. Lectures tell you what to learn, books fill in the
important details. Labs. give invaluable practical experience; exercise force the use of theory. You are
expected to spend 40 hours working on this course; assessment (75% exam, 25% coursework) will assume
you have.
At its most abstract, Computer Vision belongs to a class of mathematical problems called inverse
problems. This means inferring models of some kind, given data.
As we will discover, Computer Vision is a vast subject, but is composed of different areas. Three
basic examples are, edge detection, shape from stereo, and segmentation. Other areas, recognition, for
example, depend on one or more of these. The course will optionally discuss areas such as recognition.
There are general methods for solving Computer Vision problems too. Some methods are strategies
such as bottom-up and top-down determine ones basic approach — in particular how much is assumed
about the problem in hand. Other methods are tactics. For example, the use of singular value decom-
position as a method for estimating an optimal model from data. The role of experimental verification
is important too: Computer Vision is an engineering discipline.
To do computer vision, you will need to know and be able to use some mathematics. Principally
you need vector and matrix analysis, some Fourier theory, and some statistics. Practical Computer
Vision requires the ability to program too. Most people use C (execution speed) or MATLAB (ease of
development).

1.1.1 Assessment
Assessment is based on a formal examination. The examination will be based on material delivered in
lectures, informative exercises, and in recommended texts. The examination is set assuming you have
worked 40 hours on this course. The examination last two hours (and does not count toward your 40
hours total time).

1.1.2 Books
The most important book for this course — the one which is closest in spirit — is

D.A. Forsyth and J. Ponce (2003)


Computer Vision: A Modern Approach
Prentice-Hall, ISBN 0-13-085198-1

Other valuable books are:

5
6 CHAPTER 1. INTRODUCTION TO CM30080

R. Jain, R. Kasturi, and B.G. Schunk (1995)


Machine Vision
McGraw-Hill, ISBN 0-07-113407-7

1.2 Computer Vision, and related disciplines


Computer Vision sets out to make machine “see”. Note the quote marks, “seeing” here is makes no
claim that machines see as you or I or any other animal. Computer Vision is not like physcophysics that
tries to explain how humans see.
Computer Vision is about “getting information out” of image, which is why it an inverse problem.
Typically the information to be “got out” is a model of some kind. These models include but are not
limited to three-dimensional models of objects. This contrasts with Computer Graphics , which is about
making images from models and is a forward problem.
As the name suggests, Image Processing processes images to make other images. For example, images
may be enhanced to make them easier for people to see the information in them. Image processing
can also concern itself with compressing and transmitting images. Computer Vision uses some image
processing, which in this context might also be called low-level vision (which references the fact few
assumptions are made) or early vision (which draws analogy with biological processing of eyes).
Computer Vision comprises three main strategies: Low-level vision which overlaps very much with
Image Processing. Mid-level Vision tries to make “general” models from images and uses more assump-
tions than low-level. High-level Vision tries to make “specific” models, and uses strong assumptions. To
date, no one has produced a Computer Vision system that competently performs in all three classes, as
the human visual system can.
Some examples help distinguish between the three classes named above. Imagine a photograph of a
street with cars. The photograph is blurred. A Low-level vision program might “unblur” the photograph
so humans could read the number plates. The unblurred photograph might be further processed to
produce an “edge map”, which is just an image that is white wherever there is a sharp colour change,
and black elsewhere. note this edge-map is just an image. Think of each white pixels in the edge-map
as housing a very small edge — an “edge” The edges could be used by a Mid-level vision program that
links them into polygons. Referring to Figure 1.1, the “model of classes of things” in this case is the set
of all polygons. A high-level vision program might match the polygons to examples in a database, and
in that way read the number plates automatically. Clearly, this requires a specific model — in this case
polygons that represent the outline of numbers and letters as they appear in number plates.
Notice that numbers and letters are special cases of polygons, and these are special cases of linked
edges. Also notice that as in moving from low-level to mid-level and then to high-level vision we are
forced to make more assumptions. First we assumed a sudden change was an edge – this would work
from pretty much any picture at all. Next we assumed that edges link the into polygons. This will
not always be the case, sometimes for practical reasons (the edge detector somehow fails) but also as a
matter of principle we cannot edges will always link into polygons – -they might not form closed loops,
or may not link at all. We made the assumption about linking because it is helpful and valid in many
cases. Finally we assumed that some polygons represent numbers and letters as they appear in number
plates. This may not always be case, but some plates may be written in a fancy type-face, or be from
outside the UK. Again our assumption is helpful, it makes reading plates easier, but does not cover
all cases. This is typical — the more we assume about a particular problem the easier it is to solve,
but the less general any solution built on our assumption will be. On the other hand, if we make few
assumptions, then progress can be very difficult indeed.

1.3 Applications and Trends


Computer vision has many applications. These include: security, medicine, astronomy, physics, aug-
mented reality, entertainment and film.
“Active vision” is an important current trend, which tries to process images just enough to get a
particular task done, such as moving a robot about a room. This contrasts with the traditional approach
1.3. APPLICATIONS AND TRENDS 7

Models of particular things

High−level
Computer Vision

Computer Graphics
Computer Vision

Models of classes of things

Mid−level
Computer Vision

Images

Image Processing
Low−level Computer Vision
Figure 1.1: A diagram of Computer Vision and its relationship to related disciplines. The modern
disciplines overlap much more than this diagram suggested, but the general pattern is captured.

which held processing to build three-dimensional models was the goal of computer vision.
Another trend is to include Machine Learning techniques. It turns out that it often easier to build
a system that learns to see, rather than build a system that sees from the start. Usually the learning is
to get a model into the computer vision system, and later the model is used to help processing. As an
example, it may be easier to learn the characteristics of hand-written text by looking at it and measuring
things about it, than it is to think up rules that describe hand-written text (so they can be recognised).
The “convergence” area is currently very active. The term refers to the fact that much progress
can be made by combing computer vision and computer graphics. Obvious examples arise in the film
industry, where special effects require image analysis (Computer Vision) and image synthesis (Computer
Graphics). Less obvious are medical applications, but here the output of an ultra-sound scanner, say,
may be processed any combined with computer graphics to “paint” the internal organs over the real
image of a patient.
8 CHAPTER 1. INTRODUCTION TO CM30080

low−level vision mid−level vision high−level vision

A small piece of The edges have The polygon is matched


image showing been linked to to others in a database.
edges in pixels. make a polygon.
The edges have
a direction.
... ... ... but edges may not
link into polygons
... and not all polygons
can be matched.

Figure 1.2: The progress of processing a plate to read, see text for an account. Here note that stronger
assumptions are made as processing moves from left to right, and the general applicability falls from left
to right.
Chapter 2

Basics

This section outlines the background material you are expected to know for this course. Chapter is
limited to ideas directly related to Computer Vision, see Appendix A for mathematical background.

2.1 Ways to Think About Images


It is natural to think that an image is a picture of something, but computer vision Practitioners think of
images in many different ways, depending on context. To begin with the images we deal with come not
just from standard cameras, but come in a vast array of different types: x-rays, depth images, infra-red;
and from an even wider range of devices: MRI machines, submarines, planetary landers, cars, ordinary
cameras. Here we view consider images in a variety of ways.

2.1.1 Images evidence, functions, frequencies, and points


From a philosophical point of view, images are best regarded as evidence for something, rather than
a picture of something. After all, any one image may be caused by many things (are we looking at a
photograph of the Prime Minister, or a photograph of a waxwork?) Fundamentally, this is (I think) the
right way to look at images, because it allows for uncertainty. This view has lead to the rise of statistics
in computer vision. Some examples of difficult images are shown in Figure 2.1.

Figure 2.1: These pictures show that images can only ever be evidence for something, rather than a
picture of something. Is the left image a rabbit or a duck? What every-day object is in the middle
photograph — and would a different view be easier to understand? What is in the processed photograph
on the right?

Technically, images are typically represented by an array of colours or intensities. Intensity images
(monochrome images) usually have 8-bits per pixel. Full colour images have three channel, for red,
green, and blue intensities. It is common to think of intensities are ranging between zero and one. This
view, seen in Figure 2.2, is needed when we want to store, input, and output images.
When we want to mathematically manipulate images (which is often) we think of them as functions.
For monochrome images this is intensity as a function of two spatial variables, v = f (x, y), but for colour
images we need to think of a multi-valued function, (r, g, b) = c(x, y). It sometimes helps to think of an
intensity image as a height field, that is as a surface; bright pixels being higher then darker ones, say.

9
10 CHAPTER 2. BASICS

el
pix
e
lan

er
sp
rp
pe

ne
pla
s
bit

3
8

monochrome images colour images


Each pixel in each plane has integer values 0...255,
often considered in the range 0...1

Figure 2.2: A schematic view of monochrome and colour images.

Another way to think about images mathematically is as the sum of some other functions. The most
common example of this is the frequency domain view, in which the image is built by summing sinusoids.
This view of the image is rather abstract, but is very useful in understanding filtering processes, and is
also used in applications such as compression.
Perhaps the most obscure way to think of images is as points. To help see how this works think of
an image of one pixel, whose intensity varies between 0 and 1. Given such a “picture”, we can plot it
on the [0, 1] interval. Now consider a picture of two pixels. In this case we have two axes — one for
each pixel. Each point in the square [0, 1]2 is a possible picture. Similarly, pictures with three pixels are
points in the cube [0, 1]3 . Finally, pictures with N pixels are points in the hyper-cube [0, 1]N — there is
an axis for every pixel. This kind of representation can, perhaps surprisingly, be useful.

2.1.2 Windows, neighbourhoods, regions, and connectivity


A window on an image is a small group of pixels contained within a geometric shape of some kind,
usually square, sometimes circular, and occasionally other shapes.
A neighbourhood of a pixel comprises surrounding pixels — its neighbours. We can think of the
4-neighbours (north, east, south, and west), or the 8-neighbours (also include north-east, north-west,
south-east, and south-west). More complex definitions are possible (see Gonzales and Woods [3]).
A region is an arbitrarily shaped collection of pixels. We can think of pixels in the region as “in”,
and all other pixels as “out”. Two pixels in a region are connected if we can move from one pixel
by moving from pixel A one of its “in” neighbours, from there to some other neighbour, and so on to
eventually arrive at the pixel B, having moved on “in” pixels. Clearly, some paths may be allowing using
8-neighbours that are not possible using 4-neighbours, so we talk of 4-connectivity and 8-connectivity.

2.2 Cameras
It is obvious that computer vision relies on cameras. Different applications require different information,
and make different assumptions, so it is not surprising that many models of cameras have been developed.
Forsyth and Ponce [1] give details of some of these, and mention real-world effects such a lens distortion
that are sometimes important.
The pin-hole camera is a very simple camera, is easy to model, and often suffices as a good first
approximation to reality. In the pin-hole camera rays of light travel in straight from points in the scene
through a focal point (the pin hole). The light falls onto a flat surface behind the focus, and may be
recorded on film or digital media. From a mathematical points of view the flat surface could just as well
be in front of the pin-hole, as Figure 2.2 shows.
2.2. CAMERAS 11

window behind focus

light ray
image
focus object

window in front of focus

The simple idea behind


perspective projection is
similar triangles:
h = w.H/W H
h

w W

The pin-hole camera is the simplest example of a perspective camera, so called because it produce
perspective images of scenes. Computer vision recognises other forms of perspective too, see Forsyth
and Ponce [1] for details. We will now make a mathematical model of a pin-hole camera: this means we
will model perspective projection.

2.2.1 A simple perspective camera

We begin modelling perspective projection using similar triangles. The underlying assumption is that
a ray of light travels in a straight line. So a ray of light leaving some real-world point and travelling
toward the focal point. At some point along its path the light ray passes through a window, and in
doing so makes an image of the real-world point on the window. Now consider the line from the focal
point in a direction perpendicular to the window plane. Measure the distance from the focal point to
the window plane along this line, call this distance w. Also measure the perpendicular distance of the
real-world point to the line, call this H; and the distance from the focal point to the foot of the points’
perpendicular, W . The distance from the image in the window to the reference line is h = Hw/W ,
obtained using similar triangles. This is clear to see in Figure 2.2, where the diagram has been drawn
using a convenient “side on” view.
Computer vision practice (and, in fact computer graphics practice) is to represent projection using
homogeneous coordinates, which allows a matrix representation of the camera. We will see the utility of
this approach when we study reconstruction via stereopsis (Chapter 4). Here we note that if we fix the
camera focus to the origin and choose the z-axis as the direction of view, then we can easily represent
12 CHAPTER 2. BASICS

projection using a simple matrix:


 
1 0 0 0
 0 1 0 0 
Peasy =
 0
 (2.1)
0 0 0 
0 0 1 0

so that if (x, y, z)T is a real-world point, which is arbitrary up to z 6= 0, then


   
x x
 y 
 = Peasy  y 
 

 0   z 
z 1

gives the projected point in homogeneous coordinates, [x/z, y/z, 0]T is the image of the real-world point.
Typically the z-coordinate is dropped from further consideration. Notice the arbitrary object point
[x, y, z]T is projected to a plane, in this case the xy-plane, provided z 6= 0. If z = 0, the corresponding
homogeneous point is [x, y, 0, 1], which under Peasy transforms to [x, y, 0, 0]. The corresponding non-
homogeneous point is [x/0, y/0] = [∞, ∞].
An alternative form for projection is the (3 × 4) matrix
 
1 0 0 0
Peasy =  0 1 0 0  (2.2)
0 0 1 0

which is the form used by some texts because it “drops” naturally drops the projected z-coordinate.
Notice that both matrices are singular — they have no inverse. Both are rank 3 matrices (the rank
of a matrix is the number of linearly independent rows or columns). This feature is characteristic of all
projection matrices are reflect the fact that three dimensions are being “compressed” into two — depth
information is lost.
In Section 4.1 we will build a more sophisticated camera model, here we will introduce an affine
camera.

2.2.2 A simple affine camera


Affine cameras assume parallel light rays, and are readily expressed using a (2 × 4) matrix
 
1 0 0 0
Paffine = (2.3)
0 1 0 0

in which the projected point is just [x, y]T . In some text (typically computer graphics texts) this is
called orthogonal or orthographic projection.
Affine transforms preserve length. Affine cameras preserve the length of vector components that are
parallel to the window plane, because the ray of light are all assumed to be orthogonal to it (hence the
alternative names). A simple affine camera is shown in Figure 2.3.

the image of the this component of


object line has the object line is
the same length object line
parallel to the window
as its parallel plane
component
this component of the object line
is perpendicular to the window
plane
window plane

Figure 2.3: A side view of a basic affine camera.

Affine cameras have rank two.


Chapter 3

Low level vision

Low-level vision (also called early vision) is all about getting information out of images using a minimal
set of assumptions. Low-level vision is a prerequisite for any Vision system, be it top-down or bottom-up,
simply because any Vision system must be grounded in basic information available to it, which we take
to be pixels.
Low-level vision takes pixels in and give pixels out. We will consider two classes of processing: linear
filtering and morphological processing. We will learn how linear filtering an image can be used to blur
away noise, to enhance edges, and even describe textures. We will learn, too, that by using non-linear
filters (morphological operators) we can shrink or grow shapes, find their borders, fill them in.

3.1 Linear Filtering


Linear filters comprise a very important class of image processing techniques. The basic idea behind all
linear filters is identical, and very simple: we form sums of weighted pixel values from a “source” image,
and store the result in a “target” image”.
To do this we input a source image and choose a window. The window does not have to be small
and square, but usually is — so we’ll consider that case for now. The important thing are that we can
place the window anywhere on the image, and the window pixels and image pixels will always match
up. All we do now is place the window somewhere on the source image. Next we multiply each the pixel
value in the source by the corresponding pixel value in the window. The sum of these products gives a
result that is stored pixel in some target image. The window is then moved to cover the next-door set
of pixels, and the is process repeated with the result now stored in the neighbouring pixel of the target.
This continues until all the pixels in the target have been given a value. (This description overlooks
several difficulties, but this simple scheme remains intact.)
So let’s consider the case where we have a (2N + 1 × 2N + 1) window; this is a square with an odd
number of pixels on each side. The indices of each pixel will serve as its location. It is mathematically
convenient to place the centre of the window at the origin, so we will allow negative indices. The weight
at pixel (i, j) is w(i, j). The window is placed over pixel (x, y) — that is, the central window pixel is
aligned with the pixel as (x, y). The image value at this pixel is f (x, y). The weighted sum is given by

y+N
X x+N
X
g(x, y) = f (x − i, y − j)w(i, j) (3.1)
j=y−N i=x−N

where g(x, y) is a pixel in the target image. Notice that this sum “flips” the window; convince yourself
this is true. Also, bear in mind the shape of the window does not matter. After all, we could effect a
window of arbitrary shape by setting some weights to be zero. This scheme is illustrated in Figure 3.1.
The above process is simple, but surprisingly powerful. It can be used to blur images, to “emboss”
images, and underlays edge detection. Exactly the same weighted-sum process is used in each case, only
the weights differ.

13
14 CHAPTER 3. LOW LEVEL VISION

window

w(−1,−1) w(−1,1)

w(0,0)

w(1,−1) w(1,1)

Source Target

Figure 3.1: Linear filtering scheme. A window of weights is placed over a source image. A weighted sum
of source pixels is stored in a target pixel.

Figure 3.2: Some effects of linear filtering. From left to right: original photograph; blurred version,
vertical edges, horizontal edges. These effects are made just by changing the weight values in the
window (some weights need to be negative) .

This weighted-sum process we have just described is more formally called convolution. More properly,
it is called discrete convolution — because it takes place at discrete locations. The continuous form of
convolution uses integration rather than summation and is defined in 2D as
Z ∞Z ∞
g(x, y) = f (u, v)h(x − u, y − v)dudv (3.2)
−∞ −∞

in which f () is the image and h() the convolution kernel, which plays the role of the window. This is
often written as f ∗ h, as a short-hand for the integration, so

g =f ∗h

means the same thing.


You may have spotted that the mathematical definition differs from the weighted-sum given above:
the wieghted sum uses kernel coordinates rather than image coordinates. The reason for this is purely
pragmatic — it is possible to write a program that performs convolution “properly”, but we would have
to have a huge window, most of which was zero. It is sensible to write code that loops only over the
non-zero heart of the window. In fact, it is easy to convert the above linear sum to read a standard
definition of discrete convolution, we need only change variables so that u = x − i and v = y − j, then
3.1. LINEAR FILTERING 15

the above sum is written


N
X N
X
g(x, y) = f (u, v)w(x − u, y − v) (3.3)
v=−N u=−N

which is a valid discrete convolution.


Figure 3.3 gives pseudo-code; this pseudo-code is incomplete in several ways, in particular problems
arise when the window “overlaps” the border of the image. How to handle border conditions is left as
an exercise.
INPUT a source image and a window of weight
FOR each pixel in the source
Locate window at source pixel, (x,y)
sum = 0
FOR each pixel (i,j) in window
sum = sum + pixel(x-i,y-j)*weight(i,j)
END
target pixel(x,y) = sum
END
OUTPUT a target image
Figure 3.3: Pseudo-code for convolution.

As mentioned, the effect of convolution depends on the kernel being used. We now discuss some
common kernels and their uses.

3.1.1 Blurring images suppresses noise


All real images come with noise, which a term with a very broad meaning. Some people might say that
any object in a scene which is of no interest is noise — but such objects are more often called clutter.
Noise, here, means variations in the colour values. There are two basic ways to model the variation:
multiplicative noise and additive noise. If v is a pixel value, then multiplicative noise would give sv for
some s, whereas additive noise gives v + d for some d. Noise of either kind can be systematic or random.
Coloured lighting could be considered as systematic noise because the real colour of objects are changed
in the same kind of way. Digital camera introduce random noise because the photo-sensors depend on
quantum effects and can fire at any time (in fact they can fire when there is no light to produce dark
current).
We’ll consider additive, random noise. This means the amount of noise d is drawn from a probability
distribution of some kind: some values of d are more likely than others – we want a value of 0 to be
the most likely, and for there to be a low likelihood for large values of d (negative or positive). We’ll
consider only a Gaussian distribution because the central limit theorem shows that unbiased, random
errors always exhibit a Gaussian distribution, no matter what the source of the data might be. The
likelihood that a variation of d occurs is
d2
 
1
p(d) = exp −0.5 2 ∆
2πσ 2 σ

where ∆ is a small interval over the “d” axis. This is stationary additive Gaussian noise. Of course, the
variation d is different for every pixel. If we consider the variations for every pixel in an image suffering
Gaussian noise we will find the mean variation is 0, (stationary) and the standard deviation is σ. From
here on in we’ll just call it “noise’.
One way to suppress noise is to blur the image. The hope is that the variations “average out” in
a window of pixels. Figure 3.4 was made by adding Gaussian noise to a signal, and blurring the noisy
signal — the filtered signal is encouragingly similar to the original. The same figure shows a histogram
of the noise added, and a Gaussian that was fitted to the distribution.
16 CHAPTER 3. LOW LEVEL VISION

Suppressing noise by blurring histogram of noise distribution, and a fitted Gaussian


1.5 0.25
signal
signal+noise
filtered signal

1
0.2

0.5

0.15

0.1

−0.5

0.05
−1

−1.5 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3

Figure 3.4: Left: a signal with added noise is blurred. Right: the distribution of the noise is Gaussian.

Blurring kernels average pixel values over a window. Flat kernels are easy to make, but Gaussian
kernels are to be preferred for their analytic properties. Flat and Gaussian kernels are visualised in
Figure 3.5.

2 2 2

1 1 1

0 0 0

−1 −1 −1
1 2 5
1 2 5
0 0 0
0 0 0
−1 −1 −2 −2 −5 −5

1 1 0.4

0.5 0.5 0.2

0 0 0
1 2 5
1 2 5
0 0 0
0 0 0
−1 −1 −2 −2 −5 −5

Figure 3.5: Top row, flat kernels of increasing sizes. Bottom row, Gaussian kernels of increasing size.

The flat blurring kernel of width w is defined by


1
h(x, y) = (3.4)
w2
for all integer (x, y) in the square [1, w]2 . The Gaussian kernel is

1 (x2 + y 2 )
 
1
h(x, y) = exp − (3.5)
2πσ 2 2 σ2

the scale parameter σ is somewhat arbitrary, but typically could be set as σ = (−r2 /(2 ∗ log(y)))(1/2) ,
which forces the Gaussian to of height y at distance r.
3.2. DIFFERENTIATING IMAGES DETECTS EDGES 17

3.2 Differentiating images detects edges


Edges are discontinuities in the image — typically at the boundary of objects as there is a sharp change
of colour. A monochrome image is a scalar function f (x, y); the gradient
" #
∂f (x,y)
∇[f (x, y)] = ∂x (3.6)
∂f (x,y
∂y

is a vector that points in the direction of greatest change of this function; that is, the gradient points
across the boundary of an edge. The partial derivatives can be estimated using finite difference — central
difference,s for example, gives
∂f (x, y) f (x + 1, y) − f (x − 1, y)
≈ (3.7)
∂x 2
∂f (x, y) f (x, y + 1) − f (x, y − 1)
≈ (3.8)
∂y 2
Central differences can be computed vis convolution, the row-kernel [−1, 0, 1] suffices for ∂f /∂x, for
example; the equivalent column kernel for ∂f /∂x; but note, kernels are not matrices. Other kernels
exist, and the reader should consult texts to learn about the Sobel, Roberts, and Prewitt operators
(kernels).
Edges have strength, s, and orientation:, θ
 2  2 !1/2
∂f (x, y) ∂f (x, y)
s(x, y) = + (3.9)
∂x ∂y
 
∂f (x, y)/∂y
θ(x, y) = atan (3.10)
∂f (x, y)/∂x
the strength measures the contrast at the edge — wider contrast means stronger edges; orientation gives
the angle of gradient to the horizontal.
The convolution theorem can be used to show that convolving with a differential operator (kernel)
is equivalent to differentiating the function. Here we focus on the derivative of Gaussian for several
reasons:
• it combines noise-suppression with edge detection
• it has a natural scale parameter, σ
• it is streerable
The directional differential of the Gaussian in Equation 3.5 are
∂h(x, y) −x
= h(x, y) (3.11)
∂x σ2
∂h(x, y) −y
= h(x, y) (3.12)
∂y σ2
used to estimate ∂f /∂x and ∂f /∂y respectively; examples kernels are shown in Figure 3.6.
The “scale” of an edge is the spatial distance between its darkest and brightest pixels. A blurred
edge will typical have a larger scale than a “sharp” edge. The Gaussian derivates will “look for” edges
of a scale commensurate with 2σ — because the peaks in the derivate are separated by 2σ (the reader
should verify this for themselves).
A kernel can be steered, if is can aligned to any direction; Gaussian derivatives can be. The total
derivate in the the direction [dx, dy]T = [cos(θ), sin(θ)]T is

∂h(x, y) ∂h(x, y)
dh(x, y) = dx + dy (3.13)
∂x ∂y
18 CHAPTER 3. LOW LEVEL VISION

12 8 8

10 6 6

8 4 4

6 2 2

4 0 0

2 −2 −2

0 −4 −4

−2 −6 −6
5 5 5

−8 −8
0 0 0
−10 −10 −10
−5 −5 −5
0 −5 0 −5 0 −5
5 5 5
10 10 10

Figure 3.6: Left to right: A Gaussian h, the “x” derivative ∂h/∂x, and the “y” derivative ∂h/∂y. The
vertical axis is not to scale.

−x −y
= h(x, y) cos(θ) + 2 h(x, y) sin(θ) (3.14)
σ2 σ
(x cos(θ) + y sin(θ))
= − h(x, y) (3.15)
σ2
which is precisely the differential in the given direction. Examples of steered kernels can be seen in
Figure 3.7.
10 10

5 5
1 2

0 0 0.5 1

0 0
−5 −5
10 10
−0.5 −1
0 0
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10 −1 −10 −2 −10
−10 −5 0 5 10 −10 −5 0 5 10

10 10

5 5
4 10

2 5
0 0
0 0
−5 −5 10 10
−2 −5
0 0

−10 −10 −4 −10 −10 −10


−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

Figure 3.7: A Gaussian derivate steered to 0 (top left), 30 (top-right), 60 (bottom-left), and 90 (bottom-
right) degrees. The picture on the left shows a plan view, that on the right shows an oblique view of the
same kernels.

3.3 Frequency domain filtering


To better understand the effects of linear filtering we must consider the frequency domain representation
of images. This will also help us understand aliasing (and anti-aliasing) in computer graphics, as well
as give meaning to terms such as low-pass and high-pass filtering. Filtering in the frequency domain can
often be faster than spatial-domain filtering. Finally, an understanding of the frequency domain serves
as a starting point for understanding wavelets (an advanced topic we will not discuss).
The idea is very simple. First think of a monochrome image as a surface. Next realise this surface can
made by summing many corrugated surfaces. If we choose one set of a corrugations we get a particular
gray scale picture. If we choose another set of corrugates, we get a different picture. In fact, each picture
has a unique set of corrugates, and eash set of corrugates gives a unique pictue.
Here in an analogy. Think about the surface of a pond, which can be made to undulate in complicated
ways by throwing in stones – the waves from each stone add together to make the pond’s surface. This is
3.3. FREQUENCY DOMAIN FILTERING 19

called wave superposition. The gray level surface of a picture can be though of a pond surface captured
at an instant in time, and the corrugate surfaces are the waves made by different
Currguates vary in frquency, amplitude, and phase. We can easily get a new picture by varying any
of these properties. The first three of these properties are shown in Figure 3.8 for a one-dimensional
corrugate, which is a sinusoid; we cannot show “angle” for one-dimensional corrugates.

2
freq: 1.00 amp: 1.00 phase: 0.00
freq: 2.00 amp: 1.00 phase: 0.00
freq: 1.00 amp: 2.00 phase: 0.00
1.5 freq: 1.00 amp: 1.00 phase: 1.57

0.5
a.sin(π x+φ)

−0.5

−1

−1.5

−2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
x

Figure 3.8: Properties of a one-dimensional sinusoid, the base wave shown in black with variations in
colours. (Produced with wave1D.m.)

We can express this mathematically using


N
X
f (x, y) = ai sin(2π(xui + yvi ) + φi )
i=1

where f (x, y) is the value of the image at the pixel (x, y), made by summing N corrugates. The ith
corrugate has amplitude ai and phase φi . The frequency in the x-direction is ui and in the y-direction is
vi ; taken together these give a corrugate of frequency (xui + yvi ), that travels at an angle tan−1 (vi /ui )
to the x-axis (compare this to the vector dot product).
It turns out that each and every image has exactly one set of corrugates which, when added, make
the image. One way to specify the corrugates we are using is to use a matrix. The position of each
matrix element is considered as the vector [u, v]T , which fixes direction and frequency. So the matrix
represents the uv-plane, as seen in Figure 3.9. We want two values in the matrix element, one to give
the amplitude and the other the phase. This is done using complex numbers written as a exp(iφ).
The Fourier transform decomposes an image, f (x, y) into its corrugates. It gives a new function
F (u, v), which is the continuous analogue to the matrix we’ve just discussed. The [u, v]T position
specifies a corrugate, just as before, and the value at each point is complex.
Z ∞
F (u, v) = f (x, y) exp(−i2π(xu + yv))dxdy
−∞

The inverse Fourier transform reverses the process — it takes the corrugates and adds them to make
a picture. In general this picture will have complex values too, but in practice pictures are real values
20 CHAPTER 3. LOW LEVEL VISION

moving around the origin at


constant disatnce from it
rotates the corrugate
[0,3]
the origin gives a
3[cos(60),sin(60)]
flat corrugate

3[cos(30),sin(30)]

[1,0] [2,0] [3,0]


[0,0]

we can view moving away from the origin in one direction


the corrugates as images raises the frequency os the corrugate
of surfaces

Figure 3.9: A diagram to show how different corrugates relate to the vu-plane. Here all amplitudes are
set to unity, and all phases to zero. Notice how the dots in the uv-plane make a picture.

(the imaginary component is zero).


Z ∞
f (x, y) = F (u, v) exp(i2π(xu + yv))dudv
−∞

The Fourier transform of several pictures are shown in Figure 3.10.

Figure 3.10: A set of pictures (left column) and their frequency-domain equivalents; amplitude (middle)
and phase (right). The “Mondrian” type pictures show a dominance of corrugates that are aligned to
the strongest edges; as the rows the frequency domain representation rotates with the image.

As it turns out, any linear filtering we can perform in the spatial domain using convolution, we can
also perform in the frequency domain. In fact, convolution and the Fourier transform are related via
the convolution theorem which is this: Suppose f and g are functions with Fourier transform F and G
respectively, written as f ↔ F and g ↔ G then

f ∗ g ↔ FG (3.16)

Otherwise said: the Fourier transform of the convolution of f and g is the product of the Fourier
transform of the individual functions. Note the product of the functions is not matrix multiplication —
rather corresponding (complex) values of the functions are multiplied.
So, if we want to convolve with some window we can, if we want, apply the Fourier transform to the
window( kernel) and take the inverse Fourier transform of their product. This may seem a “long way
around”, and so it is, but is can often take less time than standard convolution.

3.4 Corner and feature detection


So called Harris features are commonly used to detect features. A Harris corner uses the Hessian at a
pixel, as in Figur ??. The Hessian is a matrix of second order derivatives. If we think of ∂f /∂x as an
image, and differentiate it then we get a vetor of second order partials. Similarly we get another vector
by differeniating ∂f /∂y. We call this matrix A, which we write as

 
fxx fxy
A= (3.17)
fyx fyy
3.5. HISTOGRAM, MOROPHOLOGICAL AND OTHER TRANSFORMS 21

Now any square matrix can be witten as a product like this:

A = ULUC (3.18)

in which U is orthonormal and L is diagonal. These are the eignvectors and eigenvalues, respectively.
But these are expensive to compute and need not be computed because

Mc = L1 L2 − κ(L1 + L2 )2 (3.19)
= det(A) − κtrace2 (A) (3.20)

is a number that gives a numerical measure of the “cornerness” at the point. The determinant and trace
are easy to compute, in the specific case above

det(A) = fxx fyy − fxy fyx (3.21)


trace(A) = fxx + fyy (3.22)

The κ is a parameter of the algorithm that is choosen, usually between 0.04 and 0.15.
Harris corners are used in many applications, espeically where matching is required. For example,
the three dimensional shape of objects can be reconstructed from a pair of pictures, if points in them
can be corresponded (matched). By using Hariss corners the matching process is made a little easier.

Figure 3.11: A photograph with dtected Harris features.

3.5 Histogram, Morophological and other transforms


So far in this chapter we have consider linear transforms; so called because convolution is a linear
operation: X
g(x, y) = f (x − i, y − j)h(i, j)
ij

can be written as an inner product f T f . The fact that convolution is linear brings many advantanges,
the convolution theorem being amongst the most notable. However, linear filtering has disadvantages
22 CHAPTER 3. LOW LEVEL VISION

too: intensity peaks and troughs move in their spatial location as scale changes, for example. Additonally
there are many useful operations that cannot be properly described using convolution, thresholding is
an obvious example.

1 if f (x, y) > t
b(x, y) =
0 otherwise
In this section we shall take a look at image transforms that cannot be computed using convolution,
specifically:

• Histogram Transforms

• Distance Transforms

• Morphological Transforms

We begin with histogram transforms.

3.5.1 Histogram Transforms


The histogram of a picture f (x, y) is just
X
h(v) = f (x, y) = v
(x,y)∈f

for all possible gray level values, v. This is easily converted to a probabilty:

h(v)
p(v) = P
v h(v)

Many algorithms exist to analyse histograms. For example, if there are two peaks in the histogram then
seting the theshold to lie at the minimum between them often yields optimal results. The number of
peaks can be used to estimate the number of colours in the image, so that multiple thresholds can be
set.
Other than thesholding, Perhaps the simplest histogram transform — that is, an algorithm that
changes the histogram of an image — is normalisation. This stretched the gray levels in an image to lie
in a specific range, typically [0, 1]. If an image has minium and maximum gray levels vmin and vmax the
transform to [0, 1] is just
vin − vmin
vout =
vmax − vmin
This is an example of a transfer function.
Histrogram equalisation is more complicated. It tries to modify the histogram of an image so that
all gray levels are equally used. The transfer function in this w(v) case is defined via an integration: All
the gray levels in an interval dv are transfered to an interval dw,; and we know pb (w) is constant in so
that
pa (v)dv = pb (w)dw

giving
pa (v)
dw = dv
pb (w)
so the function w(v) must be differentiable. Is can be shown this has the (approximate) solution
v
X
w(v) = (2B − 1) p(u)
u=0
3.5. HISTOGRAM, MOROPHOLOGICAL AND OTHER TRANSFORMS 23

3.5.2 Distance Transforms


It is often convenient to know the distance of a point (x, y) from an object, which is a collection of points
{(u, v)}. The points on an object, O, can be functionally defined (as on a straight line), or simply listed
— as is the case in a binary image in which

1 if (x, y) ∈ O
f (x, y) =
0 otherwise
The “distance” of the given point x = (x, y) to the object is just the distance between its nearest point
u = (u, v).
The generalised distance is given by a norm which in general is
X
d(x, u) = ( |xi − ui |n )1/n
i

this is denoted Ln . We set n = 2 to get the Euclidian distance, which is the L2 norm. We can set n to
other values to get other norms. L1 is just the sum of two sides of a right triangle, and is sometimes
called the Manhattan distance. The infinity norm, denoted L∞ is the maximum of the two sides of right
triangle.
Provided n ≥ 1, then Ln is a metric, meaning that
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) ≤ d(x, y) + d(y, z)
all hold true. It is often the case that some distance definition we make up fails to be a metric, usually
because the triangular inequality condition (the last one) fails.
Let us define the distance between a point x and an object O as
d(x, O) = min d(x, u)over all u ∈ O
This needs the norm to be specified (by the programmer). The distance transform (given a norm) of an
image
g(x) = d(x, O)
for all x in the image. Clearly this is 0 for all points inside the object.
The distance between two objects is now easy to find. Just take the distance transform of any one
of them, g1 , say. Then mask the distance transform with the other, f2 say. The distance between the
objects is just the smallest value that remains in the masked distance transform.

3.5.3 Morphological Transforms


Morphological transforms arise from the study of shape, but appear to us as set based operations. There
are many morphological operations; most ivolved passing a window over a image and performing some
computation at each position. This is rather like convolution, except the computation is non-linear. An
example would be to take the maximum value in the window; minimum values or median values can
also be used.
Often a structuring element is used, espeically on binary images. A structuring element is then a
binary mask. To output could be the determined by some computation using intersection (for example)
of the structing element with the image.
Or we can simply move the whoel image up, down, left and right, say. Then take the union of all
these moved images. This is called dilation because it makes the object a little bigger all round. Taking
the intersection gives erosion, which is a little smaller all round. The structuring element can be used
to locate the translations of the image, and so generalise this idea.
Morphological operations can be used to find the skeleton of an object - one removes progressively all
pixels from the object’s border but without ever disconnecting the pixel set. The skelton is a compact
representation of an object, but does suffer from noise; even so the skeleton is usedful in recognition
tasks (eg character recognition).
24 CHAPTER 3. LOW LEVEL VISION
Chapter 4

Cameras, stereopsis, and


reconstruction

4.1 A single camera


In Section 2.2 we introduced simple models for perspective and affine cameras. Here we will develop a
more sophisticated model for a perspective camera. We begin with a geometric approach by considering
the intersection of plane by a line: the plane represents a window and the line a ray of light that passes
through some real world point and a focal point.
Suppose the focal point of the camera is at f and the window plane passes through some point c
and is oriented by a unit normal w. Suppose further that two vectors lie in the window plane, u and
v. We suppose c is the origin of the basis defined by u, v, and w. This geometry is seen in Figure 4.1,
along with a real-world object point x and its image on the window plane, also y. The interpretation
of “right”, “up”, and “forward”, given to the vectors u, v, and w respectively, are useful in helping us
imagine the scenario. The distances a and b in the diagram are the coordinates of the image as seen by
the camera, and it our aim to compute them from the information given so far.

up v

b
centre c forward w object x
focus f
a image y

right u
Figure 4.1: The geometry of a single camera; all vectors and points are in some world frame.

The parametric equation of the line through a general point x and the focus f is p(t) = f + t(x − f ).
A point q is in the window plane if (and only if) wT (q − c) = 0. We are interested in the case when
p(t) lies in the window plane, and it is easy to show that t must have the value

wT (c − f )
t= (4.1)
wT (x − f )

25
26 CHAPTER 4. CAMERAS, STEREOPSIS, AND RECONSTRUCTION

If follows from substitution into the equation of the line that

wT (c − f )
y=f+ (x − f ) (4.2)
wT (x − f )
We need only transform y from the world basis into the camera basis:

z = [u v w]T (y − c) (4.3)

which upon substitution yields

wT (c − f )
 
z = [u v w]T (f − c) + T (x − f ) (4.4)
w (x − f )

we must have wT (y − c) = 0 since by construction y lies in the window plane defined by c and w. Hence
the projected point is of the form z = [a, b, 0]T , and we have computed the coordinates we seek as a and
b.
This geometric construction is perfectly general. It does not assume the basis vectors are orthogonal,
nor does it assume they are unit. This allows for the some optical distortions to be modelled. The
camera can be placed anywhere. Object points that lie in the plane parallel to the window but which
runs through the focus cannot be projected. This is reflected here because wT (x − f ) = 0 for such points
— the projected image of such points lie at infinity.
We can put this projection into matrix form:
 T     
R 0 I −(c − f ) S 0 I 0 I −f
P= (4.5)
0T 1 0T 1 0T 1 wT 0 0T 1

in which R = [u v w] is an affine transform matrix, I is the (3 × 3) identity, S = (wT (c − f ))I and


0 = [0, 0, 0]T is the world origin.
These matrices are applied to a point from right to left, and they shift, project, scale, shift again,
and finally affine transform the point. The projection step uses a singular matrix, which is singular and
of rank three. It follows P is singular and of rank three. Multiplying these matrices we get
 T
R (S − (c − f )wT ) RT (−Sf + (c − f )wT f )

P= (4.6)
wT −wT f

It is easy to verify this matrix conforms to the projection in Equation 4.4. Given a homogeneous point
x∗ = [x, 1]T we find
 T
R (S − (c − f )wT ) RT (−Sf + (c − f )wT f )
 
x
Px∗ = (4.7)
wT −wT f 1
 T
R S(x − f ) − RT (c − f )wT (x − f )

= (4.8)
wT (x − f )

Using fact that S = diag(wT (c − f )) we can write the corresponding ordinary point as
 T 
w (c − f )
z′ = RT (x − f ) − (c − f ) (4.9)
wT (x − f )
which is identically the point in Equation 4.4.
Although this model is general, it is not necessarily particularly useful. In particular we want to
separate out the intrinsic and extrinsic parameters. The intrinsic parameters control focal length and
any distortions, while the extrinsic parameters locate and orient the camera in the world. The idea
behind separating these parameters is to build a simple model of a camera, which gives the intrinsic
parameters, and then move the camera into position to get the extrinsic parameters.
We consider intrinsic parameters first, for which we want a simple camera model. We set the focus at
the origin, f = [0, 0, 0]T , and look along the z-axis towards a window that is parallel to the xy-plane, so
4.1. A SINGLE CAMERA 27

w = [0, 0, 1]T . The window is at a distance s from the focus, which is the focal length of the camera. The
origin of the window is c = (c1 , c2 , s)T . The “right” and “up” lie in the window plane so are orthogonal
to w but not necessarily to one another. These conditions give
 
su1 su2 −(c1 u1 + c2 u2 ) 0
 −sv1 sv2 −(c1 v1 + c2 v2 ) 0 
Pint =   0
 (4.10)
0 0 0 
0 0 1 0
Hence we seven numbers to specifiy this camera: s, u1 , u2 , v1 , v2 , c1 , c2 . The third row contributes nothing
to the projected point, so we can write the camera model as a (3 × 4) matrix.
 
su1 su2 −(c1 u1 + c2 u2 ) 0
Pint =  −sv1 sv2 −(c1 v1 + c2 v2 ) 0  (4.11)
0 0 1 0
This is not the only possible camera model. Furthermore, if we make certain rsonable assumptions then
we can reduce the number of parameters from seven to five, as we show next.
We can make use of focal length, aspect ratio, and skew, and centre of interest. The focal-length scales
the projected image, and seen in the above matrix. The aspect ratio is the ratio of the width to height
of a pixel; pixels are not necessarily square. The skew of a camera deforms a rectangular pixel into a
parallelogram, and is caused by the x-axis and y-axis of the camera not being orthogonal. The centre
of interest allows for a shift of origin. Taken together these terms can be used to construct a transform
from the window plane into the frame buffer.
As Forsyth and Ponce [1] show (pages 29–30). Suppose a pixel is α′ units wide and β ′ meters high.
They appear to the camera to be of size α = sα′ and β = sβ ′ respectively. Skew is encoded by the angle
θ, and centre of interest by a1 , and a2 , giving
 
α −α cot(θ) a1 0
 0 β/ sin(θ) a2 0 
Pint = 
 0
 (4.12)
0 0 0 
0 0 1 0
It is common to drop the third row, which contributes nothing to the projection, to get
 
α −α cot(θ) a1 0
Pint =  0 β/ sin(θ) a2 0  (4.13)
0 0 1 0
= [K 0] (4.14)
where K is the (3 × 3) matrix on the left, which maps a real-world point into the camera coordinates.
Note we must divide by the z-coordinate to recover the ordinary coordinates, so we can write
1
z= Kx (4.15)
z
It is not possible to recover the x from z and K.
We can now locate and orient our camera, P in the world. We need size parameters, three to locate
the focus and three to orient the direction of view. These are used to construct a matrix
 
A t
Pext
0 1
which is right-multiplied with Pint to obtain a complete projection. Forsyth and Ponce [1] give this as
P = Pint Pext (4.16)
(αrT1− α cot(θ)rT2 + a T
 
1 r3 ) (αt1− α cot(θ)t2 + a1 t3 )
β T T β
=  sin(θ) r2 + a2 r3 sin(θ) t2 + a2 t3 (4.17)
 

rT3 t3
28 CHAPTER 4. CAMERAS, STEREOPSIS, AND RECONSTRUCTION

in which the rTi are rows of A, and ti the components of t. The reader is directed to the characterisation
of projection matrices (pages 31–32), and development of affine cameras in (pages 32–35) in the same
text.

4.1.1 Camera calibration


For many computer vision tasks it is important to calibrate the camera, to determine its internal pa-
rameters. This requires an estimate of the projection matrix P using real world data. To do this it is
very convenient to use a calibration rig, which is just a prop of known size and shape — the inner corner
of a cube is often used.
Now we can write the general perspective transform as three rows
 
p1
P =  p2  (4.18)
p3

in which case an object point x is projected to


p1 x
u = (4.19)
p3 x
p2 x
v = (4.20)
p3 x

which can be re-arranged to read

p1 x − up3 x = 0 (4.21)
p2 x − vp3 x = 0 (4.22)

This can be written in matrix form as


pT1
 
T T T
   
x 0 −ux  pT2  = 0 (4.23)
0T xT −vxT 0
pT3

which usefully separates the unknowns into a column vector of 12 elements. The above is for a single
object point, x and its image [u, v], which clearly is not sufficient to determine the twelve unknowns of
the perspective transform matrix (now written as a column vector). If we use at least six object points
we can estimate P. Each new object point and its image generate a new pair of rows in the matrix
above, giving
 T
x1 0T −u1 xT1

 
 0T xT1 −v1 xT1  0
 T   0 
 x2 0T −u2 xT2  pT1

 
 T T T 
 T 
 0 
 0 x2 −v2 x2  p2 =   (4.24)

 ... ... T 0 
 T . . .  p3

 . 

 xn 0T −un xTn 

0
0T xTn −vn xTn

where n is the number of object point / image point pairs we have used. We write the final system as

XP = 0 (4.25)

This is a homogeneous linear system. Provided n >= 6 we can find a solution in the least squares sense.
The easy way to do this is to use singular value decomposition (SVD). Any matrix X can be written as
a product:

X = USVT (4.26)
4.2. TWO CAMERAS 29

in which U and V are orthonormal, the columns are called the left and right singular vectors, and S is
diagonal, comprising the singular values. The ith singular value si i is associated with the ith left singular
vector, and the ith right singular vector. The size of si indicates the magnitude of its corresponding
vector, and in particular

XV = US (4.27)
X[v1 v2 . . . vn ] = [s11 u1 s22 u2 . . . snn un ] (4.28)

Which means Xvi = sii ui . Clearly, sii denotes the magnitude of the product, and if sii = 0 we will have
found a (non trivial) solution to XP = 0. Usually, though, no singular value is zero — in which case we
find the smallest sii and use the corresponding vi as a solution. This is, in fact, the optimal solution. It
corresponds to fitting a hyper-plane through the multidimensional points that are the columns of X. If
these points lay exactly on a hyper plane we obtain an exact solution, so sii = 0, otherwise we minimise
the sum of squared distance from each point to the fitted plane — we optimise in the least squares sense.
Forsyth and Ponce [1] discuss least-squares fitting (pages 39 –42). Press et al [4] discuss SVD, and
some of its uses including but not limited to least-squares fitting. Golub and van Loan [2] also discuss
SVD, and is well worth reading. These issues are also taken up in Section A.1 of these notes.
Having obtained a P we need to separate out the intrinsic and extrinsic parameters. In doing so we
remember that because we have solved a homogeneous system we can know P only up to a scale factor,
ρ. Now, writing each row as [qT z] We have
     (αrT − α cot(θ)rT + a rT ) (αt − α cot(θ)t + a t ) 
p1 q1 z1 1 2 1 3 1 2 1 3
β T T β
ρ p2 = ρ q2 z2 =  sin(θ) r2 + a2 r3 sin(θ) t2 + a2 t3
     

p3 q3 z3 rT3 t3

And we see
±1
ρ= (4.29)
|q3 |
We obtain the intrinsic parameters as

a1 = ρ2 qT1 q3 (4.30)
a2 = ρ2 qT2 q3 (4.31)
T
(q1 × q3 ) (q2 × q3 )
cos(θ) = − (4.32)
|q1 × q3 ||q2 × q3 |
α = ρ2 |q1 × q3 | sin(θ) (4.33)
2
β = ρ |q2 × q3 | sin(θ) (4.34)

The extrinsic parameters are

r3 = ρq3 (4.35)
q2 × q3
r1 = (4.36)
|q2 × q3 |
r2 = r3 × r1 (4.37)
t = ρP−1
int z (4.38)

where z = [z1 z2 z3 ]T and Pint is the newly acquired projection matrix that characterises the intrinsic
parameterisation of the camera.

4.2 Two cameras


We now consider the case when there are two cameras present in a scene. Such a set-up raises the
possibility of reconstructing the objects in the scene in three-dimensions. For this, and indeed for other
30 CHAPTER 4. CAMERAS, STEREOPSIS, AND RECONSTRUCTION

applications, we want to match image points in one camera with image points in the other; that is we
want to identify points in each image that correspond to the same real-world point. If all correspondences
are known, that is we have a solution to the correspondence problem, and we know all details of our
cameras then a reconstruction is feasible. This situation is shown in Figure 4.2.

object
x
image
y2
focus y image c2
1
f1 c1 centre centre f2
focus

Figure 4.2: Given cameras in 3D space, and corresponding points also in 3D, we simply intersect the
light rays to recover the object point in 3D. This is the basic idea underlying reconstruction.

This situation allows us to make a “Euclidean reconstruction” in which the reconstructed model has
correct length ratios and angles. If we know less — if our cameras are not calibrated, say — we can still
make a reconstruction, but now may only get length ratios correct, a so-called “affine reconstruction”.
Weaker still is a “perspective reconstruction” in which our models have neither correct length ratios
nor correct angles — but they are connected up in the right kind of way. It is easy to see why this is
so. Suppose we did not know the focal length of the camera. We can still make a reconstruction by
presuming a value for it, but the model we get depends on the presumption. Another way to think about
it is to imaging photographing an object that can shape in a devious way: as you change focal length so
it changes shape to look like a perfect cube.
Even if we know all there is to know about out cameras, both intrinsic and extrinsic parameters we
still have to solve the correspondence problem. We will consider this later, but for now we will study
the geometry that underlies the two-camera case, the so-called epipolar geometry. This will help us
understand the different kinds of reconstruction available to us.

4.2.1 Epipolar geometry


Epipolar geometry is the name given to the necessary geometric arrangements that exist in the two-
camera case, Figure 4.3 illustrates.
x object

window plane 2
window plane 1
epipolar plane

y1 y image
image 2

epipolar
epipolar line2
line1

focus
focus e1 f2
baseline e 2 epipole
f1 epipole

Figure 4.3: Epipolar geometry: the two foci, the object, and its two images must all lie in the same
plane.

The epipolar plane is defined by the two foci of the cameras and the object, and is of infinite extent.
Each window plane (that are of infinite extent, the window being a small region within the plane)
intersect the epipolar plane to create an epipolar line. The image of the point in the window must lie
4.2. TWO CAMERAS 31

on the epipolar line. The line joining the foci is the baseline. This line intersects each window plane to
create an epipole.
The epipolar constraints imposed by the geometry mean this: given the focal point of each camera
and the image of some object point in one camera, then the corresponding image in the other camera
must lie on the epipolar line in the other camera’s window. We may no be able to say exactly where —
this depends on the distance to the object from the “known” camera, but we can say it definitely does
lie on the corresponding epipolar line. This is shown in Figure 4.4.
Note that a different object point that is not on the “current” epipolar plane generates a whole new
epipolar plane — and that these planes intersect at in a line which is the baseline. In fact, we can think
of the baseline as an axis about which the epipolar planes rotate. Furthermore, the set of epipolar lines
in a window will rotate about that window’s epipole. This too is shown in Figure 4.4.
object
xa

xb

xc
window plane 2
a "pencil" of epipolar lines
window plane 1
a "pencil" of epipolar
image planes
y 2a epipolar
line 2 epipole
y1 y
image 2b
y
2c
epipolar epipolar plane epipole
line1
basline

focus
focus e1 f2
epipole
baseline e 2 epipole epipole
f1 epipole as an object moves off its "current"
epipolar plane is generates a "pencil"
of epipolar planes that roate about
If an object moves along the line from focus to image, then the baseline. As seen from a camera
it traces out the epipolar line in the other camera’s window; this makes a set of epipolar lines that
so given just an image in one camera we cannot tell depth rotaote about the epipole.
but do know the corresponding image lies on the corresponding
epipolar line

Figure 4.4: The effect of different points in the same line of view (left), and different points at arbitrary
locations in 3D (right).

The epipolar geometry imposes constrains: the baseline vector f1~f2 , and the lines to each of the
images f1~y1 , and f2~y2 are coplanar (lie in the epipolar plane, in fact). The normal to the epipolar plane
is f1~f2 × f2~y2 , and this must be orthogonal to f1~y1 , so
T
f1~y1 (f1~f2 × f2~y2 ) = 0

which is independent of any coordinate system. This relation is put to use in constructing both the
essential matrix and the fundamental matrix.

4.2.2 Essential Matrix and Fundamental Matrix


The essential matrix is the name given to a matrix that enforces the epipolar constrain in the case of
calibrated cameras.
In practice we observe points in images: that is, we can measure the coordinates of a point in an
image, which are the z of Equation 4.15, for example. We would like the coordinates of this point in the
’real’ world. We remind ourselves that
1
z = Ky
s
where K is the (3 × 3) calibration matrix, and y is where the light from an object x to the focus pierces
the window plane, at a distance s from the camera, and z is the image point as seen by the camera. So, if
we are to place this image point on the window plane we need the calibration parameters of the camera,
that is we need K and the focal length s. How to determine the calibration matrix for a real camera has
already been discussed in Subsection 4.1.1, but this does not yield s. Fortunately this does not matter,
because we can always scale the camera to an equivalent one (takes exactly the same photo’s) which
does has unit focal length. Hence we can write

y′ = K−1 z
32 CHAPTER 4. CAMERAS, STEREOPSIS, AND RECONSTRUCTION

to obtain the point in 3D that is equivalent to the image point in the camera.
We must bear in mind that this newly computed y′ is in a reference frame rigidly attached to the
camera. This means that the y1′ is as seen from camera 1 — it is just scaled, sheared, and shifted when
compared to z. Similar remarks apply to y2′ . We must transform y1′ and y2′ into a common coordinate
system, before we the epipolar constraint is used. We can choose any coordinate system we like —- that
attached to camera 1 is convenient. In this case we set y1 = y1′ and y2 = Ry2 − t. The translation
is just the distance between the foci — which defines the baseline, so t = f2 − f1 . The rotation, R,
will bring the viewing direction of camera 2 into line with that of camera 1: it maps from the camera 2
system into the camera 1 system.
Now, putting all this together, the epipolar constraint can be written:

y1T (t × y2 ) = 0

y1T (t × (R(y2′ − t)) = 0 after substitution to get local coordinates

y1T (t × R(y2′ ) = 0 because t × t = 0
zT1 K−T
1 (t × R(K−1
2 z2 )) = 0

We would like this in matrix form. We can achieve our aim by noticing that any cross-product can
be written as a skew-symmetric matrix; for example x × y
  
0 −x3 x2 y1
x × y =  x3 0 −x1   y2 
−x2 x1 0 y3

the matrix used in the cross-product is often denoted [x× ]. We take advantage of this to write the
epipolar constraint as

zT1 K−T −1
1 EK2 z2 = 0 (4.39)

in which E = [t× ]R is the essential matrix. If the calibration matrices are unknown we write

zT1 Fz2 = 0 (4.40)

where F = K−T −1
1 EK2 is the fundamental matrix.
Both equations can be given the same geometrical interpretation, we use the fundamental matrix to
illustrate. The equation of a line in the plane in ax+by +c = 0, which can be written as [x, y, 1][a, b, c]T =
0. If we set [a, b, c]T = Fp2 then we see Fz2 contain line parameters a, b, and c. The line is, in fact,
the epipolar line corresponding to z2 . Similarly FT z1 is the epipolar line in the window of the second
camera.
Notice that each of the epipoles makes a degenerate line in the other camera (i.e. they appear to be
a point). To see this we observe FT e2 = R[t× ]e2 = 0 since t and e2 are parallel. Similarly Fe1 = 0.
Since e 6= 0 it follow that e is in the null space of F, and hence F must be singular.

4.2.3 Determining scene geometry


It is possible to determine the fundamental matrix from a pair of images, if at least eight point can
be corresponded. This is a process known as weak calibration. The ith pair of corresponding points
[u, v, 1]T and [u′ , v ′ , 1]T . The epipolar constraint is expressed through the fundamental matrix as:
  ′ 
f11 f12 f13 ui
[ui vi 1]  f21 f22 f23   vi′  = 0
f31 f32 f33 1
4.3. MULTIPLE IMAGES 33

we want to determine the fjk , and so are motivated to write the above with these components as a
vector:
 
f11
 f12 
 
 f13 
 
 f21 
[ui u′i ui vi′ ui vi u′i vi vi′ vi u′i vi′ 1] 
 
 f22  = 0

 f23 
 
 f31 
 
 f32 
f33

Now, this is a homogeneous equation, and as a consequence the values of fjk can be scaled arbitrarily.
Therefore, choosing a scale in which f33 = 1 is a perfectly acceptable solution. This gives
 
f11
 f12 
 
 f13 
 
′ ′  f21 
′ ′ ′ ′
 
[ui ui ui vi ui vi ui vi vi vi ui vi ]   = −1
f
 22 
 f23 
 
 f31 
f32

This allows us to express the epipolar constraint for exactly eight point correspondences as

u1 u′1 u1 v1′ u1 v1 u′1 v1 v1′ v1 u′1 v1′


    
f11 1
 u2 u′2 u2 v2′ u2 v2 u′2 v2 v2′ v2 u′2 v2′   f12   1 
    
 u3 u′3 u3 v3′ u3 v3 u′3 v3 v3′ v3 u′3 v3′   f13   1 
    
 u4 u′4 u4 v4′ u4 v4 u′4 v4 v4′ v4 u′4 v4′   f21  = − 1 
 
 
 u5 u′5 u5 v5′ u5 v5 u′5 v5 v5′ v5 u′5 v5′   (4.41)
  f22 
 1 
 
 u6 u′6 u6 v6′ u6 v6 u′6 v6 v6′ v6 u′6 v6′   f23   1 
    
 u7 u′7 u7 v7′ u7 v7 u′7 v7 v7′ v7 u′7 v7′   f31   1 
u8 u′8 u8 v8′ u8 v8 u′8 v8 v8′ v8 u′8 v8′ f32 1

which is not homogeneous and can be written succinctly as Uf = −1, hence f = −U−1 1, provide U
is invertable (which is the case unless the eight points have a particular geometric relation). This is a
non-homogeneous set of equations, and when it is used to estimate the fundamental matrix the algorithm
is called the eight-point algorithm.
The eight-point algorithm has been criticised because it tends to be not very accurate. Forsyth and
Ponce [1] (pages 219–221) discuss two common alternatives. A third, very useful alternative is provided
by Torr and Fitzgibbon [5].

4.3 Multiple Images


The geometry associated with more than two images is an advanced topic that we will omit from this
course.

4.4 Matching Points Across Images


Estimating the fundamental matrix depends on being able to match points in two images. In fact,
matching points is a basic operation that supports many other applications.There are several ways to
match points — and user-interactive systems must not be overlooked — but we will look in detail at
just one automatic method.
34 CHAPTER 4. CAMERAS, STEREOPSIS, AND RECONSTRUCTION

The method we will study relies on being able to detect features in images. The features are assumed
to exist at points (so corners are useful example features), and we will try to match features in one
image with those in another. This method belongs to the landmark class of matching methods; which
contrast with the correlation class, where one image is treated as a template that is to be transformed
onto the other. Bear in mind that no mater what method is used, the objective is always to produce a
mapping between the images so that given a point in one image we can use the mapping to locate its
corresponding point in the other.
Typically the Harris feature detector is used as a feature detector (see Section 3.4 for details). The
Harris feature detector can and will detect features where there are none (false positives) and fail to
detect features where they exist (false negatives). The consequence of this for matching features across
images is that a “feature” in one image may not exist in the other. Furthermore it is easy to make false
matches.
We can make progress by assuming the cameras that acquired the images are separated by a small
baseline and look in more-or-less the same direction: that is, we assume a stereo-pair. This means that
a real feature in one image will not have moved very far in the second image. Another way to say this
that the pictures look pretty much alike. The advantage of this is that if we pick a feature in image A,
then we need only look in a small region of image B to find its corresponding point. Suppose a features
is at [x, y]T in image A, we typically limit the search to an 11 × 11 window centred around [x, y]T in
image B. The distance between the matching points is called disparity and can be used as a basis for
three-dimensional reconstruction.
The above matching process can and will produce false matches, so we need a way to decide which
matches are good and which are bad. We use an iterative process, called RANSACK (random sample
consensus), which has the following algorithmic form:
1. for a pre-specified number of iterations
(a) randomly choose four matches assumed to be good,
(b) use these matches are used to compute a mapping between the images,
(c) measure the quality of the mapping
2. use the highest quality mapping
We already know how to pick out features in images. Step 1a says just choose 4 feature pairs, at
random. In practice this is not quite random — the stereo assumption limits the number of possible
choices to a manageable number. Four matches of the form (ai , bi ), for i = 1 . . . 4, are enough to compute
an homography between the images. An homography is a map (transform) that allows one image to be
translated, scaled, rotated, skewed, and subject to perspective-like distortions. We need homogeneous
coordinates to apply a homography, and so write
    
b1i h11 h12 h13 a1i
 b2i  =  h21 h22 h23   a2i 
1 h31 h32 h33 1
The upper-left (2 × 2) matrix will shear, rotate, and scale. The left-most column vector [h13 h23 ]T is
a translation. The bottom-most row vector [h31 h32 ] introduces a perspective-like distortion after the
homogeneous point is converted to an ordinary point. If this vector is zero, then h33 will scale (in inverse
fashion).
Under this homography the point bi can be computed from the ai as
h11 a1i + h12 a2i + h13
b1i =
h31 a1i + h32 a2i + 1
h21 a1i + h22 a2i + h23
b2i =
h31 a1i + h32 a2i + 1
and these can be re-arranged to give
(h11 a1i + h12 a2i + h13 ) − b1i (h31 a1i + h32 a2i + 1) = 0
(h21 a1i + h22 a2i + h23 ) − b2i (h31 a1i + h32 a2i + 1) = 0
4.5. MANY-IMAGE APPLICATIONS 35

from there we can separate out the unknowns, the hjk , into a column vector and so write
 
h11
 h12 
 
 h13 
 
     h21 
0 a1i a2i 1 0 0 0 −b1i a1i −b1i a2i −1  
=  h22 
0 0 0 0 a1i a2i 1 −b1i a1i −b1i a2i −1  
 h23 
 
 h31 
 
 h32 
h33

Given we have i = 1 . . . 4 this represents a small part of the larger system:


 
    h11
0 a11 a21 1 0 0 0 −b11 a11 −b11 a21 −1  h12 
 0   0 0 0 a11 a21 1 −b11 a11 −b11 a21 −1  
    h13 
 0   a12 a22 1 0 0 0 −b12 a12 −b12 a22 −1  
    h21 
 0   0 0 0 a12 a22 1 −b12 a12 −b12 a22 −1  
 =  h22  (4.42)
 0   a13 a23 1 0 0 0 −b13 a13 −b13 a23 −1  
    h23 
 0   0 0 0 a13 a23 1 −b13 a13 −b13 a23 −1  
    h31 
 0   a14 a24 1 0 0 0 −b14 a14 −b14 a24 −1  
 h32 
0 0 0 0 a14 a24 1 −b14 a14 −b14 a24 −1
h33

or, more succinctly Ah = 0. The smallest singular vector of A is the best solution in the least-squares
sense, and this can be easily scaled so that h33 = 1.
Now we have a mapping we have to measure its quality. For this we apply the mapping to all features
in the image A and find which feature in image B it ends up closest to. That is, for each point a we
measure

d(a; H) = min |b − Ha|


b∈B
a quick way to do this is to use a distance transform (see Section ??). The quality of the mapping is
then taken to be
X
D(H) = d(a) (4.43)
a∈A

clearly, we choose the H with the smallest measure.


In practice a better set of matchings can be produced by using the inverse homography to compare
point sin B with those in A.

4.5 Many-image Applications


The techniques mentioned above allow for a wide range of applications, here we outline two of them.

4.5.1 Mosaicing
A mosaic is a single image that has been created by “stitching” together many images. To mosaic a
pair of images we can assume that the perspective images are close approximations to that which would
obtain from an affine camera — this means objects are far from the camera and/or the scene depth is
small compared to the camera-scene distance.
We can now compute a homography that carries one image into another, and use texture-mapping
techniques from computer graphics to generate a final image. Some care must be taken where the images
overlap, especially since the colours in the images can vary quite a lot, even the colour of the same object
can vary. This can be taken care of using luminance compensation techniques, that we will not discuss.
36 CHAPTER 4. CAMERAS, STEREOPSIS, AND RECONSTRUCTION

ray2
ray1
shortest
distance
right−angle
right−angle
reconstructed
object
point

image1 image2

focus2
focus1

Figure 4.5: Reconstruction of a 3D point; in practice the rays will not quite intersect.

Mosaicing two images is quite easy, even routine. Mosaicing several images is rather more difficult.
This is because every image must be brought into a common reference frame. If we choose to form
homographies between successive (in time, say) image pairs HAB , HBC , etc, then errors accumulate so
|HAC HAB HBC | can be large. This is especially noticeable if the camera loops around, so the first and
last images overlap a lot. In fact, if the first and last images we somehow arranged to be identical then
we could measure the average accumulation error (how?).
Methods do exist for coping with N images simultaneously, rather than two at a time, but we will
not study these.

4.5.2 Reconstruction
Reconstruction in three dimensions is a very important application. The geometric way to reconstruct is
to compute the intersection of two rays — a line for each camera, defined by the focus and image point.
In practice, the rays will usually not intersect, because of noise, rounding errors, and so on. It therefore
better to choose the point mid-way between the rays. This point lies on the line that is defined by the
shortest possible distance between both rays, as seen in Figure 4.5. Computing this point is left as an
exercise.
Chapter 5

Segmentation

To segment and image is to break it into parts. Ideally the parts have semantic meaning of some kind.
Figure 5.1 shows an example of an image that has been segmented by hand. It is of course true that
different people would probably choose a different segmentation of the picture — but intuition suggests
that hand based segmentations are plausible; it would be an unusual person who associated a cow with
background foliage. The aim of segmentation algorithms is to segment images in a plausible way.

Figure 5.1: An images segmented by hand.

Segmentation algorithms rely on coherence: they assume that pixels in a segment all have more-or-
less the same property. Colour is the easiest property to use and it is no surprise to learn that many
segmentation algorithms are based on colour coherence. Clearly, colour coherence is useless in textured
areas such as leaves, hair, patterned carpets or clothing, ans so on — but if we put a windo around each
pixel then coherence can be taken to mean “windows look more-or-less the same”.
It is no surprise to learn that algorithms for automatic segementation do not work too well, even
the most sophisicated ones tend to produce implausible results and some kind of human intervention
is often required in most practical applications. However, good segementations can assist users a great
deal, and may be of direct use in some particular applications.

5.0.3 Simple segmentation


A very simple way to segement an image is say that a set of pixels makes a segement, if the set is
connected and the colour of any two pixels in the segement is no greater than some predefined theshold,
t. That is if p(x, y) and p(u, v) are pixels such that

||p(x, y) − p(u, v)|| <= t

and there exists a path of pixels, made from neighbours, from (x, y) to (u, v) that exhibit the same
property, then (x, y) and (u, v) both belong to some segment S.

37
38 CHAPTER 5. SEGMENTATION

An algorithm enforcing this condition appears in Figure 5.2, and a result using it in Figure 5.3. The
algorithm is not only inefficient in time but over segments, meaning that it produces too many segements
to be plausible. Variations — such specifying only that neighbouring pixels differ in colour by no more
than a threshold suffer along similar lines. A reasonable response is to try to merge segmented regions.
This naturally leads to the more general idea of merging and splitting regions. Another is to allow the
use of textures rather than just colour.
input a picture p
make a label picture, L, of the same size to zero
WHILE some L(x,y) = 0
S <- empty set
RegionGrow( x, y : p, L, S )
END

RegionGrow( x, y : p, L, S )
IF (x,y) is on the image AND L(x,y) = 0 AND
p(x,y) - p(u,v)|| <= t for all (u,v) in S
THEN
L(x,y) <- label
S <- S + (x,y)
RegionGrow( x+1, y : p, L, S )
RegionGrow( x-1, y : p, L, S )
RegionGrow( x, y+1 : p, L, S )
RegionGrow( x, y-1 : p, L, S )
END

Figure 5.2: Psuedo-code for simple segmentation; a result in shown in Figure 5.3

Figure 5.3: A simple segmentation of the ”cows”, made using the algorithm in Figure 5.2

5.0.4 Merge and Split


The idea of split and merge is to refine an initial segmentation by merging regions to make a larger
one that is somehow coherent, and splitting other regions that violate some coherence condition. Many
possible criteria exist on which to base criteria for merge and split, such as:
• split a region where an edge is strong
• merge regions separated by a weak edge
• use semantic information (prior knowledge)
39

Only regions that touch should be merged. Hence is it useful to think of each region as a node in a
graph. Edges in the graph indicate that two nodes touch. The graph formed is called a region adjacency
graph. The algorithm for split and merging is shown in Figure 5.4.
Form an initial segmentation
Make up a Region Adacency Graph (RAG)
WHILE regions may be merged
FOR each region
IF it should be merged with a neighbour
modify the RAG as regions merge
ELSE if it should be split
modify the RAG to split the region
END
END
END

Figure 5.4: A simple approach to merge and split

Statistical criterion can be used to decide both merging and splitting. Suppose S is a region of pixels.
The mean colour, p̄ and colour covariance C are given by
1 X
p̄ = p
|S|
p∈S
1 X
C = (p − p̄)(p − p̄)T
|S| − 1
p∈S

The covariance is the multidimensional equivalent of variance, but with direction in space taken into
account. The Mahalanobis distance of some arbitrary colour q from the region measures the squared
distance between the arbitrary colour and the mean in terms of the number of standard deviations. It
is given by
d(q; S) = ((q − p̄)T C−1 (q − p̄))1/2
Now, it turns out that 97% of pixels that belong to the region have a colour with a Mahalanobis distance
less than 3. So, one criterion to merge two regions is this: merge them if the mean colour of each region
has a Mahalanobis distance of less then three with respect to the other region. A bit more precisely, let
S1 and S2 be two regions, with means and covariance (p̄1 , C1 ) and (p̄1 , C2 ) respectively. So we might
merge these regions if
(d(p̄1 ; S2 ) < 3) ∧ (d(p̄2 ; S1 ) < 3)
We might split a region if any eigenvalue of the covariance matrix becomes too large.
There are other ways to merge and split that are more principled — arbitrary thresholds are an
unwelcome introduction into algorithms because they necessarily introduce an element of contrivance.
For example we might appeal to clustering methods that have been developed in statistics and machine
learning, as well as computer vision and elsewhere.

5.0.5 Segmentation as clustering


Let’s think about the colours in a coherent region for a moment, in fact let’s think about the colours as
points in an RGB cube. Each pixel has a particular colour, (r, g, b) which is a point in the RGB cube.
If the colours in a segment are similar, then RGB points will huddle together to form a cluster. Colours
from other regions will likewise form clusters. The process of segmenting an image can therefore we seen
as the process of clustering colour values.
In fact this idea of clustering works not just with colours. Think of a colour as a vector with three
values in it. We know we can cluster colours and segment images. But we can associate many values
with a pixel; the partial derivates computed for edge detection is an example where we associate a vector
40 CHAPTER 5. SEGMENTATION

of two numbers with each pixel. Of course, we can compute derivatives of different order, and at different
scales so building up a vector of high dimension at each pixel. We can then imagine clustering these
high-dimensional vectors and so segmenting images — this is a step toward segmenting on the basis of
texture.
Now consider the use of statistics in the previous section to control merge and split. We computed the
mean and covariance of a region. This is, in fact, a semi-parametric way to describe the cluster of colours.
The mean colour lies at the centre of the cluster, and the covariance matrix contains information about
the size, shape and orientation of the cluster. In fact the statistical approach we introduced assumes
that the colour points in RGB are Gaussian distributed (also called a Normal distribution. This means
that the denisty of points in the cluster is assumed to be given by the Gaussian
 
1 1 T −1
exp − (q − p̄) C (q − p̄) dq
(2π)n/2 |C|1/2 2

which when integrated over all space gives a unit mass.


Chapter 6

Tracking

To track is to follow an object in an image sequence (video). Tracking is an inference problem, because
the motion of the object has to be inferred from a set of stills — much as 3D reconstruction is an inference
problem because the 3D shape has to be inferred from a set of 2D Because tracking is so important there
are many ways to track. Historically, the most important method has been Kalman filtering, but more
recently Condensation filtering has proven advantages. Both methods are called filtering because both
filter in the input data as they try to infer object motion. We’ll look at Kalman filtering first, mention
its problems, and see how Condensation filtering addresses these problems.

6.0.6 A simple tracker


Consider a simple template tracker. A template tracker has a template of the object it is tracking. The
template will typically be a small window that has been masked out from a video frame to show the
object. The template will correlate windows that look like itself: correlation being defined here as
!1/2
XX
2
c(u, v) = |I(x − u, y − v) − T (x, y)| (6.1)
x y

which assumes the template has been normalised so that [0, 0]T is its origin. We can imaging using
correlation to locate by look for the pixel [u, v]T with the smallest correlation score. Further, we can
imagine doing this for successive frames in a video, and so find the trajectory of the object.
If only things were that simple. The object may get bigger bigger (smaller) as it moves toward
(away) from us. It may change appearance as it spins. It may become occluded, or partially occluded,
as it passes behind foreground objects. There may be other objects just like it in the video, or at least
sufficiently similar objects. All of these problems make tracking very difficult indeed. The first step to
a better tracker in to use a motion model.

6.0.7 Motion models


Physics provides us with very strong models of motion. Newton’s second law is that rate of change of
momentum is equal to the force applied. This is expressed in the differential equation:
dv
F=m (6.2)
dt
assuming that mass, m (the constant of inertia) does not change. Acceleration is rate of change of
velocity, and velocity is the rate of change of displacement:
dv
a = (6.3)
dt
dx
v = (6.4)
dt

41
42 CHAPTER 6. TRACKING

hence we can also write a = d2 x/dt2 . These equations are integrated to give the well known equations
of motion
1
x = ut + at2 (6.5)
2
in which x is the distance travelled in time t. To get the actual location we must, of course, add on the
initial location:
1
x = p + ut + at2 (6.6)
2
This is a very common motion model.
Perhaps a better starting point for is the Taylor expansion of a function:

df (t) 1 d2 f (t) 2 1 d3 f (t) 3


f (t + δ) = f (t) + δ+ δ + δ + E[δ 4 ] (6.7)
dt 2 dt2 6 dt3
X 1 dk f (t)
= k = 0N + E[δ N +1 ] (6.8)
k! dtk
which says that the value of a function, f () a distance δ from a known point t can be approximated to
zeroth order by f (t), to first order by adding the sloping line df /dt, to second order by adding a quadratic
term d2 f /dt2 , and so on with the correction terms of increasing order and decreasing magnitude until
a error E[] remains; which is or order δ N +1 . You should compare the Taylor expansion with Newton’s
equations.
If a computer program is given initial position, initial velocity, acceleration, and any other terms we
can computed the path a particle takes using numerical integration. Many methods exist but the easiest
(yet least stable, numerically) is called Euler integration:

Initialise position, velocity, accleration


decide a time interval
FOR each time instant
position = position + velocity*interval
velocity = velocity + acceleration*interval
END

This simply updates the position and velocity at each time instant.
If we choose to do so (and we will choose to do so) we can express this integration using matrix
methods. First we define the “state” of the particle as being a combination of its position, velocity,
and acceleration — which is all that is needed to predict its future position. Next we define a “system
matrix” that maps the state a onto time instant to the state in the very next time instant. In general,
then

Initialise state
decide a time interval
FOR each time instant
state = A*state
position = H*position
END

where A is the system matrix (state transition matrix), and H is a projection matrix that grabs the data
we are interested in from the state. In the specific case of motion under Newton’s laws we define a state
vector by concatenating location, velocity, and acceleration
 
x(0)
s =  v(0) 
a
43

and define a system matrix for the interval δ

δ1T (δ 2 /2)1T
 T 
1
A =  0T 1T δ1T 
0T 0T 1 T

in which 1 is a vector which 1 in every location, and 0 is a vector with 0 in every location; each of these
being the same size as the position vector (so if the position is 2D, the vectors are 2-dimensional, and
so on). The observation matrix is
H = 1T 0T 0T
 

that is, a row vector whose length is the same as that of the state vector, and which in this case picks
out the location of the particle.
In practice we would not, of course, have direct access to the velocity and acceleration of a particle.
but we might hope to be able to observe the particle in the first three frames, and this is enough to
compute the remaining positions. This requires no alteration at all to the general scheme just proposed,
but does require new contents for the state, and new contents for the system matrices (in general the
observation matrix must change too, but here its the same). The state in the third frame (time instant)
is  
x(2δ)
s =  x(1δ) 
x(0δ)
and the system matrix is
31T −31T 1T
 

A=  1T 0T δ0T 
0T 1T 0T
Notice how the first line predicts the new location as a linear combination of the previous three locations,
and that this “history” window of three points rolls along in time.
All of this would work very well, except it does not account for noise at all. Unfortunately the
presence of noise in the observations can mess things up considerably. Fortunately we have a remedy —
the Kalman filter.

6.0.8 Kalman Filtering


The Kalman filter is a predictor corrector: the Kalman filter “watches” an object as it moves — that is
to say it gathers information about the object being tracked, at least over a short period of time. It then
uses the information it has gathered to predict where the object will be in the next frame; this prediction
requires a model of motion of the kind just discussed. Next, the Kalman filters looks for the object in
the next frame — and corrects the model of motion using the difference between the observation and
prediction.
But the Kalman filter does more. It recognises two kinds of noise
• First, the particle may not move under conventional laws of physics, for example the flight of a
bird can be rather erratic. This is called process noise.
• Second, the particle has to be observed with a measuring instrument that will always introduce
errors of its own, so called measurement noise.
The presence of noise means we have to correct our predictions, somehow, What is more, we should not
only make a prediction, but in addition be able to state a level on confidence in our prediction. The
Kalman filter does these things for us too.
Let us consider noise, for the moment. One way to think about noise is that it “jitters” a true
point to a new point, one we end up seeing. Suppose for the moment these points are two-dimensional.
We allow the true point to move under a motion model, and at each instant jitter it into a new point.
What is of interest is the error e = xseen − xtrue . We can imagine plotting these errors as points in
the plane. The way the points get spread out — their distribution — indicates any biasing in the error
44 CHAPTER 6. TRACKING

measurement, and also indicated the level of confidence we can have. If the points are widely spread we
would be included to have less confidence than if they were tightly clustered. We would hope for, and
in fact assume, that the points are Gaussian distributed. This just means that the number of points
in any patch of the plane is directly proportional to the volume of a 2D Gaussian under that patch. A
Gaussian has a single peak at
1 X
ē = N ei (6.9)
N i=1

where N is the number of points being considered. This is the mean and it is the most likely value for
the error. The rate at which the distribution decays, and whether this rate is faster in one direction
than another, is captured by the covariance matrix.
1 X
C= N (ei − ē)(ei − ē)T (6.10)
N i=1

which assumes a Gaussian distribution of points.


The Kalman filter maintains a prediction for the most likely next location, and also maintains a
covariance matrix that is used to indicate a confidence in the predicted value. The equations that
underpin Kalman filtering are
sk+1 = Ask + nk (6.11)
xk = Hsk + mk (6.12)
in which s is a state vector (the subscripts indicate a given frame); x is the observable; A is the system
update matrix, and H is the observation matrix. In fact, these equations are exactly those we used
in modelling motion expect for the presence of noise vectors n and m to model process noise and
measurement noise, respectively.
It is important to recognise that the x in the above is a real measurement (observation), and that
Hsk is the prediction of this observation. In the case of no noise these two should be identical, but where
there is noise the error is exactly m;
mk = xk − Hsk (6.13)
We assume expectation value of this error is zero, and compute its covariance as
N
1 X
P = mmT (6.14)
N
k=1

What the Kalman filter tries to do is to make this covariance as small as possible, which in practical
terms means keeping the spread of errors as tight as possible. (The “volume” of the covariance — the
spread of points — can be measured either by the trace of the matrix or as the product of its singular
values).
The Kalman filter corrects the state using
sk+1 = Ask + Km (6.15)
where K is the gain.
−1
K = PHT HPHT + R (6.16)
where R is the covariance of the measurement; the measurement noise . This measures the confidence in
the measurement, it is not covariance of the prediction error m (that covariance we have called already
P). For example, reconsider the template based tracker. We might we the values from such a tracker to
generate an R matrix. This gain, K minimises the error covariance P, which is also corrected:
Pk+1 = (I − KH) APk AT + Q

(6.17)
where now Q is the process noise covariance; that is the noise intrinsic to the system.
Putting all this together the Kalman filter is:
45

Initialise
* state
* the system matrix A
* the observation matrix H
* process covariance Q
* measurment covariance R
* error covariance P
FOR each time instant, k

/* predict */
newstate = A*state;// advance predicted state
newP = A*P*A’ + Q; // advance predicted state covariance

/* correct */
K = newP * H’ *inv( H * newP *H’ + R); // Kalman gain
state = newstate + K*( x[k] - H*newstate); // correct state
P = (I - K*H) * newP; // correct covaraince

/* record observation
z[k+1] = H*state;
END
Well, not quite. This is a simplified version of the Kalman filter; it does not allow for any control to
be exerted over the system. Even so, it is sufficient for many tracking purposes. Bear in mind that the
state can be anything we want it to be, and can include terms to dictate shape, say, so that objects and
not just points can be tracked.
The major limitation to the Kalman filter is that it assumes a Gaussian distribution over the pre-
diction error; this is implicit in using a covariance matrix. But consider the case where a someone’s
face is being tracked, and they happen to be walking; walking is of course not easy motion to model —
especially by a polynomial, and if the person happens to be in a crowd things get much worse.It is very
easy to imaging that there are many possible places for next position, and the Kalman filter just breaks
down. A more modern filter, called condensation filter, is able to cope with such situations. It does so
by allowing many next possible locations, not just one.
46 CHAPTER 6. TRACKING
Chapter 7

Model Based Vision

Given an image, even a perfectly segemented image, there is no way a computer can give a name to objects
in the scene. Computers have no inormation about “cars”, “cows”, “people” and so on. Computera
cannot recognise (ie label) these things.
Model based vision is so called because it uses models to help interpret image content. Models can
be used to improve segmentation and to recognise particular things.
Of course, in some sense all vision is model based; that is, if by “model” we refer to the assumptions
used, constants used and so on. The models in “model based vision”, though, act at a “high level”. This
means thay make strong assumptions, so they tend to be very useful but for a strictly limited class of
objects.
In this chapter we will develop our ideas about model based vision, We will explore a few of the many
varieties of models available to Computer Vision. We will see how the models can help computers to
interpret images. And when answers “where do we get the models from”, we will understand something
of the role that Machine has in contemporary Computer Vision.

7.1 Simple Recognition of 3D objects?


Let’s begin with what appears to be a simple example. Let’s imagine a model of, say, a Rubik’s cube
The idea is to match the model of a Rubik’s cube to its photograph. For the moment, lets’ assume the
model and the photograph show the cube in its “home” position, so that each face is just one colour.
Now we can use Computer Graphics to make a picture of the model cube. All we need do, it seems,
is to move the cube around in space until the Computer Graphic picture makes a good match to the
real picture. This raises several questions

• how should the picture be matched? (What if the real picture shows the Rubik’s cube amongst
other objects, some of which might be other Rubik’s cubes.)

• How should the model be moved? In this case we might allow rotations and translations. But the
focal length and other internal parameters on the camera may well need to be set too. The number
of free variables we allow to change is the degree of freedom of the problem. Problem complexity
scales exponentially with DoF.

The standard approach to recognition using models is to look for projective inavriants. These are
quantities that remain unchanged under projection. The cross-ratio of line lengths is an obvious example.
Invariants merely mitigate rather than solve the problems alluded to above. Further, projective
invariants are of little or no use in cases where the object changes shape and/or appearance. Rather
than continue trying to fit a 3D model to a 2D picture, we will proceed by looking at ways to use pictures
as models

47
48 CHAPTER 7. MODEL BASED VISION

7.2 Pictures as models


Using pictures as models is less suprising than it sounds. In fact, templates as using in tracking, say, are
nothing more than small pictures that are used to model appearance. Here, though, we will think about
whole pictures as models. The idea now is like having a mug-shot of someone. We can now use their
mug-shot to find them in other photos. Of course, we may need to resize the picture, rotate it, and so
on. This means the above two questions, or at least, versions of them, remain with us. But now we can
look for objects we cannot easily model in 3D, and as we will see, it allows the possibility of changes of
shape and appearance.
Now, the exact form of the model we use depends on the task we want to perform. To illustrate here
are two tasks:

• We have a single mug-shot and a collection of holiday snaps, each of which we know contains the
person. The task is to locate the person in each snap.

• We have a single mug-shot and a collection of other mug-shots, each of a different person. The
task is to decide whether the person’s mug shot is a member of the collection.

The first of these is very hard: holiday snaps are unconstrained, as any porcess has a lot to take care
of. The second is very difficult, but at least we might be able to restrict the way in the mug-shots in
the collection are presented. We might make sure each is the same size, shows a frontal view, under
standard lighting and so on. All this makes automatic recognition easier, which (I presume) is why the
passport office insists on a set of what seem to be un-necessary constraints on passport photographs.
So, let’s start by making life easy. We will consider just a collection of pasport-like mug shots, this
is the library. We want to know whether or not a new mug-shot, the probe belongs to the library.
A simple minded approach: Use root mean squared difference error as a measure of disimilarity. Now
compute this measure between every picture in the library and the probe. When (if) we encouter an
RMS error that is small enough, we can stop.
This can be critised on two grounds: RMS might not give the right answer, and the may be thousands
or even millions of pictures. This is epseically important if the probe is not in the library - in that case
every pictuire in the library will have to be processed. Let’s assume RMS is OK. The problem is to
organise the pictures in the library to that a decision can be made quickly.

7.2.1 Eigenfaces

This cube is [0,1]^3

1,1,1
a picture of 3 pixels

1,1
more pixels,
more dimensions This square is [0,1]^2

Figure 7.1: Three pixels from a mug-shot make up a picure of 3 pixels. This mini-picture is a point in
[0, 1]3 . As we add more pixels, so the number of dimensions grows.
7.2. PICTURES AS MODELS 49

Eigenfaces is a very modern, common method to organise pictures of mug-shots. Here we will look
at a basic method.
The first thing to do is think of an image as a point in a space. The space is very high dimensional:
there is one dimension for every pixel. To see how this works think first of a picture of just one pixel.
The pixel can take on a gray level value between 0 and 1, say. The point representing this single-pixel
picture rests on this line, at the pixel’s value, v1 , say. Now we introduce a second pixel, with value v2 to
give a two-pixel picture (v1 , v2 ). This is just a point inside the unit square [0, 1]2 . If we consider a three
pixel picture the point is at (v1 , v2 , v3 ) inside the unit cube [0, 1]3 .
Real pictures have N pixels. Provided we always order these pixels in the same way we can represent
this picture as a vector of gray levels (v1 , v2 , . . . , vN ), which is inside the hyper-cube [0, 1]N . This is an
N dimensional space, typically N >> 1, of course. This idea is show in Figure 7.1. In this case the
pixels are ordered column by column, which is very common.
Now we can think of any picture as a point in [0, 1]N . In fact, every picture (for a picture with a
given number of pixels) is inside [0, 1]N : we have a universe of pictures. Of course, this includes random
noise pictures, pictures of cows, cars, faces and so on. Clearly, [0, 1]N ⊂ ℜN , so is in N -dimensional
space.

EignFaces lie on aN M-manifold emedded in N-space

But what makes a library of face images interesting is that their points in the universe of picture crowd
together. More than that, they will lie on a manifold of much lower dimension than N , which you
will recall is the number of pixels. We say that this M < N dimensionsal manifold is embedded in N
dimensional space.
You can think of a manifold as a surface. A plane is a two dimensional manifold that can be embedded
in a three dimensional space. An easy physical analogy is a sheet of paper, which we can move about
in 3D. Actually the paper does not have to be plane. Even if bend the paper and it is still a 2D surface
(manifold) embedded in a 3D space. A curve is also a manifold; a 1D manifold. Curves, like string, can
be embedded in 3D. Interestingly, the curve may be confined to a plane, which is emebedded in 3D, or
it may genuinely curve in 3D, like the track of a big-dipper. Figure ?? illustrates.
It is much more difficult, impossible even, to think about M -dimensional manifold in N -dimensional
space. But if we hang on to the paper and string analogies we should be OK, because maths we need is
the same no matter what the dimension is.

a 2−D manifold embedded in 3D space

A 1−D mainfold emedded in 3D space

Figure 7.2: Two manifolds in a 3D space,one a surface the other a curve.

Figure 7.3: fig:manifold


50 CHAPTER 7. MODEL BASED VISION

Why should eigenfaces lie on a manifold? Let’s change one of the pixel values just a bit. This makes
a new picture, whose point is shifted a corresponding amount along the axis corresponding to the pixel
that has been changed. If we continued changing this pixel, then we would get a straight line in [0, 1]N .
And we know lines are 1-D manifolds. Now we change two pixels. We will get plane in [0, 1]N , and
planes are manifolds.
But, now suppose we change two pixels in a correlated way. For example, if we increase the value
of one we might decrease the value of the other. This correlation does not have to be linear, we might
increase one exponentially and decrease the other linearly. The important point is that the two pixel
values are linked togeher somehow. If this is true, then instead of getting a plane as we vary the two
pixels, we would get a curve, this curve will be embedded in the plane defined by their uncorrelated
variations.
Now, it is clear that the pixel values in mug-shots are correlated. People’s noses, eyes, mouths, hair,
etc all tend to lie in about the same place and all tend to look about the same. And, because they are
correlated the points in [0, 1]N that represent a face will lie on a manifold. At this instant in time it
does not matter we do not know how face images are correlated, all that matters is we know they are
correlated. So we know that face images must be restricted to some M -dimensional manifold emedded
in N -dimensional space.
More than that, we can guess it is likely that the images in our library of mug-shots do not occupy
all of the manifold, but rather cluster into a patch upon it. To begin with, the manifold is infinitley
large, and the library must be confined to its intersection with the universe of pictures, [0, 1]N . But
also, oridnary monochrome images contain very few extreme values pixels (0 or 1), so the images can be
expected to crowd together, on an M-dimensional maniold, in more-or-less the middle of [0, 1]N , or at
least not too close to its boundary.
Now we have an intuitive feel for how a library of images, in this case faces, looks inside the universe
of pictures. Next we will add some mathematics to back up our intuition. Fortunately, the mathematics
required is not too hard.

A mathematical representation of the manifold


We are now representing an image by a point in N-dimensional space; so we write x ∈ ℜN for the image.
In fact, we have a library of, say, n images. We expect these images to be correlated in some way, that
is form a cluster.We want the mean of this cluster and its covaraince matrix. The mean is just
n
1X
x̄ = xi (7.1)
n i=1

and the covariance matrix is


n
1X
C= (xi − x̄)(xi − x̄)T (7.2)
n i=1

Notice that C is N × N , which is huge. So big that it imposes the significant practical difficultie that it
cannot be computed on most computers! We shall address later and see how to overcome this problem.
For now, though we continue as if we have access to C.
The eigendecomposition of C gives us a set of eigenvectors and eigenvalues

C = ULUT (7.3)

There is one eigenvector and one eigenvalue for each of the N dimensions of the unverse of images. But
we expect the images in our library to be confined to manifold of much lower dimension. If we assume
the images a spread out over a hyper-plane, which is a manifold, then all we need do is to look at the
eigenvalues. Many of them will be zero, or at least so small that they can be neglected. This means the
library images do not spread out in direction of the corresponding eigenvectors. These small eigenvectors
can be discarded, leaving us with just M eigenvectors and eigenvalues.
(Why must we have M ≤ n ?)
7.2. PICTURES AS MODELS 51

A hyper-plane, by the way, is just a generalised sort of plane. A line is a 1D hyer plane, an ordinary
plane is a 2D hyper-plane. Its hard to think about flat objects having a dimension greater than this,
but that is what a hyper-plane is.
The remaining M eigenvalues make up the central axes of the hyper-plane. So another way to think
about the hyper-plane is just as am M -dimensional vector subspace embedded in an N -dimensional
vector space. This way might be a easier to think about, because now a cube, say, can stand for a 3D
subspace in a 4D space, jsy as a square stands for a 2D subspace in a 3D space.
In any case, the fact is that we need only M of the eigenvectors. We will write these as UN M , which
is an N xM matrix. Now any image at all can be projected into this subspace

yi = UT1:M (x − x̄) (7.4)

But, like all projections information is lost. To see this, consider the orthogonal projection of a point in
3D onto a 2D plane. If the points start out in the plane we loose nothing, but if the point is above or
below the plane, then we loose its depth. The above equation is exactly an orthogonal projection, but
from N dimensions down to M . If a point (an image) is in the hyperplane (subspace, manifold), then
we loose nothing. If the point is outside the manifold, then we incur a loss. This loss is given by the
residue vector

h = U1:M yi − x̄ − x (7.5)

The length, |h| is 0 if and only if x belongs to the library’s manifold.


If |h| < ǫ for small ǫ is true, then yi is a low-dimensional representation of the image x. In fact,
M can be very low, typically M = 50. This means we are representing a complete picture with just 50
numbers! This fact has significant implications for storage, transmission, and other applications.
Now these eigenvectors are called eiegenimages. Why? Simply, because each one is an image! And
sicne we started with faces, we can be more specific. We have a set of so-called eigenfaces.

Does a given picture belong to the library?


Just because a picture bleongs to the manifold does not mean it belongs to the library. Recall the library
images are expected to crowd together on the manifold. We need to check not only the size of the residue,
but also whether a given iamge belongs to the crowd. Fortunately we already have everything we need
to do this.
The squared Mahalanaobis distance of a point x from the mean x̄ given a covariance matrix C is

m2 (x) = (x − x̄)T C−1 (x − x̄) (7.6)

We do not have C, it is far too large. But we know

C ≈ UN M LMM UTN M (7.7)

and so can write the squared Mahalanaobis distance as

m2 (x) = (x − x̄)T UN M L−1 T


MM UN M (x − x̄) (7.8)

which is just

M
X
m2 (x) = yi2 /Lii (7.9)
i=1

Now, this all assumes that the library images are Gaussian distrubuted. In which case about 97% of the
library population is within 3 Mahalanobis diatances (standard deviations) of the mean. We conclude
that x belongs to the library if m(x) < 3.
52 CHAPTER 7. MODEL BASED VISION

Computing the eigenimages in practice


We observed that C is huge, too big to store and compute with. We need another way to get tothe
eigenimages. Singular value decompistion comes too ur rescue.
First, we arrange our n library images to in a matrix:

X = [x1 , x2 , . . . , xn ] (7.10)

And we subtract the mean, which we can compute...

Z = X − x̄1 (7.11)

The 1 is just a row of n 1’s, which acts simply to repeat the column vector x̄ n times. Now we have two
N × n matrices we can subtract one from the other to get Z. Now, we can write C like this
1
C= ZZT (7.12)
n
which is an outer product of Z.
Let’s take the SVD of Z:
T
ZN n = UN N SN n Vnn (7.13)

The outer product is

ZN n ZTN n T
= (UN N SN n Vnn T T
)(UN N SN n Vnn ) (7.14)
T T
= UN N SN n SN n UN N (7.15)
= UN N LN N UTN N (7.16)

which is exactly the eigendecompoistion, with

LN N = SN n STN n (7.17)
= S2N N (7.18)
(7.19)

So, the square of signular values give eigenvalues, and the left singular vectors and eigenvectors are the
same.
Recall many of the eigenvalues, now singular values, are zero. In fact, at most n of the signular values
can be non-zero, because the n images cannot possibly occupy more than n dimensions.
Now consider the inner product, we get

ZN n ZTN n = Vnn )S2nn Vnn


T
) (7.20)

Since typically n << N we can compute this, even on quite modest machinery. This gives us the
eigenvalues, and a set of right singular vectors. As before, we can throw away vectors with small singular
values. Of course, we have to discard left singular vetcors too, and although we do not yet know what
these are we can write
We can now write our library with just M singular values
T
ZN n = UN M SMM VnM (7.21)

From which it is easy to obtain the left singular vector we actaully want:

UN M = ZN n VnM S−1
MM (7.22)

In practice, it is this method which is used to get the eigenimages.


Appendix A

Mathematical background

Much of the mathematics needed is covered by Mathematical Foundations of Computer Graphics:CM20001,


the exception being statistics.

A.1 Linear algebra: vectors and matrices


An n-dimensional vector is an element of ℜn . If x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) are vectors, and
s is a scalar, then:

• x + y = (x1 + y1 , . . . , xn + yn )
is the addition of x and y.

• sx = (sx1 , . . . , sxn )
is a scaled version of x.

• |x| = ( ni=1 x2i )1/2


P
is the length of the vector x.

• x̂ = x/|x|
is the unit vector.

• x ⊙ y = ni=1 xi yi
P
is the inner product, also called dot product and scalar product.

A basis (also called reference frame) of ℜn comprises a set of n vectors such that any other vector in
n
ℜ is a linear combination of the basis vectors. That is, for a basis set bi
n
X
x= ci bi
i=1

holds for every vector x ∈ ℜn . The set of numbers (c1 , . . . , cn ) are called the coordinates of the vector
x. If the basis vectors are mutually orthogonal and each of unit length, then the basis is an orthonormal
basis.
The above can be written more compactly by writing the basis vectors as column in a matrix,
B = [b1 . . . bn ] and the coordinates as a column vector

x = BT c

Where BT is the matrix transpose of B.


A matrix is (n × m) array of numbers, having n rows and m columns; aij is the element in the ith
row and jth column. If A is a n × m) matrix and B is a (p × q) are matrix, then

53
54 APPENDIX A. MATHEMATICAL BACKGROUND

• cij = aji
is the transpose of A.
• cij = aij + bij
is the addition of the matrices, and exists only if n = p and m = q.
• cij = saij
is the scaled version of matrix A.
Pm
• cij = k=1 aik bkj
is the matrix product of A and B, and exists only if m = p. The product, C is (n × q).
Matrix multiplication does not commute in general, so AB 6= BA. A−1 is an inverse of A if and only
if AA−1 = A−1 A = I, in which I is the (n × n) identity matrix. A matrix is called orthonormal if
A−1 = AT .
The determinant of a matrix A is defined only for square matrices as the determinant of its elements:
n
X
det(A) = (−1)i+1 ai1 det(Ai1 )
i=1

where Ai1 is matrix formed from A by removing the ith row and column. A matrix A is singular if
det(A) = 0.
The rank of matrix is the number of linearly independent rows (or columns). If the rank of matrix
is less than its size, the matrix is said to be rank deficient. All rank deficient matrices are singular.
Consequently all non-square matrices are singular.
A matrix is Hermitian if it is square and is equal to the transpose of its complex conjugate, so
A = (A∗ )T . If the matrix is real (has only real numbers), then this is equvalent to A being symmteric.

A.1.1 Singular Value Decomposition


We will not prove it, but any A matrix can be written as the product of three other matrices:
A = USVT
with U and V both orthonormal, and S diagonal. If A is (M ×N ) then U is (M ×M ), V is (N ×N ), and
S is (N × M ). If N 6= M , then the statement S is diagonal means the diagonal elements just continue
until they hit the right or lower edge of the matrix, whichever is sooner. The diagonal elements of S are
all zero or larger.
The columns of the matrix U are called left singular vectors, those in V are right singular vectors,
and the diagonal elements of S are singular values. The ith singular value is associated with both the
ith left singular vector ui and the ith right singular vector vi . The process overall is called singular
value decomposition, see Press et al [4] for an algorithm.
One way to convince yourself that SVD works for any matrix is to interpret the product geometrically.
Consider the right singular vectors as points on a unit sphere (more correctly, a unit hyper-sphere). The
diagonal matrix squashes the sphere into an ellipsoid (like a rugby ball). The left singualr matrix then
rotates the points in place to give the values in A. This interpretation works equally well the other way
— treating the left singular values as points. If N 6= M then S will orthogonally project as well as scale.
For example
 v11 v12 v13 T
 
   
x11 x12 x13 u11 u12 s11 0 0 
= v21 v22 v23 
x21 x22 x23 u21 u22 0 s22 0
v31 v32 v33
We can write our example as
   
x11 x12 x13 s11 0 0
= [u1 u2 ] [v1 v2 v3 ]T
x21 x22 x23 0 s22 0
= s11 u1 v1T + s22 u2 v2T
A.1. LINEAR ALGEBRA: VECTORS AND MATRICES 55

each ui viT is a (2 × 3) matrix.


Although singular matrices have no inverse, they do have a pseudo inverse. This is computed from
the three matrices in the decomposition, as follows. First we “deflate” the three matrices we have. This
means throwing away all vectors that have either no singular value, or a zero singular value; this does
not change their product, as the example above illustrates. Now we can write
A = ÛŜV̂T
where the deflated matrices have hats on. In the example, we can deflate by removing v3 , which
contributes nothing to the sum, so
 v11 v12 T
 
   
x11 x12 x13 u11 u12 s11 0
=  v21 v22 
x21 x22 x23 u21 u22 0 s22
v31 v32
Now
A+ = V̂Ŝ−1 ÛT
is called the pseudo inverse of A, and is easy to compute since Ŝ−1 is just a diagonal matrix with terms
1/sii . The general properties characterising a psuedo-inverse are:
rank(A) = rank(A+ )
A+ AA+ = A+
AA+ A = A
AA+ and A+ A areHermitian
Now consider the linear system
Ax = b
which is a general problem to be solved for x, given A and b. Now this matrix transfrom can be
considered as a mapping from one space to another, and is A is singular then a projection is involved. This
is because the columns of A are basis vectors, and if any vector can be written as a linear combination
of the others then it lies in the plane defined by those vectors in the combination.
For example if we [100]T and [010]T , which lie in the xy-plane around the y axis by φ and x-axis by
θ, we get a transform into this plane which is
 
cos(φ) sin(θ) sin(φ) − cos(φ) sin(φ)
B=
0 cos(θ) sin(θ)]
and which has two indpendent columns. Moreover the vector z = [sin(φ) − sin(θ) cos(φ) cos(θ) cos(φ)]T
is such that Bz = 0, meaning that it is perpendicular to the plane.
In general if Ax = 0, then x is said to be in the null space of A. The null space has as many
dimensions as it has linearly independent vectors. So, in general, any matrix partions space into two:
the subspace determined by its basis vectors — its range — and the null space that comprises points
that cannot be reached by any linear combination of basis vectors.
Returning now to the solution of Ax = b and SVD. The right singular vectors that have a non-zero
singular value form a spanning set over the range of the matrix A. The remaining right singular vectors
span the null space of A. Now suppose b lies in the range of A, in the example above this means it is a
point on the plane. We can now solve for x using the pseudo-inverse:
x = A+ b
If A is singular this gives many solutions, in fact, for we the null vectors can be added to this solution
in any linear combination. Suppuse z is any vector in the null space of A, then
A(x + z) = Ax + Az
= Ax
because Az = 0.
56 APPENDIX A. MATHEMATICAL BACKGROUND

A.2 Fourier Transforms and convolution


The Fourier Transform decomposes an image into its frequency components — sinusoid that sum to make
the original image. Linear filtering (see Chapter 3) can be performed by convolution over the original
image, or by multiplication over the frequency components.
If f (x) is a function, then
Z ∞
F (u) = f (x) exp(−i2πxu)dx
−∞

is its Fourier Transform (FT). The inverse FT retrieves the original function:

1
Z
f (x) = F (u) exp(i2πxu)du
2π −∞

The function f (x) is said to be defined in the “spatial domain”, or “time domain”, because x typically
represents distance travelled or time elapsed. The function F (u) is said to be defined in the “frequency
domain”, also called the “Fourier domain”. The Inverse FT sums all sinusoids, u specifies the frequency
of a particular one, F (u) is its amplitude. Since exp(i2πxu) = cos(2πxu) + i sin(2πxu) it represents two
sinusoids in quadrature.
The FT is just an alternative way to represent the same information. An example by analogy: the
statement “the fences posts were separated by 1/3 of a metre” is in the spatial domain, whereas the
statement “there are 3 posts per metre” is in the frequency domain.
In two dimensions we have the pair:
Z ∞ Z ∞
F (u, v) = f (x, y) exp(−i2π(xu + yv))dxdy (A.1)
−∞ −∞
Z ∞Z ∞
1
f (x, y) = F (u, v) exp(i2π(xu + yv))dudv (A.2)
(2π)2 −∞ −∞

Here (u, v) specifies a sinusoid spread over two dimensions, like a corrugated roof, heading in a particular
direction.
Convolution in one dimension is defined by

g = f ∗h
Z ∞
= f (x)h(u − x)dx
−∞

In two dimensions:

g = f ∗h
Z ∞Z ∞
= f (x, y)h(u − x, v − y)dxdy
−∞ −∞

Convolution commutes and associates. This makes it possible to combine the effects of several convolu-
tion filters into one.
The convolution theorem states that the FT of a convolution is the product of individual FTs. If F
and H are the FTs of functions f and g, respectively, then

f ∗ h ↔ FH (A.3)

This equivalence is important in helping us understand the effects of convolution filters. It is practically
useful when using large convolution kernels — the Fast Fourier Transform [4] makes it more efficient to
convolve in the frequency domain.
A.3. STATISTICS 57

A.3 Statistics
Statistics are playing an increasingly vital role in modern computer vision. Statistics are used to model
the variations in objects and to help make algorithms more robust to noise. See Webb [6] for a more
in-depth discussion of the issues raised here.
you are expected to know that the probability of observing an “event”, x conventionally lies between
zero (no chance) and one (certainty); and is often written p(x). It is the state of “x” effects the probable
states of some other event “y” (for example, observing clouds in the sky effects estimates of the probability
of rain). This is called the conditional probability ( of y upon x) and written p(y|x). In a hypothetical
computer vision application we might observe an image x and want to estimate the probability that is
represents some object or scene y — but other interpretations are possible.
We cannot estimate p(y|x) directly, but (given a model) can estimate p(x|y). Maintaining our
hypothetical application we interpret this as the probability of observing image x given object/scene y is
the truth. Note that the “object” y can represent a class of objects, such as faces, which clearly vary in
appearance. In this case the interpretation of p(x|y) is yet more specific, being the probability that x is
a face. This may seem odd, because we can instantly recognise faces. But remember humans are wired
up to see faces especially well, and even then we can get caught out sometimes seeing faces in shadows,
in rock formations on Mars, and so on.
We use Bayes theorem to estimate p(y|x) given p(x|y):

p(x|y)p(y)
p(y|x) = (A.4)
p(x)

where p(y) is the probability of observing y at all, and p(x) is the probability of observing x. We can
obtain p(x) by summing over all models y:
X
p(x) = p(x|z)p(z) (A.5)
z∈Y

where Y is the set of all models.


It is very unusual to reason about whole images. It is much more usual to make measurements from
images and so associate many values with each pixel. For example, the intensity, the intensity gradient,
and the curvature of any edge through the pixel might be considered as important. Such values are
considered an element in a vector of measures associated with the pixel. We need multivariate statistics
to handle such data. Note that all the above continues to apply.
Suppose xi ∈ ℜn is a column vector of measures from the ith pixel. Imagine the data xi as points
scattered in n-dimensional space, they make a cloud of some kind. We would like to know the average
value of these measures, and the multivariate equivalent of standard deviation:
N
1 X
x̄ = xi (A.6)
N i=1
N
1 X
C = (xi − x̄)(xi − x̄)T (A.7)
N − 1 i=1

where N is the number of pixels. The average value x̄ is at the centre of the cloud. (At least, we hope
it is — the cloud may be doughnut shaped, so the average can sometimes be far away from every data
point!). The (n × n) matrix C is the correlation matrix. It relates the measures in pixel i to those in
pixel j. If |cij | is large, than the two sets of measures are correlated — maybe they share they same
underlying cause. In terms of the “cloud” of points, large correlations mean the cloud is well spread out,
at least in some directions (like a rugby ball). Correlations are positive or negative, depending on the
sign of cij ; these indicate the orientation of the cloud.
It is a remarkable fact that the matrix C can be decomposed into a matrix product:

C = ULUT (A.8)
58 APPENDIX A. MATHEMATICAL BACKGROUND

each of which is n × n. The matrix U is orthonormal, whose columns are called eigenvectors. The
matrix L is diagonal whose elements are called eigenvalues. The ith eigenvalue gives the length of
the ith eigenvector. In fact the eigenvalues give the variance of the data when projected onto the
corresponding eigenvector.
The eigenvectors represent a set of basis vectors through the cloud of points. In fact, the basis set in
U is somehow a “natural” basis set. The “largest” eigenvector points in the direction of greatest spread
though the point cloud. The next largest eigenvector points along the second-most largest spread, and
so on. in practice it is common to deflate the system by discarding eigenvectors that are very small.
Bibliography

[1] David A. Forsyth and Jean Ponce. Computer Vision A Modern Approach. Prentice Hall, New Jersey,
2003.
[2] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins, 1983.
[3] R.C. Gonzalez and R.E. Woods. Digital Image Processing. Prentice-Hall, 1992.
[4] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numercial Recipies in C: The Art
of Scientific Computing. Cambridge University Press, 1998.
[5] P.H.S Torr and A.W. Fitzgibbon. Invariant fitting of two view geometry or “in defiance of the 8
point algorithm”. In R. Harvey and J.A. Bangham, editors, British Machine Vision Conference,
pages 83–92. BMVC, 2003.
[6] Andrew Webb. Statistical Pattern Recognition. Newnes, Oxford, 1999.

59

Você também pode gostar