Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract
Visual Navigation is the key to enabling useful mobile robots to work autonomously
in unmapped dynamic environments. It involves positioning a robot by tracking the
world as it moves past a camera.
Visual Navigation is not widely used in the real world yet as current implementations
have limitations including drift (accumulated errors), poor performance in featureless,
self-similar or dynamic environments, or during rapid motion, and the need for
significant amounts of processing power. In my PhD I propose to address some of
these issues including improving the accuracy and robustness of feature-based relative
positioning, and reducing drift without damaging reliability by adapting developing
loop closure techniques. I will aim to implement a 3d Visual Navigation system to
demonstrate improvements in robustness and accuracy over existing technology.
Contents
1 Introduction...............................................................................2
2 Current Research and Technology............................................3
2.1 Accumulated Errors............................................................4
2.1.1 Bundle Adjustment.......................................................4
2.1.2 Sensor Integration.........................................................4
2.1.2.1 Inertial Navigation Systems....................................5
2.1.3 Loop Closure................................................................5
2.1.3.1 Scene Recognition..................................................5
2.2 Motion Models, Frame Rates and Robustness...................6
3 PhD Plan:..................................................................................7
3.1 Progress to-date:.................................................................7
3.1.1 Developing techniques to reduce errors in initial
solution used to start Bundle Adjustment..............................7
3.1.1.1 Translation extraction.............................................7
3.1.1.2 Unbiased Estimator of stereo point position...........8
3.1.1.3 Future development work.......................................9
3.1.2 Loop Closure................................................................9
3.1.3 Demonstrating improved Visual Navigation................9
3.2 Future development work...................................................9
3.3 Proposed research timeline and targets.............................11
4 Bibliography...........................................................................13
1. Introduction
Visual Navigation (VN) involves working out the position of and path taken by a
human, robot or vehicle from a series of photographs taken en-route. This is
potentially a useful navigation tool in environments without satellite positioning, for
example indoors, in rugged terrain or cities, underwater, on other planets or near the
poles. It could also be used to augment satellite or inertial navigation sensors where
more precision is required, or to provide a backup navigation system.
VN is a major component of Visual Simultaneous Localisation and Mapping (VSLAM). SLAM is the process by which a mobile robot can autonomously map and
position itself within an unknown environment. Solving this problem is a major
challenge in robotics and is heavily researched. It will enable robots to perform tasks
requiring them to navigate unknown environments (for example so they can drive or
deliver items, for domestic tasks, in military operations, in hazardous environments).
VN is likely to become a widely used navigation solution once solved is because
cameras (and microprocessors) are already common on robots and portable computers
(e.g. phones). Cameras have many other additional uses, are relatively cheap, dont
interfere with other sensors (they are passive if additional lighting is not needed), are
not easily mislead, and require no additional infrastructure (such as base stations or
satellites). In theory VN will work in any environment where there is enough light
and texture that static features can be identified. Humans manage to navigate using
predominately stereo vision whenever there is enough light to see our surroundings
(although we also integrate acceleration information from our ears, feedback from our
limbs and understanding of what we are seeing, for example we know the scale of
objects in our environment and can identify those that are moving with respect to
other).
These applications usually require a near-real time response to what has been seen.
Ideally we need each photo or video frame to be processed rapidly enough that the
robot does not have to stop while calculation of its position catches up, and so it can
react to its change in position (to stop moving, having reached a goal, or to re-plan its
path given new information on its environment). To avoid losing our position the
frame rate must be high enough that there is significant overlap between consecutive
frames. Real-time VN systems have been demonstrated in controlled environments
(mainly two-dimensional and indoors), which limits their usefulness. The aim of my
research will be to improve current VN technology so that it can be used in less
constrained environments, with the aim of developing a real-time VN system robust
enough for real-world applications.
the error in position was about 100m, before the error was greatly reduced by loop
closure. The algorithm did not quite run in real-time and did not use bundle
adjustment.
A video of the work in [3] shows visual navigation and re-localisation indoors using a
mono handheld camera. This works in real-time in 3d but can only track 6-10 features
and regularly loses its position (having tracked fewer than 3 points between two
frames) until it recognises features it has seen before.
[write about Oxford V-SLAM video]
The V-SLAM and VN implementations described above work like this:
1. Take photograph(s)
2. Extract 2d or 3d features from mono image, or stereo image pair
3. Find correspondences between these features with features in previous
frame(s)
4. If this image is different to recent ones, and sufficiently distinctive, attempt
Loop Closure:
a. Search for similar scenes seen in the past
b. Test whether they are likely to be the same (could we be in the same
position?)
5. Estimate displacement from previous frame(s)
Stage 2 usually uses Harris corners or similar point features.
Stage 3 either tracks features or uses a projective transform-invariant feature
descriptor (SIFT [4] or SURF [5]) to find correspondences.
Stage 5 involves a point-alignment algorithm (some form of Procrustes alignment, or
the 3- or n-point algorithm), or a motion model, to estimate the translation and
rotation. RANSAC may be used to remove outliers, and Bundle Adjustment [6] may
be used to refine the position estimate. Measurements from an odometer or INS may
be incorporated (usually using an Extended Kalman Filter).
Repeating this algorithm for every frame gives us our position.
An alternative approach, from Structure from Motion research, is to use Optical Flow
[7]. This requires a high frame rate and is dependent on tracking many features across
an image. Optical flow has been combined with feature tracking to make use of
distant features by [8], and for mobile robot navigation [9, 10], but it is generally used
because it is fast to compute rather than because of its accuracy, however when stereo
depth disparities are an issue it may outperform feature-based algorithms as depth is
not used.
[Neural nets???]
1.1
Accumulated Errors
VN implementations suffer badly from small errors accumulated over many steps.
There are several ways of reducing this error: we can use Bundle Adjustment to refine
our position estimate, we can integrate a complementary sensor (such as an inertial
sensor), and/or we can recognise and use the position of places weve seen before
(loop closure).
1.1.1 Bundle Adjustment
As a first approximation we estimate the transformation from the previous frame.
There are closed form (i.e. constant time) linear methods that give us a best estimate
(in some sense) of the transformation between two sets of features. These include the
N-point algorithm [11] and point pattern (Procrustes) alignment [12]. However we
can improve on this estimate if we have matched features across more than two
frames. Bundle Adjustment (BA) [6] is the technique used in photogrammetry to
estimate the scene structure and extrinsic camera parameters (position and orientation)
given a sequence of frames. BA was shown to improve VN accuracy by [13].
BA estimates the scene structure and extrinsic camera parameters that minimise a
function of the re-projection error. Re-projection error is the distance within the image
between features we observed and corresponding features projected from our
estimated structure into the image. We can explicitly calculate this error whereas we
cannot explicitly calculate error in our structure/camera position.
BA is an iterative algorithm: each step consists of estimating the local gradient
(Jacobian) of the objective function, and hence choosing a vector in the direction of
this gradient that we can add to our previous solution to reduce the value of the
objective function. The slow part of this is inverting a function of the Jacobian. The
objective function is a function of all points and camera locations being estimated.
Note that most of these are uncorrelated (e.g. if a camera doesnt see a point then the
points position has no effect on our estimated camera position), so the Jacobian is
sparse.
Bad correspondences have a big effect on the error function if it assumes Gaussian or
similar errors (a reasonable assumption otherwise). We can deal with a few
mismatched points as outliers by choosing a Gaussian plus outliers (i.e. heavytailed) error model, but we are better off removing mismatches first. A RANSAC
method is normally used to choose a large set of inliers.
BA is usually used in mapping/structure from motion calculations using large
numbers of overlapping frames and large numbers of features. This takes much too
long for real time applications and so would applying BA to all frames and
correspondences at once [14]. For a real-time implementation (i.e. each frame is
processed in constant time) we must limit the number of frames. This is in theory a
very good approximation to BA over the entire sequence for VN, as we are unlikely to
have correspondences over more than a few consecutive frames (e.g. 2-10) when
moving, therefore (if we sort the camera positions and points appropriately, e.g. by
order of appearance) the whole Jacobian matrix will have a band-diagonal structure.
uncertainty. This appears to work well when the robot position is reasonably well
known (e.g. using a wheeled robot with an odometer). However it is most useful to be
able to re-localise ourselves when there is the greatest uncertainty in our position.
Therefore it would be more useful to recognise places weve been before without
needing to know the approximate position of that place, as a lost person would when
they recognise somewhere familiar. It is very easy to lose our orientation entirely with
vision (for example if all we can see for a frame is a blank wall, or a light is switched
off) so re-localisation is particularly useful when we can see again (although we can
easily lose our orientation entirely, a bound on our velocity can give a rapidly growing
bound on our position).
If we perform loop closure for VN using scene similarity, we dont need to maintain a
large map of features. Updating the map is the main computational cost in SLAM.
Maps have advantages for VN though: a small local map can help us keep track of
nearby features that may re-enter our field of view [20] (which helps reduce problems
caused by occlusion), and the positions of many previous scenes can be refined after
closing multiple loops. A database of scenes and their positions is essentially the same
as a map, although updating the positions of many scenes, as we would using a
Kalman Filter in SLAM on loop closure, would be less straightforward.
1.1.3.1 Scene Recognition
Object Recognition is a key problem in Artificial Intelligence. We are concerned with
the related problem of recognising two scenes that are identical, possibly from
different viewpoints (i.e. identical up to a perspective transformation and occlusion)
and distinctive.
The basic method used for scene recognition is to encode some property of the scene
as a descriptor that can be added to a database. These should (ideally) be invariant to
occlusion and perspective transformations. We recognise a scene when we find a
scene in the database with a descriptor that is close enough to the current image by
some metric.
SIFT feature descriptors [21], words on signs read using OCR [22], and 2d
stereo/laser scan profiles [18] have been used as descriptors for navigation.
Loop closure can also cause us to lose our position if we recognise a scene incorrectly.
Therefore it is important that what we see is distinctive and discriminative, so we
dont incorrectly recognise scenes that are common in our environment. We would
normally do this by clustering descriptors; descriptors in a large cluster are less
distinctive than those in a small cluster [23]. The Bag of Words algorithm has been
applied to SIFT descriptors to identify discriminative combinations of descriptors [24,
25]. For example when navigating indoors, window corners are common so are not
good features to identify scenes with. Features found on posters or signs are much
better, although even these may be repeated elsewhere.
Scenes with very little detail are more likely to be falsely recognised than those with
more detail (which are less likely to happen to have to same property that we are
recognising).
[26] demonstrates real-time loop detection using a hand-held mono camera, using
SIFT features and histograms (of intensity and hue) combined using a Bag of Words
approach.
[27] also demonstrated real-time loop closure outdoors using SIFT features and laser
scan profiles. Much work to remove visually ambiguous scenes was needed, and
more complex profiles were preferred to provide more discriminative features.
The geometry of scenes has not often been used to recognise them, even though it is
likely that people use scene geometry. This is probably because of a lack of geometric
properties that are invariant under perspective transformations. In a 2d projection
ratios of distances, angles, and ordering along lines are not preserved. The epipolar
constraint does define an invariant feature, but this is defined by seven feature
matches (up to a small number of possibilities) so eight or more points are needed for
this to validate or invalidate a possible correspondence.
Simple geometric constraints have been used to eliminate triples of bad
correspondences [28], but the key assumption here is that points lie on a convex
surface, i.e. there is no occlusion, which is not a good assumption for real world
navigation applications.
If suitable descriptors can be found then geometric constraints would be very useful
for identifying distinctive scenes in an environment made up of different
arrangements of similar components.
1.2
2 PhD Plan:
I will develop ways of speeding up VN that uses BA to refine position estimates by
identifying existing algorithms, and developing new algorithms that will lead to more
accurate initial solutions and robust outlier rejection (to reduce the number of
iterations, and the probability of finding false minima). I will compare different
approaches using simulated data to determine the best approach.
I will investigate ways of improving the reliability of fast loop closure algorithms. I
would like to incorporate image geometry into algorithms that match feature points,
either as part of a descriptor or to validate matches. I will also investigate
conditioning
I will implement VN software that I hope will provide real-world verification of
algorithms I have developed. People have demonstrated real-time navigation systems
and loop closure in real time before, but always with severe restrictions, e.g. on
numbers of points tracked, maximum angular velocities, restricted to 2d or relying on
a level ground plane.
2.1
Progress to-date:
[12] derived expressions for the translation and rotation (and scale) between a
set of point-correspondences that minimise the square of the error.
2.
[29] take the Singular Value Decomposition of a matrix formed from matrix
products of points and discard the diagonal factor to give an orthogonal
rotation matrix.
An alternative approach would be to use the SVD to give a matrix that is the best
least-squares transformation matrix between two point sets. The we can use either the
method in (2) (taking a second SVD of this 3x3 matrix), or use the algorithm by [30]
to find the closest rotation to this matrix. After getting identical simulated results from
comparing these methods I have proved that they are equivalent.
I have implemented a MATLAB program to simulate noisy stereo image data
(projecting points onto images, adding Gaussian measurement noise, calculating 3d
structure). I can compare these approaches with each other. Initial results show that
the first method is slightly better (mean error approximately 5% less) than the second
given 5-15 points with a re-projection error with mean 0.01 radians (approximately 2
pixels for a typical camera), but the second is significantly better given more exact
data (mean re-projection error 0.002 radians).
2.1.1.1 Translation extraction
The three widely used methods of extracting translation are:
1. The 3-point algorithm [31]: Three non-collinear point correspondences
determine a small finite set of possible new camera positions. We can solve
this exactly, then disambiguate between solutions using a fourth
correspondence.
2. The N-point algorithm [11]. This is the generalisation of (1) to n > 3 points,
making use of the fact that the problem is over determined. These algorithms
can position a camera relative to known 3d structure given one 2d image
alone. The only information about the second point set used is the angles
between points. If we have a stereo pair we can use angles between 3d points
(this is effectively an average of the angles from the two images, so we would
expect a slightly reduced error), and also improve on distance estimates
between point pairs (by taking some sort of average again).
3. Procrustes Alignment (difference of centroids of point sets) between 3d point
sets after the rotation. Unlike (1) and (2) this method is directly affected by the
accuracy of rotation estimates.
My simulations can currently use (1) or (3) and I am working on incorporating (2).
2.1.1.2 Unbiased Estimator of stereo point position
3d point positions are normally calculated from stereo correspondences by finding the
intersection of rays projected through each point in the image, as described by [32].
As there is noise in image measurements these lines do not generally intersect at a
point, so instead we take the midpoint of the points where the lines are closest
together. In general this is a good approximation, and it is an unbiased estimator of the
position if the rays are perpendicular. However for more distant points the PDF of the
true position is highly asymmetric and is not centred on this position. The proofs that
algorithms for doing Procrustes alignment are optimal make the assumption that
errors are independent Gaussian RVs with zero mean. However this does not appear
to be a good assumption when points come from stereo.
Independence is a reasonable assumption if calibration errors are small, so errors
come mainly from extracting points from pixels in a photograph. Algorithms to
extract points are generally symmetric and extract individual points independently. If
we assume these errors are Gaussian then the error in the plane perpendicular to the
direction the stereo rig is pointing will be approximately Gaussian and will have zero
mean (it is determined by the point where these rays intersect this plane at whatever
depth we have calculated for our point).
If we adjust the depth calculated by the procedure above so that the adjusted point is
at the expected value of its depth it will be at its expected true position, so will be an
unbiased estimator of this position. The distribution of possible positions about the
adjusted position is now closer in some sense to a Gaussian distribution, so it is
intuitive that the algorithms described above that assume this will perform better
given adjusted points.
I have calculated the expected depth of points given their reconstructed positions:
d=
l
bdm
( e ) de
b+ 2 ed m 0,
l >
b
2d m
is a lower limit high enough that we are not considering points that
we couldnt possibly see as they are behind the camera
[TODO: should that be root 2? Sum of 2 NDs]
This can be pre-computed and stored in a lookup table, so is relatively fast. Several
small-angle assumptions are made so possibly simulated data would give better
estimates. I will try this approach.
3d points can be shifted by this amount in the direction of the stereo rig so the
distribution of their true position is centred on the adjusted point. Preliminary results
suggest there is a small improvement in accuracy that is insignificant (about 2.5%)
until very noisy data is used, or the stereo baseline is less than ten times the point
depth, when it gives a significant improvement in accuracy. More simulations are
necessary to see when this is useful.
2.1.1.3 Future development work
1) I will use and extend my MATLAB program to compare different first
approximations to transformations between images.
2) I will investigate incorporating Bundle Adjustment into my MATLAB program to
test whether accurate starting positions really do save time and reduce the likelihood
of finding false minima.
3) I will investigate whether the reconstructed structure given by BA also suffers from
skewed stereo distributions. It clearly does in the case of points occurring in only two
images, as the same reconstructed point described above is the one that minimises reprojection error. A function of the re-projection error that is minimised when the
optimal point is found would have to be decreasing in increasing total re-projection
error (close to where the optimal reconstruction would be with the current approach)
so this is unlikely to be a suitable approach. Post-processing results may be possible.
2.1.2 Loop Closure
I will investigate using the geometry of scenes to either search for matching scenes, or
to validate matches based on distinctive point combinations. I have already started
investigating ways of validating matches without aligning points by examining the
direction of vectors between correspondences.
I will attempt to analyse projective-invariant features of point sets in order to describe
scenes in terms of their geometry. The aim is to find a descriptor (such as a set of
orderings) that will discriminate between different scenes.
I will investigate conditioning based on either descriptor frequencies or descriptor
combination frequencies, and ways of partitioning distinctive descriptor sets. This
may involve finding existing techniques that are suitable for loop closure, or
developing new techniques.
2.1.3 Demonstrating improved Visual Navigation
I have started to implement a navigation algorithm using C++ and the OpenCV library
and a stereo webcam pair.
At the moment features are detected and good correspondences are found within
stereo pairs. The epipolar constraint and match condition numbers are used to speed
up matching and eliminate bad matches. 3d positions are calculated but these are not
working.
Features are matched with the previous frame. A RANSAC/Procrustes alignment
algorithm for relative motion is implemented but untested due to lack of good 3d
points.
2.2
I will fix and extend my navigation software to give motion estimates using the
navigation algorithm identified by simulation. I will aim to show that this gives a
good enough initial estimate to allow BA to refine the position estimate in real-time.
Fast BA code is available in the sba library for this purpose. Other potential areas to
research include the trade-off between a high frame rate (features are tracked over
many frames but not much time is available to process each frame) and spending
longer refining estimates from less frequent, possibly higher resolution, frames.
Generic relative-positioning techniques across multiple frames will enable me to
incorporate loop closure into this software, by positioning relative to frames from the
same position in the past. The most accurate algorithms for navigation at the moment
do not incorporate BA. Hopefully by identifying and refining the most appropriate
algorithms it will be possible to do this in real-time.
Learning
I will need to learn more Bayesian statistics and statistical geometry. This will allow
me to understand and select existing methods and to develop new solutions for the
Loop Closure problem. I will do this primarily through reading.
I will attend Mathematics undergraduate lecture courses on optimisation (MATH41207S1), geometry (MATH407-07S1) and calculus (MATH264-07S1) to extend my
pure mathematical knowledge to more applied fields. This will help me understand
the concepts underlying BA and other geometric algorithms, and to develop and adapt
them. For example I am currently investigating whether it is beneficial for VN to
adapt BA to minimise reconstruction errors rather than reprojection errors, which
requires an understanding of robust optimisation,
As the field of VN is moving rapidly and significant advances are likely I will
continue my literature review, paying particular attention to forthcoming conference
proceedings and the activities of groups working on VN, V-SLAM and image
recognition, including the following:
Key computer vision conferences:
2.3
2016
September Complete investigation of approximate transformation extraction
techniques from point sets.
Octomber Learn sufficient Bayesian statistics to be able to adapt categorisation and
discrimination (conditioning) techniques to the problems of VN.
3 Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
2004.
26.
Filliat, D. A visual bag of words method for interactive qualitative localization
and mapping. in International Conference on Robotics and Automation
(ICRA). 2007.
27.
Ho, K.L. and P. Newman, Detecting loop closure with scene sequences.
International Journal of Computer Vision, 2007. 74(3): p. 261-286.
28.
Xiaoping, H. and N. Ahuja, Matching point features with ordered geometric,
rigidity, and disparity constraints. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 1994. 16(10): p. 1041-1049.
29.
Shinji, U., Least-Squares Estimation of Transformation Parameters Between
Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell., 1991. 13(4): p.
376-380.
30.
Bar-Itzhack, I.Y., New Method for Extracting the Quaternion from a Rotation
Matrix. Journal of Guidance, Control, and Dynamics, 2000. 23(6).
31.
Haralick, R.M., et al., Review and Analysis of Solutions of the 3-Point
Perspective Pose Estimation Problem. International Journal of Computer
Vision, 1994. 13(3): p. 331-356.
32.
Hartley, R.I.a.Z., A., Multiple View Geometry in Computer Vision. Second ed.
2004: Cambridge University Press.