Você está na página 1de 16

PhD Research proposal

Enabling robust visual navigation

Abstract
Visual Navigation is the key to enabling useful mobile robots to work autonomously
in unmapped dynamic environments. It involves positioning a robot by tracking the
world as it moves past a camera.
Visual Navigation is not widely used in the real world yet as current implementations
have limitations including drift (accumulated errors), poor performance in featureless,
self-similar or dynamic environments, or during rapid motion, and the need for
significant amounts of processing power. In my PhD I propose to address some of
these issues including improving the accuracy and robustness of feature-based relative
positioning, and reducing drift without damaging reliability by adapting developing
loop closure techniques. I will aim to implement a 3d Visual Navigation system to
demonstrate improvements in robustness and accuracy over existing technology.

Contents
1 Introduction...............................................................................2
2 Current Research and Technology............................................3
2.1 Accumulated Errors............................................................4
2.1.1 Bundle Adjustment.......................................................4
2.1.2 Sensor Integration.........................................................4
2.1.2.1 Inertial Navigation Systems....................................5
2.1.3 Loop Closure................................................................5
2.1.3.1 Scene Recognition..................................................5
2.2 Motion Models, Frame Rates and Robustness...................6
3 PhD Plan:..................................................................................7
3.1 Progress to-date:.................................................................7
3.1.1 Developing techniques to reduce errors in initial
solution used to start Bundle Adjustment..............................7
3.1.1.1 Translation extraction.............................................7
3.1.1.2 Unbiased Estimator of stereo point position...........8
3.1.1.3 Future development work.......................................9
3.1.2 Loop Closure................................................................9
3.1.3 Demonstrating improved Visual Navigation................9
3.2 Future development work...................................................9
3.3 Proposed research timeline and targets.............................11
4 Bibliography...........................................................................13

1. Introduction
Visual Navigation (VN) involves working out the position of and path taken by a
human, robot or vehicle from a series of photographs taken en-route. This is
potentially a useful navigation tool in environments without satellite positioning, for
example indoors, in rugged terrain or cities, underwater, on other planets or near the
poles. It could also be used to augment satellite or inertial navigation sensors where
more precision is required, or to provide a backup navigation system.
VN is a major component of Visual Simultaneous Localisation and Mapping (VSLAM). SLAM is the process by which a mobile robot can autonomously map and
position itself within an unknown environment. Solving this problem is a major
challenge in robotics and is heavily researched. It will enable robots to perform tasks
requiring them to navigate unknown environments (for example so they can drive or
deliver items, for domestic tasks, in military operations, in hazardous environments).
VN is likely to become a widely used navigation solution once solved is because
cameras (and microprocessors) are already common on robots and portable computers
(e.g. phones). Cameras have many other additional uses, are relatively cheap, dont
interfere with other sensors (they are passive if additional lighting is not needed), are
not easily mislead, and require no additional infrastructure (such as base stations or
satellites). In theory VN will work in any environment where there is enough light
and texture that static features can be identified. Humans manage to navigate using
predominately stereo vision whenever there is enough light to see our surroundings
(although we also integrate acceleration information from our ears, feedback from our
limbs and understanding of what we are seeing, for example we know the scale of
objects in our environment and can identify those that are moving with respect to
other).
These applications usually require a near-real time response to what has been seen.
Ideally we need each photo or video frame to be processed rapidly enough that the
robot does not have to stop while calculation of its position catches up, and so it can
react to its change in position (to stop moving, having reached a goal, or to re-plan its
path given new information on its environment). To avoid losing our position the
frame rate must be high enough that there is significant overlap between consecutive
frames. Real-time VN systems have been demonstrated in controlled environments
(mainly two-dimensional and indoors), which limits their usefulness. The aim of my
research will be to improve current VN technology so that it can be used in less
constrained environments, with the aim of developing a real-time VN system robust
enough for real-world applications.

1 Current Research and Technology


The best existing VN implementations (without an inertial sensor) can position a
vehicle in 2d in a stationary outdoor environments with an accuracy of about 3% [1].
This system ran at 5-13 fps and moved slowly enough that sufficient features moved
less than 30% of the frame-width in every frame (typically 3-10%).
These results are substantially better than [2], who used V-SLAM with a single
camera to position a person in, and map, a 3d urban environment. After about 300m

the error in position was about 100m, before the error was greatly reduced by loop
closure. The algorithm did not quite run in real-time and did not use bundle
adjustment.
A video of the work in [3] shows visual navigation and re-localisation indoors using a
mono handheld camera. This works in real-time in 3d but can only track 6-10 features
and regularly loses its position (having tracked fewer than 3 points between two
frames) until it recognises features it has seen before.
[write about Oxford V-SLAM video]
The V-SLAM and VN implementations described above work like this:
1. Take photograph(s)
2. Extract 2d or 3d features from mono image, or stereo image pair
3. Find correspondences between these features with features in previous
frame(s)
4. If this image is different to recent ones, and sufficiently distinctive, attempt
Loop Closure:
a. Search for similar scenes seen in the past
b. Test whether they are likely to be the same (could we be in the same
position?)
5. Estimate displacement from previous frame(s)
Stage 2 usually uses Harris corners or similar point features.
Stage 3 either tracks features or uses a projective transform-invariant feature
descriptor (SIFT [4] or SURF [5]) to find correspondences.
Stage 5 involves a point-alignment algorithm (some form of Procrustes alignment, or
the 3- or n-point algorithm), or a motion model, to estimate the translation and
rotation. RANSAC may be used to remove outliers, and Bundle Adjustment [6] may
be used to refine the position estimate. Measurements from an odometer or INS may
be incorporated (usually using an Extended Kalman Filter).
Repeating this algorithm for every frame gives us our position.
An alternative approach, from Structure from Motion research, is to use Optical Flow
[7]. This requires a high frame rate and is dependent on tracking many features across
an image. Optical flow has been combined with feature tracking to make use of
distant features by [8], and for mobile robot navigation [9, 10], but it is generally used
because it is fast to compute rather than because of its accuracy, however when stereo
depth disparities are an issue it may outperform feature-based algorithms as depth is
not used.
[Neural nets???]

1.1

Accumulated Errors

VN implementations suffer badly from small errors accumulated over many steps.
There are several ways of reducing this error: we can use Bundle Adjustment to refine
our position estimate, we can integrate a complementary sensor (such as an inertial
sensor), and/or we can recognise and use the position of places weve seen before
(loop closure).
1.1.1 Bundle Adjustment
As a first approximation we estimate the transformation from the previous frame.
There are closed form (i.e. constant time) linear methods that give us a best estimate
(in some sense) of the transformation between two sets of features. These include the
N-point algorithm [11] and point pattern (Procrustes) alignment [12]. However we
can improve on this estimate if we have matched features across more than two
frames. Bundle Adjustment (BA) [6] is the technique used in photogrammetry to
estimate the scene structure and extrinsic camera parameters (position and orientation)
given a sequence of frames. BA was shown to improve VN accuracy by [13].
BA estimates the scene structure and extrinsic camera parameters that minimise a
function of the re-projection error. Re-projection error is the distance within the image
between features we observed and corresponding features projected from our
estimated structure into the image. We can explicitly calculate this error whereas we
cannot explicitly calculate error in our structure/camera position.
BA is an iterative algorithm: each step consists of estimating the local gradient
(Jacobian) of the objective function, and hence choosing a vector in the direction of
this gradient that we can add to our previous solution to reduce the value of the
objective function. The slow part of this is inverting a function of the Jacobian. The
objective function is a function of all points and camera locations being estimated.
Note that most of these are uncorrelated (e.g. if a camera doesnt see a point then the
points position has no effect on our estimated camera position), so the Jacobian is
sparse.
Bad correspondences have a big effect on the error function if it assumes Gaussian or
similar errors (a reasonable assumption otherwise). We can deal with a few
mismatched points as outliers by choosing a Gaussian plus outliers (i.e. heavytailed) error model, but we are better off removing mismatches first. A RANSAC
method is normally used to choose a large set of inliers.
BA is usually used in mapping/structure from motion calculations using large
numbers of overlapping frames and large numbers of features. This takes much too
long for real time applications and so would applying BA to all frames and
correspondences at once [14]. For a real-time implementation (i.e. each frame is
processed in constant time) we must limit the number of frames. This is in theory a
very good approximation to BA over the entire sequence for VN, as we are unlikely to
have correspondences over more than a few consecutive frames (e.g. 2-10) when
moving, therefore (if we sort the camera positions and points appropriately, e.g. by
order of appearance) the whole Jacobian matrix will have a band-diagonal structure.

Using BA on short sequences of frames corresponds to inverting short sections of this


diagonal at a time.
Errors will still accumulate as we gain and lose features over the series of images in
the BA, however in some ways BA gives the best motion estimate we can make from
a sequence of images of tracked points.
To start BA it is desirable to have a reasonably good approximation to the actual
solution to minimise the number of iterations needed to get close enough to the
minimum, and to reduce the probability of converging to false minima (which are
common, especially if there are mismatched points).
1.1.2 Sensor Integration
A different navigation sensor can be integrated with VN. Normally two position
estimates would be combined using a Kalman filter. This gives a best estimate of the
new position given previous position and multiple motion estimates (assuming
Gaussian errors with known covariance), and keeps track of the estimated
accumulated error. Kalman filters are widely used for maintaining SLAM position and
map estimatesincorporating an additional sensor is straightforward. A good
implementation would cope with position estimate failures (e.g. we cant track enough
points between two images, or we lose our GNSS signal). Particle filters [15] are
another less popular option that can be better at dealing with erroneous input, or data
that is not well approximated by a Normal distribution. It appears to be hard to make
real-time implementations that provide the accuracy of Kalman Filter SLAM [16].
Odometry (for example on the Mars Rover [17]), GNSS (for example to navigate
missiles), laser scanners (to identify profiles as recognisable descriptors [18]) and
inertial navigation sensors have been integrated with VN.
1.1.2.1 Inertial Navigation Systems
INS is the most popular positioning system used to augment vision. Like vision it
suffers from drift over time. Currently it is much more accurate than VN. As vision is
measuring motion and INS measure acceleration then vision can be used to measure
and correct INS drift if it is detected that the camera is stationary (or very slowly
moving)this is one time when VN is highly accurate compared to inertial
navigation.
Various teams have shown that inertial position estimates can be improved using
vision, including [19], who showed integrating vision with a low quality IMU reduced
errors to those of an expensive higher quality tactical IMU. The position error is still
unbounded in the long term.
1.1.3 Loop Closure
A useful way of bounding position errors (within a finite region) is to recognise places
we have been beforewe then know our position relative to a previous position (and
if we are using an INS we can correct for drift over the previous loop). This is known
as the Loop Closure, and is the key to building accurate maps in SLAM.
The problem is usually tackled in SLAM by looking (visually, or with range-finders)
for landmarks that we might be able to see from our current pose and pose

uncertainty. This appears to work well when the robot position is reasonably well
known (e.g. using a wheeled robot with an odometer). However it is most useful to be
able to re-localise ourselves when there is the greatest uncertainty in our position.
Therefore it would be more useful to recognise places weve been before without
needing to know the approximate position of that place, as a lost person would when
they recognise somewhere familiar. It is very easy to lose our orientation entirely with
vision (for example if all we can see for a frame is a blank wall, or a light is switched
off) so re-localisation is particularly useful when we can see again (although we can
easily lose our orientation entirely, a bound on our velocity can give a rapidly growing
bound on our position).
If we perform loop closure for VN using scene similarity, we dont need to maintain a
large map of features. Updating the map is the main computational cost in SLAM.
Maps have advantages for VN though: a small local map can help us keep track of
nearby features that may re-enter our field of view [20] (which helps reduce problems
caused by occlusion), and the positions of many previous scenes can be refined after
closing multiple loops. A database of scenes and their positions is essentially the same
as a map, although updating the positions of many scenes, as we would using a
Kalman Filter in SLAM on loop closure, would be less straightforward.
1.1.3.1 Scene Recognition
Object Recognition is a key problem in Artificial Intelligence. We are concerned with
the related problem of recognising two scenes that are identical, possibly from
different viewpoints (i.e. identical up to a perspective transformation and occlusion)
and distinctive.
The basic method used for scene recognition is to encode some property of the scene
as a descriptor that can be added to a database. These should (ideally) be invariant to
occlusion and perspective transformations. We recognise a scene when we find a
scene in the database with a descriptor that is close enough to the current image by
some metric.
SIFT feature descriptors [21], words on signs read using OCR [22], and 2d
stereo/laser scan profiles [18] have been used as descriptors for navigation.
Loop closure can also cause us to lose our position if we recognise a scene incorrectly.
Therefore it is important that what we see is distinctive and discriminative, so we
dont incorrectly recognise scenes that are common in our environment. We would
normally do this by clustering descriptors; descriptors in a large cluster are less
distinctive than those in a small cluster [23]. The Bag of Words algorithm has been
applied to SIFT descriptors to identify discriminative combinations of descriptors [24,
25]. For example when navigating indoors, window corners are common so are not
good features to identify scenes with. Features found on posters or signs are much
better, although even these may be repeated elsewhere.
Scenes with very little detail are more likely to be falsely recognised than those with
more detail (which are less likely to happen to have to same property that we are
recognising).

[26] demonstrates real-time loop detection using a hand-held mono camera, using
SIFT features and histograms (of intensity and hue) combined using a Bag of Words
approach.
[27] also demonstrated real-time loop closure outdoors using SIFT features and laser
scan profiles. Much work to remove visually ambiguous scenes was needed, and
more complex profiles were preferred to provide more discriminative features.
The geometry of scenes has not often been used to recognise them, even though it is
likely that people use scene geometry. This is probably because of a lack of geometric
properties that are invariant under perspective transformations. In a 2d projection
ratios of distances, angles, and ordering along lines are not preserved. The epipolar
constraint does define an invariant feature, but this is defined by seven feature
matches (up to a small number of possibilities) so eight or more points are needed for
this to validate or invalidate a possible correspondence.
Simple geometric constraints have been used to eliminate triples of bad
correspondences [28], but the key assumption here is that points lie on a convex
surface, i.e. there is no occlusion, which is not a good assumption for real world
navigation applications.
If suitable descriptors can be found then geometric constraints would be very useful
for identifying distinctive scenes in an environment made up of different
arrangements of similar components.

1.2

Motion Models, Frame Rates and Robustness

A high frame-rate is desirable for VN so that there is always overlap between


consecutive frames (and preferably sequences of frames) so feature correspondences
can be found between frames. For example, a camera held by a human may be swung
through 180 degrees in approximately a second; if the frame is 45 degrees wide then
this requires a frame rate higher than eight frames per second for there to be any
overlap between frames at all, and higher than 16 frames per second if any features
are to be tracked across more than two images.
Omni-directional cameras may partially solve this, although they provide a larger or
less detailed view of the scene, and are more expensive and less generic.
Often the camera motion is modelled and used as an initial estimate of the translation
between frames. This works well in many applications (such as UAVs) where the
acceleration in the interval between frames is small, however for cameras attached to
humans (or even fast robots, or robots travelling through difficult terrain) jerky
movement and rapid rotation is likely, and the algorithm must be able to reliably cope
with unpredictable motion (in other words, should be able to predict motion from
what it sees alone). This is also the time when errors accumulate most rapidlyit is
less critical to keep the frame rate high at the same time as the motion model will be
most effective at speeding up the algorithm.

2 PhD Plan:
I will develop ways of speeding up VN that uses BA to refine position estimates by
identifying existing algorithms, and developing new algorithms that will lead to more
accurate initial solutions and robust outlier rejection (to reduce the number of
iterations, and the probability of finding false minima). I will compare different
approaches using simulated data to determine the best approach.
I will investigate ways of improving the reliability of fast loop closure algorithms. I
would like to incorporate image geometry into algorithms that match feature points,
either as part of a descriptor or to validate matches. I will also investigate
conditioning
I will implement VN software that I hope will provide real-world verification of
algorithms I have developed. People have demonstrated real-time navigation systems
and loop closure in real time before, but always with severe restrictions, e.g. on
numbers of points tracked, maximum angular velocities, restricted to 2d or relying on
a level ground plane.

2.1

Progress to-date:

2.1.1 Developing techniques to reduce errors in initial solution used to


start Bundle Adjustment.
It is advantageous to start Bundle Adjustment from a good approximation to the actual
motion, so that false minima are more likely to be avoided and to reduce the number
of iterations needed to reach a good enough solution. To determine relative motion
from a set of correspondences between 3d points we first recover the rotation between
the point sets, then the translation. This gives us the camera motion. This process is
known as Procrustes Alignment and is related to Point-Pattern Matching.
Sometimes an initial solution of either the solution obtained by Bundle Adjustment on
previous frames, or an assumption that the motion between frames is close enough to
zero is used. This assumes that the frame rate is high enough that motion or
acceleration is small between frames, however this is a bad approximation when there
is substantial motion between frames. This is precisely when we want the most
accurate (relatively) motion estimate as smaller relative errors in estimating smaller
movements contribute proportionally less to the global error.
Various methods have been proposed for extracting the rotation:
1.

[12] derived expressions for the translation and rotation (and scale) between a
set of point-correspondences that minimise the square of the error.

2.

[29] take the Singular Value Decomposition of a matrix formed from matrix
products of points and discard the diagonal factor to give an orthogonal
rotation matrix.

An alternative approach would be to use the SVD to give a matrix that is the best
least-squares transformation matrix between two point sets. The we can use either the

method in (2) (taking a second SVD of this 3x3 matrix), or use the algorithm by [30]
to find the closest rotation to this matrix. After getting identical simulated results from
comparing these methods I have proved that they are equivalent.
I have implemented a MATLAB program to simulate noisy stereo image data
(projecting points onto images, adding Gaussian measurement noise, calculating 3d
structure). I can compare these approaches with each other. Initial results show that
the first method is slightly better (mean error approximately 5% less) than the second
given 5-15 points with a re-projection error with mean 0.01 radians (approximately 2
pixels for a typical camera), but the second is significantly better given more exact
data (mean re-projection error 0.002 radians).
2.1.1.1 Translation extraction
The three widely used methods of extracting translation are:
1. The 3-point algorithm [31]: Three non-collinear point correspondences
determine a small finite set of possible new camera positions. We can solve
this exactly, then disambiguate between solutions using a fourth
correspondence.
2. The N-point algorithm [11]. This is the generalisation of (1) to n > 3 points,
making use of the fact that the problem is over determined. These algorithms
can position a camera relative to known 3d structure given one 2d image
alone. The only information about the second point set used is the angles
between points. If we have a stereo pair we can use angles between 3d points
(this is effectively an average of the angles from the two images, so we would
expect a slightly reduced error), and also improve on distance estimates
between point pairs (by taking some sort of average again).
3. Procrustes Alignment (difference of centroids of point sets) between 3d point
sets after the rotation. Unlike (1) and (2) this method is directly affected by the
accuracy of rotation estimates.
My simulations can currently use (1) or (3) and I am working on incorporating (2).
2.1.1.2 Unbiased Estimator of stereo point position
3d point positions are normally calculated from stereo correspondences by finding the
intersection of rays projected through each point in the image, as described by [32].
As there is noise in image measurements these lines do not generally intersect at a
point, so instead we take the midpoint of the points where the lines are closest
together. In general this is a good approximation, and it is an unbiased estimator of the
position if the rays are perpendicular. However for more distant points the PDF of the
true position is highly asymmetric and is not centred on this position. The proofs that
algorithms for doing Procrustes alignment are optimal make the assumption that
errors are independent Gaussian RVs with zero mean. However this does not appear
to be a good assumption when points come from stereo.
Independence is a reasonable assumption if calibration errors are small, so errors
come mainly from extracting points from pixels in a photograph. Algorithms to
extract points are generally symmetric and extract individual points independently. If
we assume these errors are Gaussian then the error in the plane perpendicular to the
direction the stereo rig is pointing will be approximately Gaussian and will have zero

mean (it is determined by the point where these rays intersect this plane at whatever
depth we have calculated for our point).
If we adjust the depth calculated by the procedure above so that the adjusted point is
at the expected value of its depth it will be at its expected true position, so will be an
unbiased estimator of this position. The distribution of possible positions about the
adjusted position is now closer in some sense to a Gaussian distribution, so it is
intuitive that the algorithms described above that assume this will perform better
given adjusted points.
I have calculated the expected depth of points given their reconstructed positions:

d=
l

bdm
( e ) de
b+ 2 ed m 0,

where b is baseline length


dm is measured depth
e is the error in measuring one pixel
0,

is the normal PFD with standard deviation (image noise in radians)

l >

b
2d m

is a lower limit high enough that we are not considering points that
we couldnt possibly see as they are behind the camera
[TODO: should that be root 2? Sum of 2 NDs]
This can be pre-computed and stored in a lookup table, so is relatively fast. Several
small-angle assumptions are made so possibly simulated data would give better
estimates. I will try this approach.
3d points can be shifted by this amount in the direction of the stereo rig so the
distribution of their true position is centred on the adjusted point. Preliminary results
suggest there is a small improvement in accuracy that is insignificant (about 2.5%)
until very noisy data is used, or the stereo baseline is less than ten times the point
depth, when it gives a significant improvement in accuracy. More simulations are
necessary to see when this is useful.
2.1.1.3 Future development work
1) I will use and extend my MATLAB program to compare different first
approximations to transformations between images.
2) I will investigate incorporating Bundle Adjustment into my MATLAB program to
test whether accurate starting positions really do save time and reduce the likelihood
of finding false minima.
3) I will investigate whether the reconstructed structure given by BA also suffers from
skewed stereo distributions. It clearly does in the case of points occurring in only two
images, as the same reconstructed point described above is the one that minimises reprojection error. A function of the re-projection error that is minimised when the
optimal point is found would have to be decreasing in increasing total re-projection

error (close to where the optimal reconstruction would be with the current approach)
so this is unlikely to be a suitable approach. Post-processing results may be possible.
2.1.2 Loop Closure
I will investigate using the geometry of scenes to either search for matching scenes, or
to validate matches based on distinctive point combinations. I have already started
investigating ways of validating matches without aligning points by examining the
direction of vectors between correspondences.
I will attempt to analyse projective-invariant features of point sets in order to describe
scenes in terms of their geometry. The aim is to find a descriptor (such as a set of
orderings) that will discriminate between different scenes.
I will investigate conditioning based on either descriptor frequencies or descriptor
combination frequencies, and ways of partitioning distinctive descriptor sets. This
may involve finding existing techniques that are suitable for loop closure, or
developing new techniques.
2.1.3 Demonstrating improved Visual Navigation
I have started to implement a navigation algorithm using C++ and the OpenCV library
and a stereo webcam pair.
At the moment features are detected and good correspondences are found within
stereo pairs. The epipolar constraint and match condition numbers are used to speed
up matching and eliminate bad matches. 3d positions are calculated but these are not
working.
Features are matched with the previous frame. A RANSAC/Procrustes alignment
algorithm for relative motion is implemented but untested due to lack of good 3d
points.

2.2

Future development work

I will fix and extend my navigation software to give motion estimates using the
navigation algorithm identified by simulation. I will aim to show that this gives a
good enough initial estimate to allow BA to refine the position estimate in real-time.
Fast BA code is available in the sba library for this purpose. Other potential areas to
research include the trade-off between a high frame rate (features are tracked over
many frames but not much time is available to process each frame) and spending
longer refining estimates from less frequent, possibly higher resolution, frames.
Generic relative-positioning techniques across multiple frames will enable me to
incorporate loop closure into this software, by positioning relative to frames from the
same position in the past. The most accurate algorithms for navigation at the moment
do not incorporate BA. Hopefully by identifying and refining the most appropriate
algorithms it will be possible to do this in real-time.

Learning
I will need to learn more Bayesian statistics and statistical geometry. This will allow
me to understand and select existing methods and to develop new solutions for the
Loop Closure problem. I will do this primarily through reading.
I will attend Mathematics undergraduate lecture courses on optimisation (MATH41207S1), geometry (MATH407-07S1) and calculus (MATH264-07S1) to extend my
pure mathematical knowledge to more applied fields. This will help me understand
the concepts underlying BA and other geometric algorithms, and to develop and adapt
them. For example I am currently investigating whether it is beneficial for VN to
adapt BA to minimise reconstruction errors rather than reprojection errors, which
requires an understanding of robust optimisation,
As the field of VN is moving rapidly and significant advances are likely I will
continue my literature review, paying particular attention to forthcoming conference
proceedings and the activities of groups working on VN, V-SLAM and image
recognition, including the following:
Key computer vision conferences:

International Conference on Robotics and Automation 2016


European Conference on Computer Vision 2016
International Conference on Computer Vision 2016
Computer Vision and Pattern Recognition 2016
SLAM Summer School 2016unconfirmed at the moment

Leading research groups in the field:

2.3

ROBOTVIS: Computer Vision and Robotics, INRIA, Grenoble (Localisation,


VN, SLAM)
Robotics Research Group, Oxford University (Loop closure, SLAM, Image
Registration)
Centre for Visualization & Virtual Environments, University of Kentucky (VN,
photogrammetry applied to CV)
The Australian Centre for Field Robotics, University of Sydney (SLAM and
UAVs)

Proposed research timeline and targets

2016
September Complete investigation of approximate transformation extraction
techniques from point sets.
Octomber Learn sufficient Bayesian statistics to be able to adapt categorisation and
discrimination (conditioning) techniques to the problems of VN.

November Complete analysis of and publish results from transformation extraction


experiments.
December Develop VN software to a stage where it can infer reasonable motion
estimates from point sets. Aim for a real-time implementation.
2017
February Complete incorporation of BA into VN software to refine position. Aim
for a near real-time implementation, identifying bottlenecks to help guide future work.
May Complete preliminary research into suitable registration algorithms and map
formats for loop closure.
July Decide whether to extend research to monocular vision or to stay with stereo.
September Complete addition of mapping to VN software (either a database of
descriptors approach or a traditional SLAM landmark map).
December Decide whether to focus research efforts on sensor integration involving
VN,
or
on
localisation.
2018
January Complete research into registration/recognition in loop closure.
June Exhibit VN system for 3d indoor or outdoor positioning.
Octomber Publish details of any improvements over existing technology of my VN
implementation
November Start writing thesis.
2019
April Complete experimental work.
July Submit PhD thesis. Write papers based on thesis.

3 Bibliography
1.
2.
3.
4.
5.
6.

7.
8.
9.
10.
11.
12.
13.
14.
15.
16.

Qyngxiong, Y., et al. Stereo Matching with Color-Weighted Correlation,


Hierachical Belief Propagation and Occlusion Handling. in Computer Vision
and Pattern Recognition, 2006 IEEE Computer Society Conference on. 2006.
Laura A. Clemente, A.J.D., Ian Reid, Jos Neira and Juan D. Tards, Mapping
Large Loops with a Single Hand-Held Camera. RSS, 2007.
Murray, R.O.C.a.D.J.G.a.G.K.a.D.W. Towards simultaneous recognition,
localization and mapping for hand-held and wearable cameras. in
International Conference on Robotics and Automation. 2007. Rome.
Lowe, D.G. Object recognition from local scale-invariant features. in
Computer Vision, 1999. The Proceedings of the Seventh IEEE International
Conference on. 1999.
Tinne, T. and G. Luc Van, Matching Widely Separated Views Based on Affine
Invariant Regions. Int. J. Comput. Vision, 2004. 59(1): p. 61-85.
Bill Triggs, P.F.M., Richard I. Hartley and Andrew W. Fitzgibbon, Bundle
Adjustment -- A Modern Synthesis. Vision Algorithms: Theory and Practice:
International Workshop on Vision Algorithms, Corfu, Greece, September
1999. Proceedings, 1999. Volume 1883/2000.
Tomasi, C. and T. Kanade, Shape and Motion from Image Streams under
Orthography - a Factorization Method. International Journal of Computer
Vision, 1992. 9(2): p. 137-154.
Agrawal, M., K. Konolige, and R.C. Bolles. Localization and Mapping for
Autonomous Navigation in Outdoor Terrains : A Stereo Vision Approach. in
Applications of Computer Vision, 2007. WACV '07. IEEE Workshop on. 2007.
Lee, S.Y. and J.B. Song, Mobile robot localization using optical flow sensors.
International Journal of Control Automation and Systems, 2004. 2(4): p. 485493.
Davison, A.J. Real-time simultaneous localisation and mapping with a single
camera. in Computer Vision, 2003. Proceedings. Ninth IEEE International
Conference on. 2003.
Lan, L.Q.a.Z.-D., Linear N-Point Camera Pose Determination. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 1999.
Horn, B.K.P., Closed-form solution of absolute orientation using unit
quaternions. J. Opt. Soc. Am. A, 1987. 4(4): p. 629.
Sunderhauf, N.P., P., Towards Using Sparse Bundle Adjustment for Robust
Stereo Odometry in Outdoor Terrain. Proceedings of Towards Autonomous
Robotic Systems TAROS06, 2006.
Zhang, Z. and Y. Shan. Incremental motion estimation through modified
bundle adjustment. in Image Processing, 2003. ICIP 2003. Proceedings. 2003
International Conference on. 2003.
Kwok, N.M. and A.B. Rad, A Modified Particle Filter for Simultaneous
Localization and Mapping. J. Intell. Robotics Syst., 2006. 46(4): p. 365-382.
Dailey, M.N. and M. Parnichkun. Simultaneous Localization and Mapping
with Stereo Vision. in Control, Automation, Robotics and Vision, 2006.
ICARCV '06. 9th International Conference on. 2006.

17.
18.
19.
20.
21.
22.
23.
24.

25.

Maimone, M., Y. Cheng, and L. Matthies, Two years of Visual Odometry on


the Mars Exploration Rovers. Journal of Field Robotics, 2007. 24(3): p. 169186.
Kin Ho, P.N. Combining Visual and Spatial Appearance for Loop Closure
Detection in SLAM. in 2nd European Conference on Mobile Robots. 2005.
Ancona, Italy.
Veth, M.J., Raquet, J.R. Fusion of Low-Cost Inertial Systems for Precision
Navigation. in Proceedings of the ION GNSS. 2006.
Segvic, S., et al. Large scale vision-based navigation without an accurate
global reconstruction. in Computer Vision and Pattern Recognition, 2007.
CVPR '07. IEEE Conference on. 2007.
Felzenszwalb, P.F. and D.P. Huttenlocher, Pictorial structures for object
recognition. International Journal of Computer Vision, 2005. 61(1): p. 55-79.
Bret Taylor, L.V., Database assisted OCR for street scenes and other images,
U.P. office, Editor. 2007.
Neira, J. and J.D. Tardos, Data association in Stochastic mapping using the
joint compatibility test. Ieee Transactions on Robotics and Automation, 2001.
17(6): p. 890-897.
Fei-Fei, L. and P. Pietro, A Bayesian Hierarchical Model for Learning Natural
Scene Categories, in Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume
2 - Volume 02. 2005, IEEE Computer Society.
G. Csurka, C.D., L. Fan, J. Williamowski, C. Bray. Visual categorization with
bags of keypoints. in ECCV04 workshop on Statistical Learning in Computer
Vision

2004.
26.
Filliat, D. A visual bag of words method for interactive qualitative localization
and mapping. in International Conference on Robotics and Automation
(ICRA). 2007.
27.
Ho, K.L. and P. Newman, Detecting loop closure with scene sequences.
International Journal of Computer Vision, 2007. 74(3): p. 261-286.
28.
Xiaoping, H. and N. Ahuja, Matching point features with ordered geometric,
rigidity, and disparity constraints. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 1994. 16(10): p. 1041-1049.
29.
Shinji, U., Least-Squares Estimation of Transformation Parameters Between
Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell., 1991. 13(4): p.
376-380.
30.
Bar-Itzhack, I.Y., New Method for Extracting the Quaternion from a Rotation
Matrix. Journal of Guidance, Control, and Dynamics, 2000. 23(6).
31.
Haralick, R.M., et al., Review and Analysis of Solutions of the 3-Point
Perspective Pose Estimation Problem. International Journal of Computer
Vision, 1994. 13(3): p. 331-356.
32.
Hartley, R.I.a.Z., A., Multiple View Geometry in Computer Vision. Second ed.
2004: Cambridge University Press.

Você também pode gostar