Escolar Documentos
Profissional Documentos
Cultura Documentos
Yves Albers-Schoenberg
Master Thesis
Robotics and Perception Lab
University of Zurich
Supervision
Dr. Andras Majdik
Prof. Dr. Davide Scaramuzza
November 2013
Contents
Abstract
iii
1 Introduction
1.1 Goal . . . . . . . . . . . . . . . . . . . . . .
1.2 Motivation . . . . . . . . . . . . . . . . . .
1.3 Autonomous Flight in Urban Environments
1.3.1 Above-Rooftop Flight . . . . . . . .
1.3.2 Street-Level Flight . . . . . . . . . .
1.4 Legal Framework . . . . . . . . . . . . . . .
1.5 Literature Review . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
3
5
5
6
7
.
.
.
.
.
9
10
12
15
16
19
.
.
.
.
21
21
24
26
30
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Experimental Setup
31
4.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Test Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Results and Discussion
5.1 Visual-Inspection . . . . . . . . . . . . .
5.2 Uncertainty Quantification . . . . . . . .
5.3 Virtual-views and Iterative Refinements
5.4 GPS Comparison . . . . . . . . . . . . .
6 Conclusion and Outlook
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
40
43
48
52
A Appendix
59
A.1 OpenCV EPnP + Ransac . . . . . . . . . . . . . . . . . . . . . . 59
A.2 ICRA Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
i
Abstract
This thesis presents a proof-of-concept of a purely vision-based global positioning system for a Micro Aerial Vehicle (MAV) acting in an urban environment.
The overall goal is to contribute to the advance of autonomously acting aerial
service robots in city-like areas. It is shown that the increasing availability
of textured 3D city models can be used to localize a MAV in the case where
satellite-based GPS is not, or only partially available. Textured urban scenes are
created by overlaying Google Street View images on a georeferenced cadastral
3D city model of Zurich. The most similar Street View image is then identified
based on an image search algorithm for a particular MAV image and the global
camera position of the MAV is derived. An extensive test dataset containing
aerial recordings of a 2km long trajectory in the city of Zurich is used to verify
and evaluate the proposed approach. It is concluded that the suggested visionbased positioning algorithm can be used as a complement or an alternative to
satellite-based GPS with comparable results in terms of localization accuracy.
Finally, suggestions are presented on how to improve the introduced visionbased positioning approach and implement it in a future real-life application.
Results of this thesis have been used in the ICRA submission Micro Air Vehicle
Localization and Position Tracking from Textured 3D Cadastral Models 1
1 International
Conference on Robotics and Automation (ICRA 2014), Micro Air Vehicle Localization and Position Tracking from Textured 3D Cadastral Models (under review),
Andras L. Majdik, Damiano Verda, Yves Albers-Schoenberg, Davide Scaramuzza
iii
Chapter 1
Introduction
This chapter describes the goal of the underlying master thesis and gives and
overview on autonomous flight of Micro Aerial Vehicles (MAVs) in urban environments. Challenges are highlighted and a motivation for the suggested
vision-based positioning approach is provided. Moreover, a literature review is
conducted summarizing the current state-of-the art.
1.1
Goal
1.2
Motivation
With the rapid advance of low-cost Micro Aerial Vehicles new applications such
as airborne goods-delivery 1 , inspection 2 , traffic surveillance or first-aid delivery
in case of accidents start to emerge. Moreover, it is conceivable, that tomorrows
small-sized aerial service robots will increasingly carry out tasks autonomously
i.e. without any direct human intervention.
Accurate localization is indispensable for any autonomously acting robot and is
a prerequisite for the successful completion of tasks in a real-life environment.
1 As
2 As
1.2. Motivation
Figure 1.1: On the left: There is no direct line of sight for the GPS signal to
the red satellites due to an urban canyon. On the right: The GPS signals are
reflected by the surrounding buildings
Satellite-based global positioning systems like GPS, Glonass, Galileo or Compass work based on the principle of triangulation and have become the stateof-the-art for global outdoor positioning forming a crucial component of many
modern technological systems. Everyday life applications like smart-phones,
driving-assistance or fleet tracking heavily rely on the availability of satellitebased signals for positioning. While standard consumer grade GPS receivers
have a typical accuracy between 3 15 meters in 95% of the time, augmentation techniques like differential GPS (DGPS) or Wide Area Augmentation
Systems (WAAS) to support aircraft navigation can reach a typical accuracy
of 1 3 meters [23]. The accuracy and reliability of a standard GPS sensing
device fundamentally depends on the number of visible satellites which are in
the line of sight of the receiver. In urban areas, the availability of satellitebased GPS signals is often reduced if compared to unobstructed terrain, or even
completely unavailable in case of restricted sky view. So-called urban canyons
tend to shadow the GPS signals, and building facades reflect the signals violating the underlying triangulation assumption that signals travel along a direct
line of sight between the satellite and the receiver. Several approaches have
been suggested in the literature to deal with these drawbacks such as using additional ground stations or fusing the GPS measurements together with data
from Inertial Measurement Units (IMUs) for dead-reckoning. This thesis aims to
provide a vision-based alternative to satellite-based global positioning in urban
environments by taking advantage of 3D city models together with geotagged
image databases such as Google Street View 3 or Flickr 4 . The motivation is to
develop novel approaches for MAV positioning paving the way for tomorrows
aerial robotics applications in urban environments.
3 https://maps.google.ch/
4 http://www.flickr.com/
Chapter 1. Introduction
1.3
In the context of this work the term autonomous flight in urban environments is
referred to the capability of a MAV to independently i.e. without any human
piloting execute the following directive:
Fly from Address A to Address B
This capability is a basic requirement for any autonomously acting aerial robot
fulfilling tasks in city-like environments. Fig. 1.2 shows a simplified reference
control scheme of an autonomous robot in the style of [31]. As framed by the
red dashed line, an autonomously flying MAV will carry out all the four major building blocks of navigation, namely localization, path planning, motion
control and perception in an automated way.
Start:
Address A
End:
Address B
Localization
Path Planning
Perception
Motion Control
Mission Commands
Safety
Autonomous Flight
Figure 1.2: The mission commands fly from Address A to Address B are given
by the operator. To execute the mission commands, the MAV needs to iteratively carry out the following steps until the goal is reached: Firstly, localize
and determine the current position; secondly, plan the next step (path) to reach
the target; thirdly, generate motor commands to execute the planned path and
interact with the environment; fourthly, extract information from the environment to get an update on the current state.
Localization and Map Building This work focuses on the localization step
in the above control scheme. It is explicitly assumed that the MAV has access to
a given map of the environment i.e. the 3D city model and does not need
to simultaneously localize and map the environment (SLAM). SLAM systems
like [17] have been successfully applied to localize MAVs in indoor environments
where no map is available [7].
Hereafter, the term global localization is referred to the positioning of the robot
with respect to a global coordinate system such as the World Geodetic System
1984 (WGS84). Besides positioning (e.g. determination of latitude and longitude) localization usually also includes information on the robots attitude (i.e.
yaw, roll and pitch).
Path-Planning As defined in [31] path-planning involves identifying a trajectory that will cause the robot to reach the goal location when executed. This is
a strategic problem-solving competence that requires that robot to plan how to
achieve its long term goals. Path-planning usually involves the determination of
intermediate way-points between the current position and the goal. Even though
path planning is a long-term process, path-planning can change when new information on the environment gets available or the mission control commands
are changed. A crucial competence of any autonomous robot acting in human
environments is the capability of short-term obstacle avoidance. Especially in
urban areas where the robots workspace is shared with pedestrians, cars and
public transport, a robust obstacle avoidance system is a basic prerequisite for
any safe robot operation.
Motion Control Motion control is the process of generating suitable motor
commands so that the robot executes the planned path. In case of a quadrocopter, motion control regulates the rotary speed of the four rotors to move the
MAV to the desired position and attitude. Generally, it is differentiated between
open-loop control where the robots position is not fed back to the kinematic
controller to regulate the velocity or the position and closed-loop control where
the robots system state (velocity, position) are fed back as an input to the
kinematic controller. The most widely used closed-loop control mechanism is
a Proportional Integral Derivative (PID) controller which minimizes an error
between a measured system variable and its desired set-point.
Perception Perception refers to the process of information extraction from
the robots environment. During sensing raw data is collected depending on
the robots specific sensor configuration. Various types of sensors are used in
robotics such as laser scanners or ultrasonic sensors for range sensing, IMUs for
attitude estimation or cameras for positioning and motion detection. Generally,
one differentiates between active sensors that release energy and measure the
environmental response to that energy and passive sensors which detect ambient energy changes without releasing energy to the environment. Moreover, one
differentiates between exteroceptive sensors that measure environmental properties such as the temperature and interoceptive sensors which measure the robots
internal state such as the actuator positions. A detailed overview on different
sensing technologies can be found in [31]. The meaningful interpretation of raw
sensor data is referred to information extraction and is a key process in the
perception phase. In this work, the main sensor used is a monocular camera
producing a continuous image stream.
Safety Reliable safety measures are a core requirement for any autonomous
mobile system acting in a real-world environment. Especially in urban areas
where the robot shares its workspace with human beings, well-tested safety
measures such as obstacle avoidance are crucial for any robotic application.
Based on the context-specific requirements for the above functions, two scenarios
for autonomous flight in urban environments are defined and explained below.
Chapter 1. Introduction
1.3.1
Above-Rooftop Flight
In this scenario, the MAVs are flying above the buildings as illustrated in Figures
1.3 and 1.4. Depending on the city-specific urban structure (e.g. mega city in
an emerging country with skyscrapers vs. ancient city with historic buildings in
Western Europe) a minimum flying height will be defined such that the MAV
is always flying above the rooftops of the buildings. The main advantage of
this scenario is the absence of obstacles in the form of man-made structures and
humans. Therefore, trajectory planning is drastically simplified resulting in a
faster and safer system. Moreover, MAV localization can be robustly carried
out based on satellite-based GPS as no buildings obstruct the direct line of
sight to the satellites. Recent research has dealt and largely solved GPS-based
autonomous flight. Low-cost autopilots such as a the PX4 5 can be used together
with open source software such as qgroundcontrol 6 or Paparazzi 7 to control a
GPS-based flight mission. To demonstrate the practice and the limitations of
this approach, an autonomous test flight has been conducted using a Parrot AR
Drone 2.0 together with Qgroundcontrol. A video presentation summarizing
the results of this flight can be found attached to this thesis. It is clearly shown
that the GPS way-point following works good in principle. However, it is also
demonstrated that the position accuracy i.e. the MAVs ability to follow the
designated path is not precise enough in order to use GPS-based flight in the
street-level flight scenario described below.
1.3.2
Street-Level Flight
In this scenario, the MAV flies at street-level i.e. between the building facades as
illustrated in Figures 1.5 and 1.6 Depending on the city-specific characteristics
and the local obstacle scenario (e.g. street with cars, public transport, pedes5 https://pixhawk.ethz.ch/px4/en/start
6 www.http://qgroundcontrol.org/
7 http://paparazzi.enac.fr
trians) there will be a minimum height of approximately 4-5 meters for safety
reasons. The city-specific positions of overhead contact wires and crossovers
will moreover determine an acceptable range for a safe flying altitude. The
main challenges associated with autonomous flight in this scenario is obstacle
avoidance and trajectory planning. Flying from address A to address B requires
a path planning strategy which takes into account the local scene structure and
the prevailing traffic situation. However, also accurate positioning becomes
more challenging than in the above-rooftop scenario as the satellite-based GPS
signals can be shadowed by the surrounding buildings as illustrated in Figure
1.1.
In a realistic application, the two scenarios are likely to be combined. Takeoff/landing and short-distance flights will be carried out at street-level while
long-distance flights could be conducted above the rooftops. This work aims to
contribute to solve the problem of localizing the MAV in the outlined street-level
flight scenario.
1.4
Legal Framework
This section provides a brief overview of the legal environment concerning the
operation of MAVs in urban areas 8 . In this context there are two main legal
aspects to be considered: a) the rules governing the operation of unmanned
aircrafts vehicles (UAVs) and b) the protection of data and the private sphere
of individuals.
a) The rules regulating the operation of UAVs
The operation of UAVs is governed by the ordinance on special categories of
aircrafts (Ordinance) issued by the Federal Department of the Environment,
Transport, Energy and Communications (DETEC ) 9 .
The Ordinance distinguishes between UAVs weighting more than 30 kilograms
and those weighting up to 30 kilograms.
The most significant Ordinances rules governing UAVs weighting up to 30 kilograms, which are of relevance for our purposes, are the followings:
According to art. 14 of the Ordinance, the operation of UAVs with a total
weight of up to 30 kilograms do not require an authorization of the Swiss
Federal Office of Civil Aviation (FOCA);
A constant and direct eye contact with the UAV has to be maintained at
all times (art. 17 para. 1, Ordinance).
Autonomous operation of UAVs (through cameras or GPS) within the eye
contact area of the pilot is allowed provided that the he is always in the
position to intervene on the UAV; otherwise the authorization of FOCA
is required;
8 This
Chapter 1. Introduction
1.5
Literature Review
In the recent years, several research papers have addressed the development
of autonomous Unmanned Ground Vehicles (UGVs), leading to striking new
technologies like self-driving cars. These can map and react in highly-uncertain
street environments using partially [6] or completely neglecting GPS systems
[34]. In the coming years, a similar bust in the development of autonomously
acting Micro Aerial Vehicle is expected. Several recent papers have addressed
visual localization and navigation in indoor environments using low-cost MAVs
[5, 37] or [40] which tackles the problem of safely navigating an MAV through a
10 Art. 3 (g) FDPIC: Processing: any processing of data, irrespective of the means used, in
particular, the collection, storage, use, modification, communication, archive and deletion of
data;
11 Art. 3 (a) FDPIC: Personal data (data): all information of an identified or identifiable
person
12 BGE 138 II 346;
13 For further information on the Google Street view case: http://www.edoeb.admin.ch/
datenschutz/00683/00690/00694/01109/index.html?lang=en
corridor using optical flow. Most of these approaches are based on Simultaneous
Localization and Mapping (SLAM) systems such as [17] using a monocular camera. Other approaches rely on stereo vision or laser odometry as described in [1].
Several papers have addressed vision-based localization in city environments. In
[36] the authors present present a method for estimating the geospatial trajectory of a moving camera with unknown intrinsic parameters. A similar approach
is discussed in [14] which aims to localize a mobile camera device by performing
a database search using a widebaseline matching algorithm. [10] introduces a
SIFT-based approach [20] to detect buildings with mobile imagery. In [39] the
authors propose an image-based localization system using GPS tagged images.
The camera position of the query view is therein triangulated with respect to the
most similar database image. Note that most of these approaches address the
localization of ground-level imagery with respect to geo-referenced ground-level
image databases. However, this thesis explicitly focuses on vision-based aerial
localization for MAVs. An interesting paper addressing the vision-based MAV
localization in urban canyons is given by [15] based on optical flow. Moreover,
the probably most similar work related to the approach presented in this thesis
is given in [38] in which the authors makes use of metric, geo-referenced visual
landmarks based on images taken by a consumer camera on the ground to localize the MAV. However, in contrast, the approach presented in this thesis is
completely based on publicly available 3D city models and image databases. A
short literature overview on textured 3D models is presented at the beginning
of the next chapter.
Chapter 2
1 http://www.flickr.com/
2 https://picasaweb.google.com/
3 https://www.google.com/maps
10
2.1
3D Cadastral Models
Accurate 3D city models based on administrative cadastral measurements become increasingly available to the public all over the world. In Switzerland,
the municipal authorities of Basel 4 , Bern5 and Zurich6 provide access to their
cadastral 3D data. The city model of Zurich used in this work was acquired from
the urban administration and claims to have an average lateral position error
of l = + 10 cm and an average error in height of h = + 50 cm. The city
model is referenced in the Swiss Coordinate System CH1903 which is described
in detail in [26]. An online conversion calculator between CH1903 and WGS84
can be found under 7 Please note that this model does not contain any texture information. As specified in [35], the model is available in several current
Computer-aided design (CAD) file formats and comes along in three different
Levels-Of-Detail (LODs).
Digital Terrain Model (LOD 0) The digital terrain model is available
as Triangulated Irregular Network (TIN) or in the format of interpolated
contour lines cf. Fig. 2.1 (a).
3D Block Model (LOD 1) The 3D block model represents the buildings
and their height in the form of blocks (prisms) cf. Fig. 2.1 (b).
3D Rooftop Model (LOD 2) The 3D Rooftop model represents the
facades and the rooftops of the buildings in more detail and also models
walls and bridges cf. Fig. 2.1 (c).
Figure 2.1: The figures show the different Levels-of-Detail (LODs) in which the
cadastral 3D model is available. The images in this Figure belong to the city of
Zurich.
In this work, the LOD 2 model is used to get the highest level of accuracy
available. However, as shown in Fig. 2.2, the LOD 2 model is a simplification
of the reality. Balconies (as shown in yellow), windows (as shown in green) and
special structures (as shown in red) are usually not modelled. It is evident that
4 http://www.gva-bs.ch/produkte_3d-stadtmodelle.cfm
5 http://www.geobern.ch/3d_home.asp
6 http://www.stadt-zuerich.ch/ted/de/index/geoz/3d_stadtmodell.html
7 http://www.swisstopo.admin.ch/internet/swisstopo/de/home/apps/calc/
navref.html
11
12
2.2
8 https://developers.google.com/maps/documentation/streetview/
9 https://developers.google.com/maps/documentation/javascript/
streetview?hl=en
10 http://www.python.org
13
INPUT
A text file containing a list Ldownload of WGS84 referenced GPS coordinates (latitude, longitude) gps1 , ..., gpsj , ..., gpsm derived from Google
Maps for which the closest (in terms of Euclidean distance) available
panoramic image should be downloaded.
The panoramic zoom level zzoom defining the panoramic image size
Pheight x Pwidth .
OUTPUT
A folder containing a set Ipanos of panoramic images p1 , ..., pj , ..., pM for
every GPS coordinate gpsj Ldownload .
A list Lgeo containing the geotags geo1 , ..., geoj , ..., geom for the downloaded panoramic images. Every geotag is given by the latitude, longitude, yaw, roll and pitch of the panoramic camera position.
FUNCTIONAL REQUIREMENTS
Download for every GPS coordinate gpsj Ldownload the tiles which
together make up the closest panoramic image. Stitch the tiles together
and save the panoramic images pj in Ipanos .
For every gpsj Ldownload , get the geotag geoj of the closest panoramic
image and save it in Lgeo .
Figure 2.3: Functional setup of Street View script used to download Street View
panoramas
14
(c) This Figure shows a panoramic Street View image (equirectangular projection)
stitched together by using the dynamic Street View API. The yaw spans from 0 to 360
degrees (x-axis along image width) whereas the pitch extends from 0 to 180 degrees
(y-axis along image height).
Figure 2.4: Figures (d)-(l) show different perspective cutouts from the
panoramic image in (c) using different cutout parameters as described in 2.3
.
2.3
15
As shown later in chapter 3.2, a perspective cutout i.e. an image which meets
the underlying assumptions of a perspective camera model as described in [13]
of the Street View panoramas needs to be generated. This is done following
the procedure outlined in [27]. The functional setup of the cutout function is
described in Figure 2.5. Based on the input parameters, the internal camera
FUNCTION: perspective cutout
DESCRIPTION: Generate a perspective cutout of a panoramic Street
View image
INPUT
Panoramic Street View image pj .
Panoramic image size Psize given by the image width Pwidth and the
image height Pheight .
Desired image size Csize of the perspective cutout given by Cwidth and
Cheight .
Horizontal field of view hf ov for the desired cutout.
Image center for the desired perspective cutout specified by yaw and
pitch of the panoramic projection.
OUTPUT
Perspective view ck according to input specifications.
FUNCTIONAL REQUIREMENTS
Transform the equirectangular projection to a perspective view.
Figure 2.5: Functional setup of perspective cutout function.
matrix Kstreet for the generated perspective cutout can be calculated as follows:
cx = Cheight /2
cy = Cwidth /2
(2.1)
Whereas cx and cy represent the optical camera center. The camera focal lengths
fx , fy are given by:
fy = fx =
Cwidth
2tan(hf ov/2)
fx 0 c x
Kstreet = 0 fy cy
0 0 1
(2.2)
(2.3)
16
2.4
The provided geotags geoj Lgeo (cf. Figure 2.3) for the Google Street View
imagery are not exact. As shown in [32] where 1400 images were used for an
analysis the average error of the camera positions is 3.7 meters and the average
error of the camera orientation is 1.9 degrees. In the same work, an algorithm
is proposed to improve the precision of the Street View image poses. This
algorithm uses the described cadastral 3D city model of Zurich to detect the
outlines of the buildings by rendering out 3D panorama views as illustrated in
Fig. 2.6 (a-b). Accordingly, the outlines of the buildings are also computed for
the Street View panoramas using the image segmentation technique described
in [19]. Finally, the refined pose is computed by an iterative optimization,
namely by minimizing the offset between the segmented outlines from the Street
View panoramas and the outlines of the rendered out panorama view from the
3D cadastral model. For this work, the described refinement algorithm was
applied to correct the Google Street View geotags used in the experimental
setup cf. chapter 4. Fig. 2.6 shows the difference when overlaying rendered out
panoramas of the cadastral 3D model before applying the correction algorithm
and after applying the correction algorithm. It is clearly evident that the match
quality i.e. the accuracy when overlaying the 3D city model with Street View
images drastically increases after the application of the described refinement
algorithm.
Figure 2.6: Figure (a) shows the rendered out building outlines based on the
original geotag of a panoramic Street View image. Figure (b) shows the rendered out bulding outlines based on the refined geotag. Figure (c) overlays the
panoramic image with the outlines based on the original geotag. Figure (d)
overlays the panoramic image with the outlines based on the refined geotag. It
is clearly shown that the overlaying of Figure (d) is much more precise than
Figure (c).
Note that the refinement algorithm was run by the authors of [32] as the code
17
has not been published by the time writing. A functional setup of the refinement
algorithm is, however, provided in Figure 2.7.
18
INPUT
A text file containing a list Lgeo of Street View geotags (latitude, longitude, yaw, pitch, roll) geo1 , ..., geoj , ..., geom derived from the function
download panoramas (cf. Figure 2.3 which describe the Street View
camera locations for a set of panoramic images p1 , ..., pj , ..., pM .
A set Ipanos of panoramic images p1 , ..., pj , ..., pM
The 3D cadastral model for the locations in Lgeo .
OUTPUT
A list Lref ined containing the refined panoramic camera locations
xyz1 , ..., xyzj , ..., xyzm referenced in the 3D model coordinate frame
CH1903.
For every original geotag geoj related to the panoramic image pj , the
refined external camera matrix RTj given by the refined rotation matrix
Rj which describes the rotation of the Street View camera with respect
to the model origin and the translation vector Tj which describes the
translation with respect to the model origin. Note that xyzj = inv(Rj )
Tj ).
FUNCTIONAL REQUIREMENTS
Segment building outlines in the panoramic images p1 , ..., pj , ..., pM .
Render out panoramic building outlines from the 3D cadastral model for
the Street View camera localizations in Lgeo .
Overlay the segmented building outlines with the panoramic renderings
and measure the offset
Iteratively refine the Street View camera localizations by running an
optimization to minimize the offset.
Figure 2.7: The functional setup of the refinement algorithm proposed by [32]
to correct the panoramic geotags of the Street View images.
2.5
19
A given perspective cutout of a downloaded Street View panorama can be backprojected onto the 3D cadastral model taking into account the refined position
location as illustrated in Fig. 2.8 (a)-(d). This is done with the open-source 3D
modelling software Blender 11 . Some sample files showing textured 3D model
scenes are added to this thesis. Note that the quality of the backprojection
largely depends on the accuracy of the refined position estimates (i.e. the refined geotags) of the Street View camera and the modelling accuracy of the 3D
cadastral model. The main goal of the backprojection is to assign the texture
i.e. the Street View images to their corresponding 3D geometries in the
cadastral model. An alternative approach to map the 2D pixel coordinates of
the Street View cutouts to their global 3D coordinates in the city model is to
add the Street View camera perspective to the 3D model and subsequently render out the global 3D coordinates for all the pixels. This process is illustrated
in Figure 2.8 (e)-(f).
Figure 2.8: Figures (e) -(g) illustrate the rendered out global 3D model coordinates in the style of a heat map.
11 http://www.blender.org
20
Moreover, the functional setup on how to render out 3D coordinates for pixels in
the Street View images is described in Figure 2.9. Note that the 3D coordinates
for the pixels can be either rendered out in the global coordinate system or
in the local camera coordinate system. If the global reference frame is used,
every pixel in the Street View image can be directly linked to its absolute global
coordinates in the city model reference system. Alternatively, the depth values
can be rendered out for every pixel and then be converted to the local camera
coordinate frame. Remember that the global 3D coordinates are referenced in
the Swiss coordinate system CH1903 as outlined in chapter 2.1.
FUNCTION: get 3D coordinates
DESCRIPTION: Render out the global 3D coordinates and/or depth for
the Street View pixels
INPUT
The Street View camera locations RTk specifying the external camera
parameters of a specific perspective cutout ck . Whereas Rk is the rotation matrix of the Street View camera with respect to the model origin
and Tk gives the translation vector with respect to the model origin.
The internal camera parameters Kstreet of the perspective cutout as
given by 2.5.
The cadastral 3D model which contains the location RTk .
OUTPUT
For every pixel pkuv in ck the global 3D coordinates to which the pixel
corresponds i.e. Xglobal (pkuv ), Yglobal (pkuv ), Zglobal (pkuv ). pkuv stands
for the pixel in cutout k, row u and column v.
Alternatively, for every pixel pkuv in ck get the depth D(pkuv ) which
corresponds to the pixel. If desired, also the 3D coordinates in the
local camera frame can be extracted i.e. Xlocal (pkuv ), Ylocal (pkuv ),
Zlocal (pkuv ).
FUNCTIONAL REQUIREMENTS
Create a perspective camera in the 3D model according to the external
parameter RTk and the internal parameters Kstreet .
Render out coordinate paths i.e. save the corresponding 3D model
coordinates Xglobal (pkuv ), Yglobal (pkuv ), Zglobal (pkuv ) for every pixel pkuv
in the image plane of the perspective cutout ck .
Figure 2.9: This figure outlines the process of linking the Street View cutout
pixels to their corresponding 3D coordinates.
Chapter 3
Vision-based Global
Positioning
This chapter presents the vision-based global positioning approach. The functional requirements are derived and the main steps are explained in detail.
The underlying idea of the vision-based global positioning approach is straightforward and illustrated in Figure 3.1: First (a), in the preprocessing phase a 3D
referenced image database containing perspective Street View cutouts is generated. In this context, 3D referenced means that we can link every pixel in the
image database to the corresponding global 3D point which resulted in the 2D
image projection. Important steps include: download the Street View panoramas, create perspective cutouts, refine the Street View geotags and finally render
out the 3D path of every cutout. Second, the MAV image (b) which we want
to localize is searched in the Street View cutout database. This is done using
the so-called air-ground algorithm which outputs 2D-2D match points that link
corresponding feature points between the MAV and the Street View image (c).
Third, the resulting 2D-2D matches can be converted into 2D-3D matches which
link the MAV image feature points to their global 3D counterparts i.e. the 3D
points which result in the projection of the 2D feature points. This is done with
the help of the 3D referenced image database established in the preprocessing
phase. Finally, a so-called PnP algorithm can be used to estimate the MAVs
external camera parameters (d) which describe the global location and attitude
of the MAV with respect to the global reference frame (e).
3.1
Preprocessing
22
3.1. Preprocessing
23
FUNCTION: Preprocessing
DESCRIPTION: Steps required to generate a 3D referenced image
database.
INPUT
Flight area gps1 , ..., gpsj , ..., gpsm Af light where the MAV will operate.
This is a list of WGS84 referenced GPS coordinates
OUTPUT
Geo-referenced image database Icutout containing N perspective cutouts
c1 , ..., ck , ..., cN along the flying route as described in Figure 2.3.
Internal camera matrix Kstreet specifying the focal lengths fx , fy and
the optical centers cx .cy of the perspective Street View cutouts.
A mapping which links every pixel pkuv of cutout ck Icutout to its
global 3D model coordinates Xglobal (pkuv ), Yglobal (pkuv ), Zglobal (pkuv ).
Whereas pkuv stands for the pixel in cutout k in row u and column
v cf. function get 3D coordinates in Figure 2.9.
FUNCTIONAL REQUIREMENTS
Download the Street View panoramas for every GPS coordinate gpsj
Af light and store them in a panorama image database Ipanos using function download panoramas cf. Figure 2.3.
Process the panoramas p1 , ..., pj , ..., pM Ipanos and generate the perspective cutouts c1 , ..., ck , ..., cN Icutout with the function perspective cutout cf. Figure 2.5.
Refine the the GPS coordinates gps1 , ..., gpsj , ..., gpsm Af light using
the algorithm 2.7 and store the refined positions referenced in the 3D
model coordinate frame as xyz1 , ..., xyzj , ..., xyzm Aref
Based on Aref and the perspective cutout inputs, derive the rotation
matrix Rk and the translation vector Tk specifying the external camera
parameters RTk of the cutouts c1 , ..., ck , ..., cN in the 3D model coordinate frame.
Calculate the internal camera parameters Kstreet based on the perspective cutout inputs cf. chapter 2.3.
Create a mapping between the pixels pkuv and their global 3D model
coordinates by rendering out the Xglobal , Yglobal , Zglobal path from the
3D model for every cutout ck Icutout using RTk and Kstreet cf. Figure
2.8.
Figure 3.2: Steps required to prepare the 3D referenced image database which
is used in the vision-based positioning algorithm.
24
3.2
Air-ground Algorithm
The air-ground algorithm was introduced in [21] and partially resulted from the
authors semester thesis 1 . The main goal of this algorithm is to find the most
similar Street View image for a given MAV image by finding corresponding feature points. In the said thesis, it is shown that state-of-the-art image search techniques usually fail to robustly identify correct feature matches between street
level aerial images recorded by a MAV and perspective Google Street View images. The reason for this are significant viewpoint changes between the two
images, image noise, environmental changes and different illuminations. The
air-ground algorithm introduces a novel technique to simulate artificial views
according to the air-ground geometry of the system and hence manages to significantly increase the number of feature points. Moreover, a state-of-the-art
outlier rejection technique using virtual line descriptors (KVLD) [4] is used to
reduce the number of wrong correspondences. Please refer to the cited papers
for details on the air-ground algorithm. Figure 3.3 shows an example image
showing the matches found by the air-ground algorithm between the MAV image on the left side and its corresponding Street View image on the right side.
The green lines illustrate the corresponding feature points whereas the magenta
lines describe the virtual lines as described in [4].
Figure 3.3: Match points found between MAV image (left) and Street View
cutout (right) with the air-ground algorithm. Note that there are still some
outliers.
Note that the output of the original air-ground algorithm is essentially a set of
2D-2D image correspondences between the MAV image and the most similar
Street View image. As described in [21], by identifying the most similar Street
View image, one can localize the MAV image in the sense of a topological map.
However, no metric localization i.e. the exact global position in a metric map
can be derived based solely on the 2D-2D correspondences. As described in
this thesis, the 2D-3D correspondences between the MAV image coordinates
of the feature points and the 3D coordinates referenced in a global coordinate
frame can be established using the cadastral 3D city model. Based on these
correspondences, the global position of the MAV can be inferred as shown in
the next section. The functional setup of the air-ground algorithm is illustrated
in Figure 3.4.
1 Micro
25
INPUT
A geotaged image database Icutout containing a set of perspective Street
View cutouts c1 , ..., ck , ..., cN .
MAV image dj for which we want to identify the most similar Street
View image in the set Icutouts .
OUTPUT
The most similar Street View cutout cj which corresponds to the MAV
image dj i.e. the Street View cutout with the highest number of
corresponding feature points.
A list uM AV containing Nmatches x2 entries whereas the first column
refers to the u coordinate and the second column to the v coordinate in
the MAV image plane of a corresponding feature point. The image pixel
coordinate system is the standard used by OpenCV a
A list uST REET containing Nmatches x2 entries whereas the first column
refers to the u coordinate and the second column to the v coordinate in
the Street View image plane of a corresponding feature point.
Note that the feature point of the first row of uM AV corresponds to the
feature points of the first row in uST REET and so on. Nmatches stands
for the total number of found feature points between the two images.
FUNCTIONAL REQUIREMENTS
Generate artificical views of the images to be compared by means of an
affine transformation.
Identify salient feature points in the artificial views.
Backproject the feature points of the artificial views to the original images
Find corresponding feature points by means of an approximate nearest
neighbor search.
a http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_
3d_reconstruction.html
Figure 3.4: Functional setup of the air-ground algorithm. Please refer to [21]
for details.
26
3.3
The goal of this section is to calculate the MAVs external camera parameters
which are given by the 3 x 3 rotation matrix RM AV and the 3 x 1 translation
vector TM AV , or alternatively by the 3 x 4 matrix RTM AV as follows:
RM AV
r11
= r21
r31
r12
r22
r32
r13
t1
r11 r12 r13
r23 , TM AV = t2 , RTM AV = r21 r22 r23
r33
t3
r31 r32 r33
t1
t2
t3
(3.1)
Basically, the cameras external parameters define the cameras heading and
location in the world reference frame. Or in other words, they define the coordinate transformation from the global 3D coordinate frame to the cameras local
3D coordinate frame. Note that TM AV specifies the position of the origin of
the global coordinate system expressed in the coordinates of the local cameracentred coordinate system [13]. The global camera position XM AV in the world
reference frame is given by:
1
XM AV = RM
AV TM AV
(3.2)
Several approaches have been proposed in the literature to estimate the external camera parameters based on 3D points and their 2D projections by a
perspective camera. In [8], the term perspective-n-point (PnP) problem was
introduced and different solution were described to retrieve the absolute camera
pose given n 3D-2D correspondences. The authors in [18] addressed the PnP
problem for the minimal case where n equals 3 points and introduced a novel
parametrization to compute the absolute camera position and orientation. In
this thesis the Efficient Perspective-n-Point Camera Pose Estimation (EPnP)
algorithm [9] is used to estimate the MAV camera position and orientation with
respect to the global reference frame. In their paper, the authors present a novel
technique to determine the position and orientation of a camera given its intrinsic parameters and a set of n correspondences between 3D points and their 2D
projections. The advantage of EPnP with respect to other state-of-the-art noniterative PnP techniques is that it has a much lower computational complexity.
The computational complexity grows linearly with the number of points supplied. Moreover, EPnP has proven to be more robust than other non-iterative
techniques in terms of noise in the 2D location. An alternative to non-iterative
approaches, are iterative techniques which optimize the pose estimation by minimizing a specific criterion. These techniques have shown to achieve a very high
accuracy if the optimization is properly initialized and successfully converges to
a stable solution. However, convergence is not guaranteed and iterative techniques are computationally much more expensive than non-iterative techniques.
Moreover, it was shown by the authors of [9] that EPnP achieves almost the
same accuracy as statet-of-the-art iterative techniques. To summarize, EPnP
was used in this thesis because of its speed, robustness to noise and its simple
implementation. Note that any other PnP technique could be used at this point
to estimate the external camera parameters RM AV and TM AV of the MAV. The
minimal number of correspondences required for EPnP is n = 4.
Given that the output of our Air-ground matching algorithm may still contain
27
Figure 3.5: Left side: sample data for fitting a line containing outlier points.
Right side, fitted line by applying Ransac. Images take from Wikipedia.
two points from the sample data and fits a line. Second, the number of inlier
points i.e. the points which are close enough to the fitted line according to
a certain threshold are determined. This procedure is repeated for a certain
number of times and the model parameters which have the highest number of
inliers are selected to be the best ones. Ransac works robustly as long as the
outlier percentage of the model data is below 50 percent. As specified in [21]
the number of iterations N needed to select at least one random sample set free
of outliers with a given confidence level p e.g. p = 0.95 can be computed as.
N=
log(1 p)
log(1 (1 y)s )
(3.3)
Where y is the outlier ratio of the underlying model data and s the number
of model parameters needed to estimate the model. In the case of the line
example s = 2. In the case of EPnP the minimal set is at least equal to s = 4.
The procedure used by Ransac in the case of EPnP to test whether a given
correspondence point is an inlier is as follows: Firstly, Ransac randomly selects
s points from the 3D-2D correspondences and supplies them to EPnP which
calculates RM AV and TM AV . The remaining 3D points are then reprojected
to the 2D image plane based on RM AV and TM AV according to the following
28
equation:
xglobal
xrepr
yglobal
zc yrepr = KM AV RTM AV
zglobal
1
1
(3.4)
2 http://docs.opencv.org/master/modules/calib3d/doc/
29
INPUT
A set of corresponding 3D-2D points i.e. a set of 3D points Xglobal and
their 2D projections Xcamera in the MAVs camera frame.
The internal camera parameters KM AV of the MAV camera.
The Ransac parameters i.e. allowed reprojection error treshhold
reprtresh in pixels, the confidence level pconf idence , the number of
matches supplied to EPnP s which must be at least s = 4.
OUTPUT
The external camera parameters RM AV and TM AV descibing the MAV
camera position with respect to the global reference frame.
FUNCTIONAL REQUIREMENTS
Randomly select a subset of s 3D-2D match points.
Calculate RM AV and TM AV based on EPnP.
Reproject 3D points and calculate the reprojection error ereprojection .
Consider 3D point to be an inlier if the ereprojection < reprtresh . Otherwise consider match to be an outlier.
Repeat this procedure according to the confidence level pconf idence and
Equation 3.3.
Take the iteration which resulted in the highest number of inliers and
recalculate the final RM AV and TM AV based on these inliers using EPnP.
Figure 3.6: Functional setup of EPnP + Ransac. Please refer to [9] for details.
30
3.4
Vision-based Positioning
Based on the previous steps, the vision-based positioning algorithm can now be
easily formulated. Algorithm 1 shows the basic setup of the system. Note that
in a realistic application, not the entire 3D referenced image database will be
searched. Based on a so-called position prior pprior , it will make sense to narrow
down the search space to make it as small as possible and hence speed up the
whole algorithm. Such a position prior can be given by the latest satellite-based
GPS estimate, IMU-based dead-reckoning or the previous vision-based estimate.
Moreover, if there is a magnetometer available, the heading measurements could
also be used to reduce the search space.
Algorithm 1: Vision-based Positioning
Data: 3D referenced image database Icutout
Result: Global MAV position XM AV
1 initialization cf. Preprocessing cf. Figure 3.2;
2 for every MAV image dj do
3
if position prior pprior is available then
4
Reduce search space to Ireduced Icutout
5
Set Isearch = Ireduced
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Chapter 4
Experimental Setup
This chapter describes the experimental setup which is used to test the visionbased positioning system. Two datasets are used to verify the performance of
the introduced approach, firstly the big dataset which was already presented in
[21] and secondly, the small dataset which was newly recorded and also contains
GPS data for comparison.
4.1
Platform
Figure 4.1: On the left: Distorted MAV image, on the right: Undistorted MAV
image
The recorded imagery was therefore undistorted using the OpenCV library as
outlined in [33]. The OpenCV drone distortion parameters and the camera
31
32
KM AV
558.265829
0.0
0.0
558.605079
=
0.0
0.0
328.999406
178.924958
1.0
0.0
(4.1)
(4.2)
(4.3)
640
Cwidth
=
= 184.7521
2tan(hf ov/2)
2tan(60 )
184.7521
0
180
0
184.7521 320
KST REET =
0
0
1
4.2
(4.4)
(4.5)
Test Area
To test the vision-based positioning approach, two datasets were recorded by piloting the AR. Drone manually through the streets of Zurich filming the building
facades i.e. the front-looking camera was turned by 90 degrees with respect
to the flying direction. The first dataset called ETH Zurich big covers a trajectory of roughly 2 kilometres in the neighbourhood of the ETH Zurich and has
already been used in [21]. Figure 4.2 (a) shows the map of the recorded flying
route together with some sample images. The average flying altitude is roughly
7 meters. This dataset has been recorded using the ROS 1 ardrone autonomy
software package. The drone was controlled with a wireless joystick and the
images were streamed down to a MacBook Pro over a wireless connection. In
total, the ETH Zurich big dataset contains 40599 images. For computational
reasons, to test the vision-based global positioning system, the dataset has been
sub-sampled using every 10th image resulting in a total of 4059 images. The
trajectory of the big dataset corresponds to 113 Street View panoramic images
which are roughly 10 - 15 meters apart from each other. As we were flying with
1 http://www.ros.org
33
the MAV camera facing the building facades i.e. 90 degrees turned with respect
to the driving direction of the Street View car only 113 perspective cutouts
are stored in the image database Icutouts . In other words, the yaw parameter in
the function perspective cutout was always set to be 90 degrees. In a more realistic scenario, every panoramic image would be sampled according to different
yaw parameters e.g. yaw = [0, 90, 180, 270] as we can not explicitly assume
to know in which direction the MAV is looking if it is flying completely autonomously. However, to test the vision-based positioning approach, the above
setting seems to be reasonable in terms of computational resources.
Besides the side-looking dataset, another dataset for the same trajectory was
recorded with a front-looking camera facing the flying direction. This frontlooking dataset contains only 22363 images but covers the same area. The
reduced number of images for the front-looking dataset is due to a much easier
manual control when flying in the viewing direction of the camera and hence
results in a higher flying speed. However, the front-looking dataset has not been
used in this work.
The second dataset called ETH Zurich small is a subpath of the path gathered
in the big dataset (cf. blue line in Figure 4.2 (a)) and has been recorded together with satellite-based GPS in order to have a comparison to the proposed
vision-based approach. For every recorded frame, this dataset also contains the
recorded GPS-coordinates i.e. latitude, longitude and altitude according to
WGS84 from where the image has been taken. The timely synchronization of
the GPS tags and the image frames is done on the software level using the open
source package cvdrone 2 which combines the OpenCV image library together
with the AR.Drone 2.0 API.
To calculate the vision-based position estimates, every 10th image of the dataset
ETH Zurich big is processed using algorithm 1 outlined in section 3.4. Every
MAV image is compared to the eight nearest Street View images in order to
find the correct match i.e. the Street View image which corresponds to the
observed scene. By comparing every MAV image only to the eight nearest Street
View images, the search space and hence the computational requirements are
drastically reduced for the air-ground algorithm described in section 3.2. This
corresponds to the so-called position prior pprior described in algorithm 1. In a
real flight scenario, it is realistic to have a position prior of at least 100 meters
accuracy based on other on-board sensors such as satellite-based GPS or an
IMU used for dead-reckoning. It therefore seems reasonable to only compare
the MAV images to the nearest Street View images instead of searching the
whole Street View database. However, the proposed approach also works for
a bigger search space as demonstrated in [21] at the expense of an increased
computational complexity.
The following list summarizes the parameters used to achieve the results presented in the next chapter:
Number of Street View cutouts stored in Icutout is equal to Ncutout = 113.
Number of processed MAV images used to test the vision-based positioning
approach: NM AV = 4059.
Image size of the cutouts Cwidth x Cheight = 640 x 360 pixels.
2 https://github.com/puku0x/cvdrone
34
35
(a) Recorded Test Area: The red line describes the MAV flying path of the
ETH Zurich big dataset. The blue-white line designates the the ETH Zurich small
dataset which is a subset of the big dataset and was recorded together with satellitebased GPS.
(b)
(c)
(d)
Figure 4.2: (a) shows an aerial map of the recorded datasets. Figures (b)-(d)
show MAV example images (top row) together with corresponding Street View
images (bottom row). Note that there are significant differences in terms of viewpoint, illumination, scale and environmental setting between the MAV images
and the corresponding Street View images which makes a correct classification
highly challenging.
Chapter 5
5.1
Visual-Inspection
Figure 5.2 shows the top view of the ETH Zurich big dataset. The red dots
represent the vision-based global position estimates for each camera frame for
which a corresponding Street View image was found according to algorithm 1.
It can be seen that almost the whole flying route is covered by position estimates. Some of the streets are covered very densely with position estimates
meaning that there are many correct Street View correspondences found
while other areas are rather sparsely covered. There are four areas designated
with the numbers 1-4 in Figure 5.2 (a) where there are no position estimates
available. The reason why there are no or not enough Street View correspondences found in these areas (referring to the numbers 1,2,3 in Figure 5.2 (a))
is because of vegetation which occludes the buildings as illustrated in Figure
5.1. The reason why there are no Street View correspondences found for area
number 4 is not so obvious. Possible reasons could be the relatively high flying
speed in that area resulting in motion- blurred images and a reduced number of
MAV frames per Street View image to be matched. Other possible reasons for
36
37
(a) MAV image from area 1 (b) MAV image from area 2 (c) MAV image from area 3
Figure 5.1: No Street View matches where found for these areas cf. Figure 5.2
as the buildings are occluded by vegetation.
By close examination of Figures 5.2 (a) and (b) the reader will realize that the
vision-based position estimates i.e. the red dots are not exactly the same for
the two plots. The reason for this is that in Figure 5.2 (a) the minimal set of
s=4 match points was used to run the EPnP and Ransac as described in chapter
3.3 whereas in Figure 5.2 (b) a non-minimal set of s=8 match points was used.
To illustrate the difference between the two approaches, Figure 5.3 shows some
close-ups of the whole map. The first row shows close-ups when using the minimal set whereas the second row shows close-ups when using the non-minimal set.
By closely comparing the first two rows, one can conclude that the position estimates derived by using the non-minimal set seem to be more plausible than by
using only the minimal set. This is illustrated by the fact that the non-minimal
position estimates tend to be more organized along a smooth trajectory which
is realistic with respect to the real MAV flying path. Or stated differently, the
position estimates derived by using the minimal set tend to jump around more
than by using the non-minimal set i.e. they are more widely spread and less
spatially consistent. The reason for the more robust results in the non-minimal
case is that the position estimates derived by the EPnP are less affected by outliers and degenerate point configurations. However, in both approaches the
minimal and the non-minimal a few extreme outliers occur which are clearly
not along the flying path as highlighted by the yellow boxes in Figure 5.3. One
possible cause for these outliers are wrong point correspondences between the
Street View images and the MAV images given by the air-ground algorithm described in chapter 3.2. Another potential explanation are inaccurate 3D point
coordinates supplied to EPnP resulting from inaccuracies when overlaying the
Street View images with the cadastral city model as illustrated in chapter 2.5.
The last row of Figure 5.3 shows the same close-ups after filtering the nonminimal estimates (second row) based on the standard deviations calculated in
the next section. It is shown that the outliers (yellow boxes) can be successfully
discarded by limiting the allowed standard deviations. The next section step-bystep explains how to get a measure for the uncertainty of the derived vision-based
position estimates based on a Monte Carlo approach.
38
5.1. Visual-Inspection
(a) Top View Vision-based Position Estimates: EPnP + Ransac. For this plot, Ransac
uses a minimal set of s=4 points for the EPnP algorithm.
(b) Top View Vision-based Position Estimates: EPnP + Ransac. For this plot,
Ransac was required to use a non-minimal set of s=8 match points for the EPnP
algorithm cf. chapter 3.3. The visual difference between using s = 4 points or s = 8
points for EPNP + Ransac is shown more detailed in Figure 5.3.
Figure 5.2: The red dots represent the vision-based position estimates whereas
the black line in Figure (b) illustrates the flight path. The numbers 1-4 in Figure
(a) show areas where no matches were found cf. Figure 5.1.
39
Figure 5.3: Close ups of Figure 5.2 (a) and (b): First row: Vision-based position
estimates (red points) using a minimal set of s=4 points for EPnP and Ransac.
Second row: Vision-based estimates using a non-minimal set of s=8 points for
EPnP and Ransac. By comparing the first and the second row, one can see
that the estimates illustrated in the second row for the non-minimal set tend
to be more aligned along the street. However, in both cases there are some
clearly wrong estimates (highlighted in yellow) which are not along the flying
path. To get rid of those estimates, it is suggested to filter the vision-based
estimates based on the standard deviations as shown in the last row. Please
refer to chapter 5.2 for more details.
40
5.2
Uncertainty Quantification
To quantify the uncertainties related to the vision-based global positioning estimates with respect to the underlying data, a Monte Carlo type approach is
used to calculate the covariances for each estimate. The procedure is outlined
below:
Algorithm 2: Uncertainty Quantification
Data: A set of Nmatches 2D-3D match points uM AV and Xglobal for a
specific MAV image
Result: Covariance and standard deviations related to the vision-based
position estimation
1 for a specific MAV Image - Street View Image match pair do
2
Initialize: Calculate the vision-based position estimate XM AV
according to Algorithm 1
3
initialize counter: j = 1;
4
for it=1:1000 do
5
1.) Randomly select a subset of s 3D-2D match points out of the
total number of Nmatches 3D-2D match points and calculate the
rotation matrix RM AV it and the translation vector TM AV it with
EPnP;
6
2.) Calculate the number of inliers Ninliers by reprojecting all the
Nmatches 3D points Xglobal to the image plane with a suitable
pixel threshold t based on Equations 3.4 and 3.5.
7
if Ninliers > s i.e. if the number of inliers is bigger than the
number of sample points s then
Save RM AV it and TM AV it as RM AV j and TM AV j and store
8
1
the global localization estimate XM AV j = RM
AV j TM AV j
and the heading yawj (which can be extracted from RM AV j )
in the list Lestimates .
9
increase counter: j = j + 1;
10
else
Do not save RM AV it and TM AV it ;
11
12
a The covariances and the standard deviations can be easily calculated using the Matlab
functions cov and std cf. http://www.mathworks.ch/ch/help/matlab/ref/cov.html
The used Monte Carlo approach is straigh-forward: First, the algorithm randomly selects a subset of all the found 2D-3D match points. Second, it calculates
the vision-based position estimates RM AV j and TM AV j using EPnP. Third, it
reprojects all the 3D points to the 2D plane as explained by equation 3.4 and
checks if they are valid inliers i.e. if their reprojected error is smaller than
the allowed reprojection error treshhold. Finally, if enough inliers are found,
the position estimates RM AV j and TM AV j are saved. This procedure is repeated several times. Based on all the saved position estimates, the covariances
quantifying the uncertainty related to the initial vision-based position estimate
XM AV for a certain MAV - Street View image match pair can be calculated.
The described procedure is illustrated in Figure 5.4.
41
130
Y-Axis
Y-Axis
120
115
125
120
150
160
X-Axis
170
150
152 154
X-Axis
156
Figure 5.4: This Figure illustrates the described procedure to calculate the
covariances for two example MAV - Street View match pairs. Left side: Top
view of MAV - Street View match pair Nr.4730-31 from the small dataset. Right
side: Top view of MAV - Street View match pair Nr.3500-32 from the small
dataset. The red dots correspond to the Monte Carlo position estimates given
by RM AV j and TM AV j . The green squares represent the positions of the Street
View cameras. The blue ellipses limit a 95- percent confidence interval based
on the calculated covariance. The black strokes represent the direction of the
mean yaw. The yellow points represent the mean position estimates based on
all the Monte Carlo samples. Finally, the magenta crosses represent the position
of the 3D feature points Xglobal which are found on the building facades. The
blue ellipses can be used to identify how reliable a certain vision-based position
estimate is based on a certain probability interval. Based on the blue ellipses
we can say that we are 95 percent sure to be in the area bordered by the blue
ellipse. As shown in the right image, the Monte Carlo estimates can be highly
clustered resulting in relatively narrow ellipse - meaning that the vision-based
estimate can be considered to be reliable. On the other hand, as shown in the
left image, the Monte Carlo estimates can also be dispersed resulting in a less
concentrated confidence area.
Please note the following: one drawback of the described Monte Carlo approach
is that it depends on the number of valid Monte Carlo estimates (represented by
the red dots in Figure 5.4). The number of estimates is given by the iterator j
at the end of Algorithm 2. Remember that an estimate is considered to be valid
if more than s inliers are found (whereas s stands for the number of randomly
sampled 2D-3D points supplied to EPnP). However, this number can be highly
different for different MAV - Street View match pairs. If the total number of
3D-2D match points Nmatches is high (e.g. Nmatches = 200), usually also the
number of valid Mone Carlo estimates is high (e.g. j = 500 out of the 1000
iterations result in a valid match). In contrast, if the total number of 3D-2D
match points Nmatches is low (e.g. Nmatches = 10), the number of valid Monte
Carlo estimates will also be low (e.g. only j = 3 out of the 1000 iterations
result in a valid match). If the number of valid Monte Carlo estimates is too
small (i.e. less than 20), it is reasonable to not use those estimate at all to
calculate covariances, as they might be highly unreliable. The same counts for
when calculating the standard deviations of the position and the yaw for the
MAV - Street View match pairs based on the Monte Carlo estimates: If the
42
number of Monte Carlo estimates is too low (e.g. if only j = 3 out of the 1000
iterations result in a valid estimate), the calculated standard deviations may be
either very small or very high. Therefore, only uncertainty estimates which are
based on more than j = 20 Monte Carlo estimates are considered to be reliable
in this thesis. This is the reason why Figure 5.5 only shows standard deviations
for about 800 MAV - Street View match pairs out of the 4059 maximum possible
(i.e. if every MAV image could have been correctly classified). The fundamental
problem here is that for certain MAV - Street View match pairs, not enough
correct correspondences can be found by the air-ground algorithm. Suggestions
on how to improve that are given in chapter 6.
(a) This Figure shows a top view (global X-Y coordinates) of a subset of the big
dataset which corresponds to the corner illustrated on the left side of Figure 5.3. The
blue ellipses show the 95-percent confidence intervals of the vision-based position
estimates calculated based on the outlined Monte Carlo approach. The green boxes
correspond to the Street View camera positions. The magenta crosses show the
positions of the matched 3D feature points on the building facades. It is shown that
the vision-based estimates are usually found near the Street View camera positions.
Moreover, it can be seen that most of the confidence intervals border a reasonably
small area meaning that the accuracy of the vision-based positioning approach
seems to be be practical to accurately localize a MAV in an urban environment.
Figure 5.5 shows the Monte Carlo-based standard deviations calculated for the
big dataset for the global X,Y,Z-coordinates and the camera yaw. Based on the
calculated standard deviations, one can define a simple filter rule to discharge
vision-based position estimates which have a too high standard deviation. For
example, one could only consider vision-based estimates which have a standard-
43
10
8
6
4
2
0
200
400
600
800
MAV Street View Match Pair
12
4
2
200
400
600
800
MAV Street View Match Pair
1000
50
zcoordinate
mean std = 2.20 m
8
6
4
2
200
400
600
800
MAV Street View Match Pair
1000
10
1000
ycoordinate
mean std = 1.56 m
yaw
mean std = 7.86 degree
40
30
20
10
0
200
400
600
800
MAV Street View Match Pair
1000
Figure 5.5: This Figure shows the standard deviations for the MAV - Street View
match pairs. The y-axis shows the standard deviation in meters / degrees, the xaxis stand for a certain MAV - Street View match pair from the big dataset. The
blue curve shows the standard deviation for the global vision-based X-coordinate
[m]. The green curve shows the standard deviation for the global vision-based Ycoordinate [m]. The red curve shows the standard deviation for the global visionbased Z-coordinate [m]. The magenta curve shows the standard deviation for
the estimated yaw [degrees]. The mean standard deviation for the X-coordinates
and the Y-coordinates are 1.16 meters and 1.56 meters respectively. The mean
standard deviation for the Z-coordinate (which is the height) is slightly bigger
with 2.20 meters. The mean standard deviation for the yaw is 7.86 degrees.
deviation for X, Y and Z-coordinates of less than 2.0 meters. The results of such
an approach is illustrated in Figure 5.3 in the last row. It is clearly demonstrated
that by such a strategy, extreme outliers can be eliminated. However, the price
to pay is that the total number of available vision-based estimates is reduced.
Of course, in a real-life application, more sophisticated rules could be applied
to discharge outliers e.g. by using IMU data.
5.3
44
comparing them to the original MAV images. The procedure is as follows: based
on the estimated MAV camera position, a camera is added to the 3D city model.
The internal parameters of the MAV camera have been described in chapter 4.1.
Texture is then applied to the scene by backprojecting the Street View image to
the model as explained in chapter 2.5 and a rendered view from the perspective
of the estimated camera position is generated. If the estimated camera position
is identical with the true camera position of the MAV, the original MAV image
should cover exactly the same scene as the artificially generated virtual-view.
Or stated differently, the higher the visual similarity between the two images
is, the better we can claim is our vision-based position estimate. Figure 5.6
shows some representative examples of original MAV images and the virtual
views generated according to their vision-based position estimates. Note that
there are some artefacts and inaccuracies in the virtual views resulting from
failures related to the backprojection of the Street View images to the 3D city
model. The examples clearly show that the vision-based position estimates are
in the close neighbourhood of the true camera positions as there is a substantial
overlap between the two depicted scenes. However, it is also evident that the
precision of the vision-based position estimates is in the order of a couple of
meters rather than a few centimetres. This is in accordance with the calculated
position covariances described in the previous section.
Another interesting application of the virtual views is to use them to refine
the global position estimates. The idea is as follows: After calculating the position estimate based on the vision-based global positioning algorithm described
in chapter 3.4, the algorithm is applied a second time. However, this time,
the air-ground matching step is carried out between the virtual-view and the
original MAV image. The other parts of the algorithm remain the same. The
procedure is illustrated in Figures 5.7 and 5.8. The first row shows the matches
found between the MAV image and its corresponding Street View image as a
results of the air-ground matching algorithm described in 3.4. Based on these
matches, the MAV camera position is estimated using EPnP and Ransac and
a virtual view is rendered out from the textured model. The rendered view is
shown in the second row on the right side. The air-ground algorithm is then
applied again between the MAV image and the rendered-out virtual view. The
resulting matches are shown in the third row. Finally, the MAV camera position is again estimated by applying EPnP and Ransac and a new virtual view is
generated based on this refined position estimate. In Figure 5.7 it can be clearly
seen that the precision of the vision-based position estimate improves by applying this procedure. This is shown by the fact that there is more visual overlap
between the second virtual with the original MAV image than between the first
virtual view and the MAV image. However, if the same procedure is applied
to the virtual view in Figure 5.8, no significant improvement can be observed.
The following observation which was also confirmed in other examples may give
an explanation for this discrepancy: if the first vision-based position estimate
is comparatively bad (especially in terms of the camera rotation) as shown in
the first row of Figure 5.7, a second iteration using the air-ground algorithm
will improve the position estimate. However, if the first vision-based position
estimate is already relatively good (especially in terms of the camera rotation),
a second iteration using the air-ground algorithm will not significantly improve
the position estimate. An interesting future application would be to iteratively
45
refine the position estimates by minimizig the photometric error between the
MAV image and the generated virtual views as explained in chapter6.
Figure 5.6: The left column shows the original MAV images, the right column
shows the rendered-out virtual views. The more similar the two images are, the
better is the vision-based global position estimate.
46
Figure 5.7: Image (a) shows the matches between the original MAV image and
its corresponding Street View image as a result of the air-ground algorithm. The
second row shows the original MAV image (b) on the left side and the virtual
view (c) generated based on the position estimate after EPnP and Ransac on the
right side. The third row shows the matches found between the original MAV
image and the virtual view. The last row shows the original MAV image (e) on
the left side and the virtual view (f) generated based on the second iteration of
EPnP and Ransac. It can be clearly seen that the position estimate is improved
i.e. that the visual overlap is increased between the original MAV image (e)
and the virtual view after the second iteration (f).
47
Figure 5.8: Image (a) shows the matches between the original MAV image and
its corresponding Street View image as a result of the air-ground algorithm. The
second row shows the original MAV image (b) on the left side and the virtual
view (c) generated based on the position estimate after EPnP and Ransac on
the right side. The third row shows the matches found between the original
MAV image and the virtual view. The last row shows the original MAV image
(e) on the left side and the virtual view (f) generated based on the second
iteration of EPnP and Ransac. In contrast to Figure 5.7, there is no significant
improvement of the position estimate after the second iteration i.e. the visual
overlap between the original MAV image and the virtual views is not visibly
increased.
48
5.4
GPS Comparison
me/vsfm/
49
odometry whereas the red path is based on visual odometry in combination with
the proposed vision-based global updates.
Figure 5.9: Top view: The green dots show the path given by the GPS. The
black dots represent the path estimated purely based on visual odometry. The
red dots represent the visual odometry together with the updates given by the
vision-based based global positioning system. Note that the beginning of the
former two paths are the same (both in black). After the first global position
update the drift is corrected and the red path starts. The jumps in the red path
are caused by the vision-based position updates which are used to correct the
drift of the visual odometry.
Figure 5.11 shows the global X,Y,Z coordinates in the 3D model coordinate
system for each image frame for the satellite-based GPS readings (blue) and the
vision-based position estimates (green) 3 . It is shown that the position estimates
for the global X-coordinates are relatively similar for the two (cf. Figure 5.11
(a)). Their mean deviation is 2.78 meters. The Y-coordinates clearly move according to each other, however, there is a visible offset between the two which is
also reflected by the relatively high mean deviation of 9.26 meters (cf. 5.11 (b)).
Moreover, this can also be seen in Figure 5.9 where the first GPS measurement
starts more to the right than the first vision-based estimate. The GPS-based
Z-coordinate which points into the direction of the flying height is rather adventurous and is clearly overestimated (cf. 5.11 (c)). This is also illustrated in
Figure ??, which shows some screenshots from the 3D model. The GPS path
(green) is clearly too high. Based on the altitude sensor and the flight parameters, we know that the flying height has never been more than 6 meters over
ground. However, the GPS path is sometimes even above the buildings.
As mentionned in the beginning, without any ground-truth it is not possible
to finally conclude whether the GPS-based flight path or the visual odometry
together with the vision-based global positioning approach is more realistic.
However, based on the plots in this section we can draw the following conclu3 The GPS measurements in WGS84 have been converted to the Swiss Coordinate System
CH1903 as described in 2.1
50
Figure 5.10: 3D screenshots from the estimated paths: it can be seen that the
flying height is overestimated by the GPS measurements (green). Moreover,
it is shown that the pure visual odometry (black) eventually crashes into the
buildings if no global vision-based update to correct the drift is done.
sions: Pure visual odometry without any update to correct the drift will finally
lead to significant positioning errors as shown in Figure 5.9 where the visual
odometry hits the buildings. The proposed vision-based global positioning update successfully manages to correct for this drift. Figures 5.11 (a) and (b) show
that the vision-based path is rather similar to the GPS path. Particularly, the
relative movements between the two are highly correlated. The reason for the
visible gap in the absolute position in Figure 5.11 (b) can not be clearly attributed to one of the two estimates. However, from visual inspection, it seems
likely that the GPS estimate is slightly biased to the right side as shown in
Figure 5.9. The satellite-based GPS measurements are clearly too high. This
is not surprising as GPS has a much better lateral than vertical precision. The
proposed vision-based approach seems to be more precise in terms of altitude estimation. Based on the flight recordings, the estimated vision-based path seems
to be plausible as the flight was carried out in the middle of street. From the
results presented it may be therefore concluded that the proposed vision-based
tracking and positioning system can offer a viable alternative or extension to a
purely satellite-based global positioning system.
51
180
GPS
Visual Odometry + Mag + Update
170
XCoordinate
160
150
140
130
120
110
100
50
100
150
200
250
300
MAV Image ID
350
400
450
500
(a) X-coordinates: The GPS path and the vision-based path are
highly correlated. Correlation coefficient = 0.99
200
GPS
Visual Odometry + Mag + Update
180
YCoordinate
160
140
120
100
80
50
100
150
200
250
300
MAV Image ID
350
400
450
500
(b) Y-coordinates: The GPS path and the vision-based path are
highly correlated. Correlation coefficient = 0.98
30
ZCoordinate
25
20
15
10
GPS
Visual Odometry + Mag + Update
5
50
100
150
200
250
300
MAV Image ID
350
400
450
500
(c) The GPS path and the vision-based path significantly deviate.
GPS is clearly overestimating the height and is inconsistent.
Chapter 6
52
53
(a)
(b)
(c)
Figure 6.1: Inaccuracies resulting from the backprojection of the Street View
images on the city model are directly related to the positioning inaccuracies.
Especially, at the border and the top of the buildings, the backprojection errors
can be seen (a-b). Wrong correspondences after the air-ground algorithm (c)
are also a source of error.
To increase the number of inliers after EPnP + Ransac and hence increase
the accuracy and robustness of the vision-based position estimates, there are
basically two approaches: (1) to get a more accurate textured 3D city model
and/or (2) to increase the number of air-ground matches. The following steps
could be carried out to achieve this:
Use multiple cameras with a bigger field of view. This is the most
straight-forward and practical solution to increase the number of MAV
- Street View image correspondences. By having multiple cameras the
overlap between the Street View image and the MAV images can be drastically increased. For example, by having two side-looking cameras (facing
the buildings) the number of found matches can be doubled. Moreover,
the field of view of the cameras should be as high as possible. Figure 6.3
# Number of Matches
54
100
200
400
600
800
55
56
a real system, in terms of reliability, it will make sense to use all sensory information which is available on the MAV. It is therefore suggested
to fuse satellite-based GPS measurements, IMU data and the suggested
vision-based approach all together to get the highest possible positioning
accuracy and reliability.
(a)
(b)
Figure 6.3: Top row: aerial image recorded with a Go Pro camera (horizontal
field of view is equal to 170 degrees) and the resulting air-ground matches with
the most similar Street View image. Bottom row: aerial image for a nearby scene
taken with the standard AR. Drone 2.0 camera used in this work (horizontal field
of view is equal to 92 degrees). It is clearly shown that the number of matches
can significantly be increased by using a camera with a bigger field of view. Note
that the Go Pro camera on the Fotokite is installed with a downwards-looking
angle i.e. the camera has a negative pitch of 45 degrees. This is the reason
why a big part of the image depicts the street and not the facades. In a real
system, the pitch level of the installed cameras should be chosen in such a way
that the amount of visual overlap between the MAV image and the Street View
images is maximized.
Other suggestions for future work and final remarks
In terms of hardware and software requirements, for a realistic system, the
following suggestions are made:
Use an open-source hardware platform which can openly be configured according to specific user requirements. In this work, the
standard interface of the AR Drone 2.0 has been used. While this is
an easy-to-work-with, robust and low-cost MAV platform, the on-board
software cannot be directly manipulated. Moreover, it would be very cumbersome if not impossible to add additional hardware like cameras or
(a)
57
(b)
(c)
Figure 6.4: Figure (a) shows a depth map generated directly from Google Street
View. Note that the 3D geometry has been recorded with a laser-range scanner
and synchronized with the Google Street View camera as described in [3]. The
scene is modelled with the help of planes. Figure (b) shows the Street View
panoramic image for the particular scene. Figure (c) shows the panoramic depth
map generated by the cadastral 3D city model as used in this thesis. Note that
the 3D scene is modelled more detailed than the depth map from Google in
(a). However, the Street View camera geotags are not perfectly synchronized
with the cadastral 3D city model resulting in the errors described in 6.1. Please
note that Figure (a) has not been generated using the official Google Street View
API. It is therefore recommended to consult Googles Terms of Conditions before
making excessive use of the script described before.
Figure 6.5: This Figure illustrates the principle of the described active camera
ranging approach. The main idea is to let the MAV turn around if not enough
correspondences can be found for the current view to accurately localize. On
the top left side, a Street View panoramic image is displayed. On the bottom,
a stitched user-generated (e.g. generated by the MAV) panoramic image for a
nearby scene is displayed which results when the camera is turned around for
360 degrees. The green boxes show parts of the two panoramas which can be
successfully matched with the air-ground algorithm as shown on the right side.
The black boxes represent parts of the panoramas for which no matches were
found. It is evident that if the MAV camera currently is stuck in the state of a
black box i.e. no matches are found for the current MAV image it will make
sense to turn around and look for a perspective which can be matched with the
Street View panorama available.
58
on-board computers on it. It is therefore suggested to build an own platform e.g. using the PX4 autopilot by ETH 4 which builds on the AR
Drone 2.0 platform.
Also use a satellite-based GPS receiver. This thesis proves that pure
vision-based global localization is possible. However, a GPS receiver will
be very useful in combination with the suggested vision-based approach.
Especially, in the above rooftop flight scenario described in 1.3.1, satellitebased GPS will be the method of choice.
Use MORSE and ROS for a realistic software implementation.
The algorithms and functions in this thesis have been partially programmed
in Python, Matlab and C++. For the work with the 3D cadastral city
model, the open-source CAD software Blender 5 has been used which offers a direct interface for Python. However, for future applications, it
is suggested to use MORSE 6 which is a generic simulator for academic
robots. Since recently, MORSE can be easily interfaced with the robot
operating system (ROS) and the so-called Blender game engine which will
allow for a much more efficient work-flow.
Finally, to conclude this thesis, the author would like to share some personal
remarks concerning the advance of aerial robotic applications in urban areas.
Based on the experience gathered during this work, it is concluded that the
biggest challenges on the way to autonomous aerial robots will not be of technical nature these can be solved but more likely to be public concerns resulting
in a restrictive regulatory environment. When recording the datasets used in
this work, several interactions with the public showed that many people perceive camera-equipped Micro Aerial Vehicles as a threat to their privacy. These
public concerns should be taken seriously by the scientific community. As mentioned in chapter 1.4, the questions of privacy and liability law in the context
of MAV applications are not entirely solved yet. The regulators in Switzerland
and elsewhere will be forced to clarify these pending legal issues in the near future. For the successful advance of autonomous MAV applications like aerial
parcel delivery or first-aid response systems - it will therefore be crucial that the
involved scientists and engineers pro-actively engage in the coming public discussion by showing the opportunities of this relatively new technological trend,
and at the same time, by being aware of the possible threats related to it.
4 https://pixhawk.ethz.ch/
5 http://www.blender.org/
6 http://www.openrobots.org/morse/doc/latest/what_is_morse.html
Appendix A
Appendix
A.1
/*
* PnP
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*/
Drone.cpp
Created on: Oct 24, 2013
Author: yves.albers@gmail.com
This file is used to calculate the EPnP + Ransac position estimates
Input: InputYML.yml file containing 2D 3D matches for a
specific MAV Street View match pair;
Thesis Notation:
> x3d h refers to X {global}, > x2d drone refers to u {MAV}
> x2d street refers to u {STREET}, K Drone refers to K {MAV}
Output: OutputYML.yml file containing visionbased position estimate:
Thesis Notation:
> tvec refers to T {MAV}, R Mat refers to R {MAV}
> T Vec refers to X {MAV}
Note that instead of using EPnP also other PnP approaches can be tested
PnPType:
1 = Iterative PnP
2 = P3P by Gao et al;
3 = EPnP
Note: this file has been written that it can be used in batch mode from the
commandline by the python script batch run pnp.py
The output yml files can be read by Matlab
59
60
#include
#include
#include
#include
#include
<cv.h>
<highgui.h>
<opencv2/core/core.hpp>
<opencv2/imgproc/imgproc.hpp>
<opencv2/highgui/highgui.hpp>
= argv[1];
Appendix A. Appendix
61
A.2
ICRA Submission
The paper following on the next pages was submitted to the IEEE International
conference on Robotics and Automation (ICRA 2014) and is currently under
review.
I. INTRODUCTION
Our motivation is to create vision-driven localization
methods for Micro Aerial Vehicles (MAV) flying in urban
environments, where the satellite GPS signal is often shadowed by the presence of the buildings, or not available. Accurate localization is indispensable to safely operate smallsized aerial service-robots to perform everyday tasks, e.g.,
goods delivery,1 inspection and monitoring,2 first-response
and telepresence in case of accidents.
In this paper, we tackle the problem of localizing MAVs
in urban streets, with respect to the surrounding buildings.
We propose the use of textured 3D city models to solve the
localization problem of a camera equipped MAV. A graphical
illustration of the problem addressed in this work is shown
in Fig. 1.
In our previous work [1], we described an algorithm
to search airborne MAV images within a geo-registered,
street-level image database. Namely, we localized airborne
images in a topological map, where each node of the map is
represented by a Street View image. 3
In this paper, we advance our earlier work [1], by backprojecting the geo-referenced images onto the 3D cadastral
model of the city to obtain the depth the scene. Therefore, the
algorithm described in this paper, computes the 3D position
of the corresponding image pointsbetween the airborne
The authors
University of
Fig. 2: Comparison between ground-level Street View (top row) and airborne
MAV (row 2 and 3) images used in this work. Note the significant changes
in terms of viewpoint, over-season variation, and scene between the database
((a); respectively (d)) and query images ((b), (c); respectively (e),(f))that
obstruct their visual recognition.
details
please
watch:
http://youtu.be/
140
120
100
80
60
40
20
0
20
15
10
10
15
20
Fig. 3: Number of inlier feature points matched between the MAV and
ground images versus the distance to the closest Street View image.
of the system, rather than the real-time, efficient implementation. Though, for the sake of completeness, we present
in Fig. 4 the effective processing time of the Air-ground
image matching algorithm, using a commercially available
laptop with an 8 core2.40 GHz clockarchitecture. The
Air-ground matching algorithm is computed in five major
steps: (1) virtual view generation and feature extraction; (2)
approximate nearest-neighbor search within the full Street
View database; (3) putative correspondences selection; (4)
approximate nearest-neighbor search among the features
extracted from the aerial MAV image with respect to the
selected ground level image; (5) acceptance of good matches
(kVLD inlier detection). In Fig. 4 we used more than 400
airborne MAV images. All the images were searched within
the entire Street View images that could be found along the
2km trajectory. Notice that the longest computation time is
the approximate nearest-neighbor search in the entire Street
View database for the feature descriptors found in the MAV
image. However, for position tracking, this step is completely
neglected (Section III) since, in this case, the MAV image
is compared only with the neighboring Street View images
(usually up to 4 or 8, computed in parallel on different cores,
depending on the road configuration). Finally, notice that
the histogram voting (Fig. 4) takes only 0.01 seconds. On
average, steps (1), (4), and (5) are computed in 3.2 seconds.
Therefore, if the MAV flies roughly with a speed of 2 m/s,
its position would be updated every 6.5 meters (subsection
IV).
B. Textured 3D cadastral models
The 3D cadastral model of Zurich used in this work was
acquired from the city administration and claims to have an
average lateral position error of l = 10 cm and an average
error in height of h = 50 cm. The city model is referenced
in the Swiss Coordinate System CH1903 [12]. Note that this
model does not contain any textures.
The geo-location information of the Google Street View
dataset is not exact. The geo-tags of the Street View images provide only approximate information about where the
images were recorded by the vehicle. Indeed, according to
Fig. 5: (a) perspective view of the cadastal 3D city model; (b) the ground-level Street View image overlaid on the model; (c) the back-projected texture
onto the cadastral 3D city model; (d) estimated MAV camera positions matched with one Street View image; (e) the synthesized view from one estimated
camera position corresponding to actual MAV image (f); (g)-(i) show another example from our dataset, where (g) is an aerial view of the estimated camera
position (h), which is marked with the blue camera in front of the textured 3D model, (h) is the synthesized view from the estimated camera position
corresponding to actual MAV image (i).
(1)
(2)
zk = h(qk|k1 ),
(3)
k,k1
R4
k,k1
= (sk , k ),
(4)
sk
where
R denotes the translational component of the
motion and k R the yaw increment. sk is valid up to
a scale factor, thus the metric translation sk R3 of the
MAV at time k with respect to the camera reference frame
is equal to
(5)
sk = sk .
We define k,k1 R4 as
k,k1 = (sk , k ),
(6)
(7)
(8)
assuming that qk1|k1 and k,k1 are uncorrelated. We
compute the Jacobian matrices numerically. The rows of the
Jacobian matrices (i fqk1|k1 ), (i fk,k1 ) R1x4 (i =
1,2,3,4) are computed as
(i fqk1|k1 ) =
(i fk,k1 ) =
(i f )
(1 qk1|k1 )
i
( f )
(1 k,k1 )
(i f )
(2 qk1|k1 )
(i f )
(3 qk1|k1 )
(i f )
(4 qk1|k1 )
( f )
(2 k,k1 )
( f )
(3 k,k1 )
( f )
(4 k,k1 )
(9)
where i qk1|k1 and i k,k1 denote the i-th component of
qk1|k1 respectively k,k1 . The function i f relates the
updated state estimate qk1|k1 and the VO output k,k1
to the i-th component of the predicted state i qk|k1 .
In conclusion, the state covariance matrix qk|k1 defines
an uncertainty space (with a confidence level of 3). If
the measurement zk that we compute by means of the
appearance-based global positioning system is not included
in this uncertainty space, we do not update the state and we
rely on the VO estimate.
C. Uncertainty estimation of the appearance-based global
localization
Our goal is to update the state of the MAV qk|k1 ,
whenever an appearance-based global position measurement
zk R4 is available. We define zk as
zk = (pSk , kS ),
pSk
(10)
kS
where
R denotes the position and
R the yaw in
the global reference system.
The appearance-based global positioning system (Section
II) provides the index j N of the Street View image
corresponding to the current MAV image, together with two
sets of n N 2D corresponding image points between the
two images. Furthermore, it provides also the 3D coordinates
of the corresponding image points in the global reference
system. We define the set of 3D coordinates as X S : = {xSi }
({xSi } R3 i = 1, , n) and the set of 2D coordinates
2
D
as MD = {mD
i } ({mi }, R i = 1, , n).
If the MAV image matches with a Street View image,
it cannot be farther than 25 meters from that Street View
camera (c.f. Fig. 3), according to our experiments. We
illustrate the uncertainty bound of the MAV from the birdeye-view perspective in Fig. 6 with the green ellipse, where
the blue dots represent Street View cameras. In order to
reduce the the uncertainty associated to zk , we use the two
sets of corresponding image points.
We compute zk such that the reprojection error of X S
with respect to MD is minimized, that is
zk = argmin(
z
S
||mD
i (xi , z)||),
(11)
i=1
10
5
y (m)
0
0
10
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
160
z (m)
yaw (deg)
0
0
10
0
0
15
10
5
0
0
190
180
This paper
Update
Update
Satelite GPS
Street View position
Visual Odometry
170
25
150
140
130
120
z (m)
y (m)
160
20
15
110
100
10
90
110
120
130
140
150
160
170
110 120 130 140 150 160 170
x(m)
x(m)
(a)
(b)
Fig. 8: Comparison between the estimated trajectories: (a) bird eye-view perspective of the results overlayed on Google Maps; (b) side view of (a); black
circles represent the Street View camera positions, note that the terrain is ascendant, consequently, we measure the altitude above the ground; we display
the trajectory measured using the on-board GPS with green; the path estimate obtained with the system described in this paper is with blue, the squares
identify the state updates; we show with red the enhanced version of our path estimate, computed using the on-board magnetometer data to estimate the
yaw of the MAV, red circles identify the state updates; finally, the magenta displays the estimate given by pure Visual Odometry.
Fig. 9: Comparison between path estimates shown within the cadastral 3D city model: Top row: we display the Visual Odometry estimate in black, GPS
in green, our estimate in blue; (a) altitude evaluation: in the experiment, the MAV flew close to the middle of the street and it never flew over the height
of 6 m (above the ground), from this point of view, our path estimate (blue) is more accurate than the GPS one (green); (b) perspective view of the path
estimates; (c) trajectory zoom: the pure VO trajectory penetrates one of the surrounding buildings, using the proposed method, we are able to reduce the
drift of the VO; Bottom row: we show a visual comparison of the: (d) actual view; (e) rendered view of the textured 3D model corresponding to (d), that
the MAV perceives according to our estimate; (f) rendered view of the textured 3D model corresponding to (d) that the MAV perceives according to the
GPS measurement; to conclude, the algorithm presented in this paper outperform the other techniques to estimate the trajectory of the MAV flying at low
altitudes in urban environment.
[12] Formulas and constants for the calculation of the swiss conformal
cylindrical projection and for the transformation between coordinate
systems, Federal Department of Defence, Civil Protection and Sport
DDPS, Tech. Rep., 2008.
[13] A. Taneja, L. Ballan, and M. Pollefeys, Registration of spherical
panoramic images with cadastral 3d models, in 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012
Second International Conference on. IEEE, 2012, pp. 479486.
[14] M. A. Fischler and R. C. Bolles, Random sample consensus: a
paradigm for model fitting with applications to image analysis and
automated cartography, Communications of the ACM, vol. 24, no. 6,
pp. 381395, 1981.
[15] L. Kneip, D. Scaramuzza, and R. Siegwart, A novel parametrization
of the perspective-three-point problem for a direct computation of
absolute camera position and orientation, in Proc. of The IEEE
International Conference on Computer Vision and Pattern Recognition
(CVPR), Colorado Springs, USA, June 2011.
[16] F.Moreno-Noguer, V.Lepetit, and P.Fua, Accurate non-iterative o(n)
solution to the pnp problem, in IEEE International Conference on
Computer Vision, Rio de Janeiro, Brazil, October 2007.
[17] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
Vision, 2nd ed. New York, NY, USA: Cambridge University Press,
2003.
[18] S. Thrun, W. Burgard, D. Fox, et al., Probabilistic robotics. MIT
press Cambridge, 2005, vol. 1.
[19] D. Scaramuzza and F. Fraundorfer, Visual odometry [tutorial],
Robotics & Automation Magazine, IEEE, vol. 18, no. 4, pp. 8092,
2011.
[20] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz, Multicore bundle
adjustment, in In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR. IEEE, 2011, pp. 30573064.
[21] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, Robust monte carlo
Bibliography
[1] M. Achtelik, A. Bachrach, R. He, S. Prentice, and N. Roy. Stereo vision
and laser odometry for autonomous helicopters in gps-denied indoor environments. In Proceedings of the SPIE Unmanned Systems Technology XI,
volume 7332, Orlando, F, 2009.
[2] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. Building rome in a day. Commun. ACM, 54(10):105112, 2011.
[3] Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh,
Stephane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh
Weaver. Google street view: Capturing the world at street level. Computer, 43, 2010.
[4] Richard Bowden, John P. Collomosse, and Krystian Mikolajczyk, editors.
British Machine Vision Conference, BMVC 2012, Surrey, UK, September
3-7, 2012. BMVA Press, 2012.
[5] Koray C
elik, Soon Jo Chung, Matthew Clausman, and Arun K. Somani.
Monocular vision slam for indoor aerial vehicles. In IROS, pages 15661573,
2009.
[6] Winston Churchill and Paul M. Newman. Practice makes perfect? managing and leveraging visual experiences for lifelong navigation. In ICRA,
pages 45254532, 2012.
[7] J. Engel, J. Sturm, and D. Cremers. Accurate figure flying with a quadrocopter using onboard visual and inertial sensing. In Proc. of the Workshop
on Visual Control of Mobile Robots (ViCoMoR) at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS), 2012.
[8] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a
paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381395, June 1981.
[9] F.Moreno-Noguer, V.Lepetit, and P.Fua. Accurate non-iterative o(n) solution to the pnp problem. In IEEE International Conference on Computer
Vision, Rio de Janeiro, Brazil, October 2007.
[10] Gerald Fritz, Christin Seifert, Manish Kumar, and Lucas Paletta. Building
detection from mobile imagery using informative sift descriptors. In SCIA,
pages 629638, 2005.
70
Bibliography
71
[11] Christian Frueh, Siddharth Jain, and Avideh Zakhor. Data processing algorithms for generating textured 3d building facade meshes from laser scans
and camera images. International Journal of Computer Vision, 61(2):159
184, 2005.
[12] Christian Fr
uh, Russell Sammon, and Avideh Zakhor. Automated texture
mapping of 3d city models with oblique aerial imagery. In 3DPVT, pages
396403, 2004.
[13] Andrew Harltey and Andrew Zisserman. Multiple view geometry in computer vision (2. ed.). Cambridge University Press, 2006.
[14] Andreas Hoppe, Sarah Barman, and Tim Ellis, editors. British Machine
Vision Conference, BMVC 2004, Kingston, UK, September 7-9, 2004. Proceedings. BMVA Press, 2004.
[15] Stefan Hrabar and Gaurav S. Sukhatme. Vision-based navigation through
urban canyons. J. Field Robotics, 26(5):431452, 2009.
[16] Christian Kerl, J
urgen Sturm, and Daniel Cremers. Robust odometry estimation for rgb-d cameras. In ICRA, pages 37483754, 2013.
[17] Georg Klein and David Murray. Parallel tracking and mapping for small
ar workspaces. IEEE and ACM International Symposium on Mixed and
Augmented Reality, November 2007.
[18] L Kneip, D Scaramuzza, and R Siegwart. A novel parametrization of the
perspective-three-point problem for a direct computation of absolute camera position and orientation. In Proc. of The IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Colorado
Springs, USA, June 2011.
[19] Lubor Ladicky, Christopher Russell, Pushmeet Kohli, and Philip H. S. Torr.
Associative hierarchical crfs for object class image segmentation. In ICCV,
pages 739746, 2009.
[20] David G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91110, 2004.
[21] Andras Majdik, Yves Albers-Schoenberg, and Davide Scaramuzza. Mav urban localization from google street view data. In International Conference
on Intelligent Robots and Systems, page To appear, 2013.
[22] Andrew Mastin, Jeremy Kepner, and John W. Fisher III. Automatic registration of lidar and optical images of urban scenes. In CVPR, pages
26392646, 2009.
[23] Navigation National Coordination Office for Space-Based Positioning and
Timing. Official u.s. government information about global positioning system (gps) and related information, 2013. [Online; accessed 19-September2013].
[24] David Nister. An efficient solution to the five-point relative pose problem.
IEEE Trans. Pattern Anal. Mach. Intell., 26(6):756777, 2004.
72
Bibliography
[25] Preprints of the 18th IFAC World Congress Milano (Italy), editor. The
Navigation and Control Technology Inside the AR.Drone Micro UAV, Milano, Italy, 2011.
[26] Federal Office of Topography swisstopo. Swiss reference systems, 2013.
[Online; accessed 24-September-2013].
[27] Tomas Pajdla Petr Gronat. Building streetview datasets for place recognition and city reconstruction. Workshop 2011, Center for Machine Perception, FEE, CTU in Prague, Czech Republic.
[28] Marc Pollefeys, David Nister, Jan-Michael Frahm, Amir Akbarzadeh,
Philippos Mordohai, Brian Clipp, Chris Engels, David Gallup, Seon Joo
Kim, Paul Merrell, C. Salmi, Sudipta N. Sinha, B. Talton, Liang Wang,
Qingxiong Yang, Henrik Stewenius, Ruigang Yang, Greg Welch, and Herman Towles. Detailed real-time urban 3d reconstruction from video. International Journal of Computer Vision, 78(2-3):143167, 2008.
[29] Timo Pylvan
ainen, Jer
ome Berclaz, Thommen Korah, Varsha Hedau,
Mridul Aanjaneya, and Radek Grzeszczuk. 3d city modeling from streetlevel data for augmented reality applications. In 3DIMPVT, pages 238245,
2012.
[30] Davide Scaramuzza and Friedrich Fraundorfer. Visual odometry [tutorial].
IEEE Robot. Automat. Mag., 18(4):8092, 2011.
[31] Roland Siegwart and Illah R. Nourbakhsh. Introduction to Autonomous
Mobile Robots. Bradford Company, Scituate, MA, USA, 2004.
[32] Aparna Taneja, Luca Ballan, and Marc Pollefeys. Registration of spherical
panoramic images with cadastral 3d models. In 3DIMPVT, pages 479486,
2012.
[33] OpenCV Development Team. Camera calibration with opencv, 2013. [Online; accessed 19-September-2013].
[34] Alex Teichman and Sebastian Thrun. Practical object recognition in autonomous driving and beyond. In ARSO, pages 3538, 2011.
[35] Stadt Zurich Tiefbau und Entsorgungsdepartement. 3d-stadtmodell, 2013.
[Online; accessed 24-September-2013].
[36] Gonzalo Vaca-Castano, Amir Roshan Zamir, and Mubarak Shah. City scale
geo-spatial trajectory estimation of a moving camera. In CVPR, 2012.
[37] Stephan Weiss, Davide Scaramuzza, and Roland Siegwart. Monocularslam-based navigation for autonomous micro helicopters in gps-denied environments. J. Field Robotics, 28(6):854874, 2011.
[38] Andreas Wendel, Michael Maurer, and Horst Bischof. Visual landmarkbased localization for mavs using incremental feature updates. In 3DIMPVT, pages 278285, 2012.
[39] Wei Zhang and Jana Kosecka. Image based localization in urban environments. In 3DPVT, pages 3340, 2006.
Bibliography
73
Title of work:
Student:
Name:
E-mail:
Legi-Nr.:
Yves Albers-Schoenberg
yvesal@student.ethz.ch
06-732-523