Yves Albers-Schoenberg - MT - Micro Aerial Vehicle Localization Using Textured 3D City Models PDF

Department of Informatics
Yves Albers-Schoenberg
Micro Aerial Vehicle

Localization using
Textured 3D City Models
Master Thesis
Robotics and Perception Lab
University of Zurich
Supervision
Dr. Andras Majdik
Prof. Dr. Davide Scaramuzza
November 2013
Contents
Abstract
iii
1 Introduction
1.1 Goal . . . . . . . . . . . . . . . . . . . . . .
1.2 Motivation . . . . . . . . . . . . . . . . . .
1.3 Autonomous Flight in Urban Environments
1.3.1 Above-Rooftop Flight . . . . . . . .
1.3.2 Street-Level Flight . . . . . . . . . .
1.4 Legal Framework . . . . . . . . . . . . . . .
1.5 Literature Review . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
3
5
5
6
7
2 Textured 3D City Models

2.1 3D Cadastral Models . . . . . . . . . . . . . . . . . . . . . . . .
2.2 The Google Street View API . . . . . . . . . . . . . . . . . . .
2.3 Generating Perspective Cutouts . . . . . . . . . . . . . . . . . .
2.4 Refining Geotags by Using Cadastral 3D Models . . . . . . . .
2.5 Backprojecting Street View Images and Depth Map Rendering
.
.
.
.
.
9
10
12
15
16
19
3 Vision-based Global Positioning

3.1 Preprocessing . . . . . . . . . .
3.2 Air-ground Algorithm . . . . .
3.3 EPnP and Ransac . . . . . . .
3.4 Vision-based Positioning . . . .
.
.
.
.
21
21
24
26
30
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Experimental Setup
31
4.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Test Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Results and Discussion
5.1 Visual-Inspection . . . . . . . . . . . . .
5.2 Uncertainty Quantification . . . . . . . .
5.3 Virtual-views and Iterative Refinements
5.4 GPS Comparison . . . . . . . . . . . . .
6 Conclusion and Outlook
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
40
43
48
52
A Appendix
59
A.1 OpenCV EPnP + Ransac . . . . . . . . . . . . . . . . . . . . . . 59
A.2 ICRA Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
i
Abstract
This thesis presents a proof-of-concept of a purely vision-based global positioning system for a Micro Aerial Vehicle (MAV) acting in an urban environment.
The overall goal is to contribute to the advance of autonomously acting aerial
service robots in city-like areas. It is shown that the increasing availability
of textured 3D city models can be used to localize a MAV in the case where
satellite-based GPS is not, or only partially available. Textured urban scenes are
created by overlaying Google Street View images on a georeferenced cadastral
3D city model of Zurich. The most similar Street View image is then identified
based on an image search algorithm for a particular MAV image and the global
camera position of the MAV is derived. An extensive test dataset containing
aerial recordings of a 2km long trajectory in the city of Zurich is used to verify
and evaluate the proposed approach. It is concluded that the suggested visionbased positioning algorithm can be used as a complement or an alternative to
satellite-based GPS with comparable results in terms of localization accuracy.
Finally, suggestions are presented on how to improve the introduced visionbased positioning approach and implement it in a future real-life application.
Results of this thesis have been used in the ICRA submission Micro Air Vehicle
Localization and Position Tracking from Textured 3D Cadastral Models 1
1 International
Conference on Robotics and Automation (ICRA 2014), Micro Air Vehicle Localization and Position Tracking from Textured 3D Cadastral Models (under review),
Andras L. Majdik, Damiano Verda, Yves Albers-Schoenberg, Davide Scaramuzza
iii
Chapter 1
Introduction
This chapter describes the goal of the underlying master thesis and gives and
overview on autonomous flight of Micro Aerial Vehicles (MAVs) in urban environments. Challenges are highlighted and a motivation for the suggested
vision-based positioning approach is provided. Moreover, a literature review is
conducted summarizing the current state-of-the art.
1.1
Goal
The goal of this master thesis is to design a vision-based global positioning

system for a Micro Aerial Vehicle (MAV) acting in urban environments. The
main idea is to conduct absolute positioning of the MAV using a textured 3D
city model together with a monocular on-board camera. Basically, the MAV
is localized with respect to the surrounding buildings which are perceived by
means of a camera. By introducing this novel positioning technique, it is shown
that the increased availability of textured 3D city models can successfully be
used as an alternative to satellite-based global positioning systems in case when
these are not, or only partially available. This thesis is an extension of the
authors semester thesis which resulted in the publication MAV Localization
using Google Street View Data [21]. Rather than implementing a real-time
system, this thesis aims to give a proof-of-concept of the underlying principles
of vision-based positioning using textured 3D models for MAVs acting in urban
environments.
1.2
Motivation
With the rapid advance of low-cost Micro Aerial Vehicles new applications such
as airborne goods-delivery 1 , inspection 2 , traffic surveillance or first-aid delivery
in case of accidents start to emerge. Moreover, it is conceivable, that tomorrows
small-sized aerial service robots will increasingly carry out tasks autonomously
i.e. without any direct human intervention.
Accurate localization is indispensable for any autonomously acting robot and is
a prerequisite for the successful completion of tasks in a real-life environment.
1 As
2 As
envisaged by the US company Matternet: http://matternet.us.

envisaged by the US company Skycatch: http://skycatch.com.
1.2. Motivation
Figure 1.1: On the left: There is no direct line of sight for the GPS signal to
the red satellites due to an urban canyon. On the right: The GPS signals are
reflected by the surrounding buildings
Satellite-based global positioning systems like GPS, Glonass, Galileo or Compass work based on the principle of triangulation and have become the stateof-the-art for global outdoor positioning forming a crucial component of many
modern technological systems. Everyday life applications like smart-phones,
driving-assistance or fleet tracking heavily rely on the availability of satellitebased signals for positioning. While standard consumer grade GPS receivers
have a typical accuracy between 3 15 meters in 95% of the time, augmentation techniques like differential GPS (DGPS) or Wide Area Augmentation
Systems (WAAS) to support aircraft navigation can reach a typical accuracy
of 1 3 meters [23]. The accuracy and reliability of a standard GPS sensing
device fundamentally depends on the number of visible satellites which are in
the line of sight of the receiver. In urban areas, the availability of satellitebased GPS signals is often reduced if compared to unobstructed terrain, or even
completely unavailable in case of restricted sky view. So-called urban canyons
tend to shadow the GPS signals, and building facades reflect the signals violating the underlying triangulation assumption that signals travel along a direct
line of sight between the satellite and the receiver. Several approaches have
been suggested in the literature to deal with these drawbacks such as using additional ground stations or fusing the GPS measurements together with data
from Inertial Measurement Units (IMUs) for dead-reckoning. This thesis aims to
provide a vision-based alternative to satellite-based global positioning in urban
environments by taking advantage of 3D city models together with geotagged
image databases such as Google Street View 3 or Flickr 4 . The motivation is to
develop novel approaches for MAV positioning paving the way for tomorrows
aerial robotics applications in urban environments.
3 https://maps.google.ch/
4 http://www.flickr.com/
Chapter 1. Introduction
1.3
Autonomous Flight in Urban Environments
In the context of this work the term autonomous flight in urban environments is
referred to the capability of a MAV to independently i.e. without any human
piloting execute the following directive:
Fly from Address A to Address B
This capability is a basic requirement for any autonomously acting aerial robot
fulfilling tasks in city-like environments. Fig. 1.2 shows a simplified reference
control scheme of an autonomous robot in the style of [31]. As framed by the
red dashed line, an autonomously flying MAV will carry out all the four major building blocks of navigation, namely localization, path planning, motion
control and perception in an automated way.
Start:
Address A
End:
Address B
Localization
Path Planning
Perception
Motion Control
Real World Environment
Mission Commands
Safety
Autonomous Flight
Figure 1.2: The mission commands fly from Address A to Address B are given
by the operator. To execute the mission commands, the MAV needs to iteratively carry out the following steps until the goal is reached: Firstly, localize
and determine the current position; secondly, plan the next step (path) to reach
the target; thirdly, generate motor commands to execute the planned path and
interact with the environment; fourthly, extract information from the environment to get an update on the current state.
Localization and Map Building This work focuses on the localization step
in the above control scheme. It is explicitly assumed that the MAV has access to
a given map of the environment i.e. the 3D city model and does not need
to simultaneously localize and map the environment (SLAM). SLAM systems
like [17] have been successfully applied to localize MAVs in indoor environments
where no map is available [7].
Hereafter, the term global localization is referred to the positioning of the robot
with respect to a global coordinate system such as the World Geodetic System
1.3. Autonomous Flight in Urban Environments
1984 (WGS84). Besides positioning (e.g. determination of latitude and longitude) localization usually also includes information on the robots attitude (i.e.
yaw, roll and pitch).
Path-Planning As defined in [31] path-planning involves identifying a trajectory that will cause the robot to reach the goal location when executed. This is
a strategic problem-solving competence that requires that robot to plan how to
achieve its long term goals. Path-planning usually involves the determination of
intermediate way-points between the current position and the goal. Even though
path planning is a long-term process, path-planning can change when new information on the environment gets available or the mission control commands
are changed. A crucial competence of any autonomous robot acting in human
environments is the capability of short-term obstacle avoidance. Especially in
urban areas where the robots workspace is shared with pedestrians, cars and
public transport, a robust obstacle avoidance system is a basic prerequisite for
any safe robot operation.
Motion Control Motion control is the process of generating suitable motor
commands so that the robot executes the planned path. In case of a quadrocopter, motion control regulates the rotary speed of the four rotors to move the
MAV to the desired position and attitude. Generally, it is differentiated between
open-loop control where the robots position is not fed back to the kinematic
controller to regulate the velocity or the position and closed-loop control where
the robots system state (velocity, position) are fed back as an input to the
kinematic controller. The most widely used closed-loop control mechanism is
a Proportional Integral Derivative (PID) controller which minimizes an error
between a measured system variable and its desired set-point.
Perception Perception refers to the process of information extraction from
the robots environment. During sensing raw data is collected depending on
the robots specific sensor configuration. Various types of sensors are used in
robotics such as laser scanners or ultrasonic sensors for range sensing, IMUs for
attitude estimation or cameras for positioning and motion detection. Generally,
one differentiates between active sensors that release energy and measure the
environmental response to that energy and passive sensors which detect ambient energy changes without releasing energy to the environment. Moreover, one
differentiates between exteroceptive sensors that measure environmental properties such as the temperature and interoceptive sensors which measure the robots
internal state such as the actuator positions. A detailed overview on different
sensing technologies can be found in [31]. The meaningful interpretation of raw
sensor data is referred to information extraction and is a key process in the
perception phase. In this work, the main sensor used is a monocular camera
producing a continuous image stream.
Safety Reliable safety measures are a core requirement for any autonomous
mobile system acting in a real-world environment. Especially in urban areas
where the robot shares its workspace with human beings, well-tested safety
measures such as obstacle avoidance are crucial for any robotic application.
Based on the context-specific requirements for the above functions, two scenarios
for autonomous flight in urban environments are defined and explained below.
Figure 1.3: Above-rooftop flight
Figure 1.4: Above-rooftop flight
Figure 1.5: Street-level flight
Figure 1.6: Street-level flight
1.3.1
Above-Rooftop Flight
In this scenario, the MAVs are flying above the buildings as illustrated in Figures
1.3 and 1.4. Depending on the city-specific urban structure (e.g. mega city in
an emerging country with skyscrapers vs. ancient city with historic buildings in
Western Europe) a minimum flying height will be defined such that the MAV
is always flying above the rooftops of the buildings. The main advantage of
this scenario is the absence of obstacles in the form of man-made structures and
humans. Therefore, trajectory planning is drastically simplified resulting in a
faster and safer system. Moreover, MAV localization can be robustly carried
out based on satellite-based GPS as no buildings obstruct the direct line of
sight to the satellites. Recent research has dealt and largely solved GPS-based
autonomous flight. Low-cost autopilots such as a the PX4 5 can be used together
with open source software such as qgroundcontrol 6 or Paparazzi 7 to control a
GPS-based flight mission. To demonstrate the practice and the limitations of
this approach, an autonomous test flight has been conducted using a Parrot AR
Drone 2.0 together with Qgroundcontrol. A video presentation summarizing
the results of this flight can be found attached to this thesis. It is clearly shown
that the GPS way-point following works good in principle. However, it is also
demonstrated that the position accuracy i.e. the MAVs ability to follow the
designated path is not precise enough in order to use GPS-based flight in the
street-level flight scenario described below.
1.3.2
Street-Level Flight
In this scenario, the MAV flies at street-level i.e. between the building facades as
illustrated in Figures 1.5 and 1.6 Depending on the city-specific characteristics
and the local obstacle scenario (e.g. street with cars, public transport, pedes5 https://pixhawk.ethz.ch/px4/en/start
6 www.http://qgroundcontrol.org/
7 http://paparazzi.enac.fr
1.4. Legal Framework
trians) there will be a minimum height of approximately 4-5 meters for safety
reasons. The city-specific positions of overhead contact wires and crossovers
will moreover determine an acceptable range for a safe flying altitude. The
main challenges associated with autonomous flight in this scenario is obstacle
avoidance and trajectory planning. Flying from address A to address B requires
a path planning strategy which takes into account the local scene structure and
the prevailing traffic situation. However, also accurate positioning becomes
more challenging than in the above-rooftop scenario as the satellite-based GPS
signals can be shadowed by the surrounding buildings as illustrated in Figure
1.1.
In a realistic application, the two scenarios are likely to be combined. Takeoff/landing and short-distance flights will be carried out at street-level while
long-distance flights could be conducted above the rooftops. This work aims to
contribute to solve the problem of localizing the MAV in the outlined street-level
flight scenario.
1.4
Legal Framework
This section provides a brief overview of the legal environment concerning the
operation of MAVs in urban areas 8 . In this context there are two main legal
aspects to be considered: a) the rules governing the operation of unmanned
aircrafts vehicles (UAVs) and b) the protection of data and the private sphere
of individuals.
a) The rules regulating the operation of UAVs
The operation of UAVs is governed by the ordinance on special categories of
aircrafts (Ordinance) issued by the Federal Department of the Environment,
Transport, Energy and Communications (DETEC ) 9 .
The Ordinance distinguishes between UAVs weighting more than 30 kilograms
and those weighting up to 30 kilograms.
The most significant Ordinances rules governing UAVs weighting up to 30 kilograms, which are of relevance for our purposes, are the followings:
According to art. 14 of the Ordinance, the operation of UAVs with a total
weight of up to 30 kilograms do not require an authorization of the Swiss
Federal Office of Civil Aviation (FOCA);
A constant and direct eye contact with the UAV has to be maintained at
all times (art. 17 para. 1, Ordinance).
Autonomous operation of UAVs (through cameras or GPS) within the eye
contact area of the pilot is allowed provided that the he is always in the
position to intervene on the UAV; otherwise the authorization of FOCA
is required;
8 This
section has been composed together with a legal counselor

on special categories of aircrafts issued and subsequently amended by the Federal Department of the Environment, Transport, Energy and Communications on 24 November
1994 (sr.748.941);
9 Ordinance
It is prohibited to operate UAVs: a) within 5 kilometres of a civil or

military aerodrome and b) 150 metres above the ground within control
areas (art. 17 para. 2, Ordinance);
The operation of UAVs weighing more than 0,5 kilograms requires a liability insurance of at least CHF 1 million (art. 20 para. 2, Ordinance)
Specific rules governing military areas must be observed.
b) The protection of data and the private sphere of individuals
The use of MAVs to process personal data in urban areas might trigger the
application of the Federal Act on Data Protection (FDPIC ).
The scope of the law is very wide and includes the processing 10 of personal
data 11 carried out by Federal authorities, private organisations and private individuals. It is important to note that the FDPIC does not apply to personal
data processed by a natural person exclusively for personal use and that it is
not disclosed to third parties (art. 2 FDIPC).
Moreover, there is no breach of privacy if the data subject has made the data
generally accessible and has not expressly prohibited its processing (art. 12
para. 3 FDIPC). A breach of privacy is justified by the consent of the injured
party or by an overriding private or public interest such as the processing of
personal data for the purposes of research, planning and statistics (provided the
results of the work is anonymised) (art. 13 FDIPC).
Whenever an activity of data processing within the meaning of FDIPC is involved, a number of duties and obligations apply, namely: registration of the
data processing systems, protection against unauthorised access to data, information duties, duty to ensure the correctness of data, etc...
In this context, it is worth mentioning the recent judgement of the Swiss Federal
Supreme Court of the Google Street View case 12 . In order to comply with the
FDPIC, the Supreme Court imposed to Google to further develop the anonymisation software and to perform upon request swift ad-hoc anonymisation 13 .
1.5
Literature Review
In the recent years, several research papers have addressed the development
of autonomous Unmanned Ground Vehicles (UGVs), leading to striking new
technologies like self-driving cars. These can map and react in highly-uncertain
street environments using partially [6] or completely neglecting GPS systems
[34]. In the coming years, a similar bust in the development of autonomously
acting Micro Aerial Vehicle is expected. Several recent papers have addressed
visual localization and navigation in indoor environments using low-cost MAVs
[5, 37] or [40] which tackles the problem of safely navigating an MAV through a
10 Art. 3 (g) FDPIC: Processing: any processing of data, irrespective of the means used, in
particular, the collection, storage, use, modification, communication, archive and deletion of
data;
11 Art. 3 (a) FDPIC: Personal data (data): all information of an identified or identifiable
person
12 BGE 138 II 346;
13 For further information on the Google Street view case: http://www.edoeb.admin.ch/
datenschutz/00683/00690/00694/01109/index.html?lang=en
1.5. Literature Review
corridor using optical flow. Most of these approaches are based on Simultaneous
Localization and Mapping (SLAM) systems such as [17] using a monocular camera. Other approaches rely on stereo vision or laser odometry as described in [1].
Several papers have addressed vision-based localization in city environments. In
[36] the authors present present a method for estimating the geospatial trajectory of a moving camera with unknown intrinsic parameters. A similar approach
is discussed in [14] which aims to localize a mobile camera device by performing
a database search using a widebaseline matching algorithm. [10] introduces a
SIFT-based approach [20] to detect buildings with mobile imagery. In [39] the
authors propose an image-based localization system using GPS tagged images.
The camera position of the query view is therein triangulated with respect to the
most similar database image. Note that most of these approaches address the
localization of ground-level imagery with respect to geo-referenced ground-level
image databases. However, this thesis explicitly focuses on vision-based aerial
localization for MAVs. An interesting paper addressing the vision-based MAV
localization in urban canyons is given by [15] based on optical flow. Moreover,
the probably most similar work related to the approach presented in this thesis
is given in [38] in which the authors makes use of metric, geo-referenced visual
landmarks based on images taken by a consumer camera on the ground to localize the MAV. However, in contrast, the approach presented in this thesis is
completely based on publicly available 3D city models and image databases. A
short literature overview on textured 3D models is presented at the beginning
of the next chapter.
Chapter 2
Textured 3D City Models

A basic requirement for the vision-based global positioning approach discussed
in this work is the availability of a textured 3D city model. Such a model has
two important aspects, firstly, it contains the geo-referenced 3D geometry of
the city in a global coordinate system and secondly, it contains the imagery
of the city environment attached to the 3D geometry. In other words, such a
model is a virtual reproduction of a citys spatial appearance. Essentially, what
is needed in this thesis, is a model which allows to link every pixel of the model
texture to its specific 3D point in a global reference frame. An image database
for which every 3D point can be deducted will hereafter be called 3D referenced
image database. The large-scale 3D reconstruction of entire cities is a dynamic
research area. Traditionally, the creation of such photorealistic city models
has been based on the fusion of aerial oblique imagery together with Light
Detection and Ranging Data (LiDAR) as described in [22, 12] or use stereo
vision techniques. Other approaches use street-level imagery in combination
with LiDAR and GPS for positioning as described in [11, 29] or use only a
single video-camera as described in [28]. More recent approaches rely on user
generated geo-tagged image databases such as Flickr 1 or Picasa 2 as described in
[2]. Given the results presented in these papers it may be concluded that largescale textured 3D city model will become more and more available in the near
future. At present, Google Street View 3 is the most advanced commercially
available large-scale city image database in terms of geographic coverage. Even
though the current version of the Google Street View API does not support the
extraction of 3D data i.e. it does not provide an official interface to recover
depth information the coming generations of Google Street View are likely to
be available in full 3D as oulined in [3]. To anticipate this development, in this
work, an alternative approach is applied to create textured 3D city scenes by
overlaying 3D cadastral models with Google Street View imagery as described
in this chapter.
1 http://www.flickr.com/
2 https://picasaweb.google.com/
3 https://www.google.com/maps
10
2.1. 3D Cadastral Models
2.1
3D Cadastral Models
Accurate 3D city models based on administrative cadastral measurements become increasingly available to the public all over the world. In Switzerland,
the municipal authorities of Basel 4 , Bern5 and Zurich6 provide access to their
cadastral 3D data. The city model of Zurich used in this work was acquired from
the urban administration and claims to have an average lateral position error
of l = + 10 cm and an average error in height of h = + 50 cm. The city
model is referenced in the Swiss Coordinate System CH1903 which is described
in detail in [26]. An online conversion calculator between CH1903 and WGS84
can be found under 7 Please note that this model does not contain any texture information. As specified in [35], the model is available in several current
Computer-aided design (CAD) file formats and comes along in three different
Levels-Of-Detail (LODs).
Digital Terrain Model (LOD 0) The digital terrain model is available
as Triangulated Irregular Network (TIN) or in the format of interpolated
contour lines cf. Fig. 2.1 (a).
3D Block Model (LOD 1) The 3D block model represents the buildings
and their height in the form of blocks (prisms) cf. Fig. 2.1 (b).
3D Rooftop Model (LOD 2) The 3D Rooftop model represents the
facades and the rooftops of the buildings in more detail and also models
walls and bridges cf. Fig. 2.1 (c).
(a) Terrain Model
(b) Block Model
(c) Rooftop Model
Figure 2.1: The figures show the different Levels-of-Detail (LODs) in which the
cadastral 3D model is available. The images in this Figure belong to the city of
Zurich.
In this work, the LOD 2 model is used to get the highest level of accuracy
available. However, as shown in Fig. 2.2, the LOD 2 model is a simplification
of the reality. Balconies (as shown in yellow), windows (as shown in green) and
special structures (as shown in red) are usually not modelled. It is evident that
4 http://www.gva-bs.ch/produkte_3d-stadtmodelle.cfm
5 http://www.geobern.ch/3d_home.asp
6 http://www.stadt-zuerich.ch/ted/de/index/geoz/3d_stadtmodell.html
7 http://www.swisstopo.admin.ch/internet/swisstopo/de/home/apps/calc/
navref.html
Chapter 2. Textured 3D City Models
11
Figure 2.2: 3D cadastral model (left) compared to the corresponding Street

View images (right). It is shown that facade details such as balconies (yellow),
windows(green) and special geometries (red) are not modelled.
the discrepancy between the simplified cadastral city model and the Street View
images introduces an error when the images are backprojected onto the model
facades as described in chapter 2.5.
12
2.2. The Google Street View API
2.2
The Google Street View API
Google offers an application programming interface (API) to access the Street

View database 8 . There is a static (i.e. non-interactive) interface which allows
the user to specify the location (i.e. latitude and longitude), the attitude (i.e.
yaw and pitch) and the image size to download a certain perspective image via
browser. For example to display Figure 2.4 (a) the following URL was accessed:
http://maps.googleapis.com/maps/api/streetview?size=640x320&heading=
270&pitch=0&location=47.376645,8.548712&sensor=false
Moreover, a dynamic API based on Java Script is available which provides

panoramaic 360 degree views. A detailed description of the two APIs can be
found on the Google Developer online reference. The Street View panoramas
are provided in the form of an equirectangular projection which contains 360
degrees of horizontal view (a full wrap-around) and 180 degrees of vertical view
(from straight up to straight down) 9 . An example of such an equirectangular
panoramic image is shown in Figure 2.4 (c). For this work, a python 10 script is
used to access the dynamic Google Street View API and download the panoramas for the test area described in the experimental setup cf. chapter 4. The
Street View panoramas are stored in a set of tiles which must be downloaded
separately and stitched together to receive a panoramic image. Different sets of
tiles are available depending on the specific zoom level zzoom . In this thesis, the
panoramas have been used for zzoom = 3 which results in a panoramic image
size of Pwidth x Pheight = 3328 x 1664 pixels which consists of 6.5 x 3.5 tiles
having each the resolution of 512 x 512 pixels [27]. The maximum possible zoom
level offered is zzoom = 4 which results in a panoramic image size of Pwidth x
Pheight = 6656 x 3328 pixels. However, not all panoramas are available at this
zoom level. One particular advantage of the dynamic API is that it allows for
large-scale downloads of panoramic images without any restriction. In contrast,
if the static API is used, the user can get quickly blocked when downloading
excessively i.e. when too many download enquiries for subsequent images are
requested. To get perspective images with the dynamic API, the user must
therefore generate perspective cutouts of the panoramic images by himself as
described in chapter 2.3.
The basic functional setup of the Street View script is described in Figure 2.3.
8 https://developers.google.com/maps/documentation/streetview/
9 https://developers.google.com/maps/documentation/javascript/
streetview?hl=en
10 http://www.python.org
13
FUNCTION: download panoramas

DESCRIPTION: Download Street View panoramas
INPUT
A text file containing a list Ldownload of WGS84 referenced GPS coordinates (latitude, longitude) gps1 , ..., gpsj , ..., gpsm derived from Google
Maps for which the closest (in terms of Euclidean distance) available
panoramic image should be downloaded.
The panoramic zoom level zzoom defining the panoramic image size
Pheight x Pwidth .
OUTPUT
A folder containing a set Ipanos of panoramic images p1 , ..., pj , ..., pM for
every GPS coordinate gpsj Ldownload .
A list Lgeo containing the geotags geo1 , ..., geoj , ..., geom for the downloaded panoramic images. Every geotag is given by the latitude, longitude, yaw, roll and pitch of the panoramic camera position.
FUNCTIONAL REQUIREMENTS
Download for every GPS coordinate gpsj Ldownload the tiles which
together make up the closest panoramic image. Stitch the tiles together
and save the panoramic images pj in Ipanos .
For every gpsj Ldownload , get the geotag geoj of the closest panoramic
image and save it in Lgeo .
Figure 2.3: Functional setup of Street View script used to download Street View
panoramas
14
2.2. The Google Street View API
(a) This Figure shows a perspective Street

View image which was accessed via the
static Google Street View API using the
internet browser.
(b) This Figure shows a perspective Street

View image which was accessed via the
static Google Street View API using the
internet browser.
(c) This Figure shows a panoramic Street View image (equirectangular projection)
stitched together by using the dynamic Street View API. The yaw spans from 0 to 360
degrees (x-axis along image width) whereas the pitch extends from 0 to 180 degrees
(y-axis along image height).
(d) fov = 60 degrees
(e) fov = 90 degrees
(f) fov = 120 degrees
(g) pitch = -45 degrees
(h) pitch = 0 degrees
(i) pitch = 45 degrees
(j) yaw = -45 degrees
(k) yaw = 0 degrees
(l) yaw = 45 degrees
Figure 2.4: Figures (d)-(l) show different perspective cutouts from the
panoramic image in (c) using different cutout parameters as described in 2.3
.
2.3
15
Generating Perspective Cutouts
As shown later in chapter 3.2, a perspective cutout i.e. an image which meets
the underlying assumptions of a perspective camera model as described in [13]
of the Street View panoramas needs to be generated. This is done following
the procedure outlined in [27]. The functional setup of the cutout function is
described in Figure 2.5. Based on the input parameters, the internal camera
FUNCTION: perspective cutout
DESCRIPTION: Generate a perspective cutout of a panoramic Street
View image
INPUT
Panoramic Street View image pj .
Panoramic image size Psize given by the image width Pwidth and the
image height Pheight .
Desired image size Csize of the perspective cutout given by Cwidth and
Cheight .
Horizontal field of view hf ov for the desired cutout.
Image center for the desired perspective cutout specified by yaw and
pitch of the panoramic projection.
OUTPUT
Perspective view ck according to input specifications.
Transform the equirectangular projection to a perspective view.
Figure 2.5: Functional setup of perspective cutout function.
matrix Kstreet for the generated perspective cutout can be calculated as follows:
cx = Cheight /2
cy = Cwidth /2
(2.1)
Whereas cx and cy represent the optical camera center. The camera focal lengths
fx , fy are given by:
fy = fx =
Cwidth
2tan(hf ov/2)
The Street View camera matrix is hence given by:
fx 0 c x
Kstreet = 0 fy cy
0 0 1
(2.2)
(2.3)
16
2.4. Refining Geotags by Using Cadastral 3D Models
2.4
Refining Geotags by Using Cadastral 3D Models
The provided geotags geoj Lgeo (cf. Figure 2.3) for the Google Street View
imagery are not exact. As shown in [32] where 1400 images were used for an
analysis the average error of the camera positions is 3.7 meters and the average
error of the camera orientation is 1.9 degrees. In the same work, an algorithm
is proposed to improve the precision of the Street View image poses. This
algorithm uses the described cadastral 3D city model of Zurich to detect the
outlines of the buildings by rendering out 3D panorama views as illustrated in
Fig. 2.6 (a-b). Accordingly, the outlines of the buildings are also computed for
the Street View panoramas using the image segmentation technique described
in [19]. Finally, the refined pose is computed by an iterative optimization,
namely by minimizing the offset between the segmented outlines from the Street
View panoramas and the outlines of the rendered out panorama view from the
3D cadastral model. For this work, the described refinement algorithm was
applied to correct the Google Street View geotags used in the experimental
setup cf. chapter 4. Fig. 2.6 shows the difference when overlaying rendered out
panoramas of the cadastral 3D model before applying the correction algorithm
and after applying the correction algorithm. It is clearly evident that the match
quality i.e. the accuracy when overlaying the 3D city model with Street View
images drastically increases after the application of the described refinement
algorithm.
(a) Rendered Building Outlines Before

Correction
(b) Rendered Building Outlines After

Correction
(c) Overlay Panorama and Rendered

Outlines Before Correction
(d) Overlay Panorama and Rendered

Outlines After Correction
Figure 2.6: Figure (a) shows the rendered out building outlines based on the
original geotag of a panoramic Street View image. Figure (b) shows the rendered out bulding outlines based on the refined geotag. Figure (c) overlays the
panoramic image with the outlines based on the original geotag. Figure (d)
overlays the panoramic image with the outlines based on the refined geotag. It
is clearly shown that the overlaying of Figure (d) is much more precise than
Figure (c).
Note that the refinement algorithm was run by the authors of [32] as the code
17
has not been published by the time writing. A functional setup of the refinement
algorithm is, however, provided in Figure 2.7.
18
2.4. Refining Geotags by Using Cadastral 3D Models
FUNCTION: refine geo tags

DESCRIPTION: Refine the panoramic geotags using the 3D cadastral
model
INPUT
A text file containing a list Lgeo of Street View geotags (latitude, longitude, yaw, pitch, roll) geo1 , ..., geoj , ..., geom derived from the function
download panoramas (cf. Figure 2.3 which describe the Street View
camera locations for a set of panoramic images p1 , ..., pj , ..., pM .
A set Ipanos of panoramic images p1 , ..., pj , ..., pM
The 3D cadastral model for the locations in Lgeo .
OUTPUT
A list Lref ined containing the refined panoramic camera locations
xyz1 , ..., xyzj , ..., xyzm referenced in the 3D model coordinate frame
CH1903.
For every original geotag geoj related to the panoramic image pj , the
refined external camera matrix RTj given by the refined rotation matrix
Rj which describes the rotation of the Street View camera with respect
to the model origin and the translation vector Tj which describes the
translation with respect to the model origin. Note that xyzj = inv(Rj )
Tj ).
Segment building outlines in the panoramic images p1 , ..., pj , ..., pM .
Render out panoramic building outlines from the 3D cadastral model for
the Street View camera localizations in Lgeo .
Overlay the segmented building outlines with the panoramic renderings
and measure the offset
Iteratively refine the Street View camera localizations by running an
optimization to minimize the offset.
Figure 2.7: The functional setup of the refinement algorithm proposed by [32]
to correct the panoramic geotags of the Street View images.
2.5
19
Backprojecting Street View Images and Depth

Map Rendering
A given perspective cutout of a downloaded Street View panorama can be backprojected onto the 3D cadastral model taking into account the refined position
location as illustrated in Fig. 2.8 (a)-(d). This is done with the open-source 3D
modelling software Blender 11 . Some sample files showing textured 3D model
scenes are added to this thesis. Note that the quality of the backprojection
largely depends on the accuracy of the refined position estimates (i.e. the refined geotags) of the Street View camera and the modelling accuracy of the 3D
cadastral model. The main goal of the backprojection is to assign the texture
i.e. the Street View images to their corresponding 3D geometries in the
cadastral model. An alternative approach to map the 2D pixel coordinates of
the Street View cutouts to their global 3D coordinates in the city model is to
add the Street View camera perspective to the 3D model and subsequently render out the global 3D coordinates for all the pixels. This process is illustrated
in Figure 2.8 (e)-(f).
(a) Street View Camera View
(b) Extracted UV map
(c) Backprojected Street View Images
(d) Textured Scene
(e) Z-Coordinate Map: (f) X-Coordinate Map: (g) Y-Coordinate Map:

Every pixel is mapped to Every pixel is mapped to Every pixel is mapped to
its global Z-coordinate
its global X-coordinate
its global Y-coordinate
Figure 2.8: Figures (e) -(g) illustrate the rendered out global 3D model coordinates in the style of a heat map.
11 http://www.blender.org
20
2.5. Backprojecting Street View Images and Depth Map Rendering
Moreover, the functional setup on how to render out 3D coordinates for pixels in
the Street View images is described in Figure 2.9. Note that the 3D coordinates
for the pixels can be either rendered out in the global coordinate system or
in the local camera coordinate system. If the global reference frame is used,
every pixel in the Street View image can be directly linked to its absolute global
coordinates in the city model reference system. Alternatively, the depth values
can be rendered out for every pixel and then be converted to the local camera
coordinate frame. Remember that the global 3D coordinates are referenced in
the Swiss coordinate system CH1903 as outlined in chapter 2.1.
FUNCTION: get 3D coordinates
DESCRIPTION: Render out the global 3D coordinates and/or depth for
the Street View pixels
INPUT
The Street View camera locations RTk specifying the external camera
parameters of a specific perspective cutout ck . Whereas Rk is the rotation matrix of the Street View camera with respect to the model origin
and Tk gives the translation vector with respect to the model origin.
The internal camera parameters Kstreet of the perspective cutout as
given by 2.5.
The cadastral 3D model which contains the location RTk .
OUTPUT
For every pixel pkuv in ck the global 3D coordinates to which the pixel
corresponds i.e. Xglobal (pkuv ), Yglobal (pkuv ), Zglobal (pkuv ). pkuv stands
for the pixel in cutout k, row u and column v.
Alternatively, for every pixel pkuv in ck get the depth D(pkuv ) which
corresponds to the pixel. If desired, also the 3D coordinates in the
local camera frame can be extracted i.e. Xlocal (pkuv ), Ylocal (pkuv ),
Zlocal (pkuv ).
Create a perspective camera in the 3D model according to the external
parameter RTk and the internal parameters Kstreet .
Render out coordinate paths i.e. save the corresponding 3D model
coordinates Xglobal (pkuv ), Yglobal (pkuv ), Zglobal (pkuv ) for every pixel pkuv
in the image plane of the perspective cutout ck .
Figure 2.9: This figure outlines the process of linking the Street View cutout
pixels to their corresponding 3D coordinates.
Chapter 3
Vision-based Global
Positioning
This chapter presents the vision-based global positioning approach. The functional requirements are derived and the main steps are explained in detail.
The underlying idea of the vision-based global positioning approach is straightforward and illustrated in Figure 3.1: First (a), in the preprocessing phase a 3D
referenced image database containing perspective Street View cutouts is generated. In this context, 3D referenced means that we can link every pixel in the
image database to the corresponding global 3D point which resulted in the 2D
image projection. Important steps include: download the Street View panoramas, create perspective cutouts, refine the Street View geotags and finally render
out the 3D path of every cutout. Second, the MAV image (b) which we want
to localize is searched in the Street View cutout database. This is done using
the so-called air-ground algorithm which outputs 2D-2D match points that link
corresponding feature points between the MAV and the Street View image (c).
Third, the resulting 2D-2D matches can be converted into 2D-3D matches which
link the MAV image feature points to their global 3D counterparts i.e. the 3D
points which result in the projection of the 2D feature points. This is done with
the help of the 3D referenced image database established in the preprocessing
phase. Finally, a so-called PnP algorithm can be used to estimate the MAVs
external camera parameters (d) which describe the global location and attitude
of the MAV with respect to the global reference frame (e).
3.1
Preprocessing
As described before, during the preprocessing phase, a 3D referenced image

database is created. In a realistic setting, this process should be done offline as
it is computationally expensive. Note that the procedure described at the end
of this section in Figure 3.2 is rather complicated as we do not have direct access
to a textured 3D city model. We therefore need to manually link the texture
(i.e. the Street View images) to the 3D cadastral model. However, as outlined
in the beginning, it is conceivable that we will have Google Street View in full
3D in the near-future. The following procedure could then be carried in a much
21
22
3.1. Preprocessing
more direct way.
(a) 3D referenced image database
(b) MAV image
(c) MAV image Street View cutout feature correspondences
(d) PnP position estimation
(e) Global localization
Figure 3.1: Main steps of vision-based localization

Note that the specific cutout parameters mentioned in Figure 3.2 i.e. field of
view, yaw and pitch will largely depend on the MAV camera used in a real
flight scenario. Generally, the field of view of the Street View cutouts should
be as high as possible to make use of the biggest possible overlap between the
MAV image and the Street View cutout. Cutouts with different yaw parameters
should be stored in the image database to ensure a 360 degree coverage of the
flying area. Similarly, different pitch cutouts can be used to ensure vertical
scene coverage and different fields of view can ensure the coverage of different
zoom levels. Of course, the more cutouts are stored in the image database, the
longer it will take to retrieve the correct match with the air-ground algorithm
described in the next section. The specific parameters used in this thesis are
described in the experimental setup cf. chapter 4.
Chapter 3. Vision-based Global Positioning
23
FUNCTION: Preprocessing
DESCRIPTION: Steps required to generate a 3D referenced image
database.
INPUT
Flight area gps1 , ..., gpsj , ..., gpsm Af light where the MAV will operate.
This is a list of WGS84 referenced GPS coordinates
OUTPUT
Geo-referenced image database Icutout containing N perspective cutouts
c1 , ..., ck , ..., cN along the flying route as described in Figure 2.3.
Internal camera matrix Kstreet specifying the focal lengths fx , fy and
the optical centers cx .cy of the perspective Street View cutouts.
A mapping which links every pixel pkuv of cutout ck Icutout to its
global 3D model coordinates Xglobal (pkuv ), Yglobal (pkuv ), Zglobal (pkuv ).
Whereas pkuv stands for the pixel in cutout k in row u and column
v cf. function get 3D coordinates in Figure 2.9.
Download the Street View panoramas for every GPS coordinate gpsj
Af light and store them in a panorama image database Ipanos using function download panoramas cf. Figure 2.3.
Process the panoramas p1 , ..., pj , ..., pM Ipanos and generate the perspective cutouts c1 , ..., ck , ..., cN Icutout with the function perspective cutout cf. Figure 2.5.
Refine the the GPS coordinates gps1 , ..., gpsj , ..., gpsm Af light using
the algorithm 2.7 and store the refined positions referenced in the 3D
model coordinate frame as xyz1 , ..., xyzj , ..., xyzm Aref
Based on Aref and the perspective cutout inputs, derive the rotation
matrix Rk and the translation vector Tk specifying the external camera
parameters RTk of the cutouts c1 , ..., ck , ..., cN in the 3D model coordinate frame.
Calculate the internal camera parameters Kstreet based on the perspective cutout inputs cf. chapter 2.3.
Create a mapping between the pixels pkuv and their global 3D model
coordinates by rendering out the Xglobal , Yglobal , Zglobal path from the
3D model for every cutout ck Icutout using RTk and Kstreet cf. Figure
2.8.
Figure 3.2: Steps required to prepare the 3D referenced image database which
is used in the vision-based positioning algorithm.
24
3.2. Air-ground Algorithm
3.2
Air-ground Algorithm
The air-ground algorithm was introduced in [21] and partially resulted from the
authors semester thesis 1 . The main goal of this algorithm is to find the most
similar Street View image for a given MAV image by finding corresponding feature points. In the said thesis, it is shown that state-of-the-art image search techniques usually fail to robustly identify correct feature matches between street
level aerial images recorded by a MAV and perspective Google Street View images. The reason for this are significant viewpoint changes between the two
images, image noise, environmental changes and different illuminations. The
air-ground algorithm introduces a novel technique to simulate artificial views
according to the air-ground geometry of the system and hence manages to significantly increase the number of feature points. Moreover, a state-of-the-art
outlier rejection technique using virtual line descriptors (KVLD) [4] is used to
reduce the number of wrong correspondences. Please refer to the cited papers
for details on the air-ground algorithm. Figure 3.3 shows an example image
showing the matches found by the air-ground algorithm between the MAV image on the left side and its corresponding Street View image on the right side.
The green lines illustrate the corresponding feature points whereas the magenta
lines describe the virtual lines as described in [4].
Figure 3.3: Match points found between MAV image (left) and Street View
cutout (right) with the air-ground algorithm. Note that there are still some
outliers.
Note that the output of the original air-ground algorithm is essentially a set of
2D-2D image correspondences between the MAV image and the most similar
Street View image. As described in [21], by identifying the most similar Street
View image, one can localize the MAV image in the sense of a topological map.
However, no metric localization i.e. the exact global position in a metric map
can be derived based solely on the 2D-2D correspondences. As described in
this thesis, the 2D-3D correspondences between the MAV image coordinates
of the feature points and the 3D coordinates referenced in a global coordinate
frame can be established using the cadastral 3D city model. Based on these
correspondences, the global position of the MAV can be inferred as shown in
the next section. The functional setup of the air-ground algorithm is illustrated
in Figure 3.4.
1 Micro
Aerial Vehicle Localization using Google Street View
25
FUNCTION: air-ground algorithm

DESCRIPTION: Find the visually most similar Street View image for a
given MAV image.
INPUT
A geotaged image database Icutout containing a set of perspective Street
View cutouts c1 , ..., ck , ..., cN .
MAV image dj for which we want to identify the most similar Street
View image in the set Icutouts .
OUTPUT
The most similar Street View cutout cj which corresponds to the MAV
image dj i.e. the Street View cutout with the highest number of
corresponding feature points.
A list uM AV containing Nmatches x2 entries whereas the first column
refers to the u coordinate and the second column to the v coordinate in
the MAV image plane of a corresponding feature point. The image pixel
coordinate system is the standard used by OpenCV a
A list uST REET containing Nmatches x2 entries whereas the first column
refers to the u coordinate and the second column to the v coordinate in
the Street View image plane of a corresponding feature point.
Note that the feature point of the first row of uM AV corresponds to the
feature points of the first row in uST REET and so on. Nmatches stands
for the total number of found feature points between the two images.
Generate artificical views of the images to be compared by means of an
affine transformation.
Identify salient feature points in the artificial views.
Backproject the feature points of the artificial views to the original images
Find corresponding feature points by means of an approximate nearest
neighbor search.
a http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_
3d_reconstruction.html
Figure 3.4: Functional setup of the air-ground algorithm. Please refer to [21]
for details.
26
3.3. EPnP and Ransac
3.3
EPnP and Ransac
The goal of this section is to calculate the MAVs external camera parameters
which are given by the 3 x 3 rotation matrix RM AV and the 3 x 1 translation
vector TM AV , or alternatively by the 3 x 4 matrix RTM AV as follows:
RM AV
r11
= r21
r31
r12
r22
r32
r13
t1
r11 r12 r13
r23 , TM AV = t2 , RTM AV = r21 r22 r23
r33
t3
r31 r32 r33
t1
t2
t3
(3.1)
Basically, the cameras external parameters define the cameras heading and
location in the world reference frame. Or in other words, they define the coordinate transformation from the global 3D coordinate frame to the cameras local
3D coordinate frame. Note that TM AV specifies the position of the origin of
the global coordinate system expressed in the coordinates of the local cameracentred coordinate system [13]. The global camera position XM AV in the world
reference frame is given by:
1
XM AV = RM
AV TM AV
(3.2)
Several approaches have been proposed in the literature to estimate the external camera parameters based on 3D points and their 2D projections by a
perspective camera. In [8], the term perspective-n-point (PnP) problem was
introduced and different solution were described to retrieve the absolute camera
pose given n 3D-2D correspondences. The authors in [18] addressed the PnP
problem for the minimal case where n equals 3 points and introduced a novel
parametrization to compute the absolute camera position and orientation. In
this thesis the Efficient Perspective-n-Point Camera Pose Estimation (EPnP)
algorithm [9] is used to estimate the MAV camera position and orientation with
respect to the global reference frame. In their paper, the authors present a novel
technique to determine the position and orientation of a camera given its intrinsic parameters and a set of n correspondences between 3D points and their 2D
projections. The advantage of EPnP with respect to other state-of-the-art noniterative PnP techniques is that it has a much lower computational complexity.
The computational complexity grows linearly with the number of points supplied. Moreover, EPnP has proven to be more robust than other non-iterative
techniques in terms of noise in the 2D location. An alternative to non-iterative
approaches, are iterative techniques which optimize the pose estimation by minimizing a specific criterion. These techniques have shown to achieve a very high
accuracy if the optimization is properly initialized and successfully converges to
a stable solution. However, convergence is not guaranteed and iterative techniques are computationally much more expensive than non-iterative techniques.
Moreover, it was shown by the authors of [9] that EPnP achieves almost the
same accuracy as statet-of-the-art iterative techniques. To summarize, EPnP
was used in this thesis because of its speed, robustness to noise and its simple
implementation. Note that any other PnP technique could be used at this point
to estimate the external camera parameters RM AV and TM AV of the MAV. The
minimal number of correspondences required for EPnP is n = 4.
Given that the output of our Air-ground matching algorithm may still contain
27
wrong correspondences so-called outliers and that the model-generated 3D

coordinates may depart from the real 3D coordinates, the EPnP algorithm is
applied together with a RANSAC scheme [8] to discard the outliers. RANSAC
is a randomized iterative algorithm which is used to estimate model parameters
(in our case the rotation matrix RM AV and the translation vector TM AV describing the MAV camera position) in the presence of outliers. The main idea
of Ransac is straight-forward and can best be explained using the example of
fitting a straight line as shown in Figure 3.5. First, Ransac randomly selects
Figure 3.5: Left side: sample data for fitting a line containing outlier points.
Right side, fitted line by applying Ransac. Images take from Wikipedia.
two points from the sample data and fits a line. Second, the number of inlier
points i.e. the points which are close enough to the fitted line according to
a certain threshold are determined. This procedure is repeated for a certain
number of times and the model parameters which have the highest number of
inliers are selected to be the best ones. Ransac works robustly as long as the
outlier percentage of the model data is below 50 percent. As specified in [21]
the number of iterations N needed to select at least one random sample set free
of outliers with a given confidence level p e.g. p = 0.95 can be computed as.
N=
log(1 p)
log(1 (1 y)s )
(3.3)
Where y is the outlier ratio of the underlying model data and s the number
of model parameters needed to estimate the model. In the case of the line
example s = 2. In the case of EPnP the minimal set is at least equal to s = 4.
The procedure used by Ransac in the case of EPnP to test whether a given
correspondence point is an inlier is as follows: Firstly, Ransac randomly selects
s points from the 3D-2D correspondences and supplies them to EPnP which
calculates RM AV and TM AV . The remaining 3D points are then reprojected
to the 2D image plane based on RM AV and TM AV according to the following
28
3.3. EPnP and Ransac
equation:
xglobal
xrepr
yglobal
zc yrepr = KM AV RTM AV
zglobal
1
1
(3.4)
Where KM AV is the 3 x 3 internal camera calibration matrix of the MAV (cf.

chapter 4. Finally, Ransac calculates the so-called reprojection error which is
given by:
q
(3.5)
ereprojection = (xrepr xorig )2 + (yrepr yorig )2
Where xorig and yorig are the original 2D image coordinates of the reprojected
3D-2D match point. If the reprojection error is below a certain pixel treshold t
e.g. t = 2 pixel , the match point is considered to be an inlier. The described
procedure is repeated several times i.e. RM AV and TM AV are calculated
based on several sample sets and the model parameters which result in the
highest number of inliers is chosen. RM AV and TM AV are then recalculated
based on all the inlier points. The function SolvePnP from the computer vision
image library OpenCV 2 has been used to implement the described procedure.
Appendix A.1 shows a straight-forward example on how to implement EPnP +
Ransac in OpenCV. The functional setup of EPnP is described in Figure 3.6.
2 http://docs.opencv.org/master/modules/calib3d/doc/
29
FUNCTION: EPnP Ransac

DESCRIPTION: Estimate the external camera parameters of the MAV
camera
INPUT
A set of corresponding 3D-2D points i.e. a set of 3D points Xglobal and
their 2D projections Xcamera in the MAVs camera frame.
The internal camera parameters KM AV of the MAV camera.
The Ransac parameters i.e. allowed reprojection error treshhold
reprtresh in pixels, the confidence level pconf idence , the number of
matches supplied to EPnP s which must be at least s = 4.
OUTPUT
The external camera parameters RM AV and TM AV descibing the MAV
camera position with respect to the global reference frame.
Randomly select a subset of s 3D-2D match points.
Calculate RM AV and TM AV based on EPnP.
Reproject 3D points and calculate the reprojection error ereprojection .
Consider 3D point to be an inlier if the ereprojection < reprtresh . Otherwise consider match to be an outlier.
Repeat this procedure according to the confidence level pconf idence and
Equation 3.3.
Take the iteration which resulted in the highest number of inliers and
recalculate the final RM AV and TM AV based on these inliers using EPnP.
Figure 3.6: Functional setup of EPnP + Ransac. Please refer to [9] for details.
30
3.4
3.4. Vision-based Positioning
Vision-based Positioning
Based on the previous steps, the vision-based positioning algorithm can now be
easily formulated. Algorithm 1 shows the basic setup of the system. Note that
in a realistic application, not the entire 3D referenced image database will be
searched. Based on a so-called position prior pprior , it will make sense to narrow
down the search space to make it as small as possible and hence speed up the
whole algorithm. Such a position prior can be given by the latest satellite-based
GPS estimate, IMU-based dead-reckoning or the previous vision-based estimate.
Moreover, if there is a magnetometer available, the heading measurements could
also be used to reduce the search space.
Algorithm 1: Vision-based Positioning
Data: 3D referenced image database Icutout
Result: Global MAV position XM AV
1 initialization cf. Preprocessing cf. Figure 3.2;
2 for every MAV image dj do
3
if position prior pprior is available then
4
Reduce search space to Ireduced Icutout
5
Set Isearch = Ireduced
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
if no position prior is available then

search the whole database: Isearch = Icutout
Run the air-ground algorithm cf. Figure 3.4:
[cj , uST REET , uM AV ] = air-ground(dj , Isearch )
Search the most similar Street View cutout cj Isearch .
Determine the 2D-2D feature correspondences uST REET , uM AV .
if number of found correspondences Nmatches is bigger than minimum
treshhold Tmin then
Get global 3D coordinates: cf. Figure 2.9:
convert 2D feature coordinates uM AV by using uST REET to
their corresponding 3D coordinates Xglobal
Run EPnP Ransac cf. Figure 3.6:
[RM AV , TM AV ]=EPnP Ransac(uM AV , Xglobal , KM AV )
Calculate global MAV position cf. Equation 3.2:
1
XM AV = RM
AV TM AV
else
Go to the next MAV image dj+1
Note that for EPnP to work, we need at least s = 4 corresponding feature

points to estimate the external camera parameters. Therefore, Nmatches must
at least be equal to four. However, in reality, the bigger the number of feature
correspondences, the better the accuracy of EPnP. It therefore makes sense to
set Tmin to be higher than s = 4.
Chapter 4
Experimental Setup
This chapter describes the experimental setup which is used to test the visionbased positioning system. Two datasets are used to verify the performance of
the introduced approach, firstly the big dataset which was already presented in
[21] and secondly, the small dataset which was newly recorded and also contains
GPS data for comparison.
4.1
Platform
The quadrocopter used in this work is a commercially available Parrot AR.Drone

2.0. It is equipped with two cameras, one is looking forwards and has a diagonal
field of view f ovM AV of 92 , one is looking downwards. Moreover, the AR.
Drone contains on-board sensing technology such as an Inertial Measurement
Unit (IMU), an ultrasound ground altitude sensor and a 3-axis magnetometer.
The drone can be controlled remotely over Wi-fi via ground station or Smart
Phone. An in-depth discussion of the technology used in the AR. Drone is
given in [25]. For the recordings in this work, only the front-looking camera has
been used. As shown in Figure 4.1 the camera is heavily affected by radial and
tangential distortion.
(a) Original MAV image
(b) Undistorted MAV image
Figure 4.1: On the left: Distorted MAV image, on the right: Undistorted MAV
image
The recorded imagery was therefore undistorted using the OpenCV library as
outlined in [33]. The OpenCV drone distortion parameters and the camera
31
32
4.2. Test Area
matrix in this work are given by:

DM AV = 0.513836
KM AV
0.277329 0.000371 0.000054
558.265829
0.0
0.0
558.605079
=
0.0
0.0
328.999406
178.924958
1.0
0.0
(4.1)
(4.2)
According to the procedure described in Chapter 2.3, the perspective Street

View cutouts used in this work are chosen to have a horizontal field of view
hf ov of 120 degrees and have the same image size Cwidth x Cheight as the drone
images which is = 640 x 360 pixels. The Street View camera matrix is hence
derived as follows:
Optical center expressed in pixel coordinates (cx , cy ):
cx = Cheight /2 = 360/2 = 180
cy = Cwidth /2 = 640/2 = 320
(4.3)
Camera focal lengths (fx , fy ):

fy = fx =
640
Cwidth
=
= 184.7521
2tan(hf ov/2)
2tan(60 )
The Street View camera matrix is hence given by:
184.7521
0
180
0
184.7521 320
KST REET =
0
0
1
4.2
(4.4)
(4.5)
Test Area
To test the vision-based positioning approach, two datasets were recorded by piloting the AR. Drone manually through the streets of Zurich filming the building
facades i.e. the front-looking camera was turned by 90 degrees with respect
to the flying direction. The first dataset called ETH Zurich big covers a trajectory of roughly 2 kilometres in the neighbourhood of the ETH Zurich and has
already been used in [21]. Figure 4.2 (a) shows the map of the recorded flying
route together with some sample images. The average flying altitude is roughly
7 meters. This dataset has been recorded using the ROS 1 ardrone autonomy
software package. The drone was controlled with a wireless joystick and the
images were streamed down to a MacBook Pro over a wireless connection. In
total, the ETH Zurich big dataset contains 40599 images. For computational
reasons, to test the vision-based global positioning system, the dataset has been
sub-sampled using every 10th image resulting in a total of 4059 images. The
trajectory of the big dataset corresponds to 113 Street View panoramic images
which are roughly 10 - 15 meters apart from each other. As we were flying with
1 http://www.ros.org
Chapter 4. Experimental Setup
33
the MAV camera facing the building facades i.e. 90 degrees turned with respect
to the driving direction of the Street View car only 113 perspective cutouts
are stored in the image database Icutouts . In other words, the yaw parameter in
the function perspective cutout was always set to be 90 degrees. In a more realistic scenario, every panoramic image would be sampled according to different
yaw parameters e.g. yaw = [0, 90, 180, 270] as we can not explicitly assume
to know in which direction the MAV is looking if it is flying completely autonomously. However, to test the vision-based positioning approach, the above
setting seems to be reasonable in terms of computational resources.
Besides the side-looking dataset, another dataset for the same trajectory was
recorded with a front-looking camera facing the flying direction. This frontlooking dataset contains only 22363 images but covers the same area. The
reduced number of images for the front-looking dataset is due to a much easier
manual control when flying in the viewing direction of the camera and hence
results in a higher flying speed. However, the front-looking dataset has not been
used in this work.
The second dataset called ETH Zurich small is a subpath of the path gathered
in the big dataset (cf. blue line in Figure 4.2 (a)) and has been recorded together with satellite-based GPS in order to have a comparison to the proposed
vision-based approach. For every recorded frame, this dataset also contains the
recorded GPS-coordinates i.e. latitude, longitude and altitude according to
WGS84 from where the image has been taken. The timely synchronization of
the GPS tags and the image frames is done on the software level using the open
source package cvdrone 2 which combines the OpenCV image library together
with the AR.Drone 2.0 API.
To calculate the vision-based position estimates, every 10th image of the dataset
ETH Zurich big is processed using algorithm 1 outlined in section 3.4. Every
MAV image is compared to the eight nearest Street View images in order to
find the correct match i.e. the Street View image which corresponds to the
observed scene. By comparing every MAV image only to the eight nearest Street
View images, the search space and hence the computational requirements are
drastically reduced for the air-ground algorithm described in section 3.2. This
corresponds to the so-called position prior pprior described in algorithm 1. In a
real flight scenario, it is realistic to have a position prior of at least 100 meters
accuracy based on other on-board sensors such as satellite-based GPS or an
IMU used for dead-reckoning. It therefore seems reasonable to only compare
the MAV images to the nearest Street View images instead of searching the
whole Street View database. However, the proposed approach also works for
a bigger search space as demonstrated in [21] at the expense of an increased
computational complexity.
The following list summarizes the parameters used to achieve the results presented in the next chapter:
Number of Street View cutouts stored in Icutout is equal to Ncutout = 113.
Number of processed MAV images used to test the vision-based positioning
approach: NM AV = 4059.
Image size of the cutouts Cwidth x Cheight = 640 x 360 pixels.
2 https://github.com/puku0x/cvdrone
34
4.2. Test Area
Size of MAV images Dwidth x Dheight = 640 x 360 pixels.

Horizontal field of view hf ov = 120 degrees, yaw = 90 degrees and pitch =
20 degrees cf. Figure 2.5.
Position prior pprior is equal to the eight nearest Street View cutouts.
Hence, the length of the reduced image database Isearch is equal to Nreduced =
8 cf. Algorithm 1.
The number of feature points s which are used by Ransac + EPnP: (1):
one trial with the minimal set s = 4 (2.) and one trial with a non-minimal
set of s = 8 cf. Figure 3.6.
Ransac parameters: reprojection error treshhold t = 2 pixels. Confidence
level p = 0.95 cf. Figure 3.6.
Note at this point, that different Ransac parameters reprojection error threshold and confidence levels have been tested on the big dataset. The setting
presented above seems to achieve the best results in terms of accuracy and
robustness. If the inlier criteria are made more strict i.e. if the allowed reprojection error is decreased and/or the confidence interval is increased the
accuracy for some estimates may slightly increase. However, the number of successful estimates (given by estimates which result in more than s = 8 inliers) is
drastically decreased by using more stringent Ransac parameters.
Chapter 4. Experimental Setup
35
(a) Recorded Test Area: The red line describes the MAV flying path of the
ETH Zurich big dataset. The blue-white line designates the the ETH Zurich small
dataset which is a subset of the big dataset and was recorded together with satellitebased GPS.
(b)
(c)
(d)
Figure 4.2: (a) shows an aerial map of the recorded datasets. Figures (b)-(d)
show MAV example images (top row) together with corresponding Street View
images (bottom row). Note that there are significant differences in terms of viewpoint, illumination, scale and environmental setting between the MAV images
and the corresponding Street View images which makes a correct classification
highly challenging.
Chapter 5
Results and Discussion

This chapter presents the results of the outlined vision-based global positioning
approach applied to the two recorded datasets. The fundamental problem when
evaluating the accuracy of the vision-based position estimates is that there is
no ground-truth available meaning that we do not know the real flying trajectory. We therefore need to find alternative ways to evaluate the precision of the
proposed approach. Firstly, the position estimates for each camera frame are
plotted in the 3D model and assessed qualitatively. Secondly, the uncertainty
related to the position estimate is evaluated by using a Monte Carlo simulation
to calculate the covariances. Thirdly, a visual verification of the position estimate is conducted by rendering-out virtual-views from the textured 3D model.
The rendered-out virtual-views are then compared to the original MAV images. Ideally, the compared images should show the same scene. Moreover, it
is shown how virtual-views can be used to further refine the vision-based position estimates. Finally, the vision-based position estimates are compared to the
recorded satellite-based GPS measurements from the ETH Zurich small dataset
in combination with a visual odometry tracking approach.
5.1
Visual-Inspection
Figure 5.2 shows the top view of the ETH Zurich big dataset. The red dots
represent the vision-based global position estimates for each camera frame for
which a corresponding Street View image was found according to algorithm 1.
It can be seen that almost the whole flying route is covered by position estimates. Some of the streets are covered very densely with position estimates
meaning that there are many correct Street View correspondences found
while other areas are rather sparsely covered. There are four areas designated
with the numbers 1-4 in Figure 5.2 (a) where there are no position estimates
available. The reason why there are no or not enough Street View correspondences found in these areas (referring to the numbers 1,2,3 in Figure 5.2 (a))
is because of vegetation which occludes the buildings as illustrated in Figure
5.1. The reason why there are no Street View correspondences found for area
number 4 is not so obvious. Possible reasons could be the relatively high flying
speed in that area resulting in motion- blurred images and a reduced number of
MAV frames per Street View image to be matched. Other possible reasons for
36
Chapter 5. Results and Discussion
37
the absence of vision-based position estimates are extreme viewpoint changes

or temporal changes in the environment (e.g. refurbishment or reconstruction
of buildings) that occurred between the Street View image recordings in the
summer of 2009 and the recording of the aerial dataset in the winter of 2012.
(a) MAV image from area 1 (b) MAV image from area 2 (c) MAV image from area 3
Figure 5.1: No Street View matches where found for these areas cf. Figure 5.2
as the buildings are occluded by vegetation.
By close examination of Figures 5.2 (a) and (b) the reader will realize that the
vision-based position estimates i.e. the red dots are not exactly the same for
the two plots. The reason for this is that in Figure 5.2 (a) the minimal set of
s=4 match points was used to run the EPnP and Ransac as described in chapter
3.3 whereas in Figure 5.2 (b) a non-minimal set of s=8 match points was used.
To illustrate the difference between the two approaches, Figure 5.3 shows some
close-ups of the whole map. The first row shows close-ups when using the minimal set whereas the second row shows close-ups when using the non-minimal set.
By closely comparing the first two rows, one can conclude that the position estimates derived by using the non-minimal set seem to be more plausible than by
using only the minimal set. This is illustrated by the fact that the non-minimal
position estimates tend to be more organized along a smooth trajectory which
is realistic with respect to the real MAV flying path. Or stated differently, the
position estimates derived by using the minimal set tend to jump around more
than by using the non-minimal set i.e. they are more widely spread and less
spatially consistent. The reason for the more robust results in the non-minimal
case is that the position estimates derived by the EPnP are less affected by outliers and degenerate point configurations. However, in both approaches the
minimal and the non-minimal a few extreme outliers occur which are clearly
not along the flying path as highlighted by the yellow boxes in Figure 5.3. One
possible cause for these outliers are wrong point correspondences between the
Street View images and the MAV images given by the air-ground algorithm described in chapter 3.2. Another potential explanation are inaccurate 3D point
coordinates supplied to EPnP resulting from inaccuracies when overlaying the
Street View images with the cadastral city model as illustrated in chapter 2.5.
The last row of Figure 5.3 shows the same close-ups after filtering the nonminimal estimates (second row) based on the standard deviations calculated in
the next section. It is shown that the outliers (yellow boxes) can be successfully
discarded by limiting the allowed standard deviations. The next section step-bystep explains how to get a measure for the uncertainty of the derived vision-based
position estimates based on a Monte Carlo approach.
38
5.1. Visual-Inspection
(a) Top View Vision-based Position Estimates: EPnP + Ransac. For this plot, Ransac
uses a minimal set of s=4 points for the EPnP algorithm.
(b) Top View Vision-based Position Estimates: EPnP + Ransac. For this plot,
Ransac was required to use a non-minimal set of s=8 match points for the EPnP
algorithm cf. chapter 3.3. The visual difference between using s = 4 points or s = 8
points for EPNP + Ransac is shown more detailed in Figure 5.3.
Figure 5.2: The red dots represent the vision-based position estimates whereas
the black line in Figure (b) illustrates the flight path. The numbers 1-4 in Figure
(a) show areas where no matches were found cf. Figure 5.1.
39
Figure 5.3: Close ups of Figure 5.2 (a) and (b): First row: Vision-based position
estimates (red points) using a minimal set of s=4 points for EPnP and Ransac.
Second row: Vision-based estimates using a non-minimal set of s=8 points for
EPnP and Ransac. By comparing the first and the second row, one can see
that the estimates illustrated in the second row for the non-minimal set tend
to be more aligned along the street. However, in both cases there are some
clearly wrong estimates (highlighted in yellow) which are not along the flying
path. To get rid of those estimates, it is suggested to filter the vision-based
estimates based on the standard deviations as shown in the last row. Please
refer to chapter 5.2 for more details.
40
5.2
5.2. Uncertainty Quantification
Uncertainty Quantification
To quantify the uncertainties related to the vision-based global positioning estimates with respect to the underlying data, a Monte Carlo type approach is
used to calculate the covariances for each estimate. The procedure is outlined
below:
Algorithm 2: Uncertainty Quantification
Data: A set of Nmatches 2D-3D match points uM AV and Xglobal for a
specific MAV image
Result: Covariance and standard deviations related to the vision-based
position estimation
1 for a specific MAV Image - Street View Image match pair do
2
Initialize: Calculate the vision-based position estimate XM AV
according to Algorithm 1
3
initialize counter: j = 1;
4
for it=1:1000 do
5
1.) Randomly select a subset of s 3D-2D match points out of the
total number of Nmatches 3D-2D match points and calculate the
rotation matrix RM AV it and the translation vector TM AV it with
EPnP;
6
2.) Calculate the number of inliers Ninliers by reprojecting all the
Nmatches 3D points Xglobal to the image plane with a suitable
pixel threshold t based on Equations 3.4 and 3.5.
7
if Ninliers > s i.e. if the number of inliers is bigger than the
number of sample points s then
Save RM AV it and TM AV it as RM AV j and TM AV j and store
8
1
the global localization estimate XM AV j = RM
AV j TM AV j
and the heading yawj (which can be extracted from RM AV j )
in the list Lestimates .
9
increase counter: j = j + 1;
10
else
Do not save RM AV it and TM AV it ;
11
12
Calculate the covariances based on the list of valid Monte Carlo

estimates saved in Lestimates with j <= 1000 entries a .
a The covariances and the standard deviations can be easily calculated using the Matlab
functions cov and std cf. http://www.mathworks.ch/ch/help/matlab/ref/cov.html
The used Monte Carlo approach is straigh-forward: First, the algorithm randomly selects a subset of all the found 2D-3D match points. Second, it calculates
the vision-based position estimates RM AV j and TM AV j using EPnP. Third, it
reprojects all the 3D points to the 2D plane as explained by equation 3.4 and
checks if they are valid inliers i.e. if their reprojected error is smaller than
the allowed reprojection error treshhold. Finally, if enough inliers are found,
the position estimates RM AV j and TM AV j are saved. This procedure is repeated several times. Based on all the saved position estimates, the covariances
quantifying the uncertainty related to the initial vision-based position estimate
XM AV for a certain MAV - Street View image match pair can be calculated.
The described procedure is illustrated in Figure 5.4.
41
130
Y-Axis
Y-Axis
120
115
125
120
150
160
X-Axis
170
150
152 154
X-Axis
156
Figure 5.4: This Figure illustrates the described procedure to calculate the
covariances for two example MAV - Street View match pairs. Left side: Top
view of MAV - Street View match pair Nr.4730-31 from the small dataset. Right
side: Top view of MAV - Street View match pair Nr.3500-32 from the small
dataset. The red dots correspond to the Monte Carlo position estimates given
by RM AV j and TM AV j . The green squares represent the positions of the Street
View cameras. The blue ellipses limit a 95- percent confidence interval based
on the calculated covariance. The black strokes represent the direction of the
mean yaw. The yellow points represent the mean position estimates based on
all the Monte Carlo samples. Finally, the magenta crosses represent the position
of the 3D feature points Xglobal which are found on the building facades. The
blue ellipses can be used to identify how reliable a certain vision-based position
estimate is based on a certain probability interval. Based on the blue ellipses
we can say that we are 95 percent sure to be in the area bordered by the blue
ellipse. As shown in the right image, the Monte Carlo estimates can be highly
clustered resulting in relatively narrow ellipse - meaning that the vision-based
estimate can be considered to be reliable. On the other hand, as shown in the
left image, the Monte Carlo estimates can also be dispersed resulting in a less
concentrated confidence area.
Please note the following: one drawback of the described Monte Carlo approach
is that it depends on the number of valid Monte Carlo estimates (represented by
the red dots in Figure 5.4). The number of estimates is given by the iterator j
at the end of Algorithm 2. Remember that an estimate is considered to be valid
if more than s inliers are found (whereas s stands for the number of randomly
sampled 2D-3D points supplied to EPnP). However, this number can be highly
different for different MAV - Street View match pairs. If the total number of
3D-2D match points Nmatches is high (e.g. Nmatches = 200), usually also the
number of valid Mone Carlo estimates is high (e.g. j = 500 out of the 1000
iterations result in a valid match). In contrast, if the total number of 3D-2D
match points Nmatches is low (e.g. Nmatches = 10), the number of valid Monte
Carlo estimates will also be low (e.g. only j = 3 out of the 1000 iterations
result in a valid match). If the number of valid Monte Carlo estimates is too
small (i.e. less than 20), it is reasonable to not use those estimate at all to
calculate covariances, as they might be highly unreliable. The same counts for
when calculating the standard deviations of the position and the yaw for the
MAV - Street View match pairs based on the Monte Carlo estimates: If the
42
5.2. Uncertainty Quantification
number of Monte Carlo estimates is too low (e.g. if only j = 3 out of the 1000
iterations result in a valid estimate), the calculated standard deviations may be
either very small or very high. Therefore, only uncertainty estimates which are
based on more than j = 20 Monte Carlo estimates are considered to be reliable
in this thesis. This is the reason why Figure 5.5 only shows standard deviations
for about 800 MAV - Street View match pairs out of the 4059 maximum possible
(i.e. if every MAV image could have been correctly classified). The fundamental
problem here is that for certain MAV - Street View match pairs, not enough
correct correspondences can be found by the air-ground algorithm. Suggestions
on how to improve that are given in chapter 6.
(a) This Figure shows a top view (global X-Y coordinates) of a subset of the big
dataset which corresponds to the corner illustrated on the left side of Figure 5.3. The
blue ellipses show the 95-percent confidence intervals of the vision-based position
estimates calculated based on the outlined Monte Carlo approach. The green boxes
correspond to the Street View camera positions. The magenta crosses show the
positions of the matched 3D feature points on the building facades. It is shown that
the vision-based estimates are usually found near the Street View camera positions.
Moreover, it can be seen that most of the confidence intervals border a reasonably
small area meaning that the accuracy of the vision-based positioning approach
seems to be be practical to accurately localize a MAV in an urban environment.
Figure 5.5 shows the Monte Carlo-based standard deviations calculated for the
big dataset for the global X,Y,Z-coordinates and the camera yaw. Based on the
calculated standard deviations, one can define a simple filter rule to discharge
vision-based position estimates which have a too high standard deviation. For
example, one could only consider vision-based estimates which have a standard-
43
Standard Deviations Appearancebased Position Estimates

10
xcoordinate
mean std = 1.16 m
10
8
6
4
2
0
200
400
600
800
MAV Street View Match Pair
Standard Deviation in [m]
12
4
2
200
400
600
800
1000
50
zcoordinate
mean std = 2.20 m
8
6
4
2
200
400
600
800
1000
Standard Deviation in degrees
10
1000
ycoordinate
mean std = 1.56 m
yaw
mean std = 7.86 degree
40
30
20
10
0
200
400
600
800
1000
Figure 5.5: This Figure shows the standard deviations for the MAV - Street View
match pairs. The y-axis shows the standard deviation in meters / degrees, the xaxis stand for a certain MAV - Street View match pair from the big dataset. The
blue curve shows the standard deviation for the global vision-based X-coordinate
[m]. The green curve shows the standard deviation for the global vision-based Ycoordinate [m]. The red curve shows the standard deviation for the global visionbased Z-coordinate [m]. The magenta curve shows the standard deviation for
the estimated yaw [degrees]. The mean standard deviation for the X-coordinates
and the Y-coordinates are 1.16 meters and 1.56 meters respectively. The mean
standard deviation for the Z-coordinate (which is the height) is slightly bigger
with 2.20 meters. The mean standard deviation for the yaw is 7.86 degrees.
deviation for X, Y and Z-coordinates of less than 2.0 meters. The results of such
an approach is illustrated in Figure 5.3 in the last row. It is clearly demonstrated
that by such a strategy, extreme outliers can be eliminated. However, the price
to pay is that the total number of available vision-based estimates is reduced.
Of course, in a real-life application, more sophisticated rules could be applied
to discharge outliers e.g. by using IMU data.
5.3
Virtual-views and Iterative Refinements
An intuitive way to qualitatively verify the accuracy of our vision-based position

estimates is by generating virtual views from the textured 3D city model and
44
5.3. Virtual-views and Iterative Refinements
comparing them to the original MAV images. The procedure is as follows: based
on the estimated MAV camera position, a camera is added to the 3D city model.
The internal parameters of the MAV camera have been described in chapter 4.1.
Texture is then applied to the scene by backprojecting the Street View image to
the model as explained in chapter 2.5 and a rendered view from the perspective
of the estimated camera position is generated. If the estimated camera position
is identical with the true camera position of the MAV, the original MAV image
should cover exactly the same scene as the artificially generated virtual-view.
Or stated differently, the higher the visual similarity between the two images
is, the better we can claim is our vision-based position estimate. Figure 5.6
shows some representative examples of original MAV images and the virtual
views generated according to their vision-based position estimates. Note that
there are some artefacts and inaccuracies in the virtual views resulting from
failures related to the backprojection of the Street View images to the 3D city
model. The examples clearly show that the vision-based position estimates are
in the close neighbourhood of the true camera positions as there is a substantial
overlap between the two depicted scenes. However, it is also evident that the
precision of the vision-based position estimates is in the order of a couple of
meters rather than a few centimetres. This is in accordance with the calculated
position covariances described in the previous section.
Another interesting application of the virtual views is to use them to refine
the global position estimates. The idea is as follows: After calculating the position estimate based on the vision-based global positioning algorithm described
in chapter 3.4, the algorithm is applied a second time. However, this time,
the air-ground matching step is carried out between the virtual-view and the
original MAV image. The other parts of the algorithm remain the same. The
procedure is illustrated in Figures 5.7 and 5.8. The first row shows the matches
found between the MAV image and its corresponding Street View image as a
results of the air-ground matching algorithm described in 3.4. Based on these
matches, the MAV camera position is estimated using EPnP and Ransac and
a virtual view is rendered out from the textured model. The rendered view is
shown in the second row on the right side. The air-ground algorithm is then
applied again between the MAV image and the rendered-out virtual view. The
resulting matches are shown in the third row. Finally, the MAV camera position is again estimated by applying EPnP and Ransac and a new virtual view is
generated based on this refined position estimate. In Figure 5.7 it can be clearly
seen that the precision of the vision-based position estimate improves by applying this procedure. This is shown by the fact that there is more visual overlap
between the second virtual with the original MAV image than between the first
virtual view and the MAV image. However, if the same procedure is applied
to the virtual view in Figure 5.8, no significant improvement can be observed.
The following observation which was also confirmed in other examples may give
an explanation for this discrepancy: if the first vision-based position estimate
is comparatively bad (especially in terms of the camera rotation) as shown in
the first row of Figure 5.7, a second iteration using the air-ground algorithm
will improve the position estimate. However, if the first vision-based position
estimate is already relatively good (especially in terms of the camera rotation),
a second iteration using the air-ground algorithm will not significantly improve
the position estimate. An interesting future application would be to iteratively
45
refine the position estimates by minimizig the photometric error between the
MAV image and the generated virtual views as explained in chapter6.
Figure 5.6: The left column shows the original MAV images, the right column
shows the rendered-out virtual views. The more similar the two images are, the
better is the vision-based global position estimate.
46
5.3. Virtual-views and Iterative Refinements
(a) Air-ground matches MAV Image - Street View
(b) Original MAV Image
(c) Virtual View first iteration
(d) Air-ground matches MAV Image - Virtual View
(e) MAV Image
(f) Virtual View second iteration
Figure 5.7: Image (a) shows the matches between the original MAV image and
its corresponding Street View image as a result of the air-ground algorithm. The
second row shows the original MAV image (b) on the left side and the virtual
view (c) generated based on the position estimate after EPnP and Ransac on the
right side. The third row shows the matches found between the original MAV
image and the virtual view. The last row shows the original MAV image (e) on
the left side and the virtual view (f) generated based on the second iteration of
EPnP and Ransac. It can be clearly seen that the position estimate is improved
i.e. that the visual overlap is increased between the original MAV image (e)
and the virtual view after the second iteration (f).
47
(a) Air-grund matches MAV Image - Street View
(b) MAV Image
(c) Virtual View first iteration
(d) Air-ground matches MAV Image - Virtual View
(e) MAV Image
(f) Virtual View second iteration
Figure 5.8: Image (a) shows the matches between the original MAV image and
its corresponding Street View image as a result of the air-ground algorithm. The
second row shows the original MAV image (b) on the left side and the virtual
view (c) generated based on the position estimate after EPnP and Ransac on
the right side. The third row shows the matches found between the original
MAV image and the virtual view. The last row shows the original MAV image
(e) on the left side and the virtual view (f) generated based on the second
iteration of EPnP and Ransac. In contrast to Figure 5.7, there is no significant
improvement of the position estimate after the second iteration i.e. the visual
overlap between the original MAV image and the virtual views is not visibly
increased.
48
5.4
5.4. GPS Comparison
GPS Comparison
To compare the vision-based global positioning algorithm to satellite-based GPS

the ETH Zurich small dataset is used as described in chapter 4.2. To show a
potential field for future applications, the vision-based position estimates are
used together with a visual odometry system to reconstruct the full path of
the flying MAV. Visual odometry refers to the process of incrementally estimating the pose of the vehicle by examining the changes that motion induces on
the images of its onboard cameras as described in [30]. The principle of visual
odometry is similar to wheel odometry where a rotary-encoder is used to measure how far the wheels have rotated. Together with the knowledge of the wheel
diameter, the traveled distance can be calculated. In visual odometry, the basic
idea is to track the ego-motion of the MAV camera by tracking corresponding
feature points from frame-to-frame. Given that the camera is calibrated, the
so-called essential matrix [24] can be calculated based on a set of feature correspondences. The essential matrix can then be deconstructed using a Singular
Value Decomposition to calculate the rotation matrix R which describes the rotation between two consecutive image frames and the translation vector T which
describes the displacement between two consecutive image frames as described
in [13]. By adding up the relative camera motion from frame-to-frame, the MAV
camera path can be reconstructed. However, the errors which are introduced
due to inaccurate feature matches and keypoint localizations are accumulated
from frame-to-frame and result in a drifted movement with respect to the real
trajectory. The longer the distance travelled, the bigger the drift and the less
accurate the inferred position estimate.
To correct for this drift, the idea is to use the proposed global vision-based positioning system to update the visual odometry at regular time intervals. With
the described approach, a purely vision-based localization and tracking system
can be implemented. The example presented in this section is taken from the
ICRA 1 submission Micro Air Vehicle Localization and Position Tracking from
Textured 3D Cadastral Models to which the author of this thesis contributed.
The full paper is attached in Appendix A.2 As described in the paper, the position tracking by visual odometry is updated roughly every 10-15 meters which
corresponds to the usual distance between two Street View images. Moreover,
it was assumed that the flying speed of the MAV is around 3 meters per second.
As shown in the submission the air-ground matching algorithm which is the
most computationally expensive step in the proposed vision-based positioning
system takes roughly 3 seconds to find the correct Street View image out of n=8
neighbouring images. It therefore seems realistic to update the visual odometry
every 10 meters. To perform the visual odometry the software package VisualSFM 2 has been used. Moreover, the heading data from the magnetometer has
been used to align the resulting subpaths with respect to each other. Details on
this procedure can be found in the said paper which is also attached in the Appendix. Figure 5.9 compares the resulting path estimates. The green plot shows
the satellite-based GPS measurements, the black path is purely based on visual
1 Micro Air Vehicle Localization and Position Tracking from Textured 3D Cadastral Models,
Andras L. Majdik, Damiano Verda, Yves Albers-Schoenberg, Davide Scaramuzza, 2014 IEEE
International conference on Robotics and Automation (ICRA 2014), (under review)
2 VisualSFM : A Visual Structure from Motion System by Changchang Wu: http://ccwu.
me/vsfm/
49
odometry whereas the red path is based on visual odometry in combination with
the proposed vision-based global updates.
Figure 5.9: Top view: The green dots show the path given by the GPS. The
black dots represent the path estimated purely based on visual odometry. The
red dots represent the visual odometry together with the updates given by the
vision-based based global positioning system. Note that the beginning of the
former two paths are the same (both in black). After the first global position
update the drift is corrected and the red path starts. The jumps in the red path
are caused by the vision-based position updates which are used to correct the
drift of the visual odometry.
Figure 5.11 shows the global X,Y,Z coordinates in the 3D model coordinate
system for each image frame for the satellite-based GPS readings (blue) and the
vision-based position estimates (green) 3 . It is shown that the position estimates
for the global X-coordinates are relatively similar for the two (cf. Figure 5.11
(a)). Their mean deviation is 2.78 meters. The Y-coordinates clearly move according to each other, however, there is a visible offset between the two which is
also reflected by the relatively high mean deviation of 9.26 meters (cf. 5.11 (b)).
Moreover, this can also be seen in Figure 5.9 where the first GPS measurement
starts more to the right than the first vision-based estimate. The GPS-based
Z-coordinate which points into the direction of the flying height is rather adventurous and is clearly overestimated (cf. 5.11 (c)). This is also illustrated in
Figure ??, which shows some screenshots from the 3D model. The GPS path
(green) is clearly too high. Based on the altitude sensor and the flight parameters, we know that the flying height has never been more than 6 meters over
ground. However, the GPS path is sometimes even above the buildings.
As mentionned in the beginning, without any ground-truth it is not possible
to finally conclude whether the GPS-based flight path or the visual odometry
together with the vision-based global positioning approach is more realistic.
However, based on the plots in this section we can draw the following conclu3 The GPS measurements in WGS84 have been converted to the Swiss Coordinate System
CH1903 as described in 2.1
50
5.4. GPS Comparison
Figure 5.10: 3D screenshots from the estimated paths: it can be seen that the
flying height is overestimated by the GPS measurements (green). Moreover,
it is shown that the pure visual odometry (black) eventually crashes into the
buildings if no global vision-based update to correct the drift is done.
sions: Pure visual odometry without any update to correct the drift will finally
lead to significant positioning errors as shown in Figure 5.9 where the visual
odometry hits the buildings. The proposed vision-based global positioning update successfully manages to correct for this drift. Figures 5.11 (a) and (b) show
that the vision-based path is rather similar to the GPS path. Particularly, the
relative movements between the two are highly correlated. The reason for the
visible gap in the absolute position in Figure 5.11 (b) can not be clearly attributed to one of the two estimates. However, from visual inspection, it seems
likely that the GPS estimate is slightly biased to the right side as shown in
Figure 5.9. The satellite-based GPS measurements are clearly too high. This
is not surprising as GPS has a much better lateral than vertical precision. The
proposed vision-based approach seems to be more precise in terms of altitude estimation. Based on the flight recordings, the estimated vision-based path seems
to be plausible as the flight was carried out in the middle of street. From the
results presented it may be therefore concluded that the proposed vision-based
tracking and positioning system can offer a viable alternative or extension to a
purely satellite-based global positioning system.
51
180
GPS
Visual Odometry + Mag + Update
170
XCoordinate
160
150
140
130
120
110
100
50
100
150
200
250
300
MAV Image ID
350
400
450
500
(a) X-coordinates: The GPS path and the vision-based path are
highly correlated. Correlation coefficient = 0.99
200
GPS
180
YCoordinate
160
140
120
100
80
50
100
150
200
250
300
MAV Image ID
350
400
450
500
(b) Y-coordinates: The GPS path and the vision-based path are
highly correlated. Correlation coefficient = 0.98
30
ZCoordinate
25
20
15
10
GPS
5
50
100
150
200
250
300
MAV Image ID
350
400
450
500
(c) The GPS path and the vision-based path significantly deviate.
GPS is clearly overestimating the height and is inconsistent.
Chapter 6
Conclusion and Outlook

The results presented in this thesis prove that global vision-based positioning
according to Algorithm 1 is practical. Using an extensive dataset, it has been
demonstrated that textured 3D city models can be successfully used for MAV
street-level localization in an urban environment. It is step-by-step shown how
3D cadastral models can be fused with Street View imagery to construct textured scenes and a 3D referenced image database. Moreover, by means of visual
inspection and uncertainty quantification it is demonstrated that the accuracy of
the suggested vision-based approach is comparable to satellite-based GPS. It is
concluded that vision-based positioning can be (1) a viable alternative for localization in shadowed urban areas where GPS signals are completely unavailable
or (2) a powerful complement to enhance localization in areas where the GPS
signal strength is weak i.e. where the direct line of sight to the satellites may
be obstructed. The given thesis should be understood as a proof-of-concept that
pure vision-based global positioning using textured 3D city models can be used
to support autonomous MAV applications. However, the suggested approach
is not yet sophisticated enough to be implemented in a real system. This section therefore gives an outlook on how to improve the suggested approach and
presents suggestions for future work.
Suggestions to improve the localization accuracy and robustness:
Using the vision-based positioning approach suggested in this thesis, there are
two main problems which negatively affect the localization accuracy and the
robustness 1 :
Inaccurate 2D - 3D mappings resulting from errors when backprojecting the Street View cutouts on the cadastral 3D model. For the suggested
approach to work accurately, it is crucial that the projected 2D camera
points correspond to their true 3D counterparts. However, as the refined
geotags used to render out the 3D model coordinates are still inaccurate,
the resulting 2D - 3D mappings of the feature points contain errors as
shown in Figure 6.1 (a-b).
1 In
this context, robustness refers to the availability of a global vision-based position

estimate which essentially depends on whether a corresponding Street View image can be
found for a specific MAV image by the air-ground algorithm or not.
52
Chapter 6. Conclusion and Outlook
53
Small number of corresponding 2D - 3D inlier points after EPnP

+ Ransac. As shown in [9] the accuracy of EPnP is directly linked
to the number of correspondences used to compute the external camera
parameters. The output of the air-ground algorithm may still contain
outliers as shown in Figure 6.1 (c). These outliers are usually discarded
during the EPnP + Ransac step. However, together with the inaccuracies
resulting from the backprojection, the remaining number of inlier points
which can be used to calculate the external camera parameters by EPnP
is drastically reduced with respect to the original number of air-ground
matches as illustrated in Figure 6.2.
(a)
(b)
(c)
Figure 6.1: Inaccuracies resulting from the backprojection of the Street View
images on the city model are directly related to the positioning inaccuracies.
Especially, at the border and the top of the buildings, the backprojection errors
can be seen (a-b). Wrong correspondences after the air-ground algorithm (c)
are also a source of error.
To increase the number of inliers after EPnP + Ransac and hence increase
the accuracy and robustness of the vision-based position estimates, there are
basically two approaches: (1) to get a more accurate textured 3D city model
and/or (2) to increase the number of air-ground matches. The following steps
could be carried out to achieve this:
Use multiple cameras with a bigger field of view. This is the most
straight-forward and practical solution to increase the number of MAV
- Street View image correspondences. By having multiple cameras the
overlap between the Street View image and the MAV images can be drastically increased. For example, by having two side-looking cameras (facing
the buildings) the number of found matches can be doubled. Moreover,
the field of view of the cameras should be as high as possible. Figure 6.3
# Number of Matches
54
Relation between Airground Matches and EPNP+RANSAC Inliers

300
EPNP-RANSAC Inliers Mean = 15
Airground Algorithm Inliers Mean = 51
200
100
200
400
600
800
1,000 1,200 1,400 1,600 1,800
MAV Image Nr.

Figure 6.2: This figure shows the relation between the number of air-ground
matches (green: 2D-2D) for a particular MAV - Street View image pair and its
resulting number of matches (blue: 2D-3D) after applying EPnP and Ransac.
The y-axis shows the number of matches, the x-axis represents a particular
MAV - Street View image pair from the big dataset. It is clearly shown that the
number of 2D-3D match points after EPnP and Ransac is by order of magnitude
smaller than the number of corresponding 2D-2D match points before applying
EPnP and Ransac. The reason for this are errors related to the backprojection
of the Street View images on the cadastral model and outliers present in the
air-ground algorithm results.
shows and example from a test dataset recorded with a Go Pro camera
on a Fotokite MAV 2 having a horizontal field of view of 170 degrees. It
is clearly shown that number of matches can be drastically increased by
using a camera with a bigger field of view. Finally, using multiple cameras
in a real system will also make sense in terms of collision avoidance.
Implement active camera ranging. The idea here would be that
the MAV actively turns and moves around to find a view which results
in a high number of Street View correspondences. The effect would be
essentially the same as when using several cameras - namely to increase
the overall field of view and thereby to increase the chance of finding
Street View correspondences with the advantage that only one camera
is needed. For example, if a localization update is needed, but not enough
Street View correspondences can be found for the current MAV view, the
MAV could turn around and look for additional correspondences. An
example of such a procedure is illustrated in Figure 6.5.
Use the 3D data of Google Street View instead of/or together
with the cadastral 3D city model. The introduced procedure to overlay the cadastral 3D city model with the Street View texture is relatively
2 http://www.fotokite.com/
55
cumbersome and prone to the described inaccuracies illustrated in 6.1

(a-b). As mentioned in chapter 2, the Google API does not yet provide an official interface to access the 3D measurements which have been
recorded with a laser-ranger scanner as explained in [3]. However, very
recently a community-based script was published 3 which allows to extract
panoramic depth maps directly from Google. Figure 6.4 (a) shows an example of such a Google depth map for one of the panoramic images used
in this thesis. In a future work, it would be interesting to use these depth
maps with the described vision-based positioning approach and compare
the results to the ones received by using the 3D cadastral model. Even
though the depth maps generated by the 3D cadastral model are more
detailed than the ones generated directly from Google (cf. Figure 6.4),
the depth maps from Google are highly synchronized with the camera
capturing the Street View images. It might therefore be that the results
can be improved by using the Street View depth maps. If this is the case,
another advantage would be that no cadastral 3D model is needed at all,
as everything could be purely based on Google Street View.
Improve the air-ground algorithm and increase the number of
correct 2D-2D correspondences. As shown in [21], the air-ground algorithm outperforms state-of-the-art image image search techniques. However, the biggest disadvantage of the air-ground algorithm is the high
computation time it needs to correctly classify an MAV image (roughly
3 seconds per MAV image using 1 core). Future work could address an
efficient implementation to increase the speed of the air-ground algorithm,
either on-board of a MAV or off-board in a ground-station. Moreover, future work could address on how to increase the number of correspondence
points.
An alternative strategy to improve the vision-based estimates would be to use
other sensory information or more advanced vision-based techniques to refine
the EPnP + Ransac position estimates. For example:
Iteratively minimize the photometric error between virtual-views
and original MAV images to refine the position estimate. As
shown in chapter 5.3, virtual-views based on the position estimate can be
generated and compared to the original MAV images to verify the accuracy of the position estimate. An interesting future approach would be to
use the generated virtual view to iteratively refine the position estimate.
The main idea would be (1) to make a vision-based position estimate,
(2) to generate a virtual-views based on that position and (3) to adapt
the position estimate such that a specific criterion is optimized i.e. to
make the virtual views and the original MAV image as visually similar
as possible. One possible minimization criterion could be the so-called
photometric error as described in [16].
Use other on-board sensors to increase the accuracy of the visionbased estimate. In this thesis, a proof-of-concept was given to show that
purely vision-based global localization is possible in general. However, in
3 http://0xef.wordpress.com/2013/05/01/extract-depth-maps-from-google-street-view/
56
a real system, in terms of reliability, it will make sense to use all sensory information which is available on the MAV. It is therefore suggested
to fuse satellite-based GPS measurements, IMU data and the suggested
vision-based approach all together to get the highest possible positioning
accuracy and reliability.
(a)
(b)
Figure 6.3: Top row: aerial image recorded with a Go Pro camera (horizontal
field of view is equal to 170 degrees) and the resulting air-ground matches with
the most similar Street View image. Bottom row: aerial image for a nearby scene
taken with the standard AR. Drone 2.0 camera used in this work (horizontal field
of view is equal to 92 degrees). It is clearly shown that the number of matches
can significantly be increased by using a camera with a bigger field of view. Note
that the Go Pro camera on the Fotokite is installed with a downwards-looking
angle i.e. the camera has a negative pitch of 45 degrees. This is the reason
why a big part of the image depicts the street and not the facades. In a real
system, the pitch level of the installed cameras should be chosen in such a way
that the amount of visual overlap between the MAV image and the Street View
images is maximized.
Other suggestions for future work and final remarks
In terms of hardware and software requirements, for a realistic system, the
following suggestions are made:
Use an open-source hardware platform which can openly be configured according to specific user requirements. In this work, the
standard interface of the AR Drone 2.0 has been used. While this is
an easy-to-work-with, robust and low-cost MAV platform, the on-board
software cannot be directly manipulated. Moreover, it would be very cumbersome if not impossible to add additional hardware like cameras or
(a)
57
(b)
(c)
Figure 6.4: Figure (a) shows a depth map generated directly from Google Street
View. Note that the 3D geometry has been recorded with a laser-range scanner
and synchronized with the Google Street View camera as described in [3]. The
scene is modelled with the help of planes. Figure (b) shows the Street View
panoramic image for the particular scene. Figure (c) shows the panoramic depth
map generated by the cadastral 3D city model as used in this thesis. Note that
the 3D scene is modelled more detailed than the depth map from Google in
(a). However, the Street View camera geotags are not perfectly synchronized
with the cadastral 3D city model resulting in the errors described in 6.1. Please
note that Figure (a) has not been generated using the official Google Street View
API. It is therefore recommended to consult Googles Terms of Conditions before
making excessive use of the script described before.
Figure 6.5: This Figure illustrates the principle of the described active camera
ranging approach. The main idea is to let the MAV turn around if not enough
correspondences can be found for the current view to accurately localize. On
the top left side, a Street View panoramic image is displayed. On the bottom,
a stitched user-generated (e.g. generated by the MAV) panoramic image for a
nearby scene is displayed which results when the camera is turned around for
360 degrees. The green boxes show parts of the two panoramas which can be
successfully matched with the air-ground algorithm as shown on the right side.
The black boxes represent parts of the panoramas for which no matches were
found. It is evident that if the MAV camera currently is stuck in the state of a
black box i.e. no matches are found for the current MAV image it will make
sense to turn around and look for a perspective which can be matched with the
Street View panorama available.
58
on-board computers on it. It is therefore suggested to build an own platform e.g. using the PX4 autopilot by ETH 4 which builds on the AR
Drone 2.0 platform.
Also use a satellite-based GPS receiver. This thesis proves that pure
vision-based global localization is possible. However, a GPS receiver will
be very useful in combination with the suggested vision-based approach.
Especially, in the above rooftop flight scenario described in 1.3.1, satellitebased GPS will be the method of choice.
Use MORSE and ROS for a realistic software implementation.
The algorithms and functions in this thesis have been partially programmed
in Python, Matlab and C++. For the work with the 3D cadastral city
model, the open-source CAD software Blender 5 has been used which offers a direct interface for Python. However, for future applications, it
is suggested to use MORSE 6 which is a generic simulator for academic
robots. Since recently, MORSE can be easily interfaced with the robot
operating system (ROS) and the so-called Blender game engine which will
allow for a much more efficient work-flow.
Finally, to conclude this thesis, the author would like to share some personal
remarks concerning the advance of aerial robotic applications in urban areas.
Based on the experience gathered during this work, it is concluded that the
biggest challenges on the way to autonomous aerial robots will not be of technical nature these can be solved but more likely to be public concerns resulting
in a restrictive regulatory environment. When recording the datasets used in
this work, several interactions with the public showed that many people perceive camera-equipped Micro Aerial Vehicles as a threat to their privacy. These
public concerns should be taken seriously by the scientific community. As mentioned in chapter 1.4, the questions of privacy and liability law in the context
of MAV applications are not entirely solved yet. The regulators in Switzerland
and elsewhere will be forced to clarify these pending legal issues in the near future. For the successful advance of autonomous MAV applications like aerial
parcel delivery or first-aid response systems - it will therefore be crucial that the
involved scientists and engineers pro-actively engage in the coming public discussion by showing the opportunities of this relatively new technological trend,
and at the same time, by being aware of the possible threats related to it.
4 https://pixhawk.ethz.ch/
5 http://www.blender.org/
6 http://www.openrobots.org/morse/doc/latest/what_is_morse.html
Appendix A
Appendix
A.1
/*
* PnP
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*/
OpenCV EPnP + Ransac
Drone.cpp
Created on: Oct 24, 2013
Author: yves.albers@gmail.com
This file is used to calculate the EPnP + Ransac position estimates
Input: InputYML.yml file containing 2D 3D matches for a
specific MAV Street View match pair;
Thesis Notation:
> x3d h refers to X {global}, > x2d drone refers to u {MAV}
> x2d street refers to u {STREET}, K Drone refers to K {MAV}
Output: OutputYML.yml file containing visionbased position estimate:
Thesis Notation:
> tvec refers to T {MAV}, R Mat refers to R {MAV}
> T Vec refers to X {MAV}
Note that instead of using EPnP also other PnP approaches can be tested
PnPType:
1 = Iterative PnP
2 = P3P by Gao et al;
3 = EPnP
Note: this file has been written that it can be used in batch mode from the
commandline by the python script batch run pnp.py
The output yml files can be read by Matlab
59
60
A.1. OpenCV EPnP + Ransac
#include
#include
#include
#include
#include
<cv.h>
<highgui.h>
<opencv2/core/core.hpp>
<opencv2/imgproc/imgproc.hpp>
<opencv2/highgui/highgui.hpp>
using namespace cv;

using namespace std;
int main( int argc, char** argv )
{
if( argc != 5)
{
cout <<" Usage: PnP Drone InputYML.yml OutputYML.yml PnPType" << endl;
return 1;
}
int arg2=atoi(argv[3]);
int arg3=atoi(argv[4]);
if (!(arg2 == 0 | | arg2 == 1 | | arg2 == 2))

{
cout << "PnPType must be either 0, 1,2" << endl;
return 1;
};
// Define Camera Matrices
Mat K Drone = (Mat <double>(3,3) << 558.265829, 0.000000, 328.999406,
0.000000, 558.605079, 178.924958, 0.0, 0.0, 1.0);
Mat K Street = (Mat <double>(3,3) << 184.7521,0,180,0,184.7521,320,0,0,1);
// Distortion parameters are set to zero as images have been previously undistorted
Mat dist = (Mat <double>(1,5) << 0, 0, 0, 0, 0);
Mat x2d drone, x2d street, x3d h, inliers, calib, rvec, img matches, R Mat, T Vec;
string inputFile
= argv[1];
FileStorage fsDemo( inputFile, FileStorage::READ);

fsDemo["x3d h"] >> x3d h;
fsDemo["x2d drone"] >> x2d drone;
fsDemo["x2d street"] >> x2d street;
fsDemo.release();
Mat tvec;
//arg2 specifies the PnP type
//arge3 the Ransac inlier treshold in pixels
solvePnPRansac(x3d h, x2d drone, K Drone, dist, rvec, tvec, 0, 1000, arg3,
100, inliers, arg2);
Rodrigues(rvec, R Mat);
T Vec = R Mat.inv()*tvec;
Appendix A. Appendix
61
// Save Outputs to YML File

string outputFile = argv[2];
FileStorage fs(outputFile, FileStorage::WRITE);
fs << "tvec" << tvec << "rvec" << rvec << "inliers" << inliers << "T Vec" << T Vec <<
"x2d drone" << x2d drone << "x2d street" << x2d street << "x3d h" << x3d h;
fs.release();
// the variable inliers contains the inliers after EPnP + Ransac,
return 0;
}
A.2
ICRA Submission
The paper following on the next pages was submitted to the IEEE International
conference on Robotics and Automation (ICRA 2014) and is currently under
review.
Micro Air Vehicle Localization and Position Tracking

from Textured 3D Cadastral Models
Andras L. Majdik, Damiano Verda, Yves Albers-Schoenberg, Davide Scaramuzza
Abstract In this paper, we address the problem of localizing
a camera-equipped Micro Aerial Vehicle (MAV) flying in urban
streets at low altitudes. An appearance-based global positioning
system to localize MAVs with respect to the surrounding
buildings is introduced. We rely on an air-ground image
matching algorithm to search the airborne image of the MAV
within a ground-level Street View image database and to detect
image matching points. Based on the image matching points,
we infer the global position of the MAV by back-projecting the
corresponding image points onto a cadastral 3D city model.
Furthermore, we describe an algorithm to track the position
of the flying vehicle over several frames and to correct the
accumulated drift of the visual odometry, whenever a good
match is detected between the airborne MAV and the streetlevel images. The proposed approach is tested on a dataset
captured with a small quadroctopter flying in the streets of
Zurich.
I. INTRODUCTION
Our motivation is to create vision-driven localization
methods for Micro Aerial Vehicles (MAV) flying in urban
environments, where the satellite GPS signal is often shadowed by the presence of the buildings, or not available. Accurate localization is indispensable to safely operate smallsized aerial service-robots to perform everyday tasks, e.g.,
goods delivery,1 inspection and monitoring,2 first-response
and telepresence in case of accidents.
In this paper, we tackle the problem of localizing MAVs
in urban streets, with respect to the surrounding buildings.
We propose the use of textured 3D city models to solve the
localization problem of a camera equipped MAV. A graphical
illustration of the problem addressed in this work is shown
in Fig. 1.
In our previous work [1], we described an algorithm
to search airborne MAV images within a geo-registered,
street-level image database. Namely, we localized airborne
images in a topological map, where each node of the map is
represented by a Street View image. 3
In this paper, we advance our earlier work [1], by backprojecting the geo-referenced images onto the 3D cadastral
model of the city to obtain the depth the scene. Therefore, the
algorithm described in this paper, computes the 3D position
of the corresponding image pointsbetween the airborne
The authors
University of
are with the Robotics and Perception Group,

Zurich, Switzerlandhttp://rpg.ifi.uzh.ch,
majdik@ifi.uzh.ch, damiano.verda@unige.it, yvesal@ethz.ch,

davide.scaramuzza@ieee.org.
1 As
envisaged by Matternet: http://matternet.us.

envisaged by Skycatch: http://skycatch.com.
3 In our work we used Google Street View images: http://google.
com/streetview.
2 As
Fig. 1: Illustration of the problem addressed in this work. The absolute

position of the aerial vehicle is computed by matching airborne MAV
images with ground-level Street View images that have previously been
back-projected onto the cadastral 3D city model.
MAV images and ground-level Street View imagesin order

to compute the real-world 3D position of the flying vehicle
within a metric map.
World models, maps of the environment, street-network
layouts have been used for a long time to localize vehicles
performing planar motion in urban environments. Recently,
several research works have addressed the localization of
ground vehicles using publicly available maps [2], [3], road
networks [4] or satellite images [5]. However, the algorithms
described in those works are not suitable for the localization
of flying vehicles, because of the large viewpoint differences. With the advance of the mapping technology, more
and more detailed, textured 3D city models are becoming
publicly available [6], which can be exploited for visionbased localization of MAVs.
In the recent years, numerous research papers have addressed visual appearance-based place recognition by ground
vehicles in Simultaneous Localization and Mapping (SLAM)
systems [7], [8]. Such image-matching algorithms tend to
perform poorly in the case of flying robots [1], since in this
casebesides the challenges present in ground visual search
algorithms, such as illumination, lens distortions, over-season
variation of the vegetation, and scene changes between
the query and the database imagesextreme changes in
viewpoint and scale can be found between the aerial MAV
images and the ground-level images.
To illustrate the challenges of matching airborne images
with ground level ones, in Fig. 2 we show a few samples of
the airborne images and their associate Street View images
from the dataset used in this work. As can be observed, due
to the different fields of view of the ground cameras and

aerial vehicles and their different distance to the buildings
facades, the aerial image is often a small subsection of
the ground-level image, which mainly consists of highlyrepetitive and self-similar structures (e.g., windows). All
these peculiarities make the air-ground matching problem
extremely difficult to solve for state-of-the-art feature-based
image-search techniques.
In our work we depart from conventional appearancebased localization algorithms by applying our air-ground image matching algorithm to match airborne MAV images with
ground-level Street View images. In this paper, we advance
our previous topological localization [1] by computing and
tracking the position of the MAV in the 3D space using
cadastral 3D city models. To the best of our knowledge, this
is the first work to present an in-depth analysis to solve the
problem of localizing an MAV and track its position using
textured 3D cadastral city models.
To summarize, this paper advances the state-of-the-art with
the following contributions:
It solves the localization problem of MAVs flying at low
altitudes (up to 10 meters) in urban streets.
We present a new appearance-based global positioning
system to detect the position of MAVs with respect
to the surrounding buildings. The proposed algorithm
matches airborne MAV images with geo-tagged Street
View images and exploits cadastral 3D city models in
order to compute the absolute position of the flying
vehicle.
We describe an algorithm to track the vehicle position
and correct the accumulated drift induced by the onboard state estimator.
The remainder of the paper is organized as follows.
Section II presents the visual localization system. Section III
describes the position tracking algorithm. Section IV presents
the experimental results.
II. APPEARANCE-BASED GLOBAL POSITIONING
SYSTEM
In this section, we briefly summarize and show the limitations of our previous work [1] concerning the appearancebased, global localization of MAVs in topological maps.
Next, we present the cadastral 3D model used in this work
and we show the back-projection of the Street View images
onto the city model.
A. Overview and evaluation of the Air-ground Image Matching algorithm
Conventional image-matching algorithms tend to fail in
matching airborne MAV images with street-level images
because of the extreme changes in viewpoint and scale
that can be found between them. In [1], we proposed an
algorithm that can successfully match air-ground images,
with a significant difference in viewpoint and scale. To
avoid the caveats of other image-search algorithms in case
of severe viewpoint changes between the query and the
database images, in [1] we generate virtual views of the
Fig. 2: Comparison between ground-level Street View (top row) and airborne
MAV (row 2 and 3) images used in this work. Note the significant changes
in terms of viewpoint, over-season variation, and scene between the database
((a); respectively (d)) and query images ((b), (c); respectively (e),(f))that
obstruct their visual recognition.
scene, which exploit the air-ground geometry of the system.

Next, similarly to [9], we extract SIFT [10] image points in
the original image, and also, on the generated virtual views.
Further on, we use a robust algorithm to select the inlier
feature points based on virtual line descriptors [11]. Also,
we present techniques to speed up the image search within
the Street View image database by adopting a histogramvoting scheme and by computing the algorithm in parallel
threads on different cores.
Our appearance-only-based Air-ground matching algorithm is described more in detail in [1]. It can successfully
recognize more than 45% of the airborne MAV images within
a database of street-level images with 100% precision
namely without detecting any false positive matchesusing
only the visual similarity between them. In other words,
almost every second airborne image captured by the MAV,
flying at different altitudes (up to 15 meters), and often
traveling close to the buildings, is successfully recognized
along a 2km path.4
The precision of the Air-ground matching algorithm and
the uncertainty of the position determination depends on the
correctly-matched number of features. Fig. 3 summarizes
the mean number of inliers matched between the airborne
and the ground images versus the distance to the closest
Street View image. The results show a Gaussian distribution
with standard deviation = 5 meters. This means that, if the
MAV is in the vicinity of 5 meters from a Street View image
along the path, our algorithm can detect around 60 correct
correspondences. The 3D coordinates of the matched image
feature points can be computed by back-projecting them onto
the 3D city model.
The main goal of this work is to present a proof-of-concept
4 For
further
CDdUKESUeLc.
details
please
watch:
http://youtu.be/
Number of inlierer matches (average)
140
120
100
80
60
40
20
0
20
15
10
10
15
20
Distance from the Google Street View image (meter)
Fig. 3: Number of inlier feature points matched between the MAV and
ground images versus the distance to the closest Street View image.
of the system, rather than the real-time, efficient implementation. Though, for the sake of completeness, we present
in Fig. 4 the effective processing time of the Air-ground
image matching algorithm, using a commercially available
laptop with an 8 core2.40 GHz clockarchitecture. The
Air-ground matching algorithm is computed in five major
steps: (1) virtual view generation and feature extraction; (2)
approximate nearest-neighbor search within the full Street
View database; (3) putative correspondences selection; (4)
approximate nearest-neighbor search among the features
extracted from the aerial MAV image with respect to the
selected ground level image; (5) acceptance of good matches
(kVLD inlier detection). In Fig. 4 we used more than 400
airborne MAV images. All the images were searched within
the entire Street View images that could be found along the
2km trajectory. Notice that the longest computation time is
the approximate nearest-neighbor search in the entire Street
View database for the feature descriptors found in the MAV
image. However, for position tracking, this step is completely
neglected (Section III) since, in this case, the MAV image
is compared only with the neighboring Street View images
(usually up to 4 or 8, computed in parallel on different cores,
depending on the road configuration). Finally, notice that
the histogram voting (Fig. 4) takes only 0.01 seconds. On
average, steps (1), (4), and (5) are computed in 3.2 seconds.
Therefore, if the MAV flies roughly with a speed of 2 m/s,
its position would be updated every 6.5 meters (subsection
IV).
B. Textured 3D cadastral models
The 3D cadastral model of Zurich used in this work was
acquired from the city administration and claims to have an
average lateral position error of l = 10 cm and an average
error in height of h = 50 cm. The city model is referenced
in the Swiss Coordinate System CH1903 [12]. Note that this
model does not contain any textures.
The geo-location information of the Google Street View
dataset is not exact. The geo-tags of the Street View images provide only approximate information about where the
images were recorded by the vehicle. Indeed, according to
Fig. 4: Analysis of the processing time of the Air-ground image matching

algorithm. To compute this figure, we used more than 400 airborne MAV
images, and all the images were searched within the entire Street View
image database, that could be found along the 2km trajectory.
[13], where 1,400 Street View images were used to perform

the analysis, the average error of the camera positions is 3.7
meters and the average error of the camera orientation is
1.9 degrees. In the same work, an algorithm was proposed
to improve the precision of the Street View image poses.
There, cadastral 3D city-models were used, in combination
with image-segmentation techniques, to detect the outline of
the buildings. Finally, the pose was computed by an iterative
optimization, namely by minimizing the offset between the
segmented outline and the reprojected one. The resulting corrected Street View image positions have a standard deviation
of 0.1184 meters and the orientation of the cameras have
standard deviation of 0.476 degrees.
In our work, we apply the algorithm from [13] on the
dataset used in this work to correct the Street View image poses. Then, we back-project each pixel onto the 3D
cadastral model. One sample of the resulting textured 3D
model is shown in Fig. 5. This step is crucial to compute the
scale of the monocular visual odometry (Section III-A) and
to localize the MAV images with respect to the street level
ones, thus, reducing the uncertainty of the position tracking
algorithm. In the next section, we give more details about
the integration of textured 3D models into our pipeline.
C. Global MAV camera pose estimation
By back-projecting the Street View images, onto the 3D
city model as illustrated in Fig. 5, the absolute MAV camera
pose and orientation can be estimated given a set of known
3D-2D correspondence points. Several approaches have been
proposed in the literature to estimate the external camera
parameters based on 3D-2D correspondences. In [14], the
term perspective-n-point (PnP) problem was introduced and
different solution were described to retrieve the absolute
camera pose given n correspondences. The authors in [15]
addressed the PnP problem for the minimal case where n
equals 3 and introduced a novel parametrization to compute
the absolute camera position and orientation.
Given that the output of our Air-ground matching algorithm [1] may still contain outliers and that the modelgenerated 3D coordinates may depart from the real 3D
coordinates, we apply the P3P algorithm together with a
RANSAC scheme [14] to discard the outliers. Finally, using
Fig. 5: (a) perspective view of the cadastal 3D city model; (b) the ground-level Street View image overlaid on the model; (c) the back-projected texture
onto the cadastral 3D city model; (d) estimated MAV camera positions matched with one Street View image; (e) the synthesized view from one estimated
camera position corresponding to actual MAV image (f); (g)-(i) show another example from our dataset, where (g) is an aerial view of the estimated camera
position (h), which is marked with the blue camera in front of the textured 3D model, (h) is the synthesized view from the estimated camera position
corresponding to actual MAV image (i).
the inlier points we compute the MAV camera position by

applying EPnP [16].
We refine the resulting camera pose estimate using the
Levenberg-Marquardt [17] optimization, which minimizes
the reprojection error given by the sum of the squared distances between the observed image points and the reprojected
3D points.
Fig. 5 (a-c) show how the Street View images are backprojected onto the 3D city model. Moreover, Fig 5 (d) shows
the estimated camera positions and orientations in the 3D city
model for a series of consecutive MAV images. As we do not
have any ground-truth (i.e. we do not know the true MAV
camera position) we visually evaluate the accurateness of
the position estimate by rendering-out the estimated MAV
camera view and comparing it to the actual MAV image
for a given position as shown in 5(e-f). Fig. 5(g-i) again
show another example of estimated camera position (g), the
synthesized camera view (h), and the actual MAV image
(i). By comparing the actual MAV images to the renderedout views (Fig. 5(e-f) and Fig. 5(h-i)), we can see that
even though there are differences in terms of orientation and
positionthat there is a significant visual overlap between
the two images meaning that the estimated camera position

is relatively close to the true camera position. Similar results
were derived for the remaining MAV-Street View image pairs
of the recorded dataset. Based on this visual verification, we
can conclude that the absolute MAV camera positions are
good enough to be used in III for correcting the drift of the
visual odometry.
III. POSITION TRACKING
The goal of this section is to track the state of the
MAV over several images. The vehicle state at time k is
composed by the position vector and the orientation of the
airborne image with respect to the global reference system.
To simplify the proposed algorithm, we neglect the roll
and pitch, since we assume that the MAV flies in nearhovering conditions. Consequently, we consider the reduced
state vector qk R4
qk = (pk , k ),
(1)
where pk R3 denotes the position and k R the yaw

angle.
We adopt a Bayesian approach [18] to track and update the
position of the MAV. We compute the posterior probability
density function (PDF) of the state in two steps. To compute

the prediction update of the Bayesian filter, we use visual
odometry. To compute the measurement update, we integrate
the global localization update, whenever it is supplied by the
algorithm described in the previous section. Conversely, we
compute the measurement update using the visual odometry.
The system model f describes the evolution of the state
over time. The measurement model h relates the current
measurement zk R4 to the state. Both are expressed in
a probabilistic form:
qk|k1 = f (qk1|k1 , uk1 ),
(2)
zk = h(qk|k1 ),
(3)
where uk1 R4 denotes the output of the visual odometry

algorithm at time k 1, qk|k1 denotes the prediction
estimate of q at time k and qk1|k1 denotes the updated
estimate of q at time k 1.
A. Visual odometry
Visual Odometry (VO) is the problem of incrementally
estimating the ego-motion of a vehicle using its on-board
camera(s) [19]. We use the VO algorithm in [20] to incrementally estimate the state of the MAV.
B. Uncertainty estimation and propagation of the VO
At time k, VO takes two consecutive images Ik , Ik1 as
input and returns an incremental motion estimate with respect
to the camera reference system. We define this estimate as
k,k1
R4
k,k1
= (sk , k ),
(4)
sk
where
R denotes the translational component of the
motion and k R the yaw increment. sk is valid up to
a scale factor, thus the metric translation sk R3 of the
MAV at time k with respect to the camera reference frame
is equal to
(5)
sk = sk .
We define k,k1 R4 as
k,k1 = (sk , k ),
(6)
where R represents the scale factor. We describe the

procedure to estimate in Section III-E.
We estimate the covariance matrix k,k1 R4x4 using
Monte Carlo technique [21]. The VO at every step of the
algorithm provides an incremental estimate k,k1 , together
with a set of corresponding image points between image
Ik and Ik1 . We randomly sample five couples from the
corresponding point set, multiple times (1000 in our experiments). Each time, we use the selected samples as an input
to the 5-point algorithm [22] to obtain the estimate {i }.
All these estimates form D = {i }. Finally, we calculate
the uncertainty k,k1 of the VO by computing the sample
covariance from the data.
The error of the VO is propagated throughout consecutive
camera positions as follows. At time k the state qk|k1
depends on qk1|k1 and k,k1
qk|k1 = f (qk1|k1 , k,k1 ),
(7)
We compute its associated covariance qk|k1 R4x4 by the

error-propagation law:
qk|k1 = fqk1|k1 qk1|k1 fqTk1|k1 + fk,k1 k,k1 fTk,k1 ,
(8)
assuming that qk1|k1 and k,k1 are uncorrelated. We
compute the Jacobian matrices numerically. The rows of the
Jacobian matrices (i fqk1|k1 ), (i fk,k1 ) R1x4 (i =
1,2,3,4) are computed as
(i fqk1|k1 ) =
(i fk,k1 ) =
(i f )
(1 qk1|k1 )
i
( f )
(1 k,k1 )
(i f )
(2 qk1|k1 )
(i f )
(3 qk1|k1 )
(i f )
(4 qk1|k1 )
( f )
(2 k,k1 )
( f )
(3 k,k1 )
( f )
(4 k,k1 )
(9)
where i qk1|k1 and i k,k1 denote the i-th component of
qk1|k1 respectively k,k1 . The function i f relates the
updated state estimate qk1|k1 and the VO output k,k1
to the i-th component of the predicted state i qk|k1 .
In conclusion, the state covariance matrix qk|k1 defines
an uncertainty space (with a confidence level of 3). If
the measurement zk that we compute by means of the
appearance-based global positioning system is not included
in this uncertainty space, we do not update the state and we
rely on the VO estimate.
C. Uncertainty estimation of the appearance-based global
localization
Our goal is to update the state of the MAV qk|k1 ,
whenever an appearance-based global position measurement
zk R4 is available. We define zk as
zk = (pSk , kS ),
pSk
(10)
kS
where
R denotes the position and
R the yaw in
the global reference system.
The appearance-based global positioning system (Section
II) provides the index j N of the Street View image
corresponding to the current MAV image, together with two
sets of n N 2D corresponding image points between the
two images. Furthermore, it provides also the 3D coordinates
of the corresponding image points in the global reference
system. We define the set of 3D coordinates as X S : = {xSi }
({xSi } R3 i = 1, , n) and the set of 2D coordinates
2
D
as MD = {mD
i } ({mi }, R i = 1, , n).
If the MAV image matches with a Street View image,
it cannot be farther than 25 meters from that Street View
camera (c.f. Fig. 3), according to our experiments. We
illustrate the uncertainty bound of the MAV from the birdeye-view perspective in Fig. 6 with the green ellipse, where
the blue dots represent Street View cameras. In order to
reduce the the uncertainty associated to zk , we use the two
sets of corresponding image points.
We compute zk such that the reprojection error of X S
with respect to MD is minimized, that is
zk = argmin(
z
S
||mD
i (xi , z)||),
(11)
i=1
where denotes the j-th Street View camera projection

model.
Uncertainty with respect to travelled distance

x (m)
10
5
y (m)
0
0
10
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
160
z (m)
yaw (deg)
0
0
10
0
0
15
10
5
0
0
travelled distance (m)

Fig. 7: Uncertainty along the x, y, z-axis and yaw, with respect to the traveled
distance.
Fig. 6: Blue dots represent Street View cameras. If the MAV current image
matches with the central Street View one, the MAV must lie in an area of
25 meters around the corresponding Street View camera. We display this
area with a green ellipse.
The reprojected points coordinates (xSi , z) are noisy,

because of the uncertainty of the Street View camera poses
and of the 3D model data. The MD , X S sets may contain
outliers. We choose then P3P-RANSAC (Section II-C) to
compute zk , selecting the solution with the highest consensus
(maximum number of inliers, minimum reprojection error).
Similarly to Section III-B, we estimate the covariance matrix zk R4x4 using Monte Carlo technique. We randomly
sample five couples of corresponding points from MD , X S
multiple times (1000 in the experiments). Each time, we
use the selected samples as an input to the P3P algorithm,
to obtain the measurement {zi }. As we can see in Fig.
3, a match with images gathered by Street View cameras
farther than twenty meters is not plausible. We use this
criterion to accept or discard {zi } measurements. All the
plausible estimates form the set Z = {zi }. We estimate zk
by computing the sample covariance from the data.
D. Fusion
We aim to reduce the uncertainty associated to the state
by fusing the prediction estimate with the measurement,
whenever an appearance-based global position measurement
is available. The outputs of this fusion step are the updated
estimate qk|k and its covariance qk|k R4x4 . We compute
them according to Kalman filter equations [23]:
qk|k = qk|k1 +qk|k1 (qk|k1 +zk )1 (zk qk|k1 ) (12)
qk|k = qk|k1 qk|k1 (qk|k1 + zk )1 qk|k1 (13)
E. Initialization
In order to initialize our system we use the global localization algorithm, namely we use (11) to compute the initial
state q0|0 and the Monte Carlo procedure described in III-C
to estimate its covariance q0|0 . In the initialization step, we
also estimate the absolute scale factor for Visual Odometry.
After the initial position we need another position of the
MAV that is globally localized by our appearance-based
approach. Finally, we compute by comparing the metric
distance traveled computed by the two global localization
estimates, with the unscaled motion estimate returned by the

VO.
IV. EXPERIMENTS AND RESULTS
A. The experimental dataset
We collected a dataset in downtown Zurich, Switzerland
using a commercially available Parrot AR.Drone 2. The
flying vehicle was manually piloted along a 150 m trajectory,
collecting images throughout the environment at different
altitudes. The MAV camera was always side-looking, facing
the buildings. We considered a street in which we could receive also the GPS signal emitted by the satellites. Although
we do not have accurate ground truth data to compare with,
we can evaluate our algorithm visually. We also plot the
obtained trajectories in the cadastral 3D city model, where
the surrounding buildings can give an intuitive measure about
the MAV positions.
B. Results
Even though, we do not have a ground-truth path of the
MAV to compare with (since the GPS signal is shadowed),
we can still evaluate visually the performance of our system.
Furthermore, we display our result within the very accurate
3D city model, which can give a good basis to evaluate the
result.
We show the results using a bird-eye-view perspective,
overlayed on Google Maps in Fig. 8 (a). Fig. 8 (b) shows a
side-view of the same figure, where the flying altitude can
be observed. We display the trajectory measured by the onboard GPS in green. Note that the GPS measurement shows
a very high altitude compared to the real one, which never
exceeds 6 meters relative to the ground. Note that the terrain
is ascendant, the Street View images are recorded at the
same altitude. We show the path estimated by the algorithm
described in this paper with blue. The blue squares mark
the positions where an update from the appearance-based
global localization system was integrated into the Bayesian
tracking to correct the trajectory. Even though, in these states
the position was accurately detected, the orientation of the
camera was sometimes not very precise, resulting in a nonsmooth trajectory.
To improve our results, we computed the orientation of
the MAV in the update positions (red dots in Fig. 8 (a) and
190
180
This paper
This paper using

the magnetometer
Update
Update
Satelite GPS
Street View position
Visual Odometry
170
25
150
140
130
120
z (m)
y (m)
160
20
15
110
100
10
90
110
120
130
140
150
160
170
110 120 130 140 150 160 170
x(m)
x(m)
(a)
(b)
Fig. 8: Comparison between the estimated trajectories: (a) bird eye-view perspective of the results overlayed on Google Maps; (b) side view of (a); black
circles represent the Street View camera positions, note that the terrain is ascendant, consequently, we measure the altitude above the ground; we display
the trajectory measured using the on-board GPS with green; the path estimate obtained with the system described in this paper is with blue, the squares
identify the state updates; we show with red the enhanced version of our path estimate, computed using the on-board magnetometer data to estimate the
yaw of the MAV, red circles identify the state updates; finally, the magenta displays the estimate given by pure Visual Odometry.
(b)), using the on-board magnetometer data to correct the

yaw angle of the vehicle. Thereby, the best trajectory was
obtained in comparison with the actual one. We show the
uncertainty of the position and yaw in Fig. 7. Along the
considered path, the average uncertainty was 4.2 m in x, 4.3
m in y, and 3.6 m in z. After each state update, highlighted
by red dots in Fig. 7, the uncertainty drops lower than 5 m.
The yaw uncertainty never exceeds 12.2 degrees, and it is
about 7.5 degrees in average.
When the Visual Odometry (magenta in Fig. 8 (a) and (b))
is not updated by the appearance-based global-positioning
system, a very big error is accumulated in the trajectory of
the MAV.
In Fig. 9 (a)-(c), we display the results using the cadastral
3D model, in order to evaluate the trajectories with respect
to the surrounding buildings. The Visual Odometry estimate
is shown in black, the GPS in green, and our estimate in
blue. The altitude estimate error of the GPS is even more
notable, while the VO estimate penetrates the buildings. The
estimated trajectory with the proposed algorithm is the most
plausible, is the most similar to the actual one.
Finally, the rendered view of the textured 3D modelthat
the MAV perceives at the end of the trajectoryis visually
more similar to the real one (Fig. 9 (d)), in case it is estimated
by the presented algorithm (Fig. 9 (e)), in comparison with
the rendered view computed based on the GPS measurement
(Fig. 9 (f)).
V. CONCLUSIONS
This paper presented a solution to localize MAVs in urban
environments with respect to the surrounding buildings using
a single on-board camera and priorly geo-tagged streetlevel images together with a cadastral 3D city model. A
reliable alternative to satellite-based global positioning was
introduced, which is purely based on appearance. Moreover,
it was shown that the performance of such a system can
be increased by making use of additional on-board sensors
such as the magnetometer. The presented appearance-based

positioning described in this paper can be of great importance
to safely operateto takeoff, land, and navigatesmallsized, autonomous aerial vehicles in urban environments
equipped with vision cameras.
ACKNOWLEDGEMENTS
The authors are grateful to Aparna Taneja and Luca Ballan
for providing corrected, accurate Google Street View image
positions for the data used in this work.
R EFERENCES
[1] A. Majdik, Y. Albers-Schoenberg, and D. Scaramuzza, Mav urban
localization from google street view data, in International Conference
on Intelligent Robots and Systems, 2013, p. To appear.
[2] M. A. Brubaker, A. Geiger, and R. Urtasun, Lost! leveraging the
crowd for probabilistic visual self-localization, in Conference on
Computer Vision and Pattern Recognition (CVPR), 2013.
[3] B. L. G. Floros, B. van der Zander, Openstreetslam: Global vehicle
localization using openstreetmaps, in International Conference on
Robotics and Automation, 2013.
[4] M. Hentschel and B. Wagner, Autonomous robot navigation based on
openstreetmap geodata, in Intelligent Transportation Systems (ITSC),
2010 13th International IEEE Conference on, 2010, pp. 16451650.
[5] R. Kuemmerle, B. Steder, C. Dornhege, A. Kleiner, G. Grisetti, and
W. Burgard, Large scale graph-based SLAM using aerial images as
prior information, Journal of Autonomous Robots, vol. 30, no. 1, pp.
2539, 2011.
[6] D. Anguelov, C. Dulong, D. Filip, C. Frueh, S. Lafon, R. Lyon,
A. Ogale, L. Vincent, and J. Weaver, Google street view: Capturing
the world at street level, Computer, vol. 43, no. 6, pp. 3238, 2010.
[7] M. Cummins and P. Newman, Appearance-only slam at large scale
with fab-map 2.0, Int. J. Rob. Res., vol. 30, no. 9, pp. 11001123,
Aug. 2011.
[8] M. Milford, Vision-based place recognition: how low can you go?
The International Journal of Robotics Research, vol. 32, no. 7, pp.
766789, 2013.
[9] J.-M. Morel and G. Yu, Asift: A new framework for fully affine
invariant image comparison, SIAM J. Imaging Sciences, vol. 2, no. 2,
pp. 438469, 2009.
[10] D. G. Lowe, Distinctive image features from scale-invariant keypoints, I. J. of Computer Vision, vol. 60, no. 2, pp. 91110, 2004.
[11] Z. Liu and R. Marlet, Virtual line descriptor and semi-local graph
matching method for reliable feature correspondence, in in British
Machine Vision Conference, 2012, pp. 16.116.11.
Fig. 9: Comparison between path estimates shown within the cadastral 3D city model: Top row: we display the Visual Odometry estimate in black, GPS
in green, our estimate in blue; (a) altitude evaluation: in the experiment, the MAV flew close to the middle of the street and it never flew over the height
of 6 m (above the ground), from this point of view, our path estimate (blue) is more accurate than the GPS one (green); (b) perspective view of the path
estimates; (c) trajectory zoom: the pure VO trajectory penetrates one of the surrounding buildings, using the proposed method, we are able to reduce the
drift of the VO; Bottom row: we show a visual comparison of the: (d) actual view; (e) rendered view of the textured 3D model corresponding to (d), that
the MAV perceives according to our estimate; (f) rendered view of the textured 3D model corresponding to (d) that the MAV perceives according to the
GPS measurement; to conclude, the algorithm presented in this paper outperform the other techniques to estimate the trajectory of the MAV flying at low
altitudes in urban environment.
[12] Formulas and constants for the calculation of the swiss conformal
cylindrical projection and for the transformation between coordinate
systems, Federal Department of Defence, Civil Protection and Sport
DDPS, Tech. Rep., 2008.
[13] A. Taneja, L. Ballan, and M. Pollefeys, Registration of spherical
panoramic images with cadastral 3d models, in 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012
Second International Conference on. IEEE, 2012, pp. 479486.
[14] M. A. Fischler and R. C. Bolles, Random sample consensus: a
paradigm for model fitting with applications to image analysis and
automated cartography, Communications of the ACM, vol. 24, no. 6,
pp. 381395, 1981.
[15] L. Kneip, D. Scaramuzza, and R. Siegwart, A novel parametrization
of the perspective-three-point problem for a direct computation of
absolute camera position and orientation, in Proc. of The IEEE
International Conference on Computer Vision and Pattern Recognition
(CVPR), Colorado Springs, USA, June 2011.
[16] F.Moreno-Noguer, V.Lepetit, and P.Fua, Accurate non-iterative o(n)
solution to the pnp problem, in IEEE International Conference on
Computer Vision, Rio de Janeiro, Brazil, October 2007.
[17] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
Vision, 2nd ed. New York, NY, USA: Cambridge University Press,
2003.
[18] S. Thrun, W. Burgard, D. Fox, et al., Probabilistic robotics. MIT
press Cambridge, 2005, vol. 1.
[19] D. Scaramuzza and F. Fraundorfer, Visual odometry [tutorial],
Robotics & Automation Magazine, IEEE, vol. 18, no. 4, pp. 8092,
2011.
[20] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz, Multicore bundle
adjustment, in In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR. IEEE, 2011, pp. 30573064.
[21] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, Robust monte carlo
localization for mobile robots, Artificial intelligence, vol. 128, no. 1,

pp. 99141, 2001.
[22] D. Nister, An efficient solution to the five-point relative pose problem, Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 26, no. 6, pp. 756770, 2004.
[23] R. E. Kalman et al., A new approach to linear filtering and prediction
problems, Journal of basic Engineering, vol. 82, no. 1, pp. 3545,
1960.
Bibliography
[1] M. Achtelik, A. Bachrach, R. He, S. Prentice, and N. Roy. Stereo vision
and laser odometry for autonomous helicopters in gps-denied indoor environments. In Proceedings of the SPIE Unmanned Systems Technology XI,
volume 7332, Orlando, F, 2009.
[2] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. Building rome in a day. Commun. ACM, 54(10):105112, 2011.
[3] Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh,
Stephane Lafon, Richard Lyon, Abhijit Ogale, Luc Vincent, and Josh
Weaver. Google street view: Capturing the world at street level. Computer, 43, 2010.
[4] Richard Bowden, John P. Collomosse, and Krystian Mikolajczyk, editors.
British Machine Vision Conference, BMVC 2012, Surrey, UK, September
3-7, 2012. BMVA Press, 2012.
[5] Koray C
elik, Soon Jo Chung, Matthew Clausman, and Arun K. Somani.
Monocular vision slam for indoor aerial vehicles. In IROS, pages 15661573,
2009.
[6] Winston Churchill and Paul M. Newman. Practice makes perfect? managing and leveraging visual experiences for lifelong navigation. In ICRA,
pages 45254532, 2012.
[7] J. Engel, J. Sturm, and D. Cremers. Accurate figure flying with a quadrocopter using onboard visual and inertial sensing. In Proc. of the Workshop
on Visual Control of Mobile Robots (ViCoMoR) at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS), 2012.
[8] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a
paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381395, June 1981.
[9] F.Moreno-Noguer, V.Lepetit, and P.Fua. Accurate non-iterative o(n) solution to the pnp problem. In IEEE International Conference on Computer
Vision, Rio de Janeiro, Brazil, October 2007.
[10] Gerald Fritz, Christin Seifert, Manish Kumar, and Lucas Paletta. Building
detection from mobile imagery using informative sift descriptors. In SCIA,
pages 629638, 2005.
70
Bibliography
71
[11] Christian Frueh, Siddharth Jain, and Avideh Zakhor. Data processing algorithms for generating textured 3d building facade meshes from laser scans
and camera images. International Journal of Computer Vision, 61(2):159
184, 2005.
[12] Christian Fr
uh, Russell Sammon, and Avideh Zakhor. Automated texture
mapping of 3d city models with oblique aerial imagery. In 3DPVT, pages
396403, 2004.
[13] Andrew Harltey and Andrew Zisserman. Multiple view geometry in computer vision (2. ed.). Cambridge University Press, 2006.
[14] Andreas Hoppe, Sarah Barman, and Tim Ellis, editors. British Machine
Vision Conference, BMVC 2004, Kingston, UK, September 7-9, 2004. Proceedings. BMVA Press, 2004.
[15] Stefan Hrabar and Gaurav S. Sukhatme. Vision-based navigation through
urban canyons. J. Field Robotics, 26(5):431452, 2009.
[16] Christian Kerl, J
urgen Sturm, and Daniel Cremers. Robust odometry estimation for rgb-d cameras. In ICRA, pages 37483754, 2013.
[17] Georg Klein and David Murray. Parallel tracking and mapping for small
ar workspaces. IEEE and ACM International Symposium on Mixed and
Augmented Reality, November 2007.
[18] L Kneip, D Scaramuzza, and R Siegwart. A novel parametrization of the
perspective-three-point problem for a direct computation of absolute camera position and orientation. In Proc. of The IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Colorado
Springs, USA, June 2011.
[19] Lubor Ladicky, Christopher Russell, Pushmeet Kohli, and Philip H. S. Torr.
Associative hierarchical crfs for object class image segmentation. In ICCV,
pages 739746, 2009.
[20] David G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91110, 2004.
[21] Andras Majdik, Yves Albers-Schoenberg, and Davide Scaramuzza. Mav urban localization from google street view data. In International Conference
on Intelligent Robots and Systems, page To appear, 2013.
[22] Andrew Mastin, Jeremy Kepner, and John W. Fisher III. Automatic registration of lidar and optical images of urban scenes. In CVPR, pages
26392646, 2009.
[23] Navigation National Coordination Office for Space-Based Positioning and
Timing. Official u.s. government information about global positioning system (gps) and related information, 2013. [Online; accessed 19-September2013].
[24] David Nister. An efficient solution to the five-point relative pose problem.
IEEE Trans. Pattern Anal. Mach. Intell., 26(6):756777, 2004.
72
Bibliography
[25] Preprints of the 18th IFAC World Congress Milano (Italy), editor. The
Navigation and Control Technology Inside the AR.Drone Micro UAV, Milano, Italy, 2011.
[26] Federal Office of Topography swisstopo. Swiss reference systems, 2013.
[Online; accessed 24-September-2013].
[27] Tomas Pajdla Petr Gronat. Building streetview datasets for place recognition and city reconstruction. Workshop 2011, Center for Machine Perception, FEE, CTU in Prague, Czech Republic.
[28] Marc Pollefeys, David Nister, Jan-Michael Frahm, Amir Akbarzadeh,
Philippos Mordohai, Brian Clipp, Chris Engels, David Gallup, Seon Joo
Kim, Paul Merrell, C. Salmi, Sudipta N. Sinha, B. Talton, Liang Wang,
Qingxiong Yang, Henrik Stewenius, Ruigang Yang, Greg Welch, and Herman Towles. Detailed real-time urban 3d reconstruction from video. International Journal of Computer Vision, 78(2-3):143167, 2008.
[29] Timo Pylvan
ainen, Jer
ome Berclaz, Thommen Korah, Varsha Hedau,
Mridul Aanjaneya, and Radek Grzeszczuk. 3d city modeling from streetlevel data for augmented reality applications. In 3DIMPVT, pages 238245,
2012.
[30] Davide Scaramuzza and Friedrich Fraundorfer. Visual odometry [tutorial].
IEEE Robot. Automat. Mag., 18(4):8092, 2011.
[31] Roland Siegwart and Illah R. Nourbakhsh. Introduction to Autonomous
Mobile Robots. Bradford Company, Scituate, MA, USA, 2004.
[32] Aparna Taneja, Luca Ballan, and Marc Pollefeys. Registration of spherical
panoramic images with cadastral 3d models. In 3DIMPVT, pages 479486,
2012.
[33] OpenCV Development Team. Camera calibration with opencv, 2013. [Online; accessed 19-September-2013].
[34] Alex Teichman and Sebastian Thrun. Practical object recognition in autonomous driving and beyond. In ARSO, pages 3538, 2011.
[35] Stadt Zurich Tiefbau und Entsorgungsdepartement. 3d-stadtmodell, 2013.
[Online; accessed 24-September-2013].
[36] Gonzalo Vaca-Castano, Amir Roshan Zamir, and Mubarak Shah. City scale
geo-spatial trajectory estimation of a moving camera. In CVPR, 2012.
[37] Stephan Weiss, Davide Scaramuzza, and Roland Siegwart. Monocularslam-based navigation for autonomous micro helicopters in gps-denied environments. J. Field Robotics, 28(6):854874, 2011.
[38] Andreas Wendel, Michael Maurer, and Horst Bischof. Visual landmarkbased localization for mavs using incremental feature updates. In 3DIMPVT, pages 278285, 2012.
[39] Wei Zhang and Jana Kosecka. Image based localization in urban environments. In 3DPVT, pages 3340, 2006.
Bibliography
73
[40] S Zingg, D Scaramuzza, S Weiss, and R Siegwart. Mav navigation through

indoor corridors using optical flow. In Proc. of The IEEE International
Conference on Robotics and Automation (ICRA), May 2010.
Title of work:
Micro Aerial Vehicle Localization using Textured

3D City Models
Thesis type and date:
Master Thesis, November 2013
Supervision:
Dr. Andras Majdik
Prof. Dr. Davide Scaramuzza
Student:
Name:
E-mail:
Legi-Nr.:
Yves Albers-Schoenberg
yvesal@student.ethz.ch
06-732-523
Statement regarding plagiarism:

By signing this statement, I affirm that I have read the information notice
on plagiarism, independently produced this paper, and adhered to the general
practice of source citation in this subject-area.
Information notice on plagiarism:
http://www.ethz.ch/students/semester/plagiarism_s_en.pdf
Zurich, 16. 11. 2013:

Yves Albers-Schoenberg - MT - Micro Aerial Vehicle Localization Using Textured 3D City Models PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Yves Albers-Schoenberg - MT - Micro Aerial Vehicle Localization Using Textured 3D City Models PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Department of Informatics

Micro Aerial Vehicle

2 Textured 3D City Models

3 Vision-based Global Positioning

The goal of this master thesis is to design a vision-based global positioning

envisaged by the US company Matternet: http://matternet.us.

Autonomous Flight in Urban Environments

Real World Environment

1.3. Autonomous Flight in Urban Environments

Figure 1.3: Above-rooftop flight

Figure 1.4: Above-rooftop flight

Figure 1.5: Street-level flight

Figure 1.6: Street-level flight

1.4. Legal Framework

section has been composed together with a legal counselor

It is prohibited to operate UAVs: a) within 5 kilometres of a civil or

1.5. Literature Review

Textured 3D City Models

2.1. 3D Cadastral Models

(a) Terrain Model

(b) Block Model

(c) Rooftop Model

Chapter 2. Textured 3D City Models

Figure 2.2: 3D cadastral model (left) compared to the corresponding Street

2.2. The Google Street View API

The Google Street View API

Google offers an application programming interface (API) to access the Street

Moreover, a dynamic API based on Java Script is available which provides

Chapter 2. Textured 3D City Models

FUNCTION: download panoramas

2.2. The Google Street View API

(a) This Figure shows a perspective Street

(b) This Figure shows a perspective Street

(d) fov = 60 degrees

(e) fov = 90 degrees

(f) fov = 120 degrees

(g) pitch = -45 degrees

(h) pitch = 0 degrees

(i) pitch = 45 degrees

(j) yaw = -45 degrees

(k) yaw = 0 degrees

(l) yaw = 45 degrees

Chapter 2. Textured 3D City Models

Generating Perspective Cutouts

The Street View camera matrix is hence given by:

2.4. Refining Geotags by Using Cadastral 3D Models

Refining Geotags by Using Cadastral 3D Models

(a) Rendered Building Outlines Before

(b) Rendered Building Outlines After

(c) Overlay Panorama and Rendered

(d) Overlay Panorama and Rendered

Chapter 2. Textured 3D City Models

2.4. Refining Geotags by Using Cadastral 3D Models

FUNCTION: refine geo tags

Chapter 2. Textured 3D City Models

Backprojecting Street View Images and Depth

(a) Street View Camera View

(b) Extracted UV map

(c) Backprojected Street View Images

(d) Textured Scene

(e) Z-Coordinate Map: (f) X-Coordinate Map: (g) Y-Coordinate Map:

2.5. Backprojecting Street View Images and Depth Map Rendering