Escolar Documentos
Profissional Documentos
Cultura Documentos
Basil Huber
Section de microtechnique
Ecole
Polytechnique Federale de Lausanne
High-Speed Pose
Estimation using a
Dynamic Vision Sensor
Master Thesis
Robotics and Perception Group
University of Zurich
Supervision
Prof. Dr. Davide Scaramuzza, RPG, UHZ
Prof. Dr. Dario Floreano, LIS, EPFL
Elias M
uggler, RPG, UHZ
March 2014
14.03.2014
Professeur:
Contents
Abstract
Nomenclature
vii
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 DVS Calibration
2.1 Outline . . . . . . . . . . . . . . . .
2.2 Approach . . . . . . . . . . . . . . .
2.2.1 Displaying a Pattern . . . . .
2.2.2 Focusing . . . . . . . . . . . .
2.2.3 Intrinsic Camera Calibration
2.3 Results . . . . . . . . . . . . . . . . .
2.3.1 Focusing . . . . . . . . . . . .
2.3.2 Intrinsic Camera Calibration
1
1
4
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
7
8
8
10
10
10
3 DVS Simulation
3.1 Outline . . . . . . . . . . . . . . . . . . . . . .
3.2 Approach . . . . . . . . . . . . . . . . . . . . .
3.3 Simulation Procedure . . . . . . . . . . . . . .
3.4 Results . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Correction of DVS-Screen Misalignment
3.4.2 Screen Refreshing Effects . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
15
19
20
20
21
4 Pose Estimation
4.1 Outline . . . . . . . . . . . .
4.2 Approach . . . . . . . . . . .
4.2.1 Initialization . . . . .
4.2.2 Tracking . . . . . . . .
4.3 Experimental Setup . . . . .
4.3.1 Trajectory Simulation
4.3.2 DVS on Quadrotor . .
4.4 Results . . . . . . . . . . . . .
4.4.1 Trajectory Simulation
4.4.2 DVS on Quadrotor . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
24
25
29
31
31
32
33
33
38
.
.
.
.
.
.
.
.
.
.
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Conclusion
41
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Abstract
We see because we move; we move because we see.
James J. Gibson, The Perception of the Visual World
Micro Aerial Vehicles (MAV) have gained importance in various fields, such
as search and rescue missions, surveillance, and delivery services over the last
years. To stabilize and navigate reliably, the pose of the MAV must be known
precisely. While impressive maneuvers can be performed using external motion
capture systems, their usage is limited to small predefined areas. Numerous
solutions exist to estimate the pose using onboard sensors such as laser range
finders, Inertial Measurement Units (IMU) and cameras. To navigate quickly
through cluttered and dynamic environments, an MAV must be able to react
agilely to suddenly appearing obstacles. In current visual navigation techniques,
the agility is limited to the update rate of the perception pipeline, typically in
the order of 30 Hz. The Dynamic Vision Sensor (DVS) is a novel visual sensor
that allows to push these limits.While conventional cameras provide brightness
values at a fixed frame, this sensor only registers changes in brightness. When
the illumination of a pixel changes, an event is emitted containing the pixel
location, the sign of the change (i.e., increase or decrease of illumination) and
a timestamp, indicating the precise time of the change. The events are emitted asynchronously leading to a latency of only 15 s. We first introduce a
convenient method for the intrinsic camera calibration of the DVS using a computer screen to display the calibration pattern. We then present a simulation
approach where the real DVS is used in a virtual environment. The DVS is
placed in front of a computer screen displaying a scene. The recorded data is
subjected to the genuine behavior of the DVS including noise while providing
ground truth for the DVS pose and the scene. We present a method to estimate
the 6-DoF pose of the DVS with respect to a known pattern. The algorithm is
capable of estimating the pose of the DVS mounted on quadrotor during a flip
with an angular rate of 1200 s1 .
Nomenclature
Notation
Scalars are written in lower case letters (x), vectors in upper case letters (X)
and matrices in upper case bold letters (X).
vii
Chapter 1
Introduction
1.1
Motivation
1.1. Motivation
standard
camera
output
Time
DVS
output
Time
Figure 1.1: Comparison between a standard camera and a DVS: A black dot on a
turning wheel (left) is observed by both sensors. The standard camera registers the
whole scene at a fixed rate. The DVS observes only the black dot, but it is not
restrained to a frame rate. If the wheel stops turning, no events are emitted by the
DVS while the conventional camera continues sending images.
To take advantage of the asynchronous signal, new algorithms have to be developed. Many traditional approaches are based on features detected in an
image [18]. Features describe interest points in an image. These interest points
should have a high repeatability (i.e., they should be recognised as such under
different viewing conditions). Interest points can be corners and edges [19, 20],
blobs [21, 22], maxima and minima of the difference of Gaussian (DoG) function [23], or other salient image regions. The interest points are then described
by a feature descriptor which should be able to identify them across different
images.
Due to the asynchronous nature of the DVS, images in the traditional sense do
not exist. Therefore, these methods cannot be directly applied to DVS data.
One approach to get images from DVS data is to synthesize them by integrating
the events emitted by the DVS. In integrated images, the grey value of a pixel
is define by the number of events that this pixel emitted during the integration
time. Depending on the implementation, events of different signs are handled
differently. These images show gradients (typically edges) of objects that move
with respect to the DVS. The integration time is critical for the quality of the
images, comparable to the exposure time of conventional cameras. When the
integration time is long compared to the apparent motion, the image is blurred,
Chapter 1. Introduction
which is the equivalent of motion blur. In images with short integration time,
the gradients are only partially visible. Furthermore, only gradients perpendicular to the apparent motion are visible in these images. Thus, the synthesized
images depend not only on the appearance of the scene, but also on the movements during the integration time as shown in Figure 1.2. If many gradients
Figure 1.2: Integrated images are depending on the apparent motion; The blue arrows
show the direction of the apparant motion. White/black pixels indicate an intensity
increase/decrease. Hatched pixels indicate that the intensity first decreases and then
increases. Hence, an event of each polarity is emitted. The two movements on the
right have identical initial and end position, but produce different integrated images.
in different directions exist in the scene, some of the above-mentioned methods might be adapted to the DVS, using edges as feature points. However,
approaches based on integrated images cannot take full advantage of the asynchronous circuit since they introduce lag.
Therefore, we aim for a purely event-based approach, where each emitted event
updates the estimation of the MAV pose. This allows us to benefit from the
low latency of the sensor and the sparsity of its output data, thus resulting in
an update rate of several kHz. Approaches using conventional cameras have a
theoretical limit at the camera frame rate (50 Hz to 100 Hz). Due to the large
amount of redundant data, the update rate of the perception pipeline is typically limited to 30 Hz [24]. The perception pipeline is currently a bottle neck for
the agility of MAVs. We propose an algorithm that allows to estimate the 6 Degrees of Freedom (DoF) of the pose with minimal integration to avoid lag. Our
approach is based on tracking straight gradient segments in the event stream of
the DVS. We update the pose estimation upon the arrival of each event that is
attributed to a tracked segment, allowing very high frequency pose updates.
Several tools exist for calibrating conventional cameras [25, 26, 27]. These tools
typically involve taking pictures of a calibration pattern. The calibration parameters are then estimated by minimizing the reprojection error. These techniques
cannot directly be used for the DVS since it does not register static scenes. We
propose a calibration method that uses a computer screen to display the calibration pattern. The backlighting of LED-backlit LCD screens flickers with a
high frequency when dimmed. Hence, the pattern can be seen by the DVS.
This trick is also used for the focal adjustement of the DVS.
Ground truth data for the DVS trajectory is an important aid for testing and
1.2
Related Work
Despite being a relatively new technology, various applications using a DVS for
tracking were proposed. In many applications, clusters of events were tracked.
Examples include traffic control [28], ball tracking [29], and particle tracking [30].
Based on the particle tracker, a visual shape tracking algorithm was proposed [31].
The shape tracking is used as a feedback for a robotic microgripper. The DVS
output is used to determine the grippers position. In a first step, objects are
detected using the generalized Hough transform. Contrary to the approach of
this work, the objects are detected in the image of a conventional camera rather
than in the DVS event stream. To estimate the grippers pose and shape, the location of incoming events is compared to a model of the gripper in the estimated
pose. The pose of the gripper relative to the model is then estimated using an
Iterative Closest Point (ICP) approach. The pose is then used to provide the
operator of the gripper with a haptic feedback.
Several methods to track the knee of a walking person using a DVS were proposed [32]. One approach consists of an event-based line-segment detector. In
another approach, the leg is found in the Hough space, similar to the approach
of this work. The tracking is then implemented with a particle filter in the
Hough space.
Another approach is using event-based optical flow to estimate motion [33].
Since the DVS does not provide gray levels, it is challenging to find the spatial
gradient. To do so, they integrate the image over a very short time (50 s).
The gradient is then approximated as the difference of the event count between
neighboring pixels during the integration time. To avoid integrating events,
they proposed an approach relying only on the timing of the events [34]. For
this approach, the optical flow is estimated by comparing the timestamps of the
most recent events of neighboring pixels. Knowing the time difference and the
Chapter 1. Introduction
distance between the pixels, they can directly calculate the velocity of the visual
flow. To make their method more robust, they assume the flow velocity to be
locally constant. In their experiments, they track a black bar painted on a conveyor belt and on a turning wheel. Based on this optical flow implementation,
they present a Time-Of-Contact (TTC) estimation method [35]. The TTC is
the time until the camera reaches an obstacle assuming uniform motion. It is
estimated using the velocity of the optic flow at a point in the image and the
distance of this point to the Focus of Expansion (FoE). The TTC can then be
used for obstacle avoidance and motion planning.
They further estimated the depth (i.e., the distance to the camera) of an object
using event-based stereo matching between two DVS [36]. They propose to estimate the epipolar line of a DVS pixel by finding pixels of the other DVS that
emit events nearly simultaneously.
Another group uses blinking LEDs on a quadrotor to find its pose with a stationary DVS [37]. While providing good results for pose estimation during
aggressive maneuvers, the tracking is limited to the field of view of the DVS,
similar to approaches using a motion capture system.
In more recent work [38], they mounted the DVS on a two wheeled robot. Features are tracked using an onboard Kinect camera, providing pose estimates at
the frame rate of the camera. Between two subsequent frames, the relative motion is estimated based on events emitted by the DVS. The received events are
compared to the events expected when considering the most recent image of the
CMOS camera. The motion that has the highest coherence between expected
and received events is taken as estimate. To estimate the translation, depth
information of the Kinect has to be included. This approach is very promising
to estimate rotational motion while it performs poorly for translation due to
the low resolution of the DVS. However, tracking is lost when the motion between to subsequent frames is faster than half the field of view. Furthermore,
when performing maneuvers, the CMOS images suffer from motion blur and can
therefore not be used for pose updates. Hence, the pose estimation relies then
only on the estimation of the relative motion of the DVS. Over time, the pose
estimation is drifting considerably until the next sharp CMOS image arrives.
Although they claimed that it could be extended to estimate 3-DoF rotation,
they only demonstrate a 1-DoF rotation implementation.
Another approach performs localization based on a particle filter with DVS
data [39]. Similar to the above approach, they compare the received events
with the expected events. However, rather then localizing with respect to a
camera image, they localize with respect to a predefined global map. Although
this approach is promising, it was only implemented in 2D (3-DoF). In later
work [40], they expanded their method to perform simultaneous localization
and mapping (SLAM). Hence, the 2D map is built and expanded during operation. They demonstrated the performance of their approach on a slow ground
robot.
1.3
Contribution
In this work, we first introduce a novel technique for focusing and intrinsic
camera calibration of the DVS (Chapter 2). For this procedure, we display a
1.3. Contribution
Chapter 2
DVS Calibration
2.1
Outline
In this section, the calibration process is explained and the results are presented.
In the first section, the methods for focusing and for the intrinsic camera calibration are explained. First, we show how the screen can be used to display
patterns so that they are visible for the DVS. Then, the procedure for focusing
of the DVS is explained, followed by the procedure for the intrinsic camera calibration. We explain the parameters obtained from the calibration and explain
briefly the used distortion model.
In Section 2.3, we demonstrate the performance of our focusing and intrinsic
camera calibration method. First, we compare the output of a DVS before and
after focusing. Next, we show a distorted and undistorted image. We then
discuss reasons for the remaining distortion.
2.2
Approach
While the DVS is different to conventional cameras in many ways, the optics
are the same. However, since the DVS can only detect intensity changes in the
observed image, conventional camera calibration tools cannot be used out of the
box where a pattern (typically a checkerboard) is held in front of the camera.
For the DVS to find a calibration pattern, it has to be moving or its brightness
has to change. In our approach, we let the pattern blink to render it visible for
the DVS. A convenient way to produce a blinking pattern is to use a computer
screen. The same effect can be used to focus the camera.
2.2.1
Displaying a Pattern
Observing an LED-backlit LCD screen with an stationary DVS under different brightness settings reveals the mechanism used to dim this type of screens.
7
2.2. Approach
When the screen is set to full brightness and a static image is displayed, no
events are emitted. However, when the screen is dimmed, bright areas on the
screen emit events at a high rate ( 170 events/s per pixel) while dark areas do
not emit any. This effect is caused by the pulse-width modulation of the screens
backlighting used for dimming [41]. The flickering caused by the modulation
is not visible to the human eye since the frequency is typically in the range of
100 Hz to 200 Hz, but varies strongly from model to model [42].
The high event rate of bright areas allows to find the pattern during a short
integration time and low sensitivity (i.e., only strong illumination changes emit
events). Hence, the percentage of events that are generated by noise is decreased. These events include spontaneously generated events (e.g., shot noise,
dark current) and events generated by unintentional movements of the DVS
relative to edges (e.g., the border of the screen). In addition, motion blur due
to unintentional movement of the DVS during the integration time is limited.
2.2.2
Focusing
As for conventional cameras, the distance between the image sensor and the lens
has to be adjusted to get a sharp image on the sensor. To adjust the focus, the
user has to change the sensor-lens distance by manually screwing the lens closer
or farther from the sensor. Our blinking screen technique provides a pattern
visible in the DVS data without the need to move the camera. After placing the
DVS in front of the screen, the focusing pattern is shown on the dimmed screen.
We chose a set of concentric unfilled squares, alternately black and white as
shown in Figure 2.1a. The squares are logarithmically scaled to provide squares
with suitable thickness for different distances between the DVS and the screen.
Integrated images are synthesized by setting pixels to white if they registered
more events than a threshold value and to black otherwise. Hence, white screen
regions appear white in the integrated images, while black regions appear black.
A preview window showing the integrated output of the DVS allows the user
to observe the sharpness of the image. The user can then adjust the focus until
the preview appears as sharp as possible. When out of focus, the white lines
are blurred and hence the pattern is not recognizable towards the center of the
image, where the distance between the the white lines is narrow. The better
the sensor-lens distance is adjusted, the more of the pattern is visible as shown
in Figure 2.2.
2.2.3
For the intrinsic calibration of the DVS, the pattern consists of white circles
on a black background as shown in Figure 2.1b. There is tradeoff a between
the size and the number of circles. The larger and farther apart the circles are,
the easier it is to detect the circles in the integrated image. However, more
circles result in more points for the calibration, hence improving the estimated
parameters. In our implementation, we chose a grid of 7 7 circles.
As for convential camera calibration tools [25], the user should take several pictures of the pattern from different viewpoints. By increasing the number of
Figure 2.1: Pattern shown on the screen for (a) focusing and (b) intrinsic camera
calibration.
images and choosing the viewpoints as different as possible, the user can highly
increase quality of the estimation of the intrinsic parameters. Especially tilting
the camera with respect to the screen considerably increases the accuracy of the
estimation [43].
For the detection of the calibration pattern, the events are integrated over 75 ms.
As for focusing, pixels with an event count higher than a certain threshold are
white in the synthesized image whereas the other pixels are black. After this
thresholding, a morphological closing filter [44] is applied to fill holes or dents
in the white regions. This filter first dilates the white regions and then shrinks
them again, resulting in more convex regions. The circles are then detected
using the OpenCV [45] routine findCircleGrid [46]. This function provides
the centers of circles that are arranged in a projection of a grid.
The calibration is performed using the OpenCV routine calibrateCamera [47]
based on Bouguets Camera Calibration Toolbox [25]. The calibration routine
provides the DVS focal length (fx ,fy ), principal point (cx ,cy ) and the radial
and tangential distortion coefficients (k1 , . . . , k5 ). For the focal length and the
principal point, the pinhole camera model is used. In this model, the image
coordinates XD R3 are described as
fx
XD = 0
0
0
fy
0
cx
cy Xcam ,
1
(2.1)
where Xcam R3 are the camera coordinates and is a scaling factor, so that
zcam = 1.
Brownss Plumb Bob model [48] is used to approximate the distortion caused
by imperfect centering of the lens and imperfections of the lens. The radial
distortion is approximated with a sixth order model. First, the normalized
10
2.3. Results
(2.2)
(2.3)
where r is the distance from the optical axis (r2 = x2n + yn2 ) and Dt is the
tangential distortion vector,
2k3 xn yn + k4 (r2 + 2x2n )
Dt =
.
(2.4)
2k4 xn yn + k3 (r2 + 2yn2 )
The distorted image points can be undistorted by iteratively solving (2.3) using
the previous estimation of Xn to calculate r and Dt as follows:
Xn =
Xd Dt
.
(1 + k1 r2 + k2 r4 + k5 r6 )
(2.5)
The distorted coordinates are used as an initial guess for the undistorted coordinates (Xn = Xd ).
2.3
2.3.1
Results
Focusing
Figure 2.2 shows the integrated images taken from the screen showing the focusing pattern. On the left, it can be seen that the image is blurred due to
bad focusing of the DVS. On the right, the image is shown after adjusting the
camera-lens distance. The right image is clearly sharper.
2.3.2
11
(b) in focus
Figure 2.2: Integrated images taken during the focusing process: (a) out of focus; (b)
sharp image after finding the optimal lens-camera distance.
The result of the calibration for the first lens can be seen in Figure 2.3. On
the left, the output without correction of the distortion is shown. It can be
observed that the lines are curved due to the radial distortion. The further
from the image center, the smaller becomes the radius of curvature. On the
right, the pixel location is corrected using the above-mentioned method. The
lines appear overall straight. However, it can be observed that the lines show
piecewise curvature in the opposite direction (left border of the image). This
effect is not due to an error in the estimation of the intrinsic parameters, but is
rather due to the spatially discrete nature of the sensor. Thus, it occurs for all
pixel-based cameras but is more important for low resolution sensors. Despite
being distorted, pieces of the line appear straight in the uncorrected image due
to the limited resolution of the DVS. These straight segments are then bent to
correct the distortion resulting in overbent segments. Figure 2.5 illustrates
this problem.
Figure 2.4 shows the output for the second lens (3.5 mm). Although the lines
are slightly less curved in the corrected image, they still appear distorted. This
is due to badly estimated intrinsic camera parameters.
The proposed calibration method is mainly limited by two factors: LCD displays
emit the light towards the front and only little light is emitted sideways. Hence,
not enough light can reach the imaging sensor when the DVS is tilted to much.
Therefore, images of less different viewpoints can be taken, which decreases the
quality of the intrinsic parameter estimation. The second reason lies in the
limited resolution of the DVS (128 128 pixel). The low resolution introduces
an error on the estimation of the center of the circles in the integrated image.
The rms of the reprojection error of all points used for the calibration is 0.28 pixel
for the first lens (2.8 mm S-mount)
and 0.30 mm for the second lens (3.5 mm Cp PN 1
PM 1
2
mount). It is calculated as ( j=0
i=0 kdij dij k ), where N = 50 is the
number of images, M = 49 is the number of circles, and dij is the expected and
d the found position of the circle i in the integrated image j. The fact that the
12
2.3. Results
(a) distorted
(b) undistorted
Figure 2.3: Integrated image of a line grid using a 2.8 mm S-mount lens with (a) and
without (b) correction
(a) distorted
(b) undistorted
Figure 2.4: Integrated image of a line grid using a 3.5 mm C-mount lens with (a) and
without (b) correction.
reprojection error for both lenses is in the same range suggests, that the poor
estimation of the intrinsic camera parameters with the the second lens is due to
the low variation of the viewing angles.
(a) distorted
13
(b) undistorted
Figure 2.5: Pixel level schematics showing the influence of the low resolution on the
rectification. (a) Shows the distorted image of two straight lines. Due to the low
resolution, parts of the curved lines appear straight. (b) Shows the rectified image.
Convexly curved lines appear straight again. However, lines that are straight in the
undistorted image are curved concavely (dashed curves). Therefore, the straight segments from the distorted image become curved segments in the rectified image. This
effect occurs not only for the DVS, but for all pixel-based cameras.
Chapter 3
DVS Simulation
3.1
Outline
(a) initialization
(b) simulation
Figure 3.1: Setup for the simulation; (a) The DVS is placed in front of the screen. The
calibration pattern is shown to estimate the misalignment of the DVS with respect to
the screen. (b) The simulation is shown on the screen. Due to the applied misalignment
correction, the DVS sees the scene under the intended perspective.
14
3.2
15
Approach
For this method, a virtual environment is setup using OpenGL. A virtual camera is then placed in this environment. The output of this virtual camera is
rendered live and shown on the computer screen. The DVS is placed in front of
a computer screen as shown in Figure 3.1. Since the DVS cannot be perfectly
aligned with the screen, the output of the virtual camera is transformed by a
homography. In this way, the DVS sees the scene under the same perspective
as the virtual camera. This setup is shown in Figure 3.2. This allows to simulate the environment and having the actual DVS behavior including noise and
latency. For the testing of our pose estimation algorithm, the virtual camera
performs a predefined trajectory.
2
Hx
16
3.2. Approach
Projection Pipeline To provide a better understanding of the implementation of the simulation, we explain the OpenGL rendering pipeline using OpenGL
terminology. Throughout the whole rendering pipeline, 4D homogeneous coordinates are used. The virtual camera is implemented using the pinhole camera
model. In a first step, the world coordinates XW R4 are transformed to virtual
camera coordinates Xcam R4 (sometimes referred to as eye coordinates). This
transformation is described by the multiplication with the 4 4 view matrix
MO as
Xcam = MO XW .
The view matrix is defined as
RO TO
MO =
,
0
1
(3.1)
(3.2)
where RO is the 3 3 rotation matrix from the world to the virtual camera
frame and TO is the virtual camera translation vector. The virtual camera
coordinates are then projected to OpenGL clipping coordinates Xclip R4 .
Clipping coordinates define the virtual image before the division by the depth.
This transformation is performed by the multiplication with the 4 4 projection
matrix KO and can be describes as
Xclip = KO Xcam .
In our implementation, we choose
fOx
0
0 0
0
fOy 0 0
,
KO =
0
0
0 0
0
0
1 0
(3.3)
(3.4)
where fOy and fOy are the focal length of the virtual camera in x and y direction
respectively. The clipping coordinates can then be written as
xclip
fOx xcam
yclip fOy ycam
.
Xclip =
(3.5)
zclip =
0
wclip
zcam
The coordinate zclip is stored in the depth buffer. The depth buffer is used by
OpenGL to determine whether a point is visible or occluded by another point.
To avoid very far objects from being rendered, only points with zclip between 1
and 1 are visible. For the simulation used in this work, we choose a 2D scene and
hence no occlusion can occur. We therefore choose zclip = 0 to make all points
visible, independent of their distance to the camera. For 3D scenes, this has to
be adapted to allow OpenGL to handle occlusions. The coordinate wclip is used
to scale the coordinates by the depth of the point, independently of the value
in the depth buffer. The transformation from world to clipping coordinates
is performed on the Graphics Processing Unit (GPU). The programmer can
describe this behavior with a program that is loaded to the GPU called vertex
shader.
17
The clipping coordinates are then transformed on to the virtual image plane
called viewport, defined as
xclip /wclip
fOx xcam /zcam
Xvp = yclip /wclip = fOy ycam /zcam .
(3.6)
zclip /wclip
0
The viewport coordinates range from 1 to 1 in x and y dimension. The
window that the image is displayed in is however not necessarily quadratic. To
compensate for this distortion, the inverse of the windows aspect ratio is
incorporated into fOx . To achieve a realistic simulation, the field of view of the
virtual camera and the DVS should coincide. A point that is on the border of
the DVS image should also be on the border of the viewport. Therefore, we
choose
1
max(xvp ) min(xvp )
=
fDx
max(xD ) min(xD )
63.5
max(yvp ) min(yvp )
1
= fDy
=
fDy ,
max(yD ) min(yD )
63.5
fOx = fDx
(3.7)
fOy
(3.8)
where fDx and fDx are the focal lengths of the DVS, max(xvp ) = max(yvp ) = 1
and min(xvp ) = min(yvp ) = 1 are the maximal and minimal viewport coordinates, and max(xD ) = max(yD ) = 127 and min(xD ) = min(yD ) = 0 are the
maximal and minimal DVS image coordinates. Note that this formula assumes
the optical center cDx , cDy of the DVS to be in the center of the image. Although this is generally not the case, it provides an acceptable approximation
to assure a common field of view for both images. In the misalignment correction transformation, we compensate for this inaccuracy. The negative sign in
fOy comes from different coordinate system conventions. While the y coordinate
is pointing downwards in the DVS image, it is pointing upwards in the OpenGL
viewport. The correction of the sign assures that the image on the screen is
not shown upside down. The transformation from clipping coordinates Xclip to
viewport coordinates Xvp is performed automatically by OpenGL on the GPU.
Hence, the programmer cannot modify this transformation.
The viewport coordinates Xvp are then scaled to screen coordinates Xs R2 depending on the window size. These coordinates are unknown to the programmer.
Note that here, we define the screen coordinate system as a two dimensional
coordinate system centered in the middle of the application window. The y axis
is pointing upwards and the x axis to the right. We use physical units (meter)
rather than pixels. These coordinates can then be described as
ws /2
0
0
ws fOx xcam /(2zcam )
Xs =
Xvp =
,
(3.9)
0
hs /2 0
hs fOy ycam /(2zcam )
where ws and hs are the width and the height of the application window in meter.
From the screen, the points are then projected onto the DVS image plane. As
for the the virtual camera, a pinhole model is used to describe the DVS. The
projection is described as
Xs
0 XD = KD RD TD 0 ,
(3.10)
1
18
3.2. Approach
0
xvp
0
0
= HD XD
,
3 yvp
1
(3.13)
where 3 is again a scaling factor. These coordinates represent the ideal position
of the circles on the OpenGL viewport, so that both cameras observe the scene
under the same perspective.
19
Knowing the current and the desired viewport coordinates, the homography
between these two can be found using the homography equation
0
xvp
xvp
0
= H yvp ,
4 yvp
(3.14)
1
1
where 4 is a scaling factor and H is the homography matrix. This homography
is the transformation allowing to correct the virtual cameras output. For the
correction, this homography is applied to the clipping coordinates, since the
programmer cannot modify the viewport coordinates. The corrected clipping
coordinates can be found as
RO
TO
XW ,
1
h21 h22
H
with:
H
=
0
0 1 0
h31 h32
h31 h32 0 h33
(3.15)
h13
h23 .
h33
(3.16)
3.3
Simulation Procedure
In this section, we show the procedure to record simulated data. The user can
enter the desired camera trajectory directly in the source code. By default, the
start and end pose of the camera are entered. The trajectory is then calculated
as a linear interpolation of the translation vector and the orientation angles
yaw, pitch, and roll. This part can be easily modified in the code to get any
parametrizable trajectory. In our experiment, we display a black square on a
white background. The user can modify the scene to be displayed and the way
it is rendered.
Once both the trajectory and the scene are set, the camera is placed in front of
the screen. It should be placed as close to the screen as possible but with still
the whole height of the screen in view. If the camera is placed too far from the
screen, the correction above might result in an image that is too large for the
screen. A preview window showing the integrated DVS output assists the user
in this task. The brightness must be set to the maximum, so that the screen is
not flickering.
The user can then start the initialization. First, the initialization pattern is
shown to determine the homography for the alignment correction. Then, the
20
3.4. Results
animation is shown and recorded. Figure 3.1 shows the setup for the initialization phase on the left side and the simulation phase on the right side.
3.4
3.4.1
Results
Correction of DVS-Screen Misalignment
To investigate the quality of our correction method for the DVS-screen misalignment, we propose the following setup: The DVS is placed in front of the
screen after being intrinsically calibrated (see Chapter 2). For this experiment,
we use a 3.5 mm S-mount lens, which introduces only minimal distortion. The
calibration pattern is displayed on the dimmed screen to find the homography
between the desired and the current viewpoint coordinates. Then, a grid of
thin white lines is displayed on the screen without applying the correction. An
image is taken by integrating the DVS output for 100 ms. This image is shown
in Figure 3.3a. In the next step, the correction is applied to the projection
matrix and the grid is displayed again. Another DVS image is taken, shown
in Figure 3.3b. Both images are thresholded to remove noise.
(a) uncorrected
(b) corrected
Figure 3.3: Integrated image of a line grid using a 3.5mm S-mount lens (a) with and (b)
without misalignment correction; Certain image regions did not receive enough light
due to the tilt of the DVS. Both images are undistorted and thresholded to remove
noise. In (b) the expected position of the grid is shown in red.
The position of the events is compared to the expected position of the lines.
The mean distance between the event positions and the expected lines is 0.28
pixels. Note that this error is not only due to inaccuracy of the correction. It
has to be considered that the lines are thicker than one pixel at some places in
the image. Furthermore, the low resolution of the DVS introduces discretization
error and the correction for the lens distortion is not perfect. The correction
itself is limited by the accuracy of the estimation of the circle centers from the
image. This estimation is influenced by the quality of the correction of the
lens distortion and suffers again from the low resolution. The error caused by
21
the pixel nature of the screen can be neglected when comparing the resolution
of a computer screen (typically in the order of 1600 900 pixel) to the DVS
(128 128 pixel).
3.4.2
The simulation is not only limited by the misalignment between the DVS and
the screen, but also by the screen refreshing. While the DVS does not suffer
from motion blur thanks to its asynchronous circuit, the screen starts to display
blurred images when motion is faster than one screen pixel per screen refresh.
The screen refresh rate of our setup was measured to be 16 ms. This value is
determined by measuring the time between two subsequent drawing commands
sent by OpenGL. Furthermore, the individual pixels are updated row by row
from the top to the bottom. This could introduce an unnatural chronological
order of the emitted events. To investigate this phenomenon, we display a
black square moving horizontally over the screen. The square is invisible at the
beginning of the animation. It then slides into view from the left side until the
screen is fully covered. To sensitivity of the DVS is set low to minimize the
noise in the measurement.
80
60
40
20
Figure 3.4: Plot of the timestamp of the first event of each pixel in milliseconds; Blue
pixels fired first, red pixels last. White pixels indicate pixels that have not emitted any
events. A black square was slided over the screen. The four vertical bands correspond
to screen refreshes.
Figure 3.4 visualizes the timestamps of the events registered during this experiment. Four vertical bands can clearly be seen. Each band represents a screen
refreshing. The time between these bands is coherent with the measured screen
refresh rate. It is therefore important for simulations to avoid to high apparent
motion. The speed of the simulation is a tradeoff between the screens motion
blur and the noise level, since DVS output is more affected by noise during slow
22
3.4. Results
2,400
2,200
2,000
1,800
1,600
Figure 3.5: The experiment shown 3.4 is repeated, but 5 times slower. The horizontal
gradient is smooth over the whole image, indicating that the simulation does not suffer
from motion blur of the screen at that speed.
simulations.
When investigating the bands more closely, a vertical gradient from top to bottom can be observed, especially in the rightmost band. When turing the DVS
by 180 around the optical axis, the gradient are upside down. This suggest
that the gradients arise from the row-wise screen refresh rather than from DVS
related issues. Figure 3.5 shows the result when performing the experiment 5
times slower. A smooth horizontal gradient over the whole screen can be seen.
The absence of the bands observed before indicates that the simulation does not
suffer from motion blur of the screen at that speed.
In a second experiment, the whole screen was turned from white to black in
between two refreshes. The timestamp of the fist event of each pixel are shown
in Figure 3.6. Against our first intuition, the timestamps of the events do not
show a continuous vertical gradient over the whole image plane. It appears
that several gradients are overlapping, and different horizontal bands of this
gradients are visible. This could either be due to the refreshing of the screen
or the readout of the pixels in the DVS. To disambiguate the origin of this
phenomenon, we turned the DVS 90 around the optical axis and repeated the
experiment. If it is due to the screen, the bands should be turned in the new
plot as well, while they should stay the same if its related to the DVS. The
result is shown in Figure 3.7. Again, horizontal bands of vertical gradients can
be observed. This indicates that this phenomenon is due to the DVS rather
than to the screen refreshing. OFF events (i.e., decrease of illumination) are
readout row-wise. [17]. Since the vertical gradients within the bands do not
change orientation either, they are assumed to originate in the readout as well,
rather than in the screen refreshing. Hence, the row-wise screen refresh is visible
23
if a small amount of the screen changes. In that case, the DVS readout is fast
enough to be influenced by this effect. If, however, the whole screen changes its
brightness, the lag introduced by the readout method dominates the timing of
the events.
60
50
40
30
20
10
0
Figure 3.6: Timestamps of the events in milliseconds; several overlaying vertical gradients can be seen.
60
50
40
30
20
10
0
Figure 3.7: Timestamps of the events in milliseconds; the DVS is turned 90 with
respect to the plot in 3.6. The gradients are still vertical.
Chapter 4
Pose Estimation
4.1
Outline
In this chapter we present the main contribution of this thesis, the tracking and
pose estimation algorithm. In the first section, we explain the algorithm. We
start with the initialization phase, where we show how we find the square whose
sides are used as landmarks. Therefore, we explain first how the line segments
are extracted from the event stream. Then, we explain how the polygon is
found among the extracted line segments and how the initial pose is estimated.
Subsequently, we explain the tracking and the pose estimation.
In the next section, we present the conducted experiments. Therefore, the setups
and the motivation for the experiments are described. We first do this for the
experiments involving our simulation method and then for those which involve
real data collected with the DVS on a quadrotor.
In the last section, the results of the experiments are shown and discussed.
Again, we first treat the simulation, followed by the real data.
4.2
Approach
The pose of the camera is estimated by extracting the position of line segments
from the DVS data and comparing them to their known position in world coordinates. In the initialization phase, the stream of events is searched for straight
lines. We then search these straight lines for line segments. In our implementation, we use the sides of a square as landmarks. Being a closed shape,
the endpoints of the segments can be estimated as the intersection with the
neighboring segment. Assuming only limited tilting of the camera, the lines are
easily discriminable since the angles are not very acute or obtuse and the lines
are therefore not interfering with each other.
The tracking algorithm then tracks the edges of the square through time. The
pose of the DVS is estimated after the arrival of each event, resulting in highfrequency event-based updates.
24
25
4.2.1
Initialization
Finding Line Segments To find line segments in the stream of events, the
arriving events are integrated to form an image. Therefore, an arriving event is
added to a buffer containing the events location.
To detect lines in this image, the Hough transform is used [49]. It transforms
an image point P to a curve in the Hough space described by the following
equation:
rP () = Px cos + Py sin .
(4.1)
Each pair (rP (), ) fulfilling the above equation represents a line passing through
the point P [50]. The radius rP () is the smallest distance from the line to the
origin and is the angle between the normal to the line through the origin and
the x axis, as shown in Figure 4.1a. The curves of a set of collinear points intersect in a single point in the Hough space, representing the line passing through
all points as shown in Figure 4.1b.
For the implementation, the space of all possible lines is discrete. A bin for
each possible line is stored. Upon the arrival of an event, the value of each bin
which fulfils (4.1) for the location of the event Pi is incremented. Hence bins
with a high vote count represent lines passing through many received events.
Figure 4.1c shows the values of the bins for the example in Figure 4.1a. The resolution of the Hough space (i.e., the number of bins) is an important parameter.
Too few bins result in a poor estimate of the orientation and position of the line.
If the number of bins is too high, clear maxima are not found since the points
belonging to a segment are generally not exactly on straight line. Furthermore,
the computational cost increases with the number of bins. We chose an angular resolution of 7.5 and a radial resolution of 2.5 pixel, resulting in 24 73 bins.
After receiving a certain number of events, the bins are searched for potential
lines. Usually, the Hough bins are searched for local maxima. However, if only
local maxima are considered, line segments could be omitted if there is a line
with similar parameters that has more votes. This problem is shown in 4.2.
Therefore, the space of all bins is thresholded: bins containing more than a
certain number of events represent line candidates (we chose a threshold of
25 events). In a next step, each line candidate is searched for segments with a
high event density (i.e., clusters of events along the line). To do so, all events
that are close enough to a line are attributed to the line (we chose a maximal
distance of 2 pixel). An event can be attributed to several lines. The events
attributed to a line are then sorted according to their distance (parallel to the
line) to the intersection of the line with its normal passing through the origin s as
26
4.2. Approach
100
P
y [pixel]
rp ()
r
50
r [pixel]
50
50
50
50
50
50
x [pixel]
100
[ ]
150
90
70
50
r [pixel]
30
10
-10
-30
-50
-70
-90
30 60 90 120 150
[ ]
shown in Figure 4.3. A segment is defined as the part of a line candidate where
neighbouring event locations are not separated further than a certain distance
(we chose 15 pixel). A segment has furthermore to have a minimal length and a
minimal number of events lying on it (we chose 20 pixel for the minimal length
and 25 events for the minimal event count). The set of found segments is then
searched for a closed shape as explained in the following paragraph. If no closed
shape can be detected, the accumulation of events continues and the Hough
space is again searched for segments after a certain number of events.
27
20
90
70
50
50
15
r [pixel]
y [pixel]
30
0
10
10
-10
-30
-50
-70
50
50
-90
50
x [pixel]
30 60 90 120 150
[ ]
(b) Hough bins
Figure 4.2: Problem of local maxima approach for segment search; Although only the
blue line contains a segment, the green line would be chosen as line candiadate since
it has more votes. Hence, we consider all lines with a certain bin count as candidates.
x
Figure 4.3: Parametrization of the event location on a line; the parameter s is used
to describe the location of an event on a line. It is used to sort the events in order to
find line segments.
Finding the Square To detect the square, the image is searched for 4 line
segments forming a quadrangle. In a first step, the line segments are sorted
according to their length. The search for the square starts with the longest
segment. This gives priority to longer segments. Hence, in case of ambiguity
larger quads are found first and hence selected. One end is selected arbitrarily.
A list of all segments which have one endpoint close to this point is generated
(we count a point as close if the distance is smaller than one third of the length
of the current segment). Only segments that are oriented counter-clockwise
and have an angle larger than 45 and smaller than 135 are considered. This
condition implies that the camera is not tilted too much with respect to the
square during the initialization phase. It increases the robustness of the search
28
4.2. Approach
since it avoids many false detections. Then further elements connected counterclockwise are looked for recursively. This results in a recursive chain of elements
possibly forming a square. Once four elements are found, it is checked whether
the far end point of the current segment is close to the far end point of the first
segment. In this case, the search is stopped. If not, the first element is removed
from the chain and the search is continued. If no segment can be added to the
current segment and it does not form a quadrangle, a dead end is hit. The
search continues from the previous segment.
To guarantee that a possible quadrangle can be found, a segment is checked
twice: Even though a segment has on one end no close segment in counterclockwise direction, it is possible that the other end has an adjunct counterclockwise neighbor as shown in Figure 4.4. To avoid unnecessary computation,
a list of possible segment endpoints is maintained. Once a segment is added,
the endpoint close to the previous segment is removed from the list of possible
endpoints. This guarantees that segments are only checked twice.
To avoid as much false positive detection as possible, a minimal side length
for the quadrangle is established: Once a quadrangle is found, its corners are
calculated as the intersections of its sides. If the distance between to adjacent
corners is smaller than the minimal length for segments used in the segment
detection (20 pixel), the quadrangle is rejected and the search continued. This
detection method could be easily applied to other closed convex shapes.
P1
P2
Figure 4.4: Square detection: Segments have to be checked twice to see if they belong
to the square since there could be a square on both sides of it. For example, the
segment in the middle is not attributed to a square when starting from point P1 . It is
however attributed when starting from P2 .
Initial Pose Estimation A first coarse pose estimation is performed by calculating the homography between the estimated position of the corners in the
image and the known position in the world frame. The pose of the DVS is then
estimated by decomposing the homography. The correspondence between world
and image points is established by assuming the rotation of the DVS around the
optical axis to be between 45 and 45. A refinement of the pose estimation
is achieved by minimizing the reprojection error.
4.2.2
29
Tracking
After finding the line segments to be tracked in the image plane, their position
is updated upon the arrival of a new event that can be attributed to a segment.
If an event can be attributed to a segment, it is appended to the segments
event buffer. The DVS pose is then optimized considering all events attributed
to all line segments. The segment positions are then estimated by projecting
the world coordinates on to the image plane.
Event Attribution When an event arrives, its distance to each line segment
is calculated as
d = kv0 (v0 n)nk,
(4.2)
where v0 = P0 P is the vector from one of the line endpoint P0 to the location of
the event P and n = P0 P1 /P0 P1 is the unit vector of the segment. The distance
parallel to the line to the closer of the semgents endpoints is calculated as
(
v0 n if |v0 n| < |v1 n|
dk =
,
v1 n
if |v0 n| > |v1 n|
(4.3)
where v1 = P1 P is the vector from the other line endpoint P1 to the location of
the event P . This distance is negative if the point lies between the endpoints
and positive if it lies outside. An event is in the range of the segment i if its
orthogonal and parallel distance to the line is smaller than a threshold:
di < dmax dki < dkmax .
(4.4)
30
4.2. Approach
la
C
lb
la
la
(a) regions for attribution
Figure 4.5: In (a), the attribution of events is shown. Events located in region A or
B are attributed only to the segment la or lb respectively. Events located in C are
attributed to both segments. Events locates in D are attributed to neither segment.
In (b), the shrinking of the square is depicted. Produced by lb , events can draw the
segment la towards the center when attributed to la .
Event Buffer Each line segment has its own event buffer. When a new event
is attributed to a line, an old event in the buffer is replaced.
When a line is rotating, the number of events is increasing with the distance
to the center of rotation. If the center of rotation is close to an endpoint of
the line segment, it can lead to inaccurate estimations of the position and the
orientation of the segment. This is due to the fact that all events stored for this
segment are gathered on the other end of the segment as shown in Figure 4.6a.
This problem can be addressed in two ways: Increasing the number of events
stored for each segment or replacing stored events that have occurred close to
the location of the current event. The first solution introduces an unwanted lag
since old events located far from the current segment position are considered for
the estimation. Furthermore, the computational cost and the memory required
increase. The second solution can improve the distribution of event locations
along the segment. Despite their age, events close to the center of rotation are
still lying on the line and are therefore still valid. These events can improve
the estimation of the segment considerably. In the presented approach, events
replace close-by events if the distance (parallel to the line) between the events is
smaller than a threshold distance. This threshold distance is chosen to be equal
to half the distance between the points if they where distributed uniformly. This
results in an more uniform distribution of event along the line and hence a better
estimation. A disadvantage is that once an outlier is attributed to the line, it
might influence the estimation for a long time. However, since the number of
outliers is relatively small the advantage outweighs.
31
Figure 4.6: Pixel level schematics showing the problem of replacing the oldest event
stored for a line; The true line (black dashed) is rotated (black solid). The line is
estimated (red) based on the position of the events (gray). (a) New events replace the
oldest event stored for this line. Note how the stored events cluster on one end of the
line, thus, corrupting the line estimate. (b) Instead, new events replace the old event
closest to their location. Hence, the line estimate is more accurate.
events to the projected lines d is than minimised in the least-square sense. The
optimization is performed by the MATLAB function lsqnonlin. This function
implements the trust-region-reflective algorithm. The new estimation for the
line segments is obtained by the projection of the world coordinates using the
pose resulting from the optimization.
4.3
Experimental Setup
4.3.1
Trajectory Simulation
cos()
sin()
0
0
M = sin() cos() cos() cos() sin() 0 ,
(4.5)
sin() sin()
cos() sin() cos() z
where = 720t/T , = 210, and z = 2.5 m t/T . We chose the duration of the
simulation as T = 4 s to minimize effects of screen refreshing. This trajectory is
used to investigate the influence of the size of the square in the DVS image plane.
The second experiment consists of the same trajectory, except that the distance
to the square is kept fix. The experiment was performed with two different dis-
32
tances from the square. First, the camera has a distance of 1 m from the square,
hence the projection square is a large quadrangle that should be easily trackable
(the longest side measures 114 pixel and the shortest measures 70 pixel). The
second distance is chosen as 2.5 m, resulting in a quadrangle whose longest side
measures 45 pixel and the shortest measures 31 pixel. This is near the maximal
distance for which the square can still be tracked. With these simulation, we
want to observe the influence of the number of events stored for each line.
z [m]
0
1
0.5
x [m]
0
0.5
1 1
y [m]
Figure 4.7: Trajectory of the virtual camera: The optical axis (blue) is always pointing
towards the center of the square.
4.3.2
DVS on Quadrotor
33
Figure 4.8: Image of the modified AR.drone; 1) DVS mounted ontop of the standard
CMOS camera; 2) Odroid onboard computer; 3) reflective markers for the motion
tracking system.
remote controlled during the whole flight using the AR.drones standard smartphone application. Flips can be performed easily by choosing this maneuver in
the application. During the flip, the drone rises approximately 50 cm and falls
than back to its initial height. The angular velocity during these flips reaches
peak values of 1200 s1 . The distance to the wall was chosen so that the square
is always in the field of view of the DVS, ranging from 0.75 m to 2 m.
Figure 4.9: AR.drone performing a flip; the black square can be seen in the background.
As in the simulation, it measures 0.9 m 0.9 m.
4.4
4.4.1
Results
Trajectory Simulation
Conic Helical Trajectory The estimated trajectory and the ground truth
for the helical trajectory are shown in Figure 4.10. The position is expressed
as the camera translation vector T in camera coordinates and the orientation
is expressed as euler angles of the rotation matrix R that transforms world
coordinates to camera coordinates such that
R T
Xcam =
XW ,
(4.6)
0 1
34
4.4. Results
0.1
0.1
0.1
0.1
y [m]
x [m]
roll [deg]
pitch [deg]
yaw [deg]
z [m]
0.1
0.1
0
800
600
400
200
0
20
0
20
200
180
160
0
2
time [s]
20
10
0
10
20
20
10
0
10
20
20
10
0
10
20
time [s]
Figure 4.10: Pose estimation for a helical trajectory; (a) estimation of the camera
translation vector t and the euler angles (blue) including ground truth (red); (b) error
of the estimation; 8 events were stored per line
where XW R4 are homogeneous world coordinates and Xcam R4 are homogeneous camera coordinates. The error of the position estimation T is
described as the euclidean distance between the camera translation vector of
the estimation and the ground truth:
T = kT T k,
(4.7)
35
The angle of the axis-angle representation is then used as a measure for the
orientation error:
trace(R) 1
1 ,
with R = RR
= arccos
(4.8)
2
T [cm]
10
5
[deg]
0.5
1.5
time [s]
2.5
0.5
1.5
time [s]
2.5
20
15
10
Figure 4.11: Error of the pose estimation for a helical trajectory; (a) norm of the
distance between estimated and ground truth cameras translation vector T ; (b)
angle between the estimated and the ground truth pose in axis-angle representation
Figure 4.11 shows the error of the estimation, expressed as stated above. The
mean position error for the first second of the simulation is 1.85 cm with a standard deviation of 0.63 cm. The mean angular error for this period is 9.15 with a
standard deviation of 0.96. This low standard deviation compared to the high
mean error suggests that this error is due to a lag or a bias. The plots indicate a
lag in the order of 50 ms when looking at the pitch and roll angle. It is however
not clear, whether this lag is introduced by the tracker or whether it is due
to a temporal misalignment of the ground truth and the measured data. The
temporal alignment is based on the event density in the DVS event stream. The
ground truth starts when the number of events exceeds 500 event s1 , averaged
over 5 events.
It can be seen that the error of the estimation grows as the camera moves farther
from the square. Especially the errors in z direction (optical axis) and in pitch
and roll increase rapidly towards the end of the simulation as shown in 4.10.
When the camera is approximately 2.5 m away from the square, tracking is lost.
This is caused by the low amount of events that are emitted per segment. Figure 4.12b shows the number of events emitted during the simulation. The event
rate decreases from 14 kHz to 4 kHz as the distance between the camera and the
square increases from 1 m to 3 m. It can also be seen that, due to the decrease
36
4.4. Results
of the quality of the estimation, less events get attributed to the a line segment
towards the end of the simulation. At this distance, the shortest side of the
square measures only 30 pixel and the apparent motion is much smaller. This
increases the number of old events in the buffers of the segments. The quality
of the estimation is further decreased since the events are close to each other
rather than spread over the entire image plane. The low number of events also
increases the influence of sensor noise.
10
0
0.5
0.5
1.5
2.5
3.5
4.5
time [s]
(a) coarse
40
20
0
20
20
40
60
80
100
120
140
time [ms]
(b) zoom
Figure 4.12: Rate of emitted event during simulation (blue) and pose updates (red)
sampled at 10 Hz (a) and 2 kHz (b); The pose update rate is equal to the number of
events attributed to the square per millisecond.
(a) The number of events decreases as the camera moves further from the square. As
the estimation error increases, less events are attributed to the square. (b) In the
zoom, the screen update frequency of 16 Hz can be seen as peaks. The width of the
peaks corresponds to the temporal noise.
37
Circular Trajectory Figure 4.13 shows the pose error as a function of the
number of events stored per line (buffer size) for a circular trajectory of the
DVS around the square. As for the helical trajectory, it can be seen that the
error is higher if the DVS is further from the square. For more than 30 events
per segment, tracking is lost in the case of the further trajectory (Tz = 2.5 m).
30 events per segment corresponds to nearly 1 event per pixel at this distance.
As described in section 4.2.2, newly arriving events that are close to an event
in the buffer replace this event rather than the oldest event. Hence, once an
event is far from the line, its chance to be replaced is small. The more events in
the buffer, the smaller is the change that the event is removed. Therefore, the
estimation suffers from lag and new events can eventually not be attributed to
line segments due to the bad estimation.
The position error (euclidean distance between the camera translation vector
T and the corresponding ground truth vector) is decreasing with an increasing
buffer size. This behavior is as expected, since more points are available for
the optimization for large buffers. Furthermore, the more events are used to
estimate the square, the lower is the influence of a event that does not originate
from the line segment. In the case where the camera is further away from the
square, the error stagnates between 15 events and 25 events and increases slightly
for larger buffers. The angular error is nearly constant in the case where the
camera is close to the square. In the other case, however, it increases for buffers
larger than 6 events with a local minima for 13 and 14 events, presumably for
the reason stated above.
15
0.1
0.05
20
40
events per line segment
10
20
40
events per line segment
Figure 4.13: Error of the camera translation vector and the camera orientation in
axis-angle representation for a circular trajectory 1 m (red) and 2.5 m (blue) above the
square as a function of the number of events per line segment. Tracking was lost with
more than 30 events in case of the further trajectory.
Figure 4.14 shows the time need for the pose estimation of the whole trajectory.
It can be seen that the computation time decreases with increasing buffer size.
This might appear counter-intuitive. However, when considering the number
38
4.4. Results
of iterations per pose estimation shown in 4.14 , it can be seen that the more
points are considered, the faster converges the minimization and thus the faster
is the algorithm.
mean calc. time per update [s]
1,000
800
600
400
200
0
20
40
events per line segment
3.5
2.5
2
0
20
40
events per line segment
Figure 4.14: Time required to process the whole tracking of the circular trajectory
(left) and the average number of iterations to find the optimal pose (right) as a function
of the buffer size (number of events stored per line).
4.4.2
DVS on Quadrotor
roll [deg]
pitch [deg]
yaw [deg]
z [m]
y [m]
x [m]
0
0.5
1
1.5
2
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0
10
20
30
40
20
10
0
10
20
200
100
0
100
200
2
time [s]
39
0.4
0.2
0
0.2
0.4
0.4
0.2
0
0.2
0.4
0.4
0.2
0
0.2
0.4
20
10
0
10
20
20
10
0
10
20
20
10
0
10
20
0
2
time [s]
Figure 4.15: Estimated trajectory (red) with ground truth (blue) and errors (black)
for three consecutive flips with a quadrotor
40
4.4. Results
Figure 4.16: Output of the standard CMOS camera on the quadrotor during the flip;
the images are corrupted by strong motion blur.
Chapter 5
Conclusion
In this work, we presented, to the best of our knowledge, the first intrinsic camera calibration tool for the DVS. It is convenient and easy to use, requiring only
a computer screen. Using the flickering backlighting of a dimmed LED-backlit
computer screen, this tool allows to estimate the intrinsic camera parameters
including the distortion coefficient. Our method is limited by the direction of
the light emitted from the screen in combination with the light passing through
the lens to the DVS sensor. Since computer screens are designed to emit light
only towards the front, the pattern can not be detect when the camera is highly
tilted with respect to the screen. For lenses that let enough light pass through
to the sensor, the pattern can be detected for angles up to 45, leading to an
accurate estimation of the calibration parameters. Some lenses let only little
light pass through to the sensor, especially around the border of the image.
Using these lenses, the pattern can only be detected when the DVS is placed
relatively straight in front of the screen. This leads to a limited accuracy of the
estimation. However, the image quality can still be increased, even for these
lenses.
We also introduced a very flexible simulation method, allowing virtual scenes
to be observed with the DVS. Using OpenGL programming, complex virtual
scenes can be implemented and used for the simulation. The virtual camera can
be placed and moved arbitrarily. Since the scene and the perspective are defined
by the user, ground truth is easily obtained. Providing a way of recording highly
repeatable data sets, it can be used for testing and benchmarking of algorithms
for tracking or other applications. The main advantage of this method is that
the behavior of the real DVS is used rather than modeled. Hence, highly realistic simulations can be achieved.
In terms of performance, this method is mainly limited by the screens update
rate. As the apparent motion gets higher than 1 screen pixel per screen refresh,
effects similar to motion blur start to appear. Having a refresh rate of 62.5 Hz,
the critical motion of our setup is 62.5 screen pixel per second, corresponding
to 17.6 mm s1 on our 22 inch screen. However, motion blur in the order of several screen pixels is acceptable, considering the limited resolution of the DVS.
Thus the critical motion is 62.5 DVS pixel per second. Assuming that the DVS
41
42
sees the whole height of the screen, a DVS pixel corresponds to 8.2 screen
pixel. The critical motion is hence approximately 512 screen pixel per second
( 144 mm s1 ) which allows to perform simulations at reasonable speed.
While arbitrarily complex scenes can be produced in OpenGL, including shading with several light sources, smooth and specular highlights, the programmer
has to be familiar with OpenGL. Implementing such complex scenes is laborious
and time consuming. Hence, the use of our simulator is in practice limited to
relatively simple scenes.
We successfully used this software for testing and evaluating our tracking algorithm.
To the best of our knowledge, this work presents the first 6-DoF pose estimation
using a DVS. Our algorithm estimates the pose of the DVS upon the arrival of
every event.
In our experiments using a quadrotor, we found a mean error of the camera
position of 10.8 cm. The angular error, using the axis-angle representation,
was found to be 5.1. Using simlulation data, the mean error of the camera
translation vector was 1 cm and the orientation error was 10. Considering this
error, this method cannot be used for the control of quadrotor. To demonstrate
the influence of the low resolution of the DVS on the pose estimation, we perform
a simulation (pure calculations). In our simulation, a camera performs a circular
trajectory over the square, pointing its optical axis towards the center of the
square. As for the performed experiment, the camera has a distance of 2.5 m
to the square. On each iteration step, the sides of the square are projected to
the image plane using the pinhole camera model. On each line, 8 events are
uniformly distributed. To simulate spatially the discrete nature of the DVS, the
coordinates of the event locations are rounded to the next integer value. We
then estimated the pose based on these points, using the same minimization
procedure as for our algorithm. This simulation shows the theoretical limit due
to the low resolution, assuming that no subpixel precision can be achieved. We
found a mean position error of 1.0 cm with a standard deviation of 0.6 cm. For
the angular error, we found a mean value of 1.3 with a standard deviation of
0.5. Since we assume that a large part of the mean error of our experiment
(3 cm and 11) is due to a temporal misalignment between the estimation and
the ground truth, we will focus our comparison on the standard deviation. In
our experiment, we found a standard deviation of 2.3 cm for the position error
and 3.4 for the angular error. When comparing the standard deviations, it
can be stated that the influence of the low resolution is important, however
not solemnly responsible for the error in our estimation. To investigate the
influence of badly distributed events, we repeat the experiment, but distribute
the events randomly on the sides of the square. The position error increased
to 1.3 cm with a standard deviation of 0.9 cm and the angular error to a mean
value of 2.1 with a standard deviation of 1.1. Although the standard deviations
achieved in our experiment are considerably higher than the theoretical limit
(factor 2.2 for the position and factor 6.8 for the angle), it has to be noted
that both the low resolution and the nonuniform distribution of events have
a considerable effect on the estimation. A higher resolution DVS is currently
under development and will reduce the limitations due to the low resolution.
Another important influence on the quality of the estimation is temporal noise.
Chapter 5. Conclusion
43
Events that arrive with a delay compared to events of the neighboring pixels
decrease the estimation of the line segment position considerably.
In combination with an appropriate filter and additional sensors such as an IMU,
this approach might allow to control an MAV during aggressive maneuvers.
Although the current implementation is relatively slow, the algorithm is expected to run at a very high frequency when optimally implemented. If needed,
it can be sped up by reducing the frequency of the pose updates. Therefore,
the pose would be updated only after the arrival of a certain number of events.
For the event attribution, it would be assumed that the line segments are static
between two pose updates. Alternatively, a motion model could be used to predict the pose, hence allowing to estimate the position and orientation of the line
segments. Since the pose estimation is the main contributor to the computational cost, the computational cost is assumed to scale nearly linearly with the
frequency of the pose updates.
While we only demonstrated our tracker with a square, it can be track any number of line segments. However, in the current implementation, in the case of an
overlap between to nearly parallel or collinear segments, false correspondences
between events and a line segments can occur. If enough other line segments
are tracked, the pose should be estimated reasonably thanks to the chosen cost
function for the minimization. The tracking would then recover and false correspondences would be gradually removed. If the influence of the overlapping lines
is too large (i.e., not enough other lines), bad pose estimates could cause the
tracking to be lost by introducing even more false correspondences. Another
problem that is not handled in our algorithm are occlusions. While tracking
algorithms based on conventional cameras usually know that they lost tracking
of a feature, our algorithm assumes that the line segment is stationary. This is
also the case if a segment leaves the field of view.
5.1
Future Work
As stated above, the implementation of complex scenes and rendering is laborious in the current implementation of the simulation tool. While the simple
black square is enough to test our algorithm, it might be interesting to simulate
complex scenes with objects occluding each other, light sources producing specular highlights and other effects. Thus, we suggest to base the application on
Blender [51], a open-source computer graphics software, which is often used for
creating 3D animation films. In Blender, 3D objects and even scenes can easily
be created or imported. Furthermore, it provides ready illumination and material settings, allowing realistic rendering with minimal effort. The DVS screen
misalignment correction could be implemented as a Python [52] script which
can be easily executed in Blender [53]. Blender also provides a game engine
which allows to interact with the scene in real time. Hence, an interactive DVS
simulation could be implemented.
As mentioned in the last section, fusing the output of our pose estimation algorithm with other sensors could significantly reduce the error. Since the noise
of our estimate has a high frequency, it would be useful to fuse it with a sensor
44
that has also a high frequency. IMUs have a high frequency, but suffer from
a high drift. Thus, when combining an IMU and the suggested algorithm, the
IMU could reduce the noise while the drift is reduced by our algorithm.
While tracking a square is suitable as a proof of concept, it cannot be applied
to real applications. A more generalized way to establish the correspondences
between detected line segments and the map would have to be developed. This
would allow navigation and pose estimation in a known environment. However,
it is often necessary for a robot to navigate through unknown terrain. Therefore, the map would have to be expanded during operation. While this has been
successfully demonstrated in 2D [40], it remains an important challenge in 3D.
As explained above, our current algorithm cannot distinguish between an unmoving segment and an occluded segment. This problem could be solved by
considering for the pose estimation only line segments which are consistent with
the currently estimated pose. We suggest to use a Random Sample Consensus (RANSAC) [54] approach. In this scenario, three line segments would be
randomly selected and the pose would be estimated based on this segments. It
would then be evaluated how many line segments are coherent with this pose.
If enough segments would support a pose, this pose would be re-estimated using all coherent segments. If the pose is not supported by enough segment,
three other segments would be picked randomly until a consensus is found. Alternatively, rather than picking line segments for RANSAC, events of the line
segments buffers could be picked. While the latter alternative is more robust,
the computational cost is much higher than for the former approach.
The next generation of the DVS, the apsDVS [55], does not only provide eventbased output, but also provides full image frames at a low frame rate (up to
25 fps). This allows to perform global localization using image frames with standard algorithms while the event-based output can be used for pose estimation
during fast maneuvers. Furthermore, the apsDVS has a higher resolution of
240 180 pixel.
Bibliography
[1] R. Murphy, S. Tadokoro, D. Nardi, A. Jacoff, P. Fiorini, H. Choset,
and A. Erkmen, Search and rescue robotics, in Springer Handbook
of Robotics. Springer Berlin Heidelberg, 2008, pp. 11511173. [Online].
Available: http://dx.doi.org/10.1007/978-3-540-30301-5 51
[2] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair, I. Grixa,
F. Ruess, M. Suppa, and D. Burschka, Toward a fully autonomous uav:
Research platform for indoor and outdoor urban search and rescue, IEEE
Robotics Automation Magazine, vol. 19, no. 3, pp. 4656, Sept 2012.
[3] N. Michael, E. Stump, and K. Mohta, Persistent surveillance with a team
of mavs, in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems
(IROS), Sept 2011, pp. 27082714.
[4] K. W. Weng and M. Abidin, Design and control of a quad-rotor flying
robot for aerial surveillance, in Student Conference on Research and Development (SCOReD 2006), June 2006, pp. 173177.
[5] A. Faust, I. Palunko, P. Cruz, R. Fierro, and L. Tapia, Aerial suspended
cargo delivery through reinforcement learning, Department of Computer
Science, University of New Mexico, Tech. Rep., 2013.
[6] A. Kushleyev, B. MacAllister, and M. Likhachev, Planning for landing
site selection in the aerial supply delivery, in IEEE/RSJ Intl. Conf. on
Intelligent Robots and Systems (IROS), Sept 2011, pp. 11461153.
[7] M. Muller, S. Lupashin, and R. DAndrea, Quadrocopter ball juggling,
in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), Sept
2011, pp. 51135120.
[8] D. Mellinger, N. Michael, and V. Kumar, Trajectory generation and
control for precise aggressive maneuvers with quadrotors, Intl. J. of
Robotics Research, vol. 31, no. 5, pp. 664674, 2012. [Online]. Available:
http://ijr.sagepub.com/content/31/5/664.abstract
[9] B. Yun, K. Peng, and B. Chen, Enhancement of gps signals for automatic
control of a uav helicopter system, in IEEE Intl. Conf. on Control and
Automation (ICCA), May 2007, pp. 11851189.
[10] N. Abdelkrim, N. Aouf, A. Tsourdos, and B. White, Robust nonlinear
filtering for ins/gps uav localization, in Mediterranean Conf. on Control
and Automation, June 2008, pp. 695702.
45
46
Bibliography
Bibliography
47
http:
48
Bibliography
http://www.blender.org/
Title of work:
Student:
Name:
E-mail:
Legi-Nr.:
Basil Huber
basil.huber@gmail.com
1866171