Escolar Documentos
Profissional Documentos
Cultura Documentos
SURVEILLANCE SYSTEMS
DESIGN, APPLICATIONS
AND TECHNOLOGY
No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or
by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no
expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of information
contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in
rendering legal, medical or any other professional services.
COMPUTER SCIENCE, TECHNOLOGY
AND APPLICATIONS
SURVEILLANCE SYSTEMS
DESIGN, APPLICATIONS
AND TECHNOLOGY
ROGER SIMMONS
EDITOR
New York
Copyright © 2017 by Nova Science Publishers, Inc.
All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted
in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying,
recording or otherwise without the written permission of the Publisher.
We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to
reuse content from this publication. Simply navigate to this publication’s page on Nova’s website and
locate the “Get Permission” button below the title description. This button is linked directly to the
title’s permission page on copyright.com. Alternatively, you can visit copyright.com and search by
title, ISBN, or ISSN.
For further questions about using the service on copyright.com, please contact:
Copyright Clearance Center
Phone: +1-(978) 750-8400 Fax: +1-(978) 750-4470 E-mail: info@copyright.com.
Independent verification should be sought for any data, advice or recommendations contained in this
book. In addition, no responsibility is assumed by the publisher for any injury and/or damage to
persons or property arising from any methods, products, instructions, ideas or otherwise contained in
this publication.
This publication is designed to provide accurate and authoritative information with regard to the subject
matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in
rendering legal or any other professional services. If legal or any other expert assistance is required, the
services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS
JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A
COMMITTEE OF PUBLISHERS.
Additional color graphics may be available in the e-book version of this book.
Preface vii
Chapter 1 Omnidirectional Surveillance System
for Household Safety 1
Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
Chapter 2 Tracking Moving Objects in Video Surveillance
Systems with Kalman and Particle Filters –
A Practical Approach 55
Grzegorz Szwoch
Chapter 3 Performance Evaluation of Single Object Visual
Tracking: Methodology, Dataset and Experiments 107
Juan C. SanMiguel, José M. Martínez
and Mónica Lozano
Index 143
PREFACE
frame basis, providing a ‘track’ of each object. First, the Kalman filter
approach is presented. Implementation of a dynamic model for the filter
prediction, methods of obtaining the measurement for updating the filter, and
the influence of the noise variance parameters on the results, are discussed.
Tracking with Kalman filters fails in many practical situations when the
tracked objects come into a conflict due to the object occlusion and
fragmentation in the camera images. Another method presented here is based
on particle filters which are updated using color histograms of the tracked
objects. This method is more robust to tracking conflicts than the Kalman
filter, but it is less accurate in describing the object size, and it is also much
more demanding in terms of computation. Therefore, a combined approach for
resolving the tracking conflicts, is proposed. This algorithm uses Kalman
filters for the basic, non-conflict tracking, and switches to the particle filter for
resolving cases of occlusion and fragmentation. A methodology of evaluation
of tracking algorithms is also presented, and an example of testing the three
presented tracking algorithms on a sample test video is shown.
Chapter 3 – Performance evaluation of visual tracking approaches
(trackers) based on ground-truth data allows to determine their strengths and
weaknesses. In this paper, the author present a methodology for tracker
evaluation that quantifies performance against variations of the tracker input
(data and configuration). It addresses three aspects: dataset, performance
criteria and evaluation measure. A dataset with ground-truth is designed
including common tracking problems such as illumination changes, complex
movements and occlusions. Four performance criteria are defined: parameter
stability, initialization robustness, global accuracy and computational
complexity. A new measure is proposed to estimate spatio-temporal tracker
accuracy to account for the human errors in the generation of ground-truth
data. Then, such measure is compared with the related state-of-the-art showing
its superiority to evaluate trackers. Finally, the proposed methodology is
validated on state-of-the-art trackers demonstrating their utility to identify
tracker characteristics.
In: Surveillance Systems ISBN: 978-1-53610-703-6
Editor: Roger Simmons © 2017 Nova Science Publishers, Inc.
Chapter 1
Kai Yiat Jim*, Wai Kit Wong† and Yee Kit Chan‡
Faculty of Engineering and Technology, Multimedia University,
Ayer Keroh Lama, Melaka, Malaysia
ABSTRACT
Recent statistical results on home security reveal that around 3.7
million home break-ins are committed each year in the United States. On
average, there is a home intrusion every 8.4 seconds. Homes without
security systems are up to 300% more likely to be broken into and
usually, police can only clear 13% of all reported burglaries due to the
lack of witnesses or physical evidence. Obviously, this shows the
necessity of installing a trespasser detection surveillance system in a
residence to mitigate burglaries. Besides that, according to the 2010
National Health Interview Survey, the overall rate of nonfatal fall injury
cases for which a health-care professional was contacted, was 43 per
1,000 people. This means that the other 957 fall injury cases were
probably fatal due to inattentiveness or late notification. Therefore, to
save more lives, there is also a need to install a health care surveillance
system in a residence for human faint detection.
*
Email: helium_jim@yahoo.com.
†
wkwong@mmu.edu.my.
‡
ykchan@mmu.edu.my.
2 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
1. INTRODUCTION
Household surveillance has been an important part of our daily lives,
whether it is to prevent trespassing or to ensure the safety of our loved ones.
Trespassing problem has been around for a few decades and it keeps on
Omnidirectional Surveillance System … 3
increasing, causing potential danger to our safety. On the other hand, cases
where the elderly faint and experience fatal injuries, have also been increasing
throughout the years. Hence, this work is proposed to solve both the
trespassing problem and the elderly fainting issue.
Burglar alarm system and video surveillance have been widely used
around the world as a solution for trespassing detection. On the other hand,
wearable sensors and video monitoring have been implemented as a healthcare
surveillance for the elderly. However, all of them are not as flexible as the
image processing based surveillance. Image processing based surveillance is
able to detect automatically, capture images and cover a wide area of
surveillance.
Common video surveillance employs directional view with a limitation of
180 degrees view angle and more cameras will be needed to cover a wider
view angle. However, more cameras will also increase the total cost of the
surveillance system. Therefore, a method has been devised to obtain
omnidirectional image with 360 degrees view angle using only minimal
hardware. Generally, mechanical approach and optical approach have been
used by practitioners to obtain omnidirectional image. However, optical
approach has always been favoured since mechanical approach leads to many
problems of discontinuity and inconsistency.
Optical approach often has image deformation problem which causes
difficulty to interpret the image taken. Thus, it is necessary to implement an
efficient unwarping method on the omni-image taken. Basically, unwarping is
the process in digital image processing that “opens’ up an omni-image into a
panoramic image. Subsequently, information of the panoramic image can be
easily interpreted for any direct implementations. There are 3 unwarping
methods actively adopted in the application of visual surveillance system all
around the world, which are the pano-mapping table method, discrete
geometry techniques (DGT) method and log-polar mapping method. The best
method is selected based on the advantages and disadvantages of each
methods.
Finally, automatic trespasser and faint detection algorithm will be
implemented into the hardware setup to form a complete surveillance system.
The proposed methods include the extreme point curvature analysis algorithm
and the integrated body contours algorithm. Extreme point curvature analysis
algorithm checks the curves on the top, bottom, left and right of an object blob
to detect trespasser, whereas integrated body contours algorithm combines
head detection, leg detection, ellipse fitting’s ratio and orientation as the main
features to detect faint. These automatic detection algorithms are needed due
4 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
to the fact that the current video surveillance requires monitoring by humans
and the efficiency will drop as they grow tired.
In this chapter, the topics will be divided into surveillance system,
omnidirectional imaging system, automatic trespasser & faint detection
algorithm, experimental results & discussion and lastly the conclusion &
future research direction.
2. SURVEILLANCE SYSTEM
Table 1. (Continued)
wired surveillance digital cameras are wired, lack in flexibility and appropriate
for permanent setup. Both serve the same functions as to transmit image
signals to a center hub which are then being displayed on a monitor screen for
viewing. However, manpower is still required to observe the monitor screen
and to determine the presence of trespassers. Continuous monitoring will be
less effective as the person gets distracted due to fatigue [4]. This will
eventually causes errors such as false alarm and unnoticed trespassing. As a
conclusion, although video surveillance is widely used for security purposes, it
does not serves as the best solution to counter trespassing in places such as
households.
These wearable sensors are efficient and may even be lifesaving tools
whenever there is an emergency. However, we could not overlook the
possibility that the elderly or patients will forget to use the product (wearing
them or bringing with them). Elderly people tend to be more forgetful and
having to use the product at all times would be meticulous to them.
length, allows the camera to view in a much wider range that resembles a
hemisphere scene. Although fish eye lens has been used in numerous
applications that require wide angle [23, 24], Nalwa [25] found that it is
difficult to design a fish eye lens that can ensures all incoming principal rays
intersect at a single point to provide a fixed viewpoint. This means that the
obtained image does not provide distortion free perspective image of the
viewed scene. Hence, a complex and large design is required to build an
optimal fish eye lens that can capture a good omnidirectional view image.
However, this optimal fish eye lens may cost a fortune. Meanwhile, hyperbolic
optical mirror offers a cheaper solution, less complexity in design and provides
the same reflective quality as the fish eye lens.
Since mechanical approach leads to many problems on discontinuity and
inconsistency, optical approach is selected to be used in this work. Optical
approach particularly the hyperbolic optical mirror is preferred, as it
outperforms fish eye lens as stated above. The proposed omnidirectional
surveillance system model will be discussed in section 3.1 below.
(a) (b)
(c) (d)
Figure 2. (a) Wireless webcam, (b) Custom bracket, (c) Custom hyperbolic mirror, (d)
Combined camera set.
Omnidirectional Surveillance System … 15
The hyperbolic mirror image taken, has image deformation that may leads
to analysis difficulty. Therefore, it is necessary to have a suitable method for
unwarping the hyperbolic mirror image into an easy to read form. Generally,
unwarping is a method in digital image processing, where the spherical
hyperbolic mirror image is ‘opened’ up into a panoramic image that can be
directly used and understood. There are 3 universal unwarping methods that
are currently applied actively around the world for transforming
omnidirectional mirror image into panoramic image. These methods include
the discrete geometry technique method (DGT) [26], pano-mapping table
method [27] and the log-polar mapping method [28]. The following review on
the unwarping methods is done in a work [29] by W. S. Pua.
(a) (b)
Figure 3. (a) Circle lying in between pixels, (b) Circle being split into 4 sections.
(a)
(b)
𝑟 = 𝑓𝑟 (𝑝) = 𝑎0 + 𝑎1 𝑝1 + 𝑎2 𝑝2 + 𝑎3 𝑝3 + 𝑎4 𝑝4 (1)
18 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
where r corresponds to the radius, 𝑝 is the particular radius for each of the 5
points taken, and 𝑎0 ~ 𝑎4 are five coefficients to be estimated using the values
obtained from the landmark points.
Once the 5 coefficients are obtained, the pano-mapping table (𝑇𝑀𝑁 ) can be
generated. The size of the table will be determined manually, by setting it to a
table of size M x N. Hence, in order to fill up a table with M x N entries, the
landmark point (𝑝), which correspond to the radius of the omnidirectional
mirror image, will be divided into M separated parts, and the angle (θ) will be
divided into N parts as follows:
𝑟𝑎𝑑𝑖𝑢𝑠
𝑝𝑖𝑗 = 𝑖 × (2)
𝑀
3600
𝜃𝑖𝑗 = 𝑗 × (3)
𝑁
The calculation will be processed by taking the first point where 𝑖 = 1 and
𝑗 = 1, which gives 𝑝11 = 𝑟𝑎𝑑𝑖𝑢𝑠/𝑀, and 𝜃11 = 3600 /𝑁. The value of 𝑝𝑖𝑗
will then be substituted into the “radial stretching function” in order to obtain
the particular radius at that particular landmark point. This radius obtained,
will be substituted into the equation below and then rounded up, in order to get
the corresponding co-ordinates in the omnidirectional image.
𝑣 = 𝑟 cos 𝜃 (4)
𝑢 = 𝑟 sin 𝜃 (5)
𝑁 𝑦 −𝑦
𝜃(𝑥𝑖 , 𝑦𝑖 ) = (2𝜋) 𝑡𝑎𝑛−1 𝑥𝑖− 𝑥𝑐 (7)
𝑖 𝑐
𝑟𝑛 = 𝑏𝑟𝑛−1 (10)
and
where r is the sampling circle radius and b is the ratio between 2 apparent
sampling circles. Figure 6 shows the circular sampling structure and the
unwarping process done by using the log-polar mapping method [32]. The
Omnidirectional Surveillance System … 21
mean value of pixels within each and every circular sampling is calculated and
it will be assigned to the center point of the circular sampling. The process will
then continue by mapping the mean value of log-polar pixel (𝑝, 𝜃) into another
Cartesian form using 𝑥0 (𝑝, 𝜃) = 𝑝𝑐𝑜𝑠𝜃 + 𝑥𝑐 and 𝑦0 (𝑝, 𝜃) = 𝑝𝑐𝑜𝑠𝜃 + 𝑦𝑐 as
stated above. Finally, the un-warping process will be completed at the end of
the mapping.
(a)
(b)
(c)
(d)
Figure 7. (a) Sample of omnidirectional mirror image, (b) Panoramic image generated
by using DGT method, (c) Panoramic image generated by using pano-mapping table
method, (d) Panoramic image generated by using log-polar method.
Omnidirectional Surveillance System … 23
and it is not as obvious as the DGT method and pano-mapping table method.
In terms of quality of image, the pano-mapping table method has the highest
image quality among the 3 methods, followed by the log-polar mapping
method with a slightly lower image quality (but still within an acceptable
range), and lastly the DGT method with the lowest image quality. In terms of
algorithm used in performing the un-warping process, the pano-mapping table
method has the most simple algorithm, followed the log-polar mapping
method with a slightly more complex algorithm, and lastly, the DGT method
with the most complicated algorithm. In terms of complexity, it is found that
the pano-mapping table method has the least complexity, followed by the DGT
method, and lastly the log-polar mapping method. In terms of processing time,
on average, the pano-mapping table method has the lowest processing time (to
transform an omnidirectional mirror image into a panoramic image), followed
by the log-polar mapping method and DGT method. In terms of data
compression, the log-polar mapping method has the highest data compression
rate as compared to the pano-mapping table method and DGT method. High
compression rate is important to preserve CPU’s memory, and the memory
available is usually very limited.
After comparing and weighing each category of performance for the 3
unwarping methods mentioned above, we have decided to implement the log-
polar mapping method as the unwarping method to be used in this work. This
is due to its all rounded performance that is suitable for our work.
Step 1: Image acquisition: Image of the monitored area will be taken using
the setup proposed in Section 3.1. The image taken can be seen in
Figure 8.
Convert both current image (IC) and background image (IB) into
grayscale image. Check:
IF (𝑋𝐼𝐶 , 𝑌𝐼𝐶 ) – (𝑋𝐼𝐵 , 𝑌𝐼𝐵 ) > TBS (background subtraction
threshold);
THEN (𝑋𝐼𝑅 , 𝑌𝐼𝑅 ) is set to be white pixel (value 1 for binary);
(a)
(b)
(c)
Figure 10. (a) Background image, (b) Current image, (c) Resultant image.
Step 1: Top peak point: From every pixel of the object’s boundary, check:
IF (Ya – Ya-1 < 0) & (Ya+1 – Ya > 0);
THEN set (Xa,Ya) = Fn;
where n = number of peak points in the object.
NEXT set Fn with minimum y-coordinate as Ftpp or Top Peak Point.
Step 2: Side turning point: Following the boundary, from Ftpp, search
clockwise and check:
IF (Xa – Xa-1 > 0) & (Xa+1 – Xa < 0);
THEN set (Xa,Ya) = Fright;
NEXT, from Fright, search clockwise along the boundary and check:
IF (Xa – Xa-1 < 0) & (Xa+1 – Xa > 0);
THEN set (Xa,Ya) = FrightN.
Similarly, for left side, following the boundary, from Ftpp, search
anticlockwise and check:
IF (Xa+1 – Xa > 0) & (Xa – Xa-1 < 0);
THEN set (Xa,Ya) = Fleft;
NEXT, from Fleft, search anticlockwise along the boundary and check:
IF (Xa+1 – Xa < 0) & (Xa – Xa-1 > 0);
THEN set (Xa,Ya) = FleftN.
*Only the first turning point encountered is recorded and used.
Step 3: Check head symmetry and position: Check the following
conditions:
IF [(HRheight/HLheight) < 2] & [(HLheight/HRheight) < 2];
AND Cdistance < (2*Ndistance);
28 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
Step 1: Right peak point: From every pixel of the object’s boundary,
check:
IF (Xa – Xa-1 > 0) & (Xa+1 – Xa < 0);
THEN set (Xa,Ya) = Fn;
where n = number of right peak points in the object.
NEXT set Fn with maximum x-coordinate as Frpp or Right Peak
Point.
Step 2: Side turning point: Following the boundary, from Frpp, search
clockwise and check:
IF (Ya – Ya-1 > 0) & (Ya+1 – Ya < 0);
THEN set (Xa,Ya) = Fbottom;
NEXT, from Fbottom, search clockwise along the boundary and
check:
IF (Ya – Ya-1 < 0) & (Ya+1 – Ya > 0);
THEN set (Xa,Ya) = FbottomN.
Similarly, for top side, following the boundary, from Frpp, search
anticlockwise and check:
IF (Ya+1 – Ya > 0) & (Ya – Ya-1 < 0);
THEN set (Xa,Ya) = Ftop;
NEXT, from Ftop, search anticlockwise along the boundary and
check:
IF (Ya+1 – Ya < 0) & (Ya – Ya-1 > 0);
THEN set (Xa,Ya) = FtopN.
*Only the first turning point encountered is recorded and used.
Step 3: Check head symmetry and position: Check the following
conditions:
IF [(HBheight/HTheight) < 2] & [(HTheight/HBheight) < 2];
AND Cdistance < (2*Ndistance);
where HBheight = horizontal distance between Frpp and FbottomN,
HTheight = horizontal distance between Frpp and FtopN,
Ndistance = vertical distance between FbottomN and FtopN,
Cdistance = [vertical distance between Frpp and FtopN] – [Ndistance/2],
THEN proceed to Step 4,
ELSE Hright = 0 or ‘right head position’ is not detected and skip to
Section 4.1.1.3.
Step 4: Check head curve: Check the following conditions along the
boundaries:
30 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
Step 1: Left peak point: From every pixel of the object’s boundary, check:
IF (Xa – Xa-1 < 0) & (Xa+1 – Xa > 0);
THEN set (Xa,Ya) = Fn;
where n = number of peak points in the object.
NEXT set Fn with minimum x-coordinate as Flpp or Left Peak
Point.
Step 2: Side turning point: Following the boundary, from Flpp, search
clockwise and check:
Omnidirectional Surveillance System … 31
Step 1: Obtain start point: Find the lowest point of the object with
maximum y-coordinate and minimum x-coordinate. Record the point
as Leg Start Point, FlegSP = (XlegSP,YlegSP). Next, find the mean x-
coordinate and mean y-coordinate of the object and set that point as
Object Middle Point, FobjMP = (XobjMP,YobjMP).
Step 2: Obtain turning point:
i. IF (XlegSP - XobjMP < 0), search anticlockwise along the
boundary from FlegSP and check:
IF (Ya – Ya-1 < 0) & (Ya+1 – Ya > 0);
THEN set (Xa,Ya) = FlegTP and proceed to Step 3(i);
ii. IF (XlegSP - XobjMP > 0), search clockwise along the boundary
from FlegSP and check:
IF (Ya – Ya-1 < 0) & (Ya+1 – Ya > 0);
THEN set (Xa,Ya) = FlegTP and proceed to Step 3(ii);
where, p, q = 0, 1, 2, 3 …
The center of ellipse: 𝑥̅ = 𝑚10 /𝑚00 and 𝑦̅ = 𝑚01 /𝑚00 can be derived
from the first-order and zero-order spatial moments. Then the central moment
can be calculated with:
where, p,q = 0, 1, 2, 3 …
Using the central moment, the ellipse’s orientation or the angle between
the major axis of the person and the horizontal axis x can be computed as
follows:
1 2𝜇11
𝜃 = arctan( ) (14)
2 𝜇20 −𝜇02
Then to recover the major semi-axis 𝑎 and the minor semi-axis 𝑏, the
greatest moments of inertia, 𝐼𝑚𝑎𝑥 and the least moments of inertia 𝐼𝑚𝑖𝑛 must
be computed. They can be calculated by evaluating the eigenvalues of the
covariance matrix:
Omnidirectional Surveillance System … 35
𝜇20 𝜇11
𝐽 = (𝜇 𝜇02 ) (15)
11
2
𝜇20 + 𝜇02 + √(𝜇20 − 𝜇02 )2 + 4𝜇11
𝐼𝑚𝑎𝑥 = 2
(16)
2
𝜇20 + 𝜇02 − √(𝜇20 − 𝜇02 )2 + 4𝜇11
𝐼𝑚𝑖𝑛 = 2
(17)
Finally, the major semi-axis 𝑎 and the minor semi-axis 𝑏 of the best fitting
ellipse can be calculated as follows:
1 1⁄8
(𝐼𝑚𝑎𝑥 )3
𝑎 = (4/𝜋)4 [ ] (18)
𝐼𝑚𝑖𝑛
1 1⁄8
(𝐼𝑚𝑖𝑛 )3
𝑏 = (4/𝜋)4 [ 𝐼𝑚𝑎𝑥
] (19)
Hence, the ellipse fitting’s ratio would be Robject = (𝑎/𝑏) and the ellipse
fitting’s orientation would be Oobject = (𝜃) as mentioned above.
From the head detection and leg detection information obtained in Section
4.1, check:
Posture Conditions
Stand If (Hcenter = 1) & (Lpresence = 1);
If (Hcenter = 1) & (Lpresence = 0) & (RstandMin < Robject < RstandMax) &
(OstandMin < Oobject < OstandMax)
Bend If (Hcenter = 0) & (Lpresence = 1) & (RbendMin < Robject < RbendMax) &
(ObendMin < Oobject < ObendMax)
If [(Hright = 1) or (Hleft = 1)] & (Lpresence = 0) & (RbendMin < Robject <
RbendMax) & (ObendMin < Oobject < ObendMax)
Sit If (Hcenter = 1) & (Lpresence = 0) & (RsitMin < Robject < RsitMax) &
(OsitMin < Oobject < OsitMax)
Lie If (Hcenter = 0) & (Hright = 0) & (Hleft = 0) & (Lpresence = 0) & (RlieMin1
< Robject < RlieMax1) & (OlieMin1 < Oobject < OlieMax1)
If [(Hright = 1) or (Hleft = 1)] & (Lpresence = 0) & (RlieMin2 < Robject <
RlieMax2) & (OlieMin2 < Oobject < OlieMax2)
If (Hcenter = 0) & (Lpresence = 0) & (RlieMin2 < Robject < RlieMax2) &
(OlieMin3 < Oobject < OlieMax3)
Parameters optimizations are done to obtain the optimal results for the
extreme point curvature analysis algorithm and integrated body contours
algorithm. The parameters include background subtraction threshold, head &
leg detection threshold, posture’s conditions and ellipse fitting’s ratio &
orientation as stated in the following sections.
(d) (e)
Figure 15. (a) TBS = 40, (b) TBS = 50, (c) TBS = 60, (d) TBS = 70, (e) TBS = 80.
From Figure 15, we can observe that the object blob in image with TBS =
40 has a deformed shape, where the leg and arm parts are totally covered by
white pixels. Then, the object blob in image with TBS = 50 still has its arm
(right side) covered by white pixels and there are a lot of noises around the leg
region. Object blob in image with TBS = 60 resembles the actual object the
Omnidirectional Surveillance System … 39
most. Lastly, the object blob in image with TBS = 70 and 80 have smaller sizes
around the arms, legs and head region as compared to the actual object. Hence.
TBS = 60 is selected as the ideal background subtraction threshold value.
Figure 19 shows the graph of accuracy vs threshold for the leg curve. It
can be observed that the accuracy starts at the peak with 99.4% at threshold
value of 0.1, which remains constant before dropping to 98.8% at threshold
value of 0.7. The accuracy continues to drop until the end, with accuracy of
50% at threshold value of 1.0. Hence, 0.6 is selected as the leg curve threshold
(Cleg) since it has the highest detection accuracy of 98.8%. Threshold value of
0.6 is selected instead of 0.1 to 0.5 because the higher the threshold, the lower
the false alarm.
Posture Explanation
Stand From the outline of shape shown in Figure 20(a), center head position
and leg curve can be seen.
From the outline of shape shown in Figure 20(b), center head position
can be seen while leg curve is absent.
Bend From the outline of shape shown in Figure 20(c), leg curve can be seen
while center head position is absent.
From the outline of shape shown in Figure 20(d), either right or left head
position can be seen while leg curve is absent.
Sit From the outline of shape shown in Figure 20(e) and Figure 20(f), center
head position can be seen while leg curve is absent.
Lie From the outline of shape shown in Figure 20(g), center head position,
right head position, left head position and leg curve are absent.
From the outline of shape shown in Figure 20(h), either right or left head
position can be seen while leg curve is absent.
This condition is a mitigation step in reducing false alarms. It is an
extension of rule for the outline of shape shown in Figure 20(g). Hence,
they have similar conditions where head and leg curve are absent.
* Note that Table 6 is a direct mirror to Table 5. Therefore, the explanation at specific row
and column in Table 6 is appointed to the condition in the same row and column in
Table 5.
Omnidirectional Surveillance System … 43
(h)
Figure 20. (a) Stand frontal, (b) Stand side, (c) Bend frontal, (d) Bend side, (e) Sit
frontal, (f) Sit side, (g) Lie frontal, (h) Lie side.
As for the Robject and Oobject that are also included in the conditions in
Section 4.3, the explanations will be in Section 5.1.4 below.
(a)
(b)
Figure 21. (a) Ratio graph for stand posture, (b) Orientation graph for stand posture.
(a)
(b)
Figure 22. (a) Ratio graph for bend posture, (b) Orientation graph for bend posture.
Based on Figure 22(a), the minimum bend ratio (RbendMin), is set as 1.5 and
the maximum bend ratio (RbendMax), is set as 3.5. Based on Figure 22(b), the
46 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
(a)
(b)
Figure 23. (a) Ratio graph for sit posture, (b) Orientation graph for sit posture.
Omnidirectional Surveillance System … 47
(a)
(b)
Figure 24. (a) Ratio graph for lie posture, (b) Orientation graph for lie posture.
Based on Figure 23(a), the minimum sit ratio (RsitMin), is set as 1 and the
maximum sit ratio (RsitMax), is set as 2.5. Based on Figure 23(b), the minimum
sit orientation (OsitMin), is set as 70 and the maximum sit orientation (OsitMax), is
set as 90.
The fluctuation shown in Figure 24(a) and Figure 24(b) are due to the
shifting of posture from frontal view to side view, causing an increase in both
48 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
the ellipse fitting’s ratio and orientation. Based on Figure 24(a), the minimum
lie ratio 1 (RlieMin1) is set as 0 and minimum lie ratio 2 (RlieMin2) is set as 1.9.
Meanwhile, the maximum lie ratio 1 (RlieMax1) is set as 1.9 and maximum lie
ratio 2 (RlieMax2) is set as 8. The first set of ratios is for the lie posture in frontal
position, while the second set of ratios is for the lie posture in side position.
Based on Figure 24(b), the minimum lie orientation 1 (OlieMin1) is set as 0,
minimum lie orientation 2 (OlieMin2) is set as 0 and minimum lie orientation 3
(OlieMin3) is set as 5. Meanwhile, the maximum lie orientation 1 (OlieMax1) is set
as 90, maximum lie orientation 2 (OlieMax2) is set as 5 and maximum lie
orientation 3 (OlieMax3) is set as 60. Similarly, the first set of orientations is for
the lie posture in frontal position while the second set of orientations is for lie
posture in side position. However, the third set of orientations is a special set
built as a mitigation plan for better accuracy as stated in Table 6.
A total of 5136 images from 4 different individuals are used to test the
extreme point curvature analysis algorithm. The evaluation is carried by using
“Operator perceived Activity” (OPA) [39], where the operator compares the
output result by the algorithm, with the actual condition in the image. Our
approach is also compared (in terms of accuracy) with a couple of other works
that are done in recent years, such as the size filter method [40] and head
detection method [41]. Following are the results:
Metrics Our approach Modified aspect ratio [42] Head detection [43]
Accuracy (%) 92.20 90.60% 78.13%
REFERENCES
[1] Gerald N. H. and Kathleen T. H. (2016) “The People’s Law
Dictionary”, [online], Retrieved 2016 November 5 from http://legal-
dictionary.thefreedictionary.com/Trespassers.
[2] Shannan C. (2010). “Victimization during Household Burglary”,
[online], Retrieved 2016 November 5 from http://www.bjs.gov/
content/pub/pdf/vdhb.pdf.
[3] McGoey, C.E. (2012). “Home Security: Burglary Prevention Advice”.
Aegis Books Inc. 2012.
[4] Miller, J. C., Smith M. L. and McCauley M. E. (1998). “Crew Fatigue
and Performance on U.S. Coast Guard Cutters”. U.S. Coast Guard
Research & Development Center, 1998.
[5] American Heritage, Dictionary of the English Language, Fifth Edition.
(2011). “Health care”. Houghton Mifflin Harcourt Publishing Company,
[online], Retrieved 2016 November 5 from http://www.
thefreedictionary.com/health+care.
[6] Random House Kernerman Webster’s College Dictionary (2010).
“Health care”. Random House, Inc. [online], Retrieved 2016 November
5 from http://www.thefreedictionary.com/health+care.
[7] Adams, P. F., Martinez, M. E., Vickerie, J. L. and Kirzinger, W. K.
(2011). “Summary health statistics for the U.S. population”, National
Health Interview Survey, 2010, Vital and Health Statistics Series.
[8] Centers for Disease Control and Prevention (2016). “Falls Among Older
Adult: An overview”, [online], Retrieved 2016 November 5 from
http://www.cdc.gov/HomeandRecreationalSafety/Fals/adultfalls.html.
[9] Stevenson, S. (2014). “10 Products You’ve Never Heard Of”. [online],
Retrieved 2016 November 5 from http://www.aplaceformom.com/blog/
2014-6-1-cutting-edge-products-for-seniors/.
Omnidirectional Surveillance System … 51
[10] Baker, A. (2016). “The Top 5 Safety Wearable Products for Seniors”,
[online], Retrieved 2016 November 5 from http://www.safewise.com/
blog/top-safety-wearable-products-for-seniors/.
[11] Miller. J. T. (2016). “How to Keep Tabs On an Elderly Parent with
Video Monitoring”, [online], Retrieved 2016 November 5 from
http://www.huffingtonpost.com/jim-t-miller/how-to-keep-tabs-on-an-
el_b_8954044.html.
[12] Fischler, M. A. and Bolles, R. C. (1981). “Random Sample Consensus:
A Paradigm for Model Fitting with Applications to Image Analysis and
Automated Cartography”, Comm. of the ACM, Vol. 24: p.p. 381-395.
[13] Corke, P., Strelow, D., and Singh, S. (2004). “Omnidirectional visual
odometry for a planetary rover”. International Conference on Intelligent
Robots and Systems(IROS 2004), Vol. 4, p.p. 4007-4012.
[14] Durrant, W. H., and Bailey, T. (2006). “Simultaneous Localization and
Mapping (SLAM): Part I The Essential Algorithms”. Robotics and
Automation Magazine Vol. 13: p.p. 99-110.
[15] Kawanishi, T., Yamazawa, K., Iwasa, H., Takemura, H., and Yokoya,
N. (1998). “Generation of High-resolution Stereo Panoramic Images by
Omnidirectional Imaging Sensor Using Hexagonal Pyramidal Mirrors”,
Proc. 14th Int. Conf. in Pattern Recognition, Vol. 1, p.p. 485-489.
[16] Ishiguro, H., Yamamoto, M., and Tsuji, S. (1992). “Omni-Directional
Stereo”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.
14, No. 2, p.p. 257-262.
[17] Huang, H-C., and Hung, Y. P. (1998). “Panoramic Stereo Imaging
System with Automatic Disparity Warping and Seaming”, Graphical
Models and Image Processing, Vol. 60, No. 3, p.p. 196-208.
[18] Peleg, S., and Ben-Ezra, M. (1999). “Stereo Panorama with a Single
Camera”, Proc. IEEE Conf. Computer Vision and Pattern Recognition,
p.p. 395-401.
[19] Shum, H., and Szeliski, R. (1999). “Stereo Reconstruction from Multi-
perspective Panoramas”, Proc. Seventh Int. Conf. Computer Vision, p.p.
14-21.
[20] Chen, S. E. (1995). “Quick Time VR: An Image-Based Approach to
virtual Environment Navigation”, Proc. of the 22nd Annual ACM Conf.
on Computer Graphics, p.p. 29-38.
[21] Kumar, J., and Bauer, M. (2000). “Fisheye lens design and their relative
performance”, Proc. SPIE, Vol. 4093, p.p. 360-369.
52 Kai Yiat Jim, Wai Kit Wong and Yee Kit Chan
[22] Padjla, T., and Roth, H. (2000). “Panoramic Imaging with SVAVISCA
Camera- Simulations and Reality”, Research Reports of CMP, Czech
Technical University in Prague, No. 16.
[23] Oh, S. J., and Hall, E. L. (1987). “Guidance of a Mobile Robot Using an
Omnidirectional Vision Navigation System”, Proc. of the Society of
Photo-Optical Instrumentation Engineers, SPIE, 852, p.p. 288-300.
[24] Kuban, D. P., Martin, H. L., Zimmermann, S. D., and Busico, N. (1994).
“Omniview Motionless Camera Surveillance System”, United States
Patent No. 5, 359, 363.
[25] Nalwa, V. (1996). “A True Omnidirecdtional Viewer”, Technical
Report, Bell Laboratories, Homdel, NJ07733, USA.
[26] Akihiko T., Atsushi I. (2004). “Panoramic Image Transform of
Omnidirectional Images Using Discrete Geometry Techniques”, in
Proceedings of the 2nd International Symposium on 3D Data
Processing, Visualization, and Transmission (3DPVT’04).
[27] Jeng, S.W. and Tsai, W.H. (2007). “Using pano-mapping tables for
unwarping of omni-images into panoramic and perspective-view
images”, in IET Image Process., 1, (2), pp. 149–155.
[28] Jurie, F. (1999). “A new log-polar mapping for space variant imaging:
Application to face detection and tracking”, Pattern Recognition,
Elsevier Science, 32:55, p.p. 865-875.
[29] Pua, W. S., Wong, W. K., Loo, C. K. and Lim, W. S. (2013). “A Study
of Different Unwarping Methods for Omnidirectional Imaging”,
Computer Technology and Application 3, pp. 226-239.
[30] Huang, D. S., Wunsch, D.C., Levine, D.S., Jo, K-H. (2008). Advanced
intelligent computing theories and applications: with aspects of
theoretical and methodological issues in 4th International Conference on
Intelligent Computing, ICIC 2008, Shanghai, China, September.
[31] Hampapur, A., Brown, L., Connell, J., Ekin, A., Haas, N., Lu, M. et al.
(2005). “Smart Video Surveillance”, IEEE Signal Processing Mag., p.p.
39-51.
[32] George, W., Siavash, Z. (2000). “Robust Image Registration Using Log-
Polar Transform”, in Proc. of IEEE Intl. Conf. on Image Processing.
[33] Traver, V. J., Alexandre B. (2010). “A review of log-polar imaging for
visual perception in robotics” in Robotics and Autonomous Systems 58,
p.p. 378-398.
[34] Jos´e S. V., Alexandre B. (2003). “Vision-based Navigation,
Environmental Representations and Imaging Geometries”, in VisLab-
Omnidirectional Surveillance System … 53
Chapter 2
Grzegorz Szwoch*
Gdansk University of Technology,
Department of Multimedia Systems, Gdansk, Poland
ABSTRACT
The development and tuning of an automated object tracking system
for implementation in a video surveillance system is a complex task,
requiring understanding how these algorithms work, and also the
experience with choosing proper algorithm parameters in order to obtain
accurate results. This Chapter presents a practical approach to the
problem of a single camera object tracking, based on the object detection
and tracking with Kalman filters and particle filters. The aim is to provide
practical guidelines for specialists who design, tune and evaluate video
surveillance systems based on the automated tracking of moving objects.
The main components of the tracking system, the most important
parameters and their influence on the obtained results, are discussed. The
described tracking algorithm starts with the detection phase which
*
greg@sound.eti.pg.gda.pl.
56 Grzegorz Szwoch
INTRODUCTION
Smart cameras are the current trend in video surveillance systems
[Nab01]. The number of cameras installed in modern video monitoring
solutions still increases, and a human operator is not able to notice every
important event that occurs in the surveyed areas. Therefore, automated video
content analysis (VCA) algorithms are implemented as a ‘helper’ for the
operators [Lin11]. Traditionally, such algorithms require powerful
workstations to run complex video analysis in real time. However, this
situation started to change in the recent years, with a development of powerful
and energy-efficient computing platforms, such as GPUs, FPGAs, DSPs, etc.
Such devices may be implemented within embedded camera systems, forming
smart devices that combine video sensors with processors running VCA
algorithms, within a single device.
Various complex VCA algorithms may be implemented in surveillance
systems, performing detection of specific events, such as abandoned luggage,
robbery, traffic law violations, etc. However, most of these algorithms are
built upon two basic operations: object detection and object tracking [Czy11].
The former extracts moving objects from video, and the latter tracks
Tracking Moving Objects in Video Surveillance Systems … 57
movement of individual objects as long as they are present in the camera view.
These ‘tracks’ of moving objects may be then used for a high-level analysis of
the object behavior. Therefore, the developer of VCA solutions has to ensure
that these two basic operations are performed with a sufficient accuracy. If the
object is lost or the tracker is assigned to a wrong object, further event analysis
becomes impossible. Therefore, this Chapter focuses on performing the object
tracking stage in a way that accurate tracks, useful for further analysis, are
obtained.
Object detection identifies individual objects in static images (single video
frames), and produces data needed for object tracking. There are two main
approaches to object detection. The first one aims to detect a specific class of
objects, employing an algorithm trained with sample images of the desired
class. Such algorithms include the Viola-Jones detector [Vio01], commonly
used for face detection, and histograms of oriented gradients (HoG), usually
applied to human detection [Dal05]. The other group of methods is based on
background subtraction and it usually employs a statistical background model
in order to separate moving objects from the background. Such a model has to
be constantly updated in order to adapt to varying conditions. An algorithm of
this type which is commonly applied for the VCA is based on Gaussian
mixture models (GMM), as proposed by Stauffer and Grimson [Sta99], and
later extended by Zivkovic [Ziv06]. There are also other background
subtraction algorithms, e.g., the Codebook algorithm which utilizes a layered
background model [Kim05]. The main drawback of the background
subtraction approach is that it works only in fixed view cameras.
For successful tracking, the position of each object has to be established in
each analyzed video frame. A collection of the extracted object data
constitutes a track of the object. The main challenge in object tracking is
related to conflict situations in which different moving objects overlap in the
camera view (occlusion) or they are divided into separate objects. If the
frequency of such conflicts is relatively low and individual objects may be
detected most of the time, algorithms that track the results of object detection
(image regions, often called blobs) are usually employed. Such algorithms
work on the ‘prediction-update’ principle. The most notable algorithm of this
type is based on Kalman filters [Wel04]. It is often used in VCA applications
because it is computationally efficient and accurate as long as the conflicts are
relatively rare and short-term [Czy11]. Particle filters [Aru02, Ris04] are an
alternative approach which is significantly more robust to tracking conflicts,
but it is also much more demanding in terms of processing resources, and less
accurate in tracking the object size. Therefore, particle filters are less common
58 Grzegorz Szwoch
in VCA systems than Kalman filters, but their advantages were utilized in a
number of published works. For example, Isaard and Blake [Isa98] proposed
the Condensation algorithm based on particle filters, for tracking object
contours in a cluttered environment. Nuumiaro et al. [Num03] used particle
filters based on color histograms for tracking moving objects in video. Czyz et
al. [Czy06] extended the particle tracker with automatic object detection and
tracking multiple objects.
The tracking methods described above are not suitable for crowded
scenes, with a large number of objects and a high occurrence of object
occlusions. Such scenarios require a different approach to tracking. For
example, the CamShift algorithm [Bra98] searches for a specified object in the
image with a sliding window, using the histogram back-projection method.
Optical flow methods (e.g., Lucas-Kanade [Luc81] and Horn-Schunck
[Hor81]) are often used for tracking objects in busy scenes. Dense optical flow
methods detect the movement by analyzing all image pixels, while sparse
optical flow algorithms analyze the movement of key points (corners),
detected e.g., with the Shi-Tomasi algorithm [Shi94]. The optical flow
approaches are computationally demanding and because of this, their
implementation in VCA systems is still a challenge.
This Chapter focuses on the first type of object tracking algorithms,
namely on Kalman and particle filters. A theory of these algorithms may be
found in many publications, there are also reports on implementation of these
approaches to object tracking in video. However, developers of VCA systems
still face two important problems. The first one is related to obtaining accurate
measurements of positions and sizes of the tracked objects, required for
updating their tracker. It is easy to do if the object is clearly identified in the
camera image, but in case of tracking conflicts, obtaining a valid measurement
is not trivial The second problem is related to the parameters tuning in the
object detection and tracking algorithms, in order to obtain accurate object
tracks. Despite the abundance of publications on object tracking in video with
these methods, it is not easy to find a clear solution to both problems. This
Chapter has therefore two main aims. First, it attempts to fill the
abovementioned gap, by describing the influence of the algorithm parameters
on the obtained results, and also presenting the problem of obtaining accurate
measurements for updating the tracking filter in presence of conflicts. Second,
a novel approach that combines Kalman filters with particle filters, is
proposed. This dual-type tracker uses the simpler Kalman filter when there are
no conflicts, and the more demanding particle tracker only for resolving these
conflicts.
Tracking Moving Objects in Video Surveillance Systems … 59
The rest of the Chapter is organized as follows. The next Section describes
the background subtraction procedure based on the GMM, and discusses the
influence of its parameters on the accuracy of object detection. Next, the
Kalman filter algorithm which tracks objects using data obtained from the
object detection stage, is presented. A relationship between the algorithm
parameters and the tracking accuracy, as well as problems related to tracking
conflicts, are discussed. Then, a tracker utilizing a particle filter is presented.
In the subsequent Section, the combined tracker is proposed in order to obtain
more accurate measurements for updating the Kalman filter in case of
conflicts, by utilizing the advantages of both filter types. The final Sections
present a method of evaluating the performance of tracking algorithms, discuss
the results of tests of the presented algorithms, and finish the Chapter with the
Conclusion.
OBJECT DETECTION
Performing object tracking in a video stream requires that data on the
position and size of each moving object, in every analyzed video image, is
obtained first. Therefore, the task of the object detection procedure is to
analyze the individual video frames, and to extract image regions representing
important moving objects that should be tracked. In this Section, the classic
approach based on background subtraction with the GMM algorithm, is
presented. The obtained results are post-processed, and then connected
components (blobs) representing moving objects are extracted, forming data
suitable for tracking.
Background Subtraction
x b (1)
where b is a factor determining the maximum distance between the pixel value
and the model mean, usually b is 2.0 to 3.0. A separate background model has
to be constructed for each image pixel. For each analyzed video frame, every
pixel is compared with its model and assigned to either the background and the
foreground. In practice, three-channel (RGB) images are analyzed, so the
background models have to store and for each color channel, separately.
The pixel belongs to the background if Eq. 1 is fulfilled for all three channels.
A pre-learned background model would become invalid if, for example,
light in the scene changes (e.g., sun comes out from the clouds). Therefore, the
background models need to be constantly updated. In the GMM algorithm,
each model is initialized by setting its mean to the value of the first observed
pixel, and the initial variance to a predefined, high value. For each analyzed
image, means and variances of the models are updated if the pixel was
assigned to the background, using the following equations [Sta99]:
t t 1 xt t 1 (2)
t2 t21 xt t 2 t21 (3)
where x is the pixel value, t is time index and is the background update rate.
Higher values of result in a model that adapts quickly to changes, but
frequent changes in the background may prevent the model from becoming
stable. Lower values cause the model to react slowly to changes, resulting in a
more stable model which needs more time to ‘re-learn.’ During the adaptation
phase (before these changes are fully incorporated into the model), a large
Tracking Moving Objects in Video Surveillance Systems … 61
where π is the weight, o is one if the Gaussian was matched and zero
otherwise, α is the learning ratio which has similar meaning to (often,
identical values are used for both parameters). The parameter cT was
introduced by Zivkovic [Ziv06] in order to reduce the influence of older (not
updated recently) Gaussians on the detection. When cT is 0, the classic GMM
algorithm is used. Additionally, the modified algorithm chooses k
dynamically. After the update, the weights are normalized to the unit sum, and
the Gaussians are ordered by a decreasing ratio of the weight to the variance.
The decision whether the pixel belongs to the foreground or to the
background is made as follows. The number of distributions that describe the
background is given by [Sta99]:
b
B arg min b b T (5)
k 1
where T is the threshold, usually set to 0.5-0.7. This approach limits the actual
background model to a number of Gaussians with sufficiently high weights. If
62 Grzegorz Szwoch
a matching Gaussian was found and it is one of B Gaussians with the highest
weights, the pixel is assigned to the background, otherwise it is classified as
the foreground.
Background subtraction with the GMM may be interpreted as follows.
Initially, the Gaussian modeling a given pixel has the mean equal to the value
of this pixel, and a high variance. If the Gaussian is matched in the consecutive
video frames, its variance decreases gradually, with the coefficient
determining the speed of this adaptation. It is useful to limit the minimum
variance value in order to prevent ‘overfitting.’ The variance becomes low for
a stable background and it is higher in case of e.g., frequent light changes or
the camera noise.
The background model is usually initialized with the first received video
frame, so the initial model includes moving objects present in this image.
Therefore, the model initialization should be performed when no moving
objects are present, or the model has to be learned for a defined time before the
actual detection is started. Alternatively, the learning rates and may be set
to much higher values during the initialization phase, for example:
Tinit
max , 0.5 if t Tinit (6)
t
where t is the frame index and Tinit is the number of frames used for
initialization. With this approach, the learning parameters start with high
values and they gradually decrease towards the target values. As a result, the
detection accuracy is improved during the initialization.
Background subtraction requires a significant amount of processing time
and it has high memory requirements. The following considerations should be
taken into account when this algorithm is implemented.
Video resolution. Each image pixel is analyzed independently. Therefore,
the video resolution has the largest impact on the processing time and on the
memory usage. A powerful processing hardware is required for a real-time
analysis of high resolution video streams. In case that limited resources are
available, the video resolution may be decreased by downscaling each image,
e.g., by a factor of 2. This operation reduces the processing requirements
significantly, but at the same time, the analysis resolution is decreased. It
should also be noted that the algorithm is well suited for parallel
implementation, e.g., on GPU platforms [Szw15, Szw16].
Tracking Moving Objects in Video Surveillance Systems … 63
b xc , yc , w, h (7)
The most important problems that may occur during the background
subtraction stage, and possible approaches to avoid at least some of them, are
summarized below.
Changes in the scene lighting. This problem is important in both the
outdoor scenes, when clouds move across the sky, and the indoor scenes, when
the natural or artificial light changes. Frequent variations in the scene lighting
require a re-adaptation of the model, and during this phase, object detection is
practically impossible because of an excessive number of false-positive
results. In some situations, this problem may be reduced by increasing the
learning rate and using a larger number of Gaussians per pixel. Special
algorithms performing backlight compensation may also be useful.
Objects incorporated into the background. Objects that remain stationary
for a prolonged time may eventually be learned as the background. An
example is a vehicle stopping at the red light. When this object moves, it
leaves a ‘background hole’ which, without further analysis, would be detected
as a moving object. This problem may be reduced by lowering the learning
rate, which contradicts the solution to the previous problem. Background holes
may be detected by an analysis of the pixel values in the neighborhood of the
blob contour. In case of holes, there should be a smooth transition of the pixel
values, while for the actual objects, edges should be detected alongside the
contour [Szw10].
Camouflaging. This is one of the most common problems in background
subtraction. When color of a moving object is very similar to the background
(e.g., a person in a white shirt on a light gray wall), some pixels of the moving
66 Grzegorz Szwoch
Figure 3. Examples of typical object detection problems. Top: occlusion, a vehicle and
three persons are represented with a single, merged blob. Bottom: fragmentation, the
white vehicle is fragmented into two blobs due to an inaccurate background
subtraction.
Figure 4. A track of the moving object – positions of the tracked object, detected in the
analyzed video frames, are marked with white dots. The top images present the
examples of tracking results (the bounding box of the white van is shown).
Kalman filters (KF) are commonly used for tracking data from noisy
sensors [Wel04], and they are applicable to object tracking in video [Czy11]. It
is assumed that both the dynamic process and the measurements are
contaminated with noise having a normal distribution. Tracking moving
objects may be performed using a first order or a second order dynamic model
Tracking Moving Objects in Video Surveillance Systems … 69
(more complex models are also possible) [Li03]. A vector of the tracker state
variables for the first order model is given by:
s x, y, x, y , w, h, d w , d h
T
(8)
where the variables denote: the position of the center point of the object’s
bounding box, the object velocity (in x and y directions), the size of the
bounding box, and the size change. Alternatively, a single scale factor may be
used instead of the two size change variables, or they may be omitted from the
vector (assuming that the tracker size is constant). For the second order
dynamic model, the acceleration is also included:
xt xt 1 xt 1 yt yt 1 y t 1
xt xt 1 y t y t 1
(10)
wt wt 1 d w,t 1 ht ht 1 d h ,t 1
d w,t d w,t 1 d h ,t d h ,t 1
where the size equations are the same as in the first order model.
It is not trivial to justify the choice of either model. The first order model
assumes a constant velocity, which is not true in practical situations, so this
model is oversimplified. On the other hand, accurate modeling of the velocity
70 Grzegorz Szwoch
changes due to both the actual acceleration and deceleration of the object, and
the camera perspective effect, with the second order model is problematic, and
also a larger number of the state variables affects the filter performance. The
second order model may be useful if the tracked objects are expected to slow
down and accelerate very often (e.g., vehicles at traffic lights), so that the
trackers do not lose their objects. In simpler scenarios (e.g., persons inside a
building), the first order model should be sufficient.
st As t 1 (12)
where A is the process matrix. In the scenario described here, the matrix A for
the first and the second order models may be determined from Eqs. 10 and 11:
1 0 1 0 0.5 0 0 0 0 0
0 1 0 1 0 0.5 0 0 0 0
1 0 1 0 0 0 0 0
0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0
(13)
0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
0 0 0 0 1
0 0 0 1 0 0 0 0 0 0 0 0 0
AI A II
0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0
0 0 0 0 0 1 0 1 0 1 0 1 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1
0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1
Next, an estimate of the error covariance matrix is computed:
The matrix P is initialized with the starting (usually high) variance and it
is updated internally by the filter. The matrix Q describes the process noise
covariance and it represents the uncertainty in the dynamic model. Usually, in
tracking moving objects in cameras, it is a diagonal matrix, with values on the
Tracking Moving Objects in Video Surveillance Systems … 71
diagonal representing the variance of each state variable. Often, the same
value is used for the variance of all variables, but also separate variances may
be set e.g., for estimation of the position and the size.
From the predicted state variable vector, a predicted tracker state may be
obtained by selecting the variables related to the position and the size:
ot xt , yt , wt , ht (15)
K t Pt HT HPt HT R
1
(16)
st st K t z t Hs t (17)
Pt I K t H Pt (18)
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 (19)
HI H II
0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
The three noise variances are the most important parameters that affect the
tracker performance in tracking moving objects in the camera. Similarly to the
background subtraction algorithm, there are no universal values for these
parameters and they should be tuned for a specific tracking scenario. However,
some guidelines may be provided in order to help the algorithm developers in
optimizing the tracking system.
Error variance (P). Only the initial values have to be provided, the matrix
is later updated by the filter. These values should be high because of an
uncertainty of the initial state. For example, we know the initial position and
the size of the object, but we don’t know its velocity, so the latter may get a
higher noise variance. The error variance will decrease as the filter converges.
A value of 1 is often used for the initial error variance for all variables.
However, since the initial estimates of the velocity, acceleration and size
change are inaccurate, it is justified to set a larger variance for these variables.
Tracking Moving Objects in Video Surveillance Systems … 73
Process noise (Q). Variance of the process noise determines the balance
between the deterministic and the random component of the modeled process.
A low variance means that we expect the dynamic model to describe the
process accurately. For example, if we know that the tracked objects always
move with a constant velocity, then we can rely on the first order dynamic
model, by setting a low variance of the process noise. In practice, the velocity
of moving objects is not constant. If the noise variance is set to a low value
and the object stops, the tracker may overshot and lose the object. Therefore, a
higher variance value is needed in this case. On the other hand, the variance
set too high will cause the filter not to trust the dynamic model and to assume
that the movement is more random in nature. As a result, random changes of
the tracker state may result in losing the tracked object. Therefore, tuning the
process noise variance requires finding a value that balances both cases. For
example, if a camera shows pedestrians on a sidewalk, small velocity changes
may be expected, so lower variance values may be sufficient. If a camera
observes a busy road intersection, higher variance values will be needed.
Additionally, different values may be used for separate variables, e.g., velocity
changes may be more random than size variations.
Measurement noise (R). Variance of the measurement noise defines a
confidence on the measurements provided to the tracker for updating its state.
If the measurements are less accurate (more noisy), a higher variance should
be set. This parameter sets the balance between the state predicted by the filter
and the measurements. For example, a low variance of the measurement noise
means that we are confident in the measurement accuracy, so the predicted
state will be largely ignored in the update phase. Conversely, a high variance
means that the predicted state is more important than the measurement. In
practice, this variance cannot be set too high, because the filter will ignore the
measurements and the tracked object may be lost if it changes its velocity or
the direction of movement. On the other hand, the variance that is set too low
may disrupt the tracker by incorporating inaccurate measurements. Similarly
to the process noise, different values of the measurement noise may be set for
individual variables.
One may argue that measurements made with the object detection
algorithm are accurate, so the variance of measurement noise should be low.
This is not the case. In an ideal situation, a specific point of the object should
be tracked. However, in the discussed scenario, a center point of the blob is
tracked and the position of this point within the object may change on a frame-
by-frame basis. For example, when a walking person is tracked, the shape of
the blob changes in different phases of the movement, so the position of the
74 Grzegorz Szwoch
x xk y p ,k yk
N
1
e
2 2
p ,k (20)
N k 1
where (xk, yk) is the ground truth position of the object in k-th video frame,
(xp,k, yp,k) is the predicted object position obtained from the tracker, N is the
number of the analyzed states. The error value may be computed for all
analyzed tracks. In order to obtain an optimal set of parameters, a grid search
may be performed by repeating this procedure for different sets of parameters,
usually on a logarithmic scale (e.g., 10-5, 10-4, …, 10-1). With this method,
optimal values resulting in a minimized error may be found and used for the
tracking, and later tuned if necessary.
In practice, it is not always required to set the noise variance values in Q
and R independently. In many cases, only two variance values are defined: one
for the process noise and another for the measurement noise. Therefore, Q and
R are diagonal matrices with constant values on each diagonal. In this case,
the results of Kalman filtering depend only on the ratio of both variances, not
on their absolute values. The result obtained for the variances set to e.g., (10-1,
10-3) will be identical to those obtained with the variances (10-5, 10-7). This
feature of KFs (which is rarely mentioned in the literature) may simplify the
process of the algorithm tuning, as the number of parameters is practically
reduced to a single ratio. However, in some more difficult scenarios, it may be
Tracking Moving Objects in Video Surveillance Systems … 75
beneficial to use separate variance parameters for the position, size, velocity,
etc.
Tracking Conflicts
So far, a simplified case of tracking objects that do not interact witch each
other, was considered. In practice, the detected blobs often merge together,
forming a single blob (an occlusion in the camera view) and they also may
split (e.g., a person leaving their luggage). Resolving such cases are the most
challenging problem in object tracking in video, and this problem does not
have a definitive solution. The KF is not able to handle such cases by itself, it
is the job of the tracking algorithm to provide an accurate measurement that
takes tracking conflicts into account. In this section, some attempts to solve
this problem with the KF will be discussed and a more sophisticated approach,
using particle filters, will be discussed later in the Chapter.
When the predicted positions of all trackers are compared with the blobs
detected in the current video frame using the background subtraction and
object detection algorithms, a relationship matrix is constructed [Czy08]. If a
tracker/blob is related to more than one blob/tracker, a tracking conflict
occurs. As a result, it is not possible to obtain a direct measurement for
updating the KF, because there is no single blob that represents the whole
tracked object, and only that object. An example is shown in Figure 5. The
object that was previously tracked in a non-conflict situation, now enters the
conflict, because the predicted position of the object is related to a blob that
represents two objects, so the measurement obtained from the object detection
algorithm is inaccurate. Therefore, the updated state of the tracker is
76 Grzegorz Szwoch
Figure 6. An example of the occlusion. Left: two vehicles are tracked separately.
Center: the same blob is used for the (inaccurate) measurement for both objects. Right:
the tracker resumes correct tracking of the vehicle after a short-term occlusion.
A split due to separating objects. This situation occurs when two or more
objects that were moving together, become separated, for example: persons
that were walking in a group, but at some point walked away from each other;
a person leaving luggage and walking away, etc. When objects split, each of
them should get their own tracker (Figure 7, center). Such a case may be
Tracking Moving Objects in Video Surveillance Systems … 77
Figure 7. Examples of object splitting. Left: a single tracker for a group of objects
moving together. Center: a split caused by the separated objects (a vehicle and three
persons start to move independently). Right: a split caused by the fragmentation (two
blobs for a single object, because of the background subtraction errors, as shown in
Figure 3).
Particle Filters
st As t w t (21)
80 Grzegorz Szwoch
where s is the predicted state, s- is the state after resampling, w is the vector of
random values from a normal distribution with a zero mean and with variances
σ2. Therefore, if the same particle was selected multiple times from the
original set, the dynamic model propagates all particles to the same state, and
the random process adds uncertainty to the prediction phase, resulting in
dispersion of these particles. The variance values define a spread of each
variable. Similarly to the KF, a higher variance is needed if movement of the
object is expected to deviate significantly from the dynamic model.
The measurement phase recalculates the weights of all particles and
normalizes them to the unit sum. In the tracking application, the particle
weight should reflect the similarity between the predicted and the measured
state. An example using color histograms for computing weights will be
described in the next Section. Finally, an estimate of the object state may be
computed as a weighted mean of the particle set (Figure 8):
N
st (i )s t(i ) (22)
i 1
Figure 8. Tracking a white van with the particle filter. Left: predicted states of the
particles, representing the object position, are shown as dots. Center: the updated
particles (only the particles with sufficiently high weights are retained). Right: the
mean state of the tracker, shown as a bounding box.
resolving, but a method for measuring the object position and size is needed.
In order to compute a similarity between the tracked object and the predicted
region in the video frame, color histograms may be used [Num03]. The target
object histogram is stored in the tracker. For each particle, the histogram of the
image region described by the particle state is computed and normalized, and
its similarity to the target histogram is calculated and used for the weight
updating. Efficient histogram calculation requires that histograms are invariant
of brightness changes (so they allow tracking objects in scenes with different
lighting), but at the same time, they are specific for a given object (i.e., objects
that look different also have significantly different histograms).
Color histograms may be computed in a variety of color spaces (RGB,
normalized RG, HSV, HLS, etc.). In the algorithm presented here, the
improved HLS (iHLS) color space, introduced by Hanbury [Han03], was used.
The main advantage of this color space is that it removes the dependency of
the saturation on the brightness. Therefore, it is better suited for the analysis of
camera images [Bla06].
Various methods of histogram calculation in the iHLS color space are
possible: three separate 1D histograms, a combined 1D histogram, a 2D
histogram with only selected channels, etc. For example, Hanbury proposed a
merged histogram of the hue and the brightness, with bin values weighted by
the saturation [Han03]. However, in the preliminary experiments on the
tracking algorithm presented here, this approach did not work well and it
resulted in losing the tracked objects. It was found that a 2D histogram
constructed from the hue and the saturation channels in the iHLS color space
works with a good accuracy. Values of the hue range from 0 to 360 degrees,
values of the saturation range from 0 to 1. The number of histogram bins
should be chosen so that the histogram reflects significant differences between
different objects, but at the same time, the histogram is not too detailed. The
number of bins which is too high increases the computation time and memory
requirements, and makes the histogram comparison more difficult (the values
are spread among too many bins). As a good compromise, a histogram
consisting of 64 ranges for the hue and 8 ranges for the saturation, was chosen.
Because the hue is meaningless for low saturation (almost gray) pixels, it is
replaced with the brightness values, scaled to 0-360 range, for pixels falling
into the bins representing the lowest saturation values. After computing the
histogram for all pixels in the image region described by a given particle, it is
normalized to the unit sum.
The position of an object is represented by a center point (x, y) and the
size (w, h). The image region represented by a given particle state may be
82 Grzegorz Szwoch
therefore visualized as an ellipse with the center point (x, y) and the axes (w,
h), and all pixels within this ellipse are used for the histogram computation.
The ellipse is not rotated (its axes are always parallel to the image borders).
An incorrect estimation of the object size may result in including the
background pixels into the histogram computation. When too many
background pixels are included, the risk of ‘sticking’ the tracker to the
background increases. In order to avoid such problem, it is possible to weight
the pixels used for the histogram computation. A similar approach was
proposed by Nummiaro [Num03]. Pixels close to the ellipse border receive
lower weights, so that the background pixels on the edges of the analyzed
region have a smaller contribution to the histogram. The histogram weight r of
each pixel is computed from its distance from the ellipse center, as follows:
r max 1
x p xc y p yc
2 2
, 0 (23)
2 2
w x p xc 2 h y p yc 2
2 2
where (xc, yc) is the ellipse center, (xp, yp) are the pixel coordinates, (w, h) are
the ellipse axes lengths. The value of r is one for the ellipse center, it decreases
towards zero when the distance from the center increases, and it is zero for
pixels outside the ellipse.
For calculation of the similarity between the histogram computed from an
image region and the target histogram stored in the tracker, various metrics
may be used, e.g., correlation, intersection, Chi-Square, Bhattacharyya,
Hellinger, quadratic, etc. [Zha14]. It was found during the experiments that
comparable results were obtained with different metrics. Therefore, for
performance reasons, the Bhattacharyya metric was used. The distance
between two histograms H1, H2 consisting of N bins is:
N
d H1 (i) H 2 (i) (24)
i 1
The particle weights are calculated from the histograms distance using an
exponential weighting function [Num98]:
Tracking Moving Objects in Video Surveillance Systems … 83
d2
exp
2
(25)
2s
where HT,n is the target histogram stored in the tracker in frame n, H is the
histogram computed from the image region corresponding to the current mean
tracker state, p is an update factor which determines the histogram update
ratio. The target histogram is computed when the tracker is initialized.
Updating the target histogram is necessary in order to take changes of the
object appearance into account, e.g., if the object changes its orientation
relative to the camera, or it moves to an area with different light. In order to
prevent distorting the target histogram with incorrect results, e.g., when the
tracked object is partially occluded, this operation should be performed only if
the distance between histograms HT,n-1 and H is below the threshold (typically
in the range of 0.25 to 0.4).
the state prediction becomes inaccurate and unstable (a large variation between
frames is observed). A higher number of particles increases the tracking
accuracy, but it also significantly affects the processing time. In practical
situations, 500 to 2000 particles per one tracker should be used, depending on
the variability of object movement and the frequency of tracking conflicts. A
value of 500 is recommended for the initial experiments.
Noise variance determines a spread of the particle states during the
prediction phase, this value should be set in order to cover the expected
deviations from the dynamic model. For example, the variance of the position
noise should be set so that all particles cover the image region in which the
object may appear, but they do not extend significantly beyond this region.
Similarly to the KF, a low variance means that the object is expected to move
according to the dynamic model. If the value is too low, the actual position of
a tracked object may not be fully covered by any particle, resulting in tracking
errors. On the other hand, the variance that is set too high causes the particles
to spread too wide, increasing the risk of taking a wrong image region as a
measurement and losing the tracked object. Different noise variances may be
used for each variable type (the position, velocity, size, etc.). The values of
variance in PFs are usually higher than in the KF. It is recommended to set a
higher variance to the position and velocity noise (e.g., 1.0 – 10.0) and a much
smaller variance for the position noise (e.g., 10-6). These values need to be
tuned for a specific scenario and techniques similar to those described for KFs
(e.g., a grid search) may be employed.
The weighting function parameter s2 in Eq. 25 sets the required similarity
of the object histograms. With higher values, larger differences between the
histograms are allowed. Lower values may help in separating objects with a
similar look, but they may result in losing the tracked object if its appearance
changes. A value of 0.3 may be used as a starting one.
The histogram update threshold defines the maximum distance between
the computed and the target histogram that allows for updating the latter. This
value should not be too low, because the target histogram will not be updated
if the object appearance changes, (e.g., when its orientation relative to the
camera is changed) and it should also not be too high, because the target
histogram may be corrupted when the tracked object is occluded. Typical
values are 0.3 to 0.5. The histogram update rate (p in Eq. 27) controls the
speed of adaptation of the target histogram to changes in the object appearance
and it is typically set to 0.01 – 0.05.
Tracking Moving Objects in Video Surveillance Systems … 85
Implementation Considerations
short-term, KFs are expected to perform reasonably well and employing PFs in
this case is not going to provide a significant performance boost. However, if
object occlusions, splitting and fragmentation are frequent, the PFs may
improve the tracking accuracy in a significant way. For example, in case of
object occlusions (blob merges), the KF provides a single prediction of the
object state, and it is not possible to verify the accuracy of this hypothesis and
to obtain a better measurement for the tracker update. On the other hand, the
PF provides multiple hypotheses that may be verified e.g., by comparing color
histograms, and the particle weights reflect the accuracy of these predictions.
Therefore, the particle tracker is able to find the optimal estimate of the object
state. Compared with algorithms such as CamShift [Bra98] that perform a
‘blind’ search for a region with the best matching histogram, the particle
tracker uses a dynamic model to predict the most probable object state. Of
course, this approach will not work well if an object is completely occluded
for a prolonged time, or when it is camouflaged in the background. However,
if the occlusion is partial and temporary, it will result in high dispersion of the
particles during the conflict, but after the object becomes fully visible, the
particle set should be able to refocus on it. An example of a successful
tracking in the described case is presented in Figure 9. When the tracked
object is partially occluded, the particle set is still focused on the visible part
of the object, and when the occluding object moves away, the tracker readjusts
itself to the tracked object. The tracker is even able to handle short-term full
occlusions, provided that the occluding object has a sufficiently different color
histogram from the target. Obviously, there is a risk of ‘stealing’ the tracker by
another object with a similar appearance. This effect cannot be fully
eliminated, but it may be reduced by tuning the algorithm parameters.
Figure 9. Phases of tracking the dark (parked) vehicle that is partially occluded by the
white van moving in front of it. Top: a ‘swarm’ of particles modeling the object
position (after the filter update). Bottom: the mean tracker state as a bounding box.
Tracking Moving Objects in Video Surveillance Systems … 87
Object fragmentation is usually not an issue with the PFs, because they are
able to recover the image region containing the object. On the other hand,
permanent object splitting is not handled by the particle tracker itself. In case
of splitting the object into two or more separate ones, the tracker will stick to
the object which has the highest similarity with the target histogram. The
remaining objects will be lost, so a dedicated procedure is needed to handle the
split by assigning a new tracker to the separated objects.
The initialization of the tracker for new objects that appear on the scene,
and for objects left behind after the split, is an obvious problem with this
approach. One solution is to use the background subtraction and blob
extraction procedures to find objects not assigned to any tracker and initialize
their tracking with this data. The drawback of such an approach is that the
computation time increases significantly, because now two computationally
complex algorithms: background subtraction and particle filtering, are
employed. However, incorporating the background subtraction procedure into
the PF tracker has an additional advantage of removing the influence of the
background pixels on the tracking accuracy. In a standalone particle tracker
presented here, color histograms are calculated from all pixels inside the
ellipse determined by the tracker. This may also include the background
pixels, for example, from the area between legs of a walking person. As a
result, the histograms are distorted by the background pixels, which increases
the risk of losing the tracked object if a tracker sticks to the background (it
may happen e.g., when the object is mostly occluded). In this case, the
background subtraction stage may be used to mask out the background pixels,
removing them from the histogram calculation. Therefore, this modification
may increase the tracking accuracy.
KF KF Association No conflict KF
state prediction matrix update
Conflict
PF PF PF
state prediction update
Figure 10. Block diagram of the proposed combined object tracking algorithm.
Additionally, the histogram of the blob and its distance from the target
histogram, are calculated. If the distance exceeds a certain threshold, it means
that the blob does not represent the tracked object. Therefore, the tracker is
removed. If the histogram distance is sufficiently low, the target histogram is
updated (Eq. 26). The PF is inactive during the non-conflict tracking.
A relationship of one tracker to more than one blob indicates either the
fragmentation, or the object splitting into two or more separate tracks. For the
analysis of this case, a bounding box encompassing all the related blobs is
calculated. If the conflict results from a fragmentation, the size of the
combined bounding box remains similar to the object size stored in the tracker,
and the distance between the histogram computed from all the related blobs
and the target histogram is also small. Therefore, the combined bounding box
is used as a measurement for the KF update. It is recommended to increase the
measurement noise variance for this case, because fragmentation makes the
measurement inaccurate by nature. The target histogram is not updated.
Tracker splitting into separate objects may be detected by observing that
the size of the combined bounding box of the matched blobs increases in the
successive frames, and also the distance between the individual blobs
increases. When the size of the combined blob is larger than the original
tracker size by a factor exceeding a threshold, a split is detected. Histograms
and sizes of the partial blobs are compared with the tracker data obtained
before the conflict occurred. If a match is found between the merged blob and
a single tracker, this blob is used for updating the tracker and new trackers are
created from the remaining blobs. An additional analysis is also needed in
order to detect splitting and fragmentation occurring at the same time.
The remaining conflict situations represent cases in which more than one
tracker is related to one or more blobs. Such a situation occurs during the
occlusion, and also in complex cases of the occlusion coexisting with the
fragmentation and splitting. These are the most difficult conflicts to resolve.
Generally, the matched blob (or the combined blob) is larger than the
individual tracker states, so the main problem is to find a region inside the
90 Grzegorz Szwoch
blob that contains a given tracked object, and to use this region for the tracker
update. The PF is used for this task, because it allows for verification of the
predicted state (by computing the distance between the color histograms).
When this type of conflict is detected, the tracker activates the PF. A
simplified state vector, containing only the position and the velocity, is used in
particle tracking:
s x, y, x, y
T
(27)
Initial values of these variables are copied from the last KF state. The
object size and its changes are not taken into account, because it is not possible
to obtain the size measurement when a tracked object is occluded. It is
therefore assumed that the object size does not change significantly during the
conflict. Reducing the dimension of the state vector allows for using a smaller
number of particles per tracker. The prediction phase is also simplified:
xt xt 1 vx y t y t 1 v y
(28)
xt xt 1 xt yt yt 1 y t
where vx and vy are noise values from the independent normal distributions.
The process noise variance should allow for expected deviations from the
dynamic model, but it should also be sufficiently low, in order to keep the
particles within the blob borders. Since it is known that the object is inside the
blob (assuming that the results of background subtraction are accurate), the
particles should not extend beyond the blob borders. Therefore, a proper
choice of the noise variance keeps the spread of particles within the blob
limits.
Verification of the hypotheses (calculation of the particle weights) is
performed as before, by comparing color histograms computed for each
particle, with the target histogram stored in the tracker. If the overlap between
the tracked objects is small and these objects differ in appearance, it may be
expected that the particles having the highest weight in the set describe the
image region containing the tracked object. The computation of weights is
done according to Eqs. 23-25, but the particles having the estimated position
beyond the blob limits automatically receive a zero weight.
After the update phase is finished, the mean posterior state of the PF is
computed (Eq. 22) and used as a representation of the object position, with the
size retained from the original state (before the conflict). This result is then
Tracking Moving Objects in Video Surveillance Systems … 91
used as a measurement for updating the KF, possibly with an increased value
of the noise variance. Tracking with the PF continues until a non-conflict state
is detected and the tracker switches back to the KF only. The target histogram
remains unaffected during tracking with the PF.
The complete tracking procedure (Figure 10) may be summarized with the
following steps.
This process is repeated for all the tracked objects, and then for the
consecutive video frames.
92 Grzegorz Szwoch
EXPERIMENTS
In order to perform a thorough evaluation of any object tracking
algorithm, it has to be tested on a large number of object tracks, obtained from
a set of video recordings representing various tracking scenarios, with varying
complexity of objects movement. In practical applications, the tracking system
has to be verified also on real recordings obtained from the target surveillance
system. For each video, the ground truth data describing the exact position and
size of all objects in each video frame, has to be available. Creation of such a
set requires a substantial amount of work. Finding a ready to use benchmark
set suitable for object tracking evaluation is also problematic. Therefore, in
this Section, a simplified test procedure utilizing only one video recording and
a single object, will be presented. The aim is to illustrate the testing procedure,
and to provide an overview on the accuracy of the tracking methods presented
here. However, it is by no means an exhaustive performance evaluation of
these algorithms.
For a quantitative analysis of the tracker performance, various metrics are
used [Ceh16]. Commonly utilized region-based metrics may be calculated by
comparing a coverage of two image regions: the one obtained from the
tracking algorithm, and the ground truth data. These regions are usually
described with rectangles, denoted as t and g for the tracker and the ground-
truth, respectively. Pixels situated inside these rectangles may be classified as:
Recall describes the part of g that was detected correctly. Lower recall
corresponds to a higher number of FNs.
TP areat g
recall (29)
TP FN areag
TP areat g
precision (30)
TP FP areat
TP areat g
accuracy (31)
TP FP FN areat areag areat g
Values of all these measures range from 0 (the worst) to 1 (the best), they
may also be expressed in percents. None of these metrics is exhaustive, so all
three of them need to be calculated and provided in the report. It should be
pointed out that in general, it is not possible to lower the number of both FPs
and FNs at the same time, by tuning the algorithm parameters. The precision
and recall metrics are usually related with each other, and altering some
parameters of the algorithm (e.g., the noise variances in the KF) often leads to
changing both the number of FPs and FNs in the opposite direction. Therefore,
when the precision increases, the recall may decrease, and vice versa. This
effect is often visualized by plotting the precision and recall values in a single
graph, as a function of the tested parameter, forming a receiver operating
characteristic (ROC). The ROC curve is useful in finding a proper balance
between the number of FPs and FNs.
Another useful measure is the distance error, measured as a distance in
pixels between the center position of the tracker and the center point of the
ground truth rectangle. Obviously, this measure does not take the size of the
object into account, but it is useful for assessment of the object location
accuracy. It is measured as an average of the squared distances obtained from
N analyzed video frames:
derr
1 N
N i 1
xt ,i xg ,i 2 yt ,i yg ,i 2 (32)
where (xt,i, yt,i) is the center point of the tracker in i-th frame, and (xg,i, yg,i) is
the center point of the object rectangle in the ground truth data. This measure
is therefore a root-mean-square error (RMSE).
94 Grzegorz Szwoch
differences are not large, it may be observed that all the metrics are lower for
low values of the variance ratio, and for ratios larger than 10, no significant
change in the results is observed. Since the differences were small, there was
no point in plotting a ROC curve in this case. Based on the analysis, a ratio of
1, i.e., identical values for both noise variances, was chosen as the optimal
value for this case. However, in practical scenarios, the described evaluation
procedure should be performed on a large number of tracks obtained in
varying conditions, in order to obtain a meaningful result.
Table 1. Performance metrics (%) obtained for the object tracked with
the Kalman filter, as a function of the ratio of the process noise variance
to the measurement n. v.
In case of the particle filter tracker, the analysis was performed using a PF
containing 512 particles in the set. In the prediction phase, the noise variance
was equal to 1 for the position and the velocity, and 10-6 for the size. These
optimal values were found experimentally. The measurement phase was
performed by computing the distances between color histograms, as presented
earlier. Values: s2 = 0.3 (Eq. 25), p = 0.05 (Eq. 26) and the histogram update
threshold equal to 0.3, were used. Finally, the proposed combined tracker was
evaluated, using the same values of the KF and PF parameters as above.
Additionally, the metrics were measured for the PF with a varying number of
particles.
Table 2 presents the obtained results averaged for the complete object
track, and Figure 11 shows the plots of all metrics vs. the frame number,
illustrating how each tracker performs in different situations (a conflict or no
conflict). The KF tracker has a near perfect recall, but the overall precision is
below 75% because of the conflicts, during which the area covered by the
tracker is larger than the actual object (a result of inaccurate measurements
96 Grzegorz Szwoch
Figure 11. The performance metrics calculated for each frame of the tested video. Top
to bottom: accuracy, recall, precision, and the distance error in pixels.
98 Grzegorz Szwoch
Figure 12. Tracking results in sample frames from the analyzed video set. Columns
from left to right: the Kalman tracker, the particle tracker and the proposed, combined
algorithm.
Tracking Moving Objects in Video Surveillance Systems … 99
CONCLUSION
The approach to object tracking in video surveillance systems presented
here utilizes various algorithms: background subtraction, Kalman filters and
100 Grzegorz Szwoch
ACKNOWLEDGMENT
This work has been funded by the Artemis JU as part of the COPCAMS
project under GA number 332913.
REFERENCES
[Aru02] Arulampalam, A.; Maskell, A.; Gordon, N.; Clapp, T. A tutorial on
particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE
Trans. Signal Processing. 2002, 50, 174-188.
[Bla06] Blauensteiner, P.; Wildenauer, H.; Hanbury, A.; Kampel, M. On
colour spaces for change detection and shadow suppression. Computer
Vision Winter Workshop. 2006, 87-92.
[Bos07] Bose, B.; Wang, X.; Grimson, G. Multi-class object tracking
algorithm that handles fragmentation and grouping. IEEE Conf. Computer
Vision and Pattern Recognition. 2007, 1-8.
[Bra98] Bradski, G. R. Computer vision face tracking for use in a perceptual
user interface. Intel Technology Journal. 1998, Q2, 214-219.
[Ceh16] Cehovin, L.; Kristan, M. Visual object tracking performance
measures revisited. IEEE Trans. Image Processing. 2016, 25, 1261-1274.
[Czy06] Czyz, J.; Ristic, B.; Macq, B. A particle filter for joint detection and
tracking of color objects. Image and Vision Computing. 2006, 25, 1271-
1281.
[Czy08] Czyzewski, A.; Dalka, P. Examining Kalman filters applied to
tracking objects in motion. 9th Int. Workshop on Image Analysis for
Multimedia Interactive Services. 2008, 175-178.
[Czy11] Czyzewski, A.; Szwoch, G.; Dalka, P.; Szczuko, P.; Ciarkowski, A.;
Ellwart, D.; Merta, T.; Łopatka, K.; Kulasek, L.; Wolski, J. Multi-stage
video analysis framework. Video surveillance; Lin, W.; Ed.; InTech:
Rijeka, 2011, pp. 147-172.
102 Grzegorz Szwoch
[Shi94] Shi, J; Tomasi, C. Good features to track. 9th IEEE Conf. Computer
Vision and Pattern Recognition. 1994, 593-600.
[Sta99] Stauffer, C; Grimson, W. E. L. Adaptive background mixture models
for real-time tracking. Proc. IEEE Conf. Computer Vision and Pattern
Recognition (CVPR). 1999, 246-252.
[Suz85] Suzuki, S. Topological structural analysis of digitized binary images
by border following. Computer Vision, Graphics and Image Processing.
1985, 30, 32-46.
[Szw10] Szwoch G.; Dalka, P.; Czyzewski, A. A framework for automatic
detection of abandoned luggage in airport terminal. Intelligent Interactive
Multimedia Systems and Services. 2010, 9, 13-22.
[Szw11] Szwoch, G.; Dalka, P.; Czyzewski, A. Resolving conflicts in object
tracking for automatic detection of events in video. Elektronika. 2011, 52,
52-55.
[Szw15] Szwoch, G. Performance evaluation of parallel background
subtraction on GPU platforms. Elektronika. 2015, 56, 23-27.
[Szw16] Szwoch G.; Ellwart, D.; Czyzewski, A. Parallel implementation of
background subtraction algorithms for real-time video processing on a
supercomputer platform. J. Real-Time Image Processing. 2016, 11, 111-
125.
[Vio01] Viola, P.; Jones, M. Rapid object detection using a boosted cascade of
simple features. Proc. IEEE Comp. Soc. Conf. Computer Vision and
Pattern Recognition. 2001, 1, 511-518.
[Wan00] Wan, E. A.; van der Merwe, R. The unscented Kalman filter for
nonlinear estimation. IEEE Symp. Adaptive Systems for Signal Processing,
Communications and Control. 2000, 153-158.
[Wel04] Welch, G.; Bishop, G. (2004). An introduction to the Kalman filter.
Technical report TR-95041, University of North Carolina, 2004.
https://www.cs.unc.edu/~welch/kalman/kalmanIntro.html.
[Xu05] Xu, L.-Q.; Landabaso, J. L.; Pardas, M. Shadow removal with blob-
based morphological reconstruction for error correction. IEEE Conf.
Acoustics, Speech & Signal Processing. 2005, 729-732.
[Zha14] Zhang, Q.; Canosa, R. L. A comparison of histogram distance metrics
for content-based image retrieval. Proc. SPIE Imaging and Multimedia
Analytics in a Web and Mobile World. 2014, 9027, 90270O.
[Ziv06] Zivkovic, Z.; Van der Heijden, F. Efficient adaptive density estimation
per image pixel for the task of background subtraction. Pattern
Recognition Letters. 2006, 27, 773-780.
104 Grzegorz Szwoch
BIOGRAPHICAL SKETCH
Grzegorz Szwoch, PhD
the supercomputer cluster. The main algorithm that was implemented on the
parallel computing platform, was based on the Codebook background
subtraction method, supplemented with object tracking with Kalman filters
and the object detection module. The algorithm was implemented within a web
service.
In the European project ADDPRIV – Automatic Data Relevancy
Discrimination for a Privacy - sensitive Video Surveillance (2011-2014), he
worked on automatic detection of unattended luggage in public spaces. Within
this project, the detection algorithm utilizing a multi-layer model based on the
modified Codebook method, was developed. The system was tested in a real-
life scenario in Milan-Linate airport.
In the European project COPCAMS – Cognitive and Perceptive Cameras
(2013-2016), the research was focused on parallel processing of multimedia
streams for application in smart camera systems. During this project, he
worked on parallel implementation of the object detection and tracking
algorithms on GPU platforms, with CUDA and OpenCL. He developed an
object tracking algorithm based on particle filters, intended for implementation
in systems equipped with non-fixed cameras, e.g., unmanned aerial vehicles.
He also proposed a combined object tracking algorithm, employing both
Kalman and particle filters, for an improved resolving of a difficult tracking
cases.
His professional interests include audio, image and video processing and
analysis, programming (Python, C++) and web technologies. He is particularly
interested in employing parallel processing platforms (such as GPUs) and
mini-computers (e.g., Raspberry Pi) for the analysis of multimedia data.
He is also an academic teacher on the topics of sound synthesis, computer
graphics, audio measurement, and applications of digital signal processor.
Professional Appointments:
since 2004: Gdansk University of Technology, Assistant Professor
2000-2004: Gdansk University of Technology, Research Assistant
2. Kotus, J., Dalka, P., Szczodrak, M,., Szwoch, G., Szczuko, P.,
Czyzewski, A. Multimodal Surveillance Based Personal Protection
System. Signal Processing: Algorithms, Architectures, Arrangements,
and Applications (SPA) 2013, Poznan, 2013, 100-105.
3. Czyzewski, A., Bratoszewski, P., Ciarkowski, A., Cichowski, J.,
Lisowski, K., Szczodrak, M., Szwoch, G., Krawczyk, H. Massive
surveillance data processing with supercomputing cluster. Information
Sciences, 296 (1), 2014, 322-344, DOI: 10.1016/j.ins.2014.11.013.
4. Dalka, P., Ellwart, D., Szwoch, G., Lisowski, K., Szczuko, P.,
Czyzewski, A. Selection of Visual Descriptors for the Purpose of
Multi-camera Object Re-identification. In: U. Stanczyk and L.C. Jain
(eds.), Feature Selection for Data and Pattern Recognition. Studies in
Computational Intelligence, 584, Springer 2014, 263-303, DOI:
10.1007/978-3-662-45620-0_12.
5. Szwoch, G., Dalka, P. Detection of vehicles stopping in restricted
zones in video from surveillance cameras. In: A. Dziech, A.
Czyzewski (eds.), Multimedia Communications, Services and
Security. Communications in Computer and Information Science,
429, Springer 2014, 242-253, DOI: 10.1007/978-3-319-07569-3_20.
6. Lech, M., Dalka, P., Szwoch, G., Czyzewski, A. Examining Quality
of Hand Segmentation Based on Gaussian Mixture Models. In: A.
Dziech, A. Czyzewski (eds.), Multimedia Communications, Services
and Security. Communications in Computer and Information Science,
429, Springer 2014, 111-121, DOI: 10.1007/978-3-319-07569-3_9.
7. Szwoch, G. Parallel background subtraction in video streams using
OpenCL on GPU platforms. Signal Processing: Algorithms,
Architectures, Arrangements, and Applications (SPA) 2014, Poznan,
2014, 54-59.
8. Szwoch, G. Performance evaluation of parallel background
subtraction on GPU platforms. Elektronika: konstrukcje, technologie,
zastosowania, 2015 (4), 2015, 25-29, DOI: 10.15199/13.2015.4.4.
9. Szwoch, G., Ellwart, E., Czyzewski, A. Parallel implementation of
background subtraction algorithms for real-time video processing on a
supercomputer platform. Journal of Real-Time Image Processing, 11
(1), 2016, 111-125, DOI: 10.1007/s11554-012-0310-5.
10. Szwoch, G. Extraction of stable foreground image regions for
unattended luggage detection. Multimedia Tools and Applications, 75
(2), 2016, 761-786, DOI: 10.1007/s11042-014-2324-4.
In: Surveillance Systems ISBN: 978-1-53610-703-6
Editor: Roger Simmons
c 2017 Nova Science Publishers, Inc.
Chapter 3
Abstract
ground-truth data. Then, such measure is compared with the related state-
of-the-art showing its superiority to evaluate trackers. Finally, the pro-
posed methodology is validated on state-of-the-art trackers demonstrating
their utility to identify tracker characteristics.
1. Introduction
Visual tracking has received enormous attention by the research community dur-
ing the past years, resulting in a wide variety of approaches (trackers) [55][27].
In this situation, selecting the optimum tracker for each application requires to
evaluate tracker performance (i.e., determine its strengths and weaknesses) un-
der different challenges that affect trackers such as noise, clutter, illumination
changes and occlusions.
Common performance evaluation of trackers analyzes the obtained re-
sults through a methodology defined by a dataset (a set of sequences), the
ground-truth data (manual annotations of the ideal result) and the measures (to
quantify the performance) [27]. Its design is challenging as it has to cover,
with enough variability, the situations and problems of interest [15] and cor-
rectly estimate performance. Although there are approaches not based on
ground-truth [52][41], most of the literature measure performance as the (spa-
tial and temporal) deviation between the tracker output and the ground-truth
data [7][25][45]. However, current approaches partially address this evaluation
as they do not systematically cover the tracking problems [6][35][24] or use
small datasets [28] (as generating ground-truth is tiresome limiting the dataset
variability and size). Moreover, many measures exist making difficult which one
to use as there are no comparisons [7], thus increasing the complexity to design
a proper methodology. In summary, these limitations have restricted the wide
acceptance of a common tracker evaluation approach and motivated the recent
interest in major conferences [53][34] and the organization of challenges1 .
In this paper, we propose a methodology for performance evaluation of
single-object tracking. It provides an evaluation framework including the
dataset, the evaluation measures and the aspects to understand the advantages
and drawbacks of the tracker nder evaluation. Each tracker is modeled as a
1
IEEE Workshop on Visual Object Tracking Challenge, http://www.votchallenge.net/
Performance Evaluation of Single Object Visual Tracking 109
black box with two inputs (visual data and configuration parameters) and one
output (target estimations). Tracking performance is measured against varia-
tions of the visual data (problems affecting tracking) and configuration param-
eters (inaccurate initialization, non-optimum settings) by comparing its results
with ground-truth data. A dataset is designed to represent the relevant tracking
problems with different complexity levels via synthetic and real data (126 se-
quences, ˜23000 frames). Then, four evaluation criteria are defined: parameter
stability, initialization robustness, global accuracy and computational complex-
ity. A novel spatio-temporal performance measure is proposed to counteract the
ground-truth errors made by the annotators. Finally, experiments are presented
to validate the proposed methodology2. We compare existing accuracy mea-
sures, showing the benefits of the proposed measure under inaccurate ground-
truth data. Then, we apply the methodology to classical and recently proposed
trackers determining their strengths across a wide variety of testing conditions.
The structure of this paper is as follows. Section 2 presents the related
work. The proposed methodology is overviewed in Section 3. The dataset and
performance criteria are described, respectively, in Sections 4 and 5. Then,
Section 6 presents the experimental results. Finally, Section 7 summarizes the
main conclusions.
2. Related Work
Performance evaluation for visual tracking can be categorized as low (high)
level based on the application’s independency (dependency) [27]. Evaluation is
also for single (SOTE) and multiple object tracking (MOTE) [9]. MOTE is often
simplified to SOTE after associating the estimated and ground-truth targets [7].
In this section, we briefly discuss recent advances in visual tracking and review
the SOTE low-level approaches focusing on the performance evaluation scores,
the benchmark datasets and evaluation frameworks.
Filter [32] and Lucas-Kanade [5] trackers. These approaches fail in presence of
objects similar to the target or occlusions by other objects. Recent proposals ad-
dress these limitations via adaptive strategies to update the target model such as
incremental PCA [40], continuous outlier detection [61] and scale-orientation
adaptations of MS [31][49]. Combination of rigid and deformable genera-
tive models can be done via superpixels to increase robustness against occlu-
sions [33]. Local information can be used to increase the target model accuracy
such as the MS extension for background correction [30] and the FFT-based
tracker [59]. Discriminative trackers focus on developing classifiers to distin-
guish between the target and its background, being sensitive to sudden changes
in the surrounding background. For example, the TLD tracker [23] combines
PN learning and a tracker to exploit the spatio-temporal structure of the data.
Target-background dissimilarity can be also measured via superpixels [54]. Due
to the high computational cost of previous approaches, fast discriminative track-
ers are proposed focused on compressive sensing which updates a set of weak
classifiers via sparse factorization [58] and on adaptive dimensionality reduction
of color attributes based on their discriminative power [17].
Multiple trackers or models can be combined to overcome the limitations
of each tracker such as selecting relevant data to update the target model [21],
Performance Evaluation of Single Object Visual Tracking 111
where xE f and xf
GT
are the estimated and ground-truth targets for frame f ,
|Af ∩ Af | is their spatial overlap (in pixels); |AE
E GT GT
f | and |Af | represent their
area (in pixels). Unlike centroid-based measures [11][43], SO considers the er-
rors in the estimated target size and saturates to one [27]. Evaluating the target
estimation can be also considered via its error (i.e., the non-overlapping re-
gion) at pixel [15] or region level [13]. Finally, other approaches compute such
ground-truth similarity using Euclidean [6] or Mahalanobis [19] distances.
Trajectory-level evaluation quantifies spatio-temporal accuracy of the esti-
mated target tracks. For example, the mean over all the frames of each sequence
can be taken for SOs [15][25][29] or centroid distances [11][43]. In addition,
[11] computed the positive and false-matches using point-wise ground-truth for
measuring the rate of correct, wrong and missing targets. [38] focused on the
ability of the tracker to maintain the same identifier for each detected target.
Extending the previous approaches, [57] thresholded the SO for determining
112 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
correct target detection and derived a set of track-based measures for measuring
their fragmentation as well as their spatial and temporal closeness to ground-
truth tracks. More recently, [28] defined the loss of target as the number of SOs
for an entire track that are below a threshold. Then, this measure is computed
for a set of predefined thresholds and accumulated to obtain the performance
score of each track.
As a conclusion, several approaches are available to measure tracker perfor-
mance given a particular tracker initialization and configuration, being not clear
which measure to use. Hence, it would be desirable to analyze which measure
is more efficient and to systematically study the performance variation under
different configurations and initializations (as in [28]).
Configuration
Results
Tracking Performance Performance
results
analisis evaluation
Visual data
3. Methodology Overview
The proposed methodology for evaluating single-object trackers is depicted in
Fig. 1. It is composed of two stages: tracking analysis and performance evalua-
tion.
The first stage models the tracker to evaluate as a black box with two inputs
(visual data and configuration) and one output (results). The visual data is the
114 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
video sequence with the targets to track. Evaluating tracker accuracy requires to
use data covering the tracking problems (e.g., occlusions, scale changes). The
configuration describes all the tracker parameters that can be manually set (e.g.,
window search). The results are the estimated target locations defined by their
bounding boxes (center position and size).
The second stage formalizes the tracker evaluation by comparing their re-
sults with ground-truth data. We propose such evaluation as changing the
tracker input and then, analyzing its accuracy. Hence, visual data and con-
figuration variations are modeled by, respectively, sequences with complexity-
variable tracking problems (requiring to design a new dataset) and the two
principal aspects to configure a tracker (initial target location and parameters).
Tracker performance is evaluated through four criteria: parameter stability, ini-
tialization robustness, global accuracy and computational complexity (detailed
in Sec. 5).
4. Dataset Design
We propose a new dataset, named SOVTds (Single-Object Video Tracking
dataset), composed of synthetic and real data selected from publicly available
benchmarks. It covers common problems and situations of tracking, having a
total of 126 sequences (˜23000 frames) where ground-truth data is generated
for each frame as the target bounding box (center and size). The detailed de-
Performance Evaluation of Single Object Visual Tracking 115
Abrupt (and Local) Illumination Changes. As the target moves, it can enter
in areas with variable illumination. Hence, the tracker might be confused losing
the target.
Noise. It appears as random variations over the values of the image pixels
and can significantly degrade the quality of the extracted features for the target
model.
Occlusion. It is defined when an object moves between the camera and the
target. It can be partial or total if, respectively, a region or the whole target are
not visible.
Scale Changes. It happens when a target moves during the sequence and in-
creases or decreases its size due to changes in its distance from the camera.
Similar Objects. It considers objects with similar features to those of the tar-
get (e.g., color, edges) as the tracker might be confused and track them.
116 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
After describing the problems covered by the dataset, we define the criteria to
evaluate their complexity (Table 2). These criteria include objective (illumina-
tion change, occlusion and scale change) and subjective (complex movement,
noise and similar objects) factors. Some factors can be artificially generated
(noise and illumination changes) allowing to create synthetic sequences or mod-
ify real ones with any required complexity.
Figure 3. Sample frames for the situations of the proposed dataset (from top row
to bottom row): synthetic (S1), laboratory (S2), Simple real (S3) and Complex
real (S4). In addition, samples of some tracking-related problems are also pre-
sented for each column (from left to right): abrupt illumination change, noise,
occlusion, scale change and (color-based) similar objects. Target are repre-
sented by green squares.
problems (noise, gradual and abrupt illumination changes), the problems are ar-
tificially introduced. S2 contains 21 sequences (˜6500 frames). Sample frames
are shown in the second row of Fig. 3.
5. Performance Evaluation
We describe the proposed evaluation measure and the four criteria to assess
tracker performance.
FGT
1 X
AW SO = W SO(f ), (2)
FGT
f =1
where FGT is the number of frames with ground-truth data; oif , lfi and gfi are
pixel coordinates of, respectively, the overlapped, the estimated and ground-
truth locations for frame f ; NO , NE and NGT are the number of overlapped,
estimated and ground-truth pixels; kE (·) and kGT (·) are two kernels to weight
each pixel inversely proportional to its distance from, respectively, the estimated
(lfc ) or ground-truth (gfc ) center locations. Both are defined as:
”n
kE oif = 1 − d(oif , lcf )/dmax oif , lcf , l1...N
“
f
E , (4)
”n
kGT oif = 1 − d(oif , gfc )/dmax oif , gfc , gf1...NGT
“
, (5)
where d(·, ·) is the Euclidean distance between each pixel coordinate (oif , lfi or
gfi ) and the center of the estimated (lfc ) or ground-truth (gfc ) target; dmax(·, ·, ·)
gets the maximum distance determined by the furthest ground-truth point gfi
along the line formed by oif and gfc (similar for target estimation using lfc and
lfi instead of gfc and gfi ); n controls the importance given to pixels close to the
center location.
As a summary, we measure performance for each frame by combining the
weighted coverage of the spatial overlap for both the estimation and the ground-
truth location. Values close to one (zero) indicate high (low) tracker spatial
accuracy. Fig. 4 shows an example of SO and WSO measures.
120 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
н н н н н н
н н н н н
н н
н
Figure 4. Sample results for standard SO and the proposed WSO measures for
different spatial overlaps. Estimations and ground-truth targets are depicted by,
respectively, blue and green squares.
where Ns and Np are, respectively, the number of sequences and test values of
p; and AW SOs,v and µs,p are, respectively, the results accuracy of sth sequence
for the value v of p and the mean for all v (computed as eq. 2).
Detecting stable parameters requires defining the stability concept, which
often depends on the application. As a first approach, we threshold σp using a
maximum allowed deviation (σmax ) to accept stability (σp ≤ σmax ). However,
the opposite condition (σp > σmax ) does not imply instability as results may be
Performance Evaluation of Single Object Visual Tracking 121
0.9
0.8
Tracker accuracy (AWSO)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10
Values vj of parameter pi
p1 p2 p3 p4 p5 p6 p7
p1 p2 p3 p4 p5 p6 p7
σp .012 .035 .266 .258 .354 .422 .425
ηp .010 .031 -.668 .765 .010 .001 .566
γp 1 .556 .333 .444 .111 .556 .000
Figure 5. Sample results for parameter stability. The curves represent the re-
sult’s variation of seven tracker parameters (all with 10 test values) for one se-
quence. For σmax = 0.05, only p1 and p2 are stable. A predominant decreasing
and increasing trend is observed for, respectively, p3 and p4 (high |ηp|). p5 and
p6 have partial stability (high γp). The most unstable is p7 (low γp and high σp ).
partially stable (see Fig. 5). To detect it, we measure properties of AW SOs,v
results using the mean accumulated difference (ηp ∈ [−1, 1]) and the ratio of
consecutive stable values (γp ∈ [0, 1]):
Ns Np
1 X X
ηp = (AW SOs,v − AW SOs,v−1 ) , (7)
Ns s=1 v=2
122 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
Ns Np
1 X 1 X
γp = (∆AW SO < σmax ) , (8)
Ns s=1 Np v=2
s N
1 X
gAW SO = F s · AW SOs,v∗ , (9)
FT s=1 GT
Performance Evaluation of Single Object Visual Tracking 123
н
н
н н
where s indicates the sth sequence and v ∗ the optimum values of tracker param-
eters as computed in sec. 5.2.1; AW SOs,v∗ is its accuracy value computed as
s
eq. 2; and FGT and FT are the number of frames of, respectively, each sequence
and problem (with Ns test sequences).
Visual tracking, where real-time constraints may apply [27], often involves in-
tensive processing. Hence, the computational complexity of tracking has to be
estimated. It can be theoretically determined via the big-O notation [44]. How-
ever, current trackers contain many stages limiting the use of such notation.
In practice, this complexity can be approached as the mean time (or memory)
required for tracking. Note that both measures depend on the implementation
and the testing machine but they are accepted to approximate algorithmic com-
plexity [8]. We extend this complexity analysis by calculating such time as a
function of the target size.
6. Experiments
We present two experiments for the proposed approach. First, we evaluate the
proposed measure AWSO (Sec. 5.1) and related ones for tracker accuracy. Then,
we apply the proposed evaluation methodology (Sec. 5.2) to selected trackers
using the SOVTds dataset (Sec. 4). A standard PC (P-IV 2.8GHz and 2 GB
RAM) is used.
124 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
Table 3. Evaluation results for different tracking errors. Results are for
CBWH (low), MS (med) and SOAMST (high) trackers using selected
sequences from Karlsruhe [1], PETS2010[37], CLEMSON [10], I-LIDS[3]
and VISOR [48] datasets.
Error Test Mean evaluation result
case sequence SFDA ATA iATE iAUC TC AWSO
Low S3 cars nM .741 .741 .840 .758 1 .867
Low S3 cars nL .753 .753 .849 .779 1 .825
Low S3 faces nM .797 .797 .797 .791 1 .894
Low S3 faces nL .804 .804 .804 .799 1 .899
Low S3 people nH .829 .829 .829 .824 1 .909
Low S3 people nM .837 .837 .817 .831 1 .920
Low S3 people nL .841 .841 .831 .836 1 .925
Med seq bb .429 .592 .466 .426 .608 .488
Med seq jd .353 .638 .384 .351 .545 .325
Med seq mb .642 .694 .558 .637 .922 .679
Med seq ms .510 .685 .510 .507 .758 .594
Med seq sb .427 .675 .363 .424 .631 .453
Med seq villains2 .553 .567 .535 .549 .920 .617
High AB Easy man .275 .376 .534 .272 .615 .385
High mv2 002 redcar .225 .394 .557 .223 .542 .300
High mv2 005 scar .205 .426 .448 .203 .451 .105
High visor2 head .120 .628 .132 .119 .167 .148
High visor5 head .035 .572 .039 .035 .062 .032
High visor6 head .084 .392 .181 .083 .172 .130
tency is depicted in Fig. 7(a) where both results have similar SFDA. The results
of SFDA and AWSO show that both measures provide reliable estimation for
all error cases. As shown in Fig. 7(b), low SFDA values correspond to correctly
tracked targets as most of the estimated locations are very close to the target
center. Hence, AWSO provides a better evaluation.
As a conclusion, SFDA correctly represents tracker performance for all
cases. Although the theoretical SFDA range is [0, 1], according to the results,
the real range is almost [0, 0.8], thus reducing its variability. iAUC is very simi-
lar to SFDA. ATA and TC are not consistent for, respectively, high and all error
cases. The proposed AWSO addressed the annotation errors allowing a fully
range coverage, thus improving SFDA.
(a)
(b)
prediction). For detection-based trackers (IVT and TLD), we inspect the num-
ber of samples taken (numsample), the template size (tmplsize), the number of
previous frames buffered (batchsize), the modeling complexity (numtrees) and
the number of detections evaluated in each iteration (maxbbox). We use seven
representative sequences of the dataset for stability analysis.
Table 4 presents the results of the proposed stability measures and we an-
alyze them considering σmax = 0.05 (i.e., tolerating a maximum 5% of devi-
ation). For MeanShift-based trackers (MS, CBWH and SOAMST), maxiter is
stable and WSA shows a noticeable decreasing trend (high σp and negative ηp ).
Performance Evaluation of Single Object Visual Tracking 127
Low values of WSA are preferred for better performance. Identical behavior has
WSA in TM. Both LK parameters (SigmaIter and TransIter) demonstrated high
invariance to tracker results. Thus, low values are selected to reduce complex-
ity. For PFC, σpos requires more tuning effort as it has high σp . However, it has
a clear stable range as γp = .625 and no relevant increasing/decreasing trend
ηp = .060. The most invariant parameter is the number of particles (N) as all
the test values provide similar performance. For IVT, a noteworthy reliance on
tmplsize is observed where best performance is clearly obtained for few values
(γp = .167) with an almost peaked pattern (no slope as ηp = −.021 and high
128 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
Cars (First Row of Fig. 8). For high overlap (90%), results are very simi-
lar to ground-truth initialization (100% case) indicating that the relevant target
data is still included. Among the trackers, LK has low performance with a
decreasing trend. PFC shows high robustness to size changes (wh) due to its
ability to change the estimated target scale (from inaccurate to the true one).
For IVT and TLD, their update schemes for target models are rapidly degraded
if non-accurate target samples are included (low overlaps) and background data
contains features similar to those of the target. CBWH, CT, ACA and PCOM
are the best approaches being capable to deal non-accurate initialization better
than the other trackers since they employ sparse representations and adapt to
color changes.
Faces (Second Row of Fig. 8). Face targets had a lower general complex-
ity which is shown in the results. In general, face targets allow easy annota-
tion using a bounding box format. Hence, the values are higher than the Cars
case in all the categories. In fact, some trackers (e.g., CBWH, MS, PCOM)
get better results for slight size changes (90% case). This can be explained as
the ground-truth annotation was not completely accurate having errors at the
borders. Besides, PFC and SOAMST also show high robustness to size and
position changes. Unlike for Cars, sequential update of target model of IVT
Performance Evaluation of Single Object Visual Tracking 129
Target (Car) − wh variation Target (Car) − xy variation Target (Car) − xywh variation
1 1 1
SFDA
SFDA
0.5 0.5 0.5
0 0 0
100% 90% 75% 50% 100% 90% 75% 50% 100% 90% 75% 50%
Overlap Overlap Overlap
SFDA
SFDA
0.5 0.5 0.5
0 0 0
100% 90% 75% 50% 100% 90% 75% 50% 100% 90% 75% 50%
Overlap Overlap Overlap
SFDA
SFDA
0.5 0.5 0.5
0 0 0
100% 90% 75% 50% 100% 90% 75% 50% 100% 90% 75% 50%
Overlap Overlap Overlap
Figure 8. Target initialization performance of the selected trackers for Car (first
row), Face (second row) and Person (third row) targets using SFDA [25]. First,
second and third columns correspond to changes in, respectively, size (wh),
position (xy) and both (xywh). For each change, three spatial overlaps with
ground-truth data are considered: 90%, 75% and 50%. 100% case is the ground-
truth initialization.
does not have a significant impact in performance. However, TLD still presents
degradation due to the intensive use of background data for such update. Track-
ers with low scores (LK, STC and TM) improve their results as a reduction of
the initialization size allowed to reduce the amount of background information
in the computed target model. Finally, CT shows its robustness in the 90-75%
cases.
130 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
ACA CBWH CT IVT LK MS PCOM PFC PSRMT SOAMST ST STC TLD TM
0.8
0.6
gAWSO
0.4
0.2
0
motion illum. global illum. local noise occlusion scale similar obj. Mean
Tracking Problem
People (Third Row of Fig. 8). It can be noticed that as the overlap is reduced,
the performance is degraded at a higher rate than the previous cases. Moreover,
SOAMST, ST and PFC demonstrate high robustness since the performance drop
is not severe for size and position target changes. CBWH and PCOM also get
similar results (for 90% and 75% cases) to ground-truth initialization indicat-
ing possible errors in the borders of the ground-truth annotations. Similarly to
the Faces case, the accuracy decrease is more difficult to observe in low perfor-
mance trackers such as TM and LK. In fact, TM shows an improvement when
highly reducing the initialization size. STC shows that the spatial context of
people targets can dramatically change depending on the initialization. IVT,
TLD and the rest of the trackers have similar conclusions to previous results.
Among the trackers, CBWH, PCOM and PSRMT obtain the best results for
size and position variations.
After analyzing parameter variation and target initialization, we present the re-
sults of the selected trackers for the modeled situations in the dataset. We use
the same settings for all sequences.
S1: Synthetic Sequences (Fig. 9). Best results are provided by many trackers
(TM, ACA, PCOM, ST and LK as the background of all sequences has an uni-
form color different from the target one. LK has high performance in most of
the tracking problems except for scale changes as it is not able to adapt to drastic
size changes. CBWH also presents good performance for the noise and com-
Performance Evaluation of Single Object Visual Tracking 131
ACA CBWH CT IVT LK MS PCOM PFC PSRMT SOAMST ST STC TLD TM
0.8
0.6
gAWSO
0.4
0.2
0
motion illum. global illum. local noise occlusion scale similar obj. Mean
Tracking Problem
Figure 10. Performance scores of selected trackers for each problem of the S2
situation (lab sequences).
S2: Lab Sequences (Fig. 10). Six trackers have the best performance (see
Mean bars) in most of the problems: ACA, CBWH, PCOM, MS, TLD and
TM. A high robustness to noise is observed in most of the trackers whereas
they struggle in presence of abrupt motion, scale changes and occlusions. For
TM, its non-adaptivity to scale changes is observed by its low results as com-
pared to other approaches. LK obtains low performance as most of the prob-
lems made to lost the target in the beginning of the sequence without finding it
again, therefore, delivering poor results. In particular, its performance decrease
for occlusions is relevant compared with the other trackers. CBWH success-
fully performs in most the cases except for scale changes and similar objects
132 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
ACA CBWH CT IVT LK MS PCOM PFC PSRMT SOAMST ST STC TLD TM
0.8
0.6
gAWSO
0.4
0.2
0
motion illum. global illum. local noise occlusion scale similar obj. Mean
Tracking Problem
Figure 11. Performance scores of selected trackers for each problem of the S3
situation (Simple real sequences).
because of, respectively, it tracks constant size targets and it does not compute
an accurate target model in presence of similar objects. MS slightly outper-
forms CBWH showing that discriminating the features in the target neighbor-
hood does not always improve results (specially, if the sequence contains com-
plex backgrounds). PFC gets medium results in all the problems except for
scale changes demonstrating its adaptation to size changes. However, overall
results are worst than other trackers. SOAMST is similar to PFC showing a sig-
nificant performance decrease for complex movement and global illumination
changes. IVT and TLD have limitations to track under sudden target motions.
Both exhibit performance among the best trackers for the other problems. For
similar objects, discriminative trackers (PSRMT, ST, STC and PCOM) obtain
good results as they consider the nearby background which is slightly different
for nearby objects. In summary, ACA provides a good compromise for each
tracking problem, being always among the top trackers.
S3: Simple Real Sequences (Fig. 11). Robustness to noise and global illu-
mination changes is achieved by CBWH, LK, PCOM, ACA, PSRMT, ST and
IVT. It should be noted that all trackers failed for occlusions. CBWH obtains
good results closely followed by LK. Unlike for S1 and S2, PFC results are
comparable to the best results as real targets always undergo small changes in
size and appearance that PFC is capable to deal with. For this reason, TM
results drop as compared to previous situations. The presence of objects simi-
lar to the target is frequent in real data and, therefore, TM is easily distracted.
However, LK is able to adapt the template to target changes considering the
neighbor of the target. SOAMST adaptation is heavily affected by target-like
Performance Evaluation of Single Object Visual Tracking 133
objects in the background as it only looks for similar features disregarding the
target size. It is not able to correctly track scale changes with real data and
other similar approaches without size adaptability (CBWH and MS) get better
results. MS has average performance with a great robustness for global illumi-
nation changes. IVT has similar conclusions as for S2: it depends on complex
motion and occlusions, affecting the accuracy of the target model update which
leads to drifting. TLD shares IVT drawbacks whilst having strong performance
dependency for similar objects close to the target. Sparse target modeling of
ST is adequate when facing real data problems. Unlike the previous situations,
ACA shows that color adaptability is present additional challenges compared to
controlled environments (S2 and S3). PCOM is again among the best trackers
as the modeling of noise is appropriate for real data. STC shows that contex-
tual target information is difficult to extract as many clutter often exists around
the target. Again, sparse models (ST and PSRMT) present robustness against
occlusions and similar objects.
S4: Complex Real Sequences (Fig. 12). S4 data have the highest complexity
as sequences mix various tracking problems. Thus, we analyze the target types
instead of each problem. Results for Cars present best performance as these
targets allow an easy annotation and modeling. Face targets have similar char-
acteristics but they might move quickly (as camera distance is usually closer
than that for Cars). Hence, the model update scheme is affected by wrong target
estimations, explaining the performance drop of ACA, CT, STC, IVT, LK and
SOAMST. For People targets, a decrease in performance is clearly observed
showing their difficulty to model and track. Among the trackers, CBWH is
the best closely followed by ACA as color cues are adapted and CT/PFC as
removing background data from the Person model improves accuracy. In sec-
ond order, MS and ST present slightly lower performance being limited to track
People targets. Finally, IVT, SOAMST, LK, TLD and TM reduce their accuracy
when dealing with complex real data (as compared with the other situations). As
expected the performance for S4 is the lowest compared to the other situations.
0.6
0.4
0.2
0
cars faces people Mean
Target type
Figure 12. Performance scores of selected trackers for each target type of the
S4 situation (Complex real sequences).
ers are ACA, STC, CT, TM, CBWH and MS (with respectively 0.013, 0.009,
0.015, 0.011, 0.025 and 0.033) due to their simple computations. Then, a sec-
ond category comprises trackers with medium complexity such as LK (0.309)
and SOAMST (0.362). Finally, advanced trackers are the slowest ones: TLD
(0.578), IVT (0.789), PFC (1.155), PCOM (0.49), PSRMT (0.57) and ST (0.96).
Although these results depend on implementations that may not be optimal, they
allow a rough speed-based categorization.
Fig. 13 depicts the execution time versus the target area that can be under-
stood as a measure of complexity scalability. As expected, most of the trackers
require more time for increasing target sizes. On the contrary, IVT and TLD
show a different trend. Both trackers use a predefined number of fixed-size
patches extracted from the target, allowing them to be almost size-independent.
This advantage could be useful when dealing with high quality data. However,
there is an additional cost for small targets having higher execution time than
the other trackers.
6.2.5. Discussion
Here we discuss the major findings after analyzing the selected trackers with the
proposed methodology.
For parameter stability, the search area is a sensible parameter in many
trackers. In real settings, it should be close to target area to avoid including
similar objects to the target in the analysis. However, robustness against size
changes and sudden motion requires higher search areas. As a result, the tuning
of this parameter exhibits a trade-off between adaptability (to size and motion)
Performance Evaluation of Single Object Visual Tracking 135
4
3
ACA IVT PCOM SOAMST TLD
2 CBWH LK PFC ST TM
CT MS PSRMT STC
1
Execution time per frame (log(seconds)
−1
−2
−3
−4
−5
−6
−7
0 1000 2000 3000 4000 5000 6000 7000 8000
target area (in pixels)
Figure 13. Execution time of each tracker (in logarithmic scale) versus the area
of the target being tracked (in pixels).
and drifting (of target model). For the probabilistic tracker (PFC) and many
trackers by detection (ST, PCOM), the most relevant parameter regards the pre-
dicted target position (σpos ) instead of the number of particles, which depends
on the expected target motion. Patch-based trackers (IVT and TLD) are very
dependant on the size of such template and it should be fixed for all target types.
Parameters in charge of model update schemes (batchsize of IVT and maxbbox
of TLD) are also not stable showing that automatic update is still an open issue.
Concerning results of initialization three findings are noteworthy: first, a
slight reduction of ground-truth size is preferred to avoid annotation errors and
improve performance; second, non-accurate target initialization frequently leads
to errors for automatic updating of target models; third, all trackers have a trend
for size-position changes, showing that the higher the overlap, the better results
(as expected).
Some conclusions can be extracted from the analysis of the tracking prob-
136 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
lems. CBWH, ACA and PCOM show best results in most of the experiments
as discarding background data and modeling noisy inputs are good strategies
to improve tracking (MS). As CBWH tracks fixed-size targets, it demonstrates
that size adaptation is not fully solved in real scenarios (see results of SOAMST,
LK, IVT and TLD). Context information is difficult to use for managing the up-
date of the target model (STC). Multi-hypothesis trackers such as ST globally
improve the performance but increase the computational cost. Robustness to
noise is achieved by all the trackers and illumination changes are partially han-
dled by the evaluated trackers (PCOM, PSRMT, CBWH, LK, TLD, IVT). For
occlusions and similar objects, selected trackers obtain low performance even in
presence of short-term occlusions. Finally, a noticeable performance drop is ob-
served in sequences mixing problems (situation S4) which represents complex
real data. Unlike the trend exhibited by many trackers, PFC has better results
for real data as it handles data complexity more effectively.
Conclusion
In this paper, we have presented a methodology for performance evaluation of
single-object visual tracking based on ground-truth data. It proposes a standard
procedure for comparing trackers on sequences that represent the most relevant
problems. In particular, we consider four situations ranging from controlled
(synthetic sequences) to uncontrolled (real complex sequences) conditions. For
each one, a set of sequences is generated for each problem with different degrees
of complexity. This dataset can be extended by including video sequences from
large-scale evaluations [45]. This methodology evaluates tracker performance
in terms of its parameter stability, robustness to initialization, global accuracy
and computational complexity. For estimating accuracy, a novel measure is pro-
posed that compensates the errors made by the annotators (mainly in the target
borders) based on the widely used spatial overlap measure. Finally, experiments
are performed to demonstrate the utility of the proposed methodology. We com-
pare the proposed accuracy measure against the representative state-of-the-art
ones demonstrating its utility for high, medium and low error cases. Then, we
apply the proposed methodology to evaluate relevant state-of-the-art trackers
against different tracking problems.
As future work, we will focus on extending the proposed approach to eval-
uate multi-target tracking.
Performance Evaluation of Single Object Visual Tracking 137
Acknowledgment
This work has been partially supported by the Spanish Government (TEC2014-
53176-R HAVideo).
References
[1] (Last accessed, 05 Apr 2013). Institut fur Algorithmen und Kognitive
Systeme: Cars Dataset. http://i21www.ira.uka.de/image-sequences/.
[3] AVSS2007 (Last accessed, 05 Apr 2013). I-LIDS dataset for avss 2007.
http://www.avss2007.org/.
[4] Bailer, C., Pagani, A., and Stricker, D. (2014). A superior tracking ap-
proach: Building a strong tracker through fusion. In European Conf. on
Computer Vision, page (In press).
[7] Baumann, A., Boltz, M., Ebling, J., Koenig, M., Loos, H. S., Merkel, M.,
Niem, W., Warzelham, J. K., and Yu, J. (2008). A review and comparison
of measures for automatic video surveillance systems. EURASIP J Image
Video Process, 2008:1–30.
[11] Black, J., Ellis, T., and Rosin, P. (2003). A nov.el method for video
tracking performance evaluation. In Proc. IEEE Int. Workshop Perform.
Eval. Track. Surveill., pages 125–132, Nice (France).
[13] Carvalho, P., Cardoso, J. S., and Corte-Real, L. (2012). Filling the gap
in quality assessment of video object tracking. Image Vision Comput.,
30(9):630 – 640.
[14] CAVIAR (Last accessed, 05 Apr 2013). Context Aware Vision using
Image-based Active Recognition.
http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.
[15] Chu, D. and Smeulders, A. (2010). Thirteen hard cases in visual tracking.
In Proc. IEEE Adv. Video-Based Signal Surveill., pages 103–110, Boston
(USA).
[16] Comaniciu, D., Ramesh, V., and Meer, P. (2003). Kernel-based object
tracking. IEEE Trans Pattern Anal. Mach. Intell., 25(5):564–577.
[17] Danelljan, M., Khan, F. S., Felsberg, M., and van de Weijer, J. (2014).
Adaptive color attributes for real-time visual tracking. In IEEE Int. Conf.
on Computer Vision and Pattern Recognition, page (In press).
[18] Doermann, D. and Mihalcik, D. (2000). Tools and techniques for video
performance evaluation. In Proc. Int. Conf. Pattern Recog., pages 167–
170.
[19] Edward, K., Matthew, P., and Michael, B. (2009). An information the-
oretic approach for tracker performance evaluation. In Proc. IEEE Int.
Conf. Comput. Vis., pages 1523 –1529.
[20] Gao, Y., Ji, R., Zhang, L., and Hauptmann, A. (2014). Symbiotic tracker
ensemble toward a unified tracking framework. IEEE Trans. on Circuits
and Systems for Video Technology, 24(7):1122–1131.
Performance Evaluation of Single Object Visual Tracking 139
[21] Hong, S., Kwak, S., and Han, B. (2013). Orderless tracking through
model-averaged posterior estimation. In Computer Vision (ICCV), 2013
IEEE International Conference on, pages 2296–2303.
[25] Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J.,
Bowers, R., Boonstra, M., Korzhova, V., and Zhang, J. (2009). Frame-
work for performance evaluation of face, text, and vehicle detection and
tracking in video: Data, metrics, and protocol. IEEE Trans. Pattern Anal.
Mach. Intell., 31(2):319–336.
[26] List, T., Bins, J., Vazquez, J., and Fisher, R. (2005). Performance eval-
uating the evaluator. In Proc. IEEE Int. Workshop Perform. Eval. Track.
Surveill., pages 129–136.
[27] Maggio, E. and Cavallaro, A. (2011). Video tracking: theory and prac-
tice. Wiley.
[29] Nghiem, A., Bremond, F., Thonnat, M., and Valentin, V. (2007). Etiseo,
performance evaluation for video surveillance systems. In Proc. IEEE
Adv. Video-Based Signal Surveill., pages 476–481, London (UK).
[30] Ning, J., Zhang, L., Zhang, D., and Wu, C. (2012a). Robust mean shift
tracking with corrected background-weighted histogram. IET Computer
Vision, 6(1):62–69.
140 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
[31] Ning, J., Zhang, L., Zhang, D., and Wu, C. (2012b). Scale and orientation
adaptive mean shift tracking. IET-Computer Vision, 6(1):52–61.
[32] Nummiaro, K., Koller-Meier, E., and Van Gool, E. (2003). An adaptive
colour-based particle filter. Image and Vision Computing, 2(1):99–110.
[33] Oron, S., Bar-Hillel, A., Levi, D., and Avidan, S. (2014). Locally order-
less tracking. Int. Journal of Computer Vision, pages 1–16.
[34] Pang, Y. and Ling, H. (2013). Finding the best from the second bests-
inhibiting subjective bias in evaluation of visual tracking algorithms. In
Proc. of IEEE Int. Conf. on Computer Vision, pages 1–8, Sidney (Aus-
tralia).
[35] PETS Datasets (Last accessed, 05 Apr 2013). IEEE Int. Workshop Per-
form. Eval. Track. Surveill. (2001-2007).
http://www.cvg.rdg.ac.uk/datasets/index.html.
[36] PETS2000 (Last accessed, 05 Apr 2013). IEEE Int. Workshop Perform.
Eval. Track. Surveill. (2000). ftp://ftp.pets.rdg.ac.uk/pub/PETS2000.
[37] PETS2010 (Last accessed, 05 Apr 2013). IEEE Int. Workshop Perform.
Eval. Track. Surveill. (2010). http://pets2010.net/.
[40] Ross, D. A., Lim, J., Lin, R.-S., and Yang, M.-H. (2008). Incremental
learning for robust visual tracking. Int. J. Comput. Vision, 77(1-3):125–
141.
[41] SanMiguel, J., Cavallaro, A., and Martinez, J. (2012). Adaptive online
performance evaluation of video trackers. IEEE Trans. Image Process.,
21(5):2812 –2823.
Performance Evaluation of Single Object Visual Tracking 141
[42] Schlogl, T., Beleznai, C., Winter, M., and Bischof, H. (2004). Perfor-
mance evaluation metrics for motion detection and tracking. In Proc.
IEEE Int. Conf. Pattern Recogn., volume 4, pages 519 – 522 Vol.4.
[43] Sebastian, P., Comley, R., and Voon, Y. (Dec. 2011). Performance evalu-
ation metrics for video tracking. IETE Tech. Review, 28(6):493–502.
[45] Smeulders, A., Chu, D., Cucchiara, R., Calderara, S., Dehghan, A., and
Shah, M. (2014). Visual Tracking: An Experimental Survey. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 36(7):1442–1468.
[49] Wang, D. and Lu, H. (2014). Visual tracking via probability continu-
ous outlier model. In IEEE Int. Conf. on Computer Vision and Pattern
Recognition, page (In press).
[50] Wang, D., Lu, H., and Yang, M.-H. (2013). Online object tracking with
sparse prototypes. IEEE Transactions on Image Processing, 22(1):314–
325.
[51] Wang, Q., Chen, F., Xu, W., and Yang, M. H. (2011). An experimental
comparison of online object-tracking algorithms. In Proceedings of the
SPIE, pages 81381A–81381A.
[52] Wu, H., Sankaranarayanan, A., and Chellappa, R. (2010). Online empir-
ical evaluation of tracking algorithms. IEEE Trans. Pattern Anal. Mach.
Intell., 32(8):1443–1458.
142 Juan C. SanMiguel, José M. Martı́nez and Mónica Lozano
[53] Wu, Y., Lim, J., and Yang, M. H. (2013). Online object tracking: A
benchmark. In Proc. of IEEE Int. Conf. on Computer Vision and Pattern
Recognition, pages 1–8, Portland (Oregon, USA).
[54] Yang, F., Lu, H., and Yang, M.-H. (2014). Robust superpixel tracking.
IEEE Trans on Image Processing, 23(4):1639–1651.
[55] Yang, H., Shao, L., Zheng, F., Wang, L., and Song, Z. (2011). Re-
cent advances and trends in visual tracking: A review. Neurocomputing,
74(18):3823 – 3831.
[56] Yi, K. M., Jeong, H., Heo, B., Chang, H. J., and Choi, J. Y. (2013).
Initialization-insensitive visual tracking through voting with salient local
features. In Computer Vision (ICCV), 2013 IEEE International Confer-
ence on, pages 2912–2919.
[57] Yin, F., Makris, D., Velastin, S., and Orwell, J. (2010). Quantitative eval-
uation of different aspects of motion trackers under various challenges.
Annals of the BMVA, 5:1–11.
[58] Zhang, K., Zhang, L., and Yang, M. (2014a). Fast compressive tracking.
[59] Zhang, K., Zhang, L., Yang, M.-H., and Zhang, D. (2014b). Fast tracking
via spatio-temporal context learning. In European Conf. on Computer
Vision, page (In press).
[60] Zhang, L. and van der Maaten, L. (2014). Preserving structure in model-
free tracking. IEEE Trans on Pattern Analysis and Machine Intelligence,
36(4):756–769.
Coast Guard, 50
A color, ix, 56, 58, 60, 63, 64, 65, 67, 76, 80,
81, 85, 86, 87, 88, 90, 91, 95, 96, 100,
adaptability, 131, 133, 134
101, 102, 110, 115, 116, 117, 128, 130,
adaptation(s), 60, 62, 65, 84, 110, 131, 132,
131, 133, 138
136
combined tracker, 59, 91, 95, 99
algorithm, viii, 2, 3, 4, 10, 11, 21, 24, 27,
community, 7, 108
35, 36, 37, 48, 49, 50, 53, 55, 57, 58, 59,
compensation, 65
60, 61, 62, 63, 64, 65, 68, 71, 72, 73, 74,
complexity, ix, 13, 21, 22, 24, 79, 92, 107,
75, 77, 78, 79, 80, 81, 85, 86, 87, 88, 92,
108, 109, 112, 113, 114, 115, 116, 117,
93, 94, 96, 98, 99, 100, 101, 104, 105
123, 126, 128, 133, 134, 136
annotation, 125, 128, 133, 135
compression, 7, 21, 23, 24, 67
assessment, 93, 138
computation, ix, 56, 78, 81, 82, 83, 85, 87,
88, 90, 99, 100, 124
B computer, vii, viii, 2, 7, 9, 12, 13, 105
computer vision field, vii, 2, 12
background information, 129 computing, 52, 56, 78, 80, 81, 85, 90, 91,
background subtraction, viii, 25, 26, 37, 39, 95, 105
56, 57, 59, 62, 63, 64, 65, 66, 67, 72, 74, configuration, ix, 6, 109, 112, 113, 114
75, 77, 78, 80, 85, 87, 88, 90, 91, 94, 99, conflict, ix, 56, 57, 75, 76, 77, 78, 80, 86,
100, 102, 103, 104, 105, 106 87, 88, 89, 90, 91, 95, 99
benefits, 10, 109 consumption, 79
blood pressure, 9 content analysis, 56, 104, 105
body shape, 35, 36, 43 contour, 64, 65
Bureau of Justice Statistics, 4 correlation, 82, 124
cost, 3, 5, 6, 7, 10, 11, 13, 78, 99, 110, 134,
136
C
E H
house, 50, 102 learning, 5, 61, 62, 65, 110, 125, 139, 140,
human, vii, viii, ix, 1, 2, 11, 25, 35, 36, 42, 142
50, 53, 56, 57, 102, 107, 119 LED, 5
human body, 35, 36 lens, 12, 13, 51
human faint detection, vii, 1 light, 5, 6, 9, 24, 38, 59, 60, 61, 62, 65, 83
Hunter, 53
hypothesis, 78, 79, 86, 125, 136
M
I magnetic field, 6, 7
Malaysia, 1
identification, 100, 104, 106 manpower, 8, 10
illumination, ix, 5, 36, 50, 107, 108, 114, mapping, viii, 2, 3, 12, 15, 17, 18, 19, 20,
115, 116, 117, 118, 131, 132, 136 21, 22, 23, 24, 49, 52
image analysis, 102 mass, 78
image processing, 3, 10, 11, 13, 15, 24, 25, matrix, 16, 34, 70, 71, 72, 75, 88, 91
49, 53 matter, 4, 9, 12
image(s), vii, viii, ix, 2, 3, 7, 8, 10, 11, 12, measurement(s), ix, 56, 58, 59, 68, 71, 72,
13, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 73, 74, 75, 76, 77, 80, 84, 86, 88, 89, 90,
26, 35, 37, 38, 39, 43, 48, 49, 52, 53, 56, 91, 94, 95, 105
57, 58, 59, 60, 62, 63, 64, 67, 68, 76, 81, medical, 8, 9
82, 83, 84, 87, 88, 90, 92, 96, 102, 103, memory, 24, 62, 63, 81, 85, 123
105, 106, 114, 115, 137 methodology, vii, ix, 56, 100, 107, 108, 109,
inattentiveness, vii, 1, 8 113, 123, 134, 136
independence, 9 mobile robots, 12
individuals, 39, 43, 48, 49 models, 15, 57, 60, 69, 70, 78, 79, 102, 103,
initial state, 72 110, 113, 128, 133, 135
injuries, 3, 8 modifications, 78, 123
injury, vii, 1, 8, 10 monitoring surveillance systems, viii, 2
integration, 10 Monte Carlo method, 78
interface, 101 morphology, 53
interference, 4, 6, 7 multimedia, 104, 105
issues, viii, 2, 52
iteration, 126
N
100, 101, 103, 104, 105, 108, 109, 138, researchers, viii, 2, 10
140, 141, 142 resolution, 13, 16, 17, 21, 23, 51, 62, 63, 94
occlusion, ix, 50, 56, 57, 64, 66, 67, 75, 76, resources, 57, 62, 85
77, 86, 89, 94, 100, 114, 116, 117, 130, response, 4, 7, 9
131, 132 restoration, 8
omnidirectional, v, vii, viii, 1, 2, 3, 4, 11, risk, 8, 67, 82, 84, 86, 87, 99
12, 13, 14, 15, 17, 18, 19, 21, 22, 23, 24, robotic vision, 19
49, 50, 51, 52, 53 robotics, 11, 12, 52
OPA, 48, 49 ROC, 93, 95
operations, 56, 63, 94 root-mean-square, 93
optimization, 125 rotating camera, 12
overlap, 57, 71, 76, 90, 111, 119, 122, 124, routines, 78
128, 130, 135, 136
S
P
safety, 2, 51
parallel, 6, 62, 82, 103, 105, 106 security, vii, viii, 1, 2, 4, 6, 7, 104, 140
parallel implementation, 62, 105 security systems, vii, 1
parallel processing, 105 security threats, 104
particle filters, viii, 55, 57, 58, 68, 75, 78, SED, 141
100, 101, 102, 105 senses, 6
PCA, 110 sensing, 110
physical well-being, 8 sensitivity, 6, 66, 67, 120
platform, 103, 104, 106 sensor(s), 3, 6, 7, 8, 9, 10, 11, 56, 68, 72, 74,
Poland, 55, 104 101
polar, viii, 2, 3, 15, 19, 20, 21, 22, 23, 24, services, 8
25, 49, 52, 102 signals, 6, 8
police, vii, 1, 4, 7 smoothness, 111
prevention, 7, 8, 104 software, 11, 13, 94, 109, 139
probability, 78, 79, 83, 141 solution, 3, 8, 13, 58, 65, 75, 87, 100
probability density function, 79 SPA, 105, 106
probability distribution, 83 Spain, 107
programming, 105 specialists, vii, viii, 55
project, 101, 104, 105 stability, ix, 99, 107, 109, 113, 114, 120,
propagation, 69, 102 121, 122, 126, 128, 134, 136
prototypes, 141 standard deviation, 60
state(s), ix, 67, 68, 69, 70, 71, 72, 73, 74,
75, 76, 77, 78, 79, 80, 81, 83, 84, 86, 88,
R 89, 90, 91, 94, 108, 136
statistics, 9, 50
radar, 79
storage, 7, 10
radius, 15, 16, 17, 18, 19, 20, 21, 23
structure, 6, 20, 67, 109, 110, 142
recall, 92, 93, 95, 97, 99, 111
subtraction, viii, 25, 26, 37, 39, 56, 57, 59,
reconstruction, 103
62, 63, 64, 65, 66, 67, 72, 74, 75, 77, 78,
requirements, 7, 62, 81, 85, 112
Index 147
80, 85, 87, 88, 90, 91, 94, 99, 100, 102, translation, 125
103, 104, 105, 106 treatment, 8
suppression, 101 trespasser detection, vii, 1, 2, 5, 6, 24, 27,
surveillance, vii, viii, 1, 2, 3, 4, 5, 7, 8, 10, 48, 49
11, 12, 13, 14, 50, 53, 55, 56, 87, 92, 99,
100, 101, 102, 106, 110, 137, 139, 140,
141 U
surveillance system, vii, viii, 1, 2, 3, 4, 8,
uniform, 16, 17, 79, 96, 99
10, 11, 13, 14, 50, 55, 56, 92, 99, 137,
United States (USA), vii, 1, 4, 7, 52, 137,
139
138, 142
symmetry, 27, 29, 31, 33
updating, ix, 56, 58, 59, 71, 73, 75, 80, 84,
synchronization, 85
85, 88, 89, 91, 94, 135
synthesis, 105
urban, 104
T
V
target, 19, 62, 81, 82, 83, 84, 85, 86, 87, 88,
valuation, vii, ix, 107, 108, 111, 113, 114,
89, 90, 91, 92, 100, 102, 109, 110, 111,
126
112, 113, 114, 115, 116, 117, 118, 119,
variables, 69, 70, 71, 72, 73, 76, 78, 79, 83,
122, 123, 124, 125, 128, 129, 130, 131,
90, 94
132, 133, 134, 135, 136, 137
variations, vii, ix, 61, 65, 73, 107, 114, 115,
TBS, 26, 37, 38
130
techniques, 3, 15, 21, 84, 104, 138
vector, 69, 71, 72, 79, 80, 83, 90
technologies, 105
vegetation, 6
technology, vii, 2, 6, 7, 10, 12
vehicles, 12, 59, 70, 74, 75, 76, 94, 100,
test procedure, 92
104, 105, 106
testing, ix, 56, 92, 94, 109, 117, 120, 123
velocity, 69, 72, 73, 75, 76, 84, 88, 90, 95
theft, 4
vibration, 5, 6, 7
threats, 104
victims, 4
time constraints, 123
videos, 63, 112
tracker, vii, ix, 9, 57, 58, 59, 67, 69, 70, 71,
vision, vii, viii, 2, 10, 12, 13, 19, 101, 102
72, 73, 74, 75, 76, 77, 79, 80, 81, 82, 83,
visual attention, 19
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
visual field, 11
96, 98, 99, 107, 108, 110, 111, 112, 113,
visual surveillance, vii, 2, 3
114, 115, 118, 119, 120, 121, 122, 123,
visualization, 12
124, 125, 126, 127, 128, 133, 135, 136,
137,138
tracking conflict, ix, 56, 57, 58, 59, 68, 71, W
75, 77, 78, 80, 84, 85, 87
tracks, 9, 56, 58, 59, 74, 78, 89, 92, 95, 99, web service, 105
104, 111, 112, 131, 132, 136 well-being, 8
trajectory, 111, 115 wide area coverage, viii, 2
transducer, 6 witnesses, vii, 1, 4
transformations, 125