Você está na página 1de 4

FRAGMENT-BASED REAL-TIME OBJECT TRACKING:

A SPARSE REPRESENTATION APPROACH


Naresh Kumar M. S. Priti Parate R. Venkatesh Babu
Supercomputer Education and Research Centre
Indian Institute of Science, Bangalore, India - 560012
ABSTRACT
Real-time object tracking is a critical task in many computer vision
applications. Achieving rapid and robust tracking while handling
changes in object pose and size, varying illumination and partial
occlusion, is a challenging task given the limited amount of com-
putational resources. In this paper we propose a real-time object
tracker in l
1
framework addressing these issues. In the proposed ap-
proach, dictionaries containing templates of overlapping object frag-
ments are created. The candidate fragments are sparsely represented
in the dictionary fragment space by solving the l
1
regularized least
squares problem. The non zero coefcients indicate the relative mo-
tion between the target and candidate fragments along with a delity
measure. The nal object motion is obtained by fusing the reliable
motion information. The dictionary is updated based on the object
likelihood map. The proposed tracking algorithmis tested on various
challenging videos and found to outperform earlier approach.
Index Terms Object tracking, Fragment tracking, Motion es-
timation, l
1
minimization, Sparse representation
1. INTRODUCTION
Visual tracking is an important task in computer vision with a vari-
ety of applications such as surveillance, robotics, human computer
interactions, medical imaging etc. One of the main challenges that
limits the performance of the tracker is appearance change caused
by pose variation, illumination or view point. Signicant amount of
work has been done to address these problems and develop a robust
tracker. However robust object tracking still remains a big challenge
in computer vision research.
There have been many proposals towards building a robust
tracker, a thorough survey can be found in [1]. In early works,
minimizing SSD (sum of squared differences) between regions was
a popular choice for the tracking problem [2] and a gradient descent
algorithm was most commonly used to nd the minimum SSD. Of-
ten in such methods, only a local minimum could be reached. Mean
shift tracker [3] uses mean-shift iterations and a similarity mea-
sure based on Bhattacharyya coefcient between the target model
and candidate regions to track the object. Incremental tracker [4]
and Covariance tracker [5] are other examples of tracking methods
which use appearance model to represent the target observations.
One of the recently developed and popular trackers is the l
1
tracker [6]. In this work, the authors have utilized the particle lter to
select the candidate particles and then represent them sparsely in the
space spanned by the object templates using l
1
minimization. This
requires a large number of particles for reliable tracking. This results
in a high computational cost and thus brings down the speed of the
tracker. An attempt to speed up the tracking by reducing the num-
ber of particles only deteriorates the accuracy of the tracker. There
have been attempts to improve the performance of [6]. In [7] the
authors try to reduce the computation time by decomposing a single
object template into the particle space. In [8] hash kernels are used
to reduce the dimensionality of observation.
In this paper, we propose a computationally efcient l
1
mini-
mization based real-time and robust tracker. The tracker uses frag-
ments of the object and the candidate to estimate the motion of the
object. The number of candidate fragments required to track the ob-
ject in this method is small, thus reducing the computational burden
of the l
1
tracker. Further, the fragment based approach combined
with the trivial templates make the tracker robust against partial oc-
clusion. The results show that the proposed tracker gives more accu-
rate tracking at much higher execution speeds in comparison to the
earlier approach.
The rest of the paper is organized as follows. Section 2 provides
the overview of the proposed tracker. Section 3 describes the pro-
posed approach in detail. Section 4 discusses the results and Section
5 concludes the paper.
2. OVERVIEW
The proposed tracking algorithm is essentially a template tracker in
l
1
framework. The object is partitioned into overlapping fragments
that form the atoms of the dictionary. The candidate fragments are
sparsely represented in the space spanned by the dictionary frag-
ments by solving the l
1
minimization problem. The resulting sparse
representation indicates the ow of fragments between consecutive
frames. This ow information or the motion vectors are utilized
for estimating the object motion between consecutive frames. The
proposed algorithm uses only grey scale information for tracking.
Similar to the mean-shift tracker [3], the proposed algorithm also as-
sumes sufcient overlap between object and candidate regions such
that there is at-least one fragment in the candidate area that corre-
sponds to an object fragment. In this approach two dictionaries are
used. One is kept static while the other is updated based on the
tracking result and a condence measure computed using histogram
models. The dictionaries are initialized with the object selected in
the rst frame. The proposed algorithm is able to track objects with
rapid changes in appearance, illumination and occlusions at real-
time. Changes in size are also tracked up-to some extent.
3. PROPOSED APPROACH
3.1. Sparse representation and l
1
minimization
The discriminative property of sparse representation is recently uti-
lized for various computer vision applications such as tracking [6],
detection [9], classication [10] etc. A candidate vector y can be
sparsely represented in the space spanned by the vector elements of
the matrix (called the dictionary) D =
_
d
1
, d
2
, ..., d
n

R
ln
.
Mathematically,
y = Da (1)
where a =
_
a
1
, a
2
, ..., a
n

T
R
n
is the coefcient vector of basis
D. In application, the system represented by (1) could be under
determined since l << n and there is no unique solution for a. Such
a system is solved as an l
1
regularized least squares problem, which
is known to yield sparse solutions [10].
min||Da y||
2
2
+||a||
1
(2)
where ||.||
1
and ||.||
2
are the l
1
and l
2
norms respectively.
3.2. Dictionary creation and object-candidate fragments
In the template based tracking in l
1
framework, tracking is achieved
by matching the candidate template to one among a set of object tem-
plates through sparse representation [6]. The set of object templates
form the dictionary. On similar lines, in our method dictionaries
are initialized with overlapping fragments of the object. We make
use of two dictionaries - one is static and the other is updated for the
purpose of modelling the appearance changes. The fragment size de-
pends on the original object size and the dictionary array size. Sup-
pose the object size is MN and we choose the dictionary size to be
uv, then the fragment size will be (Mu+1)(Nv+1). Fig-
ure 1 shows the dictionary created from the overlapping fragments
of the object by going through the object in the raster scan order.
Each fragment is resized into a template of predened size and then
vectorized into a single column vector d
qj
R
l
. Each dictionary
is a set D
q
= [d
q1
, d
q2
, ...d
qn
] R
ln
. For tracking, candidate
Fig. 1. Object Fragment Dictionary of size 15x15.
in the current frame is taken as the area of pixels where the object
was located in the previous frame. An array of fragments of the can-
didate is constructed in the same way as it was done for the object
dictionary and the size of this array is same as that of the dictionary.
Only a certain number of fragments sub-sampled among this array
of candidate fragments are sufcient to estimate the motion of the
object and track it.
Each candidate fragment is represented as a sparse linear com-
bination of the dictionary fragments. Equation (1) can be rewrit-
ten for the k
th
candidate fragment in a set of p candidate fragments
Y =
_
y
1
, y
2
, ..., y
p

R
lp
and y
k
R
l
as,
y
k
=
_
D
1
, D
2
_
A
1k
T
, A
2k
T

T
(3)
where D
1
and D
2
are the static and dynamic dictionaries respec-
tively and A
qk
=
_
a
qk1
, a
qk2
, ..., a
qkn

T
R
n
is the correspond-
ing target coefcient vector. Equation (3) represents an under de-
termined system since l << 2n, which can be solved for a sparse
solution using l
1
minimization as described in Section 3.1.
3.3. Handling occlusion, clutter, changes in illumination, ap-
pearance and size.
In order to handle mild occlusions and clutter, trivial templates (pos-
itive and negative) are used as proposed in [6]. Negative trivial
templates imposes a non-negativity constraint on the target vector
coefcients. In our approach, fragmentation helps the tracker even
under heavy partial occlusions. Occluded candidate fragments with
low condence measures are eliminated before estimating the object
motion. Now equations (3) - (2) can be written as,
y
k
= Dx
k
(4)
min||Dx
k
y
k
||
2
2
+||x
k
||
1
(5)
where,
D =
_
D
1
, D
2
, I, I

, x
k
=
_
A
1k
T
, A
2k
T
, (e
+
k
)
T
, (e

k
)
T
_
T
,
I =
_
i
1
, i
2
, ..., i
l

R
ll
is the set of trivial templates and i
i
R
l
is a vector with only one non-zero element. e
+
k
R
l
and e

k
R
l
are the positive and negative trivial coefcient vectors respectively.
The object pose and illumination are prone to changes and a
static dictionary is unreliable in such cases. A static dictionary loses
the object modelling capability once its appearance changes. The
error accumulated over time results in a drift from the actual object
position. Incorporating a second dictionary and updating it through-
out the tracking process helps overcome this problem. Currently our
algorithm does not include measures to handle excessive changes in
the object size. However, it can work with small changes in the ob-
ject size since it easily is captured by the fragment templates that are
updated in the second dictionary.
3.4. Motion estimation through fragment matching and con-
dence measure
A sparse solution for x
k
indicates which fragment from the dictio-
naries closely resembles the candidate fragment y
k
. A sparse re-
construction is obtained for all the p candidate fragments. A set of
p

candidates with highest condence measure are chosen out of p


candidates. Condence measure is computed using target coefcient
vector as,
C
conf,k
=
_
2n

t=1
x
k
(t)
_
/
_
1 +
2n+2l

t=2n+1
|x
k
(t)|
_
(6)
Motion vectors for the p

candidates are obtained knowing the


offset in locations of the candidate fragment and the corresponding
matched fragment in the dictionary. We denote these set of motion
vectors as MV =
_
mv
1
, mv
2
, ..., mv
p

, where mv
r
= (x
r
, y
r
)
is the motion vector of r
th
candidate. The set of p

motion vectors
are reduced to (s+2) motion vectors by eliminating directional out-
liers. Outliers based on magnitude are then removed to get s num-
ber of motion vectors MV

=
_
mv

1
, mv

2
, ..., mv

. The motion
vector for the object is now estimated from the set MV

using two
methods. In the rst method, the resultant motion vector MV
obj,1
has its x and y components as the median values of the x and y
components of motion vectors in MV

. In the second method, the


resultant motion vector MV
obj,2
is computed using,
MV
obj,2
= (x, y) =
1
s
s

r=1
|mv

r
|
__
s

r=1
mv

r
_
/
_
|
s

r=1
mv

r
|
__
(7)
which is a vector with a magnitude equal to the mean of |MV

| and
having its direction equal to that of the resultant of MV

.
One of the two motion vectors MV
obj,1
and MV
obj,2
are chosen
based on a condence measure computed using the histogram mod-
els of the object and background. The object histogramP
obj
with 20
bins is constructed from the pixels occupying the 25% central area
of the object. The background histogram P
bg
with 20 bins is con-
structed from the pixels occupying the area surrounding the object
up-to 15 pixels. These histograms are normalized. Figure 2 shows
the areas used to construct these histograms. The area between the
innermost rectangle and the middle rectangle is not used as this re-
gion contains both object and background pixels which adds confu-
sion to the models. The likelihood map is calculated using equation
Background
Object
Not used
Fig. 2. Pixels used to build the object and background histogram.
(8) for the pixels occupying the central 25% area of the candidate
area T. The condence measure for each of the motion vector is
taken as the sum of the corresponding likelihood values of the pixels
using equation (9).
L(x, y) = [P
obj
(b(T(x, y)))] / [max(P
bg
(b(T(x, y))), )] (8)
L
conf
=

y
L(x, y) (9)
where function b maps the pixel at location (x, y) to its bin and
is a small quantity to prevent division by zero. Out of MV
obj,1
and
MV
obj,2
, the motion vector with a larger value of this condence
measure is chosen. Higher condence measure implies that a larger
number of pixels from that target area belong to the object than that
pointed by the other motion vector.
3.5. Dictionary update
Fragments in the second dictionary are chosen for update after
analysing how well each of the fragments were matched to the
candidate fragments. This can be inferred from the target coef-
cient vectors A
2k
. The maximum value along the rows (each
row corresponds to a fragment in the dictionary) in the matrix
A =
_
A
21
, A
22
, ..., A
2p

helps in sorting out fragments that


matched very well, mildly and no match at all with candidate frag-
ments. Since there are only p candidate fragments, a large portion
of the dictionary fragments would not have matched at all, indicated
by their zero coefcient values. A small number (depending on the
update factor, which is expressed as the percentage of total number
of elements in each dictionary) of such fragments are updated since
there was no contribution from them in the current iteration. They
are updated with the corresponding fragments of the tracking result
after performing a check on the new fragment based on histogram
models explained in Section 3.4. The likelihood map, inverse like-
lihood map and condence measure of each new fragment F are
computed
L
f
(x, y) = [P
obj
(b(F(x, y)))] / [max(P
bg
(b(F(x, y))), )]
(10)
IL
f
(x, y) = [P
bg
(b(F(x, y)))] / [max(P
obj
(b(F(x, y))), )]
(11)
L
conf,f
=
_

y
L
f
(x, y)
_
/
_

y
IL
f
(x, y)
_
(12)
The fragment is updated only if the condence measure L
conf,f
>
1 (indicates fragment has more pixels belonging to the object) to
prevent erroneous updates of the dictionary fragments.
Algorithm 1 Proposed Tracking
1: Input: Initial position of the object in the rst frame.
2: Initialize: D
1
and D
2
with overlapping fragments of the object.
3: repeat
4: In next frame, select candidate from same location as the ob-
ject in the previous frame and prepare set of p candidate frag-
ments.
5: Solve l
1
minimization problem using SPAMS [11] to sparsely
reconstruct candidate fragments in the space spanned by dic-
tionary fragments.
6: Compute condence measure C
conf,k
using equation (6).
7: Choose top p

candidate fragments based on C


conf,k
and
compute their motion vectors MV.
8: Remove outliers in MVbased on direction and magnitude to
get s number of motion vectors MV

.
9: Compute motion vector MV
obj,1
as the median values of x
and y components of MV

.
10: Compute motion vector MV
obj,2
using equation (7).
11: Choose MV
obj,1
or MV
obj,2
as the motion vector for the ob-
ject, whichever gives a higher condence measure based on
likelihood in (9).
12: Update fragments of dictionary D
2
that did not match with
any of the candidate fragments if L
conf,f
> 1.
13: until End of video feed
4. RESULTS AND DISCUSSION
The proposed tracker is implemented in MATLABand experimented
on four different video sequences: pktest02 (450 frames), face (206
frames), panda (451 frames) and trellis (226 frames). We use the
software (SPAMS) provided by [11] to solve the l
1
minimization
problem. For evaluating the performance of the proposed tracker, its
results are compared with the l
1
tracker proposed by Mei et al. [6].
The l
1
tracker is congured 300 particles, 10 object templates of size
1215. The proposed tracker is congured for p = 25 candidate
fragments of size 88; p

= 21, s = 5, and an update factor of 5%.


Figure 3 shows the trajectory error (position error) plot with re-
spect to ground truth for the four videos using the proposed method
and l
1
tracker [6]. Table 1 summarizes the performance of the track-
ers under consideration. It can be seen that the proposed tracker
achieves real time performance with better accuracy compared to the
particle lter based l
1
tracker [6], while executed on a PC. The pro-
posed tracker runs 6070 times faster than [6]. Figures 4, 5, 6 and 7
show the tracking results. The results of the proposed approach and
l
1
tracker are shown by blue and yellow (dashed) windows respec-
tively. In Figure 4, frame number 153 shows that l
1
tracker failed
when the car was occluded by the tree and it continues to drift away.
The proposed tracker survives the occlusion and gradual pose change
as seen in frames 153, 156, 219 and 430. Figure 5 also shows that the
proposed tracker is robust to changes in appearance and illumination
at frames 69, 114 and 192. Figure 6 shows that the proposed tracker
was able to track drastic changes in pose when the panda changes
its direction of motion while the tracker in [6] fails at frames 94 and
327. Figure 7 shows the ability of the proposed tracker to track ob-
ject even under partial illumination changes because of the fragment
based approach. In frame 71, it can be seen that the lower left re-
100 200 300 400
0
50
100
150
200
250
Frame number
A
b
s
o
l
u
t
e

e
r
r
o
r


Proposed
Mei et al.
50 100 150 200
0
10
20
30
40
50
Frame number
A
b
s
o
l
u
t
e

e
r
r
o
r


Proposed
Mei et al.
(a) (b)
100 200 300 400
0
20
40
60
80
100
Frame number
A
b
s
o
l
u
t
e

e
r
r
o
r


Proposed
Mei et al.
50 100 150 200
0
50
100
Frame number
A
b
s
o
l
u
t
e

e
r
r
o
r


Proposed
Mei et al.
(c) (d)
Fig. 3. Trajectory position error with respect to ground truth for: (a)
pktest02 (b) face (c) panda and (d) trellis sequences.
gion is illuminated more. Fragments on the lower left region would
give low condence measures and are discarded before computing
the object displacement, whereas tracker in [6] uses the entire ob-
ject to build the dictionary of templates and hence fails to track the
object under such partial illumination changes. The videos corre-
sponding to the results presented in Figs. 4 to 7 are available at
http://www.serc.iisc.ernet.in/venky/tracking results/.
Table 1. Execution time and trajectory error (RMSE) comparison of
the proposed tracker and l
1
tracker [6]
Video Execution time Trajectory Error
(Number per frame (seconds) (RMSE)
of frames) Proposed [6] Proposed [6]
pktest02 (450) 0.0316 2.0770 2.9878 119.5893
face (206) 0.0308 2.2194 7.0961 9.5666
panda (451) 0.0303 2.2742 4.7350 25.5386
trellis (226) 0.0301 2.1269 12.8113 42.3399
5. CONCLUSION AND FUTURE WORK
In this paper we have proposed a computationally efcient tracking
algorithm which makes use of fragments of the object and candi-
date to track the object. The performance of the proposed tracker
has been demonstrated by various complex video sequences and is
shown to perform better than the earlier tracker in terms of both ac-
curacy and speed. Future work includes improvement of the dictio-
nary and its update mechanism to model the changes in pose, size
and illumination of the object more precisely.
6. REFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah., Object tracking: A sur-
vey, ACM Computing Surveys, vol. 38, no. 4, 2006.
[2] G. D. Hager and P. N. Belhumeur, Efcient region tracking
with parametric models of geometry and illumination, IEEE
Fig. 4. Result for pktest02 video at frames 5, 153, 156, 219 and
430. [Color convention for all results: solid blue - proposed tracker;
dashed yellow - l
1
tracker.]
Fig. 5. Result for face video at frames 3, 10, 69, 114 and 192.
Fig. 6. Result for panda video at frames 4, 45, 94, 327 and 450.
Fig. 7. Result for trellis video at frames 12, 24, 71, 141 and 226.
Transactions on Pattern Analysis and Machine Intelligence,
vol. 20, no. 10, pp. 10251039, 1998.
[3] D. Comaniciu, V. Ramesh, and P. Meer, Real-time track-
ing of non-rigid objects using mean shift, in Proceedings of
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2000, vol. 2, pp. 142149.
[4] David A. Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-
Hsuan Yang, Incremental learning for robust visual tracking,
International Journal of Computer Vision, vol. 77, no. 1-3, pp.
125141, 2008.
[5] F. Porikli, O. Tuzel, and Peter Meer, Covariance tracking
using model update based on lie algebra, in Proceedings of
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2006, vol. 1, pp. 728735.
[6] Xue Mei and Haibin Ling, Robust visual tracking using l1
minimization, in Proceedings of IEEE International Confer-
ence on Computer Vision, 2009, pp. 14361443.
[7] Huaping Liu and Fuchun Sun, Visual tracking using spar-
sity induced similarity, in Proceedings of IEEE International
Conference on Pattern Recognition, 2010, pp. 17021705.
[8] Hanxi Li and Chunhua Shen, Robust real-time visual track-
ing with compressed sensing, in Proceedings of IEEE Inter-
national Conference on Image Processing, 2010.
[9] Ran Xu, Baochang Zhang, Qixiang Ye, and Jianbin Jiao, Hu-
man detection in images via l1-norm minimization learning,
in Proceedings of IEEE International Conference on Acoustics
Speech and Signal Processing, 2010, pp. 35663569.
[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.,
Robust face recognition via sparse representation, In IEEE
Transactions on Pattern Analysis and Machine Intelligence,
vol. 31, no. 1, pp. 210227, 2009.
[11] SPAMS, http://www.di.ens.fr/willow/spams/, .

Você também pode gostar