Escolar Documentos
Profissional Documentos
Cultura Documentos
' s s '
=
otherwise
M M M
H i L
i
0
1
| (3)
where
i
| is the pixel in location i ,
i
M is the magnitude of motion vector,
and
L
M' and
H
M' are the lower and upper threshold for the motion vector
magnitude. To improve the classification performance, the optimal threshold
values of
L
M' and
H
M' can be obtained by minimizing Bayesian risk which
are constructed from the probability densities and prior probabilities of two
classes (target and foreground) Then, all pixels which are interconnected each
other in the binary image are merged; and several fragments are formed. In the
tracking step, the motion of the target object is modeled as follow:
k k k
k k k
w Hx z
v Fx x
+ =
I + =
+1
(4)
where
k
x is the state vector of the target at time k ,
k
z is the observation vec-
tor of the target,
k
v and
k
w are a zero mean, white, and Gaussian noise se-
quences with covariance matrix I' I
k
Q and
k
R respectively, F and H are
matrices that are independent from time. Classically, this kind of objects can be
tracked by the Kalman filter; however, since the Kalman filter can track only one
fragment, it can have serious error in the case that one object is split into several
fragments. Therefore, the PDAF can be applied to handle such cases. It seems to
show reliable performance, but it does not assure the reduction of computational
complexity.
29
2.3.4 Issues in Compressed Domain Approach
The essential goal of compressed domain approach is to significantly re-
duce the computation complexity although it slightly deteriorates the perfor-
mance of object detection and tracking. The processing time for major algo-
rithms is shown in Table 1; other algorithms have not verified how fast they are.
Table 1. The processing time of compressed domain algorithms.
Authors Frames/sec PC Note
Zen et al. [11] 5~10 unknown
Wang et al. [7] 2 450 MHz
Chen et al. [26] 43 unknown
Benzougar et al. [6] 40 400 MHz Excluding video decoding
Mezaris et al. [22] 200 800 MHz Excluding video decoding
Zeng et al. [12] 2~16 700 MHz Available for H.264|AVC
Treetasanatavorn et al. [15] 0.1 500 MHz
Porikli and Sun [17] 111~500 4.3 GHz
Aggarwal et al. [2] 100 1.8 GHz
The algorithms which show significantly fast processing time are Chen et al.s,
Mezaris et al.s, Porikli and Suns, and Aggarwal et al.s algorithms. Especially,
although Mezaris et al.s algorithm and Porikli and Suns algorithm simulta-
neously performs object segmentation as well as object detection and tracking,
their processing time is remarkably fast.
Nevertheless, these algorithms have some lethal shortcomings which
cause poor performance of object detection and tracking. First of all, they are
available only in extremely restricted environments; that is, they can have se-
rious error in special scene situations. For instance, Chen et al.s algorithm first
30
extracts the foreground region from the difference image of temporally neigh-
boring DC images [26]. It is not reasonable because most internal parts of object
region can be excluded from the extracted foreground region when the interior of
objects is low-textured. In other words, it can achieve successful results only in
the cases that the texture of most object regions is obviously altered. In the case
of Mezaris et al.s algorithm, the foreground is obtained through global-motion
compensation based on iterative macroblock rejection scheme [22]. That is, a
motion vector which is greatly different from the global motion is considered as
background. However, it can fail to extract the whole foreground region in the
case that the motion of moving objects is not exactly distinguishable from the
global motion. Porikli and Suns algorithm can also make an error due to the li-
mitation of region merging technique [17]. For spatiotemporal segmentation, it
merges blocks that have similar motion vectors and DCT coefficients. However,
an object region can contain chaotic motion vectors; for example, when a de-
formable object moves in the same direction as that of a camera looking toward
or it consists of homogeneous texture in large portion, a chaotic set of motion
vectors is produced with various amplitudes or directions in unpredictable pat-
terns. Likewise, the limitation of Aggarwal et al.s algorithm is that it does not
consider the change in the size of the target object which is manually selected as
a rectangle box [2]. The algorithm is limitedly applicable only when the object
size is constant over frames.
Another problem in these algorithms is that they are not compatible with
31
H.264|AVC. These algorithms commonly exploit the DC images which are
formed from DCT coefficients in an I-frame. In MPEG-1 or MPEG-2 bitstreams,
the DC image formation is possible because raw pixel data in I-frames is directly
converted by the discrete cosine transform (DCT) without intra prediction. On
the other hand, since in H.264|AVC the difference between original pixel data
and intra-predicted pixel value is converted by the integer transform (IT), the DC
image cannot be built in I-frames and P-frames.
Additionally, these algorithms do not support consistent object recognition
based on color information. For example, when we track multiple persons, a per-
son can repeatedly come in and go out from the camera screen. Then, in
H.264|AVC videos it is difficult to recognize the persons identity based on mo-
tion vectors and IT coefficients.
In the proposed methods, the above problems are coped with by three
ways: (1) reinforcing the adaptability about various scenes, (2) reflecting the ex-
traordinary features of H.264|AVC bitstreams, and (3) decoding the ROIs par-
tially. As a result, the proposed methods cannot only maintain fast computation
time, but also have more reliable performance than that of the traditional algo-
rithms.
32
III Proposed Schemes for Moving Object De-
tection and Tracking with Partial Decod-
ing in H.264|AVC Bitstream Domain
In this chapter, two algorithms for object detection and tracking in
H.264|AVC bitstream domain are introduced. One approach is the semi-
automatic method for interactive broadcasting services, and the other approach is
the automatic method especially for real-time surveillance applications. The
semi-automatic method adopts the dissimilarity minimization algorithm, whereas
the automatic method is based on the spatial and temporal macroblock filter
(STMF). Two techniques commonly concentrate on improving the performance
in various scenes where the traditional compressed domain algorithms are not
available.
It should be noticed that unlike traditional compressed domain algorithms,
the proposed algorithms exploit partially decoded pixel data as well as encoded
information like motion vectors or IT coefficients in order to detect and track
moving objects. Even though some compressed domain algorithms contain par-
tial decoding process, it does not positively contribute to object detection and
tracking procedure; it is just for boundary refinement [8,13]. The partial decod-
ing in the proposed algorithms can increase the processing time; however, it
makes a great contribution to finding more accurate locations and sizes of mov-
33
ing objects. Not only that, but it also gives the color information of multiple ob-
jects which can be used for object recognition or metadata formation.
3.1 Semi-automatic Approach
In order to extract location information of a predefined target object from
stationary or non-stationary scenes encoded by H.264|AVC, the dissimilarity
energy minimization algorithm can be exploited. It makes use of motion vectors
and partially decoded luminance signals to perform tracking adaptively accord-
ing to properties of the target object in H.264/AVC videos. It is one of the semi-
automatic feature-based approaches that tracks some feature points selected by a
user. First, it roughly predicts the position of each feature point using motion
vectors extracted from H.264/AVC bitstream. Then, it finds out the best position
inside the given search region by considering three clues such as texture, form,
and motion dissimilarity energies. Since just neighborhood regions of feature
points are partially decoded to compute this energy, the computational complexi-
ty is greatly saved. The set of the best positions of feature points in each frame is
selected to minimize the total dissimilarity energy by dynamic programming.
Also, weight factors for dissimilarity energies are adaptively updated by the
neural network. Compared with the traditional compressed domain algorithms,
the algorithm can successfully track the target object even when its shape is de-
formable over frames or its motion vectors are not homogeneous due to high-
textured background.
34
3.1.1 Forward Mapping of Backward Motion Vectors
The motion vectors extracted directly from H.264|AVC bitstream can be
used to predict roughly the motion of feature points. Since all motion vectors in
P-frames have backward direction, it should be changed to have forward direc-
tion. Following Porikli and Sun [17], the forward motion field is built by the re-
gion-matching method. First, motion vectors of blocks with various sizes are
dispersed to 4x4 unit blocks. After each block is projected to the previous frame,
the set of overlapping blocks is extracted as shown at Figure 1.
Figure 1. The region-matching method for constructing the forward motion field
Forward motion vectors of overlapped blocks in the previous frame are
updated with respect to the ratio of the overlapping area to the whole block area.
Assuming that the jth 4x4 block b
k,j
in the kth frame is overlapped with the ith
4x4 block b
k-1,i
in the k-1th frame, the forward motion vector fmv
k-1
(b
k-1,i
) is giv-
en by
( )
( )
( )
,
1
1, ,
1
16
1
N
S i j
k
fmv b mv b
k
k i k j
k
j
| |
=
|
|
= \ .
(5)
35
where S
k-1
(i,j) stands for the overlapping area between b
k,j
and b
k-1,i
, and
mvk(b
k,j
) denotes the backward motion vector of b
k,j
with i,j=1,2,,N. We as-
sume that H.264/AVC videos are encoded in the baseline profile which each
GOP contains just one I-frame and several P-frames. It should be noticed that the
above region-matching method cannot be applied in the last P-frame in one GOP
since the next I-frame does not have backward motion vectors. Assuming that
the motion of each block is approximately constant within a small time interval,
the forward motion vector of any block in the last P-frame can be assigned as a
vector with the reverse direction of the backward motion vector as expressed by
( ) ( )
1
1, 1,
1
fmv b mv b
k
k i k i
k
=
. (6)
Thereafter, positions of feature points in the next frame are predicted using
forward motion vectors. If the nth feature point in the k-1th frame has the dis-
placement vector f
k-1,n
=(fx
k-1,n
,fy
k-1,n
) and is included in the ith block b
k-1,i
, the
predicted displacement vector p
k,n
=(px
k,n
,py
k,n
) in the kth frame is defined as
( )
1,
, 1, 1
p f fmv b
k i
k n k n k
= +
. (7)
Since the predicted position of any feature point is not precise, we need
the process of searching the best position of any feature point inside the search
region centered at the predicted position p
k,n
= (px
k,n
,py
k,n
). It is checked whether
36
each candidate point inside the search region is the best position using the dissi-
milarity energies related to texture, form, and motion. The set of candidate points
with the minimum total dissimilarity energy is selected as the optimal configura-
tion of feature points.
3.1.2 Texture Dissimilarity Energy
The similarity of texture means how the luminance property in neighbor-
hood of a candidate point is similar with that in the previous frame. The set of
candidate points inside the square search region is denoted as C
k,n
={c
k,n
(1),
c
k,n
(2),, c
k,n
(L)} with L= (2M+1)(2M+1) in the case of the nth feature point in
the kth frame. Then, the texture dissimilarity energy E
C
for the ith candidate
point c
k,n
(i)=(cx
k,n
(i),cy
k,n
(i)) is defined as
( )
( )
( ) ( ) ( )
( ) ( ) ( ) i cy y i cx x s
i cy y i cx x s
W
i n k E
n k n k k
W
W x
W
W y
n k n k k C
, , 1
, ,
2
,
,
1 2
1
, ;
+ +
+ +
+
=
= =
(8)
where s
k
(x,y) stands for the luminance value in a pixel (x,y) of the kth frame, and
W is the maximum half interval of neighborhood. The smaller E
C
is, the more the
texture of its neighborhood is similar with that of the corresponding feature point
in the previous frame. This energy forces the best point to be decided as the posi-
tion with the most plausible neighbor texture as far as possible. Figure 2 shows
how the search region and the neighborhood of a candidate point are applied to
calculate E
C
.
37
Figure 2. The search region is centered at the predicted point located by a forward
motion vector. A candidate point inside the search region has its neighborhood of square
form to compute E
C
.
Only necessary blocks can be partially decoded in P-frames to reduce the
computational complexity. On the other hand, intra-coded blocks are impossible
to be partially decoded since these are spatially intra-coded from these neighbor
blocks.
General partial decoding takes long time since decoding particular blocks
in P-frames requires many reference blocks to be decoded in the previous frames.
We can predict decoded blocks to reduce the computation time. To predict de-
coded blocks in the kth P-frame, we assume that the velocity inside one GOP is
as uniform as the forward motion vector of the k-2th frame. For the ith frame
with i=k,k+1,,K, the predicted search region P
k,n
(i) is defined as the set of pix-
els which are necessary to calculate the texture dissimilarity energies of all poss-
ible candidate points for the nth feature point. Then, the half maximum interval
T
k,i
of P
k,n
(i) is T
k,i
=(i-k+1)M+W+ where denotes the prediction error. Then,
38
P
k,n
(i) is given as follows:
( ) ( ) ( ) ( ) {
( ) }
i k i k m m m m
n k n k k n k
T T y x y x m
f m f b fmv k i p p i P
, ,
, 1 , 2 2 ,
,..., , ; , ,
1
= =
+ + + = =
(9)
where b(f
k-2,n
) stands for the block which includes the nth feature point f
k-2,n
. The
decoded block set D
k,n
(i) is defined as the set of blocks which should be decoded
to reconstruct P
k,n
(i). Using the motion vector of the k-1th frame, D
k,n
(i) is given
by
( )
( )
( )
( ) ( )
( ) ,
1
,
1,
,
D i b d d i k mv b f p p P i
k
k n
k n
k n
= = + e `
)
(10)
Assuming that there exist F feature points, the total decoded block set D
k
in the kth frame can be finally computed as
( )
,
1
F
K
D D i
k n
k
n
i k
=
=
=
(11)
Figure 3 shows how partial decoding is performed in the first P-frame of
one GOP which contains one I-frame and three P-frames. It should be noticed
that the time for calculating the total decoded block set is proportional to the
GOP size.
39
Figure 3. The structure of partial decoding in the first P-frame of a GOP which contains
one I-frame and three P-frames. Two decoded block sets D
k,n
(k+1) and D
k,n
(k+2) in the
first P-frame are projected from two predicted search regions P
k,n
(k+1) and P
k,n
(k+2).
3.1.3 Form Dissimilarity Energy
The similarity of form means how the network of candidate points is simi-
lar with the network of feature points in the previous frame. Each feature point is
jointly linked by a straight line like Figure 4. After a feature point is initially se-
lected, it is connected to the closest one among non-linked feature points. In this
way, the feature network in the first frame is built by connecting all feature
points successively.
To calculate the form dissimilarity energy of each candidate point, we as-
sume that each feature point is arranged in the order named at the first frame.
The feature point f
k-1,n
in the k-1th frame has its difference vector fd
k-1,n
(i)=f
k-
1,n
(i)-f
k-1,n-1
(i) as shown at Figure 4. Likewise, the ith candidate point of the nth
feature point in the kth frame has its difference vector cd
k,n
(i)=c
k,n
(i)-c
k,n-1
(j).
Then, the form dissimilarity energy E
F
for the ith candidate point of the nth fea-
40
ture point (n>0) is defined as follows:
( ) ( )
1/ 2
; ,
,
1,
E k n i cd i fd
k n
F
k n
=
(12)
All candidate points of the first feature point (n=0) have zero form dissi-
milarity energy E
F
(k;0,i)=0. The smaller E
F
is, the less the form of the feature
network will be transformed. The form dissimilarity energy forces the best posi-
tion of a candidate point to be decided as the position where the form of the fea-
ture network is less changed as far as possible.
Figure 4. The network of feature points in the previous frame and the network of
candidate points in the current frame.
3.1.4 Motion Dissimilarity Energy
The reliability of a forward motion vector means how it is similar with
true motion enough to get a predicted point as exactly as possible. Following Fu
et al. [6], if the predicted point p
k,n
which has located by the forward motion vec-
41
tor fmv
k-1
returns to its original location in the previous frame by the backward
motion vector mv
k
, fmv
k-1
is highly reliable. Assuming that p
k,n
is included to the
jth block b
k,j
, the reliability R can be given as follows:
( )
( ) ( )
2
1, ,
1
exp
,
2
2
fmv b mv b
k
k i k j
k
R p
k n
o
| |
+
|
=
|
|
|
\ .
(13)
where is the variance of reliability. Figure 5 shows forward motion vectors
with high and low reliability. In a similar way of Fus definition [18], the motion
dissimilarity energy E
M
for the ith candidate point is defined as follows:
( )
( )
( ) ; ,
,
, ,
E k n i R p c i p
k n
M
k n k n
=
(14)
With high reliability R, E
M
has greater effect on finding the best point than
E
C
or E
F
since it is sharply varying according to the distance between a predicted
point and a candidate point.
Figure 5. The reliability of forward motion vectors. The great gap between a forward
motion vector and a backward motion vector results in low reliability.
42
3.1.5 Energy Minimization
The dissimilarity energy E
k,n
(i) for the ith candidate point of the nth fea-
ture point is defined as follows:
( ) ( ) ( ) ( ) ( ) ( ) ( ) ; , ; , ; ,
,
E i k E k n i k E k n i k E k n i
k n C C F F M M
e e e = + +
(15)
where w
C
(k), w
F
(k), and w
M
(k) are weight factors for texture, form, and
motion dissimilarity energy. If the configuration of candidate points is denoted
as I={c
k,1
(i
1
), c
k,2
(i
2
),,c
k,F
(i
F
)}, the optimal configuration I
opt
(k) in the kth frame
is selected as what minimizes the total dissimilarity energy E
k
(I) expressed by
( ) ( )
,
1
F
E I E i
k k n n
n
=
=
(16)
When all possible configurations of candidate points are considered, it
takes so much time ((2M+1)
2F
) that causes high computation complexity espe-
cially in cases of large search region or many feature points. We can reduce the
amount of computations by (F) using the discrete multistage decision process
called the dynamic programming which corresponds to two steps [19]:
A. The accumulated dissimilarity energy (ADE) E
local
(n,i) for the ith can-
didate point of the nth feature point (n>0) is calculated as follows:
( ) ( ) ( ) , min , 1,
,
E n i E i j E n j
local k n local
j
(
= +
(17)
43
The ADE for the first feature point is E
local
(0,i)=E
k,0
(i). Then, the point
which minimizes the ADE is selected among candidate points of the n-
1th feature point; the index of this point is saved as
( ) ( ) ( ) , argmin , 1,
,
s n i E i j E n j
k n local
j
(
= +
(18)
B. In the last feature point, the candidate point with the smallest ADE is
selected as the best point o
F
. Then, the best point o
n
for the nth feature
point is heuristically decided as follows:
( ) argmin , o E F i
F local
i
( =
and
( ) 1,
1
o s n o
n n
= +
+ (19)
The best position for nth feature point f
k,n
is f
k,n
=c
k,n
(o
n
).
3.1.6 Adaptive Weight Factors
The arbitrarily assigned weight factors for texture, form, and motion dis-
similarity energy can give rise to tracking error since the target object can have
various properties. In this reason, weight factors need to be decided adaptively
according to properties of the target object. For instance, for an object which tex-
ture is scarcely changing, the weight factor w
C
should be automatically set up as
high value.
Weight factors can be automatically updated in each frame by using the
neural network as shown in Figure 6. The dissimilarity energy E
k
is transformed
44
to its output value E
k
by the nonlinear activation function . The update of
weight factors is per-formed by the backpropagation algorithm which minimizes
the square output error
k
defined as follows:
( )
2 1
2
E E
d k
k
c =
(20)
where E
d
denotes the ideal output value. If the activation function is the unipo-
lar sigmoidal function ((x)=1/(1+e
-x
)), the gradient of a weight factor is calcu-
lated as
( ) ( ) ( ) ( ) 1 k E E E E E k
d k k
x k x
e q A =
(21)
where x can be T (texture), F (form), or M (motion), and is the learning con-
stant [20].
Figure 6. The neural network for updating weight factors.
45
3.2 Automatic Approach
For the automatic detection and tracking of moving objects in H.264|AVC
bitstream domain, a novel method based on the spatial and temporal macroblock
filter (STMF) is introduced. The STMF exploits macroblock types and IT coeffi-
cients which represent the existence of motion and the temporal texture change
in a macroblock; the encoded information is exploited to extract foreground re-
gions.
As depicted in Figure 7, the method is composed of two stages: the object
extraction and the object refinement. In the object extraction stage, all object re-
gions are roughly extracted by the STMF based on the occurrence probability of
the objects. The STMF first removes blocks which are judged to be background
based on macroblock types and IT coefficients, and then clusters them into sev-
eral fragments called block groups. Since some block groups can also belong to
background, it calculates the occurrence probability of each block group based
on its temporal consistency. Only block groups with high probability are consi-
dered as real objects. In the object refinement stage, the location and size of ob-
ject regions are then precisely refined by background subtraction with partial de-
coding in I-frames and motion interpolation in P-frames.
46
Block Group Extraction
Spatial Filtering
Temporal Filtering
Partial Decoding
Background Subtraction
Motion Interpolation
O
b
j
e
c
t
E
x
t
r
a
c
t
i
o
n
O
b
j
e
c
t
R
e
f
i
n
e
m
e
n
t
Region Prediction
P
-
f
r
a
m
e
s
I
-
f
r
a
m
e
P
-
f
r
a
m
e
s
Figure 7. A procedure of object region extraction and refinement
3.2.1 Block Group Extraction
To detect and track moving objects in surveillance videos encoded by an
H.264|AVC baseline profile encoder, we assume that the surveillance camera is
fixed so that there is no camera motion and I frames were periodically inserted
less than every 10 frames in surveillance video. It is observed that in a fixed
camera, most macroblocks of the background tend to be encoded in the skip
mode in P-frames while most parts of the foreground tend to be encoded in non-
skip modes. From these observations, we may consider sets of non-skip blocks
as the foreground candidates for moving object detection and tracking.
47
B4
F1
B1
B2
B3 B5
B6
B7
B8
Figure 8. Block groups before and after spatial filtering
Figure 8 shows that the approximate foreground in a P-frame consists of a
set of block groups which consists of the blocks with non-skip modes which
are connected in the horizontal, vertical, or diagonal directions. However, such
simple segmentation as block groups is not enough to define moving objects
since there are also the blocks of non-skip modes that may occur in the back-
ground or the blocks of skip mode in the foreground region. For example, some
macroblocks in a homogeneous region of the background are encoded as inter-
coded blocks with motion vectors instead of skip mode blocks. Likewise, in the
case that the visual change of object motion is negligible, the whole or some
parts of the object can be encoded as skip mode blocks. Moreover, one object
region can be separated into one or more block groups which are disconnected
48
one another. Therefore, the block grouping based on the simple classification of
skip mode blocks and non-skip mode blocks is not sufficient to define moving
objects as ROIs. To decide whether each block group represents a real object or
a part of background, we use the spatial and temporal macroblock filter (STMF)
which are performed only in P-frames. The filter consists of two modules: spa-
tial filtering and temporal filtering.
3.2.2 Spatial Filtering
The spatial filtering removes most of block groups in the background by
using IT coefficients. That is, the block groups which contain just one non-skip
macroblock or do not contain non-zero IT coefficients are considered belonging
to the background since these groups tend to occur in the background rather than
the foreground. It means that we regard as a candidate of a real object the block
groups which contain more than one non-skip macroblocks and include non-zero
IT coefficients. Although some block groups of a real foreground object can be
considered the background, this rarely happens in the foreground instead many
more block groups are removed in background. So, we have better chance of
removing a number of such false block groups of background by the spatial fil-
tering.
As shown in Figure 8, nine block groups (indicated as F1, and B1~B8) in
a frame can be detected first. After spatial filtering, two active block groups (F1,
B4) are left while the other block groups are removed. It can be seen that most of
49
the block groups belonging to background consist of only one single macroblock
except B3 and B4. After spatial filtering, B3 is removed due to its all zero IT
coefficient values but B4 remains survived due to its non-zero IT coefficient val-
ues. Each frame after spatial filtering can contain several active block groups. So,
our proposed method can support for multiple object detection and tracking
problems.
3.2.3 Temporal Filtering
The temporal filtering process further removes the block groups in the
background which survive after spatial filtering. The survived block groups after
spatial filtering are called the active block groups. Then, the active block groups
are labeled with their object IDs by object detection and tracking through tem-
poral evolution. Each block group can be classified as a real object or the back-
ground. Especially, the active block groups which are not determined yet wheth-
er they are the real objects or the background are called the candidate objects.
Hence, an active block group can be labeled as a candidate object C , a real ob-
ject R , or background B.
For the classification of active block groups, a newly appeared (or de-
tected) active block group is regarded initially as a candidate object. The candi-
date object is regarded as a real object when it exhibits its temporal coherence
for which high occurrence probability is obtained, during an observation period.
On the other hand, the active block groups in the background tend to randomly
50
appear and disappear in time while those in the foreground tend to move
smoothly and appear during a relative long period of time in subsequent frames. .
If a candidate object occurs more frequently during a given observation
period, its occurrence probability would be more increased. The longer the ob-
servation period is, the more precise the classification is taken. The structure of
temporal filtering is illustrated in Figure 9.
frame
1
T
2
T
3
T
4
T
5
T
6
T
A G =
1
6
2
6
G
i
G
6
+
6
G
Real object
Real object
Real object
Observation
period
3
6
G
Figure 9. Temporal filtering based on the occurrence probability of active group trains
Before applying temporal filtering for an initial active block group A, it
is assigned the active group train
l
T
which is labeled by l and is defined as
51
follows:
{ } + = = = , , 1 ,
1
i A G G T
l
i
l l
(22)
where + indicates the length of an observation period, and
i
l
G
, called the
succeeding active block groups, denotes the set of active groups corresponding
to A, in the ith frame during the observation period as follows:
{ }
i i
l
i
l
C X G X X G c = =
,
1
|
(23)
where
i
C
denotes the set of all active block groups in the ith frame during the
observation period and X is an active block group. In other words,
i
l
G
con-
sists of all overlapped active block groups in the ith frame with
1 i
l
G
. If
| =
i
l
G
, we let
1
=
i
l
i
l
G G
assuming that the corresponding object does not
move or there is no or little change in the intensity of the active block group for
which
3
6
G
in Figure 9 corresponds to such a case.
In this way, we compute
i
l
G
recursively for
+ s si 1
, and then obtain
l
T
(a sequence of
i
l
G
) by accumulating the initial active block group and its
succeeding active block groups through all frames in the observation period.
Thereafter, in the last frame of the observation period, we calculate the oc-
currence probability
l
P
for the active block group train
l
T
which is defined
as follows:
52
( )
+
= =
l l l l l
G G G L P P ,..., , R
2 1
(24)
where
l
L
indicates a type of an active group for
l
T
after the observation pe-
riod. That is,
l
P
describes the probability that all candidate objects which cor-
respond to an active group train
l
T
would be real objects. According to the
Bayes rule, we have:
( )
( )
( )
( )
( )
( )
+
+
=
+
+
+
=
= =
=
=
=
[
l l l
l
i
l l
i
l
i
l
l l l
l l l l
l l l l
G G G P
L P
L G G G P
G G G P
G G G L P
G G G L P
,..., ,
R
R , ,...,
,..., ,
,..., , , R
,..., , R
2 1
1
1 1
2 1
2 1
2 1
(25)
Suppose that the succeeding candidate object
i
l
G
in the current frame depends
on only
1 i
l
G
in the previous frame. Then, we have
( ) ( ) R , R , ,...,
1 1 1
= = =
l
i
l
i
l l l
i
l
i
l
L G G P L G G G P
(26)
From (25) and (26), we have
( )
( )
( )
( )
( )
[
[
+
=
+
+
=
+
=
=
= =
=
1
1
2 1
1
1
2 1
R ,
,..., ,
R
R ,
,..., , R
i
l
i
l
i
l
l l l
l
i
l
i
l
i
l
l l l l
L G G P
G G G P
L P
L G G P
G G G L P
(27)
53
Since
( ) R =
l
L P
and
( )
+
l l l
G G G P ,..., ,
2 1
are the nature of scenes, that is, a
priori probabilities, we only consider the conditional probability in (27). Accor-
dingly, we judge that the active group train
l
T
is a real object if the following
condition is satisfied:
( ) O < =
+
=
1
1
R , ln
i
l
i
l
i
l
L G G P
(28)
where O is the threshold of occurrence with
0 > O
. If Equation (28) does
not hold true, then the active group train
l
T
is removed because it is regarded
as a part of the background. If
| =
i
l
G
,
( ) R ,
1
=
l
i
l
i
l
L G G P
can be calculated
as follows:
( )
( )
( )
1
1
1
R ,
= =
i
l
i
l
i
l
l
i
l
i
l
G n
G G n
L G G P
(29)
where
( )
1 i
l
G n
denotes the number of macroblocks in the region of
1 i
l
G
. If
| =
i
l
G
, we have
( )
( )
+
= =
l c
L G G P
l
i
l
i
l
R ,
1
(30)
which
( ) l c
is the number of frames where the succeeding candidate objects for
the active group train
l
T
are found during the observation period.
Once an active block group train is regarded as motion trajectory of a real
object, the object tracking is performed by searching the candidate objects that
54
are overlapped with the corresponding real object group in the previous frame
throughout the subsequent frames after the observation period. In this case, the
train becomes the real objects one and is extended towards the subsequent
frames. The real object tracking is performed in the same way as done for the
candidate object tracking in Equation (23). If a real object does not have its suc-
ceeding candidate objects in any subsequent frame, it is assumed that the real
object does not move by staying at a location.
When we detect and track multiple objects with active block groups, we
may have train tangling problem in which at least two trains are merged together,
called the train merging, or one train gets separated into more than two individu-
al trains, called the train separation. Train merging occurs under the situation
that one active block group is overlapped with several candidate or real objects
in the previous frame as shown in Figure 10(a). For simplicity in this paper, we
only consider the case of train merging by two active group trains. Figure 10(b)
shows the train separation where an active group train is divided into two active
groups.
(a) (b)
1
l
T
2
l
T
l
T
A
1
A
2
A
Figure 10. Train tangling. (a) Train merging. (b) Train separation.
55
When two active group trains,
1
l
T
and
2
l
T
are overlapped with a single
active block group and their corresponding objects are labeled with candidate
objects
C
1
=
l
L
and
C
2
=
l
L
, one of two trains is removed. In the case of
having one real object and one candidate object, the candidate object train is re-
moved. When both trains are real objects, then the overlapped active block group
is split into two active block groups, both (
1
l
T
and
2
l
T
) of which correspond to
the real object. That is, if both active group trains are for real objects, two ob-
jects are not merged which means that two overlapped real objects are consi-
dered to move independently.
On the other hand, train separation occurs under the situation that several
active block groups are overlapped with one candidate or one real object in the
previous frame as shown in Figure 10(b). If the active group train (
l
T
) in Figure
10(b) in the previous frame were a candidate object, the two active block groups
(
1
A
and
2
A
) overlapped with
l
T
are merged into one candidate object. In
case of the active group train (
l
T
) being a real object, the two active block
groups are considered independent objects for which one is regarded as the real
object corresponding to
l
T
and the other as a new candidate object.
3.2.4 Region Prediction of Moving Objects in I-frames
Finally, the location and size of the real object is determined by a rectangle
that encompasses the exterior of the active block group. We define the feature
vector
l i
f
,
=
(31)
where
( )
l i l i l i
y x p
, , ,
, =
and
( )
l i l i
w h
, ,
,
is the
size of the object with its height and width.
In practice, there are somewhat discrepancy in object size between the rec-
tangle box and the object. Therefore, the object region defined by the rectangle
box must be refined every frame during object detection and tracking. For this,
we employ background subtraction and motion interpolation as shown in Figure
11-12. That is, we periodically update the size and location of a real object every
GOP by background subtraction. The background subtraction is performed on
every I-frame by comparing it with the background and is followed by the re-
finement process for the real object region in the I-frame. Then the motion inter-
polation is performed over P-frames between the current I-frame and its previous
I-frame.
Since each I-frame does not contain macroblock partition types and tem-
poral prediction residuals, their object regions need to be estimated by projecting
the real object regions in the previous P-picture onto the I-frame. The projection
of a real object in a P-frame onto the next I-frame is made as follows:
|
.
|
\
|
' ' ' = '
=
=
l k i
N k
l k i
N k
l i l i
w h p f
,
1 ,..., 1
,
1 ,..., 1
, 1 ,
max , max ,
(32)
57
(a) (b)
(c) (d)
l
D
l
S
Figure 11. Optimizing the feature vector of an object through background subtraction in
an I frame. (a) The background Image. (b) The I frame in the original sequence. (c) A
partially decoded image from H.264|AVC bitstream. (d) A background-subtracted
image.
where
l i
f
,
'
, and N de-
notes the length of one GOP. The predicted location
l i
p
,
'
and
( ) x p
B