Você está na página 1de 6

A system for face detection and tracking in unconstrained environments

Augusto Destrero
Francesca Odone
Alessandro Verri
DISI - Universit`a di Genova
Via Dodecaneso 35 - I-16146 Genova, Italy
{ destrero,odone,verri } @ disi.unige.it
June 10, 2007

Abstract
We describe a trainable system for face detection and tracking. The structure of the system is based on multiple cues
that discard non face areas as soon as possible: we combine motion, skin, and face detection. The latter is the core
of our system and consists of a hierarchy of small SVM classifiers built on the output of an automatic feature selection
procedure. Our feature selection is entirely data-driven and
allows us to obtain powerful descriptions from a relatively
small set of data. Finally, a Kalman tracking on the face
region optimizes detection results over time. We present an
experimental analysis of the face detection module and results obtained with the whole system on the specific task of
counting people entering the scene.

1. Introduction
In this paper we describe a full system that we designed and
implemented for real-time face detection. The efficiency
is guaranteed by a coarse-to-fine multiple cue structure that
aims at discarding a non-face area as soon as there is enough
evidence against it. The building blocks of the system are
a motion detector, a skin detector, a feature-based face detector implemented as a cascade of SVM classifiers. Finally, a Kalman tracker is applied, to minimize operations
over time. Figure 1 sketches the various processing phases
on a given frame. It also shows the scenario monitored by
our system: a corridor of our department, busy in office
hours, illuminated by both natural and neon light. The various blocks are independent and may be switched off in case
the conditions they require are not met. The hardware of the
system is a standard pc equipped with a frame grabber and
a color video-surveillance camera.
Our motion detection is based on an incremental background construction method, that we use to initialize (or
re-initialize) a model for the background and to keep it updated as time passes. The background initialization method
is similar to the one proposed in [2]. As for skin detection
we implement the method proposed in [7], with thresholds

978-1-4244-1696-7/07/$25.00 2007 IEEE.

Figure 1: The environment monitored by our system and a


sketch of the processing phases. From left: motion detection, skin detection, multi-scale face analysis and detection.

set so to minimize the number of false negatives. The core


of our system is a face detection method that consists of
a cascade of SVM classifiers. We represent image data as
a collection of meaningful features that are automatically
selected on a training stage with a feature selection procedure. The family of features we use are the rectangle features proposed in [15], that are well known as an appropriate
description for the structure of faces and can also be implemented efficiently. Our feature selection is a three-stages
procedure based on a regularization algorithm with a sparsity constraint [4] (for more details on its application to the
case of faces see [5]), that produces a sparse solution highlighting the most interesting features for the classification
problem at hand. Once we obtain a reduced set of features,
we partition it in N small groups of uncorrelated features
and train N linear SVM classifiers on such representations.
At run time we apply a cascade of tests which is very efficient since few image areas are tested with all classifiers,
while in most cases the hypothesis that the area is a face is
discarded after few steps.
The main contributions of this paper are the novel feature
selection method, and the overall system design that can be
used as it is for monitoring tasks or as a pre-processing for
an access control system. Our feature selection is entirely
data driven and can be applied to a different problem, sim-

499

Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.

ply changing the training set. To confirm this we will show


some results of an eyes detector built with the same principle on a different pool of data.
Face detection has been one of the first and most studied
applications to content-based image analysis as it is confirmed by the massive scientific production on this topic
(see, for instance, [17], and references therein). In this work
we considered appearance-based methods, using a description that looks for local meaningful patterns, similarly to
[12, 11, 15]. This paper mainly focuses on the design of
the whole system so we rely on various processing steps
to minimize the number of times actual face detection is
performed, but its core is the hierarchical feature detection
module. Hierarchical methods have been widely studied
as speed-up strategies to object detection. They are based
on the assumptions (that are usually met) that most background elements are very different from the class of interest, and also that usually background covers most of the
image area. Hierarchical approaches to face detection include [3, 13, 15, 8, 9]. The structure of our method is related to [15] since we often considered it as a reference,
but a detailed analysis shows that the similarities between
the two methods are very few: we use the same features
that they propose, but our feature selection is different and
it is inspired by a different theoretical framework. From
the application standpoint, the main advantage of our approach is that with a much smaller training set we obtain
comparable results [5]. Also, our hierarchical approach is
different as we use SVMs as baseline classifiers. On this
respect our method is more related to [9], even if we automatically compute the layers of the cascade while they rely
on a selection of classifiers designed ahead, implementing
decision surfaces of increasing complexity. Our approach
has connections with the combination of rectangle features
and SVMs is presented in [16], but they consider neither
automatic feature selection nor cascading as we do.
The paper is organized as follows. Section 2 describes
the preprocessing phases of the system, Section 3 is devoted to describe our feature selection and object detection
method, Section 4 first reports test results on face detection,
then analyzes the overall performance of the system on the
task of counting people that walk towards the camera, and
finally presents some results on eyes detection.

2. Preprocessing of the frame


The basic structure of our system starts with a change detection process that relies on a background model updated
incrementally over time. Background initialization is the
fundamental preprocessing step at the basis of most video
surveillance systems. Once a model for the observed scene
is known motion detection can be simply performed by
comparing the scene model to new frames. Many applica-

tions assume a tuning procedure where the empty scene is


observed by the system for a few seconds. In this paper we
follow a different approach, vaguely resemblant to the one
proposed in [2], that allows us to compute (or re-compute)
the background in the presence of dynamic events. We start
from an empty background and add background information only for pixels that are not moving. Motion is detected
analyzing a buffer of recent frames and estimating the standard deviation of pixel values along the frames. Given
pixel p = (i, j), if ij is small the pixel is marked as still
and its average value in the buffer is assigned to the background. Once a pixel is initialized its value is updated as
long as it remains still in the video. Background update follows the standard:
Bt (i, j) = Bt1 (i, j) + (1 )It (i, j),
where is selected according to how fast we want to update the background. In our environment we may choose a
close to 1, since the background changes are mainly due
to smooth illumination changes. The same procedure is applied every few frames to keep an updated version of the
background. In the presence of abrupt changes (detected
with a global analysis on the number of pixels estimated as
moving) the initialization procedure is started again. In the
presence of moderate variations a few seconds are enough
to obtain a complete model of the background.
A fast skin detection may be applied to the motion detection output to limit the size of the search area. Also, it
allows us to discard video frames containing people walking away from the camera, without further investigation.
Most face detection systems based on skin color analysis are
founded on studies [1] showing that the spectral reflectance
of human skin is independent on the human race and on the
wavelength of the exposed light. Hence, human skin can be
found in a limited cloud of a color space. Skin detection is
often used for face detection [7, 17] in spite of the fact that
it suffers from the presence of other skin areas (hands, arms,
see Fig. 1) and other material (leather, red or brown clothes,
hair). Here we use it only as a preprocessing, following the
work of Elazouzi et al [6]. We choose the YUV color space
because it is a standard in video: the components U and V
are called chrominance components and hold all the color
information of the pixel, so we can use only these components to analyze the image. We test each pixel of the motion area, according to the following scheme: we model the
skin color in the U V -plane as a quadruple consisting of the
mean values mU and mV , and the tolerance values tU and
tV . We estimate the mean values on a training set of skin
images and set the tolerance values to a proportion of standard deviation of the training set data. The tolerance values
are chosen so to minimize the number of negative examples
on a calibration phase: skin detection is used to lower the
number of tests when the observed area is not ambiguous.

500

Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.

1111
0000
0000
1111
0000
1111
0000
1111

2v

11
00
00
11
00
11
00
11
00
11
00
11
3v

2h

000
111
000
111
000
111
000
111
000
111
11
00
000
111
00
11
00
11
00
11
00
11
00
11

111
000
000
111
000
111
3h

Figure 2: The support of rectangle features implemented in


our system (see text).

3. Feature-based object detection


This section is devoted to a brief overview of our feature selection method and the subsequent object detection adopted.
Detection problems can be modeled as binary classifiers,
where we assign a positive (1) or negative (1) label to an
image area in accordance to the presence or absence of the
object of interest.
We start from a dataset of positive and negative example images of the object class of interest, and expand the
information carried by each image in a high dimensional
dictionary of overcomplete features, describing various local patterns. Then we apply automatic feature selection so
to obtain a small and compact set of features that capture
the most meaningful patterns for a given classification problem. The features that we implement in our work are the
so-called rectangle features [15]: Fig. 2 shows the support
of the 5 types of rectangle features that we compute at all
positions and scales of an image patch analyzed. The greylevel image values corresponding to the white area are subtracted from the values corresponding to the dark area, then
are normalized with respect to the size of the patch and the
grey-levels standard deviation of the patch.
We now sketch the feature selection procedure, giving
the details useful to understanding the implications on our
system. For more information on the algorithm and an exhaustive analysis on the feature selection strategy see [5],
while for the theory see, for instance, [4].

3.1. Feature selection


We consider a training set of positive and negative examples
and describe each image by means of rectangle features. We
then assume a linear dependence between input representation and output labels, and build a linear system
Af = g

(1)

where A is the matrix of processed image data, and each


{Aij } is obtained from i = 1 . . . n images each of which is
represented by j = 1, . . . , p rectangle features. In a binary
classification setting we may set gi {1, 1}; f is the unknown vector that weighs the importance of the features for
each image of the training set. In this setting, feature selection means looking for a sparse solution f = (f1 , . . . , fp ) :

features corresponding to non-zero fi are relevant to model


the diversity of the two classes.
In general one could solve Problem (1) by matrix inversion, but typically the system is largely under-determined
and may be severely ill-conditioned because of the measurements redundancy. One possible way to address these problems is to resort to some regularization strategy. In particular, since we aim at selecting a subset of features, a sparsityenforcing penalty is ideal for our case. We consider the L1
norm that leads to a feasible problem of the form
fL = arg min{|g Af |22 + 2 |f |1 }
f

(2)

( |f |1 is the L1 -norm). This problem, usually referred to as


lasso problem [14] may be solved implementing an iterative
method called thresholded Landweber, whose iterative step
is (fL0 = 0 ):
fLt+1 = S [fLt + A (g AfLt )] t = 0, 1, . . . ;
where S is a soft-thresholder:

hn sign(hn ) if|hn |
(S h)n =
0
otherwise.

(3)

(4)

It has been shown [4] that, under appropriate conditions


on the matrix A (matrix A needs to be normalized so that
||A|| < 1), the thresholded Landweber algorithm converges
to the minimizer of (2).
Problem (3) involves the manipulation of matrices that
may be very big and will be, in our case. To this purpose
we consider an optimization procedure based on solving S
smaller sub-problems obtained each time extracting, with
replacement, a subset of m features from the original set,
m p. The S intermediate solutions are combined and
used as input of a second regularized feature selection. Details and motivations of this two-stage feature selection can
be found in [5]. The solution obtained at the end of the two
stages is sparse and consistent, but usually it is not small
enough for real time processing. We then apply a third selection stage on the features that survived the previous two
stages, that reduces the amount of redundancy in the set of
selected features. We choose one delegate for groups of
features of the same type that are spatially close: we first
restrict the set starting from a random feature and adding
only features that are (a) distant according to

if fi Fk fj Fl

k, l = {2h, 2v, 3h, 3v, 4} k 6= l
(5)
D(fi , fj ) =

d(fi , fj ) otherwise
(d(fi , fj ) is the sum of the Euclidean distances between
corresponding corners of the rectangle support), or (b) features that are close but appear not correlated (to this purpose
we use the Spearmans correlation test [10]). We call the final set of features S out .

501

Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.

3.2. Cascade of classifiers


out

The set S
is used to set up a coarse-to-fine cascade of
classifiers. Each classifier is built extracting from S out
small subsets of at least 3 distant features that are able to
reach a fixed target performance on a validation set. We
start by 3 mutually distant features (according to (5)) and
add further features until a target performance is obtained,
testing a validation set with a linear SVM.
Target performance is chosen so that each weak classifier
will not be likely to miss positive occurrences: than we set
the minimum hit rate to 99.5% and the maximun false positive rate to 50%. These modest targets allow us to achieve
good performances with the global classifier, since global
performance of the cascade will be computed according to
the following [15]:
H=

K
Y

i=1

hi

and

F =

K
Y

fi

i=1

where hi is the hit rate and fi is the false positive rate for
each weak classifier i. In our case, assuming a cascading of
10 weak classifiers, we will get H = 0.99510 0.9 and
F = 0.510 3 105 .
At the end of the cascade design we test the system live
and store the detected objects. We then analyze the results
checking whether they meet our needs in terms of performance. If not, we repeat the second and third stages of feature selection on a new set of negative examples, made of
all the false positives detected by the system (similar to the
bootstrapping procedure described in [12]); the features we
obtain are more specialized on discriminating among positive and difficult negative examples. Further layers derived
from this refinement are added at the bottom of the cascade.

3.3. Running the object detector


The final object detector scans the image or a part of it,
performing classification at multiple scales and locations.
Following [15], instead of scaling the image, we can more
efficiently rescale the features used by the classifier, because rectangle features can be computed at any size in
constant time. The choice of scale factor s between different scale levels Sk directly affects both detection performance and scanning speed, in our experiments we chose
s = SSk+1
= 1.1, achieving good accuracy. The search wink
dow is also moved at different locations, shifted 1 pixel at a
time at the base scale; at higher scales shifting is done for
a number of pixels k = [Sk ] where [] denotes rounding
operation. Given an image patch at a certain location and
scale, we use it as input of the various layers of the cascade.
If the patch is classified as negative at one stage of the cascade it is immediately discarded. If it passes all stages the
object is detected (see Fig. 4). We expect that multiple detections occur in the surrounding of a real object because

of the spatial correlation of the image, while often the number of detections in the surrounding of a false positive is
lower, so we discard regions that accumulated less than a
fixed number of hits (4 in our experiments). Moreover is
reasonable to keep only one delegate for groups of overlapping detections, and we do it simply by partitioning the set
of detection in disjoint subsets, then keeping the average
bounding box of each partition.

4. Experiments
The methodology described in section 3, which is entirely
data-driven, has been applied to both face and eye detection
problems. In this section we first report experiments that
confirm the appropriateness of our object detection method
on faces, then we describe the experimental analysis of the
whole system on a face tracking and people counting problem, finally we present some experiments made with our
system trained to detect eyes.

4.1. Training and validating a face detector


Our example-based face detector is currently built and validated on a dataset of 2000 positive and 2000 negative training examples, 2000 validation data (that we use to tune parameters and build the cascade), about 4000 test examples.
Examples are face and non face images resized to a common size of 19 19px. This dataset has been acquired automatically with our system previously trained on a benchmark dataset available on the web (CMU-CBCL for frontal
faces).
The feature selection procedure described in Section 3.1
is applied to the training images. At the end of the twostages regularized selection we obtain 247 features, while,
after the third step we are left with 42 features. After selection, in order to check the appropriateness of our representation with respect to the detection problem, we trained
a linear SVM classifier on the training images represented
with the set of features S247 and S42 respectively (SVM parameter C was tuned with cross validation on the validation
set). We then classify images of the test set. Figure 3 reports
the R.O.C. obtained while varying the SVM offset. The results obtained are very good, and show that the degradation
of the results when adding the third step of correlation analysis is limited. We then build the classifiers cascade, obtaining 13 layers, each of which made of at most 4 features
(this means that in the construction process 4 features were
always enough to meet target performances).
We test the face detector on live videos gathering all false
positives. We then use those examples to find a new set
of more discriminative features for difficult examples. The
final feature set, S59 , is made of 59 features and produces
a cascade of 19 layers (see Fig. 4). The performance we

502

Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.

0.95

0.9

0.85

0.8

0.75

0.7

Figure 5: Sample shots from the system interface, acquired (from top left) at times 11:00:58, 11:44:17, 13:15:13,
13:55:42, 14:25:23, 14:25:25.

Two stages feature selection


Two stages + correlation analysis
0.65
0

0.01

0.02

0.03

0.04

0.05

0.06

Figure 3: Two stages feature selection with and without a


3rd step of correlation analysis. The results show that the
3rd stage guarantees a considerable reduction of the set size
with little degradation of the results.

Figure 4: A sketch of the final cascade of 19 classifiers,


that shows the support at the right position and scale of the
features used at the first and last layers.

obtain applying the cascade to our test set is: 0.1% false
positives and 94.0% hit rate.

4.2. Detecting, tracking and counting faces


This final section reports results obtained running the system for 5 hours (from 10:00 am to 3:00 pm) on a busy week
day: we concentrate on the task of counting the number of
people leaving the corridor, that is, people facing the camera for enough frames to be detected and tracked. We aim
at keeping a balance between false positives and negatives:
thus we limit the number of false positives tuning the face
detector so that a small number of false faces is taken, the
temporal component will help us to increase the number of
faces along the sequence. As a ground truth we manually
marked all the temporal ranges where motion was detected,
marking positive ranges in the presence of roughly frontal
faces, negative ranges in the presence of people walking
away from the camera, lateral faces, motion detection errors

due to changes of the background. Once a face is first located in a video, we track it with a Kalman tracker (whose
state system models position and velocity, while the measurements system models the position evolution over time)
so to build the trajectory of the face and count it just once.
The tracking module also allows us to evaluate the stability
of detected faces over time, discarding the ones that survive for few frames only; more precisely we set a minimum number of frames in which a face had to be tracked
to be considered stable. This increases the detection performance, since often false positives are less stable than true
ones. The overall performance with respect to the number of people that crossed the scene over the 5 hours was
of 16% false positives and a hit rate of 84%. The results
are very encouraging, considering that the corridor activity
was entirely out of our control, and it included abrupt scene
changes (due to illumination or to doors opened or closed),
people standing in the corridor for unpredictable time, people reading or putting their jumper on while walking, people
using the telephone thus occluding part of the face, and so
forth (see Fig. 5). Fig. 6 shows the difference between the
number of real faces and the detected faces: positive values
indicate misses, negative values false positives. To make
the figure more readable we composed the temporal ranges
where motion was detected, thus on the x-axis there is not
a real flow of time, but a timestamp running on the various
video-shots. A posteriori inspection on a subset of the video
shots allowed us to estimate a performance of the detector
with respect to the analyzed patches of 4 107 false positives and a hit rate of about 67%. The number of analyzed
patches, on an average frame, is of about 20000.

4.3. Training an eyes detector


Since our method is completely data-driven, we could apply the same procedure on a different dataset. We collect
automatically 1244 pairs of eyes from the FERET dataset
[18] using the ground-truth for eyes positions, together with

503

Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.

References

[1] R. R. Anderson and J. A. Parrish. The optics of human skin.


Journal of Investigative Dermatology, 77:1319, 1981.

1.5

[2] A. Bevilacqua. A novel background initialization method in


visual surveillance. In MVA work., pages 614617, 2002.

0.5

[3] P. J. Burt. Smart sensing within a pyramidal vision machine.


Proc of the IEEE, 76(8), 1988.

0.5

[4] I. Daubechies, M. Defrise, and C. D. Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity
constraint. Comm. on Pure Appl. Math., 57, 2004.

0.5

1.5

2.5

3.5

4.5

5
4

x 10

Figure 6: The difference between the number of faces in


the scene and the detected faces: positive values indicate
misses, negative values false positives

[5] A. Destrero, C. D. Mol, F. Odone, and A. Verri. A regularized


approach to feature selection for face detection. Technical
Report DISI-TR-2007-01, Universit`a di Genova, 2007.
[6] K. Elazouzi, P. Kauff, O. Schreer, S. Askar, and Y. Jondratyuk. Vision-based skin-colour segmentation of moving
hands for real-time applications. In Proc. I European Conf.
on Visual Media Production, 2004.
[7] A. Elgammal and M. Abdel-Mottaleb. Face detection in complex environments from color images. In Proc. of ICIP, 1999.
[8] F. Fleuret and D. Geman. Coarse-to-fine face detection. International Journal on Computer Vision, 41:85107, 2001.
[9] B. Heisele, T. Serre, S. Mukherjee, and T. Poggio. Feature
reduction and hierarchy of classifiers for fast object detection
in video images. In IEEE Proc. CVPR, 2001.
[10] E. L. Lehmann. Nonparametrics: Statistical methods based
on ranks. Holden-Day, 1975.

Figure 7: Some face and eyes detection results.

[11] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based


object detection in images by components. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 23(4), 2001.

other patches of faces from the same dataset not containing


eyes. We resized all images to a common size of 40 20px,
then used 1000 positives and negatives examples as training set for our system, and the remaining 244 positives and
negatives as validation set. Our feature selection procedure
produced a set of 82 features, from which we obtained a
cascade of 22 layers. We integrated the eyes detector as a
new module of our system, performing a multi-scale search
for eyes in the regions marked as face by the face detector. Some result of simultaneous face and eyes detection
are shown in fig. 7.
Within our system this further detection will be used to select the best face views in a face validation framework.
Acknowledgements We thank Christine De Mol for
many useful suggestions on feature selection. Hardware
and low level processing libraries are due to Imavis s.r.l
(http://www.imavis.com/), that we thank for their
help and support.

[12] E. Osuna, R. Freund, and F. Girosi. Training support vector


machines: an application to face detection. In CVPR, 1997.
[13] A. Resefield and G. J. Vanderbrug. Coarse-fine template
matching. IEEE Trans. on Sys. Man. Cyb., 2, 19677.
[14] R. Tibshirani. Regression shrinkage and selection via the
lasso. J Royal. Statist. Soc B., 58(1), 1996.
[15] P. Viola and M. J. Jones. Robust real-time face detection.
International Journal on Computer Vision, 57(2), 2004.
[16] Q. Wang, J. Yang, and W. Yang. Face detection using rectangle features and SVM. Int. Journ. of Intelligent Technology,
1(3), 2006.
[17] M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces
in images: a survey. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 24(1):3458, 2002.
[18] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The
FERET Evaluation Methodology for Face-Recognition Algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(10):10901104, 2000.

504

Authorized licensed use limited to: Universiti Malaysia Perlis. Downloaded on February 18, 2009 at 02:21 from IEEE Xplore. Restrictions apply.

Você também pode gostar