Você está na página 1de 6

2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea

A Fast Eye Localization and Verication Method to Improve Face Matching in Surveillance Videos
Lucas Sousa de Oliveira1,2
1 2

D bio Leandro Borges1

Fl avio de Barros Vidal1

Lin-Ching Chang2

Department of Computer Science, University of Brasilia, Brasilia, DF 70910-900 Brazil Department of Electronic Engineering and Computer Science, The Catholic University of America, Washington, DC, USA in real time, i.e. at least 15 fps. Both Viola [8] and Rowley [7] can overcome this time constraint. Detection algorithms usually have a parameter that affects how many detected regions would be outputted. In [8] and [7], this is a so called scaling parameter and it allows the system to detect objects of sizes different than the training size. Note that as more regions of interest (ROI), i.e. possible eyes, are outputted the detection algorithm takes more time and outputs more redundant information. The time consequence is not good for the desired application, but the redundant information is. It can be used to increase the certainty that the detected ROI is actually an eye. It should also be noted that this problem is a somewhat hard task. The regions outputted by the detection algorithms were already classied and therefore carry very similar characteristics. The problem then can be summarized as nding good and computationally effective features to describe the eyes. In this paper, we propose a fast method to localize and verify eyes considering as input regions of interest (ROI) produced by Viola-Jones detector [8]. The set of features proposed here is shown to improve face matching by lowering near 50% the number of false positive features to be further classied. The experiments were done using a video surveillance scenario where many people move in the scene and light conditions are poor. Next sections detail the proposed approach and review the results. II. M ATERIAL AND M ETHODS The proposed method uses the ROIs outputted by a wellknown detection algorithm, which in this research was ViolaJones detector [8], and further classies them, reducing the amount of false negatives with potentially no side-effect on the positive regions. Figure 1 shows a simple representation of the proposed system. The footage captured, the detection algorithm, the features and classication performed will be described in details in the following sections. Note that the frames are not the input for our system but the eye regions detected. The eye detection was made using the Haar detector from [12] with a scaling parameter equal to 1.05, which resulted in 9405 ROIs detected where 712 were positive and 8693 were false positive. The system expects a frame as input, which is then passed to the detection stage where ROIs are detected and then further classied. The resulting set of regions is passed to the face recognition algorithm.

AbstractAutomatic face recognition (FR) problem has been extensively studied and applied to domains including biometrics, security, authentication, surveillance, and identication. Face recognition algorithms commonlly use high dimensional information and are therefore computationally expensive. The use of wrongly detected features can confuse the recognition process and make it even slower. This paper presents a set of heuristic lightweight features to describe eye regions. These features are used to further classify detected eye regions into false positive and positive ones. Note that the detected eye regions are all very similar, making it hard to nd a good set of features to separate them. The classication is done by simply applying a threshold proportional to the variance of the data. The method was able to correctly classify 49.8% of the false positive samples, which if applied to Viola and Jones best result could potentially turn its 93.7% performance into 96.8%. If the classication considers the pairs heuristics, the performance goes up to 83.3% at the expense of wrongly classifying positive samples. It would make the 93.7% become 98.9%.

I. I NTRODUCTION The face recognition (FR) problem has been a challenge for the research community for many years even though great results were achieved. The attempts to solve it can be divided into four categories [1]: knowledge-based [2], feature-invariantbased [3], template-matching-based [4] and appearance-based methods [5]. These methods usually exploit facial features such as eyes, mouth and nose in different ways to determine whether a face belongs to a specic person [6][10]. There are also cases, usually in face detection problems, where these features are used implicitly while chosen by an articial intelligence system, e.g. a neural network [7] or adaptative boosting [8]. The solutions for FR, especially the ones categorized as template matching and appearance based, can easily become high dimensional and therefore computationally expensive. Because of this, obtaining a small and precise set of features is an important and tough quest [11]. The bigger this set, longer is the time required to nd the corresponding face, if any. Face recognition has its most popular application in surveillance problems. This kind of problem allows us to have a camera on a xed position, controlled light and a somewhat controlled ow of people, thus restricting pose and illumination. Although this setting reduces the complexity of the FR problem, there are still some concerns about the time it takes to analyze each frame. This time, which is proportional to frame resolution and detection algorithms and their parameters, is usually desired to be such in order to allow the system to run

978-1-4673-1714-6/12/$31.00 2012 IEEE

840

Fig. 2: A sample frame of the video dataset. intuitive that those new features should be able to extract new information from the ROIs instead of the ones used by the detector. Viola-Jones detector uses features known as Haar-like features to describe an image patch. These are composed of rectangular regions such as the ones shown in Figure 3. The value for these features is simply the sum of the pixels within the clear region subtracted by the sum of pixels within the shaded region.

Fig. 1: System representation. Dashed line represents the proposed method and where it would be applied. The analysis made considers that the video inputted is in grayscale. Some authors, such as [13], use color information in different spaces to do a similar task called eye location. Although this provides fairly good solutions, dealing with color space introduces a potential delay and increases the system complexity. Grayscale already provides enough information to be used, as seen in Figure 4. In order to simplify the implementation, it was used the method on [8] already implemented in the OpenCV Library [12]. A. Dataset Although the proposed method uses ROIs as input instead of full frames, dening the detection algorithms input video helps understanding some characteristics of those ROIs that can be worked upon. The footage used to extract the possible eye regions was recorded in a corridor. The camera was positioned in a xed location 2.5 meters off the oor in the middle of a 2-meterwide hallway. The reason for that was to reduce or possibly eliminate the problems with focus, pose and illumination. Also one can note that this would be a fairly realistic scenario for testing. The recording camera was able to capture frames with 720p resolution and 30 fps. 764 grayscale frames were captured with those characteristics, where 431 of those were used for the analysis. Figure 2 shows a sample frame from this dataset. From these 431 frames, 9405 ROIs where extracted using the Viola-Jones object detector, described in the next section. Those were labelled by hand, providing 8693 wrongly detected regions and 712 positive regions. B. Viola-Jones Detector The choice of the eye detection algorithm can affect the choice of features to further classify the detected region. It is

Fig. 3: The Haar-like features. The rst major contribution from Viola and Jones work was the integral image. It is basically an image where each pixel equals the sum of every pixel from it to the origin (0,0). By building it, the sum of pixels in a rectangular region can be computed with only 4 memory accesses and 3 operations, instead of an iteration. Another contribution was the construction of a classier by the selection of critical visual features by an AdaBoost learning algorithm. This is enough to build weak classiers. In order to improve the classication rate, a cascade classication, as the one described in section II-D2, was built. Each stage of their cascade classier has an increasing time complexity, but also an increasing classication rate. In order to be able to classify subwindows of all sizes, Viola and Jones applied a scaling factor to the subwindows. Typically, a 1.25 factor, as used in [8], allows a good clas-

841

sication speed but poor false negative rate. In this research, it was observed that a 1.05 value was enough to be able to retrieve most of the eyes in the frame, at the cost of getting a lot of false positive regions and system speed. The output for this stage are the image patches, i.e. ROIs, that the detector classied as possible eyes. C. Proposed features Observing the outputted ROIs, we could notice that the eye presents some distinct patterns and that could be used for its classication. Some of these patterns were found on the prole of the ROIs middle region and the others were derived from heuristics. 1) Prole Valleys: When plotting a graph of the intensity of the pixels along horizontal middle of the ROI against its position, we can see that there are two distinct valleys, as seen in Figure 4. These correspond to the white regions of the eye, namely sclera. Note that the color convention used in this plot considered that eye had null value and black had full value.

very wide spectrum of sizes, which makes their sizes easily classiable. Note that the time required to compute this feature is insignicant once the information is already present as the ROIs width/height information. 4) Pairs: One important information about the eyes is that they are most likely to come in pairs and the distance between the paired elements is biologically constrained to be approximately 1.6 times the size of the eye [14]. This way, knowing the ROI size, we could build an annular region where another eye should be found. The annular region is dened as
2 |||Pi Pj ||2 pd Si | < pd for i = j,

(a) A sample positive ROI with highlighted prole.

(b) The prole plot for the ROI in Fig. 4a.

Fig. 4: Positive ROI and its prole plot. This way, two features can be retrieved. They correspond to the position of these valleys relative to the ROI size. Note that although these are weak classiers, they are somewhat robust once they can be found in ROIs even with some level of rotation and scaling. Their computation is also really fast, once it only requires a one-pass search and two multiplications. 2) Prole Variance: While comparing the false positive and positive ROIs, we noticed an interesting texture property. The detection algorithm tends to nd false ROIs with very small prole variance while positive ROIs have a distinct form. The computation of the prole variance is an attempt to make use of this property. The computation of this feature was done with a fast onepass calculation using the equation = E[(X )2 ] = E[X 2 ] (E[X ])2 . E[X 2 ] and (E[X ])2 can be computed independently during this only prole pass and operated. 3) ROI Size: If we consider the surveillance camera to be static and without zooming, we can also use the size of the outputted ROIs as a feature for classication. The detection method requires a minimum amount of detail to be able to nd the eye properly. Because of that, the regions only enclose eyes within some distance from the camera, and therefore with limited window size range, while false positives have a

where pd = 1.6 and pd = 0.6 are the values of the distance between eyes proportional to the intercanthal distance and its standard deviation, derived from [14], and Si the size of the i-th ROI. Note that the parameters used needed to be greater than those presented by [14] because the outputted ROIs tend to enclose more than just the eye. It was found that good values for pd and pd were 2 and 1 respectively. Note that by using dynamic programming techniques, almost half of the calculations can be avoided. There are two issues to consider with this approach: a) If too many regions are outputted, they each have a higher chance of having a neighbor within its annular region; b) Removal of a false positive can result missed eyes. Note though that the second issue can be used in our favor. If an eye was not detected, it either means that the algorithm is not robust enough or that the detected ROI was not part of a face indeed. If we suppose it was not part of a face, any ROI that does not have any neighbor in the annular region should be removed. D. Classication For each potential eye region we propose a set of features to represent it. We will describe the proposed features in more details in the next section. Here we cover only the classication task. The eye regions need to be classied into two categories: false positives and positives. It is equivalent of doing a likelihood-ratio test where the loss weight for classifying a positive region as false positive is much greater than the risk weight for the opposite. To simplify the computation, a simple threshold was applied. For each feature vector x i1 x i2 i = X . . . xiN it was computed
2 |xij | < Kj j ,

(1)

842

where xij is the value of the feature and Kj its parameter to be optimized. If (1) is true for every j , the i-th region is classied as a positive sample. If for at least one of j , (1) is false, the i-th is said to be a false positive. 1) Optimum K: In order to be able to keep as many positive regions while removing as many false positives as possible, the parameter K needed to be optimized. This was done using an idea from the 3 characteristic of normal distributions. The optimum K is simply the value of K for which no more positive samples are wrongly classied. 2) Cascade Classication: In order to allow the system to run as fast as possible, a cascade classier, as shown in Figure 5, was built. An important characteristic of the cascade classication is that each stage has to classify less data than the previous one. It allows us to reduce the classication time by much at the cost of analyzing only one feature at a time.

A. Prole Valleys Figure 6 shows the number of elements removed by using different K values. Note that even though the classication used was somewhat rough, the classication using the prole valleys classied 1.9 of the false positive with the optimum K value. Moreover, the time needed to compute this feature is about 3.4 s/frame.

Fig. 6: Classication performance for the right and left valley features while using fractions of the optimum K. Fig. 5: The conceptual framework of the cascade classier To build an effective cascade classier, one has to consider the following two things: The computation time to obtain each feature, and the performance of each feature. The rst is multiplied by the amount of regions the current stage has to analyze. The latter denes the amount of regions the following stages will have to process. As our objective is to minimize this time while obtaining the greatest performance possible, the sequence of features analyzed was: window size, prole variance, left and right valleys and, optionally, pairs. Some timing and classication rate results to support this choice can be found in section III. Note that left and right valleys and the prole variance can be calculated within the same iterations. Although it would reduce the average time to compute those features, the amount of features computed would be the same for all of them. This would reduce the exibility of the cascade classier. III. R ESULTS In this section, three sets of results are presented. Most of them refer to the classication rate, dened here as the amount of false positive regions removed with no affect to the positive samples, unless otherwise noted. The rst of them is the classication rate using different values of K (refer section II-D for the denition of K). Second, the classication rate for the optimum K. Third, the computation time needed to compute each feature.

B. Prole Variance This feature provides a good classication rate comparing to the others. It classied 34.4% of the false positive samples without misclassifying any positive samples. The time needed to compute this feature is 14.9 s/frame. Figure 7 shows this features removal rate for various amounts of fractions of the optimum K. Note that the false positive curve in this gure descends slower than the curves for the other features.

Fig. 7: Classication performance for the prole variance feature while using fractions of the optimum K.

843

C. ROI Size The gure 8 shows this features removal rate for various values of parameter K. The ROI size can be classied without additional computation, meaning the time required to compute this feature can be unnoticed, i.e., 3.4 s/frame. Its performance for the optimum K was 14.98% of removed false positives.

Fig. 9: Classication performance for cascade classier while using fractions of the optimum K. The cascade classier includes the right and left valley features, the ROI size and the prole variance only. needed for each task within the system. It can be seen there that the cascade classier and even the pairs feature performs extremely fast compared to the detection phase. The only thing keeping this whole method to perform in real time is the detection algorithm, which was slow because of the scaling parameter as commented in section I. TABLE II: Time spent by each stage while analyzing all 431 frames.
Stage Load video Detect ROIs Cascade Pairs Total Time 5.76s 529.05s 9.87ms 6.85ms 540.07s

Fig. 8: Classication performance for the ROI size feature while using fractions of the optimum K. D. Pairs This feature alone allows the system to classify 57.5% of the false positives correctly. It also removes 64.5% of the positive samples. Also note that this feature takes more time to compute, as shown in Table I and II, and should be used preferably as the last stage of the cascade. E. Cascade Classier Figure 9 shows the misclassication rate for each class using different K values. Table I presents the performance for each feature individually and for the cascade classier built. TABLE I: Classication rate using each individual feature (lines 1-4, 6), and the classication rate using the cascade classiers in which line 5 includes features in lines 1-4, and line 7 include all 5 features. The last column shows the computation time needed to each task respectively.
1 2 3 4 5 6 7 Analysis Left Valley Right Valley Window Size Light Std. Dev. Cascade Pairs1 Cascade w/ Pairs False Positives Removed 0.0000% 1.9556% 14.9776% 34.4415% 49.7527% 57.4945% 83.2969% Time/Frame 3 .4 s 14.9 22.9 15.9 27.4 s s s s

As there were 431 frames being analyzed, each frame took approximately 1.25 seconds to be fully processed, which can be translated into 0.8 fps. Thus is not quite near the 15 fps expected, but we can reduce it by adjusting the scaling parameter mentioned before. Note though that the proposed cascade classier only took 1.8% of the time, what means that this is indeed a fast method. The exception to that was the pairs feature, which alone took another 1.26% of the computation time. IV. D ISCUSSION In this paper, some features to describe eyes were presented. The classication rate of those features was good considering the rough classication method used. This classication was derived from those made for normal distribution, where it is expected that 99.7% of the data to lie within the 3 range. This way, the parameter K was varied to nd this region.
1 As mentioned in section III-D, it also removes 64.5% of the positive samples.

Considering that ROIs detected have already passed a classication stage, the further classication proposed was able to remove a large portion of the incorrectly detected regions with quite a good speed. Table II shows the computation time

844

But this does not guarantee an optimum classication rate. It only provides a good approximation to an optimum K. This way, the results obtained can be possibly improved by the use of more complex pattern recognition analysis, one that can make use of the multi-dimensional property of the system. Those were avoided this time because computing time was the main concern. Finding different features and/or non-linear transformation between those features presented may be a good solution. A promising non-linear relation was already found between the left valley and the right valley feature and it will be extensively tested in a future work. The computation time was one of the main concerns in this research. Some other features were tested, but they were kept aside from this report due to the computational cost. Intuitively, the prole valleys seem to be excellent features for classication, however they were not as good as expected. The reason for this is that the positioning of the detected region is not precise. This comes from the way the dataset was used to train the detector and cannot be easily modied. A possible approach to deal with this is to use some technique to better position the region before computing the valleys. This could possibly improve the prole variance performance too. The inclusion of the pair feature in the cascade classier is also somewhat troublesome, once it removes a lot of the positive samples we are aiming to identify. This is thought to be a robustness limitation of the detection algorithm. Because of this lack of robustness, the scaling parameter had to be set in such way that all the eyes in the frame were detected. But by doing so, many false regions were detected. This way, it is hard to nd regions, false or not, that do not have any ROI within its annular region. By using a more robust detection algorithm, this feature can become more useful. V. C ONCLUSION The task of improving a detection algorithm was found extremely hard. It can be described solely as a search for good features that preferably complement those used by the detection algorithm. That gets even harder when the algorithm does not use those features explicitly, as in [7], but let another algorithm chooses them, as in [8]. Nevertheless, the purpose of this work was to nd a fast and fairly good classication method to be applied in surveillance videos. The 49.8% classication rate while taking 9.87ms to classify 9405 features was enough to achieve those objectives. In this paper, four new features have been proposed to further classify the detected eye regions, which were previously identied by the OpenCV implementation of Viola-Jones eye detection algorithm. The possible eye regions were classied into two classes: positive and false positive. Those features are lightweight and easily computed. These features relay mostly on information already present in the regions header or on information found in the ROIs prole. A cascade classier is proposed that utilizes each individual feature sequentially that can further improves the detection rate for eye regions. The proposed method is simple, fast, and effective. We demonstrated our method can improve Viola-Jones

method (93.7% classication rate), and has a classication rate of 96.8% when using the cascade, and 98.9% when using the cascade classier with left and right valleys, ROI size and prole variance and the cascade classier composed of the cascade classier plus the pairing heuristic respectively. Although we applied the method to improve face matching in surveillance videos, it can be easily adapted to other applications that involve automatic face recognition or eye detection. R EFERENCES
[1] M.-H. Yang, D. J. Kriegman, and N. Ahuja, Detecting faces in images: A survey, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 1, pp. 3458, Jan. 2002. [2] G. Yang and T. S. Huang, Human face detection in a complex background, Pattern Recognition, vol. 27, no. 1, pp. 5363, 1994. [3] R. Kjeldsen and J. Kender, Finding skin in color images, in Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on, oct 1996, pp. 312317. [4] A. Lanitis, C. Taylor, and T. Cootes, Automatic face identication system using exible appearance models, Image and Vision Computing, vol. 13, no. 5, pp. 393 401, 1995, 5th British Machine Vision Conference. [5] M. Turk and A. Pentland, Eigenfaces for recognition, J. Cognitive Neuroscience, vol. 3, no. 1, pp. 7186, Jan. 1991. [6] A. F. Abate, M. Nappi, D. Riccio, and G. Sabatino, 2d and 3d face recognition: A survey, Pattern Recogn. Lett., vol. 28, no. 14, pp. 1885 1906, Oct. 2007. [7] H. A. Rowley, S. Baluja, and T. Kanade, Neural network-based face detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp. 2338, 1998. [8] P. A. Viola and M. J. Jones, Robust real-time face detection, International Journal of Computer Vision, vol. 57, no. 2, pp. 137154, May 2004. [9] C. Zhang and Z. Zhang, A survey of recent advances in face detection, Microsoft Research, One Microsoft Way Redmond, WA 98052, Technical Report MSR-TR-2010-66, June 2010. [10] D. W. Hansen and Q. Ji, In the eye of the beholder: A survey of models for eyes and gaze, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 478500, Mar. 2010. [11] B. Kroon, A. Hanjalic, and S. M. Maas, Eye localization for face matching: is it always useful and under what conditions? in Proceedings of the 2008 international conference on Content-based Image and Video Retrieval, ser. CIVR 08. New York, NY, USA: ACM, 2008, pp. 379 388. [12] G. Bradski, The OpenCV Library, Dr. Dobbs Journal of Software Tools, 2000. [13] R. Kumar, S. Raja, and A. Ramakrishnan, Eye detection using color cues and projection functions, 2002, pp. 337340. [14] N. A. Dodgson, Variation and extrema of human interpupillary distance, in Proceedings of SPIE: Stereoscopic Displays and Virtual Reality Systems, vol. 5291, San Jose, CA, 2004, pp. 3646. [15] T. Moriyama, J. Xiao, J. Cohn, and T. Kanade, Meticulously detailed eye model and its application to analysis of facial image, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 738752, May 2006. [16] Q. Wang and J. Yang, Eye detection in facial images with unconstrained background, Journal of Pattern Recognition Research, vol. 1, no. 1, pp. 5562, Jan. 2006. [17] X. Tang, Z. Ou, T. Su, H. Sun, and P. Zhao, Robust precise eye location by adaboost and svm techniques, in Advances in Neural Networks, LNCS 3497, vol. 3497, 2005, pp. 9398. [18] S. Kawato and J. Ohya, Two-step approach for real-time eye tracking with a new ltering technique, in Proceedings of 2000 IEEE International Conference on Systems, Man and Cybernetics, vol. 2, 2000, pp. 13661371. [19] Z. Wencong, C. Hong, Y. Peng, and Z. Zhenquan, Precise eye localization with adaboost and fast radial symmetry, in Proceedings of 2006 IEEE International Conference on Computational Intelligence and Security, vol. 1, 2006, pp. 725730.

845