Você está na página 1de 7

Explanation-based Facial Motion Tracking Using a Piecewise Bezier Volume Deformation Model

Hai Tao and Thomas S. Huang Image Processing and Formation Laboratory, Beckman Institute University of Illinois at Urbana-Champaign, Urbana, IL 6 1801
Abstract
Capturing real motions from video sequences is a powe@l method for automatic building of facial articulation models. In this paper, we propose an explanation-based facial motion tracking algorithm based on a piecewise Be'zier volume deformation model (PBVD). The PBVD is a suitable model both for the synthesis and the analysis of facial images. It is linear and independent of the facial mesh structure. With this model, basic facial movements, or action units, are interactively defined. By changing the magnitudes of these action units, animated facial images are generated. The magnitudes of these action units can also be computed from real video sequences using a model-based tracking algorithm. However, in order to customize the articulation model for a particular face, the predefined PBVD action units need to be adaptively modified. In this paper, we first briefly introduce the PBVD model and its application in facial animation. Then a multiresolution PBVD-based motion tracking algorithm is presented. Finally, we describe an explanation-based tracking algorithm that takes the predefined action units as the initial articulation model and udaptively improves them during the tracking process to obtain a more realistic articulation model. Experimental results on PBVDbased animation. model-based tracking, and explanation-based tracking are shown in this paper.

from 3 D CyberWare scanner data (Figure 2). In this paper, however, our focus is the facial deformation model, which represents the dynamics of a face. Four categories of facial deformation models have been proposed. They are parameterized models [ 2 ] , physical muscle models [3], free-form deformation models [4], and performance-driven animation models [ 5 ] . In analysis, these models are applied as constraints that regulate the facial movements.

Detormabon Paramelen
J.

Delormaim
Parameten .81

1 Introduction
Recently, great efforts have been made to integrate the computer vision and the computer graphics techniques in the areas of human computer interaction, model-based video conferencing, visually guided animation, and imagebased rendering. A key element in these vision-graphics systems is the model. A model provides the information describing the geometry, the dynamics, and many other attributes of an object that represents the a priori knowledge and imposes a set of constraints for analysis [I]. Among many applications, the analysis and synthesis of facial images is a good example that demonstrates the close relationships between graphics and vision technologies. As shown in Figure 1, a model-based facial image communication system involves three major parts (a) the analyzer or the motion generator, (b) the synthesizer that renders the facial image (c) the transmission channel that efficiently communicates between (a) and (b). The system relies on an important component - the facial model. Both geometric and deformation representations are equally important components in facial modeling. We have developed system to obtain 3 D mesh model of a face

Figure 1 : A facial image communication system. In this paper, a new free-form facial deformation model called piecewise Bkzier volume deformation (PBVD) is proposed. Some properties of this model such as linearity and independent of mesh structure make it a suitable choice for both realistic facial animation and robust facial motion analysis. The differences between this approach and Kalra's method [4] are twofold. By using nonparallel volumes, irregular 3D manifolds are formed. As a result, fewer deformation volumes are needed and the number of control points is reduced. This is a desired property for tracking algorithms. Further, based on facial feature points, this model is mesh independent and can be easily adopted to articulate any face model. By using the PBVD model, a facial animation system, a model-based facial motion tracking algorithm, and an explanation-based tracking algorithm are presented. These algorithms have been successfully implemented in several applications including video-driven facial animation, lip motion tracking, and real-time facial motion tracking. The remaining sections are organized as following: Section 2 introduces the PBVD model and the PBVD-based

611
0-7695-0 149-4/99 $10.00 0 1999 IEEE

animation system. Section 3 describes a. PBVD modelbased tracking algorithm. Explanation-based tracking is then described in Section 4. Some experimental results are demonstrated in Section 5, followed by discussions and concluding remarks in Section 6.

single volume and different volumes are connected so that the smoothness between regions is maintained.

BCLW vidum
(huflim lrycri

Figure 3: A BCzier volume and a part of the face mesh.

Figure 2: A facial mesh model derived from the CyberWare scanner data. Left: the mesh model. Right: the texture-mapped model.

2 PBVDmodel
2 1 PBVD - formulation and properties .
A 3 D Btzier volume [IO] is defined as
n

x ( u , v, w) =
J=O

c c b,.JiB,n(u)By(v)B: (w)
m
I

J=ok=o

(1) Figure 4: The face mesh and the 16 BCzier volumes. Once the PBVD model is constructed, for each mesh point on the face model, its corresponding Bernstein polynomials are computed. Then the deformation can be written in a matrix form as V=BD
(3)

where x ( u , v, w) is a point inside the volume, which, in our case, is a facial mesh point. Variables (u,v, w) are the are the control parameters ranging from 0 to 1, b,,J,r( (v) points, and B,"(U), BJ" , and B: (w) are the Bernstein polynomials. By moving each control point b,,,,k with a amount of d,,,,r( the displacement of the facial mesh point ,
x ( u , v, w) is

Figure 3 shows the deformation of a Btzier volume that contains a part of the facial mesh. In order to deform the face, multiple Btzier volumes are formed to embed all the deformable parts. These volumes are formed based on the facial feature points such as eye corners, mouth corners, etc. Each Btzier volume contains two layers, the external layer and the internal layer. They form the volume that contains the face mesh. Norm vectors of each facial feature points are used to form these volumes. To ensure continuity in the deformation process, neighboring BCzier volumes are of the same order along the borders. In other words, there are the same number of control points on each side of a boundnry. The piecewise Btzier volume structure used in our implementation is shown in Figure 4. Using this model, facial regions with similar motions are controlled by a

where V contains the nodal displacements of the mesh points, D contains the displacement vectors of Btzier volume control nodes. The matrix B describes the mapping function composed of Bernstein polynomials. Manipulating the control points through an interactive tool may derive various desired expressions, visual speech, or action units. In Figure 5, the real control mesh and the rendered expression smile is illustrated. At each moment, the nonrigid motion of a face may be modeled as a linear combination of different expressions or visemes (visual phonemes), or formally

V = B[D,D, . . . D , n ] [ p , p... p,"]' = BDP = LP ,

(4)

where D, is an expression or a viseme, and p , is its corresponding intensity. The overall motion of the face is

R(V, + LP) + T

(5)

where V, is the neutral facial mesh, R is the rotation decided by the three rotation angles ( w , ,w,, z)' = W , ,w and T is the 3D translation.

612

recognition tool. Based on the phoneme segmentation results, mouth shapes and expressions are computed using Audio and a coarticulation model similar to [ l l ] . animation results are then synchronized to generate realistic talking head sequences. Figure 7 shows some frames from a synthetic visual speech sequence.

3 PBVD model-based tracking algorithm


3.1 Video analysis of the facial movements
(a)

(b)

Figure 5: (a) The PBVD volumes (b) the expression smile.

2.2 PBVD-based facial animation


Based on the PBVD model, facial action units are constructed using an interactive tool. Then various expressions and visemes are created either from combining these action units or from moving some control nodes. We have created 23 visemes manually to implement a talking head system. Six of them are shown in Figure 6. In addition to these visemes, six universal expressions have also been created.

Several algorithms for extracting face motion information from video sequences have been proposed [6][9]. Most of these methods are designed to detect actionunit-level animation parameters. The assumption is that the basic deformation model is already given and will not be changed. In this section, we propose a tracking algorithm in the same flavor using the PBVD model. The algorithm adopts a multiresolution framework to integrate the low-level motion field information with the high-level deformation constraints. Since the PBVD model is linear, an efficient optimization process using lease squares estimator is formulated to incrementally track the head poses and the facial movements. The derived motion parameters can be used for facial animation, expression recognition, or bimodal speech recognition.

3.2

Model-based tracking using PBVD model

The changes of the motion parameters between two consecutive video frames are computed based on the motion field. The algorithm is shown in Figure 8. We assume that the camera is stationary

Figure 6: Expressions and visemes created using the PBVD model. The expressions and visemes (bottom row) and their corresponding control meshes (top row). The facial movements are, from left to right, neutral, anger, smile, vowel-or, and vowel-met.

Deform model fitting

Figure 8: Block diagram of the model-based PBVD tracking system. At the initialization stage, the face needs to be approximately frontal view so that the generic 3D model can be fitted. The inputs to the fitting algorithms are the positions of facial feature points, which are manually picked. All motion parameters are set to zeroes (i.e., (fo, O ,Po)= 0 ), which means a neutral face is assumed. W The camera parameters are known in our implementation. Otherwise, a small segment of the video sequence should be used to estimate these parameter using the structure from motion techniques.

Figure 7: An animation sequence with smile and speech I


a m George McConkie.

Once visemes and expressions are created, animation sequences can be generated by assigning appropriate values to the magnitudes of these visemes and expressions at each time instance. In our implementation, a human subject speaks to a microphone according to a script. The phoneme segmentation is then obtained using a speech

613

Then, from the video frames n and n + l , the 2D motion vectors of many mesh nodal points are estimated using the template matching method. In our implementations, the template for each node contains 11x 11 pixels and the searching region for each node is an area of 17 x 17 pixels. To alleviate the shifting problem, the templates from the previous frame and the templates from the initial frame are used together. For example, the even nodes of a patch are tracked using the templates from the previous frame and the odd nodes are tracked using those of the initial frame. Our experiments showed that this approach is very effective. From these motion vectors, 3D rigid motions and nonrigid motions (intensities of expressions/visemes or action units) are computed simultaneously using a least squares estimator. Since the PBVD model is linear, only the perspective projection and the rotation introduce nonlinearity. This property makes the algorithm simpler and more robust. The 2D interframe motion for each node is

3.3

Multiresolution framework

Two problems with the above algorithm are the expensive template matching computation and the noisy motion vectors it derives. The first problem is obvious because the computational complexity for each motion vector is about (17 x 17) x (1 1x 1 l)x 3 = 104,907 integer multiplication. The second problem is partially caused by the fact that in the above algorithm, the computation of the motion field is totally independent of the motion constraints, which makes it vulnerable to various noises. If the lower-level motion field measurements are too noisy, good estimation of motion parameters can never be achieved, even with correct constraints. A multiresolution framework is proposed in this section to partially solve the above problems. The block diagram of this new algorithm is illustrated in Figure 9. An image pyramid is formed for each video frame. The algorithm proposed in the previous section is then applied to the consecutive frames sequentially from lowest resolution to the original resolution. For the diagram depicted in Figure 9, changes of motion parameters are computed in quarterresolution images as (df,(O), dWl0), ( o ) ) . By adding &

,Wn, ) these changes to (f,, pn , the estimated new motion


where

diT2D is the 2D interframe motion, and

f : , W::,), 6: ) . Similarly, , :: parameters are derived as ( , : ) changes of motion parameters are computed in the halfd W " ) ,&(')) based on the resolution images as (&('),
W,,(:,',Pny,! . ) previous motion parameter estimation This process continues until the original resolution is reached.

(f":,!,

The projection matrix

is

where f is the focal length of the camera, s is the scale factor, and z is the depth of the mesh node. The vector (x, y, z ) represents the 3D mesh nodal position after both rigid and nonrigid deformation, or R(V, + LP) + T

;/I
.

;TI, "';,;I
. :

. The
T ,w* "
I

vector (x, , y , , z , ) represents the 3D mesh nodal position after only nonrigid deformation, but without translation and rotation (i.e., V, + L P ) . Matrix G, and [ R L ] , denote the i th row of the matrix G and the matrix RL , respectively. An over-determined system is formed because many 2D inter-frame motion vectors are calculated. As the result, changes of the motion parameters (df, dW, dp) can be estimated using a least squares estimator. By adding these changes to the previously estimated motion parameters

-,.I

U/

Figure 9: The multiresolution PBVD tracking algorithm. In this coarse-to-fine algorithm, motion vector computation can be achieved with smaller searching regions and smaller templates. In our implementation, for each motion vector, the number of multiplications is [(5 x 5) x (7 x 7)x 31 x 4 = 14,700, which is about seven times fewer than the model-based scheme. A more important property of this method is that, to some extent,

614

this coarse-to-fine framework integrates motion vector computation with high-level constraints. The computation of the motion parameter changes is based on the approximated motion parameters at low-resolution images. As the result, more robust tracking results are obtained.

vector or control nodes for each action units so that the model fits the data better. Modifying B means to modify BCzier volumes so that descriptive power of the model is enhanced. In this paper, we discuss the learning of D .

3.4

Confidence measurements

Two quantities are computed for each frame as the confidence measurements. The average normalized correlation Q, is computed based on nodes using the templates from the initial video frame. If the tracker fails, this quantity is small. The average LSE fitting error Q, indicates the tracking quality. When Q, is large, it means the motion field and the fitted model are not consistent. Q, and Q, are closely related. When Q, is small, which means the matching has low score, Q, is large. However,
a large Q, does not necessarily imply a small Q, because

1
Deform model

f
2-1

df ,dW, dF

L=BD
/

Figure 10: The block diagram for the explanation-based tracking algorithm.

4.2 Learning expressionslvisemes or action units


The learning of the D is based on the model-based analysis of a video segment. As shown in Figure 11, the predefined action units D are first used in a model-based method to track n frames, then the tracking error is analyzed and the deformation D is modified. The new D is then used as the model in the next segment. This process is performed during the entire tracking process.
Frarnc 0
n
I

the problem could be that the model itself is not correct. In our implementation, we use a confidence measurement J = Q, / Q , to monitor the status of the tracker. When J is smaller than a certain threshold, a face detection algorithm is initiated to find the approximate location of the face. The tracking algorithm will then continue.

4 Explanation-based motion tracking


The model-based approach is powerful because it dramatically reduces the solution searching space by imposing domain knowledge as constraints. However, if the model is oversimplified, or is not consistent with the actual dynamics, the results are prone to errors. T o be able to learn new facial motion patterns while not to loose the benefits of model-based approach, a new tracking algorithm called explanation-based method is proposed in this section.

2n
. .

rn"

- .. - __. .. -.

Figure 1 1 : Learning of the expressions/visemes or action units ( D ) in a video sequence.

4.1 Approach
We use the term explanation-based to describe the strategy that starts from a rough domain theory and then incrementally elaborates the knowledge by learning new events. The existing domain theory provides some possible explanation of each new event, and the learning algorithm explores the new data to adjust the domain theory. For a PBVD tracking algorithm, the predefined action units provide an explanation for the computed motion vectors. The fitting error, which is the combination of the noise and error of the model, is then analyzed to adjust the model. A block diagram is shown in Figure 10. It is essentially the same as the block diagram for the model-based PBVD method except for the block that adjusts the nonrigid motion model L. The model L includes two parts: B and D . Both of them can be adjusted. Changing D means to change the displacement

To adjust D , for each frame, 2 D motion residuals are projected back to 3D nonrigid motion according to the 3D model and in-surface motion assumption. The fitting error for any mesh nodes can be written as

where V,e,Zn the 2D fitting error in the LSE equation, is M is the perspective projection matrix, and R is the rotation matrix. Vectors U, and u z are the 3D vectors that span the tangent plane of the investigated facial mesh node. They can be decided from the norm vector of that mesh node. From Equation 9, a and b can be solved and the 3 D residual V,c, is derived. For each frame, the projected 3D motion is

615

where L is the previous PBVD model, I; is the vector representing the nonrigid motion magnitudes, and a is the learning rate. The term LF means the fitted part of the motion. The vector V,neA collected for each frame in a is video segment. At the end of that segment, adjustment of the D is performed. Adjusting D is equivalent to adjusting D, (Equation

5.1 Facial model initialization


The face in the first video frame needs to be approximately frontal view and with a neutral expression. From this frame, facial feature points are extracted manually. The 3D fitting algorithm is then applied to warp the generic model according to these feature points. However, since we only have 2D information, the whole warping process is performed in 2 D except the initial scaling. Once the geometric facial model is fitted, a PBVD model is automatically derived from some facial feature points and their corresponding norms.

4 . An iterative algorithm is proposed in this paper. For ) each loop, only one D, is modified; the others are fixed.
For example, for the first iteration, only Do is adjusted. For each frame, we derive the 3D-motion vector that equals

V,,) = V,ne, BIO D, ...D,"] F . The PCA analysis of V,o in a video segment is performed to extract the major motion patterns such as Vem, Ve,,,, etc. The number of these patterns is decided by the energy distribution in the eigenvalues. The maximum number of these patterns can also be imposed to avoid over-fitting. W e assume these motions are due to some Do,.

(C)

(a

To find the deformation unit that cause each eigenvector, the following LSE problem are solved

Figure 13: Lip tracking for bimodal speech recognition.

5.2 PBVD model-based tracking


In PBVD tracking algorithm, the choice of deformation units D, depends on the application. In a bimodal speech recognition application, 6 action units are used to describe the motions around the mouth. These action units are illustrated in Figure 12 The tracking result for each frame is twelve parameters including the rotation, the translation, and the intensities of these action units.

where C is the smoothness constraint that regulate Do, so that the motion of PBVD control nodes are smooth.

5 Implementation and experimental results


The PBVD model has been implemented on a SGI ONYX machine with a VTX graphics engine. Real-time tracking at 10 frame/s has been achieved using the multiresolution framework. It has also been used for bimodal speech recognition and bimodal emotion recognition. Explanation-based method has been implemented to improve the facial image synthesis.

Figure 12: The action units in bimodal speech recognition. Top: the control nodes. Bottom: the action units.

Figure 14: Tracking results of the real-time demo system. The two bars are Q, (left) and Q, (right), respectively.

616

For the bimodal emotion recognition and the real-time demo system, 12 action units are used. Actually, users can design any set of deformation units for the tracking algorithm. These deformations can be either at expression level or at action unit level. Lip tracking results are shown in Figure 13. Figure 14 shows the results of the real-time tracker. Facial animation sequences are generated from the detected motion parameters. Figure 15 shows the original video frame and the synthesized results. The synthesized facial model uses the initial video frame as the texture. The texture-mapped model is then deformed according to the motion parameters.

tracking. For future research, the emphasis will be on improving low-level motion field calculation, combining the explanation-based approach with multiresolution framework, and recognizing spatio-temporal patterns from the facial motion parameters.

Acknowledgments
This work was supported in part by the Army Research Laboratory Cooperative Agreement No. DAALO 1-960003.

References
C. Kambhamettu, D. B. Goldgof, D. Terzopoulos, and T. S. Huang, Nonrigid motion analysis, in Handbook of PRIP: Computer Vision, vol. 2. San Diego, CA: Academic Press, 1994, pp. 405-430. F. I. Parke, Parameterized models for facial animation, IEEE Comput. Graph. and Appl., vol. 2, no. 9, pp. 61-68, Nov. 1982. Y. Lee, D. Terzopoulos, and K. Waters, Realistic modeling for facial animation, in Proc. SICGRAPH 95, 1995, pp. 5562. P. Kalra, A. Mangili, N. M. Thalmann, and D. Thalmann, Simulation of facial muscle actions based on rational free form deformations, in Proc. ElJROGRAPHICS92, Sept. 1992, pp. 59-69. L. Williams, Performance-driven facial animation, in Proc. SIGGRAPH 90, Aug. 1990, pp. 235-242. H. Li, P. Roivainen, and R. Forchheimer, 3-D motion estimation in model-based facial image coding, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 6, pp. 545-555, June 1993. C. S. Choi, K. Aizawa, H. Harashima, and T. Takebe, Analysis and synthesis of facial image sequences in modelbased image coding, IEEE Trans. Circuit Sys. Video Technol., vol. 4, no. 3, pp. 257-275, June 1994. I. Essa and A. Pentland, Coding, analysis, interpretation and recognition of facial expressions, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 757763, July 1997. D. DeCarlo and D. Metaxas, The integration of optical flow and deformable models with applications to human face shape and motion estimation, in Proc. CVPR 96, 1996, pp. 231-238. [IO] T. W. Sederberg and S. R. Parry, Free-form deformation of solid geometric models, in Proc. SlGGRAPH 86, 1986, pp. 151-160. [I 11 D. W. Massaro and M. M. Cohen, Modeling coarticulation in synthetic visual speech, in N. M. Thalmann & D. Thalmann (Eds.) Models and Techniques in Computer Animation, Tokyo: Springer-Verlag, 1993, pp. 139-156.

Figure 15: The original video frame (top) and the synthesized tracking results (bottom).

(a)

(b)

(C)

Figure 16: The synthesis results (a) original frame (b) model-based (c) explanation-based .

5.3 Explanation-based tracking


A set of predefined motion units is used as the initial deformation model. Then, these motion units are adjusted during the tracking. To compare the results, some generic motion units are used in the model-based method. The resulted synthesis is not convincing (Figure 16(b)). In Figure 16(c), the improved result of the explanation-based method is shown. In our implementation, the segment size is 20 frames, and the learning rate is a = 0.4 .

6 Discussion
In this paper, issues on generating facial articulation models from real video sequences are addressed. Three major contributions are PBVD model and facial animation, PBVD model-based tracking, and explanation-based

617

Você também pode gostar