Você está na página 1de 29

Reading newsreaders

Good news or bad news revealed by the automatic analysis of eyebrow


movements.
Davy Verbeek
s331442

HAIT Master Thesis series nr. 12-001

THESIS SUBMITTED IN PARTIAL FULFILLMENT


OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF ARTS IN COMMUNICATION AND INFORMATION SCIENCES,
MASTER TRACK HUMAN ASPECTS OF INFORMATION TECHNOLOGY,
AT THE SCHOOL OF HUMANITIES
OF TILBURG UNIVERSITY

Thesis committee:

Prof. dr. E.O. Postma


Prof. dr. M.G.J. Swerts
B. Joosten MA

Tilburg University
Faculty of Humanities
Department of Communication and Information Sciences
Tilburg center for Cognition and Communication (TiCC)
Tilburg, the Netherlands
January 2012

1
2
PREFACE
A lot of people give up just before they're about to make it.
You know you never know when that next obstacle is going to be the last one.
Chuck Norris

I guess thats the way it goes with everything in life. Every goal has its obstacles and overcoming
them is a part of reaching you goal. Life isnt always going to be easy or fast and often requires
thought and effort to overcome the problems that lay ahead. For me, graduation was no exception.
This thesis marks the end of my time here at Tilburg University. I was proud when I received my
Bachelors degree for the Business Communication and Digital Media track, but becoming a Master
of Arts in the Human Aspects of Information Technology track has been my goal from the very
beginning. It took me 5.5 years to get to this point, slightly longer than the rest. I am proud to say
that I finally achieved my goal of becoming a master! I could say that the delay was merely caused by
obstacles and difficulties and surely there were, but lets be honest, most of it was caused by me.
During my time at Tilburg University I may have been a little lazy, but most of all Ive had a lot of fun.
John Lennon once said: Time you enjoy wasting, was not wasted. and let me tell you, I strongly
agree with him! I want to take this opportunity to thank all the people that helped me reach my goal
of becoming a master. First of all I want to thank Eric Postma for bearing with me for the past couple
of years. Guiding me with both my Bachelor and my Master thesis must have been quite a challenge.
I also want to thank Bart Joosten for his technical support and constant willingness to help me during
the writing of my thesis. Furthermore, during my time here at TiU two groups have played a vital
role. First, I want to thank the people from the T-gang for the fun times we had together and the
memories we created (Peace up T-town!). Second, I want to thank all the people from Chuck Norris
HQ for all the useless and useful discussions we had. Finally I want to thank my mom and dad and
brothers for always pushing me towards my goal even when I didnt want to. Now it is time to set a
new goal, with new obstacles to overcome with the help of friends and family. Thank you all!

Sincerely,
Davy Verbeek

3
SUMMARY
Communication is one of the most important aspects of life. Faces are especially important for
communication so it is important that humans can read and interpret facial expressions. In this thesis
we focus on the emotional content of broadcast news items. Research conducted by Swertz and
Krahmer (2010) revealed that humans can guess the emotional content of a news item (i.e., bad
news versus good news) by means of the facial expressions of newsreaders, only. State-of-the-
art digital recognition methods, such as the facial feature recognition method proposed by Joosten
(2011), may be able to replicate this ability.
The problem statement addressed in this thesis reads: Can we determine the emotional content of a
news item by analyzing the face of newsreaders using an automatic facial feature recognition
method?

To address this problem statement, an adapted version of the automatic facial feature recognition
method of Joosten (2011) has been applied to a collection of video clips of news items. In addition,
we searched for facial features that predict the emotional content of news items. We found the
vertical movements of the eyebrows to be predictive features. Using these features, we trained
classifiers to recognize the emotional content of news items. We trained four of commonly used
classifiers: the k-nearest neighbor classifier, the decision tree classifier, the multilayer perceptron,
and the support vector machine. The classifiers were able to predict the emotional content of news
items with an accuracy of about 70%. The conclusion is that the emotional content of a news item
can be determined quite well using an automatic facial analysis method.

4
CONTENTS
Preface 3
Summary 4
Contents 5
Chapter 1: Introduction 6
1.1 Problem Statement and Research Questions 6
1.2 Related Work 7
1.3 Thesis Outline 7
Chapter 2: The FER Method 8
2.1 Face Detection 9
2.2 Facial Feature Extraction 9
2.2.1 Labeling Training Images 9
2.2.2 Creating the Shape and Appearance Models 9
2.2.3 Fitting Unseen Images 10
2.3 Determining the Feature 10
2.4 Classification 11
Chapter 3: Experimental Set-up 14
3.1 The Dataset 14
3.2 Experimental Procedure 14
3.2.1 Face Detection 15
3.2.2 Facial Feature Extraction 15
3.2.3 Classification 16
3.3 Evaluation 16
Chapter 4: The Results 17
4.1 Can we find a distinctive feature based on the movement of the eyebrows? 17
4.2 What classifier performs best in our classification task? 19
Chapter 5: General Discussion 22
5.1 The Task 22
5.2 The Experiment 23
Chapter 6: Conclusion and Future Work 25
6.1 Research Questions and Problem Statement 25
6.2 Application and Future Research 26
Literature 27

5
Chapter 1
Introduction

Human-computer interaction has become an important aspect of our daily lives and is still growing
on a daily base. The computer has been given a permanent place in every household and has become
an important aspect in science. Recent developments are trying to take an extra step in creating a
computer with which we can actually have a real conversation. Even though computers are able to
register everything we say using speech recognition, communication will not be possible until
computers possess the ability to give meaning to the registered words. Humans are highly capable of
understanding and interpreting signals that give words meaning. We use our hands to depict what
we say, the pitch and melody of our voice to emphasize or clarify our words and our face to express
our attitude towards the information we are transferring (Cassell, 2001) (Vinciarelli, Pantic, Boulard,
& Pentland, 2008). This is an aspect of communication with which the computer is not yet able to
deal with. To be able to fully integrate the realistic human-computer interaction, realistic human-
centered user interfaces are needed which can respond naturally during communicating with human
users. These interfaces must possess the ability to recognize human social signals and social
behaviour in order to accomplish this goal (Vinciarelli, Pantic, Boulard, & Pentland, 2008) (Zeng,
Pantic, Roisman, & Huang, 2009) (Pantic, Nijholt, Pentland, & Huanag, 2008) (Pantic & Rothkrantz,
2003). This field of research emerged approximately 15 years ago in which computer scientists
started to use the power of computing to automatically analyze non-verbal behaviour.
The two most important categories of non-verbal communication are hand gestures and facial
expressions. In this thesis we will focus on facial expressions. Humans derive all kind of different
information from the facial features of a speaker during communication (Swerts & Krahmer, 2010).
By using different expressions of the face one can convey a whole range of different meanings. Facial
expressions have become of great interest to the fields of computer vision and human-computer
interaction (Pantic & Bartlett, 2007) (Cassell, 2001). Research shows that facial expressions can show
a persons emotional state, give information about the speakers personality, enhance and support
speech or possibly even replace it (Cassell, 2001) (Donato, Bartlett, Hager, Ekman, & Sejnowski,
1999).
Swerts and Krahmer (2010) conducted an experiment in which human participants viewed a
newsreader presenting the news. Their research showed that humans are capable of determining the
emotional content (whether a news item was positive or negative) of a video based on the facial
expression of the newsreaders. Participants noted that newsreaders were generally more expressive
in their facial expressions when they were conveying a positive news item. In this thesis we will try to
determine the emotional content of a video using an automatic facial analysis method. To know if a
message is positive or negative is an important step in understanding the message itself. If a
computer is able to determine the emotional content of everything we say, it may take us a step
closer to achieving natural communication with computers. We will use a dataset which consist of
similar videos as used by Swerts and Krahmer (2010). The task consists of classifying the emotional
content of a news item presented by a newsreader based on the movement of the eyebrows. We do
not take other aspects of the face into account because we believe that the eyebrows are a good
indicator of expressiveness. The automatic facial feature analysis method will be applied using a
computer. It is our goal to find a distinctive feature which can be analyzed and recognized and can be
used by a classifier to predict the emotional content of a video message.

6
1.1 Problem Statement and Research Questions
This thesis focuses on the connection between the facial expressions of local newsreaders and the
message they are trying to convey. In the study by Swerts & Krahmer it appeared that human
observers are indeed capable of recognizing the emotional content of a message, based on the facial
expressions of news readers. We want to know if a computer is able to classify the emotional content
of a video using an automatic facial feature recognition method. For this purpose we need a facial
feature which provides a clear distinction between positive and negative videos. Furthermore we
want to know which classifier is the most suitable for the task proposed. The problem statement (PS)
and Research Questions (RQ) addressed in this thesis read as follows.

PS: Can we determine the emotional content of a video by analyzing the face of newsreaders
using an automatic facial feature recognition method?

RQ 1: Can we find a distinctive feature based on the movement of the eyebrows?


RQ 2: What classifier performs best in our classification task?

We are searching for a feature which represents the clearest distinction between the two conditions.
As mentioned above the focus during our search is the movement of the eyebrows. If a distinctive
feature is found we will use this feature to carry out our classification task. For this purpose we will
use a total of four different classifiers. We will evaluate their performance and highlight the
differences between the classifiers. The problem statement will be answered based on answers to
the research questions.

1.2 Related Work


Automatically recognizing and classifying certain facial expressions has been the subject of many
recent studies. Most early attempts at analyzing facial expressions have focused on the recognition
of basic emotional states. Emotions such as happiness, anger or sadness can successfully be
recognized using facial expression recognition (Pantic & Rothkrantz, 2003) (Pantic, Pentland, Nijholt,
& Huang, 2006). Recent studies have also shown that these methods are capable to determine the
gender, age and even the personality of a person (Vinciarelli, Pantic, Boulard, & Pentland, 2008).
Further studies have found features which are able to predict pain (Cohn, 2010) (Williams, 2002), or
determine the difficulty of a posed question in children (Joosten, 2011). During these studies, a
variety of methods have been used to analyze facial expressions. One of the most recently developed
methods is the Computer Expression Recognition Toolbox (CERT) by Bartlett et al. (2009). CERT has
been successfully applied to the detection of pain (Littlewort, Bartlett, and Lee, 2009), the detection
of driver drowsiness (Vural et al., 2007), and the perceived difficulty of a video lecture (Whitehill,
Bartlett, and Movellan, 2008).

1.3 Thesis Outline


The outline of the remainder of this thesis is as follows. In chapter 2 we will discuss the method and
the classifiers used in our experiments. We will describe our experiments in chapter 3 and the results
will be presented in chapter 4. Then, in chapter 5 we will discuss the results and the strengths and
weaknesses of our experiments. Finally, we will state our conclusions and give recommendations for
future research in Chapter 6.

7
Chapter 2
The FER Method
The automatic analysis of human faces can be described as the measuring of deformations of the
different facial components and their special relations (Chibelushi & Bourel, 2003). The difficult part
is to translate these deformations to meaningful features which can be measured or counted. In the
face there are numerous features which we can analyze. The features with the most interest are
changes in the eyebrows, the expression of the mouth, movement of the head and eye gazing. These
features are considered to be the social signals with the highest informational value in
communication (Cassell, 2001). To extract the features present in the face we need to extract their
coordinates. Several methods have been developed capable of this task. We will use a facial feature
recognition method based on the Active Appearance model. Joosten (2010) proposed a method
called the Facial Expression Recognition method which consists of three basic steps (Figure 1).

1. Detecting the face in an image. (Face Detection)


2. Extract facial features from the detected face region. (Facial Feature Extraction)
3. Analyzing these facial features and classifying this information. (Facial Feature Interpretation)

Figure 1 The three steps of the FER-method (Joosten, 2010).

For the first step we use a method called the Viola-Jones detector. Research shows that this face
detector is highly effective and known to be computationally highly efficient and fast (Joosten, 2011)
(Viola & Jones, 2001). In the second step we will extract several facial features from the face
detected in an image. The Active Appearance Model is used to automatically fit a specified grid of
coordinates on unseen images. For a large number of training instances these coordinates have to be
specified by hand. This results in a training set with which the AAM can be trained (Matthews, 2004).
After fitting the grid on an unseen image, the coordinate values will be stored. Using only the
coordinates of the eyebrows we will begin our search for a distinctive feature. This feature can then
be used to classify the video fragments for our classification task. A classifier is a machine learning
method in which the class of an unseen sample is determined using a set of training data. As
mentioned in chapter 1 we will use a total of four different classifiers.

8
In the following of this chapter we will discuss each stage in further detail. In the section 2.1
we will describe the Viola-Jones detector method used to detect the location of the faces in our
image dataset. Section 2.2 will explain the Active Appearance Model and the pre-work that is
necessary for this method. In section 2.3 we will describe the method used for searching our
distinctive feature. Finally, in section 2.4, we will describe the four different classifiers used for our
classification task.

2.1 Face Detection


The method used by Viola and Jones (2001) is called the integral image. The Viola-Jones face detector
scans each input image and determines the presence of a face. Whether or not a face will be
recognized depends on the presence of certain visual patterns. These patterns are recognized by
extracting features which are relevant to the detection of faces (Viola and Jones, 2001). These
features, i.e. a certain shape or contour, are translated into a number which indicated the presence
of the concerning pattern. Due to the large number of features Viola and Jones incorporated
AdaBoost, a machine learning method by Freund and Shapire (1999) to rule out al features not
relevant for face detection.

2.2 Facial Feature Extraction


For the facial feature extraction step we use an Active Appearance Model. An AAM is a method used
to model and register all kinds of deformable visual objects. In recent years, AAMs have gained
popularity and has been used for all kinds of applications, mostly face recognition, due to its
excellent performance and its rapid fitting to unseen images (Asthana, Saragih, Wagner, & Goecke,
2009)(Matthews, 2004)(Cootes, Edwards, & Taylor, 2001)(Maaten & Hendriks, 2010)(Cohn, 2010).
The application of an AAM proceeds in three steps. First we have to manually specify a grid of
landmarks to create a set of training images. Secondly we will create two models, a shape and an
appearance model, using this training data. The shape and the appearance model are combined and
form the Active Appearance model of the face (Joosten, 2011). In the third step we will fit the
specified grid on new images using this global model. We will further discuss these steps in the
upcoming sections.

2.2.1 Labeling Training Images


The Active Appearance Model can be described as a supervised learning algorithm. This means that
in order for AAMs to work, you need some kind of training data to create the statistical shape and
appearance models (Joosten, 2011). This training data consists of a number of manually labeled
example images in which specific facial landmarks are assigned and their coordinates are stored.
With facial landmarks we mean characteristic points present in the face, e.g. corners of the eyes or
mouth, tip of the nose or the shape of the mouth, jaw or eyebrows. When manually labeled
examples are available, AAMs have the ability to accurately fit the global model of the face unto the
remaining images (Cootes, 2001) (Matthews, 2004).

2.2.2 Creating the Shape and Appearance Models


The next step is to create the statistical models, i.e. the shape and the appearance models, using the
labeled images in the training data. For both models two steps are needed to create them. First of all,
for constructing the shape model, Procrustes analysis (Goodall, 1991) is applied to align each training
image onto a common grid. Procrustes analysis is a method which optimally transforms a shape using
translation, rotation, rescaling and reflection (Goodall, 1991). These transformations are stored in

9
four variables called the Procrustes components (Joosten, 2011). An average model, called the mean
face shape, is now calculated using these Procrustes components. The second step is to apply
Principal Component Analysis (Jolliffe, 2002) to the set of aligned training grids. Principal Component
Analysis is a method to determine the components in which the training samples differs the most.
Principal Component Analysis is used to find the modes of shape variation (Joosten, 2011) (Asthana,
Saragih, Wagner, & Goecke, 2009). At this stage the shape model is complete, consisting of the mean
face shape, the 4 Procrustes components and the modes of shape variation (the shape components).
These components account for the range of facial variation to which this model can be fitted.
In creating the appearance model, the first step consists of warping the training images onto
the mean face shape of the shape model. These warped images are considered the shape normalized
appearances. As in the shape model, the shape normalized appearances are used to compute a mean
face appearance. PCA is applied to model the texture variation of the skin and face (called the
appearance components). The appearance model now consists of both the mean appearance and
the appearance components. The Active Appearance Model consists of both the shape and the
appearance model. Once the AAM is created, new instances of this model can be generated due to
its capability of representing large variation in both shape and texture (Joosten, 2011) (Asthana,
Saragih, Wagner, & Goecke, 2009).

2.2.3 Fitting Unseen Images


The next step in our facial feature extraction component is to estimate the coordinates of landmarks
in new unseen images. This process starts with generating a new AAM instance which is compared to
the unseen image. Next, a fitting algorithm is performed which basically tries to minimize the error
between the generated AAM instance and the unseen image. An initial estimate of the face
configuration is needed for the fitting algorithm to converge correctly. A facial feature point detector
by Everingham et al. (2006) is used to locate specific important locations in the face, i.e. the corner of
the eyes, the mouth and the tip of the nose. These locations are used to determine the weights of
the four Procrustes components by transforming (translating, rotating, scaling or reflecting) the
mean shape to fit the specified locations in the image. In turn, the image is warped to match the
mean shape to compare it to the mean appearance (of the appearance model). The values of the
Procrustes components and the difference in pixel values between the initial texture estimate and
the warped image are combined into the initial error between the model estimate and the actual
image after which a fitting algorithm tries to reach an optimal solution in which the error is
minimized (Matthews, 2004). The AAM translates the facial expressions in an image into a series of
parameters or weights using the shape and/or the appearance model. In this thesis we will base our
results on the shape parameters only. Using the location of the landmarks in the mean shape and the
known weights, we can then convert these into specific coordinates. These coordinates can be used
by a classifier in which multiple classification tasks are possible.

2.3 Determining the feature


The coordinates of the landmarks serve as the basis for our feature search. To begin our search we
started filtering out data that was not relevant for our experiments and focused only on the
coordinates of the landmarks that represented the eyebrows. This data filtering and further data
mining was done using MATLAB, a tool for numerical computation and visualization. In our search for
a good feature which represents a clear distinction between the two categories we started by visually
inspecting the video fragments used in the experiment. After deciding which properties seem the
most promising we started by visualizing these features in MATLAB. The features were calculated
using basic functions available in MATLAB. Some examples of the functions used were variance (var),
average (mean), minimal value (min) and maximal value (max). Based on the plots of point

10
distributions of calculated features we decided whether or not the distinction between categories
was strong enough to use in our classification task.

2.4 Classification
In this section we will explain the choice for our classifiers. It is commonly known that the
performance of classifiers depends on the sort of data to be classified. Considering the no-free-lunch
theorem by Wolpert and Macready (1997), we can also assume that no single classifier performs best
on all problems available. To determine a suitable classifier for our classification task we opted for
using multiple classifiers. We will determine which classifier performs best in our classification task.
We chose to use four of the most commonly used classifiers in statistical classification. As mentioned
in chapter 1 we used: k-nearest neighbor, decision tree, multilayer perceptron and support vector
machine. In optimizing our results we limited our changes to parameters which alter the complexity
of the classifiers. Below we describe each classifier shortly and indicate the value used to increase
the complexity.

k-Nearest Neighbor

To classify a new video, a number of neighbors are located of which the class is known. The majority
rule is used to determine the class of the unknown video according to its nearest neighbors (Witten
& Frank, 2005). The k-nearest neighbor classifier is a very robust classifier capable of dealing with
noisy training data and large datasets. The complexity of the k-nearest neighbor can be decreased or
increased by changing the k-value, i.e. the number of nearest neighbors used during classification.
Figure 2 shows an example of the k-nearest neighbor using five nearest neighbors. In this example
the unseen sample (triangle) is classified using five neighbors (k=5). These consist of two class 1
(circles) samples and three class 2 (squares) samples. In this case the unseen sample is classified as a
class 2 sample.

Figure 2 Example of a k-nearest neighbor


classification using 5 nearest neighbors.

Decision tree

A decision tree is the result of a divide-and-conquer learning approach and is constructed like a
tree. Each video is passed along the tree structure and is subjected to a set of rules after which a
class is decided (Witten & Frank, 2005). These rules can range from matching a certain characteristic

11
to having a certain value. Beside the fact that decision trees require little computational effort, the
real advantage is that a decision tree can be described as a set of rules which can be followed. The
complexity of a decision tree depends on the number of branches. Figure 3 shows an example of a
very simple decision tree which determines the class, in this example class A or class B, of an unseen
sample (X).

Figure 3 Example of a very simple decision


tree classifier.

Multilayer Perceptron

The multilayer perceptron is a classifier that is capable of classifying non-linear point distributions.
The decision boundary of a multilayer perceptron is calculated using several linear decision
boundaries. Combining these boundaries results in a non-linear decision boundary capable of
classifying all sorts of data (Witten & Frank, 2005). The complexity of the multilayer perceptron
corresponds with the number of linear decision boundaries used. To increase the complexity we
need to increase the amount of hidden layers/neurons. An example of decision boundary calculated
using a multilayer perceptron can be seen in figure 4. In this example you can see that the decision
boundary encircles the class in the middle very good. Everything within the decision boundary will be
classified as a circle. Non-linear decision boundaries can take on different forms than the one shown
in this example.

Figure 4 Example of a complex decision


boundary using a multilayer perceptron.
12
Support Vector Machine

A Support Vector Machine is a classifier that is highly efficient when dealing with high dimensional
data. To classify a new instance; an SVM transforms data to a dimension in which the data is linearly
separable. This transformation is done by using a kernel and the type of kernel depends on the data
used. A basic Support Vector Machine can be represented as a point distribution of two different
categories, separated by a line with a margin (Figure 5). This line is oriented so that the margin
between the two classes is maximized. The points that determine the width of the margin are called
the support vectors. The complexity of a support vector machine can be changed by altering the c-
value. This value represents the trade-off between learning error and the number of support vectors
(Witten & Frank, 2005).

Figure 5 Example of a decision boundary


with margin using a SVM.

13
Chapter 3
Experimental Set-up
Chapter 3 will focus on how we applied the method described in chapter 2 in our experiment. We
will start by describing the process of creating our dataset in section 3.1. In section 3.2 we will
describe the application of the FER-method (described in chapter 2) to our dataset. How we applied
the classifiers for our classification task is described in section 3.3.

3.1 The Data Set


The selection of our dataset is based on an experiment conducted by Swerts & Krahmer (2010) as
described earlier. We recorded several broadcasts from the 8 oclock news on Dutch public
television. Each video was recorded at a speed of 29 frames per second. 25 different broadcast were
randomly chosen from the year 2010 featuring a male (Rob Trip) or a female newsreader (Sacha de
Boer) (Figure 6). From these 25 videos we extracted a total of 183 fragments, each about one specific
news-item. Next, we divided these fragments by their emotional content and chose for each category
the most extreme cases, based on our own judgment. Our dataset consists of a total of 100 news
items divided into 50 male videos and 50 female videos, where each gender set contains 25 positive
and 25 negative news-items. In order for us to be able to apply the FER-method we needed to adjust
the fragments. For each fragment we extracted a 150x150 pixel resolution window in which the head
of the newsreader is centered. This way we can eliminate most of the background which can
interfere with the localization of the face region and thus results in a better fitting. To do this, we
used the cropping tool in a video converting and editing tool called Daniusoft Video Converter
Ultimate v3.1.1. Using Matlab we converted each edited fragment into a set of frames resulting in a
dataset with a total of 40322 frames divided in 100 sets of video frames ranging from a total of 144
to 773 frames.

Figure 6 Sacha de Boer and Rob Trip

3.2 Experimental Procedure


For our experiments we used the Feature Expression Recognition method proposed by Joosten
(2011). This FER-method is composed of three different steps described in Chapter 2. In the
remainder of this chapter we will describe the way we applied the FER-method to our dataset. In
section 3.2.1 we will focus on face detection and in section 3.2.2 the facial feature extraction

14
component will be discussed. These two steps are incorporated in a single algorithm provided by
Laurens van der Maaten (Maaten & Hendriks, 2010). In section 3.2.3 we will describe how the
classifiers have been applied.

3.2.1 Face Detection


The first step in applying the FER-method is to detect the faces present in the videos of our dataset.
The FER-method makes use of the Viola-Jones Detector to localize the faces in the video frames of
our dataset. We kept the default settings with a window size of 24 x 24 pixels and 3 different feature
types. For a more detailed description about the Viola-Jones detector read the paper by Viola and
Jones (2001). Due to the stable position of the faces present in our dataset, the Viola-Jones detector
managed to localize each face correctly. The outputted face regions covering the face ranged from
402 to 452 pixels, depending on the distance of the face to the camera.

3.2.2 Facial Feature Extraction


The next step is extracting facial feature data from the video frames. This process consists of three
different steps and is based on the Active Appearance model by Cootes et al. (2001). The AAM model
depends on the availability of a training set of annotated images. Thus, the first step in extracting
facial feature data is annotating our training set. Using this data we will generate the appearance and
shape models. The third step consists of fitting new data to these created models. Each step will be
described in the next three sections.

Annotating the Data


In our research we divided our image data into two separate datasets, a female and a male category.
From each set we selected a total of 100 images for annotation. These images were taken from every
fifth video from which the ten most representative expressions were chosen. This resulted in a set of
200 images which we manually annotated using software developed by Tim Cootes. In annotating
the data we used a total of 38 different landmarks which cover the most important facial features.
Figure 7 shows a manually annotated image for the male category.

Figure 7 manually annotated image of Rob Trip.

15
Model Generation
Our next step consists of generating the shape and the appearance models which are needed to fit
the determined coordinate grid on unseen images. The shape model and the appearance model
(described in chapter 2) together form the Active Appearance Model. These models are
automatically generated in our algorithm during the training period. In our experiment we created
two different AAMs, for the male and the female newsreader. Each AAM model is generated using
100 annotated training images.

Image Fitting
The first step in fitting our coordinates unto an unseen image is the detection of several important
facial feature points. These are determined in the facial-extraction component of the algorithm as
the corners of the eyes, mouth and the tip of the nose. Estimates of their locations are used to
calculate the values of the so-called Procrustes components (translation, rotation, rescaling and
reflection). The shape model is now fitted to the new image and converged to an optimal solution
using the values of the Procrustes components (Joosten, 2011). These Procrustes values are stored
and processed further into usable coordinate values. These values are ready to be analyzed and used
for our classification task.

3.2.3 Classification
In chapter 2 we shortly discussed the classifiers used in our experiment namely; k-Nearest neighbor,
Decision trees, Multilayer perceptron and a Support vector machine. We will now discuss the method
used to apply these classifiers for our classification task. We applied these classifiers using Weka 3.6
(Waikato environment for knowledge analysis). Weka is an open source machine learning software
freely available from http://www.cs.waikato.ac.nz/ml/weka/. This program can be used for
classifying, preprocessing, clustering or visualizing dataset. For our purpose we only used the
classification function. Most generally used classifiers are present in this software package. As a k-
nearest neighbor we used the IBk algorithm. For the decision tree classifier we used j48, an
implementation of the c4.5 algorithm which is used to generate a decision tree (developed by Ross
Quinlan). The multilayer perceptron function is readily present in Weka. Lastly we choose the SMO
function developed by John Platt (Witten & Frank, 2005). For our classification task we used the
leaving-one-out training method. This means that every instance is used as a training example except
for the instance that is being classified. During our classification we varied the complexity values as
described in chapter 2. After determining the optimal complexity values for our classification task we
stored the results. The results of our classification task can be found in Chapter 4.

3.3 Evaluation
This section describes the way we evaluate the performance of our experiments. In chapter 4 we will
discuss the quality of the determined feature based on its distinctiveness. We will also reflect on the
performance of the classifiers in our classification task. In this thesis we label a classification as
successful if it reaches a correct classification rate of 65% or more. In chapter 5 we will discuss the
quality of our dataset and the possible effects this could have on the results.

16
Chapter 4
The Results
In this chapter we will discuss the results from our experiments. These results are divided into two
paragraphs, each covering a specific research question. The first paragraph describes the results of
our search for the most optimal facial feature in our classification task based on the movement of the
eyebrows. In section 4.2 we will discuss the results of the different classifiers and their optimal
complexity settings.

4.1 Can we find a distinctive feature based on the movement of the


eyebrows?
In this paragraph we will explain the outcome of our search for an optimal feature in our
classification task. As mentioned we focused only on the movement of the eyebrows. Several plots
and point distributions were used to discover potential features. After a lengthy search we
discovered a feature which seemed to provide us with the distinction needed for our classification
task. This feature was discovered after we noticed that the newsreader were more prone to move
their eyebrows up and down more during positive news fragments then during negative fragments.
The movements of each eyebrow points separately were plotted against each other. This plot can be
seen in figure 8 and shows that there is a distinction between the positive (green) and the negative
(blue) category. The final feature consists of two values and is calculated using the vertical variance
of the movement of the points located on the right and the left eyebrows separately. Figure 9 and
figure 10 show the plots of respectively the male and the female video feature values. The two
categories combined can be seen in figure 11. These figures show a distinction between the positive
(blue) and the negative (green) video even though there is a lot of overlap between two categories.

Figure 8 Y-variance of the eyebrow coordinates (female) plotted against each other using MATLAB.

17
Figure 9 Plot of the distinctive feature in the female category.

Figure 10 Plot of the distinctive feature in the male category.

18
Figure 11 Plot of the distinctive feature in the total category.

4.2 What classifier performs best in our classification task?


This section will present the results of our classification task. We have used four different classifiers
in our experiments and each will be discussed in a separate section. The results consist of the overall
performance of the classifier for all three datasets (male, female and total) and its optimal settings.
For further evaluation we will calculate the negative predictive value (Class A) and the positive
predictive value (Class B). These NPV and PPV values represent the certainness of a prediction.
Figure 12 shows an overview of the overall performance of the four classifiers on each dataset in
percentage. The results will be further discussed in Chapter 5.

80
70
60
50
Male
40
Female
30
Total
20
10
0
k-nearest decision tree multilayer support vector
neighbor perceptron machine
Figure 12 Correctly classified video percentage of the four classifiers on each dataset.

19
k-Nearest Neighbor
In our attempt to optimize the k-nearest neighbor classifier for our classification task we found an
optimal k-value of 14 neighbors. Increasing or decreasing the complexity of the classifier any further
resulted in a lesser performance for both male and female, and for the total dataset. For the male
dataset 64% of the classifiers predictions were correct with an NPV of 63% and a PPV of 65.2%. For
the female dataset scored a total of 68.75% correct predictions with an NPV of 64.5% and a PPV of
76.5%. On the combined dataset, it reached a performance of 70.4% with an NPV of 67.25% and a
PPV of 75%. See table 1 for an overview of the results for the k-nearest neighbor classifier.

k-nearest neighbor (k=14)


Confusion Matrix
%
Dataset correct Predicted
correct Actual
Neg Pos
17 8 Neg
Male 64% 32/50
10 15 Pos
20 4 Neg
Female 68.75% 33/48
11 13 Pos
39 10 Neg
Total 70.4% 69/98
19 30 Pos
Table 1 Results for the k-nearest neighbor classifier.

Decision Tree
For the j48 decision tree classifier, changing the default Weka settings resulted in a lower
performance. Therefore we believe that the default value is most suitable for our classification task.
This resulted in a performance of 70% for the male dataset with an NPV of 100% and a PPV of 62.5%.
For the female dataset scored a total of 75% correct predictions with an NPV of 70% and a PPV of
88.2%. On the combined dataset, it reached a performance of 63.25% with an NPV of 60.7% and a
PPV of 67.6%. See table 2 for an overview of the results for the decision tree classifier.

Decision Tree
Confusion Matrix
%
Dataset correct Predicted
correct Actual
Neg Pos
10 15 Neg
Male 70% 35/50
0 25 Pos
21 2 Neg
Female 75% 36/48
9 15 Pos
37 12 Neg
Total 63.25% 62/98
24 25 Pos
Table 2 Results for the decision tree classifier.

Multilayer Perceptron
For the multilayer perceptron we used the default setting which automatically determined the
optimal number of hidden layers and neurons. This resulted in 72% correct predictions for the male
dataset with an NPV of 73.9% and a PPV of 70.4%. For the female dataset the multilayer perceptron
scored a total of 72.9% correct predictions with an NPV of 69% and a PPV of 79%. On the combined
dataset, it reached a performance of 66.3% with an NPV of 66% and a PPV of 66.7%. See table 3 for
an overview of the results for the multilayer perceptron classifier.

20
Multilayer Perceptron Lr = 0.1 TrT = 1000
Confusion Matrix
%
Dataset correct Predicted
correct Actual
Neg Pos
17 8 Neg
Male 72% 32/50
6 19 Pos
20 4 Neg
Female 72.9% 35/48
9 15 Pos
33 16 Neg
Total 66.3% 65/98
17 32 Pos
Table 3 Results for the multilayer perceptron classifier.

Support Vector Machine


In our attempt to find the optimal complexity value for our classification task we found the highest
performing value to be c=100000. This resulted in a correct classification of 62% for the male dataset
with an NPV of 60% and a PPV of 65%. For the female dataset scored a total of 72.9% correct
predictions with an NPV of 70.4% and a PPV of 76.2%. On the combined dataset, it reached a
performance of 69.4% with an NPV of 66.1% and a PPV of 74.4%. See table 4 for an overview of the
results for the support vector machine classifier.

Support Vector Machine c=100000


Confusion Matrix
%
Dataset correct Predicted
correct Actual
Neg Pos
18 7 Neg
Male 62% 31/50
12 13 Pos
19 5 Neg
Female 72.9% 35/48
8 16 Pos
39 10 Neg
Total 69.4% 68/98
20 29 Pos
Table 4 Results for the SVM classifier.

21
Chapter 5
General Discussion
Here we will discuss the research described in this thesis. The results of our classification task will be
discussed in section 5.1. We will question the validity and quality of our results and some
improvements to increase the quality of our research will be proposed. In section 5.2 we will point
out the strengths and weaknesses of our dataset and the method applied to solve our classification
task. We will focus on the quality of our dataset, the importance of using real world data and the
applicability of our face analysis method. Furthermore we will give suggestions about which aspects
of our experiment could be improved.

5.1 The Task


The classification task described in this thesis, classifying the emotional content of a news item by
analyzing the face of newsreaders using an automatic facial expression recognition method, is very
similar to an experiment described by Swerts & Krahmer (2010) in which they used human observers
instead of an automatic facial analysis method. Their research showed that almost all participants
were able to judge the emotional content correct and participants noted that newsreaders were
generally more expressive in their facial expressions when the message had a good emotional
content (Swerts & Krahmer, 2010). This corresponds to the results found in our experiments. The
feature used in our experiment could indeed be seen as a measurement of expressiveness. In
general, more movement in the eyebrows means more expressive facial expressions. Unfortunately,
expressiveness cannot be attributed to the movement of the eyebrows alone which may explain the
difference in performance between human observers and our automatic facial analysis method.
The second similar aspect is the modality used. Most Social Signal Processing research is
monomodal, using only auditive or visual input, which seems illogical due to the multimodal nature
of behavior (Vinciarelli, Pantic, Boulard, & Pentland, 2008) (Grant, 1969). In our experiments we used
visual input only because we wanted to focus on the non-verbal behaviour of the face. Research on
multimodality shows inconclusive evidence of the advantages of using more modalities in analyzing
faces. Research by Esposito (2009) show that both verbal and non-verbal communication convey the
same emotional information. This was already stated by Quintilianus (1st century AD) in his De
institutione oratoria. It says that a speakers expression should be congruent to what they are
saying. Most people focus on only one modality depending on both language and culture (Esposito,
2009). Other research shows that combining both verbal and non-verbal results in better recognition
of basic emotions (Schuller, Mueller, Hoernler, Hoethker, Konosu, & Rigoll, 2007). This also
corresponds to the ideas of social psychology about effective social situations (Vinciarelli, Pantic,
Boulard, & Pentland, 2008). Several facial expression recognition experiments, using both visual as
auditive input achieved an accuracy between 72% and 85% (Pantic, Pentland, Nijholt, & Huang,
2006). This indeed seems higher than the results we achieved but may also be influenced by the
numbers of features used.
Our results show that there is a noticeable inconsistency in the expressiveness of our male
and female newsreaders. Bout (2008) shows that female newsreaders are generally more expressive
then male newsreaders. However, this is not reflected in our results. This may be explained due to
individual differences in emotional expression. According to Cohn (2007), this difference is strong
enough to serve as a basis for person recognition.

22
Lastly, in our experiments we used four different classifiers. Using other classifiers or even
combining several classifiers may result in a higher accuracy. Combining multiple classifiers has
proven to increase the overall accuracy (Vinciarelli, Pantic, Boulard, & Pentland, 2008).

5.2 The Experiment

The Dataset
One of the most important issues in social signal processing, and an aspect mostly neglected for the
sake of research is the use of real world data. Most datasets are collected in either laboratories or
any other artificial settings. Even though the use of actors is very common in research focused on the
study of faces, it is likely to oversimplify a real world situation and many aspect of real social
behaviour could be missing (Vinciarelli, Pantic, Boulard, & Pentland, 2008) (Wilting, Krahmer, &
Swerts, 2006). Unfortunately recordings of genuine facial behaviour suitable for research are difficult
to find (Pantic, 2009). In our experiment we focused on the use of Dutch newsreaders. These
newsreaders are expected to present the news in a neutral way but also serve as the face of the
news channel and are expected to show a certain degree of emotion to attract the audience while
preserving their neutral position (Tomascikova, 2010). Using newsreader data in research comes with
both advantages and disadvantages.
First of all, even though newsreader data can be labeled as real world data, the question
remains if this kind of discourse can be used to explain other kinds of discourse. It can even be the
case that the expressions used by the newsreaders are different on-air. Nevertheless, this is not
always the case. The newsreaders used in our experiments mentioned that while presenting the
news, neither one of them thinks about the use of their facial expression (Swerts & Krahmer, 2010).
With this in mind we can assume that the facial expressions in our dataset are natural facial behavior
indeed. Unfortunately, this may be the case in our research but it cannot be guaranteed for other
newsreader data. The biggest advantage of the use of newsreader data in social signal processing is
the setting. The stable position of the head and the uniform background makes newsreader data
highly suitable for automatic facial analysis (Joosten, 2011).
A possible weakness of our research is the quality of our dataset. The quality of the video
fragments used was generally lower than those of other datasets. Due to the preprocessing steps of
the video, the quality decreased even more. Therefor our dataset consists of low resolution images
(150x150 pixels), with face region resolution ranging from 40x40 to 45x45 pixels. These low
resolutions might influence the efficiency of the FER-method used in our experiments. Tian (2004)
evaluated the performance of the most common facial analysis methods on several resolutions. They
show that face detectors are able to locate faces with a face region from 36x48 and bigger. In the
facial feature recognition step it appears that for a face region of 36x48 or smaller it is best to use
appearance features instead of geometric features. With a resolution lower than 36x48 it is also
more difficult to recognize fine detailed expressions. It seems that a resolution of 40x40 seems
enough besides the fact that we chose geometric features. This may have an influence on the
efficiency of the FFR-Method.

The Facial Expression Recognition Method


The FER-method is a method which has proven to be useful in the field of facial analysis (Joosten,
2011). However, there are certain drawbacks of this method which can influence the efficiency of the
method. The first drawback is the FER-Methods inability to deal with head movements. Slight
movements can still be recognized, but rotating or scaling the head too far may result in the model
not fitting properly onto the image (Chibelushi & Bourel, 2003). This drawback doesnt seem to have
an impact in our experiments. This is mainly due to the stable position of the head.
The next drawback is the necessity to have an accurate shape model to fit unto unseen
images. This model is formed by the providing a large amount of training data and can only take on
deformations available in the training set (Cootes et al., 2004). The main problem with this training

23
set is that the training examples have to be labeled manually. This manual labour is very time
consuming and has a major influence in the efficiency of the FER-Method. Lucey, Lucey, and Cohn
(2010) argue that about 60 or more landmarks are required for facial-expression recognition.
Unfortunately this is much higher than the 38 landmarks used in our experiments. We only chose
landmarks which seemed most relevant to us in an attempt to decrease the time needed for manual
labor.
A more critical concern is the reliability of the manually labeled images. Cohn (2007) noted
that in most cases approximately 20-30% of the manually labeled images are inaccurate. This
inaccuracy can have a huge impact on the performance of our FER-Method. After visual inspection of
the fitted images we came to the conclusion that a high degree of jittering (Movement of the model
while the face showed no movement) was present. This jittering can influence the correctness of the
extracted features which could influence the results of our classification task (Joosten, 2011). This
jittering may be reduced by using a feature tracking method instead of the feature fitting method
currently used. With feature tracking, the previous frame is taken into account during the fitting
process. This results in a more natural movement of the landmark coordinates and is thus more likely
to be correct.

24
Chapter 6
Conclusion and future work
In this chapter we will reach a conclusion based on the results or our experiment. In section 6.1 we
will answer the research questions posed in chapter one and discuss the outcome of our problem
statement. Section 6.2 gives some suggestions of improvement for future research.

6.1 Research Questions and Problem Statement


We will first provide answers to the research questions proposed in chapter 1. Considering these
answers we will answer our problem statement and reach our conclusion.

Can we find a distinctive feature based on the movement of the eyebrows?


Yes, we did find a distinctive feature based on the eyebrows. The feature we discovered that showed
the most promise for our classification task can be described as the amount of vertical movement of
the eyebrows. More and stronger movement of the eyebrows resulted in a higher variance value and
is characteristic for a positive emotional content. Better or equally performing features may be
present but have not been discovered during the writing of this thesis.

What classifier performs best in our classification task?


The classifier that performs best in our classification task depends largely on the group tested on.
When looking at the male and the female category, we can see that all classifiers perform better in
the female category. The decision tree and the multilayer perceptron are the two classifiers that
perform best on these two categories. The decision tree got the best performance in the female
category (75%) but a lower score for the male category (70%) while the multilayer perceptron
performs approximately equally high for both categories (72.9% and 72%). On the other hand, when
taking into account both categories simultaneously in one combined dataset the performance of
these classifiers drops. On the total dataset the performance is highest using the k-nearest-neighbor
classifier (70.4%) followed by the support vector machine (69.4%).When using the total dataset for
this classification task it is therefore best to use the k-nearest neighbor classifier. Considering the
male and female categories separately, the stability of the multilayer perceptron suggests that it is
most suitable for this classification task.

We have provided answers to our research questions and can now focus on answering the problem
statement as stated in chapter 1:

Can we determine the emotional content of a video by analyzing the face of newsreaders using an
automatic facial feature recognition method?

Considering the performance of our classifiers we can conclude that a computer is capable of
determining the emotional content using an automatic facial feature recognition method. An average
performance of 67%, 72.4% and 67.3% for the male, female and total dataset shows that the feature
used in our experiments can be used for classifying the emotional content of a video. Even though
we found these results for newsreaders only, we believe that these results can be used for other
purposes. However, improvements are needed to increase the performance before it can be used in
real applications.

25
6.2 Application and Future Research
Research in the field of automatic facial analysis is most important if we want to achieve the goal of
natural communication between man and machine. For this we need to explore every possible
aspect of human-machine communication including topics like verbal and non-verbal
communication. Every research concerning these aspects is a small step towards the successful
application of natural human-computer interaction. Every step is an exploration of what we can
achieve and how this can be improved. This thesis is no exception as it is an attempt to explore the
possibilities of an automatic facial analysis system. The problem statement answered in this thesis is
and interesting one, but mostly serves as a small part of a much bigger picture. The results are not
yet strong enough to apply this in a real life situation like for example the automatic classifying of
news fragments for cataloging purposes. To achieve this, a much higher accuracy is needed and the
methods have to be improved due to the problem of automation of the entire process. In this thesis
however it is not our intention to improve the method used, but use the method to solve a
classification task. Therefore, we will focus on what can be done to increase the accuracy of our
classification task. First of all, increasing the amount of landmarks may increase the amount of
information available. More landmarks combined with using more different features may greatly
increase the accuracy. Second, using a dataset with a higher quality may increases the performance
of the Active Appearance Model and may decrease the jittering of the image fitting process which
result in better fitting. Finally using different classifiers or combining several classifiers is likely to
increase the quality of the classification task. This however may lead to either a higher or a lower
accuracy of the task.
Lastly I want to encourage future research to focus on all sorts of different classification
tasks. These tasks provide us with the knowledge of the possibilities of automatic facial analysis and
the problems that may arise. This knowledge enables us to go a step forward in achieving natural
communication between man and machine.

26
Literature
Asthana, A., Saragih, J., Wagner, M., & Goecke, R. (2009). Evaluating AAM fitting methods for facial
expression recognition. 3rd Internation Conference on Affective Computing and Intelligent
Interaction and Workshops (pp. 1-8). IEEE.

Bout, A. (2008). Expressiviteit bij nieuwslezers. Master Thesis, Tilburg University.

Cassell, J. (2001). Nudge nudge wink wink: elements of face-to-face conversation for embodied
conversational agents. MIT Press.

Chibelushi, C. C., & Bourel, F. (2003). Facial Expression Recognition: A Brief Tutorial Overview.
CVonline: On-Line Compendium of Computer Vision, 9.

Cohn, J. (2007). Foundations of human computing: facial expression and emotion. Artifical
Intelligence for Human Computing, 1-16.

Cohn, J. (2010). Advances in Behavioral Science Using Automated Facial Image Analysis and
Synthesis. Signal Processing Magazine, IEEE, 27(6), 128-133.

Cootes, T., Edwards, G., & Taylor, C. (2001). Active appearance models. Pattern Analysis and Machine
Intelligence, 23(6), 681-685.

Cootes, T., Taylor, C., & others. (2004). Statistical models of appearance for computer vision. World
Wide Web Publication.

Donato, G., Bartlett, M., Hager, J., Ekman, P., & Sejnowski, T. (1999). Classifying facial actions. Pattern
Analysis and Machine Intelligence, 21(10), 974-989.

Edwards, G., Taylor, C., & Cootes, T. (1998). Interpreting face images using active appearance models.
Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998.
Proceedings. (pp. 300-305). IEEE.

Esposito, A. (2009). The Perceptual and Cognitive Role of Visual and Auditory Channels in Conveying
Emotional Information. Cognitive Computation, 1(3), 268-278.

Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! My name is... Buffy - Automatic Naming of
Characters in TV Video. Proceedings of the 17th British Machine Vision Conference, (pp. 889-
908).

Goodall, C. (1991). Procrustes methods in the statistical analysis of shape. Journal of the Royal
Statistical Society. Series B. Methodological, 53(2), 285-339.

Grant, E. (1969). Human facial expression. Man, 4(4), 525-692.

Jolliffe, I. T. (2002). Principal Component Analysis. New York: Springer, second edition.

Joosten, B. (2011). Facial Expression Recognition. Master Thesis, Tilburg University.

27
Maaten, L. V., & Hendriks, E. (2010). Capturing Appearance Variation in Active Appearance Models.
Computer Vision and Pattern Recognition Workshops, 34-41.

Matthews, I. a. (2004). Active appearance models revisited. International Journal of Computer Vision,
60(2), 135-164.

Pantic, M. (2009). Machine analysis of facial behaviour: Naturalistic and dynamic behaviour.
Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535), 3505-3513.

Pantic, M., & Bartlett, M. (2007). Machine analysis of facial expressions. 377-416.

Pantic, M., & Rothkrantz, L. (2003). Toward an affect-sensitive multimodal human-computer


interaction. Proceedings of the IEEE, 91(9), 1370-1390.

Pantic, M., Nijholt, A., Pentland, A., & Huanag, T. (2008). Human-Centred Intelligent Human?
Computer Interaction (HCI2): how far are we from attaining it? International Journal of
Autonomous and Adaptive Communications Systems, 1(2), 168-187.

Pantic, M., Pentland, A., Nijholt, A., & Huang, T. (2006). Human computing and machine
understanding of human behavior: a survey. Proceedings of the 8th international conference
on Multimodal interfaces (pp. 239-248). ACM.

Schmidt, K., Ambadar, Z., Cohn, J., & Reed, L. (2006). Movement differences between deliberate and
spontaneous facial expressions: Zygomaticus major action in smiling. Journal of Nonverbal
Behavior, 30(1), 37-52.

Schuller, B., Mueller, R., Hoernler, B., Hoethker, A., Konosu, H., & Rigoll, G. (2007). Audiovisual
recognition of spontaneous interest within conversations. Proceedings of the 9th
international conference on Multimodal interfaces (pp. 30-37). ACM.

Swerts, M., & Krahmer, E. (2010). Visual prosody of newsreaders: Effects of information structure,
emotional content and intended audience on facial expressions. Journal of Phonetics, 38(2),
197-206.

Tian, Y. (2004). Evaluation of face resolution for expression analysis. Conference on Computer Vision
and Pattern Recognition Workshop, 2004. CVPRW'04. IEEE.

Tomascikova, S. (2010). On Narrative Construction of Television News. Bulletin of the Transilvania


University of Brasov, 3.

Vinciarelli, A., Pantic, M., Boulard, H., & Pentland, A. (2008). Social signal processing: state-of-the-art
and future perspectives of an emerging domain. MM '08 proceedings of the 16th ACM
international conference for Multimedia (pp. 1061-1070). New York: ACM.

Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Seatures.
Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. 1, pp. 511-518. IEEE.

Williams, A. e. (2002). Facial expression of pain: an evolutionary account. Behavioral and brain
sciences, 25(4), 439-455.

28
Wilting, J., Krahmer, E., & Swerts, M. (2006). Real vs. acted emotional speech. Ninth International
Conference on Spoken Language Processing.

Witten, I., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan
Kaufmann.

Zeng, Z., Pantic, M., Roisman, G., & Huang, T. (2009). A survey of affect recognition methods: Audio,
visual, and spontaneous expressions. Pattern Analysis and Machine Intelligence, 31(1), 39-58.

29

Você também pode gostar