Facial Expression Recognition: Towards Meaningful Prior Knowledge in Deep Neural Networks

FACULDADE DE E NGENHARIA DA U NIVERSIDADE DO P ORTO
Facial Expression Recognition:

Towards Meaningful Prior Knowledge
in Deep Neural Networks
Filipe Martins Marques
M ASTER ’ S T HESIS
Integrated Master in Bioengineering
Supervisor: Jaime dos Santos Cardoso

Second Supervisor: Pedro Miguel Martins Ferreira
June 12, 2018

c Filipe Marques, 2018
Resumo
As expressões faciais, por definição, estão associadas à forma como as emoções são expressas e
desempenham um papel fulcral na comunicação. Isso torna a expressão facial um domínio inter-
disciplinar transversal a várias ciências como ciência comportamental, neurologia e inteligência
artificial. A expressão facial é documentada como o meio mais informativo de comunicação para
os seres humanos, pelo que as aplicações de visão por computador, como a interação homem-
máquina ou o reconhecimento de língua gestual, precisam de um sistema eficiente de reconheci-
mento de expressão facial.
Os métodos de reconhecimento da expressão facial têm sido estudados e explorados, demon-
strando desempenhos impressionantes na deteção de emoções discretas. No entanto, essas abor-
dagens apenas têm uma eficiência notável em ambientes controlados, ou seja, ambientes onde a
iluminação e a pose são monitorizadas. Os sistemas de reconhecimento de expressão facial em
aplicações de visão por computador, par além de poderem ser melhorados em cenários controla-
dos, precisam de ser eficientes em cenários reais, embora os métodos mais recentes não tenham
atingido um desempenho desejável em tais ambientes.
As redes neurais convolucionais têm sido amplamente utilizadas em várias tarefas de visão por
computador e reconhecimento de objetos. Recentemente, as redes neuronais convolucionais foram
aplicadas ao reconhecimento da expressão facial. Contudo, estes métodos ainda não alcançaram o
seu potencial no reconhecimento de expressões faciais, dado que o treino de modelos complexos
em bases de dados pequenas, como aquelas disponíveis para o reconhecimento de expressão facial,
normalmente resultam no sobreajuste dos dados. Tendo isto em conta, é necessário o estudo de
novos métodos com redes neuronais que envolvam estratégias de treino inovadoras.
Na presente dissertação, um novo modelo é proposto no qual diferentes fontes de conheci-
mento do domínio são integradas. O método proposto visa incluir informação extraída de redes
pré-treinadas em tarefas do mesmo domínio (reconhecimento de imagem ou objeto), juntamente
com informção morfológica e fisiológica de expressão facial. Esta inclusão de informação é con-
seguida pela regressão de mapas de relevância que destacam zonas chave para o reconhecimento
de expressão facial. Foi estudado em que medida o refinamento dos mapas de relevância de ex-
pressões faciais e o uso de características de outras redes permitem obter melhores resultados na
classificação de expressão.
O método proposto alcançou o melhor resultado quando comparado com métodos do estado
de arte implementados, mostrando assim capacidade de aprender características específicas de
expressão. Para além disso, o modelo é mais simples (menos parâmetros para serem treinados)
e requer menos recursos computacionais. Deste modo demonstra-se que uma eficiente inclusão
de informação do domínio origina modelos mais eficientes em tarefas onde as respetivas bases de
dados são limitadas.
i
ii
Abstract
Facial expressions by definition are associated with how emotions are expressed. This makes facial
expression an interdisciplinary domain transversal to behavioral science, neurology and artificial
intelligence. Facial expression is documented as the most informative mean of communication for
humans, reason why computer vision applications such as natural human computer interaction or
sign language recognition need an efficient facial expression recognition system.
Facial expression recognition methods (FER) have been deeply studied and these methods
have impressive performances on the detection of discrete emotions. However, these approaches
only have remarkable efficiency in controlled environments, i.e., environments where illumination
and pose are monitored. An integration of FER systems in computer vision applications need to be
efficient in real world scenarios. However, current state-of-the-art methods do not reach accurate
expression recognition in such environments.
Deep convolutional neural networks have been widely used in several computer vision tasks
involving object recognition. Recently, deep learning methods have also been applied to facial
expression recognition. Nonetheless, these methods have not reached their full potential in the
FER task as training high capacity models in small datasets, such as the ones available in the FER
field, usually result in overfitting. In this regard, further research of novel deep learning models
and training strategies has crucial significance.
In this dissertation, a novel neural network that integrates different sources of domain knowl-
edge is proposed. The proposed method integrates knowledge transfered from pre-trained net-
works on similar recognition tasks with prior knowledge of facial expression. The prior knowl-
edge integration is achieved by the means of a regression map with meaningful spatial features for
the model. Further experiments and studies were performed to assess whether refined regressed
maps of facial landmarks can lead to better performances and whether transfered features from
other networks can lead to better results.
The proposed method outperforms the implemented state of the art methods and shows the
ability to learn expression-specific features. Besides, the network is simpler and requires less
computational resources. Thus, it is demonstrated that an effective use of prior knowledge can
lead to more efficient models in tasks where large datasets are not available.
iii
iv
Acknowledgments
First of all, I would like to thank the Faculty of Engineering of University of Porto and all the
people that I met in university, from teachers to my colleagues and friends, for all the education,
the support and the strength to make me complete this course in these five years and, most impor-
tantly, for making me discover my potential.
To my supervisor, Professor Jaime Cardoso, for all the orientation, support and experience. To
the INESC-TEC, for the facilities, kindness and networking provided. To my second supervisor,
Pedro Ferreira, for all the patience, the availability to help me, the experience, dedication and mo-
tivation when I needed.
To Tiago, for bringing out the dedicated worker that was hidden in me. Thank you as well for
all the motivation, time, patience and joy. To Inês, for being my voice of reason. To Joana, for
bringing out my free spirit. To Rita, for growing up with me and helping me build my personality.
To my parents, for being the best they can be everyday, for all the things that I can not enu-
merate here and specially, for all the love.
To my siblings, for annoying me since forever, but most importantly, for being here for me all
the time as well.
To my family, for teaching me moral values and lessons that I will carry all my life. To my
grandfather that is looking down on me somewhere.
To all my friends who handle me and let me be just as I am.
Filipe Marques
v
vi
Contents
1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Facial Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Feature Descriptors for Facial Expression Recognition . . . . . . . . . . . . . . 9
2.3.1 Local-Binary Patterns (LBP) . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Gabor Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Learning and classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Deep Convolutional Neural Networks (DCNNs) . . . . . . . . . . . . . 12
2.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 State-of-the-Art 19
3.1 Face detection (from Viola & Jones to DCNNs) . . . . . . . . . . . . . . . . . . 19
3.2 Face Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Traditional (geometric and appearance) . . . . . . . . . . . . . . . . . . 22
3.3.2 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 24
3.4 Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Implemented Reference Methodologies 29

4.1 Hand-crafted based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Conventional CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Transfer Learning Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 FaceNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Physiological regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vii
viii CONTENTS
4.4.2 Supervised term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.3 Unsupervised Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Proposed Method 39
5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.1 Representation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.2 Facial Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.3 Classification Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Results and Discussion 45

6.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Relevance Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Results on CK+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4 Results on SFEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7 Conclusions 55
References 57
List of Figures
2.1 Study of FEs by electrically stimulate facial muscles. . . . . . . . . . . . . . . . 6

2.2 Examples of AUs on the FACS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Illustration of the neutral expression and the six basic emotions. . . . . . . . . . 7
2.4 A typical framework of an HOG-based face detection method. . . . . . . . . . . 9
2.5 Example of LBP calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Example of an SVM classifier applied to features extracted from faces. . . . . . . 11
2.7 Architecture of a deep network for FER. . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Visualization of the activation maps for different layers. . . . . . . . . . . . . . . 14
2.9 Batch-Normalization applied to the activations x over a mini-batch. . . . . . . . . 15
2.10 Dropout Neural Network Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.11 Common pipeline for model selection. . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Multiple face detection in uncontrolled scenarios. . . . . . . . . . . . . . . . . . 20

3.2 Architecture of the Multi-Task CNN for face detection. . . . . . . . . . . . . . . 21
3.3 Detection of AUs based on geometric features. . . . . . . . . . . . . . . . . . . . 22
3.4 Spatial representations of the main approaches for feature extraction. . . . . . . . 24
3.5 Framework of the curriculum learning method. . . . . . . . . . . . . . . . . . . 25
4.1 Illustration of the pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Illustration of the implemented geometric feature computation. . . . . . . . . . . 30
4.3 Architecture of the conventional deep network used. . . . . . . . . . . . . . . . . 32
4.4 Examples of the implemented data augmentation process. . . . . . . . . . . . . . 34
4.5 Network configurations of VGG. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Original image followed by density maps obtained by a superposition of Gaussians
at the location of each facial landmark, with an increasing value of σ . . . . . . . 37
5.1 Architecture of the proposed network. The relevance maps are produced by regres-
sion from the facial component module who is composed by an encoder-decoder.
The maps x̂ are operated (⊗) with the feature representations ( f ) that are outputted
by representation module and then fed to the classification module, predicting the
classes probabilities (ŷ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Pipeline for feature extraction from Facenet. Only the layers before pooling oper-
ations are represented. GAP- Global Average Pooling. . . . . . . . . . . . . . . 41
5.3 Facial Module architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Illustrative examples of the facial landmarks computation for the SFEW dataset. . 43
5.5 Architecture of the proposed network for iterative refinement. . . . . . . . . . . . 44
6.1 Samples from CK+ and from SFEW database. . . . . . . . . . . . . . . . . . . . 45

6.2 Examples of predicted relevance maps for different methods used. . . . . . . . . 48
ix
x LIST OF FIGURES
6.3 Frame-by-frame analysis of the relevance maps. . . . . . . . . . . . . . . . . . . 49

6.4 Class Distribution on CK+ database. . . . . . . . . . . . . . . . . . . . . . . . . 50
6.5 Confusion Matrix of CK+ database. . . . . . . . . . . . . . . . . . . . . . . . . 52
6.6 Class distribution for SFEW database. . . . . . . . . . . . . . . . . . . . . . . . 53
6.7 Confusion Matrix of SFEW database. . . . . . . . . . . . . . . . . . . . . . . . 54
List of Tables
6.1 Hyperparameters sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 Performance achieved by the traditional baseline methods on CK+. . . . . . . . . 51
6.3 CK+ experimental results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 SFEW experimental results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
xi
xii LIST OF TABLES
Abbreviations
AAM Active Appearance Model

AFEW Acted Facial Expression In The Wild
AU Action Unit
BN Batch-Normalization
CNN Convolutional Neural Network
CK Cohn-Kanade
FACS Facial Action Coding System
FE Facial Expression
FER Facial Expression Recognition
GPU Graphics Processing Unit
HCI Human–Computer Interaction
HOG Histogram of Oriented Gradients
kNN k-Nearest Neighbors
LBP Local binary patterns
MTCNN Multi-Task Convolutional Neural Network
NMS Non-Maximum Suppression
P-NET Proposal Network
PCA Principal Component Analysis
R-NET Refine Network
ReLU Rectified Linear Units
SFEW Static Facial Expressions in the Wild
SVM Support Vector Machine
xiii
Chapter 1
Introduction
1.1 Context
In psychology, emotion refers to the conscious and subjective experience that is characterized
by mental states, biological reactions and psychological or physiologic expressions, i.e., Facial
Expressions (FE). It is common to relate FE to affect, as it can be defined as the experience of
emotion, and is associated with how the emotion is expressed. Together with voice, language,
hands and posture of the body, FE form a fundamental communication system between humans in
social contexts.
Facial expressions were introduced as a research field by Charles Darwin in his book "The Ex-
pression of the Emotions in Man and Animals" [1]. Darwin questioned whether facial expressions
had some instrumental purpose in the evolutionary history. For example, lifting the eyebrows
might have helped our ancestors respond to unexpected environmental events by widening the
visual field and therefore enabling them to see more. Even though their instrumental function
may have been lost, the facial expression remains in humans as part of our biological endowment
and therefore we still lift our eyebrows when something surprising happens in the environment
whether seeing more is of any value or not. Since then, FEs were established as one of the most
important features of human emotion recognition.
1.2 Motivation
Expression recognition is a task that human beings perform daily and effortlessly, but it is not
yet easily performed by computers. By definition Facial Expression Recognition (FER) involves
identification of cognitive activity, deformation of facial feature and facial movements. Computa-
tionally speaking, the FER task is done using static images or their sequences. The purpose is to
categorize them into different abstract classes based on the visual facts only.
In the last few years, automated FER has attracted much attention in the research community
due to its wide range of applications. In technology and robotic systems, several developed robots
with social skills like Sony’s AIBO and ATR’s Robovie have been developed [2]. In education,
1
2 Introduction
FER can also play a role detecting students’ frustration by improving e-learning experiences [3].
Game industry is already investing in the expansion of gaming experience by adapting difficulty,
music, characters or mission according to the player’s emotional responses [4, 5].
In the medical field, emotional assessment can be used in multiple conditions. For instance,
pain detection is used for monitoring the patient progress in clinical settings and depression recog-
nition from FEs is a very important application for the analysis of psychological distress [6, 7].
Facial Expression also plays a significant role in several diseases. In autism, for instance, emo-
tions are not expressed the same way and the understanding of how basic emotions work and how
they are conveyed in autism could lead to therapy improvement [8]. Deafness also leads to an
adaptation of communication, being sign language the common mean of communication. In sign
language, FE can play a significant role. Facial and head movements are used in sign languages
at all levels of linguistic structure. At the phonological level, some signs have an obligatory fa-
cial component in their citation form. Facial actions mark relative clauses, content questions and
conditionals, amongst others [9]. Therefore, an integration of automated FER is essential for an
efficient automated sign language recognition system.
Several automated FER methods have been proposed and demonstrated remarkable perfor-
mances in highly controlled environments (i.e., high-resolution frontal faces with uniform back-
grounds). However, the automatic FER in real-world scenarios is still a very challenging task.
Those challenges are mainly related to the inter-individual’s facial expressiveness variability and
to different acquisition conditions.
Most of machine learning methods are task-specific methods, in which the representation (fea-
ture) is first extracted and, then, a classifier is learned from it. Deep learning can be seen as a part
of machine learning methods that are able to jointly learn the classification and representation
of data. Deep learning approaches learn data representations with multiple levels of abstraction,
leading to features that traditional methods could not extract. The recent success of deep networks
relies on the current availability of large labeled datasets and advances in GPU technology. In
some computer vision tasks the availability of diverse and large datasets is scarce. To overcome
this, deep training strategies are needed. State of the art methods use some strategies such as data
augmentation, dropout and ReLU and reach the optimal results in most object recognition tasks.
FER is one of the cases where only small datasets are available. Current state of the art strate-
gies for deep neural networks achieved satisfactory results in controlled environments but when
applied to expressions in the wild the performance decays abruptly. Therefore, novel strategies for
regularization in deep neural networks are needed in order to develop a robust system that is able
to recognize emotions in natural environments.
1.3 Goals
The purpose of this dissertation is the development of fundamental work on FER to propose a novel
method mainly based on deep neural networks. In particular, the main goal of this dissertation is
the proposal and development of a deep learning model for FER that explicitly models the facial
1.4 Contributions 3
key-points information along with the expression classification. The underlying idea is to increase
the discriminative ability of the learned features by regularizing the entire learning process and,
hence, improve the generalization capability of deep models to the small datasets of FER.
1.4 Contributions
In this dissertation, our efforts were targeted towards the development and analysis of different
deep learning architectures and training strategies to deal with the problem of training deep models
in small datasets. In this regard, the main contributions of this work can be summarized as follows:
• The implementation of several baseline and state of the art methods for FER, in order to
provide a fair comparison and evaluation of different approaches. The implemented methods
include traditional methods based on hand-crafted features and state of the art methods based
on deep neural networks, such as transfer learning approaches and methods that intend to
integrate physiological knowledge on FER.
• Development of a novel deep neural network that, by integrating different sources of prior
knowledge, achieves state-of-the-art performances. The proposed method integrates knowl-
edge transfered from pre-trained networks jointly with physiological knowledge on facial
expression.
1.5 Dissertation Outline

This dissertation will cover the historic overview on expression recognition followed by the ex-
position of the pipeline for FER in Chapter 2. Chapter 3 looks over the state-of-the-art on FER
in which the relevant works proposed for each step of the FER pipeline are presented (ranging
from face detection to expression recognition). Chapter 4 details the methodology followed for
the implementation of base-line and state-of-the art methods. Chapter 5 focuses on the proposed
method. A description of the databases and the implementation details are then followed by the
results and a discussion on the findings. As last chapter of the dissertation, Chapter 6 draws the
main conclusions on the performed study and discusses future work.
4 Introduction
Chapter 2
Background
Human faces generally reflect the inner feelings/emotions and hence facial expressions are sus-
ceptible to changes in the environment. Expression recognition assists in interpreting the states of
mind and distinguishes between various facial gestures. In fact, FE and FER are interdisciplinary
domains standing at the crossing of behavioral science, neurology, and artificial intelligence.
For instance, in early psychology, Mehrabian [10] has found that only 7% of the whole in-
formation that an human expresses is conveyed through language, 38% through speech, and 55%
through facial expression. FER aims to develop an automatic, efficient and accurate system to dis-
tinguish facial expressions of human beings, so that human emotions can be understood through
facial expression, such as happiness, sadness, anger, fear, surprise, disgust, etc. The develop-
ments in FER hold potential for computer vision applications, such as natural human computer
interaction (HCI), human emotion analysis and interactive video.
Section 2.1 starts with a historic overview on facial expressions followed by how human emo-
tion can be described. The default pipeline of FER is detailed after, beginning with pre-processing
in section 2.2. A technical explanation on how the main feature descriptors work is then presented
in section 2.3. These descriptors are then fed into a classifier for learning purposes: section 2.4
covers how the learning is processed. The Chapter ends with an overview on a model selection
strategy in section 2.5.
2.1 Facial Expressions
Duchenne de Boulogne believed that the human face worked as a map whose features could be
codified into universal taxonomies of mental states. This lead him to conduct one of the first
studies on how FEs are produced by electrically stimulating facial muscles (Figure 2.1)[11]. At
the same time, Charles Darwin also studied FEs and hypothesized that they must have had some
instrumental purpose in the evolutionary history. For instance, constricting the nostrils in disgust
served to reduce inhalation of noxious or harmful substances [12].
5
6 Background
Following these works, Paul Ekman claimed that there is a set of facial expressions that are
innate, and they mean that the person making that face is experiencing an emotion [13], defend-
ing the universality of facial expression. Further studies support that there is an high degree of
consistency in the facial musculature among peoples of the world. The muscles necessary to ex-
press primary emotions are found universally and homologous muscles have been documented in
non-human primates [14] [15].
Figure 2.1: Study of FEs by electrically stimulate facial muscles [11].
Physiological specificity is also documented. Heart-rate and skin temperature vary with basic
emotions. For instance, in anger, blood flow of the hands increases to prepare to a fight. Left-
frontal asymmetry is greater during enjoyment while right frontal asymmetry is greater during
disgust. These evidences support the argument that emotion expressions reliably signal action
tendencies [16] [17].
Facial expression signals emotion, communicative intent, individual differences in personal-
ity, psychiatric and medical status and helps to regulate social interaction. With the advent of
automated methods of FER, new discoveries and improvements became possible.
The description of human expressions and emotions can be divided in two main categories:
categorical and dimensional description.
It is common to classify emotions into distinct classes essentially due to Darwin and Ekman
studies [13]. Affect recognition systems aim at recognizing the appearance of facial actions or the
emotions conveyed by the actions. The former set of systems usually relies on the Facial Action
Coding System (FACS) [18]. FACS consists of facial Action Units (AUs), which are codes that
describe facial configurations. Some examples of AUs are presented in Figure 2.2.
The temporal evolution of an expression is typically modeled with four temporal segments:
neutral, onset, apex and offset. Neutral is the expressionless phase with no signs of muscular
activity. Onset corresponds to the period during which muscular contraction begins and increases
in intensity. Apex is a plateau where the intensity usually reaches a stable level; whereas offset
is the phase of muscular action relaxation. [18]. Usually, the order of these phases is: neutral-
onset-apex-offset. The analysis and comprehension of AUs and temporal segments are studied in
2.1 Facial Expressions 7
Figure 2.2: Examples of AUs on the FACS [18].
psychology and their recognition enables the analysis of sophisticated emotional states such as
pain and helps distinguishing between genuine and posed behavior [19].
Systems and models that recognize emotions can recognize basic or non-basic emotions. Ba-
sic emotions come from the affect model developed by Paul Ekman that describe six basic and
universal emotions: happiness, sadness, surprise, fear, anger and disgust (see Figure 2.3).
Figure 2.3: Illustration of the neutral expression and the six basic emotions. The images are
extracted from the JAFFE database [20].
Basic emotions are believed to be limited in their ability to represent the broad range of every-
day emotions [19]. More recently, researchers considered non-basic emotion recognition using a
variety of alternatives for modeling non-basic emotions. One approach is to define an extra set of
emotion classes, for instance, relief or contempt [21]. In fact, Cohn-Kanade database, a popular
database for FE, integrates contempt as emotion label.
Another approach, which represents a wider range of emotions, is the continuous modeling
using affect dimensions [22]. These dimensions include how pleasant or unpleasant a feeling is,
how likely is the person to take action under the emotional state and the sense of control over the
emotion. Due to the higher dimensionality of such descriptions they can potentially describe more
complex and subtle emotions. Nonetheless, the richness of the space is more difficult to use for
8 Background
automatic recognition systems because it can be challenging to link such described emotion to a
FE [12].
For automatic classification systems is common to simplify the problem and adopt a cate-
gorical description of affect by dividing the space in a limited set of categories defined by Paul
Ekman. This will be the approach followed in this dissertation.
2.2 Pre-processing
The default pipeline of a FER system includes as first step face detection and alignment. This is
considered a pre-processing of the original image and will be covered in this section. Face de-
tection and posterior alignment can be achieved using classical approaches, such as Viola&Jones
algorithm and HOG descriptors, or by deep learning approaches.
The Viola&Jones object detection framework [23] was proposed by Paul Viola and Michael
Jones in 2001 as the first framework to give competitive object detection rates. It can be used
for detecting objects in real time, but it is mainly applied for face detection. Besides processing
the images quickly, another advantage of the Viola&Jones algorithm is the low false positive rate.
The main goal is to distinguish faces from non-faces. The main steps of this algorithm can be
summarized as follows:
(1) Haar Feature Selection: Human faces have similar properties (e.g., the eyes region is
darker than the nose bridge regions). These properties can be matched using Haar features, also
known as digital image features based upon Haar basis functions. An Haar-like feature considers
adjacent rectangular regions at a specific location in a detection window, sums up the pixel inten-
sities in each region and calculates the difference between these sums. This difference is then used
to categorize subsections of an image.
(2) Creating an Integral Image: The integral image computes a value at each pixel (x, y)
that is the sum of the pixel values above and to the left of (x, y) inclusive. This image representa-
tion allows computing rectangular features such as Haar-like features, speeding up the extraction
process. As each feature’s rectangular area is always adjacent to at least another rectangle, any
two-rectangle feature can be computed just in six array references.
(3) Adaboost Training: The Adaboost is a classification scheme that works by combining
weak learners into a more accurate ensemble classifier. The training procedure consists of multiple
boosting rounds. During each boosting round, the goal is to find a weak learner that achieves the
lowest weighted training error. Then, the weight of the misclassified training samples are raised.
At the end of the training process, the final classifier is given by a linear combination of all weak
learners. The weight of each learner is directly proportional to its accuracy.
(4) Cascading Classifiers: The attentional cascade starts with simple classifiers that are able
to reject many of the negative (i.e., non-face) sub-windows, while keeping almost all positive (i.e.,
face) sub-windows. That is, a positive response from the first classifier triggers the evaluation of
a second and more complex classifier and so on. A negative outcome at any point leads to the
2.3 Feature Descriptors for Facial Expression Recognition 9
immediate rejection of the sub-window.
Another common method for face detection is the extraction of HOG descriptors to be fed
into a Support Vector Machine (SVM) classifier. The basic idea is that local object appearance and
shape can often be characterized rather well by the distribution of local intensity gradients or edge
directions, even without precise knowledge of the corresponding gradient or edge positions. The
HOG representation has several advantages. It captures the edge or gradient structure that is very
characteristic of local shape, and it does so in a local representation with an easily controllable
degree of invariance to local geometric and photometric transformations: translations or rotations
make little difference if they are much smaller that the local spatial or orientation bin size [24].
In practice, this is implemented by dividing the image frame into cells and, for each cell,
a local 1-D histogram of gradient directions or edge orientations, over the pixels of the cell, is
created. The combined histogram entries form the representation [24]. The feature vector is then
fed into an SVM classifier to find whether there is a face in the image or not. A representation of
this framework can be found in Figure 2.4.
Figure 2.4: A typical framework of an HOG-based face detection method [24].
More recently, deep learning methods have shown efficiency in most computer vision tasks
and hold the state of the art in face detection as well. A detailed description on the fundamentals
of deep learning can be found in section 2.4.2 and the sate of the art in deep learning methods for
face detection in section 3.1.
2.3 Feature Descriptors for Facial Expression Recognition

After face detection, the facial changes caused by facial expressions have to be extracted. This
subsection presents two of the most widely used feature descriptors, namely Local Binary Patterns
(LBP) and Gabor filters.
2.3.1 Local-Binary Patterns (LBP)

Local Binary Patterns (LBP) were first presented in [25] to be used in texture description. The
basic method labels each pixel with decimal values called LBPs or LBP codes, to describe the local
structure around each pixel. As illustrated in Figure 2.5, the value of the center pixel is subtracted
from the 8-neighbor pixels’ values; if the result is negative the binary value is 0, otherwise 1.
The calculation starts from the pixel at the top left corner of the 8-neighborhood and continues in
clockwise direction. After calculating with all neighbors, an eight digit binary value is produced.
When this binary value is converted to decimal, the LBP code of the pixel is generated, and placed
at the coordinates of that pixel in the LBP matrix.
10 Background
Figure 2.5: Example of LBP calculation, extracted from [26].
2.3.2 Gabor Filters
Gabor filter is one of the most popular approaches for texture description. Gabor filter-based
feature extraction consists in the application of a Gabor filter bank to the input image, defined by
its parameters including frequency ( f ), orientations (θ ) and smooth parameters of the Gaussian
envelop (σ ). This makes the approach invariant to illumination, rotation, scale and translation.
Gabor filters are based on the following function [27]:
2 0 2 +η 2 v02 )
− π 2 (γ 2(u − f ) )
Ψ(u, v) = e f (2.1)
u0 = ucosθ + vsinθ (2.2)
v0 = −usinθ + vosθ (2.3)
In the frequency domain (Eq. 2.1, 2.2, 2.3)) the function is a single real-valued Gaussian
centered at f . γ is the sharpness (bandwidth) along the Gaussian major axis and η is the sharpness
along the minor axis (perpendicular to the wave). In the given form, the aspect ratio of the Gaussian
η
is γ. Gabor features, also referred as Gabor jet, Gabor bank or multi-resolution Gabor features,
are constructed from responses of Gabor filters by using multiple filters with several frequencies
and orientations. Scales of a filter bank are selected from exponential spacing and orientations
from linear spacing. These filters are then convolved with the image, in order to obtain different
representations of the image to be used as descriptors.
2.4 Learning and classification
Given a collection of extracted features, it is necessary to build a model capable of correctly sep-
arate and classify the expressions. Traditional FER systems use a three stage training procedure:
(i) feature extraction/learning, (ii) feature selection, and (iii) classifier construction. On the other
hand, FER systems based on deep learning techniques comprise these three steps into one single
2.4 Learning and classification 11
step. This section presents an overview of one of the most widely used traditional classifiers, the
Support Vector Machines (SVMs) as well as one of the most relevant deep learning approaches,
the Convolutional Neural Networks (CNNs).
2.4.1 Support Vector Machine (SVM)
Support Vector Machine [28] performs an implicit mapping of data into a higher (potentially
infinite) dimensional feature space, and then finds a linear separating hyperplane with the maximal
margin to separate data in this higher dimensional space. Given a training set of labeled examples
a new test example x is classified by the following function:
l
f (x) = sgn( ∑ αi yi K(xi , x) + b), (2.4)
i=1
where αi are Lagrange multipliers of a dual optimization problem that describe the separating
hyperplane, K is a kernel function, and b is the threshold parameter of the hyperplane. The training
sample xi with αi > 0 is called support vector, and SVM finds the hyperplane that maximizes the
distance between the support vectors and the hyperplane. SVM allows domain-specific selection
of the kernel function. Though new kernels are being proposed, the most frequently used kernel
functions are the linear, polynomial, and Radial Basis Function (RBF) kernels. SVM makes binary
decisions, so the multi-class classification is accomplished by using, for instance, the one-against-
rest technique, which trains binary classifiers to discriminate one expression from all others, and
outputs the class with the largest output of binary classification. The selection of the SVM hyper-
parameters can be optimized through a k-fold cross-validation scheme. The parameter setting
producing the best cross-validation accuracy is picked [29].
In general, SVMs exhibit good classification accuracy even when only a modest amount of
training data is available, making them particularly suitable to expression recognition [30]. Figure
2.6 represents a possible pipeline for FER using feature descriptors along with SVM classifier.
Figure 2.6: Example of an SVM classifier applied to features extracted from faces [30].
12 Background
2.4.2 Deep Convolutional Neural Networks (DCNNs)
Recently, deep learning methods have shown to be efficient on many computer vision tasks like
pattern recognition problems, character recognition, object recognition or autonomous robot driv-
ing for instance. Deep learning models are composed of consecutive processing layers that learn
representations of data with multiple levels of abstraction, capturing features that traditional meth-
ods could not compute. One of the factors that allows the computation of complex features is the
back-propagation algorithm that indicates how a machine should change its internal parameters to
compute new representations of input data [31].
The emergent success of CNNs on recognition and segmentation tasks can be explained by 3
factors: (1) The availability of large labeled training sets; (2) The recent advances in GPU tech-
nology, which allows training large CNNs in a reasonable computation time; (3) The introduc-
tion of effective regularization strategies that greatly improve the model generalization capacity.
However, in the FER context, the availability of large training sets is scarce, arising the need of
strategies to improve the models.
CNNs learn to extract the features directly from the training database using iterative algorithms
like gradient descent. An ordinary CNN learns its weights using the back-propagation algorithm.
A CNN has two main components, namely, local receptive fields and shared weights. In local
receptive fields, each neuron is connected to a local group of the input space. The size of this
group of the input space is equal to the filter size where the input space can be either pixels
from the input image or features from the previews layer. In CNN the same weights and bias
are used over all local receptive fields, which significantly decreases the number of parameters of
the model. However, the increased complexity and depth of a typical CNN architecture network
are prone to overfit [32]. CNNs can have multiple architectures but the standard is having series
of convolutional layers that produce a certain amount of feature maps given by the number of
filters defined for the convolutions, leading to different image representations. This is followed
by pooling layers. Max pooling, the common pooling layer applies a max filter to (usually) non-
overlapping subregions of the initial representation, reducing the dimensionality of the current
representation. Then, these representations are fed into fully-connected layers that can be seen as
a multilayer perceptron that aims to map the activation volume, from the combination of previous
different layers, into a class probability distribution. The network is followed by an affine layer
that computes the scores [33] [34].
Figure 2.7 represents a possible network for facial expression recognition applying regulariza-
tion methods.
It is useful to understand the features that are being extracted by the network in order to
understand how the training and classification is performed. Figure 2.8 shows the visualization
of the activation maps of different layers. It can be seen that, the deeper the layers are, the more
sparse and localized the activation maps become [34].
Figure 2.7: The architecture of the deep network proposed in [34] for FER.
2.4.2.1 Activation Functions
An activation function is a non-linear transformation that defines the output of a specific node
given a set of inputs. Activation functions decide whether a neuron should be activated so they
assume an important role in the network design. The commonly used activation functions are
presented as follows:
• Linear Activation: The activation is proportional to the input. The input x, will be trans-
formed to ax. This can be applied to various neurons and multiple neurons can be activated
at the same time. The issue with a linear activation function is that the whole network is
equivalent to a single layer with linear activation.
• Sigmoid function: In general, a sigmoid function is real-valued, monotonic, and differen-

tiable having a non-negative first derivative which is bell shaped. The function ranges from
0 to 1 and has an S shape. This means that small changes in x bring large changes in the out-
put, Y . This is desired when performing a classification task since it pushes the predictions
to extreme values. The sigmoid function can be written as follows:
1
Y= (2.5)
1 + e−x
• ReLU: ReLU is the most widely used activation function since it is known for having better
fitting abilities than the sigmoid function [35]. ReLU function is non linear so it back-
propagates the error. ReLU can be written as:
Y = max(0, x) (2.6)
It gives an output equal to x if x is positive and 0 otherwise. Only specific neurons are
activated, making the network sparse and efficient for computation.
• Softmax: For classification problems, commonly, the output consists in a multi-class prob-
lem. Sigmoid function can only handle two classes so softmax is used for outputting the
probabilities of each class. The softmax function converts the outputs of each unit to values
14 Background
Figure 2.8: Visualization of the activation maps for different layers from [34].
between 0 and 1, just like a sigmoid function, but it also divides each output such that the
total sum of the outputs is equal to 1. The output of the softmax function is equivalent to a
categorical probability distribution. Mathematically, the softmax function is shown below:
ez j
σ (z) j = , (2.7)
∑Kk=1 ezk
where z is a vector of the inputs to the output layer and j indexes the output units, so j = 1,
2, ..., K.
2.4.2.2 Regularization
As mentioned previously, CNN can easily overfit. To avoid overfitting, regularization methods can
be applied. Regularization techniques can be seen as an imposition of certain prior distributions
on model parameters.
Batch-Normalization is a method known for reducing internal covariate shift in neural

networks [36].
To increase the stability of a neural network, batch normalization normalizes the output of a
previous activation layer by subtracting the batch mean and dividing by the batch standard devia-
tion. In figure 2.9 a representation of the Batch-Normalization transform is presented.
Figure 2.9: Batch-Normalization applied to the activations x over a mini-batch. Extracted from
[36]
In the notation y = BNγ,β (x), the parameters γ and β have to be initialized and will be learned.
Dropout is widely used to train deep neural networks. Unlike other regularization techniques
that modify the cost function, dropout modifies the architecture of the model since it forces the
network to drop different neurons across iterations. Dropout can be used between convolutional
layers or only in the classification module. Its contribution to the activation of downstream neu-
rons is temporally removed on the forward pass and any weight updates are not applied to the
neuron on the backward pass [37].
16 Background
Dropout reduces complex co-adaptations of neurons. Since a neuron cannot rely on the presence
Figure 2.10: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right:
An example of a thinned net produced by applying dropout to the network on the left. Extracted
from [37].
of particular other neurons, it is forced to learn more robust features that are useful in conjunction
with many different random subsets of the other neurons (see figure 2.10).
Data-augmentation is one of the most common most common method to reduce overfit-
ting on image data by artificially enlarging the dataset using label-preserving transformations.
The main techniques are classified as data warping, which is an approach which seeks to directly
augment the input data to the model in the data space [38]. The generic practice is to perform ge-
ometric and color augmentation. For each input image, it is generated a new image that is shifted,
zoomed in/out, rotated, flipped, distorted, or shaded with a hue. Both image and duplicate are fed
into the neural net.
L1 and L2 regularization are traditional regularization strategies that consist in adding a

penalty term to the objective function and control the model complexity using that penalty term.
L1 and L2 are common regularization techniques not only in deep neural networks but in general
machine learning algorithms. The first L1 regularization uses a penalty term which encourages the
sum of the absolute values of the parameters to be minimum. It has frequently been observed that
L1 regularization in many models forces parameters to equal zero, so that the parameter vector is
sparse. This makes it a natural candidate for feature selection.
L2 regularization can be seen as adaptive minimization of the square error with a penalization
term that penalize in such a way that less influential features, features that cause very small in-
fluence on dependent variable, undergo more penalization. A high penalization term can lead do
underfitting. Therefore, this term needs an optimal value to prevent overfitting and underfitting
[39].
Early-Stopping is a strategy that by monitoring the validation set metrics decides when the
model stops training. An indicator that the network is overfitting to the training data is when the
2.5 Model Selection 17
loss of the validation set is not improving for a certain number of epochs. To avoid this, early-
stopping is implemented: the network will stop the training when it reaches a certain number of
epochs without improving the validation set.
2.5 Model Selection

Model selection is the task of selecting a model from a set of candidate models. Commonly, model
selection strategies are based in validation. Validation consists in partitioning a set of the training
data and use this sub-set to validate the predictions from the training. It intends to assess how the
results of a model will generalize to an independent data set as it is illustrated in Figure 2.11.
Figure 2.11: Common pipeline for model selection. Extracted from [40].
Different pipelines can be designed using validation for model selection. Commonly, a set of
data is split in three sub-sets: train set, where the model will be trained, validation set in which in
the model will be validated and test set where the model performance is assessed. Different models
or models with different hyper-parameters are validated in the validation set. The hyper-parameter
optimization is typically conveyed by a grid-search approach. Grid-search is exhaustive searching
through a manually specified subset of the hyper-parameter space. A step-by-step description for
grid-search pipeline as model selection is presented as follows:
1. The data set is split randomly, with user-independence between the sets, P times in three
sub-sets: train-set, validation-set and test-set.
2. Sets of hyper-parameters to be optimized are defined. Being A and B two hyper-parameters

sets to be optimized, each value of set A is defined as ai ( i = 1, ..., I values) and each value
of hyper-parameter B as bi ( j = 1, ..., J values).
3. The Cartesian product of the two sets, A and B, is performed, returning a set of pairs, (ai , b j )
in which a model will be trained. In the end, I × J models are trained.
4. Each model is evaluated on the validation set, returning a specific metric value.
5. The models are ordered by their performance on the validation test. The best set of hyper-
parameters that produces the best model is selected.
18 Background
6. The model with the selected hyper-parameters is evaluated on the test set of split p (with
p = 1, ..., P splits).
7. The performance of the algorithm corresponds to the average value of the performance of
the selected model on the P splits.
Chapter 3
State-of-the-Art
Automatic Facial Expression Recognition (FER) can be summarized in four steps: Face Detection,
Face Registration, Feature Extraction and Expression Recognition. These four steps encompass
methods and techniques that will be covered in the next sections.
3.1 Face detection (from Viola & Jones to DCNNs)
Face detection is usually the first step for all automated facial analysis systems, for instance face
modeling, face relighting, expression recognition or face authentication systems. Given an image,
the goal is to detect faces in the image and return their location in order to process these images.
There are some factors that may compromise and determine the success of face detection. Beyond
image conditions and acquisition protocols, different camera-face poses can lead to different views
from a face. Furthermore, structural components such as beards or glasses introduce variability
in the detection, leading to occlusions [41].
For RGB images, the algorithm of Viola & Jones [23] is still one of the most used face detec-
tion methods. It was proposed in 2001 as the first object detection framework to provide compet-
itive face detection in real time. Since it is essentially a 2D face detector, it can only generalize
within the pose limits of its training set and large occlusions will impair its accuracy.
Some methods overcome these weaknesses by building different detectors for different views
of the face [42], by introducing robustness to luminance variation [43], or by improving the weak
classifiers. Bo Wu et al. [44] proposed the utilization of a single Haar-like feature, in order to
compute an equally bin histogram that is then used in a RealBoost learning algorithm. In [45] a
new weak classifier, the Bayesian stump, is proposed. Features as LBP can also be used to improve
invariance to image conditions. Hongliang Jin et al. [46] apply LBP on a Bayesian framework
and Zhang et. al [47] combines LBP with a boosting algorithm that uses multi-branch regression
tree as its weak classifier. Another feature set can be found in [24] that applies SVM over grids of
histograms of oriented gradient (HOG) descriptors.
19
20 State-of-the-Art
Convolutional Neural Networks (CNNs) have been widely used in image segmentation or clas-
sification tasks as well as for face localization. Convolutional networks are specifically designed
to learn invariant representations of images as they can easily learn the type of shift-invariant local
features that are relevant to face detection and pose estimation. Therefore, CNN-based face de-
tectors outperform the traditional approaches, specially in unconstrained scenarios, in which there
is a large variability of face-poses, viewing angles, occlusions and illumination conditions. Some
examples of face detection in unconstrained scenarios can be found in Figure 3.1. They can also
be replicated in large images at a small computational cost when compared with the traditional
methods mentioned before [48].
Figure 3.1: Multiple face detection in uncontrolled scenarios using the CNN-based method pro-
posed in [49].
In [48] a CNN detects and estimates pose by minimizing an energy function with respect to the
face/non-face binary variable and the continuous pose parameters. This way, the trained algorithm
is capable of handle a wide range of poses without retraining, outperforming traditional methods.
The work of Haoxiang Li et al. [49] also takes advantage of the CNN discriminative capacity
proposing a cascade of CNNs. The cascade operates at different resolutions in order to quickly
discard obvious non-faces and evaluate carefully the small number of strong candidates. Besides
achieving state of the art performances the algorithm is capable of fast face detection.
Another state of the art approach is the work of Kaipeng Zhang et al. [50] in which a deep
cascaded multi-task framework (MTCNN) is designed to detect face and facial landmarks. The
method consist of three stages: in the first stage, it produces candidate windows. Then, it refines
the windows by rejecting a large number of non-faces windows through a more complex CNN.
Finally, it uses a more powerful CNN to refine the result again and output five facial landmarks
positions. The schema of the MTCNN is illustrated in Figure 3.2. In particular, the input image is
resized to different scales, being the input to a three-stage cascaded framework:
Stage 1: First, a fully convolutional network, called Proposal Network (P-Net), is imple-
mented to obtain the candidate facial windows and their bounding box regression vectors. Then,
candidates are calibrated based on the estimated bounding box regression vectors. Non-maximum
suppression (NMS) is performed to merge highly overlapped candidates.
Stage 2: All candidates are fed to another CNN, called Refine Network (R-Net), which further
rejects a large number of false candidates, performs calibration with bounding box regression, and
conducts NMS.
3.2 Face Registration 21
Stage 3: This stage is similar to the second stage, but the purpose of this stage is to identify
face regions with more supervision. In particular, the network will output five facial landmarks’
positions.
Figure 3.2: Architecture of the Multi-Task CNN for face detection [50].
In this network there are three main tasks to be trained: face/non-face classification, bounding
box regression, and facial landmark localization. For the face/non-face classification task, the goal
of training is to minimize the traditional loss function for classification problems, the categorical
cross-entropy. The other two tasks (i.e., the bounding box and landmark localization) are treated
as a regression problem, in which the Euclidean loss has to be minimized [50].
Since this method holds the state of the art for face detection, the MTCNN is used as face
detector in this dissertation.
3.2 Face Registration

Once the face is detected, many FER methods require a face registration step for face alignment.
During the registration, fiducial points (or landmarks) are detected, allowing the alignment of the
face to different poses and deformations. These facial key-points can also be used to compute
localized features. Interest points combined with local descriptors provide reliable and repeatable
measurements from images for a wide range of applications, capturing the essence of a scene
without the need for a semantic-level [51]. Landmark localization is then an essential step to take,
as these fiducial points can be used for face alignment and to compute meaningful features for
FER [12]. Key-points are mainly located around facial components such as eyes, mouth, nose and
chin. These key-points can be computed either using scale invariant feature transform (SIFT) [52]
[53] or through a CNN where facial landmark localization is taken as a regression problem [50].
3.3 Feature Extraction

Facial features can be extracted using different approaches and techniques that will be covered in
this sub-section. Feature extraction approaches can be broadly classified into two main groups:
22 State-of-the-Art
hand-crafted features and learned features, which can be applied locally or to the global im-
age. Concerning the temporal information, algorithms can also be further divided into static or
dynamic.
3.3.1 Traditional (geometric and appearance)
Hand-crafted features can be divided into appearance or geometric. Geometric features describe
faces using distances and shapes of fiducial points (landmarks). Many geometric-based FER meth-
ods recognize expressions by first detecting AUs and then decoding a specific expression from
them. As an example, [54] looks over the recognition of facial actions through landmarks dis-
tance, taking as prior the fact that facial actions involved in spontaneous emotional expressions
are more symmetrical, involving both the left and the right side of the face. Figure 3.3 represents
the recognition of one AU.
Figure 3.3: Detection of AUs based on geometric features used in [54].
Geometric features can provide useful information when tracked on temporal axis. Such ap-
proach can be found in [55], in which a model for dynamic facial expression recognition based
on landmark localization is proposed. Geometric features can also be used to build an active
appearance model (AAM), the generalization of a statistical model of the shape and gray-level ap-
pearance of the object of interest [56]. AAM is often used for deriving representations of faces for
facial action recognition [57] [58]. Local geometric feature extraction approaches aim to describe
deformations on motions or localized regions of the face. It is the example of the work proposed
3.3 Feature Extraction 23
by Stefano Berretti et al. [59] that describes local deformations (given by the key-points) through
SIFT descriptors. Dynamic descriptors for local geometric features are based on landmark dis-
placements coded with motion units [60], [61] or deformation of facial elements as eyes, mouth
or eyebrows [62][63].
From the literature it is clear that geometric features are effective on the description of facial
expressions. However, effective geometric feature extraction highly depends upon the accurate
facial key-points detection and tracking. In addition, geometric features are not able to encode
relevant information caused by skin texture changes and expression wrinkles.
Appearance features are based on image filters applied to the image to extract the appearance
changes on the face. Global appearance methods began to focus on Gabor-wavelet representations
[64] [65]. However, the most popular global appearance features are Gabor filters and LBPs.
In [66] and [67], the input (face images) are convolved with a bank of Gabor filters to extract
multi-scale and multi-orientational coefficients that are invariant to illumination, rotation, scale
and translation. LBP’s are widely used for feature extraction on face expression recognition for
their tolerance against illumination changes and their computational simplicity. Caifeng Shan et
al. [29] implements LBP as feature descriptors, using an AdaBoost to learn the most discrim-
inative LBP features and an SVM as discriminator. In general, such methods have limitations
on generalization to other datasets. Other global appearance methods are based on the Bag of
Words (BoW) approach. Karan Sikka et al. [68] explores BoW for an appearance-based dis-
criminative feature extraction, combining highly discriminative Multi-Scale Dense SIFT (MSDF)
features with spatial pyramid matching (SPM).
Dynamic global appearance features are an extension to the temporal domain. In [69] local
binary pattern histograms from three orthogonal planes (LBP-TOP) are proposed. Bo Sun et al.
[70] uses a combination of LBP-TOP and local phase quantization from three orthogonal planes
(LPQ-TOP), a descriptor similar to LBP-TOP but more robust to blur. Often a combination of
different descriptors is used to form hybrid models. It is the example of the work proposed in [71],
where LBQ-TOP is used along with local Gabor binary patterns from three orthogonal planes
(LGBP-TOP). Figure 3.4 shows the most commonly used appearance-based feature extraction
approaches.
Local appearance features require previous knowledge of regions of interest such as mouth,
eyes or eyebrows. Consequently, its performance is dependent on the localization and tracking
of these regions. In [72], the appearance of gray-scale frames is described by spreading an array
of cells across the mouth and extracting the mean intensity from each cell. The features are then
modeled using an SVM. A Gray Level Co-occurrence Matrix is used in [73] as feature descriptor
of specific regions of interest.
There is not a straight answer on which feature extraction method is better, it depends on the
problem and/or AUs to detect. Therefore, it is common to combine both types of appearance-based
methods, as is usually associated with an increase of performance [74] [75]. The performance of
FER methods based on these traditional descriptors has remarkable results but mainly for con-
trolled scenarios. The performance using these features decreases dramatically in unconstrained
24 State-of-the-Art
(1)
(2)
(3)
(4)
(5)
(6)
Figure 3.4: Spatial representations of main approaches for feature extraction: (1) Facial Points, (2)
LBP histograms, (3) LBQ histograms, (4) Gabor representation, (5) SIFT descriptors, (6) dense
Bag of Words [19].
environments where face images cover complex and large intra-personal variations such as pose,
illumination, expression and occlusion. The challenge is then to find an ideal facial representation
which is robust for facial expression recognition in unconstrained environments. As described
in the following subsection, the recent success of deep leaning approaches, specially those using
CNNs, has been extended to the FER problem.
3.3.2 Deep Convolutional Neural Networks
The state of the art for FER is mostly composed by deep learning based methods. For instance,
the following works [76][77][78] are some implementations of CNNs for expression recognition,
holding a state of the art performance on public dataset of uncontrolled environments (SFEW).
Zhiding Yu et al. [77] uses ensembles of CNNs, a commonly used strategy to reduce the model’s
variance and, hence, improve the model performance. Ensembles of networks can be seen as
multiple networks initialized with different weights leading to different responses. The output in
this case is averaged but it can be merged by other means, as majority voting.
Another commonly used technique is transfer learning: a CNN is previously trained, usually
in some related dataset and then is fine tuned for the target dataset. Mao Xu et al. [79] propose a
facial expression recognition model based on transfer features from CNNs for face identification.
The facial expression recognition model transfers high-level features from face identification to
3.3 Feature Extraction 25
classify them into one of seven discrete emotions with the multi-class SVM classifier. In [80],
another transfer learning method is proposed. It uses as pre-trained networks the AlexNet [35], a
well-known network trained on large-scale database. Since the target dataset is 2D gray-scale, the
authors perform image transformations to convert 2D gray-scale images to 3D values.
Another strategy to work around limited datasets is the use of artificial data. Generative Ad-
versarial Networks (GANs) [81] generate artificial examples that must be identified by a discrim-
inator. GANs have already been used in face recognition. The method proposed in [82] generates
meaningful artifacts on state of the art face recognition algorithms, for example glasses, leading
to more data and preventing real noise to influence the network final prediction. Such approaches
can possibly be extended to FER as well.
To solve the problem of model generalization on FER methods, curriculum learning [83] can
be a viable option. Initially, the weights favor “easier” examples, or examples illustrating the sim-
plest concepts, that can be learned most easily. The next training criterion involves a slight change
in the weighting of examples that increases the probability of sampling slightly more difficult ex-
amples. At the end of the sequence, the re-weighting of the examples is uniform and trained on
the target training set. In [84] a meta-dataset is created with values of complexity of the task of
emotion recognition. This meta-dataset is generated through a complexity function that measures
the complexity of image samples and ranks the original training dataset. The dataset is split in
different batches based on complexity rank. The deep network is then trained with easier batches
and progressively fine tuned with the harder ones till the network is trained with all the data (see
Figure 3.5).
Figure 3.5: Framework of curriculum learning method proposed in [84]. Faces are extracted and
aligned and then ranked into different subsets based on curriculum. Succession of deep convolu-
tion networks are applied, and then the weights of the fully connected layers are fine-tuned.
Regularization as mentioned before is also one of the factors that contribute to better deep
models. More recently, new methods for regularization were introduced, such as dropout [37],
drop-connect [85], max pooling dropout [86], stochastic pooling [87] and, to some degree, batch
normalization [36]. H. Ding et al. [88] proposes a probabilistic distribution function to model the
high level neuron response based on an already fine tuned face network, leading to regularization
on feature level, achieving state of the art performance for SFEW database.
Besides regularization and transfer learning approaches, the inclusion of domain knowledge on
the domain of the problem is also a common approach since it discriminates the feature space. The
work of Liu et al. [78] explores the psychological theory that FEs can be decomposed into multiple
26 State-of-the-Art
action units and use the domain knowledge to build a oriented network for FER. The network can
be divided in three stages where, firstly, it is built a convolutional layer and a max-pooling layer to
learn the Micro-Action-Pattern (MAP) representation, extracting domain knowledge from the data,
which can explicitly depict local appearance variations caused by facial expressions. Then, feature
grouping is applied to simulate larger receptive fields by combining correlated MAPs adaptively,
aiming to generate more abstract mid-level semantics. As last stage, a multi-layer learning process
is employed in each receptive field respectively to construct group-wise sub-networks. Likewise,
the method proposed by Ferreira et al. [89] intends to build a network inspired on the physiological
support that FEs are the result of the motions of facial muscles. The proposed network is an end-
to-end deep neural network along with a well-designed loss function that forces the model to
learn expression-specific features. The model can be divided in three main components where the
representation component is a regular encoding that computes a feature space. This feature space
is then filtered with a relevance map of facial expressions computed by the facial-part component.
The new feature space is fed to the classification component and classified in discrete classes.
With this approach, the network increases the ability to compute discriminative features for FER.
3.4 Expression Recognition

Within a categorical classification approach, there are several works that tackle the categorical sys-
tem in different ways: most works use feature descriptors followed by a classifier to categorically
classify expressions/emotions. Principal Component Analysis (PCA) can be performed before
classification in order to reduce feature dimensionality. This classification can be achieved by us-
ing SVMs [90] [59], Random Forests [91] or k-nearest neighbors (kNN) [67]. More recently, deep
networks are being used and, as mentioned in the previous sub-section, perform jointly feature
extraction and recognition but it is possible to stop training and output the extracted features from
the network and then proceed to classification and recognition [76][77][78].
3.5 Summary
Facial expression recognition has a standard pipeline, starting on face detection/alignment, feature
extraction and expression recognition/classification. Feature extraction plays a crucial role on the
performance of a FER system . It can be performed using traditional methods that either take
in account geometric or appearance features or can be more complex, using convolutional neural
networks as a feature descriptor. CNNs learn representations of data with levels of abstraction
that traditional methods cannot, leading to new features. However, CNNs performance depend on
the dataset size and GPU technology that determine its velocity. For FER, the datasets available
are scarce, arising the need to implement methods that regularize, augment and generalize the
datasets available in order to improve their performance. Novel methods on FER intend to im-
prove generalization and scarcity of datasets by employing prior knowledge to the networks. Prior
knowledge can consist in features transfered from other networks and domains but it can also be
3.5 Summary 27
domain knowledge that improve the discrimination in the feature space. Either way, the works
that include prior knowledge indicate that this approach can achieve state of the art results and can
help the generalization and the use of small datasets.
28 State-of-the-Art
Chapter 4
Implemented Reference Methodologies
Several approaches, from traditional methods to state of the art methods, were implemented. The
methodology followed for each approach will be covered in this chapter.
The implemented traditional approaches include hand-crafted based methods (geometric and
apperance) as well as a conventional CNN trained from the scratch. These methods were imple-
mented as baselines for the proposed FER. In addition, several state-of-the-art methods were also
implemented (i.e., transfer learning and physiological inspired networks) to serve as a starting
point for the proposed method.
Figure 4.1: Illustration of the pre-processing where (a) and (b) are instances from an unconstrained
dataset (SFEW [92]) and (c) and (d) are instances from a controlled public dataset (CK+ [93]). The
original images are fed to the face-detector (a MTCNN framework [50]) and the faces detected
will be used as input to the implemented methods.
As a pre-processing step, all methods are preceded by face-detection and alignment. Then, the
images are normalized, cropped and resized. To jointly perform face-detection and alignment, it
29
30 Implemented Reference Methodologies
is used the MTCNN as face-detector [50]. Some examples of pre-processed images are presented
in Figure 4.1.
4.1 Hand-crafted based approaches

Concerning hand-crafted methods, several appearance-based methods as well as a geometric-
based approach were implemented. The geometric approach is based on features computed from
facial key-points (see Figure 4.2), such as:
1. Distances x and y of each key-point to the center point;
2. Euclidean distance of each key-point to the center point;
3. Relative angle of each key-point to the center point corrected by nose angle offset.
These features are then concatenated to form geometric feature descriptor. Finally, this feature
descriptor is fed into a multi-class SVM for expression classification.
Figure 4.2: Illustration of the implemented geometric feature computation.
The implemented appearance-based FER methods are based on two commonly used tech-
niques for texture classification, namely Gabor filter banks and LBP. Regarding the Gabor filters
approach, a bank of Gabor filters with different orientations, frequencies and standard deviations
was first created. Afterwards, the input images are convolved (or filtered) with the different Ga-
bor filter kernels, resulting in several image representations of the original image. The mean and
4.2 Conventional CNN 31
variance of the filtered images (image representation) are then used as descriptors for classifica-
tion. In particular, different Gabor-descriptors were extracted, according to the degree of local
information:
• Gabor-global: This feature descriptor consists in the concatenation of global mean and
variance values of each Gabor representation.
• Gabor-local: The Gabor representations are divided into a grid of cells. The mean and
variance of each cell are computed and, then, concatenated to form the feature vector.
• Gabor-kpts: It requires the information of the facial key-points coordinates. In particular,

the mean and variance of the Gabor representations are computed locally in a neighborhood
of each facial key-point.
Then, these feature descriptors are fed into an SVM for expression classification.
Regarding the implemented LBP-based approach, the LBP representation of the input image
is first computed and then this LBP representation is used to build an histogram of the LBP pat-
terns. For an extra level of rotation and luminance invariance, only the uniform LBP patterns [94]
were extracted. Similarly to the Gabor-based approach, different LBP feature descriptors were
computed:
• LBP-global: This feature feature vector consists in the global histograms of the LBP pat-
terns.
• LBP-local: The LBP representations are divided into a grid of cells. Then, the histograms
of the LBP representations of each cell are computed and, then, concatenated to form the
feature vector.
• LBP-kpts: The histograms of the LBP representations of the region around each facial
key-point are concatenated to form the feature vector.
For classification, these LBP-based features descriptors are also used to train a multi-class SVM.
Moreover, different combinations of these methods were also performed as well. That is,
LBPs were applied to Gabor representations and geometric features were concatenated with LBPs
features.
4.2 Conventional CNN

Hand-crafted methods have been widely used in image recognition problems, although, broad
representations of the image could be extracted using deep neural networks. Convolutional layers
take image or feature maps as the input, and convolve these inputs with a set of filter banks in a
sliding-window manner to output feature maps that represent a spatial arrangement of the facial
image. The weights of convolutional filters within a feature map are shared, and the inputs of the
feature map layer are locally connected. Second, sub-sampling layers lower the spatial resolution
of the representation by averaging or max-pooling the given input feature maps to reduce their
dimensions and thereby ignore variations in small shifts and geometric distortions [95]. A deep
neural network was implemented and trained from scratch. The architecture, regularization and
learning strategies of the implemented CNN from scratch are described in sections 4.2.1, 4.2.2 and
4.2.3, respectively.
4.2.1 Architecture
The architecture of the implemented model is presented in Figure 4.3. The architecture can be
divided into two main parts: representation computation, Xr , and classification,Xc .
Figure 4.3: Architecture of the conventional deep network used.
As the schema suggests, the representation module corresponds to Xr and the classification
module to Xc . Xr is a functional block that takes as inputs the pre-processed images x and com-
putes new representations of the data. It starts from sequences of two consecutive 3x3 convolu-
tional layers, with rectified linear units (ReLU) as non-linearities followed by a 2x2 max-pooling
operation for down sampling. Between the convolutional layers it can be included regularization
layers that will be covered next. The classification module Xc corresponds to the classification
module. It consists of a sequence of fully connected layers where the last layer is a softmax layer
that outputs the probabilities for each class label, ŷ. Between the fully connected layers the inclu-
sion of regularization layers was assessed as presented in section 6.1.
4.2.2 Learning
The model is trained in order to return the predictions ŷ on class labels. The goal of training is to
minimize the following loss function:
N
Lclassi f ication = − ∑ yTi log(ŷi ), (4.1)
i=1
where yi is a column vector with one-hot encoding of the class label for input i and ŷi are the soft-
max predictions of the model. The Adaptive Moment Estimation (Adam) is used as optimizer. It
computes the adaptive learning rates for each parameter and stores an exponential decay rate of
past squared gradients and it keeps an exponential decay rate of past gradients. The learning rate
4.2 Conventional CNN 33
(Lr ) is optimized by means of a grid-search procedure. The range of values used for the optimizer
is searched by grid-search approach (see Table 6.1).
4.2.3 Regularization
Due to the ability of high representational capacity and high number of parameters estimated
by the deep models, overfitting is a common problem. Regularization techniques penalizes the
weight matrices of the nodes.As described in the following subsection, a wide range of regulariza-
tion strategies were applied to the implemented CNN. These regularization techniques were also
applied to the remaining networks that will be presented in this dissertation.
4.2.3.1 Data-Augmentation
The simplest way to reduce overfitting is to increase the size of the training data, however, in
some domains is not always possible. Therefore data augmentation is needed. Data augmentation
consists in the synthesis of an artificially number of training samples by different image transfor-
mations and noise addition. In here, a randomized data augmentation scheme based on geometric
transformations is applied during the training step. The purpose of data-augmentation is to in-
crease the robustness of the model by training on wider range of face positions, poses and viewing
angles. When performing data-augmentation, the transformations applied to the image have to
be studied to not corrupt the correspondent label. For instance, vertical flips are not performed
since some images lose their assigned label and would corrupt the classification system. The data
augmentation process is applied in an online-fashion, within every iteration, to all the images of
each mini-batch. The next equation represent the geometric transformations used to augment the
training data:
" # " #" #" #" #
x0 s 0 cos(θ ) −sin(θ ) x − t1 1 − p p
= , (4.2)
y0 0 s sin(θ ) cos(θ ) y − t2 p 1− p
where θ is the rotation angle, t1 and t2 define translation parameters, s defines scale factors and p
a binary variable for horizontal flip. Pixels mapped outside the original image are assigned to the
same value of the closest existing pixel.
The parameters for each transformation are presented in section 6.1 and the range for each
transformation is chosen in order to assure that the image is never corrupted by abrupt transforma-
tions. Some instances of the augmented data are presented in figure 4.4.
4.2.3.2 Dropout
Dropout is commonly used in the fully connected layers and not on the convolutional layers since
it affects the number of parameters of the network and this number increases in the fully connected
Figure 4.4: Examples of the implemented data augmentation process. For each pair, the left image
corresponds to original image and right to the respective transformation: (a) Horizontal Flip; (b)
Rotation ; (c) Zooming (d) Width Shift; (e) and (f) Height shifts.
layers. Given this, whereas it will be applied in the representation module or only in the classifi-
cation is feasible by defining a binary variable, d. Its magnitude, D is also searched and is defined
in table 6.1, Chapter 6.
4.2.3.3 L2
L2 regularization is performed for each new feature computation, i.e, for each convolutional layer
L2 regularization is performed. The penalization term is optimized and a detailed description is
presented in Implementation details in Table 6.1 from Chapter 6.
4.2.3.4 Early-Stopping
In order to apply early-stopping during training it is necessary to define the patience, p, that denotes
the number of epochs with no further improvement after which the training will be stopped and it
is a hyper-parameter of the network (see Table 6.1 in Chapter 6).
4.2.3.5 Batch-Normalization
This layer has a defined momentum and is initialized by γ and β . Batch-Normalization, as

Dropout, can be defined between the convolutional layers or only between fully-connected lay-
ers by defining a binary variable, B.
4.3 Transfer Learning Based Approaches

It is rare to train an entire Convolutional Network from scratch with random weight initialization
because of the scarce data and generic features can be used across different models and problems.
When performing transfer learning, a first model is trained in a base network on a base dataset and
task, then, the learned features are transfered to the desired target, being trained on the new dataset.
The success of the method highly depends on the generalization of the features extracted and how
4.3 Transfer Learning Based Approaches 35
similar the first task is to the target task. Two different networks were used as a pre-trained model:
VGG16 and Facenet. They hold the state-of-the-art in most object recognition problems and
the dataset in which they are trained belongs to a similar domain, therefore, potentially common
features will be used for classification.
4.3.1 VGG16
VGG16 is a pre-trained network on the ImageNet Large Scale Visual Recognition Challenge
dataset. The original dataset contains 1.2 million training images with another 50,000 images
for validation and 100,000 images for testing. The goal of this image classification challenge is to
train a model that can correctly classify an input image into 1,000 separate object categories. The
categories correspond to common object classes as dogs, cats, houses, vehicles and so on [96].
VGG16 uses only 3 x 3 convolutional layers stacked on top of each other in increasing depth.
Reducing volume size is handled by max pooling. Two fully-connected layers, each with 4,096
nodes are then followed by a softmax classifier. As the name suggests, VGG16 has 16 weight
layers. A representation of VGG16 architecture is presented in Figure 4.5.
Figure 4.5: Network configurations of VGG, from [96]. The pre-trained network used corresponds
to configuration D (VGG16).
There are two steps in the training of the pre-trained network. Firstly, the VGG16 will be used
as a fixed feature extractor. That is, the fully-connected layers are removed and replaced by fully-
connected layers adapted to our dataset. The hyper-parameters of the new dense layers (number of
units, regularization and number of layers) are optimized by means of grid-search (see their range
of values in table 6.1, chapter 6). In the second step, the convolutional layers fixed before are then
trained on the FER dataset in order to fine tune the weights. All the layers are trained in the second
step.
4.3.2 FaceNet
Facenet learns a mapping from face images to a compact Euclidean Space where distances directly
correspond to a measure of face similarity. Once this is done, tasks such as face recognition, veri-
fication, and clustering are easy to do using standard techniques (using the FaceNet embeddings as
features). The training is done using triplets: one image of a face (‘anchor’), another image of that
same face (‘positive exemplar’), and an image of a different face (‘negative exemplar’) [97]. The
dataset consists in 100 million to 200 million face thumbnails, having 8 million identities. The
adapted training strategy for Facenet is equal to VGG16: The convolutional layers are fixed for a
certain amount of epochs, until the network converges. Afterwards, all the layers of the network
are fine-tunned. As performed in VGG16 strategy, the classification part is also composed by fully
connected layers and regularization layers whose hyperparameters are also optimized using grid
search approach (as presented in Table 6.1, Chapter 6).
4.4 Physiological regularization
Transfer learning is commonly used for additional feature computation but the benefits are highly
dependent on the source-target domain similarity and the number of parameters to be trained is
increased as well. The base of transfer learning is inductive transfer where the allowed hypothesis
domain is shrunken and the features used are more selective. These selection of feature space can
also be imposed by using domain knowledge. In fact, in FER, it is known that FEs are the result
of the motions of facial muscles [13].
The physiological based neural network as proposed in [89] is composed by three well-designed
modules: the facial-parts module, the representation module and the classification module.
The purpose of the facial-parts module is to learn an encoding-decoding function that maps
from an input image, x to a relevance map, x̂, representing the probability of each pixel being
relevant for recognition. This task is trained using a supervised learning approach when the anno-
tation of facial key-points exists. Otherwise, it is performed an unsupervised learning where the
loss function enforces sparsity and spatial contiguity on the activations of x̂.
The representation module is a series of convolutions trained from scratch with random
weight initialization. The representation module aims to learn a embedding function that maps
from an input image x to a feature space f . The relevance map, x̂ that is being learned in the
facial-parts module is then used to filter the learned representations f to a new feature space, f 0 ,
enforcing them to only respond strongly to the most relevant facial parts as possible.
The classification module is the same to the module presented in figure 4.3 and it consists in
a sequence of fully connected layers followed by regularization, returning a vector of probabilities
for each class, ŷ.
4.4 Physiological regularization 37
4.4.1 Loss Function
The goal of the network is to explicitly model relevant local facial regions and expression recog-
nition. Given this, the network has two tasks to be trained: regression of relevance maps, x̂, and
expression recognition labels, ŷ. The class labels are trained by defining a categorical cross en-
tropy cost function as defined in eq. 4.2. The regression task is trained using supervised learning
when annotations of key-points exist and unsupervised strategies when only class labels exist.
4.4.2 Supervised term
The supervised learning requires annotation of the true coordinates of key-points located over im-
portant facial components, such as the eyes, nose, mouth and eyebrows. In this scenario, a target
relevance map for each training image is created. As illustrated in Fig. 4.6, for a given training
image, each facial landmark is represented by a Gaussian, with mean at the key-point coordinates
and a predefined standard deviation. The target relevance map is formed by the mixture of the
Gaussians of each facial landmark. The standard deviation should be set to control the neighbor-
hood size around the facial landmarks and is also an hyperparameter for this model. The relevance
Figure 4.6: Original image followed by density maps obtained by a superposition of Gaussians at
the location of each facial landmark, with an increasing value of σ .
map is trained by training a defined loss: LrelevanceMap . The goal is to minimize the mean squared
error between the target and the predicted relevance maps, such that:
1 N target
LrelevanceMap = ∑ (Xi − x̂i )2 , (4.3)
N i=1
where Xitarget is the map created by superposition of the gaussians for each key-point and x̂i the
map predicted by the model for the training sample i. Therefore, this loss term encourages the
relevance map x̂ to take high values in the neighborhood of the most important facial components.
4.4.3 Unsupervised Term
Contrary to the supervised learning strategy, this training strategy does not require the availability
of the key-points annotations. In this scenario, the facial-module loss, L f acial parts , is defined to
regularize the activations of the relevance map x̂ by imposing sparsity and spatial contiguity as
follows:
N N
L f acial_module = ∑ Lcontiguity (x̂i ) + α ∑ Lsparsity (x̂i ), (4.4)
i=1 i=1
where α controls the dominance of each component. Sparsity assures that small and disjoint facial
regions are relevant for the recognition and corresponds to L1 regularization. The sparsity term is
defined as follows:
1
Lsparsity (x̂) = ∑ |x̂m,n | , (4.5)
m × n m,n
where m and n denote the resolution of the relevance map x̂. The contiguity term enforces the
activations of x̂ to be smooth and spatially localized. Contiguity corresponds to the total variation
regularization and is defined by:
1
Lcontiguity (x̂) = ∑ |x̂m+1,n − x̂m,n | + |x̂m,n+1 − x̂m,n | (4.6)
m × n m,n
Chapter 5
Proposed Method
Inspired by the previous state of the art methods exposed and by the idea that prior knowledge
plays a crucial role in FER, the novel method proposed is a deep neural network architecture with
an encoding that corresponds to a pre-trained network followed by a stage with a loss function that
jointly learn the most relevant facial parts along with expression recognition.
The proposed network is divided in three main modules: the representation module, the fa-
cial module and the classification module. The regression maps of facial module are obtained
from a balance between map supervision and contiguity and sparsity impositions, contrary to the
physiological based network where only one approach of learning was applied separately. The
representation module also differs from the previous networks since the feature space produced
is a merge operation between the relevance maps and the transfered features from a pre-trained
network. Transfered learning approaches are limited by the domain similarity but an hybrid ap-
proach in which only low level feature spaces are transfered can, in fact, induce the network to
compute additional features. The proposed method intends to obtain representations from early
stages on pre-trained networks and then transform the transfered features to a wider feature space
that highlights relevant facial regions for FER.
Concerning the facial module, the approach will be similar to the physiological regularization
network seen before but including both supervised and unsupervised learning of the relevance
maps. Models that impose domain knowledge are desired since they induce a bias in the network
to compute relevant features in the problem to be solved. The facial module presented before
introduces domain knowledge but the knowledge introduced has to be computed carefully since
it will affect all the activations from the representation module. A supervised approach tells the
network the exact areas but it can "over-fit" in a sense that the model ignores potential relevant
features or gives the same weights to regions that contribute differently to the expression. For
instance, wrinkles or dimples can be decisive for the classification but they are not included in
the relevance map and ideally the network should compute features capable of consider these
expressions. These type of expressions are small and disjoint facial regions and can be valued by
unsupervised approaches that enforce sparse and contiguous features. The proposed network uses
a facial module that will produce relevance maps learned through supervised maps and through
39
40 Proposed Method
mathematical impositions such as sparsity and contiguity. Therefore, the facial module will have
regions of reference for relevant features but freedom to compute additional sparse features that
are not included in the targeted relevance maps.
5.1 Architecture
As presented in Figure 5.1, the network is composed by three main modules: the representation
module, the facial module and the classification module. Each module will be covered in detail in
the following subsections.
Figure 5.1: Architecture of the proposed network. The relevance maps are produced by regression
from the facial component module who is composed by an encoder-decoder. The maps x̂ are
operated (⊗) with the feature representations ( f ) that are outputted by representation module and
then fed to the classification module, predicting the classes probabilities (ŷ).
5.1.1 Representation Module

The representation module contains a series of layers that will encode a set of high-level features
to be used by the proposed network. Being x the input images and Xr the representation module,
the high-level feature representation f can be explicitly written by: f = Xr (x).
The proposed network will evaluate the representation module by computing the encoded
features, f from an encoder designed from scratch or by importing the features f from a pre-trained
network. When evaluating the scratch encoder, the representation module corresponds only to the
encoding block presented in figure 5.1 and it consists in a series of convolution layers. The encod-
ing block has the same architecture as the Xr block in the scratch network. The hyperparameters
used for this encoding block (number of layers, depth, regularization performed, etc.) correspond
to the best hyperparameters of the scratch network and are presented in Table 6.1.
When evaluating the representation module with a pre-trained network the encoding block
corresponds to a set of convolutional layers original from the pre-trained network. The main
idea is to extract the features from a specific layer and then use this feature representation to be
refined by the relevance maps. The pipeline for feature extraction from the pre-trained network is
presented in Figure 5.2.
When the representation module is based on a pre-trained network, the feature space extracted
is fed into a series of additional convolutional layers, returning the feature space ( f ) that will be
5.1 Architecture 41
Figure 5.2: Pipeline for feature extraction from Facenet. Only the layers before pooling operations
are represented. GAP- Global Average Pooling.
used in the merge operation. The additional convolutions assure the computation of more complex
and high level features since the extracted features came from the pre-trained network in an early
stage. The resulting feature space, f , is then operated with the relevance maps obtained from the
facial module. The merge operation ⊗ between the feature space f and the relevance map x̂ has
two possible approaches that will be evaluated, an element-wise product or a concatenation:
f 0 = f ⊗ x̂, (5.1)
where f are the activations of the representation module (learned features) , with N feature maps,
that is merged with the relevance map x̂. The merge operation can be an element-wise product
that returns a new set of features f 0 with N feature maps. Alternatively, the merge operation can
represent a concatenation between the feature maps from representation module and the relevance
map from facial module. In this scenario, the output will be a concatenated set of features with
the previous N feature maps plus the map of relevance (N+1 feature maps). It is necessary that the
terms operated have the same dimension, therefore relevance map suffers a pooling operation to
have the same dimension of the feature space f . Due to the need of similar semantic level of the
operated terms, the layer extracted from the pre-trained network has to come from an intermediate
layer where the features are sized 17 by 17.
5.1.2 Facial Module
The facial module can be seen as an encoder-decoder where a convolutional path is followed by a
deconvolution path, in a such way that it is possible to learn a mapping between an input image,
x, to a relevance map x̂. A scheme of the facial module can be found in Figure 5.3.
The convolutional path follows the typical architecture of a fully convolutional network sim-
ilar to the scratch network presented. It comprises several sequences of two consecutive 3x3
convolutional layers, with rectified linear units (ReLUs) as non-linearities and L2 regularization,
followed by a 2x2 max-pooling operation for down-sampling. The number of convolutional filters
is doubled at each max- pooling operation. The sequences of pooling and transpose operations are
represented by the Xe and Xd respectively and are repeated according to the desired depth of the
42 Proposed Method
Figure 5.3: Facial Module architecture. A regression map of facial components x̂ is obtained after
sequences of convolutions (Xe ) and deconvolutions (Xd ).
network. For each pooling-transpose operation a skip-connection is implemented. This way, sub-
sequent layers can re-use middle representations, maintaining more information which can lead to
better performances.
Every step in the deconvolution path comprises a 2x2 transpose convolution and two 3x3 con-
volutions, each one followed by a ReLU and regularized by L2 . The transpose convolution is
applied for up-sampling and densification of the incoming features maps. At the final layer a 3x3
convolution with a activation function (it is evaluated what activation produces the best maps:
sigmoid or linear) and is used to map the activations into a probability relevance map x̂.
5.1.3 Classification Module
The classification module architecture has the same structure of Xc module of the scratch neural
network in Figure 4.3. It consists in a sequence of fully connected layers followed by regulariza-
tion finishing in a vector of probabilities for each class, ŷ.
5.2 Loss Function
There are two main tasks performed by the network: the regression of relevance maps and the
classification task. The goal of the model is to minimize a loss function composed by the loss of
each task performed as it shows in the following equation:
L = Lclassi f ication + L f acial_module (5.2)
The Lclassi f ication is a categorical cross entropy as defined before in Equation 4.2 for the scratch
neural network. The facial module is learned by a interaction between three terms: a term respon-
sible for the supervised learning of the relevance maps, Lsupervised , that corresponds to the mean
squared error between the produced map, x̂ and the targeted map xtarget (see Eq. 4.3).
5.3 Iterative refinement 43
It should be noted that the proposed method requires annotation of facial landmarks in order to
compute the targeted map. Some datasets as CK+ [93] provide facial landmarks annotation but, for
instance, the SFEW database [92] is weakly annotated: only emotions are annotated. To solve this,
a key-point detector is applied, generating facial landmarks for each train image automatically. The
key-point detector is a framework presented by Bulat et al. [98]. Some instances of the application
of the key-point detector over the train set of SFEW is presented in Figure 5.4.
Figure 5.4: Illustrative examples of the facial landmarks computation for the SFEW dataset using
the framework proposed in [98]. Each pair of images contains the original image with the facial
landmarks superimposed (left side) and the corresponding target density map (right side).
The terms that integrates the weakly supervision of the maps impose sparsity and contiguity
(see Eq. 4.4).
L f acial_module = γLsupervised + λ Lsparsity + αLcontiguity (5.3)
The factors γ, λ and α are positive values and will balance the weight of each member (see Table
6.1). The task of regression of the maps now favors a balance between sparse and contiguous
representations and expression-specific regions. Since this task is optimized intermediately in the
network, the classification task will depend on the relevance maps.
5.3 Iterative refinement
Some state of the art approaches for recognition tasks apply iterative strategies where some density
map needs to be refined. For instance, Cao [99] proposes a network for pose estimation through
part affinity fields estimated recursively. For the presented network, the relevance maps obtained
assumed a crucial role on the classification since they will be merged with the computed features.
The refinement of these relevance maps may improve the model. To implement this strategy,
the task of regression of maps is defined as stage of the network and each stage is an iterative
prediction architecture, following Wei et al. [100], which refines the predictions over successive
stages with intermediate supervision at each stage.
44 Proposed Method
Figure 5.5: Architecture of the proposed network for iterative refinement. The maps (x̂) produced
by regression from the facial component module composed by an encoder-decoder are operated
N
with the feature representations ( f ) that are outputted by the representation module Xr . The
resultant features can be fed to an additional stage that computes a new relevance map x̂. The
final feature representations f 0 are then fed to the classification module Xc , predicting the classes
probabilities ŷ.
When the network is implemented with a recursive approach, the feature space returned by
the merge operation, f 0 , will be the input for a new stage. For stage >=1, the feature space from
the first encoding, f , is supplied to each merge operation, allowing the classifier to freely combine
contextual information by picking the most predictive features (see Figure 5.5). It is expected that
a new and more refined map is generated in each stage. The number of stages, nstg , is defined when
designing the network and can be found in table 6.1. When the network reaches the last stage, the
final feature space is then fed into the classification module, Xc , returning the class probabilities,
ŷ.
Chapter 6
Results and Discussion
The experimental evaluation of the implemented methods was performed using two public avail-
able databases in the FER research field: the Extended Cohn- Kanade (CK+) database [93] and
the Static Facial Expressions in the Wild (SFEW) database [92]. Datasets used in FER can be
grouped by the nature of the environment: controlled, where illumination and pose is defined, and
uncontrolled environments where external conditions and pose are not controlled. CK+ images
are acquired in a controlled environment annotated with 8 expression labels (6 basic plus neutral
and contempt). It has limited gender, age and ethnic diversity and contains only frontal views with
homogeneous illumination.
The other dataset used, SFEW, is targeted for unconstrained FER. It is the first database that
depicts real-world or simulated real-world conditions for expression recognition. The images are
all extracted from movies and labeled with the six primary emotions plus neutral expression [101].
Therefore, there is a wide range of poses, viewing angles, occlusions, illumination conditions and,
hence, the recognition is much more challenging. In figure 6.1 can be found samples for each
database used.
Figure 6.1: (1) - Samples from CK+ dataset where images were acquired under controlled envi-
ronments [102]. (2) - Samples from SFEW dataset. Images of spontaneous expression acquired
acquired in uncontrolled environments [92].
45
46 Results and Discussion
6.1 Implementation Details
As a common pre-processing across all methods, the multi-task CNN face detector [50] is used for
face detection and the images were normalized, cropped and resized to 120 by 120 pixels except
for methods that are based in the pre-trained networks. When using Facenet as a pre-trained net-
work, the images were resized to 160 by 160 pixels and 224 by 224 for the VGG16 as pre-trained
network.
Regarding the traditional approaches, the grid cell size is 10x10, the window of Gabor-kpts
and LBP-kpts is 16x16. For LBP, a neighborhood and a radius of 8 are used. The Gabor filter
bank comprises 16 filters with different values of σ : {1,3}, θ :{0; π4 ; π2 ; 3π
4 ;} and f :{0.05;0.25}.
The data augmentation performed consisted in geometric transformations where the rotation
−5π
angle θ is randomly sampled up to 180 . The scale factor, s that defines random zoom over the
image is a random value from the interval [0.95, 1.05]. The translation parameters t1 and t2 are
randomly sampled fraction values up to 5 % of the image height and width. For inputs with 120
by 120, the translational parameters assume integer values from the interval [0, 6]. The horizontal
flip, p, is a boolean variable, assuming True or False value.
The hyperparameters of models are optimized by means of grid search and validation set from
the training set. The hyperparameters sets used can be found in Table 6.1. The methods imple-
mented can be categorized in three main approaches: CNN from scratch, Physiological inspired
Network and the proposed network. For each approach there are multiple sets of hyperparameters
that were optimized. The parameters include common parameters across the approaches such as
dropout magnitude on the dense layers, D, dimension of fully-connected layers, FCu , learning rate,
Lr , magnitude of L2 and a boolean that determines the use of batch normalization, Bd , between
fully-connected layers. The number of dense layers was set to 3. Concerning batch-normalization,
the weights of β were initialized with zeros and the weights of γ with ones. For all experiments,
500 epochs were defined to train each network, setting the patience of Early-Stopping to 45 epochs.
The gaussians used to form the relevance maps were obtained using a standard deviation of 21.
For a fair comparison with other methods, the architecture of the CNN from scratch was opti-
mized. The number of functional blocks defined previously as Xr defines the depth of the network
and is also an hyperparameter. Regularization between convolutional layers is also optimized.
For the physiological inspired network the best activation function that returns the relevance
map x̂ is searched, as well as the coefficients that control the interaction between relevance map
regression and the classification task (λ when only fully supervision is performed and λ and γ
when only contiguity and sparsity is imposed).
The proposed network is optimized by searching the best parameters that create an accurate
relevance map for classification (α, γ and λ define the interaction between supervision, contiguity,
6.2 Relevance Maps 47
Table 6.1: Hyperparameters sets.
Hyperparameter Symbol Set

Dropout Dense Magnitude D {0.3;0.4}
Batch Normalization Dense Bd {True;False}
Dense Units FCu {1024;512}
Learning Rate Lr {1e−4 ;1e−5 }
L2 Regularizer Factor L2 {{1e−3 ;1e−4 }
Architecture Blocks Xr {3;4}
Batch Normalization
Scratch Br {True;False}
in Conv Layers
Dropout in Conv Layers dr {True;False}
Maps Activation Function Ac {Linear;Sigmoid}
Physiological Fully Supervision λ {1;2;5}
Inspired Nets λ {{1e−3 ;1e−4 ; 1e−5 }
Weakly Supervision
γ {{1e−3 ; 1e−4 ; 1e−5 }
Supervision Factor λ {1;2;5}
α {{1e−3 ;1e−4 }
Weakly Supervision Factor
γ {{1e−3 ;1e−4 }
Proposed
Maps Activation Function Ac {Linear;Sigmoid}
Network N
Merge Operation {Concat;Product}
Number of Stages nstg {1;3}
Representation Module Xr {Scratch;Facenet}
sparsity and classification in the loss function). Besides the activation function, the merge opera-
tion between the relevance map x̂ and the computed features is searched between a concatenation
or a point-wise product. For evaluating the iterative refinement strategy over the map refinement,
the appropriate number of stages, nstg is searched. The representation module, Xr , that produces
the features to be merged with the maps have two possible sources, a scratch implementation or a
pre-trained network on Facenet.
All deep models are implemented in Keras with Tensorflow as backend. All models are trained
with the Adam optimization algorithm using a batch size of 64 samples. No learning decay was
used.
6.2 Relevance Maps

The maps computed by the methods that intend to use prior knowledge to capture semantic fea-
tures related to facial expression were outputted and analyzed. In order to observe and analyze this
task, the predicted relevance maps, x̂, can be found in Figure 6.2. The map predicted by the pro-
posed method (column 5) is placed next to the maps predicted by networks that only use a weakly
supervision (column 3) or only a fully supervision of the maps (column 4). The activations of these
maps are strong around relevant facial components in the three learning schemes and introduce ad-
ditional discriminative representations. As expected, the maps from the fully supervised learning
approach are the most similar to the target map (column 2) since they were trained to minimize the
Figure 6.2: Examples of predicted relevance maps for different methods used. (1) Original samples
from CK+ dataset (2) Target relevance map (3) Predicted map using only weakly supervision
from physiological inspired net. (4) Predicted relevance map from fully supervised scheme from
physiological inspired net. (5) Predicted relevance map from proposed network using both types
of supervision.
mean square error between these two maps. Therefore, a strong information around facial land-
marks is given to the model but peculiar regions that are not encoded as facial landmarks are not
activated (e.g, wrinkles), forcing the network to not activate potential regions that encode some
expression. On the other hand, although the weakly supervised learning does not use any informa-
tion about facial landmark location, it creates maps that are sparse and spatially localized around
important facial components as well as expression wrinkles. Since no supervision is performed in
this learning scheme, an exhaustive optimization of hyper-parameters is needed to obtain a suit-
able relevance map. Due to the high number of hyper-parameters and models computed, it was
not possible to do a proper optimization of hyper-parameters for the weakly supervision scheme.
As it presented in table 6.1, only two values for each hyper-parameter were tested. For this reason,
as it shows in Figure 6.2, column 3, the produced maps are weakly highlighted expression specific
regions. An extensive optimization would be needed in order to produce accurate relevance maps
with more contrast between activations.
6.2 Relevance Maps 49
The proposed method includes the two types of learning, allowing for an interaction between
different terms. The coefficients γ, α and λ will define the degree of freedom of the activated
features. The higher the λ , the closest the maps will be to the target map and less additional fea-
tures coming from the weakly supervised approach. Column 5 presents the images produced by
the proposed method. Almost all of the surrounding areas around facial landmarks are encoded in
these maps along with other regions that were only present in the maps from the weakly supervi-
sion approach. The chin is highlighted in the three samples and the wrinkles and dimples from the
first row are also present.
Within the highlighted regions, the activations have different magnitude, being more specific
in encoding key structures for expression recognition. For instance, figure 6.3 illustrates the gen-
erated maps for the same expression of the same subject but in different temporal moments. Al-
though is an anger expression in all frames, it is clear that the expression is more intense in the last
frames. Since the targeted maps are generated taking into account only the facial landmarks, these
maps are similar in all frames, not representing the different intensity for the anger expression.
With the proposed model this dimensionality of the expression is covered: wrinkles around the
eyebrows and the nose are more highlighted the more intense is the expression. This dimension-
ality of the relevance maps forces a better discrimination of the feature space generated.
Figure 6.3: Frame-by-frame analysis of the relevance maps. First row corresponds to the original
image, second row presents the targeted map and third row presents the relevance maps generated
by the proposed model. First column represents a neutral expression while the remain columns
represent the anger expression.
6.3 Results on CK+

CK+ contains 327 annotated image sequences with 8 expression labels: the 6 basic emotions plus
the neutral and contempt ones. Each video starts with a neutral expression and reaches the peak
in the last frame. Similar to other works, [103] [89], the first frame and the last three frames of
each video were extracted. The result is a subset of 1308 images. Figure 6.4 shows the class
distribution for the CK+ dataset. All the splits are stratified, therefore, they maintain the original
class distribution.
Figure 6.4: Class Distribution on CK+ [93].
For model selection and evaluation, the data is stratified and randomly split three times in
training and test. In each split, 80 % of the original set corresponds to the training and 20 % to
the test set . Each train set is further divided, also with subject independence and in a stratified
way with 80 % for training and 20 % for validation. In the end, for each split, the validation set
is used to validate the training and the model selected is tested on the test set. The performance is
evaluated by computing the average accuracy and loss on the three test sets.
The experiments on CK+ are presented in Table 6.2 and in Table 6.3. Table 6.2 presents
the results on CK+ using traditional methods based on hand-crafted features. Geometric features
outperform appearance methods, holding the best performance on hand-crafted based methods. It
shows the significance of facial landmarks for FER since these features alone reach an accuracy
of almost 80 %. Within appearance methods, LBP and Gabor show similar performances. All
the traditional approaches are outperformed by convolutional neural networks by a significant
difference. The best method of the traditional approaches (Geometric features + LBP around key-
points) differs by almost 10 % from the weaker method using convolutional neural networks (CNN
from scratch).
6.3 Results on CK+ 51
Table 6.2: Performance achieved by the traditional baseline methods on CK+.
Performance
Loss Acc( %)
Geometric 0.74 79.76
LBP Global 1.59 46.35
Gabor Global 1.62 43.11
LBP Local 0.81 67.82
Gabor Local 0.93 65.53
Hand-crafted
Appearance LBP kpts 0.91 72.41
Features
Gabor kpts 0.93 70.13
Gabor Global
1.64 42.25
+LBP Global
Gabor Local
0.82 69.12
+LBP Local
Gabor
0.67 77.26
+ LBP kpts
Geometric
Geometric + Appearance 0.55 79.76
+ LBP kpts
In Table 6.3 the methods based on convolutional neural networks are evaluated. The proposed
method is compared with a CNN from scratch, that works as a baseline, and methods that hold
state-of-the-art in FER. Although having the lowest score, the CNN from scratch is a strong base-
line since it presents a result near state of the art methods. It has strong regularization applied and
the representation module has the same architecture as the proposed method and as the fully and
weakly supervised learning approaches. Within pre-trained networks approach, Facenet beats the
VGG16 as it was expected since the domain of the database in which facenet was trained is similar
to the databases used. As stated in [89], the inclusion of physiological knowledge approach is bet-
ter than a CNN from scratch, showing that domain knowledge can improve the model. Analyzing
the two approaches for knowledge inclusion, imposing sparsity and contiguity demonstrate better
results than a supervised approach with maps of facial landmarks regions. These observations can
be explained by the fact that a weakly supervision allows for activation of regions that are not
present in the target maps such as wrinkles.
The proposed method pre-trained on facenet beats all the other approaches, either in average
accuracy or in average loss. The proposed method, that includes both approaches of learning
the maps, when using the representation module from the scratch CNN outperforms all networks
except the one pre-trained on facenet. Besides outperforming other scratches networks, it outper-
forms a pre-trained approach that was trained in millions of images with a more complex network
(VGG16). It shows that a simpler network, with fewer parameters to be trained and with do-
main knowledge inclusion can beat heavier networks as VGG16. It is also surprising that the
proposed method based with a feature space from facenet beats a pre-trained network on facenet
since the features extracted from original network belong to intermediate layers and facenet is
more complex and deeper. These observations can be explained by the capability of the network
Table 6.3: CK+ experimental results.
Method Average Accuracy (%) Average Loss

CNN from Scratch 88.6 0.58
Pre-trained Facenet 93.75 0.28
Pre-trained VGG16 91.67 0.42
Fully Supervised 88.79 0.47
Weakly Supervised 89.78 0.43
Proposed Method with CNN from scratch 91.11 0.51
Proposed Method with pre-trained facenet 94.21 0.20
Proposed Method with pre-trained facenet with iterative refinement 93.85 0.23
in use higher amount of features with low semantic level and then discriminate them with prior
knowledge (facial landmarks regions).
The implemented iterative refinement strategy tested consisted in iteratively repeat the stage
responsible for the relevance maps estimation over features pre-computed. The underlying idea is
to refine the map over consecutive stages and inspect whether these refine maps induce a better
classification task. As Table 6.3 presents, the proposed method with this iterative refinement
approach does not introduce gains in terms of performance, showing a similar performance. In
fact, it was observed that loss value of each map generated maintained or decrease its value,
indicating that the maps were near optimal and a iterative refinement strategy would not lead to
more accurate maps in the presented case.
Figure 6.5: Confusion Matrix of CK+ database.
The confusion matrix illustrated in Figure 6.5 shows the performance of the proposed method
6.4 Results on SFEW. 53
using a pre-trained network. Anger, contempt and fear are the expressions that are most difficult
to classify. This can be explained by the fact that the frequency of these classes is lower than most
of classes (see Figure 6.4).
6.4 Results on SFEW.
The other dataset used, SFEW, is targeted for unconstrained FER. SFEW was created as part of
the Emotion Recognition in the Wild (EmotiW) 2015 Grand Challeng [104] and it has a strict
evaluation protocol with predefined training, validation, and test sets. In particular, the training
set comprises a total of 891 images. Since it was not possible to obtain the test set, the results
are reported on the validation data that contains 431 images. It has 7 classes, the basic 6 basic
emotions plus the neutral expression. The class distribution for the test set of SFEW is presented
in Figure 6.6.
Figure 6.6: Class distribution for SFEW database [92].
SFEW is known for being one of the most challenging FER datasets.For instance, the chal-
lenge baseline performance is 35.96% [104] and the state of the art performance for SFEW is held
by Yu Zhiding et al. [77] with an accuracy of 52.29 %. However, the method proposed by Yu
Zhiding et al., as most of top state of the art methods for SFEW, uses an ensemble of multiple net-
works to boost their performance and uses other databases to train before hand. The implemented
methods were applied directly to SFEW and can be found in Table 6.4. As observed in CK+
results, the proposed method outperforms both CNN from scratch and the pre-trained network.
The consistency on the results of both datasets shows that the model, besides getting simpler with
fewer parameters to be trained, also performs better with the inclusion of features of pre-trained
networks and integration of domain knowledge.
Table 6.4: SFEW experimental results.
Method Average Accuracy (%) Loss

CNN from scratch 36.01 1.88
Pre-trained Facenet 46.02 1.57
Proposed method (with Pre-trained Facenet) 47.26 1.79
It should be pointed out that when using just a pre-trained network, the performance of the
method only achieved the accuracies reported when all the layers from facenet were trained with
in our dataset. When training only with the fully-connected layers, the performance was similar
to the performance of the CNN implemented from scratch.
The proposed method uses a fixed feature extractor from facenet by transferring the feature
space of a specific layer, without training the facenet layers on our dataset. An improvement to the
proposed method could be the fine-tunned of all facenet layers on SFEW.
The confusion matrix correspondent to the best performance (Proposed method with pre-
trained facenet) is presented in Figure 6.7.
Figure 6.7: Confusion Matrix of SFEW database.
The recognition accuracy for fear is the lowest among all the classes. The fear expression is
mostly confused with anger and neutral expression. This observation is also documented in other
works [77] [89].
Chapter 7
Conclusions
Facial Expressions can assist on the interpretation of different states of mind and are part of the
fundamental communication system in humans. Their automatic recognition would open new
strategies and improvements in different fields that involve Human-Computer interaction (HCI) or
systems where expression have a crucial semantic meaning.
Several FER methods have been evaluated, from methods based on traditional feature extrac-
tion, such as LBP and Gabor filters, to different approaches on deep convolutional neural networks.
It is clear that deep networks perform better than methods based on hand-crafted features due to
the ability of computing an whole new set of data representations. Within deep neural approaches,
several methods had been proposed and some state-of-the art methods are remarkable in some
object recognition tasks. However, large datasets are scarce and some state of the art methods
design heavy networks that demand high computational resources, not being viable and efficient
in some cases. There are some studies that look over domain knowledge and its role in deep
neural networks. In most cases, a correct inclusion of prior knowledge can lead to better feature
discrimination, therefore, better results.
In order to study the role of prior-knowledge in deep neural networks, several state-of-the art
were implemented and a novel method was presented. The proposed method is a deep neural
network architecture with an encoding that corresponds to a pre-trained network and a posterior
stage with a loss function that jointly learn the most relevant facial parts through different sources
of learning along with the expression recognition. The result is a model that is able to learn
expression-specific features, demonstrating better performances than the state of the art methods
implemented. The proposed method is composed by three main modules: (1) Representation
Module, (2) Facial Module, (3) Classification Module. The facial module aims to regress a rel-
evance map that highlights regions around facial landmarks. The train of this task is ruled by
an interaction between supervised learning and unsupervised learning that imposes sparsity and
contiguity. The output of this module is a relevance map with crucial activated regions for FER.
The relevance map will filter the feature space that is returned by the representation module. This
representation module can be an encoding implemented from scratch or an encoding transfered
from a pre-trained network, Facenet. Finally, the classification module is trained on these filtered
55
56 Conclusions
features and returns a vector with the predicted classes.

The experimental results on the two databases used, CK+ (with controlled conditions), and
SFEW (natural conditions), demonstrate that the proposed method outperforms the state of the
art methods implemented, showing the potential in integrate different sources of prior knowledge:
Domain knowledge coming from facial landmarks and expression morphology, and prior repre-
sentations transfered from other networks trained on image recognition tasks. The studies and
experiments performed using a encoding from scratch as representation module also reveal that a
simpler network architecture with robust regularization and rich prior knowledge can beat some
pre-trained networks that have complex and deeper architectures, therefore, more parameters to
be tuned. Concerning only the facial module, it is clear that a balance between supervision of the
regression maps and mathematical impositions such as sparsity and contiguity can lead to refined
maps of relevance since facial landmarks regions are encoded, along with small and disjoint re-
gions such as wrinkles. Since the relevance maps play a crucial role on discriminating the feature
space, approaches that lead to refined maps can also lead to better results. Given this, a recursive
strategy was implemented where the the facial module was repeated consecutively to refine the
predictions over successive stages with intermediate supervision at each stage. The refinement on
the successive maps was not clear and the performance was similar. This can be explained by the
accurate computation of these maps on the first stage, with no relevant gains on the next ones.
As future work, the proposed network and its training strategies could be applied to more
datasets and to other domains as well. The proposed method can also be applied for video, using
for instance Long Short Term Memory (LSTSM) or optical flow streams networks along with the
proposed method.
References
[1] Charles Darwin and Phillip Prodger. The expression of the emotions in man and animals.
Oxford University Press, USA, 1998.
[2] Tijn Kooijmans, Takayuki Kanda, Christoph Bartneck, Hiroshi Ishiguro, and Norihiro
Hagita. Interaction debugging: an integral approach to analyze human-robot interaction.
In Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction,
pages 64–71. ACM, 2006.
[3] Ashish Kapoor, Winslow Burleson, and Rosalind W Picard. Automatic prediction of frus-
tration. International journal of human-computer studies, 65(8):724–736, 2007.
[4] Chek Tien Tan, Daniel Rosser, Sander Bakkes, and Yusuf Pisan. A feasibility study in
using facial expressions analysis to evaluate player experiences. In Proceedings of The 8th
Australasian Conference on Interactive Entertainment: Playing the System, page 5. ACM,
2012.
[5] Sander Bakkes, Chek Tien Tan, and Yusuf Pisan. Personalised gaming: a motivation and
overview of literature. In Proceedings of the 8th Australasian Conference on Interactive
Entertainment: Playing the System, page 4. ACM, 2012.
[6] Jeffrey M Girard, Jeffrey F Cohn, Mohammad H Mahoor, S Mohammad Mavadati, Zakia
Hammal, and Dean P Rosenwald. Nonverbal social withdrawal in depression: Evidence
from manual and automatic analyses. Image and vision computing, 32(10):641–647, 2014.
[7] Stefan Scherer, Giota Stratou, Marwa Mahmoud, Jill Boberg, Jonathan Gratch, Albert
Rizzo, and Louis-Philippe Morency. Automatic behavior descriptors for psychological dis-
order analysis. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE Interna-
tional Conference and Workshops on, pages 1–8. IEEE, 2013.
[8] Sarah Griffiths, Christopher Jarrold, Ian S Penton-Voak, Andy T Woods, Andy L Skinner,
and Marcus R Munafò. Impaired recognition of basic emotions from facial expressions
in young people with autism spectrum disorder: Assessing the importance of expression
intensity. Journal of autism and developmental disorders, pages 1–11, 2017.
[9] Eeva A Elliott and Arthur M Jacobs. Facial expressions, emotions, and sign languages.
Frontiers in psychology, 4, 2013.
[10] Albert Mehrabian. Communication without words. Communication theory, pages 193–200,
2008.
[11] Guillaume-Benjamin Duchenne. The mechanism of human facial expression. Cambridge

university press, 1990.
57
58 REFERENCES
[12] Ciprian Adrian Corneanu, Marc Oliu Simon, Jeffrey F Cohn, and Sergio Escalera Guerrero.
Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition:
History, trends, and affect-related applications. IEEE transactions on pattern analysis and
machine intelligence, 38(8):1548–1568, 2016.
[13] Paul Ekman. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200, 1992.
[14] Karen L Schmidt and Jeffrey F Cohn. Human facial expressions as adaptations: Evolu-
tionary questions in facial expression research. American journal of physical anthropology,
116(S33):3–24, 2001.
[15] David Matsumoto, Dacher Keltner, Michelle N Shiota, MAUREEN O’Sullivan, and Mark
Frank. Facial expressions of emotion. Handbook of emotions, 3:211–234, 2008.
[16] Robert W Levenson, Paul Ekman, and Wallace V Friesen. Voluntary facial action generates
emotion-specific autonomic nervous system activity. Psychophysiology, 27(4):363–384,
1990.
[17] Nico H Frijda, Anna Tcherkassof, et al. Facial expressions as modes of action readiness.
The psychology of facial expression, pages 78–102, 1997.
[18] Paul Ekman and Wallace V Friesen. Facial action coding system. 1977.
[19] Evangelos Sariyanidi, Hatice Gunes, and Andrea Cavallaro. Automatic analysis of facial
affect: A survey of registration, representation, and recognition. IEEE transactions on
pattern analysis and machine intelligence, 37(6):1113–1133, 2015.
[20] Frank Y Shih, Chao-Fa Chuang, and Patrick SP Wang. Performance comparisons of facial
expression recognition in jaffe database. International Journal of Pattern Recognition and
Artificial Intelligence, 22(03):445–459, 2008.
[21] Tanja Bänziger, Marcello Mortillaro, and Klaus R Scherer. Introducing the geneva mul-
timodal expression corpus for experimental research on emotion perception. Emotion,
12(5):1161, 2012.
[22] Hatice Gunes and Björn Schuller. Categorical and dimensional affect analysis in continuous
input: Current trends and future directions. Image and Vision Computing, 31(2):120–136,
2013.
[23] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple
features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of
the 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.
[24] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Con-
ference on, volume 1, pages 886–893. IEEE, 2005.
[25] Timo Ojala, Matti Pietikainen, and David Harwood. Performance evaluation of texture
measures with classification based on kullback discrimination of distributions. In Pattern
Recognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceed-
ings of the 12th IAPR International Conference on, volume 1, pages 582–585. IEEE, 1994.
REFERENCES 59
[26] Abdenour Hadid. The local binary pattern approach and its applications to face analysis. In
Image Processing Theory, Tools and Applications, 2008. IPTA 2008. First Workshops on,
pages 1–9. IEEE, 2008.
[27] Joni-Kristian Kamarainen. Gabor features in image analysis. In Image Processing Theory,
Tools and Applications (IPTA), 2012 3rd International Conference on, pages 13–14. IEEE,
2012.
[28] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and
other kernel-based learning methods. Cambridge university press, 2000.
[29] Caifeng Shan, Shaogang Gong, and Peter W McOwan. Facial expression recognition based
on local binary patterns: A comprehensive study. Image and Vision Computing, 27(6):803–
816, 2009.
[30] Philipp Michel and Rana El Kaliouby. Facial expression recognition using support vector
machines. In The 10th International Conference on Human-Computer Interaction, Crete,
Greece, 2005.
[31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–
444, 2015.
[32] Michael Goh. Facial expression recognition using a hybrid cnn–sift aggregator. In Multi-
disciplinary Trends in Artificial Intelligence: 11th International Workshop, MIWAI 2017,
Gadong, Brunei, November 20-22, 2017, Proceedings, volume 10607, page 139. Springer,
2017.
[33] Arushi Raghuvanshi and Vivek Choksi. Facial expression recognition with convolutional
neural networks. CS231n Course Projects, 2016.
[34] Shima Alizadeh and Azar Fazel. Convolutional neural networks for facial expression recog-
nition. arXiv preprint arXiv:1704.06756, 2017.
[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing sys-
tems, pages 1097–1105, 2012.
[36] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. In International Conference on Machine Learning,
pages 448–456, 2015.
[37] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal
of machine learning research, 15(1):1929–1958, 2014.
[38] Jason Wang and Luis Perez. The effectiveness of data augmentation in image classification
using deep learning. Technical report, Technical report, 2017.
[39] Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In
Proceedings of the twenty-first international conference on Machine learning, page 78.
ACM, 2004.
[40] Evaluating machine learning models - o’reilly media. https://www.oreilly.com/

ideas/evaluating-machine-learning-models. (Accessed on 06/11/2018).
60 REFERENCES
[41] Ming-Hsuan Yang, David J Kriegman, and Narendra Ahuja. Detecting faces in images:
A survey. IEEE Transactions on pattern analysis and machine intelligence, 24(1):34–58,
2002.
[42] Michael Jones and Paul Viola. Fast multi-view face detection. Mitsubishi Electric Research
Lab TR-20003-96, 3:14, 2003.
[43] Bernhard Froba and Andreas Ernst. Face detection with the modified census transform.
In Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International
Conference on, pages 91–96. IEEE, 2004.
[44] Bo Wu, Haizhou Ai, Chang Huang, and Shihong Lao. Fast rotation invariant multi-view
face detection based on real adaboost. In Automatic Face and Gesture Recognition, 2004.
Proceedings. Sixth IEEE International Conference on, pages 79–84. IEEE, 2004.
[45] Rong Xiao, Huaiyi Zhu, He Sun, and Xiaoou Tang. Dynamic cascades for face detection.
In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8.
IEEE, 2007.
[46] Hongliang Jin, Qingshan Liu, Hanqing Lu, and Xiaofeng Tong. Face detection using im-
proved lbp under bayesian framework. In Image and Graphics (ICIG’04), Third Interna-
tional Conference on, pages 306–309. IEEE, 2004.
[47] Lun Zhang, Rufeng Chu, Shiming Xiang, Shengcai Liao, and Stan Z Li. Face detection
based on multi-block lbp representation. In International Conference on Biometrics, pages
11–18. Springer, 2007.
[48] Margarita Osadchy, Yann Le Cun, and Matthew L Miller. Synergistic face detection
and pose estimation with energy-based models. Journal of Machine Learning Research,
8(May):1197–1215, 2007.
[49] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional
neural network cascade for face detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5325–5334, 2015.
[50] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and align-
ment using multitask cascaded convolutional networks. IEEE Signal Processing Letters,
23(10):1499–1503, 2016.
[51] Tinne Tuytelaars, Krystian Mikolajczyk, et al. Local invariant feature detectors: a survey.
Foundations and trends R in computer graphics and vision, 3(3):177–280, 2008.
[52] Jimei Yang, Shengcai Liao, and Stan Z Li. Automatic partial face alignment in nir video
sequences. In International Conference on Biometrics, pages 249–258. Springer, 2009.
[53] David G Lowe. Object recognition from local scale-invariant features. In Computer vision,
1999. The proceedings of the seventh IEEE international conference on, volume 2, pages
1150–1157. Ieee, 1999.
[54] Maja Pantic and Ioannis Patras. Dynamics of facial expression: recognition of facial actions
and their temporal segments from face profile image sequences. IEEE Transactions on
Systems, Man, and Cybernetics, Part B (Cybernetics), 36(2):433–449, 2006.
REFERENCES 61
[55] Robert Walecki, Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Variable-state latent
conditional random fields for facial expression recognition and action unit detection. In
Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference
and Workshops on, volume 1, pages 1–8. IEEE, 2015.
[56] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appearance mod-
els. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
[57] Simon Lucey, Ahmed Bilal Ashraf, and Jeffrey F Cohn. Investigating spontaneous facial
action recognition through aam representations of the face. In Face recognition. InTech,
2007.
[58] Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Multi-output laplacian dynamic or-
dinal regression for facial expression recognition and intensity estimation. In Computer Vi-
sion and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2634–2641. IEEE,
2012.
[59] Stefano Berretti, Boulbaba Ben Amor, Mohamed Daoudi, and Alberto Del Bimbo. 3d facial
expression recognition using sift descriptors of automatically detected keypoints. The Visual
Computer, 27(11):1021, 2011.
[60] Ira Cohen, Nicu Sebe, Ashutosh Garg, Lawrence S Chen, and Thomas S Huang. Facial
expression recognition from video sequences: temporal and static modeling. Computer
Vision and image understanding, 91(1):160–187, 2003.
[61] Ira Cohen, Nicu Sebe, FG Gozman, Marcelo Cesar Cirelo, and Thomas S Huang. Learning
bayesian network classifiers for facial expression recognition both labeled and unlabeled
data. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Com-
puter Society Conference on, volume 1, pages I–I. IEEE, 2003.
[62] Petar S Aleksic and Aggelos K Katsaggelos. Automatic facial expression recognition using
facial animation parameters and multistream hmms. IEEE Transactions on Information
Forensics and Security, 1(1):3–11, 2006.
[63] Montse Pardàs and Antonio Bonafonte. Facial animation parameters extraction and expres-
sion recognition using hidden markov models. Signal Processing: Image Communication,
17(9):675–688, 2002.
[64] Zhengyou Zhang, Michael Lyons, Michael Schuster, and Shigeru Akamatsu. Compari-
son between geometry-based and gabor-wavelets-based facial expression recognition using
multi-layer perceptron. In Automatic Face and Gesture Recognition, 1998. Proceedings.
Third IEEE International Conference on, pages 454–459. IEEE, 1998.
[65] Marian Stewart Bartlett, Gwen Littlewort, Mark Frank, Claudia Lainscsek, Ian Fasel, and
Javier Movellan. Recognizing facial expression: machine learning and application to spon-
taneous behavior. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, volume 2, pages 568–573. IEEE, 2005.
[66] Michael J Lyons, Julien Budynek, and Shigeru Akamatsu. Automatic classification of
single facial images. IEEE transactions on pattern analysis and machine intelligence,
21(12):1357–1362, 1999.
62 REFERENCES
[67] Wenfei Gu, Cheng Xiang, YV Venkatesh, Dong Huang, and Hai Lin. Facial expression
recognition using radial encoding of local gabor features and classifier synthesis. Pattern
Recognition, 45(1):80–91, 2012.
[68] Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett. Exploring bag of words
architectures in the facial expression domain. In Computer Vision–ECCV 2012. Workshops
and Demonstrations, pages 250–259. Springer, 2012.
[69] Guoying Zhao and Matti Pietikainen. Dynamic texture recognition using local binary pat-
terns with an application to facial expressions. IEEE transactions on pattern analysis and
machine intelligence, 29(6):915–928, 2007.
[70] Bo Sun, Liandong Li, Tian Zuo, Ying Chen, Guoyan Zhou, and Xuewen Wu. Combining
multimodal features with hierarchical classifier fusion for emotion recognition in the wild.
In Proceedings of the 16th International Conference on Multimodal Interaction, pages 481–
486. ACM, 2014.
[71] Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, Peng Wu, and Hichem Sahli. Multimodal
affective dimension prediction using deep bidirectional long short-term memory recurrent
neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emo-
tion Challenge, pages 73–80. ACM, 2015.
[72] A Geetha, Vennila Ramalingam, S Palanivel, and B Palaniappan. Facial expression

recognition–a real time approach. Expert Systems with Applications, 36(1):303–308, 2009.
[73] Benjamín Hernández, Gustavo Olague, Riad Hammoud, Leonardo Trujillo, and Eva
Romero. Visual learning of texture descriptors for facial expression recognition in ther-
mal imagery. Computer Vision and Image Understanding, 106(2):258–269, 2007.
[74] Sander Koelstra, Maja Pantic, and Ioannis Patras. A dynamic texture-based approach to
recognition of facial actions and their temporal models. IEEE transactions on pattern anal-
ysis and machine intelligence, 32(11):1940–1954, 2010.
[75] Maja Pantic and Marian Stewart Bartlett. Machine analysis of facial expressions. In Face
recognition. InTech, 2007.
[76] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from
scratch. arXiv preprint arXiv:1411.7923, 2014.
[77] Zhiding Yu and Cha Zhang. Image based static facial expression recognition with multiple
deep network learning. In Proceedings of the 2015 ACM on International Conference on
Multimodal Interaction, pages 435–442. ACM, 2015.
[78] Mengyi Liu, Shaoxin Li, Shiguang Shan, and Xilin Chen. Au-aware deep networks for
facial expression recognition. In Automatic Face and Gesture Recognition (FG), 2013 10th
IEEE International Conference and Workshops on, pages 1–6. IEEE, 2013.
[79] Mao Xu, Wei Cheng, Qian Zhao, Li Ma, and Fang Xu. Facial expression recognition based
on transfer learning from deep convolutional networks. In Natural Computation (ICNC),
2015 11th International Conference on, pages 702–708. IEEE, 2015.
[80] Tian Xia, Yifeng Zhang, and Yuan Liu. Expression recognition in the wild with transfer
learning.
REFERENCES 63
[81] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
neural information processing systems, pages 2672–2680, 2014.
[82] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Adversarial gen-
erative nets: Neural network attacks on state-of-the-art face recognition. arXiv preprint
arXiv:1801.00349, 2017.
[83] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learn-
ing. In Proceedings of the 26th annual international conference on machine learning, pages
41–48. ACM, 2009.
[84] Liangke Gui, Tadas Baltrušaitis, and Louis-Philippe Morency. Curriculum learning for
facial expression recognition. In Automatic Face & Gesture Recognition (FG 2017), 2017
12th IEEE International Conference on, pages 505–511. IEEE, 2017.
[85] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of
neural networks using dropconnect. In International Conference on Machine Learning,
pages 1058–1066, 2013.
[86] Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural networks.
Neural Networks, 71:1–10, 2015.
[87] Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolu-
tional neural networks. arXiv preprint arXiv:1301.3557, 2013.
[88] Hui Ding, Shaohua Kevin Zhou, and Rama Chellappa. Facenet2expnet: Regularizing a deep
face recognition net for expression recognition. In Automatic Face & Gesture Recognition
(FG 2017), 2017 12th IEEE International Conference on, pages 118–126. IEEE, 2017.
[89] Pedro Ferreira, Jaime Cardoso, and Ana Rebelo. Physiological inspired deep neu-
ral networks for emotion recognition. https://drive.google.com/open?id=
11HI3sEF4V0U06F-30-LLwWFO99IuCCH4, 2018.
[90] Irene Kotsia and Ioannis Pitas. Facial expression recognition in image sequences using
geometric deformation features and support vector machines. IEEE transactions on image
processing, 16(1):172–187, 2007.
[91] Arnaud Dapogny, Kevin Bailly, and Séverine Dubuisson. Dynamic facial expression recog-
nition by joint static and multi-time gap transition classification. In Automatic Face and
Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on,
volume 1, pages 1–6. IEEE, 2015.
[92] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Static facial expression
analysis in tough conditions: Data, evaluation protocol and benchmark. In Computer Vision
Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2106–2112.
IEEE, 2011.
[93] Takeo Kanade, Jeffrey F Cohn, and Yingli Tian. Comprehensive database for facial ex-
pression analysis. In Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth
IEEE International Conference on, pages 46–53. IEEE, 2000.
64 REFERENCES
[94] Shengcai Liao, Xiangxin Zhu, Zhen Lei, Lun Zhang, and Stan Z Li. Learning multi-scale
block local binary patterns for face recognition. In International Conference on Biometrics,
pages 828–837. Springer, 2007.
[95] Byoung Chul Ko. A brief review of facial emotion recognition based on visual information.
sensors, 18(2):401, 2018.
[96] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[97] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding
for face recognition and clustering. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 815–823, 2015.
[98] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d
face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International
Conference on Computer Vision, 2017.
[99] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose
estimation using part affinity fields. In CVPR, volume 1, page 7, 2017.
[100] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose
machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 4724–4732, 2016.
[101] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Acted facial expressions
in the wild database. Australian National University, Canberra, Australia, Technical Report
TR-CS-11, 2, 2011.
[102] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain
Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit
and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops
(CVPRW), 2010 IEEE Computer Society Conference on, pages 94–101. IEEE, 2010.
[103] Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. Learning expressionlets
on spatio-temporal manifold for dynamic facial expression recognition. In Computer Vi-
sion and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1749–1756. IEEE,
2014.
[104] Emotion recognition in the wild challenge 2015. https://cs.anu.edu.au/few/

emotiw2015.html. (Accessed on 06/10/2018).

Facial Expression Recognition: Towards Meaningful Prior Knowledge in Deep Neural Networks

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Facial Expression Recognition: Towards Meaningful Prior Knowledge in Deep Neural Networks

Enviado por

Direitos autorais:

Formatos disponíveis

FACULDADE DE E NGENHARIA DA U NIVERSIDADE DO P ORTO

Facial Expression Recognition:

Filipe Martins Marques

Integrated Master in Bioengineering

Supervisor: Jaime dos Santos Cardoso

June 12, 2018

To all my friends who handle me and let me be just as I am.

4 Implemented Reference Methodologies 29

4.4.2 Supervised term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Results and Discussion 45

2.1 Study of FEs by electrically stimulate facial muscles. . . . . . . . . . . . . . . . 6

3.1 Multiple face detection in uncontrolled scenarios. . . . . . . . . . . . . . . . . . 20

4.1 Illustration of the pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.1 Samples from CK+ and from SFEW database. . . . . . . . . . . . . . . . . . . . 45

6.3 Frame-by-frame analysis of the relevance maps. . . . . . . . . . . . . . . . . . . 49

6.1 Hyperparameters sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

AAM Active Appearance Model

1.5 Dissertation Outline

2.1 Facial Expressions

Figure 2.1: Study of FEs by electrically stimulate facial muscles [11].

Figure 2.2: Examples of AUs on the FACS [18].

immediate rejection of the sub-window.

Figure 2.4: A typical framework of an HOG-based face detection method [24].

2.3 Feature Descriptors for Facial Expression Recognition

2.3.1 Local-Binary Patterns (LBP)

Figure 2.5: Example of LBP calculation, extracted from [26].

2.3.2 Gabor Filters

u0 = ucosθ + vsinθ (2.2)

v0 = −usinθ + vosθ (2.3)

2.4 Learning and classification

2.4.1 Support Vector Machine (SVM)

2.4.2 Deep Convolutional Neural Networks (DCNNs)

2.4.2.1 Activation Functions

• Sigmoid function: In general, a sigmoid function is real-valued, monotonic, and differen-

Batch-Normalization is a method known for reducing internal covariate shift in neural

L1 and L2 regularization are traditional regularization strategies that consist in adding a

2.5 Model Selection

2. Sets of hyper-parameters to be optimized are defined. Being A and B two hyper-parameters

3.1 Face detection (from Viola & Jones to DCNNs)

3.2 Face Registration

3.3 Feature Extraction

3.3.1 Traditional (geometric and appearance)

Figure 3.3: Detection of AUs based on geometric features used in [54].

3.3.2 Deep Convolutional Neural Networks

3.4 Expression Recognition

Implemented Reference Methodologies

4.1 Hand-crafted based approaches

1. Distances x and y of each key-point to the center point;

2. Euclidean distance of each key-point to the center point;

Figure 4.2: Illustration of the implemented geometric feature computation.

• Gabor-kpts: It requires the information of the facial key-points coordinates. In particular,

4.2 Conventional CNN

Figure 4.3: Architecture of the conventional deep network used.

This layer has a defined momentum and is initialized by γ and β . Batch-Normalization, as

4.3 Transfer Learning Based Approaches

4.4 Physiological regularization

4.4.1 Loss Function

4.4.2 Supervised term

4.4.3 Unsupervised Term

5.1.1 Representation Module

5.1.2 Facial Module

5.1.3 Classification Module

5.2 Loss Function