Você está na página 1de 4

30th Annual International IEEE EMBS Conference

Vancouver, British Columbia, Canada, August 20-24, 2008

Feature extraction of speech signals in emotion identification


M. Morales-Perez, J. Echeverry-Correa, A. Orozco-Gutierrez and G. Castellanos-Dominguez

Abstract In this work, the acoustic and spectral


characteristics and the automatic recognition of human
emotional states through speech analysis have been studied.
Acoustic features have been evaluated and features from
time-frequency representation are proposed. The method is
based in the representation of speech signal through energy
distributions (Gabor transform and WVD) and discrete
coefficients (DWT and linear prediction analysis). Recognition
accuracy of 94.6% for emotion detection are obtained from
SES database of emotional speech in spanish language.

I. INTRODUCTION
In the study of oral communication it is possible to
distinguish between two different channels. One of them
provides the transmission of messages in an explicit way
(speech), while the other channel contributes in an implicit
way providing information about the speaker; in this channel,
which is not as discussed as the first one, the voice can
be considered as a biological signal, due to the fact that it
contains extralinguistic information about physiological and
emotional states of people [1].
Any type of research aimed at understanding the human
mind facilitates the development of applications in which
human-machine interfaces are involved [2]. For this reason
is relevant in the study of humans behavior to understand
how humans can express their emotions [3]. In this paper
is presented a methodology for the extraction of acoustic
and spectral characteristics over speech signals from wich
it is possible to determine the emotional state in which
there is a speaker. This methodology can be implemented
in diagnosis and treatment of psychological disorders such
as Post-Traumatic Stress Disorders (PTSD) in which voice
analysis reveals hidden information to the specialist about
real emotional state of the patient [4]. Speech signals
are non-stationary random processes; they exhibit innate
rhythms and periodicity that is more readily expressed and
appreciated in terms of frequency than time units. For this
reason the use of time-frequency transforms is needed in
order to extract relevant features in time and frequency
domains [5].
This paper is organized as follows: first a description of the
time-frequency transforms used to perform the representation
of the signal is provided. Then, there is a description of
the automatic emotion identification process, from feature
M. Morales, J. Echeverry and A. Orozco are with Faculty
of Electrical, Electronic, Physics and Systems Engineering,
Universidad Tecnolgica de Pereira, La Julita, Pereira, Colombia

{mmperez,jdec,aorozco}@ohm.utp.edu.co
G.
Castellanos
is
with
Faculty
of
Engineering,
Universidad
Nacional
de
Colombia,
Manizales,
Colombia

cgcastellanosd@unal.edu.co

978-1-4244-1815-2/08/$25.00 2008 IEEE.

extraction and selection to automatic emotion recognition.


Finally, a discussion of the results and some conclusions are
provided.
II. T IME - FREQUENCY AND PROCESSING METHODS
Unlike conventional methods, such as time based
techniques and Fourier transform, time-frequency transforms
give join representation in time and frequency domains, with
this information it is possible to know the dynamics of the
spectral content over time. In this work Gabor transform,
discrete wavelet (DWT), Wigner-Ville distribution, linear
prediction analysis and raw data analysis are performed.
Gabor transform uses a time window g(t) over a portion of
speech signal s(t) allowing a local application of the Fourier
transform. This reveals information in frequency located in
the time effective domain of window. Moving temporarily
the window covers an interval in the domain of the signal
and a time-frequency representation sg (, ) is obtained [6].
This transform can be obtained by (1).
sg (, ) =

s(t)g(t )ejt dt

(1)

The main purpose in DWT is to represent the signal


through approximation and detail coefficients {aj , dj } which
are obtained by passing the signal through low-pass and
high-pass filters respectively.
WVD is a non linear transform, it is functionally similar to
a spectrogram. The WVD is shown in (2). On the whole it
gives better temporal and frequency resolution, at the expense
of many artifacts and the introduction of negative values,
which would correspond to negative energy (which is not
physically possible, and represents a significant defect in this
method). However, the W-V spectrum provides very useful
information about the energy content of a signal.
Z 
 j
 
s t
e
d
W Vs (, ) =
s t+
2
2

(2)

One of the most commonly used technique for


characterization based in predefined models of speech
signals is Linear Prediction Analysis (LP). LP focuses on
the generation of a filter model as transfer function of its
inputs and outputs. The behavior of the vocal tract is related
to a filter with transfer function H(z) whose parameters
vary over time depending on the action taken to utter a
word. There are two possible input signals for the filter:
voiced speech (pulse train) and unvoiced speech (white
noise). A basic scheme of this model is presented in Fig. 1.

2590

The disturbance parameters are calculated using the


following expression:

Fundamental
Period

Pulse
generator

Vocal Tract
Parameters

Glottal
Model

1
N 1

Vp =
G (z)

H (z)

Speech production general model

For modeling the vocal tract (H(z)) an all-pole function is


used as in (3).
H(z) =
1

(3)
ai

NP
1

|p(i)|

i=1

Where pi is the value of the parameter for the i th


interval, N is the number of intervals and Vp it is the value
of the parameter disturbance.
The representation features describes the dynamical behavior
of the signal, and are calculated from some of the forms of
representation (LPC, time-frequency tramsforms). These are
not generally associated whit any physical phenomena.

Random noise
generator

G
N
P

|p(i + 1) p(i)|

i=1
1
N

Vocal Tract
Model

Fig. 1.

NP
1

z i

i=1

Finally, the signal is totally represented by the estimation of


ai coefficients and gain G.
In case of raw data, analysis works out directly on the
entire universe of data. Over the signal is not performed
any transform or representation in which the information is
reduced.

IV. E XPERIMENTAL RESULTS


In this work, the Spanish Emotional Speech database
(SES) have been used. It contains two emotional speech
recording sessions played by a professional male actor
in an acoustically treated studio. Each recorded session
includes thirty words, fifteen short sentences and three
paragraphs, simulating three basic or primary emotions
(sadness, happiness and anger), one secondary emotion
(surprise) and a neutral speaking style [7]. This study was
conducted on 30 single words of each emotional state.
The Table I shows the results in identifying emotions carried
out by a group of human listeners over short phrases, the
target global percentage was 80.89% [8].

III. F EATURE EXTRACTION

TABLE I
C ONFUSION MATRIX FOR EMOTION IDENTIFICATION PERFORMED BY

Feature extraction methods obtain parameters that,


according to their relevance, allow in a complete or partial
way a description of the speech signal. The main objective is
to obtain a reduction in the data number and enhance aspects
of the signal that contribute significantly to subsequent
processes (recognition, segmentation or classification). In
speech analysis it is common to use two types of features:
acoustic, which have a physical sense; and representation
characteristics that correspond to values calculated over
the representation of the speech signal, and those that
in a general does not correspond to any physical sense.
Feature extraction is performed over time intervals between
20 and 40 ms. In this intervals the signal is considered
cuasi-stationary due to the fact that the statistical parameters
remains invariant inside the observation interval.
As it was previously mentioned the acoustic parameters
have a physical meaning, which allow a qualification of the
vocal qualities. Acoustic parameters can be classified in the
following way:

HUMAN LISTENERS

1) Cuasi-periodic parameters: Show diferent levels of


periodicity in the speech signal: in this work the
fundamental frequency (F0) was studied.
2) Disturbance parameters: Show a relative variation of
certain parameter: Jitter, Shimmer, HNR.

Intended emotion
Hapiness
Anger
Surprise
Sadness
Neutral
Precision

Hapiness
61.9
3.2
7.9
84.8

Identified emotion (%)


Anger
Surprise
Sadness
7.9
3.2
11.1
95.2
6.3
76.2
1.6
3.2
90.6
4.8
87.0
92.3
83.8

Neutral
9.5
7.9
1.6
81.0
81.0

As there is no a priori knowledge of the features


that will provide a better result in emotion recognition,
it was considered appropriate to obtain a large number
of parameters to dismiss those who subsequently prove
redundant. As stated earlier, there are two types of
characteristics to extract in speech signals: those from
acoustic analysis and time-frequency representation.
First, time-frequency features are extracted from the various
transformations applied to the signal. They can be viewed
as statistical parameters which provide information on the
nature of the power spectral density of the signal.
The acoustic characteristics can be studied as parameters
which provide information on physical qualities of the voice
(changes of the tone or variations of voice amplitude).
Fig. 2 shows contours of the fundamental frequency,

2591

TABLE II
C ONFUSION MATRIX FOR EMOTION IDENTIFICATION WITH LPC

calculated from the autocorrelation function on the word


/coche/ in different emotional states. Can be observed high
frequency content for Surprise and Happiness states and low
for Neutral and Sadness states [9].

FEATURES

Real emotion
Hapiness
Anger
Neutral
Surprise
Sadness
Global

300
Happines
Anger
Neutral
Surprise
Sadness

250

Frequency (Hz)

200

Hapiness
50
3.33
3.33
3.33
16.66

Identified emotion (%)


Anger
Neutral
Surprise
10
6.66
26.66
70
16.66
10
3.33
83.33
6.66
80
20
3.33
20

Sadness
6.66
10
6.66
40
64.66

150

TABLE III
C ONFUSION MATRIX FOR EMOTION IDENTIFICATION WITH WVD

100

FEATURES

50

0
0

Fig. 2.

100

200

300
Time (ms)

400

500

600

Real emotion
Hapiness
Anger
Neutral
Surprise
Sadness
Global

Contours of the fundamental frequency

In all cases (except for Gabor which segments the


signal through the inherent window) signal was previously
segmented with windows of 30ms long and overlap of 50%.
Then the values of each of the extracted parameters are
averaged for each frame, resulting in a average vector of
features for each signal. Are obtained 104 features grouped
according to the methodology used in its extraction, a general
scheme of the characterization is shown in Fig. 3. The results
Speech signal

Real emotion
Hapiness
Anger
Neutral
Surprise
Sadness
Global

3.33
46.66
67.33

Hapiness
80.00
3.33
16.66
13.33

Identified emotion (%)


Anger
Neutral
Surprise
3.33
3.33
13.33
66.66
16.66
13.33
6.66
73.33
6.66
76.66
3.33
16.66

Sadness
3.33
16.66
66.66
72.66

Feature extraction

TABLE V
C ONFUSION MATRIX FOR EMOTION IDENTIFICATION WITH G ABOR

LPC
104
Characteristics

FEATURES

ACF

Raw Data

Fig. 3.

Sadness

FEATURES

DWT

Window

Identified emotion (%)


Anger
Neutral
Surprise
20
23.33
10
70
13.33
3.33
86.66
86.66
3.33
10
26.66

TABLE IV
C ONFUSION MATRIX FOR EMOTION IDENTIFICATION WITH DWT

Gabor

WVD

Hapiness
46.66
13.33
13.33
10
13.33

Real emotion
Hapiness
Anger
Neutral
Surprise
Sadness
Global

Characterization scheme

of classification and recognition can be affected by the lack


of care in the design of experiments. In this case to ensure
the statistical reliability of the results a Bayes classifier is
used to implement the emotion recognizer and leave-one-out
method for test validation.
In order to eliminate redundant features all possible
combinations are classified, the sub-group with higher
percentage at the classfication process is choosen.
Tests were conducted using each of the techniques on an
individual basis in order to compare the results of each of
the proposed transformations and methods of representation.
Both, confusion matrices and the respective percentages of
recognition accuracy of emotions are shown in Tables II
to VI.

Hapiness
73.33
20
16.66
3.33

Identified emotion (%)


Anger
Neutral
Surprise
10
6.66
6.66
70
3.33
6.66
6.66
83.33
3.33
3.33
80
3.33
13.33

Sadness
3.33
6.66
80
77.33

TABLE VI
C ONFUSION MATRIX FOR EMOTION IDENTIFICATION WITH RAW DATA

2592

FEATURES

Real emotion
Hapiness
Anger
Neutral
Surprise
Sadness
Global

Hapiness
83.33
6.66
10
3.33
10

Identified emotion (%)


Anger
Neutral
Surprise
6.66
3.33
70
13.33
6.66
76.66
3.33
93.33
3.33
6.66

Sadness
6.66
10
6.66
80
80.66

TABLE VII
C ONFUSION MATRIX FOR EMOTION IDENTIFICATION WITH MIXED

diagnostic support for specialists in psychological treatments


illnesses as Post-Traumatic Stress Disorders.
It is important to highlight that feature extraction in Raw
Data presents better results over other evaluated techniques
individually. Raw Data contain information about intensity,
duration and pauses of the voice.
Is justified the use of features extracted directly of linear
prediction coefficients given that the adjustment of these is
a dynamic representation of the vocal tract model which is
affected by the physiologic changes produced by the different
emotional states.

FEATURES

Real emotion
Hapiness
Anger
Neutral
Surprise
Sadness
Global

Hapiness
93.33
3.33

6.66

Identified emotion (%)


Anger
Neutral
Surprise
3.33
3.33
90
6.66
100
100
3.33

Sadness

90
94.66

Using a combination of the most discriminating features


of the techniques described above, are obtained the results
shown in Table VII.
V. DISCUSSION
The identification of emotions with LPC presents the
highest percentage for a neutral emotional state with
83.33% while the lowest was obtained for sad with
40%.This technique exhibit a tendency to over identify
emotional states whit high energy content as surprise and to
not idetify those that present low energy content as sad.
Using time frequency techniques percentages of
identification of 77% were reached. The lowest percentage
was obtained using WVD, cross-terms generated in WVD
do not perform a diserable representation of emotional
states. A symlet wavelet whit 8 decomposition levels have
been used to calculate the approach and detail coefficients,
the best percentage was obtained for surprise with 80% and
the lower for angry with 66.66%, the identification presents
low variance, which indicates discrimination of the different
emotional states. The best results using time frequency
techniques were obtained with Gabor transform due to the
resolution commitment between the time and the frequency
domains.
The use of the Raw Data technique present the best results
for the task of emotions identification, with a percentage
of 80.66%, this technique is an analysis of the voice in the
time domain and it represents the dynamics of parameters
like intensity, duration and speech speed.
Finally were used the most discriminating features
of the previous techniques. The identify task present
notably success with a global percentage of 94.66%, low
variance and precision in comparison with anyone of the
methodologies used previously.

VII. ACKNOWLEDGMENTS
This work has been partially funded by Colciencias
and Universidad Tecnolgica de Pereira under contract
1110-370-19600.
The authors would like to thank the Speech Technology
Group, Dept. of Electronic Engineering, Technical University
of Madrid, and especially Juan Manuel Montero, the database
SES provided for this study.

VI. CONCLUSIONS
A methodolgy for emotion detection in speech signals
based on the use of acoustic and spectral features
is developed using representation and acoustic features.
The results show high percentages in emotional states
recognition. This technique was validated using a Bayesian
classifier. Comparing Tables I and VII is proven the
effectiveness of the used methodology improving in 13.68%
the identification carried out by a group of human listeners
over short phrases.
It is possible to develop a system in real time, that serves as

2593

R EFERENCES
[1] X. Huang. Spoken language processing. Prentice Hall, 2001.
[2] Cowie, R. et.al. Emotion recognition in human-computer interaction.
IEEE Signal Processing Magazine, (18) 1, pp: 32-80, 2001.
[3] E. Vyrynen. Automatic emotion recognition from speech. Master
Thesis, Department of Electrical and Information Engineering,
University of Oulu, Finland, 2005.
[4] G. Castellanos, E. Delgado, G. Daza, L.G. Sanchez, J.F.
Suarez. Feature Selection in Pathology Detection using Hybrid
Multidimensional Analysis. Engineering in Medicine and Biology
Society, 2006. EMBS 06. 28th Annual International Conference of
the IEEE. pp: 5503 - 5506, 2006.
[5] F. Hlawatsch and G. Matz. Time-Frequency Signal Processing: A
Statistical Perspective Invited paper in Proc. CSSP-98, Mierlo (NL),
pp. 207-219, 1998.
[6] R. Rangayyan. Biomedical Signal Analysis. Ed. Wiley & Sons, 2001.
[7] J.M. Montero, J. Gutirrez-Arriola, S. Palazuelos, E. Enrquez, S.
Aguilera, J.M. Pardo Emotional Speech Synthesis: From Speech
Database to TTS, Proceedings of 5th International Conference on
Spoken Language Processing, ICSLP98, Australia, 1998.
[8] R. Barra, J. M. Montero, J. Macas, Prosodic and segmental
rubrics in emotion identification, in ICASSP 2006 Proceedings. IEEE
International Conference, 2006.
[9] J.M. Montero, J. Gutirrez-Arriola, R. Crdoba, E. Enrquez,
J.M. Pardo.The role of pitch and tempo in Spanish emotional
speech: towards concatenative synthesis. In Improvements in Speech
Synthesis. Ed. Wiley & Sons, 2002.

Você também pode gostar