Você está na página 1de 15

1

Speaker Identification
Using Instantaneous Frequencies
Marco Grimaldi, Fred Cummins

AbstractThis work presents an experimental evaluation of that minimize within-speaker variability and simultaneously
different features for use in speaker identification. The features maximize between-speaker variability [1], [18], [38].
are tested using speech data provided by the CHAINS corpus, in The general area of speaker recognition encompasses two
a closed set speaker identification task. The main objective of the
paper is to present a novel parametrization of speech that is based fundamental tasks: speaker identification and speaker veri-
on the AM-FM representation of the speech signal and to assess fication [3], [7], [35]. Speaker identification is the task of
the utility of these features in the context of speaker identification. assigning an unknown voice to one of the speakers known
In order to explore the extent to which different instantaneous by the system: it is assumed that the voice must come from
frequencies due to the presence of formants and harmonics in the a fixed set of speakers. Thus, the system must solve a n-
speech signal may predict a speakers identity, this work evaluates
three different decompositions of the speech signal within the class classification problem and the task is often referred to as
same AM-FM framework: a first setup has been used previously closed-set identification. On the other hand, speaker verifica-
for formant tracking; a second setup is designed to enhance tion refers to the case of open-set identification: it is generally
familiar resonances below 4000 Hz, and a third setup is designed assumed that the unknown voice may come from an impostor.
to approximate the bandwidth scaling of the filters conventionally Regardless of the specific task at hand, it is common practice
used in the extraction of MFCCs. From each of the proposed
setups, parameters are extracted and used in a closed-set text- to adopt a probabilistic approach that predicts the likelihood
independent speaker identification task. The performance of the that a given speech sample belongs to a given speaker [3], [7],
new featural representation is compared with results obtained [27], [34], [36], [35]. The base system for speaker recognition
adopting MFCC and RASTA-PLP features in the context of a is usually composed of a speech parametrization module and
generic Gaussian mixture model (GMM) classification system. a statistical modeling module [3] which are responsible for
In evaluating the novel features, we look selectively at in-
formation for speaker identification contained in the frequency the production of a machine readable parametrization of the
range 0 Hz4000 Hz and 4000 Hz8000 Hz, as the instanta- speech samples and the computation of a statistical model
neous frequencies revealed by the AM-FM approach suggest from the parameters. The main difference between speaker
the presence of structures not well known from conventional identification and speaker verification is that in the first case
spectrographic analyses. Accuracy results obtained using the the system provides one model for each speaker, while, in the
new parametrization perform as well as conventional MFCC
parameters within the same reference system, when tested and second case, the system provides a total of two models: one for
trained on modally voiced speech which is mismatched in both the hypothesized speaker and one representing the hypothesis
channel and style. When the testing material is whispered speech, that the speech sample comes from some other speakerthe
the new parameters provide better results than any of the other background model.
features tested, although they remain far from ideal in this The base model for speech parametrization (the front-end
limiting case.
of the recognition system) usually adopted is the source-filter
model, which leads to the extraction of parameters such as
I. I NTRODUCTION LPC, MFCC, PLP, etc. (e.g.: [8], [17], [24], [27], [34], [36],
Humans are fairly good at identifying speakers based on [35]). While these parameters have proved highly successful
their voices alone. The large amount of work in the field in robust speech recognition, the amplitude spectrum typically
of speaker recognition over the previous 30 years has been employed is highly sensitive to changes in speaking conditions
predicated on the belief that automated systems ought to be such as changing channels and speaking style [6], [35], [38].
able to do as well, or even better, than humans. Yet we still This is well illustrated by the so-called great divide in the
lack a solid understanding of those characteristics of speech KING corpus [15]. Minor changes in the recording setup of the
that index an utterance as originating in one speaker rather corpus introduced differences in the spectral slope of the data,
than another. The introduction of the idea of a voice print by badly affecting the system performance. To compensate for the
Kersta [21] led to a common perception that the now familiar shortcomings of standard speech parametrization, researchers
spectrogram could function in much the same way as a finger- often opt to train the decision algorithm using a mix of
print. Phoneticians are well aware, however, of the inherent samples from different sessions/setups [15] and implementing
variability in the speech signal, such that no two utterances different forms of channel normalization such as cepstral mean
are ever identical. Moreover, much of what we know about normalization or RASTA processing [3], [17], [34], [36].
speech comes from an intellectual focus upon features which While these methodologies have potential in some speaker
make linguistic communication possible, which necessarily verification scenarios (where it is possible to perform multi-
ignores the idiosyncratic properties of speech that differentiate session recordings), the same is not true in areas where
linguistically similar utterances. The goal of work in speaker researchers and investigators do not have full control of the
recognition, on the other hand, is to find measurable quantities devices used for recording, or of the recording environment,
2

or the amount of recorded material. Moreover, many research quality recording session is used in testing the algorithm. The
studies have stressed the fact that further difficulties may arise training material consists of utterances recorded from subjects
due to (possibly volitional) changes in speaking register and reading a prepared text aloud, while the test material consists
speaking style [1], [6], [18], [38]. of speech samples obtained from subjects who willfully alter
In the context of text-independent speaker recognition, their speaking style. A first set of experiments uses test
where there is no prior knowledge of what the speaker will material in which the prepared text is read aloud at a fast
say, one of the most successful methods in modeling human rate; a second set of experiments uses test material in which
identity from speech has been the Gaussian Mixture Model the prepared text is read in a whisper. While our main focus is
(GMM) [3]. The GMM is usually viewed as a hybrid approach on the evaluation of a novel parametrization of speech for the
between parametric and nonparametric density models: like a purposes of SI, this work also attempts to address outstanding
parametric model, it has structure and parameters that control challenges in SI by making use of recordings in which the
the behavior of the density in known ways, but without the speakers intentionally modify their voices [37] and to examine
constraint that the data must follow a specific distribution the effect of non-modal voice (whispering) on the performance
[3], [34], [36]. In the literature, other approaches have been of the SI system.
proposed to model human identity from speech (e.g.: artifi- This work introduces a new set of descriptors based on
cial neural networks and support vector machines, [8], [40], the AM-FM representation of the speech signal. This new
[41], [42]), demonstrating that other techniques may provide signal characterization is obtained by extending the use of the
comparable performance. pyknogram of the signal [30]. It seeks to exploit the rich set
The main objective of this work is an experimental compari- of time-varying frequencies inherent in the speech signal, and
son of different parametrization of speech for speaker recogni- to encode them as parameters for speaker identification. Well-
tion. In order to reduce the degrees of freedom of the problem, known frequency components of voice include both formant
we focus on the performance that a generic GMM classifica- resonances and harmonics of the fundamental frequency. Other
tion system provides in closed-set identification (speaker iden- frequency components of unknown origin (bony structures,
tification, hereafter SI). The closed-set scenario is preferred idiosyncratic anatomy) may also prove useful in identifying
to the open-set scenario (speaker verification), because of the individuals. We therefore explore three alternative ways of
open nature of the latter. As argued by Bimbot et. al [3, p. parametrizing speech within a single AM-FM framework: the
435-436], when building a background model for verification, first setup we employ has been used in the literature for
speech should be selected such that it reflects the expected formant tracking [30]; a second setup was found, during pre-
alternative speech to be encountered during the verification liminary tests, to enhance familiar resonances below 4000 Hz
procedure. This applies to the type and quality of speech as and a third setup is designed to mimic the frequency scaling
well as the composition of the set of speakers. Moreover, of the filters adopted in the extraction of MFCC coefficients.
other than general guidelines and experimentation, there is no The recognition rates obtained using the three AM-FM se-
objective measure to determine the optimal number of speakers tups are compared with the results obtained adopting MFCC
or amount of speech to use in training a background model and RASTA-PLP parameterizations within the same generic
[3]. On the other hand, by focusing on closed-set identification, Gaussian mixture model (GMM) classification system.
differences in the performance of the reference classification
system may be directly related to a better separation of the
II. T HE AM-FM M ODEL
speaker in the feature space, or lack of it. Furthermore, given
that both speaker identification and speaker verification rely Popular speech processing techniques are conventionally
on the same base techniques, improvements in one technology based on spectral analysis, using some variant of linear predic-
may also provide improvements in the other. tion or cepstral analysis [2], [13], [23]. These parametrizations
The speech samples used in this work are extracted from are commonly used as a front-end for both speech recognition
the CHAINS corpus [9]. The CHAINS corpus is a novel speech and for speaker identification/verification systems and are
corpus designed to facilitate research into the characterization based (in one way or another) on the source-filter model of
of individual speakers. It offers speech samples obtained in speech production [32], [33]. The source-filter model assumes
two different recording sessions with the unique possibility that the sound source in modally voiced speech is localized in
of studying the effect of non-modal voice (whispering) in SI. the larynx, while the vocal tract acts as a convolution filter
The first recording session used a very high quality recording for the emitted sound. Although this approach has led to
environment and apparatus, while the second recording session great advances in the last 30 years, it is known to neglect
is more typical of data recorded in a quiet office using a near- some structure present in the speech signal [12]. Examples
field microphone. Across the two sessions, each speaker pro- of phenomena not well-captured by the source-filter model
vides recordings in six qualitatively different speaking styles. include unstable airflow, turbulence, and non-linearities arising
In this work, we make use of speech samples selected across from oscillators with time-varying masses [12], [25], [39].
the two recording sessions, training and testing Gaussian In recent years, new ways of modeling and characterizing
mixture models (GMMs) in mismatched channel and speaking speech have been proposed in a number of different works
style conditions. Speech samples extracted from the first, (e.g.: [10], [11], [12], [19], [28], [29], [31]). Of particular
high-fidelity, recording session are used to train the induction interest here is AM-FM signal modeling. AM-FM modeling
algorithm, while material extracted from the second, lower is a technique used especially by electrical engineers in the
3

context of frequency modulated signals, such as FM radio In the context of AM-FM signal modelling, the concept of a
signals. This technique has been applied to speech signal single-valued instantaneous frequency for a multi-component
analysis, with varying degrees of success, in areas such as signal becomes meaningless without breaking the signal down
formant tracking [30], speech synthesis [22], speech recog- into its components. As discussed in [4], the decomposition
nition [10], [11], [28], [29], [31] and speaker identification of a signal is not unique if its frequency components coincide
[19]. The AM-FM model is generally used to decompose a at some points in the time-frequency plane. This is the case
speech signal into decorrelated band pass channels, each of for speech, e.g.: formants are well known to have points in
which is characterized in terms of its envelope (instantaneous the time-frequency plane where they appear to join or split.
amplitude) and phase (instantaneous frequency). In this case, the decomposition is heuristic in nature and its
In order to characterize a (single) instantaneous frequency optimal form will depend on the specific application. [4].
for a real-valued signal, an analytic signal is first constructed: it
is a transformation of the real signal into the complex domain An example of AM-FM signal decomposition for speech
and it is adopted because it permits the characterization of the analysis was proposed by Potamianos et al. [30] in the context
real input in terms of instantaneous amplitude and frequency of formant tracking. The authors proposed a new way of
[14], [33]. More formally: given a real input signal s(t), its representing the speech signal through the computation of the
analytic signal sa (t) can be computed as pyknogram, which is a density plot of the frequencies present
in the input signal. The pyknogram is computed by the authors
sa (t) = s(t) + j s(t) (1) using a uniformly spaced Gabor filterbank and a demodulation
schema based on the Teager energy-tracking operator [20],
where s(t) is the Hilbert Transform of s(t). The analytic signal [25]. A single instantaneous frequency is computed for each
sa (t) can be decomposed as follows: filter output and each frame of speech, yielding a 2-D array of
instantaneous frequencies. Potamianos et al. demonstrate that
sa (t) = a(t) ej(t) (2) the AM-FM approach can overcome some of the limitations
where a(t) is the instantaneous amplitude (envelope) of the of the classic source-filter model and greatly help the difficult
analytic signal, while (t) is its phase. The instantaneous task of speech formant tracking.
frequency (IF) f (t) of the analytic signal can be computed Jankowski et al. [19] adopted the AM-FM model to char-
directly from the phase: acterize some fine structures of the human voice for the pur-
1 d(t) pose of speaker identification. The authors proposed a mixed
f (t) = . (3) LPC/AM-FM approach to identify, select and parametrize the
2 dt
first three formants of human speech. Their work demonstrated
The importance of the instantaneous frequency (IF) stems
that this procedure may be beneficial in speaker identification:
from the fact that speech is a nonstationary signal with spectral
formant AM-FM parameters substantially improve identifica-
characteristics that vary with time.
tion rates on female speakers [19]. A more thorough compar-
When a person speaks, the supra-glottal vocal tract modifies
ison of AM-FM representations in an SI context has hitherto
the sound wave originating in the vocal cords in very specific
remained outstanding.
ways. Depending on the message embodied in the speech itself
and on the anatomy of the vocal cavities of the speaker, the In this work, we extend the use of the pyknogram and
final result of the act of speaking is a very complex pres- the decomposition of the multi-frequency component speech
sure signal containing many different forms of information. signal to the problem of speaker identification. The approach
Different parts of the skull (soft parts and hard parts of the proposed here seeks to identify which instantaneous frequen-
vocal tract) vibrate under the influence of a common energy cies are present in the speech signal and to encode them as
generator (the lungs) and the consequent airflow across the parameters for speaker identification. Given the non-unique
glottis. The human ability to change the form of some parts decomposition of a multi-frequency component signal and
of the vocal tract (e.g. the cross section of the larynx, the size the lack of a theory justifying the adoption of any specific
of the mouth cavity) gives rise to the familiar dynamic formant approximation to extract parameters which are invariant in the
structure found in speech. Changing the stiffness of the vocal context of SI, we perform speech parametrization by applying
chords causes changes in pitch. The presence of diseases (e.g., three different decomposition algorithms for the same input
colds) in a speaker has the effect of enhancing the role played signals. Within the same approach (AM-FM), it is possible to
in speech production by some structures of the vocal tract vary the way in which the speech pyknogram is computed in
(e.g., the nasal cavity). A speaker may voluntarily change some order to focus on different frequencies in the signal: structures
of his speaking habits in order to achieve a predefined goal, due to the presence of resonators in the upper vocal tract
e.g. mimicry or disguise [37]. A speaker may involuntarily (formants) and fine structures related to the presence of quasi-
change some properties of his vocal tract under the influence harmonic vibrations within the whole vocal tract.
of a particular state of mind, e.g. anger, stress, sadness. All
these intrinsic factors in speech production affect the physical In the next section we present the main structure of the
properties of the speech waveform: they collectively ensure algorithm used to calculate the pyknogram and the ad hoc
that speech is a signal with a rich set of multi-component modifications applied to selectively attend to different infor-
time-varying frequencies. mation embodied in the signal.
4

A. Computing the Pyknogram of the Signal The computation of the short-time instantaneous frequency
In order to compute the instantaneous frequencies of the for each bandpass waveform i leads to the extraction of the
speech signal (and its pyknogram) a multiband demodulation pyknogram of the speech signal. Examples of pyknograms are
analysis (MDA) must be performed. Similar to the process shown in Figures 1, 2 and 3.
described in [30], the MDA consists of a multiband filtering
scheme and a demodulation algorithm. First the speech signal
is bandpassed with the use of a filterbank and then each band-

8000
pass waveform is demodulated and its instantaneous amplitude
and frequency computed.

6000
The filterbank adopted consits of a set of Gabor bandpass

Frequency [Hz]
filters with center frequencies that are uniformly spaced on
the frequency axis. The filter bandwidth is either constant or

4000
variable depending on which frequency structures we seek
to extract (Section II-B). Gabor filters are chosen because

2000
they are optimally compact and smooth in both the time
and frequency domains. This characteristic guarantees accurate
amplitude and frequency estimates in the demodulation stage

0
[30] and reduces the incidence of ringing artifacts in the time 0 0.5 1 1.5 2

Time [sec]
domain [19].
The demodulation schema adopted is based on the Hilbert Figure 1. Pyknogram of If it doesnt matter who wins, why do we keep
Transform Demodulation (HTD). Although other demodula- score?. 80 filters linearly spaced between 200 Hz and 8200 Hz, constant
bandwidth of 400 Hz - Filterbank setup 1. See also Figure 4 for a conventional
tion schema are less computationally expensive (e.g.the DESA spectrographic representation.
family of algorithms [25], [30]), the HTD can give smaller
error and smoother frequency estimates [25], especially when
the first formant is close to the fundamental frequency [30].
The main schema adopted to demodulate the speech signal
can be summarized as follows:
8000

the speech signal s(t) is bandpass filtered and a set of


waveforms wi (t) is obtained (i denotes the output of the
i-th filterbank);
6000
Frequency [Hz]

for each bandpass waveform wi (t), its Hilbert transform


wi (t) is computed;
4000

the instantaneous amplitude for each bandpass waveform


ai (t) is computed as:
2000

!
ai (t) = wi2 (t) + wi2 (t) (4)
0

the instantaneous frequency for each bandpass waveform 0 0.5 1 1.5 2

Time [sec]
fi (t) is computed as the first time derivative of the phase,
i (t): Figure 2. Pyknogram of If it doesnt matter who wins, why do we keep
score?. 80 filters linearly spaced between 200 Hz and 8200 Hz, constant
1 di (t) 1 d bandwidth of 266 Mel, Filterbank setup 2.
fi (t) = = [arctan(wi (t)/wi (t))].
2 dt 2 dt
(5)
The pyknogram, being a density plot of the frequencies
Instantaneous amplitude and instantaneous frequency are present in the input signal, reveals the presence of strong res-
combined together to obtain a mean-amplitude weighted short- onances as regions of high density of the estimated short-time
time estimate Fi of the instantaneous frequency for each wi (t): frequencies. Harmonic-like structures are identified as regions
" t0 + where the short-time estimates are placed with approximately
[fi (t) a2i (t)]dt
Fi = t0 " t0 + (6) equal frequency spacing 1 . Note that the pyknogram is not a
[a 2 (t)]dt
t0 i 3-D plot similar to a spectrogram: each individual point rep-
resents the location of an (amplitude-weighted) instantaneous
where is the selected length of the time-frame. The short-
frequency, but not its intensity. It is also crucial that in our
time frequency estimate is computed for the full length of
approach, the bandwidths of the filters exhibit considerable
each wi (t), with an overlap window of /2. The adoption of
overlap, allowing the center frequencies from multiple filters
a mean amplitude weighted instantaneous frequency (equation
to converge on a single strong frequency. The details of the
6) is motivated by the fact that it provides more accurate
frequency estimates and is more robust for low energy and 1 A special case is when the signal lacks a marked frequency component
noisy frequency bands when compared with an unweighted within a given band: the estimated instantaneous frequency is then simply the
frequency mean [30]. center-frequency of the filter.
5

The first setup (setup 1) has been used in the literature [30]
for formant tracking and is known to be of use in identifying
8000

structure due to the first 4 or 5 formants in the vocal tract


(below about 4000 Hz). The second setup (setup 2) has been
evaluated in our tests and found to reveal marked structures
6000
Frequency [Hz]

above 4000 Hz, while enhancing the formant structure below


4000 Hz. The third setup (setup 3) has been designed to
4000

replicate the bandwidth scaling of the triangular filters used


for MFCC extraction. With this configuration it is possible to
2000

identify both the formants and the harmonic structures well


known in the literature and present in the range of frequencies
between 50 Hz and 4000 Hz.
0

0 0.5 1 1.5 2 Figures 1, 2 and 3 show three different pyknograms obtained


Time [sec]
using the three different setups. The input sound file used is
Figure 3. Pyknogram of If it doesnt matter who wins, why do we keep the same and has been extracted from the CHAINS corpus
score?. 80 filters linearly spaced between 200 Hz and 8200 Hz, constant [9]: irm01 s01 solo.wav (If it doesnt matter who wins, why
bandwidth of 106 Mel, Filterbank setup 3.
do we keep score?). Figure 4 shows its spectrogram computed
using Praat [5]. All three pyknograms are computed using 80
overlap are specific to the setup employed to calculate the Gabor filters with uniform center frequency spacing between
pyknogram of the signal, as described in the next section. 200 Hz and 8200 Hz. The bandwidth of the filters are set as
described above. Thus the three setups differ in the degree of
overlap among the filters, and in the relation between overlap
B. Different Bandwidth, Different Frequency Structures
and frequency.
An extremely important role in the computation of the The three different setups adopted in the calculation of the
pyknogram (and in the encoding of the features derived from pyknogram of the speech signal reveal different instantaneous
it) is played by the bandwidth of each individual Gabor filter frequencies within the signal. Qualitatively, these differences
within the filterbank. By varying the bandwidth of the Gabor are found not only in the range of frequencies commonly
filters, the short-time frequency estimates can be tuned to exploited for speaker and speech recognition (below 4000 Hz),
identify resonances (formants) or structures due to quasi- but extend to the higher part of the frequency scale, between
harmonic vibrations (e.g. during vowel production) in the 4000 Hz and 8000 Hz. In this region, Figure 3 (setup 3)
vocal tract (e.g. the harmonics of the fundamental frequency). reveals instantaneous frequencies that differ greatly from the
Fine tuning the bandwidth of the filters used for speech ones detected using the other two setups in the same frequency
analysis and speech characterization is a standard practice range.
found in many approaches. The power spectrum of speech
can be fine tuned (varying the bandwidth of the spectral
analysis) to enhance structures due to presence of harmonics
or formants. Generally, a broad-band spectrogram is obtained
by setting the bandwidth (through the time length of the
analysis window) to a value of about 250 Hz; a narrow-
band spectrogram has typical bandwidth of about 50 Hz. The
extraction of a conventional MFCC feature vector is based on
Figure 4. Spectrogram of a speech file (If it doesnt matter who wins, why
the definition of a filterbank of triangular filters with uniformly do we keep score?). The spectrogram shows frequencies between 0 Hz and
spaced center frequencies and constant bandwidth on the Mel 8200 Hz
scale: this approach produces filters with variable bandwith.
In order to tune the pyknogram to the speech resonances It is interesting to note that standard techniques used in
for formant tracking, Potamianos et al. [30] used a Gabor speech analysis are somewhat limited in detecting frequency
filterbank with constant bandwidth of 400 Hz. structures above 4000 Hz: the majority of them are not visible
In this work, three different filterbanks are used to compute in the spectrogram of the same speech file shown in Figure 42 .
the pyknogram. In each case. the filterbank is composed of On the other hand, the same formants between 0 and 4000 Hz
Gabor filters with center frequencies which are uniformly are replicated almost exactly in each of Figures 1, 2, 3 and 4.
spaced on the Hertz scale while the bandwidths are defined as Given the unknown nature of the high frequency compo-
follows: nents of human speech, we will not try to motivate their
setup 1: constant (on the Hertz scale) bandwidth of 400 presence. We will limit ourselves to encoding their differing
Hz values (Section III) and verifying their relative utility for
setup 2: constant (on the Mel scale) bandwidth of 266 speaker identification (Section VI). Moreover, Section VI-C
Mel 2 No pre-emphasis is applied to the speech file to calculate the pyknograms
setup 3: constant (on the Mel scale) bandwidth of 106 in Figures 1, 2 and 3. Pre-emphasis of 6dB/octave is applied to calculate the
Mel spectrogram in Figure 4
6

presents a quantitative analysis of the importance of the dif- windows of 512 taps. No channel compensation schema is
ferent instantaneous frequencies in the region 40008000 Hz, adopted.
derived according to the three setups, for the problem of
speaker identification. B. MFCC
Table I summarizes the frequency bandwidth employed in
setup 1, setup 2 and setup 3 for selected Gabor filters. In order to provide a comparison to the performance ob-
tained using pykfec in the task of speaker identification, we
In the next section we describe the encoding of the pykno-
also use standard MFCCs with the same general GMM clas-
gram into the features used to characterize speech for speaker
sifier. Since pykfec are extracted using very different setups
identification.
and taking into account different frequency ranges, the MFCC
are extracted according to three schema roughly matching the
III. S PEECH PARAMETRIZATION ranges analysed in the pykfec extraction:
A. Pyknogram Frequency Estimate Coefficients MFCC-0-8000: MFCC extracted using 40 triangular fil-

The encoding of the pyknogram into a set of feature ters spaced between 0 Hz and 8000 Hz;
vectors is straightforward, given its short-time nature and MFCC-0-4000: MFCC extracted using 40 triangular fil-

the probabilistic approach taken in speaker identification and ters spaced between 0 Hz and 4000 Hz;
verification. An utterance can be represented by an N M MFCC-4000-8000: MFCC extracted using 40 triangular

matrix where N corresponds to the number of Gabor filters filters spaced between 4000 Hz and 8000 Hz;
used in the filterbank and M to the number of times the input The number of coefficients used for identification is not
signal can be segmented into frames with a given window selected a priori; it is evaluated experimentally (Section VI)
lengththe value of in equation 6. For each frame, N values in order to maximize the speaker recognition rate. The window
corresponding to the estimated short-time frequency of each length is again set to 1024 taps with an overlap between
Gabor filter are encoded. This approach is very similar to the successive windows of 512 taps, to ensure homogeneous
standard procedure used for, e.g., MFCC encoding; the main sampling between the two approaches (MFCC and pykfec)
difference being that in the MFCC case N corresponds to the for the same input files. The zeroth cepstral coefficient is not
first N DCT coefficients used to decorrelate the cepstrum of used in the Mel-frequency cepstral feature vector, while the
the signal. In our case no DCT is applied to the pyknogram, value of the coefficients is normalized using cepstral mean
while the frequency values are expressed in kHz. During removal [34], [36], in order to compensate for the different
preliminary tests, we found that expressing the short-time recording channels used to train and subsequently test the
frequency estimates in kHz allows the feature set to work well induction algorithm.
together with a Gaussian Mixture Model classifier (Section
IV). Hereafter we refer to the features based on the estimates C. RASTA-PLP
of the short-time frequency as pykfec: pyknogram frequency
estimate coefficients. Our decision not to apply the DCT has A second comparison is provided using RASTA-PLP coef-
the effect of introducing an element of redundancy due to ficients [16], [17] for speech parametrization. PLP coefficients
correlation in our encoding, but as we show in Section VI-B, are a hybrid representation, using aspects of both a filterbank
this does not, in fact, seem to adversely affect the accuracy of and an all-pole model spectral representation. The spectrum
the reference GMM system. is first passed through a bark-spaced and trapezoidal-shaped
filterbank and then fitted with an all-pole model. In this work,
As a first approximation, we standardize the choice of N
(the number of Gabor filters in the filterbank) to 80, with center the model order is not fixed, but it varies in order to maximize
frequencies uniformly spaced on a linear scale between 200 the recognition rate of the induction algorithm. The spectral
representation is transformed to cepstral coefficients and a
Hz and 8200 Hz; however, in Section VI-B, we examine the
effect on SI of varying the number of filters within the same discrete cosine transform (DCT) applied as a final step. The
zeroth cepstral coefficient is discarded as a form of energy
frequency range. Moreover, in order to test the importance of
instantaneous frequencies extracted from different frequency normalization [34]. In order to compensate for channel effects,
ranges, we evaluate the speaker recognition rate provided by the input speech is pre-processed using RASTA filtering [17]
before proceding to the extraction of the PLP feature vector.
40 pykfec computed using 40 Gabor filters uniformly spaced
between 200 Hz and 4200 Hz (Section VI-C), also varying RASTA-PLP features are extracted in the frequency interval
the setup adopted in the calculation of the pyknogram. Finally, 0 Hz8000 Hz and with a time window of about 0.23 msec
(with a step equal to half of the time window).
in order to quantitatively test the importance of the different
instantaneous frequencies in the region 4000 Hz8000 Hz,
we also evaluate the speaker recognition rate provided by IV. T HE G AUSSIAN M IXTURE S PEAKER M ODEL
40 pykfec computed using 40 Gabor filters uniformly spaced The focus of this study is the utility of the various para-
between 4200 Hz and 8200 Hz (Section VI-C), varying the metric representations of speech employed. Accordingly, a
setup adopted in the calculation of the pyknogram. rather generic and simple classifier is employed, which is
The window length chosen to calculate the short-time fre- sufficiently powerful to discriminate among alternative fea-
quency estimate is 1024 taps (about 23 msec for a signal tures. No attempt is made to optimize the classifier however.
sampled at 44100 Hz) with an overlap between successive Gaussian Mixture Models are classifiers commonly used in
7

Table I
C ENTER FREQUENCY ( CF ) AND BANDWIDTH ( BW ) FOR SELECTED GABOR FILTERS USED IN SETUP 1, SETUP 2 AND SETUP 3

cf setup 1 setup 2 setup 3


[Hz] bw [Hz] bw [M el] bw [Hz] bw [M el] bw [Hz] bw [M el]
200 400 510 200 266 85 106
400 400 414 260 266 100 106
600 400 350 305 266 120 106
1000 400 266 400 266 160 106
3600 400 105 1020 266 400 106
6200 400 65 1640 266 650 106
8200 400 50 2100 266 840 106

speaker identification/verification, e.g. [27], [34], [36], [35]. The final performance evaluation is computed as the ratio
This classifier is able to approximate the distribution of the of correctly identified sequences, over the total amount of
acoustic classes representing broad phonetic events occurring sequences for the whole set of speakers.
in speech production (e.g. during the production of vowels, The GMM can assume different forms depending on the
nasals, fricatives) and often outperforms other algorithms on choice of covariance matrix used in the estimation of the
the problem of Speaker Identification [36], [40], [42]. During N -variate Gaussian functions. In this work, nodal, diagonal
training, the induction algorithm estimates the mixture of covariance matrices are used to model the speakers [36], [40],
Gaussian models that best approximates the distribution of [42]
values produced by the SI system front-end for a given speaker.
Formally, a speaker is described by a mixture of M A. Algorithm Details
Gaussian models = {1 , ..., M }: the mixture density is a
There are some known shortcomings of the GMM induction
weighted sum of the M component densities. Given an input
algorithm, due largely to the initialization procedure used to
feature vector x , the conditional probability is computed from
estimate variance and mean of each Gaussian in the mixture,
the mixture as follows:
and to the maximum and minimum values permitted for the
M variance. As pointed out by Reynolds et al. [36], the issue
#
p( x |) = cm m ( x ) (7) of initializing the GMM algorithm for SI is less challenging
m=1 than in other areas [26]. In Speaker Identification, elaborate

initialization schemes are not necessary for training Gaussian
where cm are the mixture weights and m ( x ) an N -variate mixture models [36]. We apply a k-means clustering algorithm
Gaussian function. The dimensionality of the Gaussian func- to find M clusters in the training data as an initialization
tion (N ) coincides with the dimensionality of the feature procedure. The k-means search stops whenever a stable con-

vector x , while the M models, and relative weights, are figuration is found.
estimated from the training data using a special case of the On the other hand, the maximum and minimum values
expectation-maximization (EM) algorithm [36]. allowed for the variance of each Gaussian in the mixture are
A set of S speakers {s1 , ..., sS } is represented by S two parameters that play an important role in the training of
Gaussian mixture models {1 , ..., S }. A given observation the GMM algorithm. In preliminary tests using pykfec features,
sequence X = {x1 , ..., xT } is tested by finding the speaker we found that if the nodal variance assumes big values
model which has maximum a posteriori probability. By apply- (# 100), the EM algorithm does not always converge and the
ing Bayes rule and using the logarithm [36], the probability generalization accuracy of the GMM is badly affected. When
can be computed as: the minimum nodal variance is not bounded, it may happen
that the GMM models singularities due, e.g., to a lack of data
T
# to represent a specific speaker or to the presence of outliers
S = arg max log p( x t |s ) (8) in the training set caused by noisy data [36]. To overcome
1sS t=1
these two problems, we encode the pykfec, which express the
To evaluate different test utterance lengths, the sequence of estimate of short-time instantaneous frequency, in kHz. This
feature vectors is divided for each speaker into a sequence of simple stratagem guarantees that the maximum nodal variance
overlapping segments of t0 feature vectors [36]. Each segment does not exceed a few kHz, or the maximum bandwidth of
is treated as a separate test utterance. For a test set containing the filters in the filterbanks (Table I). The minimum value for
the observation sequence X = {x1 , ..., xT }, two adjacent the nodal variance is set to 0.001, which corresponds to a
segments would be: minimum nodal variance of 1 Hz when pykfec are used. The
same absolute value for minimum nodal variance is used when
segmentn = xn , .., xn+t0 MFCCs are employed.
and
V. DATABASE DESCRIPTION
segmentn+1 = xn+1 , .., xn+t0 +1
The CHAINS corpus [9] contains the recordings of 36 speak-
where n + t0 + 1 T . ers obtained in two different sessions with a time separation
8

of about two months. The first recording session was carried models in the Gaussian mixture and the number of speakers at
out in a professional recording studio; speakers were recorded hand. All the parameters (pykfec, MFCC, and RASTA-PLP)
in a sound-proof booth using a Neumann U87 Condenser are extracted within the frequency range of 0 Hz8000 Hz.
microphone. The second recording session was carried out in In the second part of the evaluation, we present results
a quiet office environment with a Shure SM50 head-mounted obtained by restricting the parameterization to two different
microphone connected to a Marantz PMD 670 Compact Flash frequency ranges: we compare the speaker recognition rate
recorder. Across the two recording sessions, each speaker obtained extracting parameters from the frequency range 0
provided recordings in six different speaking styles. In this Hz4000 Hz and from the the range 4000 Hz8000 Hz.
work, we make use of three different speaking styles: NORM 3 , This is done to quantitatively evaluate the importance of the
in which speakers read a prepared text alone at a comfortable instantaneous frequencies revealed by the AM-FM approach
rate; FAST, in which the same prepared text was read at a within the higher part of the frequency axis (the frequency
fast rate; and WHSP, which is a whispered reading of the range 4000 Hz8000 Hz). The training material is extracted
same material. Full details of the corpus are provided in [9]. from the first recording session of the CHAINS corpus using
The NORM condition, which belongs to the first recording speech samples recorded in the NORM style; the test material
session of the corpus, is used as training material while speech is extracted from the second recording session in which the
recorded in the second session in the FAST and WHSP styles speech was recorded in the FAST style. Results obtained using
is used as test material, in order to examine robustness with both MFCC and pykfec are reported.
respect to both stylistic and channel characteristics. In the third part of the evaluation, non-modal (whispered)
Among the speech material the CHAINS corpus offers, we speech is used as testing material. This is done to provide a
select (for each of the speaking styles above) the first nine very hard test bed for the identification system and to assess
individual sentences as speech samples for testing (s1 to s9 the extent to which it is possible for a model trained on
in the corpus), while the next sentences (s10 to s33 in the modally voiced speech to generalize to whispered speech. The
corpus) are used to generate the training sets. The training of training material is extracted from the first recording session of
the GMMs is performed using only NORM recordings from the CHAINS corpus (NORM); the test material is again extracted
the first recording session. from the second recording session, but in the WHSP style.
Four sets of the available speakers are employed (8, 16, Results obtained using both MFCC and pykfec are reported
24 and 36 speakers respectively). The 8-speaker, 16-speaker and compared with the accuracy results reported in the first
and 24-speaker sets have equal numbers of males and females and second part of the evaluation.
and speakers are drawn from the university population in In each case, the accuracy of the SI system is expressed
Dublin (native speakers of Hiberno-English). The 36-speaker as the mean of ten runs, each run having a different random
set covers the whole set of available speakers in the corpus. initialization of the GMMs. The error of the generalization
Of these, 28 (16 male, 12 female) are from the Eastern part score is calculated as twice the standard deviation of the mean,
of Ireland (Dublin and adjacent counties). A further 8 subjects corresponding to a confidence interval of about 95%.
(4 male, 4 female) are from the UK or USA: 2 males from
UK and 2 males from US; 3 females from US and the last
B. Varying Recording Channel and Speaking Style
remaining female from the UK [9] 4 .
In this section we compare three novel parameterizations
VI. E XPERIMENTAL E VALUATION pykfec (setup 1, setup 2, setup 3), along with more standard
MFCC and RASTA-PLP encodings. The training material is
A. Methodology extracted from the first recording session of the CHAINS corpus
The utility of our three novel encoding methods (pykfec (Section V) with speech in the NORM style; the test material
setup 1, setup 2, setup 3) is evaluated in a closed-set text- is from the second, lower fidelity, recording session, and uses
independent speaker identification task. We compare perfor- speech spoken in a FAST style.
mance with encodings using MFCCs and RASTA-PLP coeffi- Figures 5 (a), (b), (c) and (d) summarize the results obtained
cients. The evaluation is subdivided in three parts. In the first encoding speech with pykfec (setup 1, setup 2, setup 3)
part (Section VI-B), we estimate the accuracy of the reference extracted from the frequency range 0 Hz8000 Hz (the Gabor
GMM system, taking into account two different recording filters are spaced linearly between 200 Hz and 8200 Hz, with
channels and two speaking styles. The training material is varying bandwith according to the selected setup).
extracted from the first recording session of the CHAINS corpus Figure 5 (a) shows the accuracy of the reference GMM
and comprises speech samples recorded in the NORM style; the classifier varying the number of Gaussian components and
test material is extracted from the second recording session and varying the AM-FM setup adopted to extract the pyknogram
employs speech samples recorded in the FAST style. Results of the signal. The total length of the training material is ap-
are obtained varying the number of parameters used to encode proximately 25 seconds per speakers, the number of speakers
speech, the amount of training material used, the number of is 16 while utterances of 10 seconds are used for testing. 80
pykfec are used to encode the speech signal.
3 which is referred to as SOLO in the CHAINS corpus.
4 The
Figure 5 (b) shows the accuracy of the GMM classifier
bulk of the subjects are aged between 19 and 25 years: the mode of
the data is equal to 21, the median value equal to 22 and the semi-interquartile varying the total length of training material per speaker. The
range equal to 3.5. number of speakers is 16 while utterances of 10 seconds are
9

100

100
90

90
80

80
Accuracy [%]

Accuracy [%]
70

70
60

60
80 pykfec setup1 08000 80 pykfec setup1 08000
50

50
80 pykfec setup2 08000 80 pykfec setup2 08000
80 pykfec setup3 08000 80 pykfec setup3 08000
40

40
4 8 16 32 64 10 25 45 60
Number of Gaussian Components Length of Training Material [sec]
(a) (b)
100

100
90

90
80

80
Accuracy [%]

Accuracy [%]
70

70
60

60

80 pykfec setup1 08000 pykfec setup1 08000


50

50

80 pykfec setup2 08000 pykfec setup3 08000


80 pykfec setup3 08000
40

40

8 16 24 36 20 40 80 100
Number of Speakers Number of Features
(c) (d)

Figure 5. Accuracy of the reference GMM classifier using pykfec for speaker parametrization, varying (a) the number of Gaussian components, (b) varying
the length of the training material, (c) varying the number of speakers and (d) varying the number of features. Parameters are extracted from the frequency
interval 0 Hz8000 Hz.

used for testing. 80 pykfec are used to encode the speech performance and greater stability while increasing the number
signal. of Gaussian components in the reference classifier, increasing
Figure 5 (c) summarizes the accuracy of the GMM classifier the amount of training material and varying the number of
while varying the number of speakers. The number of Gaus- features. Unsurprisingly, if we increase the number of speakers
sian components is 64 while utterances of 10 seconds are used while keeping the amount of training material per speaker
for testing. The amount of training material is approximately constant, the accuracy of the reference system decreases.
25 seconds per speaker. 80 pykfec are used to encode the However, pykfec setup 3 still performs better than pykfec
speech signal. setup 1 and pykfec setup 2 in all experiments. Pykfec setup
Finally, Figure 5 (d) shows the accuracy of the classifier 2 shows the worst recognition rate in all the experiments
while varying the number of features (pykfec setup 1 and setup presented here. It is interesting to note (Figure 5 (d)) that in
3). The number of Gaussian components is 32; utterances the case of pykfec the reference system is capable of handling
of 5 seconds are used for testing. The amount of training considerable redundancy in the features very well: increasing
material per speaker is approximately 25 seconds. The number the number of parameters from 40 to 100 does not seem
of speakers is 16. to produce any overfitting (training 32 GMMs per speaker,
Figures 5 (a), (b), (c) and (d) indicate that a parametrization using 5 second test utterances and approximately 25 second
based on pykfec setup 3 outperforms parametrizations based of training material per speaker, pykfec setup 2 provides an
on both setup 1 and setup 2 (e.g.: considering the 16-speaker accuracy of about 79%, pykfec setup 3 provides an accuracy
set, training 64 GMMs per speaker, using 10 second test of about 87%, as plotted in Figure 5 (d)).
utterances and approximately 25 seconds of training material Figures 6 (a), (b) and (c) summarize the results obtained
per speaker, pykfec setup 1 provides an accuracy of 83%3%, encoding speech with MFCCs extracted from the frequency
pykfec setup 2 provides an accuracy of 72% 5% and pykfec range 0 Hz8000 Hz.
setup 3 provides an accuracy of 92% 3%, as plotted in Figure 6 (a) plots classifier accuracy while varying the
Figure 5 (a)). Pykfec setup 3 shows the best identification number of Gaussian components and the number of MFCCs
10

The number of speakers is 16, while utterances of 10 seconds


100
are used for testing.
Figure 6 (c) shows the accuracy of the classifier while
varying the number of speakers. The number of Gaussian
90

components is 64 while utterances of 10 seconds are used


Accuracy [%]

for testing. The amount of training material is approximately


80

25 seconds per speaker.


Figures 6 (a), (b) and (c) show that the accuracy of the
70

reference system trained with MFCCs is comparable to the


15 MFCC 08000 recognition rate obtained using pykfec setup 3 (e.g.: consider-
60

20 MFCC 08000
25 MFCC 08000 ing the 16-speaker set, training 64 GMMs per speaker, using
35 MFCC 08000
10 second test utterances and approximately 25 seconds of
50

4 8 16 32 64
training material per speaker, pykfec setup 3 (80 features) pro-
Number of Gaussian Components vide an accuracy of 92%3%, 25 MFCCs provide an accuracy
(a) of 88%4% and 35 MFCCs provide an accuracy of 88%2%,
as shown in Figure 5 (a) and Figure 6 (a)). However, when
MFCCs are adopted for speech encoding, the stability of the
100

GMMs across the various experimental setups is somewhat


uneven: increasing the number of cepstral coefficients used
90

to encode the speech may, under some circumstances, harm


the recognition rate. Figure 6 (a) shows that both 35 MFCCs
Accuracy [%]
80

and 25 MFCCs parameterizations yield equivalent accuracy


scores while varying the number of Gaussian components
70

in the classifier. On the other hand, Figure 6 (b) suggests


15 MFCC 08000 that when increasing the amount of training material, 25
60

20 MFCC 08000 MFCCs provide better performance than 35 MFCCs: training


25 MFCC 08000
35 MFCC 08000 the GMM classifier with about 60 seconds of material per
speaker and 64 Gaussian components per speaker, 25 MFCCs
50

10 25 45 60
provide an accuracy of 94% 1% and 35 MFCCs provide an
Length of Training Material [sec]
(b)
accuracy of 88% 1% This is not the case for pykfec setup
3: the trained classifier does not show signs of overfitting due
to the increased number of features, as shown in Figure 5 (d).
100

Figures 7 (a) and (b) summarize the results obtained encod-


ing speech with RASTA-PLP coefficients extracted from the
90

frequency range 0 Hz8000 Hz.


Figure 7 (a) plots accuracy while varying the number
Accuracy [%]

of Gaussian components and the number of RASTA-PLP


80

coefficients used to encode the signal. The total length of the


training material is approximately 25 seconds per speakers,
70

the number of speakers is 16, while utterances of 10 seconds


25 MFCC 08000
are used for testing.
60

35 MFCC 08000
Figure 7 (b) shows accuracy while varying the total length
of training material per speaker and the number of RASTA-
50

8 16 24 36 PLP coefficients used to encode the the signal. The number


Number of Speakers of speakers is 16, while utterances of 10 seconds are used for
(c) testing.
Figure 6. Accuracy of the reference GMM classifier using MFCC for speaker Figures 7 (a) and (b) indicate that RASTA-PLP coefficients
parametrization, varying (a) the number of Gaussian components, (b) varying are not as effective in capturing speaker identity as MFCCs
the length of the training material and (c) varying the number of speakers. and pykfec setup 3, within the context of our relatively simple
Parameters are extracted from the frequency interval 0 Hz8000 Hz.
classifier. While the GMM recognition rate using the RASTA-
PLP parametrization reaches about 80% when increasing
both training material and number of Gaussian components,
used to encode the signal. The total length of the training MFCCs and pykfec setup 3 yield accuracy scores of about 90%
material is approximately 25 seconds per speakers, the number in similar experimental setups see Figures 5 (a), (b) and
of speakers is 16 while utterances of 10 seconds are used for Figures 6 (a), (b).
testing. In order to facilitate a comparison of the results presented
Figure 6 (b) shows the accuracy curves of the classifier in Figures 5 (ad), Figures 6 (ac) and Figures 7 (a,b), we
while varying the total length of training material per speaker. reproduce in Table II and Figure 8 some of the accuracy
11

Table II
A CCURACY OF DIFFERENT ENCODINGS VARYING THE TRAINING AND
100
TEST MATERIAL LENGTH AND NUMBER OF PARAMETERS . T HE
REFERENCE CLASSIFIER IS TRAINED WITH 64 G AUSSIAN COMPONENTS .
12 RASTA 08000
90

16 RASTA 08000 R ESULTS ARE OBTAINED ON THE 16- SPEAKER DATASET. T HE


20 RASTA 08000 ACCURACY SCORES ARE COMPUTED AS THE MEAN OF TEN INDEPENDENT
80

RUNS , EACH RUN HAVING A DIFFERENT RANDOM INITIALIZATION OF THE


Accuracy [%]

GMM S . T HE ERROR OF THE SCORE IS CALCULATED AS TWICE THE


70

STANDARD DEVIATION OF THE MEAN .


60

Training Length Test Length Number pykfec setup 1


[sec.] [sec.] of Features ACC. [%] Error [%]
50

25 5 20 83 5
25 5 40 87 6
40

25 5 80 87 5
25 10 80 92 3
30

60 10 80 93 1
4 8 16 32 64
Training Length Test Length Number MFCC
Number of Gaussian Components
[sec.] [sec.] of Features ACC. [%] Error [%]
(a)
25 10 15 86 3
25 10 20 86 4
25 10 25 88 4
100

60 10 25 94 1
25 10 35 88 3
60 10 35 88 1
80

Training Length Test Length Number RASTA-PLPC


[sec.] [sec.] of Features ACC. [%] Error [%]
Accuracy [%]

25 10 12 54 6
60

25 10 16 65 4
25 10 20 76 9
60 10 20 82 6
40

12 RASTA 08000
16 RASTA 08000
20

20 RASTA 08000
100

10 25 45 60
80

Length of Training Material [sec]


(b)
60
Accuracy [%]

Figure 7. Accuracy of the reference GMM classifier using RASTA-PLP for


speaker parametrization, varying (a) the number of Gaussian components and
(b) varying the length of the training material. Parameters are extracted from
40

the frequency interval 0 Hz8000 Hz.


20

results obtained in the different experimental setups. Training


Length indicates the amount of training material per speaker
0

used. In Table II Test Length gives the length of the utterance 80 py S1 80 py S2 80 py S3 25 MFCC 35MFCC 20 RPLP

used for testing; Number of Features indicates the number


of parameters used to encode the speech signal. Figure 8. Accuracy of the reference GMM classifier using pykfec setup
1, setup 2 and setup 3 compared with the accuracy of the same classifier
using 25 MFCC, 35 MFCC and 20 RASTA-PLP for speaker parametrization.
C. The Role of Different Frequency Intervals The reference classifier is trained with 64 Gaussian components; results are
obtained on the 16-speaker dataset and about 60 seconds of training material
In this section, we present results obtained by restricting our per speaker is used; 10 second utterances are used for testing. The different
parameter extraction to two limited frequency ranges, 0 Hz parameters are extracted from the same frequency interval 0 Hz8000 Hz.
4000 Hz and 4000 Hz8000 Hz. The motivation for this is to
investigate the potential role in speaker identification for the
structures above 4 kHz that are apparent in the pyknogram, but Figure 9 (a) shows that encoding speech with 40 pykfec
less obvious from a spectrographic representation. In general, setup 3 features extracted from the interval 0 Hz4000 Hz pro-
little is known about the relevance or origin of such structures, vides better identification performance than the use of either
or how they might vary from individual to individual. The pykfec setup 1 or pykfec setup 2, with accuracy curves that are
training material is extracted from the first recording session of comparable to those obtained using 80 pykfec setup 3 features
the CHAINS corpus and uses NORM speech; the test material is extracted from the larger frequency range of 0 Hz8000 Hz
extracted from the second recording session and spoken in the (e.g.: considering the 24-speaker set, training 64 GMMs per
FAST style. The reference classifier is trained with 64 Gaussian speaker, using 10 second test utterances and approximately
components and utterances of 10 seconds are used for testing. 25 seconds of training material per speaker, pykfec setup 3
The results obtained are also compared with MFCCs extracted 0 Hz4000 Hz (40 features) provide an accuracy of 79 4
from the same two frequency ranges (Figure 10). while pykfec setup 3 0 Hz8000 Hz (80 features) provide an
12

Figure 10 shows the results obtained using 25 and 35


100
MFCCs to encode the same speech signals as above, varying
the number of speaker and the frequency intervals from which
MFCCs are extracted. The reference classifier is trained with
80

64 Gaussian components and utterances of 10 seconds are used


Accuracy [%]

for testing.
60

100
40

40 pykfec setup1 04000


40 pykfec setup2 04000

80
40 pykfec setup3 04000
20

Accuracy [%]
8 16 24 36
Number of Speakers

60
(a)

25 MFCC 04000

40
100

35 MFCC 04000
25 MFCC 40008000
40 pykfec setup1 40008000 35 MFCC 40008000
40 pykfec setup2 40008000

20
40 pykfec setup3 40008000
80

8 16 24 36
Accuracy [%]

Number of Speakers

Figure 10. Accuracy of the reference GMM classifier using MFCC for
60

speaker parametrization, varying the number of speakers considered, the


number of coefficients and the frequency interval from which the parameters
are extracted (0 Hz4000 Hz and 4000 Hz8000 Hz).
40

Figure 10 shows that 25 and 35 MFCCs provide somewhat


equivalent identification rates when extracted from the same
20

8 16 24 36 frequency intervals (e.g.:considering the 36-speaker set, train-


Number of Speakers ing 64 GMMs per speaker, using 10 second test utterances and
(b) approximately 25 seconds of training material per speaker, 25
Figure 9. Accuracy of the reference GMM classifier using pykfec for speaker MFCC 0 Hz4000 Hz provide an accuracy of 66 6 while
parametrization, varying the number of speakers considered and the frequency 35 MFCC 0 Hz4000 Hz provide an accuracy of 71 5).
interval from which the parameters are extracted (0 Hz4000 Hz and 4000 Hz Moreover it confirms that parameters extracted from the fre-
8000 Hz).
quency range 4000 Hz8000 Hz are not good indicators of
speaker identity, e.g.:considering the 36-speaker set, training
accuracy of 806 , as shown in Figure 9 (a) and Figure 5 (c)). 64 GMMs per speaker, using 10 second test utterances and
On the other hand, Figure 9 (b) clearly shows that parameters approximately 25 seconds of training material per speaker,
extracted from the higher frequency range 4000 Hz8000 Hz 35 MFCCs 0 Hz4000 Hz provide an accuracy of 71 5
are less selective for speaker identity than those from 0 Hz while 35 MFCCs 4000 Hz8000 Hz provide an accuracy
to 4000 Hz, e.g.: considering the 24-speaker set, training of 47 4. It is interesting to note that the recognition rate
64 GMMs per speaker, using 10 second test utterances and obtained using 40 pykfec setup 3 features is roughly equivalent
approximately 25 seconds of training material per speaker, to the performance obtained with both 25 and 35 MFCCs,
pykfec setup 3 4000 Hz8000 Hz (40 features) provide an irrespective of the frequency range from which the parameters
accuracy of 587; pykfec setup 3 0 Hz4000 Hz (40 features) are extracted, eg.: considering the 36-speaker set, training
provide an accuracy of 79 4 (Figure 9 (a)). 64 GMMs per speaker, using 10 second test utterances and
It is interesting to note that pykfec setup 1 and pykfec setup approximately 25 seconds of training material per speaker,
2 provide equivalent accuracy in the frequency range 0 Hz pykfec setup 3 0 Hz4000 Hz (40 features) provides an
4000 Hz, while in the frequency range of 4000 Hz8000 Hz accuracy of 79 4 (Figure 9 (a)) while 35 MFCC 0 Hz
equivalent accuracy is obtained by adopting pykfec setup 1 4000 Hz provide an accuracy of 71 5 (Figure 10).
and pykfec setup 3. This corresponds well with the impression Figure 11 summarizes some of the results presented in this
of similarity seen in comparing the three setups in Figure 1, section.
Figure 2 and Figure 3. The pyknogram obtained adopting setup
1 and setup 2 appear to represent formant-like information D. Different Phonation: Whispering
below 4000 Hz similarly, while setup 3 draws out individual In this section, non modal, whispered, speech is used to
harmonics more. Conversely, setup 1 and setup 3 represent test the GMM reference classifier. This is done to provide
broad resonances above 4 kHz in similar fashion, while these a relatively hard test case for the identification system and
get washed out with the broad bandwidths of setup 2. to verify the extent to which it is possible to generalize an
13

100

100
80
80

80 pykfec setup1 08000


80 pykfec setup2 08000

Accuracy [%]
80 pykfec setup3 08000

60
60
Accuracy [%]

40
40

20
20

0
0

40 py S3 25 MFCC 35 MFCC 8 16 24 36
Number of Speakers
(a)
Figure 11. Accuracy of the reference GMM classifier using 40 pykfec setup
3 compared with the accuracy of the same classifier using 25 MFCC and 35
MFCC for speaker parametrization. The reference classifier is trained with

100
64 Gaussian components, results are obtained on the 36-speaker dataset
and about 25 seconds of training material per speaker is used; 10 second
utterances are used for testing. The different parameters are extracted from

80
the same frequency interval 0 Hz4000 Hz. 25 MFCC 08000
35 MFCC 08000

Accuracy [%]
60
identification model (the mixture of Gaussian models trained
for each speaker) based on modally voiced speech to whis- 40

pered speech. The training material is NORM speech from the


20

first recording session of the CHAINS corpus; the test material


is from the second recording session in which speech was
0

whispered throughout. The reference classifier is trained with 8 16 24 36


64 Gaussian components and utterances of 10 seconds are used Number of Speakers
for testing. The results reported in this section are obtained (b)
using both pykfec and MFCCs for speech parametrization
Figure 12. Accuracy of the reference GMM classifier trained on SOLO speech
(Figure 12). and tested using WHSP speech varying the number of speakers considered and
Figure 12 shows that neither MFCCs nor pykfec provides a the parametrization of the speech samples. Parameters are extracted from the
frequency interval 0 Hz8000 Hz.
speech parametrization that is invariant with respect to drastic
changes in phonation: the identity model extracted from modal
speech (the NORM style) is not useful when test samples are
whispered. On the other hand, when the same models are VII. C ONCLUSIONS
tested using FAST speech, the reference system is well capable This work has introduced a new set of descriptors that
of recognize speakers identity Figures 5(a)(b)(c)(d) and capture the identity of speakers well and that demonstrate
Figures 6(a)(b)(c). robustness with respect to changes in recording channel and
Finally, Figure 13 shows the effect of extracting MFCCs speaking style, without requiring any advanced channel com-
and pykfec from two different frequency ranges: 0 Hz4000 Hz pensation schema as it is usually the case for standard encod-
and 4000 Hz 8000 Hz using the same experimental conditions ing approaches such as MFCCs. Our experimental evaluation
as above. The training and test materials are as before. The indicates that the characterization of the different instantaneous
reference classifier is trained with 64 Gaussian components frequencies within the speech signal play a significant role in
and utterances of 10 seconds are used for testing. capturing the identity of a speaker. The identity of a human
Figures 13 (a) and 13 (b) show that 40 pykfec setup 3 speaker can be exploited robustly by looking at which instan-
extracted from the frequency range 0 Hz4000 Hz provides taneous frequencies are produced by the speech production
a better recognition rate than any other encoding tested. system.
Nevertheless, the accuracy of the reference system is far from The great plasticity and complexity of the human speech
ideal, confirming that the parametrizations explored are not production system ensure that the problem of speech
invariant with respect to drastic changes in phonation. For parametrization is not uniquely solvable. For this reason this
the purpose of comparison with Figure 1, Figure 14 shows work explores three different way to decompose the speech
a sample pyknogram of If it doesnt matter who wins, why signal within the same framework. Three different setups were
do we keep score?, derived from whispered speech. Finally, tested for computing the pyknogram of the signal and the
Figure 15 summarizes some of the results presented in this derived pykfec parameters; three setups that differ in their
section. specificity for the various harmonics and resonances typical
14

100

8000
40 pykfec setup1 04000
40 pykfec setup3 04000
80

40 pykfec setup1 40008000

6000
40 pykfec setup3 40008000
Accuracy [%]

Frequency [Hz]
60

4000
40

2000
20

0
0

0 0.5 1 1.5 2
8 16 24 36
Time [sec]
Number of Speakers
(a) Figure 14. Pyknogram of If it doesnt matter who wins, why do we keep
score? (whisper). 80 filters linearly spaced between 200 Hz and 8200 Hz,
constant bandwidth of 400 Hz - Filterbank setup 1.
100

25 MFCC 04000
35 MFCC 04000
80

25 MFCC 40008000
35 MFCC 40008000

100
Accuracy [%]
60

80
40

60
Accuracy [%]
20

40
0

8 16 24 36
Number of Speakers
20

(b)

Figure 13. Accuracy of the reference GMM classifier trained on SOLO speech
0

and tested using WHSP speech varying the number of speakers considered and 40 py S3 25 MFCC 35 MFCC

the parametrization of the speech samples. Parameters are extracted from the
frequency intervals 0 Hz4000 Hz and 4000 Hz8000 Hz.
Figure 15. Accuracy of the reference GMM classifier using 40 pykfec setup
3 compared with the accuracy of the same classifier using 25 MFCC and 35
MFCC for speaker parametrization. The reference classifier is trained with
of the speech signal. 64 Gaussian components, results are obtained on the 24-speaker dataset
and about 25 seconds of modal speech (NORM) per speaker is used; 10
When compared with pykfec setup 1 and pykfec setup 2, second (whisper) utterances are used for testing. The different parameters
pykfec setup 3 show the best performance in terms of speaker are extracted from the same frequency interval 0 Hz4000 Hz.
identification and stability, as we increase the number of
Gaussian components in the reference classifier, the amount of
training material and the number of features. With respect to
the latter, despite the introduction of clear redundancy among
quality recordings in which the speaker tried to speak quickly.
the descriptors, the reference GMM classifier shows no loss in
This represents a promising degree of robustness with respect
recognition rate as we increase the number of parameters from
to both channel characteristics and style. The robustness has its
40 up to 100. Moreover, with the proposed AM-FM approach,
limits, however, as was demonstrated when we used whispered
channel normalization is not required, as the (instantaneous)
speech as test material. Even here though, it was clear that the
amplitude is used only for identifying the short time frequency
new pykfec may be of use in the future, as results obtained
estimate within a single band.
with setup 3 when features were extracted below 4 kHz were
In the experiments in which we selectively extracted param-
demonstrably better than either of the other two setups, or
eters from restricted frequency ranges, we found that although
either of the MFCC parameterizations employed.
the pykfec features appear to reveal interesting structures
above 4 kHz, these are not necessarily of use in speaker Our examination of the robustness of these novel represen-
identification, and our novel features did not fare significantly tations are necessarily limited in scope. However, the robust
better than the MFCC representation. performance across the board exhibited by the novel AM-
In the first set of experiments, we demonstrated that the FM derived features are promising and merit the attention
novel representations are capable of generalizing well from of the speaker identification and verification community for
high-quality speech recorded at a comfortable rate to lower- consideration in further work.
15

VIII. ACKNOWLEDGMENTS [23] J. Makhoul. Linear prediction: A tutorial review. In Proceedings of the
IEEE, 1975.
The authors wish to acknowledge the SFI/HEA Irish Cen- [24] Y. Mami and D. Charlet. Speaker recognition by location in the space
tre for High-End Computing (ICHEC) for the provision of of reference speakers. Speech Communication, 48(2):127141, 2006.
computational facilities and support. This work is supported [25] P. Maragos, J.F Kaiser, and T.F. Quatieri. Energy separation in signal
modulations with application to speech analysis. IEEE Transactions on
by Science Foundation Ireland grant No. O4/IN3/I568 to Signal Processing, 41(10):30243051, 1993.
Fred Cummins. We would also like to thank the anonymous [26] G. McLachlan and K.E. Basford. Mixture Models. Marcel Dekker, 1987.
reviewers for the very constructive feedback. [27] H.A. MUrthy, F. Beaufays, L.P. Heck, and M. Weintraub. Robust
text-independent speaker identification over telephone channels. IEEE
R EFERENCES Transactions on Speech and Audio Processing, 7(5), 1999.
[28] K.K. Paliwal. Usefulness of phase in speech processing. In Proceedings
[1] T.B. Alderman. Forensic Speaker Indentification, A Likelihood Ratio- of IPSJ Spoken Language Processing Workshop, pages 16, Gifu, Japan,
Based Approach Using Vowel Formants. Lincom Studies in Phonetics, 2003.
2005. [29] K.K Paliwal and B.S. Atal. Frequency-related representation of speech.
[2] B.S. Atal and S.L. Hanauer. Speech analysis and synthesis by linear In Proceedings of Eurospeech 2003, page 6568, September 2003.
prediction of the speech wave. Journal of the Acoustic Society of [30] A. Potamianos and P. Maragos. Speech formant frequency and band-
America, 50:637655, 1971. width tracking using multiband energy demodulation. Journal of the
[3] Frederic Bimbot, Jean-Francois Bonastre, Corinne Fredouille, Guillaume Acoustic Society of America, 99:37953806, 1996.
Gravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Teva Merlin, [31] A. Potamianos and P.Maragos. Time-frequency distributions for au-
Javier Ortega-Garca, Dijana Petrovska-Delacretaz, and Douglas A. tomatic speech recognition. Speech and Audio Processing, IEEE
Reynolds. A tutorial on text-independent speaker verification. EURASIP Transactions on, 9(3):196200, 2001.
J. Appl. Signal Process., 2004(1):430451, 2004. [32] L.R. Rabiner and R.W. Shafer. Digital Signal Processing of Speech
[4] B. Boashash. Estimating and interpreting the instanteneous frequency Signals. Prentice-Hall, 1989.
of a signal - part 1: Fundamentals. Proceedings of the IEEE, 80(4):519 [33] A. Rao and R. Kumaresan. On decomposing speech into modulated
538, 1992. components. IEEE Transactions on Speech and Audio Processing,
[5] P. Boersma. Praat, a system for doing phonetics by computer. Glot 8(3):240254, 2000.
International, 5(9/10):341345, 2001. [34] D.A. Reynolds. Experimental evaluation of features for robust speaker
[6] J.F. Bonastre, F. Bimbot, L.J. Boe, J. P. Campbell, D. A. Reynolds, identification. IEEE Transactions on Speech and Audio Processing,
and I. Magrin-Chagnolleau. Person authentication by voice: A need for 2(3):639643, 1994.
caution. In Proceedings of Eurospeech 2003, Genova, September 2003. [35] D.A. Reynolds. An overview of automatic speaker recognition technol-
[7] Jr. Campbell, J.P. Speaker recognition: a tutorial. Proceedings of the ogy. In IEEE International Conference on Acoustics, Speech, and Signal
IEEE, 85(9):14371462, Sep 1997. Processing (ICASSP 02), pages IV4072 IV4075, 2002.
[8] K. Chen, L. Wang, and H. Chi. Methods of combining multiple clas- [36] D.A. Reynolds and R.C. Rose. Robust text-independent speaker identi-
sifiers with different features and their applications to text-independent fication using gaussian mixture speaker models. IEEE Transactions on
speaker identification. International Journal of Pattern Recognition and Speech and Audio Processing, 3(1):7283, 1995.
Artificial Intelligence, 11(3):417445, 1997. [37] Robert D. Rodman. Computer recognition of speakers who disguise
[9] F. Cummins, M. Grimaldi, T. Leonard, and J. Simko. The chains corpus: their voice. In Proc ICSPAT 2000. 2000.
Characterizing individual speakers. In Proceedings of SPECOM06, [38] P. Rose. Forensic Speaker Indentification. Taylor and Francis Forensic
pages 431435, St. Petersburg, RU, 2006. Science Series, 2002.
[10] D. Dimitriadis and P. Maragos. Robust energy demodulation based [39] H. M. Teager and S. M. Teager. Evidence for nonlinear sound production
on continuous models with application to speech recognition. In mechanisms in the vocal tract. In Hardcastle and A. Marchal, editors,
Proceedings of Eurospeech 2003, pages 28532856, 2003. Speech Production and Speech Modelling, volume 55. NATO Advanced
[11] D.V. Dimitriadis, P. Maragos, and A. Potamianos. Robust am-fm features Study Institute Series D, Bonas, France, July 1989.
for speech recognition. Signal Processing Letters, IEEE, 12(9):621 624, [40] V. Wan and S. Renals. Evaluation of kernel methods for speaker
2005. verification and identification. In IEEE International Conference on
[12] M. Foundez-Zanuy, S. McLaughlin, A. Esposito, A. Hussain, J. Schoent- Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP
gen, G. Kubin, W. B. Kleijn, and P. Maragos. Nonlinear speech 02)., volume 1, pages 669672, 2002.
processing : Overview and applications. Control and Intelligent Systems, [41] V. Wan and S. Renals. Speaker verification using sequence discriminant
30:110, 2002. support vector machines. IEEE Transactions on Speech and Audio
[13] S. Furui. Cepstral analysis technique for automatic speaker verifica- Processing, 13(2):203210, March 2005.
tion. IEEE Transactions on Acoustics, Speech, and Signal Processing, [42] B. Xiang and T. Berger. Efficient text-independent speaker verification
29(2):254272, 1981. with structural gaussian mixture models and neural network. IEEE
[14] D. Gabor. Theory of communication. JIEE, 93(3):429457, November Transactions on Speech and Audio Processing, 11(5), 2003.
1946.
[15] J. Godfrey, D. Graff, and A. Martin. Public databases for speaker
recognition and verification. In Proceedings of the ESCA Workshop
Automatic Speaker Recognition, Identification, Verification, Martigny,
Switzerland, April 1994.
[16] H. Hermansky. Perceptual linear prediction (plp) analysis for speech.
Journal of Acoustic Society of America, pages 17381752, 1990.
[17] H. Hermansky, N. Morgan, and A. Bayya.and P. Kohn. Rasta-plp speech
analysis technique. IEEE International Conference on Acoustics, Speech,
and Signal Processing, 1992. ICASSP-92., 1:121124, 23-26 Mar 1992.
[18] H. Hollien. Forensic Voice Indentification. Academic Press, 2002.
[19] C.R. Jr Jankowski, T.F. Quatieri, and D.A. Reynolds. Measuring fine
structure in speech: Application to speaker identification. In IEEE
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP 95), pages 325328, 1995.
[20] J.K. Keiser. On teagers energy algorithm and its generalization to
countinous signals. In Proceedings of IEEE DSP Workshop, New York,
1990.
[21] L.G. Kersta. Voiceprint identification. Nature, 196:12531257, 1962.
[22] GANG LI, LUNJI QIU, and LING KOK NG. Signal representation
based on instantaneous amplitude models with application to speech syn-
thesis. Speech and Audio Processing, IEEE transactions on, 8(3):353
357, 2000.

Você também pode gostar