Você está na página 1de 16

EEL6586 Final Project:

A Speaker Identification and Verification System









Team members: Zhongmin Liu
Qizhang Yin
Weimin Zhang




Submission date: 04/24/2002



1. Introduction

Speaker recognition is the process of automatically recognizing who is speaking
on the basis of individual information included in speech waves. This technique
makes it possible to use the speaker's voice to verify their identity and control access
to services such as voice dialing, banking by telephone, telephone shopping, database
access services, information services, voice mail, security control for confidential
information areas, and remote access to computers.

The goal of this project is to develop a simple, yet complete and representative
automatic speaker recognition system. We will only test our system on a very small
(but already non-trivial) speech database, which includes 3 speakers, labeled from s1
to s3. These speakers can be authorized by voice and password recognition.

2. Principles of Speaker Recognition

Speaker recognition can be classified into identification and verification. Speaker
identification is the process of determining which registered speaker provides a given
utterance. Speaker verification, on the other hand, is the process of accepting or
rejecting the identity claim of a speaker.

Speaker recognition methods can also be divided into text-independent and text-
dependent methods. In a text-independent system, a speaker models the capture
characteristics of his speech which show up irrespective of what one is saying, i.e.
spectral feature. One of the most successful text-independent recognition methods is
based on vector quantization (VQ). VQ codebooks consisting of the condensed
representative feature vectors are used as an efficient means of characterizing
speaker-specific features. A speaker-specific codebook is generated by clustering the
training feature vectors of each speaker. In the recognition stage, an input utterance is
vector-quantized using the codebook of each reference speaker and the VQ distortion
accumulated over the entire input utterance is used to make the recognition decision.

In a text-dependent system, on the other hand, a speakers identity recognition is
based on his pronouncing one or more specific phrases, like a password used in this
project. Text-dependent methods are usually based on template-matching techniques -
-dynamic time warping (DTW) algorithm. The hidden Markov model (HMM) can
efficiently model statistical variation in spectral features and have achieved
significantly better recognition accuracies than DTW. Therefore, HMM-based method
is applied into the word recognition.

Figure 1 shows the basic structures of speaker identification and verification
systems. Feature extraction is the procedure to extract a small amount of data from
the voice signal that can later be used to represent each speaker, the well-known
MFCC spectral coefficients are adopted here. Feature matching involves identifying
an unknown speaker by comparing extracted features from his voice with the ones
from speaker database.




(a) Speaker identification (Text-independent recognition)




Input
speech
Feature
extraction
Reference
model
(Speaker #1)
Similarity
Reference
model
(Speaker #N)
Similarity
Maximum
selection
Identification
result
(Speaker ID)



(b) Speaker verification (Text-dependent recognition)

Fig 1. Basic structures of speaker recognition systems

In the training phase, each registered speaker has to provide speech samples so
that the system can build a reference model for that speaker. In case of speak er
verification, in addition, a speaker-specific threshold is also set from the training
samples. During the testing phase, the input speech is matched with stored reference
model and recognition decision is made.

3. Speech Feature Extraction

The purpose of this module is to convert the speech waveform to some type of
parametric representation (at a considerably lower information rate) for further
analysis and processing which is referred as the signal-processing front end.

The speech signal is a slowly time-varying signal (called quasi-stationary). When
examined over a sufficiently short period of time (5 ~ 100 ms), its characteristics are
fairly stationary. However, over long periods of time (on the order of 1/5 seconds or
more) the signal characteristic change to reflect the different speech sounds being
spoken. Therefore, the short-time spectral analysis is the most common way to
characterize the speech signal.
A wide range of possibilities exist for parametrically representing the speech
signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-
Frequency Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best
known and most popular, and it will be used in this project.
Reference
model
(Speaker #M)
Similarity
Input
speech
Feature
extraction
Verification
result
(Accept/Reject)
Decision
Threshold
Speaker ID
(#M)
MFCC is based on the known variation of the human ears critical bandwidths
with frequencies, filters spaced linearly at low frequencies and logarithmically at high
frequencies have been used to capture the phonetically important characteristics of
speech. This is expressed in the Mel-frequency scale, linear frequency spacing below
1000 Hz and a logarithmic spacing above 1000 Hz. The process of computing MFCC
is described in more detail next.

3.1 Mel-frequency Cepstrum Coefficients processor

A block diagram of the structure of an MFCC processor is given in Figure 2. The
speech input is typically recorded at a sampling rate above 10 KHz. This sampling
frequency was chosen to minimize the effects of aliasing in the analog-to-digital
conversion. These sampled signals can capture all frequencies up to 5 kHz, which
cover most energy of sounds that are generated by humans. As been discussed
previously, the main purpose of the MFCC processor is to mimic the behavior of the
human ears. In addition, rather than the speech waveforms themselves, MFCC is
shown to be less susceptible to mentioned variations.



Fig 2. MFCC processor

3.1.1 Frame Blocking
The continuous speech signal is blocked into frames of N samples, with adjacent
frames being separated by M (M < N). The 1st frame consists of the first N samples.
The 2nd frame begins M samples after the 1st frame, and overlaps it by N - M
mel
cepstrum
mel
spectrum
frame continuous
speech
Frame
Blocking
Windowing FFT spectrum
Mel-frequency
Wrapping
Cepstrum
samples. Similarly, the 3rd frame begins 2M samples after the 1st frame (or M
samples after the 2nd frame) and overlaps it by N - 2M samples. This process
continues until all the speech is accounted for within one or more frames. Typical
values for N and M are N = 256 (which is equivalent to ~ 30ms windowing and
facilitate the fast radix-2 FFT) and M = 100.
3.1.2 Windowing
The next step is to window each individual frame so as to minimize the signal
discontinuities at the beginning and end of each frame. The concept here is to
minimize the spectral distortion by using the window to taper the signal to zero at the
beginning and end of each frame. If we define the window as 1 0 ), ( N n n w ,
where N is the number of samples in each frame, then the result of windowing is the
signal
1 0 ), ( ) ( ) ( N n n w n x n y
l l

Typically the Hamming window is used, which has the form:
1 0 ,
1
2
cos 46 . 0 54 . 0 ) (
,
_

N n
N
n
n w


3.1.3 Fast Fourier Transform (FFT)
FFT converts each frame from the time domain into the frequency domain. The
FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is
defined on the set of N samples {x
n
}, as follow:


1
0
/ 2
1 ,..., 2 , 1 , 0 ,
N
k
N jkn
k n
N n e x X


The resulting sequence {X
n
} is interpreted as follow: the zero frequency
corresponds to n = 0, positive frequencies 2 / 0
s
f f < < correspond to values
1 2 / 1 N n , while negative frequencies 0 2 / < < f f
s
correspond to
1 1 2 / + N n N .
3.1.4 Mel-frequency Wrapping
As mentioned above, psychophysical studies have shown that human perception
of the frequency contents of sounds for speech signals does not follow a linear scale.
Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is
measured on a scale called the mel scale. The mel-frequency scale is a linear
frequency spacing below 1KHz and a logarithmic spacing above 1KHz. As a
reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing
threshold, is defined as 1000 mels. Therefore we can use the following approximate
formula to compute the mels for a given frequency f in Hz:
) 700 / 1 ( log * 2595 ) ( 10 f f mel +
One approach to simulating the subjective spectrum is to use a filter bank, spaced
uniformly on the mel scale (see Figure 4). That filter bank has a triangular bandpass
frequency response, and the spacing as well as the bandwidth is determined by a
constant mel frequency interval. The modified spectrum of S() thus consists of the
output power of these filters when S() is the input. The number of mel spectrum
coefficients, K, is typically chosen as 20.
Note that this filter bank is applied in the frequency domain, therefore it simply
amounts to taking those triangle-shape windows in the Figure 4 on the spectrum. A
useful way of thinking about this mel-wrapping filter bank is to view each filter as an
histogram bin (where bins have overlap) in the frequency domain.


Fig 3. Mel-spaced filterbank

0 1000 2000 3000 4000 5000 6000 7000
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Mel -spaced filterbank
Frequency (Hz)
3.1.5 Cepstrum
Finally the log mel spectrum is converted back to time. The result is called the
mel frequency cepstrum coefficients (MFCC). The cepstral representation of the
speech spectrum provides a good representation of the local spectral properties of the
signal for the given frame analysis. Because the mel spectrum coefficients (and so
their logarithm) are real numbers, we can convert them to the time domain using the
Discrete Cosine Transform (DCT). Therefore if we denote those mel power spectrum
coefficients that are the result of the last step are K k S
k
,..., 2 , 1 ,
~
, we can calculate
the MFCC's, ,
~
n
c as
K n
K
k n S c
K
k
k n
,..., 2 , 1 ,
2
1
cos )
~
(log
~
1

1
]
1

,
_


Note that we exclude the first component, ,
~
0
c from the DCT since it represents
the mean value of the input signal which carried little speaker specific information.

4. Feature Matching

Feature matching refers to as the classification on the extracted features from
individual speakers. The feature matching techniques used in speaker recognition
include Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and
Vector Quantization (VQ).

4.1 Vector Quantization
In this project, since little data is available, VQ will be suitable for voice
recognition, due to ease of implementation and high accuracy.
VQ is a process of mapping vectors from a large vector space to a finite number of
regions in that space. Each region is called a cluster and can be represented by the
centroid called codeword. The collection of all codewords consists of the
corresponding codebook for a known speaker.
Figure 4 shows a conceptual diagram to illustrate this recognition process. In the
figure, only two speakers and two dimensions of the acoustic space are shown. The
circles refer to the acoustic vectors from the speaker 1 while the triangles are from the
speaker 2. In the training phase, a speaker-specific VQ codebook is generated for
each known speaker by clustering his training acoustic vectors. The result codewords
are shown in Fig 4 by black circles and black triangles for speaker 1 and 2,
respectively. The distance from a vector to the closest codeword is called distortion.
In the recognition phase, an input utterance of an unknown voice is vector-quantized
using each trained codebook and the total VQ distortion is computed. The speaker
corresponding to the VQ codebook with smallest total distortion is identified.
Speaker 1
Speaker 1
centroid
sample
Speaker 2
centroid
sample
Speaker 2
VQ distortion


Fig 4 . Conceptual diagram illustrating vector quantization codebook formation.
One speaker can be discriminated from another based of the location of centroids.

. The LBG VQ algorithm we used is formally implemented by the following
recursive procedure:
1. Design a 1-vector codebook; this is the centroid of the entire set of training
vectors (hence, no iteration is required here).
2. Double the size of the codebook by splitting each current codebook y
n
according
to the rule
) 1 ( +
+
n n
y y
) 1 (

n n
y y
where n varies from 1 to the current size of the codebook, and is a splitting
parameter (we choose =0.01).
3. Nearest-Neighbor Searc h: for each training vector, find the codeword in the
current codebook that is closest (in terms of similarity measurement), and assign
that vector to the corresponding cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the
training vectors assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset
threshold.
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.

Figure 5 shows, in a flow diagram, the detailed steps of the LBG algorithm.
Cluster vectors is the nearest-neighbor search procedure which assigns each
training vector to a cluster associated with the closest codeword. Find centroids is
the centroid update procedure. Compute D (distortion) sums the distances of all
training vectors in the nearest-neighbor search so as to determine whether the
procedure has converged.

Find
centroid
Split each
centroid
Cluster
vectors
Find
centroids
Compute D
(distortion)
<

D
D ' D
Stop
D = D
m = 2*m
No
Yes
Yes
No
m < M


Fig 5 . Flow diagram of the LBG algorithm

4.2 Hidden Markov Model (HMM)

For speech signals, a left-right HMM model is found to be more useful. A left -
right model has the property that as time increases, the state index increases (or stays
the same), that is the system states proceed from left to right. Since the properties of a
speech signal change over time in a successive manner, this model is very well suited
for word recognition. 6-state left-right HMM is used for modelling 2-phoneme
password.

5. Implementation



Fig 6. System Implementation


WORD
TRAINING

IDENTIFICATION
INPUT YOUR
VOICE
USING VQ TO
TRAINING
STORE THE TRAINING RESULTS IN
THE DATABASE
INPUT YOUR
PASSWORD
USING HMM TO
TRAINING
SPEAK THE
PASSWORD
TEST THE
PASSWORD
OUTPUT THE
RESULTS
INPUT YOUR
VOICE
START
6. Simulation and Results

The performance of our voice recognition and identification system depends to
a large extent on the recording effects of train and test data. Based on the simple
equipment we used to record our voices and passwords and small number of speakers
in our group for training and testing, our correct rate for the whole identification
system is 100% by carefully recording and properly adjusting relevant parameters,
such as codebook size, iteration number, etc. VQ tests using a set of 16 downloaded
speaker wave files also work perfectly for our system.













Figure 7 shows a 2-D plot of the original trained MFCC vectors by two
speakers by selecting any two dimensions (eg. 5
th
, 6
th
) of the 13 coeffic ients and in
comparison a 2-D plot of the codebooks generated by the two speakers using the same
two dimensions with a codebook size of 16. Thus we can see that each codeword on
the plot exactly represent the corresponding clusters in the original MFCC data points
and still accurately represent each speakers voice characteristics.

In our VQ part, we calculate the Euclidean distance between the unknown
word and codebooks, then the lowest value of the distances is identified as the correct
person. But on the other hand, it is very hard to set up a threshold value for this
distance to distinguish people who try to gain access not in our database and people
already in our training database based on a series of tests, as is illustrated by one of
the following table of VQ test values. Since this threshold value varies for each word,
so if a threshold is set up, either most of the values would be too high or too low. So
the limitation of our system is that people who have not been trained in our system
can still pass voice recognition realized through VQ algorithm. This limitation has not
been expected. Anyway, that person will not be able to pass the password verification,
thus ensuring the security of system. For people already trained in database trying to
access the system, VQ helps to pinpoint the password associated with the speaker in
the database, and thus improves the security of the whole system and serves an
integral part of our security system.
Minimum Euclidean Distance
Zhongmin(train) Weimin(train) Qizhang(train)
Zhongmin(test) 3.9441 11.0518 7.6278
Weimin(test) 7.9036 4.4215 5.7494
Qizhang(test) 8.9537 11.3254 6.3821

The password verification system is established by use of Malcolm Slaney's
mfcc.m code available at http://rvl4.ecn.purdue.edu/~malcolm/interval/1998-010. In
this part, we assume all the passwords have the similar number of phonemes so that
we can make some constant number of states, which is N in our matlab files. This will
simplify our implementation to some extent. In the attached m.file, we choose N=6,
and this number can be changed according to the different applications. However, we
met some problems when using this mfcc file. When running the function, there
always exists an error determinant is divided by zero or negative. By inspection, we
found it was caused by the database we set up ourselves. There exist many zeros in
some data files. To solve the problem, we decided to add a very small noise to all the
data files to avoid zero data. By simulations, we can show the noise would not have
an apparent effect on the performance.
In addition, for word recognition, we need to discard the error password not to
find a most likely word. This means we have to set a threshold to discard the wrong
password. After a series of tests we determined that the threshold would be 950. The
following shows an example of our test results:

VQ Results:
Training:
.
Codebook size 2: 80.804 (Distortion at the end of intermediate step)
Codebook size 4: 36.259
Codebook size 8: 16.053
Codebook size 16: 5.972

Codebook size 2: 51.841
Codebook size 4: 36.826
Codebook size 8: 13.907
Codebook size 16: 5.201

Codebook size 2: 70.125
Codebook size 4: 38.826
Codebook size 8: 17.810
Codebook size 16: 8.840

code =
[16x13 double] [16x13 double] [16x13 double]

Testing 1:
Speaker 1 matches with speaker 1
Speaker 2 matches with speaker 2
Speaker 3 matches with speaker 3
Test Values:
4.3394 11.3757 8.2228
8.3055 4.7606 6.0796
9.3113 11.6095 6.9993

Testing 2:
Speaker 1 matches with speaker 1
Speaker 2 matches with speaker 2
Speaker 3 matches with speaker 3
Test Values:
3.9441 11.0518 7.6278
7.9036 4.4215 5.7494
8.9537 11.3254 6.3821

HMM Results:

Values of scores for wrong passwords:
Test1:
448.3982 453.0269 338.5430 433.8976 464.9219 340.7509 447.1000
372.9350
Test2:
426.8563 399.6682 360.0860 435.6481 414.3633 390.5534 552.7840
428.9359
Test3:
816.9799 686.1951 558.3107 656.9667 786.2361 810.1102 718.3228
665.1145
Test4:
788.5213 704.0753 811.8671 770.9657 660.4869 715.6127 735.7128

Values of scores for correct passwords:
Test1:
1.0e+003 *
1.6267 1.6764 1.6476 1.6067 1.6645 1.6510 1.5664 1.6434
Test2:
1.0e+003 *
1.6979 1.6882 1.6796 1.5907 1.6284 1.6279 1.6327 1.6725
Test3:
1.0e+003 *
1.7620 1.8311 1.7683 1.7246 1.6722 1.7871 1.7006 1.8441
Test4:
1.0e+003 *
1.8620 1.7334 1.7784 1.7618 1.7945 1.8492 1.7953 1.6618

References

[1] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Raj Reddy, Spoken Language
Processing: A Guide to Theory, Algorithm and System Development
[2] S. Furui, An overview of speaker recognition technology, ESCA Workshop on
Automatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994.
[3] F.K. Song, A.E. Rosenberg and B.H. Juang, A vector quantization approach to
speaker recognition, AT&T Technical Journal, Vol. 66-2, pp. 14-26, March
1987.
[4] Lawrence R. Rabiner. A tutorial on hidden markov models and selected
applications in speech recognition. In Proceedings of the IEEE, 77(2), pages
257-285, February 1989

Você também pode gostar