Escolar Documentos
Profissional Documentos
Cultura Documentos
N n
N
n
n w
3.1.3 Fast Fourier Transform (FFT)
FFT converts each frame from the time domain into the frequency domain. The
FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is
defined on the set of N samples {x
n
}, as follow:
1
0
/ 2
1 ,..., 2 , 1 , 0 ,
N
k
N jkn
k n
N n e x X
The resulting sequence {X
n
} is interpreted as follow: the zero frequency
corresponds to n = 0, positive frequencies 2 / 0
s
f f < < correspond to values
1 2 / 1 N n , while negative frequencies 0 2 / < < f f
s
correspond to
1 1 2 / + N n N .
3.1.4 Mel-frequency Wrapping
As mentioned above, psychophysical studies have shown that human perception
of the frequency contents of sounds for speech signals does not follow a linear scale.
Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is
measured on a scale called the mel scale. The mel-frequency scale is a linear
frequency spacing below 1KHz and a logarithmic spacing above 1KHz. As a
reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing
threshold, is defined as 1000 mels. Therefore we can use the following approximate
formula to compute the mels for a given frequency f in Hz:
) 700 / 1 ( log * 2595 ) ( 10 f f mel +
One approach to simulating the subjective spectrum is to use a filter bank, spaced
uniformly on the mel scale (see Figure 4). That filter bank has a triangular bandpass
frequency response, and the spacing as well as the bandwidth is determined by a
constant mel frequency interval. The modified spectrum of S() thus consists of the
output power of these filters when S() is the input. The number of mel spectrum
coefficients, K, is typically chosen as 20.
Note that this filter bank is applied in the frequency domain, therefore it simply
amounts to taking those triangle-shape windows in the Figure 4 on the spectrum. A
useful way of thinking about this mel-wrapping filter bank is to view each filter as an
histogram bin (where bins have overlap) in the frequency domain.
Fig 3. Mel-spaced filterbank
0 1000 2000 3000 4000 5000 6000 7000
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Mel -spaced filterbank
Frequency (Hz)
3.1.5 Cepstrum
Finally the log mel spectrum is converted back to time. The result is called the
mel frequency cepstrum coefficients (MFCC). The cepstral representation of the
speech spectrum provides a good representation of the local spectral properties of the
signal for the given frame analysis. Because the mel spectrum coefficients (and so
their logarithm) are real numbers, we can convert them to the time domain using the
Discrete Cosine Transform (DCT). Therefore if we denote those mel power spectrum
coefficients that are the result of the last step are K k S
k
,..., 2 , 1 ,
~
, we can calculate
the MFCC's, ,
~
n
c as
K n
K
k n S c
K
k
k n
,..., 2 , 1 ,
2
1
cos )
~
(log
~
1
1
]
1
,
_
Note that we exclude the first component, ,
~
0
c from the DCT since it represents
the mean value of the input signal which carried little speaker specific information.
4. Feature Matching
Feature matching refers to as the classification on the extracted features from
individual speakers. The feature matching techniques used in speaker recognition
include Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and
Vector Quantization (VQ).
4.1 Vector Quantization
In this project, since little data is available, VQ will be suitable for voice
recognition, due to ease of implementation and high accuracy.
VQ is a process of mapping vectors from a large vector space to a finite number of
regions in that space. Each region is called a cluster and can be represented by the
centroid called codeword. The collection of all codewords consists of the
corresponding codebook for a known speaker.
Figure 4 shows a conceptual diagram to illustrate this recognition process. In the
figure, only two speakers and two dimensions of the acoustic space are shown. The
circles refer to the acoustic vectors from the speaker 1 while the triangles are from the
speaker 2. In the training phase, a speaker-specific VQ codebook is generated for
each known speaker by clustering his training acoustic vectors. The result codewords
are shown in Fig 4 by black circles and black triangles for speaker 1 and 2,
respectively. The distance from a vector to the closest codeword is called distortion.
In the recognition phase, an input utterance of an unknown voice is vector-quantized
using each trained codebook and the total VQ distortion is computed. The speaker
corresponding to the VQ codebook with smallest total distortion is identified.
Speaker 1
Speaker 1
centroid
sample
Speaker 2
centroid
sample
Speaker 2
VQ distortion
Fig 4 . Conceptual diagram illustrating vector quantization codebook formation.
One speaker can be discriminated from another based of the location of centroids.
. The LBG VQ algorithm we used is formally implemented by the following
recursive procedure:
1. Design a 1-vector codebook; this is the centroid of the entire set of training
vectors (hence, no iteration is required here).
2. Double the size of the codebook by splitting each current codebook y
n
according
to the rule
) 1 ( +
+
n n
y y
) 1 (
n n
y y
where n varies from 1 to the current size of the codebook, and is a splitting
parameter (we choose =0.01).
3. Nearest-Neighbor Searc h: for each training vector, find the codeword in the
current codebook that is closest (in terms of similarity measurement), and assign
that vector to the corresponding cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the
training vectors assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset
threshold.
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Figure 5 shows, in a flow diagram, the detailed steps of the LBG algorithm.
Cluster vectors is the nearest-neighbor search procedure which assigns each
training vector to a cluster associated with the closest codeword. Find centroids is
the centroid update procedure. Compute D (distortion) sums the distances of all
training vectors in the nearest-neighbor search so as to determine whether the
procedure has converged.
Find
centroid
Split each
centroid
Cluster
vectors
Find
centroids
Compute D
(distortion)
<
D
D ' D
Stop
D = D
m = 2*m
No
Yes
Yes
No
m < M
Fig 5 . Flow diagram of the LBG algorithm
4.2 Hidden Markov Model (HMM)
For speech signals, a left-right HMM model is found to be more useful. A left -
right model has the property that as time increases, the state index increases (or stays
the same), that is the system states proceed from left to right. Since the properties of a
speech signal change over time in a successive manner, this model is very well suited
for word recognition. 6-state left-right HMM is used for modelling 2-phoneme
password.
5. Implementation
Fig 6. System Implementation
WORD
TRAINING
IDENTIFICATION
INPUT YOUR
VOICE
USING VQ TO
TRAINING
STORE THE TRAINING RESULTS IN
THE DATABASE
INPUT YOUR
PASSWORD
USING HMM TO
TRAINING
SPEAK THE
PASSWORD
TEST THE
PASSWORD
OUTPUT THE
RESULTS
INPUT YOUR
VOICE
START
6. Simulation and Results
The performance of our voice recognition and identification system depends to
a large extent on the recording effects of train and test data. Based on the simple
equipment we used to record our voices and passwords and small number of speakers
in our group for training and testing, our correct rate for the whole identification
system is 100% by carefully recording and properly adjusting relevant parameters,
such as codebook size, iteration number, etc. VQ tests using a set of 16 downloaded
speaker wave files also work perfectly for our system.
Figure 7 shows a 2-D plot of the original trained MFCC vectors by two
speakers by selecting any two dimensions (eg. 5
th
, 6
th
) of the 13 coeffic ients and in
comparison a 2-D plot of the codebooks generated by the two speakers using the same
two dimensions with a codebook size of 16. Thus we can see that each codeword on
the plot exactly represent the corresponding clusters in the original MFCC data points
and still accurately represent each speakers voice characteristics.
In our VQ part, we calculate the Euclidean distance between the unknown
word and codebooks, then the lowest value of the distances is identified as the correct
person. But on the other hand, it is very hard to set up a threshold value for this
distance to distinguish people who try to gain access not in our database and people
already in our training database based on a series of tests, as is illustrated by one of
the following table of VQ test values. Since this threshold value varies for each word,
so if a threshold is set up, either most of the values would be too high or too low. So
the limitation of our system is that people who have not been trained in our system
can still pass voice recognition realized through VQ algorithm. This limitation has not
been expected. Anyway, that person will not be able to pass the password verification,
thus ensuring the security of system. For people already trained in database trying to
access the system, VQ helps to pinpoint the password associated with the speaker in
the database, and thus improves the security of the whole system and serves an
integral part of our security system.
Minimum Euclidean Distance
Zhongmin(train) Weimin(train) Qizhang(train)
Zhongmin(test) 3.9441 11.0518 7.6278
Weimin(test) 7.9036 4.4215 5.7494
Qizhang(test) 8.9537 11.3254 6.3821
The password verification system is established by use of Malcolm Slaney's
mfcc.m code available at http://rvl4.ecn.purdue.edu/~malcolm/interval/1998-010. In
this part, we assume all the passwords have the similar number of phonemes so that
we can make some constant number of states, which is N in our matlab files. This will
simplify our implementation to some extent. In the attached m.file, we choose N=6,
and this number can be changed according to the different applications. However, we
met some problems when using this mfcc file. When running the function, there
always exists an error determinant is divided by zero or negative. By inspection, we
found it was caused by the database we set up ourselves. There exist many zeros in
some data files. To solve the problem, we decided to add a very small noise to all the
data files to avoid zero data. By simulations, we can show the noise would not have
an apparent effect on the performance.
In addition, for word recognition, we need to discard the error password not to
find a most likely word. This means we have to set a threshold to discard the wrong
password. After a series of tests we determined that the threshold would be 950. The
following shows an example of our test results:
VQ Results:
Training:
.
Codebook size 2: 80.804 (Distortion at the end of intermediate step)
Codebook size 4: 36.259
Codebook size 8: 16.053
Codebook size 16: 5.972
Codebook size 2: 51.841
Codebook size 4: 36.826
Codebook size 8: 13.907
Codebook size 16: 5.201
Codebook size 2: 70.125
Codebook size 4: 38.826
Codebook size 8: 17.810
Codebook size 16: 8.840
code =
[16x13 double] [16x13 double] [16x13 double]
Testing 1:
Speaker 1 matches with speaker 1
Speaker 2 matches with speaker 2
Speaker 3 matches with speaker 3
Test Values:
4.3394 11.3757 8.2228
8.3055 4.7606 6.0796
9.3113 11.6095 6.9993
Testing 2:
Speaker 1 matches with speaker 1
Speaker 2 matches with speaker 2
Speaker 3 matches with speaker 3
Test Values:
3.9441 11.0518 7.6278
7.9036 4.4215 5.7494
8.9537 11.3254 6.3821
HMM Results:
Values of scores for wrong passwords:
Test1:
448.3982 453.0269 338.5430 433.8976 464.9219 340.7509 447.1000
372.9350
Test2:
426.8563 399.6682 360.0860 435.6481 414.3633 390.5534 552.7840
428.9359
Test3:
816.9799 686.1951 558.3107 656.9667 786.2361 810.1102 718.3228
665.1145
Test4:
788.5213 704.0753 811.8671 770.9657 660.4869 715.6127 735.7128
Values of scores for correct passwords:
Test1:
1.0e+003 *
1.6267 1.6764 1.6476 1.6067 1.6645 1.6510 1.5664 1.6434
Test2:
1.0e+003 *
1.6979 1.6882 1.6796 1.5907 1.6284 1.6279 1.6327 1.6725
Test3:
1.0e+003 *
1.7620 1.8311 1.7683 1.7246 1.6722 1.7871 1.7006 1.8441
Test4:
1.0e+003 *
1.8620 1.7334 1.7784 1.7618 1.7945 1.8492 1.7953 1.6618
References
[1] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Raj Reddy, Spoken Language
Processing: A Guide to Theory, Algorithm and System Development
[2] S. Furui, An overview of speaker recognition technology, ESCA Workshop on
Automatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994.
[3] F.K. Song, A.E. Rosenberg and B.H. Juang, A vector quantization approach to
speaker recognition, AT&T Technical Journal, Vol. 66-2, pp. 14-26, March
1987.
[4] Lawrence R. Rabiner. A tutorial on hidden markov models and selected
applications in speech recognition. In Proceedings of the IEEE, 77(2), pages
257-285, February 1989