Você está na página 1de 5

Proceedings of the 2017 4th International Conference on Advances in Electrical Engineering (ICAEE), 28-30 September, Dhaka,

Bangladesh

Modified Mel-frequency Cepstral Coefficients


(MMFCC) in Robust Text-dependent Speaker
Identification.

Md. Atiqul Islam


Department of Electrical and Electronic Engineering
International Islamic University Chittagong
Chittagong, Bangladesh
atiq.atrai@gmail.com

Abstract: In this paper a new approach in close-set speaker testing speech is confined to specific utterances with several
identification has been presented. The traditional mel- sessions. In this paper, a new text-dependent speaker
frequency cepstral coefficient (MFCC) feature has been identification approach has been studied. This process is
modified and it’s name was given modified MFCC implemented in three steps: extraction of feature from audio
(MMFCC). Text-dependent dataset has been used to test the signal, speaker prototype behavioral modeling and testing the
presented method’s speaker identification rate. The speaker system performance based on pattern classification process.
identification rate of the presented study was estimated in The feature extraction from voice signal is the most
both clean and contaminated conditions. Four types of noises challenging issue in automatic speaker identification. A
were added to clean signal to get noisy signals for a limit of successful feature extraction provides speaker distinguishing
signal to noise ratios (SNRs) from -5dB to 10 dB. Moreover, characteristics which fit properly in the speaker modeling. This
the obtained performance was compared with the model gives robust speaker identification performance
performance of the traditional features like MFCC- and irrespective of nature of environment. That’s why, a great
Gammatone frequency cepstral coefficient (GFCC)-based emphasized is given on extraction of feature from audio speech
methods. The evaluated results showed the proposed method signal. To obtain successful feature from voice signals, a number
achieves significant improved performance over conventional of feature extraction models have been developed. In broad sense
MFCC- and GFCC-based methods performance under noisy all feature extraction models are categorized into two groups:
conditions. auditory periphery-based feature extraction model such as
Index Terms: Text-dependent, robust speaker identification, (MFCC) [1] and (GFCC) [2] and voice production system based
modified MFCC (MMFCC), envelope. model like perceptual linear predictive (PLP) coefficients [3] and
linear prediction cepstral coefficients (LPCCs) [4].
I. INTRODUCTION
The MFCC is mostly used as front-end method in speaker,
Automatic speaker identification is belongs to the biometric speech, and phoneme classification. Now-a-days this feature is
process of identifying a target speaker from a number of known used as a standard feature to compare the performances from
or unknown speakers by matching voice pattern against a newly proposed method. MFCC provides better speaker
number of speaker models. Now a day, Speaker identification is identification accuracy in clean condition [5, 6] but in noisy
becoming popular in trading, banking, shopping, crime conditions (at 0 dB SNR) it’s performance falls to a remarkable
investigation, information reservation and as a forensic tool level [7]. The application of cubic root operation in lieu of log
because of unique and distinguishing characteristics of individual operation in MFCC feature can slightly improve speaker
voice production mechanism. Generally, speaker identification is identification accuracy under noisy conditions [8].
done using both text-dependent and independent speech signal. The study [5] shows, the inclusion of phase information
In automatic text-dependent speaker identification system, enhance MFCC performance. According to the [9] investigation,

978-1-5386-0869-2/17/$31.00 ©2017 IEEE


505
the phase spectrum and the magnitude spectrum both contribute A. Modified MFCC (MMFCC) Extraction
to speech intelligibility depending on the proper selection of
window shape. However, the voice of a sound is identified The feature extraction from audio speech signal is almost
mainly by magnitude spectra as observed by [10]. similar to conventional MFCC. Fig. 1 represents the block
A new feature named GFCC [3] unlike MFCC in filter diagram for the extraction process of the newly proposed feature.
(Gammatone filter rather using of triangular filter) and non- Initially, input signal was pre-emphasized to make equal-
linear operation (cubic root operation alternative of log loudness to audio waveform. Fig. 2 represents a typical speaker
operation) has been proposed to obtain improved performance identification methodological block diagram.
over MFCC. This feature-based method provides almost similar Most of all auditory periphery-based models have a common
results of MFCC-based method for text-dependent speaker process to make similarities with human auditory system. The
identification system [11]. audio signal is transgressed through a number of band pass filters
Voice is the multi-dimensional presentation and it’s not considering basilar membrane as band pass filters bank with
possible to present it with a single variable. A possible approach overlapping pass-bands. In this study 25 channel linear spaced
could be to look at sounds overall spectral energy distribution. band pass (same as MFCC) filters have been used. A hamming
The importance of energy estimation from basilar membrane window (25 ms) with 10 ms overlap has been addressed to
has been discussed in [12]. According to this study, auditory simulate basilar membrane response.
system averages basilar membrane energy (is called envelope) The pre-emphasized audio speech signal was passed through
and transmit the sound information through auditory nerve. So, hamming window to obtain a matric feature like u × v. Where u
in this study the energy of filters bank responses has been is equal to the number of window points and v is the number of
accumulated and returned to time domain applying discrete column that fits the total signal length. It was implemented using
cosine transform (DCT). This new feature is presented in this following formula,
study is called modified mel-frequency cepstral coefficient ݈െ‫݋‬
‫ݒ‬ൌ ǥ ǥ ǥ ǥ ǥ ǥ ǥ Ǥ ሺͳሻ
(MMFCC).This feature has been applied through this study to ‫ݑ‬െ‫݋‬
obtain robust speaker identification score using UM dataset [11]. Here, l is the length of audio speech signal and o is the number
The achieved performance was significantly improved in of overlap between adjacent windows. Fast Fourier transform
contrast to the performances of the conventional MFCC- and was applied on the obtained matric feature. In this study, 512
GFCC-based methods under noisy conditions. In summary; in points of input signal was used as number of fast Fourier
this paper a new feature modifying conventional MFCC front- transform (NFFT). The obtained feature is called specgram is
end speech feature extraction in speaker identification has been similar to spectrogram. Then, power spectrum was taken of the
presented. The obtained performance outperformed similar type specgram. In conventional MFCC feature, log operation is used
feature- (MFCC- and GFCC-) based performance under noisy to reflect cochlea non-linearity. Here a cubic root operation was
conditions. The proposed feature extraction algorithm also applied on the power spectrum to scale the loudness according to
supports physiology of human peripheral auditory mechanism. Stevens [13-14] as

‫ܧ‬ሺ݅ǡ ݆ሻ ൌ ܾܽ‫ ݏ‬൬‫ݕ‬ሺ݅ǡ ݆ሻଷ ൰ ǥ ǥ ǥ ǥ ǥ ǥ ǥ ሺʹሻ
The obtained power spectrum is then forwarded through
uniformly spaced filters bank. In this study, 25 filters were used
with frequency range from 0 Hz to half of sampling frequency
(as used in original MFCC feature extraction process). The
frequency components of power spectrum were converted into
Mel scale following MFCC extraction procedure.
R. S. Holambe and M. S. Deshpande investigated in [12], the
Fig. 1: The methodological block diagram of MMFCC feature human auditory system averages the energies of each critical
extraction from an audio speech signal. band frequencies of basilar membrane and finally presents a
compressed form of the input audio signal. So, the average
II. METHODOLOGY energy of the filters bank responses was taken.
In this section, the proposed feature and the baseline feature
extraction process from audio speech signal, experimental setup
and speaker modeling has been described.

506
cochlea basilar membrane responses and applies cubic root
operation on filter responses to add cochlea non-linearity. In this
study, 64 bands of Gammatone filter have been used to simulate
frequency responses from 50 Hz to 3 kHz following [14] and
frequency range has been chosen to keep pace with the proposed
feature information. The filter responses were decimated to 100
Hz which is equivalent to framing and a cubic root operation was
implemented. Finally, DCT was applied to convert the spectral
information into time domain. It was observed in [14], the most
speech information retains in 1st to 23rd bands after applying
DCT operation due to its compaction properties. So, only first
23rd coefficients were used this study as GFCC feature. To be
mentioned, no FFT application is needed in GFCC feature
Fig. 2: The newly implemented method’s functional block extraction.
diagram.
In this study, 6 points of each band response was considered C. Experimental Setup
as a frame with 50% overlap and energy of each band was
computed following This paper presents a close-set text-dependent speaker
௧ identification process. A text-dependent dataset University
ܵሺ݅ǡ ݉ሻ ൌ ‫ݔ‬ሺ݅ǡ ͳǣ ݉ ‫ݎ כ‬ሻ ‫ݔ כ‬ሺ݅ǡ ͳǣ ݉ ‫ݎ כ‬ሻ ǥ ǥ ǥ Ǥ Ǥ ሺ͵ሻ
Here, i indicates the number of filter band. m is the number of Malaya (UM) [11] has been used in this paper to record the
frames with overlap of r. t is stands for transpose of matric. This speaker identification rate of the newly proposed method in clean
and noisy conditions. The noisy speech was obtained by adding
process reduces the feature size about one-third of the
four various noises for an extent of SNRs from -5 dB to 10 dB at
conventional feature. Finally, a discrete cosine transform was step of 5dB. White, pink, street, and babble noises were used as
applied to convert the obtained energy spectrum into cepstral background noise in noisy signals.
coefficient. The obtained feature name was given modified mel- UM dataset contains 39 speakers. Each speaker has ten speech
frequency cepstrum coefficient (MMFCC). Here is to be samples. Each sample’s utterance was ‘University Malaya’.
mentioned that, the dynamic and acceleration coefficients were Seven samples from each speaker were used to create speaker
not included in MMFCC that could be further studied topic. prototype behavioral model. Only clean signals were used to
train the speaker model. Once speaker model was ready, it was
B. Baseline Feature Extraction saved for testing the proposed method validation. Rest three
The feature extraction processes of MFCC and GFCC have samples were taken to test the presented method performance in
been described below chronologically. clean and distorted conditions.

i. Mel-frequency cepstral coefficient (MFCC) D. Speaker Modeling

The feature extraction process of conventional MFCC is The most crucial task in speaker identification is modeling
almost similar to the proposed feature obtaining process. The speaker behavioral model. A successful classifier can extract
difference between them only in cochlea non-linearity operation: latent parameters from extracted feature from audio speech
MFCC uses only operation on FFT-based power spectrum and signal that characterizes each individual speaker identity. The
does not average filter responses energies. The number of filter adequate information availability ensures accurate speaker
bands and frequency range all parameters were kept same as the modeling this study, Gaussian mixture model-universal
proposed feature extraction procedure. background model (GMM- UBM) [15] has been. has been used
The MFCC feature extraction from audio speech signal was in this study for speaker modeling to achieve robust SID
done using rastamat toolbox [13]. To make fare comparison with performance.
the proposed method, only static coefficients of MFCC have The GMM speaker modeling is adapted with the UBM-based
been taken into consideration. trained speaker data to make the system faster, stable,
and to have better performances. The application of expectation
ii. Gammatone filter cepstral coefficient (GFCC) maximization (EM) [16] in GMM-based speaker modeling
GFCC feature is almost similar to MFCC feature. GFCC uses makes it successful classifier. The positive sides of application of
Gammatone filter rather using of triangular filter to reflect EM is that it can capture required latent parameters for GMM

507
from a small quantity of training data, and the obtained White Noise
parameters can be used to the new data by maximum a-posteriori 100
(MAP) adaptation [17].
A GMM-UBM-based classifier with 128 mixture components 80

SID Accuracy (%)


has been used here to train the newly developed features to obtain
speaker prototype behavioral model for each speaker. The same 60
algorithm has been also run for the MFCC- and GFCC-based
methods to make fair comparison. 40

III. RESULT AND ANALYTICAL STUDY 20 MMFCC


MFCC
In this section, the proposed method-based text-dependent GFCC
0
result has been presented. The proposed method-based Pink noise
performance was evaluated using both clean and noisy signal. 100
The obtained performance has been compared with the
conventional MFCC- and GFCC-based methods performance to 80

SID Accuracy (%)


show the novelty of the proposed feature.
60
Fig. 3 represents the speaker identification rate of MMFCC,
MFCC-, and GFCC-based method. The common thing in speech
40
processing is the automatic identification rate reduces with the
increment of noise level. It is seen from Fig. 3, the MFCC- and MMFCC
20
GFCC-based performances were dropped continuously with the MFCC
GFCC
positive change of noise level. The performance decrement rate 0
for baseline methods was almost linear with the increment of Street noise
SNR level. However, the proposed MMFCC-based performance 100
was comparatively robust to noises than the performance of other
baseline methods. SID Accuracy (%) 80
The obtained result also implies the simulation-based method
provides better speaker identification accuracy for slow varying 60
noise like pink noise and mostly suffered for stationary noises.
The MMFCC-based method’s performance was more than 40% 40
irrespective of noise except street noise at 0 dB SNR. On the MMFCC
other hand, the baseline methods performances were less than 20 MFCC
GFCC
20% at 0 dB SNR. The GFCC-based method provides
0
comparatively better result than MFCC for all noises except
Babble noise
babble noise irrespective of SNRs which is consisted with the 100
result of [11]. On the other side, MFCC generates better
identification score over GFCC for babble noise under noisy 80
SID Accuracy (%)

condition. In summary, it can be said based on the Fig. 3 result;


the newly proposed method provides significantly improved 60
performance over other existing methods irrespective of noises
and SNRs. 40
To investigate the robustness issue, the proposed method was
run without application of DCT. It was observed, the 20 MMFCC
MFCC
performance was 100% in clean but it reduced to a significant GFCC

level under noisy conditions compared to presented results. So, it 0


can be apparently said based on the obtained result; the human -5 0 5 10 Clean
auditory system finally converts spectral information of audio SNR (dB)
speech into time domain to identify a target speaker in noisy
environment. Fig. 3: Comparison of text-dependent speaker identification
There have a number of nonlinearities in human auditory results for four various noises for a range of SNRs using the
proposed (MMFCC-), MFCC- and GFCC-based methods.

508
system like compression, two-tone rate suppression, non-linear identification." In Acoustics, Speech and Signal Processing
tuning, and adaptation in the inner-hair-cell-AN synapse. It has (ICASSP), 2013 IEEE International Conference on 2013, pp.
mentioned above, cubic root operation was applied on speech 7204-7208.
[9] K. K. Paliwal and L. Alsteris, ‘‘Usefulness of phase spectrum
power spectrum to reflect cochlea nonlinearity. However, all in human speech perception,’’ in Proc. Eurospeech’03, 2003,
nonlinearities were gone when basilar membrane energies were pp. 2117---2120.
averaged as observed in this study. So, application of cubic [10] Plomp, R. & Steeneken, H. J. M. “Effect of phase on the
operation was not contributing here toward better speaker timbre of complex tones.” J. Acoust. Soc. Am. 1969,vol. 46,
identification performance. Rather, average energy (envelope) is pp. 409–421,.
[11] M. Islam, M. Zilany, and A. Wissam, "Neural-Response-Based
contributing significantly to achieve improve performance which TextDependent Speaker Identification Under Noisy Conditions," in
is distinguishing between proposed and MFCC feature. International Conference for Innovation in Biomedical Engineering
and Life Sciences, 2016, pp. 11-14.
IV. CONCLUSION [12] R. S. Holambe and M. S. Deshpande, "Nonlinearity Framework in
Speech Processing," in Advances in Non-Linear Modeling for Speech
Processing, ed: Springer, 2012, pp. 11-25
The improvement of automatic speaker identification
[13] D. P. W. Ellis, “PLP and RASTA and MFCC, and inversion in
performance under noisy condition is still challenging. To Matlab,” 2005. [Online]. Available: http://www.ee.columbia.
provide comparatively better text-dependent speaker edu/‫׽‬dpwe/resources/matlab/rastamat/.
identification result under contaminated level a new feature [14] Y. Shao, S. Srinivasan, and D. Wang, "Incorporating auditory
named modified mel-frequency cepstral coefficient (MMFCC) feature uncertainties in robust speaker identification," in
has been introduced in this paper. The newly proposed method Acoustics, Speech and Signal Processing, 2007. ICASSP
was tested in both clean and noisy conditions using GMM-UBM. 2007. IEEE International Conference on, 2007, pp. IV-277-
IV-280.
The obtained performance was compared with the conventional [15] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification
MFCC- and GFCC-based performances. The proposed method using adapted Gaussian mixture models," Digital signal processing,
provides significant improve performance over baseline 2000, vol.10, pp. 19-41.
[16] J. A. Bilmes, "A gentle tutorial of the EM algorithm and its
methods. There was no option to validate the proposed method application to parameter estimation for Gaussian mixture and hidden
extensively due to scarcity of more text-dependent datasets. The Markov models, "International Computer Science Institute, 1998, vol.
presented feature can be used for a large text-dependent dataset 4, p. 126.
[17] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P.
in future. Text-independent speaker identification and speech Woodland, "Hidden Markov Model Toolkit (HTK) Version 3.2. 1
recognition could be future studied topic using proposed feature. User’s Guide, 2002" Cambridge University Engineering Department,
Cambridge, MA.
REFERENCES
[1] Davis, S., and Mermelstein, P.: ‘Comparison of parametric
representations for monosyllabic word recognition in
continuously spoken sentences’, IEEE transactions on
acoustics, speech, and signal processing, 1980, 28, (4), pp.
357-366
[2] Y. Shao, S. Srinivasan, and D. Wang, "Incorporating auditory
feature uncertainties in robust speaker identification," in
Acoustics, Speech and Signal Processing, 2007. ICASSP
2007. IEEE International Conference on, 2007, pp. IV-277-
IV-280.
[3] E. Shriberg, “Higher-level features in speaker recognition,”
Lecture Notes Comput. Sci., 2007. vol. 4343, pp. 241–259.
[4] J. Makhoul, “Linear prediction: A tutorial review,” Proc.
IEEE, Apr. 1975, vol. 63, no. 4, pp. 561–580.
[5] S. Nakagawa, L. Wang, and S. Ohtsuka, "Speaker identification and
verification by combining MFCC and phase information," Audio,
Speech, and Language Processing, IEEE Transactions on 2012,
vol.20, pp. 1085-1095.
[6] V. Zue, S. Seneff, and J. Glass, "Speech database development at
MIT: TIMIT and beyond," Speech Communication on 1990, vol. 9,
pp. 351-356.
[7] T.-S. Chi, T.-H. Lin, and C.-C. Hsu, "Spectro-temporal modulation
energy based mask for robust speaker identification," The Journal of
the Acoustical Society of America, 2012, vol. 131, pp. EL368-EL374.
[8] Zhao, Xiaojia, and DeLiang Wang. "Analyzing noise
robustness of MFCC and GFCC features in speaker

509

Você também pode gostar