Você está na página 1de 3

270

Journal of Scientific & Industrial Research J SCI IND RES VOL 70 APRIL 2011
Vol. 70, April 2011, pp. 270-272

Automatic isolated digit recognition system: an approach using HMM

Kritika Nimje 1* and Madhu Shandilya2


Electronics and Communication Department, Maulana Azad National Institute of Technology, Bhopal, India

Received 18 August 2010; revised 10 February 2011; accepted 17 February 2011

This study presents design and development of an automatic isolated digit recognition system (AIDRS) using Hidden
Markov Model (HMM). Basic model of the system is able to recognize spoken digit utterances in English. Implementation part
has been done by using Hidden Markov Toolkit (HTK). System is found successful and can identify spoken digit at 89.2%
recognition rate, which is well acceptable rate of accuracy for speech recognition.

Keywords: Automatic isolated digit recognition system (AIDRS), Hidden Markov Model (HMM), Speech recognition

Introduction Experimental Section


Human-computer interaction via speech involves Design and Development of Automatic Isolated Digit
speech recognition (SR) and speech synthesis 1 . SR, which Recognition System (AIDRS)
is conversion of speech signal into text, has been one of A complete ASR system based on HMM (Fig. 1)
the most difficult problems to solve 2 . Speech has potential was developed and designed with two phases: i) Training
to be a better interface than other computing devices phase, which creates knowledge about speech and store
(keyboard or mouse) used3 . SR finds its use in a wide and organize the system knowledge gained; and ii)
variety of applications ranging from voice based Verification phase, which figures out the meaning of input
command control to dictation system or assistance system speech given in testing phase and this is done with HMM
for physically challenged people. Automated Speech models. Input to isolated digit recognition system
Recognition4 (ASR) is the ability of a machine or program (Fig. 2) is spoken digit, which are converted into its
to recognize voice commands or take dictation, which corresponding text.
involves ability to match a voice pattern against a provided
Training Phase (TP)
or acquired vocabulary. Hidden Markov Model (HMM)
TP accepts speech samples from different people
has been used as classifier for isolated digit recognition
and trains the system to create acoustic models for each
for English numbers5-8. HMM is preferred method for
word in vocabulary. A limited vocabulary is chosen for
SR because it provides a simple and flexible mechanism
forming speech models. Digits are converted into acoustic
for modeling sequences of variable length. Later many
models statistically characterized as HMM. TP
practical approaches utilizing different statistical models
undergoes through two stages: i) Data preparation, which
introduced noticeable improvements to speed and
involves fixing vocabulary; and ii) Recording data, which
accuracy9 . Numerous R & D have been taken place in
involves in order creating transcription files, coding data,
various languages in recent years11-16.
creating word models, creating flat start word models,
This study presents a new SR system, automatic
re-estimate models, and fixing silence models.
isolated digit recognition system (AIDRS) that uses HMM
as classifier for isolated digit recognition for English Verification Phase (VP)
numbers. VP displays 5 random numbers between 0 and 9
and asks user to pronouns those digits. Utterance for
*Author for correspondence each number is recorded as test data, which is decoded
E-mail: kritika_nimje2005@rediffmail.com to recognize spoken words. VP also consists of several
NIMJE & SHANDILYA : HMM BASED SPEECH RECOGNITION SYSTEM 271

TRAIN TRAINING Training Phase Verification Phase


TRANSCRIPTION
SPEECH MODULE

Building Word
Network
Data
HMM MODELS
Preparation Recognizing
Test Data

Constructing
RECOGNITION Creating Dictionary
TEST Monophone
MODULE TRANSCRIPTION
SPEECH HMM Data
Preparation

Fig. 1— HMM based speech recognition system

Fig. 2 — Isolated digit recognition system

Training Phase
of which 12 were as number of MFCC coefficients, and
Training
of which 0.97 were as pre-emphasis coefficients10 .
Wave
Data
Init ialize
HCopy
Re -estimate
HERest
Modify
HHEd
HMMs were trained on 1400 speech samples and were
Training
used for testing 500 samples from 20 speakers. In HMM
Wave
Data Feature Acoustic
training, it was tested on Baum Welch algorithm and
Extraction
HCopy
Configuration Model Viterbi algorithm. Number of states for each word is 7
Testing and was modeled using Left Right (LR) HMM topologies.
Wave
Data Testing Search
Compare with
answers Accuracy
HTK was used for designing and testing SR systems
Wave HVite
Data
HResult throughout all experiments. Baseline system was initially
designed as a phoneme level recognizer with three active
Construct
Rules and
Wordnet states; one Gaussian mixture per state, continuous,
HParse
Dictionary
left-to-right, and no skip HMM models. Training was done
Verification Phase
for a fixed number of iterations (up to 3 iterations).
Recognition rate of trained HMM is defined as11

Fig. 3 — HTK tools used for isolated digit recognition at Recognition Correctrecognition
Rate = x 100%
different phases Total number of testing samples for research digit

steps to pass through in order as recording data, building Number of correct and incorrect recognition was
word network, constructing a dictionary and recognizing found out (Table 1). and Table 2 shows Accuracy
test data. percentage for each digit was found as follows: 0, 89.36;
1, 80; 2, 90.91; 3, 89.58; 4, 89.28; 5, 94.64; 6, 91.67; 7,
For implementing HMM for present isolated digit 87.23; 8, 89.65; and 9, 87.5%. Table 2 shows digits that
recognition system, Hidden Markov Model Toolkit were picked in case of miss-recognition for all digits were:
(HTK)10 is used. HTK is a toolkit (Fig. 3) for building Among all, digit 1 produced lowest recognition rate,
HMMs. All functionality of HTK is built into library because digit 1 is confused with digit 7 and digit 5 for
modules and tools. most of the time. Digit 1 wrongly recognized as 7 for
15.56% of times and as 5 for 4.44% of times. This has
Results and Discussion probably occurred due to pronunciation of digit 1 is almost
In the experiment, database consists of 10 digits. similar to digit 7. Sound of one in digit 1 and “ven” in digit
System parameters were: sampling rate with a 16 bit 7 leads to similarity of the values of feature vectors for
sample resolution, 16 KHz; Hamming window duration each digit. The system tends to recognize the digit 1 as 7.
(step size, 10 ms), 32 ms; and MFCC coefficients, 22 as Therefore, digit 1 wrongly recognized as digit 7 for 15.56%
length of cepstral leftering and 26 filter bank channels, of times.
272 J SCI IND RES VOL 70 APRIL 2011

Table 1—Confusion matrix for each digit accuracy

Recognized digit
% 1 2 3 4 5 6 7 8 9 0
1 80 0 0 0 4.44 0 15.56 0 0 0
2 0 90.91 0 5.45 0 0 0 0 3.64 0
3 0 4.16 89.58 0 0 4.16 0 0 2.1 0
Uttered digit

4 0 0 0 89.29 7.14 0 0 0 0 3.57


5 0 1.78 0 0 94.64 3.58 0 0 0 0
6 0 0 2.08 0 0 91.67 6.25 0 0 0
7 0 0 0 4.26 0 0 87.23 0 0 8.51
8 0 0 0 3.45 0 6.9 0 89.65 0 0
9 0 0 5 0 5 2.5 0 0 87.5 0
0 0 0 4.25 4.25 0 2.14 0 0 0 89.36

Table 2—Digits that were picked in case of miss-recognition for 3 Jurafsky D & Martin J H, Speech and Language Processing:
all digits An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition,
Digit Mostly confused with 2nd edn, 2007: http://www.cs.colorado.edu/~martin/slp2.html.
0 3,6,4 4 Douglas O’S, Interacting with computers by voice: Automatic
1 5,7 speech recognition and synthesis, Proc IEEE, 91 (2003)
2 4,9 1272-1305.
3 2,6,9 5 Rabiner L R, A Tutorial on Hidden Markov Models and
4 0,5 selected applications in speech recognition, Proc IEEE,
5 6 77 (1989) 257-286.
6 3,7 6 Dimov D & Azamanov I, Experimental specifics of using HMM
7 0,4 in isolated word Speech recognition, in Int Conf on Computer
8 4,6 Syst & Technol – CompSysTech (Varna, Bulgaria) 2005,
9 3,5,6 3A.17.1-17.9.
7 Felinek F, Statistical Methods for Speech Recognition
The system is relatively successful, as it can identify (MIT Press, Cambridge, Massachusetts, USA) 1997.
8 Juang B & Rabiner L, Hidden Markov Models for speech
spoken digit at an average rate of 89.2%, which is
recognition, Technometrics, 33 (1991) 251-272.
relatively high given that the size of present training corpus 9 Young S, A review of large-vocabulary continuous-speech
is rather small. Presently, system can be used only for recognition, IEEE Signal Process Mag, 13 (1996) 45-57.
isolated word single digit reorganization and it can be 10 Young S, Evermann G, Gales M, Hain T, Kershaw D et al, The
enhanced to perform user authentication based on speech HTK Book (for HTK Version. 3.4); Cambridge Univ Engg
Dept: http://htk.eng.cam.ac.uk/protdoc/ktkbook.pdf, 2006.
and also accommodate connected digits. The system can 11 Rosdi F & Ainon N, Isolated Malay speech recognition using
be enhanced to a larger vocabulary including alphabets Hidden Markov Models, in Proc for Int Conf on Comput &
and commonly used words. The system can be made Commun Engg (Kuala Lumpur) 2008, 721-725.
robust by using larger database for training. 12 Alotaibi Y A, Alghamdi M & Alotaiby F, Speech recognition
system of arabic digits based on a telephony arabic corpus, in
Proc 4th Int Conf, ICISP 2010 (Trois-Rivières, QC, Canada)
Conclusions 30 Jun - 2 Jul 2010.
A newly designed and developed SR system, 13 Muhammad G, Alotaibi Y A & Huda M N, Automatic speech
AIDRS, is able to recognize spoken digit utterances in recognition for bangia digits: in Proc 12th Int Conf on Comput
English. System is found successful and can identify & Inform Technol (ICCIT 2009) (Dhaka, Bangladesh) 21-23
Dec 2009.
spoken digit at 89.2% recognition rate, which is well 14 Kuria C & Balakrishnan K, Speech recognition of malayalam
acceptable rate of accuracy for speech recognition. numbers, in Int Symp on Innovations In Nat Comput
(INC -2009) (Cochin) 2009.
References 15 Huang X, Alex A & Hon H W, Spoken Language Processing;
1 Sproat R, Multilingual Text-to-Speech Synthesis: The Bell Labs A Guide to Theory, Algorithm and System Development
Approach, Chap 5 (Kluwer Academic Publishers, (Prentice Hall, Upper Saddle River, New Jersey) 2001.
Massachusetts) 1998. 16 Jurafsky D & Martin J H, Speech and Language Processing:
2 Markus F, Why is Speech Recognition Difficult? (Department An Introduction to Natural Language Processing,
of Computing Science, Chalmers University of Technology, Computational Linguistics, and Speech Recognition,
Sweden) 24 Feb 2003. 2nd edn, 2007: ttp://www.cs.colorado.edu/~martin/slp2.html.

Você também pode gostar