Escolar Documentos
Profissional Documentos
Cultura Documentos
Manoj Kumar, Hitesh Kumar, Shweta Sinha Kamrah Institute of Information Technology, Gurgaon
Manoj.delhi24@gmail,com, hiteshkumar111@gmail.com, Meshweta_7@rediffmail.com,
ABSTRACT This paper discusses the recognition of Hindi digits based on emotion rich small vocabulary. A feed forward multilayer neural network is trained by Back propagation method for speaker independent isolated word recognition. Mel Frequency Cepstral Coefficients (MFCC) is extracted as speech features. These features are used to train the Multi Layer Feed Forward network (MLFFN) Network .The same routine is applied to signals during recognition stage and unknown test patterns are classified to the nearest pattern. Analysis based on varying number of hidden neurons in the network is presented here. The network is trained with input waves captured in the office environment and is tested against database created in the similar environment. It has been observed that the MLFFN works as good classifier for test data and number of speech features extracted plays a very important role in recognition of isolated Hindi digits through machine.
Introduction
Automatic Speech Recognition plays a very important role in the area of Human-Machine interaction. The transfer of information through speech communication from one person to another consists of variations in pressure wave coming from the mouth of a speaker that propagates through the air medium and reaches the ears of listeners, who decipher the waves into a received message. In computer technology, Speech Recognition refers to the recognition of human speech by computers for the performance of speaker-initiated computer-generated functions. Speech recognition systems are usually built upon three common approaches, namely, the acoustic-phonetic approach, the pattern recognition approach and the artificial intelligence approach [1]. The acoustic-phonetic approach attempts to decide the speech signal in a sequential manner based on the knowledge of the acoustic features and the relations between the acoustic features with phonetic symbols. The pattern recognition approach, on the other hand, classifies the speech patterns without explicit feature determination and segmentation such as in the formal approach. The artificial intelligence (AI) approach forms a hybrid system between the acoustic phonetic approach and the pattern-recognition approach. After the great success of AI approach [2, 3, 4] it became the field of interest for many more researches. There are many recognition systems based on different languages which are often used in applications meant for military systems, aircrafts, deaf-telephony etc. Efforts are being done for developing such system for Hindi language also [5].Hindi digit recognition is one effort towards achieving this. In this paper application of neural network in the pattern recognition approach is discussed. We propose the use of multilayer feed forward neural network which is trained using back propagation technique for Hindi digit recognition. The input to the training module of the
system is the speech features of the digits recorded in neutral emotion. The trained network is tested against digits recorded in same environment and emotion. The speech features extracted from the recorded digits during training and testing phases are the Mel Frequency Cepstral Coefficients. Several network with different structure (different hidden neurons) were trained and their performance in recognizing the unknown input pattern were compared.
Capturing of Speech Signal in Capturing of Speech Signal in Different Emotions Different Emotions Database Creation Recognition of Recognition of Input Test Word in Input Test Word in Different Emotion Different Emotion
Training of Training of Network for Word Network for Word and Emotions and Emotions
3 Speech Database
The speech recognition process requires corpora which provide training to the system. Research [2] shows that size of corpora plays very important role in the success of any such system. For proper training of the system we have collected speech database from speakers in the age group of 22 to 35 years. All speakers are female and are from Hindi speaking region. Total no of speakers speaking at different rates : 30 Vocabulary Size: 10 digits (Shunya -Nau) Each digit spoken by every speaker 5 times in neutral emotion. Emotion under consideration for training the network : Neutral Emotions under consideration for testing the
network: Neutral. Total number of utterances in training Database: 30*10*5= 1500 utterances Test Database under consideration: Recording of 10 speakers for every word in each of sad, surprise and neutral emotion. Number of speech features taken into consideration: 12 MFCC coefficients along with energy
Spectral Analysis
Speech is a non-stationary signal so to extract spectral features of sub-phones we analyse the spectrum in successive narrow time windows of about 20-25 ms width. For reliable frequency analysis, the human speech is considered to be fairly stationary over 20-25msec time windows [8] .The analysis is carried out using the Fast Fourier transform algorithm (FFT) of each window. This gives us the intensity of several bands on the frequency scale. After digitization and quantization of the wave form our goal is to transform the input waveform into a sequence of acoustic feature vectors, such that each feature vector represents the information in a small time window of the signal. Mel Scale Cepstrum Coefficients (MFCC) is the most widely used features extracted by Cepstrum analysis of the signal. These MFCC are Human Listening perception based features [8, 9]. As the human ear is not equally sensitive to all frequency bands in MFCC also the features are extracted by attenuating the high frequency components using the Mel scale. The overall extraction process can be represented in sequence of steps as defined below.
Continuo us Speech
Frame Blocking
Windowin g
FFT
Mel Cestrum
Sk
Cepstrum
MelFrequency Wrapping
For obtaining MFCC features for the samples of the database words were divided into window of 25ms with frame rate of 10ms.The window taken during the extraction process is the Hamming window .Fast Fourier Transform was allowed on the windowed data and this was followed by bank of filters spaced logarithmically above 1000 Hz to obtain the Cepstrum coefficients. For training of the system one set of input was prepared for 12 coefficients from every frame of each word and the other set of input was taken as 12 MFCC coefficients from every frame along with one energy value.
Mel-cepstrum coefficient
Multi-layer Feed Forward Networks [MLFFW] are one of many different types of existing neural networks. They comprise of number of neurons connected together to form a network. The strengths or weights of the links between the neurons is where the functionality of the network resides. Neural networks are useful to model the behaviours of real-world phenomena[6,7].Being able to model the behaviours of certain phenomena, a neural network is able subsequently to classify the different aspects of those behaviours, recognize what is going on at the moment, diagnose whether this is correct or faulty, predict what it will do next, and if necessary respond to what it will do next. This paper uses a multilayer feed forward neural networks with one hidden layer .The activation function at hidden and output layer is sigmoid and the network is trained with scaled conjugate gradient back propagation with momentum . The model can be extended to include more MFCC features for analysis purpose. These extracted features are fed as input to the network. These inputs are processed by hidden layers and fed to the output layer. Each neuron at the output layer corresponds to one input digit. Only one neuron is activated at one time. The overall training of the network is done in multiple epochs. The input for each frame is kept in the input file in required format. The training target is shown in the table where each of the digits will activate a different output neuron. The network can be represented as
Input layer S P E E C H Hidden Layer
B w
Output Layer
0 1 0 0 0 0 0 0 0 0 0
Training Target
1 0 1 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 0 5 0 0 0 0 0 1 0 0 0 0 6 0 0 0 0 0 0 1 0 0 0 7 0 0 0 0 0 0 0 1 0 0 8 0 0 0 0 0 0 0 0 1 0 9 0 0 0 0 0 0 0 0 0 1
Information
Scaled Conjugate Gradient descent with 12 MFCC and momentum energy Features from short time Log sigmoid Transfer Function duration frames Log sigmoid Transfer Function 1500 0.01 0.9 0.01 Table 1: Neural Network Configuration
H dn id e Nu n e ro s
D its ig 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
0 6 .0 6% 1% .7 1 .4 0% 2% .0 0% .0 2% .7 0% .0 4% .4 0% .0 0% .0 7 .0 1% 1% .0 3% .6 2% .7 1% .6 3% .1 0% .0 3% .8 1% .6 4% .7
1 2% .7 8 .0 1% 4% .2 0% .0 0% .0 6% .3 0% .0 6% .2 0% .0 1% .6 1% .3 8 .0 2% 2% .7 1% .2 2% .7 4% .7 1% .2 6% .1 2% .8 2% .7
2 8% .0 0% .0 7 .6 4% 4% .0 0% .0 1% .8 1% .7 0% .0 1% .1 2% .7 6% .4 2% .1 7 .7 1% 5% .1 0% .0 1% .6 0% .0 1% .3 1% .9 1% .8
3 0% .0 4% .6 0% .0 7 .0 7% 1% .5 0% .0 2% .1 2% .1 0% .0 4% .7 2% .1 3% .2 0% .0 6 .6 5% 1% .3 0% .0 2% .8 0% .0 3% .8 1% .2
4 2% .3 3% .1 1% .9 3% .7 9 .0 1% 0% .0 0% .0 3% .6 1% .8 7% .0 3% .7 3% .1 2% .8 6% .1 8 .2 1% 2% .9 1% .9 2% .6 4% .8 4% .3
5 4% .6 0% .0 0% .0 1% .9 3% .4 8 .0 6% 3% .6 6% .4 0% .0 3% .1 2% .9 1% .2 1% .3 9% .3 4% .2 7 .0 8% 0% .0 8% .2 2% .7 2% .8
6 4% .0 0% .0 2% .6 5% .5 0% .0 2% .2 9 .0 1 % 0% .0 1 .3 0 % 6% .9 4% .7 0% .0 4% .2 4% .7 3% .1 3% .9 9 .0 2 % 1% .3 1% .9 4% .0
7 0% .0 2% .7 3% .1 1% .4 4% .1 1% .0 1% .6 6 .0 2% 1 .6 2% 0% .0 6% .0 2% .1 4% .1 2% .1 1% .9 1% .7 0% .0 7 .0 5% 4% .5 7% .2
8 1 .2 0% 3% .0 3% .2 3% .2 0% .0 0% .0 0% .0 1 .6 0% 6 .0 7% 1 .0 0% 0% .0 4% .2 3% .6 3% .2 4% .0 2% .3 1% .0 0% .0 7 .0 1% 7% .0
9 2% .2 3% .9 0% .0 1% .3 0% .0 0% .0 0% .0 4% .7 7% .2 6 .0 4% 1% .9 1% .1 6% .0 0% .0 0% .0 1% .8 1% .1 1% .7 5% .0 6 .3 4%
0 5 .0 4% 3% .0 7% .4 7% .2 0% .0 2% .1 1% .1 6% .7 2% .5 1% .9 6 .2 1% 9% .6 4% .8 5% .1 0% .0 0% .6 2% .6 1% .9 3% .1 7% .2
1 3% .1 6 .0 2% 3% .6 1% .9 7% .2 3% .6 1% .6 3% .9 1% .7 2% .8 0% .0 7 .0 1% 6% .2 3% .8 3% .2 1% .2 1% .7 3% .8 2% .3 8% .6
T s da a s S dE o n l d ta a e e te g in t a m tio a a b s
2 0
1 .6 2 0% .0 7 .0 2 6% .4 2% .5 4% .2 0% .0 2% .9 3% .7 3% .2
3 0
8% .2 3% .9 7 .0 2 1% .8 2% .7 2% .4 2% .9 4% .9 7% .2 3% .8
Table 2: Confusion Matrix for Network with 12 MFCC features as Input for training
Testing of these same data on network trained with 12 MFCC along with energy gave better performance with average up to 93%.
References
[1] Lawrence Rabiner, Biing-Hwang Juang, Fundamental of speech Recognition, Pearson Publications, Second Edition. [2] H. S. Li, J. Liu, R. S. Liu, High Performance Mandarin Digit Speech Recognition. Journal of Tsinghua University (Science and Technology), 2000. [3] H. S. Li, M. J. Yang, R. S. Liu, Mandarin Digital Speech Recognition Adaptive Algorithm. Journal of Circuits and Systems, Vol. 4, No. 2, 1999. [4]Bin Lu, Jing-Jing Su, Research_on_Isolated_Word_Speech_Recognition Based On Biomimetic Pattern Recognition, International Conference on Artificial Intelligence and Computational Intelligence, IEEE Computer Society, pp. 436-439, 2009 [5] J. Chen, K.K. Paliwal, S. Makamura, Cepstrum derived from differentiated power spectrum for robust speech recognition, Speech communication, vol. 41, pp. 469484, 2003 [6] Mike Schuster and Kuldip K. Paliwal, Bidirectional Recurrent Neural Networks, IEEE Transaction on Signal Processing, Vol. 45, No. 11, pp. 26732681, November 1997 [7] K B Khanchandani and Moiz A Hussain, Emotion Recognition using Multilayer Perceptron and Generalized Feed Forward Neural Network, Journal of Scientific & Industrial Research, Vol. 68, pp. 367-371, May 2009 [8] Eric H .C. Choi On Compensating The Mel-Frequency Cepstral Coefficients for Noisy Speech Recognition, 29th Australian Computer Science Conference, 2006 [9] Lecture Notes of Summer School on ASR-10 (2010), 5th-9th Sep 2010, Osmania University, Organized by IIIT Hyderabad.