Você está na página 1de 5

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 7, JULY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.

ORG

97

Bangla Oral-Nasal Vowel Pairs: Acoustic Categorization and Comparative Study of Feature Extraction Methods
Shahina Haque, Tomio Takara
AbstractAcoustic feature extraction of Bangla oral-nasal vowel pairs were done by pole-zero cepstral and al l pole Linear Predictive Coding (LPC) method. It is observed that (1) Bangla nasal vowel space shrinks and shifts towards front with respect to its oral vowel space (2) For Bangla nasal vowels, cepstral method extracts more spectral details including spectral zeros than the all pole LPC method. Comparative study shows that although both the method extracts similar speech parameters for each vowel but cepstral method is more appropriate than LPC method as far as the nasal vowels are concerned.

Index Terms Cepstral method, LPC method, oral-nasal vowel, pole-zero

1 INTRODUCTION
PEECH analysis is the branch of science which deals with the analysis of speech sounds taking into consideration their method of production, modeling the method of production by a suitable model and estimating the parameters of the model. These parameters should procure all the required information underlying the speech and will form the acoustic feature vector of that speech sound. There are several speech analysis methods. Proper selection and use of speech analysis technique greatly affects the extracted speech feature. Therefore, the first task of a specific language processing should be the proper selection of a speech analysis method that will procure enough information required to reproduce the speech or to recognize the speech from the parameters. Therefore, depending on the language, proper speech analysis technique has to be used. Otherwise analysis fails to procure all the required speech parameters which may give rise to errors in speech synthesis or recognition. Bangla is a language of more than 150 million people of Bangladesh and one of the most widely spoken languages of the world. All the 7 vowels in Bangla have their corresponding nasal counterparts [1]. Nasalization of vowel changes meaning of some words in Bangla. Nasality introduces poles and zeros in the spectrum of nasal vowel [2]. The contrast lies in the spectrum of nasal and oral vowels as shown in Fig. 1. It can be seen how the spec-

trum of /i/ differs from the spectrum of /i/. The spectrum of /i/ contains both poles and zeros. Therefore, an appropriate technique should be used that can procure the information of both poles and zeros of the spectrum /i/. Research work on other languages are done on oralnasal vowels [3], ]4]. Beside, research on Bangla speech analysis was first reported in 1976 [5]. Some other works are done on Bangla oral vowel analysis and representing them in vowel space, synthesis and recognition [6], [7], [8], [9], [10], [11]. Researchers are still working on many unexplored areas on Bangla language processing. But there are few researches on acoustic details of Bangla nasal vowel. Therefore, we aim to explore in details the Bangla oral-nasal vowel analysis. The objective of our study is, (1) To extract the speech parameters of Bangla oralnasal vowel pairs by the selected methods (2) To study the extracted parameters and to evaluate whether they are within the theoretical limits (3) To compare the parameters obtained for the oral-nasal vowel pairs (4) To perform the comparative study for evaluating which method of analysis is appropriate for Bangla language. This paper is organized is as follows: Section 2 describes the process of procuring the experimental data those are used in our work. Section 3 discusses about the theory of LPC and cepstral methods of speech analysis, results obtained by applying each technique to the selected data and discussion on the result obtained by each technique. Section 4 discusses about the comparative S.H. Author is with the Department of Electronics and Telecommuniaction study of the result obtained by the two methods. In secEngineering, Daffodil International University, Dhaka, Bangladesh. tion 5 we discuss about the conclusion.
T.T. Author is with Faculty of Information Engineering, University of the Ryukyus, Okinawa, Japan. 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617 http://sites.google.com/site/journalofcomputing/

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 7, JULY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

98

analysis.

3.2 Cepstral Analysis Technique Cepstral method represents the vocal tract by pole-zero digital filters characterized by the cepstral coefficients which is suitable for characterizing all kinds (oral, nasal, fricatives etc.) of speech signal. The observed speech signal x[n] is the result of convolution of excitation source signal v[n] and the linear system impulse response v[n] in the time domain or a product of the excitation source and the system spectra in frequency domain. x[n]=v[n]*u[n]. In frequency domain, the above equation can be expressed as X[]=V[]U[]. Taking log the above equation splits the product of two spectra into a summation. The cepstrum of the speech sequence is actually the sum of the vocal tract cepstrum and the glottal excitation cepstrum. Cepstrum c[n] is the inverse Fourier transform of the short time logarithmic amplitude spectrum X(w) of the speech waveform [12] as given by Eq.1.
(1) From the high que-frency part the pitch period is obtained. The first few cepstral coefficients of the lowquefrency part of the speech cepstrum contain the characteristics of the vocal tract. Analysis, Result s and Discussions: In the analysis phase, the speech wave is segmented to 25.6ms frame length and Hamming window of the same length as the frame length is used. Frame shifting time is 10ms. We used the first 30 cepstral coefficients to represent the formant information. The first cepstral coefficient is the power content of the signal. The voiced/unvoiced decision and pitch period, number of frames are also extracted from the analysis. The parameters obtained by the cepstral analysis of Bangla oral-nasal vowel pairs are tabulated in Table No.1. The cepstrum and smoothed spectrum obtained by this method is shown in Fig. 1 for /i/. The average value of pitch for oral vowels is found to be 132.3Hz and nasal vowel 141.6Hz which is within the theoretical limit of 90-200 Hz for man. The low vowels /a/ and // are observed to have less pitch than the other vowels. Pitch of nasal vowels is higher than its oral counterpart. In the cepstral spectra of vowel /i/ as shown in Fig. 1, the resonant peaks and antiresonances are clearly shown. The formants are extracted and tabulated in Table no. 1. The extracted formant frequencies F1 and F2 of both oral and nasal vowels are plotted to obtain oral and nasal vowel space for Bangla language as shown in Fig. 2. This vowel chart signifies the place and manner of articulation for each Bangla vowel phonation. As can be seen from this figure that the nasal vowel space shrinks inside and shifts towards front with respect to its oral vowel space.

Fig. 1. Cepstrum of /i/ and smoothed amplitude cepstral spectra of /i/ and /i/

2 SPEECH MATERIALS
The aim of this section is to acquire the speech samples. The experimental part consists of recording each of the isolated Bangla oral and nasal /i/, /e/. /a/, //, //, /o/, /u/ and /i/. /e/. /a/, //, //, /o/, /u/ vowels at a normal speaking rate three times in a quiet room by three male native Bangla speakers (age around 27 years) in a DAT tape at a sampling rate of 48 kHz and 16 bit value. The best one of these three speakers voice and the best speech sample was chosen for our work. These digitized speech sound are then downsampled to 10 kHz and then normalized for the purpose of analysis.

3 SPEECH ANALYSIS TECHNIQUES


In this section, cepstral and LPC methods which analyses the speech signal and estimates the parameters useful for the given speech processing application are discussed.

3.1 Preprocessing of the Speech Signal The speech signal is non-stationary in nature, but it can be assumed to be stationary over short duration called frames by windowing for the purpose of analysis. Speech signal is analyzed frame-wise, with a frame-rate of 50-100 frames/sec, and for each frame the duration of speech segment is taken to be 20-30m-sec. A new frame is obtained by shifting the windowing function typically by 10msec to a subsequent time. After normalization and windowing, the speech samples are ready to be used for

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 7, JULY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

99

3.3 LPC Analysis Technique LPC analysis decomposes digitized speech signal into its fundamental frequency (F0 and its amplitude i.e. loudness of the source) and the vocal tract is represented as all pole filters, which can be modeled by a number of coefficients known as LPC order. This system is excited by an impulse train for voiced speech or a random noise sequence for unvoiced speech. Thus, the parameters of this model are: voiced/unvoiced classification, pitch period for voiced speech, gain parameter G, and the coefficients {a k } of the digital filter. Eq. 2 expresses the transfer function of the filter model in z-domain, where V(z) is the vocal tract transfer function. G is the gain of the filter and {a k } is a set of autoregression coefficients called Linear Prediction Coefficients. The upper limit of summation, p, is the order of the all-pole filter.
(2) Analysis, Results and Discussions: The parameters (pitch, voiced/unvoiced decision, gain and the coefficients of all pole LPC that characterize the formants) obtained by the 12 order LPC analysis of Bangla oral-nasal vowel pairs are tabulated in Table No.1. The average value of pitch for oral vowels is found to be 129.2Hz and nasal vowel 137.4Hz which is within the theoretical limit of 90-200 Hz for man.

The low vowels and are observed to have less pitch than the other vowels. Pitch of nasal vowels is higher than its oral counterpart. The 12 order LPC spectra obtained by LPC method is shown in Fig. 4 for vowel /i/. The resonant peaks are clearly shown in the LPC spectra. The formants are extracted by peak picking method. The extracted formant frequencies F1 and F2 of both oral and nasal vowels are plotted to obtain oral and nasal vowel space for Bangla language as shown in Fig. 2. As can be observed from vowel space of Fig. 2 obtained by LPC method, the nasal vowel space shrinks inside and shifts towards front as compared to the oral vowel space.

4 COMPARATIVE STUDY AND RESULTS


Speech processing is the extraction of speech parameters from the speech signal for convenient representation. The ultimate aim of studying Bangla vowels is to provide a complete Bangla based computer speech processing. In this work, Bangla vowels are analyzed using two Cepstral and LPC method and the extracted parameters as given in Table 1 are compared. The results obtained from the study:

4.1 Extracted Speech Parameters


Pitch: Pitch which is the rate at which the vocal cord vibrates. The average value of pitch is found to be 132.3msec and 129.3msec for oral vowel by cepstral and LPC method, and 141.6msec and 137.5msec for nasal vo wel by cepstral and LPC method. This value of pitch is within the theoretical value of 90msec to 200msec for man. Pitch obtained by the LPC & Cepstrum has acceptable comparable value for each vowel. Pitch of low vowels (o and a) seems to be lower than the pitch of high vowels. Pitch of nasal vowels is higher than its oral counterpart. Formants: Resonance frequencies of the vocal tract are the formants. Usually the first two formants are the main spectral cues for speech perception, therefore we extracted the first and second formant frequency (F1 and F2) respectively. Acceptable comparable values are obtained by the two methods. The evaluation of the formant parameters was done by making a vowel space with F1 and F2 and comparing it to the English vowel space. The oral vowel space obtained by plotting the extracted F1 and F2 shows similar formant pattern as the English vowel /i/, /e/,/a/,/o/ and /u/ as shown in Fig. 3.

Fig. 2: Bangla oral-nasal vowel space obtained by cepstral and LPC method

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 7, JULY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

100

TABLE 1 EXTRACTED ORAL-NASAL VOWEL SPEECH PARAMETERS OBTAINED BY CEPSTRAL AND LPC METHOD
Cepstral Pitch Formants (msec) (Hz) F1 F2 122 312 2304 143 430 2226 143 780 1914 115 859 1171 113 742 937 145 145 147 150 154 122 118 150 150 430 312 390 585 742 625 585 507 390 700 664 2500 2343 1601 1054 820 898 820 LPC Pitch Formants (msec) (Hz) F1 F2 120 293 2275 139 479 2129 139 762 1875 113 771 1162 113 625 898 139 142 143 145 150 119 115 145 145 459 303 391 508 742 664 625 547 430 723 752 2617 2422 1680 1055 859 826 898

/i/ /e/ // /a/ / / /o/ /u/ /i / /e / / / /a / / / /o / /u /

Oral Vowels

Fig. 4: Cepstral and LPC spectra of vowel /i/

Nasal Vowels

4.2 Appropriate Method of Feature Extraction for Bangla Language: Bangla has nasal counterpart of all its oral vowel. Nasal vowel contains spectral poles and zeros which are the main spectral cues of nasality. Therefore, we should choose a speech analysis method that can capture the detail information contained in the nasal vowel. From our study as shown in Fig. 4, we observe that pole-zero cepstral method captures more spectral details including spectral zeros than the all pole LPC method. Therefore, from our study we may say that as Bangla has nasal vowel therefore, pole-zero cepstral method is more appropriate for extracting nasal vowel parameters than the all pole LPC method.

5 CONCLUSIONS
From our study of the extracted parameters, we have observed (1) The speech parameters obtained by both the methods have comparable values for both oral and nasal vowels (2) The pitch of the extracted oral-nasal vowels are seen to be within the theoretical limits (3) The nasal vowel space obtained by both the methods has a tendency of shrinking inside and shifting towards the front with respect to its oral vowel space obtained (4) From the comparative study of parameters obtained by the two analysis methods we may say that for Bangla language, pole-zero cepstral method procures more spectral details than the all-pole LPC method. Therefore, for Bangla, pole-zero cepstral method is more appropriate for parameter extraction than the all pole LPC method.

Fig. 3: Bangla oral vowel space obtained by analysis ( cepstral and LPC method), standard Bangla and standard English oral vowel space.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 7, JULY 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG

101

Our future work will be to make a standard oral and nasal vowel space for Bangla from a large amount of data. Then to synthesize the data for oral-nasal vowel pairs to evaluate which of the two methods is synthetically better for Bangla. The work may be further extended to analysis of other speech units and store the speech parameters in a database for further work of synthesis, recognition or coding.

REFERENCES
M. A. Hai, Dhvani-Vignan O Bangla Dhvani Tattwa, June, 1985. (In Bangla). [2] J.R. Deller,, and J.H.L. Hansen, Discrete-Time Processing of Speech Signals, IEEE Press, 2000. [3] S. Hawkins and K.N. Stevens, Acoustic and Perceptual Correlates of the Non-nasal-nasal Distinction for Vowels, J.Acoust.Soc.Am., 77(4), pp. 1560-1575, April, 1985. [4] M.Y. Chen, Acoustic Correlates of English and French Nasalized Vowels, J.Acoust. Society of America, 102(4), pp. 2360-2370, October, 1997. [5] M Paramanik and K Kido, Bengali Speech: Formant Structure of Single Vowels and Intial Vowels of Words, IEEE International Conference on ICASSP, Japan,1976. (Conference proceedings) [6] S. A. Hossain, M. L. Rahman, and F. Ahmed, Acoustic Classification of Bangla Vowels, World Academy of Science, Engineering and Technology, Vol. 26, 2007 [7] S. A. Hossain, M. L. Rahman, F. Ahmed, Acoustic Space of Bangla Vowels, Proc. of WSEAS 5th International Conference on Speech and Image Processing, Corfu, Greece, August 2005, pp. 138-142. (Conference proceedings) [8] S. A. Hossain,Analysis and Synthesis of Bangla Phonemes for Computer Speech Recognition, Ph.D. Dissertation, Department of Computer Science and Engineering, University of Dhaka. February, 2008. [9] S. Haque and T. Takara, Nasality Perception of Vowels in Different Language Background, Proceedings of INTERSPEECH 2006 - ICSLP, , page 869-872, 17-21 September, 2006, Pittsburgh, Pennsylvania, USA. (Conference proceedings) [10] S. Haque and T. Takara, Rule Based Speech Synthesis by Cepstral Method for Standard Bangla, Proceedings of 18th International Congress on Acoustics, ICA 2004, 4-9 April, 2004. Th.P3.19, IV-3341, Kyoto, Japan. (Conference proceedings) [11] S. Haque and T. Takara, Recognition and Synthesis of Bangla Oral-Nasal Vowel Pairs,Proceedings of Annual Meeting of Acoustical Society of Japan, March, 2003, Tokyo, Japan, Page 437-438, 2Q-32. (Conference proceedings) [12] S. Furui, Digital Speech Processing, Synthesis, and Recognition, Second Edition, Marcel Dekker, Inc.,2001. [1] Shahina Haque received her B.Sc. and M .Sc. degree in applied physics and electronics from Rajshahi University, Bangladesh. She joined Bangladesh Atomic Energy Commission as a s cientific officer, in 1999. Since 1999, she was affiliated with the Department of Computer Science and T echnology, Islamic University, Kushtia, Bangladesh. Since 2001, she studied at the University of the Ryukyus, Graduate School of Engineering and Science. Since 2008 she is with the Daffodil International University, Shukrabad, Dhanmondi, Dhaka, Bangladesh. Her current research interest is speech, image and Bio-medical signal processing. Tomio Takara received his B.S. degree in physics from Kagoshima University, Kagoshima, Japan, in 1976, and hi s M.E. degree and the Dr. Eng. Degree in Information Processing from Tokyo Insti-

tute of Technology, Tokyo, Japan, in 1979 and 1983, respectively. He joined the University of the Ryukyus, Okinawa, Japan, in 1983 as an Assistant Professor and was promoted to Associate Professor in 1988 and Professor in 1995. Since then he has been a Professor in the Department of Information Engineering, Faculty of Engineering. He has been a director of the computer and networking center of the university since 2002. During 1991-1992, he studied at Carnegie Mellon University as a v isiting scientist. Dr. Takara is a member of IEICE; ASJ; ISJ; JSAI; IEEJ; and IEEE. He is the vice president of Kyushu branch of ASJ and a director of IEEE Fukuoka. He is the recipient of the 1990 Okinawa Society Award for Encouragement of study on O kinawa. He is presently interested in spoken language processing and machine intelligence.

Você também pode gostar