Você está na página 1de 8

International Journal of Advances in Engineering & Technology, Jan 2012.

IJAET ISSN: 2231-1963

ESTIMATION AND OPTIMIZATION OF PROSODIC TO IMPROVE THE QUALITY OF THE ARABIC SYNTHETIC SPEECH
Abdelkader CHABCHOUB & Adnen CHERIF
1

Signal Processing Laboratory, Science Faculty of Tunis 1060, Tunisia

A BSTRACT
The prosody modeling has been extensively applied in speech synthesis. This is simply because there is an obvious need for every speech synthesis system to generate prosodic properties of speech, for a natural and intelligible synthetic speech. This paper introduces a new technique for the prediction of a deterministic prosodic target at an early stage which relies on probabilistic models of F0 contour and may predict the duration. This paper, also, proposes a method that searches for the optimal unit sequence by maximizing a joint likelihood at both segmental and prosodic levels. This method has successfully been implemented in the analysis corpus for developing the Arabic prosody database which itself is the input of the Arabic speech synthesizer. This paper, also, shows a drastic improvement in the Arabic prosodic quality through extensive objective and subjective evaluation.

K EYW ORDS:
Arabic speech.

Segmental duration, pitch, predictive, prosodic Model, Neural Network, speech synthesis,

I.

INTRODUCTION

Generating natural sounding prosody is a central challenge in text-to-speech synthesis (TTS), which is nowadays a technology that enables computers to talk and assist people in learning languages. While existing synthesis techniques produce speech that is intelligible, few people would claim that listening to computer speech is naturally or expressive. Therefore in recent years research in the areas of speech synthesis were directed more towards improving the intelligibility and natural of synthetic systems to achieve better quality, tone of voice as well as its synthetic speech and intonation [1] [2]. In several systems, the usability of systems-speech voice that produces a good quality still need extensive research to be able to increase its overall use. In the Arabic language the processing linguistic and prosodic [3] is essential for the synthesis quality. So processing station based on the modification of the Arabic prosodic (optimization of the pitch and predictive duration) trained to improve the new Arabic voice. From the phonetic point of view, this is the processing of prosodic parameters defined by: the fundamental frequency (F0), segmental duration and intensity. Modeling of these parameters is the main target of our research work which essentially concentrates on the fundamental frequency and duration [4]. This paper is organized as follows. In Section 2, the morphological model of the Arabic language will be presented with, in particular the concepts of word. Section 3 describes the corpus used in the study and presents a list of phonemes and the corresponding acoustic parameters for each phoneme (duration and F0). These values are entered in the module to change the parameters that will optimize the prosodic parameters (pitch and duration) which will be presented in Section 4. Section 5 presents

632

Vol. 2, Issue 1, pp. 632-639

International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963
the results and evaluation of the algorithm as well as the implementation of the speech synthesis system.

II.

DATABASE OF ARABIC SPEECH PROSODY

The quality of a speech synthesis system depends on the intelligibility and naturalness of speech generated, hence the need for generating prosody quality. Our database has been developed to be used to improve the quality of Arabic synthetic speech with MBROLA [5]. The fundamental idea is to create a speech corpus consisting of phone sequence phonemic/prosodic context combinations that forms a specially structured subset of the set of all such combinations, and then use Arabic prosody transplantation [6]. The modules are cascaded in the order Phonetisation-Duration- Pitch. The input is a pair of a speech signal file and a time-aligned phonemic annotation, followed by phoneme validation (code SAMPA), followed by an identification of the voiced and un-voiced frames (V/NV), followed by duration extraction, followed by pitch extraction [7] and finally followed by Prosodic modification/optimization. This algorithm results are the entries of our Arabic prosodic database. The main data flow steps are shown in Figure 1 it represents the generation of the database.
Original speech

Automatic annotation and segmentation

V / NV

Measure duration and pitch

Measure duration

Prosodic modification and optimization

Arabic Prosodic database Figure1. Arabic prosodic database generation with prediction duration and pitch optimization algorithm.

2.1. Description of the corpus of analysis used


The corpus, which we used to build our database, is composed of 120 sentences, with an average of 5 words per sentence. These sentences contain in total 1296 syllables, 3240 phonemes, including a short vowels, long and semi- vowels [ and ], fricatives consonants, plosives and liquids consonants [ and ] and nasal consonants [ and ]. Breaks were characterized with a "_" in the text corresponding to the natural voice. These sentences were read at an average speed (from 10 to 12 phonemes / second) by a speaker who did not receive any specific instruction to avoid any influence that could affect spontaneity. These corpus were recorded with a 16-Khz sampling rate and encoding 16 bits.

2.2. Segmentation and labeling of the corpus


Continuous speech corpus has been segmented and involves the following steps [12]: labeled by a semi-automatic procedure, which

633

Vol. 2, Issue 1, pp. 632-639

International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963
Step 1: phonetic spelling manual transcription of each sentence using the SAMPA transcription system. Step 2: Automatic segmentations by Praat.

2.3. Automatic segmentation of the corpus.


The extraction of pitch is an important step. For a period (ms) of phonemes, we extract the pitch in several positions that will be the parameters of the input file for the MBROLA, resulting in a pitch extraction algorithm robust and accurate, providing a good quality synthetic speech.

2.4. Identify the voiced and un-voiced frames.


The automatic segmentation of a speech signal is used in order to identify the voiced and un-voiced frames. This classification is based on the zero-crossing ratio and the energy value of each signal frame.
fich r se m n ie g e t 100 0 -100 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Z ro C ssin e ro g

0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

e e ie n rg

5 0 -5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

zo e vo e n is

1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 temps en sec 0.6 0.7 0.8 0.9

Figure2. Voiced zone Arabic Sentence ", door" This figure represents an automatic segmentation. For example, between [0.4, 0.75 (s)] as the voiced sections corresponds to a low Zero Crossing and high Energy.
fic ie s g e t h r e mn 200 0 -200 0.1 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Z r C s in eo ro s g

e e ie n rg

5 0 -5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

z n v is e oe o

1 0.5 0 0 0.2 0.4 0.6 0.8 1 temps en sec 1.2 1.4 1.6 1.8

Figure3. Un-voiced zone, Arabic Sentence " , sun" This figure represents an automatic segmentation. For example, between [0.65, 0.8 (s)] that the un-voiced sections corresponds to a high Zero Crossing and low Energy.

2.5. Duration and pitch extraction


The extraction of pitch is the next step. Copying the phonemes, the durations of the phonemes from the annotation file and measuring the pitch values from the original recording of a human utterance allows best case speech synthesis. To extract pitch from the recordings a Praat script called max- pitch was implemented as in [8], this script goes through the Sound and TextGrid files in a directory, opens each

634

Vol. 2, Issue 1, pp. 632-639

International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963
pair of Sound and TextGrid, calculates the pitch maxima of each labeled interval, and saves the results to a text file [9]. The implementation of this script caused another problem and some modifications to the script were made. The inputs to this script are WAV files and TextGrid annotation files. The Praat pitch extraction file produces one TXT file with the pitch values of all the phonemes in the files in the directory. The output pitchresults.txt file contains the following information: 1. File names of the files in the directory. 2. Labels. 3. Maximum pitch values of the labeled intervals in Hz. The pitch results file for one file in a directory is shown in the following example: automatics extract Praat script .pho
_ 387 s. 90 83 a. 104 14 d 118 25 i 77 19 q 125 i 63 24 l 103 15 H 71 21 a 75 20 z 116 13 i 155 10 z 15210 132 126 29 123 43 119 58 116 72 120 87 120 133 38 137 51 145 64 152 76 154 89 155 153 39 149 58 145 78 141 97 135 129 122 113 117 109 136 135 48 129 29 129 42 111 40 119 26 109 19 139 20 133 71 127 44 125 63 111 60 119 39 108 29 141 30 129 95 119 58 124 85 112 80 116 52 108 39 141 39 123

73 125 87 120

65 109 91 122 48 141 97 138 49 119 99 122

III.

ARABIC PROSODIC MODELLING

3.1. Prediction models of segmental duration


Study analysis on the automatic generation of time have experienced many changes in recent years. The model proposed in this paper is based on two basic criteria that are: linear prediction and neural networks. The model of W. N. Campbell assumes that the temporal organization of a statement is made at a higher level in terms of phonemes. Two stages are distinguished in the implementation of this model, the first is the prediction of syllabic duration and the second is the prediction of syllable phoneme durations. A learning process automatically allows the prediction of syllable durations. It uses neural networks for learning because it is assumed that they can learn the fundamental interactions between contextual effects. These should represent the behavior governed by rules that are implicit in the data. If the networks can encode the fundamental interactions, , then they would do the same with data not previously encountered [10]. Regarding the segmental durations, their distribution is given by the calculation of a coefficient of elongation (deviation from the mean). Campbell has suggested that all the phonemes of one syllable have the same elongation factor z: z-score. The z-score of each phonemic realization of the corpus of study is calculated by:

z realisatio n =

( dureeobser vee realisatio n phoneme )

(1)

phoneme

Where phoneme and phoneme are the mean and standard deviation obtained from the absolute time of the achievements of each phoneme in the corpus. So, every time a phonetic realization is normalized by using the z-score (mean = 0 and standard deviation = 1) the durations of the syllables will be determined by the neural network[11]. Moreover, the model will calculate the z-score associated with each syllable by solving the following equation:
n

Duree( syllabe) = exp(i + z i )


i =1

(2)

635

Vol. 2, Issue 1, pp. 632-639

International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963
The sum on the phonemic elements of the syllable, z is the z-score associated with that syllable and the pair (i and i) contains the mean and standard deviation associated with the phoneme i and obtained from the logarithms of the durations of achievements (in milliseconds) of this phoneme in the corpus. Thus, the duration of each phoneme of the syllable is calculated using equation (3).

Duree ( phoneme i ) = exp( i + z i ) 3.2 F0 Prediction Module Based on a Neural Network

( 3)

Neural networks provide a good solution for problems involving strong non-linearity between input and output parameters, and also when the quantitative mechanism of the mapping is not well understood. The use of neural networks in prosodic modeling has been reported in [13] and [14], but those methods do not make use of a model to limit the degrees of freedom of the problem. Additional care must be taken in order to account for the continuity of F0 contours (using recurrent networks). In the proposed model, the continuity and basic shape of F0 contours are ensured by the F0 model [15][16]. In this paper, three types of neural network structures are evaluated: the multi-layer perceptron (MLP), Jordan (a structure having feedbacks from output elements), and Elman (a structure having feedbacks from hidden elements). The latter two neural network structures are called partial recurrent networks, and are tested here in order to account for the mutual influence of neighboring accentual phrases. All structures have a single hidden layer containing either 10 or 20 elements. For the experiments, we utilized the SNNS neural network simulation software [17]. The results of F0 contour prediction on the test data set are shown in Figure.4. Figure.6 shows the pitch contour of an original and synthetic speech used with our system.

Figure 4. Evaluation of the fundamental frequency F0 of the Arabic phrase " " top-down voice signal, the varieties of F0 autocorrelation method, spectral method, annotation segment average F0 value of syllable and F0 estimation by MOMEL.

IV.

RESULTS AND EVALUATION

4.1 Implementation of prosodic values into MBROLA


MBROLA synthesis system is a multilingual; it was originally designed based on the characteristics phonostactics of the French language, our synthesis system requires for its adaptation to the Arabic language, with adjustments segmental and prosodic. A first look at the results of the system showed that although there were similarities between the natural and synthetic versions, there is a considerable resemblance between the natural and synthetic F0 contours. Only a few minor differences can be observed, since the F0 values were extracted only once

636

Vol. 2, Issue 1, pp. 632-639

International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963
every 10ms. Also note the halved F0 in the creaky parts of the synthetic versions which successfully simulated creak. Similarly for the spectrogram there is small difference with the estimation algorithm. This can be seen in Figure.5 and Figure.6. The implantation of our algorithm of estimation optimization of prosodic parameters produced an Arabic synthetic speech intelligible and natural.

Figure5. Neutral and Synthesis speech, signal and spectrogram, Arabic Sentence

Figure6. Neutral and synthesis speech, pitch contour, Arabic Sentence,

637

Vol. 2, Issue 1, pp. 632-639

International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963 4.2. Subjective evaluation
Evaluation consists of a subjective comparison between the 4 models. A comparison category rating (CCR) test was used to compare the quality of the synthetic speech generated by our system, Euler system, Acapela system and natural speech models. The listening tests were conducted by four Arab adults who are native speakers of the language. All listeners are born and raised in the Arab countries. For both listening tests we prepared listening test programs and a brief introduction was given before the listening test. They were asked to attribute a preference score according to the quality of each of the sample pairs on the comparison mean opinion score (CMOS) scale[18]. Listening test was performed with headphones. After collecting all listeners response, we calculated the average values and we found the following results. In the first listening test, the average correct-rate for original and analysissynthesis sounds were 98% and that of rule-based synthesized sounds was 90%. We found the synthesized words to be very intelligible Figiure7.

Figure7. Average scores for the first test (system Euler, our system, natural speech and Acapela system. for the intelligibility of speech.

V.

CONCLUSIONS

A new high quality Arabic speech synthesis technique has been introduced in this paper. The technique is based on the estimation and optimization of the prosodic parameters such as pitch and duration for MBROLA method. It has also been shown in this paper that syllables produce reasonably natural quality speech and durational modeling is crucial for naturalness with a significant reduction in numbers of units of the total base developed. This was readily observed during the listening tests based on high quality and objective evaluation when comparing the original with the synthetic speech.

REFERENCES
[1] [2] [3] [4] [5] [6] [7]

S. Baloul, (2003) Dveloppement d'un systme automatique de synthse de la parole partir du texte arabe standard voyell , Thse de doctorat, universit du Maine, Le Mans, France. M. Elshafi, H. Al-Muhtaseb M. Al-Ghamdi, (2002) Techniques for high quality Arabic speech synthesis, Information Sciences 140-255-267, Elsevier. M. Assaf, (2005) A Prototype of an Arabic Diphone Speech Synthesizer in Festival, Master Thesis, Department of Linguistics and Philology, Uppsala University. Mbius, B. and Dogil, G., (2002) Phonemic and postural effects on the production of prosody, Speech Prosody 2002(Aix-en-Provence), p 523526. T. Dutoit, V. Pagel, N. Pierret, F. Bataille, & O. van der Vrecken, (1996) The MBROLA Project: Towards a Set of High-Quality Speech Synthesizers , Free of Use. M. Al-Zabibi, (1990) An AcousticPhonetic Approach in Automatic Arabic Speech Recognition, the British Library in Association with UMI. G. Demenko, S. Grocholewski, A. Wagner, & M. Szymaski, (2006) Prosody Annotation for Corpus Based Speech Synthesis, In: Proceedings of the Eleventh Australasian International Conference on Speech Science and Technology.

638

Vol. 2, Issue 1, pp. 632-639

International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963
[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

Auckland, New Zealand, pp. 460-465. Boersma, P. & Weenink, D. (2005) Praat. Doing phonetics by computer. [Computer program]. Version 4.3.04 Retrieved March 31, 2005 from http://www.praat.org/ J. Bachan, & D. Gibbon, (2006) Close Copy Speech Synthesis for Speech Perception Testing, In: Investigationes Linguisticae, vol. 13, pp. 9--24. W.N. Campbell, (1992) syllable-based segmental duration , Edition G. Bailly and C. Benot, Talking Machines: theories, Models and Designs, Elsevier Science Publishers, Amestrdam, pp.211-22 4, A. Lacheret-Dujour, B. Beaugendre, (1999) La prosodie du franais, Paris, Editions du CNRS. F. Chouireb, M. Guerti, M. Nal, and Y. Dimeh, (2007) Development of a Prosodic Database for Standard Arabic, the Arabian Journal for Science and Engineering, Volume 32, Number 2B, pp. 251-262, ISSN: 1319-8025, October. S. Keagy. (2000) Integrating voice and data networks: Practical solutions for the new world of packetized voice over data networks. Cisco Press, G. Sonntag, T. Portele and B. Heuft, (1997) Prosody generation with a neural network: Weighing the importance of input parameters, in Proceedings of ICASSP, pp 931-934, Munich, Germany. J. P. Teixieira, D. Freitas and H. Fujisaki, (2003) Prediction of Fujisaki models phrase commands, in Proceedings of Eurospeech , Geneva, pp 397-400 J. P. Teixiera, D. Freitas and H. Fujisaki, Prediction of accent commands for the Fujisaki intonation model, in Proceeding of Speech Prosody 2004, Nara, Japan, March 23-26, 2004, pp 451-454. SNNS (Stuttgart Neural Network Simulator) User Manual (1995), Version 4.1, University of Stuttgart, Institute for Parallel and Distributed High Performance Systems (IPVR), Report No. K. S. Rao and B. Yegnanarayana, (4-8 October 2004) Intonation modeling for Indian languages, in Proccedings of Interspeech04,Jeju Island, K0rea, pp733-736

Authors A. Chabchoub: is a researcher in signal processing laboratory at the University of Sciences of Tunis Tunisia (FST). Degree in electronics and he received a M.Sc. degree in Automatic and Signal Processing (ATS) from The National Engineering School of Tunis (ENIT). Currently, he is a PhD student under the supervision of Prof. A. Cherif. His research interests include speech synthesis and analysis.

A.Cherif: received his engineering diploma from the Engineering Faculty of Tunis and his Ph.D. in electrical engineering and electronics from The National Engineering School of Tunis (ENIT). Actually he is a professor at the Science Faculty of Tunis, Responsible for the Signal Processing Laboratory. He participated in several research and cooperation projects, and he is the author of international communications and publications.

639

Vol. 2, Issue 1, pp. 632-639

Você também pode gostar