Estimation and Optimization of Prosodic To Improve The Quality of The Arabic Synthetic Speech

International Journal of Advances in Engineering & Technology, Jan 2012.
IJAET ISSN: 2231-1963
ESTIMATION AND OPTIMIZATION OF PROSODIC TO IMPROVE THE QUALITY OF THE ARABIC SYNTHETIC SPEECH
Abdelkader CHABCHOUB & Adnen CHERIF
1
Signal Processing Laboratory, Science Faculty of Tunis 1060, Tunisia
A BSTRACT
The prosody modeling has been extensively applied in speech synthesis. This is simply because there is an obvious need for every speech synthesis system to generate prosodic properties of speech, for a natural and intelligible synthetic speech. This paper introduces a new technique for the prediction of a deterministic prosodic target at an early stage which relies on probabilistic models of F0 contour and may predict the duration. This paper, also, proposes a method that searches for the optimal unit sequence by maximizing a joint likelihood at both segmental and prosodic levels. This method has successfully been implemented in the analysis corpus for developing the Arabic prosody database which itself is the input of the Arabic speech synthesizer. This paper, also, shows a drastic improvement in the Arabic prosodic quality through extensive objective and subjective evaluation.
K EYW ORDS:
Arabic speech.
Segmental duration, pitch, predictive, prosodic Model, Neural Network, speech synthesis,
I.
INTRODUCTION
Generating natural sounding prosody is a central challenge in text-to-speech synthesis (TTS), which is nowadays a technology that enables computers to talk and assist people in learning languages. While existing synthesis techniques produce speech that is intelligible, few people would claim that listening to computer speech is naturally or expressive. Therefore in recent years research in the areas of speech synthesis were directed more towards improving the intelligibility and natural of synthetic systems to achieve better quality, tone of voice as well as its synthetic speech and intonation [1] [2]. In several systems, the usability of systems-speech voice that produces a good quality still need extensive research to be able to increase its overall use. In the Arabic language the processing linguistic and prosodic [3] is essential for the synthesis quality. So processing station based on the modification of the Arabic prosodic (optimization of the pitch and predictive duration) trained to improve the new Arabic voice. From the phonetic point of view, this is the processing of prosodic parameters defined by: the fundamental frequency (F0), segmental duration and intensity. Modeling of these parameters is the main target of our research work which essentially concentrates on the fundamental frequency and duration [4]. This paper is organized as follows. In Section 2, the morphological model of the Arabic language will be presented with, in particular the concepts of word. Section 3 describes the corpus used in the study and presents a list of phonemes and the corresponding acoustic parameters for each phoneme (duration and F0). These values are entered in the module to change the parameters that will optimize the prosodic parameters (pitch and duration) which will be presented in Section 4. Section 5 presents
632
Vol. 2, Issue 1, pp. 632-639
International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963
the results and evaluation of the algorithm as well as the implementation of the speech synthesis system.
II.
DATABASE OF ARABIC SPEECH PROSODY
The quality of a speech synthesis system depends on the intelligibility and naturalness of speech generated, hence the need for generating prosody quality. Our database has been developed to be used to improve the quality of Arabic synthetic speech with MBROLA [5]. The fundamental idea is to create a speech corpus consisting of phone sequence phonemic/prosodic context combinations that forms a specially structured subset of the set of all such combinations, and then use Arabic prosody transplantation [6]. The modules are cascaded in the order Phonetisation-Duration- Pitch. The input is a pair of a speech signal file and a time-aligned phonemic annotation, followed by phoneme validation (code SAMPA), followed by an identification of the voiced and un-voiced frames (V/NV), followed by duration extraction, followed by pitch extraction [7] and finally followed by Prosodic modification/optimization. This algorithm results are the entries of our Arabic prosodic database. The main data flow steps are shown in Figure 1 it represents the generation of the database.
Original speech
Automatic annotation and segmentation
V / NV
Measure duration and pitch
Measure duration
Prosodic modification and optimization
Arabic Prosodic database Figure1. Arabic prosodic database generation with prediction duration and pitch optimization algorithm.
2.1. Description of the corpus of analysis used

The corpus, which we used to build our database, is composed of 120 sentences, with an average of 5 words per sentence. These sentences contain in total 1296 syllables, 3240 phonemes, including a short vowels, long and semi- vowels [ and ], fricatives consonants, plosives and liquids consonants [ and ] and nasal consonants [ and ]. Breaks were characterized with a "_" in the text corresponding to the natural voice. These sentences were read at an average speed (from 10 to 12 phonemes / second) by a speaker who did not receive any specific instruction to avoid any influence that could affect spontaneity. These corpus were recorded with a 16-Khz sampling rate and encoding 16 bits.
2.2. Segmentation and labeling of the corpus

Continuous speech corpus has been segmented and involves the following steps [12]: labeled by a semi-automatic procedure, which
633
Vol. 2, Issue 1, pp. 632-639
Step 1: phonetic spelling manual transcription of each sentence using the SAMPA transcription system. Step 2: Automatic segmentations by Praat.
2.3. Automatic segmentation of the corpus.

The extraction of pitch is an important step. For a period (ms) of phonemes, we extract the pitch in several positions that will be the parameters of the input file for the MBROLA, resulting in a pitch extraction algorithm robust and accurate, providing a good quality synthetic speech.
2.4. Identify the voiced and un-voiced frames.

The automatic segmentation of a speech signal is used in order to identify the voiced and un-voiced frames. This classification is based on the zero-crossing ratio and the energy value of each signal frame.
fich r se m n ie g e t 100 0 -100 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Z ro C ssin e ro g
0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
e e ie n rg
5 0 -5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
zo e vo e n is
1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 temps en sec 0.6 0.7 0.8 0.9
Figure2. Voiced zone Arabic Sentence ", door" This figure represents an automatic segmentation. For example, between [0.4, 0.75 (s)] as the voiced sections corresponds to a low Zero Crossing and high Energy.
fic ie s g e t h r e mn 200 0 -200 0.1 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Z r C s in eo ro s g
e e ie n rg
5 0 -5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
z n v is e oe o
1 0.5 0 0 0.2 0.4 0.6 0.8 1 temps en sec 1.2 1.4 1.6 1.8
Figure3. Un-voiced zone, Arabic Sentence " , sun" This figure represents an automatic segmentation. For example, between [0.65, 0.8 (s)] that the un-voiced sections corresponds to a high Zero Crossing and low Energy.
2.5. Duration and pitch extraction

The extraction of pitch is the next step. Copying the phonemes, the durations of the phonemes from the annotation file and measuring the pitch values from the original recording of a human utterance allows best case speech synthesis. To extract pitch from the recordings a Praat script called max- pitch was implemented as in [8], this script goes through the Sound and TextGrid files in a directory, opens each
634
Vol. 2, Issue 1, pp. 632-639
pair of Sound and TextGrid, calculates the pitch maxima of each labeled interval, and saves the results to a text file [9]. The implementation of this script caused another problem and some modifications to the script were made. The inputs to this script are WAV files and TextGrid annotation files. The Praat pitch extraction file produces one TXT file with the pitch values of all the phonemes in the files in the directory. The output pitchresults.txt file contains the following information: 1. File names of the files in the directory. 2. Labels. 3. Maximum pitch values of the labeled intervals in Hz. The pitch results file for one file in a directory is shown in the following example: automatics extract Praat script .pho
_ 387 s. 90 83 a. 104 14 d 118 25 i 77 19 q 125 i 63 24 l 103 15 H 71 21 a 75 20 z 116 13 i 155 10 z 15210 132 126 29 123 43 119 58 116 72 120 87 120 133 38 137 51 145 64 152 76 154 89 155 153 39 149 58 145 78 141 97 135 129 122 113 117 109 136 135 48 129 29 129 42 111 40 119 26 109 19 139 20 133 71 127 44 125 63 111 60 119 39 108 29 141 30 129 95 119 58 124 85 112 80 116 52 108 39 141 39 123
73 125 87 120
65 109 91 122 48 141 97 138 49 119 99 122
III.
ARABIC PROSODIC MODELLING
3.1. Prediction models of segmental duration

Study analysis on the automatic generation of time have experienced many changes in recent years. The model proposed in this paper is based on two basic criteria that are: linear prediction and neural networks. The model of W. N. Campbell assumes that the temporal organization of a statement is made at a higher level in terms of phonemes. Two stages are distinguished in the implementation of this model, the first is the prediction of syllabic duration and the second is the prediction of syllable phoneme durations. A learning process automatically allows the prediction of syllable durations. It uses neural networks for learning because it is assumed that they can learn the fundamental interactions between contextual effects. These should represent the behavior governed by rules that are implicit in the data. If the networks can encode the fundamental interactions, , then they would do the same with data not previously encountered [10]. Regarding the segmental durations, their distribution is given by the calculation of a coefficient of elongation (deviation from the mean). Campbell has suggested that all the phonemes of one syllable have the same elongation factor z: z-score. The z-score of each phonemic realization of the corpus of study is calculated by:
z realisatio n =
( dureeobser vee realisatio n phoneme )
(1)
phoneme
Where phoneme and phoneme are the mean and standard deviation obtained from the absolute time of the achievements of each phoneme in the corpus. So, every time a phonetic realization is normalized by using the z-score (mean = 0 and standard deviation = 1) the durations of the syllables will be determined by the neural network[11]. Moreover, the model will calculate the z-score associated with each syllable by solving the following equation:
n
Duree( syllabe) = exp(i + z i )

i =1
(2)
635
Vol. 2, Issue 1, pp. 632-639
The sum on the phonemic elements of the syllable, z is the z-score associated with that syllable and the pair (i and i) contains the mean and standard deviation associated with the phoneme i and obtained from the logarithms of the durations of achievements (in milliseconds) of this phoneme in the corpus. Thus, the duration of each phoneme of the syllable is calculated using equation (3).
Duree ( phoneme i ) = exp( i + z i ) 3.2 F0 Prediction Module Based on a Neural Network
( 3)
Neural networks provide a good solution for problems involving strong non-linearity between input and output parameters, and also when the quantitative mechanism of the mapping is not well understood. The use of neural networks in prosodic modeling has been reported in [13] and [14], but those methods do not make use of a model to limit the degrees of freedom of the problem. Additional care must be taken in order to account for the continuity of F0 contours (using recurrent networks). In the proposed model, the continuity and basic shape of F0 contours are ensured by the F0 model [15][16]. In this paper, three types of neural network structures are evaluated: the multi-layer perceptron (MLP), Jordan (a structure having feedbacks from output elements), and Elman (a structure having feedbacks from hidden elements). The latter two neural network structures are called partial recurrent networks, and are tested here in order to account for the mutual influence of neighboring accentual phrases. All structures have a single hidden layer containing either 10 or 20 elements. For the experiments, we utilized the SNNS neural network simulation software [17]. The results of F0 contour prediction on the test data set are shown in Figure.4. Figure.6 shows the pitch contour of an original and synthetic speech used with our system.
Figure 4. Evaluation of the fundamental frequency F0 of the Arabic phrase " " top-down voice signal, the varieties of F0 autocorrelation method, spectral method, annotation segment average F0 value of syllable and F0 estimation by MOMEL.
IV.
RESULTS AND EVALUATION
4.1 Implementation of prosodic values into MBROLA

MBROLA synthesis system is a multilingual; it was originally designed based on the characteristics phonostactics of the French language, our synthesis system requires for its adaptation to the Arabic language, with adjustments segmental and prosodic. A first look at the results of the system showed that although there were similarities between the natural and synthetic versions, there is a considerable resemblance between the natural and synthetic F0 contours. Only a few minor differences can be observed, since the F0 values were extracted only once
636
Vol. 2, Issue 1, pp. 632-639
every 10ms. Also note the halved F0 in the creaky parts of the synthetic versions which successfully simulated creak. Similarly for the spectrogram there is small difference with the estimation algorithm. This can be seen in Figure.5 and Figure.6. The implantation of our algorithm of estimation optimization of prosodic parameters produced an Arabic synthetic speech intelligible and natural.
Figure5. Neutral and Synthesis speech, signal and spectrogram, Arabic Sentence
Figure6. Neutral and synthesis speech, pitch contour, Arabic Sentence,
637
Vol. 2, Issue 1, pp. 632-639
International Journal of Advances in Engineering & Technology, Jan 2012. IJAET ISSN: 2231-1963 4.2. Subjective evaluation
Evaluation consists of a subjective comparison between the 4 models. A comparison category rating (CCR) test was used to compare the quality of the synthetic speech generated by our system, Euler system, Acapela system and natural speech models. The listening tests were conducted by four Arab adults who are native speakers of the language. All listeners are born and raised in the Arab countries. For both listening tests we prepared listening test programs and a brief introduction was given before the listening test. They were asked to attribute a preference score according to the quality of each of the sample pairs on the comparison mean opinion score (CMOS) scale[18]. Listening test was performed with headphones. After collecting all listeners response, we calculated the average values and we found the following results. In the first listening test, the average correct-rate for original and analysissynthesis sounds were 98% and that of rule-based synthesized sounds was 90%. We found the synthesized words to be very intelligible Figiure7.
Figure7. Average scores for the first test (system Euler, our system, natural speech and Acapela system. for the intelligibility of speech.
V.
CONCLUSIONS
A new high quality Arabic speech synthesis technique has been introduced in this paper. The technique is based on the estimation and optimization of the prosodic parameters such as pitch and duration for MBROLA method. It has also been shown in this paper that syllables produce reasonably natural quality speech and durational modeling is crucial for naturalness with a significant reduction in numbers of units of the total base developed. This was readily observed during the listening tests based on high quality and objective evaluation when comparing the original with the synthetic speech.
REFERENCES
[1] [2] [3] [4] [5] [6] [7]
S. Baloul, (2003) Dveloppement d'un systme automatique de synthse de la parole partir du texte arabe standard voyell , Thse de doctorat, universit du Maine, Le Mans, France. M. Elshafi, H. Al-Muhtaseb M. Al-Ghamdi, (2002) Techniques for high quality Arabic speech synthesis, Information Sciences 140-255-267, Elsevier. M. Assaf, (2005) A Prototype of an Arabic Diphone Speech Synthesizer in Festival, Master Thesis, Department of Linguistics and Philology, Uppsala University. Mbius, B. and Dogil, G., (2002) Phonemic and postural effects on the production of prosody, Speech Prosody 2002(Aix-en-Provence), p 523526. T. Dutoit, V. Pagel, N. Pierret, F. Bataille, & O. van der Vrecken, (1996) The MBROLA Project: Towards a Set of High-Quality Speech Synthesizers , Free of Use. M. Al-Zabibi, (1990) An AcousticPhonetic Approach in Automatic Arabic Speech Recognition, the British Library in Association with UMI. G. Demenko, S. Grocholewski, A. Wagner, & M. Szymaski, (2006) Prosody Annotation for Corpus Based Speech Synthesis, In: Proceedings of the Eleventh Australasian International Conference on Speech Science and Technology.
638
Vol. 2, Issue 1, pp. 632-639
[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]
Auckland, New Zealand, pp. 460-465. Boersma, P. & Weenink, D. (2005) Praat. Doing phonetics by computer. [Computer program]. Version 4.3.04 Retrieved March 31, 2005 from http://www.praat.org/ J. Bachan, & D. Gibbon, (2006) Close Copy Speech Synthesis for Speech Perception Testing, In: Investigationes Linguisticae, vol. 13, pp. 9--24. W.N. Campbell, (1992) syllable-based segmental duration , Edition G. Bailly and C. Benot, Talking Machines: theories, Models and Designs, Elsevier Science Publishers, Amestrdam, pp.211-22 4, A. Lacheret-Dujour, B. Beaugendre, (1999) La prosodie du franais, Paris, Editions du CNRS. F. Chouireb, M. Guerti, M. Nal, and Y. Dimeh, (2007) Development of a Prosodic Database for Standard Arabic, the Arabian Journal for Science and Engineering, Volume 32, Number 2B, pp. 251-262, ISSN: 1319-8025, October. S. Keagy. (2000) Integrating voice and data networks: Practical solutions for the new world of packetized voice over data networks. Cisco Press, G. Sonntag, T. Portele and B. Heuft, (1997) Prosody generation with a neural network: Weighing the importance of input parameters, in Proceedings of ICASSP, pp 931-934, Munich, Germany. J. P. Teixieira, D. Freitas and H. Fujisaki, (2003) Prediction of Fujisaki models phrase commands, in Proceedings of Eurospeech , Geneva, pp 397-400 J. P. Teixiera, D. Freitas and H. Fujisaki, Prediction of accent commands for the Fujisaki intonation model, in Proceeding of Speech Prosody 2004, Nara, Japan, March 23-26, 2004, pp 451-454. SNNS (Stuttgart Neural Network Simulator) User Manual (1995), Version 4.1, University of Stuttgart, Institute for Parallel and Distributed High Performance Systems (IPVR), Report No. K. S. Rao and B. Yegnanarayana, (4-8 October 2004) Intonation modeling for Indian languages, in Proccedings of Interspeech04,Jeju Island, K0rea, pp733-736
Authors A. Chabchoub: is a researcher in signal processing laboratory at the University of Sciences of Tunis Tunisia (FST). Degree in electronics and he received a M.Sc. degree in Automatic and Signal Processing (ATS) from The National Engineering School of Tunis (ENIT). Currently, he is a PhD student under the supervision of Prof. A. Cherif. His research interests include speech synthesis and analysis.
A.Cherif: received his engineering diploma from the Engineering Faculty of Tunis and his Ph.D. in electrical engineering and electronics from The National Engineering School of Tunis (ENIT). Actually he is a professor at the Science Faculty of Tunis, Responsible for the Signal Processing Laboratory. He participated in several research and cooperation projects, and he is the author of international communications and publications.
639
Vol. 2, Issue 1, pp. 632-639

Estimation and Optimization of Prosodic To Improve The Quality of The Arabic Synthetic Speech

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Estimation and Optimization of Prosodic To Improve The Quality of The Arabic Synthetic Speech

Enviado por

Direitos autorais:

Formatos disponíveis

International Journal of Advances in Engineering & Technology, Jan 2012.

IJAET ISSN: 2231-1963

Signal Processing Laboratory, Science Faculty of Tunis 1060, Tunisia

Vol. 2, Issue 1, pp. 632-639

DATABASE OF ARABIC SPEECH PROSODY

Automatic annotation and segmentation

Measure duration and pitch

Prosodic modification and optimization

2.1. Description of the corpus of analysis used

2.2. Segmentation and labeling of the corpus

Vol. 2, Issue 1, pp. 632-639

2.3. Automatic segmentation of the corpus.

2.4. Identify the voiced and un-voiced frames.

5 0 -5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

5 0 -5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

2.5. Duration and pitch extraction

Vol. 2, Issue 1, pp. 632-639

65 109 91 122 48 141 97 138 49 119 99 122

ARABIC PROSODIC MODELLING

3.1. Prediction models of segmental duration

( dureeobser vee realisatio n phoneme )

Duree( syllabe) = exp(i + z i )

Vol. 2, Issue 1, pp. 632-639

Duree ( phoneme i ) = exp( i + z i ) 3.2 F0 Prediction Module Based on a Neural Network

RESULTS AND EVALUATION

4.1 Implementation of prosodic values into MBROLA

Vol. 2, Issue 1, pp. 632-639

Figure6. Neutral and synthesis speech, pitch contour, Arabic Sentence,

Vol. 2, Issue 1, pp. 632-639

Vol. 2, Issue 1, pp. 632-639

Vol. 2, Issue 1, pp. 632-639

Você também pode gostar