Escolar Documentos
Profissional Documentos
Cultura Documentos
Parminder Singh
Associate Professor, Department of Computer Science and Engineering Guru Nanak Dev Engineering College, Ludhiana, Punjab, India
ABSTRACT
Speech synthesis systems which involve concatenation of recorded speech units are currently very popular. These systems are known for producing high quality, naturalsounding speech as they generate speech by joining together waveforms of different speech units. This method of speech generation is quite practical. However the speech units that are being concatenated may have different spectra on either side of the concatenation points. Such mismatches are spectral in nature and give rise to spectral discontinuity in concatenated speech waveforms. The presence of such discontinuities can be very distracting to the listener and degrade the overall quality of output speech. This paper proposes a speech signal processing technique that deals with the problem of spectral discontinuity in the context of concatenated waveform synthesis. It involves the post-processing of the synthesized speech waveform in time domain. This technique is implemented on different single channel Punjabi wave audio files which were created by concatenating different Punjabi syllables. A listening test was conducted to evaluate the proposed technique, and it was observed that the spectral discontinuity is reduced to a large extent and the output speech sounds more natural with the reduction of audible noise.
Figure 1: Example Speech Waveform for Punjabi word- A computer system with the ability to convert written text into speech is known as Text-To-Speech (TTS) synthesis system. The quality of a speech synthesizer is judged by naturalness, which refers to the similarity of generated speech to the real human voice; and intelligibility, which refers to the ability of generated speech to be understood. The main goal of researchers and linguists is to create ideal speech synthesis systems which are both natural and intelligible. Three types of methods are mainly used for the purpose of synthesizing artificial speech- Articulatory Synthesis, Formant Synthesis and Concatenative Synthesis [2]. The articulatory and formant synthesis are the rule-based synthesis methods whereas the concatenative technique is a database-driven synthesis method. Articulatory synthesis uses a physical model of human speech production organs and articulators. Formant synthesis models the frequencies of speech signal based on source-filter model. In this method of speech synthesis, parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a speech waveform based on certain rules. Concatenative synthesis generates speech by concatenating recorded speech units and is described in more detail in Section 2. The remainder of this paper is organized into 6 sections. Section 2 presents an overview of concatenative speech synthesis. In section 3, the problem of spectral discontinuity in the context of concatenative speech synthesis is discussed. Section 4 explains the stages of the technique proposed to remove audible spectral discontinuities in concatenated speech waveform. Section 5 evaluates the results of the proposed technique. Finally we end our paper with Conclusions and Future work in Section 6.
General Terms
Technique for speech signal processing
Keywords
Speech waveform, Concatenative Spectral discontinuity speech synthesis,
1. INTRODUCTION
Speech is the most primary form of communication used by human beings to express their thoughts, feelings and ideas. Speech production involves a series of complex movements that alter and mould the basic tone created by human voice into specific sounds [1]. The mechanism for generating the human voice can be subdivided into three parts; the lungs, the vocal folds within the larynx, and the articulators (the parts of the vocal tract above the larynx consisting of tongue, palate, cheek, lips, nose and teeth). Speech sounds are created when air pumped from the lung causes vibratory activity in the human vocal tract. These vibrations themselves can be represented by speech waveforms. Figure 1 shows a visual representation of vibrations typical of those in human speech - a speech waveform for a Punjabi word .
13
International Journal of Computer Applications (0975 8887) Volume 53 No.16, September 2012
if the discontinuities at the concatenation points are inaudible. But when these joins are audible, their presence can be very frustrating to the listener and it also reduces the overall perceived quality of synthesized speech. In systems which use databases containing longer speech units and where the variety of output is limited, the problem of spectral discontinuity is less severe. This is because with longer speech units, there will be lesser concatenation points. However, in systems which create speech by combining large number of smaller speech units, the presence of spectral discontinuity at the concatenation boundaries is a major problem; since there is an increase in the number of joins, therefore, there is an increase in the number of discontinuities. There are a number of reasons for the presence of spectral discontinuities. Audible discontinuity may arise due to inconsistencies in fundamental frequencies, or different levels of loudness (energy of the segments), or due to the contextual differences and variations of speaking style of the speaker [6]. In order to avoid the problem of spectral discontinuity at concatenation boundaries, an appropriate signal processing technique must be applied. Ideally, a signal processing approach would include algorithms that will examine the synthetic speech waveform at concatenation points and then manipulate the waveform at these points to produce a more natural sounding continuity. In the next section we propose one such signal processing technique to reduce the effect of spectral discontinuities in the original acoustic signal.
4. PROPOSED TECHNIQUE
The block diagram in figure 2 gives an outline of the proposed technique. Input Speech Signal Pitch mark Extraction
Fragment Generation
Resulting Signal Figure 2: Block Diagram of the Proposed Technique The proposed technique works on speech signals directly in time-domain and allows them to be processed and modified in real time. The input for this technique is a concatenated Punjabi speech waveform that is generated
14
International Journal of Computer Applications (0975 8887) Volume 53 No.16, September 2012 by joining together different syllables of Punjabi. Our goal is to create as output a final speech waveform that is free from distortion even when its tempo is increased or decreased. The working of the proposed technique and the different stages involved in it will be discussed in the following sub-sections. amplitudes and have been extracted using the algorithm described in Section 4.1.
N
P1
N+1
P2
N+2
P3
N+3
P4
2T0 (a) P3
(b) Figure 3: Generation of Speech Fragments- (a) Pitchmark isolated in each of the fixed-length frames in succession; (b) a fragment lifted from around the pitchmark P3 located in frame N+2 After a pitch-mark P3 has been located in frame N+2 in figure 3(a), a fragment is lifted from around this pitchmark and is extended approximately one pitch-period in both directions. Note that generated fragment is of the size of twice the pitch period T0 as shown in the Figure 3 (b).
15
International Journal of Computer Applications (0975 8887) Volume 53 No.16, September 2012 tempo or play rate of the output speech. These steps are discussed in the following subsections. discontinuities. After scaling the sample values at the beginning and end of every speech fragment, it was observed that the distortion was reduced to a large extent and the audible quality also improved. A listener test was also conducted to evaluate the results of the proposed technique. 6 listeners were asked to rate the quality of the speech synthesized using the proposed technique for a number of sentences. Each listener was required to give each sentence a rating on a scale from 1 to 5, where 1 represents the lowest perceived audio quality whereas 5 represents the highest perceived audio quality, as shown in the Table 1. Mean Opinion Score (MOS) is the arithmetic mean of all individual scores and gives the numerical indication of the perceived audio quality. Table 1. Parameters for Mean Opinion Score (MOS) MOS 5 4 3 2 1 Quality Excellent Good Fair Poor Bad Distortion Imperceptible Slightly imperceptible Slightly Annoying Annoying Very Annoying
The rating given by each listener suggests that the proposed technique is successful in achieving our goal of generating an output speech with minimal audible distortion.
5. RESULTS
Our aim was to remove the spectral discontinuity from the original concatenated Punjabi speech waveform. The proposed technique was analysed to check if the desired results were produced. For this purpose, we used a few 16bit mono channel concatenated Punjabi wav audio files. Figure 4: Original Speech Waveform As mentioned before, the utterance rate of the original speech was also modified with the help of proposed method which, in turn, affected its overall duration. A slower speech was produced with the repetition of certain speech fragments increasing the overall duration of the final speech. Figure 5 shows the waveform of speech with slower tempo. A great elongation of the synthesized signal causes the echo effect. Similarly, omitting certain fragments in the final speech signal resulted in the production of a faster speech, reducing the overall duration. However, omission of fragments causes the loss of speech information content. Figure 6 shows the waveform of speech with faster utterance rate. Note the change in duration with the change in the tempo of speech in the waveforms.
16
International Journal of Computer Applications (0975 8887) Volume 53 No.16, September 2012 [3] Thakur. S. K. and Satao. K. J. (2011), Study of Various kinds of Speech Synthesizer Technologies and Expression for Expressive Text To Speech Conversion System, International Journal of Advanced Engineering Sciences and Technologies, Vol. 8, No. 2, pp. 301-305. Chappell. D. And Hansen. J. (2002), A Comparison of Spectral Smoothing Methods for Segment Concatenation Based Speech Synthesis, Speech Communication, Vol. 36, pp. 343-374. Kirkpatrick. B. (2010), Spectral Discontinuity in Concatenative Speech Synthesis - Perception. Join Costs and Feature Transformations, PhD. Thesis, Dublin City University, pp. 1-63. Klabbers. E. and Veldhuis. R. (2001), Reducing Audible Spectral Discontinuities, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 1, pp. 39-51. White. S. (2003), Visualizing Speech Synthesis, Bachelors Thesis, pp. 4-9. Lemmetty. S. (1999), Review of Speech Synthesis Technology, Masters Thesis, Department of Electrical and Communication Engineering. Helsinki University of Technology, pp. 28-46. Bjorkan. I. (2010), Speech Generation and Modification in Concatenative Speech Synthesis, PhD. Thesis. Department of Electronics and Concatenative Speech Synthesis, M.Sc. Thesis, University of Crete, Greece, pp. 1-18. Visagie. A. (2004), Speech Generation in a Spoken Dialogue System, Masters Thesis, University of Stellenbosch, South Africa, pp. 3591. Wouters. J. And Macon. M. (2001), Control of Spectral Dynamics in Concatenative Speech Synthesis, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 1, pp. 30-38. Klabbers. E. (1997), High Quality Output Speech Generation through Advanced Phrase Concatenation, Proceedings of the Cost Workshop on Speech Technology in the Public Telephone Network: Where are we today?, Rhodes, Greece, Vol. 1, No. 88, pp. 85-88. Rabiner. L. And Schafer. R. (2007), Introduction to Digital Speech Processing, Vol. 1, No. 1 -2, pp.1-194. Mousa. A. (2010), Voice Conversion Using PitchShifting Algorithm by Time Stretching with PSOLA and Re-sampling, Journal of Electrical Engineering. Vol. 61, No. 1, pp. 57-61. Plumpe, M. And Meredith, S. (1998), Which is More Important in a Concatenative Text to Speech SystemPitch, Duration or Spectral Discontinuity?, Proceedings of the third ESCA/COCOSDA Workshop on Speech Synthesis, Jenolan, Australia.
[4]
[5]
[6]
[7] [8]
[9]
[11]
[12]
[13]
[14]
7. REFERENCES
[1] Honda. M. (2003), Human Speech Production Mechanisms, NTT Technical Review, Vol. 1, No. 3, pp. 24-29. Tabet. Y. And Boughazi. M. (2011), Speech Synthesis Techniques: A Survey, 7th International Workshop on Systems, Signal Processing and their Applications, pp. 67-70. [15]
[2]
17