An Etude On Speech Synthesis With Instruments: An Additive Synthesis Approach

Leal 1
AD SUM LUCEM: AN ETUDE IN INSTRUMENTAL SPEECH SYNTHESIS
Partiels by Gerard Grisey and Gondwana by Tristn Murail are two landmark pieces
of what can be identified as the spectral movement.1 Both pieces share that they are the
instrumental realization or translation of a technique that is mainly developed through electronic
and/or digital means: additive and Frequency Modulation (FM) syntheses respectively.2 Both
pieces also share that they use such synthesis processes as a model more than they are an
accurate and literal translation of these techniques into the acoustic instrumental realm. In
Partiels, Grisey used the analysis of a trombone pitch to understand the harmonic material that
constituted the trombone timbre at that given pitch, and used the harmonic series and the relative
intensity, duration, and appearance time of each partial as models for the piece. In Gondwana,
Murail used the data obtained from the calculation of the pitches produced by the side bands that
are the resultant of a FM equation. In this case, there is a formula that determines which and how
many side bands will be produced when a carrier frequency of a determined value is modulated
by a modulator frequency of a determined value. Such realizations are not literal mainly because
the synthesis techniques mentioned, when using electronic means, use sine waves, which are the
simplest type periodic wave. In contrast, Grisey and Murail had to deal with complex sounds of
acoustic instruments, meaning that the resultant timbres include series of additional harmonic
partials their relation to its electronic counterparts.3 In this project I sought to apply a similar
1 Joshua Fineberg, Spectral Music, Contemporary Music Review 19, no.2 (2000): 2.
2 Franois Rose, "Introduction to the Pitch Organization of French Spectral Music,"

Perspectives of New Music 34, no. 2, (Summer, 1996):811; Tristan Murail, "Villeneuve-ls-
Avignon Conferences, Centre Acanthes, 911 and 13 July 1992," Contemporary Music Review
24, nos. 2-3 (2005): 205211.
3 Rose, Pitch Organization, 811; Murail, Villeneuve-ls-Avignon, 205211.

Leal 2
approach for the realization of speech synthesis, which is almost exclusively done with
computers, by means of traditional-ensemble acoustic instruments.
Speech synthesis by analog acoustic means has been explored in the past having a strong
impact the development of the study of phonetics. In his thesis on speech synthesis, Sami
Lemmetty described a couple of early examples of mechanic speech synthesis. These examples
include Christian Kratzensteins development of a series of resonators to produce the long
vowels /a/ /e/ /i/ /o/ /u/; Wolfgang von Kempelens models of his Acoustic Mechanical Speech
Machine, which consisted of a couple of reeds, a box with a variety of resonators, and a flexible
end that could be manipulated to filter the sound in the way the mouth filters the sounds
produced by the vocal strings (the machine could produce some vowels and some nasal and
fricative consonants); Charles Wheatstone and Alexander Graham Bells constructions of von
Kempelens machine; and the exploration by Robert Willis on the production of vowels by
means of organ pipe-like tubes.4 These examples, however, are mainly based on the production
of phoneme-like sounds by a single emitter of which several parameters can be manipulated in
ways traditional-ensemble instruments cannot. In any case, the production of movable filters,
which could be a valid approach, escapes the scope of this paper.
Digital speech synthesis is a complex process and requires complicated computer
algorithms. Two common approaches to speech synthesis are the unit selection synthesis
approach and the statistical parametric synthesis approach. Both approaches imply access to a
large data base of recorded speech sounds. The main difference between both approaches is that
unit selection mixes and reproduce small bits of sound straight from the data base, while
statistical parametric synthesis uses algorithms to analyze the samples, calculate possible
4Sami Lemmetty, Review of Speech Synthesis Technology (Master's Thesis). (Finland: Helsinki
University of Technology, 1999), 46.
Leal 3
outcomes, and synthesize the result of such operations.5 Understanding the details of any of these
techniques would require a high degree of specialization, which escapes the scope of this project.
In any case, the use of analytical techniques to interpret the components of determined speech
sounds (phonemes), and the later synthesis of an approximated sound using acoustic instruments
better resembles the latter approach.
A less data intensive approach for synthesizing certain phonemes by electronic means is
subtractive synthesis. Knowledge of the formants of specific vowels and consonants can be used
to control specific band-pass filters applied to a complex sound (such as white noise). This
approach is similar to that described in the previous examples of mechanic synthesis above. The
implementation of such method, however, seems impractical for tradition-ensemble acoustic
instruments. The visual analysis of such formants and the harmonic components of speech can be
used as models for an additive synthesis approach, which seems more reasonable in terms of data
amounts and more suitable for the selected medium.
The analysis of a large pool of phonemes, even using the simplest of the techniques
described above, implies an extended amount of data, even if such data is reduced to visual
representations. Because the scope of this project was limited by the time available during the
second half of a semesters worth of course work (for one class), an achievable goal seemed to
be the realization of a single phrase of text. Because of my Latin American background, I am
more familiarized with the sound of Spanish phonemes, and my understanding of English
phonemes may imply the failure to correctly assess the resultant sounds and their similarity to
English phonemes. Latin, however, seems a language that could fulfill the needs of this project.
Latin is a language historically related to western art music, which is often what we study in the
5 Heiga Zena; Keiichi Tokudaa, and Alan W. Blackc. 2009. "Statistical parametric speech
synthesis." Speech Communication 51, no.11 (2009): 10391044.
Leal 4
academy. At the same time, Latin vowels and consonants are closer to Spanish phonemes than
English phonemes are. I created the following sentence: Ad sum lucem arte et laborum, which
roughly translates into I am the light of arts and labor. I am not a religious person, but I interpret
the sentence as a manifestation of Marx and Engels Spectre of Communist, with which they
open the Manifesto of the Communist Party.6
Because vowels and most consonants can be understood as spectrally different (mainly
harmonic vs mainly inharmonic components), I decided it was better to approach them in two
different ways. Vowels can be synthesized more easily by means of additive synthesis because
they are mainly constituted of harmonic components. I then proceeded to record and analyze my
voice saying the Spanish vowels (which correspond also to most Latin vowels) /a/ /e/ /i/ /o/
and /u/ (Figures 1, 2, and 3).
/a/ /e/ /i/

/o/ /u/
Figure 1 Harmonic analysis of vowels using Spear software.
6 Friedrich Engels and Karl Marx, The Communist Manifesto, (Kindle Edition).
Leal 5
/a/ /e/ /i/

/o/ /u/
Figure 2 Harmonic analysis of vowels using Spear7 (all selected)
/a/ /e/ /i/ /o/ /u/
Figure 3 Sonogram of vowels using Max/MSP8
7 See http://www.klingbeil.com/spear/
8 See https://cycling74.com/products/max/#.WEiJqvkrI4k
Leal 6
To try an additive synthesis approach to reproduce the vowels, I set fifteen sinewave
oscillators. The first oscillator to the left produced a fundamental pitch, which was estimated in
123Hz, and each oscillator to the right produced a frequency resulting from the multiplication of
the fundamental by 2, 3, 4, and so on consecutively up to 14. Each oscillator was connected to a
number box which displayed the resultant frequency in Hz, and to a gain slide which allowed for
independent control of the volume (Figure 4).
Figure 4 Additive synthesizer using Max/MSP
I observed the harmonic structure of each vowel and the relative intensity of each
harmonic, and used that information as an estimate to produce vowels using the additive
Leal 7
synthesizer. The model was based only on approximates because the production mechanisms are
different. Perception played the most important role in tweaking the gain for each oscillator. I
was satisfied when I identified the sound as a vowel more than as separate oscillators.
Comparing the result for each vowel also was important because perception of the
formants was easier to understand for one vowel in relation to other vowels. The results for each
vowel were translated as a list of messages for all the oscillators, allowing me to quickly go from
one vowel to another. I recorded the result for a sequence of vowels and analyzed it using Spear
(Figures 5 and 6). While the harmonic analysis of the synthesized vowels is evidently less
complex than the real speech one, the approximation is enough to recognize the timbral
difference between vowels.
/a/ /e/ /i/ /o/ /u/
Figure 5 Analysis of vowels produced by additive synthesis (all selected)

Leal 8
/a/ /e/ /i/ /o/ /u/
Figure 6 Analysis of vowels produced by additive synthesis
To produce a similar effect with traditional-ensemble acoustic instruments, it was
necessary to consider the timbral qualities of those instruments. I selected the clarinet as the main
instrument since, when played at very low dynamic, its timbre resembles that of sinewaves.9 The
translation, however, is not free of technical issues: a bass clarinet is needed to produce the pitch
closest to 123Hz, which is a B2 (123.47Hz), and for the sound to resemble the timbre of a
sinewave it is crucial that the note is played in a pianissimo dynamic. Otherwise, the result
includes a series of partials that have undesired effects on the addition during the synthesis, and
therefore on the resultant timbre. Similarly, because of the relative intensity of the partials that
constitute each vowel, really high pitches are demanded, which are technically almost impossible
to produce in the required dynamic, especially considering that the lower partials are also played
pianissimo.
These above mentioned and other issues will be accounted for by the end of the report.
For now, an approximation using samples will be dealt with as much freedom as possible. To run
9 Wolfe, Joe. 2016, Clarinet acoustics: an introduction, University of New South Wales
School of Physics. http://newt.phys.unsw.edu.au/jw/clarinetacoustics.html#pff
Leal 9
a first trial, I downloaded samples from the University of Iowa Electronic Music Studios
website.10 Pitches were determined using the frequencies produced by the oscillators in the
additive synthesizer, and selecting the closest semitone. Using a Audacity, a free Digital Audio
Workstation (DAW), each sample was loaded in a separate track, which allowed me to control
the gain of each partial independently. While all samples corresponded to a pianissimo
performance of the determined pitch, a great deal of gain control had to be used to produce a
timbre that could be identified as a vowel. This time, the model was taken from the additive
synthesizer, and again the results were slightly altered to produce the desired effect. A harmonic
analysis shows the results for the pseudo analog synthesis of vowels in Figures 7 and 8.
/a/ /e/ /i/ /o/ /u/
Figure 7 Pseudo-analog synthesized vowels (all selected)
10 Lawrence Fritts, Musical Instruments Samples, University of Iowa Electronic Music Studios.
http://theremin.music.uiowa.edu/MIS.html.
Leal 10
/a/ /e/ /i/ /o/ /u/
Figure 8 Pseudo analog Synthesized Vowels

As suggested earlier, I relied more in intuitive approximation than in exact data. It can be
noted, however, how there are similitudes in the analyses of the vowels through different media.
For instance, in figures 1 and 2 it can be observed the difference between vowels /e/ and /i/
mainly in the less nuanced harmonics for the vowel /i/ in the areas between 500 and 750 Hz, and
the total lack of harmonics in the 750 to 2000hz. Also, vowel /e/ presents more nuanced
harmonics than vowel /a/ in the 200 to 500Hz area, but the progression of harmonics upwards for
/e/ is less gradual, and there is visible weakening of harmonics in the area between 750 and
1500Hz, while for /a/ the intensity decreases in a gradual fashion up to the 1600Hz area, where
there is a visible lack of harmonics. Vowel /o/ has a similar structure than /a/, but with more
nuanced harmonics in the 200 to 500Hz area and a visible lack of harmonics in the 1200Hz
area. /u/ appears as similar to /o/, but with les nuanced harmonics in the 500 to 1200Hz, where
/u/ keeps losing harmonics gradually in contrast to the more sudden lack of them in /o/. The
difference between /o/ and /u/, however, seems to be much clearer in the higher register that is
not displayed in the figures. Similar tendencies can be observed in figures 5 and 6, and 7 and 8.
Differences might be explained by the contrasting degrees of complexity contributed by

Leal 11
inharmonic sounds in the cases of the original speech and the pseudo-acoustic representation in
relation to the additive synthesized case.
Two alternative approaches to the synthesis of vowels were explored, one including flutes
as well as clarinets and one using only clarinets but taking in account the harmonic structures of
the individual pitches produced in different dynamics (using mainly the lower pitches to produce
the aggregate). These experiments were not explored in enough depth to produce satisfying
results. In the first case, the lack of understanding of the flute spectrum produced undesired
inharmonic elements. In the second case, there was not enough independence of each partial
regarding gain, which made the result far less controllable.
To study the consonants spectra, I recorded my voice saying diverse syllables, such as
apa, aba, opo, ola, and so on. Then, once I compared some of my analysis with the descriptions
and analysis included in articles and electronic resources about phonemes, I proceeded to record
my voice saying the proposed sentence in Latin.11 Figure 9 shows the spectral analysis of the
whole sentence.
Ad sum lucem arte et

laborum
Figure 9 Speech analysis using Max/MSP
11 Ad sum Lucem Arte et Laborum

Leal 12
Figure 10 is another analysis of the sentence (same recording) using the Emu online software.12 I
only included part of the sentence here, since it was the section I was able to realize using the
pseudo-analog method.
Ad sum lucem
Figure 10 Speech analysis using Emu

Spectrograms were considerable better than harmonic analysis to interpret consonants. /l/
and /m/ have a relatively harmonic spectrum, and the synthesis was approached in the same than
in the case of vowels. /d/ and /s/ and /ch/, however, include much more noise and lack of a clear
harmonic structure. /d/ is a stop phoneme, so the effect was achieved by suddenly cutting the
sound and allowing a softer considerably less rich (in terms of partials) aggregate to appear right
after the stop, as the resonant segment of the /d/ sound. This is not a very accurate representation
of the phoneme, but it is close enough for the ear to interpret when put in a larger context. The /s/
sound, as can be observed, has an incredibly rich spectrum, consisting mainly of noise and
without a clear harmonic structure, and can be classified as a colored noise.13 As the average
listener cannot really differentiate between very similar colored noises, the task was to find a
percussive sound which spectrum was also a colored noise with a similar appearance. The sound
12 See http://ips-lmu.github.io/EMU-webApp/
13 Joshua Fineberg, "APENDIX 1 Guide to the Basic Concepts and Techniques of Spectral
Music." Computer Music Review 19, no. 2 (2000): 91.
Leal 13
of a hi-hat when opened by the foot mechanism was the most similar case among the available
sounds.14 An artificial envelop was applied to avoid the initial percussive hit of the hi-hat. A
similar approach was applied to the phoneme /ch/, but some high frequencies were added to
resemble its whistle-like characteristic. The result of the proposed approach can be listened at
https://soundcloud.com/camilo-ignacio-leal-molina/clar-ad-sum-lucem-1. Figure 11 shows a
spectrogram of the pseudo-acoustic synthesis realized with the Emu web app.
Ad sum lucem
Figure 11 Spectrogram realized with Emu

Several elements were ignored in this study, mainly because of time constrains. Future
exploration should include the study of specific transitions between consonants and vowels and
between vowels and vowels, as there are specific harmonic structures and spectral elements that
would help to better interpret the phonemes. In addition, real acoustic synthesis should be
applied to understand the technical limitations of the proposed approach.
14 Fritts, Musical Instruments Samples

Leal 14
Bibliography
Engels, Friedrich; Marx, Karl. The Communist Manifesto. Kindle Edition.

Fineberg, Joshua. 2000. "APENDIX 1 Guide to the Basic Concepts and Techniques of Spectral
Music." Computer Music Review 19 (2): 81113.
Fineberg, Joshua. 2000. "Spectral music." Contemporary Music Review 19 (2): 15.
doi:10.1080/07494460000640221.
Fritts, Lawrence. 2016. University of Iowa Electronic Music Studios.
http://theremin.music.uiowa.edu/MIS.html.
Lemmetty, Sami. 1999. Review of Speech Synthesis Technology (Master's Thesis). Finland:
Helsinki University of Technology.
Murail, Tristan. 2005. "Villeneuve-ls-Avignon Conferences, Centre Acanthes, 911 and 13 July
1992." Contemporary Music Review 24 (2-3): 187267.
doi:10.1080/07494460500154889.
Rose, Franois. 1996. "Introduction to the Pitch Organization of French Spectral Music."
Perspectives of New Music 34 (2): 639.
Wolfe, Joe. 2016. University of New South Wales School of Physics.
http://newt.phys.unsw.edu.au/jw/clarinetacoustics.html#pff.
Zena, Heiga, Keiichi Tokudaa, and Alan W. Blackc. 2009. "Statistical parametric speech
synthesis." Speech Communication 51 (11): 10391064.
doi:10.1016/j.specom.2009.04.004.

An Etude On Speech Synthesis With Instruments: An Additive Synthesis Approach

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

An Etude On Speech Synthesis With Instruments: An Additive Synthesis Approach

Enviado por

Direitos autorais:

Formatos disponíveis

Leal 1

AD SUM LUCEM: AN ETUDE IN INSTRUMENTAL SPEECH SYNTHESIS

instrumental realization or translation of a technique that is mainly developed through electronic

2 Franois Rose, "Introduction to the Pitch Organization of French Spectral Music,"

3 Rose, Pitch Organization, 811; Murail, Villeneuve-ls-Avignon, 205211.

computers, by means of traditional-ensemble acoustic instruments.

include Christian Kratzensteins development of a series of resonators to produce the long

of phoneme-like sounds by a single emitter of which several parameters can be manipulated in

which could be a valid approach, escapes the scope of this paper.

Digital speech synthesis is a complex process and requires complicated computer

better resembles the latter approach.

implementation of such method, however, seems impractical for tradition-ensemble acoustic

amounts and more suitable for the selected medium.

be the realization of a single phrase of text. Because of my Latin American background, I am

open the Manifesto of the Communist Party.6

and /u/ (Figures 1, 2, and 3).

/a/ /e/ /i/

Figure 1 Harmonic analysis of vowels using Spear software.

/a/ /e/ /i/

Figure 2 Harmonic analysis of vowels using Spear7 (all selected)

/a/ /e/ /i/ /o/ /u/

Figure 3 Sonogram of vowels using Max/MSP8

the fundamental by 2, 3, 4, and so on consecutively up to 14. Each oscillator was connected to a

independent control of the volume (Figure 4).

Figure 4 Additive synthesizer using Max/MSP

difference between vowels.

/a/ /e/ /i/ /o/ /u/

Figure 5 Analysis of vowels produced by additive synthesis (all selected)

/a/ /e/ /i/ /o/ /u/

Figure 6 Analysis of vowels produced by additive synthesis

To produce a similar effect with traditional-ensemble acoustic instruments, it was

/a/ /e/ /i/ /o/ /u/

Figure 7 Pseudo-analog synthesized vowels (all selected)

/a/ /e/ /i/ /o/ /u/

Figure 8 Pseudo analog Synthesized Vowels

Differences might be explained by the contrasting degrees of complexity contributed by

relation to the additive synthesized case.

regarding gain, which made the result far less controllable.

Ad sum lucem arte et

Figure 9 Speech analysis using Max/MSP

11 Ad sum Lucem Arte et Laborum

Figure 10 Speech analysis using Emu

https://soundcloud.com/camilo-ignacio-leal-molina/clar-ad-sum-lucem-1. Figure 11 shows a

Figure 11 Spectrogram realized with Emu

applied to understand the technical limitations of the proposed approach.

14 Fritts, Musical Instruments Samples

Engels, Friedrich; Marx, Karl. The Communist Manifesto. Kindle Edition.

Você também pode gostar