Escolar Documentos
Profissional Documentos
Cultura Documentos
Tesi di laurea in
Trattamento automatico delle lingue
Relatore:
Presentata da:
FILIPPO BONORA
Correlatore:
Dott.ssa GLORIA GAGLIARDI
Sessione
terza
Anno accademico
2014-2015
1
Index
Abstract.................................................................................................................................................3
Riassunto..............................................................................................................................................5
Chapter I: framework...........................................................................................................................7
1.1 Some characteristics of human voice.........................................................................................7
1.1.1 Automatic approach to acoustic speech processing.........................................................12
1.2 About speaker profiling...........................................................................................................13
Chapter II: ..........................................................................................................................................17
2.1 Italian languages versus standard Italian.................................................................................17
2.2 Diatopic varieties of Italian......................................................................................................20
Chapter III: methods...........................................................................................................................25
3.1 Accent identification problem..................................................................................................25
3.2 Mel-Frequency Cepstrum Coefficients (MFCCs)...................................................................28
3.3 Machine-learning for speech processing.................................................................................30
3.4 Literature..................................................................................................................................34
3.5 Introduction to next experiments.............................................................................................37
Chapter IV: experiments.....................................................................................................................40
4.0 Data: CLIPS telephonic corpus................................................................................................40
4.1 Two surveys to explore human perception of Italian accents..................................................41
4.1.1 Results and discussion.....................................................................................................46
4.2 Tools.........................................................................................................................................50
4.2.1 openSMILE......................................................................................................................51
4.2.2 Weka.................................................................................................................................53
4.3 Composing data sets................................................................................................................54
4.3.1 Features extraction with openSMILE..............................................................................55
4.3.2 Detecting outliers in CLIPS telephonic corpus................................................................56
4.3.3 Add information for data mining.....................................................................................57
4.4 Speaker variable is a confounding variable.............................................................................58
4.4.1 Gender variable................................................................................................................66
4.5 Linguistic areas........................................................................................................................68
4.6 Three way classification task...................................................................................................76
4.7 Removing short samples..........................................................................................................79
Chapter V: conclusion.........................................................................................................................83
5.1 Resume of experiments............................................................................................................83
5.2 Discussion................................................................................................................................85
5.3 Propositions for further work...................................................................................................86
APPENDIX 1 ....................................................................................................................................89
APPENDIX 2.....................................................................................................................................89
APPENDIX 3.....................................................................................................................................90
APPENDIX 4.....................................................................................................................................91
APPENDIX 5.....................................................................................................................................92
APPENDIX 6.....................................................................................................................................93
APPENDIX 7.....................................................................................................................................94
APPENDIX 8.....................................................................................................................................98
APPENDIX 9.....................................................................................................................................99
APPENDIX 10.................................................................................................................................100
References........................................................................................................................................102
2
Abstract
The principal aim of this thesis is to investigate the possibilities to develop an automatic
classifier of regional Italian accents, exploring techniques and methods of a literature which
combine computational phonetics, forensic linguistics, machine learning and dialectometry. At
the same time, this work can be read as the study and the application of a certain representation
of the sound spectrum widely used in speaker recognition, namely the Mel-frequency Cepstrum
and its coefficients (MFCCs).
A system able to recognize regional inflections can be useful in two application: 1) if it is
trained with a quite fine-grained data set, it can serve to investigation purposes. Speaker
profiling for forensic application is a domain which deal in the collection of attributes around a
speaker from a phone call or a recording in general. In fact, from the voice it can be determined
various features of an individual, like gender, weight, height, smoke habits etc. One of the most
discriminating trait of a speaker is surely the geo-linguistic provenance, notably in country with
a high linguistic fragmentation like Italy. 2) It is well known that one of the aspect that most
undermine performances of an automatic speech recognition system is, indeed, the regional
accent of the speaker. An extension identifying the geolinguistic provenance would permit to this
system to read and predict the error (or rather the deviation from the standard language) that
the speaker does. In a more general way, speaker profiling could represent the future direction
for vocal interfaces, which will tend to fit the hallmarks of their users: similarly to what humans
do with familiar voices [cf. for instance Souza et al 2013].
To carry out our experiments set we used the telephonic sub-corpus of CLIPS (Corpora e
Lessici dell'Italiano Parlato e Scritto), spread among 15 regional varieties of Italian . With 40
samples arbitrarily chosen we realized two surveys/quiz, shared on a linguistic blog, with the
purpose to study how L1 Italian people behave on average in recognizing regional accents on
short telephonic audio samples. The rest of the corpus was used to train various machine
learning models to classify samples on the basis of the speaker's accent. Classification tasks were
set up with various levels of granularity: by CLIPS varieties (15 regional Italians corresponding
to 15 linguistically representative cities), by linguistic areas (7 totally, chosen on the basis of
dialectologic criteria, as isoglosses) and by linguistic macro-area (3 totally). These classifiers are
build up through several methods and algorithms, thereby they had very different performances.
We subsequently analyzed results and general behaviors of classifiers in the light of humans ones
on spread online tests. After, we build up some experiments in order to verify if the criteria used
3
by humans to predict the provenance of a speaker are similar to those of the machine. For
example, if the machine can perceive a connection between two varieties belonging to the same
linguistic area (as Napoli and Bari), or also, if it gets better results on longer (namely, richer of
linguistic and spectral information) audio samples.
Everything depends on the given description of the audio sample: notably, the modeling
based on MFCCs feature, critical object across all our work.
The first chapter of the present thesis introduce the framework: purposes of this works,
disciplines which deal with our task and a short introduction to acoustic-phonetics.
The second chapter deal in the regional varieties of Italian, describing them with several
examples, and finally touching some dialectologic and dialectometric aspects.
The third chapter is a critical introduction to the automatic accent identification problem,
showing methods and instruments of our case study through the existing literature.
The fourth chapter (that include most of appendixes preceding the bibliography) is the
experimental section. Experiments are preceded by 1) a description of software used to set them
up and 2) a short paragraph about the problem of human perception of accent, which will
introduce the discussion on results gotten through the tests spread online.
In the last chapter we discuss the experiments drawing our conclusion , attempting to put
forward proposals for future works.
Riassunto
L'obiettivo principale di questa tesi quello di indagare la realizzabilit di un classificatore
automatico di accenti regionali italiani, esplorando tecniche e metodi di una letteratura crossdisciplinare che combina fonetica computazionale, linguistica forense, apprendimento
automatico e dialettometria. Allo stesso tempo, il presente lavoro pu essere letto come lo studio
e l'applicazione di una rappresentazione dello spettro sonoro ampiamente usata in speech e
speaker recognition, ossia il Mel-frequency Cepstrum e i suoi coefficienti (MFCCs).
Un sistema in grado di riconoscere le inflessioni regionali pu essere utile in due modi: 1) Se
allenato a una granularit sufficientemente fine, pu servire a fini investigativi. Lo speaker
profiling per impiego forense una disciplina che si occupa della profilazione del parlante a
partire da una telefonata o una registrazione. Dalla voce, infatti, si possono determinare molti
attributi di un soggetto, come genere, peso, altezza etc. Uno dei tratti pi discriminanti del
parlante sicuramente la provenienza geo-linguistica, in particolar modo in un paese ad alta
frammentazione linguistica come l'Italia. 2) ben noto che uno degli aspetti che inficia
maggiormente le prestazioni di un sistema di riconoscimento vocale proprio l'accento
regionale del parlante. Un'estensione che identifichi la provenienza (geo)linguistica
permetterebbe al sistema di riconoscimento vocale di leggere e predire l' errore (o meglio, la
deviazione dallo standard) che il parlante compie. Pi generalmente, la profilazione
dell'utente/parlante potrebbe rappresentare la direzione futura delle interfacce vocali, che
tenderanno ad adattarsi alle caratteristiche dei loro utilizzatori: come fanno, d'altra parte,
anche gli esseri umani con le voci a loro familiari [cfr. ad esempio, Souza et al 2013].
Per svolgere il nostro set di esperimenti abbiamo usato il sotto-corpus telefonico di CLIPS
(Corpora e Lessici dell'Italiano Parlato e Scritto), distribuito fra 15 variet regionali
dell'italiano. Con 40 campioni scelti arbitrariamente abbiamo realizzato dei quiz/questionari,
diffusi poi su un blog di linguistica, allo scopo di studiare come si comportano in media gli
italiani L1 nel riconoscimento degli accenti regionali su brevi campioni telefonici. Il resto del
corpus stato utilizzato per addestrare vari modelli di apprendimento automatico a classificare
i campioni in base all'accento del parlante.
Le classificazioni sono state svolte a vari livelli di granularit: per variet di CLIPS (15
italiani regionali corrispondenti a 15 citt linguisticamente rappresentative), per aree
linguistiche (7 in totale, scelte in base a criteri dialettologici) e per macro-aree (3 in totale).
Questi classificatori sono modelli costruiti attraverso vari metodi e algoritmi, perci hanno
5
Chapter I: framework
The vocal tract is essentially a tube consisting of the mouth (oral cavity) and throat
(pharyngeal cavity), with the lips at one end and the larynx at the other (the vocal folds are in
the larynx). The length of the tube can be slightly increased by rounding and protruding the
lips and by lowering the larynx (raising the larynx will slightly shorten the tube). The nose
forms another tube (nasal cavities from the nostrils to the velopharyngeal port) which can be
connected to the oropharyngeal tube (pharyngeal cavity plus oral cavity) by lowering the soft
palate (velum) to open the velopharyngeal port. The jaw can be lowered or raised and the
tongue can be moved to change the shape of the oropharyngeal tube. [Morrison 2010]
Parts of the vocal tract that can be used to produce distinctive sounds are called
articulators. They can be grouped into active and passive articulators on the basis of their
activity. The articulators that move during the process of articulation are called active
articulators, whereas organs of speech, which remain relatively motionless, are called passive
articulators. [image from Kulshreshtha et Mathur 2012]
The vocal tract is similar to a musical instrument: air is blown into the vocal tract by
compressing the lungs so as to push air between the vocal folds. One can produce a voiced
sound (a vowel, a sonorant, a nasal, a liquid...) or a voiceless sound (an obstruent, a plosive...).
We are mainly interest in the former type of sound because it can convey a large amount of
spectral information, as the fundamental frequency (F0) and the formantic distribution,
quantified by the each formant (f1, f2, f3 ...). F0 is the rate at which the vocal folds vibrate
during voicing. Some speakers have longer and more massive vocal folds and others have
shorter and less massive vocal folds; on average adult males have larger vocal folds than adult
females but there is also variation within each sex. F0 averages around 125 Hz for adult males
and 200 Hz for adult females.
Another acoustic feature which can be prosodically relevant is the pitch, that is the relative
highness or lowness of a tone as perceived by the ear, depending directly on the F0. Pitch is the
main acoustic correlate of two phonetic features, the tone and the intonation.
To understand how formants (f1, f2, f3 ...)in a spectrogram describe modifications of the
vocal tract, let's show an example from [Morrison 2010:99.460]:
"The different mouth shapes result in different resonance frequencies which
make the sound of different vowels. The primary acoustic differences between the
vowels in heed, hid, head, and had are that the first formant(F1) increases as
the constriction widens and second formant (F2) decreases. Now say the ee
sound from heed again, but this time move your tongue back until you are saying
the vowel sound from who. It turns out that moving your tongue back in your
mouth lowers F2 and that rounding your lips also lowers F2, so doing both
together has a larger effect. The most important acoustic difference between the
vowel sounds in heed and who is the change in F2 (F1 stays about the same) [...]
In many languages F1 and F2 peaks are the primary acoustic indicators of vowel
category (vowel phoneme) identity (the peak formant values rather than the exact
shape of the spectra are perceptually relevant)"
10
"In phonetics, voice-onset time (VOT) is a feature of the production of stop consonants. It is
defined as the length of time that passes between the release of a stop consonant and the onset
of voicing, the vibration of the vocal folds, or, according to other authors, periodicity. " (source:
Wikipedia)
11
The general idea of the Fourier analysis, which is a branch of mathematics, is that a complex
and continuous function can be approximated by the sum of various simpler trigonometric
functions. The decomposition process itself is called a Fourier transformation.
"Landline telephone systems only transmit frequencies between about 300 Hz and 3.4 kHz
(this is known as a bandpass) and distort frequencies close to the edges of the bandpass [. . .]
Some vowels such as /i/ and /u/ have intrinsically low F1 which for male speakers may be
affected by the low end of the bandpass. F3 and above for females and F4 and above for males
are likely to be affected by the high end of the bandpass [. . .] Mobile-telephone systems also
12
of the main goal of this work is to test the effectiveness of MFCCs on accent recognition task
(thus, a sub-task of speaker profiling). We will talk more closely about this type of feature in
chapter n.
apply a bandpass to the signal; the low end of the bandpass is maintained at 100 Hz (lower
than for a landline system), and the high end varies between 2.8 kHz and 3.6 kHz. But in
addition, mobile systems use compression and decompression algorithms (codecs) to reduce
the amount of data sent, and this results in further deterioration of the signal." [Morrison
2010:99.610]
13
The main statement on which speaker recognition bases its epistemological reasons is that
everyone has his own, individual and dissimilar to each other voice. As [Kersta et al 1962]
state, voiceprints (as they named the spectrograms) are unique and individualistic in nature,
remaining unchanged throughout the lifetime of that speaker, even though he/she grows old,
loses tonsils, teeth, or adenoids. Even it is reckless to claim the infallibility of these voiceprints
(as Kersta did), this fact is substantially confirmed by the large experimental study initiated
by Tosi and associated at Michigan State University early in 1968 and concluded in 1970 [Tosi
1972]. Nowadays voice, being considered a behavioral biometric, is accepted as a judicial
evidence [Jessen 2007] in many countries.
The standard speaker recognition casework in forensic applications is the following:
Typically in forensic speaker recognition, a recording of an unknown voice,
usually of an offender, is to be compared with recordings of a known voice, usually
of the suspect or defendant. The interested parties (police, court) want to know if
the unknown voice comes from the same speaker as the known. [Rose 2006]
Nevertheless, it is not always the case that we dispose of two recordings:
14
sometimes the only clue to a criminals identity is his or her language. When
that is the case, a linguistic profile can be a useful tool. Forensic linguistic profiling
is the analysis of language to infer attributes of a speaker or writer from his or her
linguistic characteristics. Speaker and author profiling are used when there is an
unknown perpetrator, and investigators need to narrow down the pool of potential
suspects by identifying linguistic features that can be associated with particular
geographic areas, social groups, or unusual pathologies [Schilling et Marsters
2015:196]
Thus, during a police investigation, a linguistic/phonetic speaker profile can be useful. In
[appendix 10] we provide the reviews of two real cases.
Speaker recognition techniques are also used to implement person authentication in
security systems, like banking by telephone, telephone shopping, database access services,
information services, voice mail, security control for confidential information areas, and
remote access to computers etc. [Chauhan et al 2013].
Differently, speaker profiling (and notably the methods used to make inferences on
speaker's geographic or cultural background) can provide crucial attributes to improve speech
recognition performances: such information enables effective adaptation of speech and
language processing systems, e.g., by switching to specialized acoustic, pronunciation, or
language models in speech recognition. [Akbakac et al 2011]
Speaker profiling/recognition for commercial purposes can be text-dependent (the speaker
have to pronounce this or that sound) or text-independent (the system virtually cover any
manifestation of speech), and it has to work as a fully automatic system.
Speaker profiling/recognition for forensic applications is generally text-independent,
because the speaker of the recording is not cooperative: he obviously has not the interest in
doing it, rather he sometimes tries to disguise his voice [Farrs 2009]. On the other hand,
technical4 forensic speaker recognition/profiling can be performed in various ways, with an
automatic, a computer-assisted or a traditional approach, exploiting acoustic features or also
4
In some cases, forensic-voice-comparison is carried out by non-experts, fact that [Rose 2006]
calls naive forensic speaker recognition. This approach is obviously risky (it has historically led
to some errors: see ibid) and discouraged by the academic community.
15
16
17
economic growth of sixties has enormously contributed to share a single model of Italian.
At 2012 [ISTAT 2012] almost the totality of population knew (passively, at least) Italian
language, although more than 50% of people speaks dialects in family, being effectively
bilingual5 subjects.
Even if Italian languages [Berruto 1993: 3-36] often present a low structural distance 6*
compared to standard Italian, is more proper to consider them as independent languages or
dialects and not Italian diatopic varieties. Indeed, these languages historically have, in some
case even more than the highly literary standard Italian, a wide range of registers for different
diaphasic contexts. Moreover, according to [Coseriu 1973], Italian dialects are primary
dialect: they have developed since the dissolution of Latin language in his oral uses, exactly
like Tuscan of Florence, even if this latter (in its literary shape) has become the national
standard language.
Unlike dialects/languages, regional varieties are full-fledged diatopic varieties of standard
5
Actually it is incorrect to talk about bilingualism in this case, since this term refers to two
languages with the same political rights, whereas Italian languages are not official and they
cannot be used in administration or instruction. In Italy we might observe, rather, an example
of diglossia : Diglossic languages (and diglossic language situations) are usually described as
consisting of two (or more) varieties that coexist in a speech community; the domains of
linguistic behavior are parceled out in a kind of complementary distribution.[source:
ccat.sas.upenn.edu]. Another terminological issue we would like to stress is the difference
between language and dialect among the Anglo-Saxon and Italian linguistic traditions: while in
the former dialect generally means a linguistic variety, in the latter a dialect is a real language
spoken in some geographical area. Thus, there is no structural differences between a language
and a dialect except that the second has not a political recognition. We will use the word
dialect in this second meaning (Italian linguistic tradition), talking rather about linguistic
variety when we will want to refer to some sociolinguistic variation of a same language.
Nevertheless, structural distance between standard Italian and dialects can be quite relevant in
some case, concerning not only phonetic and lexical levels, but also morphological and
syntactic ones [cf. Maiden and Parry 1996]
18
Italian. According to Berruto [1987:17] they are the wide range of phenomena occurring
between literary [standard] Italian and dialects [...] In Italy the primary source of linguistic
diversification is provoked by geographic distribution, along diatopic axis.
Regional Italians are a relatively recent phenomenon. After unification in 1861 and the
consequent introduction of Italian as the nation's official language, dialects slowly started to
turn themselves toward standard Italian (and, to a lesser extent, vice versa). As a result, the
neo-standard Italian has managed to include these sort of smoothed varieties of dialects: not
only pronounce aspects, but also lexical items and, in its oral use, morpho-syntactic hallmarks.
Thereby, from a stillborn language, Italian has become step by step a living constantly
evolving one. On the one hand, dialects survive in the countryside, some villages and even in
some cities; on the other hand, standard Italian has changed accepting various dialectal
hallmarks, producing several diatopic varieties. Consequently, through their varieties,
members of a community shows (in a conscious or unconscious manner) their sociocultural
identity: using a variety, a speaker provide some information about his
sociocultural
positioning [Berruto 2011]7. Thus, linguistic varieties, and notably regional varieties of
Italian, could be an interesting object for the speaker profiling domain [cf. chapter 1].
Down here, a chart describing the architecture of contemporary Italian along the diamesic,
diaphasic and diastratic axis is provided [the two images above are from Berruto 1993]:
Italian digloctic nature, as well as dialectic features, were smoothed by the development of
regional varieties: the italianization of dialects, especially at the phonetic level, is welldescribed in [Grassi et al 2003:257]. However, the arise of a regional variety doesn't imply a
major people's awareness in use of Italian and/or a minor one in use of his own dialect. For
instance, [D'Addario 2015:377] shows how speakers from Taranto (in the south-east of Italy)
are not always aware to use dialectal expressions instead of standard Italian ones. As example,
D'Addario provides the example of the verb /endere/ (to come down) which in standard
Italian is intransitive, whereas in the southern area is used also as a transitive verb.
19
This scheme is about just a single geographical variety: diatopic axis is not present
because, conceptually, it embeds the other linguistic variables.
"Isogloss is the imaginary line we can use to connect the extreme points of an area
characterized by the presence of a same linguistic phenomenon" [Vignuzzi 2010]. Even if it
describe verifiable facts, isogloss is a traditional method of dialectology, thereby it cannot be
20
applied without making subjective choices. This fact is described in more detail by [Kessler
1995]. We will have to keep under consideration this aspect when in [paragraph 4.5] we will
going to test the computational soundness of the concept of linguistic area: areas and macroareas, as isoglosses, are partially based on intuition. New computational approaches of
dialectometry promise to improve the precision of the dialects clustering [Szmrecsanyi 2011;
Heeringa 2004], even if in Italy this discipline is not so developed yet [a remarkable work on
Tuscan dialects: Montemagni, 2007]
21
22
The six main colors roughly represent six broad geo-linguistical areas, and thus six broad
regional varieties of Italian. We added to these a seventh variety, namely the Sardegna regional
Italian, which is not represented here since this is a Italian dialects chart, and Sardo is not
considered an Italian language, as Ladino and Friulano. We decided to insert this variety since
even if the spoken language is not an Italian language, a regional Italian accent has been
developed in parallel along the twentieth century: this latter has its own features and it is
affected by the native language [cf. Lorinczi 1999]
In this chart we have pointed out the cities where CLIPS corpus 9 data were been collected.
Almost the totality of these cities act a role of socio-linguistic leading in its area, spreading
their linguistic variety: this phenomenon is called linguistic koin and according to [Grassi et
al 2003:176], it is one of the most important forces which has contributed to "italianization" of
dialects.
Above the "La Spezia-Rimini" isogloss there is a wide northern macro-area, which we
could mainly divide in Veneto area (various shades of yellow in the chart above) and GalloItalic Area (purple shades). Main phonetic deviation from standard Italian are the lenition of
inter-vowel consonants or geminated, which can arrive at the complete elision (Italian
/kapelli/ become /kavei/) ; the assibilation (Italian /cera/ become /sira/); the deletion of
unstressed final vowel except the /a/; the presence of some vowels from Occitan,
phenomenon that occurr in Gallo-Italic area but not in Veneto area. Another hallmark of
Venetian variety (represented in CLIPS audio samples) is the retroflex /r/.
Under "La Spezia-Rimini" isogloss, which "deeply divide Italian linguistic varieties and is
moreover the most important linguistic border of Latin Europe [Vignuzzi 2010] we
encounter another important isogloss, identified by Gerhard Rohlfs in 1937, that is the RomaAncona isogloss. It defines, with La Spezia-Rimini one, a median macro-area relatively
uniform, especially with regard to vowels trends. However, we can individuate two distinct
areas: the Tuscan (shades of brown in the chart) and the Median ones (shades of red). Tuscan
area is characterized by phenomena like regressive assimilation of close consonants and a
typical lenition of some fricatives and plosives called Gorgia (Italian /t kniko/
become /tnniho/). In Median area, assimilation of close consonants is rather progressive; we
can furthermore observe the sonorization of some consonants like /p/t/k/f/s/ (Italian
9
23
/andate/ become /annade/) and the affrication of some fricatives (Italian /borsa/ become
/bortsa/)
Roma-Ancona isogloss has actually fuzzy borders, thus the transition between Median
area and Southern area (shades of blue in the chart) could be perceived as a continuum. Some
crucial features of the latter area (whose Naples has been the cultural and linguistic center
during many centuries) are metaphony and the almost-deletion of the last vowel (namely the
schwa): e.g Italian /neri/ become /nir/.
The extreme southern area (shades of green in the chart above) includes Sicily, southern
area of Puglia and most of Calabria. In this area vowel system (Sicilian) is older than standard
Italian one: instead of 7 (/a/ /e/ / / /i/ // /o/ /u/) vowels are totally 5 (the sames, except /e/
and /o/). Moreover, retroflex consonants are quite frequent in this variety (Italian /b llo/
become /bu/).
Finally, the island of Sardegna is a linguistically heterogeneous area. Its dialects (old
conservative neo-latin Catalan-influenced languages) are not considered as belonging to
Italian set, due to some crucial differences in plural construction or in used articles. For this
reason, regional variety of Sardegna has quite recognizable phonetic features [looking also at
the results of our tests in chapter n] like the resistance to the palatalization, gemination of
consonants, partial palatalization of /s/.
That's obviously a non-exhaustive phonetic features list. Moreover, we didn't talk about
prosodic and lexical traits, which have relevant consequences on respective regional varieties.
However, the oral corpus we have worked on is about regional varieties of Italian and not
dialects: all those hallmarks have been enormously smoothed in dialects movement towards
standard Italian. Furthermore, data are (semi)text-dependent: telephone speaker are asked to
read some phrases or number, or to simulate a scenario using some specific words. Thus,
some prosodic phenomena as phrasal global intonation are compromised, because they
would be representative only in spontaneous speech.
24
25
A speaker profiling module could be an interesting extension of ASR systems in the future. This
latter could improve its performances just learning about speaker's linguistic hallmarks, such as an
intonation peculiarity, or a speech impediment. Among all idiolectal features collectable, there is
also the regional inflection: such feature can help the ASR system to anticipate some individual
speech behaviors of the speaker. For instance, it is surely useful to such system to previously know
that an Italian from Prato (Tuscany) could never pronounce /k/ and /t/ sounds if they are intervocalic
and not geminated.
Nevertheless, it is not an easy thing to realize a speaker profiling system with these
characteristics, or rather it is very expensive and time-consuming. The main reason is that we
should inform the machine about the linguistic content conveyed by the phone/sound. Let's put it in
another way: when we try to detect, even naively, the provenance of someone, we exploit not only
the acoustic side of sound, but also the linguistic one. That is to say, when we listen words like
/koriandolo/ or /kartutta/ (respectively "coriander" and "cartridge"), we previously know our
regionally-biased pronounciation of the same words and we can likewise figure out how other
speakers from different regions execute them (or words similar to them). For instance, the
first word ("coriander") is likely to be affected by a great variation across linguistic areas,
especially at the vocalic level. Instead, the second word ("cartridge") is likely to vary at the
consonantic level. Even if one has never heard these words pronounced by this or that
regional speaker, he has probably heard other words similar to those, which led him to build a
sort of model for each linguistic areas of Italy: evidently, large part of this process concerns
intuition. Nonetheless, the fact to be aware of both the acoustic quality and the linguisticphonetic level, allows to match, quoting Hjelmslev, the substance of expression with the form
of expression.
An ASR system do exactly this, but we have to give to it some pieces of speech (phones)
matched with the related syllabic units (phonemes). These phones, being most of time
sampled by one individual person (usually a professionist of the voice), are not representative
of all varieties of the same language, much less all idiolectal variations. However, if we want to
add a module collecting information on accent to compensate to this inadequacy, we still need
to pass through the linguistic knowledge of the ASR system: it is a vicious circle.
Thus, viable options are three: 1) to build a fine-grained metric in order to individuate
every single possible variation of an idiolect or a dialect, and after to develop a model of
language of a speaker using distance measure like the Levenshtein's one: it is roughly what the
ForensicLab of Pompeu Fabra university (Barcelona) is doing, notably for idiolectal profiling
26
for forensic applications [Turell 2013]. For dialectal domain we can report instead
[Kulshreshtha et Mathur 2012] and [Huckvale 2004]. (Image below from [Turell 2013])
2) to rely on the acoustic aspect of the speech, without passing through the linguistic level.
The underlying hypothesis in using this method is that accent differences are quantifiable at
the acoustic level: we just need to find a workable acoustic-phonetic feature in order to
capture these dialectal or idiolectal differences. This method is ideally the best, because is less
time-consuming. We will present some literature using this strategy for speaker profiling in
[paragraph 3.4]. Furthermore, this is the method we tested and analyzed in this work, with the
aim of classifying Italian regional speakers by accent. 3) These two approaches we just
showed can be mixed. An example of this strategy is provided by [Brown 2014]. The system
proposed, however, is text-dependent. We will explore this study in [paragraph 3.4]
27
The flowchart below conceptualizes our task, straddling the dialectometry and the speaker
profiling, and the various viable ways to cope with it.
The general approach of dialectometry is to build a linguistic model of such and such a variety,
through quantitative approaches like cluster analysis, out from some dialectal corpora. On the other
hand, speaker profiling attempts to assign an attribute to a speaker. Starting the process from this
latter, its main goal is to accomplish a classification task, targeting some already supplied classes
(linguistic models). Basically, the best way to carry on this task is mixing dialectometry and speaker
profiling strategies: conceptually, it is what the authors of [Turell 2013] are doing, despite
concerning rather idiolectometry. Nevertheless this method is expensive in term of time, so many
studies [par. 3.4] attempt to address the classification task using a shorter path: build a machinelearning model of a variety (having previously extract selected features from the recordings) and
then accomplish the classification out from it. This is the method we are going to test in the next
chapter on Italian varieties.
Last, with the question mark in the middle of the arrow, we wanted to suggest the possibility to
set up a mixed model.
28
Mel Frequency Cepstral Coefficients (MFCCs) are used in state-of-art ASR systems and they
are proven to be one of the most effective spectral feature in speech related tasks [Juravsky
2000:329]. Furthermore, it is largely used in speaker recognition [Vibha 2010], due to its robustness
to noisy channels [cf. note n in chapter 1] and its reliability on very short audio samples.
We will briefly describe what these coefficients are. The Mel Scale is a pitch perception psychophysical scale. In fact, it has been shown that human perception of a sound frequency does not
follow a linear scale, but approximately a logarithmic one [Houtsma 1995]. Mel Scale was
introduced to have a scale consistent with the height perception of a sound. 1000 Mels correspond,
by definition, to 1000 Hz (for audible sounds) and for each extra octave, mels double.
On the other hand, the cepstrum is the result of taking the Inverse Fourier transform (IFT) of the
logarithm of the estimated spectrum of a signal. Operations on cepstra are labelled cepstral analysis:
namely, the hystorical father of MFCCs (if we mean the "real" FFT cepstrum and not "complex"
cepstrum or linear predictive coding (LPC) cepstrum). Thus, MFCCs combines the advantages of
the cepstrum analysis with a perceptual frequency scale based (Mel Scale) on critical bands.
Speech analysis assumes that signal properties change slowly with time. This motivates short
time window-based processing of the speech signal to extract its parameters. Every 10 ms, a
Hamming
Transform (FFT, look at [note n chapter 1]) is used to compute short-term spectrum. 20 overlapping
Mel scale triangular filters are applied to this short-term spectrum. The output of each filter is the
sum of the weighted spectral magnitude. Discrete Cosine Transform is obtained from the logarithm
of the filter output to obtain cepstrum coefficients. The figure below, taken from [Sinha et al 2015],
represents steps in MFCC computation process.
29
30
graphic-generative model for the joint probability distribution p(x,y). Similarly, Markov models and
Bayesian networks share the strong (naive) independence assumptions between the features.
Differently, the former is a dynamic Bayesian network, namely it relates variables to each other
over adjacent time steps - the reason for why it is used in ASR. The main task of HMMs in ASR is
to couple (maximize the probability) of phone\phoneme. A hidden Markov model is just like a
regular Markov model in that it describes a process that goes through a sequence of states. The
difference is that in a regular Markov model, the output is a sequence of state names, and because
each state has a unique name, the output uniquely determines the path through the model. In a
HMM, each state has a probability distribution of possible outputs, and the same output can appear
in more than one state. These models are called 'hidden' because "the true state of the model is
hidden from the observer. In general, when you see that an HMM outputs some symbol, you can't
be sure what state the symbol came from."[Russel et al 2010]. A well-known search algorithm used
to compute p(phone|phoneme) over an HMMs model is the Viterbi algorithm.
We will now briefly show three other methods which are broadly deployed in speaker
recognition and speaker profiling: Support Vector Machines (SVMs), Artificial Neural Networks
(ANNs) and K nearest neighbour (k-NN).
SVMs12 performs classification by finding the hyperplane that maximizes the margin between
two classes. The vectors that define the hyperplane are the support vectors. An ideal SVM analysis
should produce a hyperplane that completely separates the vectors into two non-overlapping
classes. However, perfect separation may not be possible, or it may result in a model with so many
cases that the model does not classify correctly. In this situation SVM finds the hyperplane that
maximizes the margin and minimizes the misclassifications. The simplest way to separate two
groups of data is with a straight line (1 dimension), flat plane (2 dimensions) or an N-dimensional
hyperplane. However, there are situations where a nonlinear region can separate the cases clusters
more efficiently. SVM handles this by using a kernel function (nonlinear) to map the data into a
different space where a hyperplane (linear) cannot be used to do the separation.
12 For our experiment we mainly deployed the Sequential Minimum Optimization algorithm (SMO) which is a
new SVMs learning algorithm that is conceptually simple, easy to implement, often faster [Platt 1998]. Weka
implements an optimization of John Platt's SMO algorithm for training a support vector classifiers, developed by
[Keerthi et al. 2001]. The main feature of this support vector algorithm is that it manages to solve the old
quadratic programming (QP) problem that arises during the training of support vector machines. Quadratic
programming is a special mathematical optimization problem concerning the quadratic function processing of
several variables en masse. That impacts in our case on computing performances like fastness.
32
(DNNs) (In the figure above: hidden nodes work as intermediate states before output [source:
13 We never used in our experiment the Weka ANNs algorithm (namely, Multilayer Perceptron), since it deployed
too much time in the training phase. Probably it did not fit the high dimensionality of our data set.
33
Wikipedia])
Finally, the k-NN14 is perhaps the most straightforward classifier in the arsenal or machine
learning techniques. Using the good Wikipedia definition of k-NN and lazy classifiers in general:
"Classification is achieved by identifying the nearest neighbours to a query example and using those
neighbours to determine the class of the query. [...] Because induction is delayed to run time, it is
considered a Lazy Learning technique. Since classification is based directly on the training
examples it is also called Example-Based Classification or Case-Based Classification [...] The
main advantage gained in employing a lazy learning method, such as Case based reasoning, is that
the target function will be approximated locally, such as in the k-nearest neighbor algorithm. [...]
The disadvantages with lazy learning include the large space requirement to store the entire training
data set. Particularly noisy training data increases the case base unnecessarily, because no
abstraction is made during the training phase. [...] Lazy classifiers are most useful for large data sets
with few attributes." So, interpretability and ease of implementation are the advantages, but on the
other hand k-NN is very sensitive to irrelevant or redundant features because all features contribute
to the similarity and thus to the classication. This can be ameliorated by careful feature selection
or feature weighting. [Cunningham et al 2007]
3.4 Literature
In this paragraph we reviewed some relatively recent studies dealing in different automatic
accent identification tasks. We tackled some studies about regional varieties of a same language,
while we overlooked works about L2 speaker's provenance detection. Nonetheless, even if it is
perhaps an easier task to recognize the foreign accent of an L2 speaker because of the higher
variability, the problem to tackle is roughly the same of our work, namely to detect a geo-linguistic
provenance15. Before starting the review, it is worth to recall that there are many ways to deal with
14 Weka default k-NN algorithm, that is to say the method we used for our experiments, names IBk (Istance-Based k
learning algorithm).
15 We listed some recent works on this field, which make wide-ranging use of MFCC feature:
1) Ullah Sameeh. (2009) A Soft Computing Based Approach for Multi-Accent Classification in IVR Systems.
University of Waterloo Ontario
2) Torres-Carrasquillo Pedro A., Sturim Douglas E., Reynolds Douglas A., McCree Alan.120(2008) Eigen-channel
compensation and discriminatively trained gaussian mixture models for dialect and accent recognition. In: ISCA
INTERSPEECH 2008, pp. 723- 726.
3) Piat Marina, Fohr Dominique, Ilina Irina. (2008) Foreign accent identification based on prosodic parameters. In;
Proceedings of INTERSPEECH 2008, pp. 759-762
4) Perdersen Carol, Diederich Joachim. (2008) Accent in Speech Samples: Support Vector Machines for
Classification and Rule Extraction. In: Studies in Computational Intelligence. 80, 2008, pp 205-226
34
this task, but two of those are primary: 1) The phonotactic method, which derives from ASR and
automatic detection of language (for a good review of this latter task, [cf. Zissman 1995]). This
method is more expensive to set up than the others, but it is able to capture some linguistic
information of the targeted language. 2) The spectral method (our case), that is to say extracting
spectral information from the samples without passing through the linguistic content. This approach
is bound up with speaker recognition and signal processing domains. It is less time-consuming and
maybe the most popular nowadays.
In [Sinha et al 2015] the authors built up a classifier to identify among 4 Hindi prominent
dialects. They trained several AANN models (Auto Associative Neural Network) on a textdependent corpus (300 different sentences pronounced by men and women of different age and
provenance) divided by speaker between training and test sets. They used alternatively 3 different
spectral features: MFCCs, PLP (perceptual linear prediction), and the MF-PLP (PLP derived from
Mel-scale filter bank). These cepstrum-based features were deployed along with 3 different
prosodic features: energy, F0/F0, and syllable duration. This latter has been deployed through a
second cooperative AANN classifier and a syllable units segmentation. While MFCC and MF-PLP
got roughly the same scores, the best classifier was reach with MF-PLP and all 3 prosodic features
together (82% of accuracy on average, 81% with MFCC instead of MFPLP). Nevertheless, MFCC
with no syllable duration information got good results too: 73% of accuracy on average. By the
way, it is a study about regional dialects and not regional accents: the differences between these
varieties could be relevant.
In [Brown 2014] the author developed a system on the basis of the ACCDIST metric [Huckvale
2004], built up to measure relative distances between phonological units of speakers, considering 14
English regional accents of the British Isles. Differently from other ACCDIST-based accent
recognition systems (such as [Hanani 2012]: we will see this study in a moment), the Brown's
system is designed to process content-mismatched (spontaneous) speech data. On the other hand, it
equally needs the transcription of data. Using ACCDIST distance metrics and MFCCs feature to
train a SVMs model, it achieved on a 4-way classification the 86.7% of accuracy on contentcontrolled data and 52.5% on content-mismatched data. Down here two flowcharts taken from the
same study which well represents the development of the system:
35
In [Akbacak et al 2012] the authors studied the effectiveness of language recognition techniques
in order to accomplish a 4-way classification task of 4 Arabic dialects, achieving the 2.47% average
equal error rate. Used data was constituted by 30-second telephone speech samples. The system
developed is quite complex: on the one hand the authors used dialectal and cross-dialectal
phonotactic models to train an n-grams model along with a SVM model. On the other hand they
developed a GMM-UBM model (Gaussian Mixture Model-Universal Background Model)
extracting a 56-dimensional vector consisting of MFCC, SDC (Shifted delta cepstrum) and energy.
The combination of the former phonotactic model and the latter acoustic model gave the results
above. Nonetheless, similarly to [Sinha et al 2015], the study does not deal with regional accents
but with fully-fledged dialects from the Arab world, which are known to be very diverse each
others.
In [Hou et al 2010] the author proposed an approach for 2 Chinese accents identification using
both cepstral (SDC, called also MFCC_D) and prosodic (Pitch) features with a gender-dependent
model. As the previous work, the system developed is mixed: two features are preprocessed through
a GMM model whose output is computed by a SVM model, which takes the final decision. Down
here the general flowchart of the experiments and the results gotten with SDC, Pitch and both
contemporaneously:
36
In [Hanani 2012] thesis, the author built a system to recognize regional accent for British
English on the basis of ACCDIST metric and a state-of-art language identification system. This
model uses in a complementary way GMM and SVM to take the final decision, on a MFCC and
SDC feature data set extracted from a text-dependent corpus, obtaining around 94% on a 2 way
classification (against the 90% of humans scores)
In the Microsoft research study [Chen et al 2001], the authors built a GMM classifier to
recognize 4 Chinese accents over a large cross-dialectal corpus. They obtained very good results
through a gender-dependent system and an high number of GMM components, without
transcriptions or phones modeling, just extracting MFCC and energy features with their delta
regressions. By the way it is not clear if this system is text-dependent or not.
that addressed the problem supporting the acoustic modeling for accent detection through previous
linguistic modelings [cf. paragraph 3.4], we tried to rely solely on the acoustic-phonetic side, even
if this idea is perhaps counter-intuitive: in fact, the regional accent as a linguistic feature of a
speaker can be detected by humans only if they are able to couple phone and phoneme, as an ASR
system. For example, we believe of course that an Italian speaker which does not understand at all
Chinese language will not be able to recognize various Chinese varieties and accents. Nevertheless,
we cannot know if such Italian speaker, whenever subjected to massive listening of various accents
from China, will be able to discern dialect even without linguistic competencies. The operating
principle of the machine we attempted to build is exactly this latter, and although can appears as
counter-intuitive, there are several studies dealing with the problem through this approach
obtaining, moreover, good results [which we discussed in paragraph 3.4].
Our interest on MFCCs features is due to the fact that they seem to be the most effective and
successful ones in many speech and speaker-based studies: the same spectral feature is efficient in
recognizing a speaker, a linguistic pattern and some linguistic broad characteristics like accent or
idiolect. Furthermore, in many studies MFCCs provide a sort of baseline-feature to which add some
other one to reach the state of art: MFCCs + pitch, MFCCs + Perceptual Linear Prediction (PLP),
MFCCs + duration, MFCCs + energy, and so on [cf. again paragraph 3.4]. However MFCCs seem
always to constitute a sort of condition sine qua non, and performances of a classifier trained on
these features are often close to the state of art. To sum up, our main purpose is not to build an
efficient and state-of-art classifier of Italian accent recognition: it is rather to lead a critical work on
the use of MFCCs in accent recognition, using a corpus where regional variability of Italian is
relatively important, and comparing the behavior of automatic system with the human one.
Employed data are a crucial aspect, even if the system has no previous linguistic knowledge.
Looking at the deployed corpus (telephonic CLIPS, [cf. paragraph 4.1]), the identification task is
supposed to be very hard for a machine due to four main factors: 1) variability is rather great for a
relatively small country like Italy, however it is not comparable with the variability of an L2
English corpus, or that of larger country like India or China [cf. paragraph 3.4]. The identification
task we proposed is actually (at the maximum level of granularity) hardly affordable also for
humans [cf. paragraph 4.2.1] 2) This is a telephonic corpus: as we explain in [note n of chapter 1],
MFCCs are quite robust to noisy channels. However, some spectral information are equally
compromised. Furthermore, some samples contain telephonic noises which undermine
performances too: unluckily, we cannot easily remove such noises in an automatic way. 3) The
samples are partially content-mismatched (spontaneous) whereas the majority of corpus used for
38
accent recognition purposes are completely text-dependent. [major details on CLIPS hallmarks are
provided in paragraph 4.0]. This is a crucial difference, in that a text-dependent system captures
more focused and accurate spectral information, but on the other hand it can work only on specific
words as input. 4) A relatively great number of samples is pretty short and the accent is hardly
detectable. Some of these contains just a mono-syllabic word with no regional variation inside. For
samples like these, it is practically impossible to recognize the provenance of the speaker. However,
we ran an experiment [cf. paragraph 4.6] which tried to tackle this problem.
Considering all these factors, we can initiate the experimentation phase.
39
experiments in which subjects interact with a computer system that subjects believe to be
autonomous, but which is actually being operated or partially operated by an unseen human
being [source: Wikipedia]). These written scenarios, requiring the production of some
specific words, ensure a broad phonetic and phonological coverage. Basically, for each speech
sample we dispose of the identification number of the speaker, his home town and his gender.
Brief description of CLIPS telephonic sub-corpus:
Sampling frequency
8000Hz
Encoding
-law
Number of samples
8621
Number of speakers
314
15
41
perhaps physical factors.[...] The only method available to nonexperts is, of course, auditory
analysis (listening), and often they do not even have audiorecorded evidence at their disposal
when offering their profile, but must rely on previous experience with the voice or language
variety in question [...] when nave listeners provide profiles based on auditory methods, they
cannot properly said to be conducting "analysis", since they seem to listen only holistically and
usually cannot articulate which specific speech features lead them to reach their conclusions
However, even a speech expert opinion is not completely reliable since his analysis, though
sometimes supported by computational methods, is led, to a large extent, by intuition.
Let's have a look at [Kster et al 2012]: in a speaker profiling task, German experts on
voice comparison were evaluated concerning their performance and their methodological
approach in accent identification. In the discussion section, the authors state that
the limited success of the use of different methods and findings from dialectology also
suggests that proficiency tests on speaker profiling as well as the evaluation of a forensic
expert involved in casework should not only focus on methods, of dialectology in particular,
but should also focus on the experts individual performance to identify regional accents. Even
if a method has been exhaustively described and accredited, and even if an expert can
demonstrate his/her formal qualifications and ability to apply the method, there seems to be
no guarantee that (at least with the given time constraints) either 1) perceptual determination
works properly or 2) phonetic details gained with the help of a particular perceptual method
will be interpreted correctly. [ibid:68]
To sum up, the group of experts performs on average quite well not because of the method
adopted, but due to their individual and innate competence: it is somewhat surprising
according to the authors that
pure listening (which was probably performed holistically, especially for those
participants who needed very little time) is even more successful in the task of determining
dialects/regional
accents
under
certain
42
The surveys we prepared wanted to cope with the task of aural-perceptual accent
recognition through a quantitative approach (collecting data as much as possible), and they
were addressed to any sort of Italian listener (naive or expert). The questions we asked
ourselves were: are Italians good, on average, on recognizing Italian regional accents? Which
accents are better recognizable? Thus, are regional accent always perceptible?
It is worth to recall, as we saw in chapter 2, that we are not talking about dialects or
languages. Regional varieties of Italian are not different languages, but they are fully-fledged
Italian. Regional varieties, although they are something relatively new which has not finished
its evolution yet, differ from each other not necessarily for syntactic, morphological or lexical
aspects, but even only for phonetic and prosodic aspects. In spite of this, it is quite easy for an
Italian to estimate the broad geolinguistic area of a speaker. This task is harder when
fragments of speech are very short, as in the case of the greater part of CLIPS corpus; but
however affordable, how the results we are going to present show.
In the light of this, we decided to set up two surveys in order to 1) explore, broadly, how
Italian people behave on recognizing regional accents and 2) build a test-bench so as to be
able to compare the performances of our machine-learning classifier with the human ones.
To realize the first survey, we chose twenty-one samples well-distributed among cities of
the corpus (in this first survey Bergamo is included, not in the second one: [cf paragraph
4.3.2]). These samples were chosen arbitrarily and regardless of gender; the only criterion in
choosing was that every sample had to contain at least one typically regional feature. We tried
to avoid both the easier and the harder examples, in order to make difficult the choice and at
the same time to not discourage the tester.
We posed a picture about Italian linguistic areas [we showed it in paragraph 2.2] at the
head of this survey. As a sort of quiz, we embedded the audio samples in the form (made with
Google Form) and asked for each sample two multiple-choice questions: 1) which is the broad
geo- linguistic area of the speaker? 2) Which is the speaker's home town?
Down here an example:
43
The survey was posted on a computational linguistics blog ( http://www.nlpspoiler.it/sairiconoscere-gli-accenti-regionali-italiani-allora-gioca/ ) and shared across various Italian
linguistics Web-communities. Furthermore, we ran a sort of gamification, putting a price for
the best scores, in order to make the quiz more friendly and appealing.
We requested to participate only if the tester was mother-tongue Italian. Moreover, we
asked to participants to include their age and their home town (also a second town in the
event that the tester has moved elsewhere for a 5-years period minimum); we required these
information in that we had planned to see distribution of answers across the variables of the
age and the geo-linguistic area the tester belonged to. The plan was after abandoned because
we considered it out of topic compared to the subject of this work.
44
A month later, we decided to set up another survey for mainly three reasons: 1) We
discovered some errors of sampling in Bergamo sub-section and we decided to remove its
speakers from the data set [cf par. 4.3.2] 2) The first survey, made with Google Form, did not
provide an easy embedding for audio samples, making the form not very user-friendly. In the
meanwhile, we had found out another free online tool to realize surveys, namely PollDaddy. 3)
Asking to some testers, we learnt that geo-linguistic area questions were somewhat useless,
since the first user's action after listening the samples was generally to check out the list of
possible cities. By the way, the choice of geo-linguistic area could have been inferred by the
user's city selection.
Thus, the new test was stripped of the Bergamo speakers and the questions about the
broad linguistic area. Another relevant thing is that the new samples were chosen not by us,
but by an university colleague. We took this decision in order to avoid selector subjectivity:
selection of samples is surely biased by the experimenter linguistic habits. The first selector
was an Italian from south of Tuscany (brown linguistic area in the Italian varieties chart at
[paragraph 2.2] of this work), while the selector for the second survey was an Italian from
45
survey
was
posted
on
computational
linguistics
blog
Pictures above are from test 1 : correct answer was northern area for both. But in the first
the speaker was from Genova, and in the second from Parma.
Pictures above are from test 2 : in green the correct answers. Venezia variety seems not to
be easily distinguished from other northern varieties. However, linguistic boundaries are
47
fuzzier here, that is to say that there is not something like an isogloss [cf. chapter 2]
Pictures above are from test 1: correct answer was southern area for the first and extreme
southern area for the second (respectively, Napoli and Palermo).
48
Pictures above are from test 2: in green the correct answers. Users did not easily
distinguish southern and extreme-southern varieties.
Users generally got good results in guessing Roma, Cagliari and Firenze accents:
49
The graphs above are from test 2 , while those below From test 1: the correct answer of the
graph on the left was Roma, of that on the right was Firenze, on the center was Cagliari.
Finally, we can observe a sort of trend in identifying the whole northern variety as the
Milano variety: in test 2 Milano has been chosen 6 times as the first most selected answer,
while it was correct only twice. In test 1 the same city was chosen 4 times (for the more, twice
as the second most selected answer) and it was correct only once. We briefly mentioned the
likely cause of this event, namely the concept of linguistic koin, in chapter 2.
4.2 Tools
Mainly five instruments have been used during the different phases of this work. We use
firstly SoX in order to convert CLIPS audio samples from .raw to .wav format. SoX is a free
cross-platform digital audio editor [...] written in standard C and having a command-line
50
interface [source: sox.sourceforge.net]. By doing that, we made files suitable for feature
extraction processing [appendix 1]. After that, we used openSMILE software [Eyben et al
2010] to process the features extraction from each audio samples; obtaining a commaseparated file containing selected features and raw numeric descriptions of each audio files.
Thus, we added some information to the spreadsheet like the gender, the home town, the
broad linguistic area and the identification number of each speaker. To do that, we developed
some simple Perl and UNIX scripts. Then, we built different data sets to be processed, with the
aim of setting up several machine learning experiments. Next, we processed these data sets
with Weka [www.cs.waikato.ac.nz/ml/weka/], an open-source machine learning software
written in Java. Last, we analyzed the results of our classifiers comparing them with 1) human
scores on an accent recognition task 2) some experiments set up to find out confounding
variables.
4.2.1 openSMILE
OpenSMILE (open-Source Media Interpretation by Large feature-space Extraction) is an
open-source freely downloadable tool-kit developed by audEERING of Technische Universitt
of Munich of Bavaria. This software is especially used in signal processing and music
information retrivial frameworks, in order to select and extract acoustic features for acoustic
analysis or usable through machine learning methods. It is able to provide various output files
such as CSV (comma-separated values), HTK (hidden-Markov tool-kit) and ARFF (attributerelation file format). This latter is the one we are interested in because of its compatibility
with Weka. On the other hand, it accepts only one type of audio input file: RIFF-WAVE (PCM).
OpenSMILE is written in C++ and it's available for various operating systems, but it has no
graphical interface, so it provide a shell usage only. However, it is a powerful instrument,
able to extract a great range of features and their functionals (namely, various statistical filters
applied to the features, in order to smoothing data and mapping contours of audio samples).
Ones of the most common are: MFCCs, Perceptual Linear Predictive (PLP) coefficients, Linear
Predictive Coefficients (LPC), CHROMA (octave warped semitone spectra), Fundamental
Frequency (F0), formantic distribution, pitch, energy, loudness, voicing probability, jitter and
shimmer.
Features extraction can be run from a terminal with a simple command-line like this:
51
SMILExtract
C
myconfig
/demo1.conf
I
wav_samples _speech01 O speech01.energy.csv
wav_samples
_speech01.wav
-N
SMILExtract is the openSMILE component that strips features from the signal, composing
raw numeric vector of the audio sample. <-I> option is the audio input file, .wav format (more
precisely, WAVE-RIFF PCM format). <-O> option is the output file (.csv in the example above).
<-N> option provide to assign a name to the sample as the first value of the CSV output. <-C> is
the configuration file, which is the most interesting element of SMILExtract function and it is
worth to spend some words about it. Overall, OpenSMILE can be fully congured via a textbased conguration le, which looks like this:
[ componentInstances : cComponentManager ]
instance [ dataMemory ] . type = cDataMemory
instance [ waveSource ] . type = cWaveSource
instance [ framer ] . type = cFramer
instance [ energy ] . type = cEnergy
instance [ csvSink ] . type = cCsvSink
. . .
. . .
[ waveSource : cWaveSource ]
filename= \cm[ i n p u t f i l e ( I ) : f i l e name of the input wave f i l
e]
[ framer : cFramer ]
reader . dmLevel = <<XXXX>>
w r i t e r . dmLevel = <<XXXX>>
[ energy : cEnergy ]
. . .
. . .
[ csvSink : cCsvSink ]
filename= \cm[ o u t p u t f i l e (O) : f i l e name of the output CSV f i l
e]
After having listed the components, setting parameters must be specified for each one.
Each component has its own section, and all components can be connected via their link to a
central data memory component. This latter is primary together with the waveSource one,
which read the input file [image from Eyben et al 2010:28]:
52
All components except for these latter are generated from a specific component (namely,
they read a specific level component) and they generate another component (namely, they
write a specific level component). Thereby, in order to extract one specific feature, one has to
follow certain pattern, composing his own configuration file. Although configuration syntax is
quite hard to learn, openSMILE provides some default sets which can be used to compose our
own custom model.
4.2.2 Weka
Weka is an open-source software for data mining, classification and predictive modeling
tasks. It was developed in New Zeeland at the university of Waikato, and it is certainly one of
the most common open-source system for machine learning. It is written in Java, whereas its
machine learning algorithms library is written in C and C++. The algorithms can either be
applied directly to a data set or called from your own Java code. The system, that provides a
user-friendly graphical interface, contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization. The main input file format is the
53
sepallength
sepalwidth
petallength
petalwidth
class
NUMERIC
NUMERIC
NUMERIC
NUMERIC
{Iris-setosa,Iris-versicolor,Iris-virginica}
Lines that begin with a '%' are comments. The @RELATION, @ATTRIBUTE and @DATA
declarations are case insensitive. The header of an ARFF file contains the name of the relation,
a list of the attributes (introduced by @ symbol), and their types (the type of variable). Upon
this column, there are data in comma-separated format.
54
smaller dataset. This gap of performances was not so relevant even when we built the
classifier with different algorithms: so, we decided to remove delta regressions from our main
data set.
4.3.2 Detecting outliers in CLIPS telephonic corpus
For various reasons, e.g. the failure of operator console in monitoring telephonic speech, or
tester misunderstandings, there are some errors in the CLIPS telephonic corpus. For example,
it is not rare to find a sample having a long tail of telephonic noises. Also void samples are
quite common, or those with background noise. These errors are not troublesome if one
wants to carry on a manual acoustic analysis, whereas it could deteriorate performances of an
automatic model trained with data, which is our case.
In order to remove such samples from the training set, we set up a fast procedure using
openSMILE and Perl programming. Our targets were void samples, which are in a
considerable number and obviously do not convey any information about speech features.
First of all, we extracted loudness information for each telephonic corpus sample. To do
that
we
used
openSMILE,
and
notably
one
of
its
default
configuration
file:
56
outliers. After, we added to this text file also some other types of outliers which we arbitrarily
decided to exclude from our data set. For instance, we decided to role out Bergamo speakers.
We took this decision for two reasons: 1) among Bergamo speakers there were 3 who are
clearly non-native from Bergamo. Beyond Bergamo sub-set, we encountered this drawback
also among few other speakers, who are not, in our opinion, native from the city they are
assigned to. These errors (if they are) are maybe due to a loose control on people subjected to
the sampling. 2) Bergamo, being about only 55 kilometers from Milano, belongs to a sort of
koin language, which is the Milano more prestigious variety for influence and power. Since
humans were generally not be able to distinguish Milano to Bergamo accent (as we noted in
the first survey), we solved our problem deciding that this difference cannot be relevant for
the machine-learning model, and removing Bergamo samples.
In addiction to the outliers, we excluded from our training set also the 41 samples which
were submitted to human testers [see par. 4.1]. We did it to be able subsequently to test our
classifier on these samples and comparing its results with human ones.
57
Lastly, we merged all ARFF files (there were one for each gender of each city) in a single
data set file.
58
Finally, we ran a SMO algorithm (described, with the other methods used, in [paragraph
3.3]) on the twenty samples of test 2, using a data set devoid of Bergamo speakers and outliers
[cf paragraph 4.3.2]:
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
8
12
0.3548
0.1257
0.2473
94.8648 %
96.0627 %
20
40
60
%
%
b
0
1
0
0
0
0
0
0
1
0
0
0
0
0
c
0
0
0
0
0
0
0
0
0
0
0
0
0
0
d
0
0
0
1
0
0
1
0
0
0
0
0
0
0
e
0
0
0
0
1
0
1
0
0
0
0
0
0
1
f
0
0
0
0
0
0
0
0
0
0
0
0
0
0
g
0
0
0
0
0
0
0
0
0
0
1
0
0
0
h
0
0
0
0
0
1
0
1
0
0
0
0
0
0
i
1
0
0
0
0
0
0
0
0
0
0
0
0
0
j
0
0
0
0
0
0
0
0
0
1
0
0
0
0
k
0
0
0
1
0
0
0
0
0
0
1
0
0
0
l
0
0
0
0
0
0
0
0
0
0
0
0
0
0
m
0
0
1
0
0
0
0
0
0
0
0
1
1
0
n
1
0
0
0
0
0
0
0
0
0
0
1
0
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =
classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia
7
13
0.3029
0.093
0.2905
70.2351 %
112.8459 %
20
b
0
0
0
0
0
c
0
0
0
0
0
d
0
0
0
1
0
e
0
0
0
0
1
f
0
0
0
0
0
g
0
1
0
0
0
h
0
0
0
0
0
i
1
0
0
1
0
j
0
0
0
0
0
k
0
0
0
0
0
l
1
0
0
0
0
m
0
0
0
0
0
n
0
0
1
0
0
|
|
|
|
|
<-a =
b =
c =
d =
e =
classified as
bari
cagliari
catanzaro
firenze
genova
59
35
65
%
%
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
|
|
|
|
|
|
|
|
|
f
g
h
i
j
k
l
m
n
=
=
=
=
=
=
=
=
=
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia
Using finally IBk (k-NN method) algorithm the model reach proficient scores:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
18
2
0.8913
0.0145
0.1194
10.9414 %
46.3942 %
20
90
10
%
%
b
0
1
0
0
0
0
0
0
0
0
0
0
0
0
c
0
0
1
0
0
0
0
0
0
0
0
0
0
0
d
0
0
0
2
0
0
0
0
0
0
0
0
0
0
e
0
0
0
0
1
0
0
0
0
0
0
0
0
0
f
0
0
0
0
0
1
0
0
0
0
0
0
0
0
g
0
0
0
0
0
0
2
0
0
0
0
0
0
1
h
0
0
0
0
0
0
0
1
0
0
0
0
0
0
i
0
0
0
0
0
0
0
0
1
0
0
0
0
0
j
0
0
0
0
0
0
0
0
0
1
0
0
0
0
k
0
0
0
0
0
0
0
0
0
0
2
0
0
0
l
1
0
0
0
0
0
0
0
0
0
0
2
0
0
m
0
0
0
0
0
0
0
0
0
0
0
0
1
0
n
0
0
0
0
0
0
0
0
0
0
0
0
0
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =
classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia
As it can be seen, the accuracy of all three methods is higher than human one. However,
these generally positive results sharply plummeted when we removed from the training set
the samples belonging to speakers in survey 2. Down here the same algorithms using above,
but excluding these samples from the training set
Using SMO (SVMs method) algorithm:
Correctly Classified Instances
60
19
-0.0215
0.1299
0.2557
97.9839 %
99.2652 %
20
95
20
80
%
%
b
0
0
0
0
0
0
0
0
1
0
0
0
0
0
c
0
0
0
0
0
0
0
0
0
0
0
0
0
1
d
0
0
0
0
0
0
1
0
0
0
1
0
0
0
e
0
0
0
0
0
0
1
0
0
0
0
0
0
1
f
0
1
0
0
0
0
0
1
0
0
0
0
0
0
g
0
0
0
0
0
0
0
0
0
0
1
0
0
0
h
0
0
0
0
0
0
0
0
0
0
0
0
0
0
i
1
0
0
0
0
0
0
0
0
0
0
1
0
0
j
0
0
0
1
0
0
0
0
0
1
0
0
0
0
k
0
0
0
1
0
0
0
0
0
0
0
0
0
0
l
1
0
0
0
0
0
0
0
0
0
0
0
0
0
m
0
0
1
0
1
0
0
0
0
0
0
0
0
0
n
0
0
0
0
0
0
0
0
0
0
0
1
0
0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =
classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia
4
16
0.1398
0.1168
0.3175
88.1124 %
123.2673 %
20
b
0
0
0
0
0
0
0
0
0
0
0
0
0
0
c
0
0
0
0
0
0
0
0
0
0
1
0
0
0
d
0
0
0
0
0
0
0
0
0
0
0
0
0
0
e
0
0
0
0
0
0
1
0
0
0
0
0
1
0
f
0
0
0
0
0
0
1
0
0
0
0
0
0
1
g
0
1
0
0
0
0
0
0
1
0
1
0
0
0
h
0
0
0
0
0
1
0
1
0
0
0
0
0
0
i
1
0
0
1
0
0
0
0
0
0
0
1
0
0
j
0
0
0
1
0
0
0
0
0
1
0
0
0
0
k
0
0
0
0
0
0
0
0
0
0
0
0
0
0
l
1
0
0
0
0
0
0
0
0
0
0
1
0
0
m
0
0
0
0
0
0
0
0
0
0
0
0
0
0
n
0
0
1
0
1
0
0
0
0
0
0
0
0
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =
classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia
6
14
0.235
0.1001
0.3159
75.4742 %
122.6396 %
20
30
70
%
%
b
0
0
0
1
0
0
0
0
0
0
0
0
0
0
c
0
0
1
0
0
0
0
0
0
0
0
0
0
0
d
0
0
0
1
1
0
0
0
0
0
1
0
0
0
e
0
0
0
0
0
0
0
0
0
0
1
0
0
0
f
0
0
0
0
0
0
0
0
0
0
0
0
1
0
g
0
1
0
0
0
0
1
0
0
0
0
0
0
1
h
0
0
0
0
0
1
0
0
0
0
0
0
0
0
i
0
0
0
0
0
0
0
0
1
0
0
0
0
0
j
0
0
0
0
0
0
0
0
0
0
0
0
0
0
k
0
0
0
0
0
0
0
0
0
0
0
0
0
0
l
1
0
0
0
0
0
1
1
0
0
0
0
0
0
m
0
0
0
0
0
0
0
0
0
0
0
0
0
0
n
0
0
0
0
0
0
0
0
0
1
0
2
0
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =
classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia
62
system, because it is not large enough. The second experiment we will show below is perhaps
more reliable because we exploited a larger test set to evaluate our classifiers: we excluded
the 10% of samples from the training set and we used this part as a test set. On the one hand,
we took away randomly the 10% of the samples, namely 882 samples. On the other hand, we
did the same thing not randomly, but on the basis of speakers: 10% of the total number of the
speakers (selected at random) became the test set of our classifier (namely, 854 samples),
trained on the other 90% of speakers. We provide down here the scores comparison of the
random-based test and the speaker-based one, repeating this for 3 times, one for each method
(in order: bayesian network, SMO (SVMs), IBk (k-NN)). Similarly to the last experiment, it can
be seen the strong fall of performances when the machine is not trained with the speakers of
the test set (speaker-based test).
Using SMO (SVMs method) algorithm on the random-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
353
470
0.3838
0.1259
0.2477
95.0148 %
96.2155 %
823
42.8919 %
57.1081 %
FP Rate
0.046
0.089
0.044
0.065
0.034
0.031
0.057
0.013
0.034
0.053
0.036
0.034
0.04
0.042
0.045
Precision
Recall
0.397
0.404
0.343
0.5
0.346
0.391
0.333
0.5
0.509
0.529
0.467
0.42
0.348
0.348
0.6
0.333
0.409
0.277
0.506
0.569
0.46
0.333
0.422
0.317
0.531
0.486
0.492
0.596
0.439
0.429
F-Measure
0.4
0.407
0.367
0.4
0.519
0.442
0.348
0.429
0.33
0.536
0.387
0.362
0.507
0.539
0.427
And then the same algorithm using the speaker-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
102
751
0.0599
0.1312
63
11.9578 %
88.0422 %
0.2582
98.5316 %
99.7343 %
853
FP Rate
0.088
0.125
0.049
0.084
0.066
0.04
0.06
0.028
0.053
0.037
0.112
0.088
0.066
0.042
0.059
Precision
Recall
0.113
0.196
0.046
0.217
0.133
0.113
0.117
0.205
0.019
0.016
0.139
0.058
0.118
0.059
0
0
0.176
0.134
0.275
0.143
0.063
0.115
0
0
0.257
0.277
0.333
0.327
0.133
0.12
F-Measure
0.143
0.076
0.122
0.149
0.017
0.082
0.078
0
0.153
0.188
0.081
0
0.267
0.33
0.118
Using Bayesian Networks algorithm on the random-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
247
576
0.2453
0.1006
0.2972
75.941 %
115.4308 %
823
30.0122 %
69.9878 %
FP Rate
0.043
0.166
0.045
0.026
0.025
0.026
0.066
0.098
0.018
0.055
0.009
0.013
0.077
0.088
0.055
Precision
Recall
0.267
0.211
0.183
0.4
0.222
0.217
0.412
0.28
0.525
0.412
0.524
0.44
0.206
0.197
0.2
0.422
0.263
0.077
0.461
0.486
0.462
0.087
0.412
0.117
0.326
0.4
0.284
0.519
0.339
0.3
F-Measure
0.235
0.251
0.22
0.333
0.462
0.478
0.202
0.271
0.119
0.473
0.146
0.182
0.359
0.367
0.289
And then the same algorithm using the speaker-based test and training sets:
64
81
772
0.0247
0.1294
0.3415
97.1348 %
131.8942 %
853
9.4959 %
90.5041 %
Precision
Recall F-Measure
ROC Area Class
0.143
0.043
0.067
0.633
bari
0.029
0.217
0.051
0.372
cagliari
0.15
0.113
0.129
0.703
catanzaro
0
0
0
0.39
firenze
0
0
0
0.37
genova
0
0
0
0.316
lecce
0.068
0.059
0.063
0.569
milano
0.133
0.163
0.147
0.538
napoli
0.2
0.06
0.092
0.655
parma
0.347
0.221
0.27
0.776
perugia
0.133
0.038
0.06
0.64
palermo
0
0
0
0.144
roma
0.157
0.215
0.182
0.661
torino
0.098
0.173
0.125
0.642
venezia
0.114
0.095
0.094
0.551 Weighted Avg.
Using IBk (k-NN method) algorithm on the random-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
630
193
0.747
0.0337
0.1829
25.4317 %
71.0215 %
823
76.5492 %
23.4508 %
Precision
Recall F-Measure
ROC Area Class
0.721
0.772
0.746
0.875
bari
0.754
0.743
0.748
0.86
cagliari
0.614
0.761
0.68
0.866
catanzaro
0.695
0.82
0.752
0.898
firenze
0.8
0.784
0.792
0.886
genova
0.78
0.78
0.78
0.883
lecce
0.807
0.697
0.748
0.841
milano
0.861
0.689
0.765
0.841
napoli
0.733
0.677
0.704
0.828
parma
0.928
0.889
0.908
0.941
perugia
0.813
0.754
0.782
0.869
palermo
0.685
0.833
0.752
0.902
roma
0.732
0.743
0.738
0.859
torino
0.851
0.769
0.808
0.88
venezia
0.773
0.765
0.766
0.874 Weighted Avg.
65
And then the same algorithm using the speaker-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
130
723
0.093
0.1211
0.3476
90.9447 %
134.2706 %
853
15.2403 %
84.7597 %
Precision
Recall F-Measure
ROC Area Class
0.139
0.239
0.176
0.578
bari
0.128
0.435
0.198
0.677
cagliari
0.1
0.075
0.086
0.516
catanzaro
0.066
0.091
0.076
0.511
firenze
0.032
0.032
0.032
0.478
genova
0.075
0.035
0.048
0.495
lecce
0.229
0.108
0.147
0.53
milano
0.219
0.071
0.108
0.52
napoli
0.224
0.194
0.208
0.569
parma
0.317
0.169
0.22
0.567
perugia
0.065
0.058
0.061
0.493
palermo
0
0
0
0.444
roma
0.208
0.308
0.248
0.607
torino
0.382
0.558
0.453
0.75
venezia
0.172
0.152
0.148
0.547 Weighted Avg.
Overall, performances sharply plummet when the speakers of the training set and the test
set are not the same. This means that speaker variable is crucially important for our MFCCstrained model, and the classification is actually led on the basis of speaker's individual
features, not of accent ones. Thereby, it is crucially important, as the majority of explored
literature does [cf chapter 3], to keep separated the speakers of training and test set.
66
464
0.0605
0.1251
0.353
94.114 %
136.2606 %
530
12.4528 %
87.5472 %
FP Rate
0.086
0.073
0.051
0.095
0.116
0.053
0.047
0.049
0.04
0.004
0.039
0.096
0.125
0.065
0.063
Precision
Recall
0.083
0.211
0.196
0.391
0.107
0.07
0
0
0.035
0.037
0
0
0
0
0.077
0.045
0.24
0.12
0.857
0.188
0.091
0.133
0.242
0.375
0.046
0.088
0.205
0.154
0.2
0.125
F-Measure
0.119
0.261
0.085
0
0.036
0
0
0.057
0.16
0.308
0.108
0.294
0.061
0.176
0.13
b
2
9
7
3
2
3
5
1
1
3
1
2
5
2
c d e
0 0 0
1 0 0
3 2 9
0 0 3
2 5 2
1 3 26
2 2 2
2 3 1
3 3 7
1 18 2
1 0 1
6 5 0
2 3 1
4 5 3
f
0
1
1
1
1
0
6
2
2
9
3
0
0
0
g
2
3
0
2
7
1
0
1
0
0
0
1
2
4
h
0
0
3
1
3
1
9
2
5
1
0
0
1
0
i j
1 0
0 0
6 0
0 0
3 0
1 0
1 1
3 1
6 0
1 12
2 0
0 0
1 0
0 0
k l m n
<-- classified as
0 3 4 3 | a = bari
2 1 1 3 | b = cagliari
4 6 1 0 | c = catanzaro
0 1 0 0 | d = firenze
1 3 14 4 | e = genova
1 1 1 0 | f = lecce
0 2 6 2 | g = milano
2 3 9 4 | h = napoli
2 7 1 11 | i = parma
0 4 5 2 | j = perugia
2 2 1 1 | k = palermo
2 15 8 1 | l = roma
3 2 3 0 | m = torino
3 12 11 8 | n = venezia
63
345
0.0918
0.1209
0.3468
91.2094 %
134.3149 %
408
67
15.4412 %
84.5588 %
Precision
Recall F-Measure
ROC Area Class
0.286
0.37
0.323
0.654
bari
0.034
0.045
0.039
0.489
cagliari
0
0
0
0.47
catanzaro
0.133
0.125
0.129
0.53
firenze
0
0
0
0.494
genova
0.167
0.064
0.092
0.514
lecce
0.056
0.016
0.025
0.487
milano
0.273
0.111
0.158
0.535
napoli
0.385
0.588
0.465
0.775
parma
0.043
0.154
0.068
0.523
perugia
0.111
0.081
0.094
0.483
palermo
0.015
0.04
0.022
0.435
roma
0.415
0.71
0.524
0.814
torino
0
0
0
0.496
venezia
0.155
0.154
0.14
0.546 Weighted Avg.
b
3
1
0
9
0
5
1
2
0
2
0
5
1
0
c
0
1
0
3
2
4
1
6
0
0
4
0
2
1
d
4
0
0
4
2
1
6
8
2
0
0
0
0
3
e
0
0
0
0
0
0
7
0
0
0
0
0
0
0
f
1
0
2
1
1
3
0
2
0
2
4
2
0
0
g
1
7
1
1
2
0
1
0
0
0
2
3
0
0
h i
0 4
0 0
0 1
1 0
0 0
5 3
2 3
6 1
3 10
0 1
2 3
2 0
0 0
1 0
j
1
9
3
1
0
6
2
6
0
2
1
3
4
8
k l m
1 0 2
1 1 0
0 1 1
2 0 7
1 0 1
0 8 5
9 26 2
2 11 7
1 1 0
0 4 0
3 13 1
0 1 4
2 0 22
5 2 1
n
0
0
0
0
0
1
0
0
0
0
0
4
0
0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<-- classified as
a = bari
b = cagliari
c = catanzaro
d = firenze
e = genova
f = lecce
g = milano
h = napoli
i = parma
j = perugia
k = palermo
l = roma
m = torino
n = venezia
Comparing to the gender-neutral system, accuracy did not improve, but rather it degraded
as in the case of the female speakers' model above. According to these machine-learning
experiments, gender variable seems not to elicit the performance of our MFCCs-based
classifier.
4.5 Linguistic areas
Let's try to interpret the confusion matrix of our best classifier (the one built with k-nn
method) and run it on samples of test 2:
a
1
0
0
b
0
0
0
c
0
0
1
d
0
0
0
e
0
0
0
f
0
0
0
g
0
1
0
h
0
0
0
i
0
0
0
j
0
0
0
k
0
0
0
l
1
0
0
m
0
0
0
n
<-0 | a =
0 | b =
0 | c =
classified as
bari
cagliari
catanzaro
68
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
2
0
1
|
|
|
|
|
|
|
|
|
|
|
d
e
f
g
h
i
j
k
l
m
n
=
=
=
=
=
=
=
=
=
=
=
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia
Looking at the human scores, when one fails the city choice, it is generally because they
selected a city of the same linguistic area. It is quite common to mistake Torino accent for
Genova accent, or Napoli accent for Bari one, or Lecce one for Palermo one, even if there are
some cases where that is not true:
Nevertheless, our classifier does not seem to follow this criterion of linguistic proximity: as
we can see through the matrix above, when it mistakes a city for another, there is not any
strong linguistic relationship between them.
69
inst#,
actual, predicted,
1:bari
12:roma
1:bari
1:bari
3 2:cagliari
7:milano
4 3:catanzar 3:catanzar
5
4:firenze
4:firenze
4:firenze 2:cagliari
5:genova
4:firenze
6:lecce
8:napoli
7:milano
12:roma
10
7:milano
7:milano
11
8:napoli
12:roma
12 11:palermo
5:genova
13 11:palermo
4:firenze
14
9:parma
9:parma
15 10:perugia 14:venezia
16
12:roma 14:venezia
17
12:roma 14:venezia
18
13:torino
6:lecce
19 14:venezia 14:venezia
20 14:venezia
7:milano
Obviously we cannot draw general conclusions on the basis of a test set of just 20 audio
samples. However, visualizing the confusion matrix of our IBk (k-NN) classifier tested on
unknown speakers, here too, we found out only some partial correlation between cities of the
same linguistic area: correlation which probably is not meaningful [see appendix 7].
In view of the above experiments, it seems that perception categories like the linguistic
areas, which apparently work for humans, do not work for our MFCCs-based machine learning
classifier: this latter does not seem to follow a linguistic proximity criterion, although there is
no shortage of cases where also humans do not respect it.
We ran an additional experiment to verify whether the concept of linguistic area could
have a sort of computational soundness inside our classifier. In brief, what we did was to
check the classifier ability in distinguishing 2 cities of a same linguistic area, compare to the
ability in distinguishing 2 cities of different linguistic areas. Our hypothesis was that the
system would have achieved better performances on distinguishing cities with different
linguistic hallmarks, than cities belonging to the same broad linguistic area. Hence, the
structure of this experiment was the following:
70
1)
Launch the classifier on a data set composed by 2 cities from a same chosen linguistic
area.
2)
Launch the classifier on a data set composed by 2 cities whose one belongs to the
Launch the classifier on a data set composed by 2 cities whose one belongs to the
targeted linguistic area and the other to a not neighboring linguistic area.
These specific data sets were generate with some Perl scripts we built up for this purpose.
We show these scripts in [appendix 8].
Down here, the results of the lazy IBk classifier on a test set of unknown speakers from
Lecce and Catanzaro, so 2 cities belonging to the same linguistic area (extreme-southern):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
72
67
0.1215
0.0699
0.2603
97.2752 %
138.7581 %
139
51.7986 %
48.2014 %
FP Rate
0.651
0.208
0.377
Precision
0.429
0.732
0.616
Recall
0.792
0.349
0.518
F-Measure
0.556
0.472
0.504
ROC Area
0.571
catanzaro
0.571
lecce
0.571 Weighted Avg.
After, the results of the lazy IBk classifier on a test set of unknown speakers from Lecce
and Bari, so 2 cities belonging to two neighboring areas (respectively, extreme-southern and
southern area):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
71
61
0.1909
0.067
0.255
91.6848 %
133.4554 %
132
53.7879 %
46.2121 %
FP Rate
0.651
Precision
0.423
Recall
0.891
71
F-Measure
0.573
ROC Area
0.62
bari
0.349
0.538
0.109
0.298
0.857
0.706
0.349
0.538
0.496
0.523
0.62
lecce
0.62 Weighted Avg.
Last, the results of the lazy IBk classifier on a test set of unknown speakers from Lecce and
Venezia, so 2 cities belonging to two not neighboring areas (respectively, extreme-southern
and Veneto area):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
90
38
0.4203
0.0438
0.2044
61.1706 %
109.0592 %
128
70.3125 %
29.6875 %
FP Rate
0.387
0.17
0.26
Precision
0.603
0.836
0.74
Recall
0.83
0.613
0.703
F-Measure
0.698
0.708
0.704
ROC Area
0.716
catanzaro
0.716
venezia
0.716 Weighted Avg.
In this case, the machine learning model seems to respect a sort of proximity criterion: it is
more able in distinguishing far cities than close cities.
Let's try to repeat this experiment for other areas. Down here, the results of the lazy IBk
classifier on a test set of unknown speakers from Genova and Torino, so 2 cities belonging to
the same linguistic area (the northern one):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
69
59
0.0695
0.0667
0.2549
92.4914 %
134.6259 %
128
53.9063 %
46.0938 %
FP Rate
0.169
0.762
0.47
Precision
0.577
0.529
0.553
Recall
0.238
0.831
0.539
F-Measure
0.337
0.647
0.494
ROC Area
0.534
genova
0.534
torino
0.534 Weighted Avg.
After, the results of the lazy IBk classifier on a test set of unknown speakers from Torino
and Venezia, so 2 cities belonging to two neighboring areas (respectively, northern and Veneto
area) which besides are not separated by strong isoglosses [cf chapter 2]:
Correctly Classified Instances
101
72
72.1429 %
39
0.4474
0.041
0.1982
56.4392 %
103.7125 %
140
27.8571 %
FP Rate
0.347
0.2
0.268
Precision
0.667
0.79
0.733
Recall
0.8
0.653
0.721
F-Measure
0.727
0.715
0.721
ROC Area
0.727
torino
0.727
venezia
0.727 Weighted
Avg.
Last, the results of the lazy IBk classifier on a test set of unknown speakers from Genova
and Palermo, so 2 cities belonging to two not neighboring areas (respectively, northern and
extreme-southern area):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
67
48
0.1953
0.0606
0.2426
83.1049 %
126.5906 %
115
58.2609 %
41.7391 %
FP Rate
0.192
0.603
0.378
Precision
0.714
0.525
0.629
Recall
0.397
0.808
0.583
F-Measure
0.51
0.636
0.567
ROC Area
0.602
genova
0.602
palermo
0.602 Weighted Avg.
In this case, the machine learning model seems not to completely respect the proximity
criterion.
Now we are going to repeat this experiment on the Italian median area. Down here, the
results of the lazy IBk classifier on a test set of unknown speakers from Roma and Perugia, so
2 cities belonging to the same linguistic area (the median one):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
104
65
0.2352
0.0559
0.233
77.7626 %
123.6709 %
169
73
61.5385 %
38.4615 %
TP Rate
0.662
0.576
0.615
FP Rate
0.424
0.338
0.377
Precision
0.567
0.671
0.623
Recall
0.662
0.576
0.615
F-Measure
0.611
0.62
0.616
ROC Area
0.622
perugia
0.622
roma
0.622 Weighted Avg.
After, the results of the lazy IBk classifier on a test set of unknown speakers from Firenze
and Perugia following by Firenze and Roma (2 couples of cities belonging respectively to two
neighboring areas which are moreover not separated by strong isoglosses [cf chapter 2]):
Lazy firenze perugia
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
73
48
0.2119
0.0576
0.2366
78.9739 %
123.8109 %
121
60.3306 %
39.6694 %
63
33
0.2916
0.0501
0.2204
69.6414 %
116.9057 %
96
65.625
34.375
%
%
FP Rate
0.192
0.523
0.371
Precision
0.677
0.646
0.66
Recall
0.477
0.808
0.656
F-Measure
0.56
0.718
0.646
ROC Area
0.642
firenze
0.642
roma
0.642 Weighted Avg.
Last, the results of the lazy IBk classifier on a test set of unknown speakers from Roma and
Milano following by Perugia/Lecce and Perugia/Cagliari (3 couples of cities belonging
respectively to two not neighboring areas):
74
61
93
-0.1589
0.0868
0.2919
115.6791 %
147.5163 %
154
39.6104 %
60.3896 %
FP Rate
0.558
0.627
0.581
Precision
0.567
0.264
0.465
Recall
0.373
0.442
0.396
F-Measure
0.45
0.331
0.41
...
lazy perugia lecce
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
75
88
-0.0728
0.0779
0.2758
107.1565 %
144.3262 %
163
46.0123 %
53.9877 %
FP Rate
0.481
0.593
0.534
Precision
0.486
0.44
0.464
Recall
0.407
0.519
0.46
F-Measure
0.443
0.476
0.459
ROC Area
0.467
lecce
0.467
perugia
0.467 Weighted Avg.
...
lazy perugia cagliari
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
85
44
0.3482
0.0497
0.2194
68.192 %
114.7137 %
129
65.8915 %
34.1085 %
FP Rate
0.468
0.154
Precision
0.55
0.837
75
Recall
0.846
0.532
F-Measure
0.667
0.651
ROC Area
0.69
cagliari
0.69
perugia
0.659
0.28
0.721
0.659
0.657
In this last case our hypothesis seems to fail: the machine learning model did not respect at
all the proximity criterion, but rather, it appears to distinguish better cities from the same
area.
250
710
0.0935
0.2113
0.4595
90.7376 %
134.8303 %
960
26.0417 %
73.9583 %
FP Rate
0.146
0.265
0.147
0.073
Precision
0.222
0.38
0.195
0.056
Recall
0.168
0.364
0.201
0.091
F-Measure
0.191
0.372
0.198
0.07
0.063
0.055
0.222
0.372
0.222
0.387
0.222
0.379
0.158
0.168
0.198
0.268
0.234
0.26
0.214
0.263
0.539
area_rossa
0.547 Weighted Avg.
c
25
44
29
4
31
1
15
d
6
24
11
4
1
4
21
e
16
13
8
7
16
3
9
f
8
19
6
1
3
29
12
g
39
43
27
1
8
12
32
|
|
|
|
|
|
|
<-a
b
c
d
e
f
g
classified as
= area_verde
= area_viola
= area_blu
= area_marrone
= sardegna
= area_gialla
= area_rossa
137
823
0.0269
0.2449
76
sardegna
14.2708 %
85.7292 %
0.4814
105.1587 %
141.2592 %
960
238
431
-0.0258
0.4297
0.6309
97.9081 %
134.9023 %
669
35.5755 %
64.4245 %
FP Rate
0.059
0.637
0.34
0.392
Precision
0.243
0.389
0.32
0.329
Recall
0.047
0.508
0.431
0.356
<-- classified as
77
F-Measure
0.079
0.441
0.367
0.318
9 154 28 |
8 151 138 |
20 83 78 |
a = area_verde
b = area_viola
c = area_rossa
Nevertheless, best results were achieved not through IBk algorithm but through SMO
(over-setting the parameter C) and logistic regression methods. On the other hand, SVMs
method did not infer the three classes and it attributed all the samples to the largest class
among the three, namely the northern one. Down here the results using SMO and Logistic
Regression:
289
380
0.1145
0.4132
0.5138
94.1476 %
109.8515 %
669
43.1988 %
56.8012 %
FP Rate
0.213
0.468
0.213
0.326
Precision
Recall
0.346
0.283
0.466
0.512
0.444
0.459
0.426
0.432
F-Measure
0.311
0.488
0.451
0.428
c
<-- classified as
28 |
a = area_verde
76 |
b = area_viola
83 |
c = area_rossa
...
287
382
0.115
0.4072
0.4854
92.7673 %
103.7763 %
669
42.8999 %
57.1001 %
FP Rate
0.247
0.43
Precision
0.322
0.484
Recall
0.293
0.505
78
F-Measure
0.307
0.494
0.448
0.429
0.213
0.319
0.438
0.425
0.448
0.429
0.443
0.427
0.659
area_rossa
0.578 Weighted Avg.
c
<-- classified as
36 |
a = area_verde
68 |
b = area_viola
81 |
c = area_rossa
Even if these last classifier performed quite better than the level of chance, results are
rather disappointing: the system seems not to be able to well distinguish Italian regional
varieties through the MFCC12 modeling. Furthermore, we derived experimentally that adding
one or two delta regressions does not significantly improve system results. We probably have
to bet on new phonotactic, spectral or prosodic features.
6 sec
complete
threshold data set
12.80%
15.24%
87.20%
84.76%
Kappa statistic
0.0621
0.093
0.1246
0.1211
0.3522
0.3476
79
93.93%
90.94%
136.56%
134.27%
414
853
And then the SMO model, compared with the result of the same model on the complete
data set:
6 sec
complete
threshold data set
18.36%
11.96%
81.64%
88.04%
Kappa statistic
0.1206
0.0599
0.1299
0.1312
0.2554
0.2582
97.88%
98.53%
99.03%
99.73%
414
853
After, we tested the reduced data set with a 7 way classification (linguistic areas). Down
here the results of the lazy IBk model compared with the result of the same model on the
complete data set:
6 sec
complete
threshold data set
23.67%
26.04%
76.33%
73.96%
Kappa statistic
0.0668
0.0935
0.2181
0.2113
0.4665
0.4595
92.69%
90.74%
135.90%
134.83%
414
960
And then the SMO model, compared with the result of the same model on the complete
data set:
80
6 sec
complete
threshold data set
25.12%
24.38%
74.88%
75.63%
Kappa statistic
0.0729
0.0236
0.2308
0.2312
0.3417
0.3427
98.07%
99.27%
99.57%
100.56%
414
960
Finally, we tested the reduced data set with a 3 way classification (linguistic macro areas,
cf par. 4.5). Down here the results of the lazy IBk model on a 6-second threshold data set and a
10-seconds threshold data set, compared with the result of the same model on the complete
data set:
6 sec
10 sec
complet
thresh.d thresh.d e dataset
36.56%
37.70%
39.46%
63.44%
62.29%
60.54%
Kappa statistic
0.0268
0.0633
0.0659
0.423
0.4154
0.4036
0.6499
0.6434
0.6351
95.94%
93.55%
91.96%
138.6%
136.5%
135.79%
279
122
669
Hereafter, the SMO model (C parameter: 2.0), followed by the result of the same model on
the complete data set:
6 sec
10 sec
complet
thresh.d thresh.d e dataset
46.23%
44.26%
43.199%
53.76%
55.73%
56.801%
Kappa statistic
0.1812
0.1866
0.1145
0.3903
0.3971
0.4132
0.4902
0.4953
0.5138
88.53%
89.43%
94.147%
104.5%
105 %
109.85%
279
122
669
Hereafter, the naive-Bayes model, compared with the result of the same model on the
81
6 sec
10 sec
complet
thresh.d thresh.d e dataset
36.56%
40.16%
35.575%
63.44%
59.83%
64.424%
Kappa statistic
0.0538
0.1159
-0.0258
0.4137
0.3923
0.4297
0.6126
0.5767
0.6309
93.84%
88.36%
97.908%
130.6%
122.3%
134.9 %
279
122
669
Finally, the logistic regression model, compared with the result of the same model on the
complete data set:
6 sec
10 sec
complet
thresh.d thresh.d e dataset
46.59%
38.52%
42.9%
53.4%
61.47%
57.1%
Kappa statistic
0.1841
0.0835
0.115
0.381
0.4078
0.4072
0.4804
0.5607
0.4854
86.41%
91.84%
92.767%
102.4%
118.9%
103.77%
279
122
669
To sum up, except for the IBk method (a memory based algorithm), all other methods
slightly tend to improve their performances when they are trained and tested on longer
samples. Notably, a well-balanced ratio between training data and selected threshold is
obtained using SMO or Logistic methods with a 6-second-threshold reduced data set.
Differently, the baseline built up using a naive-Bayes method seems to sharply enhance
performances using longer samples as data set.
82
Chapter V: conclusion
Classifiers performances were compared with scores of human Italian testers on two
online spread surveys. Built up in a previous moment, these surveys were composed by about
20 samples each, they covered all cities in corpus, and audio files were chosen from the CLIPS
corpus by two different experimenters.
2)
find out confounding variables in the dataset. We focused mainly on the gender variable and
the speaker variable.
3)
We led a set of experiments focused on the concept of linguistic area and geo-linguistic
proximity. Notably, we checked model ability in distinguishing varieties of the same linguistic
area, varieties of two neighbours areas, and varieties of two not neighbours areas.
4)
Finally, we tried to repeat the same machine-learning experiments with a smaller data
set stripped of all shortest samples, in order to see if performances are better on longer
recordings.
83
The main data set used is the result of an acoustic features extraction run by openSMILE
software. Each sample was described by a 227 Mel-frequency cepstrial coefficients (MFCCs)
dimensional vector: notably, 12 MFC coefficients and 19 functionals for each one. The choice
of features were taken considering that the most of the literature propose MFCCs as the most
prominent acoustic feature for recognizing geo-linguistic accent of a speaker and for speaker
profiling tasks in general [cf. chapter 3].
Next, data set was stripped of anomalies and outliers thanks to both manual and automatic
procedures [look at paragraph 4.3.2], and it was enriched with new variables like gender,
linguistic area and speaker identifier in order to be able to perform several data mining
operations, and therefore to read through data.
Classifiers were built using mainly three algorithms: Sequential Minimal Optimization
algorithm (SMO, a relatively recent version of classic Support Vector Machines method),
Bayesian Networks/naive-Bayes, and IBk (the Weka default k nearest neighbours lazy
algorithm). Outcomes and trends of these methods over various tasks were quite varied.
We tested the models built using samples of online surveys as a test set, at different levels
of granularity, in order to compare human behaviours with machine trends. The speaker
variable turned out to be a crucial confounding variable, completely compromising
interpretation of classifier predictions. On the other hand, the gender variable did not seem to
elicit the system. Through new experiments and discussing model outputs we wanted to see if
our MFCCs-based classifier, as humans, followed a sort of geo-linguistic proximity criteria to
guess the class. The main purpose of this analysis was to appraise whether it exists a
correlation between Italian cities belonging to the same linguistic area, given some objective
borders like isoglosses.
Finally, we carried out a set of machine-learning experiments removing from our data set
the shortest samples of CLIPS corpus. Doing that, we supposed that longer samples are easier
to classify since they convey more spectral information. Various methods differently
responded to these experiments.
In the next paragraph we will critically discuss the results obtained. Down here, a
flowchart of the global experiment we carried out.
84
5.2 Discussion
On the 19/01/2016 testers of the second survey spread online are in the number of 73,
and they achieved on average an accuracy of 34.3%, that we think is a fine result on a 14 way
classification task (7% is the level of chance), taken into account that the speakers' audio
samples provided were rather short and difficult to guess. Using these samples as a test set of
a k-NN (IBk Weka algorithm) model, the accuracy of predictions was quite satisfactory (30%);
however, machine-learning experiments on larger test sets got high error rates (around 91%
the relative error rate of our best model, k-NN-based). Test sets were quite well-balanced by
gender, variety and speakers; by the way, a better method of evaluation is perhaps necessary,
such as the leave-one (speaker) out cross-validation. While results showed the importance of
keeping separated speakers of training and test sets, the gender variable seems not to elicit
the classifier: performances of gender-dependent models remained stable around 14% of
accuracy on the 14 way classification, not differently to the gender-independent model. On the
linguistic area classification task (7 way classification), k-NN classifier got around the 90% of
relative error rate, similarly to the 14 way classification, with an accuracy of 26%. A naiveBayes baseline classifier obtained nearly the level of chance (14% of accuracy).
The set of experiments to seek some correlation between varieties of a same area did not
85
give considerable results: the hypothesis seems to work only for the extreme-southern area,
whereas we did not obtain the same interesting outcomes for the northern area. For the
median area the system rather appears to distinguish better cities from the same area. Equally,
the visualization of the best classifier (k-NN method) confusion matrix did not present any
relevant correlations between city varieties and linguistic areas [appendix 7]
Reducing to 3 the number of classes by merging some linguistic areas and removing
others, best scores are obtained through another algorithm, namely Logistic regression.
Nevertheless, relative error rate is still high (92.7%) and the K statistic quite low (0.115).
Again, the naive-Bayes baseline obtained roughly the level of chance (which was 33.3% in this
latter case).
In our last experiment we noticed a general trend: performances improve reducing the
number of short samples. This means that the more the recordings are long and rich of
spectral information, the more the system is able to classify them. However, the enhancement
is slight and it varies according to the method used. Generally, the k-NN model (most of time
our best classifier) do not seems to respond properly to data reduction. That is probably
because it is a memory-based algorithm. On the other hand, it can be noted on SMO and
especially on Logistic Regression models some improvements removing short samples: this
latter model reached on the 3 way classification the 86.4% of relative error rate, with an
accuracy of 46.6%. Even if all methods seemed to suffer a too much strict reduction of data,
our naive-Bayes baseline improved significantly his performances whether trained with
longer samples.
However, the best solutions is perhaps 1) to flank the acoustic module with a GMM syllable
modelling, namely to implement a phonotactic module as in [Akbacak et al 2012]. Likewise, an
even more effective approach is the extraction of the i-vectors, that is to say, a newly proposed
set of features proved to be very effective for such kind of tasks [Verna et Das 2015] 2) To
create a metric for Italian regional variety, similarly to the ACCDIST for Britain accents [Brown
2014], which is able to distinguish (in text-dependent contests) among 14 different accent.
This last solution is maybe the most interesting because of its suitability for forensic
application and since it would contribute to the development of Italian dialectometry, even if it
is rather time-consuming to implement. With such a system we could easily isolate the part of
speech where cross-accent differences are prominent: vowels, realization of fricatives, voice
onset time and so on.
We express a last consideration that goes beyond this accent recognition task. Regional
accent is just one factor (though consistent), in speaker variability. If we want to consider an
ASR system that improve his performances adapting to its dialectal and idiolectal user
characteristic, it is not truly necessary to implement a text-independent model. For example, a
vocal interface could address to his first user a sort of targeted survey, asking him to
pronounce some key-words, in order to collect focused information about dialect and idiolect.
Or even less intrusively, the ASR system could trigger some collecting module when the user
pronounces this or that susceptible words, during all the system life period.
88
APPENDIX 1
UNIX script to convert en masse CLIPS .raw samples in wav using SoX
for i in ./*.raw;do sox -r 8000 --bits 8 --encoding u-law -t raw "$i"
"M_wav/$i.wav";done
The converted file, now suitable for openSMILE extraction, has these properties:
Input File
:
Channels
:
Sample Rate
:
Precision
:
Duration
:
File Size
:
Bit Rate
:
Sample Encoding:
'example.wav'
1
8000
14-bit
00:00:17.68 = 141408 samples ~ 1325.7 CDDA sectors
141k
64.0k
8-bit u-law
APPENDIX 2
perl script to filter outliers after having extract loudness features
use strict; use warnings; use locale
my $wav;
my $c=0;
my $threshold=15 # threshold changes with regard to gender: 15 for female
#speakers, 25 for male speakers
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /(\w+_?\w+,\w+,[FM],\d+,'.*'),/g) {
$wav=$1;
while ($ligne =~ /0\.000000e\+00/g){
$c++;
}
if ($c>=$threshold){
print "$wav\n";
$c=0;
}
else {
$c=0;}
}
else {}
}
89
APPENDIX 3
Perl script to enrich arff file with linguistic area, cityname, gender and speaker information
use strict;use warnings;use locale;
my $loc;
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /\@relation/) {
print $ligne, "\n";
print "\@isoglossa\n";
print "\@cityname\n";
print "\@gender\n";
print "\@locutor\n";
}
elsif ($ligne =~ /\.\/TL(\d\d\d\d)\d/) {
$loc=$1;
print "$ARGV[0],$ARGV[1],$ARGV[2],$loc,$ligne\n";
}
else {
print "$ligne\n";
}
}
## UNIX script example with arguments:
## $> perl path/to/script.pl sardegna cagliari F < path/to/arff >
path/to/new/richer/arff
90
APPENDIX 4
Functionals of emobase.conf
max maximum value
min minimum value
maxPos The absolute position of the maximum value (in frames)
minPos The absolute position of the minimum value (in frames)
amean The arithmetic mean of the contour
linregc1 The slope (m) of a linear approximation of the contour
linregc2 The offset (t) of a linear approximation of the contour
linregerrA The linear error computed as the difference of the linear
approximation and
the actual contour
linregerrQ The quadratic error computed as the difference of the linear
approximation
and the actual contour
stddev The standard deviation of the values in the contour
skewness The skewness (3rd order moment).
kurtosis The kurtosis (4th order moment).
quartile1 The first quartile (25% percentile)
quartile2 The first quartile (50% percentile)
quartile3 The first quartile (75% percentile)
iqr1-2 The inter-quartile range: quartile2-quartile1
iqr2-3 The inter-quartile range: quartile3-quartile2
iqr1-3 The inter-quartile range: quartile3-quartile1
91
APPENDIX 5
681 MFCCs features to train model, or 227 if we exclude delta regressions
About annotation:
"The suffix _sma appended to the names of the low-level descriptors indicates that they were
smoothed by a moving average filter with window length 3. The suffix _de appended to sma
suffix indicates that the current feature is a 1st order delta coefficient (differential) of the
smoothed low-level descriptor." [Eyben et al 2010]
@attribute cityname
{bari,cagliari,catanzaro,firenze,genova,lecce,milano,napoli,parma,perugia,palerm
o,roma,torino,venezia}
@attribute mfcc_sma[1]_max numeric
@attribute mfcc_sma[1]_min numeric
@attribute mfcc_sma[1]_range numeric
@attribute mfcc_sma[1]_maxPos numeric
@attribute mfcc_sma[1]_minPos numeric
@attribute mfcc_sma[1]_amean numeric
@attribute mfcc_sma[1]_linregc1 numeric
@attribute mfcc_sma[1]_linregc2 numeric
@attribute mfcc_sma[1]_linregerrA numeric
@attribute mfcc_sma[1]_linregerrQ numeric
@attribute mfcc_sma[1]_stddev numeric
@attribute mfcc_sma[1]_skewness numeric
@attribute mfcc_sma[1]_kurtosis numeric
@attribute mfcc_sma[1]_quartile1 numeric
@attribute mfcc_sma[1]_quartile2 numeric
@attribute mfcc_sma[1]_quartile3 numeric
@attribute mfcc_sma[1]_iqr1-2 numeric
@attribute mfcc_sma[1]_iqr2-3 numeric
@attribute mfcc_sma[1]_iqr1-3 numeric
. . .
@attribute mfcc_sma[12]_max numeric
@attribute mfcc_sma[12]_min numeric
. . .
@attribute mfcc_sma[12]_iqr2-3 numeric
@attribute mfcc_sma[12]_iqr1-3 numeric
. . .
@attribute mfcc_sma_de[1]_max numeric
. . .
@attribute mfcc_sma_de[12]_iqr1-3 numeric
@attribute mfcc_sma_de_de[1]_max numeric
. . .
@attribute mfcc_sma_de_de[12]_iqr1-3 numeric
92
APPENDIX 6
37 loudness features to detect outliers
Extraction run by openSMILE, using the configuration file prosodyViterbiLoudness.conf
Features list:
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
name string
frameTime numeric
F0final_sma_stddev numeric
F0final_sma_amean numeric
F0final_sma_linregc1 numeric
F0final_sma_centroid numeric
F0final_sma_percentile10.0 numeric
F0final_sma_percentile90.0 numeric
F0final_sma_pctlrange0-1 numeric
F0finalLog_sma_stddev numeric
F0finalLog_sma_amean numeric
F0finalLog_sma_linregc1 numeric
F0finalLog_sma_centroid numeric
F0finalLog_sma_percentile10.0 numeric
F0finalLog_sma_percentile90.0 numeric
F0finalLog_sma_pctlrange0-1 numeric
voicingFinalUnclipped_sma_stddev numeric
voicingFinalUnclipped_sma_amean numeric
voicingFinalUnclipped_sma_linregc1 numeric
voicingFinalUnclipped_sma_centroid numeric
voicingFinalUnclipped_sma_percentile10.0 numeric
voicingFinalUnclipped_sma_percentile90.0 numeric
voicingFinalUnclipped_sma_pctlrange0-1 numeric
HarmonicsToNoiseRatioACFLogdB_sma_stddev numeric
HarmonicsToNoiseRatioACFLogdB_sma_amean numeric
HarmonicsToNoiseRatioACFLogdB_sma_linregc1 numeric
HarmonicsToNoiseRatioACFLogdB_sma_centroid numeric
HarmonicsToNoiseRatioACFLogdB_sma_percentile10.0 numeric
HarmonicsToNoiseRatioACFLogdB_sma_percentile90.0 numeric
HarmonicsToNoiseRatioACFLogdB_sma_pctlrange0-1 numeric
loudness_sma_stddev numeric
loudness_sma_amean numeric
loudness_sma_linregc1 numeric
loudness_sma_centroid numeric
loudness_sma_percentile10.0 numeric
loudness_sma_percentile90.0 numeric
loudness_sma_pctlrange0-1 numeric
class {0,1,2,3}
@data
...
93
APPENDIX 7
Vizualising IBk MFCC-based classifier performances through the confusion matrix using R
portable. The higher column is the correct answer, the other columns show the wrong
predictions distribution.
94
95
96
97
APPENDIX 8
perl scripts generating focused datasets
use strict;use warnings;use locale;
my $city1=roma
# this script generate a dataset of audio samples
my $city2=perugia
# from cities beside. We used it to set up the
exp. a
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /\@.*$/) {
print $ligne, "\n";
}
elsif ($ligne eq ""){
print $ligne, "\n";
}
elsif ($ligne =~ /.[^,]*,$city1|.[^,]*,$city2/) {
print $ligne, "\n";
}
else {}
}
. . .
use strict;use warnings;use locale;
my $gen=F
# this script generate a dataset of audio samples
# from a selected gender. We used it to set up a
confounding
# variable experiment (par 4.6)
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /\@.*$/) {
print $ligne, "\n";
}
elsif ($ligne eq ""){
print $ligne, "\n";
}
elsif ($ligne =~ /.[^,]*,.[^,]*,$gen/){
print $ligne, "\n";
}
else {}
}
98
APPENDIX 9
UNIX script used to extract duration of all samples
# for i in ./*.wav; do sox "$i" -n stat 2>&1
# | sed -n 's#^Length (seconds):[^0-9]*\([0-9.]*\)$#\1#p'
# ; echo "$i"; done >> ../list
Perl script used to comment on the ARFF dataset samples during less than 6 second
use strict; use locale; use warnings;
open (LIE, "<","path/to/the/list/created/with/unix/command/above");
my%short;
my $c;
while (my $ligne = <LIE>) {
chomp $ligne;
if ($ligne =~ /^[012345]\.\d/ ){
$c++;
}
elsif ($ligne=~/^\.\// and $c==1){
$short{$ligne}=1;
$c=0;
}
#if
#$short{$ligne}=1;}
}
close (LIE);
#else{}
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /(\w+_?\w+),(\w+),([FM]),(\d+),(\.\/.*av),/) {
if (defined ( $short{$5.' '.$2.' '.$3})){
print "\%$ligne\n";
}
else {
print $ligne, "\n";
}}
else {
print $ligne, "\n";}}
99
APPENDIX 10
Forensic cases where speaker accent profiling has been useful
"The realisation that the voice on the tape was not the Ripper, was a stunning blow to the
police. The voice had led them off on a wild goose chase for close to 18 months. The credibility
that the police put in the letters and tape had also helped the real Yorkshire Ripper, Peter
Sutcliffe, to escape further police scrutiny during interviews because he was eliminated on
voice and handwriting samples. The police failure to err on the side of caution as to whether
the author of the letters and tape was also the Ripper, also meant that the author, given the
moniker Wearside Jack, would also benefit from that police belief. Even if he was the killer of
Joan Harrison, which is by no means a certainty, he probably would have been able to come up
with alibis for some of the killings, or by where he lived, or by his work, could not have been in
the area or had the opportunity to commit the murders. Since he wasn't the Yorkshire Ripper,
the possibilities for avoiding suspicion based on the murders are almost limitless. As well,
there is the possibility that Wearside Jack was never interviewed by the police, by living
outside the country, or was never suspected, or wasn't reported to the police.
It must also be remembered that even Peter Sutcliffe was able to satisfy the police in his
interviews that he was not the killer, even before the release of the tape. Mainly, his alibis
consisted of "being at home" at the crucial times, which were backed up by his wife. As well,
the questioning was usually about events months previous to the interviews. The only
apparently "iron-clad" alibi he gave was for the night of the return visit to the body of Jean
Jordan, when the Sutcliffes had been having a house-warming party. Of course, Peter Sutcliffe
had returned to Jean Jordan's body after that event.
The analysis of the tape had produced two possible valuable leads to the author of the tape.
The department of Linguistics and Phonetics at Glasgow University found that Wearside Jack
suffered two speech defects, one being a distinctive pronunciation of the letter 's', and the
other being a hidden stammer. It was almost a certainty that he had undergone speech
therapy training. The police looked upon this as a possible breakthrough, and approached
every speech therapist in the North of England, but most refused to help based on the grounds
of medical ethics.
Even the voice experts had been surprised that the author of the tape had not been identified
by his voice characteristics. Jack Windsor Lewis was interviewed by Barbara Frum on the CBC
100
(Canada) Radio show "As It Happens" on January 12 1981, shortly after Peter Sutcliffe's
confession. In answer to the question about why, as time went on, it became more and more
improbable that the author of the tape was the Yorkshire Ripper, he said: "Based on the
improbability of people failing to find a man with such a distinctive voice. It's a very distinctive
accent, a very distinctive voice quality, and he has certain speech defects, and so on, that, all
the features of the voice put together make him highly identifiable."
Jack Windsor Lewis also stated that people recognise voices fairly easily, and: "I'm sure people
would have come forward immediately and said I recognise this voice. This is the voice of, and
then quoted a name. Now obviously if they are looking for a murderer only they will pass by
someone who is referred to them as having such a voice but couldn't possibly be a murderer."
[http://www.execulink.com/~kbrannen/wearside.htm]
"In 1981, 13-year-old Mary Doe goes missing from her central California home. Her parents
tell her siblings that she has run away and that they are never to speak of her again. More than
20 years later, in 2003, Marys siblings report the case to the police, who immediately suspect
homicide. Marys parents, now living in New York State, are interviewed, and her stepfather
comes close to confessing. A short time later, a woman is stopped by the police in Phoenix,
Arizona, for a trafc violation. She has an Arizona drivers license in the name of Mary Doe and
claims to be the missing girl and to have spent 20-plus years as a runaway in Arizona and
California, living under various assumed names. For a number of reasons, the detectives who
interview this woman, hereafter referred to as the Person of Interest (POI), suspect that she is
an imposter. Not least of these reasons is the POIs strong Southern accent, a seeming
impossibility in someone who spent most of her rst 13 years in the Northeast, the West
Coast, and Hawaii. The POI claims that her accent comes from brief visits to New Orleans and
Georgia, again, a dubious claim at best. This clearly is a case for a forensic linguist, and experts
are consulted to help unmask the imposters real identity by creating a forensic speaker
prole of the POIs regional background." [Schilling et Marsters 2015]
101
References
kbacak, M., Vergyri, D., Stolcke, A., Scheffer, N., Mandal, A. (2012) Effective Arabic dialect
classification using diverse phonotactic models. in Proceedings of Interspeech12.
Baker, B., Vogt, R., Sridharan, S. (2005) Gaussian Mixture Modelling of Broad Phonetic and
Syllabic Events for Text-Independent Speaker Verification. In: Proceedings of the 9th European
Conference on Speech Communication and technology (Eurospeech 05 Interspeech), Lisbon,
Portugal, pp. 24292432.
Berruto, Gaetano (2011) Variazione linguistica. Entry of Enciclopedia dell'Italiano Treccani.
http://www.treccani.it/enciclopedia/variazione-linguistica_(Enciclopedia_dell'Italiano)/
(ultima visita 03/02/2016)
Berruto, G. (1987), Sociolinguistica dellitaliano contemporaneo. Roma, La Nuova Italia
Scientifica (14a rist. Roma, Carocci, 2006).
Berruto G. (1993) Variet diamesiche, diastratiche e diafasiche, in Sobrero A. (a cura di),
Introduzione all'italiano contemporaneo-La variazione e gli usi, Bari Lateza.
Boughton, Z. (2006) When perception isnt reality: Accent identification and perceptual
dialectogy in French. Journal of French Language Studies 16: 277-304.
Bove T., Giua P.E., Forte A., Rossi C. (2002), Un metodo statistico per il riconoscimento del
parlatore basato sull'analisi delle formanti. In: Statistica, anno LXII, n 3.
Brown, Georgina (2014) Y-ACCDIST: An Automatic Accent Recognition System for Forensic
Applications. MA by research thesis, University of York.
Caramazza, A. , Yeni-Komshian, G. H. (1974) Voice onset time in two French dialects. Journal of
Phonetics 2, 239- 245.
102
Chauhan Tejal, Hemant Soni, Sameena Zafar (2013) A Review of Automatic Speaker Recognition
System. In: International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307,
Volume-3, Issue-4, September 2013.
Chen, T., Huang C., Chang, E., Jiang, W., (2001) Automatic accent identification using Gaussian
mixture models. In: Proc. IEEE workshop on Automatic Speech Recognition and
Understanding, pp. 343-346.
Coseriu, Eugenio (1973) Lezioni di linguistica generale, Torino, Boringhieri.
Cunningham, P., Delany, S.J. (2007). k-Nearest neighbour classifiers. Technical Report UCD-CSI2007-4, Dublin: Artificial Intelligence Group.
D'Addario, C. (2015), Percezione dell'italiano regionale. In: dialetto parlato, scritto, trasmesso.
A cura di Gianna Marcato, Padova CLEUP 2015.
De Mauro, T. (1963) Storia linguistica dell'Italia unita. Laterza editore.
Eyben, F., Wollmer, M., Schuller, B (2010) openSMILE The Munich Versatile and Fast OpenSource Audio Feature Extractor. In: Proc. ACM Multimedia (MM). pp. 14591462. Florence,
Italy.
Farrs, M. (2008) Fusing prosodic and acoustic information for speaker recognition. PhD Thesis,
Universitat Politcnica de Catalunya.
Gooskens, C. (2002) How well can Norwegians identify their dialects? In: Nordic Journal of
Linguistics.
Grassi, C., Sobrero, A.A. & Telmon, T. (1997) Fondamenti di dialettologia italiana. Roma:
Laterza.
Grassi, C., Sobrero, A.A. & Telmon, T. (2003) Introduzione alla dialettologia italiana. Roma:
103
Laterza.
Hanani Abualsoud (2012) Human and Computer Recognition of regional accents and ethinic
groups from British English Speech. School of Electronic, Electrical and Computer Engineering,
The University of Birmingham.
Heeringa, W. (2004) Measuring Dialect Pronunciation Differences Using Levenshtein Distance.
Ph.D. Thesis. University of Groningen.
HOU Jue, Yi LIU, Thomas Fang ZHENG, Jesper OLSEN, Jilei TIAN (2010) Using Cepstral and
Prosodic Features for Chinese Accent Identification. The 7th International Symposium on
Chinese Spoken Language Processing (ISCSLP 2010). Tainan, 2010:177-181
Houtsma, A.J.M. (1995) Pitch perception. In: Hearing. Handbook of perception and cognition.
Edited by Brian C.J. Moore. Academic press, second edition. pp. 267-295 (chapter 8).
Huckvale, M. (2004) ACCDIST: a metric for comparing speakers' accents. Proc. International
Conference on Spoker Language Processing Jeju, Korea. 29-32.
ISTAT report (2012) Luso della lingua italiana, dei dialetti e di altre lingue in italia. Source:
www.istat.it
Jessen, M. (2007) Speaker Classification in Forensic Phonetics and Acoustics. In C. Mller (Ed.),
Speaker Classification (1), 180-204. Berlin: Springer.
Juravsky, D., Martin, J.H. (2000) Speech and language processing. An introduction to natural
language processing, computational linguistic and speech recognition. Pearson Prentice Hall,
second edition.
Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murty K.R.K. (2001) Improvements to platts
SMO algorithm for SVM classifier design. Neural Computation 13: 637649.
Kersta, L. G., (1962) Voiceprint Identification Infallibility. J. Acoust. Soc. Am. 34.
104
Kessler B. (1995) Computational dialectology in Irish Gaelic. Proc. Conf. European ACL, 7th,
Dublin, March 27-31, pp. 60-7. San Francisco: Morgan Kaufmann Publishers
Kster, O., R. Kehrein, K. Masthoff and Y.H. Boubaker. (2012) The tell-tale accent: identification
of regionally marked speech in German telephone conversations by forensic phoneticians.
Journal of Speech, Language and the Law 19.1, 5171.
Kulshreshtha, M., Mathur, R. (2012) Dialect Accent Feature for Establishing Speaker Identity: A
case study. Springer Briefs in Electrical and Computer Engineering, 2012.
Lippmann, R. P., (1997) Speech recognition by machines and humans Speech Commun., vol. 22,
pp. 115, 1997.
Lorinczi, M., (1999) Storia sociolinguistica della lingua sarda alla luce degli studi di linguistica
sarda. In F. Fernandez Rei and A Santamarina Fernandez (eds) Estudios de sociolinguistica
romanica. Linguas e variedades minorizadas, Universidade de Santiagio de Compostela, 1999,
pp. 385-424, 1999.
Maiden, M., Parry, M.M. (eds.) (1996) The dialects of Italy. London: Routledge.
Meyer, B. T., Kollmeier, B., (2011) Robustness of spectrotemporal features against intrinsic and
extrinsic variations in automatic speech recognition. Speech Comm., vol. 53, no. 5, pp. 753-767,
2011.
Montemagni, S. (2007) Patterns of phonetic variation in Tuscany: using dialectometric
techniques on multi-level representations of dialectal data. In P. Osenova et al. (eds.),
Proceedings of the Workshop on Computational Phonology at RANLP-2007, pp. 49-60.
Morrison, G.S. (2010) Forensic voice comparison. In Freckelton, I., Selby, H. (eds.), Expert
Evidence. Sydney, Australia: Thomson Reuters
Muthusamy, Y. K., Cole, R.A., (1992) Automatic segmentation and identifcation of ten languages
105
106
U.
(2010)
Isoglossa.
Entry
from
Enciclopedia
dell'Italiano
Treccani.
http://www.treccani.it/enciclopedia/isoglossa_(Enciclopedia_dell'Italiano)/
(ultima visita 03/02/2016)
Zissman M. A., (1995) Comparison of four approaches to automatic language identification of
telephone speech. IEEE Trans. Speech and Audio Proc., SAP-4(1):31-44, January 1996.
107