Você está na página 1de 107

ALMA MATER STUDIORUM - UNIVERSITA' DI BOLOGNA

SCUOLA DI LETTERE E BENI CULTURALI


Corso di laurea in
Italianistica, Culture letterarie europee, Scienze linguistiche (LM)

ABOUT THE DEVELOPMENT OF AN AUTOMATIC MFCC-BASED


CLASSIFIER OF REGIONAL ITALIAN ACCENTS

Tesi di laurea in
Trattamento automatico delle lingue

Relatore:

Presentata da:

Prof. FABIO TAMBURINI

FILIPPO BONORA

Correlatore:
Dott.ssa GLORIA GAGLIARDI

Sessione
terza
Anno accademico
2014-2015
1

Index
Abstract.................................................................................................................................................3
Riassunto..............................................................................................................................................5
Chapter I: framework...........................................................................................................................7
1.1 Some characteristics of human voice.........................................................................................7
1.1.1 Automatic approach to acoustic speech processing.........................................................12
1.2 About speaker profiling...........................................................................................................13
Chapter II: ..........................................................................................................................................17
2.1 Italian languages versus standard Italian.................................................................................17
2.2 Diatopic varieties of Italian......................................................................................................20
Chapter III: methods...........................................................................................................................25
3.1 Accent identification problem..................................................................................................25
3.2 Mel-Frequency Cepstrum Coefficients (MFCCs)...................................................................28
3.3 Machine-learning for speech processing.................................................................................30
3.4 Literature..................................................................................................................................34
3.5 Introduction to next experiments.............................................................................................37
Chapter IV: experiments.....................................................................................................................40
4.0 Data: CLIPS telephonic corpus................................................................................................40
4.1 Two surveys to explore human perception of Italian accents..................................................41
4.1.1 Results and discussion.....................................................................................................46
4.2 Tools.........................................................................................................................................50
4.2.1 openSMILE......................................................................................................................51
4.2.2 Weka.................................................................................................................................53
4.3 Composing data sets................................................................................................................54
4.3.1 Features extraction with openSMILE..............................................................................55
4.3.2 Detecting outliers in CLIPS telephonic corpus................................................................56
4.3.3 Add information for data mining.....................................................................................57
4.4 Speaker variable is a confounding variable.............................................................................58
4.4.1 Gender variable................................................................................................................66
4.5 Linguistic areas........................................................................................................................68
4.6 Three way classification task...................................................................................................76
4.7 Removing short samples..........................................................................................................79
Chapter V: conclusion.........................................................................................................................83
5.1 Resume of experiments............................................................................................................83
5.2 Discussion................................................................................................................................85
5.3 Propositions for further work...................................................................................................86
APPENDIX 1 ....................................................................................................................................89
APPENDIX 2.....................................................................................................................................89
APPENDIX 3.....................................................................................................................................90
APPENDIX 4.....................................................................................................................................91
APPENDIX 5.....................................................................................................................................92
APPENDIX 6.....................................................................................................................................93
APPENDIX 7.....................................................................................................................................94
APPENDIX 8.....................................................................................................................................98
APPENDIX 9.....................................................................................................................................99
APPENDIX 10.................................................................................................................................100
References........................................................................................................................................102
2

Abstract
The principal aim of this thesis is to investigate the possibilities to develop an automatic
classifier of regional Italian accents, exploring techniques and methods of a literature which
combine computational phonetics, forensic linguistics, machine learning and dialectometry. At
the same time, this work can be read as the study and the application of a certain representation
of the sound spectrum widely used in speaker recognition, namely the Mel-frequency Cepstrum
and its coefficients (MFCCs).
A system able to recognize regional inflections can be useful in two application: 1) if it is
trained with a quite fine-grained data set, it can serve to investigation purposes. Speaker
profiling for forensic application is a domain which deal in the collection of attributes around a
speaker from a phone call or a recording in general. In fact, from the voice it can be determined
various features of an individual, like gender, weight, height, smoke habits etc. One of the most
discriminating trait of a speaker is surely the geo-linguistic provenance, notably in country with
a high linguistic fragmentation like Italy. 2) It is well known that one of the aspect that most
undermine performances of an automatic speech recognition system is, indeed, the regional
accent of the speaker. An extension identifying the geolinguistic provenance would permit to this
system to read and predict the error (or rather the deviation from the standard language) that
the speaker does. In a more general way, speaker profiling could represent the future direction
for vocal interfaces, which will tend to fit the hallmarks of their users: similarly to what humans
do with familiar voices [cf. for instance Souza et al 2013].
To carry out our experiments set we used the telephonic sub-corpus of CLIPS (Corpora e
Lessici dell'Italiano Parlato e Scritto), spread among 15 regional varieties of Italian . With 40
samples arbitrarily chosen we realized two surveys/quiz, shared on a linguistic blog, with the
purpose to study how L1 Italian people behave on average in recognizing regional accents on
short telephonic audio samples. The rest of the corpus was used to train various machine
learning models to classify samples on the basis of the speaker's accent. Classification tasks were
set up with various levels of granularity: by CLIPS varieties (15 regional Italians corresponding
to 15 linguistically representative cities), by linguistic areas (7 totally, chosen on the basis of
dialectologic criteria, as isoglosses) and by linguistic macro-area (3 totally). These classifiers are
build up through several methods and algorithms, thereby they had very different performances.
We subsequently analyzed results and general behaviors of classifiers in the light of humans ones
on spread online tests. After, we build up some experiments in order to verify if the criteria used
3

by humans to predict the provenance of a speaker are similar to those of the machine. For
example, if the machine can perceive a connection between two varieties belonging to the same
linguistic area (as Napoli and Bari), or also, if it gets better results on longer (namely, richer of
linguistic and spectral information) audio samples.
Everything depends on the given description of the audio sample: notably, the modeling
based on MFCCs feature, critical object across all our work.
The first chapter of the present thesis introduce the framework: purposes of this works,
disciplines which deal with our task and a short introduction to acoustic-phonetics.
The second chapter deal in the regional varieties of Italian, describing them with several
examples, and finally touching some dialectologic and dialectometric aspects.
The third chapter is a critical introduction to the automatic accent identification problem,
showing methods and instruments of our case study through the existing literature.
The fourth chapter (that include most of appendixes preceding the bibliography) is the
experimental section. Experiments are preceded by 1) a description of software used to set them
up and 2) a short paragraph about the problem of human perception of accent, which will
introduce the discussion on results gotten through the tests spread online.
In the last chapter we discuss the experiments drawing our conclusion , attempting to put
forward proposals for future works.

Riassunto
L'obiettivo principale di questa tesi quello di indagare la realizzabilit di un classificatore
automatico di accenti regionali italiani, esplorando tecniche e metodi di una letteratura crossdisciplinare che combina fonetica computazionale, linguistica forense, apprendimento
automatico e dialettometria. Allo stesso tempo, il presente lavoro pu essere letto come lo studio
e l'applicazione di una rappresentazione dello spettro sonoro ampiamente usata in speech e
speaker recognition, ossia il Mel-frequency Cepstrum e i suoi coefficienti (MFCCs).
Un sistema in grado di riconoscere le inflessioni regionali pu essere utile in due modi: 1) Se
allenato a una granularit sufficientemente fine, pu servire a fini investigativi. Lo speaker
profiling per impiego forense una disciplina che si occupa della profilazione del parlante a
partire da una telefonata o una registrazione. Dalla voce, infatti, si possono determinare molti
attributi di un soggetto, come genere, peso, altezza etc. Uno dei tratti pi discriminanti del
parlante sicuramente la provenienza geo-linguistica, in particolar modo in un paese ad alta
frammentazione linguistica come l'Italia. 2) ben noto che uno degli aspetti che inficia
maggiormente le prestazioni di un sistema di riconoscimento vocale proprio l'accento
regionale del parlante. Un'estensione che identifichi la provenienza (geo)linguistica
permetterebbe al sistema di riconoscimento vocale di leggere e predire l' errore (o meglio, la
deviazione dallo standard) che il parlante compie. Pi generalmente, la profilazione
dell'utente/parlante potrebbe rappresentare la direzione futura delle interfacce vocali, che
tenderanno ad adattarsi alle caratteristiche dei loro utilizzatori: come fanno, d'altra parte,
anche gli esseri umani con le voci a loro familiari [cfr. ad esempio, Souza et al 2013].
Per svolgere il nostro set di esperimenti abbiamo usato il sotto-corpus telefonico di CLIPS
(Corpora e Lessici dell'Italiano Parlato e Scritto), distribuito fra 15 variet regionali
dell'italiano. Con 40 campioni scelti arbitrariamente abbiamo realizzato dei quiz/questionari,
diffusi poi su un blog di linguistica, allo scopo di studiare come si comportano in media gli
italiani L1 nel riconoscimento degli accenti regionali su brevi campioni telefonici. Il resto del
corpus stato utilizzato per addestrare vari modelli di apprendimento automatico a classificare
i campioni in base all'accento del parlante.
Le classificazioni sono state svolte a vari livelli di granularit: per variet di CLIPS (15
italiani regionali corrispondenti a 15 citt linguisticamente rappresentative), per aree
linguistiche (7 in totale, scelte in base a criteri dialettologici) e per macro-aree (3 in totale).
Questi classificatori sono modelli costruiti attraverso vari metodi e algoritmi, perci hanno
5

performance assai diverse l'uno dall'altro. Abbiamo successivamente analizzato i risultati e il


generale comportamento dei classificatori alla luce di quelli umano sui quiz messi in linea.
Successivamente abbiamo messo a punto degli esperimenti per verificare se i criteri usati da
un umano per predire la provenienza di un parlante sono simili a quelli che usa la macchina. Ad
esempio, se la macchina possa percepire un legame fra due variet appartenenti alla stessa
macro-area linguistica (come Napoli e Bari), o se ottenga risultati migliori su campioni audio
pi lunghi e ricchi di informazioni linguistiche.
Tutto questo dipende in modo cruciale dalla descrizione che si d dell'oggetto sonoro: nella
fattispecie, la modellazione basata sulle caratteristiche MFCCs, oggetto di critica durante il
nostro lavoro.
Il primo capitolo della presente tesi introduce il framework: motivazioni del lavoro, discipline
che si occupano dell'argomento e una breve introduzione alla fonetica acustica.
Il secondo capitolo si occupa delle variet regionali dell'italiano, descrivendo l'oggetto in
questione con vari esempi, e toccando infine aspetti dialettologici e dialettometrici.
Il terzo capitolo un'introduzione critica al problema dell'identificazione automatica
dell'accento, presentando i metodi e gli strumenti del nostro caso di studio tramite la letteratura
esistente.
Il quarto capitolo (che include anche gran parte delle appendici che precedono la
bibliografia) la sezione sperimentale. Gli esperimenti veri e propri sono preceduti da 1) una
descrizione dei software utilizzati per realizzarli e 2) da un breve paragrafo sul problema della
percezione umana dell'accento, che introdurr la discussione sui dati ottenuti con i questionari
in linea.
Nel quinto si discutono gli esperimenti e si tirano le conclusioni, cercando di avanzare delle
proposte per i futuri lavori sul tema.

Chapter I: framework

1.1 Some characteristics of human voice


Phonetics is the study of both the linguistic and the physical aspect of the human speech as
produced, forwarded and perceived. On the other hand, acoustics is a branch of physics which
deal in the production, transmission, and perception of sound in general. Thus, when we talk
about acoustic-phonetics, we basically mean a science which address the human speech
through the quantitative methods provided by physics, but also keeping into account the
linguistic (phonological) meaning of an item.
In this paragraph we will provide a brief introduction to some acoustic-phonetic
characteristics of human voice. This discipline is crucial to implement speech/speaker
recognition applications, systems for text-to-speech synthesis and vice versa, language
identification systems, intelligent virtual agents and so on. Furthermore, automatic speech
processing is important to forensic sciences, particularly for forensic voice comparison or
profiling tasks.
The theory of voice identification [Tosi 1972] is based on the fact that every voice has
unique and individual characteristics that distinguish it from any other voice.
Generally speaking, variability in speech can be of two types: intra-speaker and interspeaker variability. Intra-speaker variability (within the same speaker) is due to many factors:
e.g. we could have two recordings of same speaker very far apart in time, so his voice has
changed in the meanwhile; but more simply, emotions, rate of utterance, mode of speech,
disease, mood of the speaker, the emphasis given to a word, can lead to a strong intra-speaker
variability. The inter-speaker variation, which exist among different speakers, arises mainly
due to anatomical differences in the vocal organs and from learned differences in the use of
speech mechanism.

The vocal tract is essentially a tube consisting of the mouth (oral cavity) and throat
(pharyngeal cavity), with the lips at one end and the larynx at the other (the vocal folds are in
the larynx). The length of the tube can be slightly increased by rounding and protruding the
lips and by lowering the larynx (raising the larynx will slightly shorten the tube). The nose
forms another tube (nasal cavities from the nostrils to the velopharyngeal port) which can be
connected to the oropharyngeal tube (pharyngeal cavity plus oral cavity) by lowering the soft
palate (velum) to open the velopharyngeal port. The jaw can be lowered or raised and the
tongue can be moved to change the shape of the oropharyngeal tube. [Morrison 2010]

Parts of the vocal tract that can be used to produce distinctive sounds are called
articulators. They can be grouped into active and passive articulators on the basis of their
activity. The articulators that move during the process of articulation are called active
articulators, whereas organs of speech, which remain relatively motionless, are called passive
articulators. [image from Kulshreshtha et Mathur 2012]

The vocal tract is similar to a musical instrument: air is blown into the vocal tract by
compressing the lungs so as to push air between the vocal folds. One can produce a voiced
sound (a vowel, a sonorant, a nasal, a liquid...) or a voiceless sound (an obstruent, a plosive...).
We are mainly interest in the former type of sound because it can convey a large amount of
spectral information, as the fundamental frequency (F0) and the formantic distribution,
quantified by the each formant (f1, f2, f3 ...). F0 is the rate at which the vocal folds vibrate
during voicing. Some speakers have longer and more massive vocal folds and others have
shorter and less massive vocal folds; on average adult males have larger vocal folds than adult
females but there is also variation within each sex. F0 averages around 125 Hz for adult males
and 200 Hz for adult females.
Another acoustic feature which can be prosodically relevant is the pitch, that is the relative
highness or lowness of a tone as perceived by the ear, depending directly on the F0. Pitch is the
main acoustic correlate of two phonetic features, the tone and the intonation.
To understand how formants (f1, f2, f3 ...)in a spectrogram describe modifications of the
vocal tract, let's show an example from [Morrison 2010:99.460]:
"The different mouth shapes result in different resonance frequencies which
make the sound of different vowels. The primary acoustic differences between the
vowels in heed, hid, head, and had are that the first formant(F1) increases as
the constriction widens and second formant (F2) decreases. Now say the ee
sound from heed again, but this time move your tongue back until you are saying

the vowel sound from who. It turns out that moving your tongue back in your
mouth lowers F2 and that rounding your lips also lowers F2, so doing both
together has a larger effect. The most important acoustic difference between the
vowel sounds in heed and who is the change in F2 (F1 stays about the same) [...]
In many languages F1 and F2 peaks are the primary acoustic indicators of vowel
category (vowel phoneme) identity (the peak formant values rather than the exact
shape of the spectra are perceptually relevant)"

[image from Bove et al 2002]


All of us have some dialectal and idiolectal characteristics, for example a different manner
to pronounce the /a/: in that case the formants distribution is interesting to describe
speaker's uniqueness. Indeed, this method is used also in Speaker identification [for example
Bove et al 2002] even if the state-of-art systems exploit rather Mel-Frequency Cepstral
coefficients (MFCCs: we will see them in paragraph 3.2). Down here a english vowel
quadrilateral, useful to represent the wide spectrum of possible executions of any vowel.

10

However, to distinguish a vowel from another voiced sound in a spectrogram (especially


nasal and liquid consonants) can be very hard-going without having transcriptions. It is
nonetheless true that speaker's articulation of vowels and consonants (and the relationship
between them, like the Voice Onset Time 1 [Caramazza et al 1974]) is crucial in order to collect
dialectal and idiolectal features of someone.
Also supra-segmental or prosodic features can be interesting in order to know dialectal
provenance of a speaker. In terms of linguistics, prosody is the stress (stressed-unstressed
syllables, that has nothing to do with the loudness), the phrasal/lexical intonation and the
tone. Basically, prosody corresponds to various features of the speech such as the form in
which the sentence or word is uttered, e.g. question, command, or statement. It also reflects
the presence of emphasis, contrast, focus in the utterance, even emotional condition. In terms
of acoustics, the prosody of a language involves variation in F0, formants distribution, pitch,
loudness, syllable length, speech rate and so on.

"In phonetics, voice-onset time (VOT) is a feature of the production of stop consonants. It is
defined as the length of time that passes between the release of a stop consonant and the onset
of voicing, the vibration of the vocal folds, or, according to other authors, periodicity. " (source:
Wikipedia)

11

1.1.1 Automatic approach to acoustic speech processing


The speech, like almost every sound in nature, is a complex wave, which can be
decomposed in single waves through mathematical methods. To accomplish this task the most
widely used method is the Fourier analysis, whose a simplified approach is implemented in
most of acoustic analysis software (like openSMILE: [cf. paragraph 4.2.1]) and it is called Fast
Fourier Transformation (FFT)2.
The automatic approach to speaker recognition/profiling was developed by signalprocessing engineers for tasks like voice-recognition password system for telephone banking.
As with the acoustic-phonetic approach, the automatic approach is based on quantitative
measurements of acoustic properties of speech, but typically no attempt is made to exploit
information relating to phonetic units. Also, as implied by the name, once the system has been
designed and built it operates fully automatically. No transcriptions nor manual analysis are
required. Some automatic speaker recognition systems "make use of higher level features, for
example, they use automatic-speech-recognition systems to divide the acoustic signal into
phonetic units (which may not correspond to the phonetic units which a phonetician would
extract), or they automatically extract fundamental-frequency trajectories. Such features could
also be exploited by automatic forensic-voice-comparison systems. Some human supervision
could also be incorporated into an acoustic-phonetic-automatic hybrid approach." [Morrison
2010:99.720]
Typical features in an automatic system are Mel-frequency cepstral coefficients (MFCCs),
measured generally in a frame of length 20-30 ms, which are, roughly speaking, a short-term
spectral feature (as the formant analysis) involving the cepstrum analysis, that is the
logarythmic inverse fourier transform (IFT) of the spectrum . MFCCs features are widely used
in speech recognition and speaker recognition/profiling nowadays, especially because of their
robustness to noisy channels, like telephone 3 [compare Lippmann 1997 and Meyer 2011]. One
2

The general idea of the Fourier analysis, which is a branch of mathematics, is that a complex
and continuous function can be approximated by the sum of various simpler trigonometric
functions. The decomposition process itself is called a Fourier transformation.

"Landline telephone systems only transmit frequencies between about 300 Hz and 3.4 kHz
(this is known as a bandpass) and distort frequencies close to the edges of the bandpass [. . .]
Some vowels such as /i/ and /u/ have intrinsically low F1 which for male speakers may be
affected by the low end of the bandpass. F3 and above for females and F4 and above for males
are likely to be affected by the high end of the bandpass [. . .] Mobile-telephone systems also

12

of the main goal of this work is to test the effectiveness of MFCCs on accent recognition task
(thus, a sub-task of speaker profiling). We will talk more closely about this type of feature in
chapter n.

1.2 About speaker profiling


To explain what speaker profiling is, we need to start from its framework. We could
consider this domain as a group of issues related to speaker recognition, which is in turn a subdomain of authorship attribution and acoustics [defined above in par. 1]. Authorship
attribution is a set of stylometric techniques used to know the author of a written or an oral
text, deployed in forensic scenarios or digital humanities. Acoustics of speech serves rather
artificial intelligence (AI), software engineering and signal processing. To sum up, speaker
recognition/profiling can be useful both in developing AI tools and in forensic applications.
Overall, it's important to keep distinct speech and speaker related disciplines, since the firsts
deal in what speaker say, while the seconds deal in who are the speaker. Particularly,
Speaker profiling is about speaker's attributes (like gender, smoke habits, age, weight, height,
pathologies [Poorjam et al 2014]) but it is worth to note that an attribute like the region or
social group of speaker's socialization involves also speech facts (around what speaker say).
Let's try to conceptualize this complex framework:

apply a bandpass to the signal; the low end of the bandpass is maintained at 100 Hz (lower
than for a landline system), and the high end varies between 2.8 kHz and 3.6 kHz. But in
addition, mobile systems use compression and decompression algorithms (codecs) to reduce
the amount of data sent, and this results in further deterioration of the signal." [Morrison
2010:99.610]

13

The main statement on which speaker recognition bases its epistemological reasons is that
everyone has his own, individual and dissimilar to each other voice. As [Kersta et al 1962]
state, voiceprints (as they named the spectrograms) are unique and individualistic in nature,
remaining unchanged throughout the lifetime of that speaker, even though he/she grows old,
loses tonsils, teeth, or adenoids. Even it is reckless to claim the infallibility of these voiceprints
(as Kersta did), this fact is substantially confirmed by the large experimental study initiated
by Tosi and associated at Michigan State University early in 1968 and concluded in 1970 [Tosi
1972]. Nowadays voice, being considered a behavioral biometric, is accepted as a judicial
evidence [Jessen 2007] in many countries.
The standard speaker recognition casework in forensic applications is the following:
Typically in forensic speaker recognition, a recording of an unknown voice,
usually of an offender, is to be compared with recordings of a known voice, usually
of the suspect or defendant. The interested parties (police, court) want to know if
the unknown voice comes from the same speaker as the known. [Rose 2006]
Nevertheless, it is not always the case that we dispose of two recordings:

14

sometimes the only clue to a criminals identity is his or her language. When
that is the case, a linguistic profile can be a useful tool. Forensic linguistic profiling
is the analysis of language to infer attributes of a speaker or writer from his or her
linguistic characteristics. Speaker and author profiling are used when there is an
unknown perpetrator, and investigators need to narrow down the pool of potential
suspects by identifying linguistic features that can be associated with particular
geographic areas, social groups, or unusual pathologies [Schilling et Marsters
2015:196]
Thus, during a police investigation, a linguistic/phonetic speaker profile can be useful. In
[appendix 10] we provide the reviews of two real cases.
Speaker recognition techniques are also used to implement person authentication in
security systems, like banking by telephone, telephone shopping, database access services,
information services, voice mail, security control for confidential information areas, and
remote access to computers etc. [Chauhan et al 2013].
Differently, speaker profiling (and notably the methods used to make inferences on
speaker's geographic or cultural background) can provide crucial attributes to improve speech
recognition performances: such information enables effective adaptation of speech and
language processing systems, e.g., by switching to specialized acoustic, pronunciation, or
language models in speech recognition. [Akbakac et al 2011]
Speaker profiling/recognition for commercial purposes can be text-dependent (the speaker
have to pronounce this or that sound) or text-independent (the system virtually cover any
manifestation of speech), and it has to work as a fully automatic system.
Speaker profiling/recognition for forensic applications is generally text-independent,
because the speaker of the recording is not cooperative: he obviously has not the interest in
doing it, rather he sometimes tries to disguise his voice [Farrs 2009]. On the other hand,
technical4 forensic speaker recognition/profiling can be performed in various ways, with an
automatic, a computer-assisted or a traditional approach, exploiting acoustic features or also
4

In some cases, forensic-voice-comparison is carried out by non-experts, fact that [Rose 2006]
calls naive forensic speaker recognition. This approach is obviously risky (it has historically led
to some errors: see ibid) and discouraged by the academic community.

15

linguistic-auditory ones. However, traditional methods depend too much on individualistic


intuitions, that's the reason for why a forensic scientist is trained to make use of acoustic
measurements and statistical analysis to quantify the results. How [Morrison 2010:99.940]
claim, the police and the court
will usually understand that a definitive answer cannot be given: a trial is,
after all, about making decisions in the face of uncertainty. So they will usually ask:
how probable is it that the samples have been said by the same person? This is a
very reasonable way of putting it, since philosophers and statisticians will agree
that the best way of quantifying uncertainty is by using probability (Lindley 1991).

16

Chapter II: regional Italian accents

2.1 Italian languages versus standard Italian


Italian is spoken roughly by 85 millions people in the world. 59 millions are Italians, while
a consistent part of speakers are spread in countries where Italian is among the official
languages (Malta, Switzerland, Vatican City and San Marino), in some Balkan countries
(especially Albania, Croatia, Slovenia), and in Principality of Monaco. Furthermore, there are
some important Italian speaker communities in the US, Argentina and Somalia.
Despite of these numbers, which make Italian one of the first 30 most spoken languages in
the world [source: www.ethnologue.com], it should be remembered that Italian is not an
uniform language: it is rather a group of Romance languages spread across the Italian
territory. In addiction, there are languages as Sardo and Friulano, which are spoken within the
borders of the country but they can't be considered Italian languages [Grassi et al 1997:81].
Effectively, Italian has had a particular evolution since its emergence. Dante Alighieri was
one of the first to talk about Italian "vulgars" in De vulgari eloquentia, during the first part of
14th century. Two centuries later, it was developed a sort of "question of the language" across
various courts of Italy: such a literary and highly codified Italian arose around Pietro Bembo
and others grammarians, based on classic Tuscan literature (notably Giovanni Boccaccio for
the prose and Francesco Petrarca for the poetry).
This obsolete and mainly written language had been effectively spoken in rare contexts, as
in the aristocratic courts of Florence during 14 th and 15th centuries, since the main academic
and cultural language has been the Latin for longtime. Tuscan literary Italian remained
substantially unspoken until 1861, when it became the official language of the first united
Italy. Despite the obligation for Italian schools to teach in Italian, renouncing to native
languages, nothing substantially changed: according to [De Mauro 1963:43] just 2.5% of

17

Italians could speak in Italian at the moment of the unification.


However, things rapidly changed during the 20 th century, especially because of two main
events: 1) The World War I, which obligated the soldiers left to the front from any part of Italy
to communicate between them. 2) The diffusion of radios and

televisions during the

economic growth of sixties has enormously contributed to share a single model of Italian.
At 2012 [ISTAT 2012] almost the totality of population knew (passively, at least) Italian
language, although more than 50% of people speaks dialects in family, being effectively
bilingual5 subjects.
Even if Italian languages [Berruto 1993: 3-36] often present a low structural distance 6*
compared to standard Italian, is more proper to consider them as independent languages or
dialects and not Italian diatopic varieties. Indeed, these languages historically have, in some
case even more than the highly literary standard Italian, a wide range of registers for different
diaphasic contexts. Moreover, according to [Coseriu 1973], Italian dialects are primary
dialect: they have developed since the dissolution of Latin language in his oral uses, exactly
like Tuscan of Florence, even if this latter (in its literary shape) has become the national
standard language.
Unlike dialects/languages, regional varieties are full-fledged diatopic varieties of standard
5

Actually it is incorrect to talk about bilingualism in this case, since this term refers to two
languages with the same political rights, whereas Italian languages are not official and they
cannot be used in administration or instruction. In Italy we might observe, rather, an example
of diglossia : Diglossic languages (and diglossic language situations) are usually described as
consisting of two (or more) varieties that coexist in a speech community; the domains of
linguistic behavior are parceled out in a kind of complementary distribution.[source:
ccat.sas.upenn.edu]. Another terminological issue we would like to stress is the difference
between language and dialect among the Anglo-Saxon and Italian linguistic traditions: while in
the former dialect generally means a linguistic variety, in the latter a dialect is a real language
spoken in some geographical area. Thus, there is no structural differences between a language
and a dialect except that the second has not a political recognition. We will use the word
dialect in this second meaning (Italian linguistic tradition), talking rather about linguistic
variety when we will want to refer to some sociolinguistic variation of a same language.

Nevertheless, structural distance between standard Italian and dialects can be quite relevant in
some case, concerning not only phonetic and lexical levels, but also morphological and
syntactic ones [cf. Maiden and Parry 1996]

18

Italian. According to Berruto [1987:17] they are the wide range of phenomena occurring
between literary [standard] Italian and dialects [...] In Italy the primary source of linguistic
diversification is provoked by geographic distribution, along diatopic axis.
Regional Italians are a relatively recent phenomenon. After unification in 1861 and the
consequent introduction of Italian as the nation's official language, dialects slowly started to
turn themselves toward standard Italian (and, to a lesser extent, vice versa). As a result, the
neo-standard Italian has managed to include these sort of smoothed varieties of dialects: not
only pronounce aspects, but also lexical items and, in its oral use, morpho-syntactic hallmarks.
Thereby, from a stillborn language, Italian has become step by step a living constantly
evolving one. On the one hand, dialects survive in the countryside, some villages and even in
some cities; on the other hand, standard Italian has changed accepting various dialectal
hallmarks, producing several diatopic varieties. Consequently, through their varieties,
members of a community shows (in a conscious or unconscious manner) their sociocultural
identity: using a variety, a speaker provide some information about his

sociocultural

positioning [Berruto 2011]7. Thus, linguistic varieties, and notably regional varieties of
Italian, could be an interesting object for the speaker profiling domain [cf. chapter 1].
Down here, a chart describing the architecture of contemporary Italian along the diamesic,
diaphasic and diastratic axis is provided [the two images above are from Berruto 1993]:

Italian digloctic nature, as well as dialectic features, were smoothed by the development of
regional varieties: the italianization of dialects, especially at the phonetic level, is welldescribed in [Grassi et al 2003:257]. However, the arise of a regional variety doesn't imply a
major people's awareness in use of Italian and/or a minor one in use of his own dialect. For
instance, [D'Addario 2015:377] shows how speakers from Taranto (in the south-east of Italy)
are not always aware to use dialectal expressions instead of standard Italian ones. As example,
D'Addario provides the example of the verb /endere/ (to come down) which in standard
Italian is intransitive, whereas in the southern area is used also as a transitive verb.

19

This scheme is about just a single geographical variety: diatopic axis is not present
because, conceptually, it embeds the other linguistic variables.

2.2 Diatopic varieties of Italian


it is not so easy to individuate borders of a geo-linguistical area (namely areas where we
encounter certain uniform linguistic phenomena) since Italian often varies as a continuum.
However, we decided to employ for this study 7 broad geo-linguistical areas. The concept of
isogloss8, introduced by J.G.A. Bielenstein in 1892 [Vignuzzi 2010], could help us to identify
8

"Isogloss is the imaginary line we can use to connect the extreme points of an area
characterized by the presence of a same linguistic phenomenon" [Vignuzzi 2010]. Even if it
describe verifiable facts, isogloss is a traditional method of dialectology, thereby it cannot be

20

broad geo-linguistical areas:

applied without making subjective choices. This fact is described in more detail by [Kessler
1995]. We will have to keep under consideration this aspect when in [paragraph 4.5] we will
going to test the computational soundness of the concept of linguistic area: areas and macroareas, as isoglosses, are partially based on intuition. New computational approaches of
dialectometry promise to improve the precision of the dialects clustering [Szmrecsanyi 2011;
Heeringa 2004], even if in Italy this discipline is not so developed yet [a remarkable work on
Tuscan dialects: Montemagni, 2007]

21

These isoglosses (La Spezia-Rimini, Roma-Ancona, Taranto-Ostuni, Diamante-Cassano: the


most important linguistic lines, from the north to the south of italy) , argued and studied by
Italian dialectologists since his pioneer Graziadio Isaia Ascoli (1829-1907), fits properly the
geo-linguistic chart below [source: Sima Brankov's blog]:

22

The six main colors roughly represent six broad geo-linguistical areas, and thus six broad
regional varieties of Italian. We added to these a seventh variety, namely the Sardegna regional
Italian, which is not represented here since this is a Italian dialects chart, and Sardo is not
considered an Italian language, as Ladino and Friulano. We decided to insert this variety since
even if the spoken language is not an Italian language, a regional Italian accent has been
developed in parallel along the twentieth century: this latter has its own features and it is
affected by the native language [cf. Lorinczi 1999]
In this chart we have pointed out the cities where CLIPS corpus 9 data were been collected.
Almost the totality of these cities act a role of socio-linguistic leading in its area, spreading
their linguistic variety: this phenomenon is called linguistic koin and according to [Grassi et
al 2003:176], it is one of the most important forces which has contributed to "italianization" of
dialects.
Above the "La Spezia-Rimini" isogloss there is a wide northern macro-area, which we
could mainly divide in Veneto area (various shades of yellow in the chart above) and GalloItalic Area (purple shades). Main phonetic deviation from standard Italian are the lenition of
inter-vowel consonants or geminated, which can arrive at the complete elision (Italian
/kapelli/ become /kavei/) ; the assibilation (Italian /cera/ become /sira/); the deletion of
unstressed final vowel except the /a/; the presence of some vowels from Occitan,
phenomenon that occurr in Gallo-Italic area but not in Veneto area. Another hallmark of
Venetian variety (represented in CLIPS audio samples) is the retroflex /r/.
Under "La Spezia-Rimini" isogloss, which "deeply divide Italian linguistic varieties and is
moreover the most important linguistic border of Latin Europe [Vignuzzi 2010] we
encounter another important isogloss, identified by Gerhard Rohlfs in 1937, that is the RomaAncona isogloss. It defines, with La Spezia-Rimini one, a median macro-area relatively
uniform, especially with regard to vowels trends. However, we can individuate two distinct
areas: the Tuscan (shades of brown in the chart) and the Median ones (shades of red). Tuscan
area is characterized by phenomena like regressive assimilation of close consonants and a
typical lenition of some fricatives and plosives called Gorgia (Italian /t kniko/
become /tnniho/). In Median area, assimilation of close consonants is rather progressive; we
can furthermore observe the sonorization of some consonants like /p/t/k/f/s/ (Italian
9

CLIPS (Corpora e Lessici dell'Italiano Parlato e Scritto) is an important oral corpus of


contemporary Italian. For our study, we used the telephonic sub-corpus, varied along the
diatopic axis. Further information will be provided in [chapter n].

23

/andate/ become /annade/) and the affrication of some fricatives (Italian /borsa/ become
/bortsa/)
Roma-Ancona isogloss has actually fuzzy borders, thus the transition between Median
area and Southern area (shades of blue in the chart) could be perceived as a continuum. Some
crucial features of the latter area (whose Naples has been the cultural and linguistic center
during many centuries) are metaphony and the almost-deletion of the last vowel (namely the
schwa): e.g Italian /neri/ become /nir/.
The extreme southern area (shades of green in the chart above) includes Sicily, southern
area of Puglia and most of Calabria. In this area vowel system (Sicilian) is older than standard
Italian one: instead of 7 (/a/ /e/ / / /i/ // /o/ /u/) vowels are totally 5 (the sames, except /e/
and /o/). Moreover, retroflex consonants are quite frequent in this variety (Italian /b llo/
become /bu/).
Finally, the island of Sardegna is a linguistically heterogeneous area. Its dialects (old
conservative neo-latin Catalan-influenced languages) are not considered as belonging to
Italian set, due to some crucial differences in plural construction or in used articles. For this
reason, regional variety of Sardegna has quite recognizable phonetic features [looking also at
the results of our tests in chapter n] like the resistance to the palatalization, gemination of
consonants, partial palatalization of /s/.
That's obviously a non-exhaustive phonetic features list. Moreover, we didn't talk about
prosodic and lexical traits, which have relevant consequences on respective regional varieties.
However, the oral corpus we have worked on is about regional varieties of Italian and not
dialects: all those hallmarks have been enormously smoothed in dialects movement towards
standard Italian. Furthermore, data are (semi)text-dependent: telephone speaker are asked to
read some phrases or number, or to simulate a scenario using some specific words. Thus,
some prosodic phenomena as phrasal global intonation are compromised, because they
would be representative only in spontaneous speech.

24

Chapter III: methods

3.1 Accent identification problem


As we saw in the first chapter, an automatic classifier for accent recognition can be useful in
both forensic speaker profiling applications and as extension of an ASR (automatic speech
recognition) system. In the former case, such speaker profiling system could be more reliable than
an expert, who provides inevitably an analysis led by intuitions and individual skills (we will tackle
this problem in [paragraph 4.1]). Moreover, a tool that virtually knows all accent characteristics of a
specific area cannot be affected by something like subjectivity: that is to say, even an expert
phonetician is obviously influenced by his own linguistic habits, while a (learning) machine has not
any habits, it is just trained with some data. Ideally, the more the data will be proper and finegrained, the more performances will improve. For example, everyone can notice that we are
generally stronger on recognizing accents close to our own variety: an Italian from the south of
Tuscany is overall good at detecting the various accents of villages inside his territory: something
that probably a native of Firenze cannot do properly. At the same time, such Italian from the south
of Tuscany can recognize the various leader accents of Tuscany, like the Livorno variety, the
Firenze variety, the Arezzo one, and so on. He can surely accomplish this task better than an Italian
from the north of Puglia, or someone else not from Tuscany. In other words, the ability to
distinguish accents (so, the provenance of the speaker) is bound up with our own linguistic
biography, that is to say our linguistic competences and experiences. A machine learning system
could virtually be trained with any sort of linguistic experiences (in the form of data sets)10.
10 Of course, a primary problem in this case is the lack of data. Even an important resource of oral Italian like the
CLIPS corpus (the one we use for this study) is far form being satisfying, whenever our purpose is to represent all
Italian regional varieties. Nevertheless, new Web-based methods to collect data could constitute a real turning point.
A good example of crowdsourcing project: http://www.abruzzesemolisano.it/

25

A speaker profiling module could be an interesting extension of ASR systems in the future. This
latter could improve its performances just learning about speaker's linguistic hallmarks, such as an
intonation peculiarity, or a speech impediment. Among all idiolectal features collectable, there is
also the regional inflection: such feature can help the ASR system to anticipate some individual
speech behaviors of the speaker. For instance, it is surely useful to such system to previously know
that an Italian from Prato (Tuscany) could never pronounce /k/ and /t/ sounds if they are intervocalic
and not geminated.
Nevertheless, it is not an easy thing to realize a speaker profiling system with these
characteristics, or rather it is very expensive and time-consuming. The main reason is that we
should inform the machine about the linguistic content conveyed by the phone/sound. Let's put it in
another way: when we try to detect, even naively, the provenance of someone, we exploit not only
the acoustic side of sound, but also the linguistic one. That is to say, when we listen words like
/koriandolo/ or /kartutta/ (respectively "coriander" and "cartridge"), we previously know our
regionally-biased pronounciation of the same words and we can likewise figure out how other
speakers from different regions execute them (or words similar to them). For instance, the
first word ("coriander") is likely to be affected by a great variation across linguistic areas,
especially at the vocalic level. Instead, the second word ("cartridge") is likely to vary at the
consonantic level. Even if one has never heard these words pronounced by this or that
regional speaker, he has probably heard other words similar to those, which led him to build a
sort of model for each linguistic areas of Italy: evidently, large part of this process concerns
intuition. Nonetheless, the fact to be aware of both the acoustic quality and the linguisticphonetic level, allows to match, quoting Hjelmslev, the substance of expression with the form
of expression.
An ASR system do exactly this, but we have to give to it some pieces of speech (phones)
matched with the related syllabic units (phonemes). These phones, being most of time
sampled by one individual person (usually a professionist of the voice), are not representative
of all varieties of the same language, much less all idiolectal variations. However, if we want to
add a module collecting information on accent to compensate to this inadequacy, we still need
to pass through the linguistic knowledge of the ASR system: it is a vicious circle.
Thus, viable options are three: 1) to build a fine-grained metric in order to individuate
every single possible variation of an idiolect or a dialect, and after to develop a model of
language of a speaker using distance measure like the Levenshtein's one: it is roughly what the
ForensicLab of Pompeu Fabra university (Barcelona) is doing, notably for idiolectal profiling

26

for forensic applications [Turell 2013]. For dialectal domain we can report instead
[Kulshreshtha et Mathur 2012] and [Huckvale 2004]. (Image below from [Turell 2013])

2) to rely on the acoustic aspect of the speech, without passing through the linguistic level.
The underlying hypothesis in using this method is that accent differences are quantifiable at
the acoustic level: we just need to find a workable acoustic-phonetic feature in order to
capture these dialectal or idiolectal differences. This method is ideally the best, because is less
time-consuming. We will present some literature using this strategy for speaker profiling in
[paragraph 3.4]. Furthermore, this is the method we tested and analyzed in this work, with the
aim of classifying Italian regional speakers by accent. 3) These two approaches we just
showed can be mixed. An example of this strategy is provided by [Brown 2014]. The system
proposed, however, is text-dependent. We will explore this study in [paragraph 3.4]

27

The flowchart below conceptualizes our task, straddling the dialectometry and the speaker
profiling, and the various viable ways to cope with it.

The general approach of dialectometry is to build a linguistic model of such and such a variety,
through quantitative approaches like cluster analysis, out from some dialectal corpora. On the other
hand, speaker profiling attempts to assign an attribute to a speaker. Starting the process from this
latter, its main goal is to accomplish a classification task, targeting some already supplied classes
(linguistic models). Basically, the best way to carry on this task is mixing dialectometry and speaker
profiling strategies: conceptually, it is what the authors of [Turell 2013] are doing, despite
concerning rather idiolectometry. Nevertheless this method is expensive in term of time, so many
studies [par. 3.4] attempt to address the classification task using a shorter path: build a machinelearning model of a variety (having previously extract selected features from the recordings) and
then accomplish the classification out from it. This is the method we are going to test in the next
chapter on Italian varieties.
Last, with the question mark in the middle of the arrow, we wanted to suggest the possibility to
set up a mixed model.

3.2 Mel-Frequency Cepstrum Coefficients (MFCCs)

28

Mel Frequency Cepstral Coefficients (MFCCs) are used in state-of-art ASR systems and they
are proven to be one of the most effective spectral feature in speech related tasks [Juravsky
2000:329]. Furthermore, it is largely used in speaker recognition [Vibha 2010], due to its robustness
to noisy channels [cf. note n in chapter 1] and its reliability on very short audio samples.
We will briefly describe what these coefficients are. The Mel Scale is a pitch perception psychophysical scale. In fact, it has been shown that human perception of a sound frequency does not
follow a linear scale, but approximately a logarithmic one [Houtsma 1995]. Mel Scale was
introduced to have a scale consistent with the height perception of a sound. 1000 Mels correspond,
by definition, to 1000 Hz (for audible sounds) and for each extra octave, mels double.

On the other hand, the cepstrum is the result of taking the Inverse Fourier transform (IFT) of the
logarithm of the estimated spectrum of a signal. Operations on cepstra are labelled cepstral analysis:
namely, the hystorical father of MFCCs (if we mean the "real" FFT cepstrum and not "complex"
cepstrum or linear predictive coding (LPC) cepstrum). Thus, MFCCs combines the advantages of
the cepstrum analysis with a perceptual frequency scale based (Mel Scale) on critical bands.
Speech analysis assumes that signal properties change slowly with time. This motivates short
time window-based processing of the speech signal to extract its parameters. Every 10 ms, a
Hamming

window is applied to pre-emphasized 20 ms long speech segment. Fast Fourier

Transform (FFT, look at [note n chapter 1]) is used to compute short-term spectrum. 20 overlapping
Mel scale triangular filters are applied to this short-term spectrum. The output of each filter is the
sum of the weighted spectral magnitude. Discrete Cosine Transform is obtained from the logarithm
of the filter output to obtain cepstrum coefficients. The figure below, taken from [Sinha et al 2015],
represents steps in MFCC computation process.

29

3.3 Machine-learning for speech processing


Machine learning is the most widely used method to resolve automatic classification tasks in
speech processing. In such framework, building a machine learning system generally means to
outline some samples through a feature extraction, and after to train a model in recognizing some
characteristics of this or that sample. The training is carried out with an algorithm whose choice
depends on the task to be addressed. From modeled data, the system learns some rules by
inference to be used in classifying new items. If such items are phonemes (as in ASR) the main task
is to recognize which is the phoneme from the spectral characteristics of speech segment and from
the joint probability that this phoneme follows or precede another one. Thereby, most used models
in ASR are Hidden-Markov (HMM) or Gaussian Mixture (GMM), which are dynamic and capture
time features. By contrast, speaker is something which do not varies in short-term period, so
profiling tasks can be addressed through static data capturing the midpoints of this or that
characteristic. This task can be deal with no high-level features (like phonemes) and with a static
classifier, like the classic naive-Bayes algorithm11. We will describe below the most used methods in
speech processing and speaker profiling, including the ones we used in the experiment chapter
[chap. 4]
Let's start with the Hidden Markov Models (HMMs). As the naive-Bayes classifier, HMMs is a
11 The Naive Bayesian classifier is based on Bayes theorem with independence assumptions between predictors.
[...] Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c).
Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of
the values of other predictors. This assumption is called class conditional independence. [...]
A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it
particularly useful for very large datasets. [source: www.saedsayad.com]

30

graphic-generative model for the joint probability distribution p(x,y). Similarly, Markov models and
Bayesian networks share the strong (naive) independence assumptions between the features.
Differently, the former is a dynamic Bayesian network, namely it relates variables to each other
over adjacent time steps - the reason for why it is used in ASR. The main task of HMMs in ASR is
to couple (maximize the probability) of phone\phoneme. A hidden Markov model is just like a
regular Markov model in that it describes a process that goes through a sequence of states. The
difference is that in a regular Markov model, the output is a sequence of state names, and because
each state has a unique name, the output uniquely determines the path through the model. In a
HMM, each state has a probability distribution of possible outputs, and the same output can appear
in more than one state. These models are called 'hidden' because "the true state of the model is
hidden from the observer. In general, when you see that an HMM outputs some symbol, you can't
be sure what state the symbol came from."[Russel et al 2010]. A well-known search algorithm used
to compute p(phone|phoneme) over an HMMs model is the Viterbi algorithm.

Hidden states of an HMM automaton (source: Wikipedia)


A widely used method, in conjunction with the Expectation Maximization (EM) training
algorithm, is the Gaussian Mixture Models (GMMs). A GMM can be thought of as a single state
HMM (Hidden markov model). In other words, a state in an HMM has a mixture of distributions,
with the probability of belonging to a distribution being represented by the emission probability
(which can be seen as the conditional distributions of the observed variables from a specific state)
as a parametric model of the probability distribution of features. Down here a simple flowchart
which describe the general pattern recognition procedure through phonotactic approach (such as
31

HMMs and GMMs) in a ASR task:

We will now briefly show three other methods which are broadly deployed in speaker
recognition and speaker profiling: Support Vector Machines (SVMs), Artificial Neural Networks
(ANNs) and K nearest neighbour (k-NN).
SVMs12 performs classification by finding the hyperplane that maximizes the margin between
two classes. The vectors that define the hyperplane are the support vectors. An ideal SVM analysis
should produce a hyperplane that completely separates the vectors into two non-overlapping
classes. However, perfect separation may not be possible, or it may result in a model with so many
cases that the model does not classify correctly. In this situation SVM finds the hyperplane that
maximizes the margin and minimizes the misclassifications. The simplest way to separate two
groups of data is with a straight line (1 dimension), flat plane (2 dimensions) or an N-dimensional
hyperplane. However, there are situations where a nonlinear region can separate the cases clusters
more efficiently. SVM handles this by using a kernel function (nonlinear) to map the data into a
different space where a hyperplane (linear) cannot be used to do the separation.

12 For our experiment we mainly deployed the Sequential Minimum Optimization algorithm (SMO) which is a
new SVMs learning algorithm that is conceptually simple, easy to implement, often faster [Platt 1998]. Weka
implements an optimization of John Platt's SMO algorithm for training a support vector classifiers, developed by
[Keerthi et al. 2001]. The main feature of this support vector algorithm is that it manages to solve the old
quadratic programming (QP) problem that arises during the training of support vector machines. Quadratic
programming is a special mathematical optimization problem concerning the quadratic function processing of
several variables en masse. That impacts in our case on computing performances like fastness.

32

SVM linear hyperplane example (source: Wikipedia)


An ANN13 is a framework inspired by biological neural networks used to either model complex
relationships between inputs and outputs or find patterns in data. It is based on an interconnected
group of artificial neurons, and it employs a "connectionist approach to computation" when
processing information. ANNs have been successfully used for a "great variety of applications, such
as decision making, quantum chemistry, radar systems, face identification, gesture recognition,
handwritten text recognition, medical diagnosis, financial applications, robotics, data mining, and espam filtering". In recent years, there has been a renewed interest in the use of ANNs for speech
applications due to a major advance made in pre-training the weights in deep neural networks

(DNNs) (In the figure above: hidden nodes work as intermediate states before output [source:
13 We never used in our experiment the Weka ANNs algorithm (namely, Multilayer Perceptron), since it deployed
too much time in the training phase. Probably it did not fit the high dimensionality of our data set.

33

Wikipedia])
Finally, the k-NN14 is perhaps the most straightforward classifier in the arsenal or machine
learning techniques. Using the good Wikipedia definition of k-NN and lazy classifiers in general:
"Classification is achieved by identifying the nearest neighbours to a query example and using those
neighbours to determine the class of the query. [...] Because induction is delayed to run time, it is
considered a Lazy Learning technique. Since classification is based directly on the training
examples it is also called Example-Based Classification or Case-Based Classification [...] The
main advantage gained in employing a lazy learning method, such as Case based reasoning, is that
the target function will be approximated locally, such as in the k-nearest neighbor algorithm. [...]
The disadvantages with lazy learning include the large space requirement to store the entire training
data set. Particularly noisy training data increases the case base unnecessarily, because no
abstraction is made during the training phase. [...] Lazy classifiers are most useful for large data sets
with few attributes." So, interpretability and ease of implementation are the advantages, but on the
other hand k-NN is very sensitive to irrelevant or redundant features because all features contribute
to the similarity and thus to the classication. This can be ameliorated by careful feature selection
or feature weighting. [Cunningham et al 2007]

3.4 Literature
In this paragraph we reviewed some relatively recent studies dealing in different automatic
accent identification tasks. We tackled some studies about regional varieties of a same language,
while we overlooked works about L2 speaker's provenance detection. Nonetheless, even if it is
perhaps an easier task to recognize the foreign accent of an L2 speaker because of the higher
variability, the problem to tackle is roughly the same of our work, namely to detect a geo-linguistic
provenance15. Before starting the review, it is worth to recall that there are many ways to deal with
14 Weka default k-NN algorithm, that is to say the method we used for our experiments, names IBk (Istance-Based k
learning algorithm).
15 We listed some recent works on this field, which make wide-ranging use of MFCC feature:
1) Ullah Sameeh. (2009) A Soft Computing Based Approach for Multi-Accent Classification in IVR Systems.
University of Waterloo Ontario
2) Torres-Carrasquillo Pedro A., Sturim Douglas E., Reynolds Douglas A., McCree Alan.120(2008) Eigen-channel
compensation and discriminatively trained gaussian mixture models for dialect and accent recognition. In: ISCA
INTERSPEECH 2008, pp. 723- 726.
3) Piat Marina, Fohr Dominique, Ilina Irina. (2008) Foreign accent identification based on prosodic parameters. In;
Proceedings of INTERSPEECH 2008, pp. 759-762
4) Perdersen Carol, Diederich Joachim. (2008) Accent in Speech Samples: Support Vector Machines for
Classification and Rule Extraction. In: Studies in Computational Intelligence. 80, 2008, pp 205-226

34

this task, but two of those are primary: 1) The phonotactic method, which derives from ASR and
automatic detection of language (for a good review of this latter task, [cf. Zissman 1995]). This
method is more expensive to set up than the others, but it is able to capture some linguistic
information of the targeted language. 2) The spectral method (our case), that is to say extracting
spectral information from the samples without passing through the linguistic content. This approach
is bound up with speaker recognition and signal processing domains. It is less time-consuming and
maybe the most popular nowadays.
In [Sinha et al 2015] the authors built up a classifier to identify among 4 Hindi prominent
dialects. They trained several AANN models (Auto Associative Neural Network) on a textdependent corpus (300 different sentences pronounced by men and women of different age and
provenance) divided by speaker between training and test sets. They used alternatively 3 different
spectral features: MFCCs, PLP (perceptual linear prediction), and the MF-PLP (PLP derived from
Mel-scale filter bank). These cepstrum-based features were deployed along with 3 different
prosodic features: energy, F0/F0, and syllable duration. This latter has been deployed through a
second cooperative AANN classifier and a syllable units segmentation. While MFCC and MF-PLP
got roughly the same scores, the best classifier was reach with MF-PLP and all 3 prosodic features
together (82% of accuracy on average, 81% with MFCC instead of MFPLP). Nevertheless, MFCC
with no syllable duration information got good results too: 73% of accuracy on average. By the
way, it is a study about regional dialects and not regional accents: the differences between these
varieties could be relevant.
In [Brown 2014] the author developed a system on the basis of the ACCDIST metric [Huckvale
2004], built up to measure relative distances between phonological units of speakers, considering 14
English regional accents of the British Isles. Differently from other ACCDIST-based accent
recognition systems (such as [Hanani 2012]: we will see this study in a moment), the Brown's
system is designed to process content-mismatched (spontaneous) speech data. On the other hand, it
equally needs the transcription of data. Using ACCDIST distance metrics and MFCCs feature to
train a SVMs model, it achieved on a 4-way classification the 86.7% of accuracy on contentcontrolled data and 52.5% on content-mismatched data. Down here two flowcharts taken from the
same study which well represents the development of the system:

35

In [Akbacak et al 2012] the authors studied the effectiveness of language recognition techniques
in order to accomplish a 4-way classification task of 4 Arabic dialects, achieving the 2.47% average
equal error rate. Used data was constituted by 30-second telephone speech samples. The system
developed is quite complex: on the one hand the authors used dialectal and cross-dialectal
phonotactic models to train an n-grams model along with a SVM model. On the other hand they
developed a GMM-UBM model (Gaussian Mixture Model-Universal Background Model)
extracting a 56-dimensional vector consisting of MFCC, SDC (Shifted delta cepstrum) and energy.
The combination of the former phonotactic model and the latter acoustic model gave the results
above. Nonetheless, similarly to [Sinha et al 2015], the study does not deal with regional accents
but with fully-fledged dialects from the Arab world, which are known to be very diverse each
others.
In [Hou et al 2010] the author proposed an approach for 2 Chinese accents identification using
both cepstral (SDC, called also MFCC_D) and prosodic (Pitch) features with a gender-dependent
model. As the previous work, the system developed is mixed: two features are preprocessed through
a GMM model whose output is computed by a SVM model, which takes the final decision. Down
here the general flowchart of the experiments and the results gotten with SDC, Pitch and both
contemporaneously:

36

In [Hanani 2012] thesis, the author built a system to recognize regional accent for British
English on the basis of ACCDIST metric and a state-of-art language identification system. This
model uses in a complementary way GMM and SVM to take the final decision, on a MFCC and
SDC feature data set extracted from a text-dependent corpus, obtaining around 94% on a 2 way
classification (against the 90% of humans scores)
In the Microsoft research study [Chen et al 2001], the authors built a GMM classifier to
recognize 4 Chinese accents over a large cross-dialectal corpus. They obtained very good results
through a gender-dependent system and an high number of GMM components, without
transcriptions or phones modeling, just extracting MFCC and energy features with their delta
regressions. By the way it is not clear if this system is text-dependent or not.

3.5 Introduction to next experiments


The set of experiments we set out in the next chapter aims to study behavior and main trends of
Mel-Frequency Cepstrum coefficients (MFCCs) in Italian regional variety detection. Furthermore,
we will deal in a system having no linguistic knowledge. In other words, differently from studies
37

that addressed the problem supporting the acoustic modeling for accent detection through previous
linguistic modelings [cf. paragraph 3.4], we tried to rely solely on the acoustic-phonetic side, even
if this idea is perhaps counter-intuitive: in fact, the regional accent as a linguistic feature of a
speaker can be detected by humans only if they are able to couple phone and phoneme, as an ASR
system. For example, we believe of course that an Italian speaker which does not understand at all
Chinese language will not be able to recognize various Chinese varieties and accents. Nevertheless,
we cannot know if such Italian speaker, whenever subjected to massive listening of various accents
from China, will be able to discern dialect even without linguistic competencies. The operating
principle of the machine we attempted to build is exactly this latter, and although can appears as
counter-intuitive, there are several studies dealing with the problem through this approach
obtaining, moreover, good results [which we discussed in paragraph 3.4].
Our interest on MFCCs features is due to the fact that they seem to be the most effective and
successful ones in many speech and speaker-based studies: the same spectral feature is efficient in
recognizing a speaker, a linguistic pattern and some linguistic broad characteristics like accent or
idiolect. Furthermore, in many studies MFCCs provide a sort of baseline-feature to which add some
other one to reach the state of art: MFCCs + pitch, MFCCs + Perceptual Linear Prediction (PLP),
MFCCs + duration, MFCCs + energy, and so on [cf. again paragraph 3.4]. However MFCCs seem
always to constitute a sort of condition sine qua non, and performances of a classifier trained on
these features are often close to the state of art. To sum up, our main purpose is not to build an
efficient and state-of-art classifier of Italian accent recognition: it is rather to lead a critical work on
the use of MFCCs in accent recognition, using a corpus where regional variability of Italian is
relatively important, and comparing the behavior of automatic system with the human one.
Employed data are a crucial aspect, even if the system has no previous linguistic knowledge.
Looking at the deployed corpus (telephonic CLIPS, [cf. paragraph 4.1]), the identification task is
supposed to be very hard for a machine due to four main factors: 1) variability is rather great for a
relatively small country like Italy, however it is not comparable with the variability of an L2
English corpus, or that of larger country like India or China [cf. paragraph 3.4]. The identification
task we proposed is actually (at the maximum level of granularity) hardly affordable also for
humans [cf. paragraph 4.2.1] 2) This is a telephonic corpus: as we explain in [note n of chapter 1],
MFCCs are quite robust to noisy channels. However, some spectral information are equally
compromised. Furthermore, some samples contain telephonic noises which undermine
performances too: unluckily, we cannot easily remove such noises in an automatic way. 3) The
samples are partially content-mismatched (spontaneous) whereas the majority of corpus used for
38

accent recognition purposes are completely text-dependent. [major details on CLIPS hallmarks are
provided in paragraph 4.0]. This is a crucial difference, in that a text-dependent system captures
more focused and accurate spectral information, but on the other hand it can work only on specific
words as input. 4) A relatively great number of samples is pretty short and the accent is hardly
detectable. Some of these contains just a mono-syllabic word with no regional variation inside. For
samples like these, it is practically impossible to recognize the provenance of the speaker. However,
we ran an experiment [cf. paragraph 4.6] which tried to tackle this problem.
Considering all these factors, we can initiate the experimentation phase.

39

Chapter IV: experiments

4.0 Data: CLIPS telephonic corpus


CLIPS (Corpora e Lessici dell'Italiano Parlato e Scritto: [Sobrero 2006]) is a well-known
Italian oral corpus, and probably the most important one about diamesic and diatopic
varieties of Italian.
The CLIPS telephonic sub-corpus (the CLIPS section we are interesting in) was published
in 2006 and developed by the Ugo Bordoni foundation for Naples University. It is mainly
divided in 15 directories, one for each city where data were picked up: Bari (540 samples
totally), Bergamo (469), Cagliari (687), Catanzaro (482), Firenze (638), Genova (555), Lecce
(536), Milano (612), Napoli (370), Palermo (641), Parma (566), Perugia (627), Roma (703),
Torino (643) and Venezia (552). For each directory, there are two sub-directory for the
genders: male (globally 3946 samples) and female (4675 samples) speakers. Speakers are
totally 314, each having 27.13 samples on average. Other sub-directories concern tagged
materials or distinctions across the way of collecting data.
Signal of recordings has a mu-law encoding (8000Hz frequency) and files are
downloadable as RAW Audio format, namely an audio file without any header information
such as sampling rate or endian. Samples duration varies from around 3 to 30 seconds on
average. We had to convert them in RIFF-WAVE (PCM) in order to make them processable by
openSMILE. Thus, we did it with SoX for Linux [source: sox.sourceforge.net], through this
particular setting: [ appendix 1 ].
CLIPS data were collected assigning some scenarios to each speaker. The majority of these
scenarios concerns communications to some hotels or to some specific services: complains,
reservations, cancellations and so on. Testers simulated these scenarios producing a semispontaneous speech, reasonably monitored by an operator console (Wizard of Oz technique:
40

experiments in which subjects interact with a computer system that subjects believe to be
autonomous, but which is actually being operated or partially operated by an unseen human
being [source: Wikipedia]). These written scenarios, requiring the production of some
specific words, ensure a broad phonetic and phonological coverage. Basically, for each speech
sample we dispose of the identification number of the speaker, his home town and his gender.
Brief description of CLIPS telephonic sub-corpus:
Sampling frequency

8000Hz

Encoding

-law

Number of samples

8621

Number of speakers

314

Number of Italian varieties

15

4.1 Two surveys to explore human perception of Italian accents


Perceptual survey-based methods are widely used in dialectology. Since Weijnen, in 1946
published a map which was constructed on the basis of the survey question In which nearby
location(s) do people speak the same or nearly the same dialect as yours?, the experiments
on quantification of dialect perception have been quite frequent (for a review, [Heeringa
2004:13]). 16
In speaker profiling domain, as in speaker recognition for forensic application, auralperceptual method is still important and reliable, especially if it is performed by speech
experts. How [Schilling et Marsters 2015:200] state,
Aural-perceptual analysis performed by expert linguists goes far beyond impressionistic
listening and involves careful examination of the voice quality, rhythmic, intonational and
segmental features of recorded speech evidence for indicators of the various social and
16 Someone who has coped the problem of dialect distance through perception is [Gooskens 2002]: listeners
were asked to judge 15 fragments of Norwegian dialect-sensitive speech, on a scale from 1 to 10: from the most
similar to their native dialect, to the most dissimilar; other studies especially tackled the sociolinguistic
consequence of dialect perception. Some example: [D'addario 2015] about awareness to use a diatopic variety in
south of Italy, [Boughton 2006] about french accents identification, [Paciorkowski et Gilbert 2008] about
influences between several linguistic variables in an US accents identification task to form the first impression of
listener.

41

perhaps physical factors.[...] The only method available to nonexperts is, of course, auditory
analysis (listening), and often they do not even have audiorecorded evidence at their disposal
when offering their profile, but must rely on previous experience with the voice or language
variety in question [...] when nave listeners provide profiles based on auditory methods, they
cannot properly said to be conducting "analysis", since they seem to listen only holistically and
usually cannot articulate which specific speech features lead them to reach their conclusions
However, even a speech expert opinion is not completely reliable since his analysis, though
sometimes supported by computational methods, is led, to a large extent, by intuition.
Let's have a look at [Kster et al 2012]: in a speaker profiling task, German experts on
voice comparison were evaluated concerning their performance and their methodological
approach in accent identification. In the discussion section, the authors state that
the limited success of the use of different methods and findings from dialectology also
suggests that proficiency tests on speaker profiling as well as the evaluation of a forensic
expert involved in casework should not only focus on methods, of dialectology in particular,
but should also focus on the experts individual performance to identify regional accents. Even
if a method has been exhaustively described and accredited, and even if an expert can
demonstrate his/her formal qualifications and ability to apply the method, there seems to be
no guarantee that (at least with the given time constraints) either 1) perceptual determination
works properly or 2) phonetic details gained with the help of a particular perceptual method
will be interpreted correctly. [ibid:68]
To sum up, the group of experts performs on average quite well not because of the method
adopted, but due to their individual and innate competence: it is somewhat surprising
according to the authors that
pure listening (which was probably performed holistically, especially for those
participants who needed very little time) is even more successful in the task of determining
dialects/regional

accents

under

certain

forensic conditions than the use of classical

dialectological methodology and knowledge. [ibid:67]


Even if in real cases is not so easy to involve a large number of speech experts in such
classification tasks, this procedure is actually very common in investigations. Nevertheless, it
could be useful for forensic applications but not for a speaker profiling addressed to improve a
speech/speaker recognition system [cf. chapter 1 of the present work and Sinha et al 2015],
where we need a completely automatic model able to distinguish speaker attributes.

42

The surveys we prepared wanted to cope with the task of aural-perceptual accent
recognition through a quantitative approach (collecting data as much as possible), and they
were addressed to any sort of Italian listener (naive or expert). The questions we asked
ourselves were: are Italians good, on average, on recognizing Italian regional accents? Which
accents are better recognizable? Thus, are regional accent always perceptible?
It is worth to recall, as we saw in chapter 2, that we are not talking about dialects or
languages. Regional varieties of Italian are not different languages, but they are fully-fledged
Italian. Regional varieties, although they are something relatively new which has not finished
its evolution yet, differ from each other not necessarily for syntactic, morphological or lexical
aspects, but even only for phonetic and prosodic aspects. In spite of this, it is quite easy for an
Italian to estimate the broad geolinguistic area of a speaker. This task is harder when
fragments of speech are very short, as in the case of the greater part of CLIPS corpus; but
however affordable, how the results we are going to present show.
In the light of this, we decided to set up two surveys in order to 1) explore, broadly, how
Italian people behave on recognizing regional accents and 2) build a test-bench so as to be
able to compare the performances of our machine-learning classifier with the human ones.
To realize the first survey, we chose twenty-one samples well-distributed among cities of
the corpus (in this first survey Bergamo is included, not in the second one: [cf paragraph
4.3.2]). These samples were chosen arbitrarily and regardless of gender; the only criterion in
choosing was that every sample had to contain at least one typically regional feature. We tried
to avoid both the easier and the harder examples, in order to make difficult the choice and at
the same time to not discourage the tester.
We posed a picture about Italian linguistic areas [we showed it in paragraph 2.2] at the
head of this survey. As a sort of quiz, we embedded the audio samples in the form (made with
Google Form) and asked for each sample two multiple-choice questions: 1) which is the broad
geo- linguistic area of the speaker? 2) Which is the speaker's home town?
Down here an example:

43

The survey was posted on a computational linguistics blog ( http://www.nlpspoiler.it/sairiconoscere-gli-accenti-regionali-italiani-allora-gioca/ ) and shared across various Italian
linguistics Web-communities. Furthermore, we ran a sort of gamification, putting a price for
the best scores, in order to make the quiz more friendly and appealing.
We requested to participate only if the tester was mother-tongue Italian. Moreover, we
asked to participants to include their age and their home town (also a second town in the
event that the tester has moved elsewhere for a 5-years period minimum); we required these
information in that we had planned to see distribution of answers across the variables of the
age and the geo-linguistic area the tester belonged to. The plan was after abandoned because
we considered it out of topic compared to the subject of this work.

44

A month later, we decided to set up another survey for mainly three reasons: 1) We
discovered some errors of sampling in Bergamo sub-section and we decided to remove its

speakers from the data set [cf par. 4.3.2] 2) The first survey, made with Google Form, did not
provide an easy embedding for audio samples, making the form not very user-friendly. In the
meanwhile, we had found out another free online tool to realize surveys, namely PollDaddy. 3)
Asking to some testers, we learnt that geo-linguistic area questions were somewhat useless,
since the first user's action after listening the samples was generally to check out the list of
possible cities. By the way, the choice of geo-linguistic area could have been inferred by the
user's city selection.
Thus, the new test was stripped of the Bergamo speakers and the questions about the
broad linguistic area. Another relevant thing is that the new samples were chosen not by us,
but by an university colleague. We took this decision in order to avoid selector subjectivity:
selection of samples is surely biased by the experimenter linguistic habits. The first selector
was an Italian from south of Tuscany (brown linguistic area in the Italian varieties chart at
[paragraph 2.2] of this work), while the selector for the second survey was an Italian from
45

Romagna (purple linguistic area in [ibid] ).


The

survey

was

posted

on

computational

linguistics

blog

http://www.nlpspoiler.it/riconosci-le-inflessioni-regionali/ ) and shared. Furthermore, we ran a


sort of gamification, making the survey a test of self-assessment of our own skills in
recognizing Italian accents.
We requested to participate only if the tester was mother-tongue Italian. Moreover, we
asked to participants to include their age and their home town (also a second town in the
event that the tester has moved elsewhere for a 5-years period minimum) for the same
reasons of the first survey.

4.1.1 Results and discussion


On the first test we had 53 testers on the 3/12/2015. The global accuracy of each one
varies around a broad range between 22,2% and 71.43%. On the linguistic areas questions (7
way classification task), users got a good accuracy on average, namely 49% of correct answers,
score that actually does not take into account the fact that some testers chose the wrong area
just because they do not check carefully the position of each city in the map provided: that is
mean that the results should be, in effect, better. Slightly worst were the results on the 15-way
city selection classification (26,7%). By the way, scores about cities are not realistic because
answer to this question was not a mandatory: answering properly would have assigned an half
point to win the prize, but if the tester was not sure at all, he could leave blank the question.
On the second test we had, identically to the first one, 53 testers on the 3/12/2015. The
global accuracy of each one vary around a broad range between 10% and 70%, reaching on
average a 33.6% of correct answers.
We can observe that testers were particularly able in recognizing cities of median, Tuscan
and Sardinian linguistic areas, whereas they had some troubles in distinguishing northern
linguistic area from Veneto linguistic area, especially when the speaker come from a border
line zone like Parma. Same difficulties in separating meridional linguistic area to extreme
meridional linguistic area.
46

Pictures above are from test 1 : correct answer was northern area for both. But in the first
the speaker was from Genova, and in the second from Parma.

Pictures above are from test 2 : in green the correct answers. Venezia variety seems not to
be easily distinguished from other northern varieties. However, linguistic boundaries are
47

fuzzier here, that is to say that there is not something like an isogloss [cf. chapter 2]

Pictures above are from test 1: correct answer was southern area for the first and extreme
southern area for the second (respectively, Napoli and Palermo).

48

Pictures above are from test 2: in green the correct answers. Users did not easily
distinguish southern and extreme-southern varieties.
Users generally got good results in guessing Roma, Cagliari and Firenze accents:

49

The graphs above are from test 2 , while those below From test 1: the correct answer of the
graph on the left was Roma, of that on the right was Firenze, on the center was Cagliari.

Finally, we can observe a sort of trend in identifying the whole northern variety as the
Milano variety: in test 2 Milano has been chosen 6 times as the first most selected answer,
while it was correct only twice. In test 1 the same city was chosen 4 times (for the more, twice
as the second most selected answer) and it was correct only once. We briefly mentioned the
likely cause of this event, namely the concept of linguistic koin, in chapter 2.

4.2 Tools
Mainly five instruments have been used during the different phases of this work. We use
firstly SoX in order to convert CLIPS audio samples from .raw to .wav format. SoX is a free
cross-platform digital audio editor [...] written in standard C and having a command-line
50

interface [source: sox.sourceforge.net]. By doing that, we made files suitable for feature
extraction processing [appendix 1]. After that, we used openSMILE software [Eyben et al
2010] to process the features extraction from each audio samples; obtaining a commaseparated file containing selected features and raw numeric descriptions of each audio files.
Thus, we added some information to the spreadsheet like the gender, the home town, the
broad linguistic area and the identification number of each speaker. To do that, we developed
some simple Perl and UNIX scripts. Then, we built different data sets to be processed, with the
aim of setting up several machine learning experiments. Next, we processed these data sets
with Weka [www.cs.waikato.ac.nz/ml/weka/], an open-source machine learning software
written in Java. Last, we analyzed the results of our classifiers comparing them with 1) human
scores on an accent recognition task 2) some experiments set up to find out confounding
variables.

4.2.1 openSMILE
OpenSMILE (open-Source Media Interpretation by Large feature-space Extraction) is an
open-source freely downloadable tool-kit developed by audEERING of Technische Universitt
of Munich of Bavaria. This software is especially used in signal processing and music
information retrivial frameworks, in order to select and extract acoustic features for acoustic
analysis or usable through machine learning methods. It is able to provide various output files
such as CSV (comma-separated values), HTK (hidden-Markov tool-kit) and ARFF (attributerelation file format). This latter is the one we are interested in because of its compatibility
with Weka. On the other hand, it accepts only one type of audio input file: RIFF-WAVE (PCM).
OpenSMILE is written in C++ and it's available for various operating systems, but it has no
graphical interface, so it provide a shell usage only. However, it is a powerful instrument,
able to extract a great range of features and their functionals (namely, various statistical filters
applied to the features, in order to smoothing data and mapping contours of audio samples).
Ones of the most common are: MFCCs, Perceptual Linear Predictive (PLP) coefficients, Linear
Predictive Coefficients (LPC), CHROMA (octave warped semitone spectra), Fundamental
Frequency (F0), formantic distribution, pitch, energy, loudness, voicing probability, jitter and
shimmer.
Features extraction can be run from a terminal with a simple command-line like this:
51

SMILExtract
C
myconfig
/demo1.conf
I
wav_samples _speech01 O speech01.energy.csv

wav_samples

_speech01.wav

-N

SMILExtract is the openSMILE component that strips features from the signal, composing
raw numeric vector of the audio sample. <-I> option is the audio input file, .wav format (more
precisely, WAVE-RIFF PCM format). <-O> option is the output file (.csv in the example above).
<-N> option provide to assign a name to the sample as the first value of the CSV output. <-C> is
the configuration file, which is the most interesting element of SMILExtract function and it is
worth to spend some words about it. Overall, OpenSMILE can be fully congured via a textbased conguration le, which looks like this:
[ componentInstances : cComponentManager ]
instance [ dataMemory ] . type = cDataMemory
instance [ waveSource ] . type = cWaveSource
instance [ framer ] . type = cFramer
instance [ energy ] . type = cEnergy
instance [ csvSink ] . type = cCsvSink
. . .
. . .
[ waveSource : cWaveSource ]
filename= \cm[ i n p u t f i l e ( I ) : f i l e name of the input wave f i l
e]
[ framer : cFramer ]
reader . dmLevel = <<XXXX>>
w r i t e r . dmLevel = <<XXXX>>
[ energy : cEnergy ]
. . .
. . .
[ csvSink : cCsvSink ]
filename= \cm[ o u t p u t f i l e (O) : f i l e name of the output CSV f i l
e]

After having listed the components, setting parameters must be specified for each one.
Each component has its own section, and all components can be connected via their link to a
central data memory component. This latter is primary together with the waveSource one,
which read the input file [image from Eyben et al 2010:28]:

52

All components except for these latter are generated from a specific component (namely,
they read a specific level component) and they generate another component (namely, they
write a specific level component). Thereby, in order to extract one specific feature, one has to
follow certain pattern, composing his own configuration file. Although configuration syntax is
quite hard to learn, openSMILE provides some default sets which can be used to compose our
own custom model.

4.2.2 Weka
Weka is an open-source software for data mining, classification and predictive modeling
tasks. It was developed in New Zeeland at the university of Waikato, and it is certainly one of
the most common open-source system for machine learning. It is written in Java, whereas its
machine learning algorithms library is written in C and C++. The algorithms can either be
applied directly to a data set or called from your own Java code. The system, that provides a
user-friendly graphical interface, contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization. The main input file format is the
53

ARFF, specifically developed by the university of Waikato. An ARFF header example on a


simple data set about flowers is provided [source: cs.waikato.ac.nz/ml/weka/arff ]:
% 1. Title: Iris Plants Database
%
% 2. Sources:
%
(a) Creator: R.A. Fisher
%
(b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
%
(c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE
@ATTRIBUTE
@ATTRIBUTE
@ATTRIBUTE
@ATTRIBUTE

sepallength
sepalwidth
petallength
petalwidth
class

NUMERIC
NUMERIC
NUMERIC
NUMERIC
{Iris-setosa,Iris-versicolor,Iris-virginica}

The data of an ARFF file looks like this:


@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a '%' are comments. The @RELATION, @ATTRIBUTE and @DATA
declarations are case insensitive. The header of an ARFF file contains the name of the relation,
a list of the attributes (introduced by @ symbol), and their types (the type of variable). Upon
this column, there are data in comma-separated format.

4.3 Composing data sets


After having converted raw samples in RIFF-WAVE (PCM) using SoX [see: appendix 1] the
telephonic corpus was ready for the features extraction process. Even though audio samples
distribution is not well-balanced among the different cities of sampling, we decided to use all
material except for outliers or errors [see paragraph 4.3.2]

54

4.3.1 Features extraction with openSMILE


Once having compiled openSMILE 2.1 for Linux, we ran our first MFCCs feature extractions
through this UNIX script, in order to extract in large numbers:
for i in ./*.raw.wav; do SMILExtract --C path/to/example1.conf
--I "$i" --O "path/to/example1.arff" -N "$i"; do SMILExtract --C
path/to/example2.conf --I "$i" --O "path/to/example2.arff" -N
"$i"; done;

Initially, we used as configuration file the MFCCs default set (MFCC12_0_D_A.conf),


discovering that its output is not an ARFF but a HTK (Hidden-Markov tool-kit). Instead of
modifying the output module of the set, or converting the HTK, we were able to speed up the
procedure extracting with 'emobase' file configuration.
Emobase is a set of 988 acoustic features built for emotion recognition tasks. It contains
the following low-level descriptors (LLD): Intensity, Loudness, 12 MFC coefficients, Pitch (F0),
Probability of voicing, F0 envelope, 8 LSF (Line Spectral Frequencies), Zero-Crossing Rate.
Delta regression coecients (first order Delta regression already in the default set, a second
order Delta added by us) are computed from these LLD, and the following functionals are
applied to the LLD and the delta coecients: Max./Min. value and respective relative position
within input, range, arithmetic mean, 2 linear regression coecients and linear and quadratic
error, standard deviation, skewness, kurtosis, quartile 13, and 3 inter-quartile ranges [cf.
Eyben et al 2010]. For a brief description of this functionals see [appendix 4].
Processing the ARFF output directly in Weka, we were able to select the features to include
in the training set through a user-friendly system of ticking. Then, we chose only MFCCs
coefficients and the two delta regressions. The list of used features to train our machine
learning classifier is in [appendix 5] accompanied by a brief explanation of the annotations.
Launching first times the machine learning system on our dataset in order to familiarize
ourselves with the instruments, we immediately realised that a 681-dimensional data set is
quite heavy to process. We tried to repeat the test removing both delta regressions, in order to
lighten the computing load. Thereby, we reduced the features number from 681 to 227, getting
nearly the same results. We considered therefore preferable the computational fastness of a
55

smaller dataset. This gap of performances was not so relevant even when we built the
classifier with different algorithms: so, we decided to remove delta regressions from our main
data set.
4.3.2 Detecting outliers in CLIPS telephonic corpus
For various reasons, e.g. the failure of operator console in monitoring telephonic speech, or
tester misunderstandings, there are some errors in the CLIPS telephonic corpus. For example,
it is not rare to find a sample having a long tail of telephonic noises. Also void samples are
quite common, or those with background noise. These errors are not troublesome if one
wants to carry on a manual acoustic analysis, whereas it could deteriorate performances of an
automatic model trained with data, which is our case.
In order to remove such samples from the training set, we set up a fast procedure using
openSMILE and Perl programming. Our targets were void samples, which are in a
considerable number and obviously do not convey any information about speech features.
First of all, we extracted loudness information for each telephonic corpus sample. To do
that

we

used

openSMILE,

and

notably

one

of

its

default

configuration

file:

'prosodyViterbiLoudness.conf' [to see specific features: appendix 6]. This default


configuration revealed to be effective in detecting silent samples among the other ones:

The underlined sample is likely to be silent.


Next, we carried out various manual inspections to verify if samples were effectively void;
in the light of this, through a simple Perl script we choose a prudent threshold to filter the
outliers [see: appendix 2]. These samples were collected in a text file, obtaining totally 163

56

outliers. After, we added to this text file also some other types of outliers which we arbitrarily
decided to exclude from our data set. For instance, we decided to role out Bergamo speakers.
We took this decision for two reasons: 1) among Bergamo speakers there were 3 who are
clearly non-native from Bergamo. Beyond Bergamo sub-set, we encountered this drawback
also among few other speakers, who are not, in our opinion, native from the city they are
assigned to. These errors (if they are) are maybe due to a loose control on people subjected to
the sampling. 2) Bergamo, being about only 55 kilometers from Milano, belongs to a sort of
koin language, which is the Milano more prestigious variety for influence and power. Since
humans were generally not be able to distinguish Milano to Bergamo accent (as we noted in
the first survey), we solved our problem deciding that this difference cannot be relevant for
the machine-learning model, and removing Bergamo samples.
In addiction to the outliers, we excluded from our training set also the 41 samples which
were submitted to human testers [see par. 4.1]. We did it to be able subsequently to test our
classifier on these samples and comparing its results with human ones.

4.3.3 Add information for data mining


After getting an ARFF file for each city and each gender, we had to add some information to
each raw/sample, in the form of variables: information about the speaker's home town and
gender, the identification number of each speaker and the broad geo-linguistic area [cf chapter
2], in order to increase the number of practicable analysis. To add this elements, we wrote
some short scripts in Perl and UNIX language [see: appendix 3]. At the end, ARFF shuffled data
looked like this:

57

Lastly, we merged all ARFF files (there were one for each gender of each city) in a single
data set file.

4.4 Speaker variable is a confounding variable


In order to have a look on different behaviors of human and machine on the accent
recognition task, we tried to use the 20 samples of survey no. 2 [see par. 4.1] as a test set for
different machine-learning methods. To test a classifier on a given data set (in this case, an
ARFF file composed by the 21 audio samples of the first test, and their MFCCs features) we
ticked supplied test set button in Weka and set the file with our samples, which must be
compatible with the training set.

58

Finally, we ran a SMO algorithm (described, with the other methods used, in [paragraph
3.3]) on the twenty samples of test 2, using a data set devoid of Bergamo speakers and outliers
[cf paragraph 4.3.2]:
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

8
12
0.3548
0.1257
0.2473
94.8648 %
96.0627 %
20

40
60

%
%

=== Confusion Matrix ===


a
0
0
0
0
0
0
0
0
0
0
0
0
0
0

b
0
1
0
0
0
0
0
0
1
0
0
0
0
0

c
0
0
0
0
0
0
0
0
0
0
0
0
0
0

d
0
0
0
1
0
0
1
0
0
0
0
0
0
0

e
0
0
0
0
1
0
1
0
0
0
0
0
0
1

f
0
0
0
0
0
0
0
0
0
0
0
0
0
0

g
0
0
0
0
0
0
0
0
0
0
1
0
0
0

h
0
0
0
0
0
1
0
1
0
0
0
0
0
0

i
1
0
0
0
0
0
0
0
0
0
0
0
0
0

j
0
0
0
0
0
0
0
0
0
1
0
0
0
0

k
0
0
0
1
0
0
0
0
0
0
1
0
0
0

l
0
0
0
0
0
0
0
0
0
0
0
0
0
0

m
0
0
1
0
0
0
0
0
0
0
0
1
1
0

n
1
0
0
0
0
0
0
0
0
0
0
1
0
1

|
|
|
|
|
|
|
|
|
|
|
|
|
|

<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =

classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia

### Classifier output using 227 features (MFCCs), functions.SMO algorithm


### test supplied set: audio samples of test 2, classified for: cityname

Using instead Bayesian Networks algorithm:


Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

7
13
0.3029
0.093
0.2905
70.2351 %
112.8459 %
20

=== Confusion Matrix ===


a
0
0
0
0
0

b
0
0
0
0
0

c
0
0
0
0
0

d
0
0
0
1
0

e
0
0
0
0
1

f
0
0
0
0
0

g
0
1
0
0
0

h
0
0
0
0
0

i
1
0
0
1
0

j
0
0
0
0
0

k
0
0
0
0
0

l
1
0
0
0
0

m
0
0
0
0
0

n
0
0
1
0
0

|
|
|
|
|

<-a =
b =
c =
d =
e =

classified as
bari
cagliari
catanzaro
firenze
genova

59

35
65

%
%

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
1
0
0

0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
1
0

0
1
0
0
0
0
0
0
1

0
0
0
0
0
0
0
0
0

1
0
1
0
0
0
0
0
0

0
0
0
1
0
0
0
0
0

0
0
0
0
1
0
0
0
0

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
1
0
0

0
0
0
0
0
1
0
0
0

0
0
0
0
0
1
0
0
1

|
|
|
|
|
|
|
|
|

f
g
h
i
j
k
l
m
n

=
=
=
=
=
=
=
=
=

lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia

### Classifier output using 227 features (MFCCs), bayesNet algorithm


### test supplied set: audio samples of test 2, classified for: cityname

Using finally IBk (k-NN method) algorithm the model reach proficient scores:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

18
2
0.8913
0.0145
0.1194
10.9414 %
46.3942 %
20

90
10

%
%

=== Confusion Matrix ===


a
1
0
0
0
0
0
0
0
0
0
0
0
0
0

b
0
1
0
0
0
0
0
0
0
0
0
0
0
0

c
0
0
1
0
0
0
0
0
0
0
0
0
0
0

d
0
0
0
2
0
0
0
0
0
0
0
0
0
0

e
0
0
0
0
1
0
0
0
0
0
0
0
0
0

f
0
0
0
0
0
1
0
0
0
0
0
0
0
0

g
0
0
0
0
0
0
2
0
0
0
0
0
0
1

h
0
0
0
0
0
0
0
1
0
0
0
0
0
0

i
0
0
0
0
0
0
0
0
1
0
0
0
0
0

j
0
0
0
0
0
0
0
0
0
1
0
0
0
0

k
0
0
0
0
0
0
0
0
0
0
2
0
0
0

l
1
0
0
0
0
0
0
0
0
0
0
2
0
0

m
0
0
0
0
0
0
0
0
0
0
0
0
1
0

n
0
0
0
0
0
0
0
0
0
0
0
0
0
1

|
|
|
|
|
|
|
|
|
|
|
|
|
|

<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =

classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia

### Classifier output using 227 features (MFCCs), lazy.IBk algorithm


### test supplied set: audio samples of test 2, classified for: cityname

As it can be seen, the accuracy of all three methods is higher than human one. However,
these generally positive results sharply plummeted when we removed from the training set
the samples belonging to speakers in survey 2. Down here the same algorithms using above,
but excluding these samples from the training set
Using SMO (SVMs method) algorithm:
Correctly Classified Instances

60

Incorrectly Classified Instances


Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

19
-0.0215
0.1299
0.2557
97.9839 %
99.2652 %
20

95

20
80

%
%

=== Confusion Matrix ===


a
0
0
0
0
0
1
0
0
0
0
0
0
1
0

b
0
0
0
0
0
0
0
0
1
0
0
0
0
0

c
0
0
0
0
0
0
0
0
0
0
0
0
0
1

d
0
0
0
0
0
0
1
0
0
0
1
0
0
0

e
0
0
0
0
0
0
1
0
0
0
0
0
0
1

f
0
1
0
0
0
0
0
1
0
0
0
0
0
0

g
0
0
0
0
0
0
0
0
0
0
1
0
0
0

h
0
0
0
0
0
0
0
0
0
0
0
0
0
0

i
1
0
0
0
0
0
0
0
0
0
0
1
0
0

j
0
0
0
1
0
0
0
0
0
1
0
0
0
0

k
0
0
0
1
0
0
0
0
0
0
0
0
0
0

l
1
0
0
0
0
0
0
0
0
0
0
0
0
0

m
0
0
1
0
1
0
0
0
0
0
0
0
0
0

n
0
0
0
0
0
0
0
0
0
0
0
1
0
0

|
|
|
|
|
|
|
|
|
|
|
|
|
|

<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =

classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia

Accuracy moved from the 40% to the 5%.


Using Bayesian Networks algorithm:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

4
16
0.1398
0.1168
0.3175
88.1124 %
123.2673 %
20

=== Confusion Matrix ===


a
0
0
0
0
0
0
0
0
0
0
0
0
0
0

b
0
0
0
0
0
0
0
0
0
0
0
0
0
0

c
0
0
0
0
0
0
0
0
0
0
1
0
0
0

d
0
0
0
0
0
0
0
0
0
0
0
0
0
0

e
0
0
0
0
0
0
1
0
0
0
0
0
1
0

f
0
0
0
0
0
0
1
0
0
0
0
0
0
1

g
0
1
0
0
0
0
0
0
1
0
1
0
0
0

h
0
0
0
0
0
1
0
1
0
0
0
0
0
0

i
1
0
0
1
0
0
0
0
0
0
0
1
0
0

j
0
0
0
1
0
0
0
0
0
1
0
0
0
0

k
0
0
0
0
0
0
0
0
0
0
0
0
0
0

l
1
0
0
0
0
0
0
0
0
0
0
1
0
0

m
0
0
0
0
0
0
0
0
0
0
0
0
0
0

n
0
0
1
0
1
0
0
0
0
0
0
0
0
1

|
|
|
|
|
|
|
|
|
|
|
|
|
|

<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =

classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia

Accuracy moved from the 35% to the 20%.


61

Using IBk (k-NN method) algorithm:


Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

6
14
0.235
0.1001
0.3159
75.4742 %
122.6396 %
20

30
70

%
%

=== Confusion Matrix ===


a
1
0
0
0
0
0
0
0
0
0
0
0
0
0

b
0
0
0
1
0
0
0
0
0
0
0
0
0
0

c
0
0
1
0
0
0
0
0
0
0
0
0
0
0

d
0
0
0
1
1
0
0
0
0
0
1
0
0
0

e
0
0
0
0
0
0
0
0
0
0
1
0
0
0

f
0
0
0
0
0
0
0
0
0
0
0
0
1
0

g
0
1
0
0
0
0
1
0
0
0
0
0
0
1

h
0
0
0
0
0
1
0
0
0
0
0
0
0
0

i
0
0
0
0
0
0
0
0
1
0
0
0
0
0

j
0
0
0
0
0
0
0
0
0
0
0
0
0
0

k
0
0
0
0
0
0
0
0
0
0
0
0
0
0

l
1
0
0
0
0
0
1
1
0
0
0
0
0
0

m
0
0
0
0
0
0
0
0
0
0
0
0
0
0

n
0
0
0
0
0
0
0
0
0
1
0
2
0
1

|
|
|
|
|
|
|
|
|
|
|
|
|
|

<-a =
b =
c =
d =
e =
f =
g =
h =
i =
j =
k =
l =
m =
n =

classified as
bari
cagliari
catanzaro
firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia

Accuracy moved from the 90% to the 30%.


This means that the machine-learning model built above actually generated some classes
in relation to speaker characteristics, and not to the accent ones. In fact, this latter has a
sufficient amount of data to build a class on speaker's individual acoustic characteristic (like
timbre), and not on their accent features (a cross-speakers attribute). This could be expected
if we think that MFCCs acoustic features are deeply exploited in speaker recognition tasks,
where we need to attribute the authorship of a recording and not to describe some hallmarks
of the speaker who produced it. Indeed, most of considered studies point out the importance
to keep speakers of training and test sets separated: for this reason they usually distinguish
two sets by speaker or use the leave one (speaker) out cross-validation technique 17.
By the way, a test set of about 20 samples is surely not an adequate test-bench for our
17 Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the
number of data points in the set. That means that N separate times, the function approximator is trained on all the data
except for one point and a prediction is made for that point. As before the average error is computed and used to
evaluate the model. The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass it
seems very expensive to compute [source: www.cs.cmu.edu]

62

system, because it is not large enough. The second experiment we will show below is perhaps
more reliable because we exploited a larger test set to evaluate our classifiers: we excluded
the 10% of samples from the training set and we used this part as a test set. On the one hand,
we took away randomly the 10% of the samples, namely 882 samples. On the other hand, we
did the same thing not randomly, but on the basis of speakers: 10% of the total number of the
speakers (selected at random) became the test set of our classifier (namely, 854 samples),
trained on the other 90% of speakers. We provide down here the scores comparison of the
random-based test and the speaker-based one, repeating this for 3 times, one for each method
(in order: bayesian network, SMO (SVMs), IBk (k-NN)). Similarly to the last experiment, it can
be seen the strong fall of performances when the machine is not trained with the speakers of
the test set (speaker-based test).
Using SMO (SVMs method) algorithm on the random-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

353
470
0.3838
0.1259
0.2477
95.0148 %
96.2155 %
823

42.8919 %
57.1081 %

=== Detailed Accuracy By Class ===


TP Rate
0.404
0.5
0.391
0.5
0.529
0.42
0.348
0.333
0.277
0.569
0.333
0.317
0.486
0.596
0.429

FP Rate
0.046
0.089
0.044
0.065
0.034
0.031
0.057
0.013
0.034
0.053
0.036
0.034
0.04
0.042
0.045

Precision
Recall
0.397
0.404
0.343
0.5
0.346
0.391
0.333
0.5
0.509
0.529
0.467
0.42
0.348
0.348
0.6
0.333
0.409
0.277
0.506
0.569
0.46
0.333
0.422
0.317
0.531
0.486
0.492
0.596
0.439
0.429

F-Measure
0.4
0.407
0.367
0.4
0.519
0.442
0.348
0.429
0.33
0.536
0.387
0.362
0.507
0.539
0.427

ROC Area Class


0.864
bari
0.813
cagliari
0.789
catanzaro
0.85
firenze
0.891
genova
0.84
lecce
0.789
milano
0.869
napoli
0.782
parma
0.902
perugia
0.831
palermo
0.79
roma
0.837
torino
0.898
venezia
0.838 Weighted Avg.

And then the same algorithm using the speaker-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error

102
751
0.0599
0.1312

63

11.9578 %
88.0422 %

Root mean squared error


Relative absolute error
Root relative squared error
Total Number of Instances

0.2582
98.5316 %
99.7343 %
853

=== Detailed Accuracy By Class ===


TP Rate
0.196
0.217
0.113
0.205
0.016
0.058
0.059
0
0.134
0.143
0.115
0
0.277
0.327
0.12

FP Rate
0.088
0.125
0.049
0.084
0.066
0.04
0.06
0.028
0.053
0.037
0.112
0.088
0.066
0.042
0.059

Precision
Recall
0.113
0.196
0.046
0.217
0.133
0.113
0.117
0.205
0.019
0.016
0.139
0.058
0.118
0.059
0
0
0.176
0.134
0.275
0.143
0.063
0.115
0
0
0.257
0.277
0.333
0.327
0.133
0.12

F-Measure
0.143
0.076
0.122
0.149
0.017
0.082
0.078
0
0.153
0.188
0.081
0
0.267
0.33
0.118

ROC Area Class


0.668
bari
0.667
cagliari
0.728
catanzaro
0.663
firenze
0.345
genova
0.521
lecce
0.363
milano
0.552
napoli
0.726
parma
0.819
perugia
0.569
palermo
0.382
roma
0.702
torino
0.863
venezia
0.601 Weighted Avg.

Using Bayesian Networks algorithm on the random-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

247
576
0.2453
0.1006
0.2972
75.941 %
115.4308 %
823

30.0122 %
69.9878 %

=== Detailed Accuracy By Class ===


TP Rate
0.211
0.4
0.217
0.28
0.412
0.44
0.197
0.422
0.077
0.486
0.087
0.117
0.4
0.519
0.3

FP Rate
0.043
0.166
0.045
0.026
0.025
0.026
0.066
0.098
0.018
0.055
0.009
0.013
0.077
0.088
0.055

Precision
Recall
0.267
0.211
0.183
0.4
0.222
0.217
0.412
0.28
0.525
0.412
0.524
0.44
0.206
0.197
0.2
0.422
0.263
0.077
0.461
0.486
0.462
0.087
0.412
0.117
0.326
0.4
0.284
0.519
0.339
0.3

F-Measure
0.235
0.251
0.22
0.333
0.462
0.478
0.202
0.271
0.119
0.473
0.146
0.182
0.359
0.367
0.289

ROC Area Class


0.767
bari
0.697
cagliari
0.72
catanzaro
0.795
firenze
0.873
genova
0.783
lecce
0.69
milano
0.691
napoli
0.773
parma
0.819
perugia
0.781
palermo
0.715
roma
0.751
torino
0.828
venezia
0.762 Weighted Avg.

And then the same algorithm using the speaker-based test and training sets:

64

Correctly Classified Instances


Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

81
772
0.0247
0.1294
0.3415
97.1348 %
131.8942 %
853

9.4959 %
90.5041 %

=== Detailed Accuracy By Class ===


TP Rate
FP Rate
0.043
0.015
0.217
0.204
0.113
0.043
0
0.061
0
0.028
0
0.061
0.059
0.109
0.163
0.138
0.06
0.02
0.221
0.041
0.038
0.016
0
0.041
0.215
0.095
0.173
0.104
0.095
0.07

Precision
Recall F-Measure
ROC Area Class
0.143
0.043
0.067
0.633
bari
0.029
0.217
0.051
0.372
cagliari
0.15
0.113
0.129
0.703
catanzaro
0
0
0
0.39
firenze
0
0
0
0.37
genova
0
0
0
0.316
lecce
0.068
0.059
0.063
0.569
milano
0.133
0.163
0.147
0.538
napoli
0.2
0.06
0.092
0.655
parma
0.347
0.221
0.27
0.776
perugia
0.133
0.038
0.06
0.64
palermo
0
0
0
0.144
roma
0.157
0.215
0.182
0.661
torino
0.098
0.173
0.125
0.642
venezia
0.114
0.095
0.094
0.551 Weighted Avg.

Using IBk (k-NN method) algorithm on the random-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

630
193
0.747
0.0337
0.1829
25.4317 %
71.0215 %
823

76.5492 %
23.4508 %

=== Detailed Accuracy By Class ===


TP Rate
FP Rate
0.772
0.022
0.743
0.023
0.761
0.028
0.82
0.023
0.784
0.013
0.78
0.014
0.697
0.015
0.689
0.006
0.677
0.021
0.889
0.007
0.754
0.016
0.833
0.03
0.743
0.025
0.769
0.009
0.765
0.018

Precision
Recall F-Measure
ROC Area Class
0.721
0.772
0.746
0.875
bari
0.754
0.743
0.748
0.86
cagliari
0.614
0.761
0.68
0.866
catanzaro
0.695
0.82
0.752
0.898
firenze
0.8
0.784
0.792
0.886
genova
0.78
0.78
0.78
0.883
lecce
0.807
0.697
0.748
0.841
milano
0.861
0.689
0.765
0.841
napoli
0.733
0.677
0.704
0.828
parma
0.928
0.889
0.908
0.941
perugia
0.813
0.754
0.782
0.869
palermo
0.685
0.833
0.752
0.902
roma
0.732
0.743
0.738
0.859
torino
0.851
0.769
0.808
0.88
venezia
0.773
0.765
0.766
0.874 Weighted Avg.

65

And then the same algorithm using the speaker-based test and training sets:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

130
723
0.093
0.1211
0.3476
90.9447 %
134.2706 %
853

15.2403 %
84.7597 %

=== Detailed Accuracy By Class ===


TP Rate
FP Rate
0.239
0.084
0.435
0.082
0.075
0.045
0.091
0.07
0.032
0.077
0.035
0.048
0.108
0.049
0.071
0.033
0.194
0.057
0.169
0.036
0.058
0.054
0
0.115
0.308
0.096
0.558
0.059
0.152
0.059

Precision
Recall F-Measure
ROC Area Class
0.139
0.239
0.176
0.578
bari
0.128
0.435
0.198
0.677
cagliari
0.1
0.075
0.086
0.516
catanzaro
0.066
0.091
0.076
0.511
firenze
0.032
0.032
0.032
0.478
genova
0.075
0.035
0.048
0.495
lecce
0.229
0.108
0.147
0.53
milano
0.219
0.071
0.108
0.52
napoli
0.224
0.194
0.208
0.569
parma
0.317
0.169
0.22
0.567
perugia
0.065
0.058
0.061
0.493
palermo
0
0
0
0.444
roma
0.208
0.308
0.248
0.607
torino
0.382
0.558
0.453
0.75
venezia
0.172
0.152
0.148
0.547 Weighted Avg.

Overall, performances sharply plummet when the speakers of the training set and the test
set are not the same. This means that speaker variable is crucially important for our MFCCstrained model, and the classification is actually led on the basis of speaker's individual
features, not of accent ones. Thereby, it is crucially important, as the majority of explored
literature does [cf chapter 3], to keep separated the speakers of training and test set.

4.4.1 Gender variable


We wanted to see how and how much the involvement of both male and female speakers
influences in performances our MFCCs-based classifier. To do so, we generated with a simple
Perl script [see appendix 8] two data sets (a couple of training and test sets with separated
speakers) which contained respectively female and male speakers' samples only.
First of all, we deployed our best model, trained using IBk (k-NN method) algorithm, on
the female data set; getting the following results:
66

Correctly Classified Instances


Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

66
464
0.0605
0.1251
0.353
94.114 %
136.2606 %
530

12.4528 %
87.5472 %

=== Detailed Accuracy By Class ===


TP Rate
0.211
0.391
0.07
0
0.037
0
0
0.045
0.12
0.188
0.133
0.375
0.088
0.154
0.125

FP Rate
0.086
0.073
0.051
0.095
0.116
0.053
0.047
0.049
0.04
0.004
0.039
0.096
0.125
0.065
0.063

Precision
Recall
0.083
0.211
0.196
0.391
0.107
0.07
0
0
0.035
0.037
0
0
0
0
0.077
0.045
0.24
0.12
0.857
0.188
0.091
0.133
0.242
0.375
0.046
0.088
0.205
0.154
0.2
0.125

F-Measure
0.119
0.261
0.085
0
0.036
0
0
0.057
0.16
0.308
0.108
0.294
0.061
0.176
0.13

ROC Area Class


0.563
bari
0.66
cagliari
0.51
catanzaro
0.454
firenze
0.462
genova
0.475
lecce
0.478
milano
0.499
napoli
0.541
parma
0.592
perugia
0.548
palermo
0.64
roma
0.483
torino
0.545
venezia
0.532 Weighted Avg.

=== Confusion Matrix ===


a
4
2
1
1
7
0
3
10
2
6
1
0
11
0

b
2
9
7
3
2
3
5
1
1
3
1
2
5
2

c d e
0 0 0
1 0 0
3 2 9
0 0 3
2 5 2
1 3 26
2 2 2
2 3 1
3 3 7
1 18 2
1 0 1
6 5 0
2 3 1
4 5 3

f
0
1
1
1
1
0
6
2
2
9
3
0
0
0

g
2
3
0
2
7
1
0
1
0
0
0
1
2
4

h
0
0
3
1
3
1
9
2
5
1
0
0
1
0

i j
1 0
0 0
6 0
0 0
3 0
1 0
1 1
3 1
6 0
1 12
2 0
0 0
1 0
0 0

k l m n
<-- classified as
0 3 4 3 | a = bari
2 1 1 3 | b = cagliari
4 6 1 0 | c = catanzaro
0 1 0 0 | d = firenze
1 3 14 4 | e = genova
1 1 1 0 | f = lecce
0 2 6 2 | g = milano
2 3 9 4 | h = napoli
2 7 1 11 | i = parma
0 4 5 2 | j = perugia
2 2 1 1 | k = palermo
2 15 8 1 | l = roma
3 2 3 0 | m = torino
3 12 11 8 | n = venezia

Then, the same model for male speakers' audio samples:


Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

63
345
0.0918
0.1209
0.3468
91.2094 %
134.3149 %
408

67

15.4412 %
84.5588 %

=== Detailed Accuracy By Class ===


TP Rate
FP Rate
0.37
0.066
0.045
0.073
0
0.06
0.125
0.069
0
0.018
0.064
0.042
0.016
0.049
0.111
0.045
0.588
0.041
0.154
0.111
0.081
0.065
0.04
0.175
0.71
0.082
0
0.013
0.154
0.062

Precision
Recall F-Measure
ROC Area Class
0.286
0.37
0.323
0.654
bari
0.034
0.045
0.039
0.489
cagliari
0
0
0
0.47
catanzaro
0.133
0.125
0.129
0.53
firenze
0
0
0
0.494
genova
0.167
0.064
0.092
0.514
lecce
0.056
0.016
0.025
0.487
milano
0.273
0.111
0.158
0.535
napoli
0.385
0.588
0.465
0.775
parma
0.043
0.154
0.068
0.523
perugia
0.111
0.081
0.094
0.483
palermo
0.015
0.04
0.022
0.435
roma
0.415
0.71
0.524
0.814
torino
0
0
0
0.496
venezia
0.155
0.154
0.14
0.546 Weighted Avg.

=== Confusion Matrix ===


a
10
2
1
3
0
6
1
3
0
2
4
1
0
2

b
3
1
0
9
0
5
1
2
0
2
0
5
1
0

c
0
1
0
3
2
4
1
6
0
0
4
0
2
1

d
4
0
0
4
2
1
6
8
2
0
0
0
0
3

e
0
0
0
0
0
0
7
0
0
0
0
0
0
0

f
1
0
2
1
1
3
0
2
0
2
4
2
0
0

g
1
7
1
1
2
0
1
0
0
0
2
3
0
0

h i
0 4
0 0
0 1
1 0
0 0
5 3
2 3
6 1
3 10
0 1
2 3
2 0
0 0
1 0

j
1
9
3
1
0
6
2
6
0
2
1
3
4
8

k l m
1 0 2
1 1 0
0 1 1
2 0 7
1 0 1
0 8 5
9 26 2
2 11 7
1 1 0
0 4 0
3 13 1
0 1 4
2 0 22
5 2 1

n
0
0
0
0
0
1
0
0
0
0
0
4
0
0

|
|
|
|
|
|
|
|
|
|
|
|
|
|

<-- classified as
a = bari
b = cagliari
c = catanzaro
d = firenze
e = genova
f = lecce
g = milano
h = napoli
i = parma
j = perugia
k = palermo
l = roma
m = torino
n = venezia

Comparing to the gender-neutral system, accuracy did not improve, but rather it degraded
as in the case of the female speakers' model above. According to these machine-learning
experiments, gender variable seems not to elicit the performance of our MFCCs-based
classifier.
4.5 Linguistic areas
Let's try to interpret the confusion matrix of our best classifier (the one built with k-nn
method) and run it on samples of test 2:
a
1
0
0

b
0
0
0

c
0
0
1

d
0
0
0

e
0
0
0

f
0
0
0

g
0
1
0

h
0
0
0

i
0
0
0

j
0
0
0

k
0
0
0

l
1
0
0

m
0
0
0

n
<-0 | a =
0 | b =
0 | c =

classified as
bari
cagliari
catanzaro

68

0
0
0
0
0
0
0
0
0
0
0

1
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

1
1
0
0
0
0
0
1
0
0
0

0
0
0
0
0
0
0
1
0
0
0

0
0
0
0
0
0
0
0
0
1
0

0
0
0
1
0
0
0
0
0
0
1

0
0
1
0
0
0
0
0
0
0
0

0
0
0
0
0
1
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

0
0
0
1
1
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
1
0
2
0
1

|
|
|
|
|
|
|
|
|
|
|

d
e
f
g
h
i
j
k
l
m
n

=
=
=
=
=
=
=
=
=
=
=

firenze
genova
lecce
milano
napoli
parma
perugia
palermo
roma
torino
venezia

Looking at the human scores, when one fails the city choice, it is generally because they
selected a city of the same linguistic area. It is quite common to mistake Torino accent for
Genova accent, or Napoli accent for Bari one, or Lecce one for Palermo one, even if there are
some cases where that is not true:

Nevertheless, our classifier does not seem to follow this criterion of linguistic proximity: as
we can see through the matrix above, when it mistakes a city for another, there is not any
strong linguistic relationship between them.

69

inst#,

actual, predicted,

1:bari

12:roma

1:bari

1:bari

3 2:cagliari

7:milano

4 3:catanzar 3:catanzar
5

4:firenze

4:firenze

4:firenze 2:cagliari

5:genova

4:firenze

6:lecce

8:napoli

7:milano

12:roma

10

7:milano

7:milano

11

8:napoli

12:roma

12 11:palermo

5:genova

13 11:palermo

4:firenze

14

9:parma

9:parma

15 10:perugia 14:venezia
16

12:roma 14:venezia

17

12:roma 14:venezia

18

13:torino

6:lecce

19 14:venezia 14:venezia
20 14:venezia

7:milano

Obviously we cannot draw general conclusions on the basis of a test set of just 20 audio
samples. However, visualizing the confusion matrix of our IBk (k-NN) classifier tested on
unknown speakers, here too, we found out only some partial correlation between cities of the
same linguistic area: correlation which probably is not meaningful [see appendix 7].
In view of the above experiments, it seems that perception categories like the linguistic
areas, which apparently work for humans, do not work for our MFCCs-based machine learning
classifier: this latter does not seem to follow a linguistic proximity criterion, although there is
no shortage of cases where also humans do not respect it.
We ran an additional experiment to verify whether the concept of linguistic area could
have a sort of computational soundness inside our classifier. In brief, what we did was to
check the classifier ability in distinguishing 2 cities of a same linguistic area, compare to the
ability in distinguishing 2 cities of different linguistic areas. Our hypothesis was that the
system would have achieved better performances on distinguishing cities with different
linguistic hallmarks, than cities belonging to the same broad linguistic area. Hence, the
structure of this experiment was the following:
70

1)

Launch the classifier on a data set composed by 2 cities from a same chosen linguistic

area.
2)

Launch the classifier on a data set composed by 2 cities whose one belongs to the

targeted linguistic area and the other to a neighboring linguistic area.


3)

Launch the classifier on a data set composed by 2 cities whose one belongs to the

targeted linguistic area and the other to a not neighboring linguistic area.
These specific data sets were generate with some Perl scripts we built up for this purpose.
We show these scripts in [appendix 8].
Down here, the results of the lazy IBk classifier on a test set of unknown speakers from
Lecce and Catanzaro, so 2 cities belonging to the same linguistic area (extreme-southern):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

72
67
0.1215
0.0699
0.2603
97.2752 %
138.7581 %
139

51.7986 %
48.2014 %

=== Detailed Accuracy By Class ===


TP Rate
0.792
0.349
0.518

FP Rate
0.651
0.208
0.377

Precision
0.429
0.732
0.616

Recall
0.792
0.349
0.518

F-Measure
0.556
0.472
0.504

ROC Area
0.571
catanzaro
0.571
lecce
0.571 Weighted Avg.

After, the results of the lazy IBk classifier on a test set of unknown speakers from Lecce
and Bari, so 2 cities belonging to two neighboring areas (respectively, extreme-southern and
southern area):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

71
61
0.1909
0.067
0.255
91.6848 %
133.4554 %
132

53.7879 %
46.2121 %

=== Detailed Accuracy By Class ===


TP Rate
0.891

FP Rate
0.651

Precision
0.423

Recall
0.891

71

F-Measure
0.573

ROC Area
0.62

bari

0.349
0.538

0.109
0.298

0.857
0.706

0.349
0.538

0.496
0.523

0.62
lecce
0.62 Weighted Avg.

Last, the results of the lazy IBk classifier on a test set of unknown speakers from Lecce and
Venezia, so 2 cities belonging to two not neighboring areas (respectively, extreme-southern
and Veneto area):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

90
38
0.4203
0.0438
0.2044
61.1706 %
109.0592 %
128

70.3125 %
29.6875 %

=== Detailed Accuracy By Class ===


TP Rate
0.83
0.613
0.703

FP Rate
0.387
0.17
0.26

Precision
0.603
0.836
0.74

Recall
0.83
0.613
0.703

F-Measure
0.698
0.708
0.704

ROC Area
0.716
catanzaro
0.716
venezia
0.716 Weighted Avg.

In this case, the machine learning model seems to respect a sort of proximity criterion: it is
more able in distinguishing far cities than close cities.
Let's try to repeat this experiment for other areas. Down here, the results of the lazy IBk
classifier on a test set of unknown speakers from Genova and Torino, so 2 cities belonging to
the same linguistic area (the northern one):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

69
59
0.0695
0.0667
0.2549
92.4914 %
134.6259 %
128

53.9063 %
46.0938 %

=== Detailed Accuracy By Class ===


TP Rate
0.238
0.831
0.539

FP Rate
0.169
0.762
0.47

Precision
0.577
0.529
0.553

Recall
0.238
0.831
0.539

F-Measure
0.337
0.647
0.494

ROC Area
0.534
genova
0.534
torino
0.534 Weighted Avg.

After, the results of the lazy IBk classifier on a test set of unknown speakers from Torino
and Venezia, so 2 cities belonging to two neighboring areas (respectively, northern and Veneto
area) which besides are not separated by strong isoglosses [cf chapter 2]:
Correctly Classified Instances

101

72

72.1429 %

Incorrectly Classified Instances


Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

39
0.4474
0.041
0.1982
56.4392 %
103.7125 %
140

27.8571 %

=== Detailed Accuracy By Class ===


TP Rate
0.8
0.653
0.721

FP Rate
0.347
0.2
0.268

Precision
0.667
0.79
0.733

Recall
0.8
0.653
0.721

F-Measure
0.727
0.715
0.721

ROC Area
0.727
torino
0.727
venezia
0.727 Weighted

Avg.

Last, the results of the lazy IBk classifier on a test set of unknown speakers from Genova
and Palermo, so 2 cities belonging to two not neighboring areas (respectively, northern and
extreme-southern area):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

67
48
0.1953
0.0606
0.2426
83.1049 %
126.5906 %
115

58.2609 %
41.7391 %

=== Detailed Accuracy By Class ===


TP Rate
0.397
0.808
0.583

FP Rate
0.192
0.603
0.378

Precision
0.714
0.525
0.629

Recall
0.397
0.808
0.583

F-Measure
0.51
0.636
0.567

ROC Area
0.602
genova
0.602
palermo
0.602 Weighted Avg.

In this case, the machine learning model seems not to completely respect the proximity
criterion.
Now we are going to repeat this experiment on the Italian median area. Down here, the
results of the lazy IBk classifier on a test set of unknown speakers from Roma and Perugia, so
2 cities belonging to the same linguistic area (the median one):
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

104
65
0.2352
0.0559
0.233
77.7626 %
123.6709 %
169

=== Detailed Accuracy By Class ===

73

61.5385 %
38.4615 %

TP Rate
0.662
0.576
0.615

FP Rate
0.424
0.338
0.377

Precision
0.567
0.671
0.623

Recall
0.662
0.576
0.615

F-Measure
0.611
0.62
0.616

ROC Area
0.622
perugia
0.622
roma
0.622 Weighted Avg.

After, the results of the lazy IBk classifier on a test set of unknown speakers from Firenze
and Perugia following by Firenze and Roma (2 couples of cities belonging respectively to two
neighboring areas which are moreover not separated by strong isoglosses [cf chapter 2]):
Lazy firenze perugia
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

73
48
0.2119
0.0576
0.2366
78.9739 %
123.8109 %
121

60.3306 %
39.6694 %

=== Detailed Accuracy By Class ===


TP Rate
FP Rate
Precision
Recall F-Measure
ROC Area Class
0.659
0.429
0.468
0.659
0.547
0.617
firenze
0.571
0.341
0.746
0.571
0.647
0.617
perugia
0.603
0.373
0.645
0.603
0.611
0.617 Weighted Avg.
...
lazy firenze roma
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

63
33
0.2916
0.0501
0.2204
69.6414 %
116.9057 %
96

65.625
34.375

%
%

=== Detailed Accuracy By Class ===


TP Rate
0.477
0.808
0.656

FP Rate
0.192
0.523
0.371

Precision
0.677
0.646
0.66

Recall
0.477
0.808
0.656

F-Measure
0.56
0.718
0.646

ROC Area
0.642
firenze
0.642
roma
0.642 Weighted Avg.

Last, the results of the lazy IBk classifier on a test set of unknown speakers from Roma and
Milano following by Perugia/Lecce and Perugia/Cagliari (3 couples of cities belonging
respectively to two not neighboring areas):

74

lazy roma milano


Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

61
93
-0.1589
0.0868
0.2919
115.6791 %
147.5163 %
154

39.6104 %
60.3896 %

=== Detailed Accuracy By Class ===


TP Rate
0.373
0.442
0.396

FP Rate
0.558
0.627
0.581

Precision
0.567
0.264
0.465

Recall
0.373
0.442
0.396

F-Measure
0.45
0.331
0.41

ROC Area Class


0.407
milano
0.407
roma
0.407 Weighted Avg.

...
lazy perugia lecce
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

75
88
-0.0728
0.0779
0.2758
107.1565 %
144.3262 %
163

46.0123 %
53.9877 %

=== Detailed Accuracy By Class ===


TP Rate
0.407
0.519
0.46

FP Rate
0.481
0.593
0.534

Precision
0.486
0.44
0.464

Recall
0.407
0.519
0.46

F-Measure
0.443
0.476
0.459

ROC Area
0.467
lecce
0.467
perugia
0.467 Weighted Avg.

...
lazy perugia cagliari
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

85
44
0.3482
0.0497
0.2194
68.192 %
114.7137 %
129

65.8915 %
34.1085 %

=== Detailed Accuracy By Class ===


TP Rate
0.846
0.532

FP Rate
0.468
0.154

Precision
0.55
0.837

75

Recall
0.846
0.532

F-Measure
0.667
0.651

ROC Area
0.69
cagliari
0.69
perugia

0.659

0.28

0.721

0.659

0.657

0.69 Weighted Avg.

In this last case our hypothesis seems to fail: the machine learning model did not respect at
all the proximity criterion, but rather, it appears to distinguish better cities from the same
area.

4.6 Three way classification task


Deploying the same lazy-learning k-NN based classifier targeting the 7 linguistic areas, and
not the cities, we got these results:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

250
710
0.0935
0.2113
0.4595
90.7376 %
134.8303 %
960

26.0417 %
73.9583 %

=== Detailed Accuracy By Class ===


TP Rate
0.168
0.364
0.201
0.091
area_marrone
0.222
0.387
area_gialla
0.234
0.26

FP Rate
0.146
0.265
0.147
0.073

Precision
0.222
0.38
0.195
0.056

Recall
0.168
0.364
0.201
0.091

F-Measure
0.191
0.372
0.198
0.07

0.063
0.055

0.222
0.372

0.222
0.387

0.222
0.379

0.158
0.168

0.198
0.268

0.234
0.26

0.214
0.263

ROC Area Class


0.512
area_verde
0.55
area_viola
0.528
area_blu
0.51
0.58
0.666

0.539
area_rossa
0.547 Weighted Avg.

=== Confusion Matrix ===


a
b
32 65
46 108
18 45
11 16
5
8
13 13
19 29

c
25
44
29
4
31
1
15

d
6
24
11
4
1
4
21

e
16
13
8
7
16
3
9

f
8
19
6
1
3
29
12

g
39
43
27
1
8
12
32

|
|
|
|
|
|
|

<-a
b
c
d
e
f
g

classified as
= area_verde
= area_viola
= area_blu
= area_marrone
= sardegna
= area_gialla
= area_rossa

Significantly better than our Naive-Bayes baseline, that performs as below:


Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error

137
823
0.0269
0.2449

76

sardegna

14.2708 %
85.7292 %

Root mean squared error


Relative absolute error
Root relative squared error
Total Number of Instances

0.4814
105.1587 %
141.2592 %
960

A 15 way (number of the cities in CLIPS including Bergamo) or a 7 way (number of


linguistic areas that we were able to represent from the cities) classification seemed to be
hardly affordable by our MFCC12-based model. In fact, the majority of analyzed literature
using spectral methods carried out a classification with not so many choices: 2, 3, 4 classes at
most. In accordance with the literature, we built a new data set (training and test set) merging
the areas in three macro-areas. These latter are quite homogeneous as regards linguistic
phenomena [see chapter 2]. First class was composed by the northern area (Parma, Milano,
Torino, Genova), the second was made up by the median area merged with the Tuscany area
(Roma, Perugia, Firenze) and the third matched with the extreme-southern area (Catanzaro,
Lecce, Palermo). To sum up, we composed three macro-varieties (a northern, a median and a
extreme-southern one) which are not crossed by deep isoglosses. We could have add Venezia
(Veneto area) to the northern variety, but we avoided since the number of audio samples from
northern speakers was already high compared to the other areas. On the other hand, we
avoided to make up a southern variety (eventually composed by Napoli and Bari) since we had
not enough data from this area.
Our first classification, performed by naive-Bayes method, wanted to be a sort of baseline.
The gotten scores are close to the level of the chance:
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

238
431
-0.0258
0.4297
0.6309
97.9081 %
134.9023 %
669

35.5755 %
64.4245 %

=== Detailed Accuracy By Class ===


TP Rate
0.047
0.508
0.431
0.356

FP Rate
0.059
0.637
0.34
0.392

Precision
0.243
0.389
0.32
0.329

Recall
0.047
0.508
0.431
0.356

=== Confusion Matrix ===


a

<-- classified as

77

F-Measure
0.079
0.441
0.367
0.318

ROC Area Class


0.563
area_verde
0.459
area_viola
0.546
area_rossa
0.512 Weighted Avg.

9 154 28 |
8 151 138 |
20 83 78 |

a = area_verde
b = area_viola
c = area_rossa

Nevertheless, best results were achieved not through IBk algorithm but through SMO
(over-setting the parameter C) and logistic regression methods. On the other hand, SVMs
method did not infer the three classes and it attributed all the samples to the largest class
among the three, namely the northern one. Down here the results using SMO and Logistic
Regression:

Correctly Classified Instances


Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

289
380
0.1145
0.4132
0.5138
94.1476 %
109.8515 %
669

43.1988 %
56.8012 %

=== Detailed Accuracy By Class ===


TP Rate
0.283
0.512
0.459
0.432

FP Rate
0.213
0.468
0.213
0.326

Precision
Recall
0.346
0.283
0.466
0.512
0.444
0.459
0.426
0.432

F-Measure
0.311
0.488
0.451
0.428

ROC Area Class


0.53
area_verde
0.514
area_viola
0.632
area_rossa
0.551 Weighted Avg.

=== Confusion Matrix ===


a
b
54 109
69 152
33 65

c
<-- classified as
28 |
a = area_verde
76 |
b = area_viola
83 |
c = area_rossa

...

Correctly Classified Instances


Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances

287
382
0.115
0.4072
0.4854
92.7673 %
103.7763 %
669

42.8999 %
57.1001 %

=== Detailed Accuracy By Class ===


TP Rate
0.293
0.505

FP Rate
0.247
0.43

Precision
0.322
0.484

Recall
0.293
0.505

78

F-Measure
0.307
0.494

ROC Area Class


0.567
area_verde
0.537
area_viola

0.448
0.429

0.213
0.319

0.438
0.425

0.448
0.429

0.443
0.427

0.659
area_rossa
0.578 Weighted Avg.

=== Confusion Matrix ===


a
b
56 99
79 150
39 61

c
<-- classified as
36 |
a = area_verde
68 |
b = area_viola
81 |
c = area_rossa

Even if these last classifier performed quite better than the level of chance, results are
rather disappointing: the system seems not to be able to well distinguish Italian regional
varieties through the MFCC12 modeling. Furthermore, we derived experimentally that adding
one or two delta regressions does not significantly improve system results. We probably have
to bet on new phonotactic, spectral or prosodic features.

4.7 Removing short samples


The majority of CLIPS telephonic corpus samples last around 4-5 seconds. Obviously, it is
more difficult to guess the speaker provenance if the sample is short: it conveys less linguistic
information. Supposing that, as for humans, also for our classifier is easier to recognize
speaker's provenance on longer recordings, we tried to build up some data sets removing
samples lasting less than a certain threshold, notably 6 or 10 seconds [the script to build the
data set is in appendix 9]. Setting up these experiments, we considered two possible
scenarios: firstly, the system performs overall better since it has more spectral information for
each sample, hence it is able to classify them. The alternative hypothesis is that the system get
worst results because it has less data on which it is trained (however, it depends on the type of
algorithm deployed). Let's show the results.
Down here the results of the lazy IBk model on a 14 way classification (city varieties),
compared with the result of the same model on the complete data set:

IBk (k-NN method) 14 way classification

6 sec
complete
threshold data set

Correctly Classified Instances

12.80%

15.24%

Incorrectly Classified Instances

87.20%

84.76%

Kappa statistic

0.0621

0.093

Mean absolute error

0.1246

0.1211

Root mean squared error

0.3522

0.3476

79

Relative absolute error

93.93%

90.94%

Root relative squared error

136.56%

134.27%

Total Number of Instances

414

853

And then the SMO model, compared with the result of the same model on the complete
data set:

SMO (SVM method) 14 way classification

6 sec
complete
threshold data set

Correctly Classified Instances

18.36%

11.96%

Incorrectly Classified Instances

81.64%

88.04%

Kappa statistic

0.1206

0.0599

Mean absolute error

0.1299

0.1312

Root mean squared error

0.2554

0.2582

Relative absolute error

97.88%

98.53%

Root relative squared error

99.03%

99.73%

Total Number of Instances

414

853

After, we tested the reduced data set with a 7 way classification (linguistic areas). Down
here the results of the lazy IBk model compared with the result of the same model on the
complete data set:

IBk (k-NN method) 7 way classification

6 sec
complete
threshold data set

Correctly Classified Instances

23.67%

26.04%

Incorrectly Classified Instances

76.33%

73.96%

Kappa statistic

0.0668

0.0935

Mean absolute error

0.2181

0.2113

Root mean squared error

0.4665

0.4595

Relative absolute error

92.69%

90.74%

Root relative squared error

135.90%

134.83%

Total Number of Instances

414

960

And then the SMO model, compared with the result of the same model on the complete
data set:

SMO (SVM method) 7 way classification


Correctly Classified Instances

80

6 sec
complete
threshold data set
25.12%

24.38%

Incorrectly Classified Instances

74.88%

75.63%

Kappa statistic

0.0729

0.0236

Mean absolute error

0.2308

0.2312

Root mean squared error

0.3417

0.3427

Relative absolute error

98.07%

99.27%

Root relative squared error

99.57%

100.56%

Total Number of Instances

414

960

Finally, we tested the reduced data set with a 3 way classification (linguistic macro areas,
cf par. 4.5). Down here the results of the lazy IBk model on a 6-second threshold data set and a
10-seconds threshold data set, compared with the result of the same model on the complete
data set:

IBk (k-NN method) 3 way


classification

6 sec
10 sec
complet
thresh.d thresh.d e dataset

Correctly Classified Instances

36.56%

37.70%

39.46%

Incorrectly Classified Instances

63.44%

62.29%

60.54%

Kappa statistic

0.0268

0.0633

0.0659

Mean absolute error

0.423

0.4154

0.4036

Root mean squared error

0.6499

0.6434

0.6351

Relative absolute error

95.94%

93.55%

91.96%

Root relative squared error

138.6%

136.5%

135.79%

Total Number of Instances

279

122

669

Hereafter, the SMO model (C parameter: 2.0), followed by the result of the same model on
the complete data set:

SMO (SVM method) 3 way


classification

6 sec
10 sec
complet
thresh.d thresh.d e dataset

Correctly Classified Instances

46.23%

44.26%

43.199%

Incorrectly Classified Instances

53.76%

55.73%

56.801%

Kappa statistic

0.1812

0.1866

0.1145

Mean absolute error

0.3903

0.3971

0.4132

Root mean squared error

0.4902

0.4953

0.5138

Relative absolute error

88.53%

89.43%

94.147%

Root relative squared error

104.5%

105 %

109.85%

Total Number of Instances

279

122

669

Hereafter, the naive-Bayes model, compared with the result of the same model on the
81

complete data set:

naive-bayes 3 way classification

6 sec
10 sec
complet
thresh.d thresh.d e dataset

Correctly Classified Instances

36.56%

40.16%

35.575%

Incorrectly Classified Instances

63.44%

59.83%

64.424%

Kappa statistic

0.0538

0.1159

-0.0258

Mean absolute error

0.4137

0.3923

0.4297

Root mean squared error

0.6126

0.5767

0.6309

Relative absolute error

93.84%

88.36%

97.908%

Root relative squared error

130.6%

122.3%

134.9 %

Total Number of Instances

279

122

669

Finally, the logistic regression model, compared with the result of the same model on the
complete data set:

logistic regression 3 way


classification

6 sec
10 sec
complet
thresh.d thresh.d e dataset

Correctly Classified Instances

46.59%

38.52%

42.9%

Incorrectly Classified Instances

53.4%

61.47%

57.1%

Kappa statistic

0.1841

0.0835

0.115

Mean absolute error

0.381

0.4078

0.4072

Root mean squared error

0.4804

0.5607

0.4854

Relative absolute error

86.41%

91.84%

92.767%

Root relative squared error

102.4%

118.9%

103.77%

Total Number of Instances

279

122

669

To sum up, except for the IBk method (a memory based algorithm), all other methods
slightly tend to improve their performances when they are trained and tested on longer
samples. Notably, a well-balanced ratio between training data and selected threshold is
obtained using SMO or Logistic methods with a 6-second-threshold reduced data set.
Differently, the baseline built up using a naive-Bayes method seems to sharply enhance
performances using longer samples as data set.

82

Chapter V: conclusion

5.1 Resume of experiments


In chapter 4 we built various machine-learning MFCCs-based classifiers in order to classify
audio samples of CLIPS telephonic corpus by speakers' regional accent. We targeted mainly 3
levels of granularity: city variety (14 way classification), linguistic area variety (7 way
classification), linguistic macro-area variety (3 way classification). After, we led four broad
analysis:
1)

Classifiers performances were compared with scores of human Italian testers on two

online spread surveys. Built up in a previous moment, these surveys were composed by about
20 samples each, they covered all cities in corpus, and audio files were chosen from the CLIPS
corpus by two different experimenters.
2)

Classifiers performances were problematized through some tests built up in order to

find out confounding variables in the dataset. We focused mainly on the gender variable and
the speaker variable.
3)

We led a set of experiments focused on the concept of linguistic area and geo-linguistic

proximity. Notably, we checked model ability in distinguishing varieties of the same linguistic
area, varieties of two neighbours areas, and varieties of two not neighbours areas.
4)

Finally, we tried to repeat the same machine-learning experiments with a smaller data

set stripped of all shortest samples, in order to see if performances are better on longer
recordings.

83

The main data set used is the result of an acoustic features extraction run by openSMILE
software. Each sample was described by a 227 Mel-frequency cepstrial coefficients (MFCCs)
dimensional vector: notably, 12 MFC coefficients and 19 functionals for each one. The choice
of features were taken considering that the most of the literature propose MFCCs as the most
prominent acoustic feature for recognizing geo-linguistic accent of a speaker and for speaker
profiling tasks in general [cf. chapter 3].
Next, data set was stripped of anomalies and outliers thanks to both manual and automatic
procedures [look at paragraph 4.3.2], and it was enriched with new variables like gender,
linguistic area and speaker identifier in order to be able to perform several data mining
operations, and therefore to read through data.
Classifiers were built using mainly three algorithms: Sequential Minimal Optimization
algorithm (SMO, a relatively recent version of classic Support Vector Machines method),
Bayesian Networks/naive-Bayes, and IBk (the Weka default k nearest neighbours lazy
algorithm). Outcomes and trends of these methods over various tasks were quite varied.
We tested the models built using samples of online surveys as a test set, at different levels
of granularity, in order to compare human behaviours with machine trends. The speaker
variable turned out to be a crucial confounding variable, completely compromising
interpretation of classifier predictions. On the other hand, the gender variable did not seem to
elicit the system. Through new experiments and discussing model outputs we wanted to see if
our MFCCs-based classifier, as humans, followed a sort of geo-linguistic proximity criteria to
guess the class. The main purpose of this analysis was to appraise whether it exists a
correlation between Italian cities belonging to the same linguistic area, given some objective
borders like isoglosses.
Finally, we carried out a set of machine-learning experiments removing from our data set
the shortest samples of CLIPS corpus. Doing that, we supposed that longer samples are easier
to classify since they convey more spectral information. Various methods differently
responded to these experiments.
In the next paragraph we will critically discuss the results obtained. Down here, a
flowchart of the global experiment we carried out.

84

5.2 Discussion
On the 19/01/2016 testers of the second survey spread online are in the number of 73,
and they achieved on average an accuracy of 34.3%, that we think is a fine result on a 14 way
classification task (7% is the level of chance), taken into account that the speakers' audio
samples provided were rather short and difficult to guess. Using these samples as a test set of
a k-NN (IBk Weka algorithm) model, the accuracy of predictions was quite satisfactory (30%);
however, machine-learning experiments on larger test sets got high error rates (around 91%
the relative error rate of our best model, k-NN-based). Test sets were quite well-balanced by
gender, variety and speakers; by the way, a better method of evaluation is perhaps necessary,
such as the leave-one (speaker) out cross-validation. While results showed the importance of
keeping separated speakers of training and test sets, the gender variable seems not to elicit
the classifier: performances of gender-dependent models remained stable around 14% of
accuracy on the 14 way classification, not differently to the gender-independent model. On the
linguistic area classification task (7 way classification), k-NN classifier got around the 90% of
relative error rate, similarly to the 14 way classification, with an accuracy of 26%. A naiveBayes baseline classifier obtained nearly the level of chance (14% of accuracy).
The set of experiments to seek some correlation between varieties of a same area did not
85

give considerable results: the hypothesis seems to work only for the extreme-southern area,
whereas we did not obtain the same interesting outcomes for the northern area. For the
median area the system rather appears to distinguish better cities from the same area. Equally,
the visualization of the best classifier (k-NN method) confusion matrix did not present any
relevant correlations between city varieties and linguistic areas [appendix 7]
Reducing to 3 the number of classes by merging some linguistic areas and removing
others, best scores are obtained through another algorithm, namely Logistic regression.
Nevertheless, relative error rate is still high (92.7%) and the K statistic quite low (0.115).
Again, the naive-Bayes baseline obtained roughly the level of chance (which was 33.3% in this
latter case).
In our last experiment we noticed a general trend: performances improve reducing the
number of short samples. This means that the more the recordings are long and rich of
spectral information, the more the system is able to classify them. However, the enhancement
is slight and it varies according to the method used. Generally, the k-NN model (most of time
our best classifier) do not seems to respond properly to data reduction. That is probably
because it is a memory-based algorithm. On the other hand, it can be noted on SMO and
especially on Logistic Regression models some improvements removing short samples: this
latter model reached on the 3 way classification the 86.4% of relative error rate, with an
accuracy of 46.6%. Even if all methods seemed to suffer a too much strict reduction of data,
our naive-Bayes baseline improved significantly his performances whether trained with
longer samples.

5.3 Propositions for further work


Performances of our MFCC-based classifiers are relatively poor on the 14, 7 and the 3 way
classification tasks. These models do not seem to follow linguistic proximity criteria to predict
the variety of the speaker, whereas they appear to be susceptible to enhancement when
samples are longer (as humans).
Overall, these low scores were reasonably foreseeable for two primary reasons: 1) all
studies that deal successfully with an accent classification task handle very broad and
different varieties (Arabic, Chinese accents... cf. [paragraph 3.4]) whereas our Italian varieties,
at any level of granularity, are likely to be more similar each one, being spread in a much more
86

smaller territory and being probably affected by smaller acoustic-phonetic differences.


Furthermore, it is worth to recall that these kind of studies were based on a 2/3/4 way
maximum classification task, whereas our classification began from a more fine-grained
classification (14 way). In any case, at the actual state of art, such an automatic/acoustic
system cannot be useful for forensic applications, where precision of prediction and
granularity have to be obviously higher. 2) Moreover, the majority of systems using only
acoustic strategies are text-dependent, whereas our classifier is trained on a corpus which,
besides being telephonic, has a (pseudo) text-independent nature: these kind of classifiers
have worse performances across all the literature.
Nevertheless, talking about features for modelling, MFCCs does not work properly in our
case as the rest of literature point out, notably [Hou et al 2010], where the authors achieved a
good MFCC baseline without combining it with other features. Being used for many purposes
(speech and speaker recognition), it is relatively hard to direct MFCCs features to our specific
accent recognition task, which is simultaneously linguistic and relative to speaker variability.
Even if the results are quite over the level of chance, it is rather difficult to understand
whether the classifier follows valuable criteria in prediction task: for example, the set of
machine learning experiments targeted to linguistic areas shows quite clearly that there is no
correlation between some city varieties, when actually it is. Some studies suggest to combine
MFCC with other spectral features (as PLP coefficients) or prosodic ones (mainly the pitch and
the energy). Intonation characteristics are not very often considered: it is comprehensible as
long as text-dependent corpus are used to train the classifier. Our corpus, equally, is not
suitable for intonation modelling because samples are not completely spontaneous. On the
other hand, short-term prosodic features (like pitch, energy and F0) could be fine features for
accent recognition tasks, even if perhaps too much susceptible to gender and individual
characteristic.
Intuitively, formantic distribution could be an efficient spectral feature, as demonstrated in
[Kulshershtha 2012], enabling the different vocalization characteristics to be captured across
various Italian regional varieties. However, a potential MFCC + formantic distribution
modelling is likely to need being focused to the vowels of speech, that is to say the parts were
differences among regional varieties are more meaningful and quantifiable through the
spectrum. In order to do this, a previous broad segmentation of the speech in n categories
(roughly as in [Muthusamy et Cole 1992] and [Baker et al 2005]) could be a good compromise
between effectiveness and affordability.
87

However, the best solutions is perhaps 1) to flank the acoustic module with a GMM syllable
modelling, namely to implement a phonotactic module as in [Akbacak et al 2012]. Likewise, an
even more effective approach is the extraction of the i-vectors, that is to say, a newly proposed
set of features proved to be very effective for such kind of tasks [Verna et Das 2015] 2) To
create a metric for Italian regional variety, similarly to the ACCDIST for Britain accents [Brown
2014], which is able to distinguish (in text-dependent contests) among 14 different accent.
This last solution is maybe the most interesting because of its suitability for forensic
application and since it would contribute to the development of Italian dialectometry, even if it
is rather time-consuming to implement. With such a system we could easily isolate the part of
speech where cross-accent differences are prominent: vowels, realization of fricatives, voice
onset time and so on.
We express a last consideration that goes beyond this accent recognition task. Regional
accent is just one factor (though consistent), in speaker variability. If we want to consider an
ASR system that improve his performances adapting to its dialectal and idiolectal user
characteristic, it is not truly necessary to implement a text-independent model. For example, a
vocal interface could address to his first user a sort of targeted survey, asking him to
pronounce some key-words, in order to collect focused information about dialect and idiolect.
Or even less intrusively, the ASR system could trigger some collecting module when the user
pronounces this or that susceptible words, during all the system life period.

88

APPENDIX 1
UNIX script to convert en masse CLIPS .raw samples in wav using SoX
for i in ./*.raw;do sox -r 8000 --bits 8 --encoding u-law -t raw "$i"
"M_wav/$i.wav";done

The converted file, now suitable for openSMILE extraction, has these properties:
Input File
:
Channels
:
Sample Rate
:
Precision
:
Duration
:
File Size
:
Bit Rate
:
Sample Encoding:

'example.wav'
1
8000
14-bit
00:00:17.68 = 141408 samples ~ 1325.7 CDDA sectors
141k
64.0k
8-bit u-law

APPENDIX 2
perl script to filter outliers after having extract loudness features
use strict; use warnings; use locale
my $wav;
my $c=0;
my $threshold=15 # threshold changes with regard to gender: 15 for female
#speakers, 25 for male speakers
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /(\w+_?\w+,\w+,[FM],\d+,'.*'),/g) {
$wav=$1;
while ($ligne =~ /0\.000000e\+00/g){
$c++;
}
if ($c>=$threshold){
print "$wav\n";
$c=0;
}
else {
$c=0;}
}
else {}
}

89

APPENDIX 3
Perl script to enrich arff file with linguistic area, cityname, gender and speaker information
use strict;use warnings;use locale;
my $loc;
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /\@relation/) {
print $ligne, "\n";
print "\@isoglossa\n";
print "\@cityname\n";
print "\@gender\n";
print "\@locutor\n";
}
elsif ($ligne =~ /\.\/TL(\d\d\d\d)\d/) {
$loc=$1;
print "$ARGV[0],$ARGV[1],$ARGV[2],$loc,$ligne\n";
}
else {
print "$ligne\n";
}
}
## UNIX script example with arguments:
## $> perl path/to/script.pl sardegna cagliari F < path/to/arff >
path/to/new/richer/arff

90

APPENDIX 4
Functionals of emobase.conf
max maximum value
min minimum value
maxPos The absolute position of the maximum value (in frames)
minPos The absolute position of the minimum value (in frames)
amean The arithmetic mean of the contour
linregc1 The slope (m) of a linear approximation of the contour
linregc2 The offset (t) of a linear approximation of the contour
linregerrA The linear error computed as the difference of the linear
approximation and
the actual contour
linregerrQ The quadratic error computed as the difference of the linear
approximation
and the actual contour
stddev The standard deviation of the values in the contour
skewness The skewness (3rd order moment).
kurtosis The kurtosis (4th order moment).
quartile1 The first quartile (25% percentile)
quartile2 The first quartile (50% percentile)
quartile3 The first quartile (75% percentile)
iqr1-2 The inter-quartile range: quartile2-quartile1
iqr2-3 The inter-quartile range: quartile3-quartile2
iqr1-3 The inter-quartile range: quartile3-quartile1

91

APPENDIX 5
681 MFCCs features to train model, or 227 if we exclude delta regressions
About annotation:
"The suffix _sma appended to the names of the low-level descriptors indicates that they were
smoothed by a moving average filter with window length 3. The suffix _de appended to sma
suffix indicates that the current feature is a 1st order delta coefficient (differential) of the
smoothed low-level descriptor." [Eyben et al 2010]
@attribute cityname
{bari,cagliari,catanzaro,firenze,genova,lecce,milano,napoli,parma,perugia,palerm
o,roma,torino,venezia}
@attribute mfcc_sma[1]_max numeric
@attribute mfcc_sma[1]_min numeric
@attribute mfcc_sma[1]_range numeric
@attribute mfcc_sma[1]_maxPos numeric
@attribute mfcc_sma[1]_minPos numeric
@attribute mfcc_sma[1]_amean numeric
@attribute mfcc_sma[1]_linregc1 numeric
@attribute mfcc_sma[1]_linregc2 numeric
@attribute mfcc_sma[1]_linregerrA numeric
@attribute mfcc_sma[1]_linregerrQ numeric
@attribute mfcc_sma[1]_stddev numeric
@attribute mfcc_sma[1]_skewness numeric
@attribute mfcc_sma[1]_kurtosis numeric
@attribute mfcc_sma[1]_quartile1 numeric
@attribute mfcc_sma[1]_quartile2 numeric
@attribute mfcc_sma[1]_quartile3 numeric
@attribute mfcc_sma[1]_iqr1-2 numeric
@attribute mfcc_sma[1]_iqr2-3 numeric
@attribute mfcc_sma[1]_iqr1-3 numeric
. . .
@attribute mfcc_sma[12]_max numeric
@attribute mfcc_sma[12]_min numeric
. . .
@attribute mfcc_sma[12]_iqr2-3 numeric
@attribute mfcc_sma[12]_iqr1-3 numeric
. . .
@attribute mfcc_sma_de[1]_max numeric
. . .
@attribute mfcc_sma_de[12]_iqr1-3 numeric
@attribute mfcc_sma_de_de[1]_max numeric
. . .
@attribute mfcc_sma_de_de[12]_iqr1-3 numeric

92

APPENDIX 6
37 loudness features to detect outliers
Extraction run by openSMILE, using the configuration file prosodyViterbiLoudness.conf
Features list:
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute

name string
frameTime numeric
F0final_sma_stddev numeric
F0final_sma_amean numeric
F0final_sma_linregc1 numeric
F0final_sma_centroid numeric
F0final_sma_percentile10.0 numeric
F0final_sma_percentile90.0 numeric
F0final_sma_pctlrange0-1 numeric
F0finalLog_sma_stddev numeric
F0finalLog_sma_amean numeric
F0finalLog_sma_linregc1 numeric
F0finalLog_sma_centroid numeric
F0finalLog_sma_percentile10.0 numeric
F0finalLog_sma_percentile90.0 numeric
F0finalLog_sma_pctlrange0-1 numeric
voicingFinalUnclipped_sma_stddev numeric
voicingFinalUnclipped_sma_amean numeric
voicingFinalUnclipped_sma_linregc1 numeric
voicingFinalUnclipped_sma_centroid numeric
voicingFinalUnclipped_sma_percentile10.0 numeric
voicingFinalUnclipped_sma_percentile90.0 numeric
voicingFinalUnclipped_sma_pctlrange0-1 numeric
HarmonicsToNoiseRatioACFLogdB_sma_stddev numeric
HarmonicsToNoiseRatioACFLogdB_sma_amean numeric
HarmonicsToNoiseRatioACFLogdB_sma_linregc1 numeric
HarmonicsToNoiseRatioACFLogdB_sma_centroid numeric
HarmonicsToNoiseRatioACFLogdB_sma_percentile10.0 numeric
HarmonicsToNoiseRatioACFLogdB_sma_percentile90.0 numeric
HarmonicsToNoiseRatioACFLogdB_sma_pctlrange0-1 numeric
loudness_sma_stddev numeric
loudness_sma_amean numeric
loudness_sma_linregc1 numeric
loudness_sma_centroid numeric
loudness_sma_percentile10.0 numeric
loudness_sma_percentile90.0 numeric
loudness_sma_pctlrange0-1 numeric
class {0,1,2,3}

@data
...

93

APPENDIX 7
Vizualising IBk MFCC-based classifier performances through the confusion matrix using R
portable. The higher column is the correct answer, the other columns show the wrong
predictions distribution.

94

95

96

97

APPENDIX 8
perl scripts generating focused datasets
use strict;use warnings;use locale;
my $city1=roma
# this script generate a dataset of audio samples
my $city2=perugia
# from cities beside. We used it to set up the
exp. a
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /\@.*$/) {
print $ligne, "\n";
}
elsif ($ligne eq ""){
print $ligne, "\n";
}
elsif ($ligne =~ /.[^,]*,$city1|.[^,]*,$city2/) {
print $ligne, "\n";
}
else {}
}
. . .
use strict;use warnings;use locale;
my $gen=F
# this script generate a dataset of audio samples
# from a selected gender. We used it to set up a
confounding
# variable experiment (par 4.6)
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /\@.*$/) {
print $ligne, "\n";
}
elsif ($ligne eq ""){
print $ligne, "\n";
}
elsif ($ligne =~ /.[^,]*,.[^,]*,$gen/){
print $ligne, "\n";
}
else {}
}

98

APPENDIX 9
UNIX script used to extract duration of all samples
# for i in ./*.wav; do sox "$i" -n stat 2>&1
# | sed -n 's#^Length (seconds):[^0-9]*\([0-9.]*\)$#\1#p'
# ; echo "$i"; done >> ../list

Perl script used to comment on the ARFF dataset samples during less than 6 second
use strict; use locale; use warnings;
open (LIE, "<","path/to/the/list/created/with/unix/command/above");
my%short;
my $c;
while (my $ligne = <LIE>) {
chomp $ligne;
if ($ligne =~ /^[012345]\.\d/ ){
$c++;
}
elsif ($ligne=~/^\.\// and $c==1){
$short{$ligne}=1;
$c=0;
}
#if
#$short{$ligne}=1;}
}
close (LIE);
#else{}
while (my $ligne = <STDIN>) {
chomp $ligne;
if ($ligne =~ /(\w+_?\w+),(\w+),([FM]),(\d+),(\.\/.*av),/) {
if (defined ( $short{$5.' '.$2.' '.$3})){
print "\%$ligne\n";
}
else {
print $ligne, "\n";
}}
else {
print $ligne, "\n";}}

99

APPENDIX 10
Forensic cases where speaker accent profiling has been useful
"The realisation that the voice on the tape was not the Ripper, was a stunning blow to the
police. The voice had led them off on a wild goose chase for close to 18 months. The credibility
that the police put in the letters and tape had also helped the real Yorkshire Ripper, Peter
Sutcliffe, to escape further police scrutiny during interviews because he was eliminated on
voice and handwriting samples. The police failure to err on the side of caution as to whether
the author of the letters and tape was also the Ripper, also meant that the author, given the
moniker Wearside Jack, would also benefit from that police belief. Even if he was the killer of
Joan Harrison, which is by no means a certainty, he probably would have been able to come up
with alibis for some of the killings, or by where he lived, or by his work, could not have been in
the area or had the opportunity to commit the murders. Since he wasn't the Yorkshire Ripper,
the possibilities for avoiding suspicion based on the murders are almost limitless. As well,
there is the possibility that Wearside Jack was never interviewed by the police, by living
outside the country, or was never suspected, or wasn't reported to the police.
It must also be remembered that even Peter Sutcliffe was able to satisfy the police in his
interviews that he was not the killer, even before the release of the tape. Mainly, his alibis
consisted of "being at home" at the crucial times, which were backed up by his wife. As well,
the questioning was usually about events months previous to the interviews. The only
apparently "iron-clad" alibi he gave was for the night of the return visit to the body of Jean
Jordan, when the Sutcliffes had been having a house-warming party. Of course, Peter Sutcliffe
had returned to Jean Jordan's body after that event.
The analysis of the tape had produced two possible valuable leads to the author of the tape.
The department of Linguistics and Phonetics at Glasgow University found that Wearside Jack
suffered two speech defects, one being a distinctive pronunciation of the letter 's', and the
other being a hidden stammer. It was almost a certainty that he had undergone speech
therapy training. The police looked upon this as a possible breakthrough, and approached
every speech therapist in the North of England, but most refused to help based on the grounds
of medical ethics.
Even the voice experts had been surprised that the author of the tape had not been identified
by his voice characteristics. Jack Windsor Lewis was interviewed by Barbara Frum on the CBC

100

(Canada) Radio show "As It Happens" on January 12 1981, shortly after Peter Sutcliffe's
confession. In answer to the question about why, as time went on, it became more and more
improbable that the author of the tape was the Yorkshire Ripper, he said: "Based on the
improbability of people failing to find a man with such a distinctive voice. It's a very distinctive
accent, a very distinctive voice quality, and he has certain speech defects, and so on, that, all
the features of the voice put together make him highly identifiable."
Jack Windsor Lewis also stated that people recognise voices fairly easily, and: "I'm sure people
would have come forward immediately and said I recognise this voice. This is the voice of, and
then quoted a name. Now obviously if they are looking for a murderer only they will pass by
someone who is referred to them as having such a voice but couldn't possibly be a murderer."
[http://www.execulink.com/~kbrannen/wearside.htm]

"In 1981, 13-year-old Mary Doe goes missing from her central California home. Her parents
tell her siblings that she has run away and that they are never to speak of her again. More than
20 years later, in 2003, Marys siblings report the case to the police, who immediately suspect
homicide. Marys parents, now living in New York State, are interviewed, and her stepfather
comes close to confessing. A short time later, a woman is stopped by the police in Phoenix,
Arizona, for a trafc violation. She has an Arizona drivers license in the name of Mary Doe and
claims to be the missing girl and to have spent 20-plus years as a runaway in Arizona and
California, living under various assumed names. For a number of reasons, the detectives who
interview this woman, hereafter referred to as the Person of Interest (POI), suspect that she is
an imposter. Not least of these reasons is the POIs strong Southern accent, a seeming
impossibility in someone who spent most of her rst 13 years in the Northeast, the West
Coast, and Hawaii. The POI claims that her accent comes from brief visits to New Orleans and
Georgia, again, a dubious claim at best. This clearly is a case for a forensic linguist, and experts
are consulted to help unmask the imposters real identity by creating a forensic speaker
prole of the POIs regional background." [Schilling et Marsters 2015]

101

References

kbacak, M., Vergyri, D., Stolcke, A., Scheffer, N., Mandal, A. (2012) Effective Arabic dialect
classification using diverse phonotactic models. in Proceedings of Interspeech12.
Baker, B., Vogt, R., Sridharan, S. (2005) Gaussian Mixture Modelling of Broad Phonetic and
Syllabic Events for Text-Independent Speaker Verification. In: Proceedings of the 9th European
Conference on Speech Communication and technology (Eurospeech 05 Interspeech), Lisbon,
Portugal, pp. 24292432.
Berruto, Gaetano (2011) Variazione linguistica. Entry of Enciclopedia dell'Italiano Treccani.
http://www.treccani.it/enciclopedia/variazione-linguistica_(Enciclopedia_dell'Italiano)/
(ultima visita 03/02/2016)
Berruto, G. (1987), Sociolinguistica dellitaliano contemporaneo. Roma, La Nuova Italia
Scientifica (14a rist. Roma, Carocci, 2006).
Berruto G. (1993) Variet diamesiche, diastratiche e diafasiche, in Sobrero A. (a cura di),
Introduzione all'italiano contemporaneo-La variazione e gli usi, Bari Lateza.
Boughton, Z. (2006) When perception isnt reality: Accent identification and perceptual
dialectogy in French. Journal of French Language Studies 16: 277-304.
Bove T., Giua P.E., Forte A., Rossi C. (2002), Un metodo statistico per il riconoscimento del
parlatore basato sull'analisi delle formanti. In: Statistica, anno LXII, n 3.
Brown, Georgina (2014) Y-ACCDIST: An Automatic Accent Recognition System for Forensic
Applications. MA by research thesis, University of York.
Caramazza, A. , Yeni-Komshian, G. H. (1974) Voice onset time in two French dialects. Journal of
Phonetics 2, 239- 245.
102

Chauhan Tejal, Hemant Soni, Sameena Zafar (2013) A Review of Automatic Speaker Recognition
System. In: International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307,
Volume-3, Issue-4, September 2013.
Chen, T., Huang C., Chang, E., Jiang, W., (2001) Automatic accent identification using Gaussian
mixture models. In: Proc. IEEE workshop on Automatic Speech Recognition and
Understanding, pp. 343-346.
Coseriu, Eugenio (1973) Lezioni di linguistica generale, Torino, Boringhieri.
Cunningham, P., Delany, S.J. (2007). k-Nearest neighbour classifiers. Technical Report UCD-CSI2007-4, Dublin: Artificial Intelligence Group.
D'Addario, C. (2015), Percezione dell'italiano regionale. In: dialetto parlato, scritto, trasmesso.
A cura di Gianna Marcato, Padova CLEUP 2015.
De Mauro, T. (1963) Storia linguistica dell'Italia unita. Laterza editore.
Eyben, F., Wollmer, M., Schuller, B (2010) openSMILE The Munich Versatile and Fast OpenSource Audio Feature Extractor. In: Proc. ACM Multimedia (MM). pp. 14591462. Florence,
Italy.
Farrs, M. (2008) Fusing prosodic and acoustic information for speaker recognition. PhD Thesis,
Universitat Politcnica de Catalunya.
Gooskens, C. (2002) How well can Norwegians identify their dialects? In: Nordic Journal of
Linguistics.
Grassi, C., Sobrero, A.A. & Telmon, T. (1997) Fondamenti di dialettologia italiana. Roma:
Laterza.
Grassi, C., Sobrero, A.A. & Telmon, T. (2003) Introduzione alla dialettologia italiana. Roma:
103

Laterza.
Hanani Abualsoud (2012) Human and Computer Recognition of regional accents and ethinic
groups from British English Speech. School of Electronic, Electrical and Computer Engineering,
The University of Birmingham.
Heeringa, W. (2004) Measuring Dialect Pronunciation Differences Using Levenshtein Distance.
Ph.D. Thesis. University of Groningen.
HOU Jue, Yi LIU, Thomas Fang ZHENG, Jesper OLSEN, Jilei TIAN (2010) Using Cepstral and
Prosodic Features for Chinese Accent Identification. The 7th International Symposium on
Chinese Spoken Language Processing (ISCSLP 2010). Tainan, 2010:177-181
Houtsma, A.J.M. (1995) Pitch perception. In: Hearing. Handbook of perception and cognition.
Edited by Brian C.J. Moore. Academic press, second edition. pp. 267-295 (chapter 8).
Huckvale, M. (2004) ACCDIST: a metric for comparing speakers' accents. Proc. International
Conference on Spoker Language Processing Jeju, Korea. 29-32.
ISTAT report (2012) Luso della lingua italiana, dei dialetti e di altre lingue in italia. Source:
www.istat.it
Jessen, M. (2007) Speaker Classification in Forensic Phonetics and Acoustics. In C. Mller (Ed.),
Speaker Classification (1), 180-204. Berlin: Springer.
Juravsky, D., Martin, J.H. (2000) Speech and language processing. An introduction to natural
language processing, computational linguistic and speech recognition. Pearson Prentice Hall,
second edition.
Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murty K.R.K. (2001) Improvements to platts
SMO algorithm for SVM classifier design. Neural Computation 13: 637649.
Kersta, L. G., (1962) Voiceprint Identification Infallibility. J. Acoust. Soc. Am. 34.
104

Kessler B. (1995) Computational dialectology in Irish Gaelic. Proc. Conf. European ACL, 7th,
Dublin, March 27-31, pp. 60-7. San Francisco: Morgan Kaufmann Publishers
Kster, O., R. Kehrein, K. Masthoff and Y.H. Boubaker. (2012) The tell-tale accent: identification
of regionally marked speech in German telephone conversations by forensic phoneticians.
Journal of Speech, Language and the Law 19.1, 5171.
Kulshreshtha, M., Mathur, R. (2012) Dialect Accent Feature for Establishing Speaker Identity: A
case study. Springer Briefs in Electrical and Computer Engineering, 2012.
Lippmann, R. P., (1997) Speech recognition by machines and humans Speech Commun., vol. 22,
pp. 115, 1997.
Lorinczi, M., (1999) Storia sociolinguistica della lingua sarda alla luce degli studi di linguistica
sarda. In F. Fernandez Rei and A Santamarina Fernandez (eds) Estudios de sociolinguistica
romanica. Linguas e variedades minorizadas, Universidade de Santiagio de Compostela, 1999,
pp. 385-424, 1999.
Maiden, M., Parry, M.M. (eds.) (1996) The dialects of Italy. London: Routledge.
Meyer, B. T., Kollmeier, B., (2011) Robustness of spectrotemporal features against intrinsic and
extrinsic variations in automatic speech recognition. Speech Comm., vol. 53, no. 5, pp. 753-767,
2011.
Montemagni, S. (2007) Patterns of phonetic variation in Tuscany: using dialectometric
techniques on multi-level representations of dialectal data. In P. Osenova et al. (eds.),
Proceedings of the Workshop on Computational Phonology at RANLP-2007, pp. 49-60.
Morrison, G.S. (2010) Forensic voice comparison. In Freckelton, I., Selby, H. (eds.), Expert
Evidence. Sydney, Australia: Thomson Reuters
Muthusamy, Y. K., Cole, R.A., (1992) Automatic segmentation and identifcation of ten languages
105

using telephone speech. In Proceedings International Conference on Spoken Language


Processing 92, Bank, Alberta, Canada, October 1992.
Paciorkowski, B., Gilbert, M. (2008) Using regional accents to form first impressions of a
speaker. Hanover college.
Platt J. (1998) Sequential minimal optimization: A fast algorithm for training support vector
machines. In B. Scholkopf, C. Burges, and A Smola, editors, Advances in Kernel Methods Support Vector Learning. MIT Press, Cambridge, USA, 1998.
Poorjam, A. H., M. H. Bahari, and H. V. hamme, (2014) Multitask speaker profiling for estimating
age, height, weight and smoking habits from spontaneous telephone speech signals. In 4th
International Conference on Computer and Knowledge Engineering (ICCKE), 2014, pp. 712
Rose, P., (2006) Technical forensic speaker recognition: Evaluation, types and testing of evidence,
Comput. Speech Lang., vol. 20, no. 23, pp. 159191, 2006.
Russell Stuart J., Norvig Peter, Davis Ernest (2010) Artificial Intelligence: A Modern Approach.
3rd ed. Upper Saddle River, NJ: Prentice Hall
Schilling, Natalie, Marsters, Alexandria (2015) Unmasking identity: speaker profiling for
forensic linguistic purposes. Annual Review of Applied Linguistics 35. 195214.
Sinha Shweta, Aruna Jain ,S. S. Agrawal (2015) Acoustic-phonetic feature based dialect
identification in Hindi speech. International Journal On Smart Sensing and Intelligent Systems.
2015, 8(1):237-254.
Sobrero, A., Tempesta, I. (2006) Definizione delle caratteristiche generali del corpus:
informatori, localit. In: progetto CLIPS. Unit di Lecce (10/3/2006)
Souza, P., Gehani, N., Wright, R., and McCloy, D. (2013) The advantage of knowing the talker.
Journal of the American Academy of Audiology, Volume 24, Number 8, pp. 689-700.

106

Szmrecsanyi, Benedikt (2011) Corpus-based dialectometry a methodological sketch. Corpora


6(1)
Tosi O., Oyer H., Lashbrook W., Pedrey C., Nash W., (1972) Experiment on voice identification. J.
Acoust. Soc. Amer., vol. 51, no. 6, pp. 20302043, 1972.
Turell, T., (Forensiclab - Unitat de Variaci Lingstica) (updated: january 2013) Forensic
Idiolectometry and Index of Idiolectal Similitude. Institut Universitari de Lingstica Aplicada
Universitat Pompeu Fabra.
Verna, P., Das, K.P. (2015) i-Vectors in speech processing applications: a survey. In: International
J Speech of technology 18:529-546. Springer science-business media New York 2015.
Vibha, T. (2010) MFCC and its applications in speaker recognition, International Journal on
Emerging Technologies 1(1): 19-22(2010)
Vignuzzi,

U.

(2010)

Isoglossa.

Entry

from

Enciclopedia

dell'Italiano

Treccani.

http://www.treccani.it/enciclopedia/isoglossa_(Enciclopedia_dell'Italiano)/
(ultima visita 03/02/2016)
Zissman M. A., (1995) Comparison of four approaches to automatic language identification of
telephone speech. IEEE Trans. Speech and Audio Proc., SAP-4(1):31-44, January 1996.

107

Você também pode gostar