Escolar Documentos
Profissional Documentos
Cultura Documentos
Overview
What is a Corpus? p Speech Corpus and Types Why do we need speech corpus? Use of Speech Corpus Using Speech Corpus in NLP Application LDCIL Speech Corpora How to Annotated Speech Corpus using Praat? Guidelines for Annotation Recording of the data Storing LDC-IL Data in NIST Format The NIST Format Utility of Annotation
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
2
What is a corpus?
'Corpus' means 'body' in Latin, and literally refers to Corpus body the biological structures that constitute humans and other animals (Wikipedia). Corpus is a collection of spoken language stored on computer and used for language research and writing dictionaries (Macmillan Dictionary 2002). It is a collection of written or spoken texts (Oxford Dictionary 2005). In other words, corpus is a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech speech.
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
3
Speech Corpus
Speech Corpus (or spoken corpus) is a database of speech audio files and text transcriptions i a f i i in format that can b used to create acoustic h be d i model (which can then be used with a speech recognition engine). There are two types of Speech Corpora They are Read Speech and Spontaneous speech 1. Read Speech includes: Book excerpts p Broadcast news Lists of words Sequences of numbers 2. Spontaneous Speech includes Dialogs - between two or more people (includes meetings) Narratives - a person telling a story Map-tasks - one person explains a route on a map to another
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
4
SRM University
29-Jan-12 5
29-Jan-12
6
29-Jan-12
7
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
8
29-Jan-12
9
Number of Speakers
Data will be collected from minimum of 450 (225 Male ( and 225 Female) speakers of each language. In addition to this, natural conversation data from various domains too shall be collected for Indian languages for research into spoken language.
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
10
HOURS
456 105:51:38 472 138:18:47 433 201:10:48 154 111:32:11 450 156:23:04 450 163:25:47 492 143:28:54 150 44:59:07
314 105:47:05 457 107:10:27 306 168:13:50 485 145:04:46 462 165:30:05 468 110:48:26 453 213:37:27 156 50:51:36
480 124:19:58
29-Jan-12
11
10 Maithili
Speech Segmentation
Segmentation of data:
Collected speech data is in a continuous form and hence it has to be segmented as per the various content types. i.e., Text, Sentences, Words. Text Sentences Words Segmentation tools:
Wave Surfer is the tool used for segmentation of speech data.
Warehousing:
After segmenting the data according to the various content types, types it has to be warehoused properly The data has to be properly. warehoused for each content type, using the Meta data information.
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
12
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
13
Meta Data
InvestigatorName Language Datasheetscript Agegroup A Soundfilesformat District Mothertongue Placeofelementaryeducation Recordingdate Duration/lengthofrecordeditem(hh.mm.ss) Speaker'sID Dialect(Region) SpeakersGender Recordingenvironment R di i State Place Educationalqualification
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
14
Speech Annotation
Annotation of data: Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. Annotation tools: A t ti t l Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases.
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
15
Speech Segmentation
Speech data is in a continuous form and hence it has to p be segmented at sentence level by using Wave Surfer Tool Open the file in Wavesurfer. Select waveform and open the file. Each sentence should be segmented but the duration of the sentence should b no l d i f h h ld be longer than 30 h seconds. If the sentence is longer than 30 seconds then the sentence should be segmented taking the nearest pause before a full stop. Then the selection should then be saved in the required folder. q
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
16
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
17
29-Jan-12
18
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
19
29-Jan-12
20
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
21
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
22
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
23
2. Three different types of silences need to be marked Annotation of silences: Any silence shorter than y 50 ms NEED NOT be marked. short silence (possibly intraword) (sil1) silences of length (sil1): around 50-150 ms medium silence (possibly interword) (sil2): silences of length between 150-300 ms c) long silence (possibly interphrase) (sil3) : silences greater than 300 ms
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
24
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
25
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
26
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
27
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
28
[ [mini]*ster - means that the speaker intended to speak ] p p minister but spoke mini in an unclear fashion and ster clearly. *ster means that the speaker intended to speak minister but spoke ster ster.
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
29
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
30
6. Language Change
Language change like code mixing and code switching g g g g g need to be marked as follows: [.lc-english <text>]
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
31
6. Language Change
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
32
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
33
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
34
8. Number
Spell out all number sequences except in cases such as p q p 123 or 101 where the numbers have a specific meaning. Transcribe years like 1983 as spoken nineteen eighty three. Do not use hyphens (twenty eight, not twenty-eight).
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
35
8. Number
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
36
9. Mispronunciations
If a speaker mispronounces a word and the p p mispronunciation is not an actual word, transcribe the word as it is spoken. Utterances should be no longer that 30secs. So the annotator should find a long silence around 500 ms and split the sentence appropriately. Keep a separate Folder for the noisy data and for the time being as it was suggested not to annotate those now. SNR measuring tool will give you the percentage of the data which has to be annotated annotated.
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
37
9. Mispronunciations
; ; , . .ec sil2 sil2 sil1 .vn sil2 sil2 .vn
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
38
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
39
29-Jan-12
40
29-Jan-12
41
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
42
1. 1 Database ID
The following was decided for the database id. The database id will be string of length 8 characters. The first 4 letters will correspond to the organization that collects the database. This will be followed by an _ . The language id will consist of three characters. y g g For example, the Punjabi database collected at CIIL will be given the following name: CIIL_TAM CIIL TAM This will be included in the header using the following tag: database_id s13 CIIL_TAM_READ; database id -s13 CIIL TAM READ; tag for read Tamil speech Database version: This will be included in the header using the following tag: database_version -s3 1.0
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
43
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
44
3. 3 Microphones
For the collection of LDC-IL speech data, we have used in-built digital recorder microphone (stereo)(INBDR). However the following are the other types of microphones i h ( t )(INBDR) H th f ll i th th t f i h that can be used. 1. external low quality (LOWQ) 2. t 2 external hi h quality ( i cancelling) (HIGQ) l high lit (noise lli ) 3. in-built cell phone(INBCP) 4. in-built landline (INBLL) 5. throat microphone (THROT) 6. bone microphone (BONE) Examples: p Data recorded using a digital recorder with an in-built microphone(s), should have the following entry in the NIST header: microphone -s5 INBDR s5
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
45
4. 4 Utterance ID
Each utterance may be identified by type and number <4 characters for type>_<3 digit utterance number>
For example the entry in the header would be:
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
46
5. 5 Speaker ID
Each speaker may be identified using 9 characters: <4 characters for the region>_<4 characters to identify speaker> The entry for each speaker will be: speaker_id -s9 <4 characters to identify region>_<m|f><4 character speakerid> A female speaker from South Karnataka with a speaker id ab0a (4 character alpha (lower case only) numeric): speaker_id -s9 STND_fab0a
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
47
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
48
Utility of Annotation
Annotated speech data is the raw material for development of speech recognition and speech synthesis systems. Acoustic-phonetic systems Acoustic phonetic study of the speech sounds of a language is essential for determining the parameters of speec sy es s sys e s o ow g a cu a o y o speech synthesis systems following articulatory or parametric approach.
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
49
29-Jan-12
50
SRM University
Copyright 2008 LDC-IL, CIIL
29-Jan-12
51
Tamil A d T il Academy, SRM University U i i All the Professors, Teachers, Staff and the Participants & LDCIL Team