Você está na página 1de 52

An Introduction to Speech Annotation

L.R. PREM KUMAR


Senior Research Assistant
Linguistic Data Consortium for Indian Languages Central Institute of Indian Languages, Mysore

Copyright 2008 LDC-IL, CIIL

Overview
What is a Corpus? p Speech Corpus and Types Why do we need speech corpus? Use of Speech Corpus Using Speech Corpus in NLP Application LDCIL Speech Corpora How to Annotated Speech Corpus using Praat? Guidelines for Annotation Recording of the data Storing LDC-IL Data in NIST Format The NIST Format Utility of Annotation

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
2

What is a corpus?
'Corpus' means 'body' in Latin, and literally refers to Corpus body the biological structures that constitute humans and other animals (Wikipedia). Corpus is a collection of spoken language stored on computer and used for language research and writing dictionaries (Macmillan Dictionary 2002). It is a collection of written or spoken texts (Oxford Dictionary 2005). In other words, corpus is a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech speech.
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
3

Speech Corpus
Speech Corpus (or spoken corpus) is a database of speech audio files and text transcriptions i a f i i in format that can b used to create acoustic h be d i model (which can then be used with a speech recognition engine). There are two types of Speech Corpora They are Read Speech and Spontaneous speech 1. Read Speech includes: Book excerpts p Broadcast news Lists of words Sequences of numbers 2. Spontaneous Speech includes Dialogs - between two or more people (includes meetings) Narratives - a person telling a story Map-tasks - one person explains a route on a map to another
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
4

Why do we need speech corpus?


To develop tools that facilitate collection of high quality speech data Collect data that can be used for building speech g p recognition. speech synthesis and provide speech-tospeech translation from one language to another language spoken in India (including Indian English).

Copyright 2008 LDC-IL, CIIL

SRM University

29-Jan-12 5

Use of Speech Corpus


Speech Recognition and Speech Synthesis p g p y Speech to Speech translation for a pair of Indian languages Health care (Medical Transcription) Real time voice recognition Multimodal interfaces to the computer in Indian languages E-mail readers over th t l h E il d the telephone Readers for the visually disadvantaged Automatic translation etc etc.
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
6

Using Speech Corpus in NLP Application


Automatic speech recognition is the process by p g p y which a computer maps an acoustic speech signal to text. Speech synthesis is the artificial production of human speech A computer system used for this speech. purpose is called a speech synthesizer. It is what helps towards building Text-to-speech applications
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
7

LDCIL Speech Corpora

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
8

Speech Dataset Collection


Phonetically Balanced Vocabulary 800 Phonetically Balanced Sentences 500 Connected Text created using phonetically balanced vocabulary - 6 Date Format - 2 Command and Control Words 250 Proper Nouns 400 place and 400 person names - 824 Most Frequent Words- 1000 Form and Function Words- 200 News domain: news, editorial, essay - each text not less than 500 words - 150
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
9

Number of Speakers
Data will be collected from minimum of 450 (225 Male ( and 225 Female) speakers of each language. In addition to this, natural conversation data from various domains too shall be collected for Indian languages for research into spoken language.

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
10

Speech Corpora (Segmented & Ware Housed)


S.NO LANGUAGES SPEAKERS 1 2 3 4 5 6 7 8 9 Assamese Bengali Bodo Dogri Gujarati Hindi Kannada Kashmiri Konkani HOURS S.NO LANGUAGES SPEAKERS 11 12 13 14 15 16 17 18 19
Copyright 2008 LDC-IL, CIIL

HOURS

456 105:51:38 472 138:18:47 433 201:10:48 154 111:32:11 450 156:23:04 450 163:25:47 492 143:28:54 150 44:59:07

Malayalam Manipuri Marathi Nepali Oriya Punjabi Tamil Telugu Urdu


SRM University

314 105:47:05 457 107:10:27 306 168:13:50 485 145:04:46 462 165:30:05 468 110:48:26 453 213:37:27 156 50:51:36

455 195:14:47 156 43:33:42

480 124:19:58
29-Jan-12
11

10 Maithili

Speech Segmentation
Segmentation of data:
Collected speech data is in a continuous form and hence it has to be segmented as per the various content types. i.e., Text, Sentences, Words. Text Sentences Words Segmentation tools:
Wave Surfer is the tool used for segmentation of speech data.

Warehousing:
After segmenting the data according to the various content types, types it has to be warehoused properly The data has to be properly. warehoused for each content type, using the Meta data information.

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
12

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
13

Meta Data
InvestigatorName Language Datasheetscript Agegroup A Soundfilesformat District Mothertongue Placeofelementaryeducation Recordingdate Duration/lengthofrecordeditem(hh.mm.ss) Speaker'sID Dialect(Region) SpeakersGender Recordingenvironment R di i State Place Educationalqualification

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
14

Speech Annotation
Annotation of data: Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. Annotation tools: A t ti t l Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases.

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
15

Speech Segmentation
Speech data is in a continuous form and hence it has to p be segmented at sentence level by using Wave Surfer Tool Open the file in Wavesurfer. Select waveform and open the file. Each sentence should be segmented but the duration of the sentence should b no l d i f h h ld be longer than 30 h seconds. If the sentence is longer than 30 seconds then the sentence should be segmented taking the nearest pause before a full stop. Then the selection should then be saved in the required folder. q
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
16

How to Annotated Speech Corpus using Praat? p p g Praat?

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
17

What is speech annotation?


Annotation is about assigning different tags like background noise, background speech, vocal noise, echo etc to the segmented speech files. While annotating the files we should also keep in mind that the text should correspond the speech speech. The term linguistic annotation covers any descriptive or analytic notations applied to raw language data. The added notations pp g g may include information of various kinds: multi-tier transcription of speech in terms of units such as acoustic-phonetic features, syllables, syllables words etc syntactic and semantic analysis etc; analysis; paralinguistic information (stress, speaking rate) non-linguistic information (speaker's gender, age, voice quality, emotions, dialect room acoustics, additive noise, channel effects).
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
18

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
19

Formation of LDCIL Guideline


There are various tools available for speech segmentation and annotation like CSL, EMU, Transcriber, PRAAT etc. We are using PRAAT software for the annotation of our speech data. Praat is a product of Phonetic Sciences department of University of Amsterdam [4] and hence oriented for acousticphonetic studies by phoneticians. It has multiple functionalities that include speech analysis/synthesis and manipulation, labeling and segmentation, listening experiments. Guidelines for Annotation of our Data is Adapted from CSLU, OGI, Missippi University, SwitchBoard Guidelines, LDC, Upenn. U
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
20

Guidelines for Annotation


Open the stereo file in Praat and then create a text file. p Open both the files in Praat and then select the correct text which corresponds to the speech file. The data should be annotated as per the pronunciation. If it is pronounced wrongly then it should be pronounced wrongly. wrongly The following should be marked while annotating the text:

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
21

1. Non Speech sounds should be marked


Human Noise: (a) Background speech (.bs) (b) Vocal noise (.vn). Voca o se (.v ). Non-Human Speech: (c) Background noise ( bn) (.bn)

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
22

1. Non Speech sounds should be marked


Human S H Speech h Non-Human Non Human Speech

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
23

2. Three different types of silences need to be marked Annotation of silences: Any silence shorter than y 50 ms NEED NOT be marked. short silence (possibly intraword) (sil1) silences of length (sil1): around 50-150 ms medium silence (possibly interword) (sil2): silences of length between 150-300 ms c) long silence (possibly interphrase) (sil3) : silences greater than 300 ms
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
24

2. Three different types of silences need to be marked

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
25

3. Echo need to be marked


A sound that is heard after it has been reflected off a surface such as a wall. Annotation for h A t ti f echo: mark .ec i th b i i of th k in the beginning f the annotation.

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
26

3. Echo need to be marked

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
27

4. Multi speakers data need to be annotated


A new speaker at the foreground level speaks p g p < text spoken > A new annotation to mark this is defined (.sc)

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
28

5. Cut off speech and intended speech need to be marked

[ [mini]*ster - means that the speaker intended to speak ] p p minister but spoke mini in an unclear fashion and ster clearly. *ster means that the speaker intended to speak minister but spoke ster ster.

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
29

5. Cut off speech and intended speech need to be marked

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
30

6. Language Change
Language change like code mixing and code switching g g g g g need to be marked as follows: [.lc-english <text>]

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
31

6. Language Change

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
32

7. Annotation of speech disfluency


Only restarts/false starts need to be marked. For y / example the speaker intends to speak bengaluru but speaks be bengaluru. Then mark this as be-bengaluru

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
33

7. Annotation of speech disfluency

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
34

8. Number
Spell out all number sequences except in cases such as p q p 123 or 101 where the numbers have a specific meaning. Transcribe years like 1983 as spoken nineteen eighty three. Do not use hyphens (twenty eight, not twenty-eight).

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
35

8. Number

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
36

9. Mispronunciations
If a speaker mispronounces a word and the p p mispronunciation is not an actual word, transcribe the word as it is spoken. Utterances should be no longer that 30secs. So the annotator should find a long silence around 500 ms and split the sentence appropriately. Keep a separate Folder for the noisy data and for the time being as it was suggested not to annotate those now. SNR measuring tool will give you the percentage of the data which has to be annotated annotated.
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
37

9. Mispronunciations
; ; , . .ec sil2 sil2 sil1 .vn sil2 sil2 .vn

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
38

Some more points to be taken into account


Vocal noise followed by a silence: .vn silx (if the silence is more than 50ms) Vocal noise, silence and vocal noise and then again silence or the th vocal noise which is more than 50ms : .vn silx .vn silx l i hi h i th 50 il il Please mark the silence in case of background speech too: .bs silx If background noise followed by a silence or if .bn is more than 50ms: bn silx If there is any background noise in any particular position then marked that within square bracket. [.bn ..]

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
39

Recording of the data


Data should be recorded in stereo format. The wave files should be preserved i f d in four diff different f t forms: Left channel Right channel Converted to mono Original stereo Nist files are created for all the above wave files It is a format in which the files. files are saved. For example, if a single stereo file is defined as S1_0001.wav, it will be stored as: left i l ft microphone: S1_0001_left.nist h S1 0001 l ft i t right microphone: S1_0001_right.nist converted mono: S1_0001_mono.nist original: S1_0001_stereo.nist
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
40

Storing LDC IL data in NIST format LDC-IL


LDC-IL has collected read speech consisting of sentences and words. The details are as follows: Database of 19 Indian languages have been collected. Minimum 450 speakers were used to collect the database for each language. Environment of the recording is taken into consideration. All recordings are done in stereo. Age group of the speakers are recorded. The sampling rate varies as 44100 Hz, 48000 Hz, at 16 bits. All the above information must go into the header of the NIST file. Items 4 and 6 are generated automatically by the PRAAT software. Labels have to be given for items 1 2 and 5 1,2 5.
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
41

TheNISTFormat The NIST Format


As some data has been labeled, it is felt that as decided earlier that all recordings must be converted to the nist format. Each line in the header is a triplet this gives various kinds of information about the waveform, namely, database name, speaker information, number of channels, environment. Eg:
NIST_1A 1024 database_id -s13 CIIL_PUN_READ database_version -s3 1.0 recording_environment -s3 HOM microphone -s5 INBDR utterance_id -s8 sent_001 speaker_id -s9 STND_fad004 age -i 25 rec_type -s6 stereo channel_count -i 1 sample_count -i 601083 sample_n_bytes sample n bytes -i 2 sample_byte_format -s2 01 sample_coding -s3 pcm sample_rate sample rate -i 44100 sample_min -i -32768 sample_max -i 32767 end_head end head

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
42

1. 1 Database ID
The following was decided for the database id. The database id will be string of length 8 characters. The first 4 letters will correspond to the organization that collects the database. This will be followed by an _ . The language id will consist of three characters. y g g For example, the Punjabi database collected at CIIL will be given the following name: CIIL_TAM CIIL TAM This will be included in the header using the following tag: database_id s13 CIIL_TAM_READ; database id -s13 CIIL TAM READ; tag for read Tamil speech Database version: This will be included in the header using the following tag: database_version -s3 1.0
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
43

2.Recordingenvironment 2 Recording environment


The recording environment could be one of the following: 1. home (HOM) 2. public places (PUB) 3. office (OFF) 4. Telephone (TEL) Eg: Data recorded in a home should have the following entry in the NIST header: recording_environment -s3 HOM

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
44

3. 3 Microphones
For the collection of LDC-IL speech data, we have used in-built digital recorder microphone (stereo)(INBDR). However the following are the other types of microphones i h ( t )(INBDR) H th f ll i th th t f i h that can be used. 1. external low quality (LOWQ) 2. t 2 external hi h quality ( i cancelling) (HIGQ) l high lit (noise lli ) 3. in-built cell phone(INBCP) 4. in-built landline (INBLL) 5. throat microphone (THROT) 6. bone microphone (BONE) Examples: p Data recorded using a digital recorder with an in-built microphone(s), should have the following entry in the NIST header: microphone -s5 INBDR s5
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
45

4. 4 Utterance ID
Each utterance may be identified by type and number <4 characters for type>_<3 digit utterance number>
For example the entry in the header would be:

utterance_id -s8 <word|phrs|sent|uttr>_<3 digit utterance number>


For example, the entry for the 5th word in the database will be:

utterance_id -s8 word_005

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
46

5. 5 Speaker ID
Each speaker may be identified using 9 characters: <4 characters for the region>_<4 characters to identify speaker> The entry for each speaker will be: speaker_id -s9 <4 characters to identify region>_<m|f><4 character speakerid> A female speaker from South Karnataka with a speaker id ab0a (4 character alpha (lower case only) numeric): speaker_id -s9 STND_fab0a

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
47

6. 6 Age group of the speaker


The speaker should be more that 16 years and not more than 60 years.

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
48

Utility of Annotation
Annotated speech data is the raw material for development of speech recognition and speech synthesis systems. Acoustic-phonetic systems Acoustic phonetic study of the speech sounds of a language is essential for determining the parameters of speec sy es s sys e s o ow g a cu a o y o speech synthesis systems following articulatory or parametric approach.

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
49

LDCIL Tamil Team


Academic Faculties S. Thennarasu, Sr. Lecturer L.R. L R Prem Kumar Sr. Research Assistant Kumar, Sr R. Amudha, Junior Research Assistant R. Prabagaran, R Prabagaran Junior Resource Person Technical Faculties Mohamed Yoonus, Sr. Lecturer Vadivel, Vadivel Lecturer
SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
50

Speech Annotation Demo

SRM University
Copyright 2008 LDC-IL, CIIL

29-Jan-12
51

Tamil A d T il Academy, SRM University U i i All the Professors, Teachers, Staff and the Participants & LDCIL Team