Synopsis-2 and 3 Page

1.
Introduction:
1.1 Topic
Language is man's most important means of communication and speech its

primary medium. A speech signal is a complex combination of a variety of airborne
pressure waveforms. This complex pattern must be detected by the human auditory
system and decoded by the brain. This can be done by using a combination of audio and
visual cues to perceive speech more effectively. The project aims to emulate this
mechanism in human – machine communication systems by exploiting the acoustic and
visual properties of human speech.
1.2 Organization
2. Need for the project:

Current speech recognition engines employing only acoustic features are not
100% robust. Visual cues can be used to undermine the ambiguity in the auditory
modality. Hence a flexible and reliable system for speech perception can be designed
which finds a variety of applications in
 Dictation systems
 Voice Based Communications in tele-banking, voice mail, data-base query
systems, information retrieval systems, etc
 System Control in automobiles, robotics, airplanes, etc
 Security systems for speaker verification
3. Objective:
Recognise 10 English words (speaker independent) with at least 90% accuracy in a noisy
environment.
4. Methodology:
The project is carried out in into following parts

Processing of Audio Signals
o Detection of end points to demarcate word boundaries
o Analysis of various acoustic features such as pitch and formants, energy
and time difference of speech signals, etc.
o Extraction of selected features
 Processing of Video Signals

o Demarcate frames from the video sequence
o Identify faces, and then lip regions
o Extract features from the lip profile
 Recognition of Speech by synchronizing Audio and Visual Data

o Synchronize audio and video features for pattern recognition using
standardized algorithms
o Train the system to recognize the spoken word under adverse acoustic
conditions.
5. Project Schedule:
 January 2008
o Processing of audio signals
o Feature extraction from the chosen training database
o Pattern recognition and signature extraction from the features
o Training the HMM with the training set
 February 2008
o Processing of video signals
o Feature extraction from the chosen training database
o Pattern recognition and signature extraction from the features
 March 2008
o Synchronize audio and video features for pattern recognition
o Extension of training data set to 10 words
 April 2008
o Up gradation of system for speaker independent applications
o Performance analysis by comparing results of audio-only approach with that of
joint audio-visual approach
 May 2008
o Documentation
References:
1. Tsuhan Chen, "Audiovisual Speech Processing, Lip Reading and Lip
synchronization", IEEE Signal Processing Magazine, January 2001.
2. R.Chellapa, C.L. Wilson and S. Sirohoey, ‘Human and Machine
Recognition of Faces : A survey’, Proceedings of the IEEE, vol 83, no.5 May
1995

Synopsis-2 and 3 Page

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Synopsis-2 and 3 Page

Enviado por

Direitos autorais:

Formatos disponíveis

1.

Language is man's most important means of communication and speech its

2. Need for the project:

The project is carried out in into following parts

 Processing of Video Signals

 Recognition of Speech by synchronizing Audio and Visual Data

Você também pode gostar