Você está na página 1de 2

1.

Introduction:

1.1 Topic

Language is man's most important means of communication and speech its


primary medium. A speech signal is a complex combination of a variety of airborne
pressure waveforms. This complex pattern must be detected by the human auditory
system and decoded by the brain. This can be done by using a combination of audio and
visual cues to perceive speech more effectively. The project aims to emulate this
mechanism in human – machine communication systems by exploiting the acoustic and
visual properties of human speech.

1.2 Organization

2. Need for the project:


Current speech recognition engines employing only acoustic features are not
100% robust. Visual cues can be used to undermine the ambiguity in the auditory
modality. Hence a flexible and reliable system for speech perception can be designed
which finds a variety of applications in

 Dictation systems
 Voice Based Communications in tele-banking, voice mail, data-base query
systems, information retrieval systems, etc
 System Control in automobiles, robotics, airplanes, etc
 Security systems for speaker verification

3. Objective:
Recognise 10 English words (speaker independent) with at least 90% accuracy in a noisy
environment.

4. Methodology:

The project is carried out in into following parts


Processing of Audio Signals
o Detection of end points to demarcate word boundaries
o Analysis of various acoustic features such as pitch and formants, energy
and time difference of speech signals, etc.
o Extraction of selected features

 Processing of Video Signals


o Demarcate frames from the video sequence
o Identify faces, and then lip regions
o Extract features from the lip profile

 Recognition of Speech by synchronizing Audio and Visual Data


o Synchronize audio and video features for pattern recognition using
standardized algorithms
o Train the system to recognize the spoken word under adverse acoustic
conditions.

5. Project Schedule:

 January 2008
o Processing of audio signals
o Feature extraction from the chosen training database
o Pattern recognition and signature extraction from the features
o Training the HMM with the training set
 February 2008
o Processing of video signals
o Feature extraction from the chosen training database
o Pattern recognition and signature extraction from the features
 March 2008
o Synchronize audio and video features for pattern recognition
o Extension of training data set to 10 words
 April 2008
o Up gradation of system for speaker independent applications
o Performance analysis by comparing results of audio-only approach with that of
joint audio-visual approach
 May 2008
o Documentation

References:
1. Tsuhan Chen, "Audiovisual Speech Processing, Lip Reading and Lip
synchronization", IEEE Signal Processing Magazine, January 2001.
2. R.Chellapa, C.L. Wilson and S. Sirohoey, ‘Human and Machine
Recognition of Faces : A survey’, Proceedings of the IEEE, vol 83, no.5 May
1995

Você também pode gostar