Escolar Documentos
Profissional Documentos
Cultura Documentos
Motivation
Physiological experiments in different mammal species : a large percentage of neurons in the primary auditory cortex (A1) respond differently to upwardversus downward-moving ripples in the spectrogram of the input (Depireux et al., 2001).
Spectro-temporal receptive fields (STRFs) : individual neurons are sensitive to specific spectrotemporal modulation frequencies in the incoming sound signal
Introduction
Cortically-inspired TF features, which capture spectral and temporal modulations speech recognition and discrimination. Basically, spectro-temporal features are derived from filtering spectrograms with particular filters. In this case, the GABOR filter is applied to the auditory spectrogram.
Example
Example
Gabor Filters
Example
Gaussian envelope
Gabor Filters
1D Gabor
Gaussian envelope
Gaussian envelope
2D Gabor
complex sinusoid s(n, k)
Example
Gaussian envelope
Gabor Filters
Dummy
parameters
indices
Tons of Combinations!
System
Stream
Stream
PCA
MFCC
Output
System
Stream
Stream
PCA
MFCC
Output
System
Stream
Stream
MLP (Multilayer Perceptron) The structure of the MLP depends on the type of feature and corpus.
Number of input units Spectral 567 9 Cepstral 351 9
frames of context
hidden units
PCA
32D 45D MFCC Output
output units
System
Stream
Stream
The outputs of the MLP stream provide an estimate of the posterior probability distribution for phones. Then, combine each of these phone probability estimates across streams by inverse entropy.
PCA
32D 71D MFCC Output
System
Stream
Stream
then apply the KL Transform to the log probabilities of the merged MLPs
PCA
32D 71D MFCC Output Principal Components Analysis
System
Stream
Stream
PCA
32D 71D MFCC Output
then apply the KL Transform to the log probabilities of the merged MLPs reduced to 32D orthogonalized the features are mean and variance normalized by utterance finally appended to the MFCC feature
System
Features HMM
Stream
Stream
PCA
32D 71D MFCC 39D Output 32D
Experiments
Database Aurora 2 (0 20 dB) Numbers95 consists of various numeric portions extracted from telephone dialogues . vocabulary size of 32 words training set contains 3590 utterances of clean data, totaling roughly 3 hrs 2 test sets contains 1227 utterances. The first contains only clean data The second contains the same utterances with noise added at five SNR (20dB, 15dB, 10dB, 5dB, and 0dB). Additive noise Baseline 39 MFCC 4-stream system 28-stream system
Results
Aurora 2
Numbers 95
Results
Aurora 2
Numbers 95
Results
Aurora 2
Numbers 95
Results
Aurora 2
Discussion 1
Numbers 95
Results
Aurora 2
Discussion 2
Numbers 95
Results
Aurora 2
Discussion 3
Numbers 95
Results
Aurora 2
Numbers 95
Future Work
Stream
Stream
Not just additive noise Another TF feature might not work Log-mel filterbank? Or power like PNCC? How to combine MLP? Inverse Entropy?
PCA
32D 71D MFCC 39D Output 32D