Escolar Documentos
Profissional Documentos
Cultura Documentos
ABSTRACT for sound event recognition can be found in [5], which ex-
tensively investigated acoustic features of spectrograms of
In this paper, we present a novel method for environmen-
Short-Time Fourier Transform (STFT), Continuous/Discrete
tal sound classication in non-stationary noise environment.
Wavelet Transform (CWT/DWT) and MFCCs together with
The proposed method mainly consists of three stages: noise
conventional classiers, such as Articial Neural Network
source separation and acoustic feature extraction and multi-
(ANN) and Learning Vector Quantization (LVQ).
class classication. At rst stage, we employ probabilistic
latent component analysis (PLCA) to perform time-varying Advanced classication schemes have been examined for
noise separation. To alleviate the artifacts introduced by audio classication lately, such as using SVM [6] to exploit
source separation, a series of spectral weightings is applied non-linear time-frequency distributions in acoustic events.
to enhance reliability of audio spectra. At feature extraction However, motivated by biological evidence that local time-
stage, we extract acoustic subspace to effectively characterize frequency information contributes greatly to human auditory,
temporal-spectral patterns of denoised sound spectrogram. more signicant progress has been carried out for audio repre-
Subsequently, regularized kernel Fisher discriminant analysis sentation development. Various novel acoustic features have
(KFDA) is adopted to conduct multi-class sound classication been proposed, such as spikegram [7], a neural-spike-like
through exploiting class conditional distributions based on ex- representation of sound event, likewise, another sparse audio
tracted acoustic subspaces (features). The proposed method feature is presented in [2], which is based on matching pur-
is evaluated with Real World Computing Partnership (RWCP) suit with Gabor dictionaries. Besides, based on sparse audio
sound scene database and experimental results demonstrate representation, Dennis et al. [8] further developed spike neu-
its superior performance compared to other methods. ral network (SNN) for noise robust sound classication and
achieved superior results comparing to conventional methods.
Index Terms Sound recognition, PLCA, source sepa- It is suggested, from psychology of hearing, human be-
ration, eigen-decomposition, discriminant analysis ings are able to recognize noise corrupted acoustic events
effortlessly, since we can concentrate on certain sound we are
1. INTRODUCTION interested in and isolate it from background noise. This study
endeavors to realize humanlike sound recognition scheme
Recently, research area of machine audition has gained a by employing advanced signal processing techniques. A
lot of attention [1], where the goal is to recognize real-life three-step framework is proposed, which includes noise re-
acoustic events by using effective audio signal processing duction, robust acoustic feature extraction and multi-class
techniques. Various potential applications promoted those classication. The noise reduction stage is designated to im-
studies, such as intelligent environment context recognition in itate hearing concentration function of human auditory, and
robotics [2] and audio-based surveillance [3]. Initial works on thereupon lays basis for noise robust audio classication. In
acoustic event classication mostly followed speech recog- order to tackle single channel noise source separation prob-
nition approaches, such as employing the Mel-Frequency lem, we employ Probabilistic Latent Component Analysis
Cepstral Coefcients (MFCCs) to describe sound pattern and (PLCA) algorithm [9], which learns dictionaries and their ac-
performing classication through HMM modeling [4]. How- tivation weights for representing non-stationary background
ever, several critical defects restrained availability of speech noise and sound event separately, and noise source separation
recognition techniques for sound event recognition, i.e. the can be effectively performed thereafter; however, it simul-
MFCCs are not robust to noise presence while Gaussian taneously introduces artifacts (distortions) to the denoised
Mixture Model (GMM) and Hidden Markov Model (HMM) sound, especially in extremely noisy environment. To allevi-
models are insufcient to fully characterize temporal-spectral ate such artifact interference, we develop a series of spectral
dynamic patterns in non-speech sounds. A thorough compar- weightings to enhance availability of sound spectra based
ison of applying conventional audio processing techniques on characteristic of PLCA model. After noise reduction, we
5945
Based on ltered audio spectrogram, [S1 , ..., ST ], ST
N 1 , we perform eigen-decomposition to further charac-
terize signicant temporal-spectral spectral patterns in sound
event. The eigen decomposition to can be written as:
where X is input noise spectrogram and X is reconstructed which shows its signicance in representing the audio. We
spectrogram by PLCA model. P (f |e) presents spectral error select the rst K principle eigenvectors with highest con-
possibility distribution. Based on which, a spectral reliability tribution ratios UK = [u1 ...uK ], K < N to form a sound
measure can be derived as: representation. In addition, the eigenvectors in UK are
normalized by their contribution ratios through computing
f = 1 p(f |e)/max(p(f |e)) (5) K
uk = {k / i=1 i } uk , k = 1...K which follows a simi-
lar manner to [14]. Contribution weightings give prominence
from which lower error rate bands are assigned with higher
to the principle eigenvectors for describing sound event. In
weights, in contrast, high error possibility bands will be in-
the end, we concatenate K normalized principle eigenvectors
dexed by smaller weights to suppress the signicance. A
to build acoustic feature vector.
series of spectral reliability weightings for babble noise is
shown in Fig.2 and it can be applied to audio spectrogram
by St = Xt f , where St denotes tframe denoised spec- 2.3. Sound classication by regularized kernel Fisher dis-
trum, Xt represents enhanced spectral feature by weightings criminant analysis
f . We employ KFDA to perform sound classication. Let
Xi = {xi1 ...xiNi } be audio features from class i. (x) is
2.2. Robust feature extraction nonlinear mapping of input vector x into kernel feature space
Based on the aforementioned process, we further develop F . KFDA seeks a direction w F maximizing class separa-
robust acoustic representation by using triangle lter bank bility which is dened as:
and eigen-decomposition. Mel-lter bank is well developed
w T SB w
for characterizing speech signal. We followed the Mel-lter J(w) = T
w (SB +)w
, SB = (m1 m2 )(m1 m2 )T
banks idea and design a uniformly positioned triangle-lter
Ni
bank to describe sound events, which renders three main
1
advantages: First, the triangle-lter produces more robust SW = ((x)mi )((x)mi )T , mi = (xin )
i x
N i n=1
audio representation by introducing fuzzy assignment over
audio spectra. Second, in contrast to Mel-lter bank, which (9)
emphasizes on low band contents where speech signal mainly where mi denotes class center, while SB
and SW
are within-
lays in, the uniformly positioned lters render identical res- class and between-class variances in kernel feature space.
olution across all frequency bands, and thus capture richer Since the within-class variance may be singular, a regulariza-
temporal-spectral dynamics in sound event. Besides, lter tion term is added. It is typical Rayleigh quotient and solution
bank effectively reduce the feature dimension and therefore can be found in [15].
facilitates sound classication. The ltering process is ex-
pressed as: 3. EXPERIMENTS
sn = n (f )s(f ) (6)
f 3.1. Dataset
where s(f ) denotes denoised audio spectum corresponding We evaluate the proposed sound classication method using
to f band, n (f ) represent spectral weighting coefcients of RWCP sound scene dataset [16], in which sound les are
n-th lter. Then ltered spectrum can be written as St = recorded at 48 kHz sampling rate with high SNR. There are
[s1 , ..., sn ]T . 105 categories of sounds including some duplicated classes.
5946
Method Proposed Dennis YE MF- MFCC-
[8] [11] HMM[8] HMM[8]
Clean 100% 98.5% 100% 95.7% 99.0%
20dB 100% 98.0% 98.5% 94.2% 62.1%
10dB 100% 95.3% 98.5% 84.7% 34.4%
0dB 100% 90.2% 59.5% 69.5% 21.8%
-5dB 99.0% 84.6% 30.0% 53.8% 19.5%
Avg. 99.6% 93.3% 77.3% 79.6% 47.3%
Table 1. Results comparison on 10 sound classes Fig. 3. Robust feature extracted from noise corrupted sounds
5947
2 182 -
A9 6
! ?
6
B0-*
# 54E4
1200"#T6$'!
#T 9 9!
B 0
5
61* 7897
0'!!#=8U>#4EE:
1:2 )" 6# /# A)!
+
B % /
#48#:!!#4@C:4CE>4EE5
1=2R#VR#V6##7
"
#
7 A/"
B
% / #5#4!!#:85U::!#4EE
1>2 R# 3 ,# W
B9 )!
)
$
?
! ,
!
)
5948