12-Robust Acoustic Feature Extraction For Sound Classification PDF

2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)
ROBUST ACOUSTIC FEATURE EXTRACTION FOR SOUND CLASSIFICATION

BASED ON NOISE REDUCTION
Jiaxing Ye Takumi Kobayashi Masahiro Murakawa Tetsuya Higuchi
National Institute of Advanced Industrial Science and Technology

1-1-1 Umezono, Tsukuba, Japan
ABSTRACT for sound event recognition can be found in [5], which ex-
tensively investigated acoustic features of spectrograms of
In this paper, we present a novel method for environmen-
Short-Time Fourier Transform (STFT), Continuous/Discrete
tal sound classication in non-stationary noise environment.
Wavelet Transform (CWT/DWT) and MFCCs together with
The proposed method mainly consists of three stages: noise
conventional classiers, such as Articial Neural Network
source separation and acoustic feature extraction and multi-
(ANN) and Learning Vector Quantization (LVQ).
class classication. At rst stage, we employ probabilistic
latent component analysis (PLCA) to perform time-varying Advanced classication schemes have been examined for
noise separation. To alleviate the artifacts introduced by audio classication lately, such as using SVM [6] to exploit
source separation, a series of spectral weightings is applied non-linear time-frequency distributions in acoustic events.
to enhance reliability of audio spectra. At feature extraction However, motivated by biological evidence that local time-
stage, we extract acoustic subspace to effectively characterize frequency information contributes greatly to human auditory,
temporal-spectral patterns of denoised sound spectrogram. more signicant progress has been carried out for audio repre-
Subsequently, regularized kernel Fisher discriminant analysis sentation development. Various novel acoustic features have
(KFDA) is adopted to conduct multi-class sound classication been proposed, such as spikegram [7], a neural-spike-like
through exploiting class conditional distributions based on ex- representation of sound event, likewise, another sparse audio
tracted acoustic subspaces (features). The proposed method feature is presented in [2], which is based on matching pur-
is evaluated with Real World Computing Partnership (RWCP) suit with Gabor dictionaries. Besides, based on sparse audio
sound scene database and experimental results demonstrate representation, Dennis et al. [8] further developed spike neu-
its superior performance compared to other methods. ral network (SNN) for noise robust sound classication and
achieved superior results comparing to conventional methods.
Index Terms Sound recognition, PLCA, source sepa- It is suggested, from psychology of hearing, human be-
ration, eigen-decomposition, discriminant analysis ings are able to recognize noise corrupted acoustic events
effortlessly, since we can concentrate on certain sound we are
1. INTRODUCTION interested in and isolate it from background noise. This study
endeavors to realize humanlike sound recognition scheme
Recently, research area of machine audition has gained a by employing advanced signal processing techniques. A
lot of attention [1], where the goal is to recognize real-life three-step framework is proposed, which includes noise re-
acoustic events by using effective audio signal processing duction, robust acoustic feature extraction and multi-class
techniques. Various potential applications promoted those classication. The noise reduction stage is designated to im-
studies, such as intelligent environment context recognition in itate hearing concentration function of human auditory, and
robotics [2] and audio-based surveillance [3]. Initial works on thereupon lays basis for noise robust audio classication. In
acoustic event classication mostly followed speech recog- order to tackle single channel noise source separation prob-
nition approaches, such as employing the Mel-Frequency lem, we employ Probabilistic Latent Component Analysis
Cepstral Coefcients (MFCCs) to describe sound pattern and (PLCA) algorithm [9], which learns dictionaries and their ac-
performing classication through HMM modeling [4]. How- tivation weights for representing non-stationary background
ever, several critical defects restrained availability of speech noise and sound event separately, and noise source separation
recognition techniques for sound event recognition, i.e. the can be effectively performed thereafter; however, it simul-
MFCCs are not robust to noise presence while Gaussian taneously introduces artifacts (distortions) to the denoised
Mixture Model (GMM) and Hidden Markov Model (HMM) sound, especially in extremely noisy environment. To allevi-
models are insufcient to fully characterize temporal-spectral ate such artifact interference, we develop a series of spectral
dynamic patterns in non-speech sounds. A thorough compar- weightings to enhance availability of sound spectra based
ison of applying conventional audio processing techniques on characteristic of PLCA model. After noise reduction, we
978-1-4799-2893-4/14/$31.00 2014 IEEE 5944

where P (f, t) denotes the (normalized) magnitude of audio
spectrogram at t frame and f frequency band, which is re-
garded as a random variable. P (f |z) is a multinomial distri-
bution representing frequency basis vectors (dictionary) cor-
responding to the sound source, P (z) is the possibility distri-
bution of latent variable z and P (t|z) denotes the latent vari-
able activations along time coordinate. The number of latent
components is denoted by C which is determined by user.
The PLCA decomposition is carried out by minimizing KL
divergence d(P (t, f )||Q(t, f )) between
input spectra P (t, f )
and reconstructed spectra Qt (f ) = z P (f |z)P (z)P (t|z).
PLCA is efcient for modeling non-stationary sound since
the learnt dictionary accommodates invariant characteristics
of input sound, for this reason, linear combinations of its
dictionary elements (basis) are sufcient for representing the
time-varying patterns in audio signal.
In sound source separation, input spectrogram is viewed
as a mixture of noise and sound event that can be written as:

P (f, t) P (z)P (f |z)P (t|z) + P (z)P (f |z)P (t|z)
Fig. 1. Overview of the propose sound classication system zS zN
(2)
where P (f |z) for z N and P (f |z) for z S represent
extract acoustic feature extraction in following steps: 1. uni- the noise dictionary and sound events dictionary, respectively.
formly spaced spectral triangle lter bank is applied to sound Noise reduction can therefore be realized by setting noise ac-
spectrogram to generate robust spectral feature with low di- tivations P (z) for z to zero, therefore, denoised sound spectra
mensionality; 2. we extract acoustic subspace from ltered can be obtained by:
spectrogram to characterize predominant patterns of audio
event within/across time and frequency domains, which is
P (f, t) P (z)P (f |z)P (t|z) (3)
more representative comparing to conventional frame-based
zS
features. In addition, subspace method inherently provides
denoising mechanism [10] and thus exhibits more robust
We perform noise source separation in a semi-supervised
property. At multi-class environmental sound classication
manner, in which noise dictionary P (f |z) for z N is
stage, based on aforementioned audio feature, we adopt reg-
trained by background noise collected beforehand; while
ularized Kernel Fisher Discriminant Analysis (KFDA) to
sound event dictionary P (f |z) for z S and dictionary ele-
exploit non-linear class conditional distributions of audio
ment activations P (z)P (t|z) for z S are estimated through
events, which is equivalent to the method presented in [11].
the PLCA decomposition in (2).
Though PLCA presents state-of-the-art performance for
2. THE PROPOSED APPROACH noise reduction [12], artifacts are produced through noise
source separation process, especially under extremely noisy
As presented in Fig. 1, the proposed method consists of three conditions, e.g. -5dB SNR. Increasing EM iterations for
main stages: PLCA noise reduction, robust acoustic feature KL divergence minimization and initializing more latent
extraction and KFDA sound classication. In the following components in PLCA model are possible ways to suppress
sections, we introduce details in those steps. artifacts; however, they introduce heavy computation load
which greatly degrade algorithm availability. In this work,
we present an alternative solution to enhance intelligibility
2.1. PLCA noise source separation of denoised sound through focusing on frequency bands with
fewer artifacts generated. It is grounded on the fact in human
PLCA is an effective non-negative matrix decomposition al-
auditory that noise corrupted sound is recognized via process-
gorithm for non-stationary source modeling [9]. The formu-
ing the audio in local high SNR frequency bands [13]. We
lation of PLCA can be expressed as:
realize such spectral concentrating mechanism by assigning a
C
series of spectral reliability weighting, which is derived from

P (f, t) P (z)P (f |z)P (t|z) (1) reconstruction error of PLCA model for background noise
z=1
estimation. The frequency band-wise reconstruction error
5945
Based on ltered audio spectrogram, [S1 , ..., ST ], ST
N 1 , we perform eigen-decomposition to further charac-
terize signicant temporal-spectral spectral patterns in sound
event. The eigen decomposition to can be written as:
RST = U U T , RST = E{St St } (7)
where U = [u1 ...uN ] are the eigenvectors characterizing pre-

Fig. 2. Reliability weights for denoised sound spectra dominant patterns of sound and = diag(1 ...N ) is diag-
onal eigenvalue matrix. The contribution ratio of k-th eigen-
possibility distribution is expressed as: vector uk is dened as:
N

P (f |e) = (Xt,f Xt,f )2 / (Xt,f Xt,f )2 (4) k = k / i (8)
t f t i=1
where X is input noise spectrogram and X is reconstructed which shows its signicance in representing the audio. We
spectrogram by PLCA model. P (f |e) presents spectral error select the rst K principle eigenvectors with highest con-
possibility distribution. Based on which, a spectral reliability tribution ratios UK = [u1 ...uK ], K < N to form a sound
measure can be derived as: representation. In addition, the eigenvectors in UK are
normalized by their contribution ratios through computing
f = 1 p(f |e)/max(p(f |e)) (5) K
uk = {k / i=1 i } uk , k = 1...K which follows a simi-
lar manner to [14]. Contribution weightings give prominence
from which lower error rate bands are assigned with higher
to the principle eigenvectors for describing sound event. In
weights, in contrast, high error possibility bands will be in-
the end, we concatenate K normalized principle eigenvectors
dexed by smaller weights to suppress the signicance. A
to build acoustic feature vector.
series of spectral reliability weightings for babble noise is
shown in Fig.2 and it can be applied to audio spectrogram
by St = Xt f , where St denotes tframe denoised spec- 2.3. Sound classication by regularized kernel Fisher dis-
trum, Xt represents enhanced spectral feature by weightings criminant analysis
f . We employ KFDA to perform sound classication. Let
Xi = {xi1 ...xiNi } be audio features from class i. (x) is
2.2. Robust feature extraction nonlinear mapping of input vector x into kernel feature space
Based on the aforementioned process, we further develop F . KFDA seeks a direction w F maximizing class separa-
robust acoustic representation by using triangle lter bank bility which is dened as:
and eigen-decomposition. Mel-lter bank is well developed
w T SB w
for characterizing speech signal. We followed the Mel-lter J(w) = T
w (SB +)w
, SB = (m1 m2 )(m1 m2 )T
banks idea and design a uniformly positioned triangle-lter
Ni
bank to describe sound events, which renders three main
1
advantages: First, the triangle-lter produces more robust SW = ((x)mi )((x)mi )T , mi = (xin )
i x
N i n=1
audio representation by introducing fuzzy assignment over
audio spectra. Second, in contrast to Mel-lter bank, which (9)
emphasizes on low band contents where speech signal mainly where mi denotes class center, while SB
and SW
are within-
lays in, the uniformly positioned lters render identical res- class and between-class variances in kernel feature space.
olution across all frequency bands, and thus capture richer Since the within-class variance may be singular, a regulariza-
temporal-spectral dynamics in sound event. Besides, lter tion term is added. It is typical Rayleigh quotient and solution
bank effectively reduce the feature dimension and therefore can be found in [15].
facilitates sound classication. The ltering process is ex-
pressed as: 3. EXPERIMENTS
sn = n (f )s(f ) (6)
f 3.1. Dataset
where s(f ) denotes denoised audio spectum corresponding We evaluate the proposed sound classication method using
to f band, n (f ) represent spectral weighting coefcients of RWCP sound scene dataset [16], in which sound les are
n-th lter. Then ltered spectrum can be written as St = recorded at 48 kHz sampling rate with high SNR. There are
[s1 , ..., sn ]T . 105 categories of sounds including some duplicated classes.
5946
Method Proposed Dennis YE MF- MFCC-
[8] [11] HMM[8] HMM[8]
Clean 100% 98.5% 100% 95.7% 99.0%
20dB 100% 98.0% 98.5% 94.2% 62.1%
10dB 100% 95.3% 98.5% 84.7% 34.4%
0dB 100% 90.2% 59.5% 69.5% 21.8%
-5dB 99.0% 84.6% 30.0% 53.8% 19.5%
Avg. 99.6% 93.3% 77.3% 79.6% 47.3%
Table 1. Results comparison on 10 sound classes Fig. 3. Robust feature extracted from noise corrupted sounds
such as 5 types of coin sounds. We removed duplicated

classes and thereafter obtained 62 distinct sound categories
with 5949 samples for evaluation. Typical non-stationary
babble noise is extracted from NOISEX92 database for eval-
uating the robustness of the proposed method.
Fig. 4. Classication results on 62 sound classes with various
3.2. Parameter setting noise intensities
We set window length to 10ms with 7.5ms overlapping in

Short-time Fourier Transform (STFT). At PLCA noise reduc- in Fig. 3, from where clusters for each sound class can be
tion stage, background noise dictionary size and foreground clearly observed, which manifests robustness of the proposed
sound event dictionary size are both xed to 50, since it was feature. In addition, signicance of adopting noise reduction
found enough to model babble noise and sound events. The together with spectra enhancement can be observed through
number of EM iterations is experimentally set to 100. N = 50 comparing our result with result shown by [11], in which a
triangle-lters are used to build lter bank in (6). We concate- similar audio (subspace) representation is employed. As a
nate K = 3 principle eigenvectors with highest contribution conclusion, the proposed method greatly outperform [11] un-
ratio to form audio representation, from which over 95% of der extremely noisy conditions.
patterns in sound are accommodated according to their con-
tribution ratios (7). The regularization term in (8) is set to 3.4. Experiment 2: evaluation with more data
104 . 10 seconds babble noise are utilized for training the
To further conrm superiority of the proposed method, we
noise model in PLCA, which is removed at noisy audio sam-
perform more extensive sound classication test using RWCP
ples generation stage to ensure there is no overlap in training
database with 62 sound classes and 5949 samples. The perfor-
and testing background sound. In KFDA, spread parameter in
mance is measured by 10-fold cross validation, in which clean
gauss kernel is set to 0.3.
data is applied for training and test samples are corrupted by
babble noise with 20, 10, 5, 0 and -5dB SNRs, respectively.
3.3. Experiment 1: validation by comparison From results in Fig.4, proposed method obtained an average
In rst experiment, we compare the proposed approach with accuracy of 91.04%. Even under challenging -5dB SNR con-
other methods [8,11]. The evaluation protocol is similar to dition, it achieved more than 90% classication accuracy.
[8], in which 10 classes of sound events are selected from
RWCP database, i.e. bells5, bottle1, buzzer, cymbals, horn, 4. CONCLUSION
kara, metal15, phone4, ring and whistle1. For each class,
40 les are randomly selected, from which half are used for This paper proposed novel noise robust sound classication
training and the other half are for testing. The babble noise is approach presenting favorable performance in severe acous-
added to testing data with 20, 10, 0 and -5 dB SNRs. Final tic environment. The proposed method assembles advanced
results are computed by averaging the outcomes across 5 runs signal processing techniques to realize the human auditory
of test. Meanwhile, several conventional methods such as mechanism, in which noise corrupted sound event is recog-
MFCC with HMM and Mel-Frequency (MF) with HMM are nized based on part of spectral information that with high
also evaluated [8]. According to results shown in Table.1, pro- SNR. In detail, three main steps are included: noise source
posed scheme achieved favorable classication accuracy for separation, robust acoustic feature extraction and regularized
all noise conditions and signicantly outperformed all other KFDA multi-class classication. We demonstrated the pro-
methods. In addition, we plot all features extracted from 10 posed method with RWCP sound scene database. Extensive
categories of sounds corrupted with 20 to -5 dB babble noise experiments validated superiority of the proposed approach.
5947

2 182 -
A9 6
! ?

6
B0-* # 54E4
1200"#T6$'!

#T 9 9!
4E 1:2 # 6 9# /Y

# 0
?# Z! -#/# 6
A,

"
B 0 * ! 1
142)##-)#)# #AK ! 1 0"CCC!!#8U8@CCC
/ 0 ,+ ,
B -.
# . / 0'''
. # > #= 1=2 / 0 )! '
! A/0)'
!!#84:@4EEC .
B!$;;
###I!;
;%#
1>2 # 3 7# #6# A

152
!

'

,
AJ
!$ #
%C4$
%!

B-. #

!

B
1 . 2334. 0''' 0 1 . # #4#5!!#48>U4:CC5
!!#=:=@4EEC

182/#/
#.A
+

B 0 5
61* 7897
0'!!#=8U>#4EE:

1:2 )" 6# /# A)!
+

B % /
#48#:!!#4@C:4CE>4EE5

1=2R#VR#V6##7
"

#
7 A/"
B
% / #5#4!!#:85U::!#4EE

1>2 R# 3 ,# W
B9 )!
)
$
?
! ,
!
)
B1 .0'''

#8#=
!!#=@8=@C4E4#

1@2 # .
<# X 7# 7# 7# ( A!

!

B -. # 1 . 23(:*
0'''0 1 .!!#@E5@E>4E5

1C2
6#3# ?# /I '#
A'

( 3 6

,
B # 1
-6 !) 6

;. #1 0 11 !
164EE@

1E2 ?
# 6# # .# A

B 0 -. # . 1 . 233:.
0'''0 1 *!#!#@:@@4EE5

12 % X # -
6# 6" # 7
A-

B -. # 1 . 23(:*
0'''0 1 .!!#@E@@44E5

142 V . 9# # 6
'

A!

! !

B'#0 # 4E4#

152 6# ) A !

! !!

B
1 -1 - #C
!!#:=4:>54EE=
5948

12-Robust Acoustic Feature Extraction For Sound Classification PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

12-Robust Acoustic Feature Extraction For Sound Classification PDF

Enviado por

Direitos autorais:

Formatos disponíveis

2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)

ROBUST ACOUSTIC FEATURE EXTRACTION FOR SOUND CLASSIFICATION

Jiaxing Ye Takumi Kobayashi Masahiro Murakawa Tetsuya Higuchi

National Institute of Advanced Industrial Science and Technology

978-1-4799-2893-4/14/$31.00 2014 IEEE 5944

RST = U U T , RST = E{St St } (7)

where U = [u1 ...uN ] are the eigenvectors characterizing pre-

such as 5 types of coin sounds. We removed duplicated

We set window length to 10ms with 7.5ms overlapping in

4E 1:2 # 6 9# /Y

B1 .0'''

Você também pode gostar

12-Robust Acoustic Feature Extraction For Sound Classification PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

12-Robust Acoustic Feature Extraction For Sound Classification PDF

Enviado por

Direitos autorais:

Formatos disponíveis

2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)

ROBUST ACOUSTIC FEATURE EXTRACTION FOR SOUND CLASSIFICATION

Jiaxing Ye Takumi Kobayashi Masahiro Murakawa Tetsuya Higuchi

National Institute of Advanced Industrial Science and Technology

978-1-4799-2893-4/14/$31.00 2014 IEEE 5944

RST = U U T , RST = E{St St } (7)

where U = [u1 ...uN ] are the eigenvectors characterizing pre-

such as 5 types of coin sounds. We removed duplicated

We set window length to 10ms with 7.5ms overlapping in

4E 1:2 # 6 9# /Y

   

 B 1 .0'''

Você também pode gostar

4E 1:2 # 6 9# /Y

B1 .0'''