Escolar Documentos
Profissional Documentos
Cultura Documentos
audio track content understanding of a multimedia Reviewing speech indexing platforms [4] (respectively
document we can determine, for example, whether we are music [9]), we see how speech segments are selected while
dealing with news, commercials or sports programs. In our the others segments are rejected. The speech/music
study case, we identify primary audio sources: speech, detection is not studied with the same framework because
music and environment sounds. We consider laughter and speech and music have different structure [11]. We use a
applause identifications because they are key sounds in simpler approach which consists in detecting the two basic
l
Trainin
g Pre- Training
Models
{music, non-music}
{laugther, non-laugher}
under constraints: y i i
=0
i =1
processing phase {speech, non-speech}
signal x 0 i C i = 1, K, l
with x i =training vectors, y i =class label of x i ,
Decision K(.)=kernel function, C=tradeoff between misclassification
Test
Pre- Classification of training data and achieved margin
signal y processing
In the last test we built a background non-class model for [3] R. O. Duda, P. E. Hart, D. G. Stork, Pattern
applause sounds that was tested using the same vectors Classification, Ed. John Wiley & Sons, 2001.
for both technologies (cf. Table 3). We observe a
dependence of GMM scores respect to the number of [4] J.L. Gauvain, L. Lamel and G. Adda, Audio partitioning and
training vectors available, but a faster performance than transcription for broadcast data indexation, CBMI99, Toulouse.
SVM. We remark that SVM scores do not vary too much [5] L. Lu, H-J Zhang, H. Jiang, Content Analysis for Audio
when we reduce the number of vectors for training. Classification and Segmentation , IEEE Trans. on Speech and
Audio Processing, vol. 10, n7, October 2002.
Table 3. Results for applause indexing using the same
training vectors. [6] M. T. Maybury (Editor), Intelligent Multimedia
Information Retrieval . The MIT Press, 1997.
Class CPU1time - CPU time -
System Score [7] J. Pinquier, A. Arias, R. Andr-Obrecht, Audio
vectors training classif.
GMM 7500 98.58% 2 55 3 25 classification by search of primary components , International
GMM 2500 95.04% 57 3 03 workshop on Image, Video and Audio Retrieval and Mining,
Qubec, Canada, October 2004.
SVM 7500 98.35% 6 28 22 23
SVM 2500 97.56% 2 29 12 15 [8] J. Rissanen, An universal prior for integers and estimation
by minimum description length . The Annals of Statistics, vol
5. CONCLUSIONS 11, pp. 416-431, November, 1982.
We have developed a framework to compare two different [9] S. Rossignol, X. Rodet, J. Soumagne, J.L. Collette and P.
technologies of classification. GMM is a robust classical Depalle, Automatic characterisation of musical signals: feature
approach which obtained very good performance. SVM extraction and temporal segmentation , Journal of New Music
come from a strong mathematical background. Experiments Research, 2000, pp. 1-16.
show advantages and disadvantages of each method. For [10] B. Schlkopf, A. Smola, Learning with Kernels , Ed. The
each kind of sound (speech, music, applause, laughter) we MIT Press, 2002.
have equivalent accuracy scores for GMM and SVM, but
we need less training vectors for SVM models. SVM [11] E. Scheirer, M. Slaney, Construction and Evaluation of a
optimization solution is seriously compromised when we Robust Multifeature Speech/Music Discriminator , ICASSP'97,
manage important data volumes while GMM needs a big Munich, Vol. II, pp. 1331-1334, 1997.
amount of data to train proper learning models. For speech
[12] C. Tomasi, Estimating Gaussian Mixture Densities with
or music detection, where the available volume of data can
EM A Tutorial , http://citeseer.nj.nec.com.
be very important, a solution based on GMM is powerful.
On the other hand, for rarer sounds, like applause or [13] V. Vapnik, The Nature of Statistical Learning Theory ,
laughter, SVM, requiring less data, is the best alternative. Ed. Springer, 1995.