Você está na página 1de 8

Using Exact Locality Sensitive Mapping to Group and Detect Audio-Based

Cover Songs

Yi Yu+, J. Stephen Downie*, Fabian Moerchen**, Lei Chen++, Kazuki Joe+


+
Department of Information and Computer Sciences, Nara Women's University
Kitauoya nishi-machi, Nara, 630-8506, Japan. {yuyi, joe}@ics.nara-wu.ac.jp
*
Graduate School of Library and Information Science, University of Illinois at Urbana-
Champaign 501 E. Daniel St. Champaign, IL, 61820, USA. jdownie@uiuc.edu
**Siemens Corporate Research 755 College Road East Princeton NJ08540 USA.
fabian.moerchen@siemens.com
++
Department of Computer Science, Hong Kong University of Science and Technology,
HKSAR, China. leichen@cs.ust.hk

Abstract (see http://en.wikipedia.org/wiki/Cover_version). With


the personal recordings and performance increasing on
Cover song detection is becoming a very hot re- the music social website, unknown cover song detec-
search topic when plentiful personal music recordings tion is becoming extremely important. Through the
or performance are released on the Internet. A nice audio sequence comparison, an accurate audio-based
cover song recognizer helps us group and detect cover pairwise matching is usually obtained. However, fea-
songs to improve the searching experience. The tradi- ture extractor of musical audio sequence is very high
tional detection is to match two musical audio se- dimensional, which makes it time-consuming to
quences by exhaustive pairwise comparisons. Different quickly detect relevant audio tracks. Solving this hard
from the existing work, our aim is to generate a group problem involves two main aspects: (i) refining music
of concatenated feature sets based on regression mod- representation to improve the accuracy of musical se-
eling and arrange them by indexing-based approxi- mantic similarity (pitch [5][22], Mel-Frequency Cep-
mate techniques to avoid complicated audio sequence stral Coefficient (MFCC) [7], Chroma [19][21]) and
comparisons. We mainly focus on using Exact Locality (ii) organizing music documents in the way that helps
Sensitive Mapping (ELSM) to join the concatenated speed up music similarity searching (trees structure
feature sets and soft hash values. Similarity-invariance [1][4][13] hierarchical structure[10], Locality Sensitive
among audio sequence comparison is applied to define Hashing (LSH) [6][9]). Merely strengthening one point
an optimal combination of several audio features. Soft is not enough. For example, in [21] beat-synchronous
hash values are pre-calculated to help locate search- Chroma features are successful for matching similar
ing range more accurately. Furthermore, we imple- music audio sequences. Unfortunately, one pairwise
ment our algorithms in analyzing the real audio cover comparison of feature sequences costs about 0.3 sec-
songs and grouping and detecting a batch of relevant ond. This means it would take about 30s if the data-
cover songs embedded in large audio datasets. base length is 100.
We pay attention to the above addressed aspects to
1. Introduction solve audio-based cover song detection and retrieval:
accurately locate the searching range of music audio
Powerful Internet provides us an open space to tracks with the concatenated feature sets via exact lo-
store and share with the world our music digital re- cality sensitive mapping. A novel melody-based sum-
cordings or performance. An increasing number of marization principle is presented to determine a group
people are joining the online-based music social com- of weights that define an optimal combination of sev-
munities via the Internet to upload their audio re- eral audio features, Features Union (FU), by using
cordings or performance. Cover song is a new rendi- multivariable regression. In addition, we study the re-
tion of a previously recorded song in popular music lationship between FU summarization and hash values
and propose two retrieval schemes called Exact Local- Prob (t ) = P [ h (Vq ) = h (Vi ) : Vq − Vi = t ]
rH
(1)
ity Sensitive Mapping (ELSM) and SoftLSH. We con- is a strictly decreasing function of t. That is, the colli-
firm the practicality of our algorithms with real world sion probability of features Vq and Vi is diminishing as
multi-cover-version queries over the large musical their distance increases. The family H is further called
audio datasets. Interesting examples of cover songs can ( R, cR, p , p ) ( c > 1, p < p ) sensitive if for any
1 2 2 1
be found and listened at http://www.e.ics.nara-wu.ac.jp Vq , Vi ∈ S ,
/~yuyi/AudioExamples.htm.
if || Vq − Vi ||< R , PrH [ h (Vq ) = h (Vi )] ≥ p1
(2)
2. Related work if || Vq − Vi ||> cR , PrH [ h (Vq ) = h (Vi )] ≤ p2
A good family of hash functions will try to amplify
To efficiently accelerate audio-based content detec- the gap between p1 and p2 .
tion, some researchers applied index-based techniques
There is a significant shortcoming in existing works
[1][4][6]. In [1] a composite feature tree (semantic
[1][4][6][9]. No analysis was done to investigate
features, such as timbre, rhythm, pitch, e.g.) was pro-
whether the used features can completely or maximally
posed to facilitate KNN search. The weight of each
represent the original melody characteristics and have
individual feature is determined by multivariable re-
the capability in distinguishing two audio feature se-
gression. Principle Component Analysis (PCA) is used
quences. Also, there was no evaluation of the retrieval
to transform the extracted feature sequence into a new
quality according to a natural benchmark (human per-
space sorted by the importance of acoustic features. In
ception). In this work we study the correlation between
[4] the extracted features (Discrete Fourier Transform)
audio feature sets and soft hash/mapping values by
are grouped by Minimum Bounding Rectangles
using exhaustive musical audio sequence comparisons
(MBR) and compared with an R*-tree. Though the
to predict desired but unseen music songs, and give
number of features can be reduced, sometimes the
some simple principles to evaluate melody-based mu-
summarized (grouped) features may not sufficiently
sic retrieval. In contrast to the existing works, our
discriminate two different signals.
work has two advantages: (i). Similarity-invariance
LSH [11] is an index-based data organization struc-
among audio feature sequence comparisons is applied
ture proposed to find all similar pairs of a query point
in training semantic audio representations based on
in the Euclidean space. It has gained great success in
supervised learning. The proposed Features Union
different applications as a well-known solution to de-
(FU) better represents musical audio sequences. (ii).
termine whether any pair of documents are similar or
We consider possible difference between perceptually
not (web page [12], audio [6][9], image [15], video
similar audio documents and map the feature into con-
[14], etc.). Feature vectors are extracted from docu-
tinuous hash space (soft mapping). The neighborhood
ment and regarded as similar to one another if they are
determined by the query will intersect buckets that
mapped to the same hash value. Yang used random
possibly contain the similar documents. In comparison
sub-set of the spectral features (Short Time Fourier
with the exhaustive KNN and previously used LSH
Transform) to calculate hash values for the parallel
retrieval schemes our algorithms achieve almost the
LSH hash instances in [6]. With a query as input, its
same retrieval quality as KNN but with much less re-
features match reference features from hash tables.
trieval time.
Then Hough transformation is performed on these
matching pairs to detect the similarity between the
query and each reference song by the linearity filtering. 3. Algorithms
In [9] MFCC is extracted as the feature from single
speech word. Based on basic LSH idea they proposed In this section our algorithms include two main
multi-probe LSH, which can probe multiple buckets parts: spectrum-based audio semantic summarization
that are probable to contain the content similar to query. and soft-hashing-based information retrieval. We re-
LSH scheme is described as follow. If two features veal a principle of spectral-based similarity-invariance,
(Vq , Vi ) are very similar they will have a small distance by which we can summarize long audio feature se-
Vq − Vi , hash to the same value and fall into the same quences and generate a compact and semantic single
bucket with a high probability. If they are quite differ- feature FU. Instead of traditional hard hash values we
ent they will collide with a small probability. A func- assume a group of soft hash values and use exact local-
tion family H = {h: S → U  } , each h mapping one point ity sensitive mapping, which help to locate searching
from domain S to U, is called locality sensitive, if for range more accurately. Associated with FU two re-
trieval schemes are proposed to solve the problem of
any features Vq and Vi , the probability
cover song detection.
3.1. Spectrum-based features union We will choose the weight in Eq.(3) so that the dis-
tance d (V , V ) , calculated by the summary, is as near
m1 m2

Audio documents can be described as time-series to the sequence distance d ( R , R ) as possible, i.e., DP m1 m2

feature sequences. Directly computing the distance we hope the melody information is contained in the
between audio feature sequences (matching audio summary. After we determine the distance between the
documents) is an important task in implementing pairs of training data, we get a M*K matrix D and a V

query-by-content audio information retrieval. Dynamic M-dimension column vector D . The mth row of D DP V

Programming (DP) [5][19] can be used in matching has K distance values calculated from independent
two audio feature sequences and is an essentially ex- features d ( v , v ), k = 1, 2, ..., K and the mth element
m 1, k m 2,k

haustive searching approach (which offers high accu- of D is the normalized distance between the two fea-
DP

racy). But it lacks scalability and results in a lower ture sequences d ( R , R ) ⋅ K /(| R | ⋅ | R |) . Let DP m1 m2 m1 m2

retrieval speed as the database gets larger. To quicken A = [α , α , , α ] . According to Eq.(4) D , A and
2

1
2

2
2

K
T

the audio feature sequence comparison and obtain the DDP satisfies the equation D ⋅ A = D Then V DP
scalable content-based retrieval, semantic features are A = ( D D ) D D and we obtain the weight α . We
T −1 T

k
are only interested in the absolute value of α k . The
V V V DP

extracted from the audio structures. The semantic fea-


tures (high-level) used in [1] are mainly proposed for feature set defined in Eq.(3) is call features union (FU).
musical genre classification [20] and can not effec-
tively represent melody-based lower-level music in- 3.2. Exact locality sensitive hashing/mapping
formation. In the following section we propose a new
semantic feature summarization suitable for melody- Almost all the hash schemes, including LSH, use
based music information. hard (discrete) integer hash values. In LSH a FU V is i
A single feature can not well summarize a song. locality-sensitively mapped to H (V ) , which is further i
Multiple features can be combined to represent a song. quantized to integer hash value H (V ) = round ( H (V )) i i
These features play different roles in the query stage ( round ( x ) is the nearest integer of x ). Two FUs ( V i

and must be weighed by different weights. Existing and V ) with a short distance ( d (V , V ) ) have the same
j i j

retrieval schemes have selected different audio features. integer hash value ( H (V ) = H (V ) ) with a high prob- i j

We choose several competitive audio features and in- ability. By assigning integer hash values to buckets,
troduce a scheme based on multivariable regression to the songs located in the same bucket as the query can
determine the weight for each feature. The goal of our be found quickly.
approach is to apply linear and non-parametric regres- However even if two similar FUs V and V have a i j

sion models to investigate the correlation. In the model short distance d (Vi , V j ) , it is not always guaranteed
we use K (K=7) groups of features (218-dimension): that they have the same hash values due to the map-
ping and quantization errors. When a vector of N hash
Mean and Std of MFCC (13+13) [7], Mean and Std of
values instead of a single hash value is used to locate a
Mel-Magnitudes (40+40) [8], Mean and Std of Chroma
bucket the precision can be improved and the effect of
(12+12) [21], Pitch Histogram (88) [5]. quantization error gets more obvious. To find a similar
Let the groups of features of mith song be song from the database with a specific query, multiple
vmi ,1 , vmi ,2 , ..., vmi , K (i=1,2). With different weight α k parallel and independent hash instances are necessary,
assigned to each feature group the total summary vec- which in turn takes more time and requires more space.
tor is Our solution to the above problem is to exploit the
Vmi = [α 1 vmi ,1 , α 2 vmi , 2 , ..., α K vmi , K ]
(3 )
T continuous non-quantized hash values with two
schemes, Exact Locality Sensitive Mapping (ELSM)
The Euclidean distance between two features V and m1 and SoftLSH.
V is
m2

K K K
3.2.1 SoftLSH
d (Vm1 , Vm 2 ) = d ( ∑α v , ∑α v
k m1, k k m 2, k
)= ∑α d (v 2

k m1, k
, vm 2, k ) (4)
We assume the search-by-hash system has L paral-
k =1 k =1 k =1

To determine the weight in Eq.(3), we apply multivari- lel hash instances and each hash instance has a group
able regression process. Consider a training database of N locality sensitive mapping functions. In the mth
composed of M pairs of songs < R , R >, m = 1, 2, ..., M , hash instance the function group is H = {h , h , ..., h } . m m1 m2 mN

Its kth function hmk (⋅) maps an FU feature V to a con-


m1 m2

which contain both similar pairs and non-similar pairs.


From these pairs we obtain the sequences of Chroma tinuous non-quantized hash value h (V ) . After map- mk

similar to [21] and then calculate M sequence distances ping, the hash vector in the mth hash instance corre-
d ( R , R ) via DP. sponding to V is H (V ) = {h (V ), h (V ), ..., h (V )} .
m m1 m2 mN
DP m1 m2
Consider the kth dimension of the hash vectors H (V ) m i the candidates to find the features that are actually
and H (V ) corresponding to V and V respectively.
m j i j similar to V . Of course V will be one of the nearest
j i

By the first order approximation of Taylor series, the neighbors.


difference between h (V ) and h (V ) is mk i mk j

3.2.2 Query with ELSM


hmk (Vi ) − hmk (V j ) ≈ hmk′ (V j )(Vi − V j ) (5)
V V
When i and j are similar to each other, they have a Our first solution to the quantization problem is to
short distance d (Vi , Vj ) . Then according to Eq.(5) utilize the ELSM feature together with KNN instead of
h (V ) and h (V ) are close to each other and so is the
mk i mk j assigning FUs to buckets.
vector H (V ) and H (V ) . m i m j Feature sequence
At the quantization stage the hash space is divided Feature

Dimension
Summary Vi Hash instance
into non-overlapping squares, where H (V ) is quan- m
1 2 3

tized to a set of N integer hash values H (V ) , the cen- Hash value


m Hm(Vi)
ter of the squares. Two FUs falling in the same square …
have the same integer hash values. But this quantiza-
tion can not well retain the distance between two FUs.
Figure 1 shows an example where N equals 2.
d ( H (V ), H (V )) is less than the allowed error, but Time
m i m j

neither of the integer hash values of the two FUs is the Figure 2 Feature organization in the database.
same. H (V ) is quantized to H (V ) =(2,3) while
m i m i Each song in the database is processed as in Figure
H (V ) is quantized to H (V ) =(1,2). By careful ob-
m j m j 2. Its FU feature V is obtained by the regression model.
servation we can learn that the quantization error usu- The mth hash instance has its own sets of N hash func-
ally happens when both H (V ) and H (V ) are near the m i m j tions h (V ) = ( a ⋅ V + b ) / w (1≤k≤N), which is
mk mk mk mk

edge of the squares. Even a little error near the edge determined by amk and bmk, the random variables, and
will result in an error up to N between two integer hash wmk, the quantization interval. By wmk, standard devia-
set H (V ) and H (V ) .
m i m j
tions of soft hash values in different hash instances are
made almost equal and the distribution of hash vectors
roughly spans a square in the Euclidean space. The
hash set for the summarized semantic FU feature V is
Hm (Vi ) H (V ) = [ h (V ), h (V ), ..., h (V )] in the mth hash instance.
hm2=3 m m1 m2 mN

When there are L parallel hash instances, the hash vec-


Vi Hm (Vi ) tors generated from the FU feature V for all hash in-
Hm (Vj ) stances are H (V ) = [ H (V ), H (V ), ..., H (V )] , which
hm2=2 1 2 L

Hm (Vj ) has N*L dimensions. Since the mapping function h (⋅) mk

Vj is locality sensitive, the new hash vector H (V ) con-


tains most of the information embedded in V.
hm2=1
H (V ) can serve as a feature (ELSM) and be used
together with KNN. With the soft mapping value, it
hm1=1 hm1=2 hm1=3
will not suffer quantization information loss. This
Figure 1 Concept of Soft LSH. scheme can also be regarded as an ideal hash, where
each bucket only contains the ELSM features that are
In Figure 1 the FU feature V is a neighbor of V i j very similar to each other. When a query comes, its
and the hash value H (V ) is located in the neighbor- m i ELSM feature locates the bucket that contains all the
hood C(Hm(Vj), r), a ball centered at H (V ) with a m j similar songs (which are the same as exhaustive KNN).
radius r . But H (V ) and H (V ) are located in differ- m i m j In such ideal cases the search accuracy is the same as
ent squares and result in a big distance after quantiza- where ELSM is utilized together with the exhaustive
tion. Here each square corresponds to a bucket. It is KNN. Usually N*L is much smaller than the dimension
obvious that C(Hm(Vj), r) intersects several quantiza- of FU. Though it can not provide a response as fast as
tion squares simultaneously, including the square SoftLSH, it is still much faster than utilization of the
where H (V ) lies. Then with V as query and C(Hm(Vj),
m i j
FU feature directly. Meanwhile its search accuracy up
r) calculated in advance, the buckets that possibly hold bounds that of SoftLSH. With ELSM as the feature in
V can be easily found. From all the features in these
i
the KNN search, we can verify the effectiveness of
buckets, the ones located in C(Hm(Vj), r) are taken as locality sensitive mapping.
the candidates. Then the KNN algorithm is applied to
3.2.3 Query with SoftLSH All the LSH members solve Approximate Nearest
Neighbors problem in a Euclidean space. E2LSH [16]
Our second solution to the quantization problem is enhances LSH to make it more efficient for the re-
to exploit the non-quantized hash values (SoftLSH) to trieval with the very high dimensional feature. It per-
locate in a hash instance all the buckets that possibly forms locality sensitive dimension reduction to get the
hold features similar to the query. projection of the feature in different low-dimension
For the ith song with its FU feature Vi, in the mth sub-spaces. With multiple hash tables in parallel, the
hash instance its sequence number i is stored in the retrieval accuracy can be guaranteed meanwhile the
bucket H (V ) . Its soft hash values corresponding to
m i
retrieval speed is accelerated. Panigraphy [17] consid-
all hash instances, H (V ) , are stored together in a sepa- i

rate buffer and utilized as the ELSM feature. The re- ered the distance d(p,q) between the query q and its
sidual part H (V ) - H (V ) reflects the uncertainty. This
m i
nearest neighbor p in the query stage of the LSH
m i

part is usually neglected in all LSH indexing schemes. scheme. By selecting a random point p’ at a distance
Fully exploiting this part facilitates the accurate locat- d(p,q) from q and checking the bucket that p’ is hashed
ing of the buckets that possibly contain the similar to, the entropy-based LSH scheme ensures that all the
features. In times of retrieval the FU of the query, Vq , buckets which contain p with a high probability are
is calculated. In this way its ELSM feature, probed. An improvement of this scheme by multi-
H (V ) = [ H (V ), H (V ), ..., H (V )] , is also calculated.
q 1 q 2 q L q probe was proposed in [9], where minor adjustment of
In the mth hash instance the features similar to V will q the integer hash values are conducted to find the buck-
be located in the buckets that intersect the neighbor- ets that may contain the point p. According to Eq.(5)
hood C ( H (V ), r ) . Due to the quantization effect the
m q when the feature summary of two tracks are similar to
buckets are squares. Any vertex of a bucket lying in each other, their non-quantized hash values will also be
the neighborhood will result in its intersection with the similar to each other. Instead of probing, our SoftLSH
neighborhood. scheme utilizes the ELSM feature to accurately locate
Buckets in a hash instance are centered at a vector all the buckets that intersect the neighborhood deter-
of integer hash values. Their vertexes are the center mined by the query.
plus or minus 0.5. H (V ) , the integer part of H (V ) , m q m q

indicates the bucket that most possibly contains similar


4. Experimental setup
features. The buckets near H (V ) also possibly con- m q

tain features similar to the query. Vertexes of these


Our music collection includes 5275 tracks that fall
buckets, H (V ) + ( j ± 0.5, ..., j ± 0.5) , are examined.
m q 1 N
into five non-overlapping datasets. Trains80 is col-
For the vector ( j1 , ..., jN ) where the vertexes falling in lected from www.yyfc.com (a non-commercial amuse-
C ( H (V ), r ) , H (V ) + ( j , ..., j ) are the centers of the
m q m q 1 L
ment website where users can sing her/his favorite
buckets that possibly contain the similar features. Fea- songs, make records online, and share them with
tures falling in these buckets are examined by KNN friends) and our personal collections. It consists of 80
with the ELSM feature. pairs of tracks. 40 pairs each contain two versions of
3.2.4 Summary the same song while each of the other 40 pairs contains
different songs. These 160 tracks are used to train the
The original concept of LSH [11] was introduced weights of regression model proposed in section 3.1.
in section 2. In E2LSH [16] a high-dimensional feature Covers79 is also collected from www.yyfc.com and
vector is first projected to sub-feature space by a group consists of 79 popular Chinese songs each represented
of locality sensitive functions. Then hash values are in different versions (sung by different people with
calculated from the sub-features. ELSM is similar to similar background music). Each song has 13.5
the first half of E2LSH. However, in ELSM the sub- versions on the average resulting in a total of 1072
feature vector is not calculated to obtain hash values. audio tracks.
They are directly used as new features to perform ex- RADIO is from www.shoutcast.com and ISMIR is
haustive KNN searching. The ELSM feature is logi- collected from http://ismir2004.ismir.net/genre_contest.
cally related to the number of hash instances. However, JPOP (Japanese popular songs) is from our personal
the ELSM feature is not mapped to the integer hash collections. Covers79, ISMIR, RADIO and JPOP are
values. We also propose a SoftLSH scheme as a varia- used in the evaluation and altogether there are 5275
tion of LSH. It quantizes the ELSM feature into integer tracks and the last three Datasets are used as back-
hash values and utilizes the ELSM feature to accu- ground audio files of simulation. Each track is 30s long
rately locate the searching region. in mono-channel wave format, 16bit per sample and
the sampling rate is 22.05KHz. The audio data is nor-
malized and then divided into overlapped frames. Each theoretical maximum is 14452). The retrieval system
frame contains 1024 samples and the adjacent frames also outputs the retrieved set K . To evaluate perform- q

have 50% overlap. Each frame is weighed by a ham- ance of the algorithms, in our experiment recall and
ming window and further appended with 1024 zeros to precision are respectively defined as | S ∩ K | / | S | q q q

fit the length of FFT (2048 point). From FFT result the and | S ∩ K | / | K | , and also F-measure is defined as
q q q

instantaneous frequencies are extracted and Chroma is 2 ⋅ recall ⋅ precision / ( precision + recall ) .
calculated. From the amplitude spectrum pitch, MFCC 0.8
and Mel-magnitude are calculated. Then the summary
is calculated from all frames. 0.6
The ground truth is set up according to human per-

Recall
ception. We have listened to all the songs and manu- 0.4
ally labeled them so that retrieval results of our algo- KNN
ELSM
rithms correspond to human perception to support 0.2 LSH
practical application. Trains80 and Covers79 datasets SoftLSH
were divided into groups according to their verse (the 0
main theme represented by the song lyrics) to judge 2 4 6 8 10 12 14
whether tracks belong to the same group or not (one Number of hash instances
group represents one song and different versions of Figure 3 Recall under different number of hash in-
one song are members in this group). The 30s seg- stances.
ments in these two datasets are extracted from verse
0.06
sections of songs. Average retrieval time (s)
0.05
KNN
5. Evaluation 0.04 ELSM
LSH
0.03
In this section we present the performance evalua- SoftLSH

tion of the searching schemes, KNN, ELSM, LSH and 0.02

SoftLSH, and give the corresponding analysis and 0.01


demonstrate their potential applications in query-by- 0
content musical audio retrieval. All schemes are based 2 4 6 8 10 12 14
on the FU feature. KNN is an exhaustive search while Number of hash instances
LSH represents quantization into hash buckets KNN
Figure 4 Average retrieval time under different number
achieves highest recall and precision (upper bound).
of hash instances.
LSH has the least retrieval time (lower bound). We
hope our algorithms would approach KNN in the per- Figure 3 is the recall of the four schemes in ques-
formance while retaining almost the same retrieval tion. KNN is the golden standard given this feature
time as LSH. Our task is mainly to solve the problem representation and always performs best. LSH is al-
of cover songs detection or near duplicate detection of ways inferior to the new proposed schemes. When
audio files similar to [2][5][19][21]. Our methods can there are very few hash instances, ELSM feature has a
be extended to solve the query-by-example audio- low dimension and can not well represent FU. As a
based retrieval problems. result the performance of ELSM and SoftLSH is poor
Dataset Covers79 is embedded in the evaluation set compared to KNN. As the number of hash instances
with 5275 tracks (Covers79+ISMIR+RADIO +JPOP). increases from 2 to 6 the recall in both ELSM and
The whole evaluation set has a broad range of music SoftLSH increases correspondingly and the curves of
genres (classical, electronic, metal, rock, world, etc.). ELSM and SoftLSH are approaching the KNN per-
With each track in the Covers79 as the query in turn formance. The recall of SoftLSH is quite close to
we would calculate the ranked tracks similar to the ELSM. This reflects that search in the neighborhood of
query. Each query q chosen from Covers79 has its
the query’s hash values has almost the same perform-
relevant set S (perceptually similar songs), which is
q

determined according to the number of audio cover ance as an exhaustive search. The gap between KNN
tracks in each group. The average size of query’s rele- and ELSM/SoftLSH also decreases as more hash in-
vant set is 12.5 (on the average each song in Coves79 stances are used. The recall, however, does not in-
has 13.5 covers. When one cover is used as query, the crease linearly. The slope of recall approaches 0 and
rest covers are in the database). The total number of further increase of hash instances results in diminish-
relevant items can be calculated from each group (a ing returns. When the number of hash instances is
greater than 10, the gap between ELSM/SoftLSH and they can only be retrieved when KNN returns many
KNN is almost constant, which means that the infor- tracks. Therefore the precision of KNN decreases
mation loss due to utilizing a lower dimension feature quickly when recall is around 0.7. ELSM and SoftLSH
can not be salvaged by the increase of hash instances. have a performance approaching that of KNN. But at
When there are 10 hash instances, 0.682*14452 can be the same precision they have a loss of about 4% in
identified with KNN, 0.633*14452 are identified with recall compared with KNN due to utilizing a lower
ELSM, 0.445*14452 are identified with LSH and dimensional feature. The number of hash instances is
0.625*14452 are identified with SoftLSH. fixed at 10 in the experiment. Some of the tracks can
Figure 4 shows the average retrieval time for each not be retrieved by the LSH scheme at all. Its recall is
query. The exhaustive KNN always takes the longest upper bounded at 0.5 and a higher recall requires much
time (0.542s). Time consumption in the other three more hash instances in LSH compared with SoftLSH.
schemes gradually increases as the number of hash Figure 6 demonstrates F-measure scores of the four
instances does. Average retrieval time of SoftLSH is schemes with respect to different number of retrieved
about double as much as LSH due to the search in mul- tracks. It can be seen easily that the LSH always per-
tiple buckets that intersect the query’s neighborhood. forms worst. KNN performs slightly better than ELSM
From Figure 3 and Figure 4 the tradeoff among accu- and SoftLSH at the cost of a much longer time to fin-
racy and time indicates that 10 hash instances are a ish the search, as shown in Figure 4. Here we would
suitable choice. In such cases SoftLSH has a recall address that when the number of the retrieved tracks is
close to KNN with a much less retrieval time. The ad- less than that of the query’s covers in the database, an
ditional time saved by LSH would result in a signifi- increase of the retrieved tracks results in an almost
cant drop of accuracy. Therefore the number of hash linear increase of recall and a little decrease of preci-
table is set to 10 in the following experiments. sion. Therefore F-measure increases quickly. When the
1 number of retrieved tracks gets larger than the actual
tracks, the slopes of the recall curves in all schemes
0.8 become steady while increasing the retrieved tracks
always results in a decrease of precision. In this ex-
Precision

0.6
periment each query has an average number of 12.5
KNN
0.4 covers in the database. Coincidently in Figure 6 the
ELSM
curves of KNN, ELSM and SoftLSH reach the maxi-
0.2 LSH
SoftLSH
mal F-measure score when the number of returned
0 songs equal 12. This reflects that the FU feature is very
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 effective in representing the similarity of tracks in each
Recall group. The tracks belonging to the same group that
Figure 5 Precision-recall curve (10 hash instances). really have a short distance quickly appear in the re-
turned list. Not-so-similar tracks have a relatively large
0.8 KNN distance and too many retrieved tracks only result in a
ELSM very low precision and F-measure. It also confirms the
LSH
0.6 SoftLSH SoftLSH is a good alternative to KNN.
F-measure

0.4 6. Conclusion
0.2 Both the representation and organization of audio
files play important roles in audio content detection. In
0 this paper we have considered both the semantic sum-
0 10 20 30 40 50 marization of audio documents and the hash-based
Number of retrieved tracks
approximate retrieval for the purpose of reducing re-
Figure 6 F-measure at different number of re- trieval time and improving retrieval quality. By a new
trieved tracks. principle of similarity-invariance, a concise audio fea-
Figure 5 is the precision-recall curve achieved by ture representation (FU) is generated based on multi-
adjusting the number of system outputs. As expected at variable regression. Associated with the FU, variants
the same recall KNN always has the highest precision of LSH (ELSM and SoftLSH) are proposed. Different
and LSH has the lowest precision. Some of the percep- from the conventional LSH schemes, soft hash values
tually similar tracks have quite different features and are exploited to accurately locate the searching region
and improve the retrieval quality without requiring Proceedings of the 1st Workshop on Learning the Semantics
many hash instances. It is easy to make the proposed of Audio Signals (LSAS), pp.66-75, 2006.
retrieval schemes applicable to other applications with
a bit effort (especially video, bio-informatics, e,g.). [8] L.Rabiner and B.-H. Juang. Fundamentals of Speech
Recognition. Prentice-Hall, 1993.
We experimentally show the efficacy of our algo-
rithms via evaluation on ‘multi-versions’ music covers [9] Q. Lv , W. Josephson, Z. Wang, M. Charikar, K. Li,
datasets, adopting human perception as a quality meas- “MultiProbe LSH: Efficient Indexing for High Dimensional
ure. As expected our results demonstrate that (i) the Similarity Search”, In Proc. of the Very Large Data Base
FU feature is a good summary of audio sequence (ii) (VLDB), pp.950-961, 2007.
SoftLSH achieves a better balance between retrieval
time and accuracy than conventional LSH and KNN. [10] N. Bertin, A. Cheveigne, “Scalable Metadata and Quick
This work remains the room to be improved. In the Retrieval of Audio Signals”, ISMIR 2005, pp.238-244, 2005.
future we will study semantic features that better rep-
[11] P. Indyk and R. Motwani. Approximate Nearest
resent melody information and other training models
Neighbor: Towards Removing the Curse of Dimensionality.
that best combine feature groups. In Proc. of the 30th Annual ACM Symposium on Theory of
Computing, pp.604–613, 1998.
Acknowledgment
[12] M. Henzinger. Finding Near-Duplicate Web Pages: a
We thank Initiative Project of Nara Women’s Univer- Large-Scale Evaluation of Algorithms. In Proc. of the 29th
sity for supporting the first author to visit IMIRSEL, conference on research and development in IR, 2006.
where this work was partly discussed in summer, 2007.
[13] J. Reiss, J. J. Aucouturier and M. Sandler, “Efficient
The second author was supported by the Andrew W. multi dimensional searching routines for music information
Mellon and national Science Foundation (NSF) under retrieval”, 2nd ISMIR, 2001.
Nos. IIS-0340597 IIS-0327371.
[14] S. Hu, “Efficient Video Retrieval by Locality Sensitive
References Hashing”, ICASSP 2005, pp.449-452, 2005.

[1] B. Cui, J. Shen, G. Cong, H. Shen, C. Yu. Exploring [15] P. Indyk and N. Thaper, “Fast color image retrieval via
Composite Acoustic Features for Efficient Music Similarity embeddings,” Workshop on Statistical and Computational
Query, ACM MM’06, pp.634-642, 2006. Theories of Vision ( ICCV), 2003.

[2] M. Robine, P. Hanna, P. Ferraro and J. Allali. Adapta- [16] LSH Algorithm and Implementation (E2LSH)
tion of String Matching Algorithms for Identification of http://web.mit.edu/andoni/www/LSH/index.html.
Near-Duplicate Music Documents, SIGIR Workshop on
Plagiarism Analysis, Authorship Identification, and Near- [17] R. Panigrahy. Entropy based nearest neighbor search in
Duplicate Detection, 2007. high dimensions. In Proc. of ACM-SIAM Symposium on
Discrete Algorithms(SODA), 2006.
[3] J. F. Serrano J. M. Inesta. Music Motive Extraction
through Hanson Intervallic Analysis. CIC’06, pp.154-160, [18] M. Lesaffre and M. Leman, “Using Fuzzy to Handle
2006. Semantic Descriptions of Music in a Content-based Retrieval
System.”, In Proc. LSAS06, pp.43-5, 2006.
[4] I. Karydis, A. Nanopoulos, A. N. Papadopoulos and Y.
Manolopoulos, “Audio Indexing for Efficient Music Infor- [19] J. P. Bello, “Audio-based Cover Song Retrieval Using
mation Retrieval”, MMM’05, pp.22-29, 2005. Approximate Chord Sequences: Testing Shifts, Gaps, Swaps
and Beats”, pp.239-244, ISMIR2007.
[5] W. H. Tsai, H. M. Yu and H. M. Wang, “A Query-by-
Example Technique for Retrieving Cover versions of Popular [20] G. Tzanetakis and P. Cook, “Musical Genre Classifica-
Songs with Similar Melodies”, pp.183-190, ISMIR2005. tion of Audio Signals”, IEEE Transactions on Speech and
Audio Processing, Vol.10, No.5, pp. 293-302, 2002.
[6] C.Yang, “Efficient Acoustic Index for Music Retrieval
with Various Degrees of Similarity”, ACM Multimedia, [21] D. Ellis and G. Poliner. Identifying cover songs with
pp.584-591, 2002. chroma features and dynamic programming beat tracking. In
Proceedings of ICASSP-07, Volume: 4, pp.1429-1432, 2007.
[7] T. Pohle, M. Schedl, P. Knees, and G. Widmer. Auto-
matically Adapting the Structure of Audio Similarity Spaces. [22] R. Miotto and N. Orio. “A Methodology for the Seg-
mentation and Identification of Music Works.”, pp.239-244,
ISMIR 2007.

Você também pode gostar