Você está na página 1de 12

DOI 10.

7603/s40601-014-0015-7
GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Audio Music Monitoring: Analyzing Current


Techniques for Song Recognition and Identification
E.D. Nishan W. Senevirathna and Lakshman Jayaratne
Received 20 Jul 2015 Accepted 13 Aug 2015

Abstractwhen people are attaching or interesting in


something, usually they are trying to interact with it frequently.
Music is attached to people since the day of they were born.
When music repository grows, people faced lots of challenges
such as finding a song quickly, categorizing, organizing and even
listening again when they want etc. Because of this, people tend
to find electronic solutions. To index music, most of the
researchers use content based information retrieval mechanism
since content based classification doesnt need any additional
information rather than audio features embedded to it. As well
as it is the most suitable way to search music, when user dont
know the meta data attached to it, like author of the song. The
most valuable application of this audio recognition is copyright
infringement detection. Throughout this survey we will present
approaches which were proposed by various researchers to
detect, recognize music using content base mechanisms. And
finally we will conclude this by analyzing the current status of
this era.

Can we identify a cover song when multiple versions


exist?

Can we obtain a statistical report about broadcasted


songs in a radio channel without a manual monitoring
process?

Above considerations motivate researches to find proper


solutions for these challenges. As of now, so many ideas have
been proposed by researches as well as some of them have
been implemented, Shazam is one of example for that.
However still this is a challenging research area since there is
no optimal solution. This problem become even more
complex when,

Audio signal is altered by noise.

Keywords Audio fingerprint; features extraction; wavelets;


broadcast monitoring; Audio classification; Audio identification.

Audio signal is polluted by adding unnecessary audio


object like advertisement in radio broadcasting.

When multiple versions are existed.

I. INTRODUCTION

Only a small part of a song is available.

usic repositories in the world are increasing


exponentially. New artist can come to the field easily
with new technologies. Once we listen a new song, we cant
get it again easily if we dont know the meta data of that song
like author or singer. However the most common method of
accessing music is through textual meta-data but this is no
longer function properly against huge music collection. When
we come to the audio music recognition era, followings are
the key considerations.

Can we find an unknown song using a small part of it


or humming the melody?

Can we organize, index songs without meta data like


singer of the song?

Can we detect copyright infringement? For an example


after a song was broadcasted in a radio channel.

At any of above situations, human auditory system can


recognize music but providing an automated electronic
solution is very challenging task since similarity between
original music and querying music could be very few or these
similar features may not be possible to model mathematically.
It means researches need to consider perceptual features also,
in order to provide a proper solution. Feature extraction can
be considered as the heart of any of these approaches since the
accuracy and all are depended on the way of feature
extraction.
Rest of this survey, will provide broader overview and
comparisons of proposed feature extractions, searching
algorithms and overall solutions architectures.

DOI: 10.5176/2251-3043_4.3.328

The Author(s) 2015. This article is published with open access by the GSTF
23

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

II. CLASSIFICATIONS (RECOGNITION) VS. IDENTIFICATIONS


What is the different between audio recognition
(classification) and identification? In audio classification,
audio object will be classified into pre-defined sets like song,
advertisement, vocals etc. but they are not identified further.
Ultimately we know that this is a song or advertisement but
we dont know what that song is! Audio classification is less
complex than recognition. Most of the time, we can see that
these two things are combined each other in order to get better
result. For an example, in audio song recognition system, first
we can extract only songs among collection of other audio
objects using audio classifier and output will be fed in to the
audio recognition system. Using that kind of approach we can
get better result by narrow downing the search space. There
are more proposed audio classification approaches. Some of
them will be discussed in next sub section.
A. Audio classifications
1) Overview
There are considerable amount of real world
applications for audio classification. For an example it will be
very helpful to be able to search sound effects automatically
from a very large audio database in films post processing,
which contains sounds of explosion, windstorm, earthquake,
animals and so on[1]. As well as audio content analysis and
classification is also useful for audio-assisted video
classifications. For an example, all video of gun fight scenes
should include the sound of shooting and or explosions, but
image content may vary significantly from one scene to
another.
When classifying an audio content into different sets,
different classes have to be considered. Most of the researches
have started this classifying speech and music. However these
classes are depended on the situations. For example, music,
speech and others can be considered for the parsing of
news stories whereas audio recording can be classified into
speech, laughter, silences and non speech for the
purpose of segmenting discussions recording in meetings[1].
In any cases above, we have to consider, extract some sort of
audio features. This is the challenging part as well as past
researches are differed from this point. But we can consider
feature extraction of audio classification and feature
extraction of audio identification separately since most of the
times these two cases consider disjoin feature sets [7].
2) Feature extraction of audio classification
Actually, most of the time output of the audio
classification is the input of the audio identification. This will
reduce the searching space and speed up the process and help
to retrieve better results. Most of the researchers, audio

classification will be broken down into further steps. In [1]


they used two steps, in the first stage, audio signal is
segmented and classified into basic types, including speech,
music, several types of environmental sounds, and silence.
They called it as the coarse-level classification. In the second
stage, further classification is conducted within each basic
type. For speech, they differentiated it into voices of man,
woman, child as well as speech with a music background and
so on. For music, it is classified according to the instruments
or types (for example, classics, blues, jazz, rock and roll,
music with singing and the plain song). For environmental
sounds, they classified them into finer classes such as
applause, bell ring, footstep, windstorm, laughter, birds' cry,
and so on. They called this as the fine-level classification.
Overall idea was reducing the searching space step by step in
order to get better results. As well as we can use proper feature
extraction mechanism for each finer level classes based on its
basic type. For an example, due to differences in the
origination of the three basic types of audio, i.e. speech, music
and environmental sounds, different approaches can be taken
in their fine classification. Most of the researches have used
low-level (physical, acoustic) features such as Spectral
Centroid or Mel-frequency Coefficients but end users may
prefer to interact with a higher semantic level [2]. For an
example they may need to find dog barking sound instead of
environmental sounds. However low-level features can be
easily extract using signal processing than high-level
(perceptual) features.
Most of the researchers have used Hidden Markov
Model (HMM) and Gaussian Mixture Model (GMM) as the
pattern recognition tool. Those are the widely used very
powerful statistical tools in pattern recognition. To use those
tools we have to extract unique features. Any audio feature
can be grouped into two or more sets. Most of the researches
grouped all audio features into two group, physical (or
mathematical) features and conceptual features. Physical
features are directly extracted from the audio wave such as
energy of the wave, frequency, peaks, average zero crossings
and so on. These features cannot be identified by the human
auditory system. But perceptual features are the features
human can understand like loudness, pitch, timbre, rhythm
and so on. Perceptual features cannot easily be model by
mathematical functions but those are the very important audio
features since human uses those features to differentiate
audios.
However sometime we can see that audio features
classified into hierarchical groups with similar characteristics
[12]. They divide all audio features into six main categories,
refer the Figure 1.

The Author(s) 2015. This article is published with open access by the GSTF
24

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Figure 1. High level, Audio Feature Classification[12].

However no one can define audio feature and its


category exactly since there is no broad consensus on the
allocation of features to particular groups. We can see that
same feature may be classified into two different groups by
two different researchers. It is depended on the different
viewpoints of the authors. Features defined in the figure 1 can
be further classified into several groups considering the
structure of each feature.
Considering the structure of the temporal domain
feature, in [12], they classified it into three sub groups of
features: amplitude-based, power-based, and zero crossingbased features. Each of these features related to one or more
physical property of the wave, refer the Figure 2.

Figure 2. The organization of features in Temporal Domain [12].

In here, some researches had defined zero crossings


rate (ZCR) as a physical feature. Frequency domain signals
are the very important features. Most of the researches
consider only the frequency domain features. Next we will
look at the frequency domain feature classification done by
[12] refer the Figure 3.
Sometime we can see that some researches had further
classified other four main features as well. But those are not
very important. Next we will see the main characteristics of
major features.

Figure 3. The organization of features in Frequency Domain [12]

a) Temporal (Raw) Domain features

Most of the time, we cant extract features without


altering the native audio signal. But there are several features
which can be extracted from native audio signal those features
are known as temporal features. Since we dont want to alter
the native signal it is very law cost feature extraction
methodology. But only using this feature we cant uniquely
identify audio music.
Zero crossing rate is a main temporal domain feature.
This is very helpful but low cost feature which is often used
in audio classification. Usually we define is as the number of
zero crossings in the temporal domain within one second. It is
a rough estimation of dominant frequency and the spectral
centroid [12]. Sometime we obtain ZCR by altering the audio
signal bit. In this case we extract frequency information and
corresponding intensities scaled sub bands from time domain
zero crossings. It gives more stable measurement for us and it
is very helpful in noisy environment. Since noises are always
spread around zero axes but this is not creating considerable
amount of peaks therefore peak related zero crossing rate will
remain unchanged.
Amplitude-Based Features are another example for
temporal domain features. We can obtain this feature by
directly computing the frequency of audio signal. It is again
good measurement but subject to change even audio signal is
alter little bit by noise like unwanted affects.
Power measurement is also a raw domain signal
which is almost same as the amplitude based features. The
power or the energy of a signal is the square of the amplitude
represented by the waveform. Volume is well known power
measurement feature it is widely used in silence detection and
speech/music segmentation.

The Author(s) 2015. This article is published with open access by the GSTF
25

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

b) Physical features

Most of the audio features are obtain from frequency


domain since almost all features live in this domain. Before
extracting frequency domain features we have to transform the
base signal into some other formats. To do that, we can use
several methods. The most popular methods are the Fourier
transform and the autocorrelation. Other popular methods are
the Cosine transform, Wavelet transform, and the constant Q
transform [12]. Frequency domain signal can be categorized
into two major class, physical features and perceptual features.
Physical domain features are defined using physical
characteristics of audio signal which have not semantic
meanings. Next we will discuss mainly used physical features
and then perceptual features.
Auto-regression-Based Features: In statistics and
signal processing, an autoregressive (AR) model is a
representation of a type of random process; as such, it
describes certain time-varying processes in nature,
economics, etc.[18]. This is widely used standard techniques
for speech/music discrimination. This can be used to extract
basic parameters of a speech signal, such as formant
frequencies and the vocal tract transfer function [18].
Sometime we can see that this feature group is divided further
into two group, linear predictive coding (LPC) and Line
spectral frequencies (LSF). But in here we are not going to
discuss about these sub group in detailed.
Short-Time Fourier Transform-Based Features
(STFT): this is another widely used audio feature based on
the audio spectrum. STFT can be used to obtain characteristics
of both frequency component and phase component. There are
several features under STFT such as Shannon entropy, Renyi
entropy, spectral centroid, spectral bandwidth, spectral
flatness measure, spectral crest factor and Mel-frequency
cepstral coefficients [15].
Short-time energy function: Energy of an audio
signal is measured by amplitude of that signal. When we
represent amplitude variation over time it is called energy
function of that signal. For speech signals, it is a basis for
distinguishing voiced speech components from unvoiced
speech components, as the energy function values for
unvoiced components are significantly smaller than those of
the voiced components [1].
Short-time average zero-crossing rate (ZCR): This
feature is another measurement to classify voiced speech
components and unvoiced speech components. Usually voice
component have much smaller ZCR than unvoiced component
[1].
Short-time fundamental frequency (FuF): Using this
feature we can find harmonic properties. Usually most

musical instrument sounds are harmonic. Sometime some


sound can be mixer of harmonic and non-harmonic. However
this feature also can be used to classify audio objects [1].
Spectral Flatness Measure (SFM): which is an
estimation of the tone-like or noise-like quality for a band in
the spectrum [1]. Really used for audio classifications.
There are some other widely used physical features
like,
Mel-Frequency
Cepstrum
Coecients
(MFCC),Papaodysseuset al. (2001) presented the band
representative vectors, which are an ordered list of indexes
of bands with prominent tones (i.e. with peaks with signicant
amplitude). Energy of each band is used by Kimura et al.
(2001). Normalized spectral sub-band centroids are proposed
by Seo et al. (2005). Haitsma et al. use the energies of 33 barkscaled bands to obtain their hash string, which is the sign of
the energy band dierences (both in the time and the
frequency axis) and so on.
Most of the time silent audio frames are identified
earlier and those are not directed to further processing. There
are several approaches to identify/define a silent frame. Some
researched have used ZCR property. In [4], they have used
something like below to define silent frames.
Before feature extraction, an audio signal (8-bit ISDN
-law encoding) is pre-emphasized with parameter 0.96 and
then divided into frames. Given the sampling frequency of
8000 Hz, the frames are of 256 samples (32ms) each, with
25% (64 samples or 8ms) overlap in each of the two adjacent
frames. A frame is hamming-windowed by, wi = 0.54 0.46
* cos(2i/256). It is marked as a silent frame if,
256

( )2 < 4002
=1

Where si is the pre-emphasized signal magnitude at i


and 4002 is an empirical threshold. Even most of the
researches have used physical features, in order to give better
result we have to consider perceptual features as well since
those are the features recognize by human auditory system.
Spectral peaks: this is very important feature since it
is noise robust representation of audio wave. Noises are spread
across zero axes therefore noises are not affected on peaks.
This feature is mainly used to create a unique finger print from
a small segment of audio clip captured by mobile phone or
some other device. The strength of the technique is that it
solely relies on the salient frequencies (peaks) and rejects all
other spectral content [12].

The Author(s) 2015. This article is published with open access by the GSTF
26

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

c) Perceptual features

How can human beings recognize an audio? Using


some sort of audio features which are sensitive to human
auditory system, those features are known as perceptual
features. Usually these features cannot be easily extracted,
since those cannot easily be model mathematically. However
there are number of research attempts in this era, the final goal
of those researches is to model perceptual features
mathematically. In [5], statistical values (including means,
variances, and autocorrelations) of several time- and
frequency-domain measurements were used to represent
perceptual features, such as loudness, brightness, bandwidth,
and pitch. This method is only suitable for sounds with a
single timbre. Other than that following are the mostly used
perceptual features. According to the past researches
perceptual features can be grouped into six groups, refer the
Figure 4.

Figure 4: The organization of features in Frequency Domain perceptual[12].

Brightness: The word Brightness is more familiar


in illumination context. Usually we measure brightness of
illumination surfaces like LCD monitors. What is meant by
brightness of that context? If illumination is very high then
we define it as the high brightness otherwise law brightness.
Likewise we can define audio brightness using its frequency
instead of illumination. A sound becomes brighter as the
high-frequency content becomes more dominant and the lowfrequency content becomes less dominant [12].
Most of the times, we measure brightness as the
spectral centroid (SC). It indicates where the "center of mass"
of the spectrum is. Perceptually, it has a robust connection
with the impression of "brightness" of a sound. It is calculated
as the weighted mean of the frequencies present in the signal,
determined using a Fourier transform, with their magnitudes
as the weights [19].
Tonality: this is more important audio feature to
distinguish noise-like sound from other sounds. Tonal sounds
typically have line spectra whereas noise-like sound have

continues spectrum. Usually tonality is measured by


bandwidth and/or flatness.
Bandwidth is usually defined as the magnitudeweighted average of the differences between the spectral
components and the spectral centroid. As we already know
tonal sounds typically have line spectrum therefore
component variation related to the SC is law it means tonal
sounds have law bandwidth than noise like sounds. There are
several other feature classes which measure tonality of audio
signal such as Spectral dispersion, Spectral roll-off point,
Spectral crest factor, Sub-band spectral flux (SSF) and
Entropy. More details of each feature can be found in [12].
Loudness: We use this feature in our day to day life,
is the characteristic of a sound that is primarily a
psychological correlate of physical strength (amplitude).
More formally, it is defined as "that attribute of auditory
sensation in terms of which sounds can be ordered on a scale
extending from quiet to loud" [20]. In other words, Loudness
is that attribute of auditory sensation in terms of which
sounds may be ordered on a scale extending from soft to loud
[12]. According to the definition, this is widely used
perceptual feature as well as this feature can be extracted
easily than other perceptual features.
Pitch: Again this is much closed audio feature to the
human auditory system like loudness. Pitch is a basic
dimension of audio and this is defined together with loudness,
duration, and timbre. In the past researches, this feature was
widely used to genre classification and audio identification.
Pitches are compared as "higher" and "lower" in the sense
associated with musical melodies, which require sound whose
frequency is clear and stable enough to distinguish from noise
[21].
Chroma: this feature is an interesting and powerful
representation of audio. We know that any tone is belongs to
one of musical octaves. To classify tone into musical octaves
we used tone height i.e. if tone height is less then it may
belongs to class C and if it is very high then it could belongs
to class B. But there is another measurement of a tone class
which is called Chroma. The entire spectrum is projected
onto 12 bins representing the 12 distinct semitones (or
chroma) of the musical octave. Since, in music, notes exactly
one octave apart are perceived as particularly similar,
knowing the distribution of chroma even without the absolute
frequency (i.e. the original octave) can give useful musical
information about the audio [14].
Harmonicity: is a property that distinguishes periodic
signals (harmonic sounds) from non-periodic signals
(inharmonic and noise-like sounds). Harmonics are
frequencies at integer multiples of the fundamental frequency

The Author(s) 2015. This article is published with open access by the GSTF
27

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

[12] i.e. if the fundamental frequency is f, the harmonics have


frequencies 2f, 3f, 4f . . . etc. Harmonic frequencies are
equally spaced by the width of the fundamental frequency and
can be found by repeatedly adding that frequency. As a
practical example we can say that the nodes of a vibrating
string are harmonics, Refer the Figure 5.

Most of the time Spectral features are used however


depending on the targeted research/system, features are
extracted from one or many categories. For an example
Continuous feature category is the most suitable one for
emotional detection [13]. Most of the time before perceptual
feature extraction, we have to do some preprocessing things
in order to extract perceptual features more accurately. Even
there are several classifications, basic features remain
unchanged (class they fall may vary).
3) Applications of audio classification
As we already mention that the major application of
audio classification is the audio identification. i.e. audio
identification systems use audio classifications systems
output as the input to their system or pre-processing part of
audio identification systems is done by audio classification
systems. This will reduce the searching space and through this
approach we can provide efficient and accurate audio
identification systems. Followings are some other application
of audio classification.

Figure 5. Nodes of a vibrating string are harmonics [11]

According to the past literature we can see that there


are some other several, different audio feature classifications
as well. For the completeness, next we will look at other
definitions briefly.
Acoustical speech features reported in the literature
can be shown as Figure 6[13]. Existing systems use a number
of integrated continuous, qualitative, and spectral as well as
the Teager energy operator (TEO)-based features.

1. Genre classification: A music genre is a conventional


category that identifies pieces of music. There are
several well-known categories such as Pop, Rock, Jazz,
Hip hop etc. Discussed audio classification
methodologies are heavily used in genre
classifications.
2. Automatic Emotion Recognition: It is well known that
human speech contains not only the linguistic content,
but also the emotion of the speaker. The emotion may
play a key role in many applications like in
entertainment electronics to gather emotional user
behaviors, in Automatic Speech Recognition to resolve
how it was said other than what it was said, and in
text-to-speech systems to synthesize emotionally more
natural speech[13]. Audio classification approaches are
widely used in such a system.
3. Indexing video contents: Now most of the researches
use audio channel of video files to index/classify video
object. For an example, if there are frequent gun firings
or exploding sounds of a video object then it can be
classified as a war seen.

Figure 6. Examples of acoustical features reported in the literature can be


grouped into four categories.

The Author(s) 2015. This article is published with open access by the GSTF
28

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Figure 7. General ow of the audio identication process.

B. Audio Identification
1) Overview
Audio identification is very challenging task compared
to audio classification since we have to specifically match
unknown audio object with thousands of pre-installed audio
objects whereas in audio classification we classify any audio
object into small number of pre-defined classes. As we
discussed earlier most of the researches have joined these two
together in order to get better result. First we classify
unknown audio object and identify its class then we can
match this unknown audio object among other pre install
object in the same class. By doing this we can speed up the
process by omitting the unrelated class of audio objects as
well as obtain better results.
In this section we focus only on the identification part.
According to the past literature, we can provide high level
overview of overall process done by most of the researches.
Look at the Figure 7. In here feature extraction part is exactly
same as the feature extraction of audio classifications which
we have already discussed. The key thing is to discuss here is
that how to create audio archives and searching
mechanisms. Thesetwo things we will discuss later in
detail.Apart from that almost all researches have framed
audio object into sets of overlapping frames. The reason for
doing it is a very important thing. Usually we have to identify
audio object like a song when a small part of it is presented.

This small part can be come from any place from the original
track. In this case we dont know the offset of that part. To
address this problem we can use framing thing. Look at the
Figure 8.

Figure 8. Diving audio object into set of overlapping frames. This


approach facilitates to identify any small part of unknown audio object
which is extracted from anywhere of the original source.

Refer the Figure 8, we can see that there are sets of


frames and overlapping areas. According to the past
researches we can interpret this image as this. Parameters for
sampling frequency and frame size can be different from one
research to another but most of the researches have used
following parameter values. Given the sampling frequency
of 8000 Hz, the frames are of 256 samples (32 ms) each, with
25% (64 samples or 8 ms) overlaps in each of the two adjacent
frames. A frame is Hamming-windowed by [9],

The Author(s) 2015. This article is published with open access by the GSTF
29

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

2) Audio identification methodologies


This is the most important part of the audio
identification. In here we will discuss, what are the proposed
mechanisms used to create an archives of audio. In other
words, now we have various audio features which should be
store in a database. What are the proposed approaches used to
convert a sets of audio features to an audio fingerprint?
a) Audio fingerprint

Audio fingerprinting is best known for its ability to


link unlabeled audio to corresponding meta-data (e.g. artist
and song name), regardless of the audio format. This is very
powerful and widely used method, the main advantage is that
this is independent from the audio format. What is mean by
fingerprint? It is a unique representation of an object for an
example human fingerprint can be used to identify a human
likewise we can use audio fingerprint to identify an audio
uniquely. Audio file is just a binary file nothing more than
that. If we use this digital file as the fingerprint we may face
several problems. Usually unknown audio object can be a part
of original one or it may corrupt partially. In such a case we
cant get unique fingerprint. Therefore, the direct comparison
of the digitalized waveform is neither efficient nor effective.
To alleviate mentioned issues we usually split audio object
into sets of overlapping frames as we discussed earlier and
generate sets of fingerprint for each and every frame. But
again we cant use the digitalized waveform of a frame as the
fingerprint. First we extract one or more desired audio
features as we discussed in section 2.2. Then we join those
feature values according to some specific manner this is
changed from one research to another. At the end of this we
can obtain some sort of string representation of several
features. A more efficient implementation of this approach
could use a hash method, such as MD5 (Message Digest 5) or
CRC (Cyclic Redundancy Checking), to obtain a compact
representation of the combined features [2]. Downside of this
presentation is that, hash value is fragile i.e. even a single bit
is change it is enough to get completely different hash value.
Therefore we cant use fingerprint method as it is for robust
implementation. As a whole we can represent fingerprinting
model as the Figure 9.

Figure 9. The flow of the Audio Fingerprint model

The ideal fingerprints should have following properties.


1. Should be able to accurately identify an audio
object.
2. Robust representation against the distortion or
interference in the transmission channel.
3. Generate powerful fingerprint using only few
seconds of the audio object.
4. It should be computationally efficient.
5. Size of the fingerprints should be small.
6. Less complexity of the fingerprint extraction.
This method is less vulnerable to attack since changing
the fingerprint means alteration of the quality of the sound.
Usually fingerprint data base will be very huge one since we
have to extract many finger prints from an audio object.
Therefore we cant use traditional bruit-force like searching
mechanism for this. Instead of it, many researchers have used
indexed look-up tables which give result very fast.
b) Audio water marking

Audio water marking is some sort of massage which is


embedded to the audio object when it is recording. According
to [16] watermarking is the addition of some form of
identifying mark that can be used to prove the authenticity or
ownership of a candidate item. Embedding a water mark will
not alter the perception of the song. Finally Identification of
a song title is possible by extracting the message embedded
in the audio. Actually this is not a content based audio
identification mechanism since we dont worry about the
audio properties because of this nature sometime audio
watermarking identification mechanism is known as blind
detection method.
Dual Tone Multi-Frequency (DTMF) is the origin point
of this water marking approach. DTMF used in touch-tone
and mobile telephony. There are two frequencies in DTMF
for bit 1 and 0 [16].

The Author(s) 2015. This article is published with open access by the GSTF
30

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

d) Auditory Zernike moment

DTMF 1 tone: 697Hz and 1209Hz combined


DTMF 0 tone: 941Hz and 1336Hz combined
To reduce the data to be watermarked we can use series
of bit-representations of its ASCII codes. Every character has
a unique ASCII code. According to that we can represent any
character as a pattern of pure sine waves using the combined
DTMF frequencies for 1 and 0. This approach can be
represented as Figure 10.
However audio water marking can be tampered since
this is not an audio property itself. As well as we dont have
any option to already released legacy audio objects like songs.
Other thing is that using this method we cant identify two
songs/audio object with same perception but one without
watermark.
c) Using Neural Network/SVM

Support Vector Machine (SVM) is also a widely used


approach for audio identification. Actually SVM is widely
used for audio classification instead of identification for
completeness we will discuss it here. It is a statistical learning
algorithm for classifiers. SVM is used to solve many practical
problems such as face detections, three-dimensional (3-D)
objects recognition and so on.
Again, features are extracted using methods which we
discussed earlier and those features use to train the classifier.
Most of the time we can see that perceptual features like
composed of total power, sub-band powers, brightness,
bandwidth and pitch and mel-frequency cepstral coefficients
(MFCCs) are used. Then means and standard deviations of
the feature trajectories over all frames are computed and these
statistics are considered as feature sets for the audio sound.
After that we create training vector data and train the SVM
classifier. In here we are not going to discuss about SVM in
detail you can find more information in [4][8][9][13].

All of the discussed methods so far consist with a major


drawback i.e. they are working on raw (uncompressed) audio
formats like wav. But we all know that, nowadays
compressed audio format like MP3 music, has grown into the
dominant way to store music on personal computers and/or
transmit it over the Internet [17]. Therefore it will be very nice
if we can directly recognize compressed audio without
decompressing it, and definitely, it will be more efficient and
more accurate. There is very few attempts works on
compressed audio domain; this method is one of them. As
most of the identification methods, this approach also creates
a fingerprint at the end. But the way we used in this method
is considerably different from the others.
Actually Zernike moment feature is used by image
processing techniques such as image recognition, image
watermarking, human face recognition and image analysis,
due to its prominent property of strong robustness and
rotation, scale, and translation (RST) invariance. Because of
these things, researches have motivated to use Zernike
moment for audio information retrievals as well.
According to the past researches, we can see that there
are four kind of compressed domain features, i.e., modified
discrete cosine transform (MDCT) spectral coefficients,
MFCC, MPEG-7, and chroma vectors from the compressed
MP3 bit stream. Actually Zernike moment define using very
complex sets of polynomials. We are not going to discuss
about this in very detail. You can find more information in
[17]. For completeness we will show how to grab Zernike
moment in image. Following are extracted from [17].
We already mentioned that Zernike moment is defined
using sets of polynomials which form a complete orthogonal
basis set defined on the unit disk x2 + y2 1. These
polynomials have the form,

There are some other widely used neural networks


based methods like Nearest neighbor(NN), Nearest Feature
Line(NFL)[5] and so on.

Figure 10: The flow of the initial water marking system

The Author(s) 2015. This article is published with open access by the GSTF
31

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

Where n is a non-negative integer, m is a non-zero


integer subject to the constraints that (n |m|) is non negative
and even, is the length of vector from the origin to the pixel
(x, y), and is the angle between the vector and x-axis in
counter-clockwise direction. Rnm() is the Zernike radial
polynomials in (, ) polar coordinates defined as,

fine for new releases, but there is no option for already


released audio.
Another approach to the copyright-protection problem
is audio fingerprint. In this method, as we discussed earlier,
we can construct a fingerprint by analyzing an audio signal
that is uniquely associated with the audio signal. After that
we can identify a song by searching for its fingerprint in a
previously constructed database. This kind of solution can be
used tomonitor radio broadcasting and audio file sharing
systems and so on.
b) Searching audio objects effectively

Note that, Rn,m() = Rn,-m(), so Vn,-m(,) = V

n,m(

,).

Zernike moments are the projection of a function onto


these orthogonal basis functions. The Zernike moment of
order n with repetition m for a continuous two-dimensional
(2D) function f(x, y) that vanishes outside the unit disk is
defined as,

For 2D signal-like digital image, the integrals are


replaced by summations to,

Zernike moment features can only be extracted from


2D space but audio data is time variant 1D data. Therefore
we have to map 1D audio data to 2D space somehow. We can
see that there are several ways to do this according to the past
researches. For example some we can construct a series of
consecutive granule-MDCT 2D images [17].
3) Applications of Audio Identification

Sometime we need to download/find a song but we


dont know the lyrics exactly. In this case we can query audio
database by humming the melody or providing a part of that
song. As an example, suppose an automated system organize
a users music collection by properly naming each file
according to artist and song title. Another application could
attempt to retrieve the artist and title of a song given a short
clip recorded from a radio broadcast or perhaps even
hummed into a microphone [10]. In such a case we can use
content base audio identification methods to query the data
base. Audible Magic and Shazam are examples of such
systems which already used audio fingerprinting [6].
Sometime we may want to search, index and organize
songs in our personal computer. Usually we may have same
song with different names and in different locations. In such
a case we can use content base audio identification
methodologies to do these tasks.
c) Analyzing Audio objects for video indexing

Usually we identify videos by using image processing


techniques. But it is very inefficient and low accurate
method. Instead of it we can analyze audio which is attached
to the video file to index it [1]. This is properly suited for
commercial advertisement tracking systems.
d) Speech Recognition

Audio identification is very important real world


problem therefore we can find many applications in this area.
In this section we will discuss several important real world
applications.
a) Copyright infringement detection

Music copyright enforcement is major problem when


we are dealing with digital audio files. Digital audio file can
easily be copied and distributed. Audio watermarking which
we discussed earlier is one of solution to that problem. Before
releasing a song, we can embed watermarks which will not
affect to the audio quality. After that we can identify that
audio object by extracting the watermark. This is working

Speech recognition (SR) is the translation of spoken


words into text. It is also known as "automatic speech
recognition" (ASR), "computer speech recognition", or just
"speech to text" (STT). Additionally, research addresses the
recognition of the spoken language, the speaker, and the
extraction of emotions [13]. This is another major application
of Audio Identification.
III. OPEN ISSUES
As we discussed earlier, this is still a growing research
area. The reason is there are several and major

The Author(s) 2015. This article is published with open access by the GSTF
32

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

challengers/issues which are not addressed properly so far. In


this section we will discussed about those open issues.
Most of the time, we cant perform the major audio
analyzing task in a controlled environment. It is the major
issue faced by researches. There are thousands of
interruptions/interference such as unwanted noise effects,
audio alterations, playback speed, tempo and beat like audio
characteristics variations, variations of the signal source and
so on. We can divide those issues into two major groups i.e.
psychoacoustic and technical.
Psychoacoustics focuses on the mechanisms that
process an audio signal in a way that sensations in our brain
are caused. Even if the human auditory system has been
extensively investigated in recent years, we still do not fully
understand all aspects of auditory perception [13]. Therefore
modelingpsychological features in order to simulate human
perceptionis not a trivial task but it is really important. This
is a one of major overhead in this research area.
Normally humans recognize unknown audio using their
historical knowledge. This is very important to identify a new
version or cover copy of original audio object but we can
easily model this historical knowledge mathematically. For
an example is audio object masking. Masking is the process
by which the threshold of hearing for one sound is raised by
the presence of another (masking) sound. Human auditory
system has especial capability to distinguish between
simultaneous masking and temporal masking using
frequency selectivity of human eye. This is model
mathematically using the loudness of audio objects but it is
not provided 100% accuracy compared to native auditory
system.
Other than that there are several technical difficulties
as well. An audio signal is usually exposed to distortions,
such as interfering noise and channel distortions. Therefore
modeling technically robust solution is very challenging task.
Noises, Sound pressure level, Tempo variations,
concurrently presence of several audio objects and so on are
affected on any audio recognition algorithm badly. Those are
the major issues/challenges in this area. When we are
introducing any new feature we have to think about these
challenges.
IV. CONCLUSIONS AND FUTURE DIRECTIONS
Throughout this review, we discussed about digital
audio classification and identification techniques done by
various researches. As a conclusion we can summarize our
findings as below.

Still this is a young research area hence there are lots of


rooms for improvements. Finding, searching, indexing audio
file using meta data attached to it is no longer functions
properly. Audio repository is rapidly increasing, new songs
are introduced frequently therefore we have to move to the
content based audio identification mythologies. According to
the past history, most of the researches have used audio
fingerprinting concept to do that. The most important part is
the feature extraction of any of these methods since it is the
heart of the system. Still we dont have rich robust features
against any kind of signal distortions and alterations. As well
as most of the solutions cant scale to fit current audio
repositories. Therefore now we have to think about robust
and scalable solution.
Cover song identification or dealing with several
versions of the same song is a very important research area
when we discuss about audio identification approaches. This
is even more important when we thinking about intellectual
property of artists. There are several attempts on this area like
[3], but this should be improved in the future.
ACKNOWLEDGEMENTS
I offer my sincerest gratitude to my supervisor, Dr. K.L.
Jayaratne, who has supported me throughout my research. I
would like to show my gratitude to Mr. Brian for supporting
me. Finally thank everybody who contributed to the
successful realization of my project.
REFERENCES
[1]

T. Zhang and C.-C. J. Kuo, Hierarchical system for content-based


audio classification and retrieval, in Photonics East (ISAM, VVDC,
IEMB), 1998, pp. 398-409,International Society for Optics and
Photonics, 1998.

[2]

P. Cano, Content-Based Audio Search from Fingerprinting to


Semantic Audio Retrieval, Ph.D. Dissertation. UPF, 2007.

[3]

J. Serr, E. Gmez, and P. Herrera, Audio cover song identification


and similarity: background, approaches, evaluation, and beyond, in
Advances in Music Information Retrieval, vol. 274, Z. Ras and A. A.
Wieczorkowska, Eds. Springer-Verlag Berlin / Heidelberg, 2010, pp.
307-332.

[4]

S. Z. Li and G.-dong Guo, Content-based audio classification and


retrieval using SVM learning, Invited Talk PCM, 2000.

[5]

S. Z. Li, Content-based audio classification and retrieval using the


nearest feature line method, Speech and Audio Processing, IEEE
Transactions on, vol. 8, no. 5, pp. 619-625, 2000.

[6]

T. Huang, Y. Tian, W. Gao, and J. Lu, Mediaprinting: Identifying


multimedia content for digital rights management, 2010.

[7]

M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, Content-based


multimedia information retrieval: State of the art and challenges,
ACM Transactions on Multimedia Computing, Communications, and
Applications (TOMM), vol. 2, no. 1, pp. 1-19, 2006.

The Author(s) 2015. This article is published with open access by the GSTF
33

GSTF Journal on Computing (JoC) Vol.4 No.3, October 2015

[8]

J. T. Foote, Content-Based Retrieval of Music and Audio, in


MULTIMEDIA STORAGE AND ARCHIVING SYSTEMS II, PROC.
OF SPIE, 1997, pp. 138-147.

[9]

G. Guo and S. Z. Li, Content-based audio classification and retrieval


by support vector machines, Neural Networks, IEEE Transactions on,
vol. 14, no. 1, pp. 209-215, 2003.

AUTHORS PROFIEL

[10] M. Riley, E. Heinen, and J. Ghosh, A text retrieval approach to


content-based audio retrieval, in Int. Symp. on Music Information
Retrieval (ISMIR), 2008, pp. 295-300.
[11] Wikipedia, Harmonic --- Wikipedia, The Free Encyclopedia.
http://en.wikipedia.org/w/index.php?title=Harmonic&oldid=6574919
25, 2015. [Online; accessed6-May-2015].
[12] D. Mitrovi, M. Zeppelzauer, and C. Breiteneder, Features for
content-based audio retrieval, Advances in computers, vol. 78, pp. 71150, 2010.
[13] M. C. Sezgin, B. Gunsel, and G. K. Kurt, Perceptual audio features
for emotion detection, EURASIP Journal on Audio, Speech, and
Music Processing, vol. 2012, no. 1, pp. 1-21, 2012.

Nishan Senevirathna (B.Sc. in Computer Science (SL)) obtained his


B.Sc (Hons) in Computer Science from the University of Colombo
School of Computing (UCSC), Sri Lanka in 2013. Currently working as
a Senior Software Engineer at CodeGen International (Pvt) Ltd and
following a M.Phil Degree program at UCSC. His research interests
includes Multimedia Computing, Image Processing, High Performance
Computing and Human Computer Interaction.

[14] M. A. Bartsch and G. H. Wakefield, Audio thumbnailing of popular


music using chroma-based representations, Multimedia, IEEE
Transactions on, vol. 7, no. 1, pp. 96-104, Feb. 2005.
[15] A. Ramalingam and S. Krishnan, Gaussian mixture modeling using
short time fourier transform features for audio fingerprinting, in
Multimedia and Expo, 2005. ICME 2005. IEEE International
Conference on, 2005, pp. 1146-1149.
[16] R. Healy and J. Timoney, Digital Audio Watermarking with SemiBlind Detection for In-Car and Domestic Music Content
Identification, in Audio Engineering Society Conference: 36th
International Conference: Automotive Audio, 2009.
[17] W. Li, C. Xiao, and Y. Liu, Low-order auditory Zernike moment: a
novel approach for robust music identification in the compressed
domain, EURASIP Journal on Advances in Signal Processing, vol.
2013, no. 1, 2013.
[18] D. Mitrovi, M. Zeppelzauer, and C. Breiteneder, Chapter 3 - Features
for Content-Based Audio Retrieval, in Advances in Computers:
Improving the Web, vol. 78, Elsevier, 2010, pp. 71-150.
[19] B. Gajic and K. K. Paliwal, Robust feature extraction using subband
spectral centroid histograms, in Acoustics, Speech, and Signal
Processing, 2001. Proceedings. (ICASSP
01). 2001 IEEE
International Conference on, 2001, vol. 1, pp. 85-88 vol.1.
[20] B. R. Glasberg and B. C. J. Moore, A Model of Loudness Applicable
to Time-Varying Sounds, J. Audio Eng. Soc, vol. 50, no. 5, pp. 331342, 2002.
[21] K. Kondo, Method of changing tempo and pitch of audio by digital
signal processing. Google Patents, 1999.

Dr. Lakshman Jayaratne-(Ph.D (UWS),B.Sc. (SL), MACS,MCS


(SL), and MIEEE) obtained his B.Sc (Hons) in Computer Science from
the University of Colombo (UCSC), Sri Lanka in 1992. He obtained his
PhD degree in Information Technology in 2006 from the University of
Western Sydney, Sydney, Australia. He is working as a Senior Lecturer
at the UCSC, University of Colombo. He was the President of the IEEE
Chapter of Sri Lankan in 2012. He has wide experience in actively
engaging in IT consultancies for public and private sector organizations
in Sri Lanka. He was worked as a Research Advisor to Ministry of
Defense, Sri Lanka. He Awarded in Recognition of Excellence in
Research in the year 2013 at Postgraduate Convocation of University of
Colombo, Sri Lanka. His research interest includes Multimedia
Information Management, Multimedia Databases, Intelligent HumanWeb Interaction, Web Information Management and Retrieval, and Web
Search Optimization. Also his research interest includes Audio Music
Monitoring for Radio Broadcasting and Computational Approach to
Train on Music Notations for Visually Impaired in Sri Lanka.

This article is distributed under the terms of the


Creative Commons Attribution License which
permits any use, distribution, and reproduction
in any medium, provided the original author(s)
and the source are credited.

The Author(s) 2015. This article is published with open access by the GSTF
34

Você também pode gostar