A Machine Learning Approach To Musical Style Recognition

Carnegie Mellon University
Research Showcase @ CMU

Computer Science Department School of Computer Science
9-1997
A Machine Learning Approach to Musical Style

Recognition
Roger B. Dannenberg
Belinda Thom
David Watson
Follow this and additional works at: http://repository.cmu.edu/compsci
This Conference Proceeding is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
contact research-showcase@andrew.cmu.edu.
A Machine Learning Approach to Musical Style
Recognition
Roger B. Dannenberg, Belinda Thom, and David Watson
School of Computer Science, Carnegie Mellon University
frbd,bthom,dwatsong@cs.cmu.edu
Abstract
Much of the work on perception and understanding of music by computers has focused on low-level
perceptual features such as pitch and tempo. Our work demonstrates that machine learning can be
used to build eective style classiers for interactive performance systems. We also present an anal-
ysis explaining why these techniques work so well when hand-coded approaches have consistently
failed. We also describe a reliable real-time performance style classier.
1 Introduction score following, and event processing. There are also

interactive systems that respond to simple features
The perception and understanding of music by com- such as duration, pitch, density, and intervals, but
puters oers a challenging set of problems. Much there is relatively little discussion of higher-level mu-
of the work to date has focused on low-level percep- sic processing.
tual features such as pitch and tempo, yet many com- In short, there are many reasons to believe that
puter music applications would benet from higher- style recognition is dicult. Machine learning has
level understanding. For example, interactive perfor- been shown to improve the performance of many per-
mance systems [3] are sometimes designed to react ception and classication systems (including speech
to higher-level intentions of the performer. Unfor- recognizers and vision systems). We have studied the
tunately, there is often a discrepancy between the feasibility of applying machine learning techniques to
ideal realization of an interactive system, in which build musical style classiers.
the musician and machine carry on a high-level musi-
cal discourse, and the realization, in which the musi- The result of this research is the primary focus
cian does little more than trigger stored sound events. of this paper. Our initial problem was to classify an
This discrepancy is caused in part by the diculty of improvisation as one of four styles: \lyrical," \fran-
recognizing high-level characteristics or style of a per- tic," \syncopated," or \pointillistic" (the latter con-
formance with any reliability. sisting of short, well-separated sound events). We
Our experience has suggested that even relatively later added the additional styles: \blues," \quote"
simple stylistic features, such as \playing energeti- (play a familiar tune), \high," and \low." The exact
cally," \playing lyrically," or \playing with syncopa- meaning of these terms is not important. What really
tion," are dicult to detect reliably. Although it may matters is the ability of the performer to consistently
appear obvious how one might detect these styles, produce intentional and dierent styles of playing at
good musical performance is always lled with con- will.
trast. For example, energetic performances contain The ultimate test is the following: Suppose, as an
silence, slow lyrical passages may have rapid runs improviser, you want to communicate with a machine
of grace notes, and syncopated passages may have through improvisation. You can communicate four
a variety of confusing patterns. In general, higher- dierent tokens of information: \lyrical," \frantic,"
level musical intent appears chaotic and unstructured \syncopated," and \pointillistic." The question is, if
when presented in low-level terms such as MIDI per- you play a style that you identify as \frantic," what
formance data. is the probability that the machine will perceive the
Avoiding higher-level inference is common in same token? It is crucial that this classication be
other composition and research eorts. The research responsive in real time. We arbitrarily constrained
literature is lled with articles on pitch detection, the classier to operate within ve seconds.
Recorded data. time 3 Classication Techniques
We constructed several classiers using naive
Begin new
Begin new
Bayesian, linear, and neural network approaches. We
style.
style. will describe the basic ideas here. To build a classi-
6 overlapping er, we rst identied 13 low-level features based on
training examples the MIDI data: averages and standard deviations of
extracted from 10-sec MIDI key number, duration, duty factor, pitch and
segment. volume, as well as counts of notes, pitch bend mes-
sages, and volume change messages. (Pitch diers
Figure 1: 6 overlapping training examples are derived from key number in that pitch-bend information is
from each 15 seconds of performance. included. Duty factor means the ratio of duration to
inter-onset interval.)
2 Data Collection 3.1 Bayesian Classier
To study this problem, we created a set of train- The naive Bayesian classier [1] assumes that the
ing data recorded from actual performances. So features are uncorrelated and normally distributed.
far, our experiments have used trumpet performances (Neither of these is true, but this approach works
recorded as MIDI via an IVL Pitchrider 4000 pitch- well anyway.) Given a vector of features, F , we would
to-MIDI converter for machine-readable data, and like to know which classication C is most likely. Us-
DAT tape for audio data. We also devised software ing our assumptions and Bayes' Theorem, it can be
to help collect data. The performer watches a com- shown that the most likely class is the one whose
puter screen for instructions. Every fteen seconds, mean feature vector has the least \normalized dis-
a new style is displayed, and the performer performs tance" to F . The \normalized distance" is the Eu-
in that style until the next style is displayed. Be- clidean distance after scaling each dimension by its
cause the style changes are abrupt, we throw out the standard deviation:
rst four seconds of data, allowing for a \mental gear v
u
shift" and a new style to settle in. For each style we u
t X F , 2
n
retain 10 seconds of good data. (The 15th second of =

C

i C;i
;
C;i
=1
data is also discarded.) i
We want our classier to exhibit a latency of 5 where C indexes classes, i indexes features, are
seconds, so it must only use 5 seconds of data. There-
C;i
means, and are standard deviations.

fore, the MIDI data is divided into six overlapping
C;i
intervals with durations of ve seconds, as shown in 3.2 Linear Classier

Figure 1. Since we are using supervised training, we
need to label each interval of training data. Rather Linear classiers compute a weighted sum of features,
than use the labels that were displayed for the per- where a dierent set of weights is used for each class.
former during the recording, we decided to have the The linear classier algorithm we used [4] also as-
performer rate the data afterward to make sure each sumes features are normally distributed, but not that
sample realizes the intended style. These ratings also they are uncorrelated. A linear classier tries to sep-
give a richer description of the data; for example, a arate members of the class from non-members by cut-
performance might be both \frantic" and somewhat ting the feature vector space with a hyperplane. De-
\syncopated," and this additional information can be pending on the features, this division may or may not
used in some machine learning approaches. be very successful.
We recorded 25 examples each of 8 styles, result-
ing in 1200 ve-second training examples. Ideally, we 3.3 Neural Networks
would rate each ve-second example on each style,
but this would require 9600 separate ratings. To re- Of the three approaches we tried, neural networks
duce the number of ratings, we rated ten-second seg- are the most powerful because they incorporate non-
ments and assigned the rating to each of the six de- linear terms and they do not make strong assump-
rived training examples. Thus, each individual rating tions about the feature probability distributions. [1]
is shared by six training examples. The training data We used a Cascade-Correlation architecture [2] which
was presented for rating in random order. consists initially of only input and output units
Number of Bayesian Linear Neural 140
Classes Network 120
4 98.1 99.4 98.5 100
# Training
Examples
8 90.0 84.3 77.0 80
60
Table 1: Percentage of correct classications by dif- 40
ferent classiers. 20
0
(equivalent to a linear classier). During training,
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
hidden units are added with connections to inputs, Number of Notes
outputs, and any previously added hidden units. Ef- Figure 2: Histogram of the number of notes in each
fectively, there are many hidden layers, each with one sample for a given style.
node. Every hidden unit applies a non-linear sigmoid
function to a weighted sum of its inputs. There is one 100
output per class; outputs are regarded as boolean val- 90
ues that predict membership in a given class.
Average Duty Cycle

80
70
4 Results
60
(%)
50
40
All three classiers produced excellent results with 30
4 classes and impressive results with 8 classes. We 20
10
measured performance by training a classier on 0
4/5 of the data and then classifying the remaining 0 20 40 60 80
1/5. (Since the data is in groups of 6 overlapping| Number of Notes
and therefore correlated|training examples, entire
groups went into either the training or the validation Figure 3: Average duty cycle versus number of notes
sets.) This was repeated 5 times with dierent parti- (scatter plot).
tions of the data so that each example was classied
once. The classier outputs Yes or No for each of 4
or 8 classes, and we report the total percent of cor- mance systems. However, we wondered why these
rect answers. We were surprised by the reliability of systems worked well when hand-coded approaches
the classiers. Table 1 shows the numerical results. have failed. We also wondered under what circum-
Training time for the Bayesian and Linear classiers stances these machine learning systems would also
occurred in seconds, but the neural net took hours. fail.
We implemented a real-time version of the naive To answer the rst question, we have attempted
Bayesian classier. A circular buer stores the last to visualize the data by constructing histograms and
ve seconds of MIDI data, and every second, features scatter plots. Since there are 13 features, one can
are computed and a classication is made. A fth only look at projections onto one or two dimensions.
classication, \silence," was added for the case whereFigure 2 shows a histogram of the number of notes
the number of notes is zero. The execution time is feature. The number of notes per second is a common
well under 1 ms per classication, so computation input parameter in performance systems because it
cost is negligible. has a high correlation with the concept of \frantic"
Note that many of the misclassications in the 8-or \fast." However, as Figure 2 shows, there is con-
class case involve the \quote" style, which one would siderable overlap among the histograms even though
expect to require knowledge of familiar melodies and the class ratings are almost all mutually exclusive.
some encoding of melody into the features. Other histograms show similar overlap, so classica-
tions based on a single feature are not very useful.
Figure 3 shows a scatter plot of two dimensions:
5 Discussion average duty cycle and number of notes. Here, the
four classes almost separate. Given Figure 3, can we
Our work has demonstrated that machine learning still argue that machine learning is necessary? To
techniques can be used to build eective, ecient, build a good classier without machine learning, one
and reliable style classiers for interactive perfor- would rst need to identify good features. Individ-
100 An experiment with our naive Bayesian classier
90 Correct/Classed suggest that simple condence measures can dramat-
80
ically reduce false positives. Recall that this classier
70
Correct/Total makes decisions based on normalized distances from
Percent
60
50 means. If the distance to the means of two classes are
40 nearly equal, our condence in the decision should be
30 Misclassed/Total reduced. Therefore, we simply reject classications
20 when the least distance is not less than a given frac-
10 tion of the next-to-least distance. This type of anal-
0
ysis is one of the beauties of a statistical approach.
1 1.1 1.2 1.4 2 4
Figure 4 illustrates the reduction of false positives
Confidence Threshold
using this technique.
Figure 4: As the condence threshold increases, the
total number of classied examples decreases, but the 6 Summary and Conclusion
number of misclassied examples decreases faster, so
the ratio of correctly classied to all classied exam- Machine learning techniques are a powerful approach
ples increases. to music analysis. We constructed a database of
training examples and used it to train systems to clas-
ual features do not work well, and in our study there sify improvisational style. The features used for clas-
are 132 dierent pairs of features one might consider. from MIDIaredata
sication simple parameters that can be extracted
Most pairs of features will not lead to good classiers, that are classiedin consist
a straightforward way. The styles
of a range of performance
and even if a good pair is known, parameters must be intentions: \frantic," \lyrical," \pointillistic," \syn-
chosen accurately. Casually guessing at good features copated," \high," \low," \quote,"
or combinations, plugging in \reasonable" parame- rst four of these styles are classiedand \blues." The
ters and testing will almost always fail. In contrast, 98% accuracy. Eight classiers, trainedbetter
with
to
than
return
machine learning approaches that automatically take \yes" or \no" for each of the eight styles had an over-
into account a large body of many-dimensional train- all accuracy of 77% to 90%. Condence measures can
ing data lead to very eective classiers.
It is worth noting that we never hand-tuned the be introduced
Further
to reduce the number of false positives.
work is required to study other classi-
original 13 features; rather, we allowed the inference cation problems, feature
algorithms to determine the eective use of the data. ing. We believe this workselection,
has
and feature learn-
applications in music
The classication of stylistic intent is relatively simple performance, composition, analysis, and education to
for machine learning algorithms (at least in the 4- mention just a few. We expect much more sophisti-
class case), which allows one to spend less time on cated music understanding systems in the future.
problems of representation.
5.1 Live Performance References

Experience with the 4-class classier in live perfor- [1] Bishop, C. M. Neural Networks for Pattern
mance has provided a new subjective evaluation. The Recognition. Clarendon Press, 1995.
classier is fast and eective, but not as reliable as the [2] Fahlman, S. E., and Lebiere, C. The
data would predict. Experiments have shown that Cascade-Correlation learning architecture. Tech.
the performer's \style," or feature vector distribution, Rep. CMU-CS-90-100, School of Computer Sci-
changes when the performer interacts with computer- ence, Carnegie Mellon University, Pittsburgh, PA,
generated music. Thus, training examples should be Feb. 1990.
captured in context. In other words, a \syncopated"
style in isolation diers from a \syncopated" style [3] Rowe, R. Interactive Music Systems. MIT Press,
performed in an interactive composition. 1993.
Another problem with live performance has been
the \false positives" (misclassications which erro- [4] Rubine, D. The automatic recognition of ges-
neously imply the performer is playing a particular tures. Tech. rep., School of Computer Science,
style). It may be better to report nothing than to Carnegie Mellon University, 1991.
make a wrong guess.

A Machine Learning Approach To Musical Style Recognition

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Machine Learning Approach To Musical Style Recognition

Enviado por

Direitos autorais:

Formatos disponíveis

Carnegie Mellon University

Research Showcase @ CMU