Você está na página 1de 78

FACULDADE DE E NGENHARIA DA U NIVERSIDADE DO P ORTO

Automatic Transcription of Drums and


Vocalised percussion

António Filipe Santana Ramires

F OR J URY E VALUATION

Mestrado Integrado em Engenharia Eletrotécnica e de Computadores

Supervisor in FEUP: Rui Penha, PhD


Supervisor in INESC: Matthew Davies, PhD

July 13, 2017



c António Filipe Santana Ramires, 2017
Resumo

A evolução do poder de processamento dos computadores, e consequente capacidade de efetuar


processamento de sinais digitais em tempo real, levou ao aparecimento de DAWs, tornando a
criação musical acessível ao público geral. Com estas alterações, novos instrumentos e inter-
faces para a criação de música eletrónica surgiram, mas continua a haver uma grande procura por
novos controladores. Os desenvolvimentos em MIR e em Machine Learning, tornaram possíveis
sistemas capazes de transcrever frases de bateria e de beatbox. No entanto, esses sistemas são de-
senvolvidos com foco na avaliação da performance de algoritmos de transcrição e não são fáceis
de usar num cenário de produção musical.
O objetivo principal deste trabalho é criar uma aplicação que permita que produtores musicais
possam usar a sua voz para criar frases de percussão quando compõem em DAWs. Um sistema
fácil de utilizar, orientado para o utilizador e capaz de transcrever automaticamente vocalizações
de percussão, chamado LVT, é proposto. Esta aplicação foi desenvolvida usando o Max for Live e
segue o método "segment-and-classify" para transcrição de bateria [1]. O LVT tem três módulos:
i) um detetor de eventos, que deteta o início de uma vocalização; ii) um módulo que extrai ca-
racterísticas relevantes do áudio de cada evento; e iii) uma componente de Machine Learning que
implementa o algoritmo k-nearest neighbours para classificação de vocalizações de percussão.
Devido às diferenças nas vocalizações do mesmo som de percussão de diferentes utilizadores,
uma abordagem dependente do utilizador foi desenvolvida. Nesta perspetiva, o utilizador final
tem a capacidade de treinar o algoritmo com as vocalizações desejadas para cada som de bateria.
Um externo para Max, que implementa o algoritmo Sequential Forward Selection, para escolher
as características mais relevantes para cada utilizador é proposto, assim como um dataset anotado
de vocalizações de percussão.
A avaliação do LVT feita neste trabalho tem dois objetivos. O primeiro é identificar a melhoria
de performance ao ser usado um algoritmo treinado pelo utilizador final, em comparação com
um algoritmo treinado por um dataset geral. O segundo objetivo é analisar se o LVT fornece
ao utilizador um melhor workflow para produção musical em comparação com as ferramentas já
existentes: LDT [2] e a função do Ableton Live Convert Drums to MIDI. Os resultados mostraram
que ambos os objetivos para o LVT foram alcançados.

i
ii
Abstract

The development of computers performance capacity, and consequent possibility for real-time
Digital Signal Processing (DSP) for audio, led to the appearance of Digital Audio Workstations
(DAWs), making the creation of computer music available to the general public. Along with these
changes, new instruments and interfaces for creating electronic music have surfaced. However
there is still a high demand for new controllers. The developments in music information retrieval
(MIR) and in machine-learning paved the way for systems capable of transcribing drum loops
and beatboxing. However, these systems are focused on evaluating the performance of transcrip-
tion algorithms in offline testing scenarios and are either not easy to operate for end-users or not
sufficiently reliable for use in a real music production workflow.
The primary goal of this work is to develop an application that enables music producers to
use their voice to create drum patterns when composing in music DAWs. An easy-to-use and
user-oriented system capable of automatically transcribing vocalisations of percussive sounds,
called LVT, is presented. This system was developed as a Max for Live device which follows the
“segment-and-classify” methodology [1] for drum transcription. LVT includes three modules: i)
an onset detector to segment events in time; ii) a module that extracts relevant features from the
audio content; and iii) a machine-learning component that implements the k-nearest neighbours
(k-NN) algorithm for classification of vocalised drum timbres.
Due to the differences in vocalisations from distinct users for the same drum sound, a user
specific approach to vocalised transcription was developed. In this perspective, a specific end-
user trains the algorithm with their own vocalisations for each drum sound before vocalising the
desired pattern. A Max external, that implements the sequential forward selection for choosing
the features most relevant for their chosen sounds, is proposed as well as a new annotated dataset
of vocalised drum sounds.
The evaluation of the LVT presented in this work addresses two objectives. The first one is
to identify the improvement when using a user trained algorithm instead of a dataset trained one.
The second one is to assess if LVT can provide an optimised workflow for music production in
Ableton Live when compared to existing drum transcription algorithms: LDT [2], and the Ableton
Live Convert Drums to MIDI function. The results showed that both objectives expected for the
LVT were accomplished.

iii
iv
Agradecimentos

Em primeiro lugar, gostaria de agradecer aos meus pais por todo o amor e apoio que me deram,
pelo constante desejo de aumentar o meu conhecimento e por sempre terem aceitado as minhas
decisões.
À minha irmã, avós, tios e primos por sempre terem desejado o meu bem e torcerem por mim.
Um agradecimento especial à Tia Lurdes, por todo o carinho e, à Mimi, pelos mimos e comida
caseira.
À Catarina, por todo o amor e carinho, por ter aturado todos os meus momentos mais difícies,
dando-me força para continuar, pelo quanto me fez crescer e por fazer do meu mundo um mundo
bem melhor.
Aos orientadores desta dissertação, Professor Matthew Davies e Professor Rui Penha, pela
sua supervisão, por terem apoiado esta ideia e me terem dado a oportunidade de trabalhar numa
área que gosto. A Matthew Davies pelo incansável apoio e interesse por esta dissertação e pela
amizade. A Rui Penha pela paixão pela música que consegue sempre transmitir.
A todos os elementos do Sound and Music Computing Group, que me apoiaram nas dificul-
dades encontradas nesta tese, especialmente ao Diogo pela sua prontidão em ajudar e esclarecer
qualquer dúvida.
Aos meus amigos que sempre lá estiveram para mim e que fortaleceram a minha paixão pela
música: Chico, Bicá, Brás, Craveiro, Gonçalo, Martins, Costa, Alex, Cavaleiro e Sérgio.
A todos os que participaram na recolha do dataset e à Rádio Universidade de Coimbra.
A todos os outros colegas e amigos que me apoiaram ao longo da vida, especialmente na
Universidade de Coimbra.

António Ramires

v
vi
“Qualquer criador quando estimula outro para fazerem coisas deve sentir-se contente.”

António Pinho Vargas in "À Procura da Perfeita Repetição"

vii
viii
Contents

Abstract i

Acknowledgements v

Abbreviations xv

1 Introduction 1
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Publication Resulting from this Dissertation . . . . . . . . . . . . . . . . . . . . 2

2 Background and State of the Art 3


2.1 Vocalised Percussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Electronic Music Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Electronic Music Composition Tools . . . . . . . . . . . . . . . . . . . . 4
2.3 Music Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Drum Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Vocalised Percussion Transcription . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Problem Characterization 13
3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Methodology 15
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 Feature Selection and Machine Learning Algorithm . . . . . . . . . . . . 18
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.3 Feature Selection and Machine Learning Algorithm . . . . . . . . . . . . 21
4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 LVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.2 LVT Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ix
x CONTENTS

5 Data Preparation 27
5.1 Dataset Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Dataset Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Evaluation 31
6.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7 Conclusions 41
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3 Perspectives on the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A seqfeatsel C Code 45
A.1 seqfeatsel Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.2 Flowchart of the seqfeatsel external . . . . . . . . . . . . . . . . . . . . . . . . 55

References 57
List of Figures

3.1 Different kick drums waveforms overlaid with spectogram. From left to right:
Drum kit, beatboxer and vocalised kick drum. . . . . . . . . . . . . . . . . . . . 13

4.1 Flowchart summarising the system . . . . . . . . . . . . . . . . . . . . . . . . . 16


4.2 Main part of the Max patch responsible for the operation of the system and its
components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Inside the pfft∼ patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 User interface of the LVT device . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5 User interface of LVT receiver device . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Pattern participants were asked to reproduce . . . . . . . . . . . . . . . . . . . . 27


5.2 Organization of the dataset files in an Ableton Live project . . . . . . . . . . . . 29
5.3 Example the audio annotation in Sonic Visualizer . . . . . . . . . . . . . . . . . 30
5.4 Example of how participants vocalised the pattern . . . . . . . . . . . . . . . . . 30
5.5 Two different vocalisations of kick drum . . . . . . . . . . . . . . . . . . . . . . 30
5.6 Two different vocalisations of a snare drum . . . . . . . . . . . . . . . . . . . . 30

6.1 Ableton project for the evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 32


6.2 Desired Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 How the number of operations was calculated. 1) Delete the extra events; 2) Cor-
rect the events that can be corrected; 3) Add the missing events. . . . . . . . . . 33
6.4 Effect of changing the window size per vocalised drum sounds and across micro-
phones. All LDT scores are shown in red, Ableton Live (ABL) in green and LVT
in blue. The solid lines indicate the laptop microphone, the dotted lines the AKG
microphone, and the dashed lines the iPad microphone. . . . . . . . . . . . . . . 35
6.5 Transcription of the first user vocalisations using the LVT system trained by the
second user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6 Transcription of the second user vocalisations using the LVT system trained by the
first user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.7 Effect of choosing a wrong feature for a user. a) 2nd user; b)1st user with feature
for 2nd user; c)1st user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.8 Example of an LVT transcription . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.9 Example of an Ableton Live Convert Drums to MIDI transcription . . . . . . . . 37
6.10 Example of a LDT transcription . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.1 Flowchart of the seqfeatsel external . . . . . . . . . . . . . . . . . . . . . . . . 56

xi
xii LIST OF FIGURES
List of Tables

2.1 Summary of the different vocalised percussion approaches . . . . . . . . . . . . 12

5.1 Number of individual hits contained in the recordings . . . . . . . . . . . . . . . 28

6.1 F-measure results for the PC microphone . . . . . . . . . . . . . . . . . . . . . 34


6.2 F-measure results for the AKG microphone . . . . . . . . . . . . . . . . . . . . 34
6.3 F-measure results for the iPad microphone . . . . . . . . . . . . . . . . . . . . . 34
6.4 Number of Operations for the PC microphone . . . . . . . . . . . . . . . . . . . 38
6.5 Number of Operations for the AKG c4000b microphone . . . . . . . . . . . . . 38
6.6 Number of Operations for the iPad microphone . . . . . . . . . . . . . . . . . . 38

xiii
xiv LIST OF TABLES
Abbreviations

DAW - Digital Audio Workstation


k-NN - k-Nearest Neighbours
ACE - Autonomous Classification Engine
RMS - Root Mean Square
FFT - Fast Fourier Transform
ANN - Artifical Neural Network
MFCC - Mel frequency Cepstral Coefficients
BFCC - Bark frequency Cepstral Coefficients
MIR - Music Information Retrieval
SVM - Support Vector Machines
GMM - Gaussian Mixture Models
HMM - Hidden Markov Model IF - Instance Filtering SNR - Signal to Noise Ratio
NMF - Non-negative Matrix Factorisation
DSP - Digital Signal Processing
MIDI - Musical Instrument Digital Interface

xv
Chapter 1

Introduction

1.1 Context

Music culture has changed a lot in the past years. New music genres were created, new possibili-
ties of production were discovered and new instruments were tested. In this context, in particular
with the emergence of drum machines, different ways of expressing percussive patterns have sur-
faced. The most common interfaces either use pads or a sequencer in order to acquire rhythmic
representations. These tools fail to fulfil their task if the user is not able to reproduce the desired
pattern through finger-drumming or by sequencing it. The human voice is an easy and cheap way
to express a drum pattern. With the development of computers, software Digital Audio Work-
stations began to emerge, and Ableton Live, through the use of the "Convert Drums" to MIDI
function, is able to transcribe drum recordings to a MIDI pattern. The transcription presented by
this function is not accurate if the voiced input does not realistically mimic expected drum sounds,
such as the ones from a drum machine or a drum kit. Therefore this project aims to design an
interface to express drum patterns through the use of human voice.

1.2 Goals

The objectives defined for this project are the following:

• Compile a dataset of vocalised drum patterns to be available online.

• Conceive methods for automatic transcription of vocalised percussion.

• Research techniques for the incorporation of user-input.

• Create a Max for Live device to transcribe vocalised percussion.

• Evaluate the device and compare it to the existent solutions.

1
2 Introduction

1.3 Motivation
With the changes that occurred in music culture, music production and the way musicians work
with their instruments have also changed [3]. The ability to invent and reinvent new ways to pro-
duce music is nowadays a key to progress. Consequently, new proposals, such as designing new
techniques for the composition of music, are necessary. Within the genre of electronic music, the
sequencing of drum patterns plays a critical role. The voice is an important and powerful instru-
ment of rhythm production [4] and it can be used to express a drum pattern. In order to leverage
this concept within a computational system, we create a tool that can help users (both expert mu-
sicians and amateur enthusiasts) input the rhythm patterns they have in mind to a sequencer, via
automatic transcription of vocalised percussion. Our proposed tool is beneficial both from the per-
spective of workflow optimisation (by providing accurate real-time transcriptions) and as a means
to encourage users to engage with technology in the pursuit of creative activities.

1.4 Dissertation Structure


Besides the Introduction, this dissertation contains seven Chapters. In Chapter 2, the state of the
art is described and the evolution of the work in this area is presented. In Chapter 3, the obstacles
we were met with in this project as well as the proposed solution to overcome them are detailed.
In Chapter 4, a description of the theoretical and practical implementation of the LVT system
is given. Chapter 5 describes the procedure used to collect, organise and annotate the dataset. In
Chapter 6, the test methodology and the results for the evaluation of the state of the art systems
and of LVT are presented. Finally, in Chapter 7, the results and contributions of this work are
summarised and future work is proposed.

1.5 Publication Resulting from this Dissertation


This dissertation led to the presentation of the following paper:

• A. Ramires, M. Davies and R. Penha, “Automatic Transcription of Vocalised Percussion” in


DCE17 - 2nd Doctoral Congress in Engineering, 2017.
Chapter 2

Background and State of the Art

2.1 Vocalised Percussion

Vocalised percussion is one of the most intuitive ways for humans to express a rhythm. This
universal language uses phonemes with no meaning to mimic instruments and has been used in
many different cultures, throughout history, either to represent and teach percussive patterns or as
a percussive instrument itself [4] [5].
Firstly, in Australia, the ‘didjeridu talk’ or ‘tongue talk’ is used to memorise and guide a yidaki
(didjeridu) performance and the onomatopoeias used vary between different communities. Both
the conga players from Cuba and the Ewe people from Ghana use vocalised percussion sounds to
speak their riffs. In Asia, the bols and Konnakol from India comprise sound symbols to represent
different tabla hits. In the case of Europe, vocalisation is also used as an instrument itself. The
puirt a’ bhèil from Scotland and Ireland is used as an instrument substitute when the fiddles and
pipes are not available [5].
In the United States, in addition to early examples of jazz skat singing, we have beatboxing,
a form of vocal percussion originated in 1980’s hip-hop culture, where musicians use their lips,
cheeks, and throat to create different beats. This term originally referred to the vocal reproduction
of 80’s drum-machines, also known as beatboxes, that were not affordable by the vast majority of
people from this culture. With the evolution of beatboxing and with the use of microphones, the
range of expressions that beatboxers use is not simply restricted to drum sounds. By using inhaled
and exhaled sounds, different vocal modes such as head voice, growl or falsetto and trills, rolls
and buzzes, the beatboxer can create sounds such as a vocal scratch or a "synth kick" [6] [5] [4]
[7].

2.2 Electronic Music Production

With the increase in computer’s performance capacity, it became possible to do real-time Digi-
tal Signal Processing (DSP) on their personal systems and the emphasis in music-making tools
has gone from hardware to software, and the general public can now make music on their home

3
4 Background and State of the Art

computers. While computer music has been performed in academic research and composition
communities for many years, the availability of accessible software music tools has given rise to
a computer music culture outside these circles. Many exciting kinds of music are being made by
non-academic artists and producers in home studios all over the world [3].
Electronically produced music is part of popular culture. Musical ideas that were once con-
sidered far out, such as the use of environmental sounds, ambient music, turntable music, digital
sampling, computer music, the electronic modification of acoustic sounds, and music made from
fragments of speech, have now been incorporated into popular music. Genres including new age,
rap, hip-hop, electronic music, techno and jazz have been influenced with production values and
techniques that originated with classic electronic music [8].

2.2.1 Electronic Music Composition Tools

Various inventions have been devised to assist musicians in performing, arranging, recording and
composing music. A historically early method of recording music which is still in use today is the
player piano. Holes, corresponding to particular notes, are punched in paper which is rotated as
the player piano is played [9]. Newer tools used in both analogue and software electronic music
production comprise sounds generators, effects processors and mixers [10].
Any technology that can transduce human gesture or movement into electrical signal is avail-
able to be used as an electronic music composition tool. The commonly used technologies include
infrared, ultrasonic, hall effect, electromagnetic and video. With the development of MIDI, com-
puter hardware such as keyboards, switches, pushbuttons, sliders, joysticks or drum pads can also
be used as a means to input patterns and melodies in the computer. Many musicians have built
their own input devices as prototypes by using microphones, accelerometers and other types of
sensors combined to electronic circuitry. But all this hardware only works if it is connected to the
computer and managed by some software - the performance software [11].

2.3 Music Information Retrieval

Music Information Retrieval (MIR) is concerned with the extraction and inference of meaningful
features from music, indexing of music using these features and search and retrieval schemes, as
defined by [12] [13].
During the 2000s, with the development of computers and the corresponding increase in com-
puter power, MIR research shifted its focus from analysing symbolic representation of music
pieces to the use of signal processing techniques directly to the music audio signals [13].
According to Schedl et al., in [13], MIR comprises several investigation subfields. The most
typical ones are the following:
2.3 Music Information Retrieval 5

• Feature Extraction: This first group of topics is related to the extraction of relevant features
from music content. It includes several tasks such as timbre description; music transcrip-
tion and melody extraction; onset detection, beat tracking and tempo estimation; tonality
estimation and structural analysis, segmentation and summarisation.

• Similarity: This subsystem is the core of many applications such as music retrieval and mu-
sic recommendation systems. This comprises tasks such as similarity measurement, identi-
fication of cover songs and query by humming.

• Classification: This group uses the information retrieved by the previous subfields in order
to classify music. Emotion and mood recognition; music genre classification; instrument
classification; composer, singer and artist identification and auto-tagging are common areas
of research.

• Applications: This final subfield comprises the development of application that use MIR
tools. These can vary from audio fingerprinting to playlist generation and music visualisa-
tion.

Feature extraction is essential to this work and its tasks will, therefore, be described more
thoroughly.
Automatic music transcription, according to [14] "is the process of converting an acoustic
musical signal into some form of musical notation". This area is one of the most intensively re-
searched in MIR and is often considered the core technology to improve any MIR system. While
most publications deal with pitched instruments, rhythm extraction is also a major focus. A com-
plete music transcription system comprises various sub-systems such as multi-pitch detection, on-
set detection, instrument recognition and rhythm extraction. A large subset of current approaches
for the transcription of harmonic sounds employs spectogram factorisation techniques, such as
NMF and probabilistic latent component analysis [15] [16].
Beat tracking is defined in [17] as deriving from a music audio signal a sequence of beat
instants that might correspond to when a human listener would tap his or her foot. This task is
related both to note onset detection, which consists in identifying the start points of musical events
in an audio signal, and tempo induction, which resides in finding the underlying rate of a piece of
music [18]. Despite the differences between these two tasks, their investigation has always been
closely connected. Research on this topic started in the 1970s and an overview on its evolution
can be found in [19].
Research on structural analysis or self-similarity analysis mainly consists in detecting signal
changes and repetitions, within the same musical piece. This analysis is based on the computation
of the self-similarity matrix, proposed by Foote in [20]. An important application of this research
is music summarisation, as songs may be represented by their most frequently repeated segments
[13].
6 Background and State of the Art

2.4 Drum Transcription


Drum transcription is essential in automatic music transcription as, in several music genres, the
drum track possesses information about tempo, rhythm, style and possibly the structure of the song
[15]. Various problems can arise when dealing with this task. These are related to the diversity of
the drum sounds to be labelled, the difference in loudness in different loops and the possibility of
overlapping sounds.
Most drum transcription methods can be separated in three different groups, as proposed in [1]
and [21]:

• Segment and Classify: This first approach segments different drum events and, based on
the features extracted, classifies them using machine learning techniques, such as support
vector machine (SVM) or gaussian mixture models (GMM). This proved successful in solo
drum recordings, but, in polyphonic music, its application is more challenging, as most of
the features used for classification are sensitive to the presence of background music.

• Separate and Detect: The input signal is split in its various components via source separa-
tion. The different streams then go through an onset detector, such as an energy threshold
based one, in order to find the instances of each signal. To achieve source separation, a
time-frequency transform is normally used. This decomposition is traditionally achieved
with independent subspace analysis (ISA) or non-negative matrix factorisation (NMF).

• Match and Adapt: These methods search for the occurrence of a temporal or time-frequency
templates within the music signal and browse a database to find the most similar pattern to
the queried one.

The "segment and classify" and "separate and detect" methods are the most relevant ones
for this work, as they do not need previously created templates to match the query. In order to
power creativity, the user should be able to input any sequence of their own design, and not only
previously constructed ones.
Gillet et al. [22], study the performance of hidden Markov models (HMM) and SVMs on the
transcription of drum loops in a "segment and classify" method. Instead of focusing the identifi-
cation only on sounds taken in isolation, the dataset used consists of pre-recorded drum patterns,
such as those found in commercial sample CDs, where an event can contain more than one drum
hit. In order to split the loop into the corresponding events, an onset detection algorithm based
on sub-band decomposition was used. A k-NN classifier was used to find the most appropriate
group of features to be used in the event classification sub-system. The selected features were:
(1) the mean of 13 Mel frequency Cepstral Coefficients (MFCC); (2) the spectral centroid; (3) the
spectral width; (4) the spectral asymmetry; (5) the spectral flatness; (6) 6 band-wise frequency
content parameters, that correspond to the log-energy in six pre-defined bands. The classification
was done both with an HMM and an SVM. The first class of models was used as it proved efficient
when short term time dependencies exist. This is the case if the sound produced by a drum con-
tinues to resonate when the following stroke happens. Both classes were tested with two different
2.4 Drum Transcription 7

approaches. The first one using only one 2n -ary classifier, in which each possible combination
of strokes is represented as a separate class. The second one uses n binary classifiers, one per
instrument. A third experiment was conducted using a drum kit dependent approach, where four
different HMM classifiers were used, one for each kind of drum kit (Electro, Light, Heavy or Hip-
Hop). The results obtained show that the SVM surpasses the HMMs in all approaches, acquiring
65.1% accuracy using only one classifier, and 64.8% using n binary classifiers. The highest accu-
racy for the HMM classifier occurred when the drum kit dependent approach was used, attaining a
precision of 62.5%, while only 59.1% was achieved in the model trained on all data. The authors
state this probably was due to the high variability of the data set, which the HMM approach could
not handle.
Gillet et al. [23], in order to remove the non-percussive parts of polyphonic music, use a
band-wise harmonic/noise decomposition. This algorithm is only able to identify two classes:
kick and snare. The aim of the first stage of this system is to obtain the stochastic part of the input
signal. Percussive sounds have a strong stochastic component, contrary to the pitched instruments.
In order to achieve this, the input signal is decomposed into eight non-overlapping sub-bands, by
passing it through an octave-band filter bank, since, in this way, the computational cost of the noise
subspace projection is greatly reduced. The second stage is to project the noise subspace. This
is accomplished by using the Exponentially Damped Sinusoidal model. After this, the signals
still contain attacks and transients from pitched instruments, therefore, in the next stage, where
the resulting signal goes to an onset detection algorithm, non-percussive events are also detected.
This is handled by adding another class for these sounds. The onset detection is done by half-
wave rectifying and low-pass filtering the sub-band noise signals, and then finding the peaks of
their derivative. The features extracted from each segment were the energy in the first 6 sub-bands
and the average of the first 12 MFCCs, without c0 . The classification used is different from the
standard "segment and classify" approach as some onsets do not contain drum events and have
to be discarded. Two SVM classifiers were used, one for the kick and one for the snare. The
probabilistic output of the SVM classifier was used as a likelihood measure, in order to retrain the
system with the most probable events. The best F-measure achieved was 89.2% for a mix where
the drum was 6dB louder than the rest, and 84.0% for a balanced mix.
Similarly to Gillet et al. in [23], Tanghe et al., in [24], present a strategy to segment and
classify drum events, in real time, in polyphonic music, using an SVM. The algorithm operates in
a streaming way, which allows the processing of big audio files and never ending audio streams.
The onset detection consists of several sub-systems, in order to only detect local maxima, and
detects both drum and non-drum events. Firstly, the audio signal is fed into a short term Fourier
transform and then input to a Mel filterbank. The weighted sum of the differences between the
current amplitude levels and that of the recent past is calculated. Here, the more recent the value
is, the more important it is. By dividing the envelope follower of the output of the Mel filterbank
by the result of the weighted sum, the relative differences in each frequency band are calculated.
The output of this sub-system goes to a peak detector and, if this peak is higher than a selected
threshold, it is sent to a heuristic grouping peak detector that outputs “true” if a local maximum
8 Background and State of the Art

is reached and “false” if not. A new peak can only be detected if the calculated sum decreases
after the previous peak is identified. A module was created to extract features from the stream of
audio samples. The descriptors obtained were the overall RMS, the RMS in 3 frequency bands, the
RMS per band relative to the overall RMS, the RMS per band relative to RMS of other bands, the
zero-crossing rate, the crest factor, the temporal centroid, the spectral centroid, kurtosis, skewness,
rolloff and flatness and the MFCC and ∆ MFCC. The classification is then executed by an SVM,
trained with annotated audio files. The highest average classification F-measure achieved was
61.1%.

Miron et al. [2] introduce a drum transcription algorithm capable of handling real-time audio.
In order to detect events, first, a high frequency onset detection is used, since this was reported
to be the best for percussive onsets. As this stage can detect false positives, an instance filtering
(IF) method using sub-band onset detection is used. This method uses the complex domain onset
detector in three different frequency bands, one for each class (kick, snare and hi-hat). Since
different drum strokes can occur at the same time, features for each frequency band are computed
separately, so that the noise influence is reduced. The obtained features are computed in the decay
part of each sound and, in order to give less importance to silent frames, weighted with the RMS
value. The extracted features are energy in 23 bark bands, 23 bark frequency cepstrum coefficients,
spectral centroid and spectral rolloff. The machine learning part consists of three k-NN classifiers,
each adapted to the corresponding class. This system is implemented in PD-extended, Max MSP
and in Max for Live. The F-measure result obtained by the event detection sub-system was 93%,
and by the complete system was of 81% for the validation dataset. The use of the IF stage along
with the k-NN classifier led to an increase in the performance and precision in all classes.

A "separate and detect" method for drum transcription was presented by Roebel et al. [25].
The separation of the three sources is done using a non-negative matrix deconvolution, in which
the update rules are obtained from the Itakuro Saito divergence. The detection of drum events uses
three criteria. The first one comprises an activation based test. The second threshold is used in
order to establish a minimum SNR for detected events. The prominence of the target class in the
power spectrum is weighted and compared to a third threshold. If all test are passed, the event is
retained. The use of these three conditions show a better overall performance in the algorithm.

The non-negative matrix factorisation approach can also be used in a "match and adapt"
method, as proposed by Wu et al. [15]. A template adaptive drum transcription algorithm that
uses partially fixed non-negative matrix factorisation is presented. This algorithm uses two dictio-
naries: one previously defined with drum templates and one trained with melodic content, in the
standard NMF manner. Two methods are then tested for the template adaption. For three classes
of drums, the system is able to achieve average F-measures of 77.9% and 72.2% in monophonic
and polyphonic music respectively.
2.5 Vocalised Percussion Transcription 9

2.5 Vocalised Percussion Transcription

The transcription of vocalised percussion deals with automatic identification of vocalisations of


percussive sounds. This area has many applications such as live music transcription, human-
machine musical interaction or even identifying drum loops within a database [26].
Most systems for the transcription of vocalised percussion follow the "segment and classify"
approach and integrate three different parts: a component that separates the different events, a
component that generates descriptors for each event and a machine learning component that as-
signs the different events to the corresponding class [27]. Therefore, vocalised percussion is a
monophonic transcription problem.
Amaury Hazan et al. [27] created a tool for transcribing voice percussive rhythms that aims to
reduce the gap between the user and the device used for acquiring rhythmic representation. This
work focus in transcribing not only standard vocalised drum sounds but also a whole range of
acoustic oral rhythms. In order to separate the different events, an energy based algorithm is used.
This decomposes the input stream into several frames, computes their energy and compares it to
a threshold that is user-defined. In order to obtain descriptors, each event is split into attack and
decay, by finding the maximum of the event’s sound envelope. The features to be extracted were
divided between temporal and spectral. From the decay, both temporal and spectral descriptors
were obtained, whereas in the attack only temporal features were analysed. The spectral descrip-
tors used were obtained through the use of an FFT. These descriptors are: spectral energy, spectral
centroid, flatness, kurtosis and the first five MFCC. In the case of temporal features, the descriptors
used were the duration, the log-duration, the energy, the zero-crossing rate and the temporal cen-
troid. Two different machine learning components were used. The first one was the tree induction
algorithm method C4.5, with and without two optimisations: boosting and bagging. The second
algorithm used was the k-NN. The attained results favoured the utilisation of the C4.5 algorithm
with bagging with 90% accuracy for a test set with recordings from unseen performers, compared
to 87% accuracy for the C4.5 algorithm with boosting, and 79% both for the C4.5 algorithm alone
and for the k-NN.
In the report "Automatic Transcription of Beatboxing" by Christensen et al. [28], the group
developed a MATLAB application that identifies three beatboxing sounds, the kick, the snare and
the hi-hat. In order to segment different events, they use an energy based algorithm similar to
the one in the paper by Amaury Hazan et al. [27]. The classification was done using the k-NN
classifier, previously trained with a beatbox dataset recorded by them. A choice was made to
only test one feature at a time. These were energy, zero crossing rate, the first 20 MFCCs and
spectral centroid, spread and flux. Later, they analysed which k values showed better results for
each feature used. The best performing feature was the MFCC feature vector, with k = 7 or 8. An
accuracy of 98.9% was achieved when using these parameters.
In their paper [4], besides presenting a data collection comprising recordings from both beat-
boxers and non-beatboxers, Sinyor et al. studied the efficiency of the Autonomous Classification
Engine in identifying vocalised percussion sounds. This engine optimises the set of features to
10 Background and State of the Art

be used by the machine learning component [29]. A second experiment was performed using a
genetic algorithm for selecting features that proved to be superior compared to the 1-NN ACE
experiment. The segmentation of the input was done manually and the descriptors used were both
the average and the standard deviation of compactness, of spectral rolloff, of RMS derivative and
of zero crossing overall, the standard deviation of the overal RMS and of the frequency corre-
sponding to the highest peak of the FFT, and the average of both the the zero crossing derivative
and the strongest frequency of the spectral centroid. The classifier that proved to work best with
ACE was AdaBoost with C4.5 decision trees as base learners, which obtained a 98.15% accuracy
when using 3 classes of sounds, and 95.55% when using 5 classes.
Kapur et al. [7] introduce two different systems. The first one receives a beatboxing loop
and identifies the corresponding drum loop within a bank. The second one transcribes the same
input to the corresponding drum sounds (kick, snare and hi-hat). Despite the first system being
an interesting utilisation for beatbox transcription, we will focus on the transcription application.
The segmentation of the input is also done by splitting the audio when its volume is higher than
a definable value. The classification algorithm used was a backpropagation Artificial Neural Net-
work with a single dimensional feature that was the number of Zero Crossings. This method was
used since it was the one to achieve best results in a real time implementation, having obtained
an accuracy of 97.3%. In this method each of the drum sounds should be recorded 4 times before
the transcription, in order to train the ANN. The user may then record a beatboxing loop, that will
be processed and fed to a sampler, where the user can select the desired real drum sounds. The
transformed query can the be saved as a new audio file.
In [26], Stowell et al. study the effect of delaying the classification in a real time transcription
system and present a new annotated beatbox dataset. They choose to have 3 classes that are kick,
snare and hi-hat. The onset detection was done manually, in order to factor out the influence of
this component in the system. Several different features were obtained through SuperCollider
3.3, and then analysed to see the corresponding effect in the accuracy of the transcription by the
naive Bayes classifier, with the Kullback-Leibler divergence. The best value of time frames to
be analysed for each feature was also tested. The features that proved to be more appropriate for
beatbox transcription were the 25th- and 50th-percentile and the spectral centroid and flux. The
delay that performed best in most tests was 23ms. The obtained accuracy with these parameters
was 88.4 % for the kick, 81.6% for the hi-hat and 53.1% for the snare. A perceptual experiment
was also conducted in order to evaluate the tolerable latency in the decision-making component.
For common drum sounds, the best maximum delay which preserved an excellent or good audio
quality varied from 12ms to 35ms.
Nakano et al. [30] present a "match and adapt" method to retrieve drum patterns from a
database by voice percussion recognition. By using the Viterbi algorithm and only two sound
classes (kick and snare), a recognition rate of the desired pattern of 93% is attained. Gillet et
al. [31] present a system, with the same function as the previous one, that uses a segment and
classify approach. The input is manually segmented and the features used are 13 MFCCs and 13
∆ MFCCs. The transcription of the query is done by using the Bakis (left-right) HMM model.
2.6 Summary 11

Hipke et al. [32] present a transcription system which uses an identifier that is trained by
the end user. This system is named BeatBox and enables end-user creation of custom beatbox
recognisers, represented in the GUI by different pads. Each pad also shows the reliability of
the differentiation for each vocalisation. The onset detection is threshold based and the features
computed are the spectral centroid and RMS. The classification algorithm used was k-NN, which
k value is automatically selected by the system.
The DAW Ableton Live has a "Convert Drums to New MIDI Track" function that "extracts
the rhythms from unpitched, percussive audio and places them into a clip on a new MIDI track"
and should be able to work "with your own recordings such as beatboxing" [33]. In the context of
this work, this is the most similar system to the one we propose, as Ableton Live is a DAW widely
used by both expert musicians and amateur enthusiasts. This feature works satisfactorily when
transcribing drums and beatboxing recordings that imitate drum sets. By testing it with simple
onomatopoeia such as boom, pam, ta, pa or tss the transcription did not work as intended. In
recordings with a cheap microphone, the noise was identified as a hi-hat. Moreover, some of the
vocalisations of snares or kicks were identified as a snare and kick at the same time.
Table 2.1 resumes the previous articles in terms of accuracy, number of classes used, type of
segmentation, the descriptors used and the machine learning algorithm adopted.

2.6 Summary
As was described in this chapter, there are many articles focused on the complications regarding
the transcription of percussive sounds. The presented systems are mostly focused on evaluating
the performance of transcription algorithms and not on the possible applications to end-users.
Despite the satisfactory results, the transcription of percussion presents highly diverse prob-
lems in need of techniques specifically adapted to the main target scenario, whether the input of
the system consists in vocalised percussion, beatboxing, isolated or mixed polyphonic recordings.
12 Background and State of the Art

Table 2.1: Summary of the different vocalised percussion approaches

Author Num Segmentation Descriptors Used Machine Accuracy


Class Learning achieved
Hazan [27] 4 Energy Spectral energy, spectral cen- C4.5 90%
Threshold troid, flatness, kurtosis, the first w/Bagging
five MFCC,duration, the log-
duration, the energy, the zero-
crossing rate and the temporal
centroid
Christensen 3 Energy 20 MFCC k-NN 98.9%
[28] Threshold
Sinyor [4] 5 Manual Average and standard deviation C4.5 w/- 95.55%
of compactness, spectral rolloff, Boosting
RMS derivative and zero cross-
ing overall; standard deviation
of the overall RMS and of the
maximum frequency and aver-
age of the zero crossing deriva-
tive and of the strongest fre-
quency of the spectral centroid.
Sinyor [4] 3 Manual Average and standard deviation C4.5 w/- 98.15%
of compactness, spectral rolloff, Boosting
RMS,derivative and zero cross-
ing overall; standard deviation
of the overall RMS and of the
maximum frequency and aver-
age of the zero crossing deriva-
tive and of the strongest fre-
quency of the spectral centroid.
Kapur [7] 3 Energy Number of zero crossings ANN 97.3%
Threshold
Kick 88.4%
Stowell [26] 3 Manual The first 8 MFCCs, the spectral naive Snare 81.6%
centroid, spread, flatness, flux, Bayes Hi hat 53.1%
slope,crest, crest in subbands
and distribution percentiles, the
high-frequency content and the
zero-crossing rate.
Chapter 3

Problem Characterization

3.1 Problem Definition

The problem consists on the creation of a tool that assists producers, either trained in beatboxing
or not, to create patterns with the use of their own voice. This application should receive as
input either a recording or a stream of audio that contains vocalised percussive sounds and output
the corresponding transcription in a ready-to-use MIDI file. Furthermore, the system should be
adapted to vocalised input constraints, such as the inability of producing two sounds at the same
time.
The applications stated in the previous chapter do not suffice if a user-specific vocalised per-
cussion transcription system, aimed at computer musicians, is desired. These tools are aimed at
testing the behaviour of transcription systems, they export results instead of patterns and their
interfaces are not easy to use. Moreover, most of them are aimed at either beatboxing or drum
transcription and are tuned to receive only a selection of sounds.

Figure 3.1: Different kick drums waveforms overlaid with spectogram. From left to right: Drum
kit, beatboxer and vocalised kick drum.

A substantial difference exists between the sound produced by a drum kit, by a beatboxer and

13
14 Problem Characterization

by a common user not trained in beatboxing as can be seen informally in Figure 3.1. In addition,
different people vocalise drum sounds in different manners.
Therefore, for this tool to function as required, it has to be adapted to each user and to the
characteristics of the human voice.

3.2 Proposed Solution


In order to solve this problem, a vocalised drum transcription software, able to be trained with the
user vocalisations is proposed. The system is integrated in a Max for Live project. Max for Live
is a visual programming environment, based on Max 7, that allows users to build instruments and
effects for use within Ableton Live.
Firstly, a dataset of vocalised percussion was compiled. It was then annotated using Sonic
Visualiser1 , a free application for viewing and analysing the contents of music audio files. The
recordings were saved and organised both in a compressed archive and in an Ableton Live project
file, for compatibility and to facilitate the testing of the transcription systems. These files are
hosted in the project’s web page2 .
Then, the system was developed following an user-specific approach. This system follows the
"segment and classify" method previously described and integrates three elements: an onset de-
tector, a component that generates descriptors for each event, and a machine learning component.
The onset detection was done with Aubio Onset∼ [34]. The extraction of the features described
in the state of the art was done in real-time with the use of the object Zsa.mfcc∼, the library
Zsa.descriptors [35] and other Max MSP objects. The first of these tools outputs the MFCCs as
a list, the second one extracts spectral centroid, spread, slope, decrease and rolloff, a sinusoidal
model based on peaks detection and a tempered virtual fundamental [35]. The zero crossing rate
and number of zero crossings were calculated with the zerox∼ object. The machine learning com-
ponent was trained with the user’s preferred vocalisation and the features selected to show the
better results for the provided input. This was done through the use of the Sequential Forward Se-
lection method, with the most significant features selected by the accuracy obtained from testing
the training data. This metric evaluates the most adequate feature in order to achieve the maximum
separation between clusters in a machine learning algorithm. The Sequential Forward Selection
method works by selecting the most significant feature, according to a specific parameter (in this
case the accuracy obtained from testing the training data), and adding it to an initially empty set
until there are no improvements or no features remain. An user interface was created in Max for
Live, so as to facilitate the utilisation of the application.

1 http://www.sonicvisualiser.org/
2 https://lvtsmc.wordpress.com/
Chapter 4

Methodology

In this chapter, the approach used to create an automatic system that transcribes vocalised per-
cussion is described. When developing the system, the focus was on creating a model capable of
delivering a reliable and accurate transcription and which is easily operated by music professionals
and amateur enthusiasts.
This chapter is divided in three different parts. The first one details the approach chosen for
the design of the system, the second one describes how this solution was implemented and in the
third part, the way the system should be used is described. A flowchart of the system functioning
is presented in Figure 4.1.

4.1 Approach
In this section, the operation of the system is detailed and divided into various components. Fol-
lowing [1], this device uses the segment-and-classify approach. This approach proved to be partic-
ularly successful on solo drum signals [1] but not as efficient in polyphonic sounds. In our system
it is only desired to transcribe percussive events and, therefore, this approach was chosen as it is
the most suitable option.
The first component is an onset detector, which is responsible for detecting when events occur.
When these are detected, the second module extracts the features from the relevant time frame and
outputs their value to the final stage, the machine learning and feature selection component. The
features that provide a better classification are chosen and used in a k-NN classifier.

4.1.1 Onset Detection

The onset detection algorithm used is the high frequency content (HFC). In the tests conducted
in [36], it proved to be the most effective method for identifying non-pitched percussive onsets,
detecting 96.7% of all the events and not detecting any false positive. The function was originally
proposed in [37]. This algorithm calculates the weighted mean of the amplitude for each bin. The
higher the frequency is, the more weight the bin has. It is powerful at detecting onsets that can be
modelled as bursts of white noise, such as snares and cymbal sounds.

15
16 Methodology

Figure 4.1: Flowchart summarising the system

4.1.2 Feature Extraction

The module that follows the onset detection is the feature extraction. A set of temporal and spectral
features are extracted from the incoming audio signal when an onset is detected.
The temporal features are the RMS value of the energy and the number of zero crossings. The
number of zero crossings corresponds to the number of times that a signal crosses the x-axis in a
fixed time frame, while the energy corresponds to the RMS value of the energy contained in an
audio frame. These features are calculated over a time frame of 4096 samples, so that the features
are extracted from the first 93ms of the vocalisation for audio sampled at 44.1kHz.
The spectral descriptors extracted are the following:

• Spectral Centroid: This feature corresponds to the centre of mass of a spectrum and is
connected to the perception of a sounds’ “brightness” [38]. Its value can be calculated as
follows:

∑n−1
i=0 f [i] a[i]
µ= (4.1)
∑n−1
i=0 a[i]
4.1 Approach 17

where n is half of the FFT window size, i is the bin index, a[i] the corresponding amplitude
and f [i] is its frequency and is calculated as follows:

sample rate
f [i] = i∗ (4.2)
FFT window size

[35]

• Spectral Spread: This descriptor measures the variance of the spectral centroid:

2
∑n−1
i=0 ( f [i] − µ) a[i]
ν= (4.3)
∑n−1
i=0 a[i]

[35]

• Spectral Slope: Calculates the slope of the magnitude spectrum by doing a linear regression
of it:

1 n ∑n−1 n−1 n−1


i=0 f [i] a[i] − ∑i=0 f [i] ∑i=0 a[i]
slope = 2 (4.4)
∑n−1
i=0 a[i] n ∑n−1 2 n−1
i=0 f [i] − ∑i=0 f [i]

[35]

• Spectral Decrease: This feature is similar to the spectral slope as it also represents the
decreasing of the magnitude spectrum, but according to [39], is supposed to relate to human
perception:

∑n−1
i=0 a[i] − a[1]
decrease = n−1
(4.5)
∑i=2:K a[i] (i − 1)

[35]

• Spectral Roll-Off: Computes the frequency value so that 95% of the signal is below this
frequency. For x = rolloff point

fc[i] n−1
∑ a2 [ f [i]] = x ∑ a2 [ f [i]] (4.6)
i=0 i=0

[35]

• Spectral Skewness: Measures the asymmetry of the spectrum around its centre of mass and
was originally proposed in [39].

• Spectral Flux: Measures how quickly the energy of a signal is changing and is calculated
as the Euclidean distance between two normalised spectra. [40]
18 Methodology

• Spectral Kurtosis: Is similar to skewness but, instead of measuring the asymmetry of the
spectrum, it measures the flatness around its centre of mass.

• Spectral Flatness: Provides a measure of how similar to white noise a sound is and is mea-
sured in four frequency bands (250-500Hz, 500-1000Hz, 1000-2000Hz and 2000-4000Hz)
[41].

• MFCC: These are the coefficients that form a mel-frequency cepstrum and are commonly
used in speech recognition. They represent the spectrum based on its perception and is
considered in [42] to be the “best available approximation of human ear”.

• BFCC: This method is similar to the MFCC, but uses the Bark frequency filter bank instead
of the Mel filters [42].

The extracted features are normalised and, if the output is a frequency value, the scale is
changed from exponential to linear so that the lower frequencies are as important as high frequen-
cies.

4.1.3 Feature Selection and Machine Learning Algorithm

The features extracted in the previous module are given to the feature selection object that is
connected to a machine learning algorithm.
Our system is meant to be user-specific. The classification is adapted to each user and not to
a general dataset. Therefore, each user should train the algorithm with their own vocalisations,
in order to obtain a higher accuracy of prediction. The features that better differentiate each
drum sound vocalisation for each user should be chosen automatically without user interference.
To achieve this, we implemented a feature selection method which is the Sequential Forward
Selection, to be used along with the k-NN machine learning algorithm.
The SFS method was first proposed in [43] and is a bottom-up feature selection algorithm,
which means that it starts with an empty set of features. The feature that provides a best measure-
ment is initially added to this set. Additional features are added sequentially to this set until the
stopping condition is met. This condition normally is a threshold in the performance of the system
or a number of features selected.
The user should first train the system with vocalisations of each drum sound. Therefore the
SFS can use the number of correct k-NN guesses from the training set as a measure of efficiency
for each feature. This was the approach chosen in order to select the features that work better for
each user’s vocalisations. Whenever there is no improvement with the addition of features, the
algorithm stops.
The machine learning algorithm used is the k-Nearest Neighbour. This is a simple method
based on the measurement of the distances between the training data and the input sample. The
Euclidean distance is a common measure of distance between points and can be calculated as
p n
∑i=1 (xi − yi )2 , where x and y are the two points, i the index of the axis and n the total number
4.2 Implementation 19

of axis in the Euclidean n-space. The input sample is classified as the class that has k samples less
distant to it [44].

4.2 Implementation
The system was implemented as a Max for Live device, in order for it to be easy to install and to
work with for Ableton Live users. Max for Live is a toolkit that allows to build devices to be used
in Ableton Live, using the visual programming language Max.
The implementation of the most relevant part of the back end system is shown in Figure 4.2.

Figure 4.2: Main part of the Max patch responsible for the operation of the system and its compo-
nents

4.2.1 Onset Detection

Before the audio input that comes from Ableton Live is given to the onset detector, the left and
right channel are summed to mono. This way, there is only one signal chain in the system instead
of two audio channels to be analysed separately.
In Max, the HFC algorithm for onset detection can be implemented using the external object
aubioOnset∼. Aubio [45] is a free and open source C library designed for the extraction of anno-
tations from audio signals. Some of the functions provided by this library were wrapped as Pure
Data externals and, later, the onset detection function was adapted for Max MSP as an external
20 Methodology

by Marius Miron. This object receives one audio signal and, when a peak is detected, outputs a
“bang”.
The aubioOnset∼ is initialised as “aubioOnset∼ hfc 512 128 -70 0.7”. Therefore, the param-
eters used in our system are the following:

• Onset Detection Function: HFC. As previously shown, high frequency content is the most
appropriate method to use when detecting non pitched percussive sounds, which is the case
of the vocalised percussion sounds.

• Threshold: 0.7. This parameter controls the threshold value for the onset peak picking
and the values should be between 0.001 and 0.9 [34]. Different values were tested for
the threshold of the onset detection algorithm. Different audio clips from the dataset were
analysed and 0.7 provided a good balance between detecting most of the vocalisations while
not detecting many false positives.

• Silence Threshold: -70dB. This option corresponds to the volume under which the onsets
will not be detected.

• Buffer Size: 512 samples. This value deals with the number of samples that are present in
the buffer to be analysed. It also corresponds to the window of the spectral and temporal
computations [34]. The bigger this buffer is, the bigger the frequency resolution will be and
the longer it will take to detect an onset. A buffer size of 512 samples provides an accurate
detection of onsets and only corresponds to a delay of approximately 11.6ms if the sample
rate is 44.1kHz.

• Hop Size: 128 samples. This parameter corresponds to the number of samples between two
consecutive analysis frames [34]. The selected value provided a good temporal resolution.

4.2.2 Feature Extraction


The feature extraction is implemented by using either Max MSP native objects or the library
Zsa.Descriptors [35].
The number of zero crossings is calculated using the zerox∼ Max MSP object. This function
receives audio in its first inlet and outputs the number of times the analysed signal passed through
the X axis. The audio frame used for the analysis is the last signal vector. A signal vector is the
block that MSP uses in its operation and its size can be defined in the audio setup window. In
order to derive the energy, an envelope follower is used. The value of the envelope is stored when
an onset is detected and, unlike the rest of the features, it is not used for the classification but
for acquiring a velocity value. The sampled value is compared to a maximum and mapped to a
number between 1 and 127, that corresponds to the velocity of the given vocalisation.
In order to extract the spectral features from the audio, the Zsa.Descriptors library is used. This
library, which was developed in IRCAM by Mikhail Malt and Emmanuel Jordan, covers a large
set of descriptors and is able to work in a real-time situation. In order to increase CPU efficiency,
4.2 Implementation 21

the Zsa objects are implemented inside the same pfft∼ patch, as shown in Figure 4.3. This object
is a spectral processing manager for patchers and allows to work in Max MSP in the frequency
domain. The window size chosen for the FFT is 4096 samples, in order to provide a good fre-
quency resolution. The purpose of this system is to provide good accuracy in the identification of
the vocalisations, and is not focused on real-time applications. The overlap factor chosen for the
FFT analysis is 8, which corresponds to a hop size of 512 samples.
The number of zero crossings and all the spectral features are extracted 3584 samples after the
onset is detected. The buffer size of the onset detection module is 512 samples. An event can only
be detected in the end of the buffer. As we only want to analyse the audio frame that is after the
onset, the values of the feature extraction are only evaluated 4096 samples after the earliest onset
can happen. If the onset is at the beginning of the onset detector buffer, it will take 512 samples
to be detected and the features that correspond to this event will be extracted 512 + 3584 = 4096
samples after it occurred. The value of the energy is only calculated when the onset is detected, as
we want to use the maximum power to calculate the velocity, and this occurs at the beginning of
the vocalisation.

Figure 4.3: Inside the pfft∼ patch

4.2.3 Feature Selection and Machine Learning Algorithm

The k-NN algorithm was implemented by using TimbreID, a Pure Data external developed by
William Brent [46] and ported to Max by Marius Miron that implements the k-NN machine learn-
ing algorithm. It can use different metrics for the calculation of the distance but the one chosen
for this system is the Euclidean distance.
22 Methodology

As it did not exist, an external written in C that implements the Sequential Forward Selection
was created, using the Max API [47]. This external was developed to work together with Tim-
breID, the messages sent and received are adapted to this classifier. A flowchart resuming most of
the functioning and the full code can be seen in Appendix A.
When writing an external object for Max, according to Max API, there are five basic steps.
The first one corresponds to adding the ext.h and ext_obex.h header files
The object declaration follows. Here a C structure is declared that contains all the class vari-
ables. In this case, the C structure is declared as follows:
1 typedef s t r u c t _seqfeatsel
2 {
3 t _ o b j e c t ob ; / / t h e o b j e c t i t s e l f ( must be f i r s t )
4 bool iden ; / / i s i t already t r a i n e d or not ?
5 bool f l a g ; / / h a s t i m b r e I D g i v e n an a n s w e r ?
6 b o o l debug ; / / used f o r debugging purposes
7 bool fase ; / / i s t i m b r e I D a n s w e r a no c a r e ?
8 l o n g n u m F e a t u r e s ; / / number o f f e a t u r e s r e c e i v e d
9 l o n g rowCount ; / / c o u n t s t h e rows
10 long ultimaNota ; / / l a s t r e c e i v e d MIDI n o t e
11 long r e s p o s t a ; / / a n s w e r from t i m b r e I D
12 l o n g knn ; / / knn v a l u e
13 l o n g numNotas ; / / t o t a l number o f n o t e s
14 s h o r t nSel ; / / number o f s e l e c t e d f e a t u r e s
15 t_member ∗ n o t a s P C l u s t e r ; / / a r r a y w i t h t h e n o t e s f o r t h e c l u s t e r msg
16 t _ i n s t a n c e ∗ trai ningT ab ; / / array with the received f e a t u r e s
17 int ∗selCol ; / / columns s e l e c t e d t h r o u g h s f s
18 void ∗ a_out ; / / o u t p u t t h e column t o u s e
19 void ∗ b_out ; / / o u t p u t s t h e rows f o r t h e kNN e x t e r n a l
20 } t_seqfeatsel ;

struct.c

The next step is to create an initialisation routine. When the object is loaded in Max this
routine is ran and it informs Max which methods should be run when an instance of the object is
created, destroyed or when it receives a message.
1 void ext_main ( void ∗ r )
2 {
3 t _ c l a s s ∗c ;
4

5 c = c l a s s _ n e w ( " s e q f e a t s e l " , ( method ) s e q f e a t s e l _ n e w , ( method ) s e q f e a t s e l _ f r e e ,


( long ) s i z e o f ( t _ s e q f e a t s e l ) ,
6 0L , A_GIMME, 0 ) ;
7

8 class_addmethod ( c , ( method ) s e q f e a t s e l _ m e s s a g e , " l i s t " , A_GIMME, 0 ) ;


9 class_addmethod ( c , ( method ) s e q f e a t s e l _ i n 1 , " i n 1 " , A_LONG, 0 ) ;
10 class_addmethod ( c , ( method ) s e q f e a t s e l _ i d , " i d " , 0 ) ;
11 class_addmethod ( c , ( method ) s e q f e a t s e l _ c l e a r , " c l e a r " , 0 ) ;
12 class_addmethod ( c , ( method ) s e q f e a t s e l _ d e b u g , " debug " , 0 ) ;
4.2 Implementation 23

13 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ k n n , " knn " , A_LONG, 0 ) ;


14 c l a s s _ a d d m e t h o d ( c , ( method ) s e q f e a t s e l _ a s s i s t , " a s s i s t " , A_CANT, 0 ) ;
15

16 c l a s s _ r e g i s t e r ( CLASS_BOX , c ) ;
17 seqfeatsel_class = c;
18 }

initialization.c

After this, the new instance routine should be developed. In this function, all the class variables
are initialised, the memory space for the storage of the arrays is allocated and the inputs for the
object are created. The code is presented next:
1 void ∗ s e q f e a t s e l _ n e w ( t_symbol ∗s , long argc , t_atom ∗ argv )
2 {
3 t _ s e q f e a t s e l ∗x = NULL ;
4 x = ( t _ s e q f e a t s e l ∗) o b j e c t _ a l l o c ( s e q f e a t s e l _ c l a s s ) ;
5 x−>b _ o u t = l i s t o u t ( x ) ;
6 x−>a _ o u t = o u t l e t _ n e w ( ( t _ s e q f e a t s e l ∗ ) x , NULL) ;
7 x−> n o t a s P C l u s t e r = ( t_member ∗ ) sysmem_newptr ( 0 ) ;
8 x−> t r a i n i n g T a b = ( t _ i n s t a n c e ∗ ) sysmem_newptr ( 0 ) ;
9 x−> s e l C o l = ( i n t ∗ ) sysmem_newptr ( 0 ) ; / / c o l u m n s s e l e c t e d t h r o u g h s f s
10 i n t i n ( x , 1) ;
11 x−> i d e n = f a l s e ;
12 x−> f l a g = f a l s e ;
13 x−>debug = f a l s e ;
14 x−> f a s e = f a l s e ;
15 x−>n u m F e a t u r e s = 0 ;
16 x−>rowCount = 0 ; / / c o u n t s t h e rows
17 x−> u l t i m a N o t a = 0 ;
18 x−> r e s p o s t a = 0 ;
19 x−>knn = 1 ;
20 x−>numNotas = 0 ;
21 x−>n S e l = 0 ;
22 return x;
23 }

newinstance.c

Finally, the message handlers were written. These are the methods that are run when a message
arrives. As it can be seen in the initialisation routine, this object handles 7 different messages:
When a feature list is received, if the object is not trained, the number of features and the
label is stored and the number of instances is incremented. Otherwise, if the object is trained, the
received message is filtered and the object outputs the selected features.
Whenever a message containing “id” is received, this object will start the process to identify
which features provide a better performance for the module. For each feature:

• If the feature has already been selected, jump to the end.

• All instances are sent to timbreID, in order to train it with data.


24 Methodology

• Messages are sent to timbreID informing it how to cluster the training data.

• A message to set the k-NN value is sent to timbreID.

• Send each instance again but this time compare the output from timbreID with the correct
label. In order to make this method wait for timbreID’s answer, a flag is set to 0 and, while
this flag is not changed, the thread is put to sleep.

• A message to reset timbreID is sent.

The feature that best improves the classification is added to the selected features list. If a feature
is in this list, every time features are sent to timbreID, the values corresponding to the instance of
the said feature is also sent. The improvement of the iteration is calculated and, if it is bigger than
0, another feature can be added and the cycle repeats itself. Otherwise, the object is considered
trained, messages are sent in order to train timbreID with the selected features and with the cluster
information.
If an integer is received in the right inlet, the inlet to which the timbreID output is connected,
the flag is set to 1 and the “id” method can stop the while cycle.
When a “debug” message is received, the debug flag is set to 1. This is used for debugging
proposes and prints to Max window important log messages.
The “clear” message resets all the class variables to their initial values and a reset message is
sent to timbreID.
If the “knn (int)” message is received, the k-NN value is set to the one specified in the message.
The final method handles the assist message. This is used to provide visual information in the
Max patch window related to what each input and output does.

4.3 User Interface


Due to the constrains imposed by the Max for Live toolkit, in order to have an instrument capable
of outputting MIDI notes, two devices have to be used. The first one is a Max for Live audio
effect, while the second one is a MIDI effect. The user should load both devices in Ableton Live,
LVT loaded as an audio effect (in an effect rack or on an audio track) and the LVT Receiver loaded
in a MIDI track. The audio effect, LVT, receives the audio input and sends messages containing
the transcribed MIDI notes to the MIDI effect, LVT Receiver, which outputs these notes to the
Ableton Live track. These objects’ user interfaces and how they should be used are described in
the following subsections.

4.3.1 LVT

The user interface of the LVT device contains two panels as can be seen in Figure 4.4. In the first
panel, the number of times that each drum vocalisation to train the identifier is repeated and their
corresponding MIDI note are set and a light for each drum sound is displayed. The second panel
4.3 User Interface 25

contains the Train button to initialise the system’s training, the Reset button that resets the system
to its initial state, a light that informs if the system is trained or not and the device identifier.

Figure 4.4: User interface of the LVT device

The first step when using this system is to set the number of repetitions for each vocalisation.
If one of the percussive instruments is not going to be vocalised, setting this value to 0 will disable
it. Then, the desired MIDI notes for each vocalisation should be set. These should be the notes
that correspond to the desired sounds in the instrument that follows the LVT receiver in the MIDI
chain. LVT allows for up to five different drum types.
To start the training of the system starts by pressing the Train button in the second panel. The
user should then begin vocalising the desired drum sounds by repeating each sound the number
of times defined earlier. As each vocalisation is detected, the correspondingly labelled drum type
will light up. When the training phase has finished, the Trained light is turned on.
Now, the system is trained, or it can be reset to its initial state by pressing the reset button.
When a vocalisation is performed, the corresponding light will flash. The messages are not yet
reaching the LVT Receiver. The set up of the LVT Receiver is described in the next subsection.
In order to link the interface with the transcription system, several small Max patches were
used. The train button works as switch for the audio input. A patch was created to count the num-
ber of training instances and to label each one appropriately. One patch calculates the velocity of
the detected event from the instantaneous RMS energy value. Another patch packs the classifica-
tion from timbreID as a MIDI note and sends it to the LVT Receiver. The final patch is responsible
for getting the device identifier from Ableton Live and to present it in the LVT window.

4.3.2 LVT Receiver

The LVT Receiver user interface only has one panel. This panel contains a drop-down menu used
to select the identification of the LVT sender device, a reset button and a mute button, used to stop
listening the the messages from LVT.
The setup of the LVT receiver is simple. The identifier that is displayed in the LVT window
should be selected from the LVT receiver drop-down menu. After this step, the LVT system is
ready to use and a VST or a plugin can be loaded in Ableton Live after the LVT receiver. The
26 Methodology

Figure 4.5: User interface of LVT receiver device

MIDI transcription can be stored in an Ableton Live MIDI clip by creating a new MIDI track that
gets its MIDI input from the Post FX channel of the track where the LVT Receiver is loaded. There
is a constant delay between the vocalisation and the transcription, therefore, the start time of the
MIDI clip should be adjusted manually when the recording is finished.
This receiver device is responsible for unpacking the messages received from the chosen LVT
device and converting them to MIDI notes. The message is formatted using the midiformat object
which packs the information in a MIDI message that is then output to Ableton Live.
Chapter 5

Data Preparation

In this chapter, the approach used to collect a dataset of vocalised percussion will be described
in detail. The method used to record the vocalisations will be explained first, followed by the
procedure used to annotate these recordings.

5.1 Dataset Recording

In order to collect different vocalisations of percussion, a group of 11 men and 9 women were
selected. A similar gender distribution was desired so that this dataset samples the real world.
Almost all of the participants work at a University Radio and 7 of the participants are music en-
thusiasts and have some knowledge of music making, while the rest have basic music knowledge.
Only one of the people involved has beatboxing skills, while the others vocalised the percussive
sounds in a “less professional” way.
Participants were first asked to reproduce 4 bars of a fixed drum pattern with the vocalisations
they were more comfortable with and, after this, 4 bars of improvisation using the same sounds.
The pattern the participants were asked to reproduce was a simple 4 bar loop with kicks on the first
and second beat, snare on the third and hi-hats between them, as shown in Figure 5.1. Participants
were first familiarised with this pattern by listening to it played through an 808 drum-kit, as some
of them could not read music scores. Based on the state of the art, only these three drum sounds
were chosen to be vocalised.

Figure 5.1: Pattern participants were asked to reproduce

27
28 Data Preparation

Participants were given Audio Technica ATH M50X headphones with a metronome at 140
bpm to have a time reference. In order to collect data for possible different use cases, the audio
output was recorded through 3 different microphones (one from a laptop, an AKG c4000b and one
from an iPad). The first one had a lot of noise due to its poor quality, the second one provided a ref-
erence of a good microphone, while the last one is a reference to good mobile phone microphone.
The recordings were made in a sound treated studio in order to isolate external noises.
This process led to 120 audio clips with approximately 6 seconds of duration. The number of
different drum hits contained in these recordings can be seen in Table 5.1.

Fixed Pattern Improvisation Total


Kick 8 × 20 × 3 = 480 164 × 3 = 492 972
Snare 4 × 20 × 3 = 240 98 × 3 = 294 534
Hi-hat 8 × 20 × 3 = 480 181 × 3 = 543 1023
Table 5.1: Number of individual hits contained in the recordings

5.2 Dataset Annotation


The audio files that resulted from the recordings referred in the previous section were compiled in
2 folders, one for the improvisation part and the other one for the fixed pattern recordings. Each
file was given a name that contained the code for each person, ‘I’ if it was an improvisation or ‘P’
if it was a fixed pattern and a number corresponding to the microphone that recorded the audio (1
for laptop microphone, 2 for the AKG microphone and 3 for the iPad one).
Besides saving the recordings in files, these were also collected in an Ableton Live project, as
seen in Figure 5.2. The files were split in 2 groups, each one having a track for each microphone.
Both the files with the annotations and the Ableton Live project are available in the dissertation
website1 .
The annotation of the audio was done using Sonic Visualizer2 , an application for viewing and
analysing the contents of music audio files, developed at the Centre for Digital Music, Queen
Mary University of London. The detection of the onsets was done manually and each event was
labelled as kick, snare or hi-hat. The transcription was both saved as a .csv file and as a MIDI file,
with the kick in the note 36, snare being the note 38 and the hi-hat 42. The resulting files were
named with the same name as the audio files, but without the microphone number. An example of
a transcription can be seen in Figure 5.3.
From the resulting transcriptions, we could see that participants did not reproduce the hits on
time as per the 140bpm metronome, as seen on Figure 5.4. However, this has no effect on the
transcription system or accuracy which is not tempo dependent.
Participants did not vocalise the drum hits in the same way. First, the user with beatboxing
knowledge vocalised the kick and snare in a different manner than the rest of the participants, as
1 https://lvtsmc.wordpress.com/
2 http://www.sonicvisualiser.org/
5.2 Dataset Annotation 29

Figure 5.2: Organization of the dataset files in an Ableton Live project

can be heard in the dataset (JSil audio clip). Besides this, participants vocalised the kick and the
snare in the way it was easiest for them, therefore, different sounds were used to reproduce the
same drum hits, as can be seen in Figure 5.5 and 5.6.
30 Data Preparation

Figure 5.3: Example the audio annotation in Sonic Visualizer

Figure 5.4: Example of how participants vocalised the pattern

(a) A vocalised kick


(b) A kick reproduced by a beatboxer

Figure 5.5: Two different vocalisations of kick drum

(a) One vocalisation of a snare (b) Another possible vocalisation of a snare

Figure 5.6: Two different vocalisations of a snare drum


Chapter 6

Evaluation

This chapter describes the methodology used to test and evaluate the performance of the LVT
system, in comparison with the existing solutions, LDT [2] and Ableton Live Convert Drums to
MIDI function.

6.1 Experiment Design


The evaluation for LVT comprises one principal experiment that serves two purposes. The first
is to understand how a user specific trained system performs compared to the state of the art, i.e.,
systems which are trained to work on general drum timbres, while the second purpose is to explore
whether LVT can help to improve a producer’s workflow by examining the effort required to get
from a vocalised input pattern to a accurate MIDI representation.
In order to evaluate the three systems in the same data and still being able to use different data
for the training and the evaluation of LVT, some work was done on the dataset. Five kick, snare
and hi-hat vocalisations were extracted from the improvisation part of the dataset so as to create
training clips for the LVT (presented in Section 4.3). These clips were created in a manner that
simulates how a user would train the algorithm, with a speed similar to the one each participant had
in their improvisation audio recording. Seven of the contributors of the dataset did not vocalise at
least five times each drum sound and, therefore, their recordings were removed from the evaluation
data. This resulted in an evaluation set of 13 participants with both a training and a testing audio
clip recorded in three different microphones, which corresponds to 78 audio clips. These clips
were compiled in an Ableton Live project, which is available in the dissertation website1 and that
can be seen in Figure 6.1. This Ableton Live project contains an audio track and three MIDI
tracks for each microphone. The audio tracks contain the training and the testing audio clips and
the MIDI tracks contain the clips with the transcriptions from each system: LVT, LDT Max for
Live device [2] and Ableton Live Convert Drums to MIDI.
To obtain a measure of the accuracy of a user trained system compared to the state of the art
systems, the F-measure of the transcriptions was calculated. The F-measure is the harmonic mean
1 https://lvtsmc.wordpress.com/

31
32 Evaluation

Figure 6.1: Ableton project for the evaluation

of precision and recall, and can be calculated as follows:

precision ∗ recall
2∗ . (6.1)
precision + recall

Where:
truepositives
precision = (6.2)
truepositives + f alsepositives

truepositives
recall = . (6.3)
truepositives + f alsenegatives
This was calculated by importing all the transcription MIDI clips with MIDI Tools2 into MAT-
LAB, comparing the transcriptions with the annotations and then calculating the F-measure for
each drum and the average of these values. These results were plotted to see the effect of increas-
ing the F-measure tolerance window (as a means to understand the effect of temporal localisation)
in this accuracy measurement.
Finally, to acquire a measure of how this system can improve a producers workflow, the time
to get a transcription and the number of operations needed to get to the desired patterns were
calculated.
To measure the time an Ableton Live transcription takes, a stopwatch was used to measure the
time since the "Convert Drums to MIDI" button was pressed until the resulting MIDI clip came
into view. This procedure was done in 9 random clips from the 3 different microphones. These
measurements were then averaged. As LDT works in real time, the time to achieve a transcription
is the same as the time of the audio recording to be transcribed. Finally, in order to obtain a
2 http://www.ee.columbia.edu/∼csmit/matlab_midi.html
6.2 Results 33

transcription from LVT, two times have to be measured. First, the training time corresponds to the
time of the training audio clips which was calculated and averaged. The time to transcribe a given
audio recording when the system is trained corresponds to the time of this recording.
Then, the number of operations to achieve the desired pattern, that can be seen in Figure 6.2
was calculated.

Figure 6.2: Desired Pattern

Figure 6.3: How the number of operations was calculated. 1) Delete the extra events; 2) Correct
the events that can be corrected; 3) Add the missing events.

The possible operations are divided in three categories: to correct, remove, or add an event.
The procedure to compute these values is explained in Figure 6.3. First, all the additional events
are removed. These are the events that do not correspond to a real onset or that cannot be corrected
to the real classification. Afterwards, the events that are the result of a misclassification are cor-
rected and, finally, the missing events are added to the MIDI clip. The number of operations for
each clip was written down and, then, the number of operations for each category was calculated
for each microphone and for each transcription system. Since it is not reasonable to assume that
a producer editing the transcriptions would work at a constant speed, it was deemed more reliable
to take an objective measurement of workflow effort in terms of the number of operations, rather
than recording the temporal duration - as per the algorithm processing.

6.2 Results
In this section the results from the evaluation previously described are presented.
The calculated results of the F-measure accuracy (with a tolerance window of ± 0.035 sec-
onds) for each microphone are presented in Tables 6.1, 6.2 and 6.3.
34 Evaluation

Kick Snare Hi-hat


LVT 27.9% 18.1% 7.6%
Ableton Live 34.4% 34.3% 15.8%
LDT 3.87% 15.4% 11.8%
Table 6.1: F-measure results for the PC microphone

Kick Snare Hi-hat


LVT 91.4% 69.1% 80.2%
Ableton Live 51.8% 47.0% 29.7%
LDT 53.8% 20.4% 41.9%
Table 6.2: F-measure results for the AKG microphone

Kick Snare Hi-hat


LVT 82.2% 39.5% 70.6%
Ableton Live 55.2% 43.0% 31.8%
LDT 58.4% 22.3% 42.8%
Table 6.3: F-measure results for the iPad microphone

These results show that, except for the laptop microphone and the snare from the iPad micro-
phone, LVT achieves much better performance than the other systems, sometimes even the double
of the F-measure is achieved. All systems report low accuracy on the laptop microphone. The
performance of the state of the art systems in the AKG and iPad microphone are similar, whilst,
for LVT, the use of the laptop microphone leads to the worst overall performance.
The effect of changing the window size value for the F-measure for every drum can be seen
in Figure 6.4, where each line represents the performance of each system for each microphone (1-
laptop microphone, 2- AKG microphone and 3-iPad microphone).
The values for the F-measure stay approximately constant for values higher than 0.035 sec-
onds. The performance of LVT on the recordings from the AKG microphone surpasses the others
by a significant amount, as shown by the fact that it outperforms all other configurations for all
window sizes. On the kick and hi-hat, it is followed by the performance of LVT on the iPad
recordings.
In order to see the effect of user-specific training on the performance of LVT, an example is
provided where LVT is trained on one user and tested on another – and vice-versa. When training
the LVT with a different person with different vocalisations, the accuracy of the transcription
is decreased as can be seen in Figures 6.5 and 6.6. In the upper part of the pictures there is the
transcription of the user when trained with its own vocalisations, while the bottom part corresponds
to the transcription when trained with the other user. As has been pointed out by these figures,
without the user-specific training, a lot of misclassifications occur.
A surprising observation for these two users is that the feature selection in the training phase
suggested the need only for a single feature per user. In other words, the different drum vocalisa-
tions in training could be perfectly separated using just one dimension. The selected feature for
6.2 Results 35

Kick Snare
1 1

0.8 0.8
F measure

F measure
0.6 0.6

0.4 0.4

0.2 LDT1 0.2


LDT2
LDT3
0 0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ABL1 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Window Size in seconds ABL2 Window Size in seconds
ABL3
Hi hat LVT1 Average
1 LVT2 1
LVT3

0.8 0.8
F measure

F measure
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Window Size in seconds Window Size in seconds

Figure 6.4: Effect of changing the window size per vocalised drum sounds and across micro-
phones. All LDT scores are shown in red, Ableton Live (ABL) in green and LVT in blue. The
solid lines indicate the laptop microphone, the dotted lines the AKG microphone, and the dashed
lines the iPad microphone.

Figure 6.5: Transcription of the first user vocalisations using the LVT system trained by the second
user

the first user was the spectral flatness from 0 to 250 Hz and for the second one was the spectral
skewness.
The effect of selecting the wrong feature on clustering can be seen in Figures 6.5, 6.6 and 6.7.
In the latest, the distribution of kicks (circle), snares (diamond) and hi-hat (*) are shown. In a),
the second user vocalised pattern elements are distributed from 0 to 1 according to the value of
the selected feature (skewness). In b), the elements from the first user vocalisation according to
36 Evaluation

Figure 6.6: Transcription of the second user vocalisations using the LVT system trained by the
first user

the feature selected from the second user are displayed and finally, in c) these same elements are
distributed according to the most appropriate feature according to SFS (flatness from 0 to 250 Hz).
When the appropriate feature is selected, it can be seen that the different drum hits are closely
clustered. When this is not the case, the drum hits are more spread and the regions for each drum
are not well defined.

Figure 6.7: Effect of choosing a wrong feature for a user. a) 2nd user; b)1st user with feature for
2nd user; c)1st user.

In terms of a qualitative comparison between LVT, LDT and Ableton Live, an example tran-
scription can be seen in Figure 6.8 for LVT; for Ableton Live Convert Drums to MIDI in 6.9; and
for LDT in 6.10.
For an example where LVT transcribes the vocalised pattern accurately, Ableton Live detects
constantly hi-hats on top of the other drum sounds and even on silence. Furthermore, it did not
detect any kick drum in this recording. In turn, LDT, besides detecting all the ground truth events,
also identified a lot of false positives.
6.2 Results 37

Figure 6.8: Example of an LVT transcription

Figure 6.9: Example of an Ableton Live Convert Drums to MIDI transcription

Figure 6.10: Example of a LDT transcription


38 Evaluation

The timing measurements for each of the systems are the following:

• Ableton Live Convert Drums to MIDI: 12.9s

• LVT: 6.2s+6.9s= 13.1s (average of training clips + audio clip time)

• LDT: 6.9s (audio clip time)

In order to achieve a transcription, LDT is the quickest one, followed by Ableton Live and then
LVT. Ableton Live Convert Drums to MIDI function firstly shows a loading window, but after it
is finished, some time elapses until this DAW displays the MIDI clip in its user interface.
In addition to these processing times, required to give an initial automatic transcription, Tables
6.4, 6.5 and 6.6 summarise the results obtained from counting the total number of operations
needed to obtain the desired pattern for each microphone. The total number of events for each
microphone is 13 ∗ 20 = 260 which corresponds to the number of users ∗ the number of events in
each audio clip.

Corrected Added Removed


LVT 26 182 3
Ableton Live 22 33 440
LDT 52 124 95
Table 6.4: Number of Operations for the PC microphone

Corrected Added Removed


LVT 39 7 15
Ableton Live 33 12 296
LDT 52 24 206
Table 6.5: Number of Operations for the AKG c4000b microphone

Corrected Added Removed


LVT 57 40 5
Ableton Live 67 10 215
LDT 51 22 198
Table 6.6: Number of Operations for the iPad microphone

From these tables, it is easy to see that the transcriptions from LDT and Ableton Live require
a lot of events to be removed, whilst the ones from LVT do not. The number of corrected vocalisa-
tions from LVT and Ableton are similar, while the ones from LDT remain approximately constant
for all microphones. For the laptop and iPad microphones, LVT under-detected more events than
the rest of the systems.
6.3 Discussion 39

6.3 Discussion
In this section, the results presented in the previous section are analysed and discussed.
By examining the previously shown results provided in Tables 6.1, 6.2 and 6.3, we can under-
stand that the LVT provides a transcription closer to the ground truth than the generally trained
state of the art systems, shown by the higher F-measure. Besides the detail that LVT is trained
per user, these results may derive from the fact that this system does not try to detect polyphonic
events (more than one drum vocalisation at the same time) as the other systems do. Furthermore,
LVT does not detect as many events as the other systems, and, therefore, this has an influence in
the F-measure results, in terms of false positives.
From Figure 6.7, we can see that feature selection is an important step to acquiring an accurate
transcription. In order to have the k-NN algorithm work as good as possible, the different vocali-
sations must be tightly clustered, as it can be seen in a) and c). From Figures 6.5 and 6.6 we can
understand that having a system trained with a different user reports an inaccurate classification
and, therefore, we can see the importance of training the system to adapt the user vocalisations.
For the small cost in terms of timing for training the LVT, the transcription accuracy is greatly
increased and, as shown by having far fewer post-transcription operations, a significant amount of
time is saved when correcting the transcribed pattern. On this basis, the end-to-end workflow, from
training to transcription to correction is most efficient for LVT suggesting a real tangible benefit for
user-adaptive analysis. However, while LVT performs especially well with the AKG microphone
and with the iPad microphone, its performance with the computer microphone is particularly poor.
This microphone has a lot of background noise and the system is not able to detect onsets and
hence cannot provide a transcription. Thus within the processing pipeline, accurate onset detection
is extremely important, and its impact is directly observable in the transcription accuracy.
Concerning other possible limitations of LVT, if a user does not vocalise the drum sounds the
same way in the training and in the identification phase, the transcription will not work well – a
factor which more generally trained systems would not be susceptible to. Furthermore, if a user
vocalises drums that sound too similar to each other, as it was the case in some of the clips from the
dataset, there will not be a separation clear enough of the events from the perspective of the audio
features, and therefore the machine learning component will struggle to identify them correctly.
Finally, if a user vocalises the drum sounds too quietly, the onset detection will not work and the
event will not be transcribed.
As a final point, in order to have a quantitative measure on how a system performs in terms of
producer’s workflow, the number of operations to achieve a desired pattern is considered a more
meaningful measure than the F-measure. This is due to the fact that F-measure is dependent on
the window size and on the fact that a misclassification is represented both as a false positive and
as a false negative. The same occurs with poorly timed transcriptions, where an onset outside the
F-measure window also contributes to this calculation. These two possible deviations are easily
fixed in a DAW via simple shifting operations and are thus, less significant errors than totally
spurious false positives or totally missing events.
40 Evaluation
Chapter 7

Conclusions

7.1 Summary

In this dissertation, a new interface for music creation, called LVT, was presented. This system
allows Ableton Live users to sequence MIDI patterns that can be used for designing rhythms by
using their voice. The state of the art systems, including one already in Ableton Live, are not
able to transcribe vocalised percussion effectively, as these are trained for general recorded drum
sounds which are not vocalised. Different people vocalise drum sounds in different manners, a
snare drum vocalisation from a user can sound similar to the hi-hat vocalisation from another user.
LVT has to be trained before it is used, in order to fit the vocalisations of any end user. As each
user can choose the desired vocalisations for each drum sound, the system is versatile enough to
also transcribe drums or any kind of unpitched percussive sounds. As long as the training sounds
are different enough from each other, the system is able to choose the features that provide a
good separation and therefore a good classification accuracy for any input. In order to improve
the accuracy of this system, a Max external that implements the Sequential Forward Selection for
selecting features was developed. LVT is implemented as a Max for Live device, which enables
Ableton Live users to use this system by interacting with the simple and easy-to-use graphical user
interface designed for it.

The evaluation of the LVT and of the existing state of the art systems was done by running tests
on a dataset that was recorded and annotated. In order to collect different percussion vocalisations,
partakers expressed a fixed pattern with the vocalisations they found more suitable. The evaluation
of the accuracy of the transcription was done by calculating the F-measure and by counting the
number of actions needed to transform the resulting transcription in the desired pattern. The F-
measure is an adequate evaluation of the transcription accuracy while the number of operations
relates to how this accuracy affects the Ableton Live user workflow. LVT produced superior results
in both tests, showing that this tool can be used as an alternative to the existing drum transcription
systems in order to create MIDI drum patterns using the voice as the instrument.

41
42 Conclusions

7.2 Future Work

Despite the good results of the LVT described earlier, there are some features that can be added
to this system in order to improve its usability and performance. Due to time restrictions and the
amount of work some of these features need, they were not yet implemented. These improvements
are the following:

• Settings window: Add a window to the User Interface where system parameters can be set.
Examples of possible relevant parameters that may be changed by the user are the number
of neighbours for the k-NN classifier and the threshold and mode for the onset detection.
These values can only be changed inside the Max for Live patch and, therefore, they are not
easily accessible to the user.

• Save and Load button: Adding a Save and a Load button. This enables the possibility to
save the training in a file, so that the system does not have to be trained each time the user
loads it and provides more portability between computers for the same user.

• More feature selection methods and machine learning algorithms: Evaluate the effec-
tiveness of other feature selection or extraction methods and add the possibility to choose
the one desired by the user. Possible methods to be added are Sequential Backward Selec-
tion, Generalised Sequential Forward and Backward Selection, Sequential Forward Floating
Search or even adding the possibility for the user to manually select the desired features to
be extracted. Another machine learning algorithms can also be implemented and evaluated.

• Export the selected features report: The system prints to the Max command window the
index of the selected features. Adding the possibility of exporting a report that contains the
chosen features is a possible improvement.

• More features extracted: Adding more features to the feature extraction module. Temporal
and spectral features can be added such as the duration or the spectral roughness. Adding
the duration as a feature can help to make a distinction between open and closed hi-hats, as
some vocalisations of these cymbals only differ in their duration.

• Further Testing: More testing should be done on LVT with a different number of vocalised
drum sounds and with different values for the training instances.

• Pattern-Based Analysis: Researching pattern-based analysis techniques to be incorporated


in the LVT system. A possible implementation of this topic is to detect microtimings on the
vocalised pattern and correct the result when these are not present.
7.3 Perspectives on the Project 43

7.3 Perspectives on the Project


To introduce participants with no background in music production to music creation and its tech-
nology, as well as allowing them to improvise with vocalised drum sounds, was a rewarding expe-
rience. Furthermore, this project provided me with insight into the development of Max externals
and how Max MSP works, on feature extraction, on onset detection and on machine learning.
44 Conclusions
Appendix A

seqfeatsel C Code

This appendix contains the C code for the Max external seqfeatsel and the corresponding flowchart.

45
46 seqfeatsel C Code

A.1 seqfeatsel Code

1 # include " ext . h" / / s t a n d a r d Max i n c l u d e , a l w a y s r e q u i r e d


2 # include " ext_obex . h" / / r e q u i r e d f o r new s t y l e Max o b j e c t
3 // / / / / / / / / / / / / / / / / / / / / / / / / object struct
4

5 typedef s t r u c t instance
6 {
7 double ∗ i n s t a n c e ;
8 } t_instance ;
9

10 t y p e d e f s t r u c t member
11 {
12 i n t ∗member ;
13 } t_member ;
14

15 typedef s t r u c t _seqfeatsel
16 {
17 t _ o b j e c t ob ; / / t h e o b j e c t i t s e l f ( must be f i r s t )
18 bool iden ; / / i s i t already t r a i n e d or not ?
19 bool f l a g ; / / h a s t i m b r e I D g i v e n an a n s w e r ?
20 b o o l debug ; / / used f o r debugging purposes
21 bool fase ; / / i s t i m b r e I D a n s w e r a no c a r e ?
22 long numFeatures ; / / number o f f e a t u r e s r e c e i v e d
23 l o n g rowCount ; / / c o u n t s t h e rows
24 long ultimaNota ; / / l a s t r e c e i v e d MIDI n o t e
25 long r e s p o s t a ; / / a n s w e r from t i m b r e I D
26 l o n g knn ; / / knn v a l u e
27 l o n g numNotas ; / / t o t a l number o f n o t e s
28 s h o r t nSel ; / / number o f s e l e c t e d f e a t u r e s
29 t_member ∗ n o t a s P C l u s t e r ; / / a r r a y w i t h t h e n o t e s f o r t h e c l u s t e r m e s s a g e
30 t_instance ∗trainingTab ; / / array with the received f e a t u r e s
31 int ∗selCol ; / / columns s e l e c t e d t h r o u g h s f s
32 void ∗ a_out ; / / o u t p u t t h e column t o u s e
33 void ∗ b_out ; / / o u t p u t s t h e rows f o r t h e kNN e x t e r n a l
34 } t_seqfeatsel ;
35

36 / / / / / / / / / / / / / / / / / / / / / / / / / function prototypes
37 / / / / standard set
38 void ∗ s e q f e a t s e l _ n e w ( t_symbol ∗s , long argc , t_atom ∗ argv ) ; / / o b j e c t c r e a t i o n
method
39 v o i d s e q f e a t s e l _ a s s i s t ( t _ s e q f e a t s e l ∗x , v o i d ∗b , l o n g m, l o n g a , c h a r ∗ s ) ;
40 v o i d s e q f e a t s e l _ f r e e ( t _ s e q f e a t s e l ∗x ) ;
41

42

43 void s e q f e a t s e l _ m e s s a g e ( t _ s e q f e a t s e l ∗x , t _ s y m b o l ∗ s , l o n g a r g c , t _ a t o m ∗ a r g v ) ;
44 void s e q f e a t s e l _ i n 1 ( t _ s e q f e a t s e l ∗x , l o n g e n t r a d a ) ;
45 void s e q f e a t s e l _ d e b u g ( t _ s e q f e a t s e l ∗x ) ;
46 void s e q f e a t s e l _ i d ( t _ s e q f e a t s e l ∗x ) ;
A.1 seqfeatsel Code 47

47 void s e q f e a t s e l _ c l e a r ( t _ s e q f e a t s e l ∗x ) ;
48 void s e q f e a t s e l _ p r i n t ( long argc , t_atom ∗ argv ) ;
49 void s e q f e a t s e l _ p r i n t 2 ( l o n g c e r t a , l o n g p r o p o s t a , l o n g maxrows , l o n g j ) ;
50 void s e q f e a t s e l _ k n n ( t _ s e q f e a t s e l ∗x , l o n g knn ) ;
51

52 // / / / / / / / / / / / / / / / / / / / / / / global class pointer variable


53 void ∗ s e q f e a t s e l _ c l a s s ;
54

55

56 void ext_main ( void ∗ r )


57 {
58 t _ c l a s s ∗c ;
59

60 c = c l a s s _ n e w ( " s e q f e a t s e l " , ( method ) s e q f e a t s e l _ n e w , ( method ) s e q f e a t s e l _ f r e e ,


( long ) s i z e o f ( t _ s e q f e a t s e l ) ,
61 0L , A_GIMME, 0 ) ;
62

63 class_addmethod ( c , ( method ) s e q f e a t s e l _ m e s s a g e , " l i s t " , A_GIMME, 0 ) ;


64 class_addmethod ( c , ( method ) s e q f e a t s e l _ i n 1 , " i n 1 " , A_LONG, 0 ) ;
65 class_addmethod ( c , ( method ) s e q f e a t s e l _ i d , " i d " , 0 ) ;
66 class_addmethod ( c , ( method ) s e q f e a t s e l _ c l e a r , " c l e a r " , 0 ) ;
67 class_addmethod ( c , ( method ) s e q f e a t s e l _ d e b u g , " debug " , 0 ) ;
68 class_addmethod ( c , ( method ) s e q f e a t s e l _ k n n , " knn " , A_LONG, 0 ) ;
69 class_addmethod ( c , ( method ) s e q f e a t s e l _ a s s i s t , " a s s i s t " , A_CANT, 0 ) ;
70

71 c l a s s _ r e g i s t e r ( CLASS_BOX , c ) ;
72 seqfeatsel_class = c;
73 }
74

75 v o i d s e q f e a t s e l _ p r i n t 2 ( l o n g c e r t a , l o n g p r o p o s t a , l o n g maxrows , l o n g j )
76 {
77 p o s t ( " R e s p o s t a c e r t a : %l d , R e s p o s t a TimbreID : %l d , Numero de i n s t a n c e s :
%l d , j : %l d . " , c e r t a , p r o p o s t a , maxrows , j ) ;
78 }
79

80 v o i d s e q f e a t s e l _ k n n ( t _ s e q f e a t s e l ∗x , l o n g k n n v a l ) {
81

82 x−>knn = k n n v a l ;
83

84 }
85 void s e q f e a t s e l _ p r i n t ( long argc , t_atom ∗ argv )
86 {
87 long i ;
88 t _ a t o m ∗ ap ;
89 p o s t ( " t h e r e a r e %l d a r g u m e n t s " , a r g c ) ;
90 / / i n c r e m e n t ap e a c h t i m e t o g e t t o t h e n e x t atom
91 f o r ( i = 0 , ap = a r g v ; i < a r g c ; i ++ , ap ++) {
92 s w i t c h ( a t o m _ g e t t y p e ( ap ) ) {
93 c a s e A_LONG :
48 seqfeatsel C Code

94 p o s t ( "%l d : %l d " , i + 1 , a t o m _ g e t l o n g ( ap ) ) ;
95 break ;
96 c a s e A_FLOAT :
97 p o s t ( "%l d : %.2 f " , i + 1 , a t o m _ g e t f l o a t ( ap ) ) ;
98 break ;
99 c a s e A_SYM:
100 p o s t ( "%l d : %s " , i + 1 , a t o m _ g e t s y m ( ap )−>s_name ) ;
101 break ;
102 default :
103 p o s t ( "%l d : unknown atom t y p e (% l d ) " , i + 1 , a t o m _ g e t t y p e ( ap ) ) ;
104 break ;
105 }
106 }
107 }
108

109 v o i d s e q f e a t s e l _ m e s s a g e ( t _ s e q f e a t s e l ∗x , t _ s y m b o l ∗ s , l o n g a r g c , t _ a t o m ∗ a r g v ) {
110

111 i f ( x−> i d e n == f a l s e ) {
112 int i , listLength ;
113 l o n g l i n h a = x−>rowCount ;
114 l is t L en g t h = argc ;
115

116 x−> t r a i n i n g T a b = ( t _ i n s t a n c e ∗ ) s y s m e m _ r e s i z e p t r ( x−> t r a i n i n g T a b ,


( x−>rowCount + 1 ) ∗ s i z e o f ( t _ i n s t a n c e ) ) ;
117 x−> t r a i n i n g T a b [ l i n h a ] . i n s t a n c e = ( d o u b l e ∗ ) sysmem_newptr ( l i s t L e n g t h ∗
s i z e o f ( double ) ) ;
118

119 x−>rowCount ++;


120 i f ( x−>debug == 1 ) { p o s t ( " R e c e b i Mensagem e e n t r e i no i d e n == f a l s e " ) ; }
121 i f ( x−>n u m F e a t u r e s ! = l i s t L e n g t h ) {
122 x−> s e l C o l = ( i n t ∗ ) s y s m e m _ r e s i z e p t r ( x−>s e l C o l , l i s t L e n g t h ∗ s i z e o f ( i n t ) ) ;
123 x−>n u m F e a t u r e s = l i s t L e n g t h ; }
124 i f ( x−>debug == 1 ) { p o s t ( " a r g c = %l d e n u m F e a t u r e s = %l d " , a r g c ,
x−>n u m F e a t u r e s ) ; }
125

126 i f ( l i n h a == 0 ) { / / F i r s t Note
127 i f ( x−>debug == 1 ) { p o s t ( " P r i m e i r a Nota " ) ; }
128

129 x−> u l t i m a N o t a = a t o m _ g e t l o n g ( a r g v ) ;
130 i f ( x−>debug == 1 ) { p o s t ( " u l t i m a n o t a = %l d " , x−> u l t i m a N o t a ) ; }
131 x−>numNotas = 0 ;
132 x−> n o t a s P C l u s t e r = ( t_member ∗ ) s y s m e m _ r e s i z e p t r ( x−> n o t a s P C l u s t e r ,
s i z e o f ( t_member ) ) ;
133 x−> n o t a s P C l u s t e r [ 0 ] . member = ( i n t ∗ ) sysmem_newptr ( 3 ∗ s i z e o f ( i n t ) ) ;
134 x−> n o t a s P C l u s t e r [ 0 ] . member [ 0 ] = 0 ;
135 x−> n o t a s P C l u s t e r [ 0 ] . member [ 1 ] = 0 ;
136 }
137
A.1 seqfeatsel Code 49

138 i f ( x−> u l t i m a N o t a ! = a t o m _ g e t l o n g ( a r g v ) ) { / / i f t h e r e c e i v e d n o t e i s
d i f f e r e n t from t h e p r e v i o u s one
139 i f ( x−>debug == 1 ) { p o s t ( " Nota D i f e r e n t e " ) ; }
140

141 x−> u l t i m a N o t a = a t o m _ g e t l o n g ( a r g v ) ;
142

143 x−> n o t a s P C l u s t e r [ x−>numNotas ] . member [ 2 ] = l i n h a − 1 ; / / end o f t h e


p r e v i o u s row
144

145 x−>numNotas ++;


146 x−> n o t a s P C l u s t e r = ( t_member ∗ ) s y s m e m _ r e s i z e p t r ( x−> n o t a s P C l u s t e r ,
( x−>numNotas + 1 ) ∗ s i z e o f ( t_member ) ) ;
147 x−> n o t a s P C l u s t e r [ x−>numNotas ] . member = ( i n t ∗ ) sysmem_newptr ( 3 ∗
sizeof ( int ) ) ;
148

149 x−> n o t a s P C l u s t e r [ x−>numNotas ] . member [ 0 ] = x−>numNotas ; / / new n o t e


150 x−> n o t a s P C l u s t e r [ x−>numNotas ] . member [ 1 ] = l i n h a ; / / s t a r t i n d e x o f new
note
151 }
152 x−> t r a i n i n g T a b [ l i n h a ] . i n s t a n c e [ 0 ] = x−>numNotas ;
153 i f ( x−>debug == 1 ) { p o s t ( " [% l d ][% l d ] : %l d " , 0 , l i n h a , x−>numNotas ) ; }
154

155 f o r ( i = 1 ; i < l i s t L e n g t h ; i ++)


156 x−> t r a i n i n g T a b [ l i n h a ] . i n s t a n c e [ i ] = a t o m _ g e t l o n g ( a r g v + i ) ;
157 }
158

159 i f ( x−>i d e n == t r u e ) {
160

161 t _ a t o m ∗ s a i d a = ( t _ a t o m ∗ ) sysmem_newptr ( x−>n S e l ∗ s i z e o f ( t _ a t o m ) ) ;


162 / / f i l t e r s columns
163 i f ( x−>debug == 1 ) { p o s t ( " x i d e n e v e r d a d e " ) ; }
164

165 f o r ( i n t k = 0 ; k < x−>n S e l ; k ++) {


166 a t o m _ s e t f l o a t ( s a i d a + k , a t o m _ g e t f l o a t ( a r g v + x−> s e l C o l [ k ] ) ) ;
167 }
168 o u t l e t _ l i s t ( x−>b _ o u t , NULL, x−>n S e l , s a i d a ) ;
169 sysmem_freeptr ( saida ) ;
170

171 }
172 }
173

174 v o i d s e q f e a t s e l _ i d ( t _ s e q f e a t s e l ∗x ) {
175

176 s h o r t i , j , k , improv , n u m C o r r e c t a s O l d , l , n u m C o r r e c t a s , c , maximum ,


indiceMax ;
177 numCorrectasOld = 0;
178

179 t _ a t o m ∗ mensknn = ( t _ a t o m ∗ ) sysmem_newptr ( s i z e o f ( t _ a t o m ) ) ;


180 t _ a t o m ∗ m e n s C l u s t e r = ( t _ a t o m ∗ ) sysmem_newptr ( 4 ∗ s i z e o f ( t _ a t o m ) ) ;
50 seqfeatsel C Code

181

182 t _ a t o m ∗ s a i d a = ( t _ a t o m ∗ ) sysmem_newptr ( 0 ) ;
183 t _ a t o m ∗ t r e i n o = ( t _ a t o m ∗ ) sysmem_newptr ( 0 ) ;
184

185 i n t ∗ c o r r e c t a s = ( i n t ∗ ) sysmem_newptr ( x−>n u m F e a t u r e s ∗ s i z e o f ( i n t ) ) ;


186 t _ s y m b o l ∗ c l u s t e r , ∗ knnsym ;
187 c l u s t e r = gensym ( " m a n u a l _ c l u s t e r " ) ;
188 knnsym = gensym ( " knn " ) ;
189 improv = 1 ;
190

191 x−> n o t a s P C l u s t e r [ x−>numNotas ] . member [ 2 ] = x−>rowCount −1; / / marca o f i m da


tabela
192 i f ( x−>debug == 1 ) { p o s t ( " E s t o u no ID " ) ; }
193 i f ( x−>rowCount > 1 ) {
194 while ( improv > 0) {
195

196 s a i d a = ( t _ a t o m ∗ ) s y s m e m _ r e s i z e p t r ( s a i d a , ( x−>n S e l + 1 ) ∗
s i z e o f ( t_atom ) ) ;
197 t r e i n o = ( t _ a t o m ∗ ) s y s m e m _ r e s i z e p t r ( t r e i n o , x−>n S e l ∗ s i z e o f ( t _ a t o m ) ) ; ;
198

199 / / Runs a l l c o l u m n s e x c e p t t h e one w i t h l a b e l s


200 f o r ( i = 1 ; i < x−>n u m F e a t u r e s ; i ++) {
201 i f ( x−>debug == 1 ) { p o s t ( " E n t r e i na C o l u n a %l d " , i ) ; }
202

203 / / i f column a l r e a d y s e l e c t e d d o n t r u n
204 f o r ( k = 0 ; k < x−>n S e l ; k ++) {
205 i f ( i == x−> s e l C o l [ k ] ) { c o r r e c t a s [ i ] = 0 ; goto fora ; }
206

207 }
208

209 / / T r a i n s t i m b r e I D one row a t a t i m e


210 x−> f a s e = f a l s e ;
211 f o r ( j = 0 ; j < x−>rowCount ; j ++) {
212

213 / / c r e a t e s m e s s a g e and s e n d s i t
214 f o r ( k = 0 ; k < x−>n S e l ; k ++) {
215 a t o m _ s e t f l o a t ( s a i d a + k , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ x−> s e l C o l [ k ] ] ) ;
216 }
217 a t o m _ s e t f l o a t ( s a i d a + x−>n S e l , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ;
218 o u t l e t _ l i s t ( x−>a _ o u t , NULL, x−>n S e l + 1 , s a i d a ) ;
219

220 i f ( x−>debug == 1 ) { p o s t ( " E n v i e i p a r a t r e i n a r o t i m b r e I D %f " ,


x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ; }
221 }
222

223 / / sends c l u s t e r i n g messages


224 f o r ( l = 0 ; l < x−>numNotas + 1 ; l ++) {
225 a t o m _ s e t l o n g ( m e n s C l u s t e r , x−>numNotas + 1 ) ;
226 atom_setlong ( mensCluster + 1 , l ) ;
A.1 seqfeatsel Code 51

227 a t o m _ s e t l o n g ( m e n s C l u s t e r + 2 , x−> n o t a s P C l u s t e r [ l ] . member [ 1 ] ) ;


228 a t o m _ s e t l o n g ( m e n s C l u s t e r + 3 , x−> n o t a s P C l u s t e r [ l ] . member [ 2 ] ) ;
229

230 o u t l e t _ a n y t h i n g ( x−>a _ o u t , c l u s t e r , 4 , m e n s C l u s t e r ) ;
231 i f ( x−>debug == 1 ) { p o s t ( " E n v i e i mensagem de C l u s t e r " ) ; }
232 }
233 a t o m _ s e t l o n g ( mensknn , x−>knn ) ;
234 o u t l e t _ a n y t h i n g ( x−>a _ o u t , knnsym , 1 , mensknn ) ;
235

236 numCorrectas = 0;
237 x−> f a s e = t r u e ;
238 / / sends messages f o r timbreID to i d e n t i f y
239 f o r ( j = 0 ; j < x−>rowCount ; j ++) { / / f o r e v e r y row
240 / / s e n d s row t o o u t p u t 2
241 f o r ( k = 0 ; k < x−>n S e l ; k ++) {
242 a t o m _ s e t f l o a t ( s a i d a + k , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ x−> s e l C o l [ k ] ] ) ;
243 }
244 a t o m _ s e t f l o a t ( s a i d a + x−>n S e l , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ;
245 x−> f l a g = f a l s e ;
246 o u t l e t _ l i s t ( x−>b _ o u t , NULL, x−>n S e l + 1 , s a i d a ) ;
247 i f ( x−>debug == 1 ) { p o s t ( " E n v i e i p a r a o t i m b r e I D i d e n t i f i c a r %f " ,
x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ i ] ) ; }
248 / / whait f o r answer
249 w h i l e ( x−> f l a g = f a l s e ) {
250

251 systhread_sleep (1) ;


252 }
253 i f ( x−>debug == 1 ) { p o s t ( " S a i do w h i l e " ) ; }
254 / / see i f answer i s r i g h t
255 l o n g r e s p o s t a = x−> r e s p o s t a ;
256 i f ( r e s p o s t a == ( l o n g ) x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ 0 ] ) {
257 / / add t o v e c t o r w i t h number o f c o r r e c t a n s w e r s
258 n u m C o r r e c t a s ++;
259 i f ( x−>debug == 1 ) { s e q f e a t s e l _ p r i n t 2 ( ( l o n g )
x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ 0 ] , r e s p o s t a , x−>rowCount , j ) ; }
260 }
261 }
262 c o r r e c t a s [ i ] = numCorrectas ;
263 i f ( x−>debug == 1 ) { p o s t ( " C o l u n a %l d tem %l d c e r t a s " , i ,
numCorrectas ) ; }
264 / / send r e s e t message t o timbreID
265 o u t l e t _ a n y t h i n g ( x−>a _ o u t , gensym ( " c l e a r " ) , 0 , NULL) ;
266 fora : i f ( x−>debug == 1 ) { p o s t ( " E n v i e i mensagem de C l e a r " ) ; }
267 }
268 / / a d d s t h e column w i t h t h e b e s t a c c u r a c y t o t h e s e l e c t e d c o l u m n s
269 maximum = c o r r e c t a s [ 1 ] ;
270 indiceMax = 1;
271 f o r ( c = 1 ; c < x−>n u m F e a t u r e s ; c ++)
272 {
52 seqfeatsel C Code

273 i f ( x−>debug == t r u e ) { p o s t ( " c o r r e c t a s [% l d ] = %l d e max = %l d " , c ,


c o r r e c t a s [ c ] , maximum ) ; }
274 i f ( c o r r e c t a s [ c ] > maximum )
275 {
276 maximum = c o r r e c t a s [ c ] ;
277 indiceMax = c ;
278 }
279 }
280 / / c a l c u l improv
281 i m p r o v = maximum − n u m C o r r e c t a s O l d ;
282 n u m C o r r e c t a s O l d = maximum ;
283 i f ( improv > 0) {
284 p o s t ( " Added column %l d " , i n d i c e M a x ) ;
285 x−> s e l C o l [ x−>n S e l ] = i n d i c e M a x ;
286 x−>n S e l ++;
287 }
288 i f ( x−>debug == 1 ) { p o s t ( " Improv %l d " , i m p r o v ) ; }
289 } / / f i m do w h i l e
290

291 f o r ( j = 0 ; j < x−>rowCount ; j ++) {


292

293 / / c r i a t _ a t o m [ ] com a l i n h a a e n v i a r e e n v i a
294 f o r ( k = 0 ; k < x−>n S e l ; k ++) {
295 a t o m _ s e t f l o a t ( t r e i n o + k , x−> t r a i n i n g T a b [ j ] . i n s t a n c e [ x−> s e l C o l [ k ] ] ) ;
296 }
297 o u t l e t _ l i s t ( x−>a _ o u t , NULL, x−>n S e l , t r e i n o ) ;
298 i f ( x−>debug == 1 ) { p o s t ( " TREINEI O TIMBREID" ) ; }
299 }
300

301 / / C l u s t e r message
302 f o r ( l = 0 ; l < x−>numNotas + 1 ; l ++) {
303 a t o m _ s e t l o n g ( m e n s C l u s t e r , x−>numNotas + 1 ) ;
304 atom_setlong ( mensCluster + 1 , l ) ;
305 a t o m _ s e t l o n g ( m e n s C l u s t e r + 2 , x−> n o t a s P C l u s t e r [ l ] . member [ 1 ] ) ;
306 a t o m _ s e t l o n g ( m e n s C l u s t e r + 3 , x−> n o t a s P C l u s t e r [ l ] . member [ 2 ] ) ;
307

308 o u t l e t _ a n y t h i n g ( x−>a _ o u t , c l u s t e r , 4 , m e n s C l u s t e r ) ;
309 i f ( x−>debug == 1 ) { p o s t ( " E n v i e i mensagem de C l u s t e r " ) ; }
310 }
311 a t o m _ s e t l o n g ( mensknn , x−>knn ) ;
312 o u t l e t _ a n y t h i n g ( x−>a _ o u t , knnsym , 1 , mensknn ) ;
313

314 s y s m e m _ f r e e p t r ( mensknn ) ;
315 sysmem_freeptr ( mensCluster ) ;
316 sysmem_freeptr ( saida ) ;
317 sysmem_freeptr ( t r e i n o ) ;
318 sysmem_freeptr ( c o r r e c t a s ) ;
319

320 o u t l e t _ a n y t h i n g ( x−>a _ o u t , gensym ( " i d f i n i s h " ) , 0 , NULL) ;


A.1 seqfeatsel Code 53

321 x−>i d e n = t r u e ;
322 i f ( x−>debug == t r u e ) { p o s t ( " a c a b e i o c i c l o w h i l e " ) ; }
323 }
324 }
325

326 v o i d s e q f e a t s e l _ i n 1 ( t _ s e q f e a t s e l ∗x , l o n g e n t r a d a ) {
327 / / r e c e i v e s the timbreID guess
328 i f ( x−>debug == t r u e ) { p o s t ( " R e c e b i do t i m b r e I D " ) ; }
329

330 i f ( x−> f a s e == t r u e ) {
331 i f ( x−>debug == t r u e ) { p o s t ( " R e c e b i do t i m b r e I D e l i g u e i " ) ; }
332 x−> r e s p o s t a = e n t r a d a ;
333 x−> f l a g = t r u e ;
334 }
335 }
336

337 v o i d s e q f e a t s e l _ a s s i s t ( t _ s e q f e a t s e l ∗x , v o i d ∗b , l o n g msg , l o n g a r g , c h a r ∗ d s t )
338 { / / a s s i s t message
339 i f ( msg == ASSIST_INLET ) {
340 switch ( arg ) {
341 case 0: s t r c p y ( dst , " ( l i s t s / messages ) Receives f e a t u r e l i s t s with the
f i r s t element l a b e l i n g the event " ) ; break ;
342 c a s e 1 : s t r c p y ( d s t , " ( i n t e g e r ) R e c e i v e s t h e i d e n t i f i c a t i o n from t h e
c l a s s i f i e r " ) ; break ;
343 }
344 }
345 e l s e i f ( msg == ASSIST_OUTLET ) {
346 switch ( arg ) {
347 c a s e 0 : s t r c p y ( d s t , " ( l i s t s / m e s s a g e s ) S e n d s m e s s a g e s and t r a i n i n g l i s t s t o
the c l a s s i f i e r " ) ; break ;
348 c a s e 1 : s t r c p y ( dst , " ( l i s t s ) Sends l i s t s f o r i d e n t i f i c a t i o n as w e l l as t h e
f i n a l f i l t e r e d l i s t to the c l a s s i f i e r " ) ; break ;
349

350 }
351 }
352 }
353

354 v o i d s e q f e a t s e l _ c l e a r ( t _ s e q f e a t s e l ∗x )
355 { / / when r e c e i v e s c l e a r
356 i f ( x−>debug == 1 ) { p o s t ( " R e c e b i C l e a r " ) ; }
357

358 x−> i d e n = f a l s e ;
359 x−> f l a g = f a l s e ;
360 x−>debug = f a l s e ;
361

362 x−>n u m F e a t u r e s = 0 ;
363 x−>rowCount = 0 ; / / c o u n t s t h e rows
364 x−> u l t i m a N o t a = 0 ;
365 x−> r e s p o s t a = 0 ;
54 seqfeatsel C Code

366

367 x−>numNotas = 0 ;
368 x−>n S e l = 0 ;
369 o u t l e t _ a n y t h i n g ( x−>a _ o u t , gensym ( " c l e a r " ) , 0 , NULL) ;
370

371 }
372

373 v o i d s e q f e a t s e l _ d e b u g ( t _ s e q f e a t s e l ∗x )
374 { / / s e t s debug t o 1
375

376 x−>debug = t r u e ;
377 i f ( x−>debug == 1 ) { p o s t ( " R e c e b i debug " ) ; }
378

379 }
380

381 void ∗ s e q f e a t s e l _ n e w ( t_symbol ∗s , long argc , t_atom ∗ argv )


382 {
383 t _ s e q f e a t s e l ∗x = NULL ;
384

385 x = ( t _ s e q f e a t s e l ∗) o b j e c t _ a l l o c ( s e q f e a t s e l _ c l a s s ) ;
386

387

388 x−>b _ o u t = l i s t o u t ( x ) ;
389 x−>a _ o u t = o u t l e t _ n e w ( ( t _ s e q f e a t s e l ∗ ) x , NULL) ;
390

391 x−> n o t a s P C l u s t e r = ( t_member ∗ ) sysmem_newptr ( 0 ) ;


392 x−> t r a i n i n g T a b = ( t _ i n s t a n c e ∗ ) sysmem_newptr ( 0 ) ;
393 x−> s e l C o l = ( i n t ∗ ) sysmem_newptr ( 0 ) ; / / c o l u m n s s e l e c t e d t h r o u g h s f s
394

395 i n t i n ( x , 1) ;
396

397 x−> i d e n = f a l s e ;
398 x−> f l a g = f a l s e ;
399 x−>debug = f a l s e ;
400 x−> f a s e = f a l s e ;
401

402 x−>n u m F e a t u r e s = 0 ;
403 x−>rowCount = 0 ; / / c o u n t s t h e rows
404 x−> u l t i m a N o t a = 0 ;
405 x−> r e s p o s t a = 0 ;
406 x−>knn = 1 ;
407

408 x−>numNotas = 0 ;
409 x−>n S e l = 0 ;
410 return x;
411 }
412

413 v o i d s e q f e a t s e l _ f r e e ( t _ s e q f e a t s e l ∗x ) {
414 int i ;
A.2 Flowchart of the seqfeatsel external 55

415 i f ( x−>numNotas ! = 0 ) {
416 f o r ( i = 0 ; i < x−>numNotas + 1 ; i ++)
417 s y s m e m _ f r e e p t r ( x−> n o t a s P C l u s t e r [ i ] . member ) ;
418 }
419 f o r ( i = 0 ; i <x−>rowCount ; i ++)
420 s y s m e m _ f r e e p t r ( x−> t r a i n i n g T a b [ i ] . i n s t a n c e ) ;
421

422 s y s m e m _ f r e e p t r ( x−> n o t a s P C l u s t e r ) ;
423 s y s m e m _ f r e e p t r ( x−> t r a i n i n g T a b ) ;
424 s y s m e m _ f r e e p t r ( x−> s e l C o l ) ;
425 }

seqfeatselclean.c

A.2 Flowchart of the seqfeatsel external


56 seqfeatsel C Code

Figure A.1: Flowchart of the seqfeatsel external


References

[1] O. Gillet and G. Richard, “Transcription and separation of drum signals from polyphonic
music,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, pp. 529–
540, March 2008.

[2] M. Miron, M. E. P. Davies, and F. Gouyon, “An open-source drum transcription system for
pure data and max msp,” in 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 221–225, May 2013.

[3] M. Zadel and G. Scavone, “Laptop performance: Techniques, tools, and a new interface de-
sign,” in Proceedings of the International Computer Music Conference, pp. 643–648, 2006.

[4] E. Sinyor, C. Mckay, R. Fiebrink, D. Mcennis, and I. Fujinaga, “BEATBOX CLASSIFICA-


TION USING ACE,” in Proceedings of the International Conference on Music Information
Retrieval, pp. 672–675, 2005.

[5] M. Atherton, “RHYTHM-SPEAK : MNEMONIC , LANGUAGE PLAY OR SONG ?,” in


Proceedings of the Inaugural International Conference on Music Communication Science.,
pp. 15–18, 2007.

[6] D. Stowell and M. D. Plumbley, “Characteristics of the beatboxing vocal style,” tech. rep.,
Centre for Digital Music Department of Electronic Engineering Queen Mary, University of
London, 2008.

[7] A. Kapur, M. Benning, and G. Tzanetakis, “QUERY-BY-BEAT-BOXING: MUSIC RE-


TRIEVAL FOR THE DJ,” in Proceedings of the International Conference on Music Infor-
mation Retrieval, pp. 170–178, 2004.

[8] T. B. Holmes and T. Holmes, Electronic and experimental music: pioneers in technology and
composition. Psychology Press, 2002.

[9] S. Sanderson, “Low profile keyboard device and system for recording and scoring music,”
Dec. 13 1988. US Patent 4,790,230.

[10] M. Duignan, J. Noble, P. Barr, and R. Biddle, Metaphors for Electronic Music Production in
Reason and Live, pp. 111–120. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004.

[11] S. Sapir, “Gestural control of digital audio environments,” Journal of New Music Research,
vol. 31, no. 2, pp. 119–129, 2002.

[12] J. S. Downie, “Music information retrieval,” Annual Review of Information Science and Tech-
nology, vol. 37, no. 1, pp. 295–340, 2003.

57
58 REFERENCES

[13] M. Schedl, E. Gómez, and J. Urbano, “Music information retrieval: Recent developments
and applications,” Foundations and Trends in Information Retrieval, vol. 8, no. 2-3, pp. 127–
261, 2014.

[14] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic music


transcription: challenges and future directions,” Journal of Intelligent Information Systems,
vol. 41, no. 3, pp. 407–434, 2013.

[15] C. W. Wu and A. Lerch, “Drum transcription using partially fixed non-negative matrix factor-
ization,” in 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 1281–1285,
Aug 2015.

[16] E. Benetos, S. Ewert, and T. Weyde, “Automatic transcription of pitched and unpitched
sounds from polyphonic music,” in Acoustics, Speech and Signal Processing (ICASSP), 2014
IEEE International Conference on, pp. 3107–3111, IEEE, 2014.

[17] D. P. W. Ellis, “Beat tracking by dynamic programming,” Journal of New Music Research,
vol. 36, no. 1, pp. 51–60, 2007.

[18] M. E. P. Davies and M. D. Plumbley, “Context-dependent beat tracking of musical audio,”


IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1009–1020,
March 2007.

[19] M. F. McKinney, D. Moelants, M. E. Davies, and A. Klapuri, “Evaluation of audio beat


tracking and music tempo extraction algorithms,” Journal of New Music Research, vol. 36,
no. 1, pp. 1–16, 2007.

[20] J. Foote, “Visualizing music and audio using self-similarity,” in Proceedings of the seventh
ACM international conference on Multimedia (Part 1), pp. 77–80, ACM, 1999.

[21] C. Dittmar and D. Gärtner, “Real-time transcription and separation of drum recordings based
on NMF decomposition,” in Proc. of the Intl. Conference on Digital Audio Effects (DAFx),
pp. 187–194, 2014.

[22] O. Gillet and G. Richard, “Automatic transcription of drum loops,” in 2004 IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. iv–269–iv–272
vol.4, May 2004.

[23] O. Gillet and G. Richard, “Drum track transcription of polyphonic music using noise sub-
space projection,” in In Proc. of ISMIR, pp. 92–99, 2005.

[24] K. Tanghe, S. Degroeve, and B. D. Baets, “An algorithm for detecting and labeling drum
events in polyphonic music,” in In Proc. of First Annual Music Information Retrieval Evalu-
ation eXchange, pp. 11–15, 2005.

[25] A. Roebel, J. Pons, M. Liuni, and M. Lagrangey, “On automatic drum transcription using
non-negative matrix deconvolution and itakura saito divergence,” in 2015 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 414–418, April 2015.

[26] D. Stowell and M. D. Plumbley, “Delayed decision-making in real-time beatbox percussion


classification,” Journal of New Music Research, vol. 39, pp. 203–213, Sep 2010.
REFERENCES 59

[27] A. Hazan, “Towards automatic transcription of expressive oral percussive performances,” in


Proceedings of the 10th International Conference on Intelligent User Interfaces, IUI ’05,
(New York, NY, USA), pp. 296–298, ACM, 2005.

[28] D. Christensen, E. R. Høeg, R. B. Lind, S. A. Nilsson, D. M. Smed, C. Sørensen, and S. P.


Vinkel, “Automatic transcription of beatboxing,” tech. rep., Aalborg University, 2014.

[29] C. Mckay, R. Fiebrink, D. Mcennis, B. Li, and I. Fujinaga, “Ace: A framework for optimizing
music classification,” in Proceedings of the International Conference on Music Information
Retrieval, pp. 42–49, 2005.

[30] T. Nakano, J. Ogata, M. Goto, and Y. Hiraga, “A drum pattern retrieval method by voice
percussion,” in In Proc. of ISMIR 2004, pp. 550–553, 2004.

[31] O. Gillet and G. Richard, “Drum loops retrieval from spoken queries,” Journal of Intelligent
Information Systems, vol. 24, no. 2, pp. 159–177, 2005.

[32] K. Hipke, M. Toomim, R. Fiebrink, and J. Fogarty, “Beatbox: end-user interactive definition
and training of recognizers for percussive vocalizations,” in Proceedings of the 2014 Inter-
national Working Conference on Advanced Visual Interfaces, pp. 121–124, ACM, 2014.

[33] D. DeSantis, I. Gallagher, K. Haywood, R. Knudsen, G. Behles, J. Rang, R. Henke, and


T. Slama, Ableton Reference Manual Version 9. Ableton, 2016.

[34] P. Brossier, “Man page of aubioonset,” Jan 2017.

[35] M. Malt and E. Jourdan, “Zsa. descriptors: a library for real-time descriptors analysis,” in
Proceedings of 5th Sound and Music Computing Conference Berlin, 2008.

[36] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler, “A tutorial


on onset detection in music signals,” IEEE Transactions on Speech and Audio Processing,
vol. 13, pp. 1035–1047, Sept 2005.

[37] P. Masri, Computer modelling of sound for transformation and synthesis of musical signals.
PhD thesis, University of Bristol, 1996.

[38] J. M. Grey and J. W. Gordon, “Perceptual effects of spectral modifications on musical tim-
bres,” The Journal of the Acoustical Society of America, vol. 63, no. 5, pp. 1493–1500, 1978.

[39] G. Peeters, “A large set of audio features for sound description (similarity and classification)
in the CUIDADO project,” tech. rep., IRCAM, 2004.

[40] D. Giannoulis, M. Massberg, and J. D. Reiss, “Parameter automation in a dynamic range


compressor,” Journal of the Audio Engineering Society, vol. 61, no. 10, pp. 716–726, 2013.

[41] S. Dubnov, “Generalization of spectral flatness measure for non-gaussian linear processes,”
IEEE Signal Processing Letters, vol. 11, pp. 698–701, Aug 2004.

[42] T. Gulzar, A. Singh, and S. Sharma, “Comparative analysis of lpcc, mfcc and bfcc for the
recognition of hindi words using artificial neural networks,” International Journal of Com-
puter Applications, vol. 101, no. 12, pp. 22–27, 2014.

[43] A. W. Whitney, “A direct method of nonparametric measurement selection,” IEEE Trans.


Comput., vol. 20, pp. 1100–1103, Sept. 1971.
60 REFERENCES

[44] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2, p. 1883, 2009. revision
#136646.

[45] P. Brossier, MartinHN, N. Philippsen, T. Seaver, E. Müller, and S. Alexander, “aubio/aubio:


0.4.5.” https://doi.org/10.5281/zenodo.496134. Accessed: 2017-5-17.

[46] W. Brent, “A timbre analysis and classification toolkit for pure data,” in ICMC, 2010.

[47] “Max API.” https://cycling74.com/sdk/MaxSDK-7.1.0/html/index.html.


Accessed: 2017-5-20.