Você está na página 1de 25

Music and movement share a dynamic structure that

supports universal expressions of emotion

Beau Sieversa,1, Larry Polanskyb, Michael Caseyb, and Thalia Wheatleya,1

Department of Psychological and Brain Sciences and bDepartment of Music, Dartmouth College, Hanover, NH 03755

Edited by Dale Purves, Duke-National University of Singapore Graduate Medical School, Singapore, and approved November 7, 2012 (received for review
May 28, 2012)

Music moves us. Its kinetic power is the foundation of human

behaviors as diverse as dance, romance, lullabies, and the military
march. Despite its signicance, the music-movement relationship is
poorly understood. We present an empirical method for testing
whether music and movement share a common structure that affords
equivalent and universal emotional expressions. Our method uses
a computer program that can generate matching examples of music
and movement from a single set of features: rate, jitter (regularity of
rate), direction, step size, and dissonance/visual spikiness. We applied
our method in two experiments, one in the United States and another
in an isolated tribal village in Cambodia. These experiments revealed
three things: (i) each emotion was represented by a unique combination of features, (ii) each combination expressed the same emotion
in both music and movement, and (iii) this common structure between music and movement was evident within and across cultures.

| cross-modal

usic moves us, literally. All human cultures dance to music

and musics kinetic faculty is exploited in everything from
military marches and political rallies to social gatherings and
romance. This cross-modal relationship is so fundamental that in
many languages the words for music and dance are often interchangeable, if not the same (1). We speak of music moving us
and we describe emotions themselves with music and movement
words like bouncy and upbeat (2). Despite its centrality to
human experience, an explanation for the music-movement link
has been elusive. Here we offer empirical evidence that sheds
new light on this ancient marriage: music and movement share
a dynamic structure.
A shared structure is consistent with several ndings from
research with infants. It is now well established that very young
infantseven neonates (3)are predisposed to group metrically
regular, auditory events similarly to adults (4, 5). Moreover, infants
also infer meter from movement. In one study, 7-mo-old infants
were bounced in duple or triple meter while listening to an ambiguous rhythm pattern (6). When hearing the same pattern later
without movement, infants preferred the pattern with intensity
(auditory) accents that matched the particular metric pattern at
which they were previously bounced. Thus, the perception of a
beat, established by movement or by music, transfers across modalities. Infant preferences suggest that perceptual correspondences between music and movement, at least for beat perception,
are predisposed and therefore likely universal. By denition, however, infant studies do not examine whether such predispositions
survive into adulthood after protracted exposure to culture-specic
inuences. For this reason, adult cross-cultural research provides
important complimentary evidence for universality.
Previous research suggests that several musical features are
universal. Most of these features are low-level structural properties, such as the use of regular rhythms, preference for small-integer frequency ratios, hierarchical organization of pitches, and so
on (7, 8). We suggest musics capacity to imitate biological dynamics including emotive movement is also universal, and that this
capacity is subserved by the fundamental dynamic similarity of the
domains of music and movement. Imitation of human physiological responses would help explain, for example, why angry
7075 | PNAS | January 2, 2013 | vol. 110 | no. 1

music is faster and more dissonant than peaceful music. This

capacity may also help us understand musics inductive effects: for
example, the soothing power of lullabies and the stimulating,
synchronizing force of military marching rhythms.
Here we present an empirical method for quantitatively comparing music and movement by leveraging the fact that both can
express emotion. We used this method to test to what extent
expressions of the same emotion in music and movement share the
same structure; that is, whether they have the same dynamic features. We then tested whether this structure comes from biology or
culture. That is, whether we are born with the predisposition to
relate music and movement in particular ways, or whether these
relationships are culturally transmitted. There is evidence that
emotion expressed in music can be understood across cultures,
despite dramatic cultural differences (9). There is also evidence
that facial expressions and other emotional movements are crossculturally universal (1012), as Darwin theorized (13). A natural
predisposition to relate emotional expression in music and movement would explain why music often appears to be cross-culturally
intelligible when other fundamental cultural practices (such as
verbal language) are not (14). To determine how music and
movement are related, and whether that relationship is peculiar to
Western culture, we ran two experiments. First, we tested our
common structure hypothesis in the United States. Then we conducted a similar experiment in Lak, a culturally isolated tribal
village in northeastern Cambodia. We compared the results from
both cultures to determine whether the connection between music
and movement is universal. Because many musical practices are
culturally transmitted, we did not expect both experiments to have
precisely identical results. Rather, we hypothesized results from
both cultures would differ in their details yet share core dynamic
features enabling cross-cultural legibility.
General Method
We created a computer program capable of generating both music
and movement; the former as simple, monophonic piano melodies,
and the latter as an animated bouncing ball. Both were controlled
by a single probabilistic model, ensuring there was an isomorphic
relationship between the behavior of the music and the movement
of the ball. This model represented both music and movement in
terms of dynamic contour: how changes in the stimulus unfold over
time. Our model for dynamic contour comprised ve quantitative
parameters controlled by on-screen slider bars. Stimuli were generated in real time, and manipulation of the slider bars resulted in
immediate changes in the music being played or the animation
being shown (Fig. 1).

Author contributions: B.S., L.P., and T.W. designed research; B.S. performed research; B.S.,
M.C., and T.W. analyzed data; B.S. and T.W. wrote the paper; and B.S. wrote the MaxMSP
program and recorded the supporting media.
The authors declare no conict of interest.
This article is a PNAS Direct Submission.

To whom correspondence may be addressed. E-mail: beau@beausievers.com or thalia.


This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.



Fig. 1. Paradigm. Participants manipulated ve

slider bars corresponding to ve dynamic features
to create either animations or musical clips that expressed different emotions.

Sievers et al.

indicating their function (SI Text). Each of the cross-modal (musicmovement) mappings represented by these slider bars constituted
a hypothesis about the relationship between music and movement.
That is, based on their uses in emotional music and movement
(2, 23), we hypothesized that rate, jitter, direction of movement, and step size have equivalent emotional function in both
music and movement. Additionally, we hypothesized that both
dissonance and spikiness would have negative valence, and that
in equivalent cross-modal emotional expressions the magnitude of
one would be positively correlated with the magnitude of the other.
United States
Methods. Our rst experiment took place in the United States

with a population of college students. Participants (n = 50) were

divided into two groups, music (n = 25) and movement (n = 25).
Each participant completed the experiment individually and without knowledge of the other group. That is, each participant was
told about either the music or the movement capability of the
program, but not both.
After the study was described to the participants, written informed consent was obtained. Participants were given a brief
demonstration of the computer program, after which they were
allowed unlimited time to get used to the program through undirected play. At the beginning of this session, the slider bars were
automatically set to random positions. Participants ended the play
session by telling the experimenter that they were ready to begin
the experiment. The duration of play was not recorded, but the
modal duration was 510 min. To begin a melody or movement
sequence, participants pressed the space bar on a computer keyboard. The music and movement output were continuously updated
based on the slider bar positions such that participants could see
(or hear) the results of their efforts as they moved the bars. Between
music sequences, there was silence. Between movement sequences,
the ball would hold still in its nal position before resetting to a
neutral position at the beginning of the next sequence.
After indicating they were ready to begin the experiment, participants were instructed to take as much time as needed to use the
program to express ve emotions: angry, happy, peaceful,
sad, and scared. Following Hevner (24), each emotion word
was presented at the top of a block of ve words with roughly the
same meaning (SI Text). These word clusters were present on the
screen throughout the entire duration of the experiment. Participants could work on each of these emotions in any order, clicking
on buttons to save or reload slider bar settings for any emotion at
any time. Only the last example of any emotion saved by the participant was used in our analyses. Participants could use all ve
sliders throughout the duration of the experiment, and no restrictions were placed on the order in which the sliders were used. For
example, participants were free to begin using the tempo slider,
then switch to the dissonance slider at any time, then back to the
tempo slider again, and so on. In practice, participants constantly
switched between the sliders, listening or watching the aggregate
effect of all slider positions on the melody or ball movement.
Results. The critical question was whether subjects who used
music to express an emotion set the slider bars to the same
positions as subjects who expressed the same emotion with the
moving ball.
PNAS | January 2, 2013 | vol. 110 | no. 1 | 71


The ve parameters corresponded to the following features:

rate (as ball bounces or musical notes per minute, henceforth
beats per minute or BPM), jitter (SD of interonset interval), direction of movement (ratio of downward to upward movements,
controlling either pitch trajectory or ball tilt), step size (ratio of big
to small movements, controlling pitch interval size or ball bounce
height), and nally consonance/smoothness [quantied using
Hurons (15) aggregate dyadic consonance measure and mapped
to surface texture].
The settings for each of these parameters affected both the music
and the movement such that certain musical features were guaranteed to correspond with certain movement features. The rate and
jitter sliders controlled the rate and variation in interonset interval
of events in both modalities. The overall contour of each melody or
bounce sequence was determined by the combined positions of the
direction of movement and step-size sliders. Absolute pitch position corresponded to the extent to which the ball was tilted forward
or backward. Low pitches corresponded with the ball leaning
forward, as though looking toward the ground, and high pitches
corresponded with leaning backward, looking toward the sky.
For music, the consonance slider controlled the selection of one
of 38 possible 5-note scales, selected from the 12-note Western
chromatic scale and sorted in order by their aggregate dyadic
consonance (15) (SI Text). For movement, the consonance slider
controlled the visual spikiness of the balls surface. Dissonant intervals in the music corresponded to increases in the spikiness of
the ball, and consonant intervals smoothed out its surface. Spikiness
was dynamic in the sense that it was perpetually changing because
of the probabilistic and continuously updating nature of the program; it did not inuence the bouncing itself. Our choice of spikiness
as a visual analog of auditory dissonance was inspired by the concept
of auditory roughness described by Parncutt (16). We understand
roughness as an apt metaphor for the experience of dissonance.
We did not use Parncutts method for calculating dissonance based
on auditory beating (17, 18). To avoid imposing a particular physical
denition of dissonance, we used values derived directly from aggregated empirical reports of listener judgments (15). Spikiness was
also inspired by nonarbitrary mappings between pointed shapes and
unrounded vowels (e.g., kiki) and between rounded shapes and
rounded sounds (e.g., bouba; refs. 1921). The dissonance-spikiness mapping was achieved by calculating the dissonance of the
melodic interval corresponding to each bounce, and dynamically
scaling the spikiness of the surface of the ball proportionately.
Three of these parameters are basic dynamic properties: speed
(BPM), direction, and step-size. Regularity (jitter) and smoothness were added because of psychological associations with
emotion [namely, predictability and tension (22)] that were not
already captured by speed, step-size, and direction. The number
of features (ve) was based on the intuition that this number
created a large enough possibility space to provide a proof-ofconcept test of the shared structure hypothesis without becoming
unwieldy for participants. We do not claim that these ve features
optimally characterize the space.
These parameters were selected to accommodate the production
of specic features previously identied with musical emotion (2).
In the music domain, this set of features can be grouped as timing
features (tempo, jitter) and pitch features (consonance, step
size, and direction). Slider bars were presented with text labels

To answer this question, the positions of the sliders for each

modality (music vs. movement) and each emotion were analyzed
using multiway ANOVA. Emotion had the largest main effect on
slider position [F(2.97, 142.44) = 185.56, P < 0.001, partial 2 =
0.79]. Partial 2 reects how much of the overall variance (effect
plus error) in the dependent variable is attributable to the factor in
question. Thus, 79% of the overall variance in where participants
placed the slider bars was attributable to the emotion they were
attempting to convey. This main effect was qualied by an Emotion
Slider interaction indicating each emotion required different slider
settings [F(4.81, 230.73) = 112.90, P < 0.001; partial 2 = 0.70].
Although we did nd a signicant main effect of Modality
(music vs. movement) [F(1,48) = 4.66, P < 0.05], it was small
(partial 2 = 0.09) and did not interact with Emotion [Emotion
Modality: F(2.97, 142.44) = 0.97, P > 0.4; partial 2 = 0.02]. This
nding indicates slider bar settings for music and movement were
slightly different from each other, regardless of the emotion being
represented. We also found a three-way interaction between
Slider, Emotion, and Modality. This interaction was signicant but
modest [F(4.81, 230.73) = 4.50, P < 0.001; partial 2 = 0.09], and
can be interpreted as a measure of the extent to which music and
movement express different emotions with different patterns of
dynamic features.
To investigate the similarity of emotional expressions, we
conducted a Euclidean distance-based clustering analysis. This
analysis revealed a cross-modal, emotion-based structure (Fig. 2).
These results strongly suggest the presence of a common structure. That is, within this experiment, rate, jitter, step size, and direction of movement functioned the same way in emotional music
and movement, and aggregate dyadic dissonance was functionally
analogous to visual spikiness. For our United States population,
music and movement shared a cross-modal expressive code.
Methods. We conducted our second experiment in Lak, a rural
village in Ratanakiri, a sparsely populated province in northeastern Cambodia. Lak is a Kreung ethnic minority village that has
maintained a high degree of cultural isolation. (For a discussion
of the possible effects of modernization on Lak, see SI Text.) In
Kreung culture, music and dance occur primarily as a part of rituals, such as weddings, funerals, and animal sacrices (25). Kreung
music is formally dissimilar to Western music: it has no system of
vertical pitch relations equivalent to Western tonal harmony, is
constructed using different scales and tunings, and is performed on
morphologically dissimilar instruments. For a brief discussion of
the musical forms we observed during our visit, see SI Text.

Fig. 2. Music-movement similarity structure in the United States data.

Clusters are fused based on the mean Euclidean distance between members.
The data cluster into a cross-modal, emotion-based structure.

72 | www.pnas.org/cgi/doi/10.1073/pnas.1209023110

Fig. 3. (A) Kreung participants used a MIDI controller to manipulate the

slider bar program. (B) Lak village debrieng at the conclusion of the study.

The experiment we conducted in Lak proceeded in the same

manner as the United States experiment, except for a few modications made after pilot testing. Initially, because most of the
participants were illiterate, we simply removed the text labels from
the sliders. However, in our pilot tests we found participants had
difculty remembering the function of each slider during the
movement task. We compensated by replacing the slider labels for
the movement task with pictures (SI Text). Instructions were
conveyed verbally by a translator. (For a discussion of the translation of emotion words, see SI Text.) None of the participants had
any experience with computers, so the saving/loading functionality
of the program was removed. Whereas the United States participants were free to work on any of the ve emotions throughout the
experiment, the Kreung participants worked out each emotion
one-by-one in a random order. There were no required repetitions
of trials. However, when Kreung subjects requested to work on
a different emotion than the one assigned, or to revise an emotion
they had already worked on, that request was always granted. As
with the United States experiment, we always used the last example of any emotion chosen by the participant. Rather than using
a mouse, participants used a hardware MIDI controller (Korg
nanoKontrol) to manipulate the sliders on the screen (Fig. 3A).
When presented with continuous sliders as in the United States
experiment, many participants indicated they were experiencing
decision paralysis and could not complete the task. To make the
task comfortable and tractable we discretized the sliders, limiting
each to three positions: low, medium, and high (SI Text). As with
the United States experiment, participants were split into separate
music (n = 42) and movement (n = 43) groups.
Results: Universal Structure in Music and Movement
There were two critical questions for the cross-cultural analysis: (i)
Are emotional expressions universally cross-modal? and (ii) Are
emotional expressions similar across cultures? The rst question
asks whether participants who used music to express an emotion
set the slider bars to the same positions as participants who
expressed the same emotion with the moving ball. This question
does not examine directly whether particular emotional expressions are universal. A cross-modal result could be achieved even
if different cultures have different conceptions of the same emotion (e.g., happy could be upward and regular in music and
movement for the United States, but downward and irregular in
music and movement for the Kreung). The second question asks
whether each emotion (e.g., happy), is expressed similarly across
cultures in music, movement or both.
To compare the similarity of the Kreung results to the United
States results, we conducted three analyses. All three analyses
required the United States and Kreung data to be in a comparable
format; this was accomplished by making the United States data
discrete. Each slider setting was assigned a value of low, medium,
or high in accordance with the nearest value used in the Kreung
experiment. The following sections detail these three analyses. See
SI Text for additional analyses, including a linear discriminant
Sievers et al.

Sievers et al.





Distribution of distances


Monte Carlo Simulation. Traditional ANOVA is well-suited to

detecting mean differences given a null hypothesis that the means
are the same. However, this test cannot capture the similarity
between populations given the size of the possibility space. One
critical advance of the present paradigm is that it allowed participants to create different emotional expressions within a large
possibility space. Analogously, an ANOVA on distance would
show that Boston and New York City do not share geographic
coordinates, thereby rejecting the null hypothesis that these cities
occupy the same space. Such a comparison would not test how close
Boston and New York City are compared with distances between
either city and every other city across the globe (i.e., relative to
the entire possibility space). To determine the similarity between
Kreung and United States data given the entire possibility space
afforded by the ve sliders, we ran a Monte Carlo simulation.
The null hypothesis of this simulation was that there are no
universal perceptual constraints on music-movement-emotion
association, and that individual cultures may create music and
movement anywhere within the possibility space. Showing instead
that differences between cultures are small relative to the size of
the possibility space strongly suggests music-movement-emotion
associations are subject to biological constraints.
We represented the mean results of each experiment as a 25dimensional vector (ve emotions ve sliders), where each
dimension has a range from 0.0 to 2.0. The goal of the Monte
Carlo simulation was to see how close these two vectors were to
each other relative to the size of the space they both occupy. To
do this, we sampled the space uniformly at random, generating
one-million pairs of 25-dimensional vectors. Each of these vectors represented a possible outcome of the experiment. We
measured the Euclidean distance between each pair to generate
a distribution of intervector distances (mean = 8.11, SD = 0.97).

Cross-Cultural Similarity by Emotion: Euclidean Distance. For this

analysis, we derived emotional prototypes from the results of the
United States experiment. This derivation was accomplished by
selecting the median value for each slider for each emotion (for
music and movement combined) from the United States results
and mapping those values to the closest Kreung setting of low,
medium, and high. For example, the median rate for sad was
46 BPM in the United States sample. This BPM was closest to the
low setting used in the Kreung paradigm (55 BPM). Using this
method for all ve sliders, the sad United States prototype was:
low rate, low jitter, medium consonance, low ratio of big to small
movements, and high ratio of downward to upward movements.
We measured the similarity of each of the Kreung datapoints to
the corresponding United States prototype by calculating the
Euclidean distance between them.
For every emotion except angry, this distance analysis revealed that Kreung results (for music and movement combined)
were closer to the matching United States emotional prototypes
than they were to any of the other emotional prototypes. In other
words, the Kreung participants idea of sad was more similar to
the United States sad prototype than to any other emotional
prototype, and this cross-cultural congruence was observed for all
emotions except angry. This pattern also held for the movement
results when considered separately from music. When the music
results were evaluated alone, three of the ve emotions (happy,
sad, and scared) were closer to the matching United States prototype than any nonmatching prototypes.
The three Kreung emotional expressions that were not closest to
their matching United States prototypes were angry movement,


ANOVA. We z-scored the data for each parameter (slider) separately within each population. We then combined all z-scored data
into a single, repeated-measures ANOVA with Emotion and
Sliders as within-subjects factors and Modality and Population
as between-subjects factors. Emotion had the largest main effect
on slider position [F(3.76, 492.23) = 40.60, P < 0.001, partial 2 =
0.24], accounting for 24% of the overall (effect plus error) variance. There were no signicant main effects of Modality (music
vs. movement) [F(1,131) = 0.004, P = 0.95] or Population (United
States, Kreung) [F(1,131) < 0.001, P = 0.99] and no interaction
between the two [F(1,131) = 1.15, P = 0.29].
This main effect of Emotion was qualied by an Emotion
Sliders interaction, indicating each emotion was expressed by
different slider settings [F(13.68, 1791.80) = 38.22, P < 0.001;
partial 2 = 0.23]. Emotion also interacted with Population, albeit
more modestly [F(3.76, 492.23) = 11.53, P < 0.001, partial 2 =
0.08] and both were qualied by the three-way Emotion Sliders
Population interaction, accounting for 7% of the overall variance in slider bar settings [F(13.68, 1791.80) = 10.13, P < 0.001,
partial 2 = 0.07]. This three-way interaction can be understood
as how much participants different emotion congurations could
be predicted by their population identity. See SI Text for the
z-scored means.
Emotion also interacted with Modality [F(3.76, 492.23) = 2.84,
P = 0.02; partial 2 = 0.02], which was qualied by the three-way
Emotion Modality Sliders [F(13.68, 1791.80) = 4.92, P < 0.001;
partial 2 = 0.04] and the four-way Emotion Modality Sliders
Population interactions [F(13.68, 1791.80) = 2.8, P < 0.001; partial
2 = 0.02]. All of these Modality interactions were modest, accounting for between 2% and 4% of the overall variance.
In summary, the ANOVA revealed that the slider bar congurations depended most strongly on the emotion being conveyed
(Emotion Slider interaction, partial 2 = 0.23), with signicant but
small inuences of modality and population (partial 2s < 0.08).

The distance between the Kreung and United States mean

result vectors was 4.24, which was 3.98 SDs away from the mean.
Out of one-million vector pairs, fewer than 30 pairs were this
close together, suggesting it is highly unlikely that the similarities
between the Kreung and United States results were because of
chance (Fig. 4).
Taken together, the ANOVA and Monte Carlo simulation
revealed that the Kreung and United States data were remarkably similar given the possibility space, and that the combined data were best predicted by the emotions being conveyed
and least predicted by the modality used. The nal analysis examined Euclidean distances between Kreung and United States
data for each emotion separately.

analysis examining the importance of each feature (slider) in distinguishing any given emotion from the other emotions.

4 4.24




Fig. 4. Distribution of distances in the Monte Carlo simulation. Bold black

line indicates where the similarity of United States and Kreung datasets falls
in this distribution.

PNAS | January 2, 2013 | vol. 110 | no. 1 | 73

angry music, and peaceful music; however, these had several

matching parameters. For both cultures, angry music and angry
movement were fast and downward. Although it was closer to the
United States scared prototype, Kreung angry music matched
the United States angry prototype in four of ve parameters.
Kreung peaceful music was closest to the United States happy
prototype, and second closest to the United States peaceful prototype. In both cultures, happy music was faster than peaceful
music, and happy movement was faster than peaceful movement.
These data suggest two things. First, the dynamic features of
emotion expression are cross-culturally universal, at least for the
ve emotions tested here. Second, these expressions have similar
dynamic contours in both music and movement. That is, music
and movement can be understood in terms of a single dynamic
model that shares features common to both modalities. This
ability is made possible not only by the existence of prototypical
emotion-specic dynamic contours, but also by isomorphic structural relationships between music and movement.
The natural coupling of music and movement has been suggested by a number of behavioral experiments with adults. Friberg
and Sundberg observed that the deceleration dynamics of a runner
coming to a stop accurately characterize the nal slowing at the
end of a musical performance (26). People also prefer to tap to
music at tempos associated with natural types of human movement
(27) and common musical tempi appear to be close to some biological rhythms of the human body, such as the heartbeat and
normal gait. Indeed, people synchronize the tempo of their walking
with the tempo of the music they hear (but not to a similarly paced
metronome), with optimal synchronization occurring around 120
BPM, a common tempo in music and walking. This nding led the
authors to suggest that the perception of musical pulse is due to an
internalization of the locomotion system (28), consistent more
generally with the concept of embodied music cognition (29).
The embodiment of musical meter presumably recruits the
putative mirror system comprised of regions that coactivate for
perceiving and performing action (30). Consistent with this hypothesis, studies have demonstrated neural entrainment to beat
(31, 32) indexed by beat-synchronous -oscillations across auditory and motor cortices (31). This basic sensorimotor coupling
has been described as creating a pleasurable feeling of being in
the groove that links music to emotion (33, 34).
The capacity to imitate biological dynamics may also be expressed in nonverbal emotional vocalizations (prosody). Several
studies have demonstrated better than chance cross-cultural
recognition of several emotions from prosodic stimuli (3539).
Furthermore, musical expertise improves discrimination of tonal
variations in languages such as Mandarin Chinese, suggesting
common perceptual processing of pitch variations across music
and language (40). It is thus possible, albeit to our knowledge not
tested, that prosody shares the dynamic structure evinced here by
music and movement. However, cross-modal uency between
music and movement may be particularly strong because of the
more readily identiable pitch contours and metric structure in
music compared with speech (4, 41).
The close relationship between music and movement has
attracted signicant speculative attention from composers, musicologists, and philosophers (4246). Only relatively recently have
scientists begun studying the music-movement relationship empirically (4750). This article addresses several limitations in this
literature. First, using the same statistical model to generate music
and movement stimuli afforded direct comparisons previously
impossible because of different methods of stimulus creation.
Second, modeling lower-level dynamic parameters (e.g., consonance), rather than higher-level constructs decreased the potential
for cultural bias (e.g., major/minor). Finally, by creating emotional
expressions directly rather than rating a limited set of stimuli
prepared in advance, participants could explore the full breadth of
the possibility space.
74 | www.pnas.org/cgi/doi/10.1073/pnas.1209023110

Fitch (51) describes the human musical drive as an instinct to

learn, which is shaped by universal proclivities and constraints.
Within the range of these constraints music is free to evolve as
a cultural entity, together with the social practices and contexts of
any given culture. We theorize that part of the instinct to learn
is a proclivity to imitate. Although the present study focuses on
emotive movement, music across the world imitates many other
phenomena, including human vocalizations, birdsong, the sounds
of insects, and the operation of tools and machinery (5254).
We do not claim that the dynamic features chosen here describe
the emotional space optimally; there are likely to be other useful
features as well as higher-level factors that aggregate across features (37, 5556). We urge future research to test other universal,
cross-modal correspondences. To this end, labanotationa symbolic language for notating dancemay be a particularly fruitful
source. Based on general principles of human kinetics (57), labanotation scripts speed (rate), regularity, size, and direction of
movement, as well as shape forms consistent with smoothness/
spikiness. Other Laban features not represented here, but potentially useful for emotion recognition, include weight and symmetry.
It may also be fruitful to test whether perceptual tendencies
documented in one domain extend across domains. For example,
innate (and thus likely universal) auditory preferences include:
seven or fewer pitches per octave, consonant intervals, scales with
unequal spacing between pitches (facilitating hierarchical pitch
organization), and binary timing structures (see refs. 14 and 58 for
reviews). Infants are also sensitive to hierarchical pitch organization (5) and melodic transpositions (see ref. 59 for a review). These
auditory sensitivities may be the result of universal proclivities and
constraints with implications extending beyond music to other
dynamic domains, such as movement. Our goal was simply to use
a small set of dynamic features that describe the space well enough
to provide a test of cross-modal and cross-cultural similarity.
Furthermore, although these dynamic features describe the space
of emotional expression for music and movement, the present
study does not address whether these features describe the space
of emotional experience (60, 61).
Our model should not be understood as circumscribing the
limits of emotional expression in music. Imitation of movement is
just one way among many in which music may express emotion, as
cultural conventions may develop independently of evolved proclivities. This explanation allows for cross-cultural consistency yet
preserves the tremendous diversity of musical traditions around
the world. Additionally, we speculate that, across cultures, musical forms will vary in terms of how emotions and their related
physical movements are treated differentially within each cultural
context. Similarly, the forms of musical instruments and the
substance of musical traditions may in turn inuence differential
cultural treatment of emotions and their related physical movements. This interesting direction for further research will require
close collaboration with ethnomusicologists and anthropologists.
By studying universal features of music we can begin to map its
evolutionary history (14). Specically, understanding the crossmodal nature of musical expression may in turn help us understand
why and how music came to exist. That is, if music and movement
have a deeply interwoven, shared structure, what does that shared
structure afford and how has it affected our evolutionary path? For
example, Homo sapiens is the only species that can follow precise
rhythmic patterns that afford synchronized group behaviors, such
as singing, drumming, and dancing (14). Homo sapiens is also the
only species that forms cooperative alliances between groups that
extend beyond consanguineal ties (62). One way to form and
strengthen these social bonds may be through music: specically
the kind of temporal and affective entrainment that music evokes
from infancy (63). In turn, these musical entrainment-based bonds
may be the basis for Homo sapiens uniquely exible sociality (64).
If this is the case, then our evolutionary understanding of music is
not simply reducible to the capacity for entrainment. Rather,
music is the arena in which this and other capacities participate in
determining evolutionary tness.
Sievers et al.

The shared structure of emotional music and movement must

be reected in the organization of the brain. Consistent with this
view, music and movement appear to engage shared neural substrates, such as those recruited by time-keeping and sequence
learning (31, 65, 66). Dehaene and Cohen (67) offer the term
neuronal recycling to describe how late-developing cultural
abilities, such as reading and arithmetic, come into existence by
repurposing brain areas evolved for older tasks. Dehaene and
Cohen suggest music recycles or makes use of premusical
representations of pitch, rhythm, and timbre. We hypothesize
that this explanation can be pushed a level deeper: neural representations of pitch, rhythm, and timbre likely recycle brain
areas evolved to represent and engage with spatiotemporal per-

ception and action (movement, speech). Following this line of

thinking, musics expressivity may ultimately be derived from the
evolutionary link between emotion and human dynamics (12).

1. Baily J (1985) Musical Structure and Cognition, eds Howell P, Cross I, West R (Academic, London).
2. Juslin PN, Laukka P (2004) Expression, perception, and induction of musical emotions:
A review and a questionnaire study of everyday listening. J New Music Res 33(3):
3. Winkler I, Hden GP, Ladinig O, Sziller I, Honing H (2009) Newborn infants detect the
beat in music. Proc Natl Acad Sci USA 106(7):24682471.
4. Zentner MR, Eerola T (2010) Rhythmic engagement with music in infancy. Proc Natl
Acad Sci USA 107(13):57685773.
5. Bergeson TR, Trehub SE (2006) Infants perception of rhythmic patterns. Music Percept
6. Phillips-Silver J, Trainor LJ (2005) Feeling the beat: Movement inuences infant
rhythm perception. Science 308(5727):1430.
7. Trehub S (2000) The Origins of Music. , Chapter 23, eds Wallin NL, Merker B, Brown S
(MIT Press, Cambridge, MA).
8. Higgins KM (2006) The cognitive and appreciative import of musical universals. Rev
Int Philos 2006/4(238):487503.
9. Fritz T, et al. (2009) Universal recognition of three basic emotions in music. Curr Biol
10. Ekman P (1993) Facial expression and emotion. Am Psychol 48(4):384392.
11. Izard CE (1994) Innate and universal facial expressions: Evidence from developmental
and cross-cultural research. Psychol Bull 115(2):288299.
12. Scherer KR, Banse R, Wallbott HG (2001) Emotion inferences from vocal expression
correlate across languages and cultures. J Cross Cult Psychol 32(1):7692.
13. Darwin C (2009) The Expression of the Emotions in Man and Animals (Oxford Univ
Press, New York).
14. Brown S, Jordania J (2011) Universals in the worlds musics. Psychol Music, 10.1177/
15. Huron D (1994) Interval-class content in equally tempered pitch-class sets: Common
scales exhibit optimum tonal consonance. Music Percept 11(3):289305.
16. Parncutt R (1989) Harmony: A Psychoacoustical Approach (Springer, Berlin).
17. McDermott JH, Lehr AJ, Oxenham AJ (2010) Individual differences reveal the basis of
consonance. Curr Biol 20(11):10351041.
18. Bidelman GM, Heinz MG (2011) Auditory-nerve responses predict pitch attributes
related to musical consonance-dissonance for normal and impaired hearing. J Acoust
Soc Am 130(3):14881502.
19. Khler W (1929) Gestalt Psychology (Liveright, New York).
20. Ramachandran VS, Hubbard EM (2001) Synaesthesia: A window into perception,
thought and language. J Conscious Stud 8(12):334.
21. Maurer D, Pathman T, Mondloch CJ (2006) The shape of boubas: Sound-shape correspondences in toddlers and adults. Dev Sci 9(3):316322.
22. Krumhansl CL (2002) Music: A link between cognition and emotion. Curr Dir Psychol
Sci 11(2):4550.
23. Bernhardt D, Robinson P (2007) Affective Computing and Intelligent Interaction, eds
Paiva A, Prada R, Picard RW (Springer, Berlin), pp 5970.
24. Hevner K (1936) Experimental studies of the elements of expression in music. Am J
Psychol 48(2):246268.
25. United Nations Development Programme Cambodia (2010) Kreung Ethnicity: Documentation of Customary Rules (UNDP Cambodia, Phnom Penh, Cambodia). Available
at www.un.org.kh/undp/media/les/Kreung-indigenous-people-customary-rules-Eng.
pdf. Accessed June 29, 2011.
26. Friberg A, Sundberg J (1999) Does music performance allude to locomotion? A model
of nal ritardandi derived from measurements of stopping runners. I. Acoust Soc Am
27. Moelents D, Van Noorden L (1999) Resonance in the perception of musical pulse. J
New Music Res 28(1):4366.
28. Styns F, van Noorden L, Moelants D, Leman M (2007) Walking on music. Hum Mov Sci
29. Leman M (2007) Embodied Music Cognition and Mediation Technology (MIT Press,
Cambridge, MA).
30. Molnar-Szakacs I, Overy K (2006) Music and mirror neurons: From motion to emotion. Soc Cogn Affect Neurosci 1(3):235241.
31. Fujioka T, Trainor LJ, Large EW, Ross B (2012) Internalized timing of isochronous
sounds is represented in neuromagnetic oscillations. J Neurosci 32(5):17911802.
32. Nozaradan S, Peretz I, Missal M, Mouraux A (2011) Tagging the neuronal entrainment
to beat and meter. J Neurosci 31(28):1023410240.

33. Janata P, Tomic ST, Haberman JM (2012) Sensorimotor coupling in music and the
psychology of the groove. J Exp Psychol Gen 141(1):5475.
34. Koelsch S, Siebel WA (2005) Towards a neural basis of music perception. Trends Cogn
Sci 9(12):578584.
35. Bryant GA, Barrett HC (2008) Vocal emotion recognition across disparate cultures. J
Cogn Cult 8(1-2):135148.
36. Sauter DA, Eisner F, Ekman P, Scott SK (2010) Cross-cultural recognition of basic emotions
through nonverbal emotional vocalizations. Proc Natl Acad Sci USA 107(6):24082412.
37. Banse R, Scherer KR (1996) Acoustic proles in vocal emotion expression. J Pers Soc
Psychol 70(3):614636.
38. Thompson WF, Balkwill LL (2006) Decoding speech prosody in ve languages. Semiotica 2006(158):407424.
39. Elfenbein HA, Ambady N (2002) On the universality and cultural specicity of emotion
recognition: A meta-analysis. Psychol Bull 128(2):203235.
40. Marie C, Delogu F, Lampis G, Belardinelli MO, Besson M (2011) Inuence of musical
expertise on segmental and tonal processing in Mandarin Chinese. J Cogn Neurosci
41. Zatorre RJ, Baum SR (2012) Musical melody and speech intonation: Singing a different
tune. PLoS Biol 10(7):e1001372.
42. Smalley D (1996) The listening imagination: Listening in the electroacoustic era.
Contemp Music Rev 13(2):77107.
43. Susemihl F, Hicks RD (1894) The Politics of Aristotle (Macmillan, London), p 594.
44. Meyer LB (1956) Emotion and Meaning in Music (Univ of Chicago Press, Chicago, IL).
45. Truslit A (1938) Gestaltung und Bewegung in der Musik. [Shape and Movement in
Music] (Chr Friedrich Vieweg, Berlin-Lichterfelde). German.
46. Iyer V (2002) Embodied mind, situated cognition, and expressive microtiming in African-American music. Music Percept 19(3):387414.
47. Gagnon L, Peretz I (2003) Mode and tempo relative contributions to happy-sad
judgements in equitone melodies. Cogn Emotion 17(1):2540.
48. Eitan Z, Granot RY (2006) How music moves: Musical parameters and listeners images
of motion. Music Percept 23(3):221247.
49. Juslin PN, Lindstrm E (2010) Musical expression of emotions: Modelling listeners
judgments of composed and performed features. Music Anal 29(1-3):334364.
50. Phillips-Silver J, Trainor LJ (2007) Hearing what the body feels: Auditory encoding of
rhythmic movement. Cognition 105(3):533546.
51. Fitch WT (2006) On the biology and evolution of music. Music Percept 24(1):8588.
52. Fleming W (1946) The element of motion in baroque art and music. J Aesthet Art Crit
53. Roseman M (1984) The social structuring of sound: The temiar of peninsular malaysia.
Ethnomusicology 28(3):411445.
54. Ames DW (1971) Taaken smarii: A drum language of hausa youth. Africa 41(1):1231.
55. Vines BW, Krumhansl CL, Wanderley MM, Dalca IM, Levitin DJ (2005) Dimensions of
emotion in expressive musical performance. Ann N Y Acad Sci 1060:462466.
56. Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):11611178.
57. Laban R (1975) Labans Principles of Dance and Movement Notation. 2nd edition, ed
Lange R (MacDonald and Evans, London).
58. Stalinski SM, Schellenberg EG (2012) Music cognition: A developmental perspective.
Top Cogn Sci 4(4):485497.
59. Trehub SE, Hannon EE (2006) Infant music perception: Domain-general or domainspecic mechanisms? Cognition 100(1):7399.
60. Gabrielsson A (2002) Emotion perceived and emotion felt: Same or different? Music
Sci Special Issue 20012002:123147.
61. Juslin PN, Vstfjll D (2008) Emotional responses to music: The need to consider underlying mechanisms. Behav Brain Sci 31(5):559575, discussion 575621.
62. Hagen EH, Bryant GA (2003) Music and dance as a coalition signaling system. Hum Nat
63. Phillips-Silver J, Keller PE (2012) Searching for roots of entrainment and joint action in
early musical interactions. Front Hum Neurosci, 10.3389/fnhum.2012.00026.
64. Wheatley T, Kang O, Parkinson C, Looser CE (2012) From mind perception to mental
connection: Synchrony as a mechanism for social understanding. Social Psychology
and Personality Compass 6(8):589606.
65. Janata P, Grafton ST (2003) Swinging in the brain: Shared neural substrates for behaviors related to sequencing and music. Nat Neurosci 6(7):682687.
66. Zatorre RJ, Chen JL, Penhune VB (2007) When the brain plays music: Auditory-motor
interactions in music perception and production. Nat Rev Neurosci 8(7):547558.
67. Dehaene S, Cohen L (2007) Cultural recycling of cortical maps. Neuron 56(2):384398.

Sievers et al.

PNAS | January 2, 2013 | vol. 110 | no. 1 | 75


ACKNOWLEDGMENTS. We thank Dan Wegner, Dan Gilbert, and Jonathan

Schooler for comments on previous drafts; George Wolford for statistical
guidance; Dan Leopold for help collecting the United States data; and the
Ratanakiri Ministry of Culture, Ockenden Cambodia, and Cambodian Living
Arts for facilitating visits to Lak and for assistance with Khmer-Kreung translation, as well as Trent Walker for English-Khmer translation. We also thank
the editor and two anonymous reviewers for providing us with constructive
comments and suggestions that improved the paper. This research was supported in part by a McNulty grant from The Nelson A. Rockefeller Center (to
T.W.) and a Foreign Travel award from The John Sloan Dickey Center for
International Understanding (to T.W.).

Supporting Information
Sievers et al. 10.1073/pnas.1209023110
SI Text
Data. Spreadsheets containing the raw data are attached. Dataset
S1 includes all of the data from the United States experiment.
Dataset S2 includes all of the data from the Kreung experiment,
as well as discretized data from the United States experiment. The
raw means by emotion are provided in Table S1, for each population. Table S2 provides the z-scored means of the discrete data
by Emotion, Slider, and Population (corresponding to the crosscultural ANOVA). These means were z-scored within each slider
(using discrete data), for each population separately. These means
are graphically portrayed in Fig. S1.
Fishers Linear Discriminant Analysis. In an attempt to estimate which
sliders were most important to the emotion categorization effect,
we performed a Fishers linear discriminant analysis between the
slider values and their associated emotions for each population, for
each modality. The numbers represent the importance of each
feature (slider) in distinguishing the given emotion from all of the
other emotions (higher numbers mean more importance). These
data are presented in Table S3.
In the United States data, consonance was the most effective
feature for discriminating each emotion from the other four
emotions in both modalities, accounting for 60% of the total
discrimination between emotions. The second most effective feature was direction (up/down), accounting for 26% (music) and
35% (movement). In the Kreung music data, rate and step size
were most effective for discriminating each emotion compared
with the other four; rate and consonance were most effective for
the Kreung movement data.
It is important to note that low discriminant values do not necessarily imply unimportance for two reasons. First, this analysis only
reveals the importance of each feature as a discriminant for each
emotion when that emotion is compared with the other four
emotions. That is, any other comparison (e.g., each emotion
compared with a different subset of emotions) would yield different
values. For example, although jitter may seem relatively unimportant for discriminating emotions based on the data in Table S3,
jitter was a key feature for discriminating between particular
emotion dyads (e.g., scared vs. sad in the United States data).
Second, whether a parameter (slider) was redundant in this dataset
is impossible to conclude because of potential interactions between
parameters. Additional research is necessary to determine whether
one or more features could be excluded without signicant cost to
emotion recognition. Such research would benet from testing
each feature in isolation (e.g., by holding others constant) to better
elucidate its contribution to emotional expression within and across
modalities and cultures.
Multimedia. Audio and visual les of the emotional prototypes for

both music and movement in the United States and Kreung

experiments are available as Audios S1, S2, S3, S4, S5, S6, S7, S8,
S9, and S10, and Movies S1, S2, S3, S4, S5, S6, S7, S8, S9, and
S10. Each le contains three sequential probabilistically generated examples based on the prototype settings as explained in the
cross-cultural Euclidean distance analysis.
Detailed Methods. Our computer program was created using Max/
MSP (1), Processing (2), and OpenGL (3). Subjects were presented an interface with slider-bars corresponding to the ve dimensions of our statistical model: rate (in beats per minute or
BPM), jitter (SD of rate), consonance/visual spikiness, step size,
and step direction. The ve sliders controlled parametric values
Sievers et al. www.pnas.org/cgi/content/short/1209023110

fed to an algorithm that probabilistically moved the position of

a marker around a discrete number-line in real time. We will refer
to the movements of this marker as a path. The position of the
marker at each step in the generated path was mapped to either
music or animated movement.
The number-line traversal algorithm can be split into two parts.
The rst part, called the metronome, controlled the timing of
trigger messages sent to the second part, called the path generator, which kept track of and controlled movement on the number
line. The tempo and jitter parameters were fed to the metronome,
and the consonance, step size, and step direction parameters were
fed to the path generator. When the subject pressed the space bar
on the computer keyboard, the metronome turned on, sent 16
trigger messages to the path generator (variably timed as described below), and then turned off. The beginnings and endings
of paths correspond to the on and off of the metronome.
Tempo was constrained to values between a minimum of 30 BPM
and a maximum of 400 BPM. Jitter was expressed as a coefcient of
the tempo with a range between 0 and 0.99. When jitter was set to 0,
the metronome would send out a stream of events at evenly spaced
intervals as specied by the tempo slider. If the jitter slider were
above 0, then specic per-event delay values were calculated
nondeterministically as follows. Immediately before each event,
a uniformly random value was chosen between 0 and the current
value of the jitter slider. That value was multiplied by the period in
milliseconds as specied by the tempo slider, and then the next
event was delayed by a number of milliseconds equal to the result.
These delays were specied on a per-event basis and applied to
events after they left the metronome. No event was delayed for
longer than the metronome period. This per-event delay was essentially a shifting or sliding of each event in the stream toward
but never pastthe next note in the stream. Each shift left less
empty space on one side of the notes original position and more
empty space on the other. This process ensured that tempo and
jitter were independent. The effect was that as the value of the
jitter slider increased, the precise timing of event onsets became
less predictable but the mean event density remained the same.
The path generator can be conceived of as a black box with
a memory slot, which could store one number and which responded to a small set of messages: reset, select next number, and
output next number. Whenever the path generator was sent the
reset message, a new starting position was picked and stored in the
memory slot (the exact value of the starting position was constrained by the value of the scale choice slider as explained below).
Whenever the path generator was sent the select next number
message, it picked a new number according to the constraints
specied by the slider bars: rst, the size of the interval was selected, then the direction (up or down), then a specic number
according to the position of the scale choice slider. The output next
number message caused the path generator to output the next
number to the music and motion generators, described below.
When selecting a new number, the path generator rst chose a
step size, or the distance between the previous number (stored in
the memory slot) and the next. This value was calculated nondeterministically based on the position of the step size slider. The
step size slider had a minimum value of 0 and a maximum value
of 1. When choosing a step size, a uniformly random number
between 0 and 1 was generated. This number was then used as the
x value in the following equation, where a = the value of the step
size slider:
1 of 19


a x 1
1 a x 1

x a

The result r was multiplied by 4 and then rounded up to the

nearest integer to give the step size of the event. As the value of
the step size slider increased, the likelihood of a small step size
decreased, and vice versa. If the slider was in the minimum position, all of the steps would be as small as possible. If it was in
the maximum position, all of the steps would be as large as
possible. If it was in the middle position, there would be an equal
likelihood of all possible step sizes. Other positions skew the
distribution one way or the the other, where higher values resulted in a larger average step size. Note that these step size units
did not correspond directly to the units of the number line; they
were exibly mapped to the number line as directed by the setting of the consonance parameter, as described below.
After the step size was chosen, the path generator determined
the direction of the next step: up or down. As with step size, the
step direction was calculated nondeterministically based on the
position of the step direction slider. The step direction slider had
a minimum value of 0 and a maximum value of 1. When choosing
step direction, a uniformly random number between 0 and 1 was
generated. If that number was less than or equal to the value of
the step direction slider, then the next step would be downward;
otherwise the next step would be upward.
Finally, the number was mapped on to one of 38 unique scales. As
the notion of a scale is drawn from Western music theory, this
decision requires some elaboration. In Western music theory,
a collection of pitches played simultaneously or in sequence may be
heard as consonant or dissonant. The perception of a given musical
note as consonant or dissonant is not a function of its absolute pitch
value, but of the collection of intervals between all pitches comprising the current chord or phrase. The relationship between interval size and dissonance is nonlinear. For example, an interval of
seven half steps, or a perfect fth, is considered quite consonant,
whereas an interval of six half steps, or a tritone, is considered quite
dissonant. Intervallic distance, consonance/dissonance, and equivalency are closely related. If a collection of pitch classes x (a pitch
class set, or PC set) has the same set of intervallic relationships as
another PC set y, those two PC sets will have the same degree of
consonance and are transpositionally identical (and in certain
conditions equivalent).
Absolute pitches also possess this property of transpositional
equivalency. When the frequency of a note is doubled, it is perceived as belonging to the same pitch class. For example, the A key
closest to the middle of a piano has a fundamental frequency of
440 Hz, but the A an octave higher has a fundamental frequency of
880 Hz; both are heard as an A. Western music divides the octave
into 12 pitch classes, called the chromatic scale, from which all
other scales are derived. Because we wanted to investigate musical
dissonance and possible functional analogs in the modality of
motion, our number-line scales were designed to be analogous to
musical scales, where a number-line scale is a ve-member subset
of the chromatic set [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. There are 768
such subsets of the chromatic set, many of which are (in the
domain of music) transpositionally or inversionally equivalent.
Our scale list was created by generating the prime forms of these
768 subsets and then removing duplicates, yielding 38 unique
scales (4). These scales were ordered by their aggregate dyadic
consonance (5).
The algorithm for generating a specic path across the number
line was as follows. The number line consisted of the integers from
0 to 127 inclusive. When the algorithm began, three variables
were stored. First, a starting-point offset between 0 and 11 was
selected uniformly at random, then an octave bias variable was set
to 5, and a scale position variable was set to 0. The starting-point
Sievers et al. www.pnas.org/cgi/content/short/1209023110

offset was used to ensure that each musical phrase began on

a different, randomly selected note, ensuring that no single contextdetermining tonic pitch or root scale degree could be identied on the basis of repetition. The current scale class was determined by using the scale position variable as an index to the
array of scale elements specied by the position of the scale slider.
For example, if the current selected scale was [0, 3, 4, 7, 10] and
the current scale position variable was 2, then the current scale
class would be 4 (indices start from 0). The current position on the
number line was given by multiplying the octave bias by 12, adding
the starting-point offset, and then adding the current scale class
value. For example, if the octave bias was 5, the starting-point
offset was 4, and the scale class value was 7, then the current
position on the number line would be 71.
When the select next number message was received, an interval
and note direction were selected as described above. If the note
direction was upward, then the new scale position value was given
by the following:
current scale position new interval value% 5
If the note direction was downward, then the new scale position value was given by:
5 current scale position new interval value
Either of these conditions may imply a modular wrapping
around the set of possible values (04). If this is the case, then the
current octave variable is either incremented by 1 in the case of
an upward interval, or decremented by 1 in the case of a downward interval. If a step in the path would move the position on
the number line outside of the allowed range, 12 would be either
added to or subtracted from the new position. This nding means
that when the upper or lower boundaries of the allowed pitch
range were hit, the melody would simply stay in the topmost or
bottommost octave, attening out the overall pitch contour at the
extremes. This process could result in occasional upward melodic
intervals at the bottommost extreme or downward melodic intervals
at the topmost extreme, despite the setting of the pitch direction
slider. In practice, this rarely occurred, and was conned to the
more extreme angry emotional expressions where pitch direction
was maximally downward and step size was maximally large.
The subjects were divided into two groups. For the rst group,
number-line values were mapped to musical notes, and for the
second group, number-line values were mapped to animated
Our mapping from movement across a number-line to Western
music was straightforward, as its most signicant modality-specic
features were taken care of by the very design of the number-line
algorithm. The division of pitches into pitch-classes and scales is
accounted for by the scale-class and scale selection system used by
the algorithm, as is the modulo 12 equivalency of pitch-classes.
Each number was mapped to a specic pitch which was sounded as
the algorithm selects the number. The number 60 was mapped to
middle C, or C4. Movement of a distance of 1 on the number line
corresponded to a pitch change of a half-step, with higher numbers
being higher in pitch. For example, 40 maps to E2, 0 maps to A0,
and 127 maps to G9. Notes were triggered via MIDI and played on
the grand piano instrument included with Apple GarageBand.
Mapping from movement across a number-line to animated
movement was less straightforward. Our animated character was
a red ellipsoid ball with cubic eyes (Fig. S2). The ball sat atop
a rectangular dark gray oor on a light gray background. An
ellipsoid was chosen because it can be seen as rotating around a
center. The addition of eyes was intended to engage cognitive
processes related to the perception of biological motion. We wanted our subjects to perceive the ball as having its own subjectivity,
2 of 19

that it could be capable of communicating or experiencing happiness, sadness, and so forth. The movement of our character
(henceforth referred to as the Ball) was limited to bouncing up
and down, rotating forward and backward, and modulating the
spikiness of its surface. Technical details follow.
The Ball was drawn as a red 3D sphere composed of a limited
number of triangular faces, which were transformed into an ellipsoid by scaling its y axis by a factor of 1.3. The Ball was positioned such that it appeared to be resting on a rectangular oor
beneath it. Its base appeared to atten where it made contact with
the oor. The total visible height of the Ball when it is above the
oor was 176 pixels; this was reduced to 168 pixels when the Ball
was making contact with the oor. Its eyes were small white cubes
located about 23% downward from the top of the ellipsoid. The
Ball and the oor are rotated about the y axis such that it appeared the Ball was looking somewhere to the left of the viewer.
Every time the current position on the number line changed, the
Ball would bounce. A bounce is the translation of the Ball to
a position somewhere above its resting position and back down
again. Bounce duration was equal to 93% of the current period of the
metronome. The 7% reduction was intended to create a perceptible
landing between each bounce. Bounce height was determined by
the difference between the current position on the number line and
the previous position. A difference of 1 resulted in a bounce height
of 20 pixels. Each additional addition of 1 to the difference increased the bounce height by 13.33 pixels (e.g., a difference of 5
would result in a bounce height of 73.33 pixels). The Ball reached
its translational apex when the bounce was 50% complete. The arc
of the bounce followed the rst half of a sine curve.
The Ball would rotate, leaning forward or backward, depending
on the current number line value. High values caused the Ball to
lean backward, such that it appeared to look upward, and low
values caused the Ball to lean forward or look down. When the
current value of the number line was 60, the Balls angle of rotation was 0. An increase of 1 on the number line decreased the
Balls angle of rotation by 1; conversely, a decrease of 1 on the
number line increased the Balls angle of rotation by 1. For
example, if the current number-line value were 20, the Balls
angle of rotation would be 40. If the current number-line value
were 90, the Balls angle of rotation would be 30.
The Ball could also be more or less spiky. The amplitude of the
spikes or perturbations of the Balls surface were analogically
mapped to musical dissonance. The visual effect was achieved by
adding noise to the x, y, and z coordinates of each vertex in the set
of triangles comprising the Ball. Whenever a new position on the
number-line was chosen, the aggregate dyadic consonance of the
interval formed by the new position and the previous position was
calculated. The maximum aggregate dyadic consonance was 0.8,
the minimum was 1.428. The results were scaled such that when
the consonance value was 0.8, the spikiness value was 0, and when
the consonance value was 1.428, the spikiness value was 0.2.
Changes in consonance of 0.01 resulted in a change of 0.008977 to
the spikiness value. For each vertex on the Balls surface, spikiness
offsets for each of the three axes were calculated. Each spikiness
offset was a number chosen uniformly at random between 1 and
1, which was then multiplied by the Balls original spherical radius
times the current spikiness value.
For the emotion labels care was taken to avoid using words
etymologically related to either music or movement (e.g., upbeat for happy or downtrodden for sad). See Fig. S3A for
a screen shot of the United States experiment interface and the
lists of emotion words presented to participants.
In the United States experiment, the labels for the slider bars
changed between tasks as described in Fig. S3B. In the Kreung
experiment, slider bars were not labeled in the music task, and
were accompanied by icons in the movement task. See Fig. S4 for
a screenshot of the slider bars during Kreung movement task.
Sievers et al. www.pnas.org/cgi/content/short/1209023110

To conrm or reject the presence of a cross-cultural code, it

needed to be possible for Kreung participants to select slider-bar
positions that would create music and movement similar to that
created by the United States participants. For this reason, the
discretization values for the Kreung slider bars were derived from
the United States data. These values are shown in Table S4.
For consonance, the extreme low value of 4 was chosen by taking
the midpoint between the median consonance values for angry
and scared. The extreme high value of 37 was the midpoint between the median values for happy and peaceful. The central
value of 30 was chosen by taking the median value for sad, which
was neither at the numeric middle nor either of the endpoints. The
values for the other sliders were selected similarly, with two exceptions: For rate, for which there was no emotion sitting reliably
between the high and low extremes, the numeric middle between
the extremes was chosen as the central value. For direction, because the values for happy and peaceful clustered around the
center of the scale, the ideal center (50, neither up nor down) was
chosen as the central value.
Although this discretization did limit the number of possible
settings of the slider bars, it did not substantially encourage the
Kreung participants to use the same settings as the United States
participants. With three possibilities for each of ve slider bars,
there were 35 or 243 possible settings available per emotion, with
only one of those corresponding to the choices of the United
States participant population. That is, for each emotion, there
was a 0.4% chance the prototypical United States conguration
would be chosen at random.
In the United States experiment the sliders were automatically
set to random positions at the beginning of each session. In the
Kreung experiment, with discretized sliders with three values per
slider, all of the sliders were set in the most neutral, middle position.
Subjects could press a button (the space bar on the computer
keyboard) to begin a melody or movement sequence. The sliders
could be moved both during and between sequences. The duration
of each sequence was determined by the tempo setting of the slider.
When moved during a sequence, the melody/ball would change
immediately in response to the movements. Between music
sequences, there was silence. Between movement sequences, the
ball would hold still in its nal position before resetting to a neutral
position at the beginning of the next sequence.
Slider Bar Reliability for Kreung Data. The discrete nature of the
Kreung data afforded 2 analyses to quantify the likelihood that
parameters (sliders) were used randomly. The values in Table S5
indicate the likelihood that the distributions of slider positions in
the Kreung data were because of chance (lower values indicate
lower likelihood of random positioning). As can be seen, the rate
slider was used systematically (nonrandomly) for all emotions,
across modalities. Other sliders varied in their reliability by emotion but were used nonrandomly for subsets of emotion.
Lak and Modernization. Lak, the village where we conducted our
study, had no infrastructure for water, waste management, or electricity, although it was equipped with a gas-powered generator. The
Kreung language is not mutually intelligible with Khmer, Cambodias ofcial language, and has no writing system. Lak and nearby
villages maintain their own dispute resolution practices separate
from the Cambodian legal system. The Kreung practice an animist
religion and have maintained related practices such as speaking with
spirits and ritual animal sacrice (6). Access to the village is limited
by its remote location and the difculty of travel on unmaintained
dirt roads, which require a four-wheel drive vehicle and are impassable much of the year because of ooding. Almost none of the
Kreung participants could speak or read Khmer, so communication was facilitated by an English-Khmer translator who worked in
conjunction with a Khmer-Kreung translator who lived in the village.
3 of 19

Until large-scale logging operations started in Ratanakiri in the

late 1990s, the tribal ethnic minorities in the area remained culturally isolated. The destruction of the forests made the traditional
practices of slash-and-burn agriculture and periodic village relocation untenable, and the past decade has seen gradual, partial,
and reluctant modernization (7). This modernization has been
limited, and has not resulted in sustained contact with Western
culture via television, movies, magazines, books, or radio.
We conducted a survey of our participants to determine the
extent of their exposure to non-Kreung culture. The survey included age, sex, cell phone ownership, time spent talking or
listening to music on cell phones, time spent watching television,
and time spent speaking Khmer vs. speaking Kreung. Very few of
the Kreung participants owned cell phones, and those that did
reported spending very little time using them for listening to
music. None of the Kreung participants reported having listened
to Western music. However, some reported having watched videotapes of Thai movies that had been dubbed into Khmer for
entertainment, and thus may have had passive exposure to Khmer
and Thai music on video soundtracks. We should note again that
most of our participants could not speak or understand Khmer.
We did not disqualify participants with passive exposure to
Khmer and Thai music via videos for the following reasons: (i)
Khmer and Thai music are not Western. Their traditions, instruments, styles of singing, tuning systems, and use of vertical
harmony (if any) are substantially different from those in Western
music, and so exposure to Khmer and Thai music would not acclimate our participants to Western musical conventions. (ii) Our
computer program was not biased toward Western lm music.
The program excluded vertical harmony and systematic rhythmic
variation, and included many more scales than the familiar
Western major and minor modes. Ultimately, both the Kreung
and Western participants frequently chose settings outside the
bounds of Western clich.
Music and Dance in Kreung Culture. In Kreung culture, music and
dance occur primarily as a part of rituals such as weddings, funerals,
and animal sacrices (6). Kreung music and dance traditions have
not been well documented. Interviews with Kreung musicians
indicated that there is no formal standardization of tuning or
temperament as is found in Western music, nor is there any system of vertical pitch relations equivalent to Western tonal harmony; Kreung music tends to be heterophonic in nature. Kreung
musical instruments bear no obvious morphological relationship
to Western instruments. Furthermore, although some Kreung and
Khmer instruments are similar, most are profoundly different,
and there is very little overlap between traditional Kreung music
and that performed throughout the rest of Cambodia. The geographical isolation of the Kreung combined with the pronounced
formal dissimilarity of Kreung and Western music made Lak an
ideal location for a test of cross-cultural musical universality.
We observed two musical forms in Lak. First, a gong orchestra,
where each performer plays a single gong in a prearranged
rhythmic pattern, causing a greater melodic pattern to emerge
from the ensemble, often accompanying group singing and dancing. Second was a heterophonic style of music based around
a string instrument called the mem, accompanied by singing and
wooden utes. In this form, all players follow the same melodic
line while adding loosely synchronized embellishments. The mem
is an extremely quiet bowed monochord that uses the musicians
mouth as a resonating chamber. Traditionally the mem is bowed
with a wooden or bamboo stick, and its sound is described as imitative of buzzing insects. Kreung music is passed along by pedagogical tradition, and the role of musician is performed primarily
by those who are highly skilled and extensively trained. Examples

Sievers et al. www.pnas.org/cgi/content/short/1209023110

of the two forms of Kreung music described here are included in

Audios S11 and S12. These recordings are courtesy of Cambodian
Living Arts (www.cambodianlivingarts.org) and Sublime Frequencies (www.sublimefrequencies.com).
We found the Kreung data to be substantially noisier than the
data collected in the United States. We speculate that some of this
noise was related to unfamiliarity with the experimental context,
as described in the main text. However, additional noise may have
been the result of unfamiliarity with the tuning, timbre, and scales
used in our program, none of which are native to Kreung culture.
Kreung-English Translation: Happy vs. Peaceful. In the Kreung
language there is no word that translates directly to peaceful. After
extensive conversation with our translators the closest words
we could nd were sngap and sngap chet. Idiomatically, these
translate to something like still heart, which seemed to capture
the essence of peacefulness we were looking for. However,
sngap and sngap chet do not refer to emotion as a state of
being, but instead refer to emotion as an active process. In particular, they refer to the process of having been angry and then
experiencing that anger dissolve into happiness. For this reason,
both words strongly connote happiness. Some of our subjects
seemed to use these words as synonyms for happiness, occasionally even reporting they had completed expressing happy after
being asked to express peaceful, although never the other way
around. The difculty understanding the concept of peacefulness
cross-culturally may be consistent with a previous nding in the
literature that reported peacefulness as the least successfully
identied emotion compared with happy, sad, and scary
(8). Nevertheless, the Kreung results show a distinct difference
between happy and peaceful. Peaceful music and peaceful
motion both tended to be substantially slower than their happy
counterparts, a relationship matching the ndings of the experiment in the United States.
Comment on Clynes and Nettheim. Clynes and Nettheim attempted
to show cross-modal, cross-cultural recognition of emotional
expressions produced in the domain of touch and mapped to sound
(9). However, there are several important differences between
their experiments and the one reported here. Clynes and Nettheim
used a forced-choice paradigm and created individual touch-tosound mappings per emotion, as opposed to using xed rules
representing hypotheses about the relationship between the two
domains. Clyness mapping decisions introduced intuitively generated, arbitrary pitch content specic to each emotion, suggesting
what was being tested was not a cross-modal relationship, but
simply the effect of pitch on emotion perception. Clynes proposes
the idea of essentic forms: xed, short-time, essential emotional
forms that are biologically determined and measured in terms of
touch. Although this is interesting as a hypothesis, it is not conrmed by the available data (10), and is not a model of featurebased cross-modal perception.
Informed Consent in the Kreung Village. Before the study began, we
met with several villagers and described the series of studies we
would be conducting. At this time we also discussed fair compensation. Together, we determined that a participant would be
paid the same amount that they would have forfeited by not
going to work in the eld that day. We set up the equipment in the
house of one of the Khmer-Kreung translators. Any adult villager
could come to the house if he or she wanted to participate in the
study. We did not solicit participation. As most of the villagers in
Lak cannot read or write, we did not obtain written consent. Instead, consent was implied by coming to the house to participate.

4 of 19

1. Zicarelli D (1998) in Proceedings of the 1998 International Computer Music Conference,

ed Simoni M (Univ of Michigan, Ann Arbor), pp 463466.
2. Reas C, Fry B (2006) Processing: Programming for the media arts. AI Soc 20(4):526538.
3. Rost RJ (2004) OpenGL Shading Language (Addison-Wesley, Boston).
4. Forte A (1973) The Structure of Atonal Music (Yale Univ Press, New Haven).
5. Huron D (1994) Interval-class content in equally tempered pitch-class sets: Common
scales exhibit optimum tonal consonance. Music Percept 11(3):289305.
6. United Nations Development Programme Cambodia (2010) Kreung Ethnicity: Documentation of Customary Rules (UNDP Cambodia, Phnom Penh, Cambodia). Available
at www.un.org.kh/undp/media/les/Kreung-indigenous-people-customary-rules-Eng.
pdf. Accessed June 29, 2011.

7. Paterson G, Thomas A (2005) Commonplaces and Comparisons: Remaking EcoPolitical Spaces in Southeast Asia (Regional Center for Social Science and Sustainable
Development, Chiang Mai, Thailand).
8. Vieillard S, et al. (2008) Happy, sad, scary and peaceful musical excerpts for research
on emotions. Cogn Emotion 22(4):218237.
9. Clynes M, Nettheim N (1982) Music, Mind, and Brain (Plenum, New York), pp
10. Trussoni SJ, OMalley A, Barton A (1988) Human emotion communication by touch:
A modied replication of an experiment by Manfred Clynes. Percept Mot Skills 66(2):

Fig. S1. Average values of z-scores (with SE bars) for each slider (i.e., feature) by each emotion and by population. The sign for the up/down slider was ipped
for visualization purposes (positive values indicate upward tilt/pitch).

Sievers et al. www.pnas.org/cgi/content/short/1209023110

5 of 19

Fig. S2.

The Ball.

Fig. S3. (A) Interface for the United States music task. (B) Slider labels by task modality.

Sievers et al. www.pnas.org/cgi/content/short/1209023110

6 of 19

Fig. S4.

Interface for the Kreung movement task with icons as a mnemonic aid.

Table S1. Raw means for each slider by emotion for each
United States means
Size (big/small)
Direction (up/down)
Kreung means
Size (big/small)
Direction (up/down)


Angry Happy Peaceful Sad Scared

30400 331.00 280.12

53.70 33.24
8.00 32.00
67.92 49.36
76.94 35.56







53.74 289.68
19.44 58.22
22.66 10.08
26.28 52.28
78.30 51.12


Table S2. Z-scored, discrete means for each slider by emotion, for
each population

Angry Happy Peaceful

United States discrete means

Size (big/small)
Direction (up/down)
Kreung discrete means
Size (big/small)
Direction (up/down)

Sievers et al. www.pnas.org/cgi/content/short/1209023110











7 of 19

Table S3.

Fishers linear discriminant analysis

Music and motion

Angry Happy Peace

Music and Motion

United States
Size (big/small)
Direction (up/down)
Size (big/small)
Direciton (up/down)
United States
Size (big/small)
Direction (up/down)
Size (big/small)
Direction (up/down)
United States
Size (big/small)
Direction (up/down)
Size (big/small)
Direction (up/down)




Scared Total




0.11 <0.01







0.04 <0.01
0.26 <0.01














Linear discriminant analysis. Slider importance for discrimination of each

emotion from all other emotions. Higher values indicate higher importance.

Table S4. Discretization values for Kreung slider bars (derived

from United States data)
Size (big/small)
Direction (up/down)

Sievers et al. www.pnas.org/cgi/content/short/1209023110







8 of 19

Table S5. 2 Reliability of slider bar use by Kreung participants

Music and motion




Step size (big/small)

Direction (up/down)
















Movie S1.

United States angry movement.

Movie S1

Sievers et al. www.pnas.org/cgi/content/short/1209023110

9 of 19

Movie S2.

Kreung angry movement.

Movie S2

Sievers et al. www.pnas.org/cgi/content/short/1209023110

10 of 19

Movie S3. United States happy movement.

Movie S3

Sievers et al. www.pnas.org/cgi/content/short/1209023110

11 of 19

Movie S4. Kreung happy movement.

Movie S4

Sievers et al. www.pnas.org/cgi/content/short/1209023110

12 of 19

Movie S5.

United States peaceful movement.

Movie S5

Sievers et al. www.pnas.org/cgi/content/short/1209023110

13 of 19

Movie S6. Kreung peaceful movement.

Movie S6

Sievers et al. www.pnas.org/cgi/content/short/1209023110

14 of 19

Movie S7.

United States sad movement.

Movie S7

Sievers et al. www.pnas.org/cgi/content/short/1209023110

15 of 19

Movie S8.

Kreung sad movement.

Movie S8

Sievers et al. www.pnas.org/cgi/content/short/1209023110

16 of 19

Movie S9. United States scared movement.

Movie S9

Sievers et al. www.pnas.org/cgi/content/short/1209023110

17 of 19

Movie S10.

Kreung scared movement.

Movie S10

Audio S1.

United States angry music.

Audio S1

Audio S2.

Kreung angry music.

Audio S2

Audio S3. United States happy music.

Audio S3

Audio S4. Kreung happy music.

Audio S4

Audio S5.

United States peaceful music.

Audio S5

Sievers et al. www.pnas.org/cgi/content/short/1209023110

18 of 19

Audio S6.

Kreung peaceful music.

Audio S7.

United States sad music.

Audio S6

Audio S7

Audio S8.

Kreung sad music.

Audio S8

Audio S9.

United States scared music.

Audio S9

Audio S10.

Kreung scared music.

Audio S10

Audio S11.

Example of Kreung gong music, courtesy of Sublime Frequencies.

Audio S11

Audio S12.

Example of Kreung mem music, Bun Hear, courtesy of Cambodian Living Arts.

Audio S12

Dataset 1. All data from the United States experiment, in continuous format
Dataset S1

Dataset 2. All data from the United States and Kreung experiments
Dataset S2
Kreung data were collected as discrete values (each slider had three positions: 0 1 2). Continuous values from the United States data were converted to
discrete values as described in the main text.

Sievers et al. www.pnas.org/cgi/content/short/1209023110

19 of 19