tmpC424 TMP

NeuroImage 106 (2015) 222237
Contents lists available at ScienceDirect
NeuroImage
journal homepage: www.elsevier.com/locate/ynimg
A computational analysis of the neural bases of Bayesian inference

Antonio Kolossa a, Bruno Kopp b,, Tim Fingscheidt a
a
b
Institute for Communications Technology, Technische Universitt Braunschweig, Schleinitzstr. 22, 38106 Braunschweig, Germany
Department of Neurology, Hannover Medical School, Carl-Neuberg-Str. 1, 30625 Hannover, Germany
a r t i c l e
i n f o
Article history:
Accepted 2 November 2014
Available online 8 November 2014
Keywords:
Event-related potentials
Single-trial EEG
Free-energy principle
Bayesian brain
Surprise
Probability weighting
a b s t r a c t
Empirical support for the Bayesian brain hypothesis, although of major theoretical importance for cognitive neuroscience, is surprisingly scarce. This hypothesis posits simply that neural activities code and compute Bayesian
probabilities. Here, we introduce an urnball paradigm to relate event-related potentials (ERPs) such as the P300
wave to Bayesian inference. Bayesian model comparison is conducted to compare various models in terms of
their ability to explain trial-by-trial variation in ERP responses at different points in time and over different regions of the scalp. Specically, we are interested in dissociating specic ERP responses in terms of Bayesian
updating and predictive surprise. Bayesian updating refers to changes in probability distributions given new observations, while predictive surprise equals the surprise about observations under current probability distributions. Components of the late positive complex (P3a, P3b, Slow Wave) provide dissociable measures of
Bayesian updating and predictive surprise. Specically, the updating of beliefs about hidden states yields the
best t for the anteriorly distributed P3a, whereas the updating of predictions of observations accounts best for
the posteriorly distributed Slow Wave. In addition, parietally distributed P3b responses are best t by predictive
surprise. These results indicate that the three components of the late positive complex reect distinct neural
computations. As such they are consistent with the Bayesian brain hypothesis, but these neural computations
seem to be subject to nonlinear probability weighting. We integrate these ndings with the free-energy principle
that instantiates the Bayesian brain hypothesis.
2014 Elsevier Inc. All rights reserved.
Introduction
How can the brain make reliable and valid inferences about the external world based on variable sensory information? Bayesian decision
theory offers a useful theoretical framework for explaining this inference process (Jaynes, 2003; Robert, 2007). Bayesian inference constantly updates prior beliefs to posterior beliefs in light of observed data
according to probability rules (Bayes' theorem; Baldi and Itti (2010)).
Thus, it can hardly surprise that a hypothesis has been proposed according to which the brain codes and computes Bayesian probabilities (Knill
and Pouget, 2004; Friston, 2005; Doya et al., 2007; Gold and Shadlen,
2007; Kopp, 2008). While earlier research provided results that are consistent with the Bayesian brain hypothesis (Hampton et al., 2006;
Ostwald et al., 2012; Vilares et al., 2012; Lieder et al., 2013), no
agreed-upon conclusion about the utility of the Bayesian brain hypothesis as a theoretical framework for explaining cognitive functions of the
brain has been achieved in the eld (Clark, 2013).
In order to test the Bayesian brain hypothesis, we tried to explain
event related brain potentials (ERPs) during successive trials in an
urnball task (Phillips and Edwards, 1966) in terms of underlying
Corresponding author.
E-mail addresses: kolossa@ifn.ing.tu-bs.de (A. Kolossa), kopp.bruno@mh-hannover.de
(B. Kopp), ngscheidt@ifn.ing.tu-bs.de (T. Fingscheidt).
http://dx.doi.org/10.1016/j.neuroimage.2014.11.007
1053-8119/ 2014 Elsevier Inc. All rights reserved.
probability distributions. Specically, we were interested in dissociating

temporally and regionally specic cortical responses in terms of Bayesian updating and predictive surprise. To introduce experimental variance in terms of surprise-related responses, we manipulated the task's
probabilistic contingencies at two levels. First, we introduced uncertainty about the sort of urn containing balls (by sampling urns from two distributions). Second, we manipulated the proportion of ball colors within
each urn. We will refer to these probabilistic contingencies as prior
probabilities and likelihoods, respectively. Notice that the subject inferred the nature of the urn based on small samples of balls that were
drawn from the sampled urn.
Cognitive processes related to Bayesian updating can be split into
two sub-processes: (1) Bayesian surprise represents the change in beliefs about hidden states given current observations. Bayesian surprise
is computationally expressed as the divergence between the prior probability distribution over states and the posterior probability distribution
over states given current observations; we henceforth label probability
distributions over states as belief distributions. (2) Postdictive surprise
represents the change in predictions of future observations given current observations. Postdictive surprise is computationally expressed as
the divergence between the prior prediction distribution over observations and the posterior prediction distribution over observations given
current observations. Notice that the probability distributions over observations (or prediction distributions) are dependent on the belief
A. Kolossa et al. / NeuroImage 106 (2015) 222237
distributions such that both, the belief distributions as well as the

prediction distributions, are dependent on Bayes' theorem. Thus,
postdictive surprise represents an aspect of Bayesian updating which
is dependent on, yet computationally distinguishable from, Bayesian
surprise. Bayesian surprise and postdictive surprise are both fundamentally different from predictive surprise that is simply the surprise about
current observations under the current prediction distribution. Another
way to grasp these distinctions is that Bayesian surprise refers to probability distributions (belief distributions) over unobservable random
variables (here hidden states or urns), whereas postdictive surprise
and predictive surprise refer in different ways and as explained
above to probability distributions (prediction distributions) over observed random variables (here observed balls).
We also examined the possibility that electrophysiological substrates of Bayesian inference may be better explained when nonlinear
weighting of probabilities is taken into account. Nonlinear probability
weighting was originally conjectured by prospect theory which represents a successful descriptive theory of economic decision behavior
(Kahneman and Tversky, 1979; Tversky and Kahneman, 1992; Fox
and Poldrack, 2009). In addition, nonlinear likelihood weighting has
been applied in technical systems for audiovisual speech recognition
(Neti et al., 2000). We addressed the issue of nonlinear probability
weighting by repeating the above comparisons (of Bayesian updating
and predictive surprise) using Bayesian observer models with and without nonlinear weighting.
We recorded 20-channel electroencephalographic data from human
participants who performed our urnball paradigm. Our analyses focused on the late positive complex that is known to be decomposable
into three separable ERP components (Sutton and Ruchkin, 1984; Dien
et al., 2004): First, the late positive complex incorporates the anteriorly
distributed P3a component that usually occurs at latencies around
340 ms post-stimulus (Kopp and Lange, 2013). Second, the late positive
complex comprises the parietally distributed P3b component at, in
comparison to the P3a latency, variably delayed latencies (Kolossa
et al., 2012). Notice that the functional signicance of these two ERP
components seems to be related to uncertainty (Sutton et al., 1965;
Kopp and Lange, 2013), surprise (Donchin, 1981; Kolossa et al., 2012),
decision-making (O'Connell et al., 2012; Kelly and O'Connell, 2013)
and putatively Bayesian inference (Friston, 2005; Kopp, 2008).
Third, a posteriorly distributed Slow Wave (SW) accomplishes the late
positive complex whose functional signicance is, however, comparably less well understood (but see Ruchkin et al. (1988); Garca-Larrea
and Czanne-Bert (1998); Spencer et al. (2001); Matsuda and Nittono
(2014)).
The present study aims at contributing to the literature by applying
Bayesian model comparison to discuss Bayesian updating and predictive surprise in terms of their ability to explain trial-by-trial ERP variability at different points in time and over different regions of the
scalp. Notice that this is a meta-Bayesian analysis, in the sense that we
are using Bayesian model comparison to select among various regressors that are all based on the assumption of an ideal Bayesian observer
(Daunizeau et al., 2010; Lieder et al., 2013). Thus, the present study
should be regarded as a challenge to the validity of the Bayesian brain
hypothesis as a theoretical framework for cognitive functions of the
brain. One of us had proposed on purely theoretical grounds close
relationships between cortical P3 responses and Bayesian inference several years ago (Kopp, 2008). We were further interested whether a
Bayesian observer model that incorporates nonlinear probability
weighting would outperform an unweighted model when explaining
the measured cortical responses.
Notice that our earlier empirical research forms the basis of more
specic hypotheses that could be examined in the context of the urn
ball task. This urnball task was especially designed for this purpose.
The rst hypothesis proposes that ERP responses at fronto-central channels (P3a) are best explained by Bayesian surprise. This suggestion has
been made by Kopp and Lange (2013) by referring to Sokolov's
223
(1966) Bayesian model of the orienting response. Notice that the P3a
component of the ERP is usually considered as indicating the brain's
orienting response (e.g., Friedman et al., 2001; Barcel et al., 2002;
Nieuwenhuis et al., 2011), and Kopp and Lange's (2013) P3a data
were consistent with Sokolov's (1966) model of the orienting response.
The second hypothesis postulates that ERP responses at centro-parietal
channels (P3b) are best explained by predictive surprise. This suggestion was made by Kolossa et al. (2012) who showed that trial-by-trial
P3b amplitude uctuations in a simple two-choice response time task
could be best explained by a computational model of predictive surprise. Assuming that these two hypotheses possess some validity, a
novel hypothesis arises according to which ERP responses at occipitoparietal channels (SW, Matsuda and Nittono (2014)) would be best
explained by postdictive surprise.
Materials and methods
Participants, experimental design, and data acquisition & analysis
Participants
Sixteen undergraduate psychology students participated to gain
course credits (15 females, 1 male). Their age ranged from 19 to 50
years (M = 24.7; SD = 9.3 years of age). Handedness was examined
with the Edinburgh Handedness Inventory (Oldeld, 1971), revealing
that one participant was left-handed and two were ambidextrous. All
participants indicated having normal or corrected-to-normal sight.
The procedure was approved by the local Ethics Committee.
Experimental design
Our urnball task represents a modication of tasks that were used
by Phillips and Edwards (1966) and by Grether (1980, 1992; see also
Furl and Averbeck (2011) and Achtziger et al. (2014), for a similar paradigm). There were U = 2 types of urns (labeled u = 1 and u = 2)
which could be distinguished by the distribution of the K = 2 types of
colored balls (labeled k = 1 and k = 2 for red and blue, respectively)
of the ten balls contained in one urn. The urn types represented socalled states s u U f1; 2g which were hidden from the participants during the experiment. The balls were so-called events k K
f1; 2g which could be observed by the participants during the experiment. The experimental design (Fig. 1) consisted of a factorial combination of two levels of prior probabilities (Pc, Pu) and two levels of
likelihoods (Lc, Lu), yielding four experimental conditions C {PcLc,
PcLu, PuLc, PuLu}, each of which contained B = 50 episodes of random
sampling (b {1, , 50}), each consisting of N = 4 trials (n {1, , 4}),
yielding a total of 800 sequentially presented colored ball stimuli. All
conditions were administered to each participant, with short breaks
(approximately 3 min) between the conditions, and their order was
counterbalanced across participants. The ball colors were also
counterbalanced across participants, but we will ignore this in our description to avoid confusion.
At the beginning of each condition, the tableau of ten urns containing a total of one hundred balls was shown to the participants,
representing prior probabilities and likelihoods. The visualization of
prior probabilities and likelihoods in the form of tableaus allowed participants to build an internal representation of these probabilistic parameters. Each episode of random sampling consisted of the following
sequence of events: First, one of the ten urns was selected randomly
to form the state s = u, but the outcomes of these selections remained
hidden to the participants during the experiment. Subsequently, a random sample of four balls was sequentially drawn, with replacement,
from the selected urn, and shown one after the other, taking the form
of observations o(n) = k at each trial n.
Participants were asked to indicate the color of each ball stimulus by
pressing the left or right Ctrl key on a standard computer keyboard
(using the left or right index nger, respectively). Once a sample of
four balls had been completed, participants had to choose which type
224
Fig. 1. Illustration of the urnball paradigm and outline of the experimental setup for the condition uncertain prior probability and uncertain likelihood PuLu, i.e., P(s = 1) = 0.7, P(s = 2) = 0.3,
P(o = 1|s = 1) = P(o = 2|s = 2) = 0.7, and P(o = 2|s = 1) = P(o = 1|s = 2) = 0.3. At the beginning of a condition, the 'tableau' of urns u and balls k is shown to the participant, describing
prior probability and likelihood. Afterwards, an urn is randomly selected to constitute the hidden state s = u and not shown to the participant. From this urn, four balls are drawn consecutively
over trials n {1, 2, 3, 4} with replacement, and observed by the participants o(n) = k {1, 2}. Participants were asked to indicate the color of each ball stimulus by pressing the corresponding
key, and they had to choose which type of urn had been selected on the current sampling once a sample of four balls had been completed.
of urn had been selected on the current episode of sampling (i.e., which
urn type u constitutes the state s). They indicated their choice by pressing the left or right Ctrl key for the state being urn type s = u = 1 and
urn type s = u = 2, respectively. Stimulusresponse mapping was
counterbalanced across participants (i.e., left or right hand responses indicating s = 1 and s = 2 choices). The duration of an episode of sampling, i.e., the presentation of the sample of ball stimuli, the collection
of the color responses, and the nal urn choice, amounted to around
12 s to 15 s. Neither feedback nor reward was provided during the
course of the experiment.
Prior probabilities were manipulated by presenting ten urns, composed of different numbers of type u = 1 and type u = 2 urns. In uncertain prior probability conditions (Pu), seven type u = 1 urns and three
type u = 2 urns (i.e., P(s = 1) = 0.7, P(s = 2) = 0.3) were presented.
In the certain prior probability condition (Pc), nine type u = 1 urns
and one type u = 2 urn (i.e., P(s = 1) = 0.9, P(s = 2) = 0.1) were presented. On uncertain likelihood conditions (Lu), urn type u = 1
contained seven red (k = 1) and three blue (k = 2) balls (i.e., P(o =
1|s = 1) = 0.7, P(o = 2|s = 1) = 0.3), while urn type u = 2 contained
three red and seven blue balls (i.e., P(o = 1|s = 2) = 0.3, P(o = 2|s =
2) = 0.7). On certain likelihood conditions (Lc), urn type u = 1
contained nine red balls and one blue ball (i.e., P(o = 1|s = 1) =
0.9, P(o = 2|s = 1) = 0.1), while urn type u = 2 contained one red
ball and nine blue balls (i.e., P(o = 1|s = 2) = 0.1, P(o = 2|s = 2) = 0.9).
Before an experiment was to begin, each participant completed four
practice episodes of sampling under the supervision of and in a taskrelated dialogue with the experimenter to become accustomed to the
task. The tableau of each practice sampling consisted of one single
u = 1 urn and one single u = 2 urn (yielding uniform prior probabilities). Successful completion of these practice episodes of samplings
demonstrated that the participants understood the procedure and
their task. Visual ball stimuli were presented at the center of a computer
screen (Eizo FlexScan T766 19; Hakusan, Ishikawa, Japan) against gray
background (stimulus size = 1, stimulus duration = 100 ms, stimulus
onset asynchrony = 2500 ms). The experiment was run using
Presentation (Neurobehavioral Systems, Albany, CA).
Data acquisition
The electroencephalogram (EEG) was recorded continuously, using
a QuickAmps-72 amplier (Brain Products, Gilching, Germany) and
the Brain Vision RecorderVersion 1.02 software (Brain Products,
Gilching, Germany) from frontal (F7, F3, Fz, F4, F8), fronto-central
(FCz), central (T7, C3, Cz, C4, T8), parietal (P7, P3, Pz, P4, P8), occipital
(O1, O2), and mastoid (M1, M2) sites. AgAgCl EEG electrodes were
used that were mounted on an EasyCap (EasyCap, HerrschingBreitbrunn, Germany). Electrode impedance was kept below 10 k.
All EEG electrodes were referenced to average reference during the
recording. Participants were informed about the problem of noncerebral artifacts, and they were encouraged to reduce the occurrence
of movement artifacts. Ocular artifacts were monitored by means of bipolar pairs of electrodes positioned at the sub- and supraorbital ridges
(vertical electrooculogram, vEOG) and at the external ocular canthi

(horizontal electrooculogram, hEOG). The EEG and EOG channels were
subject to a bandpass lter between 0.01 Hz to 30 Hz and digitized at
250 Hz sampling rate.
Data analysis
Initial off-line analysis of the EEG data was performed by means of
the Brain Vision Analyzer Version 2.0.1 software (Brain Products,
Gilching, Germany). EEG electrodes were off-line re-referenced to average mastoid reference. Careful manual artifact rejection was performed
before averaging to discard trials during which eye movements or any
other non-cerebral artifact except blinks had occurred. Deections in
the averaged EOG waveforms were small, indicating that xation was
well maintained on those trials that survived the manual artifact rejection process. Semi-automatic blink detection and the application of an
established method for blink artifact removal were employed for blink
correction (Gratton et al., 1983).
In the next step, event-related potential (ERP) waveforms were created. Using MATLAB 7.11.0 and EEGLAB 11.0.4.3b (Delorme and Makeig,
2004), the continuous EEG was divided into epochs of 700 ms duration,
starting 100 ms before stimulus onset. Epochs were corrected using the interval [100, 0] ms before stimulus presentation as the baseline. Solely
artifact-free ERP epochs that we recorded in response to the presentation
of correctly identied ball stimuli entered the subsequent levels of analysis.
The Statistical Parametric Mapping (SPM8) software was used for metaBayesian model comparison (Friston et al., 2002, 2007).
Selection of ERP data for further analyses
For Bayesian model comparison, we selected electrodes at which the
ERP components (N250, P3a, P3b, SW) are most frequently reported to
occur. The resulting electrode sets equaled C3, Cz, and C4 for the N250
(Towey et al., 1980), Fz, FCz, and Cz for the P3a (Kopp and Lange,
2013), Pz for the P3b (Kolossa et al., 2012), and O1 and O2 for the SW
(Matsuda and Nittono, 2014). Virtual electrodes that corresponded to
each of the four ERP components were created by averaging epochs
over the entire sets of respective electrodes. ERP component latencies
were determined through the analysis of ERP waveform variability at
these virtual electrodes. Specically, the grand average ERP waves for
frequent and rare balls in each of the four experimental conditions
that resulted in eight waves per virtual electrode were
with re analyzed

gard to the latencies at which maximum variability t 2max between ERP
waveforms occurred. Maximum variance was searched within predened time intervals of [200, 300] ms for the N250 (Towey et al.,
1980), of [300, 400] ms for the P3a and P3b (Kolossa et al., 2012; Kopp
and Lange, 2013), and of [400, 560] ms for the SW (Garca-Larrea and
Czanne-Bert, 1998). Single-trial ERP amplitudes which entered Bayesian model comparison were estimated by averaging the EEG signal of
the virtual electrodes over the time interval t 2 20 ms, with i
i; max
{N250, P3a, P3b, SW}. The interval width was chosen to provide
smoothed amplitudes at the moment of peak variance (Luck, 2005).
Bayesian observer model

Fig. 2 shows the hierarchical structure of the model space. A Bayesian
observer lies at its core. Two specic probability distributions (the belief
distribution, BEL, and the prediction distribution, PRE) are formalized
that represent dynamic variables which uctuate within the Bayesian
observer. There are two random variables: events, i.e., balls which are
sampled and form observations on k K f1; ; K g and hidden
states s u U f1; ; U g, i.e., the type of urns from which the balls
are sampled. Bayesian surprise, postdictive surprise, and predictive surprise are used as response functions, i.e., they are calculated over the
BEL and PRE distributions in order to link the probability distributions (which a Bayesian observer would have to encode) to measurable scalp potentials. Consequently, these surprise values, calculated
on a trial-by-trial basis, are used for meta-Bayesian model comparison.
Table 1 presents all distributions that were chosen for evaluation, including those taken from the literature as detailed in Appendices A (OST
model (Ostwald et al., 2012)) and B (digital ltering model (DIF)
(Kolossa et al., 2012)). In the following, we will give a short introduction
to Bayesian inference before the BEL and PRE distributions will be explained in more detail.
Let P() be the prior probability of a random variable X , and
P(|o) the posterior probability given an observation o. The transition
from prior distribution P X to posterior distribution P X jo follows
Bayes' theorem
P jo
P oj P
;
P o
X ;
with likelihood P(o|) of the observation o given , and P(o) as the evidence (i.e., the probability of the observation). The belief about a hidden state before an observation represents the prior probability
P() = P(s(n) = u|on1 1) = Pu(n 1) and the posterior probability
P(|o) = P(s(n) = u|on1) = Pu(n) following an observation, with
on1 1 = (o(1), o(2), , o(n 1)) as sequence of n 1 previous observations. The distribution of balls within the urns equals the likelihood
P(o|) = P(o(n) = k|s(n) = u) = Lk|u and the prediction of the observation constitutes the observation probability P(o) = P(o(n) = k|
Fig. 2. Hierarchical structure of the model space. There are two random variables: hidden
states s (urn type u selected at the start of an episode of sampling) and observable events k
(balls drawn). The ideal Bayesian observer updates the probability distributions over
hidden states (belief updating, BEL) and observations (prediction updating, PRE)
following Bayes' theorem (Eq. (1)). Bayesian updating and predictive surprise are
response functions that link the probability distributions to cortical activity.
225
Table 1
Overview and short description of the evaluated distributions and ways to implement
probability weighting.
Distribution Description
BEL
BELSI
BELSO
PRE
PRESI
PRESO
OSTa
DIFa
Belief distribution about hidden states based on a non-weighting

Bayesian observer.
BEL distribution based on a Bayesian observer who employs
nonlinear probability weighting of the inference input.
BEL distribution based on a Bayesian observer who employs
nonlinear probability weighting of the inference output.
Prediction distribution about observations based on a nonweighting Bayesian observer.
PRE distribution based on a Bayesian observer who employs
nonlinear probability weighting of the inference input.
PRE distribution based on a Bayesian observer who employs
nonlinear probability weighting of the inference output.
Memory-based Bayesian surprise model with exponential forgetting
(Ostwald et al., 2012).
Memory-based predictive surprise model employing the digital
ltering approach (Kolossa et al., 2012).
a
Additional models taken from literature are described in Appendices B and C,
respectively.
on1 1) = Pk(n). The calculation of posterior beliefs P U n is fundamental

to the BEL distribution while the PRE distribution rests upon the calculation of the prediction distribution P K n 1 which in turn is based on
these posterior beliefs.
Fig. 3 illustrates likelihoods, beliefs, predictions and their trial-bytrial updating by an ideal Bayesian observer in the PuLu condition. The
left panel shows the probability distributions and likelihoods on
trial n 1 while the right panel shows these quantities on trial n after
a blue colored ball was observed (o(n) = 2). The horizontal line denes
the beliefs (Pu = 1, Pu = 2), i.e., the probability distribution over the hidden
states, while the vertical lines dene the likelihoods (L k = 1, L k = 2),
i.e., the distribution of balls within each type of urn which did not
change within conditions. The colored areas visualize the resulting prediction distribution over events (Pk = 1 (red ball), Pk = 2 (blue ball)). The
arrow indicates the presentation of a blue ball (o(n) = 2) which triggers
Bayesian inference (BEL distribution, P U n1P U n) equaling shifts
of the horizontal line. The resulting ratio of colored areas represents
the new prediction distribution P K nP K n 1.
Fig. 3. Illustration of Bayesian inference, i.e., the updating of the belief distribution P U and
of the prediction distribution P K under uncertain prior probabilities and likelihoods Lk. Left
panel: Likelihoods and probability distributions on trial n 1. Right panel: Likelihoods
and probability distributions on trial n. Horizontal lines dene probability distributions
of the hidden state (beliefs about urns), while vertical lines dene the likelihoods (ball distribution within the urns). Colored areas visualize the resulting prediction distribution for
red balls vs. blue balls, respectively. The arrow indicates the observation of a blue ball on
trial n (o(n) = 2). In this case, predictive surprise about the blue ball exceeds predictive
surprise about a (potential) red ball since Pk = 2(n) b Pk = 1(n). The observation of the (surprising) blue ball triggers Bayesian inference about the hidden state that is equivalent to
shifting the horizontal line. The KL divergence between P U n1 and P U n yields the scalar values used to predict trial-by-trial EEG variations based on the BEL distribution
(Bayesian surprise). The resulting ratio of colored areas on trial n represents the updated
prediction distribution. The KL divergence between P K n and P K n 1 yields the scalar
values used to predict trial-by-trial EEG variations based on the PRE distribution
(postdictive surprise).
226
Belief distribution (BEL)

The belief distribution (BEL) is the posterior probability distribution
of the hidden state s being urn type u based on Bayes' theorem (Eq. (1)).
We dene the likelihood term P(o(n) = k|s(n) = u) = Lk|u as the probability of the observation o(n) = k, given that the events are originating
from state s(n) = u, and the prior P(s(n) = u|on1 1) = Pu(n 1) as the
probability of state s(n) = u, given a sequence on1 1 =
(o(1), o(2), , o(n 1)) of n 1 previous observations. The initial
prior probability P(s(n = 1) = u) and likelihood were described to the
participants at the beginning of each condition by presenting the tableau of all possible urns with their respective ball distributions. Note
that the likelihood term remains constant throughout all trials
n {1, , N = 4}. The posterior probability P(s(n) = u|on1) = Pu(n)
after observation o(n) is evaluated according to Bayes' theorem
(Eq. (1)):
P u n
Lkju P u n1
;
C n
with u U f1; ; U g yielding the posterior distribution P U n and the

normalization factor C(n) being the observation probability

X
n1
C n P on ko1
P k n
Lkju P u n1:
uU
The preceding posterior P(s(n 1) = u|on1 1) = Pu(n 1) on trial

n 1 serves as the prior P(s(n) = u|on1 1) on trial n. This is valid since
P(s(n) = u|s(n 1) = u) = 1, implying that within one episode of sampling the chosen urn does not change.
Prediction distribution (PRE)
The prediction distribution (PRE) estimates the probability distribution over future observations P K n 1 based on posterior beliefs P U n
by calculating
n

P k n 1 P on 1 ko1
X
n

P on 1 kjsn 1 uP sn uo1
uU
for all k K. As the urn does not change within an episode of sampling,
Eq. (4) can be simplied to
P k n 1
Lkju P u n;
k K:
Fig. 4. Comparison of objective probabilities ( = 1, solid curve) with inverse S-shaped

weighted probabilities (Eq. (6) with crossover point P0 = 0.5 and shape parameter =
0.65, dashed curve). Probabilities lower than 0.5 are overestimated, and probabilities
higher than 0.5 are underestimated, which is made apparent by the dotted horizontal
line at w(P) = 0.5.
with the shape parameter controlling curvature, P 0 as crossover

point (w(P0) = P0), and the log-odds function (Barnard, 1949)
LowP log
Note that if we chose P0 = 0.5 we obtain

1wP w1P
due to symmetry and that = 1 yields the identity w(P) = P. Probability weighting can be incorporated into the observer model as a hyperparameterization of all input of the inference (i.e., prior probability
and likelihood), denoted as
P
wP :
Bayesian inference takes place as before, yielding the BELSI distribution (Eq. (2))
uU
P u n
Hyper-parameterization using probability weighting functions
Lkju P uw n1
C n
u U;
10
with the normalization factor C(n) (Eq. (3))
It has been shown that estimates of probabilities P (prior probabilities, i.e., beliefs about the urns before an observation and likelihoods, i.e., ball distribution within the urns, in our paradigm) by
human observers vary systematically from the objective probabilities in a way that low probabilities are overestimated and high probabilities are underestimated as shown in Fig. 4. This variation is
formalized via a probability weighting function w(P) which is commonly reported to be (inverse) S-shaped (Kahneman and Tversky,
1979; Prelec, 1998; Gonzalez and Wu, 1999; Zhang and Maloney,
2012; Cavagnaro et al., 2013). We use (inverse) S-shaped probability
weighting as introduced in prospect theory (Kahneman and Tversky,
1979; Tversky and Kahneman, 1992; Fox and Poldrack, 2009) with
the weighting function family as proposed by Zhang and Maloney
(2012)
LowP LoP 1 LoP 0 ;
wP
:
1wP
C n
w w
Lkju P u n1;
11
uU
and the PRESI distribution (Eq. (5))

P k n 1
w w
Lkju P u n;
k K:
12
uU
Note that the posterior probability from the preceding trial is

weighted when it becomes the prior probability for the current trial.
This is a consequence of the hyper-parameterization of all probabilities.
Empirically derived values of the weighting parameters differ between
paradigms (Cavagnaro et al., 2013) and probability conditions (Zhang
and Maloney, 2012). We addressed parameter variability by keeping
P0 = 0.5 which, according to Eq. (8), ensures that the same weighting
functions can be applied to all probabilities while meeting the probabi w
listic constraints kK Lkju 1 and uU P uw n 1. This keeps the
complexity of the observer model as low as possible as only one

(hyper-)parameter of the model is a free parameter which is the
shape parameter . Fig. 5 illustrates the inuence of probability
weighting, constrained in the manner that we described above, on
Bayesian inference. The effect of probability over- and underestimation clearly persists on posterior probabilities that are biased towards 0.5.
Probability weighting for the BELSI and PRESI distributions equals a
processing step which precedes inference because the input is weighted
before Bayes' theorem is applied (i.e., all probabilities and likelihoods on
the right-hand side of Eq. (1)). Alternatively, probability weighting can
be implemented as a processing step which succeeds inference. This
means that the output (i.e., the BEL and PRE distributions themselves)
is weighted before response functions (see below, the Surprise section)
are applied to the posterior BEL and PRE distributions. Following
Eqs. (2) and (5) yields the BELSO distribution
w
P u n wP u n;
u U;
13
and the PRESO distribution

w
P k n 1 wP k n 1;
227
Baldi, 2009; Baldi and Itti, 2010). For prior and posterior distributions
over X it is
DKL P X kP X jo

P ln
15
The KullbackLeibler divergence between prior and posterior distributions over hidden states is called Bayesian surprise IB . It is
obtained by setting P U n1 P X as prior distribution and P U n
P X jo as posterior distribution:
IB n DKL P U n1kP U n:
16
Notice that Bayesian surprise reects the degree of Bayesian

updating, because the prior distribution on any trial equals the posterior
distribution on the previous trial. In short, Bayesian surprise reects the
changes in beliefs over hidden states that are induced by observations.
By analogy, postdictive surprise can be dened via the prediction distributions before an observation P K n and after an observation P K n 1:
IB n DKL P K nkP K n 1:
k K;

P
:
P jo
17
14
respectively. Notice that this can be interpreted as a non-linearity which

exists only between non-weighted probability distributions and
response functions rather than a proper non-linear weighting of
probabilities.
Surprise
Bayesian updating
The dissimilarity between two probability distributions is commonly
measured via the KullbackLeibler (KL) divergence (see, e.g., Itti and
Predictive surprise
In contrast to Bayesian updating, predictive surprise is the surprise
about the current observation o(n) at trial n being k under the prediction P k n P K n after a sequence on1 1 = (o(1), o(2), , o(n 1))
of n 1 former observations, calculated according to (Shannon and
Weaver, 1948; Strange et al., 2005)
IP n log2 P k n:
18
Notice that Pk(n) is the denominator of Eq. (2), i.e., the one probability taken from the prediction distribution P K n that corresponds to the
actual observation o(n). As the state is not revealed to the participant at
any time, it is not possible to calculate Shannon surprise in relation to
the state. However, the average Shannon surprise can be calculated as
the entropy of the belief distribution
IH n
P u n log2 P u n:
19
uU
Evaluation methods
Fig. 5. Comparison of posterior probabilities P(s = 1|o = k) for frequent (red ball, k = 1,
upper curves) and rare (blue ball, k = 2, lower curves) events calculated via Bayes'
theorem (Eq. (1)) with unweighted prior probability P(s = 1) and exemplary likelihood
P(o = 1|s = 1) = 0.9 (solid curves), and weighted prior probability and likelihood (inverse S-shaped weighting (Eq. (6)) with = 0.65, dashed curves). The black dotted line
represents a posterior probability of 0.5. For both frequent and rare events, weighting
leads to a bias towards higher uncertainties in posterior probabilities. The double-headed
arrows illustrate the quantity of Bayesian surprise (Eq. (16)) for a prior probability of
P(s = 1) = 0.5 for a frequent event (red dashed arrow with weighting, red dashed plus
red solid arrow without weighting) and a rare event (blue dashed arrow with weighting,
blue dashed plus blue solid arrow without weighting).
The combinations of probability distributions and response functions will be referenced to as models in this section for sake of brevity.
To compare different models of the EEG we used a linear hierarchical
model as implemented in the Parametric Empirical Bayesian (PEB)
schemes in the SPM software (spm_PEB.m) (Friston et al., 2002,
2007). These empirical Bayes models simply equip a standard general
linear model with a further hierarchical level that places constraints
on the parameter estimates of the rst level. The evidence for each
model is approximated with a variational free energy bound which consists of an accuracy and a complexity term (Penny et al., 2004; Friston
et al., 2007; Penny, 2012). This approximation can then be used to compute Bayes factors and log evidences in the usual way. The exact specication of the design matrices is detailed in Appendix E. The logevidences of the models Fi = ln(p(Y|Mi)), with p(Y|Mi) being the likelihood of the data Y given the model Mi, and i {IB(BEL), IB(BELSI),
IB(BELSO), IH(BEL), IH(BELSI), IH(BELSO), IP(PRE), IP(PRESI), IP(PRESO),
IB(PRE), IB(PRESI), IB(PRESO), OST, DIF} were used for model comparison.
The log-evidences were summed across probability conditions for each
participant. We used random-effects Bayesian model selection (BMS)
for group studies and computed exceedance probabilities i each of
which equals the probability that model i is more likely than the remaining models (Stephan et al., 2009).
228
Family-level comparisons permit inferences about probability

weighting (Penny et al., 2010). The model space can be partitioned
into three model families F which correspond to different assumptions
about probability weighting: The Bayesian observer without any
probability weighting F LI, probability weighting of the input F SI, and
probability weighting of the output F SO as discussed in the Hyperparameterization using probability weighting functions section. A
fourth family consists of the OST and DIF models that are both based
on counting functions with exponential forgetting F EF. For topographical representations we use log-Bayes factors ln(BF) that equal the differences in log-evidence between two models (Kass and Raftery, 1995;
Penny et al., 2004; Friston et al., 2007)

pYjMi
lnBFiNULL ln
F i F NULL ;
pYjMNULL
20
with M NULL as the common reference null-model. The participantspecic log-Bayes factors were summed up over participants to obtain
the group log-Bayes factor ln(GBF) for one model against the reference
model (Stephan et al., 2007). Due to the use of a common reference
model, all models can be compared with each other following

ln GBFi j ln GBFiNULL ln GB F jNULL :
21
Positive values reect evidence in favor of model i over j, with values

larger than three being considered as strong, and ve as very strong
evidence (Kass and Raftery, 1995; Penny et al., 2004). The GBF is
based on a xed-effects assumption over participants. We additionally
report the posterior model probabilities for xed-effects Bayesian
model comparison in Appendix A (Stephan et al., 2007; Penny et al.,
2010).
Results
Behavioral results and hyper-parameter tting

Fig. 6 shows the likelihoods (ratios of choices) P c u 1P u1 4
across participants for choice c being urn type u = 1 depending on the
mean posterior probability P u1 4 after observing an episode of sampling. The mean posterior probabilities were calculated over all sequences
containing identical ratios of types of ball colors. The general form of the

likelihood function in a binary decision task is P opt c 1P u1 4
1ea
1
P u1
(Daunizeau et al., 2014). We assume that the decision de-
4b
pends purely on the posterior probability distribution and that these

probabilities are used optimally. Values a and b control the link between
choices and posterior probabilities and a bias, respectively. We set b = 0.5
because both choice options are equally preferred and thus no bias is expected. The parameter a is set to 16 yielding a steep slope around P u1 4
0:5. The dashed black line indicates this optimal choice function. The
hyper-parameter which controls the curvature of the inverse Sshaped weighting function (Eq. (6)) for the BELSI distribution is optimized
by minimizing the mean squared error (MSE) between the measured
likelihoods and the likelihood function
n

o
opt arg min MSE P ; P opt :
22
It was evaluated in the range [0.5, , 1] with an increment of 0.01

yielding opt = 0.65. Fig. 6(A) shows that for the BEL distribution
(Eq. (2)), the data points are spread out and in one case even erroneously
positioned in the lower right quadrant. Fig. 6(B) shows that for the BELSI
distribution (Eq. (10)) with = 0.65, the data points are clustered around
the optimal choice function and the erroneous data point moves into the
correct quadrant. Note that this is the only free parameter of the Bayesian
observer model which was optimized using behavioral data. The parameter estimate that was obtained in such a way was used for the following
EEG data-based analysis of the BELSI, BELSO, PRESI, and PRESO distributions.
Conventional ERP results
Fig. 7 shows ERP waveforms as they emerged at electrodes C3, Cz,
and C4 for the N250, at Fz, FCz, and Cz for the P3a, at Pz for the P3b,
and at O1 and O2 for the SW, separately for the more and less frequent
ball colors. Here and throughout the paper, the red (blue) color generally represents the more (less) frequent color to avoid confusion. For example, in the situation which is depicted in Fig. 1, red balls (58
exemplars) were actually more frequent than blue balls (42 exemplars).
Maximum-variance ERP analysis (see the Selection of ERP data for further analyses section) reveals the presence of a centrally distributed
N250 wave in the latency range [200, 300] ms with t 2
232 ms
N250;max
(Towey et al., 1980), a frontally distributed P3a (Kopp and Lange, 2013)
356 ms, a parietally disin the latency range [300, 400] ms with t 2
P3a;max
tributed P3b (Kolossa et al., 2012) in the latency range [300, 400] ms
380 ms, and a posterior-positive Slow Wave (SW) in
with t 2
P3b; max
the latency range [400, 560] ms with t 2
SW; max
504 ms (Garca-Larrea
and Czanne-Bert, 1998; Matsuda and Nittono, 2014).

Model-based trial-by-trial analyses

Fig. 6. Likelihoods (ratios of choices) P c u 1jP u1 4 of choosing urn type u = 1 for
all participants depending on the mean posterior probability P u1 4 of the (A) BEL and
(B) BELSI distributions after observing an episode of sampling. Mean posterior probabilities
were calculated over all episodes containing identical ratios of types of balls. Error bars indicate standard error between participants. The hyper-parameter was optimized by
minimizing the mean squared error (MSE) between the likelihood function indicated by
the dashed black line and the measured likelihoods resulting in = 0.65.
This section describes the model-based results. It starts with Fig. 8

which presents the group log-Bayes factors ln(GBF) for the twelve
most relevant electrodes and for discrete time bins for the winning
probability distributionsurprise combinations. It is followed by Fig. 9
which shows the respective scalp maps. Next, Table 2 shows the exceedance probabilities of all tested probability distributionsurprise
229
Fig. 7. ERP waves for frequent (red ball, red curves) and rare (blue ball, blue curves) events for certain (Lc, solid curves) and uncertain (Lu, dashed curves) likelihoods at electrodes C3, Cz,
and C4 for the N250, at Fz, FCz, and Cz for the P3a, at Pz for the P3b, and at O1 and O2 for the SW. Time intervals for the search for maximum variance for the ERP components are highlighted in gray and the time point of maximum variance is marked by a dashed black line at all respective electrodes. Left hand panels: certain prior conditions (PcLc and PcLu). Right hand
panels: uncertain prior conditions (PuLc and PuLu). The presence of a centrally distributed N250 wave in the latency range [200, 300] ms with t 2
232 ms and of a prominent
N250;max
late positive complex is revealed. The late positive complex can be decomposed into three separable ERP components: a frontally distributed P3a in the latency range [300, 400] ms
356 ms, a parietally distributed P3b in the latency range [300, 400] ms with t 2
380 ms, and a posterior-positive Slow Wave (SW) in the latency range
with t 2
P3a;max
P3b;max
[400, 560] ms with t 2
SW;max
504 ms.
combinations for the late positive complex (and the N250) at the ERPspecic virtual electrodes and time windows which were determined
as described in the Selection of ERP data for further analyses section.
Table 3 generalizes the results to the comparison of the three (or four)
model families. Finally, the relation between Bayesian surprise and the
measured data at electrode FCz (which was chosen as it represents
the center of the P3a region-of-interest electrodes) is shown for the
BELSI distribution in Fig. 10.
Fig. 8 displays group log-Bayes factors ln(GBFi NULL) with
i {IB(BELSI), IP(PRESI), IB(PRESI)} of the BELSI and PRESI distributions
versus a constant null model over time([100, 600] ms around eliciting
event). The electrodes that we do not display in Fig. 8 are mainly the
marginal electrodes and we did not see anything of importance at
these electrodes. The highest log-Bayes factors (red traces) represent
better ts between our surprise regressors and the measured trial-bytrial ERP amplitude modulations. Bayesian updating and predictive surprise seem to provide accurate approximations to the actual data, with a
fronto-central focus within the P3a latency range, and a centro-parietal
focus in the P3b latency range, and an occipito-parietal focus in the SW

latency range.
Fig. 9 displays group log-Bayes factors in the form of topographic
maps separately for the P3a, P3b, and SW latencies. Note that these
maps do not show scalp distributions of measured data. Rather than
that, these maps display the degree to which different kinds of surprise
as calculated based on the distributions approximate measured trial-bytrial ERP data. The P3a maps show a circumscribed fronto-central focus,
along with a left-occipital spot. For the P3b, a more posteriorly (centroparietal) and broadly distributed focus appears. Finally, the t between
surprise and measured data is sharply conned to the occipito-parietal
region with regard to the SW.
Table 2 displays resulting exceedance probabilities for all tested
distributionsurprise combinations separately for N250 (electrodes
C3, Cz, C4), P3a (electrodes Fz, FCz, Cz), P3b (electrode Pz), and SW
(electrodes O1, O2) latencies. Trial-by-trial amplitude variability for
the P3a is clearly best accounted for by the BELSI distribution with
Bayesian surprise ( = 0.58). The PRESI distribution with predictive
230
Fig. 9. Scalp maps of group log-Bayes factors for time intervals t [336, 376] ms, t
[360, 400] ms, and t [484, 524] ms for P3a, P3b, and SW latency ranges, respectively. Notice central foci within P3a latency range and occipitalparietal foci in the P3b and SW latency ranges. The P3a maps show a circumscribed fronto-central focus along with a leftoccipital spot. For the P3b, a more posteriorly (occipitalparietal) and broadly distributed
focus appears. Finally, the t between surprise and measured data is sharply conned to
the left-occipital region with regard to the SW.
Fig. 8. Degree to which Bayesian updating and predictive surprise based on the BELSI and
PRESI distributions approximate measured trial-by-trial ERP data in group log-Bayes factors ln(GBFi NULL) with i {IB(BELSI), IP(PRESI), IB(PRESI)} versus a constant model
NULL over electrodes and time.
surprise shows maximum for P3b amplitudes ( = 0.67) while for

SW amplitudes the highest exceedance probability ( = 0.19) is split
between the PRESI distribution with postdictive surprise and the OST
model with Bayesian surprise. Predictive surprise based on the DIF
model shows the highest exceedance probability ( = 0.68) for the
N250 (see Appendix D). All exceedance probabilities for entropy calculated over any of the belief distributions remain negligible. These results
imply that the BELSI distribution with Bayesian surprise provides superior predictions with regard to trial-by-trial P3a variability at frontocentral electrodes. In addition to that, the PRESI distribution with predictive surprise is superior with regard to the centro-parietal P3b while
there is no clear superior observation model with regard to occipitoparietal SW amplitude variability. In order to get clearer results for
the SW, we calculated the exceedance probabilities solely for the
four models of the winning family F SI. The family-specic analysis
was inspired by Lieder et al. (2013). Thus, instead of asking which of
all tested models best explains the data, we ask which of a specic subset (family) is best. The results were IB BELSI 0:15, IH BELSI 0:01,
IB PRESI 0:71, and IP PRESI 0:13, indicating superiority of the PRESI
distribution with postdictive surprise.
Table 3 shows family-level exceedance probabilities. For each component of the late positive complex, the model family based on the
observer with probability weighting of the inference input ( F SI) is

clearly favored, while the model family that was based on exponential
forgetting (F EF) has the highest exceedance probability for the N250.
Fig. 10 shows the correlation between Bayesian surprise, IB(n), that
was obtained from the BELSI distribution (with hyper-parameter =
0.65) and cortical activations (trial-by-trial ERP amplitude modulations,
measured at electrode FCz, here summarized through grand-average
ERP waves) across the four experimental conditions C {PcLc, PcLu,
Table 2
Exceedance probabilities for all tested distributionsurprise combinations over the
interval t [212, 252] ms for N250, over the interval t [336, 376] ms for P3a, over
the interval t [360, 400] ms for P3b, and over the interval t [484, 524] ms for SW.
Maximum exceedance probabilities are emphasized in bold face.
Surprise
IB
IB
IB
IH
IH
IH
IP
IP
IP
IB
IB
IB
IB
IP
Distribution
BEL
BELSI
BELSO
BEL
BELSI
BELSO
PRE
PRESI
PRESO
PRE
PRESI
PRESO
OSTa
DIFa
ERP waves and electrodes

N250a
P3a
P3b
SW
b0.01
0.01
0.03
b0.01
b0.01
b0.01
0.06
0.04
0.02
b0.01
0.02
b0.01
0.15
0.68
b0.01
0.58
0.01
0.02
0.07
b0.01
0.05
0.03
b0.01
b0.01
0.02
b0.01
0.02
0.20
b0.01
0.03
b0.01
0.01
b0.01
b0.01
0.02
0.67
0.01
b0.01
0.02
b0.01
0.04
0.19
b0.01
0.09
0.14
0.04
0.01
0.05
0.04
0.07
0.06
b0.01
0.19
b0.01
0.19
0.12
a
Results for the N250 as well as for the OST and DIF models will be discussed in detail in
Appendix D.

Table 3
Family-level exceedance probabilities for the distributions based on the observer without
weighting (F LI), with weighting of the inference input (F SI), weighting of the inference
output (F SO), and models based on exponential forgetting (F EF). Maximum exceedance
probabilities are emphasized in bold face.
Family
F LI
F SI
F SO
F EF

N250
P3a
P3b
SW
0.03
0.06
0.04
0.87
0.07
0.81
0.01
0.11
0.02
0.75
0.01
0.22
0.07
0.41
0.27
0.25
PuLc, PuLu} and across eight potential sequences of three successive ball
stimuli (cf. panels (A)(H)). Note that while Bayesian surprise is shown
at each stage of the sequence on the left, the ERP waves, shown on the
right, are in response to the third ball stimulus only.
A close correlation between Bayesian surprise and cortical
activations is revealed by a comparison between the various values of
231
Bayesian surprise for the third trial, IB(n = 3), and the corresponding
ERP measures. Specically, gradually increasing ERP wave amplitudes
are associated with successive increases in Bayesian surprise, IB(n =
3) (compare (A) vs. (C) vs. (E) vs. (G) and (B) vs. (D) vs. (F) vs. (H), respectively). Further, the left panels show that Bayesian surprise is
mainly grouped by likelihood (Lc (solid curves) vs. Lu (dashed
curves)). In order to show this effect in the ERP data, the waves for
certain (Lc, solid curves) and uncertain (Lu, dashed curves) likelihood have been averaged separately, regardless of prior probabilities. The ERP waves also seem to reect the degree to which
Bayesian surprise IB(n = 3) under Lc conditions (solid curves in left
panels and single solid curve in right panels) surpasses IB (n = 3)
under Lu conditions (dashed curves in left panels and single dashed
curve in right panels).
As a measure of absolute t, we computed the fraction of variance
explained by the winning distributionsurprise combinations for each
component of the late positive complex (P3a, P3b, and SW) and the
N250. We report mean values across participants as well as minimum
and maximum individual values. For the P3a, 1.4% of the variance was
explained with a minimum of 0% and a maximum of 6.2%. For the P3b,
Fig. 10. Relationships between Bayesian surprise (IB(BELSI) with hyper-parameter = 0.65) and the measured data across sequences of observed events. Left panels: Bayesian surprise
IB n DKL P U n1jjP U n (Eq. (16)) over trials n = 1, 2, 3 for all probability conditions PcLc (diamond-marked solid curve), PcLu (triangle-marked dashed curve), PuLc (inverted triangle-marked solid curve), and PuLu (square-marked dashed curve). The sequence of observed events is shown below each gure with a red ball denoting a frequent and blue ball a rare
event. A clear likelihood effect is visible (solid vs. dashed curves). Right panels: In order to show this effect in the ERP data the waves for certain (Lc, solid curves) and uncertain (Lu, dashed
curves) likelihood have been averaged separately yet regardless of prior probabilities. The thus created grand-average ERP waves are shown for the third observation of a sequence o(n =
3) at electrode FCz. Gradually increasing ERP wave amplitudes are associated with successive increases in Bayesian surprise IB(n = 3). Mean reaction times (after averaging across individual median reaction times) are marked by vertical dashed black lines.
232
2.6% of the variance was explained (minimum 0%, maximum 7.5%). For
the SW 2.4% of the variance was explained (minimum 0%, maximum
8.4%). For the N250, 1.8% of the variance was explained (minimum 0%,
maximum 15.2%).
Discussion
This study explored neural correlates of Bayesian inference by combining an urnball paradigm (Fig. 1) with computational modeling of
trial-by-trial electrophysiological signals. Our approach led to the discovery that dissociable cortical signals seem to code and compute distinguishable aspects of Bayes-optimal probabilistic inference. Thus, we
isolated discrete ERP components which could be dissociated with regard to their putative function in accomplishing Bayesian inference
(cf. Figs. 8, 9; Table 2). Specically, we found the late positive complex
spatially, temporally and functionally decomposable into three separable ERP components (see also Dien et al., 2004): (1) Bayesian surprise
yielded superior approximations of activation changes in anteriorly distributed P3a waves at relatively short latency (Kopp and Lange, 2013).
(2) Postdictive surprise best explains posteriorly distributed SW amplitudes at latest latency. (3) Predictive surprise outperformed Bayesian
updating with regard to activation changes in parietally distributed
P3b waves at intermediate latency (Kolossa et al., 2012). Taken together, these results are consistent with the Bayesian brain hypothesis insofar as dissociable cortical activities seem to code and compute various
aspects of Bayesian inference.
Bayesian updating generally reects the KullbackLeibler divergence
between two probability distributions as dened in Eq. (15), but further
differentiation is necessary in order to minimize potential misunderstandings (Fiorillo, 2012): Bayesian surprise represents the change in
beliefs over hidden states given new observations which equals the
KullbackLeibler divergence between P U n1 and P U n (see
Eq. (16)). In contrast, postdictive surprise represents the change in predictions over future events given new observations and equals the
KullbackLeibler divergence between P K n and P K n 1 (see
Eq. (17)). Predictive surprise is simply the surprise over the current observation under its degree of prediction (see Eq. (18)). Our data imply
that Bayesian surprise is related to trial-by-trial P3a amplitude variability, postdictive surprise suitably models trial-by-trial SW amplitude variability, and predictive surprise best predicts trial-by-trial P3b amplitude
variability.
The hyper-parameter was tted by minimizing the mean squared
error to approximate optimal decision behavior. This approach provided
= 0.65, with b 1 being associated with inverse S-shaped probability
weighting (Fig. 4). A Bayesian observer with = 0.65 was compared
with an otherwise equivalent observer without probability weighting
( = 1). We found that the observer with input probability weighting
outperformed the unweighted observer when explaining observed
ERPs (Tables 2 and 3). These ndings seem to demonstrate a ubiquitous
role of probability weighting in probabilistic inference (Kahneman and
Tversky, 1979; Tversky and Kahneman, 1992; Fox and Poldrack,
2009). With regard to non-linear probability weighting, we have
taken our lead from the (neuro-)economics literature. The alternative
possibility that nonlinearity might lie at the level of mapping from
probability distributions to electrophysiological responses such that
the electrophysiological responses may be a nonlinear function of the
neuronal representation of unweighted probabilities did not receive
support.
The primary effect of inverse S-shaped probability weighting (with
b 1) on Bayesian updating is to increase uncertainty. Inspection of
Fig. 5 reveals that all posterior probabilities based on an observer with
probability weighting lie between the corresponding posterior probabilities based on an observer without probability weighting and P =
0.5, which equals the point of maximum uncertainty. Probability
weighting might constitute one of the reasons why empirical support
for the Bayesian brain hypothesis (Knill and Pouget, 2004; Friston,
2005; Doya et al., 2007; Gold and Shadlen, 2007; Kopp, 2008) has apparently been so difcult to obtain in former studies. Notice that earlier attempts to identify brain areas that weight probabilities did not lead to
converging results; yet, common denominators of potential areas
seem to lie within fronto-striatal loops (Trepel et al., 2005; Preuschoff
et al., 2006; Tobler et al., 2008; Hsu et al., 2009; Takahashi et al., 2010;
Wu et al., 2011; Berns and Bell, 2012), within the parietal cortex
(Berns et al., 2008), and/or within the anterior insula (Preuschoff
et al., 2008; Bossaerts, 2010; Mohr et al., 2010). Alternatively, (inverse)
S-shaped probability weighting might constitute an emergent feature of
processing probabilistic information by neurons (Gold and Shadlen,
2007; Yang and Shadlen, 2007; Soltani and Wang, 2010; Pouget et al.,
2013).
Based on our ndings, we suggest the probabilistic reasoning (PR)
model of the Bayesian brain that basically reects the tri-partitioned
late positive complex (Fig. 11). In short, as shown in Fig. 11, the PR
model posits the existence of a Bayesian reasoning unit (BRU) that
interacts, in a reciprocal manner, with cognitive systems that process incoming environmental information (Haykin and Fuster, 2014). Further,
the PR model conjectures that the BRU is capable of Bayes-optimal
updating: Firstly, it computes posterior distributions that take the
prior and observation into account (belief updating, related to trialby-trial P3a amplitude variations (Eq. (10))). Secondly, prediction
distributions for future observations are computed from posterior distributions (prediction updating, related to trial-by-trial SW amplitude
variations (Eq. (12))).
On its output branch, the emergent BRU predictions exert control over
pre-adaptive biases on cognitive processing (Fuster, 2014), whereas
BRU belief updating is based on the incoming emergent observation. Notice further that predictive surprise (related to trial-by-trial P3b amplitude variations) can be thought of as the magnitude of prediction errors
induced by pre-adaptively biased processing within the cognitive processing stream. Predictive surprise could also be considered as the evolution of a decision variable, i.e., as the accumulation of evidence from bias
levels to a decision threshold (Kopp, 2008; O'Connell et al., 2012; Kelly
and O'Connell, 2013). Further, we leave it open whether the P3a reects
the proper updating of beliefs, or an obligatory attentional process that
forms part of belief updating (i.e., an orienting response; Friedman
et al., 2001; Barry and Rushby, 2006; Kopp and Lange, 2013).
Against the background that the P3a originates from prefrontal cortical regions while the P3b is generated in temporal/parietal regions
(Polich, 2007), our results suggest how a network of brain areas may
give rise to Bayesian inference. Specically, while belief updating and
prediction updating seem to be computed in prefrontal cortical regions
(Lee et al., 2007), predictive surprise seems to originate from regions located in posterior association cortices of the visuomotor pathway
(Summereld and Koechlin, 2008; de Lange et al., 2010; d'Acremont
et al., 2013). The occipital scalp topography of the SW needs a short
comment. One plausible possibility is that the SW reects the setting
and updating of pre-adaptive biases. Kok and colleagues recently
found that perceptual predictions trigger the formation of specic stimulus templates in primary visual cortex to efciently process sensory inputs (Kok et al., 2014). Given that we sampled EEG data from merely
twenty channels, we cannot localize the underlying neural architecture
of the Bayesian observer with sufcient precision; thus, further research
is required in order to move forward from a sensor space analysis to a
source space analysis of the Bayesian observer.
Notice that our probabilistic reasoning model of the late positive
complex can be regarded as a computational advancement of the
most widely renowned and respected conceptual theory in the P3
eld, i.e., the so-called context updating model (Donchin, 1981;
Donchin and Coles, 1988). In short, this model postulates that the P3
is evoked in the service of meta-cognitive processes that are concerned
with maintaining a proper representation of the environment, such as
the mapping of probabilities on the environment, the deployment of
attention, or the setting of priorities and biases.
233
Fig. 11. An outline of our probabilistic reasoning (PR) model of the tri-partitioned late positive complex. (A) A conceptual outline of the PR model. The model posits the existence of a
Bayesian reasoning unit (BRU) that interacts with cognitive systems that process incoming environmental information (Fuster, 2014; Haykin and Fuster, 2014). The BRU computes, retains
and updates two distinguishable probability distributions, one over the hidden state (beliefs; lighter gray color) and another one over the observable events (predictions; darker gray
color). The PR model conjectures that belief updating (Bayesian surprise, Eq. (16)) and prediction updating (postdictive surprise, Eq. (17)) are associated with trial-by-trial P3a and
SW amplitude variations, respectively. The emergent BRU predictions set pre-adaptive biases on perceptual decisions, whereas BRU belief updating is based on the observation that
emerges from these decisions. Notice further that predictive surprise (Eq. (18), related to trial-by-trial P3b amplitude variations) can be thought of as the magnitude of prediction errors
induced by unpredicted or surprising observations. (B) A more formal outline of the PR model, in particular of the computational ne structure of the BRU. Units of time (n 1, n) separate
the dynamic evolution of beliefs over states (BELSI, Eq. (10)) that obeys Bayes' theorem (lighter gray color). Units of time (n, n + 1) also separate the dynamic evolution of predictions over
observations (PRESI, Eq. (12)) as prescribed by Bayes' theorem (darker gray color).
Our urnball task was specically designed to examine the neural

bases of Bayesian inference. Our electrophysiological ndings suggest
that the brain acts as a Bayesian observer, i.e., that it might adjust probabilistic internal states, which entail beliefs about hidden states in the
environment, in a probabilistic generative model of sensory data. This
generative model enables inference and this framework provides an abstract explanation of adaptive cognition and behavior, which has been
instantiated in schemes like the free-energy principle (Friston, 2010;
Lieder et al., 2013). However, it is important that the generative model
also permits predictions of future events, and this nding provides an
abstract explanation of pre-adaptive cognition and behavior (see
Fig. 11; Fuster, 2014).
But why did those neuronal connections evolve that are required for
maintaining those computationally expensive generative models? According to the free-energy principle (which instantiates the Bayesian
brain hypothesis), the driving force is the minimization of average predictive surprise (Friston, 2010), and this minimization function is based
on Bayes-optimal probabilistic inference, putatively modulated via the
hyper-parameter . This line of reasoning suggests that meta-Bayesian
learning of a policy for setting -values might occur in response to experienced predictive surprise values, thereby shifting the Bayesian observer towards higher or lower levels of uncertainty (see above), perhaps by
optimizing synaptic gain. However, we do not yet have direct evidence
for this proposition; thus, further research is required on the issue.
Future research might benet from these discoveries in multiple
ways. First, the urnball task provides a valuable approach towards a
computational analysis of the Bayesian brain. Second, trial-by-trial variation in ERP responses in late positive complex latency ranges seems to
be a valuable target for computational models of the neural bases of
Bayesian inference despite the fact that the fractions of variance explained remained relatively small due to the disadvantageous signalto-noise ratio of these measures. Third, it seems to be important to use
non-linearly weighted probabilities in attempts to model the neural
bases of Bayesian inference. Fourth, future studies might envisage
ne-grained variations of the hyper-parameter (such as individual
policies with regard to , dynamical intra-individual adaptation of
etc.).
Our results open a new window onto neural probabilistic inference by isolating discrete cortical signatures of Bayesian updating
and predictive surprise in the human brain. These signals could be

continuously monitored with a minimum of signal processing and
with sufcient temporal resolution to allow their individual dynamics to be observed during Bayesian inference. This was never reported before. Our results suggest that the brain learns probabilistic
parameters in a generative model of the environment, weighting
prior beliefs and new observations proportionately (Vilares and
Krding, 2011). The availability of this generative model makes the
brain a pre-adaptive system (Fuster, 2014), lending it the capability
to act proactively on prospected events.
In sum, our results suggest how a network of brain areas may
allow for plausible probabilistic reasoning (Jaynes, 1988; Dayan
et al., 1995; Rangel et al., 2008; Bach and Dolan, 2012; Summereld
and Tsetsos, 2012). We do not claim that the Bayesian calculus is
consciously available to and reportable by the observers who perform an urnball task. However, our ndings shed new light on
Pierre-Simon Laplace's opinion on probability theory that is
nothing but good common sense reduced to mathematics. It provides an exact appreciation of what sound minds feel with a kind of
instinct, frequently without being able to account for it. (Laplace
cited after McGrayne, 2011, p. 33).
Acknowledgments
This work was supported by the Deutsche Forschungsgemeinschaft
(DFG) via a Future Fund of Technische Universitt Braunschweig (725
508 02). Thanks are due Dirk Ostwald, Max-Planck-Institute for
Human Development, Berlin, Germany, for providing a software implementation of his model.
Appendix A. Fixed-effects results
Choosing between xed-effects vs. random-effects Bayesian model
comparisons depends on assumptions about how consistently the
models provide descriptions of the processes as they occur in individual
participants (Stephan et al., 2007, 2009; Penny et al., 2010). Optimal
probabilistic inference can be viewed as a very basic phenomenon that
holds universally. Alternatively it can be seen as a cognitive task
which can be performed by applying different strategies that might
234
differ across individual participants. An investigation of this issue lies

beyond the scope of this article. Thus, we additionally provide the results for xed-effects Bayesian model comparisons in Tables A.1 and
A.2. Notice that the posterior model probabilities mirror the tendencies
apparent in the exceedance probabilities, but they are all N 0.99 except
for the SW, where 0.95 emerges in favor of postdictive surprise that
was obtained from the PRESI distribution.
The terms clong,k(n) and cshort,k(n) model long and short-term memory as count functions with exponential forgetting according to
Appendix B. Ostwald's model (OST)
8
1
>
< ; if n0 uniform initial prior
g k n K
>
: 1; if nN0 and on k
0; otherwise
Ostwald et al. (2012) propose a model of surprise for k {1, K = 2}

observable events with exponential forgetting of past observations. The
OST model does not estimate the discrete event probabilities Pk(n) directly, but treats the event probability = P(o(n) = k) [0, 1] itself
as a random variable. The probability density function of a beta distribution over is parameterized via event counters ck(n):
p jc1 n; c2 n
c1 n c2 n c1 n
c n
1 2 ;
c1 nc2 n
B:1
n
X
1n
e
ck ;
with k f1; 2g
B:2
with parameter controlling memory length and

e
ck n
n
X
e
g k ;
B:3
with
e
g k n
8
< 1;
1;
:
0;
if n 0 uniform beta prior distribution

if nN0 and on k
otherwise
B:4
counting the number of occurrences of event k until trial n. Parameter is set to 2.6 as in the BS3 model in Ostwald et al. (2012).
Bayesian surprise IB(n) is then calculated as the KullbackLeibler divergence between the prior p() = p(|c1(n 1), c2(n 1)) and posterior
p(|o) = p(|c1(n), c2(n)) probability density functions (OST)
IB n pjc1 n1; c2 n1 ln

pjc1 n1; c2 n1
d: B:5
pjc1 n; c2 n
Appendix C. Digital ltering model (DIF)

The digital ltering model (DIF) (Kolossa et al., 2012) estimates

n1
P k n P on ko1
C:1
as the (subjective) probability of event k being observed at trial n, after a

sequence on1 1 = (o(1), o(2), , o(n 1)) of n 1 former observations. It consists of three digital lters whose outputs are summed to
yield Pk(n) according to

1
P k n1 long clong;k n short cshort;k n c;k n :
C
X
1 n1
e
C i
1
i
g k ;
C:3
with i {short, long}, and i as time constants controlling the memory

length, and
C:4
being the digital lter input. Term c,k(n) captures alternation expectation. The weighting parameters long, short, and constants C1 and C1
guarantee normalized probabilities Pk(n) [0, 1]. The DIF model was
used to calculate predictive surprise IP(n) and the parameters were set
to the optimized parameters as found in Kolossa et al. (2012).
i
Appendix D. Additional results
with being the gamma function. The event counters are updated on a
trial-by-trial basis according to
ck n
ci;k n
C:2
The N2P3 complex has long been considered as an index of the

operation of adaptive brain systems that allow to anticipate the occurrence of environmental events and to react to unexpected discrepancies
(Hillyard and Picton, 1987). Several varieties of the fronto-centrally distributed N2 have been reported in Folstein and Van Petten (2008).
Towey et al. (1980) showed an increase in N250 latency with increased
difculty in discriminating target from nontarget auditory stimuli, and
they consequently associated N250 latency with decision latency. This
knowledge warrants attention to the N250, beyond that to the late positive complex, when examining Bayesian inference.
Fig. D.1 shows the degree to which surprise as calculated by the OST
and DIF models approximates measured trial-by-trial ERP data in logBayes factors. Fig. D.2 shows log-Bayes factors in the form of topographic maps for the N250 latency range. For the N250, a central, slightly
right-lateralized spot becomes apparent. Activation changes of the
fronto-centrally distributed N250 (Towey et al., 1980) could be best
accounted for by the DIF model (Kolossa et al., 2012). The DIF model
achieves maximum exceedance probability for the N250 ( = 0.68),
indicating that this model provides a superior approximation of the
measured N250 data when compared with the remaining models
(Table 2).
The DIF model basically rests on counting observed events with shortterm and long-term exponential forgetting rates. Thus, this model envisages probabilistic inference on the urnball task to be based upon memory for the frequency of occurrence of observable events. Ostwald et al.
(2012) suggested a similar model which models the event probability itself as a (hidden) random variable, but the present results show that the
DIF model approximates N250 changes better than Ostwald et al.'s model
does (cf. Fig. D.1(B) vs. Fig. D.1(A); Table 2).
Taken together, our data suggest a dissociation between fast,
memory-based (N250) and slow, model-based (late positive complex)
forms of surprise in the brain. The distinction between memory-based
and model-based surprise may be viewed in the context of more general distinctions, such as the one between habitual and goal-directed control over behavior (Dolan and Dayan, 2013).
Appendix E. Detailed specication of the PEB design matrices
The log-Bayes factors for Figs. 8 and 9 were generated by repeating
the model comparison for all electrodes and discrete time indexes
t [ 100, 600] ms with a sampling period Ts = 4 ms and t = 0 ms
being the time of stimulus presentation. In order to provide the evidences for the calculation of the exceedance probabilities (Tables 2 and
3) and posterior model probabilities (Tables A.1 and A.2), the time
index was set to t t 2
i; max
, with i {N250, P3a, P3b, SW}, and the respec-
tive averaged electrodes were used. To keep the presentation simple, the
electrodes are not explicitly expressed by an additional subscript.
We used a two-level hierarchical model of the form
C;1
Y;t X
C;1
;t
C;1
2
Table A.2
Family-level posterior model probabilities for the distributions based on the observer
without weighting (F LI), with weighting of the inference input (F SI), weighting of the inference output (F SO), and models based on exponential forgetting (F EF). Maximum posterior model probabilities are emphasized in bold face.
C;1
;t E;t
235
E:1
C;2
E;t ;
with {1, , L = 16} denoting the individual participants, and C

{PcLc, PcLu, PuLc, PuLu} the probability condition. The rst level design
C
matrix XC,(1)
contains the model-specic surprise values I,b
(n) as
regressors for all episodes of sampling b {1, , B = 50} within

one probability condition, with n {1, , N = 4} being the discrete
index of consecutive trials within one episode of sampling. It consists
C,(1)
C
of one sub-matrix for each episode of sampling X,b
= [I,b
(n =
1 N
C
C,(1)
1), , I,b(n = N)]
, giving it the form X
= [X,b =
C,(1)
T
BN 1
, , XC,(1)
with []T being the transpose. The specic
1
,b = B]
value of BN varies according to the number of trials in which the participant responded correctly, as only these trials were included for
h
iT
C
C
C
C
BN1
, with Y;b;t
evaluation. Vector Y;t Y;b1;t ; ; Y;bB;t
h
i
C
C
Y ;b;t n 1; ; Y ;b;t n N 1N , contains the smoothed mea-
Family
F LI
F SI
F SO
F EF

N250
P3a
P3b
SW
b0.01
b0.01
b0.01
N0.99
b0.01
N0.99
b0.01
b0.01
b0.01
N0.99
b0.01
b0.01
b0.01
N0.99
b0.01
b0.01
sured voltages for time instant t across trials n and episodes of samC
pling b, with Y ;b;t n 194 Y ;b;tT s n. Both XC,(1)

and Y;t have
been normalized to zero-mean and unit-variance. The second level

design matrix is set to be scalar X(2) = 0.
The rst level of the general linear model calculates the normalized
C
measured trial-by-trial single-time index voltages Y ;b;t n, as a function

C,(1)
of IC,b(n), with the linear parameter C,(1)
,t , and an error ,b,t (n) according to:
C
C;1 C
C;1
Y ;b;t n ;t I;b n ;b;t n:
E:2
The second level serves as a prior on the parameter C,(1)

,t . All er

C;1
C;1
rors are assumed to be normally distributed with E;t N 0; ;;t

C;2
C;2
and E;t N 0; ;;t . The covariance is parameterized following
BN BN
C,(1)
C,(2)
C,(2)
C,(1)
being an
,,t = ,t IBN and ,,t = ,t I1, with IBN
C,(1)
C,(2)
identity matrix. The hyperparameters ,t and ,t are the free parameters of the hierarchical linear model and are estimated using an
EM algorithm for maximum likelihood estimation. The common reference null-model MNULL has a non-normalized rst level design matrix
XC,(1)
= [1, , 1]T BN 1.
Fig. D.1. Degree to which surprise as calculated by the DIF and OST models approximates
measured trial-by-trial ERP data as group log-Bayes factors of the models versus a constant
null model over electrodes and time.
Table A.1
Posterior model probabilities for all tested distributionsurprise combinations over the
interval t [212, 252] ms for N250, over the interval t [336, 376] ms for P3a, over
the interval t [360, 400] ms for P3b, and over the interval t [484, 524] ms for SW.
Maximum posterior model probabilities are emphasized in bold face.
ERP waves
Surprise
Distribution
N250
P3a
P3b
SW
IB
IB
IB
IH
IH
IH
IP
IP
IP
IB
IB
IB
IB
IP
BEL
BELSI
BELSO
BEL
BELSI
BELSO
PRE
PRESI
PRESO
PRE
PRESI
PRESO
OST
DIF
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
N0.99
b0.01
N0.99
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
N0.99
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
b0.01
0.05
b0.01
b0.01
0.95
b0.01
b0.01
b0.01
Fig. D.2. Scalp maps of log-Bayes factors for time interval t [212, 252] ms for N250. A central focus becomes apparent along with a left-occipital spot.
References
Achtziger, A., Als-Ferrer, C., Hgelschfer, S., Steinhauser, M., 2014. The neural basis of
belief updating and rational decision making. Soc. Cogn. Affect. Neurosci. 9, 5562.
Bach, D.R., Dolan, R.J., 2012. Knowing how much you don't know: a neural organization of
uncertainty estimates. Nat. Rev. Neurosci. 13, 572586.
Baldi, P., Itti, L., 2010. Of bits and wows: a Bayesian theory of surprise with applications to
attention. Neural Netw. 23, 649666.
Barcel, F., Periez, J.A., Knight, R.T., 2002. Think differently: a brain orienting response
to task novelty. NeuroReport 13, 18871892.
236
Barnard, G.A., 1949. Statistical inference. J. R. Stat. Soc. Ser. B 11, 115149.
Barry, R.J., Rushby, J.A., 2006. An orienting reex perspective on anteriorisation of the P3
of the event-related potential. Exp. Brain Res. 173, 539545.
Berns, G.S., Bell, E., 2012. Striatal topography of probability and magnitude information
for decisions under uncertainty. NeuroImage 59, 31663172.
Berns, G.S., Capra, C.M., Chappelow, J., Moore, S., Noussair, C., 2008. Nonlinear neurobiological probability weighting functions for aversive outcomes. NeuroImage 39,
20472057.
Bossaerts, P., 2010. Risk and risk prediction error signals in anterior insula. Brain Struct.
Funct. 214, 645653.
Cavagnaro, D.R., Pitt, M.A., Gonzalez, R., Myung, J.I., 2013. Discriminating among probability weighting functions using adaptive design optimization. J. Risk Uncertain. 47,
255289.
Clark, A., 2013. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 36, 181253.
d'Acremont, M., Schultz, W., Bossaerts, P., 2013. The human brain encodes event frequencies while forming subjective beliefs. J. Neurosci. 33, 1088710897.
Daunizeau, J., Den Ouden, H.E.M., Pessiglione, M., Kiebel, S.J., Friston, K.J., Stephan, K.E.,
2010. Observing the observer (II): deciding when to decide. PLoS One 5, e15555.
Daunizeau, J., Adam, V., Rigoux, L., 2014. VBA: a probabilistic treatment of nonlinear
models for neurobiological and behavioural data. PLoS Comput. Biol. 10, e1003441.
Dayan, P., Hinton, G.E., Neal, R.M., Zemel, R.S., 1995. The Helmholtz machine. Neural
Comput. 7, 889904.
de Lange, F.P., Jensen, O., Dehaene, S., 2010. Accumulation of evidence during sequential
decision making: the importance of top-down factors. J. Neurosci. 30, 731738.
Delorme, A., Makeig, S., 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG
dynamics including independent component analysis. J. Neurosci. Methods 134, 921.
Dien, J., Spencer, K.M., Donchin, E., 2004. Parsing the late positive complex: mental
chronometry and the ERP components that inhabit the neighborhood of the P300.
Psychophysiology 41, 665678.
Dolan, R.J., Dayan, P., 2013. Goals and habits in the brain. Neuron 80, 312325.
Donchin, E., 1981. Surprise! Surprise? Psychophysiology 18, 493513.
Donchin, E., Coles, M.G., 1988. Is the P300 component a manifestation of context
updating? Behav. Brain Sci. 11, 357427.
Doya, K., Ishii, S., Pouget, A., Rao, R.P.N., 2007. Bayesian Brain: Probabilistic Approaches to
Neural Coding. MIT Press, Cambridge, MA.
Fiorillo, C.D., 2012. Beyond Bayes: on the need for a unied and Jaynesian denition of
probability and information within neuroscience. Information 3, 175203.
Folstein, J.R., Van Petten, C., 2008. Inuence of cognitive control and mismatch on the N2
component of the ERP: a review. Psychophysiology 45, 152170.
Fox, C.R., Poldrack, R.A., 2009. Prospect theory and the brain. In: Glimcher, P.W., Camerer,
C.F., Fehr, E., Poldrack, R.A. (Eds.), Neuroeconomics: Decision Making and the Brain.
Elsevier Academic Press, London, UK, pp. 145173.
Friedman, D., Cycowicz, Y.M., Gaeta, H., 2001. The novelty P3: an event-related brain potential
(ERP) sign of the brain's evaluation of novelty. Neurosci. Biobehav. Rev. 25, 355373.
Friston, K.J., 2005. A theory of cortical responses. Philos. Trans. R. Soc. B-Biol. Sci. 360,
815836.
Friston, K.J., 2010. The free-energy principle: a unied brain theory? Nat. Rev. Neurosci.
11, 127138.
Friston, K.J., Penny, W.D., Phillips, C., Kiebel, S.J., Hinton, G., Ashburner, J., 2002. Classical
and Bayesian inference in neuroimaging: theory. NeuroImage 16, 465483.
Friston, K.J., Mattout, J., Trujillo-Bareto, N., Ashburner, J., Penny, W.D., 2007. Variational
free energy and the Laplace approximation. NeuroImage 34, 220234.
Furl, N., Averbeck, B.B., 2011. Parietal cortex and insula relate to evidence seeking relevant
to reward-related decisions. J. Neurosci. 31, 1757217582.
Fuster, J.M., 2014. The prefrontal cortex makes the brain a preadaptive system. Proc. IEEE
102, 417426.
Garca-Larrea, L., Czanne-Bert, G., 1998. P3, positive slow wave and working memory
load: a study on the functional correlates of slow wave activity. Clin. Neurophysiol.
108, 260273.
Gold, J.I., Shadlen, M.N., 2007. The neural basis of decision making. Annu. Rev. Neurosci.
30, 535574.
Gonzalez, R., Wu, G., 1999. On the shape of the probability weighting function. Cogn.
Psychol. 38, 129166.
Gratton, G., Coles, M.G.H., Donchin, E., 1983. A new method for off-line removal of ocular
artifact. Electroencephalogr. Clin. Neurophysiol. 55, 468484.
Grether, D.M., 1980. Bayes rule as a descriptive model: the representativeness heuristic.
Q. J. Econ. 95, 537557.
Grether, D.M., 1992. Testing Bayes rule and the representativeness heuristic: some experimental evidence. J. Econ. Behav. Organ. 17, 3157.
Hampton, A.N., Bossaerts, P., O'Doherty, J.P., 2006. The role of the ventromedial prefrontal
cortex in abstract state-based inference during decision making in humans. J.
Neurosci. 26, 83608367.
Haykin, S., Fuster, J.M., 2014. On cognitive dynamic systems: cognitive neuroscience and
engineering learning from each other. Proc. IEEE 102, 608628.
Hillyard, S.A., Picton, T.W., 1987. Electrophysiology of cognition. In: Plum, F. (Ed.), Handbook of Physiology: The Nervous System, Section 1, vol. 5. Higher Functions of the
Brain, Part 2. American Physiological Society, Bethesda, MD, pp. 519584.
Hsu, M., Krajbich, I., Zhao, C., Camerer, C.F., 2009. Neural response to reward anticipation
under risk is nonlinear in probabilities. J. Neurosci. 29, 22312237.
Itti, L., Baldi, P., 2009. Bayesian surprise attracts human attention. Vis. Res. 49, 12951306.
Jaynes, E.T., 1988. How does the brain do plausible reasoning? In: Erickson, G.J., Smith, C.R.
(Eds.), Maximum-Entropy and Bayesian Methods in Science and Engineering. Kluwer
Academic Publishers, Dordrecht, The Netherlands, pp. 124.
Jaynes, E.T., 2003. Probability Theory: The Logic of Science. Cambridge University Press,
Cambridge, UK.
Kahneman, D., Tversky, A., 1979. Prospect theory: an analysis of decision under risk.
Econometrica 47, 263291.
Kass, R.E., Raftery, A.E., 1995. Bayes factors. J. Am. Stat. Assoc. 90, 773795.
Kelly, S.P., O'Connell, R.G., 2013. Internal and external inuences on the rate of
sensory evidence accumulation in the human brain. J. Neurosci. 33,
1943419441.
Knill, D.C., Pouget, A., 2004. The Bayesian brain: the role of uncertainty in neural coding
and computation for perception and action. Trends Neurosci. 27, 712719.
Kok, P., Failing, M.F., de Lange, F.P., 2014. Prior expectations evoke stimulus templates in
the primary visual cortex. J. Cogn. Neurosci. 26, 15461554.
Kolossa, A., Fingscheidt, T., Wessel, K., Kopp, B., 2012. A model-based approach to trial-bytrial P300 amplitude uctuations. Front. Hum. Neurosci. 6, 359.
Kopp, B., 2008. The P300 component of the event-related brain potential and Bayes'
theorem. In: Sun, M.K. (Ed.), Cognitive Sciences at the Leading Edge. Nova Science
Publishers, New York, NY, pp. 8796.
Kopp, B., Lange, F., 2013. Electrophysiological indicators of surprise and entropy in dynamic task-switching environments. Front. Hum. Neurosci. 7, 300.
Lee, D., Rushworth, M.F.S., Walton, M.E., Watanabe, M., Sakagami, M., 2007. Functional
specialization of the primate frontal cortex during decision making. J. Neurosci. 27,
81708173.
Lieder, F., Daunizeau, J., Garrido, M.I., Friston, K.J., Stephan, K.E., 2013. Modelling trial-bytrial changes in the mismatch negativity. PLoS Comput. Biol. 9, e1002911.
Luck, S.J., 2005. An Introduction to the Event-Related Potential Technique. MIT Press,
Cambridge, MA.
Matsuda, I., Nittono, H., 2014. Motivational signicance and cognitive effort elicit different
late positive potentials. Clin. Neurophysiol. http://dx.doi.org/10.1016/j.clinph.2014.
05.030.
McGrayne, S.B., 2011. The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, & Emerged Triumphant From Two Centuries of Controversy. Yale University Press, New Haven, CT.
Mohr, P.N., Biele, G., Heekeren, H.R., 2010. Neural processing of risk. J. Neurosci. 30,
66136619.
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A.,
Zhou, J., 2000. Audiovisual speech recognition. Final Workshop 2000 Report, Center
for Language and Speech Processing vol. 764. Johns Hopkins University, Baltimore,
MD.
Nieuwenhuis, S., De Geus, E.J., Aston-Jones, G., 2011. The anatomical and functional relationship between the P3 and autonomic components of the orienting response. Psychophysiology 48, 162175.
O'Connell, R.G., Dockree, P.M., Kelly, S.P., 2012. A supramodal accumulation-to-bound signal that determines perceptual decisions in humans. Nat. Neurosci. 15, 17291735.
Oldeld, R.C., 1971. The assessment and analysis of handedness: the Edinburgh inventory.
Neuropsychologia 9, 97113.
Ostwald, D., Spitzer, B., Guggenmos, M., Schmidt, T.T., Kiebel, S.J., Blankenburg, F., 2012.
Evidence for neural encoding of Bayesian surprise in human somatosensation.
NeuroImage 62, 177188.
Penny, W.D., 2012. Comparing dynamic causal models using AIC, BIC and free energy.
NeuroImage 59, 319330.
Penny, W.D., Stephan, K.E., Mechelli, A., Friston, K.J., 2004. Comparing dynamic causal
models. NeuroImage 22, 11571172.
Penny, W.D., Stephan, K.E., Daunizeau, J., Rosa, M.J., Friston, K.J., Schoeld, T.M., Leff,
A.P., 2010. Comparing families of dynamic causal models. PLoS Comput. Biol. 6,
e1000709.
Phillips, L.D., Edwards, W., 1966. Conservatism in a simple probability inference task. J.
Exp. Psychol. 72, 346354.
Polich, J., 2007. Updating P300: an integrative theory of P3a and P3b. Clin. Neurophysiol.
118, 21282148.
Pouget, A., Beck, J.M., Ma, W.J., Latham, P.E., 2013. Probabilistic brains: knowns and
unknowns. Nat. Neurosci. 16, 11701178.
Prelec, D., 1998. The probability weighting function. Econometrica 66, 497527.
Preuschoff, K., Bossaerts, P., Quartz, S.R., 2006. Neural differentiation of expected reward
and risk in human subcortical structures. Neuron 51, 381390.
Preuschoff, K., Quartz, S.R., Bossaerts, P., 2008. Human insula activation reects risk prediction errors as well as risk. J. Neurosci. 28, 27452752.
Rangel, A., Camerer, C., Montague, P.R., 2008. A framework for studying the neurobiology
of value-based decision making. Nat. Rev. Neurosci. 9, 545556.
Robert, C., 2007. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer, New York, NY.
Ruchkin, D.S., Johnson, R., Mahaffey, D., Sutton, S., 1988. Toward a functional categorization of slow waves. Psychophysiology 25, 339353.
Shannon, C.E., Weaver, W., 1948. The mathematical theory of communication. Commun.
Bell Syst. Tech. J. 27, 379423.
Sokolov, Y.N., 1966. Orienting reex as information regulator. In: Leontiev, A.N., Luria, A.R.,
Smirnov, A.A. (Eds.), Psychological Research in the U.S.S.R. Progress Publishers,
Moscow, pp. 334360.
Soltani, A., Wang, X.J., 2010. Synaptic computation underlying probabilistic inference. Nat.
Neurosci. 13, 112119.
Spencer, K.M., Dien, J., Donchin, E., 2001. Spatiotemporal analysis of the late ERP responses to deviant stimuli. Psychophysiology 38, 343358.
Stephan, K.E., Weiskopf, N., Drysdale, P.M., Robinson, P.A., Friston, K.J., 2007. Comparing
hemodynamic models with DCM. NeuroImage 38, 387401.
Stephan, K.E., Penny, W.D., Daunizeau, J., Moran, R.J., Friston, K.J., 2009. Bayesian model
selection for group studies. NeuroImage 46, 10041017.
Strange, B.A., Duggins, A., Penny, W.D., Dolan, R.J., Friston, K.J., 2005. Information theory,
novelty and hippocampal responses: unpredicted or unpredictable? Neural Netw.
18, 225230.

Summereld, C., Koechlin, E., 2008. A neural representation of prior information during
perceptual inference. Neuron 59, 336347.
Summereld, C., Tsetsos, K., 2012. Building bridges between perceptual and economic
decision-making: neural and computational mechanisms. Front. Neurosci. 6, 70.
Sutton, S., Ruchkin, D.S., 1984. The late positive complex. Ann. N. Y. Acad. Sci. 425, 123.
Sutton, S., Braren, M., Zubin, J., John, E.R., 1965. Evoked-potential correlates of stimulus
uncertainty. Science 150, 11871188.
Takahashi, H., Matsui, H., Camerer, C., Takano, H., Kodaka, F., Ideno, T., Okubo, S.,
Takemura, K., Arakawa, R., Eguchi, Y., et al., 2010. Dopamine D1 receptors and nonlinear probability weighting in risky choice. J. Neurosci. 30, 1656716572.
Tobler, P.N., Christopoulos, G.I., O'Doherty, J.P., Dolan, R.J., Schultz, W., 2008. Neuronal distortions of reward probability without choice. J. Neurosci. 28, 1170311711.
Towey, J., Rist, F., Hakerem, G., Ruchkin, D.S., Sutton, S., 1980. N250 latency and decision
time. Bull. Psychon. Soc. 15, 365368.
Trepel, C., Fox, C.R., Poldrack, R.A., 2005. Prospect theory on the brain? Toward a cognitive
neuroscience of decision under risk. Cogn. Brain Res. 23, 3450.
237
Tversky, A., Kahneman, D., 1992. Advances in prospect theory: cumulative representation
of uncertainty. J. Risk Uncertain. 5, 297323.
Vilares, I., Krding, K., 2011. Bayesian models: the structure of the world, uncertainty,
behavior, and the brain. Ann. N. Y. Acad. Sci. 1224, 2239.
Vilares, I., Howard, J.D., Fernandes, H.L., Gottfried, J.A., Krding, K.P., 2012. Differential representations of prior and likelihood uncertainty in the human brain. Curr. Biol. 22,
16411648.
Wu, S.W., Delgado, M.R., Maloney, L.T., 2011. The neural correlates of subjective utility of
monetary outcome and probability weight in economic and in motor decision under
risk. J. Neurosci. 31, 88228831.
Yang, T., Shadlen, M.N., 2007. Probabilistic reasoning by neurons. Nature 447,
10751080.
Zhang, H., Maloney, L.T., 2012. Ubiquitous log odds: a common representation of probability and frequency distortion in perception, action, and cognition. Front. Neurosci.
6, 1.

tmpC424 TMP

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

tmpC424 TMP

Enviado por

Direitos autorais:

Formatos disponíveis

NeuroImage 106 (2015) 222237

Contents lists available at ScienceDirect

A computational analysis of the neural bases of Bayesian inference

probability distributions. Specically, we were interested in dissociating

A. Kolossa et al. / NeuroImage 106 (2015) 222237

distributions such that both, the belief distributions as well as the

A. Kolossa et al. / NeuroImage 106 (2015) 222237

(vertical electrooculogram, vEOG) and at the external ocular canthi

A. Kolossa et al. / NeuroImage 106 (2015) 222237

Bayesian observer model

Belief distribution about hidden states based on a non-weighting

on1 1) = Pk(n). The calculation of posterior beliefs P U n is fundamental

A. Kolossa et al. / NeuroImage 106 (2015) 222237

Belief distribution (BEL)

with u U f1; ; U g yielding the posterior distribution P U n and the

The preceding posterior P(s(n 1) = u|on1 1) = Pu(n 1) on trial

Fig. 4. Comparison of objective probabilities ( = 1, solid curve) with inverse S-shaped

with the shape parameter controlling curvature, P 0 as crossover

Note that if we chose P0 = 0.5 we obtain

with the normalization factor C(n) (Eq. (3))

LowP LoP 1 LoP 0 ;

and the PRESI distribution (Eq. (5))

Note that the posterior probability from the preceding trial is

A. Kolossa et al. / NeuroImage 106 (2015) 222237

complexity of the observer model as low as possible as only one

and the PRESO distribution

Notice that Bayesian surprise reects the degree of Bayesian

respectively. Notice that this can be interpreted as a non-linearity which

A. Kolossa et al. / NeuroImage 106 (2015) 222237

Family-level comparisons permit inferences about probability

Positive values reect evidence in favor of model i over j, with values

(Daunizeau et al., 2014). We assume that the decision de-

pends purely on the posterior probability distribution and that these

It was evaluated in the range [0.5, , 1] with an increment of 0.01

the latency range [400, 560] ms with t 2

and Czanne-Bert, 1998; Matsuda and Nittono, 2014).

This section describes the model-based results. It starts with Fig. 8

A. Kolossa et al. / NeuroImage 106 (2015) 222237

[400, 560] ms with t 2

focus in the P3b latency range, and an occipito-parietal focus in the SW

A. Kolossa et al. / NeuroImage 106 (2015) 222237

surprise shows maximum for P3b amplitudes ( = 0.67) while for

observer with probability weighting of the inference input ( F SI) is

ERP waves and electrodes

A. Kolossa et al. / NeuroImage 106 (2015) 222237

ERP waves and electrodes

A. Kolossa et al. / NeuroImage 106 (2015) 222237

A. Kolossa et al. / NeuroImage 106 (2015) 222237

Our urnball task was specically designed to examine the neural

and predictive surprise in the human brain. These signals could be

A. Kolossa et al. / NeuroImage 106 (2015) 222237

differ across individual participants. An investigation of this issue lies

Appendix B. Ostwald's model (OST)

Ostwald et al. (2012) propose a model of surprise for k {1, K = 2}

with parameter controlling memory length and

if n 0 uniform beta prior distribution

Appendix C. Digital ltering model (DIF)

as the (subjective) probability of event k being observed at trial n, after a

with i {short, long}, and i as time constants controlling the memory

Appendix D. Additional results

The N2P3 complex has long been considered as an index of the

A. Kolossa et al. / NeuroImage 106 (2015) 222237