Bayes Deep

Bayesian Reasoning
and Deep Learning

Shakir Mohamed
DeepMind
shakirm.com
@shakir_za
9 October 2015
Abstract
Deep learning and Bayesian machine learning are currently two of the most
active areas of machine learning research. Deep learning provides a powerful
class of models and an easy framework for learning that now provides state-ofthe-art methods for applications ranging from image classification to speech
recognition. Bayesian reasoning provides a powerful approach for information
integration, inference and decision making that has established it as the key
tool for data-efficient learning, uncertainty quantification and robust model
composition that is widely used in applications ranging from information
retrieval to large-scale ranking. Each of these research areas has shortcomings
that can be effectively addressed by the other, pointing towards a needed
convergence of these two areas of machine learning; the complementary
aspects of these two research areas is the focus of this talk. Using the tools of
auto-encoders and latent variable models, we shall discuss some of the ways in
which our machine learning practice is enhanced by combining deep learning
with Bayesian reasoning. This is an essential, and ongoing, convergence that
will only continue to accelerate and provides some of the most exciting
prospects, some of which we shall discuss, for contemporary machine learning
research.
Bayesian Reasoning and Deep Learning
Deep Learning
Bayesian Reasoning
Better ML
Deep Learning
A framework for constructing flexible models
+ Rich non-linear models for

classification and sequence prediction.
+ Scalable learning using stochastic
approximations and conceptually simple.
- Only point estimates

- Hard to score models, do
model selection and
complexity penalisation.
+ Easily composable with other gradientbased methods

Bayesian Reasoning
A framework for inference and decision making
+ Unified framework for model building,

inference, prediction and decision making
- Mainly conjugate and linear

models
+ Explicit accounting for uncertainty and

variability of outcomes
- Potentially intractable
inference leading to
expensive computation or
long simulation times.
+ Robust to overfitting; tools for model

selection and composition.
Two Streams of Machine Learning

Deep Learning
+ Rich non-linear models for

classification and sequence
prediction.
+ Scalable learning using stochastic
approximation and conceptually
simple.
+ Easily composable with other
gradient-based methods
- Only point estimates
- Hard to score models, do selection
and complexity penalisation.
Bayesian Reasoning
- Mainly conjugate and linear

models
- Potentially intractable inference,
computationally expensive or long
simulation time.
+ Unified framework for model
building, inference, prediction and
decision making
+ Explicit accounting for uncertainty
and variability of outcomes
+ Robust to overfitting; tools for
model selection and composition.
6
Outline
Bayesian Reasoning
Deep Learning
Complementary strengths that we should

expect to be successfully combined.
1
Why is this a good idea?

Review of deep learning
Limitations of maximum likelihood and MAP estimation
How can we achieve this convergence?

Case study using auto-encoders and latent variable models
Approximate Bayesian inference
What else can we do?

Semi-supervised learning, classification, better inference
and more.
A (Statistical) Review of Deep Learning

Table 1: Correspondence between link and activations functions in
generalised regression.
Target
Regression
Link
Inv link
Activation
Generalised Linear Regression

= w> x + b
Real
Binary
p(y|x) = p(y|g(); )
The basic function can be any linear

function, e.g., ane, convolution.
g(.) is an inverse link function that well
refer to as an activation function.
Linear
Logistic
Binary
Probit
Binary
Gumbel
Binary
Logistic
Categorical
Multinomial
Counts
Counts
Non-neg.
Sparse
Ordered
Poisson
Poisson
Gamma
Tobit
Ordinal
Identity
Logit log 1-
Identity
Sigmoid
Inv
Gauss
-1
CDF
()
Compl.
log-log
log(-log())
Gauss
CDF
()
Gumbel CDF
-x
e-e
Probit
Hyperbolic
Tangent
tanh()
Multin. Logit
Tanh
log()
p
()
Reciprocal
1
1+exp(-)
Sigmoid
Softmax
Pi
j j
exp()
2
1
max max(0; )
Cum.
Logit
( k - )
ReLU
the Bernoulli distribution.

There are many link functions that allow us to make other distributional assumptions for the target (response) y. In deep learning, the
link function is referred to as the activation function and I list in the
table below the names for these functions used in the two fields. From
this table we can see that many of the popular approaches for specifying neural networks that have counterparts in statistics and related
literatures under (sometimes) very different names, such multinomial
regression in statistics and softmax classification in deep learning, or
rectifier in deep learning and tobit models is statistics.
Maximum likelihood estimation

Optimise the negative log-likelihood
L=
1.2
log p(y|g(); )
recursive generalised linear models
Constructing a recursive GLM or deep deep feed-forward neural network using the linear predictor as the basic building block. GLMS
Recursive Generalised Linear Regression

Recursively compose the basic linear functions.
Gives a deep neural network.
E[y] = hL . . . hl h0 (x)
A general framework for building non-linear, parametric models
Problem: Overfitting of MLE leading to limited generalisation.

Regularisation Strategies for Deep Networks
Regularisation is essential to overcome the limitations of maximum
likelihood estimation.
Regularisation, penalised regression, shrinkage.
A wide range of available regularisation techniques:
Large data sets
Input noise/jittering and data augmentation/expansion.
L2 /L1 regularisation (Weight decay, Gaussian prior)
Binary or Gaussian Dropout
Batch normalisation
More robust loss function using MAP estimation instead.
10
More Robust Learning

MAP estimators and limitations
Power of MAP estimators is that they provide
some robustness to overfitting.
Creates sensitivities to parameterisation.
1. Sensitivities aect gradients and can make learning hard
Invariant MAP estimators and exploiting natural
gradients, trust region methods and other
improved optimisation.
2. Still no way to measure confidence of our model.
Can generate frequentist confidence intervals
and bootstrap estimates.
11
Towards Bayesian Reasoning

Proposed solutions have not fully dealt with the underlying issues.
Issues arise as a consequence of:
Reasoning only about the most likely solution and
Not maintaining knowledge of the underlying variability (and
averaging over this).
Given this powerful model class and invaluable tools for

regularisation and optimisation, let us develop a
Pragmatic Bayesian Approach for
Probabilistic Reasoning in Deep Networks.
Bayesian reasoning over some, but not all parts of our models (yet).
12
Outline
Bayesian Reasoning
Deep Learning

1


Case study using auto-encoders and latent variable models
Approximate Bayesian inference

Semi-supervised learning, classification, better inference
and more.
13
Dimensionality Reduction and Auto-encoders

Unsupervised learning and auto-encoders
A generic tool for dimensionality
reduction and feature extraction.
Minimise reconstruction error using an
encoder and a decoder.
+
Non-linear dimensionality reduction

using deep networks for encoder and
decoder.
Easy to implement as a single

computational graph and train using
SGD
No natural handling of missing data

No representation of variability of the
representation space.
z = f(y)
Decoder
g(.)
Encoder
f(.)
y* = g(z)
Data y
L=
log p(y|g(z))
L = ky
2
g(f (y))k2
14
Dimensionality Reduction and Auto-encoders
Some questions about auto-encoders:

What is the model we are interested in?
Why use an encoder?
How do we regularise?
z = f(y)
Decoder
g(.)
Encoder
f(.)
y* = g(z)
Data y
Best to be explicit about the:

Probabilistic model of interest and
Mechanism we use for inference.
15
Density Estimation and Latent Variable Models

Latent variable models:
Generic and flexible model class for density estimation.
Specifies a generative process that gives rise to the data.
BXPCA
Latent Gaussian Models:

Probabilistic PCA, Factor analysis (FA), Bayesian Exponential
Family PCA (BXPCA).
Latent Variable
z N (z|, )
Observation Model
= Wz + b
y Expon(y|)
Exponential fam natural parameters .
W
y
n = 1, , N
Use our knowledge of deep learning to design even richer models.

16
Deep Generative Models
DLGM
Rich extension of previous model using deep neural networks.

E.g., non-linear factor analysis, non-linear Gaussian belief
networks, deep latent Gaussian models (DLGM).
z2
Latent Variables (Stochastic layers)
zl N (zl |fl (zl+1 ), l )
fl (z) = (Wh(z) + b)
Deterministic layers
h4
h3
W1
z1
hi (x) = (Ax + c)
h2
Observation Model
= Wh1 + b
h1
y Expon(y|)
Can also use non-exponential family.

n = 1, , N
17
Deep Latent Gaussian Models

Our inferential tasks are:
1. Explain this data
p(z|y, W) / p(y|z, W)p(z)

2. Make predictions:
p(y |y) =
p(y |z, W)p(z|y, W)dz
3. Choose
Z the best model
p(y|W) =
p(y|z, W)p(z)dz
z1
h2
h1
y
n = 1, , N
18
Variational Inference
Use tools from approximate inference to handle intractable integrals.
KL[q(z|y)kp(z|y)]
Approximation class
True posterior
q (z)
Reconstruction
F(y, q) = Eq(z) [log p(y|z)]
Reconstruction cost:
Expected log-likelihood
measures how well
samples from q(z) are able
to explain the data y.
Penalty: Explanation of
the data q(z) doesnt deviate
too far from your beliefs
p(z) - Okhams razor.
Penalty
KL[q(z)kp(z)]
Penalty is derived from your model and does not need to be designed.
19
Amortised Variational Inference

z ~ q(z | y)
F (y, q) = Eq(z) [log p(y|z)]

Approx. Posterior
Reconstruction
KL[q(z)kp(z)]
Penalty
Approximate posterior distribution q(z): Best match

to true posterior p(z|y), one of the unknown
inferential quantities of interest to us.
Inference network: q is an encoder or inverse model.
Parameters of q are now a set of global parameters
used for inference of all data points - test and train.
Amortise (spread) the cost of inference over all data.
Inference/
Encoder
q(z |y)
Data y
Encoders provide an ecient mechanism for

amortised posterior inference
20
Auto-encoders and Inference in DGMs

Approx. Posterior
KL[q(z)kp(z)]
Reconstruction
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
Penalty
Model (Decoder): likelihood p(y|z).

Inference (Encoder): variational distribution q(z|y)
Stochastic encoder-decoder systems

implement variational inference.
y ~ p(y | z)
Data y
Specific combination of variational inference in latent

variable models using inference networks
Variational Auto-encoder
But dont forget what your model is, and what inference you use.
21
What Have we Gained

+
Transformed an auto-encoders into more

interesting deep generative models.
Rich new class of density estimators built

with non-linear models.
Used a principled approach for deriving

loss functions that automatically include
appropriate penalty functions.
Explained how an encoder enters into

our models and why this is a good idea.
Able to answer all our desired inferential

questions.
Knowledge of the uncertainty associated

with our latent variables.
KL[q(z)kp(z)]
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
22
What Have we Gained

KL[q(z)kp(z)]
Able to score our models and do model

selection using the free energy.
Can impute missing data under any

missingness assumption
Can still combine with natural gradient

and improved optimisation tools.
Easy implementation - have a single

computational graph and simple Monte
Carlo gradient estimators.
Computational complexity the same as

any large-scale deep learning system.
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
A true marriage of Bayesian Reasoning and Deep Learning

23
...
Data Visualisation
MNIST Handwritten digits
...
28x28
DLGM
...
500
Samples from 2D latent model
...
...
100
...
300
...
28x28
...
100
...
400
...
96x96
Labels in 2D latent space
24
DLGM
Visualising MNIST in 3D
25
DLGM
Data Simulation
Data
Samples
26
Missing Data Imputation

Original Data
unobserved pixels
Inferred Image
DLGM
10%
observed
50%
observed
27
Outline
Bayesian Reasoning
Deep Learning

1


Auto-encoders and latent variable models
Approximate and variational inference

Semi-supervised learning, recurrent networks, classification,
better inference and more.
28
Semi-supervised Learning
Semi-supervised DLGM
Can extend the marriage of Bayesian reasoning and deep learning to the
problem of semi-supervised classification.
z
W
x
n = 1, , N
29
Semi-supervised DLGM
Analogical Reasoning
30
Generative Models with Attention

Figure 7. MNIST generation sequences for DRAW without atWe can
also combine other tools from deep learning to design
tention. Notice how the network first generates a very blurry imthat is subsequently
refined.generative models: recurrent networks
even age
more
powerful
and attention.
Figure 8. Generated MNIST images with two digits.
attention
it constructs the digit by tracing the lines
nt Neural Network For with
Image
Generation
much like a person with a pen.
DRAW
ts scenes
d by the
.
y step is
ne while
ew years
by a seby a sin& Hinton,
to, 2014;
014; Serequential
h can be
s such as
model in
possible
nse it re-
P (x|z)
decoder
FNN
ct
write
ct
write
. . . cT
4.3. MNIST Generation with Two Digits

dec
decoder
P (x|z1:T )
decoder
ht motivation
The main
for using
1
RNN
RNNan attention-based generative model is that large images can be built up iteratively,
z
zt+1
zt
decoding
by adding to a small
part of the image at a time.
To test
(generative
model)
sample this capability sample
sample
in a controlled
fashion, we trained DRAW
encoding
two
28 |x,
28 zMNIST
images choQ(z|x) to generate
Q(ztimages
|x, z1:t with
Q(z
1)
t+1
1:t )
(inference)
sen at random and placed at random locations in a 60 60
encoderIn casesencoder
black background.
where the two digits overlap,
henc
t 1
RNN
RNNtogether at each point and
encoder the pixel intensities were added
FNN
clipped to be noread
greater thanread
one. Examples of generated
data are shown in Fig. 8. The network typically generates
x the other, suggesting
x
x
one digit and then
an ability to recreate composite scenes from simple pieces.
Figure 2. Left:
Conventional
Auto-Encoder. Dur4.4. Street
View House Variational
Number Generation
ing generation, a sample z is drawn from a prior P (z) and passedFigure 9. Generated SVHN images. The rightmost column
MNIST digits are very simplistic in terms of visual strucshows the training images closest (in L2 distance) to the generthrough the
feedforward
decoder
network
to
compute
the
probature, and we were keen to see how well DRAW performed
ated images beside them. Note that the two columns are visually
bility of the
input
P
(x|z)
given
the
sample.
During
inference
the
on natural images. Our first natural image generation exsimilar, but the numbers are generally different.
input x is periment
passed to
thetheencoder
network,
producing
an approxused
multi-digit
Street View
House Numbers
datasetQ(z|x)
(Netzer etover
al., 2011).
used the same
preprocessimate posterior
latentWe
variables.
During
training, z
ing
as
(Goodfellow
et al.,
2013),
a 64 64
31
Bayesian Reasoning
and
Deep
Learning
is sampled
from
Q(z|x)
and
then
usedyielding
to compute
thehouse
total de-highly realistic, as shown in Figs. 9 and 10. Fig. 11 reveals
Uncertainty on Model Parameters
Bayesian Neural Networks
We can also combine other tools from deep learning to design even more
powerful generative models: recurrent networks and attention.
Weight Uncertainty in Neural
WY 1
0.1
0.5
h2 H 1
H2
0.1
0.7
W2 H3
1.3
H1
H2
H3
0.1 0.3 1.4
0.2
1.2
h1
W3
Figure 1. Left: each weight has a fixed value, as provided by classical backpropagation. Right: each weight is assigned a distribuytion, as provided by Bayes by Backprop.
the par
through
regressi
this c
Inputs
tion on
tion (gi
transfor
The we
tion (M
the ML
n = 1, , N
is related to recent methods in deep, generative modelling

(Kingma and Welling, 2014; Rezende et al., 2014; Gregor
32
In Review
Deep learning as a framework for building highly
flexible non-linear parametric models, but
regularisation and accounting for uncertainty
and lack of knowledge is still needed.
Bayesian reasoning as a general framework for

inference that allows us to account for
uncertainty and a principled approach for
regularisation and model scoring.
Combined Bayesian reasoning with auto-encoders and
showed just how much can be gained by a marriage of these
two streams of machine learning research.
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
33
Thanks to many people:

Danilo Rezende, Ivo Danihelka, Karol Gregor, Charles Blundell,
Theophane Weber, Andriy Mnih, Daan Wierstra (Google DeepMind),
Durk Kingma, Max Welling (U. Amsterdam)
Thank You.
34
Some References
Probabilistic Deep Learning
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models."
ICML (2014).
Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." ICLR 2014.
Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." ICML (2014).
Gregor, Karol, et al. "Deep autoregressive networks." ICML (2014).
Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. NIPS (pp. 3581-3589).
Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:
1502.04623.
Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. arXiv preprint arXiv:1505.05770.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. arXiv preprint arXiv:1505.05424.
Hernndez-Lobato, J. M., & Adams, R. P. (2015). Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. arXiv
preprint arXiv:1502.05336.
Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint
arXiv:1506.02142.
35
What is a Variational Method?

Variational Principle
General family of methods for approximating
complicated densities by a simpler class of densities.
KL[q(z|y)kp(z|y)]
Approximation class
True posterior
q (z)
Deterministic approximation procedures
with bounds on probabilities of interest.
Fit the variational parameters.
36
From IS to Variational Inference

Integral problem
Proposal
Importance Weight
Jensens inequality
log
p(x)g(x)dx
p(x) log g(x)dx
log p(y) = log

log p(y) = log
p(y|z)p(z)dz
q(z)
p(y|z)p(z)
dz
q(z)
p(z)
log p(y) = log p(y|z)
q(z)dz
q(z)
Z
p(z)
log p(y)
q(z) log p(y|z)
dz
q(z)
=
Variational lower bound

q(z) log p(y|z)
= Eq(z) [log p(y|z)]
q(z)
q(z) log
p(z)
KL[q(z)kp(z)]
37
Minimum Description Length (MDL)

Stochastic encoder
Data code-length
KL[q(z)kp(z)]
Hypothesis code
Stochastic encoder-decoder systems implement variational inference.
Regularity in our data that can be explained with

latent variables, implies that the data is compressible.
MDL: inference seen as a problem of compression
we must find the ideal shortest message of our data y:
marginal likelihood.
Must introduce an approximation to the ideal
message.
Encoder: variational distribution q(z|y),
Decoder: likelihood p(y|z).
z ~ q(z | y)
Decoder
p(y |z)
Encoder
q(z |y)
y ~ p(y | z)
Data y
38
Denoising Auto-encoders (DAE)

Stochastic encoder
Reconstruction
(z, y)
Penalty
Stochastic encoder-decoder systems implement variational inference.
DAE: A mechanism for finding representations or

features of data (i.e. latent variable explanations).
Encoder: variational distribution q(z|y),
Decoder: likelihood p(y|z).

The variational approach requires you to be explicit
about your assumptions. Penalty is derived from your
model and does not need to be designed.
z ~ q(z | y)
Decoder
p(y |z)
Encoder
q(z |y)
y ~ p(y | z)
Data y
39
Amortising the Cost of Inference

Repeat:
E-step
For i = 1, N
n
/ r Eq
(z) [log p (yn |zn )]
r KL[q(zn )kp(zn )]
Instead of solving this optimisation

for every data point n, we can
instead use a model.
M-step
1 X
/
r log p (yn |zn )
N n
z
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
Inference network: q is an encoder or inverse model.

Parameters of q are now a set of global parameters
used for inference of all data points - test and train.
Share the cost of inference (amortise) over all data.
Combines easily with mini-batches and Monte Carlo
expectations.
Can jointly optimise variational and model
parameters: no need for alternating optimisation.
40
Implementing your Variational Algorithm

Avoid deriving pages of gradient updates for variational inference.
Variational inference turns integration
into optimisation:
Automated Tools:
Dierentiation: Theano, Torch7, Stan
Message passing: infer.NET
Eq [( log p(y|z) + log q(z)

Forward pass
Prior
p(z)
z H[q(z)]
Backward pass
r
Prior
p(z)
log p(z)
Inference
q(z |x)
Stochastic gradient descent and

other preconditioned optimisation.
Same code can run on both GPUs
or on distributed clusters.
Probabilistic models are modular,
can easily be combined.
Model
p(x |z)
Model
p(x |z)
Data x
log p(x|z)
log p(z)]
Inference
q(z |x)
Ideally want probabilistic programming

using variational inference.
41
Stochastic Backpropagation
A Monte Carlo method that works with continuous latent variables.
Original problem
r Eq(z) [f (z)]
2
Reparameterisation
Backpropagation
with Monte Carlo
z N (, )
z = + N (0, 1)
r EN (0,1) [f ( + )]
EN (0,1) [r={, } f ( + )]
Can use any likelihood function, avoids the need for additional lower bounds.
Low-variance, unbiased estimator of the gradient.
Can use just one sample from the base distribution.
Possible for many distributions with location-scale or other known
transformations, such as the CDF.
42
Monte Carlo Control Variate Estimators

More general Monte Carlo approach that can be used with both discrete
or continuous latent variables.
Property of the score function:
r q (z|x)
r log q (z|x) =
q (z|x)
Original problem
r Eq
Score ratio
Eq
(z) [log p (y|z)r
log q(z|y)]
Eq
(z) [(log p (y|z)
c)r log q(z|y)]
MCCV Estimate
(z) [log p (y|z)]
c is known as a control variate and is used

to control the variance of the estimator.
43

Bayes Deep

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Bayes Deep

Enviado por

Direitos autorais:

Formatos disponíveis

Bayesian Reasoning

and Deep Learning

Bayesian Reasoning and Deep Learning

A framework for constructing flexible models

+ Rich non-linear models for

- Only point estimates

+ Easily composable with other gradientbased methods

A framework for inference and decision making

+ Unified framework for model building,

- Mainly conjugate and linear

+ Explicit accounting for uncertainty and

+ Robust to overfitting; tools for model

Two Streams of Machine Learning

+ Rich non-linear models for

Bayesian Reasoning and Deep Learning

- Mainly conjugate and linear

Complementary strengths that we should

Why is this a good idea?

How can we achieve this convergence?

What else can we do?

Bayesian Reasoning and Deep Learning

A (Statistical) Review of Deep Learning

Generalised Linear Regression

The basic function can be any linear

the Bernoulli distribution.

Maximum likelihood estimation

Bayesian Reasoning and Deep Learning

recursive generalised linear models

A (Statistical) Review of Deep Learning

Recursive Generalised Linear Regression

A (Statistical) Review of Deep Learning

More robust loss function using MAP estimation instead.

Bayesian Reasoning and Deep Learning

More Robust Learning

Towards Bayesian Reasoning

Given this powerful model class and invaluable tools for

Complementary strengths that we should

Why is this a good idea?

How can we achieve this convergence?

What else can we do?

Bayesian Reasoning and Deep Learning

Dimensionality Reduction and Auto-encoders

Non-linear dimensionality reduction

Easy to implement as a single

No natural handling of missing data

Bayesian Reasoning and Deep Learning

Dimensionality Reduction and Auto-encoders

Some questions about auto-encoders:

Best to be explicit about the:

Bayesian Reasoning and Deep Learning

Density Estimation and Latent Variable Models

Latent Gaussian Models:

Exponential fam natural parameters .

Use our knowledge of deep learning to design even richer models.

Deep Generative Models

Rich extension of previous model using deep neural networks.

Latent Variables (Stochastic layers)

zl N (zl |fl (zl+1 ), l )

Can also use non-exponential family.

Deep Latent Gaussian Models

1. Explain this data

p(z|y, W) / p(y|z, W)p(z)

p(y |z, W)p(z|y, W)dz

Bayesian Reasoning and Deep Learning