Escolar Documentos
Profissional Documentos
Cultura Documentos
shakirm.com
@shakir_za
9 October 2015
Abstract
Deep learning and Bayesian machine learning are currently two of the most
active areas of machine learning research. Deep learning provides a powerful
class of models and an easy framework for learning that now provides state-ofthe-art methods for applications ranging from image classification to speech
recognition. Bayesian reasoning provides a powerful approach for information
integration, inference and decision making that has established it as the key
tool for data-efficient learning, uncertainty quantification and robust model
composition that is widely used in applications ranging from information
retrieval to large-scale ranking. Each of these research areas has shortcomings
that can be effectively addressed by the other, pointing towards a needed
convergence of these two areas of machine learning; the complementary
aspects of these two research areas is the focus of this talk. Using the tools of
auto-encoders and latent variable models, we shall discuss some of the ways in
which our machine learning practice is enhanced by combining deep learning
with Bayesian reasoning. This is an essential, and ongoing, convergence that
will only continue to accelerate and provides some of the most exciting
prospects, some of which we shall discuss, for contemporary machine learning
research.
Bayesian Reasoning and Deep Learning
Deep Learning
Bayesian Reasoning
Better ML
Deep Learning
Bayesian Reasoning
- Potentially intractable
inference leading to
expensive computation or
long simulation times.
Bayesian Reasoning
Outline
Bayesian Reasoning
Deep Learning
Real
Binary
p(y|x) = p(y|g(); )
Linear
Logistic
Binary
Probit
Binary
Gumbel
Binary
Logistic
Categorical
Multinomial
Counts
Counts
Non-neg.
Sparse
Ordered
Poisson
Poisson
Gamma
Tobit
Ordinal
Identity
Logit log 1-
Identity
Sigmoid
Inv
Gauss
-1
CDF
()
Compl.
log-log
log(-log())
Gauss
CDF
()
Gumbel CDF
-x
e-e
Probit
Hyperbolic
Tangent
tanh()
Multin. Logit
Tanh
log()
p
()
Reciprocal
1
1+exp(-)
Sigmoid
Softmax
Pi
j j
exp()
2
1
max max(0; )
Cum.
Logit
( k - )
ReLU
L=
1.2
log p(y|g(); )
Constructing a recursive GLM or deep deep feed-forward neural network using the linear predictor as the basic building block. GLMS
E[y] = hL . . . hl h0 (x)
A general framework for building non-linear, parametric models
Problem: Overfitting of MLE leading to limited generalisation.
Bayesian Reasoning and Deep Learning
10
11
Bayesian reasoning over some, but not all parts of our models (yet).
Bayesian Reasoning and Deep Learning
12
Outline
Bayesian Reasoning
Deep Learning
13
z = f(y)
Decoder
g(.)
Encoder
f(.)
y* = g(z)
Data y
L=
log p(y|g(z))
L = ky
2
g(f (y))k2
14
z = f(y)
Decoder
g(.)
Encoder
f(.)
y* = g(z)
Data y
15
BXPCA
z N (z|, )
Observation Model
= Wz + b
y Expon(y|)
W
y
n = 1, , N
16
DLGM
z2
fl (z) = (Wh(z) + b)
Deterministic layers
h4
h3
W1
z1
hi (x) = (Ax + c)
h2
Observation Model
= Wh1 + b
h1
y Expon(y|)
n = 1, , N
17
p(y |y) =
3. Choose
Z the best model
p(y|W) =
p(y|z, W)p(z)dz
z1
h2
h1
y
n = 1, , N
18
Variational Inference
Use tools from approximate inference to handle intractable integrals.
KL[q(z|y)kp(z|y)]
Approximation class
True posterior
q (z)
Reconstruction
Reconstruction cost:
Expected log-likelihood
measures how well
samples from q(z) are able
to explain the data y.
Penalty: Explanation of
the data q(z) doesnt deviate
too far from your beliefs
p(z) - Okhams razor.
Penalty
KL[q(z)kp(z)]
Penalty is derived from your model and does not need to be designed.
Bayesian Reasoning and Deep Learning
19
Reconstruction
KL[q(z)kp(z)]
Penalty
Inference/
Encoder
q(z |y)
Data y
20
KL[q(z)kp(z)]
Reconstruction
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
Penalty
y ~ p(y | z)
Data y
21
KL[q(z)kp(z)]
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
22
KL[q(z)kp(z)]
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
23
...
Data Visualisation
MNIST Handwritten digits
...
28x28
DLGM
...
500
...
...
100
...
300
...
28x28
...
100
...
400
...
96x96
24
DLGM
Visualising MNIST in 3D
25
DLGM
Data Simulation
Data
Samples
26
unobserved pixels
Inferred Image
DLGM
10%
observed
50%
observed
27
Outline
Bayesian Reasoning
Deep Learning
28
Semi-supervised Learning
Semi-supervised DLGM
Can extend the marriage of Bayesian reasoning and deep learning to the
problem of semi-supervised classification.
z
W
x
n = 1, , N
29
Semi-supervised DLGM
Analogical Reasoning
30
attention
it constructs the digit by tracing the lines
nt Neural Network For with
Image
Generation
much like a person with a pen.
DRAW
ts scenes
d by the
.
y step is
ne while
ew years
by a seby a sin& Hinton,
to, 2014;
014; Serequential
h can be
s such as
model in
possible
nse it re-
P (x|z)
decoder
FNN
ct
write
ct
write
. . . cT
decoder
P (x|z1:T )
decoder
ht motivation
The main
for using
1
RNN
RNNan attention-based generative model is that large images can be built up iteratively,
z
zt+1
zt
decoding
by adding to a small
part of the image at a time.
To test
(generative
model)
sample this capability sample
sample
in a controlled
fashion, we trained DRAW
encoding
two
28 |x,
28 zMNIST
images choQ(z|x) to generate
Q(ztimages
|x, z1:t with
Q(z
1)
t+1
1:t )
(inference)
sen at random and placed at random locations in a 60 60
encoderIn casesencoder
black background.
where the two digits overlap,
henc
t 1
RNN
RNNtogether at each point and
encoder the pixel intensities were added
FNN
clipped to be noread
greater thanread
one. Examples of generated
data are shown in Fig. 8. The network typically generates
x the other, suggesting
x
x
one digit and then
an ability to recreate composite scenes from simple pieces.
Figure 2. Left:
Conventional
Auto-Encoder. Dur4.4. Street
View House Variational
Number Generation
ing generation, a sample z is drawn from a prior P (z) and passedFigure 9. Generated SVHN images. The rightmost column
MNIST digits are very simplistic in terms of visual strucshows the training images closest (in L2 distance) to the generthrough the
feedforward
decoder
network
to
compute
the
probature, and we were keen to see how well DRAW performed
ated images beside them. Note that the two columns are visually
bility of the
input
P
(x|z)
given
the
sample.
During
inference
the
on natural images. Our first natural image generation exsimilar, but the numbers are generally different.
input x is periment
passed to
thetheencoder
network,
producing
an approxused
multi-digit
Street View
House Numbers
datasetQ(z|x)
(Netzer etover
al., 2011).
used the same
preprocessimate posterior
latentWe
variables.
During
training, z
ing
as
(Goodfellow
et al.,
2013),
a 64 64
31
Bayesian Reasoning
and
Deep
Learning
is sampled
from
Q(z|x)
and
then
usedyielding
to compute
thehouse
total de-highly realistic, as shown in Figs. 9 and 10. Fig. 11 reveals
We can also combine other tools from deep learning to design even more
powerful generative models: recurrent networks and attention.
WY 1
0.1
0.5
h2 H 1
H2
0.1
0.7
W2 H3
1.3
H1
H2
H3
0.2
1.2
h1
W3
Figure 1. Left: each weight has a fixed value, as provided by classical backpropagation. Right: each weight is assigned a distribuytion, as provided by Bayes by Backprop.
the par
through
regressi
this c
Inputs
tion on
tion (gi
transfor
The we
tion (M
the ML
n = 1, , N
32
In Review
Deep learning as a framework for building highly
flexible non-linear parametric models, but
regularisation and accounting for uncertainty
and lack of knowledge is still needed.
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
33
Thank You.
34
Some References
Probabilistic Deep Learning
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models."
ICML (2014).
Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." ICLR 2014.
Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." ICML (2014).
Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. NIPS (pp. 3581-3589).
Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:
1502.04623.
Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. arXiv preprint arXiv:1505.05770.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. arXiv preprint arXiv:1505.05424.
Hernndez-Lobato, J. M., & Adams, R. P. (2015). Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. arXiv
preprint arXiv:1502.05336.
Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint
arXiv:1506.02142.
35
KL[q(z|y)kp(z|y)]
Approximation class
True posterior
q (z)
Deterministic approximation procedures
with bounds on probabilities of interest.
Fit the variational parameters.
Bayesian Reasoning and Deep Learning
36
Proposal
Importance Weight
Jensens inequality
log
p(x)g(x)dx
p(y|z)p(z)dz
q(z)
p(y|z)p(z)
dz
q(z)
p(z)
log p(y) = log p(y|z)
q(z)dz
q(z)
Z
p(z)
log p(y)
q(z) log p(y|z)
dz
q(z)
=
q(z)
q(z) log
p(z)
KL[q(z)kp(z)]
37
Data code-length
KL[q(z)kp(z)]
Hypothesis code
z ~ q(z | y)
Decoder
p(y |z)
Encoder
q(z |y)
y ~ p(y | z)
Data y
38
Reconstruction
(z, y)
Penalty
z ~ q(z | y)
Decoder
p(y |z)
Encoder
q(z |y)
y ~ p(y | z)
Data y
39
/ r Eq
r KL[q(zn )kp(zn )]
M-step
1 X
/
r log p (yn |zn )
N n
z
z ~ q(z | y)
Model
p(y |z)
Inference
Network
q(z |y)
y ~ p(y | z)
Data y
40
z H[q(z)]
Backward pass
r
Prior
p(z)
log p(z)
Inference
q(z |x)
Model
p(x |z)
Model
p(x |z)
Data x
log p(x|z)
log p(z)]
Inference
q(z |x)
Stochastic Backpropagation
A Monte Carlo method that works with continuous latent variables.
Original problem
r Eq(z) [f (z)]
2
Reparameterisation
Backpropagation
with Monte Carlo
z N (, )
z = + N (0, 1)
r EN (0,1) [f ( + )]
EN (0,1) [r={, } f ( + )]
Can use any likelihood function, avoids the need for additional lower bounds.
Low-variance, unbiased estimator of the gradient.
Can use just one sample from the base distribution.
Possible for many distributions with location-scale or other known
transformations, such as the CDF.
42
r q (z|x)
r log q (z|x) =
q (z|x)
Original problem
r Eq
Score ratio
Eq
log q(z|y)]
Eq
MCCV Estimate
43