Escolar Documentos
Profissional Documentos
Cultura Documentos
Neural net
Practical Recommendations for Gradient-Based Training of Deep
Architectures
arXiv:1206.5533v2 [cs.LG] 16 Sep 2012
Yoshua Bengio
Version 2, Sept. 16th, 2012
1
ing signal at each level of a hierarchy of features4 . tion. The main section of this chapter is Section 3,
Unsupervised representation learning algorithms can which explains hyper-parameters in general, their op-
be applied several times to learn different layers timization, and specifically covers the main hyper-
of a deep model. Several unsupervised represen- parameters of neural networks. Section 4 briefly de-
tation learning algorithms have been proposed scribes simple ideas and methods to debug and visu-
since then. Those covered in this chapter (such as alize neural networks, while Section 5 covers paral-
auto-encoder variants) retain many of the properties lelism, sparse high-dimensional inputs, symbolic in-
of artificial multi-layer neural networks, relying puts and embeddings, and multi-relational learning.
on the back-propagation algorithm to estimate The chapter closes (Section 6) with open questions
stochastic gradients. Deep Learning algorithms on the difficulty of training deep architectures and
such as those based on the Boltzmann machine improving the optimization methods for neural net-
and those based on auto-encoder or sparse coding works.
variants often include a supervised fine-tuning stage.
This supervised fine-tuning as well as the gradient
descent performed with auto-encoder variants also 1.1 Deep Learning and Greedy Layer-
involves the back-propagation algorithm, just as Wise Pretraining
like when training deterministic feedforward or
recurrent artificial neural networks. Hence this The notion of reuse, which explains the power of
chapter also includes recommendations for training distributed representations (Bengio, 2009), is also
ordinary supervised deterministic neural networks at the heart of the theoretical advantages behind
or more generally, most machine learning algorithms Deep Learning. Complexity theory of circuits,
relying on iterative gradient-based optimization of e.g. (Håstad, 1986; Håstad and Goldmann, 1991),
a parametrized learner with respect to an explicit (which include neural networks as special cases) has
training criterion. much preceded the recent research on deep learning.
The depth of a circuit is the length of the longest
This chapter assumes that the reader already un-
path from an input node of the circuit to an out-
derstands the standard algorithms for training su-
put node of the circuit. Formally, one can change
pervised multi-layer neural networks, with the loss
the depth of a given circuit by changing the defini-
gradient computed thanks to the back-propagation
tion of what each node can compute, but only by a
algorithm (Rumelhart et al., 1986). It starts by
constant factor (Bengio, 2009). The typical compu-
explaining basic concepts behind Deep Learning
tations we allow in each node include: weighted sum,
and the greedy layer-wise pretraining strategy (Sec-
product, artificial neuron model (such as a mono-
tion 1.1), and recent unsupervised pre-training al-
tone non-linearity on top of an affine transforma-
gorithms (denoising and contractive auto-encoders)
tion), computation of a kernel, or logic gates. Theo-
that are closely related in the way they are trained
retical results (Håstad, 1986; Håstad and Goldmann,
to standard multi-layer neural networks (Section 1.2).
1991; Bengio et al., 2006b; Bengio and LeCun, 2007;
It then reviews in Section 2 basic concepts in it-
Bengio and Delalleau, 2011) clearly identify families
erative gradient-based optimization and in particu-
of functions where a deep representation can be expo-
lar the stochastic gradient method, gradient com-
nentially more efficient than one that is insufficiently
putation with a flow graph, automatic differenta-
deep. If the same set of functions can be represented
4 In standard multi-layer neural networks trained using
from within a family of architectures associated with
a smaller VC-dimension (e.g. less hidden units5 ),
back-propagated gradients, the only signal that drives param-
eter updates is provided at the output of the network (and learning theory would suggest that it can be learned
then propagated backwards). Some unsupervised learning al-
gorithms provide a local source of guidance for the parameter 5 Note that in our experiments, deep architectures tend to
update in each layer, based only on the inputs and outputs of generalize very well even when they have quite large numbers
that layer. of parameters.
2
with fewer examples, yielding improvements in both to some objective of interest. Combining unsuper-
computational efficiency and statistical efficiency. vised pre-training and supervised fine-tuning usu-
Another important motivation for feature learning ally gives better generalization than pure supervised
and Deep Learning is that they can be done with un- learning from a purely random initialization. The
labeled examples, so long as the factors (unobserved unsupervised representation learning algorithms for
random variables explaining the data) relevant to the pre-training proposed in 2006 were the Restricted
questions we will ask later (e.g. classes to be pre- Boltzmann Machine or RBM (Hinton et al., 2006),
dicted) are somehow salient in the input distribution the auto-encoder (Bengio et al., 2007) and a spar-
itself. This is true under the manifold hypothesis, sifying form of auto-encoder similar to sparse cod-
which states that natural classes and other high-level ing (Ranzato et al., 2007).
concepts in which humans are interested are asso-
ciated with low-dimensional regions in input space 1.2 Denoising and Contractive Auto-
(manifolds) near which the distribution concentrates,
and that different class manifolds are well-separated
Encoders
by regions of very low density. It means that a small An auto-encoder has two parts: an encoder func-
semantic change around a particular example can tion f that maps the input x to a representation
be captured by changing only a few numbers in a h = f (x), and a decoder function g that maps h
high-level abstract representation space. As a conse- back in the space of x in order to reconstruct x.
quence, feature learning and Deep Learning are in- In the regular auto-encoder the reconstruction func-
timately related to principles of unsupervised learn- tion r(·) = g(f (·)) is trained to minimize the average
ing, and they can work in the semi-supervised setting value of a reconstruction loss on the training exam-
(where only a few examples are labeled), as well as in ples. Note that reconstruction loss should be high for
the transfer learning and multi-task settings (where most other input configurations7. The regularization
we aim to generalize to new classes or tasks). The mechanism makes sure that reconstruction cannot be
underlying hypothesis is that many of the underlying perfect everywhere, while minimizing the reconstruc-
factors are shared across classes or tasks. Since rep- tion loss at training examples digs a hole in recon-
resentation learning aims to extract and isolate these struction error where the density of training exam-
factors, representations can be shared across classes ples is large. Examples of reconstruction loss func-
and tasks. 2
P include ||x − r(x)|| (for real-valued inputs) and
tions
One of the most commonly used approaches for − i xi log ri (x) + (1 − xi ) log(1 − ri (x)) (when in-
training deep neural networks is based on greedy terpreting xi as a bit or a probability of a binary
layer-wise pre-training (Bengio et al., 2007). The event). Auto-encoders capture the input distribu-
idea, first introduced in Hinton et al. (2006), is to tion by learning to better reconstruct more likely in-
train one layer of a deep architecture at a time us- put configurations. The difference between the recon-
ing unsupervised representation learning. Each level struction vector and the input vector can be shown to
takes as input the representation learned at the pre- be related to the log-density gradient as estimated by
vious level and learns a new representation. The the learner (Vincent, 2011; Bengio et al., 2012) and
learned representation(s) can then be used as input the Jacobian matrix of the reconstruction with re-
to predict variables of interest, for example to clas- spect to the input gives information about the second
sify objects. After unsupervised pre-training, one can derivative of the density, i.e., in which direction the
also perform supervised fine-tuning of the whole sys- density remains high when you are on a high-density
tem6 , i.e., optimize not just the classifier but also 7 Different regularization mechanisms have been proposed
the lower levels of the feature hierarchy with respect to push reconstruction error up in low density areas: denoising
criterion, contractive criterion, and code sparsity. It has been
6 The whole system composes the computation of the rep- argued that such constraints play a role similar to the partition
resentation with computation of the predictor’s output. function for Boltzmann machines (Ranzato et al., 2008a).
3
manifold (Rifai et al., 2011a; Bengio et al., 2012). In and as the number of examples increases, so long as
the Denoising Auto-Encoder (DAE) and the Con- capacity is limited (the number of parameters is small
tractive Auto-Encoder (CAE), the training procedure compared to the number of examples), training er-
also introduces robustness (insensitivity to small vari- ror and generalization approach each other. In the
ations), respectively in the reconstruction r(x) or in regime of such large datasets, we can consider that
the representation f (x). In the DAE (Vincent et al., the learner sees an unending stream of examples (e.g.,
2008, 2010), this is achieved by training with stochas- think about a process that harvests text and images
tically corrupted inputs, but trying to reconstruct the from the web and feeds it to a machine learning algo-
uncorrupted inputs. In the CAE (Rifai et al., 2011a), rithm). In that context, it is most efficient to simply
this is achieved by adding an explicit regularizing update the parameters of the model after each exam-
term in the training criterion, proportional to the ple or few examples, as they arrive. This is the ideal
(x) 2
norm of the Jacobian of the encoder, || ∂f∂x || . But online learning scenario, and in a simplified setting,
the CAE and the DAE are very related (Bengio et al., we can even consider each new example z as being
2012): when the noise is Gaussian and small, the sampled i.i.d. from an unknown generating distribu-
denoising error minimized by the DAE is equiva- tion with probability density p(z). More realistically,
lent to minimizing the norm of the Jacobian of the examples in online learning do not arrive i.i.d. but
reconstruction function r(·) = g(f (·)), whereas the instead from an unknown stochastic process which
CAE minimizes the norm of the Jacobian of the en- exhibits serial correlation and other temporal depen-
coder f (·). Besides Gaussian noise, another interest- dencies. Many learning algorithms rely on gradient-
ing form of corruption has been very successful with based numerical optimization of a training criterion.
DAEs: it is called the masking corruption and con- Let L(z, θ) be the loss incurred on example z when
sists in randomly zeroing out a large fraction (like the parameter vector takes value θ. The gradient
20% or even 50%) of the inputs, where the zeroed vector for the loss associated with a single example
out subset is randomly selected for each example. In is ∂L(z,θ)
∂θ .
addition to the contractive effect, it forces the learned If we consider the simplified case of i.i.d. data,
encoder to be able to rely only on an arbitrary subset there is an interesting observation to be made: the
of the input features. online learner is performing stochastic gradient de-
Another way to prevent the auto-encoder from per- scent on its generalization error. Indeed, the gener-
fectly reconstructing everywhere is to introduce a alization error C of a learner with parameters θ and
sparsity penalty on h, discussed below (Section 3.1). loss function L is
Z
C = E[L(z, θ)] = p(z)L(z, θ)dz
1.3 Online Learning and Optimization
of Generalization Error while the stochastic gradient from sample z is
The objective of learning is not to minimize training ∂L(z, θ)
error or even the training criterion. The latter is a ĝ =
∂θ
surrogate for generalization error, i.e., performance
on new (out-of-sample) examples, and there are no with z a random variable sampled from p. The gra-
hard guarantees that minimizing the training crite- dient of generalization error is
rion will yield good generalization error: it depends ∂C ∂
Z Z
∂L(z, θ)
on the appropriateness of the parametrization and = p(z)L(z, θ)dz = p(z) dz = E[ĝ]
∂θ ∂θ ∂θ
training criterion (with the corresponding prior they
imply) for the task at hand. showing that the online gradient ĝ is an unbiased es-
Many learning tasks of interest will require huge timator of the generalization error gradient ∂C∂θ . It
quantities of data (most of which will be unlabeled) means that online learners, when given a stream of
4
non-repetitive training data, really optimize (maybe examples:
not in the optimal way, i.e., using a first-order gra-
B(t+1)
dient technique) what we really care about: general- 1 X ∂L(zt′ , θ)
θ(t) ← θ(t−1) − ǫt . (1)
ization error. B ∂θ
t′ =Bt+1
5
scent, sometimes called “batch gradient descent”, 2.2 Gradient Computation and Auto-
which corresponds to the case where B equals the matic Differentiation
training set size, i.e., there is one parameter update
per epoch). The great advantage of stochastic gra- The gradient can be either computed manually or
dient descent and other online or minibatch update through automatic differentiation. Either way, it
methods is that their convergence does not depend helps to structure this computation as a flow graph,
on the size of the training set, only on the number in order to prevent mathematical mistakes and make
of updates and the richness of the training distribu- sure an implementation is computationally efficient.
tion. In the limit of a large or infinite training set, The computation of the loss L(z, θ) as a function of
a batch method (which updates only after seeing all θ is laid out in a graph whose nodes correspond to
the examples) is hopeless. In fact, even for ordinary elementary operations such as addition, multiplica-
datasets of tens or hundreds of thousands of exam- tion, and non-linear operations such as the neural
ples (or more!), stochastic gradient descent converges networks activation function (e.g., sigmoid or hyper-
much faster than ordinary (batch) gradient descent, bolic tangent), possibly at the level of vectors, matri-
and beyond some dataset sizes the speed-up is al- ces or tensors. The flow graph is directed and acyclic
most linear (i.e., doubling the size almost doubles the and has three types of nodes: input nodes, internal
gain)10 . It is really important to use the stochastic nodes, and output nodes. Each of its nodes is as-
version in order to get reasonable clock-time conver- sociated with a numerical output which is the result
gence speeds. of the application of that computation (none in the
case of input nodes), taking as input the output of
As for any stochastic gradient descent method (in- previous nodes in a directed acyclic graph. Example
cluding the mini-batch case), it is important for ef- z and parameter vector θ (or their elements) are the
ficiency of the estimator that each example or mini- input nodes of the graph (i.e., they do not have in-
batch be sampled approximately independently. Be- puts themselves) and L(z, θ) is a scalar output of the
cause random access to memory (or even worse, to graph. Note that here, in the supervised case, z can
disk) is expensive, a good approximation, called in- include an input part x (e.g. an image) and a target
cremental gradient (Bertsekas, 2010), is to visit the part y (e.g. a target class associated with an object
examples (or mini-batches) in a fixed order corre- in the image). In the unsupervised case z = x. In
sponding to their order in memory or disk (repeating a semi-supervised case, there is a mix of labeled and
the examples in the same order on a second epoch, if unlabeled examples, and z includes y on the labeled
we are not in the pure online case where each exam- examples but not on the unlabeled ones.
ple is visited only once). In this context, it is safer if In addition to associating a numerical output oa to
the examples or mini-batches are first put in a ran- each node a of the flow graph, we can associate a gra-
dom order (to make sure this is the case, it could dient ga = ∂L(z,θ)
∂oa . The gradient will be defined and
be useful to first shuffle the examples). Faster con- computed recursively in the graph, in the opposite
vergence has been observed if the order in which the direction of the computation of the nodes’ outputs,
mini-batches are visited is changed for each epoch, i.e., whereas oa is computed using outputs op of pre-
which can be reasonably efficient if the training set decessor nodes p of a, ga will be computed using the
holds in computer memory. gradients gs of successor nodes s of a. More precisely,
the chain rule dictates
X ∂os
ga = gs
s
∂oa
10 On the other hand, batch methods can be parallelized where the sum is over immediate successors of a.
easily, which becomes an important advantage with currently Only output nodes have no successor, and in par-
available forms of computing power. ticular for the output node that computes L, the
6
gradient is set to 1 since ∂L∂L = 1, thus initializing semantics of the output (given the input) but yield-
the recursion. Manual or automatic differentiation ing smaller (or more numerically stable or more effi-
then only requires to define the partial derivative as- ciently computed) graphs (e.g., removing redundant
sociated with each type of operation performed by computations). To take advantage of the fact that
any node of the graph. When implementing gradi- computing the loss gradient includes as a first step
ent descent algorithms with manual differentiation computing the loss itself, it is advantageous to struc-
the result tends to be verbose, brittle code that lacks ture the code so that both the loss and its gradient are
modularity – all bad things in terms of software en- computed at once, with a single graph having multi-
gineering. A better approach is to express the flow ple outputs. The advantages of performing gradient
graph in terms of objects that modularize how to computations symbolically are numerous. First of all,
compute outputs from inputs as well as how to com- one can readily compute gradients over gradients, i.e.,
pute the partial derivatives necessary for gradient de- second derivatives, which are useful for some learn-
scent. One can pre-define the operations of these ob- ing algorithms. Second, one can define algorithms or
jects (in a “forward propagation” or fprop method) training criteria involving gradients themselves, as re-
and their partial derivatives (in a “backward prop- quired for example in the Contractive Auto-Encoder
agation” or bprop method) and encapsulate these (which uses the norm of a Jacobian matrix in its
computations in an object that knows how to com- training criterion, i.e., really requires second deriva-
pute its output given its inputs, and how to com- tives, which here are cheap to compute). Third, it
pute the gradient with respect to its inputs given makes it easy to implement other useful graph trans-
the gradient with respect to its output. This is the formations such as graph simplifications or numerical
strategy adopted in the Theano library11 with its Op optimizations and transformations that help making
objects (Bergstra et al., 2010), as well as in libraries the numerical results more robust and more efficient
such as Torch12 (Collobert et al., 2011b) and Lush13. (such as working in the domain of logarithms of prob-
Compared to Torch and Lush, Theano adds an in- abilities rather than in the domain of probabilities
teresting ingredient which makes it a full-fledged au- directly). Other potential beneficial applications of
tomatic differentiation tool: symbolic computation. such symbolic manipulations include parallelization
The flow graph itself (without the numerical values and additional differential operators (such as the R-
attached) can be viewed as a symbolic representation operator, recently implemented in Theano, which is
(in a data structure) of a numerical computation. In very useful to compute the product of a Jacobian ma-
2
(x)
Theano, the gradient computation is first performed trix ∂f∂x or Hessian matrix ∂ L(x,θ)
∂θ 2 with a vector
symbolically, i.e., each Op object knows how to create without ever having to actually compute and store
other Ops corresponding to the computation of the the matrix itself (Pearlmutter, 1994)).
partial derivatives associated with that Op. Hence the
symbolic differentiation of the output of a flow graph
with respect to any or all of its input nodes can be 3 Hyper-Parameters
performed easily in most cases, yielding another flow
graph which specifies how to compute these gradi- A pure learning algorithm can be seen as a func-
ents, given the input of the original graph. Since the tion taking training data as input and producing
gradient graph typically contains the original graph as output a function (e.g. a predictor) or model
(mapping parameters to loss) as a sub-graph, in or- (i.e. a bunch of functions). However, in practice,
der to make computations efficient it is important to many learning algorithms involve hyper-parameters,
automate (as done in Theano) a number of simplifica- i.e., annoying knobs to be adjusted. In many algo-
tions which are graph transformations preserving the rithms such as Deep Learning algorithms the number
11 http://deeplearning.net/software/theano/
of hyper-parameters (ten or more!) can make the idea
12 http://www.torch.ch of having to adjust all of them unappealing. In addi-
13 http://lush.sourceforge.net tion, it has been shown that the use of computer clus-
7
ters for hyper-parameter selection can have an im- datasets) to estimate generalization error of the pure
portant effect on results (Pinto et al., 2009). Choos- learning algorithm (with hyper-parameter selection
ing hyper-parameter values is formally equivalent to hidden inside).
the question of model selection, i.e., given a family
or set of learning algorithms, how to pick the most 3.1 Neural Network Hyper-
appropriate one inside the set? We define a hyper-
parameter for a learning algorithm A as a variable to
Parameters
be set prior to the actual application of A to the data, Different learning algorithms involve different sets of
one that is not directly selected by the learning algo- hyper-parameters, and it is useful to get a sense of
rithm itself. It is basically an outside control knob. the kinds of choices that practitioners have to make
It can be discrete (as in model selection) or continu- in choosing their values. We focus here mostly on
ous (such as the learning rate discussed above). Of those relevant to neural networks and Deep Learning
course, one can hide these hyper-parameters by wrap- algorithms.
ping another learning algorithm, say B, around A, to
selects A’s hyper-parameters (e.g. to minimize vali- 3.1.1 Hyper-Parameters of the Approximate
dation set error). We can then call B a hyper-learner, Optimization
and if B has no hyper-parameters itself then the com-
position of B over A could be a “pure” learning al- First of all, several learning algorithms can be viewed
gorithm, with no hyper-parameter. In the end, to as the combination of two elements: a training cri-
apply a learner to training data, one has to have a terion and a model (e.g., a family of functions, a
pure learning algorithm. The hyper-parameters can parametrization) on the one hand, and on the other
be fixed by hand or tuned by an algorithm, but their hand, a particular procedure for approximately op-
value has to be selected. The value of some hyper- timizing this criterion. Correspondingly, one should
parameters can be selected based on the performance distinguish hyper-parameters associated with the op-
of A on its training data, but most cannot. For any timizer from hyper-parameters associated with the
hyper-parameter that has an impact on the effective model itself, i.e., typically the function class, regular-
capacity of a learner, it makes more sense to select its izer and loss function. We have already mentioned
value based on out-of-sample data (outside the train- above some of the hyper-parameters typically asso-
ing set), e.g., a validation set performance, online er- ciated with gradient-based optimization. Here is a
ror, or cross-validation error. Note that some learn- more extensive descriptive list, focusing on those used
ing algorithms (in particular unsupervised learning in stochastic (mini-batch) gradient descent (although
algorithms such as algorithms for training RBMs by number of training iterations is used for all iterative
approximate maximum likelihood) are problematic in optimization algorithms).
this respect because we cannot directly measure the
• The initial learning rate (ǫ0 below, Eq.(2)).
quantity that is to be optimized (e.g. the likelihood)
This is often the single most important hyper-
because it is intractable. On the other hand, the
parameter and one should always make sure that
expected denoising reconstruction error is easy to es-
it has been tuned (up to approximately a fac-
timate (by just averaging the denoising error over a
tor of 2). Typical values for a neural network
validation set).
with standardized inputs (or inputs mapped to
Once some out-of-sample data has been used for the (0,1) interval) are less than 1 and greater
selecting hyper-parameter values, it cannot be used than 10−6 but these should not be taken as strict
anymore to obtain an unbiased estimator of gener-
alization performance, so one typically uses a test cross-validation, using an outer loop cross-validation to evalu-
ate generalization error and then applying an inner loop cross-
set (or double cross-validation14, in the case of small validation inside each outer loop split’s training subset (i.e.,
splitting it again into training and validation folds) in order to
14 Double cross-validation applies recursively the idea of select hyper-parameters for that split.
8
ranges and greatly depend on the parametriza- choices of learning rate (all in parallel), and keep
tion of the model. A default value of 0.01 typi- the value that gave the best results until the next
cally works for standard multi-layer neural net- re-estimation of the optimal learning rate. Other
works but it would be foolish to rely exclu- examples of adaptive learning rate strategies are
sively on this default value. If there is only discussed below (Sec. 6.2).
time to optimize one hyper-parameter and one
• The mini-batch size (B in Eq. (1)) is typi-
uses stochastic gradient descent, then this is the
cally chosen between 1 and a few hundreds, e.g.
hyper-parameter that is worth tuning.
B = 32 is a good default value, with values above
• The choice of strategy for decreasing or adapt- 10 taking advantage of the speed-up of matrix-
ing the learning rate schedule (with hyper- matrix products over matrix-vector products.
parameters such as the time constant τ in Eq. (2) The impact of B is mostly computational, i.e.,
below). The default value of τ → ∞ means that larger B yield faster computation (with ap-
the learning rate is constant over training it- propriate implementations) but requires visiting
erations. In many cases the benefit of choos- more examples in order to reach the same error,
ing other than this default value is small. An since there are less updates per epoch. In the-
example of O(1/t) learning rate schedule, used ory, this hyper-parameter should impact train-
in Bergstra and Bengio (2012) is ing time and not so much test performance, so it
can be optimized separately of the other hyper-
ǫ0 τ parameters, by comparing training curves (train-
ǫt = (2)
max(t, τ ) ing and validation error vs amount of training
time), after the other hyper-parameters (except
which keeps the learning rate constant for the
learning rate) have been selected. B and ǫ0 may
first τ steps and then decreases it in O(1/tα ),
slightly interact with other hyper-parameters so
with traditional recommendations (based on
both should be re-optimized at the end. Once
asymptotic analysis of the convex case) suggest-
B is selected, it can generally be fixed while the
ing α = 1. See Bach and Moulines (2011) for a
other hyper-parameters can be further optimized
recent analysis of the rate of convergence for the
(except for a momentum hyper-parameter, if one
general case of α ≤ 1, suggesting that smaller
is used).
values of α should be used in the non-convex
case, especially when using a gradient averaging • Number of training iterations T (measured
or momentum technique (see below). An adap- in mini-batch updates). This hyper-parameter
tive and heuristic way of automatically setting is particular in that it can be optimized almost
τ above is to keep ǫt constant until the training for free using the principle of early stopping: by
criterion stops decreasing significantly (by more keeping track of the out-of-sample error (as for
than some relative improvement threshold) from example estimated on a validation set) as train-
epoch to epoch. That threshold is a less sensi- ing progresses (every N updates), one can decide
tive hyper-parameter than τ itself. An alterna- how long to train for any given setting of all the
tive to a fixed schedule with a couple of (global) other hyper-parameters. Early stopping is an
free hyper-parameters like in the above formula inexpensive way to avoid strong overfitting, i.e.,
is the use of an adaptive learning rate heuristic, even if the other hyper-parameters would yield
e.g., the simple procedure proposed in Bottou to overfitting, early stopping will considerably
(2013): at regular intervals during training, us- reduce the overfitting damage that would other-
ing a fixed small subset of the training set (what wise ensue. It also means that it hides the over-
matters is only the number of examples used, fitting effect of other hyper-parameters, possibly
not what fraction of the whole training set it obscuring the analysis that one may want to do
represents), continue training with N different when trying to figure out the effect of individual
9
hyper-parameters, i.e., it tends to even out the during the stochastic gradient descent. For ex-
performance obtained by many otherwise overfit- ample, a moving average of the past gradients
ting configurations of hyper-parameters by com- can be computed with ḡ ← (1−β)ḡ+βg, where g
pensating a too large capacity with a smaller is the instantaneous gradient ∂L(z∂θ
t ,θ)
or a mini-
training time. For this reason, it might be use- batch average, and β is a small positive coeffi-
ful to turn early-stopping off when analyzing the cient that controls how fast the old examples get
effect of individual hyper-parameters. Now let downweighted in the moving average. The sim-
us turn to implementation details. Practically, plest momentum trick is to make the updates
one needs to continue training beyond the se- proportional to this smoothed gradient estima-
lected number of training iterations T̂ (which tor ḡ instead of the instantaneous gradient g.
should be the point of lowest validation error The idea is that it removes some of the noise and
in the training run) in order to ascertain that oscillations that gradient descent has, in particu-
validation error is unlikely to go lower than at lar in the directions of high curvature of the loss
the selected point. A heuristic introduced in the function18 . A default value of β = 1 (no mo-
Deep Learning Tutorials15 is based on the idea mentum) works well in many cases but in some
of patience (set initially to 10000 examples in the cases momentum seems to make a positive dif-
MLP tutorial), which is a minimum number of ference. Polyak averaging (Polyak and Juditsky,
training examples to see after the candidate se- 1992) is a related form of parameter averag-
lected point T̂ before deciding to stop training ing19 that has theoretical advantages and has
(i.e. before accepting this candidate as the final been advocated and shown to bring improve-
answer). As training proceeds and new candi- ments on some unsupervised learning procedures
date selected points T̂ (new minima of the vali- such as RBMs (Swersky et al., 2010). More re-
dation error) are observed, the patience param- cently, several mathematically motivated algo-
eter is increased, either multiplicatively or addi- rithms (Nesterov, 2009; Le Roux et al., 2012)
tively on top of the last T̂ found. Hence, if we have been proposed that incorporate some form
find a new minimum16 at t, we save the current of momentum and that also ensure much faster
best model, update T̂ ← t and we increase our convergence (linear rather than sublinear) com-
patience up to t+constant or t× constant. Note pared to stochastic gradient descent, at least for
that validation error should not be estimated af- convex optimization problems. See also Bottou
ter each training update (that would be really (2013) for an example of averaged SGD with
wasteful) but after every N examples, where N successful empirical speedups in the convex
is at least as large as the validation set (ideally case. Note however that in the pure online
several times larger so that the early stopping case (stream of examples) and under some as-
overhead remains small)17 . sumptions, the sublinear rate of convergence of
stochastic gradient descent with O(1/t) decrease
• Momentum β. It has long been advo- of learning rate is an optimal rate, at least for
cated (Hinton, 1978, 2010) to temporally smooth convex problems (Nemirovski and Yudin, 1983).
out the stochastic gradient samples obtained That would suggest that for really large train-
15 http://deeplearning.net/tutorial/ 18 Think about a ball coming down a valley. Since it has not
16 Ideally, we should use a statistical test of significance and started from the bottom of the valley it will oscillate between
accept a new minimum (over a longer training period) only if its sides as it settles deeper, forcing the learning rate to be
the improvement is statistically significant, based on the size small to avoid large oscillations that would kick it out of the
and variance estimates one can compute for the validation set. valley. Averaging out the local gradients along the way will
17 When an extra processor on the same machine is available, cancel the opposing forces from each side of the valley.
validation error can conveniently be recomputed by a proces- 19 Polyak averaging uses for predictions a moving average of
sor different from the one performing the training updates, the parameters found in the trajectory of stochastic gradient
allowing more frequent computation of validation error. descent.
10
ing sets it may not be possible to obtain bet- 3.2 Hyper-Parameters of the Model
ter rates than ordinary stochastic gradient de- and Training Criterion
scent, albeit the constants in front (which de-
pend on the condition number of the Hessian) Let us now turn to “model” and “criterion” hyper-
may still be greatly reduced by using second- parameters typically found in neural networks, espe-
order information online (Bottou and LeCun, cially deep neural networks.
2004; Bottou and Bousquet, 2008).
• Number of hidden units nh . Each layer in a
• Layer-specific optimization hyper- multi-layer neural network typically has a size
parameters: although rarely done, it is that we are free to set and that controls ca-
possible to use different values of optimization pacity. Because of early stopping and possibly
hyper-parameters (such as the learning rate) on other regularizers (e.g., weight decay, discussed
different layers of a multi-layer network. This is below), it is mostly important to choose nh large
especially appropriate (and easier to do) in the enough. Larger than optimal values typically do
context of layer-wise unsupervised pre-training, not hurt generalization performance much, but
since each layer is trained separately (while the of course they require proportionally more com-
layers below are kept fixed). This would be putation (in O(n2h ) if scaling all the layers at
particularly useful when the number of units the same time in a fully connected architecture).
per layer varies a lot from layer to layer. See Like for many other hyper-parameters, there is
the paragraph below entitled Layer-wise opti- the option of allowing a different value of nh for
mization of hyper-parameters (Sec. 3.3.4). each hidden layer20 of a deep architecture. See
Some researchers also advocate the use of the paragraph below entitled Layer-wise opti-
different learning rates for the different types mization of hyper-parameters (Sec. 3.3.4).
of parameters one finds in the model, such as In a large comparative study (Larochelle et al.,
biases and weights in the standard multi-layer 2009), we found that using the same size for all
network, but the issue becomes more important layers worked generally better or the same as us-
when parameters such as precision or variance ing a decreasing size (pyramid-like) or increasing
are included in the lot (Courville et al., 2011). size (upside down pyramid), but of course this
Up to now we have only discussed the hyper- may be data-dependent. For most tasks that
parameters in the setup where one trains a neural we worked on, we find that an overcomplete21
network by stochastic gradient descent. With other first hidden layer works better than an under-
optimization algorithms, some hyper-parameters complete one. Another even more often vali-
are typically different. For example, Conju- dated empirical observation is that the optimal
gate Gradient (CG) algorithms typically have a nh is much larger when using unsupervised pre-
number of line search steps (which is a hyper- training in a supervised neural network, e.g., go-
parameter) and a tolerance for stopping each line ing from hundreds of units to thousands of units.
search (another hyper-parameter). An optimiza- A plausible explanation is that after unsuper-
tion algorithm like L-BFGS (limited-memory Broy- vised pre-training many of the hidden units are
den–Fletcher–Goldfarb–Shanno) also has a hyper- carrying information that is irrelevant to the spe-
parameter controlling the memory usage of the algo- cific supervised task of interest. In order to make
rithm, the rank of the Hessian approximation kept in sure that the information relevant to the task is
memory, which also has an influence on the efficiency captured, larger hidden layers are therefore nec-
of each step. Both CG and L-BFGS are iterative essary when using unsupervised pre-training.
(e.g., one line search per iteration), and the number 20 A hidden layer is a group of units that is neither an input
of iterations can be optimized as described above for layer nor an output layer.
stochastic gradient descent, with early stopping. 21 larger than the input vector
11
• Weight decay regularization coefficient λ. A between early stopping (see above, choosing the
way to reduce overfitting is to add a regulariza- number of training iterations) and L2 regular-
tion term to the training criterion, which lim- ization (Collobert and Bengio, 2004a), with one
its the capacity of the learner. The parameters basically playing the same role as the other (but
of machine learning models can be regularized early stopping allowing a much more efficient se-
by pushing them towards a prior value, which lection of the hyper-parameter value, which sug-
is Ptypically 0. L2 regularization adds a term gests dropping L2 regularization altogether when
λ i θi2 to the training criterion,
P while L1 reg- early-stopping is used). However, L1 regular-
ularization adds a term λ i |θi |. Both types of ization behaves differently and can sometimes
terms can be included. There is a clean Bayesian be useful, acting as a form of feature selection.
justification for such a regularization term: it is L1 regularization makes sure that parameters
the negative log-prior − log P (θ) on the param- that are not really very useful are driven to zero
eters θ. The training criterion then corresponds (i.e. encouraging sparsity of the parameter val-
to the negative joint likelihood of data and pa- ues), and corresponds to a Laplace density prior
|θ|
rameters, − log P (data, θ) = − log P (data|θ) − ∝ e− s with scale parameter s = λ1 . L1 regu-
log P (θ), with the loss function L(z, θ) being in- larization often helps to make the input filters22
terpreted as − log P (z|θ) and − log P (data|θ) = cleaner (more spatially localized) and easier to
PT
− t=1 L(zt , θ) if the data consists of T i.i.d. interpret. Stochastic gradient descent will not
examples zt . This detail is important to note yield actual zeros but values hovering around
because when one is doing stochastic gradient- zero. If both L1 and L2 regularization are used,
based learning, it makes sense to use an unbi- a different coefficient (i.e. a different hyper-
ased estimator of the gradient of the total train- parameter) should be considered for each, and
ing criterion (including both the total loss and one may also use a different coefficient for differ-
the regularizer), but one only considers a single ent layers. In particular, the input weights and
mini-batch or example at a time. How should the output weights may be treated differently.
regularizer be weighted in this sum, which is dif-
One reason for treating output weights differ-
ferent from the sum of the regularizer and the to-
ently (i.e., not relying only on early stopping)
tal loss on all examples? On each mini-batch up-
is that we know that it is sufficient to regu-
date, the gradient of the regularization penalty
larize only the output weights in order to con-
should be multiplied not just by λ but also by
B strain capacity: in the limit case of the num-
T , i.e., one over the number of updates needed ber of hidden units going to infinity, L2 regular-
to go once through the training set. When the
ization corresponds to Support Vector Machines
training set size is not a multiple of B, the last
(SVM) while L1 regularization corresponds to
mini-batch will have size B ′ < B and the contri-
boosting (Bengio et al., 2006a). Another reason
bution of the regularizer to the mini-batch gradi-
for treating inputs and outputs differently from
ent should therefore be modified accordingly (i.e.
′ hidden units is because they may be sparse. For
scaled by BB compared to other mini-batches).
example, some input features may be 0 most of
In the pure online setting (there is no fixed ahead
the time while others are non-zero frequently. In
training set size nor iterating again on the ex-
that case, there are fewer examples that inform
amples), it would then make sense to use Bt at
the model about that rarely active input feature,
example t, or one over the number of updates
and the corresponding parameters (weights out-
to date. L2 regularization penalizes large val-
going from the corresponding input units) should
ues more strongly and corresponds to a Gaus-
2
sian prior ∝ exp(− 21 ||θ||
σ2 ) with prior variance
22 The input weights of a 1st layer neuron are often called
σ 2 = 1/(2λ). Note that there is a connection “filters” because of analogies with signal processing techniques
such as convolutions.
12
be more regularized than the parameters associ- cause they encourage representations that dis-
ated with frequently observed inputs. A similar entangle the underlying factors of representa-
situation may occur with target variables that tion. A sparsity-inducing penalty is also a
are sparse (e.g., trying to predict rarely observed way to regularize (in the sense of reducing the
events). In both cases, the effective number of number of examples that the learner can learn
meaningful updates seen by these parameters is by heart) (Ranzato et al., 2008b), which means
less than the actual number of updates. This that the sparsity coefficient is likely to interact
suggests to scale the regularization coefficient of with the many other hyper-parameters which in-
these parameters by one over the effective num- fluence capacity. In general, increased sparsity
ber of updates seen by the parameter. A related can be compensated by a larger number of hid-
formula turns up in Bayesian probit regression den units.
applied to sparse inputs (Graepel et al., 2010).
Several approaches have been proposed to in-
Some practitioners also choose to penalize only
duce a sparse representation (or with more hid-
the weights w and not the biases b associated
den units whose activation is closer to 0). One
with the hidden unit activations w′ z+b for a unit
approach (Ranzato et al., 2008b; Le et al., 2011;
taking the vector of values z as input. This guar-
Zou et al., 2011) is simply to penalize the L1
antees that even with strong regularization, the
norm of the representation or another function
predictor would converge to the optimal constant
of the hidden units’ activation (such as the
predictor, rather than the one corresponding to
student-t log-prior). This typically makes sense
0 activation. For example, with the mean-square
for non-linearities such as the sigmoid which
loss and the cross-entropy loss, the optimal con-
have a saturating output around 0, but not for
stant predictor is the output average.
the hyperbolic tangent non-linearity (whose sat-
uration is near the -1 and 1 interval borders
• Sparsity of activation regularization coeffi-
rather than near the origin). Another option
cient α. A common practice in the Deep
is to penalize the biases of the hidden units,
Learning literature (Ranzato et al., 2007, 2008b;
to make them more negative (Ranzato et al.,
Lee et al., 2008, 2009; Bagnell and Bradley,
2007; Lee et al., 2008; Goodfellow et al., 2009;
2009; Glorot et al., 2011a; Coates and Ng, 2011;
Larochelle and Bengio, 2008). Note that penal-
Goodfellow et al., 2011) consists in adding a
izing the bias runs the danger that the weights
penalty term to the training criterion that en-
could compensate for the bias23 , which could
courages the hidden units to be sparse, i.e.,
hurt the numerical optimization of parameters.
with values at or near 0. Although the L1
When directly penalizing the hidden unit out-
penalty (discussed above in the case of weights)
puts, several variants can be found in the litera-
can also be applied to hidden units activations,
ture, but no clear comparative analysis has been
this is mathematically very different from the
published to evaluate which one works better.
L1 regularization term on parameters. Whereas
Although the L1 penalty (i.e., simply α times
the latter corresponds to a prior on the pa-
the sum of output elements hj in the case of sig-
rameters, the former does not because it in-
moid non-linearity) would seem the most natural
volves the training distribution (since we are
(because of its use in sparse coding), it is used
looking at data-dependent hidden units out-
in few papers involving sparse auto-encoders. A
puts). Although we will not discuss this much
close cousin of the L1 penalty is the Student-
here, the inspiration for a sparse representa-
t penalty (log(1 + h2j )), originally proposed for
tion in Deep Learning comes from the ear-
sparse coding (Olshausen and Field, 1997). Sev-
lier work on sparse coding (Olshausen and Field,
1997). As discussed in Goodfellow et al. (2009) 23 because the input to the layer generally has a non-zero
sparse representations may be advantageous be- average, that when multiplied by the weights acts like a bias
13
eral researchers penalize the average output h̄j pre-training, but works well for auto-encoder
(e.g. over a mini-batch), and instead of pushing variants24 . For output (or reconstruction) units,
it to 0, encourage it to approach a fixed target ρ. hard neuron non-linearities like the rectifier do
This can be donePthrough a mean-square error not make sense because when the unit is satu-
2
penalty such as j (ρ − h̄j ) , or maybe more rated (e.g. a < 0 for the rectifier) and associ-
sensibly (because hj behaves like a probabil- ated with a loss, no gradient is propagated in-
ity), a Kullback-Liebler divergence with respect side the network, i.e., there is no chance to cor-
to the binomial distribution with probability ρ, rect the error25 . In the case of hidden layers the
−ρ log h̄j − (1 − ρ) log(1 − h̄j )+constant, e.g., gradient manages to go through a subset of the
with ρ = 0.05, as in (Hinton, 2010). In addition hidden units, even if the others are saturated.
to the regularization penalty itself, the choice For output units a good trick is to obtain the
of activation function can have a strong impact output non-linearity and the loss by considering
on the sparsity obtained. In particular, rectify- the associated negative log-likelihood and choos-
ing non-linearities (such as max(0, x), instead of ing an appropriate (conditional) output proba-
a sigmoid) have been very successful in several bility model, usually in the exponential family.
instances (Jarrett et al., 2009; Nair and Hinton, For example, one can typically take squared er-
2010; Glorot et al., 2011a; Mesnil et al., 2011; ror and linear outputs to correspond to a Gaus-
Glorot et al., 2011b). The rectifier also re- sian output model, cross-entropy and sigmoids
lates to the hard tanh (Collobert and Bengio, to correspond to a binomial output model, and
2004b), whose derivatives are also 0 or 1. − log output[target class] with softmax outputs
In sparse coding and sparse predictive cod- to correspond to multinomial output variables.
ing (Kavukcuoglu et al., 2009) the activations For reasons yet to be elucidated, having a sig-
are directly optimized and actual zeros are the moidal non-linearity on the output (reconstruc-
expected result of the optimization. In that tion) units (along with target inputs normalized
case, ordinary stochastic gradient is not guaran- in the (0,1) interval) seems to be helpful when
teed to find these zeros (it will oscillate around) training the contractive auto-encoder.
and other methods such as proximal gradient are
more appropriate (Bertsekas, 2010). • Weights initialization scaling coefficient.
Biases can generally be initialized to zero
• Neuron non-linearity. The typical neuron but weights need to be initialized carefully
output is s(a) = s(w′ x + b), where x is the to break the symmetry between hidden units
vector of inputs into the neuron, w the vec- of the same layer26 . Because different out-
tor of weights and b the offset or bias pa- put units receive different gradient signals,
rameter, while s is a scalar non-linear func- this symmetry breaking issue does not con-
tion. Several non-linearities have been proposed
24 The author hypothesizes that this discrepency is due
and some choices of non-linearities have been
to the fact that the weight matrix W of an auto-encoder of
shown to be more successful (Jarrett et al., 2009;
the form r(x) = W T sigmoid(W x) is pulled towards being or-
Glorot and Bengio, 2010; Glorot et al., 2011a). thonormal since this would make the auto-encoder closer to the
The most commonly used by the author, for hid- identity function, because W T W x ≈ x when W is orthonormal
den units, are the sigmoid 1/(1+e−a), the hyper- and x is in the span of the rows of W .
25 A hard non-linearity for the output units non-linearity is
a
−e−a
bolic tangent eea +e−a , the rectifier max(0, a) and very different from a hard non-linearity in the loss function,
the hard tanh (Collobert and Bengio, 2004b). such as the hinge loss. In the latter case the derivative is 0
Note that the sigmoid was shown to yield se- only when there is no error.
26 By symmetry, if hidden units of the same layer share the
rious optimization difficulties when used as the same input and output weights, they will compute the same
top hidden layer of a deep supervised network output and receive the same gradient, hence performing the
(Glorot and Bengio, 2010) without unsupervised same update and remaining identical, thus wasting capacity.
14
cern the output weights (into the output or sampling corruption noise in denoising auto-
units), which can therefore also be set to zero. encoders). Some random seeds could therefore
Although several tricks (LeCun et al., 1998a; yield better results than others. Because of the
Glorot and Bengio, 2010) for initializing the presence of local minima in the training criterion
weights into hidden layers have been proposed of neural networks (except in the linear case or
(i.e. a hyper-parameter is the discrete choice with fixed lower layers), parameter initialization
between them), Bergstra and Bengio (2012) also matters. See Erhan et al. (2010b) for an exam-
inserted as an extra hyper-parameter a scaling ple of histograms of test errors for hundreds of
coefficient for the initialization range. These different random seeds. Typically, the choice of
tricks are based on the idea that units with random seed only has a slight effect on the result
more inputs (the fan-in of the unit) should have and can mostly be ignored in general or for most
smaller weights. Both LeCun et al. (1998a) and of the hyper-parameter search process. If com-
Glorot and Bengio (2010) recommend scaling by puting power is available, then a final set of jobs
the inverse of the square root of the fan-in, al- with different random seeds (5 to 10) for a small
though Glorot and Bengio (2010) and the Deep set of best choices of hyper-parameter values can
Learning Tutorials use a combination of the fan- squeeze a bit more performance. Another way to
in and fan-out,
p e.g., sample a Uniform(−r, r) exploit computing power to push performance a
with r = 6/(fan-in + fan-out)
p for hyperbolic bit is model averaging, as in Bagging (Breiman,
tangent units and r = 4 6/(fan-in + fan-out) 1994) and Bayesian methods. After training
for sigmoid units. We have found that we could them, the outputs of different networks (or in
avoid any hyper-parameter related to initializa- general different learning algorithms) can be av-
tion using these formulas (and the derivation in eraged. For example, the difference between the
Glorot and Bengio (2010) can be used to derive neural networks being averaged into a commit-
the formula for other settings). Note however tee may come from the different seeds used for
that in the case of RBMs, a zero-mean Gaussian parameter initialization, or the use of different
with a small standard deviation around 0.1 or subsets of input variables, or different subsets of
0.01 works well (Hinton, 2010) to initialize the training examples (the latter being called Bag-
weights, while visible biases are typically set to ging).
their optimal value if the weights were 0, i.e.,
log(x̄/(1 − x̄)) in the case of a binomial visible • Preprocessing. Many preprocessing steps have
unit whose corresponding binary input feature been proposed to massage raw data into ap-
has empirical mean x̄ in the training set. propriate inputs for neural networks and model
An important choice is whether one should use selection must also choose among them. In
unsupervised pre-training (and which unsuper- addition to element-wise standardization (sub-
vised feature learning algorithm to use) in or- tract mean and divide by standard devia-
der to initialize parameters. In most settings tion), Principal Components Analysis (PCA)
we have found unsupervised pre-training to help has often been advocated (LeCun et al., 1998a;
and very rarely to hurt, but of course that Bergstra and Bengio, 2012) and also allows di-
implies additional training time and additional mensionality reduction, at the price of an ex-
hyper-parameters. tra hyper-parameter (the number of principal
components retained, or the proportion of vari-
• Random seeds. There are often several sources ance explained). A convenient non-linear pre-
of randomness in the training of neural net- processing is the uniformization (Mesnil et al.,
works and deep learners (such as for random 2011) of each feature (which estimates its cumu-
initialization, sampling examples, sampling hid- lative distribution Fi and then transforms each
den units in stochastic models such as RBMs, feature xi by its quantile Fi−1 (xi ), i.e., returns
15
an approximate normalized rank or quantile for cess, using techniques such as grid search or better,
the value xi ). A simpler to compute transform random search, or even hyper-parameter optimiza-
that may help reduce the tails of input features tion, discussed below.
is a non-linearity such as the logarithm or the
square root, in an attempt to make them more 3.3.1 General guidance for the exploration of
Gaussian-like. hyper-parameters
In addition to the above somewhat generic choices, First of all, let us consider recommendations for ex-
more choices arise with different architectures and ploring hyper-parameter settings, whether with man-
learning algorithms. For example, the denois- ual search, with an automated procedure, or with
ing auto-encoder has a hyper-parameter scaling the a combination of both. We call a numerical hyper-
amount of input corruption and the contractive auto- parameter one that involves choosing a real number or
encoder has as hyper-parameter a coefficient scaling an integer (where order matters), as opposed to mak-
the norm of the Jacobian of the encoder, i.e., control- ing a discrete symbolic choice from an unordered set.
ling the importance of the contraction penalty. The Examples of numerical hyper-parameters are regular-
latter seems to be a rather sensitive hyper-parameter ization coefficients, number of hidden units, number
that must be tuned carefully. The contractive auto- of training iterations, etc. One has to think of hyper-
encoder’s success also seems sensitive to the weight parameter selection as a difficult form of learning:
tying constraint used in many auto-encoder archi- there is both an optimization problem (looking for
tectures: the decoder’s weight matrix is equal to the hyper-parameter configurations that yield low vali-
transpose of the encoder’s weight matrix. The spe- dation error) and a generalization problem: there is
cific architecture used in the contractive auto-encoder uncertainty about the expected generalization after
(with tied weights, sigmoid non-linearies on hidden optimizing validation performance, and it is possi-
and reconstruction units, along with squared loss or ble to overfit the validation error and get optimisti-
cross-entropy loss) works quite well but other related cally biased estimators of performance when com-
variants do not always train well, for reasons that paring many hyper-parameter configurations. The
remain to be understood. training criterion for this learning is typically the
There are also many architectural choices that validation set error, which is a proxy for general-
are relevant in the case of convolutional architec- ization error. Unfortunately, the relation between
tures (e.g. for modeling images, time-series or hyper-parameters and validation error can be com-
sound) (LeCun et al., 1989, 1998b; Le et al., 2010) in plicated. Although to first approximation we expect
which hidden units have local receptive fields. Their a kind of U-shaped curve (when considering only a
discussion is postponed to another chapter (LeCun, single hyper-parameter, the others being fixed), this
2013). curve can also have noisy variations, in part due to
the use of finite data sets.
3.3 Manual Search and Grid Search
• Best value on the border. When considering
Many of the hyper-parameters or model choices de- the validation error obtained for different values
scribed above can be ignored by picking a standard of a numerical hyper-parameter one should pay
trick suggested here or in some other paper. Still, attention as to whether or not the best value
one remains with a substantial number of choices to found is near the border of the investigated in-
be made, which may give the impression of neural terval. If it is near the border, then this sug-
network training as an art. With modern comput- gests that better values can be found with val-
ing facilities based on large computer clusters, it is ues beyond the border: it is recommended in
however possible to make the optimization of hyper- that case to explore further, beyond that border.
parameters a more reproducible and automated pro- Because the relation between a hyper-parameter
16
and validation error can be noisy, it is gener- the convolution). While this yields a noisy and
ally not enough to try very few values. For biased (pessimistic) estimator of the validation
instance, trying only 3 values for a numerical error which would otherwise be obtained with
hyper-parameter is insufficient, even if the best full training, this cheap estimator appears to be
value found is the middle one. correlated with the expensive validation error.
Hence this cheap estimator is enough for select-
• Scale of values considered. Exploring values
ing some hyper-parameters (or for keeping un-
of a numerical hyper-parameter entails choosing
der consideration for further and more expen-
a starting interval to be searched, which is there-
sive evaluation only the few best choices found).
fore a kind of hyper-hyper-parameter. By choos-
Even without cheap estimators of generalization
ing the interval large enough to start with, but
error, high-throughput computing (e.g., on clus-
based on previous experience with this hyper-
ters, GPUs, or clusters of GPUs) can be ex-
parameter, we ensure that we do not get com-
ploited to run not just hundreds but thousands
pletely wrong results. Now instead of choosing
of training jobs, something not conceivable only
the intermediate values linearly in the chosen in-
a few years ago, with each job taking on the order
terval, it often makes much more sense to con-
of hours or days for larger datasets. With com-
sider a linear or uniform sampling in the log-
putationally cheap surrogates, some researchers
domain (in the space of the logarithm of the
have run on the order of ten thousands trials,
hyper-parameter). For example, the results ob-
and we can expect future advances in parallelized
tained with a learning rate of 0.01 are likely to
computing power to boost these numbers.
be very similar to the results with 0.011 while
results with 0.001 could be quite different from
results with 0.002 even though the absolute dif-
ference is the same in both cases. The ratio
between different values is often a better guide 3.3.2 Coordinate Descent and Multi-
of the expected impact of the change. That is Resolution Search
why exploring uniformly or regularly-spaced val-
ues in the space of the logarithm of the numer- When performing a manual search and with access to
ical hyper-parameter is typically preferred for only a single computer, a reasonable strategy is coor-
positive-valued numerical hyper-parameters. dinate descent: change only one hyper-parameter at a
• Computational considerations. Validation time, always making a change from the best configu-
error is actually not the only measure to consider ration of hyper-parameters found up to now. Instead
in selecting hyper-parameters. Often, one has to of a standard coordinate descent (which systemati-
consider computational cost, either of training cally cycles through all the variables to be optimized)
or prediction. Computing resources for training one can make sure to regularly fine-tune the most
and prediction are limited and generally con- sensitive variables, such as the learning rate.
dition the choice of intervals of considered val- Another important idea is that there is no point in
ues: for example increasing the number of hid- exploring the effect of fine changes before one or more
den units or number of training iterations also reasonably good settings have been found. The idea
scales up computation. An interesting idea is of multi-resolution search is to start the search by
to use computationally cheap estimators of val- considering only a few values of the numerical hyper-
idation error to select some hyper-parameters. parameters (over a large range), or considering large
For example, Saxe et al. (2011) showed that the changes each time a new value is tried. One can then
architecture hyper-parameters of convolutional start from the one or few best configurations found
networks could be selected using random weights and explore more locally around them with smaller
in the lower layers of the network (filters of variations around these values.
17
3.3.3 Automated and Semi-automated Grid initial learning rate while keeping fixed (and initially
Search constant) the learning rate descent schedule. Once
the shape of the schedule has been chosen, it may be
Once some interval or set of values has been selected possible to further refine the learning rate, but in a
for each hyper-parameter (thus defining a search
smaller interval around the best value found.
space), a simple strategy that exploits parallel com-
Humans can get very good at performing hyper-
puting is the grid search. One first needs to con-
parameter search, and having a human in the loop
vert the numerical intervals into lists of values (e.g.,
also has the advantage that it can help detect bugs
K regularly-spaced values in the log-domain of the
or unwanted or unexpected behavior of a learning
hyper-parameter). The grid search is simply an ex-
algorithm. However, for the sake of reproducibil-
haustive search through all the combinations of these
ity, machine learning researchers should strive to use
values. The cross-product of these lists contains a
procedures that do not involve human decisions in
number of elements that is unfortunately exponen-
the middle, only at the outset (e.g., setting hyper-
tial in the number of hyper-parameters (e.g., with
parameter ranges, which can be specified in a paper
5 hyper-parameters, each allowed to take 6 different
describing the experiments).
values, one gets 65 = 7776 configurations). In sec-
tion 3.4 below we consider an approach that works
more efficiently than the grid search when the num- 3.3.4 Layer-wise optimization of hyper-
ber of hyper-parameters increases beyond 2 or 3. parameters
The advantage of the grid search, compared to In the case of Deep Learning with unsupervised
many other optimization strategies (such as coordi- pre-training there is an opportunity for combin-
nate descent), is that it is fully parallelizable. If a ing coordinate descent and cheap relative valida-
large computer cluster is available, it is tempting to tion set performance evaluation associated with
choose a model selection strategy that can take ad- some hyper-parameter choices. The idea, described
vantage of parallelization. One practical disadvan- by Mesnil et al. (2011); Bengio (2011), is to perform
tage of grid search (especially against random search, greedy choices for the hyper-parameters associated
Sec. 3.4), with a parallelized set of jobs on a cluster, with lower layers (near the input) before training the
is that if only one of the jobs fails27 then one has higher layers. One first trains (unsupervised) the
to launch another volley of jobs to complete the grid first layer with different hyper-parameter values and
(and yet a third one if any of these fails, etc.), thus somehow estimates the relative validation error that
multiplying the overall computing time. would be obtained from these different configurations
Typically, a single grid search is not enough and if the final network only had this single layer as in-
practitioners tend to proceed with a sequence of grid ternal representation. In the common case where the
searches, each time adjusting the ranges of values ultimate task is supervised, it means training a simple
considered based on the previous results obtained. supervised predictor (e.g. a linear classifier) on top
Although this can be done manually, this procedure of the learned representation. In the case of a linear
can also be automated by considering the idea of predictor (e.g. regression or logistic regression) this
multi-resolution search to guide this outer loop. Dif- can even be done on the fly while unsupervised train-
ferent, more local, grid searches can be launched in ing of the representation progresses (i.e. can be used
the neighborhood of the best solutions found previ- for early stopping as well), as in (Larochelle et al.,
ously. In addition, the idea of coordinate descent can 2009). Once a set of apparently good (according
also be thrown in, by making each grid search focus to this greedy evaluation) hyper-parameters values
on only a few of the hyper-parameters. For exam- has been found (or possibly using only the best one
ple, it is common practice to start by exploring the found), these good values can be used as starting
27 For all kinds of hardware and software reasons, a job point to train (and hyper-optimize) a second layer
failing is very common. in the same way, etc. The completely greedy ap-
18
proach is to keep only the best configuration up to
now (for the lower layers), but keeping the K best
configurations overall only multiplies computational Algorithm 1 : Greedy layer-wise hyper-
costs of hyper-parameter selection by K for layers be- parameter optimization.
yond the first one, because we would still keep only input K: number of best configurations to keep
the best K configurations from all the 1st layer and at each level.
2nd layer hyper-parameters as starting points for ex- input N LEV ELS: number of levels of the deep
ploring 3rd layer hyper-parameters, etc. This proce- architecture
dure is formalized in the Algorithm 1 below. Since input LEV ELSET T IN GS: list of hyper-
greedy layer-wise pre-training does not modify the parameter settings to be considered for unsuper-
lower layers when pre-training the upper layers, this vised pre-training of a level
is also very efficient computationally. This proce- input SF T SET T IN GS: list of hyper-parameter
dure allows one to set the hyper-parameters associ- settings to be considered for supervised fine-tuning
ated with the unsupervised pre-training stage, and
then there remains hyper-parameters to be selected
for the supervised fine-tuning stage, if one is desired. Initialize set of best configurations S = ∅
A final supervised fine-tuning stage is strongly sug- for L = 1 to N LEV ELS do
gested, especially when there are many labeled exam- for C in LEV ELSET T IN GS do
ples (Lamblin and Bengio, 2010). for H in (S or {∅}) do
* Pretrain level L using hyper-parameter
setting C for level L and the parameters ob-
3.4 Random Sampling of Hyper- tained with setting H for lower levels.
Parameters * Evaluate target task performance L using
A serious problem with the grid search approach to this depth-L pre-trained architecture (e.g.
find good hyper-parameter configurations is that it train a linear classifier on top of these layers
scales exponentially badly with the number of hyper- and estimate validation error).
parameters considered. In the above sections we have * Push the pair (C ∪ H, L) into S if it is
discussed numerous hyper-parameters and if all of among the K best performing of S.
them were to be explored at the same time it would end for
be impossible to use only a grid search to do so. end for
One may think that there are no other options sim- end for
ply because this is an instance of the curse of di- for C in SF T SET T IN GS do
mensionality. But like we have found in our work for H in S do
on Deep Learning (Bengio, 2009), if there is some * Supervised fine-tuning of the pre-trained ar-
structure in a target function we are trying to dis- chitecture associated with H, using supervised
cover, then there is a chance to find good solutions fine-tuning hyper-parameter setting C.
without paying an exponential price. It turns out * Evaluate target task performance L of this
that in many practical cases we have encountered, fine-tuned predictor (e.g. validation error).
there is a kind of structure that random sampling * Push the pair (C ∪H, L) into S if it is among
can exploit (Bergstra and Bengio, 2012). The idea the K best performing of S.
of random sampling is to replace the regular grid end for
by a random (typically uniform) sampling. Each end for
tested hyper-parameter configuration is selected by output S the set of K best-performing models
independently sampling each hyper-parameter from with their settings and validation performance.
a prior distribution (typically uniform in the log-
domain, inside the interval of interest). For a discrete
19
hyper-parameter, a multinomial distribution can be tells us that we are approaching a plateau, i.e., it tells
defined according to our prior beliefs on the likely us whether it is worth it or not to continue launching
good values. At worse, i.e., with no prior preference jobs, i.e., we can perform a kind of early stopping in
at all, this would be a uniform distribution across the the outer optimization over hyper-parameters. Note
allowed values. In fact, we can use our prior knowl- that one should distinguish the curve of the “best
edge to make this prior distribution quite sophisti- trial in first N trials” with the curve of the mean (and
cated. For example, we can readily include knowl- standard deviation) of the “best in a subset of size
edge that some values of some hyper-parameters only N”. The latter is a better statistical representative of
make sense in the context of other particular val- the improvements we should expect if we increase the
ues of hyper-parameters. This is a practical consid- number of trials. Even if the former has a plateau,
eration for example when considering layer-specific the latter may still be on the increase, pointing for the
hyper-parameters when the number of layers itself is need to more hyper-parameter configuration samples,
a hyper-parameter. i.e., more trials (Bergstra and Bengio, 2012). Com-
The experiments performed (Bergstra and Bengio, paring these curves with the equivalent obtained from
2012) show that random sampling can be many times grid search we see faster convergence with random
more efficient than grid search as soon as the number search. On the other hand, note that one advan-
of hyper-parameters goes beyond the 2 or 3 typically tage of grid search compared to random sampling is
seen with SVMs and vanilla neural networks. The that the qualitative analysis of results is easier be-
main reason why faster convergence is observed is cause one can consider variations of a single hyper-
because it allows one to explore more values for each parameter with all the other hyper-parameters being
hyper-parameter, whereas in grid search, the same fixed. It may remain a valid option to do a small
value of a hyper-parameter is repeated in exponen- grid search around the best solutions found by ran-
tially many configurations (of all the other hyper- dom search, considering only the hyper-parameters
parameters). In particular, if only a small subset of that were found to matter or which concern a scien-
the hyper-parameters really matters, then this proce- tific question of interest29 .
dure can be shown to be exponentially more efficient. Random search maintains the advantage of easy
What we found is that for different datasets and ar- parallelization provided by grid search and improves
chitectures, the subset of hyper-parameters that mat- on it. Indeed, a practical advantage of random search
tered most was different, but it was often the case compared to grid search is that if one of the jobs fails
that a few hyper-parameters made a big difference then there is no need to re-launch that job. It also
(and the learning rate is always one of them!). When means that if one has launched 100 random search
marginalizing (by averaging or minimizing) the val- jobs, and finds that the convergence curve still has an
idation performance to visualize the effect of one or interesting slope, one can launch another 50 or 100
two hyper-parameters, we get a more noisy picture without wasting the first 100. It is not that simple to
using a random search compared to a grid search, combine the results of two grid searches because they
because of the random variations of the other hyper- are not always compatible (i.e., one is not a subset of
parameters but one with much more resolution, be- the other).
cause so many more different values have been consid- Finally, although random search is a useful ad-
ered. Practically, one can plot the curves of best val- dition to the toolbox of the practitioner, semi-
idation error as the number of random trials28 is in- automatic exploration is still helpful and one will
creased (with mean and standard deviation, obtained often iterate between launching a new volley of
by considering, for each choice of number of trials, all jobs and analysis of the results obtained with
possible same-size subsets of trials), and this curve 29 This is often the case in machine learning research, e.g.,
20
the previous volley in order to guide model de- big deal while debugging) but provides quadratically
sign and research. What we need is more, and more precision.
more efficient, automation of hyper-parameter op- Note that because of finite precision in the com-
timization. There are some interesting steps in putation, there will be a difference between the an-
this direction (Hutter, 2009; Bergstra et al., 2011; alytic (even correct) and finite difference gradient.
Hutter et al., 2011; Srinivasan and Ramakrishnan, Contrary to naive expectations, the relative differ-
2011) but much more needs to done. ence may grow if we choose an ε that is too small,
i.e., the error should first decrease as ε is decreased,
and then may worsen when numerical precision kicks
4 Debugging and Analysis in, due to non-linearities. We have often used a value
of ε = 10−4 in neural networks, a value that is suffi-
4.1 Gradient Checking and Con- ciently small to detect most bugs.
trolled Overfitting Once the gradient is known to be well computed,
A very useful debugging step consists in verifying another sanity check is that gradient descent (or any
that the implementation of the gradient ∂L other gradient-based optimization) should be able
∂θ is com-
patible with the computation of L as a function of to overfit on a small training set30 . In particular,
θ. If the analytically computed gradient does not to factor out effects of SGD hyper-parameters, a
match the one obtained by a finite difference approx- good sanity check for the code (and the other hyper-
imation, this signals that a bug is probably present parameters) is to verify that one can overfit on a small
somewhere. First of all, looking at for which i one training set using a powerful second order method
gets important relative change between ∂θ ∂L
and its such as L-BFGS. For any optimizer, though, as the
i
finite difference approximation, we can get hints as number of examples is increased, the degradation of
to where the problem may be. An error in sign is training error should be gradual while validation er-
particularly troubling, of course. A good next step is ror should improve. And one typically sees the advan-
then to verify in the same way intermediate gradients tages of SGD over batch second-order methods like
∂L L-BFGS increase as the training set size increases.
∂a with a some quantities that depend on the faulty
θ, such as intervening neuron activations. The break-even point may depend on the task, paral-
As many researchers know, the gradient can be lelization (multi-core or GPU, see Sec.5 below), and
approximated by a finite difference approximation architecture (number of computations compared to
number of parameters, per example).
obtained from the first-order Taylor expansion of a
scalar function f with respect to a scalar argument Of course, the real goal of learning is to achieve
x: good generalization error, and the latter can be es-
∂f (x) f (x + ε) − f (x) timated by measuring performance on an indepen-
= + o(ε) dent test set. When test error is considered too
∂x ε
But a less known fact is that a second order approx- high, the first question to ask is whether it is be-
cause of a difficulty in optimizing the training cri-
imation can be achieved by considering the following
terion or because of overfitting. Comparing train-
alternative formula:
ing error and test error (and how they change as
∂f (x) f (x + ε) − f (x − ε) we change hyper-parameters that influence capacity,
≈ + o(ε2 ).
∂x 2ε
30 In principle, bad local minima could prevent that, but in
The second order terms of the Taylor expansion of the overfitting regime, e.g., with more hidden units than exam-
f (x + ε) and f (x − ε) cancel each other because they ples, the global minimum of the training error can generally be
reached almost surely from random initialization, presumably
are even, leaving only 3rd or higher order terms, because the training criterion becomes convex in the parame-
i.e., o(ε2 ) error after dividing the difference by ε. ters that suffice to get the training error to zero (Bengio et al.,
Hence this formula is twice more expensive (not a 2006a), i.e., the output weights of the neural network.
21
such as the number of training iterations) helps to useful to compare neural networks during training in
answer that question. Depending on the answer, of terms of their “age” (number of updates made times
course, the appropriate ways to improve test error mini-batch size B, i.e., number of examples visited)
are different. Optimization difficulties can be fixed rather than in terms of number of epochs (which is
by looking for bugs in the training code, inappropri- very sensitive to the training set size).
ate values of optimization hyper-parameters, or sim- When using unsupervised training to learn the first
ply insufficient capacity (e.g. not enough degrees of few layers of a deep architecture, a very common de-
freedom, hidden units, embedding sizes, etc.). Over- bugging and analysis tool is the visualization of fil-
fitting difficulties can be addressed by collecting more ters, i.e., of the weight vectors associated with in-
training data, introducing more or better regular- dividual hidden units. This is simplest in the case
ization terms, multi-task training, unsupervised pre- of the first layer and where the inputs are images
training, unsupervised term in the training criterion, (or image patches), time-series, or spectrograms (all
or considering different function families (or neural of which are visually interpretable). Several recipes
network architectures). In a multi-layer neural net- have been proposed to extend this idea to visualize
work, both problems can be simultaneously present. the preferred input of hidden units in layers that
For example, as discussed in Bengio et al. (2007); follow the first one (Lee et al., 2008; Erhan et al.,
Bengio (2009), it is possible to have zero training er- 2010a). In the case of the first layer, since one of-
ror with a large top-level hidden layer that allows the ten obtains Gabor filters, a parametric fit of these
output layer to overfit, while the lower layer are not filters to the weight vector can be done so as to vi-
doing a good job of extracting useful features because sualize the distribution of orientations, positions and
they were not properly optimized. scales of the learned filters. An interesting special
Unless using a framework such as Theano which case of visualizing first-layer weights is the visual-
automatically handles the efficient allocation of ization of word embeddings (see Section 5.3 below)
buffers for intermediate results, it is important to using a dimensionality reduction technique such as
pay attention to such buffers in the design of the t-SNE (van der Maaten and Hinton, 2008).
code. The first objective is to avoid memory alloca- An extension of the idea of visualizing filters (which
tion in the middle of the training loop, i.e., all mem- can apply to non-linear or deeper features) is that of
ory buffers should be allocated once and for all. Care- visualizing local (arount the given test point) lead-
less reuse of the same memory buffers for different ing tangent vectors, i.e., the main directions in input
uses can however lead to bugs, which can be checked, space to which the representation (at a given layer)
in the debugging phase, by initializing buffers to the is most sensitive to (Rifai et al., 2011b).
NaN (Not-A-Number) value, which propagates into In the case where the inputs are not images or eas-
downstream computation (making it easy to detect ily visualizable, or to get a sense of the weight values
that uninitialized values were used)31 . in different hidden units, Hinton diagrams (Hinton,
1989) are also very useful, using small squares whose
color (black or white) indicates a weight’s sign and
4.2 Visualizations and Statistics whose area represents its magnitude.
Another way to visualize what has been learned
The most basic statistics that should be measured
by an unsupervised (or joint label-input) model is
during training are error statistics. The average loss
to look at samples from the model. Sampling pro-
on the training set and the validation set and their
cedures have been defined at the outset for RBMs,
evolution during training are very useful to monitor
Deep Belief Nets, and Deep Boltzmann Machines,
progress and differentiate overfitting from poor op-
for example based on Gibbs sampling. When weights
timization. To make comparisons easier, it may be
become larger, mixing between modes can become
31 Personal communication from David Warde-Farley, who very slow with Gibbs sampling. An interesting alter-
learned this trick from Sam Roweis. native is rates-FPCD (Tieleman and Hinton, 2009;
22
Breuleux et al., 2011) which appears to be more ro- for a practical example. A particularly interesting
bust to this problem and generally mixes faster, but quantity to monitor is the discriminative ability of
at the cost of losing theoretical guarantees. the representations learnt at each layer, as discussed
In the case of auto-encoder variants, it was not in (Montavon et al., 2012), and ultimately leading to
clear until recently whether they were really captur- an analysis of the disentangled factors captured by
ing the underlying density (since they are not opti- the different layers as we consider deeper architec-
mized with respect to the maximum likelihood prin- tures.
ciple or an approximation of it). It was therefore
even less clear if there existed appropriate sampling
algorithms for auto-encoders, but a recent proposal 5 Other Recommendations
for sampling from contractive auto-encoders appears
to be working very well (Rifai et al., 2012), based on 5.1 Multi-core machines, BLAS and
arguments about the geometric interpretation of the GPUs
first derivative of the encoder (Bengio et al., 2012),
showing that denoising and contractive auto-encoders Matrix operations are the most time-consuming in
capture local moments (first and second) of the train- efficient implementations of many machine learning
ing density. algorithms and this is particularly true of neural
To get a sense of what individual hidden units rep- networks and deep architectures. The basic opera-
resent, it has also been proposed to vary only one tions are matrix-vector products (forward propaga-
unit while keeping the others fixed, e.g., to the value tion and back-propagation) and vector times vector
obtained by finding the hidden units representation outer products (resulting in a matrix of weight gra-
associated with a particular input example. dients). Matrix-matrix multiplications can be done
Another interesting technique is the visual- substantially faster than the equivalent sequence of
ization of the learning trajectory in function matrix-vector products for two reasons: by smart
space (Erhan et al., 2010b). The idea is to asso- caching mechanisms such as implemented in the
ciate the function (as opposed to simply the pa- BLAS library (which is called from many higher-level
rameters) computed by a neural network with a environments such as python’s numpy and Theano,
low-dimensional (2-D or 3-D) representation, e.g., Matlab, Torch or Lush), and thanks to parallelism.
with the t-SNE (van der Maaten and Hinton, 2008) Appropriate versions of BLAS can take advantage
or Isomap (Tenenbaum et al., 2000) algorithms, and of multi-core machines to distribute these computa-
then plot the evolution of this function during train- tions on multi-core machines. The speed-up is how-
ing, or the population of such trajectories for different ever generally a fraction of the total speedup one can
initializations. This provides visualization of effec- hope for (e.g. 4× on a 4-core machine), because of
tive local minima32 and shows that no two different communication overheads and because not all com-
random initializations ended up in the same effective putation is parallelized. Parallelism becomes more
local minimum. efficient when the sizes of these matrices is increased,
Finally, another useful type of visualization is to which is why mini-batch updates can be computa-
display statistics (e.g., histogram, mean and stan- tionally advantageous, and more so when more cores
dard deviation) of activations (inputs and outputs are present.
of the non-linearities at each layer), activation gradi- The extreme multi-core machines are the GPUs
ents, parameters and parameter gradients, by groups (Graphics Processing Units), with hundreds of cores.
(e.g. different layers, biases vs weights) and across Unfortunately, they also come with constraints and
training iterations. See Glorot and Bengio (2010) specialized compilers which make it more difficult to
32 It is difficult to know for sure if it is a true local minima fully take advantage of their potential. On 512-core
or if it appears like one because the optimization algorithm is machines, we are routinely able to get speed-ups of
stuck. 4× to 40× for large neural networks. To make the
23
use of GPUs practical, it really helps to use existing for the case of auto-encoders and denoising auto-
libraries that efficiently implement computations on encoders. The first idea is that on each example (or
GPUs. See Bergstra et al. (2010) for a comparative mini-batch), one samples a subset of the elements
study of the Theano library (which compiles numpy- of the reconstruction vector, along with the associ-
like code for GPUs). One practical issue is that only ated reconstruction loss. One only needs to com-
the GPU-compiled operations will typically be done pute the reconstruction and the loss associated with
on the GPU, and that transfers between the GPU these sampled elements (or features), as well as the
and CPU considerably slow things down. It is im- associated back-propagation operations into hidden
portant to use a profiler to find out what is done units and reconstruction weights. That alone would
on the GPU and how efficient these operations are multiplicatively reduce the computational cost by the
in order to quickly invest one’s time where needed amount of sparsity but make the gradient much more
to make an implementation GPU-efficient and keep noisy and possibly biased as well, if the sampling dis-
most operations on the GPU card. tribution was chosen not uniform. To reduce the vari-
ance of that estimator, the idea is to guess for which
5.2 Sparse High-Dimensional Inputs features the reconstruction loss will be larger and to
sample with higher probability these features (and
Sparse high-dimensional inputs can be efficiently han- their loss). In particular, the authors always sample
dled by traditional supervised neural networks by us- the features with a non-zero in the input (or the cor-
ing a sparse matrix multiplication. Typically, the in- rupted input, in the denoising case), and uniformly
put is a sparse vector while the weights are in a dense sample an equal number of those with a zero in the
matrix, and one should use an efficient implementa- input and corrupted input. To make the estimator
tion made for just this case in order to optimally take unbiased now requires introducing a weight on the
advantage of sparsity. There is still going to be an reconstruction loss associated with each sampled fea-
overhead on the order of 2× or more (on the multiply- ture, inversely proportional to the probability of sam-
add operations, not the others) compared to a dense pling it, i.e., this is an importance sampling scheme.
implementation of the matrix-vector product. The experiments show that the speed-up increases
For many unsupervised learning algorithms there is linearly with the amount of sparsity while the aver-
unfortunately a difficulty. The computation for these age loss is optimized as well as in the deterministic
learning algorithms usually involves some kind of re- full-computation case.
construction of the input (like for all auto-encoder
variants, but also for RBMs and sparse coding vari-
ants), as if the inputs were in the output space of 5.3 Symbolic Variables, Embeddings,
the learner. Two exceptions to this problem are Multi-Task Learning and Multi-
semi-supervised embedding (Weston et al., 2008) and Relational Learning
Slow Feature Analysis (Wiskott and Sejnowski, 2002;
Berkes and Wiskott, 2002). The former pulls the rep- Parameter sharing (Lang and Hinton, 1988; LeCun,
resentation of nearby examples near each other and 1989; Lang and Hinton, 1988; Caruana, 1993; Baxter,
pushes dissimilar points apart, while also tuning the 1995, 1997) is an old neural network technique for in-
representation for a supervised learning task. The creasing statistical power: if a parameter is used in N
latter maximizes the learned features’ variance while times more contexts (different tasks, different parts of
minimizing their covariance and maximizing their the input, etc.) then it may be as if we had N times
temporal auto-correlation. more training examples for tuning its value. More
For algorithms that do need a form of input re- examples to estimate a parameter reduces its vari-
construction, an efficient approach based on sam- ance (with respect to sampling of training examples),
pled reconstruction (Dauphin et al., 2011) has been which is directly influencing generalization error: for
proposed, successfully implemented and evaluated example the generalization mean squared error can
24
be decomposed as the sum of a bias term and a vari- SNE (van der Maaten and Hinton, 2008).
ance term (Geman et al., 1992). The reuse idea was In addition to sharing the embedding parame-
first exploited by applying the same parameter to dif- ters across positions of words in an input sentence,
ferent parts of the input, as in convolutional neu- Collobert et al. (2011a) share them across natural
ral networks (Lang and Hinton, 1988; LeCun, 1989). language processing tasks such as Part-Of-Speech
Reuse was also exploited by sharing the lower lay- tagging, chunking and semantic role labeling. Param-
ers of a network (and the representation of the input eter sharing is a key idea behind convolutional nets,
that they capture) across multiple tasks associated recurrent neural networks and dynamic Bayes nets, in
with different outputs of the network (Caruana, 1993; which the same parameters are used for different tem-
Baxter, 1995, 1997). This idea is also one of the key poral or spatial slices of the data. This idea has been
motivations behind Deep Learning (Bengio, 2009) be- generalized from sequences and 2-D images to arbi-
cause one can think of the intermediate features com- trary graphs with recursive neural networks or recur-
puted in higher (deeper) layers as different tasks that sive graphical models (Pollack, 1990; Frasconi et al.,
can share the sub-features computed in lower layers 1998; Bottou, 2011; Socher et al., 2011), Markov
(nearer the input). This very basic notion of reuse Logic Networks (Richardson and Domingos, 2006)
is key to improving generalization in many settings, and relational learning (Getoor and Taskar, 2006).
guiding the design of neural network architectures in A relational database can be seen as a set of ob-
practical applications as well. jects (or typed values) and relations between them,
An interesting special case of these ideas is in the of the form (object1, relation-type, object2). The
context of learning with symbolic data. If some in- same global set of parameters can be shared to char-
put variables are symbolic, taking value in a finite acterize such relations, across relations (which can be
alphabet, they can be represented as neural net- seen as tasks) and objects. Object-specific parame-
work inputs by a one-hot subvector of the input vec- ters are the parameters specifying the embedding of
tor (with a 0 everywhere except at the position as- a particular discrete object. One can think of the el-
sociated with the particular symbol). Now, some- ements of each embedding vector as implicit learned
times different input variables refer to different in- attributes. Different tasks may demand different at-
stances of the same type of symbol. A patent ex- tributes, so that objects which share some underly-
ample is with neural language models (Bengio et al., ing characteristics and behavior should end up hav-
2003; Bengio, 2008), where the input is a sequence of ing similar values of some of their attributes. For
words. In these models, the same input layer weights example, words appearing in semantically and syn-
are reused for words at different positions in the input tactically similar contexts end up getting a very close
sequence (as in convolutional networks). The prod- embedding (Collobert et al., 2011a). If the same at-
uct of a one-hot sub-vector with this shared weight tributes can be useful for several tasks, then statisti-
matrix is a generally dense vector, and this asso- cal power is gained through parameter sharing, and
ciates each symbol in the alphabet with a point in transfer of information between tasks can happen,
a vector space33 , which we call its embedding. The making the data of some task informative for gener-
idea of vector space representations for words and alizing properly on another task.
symbols is older (Deerwester et al., 1990) and is a The idea proposed in Bordes et al. (2011, 2012) is
particular case of the notion of distributed represen- to learn an energy function that is lower for posi-
tation (Hinton, 1986, 1989) central to the connec- tive (valid) relations present in the training set, and
tionist approaches. Learned embeddings of symbols parametrized in two parts: on the one hand the sym-
(or other objects) can be conveniently visualized us- bol embeddings and on the other hand the rest of
ing a dimensionality reduction algorithm such as t- the neural network that maps them to a scalar en-
ergy. In addition, by considering relation types them-
33 the result of the matrix multiplication, which equals one selves as particular symbolic objects, the model can
of the columns of the matrix reason about relations themselves and have relations
25
between relation types. For example, ‘To be’ can act vised tasks — training a neural network for clas-
as a relation type (in subject-attribute relations) but sification (Hinton et al., 2006; Bengio et al., 2007;
in the statement “ ‘To be’ is a verb” it appears both Ranzato et al., 2007) — and unsupervised tasks —
as a relation type and as an object of the relation. training a Deep Boltzmann Machine to model the
Such multi-relational learning opens the door to data distribution (Salakhutdinov and Hinton, 2009).
the application of neural networks outside of their The learning trajectories visualizations
traditional applications, which was based on a single of Erhan et al. (2010b) have shown that even
homogeneous source of data, often seen as a matrix when starting from nearby configurations in function
with one row per example and one column (or group space, different initializations seem to always fall in
of columns) per random variable. Instead, one often a different effective local minimum. Furthermore,
has multiple heterogeneous sources of data (typically the same study showed that the minima found when
providing examples seen as a tuple of values), each in- using unsupervised pre-training were far in function
volving different random variables. So long as these space from those found from random initialization,
different sources share some variables, then the above in addition to giving better generalization error.
multi-relational multi-task learning approaches can Both of these findings highlight the importance of
be applied. Each variable can be associated with its initialization, hence of local minima effects, in deep
embedding function (that maps the value of a vari- networks. Finally, it has been shown that these
able to a generic representation space that is valid effects were both increased when considering deeper
across tasks and data sources). This framework can architectures (Erhan et al., 2010b).
be applied not only to symbolic data but to mixed There are also results showing that specific ways
symbolic/numeric data if the mapping from object of setting the initial distribution and ordering of
to embedding is generalized from a table look-up to examples (“curriculum learning”) can yield bet-
a parametrized function (the simplest being a linear ter solutions (Elman, 1993; Bengio et al., 2009;
mapping) from its raw attributes (e.g., image fea- Krueger and Dayan, 2009). This also suggest that
tures) to its embedding. This has been exploited very particular ways of initializing parameters, very
successfully to design image search systems in which different from uniformly sampled, can have a strong
images and queries are mapped to the same semantic impact on the solutions found by gradient descent.
space (Weston et al., 2011). The hypothesis proposed in (Bengio et al., 2009) is
that curriculum learning can act similarly to a con-
tinuation method, i.e., starting from an easier opti-
6 Open Questions mization task (e.g. convex) and tracking the local
minimum as the learning task is gradually made more
6.1 On the Added Difficulty of Train- difficult and closer to the real task of interest.
ing Deeper Architectures Why would training deeper networks be more dif-
ficult? This is clearly still an open question. A
There are experimental results which provide some plausible partial answer is that deeper networks are
evidence that, at least in some circumstances, deeper also more non-linear (since each layer composes more
neural networks are more difficult to train than non-linearity on top of the previous ones), making
shallow ones, in the sense that there is a greater gradient-based methods less efficient. It may also be
chance of missing out on better minima when start- that the number and structure of local minima both
ing from random initialization. This is borne out change qualitatively as we increase depth. Theoreti-
by all the experiments where we find that some cal arguments support a potentially exponential gain
initialization scheme can drastically improve per- in expressive power of deeper architectures (Bengio,
formance. In the Deep Learning literature this 2009; Bengio and Delalleau, 2011) and it would be
has been shown with the use of unsupervised pre- plausible that with this added expressive power com-
training (supervised or not), both applied to super- ing from the combinatorics of composed reuse of sub-
26
functions could come a corresponding increase in the (s(a) = max(0, a), see also (Nair and Hinton,
number (and possibly quality) of local minima. But 2010)) actually worked very well (but should not
the best ones could then also be more difficult to find. be used for output units), in spite of the prior be-
On the practical side, several experimental results lief that the fact that when hidden units are sat-
point to factors that may help training deep architec- urated, gradients would not flow well into lower
tures: layers. In fact gradients flow very well, but on
selected paths, possibly making the credit as-
• A local training signal. What many success- signment (which parameters should change to
ful procedures for training deep networks have handle the current error) sharper and the Hes-
in common is that they involve a local training sian condition number better. A recent heuris-
signal that helps each layer decide what to do tic that is related to the difficulty of gradient
without requiring the back-propagation of gradi- propagation through neural net non-linearities is
ents through many non-linearities. This includes the idea of “centering” the non-linear operation
of course the many variants of greedy layer-wise such that each hidden unit has zero average out-
pre-training but also the less well-known semi- put and zero average slope (Schraudolph, 1998;
supervised embedding algorithm (Weston et al., Raiko et al., 2012).
2008).
• Initialization in the right range. Based 6.2 Adaptive Learning Rates and
on the idea that both activations and gradients Second-Order Methods
should be able to flow well through a deep archi-
tecture without significant reduction in variance, To improve convergence and remove learning rates
Glorot and Bengio (2010) proposed setting up from the list of hyper-parameters, many authors have
the initial weights to make the Jacobian of each advocated exploring adaptive learning rate methods,
layer have singular values near 1 (or preserve either for a global learning rate (Cho et al., 2011),
variance in both directions). In their experi- a layer-wise learning rate, a neuron-wise learning
ments this clearly helped greatly reducing the rate, or a parameter-wise learning rate (Bordes et al.,
gap between purely supervised and pre-trained 2009) (which then starts to look like a diagonal New-
deep networks. ton method). LeCun (1987); LeCun et al. (1998a)
advocate the use of a second-order diagonal New-
• Choice of non-linearities. In the same ton (always positive) approximation, with one learn-
study (Glorot and Bengio, 2010) and a follow- ing rate per parameter (associated with the approx-
up (Glorot et al., 2011a) it was shown that the imated inverse second derivative of the loss with re-
choice of hidden layer non-linearities interacted spect to the parameter). Hinton (2010) proposes
with depth. In particular, without unsupervised scaling learning rates so that the average weight up-
pre-training, a deep neural network with sig- date is on the order of 1/1000th of the weight mag-
moids in the top hidden layer would get stuck nitude. LeCun et al. (1998a) also propose a simple
for a long time on a plateau and generally pro- power method in order to estimate the largest eigen-
duce inferior results, due to the special role of value of the Hessian (which would be the optimal
0 and of the initial gradients from the output learning rate). An interesting alternative to variants
units. Symmetric non-linearities like the hy- of Newton’s method are variants of the natural gradi-
perbolic tangent did not suffer from that prob- ent method (Amari, 1998), but like the basic Newton
lem, while softer non-linearities (without ex- method it is computationally too expensive, requir-
ponential tails) such as the softsign function ing operations on a too large square matrix (num-
a
s(a) = 1+|a| worked even better. In Glorot et al. ber of parameters by number of parameters). Diag-
(2011a) it was shown that an asymmetric but onal and low-rank online approximations of natural
hard-limiting non-linearity such as the rectifier gradient (Le Roux et al., 2008; Le Roux et al., 2011)
27
have been proposed and shown to speed-up train- the introduction, the wisdom distilled here should be
ing in some contexts. Several adaptive learning rate taken as a guideline, to be tried and challenged, not
procedures have been proposed recently and merit as a practice set in stone. The practice summarized
more attention and evaluations in the neural network here, coupled with the increase in available comput-
context, such as adagrad (Duchi et al., 2011) and ing power, now allows researchers to train neural net-
the adaptive learning rate method from Schaul et al. works on a scale that is far beyond what was possible
(2012) which claims to remove completely the need at the time of the first edition of this book, helping
for a learning rate hyper-parameter. to move us closer to artificial intelligence.
Whereas stochastic gradient descent converges
very quickly initially it is generally slower than Acknowledgements
second-order methods for the final convergence, and
this may be important in some applications. As a The author is grateful for the comments and feed-
consequence, batch training algorithms (performing back provided by Nicolas Le Roux, Ian Goodfel-
only one update after seeing the whole training set) low, James Bergstra, Guillaume Desjardins, Razvan
such as the Conjugate Gradient method (a second Pascanu, David Warde-Farley, Eric Larsen, Frederic
order method) have dominated stochastic gradient Bastien, and Sina Honari, as well as for the finan-
descent for not too large datasets (e.g. less than cial support of NSERC, FQRNT, CIFAR, and the
thousands or tens of thousands of examples). Fur- Canada Research Chairs.
thermore, it has recently been proposed and success-
fully applied to use second-order methods over large
mini-batches (Le et al., 2011; Martens, 2010). The
References
idea is to do just a few iterations of the second-order Amari, S. (1998). Natural gradient works efficiently
methods on each mini-batch and then move on to in learning. Neural Computation, 10(2), 251–276.
the next mini-batch, starting from the best previous
point found. A useful twist is to start training with Bach, F. and Moulines, E. (2011). Non-asymptotic
one or more epoch of SGD, since SGD remains the analysis of stochastic approximation algorithms. In
fastest optimizer early on in training. NIPS’2011 .
At this point in time however, although the second-
order and natural gradient methods are appealing Bagnell, J. A. and Bradley, D. M. (2009). Differen-
conceptually, have demonstrably helped in the stud- tiable sparse coding. In NIPS’2009 , pages 113–120.
ied cases and may in the end prove to be very impor- Baxter, J. (1995). Learning internal representations.
tant, they have not yet become a standard for neural In COLT’95 , pages 311–320.
networks optimization and need to be validated and
maybe improved by other researchers, before displac- Baxter, J. (1997). A Bayesian/information theoretic
ing simple (mini-batch) stochastic gradient descent model of learning via multiple task sampling. Ma-
variants. chine Learning, 28, 7–40.
Bengio, Y. (2008). Neural net language models.
6.3 Conclusion Scholarpedia, 3(1), 3881.
In spite of decades of experimental and theoretical Bengio, Y. (2009). Learning deep architectures for
work on artificial neural networks, and with all the AI . Now Publishers.
impressive progress made since the first edition of
this book, in particular in the area of Deep Learning, Bengio, Y. (2011). Deep learning of representations
there is still much to be done to better train neural for unsupervised and transfer learning. In JMLR
networks and better understand the underlying issues W&CP: Proc. Unsupervised and Transfer Learn-
that can make the training task difficult. As stated in ing.
28
Bengio, Y. and Delalleau, O. (2011). On the expres- Bertsekas, D. P. (2010). Incremental gradient, sub-
sive power of deep architectures. In ALT’2011 . gradient, and proximal methods for convex opti-
mization: a survey. Technical Report 2848, LIDS.
Bengio, Y. and LeCun, Y. (2007). Scaling learning
algorithms towards AI. In Large Scale Kernel Ma- Bordes, A., Bottou, L., and Gallinari, P. (2009). Sgd-
chines. qn: Careful quasi-newton stochastic gradient de-
scent. Journal of Machine Learning Research, 10,
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, 1737–1754.
C. (2003). A neural probabilistic language model.
JMLR, 3, 1137–1155. Bordes, A., Weston, J., Collobert, R., and Bengio, Y.
(2011). Learning structured embeddings of knowl-
Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O.,
edge bases. In AAAI 2011 .
and Marcotte, P. (2006a). Convex neural networks.
In NIPS’2005 , pages 123–130. Bordes, A., Glorot, X., Weston, J., and Bengio, Y.
(2012). Joint learning of words and meaning rep-
Bengio, Y., Delalleau, O., and Le Roux, N. (2006b).
resentations for open-text semantic parsing. AIS-
The curse of highly variable functions for local ker-
TATS’2012 .
nel machines. In NIPS’2005 , pages 107–114.
Bengio, Y., Lamblin, P., Popovici, D., and Bottou, L. (2011). From machine learning to machine
Larochelle, H. (2007). Greedy layer-wise training reasoning. Technical report, arXiv.1102.1808.
of deep networks. In NIPS’2006 .
Bottou, L. (2013). Large-scale learning with stochas-
Bengio, Y., Louradour, J., Collobert, R., and We- tic gradient descent. In K.-R. Müller, G. Mon-
ston, J. (2009). Curriculum learning. In ICML’09 . tavon, and G. B. Orr, editors, Neural Networks:
Tricks of the Trade, Reloaded . Springer.
Bengio, Y., Alain, G., and Rifai, S. (2012). Im-
plicit density estimation by local moment matching Bottou, L. and Bousquet, O. (2008). The tradeoffs of
to sample from auto-encoders. Technical report, large scale learning. In NIPS’2008 .
arXiv:1207.0057.
Bottou, L. and LeCun, Y. (2004). Large-scale on-line
Bergstra, J. and Bengio, Y. (2012). Random search learning. In NIPS’2003 .
for hyper-parameter optimization. J. Machine
Learning Res., 13, 281–305. Breiman, L. (1994). Bagging predictors. Machine
Learning, 24(2), 123–140.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P.,
Pascanu, R., Desjardins, G., Turian, J., Warde- Breuleux, O., Bengio, Y., and Vincent, P. (2011).
Farley, D., and Bengio, Y. (2010). Theano: a Quickly generating representative samples from an
CPU and GPU math expression compiler. In Proc. rbm-derived process. Neural Computation, 23(8),
Python for Scientific Comp. Conf. (SciPy). 2053–2073.
Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. Caruana, R. (1993). Multitask connectionist learn-
(2011). Algorithms for hyper-parameter optimiza- ing. In Proceedings of the 1993 Connectionist Mod-
tion. In NIPS’2011 . els Summer School , pages 372–379.
Berkes, P. and Wiskott, L. (2002). Applying slow fea- Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced
ture analysis to image sequences yields a rich reper- gradient and adaptive learning rate for training
toire of complex cell properties. In ICANN’02 , restricted boltzmann machines. In ICML’2011 ,
pages 81–86. pages 105–112.
29
Coates, A. and Ng, A. Y. (2011). The importance Frasconi, P., Gori, M., and Sperduti, A. (1998). A
of encoding versus training with sparse coding and general framework for adaptive processing of data
vector quantization. In ICML’2011 . structures. IEEE Transactions on Neural Net-
works, 9(5), 768–786.
Collobert, R. and Bengio, S. (2004a). Links between
perceptrons, MLPs and SVMs. In ICML’2004 . Geman, S., Bienenstock, E., and Doursat, R. (1992).
Neural networks and the bias/variance dilemma.
Collobert, R. and Bengio, S. (2004b). Links between
Neural Computation, 4(1), 1–58.
perceptrons, MLPs and SVMs. In International
Conference on Machine Learning, ICML. Getoor, L. and Taskar, B. (2006). Introduction to
Statistical Relational Learning. MIT Press.
Collobert, R., Weston, J., Bottou, L., Karlen, M.,
Kavukcuoglu, K., and Kuksa, P. (2011a). Natural Glorot, X. and Bengio, Y. (2010). Understanding
language processing (almost) from scratch. Journal the difficulty of training deep feedforward neural
of Machine Learning Research, 12, 2493–2537. networks. In AISTATS’2010 , pages 249–256.
Collobert, R., Kavukcuoglu, K., and Farabet, C. Glorot, X., Bordes, A., and Bengio, Y. (2011a).
(2011b). Torch7: A matlab-like environment for Deep sparse rectifier neural networks. In AIS-
machine learning. In BigLearn, NIPS Workshop. TATS’2011 .
Courville, A., Bergstra, J., and Bengio, Y. (2011). Glorot, X., Bordes, A., and Bengio, Y. (2011b). Do-
Unsupervised models of images by spike-and-slab main adaptation for large-scale sentiment classifi-
RBMs. In ICML’2011 . cation: A deep learning approach. In ICML’2011 .
Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Sam- Goodfellow, I., Le, Q., Saxe, A., and Ng, A.
pled reconstruction for large-scale learning of em- (2009). Measuring invariances in deep networks.
beddings. In Proc. ICML’2011 . In NIPS’2009 , pages 646–654.
Deerwester, S., Dumais, S. T., Furnas, G. W., Lan- Goodfellow, I., Courville, A., and Bengio, Y. (2011).
dauer, T. K., and Harshman, R. (1990). Indexing Spike-and-slab sparse coding for unsupervised fea-
by latent semantic analysis. J. Am. Soc. Informa- ture discovery. In NIPS Workshop on Challenges
tion Science, 41(6), 391–407. in Learning Hierarchical Models.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adap- Graepel, T., Candela, J. Q., Borchert, T., and Her-
tive subgradient methods for online learning and brich, R. (2010). Web-scale Bayesian click-through
stochastic optimization. Journal of Machine rate prediction for sponsored search advertising in
Learning Research. microsoft’s bing search engine. In ICML’2010 .
Elman, J. L. (1993). Learning and development in Håstad, J. (1986). Almost optimal lower bounds for
neural networks: The importance of starting small. small depth circuits. In STOC’86 , pages 6–20.
Cognition, 48, 781–799.
Håstad, J. and Goldmann, M. (1991). On the power
Erhan, D., Courville, A., and Bengio, Y. (2010a). of small-depth threshold circuits. Computational
Understanding representations learned in deep ar- Complexity, 1, 113–129.
chitectures. Technical Report 1355, Université de
Montréal/DIRO. Hinton, G. E. (1978). Relaxation and its role in vi-
sion. Ph.D. thesis, University of Edinburgh.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-
A., Vincent, P., and Bengio, S. (2010b). Why does Hinton, G. E. (1986). Learning distributed represen-
unsupervised pre-training help deep learning? J. tations of concepts. In Proc. 8th Annual Conf. Cog.
Machine Learning Res., 11, 625–660. Sc. Society, pages 1–12.
30
Hinton, G. E. (1989). Connectionist learning proce- Larochelle, H., Bengio, Y., Louradour, J., and Lam-
dures. Artificial Intelligence, 40, 185–234. blin, P. (2009). Exploring strategies for training
deep neural networks. J. Machine Learning Res.,
Hinton, G. E. (2010). A practical guide to train- 10, 1–40.
ing restricted Boltzmann machines. Technical Re-
port UTML TR 2010-003, Department of Com- Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh,
puter Science, University of Toronto. P. W., and Ng, A. (2010). Tiled convolutional neu-
ral networks. In NIPS’2010 .
Hinton, G. E. (2013). A practical guide to training
restricted boltzmann machines. In K.-R. Müller, Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow,
G. Montavon, and G. B. Orr, editors, Neural Net- B., and Ng, A. (2011). On optimization methods
works: Tricks of the Trade, Reloaded . Springer. for deep learning. In ICML’2011 .
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008).
A fast learning algorithm for deep belief nets. Neu- Topmoumoute online natural gradient algorithm.
ral Computation, 18, 1527–1554. In NIPS’07 .
Hutter, F. (2009). Automated Configuration of Algo- Le Roux, N., Bengio, Y., and Fitzgibbon, A. (2011).
rithms for Solving Hard Computational Problems. Improving first and second-order methods by mod-
Ph.D. thesis, University of British Columbia. eling uncertainty. In Optimization for Machine
Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Learning. MIT Press.
Sequential model-based optimization for general Le Roux, N., Schmidt, M., and Bach, F. (2012).
algorithm configuration. In LION-5 . A stochastic gradient method with an exponen-
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le- tial convergence rate for strongly-convex optimiza-
Cun, Y. (2009). What is the best multi-stage ar- tion with finite training sets. Technical report,
chitecture for object recognition? In ICCV’09 . arXiv:1202.6258.
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (1987). Modèles connexionistes de
LeCun, Y. (2009). Learning invariant features l’apprentissage. Ph.D. thesis, Université de Paris
through topographic filter maps. In CVPR’2009 . VI.
Krueger, K. A. and Dayan, P. (2009). Flexible shap- LeCun, Y. (1989). Generalization and network de-
ing: how learning in small steps helps. Cognition, sign strategies. Technical Report CRG-TR-89-4,
110, 380–394. University of Toronto.
Lamblin, P. and Bengio, Y. (2010). Important gains LeCun, Y. (2013). to appear. In K.-R. Müller,
from supervised fine-tuning of deep architectures G. Montavon, and G. B. Orr, editors, Neural Net-
on large labeled sets. NIPS*2010 Deep Learning works: Tricks of the Trade, Reloaded . Springer.
and Unsupervised Feature Learning Workshop.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,
Lang, K. J. and Hinton, G. E. (1988). The develop- Howard, R. E., Hubbard, W., and Jackel, L. D.
ment of the time-delay neural network architecture (1989). Backpropagation applied to handwritten
for speech recognition. Technical Report CMU-CS- zip code recognition. Neural Computation, 1(4),
88-152, Carnegie-Mellon University. 541–551.
Larochelle, H. and Bengio, Y. (2008). Classifica- LeCun, Y., Bottou, L., Orr, G. B., and Müller, K.
tion using discriminative restricted Boltzmann ma- (1998a). Efficient backprop. In Neural Networks,
chines. In ICML’2008 . Tricks of the Trade.
31
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. inspired visual representation. PLoS Comput Biol ,
(1998b). Gradient based learning applied to docu- 5(11), e1000579.
ment recognition. IEEE , 86(11), 2278–2324.
Pollack, J. B. (1990). Recursive distributed represen-
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse tations. Artificial Intelligence, 46(1), 77–105.
deep belief net model for visual area V2. In
NIPS’07 . Polyak, B. and Juditsky, A. (1992). Acceleration of
stochastic approximation by averaging. SIAM J.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Control and Optimization, 30(4), 838–855.
(2009). Convolutional deep belief networks for scal-
able unsupervised learning of hierarchical represen- Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep
tations. In ICML’2009 . learning made easier by linear transformations in
perceptrons. In AISTATS’2012 .
Martens, J. (2010). Deep learning via Hessian-free
optimization. In ICML’2010 , pages 735–742. Ranzato, M., Poultney, C., Chopra, S., and LeCun,
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Ben- Y. (2007). Efficient learning of sparse representa-
gio, Y., Goodfellow, I., Lavoie, E., Muller, X., tions with an energy-based model. In NIPS’06 .
Desjardins, G., Warde-Farley, D., Vincent, P.,
Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008a).
Courville, A., and Bergstra, J. (2011). Unsuper-
Sparse feature learning for deep belief networks.
vised and transfer learning challenge: a deep learn-
In J. Platt, D. Koller, Y. Singer, and S. Roweis,
ing approach. In JMLR W&CP: Proc. Unsuper-
editors, Advances in Neural Information Process-
vised and Transfer Learning, volume 7.
ing Systems 20 (NIPS’07), pages 1185–1192, Cam-
Montavon, G., Braun, M. L., and Muller, K.-R. bridge, MA. MIT Press.
(2012). Deep boltzmann machines as feed-forward
hierarchies. In AISTATS’2012 . Ranzato, M., Boureau, Y., and LeCun, Y. (2008b).
Sparse feature learning for deep belief networks. In
Nair, V. and Hinton, G. E. (2010). Rectified linear NIPS’2007 .
units improve restricted Boltzmann machines. In
ICML’2010 . Richardson, M. and Domingos, P. (2006). Markov
logic networks. Machine Learning, 62, 107–136.
Nemirovski, A. and Yudin, D. (1983). Problem com-
plexity and method efficiency in optimization. Wi- Rifai, S., Vincent, P., Muller, X., Glorot, X., and
ley. Bengio, Y. (2011a). Contracting auto-encoders:
Explicit invariance during feature extraction. In
Nesterov, Y. (2009). Primal-dual subgradient meth-
ICML’2011 .
ods for convex problems. Mathematical program-
ming, 120(1), 221–259. Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and
Olshausen, B. A. and Field, D. J. (1997). Sparse Muller, X. (2011b). The manifold tangent classi-
coding with an overcomplete basis set: a strategy fier. In NIPS’2011 .
employed by V1? Vision Research, 37, 3311–3325.
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P.
Pearlmutter, B. (1994). Fast exact multiplication by (2012). A generative process for sampling contrac-
the Hessian. Neural Computation, 6(1), 147–160. tive auto-encoders. In ICML’2012 .
Pinto, N., Doukhan, D., DiCarlo, J. J., and Cox, Robbins, H. and Monro, S. (1951). A stochastic
D. D. (2009). A high-throughput screening ap- approximation method. Annals of Mathematical
proach to discovering good forms of biologically Statistics, 22, 400–407.
32
Rumelhart, D. E., Hinton, G. E., and Williams, Vincent, P., Larochelle, H., Bengio, Y., and Man-
R. J. (1986). Learning representations by back- zagol, P.-A. (2008). Extracting and composing
propagating errors. Nature, 323, 533–536. robust features with denoising autoencoders. In
ICML 2008 .
Salakhutdinov, R. and Hinton, G. (2009). Deep
Boltzmann machines. In AISTATS’2009 . Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y.,
and Manzagol, P.-A. (2010). Stacked denoising au-
Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., toencoders: Learning useful representations in a
Suresh, B., and Ng, A. (2011). On random weights deep network with a local denoising criterion. J.
and unsupervised feature learning. In ICML’2011 . Machine Learning Res., 11.
Schaul, T., Zhang, S., and LeCun, Y. (2012). No
Weston, J., Ratle, F., and Collobert, R. (2008). Deep
More Pesky Learning Rates. Technical report.
learning via semi-supervised embedding. In ICML
Schraudolph, N. N. (1998). Centering neural network 2008 .
gradient factors. In G. B. Orr and K.-R. Muller, ed-
Weston, J., Bengio, S., and Usunier, N. (2011). Ws-
itors, Neural Networks: Tricks of he Trade, pages
abie: Scaling up to large vocabulary image anno-
548–548. Springer.
tation. In Proceedings of the International Joint
Socher, R., Manning, C., and Ng, A. Y. (2011). Pars- Conference on Artificial Intelligence, IJCAI .
ing natural scenes and natural language with recur-
Wiskott, L. and Sejnowski, T. J. (2002). Slow fea-
sive neural networks. In ICML’2011 .
ture analysis: Unsupervised learning of invari-
Srinivasan, A. and Ramakrishnan, G. (2011). Pa- ances. Neural Computation, 14(4), 715–770.
rameter screening and optimisation for ILP using
Zou, W. Y., Ng, A. Y., and Yu, K. (2011). Unsu-
designed experiments. Journal of Machine Learn-
pervised learning of visual invariance with tempo-
ing Research, 12, 627–662.
ral coherence. In NIPS 2011 Workshop on Deep
Swersky, K., Chen, B., Marlin, B., and de Freitas, Learning and Unsupervised Feature Learning.
N. (2010). A tutorial on stochastic approximation
algorithms for training restricted boltzmann ma-
chines and deep belief nets. In Information Theory
and Applications Workshop.
Tenenbaum, J., de Silva, V., and Langford, J. C.
(2000). A global geometric framework for nonlin-
ear dimensionality reduction. Science, 290(5500),
2319–2323.
Tieleman, T. and Hinton, G. (2009). Using fast
weights to improve persistent contrastive diver-
gence. In ICML’2009 .
van der Maaten, L. and Hinton, G. E. (2008). Visual-
izing data using t-sne. J. Machine Learning Res.,
9.
Vincent, P. (2011). A connection between score
matching and denoising autoencoders. Neural
Computation, 23(7).
33
Published as a conference paper at ICLR 2015
A BSTRACT
1 I NTRODUCTION
Stochastic gradient-based optimization is of core practical importance in many fields of science and
engineering. Many problems in these fields can be cast as the optimization of some scalar parameter-
ized objective function requiring maximization or minimization with respect to its parameters. If the
function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization
method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same
computational complexity as just evaluating the function. Often, objective functions are stochastic.
For example, many objective functions are composed of a sum of subfunctions evaluated at different
subsamples of data; in this case optimization can be made more efficient by taking gradient steps
w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself
as an efficient and effective optimization method that was central in many machine learning success
stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton
& Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other
sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For
all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this
paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In
these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be
restricted to first-order methods.
We propose Adam, a method for efficient stochastic optimization that only requires first-order gra-
dients with little memory requirement. The method computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients; the name Adam
is derived from adaptive moment estimation. Our method is designed to combine the advantages
of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gra-
dients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary
settings; important connections to these and other stochastic optimization methods are clarified in
section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to
rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter,
it does not require a stationary objective, it works with sparse gradients, and it naturally performs a
form of step size annealing.
∗
Equal contribution. Author ordering determined by coin flip over a Google Hangout.
1
Published as a conference paper at ICLR 2015
Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details,
and for a slightly more efficient (but less clear) order of computation. gt2 indicates the elementwise
square gt gt . Good default settings for the tested machine learning problems are α = 0.001,
β1 = 0.9, β2 = 0.999 and = 10−8 . All operations on vectors are element-wise. With β1t and β2t
we denote β1 and β2 to the power t.
Require: α: Stepsize
Require: β1 , β2 ∈ [0, 1): Exponential decay rates for the moment estimates
Require: f (θ): Stochastic objective function with parameters θ
Require: θ0 : Initial parameter vector
m0 ← 0 (Initialize 1st moment vector)
v0 ← 0 (Initialize 2nd moment vector)
t ← 0 (Initialize timestep)
while θt not converged do
t←t+1
gt ← ∇θ ft (θt−1 ) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← β1 · mt−1 + (1 − β1 ) · gt (Update biased first moment estimate)
vt ← β2 · vt−1 + (1 − β2 ) · gt2 (Update biased second raw moment estimate)
mb t ← mt /(1 − β1t ) (Compute bias-corrected first moment estimate)
vbt ← vt /(1 − β2t ) (Compute
√ bias-corrected second raw moment estimate)
θt ← θt−1 − α · m b t /( vbt + ) (Update parameters)
end while
return θt (Resulting parameters)
In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains
our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam’s
convergence in online convex programming. Empirically, our method consistently outperforms other
methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is
a versatile algorithm that scales to large-scale high-dimensional machine learning problems.
2 A LGORITHM
See algorithm 1 for pseudo-code of our proposed algorithm Adam. Let f (θ) be a noisy objec-
tive function: a stochastic scalar function that is differentiable w.r.t. parameters θ. We are in-
terested in minimizing the expected value of this function, E[f (θ)] w.r.t. its parameters θ. With
f1 (θ), ..., , fT (θ) we denote the realisations of the stochastic function at subsequent timesteps
1, ..., T . The stochasticity might come from the evaluation at random subsamples (minibatches)
of datapoints, or arise from inherent function noise. With gt = ∇θ ft (θ) we denote the gradient, i.e.
the vector of partial derivatives of ft , w.r.t θ evaluated at timestep t.
The algorithm updates exponential moving averages of the gradient (mt ) and the squared gradient
(vt ) where the hyper-parameters β1 , β2 ∈ [0, 1) control the exponential decay rates of these moving
averages. The moving averages themselves are estimates of the 1st moment (the mean) and the
2nd raw moment (the uncentered variance) of the gradient. However, these moving averages are
initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially
during the initial timesteps, and especially when the decay rates are small (i.e. the βs are close to 1).
The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected
estimates mb t and vbt . See section 3 for more details.
Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing
the order p
of computation, e.g. by replacing the last three lines in the loop with the following lines:
√
αt = α · 1 − β2t /(1 − β1t ) and θt ← θt−1 − αt · mt /( vt + ˆ).
An important property of Adam’s update rule is its careful choice of √ stepsizes. Assuming = 0, the
timestep t is ∆t = α · m
effective step taken in parameter space at √ b t / vbt . The
√ effective stepsize has
two upper bounds: |∆t | ≤ α · (1 − β1 )/ 1 − β2 in the case (1 − β1 ) > 1 − β2 , and |∆t | ≤ α
2
Published as a conference paper at ICLR 2015
otherwise. The first case only happens in the most severe case of sparsity: when a gradient has
been zero at all timesteps except at the√ current timestep. For less √ sparse cases, the effective stepsize
will be smaller. When (1 − β1 ) = 1 − β2 we √ have that | m
b t / vbt | <
p1 therefore |∆t | < α. In
more common scenarios, we will have that m b t / vbt ≈ ±1 since |E[g]/ E[g 2 ]| ≤ 1. The effective
magnitude of the steps taken in parameter space at each timestep are approximately bounded by
the stepsize setting α, i.e., |∆t | / α. This can be understood as establishing a trust region around
the current parameter value, beyond which the current gradient estimate does not provide sufficient
information. This typically makes it relatively easy to know the right scale of α in advance. For
many machine learning models, for instance, we often know in advance that good optima are with
high probability within some set region in parameter space; it is not uncommon, for example, to
have a prior distribution over the parameters. Since α sets (an upper bound of) the magnitude of
steps in parameter space, we can often deduce the right order of magnitude of α such that optima
can be reached from θ0 within√ some number of iterations. With a slight abuse of terminology,
we will call the ratio mb t / vbt the signal-to-noise ratio (SN R). With a smaller SNR the effective
stepsize ∆t will be closer to zero. This is a desirable property, since a smaller SNR means that
there is greater uncertainty about whether the direction of m b t corresponds to the direction of the true
gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading
to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize
∆t is also invariant to the scale of the gradients; rescaling the gradients√ g with factor√ c will scale m
bt
with a factor c and vbt with a factor c2 , which cancel out: (c · m
b t )/( c2 · vbt ) = mb t / vbt .
As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive
the term for the second moment estimate; the derivation for the first moment estimate is completely
analogous. Let g be the gradient of the stochastic objective f , and we wish to estimate its second
raw moment (uncentered variance) using an exponential moving average of the squared gradient,
with decay rate β2 . Let g1 , ..., gT be the gradients at subsequent timesteps, each a draw from an
underlying gradient distribution gt ∼ p(gt ). Let us initialize the exponential moving average as
v0 = 0 (a vector of zeros). First note that the update at timestep t of the exponential moving average
vt = β2 · vt−1 + (1 − β2 ) · gt2 (where gt2 indicates the elementwise square gt gt ) can be written as
a function of the gradients at all previous timesteps:
t
X
vt = (1 − β2 ) β2t−i · gi2 (1)
i=1
We wish to know how E[vt ], the expected value of the exponential moving average at timestep t,
relates to the true second moment E[gt2 ], so we can correct for the discrepancy between the two.
Taking expectations of the left-hand and right-hand sides of eq. (1):
" t
#
X
E[vt ] = E (1 − β2 ) β2t−i · gi2 (2)
i=1
t
X
= E[gt2 ] · (1 − β2 ) β2t−i + ζ (3)
i=1
= E[gt2 ] · (1 − β2t ) + ζ (4)
where ζ = 0 if the true second moment E[gi2 ] is stationary; otherwise ζ can be kept small since
the exponential decay rate β1 can (and should) be chosen such that the exponential moving average
assigns small weights to gradients too far in the past. What is left is the term (1 − β2t ) which is
caused by initializing the running average with zeros. In algorithm 1 we therefore divide by this
term to correct the initialization bias.
In case of sparse gradients, for a reliable estimate of the second moment one needs to average over
many gradients by chosing a small value of β2 ; however it is exactly this case of small β2 where a
lack of initialisation bias correction would lead to initial steps that are much larger.
3
Published as a conference paper at ICLR 2015
4 C ONVERGENCE ANALYSIS
We analyze the convergence of Adam using the online learning framework proposed in (Zinkevich,
2003). Given an arbitrary, unknown sequence of convex cost functions f1 (θ), f2 (θ),..., fT (θ). At
each time t, our goal is to predict the parameter θt and evaluate it on a previously unknown cost
function ft . Since the nature of the sequence is unknown in advance, we evaluate our algorithm
using the regret, that is the sum of all the previous difference between the online prediction ft (θt )
and the best fixed point parameter ft (θ∗ ) from a feasible set X for all the previous steps. Concretely,
the regret is defined as:
T
X
R(T ) = [ft (θt ) − ft (θ∗ )] (5)
t=1
PT √
where θ∗ = arg minθ∈X t=1 ft (θ). We show Adam has O( T ) regret bound and a proof is given
in the appendix. Our result is comparable to the best known bound for this general convex online
learning problem. We also use some definitions simplify our notation, where gt , ∇ft (θt ) and gt,i
as the ith element. We define g1:t,i ∈ Rt as a vector that contains the ith dimension of the gradients
β2
over all iterations till t, g1:t,i = [g1,i , g2,i , · · · , gt,i ]. Also, we define γ , √1 .
β2
Our following
− 21
theorem holds when the learning rate αt is decaying at a rate of t and first moment running
average coefficient β1,t decay exponentially with λ, that is typically close to 1, e.g. 1 − 10−8 .
Theorem 4.1. Assume that the function ft has bounded gradients, k∇ft (θ)k2 ≤ G, k∇ft (θ)k∞ ≤
G∞ for all θ ∈ Rd and distance between any θt generated by Adam is bounded, kθn − θm k2 ≤ D,
β2
kθm − θn k∞ ≤ D∞ for any m, n ∈ {1, ..., T }, and β1 , β2 ∈ [0, 1) satisfy √β1 < 1. Let αt = √αt
2
and β1,t = β1 λt−1 , λ ∈ (0, 1). Adam achieves the following guarantee, for all T ≥ 1.
d d d √
D2 X p α(1 + β1 )G∞ X X D∞2
G ∞ 1 − β2
R(T ) ≤ T vbT,i + √ kg k
1:T,i 2 +
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1 i=1
2α(1 − β1 )(1 − λ)2
Our Theorem 4.1 implies when the data features are sparse and bounded gradients, the √ sum-
Pd
mation term can be much smaller than its upper bound
Pd p √ i=1 kg1:T,i k2 << dG∞ T and
i=1 T v
b T,i << dG ∞ T , in particular if the class of function and data features are in the form of
Pd
section 1.2 in (Duchi et al., 2011). Their results for the expected value E[ i=1 kg1:T,i k2 ] also apply√
√ adaptive method, such as Adam and Adagrad, can achieve O(log d T ),
to Adam. In particular, the
an improvement over O( dT ) for the non-adaptive method. Decaying β1,t towards zero is impor-
tant in our theoretical analysis and also matches previous empirical findings, e.g. (Sutskever et al.,
2013) suggests reducing the momentum coefficient in the end of training can improve convergence.
Finally, we can show the average regret of Adam converges,
Corollary 4.2. Assume that the function ft has bounded gradients, k∇ft (θ)k2 ≤ G, k∇ft (θ)k∞ ≤
G∞ for all θ ∈ Rd and distance between any θt generated by Adam is bounded, kθn − θm k2 ≤ D,
kθm − θn k∞ ≤ D∞ for any m, n ∈ {1, ..., T }. Adam achieves the following guarantee, for all
T ≥ 1.
R(T ) 1
= O( √ )
T T
Pd √
This result can be obtained by using Theorem 4.1 and i=1 kg1:T,i k2 ≤ dG∞ T . Thus,
limT →∞ R(TT
)
= 0.
5 R ELATED WORK
Optimization methods bearing a direct relation to Adam are RMSProp (Tieleman & Hinton, 2012;
Graves, 2013) and AdaGrad (Duchi et al., 2011); these relationships are discussed below. Other
stochastic optimization methods include vSGD (Schaul et al., 2012), AdaDelta (Zeiler, 2012) and the
natural Newton method from Roux & Fitzgibbon (2010), all setting stepsizes by estimating curvature
4
Published as a conference paper at ICLR 2015
from first-order information. The Sum-of-Functions Optimizer (SFO) (Sohl-Dickstein et al., 2014)
is a quasi-Newton method based on minibatches, but (unlike Adam) has memory requirements linear
in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained
systems such as a GPU. Like natural gradient descent (NGD) (Amari, 1998), Adam employs a
preconditioner that adapts to the geometry of the data, since vbt is an approximation to the diagonal
of the Fisher information matrix (Pascanu & Bengio, 2013); however, Adam’s preconditioner (like
AdaGrad’s) is more conservative in its adaption than vanilla NGD by preconditioning with the square
root of the inverse of the diagonal Fisher information matrix approximation.
RMSProp: An optimization method closely related to Adam is RMSProp (Tieleman & Hinton,
2012). A version with momentum has sometimes been used (Graves, 2013). There are a few impor-
tant differences between RMSProp with momentum and Adam: RMSProp with momentum gener-
ates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are
directly estimated using a running average of first and second moment of the gradient. RMSProp
also lacks a bias-correction term; this matters most in case of a small value β2 (required in case of
sparse gradients), since in that case not correcting the bias leads to very large stepsizes and often
divergence, as we also empirically demonstrate in section 6.4.
AdaGrad: An algorithm that works well for sparse gradients qP is AdaGrad (Duchi et al., 2011). Its
t 2
basic version updates parameters as θt+1 = θt − α · gt / i=1 gt . Note that if we choose β2 to be
t
infinitesimally close to 1 from below, then limβ2 →1 vbt = t−1 · i=1 gt2 . AdaGrad corresponds to a
P
version of Adam with β1 = 0, infinitesimal (1 − β2 ) and a replacement of α by an qannealed version
t
αt = α · t−1/2 , namely θt − α · t−1/2 · mb t / limβ2 →1 vbt = θt − α · t−1/2 · gt / t−1 · i=1 gt2 =
p P
qP
t 2
θt − α · gt / i=1 gt . Note that this direct correspondence between Adam and Adagrad does
not hold when removing the bias-correction terms; without bias correction, like in RMSProp, a β2
infinitesimally close to 1 would lead to infinitely large bias, and infinitely large parameter updates.
6 E XPERIMENTS
To empirically evaluate the proposed method, we investigated different popular machine learning
models, including logistic regression, multilayer fully connected neural networks and deep convolu-
tional neural networks. Using large models and datasets, we demonstrate Adam can efficiently solve
practical deep learning problems.
We use the same parameter initialization when comparing different optimization algorithms. The
hyper-parameters, such as learning rate and momentum, are searched over a dense grid and the
results are reported using the best hyper-parameter setting.
We evaluate our proposed method on L2-regularized multi-class logistic regression using the MNIST
dataset. Logistic regression has a well-studied convex objective, making it suitable for comparison
of different optimizers without worrying √about local minimum issues. The stepsize α in our logistic
regression experiments is adjusted by 1/ t decay, namely αt = √αt that matches with our theorat-
ical prediction from section 4. The logistic regression classifies the class label directly on the 784
dimension image vectors. We compare Adam to accelerated SGD with Nesterov momentum and
Adagrad using minibatch size of 128. According to Figure 1, we found that the Adam yields similar
convergence as SGD with momentum and both converge faster than Adagrad.
As discussed in (Duchi et al., 2011), Adagrad can efficiently deal with sparse features and gradi-
ents
√ as one of its main theoretical results whereas SGD is low at learning rare features. Adam with
1/ t decay on its stepsize should theoratically match the performance of Adagrad. We examine the
sparse feature problem using IMDB movie review dataset from (Maas et al., 2011). We pre-process
the IMDB movie reviews into bag-of-words (BoW) feature vectors including the first 10,000 most
frequent words. The 10,000 dimension BoW feature vector for each review is highly sparse. As sug-
gested in (Wang & Manning, 2013), 50% dropout noise can be applied to the BoW features during
5
Published as a conference paper at ICLR 2015
0.7 MNIST Logistic Regression 0.50 IMDB BoW feature Logistic Regression
AdaGrad Adagrad+dropout
SGDNesterov RMSProp+dropout
Adam 0.45
0.6 SGDNesterov+dropout
Adam+dropout
0.40
0.5
training cost
training cost
0.35
0.4
0.30
0.3
0.25
Figure 1: Logistic regression training negative log likelihood on MNIST images and IMDB movie
reviews with 10,000 bag-of-words (BoW) feature vectors.
training to prevent over-fitting. In figure 1, Adagrad outperforms SGD with Nesterov momentum
by a large margin both with and without dropout noise. Adam converges as fast as Adagrad. The
empirical performance of Adam is consistent with our theoretical findings in sections 2 and 4. Sim-
ilar to Adagrad, Adam can take advantage of sparse features and obtain faster convergence rate than
normal SGD with momentum.
Multi-layer neural network are powerful models with non-convex objective functions. Although
our convergence analysis does not apply to non-convex problems, we empirically found that Adam
often outperforms other methods in such cases. In our experiments, we made model choices that are
consistent with previous publications in the area; a neural network model with two fully connected
hidden layers with 1000 hidden units each and ReLU activation are used for this experiment with
minibatch size of 128.
First, we study different optimizers using the standard deterministic cross-entropy objective func-
tion with L2 weight decay on the parameters to prevent over-fitting. The sum-of-functions (SFO)
method (Sohl-Dickstein et al., 2014) is a recently proposed quasi-Newton method that works with
minibatches of data and has shown good performance on optimization of multi-layer neural net-
works. We used their implementation and compared with Adam to train such models. Figure 2
shows that Adam makes faster progress in terms of both the number of iterations and wall-clock
time. Due to the cost of updating curvature information, SFO is 5-10x slower per iteration com-
pared to Adam, and has a memory requirement that is linear in the number minibatches.
Stochastic regularization methods, such as dropout, are an effective way to prevent over-fitting and
often used in practice due to their simplicity. SFO assumes deterministic subfunctions, and indeed
failed to converge on cost functions with stochastic regularization. We compare the effectiveness of
Adam to other stochastic first order methods on multi-layer neural networks trained with dropout
noise. Figure 2 shows our results; Adam shows better convergence than other methods.
Convolutional neural networks (CNNs) with several layers of convolution, pooling and non-linear
units have shown considerable success in computer vision tasks. Unlike most fully connected neural
nets, weight sharing in CNNs results in vastly different gradients in different layers. A smaller
learning rate for the convolution layers is often used in practice when applying SGD. We show the
effectiveness of Adam in deep CNNs. Our CNN architecture has three alternating stages of 5x5
convolution filters and 3x3 max pooling with stride of 2 that are followed by a fully connected layer
of 1000 rectified linear hidden units (ReLU’s). The input image are pre-processed by whitening, and
6
Published as a conference paper at ICLR 2015
training cost
10-2
(a) (b)
Figure 2: Training of multilayer neural networks on MNIST images. (a) Neural networks using
dropout stochastic regularization. (b) Neural networks with deterministic cost function. We compare
with the sum-of-functions (SFO) optimizer (Sohl-Dickstein et al., 2014)
training cost
10-1
1.5
10-2
1.0
10-3
Figure 3: Convolutional neural networks training cost. (left) Training cost for the first three epochs.
(right) Training cost over 45 epochs. CIFAR-10 with c64-c64-c128-1000 architecture.
dropout noise is applied to the input layer and fully connected layer. The minibatch size is also set
to 128 similar to previous experiments.
Interestingly, although both Adam and Adagrad make rapid progress lowering the cost in the initial
stage of the training, shown in Figure 3 (left), Adam and SGD eventually converge considerably
faster than Adagrad for CNNs shown in Figure 3 (right). We notice the second moment estimate vbt
vanishes to zeros after a few epochs and is dominated by the in algorithm 1. The second moment
estimate is therefore a poor approximation to the geometry of the cost function in CNNs comparing
to fully connected network from Section 6.2. Whereas, reducing the minibatch variance through
the first moment is more important in CNNs and contributes to the speed-up. As a result, Adagrad
converges much slower than others in this particular experiment. Though Adam shows marginal
improvement over SGD with momentum, it adapts learning rate scale for different layers instead of
hand picking manually as in SGD.
7
Published as a conference paper at ICLR 2015
β1=0
Loss
β1=0.9
log10(α)
(a) after 10 epochs (b) after 100 epochs
Figure 4: Effect of bias-correction terms (red line) versus no bias correction terms (green line)
after 10 epochs (left) and 100 epochs (right) on the loss (y-axes) when learning a Variational Auto-
Encoder (VAE) (Kingma & Welling, 2013), for different settings of stepsize α (x-axes) and hyper-
parameters β1 and β2 .
We also empirically evaluate the effect of the bias correction terms explained in sections 2 and 3.
Discussed in section 5, removal of the bias correction terms results in a version of RMSProp (Tiele-
man & Hinton, 2012) with momentum. We vary the β1 and β2 when training a variational auto-
encoder (VAE) with the same architecture as in (Kingma & Welling, 2013) with a single hidden
layer with 500 hidden units with softplus nonlinearities and a 50-dimensional spherical Gaussian
latent variable. We iterated over a broad range of hyper-parameter choices, i.e. β1 ∈ [0, 0.9] and
β2 ∈ [0.99, 0.999, 0.9999], and log10 (α) ∈ [−5, ..., −1]. Values of β2 close to 1, required for robust-
ness to sparse gradients, results in larger initialization bias; therefore we expect the bias correction
term is important in such cases of slow decay, preventing an adverse effect on optimization.
In Figure 4, values β2 close to 1 indeed lead to instabilities in training when no bias correction term
was present, especially at first few epochs of the training. The best results were achieved with small
values of (1 − β2 ) and bias correction; this was more apparent towards the end of optimization when
gradients tends to become sparser as hidden units specialize to specific patterns. In summary, Adam
performed equal or better than RMSProp, regardless of hyper-parameter setting.
7 E XTENSIONS
7.1 A DA M AX
In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a
(scaled) L2 norm of their individual current and past gradients. We can generalize the L2 norm based
update rule to a Lp norm based update rule. Such variants become numerically unstable for large
p. However, in the special case where we let p → ∞, a surprisingly simple and stable algorithm
emerges; see algorithm 2. We’ll now derive the algorithm. Let, in case of the Lp norm, the stepsize
1/p
at time t be inversely proportional to vt , where:
8
Published as a conference paper at ICLR 2015
Algorithm 2: AdaMax, a variant of Adam based on the infinity norm. See section 7.1 for details.
Good default settings for the tested machine learning problems are α = 0.002, β1 = 0.9 and
β2 = 0.999. With β1t we denote β1 to the power t. Here, (α/(1 − β1t )) is the learning rate with the
bias-correction term for the first moment. All operations on vectors are element-wise.
Require: α: Stepsize
Require: β1 , β2 ∈ [0, 1): Exponential decay rates
Require: f (θ): Stochastic objective function with parameters θ
Require: θ0 : Initial parameter vector
m0 ← 0 (Initialize 1st moment vector)
u0 ← 0 (Initialize the exponentially weighted infinity norm)
t ← 0 (Initialize timestep)
while θt not converged do
t←t+1
gt ← ∇θ ft (θt−1 ) (Get gradients w.r.t. stochastic objective at timestep t)
mt ← β1 · mt−1 + (1 − β1 ) · gt (Update biased first moment estimate)
ut ← max(β2 · ut−1 , |gt |) (Update the exponentially weighted infinity norm)
θt ← θt−1 − (α/(1 − β1t )) · mt /ut (Update parameters)
end while
return θt (Resulting parameters)
Note that the decay term is here equivalently parameterised as β2p instead of β2 . Now let p → ∞,
and define ut = limp→∞ (vt )1/p , then:
t
!1/p
p(t−i)
p
X
1/p p
ut = lim (vt ) = lim (1 − β2 ) β2 · |gi | (8)
p→∞ p→∞
i=1
t
!1/p
p(t−i)
X
= lim (1 − β2p )1/p β2 · |gi |p (9)
p→∞
i=1
t
!1/p
p
(t−i)
X
= lim β2 · |gi | (10)
p→∞
i=1
β2t−1 |g1 |, β2t−2 |g2 |, . . . , β2 |gt−1 |, |gt |
= max (11)
Which corresponds to the remarkably simple recursive formula:
ut = max(β2 · vt−1 , |gt |) (12)
with initial value u0 = 0. Note that, conveniently enough, we don’t need to correct for initialization
bias in this case. Also note that the magnitude of parameter updates has a simpler bound with
AdaMax than Adam, namely: |∆t | ≤ α.
Since the last iterate is noisy due to stochastic approximation, better generalization performance is
often achieved by averaging. Previously in Moulines & Bach (2011), Polyak-Ruppert averaging
(Polyak & Juditsky, P 1992; Ruppert, 1988) has been shown to improve the convergence of standard
n
SGD, where θ̄t = 1t k=1 θk . Alternatively, an exponential moving average over the parameters can
be used, giving higher weight to more recent parameter values. This can be trivially implemented
by adding one line to the inner loop of algorithms 1 and 2: θ̄t ← β2 · θ̄t−1 + (1 − β2 )θt , with θ̄0 = 0.
Initalization bias can again be corrected by the estimator θbt = θ̄t /(1 − β2t ).
8 C ONCLUSION
We have introduced a simple and computationally efficient algorithm for gradient-based optimiza-
tion of stochastic objective functions. Our method is aimed towards machine learning problems with
9
Published as a conference paper at ICLR 2015
large datasets and/or high-dimensional parameter spaces. The method combines the advantages of
two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients,
and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward
to implement and requires little memory. The experiments confirm the analysis on the rate of con-
vergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range
of non-convex optimization problems in the field machine learning.
9 ACKNOWLEDGMENTS
This paper would probably not have existed without the support of Google Deepmind. We would
like to give special thanks to Ivo Danihelka, and Tom Schaul for coining the name Adam. Thanks to
Kai Fan from Duke University for spotting an error in the original AdaMax derivation. Experiments
in this work were partly carried out on the Dutch national e-infrastructure with the support of SURF
Foundation. Diederik Kingma is supported by the Google European Doctorate Fellowship in Deep
Learning.
R EFERENCES
Amari, Shun-Ichi. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
Deng, Li, Li, Jinyu, Huang, Jui-Ting, Yao, Kaisheng, Yu, Dong, Seide, Frank, Seltzer, Michael, Zweig, Geoff,
He, Xiaodong, Williams, Jason, et al. Recent advances in deep learning for speech research at microsoft.
ICASSP 2013, 2013.
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic
optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Graves, Alex, Mohamed, Abdel-rahman, and Hinton, Geoffrey. Speech recognition with deep recurrent neural
networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on,
pp. 6645–6649. IEEE, 2013.
Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313
(5786):504–507, 2006.
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior,
Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine,
IEEE, 29(6):82–97, 2012a.
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan R. Im-
proving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580,
2012b.
Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. In The 2nd International Conference
on Learning Representations (ICLR), 2013.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher.
Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142–150. Association for
Computational Linguistics, 2011.
Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. In Advances in Neural Information Processing Systems, pp. 451–459, 2011.
Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint
arXiv:1301.3584, 2013.
Polyak, Boris T and Juditsky, Anatoli B. Acceleration of stochastic approximation by averaging. SIAM Journal
on Control and Optimization, 30(4):838–855, 1992.
10
Published as a conference paper at ICLR 2015
Roux, Nicolas L and Fitzgibbon, Andrew W. A fast natural newton method. In Proceedings of the 27th
International Conference on Machine Learning (ICML-10), pp. 623–630, 2010.
Ruppert, David. Efficient estimations from a slowly convergent robbins-monro process. Technical report,
Cornell University Operations Research and Industrial Engineering, 1988.
Schaul, Tom, Zhang, Sixin, and LeCun, Yann. No more pesky learning rates. arXiv preprint arXiv:1206.1106,
2012.
Sohl-Dickstein, Jascha, Poole, Ben, and Ganguli, Surya. Fast large-scale optimization by unifying stochas-
tic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine
Learning (ICML-14), pp. 604–612, 2014.
Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and
momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning
(ICML-13), pp. 1139–1147, 2013.
Tieleman, T. and Hinton, G. Lecture 6.5 - RMSProp, COURSERA: Neural Networks for Machine Learning.
Technical report, 2012.
Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Confer-
ence on Machine Learning (ICML-13), pp. 118–126, 2013.
Zeiler, Matthew D. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. 2003.
11
Published as a conference paper at ICLR 2015
10 A PPENDIX
10.1 C ONVERGENCE P ROOF
Definition 10.1. A function f : Rd → R is convex if for all x, y ∈ Rd , for all λ ∈ [0, 1],
λf (x) + (1 − λ)f (y) ≥ f (λx + (1 − λ)y)
Also, notice that a convex function can be lower bounded by a hyperplane at its tangent.
Lemma 10.2. If a function f : Rd → R is convex, then for all x, y ∈ Rd ,
f (y) ≥ f (x) + ∇f (x)T (y − x)
The above lemma can be used to upper bound the regret and our proof for the main theorem is
constructed by substituting the hyperplane with the Adam update rules.
The following two lemmas are used to support our main theorem. We also use some definitions sim-
plify our notation, where gt , ∇ft (θt ) and gt,i as the ith element. We define g1:t,i ∈ Rt as a vector
that contains the ith dimension of the gradients over all iterations till t, g1:t,i = [g1,i , g2,i , · · · , gt,i ]
Lemma 10.3. Let gt = ∇ft (θt ) and g1:t be defined as above and bounded, kgt k2 ≤ G, kgt k∞ ≤
G∞ . Then, s
T 2
X gt,i
≤ 2G∞ kg1:T,i k2
t=1
t
4
gT
From, kg1:T,i k22 − gT,i
2
+ ,i
4kg1:T ,i k22
≥ kg1:T,i k22 − gT,i
2
, we can take square root of both side and
have,
2
q gT,i
kg1:T,i k22 − gT,i
2 ≤ kg
1:T,i k2 −
2kg1:T,i k2
2
gT,i
≤ kg1:T,i k2 − p
2 T G2∞
q
Rearrange the inequality and substitute the kg1:T,i k22 − gT,i
2 term,
s
q 2
gT,i
G∞ kg1:T,i k22 − gT2 + ≤ 2G∞ kg1:T,i k2
T
12
Published as a conference paper at ICLR 2015
β2 β2
Lemma 10.4. Let γ , √β1 . For β1 , β2 ∈ [0, 1) that satisfy √1
β2
< 1 and bounded gt , kgt k2 ≤ G,
2
kgt k∞ ≤ G∞ , the following inequality holds
T
mb2 2 1
p t,i ≤
X
√ kg1:T,i k2
t=1
tb
vt,i 1 − γ 1 − β2
√
1−β t 1
Proof. Under the assumption, (1−β t )22 ≤ (1−β 1)
2 . We can expand the last term in the summation
1
using the update rules in Algorithm 1,
T T −1 PT
b 2t,i b 2t,i 1 − β2T ( k=1 (1 − β1 )β1T −k gk,i )2
p
X m X m
= +
(1 − β1T )2 T PT (1 − β )β T −j g 2
p p q
t=1
tb
vt,i t=1
tb
vt,i
j=1 2 2 j,i
T −1 T
b2 1 − β2T X T ((1 − β1 )β1T −k gk,i )2
p
m
p t,i +
X
≤
(1 − β1T )2 k=1 T PT (1 − β )β T −j g 2
q
t=1
tb
vt,i
j=1 2 2 j,i
T −1 T
b2 1 − β2T X T ((1 − β1 )β1T −k gk,i )2
p
X m
≤ p t,i +
(1 − β1T )2 k=1 T (1 − β )β T −k g 2
q
t=1
tb
vt,i 2 2 k,i
T −1 T T −k
b2
p
m 1 − β2T (1 − β1 )2 β2
p t,i +
X X
≤ T 2
p T √1 kgk,i k2
t=1
tb
vt,i (1 − β1 ) T (1 − β2 ) k=1 β2
T −1 T
mb2 T
p t,i + p
X X
≤ γ T −k kgk,i k2
t=1
tb
vt,i T (1 − β2 ) k=1
Similarly, we can upper bound the rest of the terms in the summation.
T T T −t
mb2 kg k
p t,i ≤ p t,i 2
X X X
tγ j
t=1
tb
v t,i t=1 t(1 − β )
2 j=0
T T
X kgt,i k2 X
≤ p tγ j
t=1 t(1 − β2 ) j=0
1
tγ t <
P
For γ < 1, using the upper bound on the arithmetic-geometric series, t (1−γ)2 :
T T T
X kgt,i k2 X 1 X kgt,i k2
p tγ j ≤ 2
√ √
t=1 t(1 − β2 ) j=0
(1 − γ) 1 − β2 t=1 t
β2
To simplify the notation, we define γ , √1 .
β2
Intuitively, our following theorem holds when the
− 21
learning rate αt is decaying at a rate of t and first moment running average coefficient β1,t decay
exponentially with λ, that is typically close to 1, e.g. 1 − 10−8 .
Theorem 10.5. Assume that the function ft has bounded gradients, k∇ft (θ)k2 ≤ G, k∇ft (θ)k∞ ≤
G∞ for all θ ∈ Rd and distance between any θt generated by Adam is bounded, kθn − θm k2 ≤ D,
13
Published as a conference paper at ICLR 2015
β2 α
kθm − θn k∞ ≤ D∞ for any m, n ∈ {1, ..., T }, and β1 , β2 ∈ [0, 1) satisfy √β1 < 1. Let αt = √
t
2
and β1,t = β1 λt−1 , λ ∈ (0, 1). Adam achieves the following guarantee, for all T ≥ 1.
d d d √
D2 X p α(β1 + 1)G∞ X X D∞2
G ∞ 1 − β2
R(T ) ≤ T vbT,i + √ kg1:T,i k2 +
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1 i=1
2α(1 − β1 )(1 − λ)2
d
X
ft (θt ) − ft (θ∗ ) ≤ gtT (θt − θ∗ ) = gt,i (θt,i − θ,i∗ )
i=1
We focus on the ith dimension of the parameter vector θt ∈ Rd . Subtract the scalar θ,i∗ and square
both sides of the above update rule, we have,
p
(1 − β1t ) vbt,i
gt,i (θt,i − θ,i∗ ) = ∗ 2 ∗ 2
(θt,i − θ,t ) − (θt+1,i − θ,i )
2αt (1 − β1,t )
1 p
β1,t vbt−1,i
4
∗ √ mt−1,i αt (1 − β1t ) vbt,i mb t,i
+ √ (θ,i − θt,i ) αt−1 1 + ( p )2
(1 − β1,t ) αt−1 vbt−1,i
4 2(1 − β1,t ) vbt,i
1 ∗ 2 β1,t
) − (θt+1,i − θ,i∗ )2 (θ,i∗ − θt,i )2 vbt−1,i
p p
≤ (θt,i − θ,t vbt,i +
2αt (1 − β1 ) 2αt−1 (1 − β1,t )
β1 αt−1 m2t−1,i αt mb2
+ p + p t,i
2(1 − β1 ) vbt−1,i 2(1 − β1 ) vbt,i
We apply Lemma 10.4 to the above inequality and derive the regret bound by summing across all
the dimensions for i ∈ 1, ..., d in the upper bound of ft (θt ) − ft (θ∗ ) and the sequence of convex
functions for t ∈ 1, ..., T :
d d X T p p
X 1 ∗ 2
p X 1 ∗ 2 vbt,i vbt−1,i
R(T ) ≤ (θ1,i − θ,i ) vb1,i + (θt,i − θ,i ) ( − )
i=1
2α1 (1 − β1 ) i=1 t=2
2(1 − β1 ) αt αt−1
d d
β αG αG∞
√1 ∞
X X
+ kg1:T,i k2 + √ kg1:T,i k2
(1 − β1 ) 1 − β2 (1 − γ)2 i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1
d X
T
X β1,t
(θ∗ − θt,i )2 vbt,i
p
+
i=1 t=1
2αt (1 − β1,t ) ,i
14
Published as a conference paper at ICLR 2015
We can use arithmetic geometric series upper bound for the last term:
t t
X β1,t √ X 1 √
t≤ λt−1 t
t=1
(1 − β1,t ) t=1
(1 − β1 )
t
X 1
≤ λt−1 t
t=1
(1 − β1 )
1
≤
(1 − β1 )(1 − λ)2
Therefore, we have the following regret bound:
d d d √
D2 X p α(1 + β1 )G∞ X X D∞2
G∞ 1 − β2
R(T ) ≤ T vbT,i + √ kg1:T,i k2 +
2α(1 − β1 ) i=1 (1 − β1 ) 1 − β2 (1 − γ)2 i=1 i=1
2αβ1 (1 − λ)2
15
Deep Learning with Limited Numerical Precision
As a first step towards achieving this cross-layer co- Previous studies have also investigated neural network
design, we explore the use of low-precision fixed-point training using different number representations. Iwata
arithmetic for deep neural network training with a et al. (Iwata et al., 1989) implements the back-
special focus on the rounding mode adopted while propagation algorithm using 24-bit floating-point pro-
performing operations on fixed-point numbers. The cessing units. Hammerstrom (Hammerstrom, 1990)
motivation to move to fixed-point arithmetic (from presents a framework for on-chip learning using 8 to
the conventional floating-point computations) is two- 16 bit fixed-point arithmetic. In (Holt & Hwang, 1993),
fold. Firstly, fixed-point compute units are typically the authors perform theoretical analysis to understand
faster and consume far less hardware resources and a neural network’s ability to learn when trained in a
power than floating-point engines. The smaller logic limited precision setting. Results from empirical eval-
footprint of the fixed-point arithmetic circuits would uation of simple networks indicate that in most cases,
allow for the instantiation of many more such units for 8-16 bits of precision is sufficient for back-propagation
a given area and power budget. Secondly, low-precision learning. In (Höhfeld & Fahlman, 1992), probabilistic
data representation reduces the memory footprint, rounding of weight updates is used to further reduce
enabling larger models to fit within the given memory (< 8 bits) the precision requirements in gradient-based
capacity. Cumulatively, this could provide dramati- learning techniques. While these studies provide valu-
cally improved data-level parallelism. able insights into the behavior of the limited precision
training of neural networks, the networks considered
The key finding of our exploration is that deep neural
are often limited to variants of the classical multilayer
networks can be trained using low-precision fixed-
perceptron containing a single hidden layer and only
point arithmetic, provided that the stochastic rounding
a few hidden units. Extrapolating these results to
scheme is applied while operating on fixed-point num-
the state-of-the-art deep neural networks that can
bers. We test the validity of the proposed approach
easily contain millions of trainable parameters is non-
by training deep neural networks for the MNIST and
trivial. Consequently, there is a need to reassess the
CIFAR10 image classification tasks. Deep networks
impact of limited precision computations within the
trained using 16-bit wide fixed-point and stochastic
context of more contemporary deep neural network
rounding achieve nearly the same performance as that
architectures, datasets, and training procedures.
obtained when trained using 32-bit floating-point com-
putations. Furthermore, we present a hardware accel- A recent work (Chen et al., 2014) presents a hardware
erator design, prototyped on an FPGA, that achieves accelerator for deep neural network training that em-
high throughput and low power using a large number ploys fixed-point computation units, but finds it neces-
of fixed-point arithmetic units, a dataflow architecture, sary to use 32-bit fixed-point representation to achieve
and compact stochastic rounding modules. convergence while training a convolutional neural net-
work on the MNIST dataset. In contrast, our results
2. Related Work show that it is possible to train these networks using
only 16-bit fixed-point numbers, so long as stochastic
Determining the precision of the data representation rounding is used during fixed-point computations. To
and the compute units is a critical design choice in the our knowledge, this work represents the first study
hardware (analog or digital) implementation of artifi- of application of stochastic rounding while training
cial neural networks. Not surprisingly, a rich body of deep neural networks using low-precision fixed-point
literature exists that aims to quantify the effect of this arithmetic.
choice on the network’s performance. However, a dis-
proportionately large majority of these studies are fo- 3. Limited Precision Arithmetic
cused primarily on implementing just the feed-forward
(inference) stage, assuming that the network is trained Standard implementations of deep neural network
offline using high precision computations. Some recent training via the back-propagation algorithm typically
studies that embrace this approach have relied on the use 32-bit floating-point (float) representation of real
processor’s vector instructions to perform multiple 8 numbers for data storage and manipulation. Instead,
bit operations in parallel (Vanhoucke et al., 2011), consider the generalized fixed-point number repre-
or employ reconfigurable hardware (FPGAs) for high- sentation: [QI.QF], where QI and QF correspond to
throughput, energy-efficient inference (Farabet et al., the integer and the fractional part of the number,
2011; Gokhale et al., 2014), or take the route of custom respectively. The number of integer bits (IL) plus
hardware implementations (Kim et al., 2014; Merolla the number of fractional bits (FL) yields the total
et al., 2014). number of bits used to represent the number. The
2
Deep Learning with Limited Numerical Precision
sum IL + FL is referred to as the word length WL. In 3.2. Multiply and accumulate (MACC) operation
this paper, we use the notation hIL, FLi to denote a
Consider two d-dimensional vectors a and b such
fixed-point representation in which IL (FL) correspond
that each component is represented in the fixed-point
to the length of the integer (fractional) part of the
format hIL, FLi, and define c0 = a.b as the inner
number. We also employ to denote the smallest
product of a and b. c0 is also represented in some
positive number that may be represented in the given ~ IFi.
~ We split the computation
fixed-point format hIL,
fixed-point format. Therefore, the hIL, FLi fixed-point
of c0 into the following two steps:
format limits the precision
to FL bits, sets the range
to −2IL−1 , 2IL−1 − 2−FL , and defines to be equal to
2−FL . 1. Compute z =
Pd
ai bi
i=1
3
Deep Learning with Limited Numerical Precision
Figure 1. MNIST dataset using fully connected DNNs: Training error (a, c) and the test error (b, d ) for training using
fixed-point number representation and rounding mode set to either “Round to nearest” (top) or “Stochastic rounding”
(bottom). The word length for fixed-point numbers WL is kept fixed at 16 bits and results are shown for three different
fractional (integer) lengths: 8(8), 10(6), and 14(2) bits. Results using float are also shown for comparison.
4. Training Deep Networks baseline evaluation. The word length WL for the fixed-
point format is set to 16 bits i.e. the number of bits
In this section, we present the results of our in- allocated to represent the integer and the fractional
vestigation into the effect of employing limited pre- parts add up to 16.
cision data representation during the training of
deep neural networks. We consider both fully con- This fairly restrictive choice of number representation
nected deep neural networks (DNN) as well as has some important implications. From the perspec-
convolutional neural networks (CNN) and present tive of neural network training, an aggressive reduction
results for the MNIST(Lecun & Cortes) and the of the precision with which the parameter updates are
CIFAR10(Krizhevsky & Hinton, 2009) datasets. As a computed and stored may result in the loss of the
baseline for comparison, we first evaluate the network gradient information if the updates are significantly
performance (in terms of the rate of reduction of both smaller than the for the given fixed-point format. As
the training error and the error on the test set) using a consequence, this may impede the progress of the
the conventional 32-bit floating-point arithmetic. Sub- gradient descent algorithm, or worse, introduce insta-
sequently, we constrain the neural network parameters bilities during the training procedure. Note that in the
(weights W l , biases B l ), as well as the other interme- round-to-nearest
scheme, any parameter update in the
diate variables generated during the back-propagation range − 2 , 2 is always rounded to zero, as opposed to
algorithm (layer outputs Y l , back-propagated error the stochastic rounding scheme which maintains a non-
δ l , weight updates ∆W l , bias updates ∆B l ) to be zero probability of small parameter updates to round
represented in the fixed-point format and train the to ±. Secondly, since the fixed-point format offers
network again starting from random initialization of only a limited range, outputs of the ReLU activation
the parameters. While training using fixed-point, the function may get clipped to the upper limit set by
different model hyperparameters such as weight ini- hIL, FLi. From a hardware perspective, the use of 16-
tialization, regularization parameters, learning rates bits for data storage (instead of float) corresponds to
etc. are kept unchanged from the ones used during the a factor 2 reduction in the amount of memory needed
4
Deep Learning with Limited Numerical Precision
Figure 2. MNIST dataset using CNNs: Training error (a) and the test error (b) for training using fixed-point number
representation and rounding mode set to either “Round to nearest” or “Stochastic rounding”. The word length for fixed-
point numbers WL is kept fixed at 16 bits and results are shown for different fractional (integer) lengths for weights and
weight updates: 12(4), and 14(2) bits. Layer outputs use h6, 10i format in all cases. Results using float are also shown
for comparison.
for training a given network. Moreover, the use of the degradation in either the convergence rate or the clas-
same word length for all network variables carries with sification accuracy. A reduction in the precision below
it the added advantage of simplifying the hardware 14 bits begins to negatively impact the network’s
implementation. ability to learn when the round-to-nearest scheme is
adopted. This is primarily because at reduced frac-
4.1. MNIST tional precision, most of the parameter updates are
rounded down to zero. In contrast, the stochastic
4.1.1. Fully connected DNN rounding preserves the gradient information, atleast
In the first set of experiments, we construct a fully statistically, and the network is able to learn with as
connected neural network with 2 hidden layers, each few as 8 bits of precision without any significant loss in
containing 1000 units with ReLU activation function performance. Note, however, at a precision lower than
and train this network to recognize the handwritten 8 bits, even the stochastic rounding scheme is unable
digits from the MNIST dataset. This dataset comprises to fully prevent the loss of gradient information.
of 60, 000 training images and 10, 000 test images –
each image is 28 x 28 pixels containing a digit from 4.1.2. CNN
0 to 9. The pixel values are normalized to lie in Using the MNIST dataset, we also evaluate a CNN
the [0, 1] range. No other form of data pre-processing with an architecture similar to LeNet-5 (LeCun et al.,
or augmentation is performed. The weights in each 1998). It comprises of 2 convolutional layers with 5x5
layer are initialized by sampling random values from filters and ReLU activation function. The first layer
N (0, 0.01) while the bias vectors are initialized to has 8 feature maps while the second convolutional
0. The network is trained using minibatch stochastic layer produces 16 feature maps. Each convolutional
gradient descent (SGD) with a minibatch size of 100 layer is followed by a pooling/subsampling layer. The
to minimize the cross entropy objective function. The pooling layers implement the max pooling function
float baseline achieves a test error of 1.4%. over non-overlapping pooling windows of size 2x2. The
Next, we retrain the network using fixed-point com- output of the second pooling layer feeds into a fully
putations and set WL to 16 bits. Figure 1 shows the connected layer consisting of 128 ReLU neurons, which
results for the two rounding modes: Round-to-nearest is then connected into a 10-way softmax output layer.
and Stochastic rounding. In both cases, allocating 14 For training this network, we adopt an exponentially
bits to the fractional part4 produces no noticeable decreasing learning rate – scaling it by a factor of 0.95
4
Using up 14 bits for the fractional part leaves only 2 after every epoch of training. The learning rate for
bits (including the sign bit) for representing the integer the first epoch is set to 0.1. Momentum (p = 0.9)
portion of the number. This does not seem to adversely is used to speed up SGD convergence. The weight
affect the network performance. decay parameter is set to 0.0005 for all layers. When
5
Deep Learning with Limited Numerical Precision
Figure 3. CIFAR10 dataset using CNNs:Training error (a) and the test error (b) for training using fixed-point number
representation and rounding mode set to either “Round to nearest” or “Stochastic rounding”. The word length for fixed-
point numbers WL is kept fixed at 16 bits and results are shown for different fractional (integer) lengths for weights and
weight updates: 12(4), and 14(2) bits. The black arrows indicate the epoch after which the training is carried out using
WL = 20 bits. Results using float are also shown for comparison.
trained using float, the network achieves a test error a 10-way softmax output layer. This architecture is
of 0.77%. As was done previously for DNNs, we retrain similar to the one introduced in (Hinton et al., 2012)
the network using fixed-point computations with WL with the exception that it does not implement local
set to 16 bits. However, in this case, saturating the out- response normalization or dropout layers.
put of the convolutional layers to a low integer value
The network training starts off with a learning rate
created some difficulty in jump-starting the training
of 0.01 and reduced by a factor of 2 after 50, 75,
procedure. As a result, we increase the number of
and 100 epochs. Using 32-bit floating point numbers
bits allocated for the integer part at the expense of
for training, this network configuration misclassifies
reducing the precision and choose the h6, 10i format
approximately 24.6% of the images in the test set. This
for representing the layer outputs. Figure 2 compiles
serves as the baseline for comparing the results ob-
the results obtained using the two different rounding
tained while training the network using fixed-point
modes. Unlike in the case of DNNs, when the round-to-
computations. Similar to earlier experiments, we set
nearest scheme is adopted during fixed-point computa-
the WL for fixed-point number to 16 and test the
tions, the training procedure fails to converge. When
different rounding modes and fractional precision. The
stochastic rounding is used, we achieve a test error
layer outputs are represented in the h4, 12i format. As
of 0.83% and 0.90% for 14-bit and 12-bit precision, re-
observed previously and as shown in Figure 3, training
spectively – corresponding to only a slight degradation
using fixed-point with round-to-nearest scheme begins
from the float baseline.
to collapse after only a few epochs. On the contrary,
the stochastic rounding scheme appears to bestow
4.2. CIFAR10 upon the training procedure a significantly higher
To further test the validity of the stochastic rounding degree of stability. For 14 bits of fractional precision
approach, we consider another commonly used image and the stochastic rounding scheme, the network’s
classification benchmark: CIFAR10. The training set behavior is quite similar to that observed during the
consists of 50, 000 RGB images of size 32x32 pixels. baseline evaluation and achieves a test error of 25.4%.
The images are divided into 10 classes, each containing If the precision is reduced further (to 12 bits) the
5, 000 images. The test set has 10, 000 images. We convergence rate degrades as the learning proceeds
scale the image RGB values to [0,1] range and do and after a point, SGD stops making progress. This
not perform any other form of data pre-processing or is expected since at reduced precision, the parameter
augmentation. For this dataset, we construct a CNN updates tend to become sparser (despite stochastic
with 3 convolutional layers each followed by a subsam- rounding) due to the perilous combination of smaller
pling/pooling layer. The convolutional layers consist gradients and diminished learning rates. The network’s
of 64 5x5 filters and the subsampling layers implement performance suffers as a result and the minimum
the max pooling function over a window of size 3x3 achievable test error saturates at 28.8%. Fortunately,
using a stride of 2. The 3rd pooling layer connects to this damage is reversible as shown in Figure 3. After
6
Deep Learning with Limited Numerical Precision
training for 100 epochs using the h4, 12i format, we FPGAs have a large number of hard-wired fixed-point
relax the constraint on WL slightly and increase WL by DSP units that are well-suited to implementing the
4 bits to 20 bits. This increases the fractional precision fixed-point arithmetic described in the earlier sections,
to 16 bits (h4, 16i format) and subsequent training and can potentially yield gains in performance and
results in a rapid improvement in the network’s per- power efficiency. However, limited memory bandwidth
formance. After an additional 15-20 epochs of training must still be carefully managed through various design
using the higher precision representation, the test error choices.
approaches that obtained using float.
This result reveals a promising (and possibly more
robust) strategy for deep neural network training in
which the network is first trained using low-precision
fixed-point arithmetic and stochastic rounding. At the
point where learning shows stagnation, the network
can be “fine-tuned” using only a few epochs of higher-
precision fixed-point computations. Such a concept
of employing mixed-precision computations has been
explored previously in the context of floating point
arithmetic (Baboulin et al., 2009), motivated largely
by the fact that most modern processors achieve a Figure 4. Block diagram of the FPGA-based fixed-point
factor 2 to 4 higher computational throughput for matrix multiplier.
single-precision (32-bit) floating-point as compared
with double-precision (64-bit) floating-point. Similar Our prototype is implemented on an off-the-shelf
concepts, in conjunction with stochastic rounding, can FPGA card featuring a Xilinx Kintex325T FPGA and
be extended to perform mixed-precision fixed-point 8 GB DDR3 memory, and communicating with the
arithmetic.5 host PC over a PCIe bus. This FPGA has 840 DSP
multiply-accumulate units and almost 2 MB of on-chip
5. Hardware Prototyping block RAM. The data bandwidth between the off-chip
DDR3 memory and the FPGA is 6.4 GB/s. The typical
The execution time of the mini-batch stochastic gradi- dimensions of the input matrices preclude storing
ent descent algorithm is dominated by a series of GEMM entire matrices in on-chip RAM. Thus, these matrices
operations in the feed-forward, error back-propagation are stored in the DDR3 memory and parts of the ma-
and weight update calculation steps6 . As a result, trices are brought into the FPGA for performing the
an improvement in the computational throughput of computations. The off-chip communication bandwidth
the GEMM operation translates into an improvement in limitation necessitates that we reuse the on-chip data
the training time. GPUs offering a large number of to the highest extent possible to make the achievable
parallel vector processors and high memory bandwidth throughput, measured in giga-operations/second (G-
have therefore been very effective in accelerating these ops/s), compute-bound.
workloads.
In this section we describe a FPGA-based hardware ac- 5.1. System Description
celerator for matrix-matrix multiplication. Our choice Figure 4 presents a block diagram of the our fixed-
of using FPGAs as the hardware substrate is mo- point matrix multiplier. The DSP units within
tivated by two factors. Firstly, FPGAs enable fast the FPGA are organized as a massively parallel 2-
hardware development times and significantly lower dimensional systolic array (SA) (Kung, 1982) of size
costs when compared to ASICs7 . Secondly, modern n such that n2 < 840. This forms the core of the
5
While preparing this paper, we became aware of a very multiplier and will be described in greater detail in
recent work (Courbariaux et al., 2014) that shares our mo- the next subsection. Most of the block RAM on the
tivations but adopts an orthogonal approach. The authors FPGA is designated as the L2 cache where a fraction
propose the use of dynamic fixed-point (a hybrid of the of the input matrices are stored. The READ logic sends
fixed-point and the conventional floating-point arithmetic)
for training deep neural networks. However, hardware data requests to the DDR3 memory and organizes
implications of this approach are not immediately obvious. the incoming data into the L2 cache. The WRITE
6 logic sends back computed results to the external
Convolution may also be rewritten as a GEMM operation
7
Application Specific Integrated Circuits memory. The L2-to-SA circuit moves relevant rows
and columns from the L2 cache to the array. The TOP
7
Deep Learning with Limited Numerical Precision
controller coordinates the entire process. The FPGA FIFO. Elements from earlier cycles are cascaded right
also contains Xilinx-supplied IP blocks that interface (for A) or down (for B) and the corresponding partial
to the DDR3 memory. products are accumulated at the DSP units. After
accumulation of all partial products, output data is
The operation sequence of the multiplier is as fol-
cascaded out to stochastic rounding units (DSP ROUND)
lows. Assume the first input matrix A has dimensions
that are also implemented with DSP units. Rounded
l x k and the second input matrix B has dimensions
results are stored in output FIFOs (one per column)
k x m. Initially n columns of matrix B and pn rows
before final readout to external memory. Throughput
of matrix A, where p is the largest integer we can
of the array depends on the number of DSPs available
choose based on on-chip memory capacity constraints,
and the maximum operating frequency at which the
are brought into the FPGA to compute pn2 elements
system can be operated without timing errors. This is
of the result matrix. The next n columns of matrix B
an example of a wavefront-type systolic array where
are then brought it and processed. This continues until
all connections are local, i.e. only between neighbor-
all m columns of matrix B have been multiplied with
ing DSPs and edge FIFOs, which limits interconnect
the first pn rows of matrix A. This entire sequence
delays and improves maximum operating frequency.
is repeated l/pn times to process all rows of matrix
A. Double buffering is employed to hide the latency
Bk1 Bkk
of bringing in new subsets of the matrices in to the
chip. This sequence of operation ensures that elements
A1k MACC MACC MACC
of matrix A are reused m times once brought into
11 12 1n
the FPGA while those of matrix B are reused pn
times. This reuse allows efficient use of the bandwidth
between the FPGA and the DDR3 memory. MACC MACC
21 2n
5.2. Systolic Array Architecture
Output C FIFOs
Input B
FIFFO
FIFFO
FIFFO
FIFOs
O
Akk MACC MACC MACC
DSP DSP DSP
n1
1 n2
2 nn
FIFO
FIFO
FIFO
FIFO
DSP DSP DSP Figure 6. Wavefront systolic array operation.
MACC MACC MACC
In a wavefront array, as depicted in Figure 6, at the
DSP DSP DSP end of k cycles, where k corresponds to the inner
Input A FIFO
FIFOs
MACC MACC MACC dimension of the matrix multiplication, MACC unit “11”
has accumulated all of its partial products. At this
point, the accumulated result is transferred to a local
register and the DSP is reset. This frees it up to receive
FIFO
DSP DSP DSP data from the next matrix multiplication operation,
MACC MACC MACC
even before other elements have completed. This
Local Storage Registers achieves high throughput for the systolic array so long
as the pipeline is fed with new incoming data. At the
Figure 5. Schematic of the systolic core for matrix multi-
end of (k + 2n − 2) cycles, the matrix multiplication is
plication.
complete, and data from the last DSP unit can be read
out. Output paths from local registers to the edge of
Figure 5 shows the logical organization of the systolic
the array are also cascaded.
array. Each node of the systolic array (DSP MACC) has
a DSP unit that implements two operations (multiply Word length of the result elements after MACC oper-
and accumulate) in every clock cycle. Elements of ations are much larger (typically 48 bits if using 7-
input matrices A and B brought in from L2-cache series DSPs) than word length of the inputs (typi-
are staged in local block RAM units configured as cally 18 bits or less). Before transferring to output
FIFO (First In First Out) queues. Each FIFO contains FIFOs, result elements must be trimmed through
elements from either a row of A or a column of B. In the stochastic rounding of least signficant bits (LSB)
each clock cycle, one element is read out from the and truncation of excess MSB bits (after detection of
8
Deep Learning with Limited Numerical Precision
LUTs 62922 203800 31% Bottou, Léon and Bousquet, Olivier. The tradeoffs of
Flip-flops 146510 407600 36% large scale learning. In NIPS, volume 4, pp. 2, 2007.
DSP 812 840 97%
Block RAM 334 445 75% Chen, Yunji, Luo, Tao, Liu, Shaoli, Zhang, Shijin, He,
Liqiang, Wang, Jia, Li, Ling, Chen, Tianshi, Xu,
Zhiwei, Sun, Ninghui, et al. Dadiannao: A machine-
learning supercomputer. In Microarchitecture (MI-
8
CRO), 2014 47th Annual IEEE/ACM International
A more direct stochastic rounding approach is multi-
Symposium on, pp. 609–622. IEEE, 2014.
bit magnitude comparison of result LSB vs. a random
number, followed by a conditional addition and examining
excess MSBs. The approach in this section achieves the Chilimbi, Trishul, Suzue, Yutaka, Apacible, Johnson,
same result but removes the first full multi-bit comparison, and Kalyanaraman, Karthik. Project adam: Build-
enabling compact implementation on a single DSP unit. ing an efficient and scalable deep learning training
9
Deep Learning with Limited Numerical Precision
system. In 11th USENIX Symposium on Operating Iwata, Akira, Yoshida, Yukio, Matsuda, Satoshi, Sato,
Systems Design and Implementation (OSDI 14), pp. Yukimasa, and Suzumura, Nobuo. An artificial
571–582, Broomfield, CO, October 2014. neural network accelerator using general purpose 24
bit floating point digital signal processors. In Neural
Coates, Adam, Huval, Brody, Wang, Tao, Wu, David, Networks, 1989. IJCNN., International Joint Con-
Catanzaro, Bryan, and Andrew, Ng. Deep learning ference on, pp. 171–175. IEEE, 1989.
with cots hpc systems. In Proceedings of The 30th
International Conference on Machine Learning, pp. Kim, Jonghong, Hwang, Kyuyeon, and Sung, Wony-
1337–1345, 2013. ong. X1000 real-time phoneme recognition vlsi using
feed-forward deep neural networks. In Acoustics,
Courbariaux, Matthieu, Bengio, Yoshua, and David, Speech and Signal Processing (ICASSP), 2014 IEEE
Jean-Pierre. Low precision arithmetic for deep International Conference on, pp. 7510–7514. IEEE,
learning. arXiv preprint arXiv:1412.7024, 2014. 2014.
Krizhevsky, Alex and Hinton, Geoffrey. Learning mul-
Dean, Jeffrey, Corrado, Greg, Monga, Rajat, Chen, tiple layers of features from tiny images. Computer
Kai, Devin, Matthieu, Mao, Mark, Senior, Andrew, Science Department, University of Toronto, Tech.
Tucker, Paul, Yang, Ke, Le, Quoc V, et al. Large Rep, 1(4):7, 2009.
scale distributed deep networks. In Advances in
Neural Information Processing Systems, pp. 1223– Kung, H.T. Why systolic architectures? Computer,
1231, 2012. 15(1):37–46, Jan 1982. doi: 10.1109/MC.1982.
1653825.
Farabet, Clément, Martini, Berin, Corda, Benoit, Lecun, Yann and Cortes, Corinna. The MNIST
Akselrod, Polina, Culurciello, Eugenio, and LeCun, database of handwritten digits. URL http://yann.
Yann. Neuflow: A runtime reconfigurable dataflow lecun.com/exdb/mnist/.
processor for vision. In Computer Vision and
Pattern Recognition Workshops (CVPRW), 2011 LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and
IEEE Computer Society Conference on, pp. 109– Haffner, Patrick. Gradient-based learning applied
116. IEEE, 2011. to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
Gokhale, Vinayak, Jin, Jonghoon, Dundar, Aysegul,
Merolla, Paul A, Arthur, John V, Alvarez-Icaza, Ro-
Martini, Berin, and Culurciello, Eugenio. A 240 g-
drigo, Cassidy, Andrew S, Sawada, Jun, Akopyan,
ops/s mobile coprocessor for deep neural networks.
Filipp, Jackson, Bryan L, Imam, Nabil, Guo, Chen,
In Computer Vision and Pattern Recognition Work-
Nakamura, Yutaka, et al. A million spiking-neuron
shops (CVPRW), 2014 IEEE Conference on, pp.
integrated circuit with a scalable communication
696–701. IEEE, 2014.
network and interface. Science, 345(6197):668–673,
2014.
Hammerstrom, Dan. A vlsi architecture for high-
performance, low-cost, on-chip learning. In Neural Murray, Alan F and Edwards, Peter J. Enhanced
Networks, 1990., 1990 IJCNN International Joint mlp performance and fault tolerance resulting from
Conference on, pp. 537–544. IEEE, 1990. synaptic weight noise during training. Neural Net-
works, IEEE Transactions on, 5(5):792–802, 1994.
Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky,
Alex, Sutskever, Ilya, and Salakhutdinov, Rus- Recht, Benjamin, Re, Christopher, Wright, Stephen,
lan R. Improving neural networks by preventing and Niu, Feng. Hogwild: A lock-free approach to
co-adaptation of feature detectors. arXiv preprint parallelizing stochastic gradient descent. In Ad-
arXiv:1207.0580, 2012. vances in Neural Information Processing Systems,
pp. 693–701, 2011.
Höhfeld, Markus and Fahlman, Scott E. Probabilistic Vanhoucke, Vincent, Senior, Andrew, and Mao,
rounding in neural network learning with limited Mark Z. Improving the speed of neural networks
precision. Neurocomputing, 4(6):291–299, 1992. on cpus. In Proc. Deep Learning and Unsupervised
Feature Learning NIPS Workshop, 2011.
Holt, JL and Hwang, Jenq-Neng. Finite precision error
analysis of neural network hardware implementa- Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing,
tions. Computers, IEEE Transactions on, 42(3): and Sun, Gang. Deep image: Scaling up image
281–290, 1993. recognition. arXiv preprint arXiv:1501.02876, 2015.
10
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift
Sergey Ioffe Christian Szegedy
Google Inc., sioffe@google.com Google Inc., szegedy@google.com
Training Deep Neural Networks is complicated by the fact of the loss over a mini-batch is an estimate of the gradient
that the distribution of each layer’s inputs changes during over the training set, whose quality improves as the batch
training, as the parameters of the previous layers change. size increases. Second, computation over a batch can be
This slows down the training by requiring lower learning much more efficient than m computations for individual
rates and careful parameter initialization, and makes it no- examples, due to the parallelism afforded by the modern
toriously hard to train models with saturating nonlineari- computing platforms.
ties. We refer to this phenomenon as internal covariate While stochastic gradient is simple and effective, it
shift, and address the problem by normalizing layer in- requires careful tuning of the model hyper-parameters,
puts. Our method draws its strength from making normal- specifically the learning rate used in optimization, as well
ization a part of the model architecture and performing the as the initial values for the model parameters. The train-
normalization for each training mini-batch. Batch Nor- ing is complicated by the fact that the inputs to each layer
malization allows us to use much higher learning rates and are affected by the parameters of all preceding layers – so
be less careful about initialization. It also acts as a regu- that small changes to the network parameters amplify as
larizer, in some cases eliminating the need for Dropout. the network becomes deeper.
Applied to a state-of-the-art image classification model, The change in the distributions of layers’ inputs
Batch Normalization achieves the same accuracy with 14 presents a problem because the layers need to continu-
times fewer training steps, and beats the original model ously adapt to the new distribution. When the input dis-
by a significant margin. Using an ensemble of batch- tribution to a learning system changes, it is said to experi-
normalized networks, we improve upon the best published ence covariate shift (Shimodaira, 2000). This is typically
result on ImageNet classification: reaching 4.9% top-5 handled via domain adaptation (Jiang, 2008). However,
validation error (and 4.8% test error), exceeding the ac- the notion of covariate shift can be extended beyond the
curacy of human raters. learning system as a whole, to apply to its parts, such as a
sub-network or a layer. Consider a network computing
1 Introduction ℓ = F2 (F1 (u, Θ1 ), Θ2 )
Deep learning has dramatically advanced the state of the where F1 and F2 are arbitrary transformations, and the
art in vision, speech, and many other areas. Stochas- parameters Θ1 , Θ2 are to be learned so as to minimize
tic gradient descent (SGD) has proved to be an effec- the loss ℓ. Learning Θ2 can be viewed as if the inputs
tive way of training deep networks, and SGD variants x = F1 (u, Θ1 ) are fed into the sub-network
such as momentum (Sutskever et al., 2013) and Adagrad
(Duchi et al., 2011) have been used to achieve state of the ℓ = F2 (x, Θ2 ).
art performance. SGD optimizes the parameters Θ of the
network, so as to minimize the loss For example, a gradient descent step
m
1 X
N
α X ∂F2 (xi , Θ2 )
Θ = arg min ℓ(xi , Θ) Θ2 ← Θ2 −
Θ N m i=1 ∂Θ2
i=1
where x1...N is the training data set. With SGD, the train- (for batch size m and learning rate α) is exactly equivalent
ing proceeds in steps, and at each step we consider a mini- to that for a stand-alone network F2 with input x. There-
batch x1...m of size m. The mini-batch is used to approx- fore, the input distribution properties that make training
imate the gradient of the loss function with respect to the more efficient – such as having the same distribution be-
parameters, by computing tween the training and test data – apply to training the
1 ∂ℓ(xi , Θ) sub-network as well. As such it is advantageous for the
. distribution of x to remain fixed over time. Then, Θ2 does
m ∂Θ
1
not have to readjust to compensate for the change in the 2 Towards Reducing Internal
distribution of x.
Covariate Shift
Fixed distribution of inputs to a sub-network would We define Internal Covariate Shift as the change in the
have positive consequences for the layers outside the sub- distribution of network activations due to the change in
network, as well. Consider a layer with a sigmoid activa- network parameters during training. To improve the train-
tion function z = g(W u + b) where u is the layer input, ing, we seek to reduce the internal covariate shift. By
the weight matrix W and bias vector b are the layer pa- fixing the distribution of the layer inputs x as the training
1
rameters to be learned, and g(x) = 1+exp(−x) . As |x| progresses, we expect to improve the training speed. It has
′
increases, g (x) tends to zero. This means that for all di- been long known (LeCun et al., 1998b; Wiesler & Ney,
mensions of x = W u+b except those with small absolute 2011) that the network training converges faster if its in-
values, the gradient flowing down to u will vanish and the puts are whitened – i.e., linearly transformed to have zero
model will train slowly. However, since x is affected by means and unit variances, and decorrelated. As each layer
W, b and the parameters of all the layers below, changes observes the inputs produced by the layers below, it would
to those parameters during training will likely move many be advantageous to achieve the same whitening of the in-
dimensions of x into the saturated regime of the nonlin- puts of each layer. By whitening the inputs to each layer,
earity and slow down the convergence. This effect is we would take a step towards achieving the fixed distri-
amplified as the network depth increases. In practice, butions of inputs that would remove the ill effects of the
the saturation problem and the resulting vanishing gradi- internal covariate shift.
ents are usually addressed by using Rectified Linear Units
We could consider whitening activations at every train-
(Nair & Hinton, 2010) ReLU (x) = max(x, 0), careful
ing step or at some interval, either by modifying the
initialization (Bengio & Glorot, 2010; Saxe et al., 2013),
network directly or by changing the parameters of the
and small learning rates. If, however, we could ensure
optimization algorithm to depend on the network ac-
that the distribution of nonlinearity inputs remains more
tivation values (Wiesler et al., 2014; Raiko et al., 2012;
stable as the network trains, then the optimizer would be
Povey et al., 2014; Desjardins & Kavukcuoglu). How-
less likely to get stuck in the saturated regime, and the
ever, if these modifications are interspersed with the op-
training would accelerate.
timization steps, then the gradient descent step may at-
tempt to update the parameters in a way that requires
We refer to the change in the distributions of internal the normalization to be updated, which reduces the ef-
nodes of a deep network, in the course of training, as In- fect of the gradient step. For example, consider a layer
ternal Covariate Shift. Eliminating it offers a promise of with the input u that adds the learned bias b, and normal-
faster training. We propose a new mechanism, which we izes the result by subtracting the mean of the activation
call Batch Normalization, that takes a step towards re- computed over the training data: x b = x − E[x] where
ducing internal covariate shift, and in doing so dramati- x = u + b, X = {x1...N } is the set of values of x over
PN
cally accelerates the training of deep neural nets. It ac- the training set, and E[x] = N1 i=1 xi . If a gradient
complishes this via a normalization step that fixes the descent step ignores the dependence of E[x] on b, then it
means and variances of layer inputs. Batch Normalization will update b ← b + ∆b, where ∆b ∝ −∂ℓ/∂b x. Then
also has a beneficial effect on the gradient flow through u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b].
the network, by reducing the dependence of gradients Thus, the combination of the update to b and subsequent
on the scale of the parameters or of their initial values. change in normalization led to no change in the output
This allows us to use much higher learning rates with- of the layer nor, consequently, the loss. As the training
out the risk of divergence. Furthermore, batch normal- continues, b will grow indefinitely while the loss remains
ization regularizes the model and reduces the need for fixed. This problem can get worse if the normalization not
Dropout (Srivastava et al., 2014). Finally, Batch Normal- only centers but also scales the activations. We have ob-
ization makes it possible to use saturating nonlinearities served this empirically in initial experiments, where the
by preventing the network from getting stuck in the satu- model blows up when the normalization parameters are
rated modes. computed outside the gradient descent step.
The issue with the above approach is that the gradient
In Sec. 4.2, we apply Batch Normalization to the best- descent optimization does not take into account the fact
performing ImageNet classification network, and show that the normalization takes place. To address this issue,
that we can match its performance using only 7% of the we would like to ensure that, for any parameter values,
training steps, and can further exceed its accuracy by a the network always produces activations with the desired
substantial margin. Using an ensemble of such networks distribution. Doing so would allow the gradient of the
trained with Batch Normalization, we achieve the top-5 loss with respect to the model parameters to account for
error rate that improves upon the best known results on the normalization, and for its dependence on the model
ImageNet classification. parameters Θ. Let again x be a layer input, treated as a
2
vector, and X be the set of these inputs over the training we introduce, for each activation x(k) , a pair of parameters
data set. The normalization can then be written as a trans- γ (k) , β (k) , which scale and shift the normalized value:
formation
x = Norm(x, X )
b y (k) = γ (k) x
b(k) + β (k) .
which depends not only on the given training example x These parameters are learned along with the original
but on all examples X – each of which depends on Θ if model parameters, and restore the representation
p power
x is generated by another layer. For backpropagation, we of the network. Indeed, by setting γ (k) = Var[x(k) ] and
would need to compute the Jacobians β (k) = E[x(k) ], we could recover the original activations,
if that were the optimal thing to do.
∂Norm(x, X ) ∂Norm(x, X ) In the batch setting where each training step is based on
and ;
∂x ∂X the entire training set, we would use the whole set to nor-
ignoring the latter term would lead to the explosion de- malize activations. However, this is impractical when us-
scribed above. Within this framework, whitening the layer ing stochastic optimization. Therefore, we make the sec-
inputs is expensive, as it requires computing the covari- ond simplification: since we use mini-batches in stochas-
ance matrix Cov[x] = Ex∈X [xxT ] − E[x]E[x]T and its tic gradient training, each mini-batch produces estimates
inverse square root, to produce the whitened activations of the mean and variance of each activation. This way, the
Cov[x]−1/2 (x − E[x]), as well as the derivatives of these statistics used for normalization can fully participate in
transforms for backpropagation. This motivates us to seek the gradient backpropagation. Note that the use of mini-
an alternative that performs input normalization in a way batches is enabled by computation of per-dimension vari-
that is differentiable and does not require the analysis of ances rather than joint covariances; in the joint case, reg-
the entire training set after every parameter update. ularization would be required since the mini-batch size is
Some of the previous approaches (e.g. likely to be smaller than the number of activations being
(Lyu & Simoncelli, 2008)) use statistics computed whitened, resulting in singular covariance matrices.
over a single training example, or, in the case of image Consider a mini-batch B of size m. Since the normal-
networks, over different feature maps at a given location. ization is applied to each activation independently, let us
However, this changes the representation ability of a focus on a particular activation x(k) and omit k for clarity.
network by discarding the absolute scale of activations. We have m values of this activation in the mini-batch,
We want to a preserve the information in the network, by
normalizing the activations in a training example relative B = {x1...m }.
to the statistics of the entire training data. Let the normalized values be xb1...m , and their linear trans-
formations be y1...m . We refer to the transform
3 Normalization via Mini-Batch BNγ,β : x1...m → y1...m
Statistics as the Batch Normalizing Transform. We present the BN
Since the full whitening of each layer’s inputs is costly Transform in Algorithm 1. In the algorithm, ǫ is a constant
and not everywhere differentiable, we make two neces- added to the mini-batch variance for numerical stability.
sary simplifications. The first is that instead of whitening
the features in layer inputs and outputs jointly, we will Input: Values of x over a mini-batch: B = {x1...m };
normalize each scalar feature independently, by making it Parameters to be learned: γ, β
have the mean of zero and the variance of 1. For a layer Output: {yi = BNγ,β (xi )}
with d-dimensional input x = (x(1) . . . x(d) ), we will nor- m
malize each dimension 1 X
µB ← xi // mini-batch mean
(k) (k)
m i=1
x − E[x ]
b(k) = p
x 1 X
m
Var[x(k) ] σB2 ← (xi − µB )2 // mini-batch variance
m i=1
where the expectation and variance are computed over the
xi − µB
training data set. As shown in (LeCun et al., 1998b), such xbi ← p 2 // normalize
normalization speeds up convergence, even when the fea- σB + ǫ
tures are not decorrelated. yi ← γbxi + β ≡ BNγ,β (xi ) // scale and shift
Note that simply normalizing each input of a layer may
change what the layer can represent. For instance, nor- Algorithm 1: Batch Normalizing Transform, applied to
malizing the inputs of a sigmoid would constrain them to activation x over a mini-batch.
the linear regime of the nonlinearity. To address this, we
make sure that the transformation inserted in the network The BN transform can be added to a network to manip-
can represent the identity transform. To accomplish this, ulate any activation. In the notation y = BNγ,β (x), we
3
indicate that the parameters γ and β are to be learned, (Duchi et al., 2011). The normalization of activations that
but it should be noted that the BN transform does not depends on the mini-batch allows efficient training, but is
independently process the activation in each training ex- neither necessary nor desirable during inference; we want
ample. Rather, BNγ,β (x) depends both on the training the output to depend only on the input, deterministically.
example and the other examples in the mini-batch. The For this, once the network has been trained, we use the
scaled and shifted values y are passed to other network normalization
layers. The normalized activations x b are internal to our x − E[x]
b= p
x
transformation, but their presence is crucial. The distri- Var[x] + ǫ
butions of values of any x b has the expected value of 0
using the population, rather than mini-batch, statistics.
and the variance of 1, as long as the elements of each
Neglecting ǫ, these normalized activations have the same
mini-batch are sampled from the same distribution, and
mean 0 and variance 1 as during training. We use the un-
if we neglect ǫ. ThisPcan be seen by observing that
P m
m 1 m biased variance estimate Var[x] = m−1 · EB [σB2 ], where
i=1 x
bi = 0 and m b2i = 1, and taking expec-
i=1 x the expectation is over training mini-batches of size m and
tations. Each normalized activation x b(k) can be viewed as
σB2 are their sample variances. Using moving averages in-
an input to a sub-network composed of the linear trans-
stead, we can track the accuracy of a model as it trains.
form y (k) = γ (k) xb(k) + β (k) , followed by the other pro-
Since the means and variances are fixed during inference,
cessing done by the original network. These sub-network
the normalization is simply a linear transform applied to
inputs all have fixed means and variances, and although
each activation. It may further be composed with the scal-
the joint distribution of these normalized x b(k) can change
ing by γ and shift by β, to yield a single linear transform
over the course of training, we expect that the introduc-
that replaces BN(x). Algorithm 2 summarizes the proce-
tion of normalized inputs accelerates the training of the
dure for training batch-normalized networks.
sub-network and, consequently, the network as a whole.
During training we need to backpropagate the gradi-
ent of loss ℓ through this transformation, as well as com- Input: Network N with trainable parameters Θ;
pute the gradients with respect to the parameters of the subset of activations {x(k) }K
k=1
BN transform. We use chain rule, as follows (before sim- Output: Batch-normalized network for inference, Ninf BN
tr
plification): 1: NBN ← N // Training BN network
2: for k = 1 . . . K do
∂ℓ ∂ℓ
∂bxi = ∂yi · γ 3: Add transformation y (k) = BNγ (k) ,β (k) (x(k) ) to
Pm ∂ℓ Ntr
∂ℓ
= · (xi − µB ) · −1 2 −3/2 BN (Alg. 1)
∂σB2 i=1 ∂bxi 2 (σB + ǫ)
4: Modify each layer in Ntr BN with input x
(k)
to take
Pm Pm (k)
∂ℓ
= ∂ℓ
· √ −1
+ ∂ℓ
· i=1 −2(xi −µB ) y instead
∂µB i=1 ∂b
xi 2 ∂σ 2 m
σB +ǫ B 5: end for
tr
∂ℓ
= ∂ℓ
·√ 1
+ ∂ℓ
· 2(xi −µB )
+ ∂ℓ
· 1 6: Train NBN to optimize the parameters Θ ∪
∂xi ∂b
xi 2 +ǫ ∂σB2 m ∂µB m
σB (k) (k) K
{γ , β }k=1
∂ℓ
Pm ∂ℓ inf tr
∂γ = i=1 ∂yi ·x
bi 7: NBN ← NBN // Inference BN network with frozen
∂ℓ Pm ∂ℓ // parameters
∂β = i=1 ∂yi
8: for k = 1 . . . K do
Thus, BN transform is a differentiable transformation that 9:
(k)
// For clarity, x ≡ x(k) , γ ≡ γ (k) , µB ≡ µB , etc.
introduces normalized activations into the network. This 10: Process multiple training mini-batches B, each of
ensures that as the model is training, layers can continue size m, and average over them:
learning on input distributions that exhibit less internal co-
E[x] ← EB [µB ]
variate shift, thus accelerating the training. Furthermore,
m 2
the learned affine transform applied to these normalized Var[x] ← m−1 EB [σB ]
activations allows the BN transform to represent the iden-
tity transformation and preserves the network capacity. 11: In Ninf
BN , replace the transform y = BN γ,β (x) with
γ γ E[x]
y= √ ·x+ β− √
Var[x]+ǫ Var[x]+ǫ
3.1 Training and Inference with Batch- 12: end for
Normalized Networks Algorithm 2: Training a Batch-Normalized Network
To Batch-Normalize a network, we specify a subset of ac-
tivations and insert the BN transform for each of them,
according to Alg. 1. Any layer that previously received 3.2 Batch-Normalized Convolutional Net-
x as the input, now receives BN(x). A model employing works
Batch Normalization can be trained using batch gradient
descent, or Stochastic Gradient Descent with a mini-batch Batch Normalization can be applied to any set of acti-
size m > 1, or with any of its variants such as Adagrad vations in the network. Here, we focus on transforms
4
that consist of an affine transformation followed by an the gradient during backpropagation and lead to the model
element-wise nonlinearity: explosion. However, with Batch Normalization, back-
propagation through a layer is unaffected by the scale of
z = g(W u + b) its parameters. Indeed, for a scalar a,
where W and b are learned parameters of the model, and BN(W u) = BN((aW )u)
g(·) is the nonlinearity such as sigmoid or ReLU. This for-
mulation covers both fully-connected and convolutional and we can show that
layers. We add the BN transform immediately before the
∂BN((aW )u) ∂BN(W u)
nonlinearity, by normalizing x = W u + b. We could have ∂u = ∂u
also normalized the layer inputs u, but since u is likely ∂BN((aW )u)
= 1 ∂BN(W u)
∂(aW ) a · ∂W
the output of another nonlinearity, the shape of its distri-
bution is likely to change during training, and constraining The scale does not affect the layer Jacobian nor, con-
its first and second moments would not eliminate the co- sequently, the gradient propagation. Moreover, larger
variate shift. In contrast, W u + b is more likely to have weights lead to smaller gradients, and Batch Normaliza-
a symmetric, non-sparse distribution, that is “more Gaus- tion will stabilize the parameter growth.
sian” (Hyvärinen & Oja, 2000); normalizing it is likely to We further conjecture that Batch Normalization may
produce activations with a stable distribution. lead the layer Jacobians to have singular values close to 1,
Note that, since we normalize W u+b, the bias b can be which is known to be beneficial for training (Saxe et al.,
ignored since its effect will be canceled by the subsequent 2013). Consider two consecutive layers with normalized
mean subtraction (the role of the bias is subsumed by β in inputs, and the transformation between these normalized
Alg. 1). Thus, z = g(W u + b) is replaced with vectors: bz = F (bx). If we assume that b
x and bz are Gaussian
and uncorrelated, and that F (b x) ≈ Jbx is a linear transfor-
z = g(BN(W u)) mation for the given model parameters, then both b x and bz
where the BN transform is applied independently to each have unit covariances, and I = Cov[bz] = JCov[b x]J T =
dimension of x = W u, with a separate pair of learned JJ T . Thus, JJ T = I, and so all singular values of J
parameters γ (k) , β (k) per dimension. are equal to 1, which preserves the gradient magnitudes
For convolutional layers, we additionally want the nor- during backpropagation. In reality, the transformation is
malization to obey the convolutional property – so that not linear, and the normalized values are not guaranteed to
different elements of the same feature map, at different be Gaussian nor independent, but we nevertheless expect
locations, are normalized in the same way. To achieve Batch Normalization to help make gradient propagation
this, we jointly normalize all the activations in a mini- better behaved. The precise effect of Batch Normaliza-
batch, over all locations. In Alg. 1, we let B be the set of tion on gradient propagation remains an area of further
all values in a feature map across both the elements of a study.
mini-batch and spatial locations – so for a mini-batch of
size m and feature maps of size p × q, we use the effec- 3.4 Batch Normalization regularizes the
tive mini-batch of size m′ = |B| = m · p q. We learn a model
pair of parameters γ (k) and β (k) per feature map, rather
than per activation. Alg. 2 is modified similarly, so that When training with Batch Normalization, a training ex-
during inference the BN transform applies the same linear ample is seen in conjunction with other examples in the
transformation to each activation in a given feature map. mini-batch, and the training network no longer produc-
ing deterministic values for a given training example. In
our experiments, we found this effect to be advantageous
3.3 Batch Normalization enables higher to the generalization of the network. Whereas Dropout
learning rates (Srivastava et al., 2014) is typically used to reduce over-
In traditional deep networks, too-high learning rate may fitting, in a batch-normalized network we found that it can
result in the gradients that explode or vanish, as well as be either removed or reduced in strength.
getting stuck in poor local minima. Batch Normaliza-
tion helps address these issues. By normalizing activa- 4 Experiments
tions throughout the network, it prevents small changes
to the parameters from amplifying into larger and subop-
4.1 Activations over time
timal changes in activations in gradients; for instance, it
prevents the training from getting stuck in the saturated To verify the effects of internal covariate shift on train-
regimes of nonlinearities. ing, and the ability of Batch Normalization to combat it,
Batch Normalization also makes training more resilient we considered the problem of predicting the digit class on
to the parameter scale. Normally, large learning rates may the MNIST dataset (LeCun et al., 1998a). We used a very
increase the scale of layer parameters, which then amplify simple network, with a 28x28 binary image as input, and
5
1
2 2 details are given in the Appendix. We refer to this model
0.9
as Inception in the rest of the text. The model was trained
0 0
0.8
Without BN using a version of Stochastic Gradient Descent with mo-
With BN
0.7
10K 20K 30K 40K 50K−2 −2 mentum (Sutskever et al., 2013), using the mini-batch size
(a) (b) Without BN (c) With BN of 32. The training was performed using a large-scale, dis-
tributed architecture (similar to (Dean et al., 2012)). All
Figure 1: (a) The test accuracy of the MNIST network networks are evaluated as training progresses by comput-
trained with and without Batch Normalization, vs. the ing the validation accuracy @1, i.e. the probability of
number of training steps. Batch Normalization helps the predicting the correct label out of 1000 possibilities, on
network train faster and achieve higher accuracy. (b, a held-out set, using a single crop per image.
c) The evolution of input distributions to a typical sig- In our experiments, we evaluated several modifications
moid, over the course of training, shown as {15, 50, 85}th of Inception with Batch Normalization. In all cases, Batch
percentiles. Batch Normalization makes the distribution Normalization was applied to the input of each nonlinear-
more stable and reduces the internal covariate shift. ity, in a convolutional way, as described in section 3.2,
while keeping the rest of the architecture constant.
3 fully-connected hidden layers with 100 activations each.
Each hidden layer computes y = g(W u+b) with sigmoid
4.2.1 Accelerating BN Networks
nonlinearity, and the weights W initialized to small ran-
dom Gaussian values. The last hidden layer is followed Simply adding Batch Normalization to a network does not
by a fully-connected layer with 10 activations (one per take full advantage of our method. To do so, we further
class) and cross-entropy loss. We trained the network for changed the network and its training parameters, as fol-
50000 steps, with 60 examples per mini-batch. We added lows:
Batch Normalization to each hidden layer of the network,
Increase learning rate. In a batch-normalized model,
as in Sec. 3.1. We were interested in the comparison be-
we have been able to achieve a training speedup from
tween the baseline and batch-normalized networks, rather
higher learning rates, with no ill side effects (Sec. 3.3).
than achieving the state of the art performance on MNIST
(which the described architecture does not). Remove Dropout. As described in Sec. 3.4, Batch Nor-
Figure 1(a) shows the fraction of correct predictions malization fulfills some of the same goals as Dropout. Re-
by the two networks on held-out test data, as training moving Dropout from Modified BN-Inception speeds up
progresses. The batch-normalized network enjoys the training, without increasing overfitting.
higher test accuracy. To investigate why, we studied in- Reduce the L2 weight regularization. While in Incep-
puts to the sigmoid, in the original network N and batch- tion an L2 loss on the model parameters controls overfit-
normalized network Ntr ting, in Modified BN-Inception the weight of this loss is
BN (Alg. 2) over the course of train-
reduced by a factor of 5. We find that this improves the
ing. In Fig. 1(b,c) we show, for one typical activation from
the last hidden layer of each network, how its distribu- accuracy on the held-out validation data.
tion evolves. The distributions in the original network Accelerate the learning rate decay. In training Incep-
change significantly over time, both in their mean and tion, learning rate was decayed exponentially. Because
the variance, which complicates the training of the sub- our network trains faster than Inception, we lower the
learning rate 6 times faster.
sequent layers. In contrast, the distributions in the batch-
normalized network are much more stable as training pro- Remove Local Response Normalization While Incep-
gresses, which aids the training. tion and other networks (Srivastava et al., 2014) benefit
from it, we found that with Batch Normalization it is not
necessary.
4.2 ImageNet classification
Shuffle training examples more thoroughly. We enabled
We applied Batch Normalization to a new variant of the within-shard shuffling of the training data, which prevents
Inception network (Szegedy et al., 2014), trained on the the same examples from always appearing in a mini-batch
ImageNet classification task (Russakovsky et al., 2014). together. This led to about 1% improvements in the val-
The network has a large number of convolutional and idation accuracy, which is consistent with the view of
pooling layers, with a softmax layer to predict the image Batch Normalization as a regularizer (Sec. 3.4): the ran-
class, out of 1000 possibilities. Convolutional layers use domization inherent in our method should be most bene-
ReLU as the nonlinearity. The main difference to the net- ficial when it affects an example differently each time it is
work described in (Szegedy et al., 2014) is that the 5 × 5 seen.
convolutional layers are replaced by two consecutive lay- Reduce the photometric distortions. Because batch-
ers of 3 × 3 convolutions with up to 128 filters. The net- normalized networks train faster and observe each train-
work contains 13.6 · 106 parameters, and, other than the ing example fewer times, we let the trainer focus on more
top softmax layer, has no fully-connected layers. More “real” images by distorting them less.
6
0.8
0.7
Model Steps to 72.2% Max accuracy
0.6
Inception 31.0 · 106 72.2%
BN-Baseline 13.3 · 106 72.7%
Inception
BN−Baseline BN-x5 2.1 · 106 73.0%
0.5 BN−x5
BN−x30
BN-x30 2.7 · 106 74.8%
BN−x5−Sigmoid BN-x5-Sigmoid 69.8%
Steps to match Inception
0.4
5M 10M 15M 20M 25M 30M Figure 3: For Inception and the batch-normalized
variants, the number of training steps required to
Figure 2: Single crop validation accuracy of Inception reach the maximum accuracy of Inception (72.2%),
and its batch-normalized variants, vs. the number of and the maximum accuracy achieved by the net-
training steps. work.
7
Model Resolution Crops Models Top-1 error Top-5 error
GoogLeNet ensemble 224 144 7 - 6.67%
Deep Image low-res 256 - 1 - 7.96%
Deep Image high-res 512 - 1 24.88 7.42%
Deep Image ensemble variable - - - 5.98%
BN-Inception single crop 224 1 1 25.2% 7.82%
BN-Inception multicrop 224 144 1 21.99% 5.82%
BN-Inception ensemble 224 144 6 20.1% 4.9%*
Figure 4: Batch-Normalized Inception comparison with previous state of the art on the provided validation set com-
prising 50000 images. *BN-Inception ensemble has reached 4.82% top-5 error on the 100000 images of the test set of
the ImageNet as reported by the test server.
plies to sub-networks and layers, and removing it from entiating characteristics of Batch Normalization include
internal activations of the network may aid in training. the learned scale and shift that allow the BN transform
Our proposed method draws its power from normalizing to represent identity (the standardization layer did not re-
activations, and from incorporating this normalization in quire this since it was followed by the learned linear trans-
the network architecture itself. This ensures that the nor- form that, conceptually, absorbs the necessary scale and
malization is appropriately handled by any optimization shift), handling of convolutional layers, deterministic in-
method that is being used to train the network. To en- ference that does not depend on the mini-batch, and batch-
able stochastic optimization methods commonly used in normalizing each convolutional layer in the network.
deep network training, we perform the normalization for In this work, we have not explored the full range of
each mini-batch, and backpropagate the gradients through possibilities that Batch Normalization potentially enables.
the normalization parameters. Batch Normalization adds Our future work includes applications of our method to
only two extra parameters per activation, and in doing so Recurrent Neural Networks (Pascanu et al., 2013), where
preserves the representation ability of the network. We the internal covariate shift and the vanishing or exploding
presented an algorithm for constructing, training, and per- gradients may be especially severe, and which would al-
forming inference with batch-normalized networks. The low us to more thoroughly test the hypothesis that normal-
resulting networks can be trained with saturating nonlin- ization improves gradient propagation (Sec. 3.3). We plan
earities, are more tolerant to increased training rates, and to investigate whether Batch Normalization can help with
often do not require Dropout for regularization. domain adaptation, in its traditional sense – i.e. whether
Merely adding Batch Normalization to a state-of-the- the normalization performed by the network would al-
art image classification model yields a substantial speedup low it to more easily generalize to new data distribu-
in training. By further increasing the learning rates, re- tions, perhaps with just a recomputation of the population
moving Dropout, and applying other modifications af- means and variances (Alg. 2). Finally, we believe that fur-
forded by Batch Normalization, we reach the previous ther theoretical analysis of the algorithm would allow still
state of the art with only a small fraction of training steps more improvements and applications.
– and then beat the state of the art in single-network image
classification. Furthermore, by combining multiple mod-
els trained with Batch Normalization, we perform better References
than the best known system on ImageNet, by a significant
margin. Bengio, Yoshua and Glorot, Xavier. Understanding the
difficulty of training deep feedforward neural networks.
Interestingly, our method bears similarity to the stan- In Proceedings of AISTATS 2010, volume 9, pp. 249–
dardization layer of (Gülçehre & Bengio, 2013), though 256, May 2010.
the two methods stem from very different goals, and per-
form different tasks. The goal of Batch Normalization Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai,
is to achieve a stable distribution of activation values Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato,
throughout training, and in our experiments we apply it Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke,
before the nonlinearity since that is where matching the and Ng, Andrew Y. Large scale distributed deep net-
first and second moments is more likely to result in a works. In NIPS, 2012.
stable distribution. On the contrary, (Gülçehre & Bengio,
2013) apply the standardization layer to the output of the Desjardins, Guillaume and Kavukcuoglu, Koray. Natural
nonlinearity, which results in sparser activations. In our neural networks. (unpublished).
large-scale image classification experiments, we have not
observed the nonlinearity inputs to be sparse, neither with Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive
nor without Batch Normalization. Other notable differ- subgradient methods for online learning and stochastic
8
optimization. J. Mach. Learn. Res., 12:2121–2159, July Saxe, Andrew M., McClelland, James L., and Ganguli,
2011. ISSN 1532-4435. Surya. Exact solutions to the nonlinear dynamics
of learning in deep linear neural networks. CoRR,
Gülçehre, Çaglar and Bengio, Yoshua. Knowledge mat- abs/1312.6120, 2013.
ters: Importance of prior information for optimization.
CoRR, abs/1301.4083, 2013. Shimodaira, Hidetoshi. Improving predictive inference
under covariate shift by weighting the log-likelihood
He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep function. Journal of Statistical Planning and Inference,
into Rectifiers: Surpassing Human-Level Performance 90(2):227–244, October 2000.
on ImageNet Classification. ArXiv e-prints, February
2015. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
Hyvärinen, A. and Oja, E. Independent component anal- A simple way to prevent neural networks from overfit-
ysis: Algorithms and applications. Neural Netw., 13 ting. J. Mach. Learn. Res., 15(1):1929–1958, January
(4-5):411–430, May 2000. 2014.
Jiang, Jing. A literature survey on domain adaptation of Sutskever, Ilya, Martens, James, Dahl, George E., and
statistical classifiers, 2008. Hinton, Geoffrey E. On the importance of initial-
ization and momentum in deep learning. In ICML
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (3), volume 28 of JMLR Proceedings, pp. 1139–1147.
Gradient-based learning applied to document recog- JMLR.org, 2013.
nition. Proceedings of the IEEE, 86(11):2278–2324,
November 1998a. Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du-
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient mitru, Vanhoucke, Vincent, and Rabinovich, An-
backprop. In Orr, G. and K., Muller (eds.), Neural Net- drew. Going deeper with convolutions. CoRR,
works: Tricks of the trade. Springer, 1998b. abs/1409.4842, 2014.
Lyu, S and Simoncelli, E P. Nonlinear image representa- Wiesler, Simon and Ney, Hermann. A convergence anal-
tion using divisive normalization. In Proc. Computer ysis of log-linear training. In Shawe-Taylor, J., Zemel,
Vision and Pattern Recognition, pp. 1–8. IEEE Com- R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q.
puter Society, Jun 23-28 2008. doi: 10.1109/CVPR. (eds.), Advances in Neural Information Processing Sys-
2008.4587821. tems 24, pp. 657–665, Granada, Spain, December 2011.
9
weight layers. Also it increases the number of pa-
rameters by 25% and the computational cost is in-
creased by about 30%.
• The number 28×28 inception modules is increased
from 2 to 3.
• Inside the modules, sometimes average, sometimes
maximum-pooling is employed. This is indicated in
the entries corresponding to the pooling layers of the
table.
• There are no across the board pooling layers be-
tween any two Inception modules, but stride-2 con-
volution/pooling layers are employed before the fil-
ter concatenation in the modules 3c, 4e.
Our model employed separable convolution with depth
multiplier 8 on the first convolutional layer. This reduces
the computational cost while increasing the memory con-
sumption at training time.
10
patch size/ output #3×3 double #3×3 double
type depth #1×1 #3×3 Pool +proj
stride size reduce reduce #3×3
convolution* 7×7/2 112×112×64 1
max pool 3×3/2 56×56×64 0
convolution 3×3/1 56×56×192 1 64 192
max pool 3×3/2 28×28×192 0
inception (3a) 28×28×256 3 64 64 64 64 96 avg + 32
inception (3b) 28×28×320 3 64 64 96 64 96 avg + 64
inception (3c) stride 2 28×28×576 3 0 128 160 64 96 max + pass through
inception (4a) 14×14×576 3 224 64 96 96 128 avg + 128
inception (4b) 14×14×576 3 192 96 128 96 128 avg + 128
inception (4c) 14×14×576 3 160 128 160 128 160 avg + 128
inception (4d) 14×14×576 3 96 128 192 160 192 avg + 128
inception (4e) stride 2 14×14×1024 3 0 128 192 192 256 max + pass through
inception (5a) 7×7×1024 3 352 192 320 160 224 avg + 128
inception (5b) 7×7×1024 3 352 192 320 192 224 max + 128
avg pool 7×7/1 1×1×1024 0
11
Binarized Neural Networks: Training Neural Networks with Weights and
Activations Constrained to +1 or −1
1
Université de Montréal
2
Technion - Israel Institute of Technology
3
Columbia University
4
CIFAR Senior Fellow
*Indicates equal contribution. Ordering determined by coin flip.
replace most arithmetic operations with bit-wise oper- 1.2. Gradient Computation and Accumulation
ations, which potentially lead to a substantial increase
Although our BNN training method uses binary weights
in power-efficiency (see Section 3). Moreover, a bi-
and activation to compute the parameters gradients, the
narized CNN can lead to binary convolution kernel
real-valued gradients of the weights are accumulated in
repetitions; We argue that dedicated hardware could
real-valued variables, as per Algorithm 1. Real-valued
reduce the time complexity by 60% .
weights are likely required for Stochasic Gradient Descent
• Last but not least, we programed a binary matrix mul- (SGD) to work at all. SGD explores the space of param-
tiplication GPU kernel with which it is possible to run eters in small and noisy steps, and that noise is averaged
our MNIST BNN 7 times faster than with an unopti- out by the stochastic gradient contributions accumulated in
mized GPU kernel, without suffering any loss in clas- each weight. Therefore, it is important to keep sufficient
sification accuracy (see Section 4). resolution for these accumulators, which at first glance sug-
gests that high precision is absolutely required.
• The code for training and running our BNNs is avail-
able on-line (In both Theano framework 1 and Torch Moreover, adding noise to weights and activations when
framework 2 ). computing the parameters gradients provide a form of reg-
ularization that can help to generalize better, as previ-
ously shown with variational weight noise (Graves, 2011),
1. Binarized Neural Networks Dropout (Srivastava, 2013; Srivastava et al., 2014) and
In this section, we detail our binarization function, show DropConnect (Wan et al., 2013). Our method of training
how we use it to compute the parameters gradients, and BNNs can be seen as a variant of Dropout, in which instead
how we backpropagate through it. of randomly setting half of the activations to zero when
computing the parameters gradients, we binarize both the
1.1. Deterministic vs Stochastic Binarization activations and the weights.
When training a BNN, we constrain both the weights and 1.3. Propagating Gradients Through Discretization
the activations to either +1 or −1. Those two values are
very advantageous from a hardware perspective, as we ex- The derivative of the sign function is zero almost every-
plain in Section 4. In order to transform the real-valued where, making it apparently incompatible with backpropa-
variables into those two values, we use two different bi- gation, since the exact gradient of the cost with respect to
narization functions, as in (Courbariaux et al., 2015). Our the quantities before the discretization (pre-activations or
first binarization function is deterministic: weights) would be zero. Note that this remains true even
if stochastic quantization is used. Bengio (2013) studied
b +1 if x ≥ 0, the question of estimating or propagating gradients through
x = Sign(x) = (1)
−1 otherwise, stochastic discrete neurons. They found in their experi-
where xb is the binarized variable (weight or activation) ments that the fastest training was obtained when using the
and x the real-valued variable. It is very straightforward to “straight-through estimator,” previously introduced in Hin-
implement and works quite well in practice. Our second ton (2012)’s lectures.
binarization function is stochastic: We follow a similar approach but use the version of
+1 with probability p = σ(x), the straight-through estimator that takes into account the
xb = (2) saturation effect, and does use deterministic rather than
−1 with probability 1 − p,
stochastic sampling of the bit. Consider the sign function
where σ is the “hard sigmoid” function: quantization
x+1 x+1 q = Sign(r),
σ(x) = clip( , 0, 1) = max(0, min(1, )). (3)
2 2 and assume that an estimator gq of the gradient ∂C ∂q has
The stochastic binarization is more appealing than the sign been obtained (with the straight-through estimator when
function, but harder to implement as it requires the hard- needed). Then, our straight-through estimator of ∂C
∂r is sim-
ware to generate random bits when quantizing. As a re- ply
sult, we mostly use the deterministic binarization function gr = gq 1|r|≤1 . (4)
(i.e, the sign function), with the exception of activations at Note that this preserves the gradient’s information and can-
train-time in some of our experiments. cels the gradient when r is too large. Not cancelling the
1
https://github.com/MatthieuCourbariaux/ gradient when r is too large significantly worsens the per-
BinaryNet formance. The use of this straight-through estimator is il-
2
https://github.com/itayhubara/BinaryNet lustrated in Algorithm 1. The derivative 1|r|≤1 can also be
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
Algorithm 4 Shift based AdaMax learning rule (Kingma did not observe accuracy loss when using the shift based
& Ba, 2014). gt2 indicates the element-wise square gt ◦ gt . BN algorithm instead of the vanilla BN algorithm.
Good default settings are α = 2−10 , 1−β1 = 2−3 , 1−β2 =
2−10 . All operations on vectors are element-wise. With β1t 1.5. Shift based AdaMax
and β2t we denote β1 and β2 to the power t.
The ADAM learning rule (Kingma & Ba, 2014) also seems
Require: Previous parameters θt−1 and their gradient gt ,
to reduce the impact of the weight scale. Since ADAM re-
and learning rate α.
quires many multiplications, we suggest using instead the
Ensure: Updated parameters θt
shift-based AdaMax we detail in Algorithm 4. In the ex-
{Biased 1st and 2nd raw moment estimates:}
periment we conducted we did not observe accuracy loss
mt ← β1 · mt−1 + (1 − β1 ) · gt
when using the shift-based AdaMax algorithm instead of
vt ← max(β2 · vt−1 , |gt |)
the vanilla ADAM algorithm.
{Updated parameters:}
θt ← θt−1 − (α (1 − β1 )) · m̂ vt−1 )
1.6. First Layer
Algorithm 5 Running a BNN. L is the number of layers. In a BNN, only the binarized values of the weights and ac-
Require: a vector of 8-bit inputs a0 , the binary weights tivations are used in all calculations. As the output of one
W b , and the BatchNorm parameters θ. layer is the input of the next, all the layers inputs are bi-
Ensure: the MLP output aL . nary, with the exception of the first layer. However, we
{1. First layer:} do not believe this to be a major issue. First, in computer
a1 ← 0 vision, the input representation typically has much fewer
for n = 1 to 8 do channels (e.g, Red, Green and Blue) than internal repre-
a1 ← a1 + 2n−1 × XnorDotProduct(an0 , W1b ) sentations (e.g, 512). As a result, the first layer of a Con-
end for vNet is often the smallest convolution layer, both in terms
ab1 ← Sign(BatchNorm(a1 , θ1 )) of parameters and computations (Szegedy et al., 2014).
{2. Remaining hidden layers:} Second, it is relatively easy to handle continuous-valued
for k = 2 to L − 1 do inputs as fixed point numbers, with m bits of precision. For
ak ← XnorDotProduct(abk−1 , Wkb ) example, in the common case of 8-bit fixed point inputs:
abk ← Sign(BatchNorm(ak , θk ))
end for s = x · wb (6)
{3. Output layer:} 8
X
aL ← XnorDotProduct(abL−1 , WLb ) s= 2n−1 (xn · wb ), (7)
aL ← BatchNorm(aL , θL ) n=1
Table 1. Classification test error rates of DNNs trained on MNIST (MLP architecture without unsupervised pretraining), CIFAR-10
(without data augmentation) and SVHN.
Data set MNIST SVHN CIFAR-10
Binarized activations+weights, during training and test
BNN (Torch7) 1.40% 2.53% 10.15%
BNN (Theano) 0.96% 2.80% 11.40%
Committee Machines’ Array (Baldassi et al., 2015) 1.35% - -
Binarized weights, during training and test
BinaryConnect (Courbariaux et al., 2015) 1.29± 0.08% 2.30% 9.90%
Binarized activations+weights, during test
EBP (Cheng et al., 2015) 2.2± 0.1% - -
Bitwise DNNs (Kim & Smaragdis, 2016) 1.33% - -
Ternary weights, binary activations, during test
(Hwang & Sung, 2014) 1.45% - -
No binarization (standard results)
Maxout Networks (Goodfellow et al.) 0.94% 2.47% 11.68%
Network in Network (Lin et al.) - 2.35% 10.41%
Gated pooling (Lee et al., 2015) - 1.69% 7.62%
Figure 1. Training curves of a ConvNet on CIFAR-10 depend- Figure 2. Binary weight filters, sampled from of the first convolu-
2
ing on the method. The dotted lines represent the training costs tion layer. Since we have only 2k unique 2D filters (where k is
(square hinge losses) and the continuous lines the corresponding the filter size), filter replication is very common. For instance, on
validation error rates. Although BNNs are slower to train, they our CIFAR-10 ConvNet, only 42% of the filters are unique.
are nearly as accurate as 32-bit float DNNs.
nary weights and neurons by updating the posterior distri- et al. also indicated satisfactory empirical performance of
butions over the weights. These distributions are updated neural networks with 8-bit precision. Kim & Paris (2015)
by differentiating their parameters (e.g., mean values) via retrained neural networks with binary weights and activa-
the back propagation (BP) algorithm. Esser et al. (2015) tions.
implemented a fully binary network at run time using a very
So far, to the best of our knowledge, no work has succeeded
similar approach to EBP, showing significant improvement
in binarizing weights and neurons, at the inference phase
in energy efficiency. The drawback of EBP is that the bina-
and the entire training phase of a deep network. This was
rized parameters were only used during inference.
achieved in the present work. We relied on the idea that bi-
The probabilistic idea behind EBP was extended in the Bi- narization can be done stochastically, or be approximated
naryConnect algorithm of Courbariaux et al. (2015). In as random noise. This was previously done for the weights
BinaryConnect, the real-valued version of the weights is by Courbariaux et al. (2015), but our BNNs extend this to
saved and used as a key reference for the binarization pro- the activations. Note that the binary activations are espe-
cess. The binarization noise is independent between dif- cially important for ConvNets, where there are typically
ferent weights, either by construction (by using stochas- many more neurons than free weights. This allows highly
tic quantization) or by assumption (a common simplifica- efficient operation of the binarized DNN at run time, and
tion; see Spang (1962). The noise would have little effect at the forward propagation phase during training. More-
on the next neuron’s input because the input is a summa- over, our training method has almost no multiplications,
tion over many weighted neurons. Thus, the real-valued and therefore might be implemented efficiently in dedi-
version could be updated by the back propagated error by cated hardware. However, we have to save the value of the
simply ignoring the binarization noise in the update. Us- full precision weights. This is a remaining computational
ing this method, Courbariaux et al. (2015) were the first bottleneck during training, since it requires relatively high
to binarize weights in CNNs and achieved near state-of- energy resources. Novel memory devices might be used to
the-art performance on several datasets. They also argued alleviate this issue in the future; see e.g. (Soudry et al.).
that noisy weights provide a form of regularization, which
could help to improve generalization, as previously shown Conclusion
in (Wan et al., 2013). This method binarized weights while
still maintaining full precision neurons. We have introduced BNNs, DNNs with binary weights and
activations at run-time and when computing the parame-
Lin et al. (2015) carried over the work of Courbariaux et al.
ters gradients at train-time (see Section 1). We have con-
(2015) to the back-propagation process by quantizing the
ducted two sets of experiments on two different frame-
representations at each layer of the network, to convert
works, Torch7 and Theano, which show that it is possible to
some of the remaining multiplications into binary shifts by
train BNNs on MNIST, CIFAR-10 and SVHN, and achieve
restricting the neurons values of power-of-two integers. Lin
nearly state-of-the-art results (see Section 2). Moreover,
et al. (2015)’s work and ours seem to share similar charac-
during the forward pass (both at run-time and train-time),
teristics . However, their approach continues to use full pre-
BNNs drastically reduce memory size and accesses, and re-
cision weights during the test phase. Moreover, Lin et al.
place most arithmetic operations with bit-wise operations,
(2015) quantize the neurons only during the back propaga-
which might lead to a great increase in power-efficiency
tion process, and not during forward propagation.
(see Section 3). Last but not least, we programed a binary
Other research (Baldassi et al., 2015) showed that fully bi- matrix multiplication GPU kernel with which it is possible
nary training and testing is possible in an array of com- to run our MNIST MLP 7 times faster than with an unopti-
mittee machines with randomized input, where only one mized GPU kernel, without suffering any loss in classifica-
weight layer is being adjusted. Judd et al. and Gong tion accuracy (see Section 4). Future works should explore
et al. aimed to compress a fully trained high precision net- how to extend the speed-up to train-time (e.g., by binariz-
work by using a quantization or matrix factorization meth- ing some gradients), and also extend benchmark results to
ods. These methods required training the network with full other models (e.g, RNN) and datasets (e.g, ImageNet).
precision weights and neurons, thus requiring numerous
MAC operations avoided by the proposed BNN algorithm. Acknowledgments
Hwang & Sung (2014) focused on a fixed-point neural net-
work design and achieved performance almost identical to We would like to express our appreciation to Elad Hoffer,
that of the floating-point architecture. Kim et al. (2014) for his technical assistance and constructive comments. We
provided evidence that DNNs with ternary weights, used thank our fellow MILA lab members who took the time to
on a dedicated circuit, consume very low power and can read the article and give us some feedback. We thank the
be operated with only on-chip memory, at run time. Sung developers of Torch, (Collobert et al., 2011) a Lua based
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
environment, and Theano (Bergstra et al., 2010; Bastien Coates, Adam, Huval, Brody, Wang, Tao, Wu, David, Catanzaro,
et al., 2012), a Python library which allowed us to easily Bryan, and Andrew, Ng. Deep learning with COTS HPC sys-
develop a fast and optimized code for GPU. We also thank tems. In Proceedings of the 30th international conference on
machine learning, pp. 1337–1345, 2013.
the developers of Pylearn2 (Goodfellow et al., 2013) and
Lasagne (Dieleman et al., 2015), two Deep Learning li- Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Clément.
braries built on the top of Theano. We thank Yuxin Wu Torch7: A matlab-like environment for machine learning. In
BigLearn, NIPS Workshop, 2011.
for helping us compare our GPU kernels with cuBLAS. We
are also grateful for funding from CIFAR, NSERC, IBM, Courbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre.
Samsung, and the Israel Science Foundation (ISF). Training deep neural networks with low precision multiplica-
tions. ArXiv e-prints, abs/1412.7024, December 2014.
Govindu, Gokul, Zhuo, Ling, Choi, Seonil, and Prasanna, Vik- Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick, Zhang,
tor. Analysis of high-performance floating-point arithmetic on Zhengyou, and Tu, Zhuowen. Deeply-supervised nets. arXiv
FPGAs. In Parallel and Distributed Processing Symposium, preprint arXiv:1409.5185, 2014.
2004. Proceedings. 18th International, pp. 149. IEEE, 2004.
Lee, Chen-Yu, Gallagher, Patrick W, and Tu, Zhuowen. Gen-
Graham, Benjamin. Spatially-sparse convolutional neural net- eralizing pooling functions in convolutional neural networks:
works. arXiv preprint arXiv:1409.6070, 2014. Mixed, gated, and tree. arXiv preprint arXiv:1509.08985,
2015.
Graves, Alex. Practical variational inference for neural networks.
In Advances in Neural Information Processing Systems, pp. Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network In Net-
2348–2356, 2011. work. arXiv preprint, pp. 10.
Han, Song, Pool, Jeff, Tran, John, and Dally, William. Learn- Lin, Zhouhan, Courbariaux, Matthieu, Memisevic, Roland, and
ing both weights and connections for efficient neural network. Bengio, Yoshua. Neural networks with few multiplications.
In Advances in Neural Information Processing Systems, pp. ArXiv e-prints, abs/1510.03009, October 2015.
1135–1143, 2015.
Mnih, Volodymyr, Kavukcuoglo, Koray, Silver, David, Rusu, An-
Hinton, Geoffrey. Neural networks for machine learning. Cours- drei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Ried-
era, video lectures, 2012. miller, Martin, Fidgeland, Andreas K., Ostrovski, Georg, Pe-
tersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioan-
Hinton, Geoffrey, Deng, Li, Dahl, George E., Mohamed, Abdel- nis, King, Helen, Kumaran, Dharsan, Wierstra, Daan, Legg,
rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Shane, and Hassabis, Demis. Human-level control through
Nguyen, Patrick, Sainath, Tara, and Kingsbury, Brian. Deep deep reinforcement learning. Nature, 518:529–533, 2015.
neural networks for acoustic modeling in speech recognition.
IEEE Signal Processing Magazine, 29(6):82–97, Nov. 2012. Mordvintsev, Alexander, Olah, Christopher, and Tyka, Mike. In-
ceptionism: Going deeper into neural networks, 2015. Ac-
Horowitz, Mark. Computing’s Energy Problem (and what we can cessed: 2015-06-30.
do about it). IEEE Interational Solid State Circuits Conference,
pp. 10–14, 2014. Pham, Phi-Hung, Jelaca, Darko, Farabet, Clement, Martini,
Berin, LeCun, Yann, and Culurciello, Eugenio. Neuflow:
Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward dataflow vision processing system-on-a-chip. In Circuits and
deep neural network design using weights+ 1, 0, and- 1. In Systems (MWSCAS), 2012 IEEE 55th International Midwest
Signal Processing Systems (SiPS), 2014 IEEE Workshop on, Symposium on, pp. 1044–1047. IEEE, 2012.
pp. 1–6. IEEE, 2014.
Romero, Adriana, Ballas, Nicolas, Kahou, Samira Ebrahimi,
Ioffe, Sergey and Szegedy, Christian. Batch normalization: Ac- Chassang, Antoine, Gatta, Carlo, and Bengio, Yoshua. Fit-
celerating deep network training by reducing internal covariate nets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550,
shift. 2015. 2014.
Judd, Patrick, Albericio, Jorge, Hetherington, Tayler, Aamodt,
Sainath, Tara, rahman Mohamed, Abdel, Kingsbury, Brian, and
Tor, Jerger, Natalie Enright, Urtasun, Raquel, and Moshovos,
Ramabhadran, Bhuvana. Deep convolutional neural networks
Andreas. Reduced-Precision Strategies for Bounded Memory
for LVCSR. In ICASSP 2013, 2013.
in Deep Neural Nets. pp. 12.
Silver, David, Huang, Aja, Maddison, Chris J., Guez, Arthur,
Kim, Jonghong, Hwang, Kyuyeon, and Sung, Wonyong. X1000
Sifre, Laurent, van den Driessche, George, Schrittwieser,
real-time phoneme recognition vlsi using feed-forward deep
Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanc-
neural networks. In Acoustics, Speech and Signal Processing
tot, Marc, Dieleman, Sander, Grewe, Dominik, Nham, John,
(ICASSP), 2014 IEEE International Conference on, pp. 7510–
Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach,
7514. IEEE, 2014.
Madeleine, Kavukcuoglu, Koray, Graepel, Thore, and Hass-
Kim, M. and Smaragdis, P. Bitwise Neural Networks. ArXiv e- abis, Demis. Mastering the game of go with deep neural net-
prints, January 2016. works and tree search. Nature, 529(7587):484–489, Jan 2016.
Article.
Kim, Minje and Paris, Smaragdis. Bitwise Neural Networks.
ICML Workshop on Resource-Efficient Machine Learning, 37, Simonyan, Karen and Zisserman, Andrew. Very deep convolu-
2015. tional networks for large-scale image recognition. In ICLR,
2015.
Kingma, Diederik and Ba, Jimmy. Adam: A method for stochas-
tic optimization. arXiv preprint arXiv:1412.6980, 2014. Soudry, Daniel, Di Castro, Dotan, Gal, Asaf, Kolodny, Avinoam,
and Kvatinsky, Shahar. Memristor-Based Multilayer Neu-
Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNet classifica- ral Networks With Online Gradient Descent Training. IEEE
tion with deep convolutional neural networks. In NIPS’2012. Transactions on Neural Networks and Learning Systems, (10):
2012. 2408–2421.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Soudry, Daniel, Hubara, Itay, and Meir, Ron. Expectation back-
Patrick. Gradient-based learning applied to document recogni- propagation: Parameter-free training of multilayer neural net-
tion. Proceedings of the IEEE, 86(11):2278–2324, November works with continuous or discrete weights. In NIPS’2014,
1998. 2014.
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
Wan, Li, Zeiler, Matthew, Zhang, Sixin, LeCun, Yann, and Fer-
gus, Rob. Regularization of neural networks using dropcon-
nect. In ICML’2013, 2013.
Journal of Machine Learning Research 13 (2012) 281-305 Submitted 3/11; Revised 9/11; Published 2/12
Abstract
Grid search and manual search are the most widely used strategies for hyper-parameter optimiza-
tion. This paper shows empirically and theoretically that randomly chosen trials are more efficient
for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a compar-
ison with a large previous study that used grid search and manual search to configure neural net-
works and deep belief networks. Compared with neural networks configured by a pure grid search,
we find that random search over the same domain is able to find models that are as good or better
within a small fraction of the computation time. Granting random search the same computational
budget, random search finds better models by effectively searching a larger, less promising con-
figuration space. Compared with deep belief networks configured by a thoughtful combination of
manual search and grid search, purely random search over the same 32-dimensional configuration
space found statistically equal performance on four of seven data sets, and superior performance
on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation
set performance reveals that for most data sets only a few of the hyper-parameters really matter,
but that different hyper-parameters are important on different data sets. This phenomenon makes
grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some
light on why recent “High Throughput” methods achieve surprising success—they appear to search
through a large number of hyper-parameters because most hyper-parameters do not matter much.
We anticipate that growing interest in large hierarchical models will place an increasing burden on
techniques for hyper-parameter optimization; this work shows that random search is a natural base-
line against which to judge progress in the development of adaptive (sequential) hyper-parameter
optimization algorithms.
Keywords: global optimization, model selection, neural networks, deep learning, response surface
modeling
1. Introduction
The ultimate objective of a typical learning algorithm A is to find a function f that minimizes some
expected loss L (x; f ) over i.i.d. samples x from a natural (grand truth) distribution Gx . A learning
algorithm A is a functional that maps a data set X (train) (a finite set of samples from Gx ) to a function
f . Very often a learning algorithm produces f through the optimization of a training criterion with
respect to a set of parameters θ. However, the learning algorithm itself often has bells and whistles
called hyper-parameters λ, and the actual learning algorithm is the one obtained after choosing
λ, which can be denoted Aλ , and f = Aλ (X (train) ) for a training set X (train) . For example, with a
2012
c James Bergstra and Yoshua Bengio.
B ERGSTRA AND B ENGIO
Gaussian kernel SVM, one has to select a regularization penalty C for the training criterion (which
controls the margin) and the bandwidth σ of the Gaussian kernel, that is, λ = (C, σ).
What we really need in practice is a way to choose λ so as to minimize generalization error
Ex∼Gx [L (x; Aλ (X (train) ))]. Note that the computation performed by A itself often involves an inner
optimization problem, which is usually iterative and approximate. The problem of identifying a
good value for hyper-parameters λ is called the problem of hyper-parameter optimization. This
paper takes a look at algorithms for this difficult outer-loop optimization problem, which is of great
practical importance in empirical machine learning work:
λ(∗) = argmin Ex∼Gx [L x; Aλ (X (train) ) ]. (1)
λ∈Λ
In general, we do not have efficient algorithms for performing the optimization implied by Equa-
tion 1. Furthermore, we cannot even evaluate the expectation over the unknown natural distribution
Gx , the value we wish to optimize. Nevertheless, we must carry out this optimization as best we
can. With regards to the expectation over Gx , we will employ the widely used technique of cross-
validation to estimate it. Cross-validation is the technique of replacing the expectation with a mean
over a validation set X (valid) whose elements are drawn i.i.d x ∼ Gx . Cross-validation is unbiased
as long as X (valid) is independent of any data used by Aλ (see Bishop, 1995, pp. 32-33). We see in
Equations 2-4 the hyper-parameter optimization problem as it is addressed in practice:
λ(∗) ≈ argmin mean L x; Aλ (X (train) ) . (2)
λ∈Λ x∈X (valid)
282
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
search is used to identify regions in Λ that are promising and to develop the intuition necessary to
choose the sets L(k) . A major drawback of manual search is the difficulty in reproducing results.
This is important both for the progress of scientific research in machine learning as well as for ease
of application of learning algorithms by non-expert users. On the other hand, grid search alone does
very poorly in practice (as discussed here). We propose random search as a substitute and baseline
that is both reasonably efficient (roughly equivalent to or better than combinining manual search
and grid search, in our experiments) and keeping the advantages of implementation simplicity and
reproducibility of pure grid search. Random search is actually more practical than grid search
because it can be applied even when using a cluster of computers that can fail, and allows the
experimenter to change the “resolution” on the fly: adding new trials to the set or ignoring failed
trials are both feasible because the trials are i.i.d., which is not the case for a grid search. Of course,
random search can probably be improved by automating what manual search does, i.e., a sequential
optimization, but this is left to future work.
There are several reasons why manual search and grid search prevail as the state of the art despite
decades of research into global optimization (e.g., Nelder and Mead, 1965; Kirkpatrick et al., 1983;
Powell, 1994; Weise, 2009) and the publishing of several hyper-parameter optimization algorithms
(e.g., Nareyek, 2003; Czogiel et al., 2005; Hutter, 2009):
• Manual optimization gives researchers some degree of insight into Ψ;
• There is no technical overhead or barrier to manual optimization;
• Grid search is simple to implement and parallelization is trivial;
• Grid search (with access to a compute cluster) typically finds a better λ̂ than purely manual
sequential optimization (in the same amount of time);
• Grid search is reliable in low dimensional spaces (e.g., 1-d, 2-d).
We will come back to the use of global optimization algorithms for hyper-parameter selection
in our discussion of future work (Section 6). In this paper, we focus on random search, that is, inde-
pendent draws from a uniform density from the same configuration space as would be spanned by a
regular grid, as an alternative strategy for producing a trial set {λ(1) ...λ(S) }. We show that random
search has all the practical advantages of grid search (conceptual simplicity, ease of implementation,
trivial parallelism) and trades a small reduction in efficiency in low-dimensional spaces for a large
improvement in efficiency in high-dimensional search spaces.
In this work we show that random search is more efficient than grid search in high-dimensional
spaces because functions Ψ of interest have a low effective dimensionality; essentially, Ψ of interest
are more sensitive to changes in some dimensions than others (Caflisch et al., 1997). In particular, if
a function f of two variables could be approximated by another function of one variable ( f (x1 , x2 ) ≈
g(x1 )), we could say that f has a low effective dimension. Figure 1 illustrates how point grids
and uniformly random point sets differ in how they cope with low effective dimensionality, as in
the above example with f . A grid of points gives even coverage in the original 2-d space, but
projections onto either the x1 or x2 subspace produces an inefficient coverage of the subspace. In
contrast, random points are slightly less evenly distributed in the original space, but far more evenly
distributed in the subspaces.
If the researcher could know ahead of time which subspaces would be important, then he or she
could design an appropriate grid. However, we show the failings of this strategy in Section 2. For a
283
B ERGSTRA AND B ENGIO
Unimportant parameter
Unimportant parameter
Figure 1: Grid and random search of nine trials for optimizing a function f (x, y) = g(x) + h(y) ≈
g(x) with low effective dimensionality. Above each square g(x) is shown in green, and
left of each square h(y) is shown in yellow. With grid search, nine trials only test g(x)
in three distinct places. With random search, all nine trials explore distinct values of
g. This failure of grid search is the rule rather than the exception in high dimensional
hyper-parameter optimization.
given learning algorithm, looking at several relatively similar data sets (from different distributions)
reveals that on different data sets, different subspaces are important, and to different degrees. A grid
with sufficient granularity to optimizing hyper-parameters for all data sets must consequently be
inefficient for each individual data set because of the curse of dimensionality: the number of wasted
grid search trials is exponential in the number of search dimensions that turn out to be irrelevant for
a particular data set. In contrast, random search thrives on low effective dimensionality. Random
search has the same efficiency in the relevant subspace as if it had been used to search only the
relevant dimensions.
This paper is organized as follows. Section 2 looks at the efficiency of random search in practice
vs. grid search as a method for optimizing neural network hyper-parameters. We take the grid search
experiments of Larochelle et al. (2007) as a point of comparison, and repeat similar experiments
using random search. Section 3 uses Gaussian process regression (GPR) to analyze the results of
the neural network trials. The GPR lets us characterize what Ψ looks like for various data sets,
and establish an empirical link between the low effective dimensionality of Ψ and the efficiency
of random search. Section 4 compares random search and grid search with more sophisticated
point sets developed for Quasi Monte-Carlo numerical integration, and argues that in the regime of
interest for hyper-parameter selection grid search is inappropriate and more sophisticated methods
bring little advantage over random search. Section 5 compares random search with the expert-
guided manual sequential optimization employed in Larochelle et al. (2007) to optimize Deep Belief
Networks. Section 6 comments on the role of global optimization algorithms in future work. We
conclude in Section 7 that random search is generally superior to grid search for optimizing hyper-
parameters.
284
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
Likewise, we must define the estimated variance V about these means on the validation and test sets,
for example, for the zero-one loss (Bernoulli variance):
With other loss functions the estimator of variance will generally be different.
The standard practice for evaluating a model found by cross-validation is to report Ψ(test) (λ(s) )
for the λ(s) that minimizes Ψ(valid) (λ(s) ). However, when different trials have nearly optimal val-
idation means, then it is not clear which test score to report, and a slightly different choice of λ
could have yielded a different test error. To resolve the difficulty of choosing a winner, we report a
weighted average of all the test set scores, in which each one is weighted by the probability that its
particular λ(s) is in fact the best. In this view, the uncertainty arising from X (valid) being a finite sam-
ple of Gx makes the test-set score of the best model among λ(1) , ..., λ(S) a random variable, z. This
score z is modeled by a Gaussian mixture model whose S components have means µs = Ψ(test) (λ(s) ),
285
B ERGSTRA AND B ENGIO
To summarize, the performance z of the best model in an experiment of S trials has mean µz and
standard error σ2z ,
S
µz = ∑ ws µs , and (5)
s=1
S
σ2z = ∑ ws µ2s + σ2s − µ2z .
(6)
s=1
It is simple and practical to estimate weights ws by simulation. The procedure for doing so is to
repeatedly draw hypothetical validation scores Z (s) from Normal distributions whose means are the
Ψ(valid) (λ(s) ) and whose variances are the squared standard errors V(valid) (λ(s) ), and to count how
often each trial generates a winning score. Since the test scores of the best validation scores are
typically relatively close, ws need not be estimated very precisely and a few tens of hypothetical
draws suffice.
In expectation, this technique for estimating generalization gives a higher estimate than the
traditional technique of reporting the test set error of the best model in validation. The difference is
related to the variance Ψ(valid) and the density of validation set scores Ψ(λ(i) ) near the best value. To
the extent that Ψ(valid) casts doubt on which model was best, this technique averages the performance
of the best model together with the performance of models which were not the best. The next section
(Random Experiment Efficieny Curve) illustrates this phenomenon and discusses it in more detail.
286
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
rectangles images
0.80
0.75
0.70
accuracy
0.65
0.60
0.55
0.50
0.45
1 2 4 8 16 32 64 128
experiment size (# trials)
Figure 2: A random experiment efficiency curve. The trials of a random experiment are i.i.d, so
an experiment of many trials (here, 256 trials optimizing a neural network to classify the
rectangles basic data set, Section 2.3) can be interpreted as several independent smaller
experiments. For example, at horizontal axis position 8, we consider our 256 trials to
be 32 experiments of 8 trials each. The vertical axis shows the test accuracy of the best
trial(s) from experiments of a given size, as determined by Equation 5. When there are
sufficiently many experiments of a given size (i.e., 10), the distribution of performance
is illustrated by a box plot whose boxed section spans the lower and upper quartiles and
includes a line at the median. The whiskers above and below each boxed section show
the position of the most extreme data point within 1.5 times the inter-quartile range of the
nearest quartile. Data points beyond the whiskers are plotted with ’+’ symbols. When
there are not enough experiments to support a box plot, as occurs here for experiments of
32 trials or more, the best generalization score of each experiment is shown by a scatter
plot. The two thin black lines across the top of the figure mark the upper and lower
boundaries of a 95% confidence interval on the generalization of the best trial overall
(Equation 6).
consider what Figure 2 would look like if the experiment had included lucky trial whose validation
score were around 77% as usual, but whose test score were 80%. In the bar plot for trials of size
1, we would see the top performer scoring 80%. In larger experiments, we would average that 80%
performance together with other test set performances because 77% is not clearly the best validation
score; this averaging would make the upper envelope of the efficiency curve slope downward from
80% to a point very close to the current test set estimate of 76%.
Figure 2 characterizes the range of performance that is to be expected from experiments of vari-
ous sizes, which is valuable information to anyone trying to reproduce these results. For example, if
we try to repeat the experiment and our first four random trials fail to find a score better than 70%,
then the problem is likely not in hyper-parameter selection.
287
B ERGSTRA AND B ENGIO
Figure 3: From top to bottom, samples from the mnist rotated, mnist background random, mnist
background images, mnist rotated background images data sets. In all data sets the
task is to identify the digit (0 - 9) and ignore the various distracting factors of variation.
288
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
Figure 4: Top: Samples from the rectangles data set. Middle: Samples from the rectangles images
data set. Bottom: Samples from the convex data set. In rectangles data sets, the image is
formed by overlaying a small rectangle on a background. The task is to label the small
rectangle as being either tall or wide. In convex, the task is to identify whether the set of
white pixels is convex (images 1 and 4) or not convex (images 2 and 3).
The mnist rotated background images data set is a variation on mnist rotated in which the
images have been rotated by an amount chosen randomly between 0 and 2π radians, and then sub-
sequently composited onto natural image patch backgrounds. This data set included 10000 training
examples, 2000 validation examples, 50 000 test examples.
The rectangles data set (Figure 4, top) is a simple synthetic data set of outlines of rectangles.
The images are 28x28, the outlines are white (1-valued) and the backgrounds are black (0-valued).
The height and width of the rectangles were sampled uniformly, but when their difference was
smaller than 3 pixels the samples were rejected. The top left corner of the rectangles was also
sampled uniformly, with the constraint that the whole rectangle fits in the image. Each image is
labelled as one of two classes: tall or wide. This task was easier than the MNIST digit classification,
so we only used 1000 training examples, and 200 validation examples, but we still used 50 000
testing examples.
The rectangles images data set (Figure 4, middle) is a variation on rectangles in which the
foreground rectangles were filled with one natural image patch, and composited on top of a different
background natural image patch. The process for sampling rectangle shapes was similar to the one
used for rectangles, except a) the area covered by the rectangles was constrained to be between
25% and 75% of the total image, b) the length and width of the rectangles were forced to be of at
least 10 pixels, and c) their difference was forced to be of at least 5 pixels. This task was harder
than rectangles, so we used 10000 training examples, 2000 validation examples, and 50 000 testing
examples.
The convex data set (Figure 4, bottom) is a binary image classification task. Each 28x28 image
consists entirely of 1-valued and 0-valued pixels. If the 1-valued pixels form a convex region in
image space, then the image is labelled as being convex, otherwise it is labelled as non-convex. The
convex sets consist of a single convex region with pixels of value 1.0. Candidate convex images
were constructed by taking the intersection of a number of half-planes whose location and orienta-
289
B ERGSTRA AND B ENGIO
tion were chosen uniformly at random. The number of intersecting half-planes was also sampled
randomly according to a geometric distribution with parameter 0.195. A candidate convex image
was rejected if there were less than 19 pixels in the convex region. Candidate non-convex images
were constructed by taking the union of a random number of convex sets generated as above, but
with the number of half-planes sampled from a geometric distribution with parameter 0.07 and with
a minimum number of 10 pixels. The number of convex sets was sampled uniformly from 2 to
4. The candidate non-convex images were then tested by checking a convexity condition for every
pair of pixels in the non-convex set. Those sets that failed the convexity test were added to the data
set. The parameters for generating the convex and non-convex sets were balanced to ensure that the
conditional overall pixel mean is the same for both classes.
290
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
probability, an ℓ2 regularization penalty was applied, whose strength was drawn exponentially from
3.1 × 10−7 to 3.1 × 10−5 . This sampling process covers roughly the same domain with the same
density as the grid used in Larochelle et al. (2007), except for the optional preprocessing steps. The
grid optimization of Larochelle et al. (2007) did not consider normalizing or keeping only leading
PCA dimensions of the inputs; we compare to random sampling with and without these restrictions.4
We formed experiments for each data set by drawing S = 256 trials from this distribution. The
results of these experiments are illustrated in Figures 5 and 6. Random sampling of trials is surpris-
ingly effective in these settings. Figure 5 shows that even among the fraction of jobs (71/256) that
used no preprocessing, the random search with 8 trials is better than the grid search employed in
Larochelle et al. (2007).
Typically, the extent of a grid search is determined by a computational budget. Figure 6 shows
what is possible if we use random search in a larger space that requires more trials to explore. The
larger search space includes the possibility of normalizing the input or applying PCA preprocessing.
In the larger space, 32 trials were necessary to consistently outperform grid search rather than 8,
indicating that there are many harmful ways to preprocess the data. However, when we allowed
larger experiments of 64 trials or more, random search found superior results to those found more
quickly within the more restricted search. This tradeoff between exploration and exploitation is
central to the design of an effective random search.
The efficiency curves in Figures 5 and 6 reveal that different data sets give rise to functions Ψ
with different shapes. The mnist basic results converge very rapidly toward what appears to be a
global maximum. The fact that experiments of just 4 or 8 trials often have the same maximum as
much larger experiments indicates that the region of Λ that gives rise to the best performance is
approximately a quarter or an eighth respectively of the entire configuration space. Assuming that
the random search has not missed a tiny region of significantly better performance, we can say that
random search has solved this problem in 4 or 8 guesses. It is hard to imagine any optimization
algorithm doing much better on a non-trivial 7-dimensional function. In contrast the mnist rotated
background images and convex curves show that even with 16 or 32 random trials, there is consid-
erable variation in the generalization of the reportedly best model. This indicates that the Ψ function
in these cases is more peaked, with small regions of good performance.
291
B ERGSTRA AND B ENGIO
accuracy
accuracy
0.7 0.7 0.7
accuracy
accuracy
0.7 0.7 0.7
0.9 0.9
0.8 0.8
accuracy
accuracy
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
1 2 4 8 16 32 1 2 4 8 16 32
experiment size (# trials) experiment size (# trials)
parameter. The kernels defined for each hyper-parameter were combined by multiplication (joint
Gaussian kernel). We fit a GP to samples of Ψ by finding the length scale (l) for each hyper-
parameter that maximized the marginal likelihood. To ensure relevance could be compared between
hyper-parameters, we shifted and scaled each one to the unit interval. For hyper-parameters that
were drawn geometrically or exponentially (e.g., learning rate, number of hidden units), kernel
calculations were based on the logarithm of the effective value.
292
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
accuracy
accuracy
0.7 0.7 0.7
accuracy
accuracy
0.7 0.7 0.7
0.9 0.9
0.8 0.8
accuracy
accuracy
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
1 2 4 8 16 32 64 1 2 4 8 16 32 64
experiment size (# trials) experiment size (# trials)
Figure 6: Neural network performance when standard preprocessing algorithms are considered (9
hyper-parameters). Dashed blue line represents grid search accuracy using (on average)
100 trials (Larochelle et al., 2007), in which no preprocessing was done. Often the extent
of a search is determined by a computational budget, and with random search 64 trials are
enough to find better models in a larger less promising space. Exploring just four PCA
variance levels by grid search would have required 5 times as many (average 500) trials
per data set.
Figure 7 shows the relevance of each component of Λ in modelling Ψ(λ). Finding the length
scales that maximize marginal likelihood is not a convex problem and many local minima exist. To
get a sense of what length scales were supported by the data, we fit each set of samples from Ψ
50 times, resampling different subsets of 80% of the observations every time, and reinitializing the
length scale estimates randomly between 0.1 and 2. Figure 7 reveals two important properties of Ψ
for neural networks that suggest why grid search performs so poorly relative to random experiments:
1. a small fraction of hyper-parameters matter for any one data set, but
293
B ERGSTRA AND B ENGIO
294
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
295
B ERGSTRA AND B ENGIO
The shape of the target rectangle in variants (2) and (4) was determined by sampling side lengths
uniformly from the unit interval, and then scaling the rectangle to have a volume of 1%. This
process gave the rectangles a shape that was often wide or tall - much longer along some axes than
others. The position of the target was drawn uniformly among the positions totally inside the unit
hyper-cube. In the case of tall or wide targets (2) and (4), the indicator function [of the target] had
a lower effective dimension than the dimensionality of the overall space because the dimensions in
which the target is elongated can be almost ignored.
The simulation experiment began with the generation of 100 random search problems. Then for
each experiment design method (random, Sobol, latin hypercube, grid) we created experiments of
1, 2, 3, and so on up to 512 trials.5 The Sobol, Niederreiter, and Halton sequences yielded similar
results, so we used the Sobol sequence to represent the performance of these low-discepancy set
construction methods. There are many possible grid experiments of any size in multiple dimensions
(at least for non-prime experiment sizes). We did not test every possible grid, instead we tested
every grid with a monotonic resolution. For example, for experiments of size 16 in 5 dimensions
we tried the five grids with resolutions (1, 1, 1, 1, 16), (1, 1, 1, 2, 8), (1, 1, 2, 2, 4), (1, 1, 1, 4,
4), (1, 2, 2, 2, 2); for experiments of some prime size P in 3 dimensions we tried one grid with
resolution (1, 1, P). Since the target intervals were generated in such a way that rectangles identical
up to a permutation of side lengths have equal probability, grids with monotonic resolution are
representative of all grids. The score of an experiment design method for each experiment size was
the fraction of the 100 targets that it found.
To characterize the performance of random search, we used the analytic form of the expectation.
The expected probability of finding the target is 1.0 minus the probability of missing the target
with every single one of T trials in the experiment. If the volume of the target relative to the unit
hypercube is (v/V = 0.01) and there are T trials, then this probability of finding the target is
v T
1 − (1 − ) = 1 − 0.99T .
V
Figure 8 illustrates the efficiency of each kind of point set at finding the multidimensional in-
tervals. There were some grids that were best at finding cubes and hyper-cubes in 3-d and 5-d, but
most grids were the worst performers. No grid was competitive with the other methods at finding
the rectangular-shaped intervals, which had low effective dimension (cases 2 and 4; Figure 8, right
panels). Latin hypercubes, commonly used to initialize experiments in Bayesian optimization, were
no more efficient than the expected performance of random search. Interestingly, the Sobol se-
quence was consistently best by a few percentage points. The low-discrepancy property that makes
the Sobol useful in integration helps here, where it has the effect of minimizing the size of holes
where the target might pass undetected. The advantage of the Sobol sequence is most pronounced in
experiments of 100-300 trials, where there are sufficiently many trials for the structure in the Sobol
5. Samples from the Sobol sequence were provided by the GNU Scientific Library (M. Galassi et al., 2009).
296
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
Figure 8: The efficiency in simulation of low-discrepancy sequences relative to grid and pseudo-
random experiments. The simulation tested how reliably various experiment design meth-
ods locate a multidimensional interval occupying 1% of a unit hyper-cube. There is one
grey dot in each sub-plot for every grid of every experiment size that has at least two ticks
in each dimension. The black dots indicate near-perfect grids whose finest and coarsest
dimensional resolutions differ by either 0 or 1. Hyper-parameter search is most typi-
cally like the bottom-right scenario. Grid search experiments are inefficient for finding
axis-aligned elongated regions in high dimensions (i.e., bottom-right). Pseudo-random
samples are as efficient as latin hypercube samples, and slightly less efficient than the
Sobol sequence.
depart significantly from i.i.d points, but not sufficiently many trials for random search to succeed
with high probability.
A thought experiment gives some intuition for why grid search fails in the case of rectangles.
Long thin rectangles tend to intersect with several points if they intersect with any, reducing the
effective sample size of the search. If the rectangles had been rotated away from the axes used to
build the grid, then depending on the angle the efficiency of grid could approach the efficiency of
random or low-discrepancy trials. More generally, if the target manifold were not systematically
aligned with subsets of trial points, then grid search would be as efficient as the random and quasi-
random searches.
297
B ERGSTRA AND B ENGIO
• There was also the choice of how to preprocess the data. Either we used the raw pixels or
we removed some of the variance using a ZCA transform (in which examples are projected
onto principle components, and then multiplied by the transpose of the principle components
to place them back in the inputs space).
• If using ZCA preprocessing, we kept an amount of variance drawn uniformly from 0.5 to 1.0.
• We chose a learning rate for finetuning of the final classifier log-uniformly from 0.001 to 10.
298
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
• We chose an anneal start time for finetuning log-uniformly from 100 to 10000.
• We chose ℓ2 regularization of the weight matrices at each layer during finetuning to be either
0 (with probability 0.5), or log-uniformly from 10−7 to 10−4 .
This hyper-parameter space includes 8 global hyper-parameters and 8 hyper-parameters for each
layer, for a total of 32 hyper-parameters for 3-layer models.
A grid search is not practical for the 32-dimensional search problem of DBN model selection,
because even just 2 possible values for each of 32 hyper-parameters would yield more trials than
we could conduct (232 > 109 trials and each can take hours). For many of the hyper-parameters,
especially real valued ones, we would really like to try more than two values. The approach taken
in Larochelle et al. (2007) was a combination of manual search, multi-resolution grid search and
coordinate descent. The algorithm (including manual steps) is somewhat elaborate, but sensible,
and we believe that it is representative of how model search is typically done in several research
groups, if not the community at large. Larochelle et al. (2007) describe it as follows:
“The hyper-parameter search procedure we used alternates between fixing a neural net-
work architecture and searching for good optimization hyper-parameters similarly to
coordinate descent. More time would usually be spent on finding good optimization
parameters, given some empirical evidence that we found indicating that the choice of
the optimization hyper-parameters (mostly the learning rates) has much more influence
on the obtained performance than the size of the network. We used the same procedure
to find the hyper-parameters for DBN-1, which are the same as those of DBN-3 except
the second hidden layer and third hidden layer sizes. We also allowed ourselves to
test for much larger first-hidden layer sizes, in order to make the comparison between
DBN-1 and DBN-3 fairer.
“We usually started by testing a relatively small architecture (between 500 and 700
units in the first and second hidden layer, and between 1000 and 2000 hidden units
in the last layer). Given the results obtained on the validation set (compared to those
of NNet for instance) after selecting appropriate optimization parameters, we would
then consider growing the number of units in all layers simultaneously. The biggest
networks we eventually tested had up to 3000, 4000 and 6000 hidden units in the first,
second and third hidden layers respectively.
“As for the optimization hyper-parameters, we would proceed by first trying a few com-
binations of values for the stochastic gradient descent learning rate of the supervised
and unsupervised phases (usually between 0.1 and 0.0001). We then refine the choice of
tested values for these hyper-parameters. The first trials would simply give us a trend on
the validation set error for these parameters (is a change in the hyper-parameter making
things worse of better) and we would then consider that information in selecting ap-
propriate additional trials. One could choose to use learning rate adaptation techniques
(e.g., slowly decreasing the learning rate or using momentum) but we did not find these
techniques to be crucial.
There was large variation in the number of trials used in Larochelle et al. (2007) to optimize the
DBN-3. One data set (mnist background images) benefited from 102 trials, while another (mnist
background random) only 13 because a good result was found more quickly. The average number
299
B ERGSTRA AND B ENGIO
accuracy
accuracy
accuracy
experiment size (# trials) experiment size (# trials)
Figure 9: Deep Belief Network (DBN) performance according to random search. Here random
search is used to explore up to 32 hyper-parameters. Results obtained by grid-assisted
manual search using an average of 41 trials are marked in finely-dashed green (1-layer
DBN) and coarsely-dashed red (3-layer DBN). Random experiments of 128 random trials
found an inferior best model for three data sets, a competitive model in four, and superior
model in one (convex). (Best viewed in color.)
of trials across data sets for the DBN-3 model was 41. In considering the number of trials per data
set, it is important to bear in mind that the experiments on different data sets were not performed
independently. Rather, later experiments benefited from the experience the authors had drawn from
earlier ones. Although grid search was part of the optimization loop, the manual intervention turns
the overall optimization process into something with more resemblance to an adaptive sequential
algorithm.
Random search versions of the DBN experiments from Larochelle et al. (2007) are shown in
Figure 9. In this more challenging optimization problem random search is still effective, but not
300
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
superior as it was as in the case of neural network optimization. Comparing to the 3-layer DBN
results in Larochelle et al. (2007), random search found a better model than the manual search in
one data set (convex), an equally good model in four (mnist basic, mnist rotated, rectangles, and
rectangles images), and an inferior model in three (mnist background images, mnist background
random, mnist rotated background images). Comparing to the 1-layer DBN results, random
search of the 1-layer, 2-layer and 3-layer configuration space found at least a good a model in all
cases. In comparing these scores, the reader should bear in mind that the scores in the original
experiments were not computed using the same score-averaging technique that we described in
Section 2.1, and our averaging technique is slightly biased toward underestimation. In the DBN
efficiency curves we see that even experiments with larger numbers of trials (64 and larger) feature
significant variability. This indicates that the regions of the search space with the best performance
are small, and randomly chosen i.i.d. trials do not reliably find them.
6. Future Work
Our result on the multidimensional interval task, together with the GPR characterization of the shape
of Ψ, together with the computational constraint that hyper-parameter searches only draw on a few
hundred trials, all suggest that pseudo-random or quasi-random trials are optimal for non-adaptive
hyper-parameter search. There is still work to be done for each model family, to establish how it
should be parametrized for i.i.d. random search to be as reliable as possible, but the most promising
and interesting direction for future work is certainly in adaptive algorithms.
There is a large body of literature on global optimization, a great deal of which bears on the ap-
plication of hyper-parameter optimization. General numeric methods such as simplex optimization
(Nelder and Mead, 1965), constrained optimization by linear approximation (Powell, 1994; Weise,
2009), finite difference stochastic approximation and simultaneous prediction stochastic approxi-
mation (Kleinman et al., 1999) could be useful, as well as methods for search in discrete spaces
such as simulated annealing (Kirkpatrick et al., 1983) and evolutionary algorithms (Rechenberg,
1973; Hansen et al., 2003). Drew and de Mello (2006) have already proposed an optimization al-
gorithm that identifies effective dimensions, for more efficient search. They present an algorithm
that distinguishes between important and unimportant dimensions: a low-discrepancy point set is
used to choose points in the important dimensions, and unimportant dimensions are “padded” with
thinner coverage and cheaper samples. Their algorithm’s success hinges on the rapid and successful
identification of important dimensions. Sequential model-based optimization methods and partic-
ularly Bayesian optimization methods are perhaps more promising because they offer principled
approaches to weighting the importance of each dimension (Hutter, 2009; Hutter et al., 2011; Srini-
vasan and Ramakrishnan, 2011).
With so many sophisticated algorithms to draw on, it may seem strange that grid search is still
widely used, and, with straight faces, we now suggest using random search instead. We believe the
reason for this state of affairs is a technical one. Manual optimization followed by grid search is
easy to implement: grid search requires very little code infrastructure beyond access to a cluster
of computers. Random search is just as simple to carry out, uses the same tools, and fits in the
same workflow. Adaptive search algorithms on the other hand require more code complexity. They
require client-server architectures in which a master process keeps track of the trials that have com-
pleted, the trials that are in progress, the trials that were started but failed to complete. Some kind
of shared database and inter-process communication mechanisms are required. Trials in an adaptive
301
B ERGSTRA AND B ENGIO
experiment cannot be queued up all at once; the master process must be involved somehow in the
scheduling and timing of jobs on the cluster. These technical hurdles are not easy to jump with the
standard tools of the trade such as MATLAB or Python; significant software engineering is required.
Until that engineering is done and adopted by a community of researchers, progress on the study of
sophisticated hyper-parameter optimization algorithms will be slow.
7. Conclusion
Grid search experiments are common in the literature of empirical machine learning, where they are
used to optimize the hyper-parameters of learning algorithms. It is also common to perform multi-
stage, multi-resolution grid experiments that are more or less automated, because a grid experiment
with a fine-enough resolution for optimization would be prohibitively expensive. We have shown
that random experiments are more efficient than grid experiments for hyper-parameter optimization
in the case of several learning algorithms on several data sets. Our analysis of the hyper-parameter
response surface (Ψ) suggests that random experiments are more efficient because not all hyper-
parameters are equally important to tune. Grid search experiments allocate too many trials to the
exploration of dimensions that do not matter and suffer from poor coverage in dimensions that are
important. Compared with the grid search experiments of Larochelle et al. (2007), random search
found better models in most cases and required less computational time.
Random experiments are also easier to carry out than grid experiments for practical reasons
related to the statistical independence of every trial.
• The experiment can be stopped any time and the trials form a complete experiment.
• If extra computers become available, new trials can be added to an experiment without having
to adjust the grid and commit to a much larger experiment.
• If the computer carrying out a trial fails for any reason, its trial can be either abandoned or
restarted without jeopardizing the experiment.
Random search is not incompatible with a controlled experiment. To investigate the effect
of one hyper-parameter of interest X, we recommend random search (instead of grid search) for
optimizing over other hyper-parameters. Choose one set of random values for these remaining
hyper-parameters and use that same set for each value of X.
Random experiments with large numbers of trials also bring attention to the question of how
to measure test error of an experiment when many trials have some claim to being best. When
using a relatively small validation set, the uncertainty involved in selecting the best model by cross-
validation can be larger than the uncertainty in measuring the test set performance of any one model.
It is important to take both of these sources of uncertainty into account when reporting the uncer-
tainty around the best model found by a model search algorithm. This technique is useful to all
experiments (including both random and grid) in which multiple models achieve approximately the
best validation set performance.
Low-discrepancy sequences developed for QMC integration are also good alternatives to grid-
based experiments. In low dimensions (e.g., 1-5) our simulated results suggest that they can hold
some advantage over pseudo-random experiments in terms of search efficiency. However, the trials
302
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
of a low-discrepancy experiment are not i.i.d. which makes it inappropriate to analyze performance
with the random efficiency curve. It is also more difficult in practice to conduct a quasi-random
experiment because like a grid experiment, the omission of a single point can be more severe.
Finally, when there are many hyper-parameter dimensions relative to the computational budget for
the experiment, a low-discrepancy trial set is not expected to behave very differently from a pseudo-
random one.
Finally, the hyper-parameter optimization strategies considered here are non-adaptive: they do
not vary the course of the experiment by considering any results that are already available. Random
search was not generally as good as the sequential combination of manual and grid search from
an expert (Larochelle et al., 2007) in the case of the 32-dimensional search problem of DBN op-
timization, because the efficiency of sequential optimization overcame the inefficiency of the grid
search employed at each step of the procedure. Future work should consider sequential, adaptive
search/optimization algorithms in settings where many hyper-parameters of an expensive function
must be optimized jointly and the effective dimensionality is high. We hope that future work in that
direction will consider random search of the form studied here as a baseline for performance, rather
than grid search.
Acknowledgments
This work was supported by the National Science and Engineering Research Council of Canada and
Compute Canada, and implemented with Theano (Bergstra et al., 2010).
References
I. A. Antonov and V. M. Saleev. An economic method of computing LPτ -sequences. USSR Compu-
tational Mathematics and Mathematical Physics, 19(1):252–256, 1979.
R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey,
1961.
Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):
1–127, 2009. doi: 10.1561/2200000006.
Y. Bengio and X. Glorot. Understanding the difficulty of training deep feedforward neural networks.
In Y. W. Teh and M. Titterington, editors, Proc. of The Thirteenth International Conference on
Artificial Intelligence and Statistics (AISTATS’10), pages 249–256, 2010.
J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, and Y. Bengio.
Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific
Computing Conference (SciPy), June 2010. Oral.
C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, London, UK, 1995.
P. Bratley, B. L. Fox, and H. Niederreiter. Implementation and tests of low-discrepancy sequences.
Transactions on Modeling and Computer Simulation, (TOMACS), 2(3):195–213, 1992.
R. E. Caflisch, W. Morokoff, and A. Owen. Valuation of mortgage backed securities using brownian
bridges to reduce effective dimension, 1997.
303
B ERGSTRA AND B ENGIO
C. Chang and C. Lin. LIBSVM: A Library for Support Vector Machines, 2001.
I. Czogiel, K. Luebke, and C. Weihs. Response surface methodology for optimizing hyper parame-
ters. Technical report, Universität Dortmund Fachbereich Statistik, September 2005.
S. S. Drew and T. Homem de Mello. Quasi-Monte Carlo strategies for stochastic optimization. In
Proc. of the 38th Conference on Winter Simulation, pages 774 – 782, 2006.
D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised
pre-training help deep learning? Journal of Machine Learning Research, 11:625–660, 2010.
N. Hansen, S. D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized
evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11
(1):1–18, 2003.
G. E. Hinton. A practical guide to training restricted Boltzmann machines. Technical Report 2010-
003, University of Toronto, 2010. version 1.
G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural
Computation, 18:1527–1554, 2006.
F. Hutter. Automated Configuration of Algorithms for Solving Hard Computational Problems. PhD
thesis, University of British Columbia, 2009.
F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algo-
rithm configuration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10.
Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and K. Muller, editors,
Neural Networks: Tricks of the Trade. Springer, 1998b.
M. Galassi et al. GNU Scientific Library Reference Manual, 3rd edition, 2009.
304
R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION
J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7:
308–313, 1965.
M. J. D. Powell. A direct search optimization method that models the objective and constraint
functions by linear interpolation. Advances in Optimization and Numerical Analysis, pages 51–
67, 1994.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,
2006.
Ingo Rechenberg. Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biol-
ogischen Evolution. Fommann-Holzboog, Stuttgart, 1973.
A. Srinivasan and G. Ramakrishnan. Parameter screening and optimisation for ILP using designed
experiments. Journal of Machine Learning Research, 12:627–662, February 2011.
P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features
with denoising autoencoders. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Pro-
ceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), pages
1096–1103. ACM, 2008.
T. Weise. Global Optimization Algorithms - Theory and Application. Self-Published, second edi-
tion, 2009. Online available at http://www.it-weise.de/.
305
Journal of Machine Learning Research 12 (2011) 2121-2159 Submitted 3/10; Revised 3/11; Published 7/11
Abstract
We present a new family of subgradient methods that dynamically incorporate knowledge of the
geometry of the data observed in earlier iterations to perform more informative gradient-based
learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very
predictive but rarely seen features. Our paradigm stems from recent advances in stochastic op-
timization and online learning which employ proximal functions to control the gradient steps of
the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal func-
tion, which significantly simplifies setting a learning rate and results in regret guarantees that are
provably as good as the best proximal function that can be chosen in hindsight. We give several
efficient algorithms for empirical risk minimization problems with common and important regu-
larization functions and domain constraints. We experimentally study our theoretical analysis and
show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient
algorithms.
Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization
1. Introduction
In many applications of online and stochastic learning, the input instances are of very high di-
mension, yet within any particular instance only a few features are non-zero. It is often the case,
however, that infrequently occurring features are highly informative and discriminative. The infor-
mativeness of rare features has led practitioners to craft domain-specific feature weightings, such as
TF-IDF (Salton and Buckley, 1988), which pre-emphasize infrequently occurring features. We use
this old idea as a motivation for applying modern learning-theoretic techniques to the problem of
online and stochastic learning, focusing concretely on (sub)gradient methods.
2011
c John Duchi, Elad Hazan and Yoram Singer.
D UCHI , H AZAN AND S INGER
Standard stochastic subgradient methods largely follow a predetermined procedural scheme that
is oblivious to the characteristics of the data being observed. In contrast, our algorithms dynamically
incorporate knowledge of the geometry of the data observed in earlier iterations to perform more
informative gradient-based learning. Informally, our procedures give frequently occurring features
very low learning rates and infrequent features high learning rates, where the intuition is that each
time an infrequent feature is seen, the learner should “take notice.” Thus, the adaptation facilitates
finding and identifying very predictive but comparatively rare features.
We also make frequent use of the following two matrices. Let g1:t = [g1 · · · gt ] denote the matrix
obtained by concatenating the subgradient sequence. We denote the ith row of this matrix, which
amounts to the concatenation of the ith component of each subgradient we observe, by g1:t,i . We
also define the outer product matrix Gt = ∑tτ=1 gτ gτ⊤ .
Online learning and stochastic optimization are closely related and basically interchangeable
(Cesa-Bianchi et al., 2004). In order to keep our presentation simple, we confine our discussion and
algorithmic descriptions to the online setting with the regret bound model. In online learning, the
learner repeatedly predicts a point xt ∈ X ⊆ Rd , which often represents a weight vector assigning
importance values to various features. The learner’s goal is to achieve low regret with respect to a
static predictor x∗ in the (closed) convex set X ⊆ Rd (possibly X = Rd ) on a sequence of functions
ft (x), measured as
T T
R(T ) = ∑ ft (xt ) − inf ∑ ft (x) .
x∈X
t=1 t=1
At every timestep t, the learner receives the (sub)gradient information gt ∈ ∂ ft (xt ). Standard sub-
gradient algorithms then move the predictor xt in the opposite direction of gt while maintaining
xt+1 ∈ X via the projected gradient update (e.g., Zinkevich, 2003)
p
In contrast, let the Mahalanobis norm k·kA = h·, A·i and denote the projection of a point y onto X
according to A by ΠAX (y) = argminx∈X kx − ykA = argminx∈X hx − y, A(x − y)i. Using this notation,
our generalization of standard gradient descent employs the update
1/2
G −1/2
xt+1 = ΠX t xt − ηGt gt .
2122
A DAPTIVE S UBGRADIENT M ETHODS
The above algorithm is computationally impractical in high dimensions since it requires computa-
tion of the root of the matrix Gt , the outer product matrix. Thus we specialize the update to
diag(Gt )1/2
xt+1 = ΠX xt − η diag(Gt )−1/2 gt . (1)
Both the inverse and root of diag(Gt ) can be computed in linear time. Moreover, as we discuss later,
when the gradient vectors are sparse the update above can often be performed in time proportional
to the support of the gradient. We now elaborate and give a more formal discussion of our setting.
In this paper we consider several different online learning algorithms and their stochastic convex
optimization counterparts. Formally, we consider online learning with a sequence of composite
functions φt . Each function is of the form φt (x) = ft (x) + ϕ(x) where ft and ϕ are (closed) convex
functions. In the learning settings we study, ft is either an instantaneous loss or a stochastic estimate
of the objective function in an optimization task. The function ϕ serves as a fixed regularization
function and is typically used to control the complexity of x. At each round the algorithm makes a
prediction xt ∈ X and then receives the function ft . We define the regret with respect to the fixed
(optimal) predictor x∗ as
T T
Rφ (T ) , ∑ [φt (xt ) − φt (x∗ )] = ∑ [ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ )] . (2)
t=1 t=1
Our goal is to devise algorithms which are guaranteed to suffer asymptotically sub-linear regret,
namely, Rφ (T ) = o(T ).
Our analysis applies to related, yet different, methods for minimizing the regret (2). The first
is Nesterov’s primal-dual subgradient method (2009), and in particular Xiao’s (2010) extension,
regularized dual averaging, and the follow-the-regularized-leader (FTRL) family of algorithms (see
for instance Kalai and Vempala, 2003; Hazan et al., 2006). In the primal-dual subgradient method
the algorithm makes a prediction xt on round t using the average gradient ḡt = 1t ∑tτ=1 gτ . The update
encompasses a trade-off between a gradient-dependent linear term, the regularizer ϕ, and a strongly-
convex term ψt for well-conditioned predictions. Here ψt is the proximal term. The update amounts
to solving
1
xt+1 = argmin η hḡt , xi + ηϕ(x) + ψt (x) , (3)
x∈X t
where η is a fixed step-size and x1 = argminx∈X ϕ(x). The second method similarly has numer-
ous names, including proximal gradient, forward-backward splitting, and composite mirror descent
(Tseng, 2008; Duchi et al., 2010). We use the term composite mirror descent. The composite mirror
descent method employs a more immediate trade-off between the current gradient gt , ϕ, and staying
close to xt using the proximal function ψ,
Our work focuses on temporal adaptation √ of the proximal function in a data driven way, while
previous work simply sets ψt ≡ ψ, ψt (·) = tψ(·), or ψt (·) = tψ(·) for some fixed ψ.
We provide formal analyses equally applicable to the above two updates and show how to au-
tomatically choose the function ψt so as to achieve asymptotically small regret. We describe and
analyze two algorithms. Both algorithms use squared Mahalanobis norms as their proximal func-
tions, setting ψt (x) = hx, Ht xi for a symmetric matrix Ht 0. The first uses diagonal matrices while
2123
D UCHI , H AZAN AND S INGER
the second constructs full dimensional matrices. Concretely, for some small fixed δ ≥ 0 (specified
later, though in practice δ can be set to 0) we set
1/2
Ht = δI + diag(Gt )1/2 (Diagonal) and Ht = δI + Gt (Full) . (5)
Plugging the appropriate matrix from the above equation into ψt in (3) or (4) gives rise to our
A DAG RAD family of algorithms. Informally, we obtain algorithms which are similar to second-
order gradient descent by constructing approximations to the Hessian of the functions ft , though we
use roots of the matrices.
These results are formally given in Theorem 7 and its corollaries. When our proximal function
ψt (x) = x, diag(Gt )1/2 x we have bounds attainable in time at most linear in the dimension d of
We formally state the above two regret bounds in Theorem 5 and its corollaries.
Following are a simple example and corollary to Theorem 5 to illustrate one regime in which
we expect substantial improvements (see also the next subsection). Let ϕ ≡ 0 and consider Zinke-
vich’s online gradient descent algorithm. Given a compact convex set X ⊆ Rd and sequence
of convex functions√ft , Zinkevich’s algorithm makes the sequence of predictions x1 , . . . , xT with
xt+1 = ΠX (xt − (η/ t)gt ). If the diameter of X is bounded, thus supx,y∈X kx − yk2 ≤ D2 , then on-
line gradient descent, with the optimal choice in hindsight for the stepsize η (see the bound (7) in
Section 1.4), achieves a regret bound of
s
T T √ T
∑ ft (xt ) − inf ∑ ft (x) ≤ 2D2 ∑ kgt k22 .
x∈X t=1
(6)
t=1 t=1
When X is bounded via supx,y∈X kx − yk∞ ≤ D∞ , the following corollary is a simple consequence of
our Theorem 5.
2124
A DAPTIVE S UBGRADIENT M ETHODS
Corollary 1 Let the sequence {xt } ⊂ Rd be√generated by the update (4) and assume that
maxt kx∗ − xt k∞ ≤ D∞ . Using stepsize η = D∞ / 2, for any x∗ , the following bound holds.
s
√ T √ d
Rφ (T ) ≤ 2dD∞ inf ∑ kgt k2diag(s)−1 = 2D∞ ∑ kg1:T,i k2 .
s0,h1,si≤d t=1 i=1
The important feature of the bound above is the infimum under the square root, which allows us to
perform better than simply using the identity matrix, and the√fact that the stepsize is easy to set a
priori. For example, if the set X = {x : kxk∞ ≤ 1}, then D2 = 2 d while D∞ = 2, which suggests that
if we are learning a dense predictor over a box, the adaptive method should perform well. Indeed,
in this case we are guaranteed that the bound in Corollary 1 is better than (6) as the identity matrix
belongs to the set over which we take the infimum.
To conclude the outline of results, we would like to point to two relevant research papers. First,
Zinkevich’s regret bound is tight and cannot be improved in a minimax sense (Abernethy et al.,
2008). Therefore, improving the regret bound requires further reasonable assumptions on the input
space. Second, in a independent work, performed concurrently to the research presented in this
paper, McMahan and Streeter (2010) study competitive ratios, showing guaranteed improvements
of the above bounds relative to families of online algorithms.
by Jensen’s inequality. In the rightmost sum, we have c ∑di=1 i−α/2 = O(log d) for α ≥ 2, and
∑di=1 i−α/2 = O(d 1−α/2 ) for α ∈ (1, 2). If the domain X is a hypercube, say X √ = {x : kxk∞ ≤ 1}, then
in Corollary 1 D∞ = 2, and the regret of A DAG RAD is O(max{log√ d, d 1−α/2 } T ). For contrast, the
standard regret√bound (6) for online gradient descent has D2 = 2 d and kgt k22 ≥ 1, yielding best
case regret O( dT ). So we see that in this sparse yet heavy tailed feature setting, A DAG RAD’s re-
gret guarantee can be exponentially smaller in the dimension d than the non-adaptive regret bound.
Our remaining examples construct a sparse sequence for which there is a perfect predictor that
the adaptive methods learn after d iterations, while standard online gradient descent (Zinkevich,
2125
D UCHI , H AZAN AND S INGER
(Here [·][−1,1]d denotes the truncation of the vector to [−1, 1]d ). In particular, after suffering d − 1
more losses, A√ DAG RAD has a perfect classifier. However, on the remaining iterations gradient
descent has η/ t ≤ ε and thus evidently suffers loss at least d/(2ε). Of course, for small ε, we
have d/(2ε) ≫ d. In short, A DAG RAD achieves constant regret per dimension while online gradient
descent can suffer arbitrary loss (for unbounded t). It seems quite silly, then, to use a global learning
rate rather than one for each feature.
Full Matrix Adaptation. We use a similar construction to the diagonal case to show a situation
in which the full matrix update from (5) gives substantially√ lower regret than stochasticd×d gradient
descent. For full divergences we set X = {x : kxk 2 ≤ d}. Let V = [v1 . . . vd ] ∈ R be an
orthonormal matrix. Instead of having zt cycle through the unit vectors, we make zt cycle through
the vi so that zt = ±vi . We let the label yt = sign( 1,V ⊤ zt ) = sign ∑di=1 hvi , zt i . We provide an
elaborated explanation in Appendix A. Intuitively, with ψt (x) = hx, Ht xi and Ht set to be the full
matrix from (5), A DAG RAD again needs to observe each orthonormal vector vi only once while
stochastic gradient descent’s loss can be made Ω(d/ε) for any ε > 0.
2126
A DAPTIVE S UBGRADIENT M ETHODS
we change the proximal function to achieve performance guarantees which are competitive with the
best proximal term found in hindsight. The second, as alluded to earlier, is to automatically adjust
the learning rates for online learning and stochastic gradient descent on a per-feature basis. The
latter can be very useful when our gradient vectors gt are sparse, for example, in a classification
setting where examples may have only a small number of non-zero features. As we demonstrated
in the examples above, it is rather deficient to employ exactly the same learning rate for a feature
seen hundreds of times and for a feature seen only once or twice.
Our techniques stem from a variety of research directions, and as a byproduct we also extend a
few well-known algorithms. In particular, we consider variants of the follow-the-regularized leader
(FTRL) algorithms mentioned above, which are kin to Zinkevich’s lazy projection algorithm. We
use Xiao’s recently analyzed regularized dual averaging (RDA) algorithm (2010), which builds upon
Nesterov’s (2009) primal-dual subgradient method. We also consider forward-backward splitting
(F OBOS) (Duchi and Singer, 2009) and its composite mirror-descent (proximal gradient) general-
izations (Tseng, 2008; Duchi et al., 2010), which in turn include as special cases projected gradients
(Zinkevich, 2003) and mirror descent (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003). Re-
cent work by several authors (Nemirovski et al., 2009; Juditsky et al., 2008; Lan, 2010; Xiao, 2010)
considered efficient and robust methods for stochastic optimization, especially in the case when the
expected objective f is smooth. It may be interesting to investigate adaptive metric approaches in
smooth stochastic optimization.
The idea of adapting first order optimization methods is by no means new and can be traced
back at least to the 1970s with the work on space dilation methods of Shor (1972) and variable
metric methods, such as the BFGS family of algorithms (e.g., Fletcher, 1970). This prior work
often assumed that the function to be minimized was differentiable and, to our knowledge, did not
consider stochastic, online, or composite optimization. In her thesis, Nedić (2002) studied variable
metric subgradient methods, though it seems difficult to derive explicit rates of convergence from the
results there, and the algorithms apply only when the constraint set X = Rd . More recently, Bordes
et al. (2009) proposed a Quasi-Newton stochastic gradient-descent procedure, which is similar in
spirit to our methods. However, their convergence results assume a smooth objective with positive
definite Hessian bounded away from 0. Our results apply more generally.
Prior to the analysis presented in this paper for online and stochastic optimization, the strongly
convex function ψ in the update equations (3) and (4) either remained intact or was simply multiplied
by a time-dependent scalar throughout the run of the algorithm. Zinkevich’s √ projected gradient,
for example, uses ψt (x) = kxk22 , while RDA (Xiao, 2010) employs ψt (x) = tψ(x) where ψ is a
strongly convex function. The bounds for both types of algorithms are similar, and both rely on the
norm k·k (and its associated dual k·k∗ ) with respect to which ψ is strongly convex. Mirror-descent
type first order algorithms, such as projected gradient methods, attain regret bounds of the form
(Zinkevich, 2003; Bartlett et al., 2007; Duchi et al., 2010)
1 η T
Bψ (x∗ , x1 ) + ∑
ft′ (xt )
∗ .
2
Rφ (T ) ≤ (7)
η 2 t=1
√ √
Choosing η ∝ 1/ √T gives Rφ (T ) = O( T ). When Bψ (x, x∗√ ) is bounded for all x ∈ X , we choose
step sizes ηt ∝ 1/ t which is equivalent to setting ψt (x) = tψ(x). Therefore, no assumption on
the time horizon is necessary. For RDA and follow-the-leader algorithms, the bounds are similar
2127
D UCHI , H AZAN AND S INGER
√ 1 T
∑
ft′ (xt )
∗ .
2
T ψ(x∗ ) + √
Rφ (T ) ≤ (8)
2 T t=1
The problem of adapting to data and obtaining tighter data-dependent bounds for algorithms
such as those above is a natural one and has been studied in the mistake-bound setting for online
learning in the past. A framework that is somewhat related to ours is the confidence weighted
learning scheme by Crammer et al. (2008) and the adaptive regularization of weights algorithm
(AROW) of Crammer et al. (2009). These papers provide mistake-bound analyses for second-
order algorithms, which in turn are similar in spirit to the second-order Perceptron algorithm (Cesa-
Bianchi et al., 2005). The analyses by Crammer and colleagues, however, yield mistake bounds
dependent on the runs of the individual algorithms and are thus difficult to compare with our regret
bounds.
AROW maintains a mean prediction vector µt ∈ Rd and a covariance matrix Σt ∈ Rd×d over µt
as well. At every step of the algorithm, the learner receives a pair (zt , yt ) where zt ∈ Rd is the tth
example and yt ∈ {−1, +1} is the label. Whenever the predictor µt attains a margin value smaller
than 1, AROW performs the update
1
βt = , αt = [1 − yt hzt , µt i]+ ,
hzt , Σt zt i + λ
µt+1 = µt + αt Σt yt zt , Σt+1 = Σt − βt Σt xt xt⊤ Σt . (9)
In the above scheme, one can force Σt to be diagonal, which reduces the run-time and storage
requirements of the algorithm but still gives good performance (Crammer et al., 2009). In contrast
to AROW, the A DAG RAD algorithm uses the root of the inverse covariance matrix, a consequence of
our formal analysis. Crammer et al.’s algorithm and our algorithms have similar run times, generally
linear in the dimension d, when using diagonal matrices. However, when using full matrices the
runtime of AROW algorithm is O(d 2 ), which is faster than ours as it requires computing the root of
a matrix.
In concurrent work, McMahan and Streeter (2010) propose and analyze an algorithm which
is very similar to some of the algorithms presented in this paper. Our analysis builds on recent
advances in online learning and stochastic optimization (Duchi et al., 2010; Xiao, 2010), whereas
McMahan and Streeter use first-principles to derive their regret bounds. As a consequence of our
approach, we are able to apply our analysis to algorithms for composite minimization with a known
additional objective term ϕ. We are also able to generalize and analyze both the mirror descent and
dual-averaging family of algorithms. McMahan and Streeter focus on what they term the compet-
itive ratio, which is the ratio of the worst case regret of the adaptive algorithm to the worst case
regret of a non-adaptive algorithm with the best proximal term ψ chosen in hindsight. We touch on
this issue briefly in the sequel, but refer the interested reader to McMahan and Streeter (2010) for
this alternative elegant perspective. We believe that both analyses shed insights into the problems
studied in this paper and complement each other.
There are also other lines of work on adaptive gradient methods that are not directly related to
our work but nonetheless relevant. Tighter regret bounds using the variation of the cost functions ft
were proposed by Cesa-Bianchi et al. (2007) and derived by Hazan and Kale (2008). Bartlett et al.
(2007) explore another adaptation technique for ηt where they adapt the step size to accommodate
2128
A DAPTIVE S UBGRADIENT M ETHODS
both strongly and weakly convex functions. Our approach differs from previous approaches as it
does not focus on a particular loss function or mistake bound. Instead, we view the problem of
adapting the proximal function as a meta-learning problem. We then obtain a bound comparable to
the bound obtained using the best proximal function chosen in hindsight.
1
ψ(y) ≥ ψ(x) + h∇ψ(x), y − xi + kx − yk2ψ .
2
Strong convexity is guaranteed if and only if Bψt (x, y) ≥ 21 kx − yk2ψt . We also denote the dual norm
of k·kψt by k·kψt∗ . For completeness, we provide the proofs of following two results in Appendix F,
as they build straightforwardly on work by Duchi et al. (2010) and Xiao (2010). For the primal-dual
subgradient update, the following bound holds.
Proposition 2 Let the sequence {xt } be defined by the update (3). For any x∗ ∈ X ,
T
1 η T
∑ ψT (x∗ ) + ∑
ft′ (xt )
ψ∗ .
2
ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ ) ≤ (10)
t=1 η 2 t=1 t−1
Proposition 3 Let the sequence {xt } be defined by the update (4). Assume w.l.o.g. that ϕ(x1 ) = 0.
For any x∗ ∈ X ,
T
∑ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ )
t=1
1 1 T −1 η T
Bψ1 (x∗ , x1 ) + ∑ Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ) + ∑
ft′ (xt )
ψ∗ .
2
≤ (11)
η η t=1 2 t=1 t
The above corollaries allow us to prove regret bounds for a family of algorithms that iteratively
modify the proximal functions ψt in attempt to lower the regret bounds.
2129
D UCHI , H AZAN AND S INGER
I NPUT: η > 0, δ ≥ 0
VARIABLES: s ∈ Rd , H ∈ Rd×d , g1:t,i ∈ Rt for i ∈ {1, . . . , d}
I NITIALIZE x1 = 0, g1:0 = []
F OR t = 1 to T
Suffer loss ft (xt )
Receive subgradient gt ∈ ∂ ft (xt ) of ft at xt
U PDATE g1:t = [g1:t−1 gt ], st,i = kg1:t,i k2
S ET Ht = δI + diag(st ), ψt (x) = 12 hx, Ht xi
Primal-Dual Subgradient
( * Update
+ (3): )
t
1 1
xt+1 = argmin η
x∈X
∑ gτ , x + ηϕ(x) + t ψt (x) .
t τ=1
Composite Mirror
Descent Update (4):
xt+1 = argmin ηhgt , xi + ηϕ(x) + Bψt (x, xt ) .
x∈X
T d 2
gt,i
min ∑∑ s.t. s 0, h1, si ≤ c .
t=1 i=1 si
s
This problem is solved by setting si = kg1:T,i k2 and scaling s so that hs, 1i = c. To see this, we can
write the Lagrangian of the minimization problem by introducing multipliers λ 0 and θ ≥ 0 to get
d kg1:T,i k22
L (s, λ, θ) = ∑ − hλ, si + θ(h1, si − c).
i=1 si
Taking partial derivatives to find the infimum of L , we see that − kg1:T,i k22 /s2i − λi + θ = 0, and com-
plementarity conditions on λi si (Boyd and Vandenberghe, 2004) imply that λi = 0. Thus
we have
1
si = θ− 2 kg1:T,i k2 , and normalizing appropriately using θ gives that si = c kg1:T,i k2 / ∑dj=1
g1:T, j
2 .
2130
A DAPTIVE S UBGRADIENT M ETHODS
Let diag(v) denote the diagonal matrix with diagonal v. It is natural to suspect that for s achieving
the infimum in Equation (12), if we use
a proximal function similar to ψ(x) = hx, diag(s)xi with
2 −1
associated squared dual norm kxkψ∗ = x, diag(s) x , we should do well lowering the gradient
terms in the regret bounds (10) and (11).
To prove a regret bound for our Algorithm 1, we note that both types of updates suffer losses that
include a term depending solely on the gradients obtained along their run. The following lemma
is applicable to both updates, and was originally proved by Auer and Gentile (2000), though we
provide a proof in Appendix C. McMahan and Streeter (2010) also give an identical lemma.
Lemma 4 Let gt = ft′ (xt ) and g1:t and st be defined as in Algorithm 1. Then
T d
∑ gt , diag(st )−1 gt ≤ 2 ∑ kg1:T,i k2 .
t=1 i=1
To obtain a regret bound, we need to consider the terms consisting of the dual-norm of the sub-
gradient in the regret bounds (10) and (11), which is k ft′ (xt )k2ψt∗ . When ψt (x) = hx, (δI + diag(st ))xi,
it is easy to see that the associated dual-norm is
kgk2ψt∗ = g, (δI + diag(st ))−1 g .
From the definition of st in Algorithm 1, we clearly have k ft′ (xt )k2ψt∗ ≤ gt , diag(st )−1 gt . Note that
if st,i = 0 then gt,i = 0 by definition of st,i . Thus, for any δ ≥ 0, Lemma 4 implies
T d
∑ t t ψ∗ ∑ kg1:T,i k2 .
′
2
f (x )
≤ 2 (13)
t
t=1 i=1
To obtain a
bound for a primal-dual subgradient method, we set δ ≥ maxt kgt k∞ , in which case
kgt k2ψ∗ ≤ gt , diag(st )−1 gt , and we follow the same lines of reasoning to achieve the inequal-
t−1
ity (13).
It remains to bound the various Bregman divergence terms for Corollary 3 and the term ψT (x∗ )
for Corollary 2. We focus first on the composite mirror-descent update. Examining the bound (11)
and Algorithm 1, we notice that
1 ∗
Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ) = hx − xt+1 , diag(st+1 − st )(x∗ − xt+1 )i
2
1
≤ max(xi∗ − xt+1,i )2 kst+1 − st k1 .
2 i
Since kst+1 − st k1 = hst+1 − st , 1i and hsT , 1i = ∑di=1 kg1:T,i k2 , we have
T −1
1 T −1 ∗
∑ Bψ ∗ ∗
t+1 (x , xt+1 ) − Bψt (x , xt+1 ) ≤ ∑ kx − xt+1 k2∞ hst+1 − st , 1i
2 t=1
t=1
d
1 1
≤ max kx∗ − xt k2∞ ∑ kg1:T,i k2 − kx∗ − x1 k2∞ hs1 , 1i . (14)
2 t≤T i=1 2
2131
D UCHI , H AZAN AND S INGER
We also have
d
ψT (x∗ ) = δ kx∗ k22 + hx∗ , diag(sT )x∗ i ≤ δ kx∗ k22 + kx∗ k2∞ ∑ kg1:T,i k2 .
i=1
Combining the above arguments with Corollaries 2 and 3, and using (14) with the fact that Bψ1 (x∗ , x1 ) ≤
1 ∗ 2
2 kx − x1 k∞ h1, s1 i, we have proved the following theorem.
Theorem 5 Let the sequence {xt } be defined by Algorithm 1. For xt generated using the primal-
dual subgradient update (3) with δ ≥ maxt kgt k∞ , for any x∗ ∈ X ,
δ ∗ 2 1 ∗ 2 d d
Rφ (T ) ≤ kx k2 + kx k∞ ∑ kg1:T,i k2 + η ∑ kg1:T,i k2 .
η η i=1 i=1
For xt generated using the composite mirror-descent update (4), for any x∗ ∈ X
d d
1
Rφ (T ) ≤ max kx∗ − xt k2∞ ∑ kg1:T,i k2 + η ∑ kg1:T,i k2 .
2η t≤T i=1 i=1
The above theorem is a bit unwieldy. We thus perform a few algebraic simplifications to get the
next corollary, which has a more intuitive form. Let us assume that X is compact and set D∞ =
supx∈X kx − x∗ k∞ . Furthermore, define
( )
d T
d
γT , ∑ kg1:T,i k2 = inf ∑ gt , diag(s)−1 gt : h1, si ≤ ∑ kg1:T,i k2 , s 0 .
s
i=1 t=1 i=1
Corollary 6 Assume that D∞ and γT are defined as above. For {xt } generated by Algorithm 1 using
the primal-dual subgradient update (3) with η = kx∗ k∞ , for any x∗ ∈ X we have
kx∗ k22
Rφ (T ) ≤ 2 kx∗ k∞ γT + δ ≤ 2 kx∗ k∞ γT + δ kx∗ k1 .
kx∗ k∞
√
Using the composite mirror descent update (4) to generate {xt } and setting η = D∞ / 2, we have
√ d √
Rφ (T ) ≤ 2D∞ ∑ kg1:T,i k2 = 2D∞ γT .
i=1
We now give a short derivation of Corollary 1 from the introduction: use Theorem 5, Corollary 6,
and the fact that
( ) !2
T d g2
1 d
inf ∑ ∑ ∑ kg1:T,i k2 .
t,i
: s 0, h1, si ≤ d =
t=1 i=1 si d i=1
s
√ in the beginning of Section 3. Plugging the γT term in from Corollary 6 and multiplying
as in (12)
D∞ by d completes the proof of the corollary.
2132
A DAPTIVE S UBGRADIENT M ETHODS
As discussed in the introduction, Algorithm 1 should have lower regret than non-adaptive algo-
rithms on sparse data, though this depends on the geometry of the underlying optimization space
X . For example, suppose that our learning problem is a logistic regression with 0/1-valued features.
Then the gradient terms are likewise based on 0/1-valued features
√ and sparse, so the gradient terms
in the bound ∑i=1 1:T,i 2 should all be much smaller than T . If some features appear much more
d
kg k
frequently than others, then the infimal representation of γT and the infimal equality in Corollary 1
show that we have significantly lower regret by using higher learning rates for infrequent features
and lower learning rates on commonly appearing features. Further, if the optimal predictor is rela-
tively dense, as is often the case in predictions problems with sparse inputs, then kx∗ k∞ is the best
p-norm we can have in the regret.
More precisely, McMahan and Streeter (2010) show that if X is contained within an ℓ∞ ball
of radius √R and contains an ℓ∞ ball of radius r, then the bound in the above corollary is within a
factor of 2R/r of the regret of the best diagonal proximal matrix, chosen in hindsight. So, for
example, if X = {x ∈ Rd : kxk p ≤ C}, then R/r = d 1/p , which shows that the domain X does effect
the guarantees we can give on optimality of A DAG RAD.
The solution is obtained by defining Gt = ∑tτ=1 gτ gτ⊤ and setting S to be a normalized version of
1/2 1/2
the root of GT , that is, S = c GT / tr(GT ). For a proof, see Lemma 15 in Appendix E, which also
shows that when GT is not full rank D we can E instead use its pseudo-inverse. If we iteratively use
1/2
divergences of the form ψt (x) = x, Gt x , we might expect as in the diagonal case to attain low
regret by collecting gradient information. We achieve our low regret goal by employing a similar
doubling lemma to Lemma 4 and bounding the gradient norm terms. The resulting algorithm is
given in Algorithm 2, and the next theorem provides a quantitative analysis of the brief motivation
above.
Theorem 7 Let Gt be the outer product matrix defined above and the sequence {xt } be defined by
Algorithm 2. For xt generated using the primal-dual subgradient update of (3) and δ ≥ maxt kgt k2 ,
for any x∗ ∈ X
δ 1 1/2 1/2
Rφ (T ) ≤ kx∗ k22 + kx∗ k22 tr(GT ) + η tr(GT ).
η η
For xt generated with the composite mirror-descent update of (4), if x∗ ∈ X and δ ≥ 0
δ ∗ 2 1 1/2 1/2
Rφ (T ) ≤ kx k2 + max kx∗ − xt k22 tr(GT ) + η tr(GT ).
η 2η t≤T
2133
D UCHI , H AZAN AND S INGER
I NPUT: η > 0, δ ≥ 0
VARIABLES: St ∈ Rd×d , Ht ∈ Rd×d , Gt ∈ Rd×d
I NITIALIZE x1 = 0, S0 = 0, H0 = 0, G0 = 0
F OR t = 1 to T
Suffer loss ft (xt )
Receive subgradient gt ∈ ∂ ft (xt ) of ft at xt
1
U PDATE Gt = Gt−1 + gt gt⊤ , St = Gt2
S ET Ht = δI + St , ψt (x) = 21 hx, Ht xi
Primal-Dual Subgradient
( * Update
+ ((3)): )
t
1 1
xt+1 = argmin η
x∈X
∑ gτ , x + ηϕ(x) + t ψt (x) .
t τ=1
Composite Mirror
Descent Update ((4)):
xt+1 = argmin ηhgt , xi + ηϕ(x) + Bψt (x, xt ) .
x∈X
Proof To begin, we consider the difference between the divergence terms at time t + 1 and time t
from the regret (11) in Corollary 3. Let λmax (M) denote the largest eigenvalue of a matrix M. We
have
∗ ∗ 1D ∗ 1/2 1/2 ∗
E
Bψt+1 (x , xt+1 ) − Bψt (x , xt+1 ) = x − xt+1 , (Gt+1 − Gt )(x − xt+1 )
2
1 ∗ 1/2 1/2 1 1/2 1/2
≤ kx − xt+1 k22 λmax (Gt+1 − Gt ) ≤ kx∗ − xt+1 k22 tr(Gt+1 − Gt ) .
2 2
For the last inequality we used the fact that the trace of a matrix is equal to the sum of its eigenvalues
1/2
along with the property Gt+1 1/2 − Gt 1/2 0 (see Lemma 13 in Appendix B) and therefore tr(Gt+1 −
1/2 1/2 1/2
Gt ) ≥ λmax (Gt+1 − Gt ). Thus, we get
T −1
1 T −1 ∗
∑ ∑
2 1/2 1/2
Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ) ≤ kx − x k
t+1 2 tr(Gt+1 ) − tr(Gt ) .
t=1 2 t=1
Now we use the fact that G1 is a rank 1 PSD matrix with non-negative trace to see that
T −1
∑
1/2 1/2
kx∗ − xt+1 k22 tr(Gt+1 ) − tr(Gt )
t=1
1/2
≤ max kx∗ − xt k22 tr(GT 1/2 ) − kx∗ − x1 k22 tr(G1 ) . (16)
t≤T
It remains to bound the gradient terms common to all our bounds. We use the following three
lemmas, which essentially directly applicable. We prove the first two in Appendix D.
Lemma 8 Let B 0 and B−1/2 denote the root of the inverse of B when B ≻ 0 and the root of the
pseudo-inverse of B otherwise. For any ν such that B − νgg⊤ 0 the following inequality holds.
2 tr((B − νgg⊤ )1/2 ) ≤ 2 tr(B1/2 ) − ν tr(B−1/2 gg⊤ ) .
2134
A DAPTIVE S UBGRADIENT M ETHODS
D 1/2 E
Lemma 9 Let δ ≥ kgk2 and A 0, then g, (δI + A1/2 )−1 g ≤ g, (A + gg⊤ )†
g .
t=1 t=1
Proof We prove the lemma by induction. The base case is immediate, since we have
D E hg , g i
1 1
g1 , (G†1 )1/2 g1 = = kg1 k2 ≤ 2 kg1 k2 .
kg1 k2
Now, assume the lemma is true for T − 1, so from the inductive assumption we get
T T −1 D
∑ t t t ∑ t T −1 t
D E E D E
† † †
g , S g ≤ 2 g , S g + g , S
T T T g .
t=1 t=1
D E
T −1
Since ST −1 does not depend on t we can rewrite ∑t=1 gt , ST† −1 gt as
!
T −1
tr ST† −1 , ∑ gt gt⊤ = tr((G†T −1 )1/2 GT −1 ) ,
t=1
where the right-most equality follows from the definitions of St and Gt . Therefore, we get
T
∑
D E D E
gt , St† gt ≤ 2 tr((G†T −1 )1/2 GT −1 ) + gT , (G†T )1/2 gT
t=1
D E
1/2
= 2 tr(GT −1 ) + gT , (G†T )1/2 gT .
Using Lemma 8 with the substitution B = GT , ν = 1, and g = gt lets us exploit the concavity of the
1/2
function tr(A1/2 ) to bound the above sum by 2 tr(GT ). N
We can now finalize our proof of the theorem. As in the diagonal case, we have that the squared
dual norm (seminorm when δ = 0) associated with ψt is
D E
Thus it is clear that kgt k2ψt∗ ≤ gt , St† gt . For the dual-averaging algorithms, we use Lemma 9 above
D E
show that kgt k2ψ∗ ≤ gt , St† gt so long as δ ≥ kgt k2 . Lemma 10’s doubling inequality then implies
t−1
that
T
T
∑
ft′ (xt )
∗ ≤ 2 tr(G ) and ∑
ft′ (xt )
∗ ≤ 2 tr(G )
2 1/2
2 1/2
ψt T (17)
ψt−1 T
t=1 t=1
for the mirror-descent and primal-dual subgradient algorithm, respectively.
1/2
To finish the proof, Note that Bψ1 (x∗ , x1 ) ≤ 21 kx∗ − x1 k22 tr(G1 ) when δ = 0. By combining this
T −1
with the first of the bounds (17) and the bound (16) on ∑t=1 Bψt+1 (x∗ , xt+1 ) − Bψt (x∗ , xt+1 ), Corol-
lary 3 gives the theorem’s statement for the mirror-descent family of algorithms. Combining the
2135
D UCHI , H AZAN AND S INGER
1/2
fact that ∑t=1
T
k ft′ (xt )k2ψ∗ ≤ 2 tr(GT ) and the bound (16) with Corollary 2 gives the desired bound
t−1
on Rφ (T ) for the primal-dual subgradient algorithms, which completes the proof of the theorem.
As before, we can give a corollary that simplifies the bound implied by Theorem 7. The infimal
equality in the corollary uses Lemma 15 in Appendix B. The corollary underscores that for learn-
ing problems in which there is a rotation U of the space for which the gradient vectors gt have
small inner products hgt ,Ugt i (essentially a sparse basis for the gt ) then using full-matrix proximal
functions can attain significantly lower regret.
Corollary 11 Assume that ϕ(x1 ) = 0. Then the regret of the sequence {xt } generated by Algorithm 2
when using the primal-dual subgradient update with η = kx∗ k2 is
1/2
Rφ (T ) ≤ 2 kx∗ k2 tr(GT ) + δ kx∗ k2 .
√
Let X be compact set so that supx∈X kx − x∗ k2 ≤ D. Taking η = D/ 2 and using the composite
mirror descent update with δ = 0, we have
v ( )
√ √
u T
2dDtinf ∑ gt⊤ S−1 gt : S 0, tr(S) ≤ d .
1/2
u
Rφ (T ) ≤ 2D tr(GT ) =
S
t=1
5. Derived Algorithms
In this section, we derive updates using concrete regularization functions ϕ and settings of the
domain X for the A DAG RAD framework. We focus on showing how to solve Equations (3) and (4)
with the diagonal matrix version of the algorithms we have presented. We focus on the diagonal
case for two reasons. First, the updates often take closed-form in this case and carry some intuition.
Second, the diagonal case is feasible to implement in very high dimensions, whereas the full matrix
version is likely to be confined to a few thousand dimensions. We also discuss how to efficiently
compute the updates when the gradient vectors are sparse.
We begin by noting a simple but useful fact. Let Gt denote either the outer product matrix of
1/2
gradients or its diagonal counterpart and let Ht = δI + Gt , as usual. Simple algebraic manipula-
tions yield that each of the updates (3) and (4) in the prequel can be written in the following form
(omitting the stepsize η):
1
xt+1 = argmin hu, xi + ϕ(x) + hx, Ht xi . (18)
x∈X 2
In particular, at time t for the RDA update, we have u = ηt ḡt . For the composite gradient update (4),
1 1 1
η hgt , xi + hx − xt , Ht (x − xt )i = hηgt − Ht xt , xi + hx, Ht xi + hxt , Ht xt i
2 2 2
so that u = ηgt − Ht xt . We now derive algorithms for solving the general update (18). Since most
of the derivations are known, we generally provide only the closed-form solutions or algorithms for
the solutions in the remainder of the subsection, deferring detailed derivations to Appendix G for
the interested reader.
2136
A DAPTIVE S UBGRADIENT M ETHODS
5.1 ℓ1 -regularization
We begin by considering how to solve the minimization problems necessary for Algorithm 1 with
diagonal matrix divergences and ϕ(x) = λ kxk1 . We consider the two updates we proposed and
denote the ith diagonal element of the matrix Ht = δI + diag(st ) from Algorithm 1 by Ht,ii = δ +
kg1:t,i k2 . For the primal-dual subgradient update, the solution to (3) amounts to the following simple
update for xt+1,i :
ηt
xt+1,i = sign (−ḡt,i ) [|ḡt,i | − λ]+ . (19)
Ht,ii
Comparing the update (19) to the standard dual averaging update (Xiao, 2010), which is
√
xt+1,i = sign (−ḡt,i ) η t [|ḡt,i | − λ]+ ,
it is clear that the difference distills to the step size employed for each coordinate. Our generalization
of RDA yields a dedicated step size for each coordinate inversely proportional to the time-based
norm of the coordinate in the sequence of gradients. Due to the normalization by this term the step
size scales linearly with t, so when Ht,ii is small, gradient information on coordinate i is quickly
incorporated.
The composite mirror-descent update (4) has a similar form that essentially amounts to iterative
shrinkage and thresholding, where the shrinkage differs per coordinate:
η η λη
xt+1,i = sign xt,i − gt,i xt,i − gt,i − .
Ht,ii Ht,ii Ht,ii +
We compare the actual performance of the newly derived algorithms to previously studied versions
in the next section.
For both updates it is clear that we can perform “lazy” computation when the gradient vectors
are sparse, a frequently occurring setting when learning for instance from text corpora. Suppose
that from time step t0 through t, the ith component of the gradient is 0. Then we can evaluate the
above updates on demand since Ht,ii remains intact. For composite mirror-descent, at time t when
xt,i is needed, we update
λη
xt,i = sign(xt0 ,i ) |xt0 ,i | − (t − t0 ) .
Ht0 ,ii +
Even simpler just in time evaluation can be performed for the the primal-dual subgradient update.
Here we need to keep an unnormalized version of the average ḡt . Concretely, we keep track of
ut = t ḡt = ∑tτ=1 gτ = ut−1 + gt , then use the update (19):
ηt |ut,i |
xt,i = sign(−ut,i ) −λ ,
Ht,ii t +
2137
D UCHI , H AZAN AND S INGER
I NPUT: v 0, a 0, c ≥ 0.
I F ∑i vi ≤ c RETURNz∗ = v
S ORT vi /ai inton µ = vi j /ai j s.t. vi j /ai j ≥ vi j+1 /a
oi j+1
ρ vi ρ
S ET ρ := max ρ : ∑ j=1 ai j vi j − aiρ ∑ j=1 a2i j < c
ρ
ρ
∑ j=1 ai j vi j −c
S ET θ = ρ
∑ j=1 a2i j
R ETURN z∗ where z∗i = [vi − θai ]+ .
use the matrix Ht = δI + diag(Gt )1/2 from Algorithm 1. We provide a brief derivation sketch and
an O(d log d) algorithm in this section. First, we convert the problem (18) into a projection prob-
lem onto a scaled ℓ1 -ball. By making the substitutions z = H 1/2 x and A = H −1/2 , it is clear that
problem (18) is equivalent to
2
min
z + H −1/2 u
s.t. kAzk1 ≤ c .
z 2
−1/2
Now, by appropriate choice of v = −H −1/2 u = −ηtHt ḡt for the primal-dual update (3) and
1/2 −1/2
v = Ht xt − ηHt gt for the mirror-descent update (4), we arrive at the problem
d
1
min
z 2
kz − vk22 s.t. ∑ ai |zi | ≤ c . (20)
i=1
−1/2
We can clearly recover xt+1 from the solution z∗ to the projection (20) via xt+1 = Ht z∗ .
By the symmetry of the objective (20), we can assume without loss of generality that v 0 and
constrain z 0, and a bit of manipulation with the Lagrangian (see Appendix G) for the problem
shows that the solution z∗ has the form
vi − θ∗ ai if vi ≥ θ∗ ai
∗
zi =
0 otherwise
for some θ∗ ≥ 0. The algorithm in Figure 3 constructs the optimal θ and returns z∗ .
5.3 ℓ2 Regularization
We now turn to the case where ϕ(x) = λ kxk2 while X = Rd . This type of regularization is useful
for zeroing multiple weights in a group, for example in multi-task or multiclass learning (Obozinski
et al., 2007). Recalling the general proximal step (18), we must solve
1
min hu, xi + hx, Hxi + λ kxk2 . (21)
x 2
There is no closed form solution for this problem, but we give an efficient bisection-based procedure
for solving (21). We start by deriving the dual. Introducing a variable z = x, we get the equivalent
problem of minimizing hu, xi + 21 hx, Hxi + λ kzk2 subject to x = z. With Lagrange multipliers α for
the equality constraint, we obtain the Lagrangian
1
L (x, z, α) = hu, xi + hx, Hxi + λ kzk2 + hα, x − zi .
2
2138
A DAPTIVE S UBGRADIENT M ETHODS
I NPUT: u ∈ Rd , H 0, λ > 0.
I F kuk2 ≤ λ
R ETURN x = 0
S ET v = H −1 u, θmax = kvk2 /λ − 1/σmin (H)
θmin = kvk2 /λ − 1/σmax (H)
W HILE θmax − θmin > ε
S ET θ = (θmax + θmin )/2, α(θ) = −(H −1 + θI)−1 v
I F kα(θ)k2 > λ
S ET θmin = θ
E LSE
S ET θmax = θ
R ETURN x = −H −1 (u + α(θ))
Taking the infimum of L with respect to the primal variables x and z, we see that the infimum is
attained at x = −H −1 (u + α). Coupled with the fact that infz λ kzk2 − hα, zi = −∞ unless kαk2 ≤ λ,
in which case the infimum is 0, we arrive at the dual form
− 21 u + α, H −1 (u + α) if kαk2 ≤ λ
inf L (x, z, α) =
x,z −∞ otherwise.
We can solve problem (22) efficiently using a bisection search of its equivalent representation in
Lagrange form,
1
θ
min hv, αi + α, H −1 α + kαk22 ,
α 2 2
where θ > 0 is an unknown scalar. The solution to the latter as a function of θ is clearly α(θ) =
−(H −1 + θI)−1 v = −(H −1 + θI)−1 H −1 u. Since kα(θ)k2 is monotonically decreasing in θ (consider
the the eigen-decomposition of the positive definite H −1 ), we can simply perform a bisection search
over θ, checking at each point whether kα(θ)k2 ≷ λ.
To find initial upper and lower bounds on θ, we note that
where σmax (H) denotes the maximum singular value of H and σmin (H) the minimum. To guarantee
kα(θmax )k2 ≤ λ, we thus set θmax = kvk2 /λ − 1/σmax (H). Similarly, for θmin we see that so long as
θ ≥ kvk2 /λ − 1/σmin (H) we have kα(θ)k2 ≥ λ. The fact that ∂ kxk2 = {z : kzk2 ≤ 1} when x = 0
implies that the solution for the original problem (21) is x = 0 if and only if kuk2 ≤ λ. We provide
pseudocode for solving (21) in Algorithm 4.
2139
D UCHI , H AZAN AND S INGER
5.4 ℓ∞ Regularization
We again let X = Rd but now choose ϕ(x) = λ kxk∞ . This type of update, similarly to ℓ2 , zeroes
groups of variables, which is handy in finding structurally sparse solutions for multitask or multi-
class problems. Solving the ℓ∞ regularized problem amounts to
1
min hu, xi + hx, Hxi + λ kxk∞ . (23)
x 2
The dual of this problem is a modified ℓ1 -projection problem. As in the case of ℓ2 regularization,
we introduce an equality constrained variable z = x with associated Lagrange multipliers α ∈ Rd to
obtain
1
L (x, z, α) = hu, xi + hx, Hxi + λ kzk∞ + hα, x − zi .
2
Performing identical manipulations to the ℓ2 case, we take derivatives and get that x = −H −1 (u + α)
and, similarly, unless kαk1 ≤ λ, infz L (x, z, α) = −∞. Thus the dual problem for (23) is
1
max − (u + α)H −1 (u + α) s.t. kαk1 ≤ λ .
α 2
When H is diagonal we can find the optimal α∗ using the generalized ℓ1 -projection in Algorithm 3,
then reconstruct the optimal x via x = −H −1 (u + α∗ ).
6. Experiments
We performed experiments with several real world data sets with different characteristics: the Im-
ageNet image database (Deng et al., 2009), the Reuters RCV1 text classification data set (Lewis
et al., 2004), the MNIST multiclass digit recognition problem, and the census income data set from
the UCI repository (Asuncion and Newman, 2007). For uniformity across experiments, we focus on
the completely online (fully stochastic) optimization setting, in which at each iteration the learning
algorithm receives a single example. We measure performance using two metrics: the online loss
or error and the test set performance of the predictor the learning algorithm outputs at the end of a
single pass through the training data. We also give some results that show how imposing sparsity
constraints (in the form of ℓ1 and mixed-norm regularization) affects the learning algorithm’s per-
formance. One benefit of the A DAG RAD framework is its ability to straightforwardly generalize to
2140
A DAPTIVE S UBGRADIENT M ETHODS
λ
xt+1 = argmin [1 − yt hzt , xi]+ + kx − xt k22 ,
x 2
where λ is a regularization parameter. PA’s update is similar to the update employed by AROW
(see (9)), but the latter maintains second order information on x. By using a representer theorem
it is also possible to derive efficient updates for PA and AROW when the loss is the logistic loss,
log(1 + exp(−yt hzt , xt i)). We thus we compare the above six algorithms using both hinge and
logistic loss.
2141
D UCHI , H AZAN AND S INGER
adaptive algorithms outperformed the non-adaptive algorithms. Moreover, both A DAG RAD-RDA
and A DAG RAD-Fobos outperform AROW on all the classification tasks. Unregularized RDA and
F OBOS attained similar results as did the ℓ1 -regularized variants (of course without sparsity), but
we omit the results to avoid clutter and because they do not give much more understanding.
|Pos(c)|
1 i
∑ p(i) .
|Pos(c)| i=1
2142
A DAPTIVE S UBGRADIENT M ETHODS
10000
PA
9000 Ada RDA
RDA
8000 Ada RDA L1/L2
RDA L1/L2
7000
6000
Mistakes
5000
4000
3000
2000
1000
0
0 1 2 3 4 5 6
Examples seen 4
x 10
Figure 5: Learning curves on MNIST
We compute the mean of each measurement across all classes, performing this twelve times for
each of the sets of rankers trained. Table 2 summarizes our results. We do not report variance as the
variance was on the order of 10−5 for each algorithm. One apparent characteristic to note from the
table is that A DAG RAD RDA achieves higher levels of sparsity than the other algorithms—using
only 73% of the input features it achieves very high performance. Moreover, it outperforms all the
algorithms in average precision. AROW has better results than the other algorithms in terms of
precision-at-k for k ≤ 10, though A DAG RAD’s performance catches up to and eventually surpasses
AROW’s as k grows.
2143
D UCHI , H AZAN AND S INGER
Table 3: Test set error rates and sparsity proportions on MNIST. The scalar λ is the multiplier on
the ℓ1 /ℓ2 regularization term.
is similar). From the curves, we see that Adaptive RDA seems to have similar performance to PA,
and the adaptive versions of RDA are vastly superior to their non-adaptive counterparts. Table 3
further supports this, where we see that the adaptive RDA algorithms outperform their non-adaptive
counterparts both in terms of sparsity (the proportion of non-zero rows) and test set error rates.
2144
A DAPTIVE S UBGRADIENT M ETHODS
0.056
AROW
PA
0.054
RDA
Ada RDA
Test error rate 0.052
0.05
0.048
0.046
0.044
0 0.2 0.4 0.6 0.8 1
Proportion train
Figure 6: Test set error rates as function of proportion of training data seen on Census Income data
set.
Table 4: Test set error rates as function of proportion of training data seen (proportion of non-zeros
in parenthesis where appropriate) on Census Income data set.
A DAG RAD with ℓ1 -regularization, and we sweep the regularization multiplier λ from 10−8 to 10−1 .
These values result in predictors ranging from a completely dense predictor to an all-zeros predictor,
respectively.
We summarize our results in Figure 7, which shows the test set performance of A DAG RAD
for each of the four categories ECAT, CCAT, GCAT, and MCAT. Within each plot, the horizontal
black line labeled AROW designates the baseline performance of AROW on the text classification
task, though we would like to note that AROW generates fully dense predictors. The plots all
portray a similar story. With high regularization values, A DAG RAD exhibits, as expected, poor
performance as it retains no predictive information from the learning task. Put another way, when
the regularization value is high A DAG RAD is confined to an overly sparse predictor which exhibits
poor generalization. However, as the regularization multiplier λ decreases, the learned predictor
becomes less sparse and eventually the accuracy of A DAG RAD exceeds AROW’s accuracy. It is
interesting to note that for these experiments, as soon as the predictor resulting from a single pass
2145
D UCHI , H AZAN AND S INGER
ECAT CCAT
0.16 0.16
AdaGrad AdaGrad
AROW AROW
0.14 0.14
0.12 0.12
Test−set error rate
0.08 0.08
0.06 0.06
0.04 0.04
0.02 −5 −4 −3 −2 −1 0
0.02 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10
Proportion non−zero Proportion non−zero
GCAT MCAT
0.16 0.16
AdaGrad AdaGrad
AROW AROW
0.14 0.14
0.12 0.12
Test−set error rate
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 −5 −4 −3 −2 −1 0
0.02 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10
Proportion non−zero Proportion non−zero
Figure 7: Test set error rates as a function of proportion of non-zeros in predictor x output by A DA -
G RAD (AROW plotted for reference).
through the data has more than 1% non-zero coefficients, A DAG RAD’s performance matches that of
AROW. We also would like to note that the variance in the test-set error rates for these experiments
is on the order of 10−6 , and we thus do not draw error bars in the graphs. The performance of
A DAG RAD as a function of regularization for other sparse data sets, especially in relation to that of
AROW, was qualitatively similar to this experiment.
7. Conclusions
We presented a paradigm that adapts subgradient methods to the geometry of the problem at hand.
The adaptation allows us to derive strong regret guarantees, which for some natural data distributions
achieve better performance guarantees than previous algorithms. Our online regret bounds can be
naturally converted into rate of convergence and generalization bounds (Cesa-Bianchi et al., 2004).
Our experiments show that adaptive methods, specifically A DAG RAD-F OBOS, A DAG RAD-RDA,
and AROW clearly outperform their non-adaptive counterparts. Furthermore, the A DAG RAD fam-
2146
A DAPTIVE S UBGRADIENT M ETHODS
ily of algorithms naturally incorporates regularization and gives very sparse solutions with similar
performance to dense solutions. Our experiments with adaptive methods use a diagonal approxima-
tion to the matrix obtained by taking outer products of subgradients computed along the run of the
algorithm. It remains to be tested whether using the full outer product matrix can further improve
performance.
To conclude we would like to underscore a possible elegant generalization that interpolates
between full-matrix proximal functions and diagonal approximations using block diagonal matrices.
[1] · · · v[k] ] where v[i] ∈ R are subvectors of v with ∑i=1 di = d.
Specifically, for v ∈ Rd let v = [v⊤ ⊤ ⊤ di k
We can define the associated block-diagonal approximation to the outer product matrix ∑tτ=1 gτ g⊤ τ
by
gτ,[1] g⊤ τ,[1] 0 ··· 0
..
t
0 gτ,[2] g⊤ . 0
Gt = ∑
τ,[2]
.
.. . . . .
τ=1 . . . 0
0 ··· 0 gτ,[k] g⊤
τ,[k]
Corollary 12 Let Gt be the block-diagonal outer productD matrix E defined above and the sequence
1/2
{xt } be defined by the RDA update of (3) with ψt (x) = x, Gt x . Then, for any x∗ ∈ X ,
1
2
1/2 1/2
tr(GT ) + η tr(GT ).
∗
Rφ (T ) ≤ max
x[i]
η i 2
A similar bound holds for composite mirror-descent updates, and it is straightforward to get infimal
equalities similar to those in Corollary 11 with the infimum taken over block-diagonal matrices.
Such an algorithm can interpolate between the computational simplicity of the diagonal proximal
functions and the ability of full matrices to capture correlation in the gradient vectors.
A few open questions stem from this line of research. The first is whether we can efficiently
use full matrices in the proximal functions, as in Section 4. A second open issue is whether non-
Euclidean proximal functions, such as the relative entropy, can be used. We also think that the
strongly convex case—when ft or ϕ is strongly convex—presents interesting challenges that we
have not completely resolved. We hope to investigate both empirical and formal extensions of this
work in the near future.
Acknowledgments
There are many people to whom we owe our sincere thanks for this research. Fernando Pereira
helped push us in the direction of working on adaptive online methods and has been a constant
source of discussion and helpful feedback. Samy Bengio provided us with a processed version of
the ImageNet data set and was instrumental in helping to get our experiments running, and Adam
Sadovsky gave many indispensable coding suggestions. The anonymous reviewers also gave several
suggestions that improved the quality of the paper. Lastly, Sam Roweis was a sounding board for
some of our earlier ideas on the subject, and we will miss him dearly.
2147
D UCHI , H AZAN AND S INGER
x2 = x1 + G†1 = x1 + v1 v⊤
1 v1 = x1 + v1 .
Since hx2 , v1 i = 1, we see that A DAG RAD suffers no loss (and Gt = G1 ) until a vector zt = ±vi for
i 6= 1 is played by the adversary. However, an identical argument shows that Gt is simply updated
to v1 v⊤ ⊤
1 + vi vi , in which case xt = v√
1 + vi . Indeed, an inductive argument shows that until all the
vectors vi are seen, we have kxt k2 < d by orthogonality, and eventually we have
s
d d √
xt = ∑ vi and kxt k2 = ∑ kvi k22 = d
i=1 i=1
√
so that xt ∈ X = {x : kxk2 ≤ d} for A DAG RAD for all t. All future predictions thus achieve margin
1 and suffer no loss.
Proof This is Example 3 of Davis (1963). We include a proof for convenience of the reader.
Let λ be any eigenvalue (with corresponding eigenvector x) of A1/2 − B1/2 ; we show that λ ≥ 0.
2
Clearly A1/2 x − λx = B1/2 x. Taking the inner product of both sides with A1/2 x, we have
A1/2 x
2 −
E
1/2
2 D
1/2
2
− λ A x, x ≤
A x
B x
= hAx, xi hBx, xi ≤ hAx, xi =
A x
1/2
1/2
1/2
p
A x
2 2 2 2
where the last inequality follows from the assumption that A B. Thus we must have λ A1/2 x, x ≥
0, which implies λ ≥ 0.
The gradient of the function tr(X p ) is easy to compute for integer values of p. However, when p is
real we need the following lemma. The lemma tacitly uses the fact that there is a unique positive
semidefinite X p when X 0 (Horn and Johnson, 1985, Theorem 7.2.6).
2148
A DAPTIVE S UBGRADIENT M ETHODS
In the above, o(A) is a matrix that goes to zero faster than A → 0, and the second line follows via a
first-order Taylor expansion of (1 + di ) p . From the above, we immediately have
where we define 00 = 0. We use induction on T to prove inequality (24). For T = 1, the inequality
trivially holds. Assume the bound (24) holds true for T − 1, in which case
T T −1
a2 a2 a2T a2T
∑ ka1:tt k = ∑ ka1:tt k +
ka1:T k2
≤ 2 ka1:T −1 k2 +
ka1:T k2
,
t=1 2 t=1 2
a2T a2
q p
2 ka1:T −1 k2 + = 2 bT − a2T + √ T ≤ 2 bT = 2 ka1:T k2 .
ka1:T k2 bT
Having proved the bound (24), we note that by construction that st,i = kg1:t,i k2 , so
T T2
gt,i d d
∑ gt , diag(st ) gt = ∑ ∑
−1
≤ 2 ∑ kg1:T,i k2 .
t=1 t=1 i=1 kg1:t,i k2 i=1
2. We note that we use an identical technique in the full-matrix case. See Lemma 8.
2149
D UCHI , H AZAN AND S INGER
Now suppose simply A, B 0 (but neither is necessarily strict). Then for any δ > 0, we have
A + δI ≻ 0 and B + δI ≻ 0 and therefore
where we used Lemma 13 for the second matrix inequality. Moreover, αA + (1 − α)B + δI →
αA + (1 − α)B as δ → 0. Since A p is continuous (when we use the unique PSD root), this line of
reasoning proves that (25) holds for A, B 0. Thus, we proved that
Recall now that Lemma 14 implies that the gradient of tr(A1/2 ) is 12 A−1/2 when A ≻ 0. There-
fore, from the concavity of A1/2 and the form of its gradient, we can use the standard first-order
inequality for concave functions so that for any A, B ≻ 0,
1
tr(A1/2 ) ≤ tr(B1/2 ) + tr(B−1/2 (A − B)) . (26)
2
Let A = B − νgg⊤ 0 and suppose only that B 0. We must take some care since B−1/2 may
not necessarily exist, and the above inequality does not hold true in the pseudo-inverse sense when
B 6≻ 0. However, for any δ > 0 we know that 2∇B tr((B+δI)1/2 ) = (B+δI)−1/2 , and A−B = −νgg⊤ .
From (26) and Lemma 13, we have
Note that g ∈ Range(B), because ifit were not, we could choose some u with Bu = 0 and hg, ui 6= 0,
which would give u, (B − cgg⊤ )u = −c hg, ui2 < 0, a contradiction. Now let B = V diag(λ)V ⊤ be
Thus, by taking δ ↓ 0 in (27), and since both tr(B + δI)1/2 and tr((B + δI)−1/2 gg⊤ ) are evidently
continuous in δ, we complete the proof.
2150
A DAPTIVE S UBGRADIENT M ETHODS
Proof of Lemma 9 We begin by noting that δ2 I gg⊤ , so from Lemma 13 we get (A + gg⊤ )1/2
√ + δ I) √. Since
(A 2 1/2
√ A and I are simultaneously diagonalizable, we can generalize the inequality
a + b ≤ a + b, which holds for a, b ≥ 0, to positive semi-definite matrices, thus,
(A + δ2 I)1/2 A1/2 + δI .
Therefore, if A + gg⊤ is of full rank, we have (A + gg⊤ )−1/2 (A1/2 + δI)−1 (Horn and Johnson,
1985, Corollary 7.7.4(a)). Since g ∈ Range((A + gg⊤ )1/2 ), we can apply an analogous limiting ar-
gument to the one used in the proof of Lemma 8 and discard all zero eigenvalues of A + gg⊤ , which
completes the lemma.
1 1
Lemma 15 If A is of full rank, then the minimizer of (28) is S = cA 2 / tr(A 2 ). If A is not of full rank,
1 1
then setting S = cA 2 / tr(A 2 ) gives
1
In either case, tr(S† A) = tr(A 2 )2 /c.
Proof Both proofs rely on constructing the Lagrangian for (28). We introduce θ ∈ R+ for the trace
constraint and Z 0 for the positive semidefinite constraint on S. In this case, the Lagrangian is
If S is full rank, then to satisfy the generalized complementarity conditions for the problem (Boyd
and Vandenberghe, 2004), we must have Z = 0. Therefore, we get S−1 AS−1 = θI. We now can
1
multiply by S on the right and the left to get that A = θS2 , which implies that S ∝ A 2 . If A is of full
rank, the optimal solution for S ≻ 0 forces θ to be positive so that tr(S) = c. This yields the solution
1 1
S = cA 2 / tr(A 2 ). In order to verify optimality of this solution, we set Z = 0 and θ = c−2 tr(A1/2 )2
which gives ∇S L (S, θ, Z) = 0, as is indeed required.
Suppose now that A is not full rank and that
Λ 0 ⊤
A=Q Q
0 0
2151
D UCHI , H AZAN AND S INGER
is the eigen-decomposition of A. Let n be the dimension of the null-space of A (so the rank of A is
d − n). Define the variables
1 1
Λ2 0 Λ2 0
0 0 1 c
Z(θ) = , S(θ, δ) = √ Q ⊤
Q , S(δ) = Q Q⊤ .
0 θI θ 0 δI 1
tr(A 2 ) + δn 0 δI
Further, let g(θ) = infS L (S, θ, Z(θ)) be the dual of (28). From the above analysis and (29), it is
evident that
−1 1
Λ 2 ΛΛ− 2
0 0 0
−S(θ, δ) AS(θ, δ) + θI − Z(θ) = −θQ
−1 −1
Q + θI −
⊤
= 0.
0 δ−2 I · 0 0 θI
So S(θ, δ) achieves the infimum in the dual for any δ > 0, tr(S(0)Z(θ)) = 0, and
√ 1 √ 1 √
g(θ) = θ tr(Λ 2 ) + θ tr(Λ 2 ) + θδn − θc.
1 1 1 1
Setting θ = tr(Λ 2 )2 /c2 gives g(θ) = tr(Λ 2 )2 /c − δn tr(Λ 2 )/c. Taking δ → 0 gives g(θ) = tr(A 2 )2 /c,
1
which means that limδ→0 tr(S(δ)−1 A) = tr(A 2 )2 /c = g(θ). Thus the duality gap for the original
problem is 0 so S(0) is the limiting solution.
1 1
The last statement of the lemma is simply plugging S† = (A† ) 2 tr(A 2 )/c in to the objective being
minimized.
Since ψt /η is 1/η-strongly convex with respect to the norm k·kψt , the function ψt∗ has η-Lipschitz
continuous gradients with respect to k·kψt∗ :
for any g1 , g2 (see, e.g., Nesterov, 2005, Theorem 1 or Hiriart-Urruty and Lemaréchal, 1996, Chap-
ter X). Further, a simple argument with the fundamental theorem of calculus gives that if f has
L-Lipschitz gradients, f (y) ≤ f (x) + h∇ f (x), y − xi + (L/2) ky − xk2 , and
1
∇ψt (g) = argmin − hg, xi + tϕ(x) + ψt (x) .
∗
(31)
x∈X η
2152
A DAPTIVE S UBGRADIENT M ETHODS
Using the bound (30) and identity (31), we can give the proof of the corollary. Indeed, letting
gt ∈ ∂ ft (xt ) and defining zt = ∑tτ=1 gτ , we have
T
∑ ft (xt ) + ϕ(xt ) − ft (x∗ ) − ϕ(x∗ )
t=1
T
≤ ∑ hgt , xt − x∗ i − ϕ(x∗ ) + ϕ(xt )
t=1
( )
T T
1
≤ ∑ hgt , xt i + ϕ(xt ) + sup − ∑ hgt , xi − T ϕ(x) − ψT (x) + ψT (x∗ )
t=1 x∈X t=1 η
T
1
= ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt ) + ψ∗T (−zT ) .
η t=1
The Lipschitz continuity of ∇ψt∗ , the identity (31), and the fact that zT − zT −1 = −gT give
T
∑ ft (xt ) + ϕ(xt+1) − ft (x∗ ) − ϕ(x∗ )
t=1
T
1
≤ ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt+1 ) + ψ∗T −1 (−zT ) − ϕ(xT +1 )
η t=1
T
1
≤ ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt+1 ) − ϕ(xT +1 )
η t=1
η
+ ψT −1 (−zT −1 ) − ∇ψ∗T −1 (zT −1 ), gT + kgT k2ψ∗
∗
2 T −1
1 T −1
η
= ψT (x∗ ) + ∑ hgt , xt i + ϕ(xt+1 ) + ψ∗T −1 (−zT −1 ) + kgT k2ψ∗ .
η t=1 2 T −1
We can repeat the same sequence of steps that gave the last equality to see that
T
1 η T
∑ ft (xt ) + ϕ(xt+1 ) − ft (x∗ ) − ϕ(x∗ ) ≤
η
ψT (x∗ ) + ∑ kgt k2ψ∗ + ψ∗0 (−z0 ).
2 t=1 t−1
t=1
Recalling that x1 = argminx∈X {ϕ(x)} and that ψ∗0 (0) = 0 completes the proof.
We now turn to the proof of Proposition 3. We begin by stating and fully proving an (essentially)
immediate corollary to Lemma 2.3 of Duchi et al. (2010).
2153
D UCHI , H AZAN AND S INGER
Lemma 16 Let {xt } be the sequence defined by the update (4) and assume that Bψt (·, ·) is strongly
convex with respect to a norm k·kψt . Let k·kψt∗ be the associated dual norm. Then for any x∗ ,
η2
ft′ (xt )
2 ∗
η ( ft (xt ) − ft (x∗ )) + η (ϕ(xt+1 ) − ϕ(x∗ )) ≤ Bψt (x∗ , xt ) − Bψt (x∗ , xt+1 ) +
2 ψt
Proof The optimality of xt+1 for (4) implies for all x ∈ X and ϕ′ (xt+1 ) ∈ ∂ϕ(xt+1 )
In particular, this obtains for x = x∗ . From the subgradient inequality for convex functions, we have
ft (x∗ ) ≥ ft (xt ) + h ft′ (xt ), x∗ − xt i, or ft (xt ) − ft (x∗ ) ≤ h ft′ (xt ), xt − x∗ i, and likewise for ϕ(xt+1 ). We
thus have
Now, by (32), the first term in the last equation is non-positive. Thus we have that
= Bψt (x∗ , xt ) − Bψt (xt+1 , xt ) − Bψt (x∗ , xt+1 ) + η xt − xt+1 , ft′ (xt )
D 1 √ E
= Bψt (x∗ , xt ) − Bψt (xt+1 , xt ) − Bψt (x∗ , xt+1 ) + η η− 2 (xt − xt+1 ), η ft′ (xt )
1 η2
2
≤ Bψt (x∗ , xt ) − Bψt (xt+1 , xt ) − Bψt (x∗ , xt+1 ) + kxt − xt+1 k2ψt +
ft′ (xt )
ψ∗
2 2 t
η 2
2
≤ Bψt (x∗ , xt ) − Bψt (x∗ , xt+1 ) +
ft′ (xt )
ψ∗ .
2 t
In the above, the first equality follows from simple algebra with Bregman divergences, the second
to last inequality follows from Fenchel’s inequality applied to the conjugate functions 21 k·k2ψt and
1 2
2 k·kψt∗ (Boyd and Vandenberghe, 2004, Example 3.27), and the last inequality follows from the
assumed strong convexity of Bψt with respect to the norm k·kψt .
2154
A DAPTIVE S UBGRADIENT M ETHODS
G.1 ℓ1 -regularization
We give the derivation for the primal-dual subgradient update, as composite mirror-descent is en-
tirely similar. We need to solve update (3), which amounts to
1 1
min η hḡt , xi + δ kxk22 + hx, diag(st )xi + ηλ kxk1 .
x 2t 2t
Let x̂ denote the optimal solution of the above optimization problem. Standard subgradient calculus
implies that when |ḡt,i | ≤ λ the solution is x̂i = 0. Similarly, when ḡt,i < −λ, then x̂i > 0, the
objective is differentiable, and the solution is obtained by setting the gradient to zero:
Ht,ii ηt
ηḡt,i + x̂i + ηλ = 0 , so that x̂i = (−ḡt,i − λ) .
t Ht,ii
ηt
Likewise, when ḡt,i > λ then x̂i < 0, and the solution is x̂i = Ht,ii (−ḡt,i + λ). Combining the three
cases, we obtain the simple update (19) for xt+1,i .
vi − θ∗ ai if vi ≥ θ∗ ai
∗
zi =
0 otherwise .
Conversely, had we obtained a value θ ≥ 0 satisfying the above equation, then θ would evidently
induce the optimal z∗ through the equation zi = [vi − θai ]+ .
2155
D UCHI , H AZAN AND S INGER
Now, let ρ be the largest index in {1, . . . , d} such that vi − θ∗ ai > 0 for i ≤ ρ and vi − θ∗ ai ≤ 0
for i > ρ. From the assumption that vi /ai ≤ vi+1 /ai+1 , we have vρ+1 /aρ+1 ≤ θ∗ < vρ /aρ . Thus, had
we known the last non-zero index ρ, we would have obtained
ρ
vρ ρ 2 ρ
2 vi vρ
∑ ai vi − aρ ∑ ai = ∑ ai ai − aρ < c ,
i=1 i=1 i=1
ρ ρ ρ+1
vρ+1 2 vi vρ+1
∑ i i aρ+1 ∑ i ∑ i ai aρ+1 ≥ c .
a v − a 2
= a −
i=1 i=1 i=1
Given ρ satisfying the above inequalities, we can reconstruct the optimal θ∗ by noting that the latter
inequality should equal c exactly when we replace vρ /aρ with θ, that is,
ρ
∑ ai vi − c
θ = i=1ρ 2 .
∗
(33)
∑i=1 ai
The above derivation results in the following procedure (when ha, vi > c). We sort v in descend-
ρ ρ−1
ing order of vi /ai and find the largest index ρ such that ∑i=1 ai vi − (vρ /aρ ) ∑i=1 a2i < c. We then
reconstruct θ∗ using equality (33) and return the soft-thresholded values of vi (see Algorithm 3). It
is easy to verify that the algorithm can be implemented in O(d log d) time. A randomized search
with bookkeeping (Pardalos and Rosen, 1990) can be straightforwardly used to derive a linear time
algorithm.
References
J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower
bounds for online convex games. In Proceedings of the Twenty First Annual Conference on
Computational Learning Theory, 2008.
T. Ando. Concavity of certain maps on positive definite matrices and applications to Hadamard
products. Linear Algebra and its Applications, 26:203–241, 1979.
A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. URL http://www.ics.
uci.edu/˜mlearn/MLRepository.html.
P. Auer and C. Gentile. Adaptive and self-confident online learning algorithms. In Proceedings of
the Thirteenth Annual Conference on Computational Learning Theory, 2000.
P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in Neural
Information Processing Systems 20, 2007.
A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex
optimization. Operations Research Letters, 31:167–175, 2003.
A. Bordes, L. Bottou, and P. Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent.
Journal of Machine Learning Research, 10:1737–1754, 2009.
2156
A DAPTIVE S UBGRADIENT M ETHODS
P. Brucker. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3
(3):163–166, 1984.
N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with
expert advice. Machine Learning, 66:321–352, 2007.
C. Davis. Notions generalizing convexity for functions defined on spaces of matrices. In Proceed-
ings of the Symposia in Pure Mathematics, volume 7, pages 187–201. American Mathematical
Society, 1963.
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchi-
cal image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal
of Machine Learning Research, 10:2873–2908, 2009.
R. Fletcher. A new approach to variable metric algorithms. Computer Journal, 13:317–322, 1970.
D. Grangier and S. Bengio. A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8):1371–1384, 2008.
E. Hazan and S. Kale. Extracting certainty from uncertainty: regret bounded by variation in costs.
In Proceedings of the Twenty First Annual Conference on Computational Learning Theory, 2008.
E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex
optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning
Theory, 2006.
J. B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms II. Springer-
Verlag, 1996.
2157
D UCHI , H AZAN AND S INGER
2158
A DAPTIVE S UBGRADIENT M ETHODS
M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Pro-
ceedings of the Twentieth International Conference on Machine Learning, 2003.
2159
Journal of Machine Learning Research 15 (2014) 1929-1958 Submitted 11/13; Published 6/14
Abstract
Deep neural nets with a large number of parameters are very powerful machine learning
systems. However, overfitting is a serious problem in such networks. Large networks are also
slow to use, making it difficult to deal with overfitting by combining the predictions of many
different large neural nets at test time. Dropout is a technique for addressing this problem.
The key idea is to randomly drop units (along with their connections) from the neural
network during training. This prevents units from co-adapting too much. During training,
dropout samples from an exponential number of different “thinned” networks. At test time,
it is easy to approximate the effect of averaging the predictions of all these thinned networks
by simply using a single unthinned network that has smaller weights. This significantly
reduces overfitting and gives major improvements over other regularization methods. We
show that dropout improves the performance of neural networks on supervised learning
tasks in vision, speech recognition, document classification and computational biology,
obtaining state-of-the-art results on many benchmark data sets.
Keywords: neural networks, regularization, model combination, deep learning
1. Introduction
Deep neural networks contain multiple non-linear hidden layers and this makes them very
expressive models that can learn very complicated relationships between their inputs and
outputs. With limited training data, however, many of these complicated relationships
will be the result of sampling noise, so they will exist in the training set but not in real
test data even if it is drawn from the same distribution. This leads to overfitting and many
methods have been developed for reducing it. These include stopping the training as soon as
performance on a validation set starts to get worse, introducing weight penalties of various
kinds such as L1 and L2 regularization and soft weight sharing (Nowlan and Hinton, 1992).
With unlimited computation, the best way to “regularize” a fixed-sized model is to
average the predictions of all possible settings of the parameters, weighting each setting by
c
2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right:
An example of a thinned net produced by applying dropout to the network on the left.
Crossed units have been dropped.
its posterior probability given the training data. This can sometimes be approximated quite
well for simple or small models (Xiong et al., 2011; Salakhutdinov and Mnih, 2008), but we
would like to approach the performance of the Bayesian gold standard using considerably
less computation. We propose to do this by approximating an equally weighted geometric
mean of the predictions of an exponential number of learned models that share parameters.
Model combination nearly always improves the performance of machine learning meth-
ods. With large neural networks, however, the obvious idea of averaging the outputs of
many separately trained nets is prohibitively expensive. Combining several models is most
helpful when the individual models are different from each other and in order to make
neural net models different, they should either have different architectures or be trained
on different data. Training many different architectures is hard because finding optimal
hyperparameters for each architecture is a daunting task and training each large network
requires a lot of computation. Moreover, large networks normally require large amounts of
training data and there may not be enough data available to train different networks on
different subsets of the data. Even if one was able to train many different large networks,
using them all at test time is infeasible in applications where it is important to respond
quickly.
Dropout is a technique that addresses both these issues. It prevents overfitting and
provides a way of approximately combining exponentially many different neural network
architectures efficiently. The term “dropout” refers to dropping out units (hidden and
visible) in a neural network. By dropping a unit out, we mean temporarily removing it from
the network, along with all its incoming and outgoing connections, as shown in Figure 1.
The choice of which units to drop is random. In the simplest case, each unit is retained with
a fixed probability p independent of other units, where p can be chosen using a validation
set or can simply be set at 0.5, which seems to be close to optimal for a wide range of
networks and tasks. For the input units, however, the optimal probability of retention is
usually closer to 1 than to 0.5.
1930
Dropout
w pw
Present with -
Always -
probability p present
(a) At training time (b) At test time
Figure 2: Left: A unit at training time that is present with probability p and is connected to units
in the next layer with weights w. Right: At test time, the unit is always present and
the weights are multiplied by p. The output at test time is same as the expected output
at training time.
1931
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
for training dropout nets. This includes a detailed analysis of the practical considerations
involved in choosing hyperparameters when training dropout networks.
2. Motivation
A motivation for dropout comes from a theory of the role of sex in evolution (Livnat et al.,
2010). Sexual reproduction involves taking half the genes of one parent and half of the
other, adding a very small amount of random mutation, and combining them to produce an
offspring. The asexual alternative is to create an offspring with a slightly mutated copy of
the parent’s genes. It seems plausible that asexual reproduction should be a better way to
optimize individual fitness because a good set of genes that have come to work well together
can be passed on directly to the offspring. On the other hand, sexual reproduction is likely
to break up these co-adapted sets of genes, especially if these sets are large and, intuitively,
this should decrease the fitness of organisms that have already evolved complicated co-
adaptations. However, sexual reproduction is the way most advanced organisms evolved.
One possible explanation for the superiority of sexual reproduction is that, over the long
term, the criterion for natural selection may not be individual fitness but rather mix-ability
of genes. The ability of a set of genes to be able to work well with another random set of
genes makes them more robust. Since a gene cannot rely on a large set of partners to be
present at all times, it must learn to do something useful on its own or in collaboration with
a small number of other genes. According to this theory, the role of sexual reproduction
is not just to allow useful new genes to spread throughout the population, but also to
facilitate this process by reducing complex co-adaptations that would reduce the chance of
a new gene improving the fitness of an individual. Similarly, each hidden unit in a neural
network trained with dropout must learn to work with a randomly chosen sample of other
units. This should make each hidden unit more robust and drive it towards creating useful
features on its own without relying on other hidden units to correct its mistakes. However,
the hidden units within a layer will still learn to do different things from each other. One
might imagine that the net would become robust against dropout by making many copies
of each hidden unit, but this is a poor solution for exactly the same reason as replica codes
are a poor way to deal with a noisy channel.
A closely related, but slightly different motivation for dropout comes from thinking
about successful conspiracies. Ten conspiracies each involving five people is probably a
better way to create havoc than one big conspiracy that requires fifty people to all play
their parts correctly. If conditions do not change and there is plenty of time for rehearsal, a
big conspiracy can work well, but with non-stationary conditions, the smaller the conspiracy
the greater its chance of still working. Complex co-adaptations can be trained to work well
on a training set, but on novel test data they are far more likely to fail than multiple simpler
co-adaptations that achieve the same thing.
3. Related Work
Dropout can be interpreted as a way of regularizing a neural network by adding noise to
its hidden units. The idea of adding noise to the states of units has previously been used in
the context of Denoising Autoencoders (DAEs) by Vincent et al. (2008, 2010) where noise
1932
Dropout
is added to the input units of an autoencoder and the network is trained to reconstruct the
noise-free input. Our work extends this idea by showing that dropout can be effectively
applied in the hidden layers as well and that it can be interpreted as a form of model
averaging. We also show that adding noise is not only useful for unsupervised feature
learning but can also be extended to supervised learning problems. In fact, our method can
be applied to other neuron-based architectures, for example, Boltzmann Machines. While
5% noise typically works best for DAEs, we found that our weight scaling procedure applied
at test time enables us to use much higher noise levels. Dropping out 20% of the input units
and 50% of the hidden units was often found to be optimal.
Since dropout can be seen as a stochastic regularization technique, it is natural to
consider its deterministic counterpart which is obtained by marginalizing out the noise. In
this paper, we show that, in simple cases, dropout can be analytically marginalized out
to obtain deterministic regularization methods. Recently, van der Maaten et al. (2013)
also explored deterministic regularizers corresponding to different exponential-family noise
distributions, including dropout (which they refer to as “blankout noise”). However, they
apply noise to the inputs and only explore models with no hidden layers. Wang and Manning
(2013) proposed a method for speeding up dropout by marginalizing dropout noise. Chen
et al. (2012) explored marginalization in the context of denoising autoencoders.
In dropout, we minimize the loss function stochastically under a noise distribution.
This can be seen as minimizing an expected loss function. Previous work of Globerson and
Roweis (2006); Dekel et al. (2010) explored an alternate setting where the loss is minimized
when an adversary gets to pick which units to drop. Here, instead of a noise distribution,
the maximum number of units that can be dropped is fixed. However, this work also does
not explore models with hidden units.
4. Model Description
This section describes the dropout neural network model. Consider a neural network with
L hidden layers. Let l ∈ {1, . . . , L} index the hidden layers of the network. Let z(l) denote
the vector of inputs into layer l, y(l) denote the vector of outputs from layer l (y(0) = x is
the input). W (l) and b(l) are the weights and biases at layer l. The feed-forward operation
of a standard neural network (Figure 3a) can be described as (for l ∈ {0, . . . , L − 1} and
any hidden unit i)
1933
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
+1
+1
(l)
r3 (l+1)
bi
(l+1) bi
(l+1)
f (l+1) (l) (l+1)
f (l+1)
(l+1)
wi zi yi r2 (l+1)
wi zi yi
(l)
r1
(l)
y1 (l) (l)
y1 ye1
5.1 Backpropagation
Dropout neural networks can be trained using stochastic gradient descent in a manner simi-
lar to standard neural nets. The only difference is that for each training case in a mini-batch,
we sample a thinned network by dropping out units. Forward and backpropagation for that
training case are done only on this thinned network. The gradients for each parameter are
averaged over the training cases in each mini-batch. Any training case which does not use a
parameter contributes a gradient of zero for that parameter. Many methods have been used
to improve stochastic gradient descent such as momentum, annealed learning rates and L2
weight decay. Those were found to be useful for dropout neural networks as well.
One particular form of regularization was found to be especially useful for dropout—
constraining the norm of the incoming weight vector at each hidden unit to be upper
bounded by a fixed constant c. In other words, if w represents the vector of weights incident
on any hidden unit, the neural network was optimized under the constraint ||w||2 ≤ c. This
constraint was imposed during optimization by projecting w onto the surface of a ball of
radius c, whenever w went out of it. This is also called max-norm regularization since it
implies that the maximum value that the norm of any weight can take is c. The constant
1934
Dropout
6. Experimental Results
We trained dropout neural networks for classification problems on data sets in different
domains. We found that dropout improved generalization performance on all data sets
compared to neural networks that did not use dropout. Table 1 gives a brief description of
the data sets. The data sets are
• MNIST : A standard toy data set of handwritten digits.
• TIMIT : A standard speech benchmark for clean speech recognition.
• CIFAR-10 and CIFAR-100 : Tiny natural images (Krizhevsky, 2009).
• Street View House Numbers data set (SVHN) : Images of house numbers collected by
Google Street View (Netzer et al., 2011).
• ImageNet : A large collection of natural images.
• Reuters-RCV1 : A collection of Reuters newswire articles.
1935
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
• Alternative Splicing data set: RNA features for predicting alternative gene splicing
(Xiong et al., 2011).
We chose a diverse set of data sets to demonstrate that dropout is a general technique
for improving neural nets and is not specific to any particular application domain. In this
section, we present some key results that show the effectiveness of dropout. A more detailed
description of all the experiments and data sets is provided in Appendix B.
6.1.1 MNIST
Unit Error
Method Architecture
Type %
Standard Neural Net (Simard et al., 2003) Logistic 2 layers, 800 units 1.60
SVM Gaussian kernel NA NA 1.40
Dropout NN Logistic 3 layers, 1024 units 1.35
Dropout NN ReLU 3 layers, 1024 units 1.25
Dropout NN + max-norm constraint ReLU 3 layers, 1024 units 1.06
Dropout NN + max-norm constraint ReLU 3 layers, 2048 units 1.04
Dropout NN + max-norm constraint ReLU 2 layers, 4096 units 1.01
Dropout NN + max-norm constraint ReLU 2 layers, 8192 units 0.95
Dropout NN + max-norm constraint (Goodfellow 2 layers, (5 × 240)
Maxout 0.94
et al., 2013) units
DBN + finetuning (Hinton and Salakhutdinov, 2006) Logistic 500-500-2000 1.18
DBM + finetuning (Salakhutdinov and Hinton, 2009) Logistic 500-500-2000 0.96
DBN + dropout finetuning Logistic 500-500-2000 0.92
DBM + dropout finetuning Logistic 500-500-2000 0.79
1936
Dropout
setting that do not use dropout or unsupervised pretraining achieve an error of about
1.60% (Simard et al., 2003). With dropout the error reduces to 1.35%. Replacing logistic
units with rectified linear units (ReLUs) (Jarrett et al., 2009) further reduces the error to
1.25%. Adding max-norm regularization again reduces it to 1.06%. Increasing the size of
the network leads to better results. A neural net with 2 layers and 8192 units per layer
gets down to 0.95% error. Note that this network has more than 65 million parameters and
is being trained on a data set of size 60,000. Training a network of this size to give good
generalization error is very hard with standard regularization methods and early stopping.
Dropout, on the other hand, prevents overfitting, even in this case. It does not even need
early stopping. Goodfellow et al. (2013) showed that results can be further improved to
0.94% by replacing ReLU units with maxout units. All dropout nets use p = 0.5 for hidden
units and p = 0.8 for input units. More experimental details can be found in Appendix B.1.
Dropout nets pretrained with stacks of RBMs and Deep Boltzmann Machines also give
improvements as shown in Table 2. DBM—pretrained dropout nets achieve a test error of
0.79% which is the best performance ever reported for the permutation invariant setting.
We note that it possible to obtain better results by using 2-D spatial information and
augmenting the training set with distorted versions of images from the standard training
set. We demonstrate the effectiveness of dropout in that setting on more interesting data
sets.
In order to test the robustness of
dropout, classification experiments were 2.5
done with networks of many different ar-
chitectures keeping all hyperparameters, in-
cluding p, fixed. Figure 4 shows the test 2.0 Without dropout
Classification Error %
1937
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Method Error %
Binary Features (WDCH) (Netzer et al., 2011) 36.7
HOG (Netzer et al., 2011) 15.0
Stacked Sparse Autoencoders (Netzer et al., 2011) 10.3
KMeans (Netzer et al., 2011) 9.4
Multi-stage Conv Net with average pooling (Sermanet et al., 2012) 9.06
Multi-stage Conv Net + L2 pooling (Sermanet et al., 2012) 5.36
Multi-stage Conv Net + L4 pooling + padding (Sermanet et al., 2012) 4.90
Conv Net + max-pooling 3.95
Conv Net + max pooling + dropout in fully connected layers 3.02
Conv Net + stochastic pooling (Zeiler and Fergus, 2013) 2.80
Conv Net + max pooling + dropout in all layers 2.55
Conv Net + maxout (Goodfellow et al., 2013) 2.47
Human Performance 2.0
followed by a max-pooling layer. Appendix B.2 describes the architecture in more detail.
Dropout was applied to all the layers of the network with the probability of retaining a hid-
den unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the different layers of the network (going
from input to convolutional layers to fully connected layers). Max-norm regularization was
used for weights in both convolutional and fully connected layers. Table 3 compares the
results obtained by different methods. We find that convolutional nets outperform other
methods. The best performing convolutional nets that do not use dropout achieve an error
rate of 3.95%. Adding dropout only to the fully connected layers reduces the error to 3.02%.
Adding dropout to the convolutional layers as well further reduces the error to 2.55%. Even
more gains can be obtained by using maxout units.
The additional gain in performance obtained by adding dropout in the convolutional
layers (3.02% to 2.55%) is worth noting. One may have presumed that since the convo-
lutional layers don’t have a lot of parameters, overfitting is not a problem and therefore
dropout would not have much effect. However, dropout in the lower layers still helps be-
cause it provides noisy inputs for the higher fully connected layers which prevents them
from overfitting.
The CIFAR-10 and CIFAR-100 data sets consist of 32 × 32 color images drawn from 10
and 100 categories respectively. Figure 5b shows some examples of images from this data
set. A detailed description of the data sets, input preprocessing, network architectures and
other experimental details is given in Appendix B.3. Table 4 shows the error rate obtained
by different methods on these data sets. Without any data augmentation, Snoek et al.
(2012) used Bayesian hyperparameter optimization to obtained an error rate of 14.98% on
CIFAR-10. Using dropout in the fully connected layers reduces that to 14.32% and adding
dropout in every layer further reduces the error to 12.61%. Goodfellow et al. (2013) showed
that the error is further reduced to 11.68% by replacing ReLU units with maxout units. On
CIFAR-100, dropout reduces the error from 43.48% to 37.20% which is a huge improvement.
No data augmentation was used for either data set (apart from the input dropout).
1938
Dropout
Figure 5: Samples from image data sets. Each row corresponds to a different category.
6.1.4 ImageNet
ImageNet is a data set of over 15 million labeled high-resolution images belonging to roughly
22,000 categories. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual
competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has
been held. A subset of ImageNet with roughly 1000 images in each of 1000 categories is
used in this challenge. Since the number of categories is rather large, it is conventional to
report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test
images for which the correct label is not among the five labels considered most probable by
the model. Figure 6 shows some predictions made by our model on a few test images.
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so
most of our experiments were performed on this data set. Table 5 compares the performance
of different methods. Convolutional nets with dropout outperform other methods by a large
margin. The architecture and implementation details are described in detail in Krizhevsky
et al. (2012).
1939
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Figure 6: Some ImageNet test cases with the 4 most probable labels as predicted by our model.
The length of the horizontal bars is proportional to the probability assigned to the labels
by the model. Pink indicates ground truth.
1940
Dropout
nets with other models. A 6-layer net gives a phone error rate of 23.4%. Dropout further
improves it to 21.8%. We also trained dropout nets starting from pretrained weights. A
4-layer net pretrained with a stack of RBMs get a phone error rate of 22.7%. With dropout,
this reduces to 19.7%. Similarly, for an 8-layer net the error reduces from 20.5% to 19.7%.
1941
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
7. Salient Features
The experiments described in the previous section provide strong evidence that dropout
is a useful technique for improving neural networks. In this section, we closely examine
how dropout affects a neural network. We analyze the effect of dropout on the quality of
features produced. We see how dropout affects the sparsity of hidden unit activations. We
1942
Dropout
also see how the advantages obtained from dropout vary with the probability of retaining
units, size of the network and the size of the training set. These observations give some
insight into why dropout works so well.
Figure 7: Features learned on MNIST with one hidden layer autoencoders having 256 rectified
linear units.
In a standard neural network, the derivative received by each parameter tells it how it
should change so the final loss function is reduced, given what all other units are doing.
Therefore, units may change in a way that they fix up the mistakes of the other units.
This may lead to complex co-adaptations. This in turn leads to overfitting because these
co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit,
dropout prevents co-adaptation by making the presence of other hidden units unreliable.
Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must
perform well in a wide variety of different contexts provided by the other hidden units. To
observe this effect directly, we look at the first level features learned by neural networks
trained on visual tasks with and without dropout.
1943
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Figure 8: Effect of dropout on sparsity. ReLUs were used for both models. Left: The histogram
of mean activations shows that most units have a mean activation of about 2.0. The
histogram of activations shows a huge mode away from zero. Clearly, a large fraction of
units have high activation. Right: The histogram of mean activations shows that most
units have a smaller mean mean activation of about 0.7. The histogram of activations
shows a sharp peak at zero. Very few units have high activation.
We found that as a side-effect of doing dropout, the activations of the hidden units
become sparse, even when no sparsity inducing regularizers are present. Thus, dropout au-
tomatically leads to sparse representations. To observe this effect, we take the autoencoders
trained in the previous section and look at the sparsity of hidden unit activations on a ran-
dom mini-batch taken from the test set. Figure 8a and Figure 8b compare the sparsity for
the two models. In a good sparse model, there should only be a few highly activated units
for any data case. Moreover, the average activation of any unit across data cases should
be low. To assess both of these qualities, we plot two histograms for each model. For each
model, the histogram on the left shows the distribution of mean activations of hidden units
across the minibatch. The histogram on the right shows the distribution of activations of
the hidden units.
Comparing the histograms of activations we can see that fewer hidden units have high
activations in Figure 8b compared to Figure 8a, as seen by the significant mass away from
1944
Dropout
zero for the net that does not use dropout. The mean activations are also smaller for the
dropout net. The overall mean activation of hidden units is close to 2.0 for the autoencoder
without dropout but drops to around 0.7 when dropout is used.
In the first case, we train the same network architecture with different amounts of
dropout. We use a 784-2048-2048-2048-10 architecture. No input dropout was used. Fig-
ure 9a shows the test error obtained as a function of p. If the architecture is held constant,
having a small p means very few units will turn on during training. It can be seen that this
has led to underfitting since the training error is also high. We see that as p increases, the
error goes down. It becomes flat when 0.4 ≤ p ≤ 0.8 and then increases as p becomes close
to 1.
3.5 3.0
Test Error Test Error
3.0 Training Error 2.5 Training Error
2.5
Classification Error %
Classification Error %
2.0
2.0
1.5
1.5
1.0
1.0
0.5 0.5
0.00.0 0.2 0.4 0.6 0.8 1.0 0.00.0 0.2 0.4 0.6 0.8 1.0
Probability of retaining a unit (p) Probability of retaining a unit (p)
1945
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Classification Error %
ture (784-1024-1024-2048-10) was used for
15
all data sets. Dropout with p = 0.5 was per-
formed at all the hidden layers and p = 0.8 10
at the input layer. It can be observed that
for extremely small data sets (100, 500) 5
dropout does not give any improvements.
The model has enough parameters that it 0
102 103 104 105
Dataset size
can overfit on the training data, even with
all the noise coming from dropout. As the
size of the data set is increased, the gain Figure 10: Effect of varying data set size.
from doing dropout increases up to a point and then declines. This suggests that for any
given architecture and dropout rate, there is a “sweet spot” corresponding to some amount
of data that is large enough to not be memorized in spite of the noise but not so large that
overfitting is not a problem anyways.
1946
Dropout
We again use the MNIST data set and do classification by averaging the predictions
of k randomly sampled neural networks. Figure 11 shows the test error rate obtained for
different values of k. This is compared with the error obtained using the weight scaling
method (shown as a horizontal line). It can be seen that around k = 50, the Monte-Carlo
method becomes as good as the approximate method. Thereafter, the Monte-Carlo method
is slightly better than the approximate method but well within one standard deviation of
it. This suggests that the weight scaling method is a fairly good approximation of the true
model average.
1
P (h, v; θ) = exp(v> W h + a> h + b> v).
Z(θ)
Where θ = {W, a, b} represents the model parameters and Z is the partition function.
Dropout RBMs are RBMs augmented with a vector of binary random variables r ∈
{0, 1}F . Each random variable rj takes the value 1 with probability p, independent of
others. If rj takes the value 1, the hidden unit hj is retained, otherwise it is dropped from
the model. The joint distribution defined by a Dropout RBM can be expressed as
Y F
1 > > >
P (h, v|r; θ) = exp(v W h + a h + b v) g(hj , rj ),
Z 0 (θ, r)
j=1
g(hj , rj ) = 1(rj = 1) + 1(rj = 0)1(hj = 0).
F
Y
P (h|r, v) = P (hj |rj , v),
j=1
!
X
P (hj = 1|rj , v) = 1(rj = 1)σ bj + Wij vi .
i
1947
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
Figure 12: Features learned on MNIST by 256 hidden unit RBMs. The features are ordered by L2
norm.
Conditioned on r, the distribution over {v, h} is same as the distribution that an RBM
would impose, except that the units for which rj = 0 are dropped from h. Therefore, the
Dropout RBM model can be seen as a mixture of exponentially many RBMs with shared
weights each using a different subset of h.
1948
Dropout
Figure 13: Effect of dropout on sparsity. Left: The activation histogram shows that a large num-
ber of units have activations away from zero. Right: A large number of units have
activations close to zero and very few units have high activation.
learned by the dropout RBM appear qualitatively different in the sense that they seem to
capture features that are coarser compared to the sharply defined stroke-like features in the
standard RBM. There seem to be very few dead units in the dropout RBM relative to the
standard RBM.
Next, we investigate the effect of dropout RBM training on sparsity of the hidden unit
activations. Figure 13a shows the histograms of hidden unit activations and their means on
a test mini-batch after training an RBM. Figure 13b shows the same for dropout RBMs.
The histograms clearly indicate that the dropout RBMs learn much sparser representations
than standard RBMs even when no additional sparsity inducing regularizer is present.
9. Marginalizing Dropout
Dropout can be seen as a way of adding noise to the states of hidden units in a neural
network. In this section, we explore the class of models that arise as a result of marginalizing
this noise. These models can be seen as deterministic versions of dropout. In contrast to
standard (“Monte-Carlo”) dropout, these models do not need random bits and it is possible
to get gradients for the marginalized loss functions. In this section, we briefly explore these
models.
Deterministic algorithms have been proposed that try to learn models that are robust to
feature deletion at test time (Globerson and Roweis, 2006). Marginalization in the context
of denoising autoencoders has been explored previously (Chen et al., 2012). The marginal-
ization of dropout noise in the context of linear regression was discussed in Srivastava (2013).
Wang and Manning (2013) further explored the idea of marginalizing dropout to speed-up
training. van der Maaten et al. (2013) investigated different input noise distributions and
1949
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
the regularizers obtained by marginalizing this noise. Wager et al. (2013) describes how
dropout can be seen as an adaptive regularizer.
||y − Xw||2 .
When the input X is dropped out such that any input dimension is retained with
probability p, the input can be expressed as R ∗ X where R ∈ {0, 1}N ×D is a random matrix
with Rij ∼ Bernoulli(p) and ∗ denotes an element-wise product. Marginalizing the noise,
the objective function becomes
minimize ER∼Bernoulli(p) ||y − (R ∗ X)w||2 .
w
This reduces to
where Γ = (diag(X > X))1/2 . Therefore, dropout with linear regression is equivalent, in
expectation, to ridge regression with a particular form for Γ. This form of Γ essentially
scales the weight cost for weight wi by the standard deviation of the ith dimension of the
data. If a particular data dimension varies a lot, the regularizer tries to squeeze its weight
more.
Another interesting way to look at this objective is to absorb the factor of p into w.
This leads to the following form
1−p
minimize e 2+
||y − X w|| e 2,
||Γw||
w p
where we = pw. This makes the dependence of the regularization constant on p explicit.
For p close to 1, all the inputs are retained and the regularization constant is small. As
more dropout is done (by decreasing p), the regularization constant grows larger.
1950
Dropout
Table 10: Comparison of classification error % with Bernoulli and Gaussian dropout. For MNIST,
the Bernoulli model uses p = 0.5 for the hidden units and p = 0.8 for the input units.
For CIFAR-10, we use p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) going from the
q input layer to the
1−p
top. The value of σ for the Gaussian dropout models was set to be p . Results were
averaged over 10 different random seeds.
11. Conclusion
Dropout is a technique for improving neural networks by reducing overfitting. Standard
backpropagation learning builds up brittle co-adaptations that work for the training data
but do not generalize to unseen data. Random dropout breaks up these co-adaptations by
1951
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
making the presence of any particular hidden unit unreliable. This technique was found
to improve the performance of neural nets in a wide variety of application domains includ-
ing object classification, digit recognition, speech recognition, document classification and
analysis of computational biology data. This suggests that dropout is a general technique
and is not specific to any domain. Methods that use dropout achieve state-of-the-art re-
sults on SVHN, ImageNet, CIFAR-100 and MNIST. Dropout considerably improved the
performance of standard neural nets on other data sets as well.
This idea can be extended to Restricted Boltzmann Machines and other graphical mod-
els. The central idea of dropout is to take a large model that overfits easily and repeatedly
sample and train smaller sub-models from it. RBMs easily fit into this framework. We de-
veloped Dropout RBMs and empirically showed that they have certain desirable properties.
One of the drawbacks of dropout is that it increases training time. A dropout network
typically takes 2-3 times longer to train than a standard neural network of the same ar-
chitecture. A major cause of this increase is that the parameter updates are very noisy.
Each training case effectively tries to train a different random architecture. Therefore, the
gradients that are being computed are not gradients of the final architecture that will be
used at test time. Therefore, it is not surprising that training takes a long time. However,
it is likely that this stochasticity prevents overfitting. This creates a trade-off between over-
fitting and training time. With more training time, one can use high dropout and suffer less
overfitting. However, one way to obtain some of the benefits of dropout without stochas-
ticity is to marginalize the noise to obtain a regularizer that does the same thing as the
dropout procedure, in expectation. We showed that for linear regression this regularizer is
a modified form of L2 regularization. For more complicated models, it is not obvious how to
obtain an equivalent regularizer. Speeding up dropout is an interesting direction for future
work.
Acknowledgments
This research was supported by OGS, NSERC and an Early Researcher Award.
1952
Dropout
B.1 MNIST
The MNIST data set consists of 60,000 training and 10,000 test examples each representing
a 28×28 digit image. We held out 10,000 random training images for validation. Hyperpa-
rameters were tuned on the validation set such that the best validation error was produced
after 1 million weight updates. The validation set was then combined with the training set
and training was done for 1 million weight updates. This net was used to evaluate the per-
formance on the test set. This way of using the validation set was chosen because we found
that it was easy to set up hyperparameters so that early stopping was not required at all.
Therefore, once the hyperparameters were fixed, it made sense to combine the validation
and training sets and train for a very long time.
1953
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
The architectures shown in Figure 4 include all combinations of 2, 3, and 4 layer networks
with 1024 and 2048 units in each layer. Thus, there are six architectures in all. For all the
architectures (including the ones reported in Table 2), we used p = 0.5 in all hidden layers
and p = 0.8 in the input layer. A final momentum of 0.95 and weight constraints with c = 2
was used in all the layers.
To test the limits of dropout’s regularization power, we also experimented with 2 and 3
layer nets having 4096 and 8192 units. 2 layer nets gave improvements as shown in Table 2.
However, the three layer nets performed slightly worse than 2 layer ones with the same
level of dropout. When we increased dropout, performance improved but not enough to
outperform the 2 layer nets.
B.2 SVHN
The SVHN data set consists of approximately 600,000 training images and 26,000 test
images. The training set consists of two parts—A standard labeled training set and another
set of labeled examples that are easy. A validation set was constructed by taking examples
from both the parts. Two-thirds of it were taken from the standard set (400 per class) and
one-third from the extra set (200 per class), a total of 6000 samples. This same process
is used by Sermanet et al. (2012). The inputs were RGB pixels normalized to have zero
mean and unit variance. Other preprocessing techniques such as global or local contrast
normalization or ZCA whitening did not give any noticeable improvements.
The best architecture that we found uses three convolutional layers each followed by
a max-pooling layer. The convolutional layers have 96, 128 and 256 filters respectively.
Each convolutional layer has a 5 × 5 receptive field applied with a stride of 1 pixel. Each
max pooling layer pools 3 × 3 regions at strides of 2 pixels. The convolutional layers are
followed by two fully connected hidden layers having 2048 units each. All units use the
rectified linear activation function. Dropout was applied to all the layers of the network
with the probability of retaining the unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the
different layers of the network (going from input to convolutional layers to fully connected
layers). In addition, the max-norm constraint with c = 4 was used for all the weights. A
momentum of 0.95 was used in all the layers. These hyperparameters were tuned using a
validation set. Since the training set was quite large, we did not combine the validation
set with the training set for final training. We reported test error of the model that had
smallest validation error.
1954
Dropout
and then rotate it back. The network architecture and dropout rates are same as that for
SVHN, except the learning rates for the input layer which had to be set to smaller values.
B.4 TIMIT
The open source Kaldi toolkit (Povey et al., 2011) was used to preprocess the data into log-
filter banks. A monophone system was trained to do a forced alignment and to get labels for
speech frames. Dropout neural networks were trained on windows of 21 consecutive frames
to predict the label of the central frame. No speaker dependent operations were performed.
The inputs were mean centered and normalized to have unit variance.
We used probability of retention p = 0.8 in the input layers and 0.5 in the hidden layers.
Max-norm constraint with c = 4 was used in all the layers. A momentum of 0.95 with a
high learning rate of 0.1 was used. The learning rate was decayed as 0 (1 + t/T )−1 . For
DBN pretraining, we trained RBMs using CD-1. The variance of each input unit for the
Gaussian RBM was fixed to 1. For finetuning the DBN with dropout, we found that in
order to get the best results it was important to use a smaller learning rate (about 0.01).
Adding max-norm constraints did not give any improvements.
B.5 Reuters
The Reuters RCV1 corpus contains more than 800,000 documents categorized into 103
classes. These classes are arranged in a tree hierarchy. We created a subset of this data set
consisting of 402,738 articles and a vocabulary of 2000 words comprising of 50 categories
in which each document belongs to exactly one class. The data was split into equal sized
training and test sets. We tried many network architectures and found that dropout gave
improvements in classification accuracy over all of them. However, the improvement was
not as significant as that for the image and speech data sets. This might be explained by
the fact that this data set is quite big (more than 200,000 training examples) and overfitting
is not a very serious problem.
where, psi,t is the target probability for state s and tissue type t in input i; qts (ri ) is the
predicted probability for state s in tissue type t for input ri and p̄s is the average of psi,t
over i and t.
A two layer dropout network with 1024 units in each layer was trained on this data set.
A value of p = 0.5 was used for the hidden layer and p = 0.7 for the input layer. Max-norm
regularization with high decaying learning rates was used. Results were averaged across the
same 5 folds used by Xiong et al. (2011).
1955
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
References
M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized denoising autoencoders for
domain adaptation. In Proceedings of the 29th International Conference on Machine
Learning, pages 767–774. ACM, 2012.
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton. Phone recognition with the mean-
covariance restricted Boltzmann machine. In Advances in Neural Information Processing
Systems 23, pages 469–477, 2010.
O. Dekel, O. Shamir, and L. Xiao. Learning to classify with missing and corrupted features.
Machine Learning, 81(2):149–178, 2010.
A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In
Proceedings of the 23rd International Conference on Machine Learning, pages 353–360.
ACM, 2006.
I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks.
In Proceedings of the 30th International Conference on Machine Learning, pages 1319–
1327. ACM, 2013.
G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504 – 507, 2006.
G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets.
Neural Computation, 18:1527–1554, 2006.
K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage
architecture for object recognition? In Proceedings of the International Conference on
Computer Vision (ICCV’09). IEEE, 2009.
A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report,
University of Toronto, 2009.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu-
tional neural networks. In Advances in Neural Information Processing Systems 25, pages
1106–1114, 2012.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computa-
tion, 1(4):541–551, 1989.
Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, Z. Li, M.-H. Tsai, X. Zhou,
T. Huang, and T. Zhang. Imagenet classification: fast descriptor coding and large-scale
svm training. Large scale visual recognition challenge, 2010.
A. Livnat, C. Papadimitriou, N. Pippenger, and M. W. Feldman. Sex, mixability, and
modularity. Proceedings of the National Academy of Sciences, 107(4):1452–1457, 2010.
V. Mnih. CUDAMat: a CUDA-based matrix class for Python. Technical Report UTML
TR 2009-004, Department of Computer Science, University of Toronto, November 2009.
1956
Dropout
A. Mohamed, G. E. Dahl, and G. E. Hinton. Acoustic modeling using deep belief networks.
IEEE Transactions on Audio, Speech, and Language Processing, 2010.
R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., 1996.
P. Simard, D. Steinkraus, and J. Platt. Best practices for convolutional neural networks ap-
plied to visual document analysis. In Proceedings of the Seventh International Conference
on Document Analysis and Recognition, volume 2, pages 958–962, 2003.
N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the 18th
annual conference on Learning Theory, COLT’05, pages 545–560. Springer-Verlag, 2005.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B. Methodological, 58(1):267–288, 1996.
1957
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
A. N. Tikhonov. On the stability of inverse problems. Doklady Akademii Nauk SSSR, 39(5):
195–198, 1943.
L. van der Maaten, M. Chen, S. Tyree, and K. Q. Weinberger. Learning with marginalized
corrupted features. In Proceedings of the 30th International Conference on Machine
Learning, pages 410–418. ACM, 2013.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust
features with denoising autoencoders. In Proceedings of the 25th International Conference
on Machine Learning, pages 1096–1103. ACM, 2008.
S. Wang and C. D. Manning. Fast dropout training. In Proceedings of the 30th International
Conference on Machine Learning, pages 118–126. ACM, 2013.
1958
!#"%$'&)(*,+.-!"/$0132406575701(,+'-98:$;!$<>=?$<1$@2:A9BDCECEF6-*6*G3H@IJ6(*KMLON0 P,$C75RQT(*S IUI?$CWV
XY[Z]\;^;_a`cbed.fW_Wghghikjl^]m_Wgh_n\befMop4_Wq>\rbhseZ_Wj.stvuwxuzy9\;{lg|vm)_Egh_n\befMo9}*~n; fMo6lkRp4beik'_
}
m_n\;jl}1R@
~E|[
6}!a t
ikkU\Z_Eshse_a4j1ik'_Ebeghis[.}1
;6sM\rse_
6shbe_W_7sn} \;k_WZ}l4mT
;; ~
}>a1t
a pY[m41u4} m> dr_Eb4 o>\lghgh_W_
¡.},~r¢r£'¤;_Ebekikj9} _7beZ]\;j.
¥ .\;jlj*} ¦_Ed
j6{!§¨)be_Wgh_n\befMo9© \rshdsn©befEi¦d
^;Zikj>\;}>k^;d;bhq1b7¨)l{lkikikghk¦ol\;_WZ_Eikshj se_
© _n 9}l ¦\lgM¨)ªlbegOsn© ^;Z]*© 1_
)bhbn} ©>\j> #l« kk_Ebn}¬@©v[4_E1bM\;*4_7s[®dbe6gW¯1shbeikfM gd°cseol_4shbM\;1_W± }
q1beikj1^
_Ebn}*~n;
¤ ©
u®o1_fWd
j6'_Ebe^;_WjlfE_ºd°{>\f7.|»q1bed;q>\;^
\seikd
j¼k_n\rbejlikjl^#ikg\;jl\; W_n
gh²]d³c\;´Wg4µW¶rse·>d]¸;_Eµ
½ ¹ q1U\ikjºfEd
ZZd
jq1ol_Wj1d
Z_Wj1d
jd;{lgh_Ebe
_n{6q1bM\fEseisei¦d;jl_EbegW© \j6
1j>1_EghikbM\{lk_]{_Wo>\n6ikd;beg
d;°{>\f76q bed
qfn\j#{_]\n'd
i¦1_W3ikseoshbeikfM6g
seol\sD\be_
bM\rbe_W¾_7½1qd;gh_n¾ikjºgh_Ebeikd
1gse_Ef7o1jlikfn\;%q1l{lkikfn\rsei¦d;jlgW©,u®olikgaq>\q_Eb^
ik'_Egaghd
Z_
d °c\sej6old
gh\;_R shsebeolikfMd;6begWgD}>o>\\nj>'_dgh¿!1_E^
be^
g_E_EgOse½ _nql¦\;sejlo>\\sesDikd
ghj1_Wg)fWd;dj>°6|Ào.d]bM1se_EolbD_Ed
]q ®seikdZbei¦W© \seikd
jÁZ_Eseo1d61g
\rbe_®\
\j.sM\;^
_Ed
lg*°Jdbcjl_W bM\;.j1_Es9shbM\ikjlikjl^ ©rYÀsikggho1dnjseol\s,Zd
gOs)[fW¦\;ghghikfn\k±
gh_WfEd
j>6|»d;bM1_7bDZ_7seold61gR\be_ikZq1bM\fEseikfn\;c°Jdb
U\rbe^
_]jl_E1bM\;%j1_Es[d;be6gW©t°J_E
Z_Eseo1d61g\rbe_aq1bed
qd;gh_nseol\s)1dDjld;sol\r
_4seol_Wgh_akikZisM\seikd
j1gW©
 ÃlÄÅÆ!ÇaÈDÉ
Ê,Å!ËeÇÄ
241ÌWÍ>Î!C701Î*6Ïl.5E=U01Á=?K< $;C7ÐÁÎ,0 Î*(!I? C]!$;(!CE I!$5eÑ)01C7ÍÁI?$; C7*=U!Ï# IUÏ10 CE=¦5EÒ!ÓÔP,$;Ì (*K7$
=¦5
=JK
Ì0 *Ì$Î5E(*6I?I?Ð3K7=UÓÎ!I?$ -,Ìr01ÓÎ*(5E657=?0 * IUI?к$ÕÌ=U$;l5;-9 *GP,$;Ì (*K7$=¦5
0 Ö×57$;ÁÑ)01C7ÍK;A
Ø0.Ñ4$< $;C;-'Ï1$r5M5E=U*Ï:=U5570Ñ40 CEÍÑ4$I?IÀ-6 *GKM01Ó$57=?Ó$;K570@Ñ40 CEÍÙ.5 IUI[- Ì K7$$;ÓÚÓ0 CE$R06Ö
66C755EÒ*6ºK7Ì=U$;*Ìr$1A1ÛD$;K7=UÏ1!=?!ÏÙ69G57CW6=?!=?!Ï]]!$5eÑ)01C7Í(*K7=U!ÏÙP* ÌWÍ>Î!CE0 ÎCE$;Ül(!=?C7$
K
Ó Í>=U!Ï
Ó >Ð@K7$$;Ó=U!Ï1IUÐ@6CEP!=U57CW6CEÐDÌWÒ!01=?Ì$;KKM(9ÌWÒ] K/57Ò!$)l(*ÓÙP,$C®6*G@5eÐ>Î,$;Kv06Ö!!0G$
K-
I?'Ð1$CWK->I?$;6CE!=?!ÏCE657$
K-5ECE =U*=U!Ï *G¾57$;KM5DK7$r5WK-* *GKM0ÖÝ01CM5EÒ%A*ÞRÒ!$
KM$]ÌWÒ*0 =JÌr$;KaÌ;6P,$
ÌrCE=¦5E=?Ì;6I[-Ð $5/5EÒ!$CE$=JK/!0ÖÝ0l01IUÎ*C70>06ÖCE$;Ì=UÎ,$ÖÝ01CvG$
Ìr=JG=U*Ïa57Ò!$;ÓßP9$
Ì (*KM$57Ò!$;Ð@ C7$I? C7Ï1$I?Ð
Î!C701P!I?$Óà *GG!.5W:G!$Î,$*G$;l5;A Ø
0.Ñ)$;< $C
-
57Ò!$;C7$a C7$RÒ!$;(!C7=JKM57=JÌK69GK70 Ó$R(!*G!$CEIUÐ>=?!Ï
57Ò!$;0 CEÐ57Ò*65
Ì ¾Ò!$;IUÎÏ1(!=?G!$@Î!CW Ìr57=U57=?0 !$;CR570Ó6Í1$:P,$r5757$;C
ÌWÒ!0 =JÌr$
KA
áhÙ5EÒ!$Râ*CWKe5K7$;Ìn5E=U01P9$;IU0.ÑãÑ)$R=?15EC70G(9Ìr$aKe5W6*G*6CWGÙP*1ÌWÍ>Î!C701Î*6Ïl.5E=U01]6*GG=JK7Ì(*K7K
>(!ÓÙP,$CD06ÖK7=UÓÎ!I?$]Ò!$;(!C7=JKM57=JÌK01C5EC7=JÌWÍKaÖÝ0 CD=UÓÎ!CE0.<>=U*Ï=¦5WKÎ,$C7ÖÝ0 CEÓ *Ìr$1A*äå$Ù!$æ>5
G=?KEÌr(9K7K=JKEKM(!$
K]0 ÖaÌ0 >< $;C7Ï1$*Ì$ Aäå$57Ò!$;çG$
K7ÌC7=?P9$#ÖÝ$Ñéè7ÌI?1K7K7=?Ì;6IJê¾K7$;Ì0 *GL[01CEG$;C
!0 LOI?=U!$
6C01Î57=?Ó=Uë
.57=?0 T5E$;ÌWÒ!!=JÜl(!$;Kº6*GìK7Ò!0.Ñí5EÒ*.55EÒ!$=?C¾ Î!Î!I?=?Ì;.5E=U01ç5E0z!$(*CE I
!$r5eÑ40 CEͼ57CW6=?!=?!ϼ=JK< $;C7ÐzI?=?Ó=U57$
Gc-RG$;K7Î!=U57$#Ó6>ÐãÌI? =UÓK5E0¼57Ò*$ÁÌr0 l5ECE C7Ðz=?ì57Ò*$
IU=U57$;CE657(!CE$ Aî=U96I?IUÐ1-Ñ)$Î!CE$;K7$l5ÁÖÝ$;ÑïK7$;Ì0 *GL[01CEG$;CÙÓ$r5EÒ!0G!K57Ò9.5ºG0¼ Ì;Ìr$;IU$;CE657$
IU$
6CE!=U*Ï=?Ìr$C75E =UÌ1KM$
KA
ð ñÙò%ó®Æ*Ä
Ë7ÄôõóÄ
È÷öxò/Äò%Æ*ó®ø7ËeùcóÅ!ËeÇÄ
ÞRÒ!$CE$6CE$KM$;< $CW6I Î!Î!CE01 ÌWÒ*$;K
570Á6(!570 Ó657=JÌÓ1ÌWÒ!=?!$IU$
6CE!=U*Ï*-cP!(5ÓÙ(*ÌWÒú06Ö457Ò*$
KM(*Ì;Ìr$
K7KMÖÝ(!I>6Î!Î!CE011ÌWÒ!$;K/Ì6ÙP9$4Ì.5E$Ï10 CE=Uë;$;G Kû6üEý þ.ÿ Wý Wþ Wý.ü 9ÿ lû ;þ nA'ÞRÒ*$
COST FUNCTION
Desired
Output
Output
D0, D1,...Dp
Parameters M(Z,W)
W
LEARNING
MACHINE
Input
Z0, Z1,... Zp
$;Ül(*.5E=U01*Ka6CE$:0 P5W6=?!$;G C
, ., G,
. .
G
.,! . + ., #"
., $ , 5 &
.$ + J , % .
ÞRÒ!$]6P,0.< $:$;Ül(*657=?0 *KaÌ;66IJKM0P9$ÑaCE=U5M57$;=UÓ.5EC7=UæÖÝ0 CE'Ó C
' # . )(X
. .
. G *
. +.
G
. + ,. + . 5 -
ÞRÒ!$KM=?ÓÎ!IU$
Ke5aI?$; C7!=?!'Ï ÝÓ=?!=UÓ=?ë;657=?0 4Î!CE0Ìr$
G(!CE$:=?KM(*ÌWÒKM$5M5E=U!Ï=JK457Ò!$Ï1CE6L
G=U$;l5
G$;KEÌr$;15 IUÏ10 CE=¦5EÒ!ÓÑaÒ!$;C7$ =JKa=U57$;CE657=?< $I?Ð PG e(*Ke5E$;G K4ÖÝ01IUI?0.ÑDK C
/.% /. F F10 5 %
áh 5EÒ!$KM=?ÓÎ!IU$
Ke5Ì1KM$1- 0 =?K@3K7Ì;6IJ6C@Ìr01*Ke5W6l5;AQ#01C7$KM01Î!Ò!=JKe5E=?Ì;.5E$;GÁÎ*C70Ìr$
G(!CE$;KD(*K7$
<' C7=J6P*IU$ 0 AáhT065EÒ!$CÓ$r5EÒ!0G!K 0 5W6Í $
K]5EÒ!$3ÖÝ0 CEÓ 06Ö@ G=J6Ï10 * I)Ó657CE=¦æc-®0 C=JK
$;KM57=?Ó657$¾06ÖR57Ò!$¾=U><1$CWKM$Ø
$;KEKM=J6¼Ó.5EC7=Uæú06Öa57Ò*$ºÌr0lKe5ÖÝ(!*Ìn5E=U01 »K7$;Ì0 *G¼G!$CE=U<..5E=U<1$
Ó657CE=¦æ K7(*ÌWÒz1K]=?¼57Ò!$ $ÑR5E0 ç *G D(*1KM=UL
$ÑR5E0 úÓ$r57Ò*0>G*KG$
K7ÌC7=?P,$;GúIJ.57$;CÙ=?
57Ò!$ÌWÒ* Î57$;C;SA Î!C701Î9$;CÌWÒ!0 =JÌr$06Ö 0 =JK@=?ÓÎ,0 C75E l5@ *GÁÑa=?IUIP,$G=JK7Ì(*K7K7$;G .5@IU$;!Ï65EÒ
I?657$C
A
a
ò Æ!óÊ,Å!ËMÊóø
Æ9ËMÊ
241ÌWÍ>Î!C701Î*6Ïl.5E=U01Ì63P,$< $CEÐK7IU0.ÑÎ96C757=JÌr(!IJ6CEIUÐÖÝ01CaÓÙ(!IU57=?IJ'Ð $CE$;Gº!$r5eÑ40 CEÍK4ÑaÒ!$CE$
57Ò!$#Ìr0lKe5KM(*CMÖ»1Ìr$3=?K5eÐ>Î!=JÌ6I?I?Ðå*0 LhÜl(* GCW.5E=?Ì -v*0 LhÌr01l<1$ræc-v *GzÒ*=UÏ1ÒTG=?Ó$*K7=?0 * I
Ña=¦5EÒ Ó6>ÐI?0>Ì;6I®Ó=U*=UÓ36*G 2.0 C 9.5@CE$Ï =?0 9KAÞRÒ!$CE$=?K:!0¾ÖÝ0 CEÓÙ(!IJ5E0¾Ï1(*6CW6l5E$$
57Ò*6:5 /5EÒ!$a!$r5eÑ40 CEÍÑa=?IUI*Ì0 >< $;C7Ï1$5E0@:Ï10l0GKM01IU(!57=?0 %"- E ®Ìr01>< $CEÏ $;*Ìr$)=?KK7Ña=UÖ×5;-601C
Ìr01l<1$CEÏ $;*Ìr$a$;< $0Ì;Ìr(!CWK6546I?I[AlØ
0.Ñ)$;< $C
-6=?57Ò*=?K4KM$
Ìn5E=U01Ñ)$DG=?KEÌr(9K7K4>(!ÓÙP,$C
06Ö5EC7=JÌWÍK5EÒ*.5Ì ÁÏ CE$;657I?о=?ÓÎ!C70.<1$]5EÒ!$ÌWÒ96*Ì$;KD06Öâ**G!=U!ÏÏ10>0>G KM01IU(!57=?0 ÁÑaÒ!=?I?$
6IJKM0
G!$;ÌrCE$;1KM=?!Ïa5EÒ!$)Ì0 >< $;C7Ï1$*Ì$®57=?Ó$)0 Ö×57$ÙPlÐ@01CEG$;CEK%06Ö*Ó6Ï1!=U57(*G$1A
QÁ0 CE$G$r5W6=?IU$
G
57Ò!$;0 CE$r57=JÌ I e(*KM57=Uâ9Ì657=?0 *KRÑa=?IUI/P9$Ï1=U<1$3=UIJ.57$;CKM$
Ìn5E=U01*KA
!"$#&%(')*,+!-/.0%12')3"034
45º$;1ÌWÒx=¦5E$CW.57=?0 /-$;Ül(*657=?0 CE$;Ül(!=?C7$
KzÌ0 ÓÎ!I?$r57$#Î9 KEK57Ò*C701(!Ï ÒT57Ò*$Á$l57=?CE$
G!.5W K7$r5=?#0 CWG$Ca5E0ºÌr01ÓÎ!(57$]57Ò!$¾6ý 5 ü7ý;û
0 Ca5EC7(*$]Ï CW G=?$l5
A*ÞRÒ!=JK=JKaC7$ÖÝ$CEC7$
G¾570º1K
P*.5WÌWÒI?$; C7*=U!ÏK7=U9Ìr$6$l5E=UCE$èMP*65EÌWÒ*ê:0 ÖG!65E:ÓÙ(9Ke5P,$Ìr01*K7=?G$;C7$
GÙP,$rÖÝ01C7$aÑ4$=?Ï Òl5WK
6CE$ (!ÎG!.5E$;GcA
I¦5E$( CE*.5E( =U<1$I?Ð -R0 *$ Ì (*K7$úKe5E0>ÌWÒ9 KM57=JÌ Ý01!I?=U!$ ¾IU$
6CE!=U*ÏçÑaÒ!$;C7$å
Wÿ >û a$ræ! ÓÎ*IU$ 3
; ==JKÌWÒ!01K7$ Ý$1A Ï9A!CW6*G01ÓIU"Ð )ÖÝCE0 Ó 5EÒ!$57CW6=?!=?!ÏKM$5
.5$
ÌWÒ
=¦5E$( CW.57=?0 .nXA
[ÿ ý 06Ö,57Ò!$57CE(!$aÏ CW G=?$l5=JK®57Ò!$;Ìr01ÓÎ*(57$
GP9 K7$;G0 57Ò*$$CEC701C
(
. 7 < /%. F10 5 %X
2)$
Ì (*KM$57Ò!=JK/$
Ke5E=UÓ657$0 Ö57Ò!$)Ï CW G=?$l5/=JK%*0 =JKMÐ1-5EÒ!$Ñ4$=?Ï Òl5EK/Ó'ÐD!0 5vÓ0.< $Î!C7$
Ìr=JKM$;IUÐ
G0.Ña57Ò*$)Ï1CE1G=U$;l5v.5$; ÌWÒÙ=¦5E$CW.57=?0 /A DK/Ñ4$)K7Ò*6I?I>KM$;$ -;57Ò!=JK:èM!01=?K7$;ê65v$
ÌWÒ]=U57$;CE657=?0
Ì6 P9$ G<. 15W6Ï1$0 (9KSA ?l570ÌWÒ*1Ke5E=?ÌÙIU$
6CE!=U*Ϻ=JK
Ï1$!$;CE IUI?Ð357Ò!$Î!CE$rÖÝ$CECE$;GÓ$r5EÒ!0GÁÖÝ01C
P* K7=?Ì:P*1ÌWÍ>Î!C701Î*6Ïl.5E=U01ÖÝ0 CR5EÒ!$@ÖÝ0 I?IU0.Ña=?!Ï57Ò!CE$$CE$;1KM01*DK C
89 #!:3;!4<%1=&>?!"A@B%12')3C"3C4
A ?l5E0>ÌWÒ9 KM57=JÌI?$;6CE!=?!Ï=?K4(*K7(*6I?I?Ð : D @Ö»1Ke5E$C457Ò963P*.5WÌWÒºI?$;6CE!=?!Ï*A
E A ?l5E0>ÌWÒ9 KM57=JÌDI?$; C7*=U!Ï6IJK700 Ö×57$;3CE$;K7(!IU5EKR=?¾P,$r5757$;C
KM01IU(!57=?0 *K;A
KM=?ÓÎ!IU$3Ì1KM$¾ÑaÒ!$CE$¾5ECE =U*=U!ÏåKM$50 Ö
KM=?ë$ =JKÙ=?* G!< $C757$;l57I?ÐåÌ0 ÓÎ90lKM$
G¼06Ö
=?G$;l57=JÌ6IvÌr0 Î*=U$
Ka06ÖK7$r5
Ña=U57Ò K7 ÓÎ!IU$
KA < $;CE Ï =?!Ï57Ò!$]Ï1CE1G=?$l50.< $;C6I?<I
Î*.5757$CE*KÙÏ =?< $;K]57Ò!$¾$ræ! Ìr5KE6Ó$CE$;K7(!I¦5 KÌr0 ÓÎ!(!57=?!Ï#57Ò*$ºÏ CW G!=U$;15P* K7$;Gú01' e(*Ke5
57Ò!$3â*CWKe@5 AÞRÒ>(*K;-P*.5WÌWÒçÏ1CE1G=U$;l5G$
K7Ì$l5=?KÑR KM57$ÖÝ(!IaP,$;Ì (*K7$¾=U5C7$
Ìr01ÓÎ*(57$
K
57Ò!$KE6Ó$Ül(*6l5E=¦5eÐ 57=?Ó$
KP,$rÖÝ0 CE$0 *$Î*6CW6Ó$r5E$C@(!ÎG!.5E$ ABD 5EÒ!$065EÒ!$CÒ96*Gc-
Ke5E0>ÌWÒ9 KM57=JÌÁÏ CW G!=U$;153Ña=UI?IKM$;$åzÖÝ(!I?I$Î,0>ÌWÒ K =¦5E$CW.57=?0 9K5EÒ!C701(!Ï Ò LOI?0 !Ï
57CW6=?!=U*ÏK7$r5
A%áhúÎ!CW Ìn5E=?Ì$ -c$æ ÓÎ!IU$
K@CE C7$;IUÐÁ Î!Î,$;6C@Ó0 CE$57Ò* ú0 *Ì$=?¼#G!.5W K7$r5
-
P!(5:57Ò!$;C7$Ù C7$Ù(*KM(96I?IUÐ#ÌrI?(*KM57$;CEK
06ÖÎ*65M57$;C79K5EÒ*.5@6CE$]< $;C7ÐKM=?Ó=UIJ6C
A9î*0 CD$ræ!6ÓÎ!I?$Ù=?
Î!Ò!0 *$Ó$ÌrIJ KEKM=Uâ9Ì657=?0 %- IUIc0 Ö/57Ò*$:Î9.5M5E$CE*K4ÖÝ0 CR5EÒ!$@Î!Ò!01!$Ó$ 2:2ÙÑa=?IU<I »Ò!0 Î,$rÖÝ(!I?I?$Ð
Ìr0 l5W6=?]ÓÙ(*ÌWÒ06Ö!5EÒ!$RK7 Ó$4=?ÖÝ0 CEÓ.57=?0 /A;áO5=JK/5EÒ!=JKvCE$;G!(!*G! *ÌrÐ:57Ò9.5Ì6Ó6Í $)P*.5WÌWÒ
IU$
6CE!=U*ÏÓ(*ÌWÒKMI?0.Ñ)$;C457Ò* ¾01L[I?=?!$ A
?l570ÌWÒ*1Ke5E=?Ì
IU$
6CE!=U*Ï I?K70Ù0 Ö×57$¾C7$
KM(*I¦5WK)=?3P9$5M57$;CaKM01IU(5E=U01*K4P9$
Ì (*KM$:06Ö/57Ò!$@!01=?K7$
=U5EÒ!$:(!Î,G*.57$
K1A 01!I?=U!$
6C)!$r5eÑ40 CEÍK(*K7(*6I?I?ÐÒ*'<1$ÓÙ(*I¦5E=UÎ!I?$:IU0Ì I*Ó=?!=UÓÙ06Ö/G/= V,$;CML
=U!ÏG$Î5EÒ*K;AlÞRÒ!$DÏ 0l6I!0 Öc57CW6=?!=U*Ï=?K570]I?0Ì657$01!$06Öc5EÒ!$;K7$Ó=?!=UÓ*Al2R.5WÌWÒIU$
6CE!=?!Ï
Ña=UI?I%G=JK7Ì0.< $C57Ò!$@Ó=?!=UÓ(!Ó0 Ö/ÑaÒ*.5E$<1$C4P* K7=U¾57Ò*$DÑ4$=?Ï Òl5EKR C7$D=U!=U57=J6I?I?ÐÎ!IJ Ìr$
GcA>áh
Ke5E0>ÌWÒ9 KM57=JÌIU$
6CE!=U*Ï*-c5EÒ!$!01=?K7$Î!CE$;K7$l5]=Uú5EÒ!$(!ÎG!657$;KÌ åCE$;K7(!IU5Ù=?ú57Ò!$Ñ4$=?Ï Òl5WK
e(!ÓÎ!=U*Ïå=?l570 57Ò!$P*1KM=?ç0 Ö: !065EÒ!$C
-vÎ,01KEK7=UP!I?ÐçG$;$Î,$C
-®IU0Ì IRÓ=?!=?ÓÙ(!ÓAÞRÒ!=JKÒ*1K
P9$;$#G$Ó01*Ke5ECE657$
Gº=?#Ìr$C75E =UK7=UÓÎ!I?=¦â9$;GÌ1KM$
K "!- [A
?l570ÌWÒ*1Ke5E=?ÌDIU$
6CE!=?!Ï=JKa6IJK70(*K7$rÖÝ(*I%ÑaÒ!$¾57Ò*$DÖÝ(**Ìn5E=U013P,$=?!ÏÓ0G$I?$;G¾=?KaÌWÒ96!Ï L
=U!Ïã0.< $;C57=?Ó$ -RçÜl(!=U57$ Ì0 ÓÓ0 KEÌr$;*6CE=U0ç=U=?*G(9Ke5EC7=J6I:6Î!Î!I?=JÌ.5E=U01*KÑaÒ*$CE$Á57Ò*$
G!.5WG=JKe5EC7=?P!(5E=U01ÁÌWÒ*6!Ï1$;KRÏ1CE1G(*6I?I?Ð0.< $;Ca57=?Ó$ Ý$1A Ï9A9G(!$]570Ñ4$; C
6*G¾5E$;6CD06Ö®57Ò*$
Ó1ÌWÒ!=?!$;
K nAcáOÖ)57Ò!$I?$; C7!=?!ϾÓ1ÌWÒ!=?!$G0>$;K@!0 5G!$r57$
Ìn5]69GÁÖÝ01IUI?0.Ñß57Ò*$ÌWÒ*6!Ï1$=U5=?K
=UÓÎ,01KEKM=?P!I?$
5E0I?$; C7º57Ò!$G!65EÎ!CE0 Î,$CEIUÐ69GIJ6CEÏ $:Ï $*$CW6I?=Uë
.57=?0 $;C7CE0 CWKÑa=?I?IcC7$
KM(!IU5;A
äx=¦5EÒzP*.5WÌWÒúIU$
6CE!=U*Ï*-%ÌWÒ* !Ï $
KÏ 0(!*G!$r57$
Ìn5E$;Gz *GåÑ)$01P5E =U¼CW.5EÒ!$CP*1GúC7$
KM(*I¦5WK
KM=?*Ìr$¾Ñ)$6CE$I?=UÍ1$I?Ðå570ú'< $CW6Ï1$0.< $;CÙKM$;< $;CE ICE(!IU$
K-ÑaÒ!$;C7$
K0 LOIU=?!$¾IU$
6CE!=U*Ï å=UÖ
0 Î,$CW.57$
G#Î!CE0 Î,$CEIUÐ ÀKM$;$ÙP,$I?0.ÑÚ=UåKM$
Ìn5E=U01 A!Q( Ña=?IUIv57CW ÌWÍ35EÒ!$ÌWÒ96!Ï1$;K
*GÁÐ>=?$IJG
Ï 0>0G¾ Î!Î!CE0'æ>=?Ó.5E=U01ºCE$;K7(!I¦5WKA
Û
$
KMÎ!=U57$
57Ò!$: G<. 15W6Ï1$;K®06Ö/KM570ÌWÒ*1Ke5E=?ÌRI?$; C7*=U!Ï9-65EÒ!$CE$
6CE$
KM57=?IUI,CE$; K70 9K®ÑaÒlÐ0 *$
Ó=?Ï Òl5
Ì0 *K7=?G!$CR(*K7=U!ÏP*65EÌWÒI?$;6CE!=?!Ï C
8 9 #!!3;:4&%1=<>+!-/@B%12')3"034
1A&)01*G=U57=?0 *KR0 Ö®Ì0 >< $;C7Ï1$*Ì$D C7$:Ñ)$;IUI/(!*G$;CEKM570>0GcA
E ARQ 6>Ð:1ÌÌ$I?$CW.57=?0 @5E$;ÌWÒ!*=?Ül(!$
K Ý$1A Ï9A
Ì0 P e(*Ï1.5E$Ï1CE1G=U$;l
5 01!IUÐ:0 Î!L
$;CE657$@=?3P9.5EÌWÒ3I?$;6CE!=?!Ï*A
*ARÞRÒ*$0 CE$r5E=?Ì;6I *6I?ÐKM=JK0 Öa57Ò!$Ñ4$=?Ï Òl5G!Ðl96Ó=?Ì;KÙ *G¼Ìr01l<1$CEÏ $;*Ìr$
CW.5E$;Ka C7$@KM=?ÓÎ!I?$C
A
ÞRÒ!$;K7$º G!<' l5E6Ï1$;KKM57$;Ó÷ÖÝCE0 Ó÷57Ò*$ºK7 Ó$!0 =JK7$57Ò*65ÙÓ6Í1$ºKe5E0ÌWÒ* KM57=JÌIU$
6CE!=?!Ï
G<.6l5W6Ï $;0 (*K;A.ÞRÒ!=JK®!0 =JKM$1-.ÑaÒ!=JÌWÒ=JKK70@ÌC7=U57=JÌ I>ÖÝ0 C®â*9G=U*ÏP9$5M57$;CI?0>Ì;6IÓ=?!=UÓ I?K70
Î!C7$;< $;15WK@ÖÝ(!I?I4Ì0 >< $;C7Ï1$*Ì$Ù5E0#57Ò*$ºÓ=U*=UÓ(!ÓAváh*KM57$
Gú0 ÖaÌ0 >< $;C7Ï1=U*Ϻ5E0#57Ò!$¾$ræ! Ìr5
Ó=?!=?ÓÙ(!Ó-5EÒ!$Ìr01l<1$CEÏ $;*Ìr$
KM5E IUIJK40 (5G(*$D5E057Ò!$@Ñ4$=?Ï Òl5 9(*Ìn5E(*.5E=U01*KA!ÞRÒ*$@K7=?ë$:06Ö
57Ò!$ *(*Ìn5E(*.5E=U01*KÙG!$Î,$*G¼01ú57Ò!$¾G$Ï1C7$;$06Ö!01=?K7$06ÖR57Ò*$¾KM570ÌWÒ* KM57=JÌ(!ÎG!.5E$;K;AÞRÒ*$
<' C7=J69Ìr$º0 Ö
5EÒ!$ 9(*Ìn5E(*.5E=U01*K6CE0 (!9Gú57Ò!$I?0Ì6I4Ó=U*=UÓ(!Ó =JKÎ!C701Î901CM5E=U01*6I570å57Ò*$
IU$
6CE!=U*Ï3CW.5E$ 0 E **- E (- & O0A ?>0#=?å01CEG!$C@570#CE$;G!(*Ìr$57Ò*$ *(*Ìr57(*657=?0 *KÑ4$Ì å$;=¦5EÒ!$C
G$;ÌC7$
K7G$ À6!*$;6UI v57Ò*$IU$
6CE!=?!ÏCE657$R0 CÒ*'< $a61G!6Î5E=U<1$aP*.5WÌWÒKM=?ë$1A6áh57Ò!$;0 CEÐ *-
- &!- " =U54=JKRKMÒ!0.Ña57Ò*65)5EÒ!$D01Î57=?Ó6I6!*$;6I?=?!ÏKEÌWÒ!$;G!(!IU$:06Öc5EÒ!$:IU$
6CE!=U*ÏÙCW.57$D=?K
06Ö57Ò!$@ÖÝ01C7Ó
0 .
% E
ÑaÒ!$CE$ .®=?K57Ò!$D>(!ÓÙP,$C)06ÖcÎ*.5757$;C7*KÎ!CE$;K7$l57$
G6*G =JK]Ì0 *KM5E l5;A áhÎ*CE1Ìn57=JÌr$1-65EÒ!=?K
Ó'кP9$@5E0l0Ö»1Ke5
!0 57Ò!$;CÓ$r5EÒ!0G570C7$;Ó0.< $a!01=?K7$=JK5E0Ù(*K7$ºè7Ó=?!=UL[P9.5EÌWÒ*$;KEê!-.5EÒ*.54=?K;-lKM5E CM5)Ña=¦5EÒ
#KMÓ6I?IP9.5EÌWÒ¼K7=Uë;$º69G =?*ÌC7$
K7$57Ò!$ºK7=?ë$º1K@57CW6=?!=?!Ï#Î!CE0>Ì$$
G!KAvQ6I?IU$;CÙG=JK7Ì(*KEKM$
K
0 !$Ó$r5EÒ!0G ÖÝ01C]G0 =?!Ͼ57Ò!=JK E " 6*G¼BDC7C " G!=?KEÌr(*KEK7$;K:57Ò*=?K:ÖÝ01CIU=?!$; CÎ!CE0 P!I?$ÓK;A
Ø0.Ñ4$< $;C;-lG$;Ì=?G!=U!ÏÙ57Ò!$:CW.57$:.5RÑaÒ!=JÌWÒ570=U9ÌrCE$; K7$a5EÒ!$DP9.5EÌWÒ¾KM=?ë$:6*GºÑaÒ!=?ÌWÒº=U!Î*(5EK
570=U*ÌIU(9G$@=U¾57Ò!$K7Ó6I?I,P9.5EÌWÒ*$;K4=?Ka1KRG=¦ÕÌ(!I¦51K4G!$r57$;C7Ó=?!=?!Ï57Ò!$:Î!CE0 Î,$CRIU$
6CE!=?!Ï
CE657$ A A V$;Ìr57=?< $I?Ð357Ò!$KM=?ë$06Ö5EÒ!$ÙI?$; C7!=?!ÏCW.5E$Ù=? KM570ÌWÒ* KM57=JÌI?$; C7!=?!ϺÌ0 CEC7$
KMÎ,0 9G!K
57057Ò*$:CE$;K7Î,$;Ìn5E=U<1$KM=?ë$0 Ö/57Ò*$Ó=?!=%P9.5EÌWÒ/A
065E$º6IJK7057Ò*65]57Ò!$¾Î!CE0 P!I?$Ó÷06ÖaC7$;Ó0.<l=?!Ï5EÒ!$!01=?K7$=?ú57Ò!$3G!.5W#Ó'Ð P,$IU$
K7K
ÌrCE=¦5E=?Ì;6Ic5EÒ*60 *$@57Ò!=?!ÍKaP,$;Ì (*K7$06Ö®Ï $*$CW6I?=Uë
.57=?0 /A9BD< $C757CW6=?!=?!ÏÓ'к0>Ì;Ìr(!C
IU01!Ï
P9$ÖÝ0 CE$D5EÒ!$!01=?K7$@C7$;Ï =?Ó$@=JKa$<1$3C7$
ÌWÒ!$
GcA
!0 57Ò!$;CD G!<' l5E6Ï1$06ÖP*65EÌWÒ5ECE =U!=?!Ï=JKa5EÒ*.5:0 !$=?KD6P!I?$5E0º(*K7$ÙK7$;Ìr01*G0 CWG$;C
Ó$57Ò!0G!K5E0åK7Î,$$;Gç57Ò*$3I?$; C7*=U!ÏåÎ!CE0Ìr$;KEK;A ?>$
Ìr01*Gz01CEG$;CÓ$57Ò!0G!KK7Î9$;$;GãIU$
6CE!=?!Ï
PlÐ $;KM57=?Ó657=?!Ï!065 e(9Ke55EÒ!$Ï CW G=?$l5@P!(5Ù I?K70º5EÒ!$Ìr(*C7<..5E(!CE$0 Ö5EÒ!$Ìr0lKe5]K7(!CMÖ»1Ìr$1A
8:=U<1$ú57Ò*$ºÌr(!CE<..5E(!C7$1-%0 *$ºÌ å$
Ke5E=UÓ657$5EÒ!$º6Î*Î!C70'æ=?Ó.57$I?0Ì.5E=U01ú06ÖR57Ò!$3 Ìr57(* I
Ó=?!=?ÓÙ(!ÓA
Û
$
KMÎ!=U57$57Ò!$
1G<.6l5E Ï $
K06ÖP*.5WÌWÒ(!ÎG!.5E$;K;-1KM570ÌWÒ* KM57=JÌRI?$;6CE!=?!Ï=JKKM57=?IUI906Ö×57$;57Ò*$
Î!C7$ÖÝ$CEC7$
GÓ$r5EÒ!0GºÎ* CM5E=?Ì(!IJ6CEIUÐÑaÒ!$;¾G$
6I?=U!ÏÑa=U57Òº<1$CEÐI? C7Ï1$DG!65EKM$5EK4P9$
Ì (*KM$:=U5
=?KK7=?ÓÎ*IUÐÓ(*ÌWÒ3Ö» KM57$;C;A
* "3C4 C%
.0%1
$r5eÑ40 CEÍK]I?$; C7ú5EÒ!$ºÖ»1Ke5E$;KM5ÙÖÝCE0 Ó 57Ò*$¾Ó01KM5(*!$ræÎ,$;Ìn5E$;GãK7 ÓÎ*IU$1AÞRÒ!$;C7$ÖÝ0 CE$ -v=¦5=?K
G<>=JK7 P!IU$570 ÌWÒ*0l0lKM$3ÁKE6ÓÎ!I?$¾.5$;1ÌWÒ¼=¦5E$CW.5E=U01¼57Ò*65=JK]5EÒ!$3Ó0lKe5(!Ö» Ó=?I?=? CÙ5E0
57Ò!$úKMÐKM57$;Ó3A
0657$1-457Ò!=JK#6Î!Î*IU=?$;K¾0 !I?Ðì570ìKM570ÌWÒ* KM57=JÌÁI?$; C7*=U!ÏTKM=?*Ì$ 5EÒ!$ 01CEG!$C306Ö
=U!Î*(5Î!CE$;K7$l5E657=?0 ç=JK=?C7CE$I?$<.6l5ÖÝ0 CP*65EÌWÒ,+.A)B
Ö:Ìr01(!CWKM$1-v5EÒ!$CE$=?K!0¼K7=UÓÎ!I?$Ñ4'Ð
570åÍl*0.Ñ ÑaÒ!=?ÌWÒç=U!Î*(5EK C7$¾=UÖÝ01C7Ó657=?0 zCE=?ÌWÒ/-vÒ!0.Ñ4$<1$C
-v < $CEмKM=?ÓÎ!I?$º5EC7=JÌWÍå57Ò*65
ÌrCE(*G$I?Ð =?ÓÎ!I?$Ó$l5EK]57Ò!=JK=JG$;=JK5E0ÁKM=?ÓÎ!I?Ð ÌWÒ!0>01K7$K7(*ÌÌ$;KEKM=?< $$ræ! ÓÎ*IU$
K:57Ò*656CE$
ÖÝC701Ó þ.ÿ rü 4ÌrIJ KEKM$
KKM=?*Ì$57CW6=?!=?!Ï#$ræ!6ÓÎ!I?$;K]P9$;IU01!Ï =?!Ï5705EÒ!$ºKE6Ó$ÌrIJ KEKÑa=?IUI
Ó0lKe5aI?=UÍ1$I?оÌ0 l5E =U3KM=?Ó=UIJ6Ca=?ÖÝ0 CEÓ.5E=U01%A
C&%
.%1," "
*
3C>&'
2"&3 <3;%13;
1A?Òl( $Á57Ò!$ 57CW6=?!=U*ÏzKM$5¾KM0z57Ò*65¾KM(9ÌÌr$
K7K7=?< $57CW6=?!=?!Ïz$æ ÓÎ!IU$
K
*$< $;C ÝCW6CE$I?"Ð P,$I?0 !Ï5E057Ò!$]KE6Ó$ÌrIJ KEKA
E A C7$
KM$;l5=?!Î!(5$ræ! ÓÎ*IU$
KÙ5EÒ*.5Î!CE0G(*Ìr$ I? C7Ï1$º$CECE0 CÙÓ01C7$ºÖÝC7$L
Ül(!$;l57I?Ð57Ò* $ræ!6ÓÎ!I?$;K45EÒ*.5Î!CE0G(*Ì$@K7Ó6I?I%$CEC701C;A
Xu®ol_dbM1_Ebvikj:olikfMo^;bM\
ik_Wj.seg\rbe_gh1ZZ_nDikj@{l\sefMoZ]\WD{_\r¿!_WfEse_n:{.
bed
1j>1d¿Ù_Ebhbed;b
i°seol_7be_ai¦g\:ghik^;jliª>fn\j.sbM\jl^;_ad;°%^;bM\;1ik_Wj.s
\kl_WgW©
Ø0.Ñ4$< $;C;-®01!$ÁÓÙ(9Ke5ºP,$ Ì;6CE$rÖÝ(!IÑaÒ*$xÎ,$C757(!CEP!=?!ϼ57Ò*$Á!0 CEÓ6IaÖÝCE$;Ül(!$;*Ìr=?$;K06Ö
=U!Î*(5$ræ!6ÓÎ!I?$;K:P,$;Ì (*K7$5EÒ!=JKÌWÒ*6!Ï1$;K
57Ò!$CE$IJ.5E=U<1$Ù=?ÓÎ,0 C75E *Ìr$57Ò9.5@57Ò!$!$5eÑ)01C7Í
Î!I?1Ìr$
K/01ÙG/= V,$;C7$;15$ræ!6ÓÎ!I?$;K;A'ÞRÒ!=?KvÓ'Ð@0 C/Ó'Ð@!0 5vP,$4G$
KM=?CW6P!I?$ A
î!0 C$ræ! ÓÎ*IU$1- >ÿ
D *ÿ ý ¦ÿ Eþ #þ ý hý D Oý6ÿ *ÿ >û kÿ ü ,DWý þ6ÿ ý Àü
P,$;Ì;6(*K7$01(57I?=?$CWK
=U!Î*(54<1$;Ìr570 C46CE$
Î,01K7=¦5E=U<1$ ->6I?I90 Öc57Ò!$:(!ÎG!.5E$;K406Ö%Ñ4$=?Ï Òl5WK5EÒ*.5)ÖÝ$$;Gº=Ul5E0]*0>G!$DÑa=?IUI
P9$57Ò!$KE6Ó$K7=?Ï »=ÀA $ AcK7=UÏ10 Q;nA
K:¾C7$
KM(!IU5;-957Ò*$;K7$Ñ4$=?Ï Òl5EK:Ì 01!IUÐ# IUI®G$;ÌC7$
K7$
0 C
IUI%=?*ÌC7$
K7$ Eû ü)ÖÝ01C
Ï =?< $;=U!Î*(5DÎ*65M5E$CE%A*ÞRÒ>(*K;-*=UÖÑ4$=?Ï Òl5<1$;Ìn5E0 CÓ(*KM5
< $CEÐK7IU0.ÑA
áhº5EÒ!$]6P,0.< $@$ræ! ÓÎ*IU$1->57Ò!$=?!Î!(5WKaÑ4$CE$: IUI%Î,01K7=U57=?< $ A!Ø
0.Ñ)$;< $;C;->=?3Ï1$!$;CE IÀ- lÐ
KMÒ!=UÖ×506Ö)57Ò!$'< $;CE Ï $]=?!Î!(5]'ÑR'Ð3ÖÝC701Óë$CE0ºÑa=?I?IP!=?1KD57Ò!$(!ÎG!.5E$;K@=? ¾Î*6C757=JÌr(*I? C
G=UCE$;Ìr57=?0 Á *G357Ò>(*KDKMI?0.Ñ G!0.Ña#IU$
6CE!=U*Ï*A*ÞRÒ!$;C7$ÖÝ0 CE$ -!=U5D=JKÏ10>0>G35E0ºK7Ò!=¦Ö×5D57Ò!$Ù=U!Î*(5EK
KM05EÒ*.5D57Ò!$Ù'<1$CW6Ï1$:0.<1$CR5EÒ!$57CW6=?!=?!ϺK7$r5
=JK
ÌIU0lKM$@5E0ë$;C709AÁÞRÒ!=JKÒ!$;(!C7=JKM57=JÌ]KMÒ*0 (!IJG
P9$º Î!Î!I?=U$
G¼.5 IUI)I?'Ð1$CWK@ÑaÒ!=JÌWÒ¼Ó$
6*K]57Ò*65ÙÑ4$Ñ4 l557Ò!$º'<1$CW6Ï1$0 Ö45EÒ!$ !
06Ö4¾!0G$570P,$ÌIU0lKM$Ù570ë$;C70¾P9$
Ì (*KM$57Ò*$;K7$01(57Î*(5EK C7$Ù57Ò!$=?!Î!(5WK:570357Ò!$!$æ>5
<' C7=J6P*IU$
*G , =JK57Ò!$ Ì0 ÓÎ901!$l5406Öc57Ò*$ 57CW6=?!=?!ÏÙ$æ!6ÓÎ!I?$ $A ?Ì IU=?!ÏÙK7Î9$;$;G!K
y
ω2
Lines of
ω1 ω2 constant E
ω1
z1 z2
KL-
Expansion
Covariance
Equalization
ÑaÒ!01K7$@0 (5EÎ!(5EK
6CE$: IUÑR'ÐK)Î,01K7=U57=?< $6*GK70ÓÙ(*KM5aÒ*'< $Ó$;6¾57Ò9.5=JKaÎ90lKM=U57=?< $1A
"04
&" 9
1A?ÐlÓÓ$r5EC7=J̺K7=?Ï Ó0 =JG!K]K7(*ÌWÒz KÙÒ>ÐlÎ,$CEP,0 I?=?Ì5W6!Ï1$l5Ù0 Ö×57$ãÌr0 ><1$CEÏ $Ö» KM57$;C
5EÒ*635EÒ!$]Ke5W6*G*6CWGI?0 Ï =JKM57=JÌ
ÖÝ(**Ìn5E=U01%A
E A CE$;Ìr01ÓÓ$;*G$
GzKM=?Ï Ó01=?G - )=J6K C X5!( "- 5W6!Ò
FV <A ?>=?*Ì$57Ò*$
5W6!ÒÁÖÝ(!*Ìn5E=U01Á=JKDK70 Ó$r5E=UÓ$
K
Ìr01ÓÎ!(5E657=?0 * IUI?Ð3$ræÎ9$;*K7=U<1$ -6Á6Î*Î!C70'æ=?Ó.L
5E=U0106Öv=U5P>кCW.57=?006Ö®Î901IUÐ>!01Ó=J6IJKRÌ 3P,$(*K7$;G¾=U9Ke5E$; G%A
*A ?0 Ó$r5E=UÓ$;K=U5=JKÒ!$I?ÎÖÝ(!I>5E0:1G!GÙ:K7Ó IUIlI?=U!$
6C57$;C7Ó-'$ A Ï*A x5E6*0Ò 7
K70 K4570'<10 =JG 965
KMÎ,065WKA
1.5
1
1
0.8
0.5
0.6
-3 -2 -1 1 2 3
0.4
-0.5
0.2 -1
-1.5
-6 -4 -2 2 4 6
(a) (b)
ÞRÒ!$
Ìr01*KM5E6l5WK=?57Ò!$DC7$
Ìr01ÓÓ$;*G$;GºKM=?Ï Ó0 =JGÏ =?< $;6P,0.< $aÒ*'<1$P,$$¾ÌWÒ!01K7$K70
57Ò*65;- Wþÿ Àü7ý .ü Wþ¾ÿ »K7$$Î!CE$<>=?0 (*KG!=?KEÌr(*KEK7=U01 n-*57Ò!$]<. C7=J6*Ì$
06Ö5EÒ!$0 (!57Î!(5WK@Ña=UI?I6IJK703P,$ÌrI?01K7$Ù5E0 P,$;Ì;6(*K7$57Ò!$$6V,$
Ìn5E=U<1$Ïl6=? 0 Ö57Ò*$KM=?Ï Ó0 =JG
=?K
C701(!Ï Ò!I?Ð]0.<1$CR=U5EK(9KM$ÖÝ(!IvCW6*Ï $ Aáh#Î96C757=JÌr(!IJ6C
->57Ò!=JK:KM=?Ï Ó0 =JG3Ò* Ka57Ò!$]Î*C701Î9$;CM5E=U$
K
» 1 - ÝP a57Ò!$KM$
Ìr0 9G#G$CE=U<.657=?< $=JK
Ó.æ=?ÓÙ(!Ó .5 1-9 *G »Ì R57Ò*$
4$ V$;Ìn5E=U<1$Ï1 =U¾=?KÌIU0lKM$:5EB0 1A
BD!$06Ö:57Ò!$#Î,065E$l57=J6IaÎ!CE0 P*IU$;ÓKÑa=U57Òx(9KM=?!ϼKMÐ>ÓÓ$r5EC7=JÌKM=?Ï Ó0 =JG!K=JK57Ò9.557Ò*$
$CEC701C4K7(!C7Ö» Ìr$@Ì ºP,,$ 5 ü,.5a!$; C457Ò!$@0 CE=?Ï =?%Aî!0 C45EÒ!=?KRCE$;1KM01º=U5a=?KRÏ10l0G570'< 01=?G
=U!=U57=J6I?=?ë=?!ϾÑa=U57Òå< $;C7Ð#K7Ó6I?IÑ4$=?Ï Òl5WKAc24$;Ì;6(*K7$Ù0 Ö57Ò!$KE.5E(!CE657=?0 Á0 Ö57Ò*$K7=?Ï Ó0 =JG!K;-
57Ò!$$CEC701CDK7(!C7Ö» Ìr$=JKD I?K70 965DÖ» C:ÖÝC701Ó 5EÒ!$0 CE=UÏ1=U/A DG!G=?!Ï3¾KMÓ IUI®I?=U!$
6CD57$;C7Ó 5E0
57Ò!$]K7=UÏ1Ó0 =JG¾Ì;6KM01Ó$57=?Ó$;KRÒ!$;IUÎ#'< 01=?G57Ò!$ ,.5CE$Ï =?0 9KA
&"034 2')4<%
!.*C%1
há ÌrIJ KEK7=¦â9Ì;.5E=U01]Î!CE0 P!I?$ÓK;-;5W6CEÏ $5v<.6I?(!$;K® C7$5eÐlÎ*=?Ì;6I?IUÐP!=?* C7Ð Ý$1A Ï9A 3 L
- 7 Q=PnA'&)01ÓL
Ó01ºÑa=JK7G!0 Ó Ó=UÏ1Òl5RK7$$;Óõ570KM(*Ï Ï $
Ke5)57Ò*65)5EÒ!$:5E C7Ï1$r5)<. IU(!$
K4P9$K7$r5a.5457Ò!$@<.6I?(!$:06Ö
57Ò!$]K7=UÏ1Ó0 =JG R Ka1KMÐ>ÓÎ5E0657$
KA!Ø
0.Ñ)$;< $;C;-157Ò!=JKaÒ*1KaKM$;< $;CE IG!CE'ÑaP*1ÌWÍKA
îv=?CWKe5
-;=?*KM5E6P*=UI?=¦5E=U$
K®Ì6ÙC7$
KM(!IU5;A.ÞRÒ!$)57CW6=?!=?!Ï
Î*C70Ìr$
K7K/Ña=UI?Il57CEÐ@570:GC7=?< $57Ò!$40 (!57Î!(5
KaÌrI?01K7$@ KRÎ,01KEKM=?P!I?$D5E057Ò!$:5W6CEÏ $r5a<.6I?(!$
K-ÑaÒ!=JÌWÒ#Ì ¾01!IUкP9$ ÌWÒ*=U$;< $;G3 K7Ð>ÓÎ570 57=UL
Ì6I?I?Ð A DK)ÙC7$
KM(*I¦5
- 5EÒ!$
Ñ4$=?Ï Òl5EK Ý01(57Î!(!5R *G$<1$Ò!=JG!G$; C7$DGC7=?< $;570I? C7Ï1$C *G
I? C7Ï1$C<' IU(*$;KÑaÒ!$CE$)57Ò!$K7=UÏ1Ó0 =JGÙG$;C7=?<..57=?< $)=?KÌIU0lKM$)570@ë;$CE0*A'ÞRÒ!$R< $CEÐIJ6CEÏ $Ñ4$=?Ï Òl5WK
=U*ÌC7$
K7$5EÒ!$DÏ1CE1G=?$l5EK;- Ò!0.Ñ4$<1$C
-65EÒ!$;K7$DÏ1CE1G=?$l5EK)6CE$57Ò*$ºÓ(!IU57=?Î!IU=?$;G¾P>Ð6º$æÎ90 L
!$l57=J6I?I?ÐK7Ó6I?I9K7=UÏ1Ó01=?GG!$CE=U<..5E=U<1$ Ý$æ!Ìr$Î!5ÑaÒ!$¾@5eÑa=?KM57=?!Ï5E$CEÓF=JK G*G$;G5E0]57Ò*$
KM=?Ï Ó0 =JG 4Î!CE0G(*Ì=U!ÏÑ4$=?Ï Òl5(!ÎG!.5E$]ÌrI?01K7$@570ë;$CE0*"A
K
C7$
KM(!IU5;-!57Ò!$Ñ4$=?Ï Òl5EKÓ'Ð
P9$
Ìr0 Ó$]KM57(*ÌWÍA
?>$;Ì0 *Gc-,ÑaÒ!$;#57Ò!$0 (!57Î!(5WKDKE.5E(!CW.57$1-*5EÒ!$Ù*$r5eÑ40 CE;Ï1=U<1$;K*0º=?*G=JÌ.5E=U01Á06Ö4Ìr0 !L
â9G$9Ìr$]I?$<1$I[A*äxÒ!$;Á6#=?!Î!(5
Î9.5M5E$CEÖ»6I?I?K
!$; C
G$
Ìr=JKM=?0 P,0 (!9G!6CEÐ5EÒ!$]0 (!57Î!(5
ÌrIJ KEK=?K(**Ìr$;CM5W6=?%AáeG$; IUI?Ðz5EÒ!=JKKMÒ*0 (!IJGTP9$#CE$ *$
Ìn5E$;GT=UT5EÒ!$#!$r5eÑ40 CEÍzP>Ðã ã01(5ML
Î!(5@<' IU(*$]57Ò9.5D=JKD=UÁP,$r5eÑ4$$;#57Ò*$]5eÑ40Î90lK7K7=?P!IU$Ù5E C7Ï1$r5
<.6I?(!$
K-,=ÀA $ A,!0 5D!$
6C
$;=¦5EÒ!$C
K7ÐlÓÎ5E065E$ A'Ø0.Ñ4$< $;C;-I? C7Ï1$Ñ)$;=UÏ1Òl5EK/57$9G570
ÖÝ0 CWÌr$46I?I101(57Î!(!5EKv570
57Ò!$)5E =UIJK/0 Ö57Ò!$RK7=UÏ L
Ó01=?GCE$Ïl6CWGI?$;KEK®06Ö,57Ò!$D(!*Ìr$;CM5W6=?l5eÐ A ÞRÒ>(*K;-65EÒ!$!$5eÑ)01C7ÍÓ'ÐÎ!CE$;G=JÌn5)@ÑaCE0 *Ï]ÌrIJ KEK
Ña=¦5EÒ!0 (5:Ï =?<>=U!Ͼ6>о=?*G=JÌ657=?0 Á0 Ö=¦5WKI?0.ÑÚÌr01â9G$;*Ìr$]=?#5EÒ!$ÙCE$;K7(!IU5;A"/6CEÏ $Ñ4$=?Ï Òl5WK
57Ò*65DK7657(!CW.5E$:57Ò!$]*0>G!$;KaÓ6Í1$@=¦5
=?ÓÎ90lK7K7=?P!IU$@5E0ºG/= V,$;C7$;l57=J.57$P,$r5eÑ4$$;35eÐlÎ*=?Ì;6I/ *G
!0 l5eÐ>Î!=JÌ6Ic$æ!6ÓÎ!I?$;K;A
KM01IU(5E=U01ç5E0ú57Ò!$
KM$#Î*C701P!IU$;ÓK=?K570¼KM$557Ò!$Á5E6CEÏ $5<' IU(*$;K5E0åP,$ÁÑa=U57Ò!=?ì57Ò*$
CE !Ï $0 Öc57Ò!$:KM=?Ï Ó01=?Gc-1CE657Ò!$;C5EÒ*6º655EÒ!$:1KMÐ>ÓÎ5E0657=JÌa<.6I?(!$
KA!&46CE$ÓÙ(9Ke54P9$D5E Í $/-
Ò!0.Ñ)$;< $;C;->5E0=U9KM(!CE$5EÒ*.55EÒ!$]!0G$Ù=?K
!065
CE$;KM57CE=JÌn57$
G357001!I?Ð57Ò*$ÙI?=U!$
6CÎ96C75
06Ö®57Ò*$
KM=?Ï Ó0 =JGcQA ?>$5M57=?!Ï:57Ò*$5W6CEÏ $r5®<.6I?(!$;Kv5E0
5EÒ!$RÎ901=Ul506Ö!5EÒ!$RÓ.æ=UÓ(!ÓÚK7$;Ìr01*GÙG!$CE=U<..5E=U<1$
0 ú57Ò*$3K7=UÏ1Ó0 =JG =JKÙ57Ò!$¾P9$
Ke5Ñ4'Ð 570#5W6Í1$º G<. 15W6Ï1$06Öa57Ò*$º!01!IU=?!$
6CE=¦5eÐåÑa=¦5EÒ!0 (!5
K7657(!CW.5E=U!Ï 57Ò*$ÁKM=?Ï Ó0 =JGcAÞRÒ!=JK=JK6!0 57Ò!$;CC7$
K70 ç57Ò!$ÁK7=UÏ1Ó0 =JGç=?xîv=?Ï (*C7$ PT=JK
Ï 0>0G#ÌWÒ!0 =JÌr$1A*áO5
Ò*1KÓ.æ=?ÓÙ(!ÓK7$;Ì0 *GG$;C7=?<..57=?< $.5 ]ÑaÒ*=?ÌWÒÁÌ0 CEC7$
KMÎ,0 *G¾5E057Ò*$
P!=U96CEÐ5E C7Ï1$r5a<.6I?(!$;K45eÐ>Î!=JÌ Ic=U#ÌrIJ KEK7=¦â9Ì;.5E=U01ºÎ!CE0 P!I?$ÓK;A
2')4&%(
&)Ò!0>0lKM$R5W6CEÏ $r5<. IU(!$
K.557Ò!$Î,0 =?l506Ö,57Ò*$Ó.æ=UÓ(!ÓõKM$
Ìr0 9GG!$CE=U<..5E=U<1$a0 57Ò*$
K7=UÏ1Ó0 =JG¾K70 K4570'<10 =JGºKE.57(*CE657=?!ÏÙ5EÒ!$0 (!57Î!(5D(!!=U5EK;A
t s[ikgOseikjl^Ôse_EbeZ ikgõ\ghZ]\;kìkikjl_W\bõse_EbeZ \;l1_W sedÔseo1_jld61_íd
seql sn}ú_
© ^1©
åsM\;j1o
©
3" "0!." "034 C%%("4&;
+ EF
F
J ,!
5 %
ÞRÒl(9K-'5E0:=?*K7(!CE$5EÒ*.5®57Ò*$ C7$R6Î!Î*C70'æ=?Ó657$;IUÐ 457Ò!$RÑ4$=?Ï Òl5EK®K7Ò!0 (*I?GP9$RCW6*G!0 ÓIUÐ
GCE'Ña¾ÖÝCE0 ÓíG=JKe5EC7=?P!(5E=U01Ña=¦5EÒ3Ó$
63ë$CE06*G3Ke5W6*G*6CWG¾G$;<>=?657=?0 3Ï =?< $3P>Ð
+ 7F
%"
ÑaÒ!$CE$ Ô=?K45EÒ!$>(!ÓÙP,$Ca0 Ö=?!Î!(!5EKR5E057Ò!$(!*=¦5
A
3" "!.0" "3C4 %1"04<;
DK7K7(!Ó=U*Ï5EÒ*.5 C
?
I ;
$ 7
C *
U
= !
Ï E
C 6
7
5
$ K4K7Ò!01(!I?GºP9$@Î!CE0 Î,0 C757=?0 96I957057Ò!$KEÜ1(96CE$
CE0l0 540 Ö%57Ò*$
>(!ÓP9$;Ca06Ö®=U*Î!(5EKa57057Ò!$(*!=¦5
Ñ4$=?Ï Òl5WK@=U¼I?0.Ñ)$;CI?'Ð1$CWKKMÒ*0 (!IJG 5eÐlÎ*=?Ì;6I?IUÐ P9$IJ6CEÏ $;C:57Ò* ¼=Uå57Ò*$
Ò*=UÏ1Ò!$CaIJ'Ð $;CEK
B
57Ò!$;CR57CE=?ÌWÍK)ÖÝ0 Ca=?ÓÎ!C70.<>=?!Ï57Ò!$]Ì0 >< $;C7Ï1$*Ì$
=?*ÌrI?(*G$ C
%(3;*
Q#01Ó$l57(!Ó
. 7 < 0 ( + 7 /%. 6
065E$º5EÒ*.55EÒ!=JKKM$50 Ö
CE(!IU$
K=?K$
K7Ðú570¼Ìr01ÓÎ*(57$#69GçKM57CW6=?Ï Òl5MÖÝ01C7ÑR6CWGå570ú=?ÓL
Î!IU$;Ó$l5;Acäå$KM=?ÓÎ!IUÐ Ò*'< $570Í $;$Î 57CW ÌWÍ#06Ö46ú G!G=U57=?0 96I®< $
Ìn5701C@=U AÜ,8A * -C,57Ò*$
'< $CW6Ï1$;G:Ï CW G!=U$;15 ' A.ÞRÒ!$)!0 CEÓ 06Ö5EÒ!=?K<1$;Ìn5E0 C/57Ò!$;Ìr01l57CE0 IJKc57Ò*$4K7=Uë;$0 Ö57Ò!$4IU$
6CE!=?!Ï
CE657$ »K7$$ AÜ,8A - %rA ÞRÒ!$
IUÏ10 CE=¦5EÒ!ÓÚÖÝ01IUI?0.ÑK5EÒ!$
K7=UÓÎ!I?$=?15E(!=U57=?0 C1Ö»6C4'Ñ4'ÐÖÝCE0 Ó57Ò*$
Ó=?!=?ÓÙ(!Ó »I? C7Ï1$ÙG=JKe5W6*Ì$
=U5DÎ!CE0Ìr$$
G!KD=U P!=UÏ3Ke5E$Î*K:69G ÌrI?01K7$5E057Ò*$Ó=?!=UÓ(!Ó
=¦5
!!$; I?K45EÒ!$I?$;6CE!=?!ÏCE657$ ×ÖÝ01CR57Ò!$;0 CE$r57=JÌ I%G$r5W6=?I?K
KM$;$ E * UrA
9 "!. +!"0"!C*C36"0<3C # "04
&" 9# 3C"0
IU57Ò!01(!Ï Ò¼Ó01KM5K7ÐKe5E$ÓK(*K7$!0G$;KÙP* K7$;Gå01zG065ÙÎ*C70G(*Ìr5EK] *G¼K7=UÏ1Ó01=?G*K-%Ó lÐ
0657Ò*$CR5eÐ>Î9$
KR06Ö(!!=U5EK Ý01CaI?'Ð1$CWK#Ì ¾P,$(*K7$;GcA" Ìr01ÓÓ013 I¦5E$CE*.5E=U<1$D=JK457Ò*$:CW G!=? I
P* K7=?KÖÝ(!*Ìr57=?0 ÀN2Rî !$5eÑ)01C7Í'»K7$$ (- E &!- "!- E UáhºN2Rîå!$r5eÑ40 CEÍK- 57Ò!$:G065RÎ!CE0>GL
(*Ìn50 Ö95EÒ!$aÑ)$;=UÏ1Òl5 *G=U*Î!(5< $
Ìn5E0 C®=?KC7$;Î!I?1Ìr$
GÙÑa=U57ÒGA(*ÌIU=JG$
6G=JKe5W6*Ì$4P,$r5eÑ4$$;
L<M7N ¹*¹ d
j6'_Ebe^;_WjlfW_d°/seo1_>dna©>pR beikjl^@seol_Rªlj>\,gOsM\;^;_Dd°%k_n\bej1i¦j1^seol_
\n'_EbM\;^;_>dnTikg
\qlq bedr½ ikZ]\se_Ek]d;jl_a ikZ_Wjlghikd
jl\;sedn\rbM1g®seol_)ZikjlikZ:1ZR\;jlÙisikg\D^;d.d6\;q1q1bedn½ ikZ]\|
seikd;jd;°seo1_aZikjlikZ:1Z_Wik^;_Wj6
\¦1_1ibe_WfEseikd;jd;°cseo1_ _Wghghi¦\j9©
57Ò!$D=U!Î*(54 *GÑ)$;=UÏ1Òl5)6*G5EÒ!$DK7=?Ï Ó0 =JG=?KC7$;Î!IJ Ìr$
GP>Ð $ræÎ901!$l5E=? IÀAlÞRÒ!$0 (!57Î!(5
Ìn5E=U<>=U5eÐ=JKÌr01ÓÎ*(57$
Gc-!$ A Ï*A>ÖÝ01Ca0 !$@01(57Î!(!5;-*1K
J
, $ræÎ
F E ,F F
, F
, Kv+
ÑaÒ!$CE$ , , =JK5EÒ!$
Ó$
6 »KM5E *G! CEGG$;<l=J.5E=U01 ®0 Öc57Ò!$ [L[57Òº8@ (*K7K7=J6%AlÞRÒ!$;K7$(!!=U5EK
Ì6 C7$;Î!I?1Ìr$0 CDÌ0>$ræ=?KM5DÑa=U57Ò 57Ò!$KM5E69G!6CWG#(!!=U5EK@ *G57Ò*$ÐÁ6CE$](*K7(* IUI?Ð357CW6=?!$;G#P>Ð
ω ω
ωmin b) ωmin
a)
ω
ωc ωmin
E(ω) E(ω)
η > 2 ηopt dE/dω
η > ηopt
ω
ωc ωmin
dE(ωc)
dω
ω ω ∆ω
ωmin ωmin
i
c) d)
iki
L<M7N ¹*¹ M
b
\ ¦
i E
_ 6
j )
s W
_ h
g W
f W
_ .
j
s J
°
d R
b
i !
¿ E
_ e
b W
_ .
j
s k
n
_
\ e
b 1
j i¦j1^@bM\se_WgW©
äxÒ*.5=JK57Ò!$#01Î57=?Ó6Ia<.6I?(!$06ÖD57Ò!$#I?$; C7!=?!ÏåCE657$ 0 (Á"%$r5(9Kâ*CEKM5Ìr01*KM=JG$;C
57Ò!$ÙÌ; K7$@=U rLhG=?Ó$*K7=U01%A DK7K7(!Ó=U*Ï57Ò*65: Ì;6P,$Ù6Î!Î*C70'æ=?Ó657$
GP>оÜl(*1GCE657=JÌ
ÖÝ(!*Ìn5E=U01%- 0 ( Ì;6#P,$G$;C7=?< $
GPlÐ3â*CWKM5D$ræÎ* *G=?!Ï =? Þ'Ð>IU01C
K7$CE=U$
K
6P,0 (!5
57Ò*$
Ìr(!CEC7$;l5RÑ4$=?Ï Òl5
- DC
< & 7 F &
- 7 E F F 6F & F - 7 565D54
E
ÑaÒ!$CE$]Ñ4$](*K7$]57Ò*$K7Ò!0 C757Ò* *G
K A,áOÖ =JK
Ül(*1GCE657=JÌ@57Ò*$K7$;Ìr01*G
0 CWG$C]G$;C7=?<..5E=U<1$=JK]Ì0 *KM5E 15] *G 57Ò!$Ò!=?Ï Ò*$C]01CEG$;C:57$CEÓK<.6!=JK7Ò%A/ÛDW= V$CE$l57=J.5E=U*Ï
P90 57Ò#KM=JG$
KRÑa=¦5EÒ3CE$;K7Î,$;Ìn5a5E&0 5EÒ!$Ï =?< $
K
& &
- 7 F
- 6F & - 5 EXE
F
?>$r5757=?!BÏ ,7.#6*G#*0657=?!Ï5EÒ*.5 & ,7. -9Ñ)$6CE$]IU$Ö×5:6Ö×57$CDC7$
6C7L
CE !Ï =?!ÏÑa=¦5EÒ +
F D
,7.& F
F
- 5
-
E
&)0 ÓÎ*6CE=?!Ïå57Ò!=JKÑa=U57Òì57Ò*$Á(!ÎG!.5E$Á$;Ül(*657=?0 E n-Ñ4$â**GT5EÒ*.5¾Ñ)$ÁÌ;6TC7$
ÌWÒì
Ó=?!=?ÓÙ(!Ó =U01!$Ke5E$Î=¦Ö
0 (<
F & + 5 E
F
®$CEÒ*6Î*K® $; K7=?$CÑR'Ð]57001P5E =U5EÒ!=?K)K7 Ó$aC7$
KM(!IU5=JK=?IUI?(*KM57CW.57$
G=Uî=?Ï (!CE$ & »=UU= nA
ÞRÒ!$¼P90 5M5E0 Ó Ï1CE Î!Ò Î*IU0 5EK357Ò*$¼Ï CW G=?$l5#06Ö 1K#ãÖÝ(!*Ìn5E=U01 06&Ö A ?>=?*Ìr$ =?K
Ü1(9 GCW.5E=?Ì -v5EÒ!$#Ï CW G!=U$;15=JKKM=?ÓÎ!IUÐã KM57CW6=?Ï Òl5IU=?!$#Ña=U57ÒT<' IU(*$3ë;$CE0å6557Ò!$#Ó=?!=¦L
ÓÙ(!Óõ6*G
.55EÒ!$
Ì(!C7CE$l5Ñ4$=?Ï Òl5 A DF 6F =JKK7=UÓÎ!I?Ð]5EÒ!$
K7IU01Î9$R06Ö5EÒ!=?K
IU=?!$]69Gº=JKÌr01ÓÎ!(57$
G3(*KM=?!Ï57Ò*$]Ke5W6*G! CEG¾KMI?0 Î,$@ÖÝ0 CEÓÙ(*I?
F F & F ,/. F 5 E "
?>0 I?<>=U!ÏÖÝ01:C ,7.º5EÒ!$Ï1=U<1$;K4$;Ül(*657=?0 E nA
äxÒ!=UI?$#5EÒ!$ÁI?$;6CE!=?!ϼCW.57$5EÒ*.5¾Ï =?< $;KÖ» KM57$;KM5¾Ì0 >< $;C7Ï1$*Ì$¾=JK 0 ( -57Ò*$ÁI? C7Ï1$;KM5
IU$
6CE!=U*ÏTCW.5E$å57Ò*65 Ì6 P,$¼(*K7$;G Ña=¦5EÒ!0 (!5 Ì (*KM=?!ÏìG=?< $CEÏ $;*Ìr$ú=JK » I?K70xKM$;$¼îv=?Ï6L
(!C7$'& ÝU= eG
0 + E 0 (
5 E &
áO1Ö =?K@!0 5$ræ! Ìr57I?ÐÁÜ1(9 GCW.5E=?ÌÙ57Ò!$; 5EÒ!$Ò!=?Ï Ò!$;C@0 CWG$CD57$;C7ÓK@=? $
Ü1(9.57=?0 E :6CE$
!065Î!CE$;Ì=?K7$I?мë$;C70ú6*G E Ù=?K01!I?Ðz6ì6Î!Î!CE0'æ=UÓ657=?0 %AáhTK7(*ÌWÒì Ì; K7$ -®=U5Ó'Ð
5E6Í1$]ÓÙ(*I¦5E=UÎ!I?$Ù=U57$;CE657=?0 *K570¾IU0Ì657$5EÒ!$Ó=?!=?ÓÙ(!Ó $;< $#ÑaÒ*$Á(*K7=?!Ï 0 ( -9Ò!0.Ñ4$<1$C
-
Ìr0 ><1$CEÏ $9Ìr$DÌ;6Ke5E=UI?I%P,$]Ül(!=¦5E$@Ö» KM5;A
áh ÓÙ(*I¦5E=UÎ!I?$G=UÓ$;*KM=?0 9K-cG$57$CEÓ=U*=U!Ï 0 (a=?K@¾P!=¦5@Ó0 CE$G=UÕÌr(!IU5P9$
Ì (*KM$57Ò*$
C7=?Ï Òl5
K7=?G$0 Ö E 4=?KÓ657CE=¦æ +DÑaÒ!$;C7$ =JK
Ì;6I?IU$
Gº5EÒ!$]Ø$
K7K7=J6ÑaÒ!01K7$]Ìr01ÓÎ90 L
!$l5EK
6CE$DÏ1=U<1$3PlÐ
,! ,DF E X(
Ña=¦5EÒ
¼-* *G $
Ü1(96I5705EÒ!$@570 5E6I/l(*ÓÙP,$Ca06Ö®Ñ4$=?Ï Òl5EK;A
=JK
Ó$;1KM(*C7$406Ö957Ò!$aÌr(*C7<..5E(!CE$)0 Ö A.áh5eÑ)0@G=?Ó$*K7=U01*K;-'57Ò!$RI?=U!$
K®06ÖÌr0 9Ke5W6l5
E
ωmin,2
\ {
ωmin,1 ω1 ν1
L<M7N ¹!¹ y9ikjl_Wgd°cfWd
j1gOsM\;j.s©
ÑaÒ!$CE!$ Ií=?K:5EÒ!$>(!ÓÙP,$C@0 Ö57CW6=?!=?!Ï< $
Ìn5701CEK;AÞRÒ!$Ø$;KEK7=? =? 5EÒ!=JKÌ K7$57(!CE*K@0 (!5
57Ò!$P,$@57Ò!$]KE6Ó$ K45EÒ!$]Ìr0.<. C7=J6*Ì$
Ó.5EC7=Uæº0 Öv5EÒ!$=U*Î!(5EK;-
I J + 5 E -
ÞRÒl(9K-!$
ÌWÒ$=?Ï $;><' IU(*$:0 Ö =JK6IJK70Ó$
K7(!C7$0 Öv5EÒ!$ÙÌ0.<' C7=J69Ìr$:0 CK7Î!C7$
G¾06Ö®57Ò*$
=U!Î*(5EK IU01!Ï57Ò!$]Ì0 CEC7$
KMÎ,0 9G=U*ÏÙ$=?Ï $;*G=?C7$
Ìn5E=U0131KaKMÒ!0.Ña3=?3î=?Ï (!CE$'*!A
x2
x1
$;Ül(*.5E=U01¾Ña=?I?I/Ìr01l<1$CEÏ $1A
Ø0.Ñ G0>$;KÙ57Ò!=JKÒ!$I?ÎzÑa=U57ÒãÌWÒ*0l0lKM=?!Ï#5EÒ!$ºI?$; C7*=U!Ï CE657$;K ºáeG$; IUI?ÐúÑ)$ÑR6l5G=¦Ö×L
ÖÝ$CE$l5ºI?$;6CE!=?!ÏúCE657$
K6I?0 !Ïú5EÒ!$ G/= V$CE$l5º$=?Ï $;*G=?C7$
Ìn5E=U01*KAÞRÒ*=?Kº=?KK7=?ÓÎ*IU$Á=UÖ57Ò*$
$=?Ï $9G=UCE$;Ìr57=?0 *K:6CE$ÙI?=?!$;G (!Î Ña=U57Ò 57Ò!$Ì0l01CEG!=U*657$.æ$;KD06Ö5EÒ!$Ñ)$;=UÏ1Ò15WKA,áhúKM(*ÌWÒ
Ì K7$ ->5EÒ!$Ñ4$=?Ï Òl5EKR6CE$@(!*Ì0 (!Î!I?$;G369G¾Ñ4$Ì 31K7K7=UÏ1º$
ÌWÒ3Ñ)$;=UÏ1Ò15a=U5EKR0.Ña3IU$
6CE!=?!Ï
CE657$
P9 K7$;G0157Ò*$:Ì0 CEC7$
KMÎ,0 *G!=U!ÏÙ$=?Ï $><.6I?(!$1AlØ
0.Ñ)$;< $C
- =UÖ%5EÒ!$DÑ4$=?Ï Òl5WK) C7$:Ìr0 (*Î!IU$
G
57Ò!$;Ñ)$aÓ(*Ke5â*CEKM5CE065E657$ K7(*ÌWÒ57Ò*65 =?KG=? Ï 01*6I[-.=ÀA $ A 57Ò!$DÌr0>0 CWG=?*.5E$R6æ$;KI?=U*$
(!ÎÑa=U57Ò5EÒ!$
$;=UÏ1$*G!=UCE$;Ìr57=?0 *K »K7$$Dî=?Ï (!CE$ (.P nA>ÞRÒ!=JK=JK57Ò!$DÎ!(!CEÎ90lKM$D06Ö/G=J6Ï10 * IU=?ë=?!Ï
57Ò!$]Ø
$;KEKM=J6G=JK7Ì(*K7K7$;G¾$; C7I?=?$C
A
"%$r5 P,$@57Ò!$CE065W.57=?0 ¾Ó.57CE=Uæ¾KM(9ÌWÒº5EÒ*.5
+
#
ÑaÒ!$CE$ =?KG!=? Ï 0 96Ic6*G + 9A!ÞRÒ!$]Ìr0lKe5RÖÝ(!*Ìr57=?0 357Ò*$ÁÌ ¾P,$ÑaCE=¦5757$#1K
0 E+
# *
=Uì0 CWG$;C5E0z'< 01=?GãG=U<1$CEÏ $;*Ìr$1-ÑaÒ*$CE$ + =JK57Ò!$#IJ6CEÏ $
Ke5$;=UÏ1$><.6I?(!$¾0 Ö zAî!01C
Ö» KM57$;KM5
Ìr01>< $CEÏ $;*Ìr$DÑ)$@Ò*'<1$
0 ( # -
+ 5
áOÖ /, .3=JKIU0 5
KMÓ IUI?$CR5EÒ*6 + 57Ò!$;Ì0 >< $;C7Ï1$*Ì$DÑa=?I?I%P9$<1$CEкK7IU0.Ñ 6I?0 !Ï57Ò*$
,7.ÙG=?C7$
Ìn5E=U01%A;áhÙÖ» Ìr5;-.Ìr01l<1$CEÏ $;*Ìr$5E=UÓ$4=?KÎ!CE0 Î,0 C757=?0 96I65E05EÒ!$aÌr0 9G=¦5E=U01]>(!ÓÙP,$C
0. %1
@B"03%12' A%( & ' îv=?Ï (*C7$ G=JKMÎ!IJ'ÐKåK7$r50 Ö $æ!6ÓÎ!I?$;KGCW'ÑazÖÝCE0 Ó 5eÑ40
@8 6(*KEKM=J6G=?KM57CE=?P!(57$
GÙÌI?1K7K7$;KvÌr$l5E$CE$;G]65 ÝL A - L A * *G A - A * nA'ÞRÒ!$)$=?Ï $><.6I?(!$
K
1.4
1.2
0.8
y 0.6
0.4
ω2
0.2
ω0 ω1
−0.2
−0.4
χ0 χ1
−0.6
−0.8
−1.2
−1.4
−1.4−1.2 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
u®d]fW¦\ghgh_Wga bM\Wj°Ubed
ZÚ^'\lghghi¦\;j¾1ikgOshbei|
l{L 1M/N seik¹ d;Ojl9g¹ fW_Wj.se_Ebe_W\s |À © £ } |À6© ¤ )\j> 6© £1} © ¤7©
îv=?Ï (*C7'$ E K7Ò!0.ÑK:5EÒ!$ºKE6Ó$$ræ!6ÓÎ!I?$(*K7=U*ÏÁKe5E0>ÌWÒ9 KM57=JÌ=?*KM57$
Gå06ÖaP*65EÌWÒúÓ0>G!$
IU$
6CE!=U*Ï*A6Ø$;C7$1- @I?$;6CE!=?!Ï@CE657$R06Ö 0 5 E =JK(*K7$;GcAlBD*$Ì6K7$$R5EÒ*.55EÒ!$a57CWPe$
Ìn5E0 CEÐ
=?KÓÙ(9ÌWÒç!01=?K7=U$;CÙ5EÒ*6ã=UãP*.5WÌWÒçÓ0G$KM=?*Ìr$30 *IUÐz ç$;KM57=?Ó.5E$¾0 Ö57Ò!$3Ï CW G!=U$;15=?K
(*KM$
Gã65$; ÌWÒT=U57$CW.5E=U01%AÞRÒ!$#Ìr0lKe5=JKÎ!I?065M5E$;Gì K ÖÝ(!*Ìr57=?0 ì06Ö:$Î,0ÌWÒ%A<ì$Î,0ÌWÒ
Ò!$CE$@=?KK7=?ÓÎ*IUкG!$râ*!$
G GK =?!Î!(!5
Î!CE$;K7$l5E657=?0 *K4ÑaÒ!=JÌWÒ%->ÖÝ0 CKM570ÌWÒ*1Ke5E=?Ì:I?$; C7!=?!Ï9-
Weight space Weight space
2 2
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1
Log MSE (dB) Log MSE (dB)
0 0
−5 −5
−10 −10
−15 −15
−20 −20
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
epochs epochs
\
{
L<M7N ¹POO>¹ _Wik^
o.s:shbM\ O_EfEsed;bh \j>#_Ebhbed;b@fW1be'_1 beikjl^k_n\bej1i¦j1^º°?d;b \
í~ ¡º\;jl {
z¢ ¡6©
Ìr0 CECE$;K7Î901*G!Kc5E0 Ñ)$;=UÏ1Ò15(!ÎG!.5E$;K;A.áhÙP*65EÌWÒ%-.6$Î,0ÌWÒÌr01C7CE$;K7Î,0 *G!K%570:0 !$4Ñ4$=?Ï Òl5
(!Î,G*.57$1A
1.8
1.8 1.6
1.4
1.6
1.2
1
1.4
0.8
1.2 0.6
0.4
1 0.2
0
0.8
−0.2
0.6 −0.4
−0.6
0.4 −0.8
−1
0.2
−1.2
0 −1.4
−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 −2 −1.6 −1.2 −0.8 −0.4 0 0.4 0.8 1.2 1.6 2 2.4
−5 −5
−10
batch −10
−15 −15
−20 −20
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
epochs epochs
¢6©
6*G E P!=J K7$;K;A*ÞRÒ!$ Ìn5E=U<.657=?0 3ÖÝ(!*Ìr57=?0 #=JK 1 5 $( c5E !Ò % E nA9ÞRÒ*$57CW6=?!=?!Ï
KM$5:Ìr01l5E6=?*K $æ ÓÎ!IU$
KaÖÝCE0 Óï$;1ÌWÒ#06Ö E ÌI?1K7K7$;K;A924065EÒÁÌrIJ KEKM$
K
C7$Ù8@ (*K7K7=J6ÁG=JKeL
57CE=UP!(!57$;GºÑa=¦5EÒ3KM5E *G! CEGG!$<>=?657=?0 A A&)IJ KEK DÒ*1K)ÙÓ$;6¾06Ö%
L D *GºÌI?1K7K E Ò9 K4
Ó$
6¾06Ö 7 1A>Þ6CEÏ $5)<.6I?(!$
K4 C7$
#L DÖÝ0 CRÌrIJ KEK : *G 7
ÖÝ0 CaÌrIJ KEK E Aî=UÏ1(!C7!$ KMÒ*0.ÑK
57Ò!$]KM570ÌWÒ*1Ke5E=?ÌD57CWP e$
Ìn5E0 CEÐÖÝ0 CR57Ò*$$ræ!6ÓÎ!I?$ A
y
ω3
ω1
ω2
ω0
L<M7N ¹PO,¹ u®ol_aZikjlχikZ]\*ZDlseik¦\n'_Ebjl_EsÀ®dbe!©
3 * ')!3> &'
2"0<3C !3 9 ?'-'):' *C'->! % ')!3C>:'
!"0<3
% #""0% 9
ä $#Ì ç(*K7$¾57Ò!$CE$;K7(!I¦5WK06Ö5EÒ!$Î!C7$;<>=U01(*KK7$;Ìr57=?0 ç570Be (*KM57=UÖÝÐçKM$;< $;CE I)0 Ö57Ò!$35EC7=JÌWÍK
G=?KEÌr(9K7K7$;G¾$; C7I?=U$;C;A
* ')! =%
%1!3 >')
% "3 * #!2')" .%1
ÞRÒ!$DCE$;1KM01ÖÝ0 CR5EÒ!$6P,0.< $
57CE=?ÌWÍ=JK457Ò*65Ù!01!ë$;C70Ó$;63=?º5EÒ!$@=U!Î*(5a<' C7=J6P*IU$
K
rÌ CE$;.5E$;K$5 ü Uý.ü[û
$=?Ï $;l<. IU(!$1A*ÞRÒ!=JKÓ$; *KR5EÒ!$ÙÌr01*G=U57=?0 #>(!ÓP9$;CÑa=UI?I/P,$]IJ6CEÏ $1-
=ÀA $ A.5EÒ!$
Ì01KM5K7(!C7Ö» Ì$aÑa=UI?I!P,$Ke5E$$Î=?KM01Ó$RG!=UCE$;Ìr57=?0 *K *GKMÒ96I?IU0.Ñã=U065EÒ!$CWKK70@57Ò*65
áOÖ:5EÒ!$Á=?!Î!(53<.6CE=? P!IU$
K6CE$ÁÌr01C7CE$IJ.5E$;Gc-®5EÒ!=?KºÑa=UI?I
*065ºÓ6Í1$57Ò!$ $CEC701CKM(!C7Ö» Ì$
KMÎ!Ò*$CE=?Ì;6I[-P!(5=U5Ña=UI?I%Î,01KEKM=?P!I?ÐC7$
G(*Ì$:=U5EK$;ÌÌ$l57CE=JÌr=U5eÐ A
&)0 CEC7$;I?657$;G=U!Î*(5a<' C7=J6P*IU$
K)(9KM(* IUI?ÐÌ (*K7$D5EÒ!$@$=?Ï $;l<1$;Ìr570 CWK0 Ö 570P,$:CE065W.5E$;G
'Ñ4'Ð:ÖÝCE0 Ó 57Ò*$Ìr0>0 CWG=?*.5E$).æ$
1K Àî=UÏ1(!C7$ (6
<1$CWKM(*K ('P %5EÒ>(*K®Ñ)$;=UÏ1Òl5(!ÎG!657$;K C7$4!0 5
G$;Ì0 (!Î!I?$;G%A*Û
$
Ìr01(!Î!I?$;G3Ñ)$;=UÏ1Òl5EKRÓ6Í1$D5EÒ!$Áè70 !$@I?$;6CE!=?!ÏCE657$@Î,$CÑ4$=?Ï Òl5EêÓ$r5EÒ!0G
0 Î5E=UÓ6I[-5EÒl(9K-Ñ4$@Ò*'< $:57Ò*$@ÖÝ0 I?IU0.Ña=?!Ï57CE=?ÌWSÍ C
%( :'-')%1.02%A% "3 * #!2')" .%1
0.ÑõKM(!Î*Î90lKM$5EÒ*.557Ò!$º=U!Î*(5<.6CE=J6P!I?$;K0 Ö#!$(*C701¼Ò*'< $P,$$;çG$;Ì0 CEC7$;I?657$
Gc-
57Ò!$]Ø
$;KEKM=J63ÖÝ0 CR5EÒ!=JK!$;(!C701¾=JKR5EÒ!$#G=J6Ï10 * I%6*G3=U5EKa$;=UÏ1$><.6I?(!$;KRÎ,0 =?l5
6I?0 !Ï57Ò*$
Ìr0>0 CWG=?*.5E$º.æ$;K;A/áhTKM(9ÌWÒç Ì; K7$57Ò!$3Ï CW G!=U$;15=?K!06557Ò!$3P,$;KM5G$
K7Ì$l5G=?CE$;Ìn5E=U01
KÌ6¼P,$KM$;$z=?çî=UÏ ('P%A 45Ù5EÒ!$3Î901=Ul5 ®-®6ã6CEC70.ÑõKMÒ*0.ÑK57Ò9.5Ï1CE1G=?$l5G0>$;K
!065@Î901=Ul5@570.ÑR6CWG!K57Ò!$Ó=U*=UÓ(!ÓA%Ø0.Ñ4$< $;C;-,=UÖÑ)$=?*Ke5E$;1G 1K7K7=UÏ1Á$;1ÌWÒÁÑ4$=?Ï Òl5:=U5EK
0.Ña I?$;6CE!=?!Ï3CE657'$ Ý$;Ül(* I57Ò!$=?l<1$CWKM$06Ö57Ò!$Ìr01C7CE$;K7Î,0 *G=?!Ͼ$=?Ï $><.6I?(!$ a5EÒ!$å57Ò*$
G$;KEÌr$;15G=?CE$;Ìn5E=U01¾Ña=?I?IcP9$=?º5EÒ!$]G=?C7$
Ìn5E=U01º06Ö57Ò*$:0 57Ò!$;C6CEC70.Ñì5EÒ*.5aÎ,0 =?l5EKaG=?C7$
Ìn5EIUÐ
570.ÑR6CWG!K)57Ò!$Ó=?!=UÓ(!Ó'C
# % % !'-!% .%12')3C"3C4 ')2% > &' %1!- %1"4&;
åøMó (>Ë7Êó®ø lò/ÊÇÄDÈ ÇÆ*Èò/ÆTÇ Å!Ë õËeùcóÅ!ËeÇ)Ä òcÅ ÇaÈ?
áhÙ5EÒ!$4ÖÝ0 I?IU0.Ña=?!Ï:Ñ)$RÑa=?IUIP!CE=?$*Ð=?l57CE0>G!(*Ìr$457Ò*$
$ÑR5E0 %-6Ì0 Pe(!Ïl.5E$)Ï1CE1G=U$;l5;-68@6(*KEKML
$ÑR5E0 %-l"%$;< $>P,$CEÏQÁ6CWÜl(*6CWG>569GÙ5EÒ!$ D(*1KM=UL $ÑR5E0 @»2Rî8 ?"®Ó$r5EÒ!0G@»K7$$
I?K70
- - *- " *nA
Ó01C7$@5EÒ*6#ÖÝ$;Ñ <. C7=J6P!I?$;K;A ?>=?*Ìr$]57Ò!$]$;C7CE0 C4ÖÝ(!9Ìn57=?0 #=JK=?#Ï $;!$CW6Ic*0 LhÜl(* GCW.5E=?Ì -
57Ò!,$ $;ÑR5701 6I?Ï 01C7=U57Ò*Ó=U =¦5WK@0 CE=UÏ1=U* I/ÖÝ01C7Ó=JK:!065@(*KE6P!I?$ÖÝ0 C@Ï $;!$CW6I!$(*CE I*$r5ML
Ñ)01C7ÍI?$; C7!=?!Ï9AlØ
0.Ñ)$;< $C4=U5Ï =?< $;K4Ï10l0Gº=U*K7=?Ï Òl5EK)ÖÝ0 CG!$< $;IU01Î!=?!ÏÙÓ0 CE$KM01Î!Ò!=JKe5E=?Ì;.5E$;G
6I?Ï 0 CE=U57Ò!ÓK;->1KaG=JK7Ì(*K7K7$;G¾=?¾5EÒ!$@ÖÝ0 I?I?0.Ña=U!Ï9A
U
ω
Λ-½ Θ′
ΘΛ -½
ω ω
Network
U Λ -½ Θ Network
input input
$ & $ + & $ E
& $ + + & $ +
0 C ®0 IJ6Í *G3Na=UP*=U$;C7$
$ & &$ F & + $ + ; + $ 5
$ + $ +
ÞRÑ)0G=?C7$
Ìn5E=U01*K $ 69G $ + 6CE$@G!$râ*!$
G KaÌr01Pe(!Ïl.57$:=UÖ
+ $ +
$
=ÀA $ AÌr0 Qe(!Ï1657$ºG=?CE$;Ìn5E=U01*KÙ C7$01CM5EÒ!0 Ï10 * IG=?C7$
Ìn57=?0 9K]=?¼57Ò!$3KMÎ9 Ìr$0 Ö
6¼=JG$l5E=¦5eÐ
Ø$;KEK7=? ]Ó.57CE=UæÀKM$;$)î=UÏ1(!CE$: (XnA$CEÐ@=UÓÎ,0 C75E6l5%ÖÝ01CÌ0 >< $;C7Ï1$*Ì$=UÙP90 57ÒÌWÒ*0 =JÌr$;K/=?K
ρκ−1
ωκ
ρκ
ÞRÒ!$
Î,01K7=¦5E=U<1$
G$â*!=U57$:$;KM57=?Ó657$0 Öc57Ò!$:=?l<1$CWKM$Ø
$;KEKM=J6=JK4G0 !$:G=?C7$
Ìn57I?ÐÑa=¦5EÒ!0 (!5
C7$
Ü1(*=UCE=U*ÏÓ657CE=¦æ3=U><1$CWKM=?0 #69G¾P>Ð30 !I?Ð3(*KM=?!ÏÏ1CE1G=U$;l5=UÖÝ01C7Ó657=?0 %A I?Ï 0 CE=U57Ò!Ó=UL
Ì6I?I?Ð#57Ò!=JKÙÌ;6åP9$G!$;KEÌrCE=UP,$;Gú K:ÖÝ01IUI?0.ÑDK C< Dâ9CEKM5Ù3Î90lKM=U57=?< $G$â*!=U57$Ó.5EC7=Uæ =?K
ÌWÒ!01K7$%-$1A Ï9A *- E )5EÒ!$35EÒ!$]KM$
6CWÌWÒ¾G=?CE$;Ìn5E=U01¾=JKK7$r5R570
%. %. & /%. %4
®@I?=?!$
K7$;6CWÌWÒÙ=JKÎ,$C7ÖÝ0 CEÓ$;G6I?0 !Ï ,- ÑaÒ!=JÌWÒÏ =?< $
K57Ò!$(*Î,G!657$RÖÝ01C57Ò!$
Î*6CW6Ó$r5E$CWK
.5a57=?Ó$'.
%. /. F F10 %. S/%. 45
îv=?* IUI?Ð 57Ò*$#$;KM57=?Ó657$#0 ÖD5EÒ!$Á=?l<1$CWKM$Ø$
K7K7=J6T=?K(*Î,G!657$
GcAR&)01ÓÎ96CE$;Gç570ú57Ò*$
$ÑR5E0 º IUÏ10 CE=¦5EÒ!ÓÚ5EÒ!$ D(9 K7=¦L $;ÑR570 Î!Î!CE01 ÌWÒ0 *IUÐ!$$
G!KÏ1CE1G=?$l5=?ÖÝ0 CEÓ.57=?0 /A
ÞRÒ!$
Ó01KM5)K7(*ÌÌ$;KEKeÖÝ(*;I D(*1KM=UL
$ÑR5E0 6I?Ï 01C7=U57Ò*Óà=JK5EÒ!$
24C70.ÐG$;LhîvI?$r5WÌWÒ!$C7Le8:0 IJG>Ö»6CEPL
?>Ò*6*!0 »2Rîv8 ?"Ó$57Ò!0GcA1ÞRÒ!$a(!ÎG!.5E$aC7(!I?$4ÖÝ0 C®5EÒ!$a$;KM57=?Ó.57$a0 Ö*5EÒ!$a=U><1$CWKM$aØ$
K7K7=J6
=?K
/%. < /. F
7 + + + + F
+ +7 +
57Ò!$]Ø
$;KEKM=J6=JK6Î!Î*C70'æ=?Ó657$
GP>Ð5EÒ!$ÙKEÜ1(96CE$D0 Öv5EÒ!$ l Ìr01P!=J6 ÀKM$;$]6IJKM0K7$;Ìr57=?0 (>A E
ÖÝ0 CÖÝ(!C757Ò!$;C
G=JK7Ì(*K7K7=?0
J
+
+
& -5 &
%
ÞRÒ!$:"/$<1$>P9$;C7ÏQÁ6CWÜl(*6CWG>5Ó$r5EÒ!0G=JK4IU=?Í $
57Ò!$8@6(*KEKML
$ÑR5701º6P,0.< $1- P!(!54=U54Ò9 K4
C7$;Ï (!IJ6CE=Uë
.5E=U01ÙÎ* CE Ó$57$;C ¾5EÒ*.5)Î!C7$;< $;15WK®=¦5ÖÝC701ÓP!I?0.Ña=?!Ï@(!Î%- =UÖ%KM01Ó$a$=?Ï $><.6I?(!$
K
6CE$@K7Ó6I?I +
J
+
7 & -
X(
ÑaÒ!$CE$ G$*0657$
K#5EÒ!$ã(*!=¦5eÐ Ó.5EC7=UæcADÞRÒ!$x8@6(9K7K $ÑR5E0 ÚÓ$57Ò!0GÚ=JKÁ<. IU=JGßÖÝ01C
Ü1(9 GCW.5E=?Ì:Ìr0lKe54ÖÝ(!*Ìr57=?0 *KaÒ*0.Ñ)$;< $C4KM=?Ó=UIJ6CRÎ!CE0Ìr$;G!(!C7$@6IJK70Ñ40 CEÍKÑa=U57ÒH@(!I?IUP*1ÌWÍlL
"%$=?P!I?$CÌ01KM5a *G3=?KaÌ;6I?IU$
G D.5E(!CE I/8:CE1G=?$l5 »K7$$$ A Ï*A - - E *nA
Æ*Ë7Ê
Å>Ç ÊÇ DÉÅ>òíÅ ò ò 1>ËeóÄ ËMÄ WÇÆ óÅ!ËeÇ)Ä ËMÄ
ÚÉDøhÅ!ËMøMó ò%ÆzÄ
òc6Å úÇÆ
!"3C"0%
" B%(')%(3 %
ä $]Ì;63ÑaC7=U57$@5EÒ!$ l L[57Ò3IU=?!$06Ö57Ò*$]Ø$;KEK7=?
$
& % & 7 $ F &
$
ÑaÒ!$CE$ $
65D565-
DX
65D5654
=?KD< $
Ìn5E0 C0 Öë$CE01K
6*G#01!I?о01!$&Ù65
57Ò*$ÙÍlLÀ5EÒ
Î90lKM=U57=?0 %AÞRÒ!=JKDÌ;6ÁP,$Ù=?ÓÎ!I?$Ó$l57$
GÁÑa=U57ÒåKM=?ÓÎ!I?$ÙCE$;Ìr=?Î,$XC DÌr01ÓÎ!(57$Ù57Ò!$Ù570 5E6I
C7$
KM(!IU5EK469GG=?<l=JG$DPlÐ >A>Û
(*$
5E0Ù>(!Ó$CE=?Ì;6I9$CEC701CEK=?5EÒ!=JK)Ì0 ÓÎ!(5W.57=?0 ¾K7ÌWÒ!$;Ó$57Ò*$
C7$
KM(!IU57=?!ÏåØ$;KEK7=? zÓ=UÏ1Ò15!065P,$¾Î,$C7ÖÝ$;Ìr57I?ÐúK7Ð>ÓÓ$57CE=?Ì Aváhç57Ò!=JKÌ1KM$=U5KMÒ!01(!IJGzP,$
[ün/ÿ T Eþ@1KaG$;KEÌrCE=?P9$
G¾P,$I?0.ÑA
*C!'-%! 0" :3 ') "
!"0<3 > &'A%
!* A% <3 :3 9
@ % #<1% 3 %('-4 2'
*C2' 9 =!.04<&'-"0
KEKM(!Ó=?!ÏÓ$;6KEÜ1(96CE$;GºÌ01KM5aÖÝ(!*Ìr57=?0
& E J F
; +1 F
; *
57Ò!$;¾5EÒ!$Ï CW G!=U$;15a=JK
& F F
% +
-
J %
6*Gº57Ò!$]Ø
$;KEKM=J6¾ÖÝ0 I?I?0.ÑKa K
J
+
7 J F
; + F
5 )"
K7=?ÓÎ*IU=UÖÝÐ>=U!Ï Î!Î!CE0'æ>=?Ó.5E=U0106Öv57Ò!$Ø$
K7K7=J6º=JK457Ò*$@KEÜl(*6CE$:06Ö/5EÒ!$'11Ìr01P!=? ÑaÒ!=JÌWÒ
=?KÎ,01K7=¦5E=U<1$KM$;Ó=ULhG$râ**=¦5E$Ó657CE=¦æ¾06ÖG!=UÓ$9KM=?0 C%
J
+
)$"
ÑaÒ!$CE$5EÒ!$aKM$
Ìr01*G5E$CEÓ ÖÝC701Ó AÜ,8A #" cÑR K®GCE0 Î!Î,$;G%A'ÞRÒ*=?K®=JK$;Ül(!=?<' IU$;l5v5E0@1K7K7(!Ó=?!Ï
57Ò*6557Ò!$!$r5eÑ40 CEÍ=JK
I?=?!$; CaÖÝ(!*Ìr57=?0 Á0 Ö5EÒ!$]Î* CE Ó$r57$;CEK A
Ï16=?57Ò*=?K
=?K
C7$
G=?IUÐ
=UÓÎ!I?$Ó$l5E$;G#ÖÝ01C5EÒ!$ >LÀ5EÒ Ìr01IU(!ÓÁ0 Ö57Ò!$ 11Ìr01P!=? C!ÖÝ01CD IUIv57CW6=?!=?!ÏÎ*65M57$;C79K-
Ñ)$RÖÝ01C7ÑR6CWGÙÎ*C701Î*6Ïl.57$1-'57Ò!$; E ®KM$557Ò*$
Ìr57=?<l=U5eÐ06Ö,57Ò!$01(57Î*(5)(!*=¦5WK570 6*G0 *IUÐ
57Ò!$ lLÀ5EÒ¼0 (5EÎ!(5]5E0 1- ]P*1ÌWÍlÎ*C701Î*6Ïl.57=?0 úKM57$μ=JK5W6Í1$z69Gå57Ò!$Ï1CE1G=?$l5]=?K
ÌÌ(!ÓÙ(*I?657$;G%A
+ ! C') :4&2"034 %( &3 9 9 %(')" #!!" #<%1
"%$r5(9KÌr0 9KM=JG$C:Ó(!I¦5E=¦LOIJ'Ð $CK7ÐKe5E$ÓÚÑa=U57ÒK70 Ó$4ÖÝ(!*Ìn5E=U01*6I*P!IU0ÌWÍKÑa=¦5EÒ , =?!Î!(5WK-
0 (5EÎ!(5EKa *G ÷Î* CE Ó$57$;CEK)06Ö%5EÒ!$:ÖÝ0 CEÓ
r2A 0.Ñ 1K7K7(!Ó$DÑ4$:Íl*$Ñ
DF @F -cÑaÒ!=?ÌWÒå=JK Ó.57CE=UæA%ÞRÒ*$å=¦5]=JKKe5ECE =UÏ1Òl5DÖÝ01C7ÑR6CWG570#Ì0 ÓÎ!(5E$
57Ò!=JKaÓ.5EC7=Uæ
6F + 6F F )" E
F F 7 F5
z
f( )
y
L<M7N ¹PO *¹ ®\;fM6q1bed
ql\;^
\seikjl^@seo1_a1i¦\;^;d
jl\; W_ ghghi¦\;j*¯>ghik^
Zd;iU g ¦ _7°?s \;jlm4g eb i¦^;o.s 7©
*C"3C4 %'- 9 *6,<>% A%1"0:3 !3 9 #&%16:'
7: F
)"-
(*KM=?!Ïå0 !I?Ðú5eÑ40 Ï CW G!=U$;15Ìr01ÓÎ*(5E657=?0 *'K »65Î,0 =?l5 69G 7 íCE$;K7Î9$
Ìn57=?< $;IU"Ð r-
ÑaÒ!=?ÌWÒ#Ì;63P9$CE$;1G=UI?ÐÌr01ÓÎ!(57$
G3Ña=¦5EÒ3P9 ÌWÍ>Î!CE0 Î z=JKaKMÓ6I?I/Ì0 *KM5E 1-5 nA
ÞRÒ!=?K)Ó$r5EÒ!0GºÌ P,$: Î!Î!I?=U$
G570Ìr01ÓÎ*(57$D57Ò!$:Î!CE=U9Ìr=?Î*6I,$=?Ï $;l<1$;Ìr570 C)6*Gº$=?Ï $!L
<' IU(*$06Ö ÷P>Ð5EÒ!$Î90.Ñ4$CRÓ$r5EÒ!0GcA*24к=U57$;CE657=?!Ï6*G3K7$r5M5E=U*Ï
/. 7 < %.
#&
/.%
57Ò!$< $;Ìr5701C /%. Ña=UI?IÌ0 >< $;C7Ï1$:5705EÒ!$ÙIJ6CEÏ $
Ke5$;=UÏ1$>< $
Ìn5701Ca06Ö *G %. :5E057Ò*$
Ìr0 CECE$;K7Î901*G=?!Ïú$=?Ï $><.6I?(!$ E *4- 4- [1A ?>$$ I?K70 ÖÝ0 C36x$<1$xÓ0 CE$Á Ì;Ìr(!CW.5E$
Ó$57Ò!0G¾57Ò9.5 RG0>$;KR!0 5(*K7$Dâ9!=¦5E$]G/= V,$;C7$;*Ìr$
K6*G E 4Ò* KaK7=?Ó=?IJ6CÌr01ÓÎ!IU$æ=¦5eÐ1A
xÄ
ó®ø lË ¼ÇÅ
ò ò1>ËeóÄ ËMÄ ÚÉDøhÅ!Ë
.øeóò/ÆçÄòcÅ åÇÆ
Oá 5=?K=Ul57$;C7$
Ke5E=U*Ï@570](!9G$CWKe5W6*GÒ!0.ÑìK70 Ó$a06Ö,57Ò!$57CE=?ÌWÍKK7Ò!0.ÑaÎ!C7$;<>=U01(*KMI?ÐÙ=?*(*$*Ì$
0 57Ò!$DØ$;KEK7=? %-.=ÀA $ A1Ò!0.ÑãG!0l$
K®57Ò*$
Ø$
K7K7=? ÌWÒ96!Ï1$4Ña=U57Ò CEÌWÒ*=¦5E$;Ìn5E(!CE$R *GG$r5W6=?I?K06Ö
57Ò!$=UÓÎ!I?$Ó$;15W.5E=U01%A,Þ)Ð>Î!=JÌ6I?I?Ð -!5EÒ!$$=?Ï $><.6I?(!$ÙG!=?KM57CE=UP*(57=?0 Á0 Ö57Ò!$Ø$
K7K7=? ÁI?0l01ÍK
IU=?Í $5EÒ!$]01!$]KMÍ1$r5WÌWÒ!$;G3=? î=UÏ1(!C7$ E C*ÖÝ$;ÑÚKMÓ IUI%$;=UÏ1$><.6I?(!$;K;-!Ó6>оÓ$
G=?(!Ó 0 !$
K
6*GÖÝ$;Ñ< $;C7ÐIJ6CEÏ $
01!$;K;A äå$:Ña=UI?I!0.Ñ6CEÏ (!$57Ò*65)5EÒ!$ ?ý.üÀû ÿ¦û :5.ý ! Ña=?IUIcÌ;6(*K7$
57Ò!$@5EC701(!P!I?$:=?357Ò!$@5ECE =U*=U!ÏÎ!CE0Ìr$
K7K4P,$;Ì (*K7$ E !- E E
Log10 Eigenvalue
−2
−2.5
−3
−3.5
−4
−4.5
−5
−5.5
−6
0 100 200 300 400 500 600 700 800
Eigenvalue order
/ik^
_Ej ;\;k1_ghq_EfEshbelZikj]\R£U\W._7bghol\be_n:_Wik^
o.segj1_Es[d;be ¢;¡~r¢;¤;£~n
hsL<bMM7\N ikjl¹P_nO *¹ d;j
¢;@o>\j>6beikshse_Ej1ik^
isegW©
57Ò!$3I?1Ke5IJ'Ð $C
A ÞRÒ!=?KQV,$
Ìn5EKÙ57Ò*$3I?$; C7*=U!Ï KMÎ,$$
Gç69GçÌ zÎ!CE0.<>=?G$36ã=U!Ï1C7$
G=?$l5
5703$ræÎ!IJ6=?#57Ò!$K7IU0.ÑßI?$; C7!=?!Ͼ=U IU0.Ñ4$CDI?'Ð1$CWK
69G57Ò!$Ö» KM5 »K70 Ó$r5E=UÓ$01KEÌr=?IUIJ.5E=U!Ï
IU$
6CE!=U*Ï=U5EÒ!$
IJ KM5I?'Ð1$C
AXì5EC7=JÌWÍ]5E0ÙÌr01ÓÎ,$9K7657$a5EÒ!=?K)G=/V,$;C7$;154K7Ì;6I?$a06ÖcIU$
6CE!=?!Ï=?K
570(*K7$
5EÒ!$@=U>< $;CEK7$DG!=? Ï 0 96IcØ$
K7K7=? 5E0Ì0 l57CE0 I,57Ò*$:I?$; C7*=U!ÏCE657$ »K7$$6IJK70K7$;Ìn5E=U01
&!A
Dø Ë7Äô
4òvÊÇÄ
È zÆ*Èò%
Æ òcÅ Ça?È Å
Ç É
øhÅ!Ë7øeó ò/Æ
òc6Å úÇÆ
2)$ÖÝ0 CE$ÙÑ4$Ì0 *Ì$l57CW.5E$Ù=? 5EÒ!=JK:K7$;Ìr57=?0 0 Ò*0.Ñ 57035E =UI?0 C@KM$
Ìr01*G#0 CWG$;C
5E$;ÌWÒ!!=JÜl(!$;K
ÖÝ0 C57CW6=?!=?!ÏDIJ6CEÏ $)!$r5eÑ40 CEÍK-
I?$r5(*Kvâ*CWKe5CE$Î,$;65®K70 Ó$4CE657Ò!$;CÎ,$;KEK7=UÓ=JKe5E=?ÌÖ» Ìr5EK6P,0 (!5
6Î!Î!I?Ð>=U*ÏÌrIJ KEK7=?Ì;6I9K7$;Ì0 *G0 CWG$C)Ó$57Ò!0G!K;AÞ/$
ÌWÒ!!=JÜl(!$;K)(*K7=U!ÏÙÖÝ(!I?IØ
$;KEKM=J6º=UÖÝ01C7Ó6L
57=?0 À8@6(9K7K4L $;ÑR570 /-!"%$<1$>P,$CEÏ6LhQÁ6CWÜl(*6CWG>5) *G¾2Rî8 ?"4Ì630 *IUÐ Î!Î!I?Ð570< $;C7Ð
KMÓ6I?IR!$r5eÑ40 CEÍK]5ECE =U*$;Gz=?TP*.5WÌWÒçÓ0G$ -®Ò!0.Ñ4$<1$C57Ò!0lKM$KMÓ IUIR!$5eÑ)01C7ÍK6CE$º!0 5
57Ò!$01!$;K@5EÒ*.5]*$$;GúK7Î9$;$;G=?!Ï#(!Îú5EÒ!$Ó01KM5;A/QÁ01KM5ÙKM$
Ìr0 9G 01CEG$;CÓ$r57Ò*0>G*&K »Ì0 P e(L
Ï1.5E$Ï CW G=?$l5
-%24î8 ?-A;A#A :CE$;Ül(!=?CE$I?=?!$rLhKM$
6CWÌWÒå *GúÌ 5EÒ!$CE$rÖÝ01C7$!0 5]P9$º(*KM$
G
=Uú5EÒ!$ºKM570ÌWÒ* KM57=JÌÓ0G$ A/Q 6>Ð 0 Ö45EÒ!$5EC7=JÌWÍKG=?KEÌr(9K7K7$;GåÎ!CE$<>=?0 (*K7IUÐå6Î!Î*IUÐ 0 !I?ÐÁ5E0
P*.5WÌWÒ#IU$
6CE!=U*Ï*A9î*C701Ó 01(!C
$æÎ9$;C7=?$*Ì$ÙÑ4$]Í>!0.Ñ 57Ò*65:Ì;6CE$rÖÝ(!I?I?к5E(!!$;G Ke5E0ÌWÒ* KM57=JÌ
Ï CW G=?$l5G$;KEÌr$;15=JKÒ*6CWGç570úP9$
.501ãIJ6CEÏ $ÌrIJ KEKM=Uâ9Ì;.57=?0 ãÎ!CE0 P!I?$ÓK;A®î*0 CK7Ó6I?IU$;C
Î!C701P!I?$ÓKa5EÒ*.5@C7$
Ül(!=UCE$Ù1ÌÌr(*CE657$CE$;6IULO<' IU(*$;G0 (!57Î!(5WK:IU=?Í $]=?ÁÖÝ(**Ìn5E=U01 6Î!Î*C70'æ=?Ó6L
57=?0 ì0 CÌ0 l57CE0 IRÎ!CE0 P!I?$ÓK;-Ñ)$#K7$$5EÒ*.5ºÌ0 P e(!Ïl.5E$3Ï1CE1G=?$l'5 »Ña=¦5EÒ ®0 IJ6ÍlLON=UP!=?$CE$
Number of Eigenvalues
20
19
18
17
16
15
14
13
12
11 Big killers
10
9
8
7
6
5
4
3
2
1
0
0 2 4 6 8 10 12 14 16
Eigenvalue magnitude
/ik^
_Ej ;\;k1_ghq_EfEshbelZikj]\R£U\W._7bghol\be_n:_Wik^
o.segj1_Es[d;be ;
¢ ¡~r¢;¤;£~n
hsL<bMM7\N ikjl¹_n 9¹ d;j
¢;@o>\j>6beikshse_Ej1ik^
isegW©
Learning 1.8
rates: 1.6
η0 = 0.12
η1 = 0.03
1.4
η2 = 0.02 1.2
Hessian
0.8
largest
eigenvalue: 0.6
0.2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Learning
rate (batch): −5
−15
−20
0 1 2 3 4 5 6 7 8 9 10
epochs
seo be_W_aq>\rbM\;Z_Ese_7beg ¢
®_Wik^;o6segW}9~a{li¦\g 7©
Weight space
2
Learning 1.8
rates: 1.6
η0 = 0.76
η1 = 0.18
1.4
η2 = 0.12 1.2
Hessian
0.8
largest
eigenvalue: 0.6
0.2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Learning
rate (batch): −5
−15
−20
0 1 2 3 4 5 6 7 8 9 10
epochs
seo be_W_aq>\rbM\;Z_Ese_7beg ¢
®_Wik^;o6segW}9~a{li¦\g 7©
%(' % C 9 ä $CE$Î,$;65a57Ò!$CE$;K7(!IU506Ö®0 (!CG=JKEÌr(*KEKM=?0 =?#KM$
Ìn57=?0 (A ""C!Ke5W6C757=?!Ï
ÖÝC701ÓíCE *G0 Óí=U!=U57=J6I/< $;Ìr5701C -!57Ò!$=U57$;CE657=?0
.
Ña=UI?I*$;< $l5E(*6I?IUÐÌr01l<1$CEÏ $4570]57Ò!$Î*C7=?*Ìr=?Î* I*$=?Ï $;>< $;Ìr5701C »0 C4@< $;Ìr5701C=U5EÒ!$Î!CE=?*Ìr=?Î*6I
$=?Ï $9KMÎ*1Ìr$ ) *G
Ña=?I?I/Ìr01l<1$CEÏ $D57057Ò*$]Ìr0 CECE$;K7Î901*G=?!Ï$;=UÏ1$><.6I?(!$ 4- OA
.0&' !3"0<3
!065EÒ!$C@Ó$r5EÒ!0GÁÓ6Í1$;KD(*KM$06Ö5EÒ!$Ö» Ìn5:57Ò9.5KMÓ6I?IÎ9$;CM5E(!CML
P*.5E=U01*KR06Ö57Ò!$Ï1CE1G=?$l5R I?K70IU$
Gº57057Ò*$Î!C7=?*Ì=UÎ* Ic$=?Ï $;l<1$;Ìr570 CR0 Ö
.
% 7 F
#&
ÑaÒ!$CE$ #=?K@K7Ó6I?I!Ìr0 9Ke5W6l5;A1BD!$R=U57$;CE657=?0 06Ö*5EÒ!=JKÎ!CE0>Ì$;G(*C7$4C7$
Ül(!=UCE$;Kv5eÑ40:ÖÝ0 CEÑ4 CEG
6*Gº5eÑ)0P*1ÌWÍ>Ñ4 CEGºÎ!C701Î*6Ïl.5E=U01ºKe5E$Î*KaÖÝ0 Ca$;1ÌWÒ¾Î9.5M5E$CE¾=?357Ò!$@5ECE =U*=U!ÏK7$r5
A
3.0"03%
*2"&3 &> ÞRÒ!$ÖÝ0 I?IU0.Ña=?!ÏCE(!I?$ÙÓ6Í1$;KD(*KM$06Ö5EÒ!$CE(!!*=U!Ï3'<lL
$CW6Ï $D57001P5E =U¾57Ò*$I? C7Ï1$;KM54$;=UÏ1$><.6I?(!$@06Ö/5EÒ!$]'< $;CE Ï $:Ø$;KEK7=? ¾<1$CEÐÖ» KM5
. F 7
7 F 5 #&
Þ/0K7(!ÓÓ C7=?ë$1->57Ò!$$;=UÏ1$><.6I?(!$ 2.< $
Ìn5701C4Ì0 ÓÎ!(5W.5E=U01*6K C
AaCE *G0 Óí< $;Ìr5701C4=JKÌWÒ!0lKM$;ÖÝ01C=U*=¦5E=? IU=?ë;657=?0 306Ö -
E Aa6=?!Î!(5
Î9.5M5E$CE=?K
Î!C7$
KM$;l57$;G¾Ña=U57ÒÁG$
KM=?C7$
G30 (5EÎ!(5;-,ÖÝ0 CEÑ4 CEG36*G3P*1ÌWÍ>Ñ4 CEG
Î!CE0 Î* Ï1.5E=U01%-Ke5E$Î3=?KÎ9$;CMÖÝ01C7Ó$;G369G5EÒ!$Ï CW G=?$l5WK a C7$@KM570 CE$;G%-
!A
=?Ka1G!G$
Gº5E057Ò!$]Ì(!C7CE$l5aÑ4$=?Ï Òl5a< $;Ìr5701C -
AaÖÝ01C7ÑR6CWG6*G#P9 ÌWÍ>Ñ4 CEG3Î!CE0 Î96Ï1657=?0 Ke5E$Î =?K
Î9$;CMÖÝ01C7Ó$;GÁÑa=¦5EÒ#57Ò!$Î,$C757(!CEP9$
G
Ñ)$;=UÏ1Òl5R<1$;Ìr570 Ca *G¾57Ò!$Ï1CE1G=U$;l5EK R C7$KM570 CE$;G%-
"A457Ò!$G!W= V$CE$*Ì$ 1 F ;@=?KÌ0 ÓÎ!(5E$;Gå6*G 57Ò*$CE(!!*=U!Ï#'<1$CW6Ï $Ù06Ö
57Ò!$$;=UÏ1$>< $
Ìn5E0 CR=?KR(*Î,G!657$
Gc-
&!ARÑ)$@I?0>0 Î3ÖÝC701Ó E O-L & (*15E=UIvC7$
K70 * P!IUÐKM5E P!IU$@CE$;K7(!IU5=?KR01P5E =U!$
GÖÝ01C -
(>A457Ò!$01Î57=?Ó6I%I?$; C7*=U!ÏCW.57$@=JK457Ò!$;#Ï =?< $;¾ K
0 ( 5
áhî=?Ï (!CE$ E Ñ4$aKM$;$457Ò!$R$<10 I?(57=?0 06Ö957Ò!$R$=?Ï $;><' IU(*$41KDÖÝ(!9Ìn57=?0 06Ö957Ò!$R>(!ÓP9$;C06Ö
Î*.5757$CEÎ!CE$;K7$l5E657=?0 *K,ÖÝ01CvR!$;(!CW6I6!$5eÑ)01C7ÍD=U]Ò*6*G!ÑaC7=U5M5E$]ÌWÒ* CE1Ìn57$;CC7$
Ìr0 Ï1!=U57=?0
5E K7ÍAáh Î!CW Ìr57=JÌr$Ñ)$ G*6Î5@57Ò*$I?$; ÍÁKM=?ë$06Ö)57Ò!$CE(!!!=?!Ï'< $;CE Ï $]=? 01CEG!$C:5703Ï $5
ÖÝ$Ñ4$C *(*Ìr57(*657=?0 *@K À K6IJK70 =?*G=JÌ657$;Gã0 ç57Ò!$â9Ï (!CE$ rAáhã57Ò*$¾â9Ï (!CE$Ñ)$KM$;$357Ò*65
.Ö×57$;C]ÖÝ$;Ñ)$;CÙ5EÒ*6 Î*65M5E$CEzÎ!CE$;K7$l5E657=?0 *K@5EÒ!$¾Ì0 CEC7$
Ìn5Ù01CEG!$C]0 ÖÓ Ï !=U57(9G$ÖÝ01C
57Ò!$¾$=?Ï $><.6I?(!$1-%=ÀA $5EÒ!$I?$; C7!=?!ÏÁCW.5E$=?KÙC7$
ÌWÒ!$
GcAvî*C701ÓÔ5EÒ!$º$æ>Î,$CE=?Ó$;l5EK]Ñ4$º I?K70
0 P*K7$CE< $D57Ò*65a57Ò!$ 9(*Ìn5E(*.5E=U01*KR06Ö57Ò!$]'<1$CW6Ï1$DØ$
K7K7=J630.< $C)57CW6=?!=?!Ï6CE$KMÓ6I?I[A
80
70
60
50
eigenvalue
40
30
γ=0.003
γ=0.01
20
10
0
0 50 100 150 200 250 300 350 400
/'d;k1seikd
j#d°seo1_]_Wik^
_Wj6;\;kl_Ù\;g:\°J1jlfEseikd
j#d°seol_]j6lZ:{_7bd;°q>\rshse_Ebejq1be_Egh_Wj1|
MsL<\rM7seN ikd
¹jl g@,¹ °?d;b]\gho>\rbe_n#®_Wik^
o.s]j1_Es[d;be3iseo塺¦\W._EbegW} ;£
;¤3fWd
j1jl_Wf7sei¦d;jlg\j>¼~r¢;;¤º°Ube_W_
ql\bM\Z_Ese_EbegW©lu®ol_RshbM\;ikj1i¦j1^@gh_EsfWd;jlghikgOseg)d°®~n;
o>\j>6beikshse_Ej1ik^;iksegW©
áh î=UÏ1(!C7$ E "º6*G E &Ñ4$KM5E CM5:Ña=¦5EÒÁ5EÒ!$KE6Ó$=U!=U57=J6IÌ0 *G=U57=?0 9K-6*G Î9$;CMÖÝ01C7Ó
â!æ$;G3>(!ÓÙP,$C0 Ö$;Î90ÌWÒ*KÑa=¦5EÒ3I?$; C7*=U!ÏCW.5E$;KÌ0 ÓÎ!(5E$;G3P>кÓÙ(*I¦5E=UÎ!I?Ð>=U*Ï57Ò!$]Î*C7$L
G=?Ìr57$
GI?$; C7*=U!Ï:CW.57$4P>ÐÙ:Î!CE$;G$â*!$;GÌr01*KM5E6l5
A6&)Ò*0l0lKM=?!Ï@Ìr01*Ke5W6l5»=ÀA $ A (*KM=?!Ï:57Ò*$
Î!C7$
G=JÌn57$
G0 Î5E=UÓ I/CW.57$a6I?ÑR'Ð>KÏ =?< $;KC7$
KM=JG(* I%$CEC701CEKRÑaÒ*=?ÌWÒ C7$<1$CEоÌIU0lKM$@5E057Ò*$
$CEC701C ÌWÒ!=?$<1$;GP>Ð57Ò!$DP9$
Ke5RÌWÒ!0 =JÌr$0 Ö57Ò!$:Ìr0 9Ke5W6l5;A1áh065EÒ!$C)Ñ)01CEG!K;-.5EÒ!$ºè7Î!C7$
G=JÌn57$
G
0 Î5E=UÓ6IcCW.5E$;ê=JKR0 Î5E=UÓ6I/$!01(!Ï Ò%A
2.5
2
MEAN SQUARED ERROR
1.5
1 epoch
2 epochs
1
3 epochs
4 epochs
5 epochs
0.5
0
0 0.250.50.75 1 1.251.51.75 2 2.252.52.75 3 3.253.53.75 4
LEARNING RATE
PREDICTED OPTIMAL LEARNING RATE
_W\;jzg .>\be_Wz_7bhbed;b\g\°J1jlfEseikd
jçd;°seo1_ºbM\rsei¦d {_EsÀ®_W_Wjzk_n\rbejlikjl^ÁbM\se_¾\;jl
qL<beM7_nN 1¹ik fEse*¹_W d;q1seikZ]\;k_n\rbejlikjl^ºbM\se_°?d;bÙ\°JlkÁfWd;jljl_EfEse_nÁjl_EsÀ®d;be ¤;£ #; ~n 7©u®o1_
shbM\ikjlikjl^@gh_7s)fEd
jlghikgOsegd;°/;
:ol\;j>6beishse_Wj1ik^;iksegW©
2.5
2
MEAN SQUARED ERROR
1.5
1 epoch
1
2 epochs
3 epochs
4 epochs
0.5
5 epochs
0
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3
LEARNING RATE
PREDICTED OPTIMAL LEARNING RATE
_W\;jzg .>\be_Wz_7bhbed;b\g\°J1jlfEseikd
jçd;°seo1_ºbM\rsei¦d {_EsÀ®_W_Wjzk_n\rbejlikjl^ÁbM\se_¾\;jl
qL<beM7_nN 1¹ik fEse*_W¹ åd
q sei¦Z]\)k_n\rbejlikjl^bM\rse_°Jd;b\gho>\rbe_n _Wik^
o.sj1_Es[d;beÁiseoz¡3¦\n'_Ebeg ~W'¢r£
~n¡ ;¤ Á
¢ #£
å~n
ú~n 7} £
;¤ kd.fn\ ÙfWd;jljl_EfEseikd
jlg\;jlã~r¢;;¤°?be_E_ºql\bM\Z_Ese_Ebeg
ghol\be_nÙ®_Wik^
o.seg 7©lu®ol_RshbM\i¦j1ikjl^gh_EsfWd;jlghikgOsegd;°~n;
@ol\;j>6beishse_Wj i¦^;isegW©
Â Ë >ÊcÉ?1>ËeÇÄïóÄ
È åÇÄ
Êcø7É?>ËeÇÄ
ÌÌ0 CWG=?!ϼ570ã57Ò*$ CE$;Ì0 ÓÓ$*G!657=?0 *KºÓ$l57=?0 !$
G6P,0.< $1-)çÎ!CE1Ìn5E=¦5E=U01!$CºÖ» Ìr=?!ÏT
ÓÙ(!IU57=UL[IJ'Ð $;CR!$(!CW6Ic*$r5a57CW6=?!=?!ÏÎ!CE0 P!I?$ÓÑ40 (*I?G¾Ï 057Ò*C701(!Ï Ò¾57Ò!$@ÖÝ01IUI?0.Ña=?!ÏKM57$;Î*KDC
KMÒ>( $@5EÒ!$$ræ!6ÓÎ!I?$;K
Ìr$;15E$CR5EÒ!$=U*Î!(5
<.6CE=J6P!I?$;K4P>кK7(!P57CW Ìr57=?!Ï57Ò!$Ó$
6
!0 CEÓ6I?=Uë;$:57Ò!$=?!Î!(5<. C7=J6P!I?$:570Ke5W6*G*6CWG¾G$;<>=?657=?0 306Ö
=¦Ö®Î,01KEKM=?P!I?$ -*G$
Ìr01C7CE$IJ.5E$57Ò*$=U!Î*(5
<.6CE=? P!I?$;K;A
Î!=JÌWͺ!$r5eÑ40 CEÍÑa=U57Ò357Ò!$]K7=?Ï Ó0 =JGÖÝ(**Ìn5E=U01K7Ò!0.Ña3=U3â*Ï1(!CE$
KM$5a57Ò!$@5W6CEÏ $54<.6I?(!$
KaÑa=¦5EÒ!=?¾5EÒ!$CW6!Ï1$D0 Öv5EÒ!$]KM=?Ï Ó01=?Gc->5eÐ>Î!=JÌ IUI?Ð 7 @6*G¾
L A
=U*=¦5E=? IU=?ë$@5EÒ!$Ñ)$;=UÏ1Òl5EK4570CW6*G!0 Ó<.6I?(!$
Ka KRÎ!CE$;KEÌrCE=UP,$;G¾P> Ð &*A
ÞRÒ!$@Î!C7$ÖÝ$CEC7$
GºÓ$57Ò!0G¾ÖÝ0 CR5ECE =U!=?!Ï57Ò*$!$r5eÑ40 CEÍKMÒ*0 (!IJGºP,$Î!=JÌWÍ $
G¾1K4ÖÝ0 I?IU0.ÑKDC
=¦Ö957Ò!$)57CW6=?!=?!Ï@KM$5=JKI? C7Ï1:$ ÝÓ0 CE$)57Ò*
ÖÝ$ÑzÒ>(!*G!C7$
GÙKE6ÓÎ!I?$;
K %69G]CE$;G(!9G!6l5;-
6*G=UÖ57Ò*$]5E1KMÍ3=JK
ÌrIJ KEK7=¦â9Ì;.5E=U01%-*(*K7$ÙKM570ÌWÒ*1Ke5E=?ÌÏ1CE1G=?$l5Ña=¦5EÒ Ì C7$ÖÝ(!I%5E(!!=?!Ï*-
0 Ca(*K7$:57Ò!$]KM570ÌWÒ*1Ke5E=?Ì@G=J6Ï10 * Ic"%$<1$>P,$CEÏQÁ CEÜl(* CEG54Ó$r5EÒ!0GcA
=¦Ö57Ò*$@57CW6=?!=U*ÏK7$r5a=JKR!065a5E0l0IJ6CEÏ $ ->01Ca=¦Öv57Ò!$@5W K7Í=?KRCE$Ï1C7$
K7K7=?0 %-l(*K7$@Ì0 P e(*Ï1.5E$
Ï CW G=?$l5
A
&)I?1K7K7=?Ì;6IcK7$;Ìr01*G>LO0 CWG$C4Ó$r5EÒ!0G!K6CE$@=UÓÎ!CW Ìr57=JÌ Ic=U#6I?Ó01KM56I?Ic(*KM$ÖÝ(!I/Ì; K7$;K;A
ÞRÒ!$D*0 LOIU=?!$
6CRGÐ>*6Ó=JÌK)06ÖKM570ÌWÒ*1Ke5E=?Ì
Ï CW G=?$l5RG$;KEÌr$;15R=?ºÓÙ(*I¦5E=¦LOI?'Ð1$C4!$(*CE I
!$r5eÑ40 CEÍK-6Î96C757=JÌr(!IJ6CEIUÐ K=¦5RÎ9$;CM5W6=?*K5E0ÙÏ1$!$;CE IU=?ë;657=?0 %-.=JK)KM57=?I?I*Ö» CÖÝCE0 Ó P9$;=U!ÏÑ)$;IUI
(!*G$;CEKM570>0GcA1QÁ0 CE$a57Ò!$;0 CE$r5E=?Ì;6I!Ñ40 CEÍ *GKMÐKM57$;Ó657=JÌa$ræÎ,$CE=UÓ$;15W6I9Ñ)01C7Í=?K!$$
G$;GcA
8 3C ,.% 9 4<%
%(3; A "A "A 2:A HA LhNA6Q¼A6Ï1CE657$ÖÝ(!IUI?ÐÙ1ÌWÍl*0.ÑaIU$
GÏ $4ÓÙ(5E(*6I
$ræ!ÌWÒ*6*Ï $:Ï CW6l5EK)ÖÝCE0 Ó :Û :
Ûà6*$G ?!î)A
ò Eò/Æ!ò/Ä
ÊòC
~
©
©!t)Z]\rbeiש4)_W1bM\k_n\bej1i¦j1^ikjºgOshbelfEse be_nq>\rbM\;Z_Ese_7bRghq>\fW_Wg jl\se bM\;,beik_WZ]\jljli¦\j
^bM\
i¦_Ej6sn©YÀj i¦fMo>\_Wv )© d
W_Ebn} ikfMo>\;_E/YM©*
d;bM1\;j*},\j>u®old;Z]\;ga`_7seghf7o1_
}*_n1ised;begW}
}'
d
klZ_ }'ql\;^;_R~r¢
.©.u®o1_ YOu
`cbe _Wgh
gW}9
~n
6©
"!$#%&
'()# +*,#"
- .0/21-'3%+
¢6©
©vt)Z]\beiש R\rse1bM\;^;bM\;1ik_Wj.s®dbe6g5_ 4fWik_Wj.se i¦j k_n\bej1i¦j1^1©
768# %9':
'()# }
~W ¢ 7¯ ¢;¡6~ ; ¢
}%~n;
¤6©
©am©
®\shseiseiש*ibegOsh|c\;jl@gh_WfWd;j> |»dbM1_EbvZ_Eseold6 gc°Jd;bvk_n\bej1ikjl^1¯._Es[_W_Wj]gOse_W_Wq_EgOs® _WghfW_Wj.s
\j>Ùjl_Esed;=j < gZ_Eseold6!© }>£1¯k~E£1~ ;l~ }c~n;'¢.©
£1©
©>v_Wf7
_Eb)\;j>A @D©>y*_r lj*>©YÀZ
q1?be68dr6# ik%jl^D9se':o1
_a'()fW# d
j6'_7be^
_Wj1fW_d;°%{>\f76{ bed
q>\^'\rsei¦d;jk_n\bej1i¦j1^
iseo¾gh_WfWd
jl3d61_7bZ_Eseold1gW©YÀj¾pa\n6i¦3u,d;1be_EseE6.} _Ed;¿be_E ikj.sed
j*},\j>¾u,_Ebhbe_Wj1fW_
6_ Oj1dngh i×}l_n1ised;begW} }
ql\;^;_Wg)¢;R ;.'.©!y,\Wbe_W*j1fWB_ #"
% be k{>\. l0ZÚ#B!Ct'ghD2gh&d.fWEi¦\F seGH_WGIgW}968~n#
2¤;J ©
5'()#2K-'MLN#"5 O/2%+%7/P
DQ#"#
¡6©
)© ©.ikghold;q9© SB
Q'(TM#-UV!$# M*
'('3XWY
-#-. 2K'((# ©, ¦\be_Wjl1d;j`cbe_WghgW}1)½6°JdbM*}
~W
¡6©
©y%©ndshsed
*©64jlkikjl_/\¦^;d;beiseolZg,\;jlRgOsed.f7ol\;gOseikf\qlq bedr½ ikZ]\seikd;jlgW©;Y[japa\n6i¦a1\;\
*}W_n1ised;bn}
}
®Z \Z: KP{ beiU [\^
_
}*Q~n;K
.]¤6©Ku®^ol_a)_E
,sed
>j'(Y[T?jl# gOseiUse` se_a_EFH F_Ebebdik_WcAgW}># ®\;UZDBDQ{1#be9di¦1
^;'0_a'4DQAj1i¦
_EbeTghi':s[# e`% be2_Egh-gW'(K© '('3af
6©apD©1!©1vbed.d
Zo1_n\
Ù\;jlÙpD©ly*dn®_
© 1sei¦;\bei¦\{lk_R°J1jlfEseikd
jÙikj.se_Ebeqd
¦\seikd;j\;jl\;l\;q seik'_
¤ © j1_Es[©*y/d;©*be6gW1© j668seik# j1%_:9P\ j> /Qt
1 ©9-'3© %+ }_W¢6ik^
¯ '_Ej>¢.~*;6©
¡
¡.d
}/Z~Wql
¤;se¤ ik© jl^]gh_WfWd;j>d;bM1_7b1_Ebeik;\seik'_Wg4i¦j¾l_W_W |
1d;bh\rbM3jl_7s[®dbe6gW¯,tbe_W6ik_Ea© 2B
2
H
5'((# 2X#
>'(T?# U }~W
6©u,d
©
\ )ql©!qpR_n\r\bnbe©'_Wjº\j>1© © d.d6 .©a)d;se_:d;j¦_W\bejlikj1^ÙbM\rse_:ghfMol_n lk_Wg4°?d;bagOsed.fMo>\gOsei¦f:d
q1sei|
*,#"
--K./21-'3%+ =*,#"
- K. >#B!C'D20E FGHG 68# "!"P
}1q>\;^;_Wg£'" ;.£'¤6}! \;j \se_Wd }
~W¤ C© @D ®t©ry9}9_n~W
l¤;j9© © LNd#"be^'5\ j¬
-# \;Q1J°?Z] \;)#j12j92© '3YH
9H9-Q'(K-
"._3
-# 2P
5'()# 2 '\ -
QK.O%#"
~W C© @D5 :©;f y9©v_r`/ o11pTj9© se ol_E_Eghjli¦_EgWbM}1\4¦ikj1n\rik'se_Eikd
beghjÙis \;_jl`:©ljl_E_Es sÀ® d©beD 1 be_Wikgh_ ik^
j`cgO\shbebMikg\rse4_W^
Y ik7_W}9gW~W©
YÀ¤
j:6m© ©'`c°J_Eik°?_Ebn} ©.6f7o be_Ese_Ebn}
_W1ised;bn}
YK>B
2 "!$# %
'()# &*B#"
--K./Q1 -'3%+ \# }'ql\;^;_Wg¡;;R¤ ;
'¡.}
¢;¢6©C@D~W
©6;y9 _n© lj9}6YM©.¬\;j.se_Ebn}.\;j>Ù© t
© 6d
k¦\ ©%6_WfWd;j>d;bM _Eb®q bed
q_Ebhseik_Wg®d°,_7bhbed;bgh bh°×\;fE_WgW©cY[j
YK >
Q "!$# %
'()# *,#"
--K.&/21-'3%+ # } 1\j \rse_Wd1}1 ®t}>~n; ~;©
¢ C© @D©1dy9be^'_r\ j1¬j9}1\;`1S©°?Z]@D\;©lj1 j9ikZ]© \bM!}l\j>)©1`_n\bekZ: shse_Ebn©%t1sed
Z]\rseikfRk_n\bej1i¦j1^:bM\se_4Z]\½ ikZi¦W\|
P>'(T?# U } ¯ ¡;¢
"¡ ;6¡;; }c~n;
6©
"!
¢;¡6© © k¦_7bn© 1q_Ebe6i¦gh_WÁk_n\rbejlikjl^ºd
jÁ¦\be^;_be_n ljll\;j.s
shbM\;ikjlikj1^¾gh_EsegW© 2'3 P
'()#P
}>£ ~7¯k~n¡"; ¢;¡6}c~W
6©
"!
¢ © ©#S © P
\d.d6#B!6\;jlSB
)=©/Q11 ©r-'3pa%0\rbe '_Wj9 ©;>\gOsk_n\rbejlikjl^ikj
j1_Es[d;be6g,d° ¦d.fW\;k |Ýse1jl_naq1bed.fE_Wghghikjl^
1jlisegW© },~;¯ ¢;¤ ~ ; ¢;£1}%~W
¤
6©
¢;6©a
© 1bM\sM\6S©B
_:? 68#%O
9S9
J':
5'()a#f ©`%olpTseo1_WghikgW}l)jlik'_EbeghisÀd°%u,d;6'd1}*~n
¢6©
¢¤ ©a
© bM\sM\6}c¬@© |Àm© #l« kk_Ebn}/t
© !ik_Wo1_
}%\j>#©ct)Z]\rbeiשt41\;q sei¦
_d
j1|»kikjl_]k_n\bej1ikjl^ikj
fMo>\jl^;i¦j1^_Wj66ibed
jlZ_Ej6segW©úY[j ikfMo>\;_Ea )© d
W_Ebn} ikf7ol\;_WRYM©v
d;bM1\;j9}®\;jlúu®old;Z]\;g
; © ©l)©>)bhbn© 1P
%+)
5
P0 .H# K'D%+ !$# 7/2':#"
DS
-'()
+
2K. ©/`/o1pTseol_WghikgW})be_W^
d;j
bM\;1l\se_aY[j1gOseise1se_
}!~n;'¡.©
6~
© © )©D)bhbn© m_WZdr6i¦j1^¼jld;i¦gh_ ikj d;j1|»kikjl_ gh_n\befMolghikjl^ç\
1\;q sei¦
_ {l\sefMo ghikW_EgW©éY[j
ikfMo>\_Wv )© d;W_Ebn} ikf7ol\;_W/Y7©*
dbMl\j9}9\;jl3u®old;Z]\;ga`_7seghf7o1_
}*_n1ised;begW}
P
&
}.'d;¦1Z_ }
q>\^
_¢;
¢6©6u®o1_ YOu `%be_EghgW}l~n;'.©
¢6© © 1©y%
©>)"bh!$bn# ©* m%_W
^
'()1# ¦+\be*iknB\#"se
ikd; j-Kikj.0se/2ol1_-gh'3_Wk%+_WfE seikd
jÙd;°*bM\
i¦\;!{>\ghi¦gv°Jlj1fEseikd
j]fW_Wj.se_EbegW©
}! 7¯
R; '¢;6}c~n;'¡6©
; ©68©#t%O©>9`_W':\
be'()kZ:# shse_E bn©v>\;gOs_E½1\;f7s)ZDlseikqlkikfn\seikd;j{.Ùseo1_aol_Wghghi¦\j9© >
?68# %9':
'()# }
6¯k~W£.$ ;>~
}c~W
£1©
£1© © ©,`cbe_WghgW})©`©,¦\;jlj1_Ebh.}!©9t
©,u,_W1'd
kgh..}9\;j> ©,u4© /_Eshse_Ebekikjl^ © O% )
-
©# ®\Z:{ beiU ^
_]4j1ik'_Ebeghis[`%be_WghgW}
®W\
5Z: 9{ :beiU ^
_
} %6jl ^
=¦\D2j>A*
}*~n'>;¤
#B!¤6© /P
5)2'( ?
*B#-.
%0%+K.
¡6©apD©61\;\
!}._n1isedbn© Z KP,[\
2K.&
Q'(T?# -U0_aEFHFb cA# UBDQ#9I
''D2T
©Áu®o1_)_Esed
j#YÀjlgOseise1se_6_Ebeik_WgW}v ®\;ZD{1bei¦1^;_)jlik'_Ebeghis[3`cbe_WghgW}v ®\;Z@|
£ ~
©
© \qljlik© /Q':
'(K-'()
-
J[\
2. =D2# 1 ik¦_7.}l)_E @dbe!}!~W
¤6©
£'¢6©t© \ik{_W×} u4© \;jl\;n\W®\ } © ikj.sed
j*}.¬@©1 o1i¦;\;j1d1}6\j>¬@©61© y9\;j1^1©c`%old;jl_WZ_be_WfEd
^;|
£
©
P© /2 i¦. _E^
_E
beikjl*,fM#"
}1t
--© K¬a. d
Zd61\ }1\;jlu4© _Wgh
_WgW©/6sed.fMo>\gOseikfa j>\ZikfWg®d;°¦_W\bejlikj1^:iseo
Zd;Z_Wj.selZ ikjjl_E1bM\;*jl_7s[®dbe6gW© }>¢
6¯ £;£.¢
$¡ ;.£;£''.}%~n
£1©
£;£1© © © @®\jl^\j>©ctZ]\beiשu®ol_5_ 4#SfE i¦P_Ejl
fE\#B!Y\;*?j>Dº1 se-ol)
5_:Obe d;{llgOsej1_Wghg
d;°j>\rse1bM\^;bM\;1ik_Wj.s
_WghfW_Wj.sk_n\rbejlikjl^Dbelk_
©%Y[j ikf7ol\;_W*Y7©1
dbMl\j9} i¦fMo>\_W91©l¬a_W\bejlgW}l\;jl \bM\@t©1 d;¦¦\6}
_W1ised;begW}
P
K
2 $!$# %
'()# *,#"
- .&/21 '3%0 }1
d
klZ_a~n6© u®ol_ YOu
`cbe_WghgW}9~n
;¤ ©