Lecture8 PDF

CS7015 (Deep Learning) : Lecture 8
Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping,

Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
1/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Acknowledgements
Chapter 7, Deep Learning book
Ali Ghodsi’s Video Lectures on Regularizationa
Dropout: A Simple Way to Prevent Neural Networks from Overfittingb
a
Lecture 2.1 and Lecture 2.2
b
Dropout
2/84
Module 8.1 : Bias and Variance
3/84
We will begin with a quick overview of bias, variance and the trade-off between
them.
4/84
Let us consider the problem of fitting a curve
through a given set of points
The points were drawn from a si-

nusoidal function (the true f (x))
5/84
We consider two models :

5/84

Simple
(degree:1) y = fˆ(x) = w1 x + w0

5/84

Simple
Simple

5/84

Simple
Simple 25
X
Complex
(degree:25) y = fˆ(x) = wi xi + w0
i=1

5/84

Simple
Simple 25
X
Complex
i=1
Complex
5/84

Simple
Simple 25
X
Complex
i=1
Complex
Note that in both cases we are making an as-
sumption about how y is related to x. We
have no idea about the true relation f (x)
5/84

Simple
Simple 25
X
Complex
i=1
Complex
Note that in both cases we are making an as-
sumption about how y is related to x. We
have no idea about the true relation f (x)
The training data consists of 100 points
5/84
We sample 25 points from the training data
and train a simple and a complex model
Simple
Complex
The points were drawn from
a sinusoidal function (the true
f (x))
6/84
Simple
We repeat the process ‘k’ times to train
multiple models (each model sees a different
Complex sample of the training data)
f (x))
6/84
Simple
We repeat the process ‘k’ times to train
multiple models (each model sees a different
Complex sample of the training data)
We make a few observations from these plots
f (x))
6/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
7/84
Simple models trained on different samples of
the data do not differ much from each other
7/84
However they are very far from the true sinus-

oidal curve (under fitting)
7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

7/84

On the other hand, complex models trained on

different samples of the data are very different
from each other (high variance)
7/84
Let f (x) be the true model (sinusoidal in this
case) and fˆ(x) be our estimate of the model
(simple or complex, in this case) then,
Bias (fˆ(x)) = E[fˆ(x)] − f (x)
Green Line: Average value of fˆ(x)

for the simple model
Blue Curve: Average value of fˆ(x)
for the complex model
Red Curve: True model (f (x))
8/84
E[fˆ(x)] is the average (or expected) value of

the model
Green Line: Average value of fˆ(x)

8/84

the model
We can see that for the simple model the av-

erage value (green line) is very far from the
Green Line: Average value of fˆ(x) true value f (x) (sinusoidal function)
8/84

the model

Blue Curve: Average value of fˆ(x) Mathematically, this means that the simple
for the complex model model has a high bias
8/84

the model

Blue Curve: Average value of fˆ(x) Mathematically, this means that the simple
for the complex model model has a high bias
On the other hand, the complex model has a
low bias 8/84
We now define,
Variance (fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]

(Standard definition from statistics)
9/84
We now define,

Roughly speaking it tells us how much the dif-

ferent fˆ(x)’s (trained on different samples of
the data) differ from each other
9/84
We now define,

Roughly speaking it tells us how much the dif-

ferent fˆ(x)’s (trained on different samples of
the data) differ from each other
It is clear that the simple model has a low vari-

ance whereas the complex model has a high
variance
9/84
In summary (informally)
10/84
Simple model: high bias, low variance
10/84
Complex model: low bias, high variance
10/84
There is always a trade-off between the bias
and variance
10/84
There is always a trade-off between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how
10/84
Module 8.2 : Train error vs Test error
11/84
Consider a new point (x, y) which was not
seen during training
12/84
If we use the model fˆ(x) to predict the

value of y then the mean square error is
given by
E[(y − fˆ(x))2 ]
(average square error in predicting y for

many such unseen points)
12/84
We can show that
E[(y − fˆ(x))2 ] = Bias2 value of y then the mean square error is
+ V ariance given by
+ σ 2 (irreducible error)
E[(y − fˆ(x))2 ]

12/84
We can show that
E[(y − fˆ(x))2 ] = Bias2 value of y then the mean square error is
+ V ariance given by
E[(y − fˆ(x))2 ]
See proof here
12/84
The parameters of fˆ(x) (all wi ’s) are trained
using a training set {(xi , yi )}ni=1
13/84
However, at test time we are interested in eval-

uating the model on a validation (unseen) set
which was not used for training
13/84

This gives rise to the following two entities of

interest:
trainerr (say, mean square error)
testerr (say, mean square error)
13/84
error


model complexity This gives rise to the following two entities of

interest:
Typically these errors exhibit the trend shown

in the adjacent figure
13/84
error



interest:

13/84
error



interest:

13/84
error High bias



interest:

13/84
error High bias High variance



interest:

13/84
Sweet spot-
-perfect tradeoff The parameters of fˆ(x) (all wi ’s) are trained
-ideal model using a training set {(xi , yi )}ni=1
complexity

interest:

13/84
Sweet spot-
-perfect tradeoff The parameters of fˆ(x) (all wi ’s) are trained
-ideal model using a training set {(xi , yi )}ni=1
complexity

interest:
E[(y − fˆ(x))2 ] = Bias2
+ V ariance
13/84
Intuitions developed so far
Let there be n training points and m test (validation) points
n
1X
trainerr = (yi − fˆ(xi ))2
n
i=1
n+m
1 X
testerr = (yi − fˆ(xi ))
m
i=n+1
14/84
n
1X
n
i=1
n+m
1 X
m
i=n+1
As the model complexity increases trainerr becomes overly optimistic and gives
us a wrong picture of how close fˆ is to f
14/84
n
1X
n
i=1
n+m
1 X
m
i=n+1
The validation error gives the real picture of how close fˆ is to f
14/84
n
1X
n
i=1
n+m
1 X
m
i=n+1
The validation error gives the real picture of how close fˆ is to f
We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
14/84
m+n
Let D={xi , yi }i=1 , then for any
point (x, y) we have,
yi = f (xi ) + εi
15/84
m+n
yi = f (xi ) + εi
which means that yi is related to xi

by some true function f but there is
also some noise ε in the relation
15/84
m+n
yi = f (xi ) + εi

For simplicity, we assume
ε ∼ N (0, σ 2 )
15/84
m+n
yi = f (xi ) + εi

ε ∼ N (0, σ 2 )
and of course we do not know f
15/84
m+n
Further we use fˆ to approximate f
Let D={xi , yi }i=1 ,then for any and estimate the parameters using T
point (x, y) we have, ⊂ D such that
yi = f (xi ) + εi yi = fˆ(xi )
ε ∼ N (0, σ 2 )
15/84
m+n
which means that yi is related to xi We are interested in knowing
also some noise ε in the relation E[(fˆ(xi ) − f (xi ))2 ]
ε ∼ N (0, σ 2 )
15/84
m+n
For simplicity, we assume but we cannot estimate this directly

because we do not know f
ε ∼ N (0, σ 2 )
15/84
m+n
For simplicity, we assume but we cannot estimate this directly

because we do not know f
ε ∼ N (0, σ 2 )
We will see how to estimate this em-
and of course we do not know f pirically using the observation yi &
prediction ŷi
15/84
E[(yî − yi )2 ]
16/84
E[(yî − yi )2 ] = E[(fˆ(xi ) − f (xi ) − εi )2 ] (yi = f (xi ) + εi )
16/84
= E[(fˆ(xi ) − f (xi ))2 − 2εi (fˆ(xi ) − f (xi )) + ε2i ]
16/84
= E[(fˆ(xi ) − f (xi ))2 ] − 2E[εi (fˆ(xi ) − f (xi ))] + E[ε2i ]
16/84
= E[(fˆ(xi ) − f (xi ))2 ] − 2E[εi (fˆ(xi ) − f (xi ))] + E[ε2i ]
∴ E[(fˆ(xi ) − f (xi ))2 ] = E[(yî − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]
16/84
We will take a small detour to understand how to empirically estimate an
Expectation and then return to our derivation
17/84
Suppose we have observed the goals scored(z) in k matches as
z1 = 2, z2 = 1, z3 = 0, ... zk = 2
18/84
z1 = 2, z2 = 1, z3 = 0, ... zk = 2
Now we can empirically estimate E[z] i.e. the expected number of goals scored
as
k
1X
E[z] = zi
k
i=1
18/84
z1 = 2, z2 = 1, z3 = 0, ... zk = 2
as
k
1X
E[z] = zi
k
i=1
Analogy with our derivation: We have a certain number of observations yi &

predictions yî using which we can estimate
E[(yî − yi )2 ] =
18/84
z1 = 2, z2 = 1, z3 = 0, ... zk = 2
as
k
1X
E[z] = zi
k
i=1
Analogy with our derivation: We have a certain number of observations yi &

predictions yî using which we can estimate
m
1 X
E[(yî − yi )2 ] = (yî − yi )2
m
i=1
18/84
... returning back to our derivation
19/84
E[(fˆ(xi ) − f (xi ))2 ] = E[(yî − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]
20/84
We can empirically evaluate R.H.S using training observations or test observa-

tions
Case 1: Using test observations
20/84

tions
E[(fˆ(xi ) − f (xi ))2 ]

| {z }
true error
20/84

tions
n+m
1 X
E[(fˆ(xi ) − f (xi ))2 ] = (yî − yi )2 −
| {z } m
true error i=n+1
| {z }
empirical estimation of error
20/84

tions
n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yî − yi )2 − εi
| {z } m m
true error i=n+1 i=n+1
| {z } | {z }
empirical estimation of error small constant
20/84

tions
n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yî − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
20/84

tions
n+m n+m
1 X 1 X 2
| {z } m m | {z }
∵ covariance(X, Y )
20/84

tions
n+m n+m
1 X 1 X 2
| {z } m m | {z }
∵ covariance(X, Y ) = E[(X − µX )(Y − µY )]
20/84

tions
n+m n+m
1 X 1 X 2
| {z } m m | {z }

= E[(X)(Y − µY )](if µX = E[X] = 0)
20/84

tions
n+m n+m
1 X 1 X 2
| {z } m m | {z }

= E[(X)(Y − µY )](if µX = E[X] = 0)
= E[XY ] − E[XµY ]
20/84

tions
n+m n+m
1 X 1 X 2
| {z } m m | {z }

= E[(X)(Y − µY )](if µX = E[X] = 0)
= E[XY ] − E[XµY ] = E[XY ] − µY E[X]
20/84

tions
n+m n+m
1 X 1 X 2
| {z } m m | {z }

= E[(X)(Y − µY )](if µX = E[X] = 0)
= E[XY ] − E[XµY ] = E[XY ] − µY E[X] = E[XY ]
20/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yî − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1
None of the test observations participated in the estimation of fˆ(x)[the para-

meters of fˆ(x) were estimated only using training data]
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1

∴ ε ⊥ (fˆ(xi ) − f (xi ))
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1

∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))]
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1

∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))]
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1

∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))] = 0 · E[fˆ(xi ) − f (xi ))]
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1

∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))] = 0 · E[fˆ(xi ) − f (xi ))] = 0
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1

∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ true error = empirical test error + small constant
21/84
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
m m | {z }
i=n+1 i=n+1

∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error 21/84
Case 2: Using training observations
E[(fˆ(xi ) − f (xi ))2 ]

| {z }
true error
n n
1 X 1X 2
= (yî − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
22/84
E[(fˆ(xi ) − f (xi ))2 ]

| {z }
true error
n n
1 X 1X 2
= (yî − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
Now, ε 6⊥ fˆ(x) because ε was used for estimating the parameters of fˆ(x)
∴ E[εi · (fˆ(xi ) − f (xi ))]
22/84
E[(fˆ(xi ) − f (xi ))2 ]

| {z }
true error
n n
1 X 1X 2
= (yî − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
∴ E[εi · (fˆ(xi ) − f (xi ))]
22/84
E[(fˆ(xi ) − f (xi ))2 ]

| {z }
true error
n n
1 X 1X 2
= (yî − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
∴ E[εi · (fˆ(xi ) − f (xi ))] 6= E[εi ] · E[fˆ(xi ) − f (xi ))]
22/84
E[(fˆ(xi ) − f (xi ))2 ]

| {z }
true error
n n
1 X 1X 2
= (yî − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
∴ E[εi · (fˆ(xi ) − f (xi ))] 6= E[εi ] · E[fˆ(xi ) − f (xi ))] 6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
22/84
E[(fˆ(xi ) − f (xi ))2 ]

| {z }
true error
n n
1 X 1X 2
= (yî − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
∴ E[εi · (fˆ(xi ) − f (xi ))] 6= E[εi ] · E[fˆ(xi ) − f (xi ))] 6= 0
Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error
But how is this related to model complexity? Let us see

22/84
Module 8.3 : True error and Model complexity
23/84
Using Stein’s Lemma (and some trickery) we can show that
n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1
24/84
n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1
∂ fˆ(xi )
When will ∂yibe high? When a small change in the observation causes a
large change in the estimation(fˆ)
24/84
n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1
∂ fˆ(xi )
Can you link this to model complexity?
24/84
n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1
∂ fˆ(xi )
Yes, indeed a complex model will be more sensitive to changes in observations

whereas a simple model will be less sensitive to changes in observations
24/84
n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1
∂ fˆ(xi )
Yes, indeed a complex model will be more sensitive to changes in observations

whereas a simple model will be less sensitive to changes in observations
Hence, we can say that

true error = empirical train error + small constant + Ω(model complexity)
24/84
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
25/84
data
We have fitted a simple
and complex model for some
given data
25/84
data
given data
We now change one of these
data points
25/84
data
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model
25/84
Hence while training, instead of minimizing the training error Ltrain (θ) we
should minimize
min Ltrain (θ) + Ω(θ) = L (θ)

w.r.t θ
26/84
should minimize

w.r.t θ
Where Ω(θ) would be high for complex models and small for simple models
26/84
should minimize

w.r.t θ
σ2 Pn ∂ fˆ(xi )
Ω(θ) acts as an approximate for n i=1 ∂yi
26/84
should minimize

w.r.t θ
σ2 Pn ∂ fˆ(xi )
This is the basis for all regularization methods
26/84
should minimize

w.r.t θ
σ2 Pn ∂ fˆ(xi )
This is the basis for all regularization methods
We can show that l1 regularization, l2 regularization, early stopping and inject-

ing noise in input are all instances of this form of regularization.
26/84
error
High bias High variance
σ2 Pn ∂ fˆ(xi )
Sweet spot n i=1 ∂yi
model complexity
Ω(θ) should ensure

that model has reas-
onable complexity
27/84
Why do we care about this
bias variance tradeoff and
model complexity?
28/84
Why do we care about this Deep Neural networks are highly complex
bias variance tradeoff and models.
model complexity?
28/84
model complexity? Many parameters, many non-linearities.
28/84
It is easy for them to overfit and drive training
error to 0.
28/84
It is easy for them to overfit and drive training
error to 0.
Hence we need some form of regularization.
28/84
Different forms of regularization
l2 regularization
29/84
l2 regularization
Dataset augmentation
29/84
l2 regularization
Parameter Sharing and tying
29/84
l2 regularization
Adding Noise to the inputs
29/84
l2 regularization
Adding Noise to the outputs
29/84
l2 regularization
Early stopping
29/84
l2 regularization
Early stopping
Ensemble methods
29/84
l2 regularization
Early stopping
Ensemble methods
Dropout
29/84
Module 8.4 : l2 regularization
30/84
l2 regularization
Early stopping
Ensemble methods
Dropout
31/84
For l2 regularization we have,
f(w) = L (w) + α kwk2

L
2
32/84
f(w) = L (w) + α kwk2

L
2
For SGD (or its variants), we are interested in
∇L
f(w) = ∇L (w) + αw
32/84
f(w) = L (w) + α kwk2

L
2
∇L
f(w) = ∇L (w) + αw
Update rule:
wt+1 = wt − η∇L (wt ) − ηαwt
32/84
f(w) = L (w) + α kwk2

L
2
∇L
f(w) = ∇L (w) + αw
Update rule:
Requires a very small modification to the code
32/84
f(w) = L (w) + α kwk2

L
2
∇L
f(w) = ∇L (w) + αw
Update rule:
Requires a very small modification to the code

Let us see the geometric interpretation of this
32/84
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)
∗
33/84
∗
Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)
33/84
∗

1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
33/84
∗

1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
33/84
∗

1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
33/84
∗

1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )
33/84
∗

1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )
= H(w − w∗ )
33/84
∗

1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )
= H(w − w∗ )
Now,
∇L
f(w) = ∇L (w) + αw
33/84
∗

1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )
= H(w − w∗ )
Now,
∇L
f(w) = ∇L (w) + αw
= H(w − w∗ ) + αw 33/84
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]
34/84
Let w
∵ ∇L(
e w)
e =0
34/84
Let w
∵ ∇L(
e w)
e =0
e − w ∗ ) + αw
H(w e=0
34/84
Let w
∵ ∇L(
e w)
e =0
e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
34/84
Let w
∵ ∇L(
e w)
e =0
e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w
34/84
Let w
∵ ∇L(
e w)
e =0
e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w
e → w∗ [no regularization]
Notice that if α → 0 then w
34/84
Let w
∵ ∇L(
e w)
e =0
e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w
But we are interested in the case when α 6= 0
34/84
Let w
∵ ∇L(
e w)
e =0
e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w
But we are interested in the case when α 6= 0
Let us analyse the case when α 6= 0
34/84
If H is symmetric Positive Semi Definite
35/84
H = QΛQT [Q is orthogonal, QQT = QT Q = I]
35/84
e = (H + αI)−1 Hw∗
w
35/84
e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗
35/84
e = (H + αI)−1 Hw∗
w
= (QΛQT + αQIQT )−1 QΛQT w∗
35/84
e = (H + αI)−1 Hw∗
w
= [Q(Λ + αI)QT ]−1 QΛQT w∗
35/84
e = (H + αI)−1 Hw∗
w
−1
= QT (Λ + αI)−1 Q−1 QΛQT w∗
35/84
e = (H + αI)−1 Hw∗
w
−1
−1
= Q(Λ + αI)−1 ΛQT w∗ (∵ QT = Q)
35/84
e = (H + αI)−1 Hw∗
w
−1
−1
= Q(Λ + αI)−1 ΛQT w∗ (∵ QT = Q)
e = QDQT w∗
w
35/84
e = (H + αI)−1 Hw∗
w
−1
−1
= Q(Λ + αI)−1 ΛQT w∗ (∵ QT = Q)
e = QDQT w∗
w
where D = (Λ + αI)−1 Λ, is a diagonal matrix which we will see in more detail

soon
35/84
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
T ∗
= QDQ w
36/84
= QDQ wT ∗ w∗ first gets rotated by QT to give
QT w∗
36/84
QT w∗
However if α = 0 then Q rotates
QT w∗ back to give w∗
36/84
QT w∗
However if α = 0 then Q rotates
QT w∗ back to give w∗
If α 6= 0 then let us see what D
looks like
36/84
  QT w∗
  However if α = 0 then Q rotates
(Λ + αI)−1 =  QT w∗ back to give w∗
 

 
looks like
36/84
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
  However if α = 0 then Q rotates
(Λ + αI)−1 =  QT w∗ back to give w∗
 

 
looks like
36/84
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 

 
looks like
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
looks like
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
λn +α looks like
D = (Λ + αI)−1 Λ
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
λn +α looks like
D = (Λ + αI)−1 Λ
 
 
(Λ + αI)−1 Λ = 
 

 
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 
(Λ + αI)−1 Λ = 
 

 
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 

 
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α
36/84
 1  QT w∗
λ1 +α
−1 λ2 +α
 
.. 
 . 
λn +α looks like
D = (Λ + αI)−1 Λ So what is happening now?
 λ1 
λ1 +α
 λ2 
λ2 +α
(Λ + αI)−1 Λ = 
 
.. 
 . 
λn
λn +α
36/84
w Each element i of QT w∗ gets scaled
= QDQT w∗ by λiλ+α
i
before it is rotated back by
 1  Q
λ1 +α
 1 
−1 λ2 +α
(Λ + αI) =
 
.. 
 . 
1
λn +α
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α
37/84
i
 1  Q
λ1 +α λi
 1  if λi >> α then λi +α =1
−1 λ2 +α
(Λ + αI) =
 
.. 
 . 
1
λn +α
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α
37/84
i
 1  Q
λ1 +α λi
−1 λ2 +α
(Λ + αI) =
 
.. λi
 .

 if λi << α then λi +α =0
1
λn +α
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α
37/84
i
 1  Q
λ1 +α λi
−1 λ2 +α
(Λ + αI) =
 
.. λi
 .

 if λi << α then λi +α =0
1 Thus only significant directions
λn +α
(larger eigen values) will be retained.
D = (Λ + αI)−1 Λ
 λ1  n
λ1 +α
X λi
λ2 Effective parameters = <n

λ2 +α
 λi + α
(Λ + αI)−1 Λ =  i=1
 
.. 
 . 
λn
λn +α
37/84
38/84
The weight vector(w∗ ) is getting rotated to (w̃)
38/84
All of its elements are shrinking but some are shrinking more than the others
38/84
All of its elements are shrinking but some are shrinking more than the others
This ensures that only important features are given high weights
38/84
Module 8.5 : Dataset augmentation
39/84
l2 regularization
Early stopping
Ensemble methods
Dropout
40/84
l2 regularization
Early stopping
Ensemble methods
Dropout
40/84
label = 2
41/84
label = 2
[given training data]
41/84
label = 2
41/84
rotated by 20◦
label = 2
41/84
rotated by 20◦ rotated by 65◦
label = 2
41/84
rotated by 20◦ rotated by 65◦ shifted vertically
label = 2
41/84
label = 2
shifted horizontally
41/84
label = 2
shifted horizontally blurred
41/84
label = 2
shifted horizontally blurred changed some pixels
41/84
label = 2

label = 2
41/84
label = 2

label = 2
[augmented data = created using some knowledge of the
task]
41/84
label = 2

We exploit the fact that
certain transformations shifted horizontally blurred changed some pixels
to the image do not label = 2
change the label of the
[augmented data = created using some knowledge of the
image.
task]
41/84
Typically, More data = better learning
42/84
Works well for image classification / object recognition tasks
42/84
Also shown to work well for speech
42/84
Also shown to work well for speech
For some tasks it may not be clear how to generate such data
42/84
Module 8.6 : Parameter Sharing and tying
43/84
Other forms of regularization
l2 regularization
Early stopping
Ensemble methods
Dropout
44/84
l2 regularization
Early stopping
Ensemble methods
Dropout
44/84
45/84
Parameter Sharing
45/84
Parameter Sharing
Used in CNNs
45/84
Parameter Sharing
Used in CNNs
Same filter applied at different
positions of the image
45/84
Parameter Sharing
Used in CNNs
Or same weight matrix acts on
different input neurons
45/84
x̂
h(x)
Parameter Sharing
Used in CNNs
45/84
x̂
h(x)
Parameter Sharing
Used in CNNs
Parameter Tying
45/84
x̂
h(x)
Parameter Sharing
Used in CNNs
Parameter Tying
positions of the image Typically used in autoencoders
45/84
x̂
h(x)
Parameter Sharing
Used in CNNs
Parameter Tying
positions of the image Typically used in autoencoders
Or same weight matrix acts on The encoder and decoder weights
different input neurons are tied.
45/84
Module 8.7 : Adding Noise to the inputs
46/84
l2 regularization
Early stopping
Ensemble methods
Dropout
47/84
l2 regularization
Early stopping
Ensemble methods
Dropout
47/84
x̂
h(x)
x̃
P (x̃|x) ←noise process
x
48/84
We saw this in Autoencoder
x̂
h(x)
x̃
x
48/84
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
x̂ to weight decay (L2 regularisation)
h(x)
x̃
x
48/84
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
x̂ to weight decay (L2 regularisation)
Can be viewed as data augmentation
h(x)
x̃
x
48/84
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn
ε ∼ N (0, σ 2 )
49/84
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn
ε ∼ N (0, σ 2 )
xei = xi + εi
49/84
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn
ε ∼ N (0, σ 2 )
xei = xi + εi
n
X
yb = wi xi
i=1
49/84
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn
ε ∼ N (0, σ 2 )
xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
49/84
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn
ε ∼ N (0, σ 2 )
xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
49/84
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn
ε ∼ N (0, σ 2 )
xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
We are interested in E[(ye − y)2 ]
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn
ε ∼ N (0, σ 2 )
xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
n
" #
h i X 2
... ... 2
E (ye − y) =E yb + w i εi − y
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
ε ∼ N (0, σ 2 )
xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
n
" #
h i X 2
... ... 2
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
n
X
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
n
" #
h i X 2
... ... 2
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
n
X
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi h i
" n
X
# " n
X 2
#
n 2
X = E (yb − y) + E 2(yb − y) wi εi + E w i εi
yb = wi xi i=1 i=1
i=1
n
X
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
n
" #
h i X 2
... ... 2
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
n
X
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi h i
" n
X
# " n
X 2
#
n 2
yb = wi xi i=1 i=1
i=1 h i
" n
#
2
X
Xn
= E (yb − y) +0+E wi2 ε2i
ye = wi xei i=1
i=1
(∵ εi is independent of εj and εi is independent of (yb-y) )
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
n
" #
h i X 2
... ... 2
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
n
X
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi h i
" n
X
# " n
X 2
#
n 2
yb = wi xi i=1 i=1
i=1 h i
" n
#
2
X
Xn
= E (yb − y) +0+E wi2 ε2i
ye = wi xei i=1
i=1
(∵ εi is independent of εj and εi is independent of (yb-y) )
Xn n
X
= wi xi + w i εi h i n
X
i=1 i=1 = (E (yb − y)2 + σ 2 wi2 (same as L2 norm penalty)
n
X i=1
= yb + wi εi
i=1 49/84
Module 8.8 : Adding Noise to the outputs
50/84
l2 regularization
Early stopping
Ensemble methods
Dropout
51/84
0 0 1 0 0 0 0 0 0 0 Hard targets
52/84
0 0 1 0 0 0 0 0 0 0 Hard targets
9
X
minimize : pi log qi
i=0
52/84
0 0 1 0 0 0 0 0 0 0 Hard targets
9
X
i=0
true distribution : p = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0}
52/84
0 0 1 0 0 0 0 0 0 0 Hard targets
9
X
i=0
estimated distribution : q
52/84
0 0 1 0 0 0 0 0 0 0 Hard targets
9
X
i=0
Intuition
Do not trust the true labels, they may be noisy
52/84
0 0 1 0 0 0 0 0 0 0 Hard targets
9
X
i=0
Intuition
Do not trust the true labels, they may be noisy
Instead, use soft targets
52/84
ε ε ε ε ε ε ε ε ε
9 9 1−ε 9 9 9 9 9 9 9 Soft targets
53/84
9 9 1−ε 9 9 9 9 9 9 9 Soft targets
ε = small positive constant
53/84
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

9
X
i=0
53/84
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

9
X
i=0
nε ε ε o
true distribution + noise : p = , , 1 − ε, , . . .
9 9 9
53/84
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

9
X
i=0
nε ε ε o
true distribution + noise : p = , , 1 − ε, , . . .
9 9 9
53/84
Module 8.9 : Early stopping
54/84
l2 regularization
Early stopping
Ensemble methods
Dropout
55/84
l2 regularization
Early stopping
Ensemble methods
Dropout
55/84
Error Track the validation error
V alidation error
T raining error
k−p k Steps
return this model stop
56/84
Have a patience parameter p
V alidation error
T raining error
k−p k Steps
56/84
If you are at step k and there was
no improvement in validation error in
the previous p steps then stop train-
V alidation error
ing and return the model stored at
step k − p
T raining error
k−p k Steps
56/84
If you are at step k and there was
no improvement in validation error in
the previous p steps then stop train-
V alidation error
ing and return the model stored at
step k − p
T raining error
Basically, stop the training early be-
fore it drives the training error to 0
k−p k Steps
return this model stop and blows up the validation error
56/84
Error Very effective and the mostly widely
used form of regularization
V alidation error
T raining error
k−p k Steps
57/84
Can be used even with other regular-
izers (such as l2 )
V alidation error
T raining error
k−p k Steps
57/84
izers (such as l2 )
How does it act as a regularizer ?
V alidation error
T raining error
k−p k Steps
57/84
izers (such as l2 )
How does it act as a regularizer ?
V alidation error
We will first see an intuitive explan-
ation and then a mathematical ana-
T raining error
lysis
k−p k Steps
57/84
Error Recall that the update rule in SGD is
V alidation error
T raining error
k−p k Steps
58/84
wt+1 = wt − η∇wt
V alidation error
T raining error
k−p k Steps
58/84
t
X
= w0 − η ∇wi
V alidation error i=1
T raining error
k−p k Steps
58/84
t
X
= w0 − η ∇wi
Let τ be the maximum value of ∇wi

then
T raining error
k−p k Steps
58/84
t
X
= w0 − η ∇wi

then
T raining error
k−p k
stop
Steps |wt+1 − w0 | ≤ ηt|τ |
return this model
58/84
t
X
= w0 − η ∇wi

then
T raining error
k−p k
stop
return this model
Thus, t controls how far wt can go

from the initial w0
58/84
t
X
= w0 − η ∇wi

then
T raining error
k−p k
stop
return this model
Thus, t controls how far wt can go

from the initial w0
In other words it controls the space
of exploration
58/84
We will now see a mathematical analysis of this
59/84
Recall that the Taylor series approximation for L (w) is
60/84
1
2
60/84
1
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) [ w∗ is optimal so ∇L (w∗ ) is 0 ]
2
60/84
1
2
1
2
∇(L (w)) = H(w − w∗ )
60/84
1
2
1
2
∇(L (w)) = H(w − w∗ )
Now the SGD update rule is:
60/84
1
2
1
2
∇(L (w)) = H(w − w∗ )
wt = wt−1 − η∇L (wt−1 )
60/84
1
2
1
2
∇(L (w)) = H(w − w∗ )
wt = wt−1 − η∇L (wt−1 )

= wt−1 − ηH(wt−1 − w∗ )
60/84
1
2
1
2
∇(L (w)) = H(w − w∗ )
wt = wt−1 − η∇L (wt−1 )

= wt−1 − ηH(wt−1 − w∗ )
= (I − ηH)wt−1 + ηHw∗
60/84
wt = (I − ηH)wt−1 + ηHw∗
61/84
Using EVD of H as H = QΛQT , we get:

wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗
61/84

If we start with w0 = 0 then we can show that (See Appendix)
wt = Q[I − (I − εΛ)t ]QT w∗
61/84

Compare this with the expression we had for optimum W̃ with L2 regularization
w̃ = Q[I − (Λ + αI)−1 α]QT w∗
61/84

Compare this with the expression we had for optimum W̃ with L2 regularization
w̃ = Q[I − (Λ + αI)−1 α]QT w∗
We observe that wt = w̃, if we choose ε,t and α such that
(I − εΛ)t = (Λ + αI)−1 α
61/84
Things to be remember
Early stopping only allows t updates to the parameters.
62/84
If a parameter w corresponds to a dimension which is important for the loss
L (θ) then ∂L∂w(θ) will be large
62/84
62/84
However if a parameter is not important ( ∂L∂w(θ) is small) then its updates will
be small and the parameter will not be able to grow large in ‘t0 steps
62/84
However if a parameter is not important ( ∂L∂w(θ) is small) then its updates will
be small and the parameter will not be able to grow large in ‘t0 steps
Early stopping will thus effectively shrink the parameters corresponding to less
important directions (same as weight decay).
62/84
Module 8.10 : Ensemble methods
63/84
l2 regularization
Early stopping
Ensemble methods
Dropout
64/84
l2 regularization
Early stopping
Ensemble methods
Dropout
64/84
yfinal
Combine the output of different models to re-
duce generalization error
ylr ysvm ynb
y
y
x1 x2 x3 x4
Logistic Regression SV M Naive Bayes
65/84
yfinal
ylr ysvm ynb
The models can correspond to different clas-
sifiers
y
y
x1 x2 x3 x4
65/84
yfinal
ylr ysvm ynb
sifiers
y It could be different instances of the same clas-
y sifier trained with:
x1 x2 x3 x4
65/84
yfinal
ylr ysvm ynb
sifiers
different hyperparameters
x1 x2 x3 x4
65/84
yfinal
ylr ysvm ynb
sifiers
different features
x1 x2 x3 x4
65/84
yfinal
ylr ysvm ynb
sifiers
different features
different samples of the training data
x1 x2 x3 x4
65/84
yfinal
ylr1 ylr2 ylr3
y y y
Logistic Logistic Logistic

Regression Regression Regression
66/84
yfinal
ylr1 ylr2 ylr3
y y y

66/84
yfinal
Bagging: form an ensemble using dif-
ylr1 ylr2 ylr3 ferent instances of the same classifier
y y y

66/84
yfinal
From a given dataset, construct mul-
y y y tiple training sets by sampling with
replacement (T1 , T2 , ..., Tk )

66/84
yfinal
Train ith instance of the classifier us-
ing training set Ti

66/84
yfinal
Train ith instance of the classifier us-
ing training set Ti

Each model trained with a different

sample of the data (sampling with
replacement)
66/84
When would bagging work?
67/84
Consider a set of k LR mod-
els
67/84
els
Suppose that each model
makes an error εi on a test
example
67/84
els
example
Let εi be drawn from a
zero mean multivariate nor-
mal distribution
67/84
els
example
mal distribution
V ariance = E[ε2i ] = V
67/84
els
example
mal distribution
Covariance = E[εi εj ] = C
67/84
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
example
mal distribution
67/84
1 P
k ε
i i els
The expected squared error is : Suppose that each model
example
mal distribution
67/84
1 P
k ε
i i els
1X 2 makes an error εi on a test
mse =E[( εi ) ] example
k
i
mal distribution
67/84
1 P
k ε
i i els
k
i
1 XX XX Let εi be drawn from a
= 2 E[ εi εj + εi εj ]
k zero mean multivariate nor-
i i=j i i6=j
mal distribution
67/84
1 P
k ε
i i els
k
i
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
k V ariance = E[ε2i ] = V
i i i6=j
67/84
1 P
k ε
i i els
k
i
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
i i i6=j
1 X XX Covariance = E[εi εj ] = C
= 2
( E[ε2i ] + E[εi εj ])
k
i i i6=j
67/84
1 P
k ε
i i els
k
i
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
i i i6=j
= 2
( E[ε2i ] + E[εi εj ])
k
i i i6=j
1
= 2 (kV + k(k − 1)C)
k
67/84
1 P
k ε
i i els
k
i
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
i i i6=j
= 2
( E[ε2i ] + E[εi εj ])
k
i i i6=j
1
= 2 (kV + k(k − 1)C)
k
1 k−1
= V + C
k k 67/84
1 k−1
mse = V + C
k k
68/84
1 k−1 When would bagging work ?
mse = V + C
k k
68/84
mse = V + C
k k If the errors of the model are perfectly
correlated then V = C and mse = V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
68/84
mse = V + C
models]
If the errors of the model are inde-
pendent or uncorrelated then C = 0
and the mse of the ensemble reduces
to k1 V
68/84
mse = V + C
models]
If the errors of the model are inde-
pendent or uncorrelated then C = 0
and the mse of the ensemble reduces
to k1 V
On average, the ensemble will per-
form at least as well as its individual
members
68/84
Module 8.11 : Dropout
69/84
l2 regularization
Early stopping
Ensemble methods
Dropout
70/84
l2 regularization
Early stopping
Ensemble methods
Dropout
70/84
Typically model averaging(bagging
ensemble) always helps
71/84
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
71/84
Option 1: Train several neural
networks having different architec-
tures(obviously expensive)
71/84
Option 2: Train multiple instances
of the same network using different
training samples (again expensive)
71/84
Option 2: Train multiple instances
of the same network using different
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
71/84
Dropout is a technique which ad-
dresses both these issues.
72/84
Effectively it allows training several
neural networks without any signific-
ant computational overhead.
72/84
Effectively it allows training several
neural networks without any signific-
ant computational overhead.
Also gives an efficient approximate
way of combining exponentially many
different neural networks.
72/84
Dropout refers to dropping out units
73/84
Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
73/84
Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a fixed probability (typically p = 0.5) for hidden
nodes and p = 0.8 for visible nodes
73/84
74/84
Suppose a neural network has n nodes
74/84
Using the dropout idea, each node can be retained or dropped
74/84
For example, in the above case we drop 5 nodes to get a thinned network
74/84
Given a total of n nodes, what are the total number of thinned networks that
can be formed?
74/84
can be formed? 2n
74/84
can be formed? 2n
Of course, this is prohibitively large and we cannot possibly train so many
networks
74/84
can be formed? 2n
networks
Trick: (1) Share the weights across all the networks
74/84
can be formed? 2n
networks
(2) Sample a different network for each training instance
74/84
can be formed? 2n
networks
(2) Sample a different network for each training instance
Let us see how? 74/84
75/84
We initialize all the parameters (weights) of the network and start training
75/84
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
75/84
the thinned network
We compute the loss and backpropagate
75/84
the thinned network
Which parameters will we update?
75/84
the thinned network
Which parameters will we update? Only those which are active
75/84
the thinned network
75/84
the thinned network
75/84
the thinned network
75/84
the thinned network
75/84
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a different thinned network
76/84
We again compute the loss and backpropagate to the active weights
76/84
76/84
If the weight was active for both the training instances then it would have
received two updates by now
76/84
If the weight was active for only one of the training instances then it would
have received only one updates by now
76/84
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
76/84
w1 w2 w3 w4
Present with
probability p
At training time
77/84
w1 w2 w3 w4
Present with
probability p
At training time
What happens at test time?
77/84
w1 w2 w3 w4
Present with
probability p
At training time

Impossible to aggregate the outputs of 2n thinned networks
77/84
w1 w2 w3 w4 pw1 pw2 pw3 pw4
Present with Always

probability p present
At training time At test time

Impossible to aggregate the outputs of 2n thinned networks
Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training
77/84
Dropout essentially applies a masking
noise to the hidden units
78/84
Prevents hidden units from co-
adapting
78/84
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
78/84
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts
78/84
hi
79/84
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
hi
79/84
bustness
Suppose hi learns to detect a face by
firing on detecting a nose
hi
79/84
bustness
Dropping hi then corresponds to eras-
hi
ing the information that a nose exists
79/84
bustness
hi
The model should then learn another
hi which redundantly encodes the
presence of a nose
79/84
bustness
hi
The model should then learn another
hi which redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features
79/84
Recap
l2 regularization
Early stopping
Ensemble methods
Dropout
80/84
Appendix
81/84
To prove: The below two equations are equivalent

82/84

Proof by induction:
82/84

Proof by induction:
Base case: t = 1 and w0 =0:
82/84

Proof by induction:
w1 according to the first equation:
w1 = (I − ηQΛQT )w0 + ηQΛQT w∗

= ηQΛQT w∗
82/84

Proof by induction:
w1 according to the first equation:
w1 = (I − ηQΛQT )w0 + ηQΛQT w∗

= ηQΛQT w∗
w1 according to the second equation:
w1 = Q(I − (I − ηΛ)1 )QT w∗

= ηQΛQT w∗
82/84
Induction step: Let the two equations be equivalent for tth step
∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗

= Q[I − (I − εΛ)t ]QT w∗
83/84

= Q[I − (I − εΛ)t ]QT w∗
Proof that this will hold for (t + 1)th step
83/84

= Q[I − (I − εΛ)t ]QT w∗
wt+1 = (I − ηQΛQT )wt + ηQΛQT w∗
83/84

= Q[I − (I − εΛ)t ]QT w∗

(using wt = Q[I − (I − εΛ)t ]QT w∗ )
83/84

= Q[I − (I − εΛ)t ]QT w∗

= (I − ηQΛQT )Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
83/84

= Q[I − (I − εΛ)t ]QT w∗

(Opening this bracket)
83/84

= Q[I − (I − εΛ)t ]QT w∗

= IQ(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
83/84

= Q[I − (I − εΛ)t ]QT w∗

= IQ(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
83/84
Continuing
wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
84/84
Continuing

= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗ (∵ QT Q = I)
84/84
Continuing

= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q (I − (I − ηΛ)t ) − ηΛ(I − (I − ηΛ)t ) + ηΛ QT w∗

84/84
Continuing


= Q I − (I − ηΛ)t + ηΛ(I − ηΛ)t QT w∗

84/84
Continuing



= Q I − (I − ηΛ)t (I − ηΛ) QT w∗

84/84
Continuing



= Q I − (I − ηΛ)t (I − ηΛ) QT w∗

= Q(I − (I − ηΛ)t+1 )QT w∗
84/84
Continuing



= Q I − (I − ηΛ)t (I − ηΛ) QT w∗

= Q(I − (I − ηΛ)t+1 )QT w∗
Hence, proved!
84/84

Lecture8 PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lecture8 PDF

Enviado por

Direitos autorais:

Formatos disponíveis

CS7015 (Deep Learning) : Lecture 8

Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping,

Department of Computer Science and Engineering

The points were drawn from a si-

We consider two models :

The points were drawn from a si-

We consider two models :

The points were drawn from a si-

We consider two models :

The points were drawn from a si-

We consider two models :

The points were drawn from a si-

We consider two models :

We consider two models :

We consider two models :

The training data consists of 100 points

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

However they are very far from the true sinus-

On the other hand, complex models trained on

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

Green Line: Average value of fˆ(x)

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of

Green Line: Average value of fˆ(x)

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of

We can see that for the simple model the av-

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of

We can see that for the simple model the av-

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of

We can see that for the simple model the av-

Variance (fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]

Variance (fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]

Roughly speaking it tells us how much the dif-

Variance (fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]

Roughly speaking it tells us how much the dif-

It is clear that the simple model has a low vari-

If we use the model fˆ(x) to predict the

(average square error in predicting y for

(average square error in predicting y for