Você está na página 1de 434

CS7015 (Deep Learning) : Lecture 8

Regularization: Bias Variance Tradeoff, l2 regularization, Early stopping,


Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Acknowledgements
Chapter 7, Deep Learning book
Ali Ghodsi’s Video Lectures on Regularizationa
Dropout: A Simple Way to Prevent Neural Networks from Overfittingb
a
Lecture 2.1 and Lecture 2.2
b
Dropout

2/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.1 : Bias and Variance

3/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We will begin with a quick overview of bias, variance and the trade-off between
them.

4/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

The points were drawn from a si-


nusoidal function (the true f (x))

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

We consider two models :

The points were drawn from a si-


nusoidal function (the true f (x))

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

We consider two models :


Simple
(degree:1) y = fˆ(x) = w1 x + w0

The points were drawn from a si-


nusoidal function (the true f (x))

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

We consider two models :


Simple
(degree:1) y = fˆ(x) = w1 x + w0
Simple

The points were drawn from a si-


nusoidal function (the true f (x))

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

We consider two models :


Simple
(degree:1) y = fˆ(x) = w1 x + w0
Simple 25
X
Complex
(degree:25) y = fˆ(x) = wi xi + w0
i=1

The points were drawn from a si-


nusoidal function (the true f (x))

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

We consider two models :


Simple
(degree:1) y = fˆ(x) = w1 x + w0
Simple 25
X
Complex
(degree:25) y = fˆ(x) = wi xi + w0
i=1
Complex
The points were drawn from a si-
nusoidal function (the true f (x))

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

We consider two models :


Simple
(degree:1) y = fˆ(x) = w1 x + w0
Simple 25
X
Complex
(degree:25) y = fˆ(x) = wi xi + w0
i=1
Complex
Note that in both cases we are making an as-
The points were drawn from a si-
sumption about how y is related to x. We
nusoidal function (the true f (x))
have no idea about the true relation f (x)

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us consider the problem of fitting a curve
through a given set of points

We consider two models :


Simple
(degree:1) y = fˆ(x) = w1 x + w0
Simple 25
X
Complex
(degree:25) y = fˆ(x) = wi xi + w0
i=1
Complex
Note that in both cases we are making an as-
The points were drawn from a si-
sumption about how y is related to x. We
nusoidal function (the true f (x))
have no idea about the true relation f (x)

The training data consists of 100 points

5/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We sample 25 points from the training data
and train a simple and a complex model
Simple

Complex
The points were drawn from
a sinusoidal function (the true
f (x))

6/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We sample 25 points from the training data
and train a simple and a complex model
Simple
We repeat the process ‘k’ times to train
multiple models (each model sees a different
Complex sample of the training data)
The points were drawn from
a sinusoidal function (the true
f (x))

6/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We sample 25 points from the training data
and train a simple and a complex model
Simple
We repeat the process ‘k’ times to train
multiple models (each model sees a different
Complex sample of the training data)
The points were drawn from
We make a few observations from these plots
a sinusoidal function (the true
f (x))

6/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Simple models trained on different samples of
the data do not differ much from each other

However they are very far from the true sinus-


oidal curve (under fitting)

On the other hand, complex models trained on


different samples of the data are very different
from each other (high variance)

7/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let f (x) be the true model (sinusoidal in this
case) and fˆ(x) be our estimate of the model
(simple or complex, in this case) then,

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

Green Line: Average value of fˆ(x)


for the simple model
Blue Curve: Average value of fˆ(x)
for the complex model
Red Curve: True model (f (x))

8/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let f (x) be the true model (sinusoidal in this
case) and fˆ(x) be our estimate of the model
(simple or complex, in this case) then,

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of


the model

Green Line: Average value of fˆ(x)


for the simple model
Blue Curve: Average value of fˆ(x)
for the complex model
Red Curve: True model (f (x))

8/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let f (x) be the true model (sinusoidal in this
case) and fˆ(x) be our estimate of the model
(simple or complex, in this case) then,

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of


the model

We can see that for the simple model the av-


erage value (green line) is very far from the
Green Line: Average value of fˆ(x) true value f (x) (sinusoidal function)
for the simple model
Blue Curve: Average value of fˆ(x)
for the complex model
Red Curve: True model (f (x))

8/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let f (x) be the true model (sinusoidal in this
case) and fˆ(x) be our estimate of the model
(simple or complex, in this case) then,

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of


the model

We can see that for the simple model the av-


erage value (green line) is very far from the
Green Line: Average value of fˆ(x) true value f (x) (sinusoidal function)
for the simple model
Blue Curve: Average value of fˆ(x) Mathematically, this means that the simple
for the complex model model has a high bias
Red Curve: True model (f (x))

8/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let f (x) be the true model (sinusoidal in this
case) and fˆ(x) be our estimate of the model
(simple or complex, in this case) then,

Bias (fˆ(x)) = E[fˆ(x)] − f (x)

E[fˆ(x)] is the average (or expected) value of


the model

We can see that for the simple model the av-


erage value (green line) is very far from the
Green Line: Average value of fˆ(x) true value f (x) (sinusoidal function)
for the simple model
Blue Curve: Average value of fˆ(x) Mathematically, this means that the simple
for the complex model model has a high bias
Red Curve: True model (f (x))
On the other hand, the complex model has a
low bias 8/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We now define,

Variance (fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]


(Standard definition from statistics)

9/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We now define,

Variance (fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]


(Standard definition from statistics)

Roughly speaking it tells us how much the dif-


ferent fˆ(x)’s (trained on different samples of
the data) differ from each other

9/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We now define,

Variance (fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]


(Standard definition from statistics)

Roughly speaking it tells us how much the dif-


ferent fˆ(x)’s (trained on different samples of
the data) differ from each other

It is clear that the simple model has a low vari-


ance whereas the complex model has a high
variance

9/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
In summary (informally)

10/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
In summary (informally)
Simple model: high bias, low variance

10/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
In summary (informally)
Simple model: high bias, low variance
Complex model: low bias, high variance

10/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
In summary (informally)
Simple model: high bias, low variance
Complex model: low bias, high variance
There is always a trade-off between the bias
and variance

10/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
In summary (informally)
Simple model: high bias, low variance
Complex model: low bias, high variance
There is always a trade-off between the bias
and variance
Both bias and variance contribute to the mean
square error. Let us see how

10/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.2 : Train error vs Test error

11/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Consider a new point (x, y) which was not
seen during training

12/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Consider a new point (x, y) which was not
seen during training

If we use the model fˆ(x) to predict the


value of y then the mean square error is
given by

E[(y − fˆ(x))2 ]

(average square error in predicting y for


many such unseen points)

12/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Consider a new point (x, y) which was not
seen during training
We can show that
If we use the model fˆ(x) to predict the
E[(y − fˆ(x))2 ] = Bias2 value of y then the mean square error is
+ V ariance given by

+ σ 2 (irreducible error)
E[(y − fˆ(x))2 ]

(average square error in predicting y for


many such unseen points)

12/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Consider a new point (x, y) which was not
seen during training
We can show that
If we use the model fˆ(x) to predict the
E[(y − fˆ(x))2 ] = Bias2 value of y then the mean square error is
+ V ariance given by

+ σ 2 (irreducible error)
E[(y − fˆ(x))2 ]
See proof here
(average square error in predicting y for
many such unseen points)

12/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The parameters of fˆ(x) (all wi ’s) are trained
using a training set {(xi , yi )}ni=1

13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The parameters of fˆ(x) (all wi ’s) are trained
using a training set {(xi , yi )}ni=1

However, at test time we are interested in eval-


uating the model on a validation (unseen) set
which was not used for training

13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The parameters of fˆ(x) (all wi ’s) are trained
using a training set {(xi , yi )}ni=1

However, at test time we are interested in eval-


uating the model on a validation (unseen) set
which was not used for training

This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)

13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error

The parameters of fˆ(x) (all wi ’s) are trained


using a training set {(xi , yi )}ni=1

However, at test time we are interested in eval-


uating the model on a validation (unseen) set
which was not used for training

model complexity This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)

Typically these errors exhibit the trend shown


in the adjacent figure
13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error

The parameters of fˆ(x) (all wi ’s) are trained


using a training set {(xi , yi )}ni=1

However, at test time we are interested in eval-


uating the model on a validation (unseen) set
which was not used for training

model complexity This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)

Typically these errors exhibit the trend shown


in the adjacent figure
13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error

The parameters of fˆ(x) (all wi ’s) are trained


using a training set {(xi , yi )}ni=1

However, at test time we are interested in eval-


uating the model on a validation (unseen) set
which was not used for training

model complexity This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)

Typically these errors exhibit the trend shown


in the adjacent figure
13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error High bias

The parameters of fˆ(x) (all wi ’s) are trained


using a training set {(xi , yi )}ni=1

However, at test time we are interested in eval-


uating the model on a validation (unseen) set
which was not used for training

model complexity This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)

Typically these errors exhibit the trend shown


in the adjacent figure
13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error High bias High variance

The parameters of fˆ(x) (all wi ’s) are trained


using a training set {(xi , yi )}ni=1

However, at test time we are interested in eval-


uating the model on a validation (unseen) set
which was not used for training

model complexity This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)

Typically these errors exhibit the trend shown


in the adjacent figure
13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error High bias High variance

Sweet spot-
-perfect tradeoff The parameters of fˆ(x) (all wi ’s) are trained
-ideal model using a training set {(xi , yi )}ni=1
complexity
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training

model complexity This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)

Typically these errors exhibit the trend shown


in the adjacent figure
13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error High bias High variance

Sweet spot-
-perfect tradeoff The parameters of fˆ(x) (all wi ’s) are trained
-ideal model using a training set {(xi , yi )}ni=1
complexity
However, at test time we are interested in eval-
uating the model on a validation (unseen) set
which was not used for training

model complexity This gives rise to the following two entities of


interest:
trainerr (say, mean square error)
testerr (say, mean square error)
E[(y − fˆ(x))2 ] = Bias2
Typically these errors exhibit the trend shown
+ V ariance
in the adjacent figure
+ σ 2 (irreducible error)
13/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Intuitions developed so far
Let there be n training points and m test (validation) points
n
1X
trainerr = (yi − fˆ(xi ))2
n
i=1
n+m
1 X
testerr = (yi − fˆ(xi ))
m
i=n+1

14/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Intuitions developed so far
Let there be n training points and m test (validation) points
n
1X
trainerr = (yi − fˆ(xi ))2
n
i=1
n+m
1 X
testerr = (yi − fˆ(xi ))
m
i=n+1

As the model complexity increases trainerr becomes overly optimistic and gives
us a wrong picture of how close fˆ is to f

14/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Intuitions developed so far
Let there be n training points and m test (validation) points
n
1X
trainerr = (yi − fˆ(xi ))2
n
i=1
n+m
1 X
testerr = (yi − fˆ(xi ))
m
i=n+1

As the model complexity increases trainerr becomes overly optimistic and gives
us a wrong picture of how close fˆ is to f

The validation error gives the real picture of how close fˆ is to f

14/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Intuitions developed so far
Let there be n training points and m test (validation) points
n
1X
trainerr = (yi − fˆ(xi ))2
n
i=1
n+m
1 X
testerr = (yi − fˆ(xi ))
m
i=n+1

As the model complexity increases trainerr becomes overly optimistic and gives
us a wrong picture of how close fˆ is to f

The validation error gives the real picture of how close fˆ is to f

We will concretize this intuition mathematically now and eventually show how
to account for the optimism in the training error
14/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Let D={xi , yi }i=1 , then for any
point (x, y) we have,

yi = f (xi ) + εi

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Let D={xi , yi }i=1 , then for any
point (x, y) we have,

yi = f (xi ) + εi

which means that yi is related to xi


by some true function f but there is
also some noise ε in the relation

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Let D={xi , yi }i=1 , then for any
point (x, y) we have,

yi = f (xi ) + εi

which means that yi is related to xi


by some true function f but there is
also some noise ε in the relation

For simplicity, we assume

ε ∼ N (0, σ 2 )

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Let D={xi , yi }i=1 , then for any
point (x, y) we have,

yi = f (xi ) + εi

which means that yi is related to xi


by some true function f but there is
also some noise ε in the relation

For simplicity, we assume

ε ∼ N (0, σ 2 )

and of course we do not know f

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Further we use fˆ to approximate f
Let D={xi , yi }i=1 ,then for any and estimate the parameters using T
point (x, y) we have, ⊂ D such that
yi = f (xi ) + εi yi = fˆ(xi )
which means that yi is related to xi
by some true function f but there is
also some noise ε in the relation

For simplicity, we assume

ε ∼ N (0, σ 2 )

and of course we do not know f

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Further we use fˆ to approximate f
Let D={xi , yi }i=1 ,then for any and estimate the parameters using T
point (x, y) we have, ⊂ D such that
yi = f (xi ) + εi yi = fˆ(xi )
which means that yi is related to xi We are interested in knowing
by some true function f but there is
also some noise ε in the relation E[(fˆ(xi ) − f (xi ))2 ]

For simplicity, we assume

ε ∼ N (0, σ 2 )

and of course we do not know f

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Further we use fˆ to approximate f
Let D={xi , yi }i=1 ,then for any and estimate the parameters using T
point (x, y) we have, ⊂ D such that
yi = f (xi ) + εi yi = fˆ(xi )
which means that yi is related to xi We are interested in knowing
by some true function f but there is
also some noise ε in the relation E[(fˆ(xi ) − f (xi ))2 ]

For simplicity, we assume but we cannot estimate this directly


because we do not know f
ε ∼ N (0, σ 2 )

and of course we do not know f

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
m+n
Further we use fˆ to approximate f
Let D={xi , yi }i=1 ,then for any and estimate the parameters using T
point (x, y) we have, ⊂ D such that
yi = f (xi ) + εi yi = fˆ(xi )
which means that yi is related to xi We are interested in knowing
by some true function f but there is
also some noise ε in the relation E[(fˆ(xi ) − f (xi ))2 ]

For simplicity, we assume but we cannot estimate this directly


because we do not know f
ε ∼ N (0, σ 2 )
We will see how to estimate this em-
and of course we do not know f pirically using the observation yi &
prediction ŷi

15/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(yˆi − yi )2 ]

16/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(yˆi − yi )2 ] = E[(fˆ(xi ) − f (xi ) − εi )2 ] (yi = f (xi ) + εi )

16/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(yˆi − yi )2 ] = E[(fˆ(xi ) − f (xi ) − εi )2 ] (yi = f (xi ) + εi )

= E[(fˆ(xi ) − f (xi ))2 − 2εi (fˆ(xi ) − f (xi )) + ε2i ]

16/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(yˆi − yi )2 ] = E[(fˆ(xi ) − f (xi ) − εi )2 ] (yi = f (xi ) + εi )

= E[(fˆ(xi ) − f (xi ))2 − 2εi (fˆ(xi ) − f (xi )) + ε2i ]

= E[(fˆ(xi ) − f (xi ))2 ] − 2E[εi (fˆ(xi ) − f (xi ))] + E[ε2i ]

16/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(yˆi − yi )2 ] = E[(fˆ(xi ) − f (xi ) − εi )2 ] (yi = f (xi ) + εi )

= E[(fˆ(xi ) − f (xi ))2 − 2εi (fˆ(xi ) − f (xi )) + ε2i ]

= E[(fˆ(xi ) − f (xi ))2 ] − 2E[εi (fˆ(xi ) − f (xi ))] + E[ε2i ]

∴ E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

16/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We will take a small detour to understand how to empirically estimate an
Expectation and then return to our derivation

17/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose we have observed the goals scored(z) in k matches as
z1 = 2, z2 = 1, z3 = 0, ... zk = 2

18/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose we have observed the goals scored(z) in k matches as
z1 = 2, z2 = 1, z3 = 0, ... zk = 2

Now we can empirically estimate E[z] i.e. the expected number of goals scored
as
k
1X
E[z] = zi
k
i=1

18/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose we have observed the goals scored(z) in k matches as
z1 = 2, z2 = 1, z3 = 0, ... zk = 2

Now we can empirically estimate E[z] i.e. the expected number of goals scored
as
k
1X
E[z] = zi
k
i=1

Analogy with our derivation: We have a certain number of observations yi &


predictions yˆi using which we can estimate

E[(yˆi − yi )2 ] =

18/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose we have observed the goals scored(z) in k matches as
z1 = 2, z2 = 1, z3 = 0, ... zk = 2

Now we can empirically estimate E[z] i.e. the expected number of goals scored
as
k
1X
E[z] = zi
k
i=1

Analogy with our derivation: We have a certain number of observations yi &


predictions yˆi using which we can estimate
m
1 X
E[(yˆi − yi )2 ] = (yˆi − yi )2
m
i=1

18/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... returning back to our derivation

19/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

E[(fˆ(xi ) − f (xi ))2 ]


| {z }
true error

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m
1 X
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 −
| {z } m
true error i=n+1
| {z }
empirical estimation of error

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi
| {z } m m
true error i=n+1 i=n+1
| {z } | {z }
empirical estimation of error small constant

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
true error i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
true error i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

∵ covariance(X, Y )

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
true error i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

∵ covariance(X, Y ) = E[(X − µX )(Y − µY )]

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
true error i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

∵ covariance(X, Y ) = E[(X − µX )(Y − µY )]


= E[(X)(Y − µY )](if µX = E[X] = 0)

20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
true error i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

∵ covariance(X, Y ) = E[(X − µX )(Y − µY )]


= E[(X)(Y − µY )](if µX = E[X] = 0)
= E[XY ] − E[XµY ]
20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
true error i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

∵ covariance(X, Y ) = E[(X − µX )(Y − µY )]


= E[(X)(Y − µY )](if µX = E[X] = 0)
= E[XY ] − E[XµY ] = E[XY ] − µY E[X]
20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ] = E[(yˆi − yi )2 ] − E[ε2i ] + 2E[ εi (fˆ(xi ) − f (xi )) ]

We can empirically evaluate R.H.S using training observations or test observa-


tions

Case 1: Using test observations

n+m n+m
1 X 1 X 2
E[(fˆ(xi ) − f (xi ))2 ] = (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
| {z } m m | {z }
true error i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

∵ covariance(X, Y ) = E[(X − µX )(Y − µY )]


= E[(X)(Y − µY )](if µX = E[X] = 0)
= E[XY ] − E[XµY ] = E[XY ] − µY E[X] = E[XY ]
20/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]
∴ ε ⊥ (fˆ(xi ) − f (xi ))

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]
∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))]

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]
∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))]

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]
∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))] = 0 · E[fˆ(xi ) − f (xi ))]

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]
∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))] = 0 · E[fˆ(xi ) − f (xi ))] = 0

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]
∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))] = 0 · E[fˆ(xi ) − f (xi ))] = 0
∴ true error = empirical test error + small constant

21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
E[(fˆ(xi ) − f (xi ))2 ]
| {z }
true error
n+m n+m
1 X 1 X 2
= (yˆi − yi )2 − εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
m m | {z }
i=n+1 i=n+1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

None of the test observations participated in the estimation of fˆ(x)[the para-


meters of fˆ(x) were estimated only using training data]
∴ ε ⊥ (fˆ(xi ) − f (xi ))
∴ E[εi · (fˆ(xi ) − f (xi ))] = E[εi ] · E[fˆ(xi ) − f (xi ))] = 0 · E[fˆ(xi ) − f (xi ))] = 0
∴ true error = empirical test error + small constant
Hence, we should always use a validation set(independent of the training set)
to estimate the error 21/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Case 2: Using training observations

E[(fˆ(xi ) − f (xi ))2 ]


| {z }
true error
n n
1 X 1X 2
= (yˆi − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

22/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Case 2: Using training observations

E[(fˆ(xi ) − f (xi ))2 ]


| {z }
true error
n n
1 X 1X 2
= (yˆi − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

Now, ε 6⊥ fˆ(x) because ε was used for estimating the parameters of fˆ(x)

∴ E[εi · (fˆ(xi ) − f (xi ))]

22/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Case 2: Using training observations

E[(fˆ(xi ) − f (xi ))2 ]


| {z }
true error
n n
1 X 1X 2
= (yˆi − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

Now, ε 6⊥ fˆ(x) because ε was used for estimating the parameters of fˆ(x)

∴ E[εi · (fˆ(xi ) − f (xi ))]

22/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Case 2: Using training observations

E[(fˆ(xi ) − f (xi ))2 ]


| {z }
true error
n n
1 X 1X 2
= (yˆi − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

Now, ε 6⊥ fˆ(x) because ε was used for estimating the parameters of fˆ(x)

∴ E[εi · (fˆ(xi ) − f (xi ))] 6= E[εi ] · E[fˆ(xi ) − f (xi ))]

22/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Case 2: Using training observations

E[(fˆ(xi ) − f (xi ))2 ]


| {z }
true error
n n
1 X 1X 2
= (yˆi − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

Now, ε 6⊥ fˆ(x) because ε was used for estimating the parameters of fˆ(x)

∴ E[εi · (fˆ(xi ) − f (xi ))] 6= E[εi ] · E[fˆ(xi ) − f (xi ))] 6= 0

Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error

22/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Case 2: Using training observations

E[(fˆ(xi ) − f (xi ))2 ]


| {z }
true error
n n
1 X 1X 2
= (yˆi − yi ) 2
− εi + 2 E[ εi (fˆ(xi ) − f (xi )) ]
n n | {z }
i=1 i=1
| {z } | {z } = covariance (εi ,fˆ(xi )−f (xi ))
empirical estimation of error small constant

Now, ε 6⊥ fˆ(x) because ε was used for estimating the parameters of fˆ(x)

∴ E[εi · (fˆ(xi ) − f (xi ))] 6= E[εi ] · E[fˆ(xi ) − f (xi ))] 6= 0

Hence, the empirical train error is smaller than the true error and does not give
a true picture of the error

But how is this related to model complexity? Let us see


22/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.3 : True error and Model complexity

23/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Using Stein’s Lemma (and some trickery) we can show that

n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1

24/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Using Stein’s Lemma (and some trickery) we can show that

n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1

∂ fˆ(xi )
When will ∂yibe high? When a small change in the observation causes a
large change in the estimation(fˆ)

24/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Using Stein’s Lemma (and some trickery) we can show that

n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1

∂ fˆ(xi )
When will ∂yibe high? When a small change in the observation causes a
large change in the estimation(fˆ)

Can you link this to model complexity?

24/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Using Stein’s Lemma (and some trickery) we can show that

n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1

∂ fˆ(xi )
When will ∂yibe high? When a small change in the observation causes a
large change in the estimation(fˆ)

Can you link this to model complexity?

Yes, indeed a complex model will be more sensitive to changes in observations


whereas a simple model will be less sensitive to changes in observations

24/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Using Stein’s Lemma (and some trickery) we can show that

n n
1X σ 2 X ∂ fˆ(xi )
εi (fˆ(xi ) − f (xi )) =
n n ∂yi
i=1 i=1

∂ fˆ(xi )
When will ∂yibe high? When a small change in the observation causes a
large change in the estimation(fˆ)

Can you link this to model complexity?

Yes, indeed a complex model will be more sensitive to changes in observations


whereas a simple model will be less sensitive to changes in observations

Hence, we can say that


true error = empirical train error + small constant + Ω(model complexity)
24/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data

25/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have fitted a simple
and complex model for some
given data

25/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have fitted a simple
and complex model for some
given data
We now change one of these
data points

25/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let us verify that indeed a
complex model is more sens-
itive to minor changes in the
data
We have fitted a simple
and complex model for some
given data
We now change one of these
data points
The simple model does not
change much as compared to
the complex model

25/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Hence while training, instead of minimizing the training error Ltrain (θ) we
should minimize

min Ltrain (θ) + Ω(θ) = L (θ)


w.r.t θ

26/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Hence while training, instead of minimizing the training error Ltrain (θ) we
should minimize

min Ltrain (θ) + Ω(θ) = L (θ)


w.r.t θ

Where Ω(θ) would be high for complex models and small for simple models

26/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Hence while training, instead of minimizing the training error Ltrain (θ) we
should minimize

min Ltrain (θ) + Ω(θ) = L (θ)


w.r.t θ

Where Ω(θ) would be high for complex models and small for simple models

σ2 Pn ∂ fˆ(xi )
Ω(θ) acts as an approximate for n i=1 ∂yi

26/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Hence while training, instead of minimizing the training error Ltrain (θ) we
should minimize

min Ltrain (θ) + Ω(θ) = L (θ)


w.r.t θ

Where Ω(θ) would be high for complex models and small for simple models

σ2 Pn ∂ fˆ(xi )
Ω(θ) acts as an approximate for n i=1 ∂yi

This is the basis for all regularization methods

26/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Hence while training, instead of minimizing the training error Ltrain (θ) we
should minimize

min Ltrain (θ) + Ω(θ) = L (θ)


w.r.t θ

Where Ω(θ) would be high for complex models and small for simple models

σ2 Pn ∂ fˆ(xi )
Ω(θ) acts as an approximate for n i=1 ∂yi

This is the basis for all regularization methods

We can show that l1 regularization, l2 regularization, early stopping and inject-


ing noise in input are all instances of this form of regularization.

26/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
error
High bias High variance

σ2 Pn ∂ fˆ(xi )
Sweet spot n i=1 ∂yi

model complexity

Ω(θ) should ensure


that model has reas-
onable complexity

27/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Why do we care about this
bias variance tradeoff and
model complexity?

28/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Why do we care about this Deep Neural networks are highly complex
bias variance tradeoff and models.
model complexity?

28/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Why do we care about this Deep Neural networks are highly complex
bias variance tradeoff and models.
model complexity? Many parameters, many non-linearities.

28/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Why do we care about this Deep Neural networks are highly complex
bias variance tradeoff and models.
model complexity? Many parameters, many non-linearities.
It is easy for them to overfit and drive training
error to 0.

28/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Why do we care about this Deep Neural networks are highly complex
bias variance tradeoff and models.
model complexity? Many parameters, many non-linearities.
It is easy for them to overfit and drive training
error to 0.
Hence we need some form of regularization.

28/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

29/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.4 : l2 regularization

30/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

31/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For l2 regularization we have,

f(w) = L (w) + α kwk2


L
2

32/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For l2 regularization we have,

f(w) = L (w) + α kwk2


L
2
For SGD (or its variants), we are interested in

∇L
f(w) = ∇L (w) + αw

32/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For l2 regularization we have,

f(w) = L (w) + α kwk2


L
2
For SGD (or its variants), we are interested in

∇L
f(w) = ∇L (w) + αw

Update rule:

wt+1 = wt − η∇L (wt ) − ηαwt

32/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For l2 regularization we have,

f(w) = L (w) + α kwk2


L
2
For SGD (or its variants), we are interested in

∇L
f(w) = ∇L (w) + αw

Update rule:

wt+1 = wt − η∇L (wt ) − ηαwt

Requires a very small modification to the code

32/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For l2 regularization we have,

f(w) = L (w) + α kwk2


L
2
For SGD (or its variants), we are interested in

∇L
f(w) = ∇L (w) + αw

Update rule:

wt+1 = wt − η∇L (wt ) − ηαwt

Requires a very small modification to the code


Let us see the geometric interpretation of this

32/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)


1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)


1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)


1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)


1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)


1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )
= H(w − w∗ )

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)


1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )
= H(w − w∗ )

Now,

∇L
f(w) = ∇L (w) + αw

33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Assume w∗ is the optimal solution for L (w) [not L
f(w)] i.e. the solution in
the absence of regularization (w optimal → ∇L (w∗ ) = 0)

Consider u = w − w∗ . Using Taylor series approximation (upto 2nd order)


1
L (w∗ + u) = L (w∗ ) + uT ∇L (w∗ ) + uT Hu
2
1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) (∵ ∇L(w∗ ) = 0 )
2
∇L (w) = ∇L (w∗ ) + H(w − w∗ )
= H(w − w∗ )

Now,

∇L
f(w) = ∇L (w) + αw
= H(w − w∗ ) + αw 33/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

∵ ∇L(
e w)
e =0

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

∵ ∇L(
e w)
e =0

e − w ∗ ) + αw
H(w e=0

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

∵ ∇L(
e w)
e =0

e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

∵ ∇L(
e w)
e =0

e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

∵ ∇L(
e w)
e =0

e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w

e → w∗ [no regularization]
Notice that if α → 0 then w

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

∵ ∇L(
e w)
e =0

e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w

e → w∗ [no regularization]
Notice that if α → 0 then w
But we are interested in the case when α 6= 0

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Let w
e be the optimal solution for L(w)
e [i.e regularized loss]

∵ ∇L(
e w)
e =0

e − w ∗ ) + αw
H(w e=0
e = Hw∗
∴(H + αI)w
e = (H + αI)−1 Hw∗
∴w

e → w∗ [no regularization]
Notice that if α → 0 then w
But we are interested in the case when α 6= 0
Let us analyse the case when α 6= 0

34/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗
= (QΛQT + αQIQT )−1 QΛQT w∗

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗
= (QΛQT + αQIQT )−1 QΛQT w∗
= [Q(Λ + αI)QT ]−1 QΛQT w∗

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗
= (QΛQT + αQIQT )−1 QΛQT w∗
= [Q(Λ + αI)QT ]−1 QΛQT w∗
−1
= QT (Λ + αI)−1 Q−1 QΛQT w∗

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗
= (QΛQT + αQIQT )−1 QΛQT w∗
= [Q(Λ + αI)QT ]−1 QΛQT w∗
−1
= QT (Λ + αI)−1 Q−1 QΛQT w∗
−1
= Q(Λ + αI)−1 ΛQT w∗ (∵ QT = Q)

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗
= (QΛQT + αQIQT )−1 QΛQT w∗
= [Q(Λ + αI)QT ]−1 QΛQT w∗
−1
= QT (Λ + αI)−1 Q−1 QΛQT w∗
−1
= Q(Λ + αI)−1 ΛQT w∗ (∵ QT = Q)
e = QDQT w∗
w

35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
If H is symmetric Positive Semi Definite
H = QΛQT [Q is orthogonal, QQT = QT Q = I]

e = (H + αI)−1 Hw∗
w
= (QΛQT + αI)−1 QΛQT w∗
= (QΛQT + αQIQT )−1 QΛQT w∗
= [Q(Λ + αI)QT ]−1 QΛQT w∗
−1
= QT (Λ + αI)−1 Q−1 QΛQT w∗
−1
= Q(Λ + αI)−1 ΛQT w∗ (∵ QT = Q)
e = QDQT w∗
w

where D = (Λ + αI)−1 Λ, is a diagonal matrix which we will see in more detail


soon
35/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
T ∗
= QDQ w

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ wT ∗ w∗ first gets rotated by QT to give
QT w∗

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ wT ∗ w∗ first gets rotated by QT to give
QT w∗
However if α = 0 then Q rotates
QT w∗ back to give w∗

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ wT ∗ w∗ first gets rotated by QT to give
QT w∗
However if α = 0 then Q rotates
QT w∗ back to give w∗
If α 6= 0 then let us see what D
looks like

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ wT ∗ w∗ first gets rotated by QT to give
  QT w∗
  However if α = 0 then Q rotates
(Λ + αI)−1 =  QT w∗ back to give w∗
 

 
If α 6= 0 then let us see what D
looks like

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
  However if α = 0 then Q rotates
(Λ + αI)−1 =  QT w∗ back to give w∗
 

 
If α 6= 0 then let us see what D
looks like

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 

 
If α 6= 0 then let us see what D
looks like

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
If α 6= 0 then let us see what D
looks like

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
D = (Λ + αI)−1 Λ

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
D = (Λ + αI)−1 Λ
 
 
(Λ + αI)−1 Λ = 
 

 

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 
(Λ + αI)−1 Λ = 
 

 

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 

 

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w So what is happening here?
= QDQ w T ∗ w∗ first gets rotated by QT to give
 1  QT w∗
λ1 +α
 1  However if α = 0 then Q rotates
−1 λ2 +α
(Λ + αI) = QT w∗ back to give w∗
 
.. 
 . 
1 If α 6= 0 then let us see what D
λn +α looks like
D = (Λ + αI)−1 Λ So what is happening now?
 λ1 
λ1 +α
 λ2 
λ2 +α
(Λ + αI)−1 Λ = 
 
.. 
 . 
λn
λn +α

36/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w Each element i of QT w∗ gets scaled
= QDQT w∗ by λiλ+α
i
before it is rotated back by
 1  Q
λ1 +α
 1 
−1 λ2 +α
(Λ + αI) =
 
.. 
 . 
1
λn +α
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α

37/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w Each element i of QT w∗ gets scaled
= QDQT w∗ by λiλ+α
i
before it is rotated back by
 1  Q
λ1 +α λi
 1  if λi >> α then λi +α =1
−1 λ2 +α
(Λ + αI) =
 
.. 
 . 
1
λn +α
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α

37/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w Each element i of QT w∗ gets scaled
= QDQT w∗ by λiλ+α
i
before it is rotated back by
 1  Q
λ1 +α λi
 1  if λi >> α then λi +α =1
−1 λ2 +α
(Λ + αI) =
 
.. λi
 .

 if λi << α then λi +α =0
1
λn +α
D = (Λ + αI)−1 Λ
 λ1 
λ1 +α
 λ2 
−1 λ2 +α
(Λ + αI) Λ=
 
.. 
 . 
λn
λn +α

37/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
e = Q(Λ + αI)−1 ΛQT w∗
w Each element i of QT w∗ gets scaled
= QDQT w∗ by λiλ+α
i
before it is rotated back by
 1  Q
λ1 +α λi
 1  if λi >> α then λi +α =1
−1 λ2 +α
(Λ + αI) =
 
.. λi
 .

 if λi << α then λi +α =0
1 Thus only significant directions
λn +α
(larger eigen values) will be retained.
D = (Λ + αI)−1 Λ
 λ1  n
λ1 +α
X λi
λ2 Effective parameters = <n

λ2 +α
 λi + α
(Λ + αI)−1 Λ =  i=1
 
.. 
 . 
λn
λn +α

37/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
38/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The weight vector(w∗ ) is getting rotated to (w̃)

38/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The weight vector(w∗ ) is getting rotated to (w̃)
All of its elements are shrinking but some are shrinking more than the others

38/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The weight vector(w∗ ) is getting rotated to (w̃)
All of its elements are shrinking but some are shrinking more than the others
This ensures that only important features are given high weights
38/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.5 : Dataset augmentation

39/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

40/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Different forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

40/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
label = 2

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
label = 2

[given training data]

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
label = 2

[given training data]

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦

label = 2

[given training data]

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦

label = 2

[given training data]

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦ shifted vertically

label = 2

[given training data]

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦ shifted vertically

label = 2

[given training data]

shifted horizontally

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦ shifted vertically

label = 2

[given training data]

shifted horizontally blurred

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦ shifted vertically

label = 2

[given training data]

shifted horizontally blurred changed some pixels

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦ shifted vertically

label = 2

[given training data]

shifted horizontally blurred changed some pixels


label = 2

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦ shifted vertically

label = 2

[given training data]

shifted horizontally blurred changed some pixels


label = 2
[augmented data = created using some knowledge of the
task]

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
rotated by 20◦ rotated by 65◦ shifted vertically

label = 2

[given training data]


We exploit the fact that
certain transformations shifted horizontally blurred changed some pixels
to the image do not label = 2
change the label of the
[augmented data = created using some knowledge of the
image.
task]

41/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically, More data = better learning

42/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically, More data = better learning
Works well for image classification / object recognition tasks

42/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically, More data = better learning
Works well for image classification / object recognition tasks
Also shown to work well for speech

42/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically, More data = better learning
Works well for image classification / object recognition tasks
Also shown to work well for speech
For some tasks it may not be clear how to generate such data

42/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.6 : Parameter Sharing and tying

43/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

44/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

44/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Parameter Sharing

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Parameter Sharing
Used in CNNs

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Parameter Sharing
Used in CNNs
Same filter applied at different
positions of the image

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Parameter Sharing
Used in CNNs
Same filter applied at different
positions of the image
Or same weight matrix acts on
different input neurons

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

h(x)

Parameter Sharing
Used in CNNs
Same filter applied at different
positions of the image
Or same weight matrix acts on
different input neurons

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

h(x)

Parameter Sharing
Used in CNNs
Parameter Tying
Same filter applied at different
positions of the image
Or same weight matrix acts on
different input neurons

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

h(x)

Parameter Sharing
Used in CNNs
Parameter Tying
Same filter applied at different
positions of the image Typically used in autoencoders
Or same weight matrix acts on
different input neurons

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

h(x)

Parameter Sharing
Used in CNNs
Parameter Tying
Same filter applied at different
positions of the image Typically used in autoencoders
Or same weight matrix acts on The encoder and decoder weights
different input neurons are tied.

45/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.7 : Adding Noise to the inputs

46/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

47/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

47/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

h(x)


P (x̃|x) ←noise process
x

48/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We saw this in Autoencoder

h(x)


P (x̃|x) ←noise process
x

48/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
x̂ to weight decay (L2 regularisation)

h(x)


P (x̃|x) ←noise process
x

48/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We saw this in Autoencoder
We can show that for a simple input
output neural network, adding Gaus-
sian noise to the input is equivalent
x̂ to weight decay (L2 regularisation)
Can be viewed as data augmentation

h(x)


P (x̃|x) ←noise process
x

48/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn

ε ∼ N (0, σ 2 )

49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn

ε ∼ N (0, σ 2 )

xei = xi + εi

49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn

ε ∼ N (0, σ 2 )

xei = xi + εi
n
X
yb = wi xi
i=1

49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn

ε ∼ N (0, σ 2 )

xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1

49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn

ε ∼ N (0, σ 2 )

xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1

49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn

ε ∼ N (0, σ 2 )

xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We are interested in E[(ye − y)2 ]

... ...
x1 + ε1 x2 + ε2 xk + εk xn + εn

ε ∼ N (0, σ 2 )

xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We are interested in E[(ye − y)2 ]

n
" #
h i  X 2
... ... 2
E (ye − y) =E yb + w i εi − y
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1

ε ∼ N (0, σ 2 )

xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We are interested in E[(ye − y)2 ]

n
" #
h i  X 2
... ... 2
E (ye − y) =E yb + w i εi − y
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
  n
X 
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi
n
X
yb = wi xi
i=1
Xn
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We are interested in E[(ye − y)2 ]

n
" #
h i  X 2
... ... 2
E (ye − y) =E yb + w i εi − y
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
  n
X 
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi h i
" n
X
# " n
X 2
#
n 2
X = E (yb − y) + E 2(yb − y) wi εi + E w i εi
yb = wi xi i=1 i=1
i=1
n
X
ye = wi xei
i=1
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We are interested in E[(ye − y)2 ]

n
" #
h i  X 2
... ... 2
E (ye − y) =E yb + w i εi − y
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
  n
X 
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi h i
" n
X
# " n
X 2
#
n 2
X = E (yb − y) + E 2(yb − y) wi εi + E w i εi
yb = wi xi i=1 i=1
i=1 h i
" n
#
2
X
Xn
= E (yb − y) +0+E wi2 ε2i
ye = wi xei i=1
i=1
(∵ εi is independent of εj and εi is independent of (yb-y) )
Xn n
X
= wi xi + w i εi
i=1 i=1
n
X
= yb + wi εi
i=1 49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We are interested in E[(ye − y)2 ]

n
" #
h i  X 2
... ... 2
E (ye − y) =E yb + w i εi − y
x1 + ε1 x2 + ε2 xk + εk xn + εn i=1
 !2 
  n
X 
2
ε ∼ N (0, σ ) =E yb − y + wi εi 
i=1
xei = xi + εi h i
" n
X
# " n
X 2
#
n 2
X = E (yb − y) + E 2(yb − y) wi εi + E w i εi
yb = wi xi i=1 i=1
i=1 h i
" n
#
2
X
Xn
= E (yb − y) +0+E wi2 ε2i
ye = wi xei i=1
i=1
(∵ εi is independent of εj and εi is independent of (yb-y) )
Xn n
X
= wi xi + w i εi h i n
X
i=1 i=1 = (E (yb − y)2 + σ 2 wi2 (same as L2 norm penalty)
n
X i=1
= yb + wi εi
i=1 49/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.8 : Adding Noise to the outputs

50/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

51/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
0 0 1 0 0 0 0 0 0 0 Hard targets

52/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
0 0 1 0 0 0 0 0 0 0 Hard targets

9
X
minimize : pi log qi
i=0

52/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
0 0 1 0 0 0 0 0 0 0 Hard targets

9
X
minimize : pi log qi
i=0
true distribution : p = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0}

52/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
0 0 1 0 0 0 0 0 0 0 Hard targets

9
X
minimize : pi log qi
i=0
true distribution : p = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0}
estimated distribution : q

52/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
0 0 1 0 0 0 0 0 0 0 Hard targets

9
X
minimize : pi log qi
i=0
true distribution : p = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0}
estimated distribution : q

Intuition
Do not trust the true labels, they may be noisy

52/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
0 0 1 0 0 0 0 0 0 0 Hard targets

9
X
minimize : pi log qi
i=0
true distribution : p = {0, 0, 1, 0, 0, 0, 0, 0, 0, 0}
estimated distribution : q

Intuition
Do not trust the true labels, they may be noisy
Instead, use soft targets

52/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
ε ε ε ε ε ε ε ε ε
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

53/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
ε ε ε ε ε ε ε ε ε
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

ε = small positive constant

53/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
ε ε ε ε ε ε ε ε ε
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

ε = small positive constant


9
X
minimize : pi log qi
i=0

53/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
ε ε ε ε ε ε ε ε ε
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

ε = small positive constant


9
X
minimize : pi log qi
i=0
nε ε ε o
true distribution + noise : p = , , 1 − ε, , . . .
9 9 9

53/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
ε ε ε ε ε ε ε ε ε
9 9 1−ε 9 9 9 9 9 9 9 Soft targets

ε = small positive constant


9
X
minimize : pi log qi
i=0
nε ε ε o
true distribution + noise : p = , , 1 − ε, , . . .
9 9 9
estimated distribution : q

53/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.9 : Early stopping

54/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

55/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

55/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Track the validation error

V alidation error

T raining error

k−p k Steps
return this model stop

56/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Track the validation error
Have a patience parameter p

V alidation error

T raining error

k−p k Steps
return this model stop

56/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Track the validation error
Have a patience parameter p
If you are at step k and there was
no improvement in validation error in
the previous p steps then stop train-
V alidation error
ing and return the model stored at
step k − p
T raining error

k−p k Steps
return this model stop

56/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Track the validation error
Have a patience parameter p
If you are at step k and there was
no improvement in validation error in
the previous p steps then stop train-
V alidation error
ing and return the model stored at
step k − p
T raining error
Basically, stop the training early be-
fore it drives the training error to 0
k−p k Steps
return this model stop and blows up the validation error

56/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Very effective and the mostly widely
used form of regularization

V alidation error

T raining error

k−p k Steps
return this model stop

57/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Very effective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such as l2 )

V alidation error

T raining error

k−p k Steps
return this model stop

57/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Very effective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such as l2 )
How does it act as a regularizer ?
V alidation error

T raining error

k−p k Steps
return this model stop

57/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Very effective and the mostly widely
used form of regularization
Can be used even with other regular-
izers (such as l2 )
How does it act as a regularizer ?
V alidation error
We will first see an intuitive explan-
ation and then a mathematical ana-
T raining error
lysis
k−p k Steps
return this model stop

57/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Recall that the update rule in SGD is

V alidation error

T raining error

k−p k Steps
return this model stop

58/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Recall that the update rule in SGD is

wt+1 = wt − η∇wt

V alidation error

T raining error

k−p k Steps
return this model stop

58/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Recall that the update rule in SGD is

wt+1 = wt − η∇wt
t
X
= w0 − η ∇wi
V alidation error i=1

T raining error

k−p k Steps
return this model stop

58/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Recall that the update rule in SGD is

wt+1 = wt − η∇wt
t
X
= w0 − η ∇wi
V alidation error i=1

Let τ be the maximum value of ∇wi


then
T raining error

k−p k Steps
return this model stop

58/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Recall that the update rule in SGD is

wt+1 = wt − η∇wt
t
X
= w0 − η ∇wi
V alidation error i=1

Let τ be the maximum value of ∇wi


then
T raining error

k−p k
stop
Steps |wt+1 − w0 | ≤ ηt|τ |
return this model

58/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Recall that the update rule in SGD is

wt+1 = wt − η∇wt
t
X
= w0 − η ∇wi
V alidation error i=1

Let τ be the maximum value of ∇wi


then
T raining error

k−p k
stop
Steps |wt+1 − w0 | ≤ ηt|τ |
return this model

Thus, t controls how far wt can go


from the initial w0

58/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Error Recall that the update rule in SGD is

wt+1 = wt − η∇wt
t
X
= w0 − η ∇wi
V alidation error i=1

Let τ be the maximum value of ∇wi


then
T raining error

k−p k
stop
Steps |wt+1 − w0 | ≤ ηt|τ |
return this model

Thus, t controls how far wt can go


from the initial w0
In other words it controls the space
of exploration

58/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We will now see a mathematical analysis of this

59/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) [ w∗ is optimal so ∇L (w∗ ) is 0 ]
2

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) [ w∗ is optimal so ∇L (w∗ ) is 0 ]
2
∇(L (w)) = H(w − w∗ )

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) [ w∗ is optimal so ∇L (w∗ ) is 0 ]
2
∇(L (w)) = H(w − w∗ )

Now the SGD update rule is:

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) [ w∗ is optimal so ∇L (w∗ ) is 0 ]
2
∇(L (w)) = H(w − w∗ )

Now the SGD update rule is:

wt = wt−1 − η∇L (wt−1 )

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) [ w∗ is optimal so ∇L (w∗ ) is 0 ]
2
∇(L (w)) = H(w − w∗ )

Now the SGD update rule is:

wt = wt−1 − η∇L (wt−1 )


= wt−1 − ηH(wt−1 − w∗ )

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recall that the Taylor series approximation for L (w) is

1
L (w) = L (w∗ ) + (w − w∗ )T ∇L (w∗ ) + (w − w∗ )T H(w − w∗ )
2
1
= L (w∗ ) + (w − w∗ )T H(w − w∗ ) [ w∗ is optimal so ∇L (w∗ ) is 0 ]
2
∇(L (w)) = H(w − w∗ )

Now the SGD update rule is:

wt = wt−1 − η∇L (wt−1 )


= wt−1 − ηH(wt−1 − w∗ )
= (I − ηH)wt−1 + ηHw∗

60/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
wt = (I − ηH)wt−1 + ηHw∗

61/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
wt = (I − ηH)wt−1 + ηHw∗

Using EVD of H as H = QΛQT , we get:


wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗

61/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
wt = (I − ηH)wt−1 + ηHw∗

Using EVD of H as H = QΛQT , we get:


wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗
If we start with w0 = 0 then we can show that (See Appendix)
wt = Q[I − (I − εΛ)t ]QT w∗

61/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
wt = (I − ηH)wt−1 + ηHw∗

Using EVD of H as H = QΛQT , we get:


wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗
If we start with w0 = 0 then we can show that (See Appendix)
wt = Q[I − (I − εΛ)t ]QT w∗

Compare this with the expression we had for optimum W̃ with L2 regularization
w̃ = Q[I − (Λ + αI)−1 α]QT w∗

61/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
wt = (I − ηH)wt−1 + ηHw∗

Using EVD of H as H = QΛQT , we get:


wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗
If we start with w0 = 0 then we can show that (See Appendix)
wt = Q[I − (I − εΛ)t ]QT w∗

Compare this with the expression we had for optimum W̃ with L2 regularization
w̃ = Q[I − (Λ + αI)−1 α]QT w∗
We observe that wt = w̃, if we choose ε,t and α such that
(I − εΛ)t = (Λ + αI)−1 α

61/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Things to be remember
Early stopping only allows t updates to the parameters.

62/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Things to be remember
Early stopping only allows t updates to the parameters.
If a parameter w corresponds to a dimension which is important for the loss
L (θ) then ∂L∂w(θ) will be large

62/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Things to be remember
Early stopping only allows t updates to the parameters.
If a parameter w corresponds to a dimension which is important for the loss
L (θ) then ∂L∂w(θ) will be large

62/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Things to be remember
Early stopping only allows t updates to the parameters.
If a parameter w corresponds to a dimension which is important for the loss
L (θ) then ∂L∂w(θ) will be large
However if a parameter is not important ( ∂L∂w(θ) is small) then its updates will
be small and the parameter will not be able to grow large in ‘t0 steps

62/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Things to be remember
Early stopping only allows t updates to the parameters.
If a parameter w corresponds to a dimension which is important for the loss
L (θ) then ∂L∂w(θ) will be large
However if a parameter is not important ( ∂L∂w(θ) is small) then its updates will
be small and the parameter will not be able to grow large in ‘t0 steps
Early stopping will thus effectively shrink the parameters corresponding to less
important directions (same as weight decay).

62/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.10 : Ensemble methods

63/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

64/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

64/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Combine the output of different models to re-
duce generalization error
ylr ysvm ynb

y
y

x1 x2 x3 x4

Logistic Regression SV M Naive Bayes

65/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Combine the output of different models to re-
duce generalization error
ylr ysvm ynb
The models can correspond to different clas-
sifiers
y
y

x1 x2 x3 x4

Logistic Regression SV M Naive Bayes

65/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Combine the output of different models to re-
duce generalization error
ylr ysvm ynb
The models can correspond to different clas-
sifiers
y It could be different instances of the same clas-
y sifier trained with:

x1 x2 x3 x4

Logistic Regression SV M Naive Bayes

65/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Combine the output of different models to re-
duce generalization error
ylr ysvm ynb
The models can correspond to different clas-
sifiers
y It could be different instances of the same clas-
y sifier trained with:
different hyperparameters

x1 x2 x3 x4

Logistic Regression SV M Naive Bayes

65/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Combine the output of different models to re-
duce generalization error
ylr ysvm ynb
The models can correspond to different clas-
sifiers
y It could be different instances of the same clas-
y sifier trained with:
different hyperparameters
different features

x1 x2 x3 x4

Logistic Regression SV M Naive Bayes

65/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Combine the output of different models to re-
duce generalization error
ylr ysvm ynb
The models can correspond to different clas-
sifiers
y It could be different instances of the same clas-
y sifier trained with:
different hyperparameters
different features
different samples of the training data
x1 x2 x3 x4

Logistic Regression SV M Naive Bayes

65/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal

ylr1 ylr2 ylr3

y y y

Logistic Logistic Logistic


Regression Regression Regression

66/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal

ylr1 ylr2 ylr3

y y y

Logistic Logistic Logistic


Regression Regression Regression

66/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Bagging: form an ensemble using dif-
ylr1 ylr2 ylr3 ferent instances of the same classifier

y y y

Logistic Logistic Logistic


Regression Regression Regression

66/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Bagging: form an ensemble using dif-
ylr1 ylr2 ylr3 ferent instances of the same classifier
From a given dataset, construct mul-
y y y tiple training sets by sampling with
replacement (T1 , T2 , ..., Tk )

Logistic Logistic Logistic


Regression Regression Regression

66/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Bagging: form an ensemble using dif-
ylr1 ylr2 ylr3 ferent instances of the same classifier
From a given dataset, construct mul-
y y y tiple training sets by sampling with
replacement (T1 , T2 , ..., Tk )
Train ith instance of the classifier us-
ing training set Ti

Logistic Logistic Logistic


Regression Regression Regression

66/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
yfinal
Bagging: form an ensemble using dif-
ylr1 ylr2 ylr3 ferent instances of the same classifier
From a given dataset, construct mul-
y y y tiple training sets by sampling with
replacement (T1 , T2 , ..., Tk )
Train ith instance of the classifier us-
ing training set Ti

Logistic Logistic Logistic


Regression Regression Regression

Each model trained with a different


sample of the data (sampling with
replacement)

66/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
When would bagging work?

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
When would bagging work?
Consider a set of k LR mod-
els

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
When would bagging work?
Consider a set of k LR mod-
els
Suppose that each model
makes an error εi on a test
example

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
When would bagging work?
Consider a set of k LR mod-
els
Suppose that each model
makes an error εi on a test
example
Let εi be drawn from a
zero mean multivariate nor-
mal distribution

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
When would bagging work?
Consider a set of k LR mod-
els
Suppose that each model
makes an error εi on a test
example
Let εi be drawn from a
zero mean multivariate nor-
mal distribution
V ariance = E[ε2i ] = V

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
When would bagging work?
Consider a set of k LR mod-
els
Suppose that each model
makes an error εi on a test
example
Let εi be drawn from a
zero mean multivariate nor-
mal distribution
V ariance = E[ε2i ] = V
Covariance = E[εi εj ] = C

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
Suppose that each model
makes an error εi on a test
example
Let εi be drawn from a
zero mean multivariate nor-
mal distribution
V ariance = E[ε2i ] = V
Covariance = E[εi εj ] = C

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
The expected squared error is : Suppose that each model
makes an error εi on a test
example
Let εi be drawn from a
zero mean multivariate nor-
mal distribution
V ariance = E[ε2i ] = V
Covariance = E[εi εj ] = C

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
The expected squared error is : Suppose that each model
1X 2 makes an error εi on a test
mse =E[( εi ) ] example
k
i
Let εi be drawn from a
zero mean multivariate nor-
mal distribution
V ariance = E[ε2i ] = V
Covariance = E[εi εj ] = C

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
The expected squared error is : Suppose that each model
1X 2 makes an error εi on a test
mse =E[( εi ) ] example
k
i
1 XX XX Let εi be drawn from a
= 2 E[ εi εj + εi εj ]
k zero mean multivariate nor-
i i=j i i6=j
mal distribution
V ariance = E[ε2i ] = V
Covariance = E[εi εj ] = C

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
The expected squared error is : Suppose that each model
1X 2 makes an error εi on a test
mse =E[( εi ) ] example
k
i
1 XX XX Let εi be drawn from a
= 2 E[ εi εj + εi εj ]
k zero mean multivariate nor-
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
k V ariance = E[ε2i ] = V
i i i6=j
Covariance = E[εi εj ] = C

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
The expected squared error is : Suppose that each model
1X 2 makes an error εi on a test
mse =E[( εi ) ] example
k
i
1 XX XX Let εi be drawn from a
= 2 E[ εi εj + εi εj ]
k zero mean multivariate nor-
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
k V ariance = E[ε2i ] = V
i i i6=j
1 X XX Covariance = E[εi εj ] = C
= 2
( E[ε2i ] + E[εi εj ])
k
i i i6=j

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
The expected squared error is : Suppose that each model
1X 2 makes an error εi on a test
mse =E[( εi ) ] example
k
i
1 XX XX Let εi be drawn from a
= 2 E[ εi εj + εi εj ]
k zero mean multivariate nor-
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
k V ariance = E[ε2i ] = V
i i i6=j
1 X XX Covariance = E[εi εj ] = C
= 2
( E[ε2i ] + E[εi εj ])
k
i i i6=j
1
= 2 (kV + k(k − 1)C)
k

67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
The error made by the average When would bagging work?
prediction of all the models is Consider a set of k LR mod-
1 P
k ε
i i els
The expected squared error is : Suppose that each model
1X 2 makes an error εi on a test
mse =E[( εi ) ] example
k
i
1 XX XX Let εi be drawn from a
= 2 E[ εi εj + εi εj ]
k zero mean multivariate nor-
i i=j i i6=j
1 X 2 XX
mal distribution
= 2 E[ εi + εi εj ]
k V ariance = E[ε2i ] = V
i i i6=j
1 X XX Covariance = E[εi εj ] = C
= 2
( E[ε2i ] + E[εi εj ])
k
i i i6=j
1
= 2 (kV + k(k − 1)C)
k
1 k−1
= V + C
k k 67/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
1 k−1
mse = V + C
k k

68/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
1 k−1 When would bagging work ?
mse = V + C
k k

68/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
1 k−1 When would bagging work ?
mse = V + C
k k If the errors of the model are perfectly
correlated then V = C and mse = V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]

68/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
1 k−1 When would bagging work ?
mse = V + C
k k If the errors of the model are perfectly
correlated then V = C and mse = V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated then C = 0
and the mse of the ensemble reduces
to k1 V

68/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
1 k−1 When would bagging work ?
mse = V + C
k k If the errors of the model are perfectly
correlated then V = C and mse = V
[bagging does not help: the mse of the
ensemble is as bad as the individual
models]
If the errors of the model are inde-
pendent or uncorrelated then C = 0
and the mse of the ensemble reduces
to k1 V
On average, the ensemble will per-
form at least as well as its individual
members

68/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Module 8.11 : Dropout

69/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

70/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Other forms of regularization
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

70/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically model averaging(bagging
ensemble) always helps

71/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive

71/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having different architec-
tures(obviously expensive)

71/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having different architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using different
training samples (again expensive)

71/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Typically model averaging(bagging
ensemble) always helps
Training several large neural net-
works for making an ensemble is pro-
hibitively expensive
Option 1: Train several neural
networks having different architec-
tures(obviously expensive)
Option 2: Train multiple instances
of the same network using different
training samples (again expensive)
Even if we manage to train with op-
tion 1 or option 2, combining several
models at test time is infeasible in
real time applications
71/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout is a technique which ad-
dresses both these issues.

72/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout is a technique which ad-
dresses both these issues.
Effectively it allows training several
neural networks without any signific-
ant computational overhead.

72/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout is a technique which ad-
dresses both these issues.
Effectively it allows training several
neural networks without any signific-
ant computational overhead.
Also gives an efficient approximate
way of combining exponentially many
different neural networks.

72/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout refers to dropping out units

73/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout refers to dropping out units
Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network

73/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout refers to dropping out units
Temporarily remove a node and all its incoming/outgoing connections
resulting in a thinned network
Each node is retained with a fixed probability (typically p = 0.5) for hidden
nodes and p = 0.8 for visible nodes

73/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes

74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped

74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped
For example, in the above case we drop 5 nodes to get a thinned network

74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped
For example, in the above case we drop 5 nodes to get a thinned network
Given a total of n nodes, what are the total number of thinned networks that
can be formed?

74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped
For example, in the above case we drop 5 nodes to get a thinned network
Given a total of n nodes, what are the total number of thinned networks that
can be formed? 2n

74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped
For example, in the above case we drop 5 nodes to get a thinned network
Given a total of n nodes, what are the total number of thinned networks that
can be formed? 2n
Of course, this is prohibitively large and we cannot possibly train so many
networks

74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped
For example, in the above case we drop 5 nodes to get a thinned network
Given a total of n nodes, what are the total number of thinned networks that
can be formed? 2n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick: (1) Share the weights across all the networks

74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped
For example, in the above case we drop 5 nodes to get a thinned network
Given a total of n nodes, what are the total number of thinned networks that
can be formed? 2n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick: (1) Share the weights across all the networks
(2) Sample a different network for each training instance
74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Suppose a neural network has n nodes
Using the dropout idea, each node can be retained or dropped
For example, in the above case we drop 5 nodes to get a thinned network
Given a total of n nodes, what are the total number of thinned networks that
can be formed? 2n
Of course, this is prohibitively large and we cannot possibly train so many
networks
Trick: (1) Share the weights across all the networks
(2) Sample a different network for each training instance
Let us see how? 74/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate
Which parameters will we update?

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate
Which parameters will we update? Only those which are active

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate
Which parameters will we update? Only those which are active

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate
Which parameters will we update? Only those which are active

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate
Which parameters will we update? Only those which are active

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
We initialize all the parameters (weights) of the network and start training
For the first training instance (or mini-batch), we apply dropout resulting in
the thinned network
We compute the loss and backpropagate
Which parameters will we update? Only those which are active

75/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a different thinned network

76/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a different thinned network
We again compute the loss and backpropagate to the active weights

76/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a different thinned network
We again compute the loss and backpropagate to the active weights

76/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a different thinned network
We again compute the loss and backpropagate to the active weights
If the weight was active for both the training instances then it would have
received two updates by now

76/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a different thinned network
We again compute the loss and backpropagate to the active weights
If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now

76/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
For the second training instance (or mini-batch), we again apply dropout res-
ulting in a different thinned network
We again compute the loss and backpropagate to the active weights
If the weight was active for both the training instances then it would have
received two updates by now
If the weight was active for only one of the training instances then it would
have received only one updates by now
Each thinned network gets trained rarely (or even never) but the parameter
sharing ensures that no model has untrained or poorly trained parameters
76/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
w1 w2 w3 w4

Present with
probability p
At training time

77/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
w1 w2 w3 w4

Present with
probability p
At training time

What happens at test time?

77/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
w1 w2 w3 w4

Present with
probability p
At training time

What happens at test time?


Impossible to aggregate the outputs of 2n thinned networks

77/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
w1 w2 w3 w4 pw1 pw2 pw3 pw4

Present with Always


probability p present
At training time At test time

What happens at test time?


Impossible to aggregate the outputs of 2n thinned networks
Instead we use the full Neural Network and scale the output of each node by
the fraction of times it was on during training

77/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout essentially applies a masking
noise to the hidden units

78/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting

78/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time

78/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Dropout essentially applies a masking
noise to the hidden units
Prevents hidden units from co-
adapting
Essentially a hidden unit cannot rely
too much on other units as they may
get dropped out any time
Each hidden unit has to learn to be
more robust to these random dro-
pouts

78/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
hi

79/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness

hi

79/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Suppose hi learns to detect a face by
firing on detecting a nose

hi

79/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Suppose hi learns to detect a face by
firing on detecting a nose
Dropping hi then corresponds to eras-
hi
ing the information that a nose exists

79/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Suppose hi learns to detect a face by
firing on detecting a nose
Dropping hi then corresponds to eras-
hi
ing the information that a nose exists
The model should then learn another
hi which redundantly encodes the
presence of a nose

79/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Here is an example of how dropout
helps in ensuring redundancy and ro-
bustness
Suppose hi learns to detect a face by
firing on detecting a nose
Dropping hi then corresponds to eras-
hi
ing the information that a nose exists
The model should then learn another
hi which redundantly encodes the
presence of a nose
Or the model should learn to detect
the face using other features

79/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Recap
l2 regularization
Dataset augmentation
Parameter Sharing and tying
Adding Noise to the inputs
Adding Noise to the outputs
Early stopping
Ensemble methods
Dropout

80/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Appendix

81/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
To prove: The below two equations are equivalent

wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


wt = Q[I − (I − εΛ)t ]QT w∗

82/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
To prove: The below two equations are equivalent

wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


wt = Q[I − (I − εΛ)t ]QT w∗

Proof by induction:

82/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
To prove: The below two equations are equivalent

wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


wt = Q[I − (I − εΛ)t ]QT w∗

Proof by induction:
Base case: t = 1 and w0 =0:

82/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
To prove: The below two equations are equivalent

wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


wt = Q[I − (I − εΛ)t ]QT w∗

Proof by induction:
Base case: t = 1 and w0 =0:
w1 according to the first equation:

w1 = (I − ηQΛQT )w0 + ηQΛQT w∗


= ηQΛQT w∗

82/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
To prove: The below two equations are equivalent

wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


wt = Q[I − (I − εΛ)t ]QT w∗

Proof by induction:
Base case: t = 1 and w0 =0:
w1 according to the first equation:

w1 = (I − ηQΛQT )w0 + ηQΛQT w∗


= ηQΛQT w∗

w1 according to the second equation:

w1 = Q(I − (I − ηΛ)1 )QT w∗


= ηQΛQT w∗
82/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

Proof that this will hold for (t + 1)th step

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

Proof that this will hold for (t + 1)th step

wt+1 = (I − ηQΛQT )wt + ηQΛQT w∗

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

Proof that this will hold for (t + 1)th step

wt+1 = (I − ηQΛQT )wt + ηQΛQT w∗


(using wt = Q[I − (I − εΛ)t ]QT w∗ )

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

Proof that this will hold for (t + 1)th step

wt+1 = (I − ηQΛQT )wt + ηQΛQT w∗


(using wt = Q[I − (I − εΛ)t ]QT w∗ )
= (I − ηQΛQT )Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

Proof that this will hold for (t + 1)th step

wt+1 = (I − ηQΛQT )wt + ηQΛQT w∗


(using wt = Q[I − (I − εΛ)t ]QT w∗ )
= (I − ηQΛQT )Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
(Opening this bracket)

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

Proof that this will hold for (t + 1)th step

wt+1 = (I − ηQΛQT )wt + ηQΛQT w∗


(using wt = Q[I − (I − εΛ)t ]QT w∗ )
= (I − ηQΛQT )Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
(Opening this bracket)
= IQ(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Induction step: Let the two equations be equivalent for tth step

∴ wt = (I − ηQΛQT )wt−1 + ηQΛQT w∗


= Q[I − (I − εΛ)t ]QT w∗

Proof that this will hold for (t + 1)th step

wt+1 = (I − ηQΛQT )wt + ηQΛQT w∗


(using wt = Q[I − (I − εΛ)t ]QT w∗ )
= (I − ηQΛQT )Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
(Opening this bracket)
= IQ(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗

83/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Continuing

wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗

84/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Continuing

wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗


= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗ (∵ QT Q = I)

84/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Continuing

wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗


= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗ (∵ QT Q = I)
= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q (I − (I − ηΛ)t ) − ηΛ(I − (I − ηΛ)t ) + ηΛ QT w∗
 

84/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Continuing

wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗


= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗ (∵ QT Q = I)
= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q (I − (I − ηΛ)t ) − ηΛ(I − (I − ηΛ)t ) + ηΛ QT w∗
 

= Q I − (I − ηΛ)t + ηΛ(I − ηΛ)t QT w∗


 

84/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Continuing

wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗


= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗ (∵ QT Q = I)
= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q (I − (I − ηΛ)t ) − ηΛ(I − (I − ηΛ)t ) + ηΛ QT w∗
 

= Q I − (I − ηΛ)t + ηΛ(I − ηΛ)t QT w∗


 

= Q I − (I − ηΛ)t (I − ηΛ) QT w∗
 

84/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Continuing

wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗


= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗ (∵ QT Q = I)
= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q (I − (I − ηΛ)t ) − ηΛ(I − (I − ηΛ)t ) + ηΛ QT w∗
 

= Q I − (I − ηΛ)t + ηΛ(I − ηΛ)t QT w∗


 

= Q I − (I − ηΛ)t (I − ηΛ) QT w∗
 

= Q(I − (I − ηΛ)t+1 )QT w∗

84/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8
Continuing

wt+1 = Q(I − (I − ηΛ)t )QT w∗ − ηQΛQT Q(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗


= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗ (∵ QT Q = I)
= Q(I − (I − ηΛ)t )QT w∗ − ηQΛ(I − (I − ηΛ)t )QT w∗ + ηQΛQT w∗
= Q (I − (I − ηΛ)t ) − ηΛ(I − (I − ηΛ)t ) + ηΛ QT w∗
 

= Q I − (I − ηΛ)t + ηΛ(I − ηΛ)t QT w∗


 

= Q I − (I − ηΛ)t (I − ηΛ) QT w∗
 

= Q(I − (I − ηΛ)t+1 )QT w∗

Hence, proved!

84/84
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 8

Você também pode gostar