Você está na página 1de 24

Multilayer Perceptron

1
MLP

Function signals, Error signals and Credit Assignment Problem


Batch learning, Online Learning
Now when you have multiple neurons in multiple layer you
get: Multilayer Perceptron
Training?
The Back Propagation Algorithm
2
Batch learning:
- All / Multiple examples are presented together in one
instance / epoch of training.
- Cost function (optimization)  “average” error energy
- Synaptic weights are adjusted on epoch to epoch basis

jth neuron

Over all neurons

3
Over all samples

Function of all
free parameters

Method of gradient descent

Pros and Cons


- Accurate gradient vector (derivative of cost function with
respect to weight vector) estimation, increasing assurance of
local minimum convergence
- Fast learning process
- High storage / memory requirements

4
Online learning:
- Training is done example by example basis
- Cost function (optimization)  instantaneous error energy
- Synaptic weights are adjusted on sample to sample basis

An epoch with N training samples is ordered

Gradient decent is applied when each example pair is presented

In every epoch, the instantaneous error is measured until we get


Final value

5
The learning curve is Final error value (vs) Epoch

- Usually, the same training examples BUT randomly


shuffled is used in each epoch.
- For study, a large enough initial conditions are chosen at
random, all of them yielding multiple realizations of
learning curve, whose average is considered.

Stochastic method: As instantaneous and random!

Allows it to jump and avoid local minimum & allows


tracking small changes, ALSO simple and elegant!

6
Back Propagation
Online training for multilayer
perceptions

7
Induced local field / Total activation:

Activation function:

Cost function:
1 2
𝜀𝜀 𝝎𝝎(𝑛𝑛) = 𝒀𝒀(𝑛𝑛) − 𝑫𝑫(𝑛𝑛)
2
Basing on gradient decent
(similar to LMS): 𝜕𝜕𝜀𝜀(𝑛𝑛)
∆𝝎𝝎(𝑛𝑛) = −η
𝜕𝜕𝝎𝝎(𝑛𝑛)
8
Sensitivity factor (how much change)

We have

We have

9
We have

So we have

Error correction:

10
Activation
where link /part

Local (neuronal level) error (gradient) is the key (how


to get in multi-layer?)
• Output node j: Easy, desired response supplied!
• Hidden node j:

11
For hidden neuron:
Sequentially going backwards tracking the error signals from
all the neurons to the hidden neuron in question.

That’s why its back propagation

12
𝜕𝜕𝜀𝜀(𝑛𝑛)
𝜕𝜕𝑒𝑒𝑘𝑘 (𝑛𝑛)

13
We have

−𝑒𝑒𝑗𝑗 (𝑛𝑛)
14
Backward propagation of error:

15
Correction (delta rule):

Forward pass:

or
Hidden Output
For every input
Backward pass: pattern / vector
Error / Local gradient flow from output
layer towards input via the hidden layers
16
Error Functions

By considering

Invoking log-
likelihood

−𝑬𝑬𝑻𝑻 [log 𝑝𝑝𝐷𝐷⁄𝑊𝑊,𝑋𝑋 (𝑑𝑑⁄𝑤𝑤, 𝑥𝑥 )]


This is nothing but cross entropy between d and wTx

17
*discuss later
18
Needs to be continuous

Derivatives of sigmoid activation function:

Logistic function:

Maximizes at the yj=0.5, so most weight change


at midrange signal values

19
Output neuron

Hidden neuron

Hyperbolic tangent:

20
Output neuron

Hidden neuron

Gradient maximizes at the yj=0.5, so most


weight change at midrange signal values

What about sign / threshold function or Linear line?

21
Oscillatory, unstable network!

Smoother trajectory in weight space, slow decent /


learning towards optimal

Momentum term Momentum constant

Generalized Delta Rule

• Same sign of the two sum terms means accelerating decent


• Different sign of the two sum terms means stabilizing
22 (decaying) effect
Current time

Start time

Δis sum of exponentially weighted time series,


which converges when |α|<1

 Connection Dependent*

23
No analytical convergence proof

One way:

But, so many weight vectors, so


many gradients!
Another way:

You can device you own


suitable way!

24