Escolar Documentos
Profissional Documentos
Cultura Documentos
1
x d 1 0.1
0
By definition:
xi = ∑ wijaj
j
oi = 1 / (1 + e-xi)
Summed, squared error at output layer: E = 1/2 ∑ (di - oi)2
i
Derivation of Backprop
By chain rule:
∂E ∂E ∂oi ∂xi
---- = ---- ---- ----
∂wij ∂oi ∂xi ∂wij
∂E
---- = (1/2) 2 (di - oi) (-1) = (oi - ti) E = 1/2 ∑ (di - oi)2
∂oi i
∂oi ∂
---- = ---- [1 / (1 + e-xi)] = - [1 / (1 + e-xi)2] (- e-xi ) = e-xi / (1 + e-xi)2
∂xi ∂xi
(1 + e-xi) - 1 1
= ------------- • ----------- = [1 - 1 / (1 + e-xi)] • [1 / (1 + e-xi)]
(1 + e-xi) (1 + e-xi)
= (1 - oi) oi
∂xi
---- = aj xi = ∑ wijaj
∂wij j
Derivation of Backprop
∂E ∂E ∂oi ∂xi
---- = ---- ---- ----
∂wij ∂oi ∂xi ∂wij
∂E
Δwij = - η ----- (where η is an arbitrary learning rate)
∂wij
∂xi
---- = aj
∂wij
Derivation of Backprop
Now need to compute weight changes in the hidden layer, so, as before,
we write out the equation for the error function slope w.r.t. a
particular weight leading into the hidden layer:
∂E ∂E ∂ai ∂xi
---- = ---- ---- ----
∂wij ∂ai ∂xi ∂wij
(where i now corresponds to a unit in the hidden layer and j now
corresponds to a unit in the input or earlier hidden layer)
From previous derivation, last two terms can simply be written down:
∂ai
---- = (1 - ai) ai
∂xi
∂xi
---- = aj
∂wij
Derivation of Backprop
However, the first term is more difficult to understand for this hidden
layer. It is what Minsky called the credit assignment problem, and is
what stumped connectionists for two decades. The trick is to realize
that the hidden nodes do not themselves make errors, rather they
contribute to the errors of the output nodes. So, the derivative of the
total error w.r.t. a hidden neuron’s activation is the sum of that hidden
neuron’s contributions to the errors in all of the output neurons:
∂E ∂E ∂ok ∂xk
---- = ∑ ---- ---- ---- (where k indexes over all output units)
∂ai k ∂ok ∂xk ∂ai
∂ok
---- = (1 - ok) ok
∂xk
∂xk
---- = wki
∂ai
Derivation of Backprop
Combining these terms then yields:
∂E
---- = - ∑ (dk - ok) (1 - ok) ok wki
∂ai k
ei
δi
Derivation of Backprop
Forward Propagation of Activity
• Forward Direction layer by layer:
– Inputs applied
– Multiplied by weights
– Summed
– ‘Squashed’ by sigmoid activation function
– Output passed to each neuron in next layer
• Repeat above until network output produced
Back-propagation of error
Can then update the weights using the Generalised Delta Rule
(GDR), also known as the Back Propagation (BP) algorithm
For output neuron
wijt+1 = wijt + η (di - oi) (1 - oi) oi aj
i
For hidden neuron
i
The chain rule does the following: distribute the error of an output unit o to all
the hidden units that is it connected to, weighted by this connection. Differently
put, a hidden unit h receives a delta from each output unit o equal to the delta
of that output unit weighted with (= multiplied by) the weight of the connection
between those units.
Algorithm (Backpropagation)
Start with random weights
while error is unsatisfactory
do for each input pattern
compute hidden node input (net)
compute hidden node output (o)
compute input to output node (net)
compute network output (o)
Modify outer layer weights
Homework