Você está na página 1de 51

2 2 


  
 !"#$%&'
()*+,*

  -*

| 
  

| 
   
  
—. /0  



| 
   
 
—. /0  


Figure above depicts supervised learning with a
teacher. In this paradigm the learning algorithm is
given a set of input/output pattern pairs. The
weights are adjusted so that the network will
produce the required output in future. In this
example the algorithm would be given a set of
pictures of animals that the teacher classifies as
spider ΕϮΒϜϨϋ , insect ΓήθΣ , lizard ΔϴϠΤγ or other.
If the network is shown a spider, but classifies it
as a lizard then the weights are adjusted to make
the network respond "spider".
| 
   
  
—. /0  



| 
   
  
Learning Rules:
Mean Square Error:
ß ike the perceptron learning rule, the least
mean square error (  algorithm is an
example of supervised training, in which the
learning rule is provided with a set of
examples of desired network behavior:
;p1 , t1 , ;p 2 , t 2 , ... , ;pQ , t Q

| 
   
  
€ere pq is an input to the network, and tq is the
corresponding target output. As each input is applied
to the network, the network output is compared to the
target. The error is calculated as the difference
between the target output and the network output. We
want to X  X    
 X

  
 
Q Q
1 21 2
 Ü (   Ü ( (    ( 
Q  Ü1 Q  Ü1

| 
   
  
The  algorithm adjusts the weights and biases of the
linear network so as to minimize this mean square error.
Fortunately, the mean square error performance index for
the linear network is a quadratic function. Thus, the
performance index will either have one global minimum,
a weak minimum or no minimum, depending on the
characteristics of the input vectors.
pecifically, the characteristics of the input vectors
determine whether or not a unique solution exists.
| 
   
  
LMS Algorithm: (Delta Rule)
(Adaline rule)(Widrow-Hoff rule)
O The  algorithm or Widrow€off learning
algorithm is based on an approximate steepest descent
procedure ϲΠϳέΪΘϟ΍ έ΍ΪΤϧϻ΍ ΔϘϳήσ.
O €ere again, linear networks are trained on examples
of correct behavior.
O Widrow and €off had the insight that they could
estimate the mean square error by using the squared
error at each iteration έ΍ήϜΗ/ΓΩϭΎόϣ .

| 
   
 
If we take the partial derivative of the squared error with
respect to the weights and biases at the th iteration we
have:

| 
   
  !
2ext look at the partial derivative with
respect to the error:

| 
   
  
| 
   
  
| 
   
  
Here the error e and the bias b are vectors and Į is
a learning rate. If Į is large, learning occurs
quickly, but if it is too large it may lead to
instability and errors may even increase. To ensure
stable learning, the learning rate must be less than
the reciprocal of the largest eigenvalue ΏϮϠϘϣ Ϧϣ Ϟϗ΃
ΔϴϨϴΠϳ΁ ΔϤϴϗ ήΒϛ΃ of the correlation matrix pTp of the
input vectors.

| 
   
  
`inding the minimum of a function: gradient descent

| 
   
  
gradient descent on an error

| 
   
  
| 
   
  
| 
   
  
| 
   
  
Vackpropagation (΄τΨϠϟ) ϲδϜόϟ΍ έΎθΘϧϻ΍:
*nce the network weights and biases have been
initialized, the network is ready for training. The
network can be trained for function approximation
(nonlinear regression, pattern association, or pattern
classification. The training process requires a set of
examples of proper network behaviornetwork inputs p
and target outputs t. The default performance function
for feedforward networks is the Mean Square Error
MSE ± the average squared error between the network
outputs a and the target outputs t.
| 
   
  !
everal different training algorithms for feedforward
networks use the gradient of the performance
function to determine how to adjust the weights to
minimize performance.

The gradient is determined using a technique called


Vackpropagation, which involves performing
computations backwards through the network.

| 
   
 
| 
   
  
| 
   
 
Vackpropagation Algorithms:
There are many variations ΕΎϓϼΘΧ· of the
backpropagation algorithms. The simplest
implementation of backpropagation learning updates
the network weights and biases in the direction in
which the performance function decreases most
rapidly  the negative of the gradient.
*ne iteration of this algorithm can be written as:

| 
   
  
Vackpropagation Algorithms:
ß There are two different ways in which this
gradient descent algorithm can be
implemented:
ß Incremental mode: where the gradient is
computed and the weights are updated after each
input is applied to the network. and
ß Vatch mode: where all of the inputs are
applied to the network before the weights
are updated.
| 
   
  
Vatch Training: ΕΎόϓΩ ϰϠϋ ϥ΍ίϭϷ΍ ΚϳΪΤΗ
O Vatch Training: In batch mode the weights and
biases of the network are updated only after the entire
training set has been applied to the network. The
gradients calculated at each training example are added
together to determine the change in the weights and
biases.
O Vatch Gradient Descent: In the batch steepest
descent training function the weights and biases are
updated in the direction of the negative gradient of the
performance function. There is only one training
function associated with a given network.

| 
   
  
Vatch Gradient Descent with Momentum
ϲϛήΣ ϡΰϋ:
This algorithm often provides faster convergence
steepest descent with momentum. omentum allows a
network to respond not only to the local gradient, but
also to recent trends in the error surface. Acting like a
lowpass filter, momentum allows the network to ignore
small features in the error surface. Without momentum
a network may get stuck in a shallow local minimum.
With momentum a network can slide through such a
minimum.
| 
   
  
R  

   

   
         
    
      
 
  
    

   
      
     
  
         
            

 

| 
   
  
`aster Training:
O The previous two backpropagation training methods
are often too slow for practical problems.
O There are several high performance algorithms that
can converge from ten to one hundred times faster than
the early mentioned algorithms.
O Algorithms in this section operate in the Batch mode.
O These faster algorithms fall into two main categories:

| 
   
 
`aster Training:
O The first category uses heuristic ϒθϜϠϟ ΓΪϋΎδϣ
techniques, which were developed from an analysis
of the performance of the standard steepest descent
algorithm.

O *ne heuristic modification is the momentum


technique, which was presented in the previous
section. This section discusses two more heuristic
techniques:
Variable learning rate backpropagation,; and
Resilient ϥήϣ ωϮΟέ backpropagation.
| 
   
  !
`aster Training:
O The second category of fast algorithms uses
standard numerical optimization techniques.

O There are three types of numerical


optimization techniques for neural network
training:
Conjugate gradient,
Quasiewton,
evenbergarquardt.

| 
   
  
¥ariable Learning Rate:
O With standard steepest descent, the learning rate is held
constant throughout training.
O The performance of the algorithm is very sensitive to the
proper setting of the learning rate. If the learning rate is
set too high, the algorithm may oscillate ΢Οέ΄Θϳ and
become unstable. If the learning rate is too small, the
algorithm will take too long to converge.
O It is not practical to determine the optimal setting for the
learning rate before training, and, in fact, the optimal
learning rate changes during the training process, as the
algorithm moves across the performance surface.
| 
   
  
¥ariable Learning Rate:
O The performance of the teepest Descent Algorithm
(DA can be improved if we allow the learning rate
to change during the training process.
O An adaptive learning rate will attempt to keep the
learning step size as large as possible while keeping
learning stable.
O The learning rate is made responsive ΔΑΎΠΘγϻ΍ ϊϳήγ to
the complexity of the local error surface.
O An adaptive learning rate requires some changes in
the training procedure used.
| 
   
  
Resilient Vackpropagation:
ultilayer networks typically use sigmoid transfer
functions in the hidden layers. These functions are often
called ³squashing´ functions, since they compress an
infinite input range into a finite output range.
—    
         
 

    
  
   This causes a problem
when using steepest descent to train a multilayer network
with sigmoid functions, since the gradient can have a
very small magnitude; and therefore, cause small changes
in the weights and biases, even though the weights and
biases are far from their optimal values.
| 
   
  
Resilient Vackpropagation:
O The purpose of the resilient backpropagation training
algorithm is to eliminate these harmful effects of the
magnitudes of the partial derivatives.
O *nly the sign of the derivative is used to determine the
direction of the weight update; the magnitude of the
derivative has no effect on the weight update.
O The size of the weight change is determined by a
separate update value.
O The update value for each weight and bias is
increased by a factor (inc whenever the derivative of
the performance function with respect to that weight
has the same sign for two successive iterations.
| 
   
  
Resilient Vackpropagation:
O The update value is decreased by a factor (dec
whenever the derivative with respect that weight
changes sign from the previous iteration.
O If the derivative is zero, then the update value
remains the same.
O Whenever the weights are oscillating the weight
change will be reduced.
O If the weight continues to change in the same direction
for several iterations, then the magnitude of the weight
change will be increased.

| 
   
  
onjugate Gradient Algorithms
ß The basic backpropagation algorithm adjusts
the weights in the steepest descent direction
(negative of the gradient. This is the direction
in which the performance function is
decreasing most rapidly.
ß It turns out that, although the function
decreases most rapidly along the negative of
the gradient, this does not necessarily produce
the fastest convergence

| 
   
  
onjugate Gradient Algorithms
ß In the conjugate gradient algorithms a search is
performed along conjugate directions, which
produces generally faster convergence than
steepest descent directions.

| 
   
  
Ruasi-2ewton Algorithm:
ewton¶s method is an alternative to the
conjugate gradient methods for fast
optimization. The basic step of ewton¶s
method is

where Ak is the €essian matrix (second


derivatives of the performance index at the
current values of the weights and biases.
| 
   
  
Ruasi-2ewton Algorithms:
ewton¶s method often converges faster than
conjugate gradient methods. Unfortunately, it is
complex and expensive to compute the €essian
matrix for feedforward neural networks. There is a
class of algorithms that is based on ewton¶s method,
but which doesn¶t require calculation of second
derivatives. These are called quasiewton (or secant
ϊσΎϘϟ΍ methods. They update an approximate €essian
matrix at each iteration of the algorithm. The update
is computed as a function of the gradient.

| 
   
  !
Levenberg-Marquardt:
ike the quasiewton methods, the evenberg
arquardt algorithm was designed to approach
secondorder training speed without having to
compute the €essian matrix. When the performance
function has the form of a sum of squares (as is
typical in training feedforward networks, then the
€essian matrix can be approximated as:

| 
   
  
Levenberg-Marquardt:
and the gradient can be computed as :

where ¢ is the Jacobian matrix that contains first


derivatives of the network errors with respect to the
weights and biases, and e is a vector of network errors.

| 
   
  
Levenberg-Marquardt:
The Jacobian matrix can be computed through a
standard backpropagation technique that is much less
complex than computing the €essian matrix.
The evenbergarquardt algorithm uses this
approximation to the €essian matrix in the following
ewtonlike update:

—  ȝ   ! "#   


$    $


| 
   
  
Levenberg-Marquardt:
When the scalar ȝ is zero, this is just ewton¶s
method, using the approximate €essian matrix.
When ȝ is large, this becomes gradient descent
with a small step size. ewton¶s method is faster
and more accurate near an error minimum, so the
aim is to shift towards ewton¶s method as
quickly as possible. Thus, ȝ is decreased after
each successful step (reduction in performance
function and is increased only when a tentative
step would increase the performance function
| 
   
  
Levenberg-Marquardt:
O In this way, the performance function will always
be reduced at each iteration of the algorithm.
O The main drawback of the evenbergarquardt
algorithm is that it requires the storage of some
matrices that can be quite large for certain problems.
O The size of the Jacobian matrix is R n, where R is
the number of training sets and  is the number of
weights and biases in the network.
O It turns out that this matrix does not have to be
computed and stored as a whole.

| 
   
  
Levenberg-Marquardt:
For example, if we were to divide the Jacobian into
two equal submatrices we could compute the
approximate €essian matrix by summing a series of
subterms. *nce one subterm has been computed, the
corresponding submatrix of the Jacobian can be
cleared. This is called Reduced Memory
Levenberg-Marquardt.

| 
   
  
Summary to Vackpropagation Algorithms:
O There are several algorithm characteristics that can be
deduced from the experiments.
O In general, on function approximation problems, for
networks that contain up to a few hundred weights,
the evenbergarquardt (  algorithm will have
the fastest convergence. This advantage is especially
noticeable if very accurate training is required.
O In many cases,  algorithm is able to obtain lower
mean square errors than any of the other algorithms
tested. €owever, as the number of weights in the
network increases, the advantage of the  decreases.
| 
   
  
Summary to Vackpropagation Algorithms:
O In addition, evenbergarquardt (  performance is
relatively poor on pattern recognition problems.
O The storage requirements of it are larger than the other
algorithms tested. By adjusting the training parameters,
the storage requirements can be reduced, but at a cost of
increased execution time.
O The Resilient backpropagation is the fastest algorithm
on pattern recognition problems. €owever, it does not
perform well on function approximation problems. Its
performance also degrades ˯Ϯδϳ as the error goal is
reduced.
| 
   
  
Summary to Vackpropagation Algorithms:
O The memory requirements for Resilient
backpropagation algorithm are relatively small in
comparison to the other algorithms considered.
O The conjugate gradient algorithms seem to perform
well over a wide variety of problems, particularly for
networks with a large number of weights.
O The caled Conjugate Gradient (CG algorithm is
almost as fast as the  algorithm on function
approximation problems (faster for large networks and
is almost as fast as Resilient backpropagation on
pattern recognition problems.
| 
   
  
Summary to Vackpropagation Algorithms:
O The conjugate gradient algorithms have relatively
modest ϊο΍ϮΘϣ memory requirements.
O The quasiewton backpropagation performance is
similar to that of  algorithm. It does not require as
much storage as , but the computation required does
increase geometrically with the size of the network,
since the equivalent of a matrix inverse must be
computed at each iteration.
O The variable learning rate algorithm is usually much
slower than the other methods, and it can still be useful
for some problems.
| 
   
  !
R 

O —     

     
 "  " %

| 
   
  
| 
   
  

Você também pode gostar