Você está na página 1de 4

Derivative-based Optimization

A class of the gradient-based methods can be applied to


optimizing nonlinear neuro-fuzzy models, thereby allowing
such models to play a prominent role in the framework of soft
computing In fact, steepest descent and conjugate gradient
methods are major algorithms used for neural network
learning in conjunction with back error propagating process.
The least-squares estimation is another widely employed
algorithm because the sum of squared errors is chosen as the
object function to be
minimized in many cases. Hence, we discuss nonlinear leastsquares problems with particular emphasis placed on GaussNewton methods with Levenberg-Marquardt notions. Those
methods are commonly used in data fitting and regression
involving nonlinear models. Therefore, the gradient-based
methods are closely related to neuro-fuzzy and soft computing
techniques.
DESCENT METHODS
on minimizing a real-valued objective function E defined on an
n-dimensional input space [O=O1,O2On]T Finding a (possibly
local) minimum point O = O that minimizes E(0) is of primary
concern.
a given objective function E may have a nonlinear form with
respect to an adjustable parameter 0. Due to the complexity
of E, we often resort to an iterative algorithm to explore the
input space efficiently. In iterative descent methods, the next
point 0next is determined by a step down from the current
point now in a direction vector d:
0next 0now + nd,
where is some positive step size regulating to what extent to
proceed in that direction. In neuro-fuzzy literature, the term
learning rate is used for the step size For our convenience, we
alternatively use the following formula:
Ok+1 = Ok +7)kdk (k = 1,2,3,...),
where k denotes the current iteration number, and 0now and
next represent two consecutive elements in a generated
sequence of solution candidates {Ok }. The Ok is intended to
converge to a (local) minimum O. The iterative descent
methods compute the kth step ikdk through two procedures:
first determning direction d, and then calculating step size i.
The next point 0next should satisfy the following inequality:
E(Onext) E(Onow + d) < E(Onow)
The principal differences between various descent algorithms
lie in the first procedure for determining successive directions.
Once the decision is reached, all algorithms call for movement
to a (local) minimum point on the line determined by the
current point 0now and the direction d. That is, for the second
procedure, the
optimum step size can be determined by line minimization:
where = arg min ii), The search of ij is accomplished by line
search (or one-dimensional search).
THE METHOD OF STEEPEST DESCENT
The method of steepest descent, also known as gradient
method, is one of the oldest techniques for minimizing a given
function defined on a multidimensional input space. This
method forms the basis for many direct methods used in
optimizing both constrained and unconstrained problems.
Moreover, despite its slow convergence, the method is the
most frequently used nonlinear optimization technique due to
its simplicity.
When G = iII, with some positive value , and the identity

matrix I, will be the well-known steepest descent formula:


9next = 0now-ng. if cosE =-1, that is is, d points the same
direction as the negative gradient direction (g)----the objective
function E can be decreased locally by the most amount at the
current point 0now This finding implies that the negative

gradient direction (g) points to the locally steepest downhill


direction. From a global perspective, going in the negative
gradient direction may not be a shortcut to reach the
minimum point 9If the steepest descent method employs line
minimization, that is, if the minimum point ij in a direction d is
obtained at each iteration-we have Newton's (orNewtonRaphson) method for minimizing a general objectivefunction
E, which is approximated locally as a quadratic form; this
approximate function is minimized exactly.
where gnext is the gradient vector at the next point. The
preceding equation indicates that the next gradient vector
gnext is always orthogonal to the current gradient vector
gnow- Figure 6.2 depicts this situation at point X, where gnext
also describes related situation. For a quadratic objective
function, as discussed the method of the steepest descent

with line minimization


generates only two mutually orthogonal directions that are
determined by the starting point (Previous investigators note
that even for a general n-input objective function, the steepest
descent method tends to search asymptotically merely in
some lower-dimensional subspace [1, 12, 30].) Only if the
contours of
the objective function E form hyperspheres (or circles in a twodimensional space), the steepest descent method leads to the
minimum.
Classical Newton's Method
The descent direction d can be determined by using the
second derivatives of the objective function E, if available. For
a general continuous objective function, the contours may be
nearly elliptical in the immediate vicinity of the minimum. If
the starting position now is sufficiently close to a local
minimum, the objective function E is expected to be
approximated by a quadratic form:
6.13
where H is the Hessian matrix, consisting of the second partial
derivatives of E(0). The preceding equation is the Taylor series
expansion of E(0) up to the second-order terms. Higher-order
terms are omitted due to the assumption that IO - now I is
sufficiently small. Since a quadratic function of O, we can
simply find its minimum point O by differenting setting it to
zero. This subsequently leads to a set of linear equations:
6.14
If the inverse of H exists, we have a unique solution. When the
minimum point of the approximated quadratic function is
chosen as the next point 0 now, we have the so-called
Newton's method or the Newton-Raphson method:
6.15
In the transformed space, the steepest descent method can be
used When T is a diagonal matrix as in the foregoing, such a
transformation is called scaling. Newton direction is
theoretically
scale invariant .However, finding a successful transformation
or scaling is difficult in many cases. A major disadvantage of
Newton's method is that calculating the inverse of the
Hessian matrix is computationally intensive and may introduce
numerical problems due to round-off errors.
STEP SIZE DETERMINATION
Recall the formula of a class of gradient-based descent
methods
6.5
This formula entails effectively determining the step size 'i The
efficiency of the step size determination affects the entire

minimization process. For a general function E, analytically


solving Equation (6.4) as in
6.5
is often impossible. That is, the univariate function (i) should
be minimized on the line determined by the current point
6now and the direction d. This is accomplished by line search
(or one-dimensional search) methods. In the rest of this
section, we discuss the line minimization methods and their
stopping criteria to prevent greedy search schemes from
slowing down the entire minimization algorithm.
InitiaI Bracketing
The line search methods discussed in subsequent sections
basically assume that the search area, or the specified
interval, contains a single relative minimum; that is, the
function E is unimodal over the closed interval. Determining
the initial
interval in which a relative minimum must lie is of critical
importance. To begin with line searches, some routine must be
employed for initially bracketing an assumed minimum into
the starting interval. This kind of procedure can be roughly
categorized into two schemes
1. A scheme, by function evaluations, for finding three points
to satisfy
6.5.1
2. A scheme, by taking the first derivatives, for finding two
points to satisfy
6.5.1
For scheme 1, the common algorithm can be outlined as
follows:
Algorithm 6.1 A initial bracketing procedure for searching
three points Oi, 02, and 03
(1) Given a starting point Oo and h E R, let O be Oo + h.
Evaluate E(O1).

The algorithm based on scheme 2 is left to the reader


(Exercise 4). The following sections of line search methods
assume that initial bracketing procedures of these types can
adequately find several starting points.
Line Searches
The process of determining r that minimizes a onedimensional function (i) is achieved by searching on the line
for the minimum. The method of line searches (or onedimensional searches) is important
because higher
dimensional problems are ultimately solved by repeating line
searches. Also, line search
algorithms usually include two components: sectioning (or
bracketing), and poiynomial interpolation.
Newton's Method When k), ' (ii), and " (i) are available, the
classical Newton method can be applied to solving the
equation
6.28
Newton's method (left) and the secant method (right) to
determine the step size.

Derivative- Free Optimization


Derivative freeness These methods do not need functional
derivative information to search for a set of parameters that
minimize (or maximize) a given objective function.

Intuitive guidelines The guidelines followed by these search


procedures are usually based on simple intuitive concepts.
Some of these concepts are motivated by so-called nature's
wisdom, such as evolution and thermodynamics.
Slowness Without using derivatives, these methods are bound
to be generally slower than derivative-based optimization
methods for continuous optimization problems.
Flexibility Derivative freeness also relieves the requirement for
differentiable objective functions, so we can use as complex
an objective function as a specific application might need,
without sacrificing too much in extra coding and computation
time.
Randomness All of these methods (with the probable
exception of the standard downhill simplex search) are
stochastic, which means that they all use random number
generators in determining subsequent search directions.
Analytic opacity It is difficult to do analytic studies of these
methods, in part because of their randomness and problemspecific nature. Therefore, most of our knowledge about them
is based on empirical studies. Iterative nature Unlike the linear
least-squares estimator (Section 5.3), these techniques are
iterative in nature and we need certain stopping criteria to
determine when to terminate the optimization process.
GENETIC ALGORITHMS Genetic algorithms (GAs) are
derivative-free stochastic optimization methods based loosely
on the concepts of natural selection and evolutionary
processes. They were first proposed and investigated by John
Holland at the University of Michigan. As a general-purpose
optimization tool, GAs are moving out of academia and finding
significant applications in many other venues. Their popularity
can be attributed to their freedom from dependence on
functional derivatives and to their incorporation of these
characteristics: GAs are parallel-search procedures that can be
implemented on parallel processing machines for massively
speeding up their operations. GAs are applicable to both
continuous
and
discrete
(combinatorial)
optimization
problems. GAs are stochastic and less likely to get trapped in
local minima, which inevitably are present in any practical
optimization application. GAs' flexibility facilitates both
structure and parameter identification in complex models such
as neural networks and fuzzy inference systems. GAs encode
each point in a parameter (or solution) space into a binary bit
string called a chromosome, and each point is associated with
a "fitness" value that, for maximization, is usually equal to the
objective function evaluated at the point. Instead of a single
point, GAs usually keep a set of points as a population (or
gene pool), which is then evolved repeatedly toward a better
overall fitness
value. In each generation, the GA constructs a new population
using genetic operators such as crossover and mutation;
members with higher fitness values are more likely to survive
and to participate in mating (crossover) operations. After a
number of generations, the population contains members with
better fitness values; this is analogous to Darwinian models of
evolution by random mutation and natural selection. GAs and
their variants are sometimes referred to as methods of
population-based optimization that improve performance by
upgrading entire
populations rather than individual members. Encoding
schemes These transform points in parameter space into bit
string representations.
For instance, a point (11,6,9) in a three-dimensional parameter

space
can
be

represented as a concatenated binary string:


Fitness evaluation The first step after creating a generation is
to calculate the fitness value of each member in the
population. For a maximization problem, the fitness value f of
the ith member is usually the objective function evaluated at
this member (or point). We usually need fitness values that
are positive, so some kind of monotonical scaling and/or
translation may be necessary
if the objective function is not strictly positive. Selection After
evaluation, we have to create a new population from the
current
generation. The selection operation determines which parents
participate in producing offspring for the next generation, and
it is analogous to retrvival of the fittest in natural selection.
Crossover To exploit the potential of the current gene pool, we
use crossover operators to generate new chromosomes that
we hope will retain good features from the previous
generation. Crossover is usually applied to selected pairs of
parents with a probability equal to a given crossover rate.
One-point crossover is the most basic crossover operator,
where a crossover point on
the genetic code is selected at random and two parent
chromosomes are interchanged at this point.
Crossover operators:
(a)
one-point
crossover; (b) twopoint crossover.
Mutation Crossover
exploits
current
gene potentials, but
if
the population
does not contain all
the
encoded
information needed
to solve a particular
problem, no amount
of gene mixing can
produce a satisfactory solution. For this reason, a mutation
operator
capable
of
spontaneously
generating
new
chromosomes
is included. The most common way of implementing mutation
is to flip a bit with a probability equal to a very low given
mutation rate. Producing the next generation in GAs. Based on
the aforementioned concepts, a simple genetic algorithm for
maximization problems is described next.
Step 1: Initialize a population with randomly generated
individuals and evaluate the fitness value of each individual.
Step 2:
(a) Select two members from the population with probabilities
proportional
to their fitness values.
(b) Apply crossover with a probability equal to the crossover
rate.
(c) Apply mutation with a probability equal to the mutation
rate.
(d) Repeat (a) to (d) until enough members are generated to
form the next generation.
Step 3: Repeat steps 2 and 3 until a stopping criterion is met.
from the current one.
SIMULATED ANNEALING
the most important part of SA is the so-called annealing
schedule
or cooling schedule, which specifies how rapidly the
temperature is lowered from high to low values. This is usually
application specific and requires some experimentation by
trial-and-error.

Before giving a detailed description of SA, first we shall


explain the fundamental terminology of SA.
Objective function An objective function f(S) maps an input
vector x into a scalar E: E=f(x),
where each x is viewed as a point in an input space. The task
of SA is to sample the input space effectively to find an x that
minimizes E.

Generating function A generating function g(., ) specifies the


probability density function of the difference between the
current point and the next point to be visited. Specifically, x
( xnw - x) is a random variable with probability density
function g(x, T), where T is the temperature. For common SA
(especially when used in combinatorial optimization
applications), g(, ) is usually a function independent of the
temperature T. Acceptance function After a new point xnew
has been evaluated, SA decides whether to accept or reject it
based on the value of an acceptance function
h(, .). The most frequently used acceptance function is the

Boltzmann probability distribution where e is a systemdependent constant, T is th temperature, and is the energy
difference between xnew and X:
Annealing schedule An annealing schedule regulates how
rapidly the temperature T goes from high to low values, as a
function of time or iteration counts. The exact interpretation of
high and low and the specification of a good
annealing schedule require certain problem-specific physical
insights and/or trial-and-error.
Having presented this brief guide to clearer understanding of
the SA terminology, we now describe the basic steps involved
in a general SA method.
Step 1: Choose a start point x and set a high starting
temperature T. Set the iteration count k to i.
Step 2: Evaluate the objective function:
E=f(x).
Step 3: Select x with probability determined by the generating
function g(x, T).
Set the new point xnew equal to x + Ax. Step 4: Calculate the
new value of the objective function: Enew = f(xnew).
Step 5: Set x to xnew and E to Enew with probability
determined by the acceptance function h(LE, T), where LE =
Enew - E.
Step 6: Reduce the temperature T according to the annealing
schedule (usually by simply setting T equal to ijT, where ij is a
constant between O and 1).
Step 7: Increment iteration count k. If k reaches the maximum
iteration count,
stop the iterating. Otherwise, go back to step 3.
In conventional SA, also known as Boltzmann machines [5, 6],
the generating function is a Gaussian probability density
function:
Traveling salesperson problem
In a typical traveling salesperson problem (TSP), we are given
n cities, and the distance (or cost) between all pairs of these
cities is an n x n distance (or cost) matrix D, where the
element represents the distance (or cost) of traveling from city
i to city j. The problem is to find a a closed tour in which each
city, except for the starting one, is visited exactly once, such
that the total length (cost) is minimized. The traveling
salesperson problem is a well-known problem in combinatorial
optimization; it belongs to a class of problems known as NPcomplete1 [16], in which the computation time required to
find an optimal solution increases exponentially
with n. For a TSP with n cities, the number of possible tours is
(n - 1) !/2, which becomes prohibitively large even for a
moderate n. For instance, finding the best tour of the state
capitals of the United States (n 50) would require many billions
of years even
Three operations for generating move sets in the traveling
salesperson problem. with the fastest modern computers. For
a common traveling salesperson problem, we can define at
least three move sets for SA: Inversion Remove two edges
from the tour and replace them to make it another legal tour.
This is equivalent to removing a section (6-7-8-9) of the tour
and then replacing with the same cities running in the
opposite order Switching Randomly select two cities (3 and

11)
and
switch
them in a
tour.
Generally
speaking,
the
switching
move set
tends to
rupture
the
original
tour and
results in
a
tour
that has a
total
length (or
cost) significantly different from that of the original tour.
Comparisons between the inversion and switching.
We apply the SA technique with the inversion move set to a
TSP
workstation. Variants of Boltzmann machines include the
Cauchy machine or fast simulated annealing where the
generating function is the Cauchy distribution:
RANDOM SEARCH
we shall start with the most primitive version proposed by
Matyas
Following
some
heuristic
guidelines, we shall
also
present
a
modified
version
that
is
more
efficient.
Let 1(x) be the
objective
function
to be minimized
and x be the point
currently
under
consideration. The
original
random
search method [11]
tries to find the optimal x by iterating the following four steps:
Step 1: Choose a start point x as the current point.
Step 2: Add a random vector dx to the current point x in the
parameter space
and evaluate the objective function at the new point at x + dx.
Step 3: If f(x + dx) <1(x), set the current point x equal to x +
dx.
Step 4: Stop if the maximum number of function evaluations is
reached. Otherwise, go back to step 2 to find a new point.
This is a truly random method in the sense that search
directions are purely guided by a random number generator.
There are several ways to improve this primitive version;
these are based on the following observations:
Observation 1: If search in a direction results in a higher
objective function, the opposite direction can often lead to a
lower objective function.
Observation 2: Successive successful searches in a certain
direction should bias subsequent searching toward this
direction. On the other hand, successive failures in a certain
direction should discourage subsequent searching along this
direction.
The first observation leads to a reverse step in the original
method. The second observation motivates the use of a bias
term as the center for the random vector. After including these
two guidelines, the modified random search method involves
the following six steps:
Step 1: Choose a start point x as the current point. Set initial
bias b equal to a zero vector.
Step 2: Add a bias term b and a random vector dx to the
current point x in the input space and evaluate the objective
function at the new point at x +b + dx.
Flow chart for the random search method.
Step 3: If f(x + b + dx) <1(x), set the current point x equal to
x + b + dx and the bias b equal to O.2b + O.4dx; go to step 6.
Otherwise, go to the next step.
Step 4: If f(x + b - dx) <f(x), set the current point x equal to x
+ b - dx and
the bias b equal to b - O.4dx; go to step 6. Otherwise, go to
the next step.

Step 5: Set the bias equal to O.5b and go to step 6.


Step 6: Stop if the maximum number of function evaluations is
reached. Otherwise go back to step 2 to find a new point.
DOWNHILL SIMPLEX SEARCH
the downhill simplex search, we must initialize a simplex of n
+i
points. For example, a simplex is a triangle in two-dimensional
space and a tetrahedron in three-dimensional space.
Moreover, we would like the simplex to be nondegenerate-that
is, it encloses a finite inner n-dimensional volume. An easy
way to set up a simplex is to start with an initial starting point
Po and the other
n points can be taken as
7.2
where e 's are unit vectors consisting of a basis of the ndimensional space and Ai is a constant reflecting the guess of
the characteristic length scale of the optimization problem in
question. We write y for the function value at P and let
= arg mini (y) (i for "low"),
h arg maxj(yj) (h for "high").
In other words, i and h are respectively the indices for the
minimum and maximum of y. In symbols,
Yl= minj(yj)
Yh = maxj(yj)
Let P be the average (centroid) of these n + i points. Each
cycle of this method starts with a reflection point P* of P,.
Depending on the function value at Pt, we have four possible
operations to change the current simplex to explore the
landscape of the function efficiently in multidimensional
space. These four operations are (1) reflection away from Ph;
(2) reflection and_expansion away from Ph; (3) contraction
along one dimension connecting Ph and P; and (4) shrinkage
toward P, along all dimensions. Before describing the full cycle
of the simplex search, we need to define four intervals to be
used in the search process:
Interval 1: {yy < y,}
Interval 2: <y maxj,jh{yj}}
Interval 3: {I max,h{y} <y yh}
Interval 4: {YIYh <y}.
Outcomes for a cycle in the downhill simplex search
Four intervals used in the downhill simplex search
1. If y* is in interval 1, go to expansion.
2. If y is in interva' 2, replace Ph with * and finish this cycle.
3. If y is in interva' 3, replace Ph with * and go to contraction.
4. If y' is in interva' 4, go to contraction.
Flow chart for the downhill simplex search.
1, replace Ph with P and finish this cycle. Otherwise, replace
Ph with the
original reflection point Pt and finish this cycle. Contraction:
Define
the

contraction point P and its value y where the ontraction


coefficient /3 lies between O and 1. If y is in interval 1,
2, or 3, replace Ph with P4 and finish this cycle. Otherwise, go
to shrinkage. Shrinkage: Replace each P with (P1 + P)/2. Finish
this cycle.

Você também pode gostar