Escolar Documentos
Profissional Documentos
Cultura Documentos
Universit
e Joseph Fourier
Master de Math
ematiques Appliquees , 2`
eme ann
ee
Lecture Notes
Optimization problems arise naturally in many application elds. Whatever people do,
at some point they get a craving for organizing things in a best possible way. This intention,
converted in a mathematical form, appears to be an optimization problem of certain type
(think of, say, Optimal Diet Problem). Unfortunately, the next step, consisting of nding a
solution to the mathematical model is less trivial. At the rst glance, everything looks very
simple: many commercial optimization packages are easily available and any user can get a
solution to his model just by clicking on an icon at the desktop of his PC. However, the
question is, how much he could trust it?
One of the goals of this course is to show that, despite to their attraction, the general
optimization problems very often break the expectations of a naive user. In order to apply
these formulations successfully, it is necessary to be aware of some theory, which tells us what
we can and what we cannot do with optimization problems. The elements of this theory can
be found in each lecture of the course.
This course itself is based on the lectures given by Arkadi Nemirovski at Technion in late
1990s. On the other hand all the errors and inanities you may nd here should be put on
the account of the name at the title page.
http://www-ljk.imag.fr/membres/Anatoli.Iouditski/cours/optimisation-convexe.htm
2
Contents
1 Introduction 5
1.1 General formulation of the problem . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Problem formulation and terminology . . . . . . . . . . . . . . . . . . 5
1.1.2 Performance of Numerical Methods . . . . . . . . . . . . . . . . . . . 8
1.2 Complexity bounds for Global Optimization . . . . . . . . . . . . . . . . . . 10
1.3 Identity cards of the elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Rules of the game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3
4 CONTENTS
Introduction
(General formulation of the problem; Important examples; Black Box and Iterative Methods;
Analytical and Arithmetical Complexity; Uniform Grid Method; Lower complexity bounds;
Lower bounds for Global Optimization; Rules of the Game.)
min f (x),
x G,
Q = {x G | gj (x) 0, j = 1 . . . m},
5
6 LECTURE 1. INTRODUCTION
Constrained problems: Q Rn .
Unconstrained problems: Q Rn .
There is also some classication in accordance to the properties of the feasible set.
Problem (1.1.1) is called strictly feasible if x int Q such that gj (x) < 0 (or > 0) for
all inequality constraints and gj (x) = 0 for all equality constraints.
(local minimum).
Let us consider now several examples demonstrating the origin of the optimization prob-
lems.
Example 1.1.1 Let x(1) . . . x(n) be our design or decision variables. Then we can x some
functional characteristics of our decision: f (x), g1 , . . . , gm (x). That could be the price of
the project, the amount of the required resources, the reliability of the system, and many
others.
We x the most important characteristics, f (x), as our objective. For all others we impose
some bounds: aj gj (x) bj .
1.1. GENERAL FORMULATION OF THE PROBLEM 7
s.t.: aj gj (x) bj , j = 1 . . . m,
x G,
where G stands for the structural constraints, like positiveness or boundedness of some
variables, etc.
Example 1.1.2 Let our initial problem be as follows: Find x Rn such that
g1 (x) = a1 ,
... (1.1.2)
gm (x) = am .
Example 1.1.3 Sometimes our decision variable x(1) . . . x(n) must be integer, say, we need
x(i) {0, 1}. That can be described by the constraint:
x(i) (x(i) 1) = 0, i = 1 . . . n.
min f (x),
s.t.: aj gj (x) bj , j = 1 . . . m,
x G,
x(i) (x(i) 1) = 0, i = 1 . . . n.
8 LECTURE 1. INTRODUCTION
Looking at these examples, a reader can understand the enthusiasm of the pioneers of
nonlinear programming, which can be easily recognized in the papers of 1950 1960. Thus,
our rst impression should be as follows:
However, just by looking at the same list, especially at Examples 1.1.2, 1.1.3, a more suspi-
cious (or more experienced) reader should come to the following conjecture:
Indeed, the life is too complicated to believe in a universal tool for solving all problems at
once.
However, conjectures are not so important in science; that is a question of the personal
taste how much we can believe in them. The most important event in the optimization
theory in the middle of 70s was that this conjecture was proved in some strict sense. The
proof is so simple and remarkable, that we cannot avoid it in our course. But rst of all,
we should introduce some special language, which is necessary to speak about such serious
things.
In this denition there are several things to be specied. First, what does it mean: to solve
the problem? In some elds it could mean to nd the exact solution. However, in many areas
of numerical analysis that is impossible (and optimization is denitely the case). Therefore,
for us to solve the problem should mean:
Now, we know that there are dierent numerical methods for doing that, and of course, we
want to choose the scheme, which is the best for our P. However, it appears that we are
looking for something, what does not exist. In fact, it does, but it is too silly. Just imagine
a method for solving (1.1.1), which always reports that x = 0. Of course, it does not work
on all problems except those with x = 0. And for the latter problems its performance is
better than that of all other schemes.
Thus, we cannot speak about the best method for a concrete problem P, but we can do
that for a class of problems F P. Indeed, usually the numerical methods are developed
1.1. GENERAL FORMULATION OF THE PROBLEM 9
for solving many dierent problems with the similar characteristics. Therefore we can dene
the performance of M on F as its performance on the worst problem from F .
Since we are going to speak about the performance of M on the whole class F , we should
assume that M does not have a complete information about a concrete problem P. It has
only the description of the problem class F . In order to recognize P (and solve it), the
method should be able to collect the personal information about P by parts. For modeling
this situation, it is convenient to introduce the notion of oracle. Oracle O is just a unit, which
answers the successive questions of the method. The method M, collecting and handling
the data, is trying to solve the problem P.
In general, each problem can be included in dierent problem classes. For each problem
we can imagine also the dierent types of oracles. But if we x F and O, then we x a model
of our problem P. In this case, it is natural to dene the performance of M on (F , O) as
its performance on the worst Pw from F .2
Let us now consider the iterative process which naturally describes any method M work-
ing with the oracle.
General Iterative Scheme. (1.1.3)
Input:
A starting point x0 and an accuracy > 0.
Initialization.
Set k = 0, I1 = . Here k is the iteration counter and Ik is the
informational set accumulated after k iterations.
Main Loop.
1. Call the oracle O at xk .
2. Update the informational set: Ik = Ik1 (xk , O(xk )).
3. Apply the rules of method M to Ik and form the new test point xk+1 .
4. Check the stopping criterion. If yes then form an output x. Otherwise
set k = k + 1 and go to 1.
End of the Loop.
Now we can specify the term computational eorts in our denition of the performance.
In the scheme (1.1.3) we can easily nd two main sources of that. First one is in the Step 1,
where we call the oracle, and the second one is in Step 3, where we form the next test point.
We introduce two measures of the complexity of the problem P for the method M:
1. Analytical complexity: The number of calls of the oracle, which is required to
solve the problem P up to the accuracy .
2. Arithmetical complexity: The total number of the arithmetic operations (in-
cluding the work of the oracle and the method), which is required to solve the
problem P up to the accuracy .
2
Note that this Pw can be bad only for M .
10 LECTURE 1. INTRODUCTION
Thus, the only thing which is not clear now, is the meaning of the words up to the accuracy
> 0. Note, that this meaning is very important for our denitions of the complexity.
However, it is too specic to speak about that here. We will make this meaning exact when
we will consider the concrete problem classes.
Comparing the notions of analytical and arithmetical complexity, we can see that the
second one is more realistic. However, for a concrete method M, the arithmetical complexity
usually can be easily obtained from the analytical complexity. Therefore, in this course we
will speak mainly about some estimates of the analytical complexity of some problem classes.
There is one standard assumption about the oracle, which allows to obtain most of the
results on the analytical complexity of the optimization methods. This assumption is called
the black box concept and it looks as follows:
1. The only information available from the oracle is its answer. No intermediate
results are available.
2. The oracle is local: A small variation of the problem far enough from the test
point x does not change the answer at x.
This concept is extremely popular in the numerical analysis. Of course, it looks as an
articial wall between the method and the oracle created by ourselves. It seems natural to
allow the method to analyze the internal structure of the oracle. However, we will see that
for some problems with complicated structure this analysis is almost useless. On the other
hand, for some important problems it could help. If we have enough time, that will the
subject of the last lecture of this course.
To conclude this section, let us present the main types of the oracles used in optimization.
For all of them the input is a test point x Rn , but the output is dierent:
Zero-order oracle: the value f (x).
Second-order oracle: the value f (x), the gradient f (x) and the Hessian f (x).
Bn = {x Rn | 0 xi 1, i = 1, . . . , n}.
In order to specify the problem class, let us make the following assumption:
1.2. COMPLEXITY BOUNDS FOR GLOBAL OPTIMIZATION 11
x, y Bn : | f (x) f (y) | L x y
Here and in the sequel we use notation for the Euclidean norm on Rn :
n
x = x, x = (x )2 .
i
i=1
Let us consider a trivial method for solving (1.2.4), which is called the Uniform Grid
Method. This method, G(p), has one integer input parameter p and its scheme is as follows.
where
i1 = 0, . . . , p,
i2 = 0, . . . , p,
...
in = 0, . . . , p.
2. Among all points x(...) nd the point x with the minimal value of the objective function.
3. Return the pair (
x, f (x)) as the result.
Thus, this method forms a uniform grid of the test points inside the cube Bn , computes
the minimal value of the objective over this grid and returns it as an approximate solution
to the problem (1.2.4). In our terminology, this is a zero-order iterative method without
any inuence of the accumulated information on the sequence of test points. Let us nd its
eciency estimate.
Theorem 1.2.1 Let f be the global optimal value of problem (1.2.4). Then
n
x) f L
f ( .
2p
Proof:
Let x be the global minimum of our problem. Then there exists a number (i1 , i2 , . . . , in )
such that
x x(i1 ,i2 ,...,in ) x x(i1 +1,i2 +1,...,in +1) y
12 LECTURE 1. INTRODUCTION
Note that now we still cannot say what is the complexity of this method on the prob-
lem (1.2.4). And the reason is that we did not dene what should be the quality of the
approximate solution we are looking for. Let us dene our goal as follows:
x) f .
Find x Bn : f ( (1.2.6)
Then we immediately get the following result.
Corollary 1.2.1 The analytical complexity of the method G is as follows:
n
n
A(G) = L +2
2
(here ]a[ is the integer part of a).
Proof:
Indeed, let us take p = L 2n + 1. Then p L 2n , and therefore, in view of Theorem 1.2.1,
we have:
n
x) f L
f ( .
2p
This result is more informative, but we still have some questions. First, may be our proof
is too rough and the real performance of G(p) is much better. Second, we cannot be sure
that this is a reasonable method for solving (1.2.4). May be there are some methods with
much higher performance.
In order to answer these questions, we need to derive for (1.2.4), (1.2.6) the lower com-
plexity bounds. The main features of these bounds are as follows.
1.2. COMPLEXITY BOUNDS FOR GLOBAL OPTIMIZATION 13
Proof:
Assume that there exists a method, which needs less than
n L
p 1, p= ( 1),
2
calls of oracle to solve any problem of our class up to accuracy > 0. Let us suppose that
when the method nds its approximate solution x we allow it to call the oracle one more
time at x, which will not be counted in our complexity evaluation. So, the total number of
calls to the oracle of the method is N < pn .
Let us apply this method to the following resisting oracle:
It reports that f (x) = 0 at any test point.
Therefore this method can nd only x Bn : f ( x) = 0.
Note that there exists x Bn such that
1
x + e Bn , e = (1, . . . , 1),
p
and there were no test points inside the box
1
B = {x | x x x + e}.
p
14 LECTURE 1. INTRODUCTION
1
Denote x = x + 2p
e and consider the function
f(x) = min{0, L x x },
where a = max | ai |.
1in
Note that the function f(x) is Lipshitz continuous (since a a ) and the optimal
value of f() is . Moreover, f(x) diers from zero only inside the box
B = {x | x x }.
L
Since 2p L/, we conclude that
1
B B {x | x x }.
2p
Thus, f(x) is equal to zero at all test points of our method. Since the accuracy of the
result of our method is , we come to the following conclusion: If the number of calls of the
oracle is less than pn then the accuracy of the result cannot be less than .
Now we can say much more about the performance of the uniform grid method. Let us
compare its eciency estimate with the lower bound:
n n
n L
G: L , Lower bound: .
2 2
Thus, we conclude that G has optimal dependence of its complexity in , but not in n. Note
that our conclusion depends on the problem class. If we consider the functions f :
x, y Bn : | f (x) f (y) | L x y
then the same reasoning
n as before proves that the uniform grid method is optimal with the
L
eciency estimate 2 .
Theorem 1.2.2 supports our initial claim that the general optimization problems are
unsolvable. Let us look at the following example.
Example 1.2.1 Consider the problem class F dened by the following parameters:
L = 2, n = 10, = 0.01.
Note that the size of the problem is very small and
we ask only for 1% accuracy.
n
L
The lower complexity bound for this class is 2 . Let us compute what does it mean:
Lower bound: 1020 calls of oracle,
Complexity of the oracle: n a.o.,
Total complexity: 1021 a.o.,
Intel Quad Core Processor: 109 a.o. per second,
Total time: 1012 seconds,
1 year: less than 3.2 107 sec.
This estimate is so disappointing that we cannot believe that such problems may become
solvable even in the future. Indeed, suppose we believe into the Moore low, i.e. that the
processor power is to be multiplied by 3 every 2 years. We can hope that a PC of 2030 will
solve the problem in only 1 year, and in 2070 it will only take 1 second. However, let us just
play with the parameters of the class.
If we change n for n + 1 then we have to multiply our estimate by 100. Thus, for
n = 11 our time estimate is valid for the fastest available computer.
But if we multiply by two, we reduce the complexity by the factor of 1000. For
example, if = 8% then we need only 2 days to solve the problem.
We should note, that the lower complexity bounds for problems with smooth functions,
or for the high-order methods is not much better than that of Theorem 1.2.2. This can be
proved using the same arguments and we leave the proof as an exercise for the reader. An
advanced reader can compare our results with the upper bound for NP-hard problems, which
are considered as the examples of very dicult problems in combinatorial optimization. It
is 2n a.o. only!
To conclude this section, let us compare our situation with some other elds of numerical
analysis. It is well-known, that the uniform grid approach is a standard tool for many of
them. For example, if we need to compute numerically the value of the integral
1
I= f (x)dx,
0
N = L/ | I SN | .
Note that in our terminology it is exactly the uniform grid approach. Moreover, it is a
standard way for approximating the integrals. The reason why it works here is in the
dimension of the problem. For integration the standard sizes are 1 3, and in optimization
sometimes we need to solve problems with several million variables.
everything is clear with the global optimization. But may be its goals are too ambitious?
May be in some practical problems we would be satised by much less optimal solution?
Or, may be there are some interesting problem classes, which are not so terrible as the class
of general continuous functions?
In fact, all these question can be answered in a dierent way. And this way denes the
style of the research (or rules of the game) in the dierent optimization elds. If we will try
to classify them, we will easily see that they dier one from another in the following aspects:
These aspects dene in a natural way the list of desired properties of the optimization
methods.
To conclude this lecture, let us present the identity cards of the elds we will consider
in our course.
The majority of Lectures are accompanied by the Exercise sections. In several cases,
the exercises are devoted to the lecture where they are placed; sometimes they prepare the
reader to the next lecture. Exercises marked by # are closely related to the lecture where
they are placed or to the following one; it would be a good thing to solve such an exercise
or at least to become acquainted with its solution (if any is given). Exercises which I nd
dicult are marked with > .
If you want to improve your background on the basic mathematical notions involved,
consider the reference
The main drawback of the little blue book by C. Lemarechal: Methodes numeriques
doptimisation, Notes de cours, Universite Paris IX-Dauphine, INRIA, Rocquencourt, 1989,
is that it is too small.
As far as the main body of the course is concerned, for Chapter 5 I would suggest the
reference
All these books also possesses that important quality of being written in French. If you decide
that you are interested in Convex Optimization, the following reading would is extremely
gratifying
where [a, b] is a given nite segment on the axis. It is also known that our objective f is a
continuous convex function on G; for the sake of simplicity, assume that we know bounds,
let them be 0 and V , for the values of the objective on G. Thus, all we know about the
objective is that it belongs to the family
And what we are asked to do is to nd, for a given positive , an -solution to the problem,
i.e., a point x G such that
x) f f (
f ( x) min f .
G
Of course, our a priori knowledge on the objective given by the inclusion f F , is, for
small , far from being sucient for nding an -solution, and we need some source of
quantitative information on the objective. The standard assumption here which comes from
the optimization practice is that we can compute the value and a subgradient of the objective
at a point, i.e., we have access to a subroutine, our oracle O, which gets, as an input, a point
19
20LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
x from our segment and returns the value f (x) and a subgradient f (x) of the objective at
the point.
We have subject the input to the subroutine to the restriction a < x < b, since the
objective, generally speaking, is not dened outside the segment [a, b], and its subgradient
might be undened at the endpoints of the segment as well. I should also add that the
oracle is not uniquely dened by the above description; indeed, at some points f may have a
massive set of subgradients, not a single one, and we did not specify how the oracle at such
a point chooses the subgradient to be reported. As usual, we need exactly one hypothesis
of this type, namely, we assume the oracle to be local: the information on f reported at a
point x must be uniquely dened by the behavior of f in a neighborhood of x:
{f, f F , x int G, f f in a neighborhood of x } O(f, x) = O(f, x).
Recall that the method M is a collection of the search rules, the termination tests and the
rules for forming the result. Note that we do not subject the rules comprising a method
to any further restrictions like computability in nitely many arithmetic operations; the
rules might be arbitrary functions of the information on the problem accumulated to the
step when the rule should be used.
What we should do is to nd a method which, given on input the desired value of accuracy
, after a number of oracle calls produces an -solution to the problem. And what we are
interested in is the most ecient method of this type. Namely, given a method which solves
every problem from our family to the desired accuracy in nite number of oracle calls, we
dene the worst-case complexity A(M) of the method M as the maximum, over all problems
from the family, of the number of calls; what we are looking for is exactly the method of the
minimal worst-case complexity. When summing up, using our terminology, the problem of
nding the optimal method is
given the family
F = {f : G = [a, b] R | f is convex and continuous on G, 0 f V }
of problems and an > 0, nd among the methods M with the accuracy on the
family not worse than the method with the smallest possible complexity on the
family.
Recall that the complexity of the associated optimal method, i.e., the function
A() = min{A(M) | Accuracy(M) }
is called the complexity of the family.
and ask the oracle about the value and a subgradient of the objective at the point. If the
subgradient is zero, we are done - we have found an optimal solution. If the subgradient is
positive, then the function, due to convexity, to the right of x1 is greater than at the point,
and we may cut o the right half of our initial segment - the minimum for sure is localized
in the remaining part. If the subgradient is negative, then we may cut o the left half of the
initial segment.
Thus, we either terminate with an optimal solution, or nd a new segment, twice smaller
than the initial domain, which for sure localizes the set of optimal solutions. In this latter
case we repeat the procedure, with the initial domain replaced by the new localizer, and
so on. After we have performed the number of steps indicated in the formulation of the
theorem below we terminate and form the result as the best - with the minimal value of f -
of the search points we have looked through:
x {x1 , ..., xN }; f (
x) = min f (xi ).
1iN
Note that traditionally the approximate solution given by the bisection method is identied
with the last search point (which is clearly at the distance at most (b a)2N from the
optimal solution), rather than with the best point found so far. This traditional choice
has small in common with our accuracy measure (we are interested in small values of the
objective rather than in closeness to optimal solution) and is simply dangerous, as you can
see from the following example:
xN xN-1
Figure 1.
Here during the rst N 1 steps everything looks as if we were minimizing f (x) = x, so
that the N-th search point is xN = 2N ; our experience is misleading, as you see from the
picture, and the relative accuracy of xN as an approximate solution to the problem is very
bad, something like 1/2.
By the way, we see from this example that the evident convergence of the search points
to the optimal set at the rate at least 2i does not imply automatically certain xed rate of
22LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
convergence in terms of the objective; it turns out, anyhow, that such a rate exists, but for
the best points found so far rather than for the search points themselves.
This way we obtain a scheme of the bisection algorithm:
1. Initialization.
Theorem 2.1.1 The complexity of the family in question satises the inequality
V
A() log2 ( ), 0 < < V. (2.1.2)
The method associated with the upper bound is the usual bisection terminated after N =
log2 (V /) steps.
Note that the range of values of in our statement is (0, V ), and this is quite natural: since
all functions from the family take their values between 0 and V , any point of the segment
solves every problem from the family to the accuracy V , so that a nontrivial optimization
problem occurs only when < V .
Proof:
We start with the observation that if GN = [x , x+ ] is the nal localizer of optimum
found during the bisection, then outside the localizer the value of the objective is at least
that one at the best of the search points, i.e., at least the value at the approximate solution
x found by bisection:
x) min f (xi ), x G\GN .
f (x) f (
1iN
Indeed, at a step of the method we cut o only those points of the domain G where f is at
least as large as at the current search point and is therefore its value at the best of the
search points, that is, to the value at x; this is exactly what was claimed.
Now, let x be an optimal solution to the problem, i.e., a minimizer of f ; as we know,
such a minimizer does exist, since our continuous objective for sure attains its minimum
2.1. EXAMPLE: ONE-DIMENSIONAL CONVEX PROBLEMS 23
over the nite segment G. Let be a real greater than 2N and less than 1, and consider
-contraction of the segment G to x , i.e., the set
G = (1 )x + G {(1 )x + z | z G}.
This is a segment of the length (b a), and due to our choice of the length is greater
than that one of our nal localizer GN . It follows that G cannot be inside the localizer, so
that there is a point, let it be y, which does not belong to the nal localizer and belongs to
G :
y G : y int GN .
Since y belongs to G , we have
y = (1 )x + z
for some z G, and from convexity of f it follows that
f (y) (1 )f (x ) + f (z),
f (y) f (f (z) f ) V,
x) f (y).
f (
We conclude that
x) f f (y) f V.
f (
Since can be arbitrary close to 2N , we come to
x) f 2N V = 2log2 (V /) V .
f (
Thus,
Accuracy(BisectionN ) .
The upper bound is proved.
The observation that the length of the localizers Gi converges geometrically to 0 was
crucial in the above proof of the complexity estimate. However, for the bisection procedure
to possess this property, convexity of f is not necessary, for instance, it will be enough if f is
quasi-convex. On the other hand, the quasi-convexity itself does not imply the convergence
of the error to 0 in course of iterations. To have this we have impose some condition on local
variation of the objective, e.g., that f is Lipschitz-continuous.1 )
1)
We will discuss this subject at length in the Exercise section of the next lecture.
24LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
i = 212i
fi (x) = ai + 3i |x ci |, x i ,
2)
Recall, that we have succeeded to treat this task for the class of Global Optimization problems.
2.1. EXAMPLE: ONE-DIMENSIONAL CONVEX PROBLEMS 25
thus ensuring (10 ). Property (20 ) holds true by trivial reasons - when i = 0, then there are
no search points to be looked at.
Step i i + 1: Let fi be the objective given by our inductive hypothesis, let i be the
active segment of this objective and let ci be the midpoint of the segment.
Let also x1 , ..., xi , xi+1 be the rst i + 1 search points generated by M as applied to
fi . According to our inductive hypothesis, the rst i of these points are outside the active
segment.
In order to obtain fi+1 , we modify the function fi in its active segment and do not vary
the function outside the segment. The way we modify fi in the active segment depends
on whether xi+1 is to the right of the midpoint ci of the segment (right modication), or
this is not the case and xi+1 either coincides with ci or is to the left of the point (left
modication).
The right modication is as follows: we replace the modulus-like in its active segment
function fi by a piecewise linear function with three linear pieces, as is shown on the picture
below. Namely, we do not change the slope of the function in the initial 1/14 part of the
segment, then change the slope from 23i to 23(i+1) and make a new breakpoint at the
end ci+1 of the rst quarter of the segment i . Starting with this breakpoint and till the
right endpoint of the active segment, the slope of the modied function is 23(i+1) . It is
easily seen that the modied function at the right endpoint of i comes to the same value
as that one of fi and that the modied function is convex on the whole axis.
In the case of the left modication, i.e., when xi+1 ci , we act in the symmetric
manner, so that the breakpoints of the modied function are at the distances 34 |i | and
13
14
|i | from the left endpoint of i , and the slopes of the function, from left to right, are
26LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
i+1
u c i+1 ci xi+1
i
Let us verify that the modied function fi+1 satises the requirements imposed by the
lemma. As we have mentioned, this is a convex continuous function; since we do not vary
fi outside the segment i and do not decrease it inside the segment, the modied function
takes its values in (0, 1) together with fi . It suces to verify that fi+1 satises (1i+1 ) and
(2i+1 ).
(1i+1 ) is evident by construction: the modied function indeed is modulus-like with
required slopes in a segment of a required length. What should be proved is (2i+1 ), the
claim that the method M as applied to fi+1 during the rst i + 1 step does not visit the
active segment of fi+1 . To prove this, it suces to prove that the rst i + 1 search points
generated by the method as applied to fi+1 are exactly the search point generated by it when
minimizing fi , i.e., they are the points x1 , ..., xi+1 . Indeed, these latter points for sure are
outside the new active segment - the rst i of them due to the fact that they even do not
belong to the larger segment i , and the last point, xi+1 - by our construction, which ensures
that the active segment of the modied function and xi+1 are separated by the midpoint ci
of the segment i .
Thus, we come to the necessity to prove that x1 , ..., xi+1 are the rst i + 1 points generated
by M as applied to fi+1 . This is evident: the points x1 , ..., xi are outside i , where fi and fi+1
coincide; consequently, the information - the values and the subgradients - on the functions
along the sequence x1 , ..., xi also is the same for both of the functions. Now, by denition
of a method the information accumulated by it during the rst i steps uniquely determines
the rst i + 1 search points; since fi and fi+1 are indistinguishable in a neighborhood of the
rst i search points generated by M as applied to fi , the initial (i + 1)-point segments of
the trajectories of M on fi and on fi+1 coincide with each other, as claimed.
Thus, we have justied the inductive step and therefore have proved the lemma.
2.2. CONCLUSION 27
It remains to derive from the lemma the desired lower complexity bound. This is imme-
diate. According to the lemma, there exists function fK in our family which is modulus-like
in its active segment and is such that the method during its rst K steps does not visit
this active segment. But the K-th point xK of the trajectory of M on fK is exactly the
result found by the method as applied to the function; since fK is modulus-like in K and
is convex everywhere, it attains its minimum fK at the midpoint cK of the segment K and
outside K is greater than
fK + 23K 3K fK + 23K 22K = 25K
(the product here is half of the length of K times the slope of fK ). Thus,
x(M, fK )) fK > 25K .
fK (
On the other hand, M, by its origin, solves all problems from the family to the accuracy ,
and we come to
25K < ,
i.e., to
1 1
K K + 1 > log2 ( ).
5
as required in our lower complexity bound.
2.2 Conclusion
The one-dimensional situation we have investigated is, of course, very simple; I spoke about it
only to give you an impression of what we are going to do. In the main body of the course we
shall consider much more general classes of convex optimization problems, i.e., multidimen-
sional problems with functional constraints. Same as in our simple one-dimensional example,
we shall ask ourselves what is the complexity of the classes and what are the corresponding
optimal methods. Let me stress that these are optimal methods we mainly shall focus on
- it is much more interesting issue than the complexity itself, both from mathematical and
practical viewpoint. In this respect, one-dimensional situation is not typical - it is easy to
guess that the bisection should be optimal and to establish its rate of convergence. In several
dimensions situation is far from being so trivial and is incomparably more interesting.
2.3 Exercises
2.3.1 Can we use 1-dimensional optimization?
Note that though being extremely simple, the bisection algorithm can be of great use to
solve multi-dimensional optimization problems which looks much more involved. Consider
28LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
for instance the following example of Minimizing a separable function subject to an equality
constraint.
We consider the problem
n
T
min f (x) = fi (xi ), subject to a x = b , (2.3.4)
i=1
The function f is also referred to as the Legendre transform of f . Note that fi , being a point-
wise supremum of convex functions are themselves convex (and, by the way, dierentiable).
The dual problem is thus
n
max b fi (ai ) (2.3.5)
i=1
Exercise 2.3.1 Consider the problem of nding the Euclidean projection of the point y R
on the standard simplex:
n
2
min f (x) = |x y| subject to xi = 1, xi 0, i = 1, ..., n . (2.3.6)
x
i=1
max uz (x a)2 .
z0
2. Using the method, described in this section, propose a simple solution to the problem
(2.3.6) by bisection. Write down explicitly the formulas which allow to recover the
primary solution from the dual one.
1
2. Let n
< 1, and let a Rn satisfy 0 < ai , i ai = 1.
Find the optimal solution and the optimal value of the problem
z
max u(z ) z log for 0 <
0z
. Explain how the bisection algorithm can be used to solve the problem
n n
xi
min f (x) = xi log + ui (xi ai ) subject to xi = 1, 0 xi , i = 1, ..., n .
x ai
i=1 i=1
n
Hint: Dualize the equality constraint i=1 xi = 1.
30LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
A being a symmetric positive denite n n matrix and c being a point in Rn (the center of
the ellipsoid).
The second way is to represent W as the image of the unit Euclidean ball under an ane
invertible mapping, i.e., as
W = {x = Bu + c | uT u 1}, (2.3.9)
Exercise 2.3.3 # Prove that the above denitions are equivalent: if W Rn is given by
(2.3.8), then W can be represented by (2.3.9) with B chosen according to
A = (B 1 )T B 1
(e.g., with B chosen as A1/2 ). Vice versa, if W is represented by (2.3.9), then W can be
represented by (2.3.8), where one should set
A = (B 1 )T B 1 .
Note that the (positive denite symmetric) matrix A involved into (2.3.8) is uniquely dened
by W (why?); in contrast to this, a nonsingular matrix B involved into (2.3.9) is dened by
W up to a right orthogonal factor: the matrices B and B dene the same ellipsoid if and
only if B = BU with an orthogonal n n matrix U (why?)
From the second description of an ellipsoid it immediately follows that
if
W = {x = Bu + c | u Rn , uT u 1}
is an ellipsoid and
x p + B x
is an invertible ane transformation of Rn (so that B is a nonsingular n n
matrix), then the image of W under the transformation also is an ellipsoid.
W = {x = B Bu + (p + B c) | u Rn , uT u 1},
the matrix B B being nonsingular along with B and B . It is also worthy of note that
2.3. EXERCISES 31
W = {x = Bu + c | u Rn , uT u 1}
x B 1 x B 1 c,
which transforms the ellipsoid exactly into the unit Euclidean ball
V = {u Rn | uT u 1}.
#
Exercise 2.3.4 Prove that if W is an ellipsoid in Rn given by (2.3.9), then
Exercise 2.3.5 # . Prove that if Q is a closed and bounded convex body in Rn , then there
exist ellipsoids containing Q and among these ellipsoids there is (at least) one with the
smallest volume.
3)
in what follows body means a set with a nonempty interior
32LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
Exercise 2.3.6 . Prove that if Q is a closed and bounded convex body in Rn , then there
exist ellipsoids contained in Q and among these ellipsoids there is (at least) one with the
largest volume.
Note that extremal ellipsoids associated with a closed and bounded convex body Q ac-
company Q under ane transformations: if x Ax+b is an invertible ane transformation
and Q is the image of Q under this transformation, then the image W of an extremal outer
ellipsoid W associated with Q (note the article: we has not proved the uniqueness!) is an
extremal outer ellipsoid associated with Q , and similarly for (an) extremal inner ellipsoid.
The indicated property is, of course, an immediate consequence of the facts that ane images
of ellipsoids are again ellipsoids and that the ratio of volumes remains invariant under an
ane transformation of the space.
In what follows we focus on outer extremal ellipsoids. Useful information can be obtained
from investigating these ellipsoids for simple parts of an Euclidean ball.
+
Exercise 2.3.7 Prove that the volume of the spherical hat
V = {x Rn | |x| 1, xn }
V = {x V | eT x }, [1, 1]
where
1/2
n 2 (1 )(n 1)
B = (1 2 ) I eeT , = 1
n2 1 (1 + )(n + 1)
Hint: note that V is contained in the set of solutions to the system of the following pair of
quadratic inequalities:
xT x 1; (2.3.12)
n+2 1 1
= exp{ } < exp{ } = exp{ }voln (V );
2(n 1)
2 2n 2 2n 2
thus, for the case of = 0 (and, of course, for the case of > 0) we may cover
V by an ellipsoid with the volume 1 O(1/n) times less than that one of V . In
fact the same conclusion (with another absolute constant factor O(1)) holds true
when is negative (so that the spherical hat is greater than half-ball), but not
1
too negative, say, when 2n .
2. In order to cover V by an ellipsoid of absolute constant times less volume
than that one of V we need to be positive of order O(n1/2 ) or greater. This
ts our observation that the volume of V itself is at least absolute constant times
less than that one of V only if O(n1/2 ) (exercise 2.3.7). Thus, whenever
the volume of V is absolute constant times less than that one of V , we can cover
V by an ellipsoid of the volume also absolute constant times less than that one
of V ; this covering is already given by the Euclidean ball of the radius 1 2
centered at the point e (which, anyhow, is not the optimal covering presented
in exercise 2.3.8).
34LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
Exercise 2.3.9 #+ Let V be the unit Euclidean ball in Rn , e be a unit vector and let
(0, 1). Consider the symmetric spherical stripe
V = {x V | eT x }.
Prove that if 0 < < 1/ n then V can be covered by an ellipsoid W with the volume
(n1)/2
n(1 2 )
voln (W ) n < 1 = voln (V ).
n1
Find an explicit representation of the ellipsoid.
Hint: use the same construction as that one for exercise 2.3.8.
We see that in order to cover a symmetric spherical stripe of the unit Euclidean ball V by
an ellipsoid of volumeless than that one of V , it suces to have the half-thickness of
the stripe to be < 1/ n, which again ts our observation (Exercise 2.3.7) that basically all
volume of the unit n-dimensional Euclidean ball is concentrated in the O(1/ n) neighbor-
hood of its equator - the cross-section of the ball and a hyperplane passing through the
center of the ball. A useful exercise is to realize when a non-symmetric spherical stripe
V , = {x V | eT x }
of the (centered at the origin) unit Euclidean ball V can be covered by an ellipsoid of volume
less than that one of V .
The results of exercises 2.3.8 and 2.3.9 imply a number of important geometrical conse-
quences.
Exercise 2.3.10 + Prove the following theorem of Fritz John:
Let Q be a closed and bounded convex body in Rn . Then
(i) Q can be covered by an ellipsoid W in such a way that the concentric to W n times
smaller ellipsoid
1 1
W = (1 )c + W
n n
(c is the center of W ) is contained in Q. One can choose as W the extremal outer ellipsoid
associated with Q.
(ii) If, in addition, Q is central-symmetric with respect to certain point c, then the above
result can be improved: Q can be covered by an ellipsoid W centered at c in such a way that
the concentric to W n times smaller ellipsoid
1 1
W = (1 )c + W
n n
is contained in Q.
Hint: use the results given by exercises
2.3.8 and 2.3.9.
Note that the constants n and n in the Fritz John Theorem are sharp; an extremal
example for (i) is a simplex, and for (ii) - a cube.
Here are several nice geometrical consequences of the Fritz John Theorem:
2.3. EXERCISES 35
Q being the image of Q under the transformation (it suces to transform the
outer extremal ellipsoid associated with Q into the unit Euclidean ball centered
at the origin). It remains to note that the smaller Euclidean ball in the above
chain of inclusions contains the cube {x | x n3/2 } and the larger one is
contained in the unit cube.
2. If Q is central-symmetric, then the parallelotopes mentioned in 1. can be
chosen to have the same center, and the homothety coecient can be improved to
1/n; in other words, there exists an invertible ane transformation of the space
which makes the image Q of Q central symmetric with respect to the origin and
ensures the inclusions
1
{x | x } Q {x | x 1}.
n
The statement is given by the reasoning completely similar to that one used for
1., up to the fact that now we should refer to item (ii) of the Fritz John
Theorem.
n
3. Any norm on R can be approximated, within factor n, by a
Euclidean norm: given , one can nd a Euclidean norm
to the origin. By item (ii) of the Fritz John Theorem, there exists a centered at
the origin ellipsoid
W = {x | xT Ax n}
(A is an n n symmetric positive denite matrix) which contains B, while the
ellipsoid
{x | xT Ax 1}
is contained in B; this latter inclusion means exactly that
|x|A 1 x B x 1,
and,
second, whenever is a norm on Rn , one can indicate a m(n, )-dimensional subspace
E Rn and a Euclidean norm | |A on Rn such that | |A approximates on E within
factor 1 + :
(1 )|x|A x (1 + )|x|A , x E.
In other words, the Euclidean norm is marked by God: for any given integer k an arbitrary
normed linear space contains an almost Euclidean k-dimensional subspace, provided that
the dimension of the space is large enough.
Lecture 3
Here the domain G of the problem is a closed convex set in Rn with a nonempty interior, the
objective f and the functional constraints gi , i = 1, ..., m, are convex continuous functions
on G.
Let us x a closed and bounded convex domain G Rn and the number m of func-
tional constraints, and let P = Pm (G) be the family of all feasible convex problems with
m functional constraints and the domain G. Note that since the domain G is bounded
and all problems from the family are feasible, all of them are solvable, due to the standard
compactness reasons.
In what follows we identify a problem instance from the family Pm (G) with a vector-
valued function
p = (f, g1 , ..., gm )
comprised of the objective and the functional constraints.
What we shall be interested in for a long time are the ecient methods for solving
problems from the indicated very wide family. Similarly to the one-dimensional case, we
assume that the methods have an access to the rst order local oracle O which, given an
input vector x int G, returns the values and some subgradients of the objective and the
functional constraints at x, so that the oracle computes the mapping
37
38 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
The notions of a method and its complexity at a problem instance and on the whole family
are introduced exactly as it was done in Section 1.2 of our rst lecture1) .
The accuracy of a method at a problem and on the family in the following way. Let us
start with the vector of residuals of a point x G regarded as an approximate solution to a
problem instance p:
Residual(p, x) = (f (x) f , (g1 (x))+ , ..., (gm (x))+ )
which is comprised of the inaccuracy in the objective and the violations of functional con-
straints at x. In order to get a convenient scalar accuracy measure, it is reasonable to pass
from this vector to the relative accuracy
f (x) f (g1 (x))+ (gm (x))+
(p, x) = max{ , , ..., };
maxG f f (maxG g1 )+
(maxG gm )+
to get the relative accuracy, we normalize each of the components of the vector of residuals
by its maximal, over all x G, value and take the maximum of the resulting quantities. It
is clear that the relative accuracy takes its values in [0, 1] and is zero if and only if x is an
optimal solution to p, as it should be for a reasonable accuracy measure.
After we have agreed how to measure accuracy of tentative approximate solutions, we
dene the accuracy of a method M at a problem instance as the accuracy of the approximate
solution found by the method when applied to the instance:
(p, M)).
Accuracy(M, p) = (p, x
The accuracy of the method on the family is its worse-case accuracy at the problems of the
family:
Accuracy(M) = sup Accuracy(M, p).
pPm (G)
Last, the complexity of the family is dened in the manner we already are acquainted with,
namely, as the best complexity of a method solving all problems from the family to a given
accuracy:
A() = min{A(M) | Accuracy(M) }.
What we are about to do is to establish the following main result:
Theorem 3.1.1 The complexity A() of the family Pm (G) of general-type convex problems
on an n-dimensional closed and bounded convex domain G satises the inequalities
ln( 1 ) 1
n 1 A() 2.181 n ln( ). (3.1.2)
6 ln 2
Here the upper bound is valid for all < 1. The lower bound is valid for all < (G), where
1
(G)
n3
1)
that is a set of rules for forming sequential search points, the moment of termination and the result
as functions of the information on the problem; this information is comprised by the answers of the oracle
obtained to the moment when a rule is to be applied
3.2. CUTTING PLANE SCHEME AND CENTER OF GRAVITY METHOD 39
Same as in the one-dimensional case, to prove the theorem means to establish the lower
complexity bound and to present a method associated with the upper complexity bound
(and thus optimal in complexity, up to an absolute constant factor, for small enough ,
namely, for 0 < < (G). We shall start with this latter task, i.e., with constructing an
optimal method.
of minimizing convex continuous objectives over a given closed and bounded convex domain
G Rn .
To solve such a problem, we can use the same basic idea as in the one-dimensional
bisection. Namely, choosing somehow the rst search point x1 , we get from the oracle the
value f (x1 ) and a subgradient f (x1 ) of f ; thus, we obtain a linear function
G1 = {x G | (x x1 )T f (x1 ) 0};
indeed, outside this new localizer our linear lower bound f1 for the objective, and therefore
the objective itself, is greater than the value of the objective at x1 .
40 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
Now, our new localizer of the optimal set, i.e., G1 , is, same as G, a closed and bounded
convex domain, and we may iterate the process by choosing the second search point x2 inside
G1 and generating the next localizer
G2 = {x G1 | (x x2 )T f (x2 ) 0},
Gi = {x Gi1 | (x xi )T f (xi ) 0}
and loop.
The approximate solution found after i steps of the routine is, by
definition, the best point found so far, i.e., the point
A cutting plane method, i.e., a method associated with the scheme, is governed by the
rules for choosing the sequential search points in the localizers. In the one-dimensional case
there is, basically, the only natural possibility for this choice - the midpoint of the current
localizer (the localizer always is a segment). This choice results exactly in the bisection and
enforces the lengths of the localizers to go to 0 at the rate 2i , i being the step number. In
the multidimensional case the situation is not so simple. Of course, we would like to decrease
a reasonably dened size of localizer at the highest possible rate; the problem is, anyhow,
which size to choose and how to ensure its decreasing. When choosing a size, we should take
care of two things
(1) we should have a possibility to conclude that if the size of a current
localizer Gi is small, then the inaccuracy of the current approximate solution
also is small;
(2) we should be able to decrease at certain rate the size of sequential localizers
by appropriate choice of the search points in the localizers.
Let us start with a wide enough family of sizes which satisfy the rst of our requirements.
Denition 3.2.1 A real-valued function Size(Q) dened on the family Q of all closed and
bounded convex subsets Q Rn with a nonempty interior is called a size, if it possesses the
following properties:
(Size.1) Positivity: Size(Q) > 0 for any Q Q;
(Size.2) Monotonicity with respect to inclusion: Size(Q) Size(Q ) whenever Q Q,
Q, Q Q;
(Size.3) Homogeneity with respect to homotheties: if Q Q, > 0, a Rn and
Q = a + (Q a) = {a + (x a) | x Q}
3.2. CUTTING PLANE SCHEME AND CENTER OF GRAVITY METHOD 41
is the image of Q under the homothety with the center at the point a and the coecient ,
then
Size(Q ) = Size(Q).
Example 1. The diameter
Diam(Q) = max{|x x | | x, x Q}
is a size;
Example 2. The average diameter
y G \Gi .
Since G clearly is contained in the domain of the problem and does not belong to the i-th
localizer Gi , we have
xi );
f (y) > f (
indeed, at each step j, j i, of the method we remove from the previous localizer (which
initially is the whole domain G of the problem) only those points where the objective is
42 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
greater than at the current search point xj and is therefore greater than at the best point xi
found during the rst i steps; since y was removed at one of these steps, we conclude that
f (y) > f (
xi ), as claimed.
On the other hand, y G , so that
y = (1 )x + z
with some z G. From convexity of f it follows that
f (y) (1 )f (x ) + f (z) = (1 ) min f + max f,
G G
whence
f (y) min f (max f min f ).
G G G
xi ), and we come to
As we know, f (y) > f (
f (xi ) min f (max f min f ),
G G G
Thus, we realize now what could be the sizes we are interested in, and the problem is
how to ensure certain rate of their decreasing along the sequence of localizers generated by
a cutting plane method. The diculty here is that when choosing the next search point in
the current localizer, we do not know what will be the next cutting plane; the only thing
we know is that it will pass through the search point. Thus, we are interested in the choice
of the search point which guarantees certain reasonable, not too close to 1, ratio of the size
of the new localizer to that one of the previous localizer independently of what will be the
cutting plane. Whether such a choice of the search point is possible, it depends on the size
we are using. For example, the diameter of a localizer, which is a very natural measure
of it and which was successively used in the one-dimensional case, would be a very bad
choice in the multidimensional case. To realize this, imagine that we are minimizing over
the unit square on the two-dimensional plane, and our objective in fact depends on the rst
coordinate only. All our cutting planes (in our example they are lines) will be parallel to the
second coordinate axis, and the localizers will be stripes of certain horizontal size (which we
may enforce to tend to 0) and of some xed vertical size (equal to 1). The diameters of the
localizers here although decrease but do not tend to zero. Thus, the rst of the particular
sizes we have looked at does not t the second requirement. In contrast to this, the second
particular size, the average diameter AvDiam, is quite appropriate, due to the following
geometric fact which we present without proof:
Proposition 3.2.1 (Grunbaum) Let Q be a closed and bounded convex domain in Rn , let
1
x (G) = xdx
Voln (G) G
3.2. CUTTING PLANE SCHEME AND CENTER OF GRAVITY METHOD 43
be the center of gravity of Q, and let be an ane hyperplane passing through the center
of gravity. Then the volumes of the parts Q , Q in which Q is partitioned by satisfy the
inequality
1/n
n
Voln (Q ), Voln (Q ) {1 }Voln (Q) exp{}Voln (Q),
n+1
= ln(1 1/ e) = 0.45867...;
in other words,
AvDiam(Q ), AvDiam(Q ) exp{ } AvDiam(Q). (3.2.5)
n
Note that the proposition states exactly that the smallest (in terms of the volume) fraction
you can cut o a n-dimensional convex body by a hyperplane passing through the center
of gravity of the body is the fraction you get when the body is a simplex, the plane passes
parallel to a facet of the simplex and you cut o the part not containing the facet.
Corollary 3.2.1 Consider the Center of Gravity method, i.e., the cutting plane method with
the search points being the centers of gravity of the corresponding localizers:
1
xi = x (Gi1 ) xdx.
Voln (Gi1 ) Gi1
consequently (see Lemma 3.2.4) the relative accuracy of i-th approximate solution generated
by the method as applied to any problem p of minimizing a convex objective over G satises
the inequality
i ) exp{ i}, i 1.
(p, x
n
In particular, to solve the problem within relative accuracy (0, 1) it suces to perform
no more than
1 1 1
N = n ln 2.181n ln (3.2.6)
steps of the method.
Remark 3.2.1 The Center of Gravity method for convex problems without functional con-
straints was invented in 1965 independently by A.Yu.Levin in the USSR and J. Newman in
the USA.
44 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
xi int Gi1
and subgradients
f (xi ), g1 (xi ), ..., gm
(xi )
of the objective and the constraints at xi .
3.3. THE GENERAL CASE: PROBLEMS WITH FUNCTIONAL CONSTRAINTS 45
Proof. Let us rst note that for any i and j one has
gj,i max gj ; (3.3.9)
G +
(i)
this is an immediate consequence of the fact that gj (x) is a lower bound for gj (x) (an
immediate consequence of the convexity of gj ), so that the maximum of this lower bound
over x G, i.e., gj,i , is at most the similar quantity for the constraint gj itself.
Now, assume that the method terminates at certain step i N. According to the
description of the method, it means that i is a productive step and 0 is a subgradient of the
objective at xi ; the latter means that xi is a minimizer of f over the whole G, so that
f (xi ) f .
(we have used (3.3.9)); these inequalities, combined with the denition of the relative accu-
racy, state exactly that xi (i.e., the result obtained by the method in the case in question)
solves the problem within the relative accuracy , as claimed.
Now assume that the method does not terminate in course of the rst N steps. In view
of our premise, here we have
Size(GN ) < Size(G). (3.3.10)
Let x be an optimal solution to the problem, and let
G = x + (G x ).
G is a closed and bounded convex subset of G with a nonempty interior; due to homogeneity
of Size with respect to homotheties, we have
(the second inequality here is (3.3.10)). From this inequality and the monotonicity of the
size it follows that G cannot be a subset of GN :
y G \GN .
Now, y is a point of G (since the whole G is contained in G), and since it does not belong
to GN , it was cut o at some step of the method, i.e., there is an i N such that
y = (1 )x + z (3.3.12)
with certain z G.
Let us prove that in fact i-th step is productive. Indeed, assume it is not the case. Then
from this latter inequality and (3.3.12), exactly as in the case of problems with no functional
constraints, it follows that
Now let us summarize our considerations. We have proved that in the case in question (i.e.,
when the method does not terminate during rst N steps and (3.3.8) is satised) there exist
a productive step i N such that (3.3.15) holds. Since the N-th approximate solution is the
best (in terms of the values of the objective) of the search points generated at the productive
steps with step numbers N, it follows that xN is well-dened and
xN ) f f (xi ) f (max f f );
f ( (3.3.16)
G
since xN is, by construction, the search point generated at certain productive step i , we
have also
(i ) ,i
xN ) = gj (xi ) gj max gj , j = 1, ..., m;
gj (
G +
N ) ,
(p, x
as claimed.
48 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
Combining Proposition 3.3.1 and the Grunbaum Theorem, we come to the Center of
Gravity method for problems with functional constraints. The method is obtained from our
general cutting plane scheme for constrained problems by the following specications:
rst, we use, as a current search point, the center of gravity of the previous localizer:
1
xi = xdx;
Voln (Qi1 ) Qi1
second, we terminate the method after N-th step, N being given by the relation
1
N =2.181n ln .
With these specications the average diameter of i-th localizer at every step, due to the
Grunbaum Theorem, decreases with i at least as
exp{ i} AvDiam(G), = 0.45867...,
n
1
and since
< 2.181, we come to
this latter inequality, in view of Proposition 3.3.1, implies that the method does nd an
-solution to every problem from the family, thus justifying the upper complexity bound we
are proving.
O(1) being a positive absolute constant and (G) being certain positive quantity
depending on the geometry G only (we shall see that this quantity measures how G
diers from a paralellotope). 2
The spoiled bound (3.4.17) (which is by logarithmic denominator worse than the
estimate announced in the Theorem) is more or less immediate consequence of our
one-dimensional considerations. Of course, it is sucient to establish the lower bound
2
for the exact lower bound refer to A.S. Nemirovskij and D.B. Yudin Problem complexity and
method eciency in optimization, A Wiley-Interscience Publication. Chichester etc.: John Wiley & Sons,
1983.
3.4. LOWER COMPLEXITY BOUND 49
for the case of problems without functional constraints, since the constrained ones form
a wider family (indeed, a problem without functional constraints can be thought of as
a problem with a given number m of trivial, identically zero functional constraints).
Thus, in what follows the number of constraints m is set to 0.
Let us start with the following simple observation. Let, for a given > 0 and a
convex objective f , the set G (f ) be comprised of all approximate solutions to f of
relative accuracy not worse than :
Assume that, for a given > 0, we are able to point out a nite set F of objectives
with the following two properties:
(I) no dierent problems from F admit a common -solution:
G (f ) G (f) =
whenever f, f F and f = f;
(II) given in advance that the problem in question belongs to F, one can compress
an answer of the rst order local oracle to be a (log2 K)-bit word. It means the
following. For certain positive integer K one can indicate a function I(f, x) taking
values in a K-element set and a function R(i, x) such that
In other words, given in advance that the problem we are interested in belongs to F, a
method can imitate the rst-order oracle O via another oracle I which returns log2 K
bits of information rather than innitely many bits contained in the answer of the
rst order oracle; given the compressed answer I(f, x), a method can substitute this
answer, along with x itself, into a universal (dened by F only) function in order to
get the complete rst-order information on the problem.
E.g., consider the family Fn comprised of 2n convex functions
f (x) = max i xi ,
i=1,...,n
where all i are 1. At every point x a function from the family admits a subgradient
of the form I(f, x) = ei (ei are the orths of the axes), with i, same as the sign at
ei , depending on f and x. Assume that the rst order oracle in question when asked
about f Fn reports a subgradient of exactly this form. Since all functions from the
family are homogeneous, given x and I(f, x) we know not only a subgradient of f at
x, but also the value of f at the point:
(*):
under assumptions (I) and (II) the -complexity of the family F, and
therefore of every larger family, is at least
log2 |F|
.
log2 K
Indeed, let M be a method which solves all problems from F within accuracy in
no more than N steps. We may assume (since informationally this is the same) that
the method uses the oracle I, rather than the rst-order oracle. Now, the behavior of
the method is uniquely dened by the sequence of answers of I in course of N steps;
therefore there are at most K N dierent sequences of answers and, consequently, no
more than K N dierent trajectories of M. In particular, the set X formed by the
results produced by M as applied to problems from F is comprised of at most K N
points. On the other hand, since M solves every of |F| problems of the family within
accuracy , and no two dierent problems from the family admit a common -solution,
X should contain at least |F| points. Thus,
K N |F|,
as claimed.
As an immediate consequence of what was said, we come to the following result:
the complexity of minimizing a convex function over an n-dimensional parallelotope
G within relative accuracy < 1/2 is at least n/(1 + log2 n).
Indeed, all our problem classes and complexity-related notions are ane invariant, so
that we always may assume the parallelotope G mentioned in the assertion to be the
unit cube
{x Rn | |x| max |xi | 1}.
i
1
For any < 2 the aforementioned family
Fn = {f (x) = max i xi }
i
clearly possesses property (I) and, as we have seen, at least for certain rst-order oracle
possesses also property (II) with K = 2n. We immediately conclude that the complex-
ity of nding an -minimizer, < 1/2, of a convex function over an n-dimensional
parallelotope is, at least for some rst order oracle, no less than
log2 |F|
,
log2 (2n)
as claimed. In fact, of course, the complexity is at least n for any rst order oracle,
but to prove the latter statement it requires more detailed considerations.
Now let us use the above scheme to derive the lower bound (3.4.17). Recall that
when studying the one-dimensional case, we have introduced certain family of univari-
ate convex functions which was as follows. The functions of the family form a tree,
with the root (generation 0) being the function
when subject to the left and to the right modications, the function produces two
children, let them be called fr and fl ; each of these functions, in turn, may be
subject to the right and to the left modication, producing two new functions, so that
at the level of grandchildren there are four functions frr , frl , flr , fll , and so on. Now,
every of the functions f of a generation k > 0 possesses its own active segment (f )
of the length 212k , and at this segment the function is modulus-like:
c(f ) being the midpoint of (f ). Note that a(k) depends only on the generation f
belongs to, not on the particular representative of the generation; note also that the
active segments of the 2k functions belonging to the generation k are mutually disjoint
and that a function from our population coincides with its parent outside the
active segment of the parent. In what follows it is convenient also to dene the active
segment of the root function f root as the whole axis.
Now, let Fk be the set of 2k functions comprising k-th generation in our population.
Let us demonstrate that any rst order oracle, restricted onto this family of functions,
admits compression to log2 (2k) bits. Indeed, it is clear from our construction that
in order to restore, given an x, f (x) and a subgradient f (x), it suces to trace the
path of predecessors of f - its father, its grandfather, ... - and to nd the youngest
of them, let it be f, such that x belongs to the active segment of f (let us call this
predecessor the active at x predecessor of f ). The active at x predecessor of f does
exist, since the active segment of the common predecessor f root is the whole axis.
Now, f is obtained from f by a number of modications; the rst of them possibly
varies f in a neighborhood of x (x is in the active segment of f), but the subsequent
modications do not, since x is outside the corresponding active segments. Thus, in a
neighborhood of x f coincides with the function f - the modication of f which leads
from f to f . Now, to identify the local behavior of f (i.e., that one of f ) at x, it suces
to indicate the age of f, i.e., the number of the generation it belongs to, and the
type of the modication - left or right - which transforms f into f.
Indeed, given x and the age k of f, we may uniquely identify the active segment of
f (since the segments for dierent members of the same generation k 1 have no
common points); given the age of f, its active segment and the type of modication
leading from f to f, we, of course, know f in a neighborhood of the active segment of
f and consequently at a neighborhood of x.
Thus, to identify the behavior of f at x and therefore to imitate the answer of
any given local oracle on the input x, it suces to know the age k of the active at
x predecessor of f and the type - left or right - of modication which moves the
predecessor towards f , i.e., to know a point from certain (2k)-element set, as claimed.
Now let us act as follows. Let us start with the case when our domain G is a
parallelotope; due to ane invariance of our considerations, we may assume G to be
the unit n-dimensional cube:
G = {x Rn | |x| 1}.
fi1 ,...,in (x) = max{fi1 (x1 ), ..., fin (xn )}, fis Fk , s = 1, ..., n.
52 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
This family contains |F|n = 2nk objectives, all of them clearly being convex and Lip-
schitz continuous with constant 1 with respect to the uniform norm | | . Let us
demonstrate that there exists a rst order oracle such that the family, equipped with
this oracle, possesses properties (I), and (II), where one should set
Indeed, a function fi1 ,...,in attains its minimum a(k) exactly at the point xi1 ,...,in with
the coordinates comprised of the minimizers of fis (xs ). It is clear that within the cube
(i.e., within the direct product of the active segments of fis , s = 1, ..., n) the function
is simply
a(k) + 23k |x xi1 ,...,in | ,
therefore outside this cube one has
Taking into account that all our functions fi1 ,...,in , being restricted onto the unit cube
G, take their values in [0, 1], so that for these functions absolute inaccuracy in terms
of the objective is majorated by the relative accuracy, we come to
It remains to note that the cubes C corresponding to various functions from the family
are mutually disjoint (since the active segments of dierent elements of the generation
Fk are disjoint). Thus, (I) is veried.
In order to establish (II), let us note that to nd the value and a subgradient of
fi1 ,...,in at a point x it suces to know the value and a subgradient at xis of any function
fis which is active at x, i.e., is all other functions participating in the expression
for fi1 ,...,in . In turn, as we know, to indicate the value and a subgradient of fis it suces
to report a point from a (2k)-element set. Thus, one can imitate certain (not any) rst
order oracle for the family F k via a compressed oracle reporting log2 (2nk)-bit word
(it suces to indicate the number s, 1 s n of a component fis active at x and a
point of a (2k)-element set to identify fis at xis ).
Thus, we may imitate certain rst order oracle for the family F k (comprised of 2kn
functions), given a compressed oracle with K = 2nk; it follows from (*) that the
-complexity of F for = 26k (see (3.4.18)) is at least
log2 (2nk )
;
log2 (2nk)
nlog2 ( 1 )
A() , = 26k , k = 1, 2, ...;
6log2 nlog2 ( 2 )
3.4. LOWER COMPLEXITY BOUND 53
of such a function on the domain G is at most the diameter of G with respect to the
uniform norm; the latter diameter, due to (3.4.19), is at most 2(G). It follows that
any method which solves all problems from F k within relative accuracy 26k1 /(G)
solves all these problems within absolute accuracy 26k as well; thus, the complexity
of minimizing convex function over G within relative accuracy 26k1 /(G) is at least
n log2 k
log2 (2nk) :
nlog2 k
A(26k1 /(G)) , k = 1, 2, ...
log2 (nk)
This lower bound immediately implies that
1
nlog2 (G) 1
A() O(1) , (G) < ,
1 128
log 2 nlog2 ( (G) )
54 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
whence, in turn,
n ln(1/) 1 1
A() O(1) , (G) 2
( );
ln (n ln(1/)) 128 (G) 128n3
this is exactly what is required in (3.4.17).
Note that our reasoning results in a lower bound which is worse than that one
indicated in the Theorem not only by the logarithmic denominator, but also due to the
fact that this is a lower bound for a particular rst order oracle, not for an arbitrary one.
In fact both these shortcomings, i.e. the presence of the denominator and the oracle-
dependent type of the lower bound, may be overcome by more careful reasoning, but
we are not going to reproduce it here.
At this point one could ask: what for should we add to an actual localizer something
which for sure does not contain optimal solutions? The answer is: acting in this manner,
we may stabilize geometry of our localizers and enforce them to be convenient for numerical
implementation of the search rules. This is the idea underlying the Ellipsoid method we are
about to present.
3.5.1 Ellipsoids
Recall that an ellipsoid in Rn is dened as a level set of a nondegenerate convex quadratic
form, i.e., as a set of the type
where B is an n n nonsingular matrix. It is immediately seen that one can pass from
representation (3.5.21) to (3.5.20) by setting
A = (B T )1 B 1 ; (3.5.22)
since any symmetric positive denite matrix A admits a representation of the type (3.5.22)
(e.g., with B = A1/2 ), the above denitions indeed are equivalent.
From (3.5.21) it follows immediately that
It is immediately seen that the introduced function is a size, i.e., it is positive, monotone
with respect to inclusions and homogeneous with respect to similarity transformations of the
homogeneity degree 1.
We need the following simple lemma
56 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
W = {x = Bu + c | uT u 1}
W + = {x = B + u + c+ | uT u 1},
1
B + = (n)B (n)(Bp)pT , c+ = c Bp,
n+1
where
1/2
n2 n 1 BT q
(n) = , (n) = (n) 1 , p= $ .
n2 1 n+1 q T BB T q
To prove the lemma, it suces to reduce the situation to the similar one with W being the
unit Euclidean ball V ; indeed, since W is the image of V under the ane transformation
u Bu + c, the half-ellipsoid W
is the image, under this transformation, of the half-ball
V = {u V | (B T q)T u 0} = {u V | pT u 0}.
Now, it is quite straightforward to verify that a half-ball indeed can be covered by an ellipsoid
V + with the volume being the required fraction of the volume of V ; to verify this, it was
one of the exercises of the previous lecture (cf. Exercise 2.3.8), and in the formulation of the
exercise you were given the explicit representation of V + . It remains to note that the image
of V + under the ane transformation which maps the unit ball V onto the ellipsoid W is an
and is in the same ratio of volumes with
ellipsoid which clearly contains the half-ellipsoid W
+
respect to W as V is with respect to the unit ball V (since the ratio of volumes remains
invariant under ane transformations). The ellipsoid W + given in formulation of the lemma
is nothing but the image of V + under our ane transformation.
3.5. THE ELLIPSOID METHOD 57
we remove from the previous localizer Gi1 only those points which do not belong to the
i indeed can be thought of as a new intermediate localizer.
domain of the problem, so that G
Thus, we come to the Ellipsoid method, due to Nemirovski and Yudin (1979), which, as
applied to a convex programming problem
works as follows:
EllOut(G)
.
EllOut(G0 )
1) Check whether xi int G. If it is not the case, then call step i non-
productive, nd a nonzero ei such that
(x xi )T ei 0 x G
ei = gk (xi )
and go to 3).
If all inequalities (3.5.26) are satised, call i-th step productive and set
ei = f (xi ).
steps and solves p within relative accuracy : the result x is well dened and
(p, x) .
Given the direction ei dening i-th cut, it takes O(n2 ) arithmetic operations to update
(Bi1 , xi ) into (Bi , xi+1 ).
Proof. The complexity bound is an immediate corollary of the termination test (3.5.28).
To prove that the method solves p within relative accuracy , note that from Lemma 3.5.1
it follows that
EllOut(Gi ) i (n) EllOut(G0 ) i (n) 1 EllOut(G)
(the latter inequality comes from the origin of ). It follows that if the method terminates
at a step N due to (3.5.28), then
Due to this latter inequality, we immediately obtain the accuracy estimate as a corollary of
our general convergence statement on the cutting plane scheme (Proposition 3.3.1). Although
the latter statement was formulated and proved for the basic cutting plane scheme rather
than for the spoiled one, the reasoning can be literally repeated in the case of the spoiled
scheme.
Note that the complexity of the Ellipsoid method depends on , i.e., on how good is the
initial ellipsoidal localizer we start with. Theoretically, we could choose as G0 the ellipsoid
of the smallest volume containing the domain G of the problem, thus ensuring = 1; for
simple domains, like a box, a simplex or a Euclidean ball, we may start with this optimal
ellipsoid not only in theory, but also in practice. Even with this good start, the Ellipsoid
method has O(n) times worse theoretical complexity than the Center of Gravity method
(here it takes O(n2 ) steps to improve inaccuracy by an absolute constant factor). As a
compensation of this theoretical drawback, the Ellipsoid method is not only of theoretical
interest, it can be used for practical computations as well. Indeed, if G is a simple domain
from the above list, then all actions prescribed by rules 1)-3) cost only O(n(m+n)) arithmetic
operations. Here the term mn comes from the necessity to check whether the current search
point is in the interior of G and, if it is not the case, to separate the point from G, and also
from the necessity to maximize the linear approximations of the constraints over G; the term
n2 reects complexity of updating Bi1 Bi after ei is found. Thus, the arithmetic cost
of a step is quite moderate, incomparably to the tremendous one for the Center of Gravity
method.
60 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
3.6 Exercises
To the moment we have applied the Cutting Plane scheme to convex optimization problems
in the standard form
(G is a solid, i.e., a closed and bounded convex set with a nonempty interior, f and gj are
convex and continuous on G). In fact the scheme has a wider eld of applications. Namely,
consider a generic problem as follows:
here Gf is certain (specic for the problem instance) solid and f is a function taking values
in the extended real axis R {} {+} and nite on the interior of G.
Let us make the following assumption on our abilities to get information on (f ):
(A): we have an access to an oracle OA which, given on input a point x Rn , informs us
whether x belongs to the interior of Gf ; if it is not the case, the oracle reports a nonzero
functional ex which separates x and Gf , i.e., is such that
(y x)T ex 0, y Gf ;
if x int Gf , then the oracle reports f (x) and a functional ex such that the level set
In the mean time we shall see that under assumption (A) we can eciently solve (f ) by
cutting plane methods; but before coming to this main issue let me indicate some interesting
3.6. EXERCISES 61
Lh (x)
#+
Exercise 3.6.1 Prove the latter statement.
Thus, in the case in question the sets {y G | f (y) < f (x)}, {y G | gj (y) < gj (x)} are
contained in the half-spaces {y | (y x)T f (x) < 0}, {y | (y x)T gj (x) < 0}, respectively.
It follows that in the case in question, same as in the convex case, we, given an access to
a rst-order oracle for (3.6.29), can imitate a required by (A) oracle OA for the induced
problem (3.6.30).
Example 3. Linear-fractional programming. Consider the problem
a (x)
minimize f (x) = max s.t. gj (x) 0, j = 1, ..., m, b (x) > 0, , x G;
b (x)
(3.6.31)
n
here G is a solid in R and is a nite set of indices, a and b are ane functions and gj
are, say, convex and continuous on G. The problem is, as we see, to minimize the maximum
of ratios of given linear forms over the convex set dened by the inclusion x G, convex
functional constraints gj (x) 0 and additional linear constraints expressing positivity of the
denominators.
Let us set
Gf = {x G | gj (x) 0, b (x) 0, };
we assume that the (closed and convex) set Gf possesses a nonempty interior and that the
functions gj are negative, while b are positive on the interior of Gf .
By setting
max {a (x)/b (x)} x int Gf
f (x) = (3.6.32)
+, otherwise
we can rewrite our problem as
Now, assume that we are given G in advance and have an access to a rst-order oracle O
which, given on input a point x int G, reports the values and subgradients of functional
constraints at x, same as reports all a (), b ().
Under this assumptions we can imitate for (3.6.33) the oracle OA required by the as-
sumption (A). Indeed, given x Rn , we rst check whether x int G, and if it is not the
case, nd a nonzero functional ex which separates x and G (we can do it, since G is known
in advance); of course, this functional separates also x and Gf , as required in (A). Now, if
x int G, we ask the rst-order oracle O about the values and subgradients of gj , a and
b at x and check whether all gj are negative at x and all b (x) are positive. If it is not the
case and, say, gk (x) 0, we claim that x int Gf and set ex equal to gk (x); this functional
is nonzero (since otherwise gk would attain a nonnegative minimum at x, which contradicts
our assumptions about the problem) and clearly separates x and Gf (due to the convexity
of gk ). Similarly, if one of the denominators b is nonpositive at x, we claim that x int Gf
and set
ex = b ;
3.6. EXERCISES 63
(for t = 0 the right hand side should be replaced by a positive vector representing the
starting amount of goods). Now, in the von Neumann Economic Growth problem it is
asked what is the largest growth factor, , for which there exists a semi-stationary growth
trajectory, i.e., a trajectory of the type xt = t x0 . In other words, we should solve the
problem
maximize s.t. Ax Bx for some nonzero x 0.
Without loss of generality, x in the above formulation can be taken as a point form the
standard simplex
G = {x Rn | x 0, xj = 1}
j
(which should be regarded as a solid in its ane hull). It is clearly seen that the problem in
question can be rewritten as follows:
j aij xj
minimize max s.t. x G; (3.6.34)
i=1,...,m j bij xj
It is worthy to note that the von Neumann growth factor describes, in a sense, the
highest rate of growth of our economy (this is far from being clear in advance: why the
Soviet proportional growth is the best one? Why could we not get something better
along an oscillating trajectory?) One exact statement on optimality of the von Neumann
semi-stationary trajectory (or, better to say, the simplest of these statements) is as follows:
Proposition 3.6.1 Let {xt }Tt=1 be a trajectory of our economy, so that xt are nonnegative,
x0 = 0 and
Axt+1 Bxt , t = 0, 1, ..., T 1.
Assume that xT T x0 for some positive (so that our trajectory results, for some T , in
growth of the amount of goods in T times in T years). Then .
xt = ( )T x0 ,
x0 being the x-component of an optimal solution to (3.6.34), does ensure growth by factor
( )T each T years.
Exercise 3.6.2 Prove Proposition 3.6.1.
means that the entries of the matrices are ane functions of x), minimize, with respect to
x, the Rayleigh ratio
T A(x)
max
Rm \{0} T B(x)
of the quadratic forms associated with these matrices under the constraints that B(x) is
positive denite (and, possibly, under additional convex constraints on x). In other words,
we are looking for a pair (x, ) satisfying the constraints
B(x) is positive denite , B(x) A(x) is positive semidenite
and additional constraints
gj (x) 0, j = 1, ..., m, x G Rn
(gj are convex and continuous on the solid G) and are interested in the pair of this type with
the smallest possible .
The Generalized Eigenvalue problem (the origin of the name is that in the particular
case when B(x) I is the unit matrix we come to the problem of minimizing, with respect
to x, the largest eigenvalue of A(x)) can be immediately written down as a semidenite
fractional problem
T A(x)
minimize max T
s.t. gj (x) 0, j = 1, ..., m, T B(x) > 0, , x G;
B(x)
(3.6.35)
m
here is the unit sphere in R . Note that the numerators and denominators in our objec-
tive fractions are ane in x, as required by our general assumptions on fractional problems.
Assume that we are given G in advance, same as the data identifying the ane in x
matrix-valued functions A(x) and B(x), and let we have an access to a rst-order oracle
providing us with local information on the general type convex constraints gj . Then it
is not dicult to decide, for a given x, whether B(x) is positive denite, and if it is not
the case, to nd such that the denominator T B(x) is nonpositive at x. Indeed,
it suces to compute B(x) and to subject the matrix to the Cholesky factorization (I hope
you know what it means). If factorization is successful, we nd a lower-triangular matrix Q
with nonzero diagonal such that
B(x) = QQT ,
and B(x) is positive denite; if the factorization fails, then in course of it we automatically
meet a unit vector which proves that B(x) is not a positive semidenite, i.e., is such that
T B(x) 0. Now, if B(x), for a given x, is positive semidenite, then to nd associated
with the largest at x of the fractions
T A(x)
T B(x)
is the same as to nd the eigenvector of the (symmetric) matrix Q1 A(x)(QT )1 associated
with the largest eigenvalue of the matrix, Q being the above Cholesky factor of B(x) (why?);
to nd this eigenvector, this is a standard Linear Algebra routine.
66 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
Thus, any technique which allows to solve (f ) under assumption (A) immediately implies
a numerical method for solving the Generalized Eigenvalue problem.
It is worthy to explain what is the control source of Generalized Eigenvalue problems.
Let me start with the well-known issue - stability of a linear dierential equation
z (t) = z(t)
(z Rs ). As you for sure know, the maximal growth of the trajectories of the equation
as t is predetermined by the eigenvalue of with the largest real part, let this part be
; namely, all the trajectories admit, for any > 0, the estimate
and vice versa: from the fact that all the trajectories admit an estimate
it follows that a .
There are dierent ways to prove the above fundamental Lyapunov Theorem, and one
of the simplest is via quadratic Lyapunov functions. Let us say that a quadratic function
z T Lz (L is a symmetric positive denite s s matrix) proves that the decay rate of the
trajectories is at most a, if for any trajectory of the equation one has
d T
ln z (t)Lz(t) 2a
dt
and, consequently,
1/2 1/2
z T (t)Lz(t) exp{at} z T (0)Lz(0) ,
which immediately results in an estimate of the type (3.6.36). Thus, any positive denite
symmetric matrix L which satises, for some a, relation (3.6.37) implies an upper bound
(3.6.36) on the trajectories of the equation, the upper bound involving just this a. Now,
what does it mean that L satises (3.6.37)? Since z (t) = z(t), it means exactly that
1
z T (t)Lz(t) z T (t)(L + T L)z(t) az T (t)Lz(t)
2
for all t and all trajectories of the equation; since z(t) can be an arbitrary vector of Rs , the
latter inequality means that
Thus, any pair comprised of a real a and a positive denite symmetric L satisfying (3.6.38)
results in upper bound (3.6.36); the best (with the smallest possible a) bound (3.6.36) which
can be obtained on this way is given by the solution to the problem
this is nothing but the Generalized Eigenvalue problem with B(L) = 2L, A(L) = T L + L
and no additional constraints on x L. And it can be proved that the best a given by this
construction is nothing but the largest of the real parts of the eigenvalues of , so that in
the case in question the approach based on quadratic Lyapunov functions and Generalized
Eigenvalue problems results in complete description of the equipped of the trajectories as
t .
In fact, of course, what was said is of no literal signicance: what for should we solve
a Generalized Eigenvalue problem in order to nd something which can be found by a direct
computation of the eigenvalues of ? The indicated approach becomes meaningful when
we come from our simple case of a linear dierential equation with constant coecients
to a much more dicult (and more important for practice) case of a dierential inclusion.
Namely, assume that we are given a multivalued mapping z Q(z) Rs and are interested
in bounding the trajectories of the dierential inclusion
z (t) = (t)z(t)
with certain unknown (). Assume that we know nitely many matrices 1 ,...,M such that
(e.g., we know bounds on entries of (t) in the above time-varying system). In order to
obtain an estimate of the type (3.6.36), we again may use a quadratic Lyapunov function
z T Lz: if for all trajectories of the inclusion one has
z T Lz az T Lz (z Rs , z Q(z)) (3.6.40)
we convert the problem of nding the best quadratic Lyapunov function (i.e., that one with
the best associated decay rate a) into the Generalized Eigenvalue problem
xi int Gi
The presented scheme denes, of course, a family of methods rather than a single method.
The basic implementation issues, as always, are how to choose xi in the interior of Gi1 and
how to extend G i to Gi ; here one may use the same tactics as in the Center of Gravity or
in the Ellipsoid methods. An additional problem is how to start the process (i.e., how to
choose G0 ); this issue heavily depends on a priori information on the problem, and here we
hardly could do any universal recommendations.
Now, what can be said about the rate of convergence of the method? First of all, we
should say how we measure inaccuracy. A convenient general approach here is as follows.
Let x Gf and let, for a given (0, 1),
Gf = x + (Gf x) = {y = (1 )x + z | z Gf }
f () = inf fx ().
xGf
f (x) f ().
Let us motivate the introduced notion. The actual motivation is, of course, that the notion
works, but let us start with a kind of speculation. Assume for a moment that the problem is
solvable, and let x be an optimal solution to it. One hardly could argue that a point x Gf
which is at the distance of order of of x is a natural candidate on the role of an -solution;
since all points from Gx are at the distance at most Diam(Gf ) from x , all these points
can be regarded as -solutions, in particular, the worst of them (i.e., with the largest value
of f ) point x (). Now, what we actually are interested in are the values of the objective; if
we agree to think of x () as of an -solution, we should agree that any point x Gf with
f (x) f (x ()) also is an -solution. But this latter property is shared by any point which
is an -solution in the sense of the above denition (look, f (x ()) is nothing but fx ()),
and we are done - our denition is justied!
Of course, this is nothing but a speculation. What might, and what might not be called
a good approximate solution, this cannot be decided in advance; the denition should come
from the real-world interpretation of the problem, not from inside the Optimization Theory.
What could happen with our denition in the case of a bad problem, it can be seen from
the following example:
x
minimize , x Gf = [0, 1].
1020 +x
Here in order to nd a solution with the value of the objective better than, say, 1/2 (note
that the optimal value is 0) we should be at the distance of order 1020 of the exact solution
70 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
x = 0. For our toy problem it is immediate, of course, to indicate the solution exactly,
but think what happens if the same eect is met in the case of a multidimensional and
nonpolyhedral Gf . We should note, anyhow, that the problems like that one just presented
are intrinsically bad (what is the problem?); in good situations our denition does work:
Size(GN )
> .
Size(Gf )
Exercise 3.6.5 # Write a code implementing the Ellipsoid version of the Cutting Plane
scheme for (f ). Use the code to nd the best decay rate for the dierential inclusion
z (t) Q(z) R3 ,
where
Q(z) = Conv {1 z, ..., M z}
and i , i = 1, 2, ..., M = 26 = 64, are the vertices of the polytope
1 p12 p13
P = { p21 1 p23 | |pij | 0.1}.
p31 p32 1
3.6. EXERCISES 71
with (n) given by (3.6.41). Based on this observation, construct a cutting plane method for
convex problems with functional constraints where all localizers are simplices.
What should be the associated size?
What is the complexity of the method?
Hint: without loss of generality we may assume that the linear form g T x attains its minimum
over D at the vertex v0 and that g T (w v0 ) = 1. Choosing v0 as our new origin and
v1 v0 , ..., vn v0 as the orths of our new coordinate axes, we come to the situation studied
in exercise 3.6.6.
Note that the progress in volumes of the subsequent localizers in the method of outer
simplex (i.e., the quantity n (n) = 1 O(n2 )) is worse than that one n (n) = 1 O(n1 ) in
the Ellipsoid method. It does not, anyhow, mean that the former method is for sure worse
than the latter one: in the Ellipsoid method, the actual progress in volumes always equals
to n (n), while in the method of outer simplex the progress depends on what are the cutting
planes; the quantity n (n) is nothing but the worst case bound for the progress, and the
latter, for a given problem, may happen to be more signicant.
Lecture 4
where G is a given solid in Rn and f, g1 , ..., gm are convex continuous on G functions. The
family of all consistent problems of the indicated type was denoted by Pm (G), and we are
interested in nding -solution to a problem instance from the family, i.e., a point x G
such that
We have shown that the complexity of the family in question satises the inequalities
where O(1) is an appropriate positive absolute constant; what should be stressed that the
upper complexity bound holds true for all (0, 1), while the lower one is valid only for not
too large , namely, for
< (G).
The critical value (G) depends, as we remember, on ane properties of G; for the box it
is 12 , and for any n-dimensional solid G one has
1
(G) .
2n3
73
74 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
Thus, our complexity bounds identify complexity, up to an absolute constant factor, only
for small enough values of ; there is an initial interval of values of the relative accuracy
(G) = [(G), 1)
where we to the moment have only an upper bound on the complexity and have no lower
bound. Should we be bothered by this incompleteness of our knowledge? I think we should.
Indeed, what is the initial segment, it depends on G; if G is a box, then this segment is
once for ever xed, so that there, basically, is nothing to worry about - one hardly might be
interested in solving optimization problems within relative inaccuracy 1/2, and for smaller
we know the complexity. But if G is a more general set than a box, then there is something
to think about: all we can say about an arbitrary n-dimensional G is that (G) 1/(2n3 );
this lower bound tends to 0 as the dimension n of the problem increases, so that for large n
(G) can in fact almost cover the interval (0, 1) of all possible values of . On the other
hand, when solving large-scale problems of real world origin, we often are not interested in
too high accuracy, and it may happen that the value of we actually are interested in is
exactly in (G), where we do not know what is the complexity and what are the optimal
methods. Thus, we have reasons, both of theoretical and practical origin, to be interested
in the pre-asymptotic behaviour of the complexity.
The diculty in investigating the behaviour of the complexity in the initial range of
values of the accuracy is that it depends on ane properties of the domain G, and this is
something too diuse for quantitative description. This is why it is reasonable to restrict
ourselves with certain standard domains G. We already know what happens when G is a
parallelotope, or, which is the same, a box - in this case there, basically, is no initial segment.
And, of course, the next interesting case is when G is an ellipsoid, or, which is the same,
an Euclidean ball (all our notions are ane invariant, so to speak about Euclidean balls is
the same as to speak about arbitrary ellipsoids). This is the case we shall focus on. In fact
we shall assume G to be something like a ball rather than a ball exactly. Namely, let us
x a real 1 and assume that the asphericity of G is at most , i.e., there is a pair of
concentric Euclidean balls Vin and Vout with the ratio of radii not exceeding and such that
the smaller ball is inside G, and the larger one contains G:
Vin G Vout .
Comment. Before proving the theorem, let us think what the theorem says. First, it says
that the complexity of convex minimization on a domain similar to an Euclidean ball is
bounded from above uniformly in the dimension by a function O(1)22 ; the asphericity
is responsible for the level of similarity between the domain and a ball. Second, we see that
in the large-scale case, when the dimension of the domain is large enough for given and
, or, which is the same, when the inaccuracy is large enough for a given dimension (and
asphericity), namely, when
1
, (4.2.3)
2 n
then the complexity admits a lower bound O(1)22 which diers from the aforementioned
upper bound by factor O(1)4 which depends on asphericity only. Thus, in the large-
scale case (4.2.3) our upper complexity bound coincides with the complexity up to a factor
depending on asphericity only; if G is an Euclidean ball ( = 1), then this factor does not
exceed 16.
Now, our new complexity results combined with the initial results related to the case of
small inaccuracies gives us basically complete description of the complexity in the case when
G is an Euclidean ball. The graph of complexity in this case is as follows:
n ln n
1
1/2
1 2n n 1-
Figure 4.
Complexity of convex minimization over an n-dimensional Euclidean ball: the whole
range [1, ) of values of 1/ can be partitioned into three segments:
the initial segment [1, 2 n]; within this segment the complexity, up to an absolute con-
stant factor, is 2 ; at the right endpoint of the segment the complexity is equal to n; in this
initial segment the complexity is independent on the dimension and is in fact dened by the
ane geometry of G;
1
the nal segment [ (G) , ) = [2n, ); here the complexity, up to an absolute constant
factor, is n ln(1/), this is the standard asymptotics known to us; in this nal segment the
complexity forgets everything about the geometry of G;
76 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
the intermediate segment [2 n, 2n]; at the left endpoint of this segment the complexity
is O(n), at the right endpoint it is O(n ln n); within this segment we know complexity up to
a factor of order of its logarithm rather than up to an absolute constant factor.
Now let us prove the theorem.
then
1
ei =
gj (xi );
|gj (xi )|
(iii) if xi int G and no constraint is -violated at xi , i.e., no inequality (4.3.5) is satised,
then
1
ei = f (xi ).
|f (xi )|
Note that the last formula makes sense only if f (xi ) = 0; if in the case of (iii) we meet with
f (xi ) = 0, then we simply terminate and claim that xi is the result of our activity.
Same as in the cutting plane scheme, let us say that search point xi is productive, if at i-th
step we meet the case (iii), and non-productive otherwise, and let us dene i-th approximate
solution xi as the best (with the smallest value of the objective) of the productive search
points generated in course of the rst i iterations (if no productive search point is generated,
xi is undened).
The eciency of the method is given by the following.
Proposition 4.3.1 Let a problem (p) from the family Pm (G) be solved by the short-step
Subgradient Descent method associated with accuracy , and let N be a positive integer such
that
2 + 12 N 2
j=1 j
N < . (4.3.6)
j=1 j
4.3. UPPER COMPLEXITY BOUND: SUBGRADIENT DESCENT 77
Then either the method terminates in course of N steps with the result being an -solution
to (p), or xN is well-dened and is an -solution to (p).
In particular, if
i /,
then (4.3.6) is satised by
2
N N() = 4 + 1,
2
and with the indicated choice of the stepsizes we can terminate the method after the N-th
step; the resulting method solves any problem from the class within relative accuracy with
the complexity N(), which is exactly the upper complexity bound stated in Theorem 4.2.1.
Proof. Let me make the following crucial observation: let us associate with the method the
localizers
Gi = {x G | (x xj )T ej 0, 1 j i}. (4.3.7)
Then the presented method ts our generic cutting plane scheme for problems with functional
constraints, up to the fact that Gi now should not necessarily be solids (they may possess
empty interior or even be themselves empty) and xi should not necessarily be an interior point
of Gi1 . But all these particularities were not used in the proof of the general proposition on
the rate of convergence of the scheme (Proposition 3.3.1), and in fact there we have proved
the following:
Proposition 4.3.2 Assume that we are generating a sequence of search points xi Rn and
associate with these points vectors ei and approximate solutions xi in accordance to (i)-(iii).
Let the sets Gi be dened by the pairs (xi , ei ) according to (4.3.7), and let Size be a size.
Assume that in course of N steps we either terminate due to vanishing of the subgradient of
the objective at a productive search point, or this is not the case, but
(if GN is not a solid, then, by denition, Size(GN ) = 0). In the rst case the result formed
at the termination is an -solution to the problem; in the second case such a solution is xN
(which is for sure well-dened).
Now let us apply this latter proposition to our short-step Subgradient Descent method and
to the size
We know in advance that G contains an Euclidean ball Vin of the radius R/, so that
Now let us estimate from above the size of i-th localizer Gi , provided that the localizer is
well-dened (i.e., that the method did not terminate in course of the rst i steps due to
vanishing the subgradient of the objective at a productive search point). Assume that Gi
78 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
contains an Euclidean ball of certain radius r > 0, and let x+ be the center of the ball. Since
V is contained in Gi , we have
(x xj )T ej 0, x V, 1 j i,
whence
(x+ xj )T ej + hT ej 0, |h| r, 1 j i,
and since ej is a unit vector, we come to
(x+ xj )T ej r, 1 j i. (4.3.9)
i
i
R(2r j j2 R) |x1 x+ |2 4R2
j=1 j=1
(we have used the fact that G is contained in the Euclidean ball Vout of the radius R). Thus,
we come to the estimate
2 + 12 ij=1 j2
r i R
j=1 j
This bound acts for the radius r of an arbitrary Euclidean ball contained in Gi , and we come
to
2 + 12 ij=1 j2
InnerRad(Gi ) i R. (4.3.10)
j=1 j
InnerRad(GN )
< ,
InnerRad(G)
with the same center. It, of course, suces to establish the lower bound for the case of problems
without functional constraints. Besides this, due to monotonicity of the complexity in , it suces
to prove that if (0, 1) is such that
1
M = 0 n,
(2)2
then the complexity A() is at least M . Assume that this is not the case, so that there exists a
method M which solves all problems from the family in question in no more than M 1 step. We
may assume that M solves any problem exactly in M steps, and the result always is the last search
point. Let us set
1
= ,
2 M
so that > 0 by denition of M . Now consider the family F0 comprised of functions
f (x) = max (i xi + di )
1iM
where i = 1 and 0 < di < . Note that these functions are well-dened, since M n and
therefore we have enough coordinates in x.
Now consider the following M -step construction.
The rst step:
let x1 be the rst search point generated by M; this point is instance-independent. Let i1 be the
index of the largest in absolute value of the coordinates of x1 , i1 be the sign of the coordinate and
let di1 = /2. Let F1 be comprised of all functions from F with i1 = i1 , di1 = di1 and di /4
for all i = i1 . It is clear that all the functions of the family F1 possess the same local behavior at
x1 and are positive at this point.
The second step:
let x2 be the second search point generated by M as applied to a problem from the family F1 ; this
point does not depend on the representative of the family, since all these representatives have the
same local behavior at the rst search point x1 . Let i2 be the index of the largest in absolute value
of the coordinates of x2 with indices dierent from i1 , let i2 be the sign of the coordinate, and let
di2 = /4. Let F2 be comprised of all functions from F1 such that i2 = i2 , di2 = di2 and di /8
for all i dierent from i1 and i2 . Note that all functions from the family coincide with each other
in a neighborhood of the two-point set {x1 , x2 } and are positive at this set.
Now it is clear how to proceed. After k steps of the construction we have a family Fk comprised
of all functions from F with the parameters i and di being set to certain xed values for k values
i1 , ..., ik of the index i and all di for the remaining i being 2(k+1) ; the family satises the
following predicate
Pk : the rst k points x1 , ..., xk of the trajectory of M as applied to any function from the family
do not depend on the function, and all the functions from the family coincide with each other in
certain neighborhood of the k-point set {x1 , ..., xk } and are positive at this set.
80 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
From Pk it follows that the (k + 1)-th search point xk+1 generated by M as applied to a function
from the family Fk is independent of the function. At the step k + 1 we
nd the index ik+1 of the largest in absolute value of the coordinates of xk+1 with indices
dierent from i1 , ..., ik ,
dene ik+1 as the sign of the coordinate,
set dik+1 = 2(k+1) ,
and
dene Fk+1 as the set of those functions from Fk for which ik+1 = ik+1 , dik+1 = dik+1 and
di 2(k+2) for i dierent from i1 , ..., ik+1 .
It is immediately seen that the resulting family satises the predicate Pk+1 , and we may proceed
in the same manner.
Now let us look what will be found after M step of the construction. We will end with a family
FM which consists of exactly one function
f = max (i xi + di )
1iM
such that f is positive along the sequence x1 , ..., xM of search points generated by M as applied
to the function. On the other hand, G contains the ball of the radius r = 1/(2) centered at the
origin, and, consequently, contains the point
M
i
x = ei ,
i=1 2 M
1
f min f (x) f (x ) < +
xG 2 M
(the concluding inequality follows from the denition of ). On the other hand, f clearly is Lipschitz
continuous with constant 1 on G, and G is contained in the Euclidean ball of the radius 1/2, so
that the variation (maxG f minG f ) of f over G is 1. Thus, we have
From now on we assume that G is a closed and bounded convex subset in Rn , possibly, with
empty interior, and that the objective is convex and Lipschitz continuous on G:
where L(f ) < and | | is the usual Euclidean norm in Rn . Note that the subgradient
set of f at any point from G is nonempty and contains subgradients of norms not exceeding
L(f ); from now on we assume that the oracle in question reports such a subgradient at any
input point x G.
We would like to solve the problem within absolute inaccuracy , i.e., to nd x G
such that
f (x) f f (x) min f .
G
The simplest way to solve the problem is to apply the standard Subgradient Descent method
which generates the sequence of search points {xi }
i=1 according to the rule
G (x) = argmin{|x y| | y G}
is the standard projector onto G. Of course, if we meet a point with f (x) = 0, we terminate
with optimal solution at hands; from now on I ignore this trivial case.
As always, i-th approximate solution xi found by the method is the best - with the
smallest value of f - of the search points x1 , ..., xi ; note that all these points belong to G.
It is easy to investigate the rage of convergence of the aforementioned routine. To this
end let x be the closest to x1 optimal solution to the problem, and let
di = |xi x |.
We are going to see how di vary. To this end let us start with the following simple and
important observation (cf Exercise 4.7.3):
Lemma 4.5.1 Let x Rn , and let G be a closed convex subset in Rn . Under projection
onto G, x becomes closer to any point u of G, namely, the squared distance from x to u
decreases at least by the squared distance from x to G:
(the concluding inequality is due to the convexity of f ). Thus, we come to the recurrence
f (xi ) f f (
xi ) f i ,
The right hand side in this inequality clearly tends to 0 as N , provided that
i = , i 0, i
i=1
(why?), which gives us certain general statement on convergence of the method as applied to
a Lipschitz continuous convex function; note that we did not use the fact that G is bounded.
Of course, we would like to choose the stepsizes resulting in the best possible estimate
(4.5.16). Note that our basic recurrence (4.5.14) implies that for any N M 1 one has
N
N
N
2N i L(f ) d2M + i2 L(f ) D + 2
i2 .
i=M i=M i=M
Whence
L(f ) D 2 + N 2
i=M i
N M ;
2 i=N i
with D being an a priori upper bound on the diameter of G; M = N/2 and
i = Di1/2 (4.5.17)
the right hand side in the latter inequality does not exceed O(1)DN 1/2 . This way we come
to the optimal, up to an absolute constant factor, estimate
L(f )D
N O(1) , N = 1, 2, ... (4.5.18)
N
4.5. SUBGRADIENT DESCENT FOR LIPSCHITZ-CONTINUOUS CONVEX PROBLEMS83
(O(1) is an easily computable absolute constant). I call this rate optimal, since the lower
complexity bound of Section 4.4 says that if G is an Euclidean ball of diameter D in Rn
and L is a given constant, then the complexity at which one can minimize over G, within
absolute accuracy , an arbitrary Lipschitz continuous with constant L convex function f is
at least
LD 2
min n; O(1) ,
so that in the large-scale case, when
2
LD
n ,
the lower complexity bound coincides, within absolute constant factor, with the upper bound
given by (4.5.18).
Thus, we can choose the stepsizes i according to (4.5.17) and obtain dimension-independent
rate of convergence (4.5.18); this rate of convergence does not admit signicant uniform
in the dimension improvement, provided that G is an Euclidean ball.
2) The stepsizes (4.5.17) are theoretically optimal and more or less reasonable from the
practical viewpoint, provided that you deal with a domain G of reasonable diameter, i.e.,
the diameter of the same order of magnitude as the distance from the starting point to the
optimal set. If the latter assumption is not satised (as it often is the case), the stepsizes
should be chosen more carefully. A reasonable idea here is as follows. Our rate-of-convergence
proof in fact was based on a very simple relation
let us choose as i the quantity which results in the strongest possible inequality of this type,
namely, that one which minimizes the right hand side:
f (xi ) f
i = . (4.5.19)
|f (xi )|
Of course, this choice is possible only when we know the optimal value f . Sometimes this
is not a problem, e.g., when we reduce a system of convex inequalities
fi (x) 0, i = 1, ..., m,
to the minimization of
f (x) = max fi (x);
i
here we can take f = 0. In more complicated cases people use some on-line estimates of f ;
I would not like to go in details, so that I assume that f is known in advance. With the
stepsizes (4.5.19) (proposed many years ago by B.T. Polyak) our recurrence becomes
This estimate seems to be the best one, since it involves the actual distance |x1 x | to the
optimal set rather than the diameter of G; in fact G might be even unbounded. Typically,
whenever one can use the Polyak stepsizes, this is the best possible tactics for the Subgradient
Descent method.
We can now present a small summary: we see that the Subgradient Descent, which
we were exploiting in order to obtain an optimal method for large scale convex minimization
over Euclidean ball, can be applied to minimization of a convex Lipschitz continuous function
over an arbitrary n-dimensional closed convex domain G; if G is bounded, then, under
appropriate choice of stepsizes, one can ensure the inequalities
where O(1) is a moderate absolute constant, L(f ) is the Lipschitz constant of f and D(G) is
the diameter of G. If the optimal value of the problem is known, then one can use stepsizes
which allow to replace D(G) by the distance |x1 x | from the starting point to the optimal
set; in this latter case, G should not necessarily be bounded. And the rate of convergence is
optimal, I mean, it cannot be improved by more than an absolute constant factor, provided
that G is an n-dimensional Euclidean ball and n > N.
Note also that if G is a simple set, say, an Euclidean ball, or a box, or the standard
simplex
n
{x Rn+ | xi = 1},
i=1
then the method is computationally very cheap - a step costs only O(n) operations in addition
to those spent by the oracle. Theoretically all it looks perfect. It is not a problem to speak
about an upper accuracy bound O(N 1/2 ) and about optimality of this bound in the large
scale case; but in practice such a rate of convergence would result in thousands of steps,
which is too much for the majority of applications. Note that in practice we are interested
in typical complexity of a method rather than in its worst case complexity and worst case
optimality. And from this practical viewpoint the Subgradient Descent is far from being
optimal: there are other methods with the same worst case theoretical complexity bound,
but with signicantly better typical performance; needless to say that these methods are
more preferable in actual computations. What we are about to do is to look at a certain
family of methods of this latter type.
where the prehistory was memorized in the current localizer). Generally speaking, what we
actually know about the objective after we have formed a sequence of search points xj G,
j = 1, ..., i? All we know is the bundle - the sequence of ane forms
f (xj ) + (x xj )T f (xj )
reported by the oracle; we know that every form from the sequence underestimates the
objective and coincides with it at the corresponding search point. All these ane forms can
be assembled into a single piecewise linear convex function - i-th model of the objective
And once again - the model accumulates all our knowledge obtained so far; e.g., the infor-
mation we possess does not contradict the hypothesis that the model is exact everywhere.
Since the model accumulates the whole prehistory, it is reasonable to formulate the search
rules for a method in terms of the model. The most natural and optimistic idea is to trust
in the model completely and to take, as the next search point, the minimizer of the model:
xi+1 Argmin fi .
G
This is the Kelley cutting plane method - the very rst method proposed for nonsmooth
convex optimization. The idea is very simple - if we are lucky and the model is good
everywhere, not only along the previous search points, we would improve signicantly the
best value of the objective found so far. On the other hand, if the model is bad, then it will be
corrected at the right place. From compactness of G one can immediately derive that the
method does converge and is even nite if the objective is piecewise linear. Unfortunately, it
turns out that the rate of convergence of the method is a disaster; one can demonstrate that
the worst-case number of steps required by the Kelley method to solve a problem f within
absolute inaccuracy (G is the unit n-dimensional ball, L(f ) = 1) is at least
(n1)/2
1
O(1) .
We see how dangerous is to be too optimistic, and it is clear why: even in the case of smooth
objective the model is close to the objective only in a neighborhood of the search points; until
the number of these points becomes very-very large, this neighborhood covers a negligible
part of the domain G, so that the global characteristic of the model - its minimizer - is very
86 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
unstable and until the termination phase has small in common with the actual optimal set.
It should be noted that the Kelley method in practice is much better than one could think
looking at its worst-case complexity (a method with practical complexity like this estimate
simply could not be used even in the dimension 10), but the qualitative conclusions from the
estimate are more or less valid also in practice - the Kelley method sometimes is too slow.
A natural way to improve the Kelley method is as follows. We can only hope that the
model approximates the objective in a neighborhood of the search points. Therefore it is
reasonable to enforce the next search point to be not too far from the previous ones, more
exactly, from the most perspective, the best of them, since the latter, as the method goes
on, hopefully will become close to the optimal set. To forbid the new iterate to move far
away, let us choose xi+1 as the minimizer of the penalized model:
di
xi+1 = argmin{fi (x) + |x x+ 2
i | },
G 2
where x+ i is what is called the current prox center, and the prox coecient di > 0 is certain
parameter. When di is large, we enforce xi+1 to be close to the prox center, and when it
is small, we act almost as in the Kelley method. What is displayed, is the generic form of
the bundle methods; to specify a method from this family, one need to indicate the policies
of updating the prox centers and the prox coecients. There is a number of reasonable
policies of this type, and among these policies there are those resulting in methods with
very good practical performance. I would not like to go in details here; let me say only
that, rst, the best theoretical complexity estimate for the traditional bundle methods is
something like O(3); although non-optimal, this upper bound is incomparably better than
the lower complexity bound for the method of Kelley. Second, there is more or less unique
reasonable policy of updating the prox center, in contrast to the policy for updating the
prox coecient. Practical performance of a bundle algorithm heavily depends on this latter
policy, and sensitivity to the prox coecient is, in a sense, the weak point of the bundle
methods. Indeed, even without addressing to computational experience we can guess in
advance that the scheme should be sensitive to di - since in the limiting cases of zero and
innite prox coecient we get, respectively, the Kelley method, which can be slow, and the
method which simply does not move from the initial point. Thus, both small and large
prox coecients are forbidden; and it is unclear how to choose the golden middle - our
information has nothing in common with any quadratic terms in the model, these terms are
invented by us.
To describe the method, we introduce several simple quantities. Given i-th model fi (),
we can compute its optimum, same as in the Kelley method; but now we are interested not
in the point where the optimum is attained, but in the optimal value
fi = min fi
G
of the model. Since the model underestimates the objective, the quantity fi is a lower
bound for the actual optimal value; and since the models clearly increase with i at every
point, their minima also increase, so that
f1 f2 ... f . (4.6.24)
On the other hand, let fi+ be the best found so far value of the objective:
where xi is the best (with the smallest value of the objective) of the search point generated
so far. The quantities fi+ clearly decrease with i and overestimate the actual optimal value:
xi ) f i , 1 2 ... 0.
f ( (4.6.27)
Qi = {x G | fi (x) li }
Computationally, the method requires solving two auxiliary problems at each iteration.
The rst is to minimize the model in order to compute fi ; this problem arises in the Kelley
method and does not arise in the bundle ones. The second auxiliary problem is to project
xi onto Qi ; this is, basically, the same quadratic problem which arises in bundle methods
and does not arise in the Kelley one. If G is a polytope, which normally is the case, the
rst of these auxiliary problems is a linear program, and the second is a convex linearly
constrained quadratic program; to solve them, one can use the traditional ecient simplex-
type technique.
Let me note that the method actually belongs to the bundle family, and that for this
method the prox center always is the last iterate. To see this, let us look at the solution
d
x(d) = argmin{fi (x) + |x xi |2 }
xG 2
of the auxiliary problem arising in the bundle scheme as at a function of the prox coecient
d. It is clear that x(d) is the closest to xi point in the set {x G | fi (x) fi (x(d))}, so
that x(d) is the projection of xi onto the level set
{x G | fi (x) li (d)}
of the i-th model associated with the level li (d) = fi (x(d)) (this latter relation gives us
certain equation which relates d and li (d)). As d varies from 0 to , x(d) moves along
certain path which starts at the closest to xi point in the optimal set of the i-th model and
ends at the prox center xi ; consequently, the level li (d) varies from fi to f (xi ) fi+ and
therefore, for certain value di of the prox coecient, we have li (di ) = li and, consequently,
x(di ) = xi+1 . Note that the only goal of this reasoning was to demonstrate that the Level
method does belong to the bundle scheme and corresponds to certain implicit control of the
prox coecient; this control exists, but is completely uninteresting for us, since the method
does not require knowledge of di .
Now let me formulate and prove the main result on the method.
Theorem 4.6.1 Let the Level method be applied to a convex problem (f ) with Lipschitz
continuous, with constant L(f ), objective f and with a closed and bounded convex domain
G of diameter D(G). Then the gaps i converge to 0; namely, for any positive one has
2
L(f )D(G)
i > c() i , (4.6.29)
where
1
c() = .
(1 )2 (2 )
In particular, 2
L(f )D(G)
i > c() xi ) f .
f (
4.6. BUNDLE METHODS 89
i (1 )1 i(2) .
i (1 )1 i(3) ,
and so on.
With this process, we partition the set I of iteration indices into sequential segments
I1 ,..,Ik (Is follows Is+1 in I). The last index in Is is i(s), and we have
(indeed, if the opposite inequality would hold, then i(s + 1) would be included into the group
Is , which is not the case).
20 . The main (and very simple) observation is as follows:
the level sets Qi of the models corresponding to certain group of iterations Is have a
point in common, namely, the minimizer, us , of the last, i(s)-th, model from the group.
Indeed, since the models increase with i, and the best found so far values of the objective
decrease with i, for all i Is one has
+
fi (us ) fi(s) (us ) = fi(s) = fi(s) i(s) fi+ i(s) fi+ (1 )i li
(the concluding in the chain follows from the fact that i Is , so that i (1 )1 i(s) ).
30 . The above observation allows to estimate from above the number Ns of iterations in
the group Is . Indeed, since xi+1 is the projection of xi onto Qi and us Qi for i Is , we
conclude from Lemma 4.5.1 that
Now let us estimate from below the steplengths |xi xi+1 |. At the point xi the i-th model fi
equals to f (xi ) and is therefore fi+ , and at the point xi+1 the i-th model is, by construction
of xi+1 , less or equal (in fact is equal) to li = fi+ (1 )i . Thus, when passing from
xi to xi+1 , the i-th model varies at least by the quantity (1 )i , which is, in turn, at
least (1 )i(s) (the gaps may decrease only!). On the other hand, fi clearly is Lipschitz
continuous with the same constant L(f ) as the objective (recall that, according to our
assumption, the oracle reports subgradients of f of the norms not exceeding L(f )). Thus,
at the segment [xi , xi+1 ] the Lipschitz continuous with constant L(f ) function fi varies at
least by (1 )i(s) , whence
From this inequality and (4.6.31) we conclude that the number Ns of iterations in the group
Is satises the estimate
Ns (1 )2 L2 (f )D 2 (G)2
i(s) .
40 . We have i(1) > (the origin of N) and i(s) > (1 )s+1 i(1) (see (4.6.30)), so
that the above estimate of Ns results in
Ns (1 )2 L2 (f )D 2 (G)(1 )2(s1) 2 ,
whence
N= Ns c()L2 (f )D 2 (G)2 ,
s
as claimed.
maximum of 5 convex quadratic forms of 10 variables. In the below table you see the results
obtained on this problem by the Subgradient Descent and by the Level methods. In the
Subgradient Descent the Polyak stepsizes were used (to this end, the method was equipped
with the exact optimal value, so that the experiment was in favor of the Subgradient Descent).
The results are as follows:
Subgradient Descent: 100,000 steps, best found value -0.8413414 (absolute inaccuracy
0.0007), running time 54 ;
Level: 103 steps, best found value -0.8414077 (absolute inaccuracy < 0.0000001), running
time 2 , complexity index c(f ) = 0.47.
Runs:
i fi+ i fi+
1 5337.066429 1 5337.066429
2 98.595071
6 98.586465
8 6.295622 16 7.278115
31 0.198568
39 0.674044
41 0.221810
54 0.811759
73 0.841058
81 0.841232
103 0.841408
201 0.801369
4001 0.839771
5001 0.840100
17001 0.841021
25001 0.841144
50001 0.841276
75001 0.841319
100000 0.841341
marks the result obtained by the Subgradient Descent after 2 (the total CPU time of the
Level method); to that moment the Subgradient Descent has performed 4,000 iterations, but
has restored the solution within 2 accuracy digits rather then 6 digits given by Level.
The Subgradient Descent with the default stepsizes i = O(1)i1/2 in the same 100,000
iterations was unable to achieve value of the objective less than -0.837594, i.e., found the
solution within a single accuracy digit.
92 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
4.7 Exercises
Exercise 4.7.1 Implement the level method. Try to test it on the MAXQUAD test problem1)
which is as follows:
min f (x), X = {x R10 | |xi | 1},
xX
where
f (x) = max xT A(i) x + xT b(i)
i=1,...,5
with
(i) (i)
Akj = Ajk = ej/k cos(jk) sin(i), j < k;
(i) j (i)
Ajj = | sin(i)| + |Ajk |,
10 k=j
(i)
bj = ej/i sin(ij).
When implementing the level method you will need a Quadratic Programming solver. For
SCILAB implementation you can use QUAPRO internal solver, when working with MAT-
LAB, you can use the conic solver from SDPT3 (http://www.math.nus.edu.sg/ mattohkc/sdpt3.html).
There are, anyhow, more or less evident possibilities to make the method computationally
more reasonable. The idea is not to tune the method to the prescribed accuracy in advance,
thus making the stepsizes small from the very beginning, but to start with large stepsizes
and then decrease them at a reasonable rate. To implement the idea, we need an auxiliary
tool (which is important an interesting in its own right), namely, projections.
Let Q be a closed and nonempty convex subset in Rn . The projection Q (x) of a point
x Rn onto Q is dened as the closest, with respect to the usual Euclidean norm, to x point
of Q, i.e., as the solution to the following optimization problem
#
Exercise 4.7.2 Prove that Q (x) does exist and is unique.
Exercise 4.7.3 # Prove that a point y Q is a solution to (Px ) if and only if the vector
x y is such that
(u y)T (x y) 0 u Q. (4.7.32)
Derive from this observation the following important property:
Thus, when we project a point onto a convex set, the point becomes closer to any point u of
the set, namely, the squared distance to u is decreased at least by the squared distance from
x to Q.
Derive from (4.7.32) that the mappings x Q (x) and x x Q (x) are Lipschitz
continuous with Lipschitz constant 1.
where 2 is the Euclidean diameter of Q and the unit vector ei is dened as follows:
(i) if xi int G, then ei is an arbitrary unit vector which separates xi and G:
(x xi )T ei 0, x G;
(ii) if xi int G, but there is a constraint gj which is i -violated at xi , i.e., is such that
gj (xi ) > i max{gj (xi ) + (x xi )T gj (xi )} , (4.7.35)
xG +
then
1
ei =
gj (xi );
|gj (xi )|2
94 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
where rin is the maximal of radii of Euclidean balls contained in G. Then among the search
points xN , xN +1 , ..., xN there were productive ones, and the best of them (i.e., that one with
the smallest value of the objective) point xN ,N is an N -solution to (p).
Derive from this result that in the case of problems without functional constraints (m = 0),
where i do not inuence the process at all, the relation
implies that the best of the productive search points found in course of the rst N steps is
well-dened and is an (N)-solution to (p).
Hint: follow the line of argument of the original proof of Proposition 4.3.1. Namely, apply the
proof to the shifted process which starts at xN and uses at its i-th iteration, i 1, the stepsize
i+N 1 and the tolerance i+N 1 . This process diers from that one considered in the lecture in
two issues:
(1) presence of time-varying tolerance in detecting productivity and an arbitrary step, instead
of termination, when a productive search point with vanishing subgradient of the objective is met;
4.7. EXERCISES 95
(2) exploiting the projection onto Q G when updating the search points.
To handle (1), prove the following version of Proposition 3.3.1 (Lecture 3):
Assume that we are generating a sequence of search points xi Rn and associate with these
points vectors ei and approximate solutions x
i in accordance to (i)-(iii). Let
Gi = {x G | (x xj )T ej 0, 1 j i},
(if GM is not a solid, then, by denition, Size(GM ) = 0). Then among the search points x1 , ..., xM
there were productive ones, and the best (with the smallest value of the objective) of these pro-
ductive points is a 1 -solution to the problem.
To handle (2), note that when estimating InnerRad(GN ), we used the equalities
and would be quite satised if = in these inequalities would be replaced with ; in view of Exercise
4.7.3, this replacement is exactly what the projection does.
Looking at the statement given by Exercise 4.7.4, we may ask ourselves what could be a
reasonable way to choose the stepsizes i and the tolerances i . Let us start with the case
of problems without functional constraints, where we can forget about the tolerances - they
do not inuence the process. What we are interested in is to minimize over stepsizes the
quantities (N). For a given pair of positive integers M M the minimum of the quantity
1 M
2+ 2 j=M j2
(N : N) = M
j=M j
2 2
over positive j is attained when j = M M +1 , M j M, and is equal to
;
M M +1
2
thus, to minimize (N) for a given i, one should set j = N , j = 1, ..., N, which would
result in
(N) = 2N 1/2 .
rin
This is, basically, the choice of stepsizes we used in the short-step version of the Subgradient
Descent; an unpleasant property of this choice is that it is tied to N, and we would like
to avoid necessity to x in advance the number of steps allowed for the method. A natural
idea is to use the recommendation j = 2N 1/2 in the sliding way, i.e., to set
j = 2j 1/2 , j = 1, 2, ... (4.7.38)
Let us look what will be the quantities (N) for the stepsizes (4.7.38).
#
Exercise 4.7.5 Prove that for the stepsizes (4.7.38) one has
(N) (]N/2[: N) N 1/2
rin rin
with certain absolute constant . Compute the constant.
96 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
We see that the stepsizes (4.7.38) result in optimal, up to an absolute constant factor, rate of
convergence of the quantities (N) to 0 as N . Thus, when solving problems without
functional constraints, it is reasonable to use the aforementioned Subgradient Descent with
stepsizes (4.7.38); according to the second statement of Exercise 4.7.4 and Exercise 4.7.5, for
all N such that
(N) N 1/2 <1
rin
the best of the productive search points found in course of the rst N steps is well-dened
and solves the problem within relative accuracy (N).
Now let us look at problems with functional constraints. It is natural to use here the
same rule (4.7.38); the only question now is how to choose the tolerances i . A reasonable
policy would be something like
i = min{0.9999, 1.01i1/2 }, (4.7.39)
rin
Exercise 4.7.6 # Prove that the Subgradient Descent with stepsizes (4.7.38) and tolerances
(4.7.39), as applied to a problem (p) from the family Pm (G), possesses the following conver-
gence properties: for all N such that
N 1/2 < 0.99
rin
among the search points x]N/2[ , x]N/2[+1 , ..., xN there are productive ones, and the best (with
the smallest value of the objective) of these points solves (p) within relative inaccuracy not
exceeding
]N/2[ N 1/2 ,
rin
being an absolute constant.
Note that if one chooses Q = Vout (i.e., = R, so that /rini = is the asphericity of
G), then the indicated rate of convergence results in the same (up to an absolute constant
factor) as for the basic short-step Subgradient Descent complexity of solving problems from
the family within relative accuracy .
xi+1 = G (xi i f (xi )/|f (xi )|) |xi+1 x |2 |xi x |2 2i (xi x )T f (xi )/|f (xi )| + i2
one should be surprised. Indeed, all of us know the origin of the gradient descent: if f is
smooth, a step in the antigradient direction decreases the rst-order expansion of f and
therefore, for a reasonably chosen stepsize, increases f itself. Note that this standard rea-
soning has nothing in common with the above one: we deal with a nonsmooth f , and it
should not decrease in the direction of an anti-subgradient independently of how small is the
stepsize; there is a subgradient in the subgradient set which actually possesses the desired
property, but this is not necessarily the subgradient used in the method, and even with the
good subgradient you could say nothing about the amount the objective can be decreased
by. The correct reasoning deals with algebraic structure of the Euclidean norm rather
than with local behavior of the objective, which is very surprising; it is a kind of miracle.
But we are interested in understanding, not in miracles. Let us try to understand what is
behind the phenomenon we have met.
First of all, what is a subgradient? Is it actually a vector? The answer, of course, is no.
Given a convex function f dened on an n-dimensional vector space E and an interior point
x of the domain of f , you can dene a nonempty set of support functionals - linear forms
f (x)[h] of h E which are support to f at x, i.e., such that
these forms are intrinsically associated with f and x. Now, having chosen somehow an
Euclidean structure (, ) on E, you may associate with linear forms f (x)[h] vectors f (x)
from E in such a way that
thus coming from support functionals to subgradients-vectors. The crucial point is that
these vectors are not dened by f and x only; they also depend on what is the Euclidean
structure on E we use. Of course, normally we think of an n-dimensional space as of the
coordinate space Rn with once for ever xed Euclidean structure, but this habit sometimes
is dangerous; the problems we are interested in are dened in ane terms, not in the metric
ones, so why should we always look at the problems via certain once for ever xed Euclidean
structure which has nothing in common with the problem? Developing systematically this
evident observation, one may come to the most advanced and recent convex optimization
methods like the polynomial time interior point ones. Our goal now is much more modest,
but we also shall get prot from the aforementioned observation. Thus, once more: the
correct objects associated with f and x are not vectors from E, but elements of the dual
to E space E of linear forms on E. Of course, E is of the same dimension as E and
therefore it can be identied with E; but there are many ways to identify these spaces, and
no one of them is natural, more preferable than others.
Since the support functionals f (x)[h] live in the dual space, the Gradient Descent
cannot avoid the necessity to identify somehow the initial - primal - and the dual space, and
this is done via the Euclidean structure the method is related to - as it was already explained,
this is what allows to associate with a support functional - something which actually exists,
but belongs to the dual space - a subgradient, a vector belonging to the primal space; in
98 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
We see that all standard identications of the primal and the dual spaces, i.e., those given
by Euclidean structures, are covered by our mappings V (); the corresponding V s are,
up to the factor 1/2, squared Euclidean norms. A natural question is what are the mappings
associated with other squared norms.
= max{[x] | x E, x 1}
Now, in the presented form of the Subgradient Descent there is nothing from the fact that
is a Euclidean norm; the only property of the norm which we actually need is the
dierentiability of the associated function V . Thus, given a norm on E which induces
a dierentiable outside 0 conjugate norm on the conjugate space, we can write down certain
method for minimizing convex functions over E. How could we analyze the convergence
properties of the method? In the case of the usual Subgradient Descent the proof of con-
vergence was based on the fact that the anti-gradient direction f (x) is a descent direction
100 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
and the quantity f (x), x x is, up to the constant factor 2, the derivative of the function
|x x |2 in the direction f (x). Could we say something similar in the general case, where,
according to (4.7.40), we should deal with the situation x = V ()? With this substitution,
the left hand side of (4.7.41) becomes
d
f (x), x V () = |t=0 V + ( tf (x)), V + () = V () , x .
dt
Thus, we can associate with (4.7.40) the function
+
V + () V
() = V
() , x , (4.7.42)
x being a minimizer of f , and the derivative of this function in the direction f (V ()) of
the trajectory (4.7.40) is nonpositive:
' (
f (V ()), (V + ) () f (x ) f (x) 0, E . (4.7.43)
Now we may try to reproduce the reasoning which leads to the rate-of-convergence estimate
for the Subgradient Descent for our now situation, where we speak about process (4.7.40)
associated with an arbitrary norm on E (the norm should result, of course, in a continuously
dierentiable V ).
For the sake of simplicity, let us restrict ourselves with the simple case when V possesses
a Lipschitz continuous derivative. Thus, from now on let be a norm on E such that the
mapping
V() V
() : E E
is Lipschitz continuous, and let
V() V( )
L L
= sup{ | = , , E }.
For the sake of brevity, from now on we write V instead of V
.
Exercise 4.7.9 Prove that
and let xi be the best (with the smallest value of f ) of the points x1 , ..., xi and let i =
xi ) minE f . Prove that then
f (
|x |2 + L N 2
i=1 i
N L
(f ) N , N = 1, 2, ... (4.7.46)
2 i=1 i
where L
(f ) is the Lipschitz constant of f with respect to the norm . In particular, the
method converges, provided that
i = , i 0, i .
i
Hint: use the result of exercise 4.7.9 and (4.7.43) to demonstrate that
L 2
V + (i+1 ) V + (i ) 2i (f (xi ) f )/|f (xi )| + , V + () = V () , x ,
2 i
and then act exactly as in the case of the usual Subgradient Descent. Note that the basic
result explains what is the origin of the Subgradient Descent miracle which motivated our
considerations; as we see, this miracle comes not from the very specic algebraic structure of
the Euclidean norm, but from certain robust analytic property of the norm (the Lipschitz
continuity of the derivative of the conjugate norm), and we can fabricate similar miracles
for arbitrary norms which share the indicated property. In fact you could use the outlined
Mirror Descent scheme, developed by Nemirovski and Yudin, with necessary (and more or
less straightforward) modications, in order to extend everything what we know about the
usual - Euclidean - Subgradient Descent (I mean, the versions for optimization over a
domain rather than over the whole space and for optimization over solids under functional
constraints) onto the general non-Euclidean case, but we skip here these issues.
Here
G is a subset in Rn
Monte-Carlo method: to run not a single, but many simulations, for a xed value of the
design parameters, and to take the empiric average of the observed random quantities as
an estimate of their expected values. According to the well-known results on the rate of
convergence of the Monte-Carlo method, to estimate the expected values within inaccuracy
it requires O(1/2 ) simulations, and this is just to get the estimate of the objective and the
constraints of problem (4.7.47) at a single point! Now imagine that we are going to treat
(4.7.47) as a usual black-box represented optimization problem and intend to imitate the
usual rst-order oracle for it via the afrementioned Monte-Carlo estimator. In order to get
an -solution to the problem, we, even in good cases, need to estimate within accuracy O()
the objective and the constraints along the search points. It means that the method will
require much more simulations than the aforementioned O(1/2): this quantity should be
multiplied by the information-based complexity of the optimization method we are going to
use. As a result, the indicated approach in most of the cases results in unappropriately long
computations.
An extremely surprising thing is that there exists another way to solve the problem.
This way, under reasonable convexity assumptions, results in overall number of O(1/2 )
computations only as if there were no optimization at all and the only goal were to estimate
the objective and the constraints at a given point. The subject of our today lecture is this
other way Stochastic Approximation.
To get a convenient framework for presenting Stochastic Approximation, it is worthy to
modify a little the way we are looking at our problem. Assume that when solving it, we
are allowed to generate a random sample 1 , 2 ,... of random factors involved into the
problem; the elements of the sample are assumed to be mutually independent and distributed
according to P . Assume also that given x and , we are able to compute the value F (x, )
and the gradient x F (x, ) of the integrant in (4.7.47). Note that under mild regularity
assumptions the dierentiation with respect to x and taking expectation are interchangeable:
f (x) = F (x, )dP (), f (x) = x F (x, )dP (). (4.7.48)
It means that the situation is covered by the following model of an optimization method
solving (4.7.47):
At a step i, we (the method) form i-th search point xi and forward it to the oracle which
we have in our disposal. The oracle returns the quantities
F (xi , i ), x F (xi , i )
(in our previous interpretation it means that a single simulation of the stochastic system
in question is performed), and this answer is the portion of information on the problem we
get on the step in question. Using the information accumulated so far, we generate the new
search point xi+1 , again forward it to the oracle, enrich our accumulated information by its
answer, and so on.
The presented scheme is a very natural denition of a method based on stochastic rst
order oracle capable to provide the method with random unbiased (see (4.7.48)) estimates of
104 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
the values and the gradients of the objective and the constraints of (4.7.47). Note that the
estimates are not only unbiased, but also form a kind of Markov chain: the distribution of
the answers of the oracle at a point depends only on the point, not on the previous answers
(recall that {i } are assumed to be independent).
Now, for our further considerations it is completely unimportant that the observation
of f (x) comes from the value, and the observation of f (x) comes from the gradient. It
suces to postulate the following:
From now on we assume that the oracle is such that L < . The quantity L will be
called the intensity of the oracle at the problem in question; in what follows it plays the
same role as the Lipschitz constant of the objective in large-scale minimization of Lipschitz
continuous convex functions.
2
Of course, the model we are about to present makes sense not only for convex programs; but the methods
we are interested in will, as always, work well only in convex case, so that we loose nothing when imposing
the convexity assumption from the very beginning
4.7. EXERCISES 105
Hint: follow the lines of the proof of the estimate (4.5.16) in Section 4.5; substitute d2i+1
with vi E|xi+1 ( i ) x |2 .
Comments: The statement and the proof of Theorem 4.7.1 are completely similar to the
related deterministic considerations of Section 4.5. The only dierence is that now we are
estimating from above the expected inaccuracy of N-th approximate solution; this is quite
natural, since the stochastic nature of the process makes it impossible to say something
reasonable about the quality of every realization of the random vector xN = xN ( N 1 )
It turns out that the rate of convergence established in (4.7.56) connot be improved.
Namely, it is not dicult to prove the following statement.
Proposition 4.7.1 For every L > 0, any D > 0, any positive integer N and any stochastic-
oracle-based N-step method M of minimizing univariate convex functions over the segment
G = [0, D] on the axis there exists a linear function f and a stochastic oracle with intensity
L on the function such that
LD
E[f (xN ) min f ] O(1) ,
G N
xN being the result formed by the method as applied to the problem f . Here O(1) is properly
chosen positive absolute constant.
Note that in the deterministic case the rate of convergence O(1/ N) was unimprovable only
in the large-scale case; in contrast to this, in the stochastic case this rate becomes optimal
already when we are minimizing univariate linear functions.
Convergence rate O(1/ N) can be improved only if the objective is strongly convex. The
simplest and the most important result here is as follows (the proof is completely similar to
that one of Theorem 4.7.1):
f (x ) + |x x |2 f (x) f (x ) + |x x |2 , x G, (4.7.57)
2 2
with certain positive and . Consider process (4.7.52) with the stepsizes
i = , (4.7.58)
i
being a positive scale factor satisfying the relation
> 1. (4.7.59)
4.7. EXERCISES 107
Then
D 2 + 2 L2
N E(f (xN ) min f ) c() , (4.7.60)
G N
where D is the diameter of G, L is the intensity of the oracle at the problem in question and
c() is certain problem-independent function on (1, ).
The algorithm (4.7.52) with the stepsizes (4.7.58) and the approximate solutions identical to
the search points is called the classical Stochastic Approximation; it originates from Kiefer
and Wolfovitz.
A good news about the algorithm is its rate of convergence: O(1/N) instead
of O(1/ N). A bad news is that this better rate is ensured only in the case of problems
satisfying (4.7.57) and that the rate of convergence is very sensitive to the choice of the scale
factor in the stepsize formula (4.7.58): if this scale factor does not satisfy (4.7.59), the
rate of convergence may become worse in order. To see this, consider the following simple
example: the problem is
1
f (x) = x2 , [ = = 1]; G = [1, 1]; x1 = 1,
2
the observations are given by
(x, ) = x + [= f (x) + ],
and i are the standard Gaussian random variables (Ei = 0, Ei2 = 1); we do not specify
(x, ), since it is not used in the algorithm. In this example, the best choice of is = 1 (in
this case one can make (4.7.59) an equality rather than strict inequality due to the extreme
simplicity of the objective). For this choice of one has
1
Ef (xN +1 ) E[f (xN +1 ) min
x
f (x)] , N 1.
2N
In particular, it takes no more than 50 steps to reach expected inaccuracy not exceeding
0.01.
Now assume that when solving the problem we overestimate the quantity and choose
stepsizes according to (4.7.58) with = 0.1. How many steps do we need in this case to
reach the same expected inaccuracy 0.01 500, 5000, or what? The answer is astonishing:
approximately 1,602,000 steps. And with = 0.05 (20 times less than the optimal value of
the parameter) the same accuracy costs more than 5.2 1014 steps!
We see how dangerous the classical rule (4.7.58) for the stepsizes is: underestimating
of ( overestimating of ) may kill the procedure completely. And where from, in more or
less complicated cases, could we take a reasonable estimate of ? It should be said that there
exist stable versions of the classical Stochastic Approximation (they, same as our version of
the routine, use large, as compared to O(1/i), stepsizes and take, as approximate solutions,
certain averages of the search points). These stable version of the method are capable to
reach (under assumptions similar to (4.7.57)) the O(1/N)-rate of convergence, even with
108 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
the asymptotically optimal coecient at 1/N. Note, anyhow, that the nondegeneracy
assumption (4.7.57) is crucial for the O(1/N)-rate of convergence;if it is removed, the best
possible rate, as we know from Proposition 4.7.1, becomes O(1/ N ), and this is the rate
given by our robust Stochastic Approximation with large steps and averaging.
Lecture 5
Nonlinear programming:
Unconstrained Minimization
(Relaxation; Gradient method; Rate of convergence; Newton method; Gradient Method and
Newton Method: What is dierent? Idea of Variable Metric; Variable Metric Methods;
Conjugate Gradient Methods
Nous discutons ici des methodes classiques de programmation non-lineaire. Ces methodes
constituent le point de depart de la theorie doptimisation - cest ici que lhistoire doptimisation
a commence. Notre objectif sera aussi dobtenir un avant-go ut de la nouvelle vie que
certaines idees tr`es anciennes ont retrouve en optimisation convexe, ce qui a donne des
techniques algorithmiques les plus avancees actuellement disponibles.
5.1 Relaxation
We have already mentioned in the rst lecture of this course that the main goal of the general
nonlinear programming is to nd a local solution to a problem dened by dierentiable
functions. In general, the global structure of these problems is not much simpler than that
of the problems dened by Lipshitz continuous functions. Therefore, even for such restricted
goals, it is necessary to follow some special principles, which guarantee the convergence of
the minimization process.
The majority of the nonlinear programming methods are based on the idea of relaxation:
We call the sequence {ak }
k=0 a relaxation sequence if ak+1 ak for all k 0.
In this sections we consider several methods for solving the unconstrained minimization
problem
minn f (x), (5.1.1)
xR
where f (x) is a smooth function. To solve this problem, we can try to generate a relaxation
sequence {f (xk )}
k=0 :
f (xk+1 ) f (xk ), k = 0, 1, . . . .
If we manage to do that, then we immediately have the following important consequences:
109
110LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
0). Choose x0 Rn .
1). Iterate (5.2.2)
xk+1 = xk hk f (xk ), k = 0, 1, . . . .
This is a scheme of the gradient method. The gradients factor in this scheme, hk , is called
the step size. Of course, it is reasonable to choose the step size positive.
There are many variants of this method, which dier one from another by the step size
strategy. Let us consider the most important ones.
1. The sequence {hk }
k=0 is chosen in advance, before the gradient method starts its job.
For example,
hk = h > 0, (constant step)
hk = h .
k+1
2. Full relaxation:
hk = arg min f (xk hf (xk )).
h0
(h) = f (x hf (x)), h 0.
5.2. GRADIENT METHOD 111
Then the step-size values acceptable for this strategy belong to the part of the graph of ,
which is located between two linear functions:
Note that (0) = 1 (0) = 2 (0) and (0) < 2 (0) < 1 (0) < 0. Therefore, the ac-
ceptable values exist unless (h) is not bounded from below. There are several very fast
one-dimensional procedures for nding a point satisfying the conditions of this strategy, but
their description is not so important for us now.
Let us estimate now the performance of the gradient method. Consider the problem
min f (x),
xRn
with f CL1,1 (Rn ). The latter means (cf the denition in Section 5.6.4 that f is continuously
dierentiable on Rn and its derivative is Lipschitz continuous on Rn with the constant L:
f (x) f (y) L x y
for all x, y Rn . Let us also assume that f (x) is bounded from below on Rn .
Let us evaluate rst the result of one step of the gradient method. Consider y = xhf (x).
Then, in view of (5.6.32), we have:
L
f (y) f (x) + f (x), y x + 2
y x 2
(5.2.5)
2
= f (x) h f (x) 2
+ h2 L 2
f (x) = f (x) h(1 h
2
L) 2
f (x) .
Thus, in order to get the best estimate for the possible decrease of the objective function,
we have to solve the following one-dimensional problem:
h
(h) = h(1 L) min .
2 h
Computing the derivative of this function, we conclude that the optimal step size must satisfy
the equation (h) = hL 1 = 0. Thus, it could be only h = L1 , and that is a minimum of
(h) since (h) = L > 0.
Thus, our considerations prove that one step of the gradient method can decrease the
objective function as follows:
1
f (y) f (x) f (x) 2 .
2L
Let us check what is going on with our step-size strategies.
Let xk+1 = xk hk f (xk ). Then for the constant step strategy, hk = h, we have:
2
Therefore, if we choose hk = L
with (0, 1), then
2
f (xk ) f (xk+1 ) (1 ) f (xk ) 2 .
L
Of course, the optimal choice is hk = L1 .
For full relaxation strategy we have
1
f (xk ) f (xk+1 ) f (xk ) 2
2L
since the maximal decrease cannot be less than that with hk = L1 .
Finally, for Goldstein-Armijo rule in view of (5.2.4) we have:
where f is the optimal value of the problem (5.1.1). As a simple conclusion of (5.2.7) we
have:
f (xk ) 0 as k .
However, we can say something about the convergence rate. Indeed, denote
gN = min gk ,
0kN
5.2. GRADIENT METHOD 113
Therefore there are only three points which can be a local minimum of this function:
x1 = (0, 0), x2 = (0, 1), x3 = (0, 1).
we conclude that x2 and x3 are the isolated local minima1 , but x1 is only a stationary point
of our function. Indeed, f (x1 ) = 0 and
4 2
f (x1 + e2 ) = < 0
4 2
for small enough.
Now, let us consider the trajectory of the gradient method, which starts from x0 = (1, 0).
Note that the second coordinate of this point is zero. Therefore, the second coordinate of
f (x0 ) is also zero. Consequently, the second coordinate of x1 is zero, etc. Thus, the entire
sequence of points, generated by the gradient method will have the second coordinate equal
to zero. This means that this sequence can converge to x1 only.
To conclude our example, note that this situation is typical for all rstorder uncon-
strained minimization methods. Without additional very strict assumptions, it is impossible
to guarantee the global convergence of the minimizing sequence to a local minimum, only to
a stationary point.
1
In fact, in our example they are the global solutions.
114LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
We can now write down some (upper) complexity estimates for a certain class of opti-
mization problems. Let us look at the following example.
solution: x) f (x0 ), f (
f ( x) .
Recall that the function class CL1,1 (Rn ) is dened as follows (cf. the denition in Section
5.6.4):
f (x) f (y) L x y
for all x, y Rn .
Note, that (5.2.8) can be used to obtain an upper bound for the number of steps (= calls of
the oracle), which is necessary to nd a point with a small norm of the gradient. For that,
let us write out the following inequality:
1/2
1 1
gN L(f (x0 ) f ) .
N +1
Let us check, what can be said about the local convergence of the gradient method. Let
us consider the unconstrained minimization problem
min f (x)
xRn
2,2
1. f CM (Rn ) (the class of twice dierentiable functions with Lipschitz continuous
2,2
Hessian. Recall that for f CM (Rn ) we have
f (x) f (y) M x y
for all x, y Rn ).
2. There exists a local minimum of function f at which the Hessian is positive denite.
1
f (xk ) = f (xk ) f (x ) = f (x + (xk x ))(xk x )d = Gk (xk x ),
0
-1
where Gk = f (x + (xk x ))d . Therefore
0
There is a standard technique for analyzing the processes of this type, which is based on
contracting mapping. Recall that for a process
a0 Rn , ak+1 = Ak ak ,
where Ak are (n n) matrices such that Ak 1 q, q (0, 1), we can estimate the rate
of convergence of the sequence {ak } to zero:
ak+1 (1 q) ak (1 q)k+1 a0 0.
Hence,
rk rk
(1 hk (L + 2
M))In In hk Gk (1 hk (l 2
M))In .
116LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
Thus,
In hk Gk max{ak (hk ), bk (hk )},
(5.2.10)
rk rk
ak (h) = 1 h(l 2
M), bk (h) = h(L + 2
M) 1.
2l
Note that ak (0) = 1 and bk (0) = 1. Therefore, if rk < r M , then ak (h) is a strictly
decreasing function of h and we can ensure In hk Gk < 1 for small enough hk . In this
case we will have rk+1 < rk .
As usual, many step-size strategies are possible. For example, we can choose hk = L1 . Let
us consider the optimal strategy consisting of minimizing the right hand side of (5.2.10):
Let us assume that r0 < r. Then, if we form the sequence {xk } using this strategy, we can be
sure that rk+1 < rk < r. Therefore, the optimal step size hk can be found from the equation:
rk rk
ak (h) = 1 h(l M) = h(L + M) 1 = bk (h).
2 2
Hence
2
hk = . (5.2.11)
L+l
Under this choice we obtain:
(L l)rk Mrk2
rk+1 + .
L+l L+l
2l M
Let us estimate the rate of convergence. Denote q = L+l
and ak = r
L+l k
(< q). Then
ak (1 (ak q)2 ) ak
ak+1 (1 q)ak + a2k = ak (1 + (ak q)) = .
1 (ak q) 1 + q ak
1 1+q
Therefore ak+1
ak
1, or
q q(1 + q) q
1 q 1 = (1 + q) 1 .
ak+1 ak ak
Hence,
q q 2l L+l r
1 (1 + q)k 1 = (1 + q)k 1 = (1 + q)k 1 .
ak a0 L + l r0 M r0
Thus,
k
qr0 qr0 1 qr0 (1 q)k
ak .
r r0 )
r0 + (1 + q)k ( r r0 1+q r r0
This proves the following theorem.
5.3. NEWTON METHOD 117
Theorem 5.2.1 Let function f (x) satisfy our assumptions and let the starting point x0 be
close enough to a local minimum:
2l
r0 = x0 x < r = .
M
Then the gradient method with the optimal step size (5.2.11) converges with the following
rate: k
rr0 Ll
xk x .
r r0 L + l
We call his rate of convergence linear.
(t) + (t)t = 0.
We can expect that the solution of this equation, the displacement t, is a good approxi-
mation to the optimal displacement t = t t. Converting this idea in the algorithmic
form, we obtain the following process:
(tk )
tk+1 = tk .
(tk )
This scheme can be naturally extended onto the problem of nding solution to the system
of nonlinear equations:
F (x) = 0,
where x Rn and F (x) : Rn Rn . For that we have to dene the displacement x as a
solution to the following linear system:
F (x) + F (x)x = 0
(it is called the Newton system). The corresponding iterative scheme looks as follows:
Finally, in view of Theorem 5.6.1 (necessary condition of the minimum), we can replace
the unconstrained minimization problem by the following nonlinear system
f (x) = 0. (5.3.12)
(This replacement is not completely equivalent, but it works in the nondegenerate situations.)
Further, for solving (5.3.12) we can apply the standard Newton method for the system of
nonlinear equations. In the optimization case, the Newton system looks as follows:
f (x) + f (x)x = 0,
Hence, the Newton method for optimization problems appears in the following form:
Note that we can come to the process (5.3.13), using the idea of quadratic approximation.
Consider this approximation, computed at the point xk :
Assume that f (xk ) > 0. Then we can choose xk+1 as a point of minimum of the quadratic
function (x). This means that
Thus, if | t0 |< 1, then this method converges and the convergence is extremely fast. Point
t0 = 1 is an oscillation point of this method. If | t0 |> 1, then the method diverge.
5.3. NEWTON METHOD 119
In order to escape from the possible divergence, in practice we can apply a Damped
Newton method:
xk+1 = xk hk [f (xk )]1 f (xk ),
where hk > 0 is a step-size parameter. At the initial stage of the method we can use the
same step size strategies as for the gradient method. At the nal stage it is reasonable to
chose hk = 1.
Let us study the local convergence of the Newton method. Consider the problem
min f (x)
xRn
Consider the process: xk+1 = xk [f (xk )]1 f (xk ). Then, using the same reasoning as
for the gradient method, we obtain the following representation:
-1
= xk x [f (xk )]1 f (x + (xk x ))(xk x )d
0
-1
where Gk = [f (xk ) f (x + (xk x ))]d . Denote rk = xk x . Then
0
-1
Gk = [f (xk ) f (x + (xk x ))]d
0
-1
f (xk ) f (x + (xk x )) d
0
-1 rk
M(1 )rk d = 2
M.
0
Therefore, if rk < Ml then f (xk ) is positive denite and [f (xk )]1 (l Mrk )1 . Hence,
2l
for rk small enough (rk < 3M ), we have
Mrk2 M 2
rk+1 ( r < rk /3).
2(l Mrk ) 6l k
Theorem 5.3.1 Let function f (x) satisfy our assumptions. Suppose that the initial starting
point x0 is close enough to x :
2l
x0 x < r = .
3M
Then xk x < r for all k and the Newton method converges quadratically:
M xk x 2
xk+1 x .
2(l M xk x )
Comparing this result with the rate of convergence of the gradient method, we see that
the Newton method is much faster. Surprisingly enough, the region of quadratic convergence
of the Newton method is almost the same as the region of the linear convergence of the
gradient method. This means that the gradient method is worth to use only at the initial
stage of the minimization process in order to get close to a local minimum. The nal job
should be performed by the Newton method.
In this section we have seen several examples of the convergence rate. Let us compare
these rates in terms of complexity. As we have seen in Example 5.2.2, the upper bound for
the analytical complexity of a problem class is an inverse function of the rate of convergence.
1. Sublinear rate. This rate is described in terms of a power function of the iteration
counter. For example, we can have rk ck . In this case the complexity of this scheme
is c2 /2 .
Sublinear rate is rather slow. In terms of complexity, each new right digit of the answer
takes the amount of computations comparable with the total amount of the previous
work. Note also, that the constant c plays a signicant role in the corresponding
complexity estimate.
2. Linear rate. This rate is given in terms of an exponential function of the iteration
counter. For example, it could be like that: rk c(1q)k . Note that the corresponding
complexity bound is 1q (ln c + ln 1 ).
This rate is fast: Each new right digit of the answer takes a constant amount of
computations. Moreover, the dependence of the complexity estimate in constant c is
very weak.
5.3. NEWTON METHOD 121
3. Quadratic rate. This rate has a form of the double exponential function of the iteration
counter. For example, it could be as follows: rk+1 crk2 . The corresponding complexity
estimate depends double-logarithmically on the desired accuracy: ln ln 1 .
This rate is extremely fast: Each iteration doubles the number of right digits in the
answer. The constant c is important only for the starting moment of the quadratic
convergence (crk < 1).
min f (x),
xRn
f (x) 1 (x), x Rn ,
(see Lemma 5.6.4). This fact is responsible for the global convergence of the gradient method.
122LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
Further, consider the quadratic approximation (the model) of the function f (x):
1
x) + f (
2 (x) = f ( x), x x + f (
x)(x x), x x. (5.3.15)
2
We have already seen, that the minimum of this function is
x2 = x [f (
x)]1 f (
x),
G (xG ) = f (
x) + G(xG x) = 0,
we obtain
xG = x G1 f (
x). (5.3.16)
The rst-order methods which form a sequence
{Gk } : Gk f (x )
(or {Hk } : Hk G1 1
k [f (x )] ) are called the variable metric methods. (Sometimes
the name Quasi-Newton methods is used.) In these methods only the gradients are involved
in the process of generating the sequences {Gk } or {Hk }.
The reasons explaining the step of the form (5.3.16) are so important for optimization,
that we will provide it with one more interpretation. It will also shed some light on the
variable metric name of the algorithm family.
We have already used the gradient and the Hessian of a nonlinear function f (x). However,
note that they are dened with respect to the standard Euclidean inner product on Rn :
n
x, y = x(i) y (i) , x, y Rn ,
i=1
x = x, x1/2 .
Indeed, the denition of the gradient is as follows:
Let us introduce now a new inner product. Consider a symmetric positive denite n n-
matrix A. For x, y Rn denote
The function x A is a new metric on Rn dened by the matrix A. Note that topologically
this new metric is equivalent to :
1 (A)1/2 x x A n (A)1/2 x ,
where 1 (A) and n (A) are the smallest and the largest eigenvalue of the matrix A. However,
the gradient and the Hessian are changing:
= f (x) + A1 f (x), hA + 14 [A1 f (x) + f (x)A1 ]h, hA + o( h A ).
Hence, fA (x) = A1 f (x) is the new gradient and fA (x) = 12 [A1 f (x) + f (x)A1 ] is the
new Hessian (with respect to the metric dened by A).
Thus, the direction used in the Newton method can be interpreted as the gradient com-
puted with respect to the metric dened by A = f (x). Note that the Hessian of f (x) at x
computed with respect to A = f (x) is the unit matrix.
x dN (x) = A1 a = x .
Thus, the Newton method converges for a quadratic function in one step. Note also that
1
f (x) = + A1 a, xA + 2
x 2A ,
Let us write out the general scheme of the variable metric methods.
General scheme.
0. Choose x0 Rn . Set H0 = In . Compute f (x0 ) and f (x0 ).
1. k-th iteration (k 0).
a). Set pk = Hk f (xk ).
b). Find xk+1 = xk hk pk (see section 2.1.2 for the step-size rules).
c). Compute f (xk+1 ) and f (xk+1 ).
d). Update the matrix Hk : Hk Hk+1
The variable metric schemes dier one from another only in the implementation of Step
1d), which updates the matrix Hk . For that, they use the new information, accumulated at
Step 1c), namely the gradient f (xk+1 ).
The idea of this update can be explained with a quadratic function. Let
f (x) = + a, x + 12 Ax, x, f (x) = Ax + a.
Then, for any x, y Rn we have f (x) f (y) = A(x y). This identity explains the origin
of the following Quasi-Newton rule:
Choose Hk+1 such that Hk+1(f (xk+1 ) f (xk )) = xk+1 xk .
Naturally, there are many ways to satisfy this relation. Let us present several examples of
the variable metric schemes, which are recognized as the most ecient ones.
Example 5.3.3 Denote Hk = Hk+1 Hk ,
k = f (xk+1 ) f (xk ), k = xk+1 xk .
Then all of the following rules satisfy the Quasi-Newton relation:
1. One-rank correction scheme.
(k Hk k )(k Hk k )T
Hk = .
k Hk k , k
2. Davidon-Fletcher-Powell scheme (DFP).
k kT Hk k kT Hk
Hk = .
k , k Hk k , k
3. Broyden-Fletcher-Goldfarb-Shanno scheme (BFGS).
Hk k kT + k kT Hk Hk k kT Hk
Hk = k ,
Hk k , k Hk k , k
where k = 1 + k , k /Hk k , k .
Clearly, there are many other possibilities. From the computational point of view, BFGS
is considered as the most stable scheme.
5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 125
For quadratic functions the variable metric methods usually terminate in n iterations.
In the neighborhood of a strict minimum they have the superlinear rate of convergence: for
any x0 Rn there exists a number N such that for all k N we have
(the proofs are very long and technical). As far as the global convergence is concerned, these
methods are not better than the gradient method (at least, from the theoretical point of
view).
Note that in these methods it is necessary to store and update a symmetric n n-
matrix. Thus, each iteration needs O(n2 ) auxiliary arithmetic operations. During a long
time this feature was considered as the main drawback of the variable metric methods. That
stimulated the interest to the conjugate gradient schemes, which have much lower complexity
of each iteration (we will consider these schemes in Section 5.5). However, in view of the
tremendous growth of the computer power, these arguments are not so important now.
5.4.1 Preliminaries
The traditional starting point in the theory of the Newton method Theorem 5.3.1
possesses an evident drawback (which, anyhow, remained unnoticed by generations of
researchers). The Theorem establishes local quadratic convergence of the Basic Newton
method as applied to a function f with positive denite Hessian at the minimizer, this is
ne; but what is the quantitative information given by the Theorem? What indeed is
the region of quadratic convergence of the method, let us denote it G, the set of those
starting points from which the method converges quickly to x ? The proof provides us with
certain constructive description of G, but look this description involves dierential char-
acteristics of f like the magnitude M of the third order derivatives of f in a neighborhood
of x and the bound on the norm of inverted Hessian in this neighborhood (which depends
on M, the radius of the neighborhood and the smallest eigenvalue l of f (x )). Besides this,
the fast convergence of the method is described in terms of the behavior of the standard
Euclidean distances xt x . All these quantities magnitudes of third-order derivatives of
f , norms of the inverted Hessian, distances from the iterates to the minimizer are frame-
dependent: they depend on the choice of Euclidean structure on the space of variables,
126LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
on what are the orthonormal coordinates used to compute partial derivatives, Hessian ma-
trices and their norms, etc. When we vary the Euclidean structure (pass from the original
coordinates to another coordinates via a non-orthogonal linear transformation), all these
quantities somehow vary, same as the description of G given by Theorem 5.3.1. On the other
hand, when passing from one Euclidean structure on the space of variables to another, we
do not vary neither the problem, nor the Basic Newton method. Indeed, the latter method
is independent of any a priori coordinates, as it is seen from the following coordinateless
description of the method (cf (5.3.15)):
To nd the Newton iterate xt+1 of the previous iterate xt , take the second order Taylor
expansion of f at xt and choose, as xt+1 , the minimizer of the resulting quadratic form.
Thus, the coordinates are responsible only for the point of view we use to investigate the
process and are absolutely irrelevant to the process itself. And the results of Theorem 5.3.1
in their quantitative part (same as other traditional results on the Newton method) reect
this point of view, not the actual properties of the Newton process! This dependence on
viewpoint is a severe drawback: how can we get correct impression of actual abilities of
the method looking at the method from an occasionally chosen position? This is exactly
the same as to try to get a good picture of a landscape directing the camera in a random
manner.
5.4.2 Self-concordance
When the drawback of the traditional results is realized, could we choose a proper point of
view to orient our camera properly, at least for good objectives? Assume, e.g., that our
objective f is convex with nondegenerate Hessian. Then at every point x there is a natural,
intrinsic for the objective, Euclidean structure on the space of variables, namely, the one
given by the Hessian of the objective at x; the corresponding norm is
$ d2
|h|f,x = hT f (x)h |t=0 f (x + th). (5.4.17)
dt2
Note that the rst expression for |h|f,x seems to be frame-dependent it is given in terms
of coordinates used to compute inner product and the Hessian. But in fact the value of this
expression is frame-independent, as it is seen from the second representation of |h|f,x .
Now, from the standard results on the Newton method we know that the behavior of the
method depends on the magnitudes of the third-order derivatives of f . Thus, these results
are expressed in terms of upper bounds
. .
. d3 .
. .
. 3 |t=0 f (x + th).
. dt .
on the third-order directional derivatives of the objective, the derivatives being taken along
unit in the standard Euclidean metric directions h. What happens if we impose similar
upper bound on the third-order directional derivatives along the directions of the unit | |f,x
5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 127
length rather than along the directions of the unit usual length? In other words, what
happens if we assume that
. .
. d3 .
. .
|h|f,x 1 . 3 |t=0 f (x + th).
. dt .
?
Since the left hand side of the concluding inequality is of homogeneity degree 3 with respect
to h, the indicated assumption is equivalent to the one
. .
. d3 .
. .
. 3 |t=0 f (x + th).
. dt .
|h|3f,x x h.
Now, the resulting inequality, qualitatively, remains true when we scale f replace it by f
with positive constant , but the value of varies: 1/2 . We can use this property
to normalize the constant factor , e.g., to set it equal to 2 (this is the most technically
convenient normalization).
Thus, we come to the main ingredient of the notion of a
self-concordant function: a three times continuously dierentiable convex function f sat-
isfying the inequality
. .
3/2
. d3 . d2
. .
. 3 |t=0 f (x + th). 2|h|3f,x 2 |t=0 f (x + th) h Rn . (5.4.18)
. dt . dt2
Of course, the second part of the denition imposes something on f only when the domain
of f is less than the entire Rn .
Note that the denition of a self-concordant function is coordinateless it imposes
certain inequality between third- and second-order directional derivatives of the function
and certain behavior of the function on the boundary of its domain; all notions involved are
frame-independent.
(ii) [Existence of minimizer] If f is below bounded (which for sure is the case when Gf is
bounded), then f attains its minimum on Gf , the minimizer being unique;
(iii) [Damped Newton method] When started at arbitrary point x0 Gf the process
1 $
xt+1 = xt [f (xt )]1 f (xt ), (f, x) = (f (x))T [f (x)]1 f (x) (5.4.20)
1 + (f, xt )
the Newton method with particular stepsizes
1
t+1 =
1 + (f, xt )
possesses the following properties:
(iii.1) The process keeps the iterates in Gf and is therefore well-dened;
(iii.2) If f is below bounded on Gf (which for sure is the case if (f, x) < 1 for
some x Gf ) then {xt } converges to the unique minimizer xf of f on Gf ;
(iii.3) Each step of the process (5.4.20) decreases f signicantly, provided that
(f, xt ) is not too small:
According to (iii.3), at every step of the initial phase the objective is decreased at least by
absolute constant
1 5 1
= ln (> );
4 4 32
consequently,
f (x0 ) minGf f
Nini =
iterations.
22 (f, xt ) 1
(f, xt+1 ) (f, xt );
1 (f, xt ) 2
thus,
starting with t = t , the residuals in terms of the objective f (xt )minGf f also converge
quadratically to zero with objective-independent rate.
the number of steps of the Damped Newton method required to reduce the residual
f (xt ) min f in the value of a self-concordant below bounded objective to a prescribed
value < 0.1 is no more than
1
N() O(1) [f (x0 ) min f ] + ln ln , (5.4.24)
O(1) being an absolute constant.
It is also worthy of note what happens when we apply the Damped Newton method to
a below unbounded self-concordant f . The answer is as follows:
for a below unbounded f one has (f, x) 1 for every x (see (iii.2)), and, consequently,
every step of the method decreases f at least by the absolute constant 1 ln 2 (see
(iii.3)).
130LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
2 (f, x)
= f(x) min f(y),
2 y
where
1
f(y) = f (x) + (y x)T f (x) + (y x)T f (x)(y x)
2
is the second-order Taylor expansion of f at x. This is a coordinateless denition of (f, x).
Note that the region of quadratic convergence of the Damped Newton method as applied
to a below bounded self-concordant function f is, according to (iii.4), the set
1
Gf = {x Gf | (f, x) }. (5.4.25)
4
1
f (x) = xT Ax bT x + c
2
(A is symmetric positive semidenite n n matrix) is self-concordant on Rn ;
5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 131
[Ane substitution] Let f (x) be self-concordant with the domain Gf Rn , and let
x = A + b be an ane mapping from Rk into Rn with the image intersecting Gf .
Then the composite function
g() = f (A + b)
is self-concordant with the domain
Gg = { | A + b Gf }
To justify self-concordance of the indicated functions, same as the validity of the combina-
tion rules, only minimal eort is required; at the same time, these examples and rules give
almost all required to establish excellent global eciency estimates for Interior Point meth-
ods as applied to Linear Programming and Convex Quadratically Constrained Quadratic
programming.
After we know examples of self-concordant functions, let us look how our new under-
standing of the behavior of the Newton method on such a function diers from the one
132LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
given by Theorem 5.3.1. To this end consider a particular self-concordant function the
logarithmic barrier
1.
This function indeed is self-concordant (see the third of the above raw material examples).
The minimizer of the function clearly is the origin; the region of quadratic convergence of
the Damped Newton method is given by
x21 x22 1
G = {x D | 2 2
+ 2
}
+ x1 1 + x2 32
(see (5.4.25)). We see that the region of quadratic convergence of the Damped Newton
method is large enough it contains, e.g., 8 times smaller than D concentric to D rectangle
D . Besides this, (5.4.24) says that in order to minimize f to inaccuracy, in terms of the
objective, , starting with a point x0 D, it suces to perform no more than
1 1
O(1) ln + ln ln
x0
|x1 |
x = max{ , |x2 |}.
Now let us look what Theorem 5.3.1 says. The Hessian f (0) of the objective at the minimizer
is 2
2 0
H= ,
0 2
and H 1 = O( 2 ); in, say, 0.5-neighborhood U of x = 0 we also have [f (x)]1 = O( 2 ).
The third-order derivatives of f in U are of order of 1. Thus, in the notation from the proof
of Theorem 5.3.1 we have M = O(1) (this is the magnitude of the third order derivatives
of f in U), r = 0.5 and the upper bound on the norm of the inverted Hessian of f in U is
O( 2 ). According to the proof, the region U of quadratic convergence of the Newton method
is r-neighborhood of x = 0 with
r = O( 2).
Thus, according to Theorem 5.3.1, the region of quadratic convergence of the method be-
comes the smaller the larger is , while the actual behavior of this region is quite opposite.
5.5. CONJUGATE GRADIENTS METHOD 133
In this simple example, the aforementioned drawback of the traditional approach its
frame-dependence is clearly seen. Applying Theorem 5.3.1 to the situation in question,
we used extremely bad frame Euclidean structure. If we were clever enough to scale the
variable x1 before applying Theorem 5.3.1 to divide it by it would become absolutely
clear that the behavior of the Newton method is absolutely independent of , and the region
of quadratic convergence of the method is a once for ever xed fraction of the rectangle
D.
min f (x),
xRn
with f (x) = + a, x + 12 Ax, x and A = AT > 0. We have already seen that the
solution of this problem is x = A1 a. Therefore, our quadratic objective function
can be written in the following form:
= 12 Ax , x + 12 A(x x ), x x .
This denition looks rather abstract. However, we will see that this method can be
written in much more algorithmic form. The above representation is convenient for
the theoretical analysis.
Proof:
For k = 1 we have f (x0 ) = A(x0 x ). Let the statement of the lemma is true for
some k 1. Note that
k
xk = x0 + i Ai (x0 x )
i=1
where y Lk . Thus,
Lk+1 = Lin {Lk , Ak+1 (x0 x )} = Lin {Lk , f (xk )} = Lin {f (x0 ), . . . , f (xk )}.
The next result is important for understanding the behavior of the minimization
sequence of the method.
Proof:
Let k > i. Consider the function
k
= (1 , . . . , k ) = f x0 +
() j f (xj1 ) .
j=1
such that
In view of Lemma 5.5.1, there exists
k
xk = x0 + j f (xj1 ).
j=1
)
(
0= = f (xk ), f (xi ).
i
Corollary 5.5.1 The sequence generated by the conjugate gradient method is nite.
The last result we need explains the name of the method. Denote i = xi+1 xi .
It is clear that
Lk = Lin {0 , . . . , k1 }.
Let us show, how we can write out the conjugate gradient method in more algo-
rithmic form. Since Lk = Lin {0 , . . . , k1 }, we can represent xk+1 as follows:
k1
xk+1 = xk hk f (xk ) + j j .
j=0
That is
k1
k = hk f (xk ) + j j . (5.5.27)
j=0
In the above scheme we did not specify yet the coecient k . In fact, there are many
dierent formulas for this coecient. All of them give the same result for a quadratic
function, but in the general nonlinear case they generate dierent algorithmic schemes.
Let us present three most popular expressions.
f (xk+1 )
2
1. k = f (xk+1 )f (xk ),pk .
f (xk+1 )
2
2. Fletcher-Rieves: k =
f (xk )
2 .
f (xk+1 ),f (xk+1 )f (xk )
3. Polak-Ribbiere: k =
f (xk )
2 .
Recall that in the quadratic case the conjugate gradient method terminates in n
iterations (or less). Algorithmically, this means that pn+1 = 0. In the nonlinear case
that is not true, but after n iteration this direction could loose any sense. Therefore,
in all practical schemes there is a restart strategy, which at some point sets k = 0
(usually after every n iterations). This ensures the global convergence of the scheme
(since we have a normal gradient step just after the restart), and a local n-step local
quadratic convergence:
xn+1 x const x0 x 2 ,
provided that x0 is close enough to the strict minimum x . Note, that this local
convergence is slower than that of the variable metric methods. However, the conjugate
gradient schemes have an advantage of the very cheap iteration. As far as the global
convergence is concerned, the conjugate gradients in general are not better than the
gradient method.
5.6. EXERCISES 137
5.6 Exercises
5.6.1 Implementing Gradient method
Exercise 5.6.1
Let us consider the implementation of the Armijo-Goldstein version of the Gradient Descent.
Recall that at each step of the algorithm we are looking for xk+1 = xk hf (xk ) and the
step-size h such that
Rosenbrock problem
Quadratic problem
How long it takes to reduce the initial inaccuracy, in terms of the objective, by factor
0.1?
Quadratic problem
1
f (x) = xT Ax bT x, x R4 ,
2
with
0.78 0.02 0.12 0.14 0.76
0.02 0.86 0.04 0.06 0.08
A= , b= , x0 = 0.
0.12 0.04 0.72 0.08 1.12
0.14 0.06 0.08 0.74 0.68
Run the method until the norm of the gradient at the current iterate is becomes less
than 106 . Is the convergence fast or not?
Those using MATLAB can compute the spectrum of A and to compare the theoretical
upper bound on convergence rate with the observed one.
In nonlinear optimization we usually apply the local approximations based on the derivatives
of the nonlinear function. Those are the rst- and the second-order approximations (or, the
linear and quadratic approximations). Let f (x) be dierentiable at x. Then for y Rn we
have:
x) + f (
f (y) = f ( x), y x + o( y x ),
where o(r) is some function of r 0 such that
1
lim o(r) = 0, o(0) = 0.
r0 r
Lf () = {x Rn | f (x) }.
+
Exercise 5.6.2 x) then f (
If s Sf ( x), s = 0.
Let s be a direction in Rn , s = 1. Consider the local decrease of f (x) along s:
1
x + s) f (
(s) = lim [f ( x)].
0
140LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
Note that
f ( x) = f (
x + s) f ( x), s + o().
Therefore (s) = f (
x), s. Using the Cauchy-Schwartz inequality:
x y x, y x y ,
we obtain:
(s) = f (
x), s f (
x) .
Let us take s = f (
x)/ f (
x . Then
s) = f (
( x), f (
x)/ f (
x) = f (
x) .
Thus, the direction f (
x) (the antigradient) is the direction of the fastest local decrease of
f (x) at the point x.
The next statement, which is already known to us, is probably the most fundamental
fact in optimization.
Theorem 5.6.1 (First-order optimality condition; Ferm`a theorem.)
Let x be a local minimum of the dierentiable function f (x). Then f (x ) = 0.
Note that this only is a necessary condition of a local minimum. The points satisfying
this condition are called the stationary points of function f .
Let us look now at the second-order approximation. Let the function f (x) be twice
dierentiable at x. Then
x) + f (
f (y) = f ( x), y x + 12 f (
x)(y x), y x + o( y x 2 ).
The quadratic function
x) + f (
f ( x), y x + 12 f (
x)(y x), y x
is called the quadratic (or second-order) approximation of function f at x. Recall that the
(n n)-matrix f (x):
2 f (x)
(f (x))i,j = ,
xi xj
is called the Hessian of function f at x. Note that the Hessian is a symmetric matrix:
f (x) = [f (x)]T .
This matrix can be seen as a derivative of the vector function f (x):
f (y) = f (
x) + f (
x)(y x) + o( y x ),
where o(r) is some vector function of r 0 such that
1
o(r) = 0, o(0) = 0.
lim
r0 r
Using the secondorder approximation, we can write out the secondorder optimality
conditions. In what follows notation A 0, used for a symmetric matrix A, means that A is
positive semidenite; A > 0 means that A is positive denite. The following result supplies
a necessary condition of a local minimum:
5.6. EXERCISES 141
f (x ) = 0, f (x ) 0.
f (y) f (x ).
Theorem 5.6.3 Let function f (x) be twice dierentiable on Rn and let x satisfy the fol-
lowing conditions:
f (x ) = 0, f (x ) > 0.
Since 1r o(r) 0, there exists a value r such that for all r [0, r] we have
r
| o(r) | 1 (f (x )),
4
f (y) f (x ) + 12 1 (f (x )) y x 2 +o( y x 2 )
f (x ) + 14 1 (f (x )) y x 2 > f (x ).
142LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
for all x, y Q.
Clearly, we always have p k. If q k then CLq,p (Q) CLk,p(Q). For example, CL2,1 (Q)
CL1,1 (Q). Note also that these classes possess the following property:
if f1 CLk,p
1
(Q), f2 CLk,p
2
(Q) and , R1 , then
f1 + f2 CLk,p
3
(Q)
with L3 =| | L1 + | | L2 .
We use notation f C k (Q) for f , which is k times continuously dierentiable on Q.
The most important class of the above type is CL1,1 (Q), the class of functions with Lip-
schitz continuous gradient. In view of the denition, the inclusion f CL1,1 (Rn ) means
that
f (x) f (y) L x y (5.6.30)
for all x, y Rn . Let us give a sucient condition for that inclusion.
Exercise 5.6.3 +
Function f (x) belongs to CL2,1 (Rn ) if and only if
f (x) L, x Rn . (5.6.31)
This simple result provides us with many representatives of the class CL1,1 (Rn ).
Example 5.6.1 1. Linear function f (x) = + a, x belongs to C01,1 (Rn ) since
f (x) = a, f (x) = 0.
we have:
f (x) = a + Ax, f (x) = A.
Therefore f (x) CL1,1 (Rn ) with L = A .
3. Consider the function of one variable f (x) = 1 + x2 , x R1 . We have:
x 1
f (x) = , f (x) = 1.
1 + x2 (1 + x2 )3/2
L
2 (x) = f (x0 ) + f (x0 ), x x0 2
x x0 2 .
Then, the graph of the function f is located between the graphs of 1 and 2 :
Let us consider the similar result for the class of twice dierentiable functions. Our main
2,2
class of functions of that type will be CM (Rn ), the class of twice dierentiable functions
2,2
with Lipschitz continuous Hessian. Recall that for f CM (Rn ) we have
for all x, y Rn .
Exercise 5.6.5 +
Let f CL2,2 (Rn ). Then for any x, y from Rn we have:
M
f (y) f (x) f (x)(y x) y x 2 . (5.6.34)
2
We have the following corollary of this result:
+ 2,2
Exercise 5.6.6 Let f CM (Rn ) and y x = r. Then
Exercise 5.6.7 Let p(x) be a polynomial of degree n > 0. Without loss of generality we
can suppose that p(x) = xn + ..., i.e. the coecient of the highest degree monomial is 1.
Now consider the modulus |p(z)| as a function of the complex argument z C. Show
that this function has a minimum, and that minimum is zero.
Hint: Since |p(z)| + as |z| +, the continuous function |p(z)| must attain a
minimum on a complex plan.
Let z be a point of the complex plan. Show that for small complex h
for some k, 1 k n and ck = 0. Now, if p(z) = 0 there a choice (which one?) of h small
such that |p(z + h)| < |p(z)|.
2
This proof is tentatively attributed to Hadamard.
Lecture 6
Constrained Minimization
This lecture is devoted to the penalty and the barrier methods; as far as the underlying
ideas are concerned, these methods implement the simplest approach to constrained opti-
mization approximate a constrained problem by unconstrained ones. Let us look how it is
done.
f (x) min,
(ICP) (6.1.1)
gi (x) 0, i = 1, . . . , m.
where gi (x) are smooth functions. For example, we can consider gi (x) from CL1,1 (Rn ).
Since the components of the problem (6.1.1) are general nonlinear functions, we cannot
expect it would be easier to solve than the unconstrained minimization problem. Indeed,
even the troubles with stationary points, which we have in unconstrained minimization,
appears in (6.1.1) in much stronger form. Note that the stationary points of this problem
(what ever it is?) can be infeasible for the functional constraints and any minimization
scheme, attracted by such point, should admit that it fails to solve the problem.
Therefore, the following reasoning looks almost as the only way to proceed:
145
146 LECTURE 6. CONSTRAINED MINIMIZATION
3. Therefore, let us try to approximate the solution of the constrained problem (6.1.1) by
a sequence of solutions of some auxiliary unconstrained problems.
Denition 6.1.1 A continuous function (x) is called a penalty function for a closed set
G if
m
2. Nonsmooth penalty: (x) = (gi (x))+
i=1
(we denote (a)+ = max{a, 0}). The reader can easily continue this list.
It is easy to prove the convergence of this scheme assuming that xk+1 is a global minimum
of the auxiliary function.3 Denote
k (x) = f (x) + tk (x), k = minn k (x)
xR
Proof:
Note that k k (x ) = f . Further, for any x Rn we have: k+1(x) k (x). Therefore
k+1 k . Thus, there exists a limit
lim k f .
k
If tk > t then
f (xk ) + t(xk ) f (xk ) + tk (xk ) f .
Therefore, the sequence {xk } has limit points.
For any limit point x we have (x ) = 0. Therefore x G and
= f (x ) + (x ) = f (x ) f .
Remark 6.1.1 Note that this result is very general, but not too informative. There are
still many questions, which should be answered. For example, we dont know what kind
of penalty function should we use. What should be the rules for choosing the penalty
coecients? What should be the accuracy for solving the auxiliary problems? The main
feature of this questions is that they cannot be answered in the framework of the general
nonlinear programming theory. Therefore, traditionally, they are considered to be answered
by the computational practice.
3
If we assume that it is a strict local minimum, then the result is much weaker.
148 LECTURE 6. CONSTRAINED MINIMIZATION
Now it is time to look what are our abilities to solve the unconstrained problems
which, as we already know, for large t are good approximations of the constrained problem in
question. In principle we can solve these problems by any one of unconstrained minimization
methods we know, and this is denitely a great advantage of the approach.
Remark 6.1.2 There is, anyhow, a severe weak point of the construction to approximate
well the constrained problem by unconstrained one, we must deal with large values of the
penalty parameter, and this, as we shall see in a while, unavoidably makes the unconstrained
problem (Pt ) ill-conditioned and thus very dicult for any unconstrained minimization
methods sensitive to the conditioning of the problem. And all the methods for unconstrained
minimization we know, except, possibly, the Newton method, are sensitive to conditioning
(e.g., in the Gradient Descent the number of steps required to achieve an -solution is,
asymptotically, inverse proportional to the condition number of the Hessian of objective
at the optimal point). Even the Newton method, which does not react on the conditioning
explicitly it is self-scaled suers a lot as applied to an ill-conditioned problem, since here
we are enforced to invert ill-conditioned Hessian matrices, and this, in actual computations
with their rounding errors, causes a lot of troubles. The indicated drawback ill-conditioness
of auxiliary unconstrained problems is the main disadvantage of the straightforward
penalty scheme, and because of it the scheme is not that widely used now and is in many
cases replaced with more smart modied Lagrangian scheme.
Denition 6.1.2 A continuous function F (x) is called a barrier function for a closed set G
with nonempty interior if F (x) when x approaches the boundary of the set G.
x: gi (
x) < 0, i = 1, . . . , m.
m
2. Logarithmic barrier: F (x) = ln(gi (x)).
i=1
6.1. PENALTY AND BARRIER FUNCTION METHODS 149
m
1
3. Exponential barrier: F (x) = exp gi (x)
.
i=1
Let us prove the convergence of this method assuming that xk+1 is a global minimum of
the auxiliary function. Denote
1
Fk (x) = f (x) + F (x), Fk = min Fk (x),
tk xG
Assumption 6.1.2 The barrier F (x) is bounded from below: F (x) F for all x G.
lim Fk = f .
k
Proof:
Let x int G. Then
1
lim Fk lim f (
x) + F (
x) = f (
x).
k k tk
Therefore lim Fk f . Further,
k
1 1 1
Fk = min f (x) + F (x) min f (x) + F = f + F .
xG tk xG tk tk
Thus, lim Fk = f .
k
150 LECTURE 6. CONSTRAINED MINIMIZATION
Same as with the penalty functions method, there are many questions to be answered.
We dont know how to nd the starting point x0 and how to choose the best barrier function.
We dont know the rules for updating the penalty coecients and the acceptable accuracy
of the solutions to the auxiliary problems. Finally, we have no idea about the eciency
estimates of this process. And the reason is not in the lack of the theory. Our problem
(6.1.1) is just too complicated.
If I were writing this lecture, say, 20 years ago, I would probably stop here, or added
some complaints about the fact that, in the same way as for the penalty method, the
problems of minimization of normally (when the solution to the original problem is on
the boundary of G; otherwise the problem actually is unconstrained) become the more ill-
conditioned the larger is t, so that the diculties of their numerical solution grow with the
penalty parameter. When indeed writing this lecture, I would say something quite opposite:
there exists important situations when the diculties in numerical minimization of do not
increase with the penalty parameter, and the overall scheme turns out to be theoretically
ecient and, moreover, the best known so far. This change in evaluation of the scheme is the
result of recent interior point revolution in Optimization which I have already mentioned
in Lecture 5.
F is self-concordant function on int G (Section 5.4, Lecture 5), i.e., three times con-
tinuously dierentiable convex function on int G possessing the barrier property (i.e.,
F (xi ) along every sequence of points xi int G converging to a boundary point
of G) and satisfying the dierential inequality
d3 3/2
T
| | t=0 F (x + th)| 2 h F (x)h x int G h Rn ;
dt3
G = {x Rn | aTj x bj , j = 1, ..., m}
be a polytope given by a list of linear inequalities satisfying the Slater condition (i.e., there
exists x such that aTj x < bj , j = 1, ..., m). Then the function
m
F (x) = ln(bj aTj x)
j=1
In a moment we will justify this example and will consider the crucial issue of how to nd
a self-concordant barrier for a given feasible domain. For the time being, let us focus on
another issue: how to solve (P), given a -self-concordant barrier for the feasible domain of
the problem.
What we intend to do is to use the path-following scheme associated with the barrier
certain very natural implementation of the barrier method.
positive t is a singleton (we already know that it is nonempty, and the uniqueness of the
minimizer follows from convexity and nondegeneracy of Ft ). Thus, we have a path
x (t) = argmin Ft ();
int G
as we know from Theorem 6.1.2, this path converges to the optimal set of (P) as t ; be-
sides this, it can be easily seen that the path is continuous (even continuously dierentiable)
in t. In order to approximate x (t) with large values of t via the path-following scheme, we
trace the path x (t), namely, generate sequentially approximations x(ti ) to the points x (ti )
along certain diverging to innity sequence t0 < t1 < ... of values of the parameter. This is
done as follows:
given tight approximation x(ti ) to x (ti ), we update it into tight approximation
x(ti+1 ) to x (ti+1 ) as follows:
rst, choose somehow a new value ti+1 > ti of the penalty parameter
second, apply to the function Fti+1 () a method for unconstrained minimization started
at x(ti ), and run the method until closeness to the new target point x (ti+1 ) is restored,
thus coming to the new iterate x(ti+1 ) close to the new target point of the path.
Our hope is that since x (t) is continuous in t and x(ti ) is close to x (ti ), for not too
large ti+1 ti the point x(ti ) will be not too far from the new target point x (ti+1 ), so
that the unconstrained minimization method we use will quickly restore closeness to the new
target point. With this gradual movement, we may hope to arrive near x (t) with large t
faster than by attacking the problem (Pt ) directly.
All this was known for many years; and the progress during last decade was in trans-
forming these qualitative ideas into exact quantitive recommendations.
Namely, it turned out that
A. The best possibilities to carry this scheme out are when the barrier F is -self-
concordant; the less is the value of , the better;
B. The natural measure of closeness of a point x int G to the point x (t) of the
path is the Newton decrement of the self-concordant function
t (x) = tFt (x) tcT x + F (x)
at the point x, i.e., the quantity
$
(t , x) = [t (x)]T [t (x)]1 t (x)
(cf. Proposition 5.4.1.(iii)). More specically, the notion x is close to x (t) is conve-
nient to dene as the relation
(t , x) 0.05 (6.2.3)
(in fact, 0.05 in the right hand side could be replaced with arbitrary absolute constant
< 1, with slight modication of subsequent statements; I choose this particular value
for the sake of simplicity)
6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 153
Now, what do all these words the best possibility and natural measure actually mean?
It is said by the following two statements.
C. Assume that x is close, in the sense of (6.2.3), to a point x (t) of the path x ()
associated with a -self-concordant barrier for the feasible domain G of problem (P).
Let us increase the parameter t to the larger value
+ 0.08
t = 1+ t (6.2.4)
and replace x by its damped Newton iterate (cf. (5.4.20), Lecture 5)
1
x+ = x [+ (x)]1 t+ (x). (6.2.5)
1 + (t+ , x) t
Then x+ is close, in the sense of (6.2.3), to the new target point x (t+ ) of the path.
C. says that we are able to trace the path (all the time staying close to it in the sense of
B.) increasing the penalty parameter linearly in the ratio (1 + 0.081/2 ) and accompanying
each step in the penalty parameter by a single Newton step in x. And why we should be
happy with this, it is said by
D. If x is close, in the sense of (6.2.3), to a point x (t) of the path, then the inaccuracy,
in terms of the objective, of the point x as of an approximate solution to (P) is bounded
from above by 2t1 :
2
f (x) min f (x) . (6.2.6)
xG t
D. says that the inaccuracy of the iterates x(ti ) formed in the above path-following
procedure goes to 0 as 1/ti , while C. says that we are able increase ti linearly, at the cost of
a single Newton step per each updating of t. Thus, we come to the following
Theorem 6.2.1 Assume that we are given
(i) -self-concordant barrier F for the feasible domain G of problem (P)
(ii) starting pair (x0 , t0 ) with t0 > 0 and x0 being close, in the sense of (6.2.3), to the
point x (t0 ).
Consider the path-following method (cf. (6.2.4) - (6.2.5))
0.08 1
ti+1 = 1+ ti ; xi+1 = xi [ (xi )]1 ti+1 (xi ). (6.2.7)
1 + (ti+1 , xi ) ti+1
Then the iterates of the method are well-dened, belong to the interior of G and the method
possesses linear global rate of convergence:
t
2 0.08
f (xi ) min f 1+ . (6.2.8)
G t0
154 LECTURE 6. CONSTRAINED MINIMIZATION
In particular, to make the residual in f less than a given > 0, it suces to perform no
more that
20
N() 20 ln 1 + (6.2.9)
t0
Newton steps.
We see that the parameter of the self-concordant barrier underlying the method is respon-
sible for the Newton complexity of the method the factor at the log-term in the complexity
bound (6.2.9).
Remark 6.2.1 The presented result does not explain how to start tracing the path how
to get initial pair (x0 , t0 ) close to the path. This turns out to be a minor diculty: given in
advance a strictly feasible solution x to (P), we could use the same path-following scheme
(applied to certain articial objective) to come close to the path x (), thus arriving at a
position from which we can start tracing the path. In our very brief outline of the topic, it
makes no sense to go in these details of initialization; it suces to say that the necessity to
start from approaching x () basically does not violate the overall complexity of the method.
It makes sense if not to prove the aforementioned statements the complete proofs,
although rather simple, go beyond the scope of our today lecture but at least to motivate
them to explain what is the role of self-concordance and magic inequality (6.2.2) in
ensuring properties C. and D. (this is all we need the Theorem, of course, is an immediate
consequence of these two properties).
Let us start with C. this property is much more important. Thus, assume we are at a
point x close, in the sense of (6.2.3), to x (t). What this inequality actually says?
Let us denote by
hH 1 = (hT H 1 h)1/2
the scaled Euclidean norm given by the inverse to the Hessian matrix
(the equality comes from the fact that t and F dier by a linear function tf (x) tcT x).
Note that by denition of (, ) one has
Due to the last formula, the closeness of x to x (t) (see (6.2.3)) means exactly that
(the concluding inequality here is given by (6.2.2) 4) , and this is the main point where this
component of the denition of a self-concordant barrier comes into the play).
From the indicated relations
(t+ , x) = t+ c + F (x)H 1 (t+ t)cH 1 + tc + F (x)H 1 =
t+ t
= tcH 1 + (t , x)
t
[see (6.2.4), (6.2.10)]
0.08 1
(0.05 + ) + 0.05 0.134 ( )
4
(note that 1 by Denition 6.2.1). According to Proposition 5.4.1.(iii.3), Lecture 5,
the indicated inequality says that we are in the domain of quadratic convergence of the
damped Newton method as applied to self-concordant function t+ ; namely, the indicated
Proposition says that
2(0.134)2
(t+ , x+ ) < 0.05.
1 0.134
as claimed in C.. Note that this reasoning heavily exploits self-concordance of F
To establish property D., it requires to analyze in more details the notion of a self-
concordant barrier, and I am not going to do it here. Just to demonstrate where comes
from, let us prove an estimate similar to (6.2.6) for the particular case when, rst, the
barrier in question is the standard logarithmic barrier given by Example 6.2.1 and, second,
the point x is exactly the point x (t) rather than is close to the latter point. Under the
outlined assumptions we have
x = x (t) t (x) = 0
[substitute expressions for t and F ]
m
aj
tc + T
=0
j=1 bj aj x
6.2.2 Applications
Linear Programming. The most famous (although, I believe, not the most important)
application of Theorem 6.2.1 deals with Linear Programming, when G is a polytope and
F is the standard logarithmic barrier for this polytope
(see Example 6.2.1). For this case,
5)
the Newton complexity of the method is O( m), m being the # of linear inequalities
involved into the description of G. Each Newton step costs, as it is easily seen, O(mn2 )
arithmetic operations, so that the arithmetic cost per accuracy digit number of arithmetic
operations required to reduce current inaccuracy by absolute constant factor turns out
to be O(m1.5 n2 ). Thus, we get a polynomial time solution method for LP with very nice
complexity characteristics, typically (for m and n of the same order) better than those, e.g.,
for the Ellipsoid method. Note also that with certain smart implementation of Linear
Algebra, the above arithmetic cost can be reduced to O(mn2 ); this is the best known so far
cubic in the size of the problem upper complexity bound for Linear Programming.
To increase list of application examples, note that our abilities to solve in the outlined
style a convex program of a given structure are limited only by our abilities to point out self-
concordant barrier for the corresponding feasible domain. In principle, there are no limits
at all it can be proved that every closed convex domain in Rn admits a self-concordant
barrier with the value of parameter at most O(n). This universal barrier is given by
certain multivariate integral and is too complicated for actual computations; recall that we
should form and solve Newton systems associated with our barrier, so that we need it to be
explicitly computable.
Thus, we come to the following important question:
How to construct explicit self-concordant barriers. There are many cases when we
are clever enough to point out explicitly computable self-concordant barriers for convex
domains we are interested in. We already know one example of this type Linear Program-
ming (although we do not know to the moment why the standard logarithmic barrier for a
polytope given by m linear constraints is m-self-concordant). What helps us to construct
self-concordant barriers and to evaluate their parameters are the following extremely sim-
ple combination rules, completely similar to those for self-concordant functions (see Section
5.4.4, Lecture 5):
[Ane substitution] Let F (x) be -self-concordant barrier for the closed convex domain
G Rn , and let x = A + b be an ane mapping from Rk into Rn with the image
intersecting int G. Then the composite function
F + () = F (A + b)
is -self-concordant barrier for the closed convex domain
G+ = { | A + b G}
which is the inverse image of G under the ane mapping in question.
The indicated combination rules can be applied to the raw materials as follows:
[Logarithm] The function
ln(x)
is 1-self-concordant barrier for the nonnegative ray R+ = {x R | x > 0};
[the indicated property of logarithm is given by 1-line computation]
[Extension of the previous example: Logarithmic barrier, linear/quadratic case] Let
G = cl{x Rn | j (x) < 0, j = 1, ..., m}
be a nonempty set in Rn given by m convex quadratic (e.g., linear) inequalities satis-
fying the Slater condition. Then the function
m
f (x) = ln(i (x))
i=1
One hardly could imagine how wide is the class of applications from Combinatorial op-
timization to Structural Design and Stability Analysis/Synthesis in Control of the latter
two barriers, especially of the Log-Det-one.
the actually observed number of Newton iterations required to solve the prob-
lem within reasonable accuracy is basically independent of the sizes of the prob-
lem and is within 30-50, even 20 iteration for most situations.
This empirical fact (which can be only partly supported by theoretical considerations, not
proved completely) is extremely important for applications; it makes polynomial time interior
point methods the most attractive (and sometimes - the only appropriate) optimization tool
in many important large-scale applications.
I should add that the ecient long-step implementations of the path-following scheme
6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 159
are relatively new, and for a long time6) the only interior point methods which demonstrated
the outlined data- and size-independent convergence rate were the so called potential re-
duction interior point methods. In fact, the very rst interior point method the method
of Karmarkar for LP which initialized the entire interior point revolution, was a potential
reduction algorithm, and what indeed caused the revolution was outstanding practical per-
formance of this method. The method of Karmarkar possesses a very nice (and in fact very
simple) geometry and is closely related to the interior penalty scheme; anyhow, time limita-
tions enforce me to skip description of this wonderful, although an old-fashioned, algorithm.
The concluding remark I would like to make is as follows: all polynomial time implemen-
tations of the penalty/barrier scheme known so far are those of the barrier scheme (which
is reected in the name of these implementations: interior point methods); numerous at-
tempts to do something similar with the penalty approach failed to be successful. It is a
pity, due to some attractive properties of the scheme (e.g., here you do not meet with the
problem of nding a feasible starting point, which, of course, is needed to start the barrier
scheme).
6
if I could qualify as long a part of the story which the entire story started in 1984