Cours D'optimisation

1
Universit
e Joseph Fourier
Master de Math
ematiques Appliquees , 2`
eme ann
ee
EFFICIENT METHODS IN OPTIMIZATION

Theorie de la programmation non-lineaire
Anatoli Iouditski
Lecture Notes
Optimization problems arise naturally in many application elds. Whatever people do,
at some point they get a craving for organizing things in a best possible way. This intention,
converted in a mathematical form, appears to be an optimization problem of certain type
(think of, say, Optimal Diet Problem). Unfortunately, the next step, consisting of nding a
solution to the mathematical model is less trivial. At the rst glance, everything looks very
simple: many commercial optimization packages are easily available and any user can get a
solution to his model just by clicking on an icon at the desktop of his PC. However, the
question is, how much he could trust it?
One of the goals of this course is to show that, despite to their attraction, the general
optimization problems very often break the expectations of a naive user. In order to apply
these formulations successfully, it is necessary to be aware of some theory, which tells us what
we can and what we cannot do with optimization problems. The elements of this theory can
be found in each lecture of the course.
This course itself is based on the lectures given by Arkadi Nemirovski at Technion in late
1990s. On the other hand all the errors and inanities you may nd here should be put on
the account of the name at the title page.
http://www-ljk.imag.fr/membres/Anatoli.Iouditski/cours/optimisation-convexe.htm
2
Contents
1 Introduction 5
1.1 General formulation of the problem . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Problem formulation and terminology . . . . . . . . . . . . . . . . . . 5
1.1.2 Performance of Numerical Methods . . . . . . . . . . . . . . . . . . . 8
1.2 Complexity bounds for Global Optimization . . . . . . . . . . . . . . . . . . 10
1.3 Identity cards of the elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Rules of the game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Suggested reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 When everything is simple: 1-dimensional Convex Optimization 19

2.1 Example: one-dimensional convex problems . . . . . . . . . . . . . . . . . . 19
2.1.1 Upper complexity bound . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Lower complexity bound . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Can we use 1-dimensional optimization? . . . . . . . . . . . . . . . . 27
2.3.2 Extremal Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Methods with linear convergence 37

3.1 Class of general convex problems: description and complexity . . . . . . . . 37
3.2 Cutting Plane scheme and Center of Gravity Method . . . . . . . . . . . . . 39
3.2.1 Case of problems without functional constraints . . . . . . . . . . . . 39
3.3 The general case: problems with functional constraints . . . . . . . . . . . . 44
3.4 Lower complexity bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 The Ellipsoid method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.2 The Ellipsoid method . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.1 Some extensions of the Cutting Plane scheme . . . . . . . . . . . . . 60
3.6.2 The method of outer simplex . . . . . . . . . . . . . . . . . . . . . . 71
3
4 CONTENTS
4 Large-scale optimization problems 73

4.1 Goals and motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 The main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Upper complexity bound: Subgradient Descent . . . . . . . . . . . . . . . . . 76
4.4 The lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Subgradient Descent for Lipschitz-continuous convex problems . . . . . . . . 80
4.6 Bundle methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6.1 The Level method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6.2 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.7.1 Around Subgradient Descent . . . . . . . . . . . . . . . . . . . . . . . 92
4.7.2 Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.3 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 101
4.7.4 The Stochastic Approximation method . . . . . . . . . . . . . . . . . 105
5 Nonlinear programming: Unconstrained Minimization 109

5.1 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3 Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.1 Gradient Method and Newton Method: What is dierent? . . . . . . 121
5.4 Newton Method and Self-Concordant Functions . . . . . . . . . . . . . . . . 125
5.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4.2 Self-concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.3 Self-concordant functions and the Newton method . . . . . . . . . . . 127
5.4.4 Self-concordant functions: applications . . . . . . . . . . . . . . . . . 130
5.5 Conjugate gradients method . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6.1 Implementing Gradient method . . . . . . . . . . . . . . . . . . . . . 137
5.6.2 Regular functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.6.3 Properties of the gradient . . . . . . . . . . . . . . . . . . . . . . . . 139
5.6.4 Classes of dierentiable functions . . . . . . . . . . . . . . . . . . . . 142
6 Constrained Minimization 145

6.1 Penalty and Barrier function methods . . . . . . . . . . . . . . . . . . . . . . 145
6.2 Self-concordant barriers and path-following scheme . . . . . . . . . . . . . . 150
6.2.1 Path-following scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Lecture 1
Introduction
(General formulation of the problem; Important examples; Black Box and Iterative Methods;
Analytical and Arithmetical Complexity; Uniform Grid Method; Lower complexity bounds;
Lower bounds for Global Optimization; Rules of the Game.)
1.1 General formulation of the problem

Let us start with xing the mathematical form of our main problem and the standard
terminology.
1.1.1 Problem formulation and terminology

Let x be an n-dimensional real vector: x = (x(1) , . . . , x(n) ) Rn , G be a subset of Rn , and
functions f (x), g1 . . . gm (x) are some real-valued function of x.
In the entire course we deal with some variants of the following minimization problem:
min f (x),
s.t.: gj (x) & 0, j = 1 . . . m, (1.1.1)
x G,
where & could be , or =.

We call f (x) the objective function, the vector function g(x) = (g1 (x), . . . , gm (x)) is called
the functional constraint, the set G is called the basic feasible set, and the set
Q = {x G | gj (x) 0, j = 1 . . . m},
is called the feasible set of the problem (1.1.1).1

There is some natural classication of the types of minimization problems:
1
That is just a convention to consider the minimization problems. Instead, we could consider a maxi-
mization problem with the objective function f (x).
5
6 LECTURE 1. INTRODUCTION
Constrained problems: Q Rn .
Unconstrained problems: Q Rn .
Smooth problems: all gj (x) are dierentiable.
Nonsmooth problems: there is a nondierentiable component gk (x),
Linearly constrained problems: all functional constraints are linear:

n

gj (x) = ai,j xi + bj aj , x + bj , j = 1 . . . m,
i=1
(here , stands for the inner product in Rn ), and G is a polyhedron.

If f (x) is also linear then (1.1.1) is a Linear Programming Problem. If f (x) is quadratic
then (1.1.1) is a Quadratic Programming Problem.
There is also some classication in accordance to the properties of the feasible set.
Problem (1.1.1) is called feasible if Q = .
Problem (1.1.1) is called strictly feasible if x int Q such that gj (x) < 0 (or > 0) for
all inequality constraints and gj (x) = 0 for all equality constraints.
Finally, we distinguish dierent types of solutions to (1.1.1):
x is called the optimal global solution to (1.1.1) if
f (x ) f (x) for all x Q
(global minimum). Then f (x ) is called the optimal value of the problem.
x is called a local solution to (1.1.1) if
f (x ) f (x) for all x int Q

Q
(local minimum).
Let us consider now several examples demonstrating the origin of the optimization prob-
lems.
Example 1.1.1 Let x(1) . . . x(n) be our design or decision variables. Then we can x some
functional characteristics of our decision: f (x), g1 , . . . , gm (x). That could be the price of
the project, the amount of the required resources, the reliability of the system, and many
others.
We x the most important characteristics, f (x), as our objective. For all others we impose
some bounds: aj gj (x) bj .
1.1. GENERAL FORMULATION OF THE PROBLEM 7
Thus, we come up with the problem:

min f (x),
s.t.: aj gj (x) bj , j = 1 . . . m,
x G,
where G stands for the structural constraints, like positiveness or boundedness of some
variables, etc.
Example 1.1.2 Let our initial problem be as follows: Find x Rn such that
g1 (x) = a1 ,
... (1.1.2)
gm (x) = am .
Then we can consider the problem:

m

min (gj (x) aj )2
x
j=1
(may be with some additional constraints on x).

Note that the problem (1.1.2) is almost universal. It covers ordinary dierential equa-
tions, partial dierential equations, problems, arising in Game Theory, and many others.
Example 1.1.3 Sometimes our decision variable x(1) . . . x(n) must be integer, say, we need
x(i) {0, 1}. That can be described by the constraint:
x(i) (x(i) 1) = 0, i = 1 . . . n.
Thus, we could treat also the {0,1} Programming Problems:
min f (x),
s.t.: aj gj (x) bj , j = 1 . . . m,
x G,
x(i) (x(i) 1) = 0, i = 1 . . . n.
Looking at these examples, a reader can understand the enthusiasm of the pioneers of
nonlinear programming, which can be easily recognized in the papers of 1950 1960. Thus,
our rst impression should be as follows:
Nonlinear Optimization is very important and perspective application theory.

It covers almost all elds of Numerical Analysis.
However, just by looking at the same list, especially at Examples 1.1.2, 1.1.3, a more suspi-
cious (or more experienced) reader should come to the following conjecture:
In general, optimization problems are unsolvable.
Indeed, the life is too complicated to believe in a universal tool for solving all problems at
once.
However, conjectures are not so important in science; that is a question of the personal
taste how much we can believe in them. The most important event in the optimization
theory in the middle of 70s was that this conjecture was proved in some strict sense. The
proof is so simple and remarkable, that we cannot avoid it in our course. But rst of all,
we should introduce some special language, which is necessary to speak about such serious
things.
1.1.2 Performance of Numerical Methods

Let us consider some sample situation: We have a problem P, which we are going to solve
using a method M. What is the performance of M on P? Let us start with the general
denition:
Performance of M on P is the total amount of computational eorts, which is

required by method M to solve the problem P.
In this denition there are several things to be specied. First, what does it mean: to solve
the problem? In some elds it could mean to nd the exact solution. However, in many areas
of numerical analysis that is impossible (and optimization is denitely the case). Therefore,
for us to solve the problem should mean:
To nd an approximate solution to M with a small accuracy > 0.
Now, we know that there are dierent numerical methods for doing that, and of course, we
want to choose the scheme, which is the best for our P. However, it appears that we are
looking for something, what does not exist. In fact, it does, but it is too silly. Just imagine
a method for solving (1.1.1), which always reports that x = 0. Of course, it does not work
on all problems except those with x = 0. And for the latter problems its performance is
better than that of all other schemes.
Thus, we cannot speak about the best method for a concrete problem P, but we can do
that for a class of problems F P. Indeed, usually the numerical methods are developed
1.1. GENERAL FORMULATION OF THE PROBLEM 9
for solving many dierent problems with the similar characteristics. Therefore we can dene
the performance of M on F as its performance on the worst problem from F .
Since we are going to speak about the performance of M on the whole class F , we should
assume that M does not have a complete information about a concrete problem P. It has
only the description of the problem class F . In order to recognize P (and solve it), the
method should be able to collect the personal information about P by parts. For modeling
this situation, it is convenient to introduce the notion of oracle. Oracle O is just a unit, which
answers the successive questions of the method. The method M, collecting and handling
the data, is trying to solve the problem P.
In general, each problem can be included in dierent problem classes. For each problem
we can imagine also the dierent types of oracles. But if we x F and O, then we x a model
of our problem P. In this case, it is natural to dene the performance of M on (F , O) as
its performance on the worst Pw from F .2
Let us now consider the iterative process which naturally describes any method M work-
ing with the oracle.
General Iterative Scheme. (1.1.3)
Input:
A starting point x0 and an accuracy > 0.
Initialization.
Set k = 0, I1 = . Here k is the iteration counter and Ik is the
informational set accumulated after k iterations.
Main Loop.
1. Call the oracle O at xk .

2. Update the informational set: Ik = Ik1 (xk , O(xk )).
3. Apply the rules of method M to Ik and form the new test point xk+1 .
4. Check the stopping criterion. If yes then form an output x. Otherwise
set k = k + 1 and go to 1.
End of the Loop.
Now we can specify the term computational eorts in our denition of the performance.
In the scheme (1.1.3) we can easily nd two main sources of that. First one is in the Step 1,
where we call the oracle, and the second one is in Step 3, where we form the next test point.
We introduce two measures of the complexity of the problem P for the method M:
1. Analytical complexity: The number of calls of the oracle, which is required to
solve the problem P up to the accuracy .
2. Arithmetical complexity: The total number of the arithmetic operations (in-
cluding the work of the oracle and the method), which is required to solve the
problem P up to the accuracy .
2
Note that this Pw can be bad only for M .
Thus, the only thing which is not clear now, is the meaning of the words up to the accuracy
> 0. Note, that this meaning is very important for our denitions of the complexity.
However, it is too specic to speak about that here. We will make this meaning exact when
we will consider the concrete problem classes.
Comparing the notions of analytical and arithmetical complexity, we can see that the
second one is more realistic. However, for a concrete method M, the arithmetical complexity
usually can be easily obtained from the analytical complexity. Therefore, in this course we
will speak mainly about some estimates of the analytical complexity of some problem classes.
There is one standard assumption about the oracle, which allows to obtain most of the
results on the analytical complexity of the optimization methods. This assumption is called
the black box concept and it looks as follows:
1. The only information available from the oracle is its answer. No intermediate
results are available.
2. The oracle is local: A small variation of the problem far enough from the test
point x does not change the answer at x.
This concept is extremely popular in the numerical analysis. Of course, it looks as an
articial wall between the method and the oracle created by ourselves. It seems natural to
allow the method to analyze the internal structure of the oracle. However, we will see that
for some problems with complicated structure this analysis is almost useless. On the other
hand, for some important problems it could help. If we have enough time, that will the
subject of the last lecture of this course.
To conclude this section, let us present the main types of the oracles used in optimization.
For all of them the input is a test point x Rn , but the output is dierent:
Zero-order oracle: the value f (x).
First-order oracle: the value f (x) and the gradient f (x).
Second-order oracle: the value f (x), the gradient f (x) and the Hessian f (x).
1.2 Complexity bounds for Global Optimization

Let us practice in applying the formal language, introduced in the previous section, to a
concrete problem. For that, let us consider
min f (x). (1.2.4)

xBn
In our terminology, that is an unconstrained minimization problem without functional con-

straints. The basic feasible set of this problem is Bn , an n-dimensional cube in Rn :
Bn = {x Rn | 0 xi 1, i = 1, . . . , n}.
In order to specify the problem class, let us make the following assumption:
1.2. COMPLEXITY BOUNDS FOR GLOBAL OPTIMIZATION 11
The objective function f (x) is Lipshitz continuous on Bn :
x, y Bn : | f (x) f (y) | L x y
with some constant L (Lipshitz constant).
Here and in the sequel we use notation for the Euclidean norm on Rn :

n

x = x, x = (x )2 .
i
i=1
Let us consider a trivial method for solving (1.2.4), which is called the Uniform Grid
Method. This method, G(p), has one integer input parameter p and its scheme is as follows.
Scheme of the method G(p). (1.2.5)
1. Form (p + 1)n points

i1 i2 in
x(i1 ,i2 ,...,in ) = , ,..., ,
p p p
where
i1 = 0, . . . , p,
i2 = 0, . . . , p,
...
in = 0, . . . , p.
2. Among all points x(...) nd the point x with the minimal value of the objective function.
3. Return the pair (
x, f (x)) as the result.
Thus, this method forms a uniform grid of the test points inside the cube Bn , computes
the minimal value of the objective over this grid and returns it as an approximate solution
to the problem (1.2.4). In our terminology, this is a zero-order iterative method without
any inuence of the accumulated information on the sequence of test points. Let us nd its
eciency estimate.
Theorem 1.2.1 Let f be the global optimal value of problem (1.2.4). Then

n
x) f L
f ( .
2p
Proof:
Let x be the global minimum of our problem. Then there exists a number (i1 , i2 , . . . , in )
such that
x x(i1 ,i2 ,...,in ) x x(i1 +1,i2 +1,...,in +1) y
(here and in the sequel we write x y for x, y Rn if and only if xi yi , i = 1, . . . , n).

Note that yi xi = 1/p, i = 1, . . . , n, and xi [xi , yi ], i = 1, . . . , n,
Denote x = (x + y)/2. Let us form a point x as follows:

yi , if xi xi ,
xi =

xi , otherwise.
1
It is clear that | xi xi | 2p
, i = 1, . . . , n. Therefore
n
n
x x 2 = xi xi )2
( .
i=1 4p2
Since x belongs to our grid, we conclude that

n
x) f (x ) f (
f ( x) f (x ) L x x L
2p
Note that now we still cannot say what is the complexity of this method on the prob-
lem (1.2.4). And the reason is that we did not dene what should be the quality of the
approximate solution we are looking for. Let us dene our goal as follows:
x) f .
Find x Bn : f ( (1.2.6)
Then we immediately get the following result.
Corollary 1.2.1 The analytical complexity of the method G is as follows:
n
n
A(G) = L +2
2
(here ]a[ is the integer part of a).
Proof:
Indeed, let us take p = L 2n + 1. Then p L 2n , and therefore, in view of Theorem 1.2.1,
we have:
n
x) f L
f ( .
2p
This result is more informative, but we still have some questions. First, may be our proof
is too rough and the real performance of G(p) is much better. Second, we cannot be sure
that this is a reasonable method for solving (1.2.4). May be there are some methods with
much higher performance.
In order to answer these questions, we need to derive for (1.2.4), (1.2.6) the lower com-
plexity bounds. The main features of these bounds are as follows.
1.2. COMPLEXITY BOUNDS FOR GLOBAL OPTIMIZATION 13
They are based on the Black Box concept.

They can be derived for a specic class of problems F equipped by a local oracle O.
These bounds are valid for all reasonable iterative schemes. Thus, they provide us with
a lower bound for the analytical complexity on the problem class.
They use the idea of resisting oracle.
For us only the notion of the resisting oracle is new. Therefore, let us discuss it in details.
A resisting oracle is trying to create a worst problem for each concrete method. It
starts from an empty function and it tries to answer each call of the method in the worst
possible way. However, the answers must be coherent with the previous answers and with the
description of the problem class. Note that after the termination of the method it is possible
to reconstruct the created problem. Moreover, if we launch the method on this problem, it
will reproduce the same sequence of the test points since it will have the same answers of
the oracle.
Let us show how it works for the problem (1.2.4). Consider the class of problems F
dened as follows:
Problem Formulation: min f (x).
xBn
Problem Class: f (x) is Lipshitz continuous on Bn .
Approximate solution: Find x Bn : f (x) f .
n
L
Theorem 1.2.2 The analytical complexity of F for the 0-order methods is at least 2

1.
Proof:
Assume that there exists a method, which needs less than

n L
p 1, p= ( 1),
2
calls of oracle to solve any problem of our class up to accuracy > 0. Let us suppose that
when the method nds its approximate solution x we allow it to call the oracle one more
time at x, which will not be counted in our complexity evaluation. So, the total number of
calls to the oracle of the method is N < pn .
Let us apply this method to the following resisting oracle:
It reports that f (x) = 0 at any test point.
Therefore this method can nd only x Bn : f ( x) = 0.
Note that there exists x Bn such that
1
x + e Bn , e = (1, . . . , 1),
p
and there were no test points inside the box
1
B = {x | x x x + e}.
p
1
Denote x = x + 2p
e and consider the function
f(x) = min{0, L x x },
where a = max | ai |.
1in
Note that the function f(x) is Lipshitz continuous (since a a ) and the optimal
value of f() is . Moreover, f(x) diers from zero only inside the box

B = {x | x x }.
L
Since 2p L/, we conclude that
1
B B {x | x x }.
2p
Thus, f(x) is equal to zero at all test points of our method. Since the accuracy of the
result of our method is , we come to the following conclusion: If the number of calls of the
oracle is less than pn then the accuracy of the result cannot be less than .
Now we can say much more about the performance of the uniform grid method. Let us
compare its eciency estimate with the lower bound:
n n
n L
G: L , Lower bound: .
2 2
Thus, we conclude that G has optimal dependence of its complexity in , but not in n. Note
that our conclusion depends on the problem class. If we consider the functions f :
x, y Bn : | f (x) f (y) | L x y
then the same reasoning
n as before proves that the uniform grid method is optimal with the
L
eciency estimate 2 .
Theorem 1.2.2 supports our initial claim that the general optimization problems are
unsolvable. Let us look at the following example.
Example 1.2.1 Consider the problem class F dened by the following parameters:
L = 2, n = 10, = 0.01.
Note that the size of the problem is very small and
we ask only for 1% accuracy.
n
L
The lower complexity bound for this class is 2 . Let us compute what does it mean:
Lower bound: 1020 calls of oracle,
Complexity of the oracle: n a.o.,
Total complexity: 1021 a.o.,
Intel Quad Core Processor: 109 a.o. per second,
Total time: 1012 seconds,
1 year: less than 3.2 107 sec.
We need: 32 000 years.

1.3. IDENTITY CARDS OF THE FIELDS 15
This estimate is so disappointing that we cannot believe that such problems may become
solvable even in the future. Indeed, suppose we believe into the Moore low, i.e. that the
processor power is to be multiplied by 3 every 2 years. We can hope that a PC of 2030 will
solve the problem in only 1 year, and in 2070 it will only take 1 second. However, let us just
play with the parameters of the class.
If we change n for n + 1 then we have to multiply our estimate by 100. Thus, for
n = 11 our time estimate is valid for the fastest available computer.
But if we multiply by two, we reduce the complexity by the factor of 1000. For
example, if = 8% then we need only 2 days to solve the problem.
We should note, that the lower complexity bounds for problems with smooth functions,
or for the high-order methods is not much better than that of Theorem 1.2.2. This can be
proved using the same arguments and we leave the proof as an exercise for the reader. An
advanced reader can compare our results with the upper bound for NP-hard problems, which
are considered as the examples of very dicult problems in combinatorial optimization. It
is 2n a.o. only!
To conclude this section, let us compare our situation with some other elds of numerical
analysis. It is well-known, that the uniform grid approach is a standard tool for many of
them. For example, if we need to compute numerically the value of the integral
1
I= f (x)dx,
0
then we have to form the discrete sum

n
1 i
Sn = f (xi ), xi = , i = 1, . . . , N.
N i=1 N
If f (x) is Lipshitz continuous then the result can be used as an approximation to I:
N = L/ | I SN | .
Note that in our terminology it is exactly the uniform grid approach. Moreover, it is a
standard way for approximating the integrals. The reason why it works here is in the
dimension of the problem. For integration the standard sizes are 1 3, and in optimization
sometimes we need to solve problems with several million variables.
1.3 Identity cards of the elds

After the pessimistic result of the previous section, rst of all we should understand what
could be our goal in the theoretical analysis of the optimization problems. Hopefully, by now
everything is clear with the global optimization. But may be its goals are too ambitious?
May be in some practical problems we would be satised by much less optimal solution?
Or, may be there are some interesting problem classes, which are not so terrible as the class
of general continuous functions?
In fact, all these question can be answered in a dierent way. And this way denes the
style of the research (or rules of the game) in the dierent optimization elds. If we will try
to classify them, we will easily see that they dier one from another in the following aspects:
Description of the goals.
Description of the problem class.
Description of the oracle.
These aspects dene in a natural way the list of desired properties of the optimization
methods.
To conclude this lecture, let us present the identity cards of the elds we will consider
in our course.
Name: Global Optimization. Goals: Find a global minimum.

Problem Class: Continuous functions.
Oracle: 0 1 2 order black box.
Desired properties: Convergence to a global minimum.
Features: This game is too short. We always lose it.
Problem Sizes: Sometimes people pretend to solve problems with several thou-
sands of variables. No guarantee for success even for very small problems.
History: Starts from 1955. Several local peaks of interest (simulated annealing,
neural networks, genetic algorithms).
Name: Nonlinear Programming.

Goals: Find a local minimum.
Problem Class: Dierentiable functions.
Oracle: 1 2 order black box.
Desired properties: Convergence to a local minimum. Fast convergence.
Features: Variability of approaches. Most widespread software. The goal is not
always acceptable.
Problem Sizes: up to 1000 variables.
History: Starts from 1955. Peak period: 1965 1975, variable theoretical
activity right now.
Name: Convex Optimization.

Goals: Find a global minimum.
Problem Class: Convex sets and functions.
Oracle: 1st order black box.
1.4. RULES OF THE GAME 17
Desired properties: Convergence to a global minimum. Rate of convergence

depends on the dimension.
Features: Very reach and interesting theory. Complete complexity theory. Ef-
cient practical methods. The problem class is sometimes restrictive.
Problem Sizes: up to 1 000 000s of variables.
History: Starts from 1970. Peak period: 1975 1985, currently a peak of
theoretical activity concerning methods for extremely large scale problems.
Name: Interior-Point Polynomial Methods.(Would be a nice subject for an ad-

vanced course)
Goals: Find a global minimum.
Problem Class: Convex sets and functions.
Oracle: 2nd order oracle which is not a black box.
Desired properties: Fast convergence to a global minimum. Rate of conver-
gence depends on the structure of the problem.
Features: Very new and perspective theory. Avoid the black box concept. The
problem class is the same as in Convex Programming.
Problem Sizes: Sometimes up to 1 000 000s variables.
History: Starts from 1984. Peak period: 1990 . . ., high theoretical activity
now.
1.4 Rules of the game

Some of the statements in the Course (theorems, propositions, lemmas, examples (if the latter
+
contain certain statement) are marked by superscripts or . The unmarked statements
are obligatory: you are required to know both the statement and its proof. The statements
marked by are semi-obligatory: you are expected to know the statement itself and may
skip its proof (the latter normally accompanies the statement), although you are welcome,
+
of course, to read the proof as well. The proofs of the statements marked by are omitted;
you are expected to be able to prove these statements by yourself, and these statements are
parts of the assignments.
The majority of Lectures are accompanied by the Exercise sections. In several cases,
the exercises are devoted to the lecture where they are placed; sometimes they prepare the
reader to the next lecture. Exercises marked by # are closely related to the lecture where
they are placed or to the following one; it would be a good thing to solve such an exercise
or at least to become acquainted with its solution (if any is given). Exercises which I nd
dicult are marked with > .
1.5 Suggested reading

It could be helpful to have a good book on analysis at hand.
If you want to improve your background on the basic mathematical notions involved,
consider the reference
J. Cea: Optimisation, theorie et algorithmes Dunod, Paris (1971).
The main drawback of the little blue book by C. Lemarechal: Methodes numeriques
doptimisation, Notes de cours, Universite Paris IX-Dauphine, INRIA, Rocquencourt, 1989,
is that it is too small.
As far as the main body of the course is concerned, for Chapter 5 I would suggest the
reference
A. Auslender: Optimisation. Methodes numeriques, Masson, Paris (1976).
P.J. Laurent: Approximation et Optimisation, Hermann (1972)
All these books also possesses that important quality of being written in French. If you decide
that you are interested in Convex Optimization, the following reading would is extremely
gratifying
A. Ben-Tal, A. Nemirovski, Lectures on Modern Convex Optimization, MPS-SIAM

Series on Optimization, SIAM, Philadelphia, (2001).
Yu. Nesterov Introductory Lectures On Convex Optimization: A Basic Course,

Boston: Kluwer Academic Publishers, (2004).
S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press

(2004).
The latter book is available on the S. Boyds page: http://www.stanford.edu/ boyd/index.html

Lecture 2
When everything is simple:

1-dimensional Convex Optimization
(Complexity of One-dimensional Convex Optimization: Upper and Lower Bounds)
2.1 Example: one-dimensional convex problems

In this part of the course we are interested in theoretically ecient methods for convex opti-
mization problems. We now take a simple start with a one-dimensional convex minimization.
Let us apply our formal language to this example, where everything can be easily seen.
Consider one-dimensional convex problem
minimize f (x) s.t. x G = [a, b], (2.1.1)
where [a, b] is a given nite segment on the axis. It is also known that our objective f is a
continuous convex function on G; for the sake of simplicity, assume that we know bounds,
let them be 0 and V , for the values of the objective on G. Thus, all we know about the
objective is that it belongs to the family
F = {f : [a, b] R | f is convex and continuous; 0 f (x) V, x [a, b]}.
And what we are asked to do is to nd, for a given positive , an -solution to the problem,
i.e., a point x G such that
x) f f (
f ( x) min f .
G
Of course, our a priori knowledge on the objective given by the inclusion f F , is, for
small , far from being sucient for nding an -solution, and we need some source of
quantitative information on the objective. The standard assumption here which comes from
the optimization practice is that we can compute the value and a subgradient of the objective
at a point, i.e., we have access to a subroutine, our oracle O, which gets, as an input, a point
19
20LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION
x from our segment and returns the value f (x) and a subgradient f (x) of the objective at
the point.
We have subject the input to the subroutine to the restriction a < x < b, since the
objective, generally speaking, is not dened outside the segment [a, b], and its subgradient
might be undened at the endpoints of the segment as well. I should also add that the
oracle is not uniquely dened by the above description; indeed, at some points f may have a
massive set of subgradients, not a single one, and we did not specify how the oracle at such
a point chooses the subgradient to be reported. As usual, we need exactly one hypothesis
of this type, namely, we assume the oracle to be local: the information on f reported at a
point x must be uniquely dened by the behavior of f in a neighborhood of x:
{f, f F , x int G, f f in a neighborhood of x } O(f, x) = O(f, x).
Recall that the method M is a collection of the search rules, the termination tests and the
rules for forming the result. Note that we do not subject the rules comprising a method
to any further restrictions like computability in nitely many arithmetic operations; the
rules might be arbitrary functions of the information on the problem accumulated to the
step when the rule should be used.
What we should do is to nd a method which, given on input the desired value of accuracy
, after a number of oracle calls produces an -solution to the problem. And what we are
interested in is the most ecient method of this type. Namely, given a method which solves
every problem from our family to the desired accuracy in nite number of oracle calls, we
dene the worst-case complexity A(M) of the method M as the maximum, over all problems
from the family, of the number of calls; what we are looking for is exactly the method of the
minimal worst-case complexity. When summing up, using our terminology, the problem of
nding the optimal method is
given the family
F = {f : G = [a, b] R | f is convex and continuous on G, 0 f V }
of problems and an > 0, nd among the methods M with the accuracy on the
family not worse than the method with the smallest possible complexity on the
family.
Recall that the complexity of the associated optimal method, i.e., the function
A() = min{A(M) | Accuracy(M) }
is called the complexity of the family.
2.1.1 Upper complexity bound

Let us consider a simple bisection method to solve the problem (2.1.1). In order to minimize
f , we start with the midpoint of our segment, i.e., with
a+b
x1 = ,
2
2.1. EXAMPLE: ONE-DIMENSIONAL CONVEX PROBLEMS 21
and ask the oracle about the value and a subgradient of the objective at the point. If the
subgradient is zero, we are done - we have found an optimal solution. If the subgradient is
positive, then the function, due to convexity, to the right of x1 is greater than at the point,
and we may cut o the right half of our initial segment - the minimum for sure is localized
in the remaining part. If the subgradient is negative, then we may cut o the left half of the
initial segment.
Thus, we either terminate with an optimal solution, or nd a new segment, twice smaller
than the initial domain, which for sure localizes the set of optimal solutions. In this latter
case we repeat the procedure, with the initial domain replaced by the new localizer, and
so on. After we have performed the number of steps indicated in the formulation of the
theorem below we terminate and form the result as the best - with the minimal value of f -
of the search points we have looked through:
x {x1 , ..., xN }; f (
x) = min f (xi ).
1iN
Note that traditionally the approximate solution given by the bisection method is identied
with the last search point (which is clearly at the distance at most (b a)2N from the
optimal solution), rather than with the best point found so far. This traditional choice
has small in common with our accuracy measure (we are interested in small values of the
objective rather than in closeness to optimal solution) and is simply dangerous, as you can
see from the following example:
xN xN-1
Figure 1.
Here during the rst N 1 steps everything looks as if we were minimizing f (x) = x, so
that the N-th search point is xN = 2N ; our experience is misleading, as you see from the
picture, and the relative accuracy of xN as an approximate solution to the problem is very
bad, something like 1/2.
By the way, we see from this example that the evident convergence of the search points
to the optimal set at the rate at least 2i does not imply automatically certain xed rate of
convergence in terms of the objective; it turns out, anyhow, that such a rate exists, but for
the best points found so far rather than for the search points themselves.
This way we obtain a scheme of the bisection algorithm:
1. Initialization.
Set x = a, x+ = b, k = 1 and x1 = (a + b)/2.
2. Iteration. For k = 1, ..., N:
Call the oracle O = f (xk ).

Update the active segment: if f (xk ) < 0 set x = xk , elseif f (xk ) >
0, set x+ = xk . If f (xk ) = 0 then goto 3.
Put xk+1 = (x + x+ )/2.
3. Form the result Take x = arg mini=1,...,N +1{f (xi )}.
To establish an upper bound for A() it suces to evaluate A(M).
Theorem 2.1.1 The complexity of the family in question satises the inequality
V
A() log2 ( ), 0 < < V. (2.1.2)

The method associated with the upper bound is the usual bisection terminated after N =
log2 (V /) steps.
Note that the range of values of in our statement is (0, V ), and this is quite natural: since
all functions from the family take their values between 0 and V , any point of the segment
solves every problem from the family to the accuracy V , so that a nontrivial optimization
problem occurs only when < V .
Proof:
We start with the observation that if GN = [x , x+ ] is the nal localizer of optimum
found during the bisection, then outside the localizer the value of the objective is at least
that one at the best of the search points, i.e., at least the value at the approximate solution
x found by bisection:
x) min f (xi ), x G\GN .
f (x) f (
1iN
Indeed, at a step of the method we cut o only those points of the domain G where f is at
least as large as at the current search point and is therefore its value at the best of the
search points, that is, to the value at x; this is exactly what was claimed.
Now, let x be an optimal solution to the problem, i.e., a minimizer of f ; as we know,
such a minimizer does exist, since our continuous objective for sure attains its minimum
over the nite segment G. Let be a real greater than 2N and less than 1, and consider
-contraction of the segment G to x , i.e., the set
G = (1 )x + G {(1 )x + z | z G}.
This is a segment of the length (b a), and due to our choice of the length is greater
than that one of our nal localizer GN . It follows that G cannot be inside the localizer, so
that there is a point, let it be y, which does not belong to the nal localizer and belongs to
G :
y G : y int GN .
Since y belongs to G , we have
y = (1 )x + z
for some z G, and from convexity of f it follows that
f (y) (1 )f (x ) + f (z),
which can be rewritten as
f (y) f (f (z) f ) V,
so that y is an V -solution to the problem. On the other hand, y is outside GN , where, as

we already know, f is at least f (
x):
x) f (y).
f (
We conclude that
x) f f (y) f V.
f (
Since can be arbitrary close to 2N , we come to
x) f 2N V = 2log2 (V /) V .
f (
Thus,
Accuracy(BisectionN ) .
The upper bound is proved.
The observation that the length of the localizers Gi converges geometrically to 0 was
crucial in the above proof of the complexity estimate. However, for the bisection procedure
to possess this property, convexity of f is not necessary, for instance, it will be enough if f is
quasi-convex. On the other hand, the quasi-convexity itself does not imply the convergence
of the error to 0 in course of iterations. To have this we have impose some condition on local
variation of the objective, e.g., that f is Lipschitz-continuous.1 )
1)
We will discuss this subject at length in the Exercise section of the next lecture.
2.1.2 Lower complexity bound

We would like also to provide a lower bound for A(). At the rst glance, it is not clear
where from could one get lower bounds of complexity of the class. This problem looks less
natural than inventing methods and evaluating their behavior on the family this is what
people in optimization are doing all their life. In contrast to this, to nd a lower bound
means to say something denite about all methods from an extremely wide class, not to
investigate a particular method. We shall see in a while that the task is quite tractable in
the convex context. 2)
The answer to the question is given by the following
Theorem 2.1.2 The complexity of the family in question satises
1 V
log2 ( ) 1 < A(), 0 < < V. (2.1.3)
5
Note that Theorem 2.1.2 states that the bisection method is optimal in complexity up to an
absolute constant factor.
Proof:
To simplify explanation, consider the case when the domain G of our problems is the
segment [1, 1] and the objectives vary from 0 to 1, i.e., let V be equal to 1; due to similarity
reasons, with these assumptions we do not loose generality. Thus, given an (0, 1) we
should prove that the complexity of our family for this value of is greater than
1 1
log2 ( ) 1.
5
In other words, we should prove that if M is a method which solves all problems from the
family to the accuracy , then the complexity K of the method on the family, i.e., the worst
case number of steps is at least the aforementioned quantity. Let M solve every problem to
the accuracy in no more than K steps. By adding redundant steps, we may assume that
M performs exactly
K = K + 1
steps at every problem and that the result of M as applied to a problem is nothing but the
last, K-th search point.
Now let us prove the following basic lemma:
Lemma 2.1.1 For any i, 0 i K, our family contains an objective fi with the following
two properties:
(1 i ) : there is a segment i G = [1, 1] - the active segment of fi - of the length
i = 212i
where fi is a modulus-like function, namely,
fi (x) = ai + 3i |x ci |, x i ,
2)
Recall, that we have succeeded to treat this task for the class of Global Optimization problems.
ci being the midpoint of i ;
(2 i ) : The rst i points x1 , ..., xi generated by M as applied to fi , are outside i .
Proof will be given by induction on i.
Base i = 0 is immediate: we set
f0 (x) = |x|, 0 = G = [1, 1],
thus ensuring (10 ). Property (20 ) holds true by trivial reasons - when i = 0, then there are
no search points to be looked at.
Step i i + 1: Let fi be the objective given by our inductive hypothesis, let i be the
active segment of this objective and let ci be the midpoint of the segment.
Let also x1 , ..., xi , xi+1 be the rst i + 1 search points generated by M as applied to
fi . According to our inductive hypothesis, the rst i of these points are outside the active
segment.
In order to obtain fi+1 , we modify the function fi in its active segment and do not vary
the function outside the segment. The way we modify fi in the active segment depends
on whether xi+1 is to the right of the midpoint ci of the segment (right modication), or
this is not the case and xi+1 either coincides with ci or is to the left of the point (left
modication).
The right modication is as follows: we replace the modulus-like in its active segment
function fi by a piecewise linear function with three linear pieces, as is shown on the picture
below. Namely, we do not change the slope of the function in the initial 1/14 part of the
segment, then change the slope from 23i to 23(i+1) and make a new breakpoint at the
end ci+1 of the rst quarter of the segment i . Starting with this breakpoint and till the
right endpoint of the active segment, the slope of the modied function is 23(i+1) . It is
easily seen that the modied function at the right endpoint of i comes to the same value
as that one of fi and that the modied function is convex on the whole axis.
In the case of the left modication, i.e., when xi+1 ci , we act in the symmetric
manner, so that the breakpoints of the modied function are at the distances 34 |i | and
13
14
|i | from the left endpoint of i , and the slopes of the function, from left to right, are
23(i+1) , 23(i+1) and 23i :
i+1
u c i+1 ci xi+1
i
Figure 2. Right modication of fi . The bold segment on the axis

is the active segment of the modied function
Let us verify that the modied function fi+1 satises the requirements imposed by the
lemma. As we have mentioned, this is a convex continuous function; since we do not vary
fi outside the segment i and do not decrease it inside the segment, the modied function
takes its values in (0, 1) together with fi . It suces to verify that fi+1 satises (1i+1 ) and
(2i+1 ).
(1i+1 ) is evident by construction: the modied function indeed is modulus-like with
required slopes in a segment of a required length. What should be proved is (2i+1 ), the
claim that the method M as applied to fi+1 during the rst i + 1 step does not visit the
active segment of fi+1 . To prove this, it suces to prove that the rst i + 1 search points
generated by the method as applied to fi+1 are exactly the search point generated by it when
minimizing fi , i.e., they are the points x1 , ..., xi+1 . Indeed, these latter points for sure are
outside the new active segment - the rst i of them due to the fact that they even do not
belong to the larger segment i , and the last point, xi+1 - by our construction, which ensures
that the active segment of the modied function and xi+1 are separated by the midpoint ci
of the segment i .
Thus, we come to the necessity to prove that x1 , ..., xi+1 are the rst i + 1 points generated
by M as applied to fi+1 . This is evident: the points x1 , ..., xi are outside i , where fi and fi+1
coincide; consequently, the information - the values and the subgradients - on the functions
along the sequence x1 , ..., xi also is the same for both of the functions. Now, by denition
of a method the information accumulated by it during the rst i steps uniquely determines
the rst i + 1 search points; since fi and fi+1 are indistinguishable in a neighborhood of the
rst i search points generated by M as applied to fi , the initial (i + 1)-point segments of
the trajectories of M on fi and on fi+1 coincide with each other, as claimed.
Thus, we have justied the inductive step and therefore have proved the lemma.
2.2. CONCLUSION 27
It remains to derive from the lemma the desired lower complexity bound. This is imme-
diate. According to the lemma, there exists function fK in our family which is modulus-like
in its active segment and is such that the method during its rst K steps does not visit
this active segment. But the K-th point xK of the trajectory of M on fK is exactly the
result found by the method as applied to the function; since fK is modulus-like in K and
is convex everywhere, it attains its minimum fK at the midpoint cK of the segment K and
outside K is greater than
fK + 23K 3K fK + 23K 22K = 25K
(the product here is half of the length of K times the slope of fK ). Thus,
x(M, fK )) fK > 25K .
fK (
On the other hand, M, by its origin, solves all problems from the family to the accuracy ,
and we come to
25K < ,
i.e., to
1 1
K K + 1 > log2 ( ).
5
as required in our lower complexity bound.
2.2 Conclusion
The one-dimensional situation we have investigated is, of course, very simple; I spoke about it
only to give you an impression of what we are going to do. In the main body of the course we
shall consider much more general classes of convex optimization problems, i.e., multidimen-
sional problems with functional constraints. Same as in our simple one-dimensional example,
we shall ask ourselves what is the complexity of the classes and what are the corresponding
optimal methods. Let me stress that these are optimal methods we mainly shall focus on
- it is much more interesting issue than the complexity itself, both from mathematical and
practical viewpoint. In this respect, one-dimensional situation is not typical - it is easy to
guess that the bisection should be optimal and to establish its rate of convergence. In several
dimensions situation is far from being so trivial and is incomparably more interesting.
2.3 Exercises
2.3.1 Can we use 1-dimensional optimization?
Note that though being extremely simple, the bisection algorithm can be of great use to
solve multi-dimensional optimization problems which looks much more involved. Consider
for instance the following example of Minimizing a separable function subject to an equality
constraint.
We consider the problem
n

T
min f (x) = fi (xi ), subject to a x = b , (2.3.4)
i=1
where a Rn , b R, and fi : R R are dierentiable and strictly convex. The objective

function is called separable since it is a sum of functions of the individual variables x1 , ..., xn .
We assume that the domain of f0 intersects the constraint set, i.e., there exists a point
x0 dom f with aT x0 = b. This implies the problem has a unique optimal point x (why?).
The Lagrangian is
n
n

L(x, ) = fi (xi ) + (aT x b) = b + (fi (xi ) + ai xi ),
i=1 i=1
which is also separable, so the dual function is

n

L() = b + inf (fi (xi ) + ai xi )
x
i=1
n

= b + inf [fi (xi ) + ai xi ]
xi
i=1
n
= b fi (ai ),
i=1
where we denote fi (u) : R R, the function conjugate to fi :
fi (u) = max[uz fi (z)], i = 1, ..., n,

zR
The function f is also referred to as the Legendre transform of f . Note that fi , being a point-
wise supremum of convex functions are themselves convex (and, by the way, dierentiable).
The dual problem is thus
n

max b fi (ai ) (2.3.5)
i=1
with (scalar) variable R.

Now suppose that 1) we are able to write down the conjugate functions fi explicitly, so
that we are able to compute the rst order oracle for the convex problem (2.3.5). Indeed,
for a given u R, the value fi (ui ) is the optimal value of the problem minz [ui z fi (z)]
and [fi (ui )] its (unique!) optimizer z . Then 2) we can recover an optimal dual variable
by bisection. Since each fi is strictly convex, the function L(x, ) is strictly convex in
x, and so has a unique minimizer x . But we also know that the optimal solution x to
(2.3.4) also minimizes L(x, ), so we must have x = x . Note that We can recover x from
Lx (x, ) = 0, i.e., by solving the equations fi (x ) = ai .
2.3. EXERCISES 29
Exercise 2.3.1 Consider the problem of nding the Euclidean projection of the point y R
on the standard simplex:
n

2
min f (x) = |x y| subject to xi = 1, xi 0, i = 1, ..., n . (2.3.6)
x
i=1
1. Consider the 1-dimensional problem
max uz (x a)2 .
z0
Write down its Lagrange dual and nd its explicit solution.
2. Using the method, described in this section, propose a simple solution to the problem
(2.3.6) by bisection. Write down explicitly the formulas which allow to recover the
primary solution from the dual one.
A slightly more involved variation on the same theme is as follows:
Exercise 2.3.2 (Entropy maximization)

1. Let a Rn be an interior point of a standard simplex, i.e., ai > 0, i ai = 1, and
let u Rn . Consider the problem
n n

xi
min
x
f (x) = xi log + ui (xi ai ) subject to xi = 1, xi 0, i = 1, ..., n(2.3.7)
.
i=1 ai i=1
Find the unique (explain why?) explicit solution to (2.3.7).
1
2. Let n
< 1, and let a Rn satisfy 0 < ai , i ai = 1.
Find the optimal solution and the optimal value of the problem
z
max u(z ) z log for 0 <
0z
. Explain how the bisection algorithm can be used to solve the problem
n n

xi
min f (x) = xi log + ui (xi ai ) subject to xi = 1, 0 xi , i = 1, ..., n .
x ai
i=1 i=1
n
Hint: Dualize the equality constraint i=1 xi = 1.
2.3.2 Extremal Ellipsoids

In order to advance further we need some basic results about the ellipsoids in Rn .
There are two natural ways to dene an ellipsoid W in Rn . The rst is to represent W
as the set dened by a convex quadratic constraint, namely, as
W = {x Rn | (x c)T A(x c) 1} (2.3.8)
A being a symmetric positive denite n n matrix and c being a point in Rn (the center of
the ellipsoid).
The second way is to represent W as the image of the unit Euclidean ball under an ane
invertible mapping, i.e., as
W = {x = Bu + c | uT u 1}, (2.3.9)
where B is an n n nonsingular matrix and c is a point from Rn .
Exercise 2.3.3 # Prove that the above denitions are equivalent: if W Rn is given by
(2.3.8), then W can be represented by (2.3.9) with B chosen according to
A = (B 1 )T B 1
(e.g., with B chosen as A1/2 ). Vice versa, if W is represented by (2.3.9), then W can be
represented by (2.3.8), where one should set
A = (B 1 )T B 1 .
Note that the (positive denite symmetric) matrix A involved into (2.3.8) is uniquely dened
by W (why?); in contrast to this, a nonsingular matrix B involved into (2.3.9) is dened by
W up to a right orthogonal factor: the matrices B and B dene the same ellipsoid if and
only if B = BU with an orthogonal n n matrix U (why?)
From the second description of an ellipsoid it immediately follows that
if
W = {x = Bu + c | u Rn , uT u 1}
is an ellipsoid and
x p + B x
is an invertible ane transformation of Rn (so that B is a nonsingular n n
matrix), then the image of W under the transformation also is an ellipsoid.
Indeed, the image is nothing but
W = {x = B Bu + (p + B c) | u Rn , uT u 1},
the matrix B B being nonsingular along with B and B . It is also worthy of note that
2.3. EXERCISES 31
for any ellipsoid
W = {x = Bu + c | u Rn , uT u 1}
there exists an invertible ane transformation of Rn , e.g., the transformation
x B 1 x B 1 c,
which transforms the ellipsoid exactly into the unit Euclidean ball
V = {u Rn | uT u 1}.
In what follows we mainly focus on various volume-related issues; to avoid complicated

constant factors, it is convenient to take, as the volume unit, the volume of the unit Euclidean
ball V in Rn rather than the volume of the unit cube. The volume of a body3) Q measured
in this unit, i.e., the ratio
Voln (Q)
,
Voln (V )
Voln being the usual Lebesgue volume in Rn , will be denoted voln (Q) (we omit the subscript
n if the value of n is clear from the context).
#
Exercise 2.3.4 Prove that if W is an ellipsoid in Rn given by (2.3.9), then
vol W = | Det B|, (2.3.10)
and if W is given by (2.3.8), then
vol W = | Det A|1/2 . (2.3.11)
Our local goal is to prove the following statement:

Let Q be a convex body in Rn (i.e., a closed and bounded convex set with a nonempty
interior). Then there exist ellipsoids containing Q, and among these ellipsoids there is one
with the smallest volume. This ellipsoid is unique; it is called the outer extremal ellipsoid
associated with Q. Similarly, there exist ellipsoids contained in Q, and among these ellipsoids
there is one with the largest volume. This ellipsoid is unique; it is called the inner extremal
ellipsoid associated with Q.
In fact we are not too interested in the uniqueness of the extremal ellipsoids (and you
may try to prove the uniqueness yourself); what actually is of interest is the existence and
some important properties of the extremal ellipsoids.
Exercise 2.3.5 # . Prove that if Q is a closed and bounded convex body in Rn , then there
exist ellipsoids containing Q and among these ellipsoids there is (at least) one with the
smallest volume.
3)
in what follows body means a set with a nonempty interior
Exercise 2.3.6 . Prove that if Q is a closed and bounded convex body in Rn , then there
exist ellipsoids contained in Q and among these ellipsoids there is (at least) one with the
largest volume.
Note that extremal ellipsoids associated with a closed and bounded convex body Q ac-
company Q under ane transformations: if x Ax+b is an invertible ane transformation
and Q is the image of Q under this transformation, then the image W of an extremal outer
ellipsoid W associated with Q (note the article: we has not proved the uniqueness!) is an
extremal outer ellipsoid associated with Q , and similarly for (an) extremal inner ellipsoid.
The indicated property is, of course, an immediate consequence of the facts that ane images
of ellipsoids are again ellipsoids and that the ratio of volumes remains invariant under an
ane transformation of the space.
In what follows we focus on outer extremal ellipsoids. Useful information can be obtained
from investigating these ellipsoids for simple parts of an Euclidean ball.
+
Exercise 2.3.7 Prove that the volume of the spherical hat
V = {x Rn | |x| 1, xn }
(0 1) of the unit Euclidean ball V = {x Rn | |x| 1} satises the inequality as

follows:
Voln (V )/Voln (V ) exp{n2 /2},
> 0 being a positive absolute constant. What is, numerically, the left hand side ratio when
= 1/2 and n = 64?
#+
Exercise 2.3.8 Let n > 1,
n 1/2

n
V = {x R | |x|2 x2i 1}
i=1
be the unit Euclidean ball, let e be a unit vector in Rn and let
V = {x V | eT x }, [1, 1]
(V is what is called a spherical hat).

Prove that if
1
< < 1,
n
then the set V can be covered by an ellipsoid W of the volume

n2 n1
voln (W ) { 2 }n/2 (1 2 )(n1)/2 (1 ) < 1 = voln (V );
n 1 n+1
W is dened as
1 + n
W = {x = e + Bu | uT u 1},
n+1
2.3. EXERCISES 33
where
1/2
n 2 (1 )(n 1)
B = (1 2 ) I eeT , = 1
n2 1 (1 + )(n + 1)
Hint: note that V is contained in the set of solutions to the system of the following pair of
quadratic inequalities:
xT x 1; (2.3.12)
(eT x )(eT x 1) 0; (2.3.13)

Consequently, V satises any convex quadratic inequality which can be represented as a
weighted sum of (2.3.12) and (2.3.13), the weights being positive. Every inequality of this
type denes an ellipsoid; compute the volume of the ellipsoid as the function of the weights
and optimize it with respect to the weights.
In fact the ellipsoid given by the latter exercise is the extremal outer ellipsoid associated
with V .
Looking at the result stated by the latter exercise, one may make a number of useful
conclusions.
1. When = 0, i.e., when the spherical hat V is a half-ball, we have

n/2
1 2
voln (W ) = 1 + 2 1
n 1 n1
n/2
exp{1/(n2 1)} exp{1/(n 1)} =
n+2 1 1
= exp{ } < exp{ } = exp{ }voln (V );
2(n 1)
2 2n 2 2n 2
thus, for the case of = 0 (and, of course, for the case of > 0) we may cover
V by an ellipsoid with the volume 1 O(1/n) times less than that one of V . In
fact the same conclusion (with another absolute constant factor O(1)) holds true
when is negative (so that the spherical hat is greater than half-ball), but not
1
too negative, say, when 2n .
2. In order to cover V by an ellipsoid of absolute constant times less volume
than that one of V we need to be positive of order O(n1/2 ) or greater. This
ts our observation that the volume of V itself is at least absolute constant times
less than that one of V only if O(n1/2 ) (exercise 2.3.7). Thus, whenever
the volume of V is absolute constant times less than that one of V , we can cover
V by an ellipsoid of the volume also absolute constant times less than that one
of V ; this covering is already given by the Euclidean ball of the radius 1 2
centered at the point e (which, anyhow, is not the optimal covering presented
in exercise 2.3.8).
Exercise 2.3.9 #+ Let V be the unit Euclidean ball in Rn , e be a unit vector and let
(0, 1). Consider the symmetric spherical stripe
V = {x V | eT x }.

Prove that if 0 < < 1/ n then V can be covered by an ellipsoid W with the volume
(n1)/2
n(1 2 )
voln (W ) n < 1 = voln (V ).
n1
Find an explicit representation of the ellipsoid.
Hint: use the same construction as that one for exercise 2.3.8.
We see that in order to cover a symmetric spherical stripe of the unit Euclidean ball V by
an ellipsoid of volumeless than that one of V , it suces to have the half-thickness of
the stripe to be < 1/ n, which again ts our observation (Exercise 2.3.7) that basically all
volume of the unit n-dimensional Euclidean ball is concentrated in the O(1/ n) neighbor-
hood of its equator - the cross-section of the ball and a hyperplane passing through the
center of the ball. A useful exercise is to realize when a non-symmetric spherical stripe
V , = {x V | eT x }
of the (centered at the origin) unit Euclidean ball V can be covered by an ellipsoid of volume
less than that one of V .
The results of exercises 2.3.8 and 2.3.9 imply a number of important geometrical conse-
quences.
Exercise 2.3.10 + Prove the following theorem of Fritz John:
Let Q be a closed and bounded convex body in Rn . Then
(i) Q can be covered by an ellipsoid W in such a way that the concentric to W n times
smaller ellipsoid
1 1
W = (1 )c + W
n n
(c is the center of W ) is contained in Q. One can choose as W the extremal outer ellipsoid
associated with Q.
(ii) If, in addition, Q is central-symmetric with respect to certain point c, then the above
result can be improved: Q can be covered by an ellipsoid W centered at c in such a way that
the concentric to W n times smaller ellipsoid
1 1
W = (1 )c + W
n n
is contained in Q.
Hint: use the results given by exercises
2.3.8 and 2.3.9.
Note that the constants n and n in the Fritz John Theorem are sharp; an extremal
example for (i) is a simplex, and for (ii) - a cube.
Here are several nice geometrical consequences of the Fritz John Theorem:
2.3. EXERCISES 35
Let Q be a closed and bounded convex body in Rn . Then

1. There exist a pair of concentric homothetic with respect to their common
center parallelotopes p, P with homothety coecient equal to n3/2 such that
p Q P ; in other words, there exists an invertible ane transformation of
the space such that the image Q of Q under this transformation satises the
inclusions
1
{x Rn | x } Q {x Rn | x 1};
n3/2
here
x = max |xi |
1in
is the uniform norm of x.

Indeed, from the Fritz John Theorem it follows that there exists an invertible
ane transformation resulting in
{x | |x|2 1/n} Q {x | |x|2 1},
Q being the image of Q under the transformation (it suces to transform the
outer extremal ellipsoid associated with Q into the unit Euclidean ball centered
at the origin). It remains to note that the smaller Euclidean ball in the above
chain of inclusions contains the cube {x | x n3/2 } and the larger one is
contained in the unit cube.
2. If Q is central-symmetric, then the parallelotopes mentioned in 1. can be
chosen to have the same center, and the homothety coecient can be improved to
1/n; in other words, there exists an invertible ane transformation of the space
which makes the image Q of Q central symmetric with respect to the origin and
ensures the inclusions
1
{x | x } Q {x | x 1}.
n
The statement is given by the reasoning completely similar to that one used for
1., up to the fact that now we should refer to item (ii) of the Fritz John
Theorem.
n
3. Any norm on R can be approximated, within factor n, by a
Euclidean norm: given , one can nd a Euclidean norm
|x|A = (xT Ax)1/2 ,
A being a symmetric positive denite n n matrix, in such a way that

1
|x|A x |x|A
n
for any x Rn .
Indeed, let B = {x | x 1} be the unit ball with respect to the norm ; this
is a closed and bounded convex body, which is central symmetric with respect
to the origin. By item (ii) of the Fritz John Theorem, there exists a centered at
the origin ellipsoid
W = {x | xT Ax n}
(A is an n n symmetric positive denite matrix) which contains B, while the
ellipsoid
{x | xT Ax 1}
is contained in B; this latter inclusion means exactly that
|x|A 1 x B x 1,
i.e., means that |x|A x . The inclusion B W , by similar reasons, implies

that x n1/2 |x|A .
n
Remark 2.3.1 The third of the indicated consequences says that any norm on R can be
approximated, within constant factor n, by an appropriately chosen Euclidean norm. It
turns out that the quality of approximation can be done much better, if we would be satised
by approximating the norm not at the whole space, but at a properly chosen subspace.
Namely, there exists a marvelous and important theorem of Dvoretski which is as follows:
there exists a function m(n, ) of positive integer n and positive real with the following
properties:
rst,
lim m(n, ) = +
n
and,
second, whenever is a norm on Rn , one can indicate a m(n, )-dimensional subspace
E Rn and a Euclidean norm | |A on Rn such that | |A approximates on E within
factor 1 + :
(1 )|x|A x (1 + )|x|A , x E.
In other words, the Euclidean norm is marked by God: for any given integer k an arbitrary
normed linear space contains an almost Euclidean k-dimensional subspace, provided that
the dimension of the space is large enough.
Lecture 3
Methods with linear convergence
(Multidimensional Convex Problems: Class Complexity; Cutting Plane Scheme, Center of

Gravity Method, Ellipsoid Method)
3.1 Class of general convex problems: description and

complexity
A convex programming problem is
minimize f (x) subject to gi (x) 0, i = 1, ..., m, x G Rn . (3.1.1)
Here the domain G of the problem is a closed convex set in Rn with a nonempty interior, the
objective f and the functional constraints gi , i = 1, ..., m, are convex continuous functions
on G.
Let us x a closed and bounded convex domain G Rn and the number m of func-
tional constraints, and let P = Pm (G) be the family of all feasible convex problems with
m functional constraints and the domain G. Note that since the domain G is bounded
and all problems from the family are feasible, all of them are solvable, due to the standard
compactness reasons.
In what follows we identify a problem instance from the family Pm (G) with a vector-
valued function
p = (f, g1 , ..., gm )
comprised of the objective and the functional constraints.
What we shall be interested in for a long time are the ecient methods for solving
problems from the indicated very wide family. Similarly to the one-dimensional case, we
assume that the methods have an access to the rst order local oracle O which, given an
input vector x int G, returns the values and some subgradients of the objective and the
functional constraints at x, so that the oracle computes the mapping
x O(p, x) = (f (x)(, f (x); g1 (x), g1 (x); ...; gm (x), gm

(x)) : int G R(m+1)(n+1) .
37
38 LECTURE 3. METHODS WITH LINEAR CONVERGENCE
The notions of a method and its complexity at a problem instance and on the whole family
are introduced exactly as it was done in Section 1.2 of our rst lecture1) .
The accuracy of a method at a problem and on the family in the following way. Let us
start with the vector of residuals of a point x G regarded as an approximate solution to a
problem instance p:
Residual(p, x) = (f (x) f , (g1 (x))+ , ..., (gm (x))+ )
which is comprised of the inaccuracy in the objective and the violations of functional con-
straints at x. In order to get a convenient scalar accuracy measure, it is reasonable to pass
from this vector to the relative accuracy
f (x) f (g1 (x))+ (gm (x))+
(p, x) = max{ , , ..., };
maxG f f (maxG g1 )+
(maxG gm )+
to get the relative accuracy, we normalize each of the components of the vector of residuals
by its maximal, over all x G, value and take the maximum of the resulting quantities. It
is clear that the relative accuracy takes its values in [0, 1] and is zero if and only if x is an
optimal solution to p, as it should be for a reasonable accuracy measure.
After we have agreed how to measure accuracy of tentative approximate solutions, we
dene the accuracy of a method M at a problem instance as the accuracy of the approximate
solution found by the method when applied to the instance:
(p, M)).
Accuracy(M, p) = (p, x
The accuracy of the method on the family is its worse-case accuracy at the problems of the
family:
Accuracy(M) = sup Accuracy(M, p).
pPm (G)
Last, the complexity of the family is dened in the manner we already are acquainted with,
namely, as the best complexity of a method solving all problems from the family to a given
accuracy:
A() = min{A(M) | Accuracy(M) }.
What we are about to do is to establish the following main result:
Theorem 3.1.1 The complexity A() of the family Pm (G) of general-type convex problems
on an n-dimensional closed and bounded convex domain G satises the inequalities
ln( 1 ) 1
n 1 A() 2.181 n ln( ). (3.1.2)
6 ln 2
Here the upper bound is valid for all < 1. The lower bound is valid for all < (G), where
1
(G)
n3
1)
that is a set of rules for forming sequential search points, the moment of termination and the result
as functions of the information on the problem; this information is comprised by the answers of the oracle
obtained to the moment when a rule is to be applied
3.2. CUTTING PLANE SCHEME AND CENTER OF GRAVITY METHOD 39
for all G Rn ; for an ellipsoid G one has

1
(G) = ,
n
and for a paralellotope G
(G) = 1.
Same as in the one-dimensional case, to prove the theorem means to establish the lower
complexity bound and to present a method associated with the upper complexity bound
(and thus optimal in complexity, up to an absolute constant factor, for small enough ,
namely, for 0 < < (G). We shall start with this latter task, i.e., with constructing an
optimal method.
3.2 Cutting Plane scheme and Center of Gravity Method

The method we are about to present is based on a very natural extension of the bisection -
the cutting plane scheme.
3.2.1 Case of problems without functional constraints

To explain the cutting plane scheme, let me start with the case when there are no functional
constraints at all, so that m = 0 and the family is comprised of problems
minimize f (x) s.t. x G (3.2.3)
of minimizing convex continuous objectives over a given closed and bounded convex domain
G Rn .
To solve such a problem, we can use the same basic idea as in the one-dimensional
bisection. Namely, choosing somehow the rst search point x1 , we get from the oracle the
value f (x1 ) and a subgradient f (x1 ) of f ; thus, we obtain a linear function
f1 (x) = f (x1 ) + (x x1 )T f (x1 )
which, due to the convexity of f , underestimates f everywhere on G and coincides with f

at x1 :
f1 (x) f (x), x G; f1 (x1 ) = f (x1 ).
If the subgradient is zero, we are done - x1 is an optimal solution. Otherwise we can point
out a proper part of G which localizes the optimal set of the problem, namely, the set
G1 = {x G | (x x1 )T f (x1 ) 0};
indeed, outside this new localizer our linear lower bound f1 for the objective, and therefore
the objective itself, is greater than the value of the objective at x1 .
Now, our new localizer of the optimal set, i.e., G1 , is, same as G, a closed and bounded
convex domain, and we may iterate the process by choosing the second search point x2 inside
G1 and generating the next localizer
G2 = {x G1 | (x x2 )T f (x2 ) 0},
and so on. We come to the following generic cutting plane scheme:

starting with the localizer G0 G, choose xi in the interior of
the current localizer Gi1 and check whether f (xi ) = 0; if it is the
case, terminate, xi being the result, otherwise define the new localizer
Gi = {x Gi1 | (x xi )T f (xi ) 0}
and loop.
The approximate solution found after i steps of the routine is, by
definition, the best point found so far, i.e., the point
xi {x1 , ..., xi } such that f (

xi ) = min f (xj ).
1ji
A cutting plane method, i.e., a method associated with the scheme, is governed by the
rules for choosing the sequential search points in the localizers. In the one-dimensional case
there is, basically, the only natural possibility for this choice - the midpoint of the current
localizer (the localizer always is a segment). This choice results exactly in the bisection and
enforces the lengths of the localizers to go to 0 at the rate 2i , i being the step number. In
the multidimensional case the situation is not so simple. Of course, we would like to decrease
a reasonably dened size of localizer at the highest possible rate; the problem is, anyhow,
which size to choose and how to ensure its decreasing. When choosing a size, we should take
care of two things
(1) we should have a possibility to conclude that if the size of a current
localizer Gi is small, then the inaccuracy of the current approximate solution
also is small;
(2) we should be able to decrease at certain rate the size of sequential localizers
by appropriate choice of the search points in the localizers.
Let us start with a wide enough family of sizes which satisfy the rst of our requirements.
Denition 3.2.1 A real-valued function Size(Q) dened on the family Q of all closed and
bounded convex subsets Q Rn with a nonempty interior is called a size, if it possesses the
following properties:
(Size.1) Positivity: Size(Q) > 0 for any Q Q;
(Size.2) Monotonicity with respect to inclusion: Size(Q) Size(Q ) whenever Q Q,
Q, Q Q;
(Size.3) Homogeneity with respect to homotheties: if Q Q, > 0, a Rn and
Q = a + (Q a) = {a + (x a) | x Q}
is the image of Q under the homothety with the center at the point a and the coecient ,
then
Size(Q ) = Size(Q).
Example 1. The diameter
Diam(Q) = max{|x x | | x, x Q}
is a size;
Example 2. The average diameter
AvDiam(Q) = (Voln (Q))1/n ,
Voln being the usual n-dimensional volume, is a size.

To the moment these examples are sucient for us.
Let us prove that any size of the indicated type satises requirement (1), i.e., if the size
of a localizer is small, then the problem is almost solved.
Lemma 3.2.1 Let Size() be a size. Assume that we are solving a convex problem (3.2.3)
by a cutting plane method, and let Gi and xi be the localizers and approximate solutions
generated by the method. Then we have for all i 1
Size(Gi )
i ) i
(p, x , (3.2.4)
Size(G)
p denoting the problem in question.
Proof looks completely similar to that one used for the bisection method. Indeed, let us
x i 1. If i 1, then (3.2.4) is evident - recall that the relative accuracy always is 1.
Now assume that i < 1 for our i. Let us choose (i , 1], let x be a minimizer of our
objective f over G and let
G = x + (G x ).
Then
Size(G ) = Size(G) > i Size(G) = Size(Gi )
(we have used the homogeneity of Size() with respect to homotheties). Thus, the size of G
is greater than that on of Gi , and therefore, due to the monotonicity of Size, G cannot be
a subset of Gi . In other words, there exists a point
y G \Gi .
Since G clearly is contained in the domain of the problem and does not belong to the i-th
localizer Gi , we have
xi );
f (y) > f (
indeed, at each step j, j i, of the method we remove from the previous localizer (which
initially is the whole domain G of the problem) only those points where the objective is
greater than at the current search point xj and is therefore greater than at the best point xi
found during the rst i steps; since y was removed at one of these steps, we conclude that
f (y) > f (
xi ), as claimed.
On the other hand, y G , so that
y = (1 )x + z
with some z G. From convexity of f it follows that
f (y) (1 )f (x ) + f (z) = (1 ) min f + max f,
G G
whence
f (y) min f (max f min f ).
G G G
xi ), and we come to
As we know, f (y) > f (
f (xi ) min f (max f min f ),
G G G
or, which is exactly the same, to

(p, xi ) < .
Since is an arbitrary real > i , we conclude that (3.2.4) holds.
Thus, we realize now what could be the sizes we are interested in, and the problem is
how to ensure certain rate of their decreasing along the sequence of localizers generated by
a cutting plane method. The diculty here is that when choosing the next search point in
the current localizer, we do not know what will be the next cutting plane; the only thing
we know is that it will pass through the search point. Thus, we are interested in the choice
of the search point which guarantees certain reasonable, not too close to 1, ratio of the size
of the new localizer to that one of the previous localizer independently of what will be the
cutting plane. Whether such a choice of the search point is possible, it depends on the size
we are using. For example, the diameter of a localizer, which is a very natural measure
of it and which was successively used in the one-dimensional case, would be a very bad
choice in the multidimensional case. To realize this, imagine that we are minimizing over
the unit square on the two-dimensional plane, and our objective in fact depends on the rst
coordinate only. All our cutting planes (in our example they are lines) will be parallel to the
second coordinate axis, and the localizers will be stripes of certain horizontal size (which we
may enforce to tend to 0) and of some xed vertical size (equal to 1). The diameters of the
localizers here although decrease but do not tend to zero. Thus, the rst of the particular
sizes we have looked at does not t the second requirement. In contrast to this, the second
particular size, the average diameter AvDiam, is quite appropriate, due to the following
geometric fact which we present without proof:
Proposition 3.2.1 (Grunbaum) Let Q be a closed and bounded convex domain in Rn , let

1
x (G) = xdx
Voln (G) G
be the center of gravity of Q, and let be an ane hyperplane passing through the center
of gravity. Then the volumes of the parts Q , Q in which Q is partitioned by satisfy the
inequality
1/n
n
Voln (Q ), Voln (Q ) {1 }Voln (Q) exp{}Voln (Q),
n+1
= ln(1 1/ e) = 0.45867...;
in other words,

AvDiam(Q ), AvDiam(Q ) exp{ } AvDiam(Q). (3.2.5)
n
Note that the proposition states exactly that the smallest (in terms of the volume) fraction
you can cut o a n-dimensional convex body by a hyperplane passing through the center
of gravity of the body is the fraction you get when the body is a simplex, the plane passes
parallel to a facet of the simplex and you cut o the part not containing the facet.
Corollary 3.2.1 Consider the Center of Gravity method, i.e., the cutting plane method with
the search points being the centers of gravity of the corresponding localizers:

1
xi = x (Gi1 ) xdx.
Voln (Gi1 ) Gi1
For the method in question one has

AvDiam(Gi ) exp{ } AvDiam(Gi1 ), i 1;
n
consequently (see Lemma 3.2.4) the relative accuracy of i-th approximate solution generated
by the method as applied to any problem p of minimizing a convex objective over G satises
the inequality

i ) exp{ i}, i 1.
(p, x
n
In particular, to solve the problem within relative accuracy (0, 1) it suces to perform
no more than
1 1 1
N = n ln 2.181n ln (3.2.6)

steps of the method.
Remark 3.2.1 The Center of Gravity method for convex problems without functional con-
straints was invented in 1965 independently by A.Yu.Levin in the USSR and J. Newman in
the USA.
3.3 The general case: problems with functional con-

straints
The Center of Gravity method, as it was presented, results in the upper complexity bound
stated in our Main Theorem, but only for problems without functional constraints. In order
to establish the upper bound in full generality, we should modify the cutting plane scheme
in a manner which enables us to deal with these constraints. The very rst idea is to act as
follows: after current search point is chosen and we have received the values and subgradients
of the objective and the constraints at the point, let us check whether the point is feasible;
if it is the case, then let us use the subgradient of the objective in order to cut o the
part of the current localizer where the objective is greater that at the current search point,
same as it was done in the case when there were no functional constraints. Now, if there
is a functional constraint violated at the point, we can use its subgradient to cut o points
where the constraint is for sure greater than it is at the search point; the removed points
for sure are not feasible: This straightforward approach cannot be used as it is, since the
feasible set may have empty interior, and in this case our process, normally, will never nd
a feasible search point and, consequently, will never look at the objective. In this case the
localizers will shrink to the feasible set of the problem, and this is ne, but if the set is not
a singleton, we would not have any idea how to extract from our sequence of search points
a point where the constraints are almost satised and at the same time the objective is
close to the optimal value - recall that we simply did not look at the objective when solving
the problem!
There is, anyhow, a simple way to overcome the diculty - we should use for the cut the
subgradient of the objective at the steps when the constraints are almost satised at the
current search point, not only at the steps when the point is feasible. Let us consider the
method, proposed by Nemirovski and Yudin in 1976:
Cutting plane scheme for problems with functional constraints:

Given in advance the desired relative accuracy (0, 1), generate,
starting with G0 = G, the sequence of localizers Gi , as follows:
given Gi1 (which is a closed and bounded convex subset of G with
a nonempty interior), act as follows:
1) choose i-th search point
xi int Gi1
and ask the oracle about the values
f (xi ), g1(xi ), ..., gm (xi )
and subgradients
f (xi ), g1 (xi ), ..., gm

(xi )
of the objective and the constraints at xi .
3.3. THE GENERAL CASE: PROBLEMS WITH FUNCTIONAL CONSTRAINTS 45
2) form the affine lower bounds

(i)
gj (x) = gj (xi ) + (x xi )T gj (xi )
for the functional constraints (these actually are lower bounds since
the constraints are convex) and compute the quantities

(i)
gji, = max gj (x) , i = 1, ..., m.
G +
3) If, for all j = 1, ..., m,

gj (xi ) gj,i , (3.3.7)
claim that i is a productive step and go to 4.a), otherwise go to 4.b).
4.a) [productive step]
If f (xi ) = 0, define the new localizer Gi as
Gi = {x Gi1 | (x xi )T f (xi ) 0}
and go to 5); otherwise terminate, xi being the result formed by the
method.
4.b) [non-productive step]
Choose j such that
gj (xi ) > gj,i ,
define the new localizer Gi as
Gi = {x Gi1 | (x xi )T gj (xi ) 0}
and go to 5).
5) Define i-th approximate solution as the best (with the smallest
value of the objective) of the search points xj corresponding to the
productive steps j i. Replace i by i + 1 and go to 1).
Note that the approximate solution xi is dened only if a productive step already has been
performed.
The rate of convergence of the above method is given by the following analogue of Lemma
3.2.4:
Proposition 3.3.1 Let Size be a size, and let a cutting plane method be applied to a convex
problem p with m functional constraints. Assume that for a given N the method either
terminates in course of the rst N steps, or this is not the case and the relation
Size(GN )
< (3.3.8)
Size(G)
is satised. In the rst case the result x found by the method is a solution to the problem of
relative accuracy ; in the second case the N-th approximate solution is well-dened and
solves the problem to the relative accuracy .
Proof. Let us rst note that for any i and j one has

gj,i max gj ; (3.3.9)
G +
(i)
this is an immediate consequence of the fact that gj (x) is a lower bound for gj (x) (an
immediate consequence of the convexity of gj ), so that the maximum of this lower bound
over x G, i.e., gj,i , is at most the similar quantity for the constraint gj itself.
Now, assume that the method terminates at certain step i N. According to the
description of the method, it means that i is a productive step and 0 is a subgradient of the
objective at xi ; the latter means that xi is a minimizer of f over the whole G, so that
f (xi ) f .
Besides this, i is a productive step, so that

gj (xi ) gj,i max gj , j = 1, ..., m
G +
(we have used (3.3.9)); these inequalities, combined with the denition of the relative accu-
racy, state exactly that xi (i.e., the result obtained by the method in the case in question)
solves the problem within the relative accuracy , as claimed.
Now assume that the method does not terminate in course of the rst N steps. In view
of our premise, here we have
Size(GN ) < Size(G). (3.3.10)
Let x be an optimal solution to the problem, and let
G = x + (G x ).
G is a closed and bounded convex subset of G with a nonempty interior; due to homogeneity
of Size with respect to homotheties, we have
Size(G ) = Size(G) > Size(GN )
(the second inequality here is (3.3.10)). From this inequality and the monotonicity of the
size it follows that G cannot be a subset of GN :
y G \GN .
Now, y is a point of G (since the whole G is contained in G), and since it does not belong
to GN , it was cut o at some step of the method, i.e., there is an i N such that
eTi (y xi ) > 0, (3.3.11)
ei being the linear functional dening i-th cut.

3.3. THE GENERAL CASE: PROBLEMS WITH FUNCTIONAL CONSTRAINTS 47
Note also that since y G , we have a representation
y = (1 )x + z (3.3.12)
with certain z G.
Let us prove that in fact i-th step is productive. Indeed, assume it is not the case. Then
ei = gj (xi ) (3.3.13)

(i)
for some j such that gj (xi ) > gj,i . From this latter inequality combined with (3.3.13) and
(3.3.11) it follows that
(i) (i)
gj (y) > gj (xi ) = gj (xi ) > gj,i . (3.3.14)
On the other hand, we have
(i) (i) (i)
gj (y) = (1 )gj (x ) + gj (z) gj,i
(i)
(we have taken into account that gj is a lower bound for gj () and therefore this bound
is nonpositive at the optimal solution to the problem); the resulting inequality contradicts
(3.3.14), and thus the step i indeed is productive.
Now, since i is a productive step, we have ei = f (xi ), and (3.3.11) implies therefore that
f (y) > f (xi );
from this latter inequality and (3.3.12), exactly as in the case of problems with no functional
constraints, it follows that
f (xi ) f < f (y) f (max f f ). (3.3.15)

G
Now let us summarize our considerations. We have proved that in the case in question (i.e.,
when the method does not terminate during rst N steps and (3.3.8) is satised) there exist
a productive step i N such that (3.3.15) holds. Since the N-th approximate solution is the
best (in terms of the values of the objective) of the search points generated at the productive
steps with step numbers N, it follows that xN is well-dened and
xN ) f f (xi ) f (max f f );
f ( (3.3.16)
G
since xN is, by construction, the search point generated at certain productive step i , we
have also
(i ) ,i
xN ) = gj (xi ) gj max gj , j = 1, ..., m;
gj (
G +
these inequalities combined with (3.3.16) results in
N ) ,
(p, x
as claimed.
Combining Proposition 3.3.1 and the Grunbaum Theorem, we come to the Center of
Gravity method for problems with functional constraints. The method is obtained from our
general cutting plane scheme for constrained problems by the following specications:
rst, we use, as a current search point, the center of gravity of the previous localizer:

1
xi = xdx;
Voln (Qi1 ) Qi1
second, we terminate the method after N-th step, N being given by the relation

1
N =2.181n ln .

With these specications the average diameter of i-th localizer at every step, due to the
Grunbaum Theorem, decreases with i at least as

exp{ i} AvDiam(G), = 0.45867...,
n
1
and since
< 2.181, we come to
AvDiam(GN ) < AvDiam(G);
this latter inequality, in view of Proposition 3.3.1, implies that the method does nd an
-solution to every problem from the family, thus justifying the upper complexity bound we
are proving.
3.4 Lower complexity bound

To complete the proof of Theorem 3.1.1, it remains to establish the lower complexity
bound. This is done, basically, in the same way as in the one-dimensional case. In
order to avoid things dicult for verbal presentation, We show here a slightly worse
lower bound
n ln( 1 )
A() O(1) , 0 < < (G), (3.4.17)
1
ln n ln( )
O(1) being a positive absolute constant and (G) being certain positive quantity
depending on the geometry G only (we shall see that this quantity measures how G
diers from a paralellotope). 2
The spoiled bound (3.4.17) (which is by logarithmic denominator worse than the
estimate announced in the Theorem) is more or less immediate consequence of our
one-dimensional considerations. Of course, it is sucient to establish the lower bound
2
for the exact lower bound refer to A.S. Nemirovskij and D.B. Yudin Problem complexity and
method eciency in optimization, A Wiley-Interscience Publication. Chichester etc.: John Wiley & Sons,
1983.
3.4. LOWER COMPLEXITY BOUND 49
for the case of problems without functional constraints, since the constrained ones form
a wider family (indeed, a problem without functional constraints can be thought of as
a problem with a given number m of trivial, identically zero functional constraints).
Thus, in what follows the number of constraints m is set to 0.
Let us start with the following simple observation. Let, for a given > 0 and a
convex objective f , the set G (f ) be comprised of all approximate solutions to f of
relative accuracy not worse than :
G (f ) = {x G | f (x) min f (max f min f )}.

G G G
Assume that, for a given > 0, we are able to point out a nite set F of objectives
with the following two properties:
(I) no dierent problems from F admit a common -solution:
G (f ) G (f) =
whenever f, f F and f = f;
(II) given in advance that the problem in question belongs to F, one can compress
an answer of the rst order local oracle to be a (log2 K)-bit word. It means the
following. For certain positive integer K one can indicate a function I(f, x) taking
values in a K-element set and a function R(i, x) such that
O(f, x) = R(I(f, x), x), f F, x int G.
In other words, given in advance that the problem we are interested in belongs to F, a
method can imitate the rst-order oracle O via another oracle I which returns log2 K
bits of information rather than innitely many bits contained in the answer of the
rst order oracle; given the compressed answer I(f, x), a method can substitute this
answer, along with x itself, into a universal (dened by F only) function in order to
get the complete rst-order information on the problem.
E.g., consider the family Fn comprised of 2n convex functions
f (x) = max i xi ,
i=1,...,n
where all i are 1. At every point x a function from the family admits a subgradient
of the form I(f, x) = ei (ei are the orths of the axes), with i, same as the sign at
ei , depending on f and x. Assume that the rst order oracle in question when asked
about f Fn reports a subgradient of exactly this form. Since all functions from the
family are homogeneous, given x and I(f, x) we know not only a subgradient of f at
x, but also the value of f at the point:
f (x) = xT I(f, x).
Thus, our particular rst-order oracle as restricted onto Fn can be compressed to

log2 (2n) bits.
Now let us make the following observation:
(*):
under assumptions (I) and (II) the -complexity of the family F, and
therefore of every larger family, is at least
log2 |F|
.
log2 K
Indeed, let M be a method which solves all problems from F within accuracy in
no more than N steps. We may assume (since informationally this is the same) that
the method uses the oracle I, rather than the rst-order oracle. Now, the behavior of
the method is uniquely dened by the sequence of answers of I in course of N steps;
therefore there are at most K N dierent sequences of answers and, consequently, no
more than K N dierent trajectories of M. In particular, the set X formed by the
results produced by M as applied to problems from F is comprised of at most K N
points. On the other hand, since M solves every of |F| problems of the family within
accuracy , and no two dierent problems from the family admit a common -solution,
X should contain at least |F| points. Thus,
K N |F|,
as claimed.
As an immediate consequence of what was said, we come to the following result:
the complexity of minimizing a convex function over an n-dimensional parallelotope
G within relative accuracy < 1/2 is at least n/(1 + log2 n).
Indeed, all our problem classes and complexity-related notions are ane invariant, so
that we always may assume the parallelotope G mentioned in the assertion to be the
unit cube
{x Rn | |x| max |xi | 1}.
i
1
For any < 2 the aforementioned family
Fn = {f (x) = max i xi }
i
clearly possesses property (I) and, as we have seen, at least for certain rst-order oracle
possesses also property (II) with K = 2n. We immediately conclude that the complex-
ity of nding an -minimizer, < 1/2, of a convex function over an n-dimensional
parallelotope is, at least for some rst order oracle, no less than
log2 |F|
,
log2 (2n)
as claimed. In fact, of course, the complexity is at least n for any rst order oracle,
but to prove the latter statement it requires more detailed considerations.
Now let us use the above scheme to derive the lower bound (3.4.17). Recall that
when studying the one-dimensional case, we have introduced certain family of univari-
ate convex functions which was as follows. The functions of the family form a tree,
with the root (generation 0) being the function
f root (x) = |x|;

when subject to the left and to the right modications, the function produces two
children, let them be called fr and fl ; each of these functions, in turn, may be
subject to the right and to the left modication, producing two new functions, so that
at the level of grandchildren there are four functions frr , frl , flr , fll , and so on. Now,
every of the functions f of a generation k > 0 possesses its own active segment (f )
of the length 212k , and at this segment the function is modulus-like:
f (x) = a(k) + 8k |x c(f )|,
c(f ) being the midpoint of (f ). Note that a(k) depends only on the generation f
belongs to, not on the particular representative of the generation; note also that the
active segments of the 2k functions belonging to the generation k are mutually disjoint
and that a function from our population coincides with its parent outside the
active segment of the parent. In what follows it is convenient also to dene the active
segment of the root function f root as the whole axis.
Now, let Fk be the set of 2k functions comprising k-th generation in our population.
Let us demonstrate that any rst order oracle, restricted onto this family of functions,
admits compression to log2 (2k) bits. Indeed, it is clear from our construction that
in order to restore, given an x, f (x) and a subgradient f (x), it suces to trace the
path of predecessors of f - its father, its grandfather, ... - and to nd the youngest
of them, let it be f, such that x belongs to the active segment of f (let us call this
predecessor the active at x predecessor of f ). The active at x predecessor of f does
exist, since the active segment of the common predecessor f root is the whole axis.
Now, f is obtained from f by a number of modications; the rst of them possibly
varies f in a neighborhood of x (x is in the active segment of f), but the subsequent
modications do not, since x is outside the corresponding active segments. Thus, in a
neighborhood of x f coincides with the function f - the modication of f which leads
from f to f . Now, to identify the local behavior of f (i.e., that one of f ) at x, it suces
to indicate the age of f, i.e., the number of the generation it belongs to, and the
type of the modication - left or right - which transforms f into f.
Indeed, given x and the age k of f, we may uniquely identify the active segment of
f (since the segments for dierent members of the same generation k 1 have no
common points); given the age of f, its active segment and the type of modication
leading from f to f, we, of course, know f in a neighborhood of the active segment of
f and consequently at a neighborhood of x.
Thus, to identify the behavior of f at x and therefore to imitate the answer of
any given local oracle on the input x, it suces to know the age k of the active at
x predecessor of f and the type - left or right - of modication which moves the
predecessor towards f , i.e., to know a point from certain (2k)-element set, as claimed.
Now let us act as follows. Let us start with the case when our domain G is a
parallelotope; due to ane invariance of our considerations, we may assume G to be
the unit n-dimensional cube:
G = {x Rn | |x| 1}.
Given a positive integer k, consider the family F k comprised of the objectives
fi1 ,...,in (x) = max{fi1 (x1 ), ..., fin (xn )}, fis Fk , s = 1, ..., n.
This family contains |F|n = 2nk objectives, all of them clearly being convex and Lip-
schitz continuous with constant 1 with respect to the uniform norm | | . Let us
demonstrate that there exists a rst order oracle such that the family, equipped with
this oracle, possesses properties (I), and (II), where one should set
= 26k , K = 2nk. (3.4.18)
Indeed, a function fi1 ,...,in attains its minimum a(k) exactly at the point xi1 ,...,in with
the coordinates comprised of the minimizers of fis (xs ). It is clear that within the cube
C1 ,...,in = {x | |x xi1 ,...,in | 2 23k }
(i.e., within the direct product of the active segments of fis , s = 1, ..., n) the function
is simply
a(k) + 23k |x xi1 ,...,in | ,
therefore outside this cube one has
fi1 ,...,in (x) min fi1 ,...,in > 26k = .

G
Taking into account that all our functions fi1 ,...,in , being restricted onto the unit cube
G, take their values in [0, 1], so that for these functions absolute inaccuracy in terms
of the objective is majorated by the relative accuracy, we come to
G (fi1 ,...,in ) Ci1 ,...,in .
It remains to note that the cubes C corresponding to various functions from the family
are mutually disjoint (since the active segments of dierent elements of the generation
Fk are disjoint). Thus, (I) is veried.
In order to establish (II), let us note that to nd the value and a subgradient of
fi1 ,...,in at a point x it suces to know the value and a subgradient at xis of any function
fis which is active at x, i.e., is all other functions participating in the expression
for fi1 ,...,in . In turn, as we know, to indicate the value and a subgradient of fis it suces
to report a point from a (2k)-element set. Thus, one can imitate certain (not any) rst
order oracle for the family F k via a compressed oracle reporting log2 (2nk)-bit word
(it suces to indicate the number s, 1 s n of a component fis active at x and a
point of a (2k)-element set to identify fis at xis ).
Thus, we may imitate certain rst order oracle for the family F k (comprised of 2kn
functions), given a compressed oracle with K = 2nk; it follows from (*) that the
-complexity of F for = 26k (see (3.4.18)) is at least
log2 (2nk )
;
log2 (2nk)
expressing k in terms of , we come to
nlog2 ( 1 )
A() , = 26k , k = 1, 2, ...;
6log2 nlog2 ( 2 )
taking into account that A() is a non-increasing function, we come to an estimate of

1
the type (3.4.17) with < (G) = 64 (recall that we were dealing with the unit cube
G).
Let me stress what was in fact was proved. We have demonstrated that it is
possible to equip the family of all convex continuous functions on the unit cube with
a rst order oracle in such a way that the complexity of the family would satisfy lower
1
bound (3.4.17), provided that < 64 .
Now, what happens in the case when G is not a parallelotope? In this case we can
nd a pair of homothetic parallelotopes, p and P , such that p G P . Let us choose
among these pairs p, P that one with the smallest possible similarity ratio, let it be
called (G). Those who have solved problems given at the previous lecture know, and
other should believe that
(G) n3/2
for any closed and bounded convex body G Rn . After appropriate ane transforma-
tion of the space (the transformation does not inuence the complexity - once again,
all our complexity-related notions are ane invariant) we may assume that p is the
unit cube. Thus, we come to the situation as follows:
Cn {x Rn | |x| 1} G {x Rn | |x| (G)}. (3.4.19)
Now let us look at the family of problems
minimize f (x) s.t. x G
associated with f F k . It is easily seen that all the objectives from the family attain
their global (i.e., over the whole space) minima within the unit cube and the sets of
approximate solutions of absolute accuracy 26k to problems of the family are mutually
disjoint (these sets are exactly the small cubes already known to us). Therefore our
previous reasoning states that it is possible to equip the family with a rst order oracle
in such a way that the worst-case, with respect to the family F k , complexity of nding
n log2 k
an approximate solution of absolute accuracy 25k is at least log . On the other
2 (2nk)
k
hand, the Lipschitz constant of every function from the family F taken with respect
to the uniform norm is, as we know, at most 1, so that the variation
max f min f
G G
of such a function on the domain G is at most the diameter of G with respect to the
uniform norm; the latter diameter, due to (3.4.19), is at most 2(G). It follows that
any method which solves all problems from F k within relative accuracy 26k1 /(G)
solves all these problems within absolute accuracy 26k as well; thus, the complexity
of minimizing convex function over G within relative accuracy 26k1 /(G) is at least
n log2 k
log2 (2nk) :
nlog2 k
A(26k1 /(G)) , k = 1, 2, ...
log2 (nk)
This lower bound immediately implies that

1
nlog2 (G) 1
A() O(1) , (G) < ,
1 128
log 2 nlog2 ( (G) )
whence, in turn,
n ln(1/) 1 1
A() O(1) , (G) 2
( );
ln (n ln(1/)) 128 (G) 128n3
this is exactly what is required in (3.4.17).
Note that our reasoning results in a lower bound which is worse than that one
indicated in the Theorem not only by the logarithmic denominator, but also due to the
fact that this is a lower bound for a particular rst order oracle, not for an arbitrary one.
In fact both these shortcomings, i.e. the presence of the denominator and the oracle-
dependent type of the lower bound, may be overcome by more careful reasoning, but
we are not going to reproduce it here.
3.5 The Ellipsoid method

We have presented the Center of Gravity method which is optimal in complexity, up to an
absolute constant factor, at least when the required accuracy is small enough. The rate of
convergence of the method, which is the best possible theoretically, looks ne also from the
practical viewpoint. The upper complexity bound O(1)n ln(1/) associated with the method
means that in order to improve inaccuracy by an absolute constant factor (say, reduce it
by factor 10) it suces to perform O(1)n iterations more, which looks not too bad, taking
into account how wide is the family of problems the method can be applied to. Thus, one
may ask: what were the computational consequences of the method invented as early as in
1965? The answer is: no consequences at all, since the method cannot be used in practice,
provided that the dimension of the problem is, say > 4. The reason is that the auxiliary
problems arising at the steps of the method, I mean those of nding centers of gravity, are
computationally extremely hard; the source of the diculty is that the localizers arising in
the method may be almost arbitrary solids; indeed, all we can say is that if the domain G
of the problem is a polytope given by k0 linear inequalities, then the localizer Gi formed
after i steps of the method is a polytope given by k0 + i linear inequalities; and all known
deterministic methods for computing the center of gravity of a polytope in Rn given by
k > 2n linear inequalities take an exponential in n time to nd the center. In our denition
of complexity we ignore the computational eort required to implement the search rules;
but in practice this eort, of course, should be taken into account, this is why the Center of
Gravity method, for which this latter eort is tremendous, cannot be used at all.
The situation, nevertheless, is not as bad as one could think: to the moment we have not
exploit all abilities of the cutting plane scheme.
Let us note that the cutting plane scheme can be spoiled as follows. Given previous
localizer Gi1 , we apply our basic scheme to produce a new localizer, G i , but now this is
something intermediate, not the localizer we are forwarding to the next step (this is why we
denote by G i the solid which in the basic scheme was designated Gi ); the localizer we do use
at the next step is certain larger solid Gi G i . Thus, at a step of the modied cutting plane
scheme we perform a cut, exactly as in the basic scheme, and enlarge the resulting localizer
to obtain Gi .
3.5. THE ELLIPSOID METHOD 55
At this point one could ask: what for should we add to an actual localizer something
which for sure does not contain optimal solutions? The answer is: acting in this manner,
we may stabilize geometry of our localizers and enforce them to be convenient for numerical
implementation of the search rules. This is the idea underlying the Ellipsoid method we are
about to present.
3.5.1 Ellipsoids
Recall that an ellipsoid in Rn is dened as a level set of a nondegenerate convex quadratic
form, i.e., as a set of the type
W = {x Rn | (x c)T A(x c) 1}, (3.5.20)
where A is an n n symmetric positive denite matrix and c Rn is the center of the

ellipsoid. An equivalent denition, which will be more convenient for us, says that an
ellipsoid is the image of the unit Euclidean ball under a one-to-one ane transformation:
W = W (B, c) = {x = Bu + c | uT u 1}, (3.5.21)
where B is an n n nonsingular matrix. It is immediately seen that one can pass from
representation (3.5.21) to (3.5.20) by setting
A = (B T )1 B 1 ; (3.5.22)
since any symmetric positive denite matrix A admits a representation of the type (3.5.22)
(e.g., with B = A1/2 ), the above denitions indeed are equivalent.
From (3.5.21) it follows immediately that
Voln (W (B, c)) = | Det B|Voln (V ), (3.5.23)
where V denotes the unit Euclidean ball in Rn .

Now, by compactness reasons it is easily seen that for any n-dimensional solid Q there
exist ellipsoids containing Q and among these ellipsoids there is at least one with the smallest
volume; in fact this extremal outer ellipsoid of Q is uniquely dened, but we are not interested
in the uniqueness issues. The average diameter of this extremal outer ellipsoid is certain
function of Q, let it be denoted by EllOut(Q) and called the outer ellipsoidal size:
EllOut(Q) = (min{Voln (W ) | W is an ellipsoid containing Q})1/n .
It is immediately seen that the introduced function is a size, i.e., it is positive, monotone
with respect to inclusions and homogeneous with respect to similarity transformations of the
homogeneity degree 1.
We need the following simple lemma
Lemma 3.5.1 Let n > 1, let
W = {x = Bu + c | uT u 1}
be an ellipsoid in Rn , and let

= {x W | (x c)T q 0}
W
be a half-ellipsoid - the intersection of W and a half-space with the boundary hyperplane

passing through the center of W (here q = 0). Then W can be covered by an ellipsoid W +
of the volume
Voln (W + ) = n (n)Voln (W ),

n n2 n1 1
(n) = 2 exp{ }; (3.5.24)
n 1 n+1 2(n 1)
in particular,
1
EllOut(W + ) (n) EllOut(W ) exp{ } EllOut(W ). (3.5.25)
2n(n 1)
The ellipsoid W + is given by
W + = {x = B + u + c+ | uT u 1},
1
B + = (n)B (n)(Bp)pT , c+ = c Bp,
n+1
where
1/2
n2 n 1 BT q
(n) = , (n) = (n) 1 , p= $ .
n2 1 n+1 q T BB T q
To prove the lemma, it suces to reduce the situation to the similar one with W being the
unit Euclidean ball V ; indeed, since W is the image of V under the ane transformation
u Bu + c, the half-ellipsoid W
is the image, under this transformation, of the half-ball
V = {u V | (B T q)T u 0} = {u V | pT u 0}.
Now, it is quite straightforward to verify that a half-ball indeed can be covered by an ellipsoid
V + with the volume being the required fraction of the volume of V ; to verify this, it was
one of the exercises of the previous lecture (cf. Exercise 2.3.8), and in the formulation of the
exercise you were given the explicit representation of V + . It remains to note that the image
of V + under the ane transformation which maps the unit ball V onto the ellipsoid W is an
and is in the same ratio of volumes with
ellipsoid which clearly contains the half-ellipsoid W
+
respect to W as V is with respect to the unit ball V (since the ratio of volumes remains
invariant under ane transformations). The ellipsoid W + given in formulation of the lemma
is nothing but the image of V + under our ane transformation.
3.5.2 The Ellipsoid method

Bearing in mind our spoiled cutting plane scheme, we may interpret the statement of
Lemma 3.5.1 as follows: assume that at certain step of a cutting plane method the localizer
Gi1 to be updated is an ellipsoid. Let us choose the current search point as the center of
i which is a half-ellipsoid.
the ellipsoid; then the cut will result in the intermediate localizer G
Let us cover this intermediate localizer by an ellipsoid given by our lemma and choose the
latter ellipsoid as the new localizer Gi . Now we are in the same position as we were - the new
localizer is an ellipsoid, and we may proceed in the same manner. And due to our lemma,
we do decrease in certain ratio certain size of localizers - namely, the outer ellipsoidal size
EllOut.
Two issues should be thought of. First, how to initialize our procedure - i.e., how to
enforce the initial localizer to be an ellipsoid (recall that in our basic scheme the initial
localizer was the domain G of the problem). The answer is immediate - let us take as G0 an
arbitrary ellipsoid containing G. The second diculty is as follows: in our spoiled cutting
plane scheme the localizers, generally speaking, are not subsets of the domain of the problem;
it may, consequently, happen that the center of a localizer is outside the domain; how to
perform a cut in the latter case? Here again the diculty can be immediately overcome.
If the center xi of the current localizer Gi1 is outside the interior of G, then, due to the
Separation Theorem for convex sets, we may nd a nonzero linear functional eT x which
separates xi and int G:
(x xi )T e 0, x G.
Using e for the cut, i.e., setting
i = {x Gi1 | (x xi )T e 0},
G
we remove from the previous localizer Gi1 only those points which do not belong to the
i indeed can be thought of as a new intermediate localizer.
domain of the problem, so that G
Thus, we come to the Ellipsoid method, due to Nemirovski and Yudin (1979), which, as
applied to a convex programming problem
minimize f (x) s.t. gj (x) 0, j = 1, ..., m, x G Rn
works as follows:
Initialization. Choose n n nonsingular matrix B0 and a point x1 such

that the ellipsoid
G0 = {x = B0 u + x1 | uT u 1}
contains G. Choose > 0 such that
EllOut(G)
.
EllOut(G0 )
i-th step, i 1. Given Bi1 , xi , act as follows:

1) Check whether xi int G. If it is not the case, then call step i non-
productive, nd a nonzero ei such that
(x xi )T ei 0 x G
and go to 3), otherwise go to 2).

2) Call the oracle to compute the quantities
f (xi ), f (xi ), g1 (xi ), g1 (xi ), ..., gm (xi ), gm

(xi ).
If one of the inequalities

gj (xi ) max{gj (xi ) + (x xi )T gj (xi )} , j = 1, ..., m (3.5.26)
G +
is violated, say, k-th of them, call i-th step non-productive, set
ei = gk (xi )
and go to 3).
If all inequalities (3.5.26) are satised, call i-th step productive and set
ei = f (xi ).
If ei = 0, terminate, xi being the result found by the method, otherwise go to 3).

3) Set
T
Bi1 ei 1
p= $ , Bi = (n)Bi1 (n)(Bi1 p)pT , xi+1 = xi Bi1 p,
eTi Bi1 Bi1
T
ei n+1
(3.5.27)
(n) and (n) being the quantities from Lemma 3.5.1.
If
i (n) < , (3.5.28)
terminate with the result x being the best, with respect to the objective, of the
search points associated with the productive steps:
x Argmin{f (x) | x is one of xj with productive j i}
otherwise go to the next step.

The main result on the Ellipsoid method is as follows:
Theorem 3.5.1 The associated with a given relative accuracy (0, 1) Ellipsoid method
Ell(), as applied to a problem instance p Pm (G), terminates in no more than

1
ln 1
A(Ell()) = 2n(n 1) ln
ln(1/(n))
steps and solves p within relative accuracy : the result x is well dened and
(p, x) .
Given the direction ei dening i-th cut, it takes O(n2 ) arithmetic operations to update
(Bi1 , xi ) into (Bi , xi+1 ).
Proof. The complexity bound is an immediate corollary of the termination test (3.5.28).
To prove that the method solves p within relative accuracy , note that from Lemma 3.5.1
it follows that
EllOut(Gi ) i (n) EllOut(G0 ) i (n) 1 EllOut(G)
(the latter inequality comes from the origin of ). It follows that if the method terminates
at a step N due to (3.5.28), then
EllOut(GN ) < EllOut(G).
Due to this latter inequality, we immediately obtain the accuracy estimate as a corollary of
our general convergence statement on the cutting plane scheme (Proposition 3.3.1). Although
the latter statement was formulated and proved for the basic cutting plane scheme rather
than for the spoiled one, the reasoning can be literally repeated in the case of the spoiled
scheme.
Note that the complexity of the Ellipsoid method depends on , i.e., on how good is the
initial ellipsoidal localizer we start with. Theoretically, we could choose as G0 the ellipsoid
of the smallest volume containing the domain G of the problem, thus ensuring = 1; for
simple domains, like a box, a simplex or a Euclidean ball, we may start with this optimal
ellipsoid not only in theory, but also in practice. Even with this good start, the Ellipsoid
method has O(n) times worse theoretical complexity than the Center of Gravity method
(here it takes O(n2 ) steps to improve inaccuracy by an absolute constant factor). As a
compensation of this theoretical drawback, the Ellipsoid method is not only of theoretical
interest, it can be used for practical computations as well. Indeed, if G is a simple domain
from the above list, then all actions prescribed by rules 1)-3) cost only O(n(m+n)) arithmetic
operations. Here the term mn comes from the necessity to check whether the current search
point is in the interior of G and, if it is not the case, to separate the point from G, and also
from the necessity to maximize the linear approximations of the constraints over G; the term
n2 reects complexity of updating Bi1 Bi after ei is found. Thus, the arithmetic cost
of a step is quite moderate, incomparably to the tremendous one for the Center of Gravity
method.
3.6 Exercises
3.6.1 Some extensions of the Cutting Plane scheme
To the moment we have applied the Cutting Plane scheme to convex optimization problems
in the standard form
minimize f (x) s.t. gj (x) 0, j = 1, ..., m, x G Rn (3.6.29)
(G is a solid, i.e., a closed and bounded convex set with a nonempty interior, f and gj are
convex and continuous on G). In fact the scheme has a wider eld of applications. Namely,
consider a generic problem as follows:
(f ) : minimize f (x) s.t. x Gf Rn ; (3.6.30)
here Gf is certain (specic for the problem instance) solid and f is a function taking values
in the extended real axis R {} {+} and nite on the interior of G.
Let us make the following assumption on our abilities to get information on (f ):
(A): we have an access to an oracle OA which, given on input a point x Rn , informs us
whether x belongs to the interior of Gf ; if it is not the case, the oracle reports a nonzero
functional ex which separates x and Gf , i.e., is such that
(y x)T ex 0, y Gf ;
if x int Gf , then the oracle reports f (x) and a functional ex such that the level set
lev f (x) f = {y Gf | f (y) < f (x)}
is contained in the open half-space
{y Rn | (y x)T ex < 0}.
In the mean time we shall see that under assumption (A) we can eciently solve (f ) by
cutting plane methods; but before coming to this main issue let me indicate some interesting
3.6. EXERCISES 61
particular cases of the generic problem (f ).
Lh (x)
Figure 3.A quasiconvex function h(x)

Example 1. Convex optimization in the standard setting. Consider the usual
convex problem (3.6.29) and assume that the feasible set of the problem has a nonempty
interior where the constraints are negative. We can rewrite (3.6.29) as (3.6.30) by setting
Gf = {x G | gj (x) 0, j = 1, ..., m}.
Now, given in advance the domain G of the problem (3.6.29) and being allowed to use a
rst-order oracle O for (3.6.29), we clearly can imitate the oracle Oa required by (A).
Example 1 is not so interesting - we simply have expressed something we already know
in a slightly dierent way. The next examples are less trivial.
Example 2. Quasi-convex optimization. Assume that the objective f involved into
(3.6.29) and the functional constraints gj are quasi-convex rather than convex. Recall that
a function h dened on a convex set Q is called quasi-convex, if the sets
lev h(x) h = {y Q | h(y) h(x)}
are convex whenever x Q (note the dierence between Lh (x) and lev h(x) h; the latter set
is dened via strict inequality <, the former one via ). Besides quasi-convexity, assume
that the functions f , gj , j = 1, ..., m, are regular, i.e., continuous on G, dierentiable on the
interior of G with a nonzero gradient at any point which is not a minimizer of the function
over G. Last, assume that the feasible set Gf of the problem possesses a nonempty interior
and that the constraints are negative on the interior of Gf .
If h is a regular quasi-convex function on G and x int G, then the set lev h(x) h = {y
G | h(y) < h(x)} belongs to the half-space
h (x) = {y G | (y x)T h (x) < 0}.
#+
Exercise 3.6.1 Prove the latter statement.
Thus, in the case in question the sets {y G | f (y) < f (x)}, {y G | gj (y) < gj (x)} are
contained in the half-spaces {y | (y x)T f (x) < 0}, {y | (y x)T gj (x) < 0}, respectively.
It follows that in the case in question, same as in the convex case, we, given an access to
a rst-order oracle for (3.6.29), can imitate a required by (A) oracle OA for the induced
problem (3.6.30).
Example 3. Linear-fractional programming. Consider the problem

a (x)
minimize f (x) = max s.t. gj (x) 0, j = 1, ..., m, b (x) > 0, , x G;
b (x)
(3.6.31)
n
here G is a solid in R and is a nite set of indices, a and b are ane functions and gj
are, say, convex and continuous on G. The problem is, as we see, to minimize the maximum
of ratios of given linear forms over the convex set dened by the inclusion x G, convex
functional constraints gj (x) 0 and additional linear constraints expressing positivity of the
denominators.
Let us set
Gf = {x G | gj (x) 0, b (x) 0, };
we assume that the (closed and convex) set Gf possesses a nonempty interior and that the
functions gj are negative, while b are positive on the interior of Gf .
By setting
max {a (x)/b (x)} x int Gf
f (x) = (3.6.32)
+, otherwise
we can rewrite our problem as
minimize f (x) s.t. x Gf . (3.6.33)
Now, assume that we are given G in advance and have an access to a rst-order oracle O
which, given on input a point x int G, reports the values and subgradients of functional
constraints at x, same as reports all a (), b ().
Under this assumptions we can imitate for (3.6.33) the oracle OA required by the as-
sumption (A). Indeed, given x Rn , we rst check whether x int G, and if it is not the
case, nd a nonzero functional ex which separates x and G (we can do it, since G is known
in advance); of course, this functional separates also x and Gf , as required in (A). Now, if
x int G, we ask the rst-order oracle O about the values and subgradients of gj , a and
b at x and check whether all gj are negative at x and all b (x) are positive. If it is not the
case and, say, gk (x) 0, we claim that x int Gf and set ex equal to gk (x); this functional
is nonzero (since otherwise gk would attain a nonnegative minimum at x, which contradicts
our assumptions about the problem) and clearly separates x and Gf (due to the convexity
of gk ). Similarly, if one of the denominators b is nonpositive at x, we claim that x int Gf
and set
ex = b ;
3.6. EXERCISES 63
here again ex is nonzero and separates x and Gf (why?)

Last, if x int G, all gj are negative at x and all b are positive at the point, we claim
that x int Gf (as it actually is the case), nd the index = (x) associated with the
largest at x of the fractions a ()/b (), compute the corresponding fraction at x (this is
nothing but f (x)) and set
ex = y a(x) (y) f (x)y b (x)(y).
Since in the latter case we have

a(x) (y)
lev f (x) f {y Gf | < f (x)} {y Gf | (y x)T ex < 0},
b(x) (y)
we do t the requirements imposed by (A).

Note that the problem of the type (3.6.31) arises, e.g., in the famous von Neumann Model
of Economy Growth which is as follows. Consider an economy where m kinds of goods are
circulating. The economy is described by a pair of m n matrices A and B with positive
entries, where row index i stands for goods and column index j stands for processes.
A process j takes, as input, aij units of good i and produces, as output, bij units of the
same good, per year. Now, let xt be an n-dimensional vector with coordinates xtj being
the intensities by which we let j-th process work in year t. Then the amount of goods
consumed by all the processes run in year t is represented by the vector Axt , and the amount
of goods produced in the same year is given by the vector Bxt . If we have no external sources
of goods, then the trajectory of our economy should satisfy the inequalities
Axt+1 Bxt , t = 0, 1, ...
(for t = 0 the right hand side should be replaced by a positive vector representing the
starting amount of goods). Now, in the von Neumann Economic Growth problem it is
asked what is the largest growth factor, , for which there exists a semi-stationary growth
trajectory, i.e., a trajectory of the type xt = t x0 . In other words, we should solve the
problem
maximize s.t. Ax Bx for some nonzero x 0.
Without loss of generality, x in the above formulation can be taken as a point form the
standard simplex
G = {x Rn | x 0, xj = 1}
j
(which should be regarded as a solid in its ane hull). It is clearly seen that the problem in
question can be rewritten as follows:

j aij xj
minimize max s.t. x G; (3.6.34)
i=1,...,m j bij xj
this is a problem of the type (3.6.33).

It is worthy to note that the von Neumann growth factor describes, in a sense, the
highest rate of growth of our economy (this is far from being clear in advance: why the
Soviet proportional growth is the best one? Why could we not get something better
along an oscillating trajectory?) One exact statement on optimality of the von Neumann
semi-stationary trajectory (or, better to say, the simplest of these statements) is as follows:
Proposition 3.6.1 Let {xt }Tt=1 be a trajectory of our economy, so that xt are nonnegative,
x0 = 0 and
Axt+1 Bxt , t = 0, 1, ..., T 1.
Assume that xT T x0 for some positive (so that our trajectory results, for some T , in
growth of the amount of goods in T times in T years). Then .
Note that following the von Neumann trajectory
xt = ( )T x0 ,
x0 being the x-component of an optimal solution to (3.6.34), does ensure growth by factor
( )T each T years.

Exercise 3.6.2 Prove Proposition 3.6.1.
Example 4. Generalized Eigenvalue problem. Let us again look at problem

(3.6.31). What happens if the index set becomes innite? To avoid minor elabora-
tions, let us assume (as it actually is the case in most applications) that is a compact
set and that the functions a (x), b (x) are continuous in (x, ) (and, as above, are ane
in x). As far as reformulation (3.6.32) - (3.6.33) is concerned, no diculties occur. The
possibility to imitate the oracle OA , this is another story (it hardly would be realistic to ask
O to report innitely many numerators and denominators). Note, anyhow, that from the
discussion accompanying the previous example it is immediately seen that at a given x we
are not interested in all a , b ; what in fact we are interested in are the active at x numerator
and denominator, i.e., either (any) b () such that b (x) 0, if such a denominator exists,
or those a (), b () with the largest ratio at the point x, if all the denominators at x are
positive. Assume therefore that we know G in advance and have an access to an oracle
O which, given on input x int G, reports the values and subgradients of the functional
constraints gj at x and, besides this, tells us (at least in the case when all gj are negative at
x) whether all the denominators b (x), , are positive; if it is the case, then the oracle
returns the numerator and the denominator with the largest ratio at x, otherwise returns
the denominator which is nonpositive at x. Looking at the construction of OA given in the
previous example we immediately conclude that in our now situation we again can imitate
a compatible with (A) oracle OA for problem (3.6.33).
In fact the semidenite fractional problem we are discussing possesses interesting ap-
plications; let me introduce one of them which is of extreme importance for modern Control
Theory - the Generalized Eigenvalue problem. The problem is as follows: given two ane
functions, A(x) and B(x), taking values in the space of symmetric m m matrices (ane
3.6. EXERCISES 65
means that the entries of the matrices are ane functions of x), minimize, with respect to
x, the Rayleigh ratio
T A(x)
max
Rm \{0} T B(x)
of the quadratic forms associated with these matrices under the constraints that B(x) is
positive denite (and, possibly, under additional convex constraints on x). In other words,
we are looking for a pair (x, ) satisfying the constraints
B(x) is positive denite , B(x) A(x) is positive semidenite
and additional constraints
gj (x) 0, j = 1, ..., m, x G Rn
(gj are convex and continuous on the solid G) and are interested in the pair of this type with
the smallest possible .
The Generalized Eigenvalue problem (the origin of the name is that in the particular
case when B(x) I is the unit matrix we come to the problem of minimizing, with respect
to x, the largest eigenvalue of A(x)) can be immediately written down as a semidenite
fractional problem
T A(x)
minimize max T
s.t. gj (x) 0, j = 1, ..., m, T B(x) > 0, , x G;
B(x)
(3.6.35)
m
here is the unit sphere in R . Note that the numerators and denominators in our objec-
tive fractions are ane in x, as required by our general assumptions on fractional problems.
Assume that we are given G in advance, same as the data identifying the ane in x
matrix-valued functions A(x) and B(x), and let we have an access to a rst-order oracle
providing us with local information on the general type convex constraints gj . Then it
is not dicult to decide, for a given x, whether B(x) is positive denite, and if it is not
the case, to nd such that the denominator T B(x) is nonpositive at x. Indeed,
it suces to compute B(x) and to subject the matrix to the Cholesky factorization (I hope
you know what it means). If factorization is successful, we nd a lower-triangular matrix Q
with nonzero diagonal such that
B(x) = QQT ,
and B(x) is positive denite; if the factorization fails, then in course of it we automatically
meet a unit vector which proves that B(x) is not a positive semidenite, i.e., is such that
T B(x) 0. Now, if B(x), for a given x, is positive semidenite, then to nd associated
with the largest at x of the fractions
T A(x)
T B(x)
is the same as to nd the eigenvector of the (symmetric) matrix Q1 A(x)(QT )1 associated
with the largest eigenvalue of the matrix, Q being the above Cholesky factor of B(x) (why?);
to nd this eigenvector, this is a standard Linear Algebra routine.
Thus, any technique which allows to solve (f ) under assumption (A) immediately implies
a numerical method for solving the Generalized Eigenvalue problem.
It is worthy to explain what is the control source of Generalized Eigenvalue problems.
Let me start with the well-known issue - stability of a linear dierential equation
z (t) = z(t)
(z Rs ). As you for sure know, the maximal growth of the trajectories of the equation
as t is predetermined by the eigenvalue of with the largest real part, let this part be
; namely, all the trajectories admit, for any > 0, the estimate
|z(t)|2 C exp{( + )t}|z(0)|2 ,
and vice versa: from the fact that all the trajectories admit an estimate
|z(t)|2 C exp{at}|z(0)|2 (3.6.36)
it follows that a .
There are dierent ways to prove the above fundamental Lyapunov Theorem, and one
of the simplest is via quadratic Lyapunov functions. Let us say that a quadratic function
z T Lz (L is a symmetric positive denite s s matrix) proves that the decay rate of the
trajectories is at most a, if for any trajectory of the equation one has
z T (t)Lz (t) az T (t)Lz(t), t 0; (3.6.37)
if it is the case, then of course
d T
ln z (t)Lz(t) 2a
dt
and, consequently,
1/2 1/2
z T (t)Lz(t) exp{at} z T (0)Lz(0) ,
which immediately results in an estimate of the type (3.6.36). Thus, any positive denite
symmetric matrix L which satises, for some a, relation (3.6.37) implies an upper bound
(3.6.36) on the trajectories of the equation, the upper bound involving just this a. Now,
what does it mean that L satises (3.6.37)? Since z (t) = z(t), it means exactly that
1
z T (t)Lz(t) z T (t)(L + T L)z(t) az T (t)Lz(t)
2
for all t and all trajectories of the equation; since z(t) can be an arbitrary vector of Rs , the
latter inequality means that
2aL (T L + L) is positive semidenite. (3.6.38)

3.6. EXERCISES 67
Thus, any pair comprised of a real a and a positive denite symmetric L satisfying (3.6.38)
results in upper bound (3.6.36); the best (with the smallest possible a) bound (3.6.36) which
can be obtained on this way is given by the solution to the problem
minimize a s.t. L is positive denite, 2aL (T L L) is positive denite;
this is nothing but the Generalized Eigenvalue problem with B(L) = 2L, A(L) = T L + L
and no additional constraints on x L. And it can be proved that the best a given by this
construction is nothing but the largest of the real parts of the eigenvalues of , so that in
the case in question the approach based on quadratic Lyapunov functions and Generalized
Eigenvalue problems results in complete description of the equipped of the trajectories as
t .
In fact, of course, what was said is of no literal signicance: what for should we solve
a Generalized Eigenvalue problem in order to nd something which can be found by a direct
computation of the eigenvalues of ? The indicated approach becomes meaningful when
we come from our simple case of a linear dierential equation with constant coecients
to a much more dicult (and more important for practice) case of a dierential inclusion.
Namely, assume that we are given a multivalued mapping z Q(z) Rs and are interested
in bounding the trajectories of the dierential inclusion
z (t) Q(z(t)), t 0; (3.6.39)
such an inclusion may model, e.g., a time-varying dynamic system
z (t) = (t)z(t)
with certain unknown (). Assume that we know nitely many matrices 1 ,...,M such that
Q(z) = Conv {1 z, ..., M z}
(e.g., we know bounds on entries of (t) in the above time-varying system). In order to
obtain an estimate of the type (3.6.36), we again may use a quadratic Lyapunov function
z T Lz: if for all trajectories of the inclusion one has
z(t)Lz (t) az T (t)Lz(t),
or, which is the same, if
z T Lz az T Lz (z Rs , z Q(z)) (3.6.40)
then, same as above,
(z T (t)Lz(t))1/2 exp{at}(z T (0)Lz(0))1/2
for all trajectories.

Now, the left hand side in (3.6.40) is linear in z , and in order for the inequality in
(3.6.40) to be satised for all z Q(z) it is necessary and sucient to have z T Li z

1
2
z T (iT L + Li )z z T Lz for all z and all i, i.e., to ensure positive semideniteness of
the matrices 2aL (iT L + Li ), i = 1, ..., M. By setting
B(L) = Diag(2L, ..., 2L), A(L) = Diag(1T L + L1 , ..., M

T
L + LM )
we convert the problem of nding the best quadratic Lyapunov function (i.e., that one with
the best associated decay rate a) into the Generalized Eigenvalue problem
minimize a s.t. B(L) positive denite, aB(L) A(L) positive semidenite
with no additional constraints on x L.

Note that in the case of dierential inclusions (in contrast to that one of equations
with constant coecients) the best decay rate which can be demonstrated by a quadratic
Lyapunov function is not necessarily the actually best decay rate possessed by the trajectories
of the inclusion; the trajectories may behave themselves better than it can be demonstrated
by a quadratic Lyapunov function. This shortcoming, anyhow, is compensated by the fact
that the indicated scheme is quite tractable computationally; this is why it becomes now
one of the standard tools for stability analysis and synthesis.3
It is worthy to add that the von Neumann problem is a very specic case of a General-
ized Eigenvalue problem (make the entries of Ax and Bx the diagonal entries of diagonal
matrices).
I have indicated a number of important applications of (f ); it is time now to think how

to solve the problem. There is no diculty in applying to the problem the cutting plane
scheme.
The Cutting Plane scheme for problem (f ):

Initialization. Choose a solid G0 which covers Gf .
i-th step, i 1. Given a solid Gi1 (the previous localizer), choose
xi int Gi
and call the oracle OA , xi being the input.

Given the answer ei exi of the oracle,
- call step i productive if the oracle says that xi int Gf , and call the step
non-productive otherwise;
- check whether ei = 0 (due to (A), this may happen only at a productive step);
if it is the case, terminate, xi being the result found by the method. Otherwise
- dene i-th approximate solution to (f ) as the best, in terms of the values
of f , of the search points xj associated with the productive steps j i;
- set
i = {x Gi1 | (x xi )T ei 0};
G
- embed the intermediate localizer G i into a solid Gi and loop.
3
For user information look at the MATLAB LMI Control Toolbox manual.
3.6. EXERCISES 69
The presented scheme denes, of course, a family of methods rather than a single method.
The basic implementation issues, as always, are how to choose xi in the interior of Gi1 and
how to extend G i to Gi ; here one may use the same tactics as in the Center of Gravity or
in the Ellipsoid methods. An additional problem is how to start the process (i.e., how to
choose G0 ); this issue heavily depends on a priori information on the problem, and here we
hardly could do any universal recommendations.
Now, what can be said about the rate of convergence of the method? First of all, we
should say how we measure inaccuracy. A convenient general approach here is as follows.
Let x Gf and let, for a given (0, 1),
Gf = x + (Gf x) = {y = (1 )x + z | z Gf }
be the image of Gf under the similarity transformation which shrinks Gf to x in 1/ times.

Let
fx () = sup f (y).
yGf
The introduced quantity depends on x; let us take the inmum of it over x Gf :
f () = inf fx ().
xGf
By denition, an -solution to f is any point x Gf such that
f (x) f ().
Let us motivate the introduced notion. The actual motivation is, of course, that the notion
works, but let us start with a kind of speculation. Assume for a moment that the problem is
solvable, and let x be an optimal solution to it. One hardly could argue that a point x Gf
which is at the distance of order of of x is a natural candidate on the role of an -solution;
since all points from Gx are at the distance at most Diam(Gf ) from x , all these points
can be regarded as -solutions, in particular, the worst of them (i.e., with the largest value
of f ) point x (). Now, what we actually are interested in are the values of the objective; if
we agree to think of x () as of an -solution, we should agree that any point x Gf with
f (x) f (x ()) also is an -solution. But this latter property is shared by any point which
is an -solution in the sense of the above denition (look, f (x ()) is nothing but fx ()),
and we are done - our denition is justied!
Of course, this is nothing but a speculation. What might, and what might not be called
a good approximate solution, this cannot be decided in advance; the denition should come
from the real-world interpretation of the problem, not from inside the Optimization Theory.
What could happen with our denition in the case of a bad problem, it can be seen from
the following example:
x
minimize , x Gf = [0, 1].
1020 +x
Here in order to nd a solution with the value of the objective better than, say, 1/2 (note
that the optimal value is 0) we should be at the distance of order 1020 of the exact solution
x = 0. For our toy problem it is immediate, of course, to indicate the solution exactly,
but think what happens if the same eect is met in the case of a multidimensional and
nonpolyhedral Gf . We should note, anyhow, that the problems like that one just presented
are intrinsically bad (what is the problem?); in good situations our denition does work:
Exercise 3.6.3 # Prove that

1) if the function f involved into (f ) is convex and continuous on Gf , then any -solution
x to the problem satises the inequality
f (x) min f (max f min f ),

Gf Gf Gf
i.e., solves the problem within relative accuracy ;

2) if the function f involved into (f ) is Lipschitz continuous on Gf with respect to certain
norm | |, Lf being the corresponding Lipschitz constant, and if x is a -solution to (f ), then
f (x) min f Diam(Gf )Lf ,

Gf
where Diam is the diameter of Gf with respect to the norm in question.
Now - the end of the story.
Exercise 3.6.4 # Prove the following statement:

let a cutting plane method be applied to (f ), and let Size() be a size. Assume that for
certain N the method either terminates in course of the rst N steps, or this is not the case,
but Size(GN ) is smaller than Size(Gf ). In the rst of the cases the result produced by the
method is a minimizer of f over Gf , and in the second case the N-th approximate solution
xN is well-dened an is an -solution to (f ) for any
Size(GN )
> .
Size(Gf )
Exercise 3.6.5 # Write a code implementing the Ellipsoid version of the Cutting Plane
scheme for (f ). Use the code to nd the best decay rate for the dierential inclusion
z (t) Q(z) R3 ,
where
Q(z) = Conv {1 z, ..., M z}
and i , i = 1, 2, ..., M = 26 = 64, are the vertices of the polytope

1 p12 p13

P = { p21 1 p23 | |pij | 0.1}.
p31 p32 1
3.6. EXERCISES 71
3.6.2 The method of outer simplex

Looking at the Ellipsoid method, one may ask: why should we restrict ourselves to ellipsoids
and not to use localizers of some other simple shape? This is a reasonable question. In
fact all properties of the ellipsoidal shape which were important in the Ellipsoid version
of the cutting plane scheme were as follows:
(a) all ellipsoids in Rn are ane equivalent to each other, and, in particular, are ane
equivalent to certain standard ellipsoid, the unit Euclidean ball; therefore in order to look
what happens after a cut passing through the center of an ellipsoid, it suces to study the
simplest case when the ellipsoid is the unit Euclidean ball;
(b) the part of the unit Euclidean ball which is cut o the ball by a hyperplane passing
through the center of the ball can be covered by an ellipsoid of volume less than that one of
the ball.
Now, can we point out another simple shape which satises the above requirements? The
natural candidate is, of course, a simplex. All n-dimensional simplices are ane equivalent
to each other, so that we have no problems with (a); the only bottleneck for simplices could
be (b). The below exercises demonstrate that everything is ok with (b) as well.
Let us start with considering the standard simplex (which now plays the role of the unit
Euclidean ball).
+
Exercise 3.6.6 Let n

= {x Rn | x 0, xi 1}
i=1
be the standard n-dimensional simplex in Rn , let
1 1 T
c=( , ..., )
n+1 n+1
be the barycenter ( the center of gravity, see exercise 1.5) of the simplex, let
g = (1 , ..., n )T 0
be a nonzero nonnegative vector such that
g T c = 1,
and let
= {x | g T x g T c = 1}.

Prove that if [0, 1], then the simplex with the vertices v0 = 0,
vi = (1 + i )1 ei , i = 1, ..., n,

(ei are the standard orths of the axes) contains .
Prove that under appropriate choice of one can ensure that
n1
n n 1 1
Voln ( ) (n)Voln (), (n) = 1 + 2 1 1 O(n2 ).
n 1 n+1
(3.6.41)
Exercise 3.6.7 Let D be a simplex in Rn , v0 , ..., vn be the vertices of D,

1
w= (v0 + ... + vn )
n+1
be the barycenter of D and g be a nonzero linear functional.
Prove that the set
D = {x D | (x w)T g 0}
can be covered by a simplex D such that
Voln (D ) n (n)Voln (D),
with (n) given by (3.6.41). Based on this observation, construct a cutting plane method for
convex problems with functional constraints where all localizers are simplices.
What should be the associated size?
What is the complexity of the method?
Hint: without loss of generality we may assume that the linear form g T x attains its minimum
over D at the vertex v0 and that g T (w v0 ) = 1. Choosing v0 as our new origin and
v1 v0 , ..., vn v0 as the orths of our new coordinate axes, we come to the situation studied
in exercise 3.6.6.
Note that the progress in volumes of the subsequent localizers in the method of outer
simplex (i.e., the quantity n (n) = 1 O(n2 )) is worse than that one n (n) = 1 O(n1 ) in
the Ellipsoid method. It does not, anyhow, mean that the former method is for sure worse
than the latter one: in the Ellipsoid method, the actual progress in volumes always equals
to n (n), while in the method of outer simplex the progress depends on what are the cutting
planes; the quantity n (n) is nothing but the worst case bound for the progress, and the
latter, for a given problem, may happen to be more signicant.
Lecture 4
Large-scale optimization problems
Large-scale non-smooth convex problems, complexity bounds, subgradient descent algorithm,

bundle methods
4.1 Goals and motivations

We start now a new topic - complexity and ecient methods of large-scale convex optimiza-
tion. Thus, we come back to problems of the type
(p) minimize f (x) s.t. gi (x) 0, i = 1, ..., m, x G,
where G is a given solid in Rn and f, g1 , ..., gm are convex continuous on G functions. The
family of all consistent problems of the indicated type was denoted by Pm (G), and we are
interested in nding -solution to a problem instance from the family, i.e., a point x G
such that
f (x) min f (max f min f ), gi (x) (max gi )+ , i = 1, ..., m.

G G G G
We have shown that the complexity of the family in question satises the inequalities
O(1)n ln(1/) A() 2.182n ln(1/), (4.1.1)
where O(1) is an appropriate positive absolute constant; what should be stressed that the
upper complexity bound holds true for all (0, 1), while the lower one is valid only for not
too large , namely, for
< (G).
The critical value (G) depends, as we remember, on ane properties of G; for the box it
is 12 , and for any n-dimensional solid G one has
1
(G) .
2n3
73
74 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS
Thus, our complexity bounds identify complexity, up to an absolute constant factor, only
for small enough values of ; there is an initial interval of values of the relative accuracy
(G) = [(G), 1)
where we to the moment have only an upper bound on the complexity and have no lower
bound. Should we be bothered by this incompleteness of our knowledge? I think we should.
Indeed, what is the initial segment, it depends on G; if G is a box, then this segment is
once for ever xed, so that there, basically, is nothing to worry about - one hardly might be
interested in solving optimization problems within relative inaccuracy 1/2, and for smaller
we know the complexity. But if G is a more general set than a box, then there is something
to think about: all we can say about an arbitrary n-dimensional G is that (G) 1/(2n3 );
this lower bound tends to 0 as the dimension n of the problem increases, so that for large n
(G) can in fact almost cover the interval (0, 1) of all possible values of . On the other
hand, when solving large-scale problems of real world origin, we often are not interested in
too high accuracy, and it may happen that the value of we actually are interested in is
exactly in (G), where we do not know what is the complexity and what are the optimal
methods. Thus, we have reasons, both of theoretical and practical origin, to be interested
in the pre-asymptotic behaviour of the complexity.
The diculty in investigating the behaviour of the complexity in the initial range of
values of the accuracy is that it depends on ane properties of the domain G, and this is
something too diuse for quantitative description. This is why it is reasonable to restrict
ourselves with certain standard domains G. We already know what happens when G is a
parallelotope, or, which is the same, a box - in this case there, basically, is no initial segment.
And, of course, the next interesting case is when G is an ellipsoid, or, which is the same,
an Euclidean ball (all our notions are ane invariant, so to speak about Euclidean balls is
the same as to speak about arbitrary ellipsoids). This is the case we shall focus on. In fact
we shall assume G to be something like a ball rather than a ball exactly. Namely, let us
x a real 1 and assume that the asphericity of G is at most , i.e., there is a pair of
concentric Euclidean balls Vin and Vout with the ratio of radii not exceeding and such that
the smaller ball is inside G, and the larger one contains G:
Vin G Vout .
4.2 The main result

Our main goal is to establish the following
Theorem 4.2.1 The complexity of the family Pm (G) of convex problems with m functional
constraints on an n-dimensional domain G of asphericity satises the bounds
1 2
min{n, 0} A() 4 + 1, 0 < < 1; (4.2.2)
(2)2 2
here a 0 is the largest integer which is smaller than a real a.
4.2. THE MAIN RESULT 75
Comment. Before proving the theorem, let us think what the theorem says. First, it says
that the complexity of convex minimization on a domain similar to an Euclidean ball is
bounded from above uniformly in the dimension by a function O(1)22 ; the asphericity
is responsible for the level of similarity between the domain and a ball. Second, we see that
in the large-scale case, when the dimension of the domain is large enough for given and
, or, which is the same, when the inaccuracy is large enough for a given dimension (and
asphericity), namely, when
1
, (4.2.3)
2 n
then the complexity admits a lower bound O(1)22 which diers from the aforementioned
upper bound by factor O(1)4 which depends on asphericity only. Thus, in the large-
scale case (4.2.3) our upper complexity bound coincides with the complexity up to a factor
depending on asphericity only; if G is an Euclidean ball ( = 1), then this factor does not
exceed 16.
Now, our new complexity results combined with the initial results related to the case of
small inaccuracies gives us basically complete description of the complexity in the case when
G is an Euclidean ball. The graph of complexity in this case is as follows:
n ln n
1
1/2
1 2n n 1-

Figure 4.
Complexity of convex minimization over an n-dimensional Euclidean ball: the whole
range [1, ) of values of 1/ can be partitioned into three segments:
the initial segment [1, 2 n]; within this segment the complexity, up to an absolute con-
stant factor, is 2 ; at the right endpoint of the segment the complexity is equal to n; in this
initial segment the complexity is independent on the dimension and is in fact dened by the
ane geometry of G;
1
the nal segment [ (G) , ) = [2n, ); here the complexity, up to an absolute constant
factor, is n ln(1/), this is the standard asymptotics known to us; in this nal segment the
complexity forgets everything about the geometry of G;

the intermediate segment [2 n, 2n]; at the left endpoint of this segment the complexity
is O(n), at the right endpoint it is O(n ln n); within this segment we know complexity up to
a factor of order of its logarithm rather than up to an absolute constant factor.
Now let us prove the theorem.
4.3 Upper complexity bound: Subgradient Descent

The upper complexity bound is associated with one of the most traditional methods for
nonsmooth convex minimization - the short-step Subgradient Descent method. The generic
scheme of the method is as follows:
given a starting point x1 int G and the required relative accuracy (0, 1), form the
sequence
xi+1 = xi i Rei , (4.3.4)
where i > 0 is certain stepsize, R is the radius of the Euclidean ball Vout covering G and
the unit vector ei is dened as follows:
(i) if xi int G, then ei is an arbitrary unit vector which separates xi and G:
(x xi )T ei 0, x G;
(ii) if xi int G, but there is a constraint gj which is -violated at xi , i.e., is such that

gj (xi ) > max{gj (xi ) + (x xi )T gj (xi )} , (4.3.5)
xG +
then
1
ei =
gj (xi );
|gj (xi )|
(iii) if xi int G and no constraint is -violated at xi , i.e., no inequality (4.3.5) is satised,
then
1
ei = f (xi ).
|f (xi )|
Note that the last formula makes sense only if f (xi ) = 0; if in the case of (iii) we meet with
f (xi ) = 0, then we simply terminate and claim that xi is the result of our activity.
Same as in the cutting plane scheme, let us say that search point xi is productive, if at i-th
step we meet the case (iii), and non-productive otherwise, and let us dene i-th approximate
solution xi as the best (with the smallest value of the objective) of the productive search
points generated in course of the rst i iterations (if no productive search point is generated,
xi is undened).
The eciency of the method is given by the following.
Proposition 4.3.1 Let a problem (p) from the family Pm (G) be solved by the short-step
Subgradient Descent method associated with accuracy , and let N be a positive integer such
that
2 + 12 N 2
j=1 j
N < . (4.3.6)
j=1 j
4.3. UPPER COMPLEXITY BOUND: SUBGRADIENT DESCENT 77
Then either the method terminates in course of N steps with the result being an -solution
to (p), or xN is well-dened and is an -solution to (p).
In particular, if
i /,
then (4.3.6) is satised by
2
N N() = 4 + 1,
2
and with the indicated choice of the stepsizes we can terminate the method after the N-th
step; the resulting method solves any problem from the class within relative accuracy with
the complexity N(), which is exactly the upper complexity bound stated in Theorem 4.2.1.
Proof. Let me make the following crucial observation: let us associate with the method the
localizers
Gi = {x G | (x xj )T ej 0, 1 j i}. (4.3.7)
Then the presented method ts our generic cutting plane scheme for problems with functional
constraints, up to the fact that Gi now should not necessarily be solids (they may possess
empty interior or even be themselves empty) and xi should not necessarily be an interior point
of Gi1 . But all these particularities were not used in the proof of the general proposition on
the rate of convergence of the scheme (Proposition 3.3.1), and in fact there we have proved
the following:
Proposition 4.3.2 Assume that we are generating a sequence of search points xi Rn and
associate with these points vectors ei and approximate solutions xi in accordance to (i)-(iii).
Let the sets Gi be dened by the pairs (xi , ei ) according to (4.3.7), and let Size be a size.
Assume that in course of N steps we either terminate due to vanishing of the subgradient of
the objective at a productive search point, or this is not the case, but
Size(GN ) < Size(G)
(if GN is not a solid, then, by denition, Size(GN ) = 0). In the rst case the result formed
at the termination is an -solution to the problem; in the second case such a solution is xN
(which is for sure well-dened).
Now let us apply this latter proposition to our short-step Subgradient Descent method and
to the size
InnerRad(Q) = max{r | Q contains an Euclidean ball of radius r}.
We know in advance that G contains an Euclidean ball Vin of the radius R/, so that
InnerRad(G) R/. (4.3.8)
Now let us estimate from above the size of i-th localizer Gi , provided that the localizer is
well-dened (i.e., that the method did not terminate in course of the rst i steps due to
vanishing the subgradient of the objective at a productive search point). Assume that Gi
contains an Euclidean ball of certain radius r > 0, and let x+ be the center of the ball. Since
V is contained in Gi , we have
(x xj )T ej 0, x V, 1 j i,
whence
(x+ xj )T ej + hT ej 0, |h| r, 1 j i,
and since ej is a unit vector, we come to
(x+ xj )T ej r, 1 j i. (4.3.9)
Now let us write down the cosine theorem:
|xj+1 x+ |2 = |xj x+ |2 + 2(xj x+ )T (xj+1 xj ) + |xj+1 xj |2 =
= |xj x+ |2 + 2j R(x+ xj )T ej + j2 R2 |xj x+ |2 2j Rr + j2 R2 .

We see that the squared distance from xj to x+ is decreased with j at least by the quantity
2j R j2 R2 at each step; since the squared distance cannot become negative, we come to
i
i

R(2r j j2 R) |x1 x+ |2 4R2
j=1 j=1
(we have used the fact that G is contained in the Euclidean ball Vout of the radius R). Thus,
we come to the estimate
2 + 12 ij=1 j2
r i R
j=1 j
This bound acts for the radius r of an arbitrary Euclidean ball contained in Gi , and we come
to
2 + 12 ij=1 j2
InnerRad(Gi ) i R. (4.3.10)
j=1 j
Combining this inequality with (4.3.8), we come to

InnerRad(Gi ) 2 + 12 ij=1 j2
i , (4.3.11)
InnerRad(G) j=1 j
and due to the denition of N, we come to
InnerRad(GN )
< ,
InnerRad(G)
Thus, the conclusion of the Theorem follows from Proposition 4.3.2.

4.4. THE LOWER BOUND 79
4.4 The lower bound

The lower bound in (4.2.2) is given by a simple reasoning which is in fact already known to us.
Due to similarity reasons, we without loss of generality may assume that G is contained in the
Euclidean ball of the radius R = 12 centered at the origin and contains the ball of the radius r = 2
1
with the same center. It, of course, suces to establish the lower bound for the case of problems
without functional constraints. Besides this, due to monotonicity of the complexity in , it suces
to prove that if (0, 1) is such that
1
M = 0 n,
(2)2
then the complexity A() is at least M . Assume that this is not the case, so that there exists a
method M which solves all problems from the family in question in no more than M 1 step. We
may assume that M solves any problem exactly in M steps, and the result always is the last search
point. Let us set
1
= ,
2 M
so that > 0 by denition of M . Now consider the family F0 comprised of functions
f (x) = max (i xi + di )
1iM
where i = 1 and 0 < di < . Note that these functions are well-dened, since M n and
therefore we have enough coordinates in x.
Now consider the following M -step construction.
The rst step:
let x1 be the rst search point generated by M; this point is instance-independent. Let i1 be the
index of the largest in absolute value of the coordinates of x1 , i1 be the sign of the coordinate and
let di1 = /2. Let F1 be comprised of all functions from F with i1 = i1 , di1 = di1 and di /4
for all i = i1 . It is clear that all the functions of the family F1 possess the same local behavior at
x1 and are positive at this point.
The second step:
let x2 be the second search point generated by M as applied to a problem from the family F1 ; this
point does not depend on the representative of the family, since all these representatives have the
same local behavior at the rst search point x1 . Let i2 be the index of the largest in absolute value
of the coordinates of x2 with indices dierent from i1 , let i2 be the sign of the coordinate, and let
di2 = /4. Let F2 be comprised of all functions from F1 such that i2 = i2 , di2 = di2 and di /8
for all i dierent from i1 and i2 . Note that all functions from the family coincide with each other
in a neighborhood of the two-point set {x1 , x2 } and are positive at this set.
Now it is clear how to proceed. After k steps of the construction we have a family Fk comprised
of all functions from F with the parameters i and di being set to certain xed values for k values
i1 , ..., ik of the index i and all di for the remaining i being 2(k+1) ; the family satises the
following predicate
Pk : the rst k points x1 , ..., xk of the trajectory of M as applied to any function from the family
do not depend on the function, and all the functions from the family coincide with each other in
certain neighborhood of the k-point set {x1 , ..., xk } and are positive at this set.
From Pk it follows that the (k + 1)-th search point xk+1 generated by M as applied to a function
from the family Fk is independent of the function. At the step k + 1 we
nd the index ik+1 of the largest in absolute value of the coordinates of xk+1 with indices
dierent from i1 , ..., ik ,
dene ik+1 as the sign of the coordinate,
set dik+1 = 2(k+1) ,
and
dene Fk+1 as the set of those functions from Fk for which ik+1 = ik+1 , dik+1 = dik+1 and
di 2(k+2) for i dierent from i1 , ..., ik+1 .
It is immediately seen that the resulting family satises the predicate Pk+1 , and we may proceed
in the same manner.
Now let us look what will be found after M step of the construction. We will end with a family
FM which consists of exactly one function
f = max (i xi + di )
1iM
such that f is positive along the sequence x1 , ..., xM of search points generated by M as applied
to the function. On the other hand, G contains the ball of the radius r = 1/(2) centered at the
origin, and, consequently, contains the point
M
i
x = ei ,
i=1 2 M
ei being the basic orths in Rn . We clearly have
1
f min f (x) f (x ) < +
xG 2 M
(the concluding inequality follows from the denition of ). On the other hand, f clearly is Lipschitz
continuous with constant 1 on G, and G is contained in the Euclidean ball of the radius 1/2, so
that the variation (maxG f minG f ) of f over G is 1. Thus, we have
f (xM ) f > 0 () = (max f min f );

G G
since, by construction, xM is the result obtained by M as applied to f , we conclude that M does

not solve the problem f within relative accuracy , which is the desired contradiction with the
origin of M .
4.5 Subgradient Descent for Lipschitz-continuous con-

vex problems
For the sake of simplicity we restrict ourselves to problems without functional constraints:
(f ) minimize f (x) s.t. x G Rn .

4.5. SUBGRADIENT DESCENT FOR LIPSCHITZ-CONTINUOUS CONVEX PROBLEMS81
From now on we assume that G is a closed and bounded convex subset in Rn , possibly, with
empty interior, and that the objective is convex and Lipschitz continuous on G:
|f (x) f (y)| L(f )|x y|, x, y G,
where L(f ) < and | | is the usual Euclidean norm in Rn . Note that the subgradient
set of f at any point from G is nonempty and contains subgradients of norms not exceeding
L(f ); from now on we assume that the oracle in question reports such a subgradient at any
input point x G.
We would like to solve the problem within absolute inaccuracy , i.e., to nd x G
such that
f (x) f f (x) min f .
G
The simplest way to solve the problem is to apply the standard Subgradient Descent method
which generates the sequence of search points {xi }
i=1 according to the rule
xi+1 = G (xi i g(xi )), g(x) = f (x)/|f (x)|, (4.5.12)
where x1 G is certain starting point, i > 0 are positive stepsizes and
G (x) = argmin{|x y| | y G}
is the standard projector onto G. Of course, if we meet a point with f (x) = 0, we terminate
with optimal solution at hands; from now on I ignore this trivial case.
As always, i-th approximate solution xi found by the method is the best - with the
smallest value of f - of the search points x1 , ..., xi ; note that all these points belong to G.
It is easy to investigate the rage of convergence of the aforementioned routine. To this
end let x be the closest to x1 optimal solution to the problem, and let
di = |xi x |.
We are going to see how di vary. To this end let us start with the following simple and
important observation (cf Exercise 4.7.3):
Lemma 4.5.1 Let x Rn , and let G be a closed convex subset in Rn . Under projection
onto G, x becomes closer to any point u of G, namely, the squared distance from x to u
decreases at least by the squared distance from x to G:
|G (x) u|2 |x u|2 |x G (x)|2 . (4.5.13)
From Lemma 4.5.1 it follows that
d2i+1 |xi+1 x |2 = |G (xi i g(xi )) x |2 |xi g(xi ) x |2 =
= |xi x |2 2i (xi x )T f (xi )/|f (xi )| + i2

d2i 2i (f (xi ) f )/|f (xi )| + i2
(the concluding inequality is due to the convexity of f ). Thus, we come to the recurrence
d2i+1 d2i 2i (f (xi ) f )/|f (xi )| + i2 ; (4.5.14)
in view of the evident inequality
f (xi ) f f (
xi ) f i ,
and since |f (x)| L(f ), the recurrence implies that

i
d2i+1 d2i 2i + i2 . (4.5.15)
L(f )
The latter inequality allows to make several immediate conclusions.
1) From (4.5.15) it follows that
N
N

2 i i L(f ) d21 + i2 ;
i=1 i=1
since i clearly do not increase with i, we come to

N
L(f ) |x1 x |2 + 2
i=1 i
N N N . (4.5.16)
2 i=1 i
The right hand side in this inequality clearly tends to 0 as N , provided that

i = , i 0, i
i=1
(why?), which gives us certain general statement on convergence of the method as applied to
a Lipschitz continuous convex function; note that we did not use the fact that G is bounded.
Of course, we would like to choose the stepsizes resulting in the best possible estimate
(4.5.16). Note that our basic recurrence (4.5.14) implies that for any N M 1 one has
N
N
N

2N i L(f ) d2M + i2 L(f ) D + 2
i2 .
i=M i=M i=M
Whence
L(f ) D 2 + N 2
i=M i
N M ;
2 i=N i
with D being an a priori upper bound on the diameter of G; M = N/2 and
i = Di1/2 (4.5.17)
the right hand side in the latter inequality does not exceed O(1)DN 1/2 . This way we come
to the optimal, up to an absolute constant factor, estimate
L(f )D
N O(1) , N = 1, 2, ... (4.5.18)
N
4.5. SUBGRADIENT DESCENT FOR LIPSCHITZ-CONTINUOUS CONVEX PROBLEMS83
(O(1) is an easily computable absolute constant). I call this rate optimal, since the lower
complexity bound of Section 4.4 says that if G is an Euclidean ball of diameter D in Rn
and L is a given constant, then the complexity at which one can minimize over G, within
absolute accuracy , an arbitrary Lipschitz continuous with constant L convex function f is
at least
LD 2
min n; O(1) ,

so that in the large-scale case, when
2
LD
n ,

the lower complexity bound coincides, within absolute constant factor, with the upper bound
given by (4.5.18).
Thus, we can choose the stepsizes i according to (4.5.17) and obtain dimension-independent
rate of convergence (4.5.18); this rate of convergence does not admit signicant uniform
in the dimension improvement, provided that G is an Euclidean ball.
2) The stepsizes (4.5.17) are theoretically optimal and more or less reasonable from the
practical viewpoint, provided that you deal with a domain G of reasonable diameter, i.e.,
the diameter of the same order of magnitude as the distance from the starting point to the
optimal set. If the latter assumption is not satised (as it often is the case), the stepsizes
should be chosen more carefully. A reasonable idea here is as follows. Our rate-of-convergence
proof in fact was based on a very simple relation
d2i+1 d2i 2i (f (xi ) f )/|f (xi )| + i2 ;
let us choose as i the quantity which results in the strongest possible inequality of this type,
namely, that one which minimizes the right hand side:
f (xi ) f
i = . (4.5.19)
|f (xi )|
Of course, this choice is possible only when we know the optimal value f . Sometimes this
is not a problem, e.g., when we reduce a system of convex inequalities
fi (x) 0, i = 1, ..., m,
to the minimization of
f (x) = max fi (x);
i
here we can take f = 0. In more complicated cases people use some on-line estimates of f ;

I would not like to go in details, so that I assume that f is known in advance. With the
stepsizes (4.5.19) (proposed many years ago by B.T. Polyak) our recurrence becomes
d2i+1 d2i (f (xi ) f )2 |f (xi )|2 d2i 2i L2 (f ),

N
whence i=1 2i L2 (f )d21, and we immediately come to the estimate
N L(f )|x1 x |N 1/2 . (4.5.20)
This estimate seems to be the best one, since it involves the actual distance |x1 x | to the
optimal set rather than the diameter of G; in fact G might be even unbounded. Typically,
whenever one can use the Polyak stepsizes, this is the best possible tactics for the Subgradient
Descent method.
We can now present a small summary: we see that the Subgradient Descent, which
we were exploiting in order to obtain an optimal method for large scale convex minimization
over Euclidean ball, can be applied to minimization of a convex Lipschitz continuous function
over an arbitrary n-dimensional closed convex domain G; if G is bounded, then, under
appropriate choice of stepsizes, one can ensure the inequalities
N min f (xi ) f O(1)L(f )D(G)N 1/2 , (4.5.21)

1iN
where O(1) is a moderate absolute constant, L(f ) is the Lipschitz constant of f and D(G) is
the diameter of G. If the optimal value of the problem is known, then one can use stepsizes
which allow to replace D(G) by the distance |x1 x | from the starting point to the optimal
set; in this latter case, G should not necessarily be bounded. And the rate of convergence is
optimal, I mean, it cannot be improved by more than an absolute constant factor, provided
that G is an n-dimensional Euclidean ball and n > N.
Note also that if G is a simple set, say, an Euclidean ball, or a box, or the standard
simplex
n

{x Rn+ | xi = 1},
i=1
then the method is computationally very cheap - a step costs only O(n) operations in addition
to those spent by the oracle. Theoretically all it looks perfect. It is not a problem to speak
about an upper accuracy bound O(N 1/2 ) and about optimality of this bound in the large
scale case; but in practice such a rate of convergence would result in thousands of steps,
which is too much for the majority of applications. Note that in practice we are interested
in typical complexity of a method rather than in its worst case complexity and worst case
optimality. And from this practical viewpoint the Subgradient Descent is far from being
optimal: there are other methods with the same worst case theoretical complexity bound,
but with signicantly better typical performance; needless to say that these methods are
more preferable in actual computations. What we are about to do is to look at a certain
family of methods of this latter type.
4.6 Bundle methods

Common sense says that the weak point in the Subgradient Descent is that when running
the method, we almost loose previous information on the objective; the whole prehistory
is compressed to the current iterate (this was not the case with the cutting plane methods,
4.6. BUNDLE METHODS 85
where the prehistory was memorized in the current localizer). Generally speaking, what we
actually know about the objective after we have formed a sequence of search points xj G,
j = 1, ..., i? All we know is the bundle - the sequence of ane forms
f (xj ) + (x xj )T f (xj )
reported by the oracle; we know that every form from the sequence underestimates the
objective and coincides with it at the corresponding search point. All these ane forms can
be assembled into a single piecewise linear convex function - i-th model of the objective
fi (x) = max {f (xj ) + (x xj )T f (xj )}.

1ji
This model underestimates the objective:
fi (x) f (x), x G, (4.6.22)
and is exact at the points x1 , ..., xi :
fi (xj ) = f (xj ), j = 1, ..., i. (4.6.23)
And once again - the model accumulates all our knowledge obtained so far; e.g., the infor-
mation we possess does not contradict the hypothesis that the model is exact everywhere.
Since the model accumulates the whole prehistory, it is reasonable to formulate the search
rules for a method in terms of the model. The most natural and optimistic idea is to trust
in the model completely and to take, as the next search point, the minimizer of the model:
xi+1 Argmin fi .
G
This is the Kelley cutting plane method - the very rst method proposed for nonsmooth
convex optimization. The idea is very simple - if we are lucky and the model is good
everywhere, not only along the previous search points, we would improve signicantly the
best value of the objective found so far. On the other hand, if the model is bad, then it will be
corrected at the right place. From compactness of G one can immediately derive that the
method does converge and is even nite if the objective is piecewise linear. Unfortunately, it
turns out that the rate of convergence of the method is a disaster; one can demonstrate that
the worst-case number of steps required by the Kelley method to solve a problem f within
absolute inaccuracy (G is the unit n-dimensional ball, L(f ) = 1) is at least
(n1)/2
1
O(1) .

We see how dangerous is to be too optimistic, and it is clear why: even in the case of smooth
objective the model is close to the objective only in a neighborhood of the search points; until
the number of these points becomes very-very large, this neighborhood covers a negligible
part of the domain G, so that the global characteristic of the model - its minimizer - is very
unstable and until the termination phase has small in common with the actual optimal set.
It should be noted that the Kelley method in practice is much better than one could think
looking at its worst-case complexity (a method with practical complexity like this estimate
simply could not be used even in the dimension 10), but the qualitative conclusions from the
estimate are more or less valid also in practice - the Kelley method sometimes is too slow.
A natural way to improve the Kelley method is as follows. We can only hope that the
model approximates the objective in a neighborhood of the search points. Therefore it is
reasonable to enforce the next search point to be not too far from the previous ones, more
exactly, from the most perspective, the best of them, since the latter, as the method goes
on, hopefully will become close to the optimal set. To forbid the new iterate to move far
away, let us choose xi+1 as the minimizer of the penalized model:
di
xi+1 = argmin{fi (x) + |x x+ 2
i | },
G 2
where x+ i is what is called the current prox center, and the prox coecient di > 0 is certain
parameter. When di is large, we enforce xi+1 to be close to the prox center, and when it
is small, we act almost as in the Kelley method. What is displayed, is the generic form of
the bundle methods; to specify a method from this family, one need to indicate the policies
of updating the prox centers and the prox coecients. There is a number of reasonable
policies of this type, and among these policies there are those resulting in methods with
very good practical performance. I would not like to go in details here; let me say only
that, rst, the best theoretical complexity estimate for the traditional bundle methods is
something like O(3); although non-optimal, this upper bound is incomparably better than
the lower complexity bound for the method of Kelley. Second, there is more or less unique
reasonable policy of updating the prox center, in contrast to the policy for updating the
prox coecient. Practical performance of a bundle algorithm heavily depends on this latter
policy, and sensitivity to the prox coecient is, in a sense, the weak point of the bundle
methods. Indeed, even without addressing to computational experience we can guess in
advance that the scheme should be sensitive to di - since in the limiting cases of zero and
innite prox coecient we get, respectively, the Kelley method, which can be slow, and the
method which simply does not move from the initial point. Thus, both small and large
prox coecients are forbidden; and it is unclear how to choose the golden middle - our
information has nothing in common with any quadratic terms in the model, these terms are
invented by us.
4.6.1 The Level method

We present now the Level method, a rather recent method from the bundle family due
to Lemarechal, Nemirovski and Nesterov (1991). This method, in a sense, is free from
the aforementioned shortcomings of the traditional bundle scheme. Namely, the method
possesses the optimal complexity bound O(2), and the diculty with tuning the prox
coecient in the method is resolved in a funny way - this problem does not occur at all.
To describe the method, we introduce several simple quantities. Given i-th model fi (),
we can compute its optimum, same as in the Kelley method; but now we are interested not
in the point where the optimum is attained, but in the optimal value
fi = min fi
G
of the model. Since the model underestimates the objective, the quantity fi is a lower
bound for the actual optimal value; and since the models clearly increase with i at every
point, their minima also increase, so that
f1 f2 ... f . (4.6.24)
On the other hand, let fi+ be the best found so far value of the objective:
fi+ = min f (xj ) = f (

xi ), (4.6.25)
1ji
where xi is the best (with the smallest value of the objective) of the search point generated
so far. The quantities fi+ clearly decrease with i and overestimate the actual optimal value:
f1+ f2+ ... f . (4.6.26)
It follows that the gaps

i = fi+ fi
are nonnegative and nonincreasing and bound from above the inaccuracy of the best found
so far approximate solutions:
xi ) f i , 1 2 ... 0.
f ( (4.6.27)
Now let me describe the method. Its i-th step is as follows:

1) solve the piecewise linear problem
minimize fi (x) s.t. x G
to get the minimum fi of the i-th model;

2) form the level
li = (1 )fi + fi+ fi + i ,
(0, 1) being the parameter of the method (normally, = 1/2), and dene the new iterate
xi+1 as the projection of the previous one xi onto the level set
Qi = {x G | fi (x) li }
of the i-th model, the level set being associated with li :
xi+1 = Qi (xi ). (4.6.28)

Computationally, the method requires solving two auxiliary problems at each iteration.
The rst is to minimize the model in order to compute fi ; this problem arises in the Kelley
method and does not arise in the bundle ones. The second auxiliary problem is to project
xi onto Qi ; this is, basically, the same quadratic problem which arises in bundle methods
and does not arise in the Kelley one. If G is a polytope, which normally is the case, the
rst of these auxiliary problems is a linear program, and the second is a convex linearly
constrained quadratic program; to solve them, one can use the traditional ecient simplex-
type technique.
Let me note that the method actually belongs to the bundle family, and that for this
method the prox center always is the last iterate. To see this, let us look at the solution
d
x(d) = argmin{fi (x) + |x xi |2 }
xG 2
of the auxiliary problem arising in the bundle scheme as at a function of the prox coecient
d. It is clear that x(d) is the closest to xi point in the set {x G | fi (x) fi (x(d))}, so
that x(d) is the projection of xi onto the level set
{x G | fi (x) li (d)}
of the i-th model associated with the level li (d) = fi (x(d)) (this latter relation gives us
certain equation which relates d and li (d)). As d varies from 0 to , x(d) moves along
certain path which starts at the closest to xi point in the optimal set of the i-th model and
ends at the prox center xi ; consequently, the level li (d) varies from fi to f (xi ) fi+ and
therefore, for certain value di of the prox coecient, we have li (di ) = li and, consequently,
x(di ) = xi+1 . Note that the only goal of this reasoning was to demonstrate that the Level
method does belong to the bundle scheme and corresponds to certain implicit control of the
prox coecient; this control exists, but is completely uninteresting for us, since the method
does not require knowledge of di .
Now let me formulate and prove the main result on the method.
Theorem 4.6.1 Let the Level method be applied to a convex problem (f ) with Lipschitz
continuous, with constant L(f ), objective f and with a closed and bounded convex domain
G of diameter D(G). Then the gaps i converge to 0; namely, for any positive one has
2
L(f )D(G)
i > c() i , (4.6.29)

where
1
c() = .
(1 )2 (2 )
In particular, 2
L(f )D(G)
i > c() xi ) f .
f (

Proof. The in particular part is an immediate consequence of (4.6.29) and (4.6.27), so

that all we need is to verify (4.6.29). To this end assume that N is such that N > , and
let us bound N from above.
10 . Let us partition the set I = {1, 2, ..., N} of iteration indices in groups I1 , ..., Ik as
follows. The rst group ends with the index i(1) N and contains all indices i i(1) such
that
i (1 )1 N (1 )1 i(1) ;
since, as we know, the gaps never increase, I1 is certain nal segment of I. If it diers from
the whole I, we dene i(2) as the largest of those i I which are not in I1 , and dene I2 as
the set of all indices i i(2) such that
i (1 )1 i(2) .
I2 is certain preceding I1 segment in I. If the union of I1 and I2 is less than I, we dene

i(3) as the largest of those indices in I which are not in I1 I2 , and dene I3 as the set of
those indices i i(3) for which
i (1 )1 i(3) ,
and so on.
With this process, we partition the set I of iteration indices into sequential segments
I1 ,..,Ik (Is follows Is+1 in I). The last index in Is is i(s), and we have
i(s+1) > (1 )1 i(s) , s = 1, ..., k 1 (4.6.30)
(indeed, if the opposite inequality would hold, then i(s + 1) would be included into the group
Is , which is not the case).
20 . The main (and very simple) observation is as follows:
the level sets Qi of the models corresponding to certain group of iterations Is have a
point in common, namely, the minimizer, us , of the last, i(s)-th, model from the group.
Indeed, since the models increase with i, and the best found so far values of the objective
decrease with i, for all i Is one has
+
fi (us ) fi(s) (us ) = fi(s) = fi(s) i(s) fi+ i(s) fi+ (1 )i li
(the concluding in the chain follows from the fact that i Is , so that i (1 )1 i(s) ).
30 . The above observation allows to estimate from above the number Ns of iterations in
the group Is . Indeed, since xi+1 is the projection of xi onto Qi and us Qi for i Is , we
conclude from Lemma 4.5.1 that
|xi+1 us |2 |xi us |2 |xi xi+1 |2 , i Is ,
whence, denoting by j(s) the rst element in Is ,

|xi xi+1 |2 |xj(s) us |2 D 2 (G). (4.6.31)
iIs
Now let us estimate from below the steplengths |xi xi+1 |. At the point xi the i-th model fi
equals to f (xi ) and is therefore fi+ , and at the point xi+1 the i-th model is, by construction
of xi+1 , less or equal (in fact is equal) to li = fi+ (1 )i . Thus, when passing from
xi to xi+1 , the i-th model varies at least by the quantity (1 )i , which is, in turn, at
least (1 )i(s) (the gaps may decrease only!). On the other hand, fi clearly is Lipschitz
continuous with the same constant L(f ) as the objective (recall that, according to our
assumption, the oracle reports subgradients of f of the norms not exceeding L(f )). Thus,
at the segment [xi , xi+1 ] the Lipschitz continuous with constant L(f ) function fi varies at
least by (1 )i(s) , whence
|xi xi+1 | (1 )i(s) L1 (f ).
From this inequality and (4.6.31) we conclude that the number Ns of iterations in the group
Is satises the estimate
Ns (1 )2 L2 (f )D 2 (G)2
i(s) .
40 . We have i(1) > (the origin of N) and i(s) > (1 )s+1 i(1) (see (4.6.30)), so
that the above estimate of Ns results in
Ns (1 )2 L2 (f )D 2 (G)(1 )2(s1) 2 ,
whence
N= Ns c()L2 (f )D 2 (G)2 ,
s
as claimed.
4.6.2 Concluding remarks

The theorem we have proved says that the level method is optimal in complexity, provided
that G is an Euclidean ball of a large enough dimension. In fact there exists computational
evidence, based on many numerical tests, that the method is also optimal in complexity in a
xed dimension. Namely, as applied to a problem of minimization of a convex function over
an n-dimensional solid G, the method nds an approximate solution of absolute accuracy
in no more than
V (f )
c(f ) n ln

iterations, where V (f ) = maxG f minG f is the variation of the objective over G and c(f ) is
certain problem-dependent constant which never is greater than 1 and typically is something
around 0.2 - 0.5. We stress that this is an experimental fact, not a theorem. Strong doubts
exist that a theorem of this type can be proved, but empirically this law is supported by
hundreds of tests.
To illustrate this point, let me present numerical results related to one of the standard test
problems called MAXQUAD. This is a small, although dicult, problem of maximizing the
maximum of 5 convex quadratic forms of 10 variables. In the below table you see the results
obtained on this problem by the Subgradient Descent and by the Level methods. In the
Subgradient Descent the Polyak stepsizes were used (to this end, the method was equipped
with the exact optimal value, so that the experiment was in favor of the Subgradient Descent).
The results are as follows:
Subgradient Descent: 100,000 steps, best found value -0.8413414 (absolute inaccuracy
0.0007), running time 54 ;
Level: 103 steps, best found value -0.8414077 (absolute inaccuracy < 0.0000001), running
time 2 , complexity index c(f ) = 0.47.
Runs:
Subgradient Descent Level
i fi+ i fi+
1 5337.066429 1 5337.066429
2 98.595071
6 98.586465
8 6.295622 16 7.278115
31 0.198568
39 0.674044
41 0.221810
54 0.811759
73 0.841058
81 0.841232
103 0.841408
201 0.801369

4001 0.839771
5001 0.840100
17001 0.841021
25001 0.841144
50001 0.841276
75001 0.841319
100000 0.841341
marks the result obtained by the Subgradient Descent after 2 (the total CPU time of the
Level method); to that moment the Subgradient Descent has performed 4,000 iterations, but
has restored the solution within 2 accuracy digits rather then 6 digits given by Level.
The Subgradient Descent with the default stepsizes i = O(1)i1/2 in the same 100,000
iterations was unable to achieve value of the objective less than -0.837594, i.e., found the
solution within a single accuracy digit.
4.7 Exercises
Exercise 4.7.1 Implement the level method. Try to test it on the MAXQUAD test problem1)
which is as follows:
min f (x), X = {x R10 | |xi | 1},
xX
where
f (x) = max xT A(i) x + xT b(i)
i=1,...,5
with
(i) (i)
Akj = Ajk = ej/k cos(jk) sin(i), j < k;
(i) j (i)
Ajj = | sin(i)| + |Ajk |,
10 k=j
(i)
bj = ej/i sin(ij).
Take the initial point x0 = 0.
When implementing the level method you will need a Quadratic Programming solver. For
SCILAB implementation you can use QUAPRO internal solver, when working with MAT-
LAB, you can use the conic solver from SDPT3 (http://www.math.nus.edu.sg/ mattohkc/sdpt3.html).
4.7.1 Around Subgradient Descent

The short-step version of the Subgradient Descent presented above (I hope you are familiar
with the lecture) is quite appropriate for proving the upper complexity bound; as a com-
putational scheme, it is not too attractive. The most unpleasant property of the scheme is
that it actually is a short-step procedure: one should from the very beginning tune it to the
desired accuracy , and the stepsizes i = / associated with the upper complexity bound
stated in Theorem 4.2.1 should be of order of . For the sake of simplicity, let us assume
for a moment that G is an Euclidean ball of radius R, so that = 1. With the choice of
stepsizes i = / = , the method will for sure be very slow, since to pass from the starting
point x1 to a reasonable neighborhood of the optimal solution it normally requires to cover
a distance of order of the diameter of the ball G, i.e., of order of R, and it will take at
least M = O(1/) steps (since a single step moves the point by i R = R) even in the ideal
case when all directions ei look directly to the optimal solution. The indicated observation
does not contradict the theoretical optimality of the method in the large-scale case, where
the worst-case complexity, as we know, is at least is O(2), which is much worse that the
above M, namely, something like M 2 . Note, anyhow, that right now we were comparing the
best possible worst-case complexity of a method, the complexity of the family, and the best
possible ideal-case complexity of the short-step Subgradient Descent. And there is nothing
good in the fact that even in the ideal case the method is slow.
1 )
this problem is probably due to C. Lemarechal and R. Miin
4.7. EXERCISES 93
There are, anyhow, more or less evident possibilities to make the method computationally
more reasonable. The idea is not to tune the method to the prescribed accuracy in advance,
thus making the stepsizes small from the very beginning, but to start with large stepsizes
and then decrease them at a reasonable rate. To implement the idea, we need an auxiliary
tool (which is important an interesting in its own right), namely, projections.
Let Q be a closed and nonempty convex subset in Rn . The projection Q (x) of a point
x Rn onto Q is dened as the closest, with respect to the usual Euclidean norm, to x point
of Q, i.e., as the solution to the following optimization problem
(Px ) : minimize |x y|22 over y Q.
#
Exercise 4.7.2 Prove that Q (x) does exist and is unique.
Exercise 4.7.3 # Prove that a point y Q is a solution to (Px ) if and only if the vector
x y is such that
(u y)T (x y) 0 u Q. (4.7.32)
Derive from this observation the following important property:
|Q (x) u|22 |x u|22 |x Q (x)|22 . (4.7.33)
Thus, when we project a point onto a convex set, the point becomes closer to any point u of
the set, namely, the squared distance to u is decreased at least by the squared distance from
x to Q.
Derive from (4.7.32) that the mappings x Q (x) and x x Q (x) are Lipschitz
continuous with Lipschitz constant 1.
Now consider the following modication of the Subgradient Descent scheme:

given a solid Q which covers G, a starting point x1 int Q and a pair of sequences {i > 0}
(stepsizes) and {i (0, 1)} (tolerances), form the sequence
xi+1 = Q (xi i ei ), (4.7.34)
where 2 is the Euclidean diameter of Q and the unit vector ei is dened as follows:
(i) if xi int G, then ei is an arbitrary unit vector which separates xi and G:
(x xi )T ei 0, x G;
(ii) if xi int G, but there is a constraint gj which is i -violated at xi , i.e., is such that

gj (xi ) > i max{gj (xi ) + (x xi )T gj (xi )} , (4.7.35)
xG +
then
1
ei =
gj (xi );
|gj (xi )|2
(iii) if xi int G and no constraint is i -violated at xi , i.e., no inequality (4.7.35) is

satised, then
1
ei = f (xi ).
|f (xi )|2
If in the case of (iii) f (xi ) = 0, then ei is an arbitrary unit vector.
After ei is chosen, loop.
The modication, as we see, is in the following:
1) We add projection onto a covering G solid Q into the rule dening the updating
xi xi+1 and use the half of the diameter of Q as the scale factor in the steplength (in the
basic version of the method, there was no projection, and the scale factor was the radius of
the ball Vout );
2) We use time-dependent tactics to distinguish between search points which almost
satisfy the functional constraints and those which signicantly violate a constraint.
3) If we meet with a productive point xi with vanishing subgradient of the objective,
we choose as ei an arbitrary unit vector and continue the process. Note that in the initial
version of the method in the case in question we terminate and claim that xi is an -solution;
now we also could terminate and claim that xi is an i -solution, which in fact is the case,
but i could be large, not the accuracy we actually are interested in.
Exercise 4.7.4 # Prove the following modication of Proposition 4.3.1:

Let a problem (p) from the family Pm (G) be solved by the aforementioned Subgradient
Descent method with nonincreasing sequence of tolerances {i }. Assume that for a pair of
positive integers N > N one has
1 N
2+ 2 j=N j2 rin
(N : N) N < N , (4.7.36)
j=N j
where rin is the maximal of radii of Euclidean balls contained in G. Then among the search
points xN , xN +1 , ..., xN there were productive ones, and the best of them (i.e., that one with
the smallest value of the objective) point xN ,N is an N -solution to (p).
Derive from this result that in the case of problems without functional constraints (m = 0),
where i do not inuence the process at all, the relation
(N) min {(M : M)/rin } < 1 (4.7.37)

N M M 1
implies that the best of the productive search points found in course of the rst N steps is
well-dened and is an (N)-solution to (p).
Hint: follow the line of argument of the original proof of Proposition 4.3.1. Namely, apply the
proof to the shifted process which starts at xN and uses at its i-th iteration, i 1, the stepsize
i+N 1 and the tolerance i+N 1 . This process diers from that one considered in the lecture in
two issues:
(1) presence of time-varying tolerance in detecting productivity and an arbitrary step, instead
of termination, when a productive search point with vanishing subgradient of the objective is met;
4.7. EXERCISES 95
(2) exploiting the projection onto Q G when updating the search points.
To handle (1), prove the following version of Proposition 3.3.1 (Lecture 3):
Assume that we are generating a sequence of search points xi Rn and associate with these
points vectors ei and approximate solutions x
i in accordance to (i)-(iii). Let
Gi = {x G | (x xj )T ej 0, 1 j i},
and let Size be a size. Assume that for some M
Size(GM ) < M Size(G)
(if GM is not a solid, then, by denition, Size(GM ) = 0). Then among the search points x1 , ..., xM
there were productive ones, and the best (with the smallest value of the objective) of these pro-
ductive points is a 1 -solution to the problem.
To handle (2), note that when estimating InnerRad(GN ), we used the equalities
|xj+1 x+ |22 = |xj x+ |22 + ...
and would be quite satised if = in these inequalities would be replaced with ; in view of Exercise
4.7.3, this replacement is exactly what the projection does.
Looking at the statement given by Exercise 4.7.4, we may ask ourselves what could be a
reasonable way to choose the stepsizes i and the tolerances i . Let us start with the case
of problems without functional constraints, where we can forget about the tolerances - they
do not inuence the process. What we are interested in is to minimize over stepsizes the
quantities (N). For a given pair of positive integers M M the minimum of the quantity
1 M
2+ 2 j=M j2
(N : N) = M
j=M j
2 2
over positive j is attained when j = M M +1 , M j M, and is equal to
;
M M +1
2
thus, to minimize (N) for a given i, one should set j = N , j = 1, ..., N, which would
result in

(N) = 2N 1/2 .
rin
This is, basically, the choice of stepsizes we used in the short-step version of the Subgradient
Descent; an unpleasant property of this choice is that it is tied to N, and we would like
to avoid necessity to x in advance the number of steps allowed for the method. A natural
idea is to use the recommendation j = 2N 1/2 in the sliding way, i.e., to set
j = 2j 1/2 , j = 1, 2, ... (4.7.38)
Let us look what will be the quantities (N) for the stepsizes (4.7.38).
#
Exercise 4.7.5 Prove that for the stepsizes (4.7.38) one has

(N) (]N/2[: N) N 1/2
rin rin
with certain absolute constant . Compute the constant.
We see that the stepsizes (4.7.38) result in optimal, up to an absolute constant factor, rate of
convergence of the quantities (N) to 0 as N . Thus, when solving problems without
functional constraints, it is reasonable to use the aforementioned Subgradient Descent with
stepsizes (4.7.38); according to the second statement of Exercise 4.7.4 and Exercise 4.7.5, for
all N such that

(N) N 1/2 <1
rin
the best of the productive search points found in course of the rst N steps is well-dened
and solves the problem within relative accuracy (N).
Now let us look at problems with functional constraints. It is natural to use here the
same rule (4.7.38); the only question now is how to choose the tolerances i . A reasonable
policy would be something like

i = min{0.9999, 1.01i1/2 }, (4.7.39)
rin
Exercise 4.7.6 # Prove that the Subgradient Descent with stepsizes (4.7.38) and tolerances
(4.7.39), as applied to a problem (p) from the family Pm (G), possesses the following conver-
gence properties: for all N such that

N 1/2 < 0.99
rin
among the search points x]N/2[ , x]N/2[+1 , ..., xN there are productive ones, and the best (with
the smallest value of the objective) of these points solves (p) within relative inaccuracy not
exceeding

]N/2[ N 1/2 ,
rin
being an absolute constant.
Note that if one chooses Q = Vout (i.e., = R, so that /rini = is the asphericity of
G), then the indicated rate of convergence results in the same (up to an absolute constant
factor) as for the basic short-step Subgradient Descent complexity of solving problems from
the family within relative accuracy .
4.7.2 Mirror Descent

Looking at the 3-line convergence proof for the standard Subgradient Descent:
xi+1 = G (xi i f (xi )/|f (xi )|) |xi+1 x |2 |xi x |2 2i (xi x )T f (xi )/|f (xi )| + i2
|xi+1 x |2 |xi x |2 2i (f (xi ) f )/|f (xi )| + i2

|x1 x |2 + N

2
i=1 i
min f (xi ) f N
iN 2 i=1 i
4.7. EXERCISES 97
one should be surprised. Indeed, all of us know the origin of the gradient descent: if f is
smooth, a step in the antigradient direction decreases the rst-order expansion of f and
therefore, for a reasonably chosen stepsize, increases f itself. Note that this standard rea-
soning has nothing in common with the above one: we deal with a nonsmooth f , and it
should not decrease in the direction of an anti-subgradient independently of how small is the
stepsize; there is a subgradient in the subgradient set which actually possesses the desired
property, but this is not necessarily the subgradient used in the method, and even with the
good subgradient you could say nothing about the amount the objective can be decreased
by. The correct reasoning deals with algebraic structure of the Euclidean norm rather
than with local behavior of the objective, which is very surprising; it is a kind of miracle.
But we are interested in understanding, not in miracles. Let us try to understand what is
behind the phenomenon we have met.
First of all, what is a subgradient? Is it actually a vector? The answer, of course, is no.
Given a convex function f dened on an n-dimensional vector space E and an interior point
x of the domain of f , you can dene a nonempty set of support functionals - linear forms
f (x)[h] of h E which are support to f at x, i.e., such that
f (y) f (x) + f (x)[y x], y Dom f ;
these forms are intrinsically associated with f and x. Now, having chosen somehow an
Euclidean structure (, ) on E, you may associate with linear forms f (x)[h] vectors f (x)
from E in such a way that
f (x)[h] = (f (x), h), h Rn ,
thus coming from support functionals to subgradients-vectors. The crucial point is that
these vectors are not dened by f and x only; they also depend on what is the Euclidean
structure on E we use. Of course, normally we think of an n-dimensional space as of the
coordinate space Rn with once for ever xed Euclidean structure, but this habit sometimes
is dangerous; the problems we are interested in are dened in ane terms, not in the metric
ones, so why should we always look at the problems via certain once for ever xed Euclidean
structure which has nothing in common with the problem? Developing systematically this
evident observation, one may come to the most advanced and recent convex optimization
methods like the polynomial time interior point ones. Our goal now is much more modest,
but we also shall get prot from the aforementioned observation. Thus, once more: the
correct objects associated with f and x are not vectors from E, but elements of the dual
to E space E of linear forms on E. Of course, E is of the same dimension as E and
therefore it can be identied with E; but there are many ways to identify these spaces, and
no one of them is natural, more preferable than others.
Since the support functionals f (x)[h] live in the dual space, the Gradient Descent
cannot avoid the necessity to identify somehow the initial - primal - and the dual space, and
this is done via the Euclidean structure the method is related to - as it was already explained,
this is what allows to associate with a support functional - something which actually exists,
but belongs to the dual space - a subgradient, a vector belonging to the primal space; in
a sense, this vector is a phantom - it depends on the Euclidean structure on E. Now, is

a Euclidean structure the only way to identify the dual and the primal spaces? Of course,
no, there are many other ways. What we are about to do is to consider certain family of
identications of E and E which includes, as particular cases, all identications given by
Euclidean structures. This family is as follows. Let V () be a smooth (say, continuously
dierentiable) convex function on E ; its support functional V ()[] at a point is a linear
functional on the dual space. Due to the well-known fact of Linear Algebra, every linear
form L[] on the dual space is dened by a vector l from the primal space, in the sense that
L[] = [l]
for all E ; this vector is uniquely dened, and the mapping L l is a linear isomorphism
between the space (E ) dual to E and the primal space E. This isomorphism is canonical
- it does not depend on any additional structures on the spaces in question, like inner
products, coordinates, etc., it is given by intrinsic nature of the spaces. In other words, in
the quantity [x] we may think of being xed and x varying over E, which gives us a
linear form on E (this is the origin of the quantity); but we also can think of x as being xed
and varying over E , which gives us a linear form on E ; and Linear Algebra says to us
that every linear form on E can be obtained in this manner from certain uniquely dened
x E. Bearing in mind this symmetry, let us from now on renotate the quantity [x] as
, x, thus using more symmetric notation.
Thus, every linear form on E corresponds to certain x E; in particular, the linear
form V ()[] on E corresponds to certain vector V () E:
V ()[] , V () , E .
We come to certain mapping
V () : E E;
this mapping, under some mild assumptions, is a continuous one-to-one mapping with con-
tinuous inverse, i.e., is certain identication (not necessarily linear) of the dual and the
primal spaces.
Exercise 4.7.7 Let (, ) be certain Euclidean structure on the dual space, and let
1
V () = (, ).
2
The aforementioned construction associates with V the mapping V () : E E ; on
the other hand, the Euclidean structure in question itself denes certain identication of the
dual and the primal space, the identication I given by the identity
[x] = (, I 1 x), E , x E.
Prove that I = V .
Assume that {ei }ni=1 is a basis in E and {ei } is the biorthonormal basis in E (so that
ei [ej ] = ij ), and let A be the matrix which represents the inner product in the coordinates
of E related to the basis {ei }, i.e., Aij = (ei , ej ). What is the matrix of the associated
mapping V taken with respect to the {ei }-coordinates in E and the {ei }-coordinates in E?
4.7. EXERCISES 99
We see that all standard identications of the primal and the dual spaces, i.e., those given
by Euclidean structures, are covered by our mappings V (); the corresponding V s are,
up to the factor 1/2, squared Euclidean norms. A natural question is what are the mappings
associated with other squared norms.
Exercise 4.7.8 Let be a norm on E, let
= max{[x] | x E, x 1}
be the conjugate norm on and assume that the function

1
V () = 2
2
is continuously dierentiable. Then the mapping V () is as follows: you take a linear
form = 0 on E, maximize it on the -unit ball of E; the maximizer is unique and is
exactly V (); and, of course, V (0) = 0. In other words, V () is nothing but the direction
of the fastest growth of the functional , where, of course, the rate of growth in a direction
is dened as the progress in per -unit step in the direction.
Prove that the mapping is a continuous mapping form E onto E . Prove that
this mapping is a continuous one-to-one correspondence between E and E if and only if the
function
1
W (x) = x2 : E R
2
is continuously dierentiable, and in this case the mapping

x W (x)

is nothing but the inverse to the mapping given by V
.
Now, every Euclidean norm on E induces, as we know, a Subgradient Descent method

for minimization of convex functions over closed convex domains in E. Let us write down
this method in terms of the corresponding function V . For the sake of simplicity let us
ignore for the moment the projector G , thus looking at the method for minimizing over the
whole E. The method, as it is easily seen, would become
i+1 = i i f (xi )/f (xi ) , xi = V (i ), V V . (4.7.40)
Now, in the presented form of the Subgradient Descent there is nothing from the fact that
is a Euclidean norm; the only property of the norm which we actually need is the
dierentiability of the associated function V . Thus, given a norm on E which induces
a dierentiable outside 0 conjugate norm on the conjugate space, we can write down certain
method for minimizing convex functions over E. How could we analyze the convergence
properties of the method? In the case of the usual Subgradient Descent the proof of con-
vergence was based on the fact that the anti-gradient direction f (x) is a descent direction
for certain Lyapunov function, namely, |x x |2 , x being a minimizer of f . In fact our

reasoning was as follows: since f is convex, we have
f (x), x x f (x) f (x ) 0, (4.7.41)
and the quantity f (x), x x is, up to the constant factor 2, the derivative of the function
|x x |2 in the direction f (x). Could we say something similar in the general case, where,
according to (4.7.40), we should deal with the situation x = V ()? With this substitution,
the left hand side of (4.7.41) becomes
d
f (x), x V () = |t=0 V + ( tf (x)), V + () = V () , x .
dt
Thus, we can associate with (4.7.40) the function
+
V + () V
() = V () , x , (4.7.42)
x being a minimizer of f , and the derivative of this function in the direction f (V ()) of
the trajectory (4.7.40) is nonpositive:
' (
f (V ()), (V + ) () f (x ) f (x) 0, E . (4.7.43)
Now we may try to reproduce the reasoning which leads to the rate-of-convergence estimate
for the Subgradient Descent for our now situation, where we speak about process (4.7.40)
associated with an arbitrary norm on E (the norm should result, of course, in a continuously
dierentiable V ).
For the sake of simplicity, let us restrict ourselves with the simple case when V possesses
a Lipschitz continuous derivative. Thus, from now on let be a norm on E such that the
mapping

V() V
() : E E
is Lipschitz continuous, and let
V() V( )
L L = sup{ | = , , E }.

For the sake of brevity, from now on we write V instead of V .
Exercise 4.7.9 Prove that
V ( + ) V () + , V() + LV (), , E . (4.7.44)
Now let us investigate process (4.7.40).

Exercise 4.7.10 Let f : E R be a Lipschitz continuous convex function which attains
its minimum on E at certain point x . Consider process (cf. (4.7.40))
i+1 = i i f (xi )/|f (xi )| , xi = V (i ), 1 = 0, (4.7.45)

4.7. EXERCISES 101
and let xi be the best (with the smallest value of f ) of the points x1 , ..., xi and let i =
xi ) minE f . Prove that then
f (

|x |2 + L N 2
i=1 i
N L (f ) N , N = 1, 2, ... (4.7.46)
2 i=1 i
where L (f ) is the Lipschitz constant of f with respect to the norm . In particular, the
method converges, provided that

i = , i 0, i .
i
Hint: use the result of exercise 4.7.9 and (4.7.43) to demonstrate that
L 2
V + (i+1 ) V + (i ) 2i (f (xi ) f )/|f (xi )| + , V + () = V () , x ,
2 i
and then act exactly as in the case of the usual Subgradient Descent. Note that the basic
result explains what is the origin of the Subgradient Descent miracle which motivated our
considerations; as we see, this miracle comes not from the very specic algebraic structure of
the Euclidean norm, but from certain robust analytic property of the norm (the Lipschitz
continuity of the derivative of the conjugate norm), and we can fabricate similar miracles
for arbitrary norms which share the indicated property. In fact you could use the outlined
Mirror Descent scheme, developed by Nemirovski and Yudin, with necessary (and more or
less straightforward) modications, in order to extend everything what we know about the
usual - Euclidean - Subgradient Descent (I mean, the versions for optimization over a
domain rather than over the whole space and for optimization over solids under functional
constraints) onto the general non-Euclidean case, but we skip here these issues.
4.7.3 Stochastic Approximation

We shall speak here about a problem which is quite dierent from those considered so
far about stochastic optimization. The simplest (and, perhaps, the most important in
applications) single-stage Stochastic Programming program is as follows:

minimize f (x) = F (x, )dP () s.t. x G. (4.7.47)

Here
F (x, ) are functions of the design vector x Rn and parameter w
G is a subset in Rn
P is a probability distribution on the set .

Thus, a stochastic program is a Mathematical Programming program where the objective

and the constraints are expectations of certain functions depending both of the design vector
x and random parameter .
The main source of programs of the indicated type is optimization in stochastic systems,
like Queing Networks, where the processes depend not only on the design parameters, like
performances and numbers of serving devices of dierent types), but also on random factors.
As a result, the characteristics of such a system (e.g., time of serving a client, cost of service,
etc.) are random variables depending, as on parameters, on the design parameters of the
system. It is reasonable to measure the quality of the system by the expected values of the
indicated random variables (in the dynamic systems, we should speak about the steady-state
expectations). These expected values have the form of the objective and the constraints in
(4.7.47), so that to optimize a system in question over its design parameters is a program of
the type (4.7.47).
Looking at stochastic program (4.7.47), you can immediately ask what are, if any, the
specic features of these problems and why these problems need specic treatment. (4.7.47)
is, nally, nothing but the usual Mathematical Programming problem; we know how to solve
tractable (convex) Mathematical Programming programs, and to the moment we never took
into account what is the origin of the objective and the constraints, are they given by explicit
simple formulae, or are they intergals, as in (4.7.47), or solutions to dierential equations,
or whatever else. All what was important for us were the properties of the objective and the
constraints convexity, smoothness, etc., but not their origin.
This indierence to the origin of the problem indeed was the feature of our approach,
but it was its weak point, not the strong one. In actual computations, the performance
of an algorithm heavily depends not only on the quality of the method we apply to solve
the problem, but also on the computational eort needed to provide the algorithm by the
information on the problem instance the job which in our model of optimization was the
task of the oracle. We did not think how to help the oracle to solve his task and took care
only of total number of oracle calls we did our best to reduce this number. In the most
general case, when we have no idea of what is the internal structure of the problem, this is
the only possible approach. But the more we know about the structure of the problem, the
more should we think of how to simplify the task of the oracle in order to reduce the overall
computational expenses.
Stochastic Programming is an example when we know something about the structure of
the problem in question. Namely, let us look at a typical stochastic system, like Queing
Network. Normally the function F (x, ) associated with the system is algorithmically
simple given x and , we can more or less easily compute the quantity F (x, ) and
even its derivative with respect to x; to this end it suces to create a simulation model of
the system and run it at the given by realization of the random parameters (arrival and
service times, etc.); even for a rather sophisticated system, a single simulation of this type
is relatively fast. Thus, typically we have no diculties with simulating realizations of the
random quantity F (x, ). On the other hand, even for relatively simple systems it is, as
a rule, impossible to compute the expected values of the indicated quantities in a closed
analytic form, and the only way to evaluate these expected values if to use a kind of the
4.7. EXERCISES 103
Monte-Carlo method: to run not a single, but many simulations, for a xed value of the
design parameters, and to take the empiric average of the observed random quantities as
an estimate of their expected values. According to the well-known results on the rate of
convergence of the Monte-Carlo method, to estimate the expected values within inaccuracy
it requires O(1/2 ) simulations, and this is just to get the estimate of the objective and the
constraints of problem (4.7.47) at a single point! Now imagine that we are going to treat
(4.7.47) as a usual black-box represented optimization problem and intend to imitate the
usual rst-order oracle for it via the afrementioned Monte-Carlo estimator. In order to get
an -solution to the problem, we, even in good cases, need to estimate within accuracy O()
the objective and the constraints along the search points. It means that the method will
require much more simulations than the aforementioned O(1/2): this quantity should be
multiplied by the information-based complexity of the optimization method we are going to
use. As a result, the indicated approach in most of the cases results in unappropriately long
computations.
An extremely surprising thing is that there exists another way to solve the problem.
This way, under reasonable convexity assumptions, results in overall number of O(1/2 )
computations only as if there were no optimization at all and the only goal were to estimate
the objective and the constraints at a given point. The subject of our today lecture is this
other way Stochastic Approximation.
To get a convenient framework for presenting Stochastic Approximation, it is worthy to
modify a little the way we are looking at our problem. Assume that when solving it, we
are allowed to generate a random sample 1 , 2 ,... of random factors involved into the
problem; the elements of the sample are assumed to be mutually independent and distributed
according to P . Assume also that given x and , we are able to compute the value F (x, )
and the gradient x F (x, ) of the integrant in (4.7.47). Note that under mild regularity
assumptions the dierentiation with respect to x and taking expectation are interchangeable:

f (x) = F (x, )dP (), f (x) = x F (x, )dP (). (4.7.48)

It means that the situation is covered by the following model of an optimization method
solving (4.7.47):
At a step i, we (the method) form i-th search point xi and forward it to the oracle which
we have in our disposal. The oracle returns the quantities
F (xi , i ), x F (xi , i )
(in our previous interpretation it means that a single simulation of the stochastic system
in question is performed), and this answer is the portion of information on the problem we
get on the step in question. Using the information accumulated so far, we generate the new
search point xi+1 , again forward it to the oracle, enrich our accumulated information by its
answer, and so on.
The presented scheme is a very natural denition of a method based on stochastic rst
order oracle capable to provide the method with random unbiased (see (4.7.48)) estimates of
the values and the gradients of the objective and the constraints of (4.7.47). Note that the
estimates are not only unbiased, but also form a kind of Markov chain: the distribution of
the answers of the oracle at a point depends only on the point, not on the previous answers
(recall that {i } are assumed to be independent).
Now, for our further considerations it is completely unimportant that the observation
of f (x) comes from the value, and the observation of f (x) comes from the gradient. It
suces to postulate the following:
The goal is to solve the Convex Programming program2

f0 (x) min, | x G Rn (4.7.49)
[G is closed and convex, f is convex and Lipschitz continuous on G]
The information obtained by an optimization method at i-th step, i = 1, 2, ..., comes
from a stochastic oracle, i.e., is a real and a vector
(xi , i ) R, (xi , i ) Rn ,
where
{i } is sequence of independent identically distributed, according to certain prob-
ability distribution P , random parameters taking values in certain space ;
xi is i-th search point generated by the method; this point may be an arbitrary
deterministic function of the information obtained by the method so far (i.e., at
the steps 1, ..., i 1)
It is assumed (and it is crucial) that the information obtained by the method is unbi-
ased:
E(x, ) = f (x), f (x) E(x, ) f (x), x G; (4.7.50)
here E is the expectation with respect to the distribution of the random parameters in
question.
To get reasonable complexity results, we need to bound somehow the magnitude of the
random noise in the process in the stochastic oracle (as it is always done in all statistical
considerations). Mathematically, the most convenient way to do it is as follows: let
1/2
L = sup E|(x, f )|2 . (4.7.51)
xG
From now on we assume that the oracle is such that L < . The quantity L will be
called the intensity of the oracle at the problem in question; in what follows it plays the
same role as the Lipschitz constant of the objective in large-scale minimization of Lipschitz
continuous convex functions.
2
Of course, the model we are about to present makes sense not only for convex programs; but the methods
we are interested in will, as always, work well only in convex case, so that we loose nothing when imposing
the convexity assumption from the very beginning
4.7. EXERCISES 105
4.7.4 The Stochastic Approximation method

The method we are interested in is completely similar to the Subgradient Descent method
from Section 4.5. Namely, the method generates search points according to the recurrency
(cf. (8.1.1))
xi+1 = G (xi i (xi , i )), i = 1, 2, ..., (4.7.52)
where
x1 G is an arbitrary starting point;
i are deterministic positive stepsizes;
G (x) = argminyG |x y| is the standard projector on G.
The only dierence with (8.1.1) is that now we replace the direction g(xi ) = f (xi )/|f (xi )|
of the subgradinet of f at xi by the random estimate (xi , i ) of the subgradient.
Note that the search trajectory of the method is governed by the random variables i
and is therefore random:
xi = xi ( i1 ), s = (1 , ..., s ).
Recurrency (4.7.52) denes the sequence of the search points, not that one of the approx-
imate solutions. In the deterministic case of Section 4.5 we extracted from the search points
the approximate solutions by choosing the best (with the smallest value of the objective) of
the search points generated so far. The same could work in the stochastic case as well, but
here we meet with the following obstacle: we do not see the values of the objective, and
therefore cannot say which of the search point is better. To resolve the diculty, we use the
following trick: dene i-th approximate solution as the sliding average of the search points:
1

i i i1
x x ( )= i t xt . (4.7.53)
i/2ti i/2ti
The eciency of the resulting method is given by the following

Theorem 4.7.1 For the aforementined method one has, for all positive integer N,

N D 2 + L2 N/2iN i2
N E[f (x ) min f (x)] , (4.7.54)
xG 2 N/2iN i
G being the diameter of G.
In particular, for i chosen accoding to
D
i = (4.7.55)
L i
one has
LD
N O(1) (4.7.56)
N
with appropriately chosen absolute constant O(1).
Exercise 4.7.11 Prove Theorem 4.7.1.
Hint: follow the lines of the proof of the estimate (4.5.16) in Section 4.5; substitute d2i+1
with vi E|xi+1 ( i ) x |2 .
Comments: The statement and the proof of Theorem 4.7.1 are completely similar to the
related deterministic considerations of Section 4.5. The only dierence is that now we are
estimating from above the expected inaccuracy of N-th approximate solution; this is quite
natural, since the stochastic nature of the process makes it impossible to say something
reasonable about the quality of every realization of the random vector xN = xN ( N 1 )
It turns out that the rate of convergence established in (4.7.56) connot be improved.
Namely, it is not dicult to prove the following statement.
Proposition 4.7.1 For every L > 0, any D > 0, any positive integer N and any stochastic-
oracle-based N-step method M of minimizing univariate convex functions over the segment
G = [0, D] on the axis there exists a linear function f and a stochastic oracle with intensity
L on the function such that
LD
E[f (xN ) min f ] O(1) ,
G N
xN being the result formed by the method as applied to the problem f . Here O(1) is properly
chosen positive absolute constant.

Note that in the deterministic case the rate of convergence O(1/ N) was unimprovable only
in the large-scale case; in contrast to this, in the stochastic case this rate becomes optimal
already when we are minimizing univariate linear functions.

Convergence rate O(1/ N) can be improved only if the objective is strongly convex. The
simplest and the most important result here is as follows (the proof is completely similar to
that one of Theorem 4.7.1):
Proposition 4.7.2 Assume that convex function f on Rn attains its minimum on G at a

unique point x and is such that

f (x ) + |x x |2 f (x) f (x ) + |x x |2 , x G, (4.7.57)
2 2
with certain positive and . Consider process (4.7.52) with the stepsizes

i = , (4.7.58)
i
being a positive scale factor satisfying the relation
> 1. (4.7.59)
4.7. EXERCISES 107
Then
D 2 + 2 L2
N E(f (xN ) min f ) c() , (4.7.60)
G N
where D is the diameter of G, L is the intensity of the oracle at the problem in question and
c() is certain problem-independent function on (1, ).
Exercise 4.7.12 Prove Proposition 4.7.2.
The algorithm (4.7.52) with the stepsizes (4.7.58) and the approximate solutions identical to
the search points is called the classical Stochastic Approximation; it originates from Kiefer
and Wolfovitz.
A good news about the algorithm is its rate of convergence: O(1/N) instead
of O(1/ N). A bad news is that this better rate is ensured only in the case of problems
satisfying (4.7.57) and that the rate of convergence is very sensitive to the choice of the scale
factor in the stepsize formula (4.7.58): if this scale factor does not satisfy (4.7.59), the
rate of convergence may become worse in order. To see this, consider the following simple
example: the problem is
1
f (x) = x2 , [ = = 1]; G = [1, 1]; x1 = 1,
2
the observations are given by
(x, ) = x + [= f (x) + ],
and i are the standard Gaussian random variables (Ei = 0, Ei2 = 1); we do not specify
(x, ), since it is not used in the algorithm. In this example, the best choice of is = 1 (in
this case one can make (4.7.59) an equality rather than strict inequality due to the extreme
simplicity of the objective). For this choice of one has
1
Ef (xN +1 ) E[f (xN +1 ) min
x
f (x)] , N 1.
2N
In particular, it takes no more than 50 steps to reach expected inaccuracy not exceeding
0.01.
Now assume that when solving the problem we overestimate the quantity and choose
stepsizes according to (4.7.58) with = 0.1. How many steps do we need in this case to
reach the same expected inaccuracy 0.01 500, 5000, or what? The answer is astonishing:
approximately 1,602,000 steps. And with = 0.05 (20 times less than the optimal value of
the parameter) the same accuracy costs more than 5.2 1014 steps!
We see how dangerous the classical rule (4.7.58) for the stepsizes is: underestimating
of ( overestimating of ) may kill the procedure completely. And where from, in more or
less complicated cases, could we take a reasonable estimate of ? It should be said that there
exist stable versions of the classical Stochastic Approximation (they, same as our version of
the routine, use large, as compared to O(1/i), stepsizes and take, as approximate solutions,
certain averages of the search points). These stable version of the method are capable to
reach (under assumptions similar to (4.7.57)) the O(1/N)-rate of convergence, even with
the asymptotically optimal coecient at 1/N. Note, anyhow, that the nondegeneracy
assumption (4.7.57) is crucial for the O(1/N)-rate of convergence;if it is removed, the best
possible rate, as we know from Proposition 4.7.1, becomes O(1/ N ), and this is the rate
given by our robust Stochastic Approximation with large steps and averaging.
Lecture 5
Nonlinear programming:
Unconstrained Minimization
(Relaxation; Gradient method; Rate of convergence; Newton method; Gradient Method and
Newton Method: What is dierent? Idea of Variable Metric; Variable Metric Methods;
Conjugate Gradient Methods
Nous discutons ici des methodes classiques de programmation non-lineaire. Ces methodes
constituent le point de depart de la theorie doptimisation - cest ici que lhistoire doptimisation
a commence. Notre objectif sera aussi dobtenir un avant-go ut de la nouvelle vie que
certaines idees tr`es anciennes ont retrouve en optimisation convexe, ce qui a donne des
techniques algorithmiques les plus avancees actuellement disponibles.
5.1 Relaxation
We have already mentioned in the rst lecture of this course that the main goal of the general
nonlinear programming is to nd a local solution to a problem dened by dierentiable
functions. In general, the global structure of these problems is not much simpler than that
of the problems dened by Lipshitz continuous functions. Therefore, even for such restricted
goals, it is necessary to follow some special principles, which guarantee the convergence of
the minimization process.
The majority of the nonlinear programming methods are based on the idea of relaxation:
We call the sequence {ak }
k=0 a relaxation sequence if ak+1 ak for all k 0.
In this sections we consider several methods for solving the unconstrained minimization
problem
minn f (x), (5.1.1)
xR
where f (x) is a smooth function. To solve this problem, we can try to generate a relaxation
sequence {f (xk )}
k=0 :
f (xk+1 ) f (xk ), k = 0, 1, . . . .
If we manage to do that, then we immediately have the following important consequences:
109
110LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION
1. If f (x) is bounded from below on Rn then the sequence {f (xk )}

k=0 converges.
2. In any case we improve the initial value of the objective function.
5.2 Gradient method

Now we are completely prepared for the analysis of the unconstrained minimization methods.
Let us start from the simplest scheme. We already know that the antigradient is a direction
of the locally steepest descent of a dierentiable function (cf. Exercise 5.6.2). Since we are
going to nd a local minimum of such function, the following scheme is the rst to be tried:
0). Choose x0 Rn .
1). Iterate (5.2.2)
xk+1 = xk hk f (xk ), k = 0, 1, . . . .
This is a scheme of the gradient method. The gradients factor in this scheme, hk , is called
the step size. Of course, it is reasonable to choose the step size positive.
There are many variants of this method, which dier one from another by the step size
strategy. Let us consider the most important ones.
1. The sequence {hk }
k=0 is chosen in advance, before the gradient method starts its job.
For example,
hk = h > 0, (constant step)
hk = h .
k+1
2. Full relaxation:
hk = arg min f (xk hf (xk )).
h0
3. Goldstein-Armijo rule: Find xk+1 = xk hf (xk ) such that
f (xk ), xk xk+1 f (xk ) f (xk+1 ), (5.2.3)
f (xk ), xk xk+1 f (xk ) f (xk+1 ), (5.2.4)

where 0 < < < 1 are some xed parameters.
Comparing these strategies, we see that the rst strategy is the simplest one. Indeed,
it is often used, but only in Convex Optimization, where the behavior of functions is much
more predictable than in the general nonlinear case.
The second strategy is completely theoretical. It is never used in practice since even in
one-dimensional case we cannot nd an exact minimum of a function in nite time.
The third strategy is used in the majority of the practical algorithms. It has the following
geometric interpretation. Let us x x Rn . Consider the function of one variable
(h) = f (x hf (x)), h 0.
5.2. GRADIENT METHOD 111
Then the step-size values acceptable for this strategy belong to the part of the graph of ,
which is located between two linear functions:
1 (h) = f (x) h f (x) 2 , 2 (h) = f (x) h f (x) 2 .
Note that (0) = 1 (0) = 2 (0) and (0) < 2 (0) < 1 (0) < 0. Therefore, the ac-
ceptable values exist unless (h) is not bounded from below. There are several very fast
one-dimensional procedures for nding a point satisfying the conditions of this strategy, but
their description is not so important for us now.
Let us estimate now the performance of the gradient method. Consider the problem
min f (x),
xRn
with f CL1,1 (Rn ). The latter means (cf the denition in Section 5.6.4 that f is continuously
dierentiable on Rn and its derivative is Lipschitz continuous on Rn with the constant L:
f (x) f (y) L x y
for all x, y Rn . Let us also assume that f (x) is bounded from below on Rn .
Let us evaluate rst the result of one step of the gradient method. Consider y = xhf (x).
Then, in view of (5.6.32), we have:
L
f (y) f (x) + f (x), y x + 2
y x 2
(5.2.5)
2

= f (x) h f (x) 2
+ h2 L 2
f (x) = f (x) h(1 h
2
L) 2
f (x) .
Thus, in order to get the best estimate for the possible decrease of the objective function,
we have to solve the following one-dimensional problem:
h
(h) = h(1 L) min .
2 h
Computing the derivative of this function, we conclude that the optimal step size must satisfy
the equation (h) = hL 1 = 0. Thus, it could be only h = L1 , and that is a minimum of
(h) since (h) = L > 0.
Thus, our considerations prove that one step of the gradient method can decrease the
objective function as follows:
1
f (y) f (x) f (x) 2 .
2L
Let us check what is going on with our step-size strategies.
Let xk+1 = xk hk f (xk ). Then for the constant step strategy, hk = h, we have:
f (xk ) f (xk+1 ) h(1 12 Lh) f (xk ) 2 .

2
Therefore, if we choose hk = L
with (0, 1), then
2
f (xk ) f (xk+1 ) (1 ) f (xk ) 2 .
L
Of course, the optimal choice is hk = L1 .
For full relaxation strategy we have
1
f (xk ) f (xk+1 ) f (xk ) 2
2L
since the maximal decrease cannot be less than that with hk = L1 .
Finally, for Goldstein-Armijo rule in view of (5.2.4) we have:
f (xk ) f (xk+1) f (xk ), xk xk+1 = hk f (xk ) 2 .
From (5.2.5) we obtain:

hk
f (xk ) f (xk+1 ) hk (1 L) f (xk ) 2 .
2
Therefore hk L2 (1 ). Further, using (5.2.3) we have:
f (xk ) f (xk+1 ) f (xk ), xk xk+1 = hk f (xk ) 2 .
Combining this inequality with the previous one, we conclude that

2
f (xk ) f (xk+1 ) (1 ) f (xk ) 2 .
L
Thus, we have proved that in all cases we have

f (xk ) f (xk+1 ) f (xk ) 2 , (5.2.6)
L
where is some constant.
Now we are ready to estimate the performance of the gradient scheme. When taking the
sum of the inequalities (5.2.6) for k = 0, . . . , N we obtain:
N

f (xk ) 2 f (x0 ) f (xN ) f (x0 ) f , (5.2.7)
L k=0
where f is the optimal value of the problem (5.1.1). As a simple conclusion of (5.2.7) we
have:
f (xk ) 0 as k .
However, we can say something about the convergence rate. Indeed, denote

gN = min gk ,
0kN
where gk = f (xk ) . Then, in view of (5.2.7), we come to the following inequality:

1/2
1 1
gN L(f (x0 ) f ) . (5.2.8)
N +1

The right hand side of this inequality describes the rate of convergence of the sequence {gN }
to zero. Note that we cannot say anything about the rate of convergence of the sequences
{f (xk )} or {xk }.
Recall, that in general nonlinear optimization our goal is rather moderate: We want
to nd only a local minimum of our problem. Nevertheless, even this goal, in general, is
unreachable for the gradient method. Let us consider the following example.
Example 5.2.1 Consider the function of two variables:
1 1 1
f (x) f (x(1) , x(2) ) = (x(1) )2 + (x(2) )4 (x(2) )2 .
2 4 2
The gradient of this function is
f (x) = (x(1) , (x(2) )3 x(2) )T .
Therefore there are only three points which can be a local minimum of this function:
x1 = (0, 0), x2 = (0, 1), x3 = (0, 1).
Computing the Hessian of this function,

1 0
f (x) = (2) 2 ,
0 3(x ) 1
we conclude that x2 and x3 are the isolated local minima1 , but x1 is only a stationary point
of our function. Indeed, f (x1 ) = 0 and
4 2
f (x1 + e2 ) = < 0
4 2
for small enough.
Now, let us consider the trajectory of the gradient method, which starts from x0 = (1, 0).
Note that the second coordinate of this point is zero. Therefore, the second coordinate of
f (x0 ) is also zero. Consequently, the second coordinate of x1 is zero, etc. Thus, the entire
sequence of points, generated by the gradient method will have the second coordinate equal
to zero. This means that this sequence can converge to x1 only.
To conclude our example, note that this situation is typical for all rstorder uncon-
strained minimization methods. Without additional very strict assumptions, it is impossible
to guarantee the global convergence of the minimizing sequence to a local minimum, only to
a stationary point.
1
In fact, in our example they are the global solutions.
We can now write down some (upper) complexity estimates for a certain class of opti-
mization problems. Let us look at the following example.
Example 5.2.2 Consider the following problem class:
Problem class: 1. Unconstrained minimization.

2. f CL1,1 (Rn ).
3. f (x) is bounded from below.
Oracle: First order black box.
solution: x) f (x0 ), f (
f ( x) .
Recall that the function class CL1,1 (Rn ) is dened as follows (cf. the denition in Section
5.6.4):
any f CL1,1 (Rn ) is continuously dierentiable on Rn .
Its 1st derivative is Lipshitz continuous on Rn with the constant L:
f (x) f (y) L x y
for all x, y Rn .
Note, that (5.2.8) can be used to obtain an upper bound for the number of steps (= calls of
the oracle), which is necessary to nd a point with a small norm of the gradient. For that,
let us write out the following inequality:
1/2
1 1
gN L(f (x0 ) f ) .
N +1
Therefore, if N + 1 L2 (f (x0 ) f ), we necessarily have gN

.
L
Thus, we can use the value 2 (f (x0 ) f ) as an upper complexity estimate for our
problem class. Comparing this estimate with the result of Theorem 1.2.2, we can see that it
is much better; at least it does not depend upon n.
To conclude this example, note that the lower complexity bounds for the class under
consideration are not known.
Let us check, what can be said about the local convergence of the gradient method. Let
us consider the unconstrained minimization problem
min f (x)
xRn
under the following assumptions:

2,2
1. f CM (Rn ) (the class of twice dierentiable functions with Lipschitz continuous
2,2
Hessian. Recall that for f CM (Rn ) we have
f (x) f (y) M x y
for all x, y Rn ).
2. There exists a local minimum of function f at which the Hessian is positive denite.
3. We know some bounds 0 < l L < for the Hessian at x :
lIn f (x ) LIn . (5.2.9)
4. Our starting point x0 is close enough to x .
Consider the process: xk+1 = xk hk f (xk ). Note that f (x ) = 0. Hence,
1

f (xk ) = f (xk ) f (x ) = f (x + (xk x ))(xk x )d = Gk (xk x ),
0
-1
where Gk = f (x + (xk x ))d . Therefore
0
xk+1 x = xk x hk Gk (xk x ) = (I hk Gk )(xk x ).
There is a standard technique for analyzing the processes of this type, which is based on
contracting mapping. Recall that for a process
a0 Rn , ak+1 = Ak ak ,
where Ak are (n n) matrices such that Ak 1 q, q (0, 1), we can estimate the rate
of convergence of the sequence {ak } to zero:
ak+1 (1 q) ak (1 q)k+1 a0 0.
Thus, in our case we need to estimate In hk Gk . Denote rk = xk x . In view

of Exercise 5.6.6, we have:
f (x ) Mrk In f (x + (xk x )) f (x ) + Mrk In .
Therefore, using our assumption (5.2.9), we obtain:

rk rk
(l 2
M)In Gk (L + 2
M)In .
Hence,
rk rk
(1 hk (L + 2
M))In In hk Gk (1 hk (l 2
M))In .
Thus,
In hk Gk max{ak (hk ), bk (hk )},
(5.2.10)
rk rk
ak (h) = 1 h(l 2
M), bk (h) = h(L + 2
M) 1.
2l
Note that ak (0) = 1 and bk (0) = 1. Therefore, if rk < r M , then ak (h) is a strictly
decreasing function of h and we can ensure In hk Gk < 1 for small enough hk . In this
case we will have rk+1 < rk .
As usual, many step-size strategies are possible. For example, we can choose hk = L1 . Let
us consider the optimal strategy consisting of minimizing the right hand side of (5.2.10):
max{ak (h), bk (h)} min .

h
Let us assume that r0 < r. Then, if we form the sequence {xk } using this strategy, we can be
sure that rk+1 < rk < r. Therefore, the optimal step size hk can be found from the equation:
rk rk
ak (h) = 1 h(l M) = h(L + M) 1 = bk (h).
2 2
Hence
2
hk = . (5.2.11)
L+l
Under this choice we obtain:
(L l)rk Mrk2
rk+1 + .
L+l L+l
2l M
Let us estimate the rate of convergence. Denote q = L+l
and ak = r
L+l k
(< q). Then
ak (1 (ak q)2 ) ak
ak+1 (1 q)ak + a2k = ak (1 + (ak q)) = .
1 (ak q) 1 + q ak
1 1+q
Therefore ak+1
ak
1, or

q q(1 + q) q
1 q 1 = (1 + q) 1 .
ak+1 ak ak
Hence,

q q 2l L+l r
1 (1 + q)k 1 = (1 + q)k 1 = (1 + q)k 1 .
ak a0 L + l r0 M r0
Thus,
k
qr0 qr0 1 qr0 (1 q)k
ak .
r r0 )
r0 + (1 + q)k ( r r0 1+q r r0
This proves the following theorem.
5.3. NEWTON METHOD 117
Theorem 5.2.1 Let function f (x) satisfy our assumptions and let the starting point x0 be
close enough to a local minimum:
2l
r0 = x0 x < r = .
M
Then the gradient method with the optimal step size (5.2.11) converges with the following
rate: k
rr0 Ll
xk x .
r r0 L + l
We call his rate of convergence linear.
5.3 Newton method

Initially, the Newton method has been proposed for nding a root of a function of one
variable (t), t R1 :
(t ) = 0.
For that, it uses the idea of linear approximation. Indeed, assume that we have some t close
enough to t . Note that
(t + t) = (t) + (t)t + o(| t |).
Therefore the equation (t + t) = 0 can be approximated by the following linear equation:
(t) + (t)t = 0.
We can expect that the solution of this equation, the displacement t, is a good approxi-
mation to the optimal displacement t = t t. Converting this idea in the algorithmic
form, we obtain the following process:
(tk )
tk+1 = tk .
(tk )
This scheme can be naturally extended onto the problem of nding solution to the system
of nonlinear equations:
F (x) = 0,
where x Rn and F (x) : Rn Rn . For that we have to dene the displacement x as a
solution to the following linear system:
F (x) + F (x)x = 0
(it is called the Newton system). The corresponding iterative scheme looks as follows:
xk+1 = xk [F (xk )]1 F (xk ).

Finally, in view of Theorem 5.6.1 (necessary condition of the minimum), we can replace
the unconstrained minimization problem by the following nonlinear system
f (x) = 0. (5.3.12)
(This replacement is not completely equivalent, but it works in the nondegenerate situations.)
Further, for solving (5.3.12) we can apply the standard Newton method for the system of
nonlinear equations. In the optimization case, the Newton system looks as follows:
f (x) + f (x)x = 0,
Hence, the Newton method for optimization problems appears in the following form:
xk+1 = xk [f (xk )]1 f (xk ). (5.3.13)
Note that we can come to the process (5.3.13), using the idea of quadratic approximation.
Consider this approximation, computed at the point xk :
(x) = f (xk ) + f (xk ), x xk + 12 f (xk )(x xk ), x xk .
Assume that f (xk ) > 0. Then we can choose xk+1 as a point of minimum of the quadratic
function (x). This means that
(xk+1 ) = f (xk ) + f (xk )(xk+1 xk ) = 0,
and, we come again to the Newton process (5.3.13).

We will see that the convergence of the Newton method in a neighborhood of a strict
local minimum is very fast. However, this method has two serious drawbacks. First, it can
break down if f (xk ) is degenerate. Second, the Newton process can diverge. Let us look at
the following example.
Example 5.3.1 Let us apply the Newton method to nding a root of the following function
of one variable:
t
(t) = .
1 + t2
Clearly, t = 0. Note that
1
(t) = .
[1 + t2 ]3/2
Therefore the Newton process looks as follows:
(tk ) tk
tk+1 = tk = tk $ [1 + t2k ]3/2 = t3k .
(tk ) 2
1 + tk
Thus, if | t0 |< 1, then this method converges and the convergence is extremely fast. Point
t0 = 1 is an oscillation point of this method. If | t0 |> 1, then the method diverge.
In order to escape from the possible divergence, in practice we can apply a Damped
Newton method:
xk+1 = xk hk [f (xk )]1 f (xk ),
where hk > 0 is a step-size parameter. At the initial stage of the method we can use the
same step size strategies as for the gradient method. At the nal stage it is reasonable to
chose hk = 1.
Let us study the local convergence of the Newton method. Consider the problem
min f (x)
xRn
under the following assumptions:

2,2
1. f CM (Rn ).
2. There exists a local minimum of function f with positive denite Hessian:
f (x ) lIn , l > 0. (5.3.14)
3. Our starting point x0 is close enough to x .
Consider the process: xk+1 = xk [f (xk )]1 f (xk ). Then, using the same reasoning as
for the gradient method, we obtain the following representation:
xk+1 x = xk x [f (xk )]1 f (xk )
-1
= xk x [f (xk )]1 f (x + (xk x ))(xk x )d
0
= [f (xk )]1 Gk (xk x ),
-1
where Gk = [f (xk ) f (x + (xk x ))]d . Denote rk = xk x . Then
0
-1
Gk = [f (xk ) f (x + (xk x ))]d
0
-1
f (xk ) f (x + (xk x )) d
0
-1 rk
M(1 )rk d = 2
M.
0
In view of Exercise 5.6.6, and (5.3.14), we have:
f (xk ) f (x ) Mrk In (l Mrk )In .

Therefore, if rk < Ml then f (xk ) is positive denite and [f (xk )]1 (l Mrk )1 . Hence,
2l
for rk small enough (rk < 3M ), we have
Mrk2 M 2
rk+1 ( r < rk /3).
2(l Mrk ) 6l k
The rate of convergence of this type is called quadratic.

Thus, we have proved the following theorem.
Theorem 5.3.1 Let function f (x) satisfy our assumptions. Suppose that the initial starting
point x0 is close enough to x :
2l
x0 x < r = .
3M
Then xk x < r for all k and the Newton method converges quadratically:
M xk x 2
xk+1 x .
2(l M xk x )
Comparing this result with the rate of convergence of the gradient method, we see that
the Newton method is much faster. Surprisingly enough, the region of quadratic convergence
of the Newton method is almost the same as the region of the linear convergence of the
gradient method. This means that the gradient method is worth to use only at the initial
stage of the minimization process in order to get close to a local minimum. The nal job
should be performed by the Newton method.
In this section we have seen several examples of the convergence rate. Let us compare
these rates in terms of complexity. As we have seen in Example 5.2.2, the upper bound for
the analytical complexity of a problem class is an inverse function of the rate of convergence.
1. Sublinear rate. This rate is described in terms of a power function of the iteration
counter. For example, we can have rk ck . In this case the complexity of this scheme
is c2 /2 .
Sublinear rate is rather slow. In terms of complexity, each new right digit of the answer
takes the amount of computations comparable with the total amount of the previous
work. Note also, that the constant c plays a signicant role in the corresponding
complexity estimate.
2. Linear rate. This rate is given in terms of an exponential function of the iteration
counter. For example, it could be like that: rk c(1q)k . Note that the corresponding
complexity bound is 1q (ln c + ln 1 ).
This rate is fast: Each new right digit of the answer takes a constant amount of
computations. Moreover, the dependence of the complexity estimate in constant c is
very weak.
3. Quadratic rate. This rate has a form of the double exponential function of the iteration
counter. For example, it could be as follows: rk+1 crk2 . The corresponding complexity
estimate depends double-logarithmically on the desired accuracy: ln ln 1 .
This rate is extremely fast: Each iteration doubles the number of right digits in the
answer. The constant c is important only for the starting moment of the quadratic
convergence (crk < 1).
5.3.1 Gradient Method and Newton Method: What is dierent?

In the previous section we have considered two local methods for nding a strict local mini-
mum of the following unconstrained minimization problem:
min f (x),
xRn
where f CL2,2 (Rn ). Those were the gradient method
xk+1 = xk hk f (xk ), hk > 0.
and the Newton Method:

xk+1 = xk [f (xk )]1 f (xk ).
Recall that the local rate of convergence of these method was dierent. We have seen, that
the gradient method has a linear rate and the Newton method converges quadratically. What
is the reason for this dierence?
If we look at the analytic form of these methods, we can see at least the following formal
dierence: In the gradient method the search direction is the antigradient, while in the
Newton method we multiply the antigradient by some matrix, that is the inverse Hessian.
Let us try derive these directions using some universal reasoning.
Let us x some x Rn . Consider the following approximation of the function f (x), we
will call it the model:
1
x) + f (
1 (x) = f ( x), x x + x x 2 ,
2h
where the parameter h is positive. The rst-order optimality condition provides us with the
following equation for x1 , the unconstrained minimum of the function 1 (x):
1
1 (x1 ) = f (
x) + (x1 x) = 0.
h
Thus, x1 = x hf (x). That is exactly the iterate of the gradient method. Note, that if
h (0, L1 ] then the function 1 (x) is the global upper approximation of f (x):
f (x) 1 (x), x Rn ,
(see Lemma 5.6.4). This fact is responsible for the global convergence of the gradient method.
Further, consider the quadratic approximation (the model) of the function f (x):
1
x) + f (
2 (x) = f ( x), x x + f (
x)(x x), x x. (5.3.15)
2
We have already seen, that the minimum of this function is
x2 = x [f (
x)]1 f (
x),
and that is the iterate of the Newton method.

Thus, we can try to use some approximations of the function f (x), which are better than
1 (x) and which are less expensive than 2 (x).
Let G be a positive denite n n-matrix. Denote
1
x) + f (
G (x) = f ( x), x x + G(x x), x x.
2
Computing its minimum from the equation
G (xG ) = f (
x) + G(xG x) = 0,
we obtain
xG = x G1 f (
x). (5.3.16)
The rst-order methods which form a sequence
{Gk } : Gk f (x )
(or {Hk } : Hk G1 1
k [f (x )] ) are called the variable metric methods. (Sometimes
the name Quasi-Newton methods is used.) In these methods only the gradients are involved
in the process of generating the sequences {Gk } or {Hk }.
The reasons explaining the step of the form (5.3.16) are so important for optimization,
that we will provide it with one more interpretation. It will also shed some light on the
variable metric name of the algorithm family.
We have already used the gradient and the Hessian of a nonlinear function f (x). However,
note that they are dened with respect to the standard Euclidean inner product on Rn :
n

x, y = x(i) y (i) , x, y Rn ,
i=1
x = x, x1/2 .
Indeed, the denition of the gradient is as follows:
f (x + h) = f (x) + f (x), h + o( h ).
From that equation we derive its coordinate form

f (x) f (x)
f (x) = ,..., .
x1 xn
Let us introduce now a new inner product. Consider a symmetric positive denite n n-
matrix A. For x, y Rn denote
x, yA = Ax, y, x A = Ax, x1/2 .
The function x A is a new metric on Rn dened by the matrix A. Note that topologically
this new metric is equivalent to :
1 (A)1/2 x x A n (A)1/2 x ,
where 1 (A) and n (A) are the smallest and the largest eigenvalue of the matrix A. However,
the gradient and the Hessian are changing:
f (x + h) = f (x) + f (x), h + 12 f (x)h, h + o( h )
= f (x) + A1 f (x), hA + 12 A1 f (x)h, hA + o( h A )
= f (x) + A1 f (x), hA + 14 [A1 f (x) + f (x)A1 ]h, hA + o( h A ).
Hence, fA (x) = A1 f (x) is the new gradient and fA (x) = 12 [A1 f (x) + f (x)A1 ] is the
new Hessian (with respect to the metric dened by A).
Thus, the direction used in the Newton method can be interpreted as the gradient com-
puted with respect to the metric dened by A = f (x). Note that the Hessian of f (x) at x
computed with respect to A = f (x) is the unit matrix.
Example 5.3.2 Consider the quadratic function
f (x) = + a, x + 12 Ax, x,
where A = AT > 0. Note that f (x) = Ax + a, f (x) = A and f (x ) = Ax + a = 0 for

x = A1 a.
Let us compute the Newton direction at some x Rn :
dN (x) = [f (x)]1 f (x) = A1 (Ax + a) = x + A1 a.
Therefore for any x Rn we have:
x dN (x) = A1 a = x .
Thus, the Newton method converges for a quadratic function in one step. Note also that
1
f (x) = + A1 a, xA + 2
x 2A ,
fA (x) = A1 f (x) = dN (x), fA (x) = A1 f (x) = In .

Let us write out the general scheme of the variable metric methods.
General scheme.
0. Choose x0 Rn . Set H0 = In . Compute f (x0 ) and f (x0 ).
1. k-th iteration (k 0).
a). Set pk = Hk f (xk ).
b). Find xk+1 = xk hk pk (see section 2.1.2 for the step-size rules).
c). Compute f (xk+1 ) and f (xk+1 ).
d). Update the matrix Hk : Hk Hk+1
The variable metric schemes dier one from another only in the implementation of Step
1d), which updates the matrix Hk . For that, they use the new information, accumulated at
Step 1c), namely the gradient f (xk+1 ).
The idea of this update can be explained with a quadratic function. Let
f (x) = + a, x + 12 Ax, x, f (x) = Ax + a.
Then, for any x, y Rn we have f (x) f (y) = A(x y). This identity explains the origin
of the following Quasi-Newton rule:
Choose Hk+1 such that Hk+1(f (xk+1 ) f (xk )) = xk+1 xk .
Naturally, there are many ways to satisfy this relation. Let us present several examples of
the variable metric schemes, which are recognized as the most ecient ones.
Example 5.3.3 Denote Hk = Hk+1 Hk ,
k = f (xk+1 ) f (xk ), k = xk+1 xk .
Then all of the following rules satisfy the Quasi-Newton relation:
1. One-rank correction scheme.
(k Hk k )(k Hk k )T
Hk = .
k Hk k , k
2. Davidon-Fletcher-Powell scheme (DFP).
k kT Hk k kT Hk
Hk = .
k , k Hk k , k
3. Broyden-Fletcher-Goldfarb-Shanno scheme (BFGS).
Hk k kT + k kT Hk Hk k kT Hk
Hk = k ,
Hk k , k Hk k , k
where k = 1 + k , k /Hk k , k .
Clearly, there are many other possibilities. From the computational point of view, BFGS
is considered as the most stable scheme.
5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 125
For quadratic functions the variable metric methods usually terminate in n iterations.
In the neighborhood of a strict minimum they have the superlinear rate of convergence: for
any x0 Rn there exists a number N such that for all k N we have
xk+1 x const xk x xkn x
(the proofs are very long and technical). As far as the global convergence is concerned, these
methods are not better than the gradient method (at least, from the theoretical point of
view).
Note that in these methods it is necessary to store and update a symmetric n n-
matrix. Thus, each iteration needs O(n2 ) auxiliary arithmetic operations. During a long
time this feature was considered as the main drawback of the variable metric methods. That
stimulated the interest to the conjugate gradient schemes, which have much lower complexity
of each iteration (we will consider these schemes in Section 5.5). However, in view of the
tremendous growth of the computer power, these arguments are not so important now.
5.4 Newton Method and Self-Concordant Functions

The traditional results on the Newton method establish no more than its fast asymptotical
convergence; global eciency estimates can be proved only for modied versions of the
method, and these global estimates never are better than those for the Gradient Descent. In
this section we consider a family of objectives the so called self-concordant ones where
the Newton method admits excellent global eciency estimates. This family underlies the
most advanced Interior Point methods for large-scale Convex Optimization.
5.4.1 Preliminaries
The traditional starting point in the theory of the Newton method Theorem 5.3.1
possesses an evident drawback (which, anyhow, remained unnoticed by generations of
researchers). The Theorem establishes local quadratic convergence of the Basic Newton
method as applied to a function f with positive denite Hessian at the minimizer, this is
ne; but what is the quantitative information given by the Theorem? What indeed is
the region of quadratic convergence of the method, let us denote it G, the set of those
starting points from which the method converges quickly to x ? The proof provides us with
certain constructive description of G, but look this description involves dierential char-
acteristics of f like the magnitude M of the third order derivatives of f in a neighborhood
of x and the bound on the norm of inverted Hessian in this neighborhood (which depends
on M, the radius of the neighborhood and the smallest eigenvalue l of f (x )). Besides this,
the fast convergence of the method is described in terms of the behavior of the standard
Euclidean distances xt x . All these quantities magnitudes of third-order derivatives of
f , norms of the inverted Hessian, distances from the iterates to the minimizer are frame-
dependent: they depend on the choice of Euclidean structure on the space of variables,
on what are the orthonormal coordinates used to compute partial derivatives, Hessian ma-
trices and their norms, etc. When we vary the Euclidean structure (pass from the original
coordinates to another coordinates via a non-orthogonal linear transformation), all these
quantities somehow vary, same as the description of G given by Theorem 5.3.1. On the other
hand, when passing from one Euclidean structure on the space of variables to another, we
do not vary neither the problem, nor the Basic Newton method. Indeed, the latter method
is independent of any a priori coordinates, as it is seen from the following coordinateless
description of the method (cf (5.3.15)):
To nd the Newton iterate xt+1 of the previous iterate xt , take the second order Taylor
expansion of f at xt and choose, as xt+1 , the minimizer of the resulting quadratic form.
Thus, the coordinates are responsible only for the point of view we use to investigate the
process and are absolutely irrelevant to the process itself. And the results of Theorem 5.3.1
in their quantitative part (same as other traditional results on the Newton method) reect
this point of view, not the actual properties of the Newton process! This dependence on
viewpoint is a severe drawback: how can we get correct impression of actual abilities of
the method looking at the method from an occasionally chosen position? This is exactly
the same as to try to get a good picture of a landscape directing the camera in a random
manner.
5.4.2 Self-concordance
When the drawback of the traditional results is realized, could we choose a proper point of
view to orient our camera properly, at least for good objectives? Assume, e.g., that our
objective f is convex with nondegenerate Hessian. Then at every point x there is a natural,
intrinsic for the objective, Euclidean structure on the space of variables, namely, the one
given by the Hessian of the objective at x; the corresponding norm is

$ d2
|h|f,x = hT f (x)h |t=0 f (x + th). (5.4.17)
dt2
Note that the rst expression for |h|f,x seems to be frame-dependent it is given in terms
of coordinates used to compute inner product and the Hessian. But in fact the value of this
expression is frame-independent, as it is seen from the second representation of |h|f,x .
Now, from the standard results on the Newton method we know that the behavior of the
method depends on the magnitudes of the third-order derivatives of f . Thus, these results
are expressed in terms of upper bounds
. .
. d3 .
. .
. 3 |t=0 f (x + th).
. dt .
on the third-order directional derivatives of the objective, the derivatives being taken along
unit in the standard Euclidean metric directions h. What happens if we impose similar
upper bound on the third-order directional derivatives along the directions of the unit | |f,x
length rather than along the directions of the unit usual length? In other words, what
happens if we assume that
. .
. d3 .
. .
|h|f,x 1 . 3 |t=0 f (x + th).
. dt .
?
Since the left hand side of the concluding inequality is of homogeneity degree 3 with respect
to h, the indicated assumption is equivalent to the one
. .
. d3 .
. .
. 3 |t=0 f (x + th).
. dt .
|h|3f,x x h.
Now, the resulting inequality, qualitatively, remains true when we scale f replace it by f
with positive constant , but the value of varies: 1/2 . We can use this property
to normalize the constant factor , e.g., to set it equal to 2 (this is the most technically
convenient normalization).
Thus, we come to the main ingredient of the notion of a
self-concordant function: a three times continuously dierentiable convex function f sat-
isfying the inequality
. . 3/2
. d3 . d2
. .
. 3 |t=0 f (x + th). 2|h|3f,x 2 |t=0 f (x + th) h Rn . (5.4.18)
. dt . dt2
We do not insist on f to be dened everywhere; it suces to assume that the domain of f

is an open convex set Gf Rn , and that (5.4.18) is satised at every point x Gf . The
second part of the denition of a self-concordant function is that
Gf is a natural domain of f , so that f possesses the barrier property with respect to
Gf blows up to innity when a sequence of interior points of Gf approaches a boundary
point of the domain:
{xi Gf } : xi x Gf , i f (xi ) , i . (5.4.19)
Of course, the second part of the denition imposes something on f only when the domain
of f is less than the entire Rn .
Note that the denition of a self-concordant function is coordinateless it imposes
certain inequality between third- and second-order directional derivatives of the function
and certain behavior of the function on the boundary of its domain; all notions involved are
frame-independent.
5.4.3 Self-concordant functions and the Newton method

It turns out that the Newton method as applied to a self-concordant function f possesses
extremely nice global convergence properties. Namely, one can more or less straightforwardly
prove the following statements:
Proposition 5.4.1 [Self-concordant functions and the Newton method]

Let f be strongly self-concordant, and let f be nondegenerate at some point of Gf (this for
sure is the case when Gf does not contain lines, e.g., is bounded). Then
(i) [Nondegeneracy] f (x) is positive denite at every point x Gf ;
(ii) [Existence of minimizer] If f is below bounded (which for sure is the case when Gf is
bounded), then f attains its minimum on Gf , the minimizer being unique;
(iii) [Damped Newton method] When started at arbitrary point x0 Gf the process
1 $
xt+1 = xt [f (xt )]1 f (xt ), (f, x) = (f (x))T [f (x)]1 f (x) (5.4.20)
1 + (f, xt )
the Newton method with particular stepsizes
1
t+1 =
1 + (f, xt )
possesses the following properties:
(iii.1) The process keeps the iterates in Gf and is therefore well-dened;
(iii.2) If f is below bounded on Gf (which for sure is the case if (f, x) < 1 for
some x Gf ) then {xt } converges to the unique minimizer xf of f on Gf ;
(iii.3) Each step of the process (5.4.20) decreases f signicantly, provided that
(f, xt ) is not too small:
f (xt ) f (xt+1 ) (f, xt ) ln (1 + (f, xt )) ; (5.4.21)
(iii.4) For every t, one has

2 (f, xt )
(f, xt ) < 1 f (xt ) f (xf ) ln (1 (f, xt )) (f, xt ) (5.4.22)
2
and
22 (f, xt )
(f, xt+1 ) . (5.4.23)
1 (f, xt )
The indicated statements demonstrate extremely nice global convergence properties of
the Damped Newton method (5.4.20) as applied to a self-concordant function f . Namely,
assume that f is self-concordant with nondegenerate Hessian at certain (and then, as it was
mentioned in the above proposition, at any) point of Gf . Assume, besides this, that f is
below bounded on Gf (and, consequently, attains its minimum on Gf by (ii)). According
to (iii), the Damped Newton method keeps the iterates in Gf . Now, we may partition the
trajectory into two parts:
the initial phase: from the beginning to the rst moment, let it be called t , when
(f, xt ) 1/4;
the nal phase: starting from the moment t .
According to (iii.3), at every step of the initial phase the objective is decreased at least by
absolute constant
1 5 1
= ln (> );
4 4 32
consequently,
the initial phase is nite and is comprised of no more than
f (x0 ) minGf f
Nini =

iterations.
Starting with t = t , we have in view of (5.4.23):
22 (f, xt ) 1
(f, xt+1 ) (f, xt );
1 (f, xt ) 2
thus,
starting with t = t , the quantities (f, xt ) converge quadratically to 0 with objective-

independent rate.
According to (5.4.23),
starting with t = t , the residuals in terms of the objective f (xt )minGf f also converge
quadratically to zero with objective-independent rate.
Combining the above observations, we observe that
the number of steps of the Damped Newton method required to reduce the residual
f (xt ) min f in the value of a self-concordant below bounded objective to a prescribed
value < 0.1 is no more than

1
N() O(1) [f (x0 ) min f ] + ln ln , (5.4.24)

O(1) being an absolute constant.
It is also worthy of note what happens when we apply the Damped Newton method to
a below unbounded self-concordant f . The answer is as follows:
for a below unbounded f one has (f, x) 1 for every x (see (iii.2)), and, consequently,
every step of the method decreases f at least by the absolute constant 1 ln 2 (see
(iii.3)).
The indicated picture gives a frame- and objective-independent description of the

global behavior of the Damped Newton method as applied to a below bounded self-concordant
function. Note that the quantity (f, x) used to describe the behavior of the method at the
rst glance is coordinate dependent (see (5.4.20)), but in fact this quantity is coordinate-
less. Indeed, one can easily verify that
2 (f, x)
= f(x) min f(y),
2 y
where
1
f(y) = f (x) + (y x)T f (x) + (y x)T f (x)(y x)
2
is the second-order Taylor expansion of f at x. This is a coordinateless denition of (f, x).
Note that the region of quadratic convergence of the Damped Newton method as applied
to a below bounded self-concordant function f is, according to (iii.4), the set
1
Gf = {x Gf | (f, x) }. (5.4.25)
4
5.4.4 Self-concordant functions: applications

At the rst glance, the family of self-concordant functions is rather thin the functions are
given by certain strict dierential inequality, and a general, even convex and smooth,
f hardly may happen to be self-concordant. Thus, what for these elegant results on the
behavior of the Newton method on self-concordant functions?
The answer is as follows: it turns out that (even constrained) Convex Programming
problems of reasonable (in principle of arbitrary) analytic structure can be reduced to a
small series of problems of minimizing self-concordant functions. Applying to these auxil-
iary problems the Damped Newton method, we come to the theoretically most ecient (and
extremely ecient in practice) Interior Point methods for Convex Optimization. Appear-
ance of these methods denitely was one of the most important events in optimization. I
am going to speak about Interior Point methods in more detail in the next lecture. What
should be stressed now is that the crucial point in the design of such methods is our ability to
construct good self-concordant functions with prescribed domains. To this end it is worth
to note how to construct self-concordant functions. Here the following raw materials and
combination rules are useful:
Basic examples of self-concordant functions. For the time being, the following exam-
ples will be sucient:
[Convex quadratic (e.g., linear) form] The convex quadratic function
1
f (x) = xT Ax bT x + c
2
(A is symmetric positive semidenite n n matrix) is self-concordant on Rn ;
[Logarithm] The function

ln(x)
is self-concordant with the domain R+ = {x R | x > 0};
[Extension of the previous example: Logarithmic barrier, linear/quadratic case] Let
G = {x Rn | j (x) < 0, j = 1, ..., m}
be a nonempty set in Rn given by m strict convex quadratic (e.g., linear) inequalities.

Then the function m
f (x) = ln(i (x))
i=1
is self-concordant with the domain equal to G.
Combination rules: simple operations with functions preserving self-concordance
[Linear combination with coecients 1] Let fi , i = 1, ...m, be self-concordant func-

tions with the domains Gfi , let these domains possess a nonempty intersection G, and
let i 1, i = 1, ..., m, be given reals. Then the function
m

f (x) = i fi (x)
i=1
is self-concordant with the domain equal to G.

In particular, the sum of a self-concordant function and a convex quadratic function
(e.g., a linear one) is self-concordant;
[Ane substitution] Let f (x) be self-concordant with the domain Gf Rn , and let
x = A + b be an ane mapping from Rk into Rn with the image intersecting Gf .
Then the composite function
g() = f (A + b)
is self-concordant with the domain
Gg = { | A + b Gf }
being the inverse image of Gf under the ane mapping in question.
To justify self-concordance of the indicated functions, same as the validity of the combina-
tion rules, only minimal eort is required; at the same time, these examples and rules give
almost all required to establish excellent global eciency estimates for Interior Point meth-
ods as applied to Linear Programming and Convex Quadratically Constrained Quadratic
programming.
After we know examples of self-concordant functions, let us look how our new under-
standing of the behavior of the Newton method on such a function diers from the one
given by Theorem 5.3.1. To this end consider a particular self-concordant function the
logarithmic barrier
f (x) = ln( x1 ) ln( + x1 ) ln(1 x2 ) ln(1 + x2 )
for the 2D rectangle

D = {x R2 | |x1 | < , |x2 | < 1};
in what follows we assume that the rectangle is wide, i.e., that
1.
This function indeed is self-concordant (see the third of the above raw material examples).
The minimizer of the function clearly is the origin; the region of quadratic convergence of
the Damped Newton method is given by
x21 x22 1
G = {x D | 2 2
+ 2
}
+ x1 1 + x2 32
(see (5.4.25)). We see that the region of quadratic convergence of the Damped Newton
method is large enough it contains, e.g., 8 times smaller than D concentric to D rectangle
D . Besides this, (5.4.24) says that in order to minimize f to inaccuracy, in terms of the
objective, , starting with a point x0 D, it suces to perform no more than

1 1
O(1) ln + ln ln
x0
steps, where O(1) is an absolute constant and
|x1 |
x = max{ , |x2 |}.

Now let us look what Theorem 5.3.1 says. The Hessian f (0) of the objective at the minimizer
is 2
2 0
H= ,
0 2
and H 1 = O( 2 ); in, say, 0.5-neighborhood U of x = 0 we also have [f (x)]1 = O( 2 ).
The third-order derivatives of f in U are of order of 1. Thus, in the notation from the proof
of Theorem 5.3.1 we have M = O(1) (this is the magnitude of the third order derivatives
of f in U), r = 0.5 and the upper bound on the norm of the inverted Hessian of f in U is
O( 2 ). According to the proof, the region U of quadratic convergence of the Newton method
is r-neighborhood of x = 0 with
r = O( 2).
Thus, according to Theorem 5.3.1, the region of quadratic convergence of the method be-
comes the smaller the larger is , while the actual behavior of this region is quite opposite.
5.5. CONJUGATE GRADIENTS METHOD 133
In this simple example, the aforementioned drawback of the traditional approach its
frame-dependence is clearly seen. Applying Theorem 5.3.1 to the situation in question,
we used extremely bad frame Euclidean structure. If we were clever enough to scale the
variable x1 before applying Theorem 5.3.1 to divide it by it would become absolutely
clear that the behavior of the Newton method is absolutely independent of , and the region
of quadratic convergence of the method is a once for ever xed fraction of the rectangle
D.
5.5 Conjugate gradients method

The conjugate gradient method was initially proposed for minimizing a quadratic func-
tion. Therefore we will describe rst the conjugate gradient schemes for the problem
min f (x),
xRn
with f (x) = + a, x + 12 Ax, x and A = AT > 0. We have already seen that the
solution of this problem is x = A1 a. Therefore, our quadratic objective function
can be written in the following form:
f (x) = + a, x + 12 Ax, x = Ax , x + 12 Ax, x
= 12 Ax , x + 12 A(x x ), x x .
Thus, f = 12 Ax , x and f (x) = A(x x ).

Let we are given by a starting point x0 . Consider the linear subspaces
Lk = Lin {A(x0 x ), . . . , Ak (x0 x )}.
The sequence of the conjugate gradient method is dened as follows:
xk = arg min{f (x) | x x0 + Lk }, k = 1, 2, . . . . (5.5.26)
This denition looks rather abstract. However, we will see that this method can be
written in much more algorithmic form. The above representation is convenient for
the theoretical analysis.
Lemma 5.5.1 Lk = Lin {f (x0 ), . . . , f (xk1 )}.
Proof:
For k = 1 we have f (x0 ) = A(x0 x ). Let the statement of the lemma is true for
some k 1. Note that
k

xk = x0 + i Ai (x0 x )
i=1
with some i R1 . Therefore

k

f (xk ) = A(x0 x ) + i Ai+1 (x0 x ) = y + k Ak+1 (x0 x ),
i=1
where y Lk . Thus,
Lk+1 = Lin {Lk , Ak+1 (x0 x )} = Lin {Lk , f (xk )} = Lin {f (x0 ), . . . , f (xk )}.
The next result is important for understanding the behavior of the minimization
sequence of the method.
Lemma 5.5.2 For any k = i we have f (xk ), f (xi ) = 0.
Proof:
Let k > i. Consider the function

k

= (1 , . . . , k ) = f x0 +
() j f (xj1 ) .
j=1
such that
In view of Lemma 5.5.1, there exists
k

xk = x0 + j f (xj1 ).
j=1
However, in view of its denition, xk is the point of minimum of f (x) on Lk . Therefore

(
) = 0. It remains to compute the components of the gradient:
)
(
0= = f (xk ), f (xi ).
i
Corollary 5.5.1 The sequence generated by the conjugate gradient method is nite.
(Since the number of the orthogonal directions cannot exceed n.)
Corollary 5.5.2 For any p Lk we have f (xk ), p = 0.
The last result we need explains the name of the method. Denote i = xi+1 xi .
It is clear that
Lk = Lin {0 , . . . , k1 }.
Lemma 5.5.3 For any k = i we have Ak , i = 0.
(Such directions are called conjugate with respect to A.)

Proof:
Without loss of generality, we can assume that k > i. Then
Ak , i = A(xk+1 xk ), xi+1 xi = f (xk+1 ) f (xk ), xi+1 xi = 0
since i = xi+1 xi Li+1 Lk .

5.5. CONJUGATE GRADIENTS METHOD 135
Let us show, how we can write out the conjugate gradient method in more algo-
rithmic form. Since Lk = Lin {0 , . . . , k1 }, we can represent xk+1 as follows:
k1

xk+1 = xk hk f (xk ) + j j .
j=0
That is
k1

k = hk f (xk ) + j j . (5.5.27)
j=0
Let us compute the coecients of this representation. Multiplying (5.5.27) by A and

i , 0 i k 1, and using Lemma 5.5.3 we obtain:
k1

0 = Ak , i = hk Af (xk ), i + j Aj , i
j=0
= hk Af (xk ), i + i Ai , i
= hk f (xk ), f (xi+1 ) f (xi ) + i Ai , i .

Therefore, in view of Lemma 5.5.2, i = 0 for all i < k 1. For i = k 1 we have:
hk f (xk ) 2 hk f (xk ) 2
k1 = = .
Ak1 , k1 f (xk ) f (xk1 ), k1
Thus, xk+1 = xk hk pk , where
f (xk ) 2 k1 f (xk ) 2 pk1
pk = f (xk ) = f
(x k )
f (xk ) f (xk1 ), k1 f (xk ) f (xk1 ), pk1
since k1 = hk1 pk1 by the denition of {pk }.
Note that we managed to write down the conjugate gradient scheme in terms
of the gradients of the objective function f (x). This provides us with possibility to
apply formally this scheme for minimizing a general nonlinear function. Of course,
such extension destroys all properties of the process, specic for the quadratic func-
tions. However, we can hope that asymptotically this method could be very fast in
the neighborhood of a strict local minimum, where the objective function is close to
be quadratic.
Let us present the general scheme of the conjugate gradient method for minimizing
some nonlinear function f (x)
Conjugate gradient method
0. Choose x0 Rn . Compute f (x0 ) and f (x0 ). Set p0 = f (x0 ).
1. kth iteration (k 0).
a). Find xk+1 = xk +hk pk (using an exact line search procedure).
b). Compute f (xk+1 ) and f (xk+1 ).
c). Compute the coefficient k .
d). Set pk+1 = f (xk+1 ) k pk .
In the above scheme we did not specify yet the coecient k . In fact, there are many
dierent formulas for this coecient. All of them give the same result for a quadratic
function, but in the general nonlinear case they generate dierent algorithmic schemes.
Let us present three most popular expressions.
f (xk+1 ) 2
1. k = f (xk+1 )f (xk ),pk .
f (xk+1 ) 2
2. Fletcher-Rieves: k = f (xk ) 2 .
f (xk+1 ),f (xk+1 )f (xk )
3. Polak-Ribbiere: k = f (xk ) 2 .
Recall that in the quadratic case the conjugate gradient method terminates in n
iterations (or less). Algorithmically, this means that pn+1 = 0. In the nonlinear case
that is not true, but after n iteration this direction could loose any sense. Therefore,
in all practical schemes there is a restart strategy, which at some point sets k = 0
(usually after every n iterations). This ensures the global convergence of the scheme
(since we have a normal gradient step just after the restart), and a local n-step local
quadratic convergence:
xn+1 x const x0 x 2 ,
provided that x0 is close enough to the strict minimum x . Note, that this local
convergence is slower than that of the variable metric methods. However, the conjugate
gradient schemes have an advantage of the very cheap iteration. As far as the global
convergence is concerned, the conjugate gradients in general are not better than the
gradient method.
5.6. EXERCISES 137
5.6 Exercises
5.6.1 Implementing Gradient method
Exercise 5.6.1
Let us consider the implementation of the Armijo-Goldstein version of the Gradient Descent.
Recall that at each step of the algorithm we are looking for xk+1 = xk hf (xk ) and the
step-size h such that
f (xk )T (xk xk+1 ) f (xk ) f (xk+1 ), (5.6.28)

f (xk )T (xk xk+1 ) f (xk ) f (xk+1 ), (5.6.29)
where 0 < < < 1 are xed parameters of the algorithm.

Note that a point xk+1 which satisfy the system of inequalities above surely exists (why?).
To nd it we can proceed as follows: let us choose an arbitrary positive h = h0 and test
whether it satises (5.6.28). If it is the case, let us replace this value subsequently by
h1 = h0 , h2 = h1 , etc., each time verifying whether the new value of h satises (5.6.28).
This cannot last forever: for certain s 1 hs for sure fails to satisfy (5.6.28). When it
happens for the rst time, the quantity hs1 turns out to satisfy (5.6.28), while the quantity
hs = hs1 fails to satisfy (5.6.28), which means that h = hs1 passes the Armijo-Goldstein
test.
If the initial h0 does not satisfy (5.6.28), we replace this value subsequently by h1 = h0 ,
h2 = 1 , etc., each time verifying whether the new value of h still does not satisfy (5.6.28).
Again, this cannot last forever: for certain s 1, hs for sure satises (5.6.28). When it
happens for the rst time, hs turns out to satisfy (5.6.28), while hs1 = hs fails to satisfy
(5.6.28), and h = hs passes the Armijo test.
Write the code implementing the Armijo-Goldstein Gradient Descent method and apply
it to the following problems:
Rosenbrock problem
f (x) = 100(x2 x21 )2 + (1 x1 )2 min | x = (x1 , x2 ) R2 ,
the starting point is x0 = (1.2, 1).

The Rosenbrock problem is a well-known test example: it has a unique critical point
x = (1, 1) (the global minimizer of f ); the level lines of the function are banana-shaped
valleys, and the function is nonconvex and rather badly conditioned.
Quadratic problem
f (x) = x21 + x22 min | x = (x1 , x2 ) R2 .
Test the following values of

101 ; 104 ; 106
and for each of these values test the starting points

(1, 1); ( , 1); (, 1).
How long it takes to reduce the initial inaccuracy, in terms of the objective, by factor
0.1?
Quadratic problem
1
f (x) = xT Ax bT x, x R4 ,
2
with
0.78 0.02 0.12 0.14 0.76
0.02 0.86 0.04 0.06 0.08

A= , b= , x0 = 0.
0.12 0.04 0.72 0.08 1.12
0.14 0.06 0.08 0.74 0.68
Run the method until the norm of the gradient at the current iterate is becomes less
than 106 . Is the convergence fast or not?
Those using MATLAB can compute the spectrum of A and to compare the theoretical
upper bound on convergence rate with the observed one.
Experiments with Hilbert matrix. Let H (n) be the n n Hilbert matrix:

1
(H (n) )ij = , i, j = 1, ..., n.
i+j1
- 1 n
This is a symmetric positive denite matrix (since xT H (n) x = 0 ( i=1 xi ti1 )2 dt 0,
the inequality being strict for x = 0).
For n = 2, 3, 4, 5 perform the following experiments:
choose somehow n-dimensional nonzero vector x , e.g., x = (1, ..., 1)T ;

compute b = H (n) x ;
Apply your Gradient Descent code to the quadratic function
1
f (x) = xT H (n) x bT x,
2
the starting point being x0 = 0. Note that x is the unique minimizer of f .
Terminate the method when you will get |xN x | 104, not allowing it, anyhow,
to run more than 10,000 steps.
What will be the results?

Those using MATLAB can try to compute the condition number of the Hilbert matrices
in question.
You can play with the parameters and of the method to get the best possible convergence.
5.6. EXERCISES 139
5.6.2 Regular functions

The exercises of this section are critical to understand the content of the next lecture. You
are invited to work on them or, at least, to work on the proposed solutions.
We are going to use a fundamental principle of numerical analysis, we mean approxima-
tion. In general,
To approximate an object means to replace the initial complicated object by a

simplied one, close enough to the original.
In nonlinear optimization we usually apply the local approximations based on the derivatives
of the nonlinear function. Those are the rst- and the second-order approximations (or, the
linear and quadratic approximations). Let f (x) be dierentiable at x. Then for y Rn we
have:
x) + f (
f (y) = f ( x), y x + o( y x ),
where o(r) is some function of r 0 such that
1
lim o(r) = 0, o(0) = 0.
r0 r
The linear function f ( x) + f (

x), y x is called the linear approximation of f at x. Recall

that the vector f (x) is called the gradient of function f at x. Considering the points
yi = x + ei , where ei is the ith coordinate vector in Rn , we obtain the following coordinate
form of the gradient:
f (x) f (x)
f (x) = ,...,
x1 xn
Let us look at some important properties of the gradient.
5.6.3 Properties of the gradient

Denote by Lf () the sublevel set of f (x):
Lf () = {x Rn | f (x) }.
Consider the set of directions tangent to Lf () at x, f (

x) = :

n yk x
x) = s R | s =
Sf ( lim .
x,f (yk )= yk x
yk
+
Exercise 5.6.2 x) then f (
If s Sf ( x), s = 0.
Let s be a direction in Rn , s = 1. Consider the local decrease of f (x) along s:
1
x + s) f (
(s) = lim [f ( x)].
0
Note that
f ( x) = f (
x + s) f ( x), s + o().
Therefore (s) = f (
x), s. Using the Cauchy-Schwartz inequality:
x y x, y x y ,
we obtain:
(s) = f (
x), s f (
x) .
Let us take s = f (
x)/ f (
x . Then
s) = f (
( x), f (
x)/ f (
x) = f (
x) .
Thus, the direction f (
x) (the antigradient) is the direction of the fastest local decrease of
f (x) at the point x.
The next statement, which is already known to us, is probably the most fundamental
fact in optimization.
Theorem 5.6.1 (First-order optimality condition; Ferm`a theorem.)
Let x be a local minimum of the dierentiable function f (x). Then f (x ) = 0.
Note that this only is a necessary condition of a local minimum. The points satisfying
this condition are called the stationary points of function f .
Let us look now at the second-order approximation. Let the function f (x) be twice
dierentiable at x. Then
x) + f (
f (y) = f ( x), y x + 12 f (
x)(y x), y x + o( y x 2 ).
The quadratic function
x) + f (
f ( x), y x + 12 f (
x)(y x), y x
is called the quadratic (or second-order) approximation of function f at x. Recall that the
(n n)-matrix f (x):
2 f (x)
(f (x))i,j = ,
xi xj
is called the Hessian of function f at x. Note that the Hessian is a symmetric matrix:
f (x) = [f (x)]T .
This matrix can be seen as a derivative of the vector function f (x):
f (y) = f (
x) + f (
x)(y x) + o( y x ),
where o(r) is some vector function of r 0 such that
1
o(r) = 0, o(0) = 0.
lim
r0 r
Using the secondorder approximation, we can write out the secondorder optimality
conditions. In what follows notation A 0, used for a symmetric matrix A, means that A is
positive semidenite; A > 0 means that A is positive denite. The following result supplies
a necessary condition of a local minimum:
5.6. EXERCISES 141
Theorem 5.6.2 (Second-order optimality condition.)

Let x be a local minimum of a twice dierentiable function f (x). Then
f (x ) = 0, f (x ) 0.
Again, the above theorem is a necessary (secondorder) characteristics of a local mini-

mum. We already know a sucient condition.
Proof:
Since x is a local minimum of function f (x), there exists r > 0 such that for all y B2 (x , r)
f (y) f (x ).
In view of Theorem 5.6.1, f (x ) = 0. Therefore, for any y from B2 (x , r) we have:
f (y) = f (x ) + f (x )(y x ), y x + o( y x 2 ) f (x ).
Thus, f (x )s, s 0, for all s, s = 1.
Theorem 5.6.3 Let function f (x) be twice dierentiable on Rn and let x satisfy the fol-
lowing conditions:
f (x ) = 0, f (x ) > 0.
Then x is a strict local optimum of f (x).
(Sometimes, instead of strict, we say the isolated local minimum.)

Proof:
Note that in a small neighborhood of the point x the function f (x) can be represented as
follows:
f (y) = f (x ) + 12 f (x )(y x ), y x + o( y x 2 ).
Since 1r o(r) 0, there exists a value r such that for all r [0, r] we have
r
| o(r) | 1 (f (x )),
4
where 1 (f (x )) is the smallest eigenvalue of matrix f (x ). Recall, that in view of our

assumption, this eigenvalue is positive. Therefore, for any y Bn (x , r) we have:
f (y) f (x ) + 12 1 (f (x )) y x 2 +o( y x 2 )
f (x ) + 14 1 (f (x )) y x 2 > f (x ).
5.6.4 Classes of dierentiable functions

It is well-known that any continuous function can be approximated by a smooth function with
an arbitrary small accuracy. Therefore, assuming the dierentiability only, we cannot derive
any reasonable properties of the minimization processes. For that we have impose some
additional assumptions on the magnitude of the derivatives of the functional components of
our problem. Traditionally, in optimization such assumptions are presented in the form of
Lipschitz condition for a derivative of certain order.
Let Q be a subset of Rn . We denote by CLk,p (Q) the class of functions with the following
properties:
any f CLk,p(Q) is k times continuously dierentiable on Q.
Its pth derivative is Lipschitz continuous on Q with the constant L:
f (p) (x) f (p) (y) L x y
for all x, y Q.
Clearly, we always have p k. If q k then CLq,p (Q) CLk,p(Q). For example, CL2,1 (Q)
CL1,1 (Q). Note also that these classes possess the following property:
if f1 CLk,p
1
(Q), f2 CLk,p
2
(Q) and , R1 , then
f1 + f2 CLk,p
3
(Q)
with L3 =| | L1 + | | L2 .
We use notation f C k (Q) for f , which is k times continuously dierentiable on Q.
The most important class of the above type is CL1,1 (Q), the class of functions with Lip-
schitz continuous gradient. In view of the denition, the inclusion f CL1,1 (Rn ) means
that
f (x) f (y) L x y (5.6.30)
for all x, y Rn . Let us give a sucient condition for that inclusion.
Exercise 5.6.3 +
Function f (x) belongs to CL2,1 (Rn ) if and only if
f (x) L, x Rn . (5.6.31)
This simple result provides us with many representatives of the class CL1,1 (Rn ).
Example 5.6.1 1. Linear function f (x) = + a, x belongs to C01,1 (Rn ) since
f (x) = a, f (x) = 0.
2. For quadratic function
f (x) = + a, x + 12 Ax, x, A = AT ,

5.6. EXERCISES 143
we have:
f (x) = a + Ax, f (x) = A.
Therefore f (x) CL1,1 (Rn ) with L = A .

3. Consider the function of one variable f (x) = 1 + x2 , x R1 . We have:
x 1
f (x) = , f (x) = 1.
1 + x2 (1 + x2 )3/2
Therefore f (x) C11,1 (R).

Next statement is important for the geometric interpretation of functions from the class
CL1,1 (Rn ).
Exercise 5.6.4 +
Let f CL1,1 (Rn ). Then for any x, y from Rn we have:
L
| f (y) f (x) f (x), y x | y x 2 . (5.6.32)
2
Geometrically, this means the following. Consider a function f from CL1,1 (Rn ). Let us x
some x0 Rn and form two quadratic functions
L
1 (x) = f (x0 ) + f (x0 ), x x0 + 2
x x0 2 ,
L
2 (x) = f (x0 ) + f (x0 ), x x0 2
x x0 2 .
Then, the graph of the function f is located between the graphs of 1 and 2 :
1 (x) f (x) 2 (x), x Rn .
Let us consider the similar result for the class of twice dierentiable functions. Our main
2,2
class of functions of that type will be CM (Rn ), the class of twice dierentiable functions
2,2
with Lipschitz continuous Hessian. Recall that for f CM (Rn ) we have
f (x) f (y) M x y (5.6.33)
for all x, y Rn .
Exercise 5.6.5 +
Let f CL2,2 (Rn ). Then for any x, y from Rn we have:
M
f (y) f (x) f (x)(y x) y x 2 . (5.6.34)
2
We have the following corollary of this result:
+ 2,2
Exercise 5.6.6 Let f CM (Rn ) and y x = r. Then
f (x) MrIn f (y) f (x) + MrIn ,
where In is the unit matrix in Rn .

Recall that for matrices A and B we write A B if A B 0 (positive semidenite). Here

is an example of a quite unexpected use of optimization (I would rather say, analytic)
technique. This is one of the facts which makes the beauty of the mathematics. I propose
you to provide an extremely short proof of The Principal Theorem of Algebra, i.e. you are
going to show that a non-constant polynomial has a (complex) root.2
Exercise 5.6.7 Let p(x) be a polynomial of degree n > 0. Without loss of generality we
can suppose that p(x) = xn + ..., i.e. the coecient of the highest degree monomial is 1.
Now consider the modulus |p(z)| as a function of the complex argument z C. Show
that this function has a minimum, and that minimum is zero.
Hint: Since |p(z)| + as |z| +, the continuous function |p(z)| must attain a
minimum on a complex plan.
Let z be a point of the complex plan. Show that for small complex h
p(z + h) = p(z) + hk ck + O(|h|k+1)
for some k, 1 k n and ck = 0. Now, if p(z) = 0 there a choice (which one?) of h small
such that |p(z + h)| < |p(z)|.
2
This proof is tentatively attributed to Hadamard.
Lecture 6
Constrained Minimization
(Penalty Function Methods, Barrier Function Methods, Self-Concordant Barriers, Tracing

the path, Interior-Point Methods)
This lecture is devoted to the penalty and the barrier methods; as far as the underlying
ideas are concerned, these methods implement the simplest approach to constrained opti-
mization approximate a constrained problem by unconstrained ones. Let us look how it is
done.
6.1 Penalty and Barrier function methods

The problem we deal with is as follows:
f (x) min,
(ICP) (6.1.1)
gi (x) 0, i = 1, . . . , m.
where gi (x) are smooth functions. For example, we can consider gi (x) from CL1,1 (Rn ).
Since the components of the problem (6.1.1) are general nonlinear functions, we cannot
expect it would be easier to solve than the unconstrained minimization problem. Indeed,
even the troubles with stationary points, which we have in unconstrained minimization,
appears in (6.1.1) in much stronger form. Note that the stationary points of this problem
(what ever it is?) can be infeasible for the functional constraints and any minimization
scheme, attracted by such point, should admit that it fails to solve the problem.
Therefore, the following reasoning looks almost as the only way to proceed:
1. We have several ecient methods for unconstrained minimization. (?)1

1
In fact, that is not absolutely true. We will see, that in order to apply the unconstrained minimization
method to solving the constrained problems, we need to be sure that we are able to nd at least a strict
local minimum. And we have already seen (Example 5.2.1), that this could be a problem.
145
146 LECTURE 6. CONSTRAINED MINIMIZATION
2. An unconstrained minimization problem is simpler than a constrained one. (?)2
3. Therefore, let us try to approximate the solution of the constrained problem (6.1.1) by
a sequence of solutions of some auxiliary unconstrained problems.
These ideology was implemented in the methods of Sequential Unconstrained Minimiza-

tion. There are two main groups of such method: the penalty function methods and the
barrier methods. Let us describe the basic ideas of this approach.
We start from penalty function methods.
Denition 6.1.1 A continuous function (x) is called a penalty function for a closed set
G if
(x) = 0 for any x G,
(x) > 0 for any x

/ G.
(Sometimes the penalty function is called just penalty.)

The main property of the penalty functions is as follows: If 1 (x) is a penalty function
for G1 and 2 (x) is a penalty function for G2 then 1 (x) + 2 (x) is a penalty function for
/
the intersection G1 G2 .
Let us give several examples of the penalty functions.
Example 6.1.1 Let G = {x Rn | gi (x) 0, i = 1, . . . , m}. Then the following functions

are the penalties for the set G:
m

1. Quadratic penalty: (x) = (gi (x))2+ .
i=1
m

2. Nonsmooth penalty: (x) = (gi (x))+
i=1
(we denote (a)+ = max{a, 0}). The reader can easily continue this list.
The general scheme of the penalty function method is as follows
Penalty Function Method
0. Choose x0 Rn . Choose a sequence of penalty coefficients:
0 < tk < tk+1 , tk .

2
We are not going to discuss the correctness of this statement for the general nonlinear problems. Note
that we have already observed that it is note true for convex problems.
6.1. PENALTY AND BARRIER FUNCTION METHODS 147
1. kth iteration (k 0): Find a point

xk+1 = arg minn {f (x) + tk (x)}
xR
using xk as a starting point.
It is easy to prove the convergence of this scheme assuming that xk+1 is a global minimum
of the auxiliary function.3 Denote
k (x) = f (x) + tk (x), k = minn k (x)
xR
(k is the global optimal value of k (x)).

Let us make the following assumption.
Assumption 6.1.1 There exists t > 0 such that the set
S = {x Rn | f (x) + t(x) f }
is bounded.
Theorem 6.1.1 If the problem (6.1.1) satises Assumption 6.1.1 then
lim f (xk ) = f , lim (xk ) = 0.
k k
Proof:
Note that k k (x ) = f . Further, for any x Rn we have: k+1(x) k (x). Therefore
k+1 k . Thus, there exists a limit
lim k f .
k
If tk > t then
f (xk ) + t(xk ) f (xk ) + tk (xk ) f .
Therefore, the sequence {xk } has limit points.
For any limit point x we have (x ) = 0. Therefore x G and
= f (x ) + (x ) = f (x ) f .
Remark 6.1.1 Note that this result is very general, but not too informative. There are
still many questions, which should be answered. For example, we dont know what kind
of penalty function should we use. What should be the rules for choosing the penalty
coecients? What should be the accuracy for solving the auxiliary problems? The main
feature of this questions is that they cannot be answered in the framework of the general
nonlinear programming theory. Therefore, traditionally, they are considered to be answered
by the computational practice.
3
If we assume that it is a strict local minimum, then the result is much weaker.
Now it is time to look what are our abilities to solve the unconstrained problems
(Pt ) [(x) = f (x) + t(x)] min
which, as we already know, for large t are good approximations of the constrained problem in
question. In principle we can solve these problems by any one of unconstrained minimization
methods we know, and this is denitely a great advantage of the approach.
Remark 6.1.2 There is, anyhow, a severe weak point of the construction to approximate
well the constrained problem by unconstrained one, we must deal with large values of the
penalty parameter, and this, as we shall see in a while, unavoidably makes the unconstrained
problem (Pt ) ill-conditioned and thus very dicult for any unconstrained minimization
methods sensitive to the conditioning of the problem. And all the methods for unconstrained
minimization we know, except, possibly, the Newton method, are sensitive to conditioning
(e.g., in the Gradient Descent the number of steps required to achieve an -solution is,
asymptotically, inverse proportional to the condition number of the Hessian of objective
at the optimal point). Even the Newton method, which does not react on the conditioning
explicitly it is self-scaled suers a lot as applied to an ill-conditioned problem, since here
we are enforced to invert ill-conditioned Hessian matrices, and this, in actual computations
with their rounding errors, causes a lot of troubles. The indicated drawback ill-conditioness
of auxiliary unconstrained problems is the main disadvantage of the straightforward
penalty scheme, and because of it the scheme is not that widely used now and is in many
cases replaced with more smart modied Lagrangian scheme.
Let us consider now the barrier methods.
Denition 6.1.2 A continuous function F (x) is called a barrier function for a closed set G
with nonempty interior if F (x) when x approaches the boundary of the set G.
(Sometimes a barrier function is called barrier for short.)

Similarly to the penalty functions, the barriers possess the following property: If F1 (x)
is a barrier for G1 and F2 (x) is a barrier for G2 then F1 (x) + F2 (x) is a barrier for the
/
intersection G1 G2 .
In order to apply the barrier approach, our problem must satisfy the Slater Condition:

x: gi (
x) < 0, i = 1, . . . , m.
Let us look at some examples of the barriers.
Example 6.1.2 Let G = {x Rn | gi (x) 0, i = 1, . . . , m}. Then all of the following

functions are barriers for G:
m
1
1. Power-function barrier: F (x) = (gi (x))p
, p 1.
i=1
m

2. Logarithmic barrier: F (x) = ln(gi (x)).
i=1
6.1. PENALTY AND BARRIER FUNCTION METHODS 149
m

1
3. Exponential barrier: F (x) = exp gi (x)
.
i=1
The reader can continue this list.
The scheme of the barrier method is as follows.
Barrier Function Method
0. Choose x0 int G. Choose a sequence of penalty coefficients:
0 < tk < tk+1 , tk .
1. kth iteration(k 0): Find a point

1
xk+1 = arg min{f (x) + F (x)}
xG tk
using xk as a starting point.
Let us prove the convergence of this method assuming that xk+1 is a global minimum of
the auxiliary function. Denote
1
Fk (x) = f (x) + F (x), Fk = min Fk (x),
tk xG
(Fk is the global optimal value of Fk (x)).
Assumption 6.1.2 The barrier F (x) is bounded from below: F (x) F for all x G.
Theorem 6.1.2 Let (6.1.1) satises Assumption 6.1.2. Then
lim Fk = f .
k
Proof:
Let x int G. Then
1
lim Fk lim f (
x) + F (
x) = f (
x).
k k tk
Therefore lim Fk f . Further,
k

1 1 1
Fk = min f (x) + F (x) min f (x) + F = f + F .
xG tk xG tk tk
Thus, lim Fk = f .
k
Same as with the penalty functions method, there are many questions to be answered.
We dont know how to nd the starting point x0 and how to choose the best barrier function.
We dont know the rules for updating the penalty coecients and the acceptable accuracy
of the solutions to the auxiliary problems. Finally, we have no idea about the eciency
estimates of this process. And the reason is not in the lack of the theory. Our problem
(6.1.1) is just too complicated.
If I were writing this lecture, say, 20 years ago, I would probably stop here, or added
some complaints about the fact that, in the same way as for the penalty method, the
problems of minimization of normally (when the solution to the original problem is on
the boundary of G; otherwise the problem actually is unconstrained) become the more ill-
conditioned the larger is t, so that the diculties of their numerical solution grow with the
penalty parameter. When indeed writing this lecture, I would say something quite opposite:
there exists important situations when the diculties in numerical minimization of do not
increase with the penalty parameter, and the overall scheme turns out to be theoretically
ecient and, moreover, the best known so far. This change in evaluation of the scheme is the
result of recent interior point revolution in Optimization which I have already mentioned
in Lecture 5.
6.2 Self-concordant barriers and path-following scheme

Assume from now on that our problem (ICP) is convex (the revolution we are speaking
about deals, at least directly, with Convex Programming only). It is well-known that convex
program (ICP) can be easily rewritten as a program with linear objective; indeed, it suces
to extend the vector of design variables by one variable, let it be called t, more and to rewrite
(ICP) as the problem
t min | gj (x) 0, j = 1, ..., k, f (x) t 0.
The resulting problem still is convex and has linear objective.
To save notation, assume that the outlined conversion is carried out in advance, so that
already the original problem has linear objective and is therefore of the form
(P) f (x) cT x min | x G Rn .
Here the feasible set G of the problem is convex; we also assume from now on that it is
closed, bounded and possesses a nonempty interior.
Let us denote
1
(Pt ) [Ft (x) f (x) + F (x)] min
t
Our abilities to solve (P) eciently by an interior point method depend on our abilities to
point out a good interior penalty F for the feasible domain. What we are interested in is
a -self-concordant barrier F ; the meaning of these words is given by the following
Denition 6.2.1 [Self-concordant barrier] Let 1. We say that a function F is -self-
concordant barrier for the feasible domain D of problem (P), if
6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 151
F is self-concordant function on int G (Section 5.4, Lecture 5), i.e., three times con-
tinuously dierentiable convex function on int G possessing the barrier property (i.e.,
F (xi ) along every sequence of points xi int G converging to a boundary point
of G) and satisfying the dierential inequality
d3 3/2
T
| | t=0 F (x + th)| 2 h F (x)h x int G h Rn ;
dt3
F satises the dierential inequality

$
|hT F (x)| hT F (x)h x int G h Rn . (6.2.2)
An immediate example is as follows (cf. Section 5.4.4, Lecture 5):
Example 6.2.1 [Logarithmic barrier for a polytope] Let
G = {x Rn | aTj x bj , j = 1, ..., m}
be a polytope given by a list of linear inequalities satisfying the Slater condition (i.e., there
exists x such that aTj x < bj , j = 1, ..., m). Then the function
m

F (x) = ln(bj aTj x)
j=1
is an m-self-concordant barrier for G.
In a moment we will justify this example and will consider the crucial issue of how to nd
a self-concordant barrier for a given feasible domain. For the time being, let us focus on
another issue: how to solve (P), given a -self-concordant barrier for the feasible domain of
the problem.
What we intend to do is to use the path-following scheme associated with the barrier
certain very natural implementation of the barrier method.
6.2.1 Path-following scheme

When speaking about the barrier scheme, we simply claimed that the minimizers of the
aggregate Ft approach, as t , the optimal set of (P). The immediate recommendation
coming from this claim could be: choose large enough value of the penalty and solve,
for the chosen t, the problem (Pt ) of minimizing Ft , thus coming to a good approximate
solution to (P). It makes, anyhow, sense to come to the solution of (Pt ) gradually, by solving
sequentially problems (Ptk )along an increasing sequence of values of the penalty parameter.
Namely, assume that the barrier F we use is nondegenerate, i.e., F (x) is positive denite at
every point x int G (note that this indeed is the case when F is self-concordant barrier for
a bounded feasible domain, see Proposition 5.4.1.(i)). Then the optimal set of Ft for every
positive t is a singleton (we already know that it is nonempty, and the uniqueness of the
minimizer follows from convexity and nondegeneracy of Ft ). Thus, we have a path
x (t) = argmin Ft ();
int G
as we know from Theorem 6.1.2, this path converges to the optimal set of (P) as t ; be-
sides this, it can be easily seen that the path is continuous (even continuously dierentiable)
in t. In order to approximate x (t) with large values of t via the path-following scheme, we
trace the path x (t), namely, generate sequentially approximations x(ti ) to the points x (ti )
along certain diverging to innity sequence t0 < t1 < ... of values of the parameter. This is
done as follows:
given tight approximation x(ti ) to x (ti ), we update it into tight approximation
x(ti+1 ) to x (ti+1 ) as follows:
rst, choose somehow a new value ti+1 > ti of the penalty parameter
second, apply to the function Fti+1 () a method for unconstrained minimization started
at x(ti ), and run the method until closeness to the new target point x (ti+1 ) is restored,
thus coming to the new iterate x(ti+1 ) close to the new target point of the path.
Our hope is that since x (t) is continuous in t and x(ti ) is close to x (ti ), for not too
large ti+1 ti the point x(ti ) will be not too far from the new target point x (ti+1 ), so
that the unconstrained minimization method we use will quickly restore closeness to the new
target point. With this gradual movement, we may hope to arrive near x (t) with large t
faster than by attacking the problem (Pt ) directly.
All this was known for many years; and the progress during last decade was in trans-
forming these qualitative ideas into exact quantitive recommendations.
Namely, it turned out that
A. The best possibilities to carry this scheme out are when the barrier F is -self-
concordant; the less is the value of , the better;
B. The natural measure of closeness of a point x int G to the point x (t) of the
path is the Newton decrement of the self-concordant function
t (x) = tFt (x) tcT x + F (x)
at the point x, i.e., the quantity
$
(t , x) = [t (x)]T [t (x)]1 t (x)
(cf. Proposition 5.4.1.(iii)). More specically, the notion x is close to x (t) is conve-
nient to dene as the relation
(t , x) 0.05 (6.2.3)
(in fact, 0.05 in the right hand side could be replaced with arbitrary absolute constant
< 1, with slight modication of subsequent statements; I choose this particular value
for the sake of simplicity)
Now, what do all these words the best possibility and natural measure actually mean?
It is said by the following two statements.
C. Assume that x is close, in the sense of (6.2.3), to a point x (t) of the path x ()
associated with a -self-concordant barrier for the feasible domain G of problem (P).
Let us increase the parameter t to the larger value

+ 0.08
t = 1+ t (6.2.4)

and replace x by its damped Newton iterate (cf. (5.4.20), Lecture 5)
1
x+ = x [+ (x)]1 t+ (x). (6.2.5)
1 + (t+ , x) t
Then x+ is close, in the sense of (6.2.3), to the new target point x (t+ ) of the path.
C. says that we are able to trace the path (all the time staying close to it in the sense of
B.) increasing the penalty parameter linearly in the ratio (1 + 0.081/2 ) and accompanying
each step in the penalty parameter by a single Newton step in x. And why we should be
happy with this, it is said by
D. If x is close, in the sense of (6.2.3), to a point x (t) of the path, then the inaccuracy,
in terms of the objective, of the point x as of an approximate solution to (P) is bounded
from above by 2t1 :
2
f (x) min f (x) . (6.2.6)
xG t
D. says that the inaccuracy of the iterates x(ti ) formed in the above path-following
procedure goes to 0 as 1/ti , while C. says that we are able increase ti linearly, at the cost of
a single Newton step per each updating of t. Thus, we come to the following
Theorem 6.2.1 Assume that we are given
(i) -self-concordant barrier F for the feasible domain G of problem (P)
(ii) starting pair (x0 , t0 ) with t0 > 0 and x0 being close, in the sense of (6.2.3), to the
point x (t0 ).
Consider the path-following method (cf. (6.2.4) - (6.2.5))

0.08 1
ti+1 = 1+ ti ; xi+1 = xi [ (xi )]1 ti+1 (xi ). (6.2.7)
1 + (ti+1 , xi ) ti+1
Then the iterates of the method are well-dened, belong to the interior of G and the method
possesses linear global rate of convergence:
t
2 0.08
f (xi ) min f 1+ . (6.2.8)
G t0
In particular, to make the residual in f less than a given > 0, it suces to perform no
more that
20
N() 20 ln 1 + (6.2.9)
t0
Newton steps.
We see that the parameter of the self-concordant barrier underlying the method is respon-
sible for the Newton complexity of the method the factor at the log-term in the complexity
bound (6.2.9).
Remark 6.2.1 The presented result does not explain how to start tracing the path how
to get initial pair (x0 , t0 ) close to the path. This turns out to be a minor diculty: given in
advance a strictly feasible solution x to (P), we could use the same path-following scheme
(applied to certain articial objective) to come close to the path x (), thus arriving at a
position from which we can start tracing the path. In our very brief outline of the topic, it
makes no sense to go in these details of initialization; it suces to say that the necessity to
start from approaching x () basically does not violate the overall complexity of the method.
It makes sense if not to prove the aforementioned statements the complete proofs,
although rather simple, go beyond the scope of our today lecture but at least to motivate
them to explain what is the role of self-concordance and magic inequality (6.2.2) in
ensuring properties C. and D. (this is all we need the Theorem, of course, is an immediate
consequence of these two properties).
Let us start with C. this property is much more important. Thus, assume we are at a
point x close, in the sense of (6.2.3), to x (t). What this inequality actually says?
Let us denote by
hH 1 = (hT H 1 h)1/2
the scaled Euclidean norm given by the inverse to the Hessian matrix
H t (x) = F (x)
(the equality comes from the fact that t and F dier by a linear function tf (x) tcT x).
Note that by denition of (, ) one has
(s , x) = s (x)H 1 tc + F (x)H 1 .
Due to the last formula, the closeness of x to x (t) (see (6.2.3)) means exactly that
t (x)H 1 tc + F (x)H 1 0.05,
whence, by the triangle inequality,

tcH 1 0.05 + F (x)H 1 0.05 + (6.2.10)
(the concluding inequality here is given by (6.2.2) 4) , and this is the main point where this
component of the denition of a self-concordant barrier comes into the play).
From the indicated relations
(t+ , x) = t+ c + F (x)H 1 (t+ t)cH 1 + tc + F (x)H 1 =
t+ t
= tcH 1 + (t , x)
t
[see (6.2.4), (6.2.10)]
0.08 1
(0.05 + ) + 0.05 0.134 ( )
4
(note that 1 by Denition 6.2.1). According to Proposition 5.4.1.(iii.3), Lecture 5,
the indicated inequality says that we are in the domain of quadratic convergence of the
damped Newton method as applied to self-concordant function t+ ; namely, the indicated
Proposition says that
2(0.134)2
(t+ , x+ ) < 0.05.
1 0.134
as claimed in C.. Note that this reasoning heavily exploits self-concordance of F
To establish property D., it requires to analyze in more details the notion of a self-
concordant barrier, and I am not going to do it here. Just to demonstrate where comes
from, let us prove an estimate similar to (6.2.6) for the particular case when, rst, the
barrier in question is the standard logarithmic barrier given by Example 6.2.1 and, second,
the point x is exactly the point x (t) rather than is close to the latter point. Under the
outlined assumptions we have
x = x (t) t (x) = 0
[substitute expressions for t and F ]
m
aj
tc + T
=0
j=1 bj aj x
[take inner product with x x , x being an optimal solution to (P)]

m
aTj (x x)
tcT (x x ) =
j=1 bj aTj x
[take into account that aTj (x x) = aTj x aTj x bj aTj x due to x G]
m,
whence
m
cT (x x )

t t
(for the case in question = m). This estimate is twice better than (6.2.6) this is because
we have considered the case of x = x (t) rather than the one of x close to x (t).
4
indeed, for a positive denite symmetric matrix H it clearly is the same (why?) to say that |g|H 1
and to say that |g T h| |h|H for all h
6.2.2 Applications
Linear Programming. The most famous (although, I believe, not the most important)
application of Theorem 6.2.1 deals with Linear Programming, when G is a polytope and
F is the standard logarithmic barrier for this polytope
(see Example 6.2.1). For this case,
5)
the Newton complexity of the method is O( m), m being the # of linear inequalities
involved into the description of G. Each Newton step costs, as it is easily seen, O(mn2 )
arithmetic operations, so that the arithmetic cost per accuracy digit number of arithmetic
operations required to reduce current inaccuracy by absolute constant factor turns out
to be O(m1.5 n2 ). Thus, we get a polynomial time solution method for LP with very nice
complexity characteristics, typically (for m and n of the same order) better than those, e.g.,
for the Ellipsoid method. Note also that with certain smart implementation of Linear
Algebra, the above arithmetic cost can be reduced to O(mn2 ); this is the best known so far
cubic in the size of the problem upper complexity bound for Linear Programming.
To increase list of application examples, note that our abilities to solve in the outlined
style a convex program of a given structure are limited only by our abilities to point out self-
concordant barrier for the corresponding feasible domain. In principle, there are no limits
at all it can be proved that every closed convex domain in Rn admits a self-concordant
barrier with the value of parameter at most O(n). This universal barrier is given by
certain multivariate integral and is too complicated for actual computations; recall that we
should form and solve Newton systems associated with our barrier, so that we need it to be
explicitly computable.
Thus, we come to the following important question:
How to construct explicit self-concordant barriers. There are many cases when we
are clever enough to point out explicitly computable self-concordant barriers for convex
domains we are interested in. We already know one example of this type Linear Program-
ming (although we do not know to the moment why the standard logarithmic barrier for a
polytope given by m linear constraints is m-self-concordant). What helps us to construct
self-concordant barriers and to evaluate their parameters are the following extremely sim-
ple combination rules, completely similar to those for self-concordant functions (see Section
5.4.4, Lecture 5):
[Linear combination with coecients 1] Let Fi , i = 1, ...m, be i -self-concordant

barriers for the closed convex domains Gi , let the intersection G of these domains
possess a nonempty interior Q, and let i 1, i = 1, ..., m, be given reals. Then the
function
m

F (x) = i Fi (x)
i=1
m
is ( i=1 i i )-self-concordant barrier for G.
5
recall that it is the factor at the logarithmic term in (6.2.9), i.e., the # of Newton steps sucient
to reduce current inaccuracy by an absolute constant factor, say, by factor 2; cf. with the stories about
polynomial time methods from Lecture 7
[Ane substitution] Let F (x) be -self-concordant barrier for the closed convex domain
G Rn , and let x = A + b be an ane mapping from Rk into Rn with the image
intersecting int G. Then the composite function
F + () = F (A + b)
is -self-concordant barrier for the closed convex domain
G+ = { | A + b G}
which is the inverse image of G under the ane mapping in question.
The indicated combination rules can be applied to the raw materials as follows:
[Logarithm] The function
ln(x)
is 1-self-concordant barrier for the nonnegative ray R+ = {x R | x > 0};
[the indicated property of logarithm is given by 1-line computation]
[Extension of the previous example: Logarithmic barrier, linear/quadratic case] Let
G = cl{x Rn | j (x) < 0, j = 1, ..., m}
be a nonempty set in Rn given by m convex quadratic (e.g., linear) inequalities satis-
fying the Slater condition. Then the function
m

f (x) = ln(i (x))
i=1
is m-self-concordant barrier for G.

Note that for the case when all the functions fi are linear, the conclusion immediately follows
from the Combination rules (a polyhedral set given by m linear inequalities is intersection
of m half-spaces, and a half-space is inverse image of the nonnegative axis under ane
mapping; applying the combination rules to the barrier ln x for the nonnegative ray, we
get, without any computations, that the standard logarithmic barrier is for a polyhedral
set is m-self-concordant). For the case when there are also quadratic forms among fi , one
needs several lines of computations. Note that the case of linear fi covers the entire Linear
Programming, while the case of convex quadratic fi covers much wider family of quadratically
constrained convex quadratic problems. Furthermore, the following examples seem much
more important:
The function
F (t, x) = ln(t2 xT x)
is 2-self-concordant barrier for the ice-cream cone
Kn+ = {(t, x) R Rn | t |x|};
the function
F (X) = ln Det X
is n-self-concordant barrier for the cone Sn+ of positive denite symmetric nn matrices.
One hardly could imagine how wide is the class of applications from Combinatorial op-
timization to Structural Design and Stability Analysis/Synthesis in Control of the latter
two barriers, especially of the Log-Det-one.
6.2.3 Concluding remarks

The path-following scheme results in the most ecient, in terms of the worst-case complexity
analysis, interior point methods for Linear and Convex Programming. From the practical
viewpoint, anyhow, this scheme, in its aforementioned form, looks bad. The severe practical
drawback of the scheme is its short-step nature according to the scheme, the penalty
parameter should be updated by the programmed rule (6.2.4), and this makes the actual
performance of the method more or less close to the one given by (6.2.9). Thus, in the
straightforward implementation of the scheme the complexity estimate (6.2.9) will be not just
the theoretical upper bound on the worst-case complexity of the method, but the indication
of the typical performance of the algorithm. And a method actually working according
to the complexity estimate (6.2.9) could be ne theoretically, but it denitely will be of
very restricted practical interest in the large-scale case. E.g., in LP program with m
105 inequality constraints and n 104 variables (these are respectable, but in no sense
outstanding sizes for a practical LP program) estimate (6.2.9) predicts something like
hundreds of Newton steps with Newton systems of the size 104 104 ; even in the case
of good sparsity structure of the systems, such a computation would be much more time
consuming than the one given by the classical Simplex Method.
In order to get practical path-following methods, we need a long-step tactics rules
for on-line adjusting the stepsizes in the penalty parameter to, let me say, local curvature
of the path, rules which allow to update parameter as fast as possible possible from the
viewpoint of the actual numerical circumstances rather than from the viewpoint of very
conservative theoretical worst-case complexity analysis.
Today there are ecient long-step policies of tracing the paths, policies which are
both ne theoretically (i.e., satisfy complexity bound (6.2.9)) and very ecient computa-
tionally. Extremely surprising phenomenon here is that for good long-step path-following
methods as applied to convex problems of the most important classes (Linear Programming,
Quadratically Constrained Convex Quadratic Programming and some other) it turns out
that
the actually observed number of Newton iterations required to solve the prob-
lem within reasonable accuracy is basically independent of the sizes of the prob-
lem and is within 30-50, even 20 iteration for most situations.
This empirical fact (which can be only partly supported by theoretical considerations, not
proved completely) is extremely important for applications; it makes polynomial time interior
point methods the most attractive (and sometimes - the only appropriate) optimization tool
in many important large-scale applications.
I should add that the ecient long-step implementations of the path-following scheme
are relatively new, and for a long time6) the only interior point methods which demonstrated
the outlined data- and size-independent convergence rate were the so called potential re-
duction interior point methods. In fact, the very rst interior point method the method
of Karmarkar for LP which initialized the entire interior point revolution, was a potential
reduction algorithm, and what indeed caused the revolution was outstanding practical per-
formance of this method. The method of Karmarkar possesses a very nice (and in fact very
simple) geometry and is closely related to the interior penalty scheme; anyhow, time limita-
tions enforce me to skip description of this wonderful, although an old-fashioned, algorithm.
The concluding remark I would like to make is as follows: all polynomial time implemen-
tations of the penalty/barrier scheme known so far are those of the barrier scheme (which
is reected in the name of these implementations: interior point methods); numerous at-
tempts to do something similar with the penalty approach failed to be successful. It is a
pity, due to some attractive properties of the scheme (e.g., here you do not meet with the
problem of nding a feasible starting point, which, of course, is needed to start the barrier
scheme).
6
if I could qualify as long a part of the story which the entire story started in 1984

Cours D'optimisation

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Cours D'optimisation

Enviado por

Direitos autorais:

Formatos disponíveis

1

EFFICIENT METHODS IN OPTIMIZATION

2 When everything is simple: 1-dimensional Convex Optimization 19

3 Methods with linear convergence 37

4 Large-scale optimization problems 73

5 Nonlinear programming: Unconstrained Minimization 109

6 Constrained Minimization 145

1.1 General formulation of the problem

1.1.1 Problem formulation and terminology

s.t.: gj (x) & 0, j = 1 . . . m, (1.1.1)

where & could be , or =.

is called the feasible set of the problem (1.1.1).1

Smooth problems: all gj (x) are dierentiable.

Nonsmooth problems: there is a nondierentiable component gk (x),

Linearly constrained problems: all functional constraints are linear:

(here ,  stands for the inner product in Rn ), and G is a polyhedron.

Problem (1.1.1) is called feasible if Q = .

Finally, we distinguish dierent types of solutions to (1.1.1):

x is called the optimal global solution to (1.1.1) if

f (x ) f (x) for all x Q

(global minimum). Then f (x ) is called the optimal value of the problem.

x is called a local solution to (1.1.1) if

f (x ) f (x) for all x int Q

Thus, we come up with the problem:

Then we can consider the problem:

(may be with some additional constraints on x).

Thus, we could treat also the {0,1} Programming Problems:

Nonlinear Optimization is very important and perspective application theory.

In general, optimization problems are unsolvable.

1.1.2 Performance of Numerical Methods

Performance of M on P is the total amount of computational eorts, which is

To nd an approximate solution to M with a small accuracy > 0.

First-order oracle: the value f (x) and the gradient f (x).

1.2 Complexity bounds for Global Optimization

min f (x). (1.2.4)

In our terminology, that is an unconstrained minimization problem without functional con-

The objective function f (x) is Lipshitz continuous on Bn :

with some constant L (Lipshitz constant).

Scheme of the method G(p). (1.2.5)

1. Form (p + 1)n points

(here and in the sequel we write x y for x, y Rn if and only if xi yi , i = 1, . . . , n).

They are based on the Black Box concept.

We need: 32 000 years.

then we have to form the discrete sum

If f (x) is Lipshitz continuous then the result can be used as an approximation to I:

1.3 Identity cards of the elds

Description of the goals.

Description of the problem class.

Description of the oracle.

Name: Global Optimization. Goals: Find a global minimum.

Name: Nonlinear Programming.

Name: Convex Optimization.

Desired properties: Convergence to a global minimum. Rate of convergence

Name: Interior-Point Polynomial Methods.(Would be a nice subject for an ad-

1.4 Rules of the game

1.5 Suggested reading

J. Cea: Optimisation, theorie et algorithmes Dunod, Paris (1971).

A. Auslender: Optimisation. Methodes numeriques, Masson, Paris (1976).

P.J. Laurent: Approximation et Optimisation, Hermann (1972)

A. Ben-Tal, A. Nemirovski, Lectures on Modern Convex Optimization, MPS-SIAM

Yu. Nesterov Introductory Lectures On Convex Optimization: A Basic Course,

S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press

The latter book is available on the S. Boyds page: http://www.stanford.edu/ boyd/index.html

(here , stands for the inner product in Rn ), and G is a polyhedron.

i.e., means that |x|A x . The inclusion B W , by similar reasons, implies

x O(p, x) = (f (x)(, f (x); g1 (x), g1 (x); ...; gm (x), gm