Escolar Documentos
Profissional Documentos
Cultura Documentos
CONVEX OPTIMIZATION
A Basic Course
Applied Optimization
Volume 87
Series Editors:
Panos M. Pardalos
University ofFlorida, US.A.
Donald W. Heam
University ofFlorida, US.A.
INTRODUCTORY LECTURES ON
CONVEX OPTIMIZATION
A Basic Course
By
Yurii Nesterov
Center of Operations Research and Econometrics, (CORE)
Universite Catholique de Louvain (UCL)
Louvain-la-Neuve, Belgium
''
~
Nesterov, Yurri
Introductory Lectures on Convex Optimization: A Basic Course
ISBN 978-1-4613-4691-3
ISBN 978-1-4419-8853-9 (eBook)
DOI 10.1007/978-1-4419-8853-9
All rights reserved. No part ofthis publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means, electronic, mechanical, photo-copying,
microfilming, recording, or otherwise, without the prior written perrnission ofthe publisher, with
the exception ofany material supplied specifrcally for the purpose ofbeing entered and executed
on a computer system, for exclusive use by the purchaser ofthe work.
Permissions for books published in the USA: permj ssi onswkap com
Permissions for books published in Europe: permissions@wkap.nl
Printedon acid-free paper.
Contents
Preface
Acknowledgments
Introduction
1. NONLINEAR OPTIMIZATION
1.1 World of nonlinear optimization
1.1.1 General formulation of the problern
1.1.2 Performance of numerical methods
1.1.3 Complexity bounds for global optimization
1.1.4 Identity cards of the fields
1.2 Local methods in unconstrained minimization
1.2.1 Relaxation and approximation
1.2.2 Classes of differentiable functions
1.2.3 Gradient method
1.2.4 Newton method
lX
xiii
XV
1
1
1
4
7
13
15
15
20
25
32
37
42
46
51
51
51
58
63
66
1.3
37
vi
2.1.5
Gradient method
2.2 Optimal Methods
2.2.1 Optimal methods
2.2.2 Convex sets
2.2.3 Gradient mapping
2.2.4 Minimization methods for simple sets
2.3 Minimization problern with smooth components
2.3.1 Minimax problern
2.3.2 Gradient mapping
2.3.3 Minimization methods for minimaxproblern
2.3.4 Optimization with functional constraints
2.3.5 Method for constrained minimization
68
71
71
81
86
87
90
90
93
96
100
105
111
111
111
117
121
124
126
130
135
135
138
141
144
146
149
156
157
158
160
164
4. STRUCTURAL OPTIMIZATION
4.1 Self-concordant functions
4.1.1 Black box concept in convex optimization
4.1.2 What the Newton method actually does?
4.1.3 Definition of self-concordant function
171
171
171
173
175
vii
Contents
4.1.4
4.1.5
Main inequalities
Minimizing the self-concordant function
4.2
Self-concordant barriers
4.2.1 Motivation
4.2.2 Definition of self-concordant barriers
4.2.3 Main inequalities
4.2.4 Path-following scheme
4.2.5 Finding the analytic center
4.2.6 Problems with functional constraints
4.3
181
187
192
192
193
196
199
203
206
210
210
213
216
220
224
227
Bibliography
231
References
233
Index
235
Preface
It was in the middle of the 1980s, when the seminal paper by Karmarkar opened a new epoch in nonlinear optimization. The importance
of this paper, containing a new polynomial-time algorithm for linear optimization problems, was not only in its complexity bound. At that time,
the most surprising feature of this algorithm was that the theoretical prediction of its high efficiency was supported by excellent computational
results. This unusual fact dramatically changed the style and directions of the research in nonlinear optimization. Thereafter it became
more and more common that the new methods were provided with a
complexity analysis, which was considered a better justification of their
efficiency than computational experiments. In a new rapidly developing field, which got the name "polynomial-time interior-point methods",
such a justification was obligatory.
Afteralmost fifteen years of intensive research, the main results of this
development started to appear in monographs [12, 14, 16, 17, 18, 19].
Approximately at that time the author was asked to prepare a new course
on nonlinear optimization for graduate students. The idea was to create
a course which would reflect the new developments in the field. Actually,
this was a major challenge. At the time only the theory of interior-point
methods for linear optimization was polished enough to be explained to
students. The general theory of self-concordant functions had appeared
in print only once in the form of research monograph [12]. Moreover,
it was clear that the new theory of interior-point methods represented
only a part of a general theory of convex optimization, a rather involved
field with the complexity bounds, optimal methods, etc. The majority
of the latter results were published in different journals in Russian.
The book you see now is a result of an attempt to present serious
thingsinan elementary form. As is always the case with a one-semester
course, the most difficult problern is the selection of the material. For
Louvain-la-Neuve, Belgium
May, 2003.
To my wife Svetlana
Acknowledgments
This book is a refiection of the main achievements in convex optimization, the field in which the author has worked for more than twenty five
years. During all these years the author has had the exceptional opportunity to communicate and collaborate with the top-level scientists in
the field. I am greatly indebted to many of them.
I was very lucky to start my scientific career in Moscow at the time of
decline of the Soviet Union, which managed to gather in a single city the
best brains of a 300-million population. The contacts with A. Antipin,
Yu. Evtushenko, E. Golshtein, A. Ioffe, V. Karmanov, L. Khachian,
R. Polyak, V. Pschenichnyj, N. Shor, N. Tretiakov, F. Vasil'ev, D. Yudin,
and, of course, with A. Nemirovsky and B. Polyak, were invaluable in
forming the directions and priorities of my research.
I was very lucky to move to the West at a very important moment
in time. For nonlinear optimization that was the era of interior-point
methods. That was the time, when a new paper was announced almost
every day, and a time of open contacts and interesting conferences. I am
very thankful to my colleges Kurt Anstreicher, Freddy Auslender, Rony
Ben-Tal, Rob Freund, Jean-Louis Goffin, Don Goldfarb, Osman Guller,
Florian Jarre, Ken Kortanek, Claude Lemarechal, Olvi Mangasarian,
Florian Potra, Jim Renegar, Kees Roos, Tamas Terlaky, Mike Todd,
Levent Tuncel and Yinyu Ye for interesting discussions and cooperation.
Special thanks to Jean-Philippe Vial, the author of the idea of writing
this book.
Finally, I was very lucky to find myself at the Center of Operations
Research and Econometrics (CORE) in Louvain-la-Neuve, Belgium. The
excellent working conditions of this research center and the exceptional
environment were very helpful during all these years. It is impossible to
overestimate the importance of the spirit of research, which is created
and maintained here by my colleagues Vincent Blondel, Yves Genin,
xiv
Michel Gevers, Etienne Laute, Yves Poches, Yves Smeers, Paul Van
Dooren and Laurence Wolsey, both coming from CORE and CESAME,
a research center of the Engineering department ofUniversite Catholique
de Louvain (UCL). The research activity of the author during many
years was supported by the Belgian Program on Interuniversity Poles of
Attraction initiated by the Belgian State, Prime Minister's Office and
Science Policy Programming.
Introduction
Optimization problems arise naturally in different fields of applications. In many situations, at some point we get a craving to arrange
things in a best possible way. This intention, converted into a mathematical form, becomes an optimization problern of a certain type. Depending on the field of interest, it could be an optimal design problem, an
optimal control problem, an optimal location problem, an optimal diet
problem, etc. However, the next step, finding a solution to the mathematical model, is far from trivial. At first glance, everything Iooks very
simple: many commercial optimization packages are easily available and
any user can get a "solution" to the model just by clicking on an icon
on the screen of his/her personal computer. The question is, what do
we actually get? How much can we trust the answer?
One of the goals of this course is to show that, despite their attraction,
the proposed "solutions" of general optimization problems very often
cannot satisfy the expectations of a naive user. In our opinion, the main
fact, which should be known to any person dealing with optimization
models, is that in general optimization problems are unsolvable. This
Statement, which is usually missing in standard optimization courses,
is very important for an understanding of optimization theory and its
development in the past and in the future.
In many practical applications the process of creating a model can take
a Iot of time and effort. Therefore, the researchers should have a clear
understanding of the properties of the model they are constructing. At
the stage of modelling, many different tools can be used to approximate
the real situation. And it is absolutely necessary to understand the
computational consequences of each decision. Very often we have to
xv1
1 More
INTRODUCTION
xvii
The structure of the book is as follows. It consists of four relatively independent chapters. Each chapter includes three sections, each of which
corresponds approximately to a two-hour lecture. Thus, the contents of
the book can be directly used for a standard one-semester course.
Chapter 1 is devoted to generat optimization problems. In Section 1.1 we introduce the terminology, the notions of oracle, black box,
functional model of an optimization problern and the complexity of general iterative schemes. We prove that global optimization problems are
"unsolvable" and discuss the main features of different fields of optimization theory. In Section 1.2 we discuss two main local unconstrained
minimization schemes: the gradient method and the Newton method.
We establish their local rates of convergence and discuss the possible difficulties {divergence, convergence to a saddle point). In Section 1.3 we
compare the formal structures of the gradient and the Newton method.
This analysis leads to the idea of a variable metric. We describe quasiNewton methods and conjugate gradients schemes. We conclude this section with an analysis of sequential unconstrained minimization schemes.
In Chapter 2 we consider smooth convex optimization methods. In
Section 2.1 we analyze the main reason for the difficulties encountered
in the previous chapter and from this analysis derive two good functional classes, the class of smooth convex functions and that of smooth
strongly convex functions. For corresponding unconstrained minimization problems we establish the lower complexity bounds. We conclude
this section with an analysis of a gradient scheme, which demonstrates
that this method is not optimal. The optimal schemes for smooth convex minimization problems are discussed in Section 2.2. We start from
the unconstrained minimization problem. After that we introduce convex sets and define a notion of gradient mapping for a minimization
problern with simple constraints. We show that the gradient mapping
can formally replace a gradient step in the optimization schemes. In
Section 2.3 we discuss more complicated problems, which involve several smooth convex functions, namely, the minimax problern and the
constrained minimization problem. For both problems we introduce the
notion of gradient mapping and present the optimal schemes.
Chapter 3 is devoted to the theory of nonsmooth convex optimization. Since we do not assume that the reader has a background in convex
analysis, the chapter is started by Section 3.1, which contains a compact
presentation of all necessary facts. The final goal of this section is to
justify the rules for computing the subgradients of a convex function.
The next Section 3.2 starts from the lower complexity bounds for nonsmooth optimization problems. After that we present a general scheme
for the complexity analysis of the corresponding methods. We use this
xviii
Chapter 1
NONLINEAR OPTIMIZATION
1.1
1.1.1
Let us start by fixing the mathematical form of our main problern and
the standard terminology. Let x be an n-dimensional real vector:
X=
(1.1.1)
s,
Q = {x ES
/j(x)::; 0, j
= 1 ... m}
=Rn.
= (aj,x) +bj, j
= 1 ... m,
i=l
Nonlinear Optimization
EXAMPLE
(1.1.2)
EXAMPLE
sin(7rx(i)) = 0,
i = 1 ... n.
:5
/j(x)
:5 bj,
j = 1 ... m,
XE S,
. ( 7l'X (i))
Sill
= 0, . = 1 ... n.
~
1.1.2
Nonlinear Optimization
always precise for particular problern classes. Now we can give a formal
definition of the problern dass:
:F
=(2:, 0, 1;).
Nonlinear Optimization
usually can be easily obtained frorn the analytical cornplexity and the
cornplexity of the oracle. Therefore, in this course we will speak rnainly
about bounds on the analytical cornplexity for some problern classes.
There is one standard assumption on the oracle, which allows us to
obtain the majority of the results on the analytical cornplexity of optimization problems. This assurnption is called the local black box concept
and it Iooks as follows:
1.1.3
I f(x)-
f(y)
I~
II X- Y lloo
Vx, Y E Bn,
(1.1.5)
uniform grid method. This method g(p) has one integer input parameter
p. Its scheme is as follows.
Method Q(p)
1. Form (p + 1) n points
X
c . .)T
:!J.!2.
!n.
p'p''p
'
(1.1.6)
2. Among all points X(i 1,... ,in) find a point x, which has
the minimal value of objective function.
3. Return the pair (x, f(x)) as a result.
Thus, this method forms a uniform grid of the test points inside the
box Bn, computes the minimum value of the objective over this grid and
returns this value as an approximate solution to problern (1.1.4). In our
terminology, this is a zero-order iterative method without any inuence
Nonlinear Optimization
1.1.1 Let
f*
Then
f(x)- !* ::; ~
+l,t2+1,
-< x* -< X (tJ
... ,tn)
= X (tJ ,t2,
- y
... ,tn+l)
X -
Denote
= {
x<i)
lt is clear that
I xSince
'
x*
otherwise.
= 1 ... n. Therefore
Let us finish the definition of our problern dass. Define our goal as
follows:
(1.1. 7)
Find x E Bn : j(x)- j*::; .
Then we immediately get the following result.
1.1.1 Analytical complexity of the problern class {1.1.4),
{1.1.5), {1.1. 7) for method g is at most
COROLLARY
A(Q) = (
l~J +
2r.
10
Thus, A(Q) justifies an upper complexity bound for our problern class.
This result is quite informative, but we still have some questions.
Firstly, it may happen that our proof is too rough and the real performance of Q(p) is much better. Secondly, we still cannot be sure that
Q(p) is a reasonable method for solving (1.1.4). There may exist other
schemes with much higher performance.
In order to answer these questions, we need to derive lower complexity
bounds for the problern dass (1.1.4), (1.1.5), (1.1.7). The rnain features
of such bounds are as follows.
They are based on the black box concept.
These bounds are valid for all reasonable iterative schernes. Thus,
they provide us with a lower estirnate for analytical complexity on
the problern dass.
Very often such bounds ernploy the idea of the resisting oracle.
For us only the notion of the resisting orade is new. Therefore, Iet us
discuss it in rnore detail.
A resisting oracle tries to create a worst problern for each particular
rnethod. It starts frorn an "ernpty" function and it tries to answer each
call of the rnethod in the warst possible way. However, the answers rnust
be compatible with the previous answers and with the description of the
problern dass. Then, after terrnination of the method it is possible to
reconstruct a problern, which fits cornpletely the final inforrnation set
accurnulated by the algorithrn. Moreover, if we launch this rnethod on
this problern, it will reproduce the sarne sequence of the test points since
it will have the sarne sequence of answers frorn the orade.
Let us show how that works for problern (1.1.4). Consider the class
of problems C defined as follows:
Model:
rnin f(x),
xEBn
f*
E.
11
Nonlinear Optimization
r.
THEOREM
1.1. 2 For
liiJ
Proof: Derrote p =
(~ 1). Assurne that there exists a method,
which needs N < pn calls of oracle to solve any problern from C. Let us
apply this method to the following resisting strategy:
Oracle returns f(x)
= 0 at any test
point x.
x,
1
B
+ Pe
E n,
= (1, ... , 1) T
](x)
f(x) = 0. However,
E Rn,
= {x I x ~ x
x + ~e}.
*),
Lower bound: (
liiJ) n.
Thus, if E = 0(
the lower and upper bounds coincide up to a constant
multiplicative factor. This implies that Q(p) is an optimal method for C.
At the same time, Theorem 1.1.2 supports our initial claim that the
general optimization problems are unsolvable. Let us Iook at the following example.
EXAMPLE 1.1.4 Consider the problern dass :F defined by the following
parameters:
L = 2, n = 10, E = 0.01.
12
Note that the size of the problern is very small and we ask only for 1%
accuracy.
The lower complexity bound for this dass is (
n. Let us compute
it for our example.
fe)
Lower bound:
Complexity of oracle:
Total complexity:
Work station:
Total time:
One year:
We need
ymrs.
We should note, that the lower complexity bounds for problems with
smooth functions, or for high-arder methods are not much better than
those of Theorem 1.1.2. This can be proved using the same arguments
and we leave the proof as an exercise for the reader. Camparisan of
the above results with the upper bounds for NP-hard problems, which
are considered as a dassical example of very difficult problems in combinatorial optimization, is also quite disappointing. Hard combinatorial
problems need 2n a.o. only!
To conclude this section, let us compare our situation with one in some
other fields of numerical analysis. lt is well known, that the uniform grid
approach is a standard tool in many domains. For example, if we need
13
Nonlinear Optimization
J
1
I=
f(x)dx,
Sn=
1
N
"L f(xi),
Xi
i=l
= ;.,, i = 1 ... N.
Note that in our terminology this is exactly the uniform grid approach.
Moreover, that is a standard way for approximating the integrals. The
reason why it works here lies in the dimension of problems. For integration the standard dimensions are very small (up to three), and in
optimization sometimes we need to solve problems with several millions
of variables.
1.1.4
14
15
Nonlinear Optimization
1.2
1.2.1
Vk
0.
xERn
(1.2.1)
16
verges.
+ (!'(x), y- x) + o(ll
Ii),
0 such that
= 0,
y- x
II II
o(O)
= 0.
The linear function f(x) + (f'(x), y- x) is called the linear approximation of f at x. Recall that the vector f'(x) is called the gradient of
function f at x. Considering the points Yi = x + Eei, where ei is the ith
coordinate vector in Rn, and taking the limit in -+ 0, we obtain the
following coordinate representation of the gradient:
17
Nonlinear Optimization
.Cf (0:) = {X E Rn
I J (X) S 0:}
J(yk) = f(x)
J(x).
+ o:s)- f(x)
= a(f'(x), s)
+ o(o:).
Therefore
we obtain ~(s)
Then
~(s) = -(f'(x),J'(x))/
II
J'(x)
II=- II
J'(x)
II.
Thus, the direction - J' (x) (the antigradient) is the direction of the
fastest local decrease of J(x) at point x.
The next statement is probably the most fundamental fact in optimization.
THEOREM 1. 2.1 (First-order optimality condition.)
Let x* be a local minimum of differentiable function f(x). Then
f'(x*) = 0.
18
= 0,
Vs,
II) 2 f(x*).
II s II= 1.
x E C, := {x ERnlAx = b} =J 0,
where A is an m x n-matrix and b E Rm, m
vector of multipliers ). * such that
< n.
{1.2.2)
= x(y) = x* + LY(i)ui,
y E Rk.
i=l
~!~?l
and (1.2.2) follows.
= (f'(x*), ui) = 0,
= 1 ... k,
0
Note that we have proved only a necessary condition of a local minimum. The points satisfying this condition are called the stationary
points of function f. In order to see that such points are not always the
local minima, it is enough to look at function f(x) = x 3 , x E Rl, at
x=O.
19
Nonlinear Optimization
f(x)
(f"(x))(i,j) = 8:c:{J~L>.
It is called the Hessian of function f at x. Note that the Hessian is a
symmetric matrix:
f"(x) = [f"(x)f.
f'(y) = f'(x)
Using the second-order approximation, we can write down the secondorder optimality conditions. In what follows notation A t 0, used for a
symmetric matrix A, means that A is positive semidefinite:
(Ax, x)
2: 0 Vx
ERn.
Notation A >- 0 means that Ais positive definite (above inequality must
be strict for x =/= 0).
THEOREM 1.2.2 (Second-order optimality condition.)
f'(x*) = 0,
f(y)
>0
f(x*).
f(y) = f(x*)
II s II= 1.
~ f(x*).
0
20
f'(x*) = 0,
J"(x*) >- 0.
f(y) = f(x*)
Since ~ --+ 0, there exists a value f such that for all r E [0, f] we have
f(y)
;::=:
1.2.2
21
Nonlinear Optimization
for all x, y E Q.
Clearly, we always have p ~ k. If q 2: k, then CfP(Q) ~ c1P(Q). For
1 (Q) ~
1 (Q). Note also that these classes possess the
example,
following property: if ft E C1~(Q), /2 E c1:(Q) and a, E R 1 , then
for
L3 =I a I L1 + I I L2
Cl'
Cz
cl
Cl
II
II x- Y II
(1.2.3)
for all x, y E Rn. Let us give a sufficient condition for that inclusion.
LEMMA
if
II
f"(x) II~ L,
Cz
1 (Rn) C
Cl'
1 (Rn)
if and only
{1.2.4)
Vx ERn.
= f'(x)
+f
= J'(x)
J"(x
(l
II
f'(y)- f'(x)
II
(l
<
Jf"(x+r(y-x))dr
f"(x
+ T(y- x))dT)
0
1
~I
II
~L
II y- XII-
f"(x+r(y-x))
(y- x)
lly-xll
II drll
y-x
II
22
(!
J"(x + Ts)dT) s
=II
f'(x
f'(x)
= a,
f"(x)
= 0.
=II A II
f'(x)
Therefore f(x) E
J"(x)
+ x2 ,
x E R 1 . We
= (1 + ~2)3/2 ~ 1.
Ci' (R).
1
ci
f II Y- x 11 2
{1.2.5}
f(y)
= f(x)
= f(x)
x)dT.
23
Nonlinear Optimization
Therefore
=I
f(f'(x
+ (f'(xo),x- xo) + ~
II x- xo 11 2,
</J2(x) = f(xo)
+ (J'(xo), x- xo)- ~
II x- xo 11 2 .
<h(x)
Let us prove a similar result for the dass of twice differentiable functions. Our main dass of functions ofthat type will be dj:/(Rn), the
dass of twice differentiable functions with Lipschitz continuous Hessian.
Recall that for f E C'j;/(Rn) we have
II f"(x)- f"(y) II:S M II x- Y II
(1.2.6)
1.2.4 Let f E
Cz' (Rn).
2
(1.2. 7}
::; 't II y
(1.2.8}
- X 11 3 .
24
Therefore
II
= II
f(f"(x
II
<
f rM
II
y- x
11 2
dr =
!vf II y- x
11 2
CoROLLARY
II
y- x
II= r.
Then
+ Mrln,
i = 1 ... n.
25
Nonlinear Optimization
1.2.3
Gradient method
Gradient method
(1.2.9)
Choose xo E Rn.
Iterate Xk+I = Xk- hkf'(xk), k = 0, 1, ....
> 0,
hk
hk
v'k+T"
(constant step)
2. Full relaxation:
(1.2.11)
26
h 2:: 0.
Then the step-size values acceptable for this strategy belang to the part
of the graph of cp that is located between two linear functions:
c/J1(h) = f(x)- ah
II
f'(x) 11 2,
c/J2(h) = f(x)- h
II
f'(x) 11 2 .
Note that cp(O) = cfJ1(0) = cp2 (0) and cp'(O) < cp~(O) < cp~(O) < 0. Therefore, the acceptable values exist unless cp(h) is not bounded below. There
are several very fast aue-dimensional procedures for finding a point satisfying the conditions of this strategy, but their description is not so
important for us now.
Let us estimate the performance of the gradient method. Consider
the problern
min f(x),
xERn
with f E Ci' 1 (Rn). And assume that f(x) is bounded below on Rn.
Let us evaluate a result of one gradient step. Consider y = x- hf'(x).
Then, in view of (1.2.5), we have
f(y)
II
f'(x)
11 2 +h22
II
f'(x)
II
f'(x)
11 2
11 2
(1.2.12)
11 2 .
Thus, in order to get the best estimate for possible decrease of the objective function, we have to solve the following one-dimensional problem:
D. (h)
= - h (1 -
~ L)
--t
m~n .
1:,
A II f'(x)
11 2 .
27
Nonlinear Optimization
= h,
i.
t.
j;(l- ).
4; L)
II f'(xk)
11 2
Xk+I)
= ahk II f'(xk) 11 2
where f* is the optimal value of the problern (1.2.1). As a simple consequence of (1.2.14) we have
II f'(xk) II-+
as
k-+ oo.
28
However, we can also say something about the convergence rate. lndeed,
denote
g* - min g
N- o-:::k-:::N k,
II
Y'N :::;
JN+l
[1
(1.2.15)
j(x)
The gradient of this function is f'(x) = (x(l), (x< 2)) 3 - x< 2))T. Therefore
there are only three points which can pretend tobe a local minimum of
this function:
x!
= (0, 0),
x;
= (0, -1),
xj
= (0, 1).
!"(x)
~ ( ~ 3(x(2~2 _ 1 )
we conclude that x2 and x3 are the isolated local minima 1 , but xi is only
a stationary point of our function. Indeed, f(xi) = 0 and f(xi + e 2 ) =
44 - ~2 < 0 for small enough.
Now, let us consider the trajectory of the gradient method, which
starts from xo = (1, 0). Note that the second coordinate of this point
is zero. Therefore, the second coordinate of f'(xo) is also zero. Consequently, the second coordinate of x 1 is zero, etc. Thus, the entire
sequence of points, generated by the gradient method will have the second coordinate equal to zero. This means that this sequence converges
to xi.
1 In
29
Nonlinear Optimization
To condude our example, note that this situation is typical for all
first-order unconstrained minimization methods. Without additional
rather strict assumptions it is impossible to guarantee their global con0
vergence to a local minimum, only to a stationary point.
Note that inequality (1.2.15) provides us with an example of a new
notion, that is the rate of convergence of minimization process. How
can we use this notion in the complexity analysis? Rate of convergence
delivers the upper complexity bounds for a problern dass. These bounds
are always justified by some numerical methods. lf there exists a method,
for which its upper complexity bounds are proportional to the lower
complexity bounds of the problern dass, we call this method optimal.
Recall that in Section 1 we have already seen an example of optimal
method.
Let us look at an example of upper complexity bounds.
EXAMPLE
Model:
1. Unconstrained minimization.
Cl'
1 (Rn).
2. f E
3. f(x) is bounded below.
(1.2.16)
Oracle:
-
solution:
f(x)
f(xo),
II
f'(x)
II~ .
g'N ~
1
[1
vfN+l -wL(f(xo)-
f*) ] 1/2 ~
30
Let us check, what can be said about the local convergence of the
gradient method. Consider the unconstrained minimization problern
min f(x)
xERn
E c;;/(Rn).
= Xk -
f'(xk)
where Gk =
f
0
f'(xk) - f'(x*)
=f
0
f"(x*
+ T(Xk
= 0.
- x*))(xk - x*)dT
II ak+l
II~ (1- q)
II ak
II ao ll-7 o.
II Denote rk =II xk -x* 11.
+ T(Xk- x*))
f"(x*)
+ TMrkln.
31
Nonlinear Optimization
+ '!fM)In
Hence, (1- hk(L + ItM))In ::S In- hkGk ::S (1- hk(l- ItM))In and we
conclude that
(1.2.18)
for small enough hk. In this case we will have rk+l < rk.
As usual, many step-size strategies are available. For example, we
can choose hk =
Let us consider the "optimal" strategy consisting in
minimizing the right-hand side of (1.2.18}:
t.
Assurne that ro < f. Then, if we form the sequence {xk} using the
optimal strategy, we can be sure that rk+l < rk < f. Further, the
optimal step size h'k can be found from the equation:
ak(h)
= bk(h)
1- h(l- ~M)
<=?
= h(L + ~M)- 1.
Hence
2
h*(1.2.19)
k- L+l'
(Surprisingly enough, the optimal step does not depend on M.) Under
this choice we obtain
<
Tk+l -
(L-th
L+l
Mrz
L+l'
+ a2k =
ak(1
1- > lli - 1, or
Therefore -ak+l
ak
+ (ak- q))
= ak(l-(ak-~) 2 )
1-(ak-q
<
-
ak
l+q-ak
32
Hence,
Thus,
a
< ro+(l+q)
qr~ (r-ro) -< ..!l!!L
(-1-)k
f-ro l+q
k -
THEOREM
ro
=II xo -
x*
II < f = i} .
Then the gradient method with step size (1.2.19} converges as follows:
(1- L~3l)k
1.2.4
Newton method
cjJ(t*) = 0.
The Newton method is basedonlinear approximation. Assurne that we
get some t close enough to t*. Note that
Therefore the equation cjJ(t + t:.t) = 0 can be approximated by the following linear equation:
cjJ(t)
+ c/J'{t)b.t =
0.
We can expect that the solution of this equation, the displacement t:.t,
is a good approximation to the optimal displacement b.t* = t* - t.
Converting this idea in an algorithmic form, we get the process
t k+l = t k
!Pl!:JJ..
tP'(tk).
33
Nonlinear Optimization
F(x) = 0,
where x ERn and F(x) :Rn~ Rn. In this case we have to define the
displacernent tl.x as a solution to the following systern of linear equations:
F(x)
+ F'(x)tl.x =
f'(x)
+ f"(x)ilx = 0.
</>(x) = f(xk)
Assurne that f"(xk) >- 0. Then we can choose xk+ 1 as a point of minimum of the quadratic function <f>(x). This means that
34
has two serious drawbacks. Firstly, it can break down if f"(xk) is degenerate. Secondly, the Newton process can diverge. Let us look at the
following example.
EXAMPLE 1.2.4 Let us apply the Newton method for finding a root of
the following function of one variable:
qy(t) = yl~t2.
Clearly, t* = 0. Note that
qy'(t) =
[l+t;j3/2.
t k+l -- t k
- !fi!:JJ_
"''(t ) -'I'
tk
__lk.__
f177I .
yl+tk
[1 + t2]3/2
-k
t3k
E c'~;/(Rn).
Nonlinear Optimization
35
Xk+1 - x*
where Gk
II Gk I = I
1
<
J[f"(xk)- f"(x*
+ T(xk- x*))]dT II
II
+ T(Xk- x*)) II dT
f"(xk)- f"(x*
II
THEOREM
II
xo- x*
II< r = 3~.
Then II Xk- x* II < f for alt k and the Newton method converges quadratically:
36
-Jk
The corresponding complexity estimate depends on a double logarithm of the desired accuracy: In ln ~.
This rate is extremely fast: Each iteration doubles the number of
right digits in the answer. The constant c is important only for the
starting moment of the quadratic convergence (crk < 1).
37
Nonlinear Optimization
1.3
1.3.1
In the previous section we have considered two local methods for finding a local minimum in the simplest minimization problern
min f(x),
xERn
with
Ci'2 (Rn).
hk > 0.
</JI(x) = f(x)
+ (J'(x),x- x) + 21 II
x- x
11 2 ,
= f'(x) + *(xi- x) = 0.
38
(see Lemma 1.2.3). This fact is responsible for global convergence of the
gradient method.
Further, consider a quadratic approximation of function f(x):
<P2(x) = f(x)
x2 =
x- [f"(x)t 1 j'(x),
</>c(x) = J(x)
<Pa(xa)
we obtain
= J'(x) + G(xa- x) = o,
x(; = x-
a- f
(1.3.1)
1 1(x).
{Gk} : Gk
--7
J"(x*)
(or {Hk} : Hk := GJ; 1 --7 [f"(x*)t 1), are called the variable metric
methods. (Sometimes the name quasi-Newton methods is used.) In these
methods only the gradients are involved in the process of generating the
sequences {Gk} or {Hk}
The updating rule (1.3.1) is very common in optimization. Let us
provide it with one more interpretation.
Note that the gradient and the Hessian of a nonlinear function f(x)
are defined with respect to a standard Euclidean inner product on Rn:
(x,y)
= L:x(i)y(i),
x,y ERn,
i=l
f(x
+ h) =
f(x}
+ (f'(x), h) + o(ll
h II),
39
Nonlinear Optimization
Let us introduce now a new inner product. Consider a symmetric positive definite n x n-matrix A. For x, y E Rn derrote
(x,y)A = (Ax,y),
II x
The function II x IIA is a new norm on Rn. Note that topologically this
new metric is equivalent to the old one:
where An (A) and A1 ( A) are the smallest and the largest eigenvalues of
the matrix A. However, the gradient and the Hessian, computed with
respect to the new inner product are changing:
f(x
+ h)
= f(x)
= f(x)
f' (x*)
= Ax*
+a =
+ (A- 1 a, x)A + ~ II
f(x)
f~(x)
A- 1f'(x) = dN(x),
x II~,
40
o.
1. kth iteration (k
0).
Hk+l
The variable metric schemes differ one from another only in implementation of Step ld), which updates matrix Hk For that, they use new
information, accumulated at Step lc), namely the gradient f'(xk+I)
The idea is justified by the following property of a quadratic function.
Let
f(x) = a + (a, x) + !(Ax, x), J'(x) = Ax + a.
Then, for any x, y ERnwehave f'(x)- f'(y) = A(x- y). This identity
explains the origin of the so-called quasi-Newton rule.
Quasi-Newton rule
Actually, there are many ways to satisfy this relation. Below we present
several examples of the schemes that usually are recommended as the
most efficient ones.
41
Nonlinear Optimization
EXAMPLE
1.3.2 Denote
1:1Hk
Hk'Yk'YfHk
(Hk"fk, 'Yk) .
3. Broyden-Fletcher-Goldfarb-Shanno scheme (BFGS).
Note that for quadratic functions the variable metric methods usually
terminate in n iterations. In a neighborhood of strict minimum they
have a superlinear rate of convergence: for any xo E Rn there exists a
number N such that for all k ~ N we have
II Xk+l -
x*
(the proofs are very long and technical). As far as global convergence is
concerned, these methods are not better than the gradient method (at
least, from the theoretical point of view).
Note that in the variable metric schemes it is necessary to store and
update a symmetric n x n-matrix. Thus, each iteration needs O(n2 )
auxiliary arithmetic operations. During many years this feature was
considered as one of the main drawbacks of the variable metric methods.
That stimulated the interest in so-called conjugate gradients schemes,
which have much lower complexity of each iteration (see Section 1.3.2).
However, in view of an amazing growth of computer power in the last
decades, these objections are not so important anymore.
42
1.3.2
Conjugate gradients
The conjugate gradients methods were initially proposed for minimizing a quadratic function. Consider the problern
(1.3.2)
min f(x),
xERn
f(x)
o:
o:- ~(Ax*,x*)
+ (a,x) +
~(Ax,x)
k;:::: 1,
(1.3.3)
This definition Iooks quite artificial. However, later we will see that
this method can be written in a pure "algorithmic" form. We need
representation (1.3.3) only for theoretical analysis.
= A(xo - x*).
Xk = xo
+L
A,(i) Ai(xo-
x*)
i=l
= y
i=l
.Ck+l
43
Nonlinear Optimization
The next result helps to understand the behavior ofthe sequence {xk}.
1.3.2 For any k, i ~ 0, k
LEMMA
Proof: Let k
i= i
E_x~) J'(xj_t).
j=l
However, by definition, x k is the point of minimum of f (x) on .Ck. Therefore 4>'(-X*) = 0. It remains to compute the components of the gradient:
0
COROLLARY
COROLLARY
6i
The last auxiliary result explains the name of the method. Denote
- Xi. It is clear that .Ck = Lin {6o, ... , 6k-l }.
= Xi+l
LEMMA
i= i
> i. Then
0
Let us show how we can write down the conjugate gradients method in
a more algorithmic form. Since .Ck = Lin {60 , ... , 6k-l }, we can represent
Xk+I as follows:
k-l
+L
j=O
_x(j)dj.
44
(1.3.4)
= k -1 we have
45
Nonlinear Optimization
+ hkPk
a).
Find
b).
c).
d).
Xktl
= xk
In that scheme we did not specify yet the coefficient k In fact, there
are many different formulas for this coefficient. All of them give the
same result on quadratic functions, but in a general nonlinear case they
generate different sequences. Let us present three of the most popular
1.
2.
3 Polak-Ribbiere
- k -
(J'(xk+t),J'(xkd-f'(xk))
II!' (xk)ll 2
46
Note, that this local convergence is slower than that of the variable
metric methods. However, the conjugate gradients schemes have an
advantage of a very cheap iteration. As far as the global convergence is
concerned, the conjugate gradients, in general, are not better than the
gradient method.
1.3.3
Constrained minimization
Let us discuss briefly the main ideas underlying the methods of general
constrained minimization. The problern we deal with is as follows:
fo(x) -+ min,
(1.3.5)
fi(x)
0, i = 1 ... m.
where fi(x) are smooth functions. For example, we can consider fi(x)
1 (Rn).
frorn
Since the cornponents of the problern (1.3.5) are general nonlinear
functions, we cannot expect that this problern is easier than an unconstrained minimization problem. Indeed, even the standard difficulties
with stationary points, which we have in unconstrained minimization,
appear in (1.3.5) in a much stronger form. Note that a stationary point
of this problern (whatever it is) can be infeasible for the systern of functional constraints. Hence, any minimization scheme attracted by such a
point should accept that it fails even to find a feasible solution to (1.3.5).
Therefore, the following reasoning Iooks quite convincing.
Cl'
47
Nonlinear Optimization
<P(x)
> 0 for
any x ~
Q.
nQ2.
Q = {x ERn lfi(x)
~ 0, i = 1. ..
m}.
L: (!i(x))~.
i=l
L.: (fi(x))+
i=l
+ tkci>(x)}
48
fo(x)
Xk+I
is bounded. Then
lim f(xk) = fo(x*),
k---'too
lim <P(xk) = 0.
k---'too
Proof: Note that lllk :::; IJ!k(x*) = fo(x*). At the same time, for any
x ERnwehave 'l!k+l(x) ~ 'l!k(x). Therefore 'l!k+l ~ 'l!k. Thus, there
exists a Iimit lim 'l!k lll* :::; f*. If tk > l then
k---'too
Therefore, the sequence {xk} has Iimit points. Since lim tk = +oo, for
k---'too
any such point x. we have <P{x.) = 0 and fo(x.) :::; fo(x*). Thus x.
and
w* = fo(x.)
+ <P(x.) = fo(x.)
fo(x*).
0
Note that this result is very general, but not too informative. There
are still many questions, which should be answered. For example, we
do not know what kind of penalty function we should use. What should
be the rules for choosing the penalty coefficients? What should be the
accuracy for solving the auxiliary problems? The main feature of these
questions is that they can be hardly addressed in the framework of general nonlinear optimization theory. Traditionally, they are considered as
questions to be answered by computational practice.
Let us Iook at the barrier methods.
1.3.2 Let Q be a closed set with nonempty interior. A
continuous function F(x) is called a barrier function for Q if F(x) ~ oo
when x approaches the boundary of Q.
DEFINITION
4 If
we assume that it is a strict local minimum, then the result is much weaker.
49
Nonlinear Optimization
then Fl (X)
+ F2 (X)
nQ2.
In order to apply the barrier approach, the problern (1.3.5) must satisfy the Slater condition:
fi(x) < o,
3x :
i = 1. .. m.
i=l
In(- fi(x)).
i~ exp (-i(x)).
coefficients:
0 < tk
Find a point
using
Xk
Xk+l
=arg min{fo(x)
xEQ
as a starting point.
+ f-F(x)}
k
wk(x) = fo(x)
t:
+ F(x),
Xk+I
is a
50
k-too
w;. = f*.
'llj.
f* be the optimal
k-too
J*.
+ /k F(x)]
= fo(x).
= min
{!o(x) + f-F(x)}
~ min
{!o(x) + lk F*} = !* + lk F*.
xeQ
k
xEQ
J*.
The same as with the penalty functions method, there are many questions to be answered. We do not know how to find the starting point xo
and how to choose the best barrier function. We do not know the rules
for updating the penalty coefficients and the acceptable accuracy of the
solutions to the auxiliary problems. Finally, we have no idea about the
efficiency estimates of this process. And the reason is not in the lack
of the theory. Our problern (1.3.5) is just too complicated. We will see
that all of the above questions get precise answers in the framework of
convex optimization.
We have finished our brief presentation of general nonlinear optimization. It was really very short and there are many interesting theoretical topics that we did not mention. That is because the main goal of
this book is to describe the areas of optimization in which we can obtain some clear and complete results on the performance of numerical
methods. Unfortunately, the general nonlinear optimization is just too
complicated to fit the goal. However, it is impossible to skip this field
since a lot of basic ideas, underlying the convex optimization methods,
have their origin in general nonlinear optimization theory. The gradient
method and the Newton method, sequential unconstrained minimization
and barrier functions were originally developed and used for general optimization problems. But only the framework of convex optimization
allows these ideas to get their real power. In the next chapters of this
book we will see many examples of the second birth of these old ideas.
Chapter 2
2.1
Strongly
convex functions. Lower complexity bounds s:l(Rn); Gradient method.)
2.1.1
xERn
{2.1.1)
where the function j(x) is smooth enough. Recall that in the previous
chapter we were trying to solve this problern under very weak assumptions on function f. And we have seen that in this general situation we
cannot do too much: It is impossible to guarantee convergence even to a
local minimum, impossible to get acceptable bounds on the global performance of minimization schemes, etc. Let us try to introduce some reasonable assumptions on function f to make our problern more tractable.
For that, Iet us try to determine the desired properties of a class of
differentiable functions F we want to work with.
From the results of the previous chapter we can get an impression
that the main reasons of our troubles is the weakness of the first-order
optimality condition (Theorem 1.2.1). Indeed, we have seen that, in
general, the gradient method converges only to a stationary point of
function f (see inequality (1.2.15) and Example 1.2.2). Therefore the
first additional property we definitely need is as follows.
2.1.1 For any f E F the first-order optimality condition
is sufficient for a point to be a global solution to (2.1.1}.
AssuMPTION
52
0, then afi
+ h E :F.
The reason for the restriction on the sign of coefficients in this assumption is evident: We would like to see x 2 in our dass, but function -x 2
is not suitable for our goals.
Finally, let us add in :F some basic elements.
ASSUMPTION
Note that the linear function f(x) perfectly fits Assumption 2.1.1. Indeed, f'(x) = 0 implies that this function is constant and any point in
Rn is its global minimum.
It turns out that we have assumed enough to specify our functional
class. Consider f E :F. Let us fix some x 0 E Rn and consider the
function
cf>(y) = f(y)- (f'(xo),y}.
Then cf> E :F in view of Assumptions 2.1.2 and 2.1.3. Note that
cf>'(y)
ly=xo=
f'(xo)- f'(xo)
= 0.
cf>(y) ~ cf>(xo)
= f(xo)- (J'(xo),xo}.
f(y) ~ f(x)
+ (f'(x), y- x}.
(2.1.2}
1 This is not a description of the whole set of basic elements. We just say that we want to
have linear functions in our class.
53
2.1.1 lf f
Thus, we get what we want in Assumption 2.1.1. Let us check Assumption 2.1.2.
2.1.1 If fi and h belong to :F1 (Rn) and a, 2::0 thenfunction
f = afl + h also belongs to :F1 (Rn).
LEMMA
ft(y)
+ (J~(x),y- x}.
and
0
54
</J(y) = f(y)
fj =
Ay + b. Since
<P(x)
<P(x)
+ (<P'(x), y- x).
0
f(ax
+ (1- a)y)
af(x)
+ (1- a)f(y).
(2.1.3)
f(xa)
f(xa)
f(x)
+ (f'(xa), X- X
0 )
= f(y)
+ a(f'(xa), Y- x),
Multiplying first inequality by (1- a}, the second one by a and adding
the results, we get {2.1.3).
Let (2.1.3} be true for all x, y ERn and a E [0, 1]. Let us choose some
a E [0, 1). Then
f(y)
f(x)
class
:F1 (Rn)
(f'(x)- J'(y),x- y) ~ 0.
belongs to the
{2.1.4)
2 Note that inequality (2.1.3) without assumption on differentiability of /, serves as a definition of geneml convex functions. We will study these functions in detail in the next chapter.
55
Proof: Let
f(y)
f(x)
f(x)
+ (f'(x),y- x) + f(f'(x
7 )-
f'(x),y- x)dr
f(x)
+ (f'(x), y- x) + f
f"(x}
<
be-
(2.1.5)
C::: 0.
r > 0. Then,
~ f(f"(x
0
+ >.s)s, s)d>.,
f(y)
f(x)
+ (f'(x), y- x)
1 T
56
+ (a,x)
is convex.
f(x) = a
+ (a, x) + ~(Ax, x)
f(x}
f(x)
= I X IP,
f(x)
f(x)
= I x I -ln(1+ I x 1).
ex,
p
> 1,
x2
1-lxl'
f(x) =
L ea:;+(a;,x)'
m
i=l
f(x)
is convex too.
.r'l
:Fz'
t II x- y 11
2,
{2.1.6}
57
f(x) + (f'(x), Y- x) +
II
A II f'(x)- f'(y)
{2.1.8}
11 2 ,
{2.1.9}
(J'(x)- f'(y), x- y) ~ L
af(x)
+ (1 -
a)f(y)
+ (1 -
II
x- y
+ o(~Lo) II
af(x)
{2.1. 7}
11 2 ~ f(y),
f'(x)- f'(y)
11
(2.1.10}
2,
+ (1- a)y)
+a(1-a)~ II x-y 11 2
{2.1.11}
A II ifl'(y) 11
!Lily- xll 2
f(xo:)
+ (1
- a)y.
f(x)
f(y)
11 2
11 2 ,
58
II 91 -
11 2
+(1 - a)
II 92 -
u 11 2 ~ a(1 - a)
II 91 - 92 11 2 ,
f(x)
f(y)
11 2 ,
11 2 .
THEOREM
0 j J"(x) j Lln.
be-
(2.1.12}
2.1.2
Oracle:
min /(x),
xERn
f E :Fi'l(Rn).
~ t.
59
In order to make our considerations simpler, let us introduce the following assumption on iterative processes.
AssUMPTION 2.1.4 An iterative method M generates a sequence of test
points {Xk} such that
Xk E xo
k ~ 1.
fk(x) =
~ { ~((x(ll) 2 + ~~>x(i) -
x(i+Il) 2
+ (x(k)) 2 ] - x(l)}
ur(x)s, s)
~ ~ [(s(ll)' + :~>(i)
s(i+ll) 2 + (s('i)']
0,
and
< LL:(s(i)) 2 .
i=l
60
= Akx- e1 = 0
xk - {
1--i
k+l'
i = 1. .. k,
k+1~ i
0,
n.
fi = ~
[!(Akxbxk)- (el,xk)]
= -t(el!xk)
(2.1.13)
t(-1+k~l).
~ 2 - k(k+1){2k+l}
6
L..~-
i=l
<
-
{k+1)3
3
(2.1.14)
Therefore
(2.1.15)
< k-
_2_. k{k+l}
k+l
2
+ {k+1)2
1
. {k+l)l - l(k
3
- 3
+ 1) .
p = k ... n.
n.
we have Ck
Rk,n.
61
COROLLARY
.Ck we have
THEOREM
62
II
Xk - x*
11 2 >
=
Xk E Rk,n
and
xo
= 0, we
L::
L::
1 2k+1 .
k+l
't
i=k+l
L::
+1-
2k+1 2
Z
i==k+l
+ 4(k+I)2 L::
L::
i2
i==k+l
t(k
II
Xk _ x* ll2
> k + 1 __1_ .
k+l
k -
>
(3k+2)(k+l)
2
+ (2k+1)(7k+6)
24(k+l)
2k 2+7k+6
- 2- 24(k+l)
2k2+7k+6
l(k+l)2
II
Xo- X2k+1
112>
_
S1
II
XQ- X *
112 .
0
The above theorem is valid only under assumption that the number
of steps of the iterative scheme is not too large as compared with the
dimension of the space (k :::; ~(n- 1)). The complexity bounds ofthat
type are called uniform in the dimension of variables. Clearly, they
are valid for very large problems, in which we cannot wait even for n
iterates of the method. However, even for problems with a moderate
dimension, these bounds also provide us with some information. Firstly,
they describe the potential performance of numerical methods on the
initial stage of the minimization process. And secondly, they warn us
that without a direct use of finite-dimensional arguments we cannot get
better complexity for any numerical scheme.
To conclude this section, Iet us note that the obtained lower bound for
the value of the objective function is rather optimistic. Indeed, after one
hundred iterations we could decrease the initial residual in 104 times.
However, the result on the behavior of the minimizing sequence is quite
disappointing: The convergence to the optimal point can be arbitrarily
slow. Since that is a lower bound, this conclusion is inevitable for our
problern dass. The only thing we can do is to try to find problern
63
classes in which the situation could be better. That is the goal of the
next section.
2.1.3
.rl
1 (Rn),
Thus, we are looking for a restriction of the functional class
for which we can guarantee a reasonable rate of convergence to a unique
solution of the minimization problern
Recall, that in Section 1.2.3 we have proved that in a small neighborhood of a nondegenerate local minimum the gradient method converges
linearly. Let us try to make this non-degeneracy assumption global.
Namely, Iet us assume that there exists some constant 1-t > 0 such that
for any x with f'(x) = 0 and any x ERnwehave
+ ~1-t II
x- x
11 2
+ (f'(x),y- x} + ~1-t
f(y);:::: f(x)
II
y- x
{2.1.16}
11 2 .
2.1.8 lf f
+ !~-t II x- x*
11 2
f(x)
f(x*)
+ !~-t II x- x*
11 2
11 2
64
2.1.4 lf fi
Sh 1 (Rn),
= afi
f2
>
!I(x)
+ (fi(x),y- x) + !J.Ll
II
y- x 11 2,
h(y)
>
h(x)
+ UHx), Y- x) + !J.L2
II
Y- x 11 2 .
Note that the class SJ(Rn) coincides with F 1 (Rn). Therefore addition
of a convex function to a strongly convex function gives a strongly convex
function with the same convexity parameter.
Let us give several equivalent definitions of strongly convex functions.
2.1.9 Let f be continuously differentiable. Both conditions
below, holding for all x, y ERn and a E (0, 1], are equivalent to inclusion
THEOREM
E S~(Rn):
?: J.L II x- y
(f'(x) - f'(y), x- y)
af(x)
+ (1- a)f(y) ?:
f(ax
(2.1.17)
11 2 ,
+ (1- a)y)
+a(1- a)~
II
(2.1.18}
x- y
11 2 .
The proof of this theorem is very similar to the proof of Theorem 2.1.5
and we leave it as an exercise for the reader.
The next statement sometimes is useful.
THEOREM
2 .1.1 0 If f
f(y) S f(x)
+ (f'(x), Y- x) + 2~
(f'(x) - f'(y), x- y) S ~
II
II
f'(x) - f'(y)
f'(x) - f'(y)
11 2
11 2 ,
(2.1.19)
(2.1.20)
65
rp(x)
minrp(v)
;::: min[<P(y)
V
V
<P(y)- 2~11<P'(y)li 2 ,
and that is exactly (2.1.19). Adding two copies of (2.1.19) with x and y
interchanged we get (2.1.20).
D
Finally, the second-order characterization of the class Sh(Rn) is as
follows.
2 .1.11 Two times continuously differentiable function
longs to the class s~ (Rn) if and only if X E Rn
THEOREM
f be-
{2.1.21)
f"(x) !: flln.
j(x) = a+ (a,x)
+ !(Ax,x)
s::l(Rn) C S!;l(Rn)
since f"(x) = A.
Other examples can be obtained as a sum of convex and strongly
D
convex functions.
For us the most interesting functional class is s!:l(Rn). This class is
described by the following inequalities:
x- Y
11 2 ,
(2.1.22)
(2.1.23)
x- Y II
The value Qf = L / fl ;::: 1 is called the condition number of function f.
It is important that the inequality (2.1.22) can be strengthened using
the additional information (2.1.23).
66
THEOREM
II
x- Y 11 2
{2.1.24}
Proof: Derrote cp(x) = f(x)- ~1111xll 2 . Then cp'(x) = f'(x)- 11x; hence,
by (2.1.22) and (2.1.9) cjJ E :Fl~M(Rn). If 11 = L, then (2.1.24) is proved.
If 11 < L, then by (2.1.8) we have
(cp'(x)- <P'(y),y- x)
~ L~J.LII<P'(x)-
</>'(y)ll 2 ,
2.1.4
Let us get the lower complexity bounds for unconstrained minimization of functions from the dass sr;:J}(Rn) c s~:l(Rn). Consider the
following problern dass.
min J(x),
Model:
xERn
s;:l(Rn),
Oracle:
Approximate solution:
x: f(x) - !*
E,
II x- x*
1-l
11 2 ~
> 0.
E.
As in the previous section, we consider the methods satisfying Assumption 2.1.4. We are going to find the lower complexity bounds for our
problern in terms of condition number Qf = ~
Note that in the description of our problern dass we do not say anything about the dimension of the space of variables. Therefore formally,
this dass includes also the infinite-dimensional problems.
We are going to give an example of some bad function defined in the
infinite-dimensional space. We could do that also in a finite dimension,
but the corresponding reasoning is more complicated.
Consider gX) l2 , the space of all sequences x = {x(i)}~ 1 with finite
norm
67
Denote
-i
(
A=
>
0 and Qf
-~ -~ ~
2
0 -1
>
This means that J11 ,Q 1 E s::,~~ 1 (R 00 ). Note that the condition number
of function f 11 ,Q1 is
Q!Jl.,Qf -_
ttQr _
-
11
, (x)
! J-L,IlQf
= (tt(Qr 1) A
can be written as
+ u1) X-
J-L(Qr 1) e = 0
4
f""'
(A + Q/- 1 ) x = e1.
= 1,
+ x(k-l)
= 0,
Qr1
Qrl
(2.1.25)
k = 2, ....
that is q =
1-1
~QQt+1
2 Q I+ 1 q + 1 = 0
Qrl
'
68
THEOREM
II
Xk- x*
11 2 2
( ~~~)
k II
xo- x*
11 2 ,
~ ~ ( ~~~) 2k II xo- x* 11 2 ,
f(xk)- !*
= 0.
xo- x*
11 2
=f
i=l
((x*)(i)]2 =
fq
i=l
2i
= ,S.
1
q
II Xk- x* 11 2 ~
00
( ")
00
L: [(x*) ~ J2 = L:
i=k+ 1
i=k+ 1
q2~
11 2
The second bound of the theorem follows from the first one and The0
orem 2.1.8.
Gradient method
2.1.5
with f E
follows.
:t'l' (Rn).
1
xERn
Gradient method
0. Choose x 0 ERn.
69
THEOREM
f(xk)Proof: Denote rk
r~+l
=II
f* ::; 2llxo-!~11~x~t~;~r~~uc~:)-/*)
Xk- x*
II
Then
Xk - x* - hf'(xk)
11 2
II
<
+ h2 II
f'(xk)
11 2
(we use (2.1.8) and f'(x*) = 0). Therefore rk ::; r 0 . In view of (2.1.6)
we have
= f(xk)- f*.
II
Then
f'(xk)
II .
>
} + r~(k
+ 1).
'-"O
0
0
f(
Xk -
f*
<
2L(f(xo)-f*)llxo-x*l! 2
- 2LJixo-xIJ2+k(/(xo)- /*)
(2.1.26)
70
f(xo)
f* + t II xo - x*
11 2
11 2 .
2.1.2 lf h =
and jE
2LIIxo-xll 2
f( Xk ) _ f* <
k+4
.
{2.1.27}
THEOREM
Xk - x* 11 2 ~ ( 1 -
II
lf h =
1-L!L'
~_fJ;) k II
1-L!L'
xo- x*
11 2 .
then
llxk-x*ll
< (~;~i)kllxo-x*ll,
f(xk)- f*
11 2 ,
where Qf = LfJ-L.
Proof: Denote rk
rf+l
=II Xk- x*
11.
Then
Xk- x* - hf'(xk)
11 2
II
r~ - 2h(f'(xk), Xk -
f'(xk)
11 2
<
11 2
x*} + h2 II
(we use (2.1.24) and f'(x*) = 0). The last inequality in the theorem
0
follows from the previous one and (2.1.6).
Recall that we have seen already the step-size rule h = 1-L!L and
the linear rate of convergence of the gradient method in Section 1.2.3,
Theorem 1.2.4. But that was only a local result.
71
2.2
Optimal Methods
2.2.1
Optimal methods
xeRn
2LIIxo-x*ll 2
k+4
'
These estimates differ from our lower complexity bounds (Theorem 2.1.7
and Theorem 2.1.13) by an order of magnitude. Of course, in general
this does not mean that the gradient method is not optimal since the
lower bounds might be too optimistic. However, we will see that in our
case the lower bounds are exact up to a constant factor. We prove that
by constructing a method that has corresponding efficiency bounds.
Recall that the gradient method forms a relaxation sequence:
72
{2.2.1)
The next statement explains why these objects could be useful.
LEMMA 2.2.1
{2.2.2}
then f(xk)-
f*
Proof: Indeed,
Thus, for any sequence {xk}, satisfying (2.2.2) we can derive its rate
of convergence directly from the rate of convergence of sequence { Ak}.
However, at this moment we have two serious questions. Firstly, we do
not know how to form an estimate sequence. And secondly, we do not
know how we can ensure (2.2.2). The first question is simpler, so let us
answer it.
LEMMA 2.2.2
Assurne that:
S~:l(Rn),
{yk}~ 0
73
(1 - ak)Ak,
{2.2.3}
(1- ak)4>k(x)
II
x- Yk
= 4>o(x).
11 2],
Further, let
+ akf(x)
(1- Ak+l)f(x)
+ Ak+I4>o(x).
Thus, the above statement provides us with some rules for updating
the estimate sequence. Now we have two control sequences, which can
help to ensure inequality (2.2.2). Note that we arealso free in the choice
of initial function c/Jo(x). Let us choose it as a simple quadratic function.
Then we can obtain the exact description of the way c/J'k varies.
2.2.3 Let 4>0 (x) = 4>0+ ~ II x- vo 11 2 . Then the process {2.2.3}
preserves the canonical form of functions {4>k(x)}:
LEMMA
cPk(x)
= c/J'k + 1f II x- Vk 11 2 ,
{2.2.4}
where the sequences {'yk}, {Vk} and {c/Jk} are defined as follows:
'Yk+l =
(1 - ak)'Yk
+ akp.,
cPk+I =
(1- ak)4>k
+ akf(Yk)- 21:~ 1
+ak(~~:;bk (~ II Yk- Vk
11 2
II
f'(yk)
11 2
+(f'(yk),vk- Yk))
74
+ akJ.Lln =
((1- akhk
+ akJ.L)ln
='Yk+1In.
if;k+l(x) =
+ 1t
II
x- Vk
11 2 )
II
x- Yk
11 2].
From that we get the equation for the point vk+ 1 , which is the minimum
of the function cpk+l(x).
Finally, let us compute if;'ic+l In view of the recursion rule for the
sequence {cpk(x)}, we have
cpk+1
+ 'Ykt
II
Yk - Vk+1
11 2 =
cpk+l (Yk)
(2.2.5)
Therefore
'Ykr
II
Vk+l - Yk
11 2
2"f!+l
II
Vk - Yk
11 2
It remains to substitute this relation into (2.2.5) noting that the factor
for the term II Yk- Vk 11 2 in this expression is as follows:
_1_{1- ak)2'Y2
(1- ak)'lk.2
2'Yk+l
k
'Yk+l
75
Now the situation is more clear and we are close to getting an algorithmic scheme. Indeed, assume that we already have Xk:
+ (f'(yk), Xk
2: f(Yk)- 21~~1
II
f'(Yk)
11 2
c/>k+l 2:
f(xk+l) Recall,
in many different ways. The simplest one is just to take the gradient
step
with hk
= (see
2
0
Then 2""lk+l
k
following:
Xk+l
= Yk- hkf'(xk)
Now we can use our freedom in the choice of Yk Let us find it from the
equation:
~krk ( Vk - Yk)
lk+l
That is
+ Xk
- Yk = 0.
76
> 0.
Set vo = xo.
2': 0).
+ akfJ
La~ = {1 - akhk
Set lk+l = {1- akhk
b). Choose
1
Yk-
+ akfJ
(2.2.6)
Qk/'kVk+l'kt!Xk
l'k+akJL
Xk+l
suchthat
'Yktl
>
b replaces L
satisfying
in the equation of
that
f(xk) - f* :S Ak [f(xo) - f*
where Ao
Xk+l
=1
and )..k
+f
II
xo - x*
11 2 ] ,
77
{2.2. 7)
Proof: Indeed, if /k ;:::: J-t, then /k+l = La~ = (1 - akhk + akJ-t ;:::: 1-t
Since /O ;:::: J-t, we conclude that this inequality is valid for all /k Hence,
ak ;::::
and we have proved the first inequality in (2.2.7).
Further, let us prove that /k ;:::: 1o>.k. Indeed, since /o = /o>.o, we can
use induction:
jii
k.
ak+l- ak
=
>
,;>:;- .;>:;:;
vf..\k..\k+l
..\k-..\ktt
-2..\k~
Ak- Akt 1
= vf..\k..\ktt(V'Xk+vf.Xkd
= Ak-(1-ak)..\k =
2.xk~
ct&
>1
/JQ,
2~-2VT
THEOREM
f(xk)-
!*
II
xo- x* 11 2 .
78
where Qf = L/~-t and R =II xo- x* II Therefore, the worst case bound
for finding Xk satisfying f(xk) - f* ~ E cannot be better than
k>- .,fQi4
-l
[In 1 + In !!2
+ 2In R] .
f(xk)-
-)ij).
sJ:l
Let us analyze a variant of the scheme {2.2.6), which uses the gradient
step for finding the point Xk+I
Constant Step Scheme, I
+ CtkJ-t
La~ = (1 - ak)rk
(2.2.8)
"l'k+akJl
Vk+I = ;;l--[{1ak)rkvk
lkl
+ CtkJ-tYk- akf'(Yk)].
Yk- tf'(Yk),
=
~[(1ak)rkvk
lkl
+ CikJ.tYk- akf'(Yk)].
79
Therefore
_ _1_ { (1-akhk
'Yk+t
ak
Yk
+ J-LYk
} _ 1-a~sx _ -2LJ'( )
ak
k
'Yk+l
Yk
Hence,
Xk
+1
+ Clktl'Yktl(Vkl-Xktl)
= Xk + 1 + fJ,f.lk(Xk + 1 _
'Yk+l +ak+tJJ
Xk)
'
where
Thus, we managed to get rid of {vk}. Let us do the same with 'Yk We
have
Therefore
_
'Yk 1 1-ak)
_ ak(l-ak)
- ak 'Yk+l +ak+l L) - af +akl
+ qak+l
(1- aoho
+ J-Lao.
The latter relation means that 'Yo can be seen as a function of ao. Thus,
we can completely eliminate the sequence {'Yk}. Let us write down the
corresponding scheme.
80
{2.2.9)
= Yk - tf'(Yk)
a1(l-ak)
ak+ak+l'
then
f (xk) - f* S min
{ (1 -
.ft) k , ( Jt1Zy'1o')2}
x [f(xo) - !*
where
"'O
1
+ lll xo -
x*
11 2 ] ,
= no(noL-tL).
1-ao
We do not need to prove this theorem since the initial scheme is not
changed. We change only notation. In Theorem 2.2.3 condition {2.2.10)
is equivalent to /o ~ Jl-
Scheme {2.2.9) becomes very simple if we choose ao =
{this corresponds to /o = 11-). Then
.ft
81
0. Choose Yo = xo E Rn.
(2.2.11)
Xk+l
Yk - tf'(Yk),
However, note that this process does not work for p. = 0. The choice
'Yo = L (which changes corresponding value of ao) is safer.
2.2.2
Convex sets
f(ax
[x,y]
= {z = ax + (1- a)y,
a E [0, 1]}.
82
f(ax
and y belong to
+ (1- a)y)
:$ af(x)
+ {1 -
and
a)f(y) :$ .
0
J(X)
:$ T}
is a convex set.
Proof: lndeed, let z1 = (x1,rt) E Cf and z2 = (x2,r2) E Cf Then for
any a E [0, 1] we have
Za
Thus, z 0 E Cf
be a linear operator:
83
5. Convex hull
Conv (Q I, Q2) = { z E Rn
z = ax + (1 - a),
y,x E QI, y E Q2, a E [0, 1]}.
y = A(x), x E QI}
A(x) E Q2}.
(yi, Y2), YI
QI,
= a1x1 +
(1- a)2x2
= 2x2,
X2 E QI, 2 ;::: 0,
= 1(ax1 +
(1- a)x2),
+ (1 -
a)z2 =
a(1x1
+ (1 -
l)x2)
a(1x1
+ (1- i)Yd
= a(Ax1 +
b) + (1- a)(Ax2 + b)
= A(ax1 +
(1- a)x2) + b.
84
A(x(a)) = A(ax1
+ (1- a)x2) + b
2. Polytope {x E Rn I (ai, x)
intersection of convex sets.
(a,x}
} is convex since
(Ax, x) ~ r 2} is
0
f'(x) = 0
does not work here.
EXAMPLE 2.2.2 Consider the one-dimensional problem:
minx.
x~O
THEOREM 2.2.5 Let f E .1'1 (Rn) and Q be a closed convex set. The
point x* is a solution of (2.2.12} if and only if
(J'(x*),x- x*} ~ 0
(2.2.13}
85
for all x E Q.
f(x) 2 f(x*)
for all x E Q.
Let x* be a solution to (2.2.12). Assurne that there exists some x E Q
suchthat
(J'(x*), x- x*) < 0.
Consider the function cf;(a) = f(x*
cf;(O)
= f(x*),
cj;'{O)
+ a(x- x*)), a
f(x*
+ a(x- x*))
= cf;(a)
< cf;(O)
= f(x*).
That is a contradiction.
THEOREM
x E
f(x) ~ f(xo)}.
Q}.
{2.2.14)
Q we have
+ (f'(xo), x- xo) + ~ II x- xo
11 2
f*
= f(xi)
2 f(x*) + (J'(x*), xi
2 !* + ~ II xi - x*
- x*) + ~ II xi -
x*
11 2
11 2
xi =
x*.
86
2.2.3
Gradient mapping
.r'l'
lr,
II
f'(x)
f'(x)
II
11 2
11 2
Denote
+ (f'(x), x- x) + ~
XQ(x; 'Y)
9Q(x; 'Y)
xEQ
r > 0.
II
x- x
11 2] ,
For Q
=Rn we have
xQ(x; r) = x- ~f'(x),
9Q(x; r) = J'(x).
Thus, the value ~ can be seen as a step size for the "gradient" step
x--+ XQ(X;[).
Note that the gradient mapping is weil defined in view of Theorem
2.2.6. Moreover, it is defined for all x ERn, not necessarily from Q.
Let us write down the main property of gradient mapping.
S~:l(Rn),
x E Q we have
f(x)
+2~
II
r ;::: L
and
xE
+ (gQ(x; 'Y), x- x)
9Q(x;r)
11 2
+~
(2.2.15)
II
x-x
11 2 .
87
+ (f'(x), x- x) + ~
+ r(x- x),
II
x- x
11 2
f(x)- ~
II
x- x
11 2
f(x)
cfJ(xQ)- ~
cfJ(xQ)- 2~
II
9Q
11 2
+(gQ, x- XQ)
cfJ(xQ)
+ 2~
II
9Q
11 2
+(gQ, x- x),
XQ- x
II
11 2
COROLLARY
r~L
and
x ERn.
Then
(2.2.16)
(2.2.17)
2.2.4
Let us show how we can use the gradient mapping for solving the
following problem:
min J(x),
xEQ
88
(2.2.18)
1. kth iteration (k
0).
2.2.8 Let
II
S!;l(Rn).
Xk - x* 11 2 ~
lf in scheme {2.2.18) h =
!,
then
11 2 .
Proof: Denote rk =II xk- x* II, 9Q = 9Q(Xki L). Then, using inequality
(2.2.17), we obtain
Tf+l
<
+ h2
II 9Q 11 2
t we have
Consider now the optimal schemes. We give only a sketch of justification since it is very similar to that of Section 2.2.1.
First of all, we define the estimate sequence. Assurne that x 0 E Q.
Define
4>o(x) =
cPk+I(x) =
f(xo)
+ lll x- xo
{1- ak)cPk(x)
11 2 ,
+(gQ(Yk;L),x- Yk)
+~
II x- Yk 11 2 ).
11 2
89
Note that the form ofthe recursive rule for cfJk(x) is changed. The reason
is that now we use inequality (2.2.15} instead of (2.1.16). However,
this modification does not change the analytical form of recursion and
therefore it is possible to keep all convergence results of Section 2.2.1.
Similarly, it is easy to see that the estimate sequence {cfJk(x)} can be
written as
c/J'k:
'Yk+I
+ ( E- 2-y:~l)
II
gQ(Yki L)
+ak(~~:thk (~ II Yk -Vk
Further, assuming that c/J'k
f(xk)
II
+(gQ(Yk;L),vk -yk))
f(xQ(Yki L})
+A
11 2
11
gQ(Yki L)
11 2
+~
Xk- Yk
II
11 2 ],
c/J'k+l
(1- ak)f(xk)
+(
+ akf(xQ(Yki L))
n- 2-y:~~)
11
YQ(Yki L)
11
27:~ 1 )
2
II
gQ(Yki L)
11 2
90
Yk
= XQ(Yki L),
'Yk+letk/L
(O!k"fkVk
+ 'Yk+IXk)
z.
1. kth iteration (k
2 0).
Clearly, the rate of convergence of this method is given by Theorem 2.2.3. In this scheme only points {xk} are feasible for Q. The
sequence {yk} is used for computing the gradient mapping and may be
infeasible.
2.3
(Minimax problem: grodient mapping, gradient method, optimal methods; Problem with functional constraints; Methods for constrained minimization.)
2.3.1
Minimax problern
Very often the objective function of an optimization problern is composed by several components. For example, the reliability of a complex
91
system usually is defined as a minimal reliability of its parts. A constrained minimization problern with functional constraints provides us
with an example of interaction of several nonlinear functions, etc.
The siruplest problern of that type is called the minimax problem. In
this section we deal with the smooth minimax problem:
min
xEQ
[!(x)
= m.ax fi(x)]
l:O:::l:O:::m
{2.3.1)
Function
+~
II
x- x
11 2 ,
{2.3.2}
f(x) ~ f(x; x)
+t
II
x- x
11 2
{2.3.3}
Proof: Indeed,
+ (Ji(x), x- x) + ~
II
x- x
11 2
fi(x) ~ fi(x)
(see (2.1.6)).
+ (Jf(x), x- x) + t II x- x
11 2
92
Let us write down the optimality conditions for problern {2.3.1) (compare with Theorem 2.2.5).
THEOREM
for any x
2.3.1 A point x*
Q we have
{2.3.4)
</>i(a) = fi(x*
Note that for all i, 1 ~ i
fi(x*)
i = 1 ... m.
m, we have
</>i(O)
+ a(x- x*)),
= m_ax fi(x*).
l~t~m
= f(x*),
4>~(0)
fi(x*
for all i, 1 ~ i
+ a(x- x*))
m. That is a contradiction.
COROLLARY
f(x) ~ f(x*)
+~
II
x- x*
11 2
for all x E Q.
> f(x*;x) + ~ II x- x*
>
f(x*;x*)
+~
II
x- x*
Q we
11 2
11 2 =
f(x*)
+~
II
x- x*
11 2
93
>
Q is bounded:
for any x E
I x E Q}.
(2.3.5)
Q we have
II
x- x
11 2 ,
consequently,
~ II x-
x 11 2 ~11 J'(x)
II II
x- x II
+J(x) -fi(x).
2.3.2
xi
+ ~ II xi -
x* 11 2 ~ f(x*)
+ ~ II xi -
= x*.
x*
11 2
Gradient mapping
In Section 2.2.3 we have introduced the gradient mapping, which replaces the gradient for a constrained minimization problern over a simple set. Since linearization of a max-type function behaves similarly to
linearization of a smooth function, we can try to adapt the notion of
gradient mapping to our particular situation.
Let us fix some 1 > 0 and x E Rn. Consider a max-type function
f(x). Denote
J'Y(x; x) = f(x; x) +! II x- x 11 2
The following definition is an extension of Definition 2.2.3.
DEFINITION
2.3.2 Define
f*(x;'Y)
min/'Y(x; x),
XJ(Xi/)
YJ(x; 1)
/(X- Xf(Xi/)).
xEQ
xeQ
94
fi(x)
+ (Jf(x), x- x) + ~
II x- x II 2 E S~;~(Rn),
i = 0 ... m.
Xf
+ (gJ(x; 1'), x- x) + 2~
= Xf(X;')'),
9!
II 9J(x; 1') 11 2
(2.3.6}
= 9J(x;')').
xEQ
f(x;x)
! 1 (x;x)-
~ II x- x 11 2
II x-
x 11 2 )
f*(x;')')
+ 1-(x- x 1,2(x- x) + x- x 1)
f*(x; 1')
+ (91, x- x) + 2~
II 9! 11 2
COROLLARY
f (X)
;::::
X-
X 11 2 .
(2.3. 7)
95
2. lf x E Q, then
J(x,(x;'Y)) ~ J(x)- 2~
3. For any
x E Rn
II
9J(x;'Y)
{2.3.8}
11 2 ,
we have
{2.3.9}
Proof: Assumption 'Y 2: L implies that f*(x; 1) 2: f(xJ(x; 'Y)). Therefore (2.3.7) follows from (2.3.6) since
f(x) 2: f(x;x}
+ ~ II x- x
11 2
= x*,
0
Proof: Denote
we have
J(x;x}
Xi
/2
+ ~ II x- x
11 2
In view of (2.3.6),
2: f*(x;'Yd + (91,x- x)
+ 2~1
II
91 11 2
+~
II X -
(2.3.10}
X 11 2
f*(x; 12) =
f(x; x2)
+~
II
x2-
x 11 2
>
J*(x;,l)
+ (91,x2- x) + 2~ 1
f*(x;11)
+ 2~ 1
II
91 11 2 -_;2 (91,92)
> J*(x;11) + 2~ 1
II
91 11 2 - 2~2 II 91 11 2 .
II
91 11 2 +~ II x2- x 11 2
+ 2;2
II
9211 2
96
2.3.3
> 0.
(2.3.11)
2 0).
THEOREM
then
II
Xk- x* 11 2 $ (1 - p,h)k
II
f,
xo- x* 11 2
Proof: Denote rk =II Xk- x* II, g = gf(Xki L). Then, in view of (2.3.9)
we have
r~+l
= II
t)
II g 11 2 $
11 2
(1 -p,h)r~.
0
we have
Goroparing this result with Theorem 2.2.8, we see that for minimax
problern the gradient method has the same rate of convergence, as it has
in the smooth case.
Let us check, what the situation is with the optimal methods. Recall,
that in order to develop an optimal method, we need to introduce an
estimate sequence with some recursive updating rules. Formally, the
minimax problern differs from the unconstrained minimization problern
only by the form of lower approximation of the objective function. In
the case of unconstrained minimization, inequality (2.1.16) was used for
97
c/Jo(x) =
II
x- xo
11 2 ,
(1- ak)<f>k(x)
cPk+I(x) =
+~
II
X- Yk
11 2 ].
Comparing these relations with (2.2.3), we can find the difference only
in the constant term (it is in the frame). In (2.2.3) this place was taken
by f(Yk) This difference Ieads to a trivial modification in the results
of Lemma 2.2.3: All inclusions of f(Yk) must be formally replaced by
the expression in the frame, and f'(Yk) must be replaced by 9J(Yki L).
Thus, we come to the following lemma.
LEMMA
=<PZ + 1f
II
x- Vk
11 2 ,
where the sequences bk}, { vk} and {<Pk} are defined as follows: vo = xo,
<Po = f(xo) and
'Yk+l
(1- O:'khk
o2
+ O:'kiJ.,
I (
+~I 9! YkiL)
II 2
Now we can proceed exactly as in Section 2.2. Assurne that <Pie 2::
f(xk). Inequality (2.3.7) with x = Xk and x = Yk becomes
f(xk)
2:: f(xJ(YkiL))
+ (gJ(YkiL),xk- Yk)
+A II 9J(Yki L) 1 2 +~ II Xk- Yk
11 2
98
Hence,
II
9J(Yki L)
11 2
+ Xk- Yk).
Xj(Yki L),
Xk+l
La~
Yk
(1- ak)rk
+ akt-t = rk+l
1
1 d akf.k (akrkVk
+ rk+lxk)
Let us write down the resulting scheme in the form of (2.2.!)), with
eliminated sequences {vk} and {rk}.
(2.3.12)
b). Compute ak+l E (0, 1) from equation
ak(l-ak)
a%+ak+l'
+ qak+l
99
Jii, then
f(xk)- f* S min{
x [f(xo)-
where "' =
tO
!* + ~ II xo- x*
11 2],
ao(aoL-JL).
1-ao
Note that the scheme (2.3.12) works for all11- ~ 0. Let us write down
the method for solving (2.3.1) with strictly convex components.
0. ehoose xo E Q. Set Yo
VI-{Ji
= xo, = VL+,fii'
(2.3.13}
THEOREM
f(xk) - f* :5 2 ( 1
-{f) k (f(x 0) -
(2.3.14}
f*).
Jli.
100
l~t~m
II
x- xo
11 2 }.
fi(xo)
+ (ff(xo), x- xo)
t(i)
i=l
s. t.
+ ~ II x- xo
11 2 }
~ t(i), i = 1 ... m,
(2.3.15)
xEQ, tERm,
2.3.4
Optimization with functional constraints
Let us show that rnethods described in the previous section can be
used for solving a constrained minimization problern with smooth functional constraints. Recall, that the analytical form of such a problern is
as follows:
min fo(x),
s.t.
fi(x)
0, i = 1 ... m,
(2.3.16)
XE Q,
where the functions fi are convex and srnooth and Q is a closed convex
(Rn), i = 0 ... m, with sorne
set. In this section we assume that fi E
J1. > 0.
The relation between the problern (2.3.16) and rninirnax problerns
is established by sorne special function of one variable. Consider the
parametric rnax-type function
s!:l
t ER\ x E Q.
(2.3.17)
101
Note that the components of max-type function f(t; ) are strongly convex in x. Therefore, for any t E R 1 the solution of problern (2.3.17),
x*(t), exists and is unique in view of Theorem 2.3.2.
We will try to get close to the solution of (2.3.16) using a process
based on approximate values of function f* (t). This approach can be
seen as a variant of sequential quadratic optimization. It can be applied
also to nonconvex problems.
Let us establish some properties of function f* (t).
LEMMA
>
< t*.
0 for all t
= max{fo(x*) - t; ]i(x*)}
t < t*,
0.
Ii (y)
0, i = 1 ... m.
j*(t)-!:::..
j*(t + !:::..)
f*(t).
102
Proof: Indeed,
f*(t
+ )
min
xEQ
< min
xeQ
f*(t
+ )
~ax
{fo(x)- t- ; fi(x)}
~ax
l~z~m
l~z~m
> min
xeQ
l~z~m
~ax
l~z~m
In other words, function f*(t) decreases in t and it is Lipschitz continuous with constant equal to 1.
f*(tl - ) ~ f*(ti)
0 we have
+ f*(t~l=f:(t2).
(2.3.18}
f*(ti)
+ ax*(t2).
Wehave
<
m.ax {(1- a)(fo(x*(to))- to) + a(fo(x*(t2))- t2);
< ~~~~m
(1- a)fi(x*(to))
+ afi(x*(t2))}
103
Note that Lemmas 2.3.5 and 2.3.6 are valid for any parametric maxtype functions, not necessarily formed by functional components of problern (2.3.16).
Let us study now the properties of gradient mapping for a parametric
max-type function. To do that, Iet us introduce first a linearization of a
parametric max-type function f(t; x):
+! II x-i 11 2 ,
f'Y(t; i; x)
f(t;i;x)
f*(t; i; 'Y) =
min f,y(t; i; x)
Xf(t; i; "f)
argmin J'Y(t;i;x)
g 1(t;x;,)
'Y(i- x 1(t;x;,)).
xEQ
xEQ
fi(x)
II
x- x
11 2 ,
+ (JI(x), x- x) - t +! II x- x 11 2 ,
i = 1 ... m.
Moreover, f'Y(t; x; x) E S~;~(Rn). Therefore, for any t E R 1 the constrained gradient mapping is weil defined in view of Theorem 2.3.2.
Since f(t; x) E S~;L(Rn), we have
(2.3.20)
There are two values of "(, which are important for us. Theseare 'Y = L
and "( = 1-t Applying Lemma 2.3.2 to max-type function J'Y(t;i;x) with
104
/l
= L and 1 2 = J.L,
11 2
(2.3.21)
t*(x, t)
x ERnwehave
t-t
t (x,t)-t
1 ~,J*(f; x; J.L).
Thus, f*(t; x; J.L) > 0 and, since f*(t; x; J.L) decreases in t, we get
t*(x, l) > l.
Denote ~ = f- t. Then, in view of (2.3.20), we have
f*(t;
x;
>
(1- "") ( 1 +
+ t(x~)-tf*(t; x; J.L)
t(x~)-t) f*(t; x; L)
0
105
2.3.5
(2.3.22)
=arg min
Xk+!
= XJ(tkiXk,j(k)iL).
O<
'< '(k)
_)_)
j*(tkiXk 3;L),
'
2.3.8
L)<to-t[
1
]k .
f *(tx
k, k+1,
- 1-tt 2(1-tt)
Proof: Denote = 2 ( 1 ~~~:) ( < 1) and
106
<
K)f*(tkiXk,j(k)iL)
.jtk+l-tk
-
f*(tk-liXk-l,j(k-l)iL))
.jtk-tk-1
= kJ* (t o; Xo,j(O); L)
t,.+t-tk
tt -to
f*(tk; xk,j(k)i L)
/*(to;xo,j(O)il')
= f*(tk; xk,j*(k)i L)
S t* + E,
fi(x*)
(2.3.23)
,
i = 1 .. . m.
t -t
(2.3.24)
full iterations of the master process (the last iteration of the process, in
general, is not full since it is terminated by the Global stop rule). Note
that in this estimate K is an absolute constant {for example, K = :\).
107
Let us analyze the complexity of the internal process. Let the sequence
{Xk,j} be generated by {2.3.13) with the starting point Xk,o = Xk In view
of Theorem 2.3.6, we have
f*(tk))
~ 2e-u-i(J(tkiXk)- f*(tk))
:5 2e-uj j(tk;xk),
jli.
where a =
Denote by N the number of full iterations of the process (2.3.22)
(N ~ N(E)). Thus, j(k) is defined for all k, 0 ~ k ~ N. Note that
tk = t*(xk-1,j(k-1) tk-1) > tk-1 Therefore
k 2: 1,
= f(to; xo).
(2.3.25)
A II 9j(tk; Xk,ji L 11 2
<
f*(tkiXk,jiJ.t)
11 2
108
And that is the termination criterion of the internal process in Step la)
0
in (2.3.22).
The above result, combined with the estimate of the rate of convergence for the internal process, provide us with the total complexity
estimate of the constrained minimization scheme.
LEMMA 2.3.10
For alt k, 0
J'(k)
N, we have
fi{.
where a =
VL
<
k+l
COROLLARY
2.3.3
(N + 1) [1 + fi In~] + fi ln _Au_,
L: j(k) <
LlN+l
V Ii
KJL
V Ii
k=O
0
2.3.11
'*
< 1 + fi.
VIi
J -
}n 2(L-JL)LlNtl.
KJL
Proof: The proof is very similar tothat of Lemma 2.3.10. Suppose that
j* _ l
<
109
That is a contradiction.
COROLLARY 2.3.4
j*
+ Ej(k)::;
k==O
(N
+ 2) [1 +
lf In 2(~/>] + lf In~.
Let us put all things together. Substituting estimate (2.3.24) for the
number of full iterations N into the estimate of Corollary 2.3.4, we come
to the following bound for the total nurnber of internal iterations in the
process (2.3.22):
t -t
1
[ ln[2{1-!~:))ln
(1o-,.)e
+V'fiIi ln
+ 2] . [1 + V!IJj ln ~]
"11(2.3.26)
Note that method (2.3.13), which implernents the internal process, calls
the oracle of problern (2.3.16) at each iteration only once. Therefore,
we conclude that estirnate (2.3.26) is an upper cornplexity bound for the
problern {2.3.16) with E-solution defined by {2.3.23). Let us check, how
far this estirnate is frorn the lower bounds.
The principal terrn in estirnate {2.3.26) is of the order
ln to-t .
e
/I. In L.p,'
VJi
This value differs frorn the lower bound for an unconstrained minimizaThis rneans, that scherne (2.3.22) is at
tion problern by a factor of In
least suboptimal for constrained optimization problems. We cannot say
rnore since a specific lower complexity bound for constrained minimization is not known.
To conclude this section, Iet us answer two technical questions. Firstly,
in scheme (2.3.22) we assume that we know sorne estimate t 0 < t*. This
assumption is not binding since we can choose to equal to the optimal
value of the minimization problern
rnin [f(xo)
xeQ
+ (f'(xo), x- xo) + ~ II x- xo
11 2 ].
110
fo(x)
+ (!(x),x- x} + ~ II x- x
11 2
-t,
fi(x)
+ (fi(x),x- x) + ~ II x- x
11 2 ,
i = 1. .. m.
In view of Lemma 2.3.4, it is the optimal value of the following minimization problem:
min [fo(x)
s.t.
fi(x)
XE
+ U(x), x- x) + ~ II x- x
+ (ff(x), x- x) + ~ II x- x
11 2 ],
11 2 ~ 0, i = 1 ... m,
Q.
This problern is not a quadratic optimization problem, since the constraints are not linear. However, it can be solved in finite time by a
simplex-type process, since the objective function and the constraints
have the same Hessian. This problern can be also solved by interiorpoint methods.
Chapter 3
3.1
{Equivalent definitions; Closed functions; Continuity of convex functions; Separation theorems; Subgradients; Computation rules; Optimality conditions.)
3.1.1
In this chapter we consider rnethods for solving general convex rninirnization problern
min fo(x),
s.t.
fi(x) :::; 0, i = 1. .. m,
(3.1.1)
where Q is a closed convex set and fi(x), i = 0 ... m, are general convex
functions. The terrn general means that these functions can be nondifferentiable. Clearly, such a problern is rnore difficult than a srnooth
one.
Note that nonsrnooth rninirnization problerns arise frequently in different applications. Quite often sorne components of a rnodel are composed
by rnax-type functions:
112
i= 0.
DEFINITION
+ (1- a)y)
f(ax
:::; af(x)
+ (1- a)J(y).
At this point, we are not ready to speak about any method for solving
(3.1.1). In the previous chapter, our optimization methods were based
on the gradients of smooth functions. For nonsmooth functions such
objects do not exist and we have to find something to replace them.
However, in order to do that, we should study first the properties of
general convex functions and justify a possibility to define a generalized
gradient. That is a long way, but we have to pass through it.
A Straightforward consequence of Definition 3.1.1 is as follows.
LEMMA 3.1.1 (Jensen inequality) For any xl, ... ,xm E domfand coefficients 0:1, ... , O:m such that
m
O:i
= 1,
O:i;:::
0, i = 1 ... m,
(3.1.2)
i=l
we have
m+l
L
i=l
O:iXi =
O:!Xl
113
where i =
clearly,
~
l~a 1
Li
= 1,
2 0, i = 1. .. m.
i=l
0
m
The point x =
2:: aiXi
i=l
x1, ..
,xm.
Then
E ai =
i=l
1,
0
CoROLLARY
3.1.2 Let
i=l
i=l
Lai= 1}.
0
THEOREM
f(y
+ (y- x)) ?
f(y)
+ (f(y)- f(x)).
{3.1.3)
114
= -dRI
and u
= y + (y- x).
Then
y= l~(u+x)=(1-a)u+ax.
Therefore
+ af(x)
= !hf(u)
+ !RIJ(x).
1]. Derrote
= 1 ~ 0 and u = ax
+ (f(u)- J(y))
I t 2::
f(x)}
is a convex set.
Proof: Indeed, if (x1, tr) E epi (!) and (x2, t2) E epi (!), then for any
a E [0, 1] we have
at1
+ (1- a)t2 2:: af(xl) + {1- a)j(x2) 2:: j(ax1 + (1- a)x2).
(x1,J(xl)) E epi{f),
Therefore (ax1
(x1,/(x2) E epi{f).
j(ax1
+ (1- a)x2)
:::; af(xl)
+ (1- a)j(x2).
D
Cr()
{x E domf
f(x):::; }
115
+ (1- a)x2)
af(xi)
+ (1- a)j(x2)
~a+
(1- a) = .
0
a closed set.
THEOREM
Proof: By its definition, (.CJ(),) = epi (f) n{(x, t) I t = }. Therefore, the epigraph .Cf () is closed and convex as an intersection of two
closed convex sets.
0
Note that, if f is convex and continuous and its domain dom f is
closed, then f is a closed function. However, in general, a closed convex
function is not necessarily continuous.
Let us look at some examples of convex functions.
EXAMPLE
2. f(x)
I t?. x,
t?. -x},
116
5. Function f(x)
f(a.xl
a.
II Xl II +(1 -
a.)
II X2 II
for any Xt,X2 E Rn and a. E (0, 1]. The most important norms in
numerical analysis are so-called lp-norms:
p~l.
Among them there are three norms, which are commonly used:
-
i=l
II x llt=
2.
f: I xCi) I, p = 1.
The h-norm:
i=l
II X lloo=
ffi!iX
l::S~::Sn
I X(i) I .
II$ r},
0,
Bp(xo,r) = {x ERn
111
x- xo
llp$ r}.
117
6. Up to now, all our examples did not show up any pathological behavior. However, let us Iook at the following function of two variables:
0,
f(x, y) = {
cp(x, y),
where cp(x, y) is an arbitrary nonnegative function defined on a unit
circle. The domain ofthis function is the unit Euclidean disk, which is
closed and convex. Moreover, it is easy to see that f is convex. However, it has no reasonable properties on the boundary of its domain.
Definitely, we want to exclude such functions from our considerations.
That was the reason for introducing the notion of closed function. It
is clear that f (x, y) is not closed unless cjJ( x, y) = 0.
D
3.1.2
In the previous section we have seen several examples of convex functions. Let us describe a set of invariant operations, which allow us to
create more complicated objects.
THEOREM 3 .1. 5 Let functions /I and h be closed and convex and let
2:: 0. Then alt functions below are closed and convex:
1. f(x)
2. f(x)
2. For all
Xl,X2
Ci
E [0, 1] we have
k-too
Xk
x E domj,
lim tk = l.
k-too
118
Since
fi(x),
f2(x).
Therefore
k--too
~ f(x).
{(x,t)
It~
fi(x) t
f2(x) x E (domfi)n(domf2)}
epifi nepif2.
Let function ifJ(y), y E Rm, be convex and closed. Consider a linear operator
THEOREM 3 .1. 6
A(x) = Ax + b:
Rn -t Rm.
I A(x) E domifJ}.
ifJ(ayl
= A(xl), Y2 = A(Y2)
Then
+ (1 - a)y2)
af(xl)
+ (1 -
a)j(x2).
1 It is important to understand, that a similar property for convex sets is not valid. Consider
the following two-dimensional example: Q1 = {(x,y): y 2:: ~. x > 0}, Q2 = {(x,y): y =
0, x ~ 0}. Bothofthese sets are convex and closed. However, their sum Q1 + Q2 {(x, y) :
y > 0} is convex and open.
119
Thus, j(x) is convex. The closedness of its epigraph follows from continuity of the linear operator A(x).
D
The next theorem is one of the main suppliers of convex functions
with implicit structure.
THEOREM
3.1. 7 Let
f(x) = sup{cp(y,x)
y
yE
~}.
Suppose that for any fixed y E ~ the function cp(y, x) is closed and convex
in x. Then f(x) is a closed and convex function with domain
domf={xE
n domcp(y,)l
:3/:c/J(y,x)S/VYE~}.
(3.1.4)
yEt:.
yEt:.
Therefore
closed.
ID!lX
l:St:Sn
f(x) = sup
L .>.(i) fi(x),
>.Et:. i=l
c/J>.(x)
L >,(i) fi(x)
i=l
120
are convex and closed. Thus, j(x) is closed and convex in view of
Theorem 3.1.7. Note that we did not assume anything about the
structure of the set b..
3. Let Q be a convex set. Consider the function
g E Q}.
'1/JQ(tx) = t'I/JQ(x),
E domQ,
~ 0.
4. Let Q be a set in Rn. Consider the function '1/J (g, 'Y) = sup cp(y, g, 'Y),
yEQ
where
c/J(y, g, 1) = (g, y) - ~ II Y 11 2
The function '1/J(g, 'Y) is closed and convex in (g, 'Y) in view ofTheorem
3.1.7. Let us look at its properties.
'1/J(g, 'Y) = {
if g
= 0, 'Y = 0,
if 'Y
> 0,
1Wf
2')' '
with the domain dom'!f; = (Rn x {'Y > 0}) U{O,O). Note that this
is a convex set, which is neither closed nor open. Nevertheless, '1/J
is a closed convex function. At the same time, this function is not
continuous at the origin:
0
121
3.1.3
In the previous sections we have seen that the behavior of convex functions at the boundary points of its domain can be rather disappointing
(see Examples 3.1.1(6), 3.1.2(4)). Fortunately, this is the only bad news
about convex functions. In this section we will see that the structure of
convex functions in the interior of its domain is very simple.
LEMMA 3.1.2 Let function f be convex and xo E int (domf). Then f
is locally upper bounded at xo.
>
0 such that
xo
Eei
Jn
x = x0
i=l
i=l
Indeed, consider
+ Lhiei, L(hi)2:::; E.
x =
xo
+ "f:
i=l
hiei = xo
+~
"f: hiEei
i=l
max f(x)
= xEB2(xo,l)
l:Sz:Sn
Eei).
122
= ~ II y- xo II,
II z - xo II= ~ II
It is clear that
y = az + (1- a)x 0 Hence,
= xo + ~(y- xo).
y - xo II= E. Therefore
z
a :::; 1 and
f(y)
f(xo)
+ M-{(xo) II Y- xo II
f(y)
M-{(xo}
II Y- Xo II
M-!(xo)
II Y- xo II
(3.1.5)
a.(.O
a > 0.
f(x
+ ap)
+ Ep
E domf.
+ f(x + ap).
Therefore
<!J(a)
123
f(y) 2: f(x)
+ f'(x;y-
x).
(3.1.6)
lim i-[f(x
a.j.O
+ Tap)-
T lim -1 [f(x
.j.O
f(x)]
+ p)- f(x)]
= T f'(xo;p).
f'(x; P1
+ {1 -
)P2)
<
lim i-{[f(x
a.j.O
+ api)- f(x)]
f'(x;pl)
+ {1- )f'(x;p2)
f(y)
124
3.1.4
Separation theorems
tl(g,'"'f)
= {x ERn
(g,x}
= '"'f},
g =f 0,
(3.1. 7)
?rq(xo) = argmin{ll x- xo
II:
Q}.
jection ?rq(xo).
Proof: Indeed, 1rq(x 0 ) = arg min{ <P(x) I x E Q}, where the function
<P(x) = II X- Xo 11 2 belongs to st:i(Rn). Therefore 1l'q(xo) is unique
and well defined in view of Theorem 2.2.6.
D
x E Q we have
2.2.5 we have
(<P1(11'q(xo)),x- ?rq(xo)) ~ 0
125
x -1l"Q(xo)
11 2
Q we have
+ II 11"Q(xo)- xo
11 2 ~11 x- xo
11 2
x- 11"Q(xo)
11 2 - II
x- xo
11 2
< -
II
xo - 11"Q(xo)
11 2
Now we can prove the separation theorems. We will need two of them.
The first one describes our possibilities in strict Separation.
THEOREM
g = xo- 11"Q(xo)-:/= 0,
'Y
II
xo- 11"Q(xo)
11 2
(g, x}
dom'I/JQ 2 and 'I/JQ 1 (g) > 'I/JQ 2 (g). That is a
~
Q2 and Q2
Q1 . Therefore,
126
THEOREM
(gk! x)
However,
(3.1.9)
(Lemma 3.1.5)
k-too
3.1.5
Subgradients
have
f(x)
f(xo)
+ (g, x- xo).
(3.1.10}
f(y)
=I x I,
x E R 1 For all y E R 1
127
f(y)
= f(yo:)
J(x)
-aT
+ (d, x) ::::;
-aj(xo)
+ (d, xo)
(3.1.11)
II d 112 +a2 =
1.
(3.1.12)
Since for all T 2: f(xo) the point (T, x 0 ) belongs to epi (f), we conclude
that a 2: 0.
Recall, that a convex function is locally upper bounded in the interior
of its domain (Lemma 3.1.2). This means that there exist some E > 0
and M > 0 suchthat B2(xo, E) ~ domfand
II x- xo I
128
II x- xo II
II
II
f(x) 2 f(xo)
+ (g, x- xo)
f(x) - f(xo) :S M
II
II
we
II x- xo II= M .
Let us show that the conditions of the above theorem cannot be relaxed.
3.1.4 Consider the function f(x) = -../X with the domain
{x E R 1 I x 2 0}. This function is convex and closed, but the subdiffer-
EXAMPLE
f'(xo;p) = max{(g,p)
For any xo E
g E j(xo)}.
f'(xo;p) = ~N ~[f(xo
+ ap)- f(xo)] 2
(g,p),
(3.1.13)
129
the other hand, since f'(x 0 ;p) is convex in p, in view of Lemma 3.1.3,
for any y E dom f we have
f(y) 2: f(xo)
(3.1.14)
--+ 0, we obtain
J'(xo;p)- (gp,p) ~ 0.
(3.1.15)
min
xEdomf
0 E 8f(x*).
(g,xo- x 2: 0 Vx E LJ(f(xo))
Proof: Indeed, if f(x)
f(xo)
={x
E domf:
f(x) ~ f(xo)}.
+ (g, x- xo)
f(x)
f(xo).
0
130
= argmin{f(x) I
x E Q}.
3.1.6
Computing subgradients
(f'(x),p}
= f'(x;p) 2:
(g,p).
Changing the sign of p, we conclude that (f'(x),p) = (g, p) for all g from
f(x). Finally, considering p = ek, k = 1 ... n, we get g = f'(x).
0
Let us provide all operations with convex functions, described in Section 3.1.2, with corresponding rules for updating subgradients.
LEMMA 3 .1. 8 Let function f (y) be closed and convex with dom f
Consider a linear operator
A(x)
= Ax + b:
Rm.
Rn---+ Rm.
cfi'(xo,p)
= f'(yo;Ap) =
=
max{(g,Ap) I g E j(yo)}
max{ (g,p) I g E AT j(yo)}.
131
jor any
n int (domf2).
Proof: In view of Theorem 3.1.5, we need to prove only the relation for
int (dom h). Then,
the subdifferentials. Consider Xo E int (dom /I)
for any p E Rn we have
max{(91,a1p)
I 91
E 8/I(xo)}
+ a292,p) I 91
max{ (a191
max{(9,p) I 9 E at8ft(xo)
+ a2h(xo)}.
Note that both fi(xo) and fi(xo) are bounded. Hence, using Theoo
rem 3.1.14 and Corollary 3.1.3, we get (3.1.16).
3.1.10 Let functions fi(x), i = 1 ... m, be closed and convex.
Then function f(x) = m_ax fi(x) is also closed and convex. For any
LEMMA
XE
int (dom/) =
1$t$m
i=l
(3.1.17}
n
m
i=l
= 1$i$k
max
ff(x;p)
= 1$i$k
max
max{(gi,p) I 9i E fi(x)}.
132
where D.k
= Pi
Therefore,
f'(x;p)
0,
L: Ai = 1},
i==l
The last rule can be useful for computing some elements from the
subdifferentiaL
LEMMA 3.1.11 Let be a set and f(x) = sup{4>(y,x) I y E }.
Suppose that for any fixed y E D. the function 4>(y, x) is closed and
convex in x. Then f(x) is closed convex.
Moreover, for any x from
domf = {x ERn
we have
where I(x)
3')': 4>(y,x)
')'Vy
= {y I 4>(y,x) = f(x)}.
+ (g,x- xo)
= f(xo)
+ (g,x- xo).
=I x I, x E R 1 .
f(x) =
Then 8f(O)
max g x.
-1:599
= [-1, 1] since
2. Consider function
Then j(x) =
f (x)
133
m
I: I (ai, x) -
i=l
bi
h (X)
I o(x)
{i : (ai, x) - bi = 0}.
I:
ai-
I:
iEL(x)
iElo(x)
=II x II
{x/
I 1::; i::; n}
0}.
=D.n.
II x 1!},
x =/= 0.
I:
iEl+(x)
where I+(x) = {i
x(i) =
x(i)
we have
'
{i :
8](0) = B 00 (0 1) = {x ERn
{i
[-ai,ai].
l<~<n
+ I:
ai
j(x)
< 0},
L(x)
iEI+(x)
Derrote
ei-
I x(i) >
I:
iEL(x)
ei
max
l:Si:Sn
+ I:
iElo(x)
0}, L(x) = {i
I x(i) I< 1}
-
'
I x(i) <
0} and Io(x) =
134
fi(x)
O,i = 1. .. m}
(3.1.18)
f~(x*)
+ L Adf(x*) =
0,
iEl*
where I*
= {i
E [1, m) : fi(x*)
= 0}.
5.of~(x*)
+L
~df(x*)
iEl"
= 0,
5.o
+L
~i
= 1.
iEl*
iEl*
iEl*
This contradicts the Slater condition. Therefore 5. 0 > 0 and we can take
Ai = J..if 5.o, i E J*.
0
Theorem 3.1.17 is very useful for solving simple optimization problems.
LEMMA
3.1.12 Let A
>- 0. Then
Proof: Note that all conditions of Theorem 3.1.17 are satisfied and
the solution x* of the above problern is attained at the boundary of the
feasible set. Therefore, in accordance with Theorem 3.1.17 we have to
solve the following equations:
c = AAx*,
(Ax*, x*} = 1.
0
135
3.2
3.2.1
(3.2.1)
xERn
where
Model:
1. Unconstrained rninirnization.
2. f is convex on Rn and Lipschitz
continuous on a bounded set.
Oracle:
Approximate
solution:
Find x E Rn : f(x) - j*
Methods:
(3.2.2)
~ .
136
+ ~ I x 1 2,
k = 1. . . n.
I(x)
{J. I 1 < J.
-
<k
'
Therefore for any x, y E B2(0, p), p > 0, and 9k(Y) E fk(Y) we have
fk(Y)- fk(x)
< (gk(y), y- x}
< I 9k(Y) 1111 y-x
II~ (pp+!)
II
y-x
II
Rk
Let us describe now a resisting oracle for function fk(x). Since the
analytical form of this function is fixed, the resistance of this oracle
consists in providing us with the worst possible subgradient at each test
137
Input:
XE Rn.
Main Loop:
f := -oo;
i* := 0;
for j := 1 to m do
if xU) > f then { f := xU); i* := j };
Output:
f := 'Yf
+}
fk(x) :=
J, 9k(x)
g := ei
II X 11 2 ;
+ fLX;
:=gERn.
At the first glance, there is nothing special in this scheme. Its main loop
is just a standard process for finding a maximal coordinate of a vector
from Rn. However, the main feature of this loop is that we always
form the subgradient as a coordinate vector. Moreover, this coordinate
corresponds to i*, which is the firstmaximal component of vector x. Let
us check what happens with a minimizing sequence, which uses such an
oracle.
Let us choose starting point x 0 = 0. Derrote
Rp,n = {x ERn
x(i) =
0, p
+1 ~ i
~ n}.
+ Iei,
= 0.
138
> 0.
THEOREM
f(xk)- f* ~
2(l:Jkr)
= ~'
J1.
= (l+v'~+l)R"
Then
II xo -
x*
~-tR+"Y
= M.
D
3.2.2
Main lemma
x E Q},
(3.2.3)
139
(3.2.4)
This simple inequality leads to two consequences, which form a basis for
any nonsmooth minimization method. Namely:
The distance between x and x* is decreasing in the direction -g(x).
Inequality (3.2.4) cuts Rn on two half-spaces.
contains x*.
Nonsmooth minimization methods cannot employ the idea of relaxation or approximation. There is another concept, underlying all these
schemes. That is the concept of localization. However, to go forward
with this concept, we have to develop some special technique, which allows us to estimate the quality of an approximate solution to problern
(3.2.3). That is the main goal of this section.
Let us fix some x ERn. For x ERn with g(x) =J 0 define
(g(x),x- y)
and II y- x II= VJ(x, x). Thus, Vj(X, x) is a distance from the point x to
hyperplane {y: (g(x), x- y) = 0}.
Let us introduce a function that measures the variation of function f
with respect to the point x. For t :2: 0 define
< 0,
we set w1 (x; t) = 0.
140
Wf
f(x) - f(x)
wf(x; II x- x*
II).
f(x)- f(x)
{3.2.5)
{3.2.6)
M(vt(x; x))+
R.
= 0 and II
f(y)
f(x)
y- x
II= vf(x;x).
+ (g(x), y- x)
Therefore
= f(x),
and
f(x)- f(x)
f(y)- f(x)
wf(x; II y- x
II) =
If
f(x)- f(x)
f(y)- f(x)
Let us fix some x*, a solution to problern (3.2.3). The values VJ(x*;x)
allow us to estimate the quality of localization sets.
DEFINITION
3.2.1 Let
{xi}~ 0
be a sequence in Q. Define
141
We call this set the localization set of problern {3. 2. 3) generated by sequence {xi}~o
Note that in view of inequality (3.2.4), for all k ;:::: 0 we have x* E Sk.
Denote
vz- VJ(x*' x)
z (>
- 0) '
vk*
= O<i<k
mm
Vi
Thus,
vZ
LEMMA
;:::: 0, i
= 0 ... k,
O$i::;k
Vx E B2(x*, r)}.
f*
WJ(x*; vk).
WJ(x*;vZ)
= O$i$k
min WJ(x*;vi) > min [f(xi)- J*] = fj.- f*.
- O~i~k
0
3.2.3
Subgradie nt method
x E Q},
(3.2.7)
142
direction g(x)/
II g(x) II
hk
> o,
hk ~
00
o, E
hk = oo.
k=O
(3.2.8)
THEOREM
R2 +
f k*- f* -< M
h~
(3.2.9}
i=O
L: h;
i=O
Proof: Denote
ri
=II Xi- x* II
<
= rf - 2hivi
Thus,
k
v*
<
k-
R2+Eh~
i=O
2L: h;
i=O
+ hf.
143
i=D
.k
00
2::
i=D
i = 0 ... N.
(3.2.10)
f*
< ../N+l.
MR
Comparing this result with the Iower bound of Theorem 3.2.1, we conclude:
The subgradient method (3.2.8), (3.2.10) is optimal for
problern (3.2. 7) uniformly in the dimension n.
If we do not want to fix the number of iterations apriori, we can choose
hi
= vf+P
= 0, ....
t:;.k
144
3.2.4
and
fi,
(3.2.11)
Vx,y E Q.
I x E Q,
(3.2.12)
Note that we can easily compute a subgradient g(x) of function j, provided that we can do so for functions fi (see Lemma 3.1.10).
Let us fix some x*, a solution to (3.2.11). Note that J(x*) = 0 and
Vj(x*;x) ~ 0 for all x ERn. Therefore, in view ofLemma 3.2.1 we have
If
J(x):::; M Vj(x*;x).
Let us write down a subgradient minimization scheme for constrained
minimization problern (3.2.12). We assume that R is known.
hk =
v'k~0.5.
(3.2.13)
g(xk), if j(xk)
(A),
II hk,
(B).
145
THEOREM
M2 = m~
1::;3::;m
{II g II:
Then for any k 2: 3 there exists a number i', 0 :5 i' :5 k, such that
f(xi') -
f* :5
1M1R f(xi')
- :5 1M
2R
k-1.5
k-1.5'
Proof: Note that for direction Pk, chosen in accordance to rule (B), we
have
II g(xk) II hk :5 ](xk) :5 (g(xk), Xk - x*).
Hence, in this case vf( x*; x k) 2: hk.
Let k'
vVJ(x*' x)
tt '
Then for all i, k'
:5 i :5 k, we have
f/. Ik,
then r[+l
+ hr,
r~, +
i=k'
iElk
iri.Jk
2 _
1 2: Ji.'I .L hi - .L
t=k'
t=k'
1
i+0.5
k+l
2: f
k'
dT
r+0.5 -
2k
ln 2kq 1 2: ln 3.
146
3.2.5
Let us look at the unconstrained minimization problern again, assuming that its dimension is relatively small. This means that our computational resources allow us to perform the number of iterations of
a minimization method, proportional to the dimension of the space of
variables. What will be the lower complexity bounds in this case?
In this section we obtain a finite-dimensionallower complexity bound
for a problem, which is closely related to minimization problem. This is
the feasibility problem:
(3.2.14)
Find x* E Q,
where Q is a convex set. We assume that this problern is endowed with
an oracle, which answers our request at point x E Rn in the following
way:
Either it reports that x E Q.
Or, it returns a vector g, separating x from Q:
(g, x - x)
\lx E Q.
Q = {(t, x)
E Rn+l
I t ~ f(x),
t ~ J* + l, x E Q}.
147
Denote by e E Rn a vector of all ls. The oracle sta.rts from the following
settings:
ao :=-Re, bo :=Re, m := 0, i := 1.
Its input is an arbitrary x E Rn.
2. lf k
m := m
+ 1;
i := i
+ 1;
If i > n then i := 1.
Return 9m
This oracle implements a very simple strategy. Note, that the next
box Bm+l is always a half of the last box Bm. The box Bm is divided into
148
two parts by a hyperplane, which passes through its center and which
corresponds to the active coordinate i. Depending in which part of the
box Bm we get the test point x, we choose the sign of the separation
vector 9m+l = ei. After creating a new box Bm+l the index i is
increased by 1. If this value exceeds n, we return again to i = 1. Thus,
the sequence of boxes {Bk} possesses two important properties:
voln Bk+l = !voln Bk.
For any k
Note also that the number of generated boxes does not exceed the number of calls of the oracle.
LEMMA
B2(ck, rk)
Bk,
{3.2.15}
Bk
:J
Bn = {x
Cn- ~Re :S
:S Cn +~Re}
:J
B2(cn, ~R).
Therefore, for such k we have Bk :J B2(cb ~R) and (3.2.15) holds. Further, let k = nl + p with some p E [0, ... , n- 1]. Since
bk- ak =
we conclude that
Bk
:J B2
~ R ( ~) -!.
THEOREM
149
THEOREM
.rt
3.2.6
I x E Q},
(3.2.16)
x,
if x E Q,
a separator of x from Q, if x
fl. Q.
Q = {x ERn
](x) S 0}.
In this case, for x f/. Q the oracle has to provide us with any subgradient
g E a](x). Clearly, g separates X from Q (see Theorem 3.1.16).
Let us present the main property of finite-dimensional localization
sets.
Consider a sequence X= {xi}~ 0 belanging to the set Q. Recall, that
the localization sets, generated by this sequence, are defined as follows:
Q,
So(X)
sk+l (X)
{X E Sk(X)
(g(xk), Xk - x) ~ 0}.
150
0 we have
*<D
vk-
[voln Sk(X)]
voln Q
.
~ B2 (x*,
D) we have the
B2(x*, vZ).
+ aQ
+ aQ] =
anvoln Q.
2 Q.
a) Choose Yk E Ek
b) If Yk E Q then compute f(Yk), g(yk) If Yk f/; Q,
then compute g(yk), which separates Yk from Q.
(3.2.17)
c) Set
9k = {
g(yk), if Yk E Q,
g(yk), if Yk f/; Q.
d) Choose Ek+l 2 {x E Ek
(gk,Yk- x} ~ 0}.
< k, suchthat Yi
Q.
151
> 0, then X f 0.
3.2.4 For any k;::: 0, we have Si(k)
Thus, if i(k)
LEMMA
Ek
::) {x E Ek
Ek+l
(g(yk),Yk- x) 2: 0}
::) {x E Ek
Ek+l
::) {X E si(k)
(g(yk), Yk - x) 2: 0}
(g(yk), Yk - x) 2: 0}
= si(k)+l
0
since Yk = xi(k)
V~(k)
2. lf voln Ek
(X)< D
-
[volnSi(k)(X)];- <
voln Q
> 0, we have
I
[volnEk];
voln Q
Proof: We have already proved the first statement. The second one
follows from the inclusion Q = So = Si(k) ~ Ek, which is valid for all k
0
such that i(k) = 0.
Thus, if we manage to ensure voln Ek -t 0, then we obtain a convergent scheme. Moreover, the rate of decrease of the volume automatically
defines the rate of convergence of the method. Clearly, we should try to
decrease voln Ek as fast as possible.
Historically, the first nonsmooth minimization method, implementing
the idea of cutting planes, was the center of gravity method. It is based
on the following geometrical fact.
Consider a bounded convex set S C Rn, int S f 0. Define the center
of gravity of this set as
cg(S) =
voll S
n
J xdx.
152
The following result demonstrates that any cut passing through the center of gravity divides the set on two proportional pieces.
LEMMA
s+ = {x
Es I
(g, cg(S)- x)
0}.
Then
(g(xk), Xk- x)
~ 0}.
Jt. =
min f(xj)
0$j$k
3.2.7 lf f is Lipschitz continuous on B2(x*,D) with a constant M, then for any k ~ 0 we have
THEOREM
J:.-f*~MD(l-~)--n.
Proof: The statement follows from Lemma 3.2.2, Theorem 3.2.6 and
Lemma 3.2.5.
0
Comparing this result with the lower complexity bound of Theorem 3.2.5, we see that the center-of-gravity method is optimal in finite
dimension. lts rate of convergence does not depend on any individual
characteristics of our problern like condition number, etc. However, we
should accept that this metbad is absolutely impractical, since the computation of the center of gravity in multi-dimensional space is a more
difficult problern than our initial one.
153
Let us look at another method, which uses a possibility of approximation of the localization sets. This method is based on the following
geometrical observation.
Let H be a positive definite symmetric n x n matrix. Consider the
ellipsoid
E(H,x) = {x ERn
Let us choose a direction g E Rn and consider a half of the above ellipsoid, defined by corresponding hyperplane:
1 (g,x- x) 2: 0}.
E+ = {x E E(H,x)
lt turns out that this set belongs to another ellipsoid, which volume is
strictly smaller than the volume of E(H, x}.
LEMMA
3.2.6 Denote
x+
x-
1
Hg
n+l (Hg,g)l/2'
(a + n:_1. dt~~g)).
II
X-
X+
llb+
n~-;1 (11
X-
X+
II X -X+ II~
(g, x - x+) 2 =
(g, x} 2 + n! 1(g, x}
+ (n~l)2
II X - X+ II~+ =
n:2 1 (II
II x lla:s; 1.
+ (g, x)
Therefore
= (g, x)(1
+ (g, x))
::::; 0.
= 1.
154
Hence,
I X- X+ II~+~
n:2 1
_
-
2 )
n+1
~] ~ -< [n2-1
n2
1 ]
[ n 2 (n 2 +n-2) ] ~ _ [
- 1 - (n+l)2
n(n-l)(n+l)2
)]
2
1 - n(n+l)
It turnsout that the ellipsoid E(H+, x+) is the ellipsoid ofthe minimal
volume containing the half of the initialellipsoid E+.
Our observations can be implemented in the following algorithmic
scheme of the ellipsoid method.
Ellipsoid method
0. Choose Yo ERn and R
Set Ho = R 2 In.
gk
g(yk), if Yk E Q,
(3.2.18)
g(yk), if Yk ~ Q,
Yk+1
Ek = {x ERn
155
Denote
J; =
min f(xj)
O~j~k
THEOREM
~ _ f* < MR ( 1 f~(k)
-
1 ) 2 . [voln Bo(xo,R)];;.
(n+1)2
voln Q
Proof: The proof follows from Lemma 3.2.2, Corollary 3.2.1 and Lemma 3.2.6.
D
We need additional assumptions to guarantee X =/=
there exists some p > 0 and x E Q such that
0. Assurne that
(3.2.19)
Then
< (1 _ (n+1)2
1 )
.!
[ voln E&] n
voln Q
-
.!
2 [voln B2 xo,R ] n
voln
In view of Corollary 3.2.1, this implies that i(k) > 0 for all
k
lf i(k)
> 0, then
~
fz(k)
_ f* <
lMR2. e-2<n+t>2
- p
156
calls of the oracle. This efficiency estimate is not optimal (see Theorem
3.2.5), but it has a polynomial dependence on In~ and a polynomial dependence on logarithms ofthe dass parameters M, Rand p. For problern
classes, whose oracle has a polynomial complexity, such algorithms are
called (weakly) polynomial.
To conclude this section, let us mention that there are several methods
that work with localization sets in the form of a polytope:
Ek.
3.3
(Model of nonsmooth function; Kelley method; ldea of level method; Unconstrained minimization; Efficiency estimates; Problems with functional constraints.)
157
3.3.1
where f is a Lipschitz continuous convex function and Q is a closed convex set. We have seen that the optimal method for problern (3.3.1) is the
subgradient method (3.2.8), (3.2.10). Note, that this conclusion is valid
for the whole class of Lipschitz continuous functions. However, when
we are going to minimize a particular function from that dass, we can
expect that it is not too bad. We can hope that the real performance of
the minimization method will be much better than a theoretical bound
derived from a worst-case analysis. Unfortunately, as far as the subgradient method is concerned, these expectations are too optimistic. The
scheme of the subgradient method is very strict and in general it cannot
converge faster than in theory. It can be also shown that the ellipsoid
method (3.2.18), inherits this drawback of subgradient scheme. In practice it works more or less in accordance to its theoretical bound even
when it is applied to a very simple function like II x 11 2 .
In this section we will discuss the algorithmic schemes, which are more
flexible than the subgradient and the ellipsoid methods. These schemes
are based on the notion of a model of nonsmooth function.
DEFINITION
3.3.1 Let X=
{xk}~ 0
be a sequence in Q. Denote
+ (g(xi), x- xi)],
f.
Xi,
0 :S i :S k, we have
158
3.3.2
Kelley method
Kelley method
0. Choose xo E Q.
1. kth iteration (k
Find
Xk+I E
(3.3.2)
2 0).
f(y,x)
max{j Y j, II x 11 2 },
y E R 1 , x ERn,
A(
of model A(Z; z), and by jz = jk(Z"k) the optimal value of the model.
Let us choose zo = (1, 0). Then the initial model of function f is
fo(Z; z) = y. Therefore, the first point, generated by the Kelley method
is z1 = (-1, 0). Hence, the next model of the function f is as follows:
159
Clearly,
/i = 0.
fk..
fk. S f(z*) = 0.
Thus, for all consequent models with k 2: 1 we will have /; = 0 and
Zk. = (0, Xk), where
xz = {x E B2(0, 1): I
Xi
Let us estimate efficiency of the cuts for the set Xk.. Since Xk+l can
be an arbitrary point from Xk., at the first stage of the method we can
choose Xi with the unit norms: II Xi II= 1. Then the set Xk. is defined as
follows:
Xk. = {x E B2(0, 1) I (xi,x) S ~,i = 0 ... k}.
We can do that if
f(zi)
=f(O, xi)
= 1.
Let us estimate a possible length of this stage using the following fact.
Let d be a direction in
S(o:)
= {x
Then v(o:)
n-1
2
At the first stage, each step cuts from the sphere S2 (0, 1) at most the
n-1
[ ]
segment S( ~) . Therefore, we can continue the process if k
s Ja
= ./J ,
This means that after N iterations we can repeat our process with the
ball B2(0, ~), etc. Note that f(O, x) = ~ for all x from B 2(0, ~ ).
160
Thus, we prove the following lower bound for the Kelley method
(3.3.2):
f(xk)-
f* ~
{!)
[.ii]
n-1
~:-solution
~ [ 2 ]n-1
2ln2
'7a
calls of the oracle. It remains to compare this lower bound with the
upper complexity bounds of other methods:
Ellipsoid method:
0 (n2 ln~)
Gradient method:
0 (e\-)
0
3.3.3
Level method
09~k
The first value is called the minimal value of the model, and the second
one the record value of the model. Clearly }'; ~ f* ~ fZ.
Let us choose some a E (0, 1). Denote
161
Level method
0. Choose point xo E Q, accuracy
cient a E (0, 1).
1. kth iteration (k
c). Set
0).
J; and J;.
a). Compute
b). If !Z-
jz ~ e,
then STOP.
Xk+l = 7r.ck(a)(Xk)
jz
t,
min
s.t.
f(xi)
XE
+ (g(xi), x- Xi)
t, i = 0 ... k,
Q.
II
x- Xk
11 2 ,
xEQ.
Jz ~ Jz+l ~ r
Denote D..k = [JZ, !Z] and
]k(X;x). Then
6k
~ JZ+l ~ JZ.
= JZ - jz.
We call
6k
162
The next result is crucial for the analysis of the Ievel method.
LEMMA 3.3.1 Assurne that for some p
Then for all i, k
k we have Op
(1 - a)ok.
p,
(1-a)c5k
(1-a)<>i Therefore
Let us show that the steps of the Ievel method are Iarge enough.
Denote
MJ = max{ll g 111 g E oj(x), x E Q}.
LEMMA 3.3.2 For the sequence {xk} generated by the Ievel method we
have
II Xk+l -
Xk II>
-
(1-a)6k
MI
Proof: Indeed,
+1-
D.
M2D2
:s; (1-~)26r
II Xi+l -
x; 11 2
< II Xi -
x; 11 2 -
< II X,
Xp
* 112 -
II Xi+l
- Xi 11 2
<II Xz
(1-a)26l
M2
_
I
* 112 - (1-a)26~
Mz
Xp
163
(p + 1- k) ( 1 -;~25~ ~II
Xk-
x; 11 2 ~ D
THEOREM
N =
lf2a(1~I~~2-a) J+
J;- f*
~ E.
= [p(j), k(j)],
p(O)
= N,
k(j)
Clearly, for j
p(j
p(j) ~ k(j),
+ 1) = k(j) + 1,
= 0 ... m,
k(m)
= 0,
= t5p(j+l)
0 we have
t5
>~
>
1-a -
P(J+l) -
n (J ~
p(O)
>
(1-a)J+i -
+1-
(1-a)J+i
k(j) is bounded:
MJD 2
MJD2 ( 1
(1-o)22 . ~ f2(1-a)2
p(J)
) 2j
Therefore
Let us discuss the above efficiency estimate. Note that we can obtain
the optimal value of the Ievel parameter a from the following maximization problem:
--+
max.
oE[O,l]
164
J2.
1
2+
3.3.4
Constrained minimization
Let us dernonstrate how we can use the rnodels for solving constrained
rninirnization problems. Consider the problern
f(x),
min
s.t.
/j(x)
s 0,
j = 1 ... m,
(3.3.4)
xEQ,
where Q is a bounded closed convex set, and functions f(x), fi(x) are
Lipschitz continuous on Q.
Let us rewrite this problern as a problern with a single functional
constraint. Denote f(x) = rn~x /j(x). Then we obtain the equivalent
1$J$m
problern
f(x),
rnin
s.t.
j(x) ~ 0,
XE
(3.3.5)
Q.
Note that f(x) and /(x) are convex and Lipschitz continuous. In this
section we will try to solve (3.3.5) using the rnodels for both of thern.
Let us define the corresponding rnodels. Consider a sequence X =
{xk}k::, 0 . Denote
Jk(X; x) =
~ f(x),
165
j(t; x)
f*(t)
rninf(t;x).
xEQ
fk(X;t,x)
J;(X;t)
xEQ
3.3.4
tk(X)
Proof: Denote by
= rnin{}k(X;
x)
I /k(X; x)
xt:
~ 0, x E Q}.
iz
i'k =
A(X; y) ~ 0.
166
:::;
> 0.
Then
{3.3.6}
= tk(X),
:~:::::~ E
+ at2
Xa =
+ afZ(X; t2)
(1- a)xk(to)
+ axk(t2).
(3.3.7)
Then we
+ af;(x; t2),
+ A}- A]
167
Now we areready to present a constrained minimization scheme (compare with constrained minimization scheme of Section 2.3.5).
Q, t 0 < t*,
/'i,
E (0,
! ) and accuracy
> 0.
(3.3.8)
0, we have
!J(k)(X;tk)
:S
t~=~ [2(1~~~:)r.
Proof: Denote
= 2(1~~~:)
Since tk+l
O"k-1
(<
1).
168
kj*j(O) (X ; t 0 )
tkl-tk
lt-to
0
~ .
Then
Therefore we have
Since tk
<
/(Xj)
< E.
t*+~:,
(3.3.9)
-t
t)
Mt= max{ll g
111
g E
8f(x)
U8j(x), x E Q}.
169
K- 2 (/j(k) (X ;tk
)f2a(1-a)2 (2-a)
iterations of the internal process. Since at the full step !J(k)(X; tk)) ~ e,
we conclude that
M2D2
fJ-1 (X; tk)- fJ-1 (X; tk) ~ ~fJ-1 (X; tk) ~ ~e.
Therefore, in view of Theorem 3.3.1, the number of iterations at the last
step does not exceed
K.22a(l-a)2(2-a)
[1 + In(2(1-K-))
1
1 J.o..=.L...]
n
MrD
,.2e2a(l-a)2(2-a}
M2D2Jn
I
(1-1;}
2(to-t)
~:2a(1-a)2(2-a)K.2Jn[2(1-K.))'
It can be shown that the reasonable choice for the parameters of this
scheme is a = ~ = 2+\!2'.
The principal term in the above complexity estimate is on the order
of ~ ln 2(to;t). Thus, the constrained level method is suboptimal (see
Theorem 3.2.1).
In this method, at each iteration of the master process we need to
find the root tj(k)(X). In view of Lemma 3.3.4, that is equivalent to the
following problem:
min{A(X;x)
/k(X;x) ~ 0, x E Q}.
170
s.t.
rnm
t,
f(xj)
+ (g(xj}, x- Xj)
t,
j = 0 ... k,
/(xj)
+ (g(xj}, x- Xj)
~ 0,
j = 0 ... k,
XE
Q.
If Q is a polytope, this problern can be solved by finite linear prograrnrning rnethods (sirnplex rnethod}. lf Q is rnore cornplicated, we need to
use interior-point schernes.
To conclude this section, let us note that we can use a better rnodel
for the functional constraints. Since
where 9i(Xj) E fi(xj) In practice, this complete rnodel significantly accelerates the convergence of the process. However, clearly each iteration
becomes more expensive.
As far as practical behavior of this scherne is concerned, we note that
usually the process is very fast. There are some technical problems,
related to accurnulation of too many linear pieces in the model. However,
in all practical schemes there exists sorne strategy for dropping the old
elernents of the rnodel.
Chapter 4
STRUCTURAL OPTIMIZATION
4.1
Self-concordant functions
(Do we really have a black box? What the Newton method actually does?
Definition of self-concordant functions; Main properties; Minimizing the selfconcordant function.)
4.1.1
In this chapter we are going to present the main ideas underlying the
modern polynomial-time interior-point methods in nonlinear optimization. In order to start, let us look first at the traditional formulation of
a minimization problem.
Suppose we want to solve a minimization problern in the following
form:
min{fo(x) I /j(x) ~ 0, j = l ... m}.
xeRn
We assume that the functional cornponents of this problern are convex. Note that all standard convex optimization schemes for solving
this problern are based on the black-box concept. This means that we
assume our problern to be equipped with an oracle, which provides us
with some inforrnation on the functional components of the problern at
some test point x. This oracle is local: lf we change the shape of a
component far enough from the test point, the answer of the oracle does
not change. These answers comprise the only inforrnation available for
numerical methods. 1
However, if we look carefully at the above situation, we can see a
certain contradiction. Indeed, in order to apply the convex optirnization
1 We
have discussed this concept and the corresponding methods in the previous chapters.
172
Ax=b.
We can proceed as follows:
1. Check that A is symmetric and positive definite. Sometimes this is
3 However,
173
Structural optimization
A=LLT,
where L is a lower-triangular matrix. Form an auxiliary system
Ly
= b,
LT X
= y.
4.1.2
174
II
II x- y II for all x
and y ERn.
We assume also that the starting point of the Newton process xo is close
enough to x*:
(4.1.1)
II xo- x* II< r = 32f.t
Then we can prove that the sequence
k ~
II Xk+l -X * II<-
o,
(4.1.2)
M!lxk-x*ll 2
2(l-Mjjxk-xli)'
What is wrong with this result? Note that the description of the
region of quadratic convergence (4.1.1) for this method is given in terms
of the standard inner product
(x, y) =
L x(i}y(i).
i=l
If we choose a new basis in Rn, then all objects in our description change:
the metric, the Hessians, the bounds l and M. But Iet us look what
happens with the Newton process. Namely, let A be a nondegenerate
(n x n)-matrix. Consider the function
</>(y) = f(Ay).
The following result is very important for understanding the nature of
Newton method.
4.1.1 Let {xk} be a sequence, generated by the Newton method
for function f:
LEMMA
k ~ 0.
175
Structural optimization
Let
Vx,y ERn.
II f"'(x)[u]II:S: M II u II .
This means that at any point x E Rn we have
(i 111 (x)[u]v,v}
:S: M II u II II v 11 2
Vu,v ERn.
Note that the value in the left-hand side of this inequality is invariant
with respect to affine transformation of variables. However, the righthand side does not possess this property. Therefore the most natural
way to improve the situation is to find an affine-invariant replacement
for the standard norm II II The main candidate for such a replacement
is rather evident: That is the norm defined by the Hessian f" (x) itself,
namely,
II u IIJ''(x)= (f"(x)u,u) 112
This choice gives us the dass of self-concordant functions.
4.1.3
D f(x)[u]
D 2 J(x)[u, u]
D 3 f(x)[u,u,u]
176
DEFINITION
D 3 !(x)[u1,
u2, u3]
1::; MJ IT II Ui llrcx)
(4.1.3)
i=l
We accept this statement without proof since it needs some special facts
from the theory of three-linear symmetric forms.
In what follows, we very often use Definition 4.1.1 in order to prove
that some I is self-concordant. On the contrary, Lemma 4.1.2 is useful
for establishing the properties of self-concordant functions.
Let us consider several examples.
EXAMPLE
Then
f'(x)
a+ (a,x),
= a,
J"(x)
domf =Rn.
= 0,
j 111 (x)
= 0,
domf =Rn,
f'(x) = a + Ax,
and we conclude that M1 = 0.
f"(x)
= A,
/ 111 (x)
= 0,
177
Structural optimization
f(x)
= -lnx,
domf
= {x E R 1 I x > 0}.
Then
f'(x) = -~,
f"(x) = ~'
f"'(x) = -~.
Dj(x)[u] =
- t/>(~)[(a,u}- (Ax,u}],
D 2f(x)[u, u] =
D 3 j(x)[u,u,u] =
Then
Derrote
w1
D 2 f(x)[u, u] =
I D 3 f(x)[u, u, u] I
The only nontrivial case is
w1
<
/D 3f(x)(u,u,u)l
(D2J(x)[u,u])3 2 -
w~
+ w2 2 0,
l2w~
+ 3wlw21 .
f(x)
= ex,
f(x)
= x1v,
x > 0, p > 0,
f(x)
=I x IP, p > 2.
0
178
MJ = max { .}aMt,
.M2}
Wi
<
ID 3 f(x)[u,u,uji
[D2 f(x)[u,u]]3 2
<
ctM1w~/ 2 +M2w~/ 2
[aw1 +w2]3f 2
The right-hand side of this inequality does not change when we replace
(w1,w2) by (tw 1, tw2) with t > 0. Therefore we can assume that
aw1 +w2 = 1.
Denote e = O:Wt. Then the right-hand side of the above inequality
becomes equal to
*'e1
+ ~(1- e) 312 ,
eE [o, 11.
e.
CoROLLARY
cf>(x) = a
179
Structural optimization
THEOREM
Proof: The function <P(x) is closed and convex in view ofTheorem 3.1.6.
Let us fix some X E dom <P = {X : A(x) E dom!} and u E nn. Denote
y = A(x), v =Au. Then
D<P(x)[u] =
D 2<P(x)[u, u] D 3 <P(x)[u, u, u]
= D 3 f(y)[v, v, v].
Therefore
I D 3 <P(x)[u,u,u] I
I D 3 f(y)[v,v,v]
I~ MJ(J"(y)v,v) 312
The next statement demonstrates that some local properties of a selfconcordant function reflect somehow the global properties of its domain.
4.1.3 Let function f be self-concordant. If dom f contains
no straight line, then the Hessian J"(x) is nondegenerate at any x from
domf.
THEOREM
e<o)
o,
t/J'(a)
2'1/;(a) 312
e'(a) = 0.
e(a),
180
</J(a)
a>.
J(x)
0 0
+ a(J'(x), u).
Assurne that there exists such that Ya E fJ( dom f). Consider a sequence {ak} such that ak t . Then
Note that Zk E epi f, but z rt. epi f since Ya rt. dom f. That is a contradietioll since function f is closed. Considering direction -u, and assuming
that this ray intersects the boundary, we come to a contradiction again.
Therefore we conclude that Ya E dom f for all a. However, that is a
contradiction with the assumptions of the theorem.
D
Finally, Iet us describe the behavior of self-concordant function near
the boundary of its domain.
THEOREM
x E 8( dom f)
{xk} C domf:
we have f(xk) -+
Xk-+
+oo.
f(xk) ~ f(xo)
181
Structural optimization
4.1.4
Main inequalities
II v 11;
([f"(x)tlv, v)l/2'
AJ(x)
Let us fix x E dom f and u E Rn, u =/= 0. Consider the function of one
variable
<fJ(t) = (f"(x+t~)u,u)l/2
with the domain dom</J = {t E R 1
x +tuE domf}.
I::; 1.
Proof: Indeed,
"'-'( ) _ _ f'"(x+tu)[u,u,ul
2(! 11 (x+tu)u,u)3)2
'+' t -
Therefore I <P'(t)
COROLLARY
(-cp(O), cp(O)).
Proof: Since f(x+tu) --+ oo as x+tu approaches the boundary of dom f
(see Theorem 4.1.4), the function (f"(x + tu)u, u} cannot be bounded.
t I c/J( t) > 0}. It remains to note that
Therefore dom cp
={
<P(t)
</J()- I t
4 Sometimes
182
W 0 (x;r)
W(x; r)
= {y ERn I II y-
= cl (W 0 (x; r))
llx< r},
THEOREM
3. If
II y- x llx< 1,
y-
lly-xll.,
II Y-> l+liy-xll.,.
II
y-
lly-xll.,
I Y-< 1-ily-xl!.,'
then
(4.1.5)
{y =
+ tu I t 2 I u II; < 1}
= IIY!xlly'
4>(0)
= iiy!xu.,'
(4.1.6)
1/;(t) = (f"(x
+ t(y- x))u,u),
t E [0, 1].
11/J'(t) I
y-
IIYtll u ll;t
183
Structural optimization
Therefore
2{ln(1- t
(1- II Y-
llx) 2 ~ ~f~~ ~
(1-lly:xiJx)2
4.1.4 Let x
estimate the matrix
OROLLARY
domf and r
=II y- x llx<
1. Then we can
f"(x
G=
+ T(y- x))dT
as follows:
f"(x
(1 - r
+ ~r 2 )f"(x),
1
= l~rf"(x).
0
Rn
to dom f.
is almost
for all y E W (x; r). Choosing r small enough, we can make the
quality of the quadratic approximation acceptable for our goals.
184
These two facts form the basis for almost all consequent results.
We conclude this section with the results describing the variation of
a self-concordant function with respect to a linear approximation.
THEOREM 4.1. 7 For any x, y E dom f we have
) -> I+JJy-xJJx'
JJy-xl!~
(4.1. 7}
f(y) ~ f(x)
(4.1.8}
y- x llx Then,
(f'(y)- f'(x), y- x)
2:
J (l_;rr)2dT = r 0J (l~t)2dt =
0
1~r
IIYr-xlli d _ J1 rr 2 d
0 r(l+JJyr-xJJx) T - 0 l+rr T
> Jl
r
= J f~t = w(r).
0
II
y- x
llx<
1. Then
+ (f'(x), Y-
x)
+ w*(ll
y- x llx),
(4.1.9}
(4.1.10}
185
Structural optimization
=II
y-
llx
Since
I(f"(yr)(y-x),y-x)dT
0
r2 d
(1-rrF 7
= r I0
1 d
(1-t)2 t
r2
= 1-r
= I
0
f~tt
= w.(r).
D
THEOREM
Definition 4.1.1
Let us prove the implication (4.1.8) => Definition 4.1.1. Let x E dom f
and x - au E dom f for a E [0, E). Consider the function
a E [0, ~:).
186
Denote r = llullx [cp"(o)Jl/ 2 Assuming that {4.1.8} holds for all x and
y from dom f, we have
a,l.O
> lim
~ [w(ar)- !a2 r 2]
a,l.O a:
= lim
a,I.O
p
0
[w'(ar)- ar]
w 1(t)
w~(r)
= 1 ~r 2 0,
2 0,
"(t)
(1+t)2
> 0'
w~(r)
c1 !r)2
> 0.
Therefore, w(t) and w*(r) are convex functions. In what follows we often
use different relations between these functions. Let us fix this notation
for future references.
LEMMA
w'(w~(r))
= r,
w(t)
w*(r) =
w~(w'(t))
= t,
+ w.(r) 2 rt,
rw~(r)- w(w~(r)),
We leave the proof of this Iemma as an exercise for the reader. For
an advanced reader we should note that the only reason for the above
relationsisthat functions w(t) and w*(t) are conjugate.
Let us prove two more inequalities.
187
Structural optimization
THEOREM
f(y) ~ f(x)
If in addition llf'(y)-
f'(x)ll; < 1,
(4.1.11)
then
(4.1.12)
z E Q.
f(x)- (f'(x),x) =
</J(x) = min</J(z)
zEQ
<
min[<!J(y)
</J(y)- w(II<P'(y)ll;)
zEQ
4.1.5
Ix
E domf}.
(4.1.13)
THEOREM
f(y)
f(x)- AJ(x) II Y-
188
f(y)
f(x)} we have
+ w(ll Y- xj llxj)
f(y) ~ f(xj)
variable
ff(x}
=EX
-lnx,
x > 0.
This function is self-concordant in view of Example 4.1.1 and Corollary 4.1.1. Note that
f: (X) = E -
~,
f:' = ~.
Therefore AJ,(x) =11- EX I Thus, for E = 0 we have AJ0 (x) = 1 for any
x > 0. Note that the function fo is not bounded below.
If E > 0, then xj. = ~ Note that we can recognize the existence of
the minimizer at point x = 1 even if E is arbitrary small.
D
(4.1.14)
189
Structural optimization
THEOREM
2 0 we have
(4.1.15}
.>.2
w'(>.).
+ w*(w'(>.))
f(xk)- 1+.>.
f(xk)- >.w'(>.)
+ w*(w'(>.))
= f(xk)- w(>.).
D
Thus, for all x E domf with AJ(x) 2 > 0 one step of the damped
Newton method decreases the value of f(x) at least by a constant w() >
0. Note that the result of Theorem 4.1.12 is global. It can be used to
obtain a global efficiency estimate of the process.
Let us describe now the local convergence of the standard Newton
method:
0. Choose xo E dom f.
1. Iterate
Xk+l
Xk-
[f"(xk)]- 1 /'(xk), k 2 0.
defined by the minimum itself. Let us prove that locally all these measures are equivalent.
190
< 1. Then
{4.1.17}
(4.1.18}
(4.1.19}
J
1
G=
J"(xj
+ T(x- xj)}dT,
and
A.j(x)
Therefore
II
11::;
Thus, AJ(x) ::; w~(r}. Applying w'() to both sides, we get the remairring
part of (4.1.18).
0
Finally, inequalities (4.1.19) follow from (4.1.8) and (4.1.10).
Let us estimate the local rate convergence of the standard Newton
method (4.1.16). It is convenient to do that in terms of >..1(x), the local
norm of the gradient.
THEOREM 4.1.14 Let x E domf and AJ(x)
x+ = x- [f"(x)t 1 f'(x)
191
Structural optimization
Af
X+ ::=;
( ,x 1 (x)
1->.,(x)
)2 '
Proof: Denote p =X+- x, >. = >.t(x). Then II p llx= >. < 1. Therefore
x+ E domf (see Theorem 4.1.5). Note that in view of Theorem 4.1.6,
1 ~_x,
3_
yg = 0.3819 ... ,
_\)2
First stage: >.t(xk) ~ , where E (0, 3;). At this stage we apply the
damped Newton method. At each iteration of this method we have
N ~ wfy[f(xo)- f(xi)].
192
Af
Xk+l ::;
( AJ(Xk) ) 2
>.J(Xk)
1->.f(Xk)
::; (l-)2
\ ( )
< Af
Xk
4.2
Self-concordant barriers
4.2.1
Motivation
I x E Q},
(4.2.1}
where Q is a closed convex set. We assume also that we know a selfconcordant function f such that Dom f = Q.
Let us introduce a parametric penalty function
f(t; x) = t(c, x)
+ f(x)
193
Structural optimization
This trajectory is called the centrat path of the problern (4.2.1). Note
that we can expect x*(t) ---+ x* as t ---+ oo (see Section 1.3.3}. Therefore
we are going to follow this trajectory.
Recall that the standard Newton method, as applied to minimization
of function f(t; x), has a local quadratic convergence (Theorem 4.1.14}.
Moreover, we have an explicit description of the region of quadratic
convergence:
~
)..f(t;)(x)~<>.= -2
> 0.
<.X.
Note that the update t---+ t+ does not change the Hessian of the barrier
function:
f"(t + ~;x} = f"(t;x).
Therefore it is easy to estimate how can be big the step ~. lndeed, the
first order optimality condition provides us with the following centrat
path equation:
(4.2.2)
tc + f'(x*(t)) = 0.
Since tc + J'(x) = 0, we obtain
)..f(t+;)(x)
f'(x) llx~ .
4.2.2
uER"
(1,.2.3}
194
for all x E dom F. The value v is called the parameter of the barrier.
Note that we do not assume F"(x) tobe nondegenerate. However, if
this is the case, then the inequality (4.2.3) is equivalent to
(4.2.4)
{4.2.5)
(To see that for u with (F"(x)u, u} > 0, replace u in (4.2.3) by .Xu and
find the maximum of the left-hand side in .X.) Note that the condition
(4.2.5) can be written in a matrix notation:
F"(x) t tF'(x)F'(xf.
{4.2.6)
Let us check now which self-concordant functions given by Example 4.1.1 arealso self-concordant barriers.
4.2.1 1. Linear function: f(x) = a + (a,x}, domf = Rn.
Clearly, for a =/::. 0 this function is not a self-concordant barrier since
f"(x) = 0.
EXAMPLE
f(x) = a
(Ax,x)- 2(a,x)
+ (A- 1 a,a).
F(x)
= -lnx,
domF
= {x E R 1 I x > 0}.
> 0.
1
F"fX} = X2" X
Therefore
= 1.
195
Structural optimization
> 0} with v
= 1.
>- 0.
(F'(x),u} =
(F"(x}u,u} =
Denote
WI
Then
-rt>lx)[(a,u}- (Ax,u)],
r/> 2tx}[(a,u}-
(Ax,u}]2
+ rt>(~)(Au,u).
(F"(x)u, u) = w~ + w2 ~
wi.
tion (c, x}
4.2.1 Let F(x) be a self-concordant barrier. Then the funcis a self-concordant function on domF.
+ F(x)
THEOREM
nDom F2
196
Proof: In view of Theorem 4.1.1, Fis a standard self-concordant function. Let us fix x E dom F. Then
max [2(F'(x)u, u}- (F"(x)u, u)]
ueRn
<
uERn
uERn
+ uERn
max [2(F2(x)u, u)- (F2'(x)u, u)]
~ v1
+ v2.
0
Domq_) = {x ERn
I A(x) E DomF}.
(q>'(x), u)
= (F'(y), Au),
(q>"(x)u, u)
= (F"(y)Au, Au).
Therefore
max [2(<I>'(x), u)- (<P"(x)u, u)]
uERn
veRm
v.
0
4.2.3
Main inequalities
THEOREM
(F'(x),y- x} < v.
(4.2. 7)
197
Structural optimization
(F'Cxtrx>2
)>
- v-(F' x ,y-x)
(4.2.8)
Vx, y E domF.
(4.2.9)
<jJ(t) = (F'(x+t(y-x)),y-x),
t E [0,1].
<P'(t) = (F"(x
Therefore <jJ(t) increases and it is positive for t E [0, 1]. Moreover, for
any t E [0, 1] we have
v~~~(b)
- </J(O) =
<
y for
'1/l'(x)
'1/J"(x)
-te-~F(x) F'(x),
'1/J(y):::; 'lj;(x)
+ ('1/J'(x),y -x)
198
(F'(x),y- x} ~ 0,
we have
II Y- X
llx~
V+
(4.2.10}
2y'i/.
(4.2.11}
Proof: Denote r =II y - x llx Let r > .,fii. Consider the point
Ya = x + a(y - x) with a =
< 1. In view of our assumption
(4.2.10) and inequality (4.1.7) we have
=(F'(Ya),y- x}
>
l .
IIYo -xll~
a l+IIYo-xll~
= l+a
al!rxll = ~
ly-x]x I+Tv'
2
Thus,
( 1-~)~
r
l+vfv <v
- '
and that is exactly (4.2.11).
is called the analytic center of convex set Dom F, generated by the barrier
F(x).
4.2.6 Assurne that the analytic center of a v-self-concordant
barrier F(x) exists. Then for any x E Dom F we have
THEOREM
II
X -
Xp
llxj,. ~ V + 2..jV.
199
Structural optimization
II
x -
Xp llx;, ~
1 we have
Proof: The first staternent follows frorn Theorem 4.2.5 since F'(x}) =
D
0. The second staternent follows frorn Theorem 4.1.5.
Thus, the asphericity of the set Dorn F with respect to x}, cornputed
in the metric II llxj.., does not exceed v + 2yfi/. It is well known that for
any convex set in Rn there exists a metric in which the asphericity of this
set is less than or equal to n (John Theorem). However, we rnanaged to
estirnate the asphericity in terrns of the parameter of the barrier. This
value does not depend directly on the dirnension of the space.
Note also, that if Dorn F contains no straight line, the existence of x}
irnplies the boundedness of DomF. (Since then F"(x}) is nondegenerate, see Theorem 4.1.3).
COROLLARY
and v E Rn we have
rnax{(v,u)
(F"(x)u,u) ~ 1}.
=
~
{y
E Rn
I II Y - Xp llx::; ll + 2y0} := B*
II v 11;
= rnax{(v,y- x)
= (v,x}- x)
y E B} ~ rnax{(v,y- x)
y E B*}
+ (v + 2yfi/) II v 11;.
F
11;.
Note that
II v II;= !I
4.2.4
Path-following scheme
-v
x E Q}
(4.2.12)
200
with bounded closed convex set Q = DomF, which has nonempty interior, and which is endowed with a v-self-concordant barrier F(x).
Recall that we are going to solve (4.2.12) by tracing the central path:
xEdomF
f(t; x),
(4.2.13)
tc+ F'(x*(t)) = 0.
(4.2.14)
Since the set Q is bounded, the analytic center of this set, x'F, exists and
x*(O) = x}.
(4.2.15)
In order to follow the central path, we are going to update the points,
satisfying an approximate centering condition:
AJ(t;)(x)
=II
f'(t; x)
II;= I tc + F'(x)
11;~
(4.2.16)
!f,
(4.2.17}
where c* is the optimal value of (4. 2.12). If a point x satisfies the centering condition (4.2.16), then
l
(c ' x)- c* <
- t
(v + (+vvl).
1-
(4.2.18)
x* (t)) ~ ![.
t(c,x-x*(t))
X-
x*(t)
llx
(+Vvl
1-
201
Structural optimization
in view of (4.2.4), Theorem 4.1.13 and (4.2.16).
(4.2.19)
THEOREM
4.2.8 Let
satisfy (4.2.16):
II tc + F'(x)
with < 5.
= 3-l'g.
11;~
I I I~
111-
(4.2.20}
AI
= II
t+c + F'(x)
11;
and
A+
~ (l~~J2
=[w~(.AI)j2.
II c 11;~ f( + JV).
(4.2.21)
202
II c 11;=11
11; + II
F'(x) 11;~ +
JV.
'V I -
_:fl_
l+...[- --
5
35
(4 222)
We have proved that it is possible to follow the central path, using the
rule (4.2.19). Note that we can either increase or decrease the current
value of t. The lower estimate for the rate of increasing t is
t+
(1 + 4+3~\fV). t,
t+
~ ( 1 - 4+3~\fV) . t.
Thus, the general scheme for solving the problern (4.2.12) is as follows.
(4.2.23)
0). Set
+ (i!gl.
THEOREM
203
Structural optimization
~
h
Thus,
tk
1-II
c II*xo< -1-ro
II c II*xj;.-< 1..=1L
1=211 II c II*xj;.
-y{l 2)
2:: ( 1 -~lcll;.
1 + 1JV
)k-1
Let us discuss now the above complexity estimate. The main term in
the complexity is
vl/cl/*.
7. 2.,fo In ---fL.
Note that the value V II c II;. estimates the variation of the linear funcF
tion (c, x} over the set Dom F (see Theorem 4.2.6). Thus, the ratio
4.2.5
Ix
E domF},
{4.2.24)
II F'(x) llis ,
for certain E {0, 1}.
In order to reach our goal, we can apply two different mmimization schemes. The first one is a Straightforward implementation of the
204
Yk+l - Yk
2. Stop the process if
(4.2.25)
_ [F"(Yk)]- 1 F'(UkJ
l+IIF'(yk)IIYk
THEOREM
w(l)
yEdomF
[-t(F'(yo), y)
+ F(y)],
F'(y*(t)) = tF'(yo).
(4.2.26)
Therefore it connects two points, the starting point y0 and the analytic
center xF:
y*(1) = Yo, y*(O) = xF.
We can follow this trajectory by the process (4.2.19) with decreasing t.
Let us estimate the rate of convergence of the auxiliary central path
y* (t) to the analytic center.
205
Structural optimization
LEMMA
0 we have
t k+l -- t k
0). Set
1
IIF'(vo)ll;k
'
II
F'(Yk)
(4.2.27)
llvk::; 1 ~.
>.k
206
"'-&
-_ lgl
I-
1+
-5
36'
lv'ii)
II
F'(yk)
ll;k
(tkF'(xo)
ll;k~ +tk(v+2y'i))
II
F'(xo)
II;F.
tk(ll + 2y'V)
II
F'(xo)
II;.F-l+J
< ~- = 'Y
II
F'(xo)
II;.]
F
4.2.6
xEQ,
0, j = l ... m,
(4.2.28}
207
Structural optimization
where Q is a simple bounded closed convex set with nonempty interior and all functions fi, j = 0 ... m, are convex. We assume that the
problern satisfies the Slater condition: There exists x E int Q such that
/j(x) < 0 for all j = 1 ... m.
Let us assume that we know an upper bound f suchthat fo(x) < f
for all x E Q. Then, introducing two additional variables T and "'' we
can rewrite this problern in the standard form:
minT,
s.t fo(x) ~
T,
(4.2.29)
T ~
f,
K,
0.
j=l
f)
VQ
+ v0 + L vi + 2,
(4.2.30)
j=l
vo
where
are the parameters of the corresponding barriers.
Note that it could be still difficult to find a starting point from dom.F.
This domain is an intersection of the set Q with the epigraphs of the objective function and the constraints and with two additional constraints
208
z ES,
(4.2.31)
(d, z} ::; 0,
where z = (x, r, ~), (c, z} = r, (d, z} = ~ and S is the feasible set of the
problern (4.2.29) without the constraint K. ::; 0. Note that we know a
self-concordant barrier F(z) for the set Sand we can easily find a point
zo Eint S. Moreover, in view of our assurnptions, the set
S(a) = {z ES
(d,z}::; a}
I z E S(a)}.
209
Structv.ral optimization
In view of the Slater condition for problern (4.2.31), the optimal value
of this problern is strictly negative.
The goal of this stage consists in finding an approximation to the
analytic center of the set
S = {z
E S(a)
(d,z) ~ 0},
F'(z*)- (d,~.) = 0.
Therefore z* is a point of the central path z(t). The corresponding value
of the penalty parameter t* is
t* = - (d,~.) > 0.
This stage ends up with a point
Ap(z)
z, computed at the
1 2
~ .
210
4.3
(Bounds on parameters of self-concordant barriers; Linear and quadratic optimization; Semidefinite optimization; Extremal ellipsoids; Separable problems;
Geometrie optimization; Approximation in lp norms; Choice of optimization
scheme.)
4.3.1
(4.3.1)
xEQ
LEMMA
v 2:::
/'b
= sup
tE{a,)
iB.ill:
f"(t) 2:::
1.
Proof: Note that v 2::: /'b by definition. Let us assume that /'b < 1. Since
f(t) is a barrier for (a., ), there exists a value E (a., ) such that
f'(t) > 0 for all t E [, ).
211
Structural optimization
2f 1(t)-
f I (t)
> 0,
fo)f I (t).
Hence, for all t E [a, ) we obtain cjJ(t) ~ c/J(a) + 2(1- fo)(f(t)- f()).
This is a contradiction since f (t) is a barrier and c/J( t) is bounded from
0
above.
COROLLARY
Then v
1.
x + api E Q
THEOREM
Va ~ 0.
x-
i Pi ~ int Q,
i = 1 ... k.
x - E aiPi
i=l
Q, then the
v>
-
E
i=l
ili.
{3;
212
(since otherwise the function f(t) = F(x + tp) attains its minimum; see
Theorem 4.1.11).
Note that x- i Pi f!. Q. Therefore, in view of Theorem 4.1.5, the
norm of the direction Pi is large enough: i II Pi llx~ 1. Hence, in view
of Theorem 4.2.4, we obtain
v ~ (F'(x),
y- x) = (F'(x),- E aiPi)
i=l
E ai II Pi
i=l
llx~
i=l
P(x) = {s ERn
(s,x- x) 51,
Vx E Q}.
It can be proved that for any x E int Q the set P(x) is a bounded closed
convex set with nonempty interior. Denote V(x) = voln P(x).
THEOREM
c1
function
U(x) = c1 In V(x}
is a (c2 n)-self-concordant barrier for Q.
Function U(x) is called the universal barrier for the set Q. Note that
the analytical complexity of problern (4.3.1}, equipped with a universal
barrier, is 0 ( yn ln '7) . Recall that such efficiency estimate is impossible, if we use a local black-box oracle (see Theorem 3.2.5).
The above result has mainly a theoretical interest. In general, the universal barrier U(x) cannot be easily computed. However, Theorem 4.3.2
demonstrates that such barriers, in principle, can be found for any convex set. Thus, the applicability of our approach is restricted only by
abilities of constructing a computable self-concordant barrier, hopefully
with a small value of the parameter. The process of creating the barrier
model of the initial problem, can be hardly described in a formal way.
For each particular problern there could be many different barrier models, and we should choose the best one, taking into account the value of
the parameter of the self-concordant barrier, the complexity of its gradient and Hessian, and the complexity of solution of the Newton system.
In the rest of this section we will see how that can be done for some
standard problern classes of convex optimization.
213
Structural optimization
4.3.2
xERn
(4.3.2)
s.t Ax = b,
. - 1 .. . n,
_ 0 , ~x (i) >
F(x) =- l:Inx(i),
v = n,
i=l
(see Example 4.2.1 and Theorem 4.2.2). This barrier is called the standard logarithmic barrier for R+..
In order to solve the problern (4.3.2), we have to use a restriction
of the barrier F(x) onto affine subspace {x : Ax = b}. Since this
restriction is an n-self-concordant barrier (see Theorem 4.2.3), the complexity estimate for the problern (4.3.2) is 0 (.,fii ln 7) iterations of a
path-following scheme.
Let us prove that the standard logarithmic barrier is optimal for R+..
LEMMA 4.3.2 Parameter v of any self-concordant barrier for R+. satisfies the inequality v ~ n.
x = e := ( 1, ... , 1) T
Pi
E int R+.,
i = 1 ... n,
= ei,
2:: lli
i=l ;
n.
0
Note that the above lower bound is valid only for the entire set R~.
The lower bound for intersection {x E R~ I Ax = b} can be smaller.
214
xeRn
s.t Qi(x) =
O:i
i = 1 ... m,
s. t
qo(x) :S:
XE
nn,
T,
TE
(4.3.4}
R1.
The feasible set of this problern can be equipped with the following selfconcordant barrier:
m
L ln(i- Qi(x)),
v= m
+ 1,
i=l
(see Example 4.2.1, and Theorem 4.2.2). Thus, the complexity bound
for problern (4.3.3) is 0 ( ym + 1 ln ':) iterations of a path-following
scheme. Note this estimate does not depend on n.
In many applications the functional components of the problern include a nonsmooth quadratic term of the form II Ax - b II Let us show
that we can treat such terms using interior-point technique.
LEMMA 4.3.3
The function
F(x, t) = -ln(t2- II x 11 2 )
II}.
215
Structural optimization
at a = 0. Denote
cpO
( = 2(t7- (x,h)),
II-
</;-
t:_
(.t)2~'
~
II h 11 2 ),
cjJ111
E'E" _
_
-37
(.t) 3.
2 ~
satisjies inequality v
z=
P2 = ( -h, 1),
P1 = (h, 1),
2.
II
II=
1.
'_"'_1
"
<-<1 - <-<2- 2
z- iPi =
(~h,!)
z- a1P1- a2P2 =
rt int K2,
( -~h
+ ~h, 1- ~- ~) = 0 E K2.
>
-
Q.J..
+ !:!2.
2
= 2.
0
216
4.3.3
Semidefinite optimization
(X, Y)F =
LL
x(i,i)y(i,j),
i=l j=l
II X IIF= (X,X)F1/2 .
= f:
f:
x(i,j)
i=lj=l
=
f: y(i,k)y(i,k) = f: f: f:
k=l
Ln nL
k=lj=l
.n
y(k,J)
x(i,i)y(i,k)y(j,k)
i=lj=lk=l
i=l
...
X(J,l)y(l,k) =
Ln nL
k=lj=l
y(k,J)(XY)(J,k)
(4.3.6)
(4.3.7)
X EPn,
F(X) = -ln
I1 ~i(X),
i=l
Structural optimization
LEMMA
direction
217
= -X- 1 .
For any
Trace (rx-l/2.x-I/2f)'
-2(In, (X-1/2Llx-1/2j3)F
F(X
+ b.) - F(X)
THEOREM
218
(F'(X), )F =
LAi,
i=l
D 3 F(X)[, , ]
E\,
(F"(X), )F =
i=l
-2
i=l
Ai.
~=1
we obtain
(F'{X), )} < n(F"(X}, )F,
I D 3 F(X)[, , ] I <
2(F"(X), )~ 2
0
Let us prove that F(X) = -In det X is the optimal barrier for Pn.
4.3.6 Parameter v of any self-concordant barrier for cone Pn
satisfies inequality v ;::: n.
LEMMA
In -
E eie[ =
i=l
0 E Pn
i=l
r. = n.
'
219
Structural optimization
Let us estimate the arithmetical cost of each iteration of a pathfollowing scheme (4.2.23) as applied to the problern (4.3.7). Note that
we work with a restriction of the barrier F(X) onto the set C. In view of
Lemma 4.3.5, each Newton step consists in solving the following problern:
min{ (U, ~)F
.6.
(Ai, ~)F
= 0, i = 1 ... m},
0,
i = 1 ... m.
~=X
[-u
f:
>.(i) Ajl
(4.3.9)
X.
J=l
L >.(j)(Ai,XAjX)F = (Ai,XUX)p,
i = 1 ... m,
(4.3.10)
j=l
d(i) =
(U,XAjX)p,
i,j = 1 ... n.
s- 1d.
(4.3.11)
220
However, if the matrices Aj possess a certain structure, then this estimate can be significantly improved. For example, if all Aj are of rank 1:
Aj = ajaJ,
aj E Rn,
j = 1 ... m,
(4.3.12)
{(x, t)
4.3.4
Extremal ellipsoids
4.3.4.1
Circumscribed ellipsoid
1},
VO n
1 B 2 (0 ' 1) d et H -1 =
VO n
voln B2(0,1)
det H
221
Structural optimization
H,v,T
s.t.
-lndetH~T,
(4.3.13)
II Hai -v II~ 1,
i = 1 ... m,
4.3. 7 Function
the set
R 1 I T ~ -lndetH, HE 'Pn}
0
E ln{1- II Hai- v 11 2 ),
i=l
v=m+n+l.
The corresponding cornplexity bound is 0 ( Jm
tions of a path-following scheme.
4.3.4.2
+ n + 1 In m;n) itera-
222
<
Inequality (a, x}
b.
<
(Ha, a} ~ (b - (a, v} )2
Proof: In view of Lemma 3.1.12, we have
max{(a,u}
u
max((a,x- v)
(a, v)
xEW
+ (a,v)]
+ max{
(a, u} I
X
+ (Ha,a) 112
This proves our statement since (a, v} < b.
-
(a,v)
(H- 1u, u) ~ 1}
~ b.
D
Note that voln W = voln B2(0, 1)[det Hjll 2. Hence, our problern is as
follows:
minT,
H,T
s.t. -lndet H
H E 'Pn,
~ T,
{4.3.14)
E R1 .
F(H, T) =
v =
m+n+l.
m;n)
223
Structural optimization
4.3.4.3
w -
{x ERn
I II G- 1 (x- v) II:$ 1}
In view of Lemma 4.3.8, the inequality (a, x) :$bis valid for any x E W
if and only if
II Ga 11 2
(G2a, a) :$ (b- (a, v)} 2
II Ga II:$ b- (a,v).
Note that voln W = voln B2(0, 1) det G. Therefore our problern can be
written as follows:
min r,
G,v,r
s.t. -lndet G :$ r,
(4.3.15)
In view of Lemmas 4.3. 7 and 4.3.3, we can use the following selfconcordant barrier:
F(G, v, r)
2m+ n + 1.
224
+ 1 ln m;n) iter-
Separable optimization
4.3.5
In problems of separable optimization all nonlinear terms are presented by univariate functions. A general formulation of such a problern
looks as follows:
mRi~
XE
s.t Qi(x)
qo(x) =
L: ao,jfo,j((ao,j,x) + bo,j),
j==l
(4.3.16)
m;
=E
mo
j==l
ai,jfi,j((ai,j,x) +bi,j)::::; i, i
= l. .. m,
where ai,j are some positive coefficients, ai,j E Rn and fi,j(t) are convex
functions of one variable. Let us rewrite this problern in a standard
form:
min ro,
x,t,r
m;
j=l
where M =
E mi.
i=O
i = 0 ... m,
(4.3.17)
for the feasible set of the problem, we need barriers for epigraphs of
univariate convex functions Ai. Let us point out such barriers for several
important functions.
4.3.5.1
Logarithm and exponent.
Function FI(x, t) = -lnx -ln(lnx + t) is a 2-self-concordant barrier
for the set
Ql = {(x,t) E R2 1 x > 0, t ~ -lnx},
and function F2(x, t) = -ln t -ln(ln t- x) is a 2-self-concordant barrier
for the set
225
Structural optimization
Entropy function.
4.3.5.2
Function Fg(x, t) = -lnx -ln(t- x lnx) is a 2-self-concordant barrier
for the set
Qg = {(x,t) E R 2 1 x 2 0, t 2 xlnx}.
4.3.5.3
Increasing power functions.
Function F4 (x, t) = -2ln t-1n(t 21P -x 2 ) is a 4-self-concordant barrier
for the set
Q4 = {(x, t) E R 2 I t 21 x IP}, p 2 1,
and function F5 (x, t) = -lnx -ln(tP- x) is a 2-self-concordant barrier
for the set
I x 2 0,
Qs = { (x, t) E R 2
tP
2 x },
< p s 1.
4.3.5.4
Decreasing power functions.
Function F6 (X' t) = - ln t -ln( X - r l/p) is a 2-self-concordant barrier
for the set
Q6 = { (x, t) E R 2 I x
> 0, t 2 x~} ,
1,
< p < 1.
We omit the proofs of the above statements since they are rather
technical. It can be also shown that the barriers for all of these sets,
(except maybe Q4 ), are optimal. Let us prove this statement for the sets
Q6 and Q7.
LEMMA
with p
> 0,
satisfies inequality v 2 2.
Then
P2 = e2,
x + eei E Q for
any
= 2 = {,
e;: : 0 and
x- et = (O,r) fj. Q,
x=
a1 = a2 = a
='Y-
x- e2 = {r,O) fj. Q,
E Q.
1.
226
>
-
Ql.
t
+ Q.2.
2 =
21.::.!.
'Y
4.3.5.5
Geometrie optimization.
The initial formulation of such problems is as follows:
min qo(x) =
xERn
s.t qi(x) =
m;
j=l
j=l
ai,j
> 0, j
x(j)
mo
ao,j
(i)
TI (x(J)Y'"0 i,
j=l
(i)
TI (x(3 >yri.j
j=l
~ 1, i = 1 ... m,
(4.3.18)
= 1 ... n,
where ai,j are some positive coefficients. Note that the problern (4.3.18)
is not convex.
Let us introduce the vectors ai,j = (o{~, ... , u~~)) E Rn, and change
the variables: x(i) = eY(i). Then (4.3.18) is transformed into a convex
separable problem.
min
mo
yERn j=l
m;
s.t.
Denote M =
j=l
m
'
'
(4.3.19)
D!i,j exp( (ai,j, y)) ~ 1, i = 1 ... m.
E mi.
i=O
ao 3exp{(ao 3,y)),
lowing scheme is
o(M 1 1 2 ln~).
4.3.5.6
Approximation in lp norms.
The simplest problern of that type is as follows:
(4.3.20)
s.t.
a :::; x :::;
227
Structural optimization
s.t
T(o)
'
I (ai, x} -
(4.3.21)
i=l
a::=;x ::=;,
The complexity bound of this problern is 0 (Jm + n In m;n) iterations
of a path-following scheme.
Wehave discussed the performance of interior-point methods on several pure problern formulations. However, it is important that we can apply these methods to mixed problems. For example, in problems (4.3. 7)
or (4.3.20) we can treat also the quadratic constraints. To do that, we
need to construct a corresponding self-concordant barrier. Such barriers
are known for all important examples we meet in practical applications.
4.3.6
(4.3.22)
s.t
a:::; x:::; ,
228
What scheme should we use? We can derive the answer from the complexity estimates of corresponding methods.
Let us estimate first the performance of the ellipsoid method as applied to problern {4.3.22).
Number of iterations: 0 ( n 2 In
*) ,
s.t.
I (ai, x) -
i=l
T(i)
b(i) IP~
T(i),
i = 1 ... m,
~ ~' a ~X~ ,
(4.3.23)
F(x,T,~))
i=l
mpn ).
229
Structural optimization
Then
E9I(T(il,(ai,x)-b(il)ai- f: [ (il~
F~(x,r,~)= i=l
i=l
X
(ilQ
iil~
(il]ei,
Further, denoting
we obtain
F;(i)x(x,T,~)
h12(T(i), (ai,x)-
b(i))ai,
F~'(i>,r<il(x,r,O
h22(r(i), (ai,x)-
b(i))
F"(il
T
1T
Ul (x, T,
e- .L:m r(i) )
-2
, i
=1=
c(x, T, 0 = -
!=l
F"(i)
'~
T
FJ:e(x, T, 0
= (
i=l
~- i~l T(i)) -2
(e - .I:
z=l
T(i)) -
Si
230
and
A2 = diag(hi2(T(i),si))~ 1 ,
D = diag(h22(T(i),si))~ 1 .
Then, using the notation A = (a 1 , ... , am), e = (1, ... , 1) E Rm, the
Newton system can be written in the following form:
(4.3.24)
[A(Ao
-AA2[D
Using these relations we can find !:l.e from the last equation in (4.3.24).
Thus, the Newtonsystem (4.3.24) can be solved in O(n 3 + mn 2 ) Operations. This implies that the total complexity of the path-following
scheme can be estimated as
0 ( n 2 (m + n) 312 ln
m;n)
Bibliography
232
References
[1] A. Ben-Tal and A. Nemirovskii. Lectures on Moden Convex Optimizatin Analysis, Alogorithms, and Engineering Applications, SIAM, Philadelphia, 2001.
[2] A.B. Conn, N.I.M. Gould and Ph.L. Toint. Trust Region Methods, SIAM,
Philadelphia, 2000.
[3] J.E. Dennis and R.B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations, SIAM, Philadelphia, 1996.
[4] A.V. Fiacco and G.P. McCormick. Nonlinear Programming: Sequential Unconstrained Minimization Techniques, John Wiley and Sons, New York, 1968.
[5] J.-B. Hiriart-Urruty and C. Lemankhal. Convex Analysis and Minimization
Algorithms, vols. I and II. Springer-Verlag, 1993.
[6] C. Lemarechal, A. Nemirovskii and Yu. Nesterov. New variants of bundle methods. Mathematical Programmming, 69(1995), 111-148.
[7] D.G. Luenberger. Linear and Nonlinear Programming, Second Edition, Addison Wesley. 1984.
[8) A.Nemirovsky and D.Yudin. Informational complexity and efficient methods
for solution of convex extremal problems, Wiley, New York, 1983.
[9) Yu.Nesterov. Minimization methods for nonsmooth convex and quaskonvex
functions. Ekonomika i Mat. Metody, v.ll, No.3, 519-531, 1984. (In Russian;
translated as MatEcon.)
[10] Yu.Nesterov. A method for solving a convex programming problern with rate
of convergence 0( ~ ). Soviet Math. Doklady, 1983, v.269, No.3, 543-547. (In
Russian.)
[11] Yu.Nesterov. Efficient methods in nonlinear programming. Radio i Sviaz,
Moscow, 1989. (In Russian.)
[12] Yu. Nesterov and A.Nemirovskii. Interior-Point Polynomial Algorithms in
Convex Programming, SIAM, Philadelphia, 1994.
234
Index
function, 112
set, 81
Cutting plane scheme, 150
Damped Newton method, 34
Dikin ellipsoid, 182
Directional derivative, 122
Domain of function, 112
Epigraph, 82
Estimate sequence, 72
Feasibility problem, 146
Function
barrier, 48, 180
convex, 112
objective, 1
self-concordant, 176
strongly convex, 63
Functional constraints, 1
General iterative scheme, 6
Gradient, 16
mapping, 86
Hessian, 19
Hyperplane
separating, 124
supporting, 124
Inequality
Cauchy-Schwartz, 17
Jensen, 112
Infinity norm, 116
Information set, 6
Inner product, 2
Kelley method, 158
Krylov subspace, 42
236
on a problem, 5
on a problern dass, 5
Polar set, 212
Polynomial methods, 156
Positive orthant, 213
Problem
constrained, 2
feasible, 2
general, 1
linearly constrained, 2
nonsmooth, 2
of approximation in lv-norms, 226-227
of geometric optimization, 226
of integer optimization, 3
of linear optimization, 2, 213
of quadratic optimization, 2
of semidefinite optimization, 216
of separable optimization, 224
quadratically constrained quadratic, 2,
214
smooth, 2
strictly feasible, 2
unconstrained, 2
Projection, 124
Quasi-Newton rule, 40
Recession direction, 211
Relaxation, 15
Restarting strategy, 45
Self-concordant
barrier, 193
function, 176
Sequential unconstrained minimization, 46
Set
convex, 81
feasible, 2
basic, 1
Slater condition, 2, 49
Solution
global, 2
local, 2
Standard
logarithmic barrier, 213
minimization problem, 192
simplex, 132
Stationary point, 18
Step-size strategies, 25
Strict Separation, 124
Structural constraints, 3
Subdifferential, 126
Subgradient, 126
Support function, 120
Supporting vector, 126
Unit ball, 116