Você está na página 1de 253

INTRODUCTORY LECTURES ON

CONVEX OPTIMIZATION

A Basic Course

Applied Optimization

Volume 87

Series Editors:

Panos M. Pardalos
University ofFlorida, US.A.

Donald W. Heam
University ofFlorida, US.A.

INTRODUCTORY LECTURES ON
CONVEX OPTIMIZATION

A Basic Course

By

Yurii Nesterov
Center of Operations Research and Econometrics, (CORE)
Universite Catholique de Louvain (UCL)
Louvain-la-Neuve, Belgium

''
~

Springer Science+Business Media, LLC

Library of Congress Cataloging-in-PubHcation

Nesterov, Yurri
Introductory Lectures on Convex Optimization: A Basic Course
ISBN 978-1-4613-4691-3
ISBN 978-1-4419-8853-9 (eBook)
DOI 10.1007/978-1-4419-8853-9

Copyright Cl 2004 by Springer Science+Business Media New York


Originally published by Kluwer Academic Publishers in 2004
Softcover reprint of the hardcover 1st edition 2004

All rights reserved. No part ofthis publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means, electronic, mechanical, photo-copying,
microfilming, recording, or otherwise, without the prior written perrnission ofthe publisher, with
the exception ofany material supplied specifrcally for the purpose ofbeing entered and executed
on a computer system, for exclusive use by the purchaser ofthe work.
Permissions for books published in the USA: permj ssi onswkap com
Permissions for books published in Europe: permissions@wkap.nl
Printedon acid-free paper.

Contents

Preface
Acknowledgments
Introduction
1. NONLINEAR OPTIMIZATION
1.1 World of nonlinear optimization
1.1.1 General formulation of the problern
1.1.2 Performance of numerical methods
1.1.3 Complexity bounds for global optimization
1.1.4 Identity cards of the fields
1.2 Local methods in unconstrained minimization
1.2.1 Relaxation and approximation
1.2.2 Classes of differentiable functions
1.2.3 Gradient method
1.2.4 Newton method

lX

xiii
XV

1
1
1
4
7
13
15
15
20
25
32

First-order methods in nonlinear optimization


1.3.1 Gradient method and Newton method:
What is different?
1.3.2 Conjugate gradients
1.3.3 Constrained minimization

37
42
46

2. SMOOTH CONVEX OPTIMIZATION


2.1 Minimization of smooth functions
2.1.1 Smooth convex functions
2.1.2 Lower complexity bounds for :FJ: 1 (Rn)
2.1.3 Strongly convex functions
2.1.4 Lower complexity bounds for s:,J}(Rn)

51
51
51
58
63
66

1.3

37

vi

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

2.1.5

Gradient method
2.2 Optimal Methods
2.2.1 Optimal methods
2.2.2 Convex sets
2.2.3 Gradient mapping
2.2.4 Minimization methods for simple sets
2.3 Minimization problern with smooth components
2.3.1 Minimax problern
2.3.2 Gradient mapping
2.3.3 Minimization methods for minimaxproblern
2.3.4 Optimization with functional constraints
2.3.5 Method for constrained minimization

68
71
71
81
86
87
90
90
93
96
100
105

3. NONSMOOTH CONVEX OPTIMIZATION


3.1 General convex functions
3.1.1 Motivation and definitions
3.1.2 Operations with convex functions
3.1.3 Continuity and differentiability
3.1.4 Separation theorems
3.1.5 Subgradients
3.1.6 Computing subgradients
3.2 Nonsmooth minimization methods
3.2.1 Generallower complexity bounds
3.2.2 Main lemma
3.2.3 Subgradient method
3.2.4 Minimization with functional constraints
3.2.5 Complexity bounds in finite dimension
3.2.6 Cutting plane schemes
3.3 Methods with complete data
3.3.1 Model of nonsmooth function
3.3.2 Kelley method
3.3.3 Level method
3.3.4 Constrained minimization

111
111
111
117
121
124
126
130
135
135
138
141
144
146
149
156
157
158
160
164

4. STRUCTURAL OPTIMIZATION
4.1 Self-concordant functions
4.1.1 Black box concept in convex optimization
4.1.2 What the Newton method actually does?
4.1.3 Definition of self-concordant function

171
171
171
173
175

vii

Contents

4.1.4
4.1.5

Main inequalities
Minimizing the self-concordant function

4.2

Self-concordant barriers
4.2.1 Motivation
4.2.2 Definition of self-concordant barriers
4.2.3 Main inequalities
4.2.4 Path-following scheme
4.2.5 Finding the analytic center
4.2.6 Problems with functional constraints

4.3

Applications of structural optimization


4.3.1 Bounds on parameters of self-concordant barriers
4.3.2 Linear and quadratic optimization
4.3.3 Semidefinite optimization
4.3.4 Extremal ellipsoids
4.3.5 Separable optimization
4.3.6 Choice of minimization scheme

181
187
192
192
193
196
199
203
206
210
210
213
216
220
224
227

Bibliography

231

References

233

Index
235

Preface

It was in the middle of the 1980s, when the seminal paper by Karmarkar opened a new epoch in nonlinear optimization. The importance
of this paper, containing a new polynomial-time algorithm for linear optimization problems, was not only in its complexity bound. At that time,
the most surprising feature of this algorithm was that the theoretical prediction of its high efficiency was supported by excellent computational
results. This unusual fact dramatically changed the style and directions of the research in nonlinear optimization. Thereafter it became
more and more common that the new methods were provided with a
complexity analysis, which was considered a better justification of their
efficiency than computational experiments. In a new rapidly developing field, which got the name "polynomial-time interior-point methods",
such a justification was obligatory.
Afteralmost fifteen years of intensive research, the main results of this
development started to appear in monographs [12, 14, 16, 17, 18, 19].
Approximately at that time the author was asked to prepare a new course
on nonlinear optimization for graduate students. The idea was to create
a course which would reflect the new developments in the field. Actually,
this was a major challenge. At the time only the theory of interior-point
methods for linear optimization was polished enough to be explained to
students. The general theory of self-concordant functions had appeared
in print only once in the form of research monograph [12]. Moreover,
it was clear that the new theory of interior-point methods represented
only a part of a general theory of convex optimization, a rather involved
field with the complexity bounds, optimal methods, etc. The majority
of the latter results were published in different journals in Russian.
The book you see now is a result of an attempt to present serious
thingsinan elementary form. As is always the case with a one-semester
course, the most difficult problern is the selection of the material. For

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

us the target notions were the complexity of the optimization problems


and a provable effi.ciency of numerical schemes supported by complexity bounds. In view of a severe volume Iimitation, we had to be very
pragmatic. Any concept or fact included in the book is absolutely necessary for the analysis of at least one optimization scheme. Surprisingly
enough, none of the material presented requires any facts from duality
theory. Thus, this topic is completely omitted. This does not mean, of
course, that the author neglects this fundamental concept. However, we
hope that for the first treatment of the subject such a compromise is
acceptable.
The main goal of this course is the development of a correct understanding of the complexity of different optimization problems. This goal
was not chosen by chance. Every year I meet Ph.D. students of different
specializations who ask me for advice on reasonable numerical schemes
for their optimization models. And very often they seem to have come
too late. In my experience, if an optimization model is created without
taking into account the abilities of numerical schemes, the chances that
it will be possible to find an acceptable numerical solution are close to
zero. In any field of human activity, if we create something, we know
in advance why we are doing so and what we are going to do with the
result. And only in numerical modeHing is the situationstill different.
This coursewas given during several years at Universite Catholique de
Louvain (Louvain-la-Neuve, Belgium). The course is self-contained. It
consists of four chapters (Nonlinear optimization, Smooth convex optimization, Nonsmooth convex optimization and Structural optimization
(Interior-point methods)). The chapters are essentially independent and
can be used as parts of more general courses on convex analysis or optimization. In our experience each chapter can be covered in three twohour lectures. We assume a reader to have a standard undergraduate
background in analysis and linear algebra. We provide the reader with
short bibliographical notes which should help in a closer examination of
the subject.
YURII NESTEROV

Louvain-la-Neuve, Belgium
May, 2003.

To my wife Svetlana

Acknowledgments

This book is a refiection of the main achievements in convex optimization, the field in which the author has worked for more than twenty five
years. During all these years the author has had the exceptional opportunity to communicate and collaborate with the top-level scientists in
the field. I am greatly indebted to many of them.
I was very lucky to start my scientific career in Moscow at the time of
decline of the Soviet Union, which managed to gather in a single city the
best brains of a 300-million population. The contacts with A. Antipin,
Yu. Evtushenko, E. Golshtein, A. Ioffe, V. Karmanov, L. Khachian,
R. Polyak, V. Pschenichnyj, N. Shor, N. Tretiakov, F. Vasil'ev, D. Yudin,
and, of course, with A. Nemirovsky and B. Polyak, were invaluable in
forming the directions and priorities of my research.
I was very lucky to move to the West at a very important moment
in time. For nonlinear optimization that was the era of interior-point
methods. That was the time, when a new paper was announced almost
every day, and a time of open contacts and interesting conferences. I am
very thankful to my colleges Kurt Anstreicher, Freddy Auslender, Rony
Ben-Tal, Rob Freund, Jean-Louis Goffin, Don Goldfarb, Osman Guller,
Florian Jarre, Ken Kortanek, Claude Lemarechal, Olvi Mangasarian,
Florian Potra, Jim Renegar, Kees Roos, Tamas Terlaky, Mike Todd,
Levent Tuncel and Yinyu Ye for interesting discussions and cooperation.
Special thanks to Jean-Philippe Vial, the author of the idea of writing
this book.
Finally, I was very lucky to find myself at the Center of Operations
Research and Econometrics (CORE) in Louvain-la-Neuve, Belgium. The
excellent working conditions of this research center and the exceptional
environment were very helpful during all these years. It is impossible to
overestimate the importance of the spirit of research, which is created
and maintained here by my colleagues Vincent Blondel, Yves Genin,

xiv

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Michel Gevers, Etienne Laute, Yves Poches, Yves Smeers, Paul Van
Dooren and Laurence Wolsey, both coming from CORE and CESAME,
a research center of the Engineering department ofUniversite Catholique
de Louvain (UCL). The research activity of the author during many
years was supported by the Belgian Program on Interuniversity Poles of
Attraction initiated by the Belgian State, Prime Minister's Office and
Science Policy Programming.

Introduction

Optimization problems arise naturally in different fields of applications. In many situations, at some point we get a craving to arrange
things in a best possible way. This intention, converted into a mathematical form, becomes an optimization problern of a certain type. Depending on the field of interest, it could be an optimal design problem, an
optimal control problem, an optimal location problem, an optimal diet
problem, etc. However, the next step, finding a solution to the mathematical model, is far from trivial. At first glance, everything Iooks very
simple: many commercial optimization packages are easily available and
any user can get a "solution" to the model just by clicking on an icon
on the screen of his/her personal computer. The question is, what do
we actually get? How much can we trust the answer?
One of the goals of this course is to show that, despite their attraction,
the proposed "solutions" of general optimization problems very often
cannot satisfy the expectations of a naive user. In our opinion, the main
fact, which should be known to any person dealing with optimization
models, is that in general optimization problems are unsolvable. This
Statement, which is usually missing in standard optimization courses,
is very important for an understanding of optimization theory and its
development in the past and in the future.
In many practical applications the process of creating a model can take
a Iot of time and effort. Therefore, the researchers should have a clear
understanding of the properties of the model they are constructing. At
the stage of modelling, many different tools can be used to approximate
the real situation. And it is absolutely necessary to understand the
computational consequences of each decision. Very often we have to

xv1

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

choose between a "good" model, which we cannot solve, 1 and a "bad"


model, which can be solved for sure. What is better?
In fact, the computational practice provides us with a hint of an answer to the above question. Actually, the most widespread optimization
models now are still linear optimization models. It is very unlikely that
such models can describe our nonlinear world very well. Thus, the main
reason for their popularity is that practitioners prefer to deal with solvable models. Of course, very often the linear approximation is poor. But
usually it is possible to predict the consequences of such a choice and
make a correction in interpretation of the obtained solution. It seems
that for them this is better than trying to solve a model without any
guarantee for success.
Another goal of this course consists in discussing numerical methods for solvable nonlinear models, namely convex optimization problems.
The development of convex optimization theory in the last years has
been very rapid and very exciting. Now it consists of several competing
branches, each of which has some strong and some weak points. We will
discuss in detail their features, taking into account the historical aspect.
More precisely, we will try to understand the internal logic of the development of each branch of the field. Up to now, the main results of
the development can be found only in special journals and monographs.
However, in our opinion, this theory is ripe for explanation to the final
users, industrial engineers, economists and students of different specializations. We hope that this book will be interesting even for the experts
in optimization theory since it contains many results, which have never
been published in English.
In this book we will try to convince the reader, that in order to apply
the optimization formulations successfully, it is necessary to be aware
of some theory, which explains what we can and what we cannot do
with optimization problems. The elements of this simple theory can be
found in each lecture of the course. We will try to show that convex
optimization is an excellent example of a complete application theory,
which is simple, easy to learn and which can be very useful in practical
applications.
In this course we discuss the most efficient modern optimization schemes and establish for them efficiency bounds. This course is self-contained; we prove all necessary results. Nevertheless, the proofs and the
reasoning should not be a problern even for graduate students.

1 More

precisely, which we can try to solve

INTRODUCTION

xvii

The structure of the book is as follows. It consists of four relatively independent chapters. Each chapter includes three sections, each of which
corresponds approximately to a two-hour lecture. Thus, the contents of
the book can be directly used for a standard one-semester course.
Chapter 1 is devoted to generat optimization problems. In Section 1.1 we introduce the terminology, the notions of oracle, black box,
functional model of an optimization problern and the complexity of general iterative schemes. We prove that global optimization problems are
"unsolvable" and discuss the main features of different fields of optimization theory. In Section 1.2 we discuss two main local unconstrained
minimization schemes: the gradient method and the Newton method.
We establish their local rates of convergence and discuss the possible difficulties {divergence, convergence to a saddle point). In Section 1.3 we
compare the formal structures of the gradient and the Newton method.
This analysis leads to the idea of a variable metric. We describe quasiNewton methods and conjugate gradients schemes. We conclude this section with an analysis of sequential unconstrained minimization schemes.
In Chapter 2 we consider smooth convex optimization methods. In
Section 2.1 we analyze the main reason for the difficulties encountered
in the previous chapter and from this analysis derive two good functional classes, the class of smooth convex functions and that of smooth
strongly convex functions. For corresponding unconstrained minimization problems we establish the lower complexity bounds. We conclude
this section with an analysis of a gradient scheme, which demonstrates
that this method is not optimal. The optimal schemes for smooth convex minimization problems are discussed in Section 2.2. We start from
the unconstrained minimization problem. After that we introduce convex sets and define a notion of gradient mapping for a minimization
problern with simple constraints. We show that the gradient mapping
can formally replace a gradient step in the optimization schemes. In
Section 2.3 we discuss more complicated problems, which involve several smooth convex functions, namely, the minimax problern and the
constrained minimization problem. For both problems we introduce the
notion of gradient mapping and present the optimal schemes.
Chapter 3 is devoted to the theory of nonsmooth convex optimization. Since we do not assume that the reader has a background in convex
analysis, the chapter is started by Section 3.1, which contains a compact
presentation of all necessary facts. The final goal of this section is to
justify the rules for computing the subgradients of a convex function.
The next Section 3.2 starts from the lower complexity bounds for nonsmooth optimization problems. After that we present a general scheme
for the complexity analysis of the corresponding methods. We use this

xviii

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

scheme to establish the convergence rate of the subgradient method, the


center-of-gravity method and the ellipsoid method. We also discuss some
other cutting plane schemes. Section 3.3 is devoted to the minimization
schemes, which employ a piece-wise linear model of a convex function.
We describe Kelley's method and show that it can be extremely slow. After that we introduce the so-called Ievel method. We justify its efficiency
estimates for unconstrained and constrained minimization problems.
Chapter 4 is devoted to convex minimization problems with explicit
structure. In Section 4.1 we discuss a certain contradiction in the black
box concept as applied to a convex optimization model. We introduce a
barrier model of an optimization problem, which is based on the notion
of self-concordant function. For such functions the second-order oracle
is not local and they can be easily minimized by the Newton method.
We study the properties of these functions and establish the rate of
convergence of the Newton method. In Section 4.2 we introduce selfconcordant barriers, the subdass of self-concordant functions, which is
suitable for sequential unconstrained minimization schemes. We study
the properties of such barriers and prove the efficiency estimate of the
path-following scheme. In Section 4.3 we consider several examples of
optimization problems, for which we can construct a self-concordant barrier, and, consequently, these problems can be solved by a path-following
scheme. We consider linear and quadratic optimization problems, problems of semidefinite optimization, separable optimization and geometrical optimization, problems with extremal ellipsoids, and problems of
approximation in lp-norms. We conclude this chapter and the whole
course by a comparison of an interior-point scheme with a nonsmooth
optimization method as applied to a particular problern instance.

Chapter 1

NONLINEAR OPTIMIZATION

1.1

World of nonlinear optimization

(General formulation of the problem; Important examples; Blackbox and iter


ative methods; Analytical and arithmetical complexity; Uniform grid method;
Lower complexity bounds; Lower bounds for global optimization; Rules of the
game.)

1.1.1

General formulation of the problern

Let us start by fixing the mathematical form of our main problern and
the standard terminology. Let x be an n-dimensional real vector:
X=

(x(l), ... ,x(n))T ERn,

S be a subset of Rn, and fo(x), ... , fm(x) be some real-valued functions


of x. In the entire book we deal with different variants of the following
general minimization problem:
min fo(x),
s.t.

fj(x) & 0, j = 1 ... m,


XE

(1.1.1)

s,

where the sign & could be ::;, ~ or =.


We call fo (x) the objective function of our prob lern, the vector function

f(x) = (JI(x), ... , fm(x)f


is called the vector of functional constraints, the set S is called the basic
feasible set, and the set

Q = {x ES

/j(x)::; 0, j

= 1 ... m}

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

is called the feasible set of problern (1.1.1). That is just a convention


to consider a rninirnization problern. Instead, we could consider a maximization problern with the objective function - fo(x).
There is a natural classification of the types of rninimization problerns:

Constrained problems: Q c Rn.


Unconstrained problems: Q

=Rn.

Smooth problems: allfi(x) are differentiable.


Nonsmooth problems: there is a nondifferentiable component fk(x).
Linearly constrained problems: all functional constraints are linear:
n

fi(x) = L:a}i)x(i) +bj

= (aj,x) +bj, j

= 1 ... m,

i=l

(here (, ) stands for the inner product in Rn), and S is a polyhedron.


If fo(x) is also linear, then (1.1.1) isalinear optimization problem. If
fo(x) is quadratic, then (1.1.1) is a quadratic optimization problem. If
allfi are quadratic, then this is a quadratically constrained quadratic
problern.
There is also a classification based on the properties of feasible set.
Problem (1.1.1) is called feasible if Q-=/::. 0.
Problem (1.1.1) is called strictly feasible if there exists x E int Q such
that fi(x) < 0 (or > 0) for all inequality constraints and fi(x) = 0
for all equality constraints. (Slater condition.)
Finally, we distinguish different types of solutions to (1.1.1):

x* is called the optimal global solution to {1.1.1) if fo(x*) ~ fo(x) for


all x E Q (global minimum). In this case fo(x*) is called the (global)
optimal value of the problern.
x* is called a local solution to (1.1.1) if fo(x*) :::; fo(x) for all x E
int Q c Q (local minimum).
Let us consider now several examples demonstrating the origin of the
optirnization problerns.
EXAMPLE 1.1.1 Let x{l), ... , x(n) be our design variables. Then we can
fix sorne functional characteristics of our decision: fo(x), ... , fm(x). For
exarnple, we can consider a price of the project, amount of required

Nonlinear Optimization

resources, reliability of the systern, etc. We fix the rnost irnportant


characteristics, f 0 (x), as our objective. For all others we irnpose sorne
bounds: aj :5 /j(x) :5 bj. Thus, we corne up with the problern:
rnin fo(x),
s.t.: aj :::; fi(x) :::; bj, j = 1. .. m,
XE S,

where S stands for the structural constraints, like nonnegativity or boun0


dedness of sorne variables.

EXAMPLE

1.1.2 Let our initialproblern be as follows:

Find x ERn suchthat fi(x) = aj, j = 1 ... m.

(1.1.2)

Then we can consider the problem:


m

mjn ~)/j(x)- aj) 2 ,


j=l

perhaps even with some additional constraints on x. 1f the optimal value


of the latter problern is zero, we conclude that our initialproblern (1.1.2)
has a solution.
Note that the problern (1.1.2) is almost universal. lt covers ordinary
differential equations, partial differential equations, problems arising in
0
Game Theory, and rnany others.
1.1.3 Sornetirnes our decision variables x(l), ... , x(n) must be
integer. This can be described by the following constraint:

EXAMPLE

sin(7rx(i)) = 0,

i = 1 ... n.

Thus, we can also treat integer optimization problems:


min fo(x),
s.t.: aj

:5

/j(x)

:5 bj,

j = 1 ... m,

XE S,
. ( 7l'X (i))
Sill

= 0, . = 1 ... n.
~

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Looking at these examples, a reader can understand the optimism


of the pioneers of nonlinear optimization, which can be easily seen in
the papers of the 1950's and 1960's. Our first impression should be, of
course, as follows:
Nonlinear optimization is a very important and promising
application theory. It covers almost all needs of Operations research and numerical analysis.
However, just by looking at the same examples, especially at Examples
1.1.2 and 1.1.3, a more suspicious (or more experienced) reader could
come to the following conjecture:
In general, optimization problems are unsolvable

lndeed, the real life is too complicated to believe in a universal tool,


which can solve all problems at once.
However, such suspicions are not so important in science; that is a
question of personal taste how much we trust them. Therefore it was
definitely one of the most important events in optimization, when in the
middle of the 1970s this conjecture was proved in some strict mathematical sense. The proof is so simple and remarkable, that we cannot avoid
it in our course. But first of all, we should introduce a speciallanguage,
which is necessary to speak about such things.

1.1.2

Performance of numerical methods

Let us imagine the following situation: Wehave a problern 'P, which


we are going to solve. We know that there are different numerical methods for doing so, and of course, we want to find a scheme that is the
best for our 'P. However, it turnsout that we are looking for something
that does not exist. In fact, maybe it does, but it is definitely not recommended to ask the winner for help. Indeed, consider a method for
solving problern (1.1.1), which is doing nothing except reporting that
x* = 0. Of course, this method does not work for all problems except
those for which the solution is indeed the origin. And for the latter
problems the "performance" of such a scheme is unbeatable.
Thus, we cannot speak about the best method for a particular problern
'P, but we can do so for a class of problems :F :J 'P. Indeed, usually the
numerical methods are developed for solving many different problems
with similar characteristics. Thus, a performance of method M on the
whole dass :F is a natural characteristic of its efficiency.

Nonlinear Optimization

Since we are going to speak about the performance of M on a class


F, we should assume that M does not have complete information about
a particular problern P.
A known (for numerical scheme) "part" of problern P is
called the model of the problem.
We denote the model by L:. Usually the model consists of problern
formulation, description of classes of functional components, etc.
In order to recognize the problern P (and solve it), the method should
be able to collect specific information about P. It is convenient to describe the process of collecting the data by the notion of an oracle. An
oracle 0 is just a unit, which answers the successive questions of the
method. The method M is trying to solve the problern P by collecting
and handling the answers.
In general, each problern can be described by different models. Moreover, for each problern we can develop different types of oracles. But let
us fix r; and 0. In this case, it is natural to define the performance of
M on (L:, 0) as its performance on the worst Pw from (r;, 0). Note that
this Pw can be bad only forM.
Further, what is the performance of M on P? Let us start from the
intuitive definition:
Performance of M on P is the total amount of computational efforts that is required by method M to solve the
problern P.
In this definition there are two additional notions to be specified. First of
all, what does it mean: to solve the problem? In some situations it could
mean finding an exact solution. However, in many areas of numerical
analysis that is impossible (and optimization is definitely such a case).
Therefore,
To solve the problern means to find an approximate solution to M with an accuracy E > 0.
The meaning of the words with an accuracy E > 0 is very important for
our definitions. However, it is too early to speak about that now. We
just introduce notation ~ for a stopping criterion; its meaning will be

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

always precise for particular problern classes. Now we can give a formal
definition of the problern dass:

:F

=(2:, 0, 1;).

In order to solve a problern P E :F we can apply an iterative process,


which naturally describes any method M working with the oracle.

General Iterative Scheme.


Input: A starting point xo and an accuracy E > 0.
Initialization. Set k = 0, L 1 = 0. Here k is iteration
counter and h is accumulated information set.
(1.1.3)
Main loop:
1. Call oracle 0 at Xk
2. Updatetheinformation set: h = h-1 U(xk, O(xk)).
3. Apply rules of method M to h and form point Xk+l
4. Check criterion 7;. If yes then form an output x.
Otherwise set k := k + 1 and go to Step 1.
Now we can specify the term computational efforts in our definition of
performance. In the scheme (1.1.3) we can easily find two mostexpensive
steps. The first one is Step 1, where we call the oracle, and the second
one is Step 3, where we form the next test point. Thus, we can introduce
two measures of complexity of problern P for method M:
Analytical complexity: The number of calls of oracle,
which is required to solve problern P up to accuracy t:.
Arithmetical complexity: The total number of arithmetic
operations (including the work of oracle and work of
method), which is required to solve problern P up to accuracy t.
Comparing the notions of analytical and arithmetical complexity, we
can see that the second one is more realistic. However, for a particular method M as applied to a problern P, the arithmetical complexity

Nonlinear Optimization

usually can be easily obtained frorn the analytical cornplexity and the
cornplexity of the oracle. Therefore, in this course we will speak rnainly
about bounds on the analytical cornplexity for some problern classes.
There is one standard assumption on the oracle, which allows us to
obtain the majority of the results on the analytical cornplexity of optimization problems. This assurnption is called the local black box concept
and it Iooks as follows:

Local black box


1. The only information available for the numerical
scheme is the answer of the oracle.
2. The oracle is local: A small variation of the problern
far enough frorn the test point x does not change the
answer at x.
This concept is very useful in numerical analysis. Of course, its first
part Iooks like an artificial wall between the method and the oracle. It
seems natural to give the method an access to internal structure of the
problem. However, we will see that for problems with rather complicated
structure this access is almost useless. For more simple problems it could
help. We will see that in the last chapter of the book.
To conclude the section, Iet us mention that the standard formulation
(1.1.1) is called a functional model of optimization problems. Usually,
for such models the standard assumptions are related to the smoothness
of functional components. In accordance to degree of srnoothness we can
apply different types of oracle:

Zero-order oracle: returns the value f(x).


First-order oracle: returns f(x) and the gradient J'(x).
Second-order oracle: returns f(x), f'(x) and the Hessian f"(x).

1.1.3

Complexity bounds for global optimization

Let us try to apply the formal language, introduced in the previous


section, to a particular problern class. Consider, for example, the following problem:
min j(x).
(1.1.4)
xEBn

In our terrninology, this is a constrained rninirnization problern without


functional constraints. The basic feasible set of this problern is Bn, an

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

n-dimensional box in Rn:

Bn={xERnl O~x(i)~l, i=l. .. n}.


Let us measure distances in Rn using l00 -norm:

Assurne that, with respect to this norm,


the objective function f(x) is Lipschitz continuous on Bn:

I f(x)-

f(y)

I~

II X- Y lloo

Vx, Y E Bn,

(1.1.5)

with some constant L (Lipschitz constant).


Consider a very simple method for solving (1.1.4), which is called the

uniform grid method. This method g(p) has one integer input parameter
p. Its scheme is as follows.

Method Q(p)
1. Form (p + 1) n points
X

(tl, ... ,tn)-

c . .)T
:!J.!2.

!n.
p'p''p

'

where (ib,in) E {O, ... ,p}n.

(1.1.6)

2. Among all points X(i 1,... ,in) find a point x, which has
the minimal value of objective function.
3. Return the pair (x, f(x)) as a result.

Thus, this method forms a uniform grid of the test points inside the
box Bn, computes the minimum value of the objective over this grid and
returns this value as an approximate solution to problern (1.1.4). In our
terminology, this is a zero-order iterative method without any inuence

Nonlinear Optimization

of the accumulated information on the sequence of test points. Let us


find its efficiency estimate.
THEOREM

1.1.1 Let

f*

be the global optimal value of problern {1.1.4).

Then

f(x)- !* ::; ~

Proof: Let x* be a global minimum of our problem. Then there exist


coordinates (i1, i2, ... , in) suchthat

+l,t2+1,
-< x* -< X (tJ
... ,tn)
= X (tJ ,t2,
- y
... ,tn+l)

X -

(here and in the sequel we write x ::; y for x, y E Rn if and only if


for all i = 1 ... n). Note that y(i) - x<i) = ~ for i = 1 ... n,
and
x~i) E [x(i), y(i)], i = 1 ... n.

x<i) ::; y(i)

Denote

x = (x + y)/2. Let us form a point x as follows:


'(i)
'f (i) >
y(i)'
_X ,
1 X*
x(i)

= {

x<i)

lt is clear that

I x(i)- x~i) 1::; 2~, i

I xSince

'

x*

otherwise.
= 1 ... n. Therefore

ID!ix lx(i) lloo= l::;t::;n

x~i) I ::; 2~.

x belongs to our grid, we conclude that


0

Let us finish the definition of our problern dass. Define our goal as
follows:
(1.1. 7)
Find x E Bn : j(x)- j*::; .
Then we immediately get the following result.
1.1.1 Analytical complexity of the problern class {1.1.4),
{1.1.5), {1.1. 7) for method g is at most

COROLLARY

A(Q) = (

l~J +

2r.

(here l aJ is an integer part of a).


Proof: Take p = l ~ J + 1. Then p 2: ~, and, in view of Theorem 1.1.1,
we have f(x) - f* ::; ~ ::; . Note that we constrict (p + 1)n points. 0

10

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Thus, A(Q) justifies an upper complexity bound for our problern class.
This result is quite informative, but we still have some questions.
Firstly, it may happen that our proof is too rough and the real performance of Q(p) is much better. Secondly, we still cannot be sure that
Q(p) is a reasonable method for solving (1.1.4). There may exist other
schemes with much higher performance.
In order to answer these questions, we need to derive lower complexity
bounds for the problern dass (1.1.4), (1.1.5), (1.1.7). The rnain features
of such bounds are as follows.
They are based on the black box concept.
These bounds are valid for all reasonable iterative schernes. Thus,
they provide us with a lower estirnate for analytical complexity on
the problern dass.
Very often such bounds ernploy the idea of the resisting oracle.
For us only the notion of the resisting orade is new. Therefore, Iet us
discuss it in rnore detail.
A resisting oracle tries to create a worst problern for each particular
rnethod. It starts frorn an "ernpty" function and it tries to answer each
call of the rnethod in the warst possible way. However, the answers rnust
be compatible with the previous answers and with the description of the
problern dass. Then, after terrnination of the method it is possible to
reconstruct a problern, which fits cornpletely the final inforrnation set
accurnulated by the algorithrn. Moreover, if we launch this rnethod on
this problern, it will reproduce the sarne sequence of the test points since
it will have the sarne sequence of answers frorn the orade.
Let us show how that works for problern (1.1.4). Consider the class
of problems C defined as follows:

Model:

rnin f(x),

xEBn

f (x) is 100 - Lipschitz continuous on Bn.


Oracle:

Zero-order local black box.

Approximate solution: Find x E Bn: f(x)-

f*

E.

11

Nonlinear Optimization

r.

< ~ L the analytical complexity of C for zeroorder methods is at least ( l ii J

THEOREM

1.1. 2 For

liiJ

Proof: Derrote p =
(~ 1). Assurne that there exists a method,
which needs N < pn calls of oracle to solve any problern from C. Let us
apply this method to the following resisting strategy:
Oracle returns f(x)

= 0 at any test

point x.

Therefore this method can find only x E Bn with


note that there exists x E Bn such that

x,

1
B
+ Pe
E n,

= (1, ... , 1) T

and there were no test points inside the box B


Derrote x* = x + 2~e and consider the function

](x)

f(x) = 0. However,

E Rn,

= {x I x ~ x

x + ~e}.

= min{O, L I x- x* lloo -E},

Clearly, this function is l 00 -Lipschitz continuous with the constant L


and its global optimal value is -E. Moreover, f{x) differs from zero only
inside the box B' = {x 111 x- x* lloo~ f}. Since 2p ~ LjE, we conclude
that
B' ~ B := {x 111 x- x Jloo~ 2~}.
Thus, f(x) is equal to zero at all test points of our method. Since
the accuracy of the result of our method is E, we come to the following
conclusion: If the number of calls of the oracle is less than pn, then the
accuracy of the result cannot be better than E.
0
Now we can say much more about the performance of the uniform grid
method. Let us compare its efficiency estimate with the lower bound:

*),

Lower bound: (

liiJ) n.

Thus, if E = 0(
the lower and upper bounds coincide up to a constant
multiplicative factor. This implies that Q(p) is an optimal method for C.
At the same time, Theorem 1.1.2 supports our initial claim that the
general optimization problems are unsolvable. Let us Iook at the following example.
EXAMPLE 1.1.4 Consider the problern dass :F defined by the following
parameters:
L = 2, n = 10, E = 0.01.

12

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Note that the size of the problern is very small and we ask only for 1%
accuracy.
The lower complexity bound for this dass is (
n. Let us compute
it for our example.

fe)

Lower bound:
Complexity of oracle:
Total complexity:
Work station:
Total time:
One year:

We need

1020 calls of orade


at least n arithmetic Operations (a.o.)
1021 a.o.
106 a.o. per second
1015 seconds
less than 3.2 107 sec.

131 250 000

ymrs.

This estimate is so disappointing that we cannot keep any hope that


such problems may become solvable in a future. Let us just play with
the parameters of the problern dass.
Ifwe change n to n+1, then the estimate is multiplied by one hundred.
Thus, for n = 11 our lower bound is valid for a much more powerful
computer.
On the contrary, if we multiply e by two, we reduce the complexity
by a factor of a thousand. For example, if e = 8%, then we need only
two weeks.
0

We should note, that the lower complexity bounds for problems with
smooth functions, or for high-arder methods are not much better than
those of Theorem 1.1.2. This can be proved using the same arguments
and we leave the proof as an exercise for the reader. Camparisan of
the above results with the upper bounds for NP-hard problems, which
are considered as a dassical example of very difficult problems in combinatorial optimization, is also quite disappointing. Hard combinatorial
problems need 2n a.o. only!
To conclude this section, let us compare our situation with one in some
other fields of numerical analysis. lt is well known, that the uniform grid
approach is a standard tool in many domains. For example, if we need

13

Nonlinear Optimization

to compute numerically the value of the integral of a univariate function

J
1

I=

f(x)dx,

the standard way to proceed is to form a discrete sum

Sn=

1
N

"L f(xi),

Xi

i=l

= ;.,, i = 1 ... N.

lf f(x) is Lipschitz continuous, then this value can be used as an approximation to I:

Note that in our terminology this is exactly the uniform grid approach.
Moreover, that is a standard way for approximating the integrals. The
reason why it works here lies in the dimension of problems. For integration the standard dimensions are very small (up to three), and in
optimization sometimes we need to solve problems with several millions
of variables.

1.1.4

Identity cards of the fields

After the pessimistic results of the previous section, first of all we


should understand what could be our goal in theoretical analysis of optimization problems. It seems, everything is clear for general global optimization. But maybe the goals of this field are too ambitious? Maybe
in some practical problems we would be satisfied by much less "optimal"
solution? Or, maybe there are some interesting problern classes, which
are not so dangerous as the dass of general continuous functions?
In fact, each of these questions can be answered in a different way.
And this way defines the style of research (or rules of the game) in
the different fields of nonlinear optimization. lf we try to classify these
fields, we can easily see that they differ one from another in the following
aspects:
Goals of the methods.
Classes of functional components.
Description of the oracle.
These aspects define in a natural way the list of desired properties of the
optimization methods. Let us present the "identity cards" of the fields,

14

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

which we are going to consider in the book.


Name: General global optimization. (Section 1.1)
G~als: Find a global minimum.

Functional dass: Continuous functions.


Orade: 0 - 1 - 2 order black box.
Desired properties: Convergence to a global minimum.
Features: From theoretical point of view, this game is too
short. We always lose it.
Problem sizes: There are examples of solving problems
with thousands of variables. However, no guarantee for success even for very small problems.
History: Starts from 1955. Several local peaks of interest
related to new heuristic ideas (simulated annealing, neural
networks, genetic algorithms).

Name: Nonlinear optimization. (Sections 1.2, 1.3)


Goals: Find a local minimum.
Functional dass: Differentiable functions.
Oracle: 1 - 2 order black box.
Desired properties: Convergence to a local minimum.
Fast convergence.
Features: Variability of approaches. Most widespread software. The goal is not always acceptable and reachable.
Problem sizes: up to 1000 variables.
History: Starts from 1955. Peak period: 1965 - 1985.
Theoretical activity now is rather low.

Name: Convex optimization. (Chapters 2, 3)


Goals: Find a global minimum.
Functional dass: Convex sets and functions.
Orade: lst-order black box.
Desired properties: Convergence to a global minimum.
Rate of convergence depends on the dimension.
Features: Very rich and interesting theory. Comprehensive complexity theory. Efficient practical methods. The
problern dass is sometimes restrictive.
Problem sizes: up to 1000 variables.
History: Starts from 1970. Peak period: 1975- 1985 (terminated by explosion of interior-point ideas). Theoretical
activity now is growing up.

15

Nonlinear Optimization

Name: Interior-point polynomial-time methods. {Chapter 4)


Goals: Find a global minimum.
Functional class: Convex sets and functions with explicit
structure.
Oracle: 2nd-order black box oracle, which is not local.
Desired properties: Fast convergence to a global minimum. Rate of convergence depends on the structure of the
problem.
Features: Very new and perspective theory. Avoid the
black box concept. The problern dass is practically the same
as in convex optimization.
Problem sizes: Sometimes up to 10 000 000 variables.
History: Starts from 1984. Peak period: 1990 - .... Very
high theoretical activity just now.

1.2

Local methods in unconstrained minimization

(Relaxation and approximation; Necessary optimality conditions; Sufficient


optimality conditions; Class of differentiahte functions; Class of twice differentiable functions; Gradient method; Rate of convergence; Newton method.}

1.2.1

Relaxation and approximation

The simplest goal of generat nonlinear optimization is to find a local


minimum of a differentiable function. In general, the global structure of
such a function is not simpler than that one of a Lipschitz continuous
function. Therefore, even for reaching such a restricted goal, it is necessary to follow some special principles, which guarantee the convergence
of a minimization process.
The majority of generat nonlinear optimization methods are based on
the idea of relaxation:
We call the sequence {ak}~ 0 a relaxation sequence if
ak+l :=:; ak

Vk

0.

In this section we consider several methods for solving the following


unconstrained minimization problern
min f(x),

xERn

(1.2.1)

16

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

where f(x) is a smooth function. In order to do so, we generate a


relaxation sequence {!(xk)}r: 0:

This strategy has the following important advantages:


1. If f(x) is bounded below on Rn, then the sequence {f(xk)}~ 0 con-

verges.

2. In any case we improve the initial value of the objective function.


However, it would be impossible to implement the idea of relaxation
without employing another fundamental principle of numerical analysis,
the approximation. In general,
To approximate means to replace an initial complex object by a simplified one, which is close by its properties
to the original.
In nonlinear optimization we usually apply local approximations based
on derivatives of nonlinear functions. These are the first- and the secondorder approximations (or, the linear and quadratic approximations).
Let f(x) be differentiable at x. Then for y ERn we have
f(y) = f(x)

+ (!'(x), y- x) + o(ll

where o( r) is some function of r


lim ~o(r)
r.l-0

Ii),

0 such that

= 0,

In the sequel we fix the notation


in Rn:

y- x

II II

o(O)

= 0.

for the standard Euclidean norm

The linear function f(x) + (f'(x), y- x) is called the linear approximation of f at x. Recall that the vector f'(x) is called the gradient of
function f at x. Considering the points Yi = x + Eei, where ei is the ith
coordinate vector in Rn, and taking the limit in -+ 0, we obtain the
following coordinate representation of the gradient:

17

Nonlinear Optimization

Let us mention two important properties of the gradient. Derrote by


.CJ(a) the level set of f(x):

.Cf (0:) = {X E Rn

I J (X) S 0:}

Consider the set of directions that are tangent to .Ct(f(x)) at x:

LEMMA 1.2.1 lf s E S 1(x), then (f'(x), s) = 0.


Proof: Since f(Yk) = f(x), we have

J(yk) = f(x)

+ (J'(x), Yk- x) + o(ll Yk- x II) =

J(x).

Therefore (J'(x), Yk- x) + o(ll Yk- x II) = 0. Dividing this equation by


0
II Yk - x II and taking the limit in Yk -+ x, we obtain the result.
Let s be a direction in Rn, II s II = 1. Consider the local decrease of
f(x) along s:
~(s) = lim l[j(x + as)- f(x)].
a.j_O a

Note that f(x

+ o:s)- f(x)

= a(f'(x), s)

+ o(o:).

Therefore

~(s) = (J'(x), s).

Using the Cauchy-Schwartz inequality:

we obtain ~(s)

- II XII . II y lls (x, y) sll X II . II y II,


= (f'(x), s) 2: - II f'(x) 11. Let us take
s = -J'(x)/ II !'(x II.

Then
~(s) = -(f'(x),J'(x))/

II

J'(x)

II=- II

J'(x)

II.

Thus, the direction - J' (x) (the antigradient) is the direction of the
fastest local decrease of J(x) at point x.
The next statement is probably the most fundamental fact in optimization.
THEOREM 1. 2.1 (First-order optimality condition.)
Let x* be a local minimum of differentiable function f(x). Then

f'(x*) = 0.

18

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Since x* is a local minimum of f(x), then there exists r > 0


such that for all y, IIY - x*ll ~ r, we have f(y) 2 f(x*). Since f is
differentiable, this implies that
f(y) = f(x*)

Thus, for all s, II s


s and -s; we get

+ (f'(x*), y- x*) + o(ll Y- x*

II= 1, we have (f'(x*), s) 2 0.


(f'(x*), s}

= 0,

Vs,

II) 2 f(x*).

Consider the directions

II s II= 1.

Finally, choosing s = ei, i = 1 ... n, where ei is the ith coordinate vector


in Rn, we obtain f'(x*) = 0.
0
CoROLLARY 1. 2.1 Let x* be a local minimum of a dijjerentiable function f(x) subject to linear equality constraints

x E C, := {x ERnlAx = b} =J 0,
where A is an m x n-matrix and b E Rm, m
vector of multipliers ). * such that

< n.

Then there exists a

{1.2.2)

Proof: Consider some vectors

Ui, i = 1 ... k, that form a basis of the


null space of matrix A. Then any x E C, can be represented as follows:
k

= x(y) = x* + LY(i)ui,

y E Rk.

i=l

Moreover, the point y = 0 is a local minimum of the function cp(y) =


f(x(y)). In view of Theorem 1.2.1, cp'(O) = 0. This means that

~!~?l
and (1.2.2) follows.

= (f'(x*), ui) = 0,

= 1 ... k,
0

Note that we have proved only a necessary condition of a local minimum. The points satisfying this condition are called the stationary
points of function f. In order to see that such points are not always the
local minima, it is enough to look at function f(x) = x 3 , x E Rl, at
x=O.

19

Nonlinear Optimization

Let us introduce now the second-order approximation. Let function


x. Then

f(x) be twice differentiahte at


J(y) = f(x)

+ (f'(x), y- x) + !U"(x)(y- x), y- x) + o(ll y- x 11 2 ).

The quadratic function

f(x)

+ (f'(x), y- x) + !U"(x)(y- x), y- x)

is called the quadratic (or second-order) approximation of function f at


x. Recall that the (n x n)-matrix f"(x) has the following entries:

(f"(x))(i,j) = 8:c:{J~L>.
It is called the Hessian of function f at x. Note that the Hessian is a
symmetric matrix:
f"(x) = [f"(x)f.

The Hessian can be seen as a derivative of the vector function f'(x):

f'(y) = f'(x)

+ f"(x)(y- x) + o(ll y- x II),

where o(r) is a vector function suchthat lim ~


r,l-0

II o(r) II= 0 and o(O) = 0.

Using the second-order approximation, we can write down the secondorder optimality conditions. In what follows notation A t 0, used for a
symmetric matrix A, means that A is positive semidefinite:
(Ax, x)

2: 0 Vx

ERn.

Notation A >- 0 means that Ais positive definite (above inequality must
be strict for x =/= 0).
THEOREM 1.2.2 (Second-order optimality condition.)

Let x* be a local minimum of twice differentiahte function f(x). Then


f"(x*) ?:: 0.

f'(x*) = 0,

Proof: Since x* is a local minimum of function f(x), there exists r


suchthat for all y, IIY- x*ll ~ r, we have

f(y)

>0

f(x*).

In view of Theorem 1.2.1, f'(x*) = 0. Therefore, for any such y,

f(y) = f(x*)

+ (f"(x*)(y- x*), y- x*) + o(ll y- x* 11 2 )

Thus, (f"(x*)s, s) ~ 0, for all s,

II s II= 1.

~ f(x*).
0

20

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Again, the above theorem is a necessary (second-order) characteristic


of a local minimum. Let us prove a sufficient condition.
THEOREM 1.2.3 Let function f(x) be twice differentiable on Rn and let
x* satisfy the following conditions:

f'(x*) = 0,

J"(x*) >- 0.

Then x* is a strict local minimum of f(x).


(Sometimes, instead of strict we say isolated.)
Proof: Note that in a small neighborhood ofpoint x* function f(x) can
be represented as

f(y) = f(x*)

+ !U"(x*)(y- x*), Y- x*) + o(ll Y- x* 11 2).

Since ~ --+ 0, there exists a value f such that for all r E [0, f] we have

I o(r) 1:::; Vdf"(x*)),


where >.1(f"(x*)) is the smallest eigenvalue of matrix f"(x*). Recall,
that in view of our assumption, this eigenvalue is positive. Therefore,
for any y, II y- x*ll :::; f we have

f(y)

;::=:

f(x*) + ~>.I(f"(x*)) II Y- x* 11 2 +o(ll Y- x* 11 2 )

> f(x*) + ~>.1(f"(x*)) II Y- x* 11 2> f(x*).


0

1.2.2

Classes of differentiable functions

It is well knowll that ally colltilluous fullctioll call be approximated by


a smooth fullctioll with arbitrarily small accuracy. Therefore, assumillg
only differelltiability of the objective function we cannot get any reasonable properties of minimizatioll processes. Hence, we have to impose
some additional assumptiolls on the magllitude of the derivatives. Traditionally, in optimizatioll such assumptions are presented in the form
of a Lipschitz condition for a derivative of certaill order.
Let Q be a subset of Rn. Wedellote by C1P(Q) the class offunctions
with the following properties:
any jE C1P(Q) is k times COlltilluously differelltiable Oll Q.
Its pth derivative is Lipschitz colltinuous Oll Q with the constant L:

21

Nonlinear Optimization

for all x, y E Q.
Clearly, we always have p ~ k. If q 2: k, then CfP(Q) ~ c1P(Q). For
1 (Q) ~
1 (Q). Note also that these classes possess the
example,
following property: if ft E C1~(Q), /2 E c1:(Q) and a, E R 1 , then
for
L3 =I a I L1 + I I L2

Cl'

Cz

we have aft + /2 E c1:(Q).


We use notation f E Ck(Q) for a function f which is k times continuously differentiable on Q.
For us the most important dass of functions of the above type will be
1 (Rn), the class of functions with Lipschitz continuous gradient. By
1 (Rn) implies that
definition, the inclusion f E

cl

Cl

II

f'(x)- f'(y) II~ L

II x- Y II

(1.2.3)

for all x, y E Rn. Let us give a sufficient condition for that inclusion.
LEMMA

1.2.2 Function f(x) belongs to

if

II

f"(x) II~ L,

Cz

1 (Rn) C

Cl'

1 (Rn)

if and only

{1.2.4)

Vx ERn.

Proof. Indeed, for any x, y E Rn we have


f'(y)

= f'(x)

+f

= J'(x)

J"(x

(l

+ r(y- x))(y- x)dr

f"(x + T(y- x))dT) (y- x).

Therefore, if condition (1.2.4) is satisfied then

II

f'(y)- f'(x)

II

(l

<

Jf"(x+r(y-x))dr

f"(x

+ T(y- x))dT)

0
1

~I

II

~L

II y- XII-

f"(x+r(y-x))

(y- x)

lly-xll

II drll

y-x

II

22

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

On the other hand, if f E


have

(!

Cz (Rn), then for any s ERn and a > 0, we

J"(x + Ts)dT) s

=II

f'(x

+ as)- /'(x) II ::; aL II s II .

Dividing this inequality by a and tending a ,t 0, we obtain (1.2.4).

This simple result provides us with many examples of functions with


Lipschitz continuous gradient.
EXAMPLE 1.2.1 1. Linear function f(x) = a + (a,x) E C~' 1 (Rn) since

f'(x)

= a,

f"(x)

= 0.

2. For the quadratic function f(x) = a+(a, x)+!(Ax, x) with A =AT


we have
f'(x) = a + Ax, J"(x) = A.
Therefore f(x) E Cl' 1 (Rn) with L

=II A II

3. Consider the function of one variable f(x) = J1


have

f'(x)
Therefore f(x) E

J"(x)

+ x2 ,

x E R 1 . We

= (1 + ~2)3/2 ~ 1.

Ci' (R).
1

The next statement is important for the geometric interpretation of


1 (Rn)
functions from

ci

LEMMA 1.2.3 Let

E Ci' 1 (Rn). Then for any x, y from Rn we have

I f(y)- f(x)- (f'(x),y- x) I~

f II Y- x 11 2

{1.2.5}

Proof: For all x, y E Rn we have

f(y)

= f(x)

+ f(f'(x + T(y- x}),y- x)dT


0

= f(x)

+ (f'(x),y- x) + f(f'(x + T(y- x))- f'(x),y0

x)dT.

23

Nonlinear Optimization

Therefore

I J(y)- J(x)- (J'(x), y- x) I


1

=I

f(f'(x

+ r(y- x))- J'(x),y- x)dr I

::; J I (J'(x + r(y- x))- J'(x), y- x) I dr


0

::; f II f'(x+r(y-x)) -f'(x) 1111 y-x II dr


0

Geometrically, we can draw the following picture. Consider a function


f from Ci' 1 (Rn). Let us fix some Xo E Rn and define two quadratic
functions
= f(xo)

+ (f'(xo),x- xo) + ~

II x- xo 11 2,

</J2(x) = f(xo)

+ (J'(xo), x- xo)- ~

II x- xo 11 2 .

<h(x)

Then graph of function

is located between the graphs of </; 1 and </J2:

Let us prove a similar result for the dass of twice differentiable functions. Our main dass of functions ofthat type will be dj:/(Rn), the
dass of twice differentiable functions with Lipschitz continuous Hessian.
Recall that for f E C'j;/(Rn) we have
II f"(x)- f"(y) II:S M II x- Y II

(1.2.6)

for all x, y ERn.


LEMMA

1.2.4 Let f E

Cz' (Rn).
2

Then for any x, y from Rn we have

II f'(y)- f'(x)- f"(x)(y- x) II:S "{ II y- x 11 2 ,

(1.2. 7}

lf(y)- f(x)- (f'(x),y- x)- !U"(x)(y- x),y- x)l

::; 't II y

(1.2.8}
- X 11 3 .

24

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Let us fix some x, y E Rn. Then


f'(y) = f'(x)
= f'(x)

+ J f"(x + r(y- x))(y- x)dr


0

+ f"(x)(y- x) + f(f"(x + r(y- x))- f"(x))(y- x)dr.


0

Therefore

II

= II

f'(y)- f'(x)- f"(x)(y- x)


1

f(f"(x

II

+ r(y- x))- f"(x))(y- x)dr II

< J II (f"(x + r(y- x))- J"(x))(y- x) II dr


0

< f I f"(x + r(y- x)) - f"(x) II II y- x II dr


0

<

f rM

II

y- x

11 2

dr =

!vf II y- x

11 2

Inequality (1.2.8) can be proved in a similar way.

CoROLLARY

1.2.2 Let f E c'f.;/(Rn) and

II

y- x

f"(x)- Mrln ::5 f"(y) ::5 J"(x)

II= r.

Then

+ Mrln,

where In is the unit matrix in Rn.

(Recall that for matrices A and B we write A t B if A - B t 0.)


Proof: Denote G = f"(y)- f"(x). Since f E cJl(Rn), we have II G II;:;
Mr. This means that eigenvalues of the symmetric matrix G, >.i(G),
satisfy the following inequality:

I >.i(G) 1;:; Mr,

i = 1 ... n.

Hence, -Mrln ::5 G:::: J"(y)- f"(x) ::5 Mrln.

25

Nonlinear Optimization

1.2.3

Gradient method

Now we are completely ready for studying the convergence rate of


unconstrained minimization methods. Let us start from the simplest
scheme. We already know that antigradient is a direction of locally
steepest descent of differentiable function. Since we are going to find its
local minimum, the following scheme is the first to be tried:

Gradient method
(1.2.9)

Choose xo E Rn.
Iterate Xk+I = Xk- hkf'(xk), k = 0, 1, ....

We will refer to this scheme as a gradient method. The scalar factor


of the gradient, hk, is called the step size. Of course, it must be positive.
There are many variants of this method, which differ one from another
by the step-size strategy. Let us consider the most important examples.
1. The sequence {hk} ~ 0 is chosen in advance. For example,

> 0,

hk

hk

v'k+T"

(constant step)

2. Full relaxation:

3. Goldstein-Armijo rule: Find Xk+I = Xk- hf'(xk) suchthat


(1.2.10)
(f'(xk), Xk- Xk+I) 2:: f(xk)- f(xk+I),

(1.2.11)

where 0 < a < < 1 are some fixed parameters.


Comparing these strategies, we see that the first strategy is the sirnplest one. Indeed, it is often used, but mainly in the context of convex
optimization. In that framework the behavior of functions is much more
predictable than in the general nonlinear case.

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

26

The second strategy is completely theoretical. It is never used in


practice since even in one-dimensional cases we cannot find an exact
minimum in finite time.
The third strategy is used in the majority of the practical algorithms.
It has the following geometric interpretation. Let us fix x E Rn. Consider the function of one variable

h 2:: 0.

cp(h) = f(x- hf'(x)),

Then the step-size values acceptable for this strategy belang to the part
of the graph of cp that is located between two linear functions:

c/J1(h) = f(x)- ah

II

f'(x) 11 2,

c/J2(h) = f(x)- h

II

f'(x) 11 2 .

Note that cp(O) = cfJ1(0) = cp2 (0) and cp'(O) < cp~(O) < cp~(O) < 0. Therefore, the acceptable values exist unless cp(h) is not bounded below. There
are several very fast aue-dimensional procedures for finding a point satisfying the conditions of this strategy, but their description is not so
important for us now.
Let us estimate the performance of the gradient method. Consider
the problern
min f(x),
xERn

with f E Ci' 1 (Rn). And assume that f(x) is bounded below on Rn.
Let us evaluate a result of one gradient step. Consider y = x- hf'(x).
Then, in view of (1.2.5), we have

f(y)

< f(x) + (J'(x),y- x) + ~ II y- x


f(x)- h

II

f'(x)

11 2 +h22

f(x)- h(l- ~L)

II

f'(x)

II

f'(x)

11 2

11 2

(1.2.12)

11 2 .

Thus, in order to get the best estimate for possible decrease of the objective function, we have to solve the following one-dimensional problem:

D. (h)

= - h (1 -

~ L)

--t

m~n .

Computing the derivative of this function, we conclude that the optimal


step size must satisfy the equation Ll'(h) = hL- 1 = 0. Thus, that is
h* = which is a minimum of Ll(h) since Ll"(h) = L > 0.
Thus, our considerations prove that one step of the gradient method
decreases the value of objective function at least as follows:

1:,

f(y) ::; f(x)-

A II f'(x)

11 2 .

27

Nonlinear Optimization

Let us check what is going on with the above step-size strategies.


Let iJ:k+l = Xk -hkf'(xk). Then for the constant step strategy, hk
we have
f(xk)- f(xk+l) ~ h(1- ~Lh) II J'(xk) 11 2 .

= h,

Therefore, if we choose hk = 2_Z with a E (0, 1), then

i.

Of course, the optimal choice is hk =


For the full relaxation strategy we have

t.

since the maximal decrease is not worse than that for hk =


Finally, for the Goldstein-Armijo rule in view of (1.2.11) we have

From (1.2.12) we obtain

f(xk) - f(xk+l) ~ hk ( 1 Therefore hk ~

j;(l- ).

4; L)

II f'(xk)

11 2

Further, using (1.2.10) we have

f(xk)- J(xk+I) ~ a(J'(xk), Xk-

Xk+I)

= ahk II f'(xk) 11 2

Combining this inequality with the previous one, we conclude that

Thus, we have proved that in all cases we have


(1.2.13)

where w is some positive constant.


Now we areready to estimate the performance of the gradient scheme.
Let us sum up the inequalities {1.2.13) for k = 0 ... N. We obtain
(1.2.14)

where f* is the optimal value of the problern (1.2.1). As a simple consequence of (1.2.14) we have

II f'(xk) II-+

as

k-+ oo.

28

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

However, we can also say something about the convergence rate. lndeed,
denote
g* - min g
N- o-:::k-:::N k,

where 9k =II f'(xk)


inequality:

II

Then, in view of (1.2.14), we come to the following

Y'N :::;

JN+l

z::;L(J(xo) - !*) ] 1/2 .

[1

(1.2.15)

The right-hand side of this inequality describes the rate of convergence


of the sequence {g'N} to zero. Note that we cannot say anything about
the rate of convergence of sequences {f(xk)} and {xk}
Recall, that in general nonlinear optimization our goal is quite moderate: We want to find only a local minimum of our problem. Nevertheless,
even this goal is unreachable for a gradient method. Let us consider the
following example.
EXAMPLE

1.2.2 Let us look at the following function of two variables:

j(x)

=j(x(1),x(2)) = !(x(1))2 + !(x(2))4 _ !(x(2))2.

The gradient of this function is f'(x) = (x(l), (x< 2)) 3 - x< 2))T. Therefore
there are only three points which can pretend tobe a local minimum of
this function:

x!

= (0, 0),

x;

= (0, -1),

xj

= (0, 1).

Computing the Hessian of this function,

!"(x)

~ ( ~ 3(x(2~2 _ 1 )

we conclude that x2 and x3 are the isolated local minima 1 , but xi is only
a stationary point of our function. Indeed, f(xi) = 0 and f(xi + e 2 ) =
44 - ~2 < 0 for small enough.
Now, let us consider the trajectory of the gradient method, which
starts from xo = (1, 0). Note that the second coordinate of this point
is zero. Therefore, the second coordinate of f'(xo) is also zero. Consequently, the second coordinate of x 1 is zero, etc. Thus, the entire
sequence of points, generated by the gradient method will have the second coordinate equal to zero. This means that this sequence converges
to xi.
1 In

fact, in our example they are the global solutions.

29

Nonlinear Optimization

To condude our example, note that this situation is typical for all
first-order unconstrained minimization methods. Without additional
rather strict assumptions it is impossible to guarantee their global con0
vergence to a local minimum, only to a stationary point.
Note that inequality (1.2.15) provides us with an example of a new
notion, that is the rate of convergence of minimization process. How
can we use this notion in the complexity analysis? Rate of convergence
delivers the upper complexity bounds for a problern dass. These bounds
are always justified by some numerical methods. lf there exists a method,
for which its upper complexity bounds are proportional to the lower
complexity bounds of the problern dass, we call this method optimal.
Recall that in Section 1 we have already seen an example of optimal
method.
Let us look at an example of upper complexity bounds.
EXAMPLE

1.2.3 Consider the following problern dass:

Model:

1. Unconstrained minimization.

Cl'

1 (Rn).
2. f E
3. f(x) is bounded below.

(1.2.16)

Oracle:
-

First order black box.

solution:

f(x)

f(xo),

II

f'(x)

II~ .

Note, that inequality (1.2.15) can be used in order to obtain an upper


bound for the number of steps (= calls of the oracle), which is necessary
to find a point with a small norm of the gradient. For that, Iet us write
down the following inequality:

g'N ~

1
[1
vfN+l -wL(f(xo)-

f*) ] 1/2 ~

Therefore, if N + 1 2: ~(f(xo)- J*), we necessarily have g'N ~ .


Thus, we can use the value ~(f(x 0 ) - f*) as an upper complexity
bound for our problern dass. Comparing this estimate with the result
of Theorem 1.1.2, we can see that it is much better; at least it does not
depend on n. The lower complexity bound for the dass (1.2.16) is not
0
known.

30

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us check, what can be said about the local convergence of the
gradient method. Consider the unconstrained minimization problern
min f(x)

xERn

under the following assumptions:


1.

E c;;/(Rn).

2. There exists a local minimum of function f at which the Hessian is


positive definite.
3. We know some bounds 0 < l:::; L < oo for the Hessian at x*:
(1.2.17)

lln ~ f"(x*) ~ Lln.

4. Our starting point x 0 is close enough to x*.

= Xk -

Consider the process: Xk+1


Hence,

f'(xk)

where Gk =

f
0

f'(xk) - f'(x*)

=f
0

hkf'(xk) Note that f'(x*)

f"(x*

+ T(Xk

= 0.

- x*))(xk - x*)dT

J"(x* + T(xk- x*))dT. Therefore

There is a standard technique for analyzing processes of this type,


which is based on contracting mappings. Let sequence {ak} be defined
as follows:
ao ERn, ak+l = Akak,
where Ak are (n x n) matrices suchthat II Ak 11:::; 1- q with q E (0, 1).
Then we can estimate the rate of convergence of sequence {ak} to zero:

II ak+l

II~ (1- q)

II ak

In our case we need to estimate II In -hkGk


In view of Corollary 1.2.2, we have

f"(x*)- TMrkln j f"(x*

II ao ll-7 o.
II Denote rk =II xk -x* 11.

II~ (1- q)k+l

+ T(Xk- x*))

f"(x*)

+ TMrkln.

31

Nonlinear Optimization

Therefore, using assumption (1.2.17), we obtain


(l- '!fM)In j Gk j (L

+ '!fM)In

Hence, (1- hk(L + ItM))In ::S In- hkGk ::S (1- hk(l- ItM))In and we
conclude that
(1.2.18)

where ak(h) = 1- h(l- ItM) and bk(h) = h(L + ItM)- 1.


Note that ak(O) = 1 and bk(O) = -1. Therefore, if rk < f
~' then
ak(h) is a strictly decreasing function of h and we can ensure

for small enough hk. In this case we will have rk+l < rk.
As usual, many step-size strategies are available. For example, we
can choose hk =
Let us consider the "optimal" strategy consisting in
minimizing the right-hand side of (1.2.18}:

t.

max{ ak(h}, bk(h)} -+ min.


h

Assurne that ro < f. Then, if we form the sequence {xk} using the
optimal strategy, we can be sure that rk+l < rk < f. Further, the
optimal step size h'k can be found from the equation:
ak(h)

= bk(h)

1- h(l- ~M)

<=?

= h(L + ~M)- 1.

Hence

2
h*(1.2.19)
k- L+l'
(Surprisingly enough, the optimal step does not depend on M.) Under
this choice we obtain

<

Tk+l -

(L-th

L+l

Mrz
L+l'

Let us estimate the rate of convergence of the process. Denote q = L2~ 1


and ak = ~zrk ( < q). Then

ak+l ::S (1- q)ak

+ a2k =

ak(1

1- > lli - 1, or
Therefore -ak+l
ak

+ (ak- q))

= ak(l-(ak-~) 2 )
1-(ak-q

<
-

ak

l+q-ak

32

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Hence,

Thus,
a

< ro+(l+q)
qr~ (r-ro) -< ..!l!!L
(-1-)k
f-ro l+q

k -

This proves the following theorem.


1.2.4 Let function f(x) satisfy our assumptions and let the
starting point xo be close enough to a local minimum:

THEOREM

ro

=II xo -

x*

II < f = i} .

Then the gradient method with step size (1.2.19} converges as follows:

II Xk- x* lls-; r~~o

(1- L~3l)k

This rate of convergence is called linear.

1.2.4

Newton method

The Newton method is widely known as a technique for finding a root


of a function of one variable. Let cjJ(t) : R -7 R. Consider the equation

cjJ(t*) = 0.
The Newton method is basedonlinear approximation. Assurne that we
get some t close enough to t*. Note that

cp(t + t:.t) = cp(t)

+ cfJ'(t)t:.t + o(i t:.t j).

Therefore the equation cjJ(t + t:.t) = 0 can be approximated by the following linear equation:

cjJ(t)

+ c/J'{t)b.t =

0.

We can expect that the solution of this equation, the displacement t:.t,
is a good approximation to the optimal displacement b.t* = t* - t.
Converting this idea in an algorithmic form, we get the process

t k+l = t k

!Pl!:JJ..
tP'(tk).

33

Nonlinear Optimization

This scherne can be naturally extended onto the problern of finding


solution to a systern of nonlinear equations,

F(x) = 0,
where x ERn and F(x) :Rn~ Rn. In this case we have to define the
displacernent tl.x as a solution to the following systern of linear equations:

F(x)

+ F'(x)tl.x =

(it is called the Newton system). Ifthe Jacobian F'(x) is nondegenerate,


we can compute displacernent tl.x = -[F'(x)t 1F(x). The corresponding iterative scherne Iooks as follows:

xk+l = Xk- [F'(xk)t 1F(xk)


Finally, in view of Theorem 1.2.1, we can replace the unconstrained
minirnization problern by a problern of finding roots of the nonlinear
systern
(1.2.20)
f'(x) = 0.
(This replacement is not completely equivalent, but it works in nondegenerate situations.) Further, for solving {1.2.20) we can apply a standard Newton rnethod for systems of nonlinear equations. In this case,
the Newton system Iooks as follows:

f'(x)

+ f"(x)ilx = 0.

Hence, the Newton method for optimization problems appears tobe in


the form
(1.2.21)
Note that we can obtain the process {1.2.21), using the idea of quadratic approximation. Consider this approximation, computed with respect
to the point Xk:

</>(x) = f(xk)

+ (f'(xk), X- Xk) + !(f"(xk)(X- Xk), X- Xk)

Assurne that f"(xk) >- 0. Then we can choose xk+ 1 as a point of minimum of the quadratic function <f>(x). This means that

and we come again to the Newton process (1.2.21).


We will see that the convergence of the Newton method in a neighborhood of a strict local minimum is very fast. However, this method

34

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

has two serious drawbacks. Firstly, it can break down if f"(xk) is degenerate. Secondly, the Newton process can diverge. Let us look at the
following example.
EXAMPLE 1.2.4 Let us apply the Newton method for finding a root of
the following function of one variable:

qy(t) = yl~t2.
Clearly, t* = 0. Note that

qy'(t) =

[l+t;j3/2.

Therefore the Newton process looks as follows:

t k+l -- t k

- !fi!:JJ_
"''(t ) -'I'

tk

__lk.__
f177I .

yl+tk

[1 + t2]3/2
-k

t3k

Thus, if I t 0 I< 1, then this method converges and the convergence is


extremely fast. The points 1 are the oscillation points of this method.
If I to I> 1, then the method diverges.
D

In order to avoid a possible divergence, in practice we can apply a


damped Newton method:

where hk > 0 is a step-size parameter. At the initial stage of the method


we can use the same step size strategies as for the gradient scheme. At
the final stage it is reasonable to choose hk = 1.
Let us study the local convergence of the Newton method. Consider
the problern
mm f(x)
xERn

under the following assumptions:


1.

E c'~;/(Rn).

2. There exists a local minimum of function f with positive definite


Hessian:
f"(x*) t lln, l > 0.
(1.2.22)
3. Our starting point xo is close enough to x*.

Nonlinear Optimization

35

Consider the process: Xk+1 = xk - [f"(xk)t 1f'(xk) Then, using


the same reasoning as for the gradient method, we obtain the following
representation:

Xk+1 - x*

Xk- x*- [J"(xk)]- 1f'(xk)

Xk- x*- [f"(xk)]- 1 f f"(x*

+ T(Xk- x*))(xk- x*)dT

= f[f"(xk)- f"(x* + T(Xk- x*))]dT.


0
Derrote rk =II xk- x* II Then

where Gk

II Gk I = I
1

<

J[f"(xk)- f"(x*

+ T(xk- x*))]dT II

II

+ T(Xk- x*)) II dT

f"(xk)- f"(x*

< f M(l- T)rkdT = !fM.


0

In view of Corollary 1.2.2, and (1.2.22), we have

f"(xk) 2 f"(x*)- Mrkln 2 (l- Mrk)In


Therefore, if rk < ~, then f" (x k) is positive definite and

II

[J"(xk)t 1 II~ (l- Mrk)- 1

Hence, for rk small enough (rk < 3~), we have

The rate of convergence of this type is called quadratic.


Thus, we have proved the following theorem.
1.2.5 Let function f(x) satisfy our assumptions. Suppose
that the initial starting point xo is close enough to x*:

THEOREM

II

xo- x*

II< r = 3~.

Then II Xk- x* II < f for alt k and the Newton method converges quadratically:

36

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Comparing this result with the rate of convergence of the gradient


method, we see that the Newton method is much faster. Surprisingly
enough, the region of quadratic convergence of the Newton method is
almost the same as the region of the linear convergence of the gradient
method. This justifies a standard recommendation to use the gradient
method only at the initial stage of the minimization process in order to
get close to a local minimum. The final job should be performed by the
Newton method.
In this section we have seen several examples of the convergence rate.
Let us make a correspondence between these rates and the complexity
bounds. As we have seen in Example 1.2.3, the upper bound for the
analytical complexity of a problern dass is an inverse function of the
rate of convergence.
1. Sublinear rate. This rate is described in terms of a power function of

-Jk

the iteration counter. For example, we can have rk ~


In this case
the upper complexity bound of corresponding problern dass justified
by this scheme is ( ~) 2 .
Sublinear rate is rather slow. In terms of complexity, each new right
digit of the answer takes the amount of computations comparable with
the total amount of the previous work. Note also, that the constant
c plays a significant role in the corresponding complexity estimate.
2. Linear rate. This rate is given in terms of an exponential function of
the iteration counter. For example,
rk ~ c(l- q)k.

Note that the corresponding complexity bound is *{lnc +In:).


This rate is fast: Each new right digit of the answer takes a constant
amount of computations. Moreover, the dependence of the complexity estimate in constant c is very weak.
3. Quadratic rate. This rate has a form of a double exponential function
of the iteration counter. For example,

The corresponding complexity estimate depends on a double logarithm of the desired accuracy: In ln ~.
This rate is extremely fast: Each iteration doubles the number of
right digits in the answer. The constant c is important only for the
starting moment of the quadratic convergence (crk < 1).

37

Nonlinear Optimization

1.3

First-order methods in nonlinear optimization

(Gradient method and Newton method: What is different? Idea of variable


metric; Variable metric methods; Conjugate gradient methods; Constrained
minimization: Penalty functions and penalty function methods; Barrier functions and barrier function methods.)

Gradient method and Newton method:


What is different?

1.3.1

In the previous section we have considered two local methods for finding a local minimum in the simplest minimization problern
min f(x),

xERn

with

Ci'2 (Rn).

Those are the gradient method

Xk+l = Xk- hkf'(xk),

hk > 0.

and the Newton Method:

Xk+l = Xk- [!"(xk)t 1f'(xk)


Recall that the local rate of convergence of these methods is different.
Wehave seen, that the gradient method has a linear rate and the Newton
method converges quadratically. What is the reason for this difference?
If we look at the analytic form of these methods, we can see at least
the following formal difference: In the gradient method the search direction is the antigradient, while in the Newton method we multiply the
antigradient by some matrix, that is the inverse Hessian. Let us try to
derive these directions using some "universal" reasoning.
Let us fix some x E Rn. Consider the following approximation of the
function f (x):

</JI(x) = f(x)

+ (J'(x),x- x) + 21 II

x- x

11 2 ,

where the parameter h is positive. The first-order optimality condition


provides us with the following equation for xi, the unconstrained minimum of the function </JI(x):
<P~ (xi)

= f'(x) + *(xi- x) = 0.

Thus, xi = x - hf'(x). That is exactly the iterate of the gradient


method. Note, that if h E (0, J, then the function </J1(x) is a global
upper approximation of j(x):

38

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

(see Lemma 1.2.3). This fact is responsible for global convergence of the
gradient method.
Further, consider a quadratic approximation of function f(x):

<P2(x) = f(x)

+ (f'(x), x- x) + !U"(x)(x- x), x- x).

We have already seen that the minimum of this function is

x2 =

x- [f"(x)t 1 j'(x),

and that is exactly the iterate of the Newton method.


Thus, we can try to use some approximations of function f(x), which
are better than </> 1 (x) and which are less expensive than </>2(x).
Let G be a positive definiten x n-matrix. Denote

</>c(x) = J(x)

+ (f'(x), x- x) + !(G(x- x), x- x).

Computing its minimum from the equation

<Pa(xa)
we obtain

= J'(x) + G(xa- x) = o,

x(; = x-

a- f

(1.3.1)

1 1(x).

The first-order methods, which form a sequence of matrices

{Gk} : Gk

--7

J"(x*)

(or {Hk} : Hk := GJ; 1 --7 [f"(x*)t 1), are called the variable metric
methods. (Sometimes the name quasi-Newton methods is used.) In these
methods only the gradients are involved in the process of generating the
sequences {Gk} or {Hk}
The updating rule (1.3.1) is very common in optimization. Let us
provide it with one more interpretation.
Note that the gradient and the Hessian of a nonlinear function f(x)
are defined with respect to a standard Euclidean inner product on Rn:

(x,y)

= L:x(i)y(i),

x,y ERn,

II x II= (x,x) 112 .

i=l

Indeed, the definition of the gradient is

f(x

+ h) =

f(x}

+ (f'(x), h) + o(ll

h II),

and from this equation we derive its coordinate representation:

39

Nonlinear Optimization

Let us introduce now a new inner product. Consider a symmetric positive definite n x n-matrix A. For x, y E Rn derrote
(x,y)A = (Ax,y),

II x

IIA= (Ax,x) 112 .

The function II x IIA is a new norm on Rn. Note that topologically this
new metric is equivalent to the old one:

where An (A) and A1 ( A) are the smallest and the largest eigenvalues of
the matrix A. However, the gradient and the Hessian, computed with
respect to the new inner product are changing:
f(x

+ h)

= f(x)

+ (J'(x), h) + ~(J"(x)h, h) + o(ll h II)

= f(x)

+ (A- 1f'(x), h)A + ~(A- 1 J"(x)h, h)A + a(ll h IIA)

Hence, J~(x) = A- 1f'(x) is the new gradient and n(x) = A- 1f"(x) is


the new Hessian.
Thus, the direction used in the Newton method can be seen as a
gradient computed with respect to the metric defined by A = J"(x).
Note that the Hessian of J(x) at x computed with respect to A = J"(x)
is In.
EXAMPLE

1.3.1 Consider quadratic function


f(x)

= a + (a, x) + !(Ax, x),

where A =AT>- 0. Note that J'(x) = Ax + a, J"(x) = A and

f' (x*)

= Ax*

+a =

for x* = -A- 1 a. Let us compute the Newton direction at some x ERn:

Therefore for any x ERnwehave x- dN(x) = -A- 1 a = x*. Thus, for


a quadratic function the Newton method converges in one step. Note
also that

+ (A- 1 a, x)A + ~ II

f(x)

f~(x)

A- 1f'(x) = dN(x),

x II~,

40

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us write down a general scheme of the variable metric methods.


Variable metric metbad

o.

Choose Xo E nn. Set Ho= In.


Compute f(xo) and f'(xo).

1. kth iteration (k

0).

a). Set Pk = Hkf'(xk).


b). Find Xk+I = Xk - hkPk
(see Section 1.2.3 for step-size rules).
c). Compute f(xk+I) and f'(xk+I)
d). Update the matrix Hk: Hk

Hk+l

The variable metric schemes differ one from another only in implementation of Step ld), which updates matrix Hk For that, they use new
information, accumulated at Step lc), namely the gradient f'(xk+I)
The idea is justified by the following property of a quadratic function.
Let
f(x) = a + (a, x) + !(Ax, x), J'(x) = Ax + a.
Then, for any x, y ERnwehave f'(x)- f'(y) = A(x- y). This identity
explains the origin of the so-called quasi-Newton rule.
Quasi-Newton rule

Choose Hk+l suchthat

Actually, there are many ways to satisfy this relation. Below we present
several examples of the schemes that usually are recommended as the
most efficient ones.

41

Nonlinear Optimization
EXAMPLE

1.3.2 Denote

Then the quasi-Newton relation is satisfied by the following rules.


1. Rank-one correction scheme.

1:1Hk

= (ok- Hk'Yk)(ok- Hk'Yk)T


(ok - Hk"fk, 'Yk)

2. Davidon-Fletcher-Powell scheme (DFP}.

Hk'Yk'YfHk
(Hk"fk, 'Yk) .
3. Broyden-Fletcher-Goldfarb-Shanno scheme (BFGS).

Hk = Hk'Ykof + Ok"ff Hk _ k Hk'Yk'Yf Hk,


(Hk"fk, 'Yk)
(Hk"fk, 'Yk)
where k = 1 + bk,ok)/(Hk'Yk,'Yk)
Clearly, there are many other possibilities. From the computational
D
point of view, BFGS is considered as the most stable scheme.

Note that for quadratic functions the variable metric methods usually
terminate in n iterations. In a neighborhood of strict minimum they
have a superlinear rate of convergence: for any xo E Rn there exists a
number N such that for all k ~ N we have

II Xk+l -

x*

11:5 const II Xk- x* II II Xk-n- x* II

(the proofs are very long and technical). As far as global convergence is
concerned, these methods are not better than the gradient method (at
least, from the theoretical point of view).
Note that in the variable metric schemes it is necessary to store and
update a symmetric n x n-matrix. Thus, each iteration needs O(n2 )
auxiliary arithmetic operations. During many years this feature was
considered as one of the main drawbacks of the variable metric methods.
That stimulated the interest in so-called conjugate gradients schemes,
which have much lower complexity of each iteration (see Section 1.3.2).
However, in view of an amazing growth of computer power in the last
decades, these objections are not so important anymore.

42

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

1.3.2

Conjugate gradients

The conjugate gradients methods were initially proposed for minimizing a quadratic function. Consider the problern

(1.3.2)

min f(x),

xERn

with f(x) = o: + (a, x) + !(Ax, x) and A = AT >- 0. We have already


seen that the solution of this problern is x* = -A- 1a. Therefore, our
objective function can be written in the following form:

f(x)

o:

o:- ~(Ax*,x*)

+ (a,x) +

~(Ax,x)

= o:- (Ax*,x) + ~(Ax,x)

+ ~(A(x- x*),x- x*).

Thus, f* = o:- ~(Ax*, x*) and f'(x) = A(x- x*).


Suppose we are given by a starting point xo. Consider the linear
K rylov subspaces

Lk = Lin { A(xo - x*), ... , Ak(xo - x*)},

k;:::: 1,

where Ak is the kth power of matrix A. The sequence of points {xk}


generated by a conjugate gradients method is defined as follows:

I Xk =arg min{f(x) I x E xo + .Ck}, k ~ 1.1

(1.3.3)

This definition Iooks quite artificial. However, later we will see that
this method can be written in a pure "algorithmic" form. We need
representation (1.3.3) only for theoretical analysis.

LEMMA 1.3.1 For any k;:::: 1 we have Lk = Lin{f'(xo), ... , f'(xk_l)}.


Proof: For k = 1 the statement is true since f'(x 0 )
Suppose that it is true for some k ~ 1. Then

= A(xo - x*).

Xk = xo

+L

A,(i) Ai(xo-

x*)

i=l

with some). E Rk. Therefore

f'(xk) = A(xo- x*)

+ L ).(i) Ai+l(xo- x*)

= y

+ ).(k) Ak+ 1 (xo- x*),

i=l

for certain y from .Ck. Thus,

.Ck+l

Lin {C.k, Ak+ 1 (xo- x*)} = Lin {L:k, J'(xk)}


Lin {f'(xo), ... , f'(xk)}.
0

43

Nonlinear Optimization

The next result helps to understand the behavior ofthe sequence {xk}.
1.3.2 For any k, i ~ 0, k

LEMMA

Proof: Let k

i= i

we have (f'(xk), f'(xi)) = 0.

> i. Consider the function

In view of Lemma 1.3.1, for some .X* we have Xk = xo +

E_x~) J'(xj_t).

j=l

However, by definition, x k is the point of minimum of f (x) on .Ck. Therefore 4>'(-X*) = 0. It remains to compute the components of the gradient:
0

1.3.1 The sequence generated by the conjugate gradients


method for (1.3.2) is finite.

COROLLARY

Proof: The number of orthogonal directions in Rn cannot exceed n. 0

COROLLARY

6i

1.3.2 For any p E .Ck we have (J'(xk),p) = 0.

The last auxiliary result explains the name of the method. Denote
- Xi. It is clear that .Ck = Lin {6o, ... , 6k-l }.

= Xi+l

LEMMA

1.3.3 For any k

i= i

we have (A6k, 6i) = 0.

(Such directions are called conjugate with respect to A.)


Proof: Without loss of generality we can assume that k

> i. Then
0

Let us show how we can write down the conjugate gradients method in
a more algorithmic form. Since .Ck = Lin {60 , ... , 6k-l }, we can represent
Xk+I as follows:

Xk+l = Xk- hkf1(Xk)

k-l

+L

j=O

_x(j)dj.

44

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

In our notation that is

(1.3.4)

Let us compute the coeflicients ofthe representation. Multiplying (1.3.4)


by A and di, 0 :::; i :::; k- 1, and using Lemma 1.3.3 we obtain

Hence, in view of Lemma 1.3.2, Ai= 0, i

< k -1. For i

= k -1 we have

since dk-1 = -hk-1Pk-1 by the definition of {Pk}


Note that we managed to write down a conjugate gradients scheme in
terms of the gradients of the objective function f(x). This provides us
with a possibility to apply formally this scheme for minimizing a general
nonlinear function. Of course, such extension destroys all properties of
the process, which are specific for the quadratic functions. However,
in the neighborhood of a strict local minimum the objective function is
close to quadratic. Therefore asymptotically this method can be fast.

45

Nonlinear Optimization

Let us present a general scheme of the conjugate gradients method


for minimizing a nonlinear function.

Conjugate gradient method


0. Let xo ERn. Compute f(xo),J'(xo). Set Po= f'(xo).

1. kth iteration (k ~ 0).

+ hkPk

a).

Find

b).

Compute f(xk+I) and f'(xk+d

c).

Compute the coefficient k

d).

Set Pk+l = f'(xk+I)- kPk

Xktl

= xk

(by "exact" line search).

In that scheme we did not specify yet the coefficient k In fact, there
are many different formulas for this coefficient. All of them give the
same result on quadratic functions, but in a general nonlinear case they
generate different sequences. Let us present three of the most popular
1.

2.
3 Polak-Ribbiere

- k -

(J'(xk+t),J'(xkd-f'(xk))

II!' (xk)ll 2

Recall that in the quadratic case the conjugate gradients method


terminates in n iterations (or less). Algorithmically, this means that
Pn+l = 0. In a nonlinear case that is not true. However, after n iteration this direction loses any interpretation. Therefore, in all practical
schemes there exists a restarting strategy, which at some moment sets
k = 0 ( usually after every n iterations). This ensures a global convergence of the scheme (since we have a usual gradient step just after
the restart and all other iterations decrease the function value). In a
neighborhood of a strict minimum the conjugate gradients schemes have
a local n-step quadratic convergence:

II Xn+I - x* II::; const II xo - x* 11 2

46

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Note, that this local convergence is slower than that of the variable
metric methods. However, the conjugate gradients schemes have an
advantage of a very cheap iteration. As far as the global convergence is
concerned, the conjugate gradients, in general, are not better than the
gradient method.

1.3.3

Constrained minimization

Let us discuss briefly the main ideas underlying the methods of general
constrained minimization. The problern we deal with is as follows:

fo(x) -+ min,
(1.3.5)

fi(x)

0, i = 1 ... m.

where fi(x) are smooth functions. For example, we can consider fi(x)
1 (Rn).
frorn
Since the cornponents of the problern (1.3.5) are general nonlinear
functions, we cannot expect that this problern is easier than an unconstrained minimization problem. Indeed, even the standard difficulties
with stationary points, which we have in unconstrained minimization,
appear in (1.3.5) in a much stronger form. Note that a stationary point
of this problern (whatever it is) can be infeasible for the systern of functional constraints. Hence, any minimization scheme attracted by such a
point should accept that it fails even to find a feasible solution to (1.3.5).
Therefore, the following reasoning Iooks quite convincing.

Cl'

1. We have efficient methods for unconstrained minimization. (?) 2

2. Unconstrained minimization is simpler than the constrained one. (?) 3


3. Therefore, Iet us try to approxirnate a solution to the problern (1.3.5)
by a sequence of solutions to some auxiliary unconstrained minimization problems.
This philosophy is implemented by the schemes of Sequential Unconstrained Minimization. There are two rnain groups of such rnethods: the
penalty function methods and the barrier methods. Let us describe the
basic ideas of these approaches.
2 In fact, that is not absolutely true. We will see, that in order to apply an unconstrained
minimization method for solving constrained problems, we need to be able to find at least
a strict local minimum. And we have already seen (Example 1.2.2}, that this could pose a
problem.
3 We are not going to discuss the correctness of this statement for general nonlinear problems.
We just prevent the reader from extending it onto another problern classes. In the next
chapters we will have a possibility to see that this statement is not always true.

47

Nonlinear Optimization

We start from penalty function methods.


1. 3.1 A continuous function ci> (x) is called a penalty function for a closed set Q if
DEFINITION

ci>(x) = 0 for any x E Q,

<P(x)

> 0 for

any x ~

Q.

Sometimes a penalty function is called just penalty. The main property


of the penalty functions is as follows.
If ci> 1 (x) is a penalty for Q 1 and ci> 2(x) is a penalty for Q2,

then ci>l(x) + ci>2(x) is a penalty for intersection Ql

nQ2.

Let us give several examples of such functions.


EXAMPLE

1.3.3 Denote (a)+

= max{ a, 0}. Let

Q = {x ERn lfi(x)

~ 0, i = 1. ..

m}.

Then the following functions are penalties for Q:


1. Quadratic penalty: ci>(x) =

L: (!i(x))~.

i=l

2. Nonsmooth penalty: ci>(x) =

L.: (fi(x))+

i=l

The reader can easily continue the Iist.

The general scheme of a penalty function method is as follows.

Penalty function method


0. Choose Xo E nn. Choose a sequence of penalty coefficients: 0 < tk < tk+I and tk -+ oo.
1. kth iteration (k;::: 0).

Find a point Xk+I =arg min{/o(x)


xERn

using Xk as a starting point.

+ tkci>(x)}

48

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

It is easy to prove the convergence of this scheme assuming that


isaglobal minimum of the auxiliary function. 4 Denote

(lllk is the global optimal value of lllk(x)).


solution to (1.3.5).
THEOREM

1.3.1 Let there exist a value l


S = {x ERn

fo(x)

Xk+I

Denote by x* the global

> 0 such that the set

+ t<P(x) :::; fo(x*)}

is bounded. Then
lim f(xk) = fo(x*),

k---'too

lim <P(xk) = 0.

k---'too

Proof: Note that lllk :::; IJ!k(x*) = fo(x*). At the same time, for any
x ERnwehave 'l!k+l(x) ~ 'l!k(x). Therefore 'l!k+l ~ 'l!k. Thus, there
exists a Iimit lim 'l!k lll* :::; f*. If tk > l then
k---'too

Therefore, the sequence {xk} has Iimit points. Since lim tk = +oo, for
k---'too

any such point x. we have <P{x.) = 0 and fo(x.) :::; fo(x*). Thus x.
and

w* = fo(x.)

+ <P(x.) = fo(x.)

fo(x*).
0

Note that this result is very general, but not too informative. There
are still many questions, which should be answered. For example, we
do not know what kind of penalty function we should use. What should
be the rules for choosing the penalty coefficients? What should be the
accuracy for solving the auxiliary problems? The main feature of these
questions is that they can be hardly addressed in the framework of general nonlinear optimization theory. Traditionally, they are considered as
questions to be answered by computational practice.
Let us Iook at the barrier methods.
1.3.2 Let Q be a closed set with nonempty interior. A
continuous function F(x) is called a barrier function for Q if F(x) ~ oo
when x approaches the boundary of Q.
DEFINITION

4 If

we assume that it is a strict local minimum, then the result is much weaker.

49

Nonlinear Optimization

Sometimes a barrier function is called barrier for short. Similarly to the


penalty functions, the barriers possess the following property:
lf F 1 ( x) is a barrier for Q1 and F2 (x) is a barrier for Q2,

then Fl (X)

+ F2 (X)

is a barrier for intersection Ql

nQ2.

In order to apply the barrier approach, the problern (1.3.5) must satisfy the Slater condition:

fi(x) < o,

3x :

i = 1. .. m.

Let us look at some examples of barrier functions.


EXAMPLE 1.3.4 Let Q = {x E Rn ifi(x)
functions below are barriers for Q:

0, i = 1 ... m}. Then all

1. Power-function barrier: F(x) = i~l (-/;~x))P' P :::=: 1.


2. Logarithmic barrier: F(x) = -

i=l

3. Exponential barrier: F(x) =

In(- fi(x)).

i~ exp (-i(x)).

The reader can easily extend this list.

The scheme of a barrier method is as follows.


Barrier function method
0. Choose xo E int Q.

coefficients:

0 < tk

Choose a sequence of penalty


-t oo.

< tk+l and tk

1. kth iteration (k :::=: 0).

Find a point
using

Xk

Xk+l

=arg min{fo(x)
xEQ

as a starting point.

+ f-F(x)}
k

Let us prove the convergence of this method assuming that


global minimum of the auxiliary function. Denote

wk(x) = fo(x)

t:

+ F(x),

Xk+I

is a

50

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

('llk is the global optimal value of 'llk(x)). And let


value of the problern (1.3.5).
THEOREM

1.3.2 Let barrier F(x) be bounded below on Q. Then


lim

k-too

Proof: Let F(x)

w;. = f*.

F* for all x E Q. For arbitrary x Eint Q we have

sup lim 'llj. ~ lim [Jo(x)


k-too

Therefore sup lim 'll'k


k-too

'llj.

f* be the optimal

k-too

J*.

+ /k F(x)]

= fo(x).

On the other hand,

= min
{!o(x) + f-F(x)}
~ min
{!o(x) + lk F*} = !* + lk F*.
xeQ
k
xEQ

Thus, lim 'llj. =


k-too

J*.

The same as with the penalty functions method, there are many questions to be answered. We do not know how to find the starting point xo
and how to choose the best barrier function. We do not know the rules
for updating the penalty coefficients and the acceptable accuracy of the
solutions to the auxiliary problems. Finally, we have no idea about the
efficiency estimates of this process. And the reason is not in the lack
of the theory. Our problern (1.3.5) is just too complicated. We will see
that all of the above questions get precise answers in the framework of
convex optimization.
We have finished our brief presentation of general nonlinear optimization. It was really very short and there are many interesting theoretical topics that we did not mention. That is because the main goal of
this book is to describe the areas of optimization in which we can obtain some clear and complete results on the performance of numerical
methods. Unfortunately, the general nonlinear optimization is just too
complicated to fit the goal. However, it is impossible to skip this field
since a lot of basic ideas, underlying the convex optimization methods,
have their origin in general nonlinear optimization theory. The gradient
method and the Newton method, sequential unconstrained minimization
and barrier functions were originally developed and used for general optimization problems. But only the framework of convex optimization
allows these ideas to get their real power. In the next chapters of this
book we will see many examples of the second birth of these old ideas.

Chapter 2

SMOOTH CONVEX OPTIMIZATION

2.1

Minimization of smooth functions


(Smooth convex functions; Lower complexity bounds for :F't'' 1 (Rn);

Strongly
convex functions. Lower complexity bounds s:l(Rn); Gradient method.)

2.1.1

Smooth convex functions

In this section we deal with unconstrained minimization problern


min f(x),

xERn

{2.1.1)

where the function j(x) is smooth enough. Recall that in the previous
chapter we were trying to solve this problern under very weak assumptions on function f. And we have seen that in this general situation we
cannot do too much: It is impossible to guarantee convergence even to a
local minimum, impossible to get acceptable bounds on the global performance of minimization schemes, etc. Let us try to introduce some reasonable assumptions on function f to make our problern more tractable.
For that, Iet us try to determine the desired properties of a class of
differentiable functions F we want to work with.
From the results of the previous chapter we can get an impression
that the main reasons of our troubles is the weakness of the first-order
optimality condition (Theorem 1.2.1). Indeed, we have seen that, in
general, the gradient method converges only to a stationary point of
function f (see inequality (1.2.15) and Example 1.2.2). Therefore the
first additional property we definitely need is as follows.
2.1.1 For any f E F the first-order optimality condition
is sufficient for a point to be a global solution to (2.1.1}.

AssuMPTION

52

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Further, the main feature of any tractable functional dass :F is the


possibility to verify indusion f E :F in a simple way. Usually that
is ensured by a set of basic elements of the dass and by the list of
possible operations with elements of :F, which keep the result in the
dass (such Operationsare called invariant). An excellent example is the
dass of differentiahte functions: In order to check either a function is
differentiahte or not, we need just to look at its analytical expression.
We do not want to restriet our dass too much. Therefore, let us
introduce only one invariant operation for the hypothetical class :F.
AssUMPTION

2.1.2 lf JI, h E :Fand a,

0, then afi

+ h E :F.

The reason for the restriction on the sign of coefficients in this assumption is evident: We would like to see x 2 in our dass, but function -x 2
is not suitable for our goals.
Finally, let us add in :F some basic elements.
ASSUMPTION

2.1.3 Any linear function f(x) = a+(a,x} belongs to :F. 1

Note that the linear function f(x) perfectly fits Assumption 2.1.1. Indeed, f'(x) = 0 implies that this function is constant and any point in
Rn is its global minimum.
It turns out that we have assumed enough to specify our functional
class. Consider f E :F. Let us fix some x 0 E Rn and consider the
function
cf>(y) = f(y)- (f'(xo),y}.
Then cf> E :F in view of Assumptions 2.1.2 and 2.1.3. Note that

cf>'(y)

ly=xo=

f'(xo)- f'(xo)

= 0.

Therefore, in view of Assumption 2.1.1, x 0 is the global minimum of


function cf> and for any y E Rn we have

cf>(y) ~ cf>(xo)

= f(xo)- (J'(xo),xo}.

Hence, f(y) ~ f(xo) + (f'(xo), x- xo}.


This inequality is very well known in optimization. It defines the class
of differentiable convex functions.
DEFINITION

2.1.1 A continuously differentiable function f(x) is called

convex on Rn (notation f E :F1 (Rn)) if for any x,y ERnwehave

f(y) ~ f(x)

+ (f'(x), y- x}.

(2.1.2}

1 This is not a description of the whole set of basic elements. We just say that we want to
have linear functions in our class.

53

Smooth convex optimization

lf- f(x) is convex, we call f(x) concave.


In what follows we consider also the classes of convex functions :F~ 1 ( Q)
with the same meaning of the indices as for the classes C~ 1 (Q).
Let us check our assumptions, which become now the properties of
the functional class.
THEOREM

2.1.1 lf f

:F1 (Rn) and f'(x*) = 0 then x* is the global

minimum of J(x) on Rn.


Proof: In view of inequality (2.1.2), for any x ERn we have

f(x) 2:: j(x*)

+ (J'(x*), x- x*) = j(x*).


0

Thus, we get what we want in Assumption 2.1.1. Let us check Assumption 2.1.2.
2.1.1 If fi and h belong to :F1 (Rn) and a, 2::0 thenfunction
f = afl + h also belongs to :F1 (Rn).

LEMMA

Proof: For any x, y E Rn we have

ft(y)

> fr(x) + (J{(x),y- x},

h(y) > h(x)

+ (J~(x),y- x}.

lt remains to multiply the first equation by a, the second one by

add the results.

and
0

Thus, for differentiable functions our hypothetical class coincides with


the class of convex functions. Let us present their main properties.
The next statement significantly increases our possibilities in constructing the convex functions.

54

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: lndeed, let x, y E Rn. Denote x = Ax + b,


<P'(x) =AT f'(Ax + b), we have

</J(y) = f(y)

fj =

Ay + b. Since

> J(x) + (J'(x), y- x)

+ (J'(x), A(y- x))

<P(x)

ifJ(x) +(AT f'(x), y- x)

<P(x)

+ (<P'(x), y- x).
0

In order to simplify the verification of inclusion


vide this class with several equivalent definitions.
THEOREM 2.1.2 Continuously
class :F1 (Rn) if and only if for

f(ax

f E :F1 (Rn), we pro-

differentiable function f belongs to the


any x, y E Rn and a E [0, 1] we have2

+ (1- a)y)

af(x)

+ (1- a)f(y).

(2.1.3)

Proof: Denote x 0 = ax + {1- a)y. Let f E :F 1 (Rn). Then

f(xa)

:::; f(y) + (f'(xa), Y- Xa)

f(xa)

f(x)

+ (f'(xa), X- X

0 )

= f(y)

+ a(f'(xa), Y- x),

= f(x)- (1- a)(f'(xa), Y- x).

Multiplying first inequality by (1- a}, the second one by a and adding
the results, we get {2.1.3).
Let (2.1.3} be true for all x, y ERn and a E [0, 1]. Let us choose some
a E [0, 1). Then

> 1 ~ 0 [/(xa)- af(x)] = f(x) + 1 ~ 0 [/(xa)- f(x)]

f(y)

f(x)

+ 1 ~ 0 [f(x + (1- a)(y- x))- f(x)].

Tending a to 1, we get (2.1.2}.


THEOREM

class

2.1.3 Continuously differentiable function


if and only if for any x, y E Rn we have

:F1 (Rn)

(f'(x)- J'(y),x- y) ~ 0.

belongs to the

{2.1.4)

2 Note that inequality (2.1.3) without assumption on differentiability of /, serves as a definition of geneml convex functions. We will study these functions in detail in the next chapter.

55

Smooth convex optimization

f be a convex continuously differentiable function. Then


f(x) 2 f(y) + (f'(y), x- y), f(y) 2 f(x) + (f'(x), Y- x).

Proof: Let

Adding these inequalities, we get (2.1.4).


Let (2.1.4) holdforallx,yERn. Denotex 7 =x+r(y-x). Then

f(y)

f(x)

+ f(f'(x + r(y- x)), y- x)dr


0

f(x)

+ (f'(x),y- x) + f(f'(x

7 )-

f'(x),y- x)dr

f(x)

+ (f'(x), y- x) + f

~(f'(xT)- f'(x), xT- x)dr

> f(x) + (f'(x), y- x).


0

Sometimes it is more convenient to work with functions from the dass


:F2(Rn) C :Fl(Rn).
THEOREM 2 .1. 4 Two tim es continuously differentiable function
longs to :F2(Rn) if and only for any x ERn we have

f"(x}

<

be-

(2.1.5)

C::: 0.

Proof: Let f E C 2 (Rn) be convex. Derrote x 7 = x + rs,


in view of (2.1.4}, we have

r > 0. Then,

~(f'(xT)- f'(x),xT- x) = ~(f'(xT)- f'(x),s)


T

~ f(f"(x
0

+ >.s)s, s)d>.,

and we get (2.1.5) by tending r to zero.


Let (2.1.5) hold for all x ERn. Then

f(y)

f(x)

+ (f'(x), y- x)

1 T

+ J f(J"(x + >.(y- x))(y- x},y- x)d>.dr


0 0

> f(x) + (J'(x), y- x).


0

56

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us look at some examples of differentiable convex functions.


EXAMPLE

2.1.1 1. Linear function f(x) = a

+ (a,x)

is convex.

2. Let a matrix A be symmetric and positive semidefinite. Then the


quadratic function

f(x) = a

+ (a, x) + ~(Ax, x)

is convex (since f"(x) =At 0}.


3. The following functions of one variable belang to F 1 (R):

f(x}

f(x)

= I X IP,

f(x)

f(x)

= I x I -ln(1+ I x 1).

ex,
p

> 1,

x2

1-lxl'

We can checkthat using Theorem 2.1.4.


Therefore, the function arising in geometric optimization,

f(x) =

L ea:;+(a;,x)'
m

i=l

is convex (see Lemma 2.1.2). Similarly, the function arising in lp-norm


approximation problem,
m

f(x)

=LI (ai,x)- bi IP,


i=l

is convex too.

As with general nonlinear functions, the differentiability itself cannot


ensure any special topological properties of convex functions. Therefore we need to consider the problern classes with Lipschitz continuous
derivatives of a certain order. The most important dassofthat type is
1 (Rn), the dass of convex functions with Lipschitz continuous gradient. Let us provide it with several necessary and sufficient conditions.

.r'l

THEOREM 2.1.5 All conditions below, holding for all x, y E Rn and a


1 (Rn):
from [0, 1], are equivalent to inclusion f E

:Fz'

0 ~ f(y)- f(x)- (J'(x), y- x) ~

t II x- y 11

2,

{2.1.6}

57

Smooth convex optimization

f(x) + (f'(x), Y- x) +

II

A II f'(x)- f'(y)

f'(x) - f'(y) 11 2 ~ (f'(x) - f'(y), x- y),

{2.1.8}

11 2 ,

{2.1.9}

(J'(x)- f'(y), x- y) ~ L

af(x)

+ (1 -

a)f(y)

+ (1 -

II

x- y

2:: f(ax + (1 - a)y}

+ o(~Lo) II
af(x)

{2.1. 7}

11 2 ~ f(y),

a)f(y) < f(ax

f'(x)- f'(y)

11

(2.1.10}
2,

+ (1- a)y)

+a(1-a)~ II x-y 11 2

{2.1.11}

Proof: Indeed, (2.1.6) follows from the definition of convex functions


and Lemma 1.2.3. Further, let us fix x 0 ERn. Consider the function
ifl(y) = f(y}- (J'(xo},y}.
Note that ifJ E .1'1' 1 (Rn) and its optimal point is y* = x 0 Therefore, in
view of (2.1.6), we have

ifl(y*) ~ ifl(y- tifl'(y)) ~ ifl(y) -

A II ifl'(y) 11

And we get {2.1.7} since ifl'(y) = f'(y)- f'(xo}.


We obtain {2.1.8) from inequality (2.1.7) by adding two copies of it
with x and y interchanged. Applying now to (2.1.8) Cauchy-Schwartz
inequality we get II f'(x) - f'(y) II~ L II x- Y II
In the same way we can obtain {2.1.9) from (2.1.6). In order to get
(2.1.6) from (2.1.9) we apply integration:
1

f(y)- f(x)- (J'(x), y- x) = f(J'(x


0

+ r(y- x))- f'(x), y- x)dr

!Lily- xll 2

Let us prove the last two inequalities. Derrote x 0 = ax


Then, using (2.1.7) we get

f(xo:)

+ (1

- a)y.

+ (J'(xo:), (1- a)(x- y)) + A II J'(x)- f'(xo:}

f(x)

f(y)

> f(xo:) + (J'(xo:), a(y- x)) + A II f'(y)- f'(xo:)

11 2

11 2 ,

58

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Adding these inequalities multiplied by a and (1- a) respectively, and


using inequality

II 91 -

11 2

+(1 - a)

II 92 -

u 11 2 ~ a(1 - a)

II 91 - 92 11 2 ,

we get (2.1.10). It is easy to checkthat we get (2.1.7) from (2.1.10) by


tending a ---+ 1.
Similarly, from (2.1.6) we get

f(x)

< f(xa) + (f'(xa), (1- a)(x- y)) + ~ II (1- a)(x- y)

f(y)

< f(xa) + (f'(xa), a(y- x)) + ~ II a(y- x)

11 2 ,

11 2 .

Adding these inequalities multiplied by a and (1 - a) respectively, we


obtain (2.1.11). And we get back to (2.1.6) as a---+ 1.
D
Finally, let us give a characterization of the dass :Fi 1 (Rn).
2.1. 6 Two tim es continuously differentiable function
lon9s to :Fi' 1 (Rn) if and only for any x ERn we have

THEOREM

0 j J"(x) j Lln.

be-

(2.1.12}

Proof: The statement follows from Theorem 2.1.4 and (2.1.9).

2.1.2

Lower complexity bounds for :F~ 1 (Rn)

Before we go forward with optimization methods, let us check our


possibilities in minimizing smooth convex functions. In this section we
obtain the lower complexitr bounds for optimization problems with objective functions from :F'f' (Rn) (and, consequently, :Fi 1 (Rn)).
Recall that our problern class is as follows.
Model:

Oracle:

min /(x),

xERn

f E :Fi'l(Rn).

First-order local black box.

Approximate solution: X E Rn, / (X) - /*

~ t.

59

Smooth convex optimization

In order to make our considerations simpler, let us introduce the following assumption on iterative processes.
AssUMPTION 2.1.4 An iterative method M generates a sequence of test
points {Xk} such that

Xk E xo

+ Lin {f'(xo), ... , J'(xk_I)},

k ~ 1.

This assumption is not absolutely necessary and it can be avoided by


a more sophisticated reasoning. However, it holds for the majority of
practical methods.
We can prove the lower complexity bounds for our problern dass without developing a resisting oracle. Instead, we just point out the "worst
function in the world" (that means, in Ff' 1 (Rn)). This function appears
to be difficult for alliterative schemes satisfying Assumption 2.1.4.
Let us fix some constant L > 0. Consider the following family of
quadratic functions

fk(x) =

~ { ~((x(ll) 2 + ~~>x(i) -

x(i+Il) 2

+ (x(k)) 2 ] - x(l)}

for k = 1 ... n. Note that for all s E Rn we have

ur(x)s, s)

~ ~ [(s(ll)' + :~>(i)

s(i+ll) 2 + (s('i)']

0,

and

< LL:(s(i)) 2 .
i=l

Thus, 0 ~ ff:(x) ~ Lln. Therefore fk(x) E :Ff' 1(Rn), 1::; k::; n.


Let us compute the minimum of function fk It is easy to see that
f/:(x) = fAk with

60

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

where Ok,p is a (k x p) zero matrix. Therefore the equation


fk(x)

= Akx- e1 = 0

has the following unique solution:


-(i) -

xk - {

1--i

k+l'

i = 1. .. k,

k+1~ i

0,

n.

Hence, the optimal value of function fk is

fi = ~

[!(Akxbxk)- (el,xk)]

= -t(el!xk)
(2.1.13)

t(-1+k~l).

Note also that

~ 2 - k(k+1){2k+l}
6

L..~-

i=l

<
-

{k+1)3
3

(2.1.14)

Therefore

(2.1.15)

< k-

_2_. k{k+l}
k+l
2

+ {k+1)2
1

. {k+l)l - l(k
3
- 3

+ 1) .

Denote Rk,n = {x ERn I x(i) = 0, k+ 1 ~ i ~ n}; that is a subspace


of Rn, in which only the first k components of the point can differ from
zero. From the analytical form of the functions {!k} it is easy to see
that for all x E Rk,n we have
/p(x) = fk(x),
Let us fix some p, 1 ~ p

p = k ... n.

n.

LEMMA 2.1.3 Let xo = 0. Then for any sequence {xk}t=o satisfying


the condition
Xk E Lk = Lin {f;(xo), ... , J;(xk_I) },

we have Ck

Rk,n.

61

Smooth convex optimization

Proof: Since xo = 0, we have J;(xo) =-tel E R 1n. Thus .C1 R 1n.


Let .Ck ~ Rk,n for some k < p. Since Ap is three-diagonal, for any
x E Rk,n we have J;(x) E Rk+l,n. Therefore .Ck+l ~ Rk+l,n, and we can
complete the proof by induction.
0

COROLLARY

2.1.1 For any sequence {xk}1=o suchthat xo = 0 and Xk

.Ck we have

Now we areready to prove the main result of this section.


2.1.7 For any k, 1 ~ k ~ ~(n- 1), and any xo ERn there
exists a junction f E :FJ: 1 (Rn) suchthat for any first-order method M
satisjying Assumption 2.1.4 we have

THEOREM

where x* is the minimum of j(x) and J* = j(x*).


Proof: It is clear that the methods of this type are invariant with
respect to a simultaneaus shift of all objects in the space of variables.
Thus, the sequence of iterates, which is generated by such a method for
function f(x) starting from x 0 , is just a shift of the sequence generated
for j (x) = f (x + xo) starting from the origin. Therefore, we can assume
that xo = 0.
Let us prove the first inequality. For that, let us fix k and apply M to
minimizing f(x) = hk+I(x). Then x* = X2k+I and f* = f2k+I Using
Corollary 2.1.1, we conclude that

Hence, since xo = 0, in view of (2.1.13) and (2.1.15) we get the following


estimate:

62

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us prove the second inequality. Since


have

II

Xk - x*

11 2 >
=

Xk E Rk,n

and

xo

= 0, we

2k+l (-(i) ) 2 - 2k+l (


i )2
x2k+1
1- 2k+2
i==k+ 1
i==k+ 1

L::

L::

1 2k+1 .
k+l
't
i=k+l

L::

+1-

2k+1 2
Z
i==k+l

+ 4(k+I)2 L::

In view of (2.1.14), we have


2k+1

L::

! [(2k + 1)(2k + 2)(4k + 3)- k(k + 1)(2k + 1)]

i2

i==k+l

t(k

+ 1)(2k + 1)(7k + 6).

Therefore, using (2.1.15) we finally obtain

II

Xk _ x* ll2

> k + 1 __1_ .
k+l

k -

>

(3k+2)(k+l)
2

+ (2k+1)(7k+6)
24(k+l)

2k 2+7k+6

- 2- 24(k+l)
2k2+7k+6
l(k+l)2

II

Xo- X2k+1

112>
_

S1

II

XQ- X *

112 .
0

The above theorem is valid only under assumption that the number
of steps of the iterative scheme is not too large as compared with the
dimension of the space (k :::; ~(n- 1)). The complexity bounds ofthat
type are called uniform in the dimension of variables. Clearly, they
are valid for very large problems, in which we cannot wait even for n
iterates of the method. However, even for problems with a moderate
dimension, these bounds also provide us with some information. Firstly,
they describe the potential performance of numerical methods on the
initial stage of the minimization process. And secondly, they warn us
that without a direct use of finite-dimensional arguments we cannot get
better complexity for any numerical scheme.
To conclude this section, Iet us note that the obtained lower bound for
the value of the objective function is rather optimistic. Indeed, after one
hundred iterations we could decrease the initial residual in 104 times.
However, the result on the behavior of the minimizing sequence is quite
disappointing: The convergence to the optimal point can be arbitrarily
slow. Since that is a lower bound, this conclusion is inevitable for our
problern dass. The only thing we can do is to try to find problern

63

Smooth convex optimization

classes in which the situation could be better. That is the goal of the
next section.

2.1.3

Strongly convex functions

.rl

1 (Rn),
Thus, we are looking for a restriction of the functional class
for which we can guarantee a reasonable rate of convergence to a unique
solution of the minimization problern

Recall, that in Section 1.2.3 we have proved that in a small neighborhood of a nondegenerate local minimum the gradient method converges
linearly. Let us try to make this non-degeneracy assumption global.
Namely, Iet us assume that there exists some constant 1-t > 0 such that
for any x with f'(x) = 0 and any x ERnwehave

f(x) ;:::: f(x)

+ ~1-t II

x- x

11 2

Using the same reasoning as in Section 2.1.1, we obtain the dass of


strongly convex functions.
2.1.2 A continuously differentiable function f(x) is called
strongly convex on Rn (notation f E S~(Rn)) if there exists a constant
1J > 0 such that for any x, y E Rn we have
DEFINITION

+ (f'(x),y- x} + ~1-t

f(y);:::: f(x)

II

y- x

{2.1.16}

11 2 .

Constant IJ is called the convexity parameter of function f.


We will also consider the classes s~t(Q) with the same meaning of
the indices k, l and L as for the class C~ 1 (Q).
Let us fix some properties of strongly convex functions.
THEOREM

2.1.8 lf f

Sb(Rn) and f'(x*) = 0, then

f(x) ;:::: f(x*)

+ !~-t II x- x*

11 2

for all x ERn.

Proof: Since f'(x*) = 0, in view of inequality (2.1.16), for any x E Rn


we have

f(x)

> f(x*) + (f'(x*), x- x*) + !~-t II x- x*


-

f(x*)

+ !~-t II x- x*

11 2

11 2

64

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

The following result justifies the addition of strongly convex functions.


LEMMA

2.1.4 lf fi

Sh 1 (Rn),

= afi

f2

Sh 2 (Rn) and a, ?: 0, then

+ h E S~ll- 1 +11- 2 (Rn).

Proof: For any x, y E Rn we have


JI(y)

>

!I(x)

+ (fi(x),y- x) + !J.Ll

II

y- x 11 2,

h(y)

>

h(x)

+ UHx), Y- x) + !J.L2

II

Y- x 11 2 .

It remains to add these equations multiplied respectively by a and

Note that the class SJ(Rn) coincides with F 1 (Rn). Therefore addition
of a convex function to a strongly convex function gives a strongly convex
function with the same convexity parameter.
Let us give several equivalent definitions of strongly convex functions.
2.1.9 Let f be continuously differentiable. Both conditions
below, holding for all x, y ERn and a E (0, 1], are equivalent to inclusion

THEOREM

E S~(Rn):

?: J.L II x- y

(f'(x) - f'(y), x- y)
af(x)

+ (1- a)f(y) ?:

f(ax

(2.1.17)

11 2 ,

+ (1- a)y)

+a(1- a)~

II

(2.1.18}

x- y

11 2 .

The proof of this theorem is very similar to the proof of Theorem 2.1.5
and we leave it as an exercise for the reader.
The next statement sometimes is useful.
THEOREM

2 .1.1 0 If f

f(y) S f(x)

Sh (Rn), then for any x and y from Rn we have

+ (f'(x), Y- x) + 2~

(f'(x) - f'(y), x- y) S ~

II

II

f'(x) - f'(y)

f'(x) - f'(y)

11 2

Proof: Let us fix some x E Rn. Consider the function


cp(y) = f(y)- (f'(x), y) E S~(Rn).

11 2 ,

(2.1.19)
(2.1.20)

65

Smooth convex optimization

Since rp'(x) = 0, in view of (2.1.16) for any y ERn we have that

rp(x)

minrp(v)
;::: min[<P(y)
V
V

<P(y)- 2~11<P'(y)li 2 ,

+ (<P'(y), v- y) + !ltllv- Yll 2]

and that is exactly (2.1.19). Adding two copies of (2.1.19) with x and y
interchanged we get (2.1.20).
D
Finally, the second-order characterization of the class Sh(Rn) is as
follows.
2 .1.11 Two times continuously differentiable function
longs to the class s~ (Rn) if and only if X E Rn

THEOREM

f be-

{2.1.21)

f"(x) !: flln.

Proof: Apply (2.1.17).

Note we can Iook at examples of strongly convex functions.


EXAMPLE

2.1.2 1. f(x) = !II x 11 2 belongs to Sf(Rn) since f"(x) =In.

2. Let symmetric matrix A satisfy the condition: ttln :S A :S Lln. Then

j(x) = a+ (a,x)

+ !(Ax,x)

s::l(Rn) C S!;l(Rn)

since f"(x) = A.
Other examples can be obtained as a sum of convex and strongly
D
convex functions.
For us the most interesting functional class is s!:l(Rn). This class is
described by the following inequalities:

(f'(x)- J'(y),x- y};::: flll


II !'(x)- !'(y) II:S L II

x- Y

11 2 ,

(2.1.22)

(2.1.23)
x- Y II
The value Qf = L / fl ;::: 1 is called the condition number of function f.
It is important that the inequality (2.1.22) can be strengthened using
the additional information (2.1.23).

66

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

THEOREM

2.1.12 lf f E S~:l(Rn), then for any x, y ERn we have

(f'(x)- f'(y), x- Y) 2: Jl-~L

II

x- Y 11 2

+ Jl-~L II f'(x)- f'(y) 11 2

{2.1.24}

Proof: Derrote cp(x) = f(x)- ~1111xll 2 . Then cp'(x) = f'(x)- 11x; hence,
by (2.1.22) and (2.1.9) cjJ E :Fl~M(Rn). If 11 = L, then (2.1.24) is proved.
If 11 < L, then by (2.1.8) we have

(cp'(x)- <P'(y),y- x)

~ L~J.LII<P'(x)-

</>'(y)ll 2 ,

and that is exactly (2.1.24).

2.1.4

Lower complexity bounds for sc;,l(Rn)

Let us get the lower complexity bounds for unconstrained minimization of functions from the dass sr;:J}(Rn) c s~:l(Rn). Consider the
following problern dass.
min J(x),

Model:

xERn

s;:l(Rn),

Oracle:

First-order local black box.

Approximate solution:

x: f(x) - !*

E,

II x- x*

1-l

11 2 ~

> 0.

E.

As in the previous section, we consider the methods satisfying Assumption 2.1.4. We are going to find the lower complexity bounds for our
problern in terms of condition number Qf = ~
Note that in the description of our problern dass we do not say anything about the dimension of the space of variables. Therefore formally,
this dass includes also the infinite-dimensional problems.
We are going to give an example of some bad function defined in the
infinite-dimensional space. We could do that also in a finite dimension,
but the corresponding reasoning is more complicated.
Consider gX) l2 , the space of all sequences x = {x(i)}~ 1 with finite
norm

67

Smooth convex optimization

Let us choose some parameters f-l


following function

Denote

-i
(

A=

>

0 and Qf

-~ -~ ~
2

0 -1

>

1, which define the

Then f"(x) = tt(Qf 1) A + f-ll, where I is the unit operator in R 00 In


the previous section we have already seen that 0 :::5 A :::5 4/. Therefore

This means that J11 ,Q 1 E s::,~~ 1 (R 00 ). Note that the condition number
of function f 11 ,Q1 is

Q!Jl.,Qf -_

ttQr _
-

11

Let us find the minimum of function f 11 ,11 Q1 The first-order optimality


condition

, (x)
! J-L,IlQf

= (tt(Qr 1) A

can be written as

+ u1) X-

J-L(Qr 1) e = 0
4

f""'

(A + Q/- 1 ) x = e1.

The coordinate form of this equation is as follows:


2 Q r+l x(l) - x( 2)

= 1,

+ x(k-l)

= 0,

Qr1

x(k+l)- 2Qr+ 1 x(k)

Qrl

(2.1.25)
k = 2, ....

Let q be the smallest root of the equation


q2 -

that is q =

1-1
~QQt+1

2 Q I+ 1 q + 1 = 0
Qrl

Then the sequence (x*)(k)

'

= qk, k = 1, 2, ... , satisfies

the system (2.1.25). Thus, we come to the following result.

68

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

2.1.13 For any xo E R 00 and any constants 1-" > 0, Qf > 1


there exists a function f E s:,~~l (R00 ) such that for any first-order
method M satisfying Assumption 2.1.4, we have

THEOREM

II

Xk- x*

11 2 2

( ~~~)

k II

xo- x*

11 2 ,

~ ~ ( ~~~) 2k II xo- x* 11 2 ,

f(xk)- !*

where x* is the minimum of function f and f* = f(x*).

= 0.

Proof: Indeed, we can assume that xo


JIJ.,!J.Qf(x). Then
II

xo- x*

11 2

=f

i=l

((x*)(i)]2 =

Let us choose f(x)

fq

i=l

2i

= ,S.
1
q

Since J:.IJ.QJ(x) is a three-diagonal operator and f~,IJ.QJ(O) = e 1 , we


conclude that Xk E Rk,oo. Therefore

II Xk- x* 11 2 ~

00

( ")

00

L: [(x*) ~ J2 = L:

i=k+ 1

i=k+ 1

q2~

= T-:r = q2k II xo- x*


2(k+l)

11 2

The second bound of the theorem follows from the first one and The0
orem 2.1.8.

Gradient method

2.1.5

Let us check how the gradient method works on the problern


min f(x)

with f E
follows.

:t'l' (Rn).
1

xERn

Recall that the scheme of the gradient method is as

Gradient method
0. Choose x 0 ERn.

1. kth iteration (k ~ 0).

a). Compute f(xk) and f'(xk).


b). Find Xk+l = Xk- hkf'(xk) (see Section 2 for
step-size rules).

69

Smooth convex optimization

In this section we analyze the simplest variant of the gradient scheme


with hk = h > 0. It is possible to show that for all other reasonable
step-size rules the rate of convergence of this method is similar. Denote
by x* the optimal point of our problern and f* = f(x*).

2.1.14 Let jE Fi, 1 (Rn) and 0 < h <


Then the gradient
method generates a sequence {xk}, which converges as follows:

THEOREM

f(xk)Proof: Denote rk
r~+l

=II

f* ::; 2llxo-!~11~x~t~;~r~~uc~:)-/*)
Xk- x*

II

Then

Xk - x* - hf'(xk)

11 2

II

r~- 2h(f'(xk), Xk - x*)

<

r~- h(f- h) II f'(xk) 11 2

+ h2 II

f'(xk)

11 2

(we use (2.1.8) and f'(x*) = 0). Therefore rk ::; r 0 . In view of (2.1.6)
we have

where w = h(l- th). Denote .k

= f(xk)- f*.

.k::; (f'(xk), Xk- x*) ::; ro

II

Then

f'(xk)

II .

Therefore .k+l ::; .k - ;%-ll.~. Thus,


0

Summing up these inequalities, we get


~
'-"k+l

>
} + r~(k
+ 1).
'-"O
0
0

In order to choose the optimal step size, we need to maximize the


function 4J(h} = h(2- Lh) with respect to h. The first-order optimality
In
condition 4J'(h} = 2 - 2Lh = 0 provides us with the value h* =
this case we get the following efficiency estimate of the gradient method:

f(

Xk -

f*

<

2L(f(xo)-f*)llxo-x*l! 2

- 2LJixo-xIJ2+k(/(xo)- /*)

(2.1.26)

70

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Further, in view of (2.1.6) we have

< f* + (f'(x*), xo- x*} + t II xo- x*

f(xo)

f* + t II xo - x*

11 2

11 2 .

Since the right-hand side of inequality (2.1.26) is increasing in f(xo)- f*,


we obtain the following result.
COROLLARY

2.1.2 lf h =

and jE

.rt' (Rn), then


1

2LIIxo-xll 2
f( Xk ) _ f* <
k+4
.

{2.1.27}

Let us estimate the performance of the gradient method on the dass


of strongly convex functions.
2.1.15 lf jE S~:l(Rn) and 0 < h ~
method generates a sequence {xk} such that

THEOREM

Xk - x* 11 2 ~ ( 1 -

II

lf h =

1-L!L'

~_fJ;) k II

1-L!L'

xo- x*

then the gradient

11 2 .

then

llxk-x*ll

< (~;~i)kllxo-x*ll,

f(xk)- f*

< ~ (~;~i) 2 k II xo- x*

11 2 ,

where Qf = LfJ-L.

Proof: Denote rk

rf+l

=II Xk- x*

11.

Then

Xk- x* - hf'(xk)

11 2

II

r~ - 2h(f'(xk), Xk -

f'(xk)

11 2

<

(1- ~-ff) r~ + h(h- 1-L!L) II f'(xk)

11 2

x*} + h2 II

(we use (2.1.24) and f'(x*) = 0). The last inequality in the theorem
0
follows from the previous one and (2.1.6).
Recall that we have seen already the step-size rule h = 1-L!L and
the linear rate of convergence of the gradient method in Section 1.2.3,
Theorem 1.2.4. But that was only a local result.

71

Smooth convex optimization

Comparing the rate of convergence of the gradient method with the


lower complexity bounds (Theorems 2.1.7 and 2.1.13), we can see that
the gradient method is far from being optimal for classes Fz 1 (Rn) and
s~:i(Rn). We should also note that on theseproblern classes the Standard unconstrained minimization methods (conjugate gradients, variable
metric) have a similar global efficiency bound. The optimal methods for
minimizing smooth convex and strongly convex functions will be considered in the next section.

2.2

Optimal Methods

(Optimal methods; Convex sets; Constrained minimization problem; Gradient


mapping; Minimization methods over a simple set.)

2.2.1

Optimal methods

In this section we consider an unconstrained minimization problern


min f(x),

xeRn

with f being strongly convex: f E s~;i_(Rn), t.t 2: 0. Formally, this


family of classes contains also the dass of convex functions with Lipschitz
gradient (S~z(Rn) = Fz 1 (Rn)).
In the pr~vious section we proved the following efficiency estimates
for the gradient method:
!( Xk ) _ J* <
-

2LIIxo-x*ll 2

k+4

'

These estimates differ from our lower complexity bounds (Theorem 2.1.7
and Theorem 2.1.13) by an order of magnitude. Of course, in general
this does not mean that the gradient method is not optimal since the
lower bounds might be too optimistic. However, we will see that in our
case the lower bounds are exact up to a constant factor. We prove that
by constructing a method that has corresponding efficiency bounds.
Recall that the gradient method forms a relaxation sequence:

This fact is crucial for justification of its convergence rate (Theorem


2.1.14). However, in convex optimization the optimal methods never
rely on relaxation. Firstly, for some problern classes this property is too
expensive. Secondly, the schemes and efficiency estimates of optimal

72

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

methods are derived from some global topological properties of convex


functions. From this point of view, relaxation is a too "microscopic"
property to be useful.
The schemes and efficiency bounds of optimal methods are based on
the notion of estimate sequence.

A pair of sequences {1>k(x)}~ 0 and {Ak}~ 0 , >.k 2: 0


is called an estimate sequence of function f(x) if
DEFINITION 2.2.1

and for any x E Rn and all k 2: 0 we have

{2.2.1)
The next statement explains why these objects could be useful.
LEMMA 2.2.1

lf for some sequence {xk} we have

{2.2.2}
then f(xk)-

f*

>.k[<!>o(x*)- !*] --+ 0.

Proof: Indeed,

Thus, for any sequence {xk}, satisfying (2.2.2) we can derive its rate
of convergence directly from the rate of convergence of sequence { Ak}.
However, at this moment we have two serious questions. Firstly, we do
not know how to form an estimate sequence. And secondly, we do not
know how we can ensure (2.2.2). The first question is simpler, so let us
answer it.
LEMMA 2.2.2

Assurne that:

S~:l(Rn),

2 <Po (x) is an arbitmry function an Rn,


3

{yk}~ 0

is an arbitrary sequence in Rn,

73

Smooth convex optimization

4 {ak}r=o: ak E (O, 1),


5 Ao = 1.

Then the pair of sequences {4>k(x)}r=o {Ak}k:o recursively defined by t:


Ak+l =
4>k+I(x) =

(1 - ak)Ak,
{2.2.3}

(1- ak)4>k(x)

+ak[f(Yk) + (J'(yk), x- Yk) + ~


is an estimate sequence.
Proof: Indeed, 4>o(x) ~ (1 - Ao)f(x) + Ao4>o(x)
(2.2.1) hold for some k ~ 0. Then
4>k+I(x) ~ (1- ak)4>k(x)
=

II

x- Yk

= 4>o(x).

11 2],

Further, let

+ akf(x)

(1- (1- ak)Ak)f(x)

+ (1- ak)(cPk(x)- (1- Ak)f(x))

< (1- {1- ak)Ak)f(x) + (1- ak)AkcPo(x)


=

(1- Ak+l)f(x)

+ Ak+I4>o(x).

It remains to note that condition 4) ensures Ak -+ 0.

Thus, the above statement provides us with some rules for updating
the estimate sequence. Now we have two control sequences, which can
help to ensure inequality (2.2.2). Note that we arealso free in the choice
of initial function c/Jo(x). Let us choose it as a simple quadratic function.
Then we can obtain the exact description of the way c/J'k varies.
2.2.3 Let 4>0 (x) = 4>0+ ~ II x- vo 11 2 . Then the process {2.2.3}
preserves the canonical form of functions {4>k(x)}:
LEMMA

cPk(x)

= c/J'k + 1f II x- Vk 11 2 ,

{2.2.4}

where the sequences {'yk}, {Vk} and {c/Jk} are defined as follows:
'Yk+l =

(1 - ak)'Yk

+ akp.,

cPk+I =

(1- ak)4>k

+ akf(Yk)- 21:~ 1

+ak(~~:;bk (~ II Yk- Vk

11 2

II

f'(yk)

11 2

+(f'(yk),vk- Yk))

74

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Note that cp~(x) = 'Yoln. Let us prove that cf;%(x)


k ~ 0. Indeed, if that is true for some k, then

cf;%+1 (x) = (1- ak)cf;%(x)

+ akJ.Lln =

((1- akhk

= 'Ykln for all

+ akJ.L)ln

='Yk+1In.

This justifies the canonical form (2.2.4) of functions cpk(x).


Further,

if;k+l(x) =

(1- ak) (if;ic

+ 1t

II

x- Vk

11 2 )

+ ak[f(yk) + (f'(yk), x- Yk) + ~

II

x- Yk

11 2].

Therefore the equation cpk+l (x) = 0, which is the fi.rst-order optimality


condition for function cpk+l(x), Iooks as follows:

From that we get the equation for the point vk+ 1 , which is the minimum
of the function cpk+l(x).
Finally, let us compute if;'ic+l In view of the recursion rule for the
sequence {cpk(x)}, we have

cpk+1

+ 'Ykt

II

Yk - Vk+1

11 2 =

cpk+l (Yk)

(2.2.5)

Note that in view of the relation for vk+ 1,

Therefore

'Ykr

II

Vk+l - Yk

11 2

2"f!+l

[(1 - ak) 2 'Y~

II

Vk - Yk

11 2

It remains to substitute this relation into (2.2.5) noting that the factor
for the term II Yk- Vk 11 2 in this expression is as follows:

_1_{1- ak)2'Y2
(1- ak)'lk.2
2'Yk+l
k

{1- ak)'lk. (1- (1-akhk)


2

'Yk+l

Smooth convex optimization

75

Now the situation is more clear and we are close to getting an algorithmic scheme. Indeed, assume that we already have Xk:

Then, in view of the previous Iemma,

Since f(xk) 2: f(Yk)


cf>*k+l

+ (f'(yk), Xk

- Yk), we get the following estimate:

2: f(Yk)- 21~~1

II

f'(Yk)

11 2

+(1- ak)(f'(yk ), ~(vk- Yk) + Xk- Yk)


lk+l
Let us Iook at this inequality. We want to have
that we can ensure the inequality

c/>k+l 2:

f(xk+l) Recall,

in many different ways. The simplest one is just to take the gradient
step
with hk

= (see
2

0
Then 2""lk+l
k
following:

Xk+l

= Yk- hkf'(xk)

(2.1.6)). Let us define ak as follows:

2~ and we can replace the previous inequality by the

Now we can use our freedom in the choice of Yk Let us find it from the
equation:
~krk ( Vk - Yk)
lk+l

That is

+ Xk

- Yk = 0.

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

76

Thus, we come to the following method.

General scheme of optimal method


0. Choose xo ERn and 10
1. kth iteration (k

> 0.

Set vo = xo.

2': 0).

a). Compute ak E {0, 1) from equation

+ akfJ

La~ = {1 - akhk
Set lk+l = {1- akhk
b). Choose
1

Yk-

+ akfJ

(2.2.6)

Qk/'kVk+l'kt!Xk
l'k+akJL

and compute f(Yk) and f'(Yk)


c). Find

Xk+l

suchthat

(see Section 1.2.3 for the step-size rules}.


d). Set Vk = (l-akhkvk+akllYk-akf'(Yk).
+l

'Yktl

Note that in Step 1c} ofthis scheme we can choose any


the inequality
with some w
Step 1a).
THEOREM

>

0. Then the constant

b replaces L

satisfying

in the equation of

2.2.1 The scheme (2.2.6} generates a sequence {xk}~ 0 such

that

f(xk) - f* :S Ak [f(xo) - f*
where Ao

Xk+l

=1

and )..k

+f

II

xo - x*

11 2 ] ,

= ll~01 {1- ai).

Proof: Indeed, let us choose </Jo(x) = f(xo) + ~ II x- vo 11 2 . Then


f(xo) = <Po and we get f(xk) ::; <P'k by construction of the scheme. It

remains to use Lemma 2.2.1.

77

Smooth convex optimization

Thus, in order to estimate the rate of convergence of (2.2.6), we need


to understand how fast >.k goes to zero.
LEMMA

2.2.4 If in the scheme (2.2.6} /O;:::: J-t, then

{2.2. 7)

Proof: Indeed, if /k ;:::: J-t, then /k+l = La~ = (1 - akhk + akJ-t ;:::: 1-t
Since /O ;:::: J-t, we conclude that this inequality is valid for all /k Hence,
ak ;::::
and we have proved the first inequality in (2.2.7).
Further, let us prove that /k ;:::: 1o>.k. Indeed, since /o = /o>.o, we can
use induction:

jii

/k+l ;:::: (1- akhk? (1- ak)/o>.k = /O)..k+l


Therefore La~ = /k+l ;:::: /o>.k+l
Denote ak =
Since {>.k} is a decreasing sequence, we have

k.

ak+l- ak

=
>

,;>:;- .;>:;:;
vf..\k..\k+l

..\k-..\ktt

-2..\k~

Ak- Akt 1

= vf..\k..\ktt(V'Xk+vf.Xkd

= Ak-(1-ak)..\k =
2.xk~

ct&

>1

/JQ,

2~-2VT

Thus, ak;:::: 1 + ~fl and the Iemma is proved.

Let us present an exact statement on optimality of (2.2.6).


2.2.2 Let us take in (2.2.6} /o = L. Then this scheme generates a sequence {xk}r:o such that

THEOREM

f(xk)-

!*

~ Lmin { (1-/f/, (k;2)2}

II

xo- x* 11 2 .

This means that {2.2. 6) is optimal for unconstrained minimization of


the functions from s~;i(Rn), 1-t ;:::: 0.

Proof: We get the above inequality using f(xo)- f* ~ ~ II xo- x* 11 2


and Theorem 2.2.1 with Lemma 2.2.4.
Let J-t > 0. From the lower complexity bounds for the dass (see
Theorem 2.1.13) we have

(&-1)2k R2 >- l!exp


(- 4k ) R2
2
..jQ/-1
'

f( xk)- f* > 1:!:.


- 2 ..jQ/+1

78

INTRODUCTORY LEGTURES ON GONVEX OPTIMIZATION

where Qf = L/~-t and R =II xo- x* II Therefore, the worst case bound
for finding Xk satisfying f(xk) - f* ~ E cannot be better than

k>- .,fQi4

-l

[In 1 + In !!2

+ 2In R] .

For our scheme we have

f(xk)-

f* ~ LR2 (1-lf)k ~ LR 2exp (

Therefore we guarantee that k ~

-)ij).

J([j [ln -: + ln L + 2ln R] . Thus, the

main term in this estimate, JQi In ~, is proportional to the lower bound.


The same reasoning can be used for the dass
(Rn).
D

sJ:l

Let us analyze a variant of the scheme {2.2.6), which uses the gradient
step for finding the point Xk+I
Constant Step Scheme, I

0. Choose xo ERn and 'Yo > 0. Set vo = xo.


1. kth iteration (k ~ 0).

a). Compute ak E {0, 1) from the equation

+ CtkJ-t

La~ = (1 - ak)rk

(2.2.8)

Set rk+l = (1- akhk + CtkJ-t


b). Choose y = ak"J'kvk+"l'k+lxk.
k

"l'k+akJl

Compute f(Yk) and f'(yk)

c). Set Xk+I = Yk --j;J'(yk) and

Vk+I = ;;l--[{1ak)rkvk
lkl

+ CtkJ-tYk- akf'(Yk)].

Let us demonstrate that this scheme can be rewritten in a simpler


form. Note that
Yk = 'Yk+1akJl (akrkVk + 'Yk+lxk),

Yk- tf'(Yk),
=

~[(1ak)rkvk
lkl

+ CikJ.tYk- akf'(Yk)].

79

Smooth convex optimization

Therefore

_ _1_ { (1-akhk

'Yk+t

ak

Yk

+ J-LYk

} _ 1-a~sx _ -2LJ'( )
ak
k
'Yk+l
Yk

Hence,

Xk

+1

+ Clktl'Yktl(Vkl-Xktl)
= Xk + 1 + fJ,f.lk(Xk + 1 _
'Yk+l +ak+tJJ

Xk)

'

where

Thus, we managed to get rid of {vk}. Let us do the same with 'Yk We
have

Therefore

_
'Yk 1 1-ak)
_ ak(l-ak)
- ak 'Yk+l +ak+l L) - af +akl

Note also that a~+l = (1- ak+I)a~


a5L =

+ qak+l

(1- aoho

with q = J-L/ L, and

+ J-Lao.

The latter relation means that 'Yo can be seen as a function of ao. Thus,
we can completely eliminate the sequence {'Yk}. Let us write down the
corresponding scheme.

80

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Constant Step Scheme, li


0. Ghoose xo ERn and ao E {0, 1).
Set Yo = xo and q = 1;.
1. kth iteration (k ;::: 0).

a). Garnpute f(Yk) and f'(Yk) Set


Xk+l

{2.2.9)

= Yk - tf'(Yk)

b). Garnpute ak+l E {0, 1) from equation

a~+l = {1- ak+l)a~ + qak+l,


and set =
k

a1(l-ak)
ak+ak+l'

The rate of convergence of the above scheme can be derived from


Theorem 2.2.1 and Lemma 2.2.4. Let us write down the corresponding
statement in terms of a 0 .
THEOREM

2.2.3 lf in scheme (2.2.9}


{2.2.10}

then

f (xk) - f* S min

{ (1 -

.ft) k , ( Jt1Zy'1o')2}

x [f(xo) - !*
where

"'O
1

+ lll xo -

x*

11 2 ] ,

= no(noL-tL).
1-ao

We do not need to prove this theorem since the initial scheme is not
changed. We change only notation. In Theorem 2.2.3 condition {2.2.10)
is equivalent to /o ~ Jl-
Scheme {2.2.9) becomes very simple if we choose ao =
{this corresponds to /o = 11-). Then

.ft

81

Smooth convex optimization

for all k ;:::: 0. Thus, we corne to the following process.


Constant step scheme, 111

0. Choose Yo = xo E Rn.
(2.2.11)

1. kth iteration (k;:::: 0).

Xk+l

Yk - tf'(Yk),

However, note that this process does not work for p. = 0. The choice
'Yo = L (which changes corresponding value of ao) is safer.

2.2.2

Convex sets

Let us try to understand which constrained rninirnization problern we


can solve. Let us start frorn the sirnplest problern of this type, the
problern without functional constraints:
rnin f(x),
xEQ

where Q is sorne subset of Rn. In order to rnake our problern tractable,


we should irnpose sorne assurnptions on the set Q. And first of all, let
us answer the following question: Which sets fit naturally the class of
convex functions? Frorn definition of convex function,

f(ax

+ (1- a)y) :::; af(x) + (1- a)f(y), Vx, y

ERn, a E [0, 1],

we see that it is irnplicitly assurned that it is possible to check this


inequality at any point of the segment [x, y]:

[x,y]

= {z = ax + (1- a)y,

a E [0, 1]}.

Thus, it would be natural to consider a set that contains the whole


segrnent [x, y] provided that the end points x and y belang to the set.
Such sets are called convex.
DEFINITION 2.2.2 Set Q is called convex if for any x, y E Q and a
from [0, 1] we have
ax + (1- a)y E Q.

82

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

The point ax + {1- a)y with a E [0, 1] is called a convex combination


of these two points.
In fact, we have already met some convex sets.
LEMMA 2.2.5 If f(x) is a convex function, then for any E R 1 its level
set

is either convex or empty.


Proof: Indeed, let
J(y) ~ . Therefore

f(ax

c,(). Then J(x)

and y belong to

+ (1- a)y)

:$ af(x)

+ {1 -

and

a)f(y) :$ .
0

LEMMA 2.2.6 Let J(x) be a convex function. Then its epigraph


Cf = {(X, T) E Rn+ 1 I

J(X)

:$ T}

is a convex set.
Proof: lndeed, let z1 = (x1,rt) E Cf and z2 = (x2,r2) E Cf Then for
any a E [0, 1] we have
Za

=az1 + {1- a)z2 = (ax1 + (1- a)x2, ar1 + (1- a)r2),

Thus, z 0 E Cf

Let us look at some properties of convex sets.


THEOREM 2.2.4 Let Q1 ~ Rn and Q2 ~ Rm be convex sets and A(x)

be a linear operator:

Then all sets below are convex:

= n}: Ql nQ2 = {x ERn I XE Ql, XE Q2}


2. Sum (m = n): Q1 + Q2 = {z = x + y I x E Qt, y E Q2}.
3. Direct sum: Q1 x Q2 = {(x, y) E Rn+m I x E Q1, y E Q2}
4 Conic hull: K:(Ql) = {z ERn I z = x, x E Qb ~ 0}.
1. Intersection (m

83

Smooth convex optimization

5. Convex hull

Conv (Q I, Q2) = { z E Rn

z = ax + (1 - a),
y,x E QI, y E Q2, a E [0, 1]}.

6. Affine image: A(QI) = {y E Rm

y = A(x), x E QI}

7. Inverse affine image: A-I(Q2) = {x ERn

A(x) E Q2}.

Proof: 1. If XI E QinQ2, XI E QlnQ2, then [xl,X2] c QI and


[xi, x2] c Q2. Therefore [xl, X2] c Ql n Q2.
2. If ZI = Xl + X2, XI E Ql, X2 E Q2 and Z2 = YI + Y2, YI E Q1,
Y2 E Q2, then
azi + (1- a)z2 = (ax1 + (1- a)yi)l + (ax2 + (1- a)y2)2,
where (h E Q1 and ()2 E Q2.
3. If ZI = (xi, X2), XI E QI, X2 E Q2 and Z2
Y2 E Q2, then

(yi, Y2), YI

QI,

azi + (1- a)z2 = ((axi + (1- a)yi)1, (ax2 + (1- a)y2)2),


where (h E Q1 and ()2 E Q2.
4. If ZI = iXI, XI E QI, i ;::: 0, and Z2
then for any a E [0, 1J we have

azi + (1- a)z2

= a1x1 +

(1- a)2x2

= 2x2,

X2 E QI, 2 ;::: 0,

= 1(ax1 +

(1- a)x2),

where 1 = al + (1- a)2, and a = ad'Y E [0, 1].


5. If z1 = ixi + (1 - dx2, x1 E Q1, x2 E Q2, 1 E [0, 1], and
Z2 = 2Y1 + (1 - 2)y2, Yl E Ql, Y2 E Q2, 2 E [0, 1], then for any
a E [0, 1] we have
az1

+ (1 -

a)z2 =

a(1x1

+ (1 -

l)x2)

+(1 - a)(2YI + (1 - 2)y2)


=

a(1x1

+ (1- i)Yd

+(1 - a)(/32x2 + (1 - 2)Y2),


where a = a1 + (1- a)2 and 1 = ai/a, 2 = a(1- i)/(1- a).
6. If YI, Y2 E A(QI) then YI = Ax1 +band Y2 = Ax2 + b for some x1,
x2 E Q1. Therefore, for y(a) = ay1 + (1- a}y2, 0 ~ a ~ 1, we have
y(a)

= a(Ax1 +

b) + (1- a)(Ax2 + b)

= A(ax1 +

(1- a)x2) + b.

84

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Thus, y(a) E A(Ql).


7. If x1, x2 E A- 1(Q2) then Ax1 + b = Y1 and Ax2 + b = Y2 for some
YI, Y2 E Q2. Therefore, for x(a) = ax1 + (1- a)x2, 0 ~ a ~ 1, we have

A(x(a)) = A(ax1

+ (1- a)x2) + b

= a(Ax1 + b) + (1- a)(Ax2 + b) = ay1 + (1- a)y2 E Q2.


0

Let us give several examples of convex sets.


EXAMPLE 2.2.1 1. Half-space {x E Rn
linear function is convex.

2. Polytope {x E Rn I (ai, x)
intersection of convex sets.

(a,x}

} is convex since

bi, i = 1 ... m} is convex as an

3. Ellipsoid. Let A =AT!:: 0. Then the set {x ERn


convex since function (Ax, x} is convex.

(Ax, x) ~ r 2} is
0

Let us write down the optimality conditions for the problern


(2.2.12)
where Q is a closed convex set. It is clear that the old condition

f'(x) = 0
does not work here.
EXAMPLE 2.2.2 Consider the one-dimensional problem:
minx.
x~O

Here x E Rl, Q = {x : x ~ 0} and f(x) = x. Note that x* = 0 but


f'(x*) = 1 > 0.
0

THEOREM 2.2.5 Let f E .1'1 (Rn) and Q be a closed convex set. The
point x* is a solution of (2.2.12} if and only if
(J'(x*),x- x*} ~ 0

(2.2.13}

85

Smooth convex optimization

for all x E Q.

Proof: Indeed, if (2.2.13) is true, then

f(x) 2 f(x*)

+ (J'(x*), x- x*) 2 f(x*)

for all x E Q.
Let x* be a solution to (2.2.12). Assurne that there exists some x E Q
suchthat
(J'(x*), x- x*) < 0.
Consider the function cf;(a) = f(x*

cf;(O)

= f(x*),

cj;'{O)

+ a(x- x*)), a

[0, 1]. Note that

= (J'(x*), x- x*} < 0.

Therefore, for a small enough we have

f(x*

+ a(x- x*))

= cf;(a)

< cf;(O)

= f(x*).

That is a contradiction.

2.2.6 Let f E S~(Rn) and Q be a closed convex set. Then


there exists a unique solution x* of problern {2.2.12}.

THEOREM

Proof: Let xo E Q. Consider the set Q = {x E Q


Note that problern (2.2.12) is equivalent to
min{f(x)
However,

Q is bounded: for all x

f(xo) 2 f(x) 2 f(xo)

x E

f(x) ~ f(xo)}.

Q}.

{2.2.14)

Q we have

+ (f'(xo), x- xo) + ~ II x- xo

11 2

Hence, II x- xo II~ ~ II f'(xo) II


Thus, the solution x* of (2.2.14) (::: (2.2.12)) exists. Let us prove that
it is unique. Indeed, if xi is also a solution to (2.2.12), then

f*

= f(xi)

2 f(x*) + (J'(x*), xi

2 !* + ~ II xi - x*

- x*) + ~ II xi -

x*

11 2

11 2

(we have used Theorem 2.2.5). Therefore

xi =

x*.

86

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

2.2.3

Gradient mapping

In the constrained minimization problern the gradient of the objective


function should be treated differently as compared to the unconstrained
situation. In the previous section we have already seen that its role in
optimality conditions is changing. Moreover, we cannot use it anymore in
a gradient step since the result could be infeasible, etc. If we Iook at the
1 (Rn),
main properties of the gradient, which we have used for f E
we can see that two of them are of the most importance. The first
one is that the gradient step decreases the function value by an amount
comparable with the squared norm of the gradient:

.r'l'

lr,

f(x- tf'(x)) ~ f(x)-

And the second one is the inequality


{f'(x), x- x*) ;:::

II

f'(x)

f'(x)
II

11 2

11 2

It turns out that for constrained minimization problems we can in-

troduce an object that inherits the most important properties of the


gradient.
DEFINITION 2.2.3 Let us fix some

Denote

+ (f'(x), x- x) + ~

XQ(x; 'Y)

=arg min [f(x)

9Q(x; 'Y)

= r(x- xQ(x; 'Y)).

xEQ

r > 0.

II

x- x

11 2] ,

We call gQ(r, x) the gradient mapping of f on Q.

For Q

=Rn we have

xQ(x; r) = x- ~f'(x),

9Q(x; r) = J'(x).

Thus, the value ~ can be seen as a step size for the "gradient" step

x--+ XQ(X;[).
Note that the gradient mapping is weil defined in view of Theorem
2.2.6. Moreover, it is defined for all x ERn, not necessarily from Q.
Let us write down the main property of gradient mapping.
S~:l(Rn),

THEOREM 2.2. 7 Let f E

x E Q we have
f(x)

;::: f(xQ(x; 'Y))

+2~

II

r ;::: L

and

xE

Rn. Then for any

+ (gQ(x; 'Y), x- x)

9Q(x;r)

11 2

+~

(2.2.15)
II

x-x

11 2 .

87

Smooth convex optimization

Proof: Derrote XQ = xQ(/,x), 9Q = 9Q(/,x) and let


cp(x) = J(x)
Then cp'(x) = f'(x)

+ (f'(x), x- x) + ~

+ r(x- x),

II

x- x

11 2

and for any x E Q we have

(f'(x)- 9Q,x- XQ) = (cp'(xQ),x- XQ) ~ 0.


Hence,

f(x)- ~

II

x- x

11 2

> f(x) + (f'(x),x- x)


=

f(x)

+ (f'(x), xQ- x) + (f'(x), x- xQ)

> J(x) + (f'(x), xQ- x) + (gQ, x- xQ)


+(gQ, x- XQ)

cfJ(xQ)- ~

cfJ(xQ)- 2~

II

9Q

11 2

+(gQ, x- XQ)

cfJ(xQ)

+ 2~

II

9Q

11 2

+(gQ, x- x),

XQ- x

II

11 2

and cp(xQ) ~ f(xQ) since 1 ~ L.

COROLLARY

2.2.1 Let jE S~;i(Rn),

r~L

and

x ERn.

Then

(2.2.16)
(2.2.17)

Proof: Indeed, using (2.2.15) with x = x, we get (2.2.16). Using (2.2.15)


0
with x = x*, we get (2.2.17) since f(xQ(x; r)) ~ f(x*).

2.2.4

Minimization methods for simple sets

Let us show how we can use the gradient mapping for solving the
following problem:
min J(x),
xEQ

88

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

where f E s!:i(Rn) and Q is a closed convex set. We assume that


the set Q is simple enough, so the gradient mapping can be computed
explicitly. This assumption is valid for positive orthant, n dimensional
box, simplex, Euclidean ball and some other sets.
Let us start from the gradient method:

Gradient method for simple sets


0. Choose xo E Q.

(2.2.18)

1. kth iteration (k

0).

Xk+l = Xk- hgQ(Xk;L).

The efficiency analysis of this scheme is very similar to that of the


unconstrained version. Let us give an example of such a reasoning.
THEOREM

2.2.8 Let
II

S!;l(Rn).

Xk - x* 11 2 ~

lf in scheme {2.2.18) h =

(1- z)k II Xo- x*

!,

then

11 2 .

Proof: Denote rk =II xk- x* II, 9Q = 9Q(Xki L). Then, using inequality
(2.2.17), we obtain
Tf+l

II Xk - x* - hgQ 11 2 = Tf - 2h(gQ, Xk - x*)

<

(1-ht-t)rf+h(h-t) IIga 11 2 = (1-r)rf.

+ h2

II 9Q 11 2

Note that for the step size h =


Xk+l

t we have

= Xk- t9Q(Xkj L) = XQ(Xki L).

Consider now the optimal schemes. We give only a sketch of justification since it is very similar to that of Section 2.2.1.
First of all, we define the estimate sequence. Assurne that x 0 E Q.
Define
4>o(x) =
cPk+I(x) =

f(xo)

+ lll x- xo

{1- ak)cPk(x)

11 2 ,

+ ak[f(xQ(Yki L)) + A II 9Q(Yki L)

+(gQ(Yk;L),x- Yk)

+~

II x- Yk 11 2 ).

11 2

89

Smooth convex optimization

Note that the form ofthe recursive rule for cfJk(x) is changed. The reason
is that now we use inequality (2.2.15} instead of (2.1.16). However,
this modification does not change the analytical form of recursion and
therefore it is possible to keep all convergence results of Section 2.2.1.
Similarly, it is easy to see that the estimate sequence {cfJk(x)} can be
written as

with the following recursive rules for 'Yk, vk and

c/J'k:

'Yk+I

+ ( E- 2-y:~l)

II

gQ(Yki L)

+ak(~~:thk (~ II Yk -Vk
Further, assuming that c/J'k
f(xk)

II

+(gQ(Yk;L),vk -yk))

f(xk) and using the inequality

f(xQ(Yki L})

+A

11 2

11

+ (gQ(Yki L), Xk- Yk)

gQ(Yki L)

11 2

+~

Xk- Yk

II

11 2 ],

we come to the following lower bound:

c/J'k+l

(1- ak)f(xk)

+(

+ akf(xQ(Yki L))

n- 2-y:~~)

11

YQ(Yki L)

> f(xQ(Yki L)) + ( A-

11

27:~ 1 )

2
II

gQ(Yki L)

11 2

+(1- ak}(gQ(Yki L), ~=+~ (vk- Yk) + Xk- Yk)

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

90

Thus, again we can choose


Xk+I

Yk

= XQ(Yki L),

'Yk+letk/L

(O!k"fkVk

+ 'Yk+IXk)

Let us write down the corresponding variant of scheme (2.2.9).


Constant Step Scheme, II. Simple sets.

0. Choose xo ERn and ao E (0, 1).


Set Yo = xo and q =

z.

1. kth iteration (k

2 0).

a). Compute f(Yk) and f'(Yk) Set


(2.2.19)
b). Compute ak+l E (0, 1) from equation

a~+l = {1- ak+I)a~ + qak+ll

Clearly, the rate of convergence of this method is given by Theorem 2.2.3. In this scheme only points {xk} are feasible for Q. The
sequence {yk} is used for computing the gradient mapping and may be
infeasible.

2.3

Minimization problern with smooth


components

(Minimax problem: grodient mapping, gradient method, optimal methods; Problem with functional constraints; Methods for constrained minimization.)

2.3.1

Minimax problern

Very often the objective function of an optimization problern is composed by several components. For example, the reliability of a complex

91

Smooth convex optimization

system usually is defined as a minimal reliability of its parts. A constrained minimization problern with functional constraints provides us
with an example of interaction of several nonlinear functions, etc.
The siruplest problern of that type is called the minimax problem. In
this section we deal with the smooth minimax problem:
min
xEQ

[!(x)

= m.ax fi(x)]
l:O:::l:O:::m

{2.3.1)

where Ii E s~;}.(Rn), i = 1 ... m and Q is a closed convex set. We call


the function f(x) max-type function composed by the components fi(x).
We write f E S~;i_(Rn) if all components of function f belangtothat
class.
Note that, in general, f(x) is not differentiable. However, provided
that all fi are differentiable functions, we can introduce an object, which
behaves exactly as a linear approximation of a smooth function.
DEFINITION

2.3.1 Let f be a max-type function:

f(x) = m.ax fi(x).


1:5z:Sm

Function

f(x; x) = m.ax [fi(x) + (Jf{x), x- x)],


1:5z:5m

is called the linearization of f(x) at x.


Campare the following result with inequalities (2.1.16) and (2.1.6).
LEMMA

2.3.1 For any x ERn we have

f(x) ;:::: f(x; x)

+~

II

x- x

11 2 ,

{2.3.2}

f(x) ~ f(x; x)

+t

II

x- x

11 2

{2.3.3}

Proof: Indeed,

fi(x) ;:::: fi(x)

+ (Ji(x), x- x) + ~

II

x- x

11 2

(see (2.1.16)). Taking the maximum ofthis inequality in i, we get (2.3.2).


For {2.3.3) we use inequality

fi(x) ~ fi(x)
(see (2.1.6)).

+ (Jf(x), x- x) + t II x- x

11 2

92

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us write down the optimality conditions for problern {2.3.1) (compare with Theorem 2.2.5).
THEOREM

for any x

2.3.1 A point x*
Q we have

Q is a solution to {2.3.1} if and only if

f(x*; x) ~ f(x*; x*) = f(x*).

{2.3.4)

Proof: Indeed, if (2.3.4) is true, then


f(x) ~ f(x*; x) ~ f(x*; x*) = f(x*)
for all x E Q.
Let x* be a solution to (2.3.1). Assurne that there exists x E Q such
that f(x*; x) < f(x*). Consider the functions

</>i(a) = fi(x*
Note that for all i, 1 ~ i

fi(x*)

i = 1 ... m.

m, we have

+ (!}(x*), x- x*) < f(x*)

Therefore either </>i(O)

</>i(O)

+ a(x- x*)),

= m_ax fi(x*).
l~t~m

=fi(x*) < f(x*), or

= f(x*),

4>~(0)

= (ff(x*), x- x*) < 0.

Therefore, for a small enough we have

fi(x*
for all i, 1 ~ i

+ a(x- x*))

</>i(a) < f(x*)

m. That is a contradiction.

2.3.1 Let x* be a minimum of a max-type function f(x)


on the set Q. If f belongs to S~(Rn), then

COROLLARY

f(x) ~ f(x*)

+~

II

x- x*

11 2

for all x E Q.

Proof: Indeed, in view of (2.3.2) and Theorem 2.3.1, for any x


have
f(x)

> f(x*;x) + ~ II x- x*

>

f(x*;x*)

+~

II

x- x*

Q we

11 2
11 2 =

f(x*)

+~

II

x- x*

11 2

93

Smooth convex optimization

Finally, Iet us prove the existence theorem.


THEOREM

2.3.2 Let max-type function f(x) belong to S~(Rn) with J-t

>

0, and Q be a closed convex set. Then there exists a unique optimal


solution x* to the problern (2.3.1}.
Proof: Let x E Q. Consider the set Q = {x E Q I f(x) ~ f(x)}. Note
that the problern (2.3.1) is equivalent to
min{f(x)
But

Q is bounded:

for any x E

I x E Q}.

(2.3.5)

Q we have

J(x) 2:: /i(x) 2:: /i(x) + (ff(x), x- x) + ~

II

x- x

11 2 ,

consequently,
~ II x-

x 11 2 ~11 J'(x)

II II

x- x II

+J(x) -fi(x).

Thus, the solution x* of (2.3.5) (and of (2.3.1)) exists.


If xi is another solution to (2.3.1), then
f(x*) = f(xi) ~ f(x*; xi)

(by (2.3.2) ). Therefore

2.3.2

xi

+ ~ II xi -

x* 11 2 ~ f(x*)

+ ~ II xi -

= x*.

x*

11 2

Gradient mapping

In Section 2.2.3 we have introduced the gradient mapping, which replaces the gradient for a constrained minimization problern over a simple set. Since linearization of a max-type function behaves similarly to
linearization of a smooth function, we can try to adapt the notion of
gradient mapping to our particular situation.
Let us fix some 1 > 0 and x E Rn. Consider a max-type function
f(x). Denote
J'Y(x; x) = f(x; x) +! II x- x 11 2
The following definition is an extension of Definition 2.2.3.
DEFINITION

2.3.2 Define
f*(x;'Y)

min/'Y(x; x),

XJ(Xi/)

arg min/'Y(x; x),

YJ(x; 1)

/(X- Xf(Xi/)).

xEQ

xeQ

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

94

We call 9J(x; ')') gradient mapping of max-type function f on Q.


For m = 1 this definition is equivalent to Definition 2.2.3. Similarly,
the point of linearization x does not necessarily belong to Q.
It is clear that f 1 (x; x) is a max-type function composed by the components

fi(x)

+ (Jf(x), x- x) + ~

II x- x II 2 E S~;~(Rn),

i = 0 ... m.

Therefore the gradient mapping is well defined (Theorem 2.3.2).


Let us prove the main result of this section, which highlights the similarity between the properties of the gradient mapping and the properties
of the gradient (compare with Theorem 2.2.7).
THEOREM

2.3.3 Let f E S~:i(Rn). Then for all XE Q we have

f(x; x) ;:::: f*(x; 1')


Proof: Denote

Xf

+ (gJ(x; 1'), x- x) + 2~

= Xf(X;')'),

9!

II 9J(x; 1') 11 2

(2.3.6}

= 9J(x;')').

S~;~(Rn) and it is a max-type function.

It is clear that f 1 (x;x) E


Therefore all results of the

previous section can be applied also to f,y.


Since Xf = argmin j 1 (x;x), in view of Corollary 2.3.1 and Theorem
2.3.1 we have

xEQ

f(x;x)

! 1 (x;x)-

~ II x- x 11 2

> J,(x;xJ) + 1-(11 x- Xf 11 2

II x-

x 11 2 )

f*(x;')')

+ 1-(x- x 1,2(x- x) + x- x 1)

f*(x; 1')

+ (91, x- x) + 2~

II 9! 11 2

In what follows we often refer to the following corollary to Theorem


2.3.3.
2.3.2 Let f E S~;i(Rn) and ')';:::: L. Then:
1. For any x E Q and x E Rn we have

COROLLARY

f (X)

;::::

f (X f (X; ')')) + (9 f (X; ')') , X - X)


+2\ II 9J(X;')') 11 2 +~ II

X-

X 11 2 .

(2.3. 7)

95

Smooth convex optimization

2. lf x E Q, then

J(x,(x;'Y)) ~ J(x)- 2~
3. For any

x E Rn

II

9J(x;'Y)

{2.3.8}

11 2 ,

we have

{2.3.9}

Proof: Assumption 'Y 2: L implies that f*(x; 1) 2: f(xJ(x; 'Y)). Therefore (2.3.7) follows from (2.3.6) since

f(x) 2: f(x;x}

+ ~ II x- x

11 2

for all x ERn (see Lemma 2.3.1).


Using (2.3.7) with x = x, we get (2.3.8), and using (2.3.7) with x
we get (2.3.9) since f(x,(x;"(})- f(x*) 2: 0.

= x*,
0

Finally, Iet us estimate the variation of f* (x; 1) as a function of 1.


LEMMA

2.3.2 For any /1,

Proof: Denote
we have

J(x;x}

Xi

/2

> 0 and x ERn we have

= x J(x; /i), 9i = 9J(x; 1i), i = 1, 2.

+ ~ II x- x

11 2

In view of (2.3.6),

2: f*(x;'Yd + (91,x- x)

+ 2~1

II

91 11 2

+~

II X -

(2.3.10}
X 11 2

for all x E Q. In particular, for x = x2 we obtain

f*(x; 12) =

f(x; x2)

+~

II

x2-

x 11 2

>

J*(x;,l)

+ (91,x2- x) + 2~ 1

f*(x;11)

+ 2~ 1

II

91 11 2 -_;2 (91,92)

> J*(x;11) + 2~ 1

II

91 11 2 - 2~2 II 91 11 2 .

II

91 11 2 +~ II x2- x 11 2

+ 2;2

II

9211 2

96

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

2.3.3

Minimization methods for minimax problern

As usual, we start a presentation of numerical methods for problern


(2.3.1) from a "gradient" method with constant step:
Gradient method for minimaxproblern
0. Choose x 0 E Q and h
1. kth iteration (k

> 0.

(2.3.11)

2 0).

Xk+l = Xk- hgt(Xki L).

THEOREM

then

2.3.4 Let j E S~:l(Rn). lf in {2.3.11} we choose h $

II

Xk- x* 11 2 $ (1 - p,h)k

II

f,

xo- x* 11 2

Proof: Denote rk =II Xk- x* II, g = gf(Xki L). Then, in view of (2.3.9)
we have
r~+l

= II

Xk- x*- hgQ

11 2 = r~- 2h(g,xk- x*) + h 2 II

< (1- hJ-L)r~ + h (h-

t)

II g 11 2 $

11 2

(1 -p,h)r~.
0

Note that with h =


Xk+l

we have

= Xk- tgf(Xki L) = Xf(Xki L).

Forthis step size, the rate of convergence of scheme (2.3.11) is as follows:

Goroparing this result with Theorem 2.2.8, we see that for minimax
problern the gradient method has the same rate of convergence, as it has
in the smooth case.
Let us check, what the situation is with the optimal methods. Recall,
that in order to develop an optimal method, we need to introduce an
estimate sequence with some recursive updating rules. Formally, the
minimax problern differs from the unconstrained minimization problern
only by the form of lower approximation of the objective function. In
the case of unconstrained minimization, inequality (2.1.16) was used for

97

Smooth convex optimization

updating the estimate sequence. Now it must be replaced by inequality


(2.3.7).
Let us introduce an estimate sequence for problern (2.3.1). Let us
fix some xo E Q and 'Yo > 0. Consider the sequences {Yk} C Rn and
{ak} C (0, 1). Define
f(xo) + ~

c/Jo(x) =

II

x- xo

11 2 ,

(1- ak)<f>k(x)

cPk+I(x) =

+ak[' f(xJ(YkiL)) + fr II 9J(YkiL) 11 2 1


+(gj(Yki L), X- Yk}

+~

II

X- Yk

11 2 ].

Comparing these relations with (2.2.3), we can find the difference only
in the constant term (it is in the frame). In (2.2.3) this place was taken
by f(Yk) This difference Ieads to a trivial modification in the results
of Lemma 2.2.3: All inclusions of f(Yk) must be formally replaced by
the expression in the frame, and f'(Yk) must be replaced by 9J(Yki L).
Thus, we come to the following lemma.
LEMMA

2.3.3 For all k 2:: 0 we have


<l>k(x)

=<PZ + 1f

II

x- Vk

11 2 ,

where the sequences bk}, { vk} and {<Pk} are defined as follows: vo = xo,
<Po = f(xo) and
'Yk+l

(1- O:'khk

o2

+ O:'kiJ.,

I (

+~I 9! YkiL)

II 2

+ ok(~~:~hk {~ II Yk- Vk 11 2 +(gj(Yki L), Vk- Yk})


D

Now we can proceed exactly as in Section 2.2. Assurne that <Pie 2::
f(xk). Inequality (2.3.7) with x = Xk and x = Yk becomes
f(xk)

2:: f(xJ(YkiL))

+ (gJ(YkiL),xk- Yk)

+A II 9J(Yki L) 1 2 +~ II Xk- Yk

11 2

98

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Hence,

~ f(xJ(Yki L)) + ( A - 2:~ 1 )

II

+(1- ak)(gj(Yki L), ~(vkYk)


lk+l

9J(Yki L)

11 2

+ Xk- Yk).

Thus, again we can choose

Xj(Yki L),

Xk+l
La~

Yk

(1- ak)rk

+ akt-t = rk+l

1
1 d akf.k (akrkVk

+ rk+lxk)

Let us write down the resulting scheme in the form of (2.2.!)), with
eliminated sequences {vk} and {rk}.

Constant Step Scheme, II. Minimax.


0. Choose xo ERn and ao E (0, 1).
Set Yo = xo and q = t.
1. kth iteration (k 2 0).
a). Compute {fi(yk)} and {fi(yk)}. Set

(2.3.12)
b). Compute ak+l E (0, 1) from equation

a~+l = (1- ak+l)a~


and set k =

ak(l-ak)

a%+ak+l'

+ qak+l

99

Smooth convex optimization

The convergence analysis of this scheme is completely identical to that


of scheme (2.2.9}. Let us just give the result.
THEOREM

2.3.5 Let the max-type function f belong to S~;l(Rn). lf in

{2.3.12} we take ao 2::

Jii, then

f(xk)- f* S min{

(1- [i)k, ( 2 VI~Zy"YY) 2 }

x [f(xo)-

where "' =
tO

!* + ~ II xo- x*

11 2],

ao(aoL-JL).
1-ao

Note that the scheme (2.3.12) works for all11- ~ 0. Let us write down
the method for solving (2.3.1) with strictly convex components.

0. ehoose xo E Q. Set Yo

VI-{Ji
= xo, = VL+,fii'

(2.3.13}

1. kth iteration (k 2:: 0).

Compute {fi(Yk)} and {![(yk)}. Set

THEOREM

2.3.6 For scheme (2.3.13} we have

f(xk) - f* :5 2 ( 1

-{f) k (f(x 0) -

(2.3.14}

f*).

Jli.

Proof: Scheme (2.3.13) is a variant of {2.3.12) with a 0 =


Under
this choice, ro = 1-" and we get (2.3.14) from Theorem 2.3.5 since, in view
of Corollary 2.3.1, ~ II xo- x* 11 2 :5 f(xo)- f*.
0

100

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

To conclude this section, let us look at an auxiliary problern, which we


need to solve in order to cornpute the gradient rnapping of the rninirnax
problem. Recall, that this problern is as follows:
rnin { rn.ax [fi(xo) + (fi(xo), x- xo)] +
xEQ

l~t~m

II

x- xo

11 2 }.

Introducing the additional variables t E Rm, we can rewrite this problern


in the following way:
min {

fi(xo)

+ (ff(xo), x- xo)

t(i)

i=l

s. t.

+ ~ II x- xo

11 2 }

~ t(i), i = 1 ... m,

(2.3.15)

xEQ, tERm,

Note that if Q is a polytope, then the problern (2.3.15) is a quadratic


optirnization problem. This problern can be solved by sorne special finite
methods (simplex-type algorithms). It can be also solved by interior
point rnethods. In the latter case, we can treat rnuch rnore cornplicated
nonlinear structure of the set Q.

2.3.4
Optimization with functional constraints
Let us show that rnethods described in the previous section can be
used for solving a constrained minimization problern with smooth functional constraints. Recall, that the analytical form of such a problern is
as follows:
min fo(x),
s.t.

fi(x)

0, i = 1 ... m,

(2.3.16)

XE Q,

where the functions fi are convex and srnooth and Q is a closed convex
(Rn), i = 0 ... m, with sorne
set. In this section we assume that fi E
J1. > 0.
The relation between the problern (2.3.16) and rninirnax problerns
is established by sorne special function of one variable. Consider the
parametric rnax-type function

s!:l

f(t; x) = max{fo(x)- t; fi(x), i = 1 ... m},

t ER\ x E Q.

Let us introduce the function

f*(t) = rnin f(t; x).


xEQ

(2.3.17)

101

Smooth convex optimization

Note that the components of max-type function f(t; ) are strongly convex in x. Therefore, for any t E R 1 the solution of problern (2.3.17),
x*(t), exists and is unique in view of Theorem 2.3.2.
We will try to get close to the solution of (2.3.16) using a process
based on approximate values of function f* (t). This approach can be
seen as a variant of sequential quadratic optimization. It can be applied
also to nonconvex problems.
Let us establish some properties of function f* (t).
LEMMA

2.3.4 Let t* be an optimal value of problern (2.3.16}. Then

f* (t) < 0 for alt t ?:. t*,


f*(t)

>

< t*.

0 for all t

Proof: Let x* be a solution to (2.3.16). 1ft?:. t*, then


f*(t)

< f(t; x*)

= max{fo(x*) - t; ]i(x*)}

< max{ t* - t; fi(x*)}


Suppose that t

< t* and f*(t)


fo (y)

t < t*,

0.

0. Then there exists y E Q suchthat

Ii (y)

0, i = 1 ... m.

Thus, t* cannot be an optimal value of (2.3.16).

Thus, the smallest root of function f* (t) corresponds to the optimal


value of the problern (2.3.16). Note also, that using the methods of
the previous section, we can compute an approximate value of function
f*(t). Hence, our goal is to form a process of finding the root, based on
that information. However, forthat we need to know more properties of
function f*(t).
LEMMA

2.3.5 For any!:::..?:. 0 we have

j*(t)-!:::..

j*(t + !:::..)

f*(t).

102

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Indeed,

f*(t

+ )

min
xEQ

< min
xeQ

f*(t

+ )

~ax

{fo(x)- t- ; fi(x)}

~ax

{fo(x)- t; fi(x)} = f*(t),

l~z~m

l~z~m

min m.ax {fo(x)- t; fi(x) + }-


xEQ

> min
xeQ

l~z~m

~ax

l~z~m

{fo(x)- t; fi(x)}- = f*(t)- .


D

In other words, function f*(t) decreases in t and it is Lipschitz continuous with constant equal to 1.

LEMMA 2.3.6 For any t1 < t2 and b..

f*(tl - ) ~ f*(ti)

0 we have

+ f*(t~l=f:(t2).

(2.3.18}

Proof: Denote to = t1 - b.., a = t 2 ~to - t 2 -~+ E [0, 1]. Then


t1 = (1 - a)to + at2 and (2.3.18) can be written as
(2.3.19)
Let x 0 = (1- a)x*(to)

f*(ti)

+ ax*(t2).

Wehave

<
m.ax {(1- a)(fo(x*(to))- to) + a(fo(x*(t2))- t2);
< ~~~~m
(1- a)fi(x*(to))

+ afi(x*(t2))}

< (1- a) m.ax {fo(x*(to))- to; fi(x*(to))}


1:5~~m

(1- a)f*(to) + aj*(t2),

and we get (2.3.18).

103

Smooth convex optimization

Note that Lemmas 2.3.5 and 2.3.6 are valid for any parametric maxtype functions, not necessarily formed by functional components of problern (2.3.16).
Let us study now the properties of gradient mapping for a parametric
max-type function. To do that, Iet us introduce first a linearization of a
parametric max-type function f(t; x):

f(t; i; x) = m_ax {fo(i)


l~z~m

+ (J~(x), x- x)- t; fi(x) + (Jf(x), x- i) }.

Now we can introduce a gradient mapping in a standard way. Let us fix


some 'Y > 0. Denote

+! II x-i 11 2 ,

f'Y(t; i; x)

f(t;i;x)

f*(t; i; 'Y) =

min f,y(t; i; x)

Xf(t; i; "f)

argmin J'Y(t;i;x)

g 1(t;x;,)

'Y(i- x 1(t;x;,)).

xEQ

xEQ

We call 9t(t; i; "() the constrained gradient mapping of problern (2.3.16).


As usual, the point of linearization x is not necessarily feasible for Q.
Note that function J'Y(t; i; x) itself is a max-type function composed
by the components
fo(x) + (JfJ(x),x- x)- t + ~

fi(x)

II

x- x

11 2 ,

+ (JI(x), x- x) - t +! II x- x 11 2 ,

i = 1 ... m.

Moreover, f'Y(t; x; x) E S~;~(Rn). Therefore, for any t E R 1 the constrained gradient mapping is weil defined in view of Theorem 2.3.2.
Since f(t; x) E S~;L(Rn), we have

!JL(t; x; x) :::; f(t; x) :::; fL(t; x; x)


for all x E Rn. Therefore f*(t; i; ~-t) :::; f*(t) :::; f*(t; i; L). Moreover,
using Lemma 2.3.6, we obtain the following result:
For any i E Rn, "( > 0, !:l. ;:::: 0 and t1 < t2 and we have

f*(tt- !:l.; x; 'Y) ;:::: J*(t1; x; 'Y)

+ t 2 ~t 1 (J*(t1; x; 'Y)- j*(t2; x; 'Y)).

(2.3.20)

There are two values of "(, which are important for us. Theseare 'Y = L
and "( = 1-t Applying Lemma 2.3.2 to max-type function J'Y(t;i;x) with

104
/l

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

= L and 1 2 = J.L,

we obtain the following inequality:

f*(t; x; J.L) 2': f*(t; x; L)- ~;f il.9t(t; x; L)

11 2

(2.3.21)

Since we are interested in finding a root of the function f*(t), let us


describe the behavior of the roots of function j*(t;x;1), which can be
seen as an approximation of f*(t).
Denote

t*(x, t)

root tU*(t; x; J.L))

(notation root t (-) means the root in t of function ()).


LEMMA

2.3.7 Let x ERn and f < t* besuchthat

J*(f;x;J.L) 2: (1- "")f*(l;x;L)


for some ""E {0, 1). Then l < t*(x, f) S t*. Moreover, for any t < t and

x ERnwehave

f*(t; x; L) 2: 2(1- "")f*(t; x; L)

t-t

t (x,t)-t

Proof: Since t < t*, we have

o < f*(l) s f*(t; x; L) s

1 ~,J*(f; x; J.L).

Thus, f*(t; x; J.L) > 0 and, since f*(t; x; J.L) decreases in t, we get

t*(x, l) > l.
Denote ~ = f- t. Then, in view of (2.3.20), we have

f*(t;

x;

L) > f*(t) 2: f*(l; x; J.L) 2: f*(l; x; J.L)

>

(1- "") ( 1 +

+ t(x~)-tf*(t; x; J.L)

t(x~)-t) f*(t; x; L)
0

105

Smooth convex optimization

2.3.5

Method for constrained minimization

Now we areready to analyze the following process.


Constrained minimization scheme
0. Choose xo E Q, K E (0, !), to

< t* and accuracy E > 0.

1. kth iteration (k ~ 0).

a). Generate sequence {xk,j} by method (2.3.13) as


applied to f(tk; x) with starting point Xk,o = Xk lf

f*(tk; Xk,ji Jl.) ~ (1- K)j*(tk; Xk,ji L)

(2.3.22)

then stop the internal process and set j(k) = j,


j*(k)

=arg min

Xk+!

= XJ(tkiXk,j(k)iL).

O<
'< '(k)
_)_)

j*(tkiXk 3;L),
'

Global stop, if at some iteration of the internal


scheme we have j*(tkiXk,j;L) $ E.
b). Set tk+I = t*(xk,j(k) tk)
This is the first time in our course we meet a two-level process. Clearly,
its analysis is rather complicated. Firstly, we need to estimate the rate
of convergence of the upper-level process in (2.3.22) (it is called a master process). Secondly, we need to estimate the total complexity of the
internal processes in Step 1a). Since we are interested in the analytical complexity of this method, the arithmetical cost of computation of
t*(x, t) and j*(t; x, 'Y) is not important for us now.
Let us describe convergence of the master process.
LEMMA

2.3.8

L)<to-t[
1
]k .
f *(tx
k, k+1,
- 1-tt 2(1-tt)
Proof: Denote = 2 ( 1 ~~~:) ( < 1) and

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

106

Since tk+l = t*(xk,j(k) tk), in view of Lemma 2.3.7 for k ~ 1 we have


2(1-

<

K)f*(tkiXk,j(k)iL)
.jtk+l-tk
-

f*(tk-liXk-l,j(k-l)iL))
.jtk-tk-1

Thus, 6k $ bk-1 and we obtain

f*(tk; xk,j(k)i L) = 6k..Jtk+1- tk $ k6o..Jtk+l- tk

= kJ* (t o; Xo,j(O); L)

t,.+t-tk
tt -to

Further, in view of Lemma 2.3.5, we have t1 - to ~ f*(to; xo,j(o)i J.t).


Hence,

f*(tk; xk,j(k)i L)

< k f*(to; xo,j(O)i L)

/*(to;xo,j(O)il')

< ~V j*(to)(to- t*).


It remains to note that f*(to) $ to - t* (Lemma 2.3.5) and
j*(tk; xk+li L)

= f*(tk; xk,j*(k)i L)

$ f*(tk; xk,j(k)i L).


0

The above result provides us with an estimate for the nurober of


upper-level iterations, which are necessary to find an -Solution of problern (2.3.16). Indeed, let f*(tk; xk,ji L) $ . Then for x. = Xf(tk; xk,ji L)
we have

f(tk;x.) = 1 ~~{fo(x.) -tk;fi(x*)} $ f*(tkiXk,jiL) $


Since tk

s t*, we conclude that


fo(x.)

S t* + E,

fi(x*)

(2.3.23)
,

i = 1 .. . m.

In view of Lemma 2.3.8, we can get (2.3.23) at most in


1

t -t

N() = ln[2(1-~~:)] In (t~~:)e

(2.3.24)

full iterations of the master process (the last iteration of the process, in
general, is not full since it is terminated by the Global stop rule). Note
that in this estimate K is an absolute constant {for example, K = :\).

107

Smooth convex optimization

Let us analyze the complexity of the internal process. Let the sequence
{Xk,j} be generated by {2.3.13) with the starting point Xk,o = Xk In view
of Theorem 2.3.6, we have

j(tk; Xk,j)- j*(tk)

~ 2(1- [f)j (J(tkj Xk)-

f*(tk))

~ 2e-u-i(J(tkiXk)- f*(tk))

:5 2e-uj j(tk;xk),

jli.

where a =
Denote by N the number of full iterations of the process (2.3.22)
(N ~ N(E)). Thus, j(k) is defined for all k, 0 ~ k ~ N. Note that
tk = t*(xk-1,j(k-1) tk-1) > tk-1 Therefore

f(tk; Xk) ~ f(tk-1i Xk) ~ j*(tk-1i Xk-I,j(k-1) L).


Denote

k = f*(tk-Ii xk-l,j(k-1) L),

k 2: 1,

= f(to; xo).

Then, for all k 2: 0 we have

LEMMA 2.3.9 For all k, 0 ~ k ~ N, the intemal process works no


longer as the following condition is satisfied:

(2.3.25)

Proof: Assurne that (2.3.25) is satisfied. Then, in view of (2.3.8), we


have

A II 9j(tk; Xk,ji L 11 2

<

f(tk; Xk,j)- f(tk; Xj(tk; Xk,ji L))

< f(tkj Xk,j) - f*(tk)


Therefore, using (2.3.21), we obtain

f*(tkiXk,jiJ.t)

> f*(tkjXk,jiL)- ~;r "g,(tkiXk,jiL

11 2

> f*(tkiXk,jiL)- L?(J(tkiXk,j)- f*(tk))

> (1- K)j*(tk; Xk,ji L).

108

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

And that is the termination criterion of the internal process in Step la)
0
in (2.3.22).
The above result, combined with the estimate of the rate of convergence for the internal process, provide us with the total complexity
estimate of the constrained minimization scheme.
LEMMA 2.3.10

For alt k, 0
J'(k)

N, we have

< 1 + !I. ln 2(L-fL)Llk.


KJLLlk+l
VIi
-

Proof: Assurne that

fi{.

min f*(tk; Xk Ji L). Note that


'
O"Sj"Sj(k)
the stopping criterion ofthe internal process did not work for j = j(k)1. Therefore, in view of Lemma 2.3.9, we have

where a =

VL

Recall that k+l =

f*(tk; Xk,ji L) ~ LJL-;_H(f(tk; Xk,j)- f*(tk)) ~ 2LJL-;_He-uj k

<

k+l

That is a contradiction with the definition of k+ 1

COROLLARY

2.3.3

(N + 1) [1 + fi In~] + fi ln _Au_,
L: j(k) <
LlN+l
V Ii
KJL
V Ii
k=O
0

It remains to estimate the number of internal iterations in the last


step of the master process. Denote this number by j*.
LEMMA

2.3.11
'*

< 1 + fi.
VIi

J -

}n 2(L-JL)LlNtl.

KJL

Proof: The proof is very similar tothat of Lemma 2.3.10. Suppose that
j* _ l

> fi. ln 2(L-JL)LlNtl.


KJLf
V Ii

Note that for j = j*- 1 we have


E

<

J*(tN+liXN+l,jiL) ~ LJL7(f(tN+liXN+!,j)- f*(tN+I))

109

Smooth convex optimization

That is a contradiction.

COROLLARY 2.3.4

j*

+ Ej(k)::;
k==O

(N

+ 2) [1 +

lf In 2(~/>] + lf In~.

Let us put all things together. Substituting estimate (2.3.24) for the
number of full iterations N into the estimate of Corollary 2.3.4, we come
to the following bound for the total nurnber of internal iterations in the
process (2.3.22):
t -t
1
[ ln[2{1-!~:))ln
(1o-,.)e

+V'fiIi ln

+ 2] . [1 + V!IJj ln ~]
"11(2.3.26)

(t rn.ax {fo(xo) - to; fi(xo) }) .


I::;)$m

Note that method (2.3.13), which implernents the internal process, calls
the oracle of problern (2.3.16) at each iteration only once. Therefore,
we conclude that estirnate (2.3.26) is an upper cornplexity bound for the
problern {2.3.16) with E-solution defined by {2.3.23). Let us check, how
far this estirnate is frorn the lower bounds.
The principal terrn in estirnate {2.3.26) is of the order
ln to-t .
e

/I. In L.p,'

VJi

This value differs frorn the lower bound for an unconstrained minimizaThis rneans, that scherne (2.3.22) is at
tion problern by a factor of In
least suboptimal for constrained optimization problems. We cannot say
rnore since a specific lower complexity bound for constrained minimization is not known.
To conclude this section, Iet us answer two technical questions. Firstly,
in scheme (2.3.22) we assume that we know sorne estimate t 0 < t*. This
assumption is not binding since we can choose to equal to the optimal
value of the minimization problern

rnin [f(xo)
xeQ

+ (f'(xo), x- xo) + ~ II x- xo

11 2 ].

Clearly, this value is less than or equal to t*.


Secondly, we assume that we are able to compute t*(x, t). Recall that
t* ( x, t) is the root of function

j*(t;x;J.L) = rnin !11-(t;x;x),


xEQ

110

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

where j,_,.(t;x;x) is a max-type function composed by the components

fo(x)

+ (!(x),x- x} + ~ II x- x

11 2

-t,

fi(x)

+ (fi(x),x- x) + ~ II x- x

11 2 ,

i = 1. .. m.

In view of Lemma 2.3.4, it is the optimal value of the following minimization problem:
min [fo(x)
s.t.

fi(x)
XE

+ U(x), x- x) + ~ II x- x

+ (ff(x), x- x) + ~ II x- x

11 2 ],

11 2 ~ 0, i = 1 ... m,

Q.

This problern is not a quadratic optimization problem, since the constraints are not linear. However, it can be solved in finite time by a
simplex-type process, since the objective function and the constraints
have the same Hessian. This problern can be also solved by interiorpoint methods.

Chapter 3

NONSMOOTH CONVEX OPTIMIZATION

3.1

General convex functions

{Equivalent definitions; Closed functions; Continuity of convex functions; Separation theorems; Subgradients; Computation rules; Optimality conditions.)

3.1.1

Motivation and definitions

In this chapter we consider rnethods for solving general convex rninirnization problern
min fo(x),
s.t.

fi(x) :::; 0, i = 1. .. m,

(3.1.1)

where Q is a closed convex set and fi(x), i = 0 ... m, are general convex
functions. The terrn general means that these functions can be nondifferentiable. Clearly, such a problern is rnore difficult than a srnooth
one.
Note that nonsrnooth rninirnization problerns arise frequently in different applications. Quite often sorne components of a rnodel are composed
by rnax-type functions:

f(x) = max r/Jj(x),


l:5J:5P

where r/Jj(x) are convex and differentiable. In the previous section we


have seen that such a function can be treated by a gradient rnapping.
However, if the number of srnooth cornponents p in this function is very
big, the cornputation of the gradient rnapping becornes too expensive.

112

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Then, it is reasonable to treat this max-type function as a general convex


function. Another source of nondifferentiable functions is the situation
when some components of the problern (3.1.1) are given implicitly, as
a solution of an auxiliary problem. Such functions are called the functions with implicit structure. Very often these functions appear to be
nonsmooth.
Let us start our considerations with definition of general convex function. In the sequel the term "general" is often omitted.
Denote by
dom f = { x E Rn : I f (X) I< oo}
the domain of function f. We always assume that dom f

i= 0.

3.1.1 Function J(x) is called convex ijits domain is convex


and for all x, y E dom f and a E [0, 1] the following inequality holds:

DEFINITION

+ (1- a)y)

f(ax

:::; af(x)

+ (1- a)J(y).

We call f concave if- f is convex.

At this point, we are not ready to speak about any method for solving
(3.1.1). In the previous chapter, our optimization methods were based
on the gradients of smooth functions. For nonsmooth functions such
objects do not exist and we have to find something to replace them.
However, in order to do that, we should study first the properties of
general convex functions and justify a possibility to define a generalized
gradient. That is a long way, but we have to pass through it.
A Straightforward consequence of Definition 3.1.1 is as follows.
LEMMA 3.1.1 (Jensen inequality) For any xl, ... ,xm E domfand coefficients 0:1, ... , O:m such that
m

O:i

= 1,

O:i;:::

0, i = 1 ... m,

(3.1.2)

i=l

we have

Proof: Let us prove this statement by induction in m. Definition 3.1.1


justifies the inequality for m = 2. Assurne it is true for some m ;::: 2. For
the set of m + 1 points we have
m

m+l

L
i=l

O:iXi =

O:!Xl

+ (1- o:I) LiXi,


i=l

113

Nonsmooth convex optimization

where i =

clearly,

~
l~a 1

Li

= 1,

2 0, i = 1. .. m.

i=l

Therefore, using Definition 3.1.1 and our inductive assumption, we have

0
m

The point x =

2:: aiXi

with coefficients ai satisfying (3.1.2) is called

i=l

a convex combination of points Xi.


Let us pointout two important consequences of Jensen inequality.
COROLLARY

3.1.1 Let x be a convex combination of points

x1, ..

,xm.

Then

Proof: Indeed, in view of Jensen inequality and since ai 2 0,


we have

E ai =

i=l

1,

0
CoROLLARY

3.1.2 Let

= Conv{xl, ,xm} ={x = LaiXi I ai 2 0,


m

i=l

i=l

Lai= 1}.
0

Let us give two equivalent definitions of convex functions.


3.1.1 Function f is convex if and only if for all x, y E domf
and 2 0 suchthat y + (y- x) E dom J, we have

THEOREM

f(y

+ (y- x)) ?

f(y)

+ (f(y)- f(x)).

{3.1.3)

114

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Let f be convex. Derrote a

= -dRI

and u

= y + (y- x).

Then

y= l~(u+x)=(1-a)u+ax.
Therefore

f(y) :::; (1- a)f(u)

+ af(x)

= !hf(u)

+ !RIJ(x).

Let (3.1.3) hold. Let us fix x, y E domf and a E (0,


+ (1- a)y. Then

1]. Derrote

= 1 ~ 0 and u = ax

x = ~(u- (1- a)y) = u + (u- y).


Therefore

J(x) 2:: J(u)

+ (f(u)- J(y))

= ~f(u)- l~a J(y).


D

THEOREM 3 .1. 2 Function f is convex if and only if its epigraph


epi (!) = {(x, t) E domf x R

I t 2::

f(x)}

is a convex set.

Proof: Indeed, if (x1, tr) E epi (!) and (x2, t2) E epi (!), then for any
a E [0, 1] we have

at1

+ (1- a)t2 2:: af(xl) + {1- a)j(x2) 2:: j(ax1 + (1- a)x2).

Thus, (ax1 + (1- a)x2, at1 + (1- a)t2) E epi (!).


Let epi {!) be convex. Note that for x 1 , x 2 E dom f

(x1,J(xl)) E epi{f),
Therefore (ax1

(x1,/(x2) E epi{f).

+ (1- a)x2, af(xl) + (1- a)j(x2))

j(ax1

+ (1- a)x2)

:::; af(xl)

E epi {!). That is

+ (1- a)j(x2).
D

We need also the following property of Ievel sets of convex functions.


THEOREM 3 .1. 3 If function

Cr()

are either convex or empty.

is convex, then all its Level sets

{x E domf

f(x):::; }

115

Nonsmooth convex optimization

Proof: Indeed, if XI E .c,() and x2 E .CJ(), then for any a E [0, 1]


we have
j(ax1

+ (1- a)x2)

af(xi)

+ (1- a)j(x2)

~a+

(1- a) = .
0

We will see that the behavior of a general convex function on the


boundary of its domain sometimes is out of any control. Therefore, let
us introduce one convenient notion, which will be very useful in our
analysis.
DEFINITION

3 .1. 2 A convex function f is called closed if its epigraph is

a closed set.

As an immediate consequence of the definition we have the following


result.
3.1.4 lf convex function
either empty or closed.

THEOREM

is closed, then alt its level sets are

Proof: By its definition, (.CJ(),) = epi (f) n{(x, t) I t = }. Therefore, the epigraph .Cf () is closed and convex as an intersection of two
closed convex sets.
0
Note that, if f is convex and continuous and its domain dom f is
closed, then f is a closed function. However, in general, a closed convex
function is not necessarily continuous.
Let us look at some examples of convex functions.
EXAMPLE

2. f(x)

3 .1.1 1. Linear function is closed and convex.

=I x I, x E R 1, is closed and convex since its epigraph is


{(x,t)

I t?. x,

t?. -x},

the intersection of two closed convex sets (see Theorem 3.1.2).


3. All dijjerentiable and convex on Rn functions belong to the class of
general closed convex functions.
4. Function f(x) = ~' x > 0, is convex and closed. However, its domain
dom f = int R~ is open.

116

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

5. Function f(x)

=II x II, where II II

f(a.xl

is any norm, is closed and convex:

+ (1- a.)x2) = II a.x1 + (1- a.)x2 II

< II a.x1 II + II (1 - a.)x2 II


=

a.

II Xl II +(1 -

a.)

II X2 II

for any Xt,X2 E Rn and a. E (0, 1]. The most important norms in
numerical analysis are so-called lp-norms:
p~l.

Among them there are three norms, which are commonly used:
-

II x II= [f; (xCi)) 2 ]112 , p =

The Euclidean norm:

i=l

II x llt=

2.

f: I xCi) I, p = 1.

The h-norm:

The l00 -norm ( Chebyshev norm, uniform norm, infinity norm):

i=l

II X lloo=

ffi!iX

l::S~::Sn

I X(i) I .

Any norm defines a system of balls,

Bllll(xo,r) = {x ERn 111 x- xo

II$ r},

0,

where r is the radius of the ball and xo E Rn is its center. We call


the ball BIIII(O, 1) the unit ball of the norm II 11. Clearly, these balls
are convex sets (see Theorem 3.1.3). For lp-balls of the radius r we
use the notation

Bp(xo,r) = {x ERn

111

x- xo

llp$ r}.

Note the following relation between Euclidean and l 1-balls:

B1(xo,r) C B2(xo,r) C Bl(xo,r.fii).


That is true because of the standard inequalities:

117

Nonsmooth convex optimization

6. Up to now, all our examples did not show up any pathological behavior. However, let us Iook at the following function of two variables:
0,

f(x, y) = {

cp(x, y),
where cp(x, y) is an arbitrary nonnegative function defined on a unit
circle. The domain ofthis function is the unit Euclidean disk, which is
closed and convex. Moreover, it is easy to see that f is convex. However, it has no reasonable properties on the boundary of its domain.
Definitely, we want to exclude such functions from our considerations.
That was the reason for introducing the notion of closed function. It
is clear that f (x, y) is not closed unless cjJ( x, y) = 0.
D

3.1.2

Operations with convex functions

In the previous section we have seen several examples of convex functions. Let us describe a set of invariant operations, which allow us to
create more complicated objects.
THEOREM 3 .1. 5 Let functions /I and h be closed and convex and let
2:: 0. Then alt functions below are closed and convex:

1. f(x)

2. f(x)

= fi(x) , domf = domfi.


= h(x) + h(x), domf = (domfi) n(dom/2).

3. f(x) = max{fi(x),h(x)}, domf = (domfi)n(domf2).


Proof:
1. The first item is evident:
j(ax1 + (1- a)x2) :S (afi (xl)

2. For all

Xl,X2

+ (1- a)fi (x2)).

E (domfl) n(domh) and

Ci

E [0, 1] we have

h (ax1 + (1- a)x2) + h(axl + (1- a)x2)


:S afi (xl) + (1 - a)fi (x2) + af2(xl) + (1 - a)f2(x2)
= a(fi(xi)

+ h(xl)) + (1- a)(fi(x2) + h(x2)).

Thus, f(x) is convex. Let us prove that it is closed. Consider a sequence


{ (xk, tk)} C epi (/):
lim

k-too

Xk

x E domj,

lim tk = l.

k-too

118

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Since

and f2 are closed, we have


inf lim fi(xk)
k-HXJ

fi(x),

inf lim h(xk)


k--too

f2(x).

Therefore

= lim tk ~ inf lim


k--too

k--too

fi (xk) + inf k--too


lim f2(xk)

~ f(x).

Thus, (x, l) E epi/.1


3. The epigraph of function f(x) is as follows:
epif =
-

{(x,t)

It~

fi(x) t

f2(x) x E (domfi)n(domf2)}

epifi nepif2.

Thus, epi f is closed and convex as an intersection of two closed convex


0
sets. It remains to use Theorem 3.1.2.
The following theorem demonstrates that convexity is an affine-invariant property.

Let function ifJ(y), y E Rm, be convex and closed. Consider a linear operator

THEOREM 3 .1. 6

A(x) = Ax + b:

Rn -t Rm.

Then f(x) = ifJ(A(x)) is a closed convex function with domain


dom f = {x ERn

I A(x) E domifJ}.

Proof: For x1 and x2 from domf denote Y1


for a E [0, 1] we have

ifJ(ayl

= A(xl), Y2 = A(Y2)

Then

+ (1 - a)y2)

< aifJ(yi) + (1 - a)ifJ(Y2)


=

af(xl)

+ (1 -

a)j(x2).

1 It is important to understand, that a similar property for convex sets is not valid. Consider
the following two-dimensional example: Q1 = {(x,y): y 2:: ~. x > 0}, Q2 = {(x,y): y =
0, x ~ 0}. Bothofthese sets are convex and closed. However, their sum Q1 + Q2 {(x, y) :
y > 0} is convex and open.

119

Nonsmooth convex optimization

Thus, j(x) is convex. The closedness of its epigraph follows from continuity of the linear operator A(x).
D
The next theorem is one of the main suppliers of convex functions
with implicit structure.
THEOREM

3.1. 7 Let

be some set and

f(x) = sup{cp(y,x)
y

yE

~}.

Suppose that for any fixed y E ~ the function cp(y, x) is closed and convex
in x. Then f(x) is a closed and convex function with domain
domf={xE

n domcp(y,)l

:3/:c/J(y,x)S/VYE~}.

(3.1.4)

yEt:.

Proof: Indeed, if x belongs to the right-hand side of equation (3.1.4),


then f(x) < oo and we conclude that x E dom f. If x does not belang
to this set, then there exists a sequence {yk} such that cp(yk, x) ---+ oo.
Therefore x does not belong to dom f.
Finally, it is clear tlmt (x, t) E epi f if and only if for all y E ~ we
have
x E domcp(y, ), t;::: cp(y, x).
This means that
epif =
epicp(y, ).

yEt:.

Therefore
closed.

f is convex and closed since each epi cp(y, ) is convex and


0

Now we are ready to Iook at more sophisticated examples of convex


functions.
EXAMPLE

3.1.2 1. Function J(x) =

ID!lX

l:St:Sn

{x(il} is closed and convex.

2. Let.>.= (.>.( 1 ), ... , .>,(m)) and ~ be a set in

R+. Consider the function

f(x) = sup

L .>.(i) fi(x),

>.Et:. i=l

where fi are closed and convex.


epigraphs of functions

c/J>.(x)

In view of Theorem 3.1.5, the

L >,(i) fi(x)
i=l

120

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

are convex and closed. Thus, j(x) is closed and convex in view of
Theorem 3.1.7. Note that we did not assume anything about the
structure of the set b..
3. Let Q be a convex set. Consider the function

'1/JQ(x) = sup{ (g, x)

g E Q}.

Function '1/JQ(x) is called support function of the set Q. Note that


'1/JQ(x) is closed and convex in view of Theorem 3.1.7. This function
is homogeneaus of degree one:

'1/JQ(tx) = t'I/JQ(x),

E domQ,

~ 0.

lf the set Q is bounded then dom '1/JQ = Rn.

4. Let Q be a set in Rn. Consider the function '1/J (g, 'Y) = sup cp(y, g, 'Y),
yEQ

where

c/J(y, g, 1) = (g, y) - ~ II Y 11 2
The function '1/J(g, 'Y) is closed and convex in (g, 'Y) in view ofTheorem
3.1.7. Let us look at its properties.

lf Q is bounded, then dom '1/J = Rn+l. Consider the case Q = Rn.


Let us describe the domain of '1/J. If 'Y < 0, then for any g =/: 0 we
can take Ya = ag. Clearly, along this sequence cp(ya, g, 'Y) -+ oo as
a -+ oo. Thus, dom '1/J contains only points with 'Y ~ 0.
lf 'Y = 0, the only possible value for g is zero since otherwise the
function 4J(y, g, 0} is unbounded.
Finally, if 'Y > 0 then the point maximizing 4J(y, g, 'Y) with respect
to y is y* (g, 'Y) = ~ g and we get the following expression for '1/J:

'1/J(g, 'Y) = 1Wf


2')'
Thus,
0,

'1/J(g, 'Y) = {

if g

= 0, 'Y = 0,

if 'Y

> 0,

1Wf
2')' '

with the domain dom'!f; = (Rn x {'Y > 0}) U{O,O). Note that this
is a convex set, which is neither closed nor open. Nevertheless, '1/J
is a closed convex function. At the same time, this function is not
continuous at the origin:
0

121

Nonsmooth convex optimization

3.1.3

Continuity and differentiability

In the previous sections we have seen that the behavior of convex functions at the boundary points of its domain can be rather disappointing
(see Examples 3.1.1(6), 3.1.2(4)). Fortunately, this is the only bad news
about convex functions. In this section we will see that the structure of
convex functions in the interior of its domain is very simple.
LEMMA 3.1.2 Let function f be convex and xo E int (domf). Then f
is locally upper bounded at xo.

Proof: Let us choose some

>

0 such that

xo

Eei

E int (dom f),

i = 1 ... n, where ei are the coordinate vectors of Rn. Denote

!:::.=Conv{xoEei, i= 1 ... n}.

Jn

Let us show that !:::. =:> B2(xo, ) with =

x = x0

i=l

i=l

Indeed, consider

+ Lhiei, L(hi)2:::; E.

We can assume that hi 2:: 0 (otherwise, in the above representation we


can choose -ei instead of ei). Then

Therefore for hi = -/Jhi we have

x =

xo

+ "f:

i=l

hiei = xo

+~

"f: hiEei

i=l

Thus, using Corollary 3.1.2, we obtain


M

max f(x)
= xEB2(xo,l)

:::; max f(x) :::; max f(xo


xE6

l:Sz:Sn

Eei).

Remarkably enough, the above result implies continuity of a convex


function at any interior point of its domain.
THEOREM 3.1.8 Let f be convex and xo Eint (domf). Then f is locally
Lipschitz continuous at xo.

122

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Let B2(xo, ~:) ~ domfand sup{f(x) I x E B2(xo, ~:)} ~ M (M is


finite in view of Lemma 3.1.2). Consider y E B2(xo, ~:), y =I xo. Denote

= ~ II y- xo II,
II z - xo II= ~ II

It is clear that
y = az + (1- a)x 0 Hence,

= xo + ~(y- xo).
y - xo II= E. Therefore
z

a :::; 1 and

< af(z) + (1- a)f(xo) :::; f(xo) + a(M- f(xo))

f(y)

f(xo)

+ M-{(xo) II Y- xo II

Further, denote u = xo + ~(xo- y). Then II u- xo II= E and y


xo + a(x0 - u). Therefore, in view of Theorem 3.1.1 we have

f(y)

> f(xo) + a(f(xo)- l(u)) ;:::: f(xo)- a(M- l(xo))


f(xo)-

M-{(xo}

II Y- Xo II

Thus, I f(y)- f(xo) I~

M-!(xo)

II Y- xo II

Let us show that the convex functions possess a property, which is


very close to differentiability.
DEFINITION 3.1.3 Let x E dom I. We call f differentiable in a direction
p at point x if the following limit exists:

f'(x;p) = lim i[f(x + ap)- l(x)].

(3.1.5)

a.(.O

The value f' (x; p) is called the directional derivative of I at x.


THEOREM 3.1.9 Convex function f is differentiable in any direction at
any interior point of its domain.
Proof: Let x E int (dom !) . Consider the function

<jJ(a) = ~[f(x + ap) - l(x)],

a > 0.

Let 'Y E (0, 1] and a E (0, ~:) be small enough to have x


Then

f(x

+ ap)

= !((1- )x + (x + ap)) :::; {1- )f(x)

+ Ep

E domf.

+ f(x + ap).

Therefore

<!J(a)

= ~[f(x + ap)- l(xo)] :::; ~[f(x + ap)- f(x)] = <!J(a).

123

Nonsmooth convex optimization

Thus, <j>(a) decreases as a .j,. 0. Let us choose 1 > 0 small enough to


have x- IP E domf. Then, in view of (3.1.3) we have
</>( a) 2:

[f(x) - f(x - /P )].

Hence, the Iimit in (3.1.5) exists.

Let us prove that the directional derivative provides us with a global


lower support of the convex function.
LEMMA 3.1.3 Let f be a convex function and x E int (domj). Then
f'(x;p) is a convexfunction ofp, which is homogeneaus ofdegree 1. For
any y E dom f we have

f(y) 2: f(x)

+ f'(x;y-

x).

(3.1.6)

Proof: Let us prove that the directional derivative is homogeneous.


Indeed, for p E Rn and T > 0 we have
f'(x; Tp)

lim i-[f(x
a.j.O

+ Tap)-

T lim -1 [f(x
.j.O

f(x)]

+ p)- f(x)]

= T f'(xo;p).

Further, for any Pl, P2 ERn and E [0, 1] we obtain

f'(x; P1

+ {1 -

)P2)

<

lim i-{[f(x
a.j.O

+ api)- f(x)]

+{1 - )[f(x + ap2) - f(x )]}


=

f'(x;pl)

+ {1- )f'(x;p2)

Thus, f'(x;p) is convex in p. Finally, Iet a E (0, 1], y E domf and


Ya = x + a(y- x). Then in view of Theorem 3.1.1, we have

f(y)

= f(ya + i-{1- a)(ya- x)) 2: f(ya) + i-(1- a)[f(ya)- f(x)],

and we get {3.1.6) taking the Iimit in a .j,. 0.

124

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

3.1.4

Separation theorems

Up to now we were describing properties of convex functions in terms


of function values. We did not introduce any directions which could
be useful for constructing the minimization schemes. In convex analysis such directions are defined by separation theorems, which form the
subject of this section.
DEFINITION

3.1.4 Let Q be a convex set. We say that hyperplane

tl(g,'"'f)

= {x ERn

(g,x}

= '"'f},

g =f 0,

is supporting to Q if any x E Q satisfies inequality (g,x} ~ 'Y


We say that the hyperplane 1l(g, 'Y) separates a point xo from Q if
(g, x)

'Y ~ (g, xo}

(3.1. 7)

for all x E Q. lf the second inequality in (3.1. 7) is strict, we call the


separation strict.
The separation theorems can be derived from the properties of projection.
DEFINITION 3.1.5

Let Q be a closed set and xo ERn. Denote

?rq(xo) = argmin{ll x- xo

II:

Q}.

We call ?rq(xo) projection of point xo onto the set Q.


THEOREM

3 .1.1 0 If Q is a convex set, then there exists a unique pro-

jection ?rq(xo).
Proof: Indeed, 1rq(x 0 ) = arg min{ <P(x) I x E Q}, where the function
<P(x) = II X- Xo 11 2 belongs to st:i(Rn). Therefore 1l'q(xo) is unique
and well defined in view of Theorem 2.2.6.
D

It is clear that 11'Q(xo) = xo if and only if xo E Q.


LEMMA 3.1.4

x E Q we have

Let Q be a closed convex set and x 0


(7rQ(xo) - xo, x- 11'Q(xo)} ~ 0.

Q. Then for any


(3.1.8}

Proof: Note that 7rq(x0 ) is a solution to the minimization problern


min <P(x) with <P(x) = ~ II x- xo 11 2 . Therefore, in view of Theorem
xEQ

2.2.5 we have

(<P1(11'q(xo)),x- ?rq(xo)) ~ 0

125

Nonsmooth convex optimization

for all x E Q. It remains to note that c/J'(x) = x- xo.

Finally, we need a kind of triangle inequality for projection.


LEMMA

3.1.5 For any x


II

x -1l"Q(xo)

11 2

Q we have

+ II 11"Q(xo)- xo

11 2 ~11 x- xo

11 2

Proof: Indeed, in view of (3.1.8), we have


II

x- 11"Q(xo)

11 2 - II

x- xo

(xo- 11"Q(xo), 2x- 11"Q(xo) - xo}

11 2

< -

II

xo - 11"Q(xo)

11 2

Now we can prove the separation theorems. We will need two of them.
The first one describes our possibilities in strict Separation.

3.1.11 Let Q be a closed convex set and xo rj. Q. Then there


exists a hyperplane 1l(g,,), which strictly separates xo from Q. Namely,
we can take

THEOREM

g = xo- 11"Q(xo)-:/= 0,

'Y

= {xo- 11"Q(xo), 11"Q(xo)}.

Proof: Indeed, in view of (3.1.8}, for any x E Q we have


{xo- 11"Q(xo}, x} < {xo -1l"Q(xo), 11"Q(xo)}
=

(xo- 11"Q(xo), xo}-

II

xo- 11"Q(xo)

11 2

Let us give an example of an application of the above theorem.

3.1.3 Let Qt and Q2 be two closed convex sets.


1. If for any g E dom'I/JQ 2 we have 'I/JQ 1 (g) ~ 'I/JQ 2 (g), then Q1 ~ Q2.
2. Let dom 1/JQ 1 = dom 1/JQ 2 and for any g E dom 1/JQ 1 we have
1/JQ 1 (g) = 1/JQ 2 (p). Then Q1 Q2.
COROLLARY

Proof: 1. Assurne that there exists x 0 E Q1 , which does not belong


to Q2. Then, in view of Theorem 3.1.11, there exists a direction g such
that

(g, xo} > 'Y

(g, x}
dom'I/JQ 2 and 'I/JQ 1 (g) > 'I/JQ 2 (g). That is a
~

for all x E Q2. Hence, g E


contradiction.
2. In view of the first statement, Q1
Ql::: Q2.

Q2 and Q2

Q1 . Therefore,

126

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

The next separation theorem deals with boundary points of convex


sets.
3 .1.12 Let Q be a closed convex set and xo belang to the
boundary of set Q. Then there exists a hyperplane 1l (g, 'Y), supporting

THEOREM

to Q and passing through xo.

(Such a vector g is called supporting to Q at xo.)


Proof: Consider a sequence {yk} suchthat Yk ~ Q and Yk --+ xo. Denote

In view of Theorem 3.1.11, for all x E Q we have

(gk! x)
However,

:S 'Yk :S (gk, Yk)

(3.1.9)

I 9k II= 1 and the sequence {rk} is bounded:


I (gk, 1fQ(Yk)- xo) + (gk, xo) I
I rk I

(Lemma 3.1.5)

< I 1fQ(Yk)- xo I + II xo II:SII Yk- xo II + II xo II

Therefore, without loss of generality we can assume that there exist


g* = lim 9k and 1* = lim rk It remains to take the limit in (3.1.9). D
k-too

k-too

3.1.5

Subgradients

Now we are completely ready to introduce an extension of the notion


of gradient.
DEFINITION

3.1.6 Let f be a convex function. A vector g is called a

subgradient of function f at point xo E dom f if for any x E dom f we

have
f(x)

f(xo)

+ (g, x- xo).

(3.1.10}

The set of alt subgradientsoff at xo, j(xo), is called the subdifferential


of function f at point xo.
The necessity of the notion of subdifferential is clear from the following
example.
3 .1. 3 Consider function f (x)
and g E [-1, 1] we have
EXAMPLE

f(y)

=I x I,

x E R 1 For all y E R 1

=I Y I~ g y = J(o) + g (y- o).

127

Nonsmooth convex optimization

Therefore, the subgradient of f at x = 0 is not unique. In our example


it is the whole segment [-1, 1].
0
The whole set of inequalities (3.1.10), x E dom j, can be seen as a set
of linear constraints, defining the set f (xo). Therefore, by definition,
the subdifferential is a closed convex set.
Note that the subdifferentiability of a function implies convexity.
LEMMA 3.1.6 Letforanyx E domf subdifferentialj(x) be nonempty.
Then f is a convex function.
Proof: Indeed, let x, y E domj, a E [0, 1]. Consider Yo: = x+a(y -x).
Let g E j(y0 ). Then

f(y)

> f(yo:) + (g, y- Yo:)

= f(yo:)

J(x)

> f(yo:) + (g, x- Yo:)

+ (1- a)(g, y- x),

f(Yo:)- a(g, y- x).

Adding these inequalities multiplied by a and (1 - a) respectively, we


get
af(y) + (1- a)f(x) 2: f(Yo:)
D

On the other hand, we can prove a converse statement.


THEOREM 3 .1.13 Let f be closed and convex and xo E int (dom J).
Then j(x 0 ) is a nonempty bounded set.
Proof: Note that the point (f(xo), x 0 ) belongs to the boundary of
epi (f). Hence, in view of Theorem 3.1.12, there exists a hyperplane
supporting to epi (J) at (J(xo), xo):

-aT

+ (d, x) ::::;

-aj(xo)

+ (d, xo)

(3.1.11)

for all (T,x) E epi (J). Note that we can take

II d 112 +a2 =

1.

(3.1.12)

Since for all T 2: f(xo) the point (T, x 0 ) belongs to epi (f), we conclude
that a 2: 0.
Recall, that a convex function is locally upper bounded in the interior
of its domain (Lemma 3.1.2). This means that there exist some E > 0
and M > 0 suchthat B2(xo, E) ~ domfand

f(x)- f(xo) ::::; M

II x- xo I

128

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

for all x E B 2 (x 0 , ). Therefore, in view of (3.1.11), for any x from this


ball we have

II x- xo II

(d,x- xo) $ a(f(x)- f(xo)) $ aM


Choosing x = xo + Ed we get II d II 2 :S Ma
normalizing condition (3.1.12) we obtain

II

II

Thus, in view of the

Hence, choosing g = dja we get

f(x) 2 f(xo)

+ (g, x- xo)

for all x E domf.


Finally, if g E j(xo), g =/:. 0, then choosing x = xo + 9/
obtain
E

II g II= (g, x- xo) :S

f(x) - f(xo) :S M

II

II

we

II x- xo II= M .

Thus, 8f(x 0 ) is bounded.

Let us show that the conditions of the above theorem cannot be relaxed.
3.1.4 Consider the function f(x) = -../X with the domain
{x E R 1 I x 2 0}. This function is convex and closed, but the subdiffer-

EXAMPLE

ential does not exist at x = 0.

Let us deterine an important relation between the subdifferential and


the directional derivative of convex function.
THEOREM 3.1.14 Let f be a closed convex function.
int (dom f) and p E Rn we have

f'(xo;p) = max{(g,p)

For any xo E

g E j(xo)}.

Proof: Note that

f'(xo;p) = ~N ~[f(xo

+ ap)- f(xo)] 2

(g,p),

(3.1.13)

where g is an arbitrary vector from af (xo). Therefore, the subdifferential


of function f'(xo;p) at p = 0 is not empty and 8f(xo) ~ pj'(xo; 0). On

129

Nonsmooth convex optimization

the other hand, since f'(x 0 ;p) is convex in p, in view of Lemma 3.1.3,
for any y E dom f we have

+ !'(xo; y- xo) 2: f(xo) + (g, y- xo),

f(y) 2: f(xo)

where g E 8pf'(x 0 ; 0). Thus, p!'(xo; 0) ~ 8f(xo) and we conclude that


j(xo) pj'(xo; 0).
Consider 9p E pj'(xo;p). Then, in view of inequality (3.1.6), for all
v E Rn and T > 0 we have

Tf'(xo; v) = !'(xo; Tv) 2: J'(xo;p)


Considering

+ (gp, TV- p).

--+ oo we conclude that

f'(xo; v) 2: (gp, v),


and, considering

(3.1.14)

--+ 0, we obtain

J'(xo;p)- (gp,p) ~ 0.

(3.1.15)

However, inequality (3.1.14) implies that gp E Bpf'(xo; 0). Therefore,


comparing (3.1.13) and (3.1.15) we conclude that (gp,p) = f'(xo;p). 0
To conclude this section, let us point out several properties of subgradients, which are of main importance for optimization. Let us start
from the optimality condition.
THEOREM

3.1.15 Wehave f(x*) =

min

xEdomf

f(x) if and only if

0 E 8f(x*).

Proof: Indeed, if 0 E 8f(x*), then f(x) 2: f(x*) + (0, x- x*) = f(x*)


for all x E domf. On the other hand, if f(x) 2: f(x*) for all x E domf,
then 0 E 8f(x*) in view of Definition 3.1.6.
0
The next result forms a basis for cutting plane optimization schemes.
3.1.16 For any xo E domf all vectors g E 8f(xo) are supporting to the level set .Cf (f (xo)):
THEOREM

(g,xo- x 2: 0 Vx E LJ(f(xo))
Proof: Indeed, if f(x)

f(xo)

={x

E domf:

f(x) ~ f(xo)}.

f(xo) and g E 8f(xo), then

+ (g, x- xo)

f(x)

f(xo).
0

130

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

COROLLARY 3.1.4 Let Q ~ domf be a closed convex set, xo E Q and


x*

= argmin{f(x) I

x E Q}.

Then for any g E j(xo) we have (g, xo- x*} 2: 0.

3.1.6

Computing subgradients

In the previous section we introduced the subgradients, objects which


we are going to use in minimization schemes. However, in order to
apply such schemes in practice, we need to be sure that these objects
are computable. In this section we present some rules for computing the
things.
LEMMA 3.1. 7 Let f be closed and convex. Assurne that it is differentiable on its domain. Then f (x) = {!' (x)} for any x E int {dom f).

Proof: Let us fix some x E int (domf). Then, in view of Theorem


3.1.14, for any direction p ERn and any g E j(x) we have

(f'(x),p}

= f'(x;p) 2:

(g,p).

Changing the sign of p, we conclude that (f'(x),p) = (g, p) for all g from
f(x). Finally, considering p = ek, k = 1 ... n, we get g = f'(x).
0
Let us provide all operations with convex functions, described in Section 3.1.2, with corresponding rules for updating subgradients.
LEMMA 3 .1. 8 Let function f (y) be closed and convex with dom f
Consider a linear operator

A(x)

= Ax + b:

Rm.

Rn---+ Rm.

Then cfi(x) = j(A(x)) is a closed convex function with domain domcjJ


{x I A(x) E domf}. For any x Eint (domcjJ) we have

cjJ(x) =AT j(A(x)).


Proof: We have already proved the first part of this lemma in Theorem 3.1.6. Let us prove the relation for the subdifferentiaL
Indeed, let Yo = A(xo). Then for all p ERn we have

cfi'(xo,p)

= f'(yo;Ap) =
=

max{(g,Ap) I g E j(yo)}
max{ (g,p) I g E AT j(yo)}.

131

Nonsmooth convex optimization

Using Theorem 3.1.14 and Corollary 3.1.3, we get


0
LEMMA 3.1.9 Let ft(x) and h(x) be closed convex functions and a1,
a2 ~ 0. Then function f(x) = ad1(x) + a2h(x) is closed and convex
and
(3.1.16}
8f(x) = a18!I(x) + a28h(x)

jor any

jrom int (dom/) = int (domfl)

n int (domf2).

Proof: In view of Theorem 3.1.5, we need to prove only the relation for
int (dom h). Then,
the subdifferentials. Consider Xo E int (dom /I)
for any p E Rn we have

max{(91,a1p)

I 91

E 8/I(xo)}

+ max{ (92, a2p) I 92 E 8f2(xo)}

+ a292,p) I 91

E 8/I (xo), 92 E 8f2(xo)}

max{ (a191

max{(9,p) I 9 E at8ft(xo)

+ a2h(xo)}.

Note that both fi(xo) and fi(xo) are bounded. Hence, using Theoo
rem 3.1.14 and Corollary 3.1.3, we get (3.1.16).
3.1.10 Let functions fi(x), i = 1 ... m, be closed and convex.
Then function f(x) = m_ax fi(x) is also closed and convex. For any

LEMMA

XE

int (dom/) =

1$t$m

n int (domfi) we have


m

i=l

8f(x) = Conv {8/i(x) I i E J(x)},

(3.1.17}

where I(x) = {i: fi(x) = f(x)}.


Proof: Again, in view of Theorem 3.1.5, we need to justify only the
rules for subdifferentials. Consider

n
m

int (domfi). Assurne that

i=l

I(x) = {1, ... , k }. Then for any p ERn we have


f'(x;p)

= 1$i$k
max

ff(x;p)

= 1$i$k
max

max{(gi,p) I 9i E fi(x)}.

132

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Note that for any set of values a 1 , ... , ak we have

where D.k

= Pi

Therefore,

f'(x;p)

0,

L: Ai = 1},

i==l

the k-dimensional standard simplex.

max{ (g,p) I g = I: Ai9i, 9i


i=l

fi(x), {.\i} E D.k}

max{(g,p) I g E Conv{8fi(x),i E I(x)} }.


0

The last rule can be useful for computing some elements from the
subdifferentiaL
LEMMA 3.1.11 Let be a set and f(x) = sup{4>(y,x) I y E }.
Suppose that for any fixed y E D. the function 4>(y, x) is closed and
convex in x. Then f(x) is closed convex.
Moreover, for any x from
domf = {x ERn

we have
where I(x)

3')': 4>(y,x)

')'Vy

8f(x) :2 Conv {84>x(y, x) I y E J(x)},

= {y I 4>(y,x) = f(x)}.

Proof: In view of Theorem 3.1.7, we have to prove only the inclusion.


Indeed, for any x E domf, y E I(x) and g E 84>x(y,x) we have

f(x) ~ 4>(y,x) ~ 4>(y,xo)

+ (g,x- xo)

= f(xo)

+ (g,x- xo).

Now we can look at some examples of subdifferentials.


EXAMPLE 3.1.5 1. Let f(x)

=I x I, x E R 1 .

f(x) =

Then 8f(O)

max g x.
-1:599

= [-1, 1] since

Nonsmooth convex optimization

2. Consider function

Then j(x) =

f (x)

133
m

I: I (ai, x) -

i=l

bi

{i: (ai, x)- bi

h (X)

{i : (ai, X) - bi > 0},

I o(x)

{i : (ai, x) - bi = 0}.

I:

ai-

3. Consider function j(x)

I:

iEL(x)

iElo(x)

=II x II

{x/

I 1::; i::; n}

0}.

=D.n.

II x 1!},

111 X II:S 1},

x =/= 0.

=II x lh= I: I x(i) I we have


i=l

I:

iEl+(x)

where I+(x) = {i
x(i) =

x(i)

we have

'

{i :

I(x)}. For x = 0 we have

8](0) = B 00 (0 1) = {x ERn

{i

m;:t:x x(i). Derrote I(x)

8](0) = B2(0, 1) = {x ERn


8j(x) =

[-ai,ai].

l<~<n

8j(O) = Conv {ei


4. For Euclidean norm f(x)

+ I:

ai

f(x)}. Then j(x) = Conv {ei-l i

j(x)

< 0},

L(x)

iEI+(x)

5. For l1-norm f(x)

Derrote

ei-

I x(i) >

I:

iEL(x)

ei

max

l:Si:Sn

+ I:

iElo(x)

0}, L(x) = {i

I x(i) I< 1}
-

'

[-ei, ei], x =/= 0,

I x(i) <

0} and Io(x) =

We leave justification of these examples as an exercise for the reader. 0


We conclude this section with an example of application of the above
technique for deriving an optimality condition for a smooth minimization
problern with functional constraints.
3.1.17 (Kuhn-Tucker). Let fi be differentiable convex functions, i = 0 ... m. Suppose that there exists a point x such that fi(x) < 0
for all i = 1 ... m. (Slater condition.)
THEOREM

134

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

A point x* is a solution to the problern


min{fo(x)

fi(x)

O,i = 1. .. m}

(3.1.18)

if and only if it is feasible and there exist nonnegative numbers Ai, i =


1 ... m, such that

f~(x*)

+ L Adf(x*) =

0,

iEl*

where I*

= {i

E [1, m) : fi(x*)

= 0}.

Proof: In view of Lemma 2.3.4, x* is a solution to (3.1.18) if and only


if it is a global minimizer of the function
r/>(x)

= max{fo(x)- J*;fi(x),i = 1. .. m}.

In view of Theorem 3.1.15, this is the case if and only if 0 E rJ>(x*).


Further, in view of Lemma 3.1.10, this is true if and only if there exist
nonnegative ~i, such that

5.of~(x*)

+L

~df(x*)

iEl"

= 0,

5.o

+L

~i

= 1.

iEl*

Thus, we need to prove only that 5.o

> 0. Indeed, if 5.o = 0, then

E ~ifi(x) ~ E ~i[fi(x*) + (fi(x*), x- x*)] = 0.

iEl*

iEl*

This contradicts the Slater condition. Therefore 5. 0 > 0 and we can take
Ai = J..if 5.o, i E J*.
0
Theorem 3.1.17 is very useful for solving simple optimization problems.
LEMMA

3.1.12 Let A

>- 0. Then

max{(c,x}: (Ax,x} ~I}= (A- 1 c,c) 112 .


X

Proof: Note that all conditions of Theorem 3.1.17 are satisfied and
the solution x* of the above problern is attained at the boundary of the
feasible set. Therefore, in accordance with Theorem 3.1.17 we have to
solve the following equations:

c = AAx*,

(Ax*, x*} = 1.
0

135

Nonsmooth convex optimization

3.2

Nonsmooth minimization methods

(Generallower complexity bounds; Main lemma; Localization sets; Subgradient


method; Constrained minimization scheme; Optimization in finite dimension
and lower complexity bounds; Cutting plane scheme; Center of gravity method;
Ellipsoid method; Other methods.)

3.2.1

General lower complexity bounds

In the previous section we have introduced a dass of general convex


functions. These functions can be nonsrnooth and therefore the corresponding rninirnization problern can be quite difficult. As for srnooth
problerns, Iet us try to derive a lower cornplexity bounds, which will
help us to evaluate the perforrnance of nurnerical rnethods.
In this section we derive such bounds for the following unconstrained
rninirnization problern
rnin J(x),

(3.2.1)

xERn

where

f is a convex function. Thus, our problern dass is as follows:

Model:

1. Unconstrained rninirnization.
2. f is convex on Rn and Lipschitz
continuous on a bounded set.

Oracle:

First-order black box:


at each point x we can cornpute
J(x), g(x) E f(x),
g(x) is an arbitrary subgradient.

Approximate
solution:

Find x E Rn : f(x) - j*

Methods:

Generate a sequence {xk} :


xk E xo + Lin {g(xo), ... , g(xk-d }.

(3.2.2)

~ .

As in Section 2.1.2, for deriving a lower cornplexity bound for our


problern dass, we will study the behavior of nurnerical rnethods on sorne
function, which appears tobe very difficult for all of thern.

136

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us fix some constants p


functions

> 0 and 1 > 0. Consider the family of

fk(x) = 1 max x(il


l~i:::;k

+ ~ I x 1 2,

k = 1. . . n.

Using the rules of subdifferential calculus, described in Section 3.1.6, we


can write down an expression for the subdifferential of fk at x. That is
Bfk(x)

I(x)

px + 1Conv {ei I i E I (x)},

{J. I 1 < J.
-

<k

'

xUl = max x(il}.


l~i~k

Therefore for any x, y E B2(0, p), p > 0, and 9k(Y) E fk(Y) we have

fk(Y)- fk(x)

< (gk(y), y- x}
< I 9k(Y) 1111 y-x

II~ (pp+!)

II

y-x

II

Thus, fk is Lipschitz continuous on B 2 (0, p) with Lipschitz constant


M =PP+!
Further, consider the point xk with the coordinates

It is easy to check that 0 E 8fk(xk) and therefore x"k is the minimum of


function fk(x) (see Theorem 3.1.15). Note that

Rk

=II x"k II= ~'

Let us describe now a resisting oracle for function fk(x). Since the
analytical form of this function is fixed, the resistance of this oracle
consists in providing us with the worst possible subgradient at each test

137

Nonsmooth convex optimization

point. The algorithmic scheme of this oracle is as follows.

Input:

XE Rn.

Main Loop:

f := -oo;

i* := 0;

for j := 1 to m do
if xU) > f then { f := xU); i* := j };

Output:

f := 'Yf

+}

fk(x) :=

J, 9k(x)

g := ei

II X 11 2 ;

+ fLX;

:=gERn.

At the first glance, there is nothing special in this scheme. Its main loop
is just a standard process for finding a maximal coordinate of a vector
from Rn. However, the main feature of this loop is that we always
form the subgradient as a coordinate vector. Moreover, this coordinate
corresponds to i*, which is the firstmaximal component of vector x. Let
us check what happens with a minimizing sequence, which uses such an
oracle.
Let us choose starting point x 0 = 0. Derrote
Rp,n = {x ERn

x(i) =

0, p

+1 ~ i

~ n}.

Since xo = 0, the answer of the oracle is fk(xo) = 0 and 9k(xo) = e1.


Therefore the next point ofthe sequence, x 1 , necessarily belongs to R 1n.
Assurne now that the current test point of the sequence, Xi, belongs to
RP,n, 1 ~ p ~ k. Then the oracle will return a subgradient
g = fLXi

+ Iei,

where i* ~ p + 1. Therefore, the next test point Xi+l belongs to RP+l,n.


This simple reasoning proves that for all i, 1 ~ i ~ k, we have Xi E
Ri,n. Consequently, for i: 1 ~ i ~ k - 1, we cannot improve the starting
value of the objective function:

!k(xi) ~ 1 max x~j)


l=:;j=:;k

= 0.

Let us convert this observation in a lower complexity bound. Let us


fix some parameters of our problern dass P(xo, R, M), that is R > 0 and
M > 0. In addition to (3.2.2) we assume that

138

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

the solution of problern (3.2.1), x*, exists and x* E B2(xo, R).

is Lipschitz continuous on B2(x 0 , R) with constant M

> 0.

3.2.1 For any class 'P(x 0 , R, M) and any k, 0 ::; k ::; n- 1,


there exists a function f E P(xo, R, M) such that

THEOREM

f(xk)- f* ~

2(l:Jkr)

for any optimization scheme, which generates a sequence {xk} satisfying


the condition
Xk E xo + Lin {g(xo), ... , g(xk-d }.
Proof: Without lass of generality we can assume that xo = 0. Let us
choose f(x) = fk+l(x) with
'Y

= ~'

J1.

= (l+v'~+l)R"

Then

II xo -

x*

II= Rk+l = JLJ:IT = R,

and f(x) is Lipschitz continuous on B 2(x 0 , R) with constant


Note that Xk E Rk,n. Hence, f(xk)- f* ~ -f*.

~-tR+"Y

= M.
D

The lower complexity bound presented in Theorem 3.2.1 is uniform


in the dimension of the space of variables. As for the lower bound of
Theorem 2.1.7, it can be applied to problems with very large dimension,
or to efficiency analysis of starting iterations of a minimization scheme
(k::; n- 1).
We will see that our lower estimate is exact: There exist minimization methods, which have the rate of convergence proportional to this
lower bound. Comparing this bound with the lower bound for smooth
minimization problems, we can see that now the possible convergence
rate is much slower. However, we should remernher that we are working
now with the most general dass of convex problems.

3.2.2

Main lemma

At this moment we are interested in the following problem:


min{f(x)

x E Q},

(3.2.3)

where Q is a closed convex set, and f is a function, which is convex


on Rn. We are going to study some methods for solving (3.2.3), which

139

Nonsmooth convex optimization

employ subgradients g(x) of the objective function. As compared with


the smooth problem, our goal now is much more complicated. Indeed,
even in the simplest situation, when Q Rn, the subgradient seems to
be a poor replacement for the gradient of smooth function. For example,
we cannot be sure that the value of the objective function is decreasing
in the direction -g(x). We cannot expect that g(x) -+ 0 as x approaches
a solution of our problem, etc.
Fortunately, there is one property of subgradients that makes our
goals reachable. We have proved this property in Corollary 3.1.4:
At any x E Q the following inequality holds:

(3.2.4)

(g(x),x- x*) :2:0.

This simple inequality leads to two consequences, which form a basis for
any nonsmooth minimization method. Namely:
The distance between x and x* is decreasing in the direction -g(x).
Inequality (3.2.4) cuts Rn on two half-spaces.
contains x*.

Only one of them

Nonsmooth minimization methods cannot employ the idea of relaxation or approximation. There is another concept, underlying all these
schemes. That is the concept of localization. However, to go forward
with this concept, we have to develop some special technique, which allows us to estimate the quality of an approximate solution to problern
(3.2.3). That is the main goal of this section.
Let us fix some x ERn. For x ERn with g(x) =J 0 define

v1 (x, x) = ~(g(x), x- x).


If g(x) = 0, then define Vj(x; x) = 0. Clearly, Vj(X, x) sll X - X II
The values Vj(x, x) have a natural geometric interpretation. Consider
a point x such that g(x) =P 0 and (g(x), x - x) :2: 0. Let us Iook at the
point y = x + VJ(x)g(x)/ I g(x) II Then

(g(x),x- y)

= (g(x),x- x)- Vj(x,x) I g(x) II= 0

and II y- x II= VJ(x, x). Thus, Vj(X, x) is a distance from the point x to
hyperplane {y: (g(x), x- y) = 0}.
Let us introduce a function that measures the variation of function f
with respect to the point x. For t :2: 0 define

wf(x; t) = max{f(x)- J(x)


If t

< 0,

we set w1 (x; t) = 0.

111 x- x lls t}.

140

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Clearly, the function

Wf

wf(x;O) = 0 for all t

possesses the following properties:


0.

w1(x; t) is a nondecreasing function oft, t E R 1 .

f(x) - f(x)

wf(x; II x- x*

II).

It is important that in the convex situation the last inequality can be


strengthened.
LEMMA

3.2.1 For any x ERn we have

f(x)- f(x)

{3.2.5)

w1(x; v 1(x; x)).

If f(x) is Lipschitz continuous on B2(x, R) with some constant M, then


f(x)- f(x)
for all x ERn with VJ(x;x)

{3.2.6)

M(vt(x; x))+

R.

Proof: If (g(x), x - x) ~ 0, then f(x) ~ f(x) + (g(x), x- x) ~ f(x).


This implies that v1(x; x) ~ 0. Hence, wf(x; VJ(x; x)) = 0 and (3.2.5)
holds.
Let (g(x), X - x) > 0. For

y = iJY&m(x + v1(x; x)g(x))


we have (g(x),y- x)

= 0 and II

f(y)

f(x)

y- x

II= vf(x;x).

+ (g(x), y- x)

Therefore

= f(x),

and

f(x)- f(x)

f(y)- f(x)

wf(x; II y- x

II) =

wf(x; vf(x; x)).

is Lipschitz continuous on B2 (x, R) and 0 ~ v1(x; x) ~ R, then


y E B2(x, R). Hence,

If

f(x)- f(x)

f(y)- f(x)

II y- x II= Mvt(x; x).


0

Let us fix some x*, a solution to problern (3.2.3). The values VJ(x*;x)
allow us to estimate the quality of localization sets.
DEFINITION

3.2.1 Let

{xi}~ 0

be a sequence in Q. Define

Nonsmooth convex optimization

141

We call this set the localization set of problern {3. 2. 3) generated by sequence {xi}~o

Note that in view of inequality (3.2.4), for all k ;:::: 0 we have x* E Sk.
Denote
vz- VJ(x*' x)
z (>
- 0) '

vk*

= O<i<k
mm

Vi

Thus,

vZ
LEMMA

= max{r I (g(xi), Xi- x)


3.2.2 Let fi.

;:::: 0, i

= 0 ... k,

min f(xi). Then fi.-

O$i::;k

Vx E B2(x*, r)}.

f*

WJ(x*; vk).

Proof: Using Lemma 3.2.1, we have

WJ(x*;vZ)

= O$i$k
min WJ(x*;vi) > min [f(xi)- J*] = fj.- f*.
- O~i~k
0

3.2.3

Subgradie nt method

Now we are ready to analyze the behavior of some minimization


schemes. Consider the problern
min{f(x)

x E Q},

(3.2.7)

where f is a convex on Rn function and Q is a simple closed convex set.


The term "simple" means that we can solve explicitly some simple minimization problems over Q. In accordance to the goals of this section, we
have to be able to find in a reasonably cheap way a Euclidean projection
of any point onto Q.
We assume that problern (3.2. 7) is equipped with a first-order oracle,
which at any test point x provides us with the value of objective function
f(x) and with one of its subgradients g(x).
As usual, we try first a version of a gradient method. Note that for
nonsmooth problems the norm of the subgradient, II g(x) II, is not very
informative. Therefore in the subgradient scheme we use a normalized

142

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

direction g(x)/

II g(x) II

Subgradient method. Unconstrained minimization


0. Choose xo E Q and a sequence {hk}~ 0 :

hk

> o,

hk ~

00

o, E

hk = oo.

k=O

(3.2.8)

1. kth iteration (k ~ 0).

Compute f(xk), g(xk) and set

Let us estimate the rate of convergence of this scheme.


3.2.2 Let f be Lipschitz continuous on B2(x*, R) with constant M and xo E B(x*, R). Then

THEOREM

R2 +

f k*- f* -< M

h~

(3.2.9}

i=O

L: h;

i=O

Proof: Denote

ri

=II Xi- x* II

<

Then, in view of Lemma 3.1.5, we have

llxi- hi ~~~~::~11 - x*ll

= rf - 2hivi

Summing up these inequalities for i = 0 ... k we get

Thus,
k

v*

<

k-

R2+Eh~
i=O

2L: h;
i=O

+ hf.

143

Nonsmooth convex optimization

It remains to use Lemma 3.2.2.

Thus, Theorem 3.2.2 demonstrates that the rate of convergence of


subgradient method (3.2.8) depends on the values

i=D

We can easily see that

.k

--+ 0 if the series

00

2::

i=D

hi diverges. However, Iet

us try to choose hk in an optimal way.


Let us assume that we have to perform a fixed number of steps of
the subgradient method, say, N. Then, minimizing D..k as a function of
{ hk }k"=o, we find that the optimal strategy is as follows: 2
hi = ,;/}+1,

In this case D..N =

i = 0 ... N.

(3.2.10)

,;/}+ 1 and we obtain the following rate of convergence:


! k* -

f*

< ../N+l.
MR

Comparing this result with the Iower bound of Theorem 3.2.1, we conclude:
The subgradient method (3.2.8), (3.2.10) is optimal for
problern (3.2. 7) uniformly in the dimension n.
If we do not want to fix the number of iterations apriori, we can choose
hi

= vf+P

= 0, ....

Then it is easy to see that D..k is proportional to

and we can classify the rate of convergence of this scheme as sub-optimal.


Thus, the simplest method for solving the problern (3.2.3) appears
to be optimal. This indicates that the problems from our dass are too
complicated tobe solved efficiently. However, we should remember, that
our conclusion is valid uniformly in the dimension of the problem. We
will see that a moderate dimension of the problem, taken into account
in a proper way, helps to develop much more efficient schemes.
2 From

Example 3.1.2{3) we can see that

t:;.k

is a convex function of {hi} .

144

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

3.2.4

Minimization with functional constraints

Let us apply a subgradient method to a constrained minimization


problern with functional constraints. Consider the problern
min{f(x)
with convex

and

fi,

x E Q, fi(x) ~ O,i = 1 ... m},

(3.2.11)

and a simple bounded closed convex set Q:


II x- y 11:::; R,

Vx,y E Q.

Let us form an aggregate constraint ](x) = ( m.ax fi(x)) . Then


l$J$m
+
our problern can be written as follows:
min{f(x)

I x E Q,

(3.2.12)

J(x) :::; 0}.

Note that we can easily compute a subgradient g(x) of function j, provided that we can do so for functions fi (see Lemma 3.1.10).
Let us fix some x*, a solution to (3.2.11). Note that J(x*) = 0 and
Vj(x*;x) ~ 0 for all x ERn. Therefore, in view ofLemma 3.2.1 we have

](x) :::; Wj(x*; vf(x*; x)}.

fi are Lipschitz continuous on Q with constant M, then for any x


from Rn we have the estimate

If

J(x):::; M Vj(x*;x).
Let us write down a subgradient minimization scheme for constrained
minimization problern (3.2.12). We assume that R is known.

Subgradient method. Functional constraints


0. Choose xo E Q and sequence {hk}f=o=

hk =

v'k~0.5.

1. kth iteration (k ~ 0).

(3.2.13)

a). Compute f(xk), g(xk), ](xk), g(xk) and set


Pk = {

g(xk), if j(xk)

<II g(xk) II hk,

(A),

II hk,

(B).

g(xk), if f(xk) ~II g(xk)

145

Nonsmooth convex optimization

3.2.3 Let f be Lipschitz continuous on B2(x*, R) with constant M1 and

THEOREM

M2 = m~

1::;3::;m

{II g II:

g E 8fi(x), x E B2(x*, R)}.

Then for any k 2: 3 there exists a number i', 0 :5 i' :5 k, such that
f(xi') -

f* :5

1M1R f(xi')
- :5 1M

2R
k-1.5

k-1.5'

Proof: Note that for direction Pk, chosen in accordance to rule (B), we
have
II g(xk) II hk :5 ](xk) :5 (g(xk), Xk - x*).
Hence, in this case vf( x*; x k) 2: hk.
Let k'

=] t( and Ik ={i E (k', ... , k]:

Pi= g(xi)} Derrote

vVJ(x*' x)
tt '
Then for all i, k'

:5 i :5 k, we have

if i E lk, then r[+1 $ r[- 2hivi


if i

f/. Ik,

then r[+l

+ hr,

:5 r[- 2hiih + hr.

Summing up these inequalities for i E [k', ... , k], we get:


k

r~, +

L hr ~ r~+l +2 L hiVi +2 L hiVi.

i=k'

iElk

iri.Jk

Recall that for i f/. Ik we have Vi 2: hi (Oase (B)).


Assurne that Vi 2: hi for all i E Ik. Then
1

2 _

1 2: Ji.'I .L hi - .L
t=k'

t=k'

1
i+0.5

k+l

2: f

k'

dT

r+0.5 -

2k

ln 2kq 1 2: ln 3.

That is a contradiction. Thus, h f= 0 and there exists some i' E h


such that Vi' < hi'. Clearly, for this number we have Vi' :5 hk', and,
consequently, (Vi') + :5 hk' .
Thus, we conclude that f(xi')- f* :5 M 1 hk' (see Lemma 3.2.1) and,
since i' E h we have also the estimate

It remains to note that k' 2:

i - 1 and therefore hk' :5 Jf2-~. 5

146

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Comparing the result of Theorem 3.2.3 with the lower complexity


bound of Theorem 3.2.1, we see that scheme (3.2.13) has an optimal rate
of convergence. Recall, that this lower complexity bound was obtained
for an unconstrained minimization problem. Thus, our result proves
that from the viewpoint of analytical complexity the general convex
unconstrained minimization problems are not easier than the constrained
ones.

3.2.5

Complexity bounds in finite dimension

Let us look at the unconstrained minimization problern again, assuming that its dimension is relatively small. This means that our computational resources allow us to perform the number of iterations of
a minimization method, proportional to the dimension of the space of
variables. What will be the lower complexity bounds in this case?
In this section we obtain a finite-dimensionallower complexity bound
for a problem, which is closely related to minimization problem. This is
the feasibility problem:
(3.2.14)
Find x* E Q,
where Q is a convex set. We assume that this problern is endowed with
an oracle, which answers our request at point x E Rn in the following
way:
Either it reports that x E Q.
Or, it returns a vector g, separating x from Q:

(g, x - x)

\lx E Q.

To estimate the complexity of this problem, we introduce the following


assumption.
3.2.1 There exists a point x*
> 0 the ball B2(x*, ) belongs to Q .
AssUMPTION

Q such that for some

For example, if we know an optimal value f* for problern (3.2.3), we


can treat this problern as a feasibility problern with

Q = {(t, x)

E Rn+l

I t ~ f(x),

t ~ J* + l, x E Q}.

The relation between the accuracy parameters land in (3.2.1) can be


easily obtained, assuming that the function f is Lipschitz continuous.
We leave this reasoning as an exercise for the reader.

147

Nonsmooth convex optimization

Let us describe now a resisting oracle for problern (3.2.14). lt forms


a sequence of boxes {Bk}f:, 0 , Bk+l C Bk, defined by their lower and
upper bounds.
Bk= {x ERn I ak :S x :S bk}
Foreach box Bk, k 2: 0, denote by Ck = ~(ak + bk) its center. For boxes
Bk, k 2: 1, the oracle creates an individual sepa.rating vector 9k Up to
a sign change, this is always a coordinate vector.
In the scheme below we use two dynamic counters:

m is the number of generated boxes.


i is the active coordinate.

Denote by e E Rn a vector of all ls. The oracle sta.rts from the following
settings:
ao :=-Re, bo :=Re, m := 0, i := 1.
Its input is an arbitrary x E Rn.

Resisting oracle. Feasibility problern


If x

Bo then return a separator of x from Bo eise

1. Find the maximal k E [0, ... , m] : x E Bk.

2. lf k

< m then return 9k eise {Create a new box}:


If

m := m

+ 1;

i := i

+ 1;

If i > n then i := 1.

Return 9m
This oracle implements a very simple strategy. Note, that the next
box Bm+l is always a half of the last box Bm. The box Bm is divided into

INTRODUCTO RY LEGTURES ON CONVEX OPTIMIZATIO N

148

two parts by a hyperplane, which passes through its center and which
corresponds to the active coordinate i. Depending in which part of the
box Bm we get the test point x, we choose the sign of the separation
vector 9m+l = ei. After creating a new box Bm+l the index i is
increased by 1. If this value exceeds n, we return again to i = 1. Thus,
the sequence of boxes {Bk} possesses two important properties:
voln Bk+l = !voln Bk.
For any k

2 0 we have bk+n- ak+n = !(bk- ak)

Note also that the number of generated boxes does not exceed the number of calls of the oracle.
LEMMA

3.2.3 For all k;:::: 0 we have the inclusion


k

B2(ck, rk)

Bk,

with rk = ~ (~) -;;-.

{3.2.15}

Proof: Indeed, for all k E [0, ... , n - 1] we have

Bk

:J

Bn = {x

Cn- ~Re :S

:S Cn +~Re}

:J

B2(cn, ~R).

Therefore, for such k we have Bk :J B2(cb ~R) and (3.2.15) holds. Further, let k = nl + p with some p E [0, ... , n- 1]. Since

bk- ak =
we conclude that

Bk

:J B2

It remains to note that rk :::;

(~) -! (bp- ap),

(ck, ~R (~) -l).

~ R ( ~) -!.

Lemma 3.2.3 immediately leads to the following complexity result.


3.2.4 Consider a class of feasibility problems {3.2.14}, which
satisfy Assumption 3.2.1, and for which the feasible sets Q belang to
B 00 ( 0, R). The lower analytical complexity bound for this class is n ln ~

THEOREM

calls of the oracle.


Proof: Indeed, we have seen that the number of generated boxes does
not exceed the number of calls of the oracle. Moreover, in view of Lemma
0
3.2.3, after k iterations the last box contains the ball B2(Cmk,rk)

149

Nonsmooth convex optimization

The lower complexity bound for minimization problern (3.2.3) can be


obtained in a similar way. However, the corresponding reasoning is more
complicated. Therefore we present here only a conclusion.
3.2.5 The lower bound for analytical complexity of problern
class formed by minimization problems {3.2.3} with Q ~ B 00 (0, R) and
0 (B00 (0, R)), is nln "':~ calls of the oracle.
f E
D

THEOREM

.rt

3.2.6

Cutting plane schemes

Let us look now at the following constrained minimization problem:


min{f(x)

I x E Q},

(3.2.16)

where f is a function convex on Rn, and Q is a bounded closed convex


set such that
int Q '=I 0, diam Q = D < oo.
We assume that Q is not simple and that our problern is equipped with a
separating oracle. At any test point x E Rn this oracle returns a vector
gwhich is:
a subgradient of f at

x,

if x E Q,

a separator of x from Q, if x

fl. Q.

An important example of such a problern is a constrained minimization


problern with functional constraints (3.2.11}. We have seen that this
problern can be rewritten as a problern with a single functional constraint
(see (3.2.12)), which defines a feasible set

Q = {x ERn

](x) S 0}.

In this case, for x f/. Q the oracle has to provide us with any subgradient
g E a](x). Clearly, g separates X from Q (see Theorem 3.1.16).
Let us present the main property of finite-dimensional localization
sets.
Consider a sequence X= {xi}~ 0 belanging to the set Q. Recall, that
the localization sets, generated by this sequence, are defined as follows:

Q,

So(X)
sk+l (X)

Clearly, for any k

{X E Sk(X)

(g(xk), Xk - x) ~ 0}.

0 we have x* E Sk. Denote

150

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Denote by voln S an n-dimensional volume of set S C Rn.


THEOREM

3.2.6 For any k

0 we have

*<D
vk-

[voln Sk(X)]
voln Q

Proof: Denote a = vkf D (:::; 1). Since Q


following inclusion:
(1- a)x*

.
~ B2 (x*,

+ aQ ~ (1- a)x* + aB2(x*, D) =

D) we have the

B2(x*, vZ).

Since Q is convex, we conclude that


(1 - a)x*

+ aQ

=[(1 - a)x* + aQ] nQ ~ B2(x*, vk) nQ ~ Sk(X).

Therefore voln Sk(X)

voln [(1 - a)x*

+ aQ] =

anvoln Q.

Quite often the set Q is rather complicated and it is difficult to work


directly with sets Sk(X). Instead, we can update some simple upper
approximations of these sets. The process of generating such approximations is described by the following cutting plane scheme.
General cutting plane scheme
0. Choose a bounded set Eo

2 Q.

1. kth iteration (k ~ 0).

a) Choose Yk E Ek
b) If Yk E Q then compute f(Yk), g(yk) If Yk f/; Q,
then compute g(yk), which separates Yk from Q.

(3.2.17)

c) Set
9k = {

g(yk), if Yk E Q,

g(yk), if Yk f/; Q.

d) Choose Ek+l 2 {x E Ek

(gk,Yk- x} ~ 0}.

Let us estimate the performance of the above process. Consider the


sequence Y = {Yk}~ 0 , involved in this scheme. Denote by X a subsequence of feasible points in the sequence Y: X = Y Q. Let us
introduce the counter

i(k) = number of points Yi, 0:::; j

< k, suchthat Yi

Q.

151

Nonsmooth convex optimization

> 0, then X f 0.
3.2.4 For any k;::: 0, we have Si(k)

Thus, if i(k)
LEMMA

Ek

Proof: Indeed, if i(O) = 0, then So = Q ~ Eo. Let us assume that


Si(k) ~ Ek for some k 2: 0. Then, at the next iteration there are two
possibilities:
a) i(k + 1) = i(k). This happens if and only if Yk rt Q. Then

::) {x E Ek

Ek+l

(g(yk),Yk- x) 2: 0}

::) {X E si(k+l) I (g(yk), Yk - x) ;::: 0} = si(k+l)


since Si(k+l) ~ Q and g(yk) separates Yk from Q.
b) i(k + 1) = i(k) + 1. In this case Yk E Q. Then

::) {x E Ek

Ek+l

::) {X E si(k)

(g(yk), Yk - x) 2: 0}

(g(yk), Yk - x) 2: 0}

= si(k)+l
0

since Yk = xi(k)

The above results immediately Iead to the following important conclusion.


COROLLARY

3.2.1 1. For any k such that i(k)


~

V~(k)

2. lf voln Ek

(X)< D
-

[volnSi(k)(X)];- <
voln Q

> 0, we have
I

[volnEk];
voln Q

< voln Q, then i(k) > 0.

Proof: We have already proved the first statement. The second one
follows from the inclusion Q = So = Si(k) ~ Ek, which is valid for all k
0
such that i(k) = 0.

Thus, if we manage to ensure voln Ek -t 0, then we obtain a convergent scheme. Moreover, the rate of decrease of the volume automatically
defines the rate of convergence of the method. Clearly, we should try to
decrease voln Ek as fast as possible.
Historically, the first nonsmooth minimization method, implementing
the idea of cutting planes, was the center of gravity method. It is based
on the following geometrical fact.
Consider a bounded convex set S C Rn, int S f 0. Define the center
of gravity of this set as

cg(S) =

voll S
n

J xdx.

152

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

The following result demonstrates that any cut passing through the center of gravity divides the set on two proportional pieces.
LEMMA

3.2.5 Let g be a direction in Rn. Define

s+ = {x

Es I

(g, cg(S)- x)

0}.

Then

(We accept this result without proof.)

This observation naturally Ieads to the following minimization scheme.

Method of centers of gravity


0. Set So= Q.
1. kth iteration (k ~ 0).

a) Choose Xk = cg(Sk) and compute f(xk), g(xk)


b) Set sk+l = {x E sk

(g(xk), Xk- x)

~ 0}.

Let us estimate the rate of convergence of this method. Denote

Jt. =

min f(xj)

0$j$k

3.2.7 lf f is Lipschitz continuous on B2(x*,D) with a constant M, then for any k ~ 0 we have

THEOREM

J:.-f*~MD(l-~)--n.
Proof: The statement follows from Lemma 3.2.2, Theorem 3.2.6 and
Lemma 3.2.5.
0
Comparing this result with the lower complexity bound of Theorem 3.2.5, we see that the center-of-gravity method is optimal in finite
dimension. lts rate of convergence does not depend on any individual
characteristics of our problern like condition number, etc. However, we
should accept that this metbad is absolutely impractical, since the computation of the center of gravity in multi-dimensional space is a more
difficult problern than our initial one.

153

Nonsmooth convex optimization

Let us look at another method, which uses a possibility of approximation of the localization sets. This method is based on the following
geometrical observation.
Let H be a positive definite symmetric n x n matrix. Consider the
ellipsoid
E(H,x) = {x ERn

(H- 1 (x- x),x- x)::::; 1}.

Let us choose a direction g E Rn and consider a half of the above ellipsoid, defined by corresponding hyperplane:

1 (g,x- x) 2: 0}.

E+ = {x E E(H,x)

lt turns out that this set belongs to another ellipsoid, which volume is
strictly smaller than the volume of E(H, x}.
LEMMA

3.2.6 Denote

x+

x-

1
Hg
n+l (Hg,g)l/2'

voln E(H+, x+) ::::; ( 1-

(n~ 1 F) 2 voln E(H, x).

Proof: Denote G = H- 1 and G+ = H+1 . It is clear that


G+ = n:-;1

(a + n:_1. dt~~g)).

Without loss of generality we can assume that x = 0 and (Hg, g)


Suppose x E E+ Note that x+ =- n~ 1 Hg. Therefore

II

X-

X+

llb+

n~-;1 (11

X-

X+

llb +n:1 (g,x- x+)2)'

II X -X+ II~

II X II~ + n!l (g, x} + (n~lF'

(g, x - x+) 2 =

(g, x} 2 + n! 1(g, x}

+ (n~l)2

Putting all terms together, we obtain

II X - X+ II~+ =

n:2 1 (II

Note that (g, x) ::::; 0 and


(g, x} 2

XII~+ n:_1 (g, X) 2 + n:_1 (g, x} + n2:_1)

II x lla:s; 1.

+ (g, x)

Therefore

= (g, x)(1

+ (g, x))

::::; 0.

= 1.

154

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Hence,

I X- X+ II~+~

n:2 1

(11 XII~ +nLl) ~ 1.

Thus, we have proved that E+ c E(H+, x+)


Let us estimate the volume of E(H+, x+)
= [( !12 )n n- 1]1/2
[ detHt]1/2
n+1
n2-l
detH
n2 ( 1
[ nL1
-

_
-

2 )
n+1

~] ~ -< [n2-1
n2

1 ]
[ n 2 (n 2 +n-2) ] ~ _ [
- 1 - (n+l)2
n(n-l)(n+l)2

)]
2
1 - n(n+l)

It turnsout that the ellipsoid E(H+, x+) is the ellipsoid ofthe minimal
volume containing the half of the initialellipsoid E+.
Our observations can be implemented in the following algorithmic
scheme of the ellipsoid method.

Ellipsoid method
0. Choose Yo ERn and R
Set Ho = R 2 In.

> 0 suchthat B2(Yo, R) 2 Q.

1. kth iteration (k ~ 0).

gk

g(yk), if Yk E Q,

(3.2.18)

g(yk), if Yk ~ Q,

Yk+1

This method can be seen as a particular implementation of general


scheme (3.2.17) by choosing

Ek = {x ERn

(Hj; 1 (x- Yk),x- Yk) ~ 1}

155

Nonsmooth convex optimization

and Yk being the center of this ellipsoid.


Let us present an efficiency estimate for the ellipsoid method. Denote
Y = {Yk}~ 0 and Iet X be a feasible subsequence of the sequence Y:

Denote

J; =

min f(xj)

O~j~k

3.2.8 Let f be Lipschitz continuous on B2(x*, R) with some


constant M. Then for i(k) > 0, we have

THEOREM

~ _ f* < MR ( 1 f~(k)
-

1 ) 2 . [voln Bo(xo,R)];;.
(n+1)2
voln Q

Proof: The proof follows from Lemma 3.2.2, Corollary 3.2.1 and Lemma 3.2.6.
D
We need additional assumptions to guarantee X =/=
there exists some p > 0 and x E Q such that

0. Assurne that
(3.2.19)

Then

< (1 _ (n+1)2
1 )

.!
[ voln E&] n
voln Q
-

.!

2 [voln B2 xo,R ] n

voln

< le- 2(n+1)2 R.


- p

In view of Corollary 3.2.1, this implies that i(k) > 0 for all
k

lf i(k)

> 2(n + 1) 2 In~.

> 0, then
~
fz(k)

_ f* <
lMR2. e-2<n+t>2
- p

In order to ensure that(3.2.19) holds for a constrained minimization


problern with functional constraints, it is enough to assume that all
constraints are Lipschitz continuous and there is a feasible point, at
which all functional constraints are strictly negative (Slater condition).
We leave the details of the proof as an exercise for the reader.
Let us discuss now the complexity of ellipsoid method (3.2.18). Each
iteration of this method is rather cheap; it takes only O(n 2 ) arithmetic
operations. On the other hand, in order to generate an e-solution of
problern (3.2.16), satisfying assumption (3.2.19), this method needs

156

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

calls of the oracle. This efficiency estimate is not optimal (see Theorem
3.2.5), but it has a polynomial dependence on In~ and a polynomial dependence on logarithms ofthe dass parameters M, Rand p. For problern
classes, whose oracle has a polynomial complexity, such algorithms are
called (weakly) polynomial.
To conclude this section, let us mention that there are several methods
that work with localization sets in the form of a polytope:

Let us list the most important methods of this type:

Inscribed Ellipsoid Method. The point Yk in this scheme is chosen as


follows:

Yk =Center of the maximal ellipsoid Wk: Wk

Ek.

Analytic Center Method. In this method the point Yk is chosen as


the minimum of the analytic barrier
mk

Fk(x) =- I:In(bj- (aj,x)).


j=l

Volumetrie Center Method. This is also a barrier-type scheme. The


point Yk is chosen as the minimum of the volumetric barrier

where Fk(x) is the analytic barrier for the set Ek


All these methods are polynomial with complexity bound

where p is either 1 or 2. However, the complexity of each iteration in


these methods is much !arger (n3 + n 4 arithmetic operations). In the
next chapter we will see that the test point Yk for these schemes can be
computed by interior-point methods.

3.3

Methods with complete data

(Model of nonsmooth function; Kelley method; ldea of level method; Unconstrained minimization; Efficiency estimates; Problems with functional constraints.)

157

Nonsmooth convex optimization

3.3.1

Model of nonsrnooth function

In the previous section we studied several methods for solving the


following problem:
min f(x),
(3.3.1)
xEQ

where f is a Lipschitz continuous convex function and Q is a closed convex set. We have seen that the optimal method for problern (3.3.1) is the
subgradient method (3.2.8), (3.2.10). Note, that this conclusion is valid
for the whole class of Lipschitz continuous functions. However, when
we are going to minimize a particular function from that dass, we can
expect that it is not too bad. We can hope that the real performance of
the minimization method will be much better than a theoretical bound
derived from a worst-case analysis. Unfortunately, as far as the subgradient method is concerned, these expectations are too optimistic. The
scheme of the subgradient method is very strict and in general it cannot
converge faster than in theory. It can be also shown that the ellipsoid
method (3.2.18), inherits this drawback of subgradient scheme. In practice it works more or less in accordance to its theoretical bound even
when it is applied to a very simple function like II x 11 2 .
In this section we will discuss the algorithmic schemes, which are more
flexible than the subgradient and the ellipsoid methods. These schemes
are based on the notion of a model of nonsmooth function.
DEFINITION

3.3.1 Let X=

{xk}~ 0

be a sequence in Q. Denote

ik(X; x) = max [f(xi)


O<i<k

+ (g(xi), x- xi)],

where g(xi) are some subgradients of f at Xi


The function jk (X; x) is called the model of convex function

f.

Note that f k (X; x) is a piece-wise linear function of x. In view of


inequality (3.1.10) we always have

for all x E Rn. However, at all test points

Xi,

0 :S i :S k, we have

The next model is always better than the previous one:

for all x ERn.

158

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

3.3.2

Kelley method

Model A(X;x) represents the complete information on function f


accumulated after k calls of the oracle. Therefore it seems natural to
develop a minimization scheme, based on this object. Perhaps the most
natural method of this type is as follows:

Kelley method
0. Choose xo E Q.
1. kth iteration (k

Find

Xk+I E

(3.3.2)

2 0).

Argmin A(X; x).


xEQ

Intuitively, this scheme looks very attractive. Even the presence of


a complicated auxiliary problern is not too disturbing, since it can be
solved by linear optimization methods in finite time. However, it turns
out that this method cannot be recommended for practical applications.
And the main reason for that is its instability. Note that the solution
of auxiliary problern in the method (3.3.2) may be not unique. Moreover, the whole set Arg min jk (X; x) can be unstable with respect to an
xeQ

arbitrary small variation of data {f(xi), g(xi)}. This feature results in


unstable practical behavior of the scheme. Moreover, this feature can
be used for constructing an example of a problem, in which the Kelley
method has a very bad lower complexity bound.
EXAMPLE

3.3.1 Consider the problern (3.3.1) with

f(y,x)

max{j Y j, II x 11 2 },

y E R 1 , x ERn,

Q = {z=(y,x): y2 +11xll 2::;1}.


Thus, the solution of this problern is z* = (y*, x*) = (0, 0), and the
Z; z), the optimal set
optimal value f* = 0. Denote by Zk = Arg min
zeQ

A(

of model A(Z; z), and by jz = jk(Z"k) the optimal value of the model.
Let us choose zo = (1, 0). Then the initial model of function f is
fo(Z; z) = y. Therefore, the first point, generated by the Kelley method
is z1 = (-1, 0). Hence, the next model of the function f is as follows:

A(Z;z) = max{y, -y} =I Y I

159

Nonsmooth convex optimization

Clearly,

/i = 0.

Note that /k,+ 1 2:

fk..

On the other hand,

fk. S f(z*) = 0.
Thus, for all consequent models with k 2: 1 we will have /; = 0 and
Zk. = (0, Xk), where

xz = {x E B2(0, 1): I

Xi

1 2 +(2xi,X -xi) s 0, i = o... k}.

Let us estimate efficiency of the cuts for the set Xk.. Since Xk+l can
be an arbitrary point from Xk., at the first stage of the method we can
choose Xi with the unit norms: II Xi II= 1. Then the set Xk. is defined as
follows:
Xk. = {x E B2(0, 1) I (xi,x) S ~,i = 0 ... k}.
We can do that if

As far as this is possible, we can have

f(zi)

=f(O, xi)

= 1.

Let us estimate a possible length of this stage using the following fact.

nn, II d II= 1. Consider a surface


ERn 111 x II= 1, (d,x) 2: o:}, o: E [~, 1].

Let d be a direction in

S(o:)

= {x

Then v(o:)

= voln-1 (S(o:)) s v(O) [1- o: 2 ] -

n-1
2

At the first stage, each step cuts from the sphere S2 (0, 1) at most the
n-1
[ ]
segment S( ~) . Therefore, we can continue the process if k

s Ja

During all these iterations we still have f(zi) = 1.


Since at the first stage of the process the cuts are (xi, x)
n-1
[ ]
k, 0 S k S N
we have

= ./J ,

s ~' for all

This means that after N iterations we can repeat our process with the
ball B2(0, ~), etc. Note that f(O, x) = ~ for all x from B 2(0, ~ ).

160

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Thus, we prove the following lower bound for the Kelley method
(3.3.2):

f(xk)-

f* ~

This means that we cannot get an

{!)

[.ii]

n-1

~:-solution

of our problern less than in

~ [ 2 ]n-1
2ln2

'7a

calls of the oracle. It remains to compare this lower bound with the
upper complexity bounds of other methods:
Ellipsoid method:

0 (n2 ln~)

Optimal methods: 0 (nln~)

Gradient method:

0 (e\-)
0

3.3.3

Level method

Let us show that it is possible to work with models in a stable way.


Denote
J; = minA(X;x), fZ = min f(xi)
xEQ

09~k

The first value is called the minimal value of the model, and the second
one the record value of the model. Clearly }'; ~ f* ~ fZ.
Let us choose some a E (0, 1). Denote

lk(a) = (1- a)jk + afk


Consider the Ievel set

Clearly, Ck(a) is a closed convex set.


Note that the set Ck(a) is of a certain interest for an optimization
scheme. Firstly, inside this set clearly there are no test points of the
current model. Secondly, this set is stable with respect to a small variation of the data. Let us present a minimization scheme, which deals
directly with this level set.

161

Nonsmooth convex optimization

Level method
0. Choose point xo E Q, accuracy
cient a E (0, 1).
1. kth iteration (k

c). Set

> 0 and Ievel coeffi(3.3.3)

0).

J; and J;.

a). Compute
b). If !Z-

jz ~ e,

then STOP.

Xk+l = 7r.ck(a)(Xk)

In this scheme there are two quite expensive Operations. We need to


compute an optimal value
of the current model. lf Q is a polytope,
then this value can be obtained from the following linear programming
problem:

jz

t,

min
s.t.

f(xi)
XE

+ (g(xi), x- Xi)

t, i = 0 ... k,

Q.

We also need to compute projection 7r.ck(a)(Xk) If Q is a polytope, then


this is a quadratic programming problem:
min

II

x- Xk

11 2 ,

xEQ.

Both these problems are solvable either by a standard simplex-type


method, or by interior-point schemes.
Let us Iook at some properties of the Ievel method. Recall, that the
optimal values of the model decrease, and the records values increase:

Jz ~ Jz+l ~ r
Denote D..k = [JZ, !Z] and
]k(X;x). Then

6k

~ JZ+l ~ JZ.

= JZ - jz.

We call

6k

the gap of the model

162

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

The next result is crucial for the analysis of the Ievel method.
LEMMA 3.3.1 Assurne that for some p
Then for all i, k

k we have Op

(1 - a)ok.

p,

Proof: Note that for such i we have c5p

(1-a)c5k

li(a) = ft- (1- a)di ~ 1;- (1- a)di =

(1-a)<>i Therefore

J; + dp- (1- a)di ~ J;.

Let us show that the steps of the Ievel method are Iarge enough.
Denote
MJ = max{ll g 111 g E oj(x), x E Q}.
LEMMA 3.3.2 For the sequence {xk} generated by the Ievel method we
have

II Xk+l -

Xk II>
-

(1-a)6k

MI

Proof: Indeed,

> jk(Xk+I) ~ f(xk) + (g(xk), Xk+l - Xk)


> f(xk)- MJ II Xk+l- Xk II

Finally, we need to show that the gap in the model is decreasing.


LEMMA 3.3.3 Let Q in the problern (3.3.1} be bounded: diamQ
If for some p ~ k we have Op ~ (1- a)ok, then

+1-

D.

M2D2

:s; (1-~)26r

Proof: Denote xic E Argmin jk(X;x). In view ofLemma 3.3.1 we have


xEQ

Ji(X; x;) S Jp(X; x;) = J; S li(a)


for all i, k ~ i ~ p. Therefore, in view of Lemma 3.1.5 and Lemma 3.3.2
we obtain the following:

II Xi+l -

x; 11 2

< II Xi -

x; 11 2 -

< II X,

Xp

* 112 -

II Xi+l

- Xi 11 2

<II Xz

(1-a)26l
M2
_
I

* 112 - (1-a)26~
Mz

Xp

163

Nonsmooth convex optimization

Summing up these inequalities in i = k ... p we get

(p + 1- k) ( 1 -;~25~ ~II

Xk-

x; 11 2 ~ D

Note that the value p + 1 - k is equal to the number of indices in


the segment [k,p]. Now we can prove the efficiency estimate of the Ievel
method.
3.3.1 Let diamQ = D. Then the scheme of the Level method
terminates no sooner than after

THEOREM

N =

lf2a(1~I~~2-a) J+

iterations. Termination criterion of the method guarantees

Proof: Assurne that k ~ E, 0 ~ k


the groups in the decreasing order

J;- f*

~ E.

N. Let us divide the indices on

{N, ... , 0} = I(O) U 1(2) U ... U I(m),


suchthat
I(j)

= [p(j), k(j)],

p(O)

= N,

k(j)

~ 1~a0p(j) < t5k(j)+1

Clearly, for j

p(j

p(j) ~ k(j),

+ 1) = k(j) + 1,

= 0 ... m,

k(m)

= 0,

= t5p(j+l)

0 we have
t5

>~
>
1-a -

P(J+l) -

In view of Lemma 3.3.2, n(j) = p(j)


")

n (J ~

p(O)

>

(1-a)J+i -

+1-

(1-a)J+i

k(j) is bounded:

MJD 2
MJD2 ( 1
(1-o)22 . ~ f2(1-a)2
p(J)

) 2j

Therefore

Let us discuss the above efficiency estimate. Note that we can obtain
the optimal value of the Ievel parameter a from the following maximization problem:

--+

max.

oE[O,l]

164

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Its solution is o:* =

J2.

1
2+

Under this choice we have the following ef-

ficiency estimate of the Ievel method: N ~ ~ M} D 2 Comparing this


result with Theorem 3.2.1 we see that the Ievel rnethod is optimal uniformly in the dimension of the space of variables. Note that the analytical cornplexity bound of this rnethod in finite dirnension is not known.
One of the advantages of this rnethod isthat the gap 8k = JZ- J;
provides us with an exact estirnate of current accuracy. Usually, this gap
converges to zero rnuch faster than in the worst case situation. For the
rnajority of reallife optirnization problems the accuracy = w- 4 - w- 5
is obtained by the rnethod after 3n - 4n iterations.

3.3.4

Constrained minimization

Let us dernonstrate how we can use the rnodels for solving constrained
rninirnization problems. Consider the problern

f(x),

min
s.t.

/j(x)

s 0,

j = 1 ... m,

(3.3.4)

xEQ,

where Q is a bounded closed convex set, and functions f(x), fi(x) are
Lipschitz continuous on Q.
Let us rewrite this problern as a problern with a single functional
constraint. Denote f(x) = rn~x /j(x). Then we obtain the equivalent
1$J$m

problern

f(x),

rnin
s.t.

j(x) ~ 0,
XE

(3.3.5)

Q.

Note that f(x) and /(x) are convex and Lipschitz continuous. In this
section we will try to solve (3.3.5) using the rnodels for both of thern.
Let us define the corresponding rnodels. Consider a sequence X =
{xk}k::, 0 . Denote

Jk(X; x) =

[f(xj) + (g(xj), x- Xj}]


0~~k
_)_

~ f(x),

165

Nonsmooth convex optimization

As in Section 2.3.4, our scherne is based on the parametric function


rnax{f(x)- t, ](x)},

j(t; x)
f*(t)

rninf(t;x).
xEQ

Recall that f*(t) is nonincreasing in t. Let x* be a solution to (3.3.5).


Denote t* = J(x*). Then t* is the srnallest root of function f*(t).
Using the rnodels for the objective function and the constraint, we
can introduce a rnodel for the pararnetric function. Denote

fk(X;t,x)

= rnax{jk(X;x)- t,A(X;x)} ~ f(t;x),


= rninfk(X;t,x) ~ j*(t).

J;(X;t)

xEQ

Again, J;(x; t) is nonincreasing in t. It is clear that its srnallest root


tA;(X} does not exceed t*.
We will need the following characterization of the root tA;(X).
LEMMA

3.3.4

tk(X)
Proof: Denote by

= rnin{}k(X;

x)

I /k(X; x)

xt:

~ 0, x E Q}.

the solution of the rninirnization problern in the


right-hand side of the above equation. And let
= j~(X; xt:). Then

iz

Thus, we always have ik ;::: tk(X).


Assurne that i'k > t'k(X). Then there exists a point y such that

A(X; y)- t'k(X) ~ 0,


However, in this case
a contradiction.

i'k =

A(X; y) ~ 0.

jk(X; x'J.:) :::; /k(X; y) :::; tk(X) < i'k. That is


D

In our analysis we will need also the function

fk(X; t) = rnin fk(X; t, Xj),


OS,jS,k

the record value of our pararnetric rnodel.

166

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

LEMMA 3.3.5 Let t 0 < t 1


tk(X) > t1 and

:::;

> 0.

t*. Assurne that J;(X; tt)

Then

{3.3.6}

Proof. Denote xk(t)

E Arg minfk(X; t, x), t2

[0, 1]. Then


t1 = (1 - a)to

= tk(X),

:~:::::~ E

+ at2

and inequality (3.3.6) is equivalent to the following:

f;(x; h) ::; (1- a)fZ(X; to)


(note that J;(X; t2) = 0). Let
have

Xa =

+ afZ(X; t2)

(1- a)xk(to)

+ axk(t2).

(3.3.7)
Then we

Jk,(X; tt) ::=; max{!k(X; Xa)- t1; Jk(X; Xa)}


:::; max{(1- a)(!k(X; xk(to))- to)

+ a(!k(X; xk(t2))- t2);

(1- a)!k(X; xA;(to)) + ajk(X; xA;(t2))}


:::; (1 - a) max{!k(X; xk(to)) - to; fk(X; xk(to))}

+amax{jk(X; xk(t2))- t2; fk(X; xk(t2))}


= (1- a)/k,(X; to)

+ af;(x; t2),

and we get (3.3.7).

We also need the following statement (compare with Lemma 2.3.5).


LEMMA

3.3.6 For any A ~ 0 we have

f*(t) - A :::; f*(t + A),

/;(X; t)- A :::; jz(X; t + A)


Proof. Indeed, for f*(t) we have
f*(t + A) = min [max{f(x)- t; f(x)
xeQ

+ A}- A]

2:: min [max{f(x)- t; f(x)}- A] = J*(t)- A.


xEQ

167

Nonsmooth convex optimization

The proof of the second inequality is similar.

Now we areready to present a constrained minimization scheme (compare with constrained minimization scheme of Section 2.3.5).

Constrained Ievel method


0. Choose x 0 E

Q, t 0 < t*,

/'i,

E (0,

! ) and accuracy

> 0.

1. kth iteration (k ~ 0).

a). Keep generating sequence X = {xj}~ 0 by the


Ievel method as applied to function f(tk; x). If the
internal termination criterion

(3.3.8)

holds, then stop the internal process and set j(k) = j.


Global stop: fj(X; tk) :SE.
b). Set tk+1 = tj(k)(X).
We are interested in an analytical complexity bound for this method.
Therefore the complexity of computation of the root tj(X) and of the
value ]J(X; t) is not important for us now. We need to estimate the rate
of convergence of the master process and the complexity of Step 1a).
Let us start from the master process.
LEMMA

3.3. 7 For all k

0, we have

!J(k)(X;tk)

:S

t~=~ [2(1~~~:)r.

Proof: Denote

= 2(1~~~:)
Since tk+l

O"k-1

(<

1).

= tj(k)(X) andin view of Lemma 3.3.5, for all k :2: 1, we have


= y'tk~tk_Jj(k-1)(X; tk-d ~ y'tk~tk-l Jj(k)(X; tk-d
>
2
JA* (X t ) > 2(1-~~:) f* (X t ) 5!..Jr..
- y'tk+t-tk j(k) j k - y'tk+t -tk j(k) j k = .

168

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Thus, ak :::; ak-l and we obtain

kj*j(O) (X ; t 0 )

tkl-tk

lt-to

Further, in view of Lemma 3.3.6, t1- to 2: Jj(o)(X; to). Therefore

:S /:.k~~:Ji;(o)(X;to)(tk+l-tk) :S /:t~~:Jf*(to)(to -t*).


It remains to note that f*(to) ~to-t* (see Lemma 3.3.6).

Let Global stop condition in (3.3.8) be satisfied: fj (X; tk)


there exist j* such that

0
~ .

Then

Therefore we have

Since tk

t*, we conclude that


j(Xj)

<

/(Xj)

< E.

t*+~:,

(3.3.9)

In view of Lemma 3.3.7, we can get (3.3.9) at most in


1

-t

N(~:) = ln[2(1-~~:)]ln (1-~~:)~


full iterations of the master process. (The last iteration of the process is
terminated by the Global stop rule). Note that in the above expression
r;, is an absolute constant (for example, we can take r;, =
Let us estimate the complexity of the internal process. Denote

t)

Mt= max{ll g

111

g E

8f(x)

U8j(x), x E Q}.

We need to analyze two cases.


1. Full step. At this step the internal process is terminated by the
rule

169

Nonsmooth convex optimization

The corresponding inequality for the gap is as follows:

fj(k)(X; tk)- ]J(k)(X; tk) ~ ~JJ(k)(X; tk).


In view of Theorem 3.3.1, this happens at most after
M2D2

K- 2 (/j(k) (X ;tk

)f2a(1-a)2 (2-a)

iterations of the internal process. Since at the full step !J(k)(X; tk)) ~ e,
we conclude that
M2D2

j(k)- j(k- 1) ~ K.2~:2a(1~a)2(2-a)


for any full iteration of the master process.
2. Last step. The internal process of this step was terminated by
Global stop rule:
fj(X; tk) ~ e.
Since the normal stopping criterion did not work, we conclude that

fJ-1 (X; tk)- fJ-1 (X; tk) ~ ~fJ-1 (X; tk) ~ ~e.
Therefore, in view of Theorem 3.3.1, the number of iterations at the last
step does not exceed
K.22a(l-a)2(2-a)

Thus, we come to the following estimate of total complexity of the


constrained level method:

[1 + In(2(1-K-))
1
1 J.o..=.L...]
n

MrD
,.2e2a(l-a)2(2-a}

M2D2Jn
I

(1-1;}

2(to-t)

~:2a(1-a)2(2-a)K.2Jn[2(1-K.))'

It can be shown that the reasonable choice for the parameters of this
scheme is a = ~ = 2+\!2'.
The principal term in the above complexity estimate is on the order
of ~ ln 2(to;t). Thus, the constrained level method is suboptimal (see
Theorem 3.2.1).
In this method, at each iteration of the master process we need to
find the root tj(k)(X). In view of Lemma 3.3.4, that is equivalent to the
following problem:

min{A(X;x)

/k(X;x) ~ 0, x E Q}.

170

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

In other words, we need to solve the problern

s.t.

rnm

t,

f(xj)

+ (g(xj}, x- Xj)

t,

j = 0 ... k,

/(xj)

+ (g(xj}, x- Xj)

~ 0,

j = 0 ... k,

XE

Q.

If Q is a polytope, this problern can be solved by finite linear prograrnrning rnethods (sirnplex rnethod}. lf Q is rnore cornplicated, we need to
use interior-point schernes.
To conclude this section, let us note that we can use a better rnodel
for the functional constraints. Since

/(x) = rn.ax fi(x),


1$t$m

it is possible to work with

where 9i(Xj) E fi(xj) In practice, this complete rnodel significantly accelerates the convergence of the process. However, clearly each iteration
becomes more expensive.
As far as practical behavior of this scherne is concerned, we note that
usually the process is very fast. There are some technical problems,
related to accurnulation of too many linear pieces in the model. However,
in all practical schemes there exists sorne strategy for dropping the old
elernents of the rnodel.

Chapter 4

STRUCTURAL OPTIMIZATION

4.1

Self-concordant functions

(Do we really have a black box? What the Newton method actually does?
Definition of self-concordant functions; Main properties; Minimizing the selfconcordant function.)

4.1.1

Black box concept in convex optimization

In this chapter we are going to present the main ideas underlying the
modern polynomial-time interior-point methods in nonlinear optimization. In order to start, let us look first at the traditional formulation of
a minimization problem.
Suppose we want to solve a minimization problern in the following
form:
min{fo(x) I /j(x) ~ 0, j = l ... m}.
xeRn

We assume that the functional cornponents of this problern are convex. Note that all standard convex optimization schemes for solving
this problern are based on the black-box concept. This means that we
assume our problern to be equipped with an oracle, which provides us
with some inforrnation on the functional components of the problern at
some test point x. This oracle is local: lf we change the shape of a
component far enough from the test point, the answer of the oracle does
not change. These answers comprise the only inforrnation available for
numerical methods. 1
However, if we look carefully at the above situation, we can see a
certain contradiction. Indeed, in order to apply the convex optirnization
1 We

have discussed this concept and the corresponding methods in the previous chapters.

172

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

methods, we need to be sure that our functional components are convex.


However, we can check convexity only by analyzing the structure of these
functions 2 : If our function is obtained from the basic convex functions
by convex operations (summation, maximum, etc.), we conclude that it
is convex.
Thus, the functional components of the problern are not in a black
box at the moment we check their convexity and choose a minimization
scheme. But we put them in a black box for numerical methods. That is
the main conceptual contradiction of the standard convex optimization
theory. 3
The above Observation gives us hope that the structure of the problern
can be used to improve the performance of convex minimization schemes.
Unfortunately, structure is a very fuzzy notion, which is quite difficult
to formalize. One possible way to describe the structure is to fix the
analytical type of functional components. For example, we can consider
the problems with linear functions /j(x) only. This works, but note that
this approach is very fragile: Ifwe addjust a single functional component
of different type, we get another problern dass and all theory must be
done from scratch.
Alternatively, it is clear that having the structure at hand we can
play a lot with the analytical form of the problem. We can rewrite the
problern in many equivalent forms using nontrivial transformation of
variables or constraints, introducing additional variables, etc. However,
this would serve no purpose until the moment we realize the final goal
of such transformations. So, let us try to find the goal.
At this moment, it is better to Iook at classical examples. In many
situations the sequential reformulations of the initial problern can be
seen as a part of the numerical scheme. We start from a complicated
problern 'P and, step by step, we simplify its structure up to the moment
wegetatrivial problern (or, a problern which we know how to solve):

'P ----t ... ----t (!*' x*).


Let us Iook at the standard approach for solving a system of linear
equations, namely,

Ax=b.
We can proceed as follows:
1. Check that A is symmetric and positive definite. Sometimes this is

clear from the origin of matrix A.


2A

numerical verification of convexity is a hopeless problem.


the conclusions of the theory concerning the oracle-based minimization schemes
remain valid.

3 However,

173

Structural optimization

2. Compute the Cholesky factorization of the matrix:

A=LLT,
where L is a lower-triangular matrix. Form an auxiliary system

Ly

= b,

LT X

= y.

3. Salve the auxiliary system.


This process can be seen as a sequence of equivalent transformations of
the initial problern
Imagine for a moment that we do not know how to solve systems
of linear equations. In order to discover the above scheme we should
perform the following steps:
1. Find a dass of problems which can be solved very efficiently (linear
systems with triangular matrices in our example).
2. Describe the transformation rules for converting our initial problern
into the desired form.
3. Describe the dass of problems for which these transformation rules
are applicable.
We are ready to explain the way it works in optimization. First of
all, we need to find a basic numerical scheme and a problern formulation
at which this scheme is very efficient. We will see that for our goals the
most appropriate candidate is the Newton method (see Section 1.2.4) as
applied in the framework of Sequential Unconstrained Minimization (see
Section 1.3.3).
In the succeeding section we will highlight some drawbacks of the
standard analysis of Newton method. From this analysis we derive a
family of very special convex functions, the self-concordant functions
and self-concordant barriers, which can be efficiently minimized by the
Newton method. We use these objects in a description of a transformed
version of our initial problem. In the sequel we refer to this description as
to a barrier model of our problem. This model will replace the standard
functional model of optimization problern used in the previous chapters.

4.1.2

What the Newton method actually does?

Let us Iook at the standard result on local convergence of the Newton


method (we have proved it as Theorem 1.2.5). We are trying to find an
unconstrained local minimum x* of twice differentiable function f(x).
Assurne that:

174

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

f"(x*) t lln with some constant l > 0,

II

f"(x)- f"(y) II~ M

II x- y II for all x

and y ERn.

We assume also that the starting point of the Newton process xo is close
enough to x*:
(4.1.1)
II xo- x* II< r = 32f.t
Then we can prove that the sequence

xk+l = xk- [f"(xkt 1 J'(xk),

k ~

is well defined. Moreover, II Xk- x* II< f for all k


method (4.1.2) converges quadratically:

II Xk+l -X * II<-

o,

(4.1.2)

0 and the Newton

M!lxk-x*ll 2
2(l-Mjjxk-xli)'

What is wrong with this result? Note that the description of the
region of quadratic convergence (4.1.1) for this method is given in terms
of the standard inner product

(x, y) =

L x(i}y(i).
i=l

If we choose a new basis in Rn, then all objects in our description change:
the metric, the Hessians, the bounds l and M. But Iet us look what
happens with the Newton process. Namely, let A be a nondegenerate
(n x n)-matrix. Consider the function

</>(y) = f(Ay).
The following result is very important for understanding the nature of
Newton method.
4.1.1 Let {xk} be a sequence, generated by the Newton method
for function f:
LEMMA

Xk+l = Xk- [f"(xk)t 1J'(xk),

k ~ 0.

Consider a sequence {Yk}, generated by the Newton method for function


</>:
Yk+l = Yk- [<f>"(Yk)t 1</>'(yk), k ~ 0,
with Yo = A- 1xo. Then Yk = A- 1xk for all k ~ 0.
Proof: Let Yk = A- 1xk for some k ~ 0. Then
Yk+l

Yk- [4>"(yk)t 1 4>'(Yk)

= Yk- [AT f"(Ayk)At 1AT f'(Ayk)


0

175

Structural optimization

Thus, the Newton method is affine invariant with respect to affine


transformation of variables. Therefore its real region of quadratic convergence does not depend on a particular inner product. It depends only
on the local topological structure of function f (x).
Let us try to understand what was bad in our assumptions. The main
assumption we used is the Lipschitz continuity of Hessians:

Let

II f"(x)- f"(y) II:S: M II x- Y II,


us assume that f E 0 3 (Rn). Denote

Vx,y ERn.

f 111 (x)[u] = lim l[J"(x + au)- J"(x)].


a~oa

Note that the object in the right-hand side is an (n x n)-matrix. Then


our assumption is equivalent to

II f"'(x)[u]II:S: M II u II .
This means that at any point x E Rn we have

(i 111 (x)[u]v,v}

:S: M II u II II v 11 2

Vu,v ERn.

Note that the value in the left-hand side of this inequality is invariant
with respect to affine transformation of variables. However, the righthand side does not possess this property. Therefore the most natural
way to improve the situation is to find an affine-invariant replacement
for the standard norm II II The main candidate for such a replacement
is rather evident: That is the norm defined by the Hessian f" (x) itself,
namely,
II u IIJ''(x)= (f"(x)u,u) 112
This choice gives us the dass of self-concordant functions.

4.1.3

Definition of self-concordant function

Let us consider a closed convex function f (x) E 0 3 ( dom f) with open


domain. Let us fix a point x E dom f and a direction u E Rn. Consider
the function
1J(x; t) = f(x +tu),
as a function of variable t E dom 1J(x; ) ~ R 1 . Denote

D f(x)[u]
D 2 J(x)[u, u]

D 3 f(x)[u,u,u]

= 4J'(x; t) = (f'(x), u),

= 4J"(x; t) = (f"(x)u, u) =II u ll}"(x)'

= 1J111 (x;t) = (D 3f(x)[u]u,u).

176

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

4.1.1 We call lunction I self-concordant il there exists a


constant MJ 2:: 0 such that the inequality

DEFINITION

holds lor any x E dom I and u E Rn.


Note that we cannot expect these functions to be very widespread.
But we need them only to construct a barrier model of our problem. We
will see very soon that such functions are easy to be minimized by the
Newton method.
Let us point out an equivalent definition of self-concordant functions.
LEMMA 4 .1. 2 A lunction f is sell-concordant il and only il lor any
x E doml and any u1, u2, U3 ERn we have

D 3 !(x)[u1,

u2, u3]

1::; MJ IT II Ui llrcx)

(4.1.3)

i=l

We accept this statement without proof since it needs some special facts
from the theory of three-linear symmetric forms.
In what follows, we very often use Definition 4.1.1 in order to prove
that some I is self-concordant. On the contrary, Lemma 4.1.2 is useful
for establishing the properties of self-concordant functions.
Let us consider several examples.
EXAMPLE

4.1.1 1. Linear function. Consider the function


l(x) =

Then

f'(x)

a+ (a,x),

= a,

J"(x)

domf =Rn.

= 0,

j 111 (x)

= 0,

and we conclude that M1 = 0.


2. Convex quadratic function. Consider the function
f(x) = a

+ (a, x} + ~(Ax, x},

domf =Rn,

where A =AT t 0. Then

f'(x) = a + Ax,
and we conclude that M1 = 0.

f"(x)

= A,

/ 111 (x)

= 0,

177

Structural optimization

3. Logarithmic barrier for a ray. Consider a function of one variable

f(x)

= -lnx,

domf

= {x E R 1 I x > 0}.

Then

f'(x) = -~,

f"(x) = ~'

f"'(x) = -~.

Therefore f(x) is self-concordant with MJ = 2.


4. Logarithmic barrier for a second-order region. Let A = AT t 0.
Consider the concave function

if>(x) = a + {a, x} - !(Ax, x).


Define f(x) = -lnif>(x), with domf = {x ERn

I if>(x) > 0}.

Dj(x)[u] =

- t/>(~)[(a,u}- (Ax,u}],

D 2f(x)[u, u] =

t/>2{x) [(a, u)- (Ax, u)J2

D 3 j(x)[u,u,u] =

Then

+ tf>(~) (Au, u),

- t/> 3Cx)[(a,u)- {Ax,u)] 3


- tjl2(x)[{a,u)- (Ax,u}](Au,u}.

Derrote

w1

= Df(x)[u] and w2 = t/>(~)(Au,u}. Then

D 2 f(x)[u, u] =

I D 3 f(x)[u, u, u] I
The only nontrivial case is

w1

<

/D 3f(x)(u,u,u)l
(D2J(x)[u,u])3 2 -

w~

+ w2 2 0,

l2w~

+ 3wlw21 .

=f 0. Derrote a = w2jw~. Then

2/w1r+3/w1!w2 _ 2(1+ta) < 2


(w 1 +w 2)3f2 - (l+a)3f2

Thus, this function is self-concordant and M 1 = 2.


5. lt is easy to verify that none of the following functions of one variable
is self-concordant:

f(x)

= ex,

f(x)

= x1v,

x > 0, p > 0,

f(x)

=I x IP, p > 2.
0

Let us Iook now at the main properties of self-concordant functions.

178

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

THEOREM 4.1.1 Let functions Ii be self-concordant with constants Mi,


i = 1, 2, and let a, > 0. Then the function f(x) = aft (x) + h(x) is

self-concordant with constant

MJ = max { .}aMt,

.M2}

and domf = domftndom/2.


Proof: In view of Theorem 3.1.5, f is a closed convex function. Let us
fix some x E dom f and u E Rn. Then

I D 3 fi(x)[u, u, u]l~ Mi [ D 2 fi(x)[u, u] ] 3/2 , i = 1, 2.


Denote

Wi

= D 2 fi(x)[u, u] ;:=: 0. Then

<

ID 3 f(x)[u,u,uji
[D2 f(x)[u,u]]3 2

<

aiD 3 b(x)[u,u,u)i+iD3 b(x)[u,u,u]l


[aDl b (x)[u,u)+D2 f2(x)[u,u]]3/2

ctM1w~/ 2 +M2w~/ 2
[aw1 +w2]3f 2

The right-hand side of this inequality does not change when we replace
(w1,w2) by (tw 1, tw2) with t > 0. Therefore we can assume that

aw1 +w2 = 1.
Denote e = O:Wt. Then the right-hand side of the above inequality
becomes equal to

*'e1

+ ~(1- e) 312 ,

eE [o, 11.

e.

This function is convex in


Therefore it attains its maximum at the
D
end points of the interval (see Corollary 3.1.1).

4.1.1 Let function f be self-concordant with some constant


MJ. If A =AT~ 0, then the function

CoROLLARY

cf>(x) = a

+ (a, x) + ~(Ax, x) + j(x)

is also self-concordant with constant MI/I= MJ.


Proof: We have seen that any convex quadratic function is self-concordant with the constant equal to zero.
D

179

Structural optimization

CoROLLARY 4.1.2 Let function f be self-concordant with some constant


M1 and a > 0. Then the function <P(x) = af(x) is also self-concordant
with the constant MI/J = )oMJ.
0

Let us prove now that self-concordance is an affine-invariant property.


4.1.2 Let A(x) = Ax + b: Rn -t Rm, be a linear operator.
Assurne that function J(y) is self-concordant with constant M1. Then
the function <P(x) = J(A(x)) is also self-concordant and MI/J = MJ

THEOREM

Proof: The function <P(x) is closed and convex in view ofTheorem 3.1.6.
Let us fix some X E dom <P = {X : A(x) E dom!} and u E nn. Denote
y = A(x), v =Au. Then

(J'(A(x)), Au) = (J'(y), v),

D<P(x)[u] =

(J"(A(x))Au, Au) = (J"(y)v, v),

D 2<P(x)[u, u] D 3 <P(x)[u, u, u]

D 3 J(A(x))[Au, Au, Au]

= D 3 f(y)[v, v, v].

Therefore

I D 3 <P(x)[u,u,u] I

I D 3 f(y)[v,v,v]

MJ(D 2<P(x)[u, u]) 312

I~ MJ(J"(y)v,v) 312

The next statement demonstrates that some local properties of a selfconcordant function reflect somehow the global properties of its domain.
4.1.3 Let function f be self-concordant. If dom f contains
no straight line, then the Hessian J"(x) is nondegenerate at any x from
domf.
THEOREM

Proof: Assurne that (J"(x)u,u) = 0 for some x E domfand u ERn,


u =/:- 0. Consider the points Ya = x + o:u E dom f and the function

'1/J(a) = (J"(y 0 )u, u).


Note that
Since '1/;(a) ~ 0, we conclude that '1/J'(O) = 0. Therefore this function is
a part of the solution of the following system of differential equations:
t/J(O) =

e<o)

o,

t/J'(a)

2'1/;(a) 312

e'(a) = 0.

e(a),

180

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

However, this system has a unique trivial solution. Therefore 'lj;(a) = 0


for all feasible a.
Thus, we have shown that the function </J(a) = f(Ya) is linear:

</J(a)

a>.

f(x) + (J'(x), Ya- x) + f f(J"(Yr )u, u)drd)..

J(x)

0 0

+ a(J'(x), u).

Assurne that there exists such that Ya E fJ( dom f). Consider a sequence {ak} such that ak t . Then

Note that Zk E epi f, but z rt. epi f since Ya rt. dom f. That is a contradietioll since function f is closed. Considering direction -u, and assuming
that this ray intersects the boundary, we come to a contradiction again.
Therefore we conclude that Ya E dom f for all a. However, that is a
contradiction with the assumptions of the theorem.
D
Finally, Iet us describe the behavior of self-concordant function near
the boundary of its domain.
THEOREM

4.1.4 Let f be a self-concordant function. Then for any point

x E 8( dom f)

and any sequence

{xk} C domf:
we have f(xk) -+

Xk-+

+oo.

Proof: Note that the sequence {f(xk)} is bounded below:

f(xk) ~ f(xo)

+ (J'(xo), Xk- xo).

Assurne that it is bounded from above. Then it has a Iimit point f. Of


course, we can think that this is a unique Iimit point of the sequence.
Therefore
Zk = (xk, f(xk)) -+ z = (x, f).
Note that Zk E epi f, but z ~ epi f since x ~ dom f. That is a contradiction since function f is closed.
D
Thus, we have proved that f (x) is a barrier function for cl (dom f)
(see Section 1.3.3).

181

Structural optimization

4.1.4

Main inequalities

Let us fix some self-concordant function f(x). We assume that its


constant M 1 = 2 (otherwise we can scale it, see Corollary 4.1.2). We call
such functions the standard self-concordant. We assume also that dom f
contains no straight line (this implies that all f" (x) are nondegenerate,
see Theorem 4.1.3).
Denote:
(J"(x)u,u) 112 ,
1/ u llx =

II v 11;

([f"(x)tlv, v)l/2'

AJ(x)

([f"(x)t 1f'(x), f'(x)) 112.

Clearly, I (v, u) 1::;11 v 11; II u llx We call II u


direction u with respect to x, and AJ(x) =II f'(x)

llx the local norm of


11; is called the local

norm of the gradient J'(x). 4

Let us fix x E dom f and u E Rn, u =/= 0. Consider the function of one
variable
<fJ(t) = (f"(x+t~)u,u)l/2
with the domain dom</J = {t E R 1

x +tuE domf}.

LEMMA 4 .1. 3 For alt feasible t we have I </J' (t)

I::; 1.

Proof: Indeed,
"'-'( ) _ _ f'"(x+tu)[u,u,ul
2(! 11 (x+tu)u,u)3)2
'+' t -

Therefore I <P'(t)

COROLLARY

1::; 1 in view of Definition 4.1.1.

4.1.3 Domain of function <jJ(t) contains the interval

(-cp(O), cp(O)).
Proof: Since f(x+tu) --+ oo as x+tu approaches the boundary of dom f
(see Theorem 4.1.4), the function (f"(x + tu)u, u} cannot be bounded.
t I c/J( t) > 0}. It remains to note that
Therefore dom cp

={

<P(t)

</J()- I t

in view of Lemma 4.1.3.

4 Sometimes

AJ(X) is called the Newton decrement of function f at x.

182

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us consider the following ellipsoid:

W 0 (x;r)
W(x; r)

= {y ERn I II y-

= cl (W 0 (x; r))

llx< r},

={y ERn I II y- x llxS r}.

This ellipsoid is called the Dikin ellipsoid of function f at x.


4.1.5 1. For any XE domf we have W 0 (x; 1) ~ domf.
2. For all x, y E dom f the following inequality holds:

THEOREM

3. If

II y- x llx< 1,

y-

lly-xll.,
II Y-> l+liy-xll.,.

II

y-

lly-xll.,
I Y-< 1-ily-xl!.,'

then
(4.1.5)

Proof: 1. In view of Corollary 4.1.3, dom f contains the set

{y =

+ tu I t 2 I u II; < 1}

(since </>(0) = 1/ II u llx) That is exactly W 0 (x; 1).


2. Let us choose u = y- x. Then
4>( 1)

= IIY!xlly'

4>(0)

= iiy!xu.,'

and </>(1) :$ 4>(0) + 1 in view of Lemma 4.1.3. That is (4.1.4}.


3. lf II y- x llx< 1, then 4>(0) > 1, and in view of Lemma 4.1.3
4>(1} 2:: 4>(0}- 1. That is (4.1.5}.
0
THEOREM

4.1.6 Let XE domf. Then for any y E W 0 (x; 1} we have

(1- II Y- x llx) 2 f"(x) ~ f"(y) ~ (t-IIY_:xll.,)2 f"(x).

(4.1.6)

Proof: Let us fix some u E Rn, u ::j:. 0. Consider the function

1/;(t) = (f"(x

+ t(y- x))u,u),

t E [0, 1].

Denote Yt = x + t(y- x). Then, in view of Lemma 4.1.2 and (4.1.5}, we


have

11/J'(t) I

I D 3 f(YtHY- x,u,u] IS 211

y-

IIYtll u ll;t

183

Structural optimization

Therefore
2{ln(1- t

II y- x llx))'::; (ln1f.l(t))'::; -2(ln(1- t II y- X llx))'.

Let us integrate this inequality in t E [0, 1]. We get:

(1- II Y-

llx) 2 ~ ~f~~ ~

(1-lly:xiJx)2

That is exactly (4.1.6).

4.1.4 Let x
estimate the matrix

OROLLARY

domf and r

=II y- x llx<

1. Then we can

f"(x

G=

+ T(y- x))dT

as follows:

(1- r + r32 )f"(x) ~ G ~ l~rf"(x).

Proof: Indeed, in view of Theorem 4.1.6 we have


G =
=

f"(x

(1 - r

+ T(y- x))dT?: f"(x) I(1- Tr) 2dT


0

+ ~r 2 )f"(x),
1

-< f"(x) I (l!;r)2


0

= l~rf"(x).
0

Let us look again at the most important facts we have proved.


At any point x E dom f we can point out an ellipsoid
W 0 (x; 1) = {x
b~longing

Rn

(f"(x)(y- x), y- x)) < 1},

to dom f.

Inside the ellipsoid W(x; r) with r E [0, 1) function


quadratic:
(1- r) 2 f"(x) ~ f"(y) ~ (l!r)2f"(x)

is almost

for all y E W (x; r). Choosing r small enough, we can make the
quality of the quadratic approximation acceptable for our goals.

184

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

These two facts form the basis for almost all consequent results.
We conclude this section with the results describing the variation of
a self-concordant function with respect to a linear approximation.
THEOREM 4.1. 7 For any x, y E dom f we have

(! '( Y) - J'( X )'Y -

) -> I+JJy-xJJx'
JJy-xl!~

+ (f'(x), Y- x) + w(ll Y- x llx},


where w(t) = t -ln(l + t).
Proof: Denote Yr = x + T(y- x), T E [0, 1], and r =II

(4.1. 7}

f(y) ~ f(x)

in view of {4.1.4) we have

(4.1.8}

y- x llx Then,

(f'(y)- f'(x), y- x)

f(f"(yr)(y- x), y- x)dT

2:

J (l_;rr)2dT = r 0J (l~t)2dt =
0

1~r

Further, using (4.1.7), we obtain

f(y)- f(x)- (f'(x),y- x)

f(f'(yr)- f'(x),y- x)dT

~(f'(Yr)- f'(x),yr- x)dT

IIYr-xlli d _ J1 rr 2 d
0 r(l+JJyr-xJJx) T - 0 l+rr T

> Jl
r

= J f~t = w(r).
0

THEOREM 4.1.8 Let x E dom f and

(! '( Y) f(y) S:: f(x)

II

y- x

llx<

1. Then

J'( X )'Y - X ) -< 1-JJy-xJix'


JJy-xl!~

+ (f'(x), Y-

x)

+ w*(ll

y- x llx),

(4.1.9}

(4.1.10}

185

Structural optimization

where w.(t) = -t -ln(l- t).

Proof: Derrote Yr = x + T(y- x), T E [0, 1], and r


II Yr- x II< 1, in view of (4.1.5) we have
(f'(y)-J'(x),y-x)

=II

y-

llx

Since

I(f"(yr)(y-x),y-x)dT
0

r2 d
(1-rrF 7

= r I0

1 d
(1-t)2 t

r2
= 1-r

Further, using (4.1.9), we obtain


J(y)- f(x)- (f'(x), y- x)

I(f'(Yr)- f'(x), y- x)dT


0
1

I ~(J'(Yr)- f'(x),Yr- x)dT

= I
0

f~tt

= w.(r).
D

4.1.9 /nequalities (4.1.4}, (4.1.5), (4.1. 7}, (4.1.8}, (4.1.9}


and (4.1.1 0) are necessary and sufficient characteristics of standard selfconcordant functions.

THEOREM

Proof: We have justified two sequences of implications:


Definition 4.1.1

=> (4.1.4) => (4.1.7) => (4.1.8),

Definition 4.1.1

=> (4.1.5) => {4.1.9) => {4.1.10).

Let us prove the implication (4.1.8) => Definition 4.1.1. Let x E dom f
and x - au E dom f for a E [0, E). Consider the function

1/J(a) = f(x- au),

a E [0, ~:).

186

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Denote r = llullx [cp"(o)Jl/ 2 Assuming that {4.1.8} holds for all x and
y from dom f, we have

'1/J(a)- '1/J(O)- '1/J'(O)a- !'l/J"(O)a2 ~ w(ar)- !a2 r 2


Therefore
~'1/J/11 (0)

lim ['1/J(a)- '1/J(O)- '1/J'(O)a- !'l/J"(O)a2 ]

a,l.O

> lim
~ [w(ar)- !a2 r 2]
a,l.O a:

= lim
a,I.O

p
0

[w'(ar)- ar]

Thus, D3 f(x)[u,u,u] = -'1/J"(O) ~ 'l/J111 (0) ~ 2['1/J"(O)j31 2 and that is


Definition 4.1.1 with MJ = 2. Implication (4.1.10) => Definition 4.1.1
D
can be proved by a similar reasoning.
The above theorems are written in terms of two auxiliary functions
w(t) = t -ln(1 + t) and w*(r) = -T -ln(1- r). Note that
t
1+t

w 1(t)

w~(r)

= 1 ~r 2 0,

2 0,

"(t)

(1+t)2

> 0'

w~(r)

c1 !r)2

> 0.

Therefore, w(t) and w*(r) are convex functions. In what follows we often
use different relations between these functions. Let us fix this notation
for future references.
LEMMA

4.1.4 For any t

0 and TE [0, 1) we have

w'(w~(r))

= r,

w(t) = max [t- w*()],


0~{<1

w(t)
w*(r) =

w~(w'(t))

= t,

w*(r) = max[r- w()],


{~0

+ w.(r) 2 rt,

rw~(r)- w(w~(r)),

w(t) = tw'(t)- w*(w'(t)).

We leave the proof of this Iemma as an exercise for the reader. For
an advanced reader we should note that the only reason for the above
relationsisthat functions w(t) and w*(t) are conjugate.
Let us prove two more inequalities.

187

Structural optimization
THEOREM

4.1.10 For any x and y from Q we have

f(y) ~ f(x)

+ (f'(x), Y- x) + w(llf'(y)- f'(x)ll;).

If in addition llf'(y)-

f(y) ::; f(x)

f'(x)ll; < 1,

(4.1.11)

then

+ (f'(x), Y- x) + w*(llf'(y)- f'(x)ll;).

(4.1.12)

Proof: Let us fix an arbitrary x and y from Q. Consider the function

<P(z) = f(z)- (f'(x), z),

z E Q.

Note that this function is self-concordant and <P'(x) = 0. Therefore,


using inequality (4.1.10) we get

f(x)- (f'(x),x) =

</J(x) = min</J(z)
zEQ

+ (<!J'(y), z- y) + w*(llz- yjjy)]

<

min[<!J(y)

</J(y)- w(II<P'(y)ll;)

f(y) - (f'(x), y) - w(llf'(y) - f'(x) II;),

zEQ

and that is (4.1.11). In order to prove inequality (4.1.12) we use a similar


reasoning with (4.1.8).
D

Minimizing the self-concordant function

4.1.5

Let us consider the following minimization problem:


min{f(x)

Ix

E domf}.

(4.1.13)

The next theorem provides us with a sufficient condition for existence of


its solution. Recall that we assume that f is a standard self-concordant
function and dom f contains no straight line.
4 .1.11 Let >.1 (x) < 1 for some x E dom f. Then the solution
of problern (4.1.13), xj, exists and is unique.

THEOREM

Proof: Indeed, in view of (4.1.8), for any y E domf we have

f(y)

> f(x) + (f'(x), Y- x) + w(ll Y- x llx)


> f(x)- II f'(x) u;. 'II y- X llx +w(ll y- X llx)
=

f(x)- AJ(x) II Y-

llx +w(ll Y- X llx)

188

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Therefore for any y E .Cj(J(x)) = {y ERn

f(y)

f(x)} we have

lly!xllx w(ii Y- X iix) ~ Aj(X) < 1.


Note that the function tw(t) = 1- t ln(1 + t) is strictly increasing in t.
Hence, II y- x llx~ f, where f is a unique positive root of the equation
(1- AJ(x))t = ln(1 + t).
Thus, CJ(f(x)) is bounded and therefore xj exists. It is unique since in
view of {4.1.8) for all y E domf we have

+ w(ll Y- xj llxj)

f(y) ~ f(xj)

Thus, we have proved that a local condition AJ(x) < 1 provides us


with some global information on function f, that is the existence of
the minimum xj. Note that the result of Theorem 4.1.11 cannot be
strengthened.
EXAMPLE

4.1.2 Let us fix some

variable
ff(x}

=EX

> 0. Consider a function of one

-lnx,

x > 0.

This function is self-concordant in view of Example 4.1.1 and Corollary 4.1.1. Note that

f: (X) = E -

~,

f:' = ~.

Therefore AJ,(x) =11- EX I Thus, for E = 0 we have AJ0 (x) = 1 for any
x > 0. Note that the function fo is not bounded below.
If E > 0, then xj. = ~ Note that we can recognize the existence of
the minimizer at point x = 1 even if E is arbitrary small.
D

Let us consider now a scheme of the damped Newton method:


Damped Newton method
0. Choose xo E dorn f.
1. Iterate Xk+l = Xk- l+A~(xk) [f"(xk)]- 1f'(xk), k ~ 0.

(4.1.14)

189

Structural optimization

THEOREM

4.1.12 For any k

2 0 we have
(4.1.15}

= AJ(Xk) Then II Xk+I - Xk llx= 1 ~..\


Therefore, in view of (4.1.10) and Lemma 4.1.4, we have

Proof: Denote >.

.>.2

w'(>.).

+ w*(w'(>.))

f(xk)- 1+.>.

f(xk)- >.w'(>.)

+ w*(w'(>.))

= f(xk)- w(>.).
D

Thus, for all x E domf with AJ(x) 2 > 0 one step of the damped
Newton method decreases the value of f(x) at least by a constant w() >
0. Note that the result of Theorem 4.1.12 is global. It can be used to
obtain a global efficiency estimate of the process.
Let us describe now the local convergence of the standard Newton
method:

Standard Newton method


(4.1.16)

0. Choose xo E dom f.
1. Iterate

Xk+l

Xk-

[f"(xk)]- 1 /'(xk), k 2 0.

Note that we can measure the convergence of this process in different


ways. We can estimate the rate of convergence for the functional gap
f(xk)- f(xj), or for the local norm ofthe gradient AJ(xk) =II f'(xk) ll;k,
or for the local distance to the minimum II Xk - xj llxk. Finally, we can
look at the distance to the minimum in a fixed metrics

defined by the minimum itself. Let us prove that locally all these measures are equivalent.

190

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

< 1. Then

THEOREM 4.1.13 Let AJ(x)

w(A.J(x)}::; f(x}- f(xj)::; w*(A.J(x}},

{4.1.17}
(4.1.18}

w(r*(x)) ::; f(x)- f(xj)::; w*(r*(x)},

(4.1.19}

where the last inequality is valid for r * (x) < 1.


Proof: Denote r =II x- xj llx and ).. = AJ(x}. Inequalities (4.1.17}
follow from Theorem 4.1.10. Further, in view of (4.1.7} we have

1~r::; (J'(x),x- xj)::; .Ar.


That is the right-hand side of inequality (4.1.18). If r ~ 1 then the
left-hand side of this inequality is trivial. Suppose that r < 1. Then
f'(x) = G(x- xj) with

J
1

G=

J"(xj

+ T(x- xj)}dT,

and

A.j(x)

([J"(x)t 1 G(x- xj}, G(x- xj)) ::;11 H 11 2 r 2 ,

where H = [f"(x)J- 112 G[f"(x)J- 112 In view of Corollary 4.1.4, we have


1-f"(x).
G -<
1-r
- -

Therefore

II

11::;

1 ~r and we conclude that


A.}(x) ::; ( 1 ~~)2 = (w~(r)) 2 .

Thus, AJ(x) ::; w~(r}. Applying w'() to both sides, we get the remairring
part of (4.1.18).
0
Finally, inequalities (4.1.19) follow from (4.1.8) and (4.1.10).
Let us estimate the local rate convergence of the standard Newton
method (4.1.16). It is convenient to do that in terms of >..1(x), the local
norm of the gradient.
THEOREM 4.1.14 Let x E domf and AJ(x)

< 1. Then the point

x+ = x- [f"(x)t 1 f'(x)

191

Structural optimization

belongs to dom f and we have


, (

Af

X+ ::=;

( ,x 1 (x)

1->.,(x)

)2 '

Proof: Denote p =X+- x, >. = >.t(x). Then II p llx= >. < 1. Therefore
x+ E domf (see Theorem 4.1.5). Note that in view of Theorem 4.1.6,

>-t(x+) = ([/"(x+)t 1 f'(x+), /'(x+)) 112

~ 1-liPII:r II f'(x+) llx= 1~-X II f'(x+) llx


Further,

f'(x+) = f'(x+)- f'(x)- f"(x)(x+- x) = Gp,


1

where G = J[f"(x + rp)- f"(x)]dr. Therefore


0

II !'(x+) II;= ([J"(x)t 1 Gp, Gp) ::;II H 11 2 . II p II;,


where H = [f"(x)t 112 G[f"(x)t 112 In view of Corollary 4.1.4,

(->. + k>. 2 )f"(x) j G j 1 ~_xf"(x).


Therefore

II H 11:: ; max:{t~_x,>.- k>- 2 }


>.](x+) :::; (l!.x)2

1 ~_x,

and we conclude that

II f'(x+) lli:::; c 1 ~~)4

Theorem 4.1.14 provides us with the following description of the region


of quadratic convergence of scheme (4.1.16):

>.t(x) < >. =

3_

yg = 0.3819 ... ,

_\)2

where ); is the root of the equation (1


= 1. In this case we can
guarantee that >.J(x+) < >.J(x).
Thus, our results Iead to the following strategy for solving the initial
problern (4.1.13).

First stage: >.t(xk) ~ , where E (0, 3;). At this stage we apply the
damped Newton method. At each iteration of this method we have

Thus, the number of steps of this stage is bounded:

N ~ wfy[f(xo)- f(xi)].

192

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Second stage: AJ(Xk) ::; . At this stage we apply the standard


Newton method. This process converges quadratically:
\ (

Af

Xk+l ::;

( AJ(Xk) ) 2
>.J(Xk)
1->.f(Xk)
::; (l-)2

\ ( )
< Af
Xk

It can be shown that the local convergence of the damped Newton


method (4.1.14) is also quadratic:
(4.1.20)
However, we prefer to use the above switching strategy since it gives
better complexity bounds. Relation (4.1.20) can be justified in the same
way as it was clone in Theorem 4.1.14. We leave the reasoning as an
exercise for the reader.

4.2

Self-concordant barriers

(Motivation; Definition of self-concordant barriers; Main properties; Standard


minimization problem; Central path; Path-following method; How to initialize
the process? Problems with functional constraints.)

4.2.1

Motivation

In the previous section we have seen that the Newton method is


very efficient in minimizing a standard self-concordant function. Such
a function is always a barrier for its domain. Let us check what can
be proved about the sequential unconstrained minimization approach
(Section 1.3.3), which uses such barriers.
In what follows we deal with constrained minimization problems of
special type. Denote Dom f = cl (dom f).
DEFINITION 4.2.1 We call a constrained minimization problern standard if it has the form
min{ (c, x)

I x E Q},

(4.2.1}

where Q is a closed convex set. We assume also that we know a selfconcordant function f such that Dom f = Q.
Let us introduce a parametric penalty function

f(t; x) = t(c, x)

+ f(x)

with t 2:: 0. Note that f(t; x) is self-concordant in x (see Corollary 4.1.1).


Denote
x*(t) =arg min f(t; x).
xEdom/

193

Structural optimization

This trajectory is called the centrat path of the problern (4.2.1). Note
that we can expect x*(t) ---+ x* as t ---+ oo (see Section 1.3.3}. Therefore
we are going to follow this trajectory.
Recall that the standard Newton method, as applied to minimization
of function f(t; x), has a local quadratic convergence (Theorem 4.1.14}.
Moreover, we have an explicit description of the region of quadratic
convergence:
~
)..f(t;)(x)~<>.= -2

Let us study our possibilities assuming that we know exactly x = x*(t}


for some t > 0.
Thus, we are going to increase t:
t+ = t + ~'

> 0.

However, we need to keep x in the region of quadratic convergence of


the Newton method for function f(t + ~; ):
)..f(t+.6;)(x) ~

<.X.

Note that the update t---+ t+ does not change the Hessian of the barrier
function:
f"(t + ~;x} = f"(t;x).
Therefore it is easy to estimate how can be big the step ~. lndeed, the
first order optimality condition provides us with the following centrat
path equation:
(4.2.2)
tc + f'(x*(t)) = 0.
Since tc + J'(x) = 0, we obtain
)..f(t+;)(x)

=II t+c + f'(x) llx= ~ II c llx= ~ II

f'(x) llx~ .

Hence, if we want to increase t at a linear rate, we need to assume that


the value
>.}(x) =II f'(x) II~= ([/"(x)t 1 f'(x), f'(x))
is uniformly bounded on dom f.
Thus, we come to a definition of self-concordant barrier.

4.2.2

Definition of self-concordant barriers

DEFINITION 4.2.2 Let F(x) be a standard self-concordant function. We

call it a v-self-concordant barrier for set Dom F, if

sup [2(F'(x), u)- (F"(x)u, u)] ~ v

uER"

(1,.2.3}

194

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

for all x E dom F. The value v is called the parameter of the barrier.
Note that we do not assume F"(x) tobe nondegenerate. However, if
this is the case, then the inequality (4.2.3) is equivalent to

([F"{x)t 1F'(x), F'(x)} ~ v.

(4.2.4)

We will use also another equivalent form of inequality (4.2.3):

(F'(x), u} 2 ~ v(F"(x)u, u} 't:/u ERn.

{4.2.5)

(To see that for u with (F"(x)u, u} > 0, replace u in (4.2.3) by .Xu and
find the maximum of the left-hand side in .X.) Note that the condition
(4.2.5) can be written in a matrix notation:

F"(x) t tF'(x)F'(xf.

{4.2.6)

Let us check now which self-concordant functions given by Example 4.1.1 arealso self-concordant barriers.
4.2.1 1. Linear function: f(x) = a + (a,x}, domf = Rn.
Clearly, for a =/::. 0 this function is not a self-concordant barrier since
f"(x) = 0.

EXAMPLE

2. Convex quadratic function. Let A = AT

f(x) = a

>- 0. Consider the function

+ (a,x) + ~(Ax,x), domf =Rn.

Then f'(x) = a + Ax and f"(x) = A. Therefore

([f(x)J- 1 f'(x), f'(x))

(A- 1 (Ax- a), Ax- a)

(Ax,x)- 2(a,x)

+ (A- 1 a,a).

Clearly, this value is unbounded from above on Rn. Thus, a quadratic


function is not a self-concordant barrier.
3. Logarithmic barrier for a ray. Consider the following function of one
variable:

F(x)

= -lnx,

domF

Then F'(x) = -~ and F"(x} = ~


~

= {x E R 1 I x > 0}.

> 0.
1

F"fX} = X2" X

Therefore
= 1.

195

Structural optimization

Thus, F(x) is a v-self-concordant barrier for {x

> 0} with v

= 1.

4. Logarithmic barrier for a second-order region. Let A = AT


Consider the concave quadratic function

>- 0.

cp(x) = a + (a, x) - ~(Ax, x).


Define F(x) = -lncp(x), domf = {x ERn

(F'(x),u} =
(F"(x}u,u} =
Denote

WI

I cjJ(x) > 0}.

Then

-rt>lx)[(a,u}- (Ax,u)],
r/> 2tx}[(a,u}-

(Ax,u}]2

+ rt>(~)(Au,u).

= (F'(x),u) and w2 = rt>lx)(Au,u). Then

(F"(x)u, u) = w~ + w2 ~

wi.

Therefore 2(F'(x), u}- (F"(x}u, u} =::; 2w1- w~ =::; 1. Thus, F(x) is a


0
v-self-concordant barrier with v = 1.

Let us present some simple properties of self-concordant barriers.


THEOREM

tion (c, x}

4.2.1 Let F(x) be a self-concordant barrier. Then the funcis a self-concordant function on domF.

+ F(x)

Proof: Since F(x) is a self-concordant function, we just apply CorolD


lary 4.1.1.
Note that this property is important for path-following schemes.
4.2.2 Let Fi be Vi-self-concordant barriers, i = 1, 2. Then
the function

THEOREM

is a self-concordant barrier for convex set Dom F = Dom F1


with the parameter v = v1 + v2.

nDom F2

196

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: In view of Theorem 4.1.1, Fis a standard self-concordant function. Let us fix x E dom F. Then
max [2(F'(x)u, u}- (F"(x)u, u)]

ueRn

<

max [2(F{(x)u, u}- (F{'(x)u, u) + 2(F{(x)u, u}- (F{'(x)u, u)]

uERn

max [2(F{(x)u,u)- (F{'(x)u,u)]

uERn

+ uERn
max [2(F2(x)u, u)- (F2'(x)u, u)]

~ v1

+ v2.
0

Finally, let us show that the value of a parameter of a self-concordant


barrier is invariant with respect to affine transformation of variables.
4.2.3 Let A(x) = Ax + b be a linear operator, A(x) :Rn --t
Rm. Assurne that function F(y) is a v-self-concordant barrier. Then the
function q>(x) = F(A(x)) is a v-self-concordant barrier for the set
THEOREM

Domq_) = {x ERn

I A(x) E DomF}.

Proof: Function g_}(x) is a standard self-concordant function in view of


Theorem 4.1.2. Let us fix x E domg_}, Then y = A(x) E domF. Note
that for any u E Rn we have

(q>'(x), u)

= (F'(y), Au),

(q>"(x)u, u)

= (F"(y)Au, Au).

Therefore
max [2(<I>'(x), u)- (<P"(x)u, u)]

uERn

= max [2(F'(y), Au)- (F"(y)Au, Au)]


uERn

max [2(F'(y), v)- (F"(y)v, v)]

veRm

v.
0

4.2.3

Main inequalities

Let us show that the local characteristics of a self-concordant barrier


(the gradient and the Hessian) provide us with global information about
the structure of the domain.
4.2.4 1. Let F(x) be a v-self-concordant barrier. For any x
and y from dom F, we have

THEOREM

(F'(x),y- x} < v.

(4.2. 7)

197

Structural optimization

Moreover, if (F'(x), y- x) 2': 0, then

(F'( Y) - F'( X )'Y -

(F'Cxtrx>2
)>
- v-(F' x ,y-x)

(4.2.8)

2. A standard self-concordant function F(x) is a v-self-concordant barrier if and only if

F(y) 2:: F(x)- v In ( 1- t(F'(x), y- x))

Vx, y E domF.

(4.2.9)

Proof: 1. Let x,y E domF. Consider the function

<jJ(t) = (F'(x+t(y-x)),y-x),

t E [0,1].

If </J(O) :::; 0, then (4.2. 7) is trivial. If </J(O) = 0, then (4.2.8) is trivial.


Suppose that </J(O) > 0. Note that in view of (4.2.5) we have

<P'(t) = (F"(x

+ t(y- x))(y- x), y- x)

Therefore <jJ(t) increases and it is positive for t E [0, 1]. Moreover, for
any t E [0, 1] we have

-itJ + q,?o) 2': tt.


This implies that (F'(x),y- x) = <P(O)
(4.2.7) is proved. Moreover,
</J(t) - </J(O) 2':

v~~~(b)

- </J(O) =

<

y for

all t E [0,1). Thus,

vt~~~t~), t E [0, 1).

Taking t = 1, we get (4.2.8).


2. Denote 'lj;(x) = e-~F(x). Then

'1/l'(x)
'1/J"(x)

-te-~F(x) F'(x),

-te-~F(x) [F"(x)- tF'(x)F'(xf] .

Thus, in view of Theorem 2.1.4 and definition (4.2.6), function 'ljJ(x) is


concave if and only if the function F(x) is a v-self-concordant barrier.
It remains to note that (4.2.9) is the same as

'1/J(y):::; 'lj;(x)

+ ('1/J'(x),y -x)

198

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

up to a logarithmic transformation of both sides of the inequality.

4.2.5 Let F(x) be a v-self-concordant barrier. Then for any


x E dom F and y E Dom F such that
THEOREM

(F'(x),y- x} ~ 0,
we have

II Y- X

llx~

V+

(4.2.10}

2y'i/.

(4.2.11}

Proof: Denote r =II y - x llx Let r > .,fii. Consider the point
Ya = x + a(y - x) with a =
< 1. In view of our assumption
(4.2.10) and inequality (4.1.7) we have

=(F'(Ya),y- x}

> (F'(Ya)- F'(x),y- x}


=

~(F'(Ya)- F'(x), Ya- x}

>

l .

IIYo -xll~

a l+IIYo-xll~

= l+a
al!rxll = ~
ly-x]x I+Tv'
2

On the other hand, in view of (4.2.7), we obtain


(1- a)w = (F'(y 0 ),y- Ya} ~ v.

Thus,

( 1-~)~
r
l+vfv <v
- '
and that is exactly (4.2.11).

We conclude this section by studying the properties of one special


point of a convex set.
DEFINITION 4.2.3 Let F(x) be a v-self-concordant barrier for the set
Dom F. The point
xP, =arg min F(x),
xEdomF

is called the analytic center of convex set Dom F, generated by the barrier
F(x).
4.2.6 Assurne that the analytic center of a v-self-concordant
barrier F(x) exists. Then for any x E Dom F we have

THEOREM

II

X -

Xp

llxj,. ~ V + 2..jV.

199

Structural optimization

On the other hand, for any x E Rn such that


x E DornF.

II

x -

Xp llx;, ~

1 we have

Proof: The first staternent follows frorn Theorem 4.2.5 since F'(x}) =
D
0. The second staternent follows frorn Theorem 4.1.5.
Thus, the asphericity of the set Dorn F with respect to x}, cornputed
in the metric II llxj.., does not exceed v + 2yfi/. It is well known that for
any convex set in Rn there exists a metric in which the asphericity of this
set is less than or equal to n (John Theorem). However, we rnanaged to
estirnate the asphericity in terrns of the parameter of the barrier. This
value does not depend directly on the dirnension of the space.
Note also, that if Dorn F contains no straight line, the existence of x}
irnplies the boundedness of DomF. (Since then F"(x}) is nondegenerate, see Theorem 4.1.3).
COROLLARY

4.2.1 Let DomF be bounded. Then for any x E dornF

and v E Rn we have

II v 11;::; (v + 2y'v) II v 11;;, .


Proof: By Lemma 3.1.12 we get the following representation:

II v 11;= ([F"(x)r 1v,v) 112 =

rnax{(v,u)

(F"(x)u,u) ~ 1}.

On the other hand, in view of Theorem 4.1.5 and Theorem 4.2.6, we


have
{y ERn I II y- X llx::; 1} ~Dorn F
B

=
~

{y

E Rn

I II Y - Xp llx::; ll + 2y0} := B*

Therefore, using again Theorem 4.2.6, we get the following:

II v 11;

= rnax{(v,y- x)
= (v,x}- x)

y E B} ~ rnax{(v,y- x)

y E B*}

+ (v + 2yfi/) II v 11;.
F

11;.

Note that

II v II;= !I

4.2.4

Path-following scheme

-v

Therefore we can assurne (v, x}- x) ~ 0. D

Now we are ready to describe a barrier model of the rninirnization


problern. This is the standard rninirnization problern
rnin{ (c, x)

x E Q}

(4.2.12)

200

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

with bounded closed convex set Q = DomF, which has nonempty interior, and which is endowed with a v-self-concordant barrier F(x).
Recall that we are going to solve (4.2.12) by tracing the central path:

x*(t) =arg min

xEdomF

f(t; x),

(4.2.13)

where f(t; x) = t(c, x) + F(x) and t ;:::: 0. In view of the first-order


optimality condition, any point of the central path satisfies equation

tc+ F'(x*(t)) = 0.

(4.2.14)

Since the set Q is bounded, the analytic center of this set, x'F, exists and

x*(O) = x}.

(4.2.15)

In order to follow the central path, we are going to update the points,
satisfying an approximate centering condition:

AJ(t;)(x)

=II

f'(t; x)

II;= I tc + F'(x)

11;~

(4.2.16)

where the centering parameter is small enough.


Let us show that this is a reasonable goal.
THEOREM

4.2. 7 For any t > 0 we have


(c, x* (t)) - c* ~

!f,

(4.2.17}

where c* is the optimal value of (4. 2.12). If a point x satisfies the centering condition (4.2.16), then
l
(c ' x)- c* <
- t

(v + (+vvl).
1-

(4.2.18)

Proof: Let x* be a solution to (4.2.12). In view of (4.2.14) and (4.2.7)


we have

(c, x* (t) - x*)

= t (F' (x* (t) ), x* -

x* (t)) ~ ![.

Further, let x satisfy (4.2.16). Denote ,\ = AJ(t;)(x). Then

t(c,x-x*(t))

(J'(t;x)- F'(x),x- x*(t))

< (.\ + JV) II

X-

< (.\ + V'v)-L


<
1-A V

x*(t)

llx

(+Vvl
1-

201

Structural optimization
in view of (4.2.4), Theorem 4.1.13 and (4.2.16).

Let us analyze now one step of a path-following scheme. Namely,


assume that x E dom F. Consider the following iterate:

(4.2.19)

THEOREM

4.2.8 Let

satisfy (4.2.16):

II tc + F'(x)
with < 5.

= 3-l'g.

11;~

Then for 1, such that

I I I~

111-

(4.2.20}

I t+c + F'(x+) 11;~ .


Proof: Denote .Ao = II tc + F'(x) 11;~ ,
A+ = II t+c + F'(x+) 11;+. Then
we have again

AI

= II

t+c + F'(x)

11;

and

and in view of Theorem 4.1.14 we have

A+

~ (l~~J2

=[w~(.AI)j2.

It remains to note that inequality (4.2.20) is equivalent to

(recall that w'(w~(r))

= r, see Lemma 4.1.4).

Let us prove now that the increase of t in the scheme (4.2.19) 1s


suffi.ciently !arge.
LEMMA

4.2.1 Let x satisfy (4.2.16). Then

II c 11;~ f( + JV).

(4.2.21)

202

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Indeed, in view of {4.2.16) and (4.2.4), we have

II c 11;=11

f'(t; x)- F'(x) 11;~11 f'(t; x)

11; + II

F'(x) 11;~ +

JV.

Let us fix now some reasonable values of parameters in the scheme


(4.2.19). In the rest of this chapter we always assume that
1
-- gl

'V I -

_:fl_
l+...[- --

5
35

(4 222)

We have proved that it is possible to follow the central path, using the
rule (4.2.19). Note that we can either increase or decrease the current
value of t. The lower estimate for the rate of increasing t is

t+

(1 + 4+3~\fV). t,

and the upper estimate for the rate of decreasing t is

t+

~ ( 1 - 4+3~\fV) . t.

Thus, the general scheme for solving the problern (4.2.12) is as follows.

Main path-following scheme


0. Set t 0 = 0. Choose an accuracy e > 0 and x 0 E dom F
suchthat
II F'(xo) n;o~ .
1. kth iteration (k

(4.2.23)

0). Set

2. Stop the process if e tk ~ v

+ (i!gl.

Let us give a complexity bound for the above scheme.


4.2.9 The scheme (4.2.23} terminates no more than ajter N
steps, where

THEOREM

203

Structural optimization

Moreover, at the moment of termination we have (c,xN}- c* S

Proof: Note that ro


II xo - x} llxoS ~ {see Theorem 4.1.13}.
Therefore, in view of Theorem 4.1.6 we have

~
h
Thus,

tk

1-II
c II*xo< -1-ro
II c II*xj;.-< 1..=1L
1=211 II c II*xj;.

-y{l 2)

2:: ( 1 -~lcll;.

1 + 1JV

)k-1

for all k 2:: 1.

Let us discuss now the above complexity estimate. The main term in
the complexity is
vl/cl/*.
7. 2.,fo In ---fL.
Note that the value V II c II;. estimates the variation of the linear funcF
tion (c, x} over the set Dom F (see Theorem 4.2.6). Thus, the ratio

can be seen as a relative accuracy of the solution.


The process (4.2.23) has one serious drawback. Sometimes it is difficult to satisfy its starting condition

In such cases we need an additional process for finding an appropriate


starting point. We analyze the corresponding strategies in the next
section.

4.2.5

Finding the analytic center

Thus, our goal now is to find an approximation to the analytic center


of the set Dom F. Let us look at the following minimization problem:
min{F{x}

Ix

E domF},

{4.2.24)

where F is a v-self-concordant barrier. In view of the needs of the


previous section, we have to find an approximate solution x E dom F of
this problem, which satisfies inequality

II F'(x) llis ,
for certain E {0, 1}.
In order to reach our goal, we can apply two different mmimization schemes. The first one is a Straightforward implementation of the

204

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

damped Newton method. And the second one is based on path-following


approach.
Consider the first scheme.

Damped Newton method for analytic centers


0. Choose yo E domF.

1. kth iteration (k 2:: 0). Set


_

Yk+l - Yk
2. Stop the process if

(4.2.25)

_ [F"(Yk)]- 1 F'(UkJ
l+IIF'(yk)IIYk

I F' (Yk) IIZ,. ~ .

4.2.10 The process (4.2.25) terminates no later than after


(F(yo)- F(xF)) iterations.

THEOREM
w(l)

Proof: Indeed, in view of Theorem 4.1.12, we have


F(Yk+l) :S F(yk)- w(>.F(Yk)) :S F(yk)- w().
Therefore F(yo)- k w() 2:: F(yk) 2:: F(xP,).

The implementation of the path-following approach is a little bit more


complicated. Let us choose some Yo E dom F. Define the auxiliary
centrat path as follows:

y*(t) =arg min

yEdomF

[-t(F'(yo), y)

+ F(y)],

where t 2:: 0. Note that this trajectory satisfies the equation

F'(y*(t)) = tF'(yo).

(4.2.26)

Therefore it connects two points, the starting point y0 and the analytic
center xF:
y*(1) = Yo, y*(O) = xF.
We can follow this trajectory by the process (4.2.19) with decreasing t.
Let us estimate the rate of convergence of the auxiliary central path
y* (t) to the analytic center.

205

Structural optimization
LEMMA

4.2.2 For any t

0 we have

II F'(y*(t)) ll;(t)::; (v + 2Ji/) II F'(xo) 11;f. t.


Proof: This estimate follows from (4.2.26) and Corollary 4.2.1.

Let us look now at the corresponding algorithmic scheme.


Auxiliary path-following scheme
0. Choose Yo E Dom F. Set to = 1.
1. kth iteration (k

t k+l -- t k

0). Set

1
IIF'(vo)ll;k
'

2. Stop the process if

II

F'(Yk)

(4.2.27)

llvk::; 1 ~.

Set x = Yk- [F"(Yk)t 1 F'(Yk)


Note that the above scheme follows the auxiliary central path y*(t) as
tk --+ 0. It updates the points {Yk} satisfying the approximate centering
condition
The termination criterion of this process,

>.k

=II F'(Yk) llvk::; 1 ~,

guarantees that II F'(x) llx::; ( 1 ~1J 2 : : ; (see Theorem 4.1.14).


Let us derive a complexity estimate for this process.
THEOREM

4.2.11 The process (4.2.27} terminates no later than ajter

~( + y'v) ln [~(v + 2y'v) II F'(xo) II;;,]


iterations.

206

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Recall that we have fixed the parameters:

"'-&

-_ lgl

I-

1+

-5
36'

Note that to = 1. Therefore, in view of Theorem 4.2.8 and Lemma 4.2.1,


we have
tk+l ~ ( 1 tk ~ exp ( -1~~)
Further, in view of Lemma 4.2.2, we obtain

lv'ii)

II

F'(yk)

ll;k

(tkF'(xo)

+ F'(yk))- tkF'(xo) ll;k

< +tk II F'(xo)

ll;k~ +tk(v+2y'i))

II

F'(xo)

II;F.

Thus, the process is terminated at most when the following inequality


holds:

tk(ll + 2y'V)

II

F'(xo)

II;.F-l+J
< ~- = 'Y

Now we can discuss the complexity of both schemes. The principal


term in the complexity of the auxiliary path-following scheme is
7.2y'v[lnv +In

II

F'(xo)

II;.]
F

and for the auxiliary damped Newton method it is O(F(y0 )-F(xj;.)). We


cannot compare these estimates directly. However, a more sophisticated
analysis demonstrates the advantages of the path-following approach.
Note also that its complexity estimate naturally fits the complexity of the
main path-following process. Indeed, if we apply (4.2.23} with (4.2.27},
we get the following complexity bound for the whole process:
7.2JV [2Inv+ In

I F'(xo) ll;f. +In II c 11;f. +ln~].

To conclude this section, note that for some problems it is difficult


even to point out a starting point y0 E dom F. In such cases we should
apply one more auxiliary minimization process, which is similar to the
process (4.2.27). We discuss this situation in the next section.

4.2.6

Problems with functional constraints

Let us consider the following minimization problem:


min fo(x),
s.t fj(x)

xEQ,

0, j = l ... m,

(4.2.28}

207

Structural optimization

where Q is a simple bounded closed convex set with nonempty interior and all functions fi, j = 0 ... m, are convex. We assume that the
problern satisfies the Slater condition: There exists x E int Q such that
/j(x) < 0 for all j = 1 ... m.
Let us assume that we know an upper bound f suchthat fo(x) < f
for all x E Q. Then, introducing two additional variables T and "'' we
can rewrite this problern in the standard form:
minT,
s.t fo(x) ~

T,

(4.2.29)

/j(x) ~ "'' j = 1 ... m,


XE Q,

T ~

f,

K,

0.

Note that we can apply the interior-point methods to a problern only if


we are able to construct the self-concordant barrier for the feasible set.
In the current situation this means that we should be able to construct
the following barriers:
A self-concordant barrier FQ(x) for the set Q.
A self-concordant barrier Fo(x, T) for the epigraph of the objective
function fo(x).
Self-concordant barriers Fj(x, "') for the epigraphs of the functional
constraints /j(x).
Let us assume that we can do that. Then the resulting self-concordant
barrier for the feasible set of the problern (4.2.29) is as follows:
m

F(x, T, "') = FQ(x)

+ Fo(x, T) + L Fj(x, "')

-ln(f- T) -In( -K,).

j=l

The parameter of this barrier is


m

f)

VQ

+ v0 + L vi + 2,

(4.2.30)

j=l

vo

where
are the parameters of the corresponding barriers.
Note that it could be still difficult to find a starting point from dom.F.
This domain is an intersection of the set Q with the epigraphs of the objective function and the constraints and with two additional constraints

208

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

r ::; f and K. ::; 0. If we have a point x 0 E int Q, then we can choose


large enough r 0 and K.o to guarantee

fo(xo) < ro < f,

fj(xo) < K.o, j = 1 ... m,

but then the constraint K. ::; 0 could be violated.


In order to sirnplify our analysis, let us change notation. Frorn now
on we consider the problern
rnin (c, z},
s.t.

z ES,

(4.2.31)

(d, z} ::; 0,
where z = (x, r, ~), (c, z} = r, (d, z} = ~ and S is the feasible set of the
problern (4.2.29) without the constraint K. ::; 0. Note that we know a
self-concordant barrier F(z) for the set Sand we can easily find a point
zo Eint S. Moreover, in view of our assurnptions, the set
S(a) = {z ES

(d,z}::; a}

is bounded and it has nonernpty interior for a large enough.


The process of solving the problern (4.2.31) consists of three stages.
1. Choose a starting point zo E int S and an initial gap ~ > 0.
Set a = (d, zo} + ~. If a ::; 0, then we can use the two-stage process
described in Section 4.2.5. Otherwise we do the following. First, we find
an approxirnate analytic center ofthe set S(a), generated by the barrier

F(z) = F(z) -ln(a- (d,z}).

z satisfying the condition


.Ap.(z) =(F"(z)-1 (F'(z) + a-(d,z)) ,F'(z) + a-fd,z))l/2::; .

Narnely, we find a point

In order to generate such a point, we can use the auxiliary schernes


discussed in Section 4.2.5.
2. The next stage consists in following the centrat path z(t) defined
by the equation
td + F'(z(t)) = 0, t 2:: 0.
Note that the previous stage provides us with a reasonable approxirnation to the analytic center z(O). Therefore we can follow this path,
using the process (4.2.19). This trajectory leads us to the solution of
the rninirnization problern
rnin{ (d, z}

I z E S(a)}.

209

Structv.ral optimization

In view of the Slater condition for problern (4.2.31), the optimal value
of this problern is strictly negative.
The goal of this stage consists in finding an approximation to the
analytic center of the set

S = {z

E S(a)

(d,z) ~ 0},

generated by the barrier

F(z) = F(z) -ln( -(d, z) ).


This point, z*, satisfies the equation

F'(z*)- (d,~.) = 0.
Therefore z* is a point of the central path z(t). The corresponding value
of the penalty parameter t* is

t* = - (d,~.) > 0.
This stage ends up with a point
Ap(z)

z, satisfying the condition

= (F"(z)- 1 (P'(z)- (d~z)) ,F'(z)- (d~z)) 1 1 2 ~ .

3. Note that F"(z) > F"(z). Therefore, the point


previous stage satisfies inequality
Ap(z)

z, computed at the

=(F"(z)- 1 (F'(z)- (d~z)) ,F'(z)- (d~z)) 1

1 2

~ .

This means that we have a good approximation of the analytic center of


the set S and we can apply the main path-following scheme (4.2.23) to
solve the problern
min{ (c, z) I z E S}.
Clearly, this problern is equivalent to (4.2.31).
We omit the detailed complexity analysis of the above three-stage
scheme. It could be done similarly to the analysis of Section 4.2.5. The
main term in the complexity of this scheme is proportional to the product
of .fD (see (4.2.30)) and the sum of the logarithm of desired accuracy
e and the logarithms of some structural characteristics of the problern
(size of the region, deepness of Slater condition, etc.).
Thus, we have shown that we can apply efficient interior point methods to all problems, for which we can point out some self-concordant
barriers for the basic feasible set Q and for the epigraphs of functional
constraints. Our main goal now is to describe the classes of convex

210

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

problems, for which such barriers can be constructed in a computable


form. Note that we have an exact characteristic of the quality of the
self-concordant barrier. That is the value of its parameter: The smaller
it is, the more efficient will be the corresponding path-following scheme.
In the next section we discuss our possibilities in applying the developed
theory to particular convex problems.

4.3

Applications of structural optimization

(Bounds on parameters of self-concordant barriers; Linear and quadratic optimization; Semidefinite optimization; Extremal ellipsoids; Separable problems;
Geometrie optimization; Approximation in lp norms; Choice of optimization
scheme.)

4.3.1

Bounds on parameters of self-concordant


barriers

In the previous section we have discussed a path-following scheme for


solving the following problem:
min (c,x),

(4.3.1)

xEQ

where Q is a closed convex set with nonempty interior, for which we


know a v-self-concordant barrier F(x). Using such a barrier, we can
solve (4.3.1) in 0 (y'V In~) iterations of a path-following scheme. Recall
that the most difficult part of each iteration is the solution of a system
of linear equations.
In this section we study the Iimits of applicability of this approach. We
discuss the lower and upper bounds for the parameters of self-concordant
barriers; we also discuss some classes of convex problems, for which the
model (4.3.1) can be created in a computable form.
Let us start from lower bounds on barrier parameters.
4.3.1 Let f(t) be a v-self-concordant barrier for the interval
(a,) C R 1 , a < < oo. Then

LEMMA

v 2:::

/'b

= sup

tE{a,)

iB.ill:
f"(t) 2:::

1.

Proof: Note that v 2::: /'b by definition. Let us assume that /'b < 1. Since
f(t) is a barrier for (a., ), there exists a value E (a., ) such that
f'(t) > 0 for all t E [, ).

211

Structural optimization

Consider the function cjJ(t) = <~Jar, t E [a,). Then, since f'(t)


f(t) is self-concordant and c/J{t) ::; "' < 1, we have
c/J 1 (t)

2f 1(t)-

f I (t)

> 0,

(F?~)) 2 f" (t)


1

f'" (t) ) ~ 2(1y'f'(t) . [f"(t)J312


2- __j:J!:L

fo)f I (t).

Hence, for all t E [a, ) we obtain cjJ(t) ~ c/J(a) + 2(1- fo)(f(t)- f()).
This is a contradiction since f (t) is a barrier and c/J( t) is bounded from
0
above.

COROLLARY

Then v

4.3.1 Let F(x) be a v-self-concordant barrier for Q C Rn.

1.

Proof: Indeed, let x E int Q. Since Q C Rn, there exists a nonzero


direction u E Rn such that the line {y = x + tu, t E R 1 } intersects
the boundary of the set Q. Therefore, considering the function f(t) =
0
F(x +tu), and using Lemma 4.3.1, we get the result.
Let us prove a simple lower bound for parameters of self-concordant
barriers for unbounded sets.
Let Q be a closed convex set with nonempty interior. Consider x E
int Q. Assurne that there exists a nontrivial set of recession directions
{PI, ... , pk} of the set Q:

x + api E Q
THEOREM

Va ~ 0.

4.3.1 Let positive coefficients {i}f= 1 satisfy condition

x-

i Pi ~ int Q,

i = 1 ... k.

lf for some positive a1, ... , ak we have '{) =

x - E aiPi
i=l

Q, then the

parameter v of any self-concordant barrier for Q satisfies inequality:

v>
-

E
i=l

ili.
{3;

Proof: Let F(x) be a v-self-concordant barrier for the set Q. Since Pi


is a recession direction, we have

212

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

(since otherwise the function f(t) = F(x + tp) attains its minimum; see
Theorem 4.1.11).
Note that x- i Pi f!. Q. Therefore, in view of Theorem 4.1.5, the
norm of the direction Pi is large enough: i II Pi llx~ 1. Hence, in view
of Theorem 4.2.4, we obtain
v ~ (F'(x),

y- x) = (F'(x),- E aiPi)
i=l

E ai II Pi

i=l

llx~

i=l

Let us present now an existence theorem for self-concordant barriers.


Consider a closed convex set Q, int Q '!- 0, and assume that Q contains
no straight line. Define a polar set of Q with respect to some point
x E intQ:

P(x) = {s ERn

(s,x- x) 51,

Vx E Q}.

It can be proved that for any x E int Q the set P(x) is a bounded closed
convex set with nonempty interior. Denote V(x) = voln P(x).
THEOREM

4.3.2 There exist absolute constants

c1

and c2, such that the

function

U(x) = c1 In V(x}
is a (c2 n)-self-concordant barrier for Q.

Function U(x) is called the universal barrier for the set Q. Note that
the analytical complexity of problern (4.3.1}, equipped with a universal
barrier, is 0 ( yn ln '7) . Recall that such efficiency estimate is impossible, if we use a local black-box oracle (see Theorem 3.2.5).
The above result has mainly a theoretical interest. In general, the universal barrier U(x) cannot be easily computed. However, Theorem 4.3.2
demonstrates that such barriers, in principle, can be found for any convex set. Thus, the applicability of our approach is restricted only by
abilities of constructing a computable self-concordant barrier, hopefully
with a small value of the parameter. The process of creating the barrier
model of the initial problem, can be hardly described in a formal way.
For each particular problern there could be many different barrier models, and we should choose the best one, taking into account the value of
the parameter of the self-concordant barrier, the complexity of its gradient and Hessian, and the complexity of solution of the Newton system.
In the rest of this section we will see how that can be done for some
standard problern classes of convex optimization.

213

Structural optimization

4.3.2

Linear and quadratic optimization

Let us start from linear optimization problem:


min (c, x),

xERn

(4.3.2)

s.t Ax = b,
. - 1 .. . n,
_ 0 , ~x (i) >

where A is an (m x n)-matrix, m < n. The inequalities in this problern


define the positive orthant in Rn. This set can be equipped with the
following self-concordant barrier:
n

F(x) =- l:Inx(i),

v = n,

i=l

(see Example 4.2.1 and Theorem 4.2.2). This barrier is called the standard logarithmic barrier for R+..
In order to solve the problern (4.3.2), we have to use a restriction
of the barrier F(x) onto affine subspace {x : Ax = b}. Since this
restriction is an n-self-concordant barrier (see Theorem 4.2.3), the complexity estimate for the problern (4.3.2) is 0 (.,fii ln 7) iterations of a
path-following scheme.
Let us prove that the standard logarithmic barrier is optimal for R+..
LEMMA 4.3.2 Parameter v of any self-concordant barrier for R+. satisfies the inequality v ~ n.

Proof: Let us choose

x = e := ( 1, ... , 1) T
Pi

E int R+.,

i = 1 ... n,

= ei,

where ei is the ith coordinate vector of Rn. Clearly, the conditions of


Theorem 4.3.1 are satisfied with ai = i = 1, i = 1 ... n. Therefore
v>
-

2:: lli
i=l ;

n.
0

Note that the above lower bound is valid only for the entire set R~.
The lower bound for intersection {x E R~ I Ax = b} can be smaller.

214

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us Iook now at a quadratically constrained quadratic optimization


problern:
min qo(x) = o:o

xeRn

+ (ao, x} + ~(Aox, x},


(4.3.3)

s.t Qi(x) =

O:i

+ (ai,x} + !(Aix,x} :S: i,

i = 1 ... m,

where Ai are some positive semidefinite (n x n)-matrices. Let us rewrite


this problern in a standard form:
minT,
X,T

s. t

qo(x) :S:

XE

nn,

T,

TE

(4.3.4}

R1.

The feasible set of this problern can be equipped with the following selfconcordant barrier:
m

F(x, T) = -ln(T- qo(x))-

L ln(i- Qi(x)),

v= m

+ 1,

i=l

(see Example 4.2.1, and Theorem 4.2.2). Thus, the complexity bound
for problern (4.3.3) is 0 ( ym + 1 ln ':) iterations of a path-following
scheme. Note this estimate does not depend on n.
In many applications the functional components of the problern include a nonsmooth quadratic term of the form II Ax - b II Let us show
that we can treat such terms using interior-point technique.
LEMMA 4.3.3

The function
F(x, t) = -ln(t2- II x 11 2 )

is a 2-self-concordant barrier for the convex set 5


K2 = {(x, t) ERn+ I I t 2::11

II}.

Proof: Let us fix a point z = (x, t) E int K 2 and a nonzero direction


u = (h,T} E Rn+l. Denote ~(o:) = (t + o:T) 2 -II x + o:h 11 2 We need to
compare the derivatives of function
c/J(o:) = F(z + o:u) = -Ine(o:)
5 Depending on the field, this set has different names: Lorentz cone, ice-cream cone, secondorder cone.

215

Structural optimization

at a = 0. Denote

cpO

= cpO(o), eO = e(l(o). Then


~ 11 = 2(r 2 -

( = 2(t7- (x,h)),
II-

</;-

t:_
(.t)2~'
~

II h 11 2 ),

cjJ111

E'E" _
_
-37

(.t) 3.

2 ~

Note the inequality 2cjJ" ~ (cp') 2 is equivalent to (e') 2 ~ 2~~" Thus, we


need to prove that for any (h, r) we have
(4.3.5)
Clearly, we can restriet ourselves by I T 1>11 h II (otherwise the righthand side of the above inequality is nonpositive). Moreover, in order to
minimize the left-hand side, we should chose sign T = sign (x, h) (thus,
let T > 0), and (x, h) =II x II Jl h Jl. Substituting these values in (4.3.5),
we get a valid inequality.
Finally, since 0 < (~};'2 ::::; ~ and [1 - eJ 312 ~ 1 - ~~, we get the
following:

Let us prove that the barrier described in the above statement is


optimal for the second-order cone.
LEMMA

4.3.4 Parameter v of any self-concordant barrier for the set K 2

satisjies inequality v

Proof: Let us choose


Define

z=

(0, 1) E intK2 and some h ERn,

P2 = ( -h, 1),

P1 = (h, 1),

Note that for all 1

2.

II

II=

1.

'_"'_1
"
<-<1 - <-<2- 2

2: 0 we have z +/Pi= (1h, 1 + 1) E Kz and

z- iPi =

(~h,!)

z- a1P1- a2P2 =

rt int K2,

( -~h

+ ~h, 1- ~- ~) = 0 E K2.

Therefore, the conditions of Theorem 4.3.1 are satisfied and


V

>
-

Q.J..

+ !:!2.
2

= 2.
0

216

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

4.3.3

Semidefinite optimization

In semidefinite optimization the decision variables are some matrices.


Let X = {X(i,j)}f.j=l be a symmetric n x n-matrix (notation: X E
snxn). The linear space snxn can be provided with the following inner
product: for any X, Y E snxn define

(X, Y)F =

LL

x(i,i)y(i,j),

i=l j=l

II X IIF= (X,X)F1/2 .

Sometimes the value II X IIF is called the Frobenius norm of matrix X.


For symmetric matrices X and Y we have the following identity:
(X,Y Y}F

= f:

f:

x(i,j)

i=lj=l
=

f: y(i,k)y(i,k) = f: f: f:

k=l

Ln nL

k=lj=l

.n

y(k,J)

x(i,i)y(i,k)y(j,k)

i=lj=lk=l

i=l

...

X(J,l)y(l,k) =

Ln nL

k=lj=l

y(k,J)(XY)(J,k)

(4.3.6)

= }: (Y XY)(k,k) = Trace {Y XY) = (Y XY, In) F


k=l

In semidefinite optimization problems a nontrivial part of constraints


is formed by the cone of positive semidefinite n x n-matrices Pn C snxn.
Recall that X E Pn if and only if (Xu, u} ~ 0 for any u E Rn. If
(X u, u} > 0 for all nonzero u, we call X positive definite. Such matrices
form an interior of cone Pn Note that Pn is a closed convex set.
The general formulation of the semidefinite optimization problern is
as follows:
min (C,X)F,

(4.3.7)
X EPn,

where C and Ai belong to snxn. In order to apply a path-following


scheme to this problem, we need a self-concordant barrier for Pn.
Letmatrix X belang to intPn. Denote F(X) = -lndetX. Clearly
n

F(X) = -ln

I1 ~i(X),

i=l

where {Ai(X)}f=I is the set of eigenvalues of matrix X.

Structural optimization
LEMMA

direction

217

4.3.5 Function F(X) is convex and F'(X)


~ E snxn we have

= -X- 1 .

For any

Trace (rx-l/2.x-I/2f)'

-2(In, (X-1/2Llx-1/2j3)F

Proof: Let us fix some Ll E snxn and X Eint Pn suchthat X +b. E Pn


Then

F(X

+ b.) - F(X)

-In det(X + Ll) -In det X

Thus, -x- 1 E 8F(X). Therefore F is convex (Lemma 3.1.6) and


F'(x) = -x- 1 (Lemma 3.1.7).
Further, consider function cp(a) = (F'(X +aLl), Ll)F, a E [0, 1]. Then
cp(a)- cp(O)

(x- 1 -(X+ ail)-I, Ll) F

((X+ ail)- 1((X + aLl)- XJX- 1 , Ll)F


a((X + ail)- 1b.x-I, b.) F

Thus, cp'{O) = (F"(X)b., Ll)F = (X- 1 b.X-1, ~)F


The last expression can be proved in a similar way by differentiating
the function '1/J(a) =((X+ ab.)- 1b.(X + a~)- 1 , b.)F
D

THEOREM

4.3.3 Function F(X) is an n-self-concordant barrier for Pn.

218

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Proof: Let us fix XE int Pn and E snxn. Denote Q = x- 112 X- 112


and Ai= Ai(Q), i = 1 ... n. Then, in view of Lemma 4.3.5 we have
n

(F'(X), )F =

LAi,

i=l

D 3 F(X)[, , ]

Using two standard inequalities

E\,

(F"(X), )F =

i=l

-2

i=l

Ai.

I.En Ar I ::; (.En Af )3/2 ,


~=1

~=1

we obtain
(F'{X), )} < n(F"(X}, )F,

I D 3 F(X)[, , ] I <

2(F"(X), )~ 2
0

Let us prove that F(X) = -In det X is the optimal barrier for Pn.
4.3.6 Parameter v of any self-concordant barrier for cone Pn
satisfies inequality v ;::: n.

LEMMA

Proof: Let us choose X = In E int Pn and the directions Pi = eie[,


i = 1 ... n, where ei is the ith coordinate vector of Rn. Note that for
any 'Y;::: 0 we have In+ 'Ypi Eint Pn. Moreover,
In - eief

f/. int Pn,

In -

E eie[ =

i=l

0 E Pn

Therefore, the conditions of Theorem 4.3.1 are satisfied with O:i = i = 1,


i = 1 ... n, and we obtain v ;:::

i=l

r. = n.
'

As in the linear optimization problern (4.3.2}, in problern (4.3.7) we


need to use restriction of F(X) onto the set

.C ={X: (Ai,X)F = bi, i = 1 ... m}.


This restriction is an n-self-concordant barrier in view of Theorem 4.2.3.
Thus, the complexity bound of the problern (4.3. 7) is 0 ( y'n In~) iterations of a path-following scheme. Note that this estimate is very
encouraging since the dimension of the problern (4.3.7) is !n(n + 1).

219

Structural optimization

Let us estimate the arithmetical cost of each iteration of a pathfollowing scheme (4.2.23) as applied to the problern (4.3.7). Note that
we work with a restriction of the barrier F(X) onto the set C. In view of
Lemma 4.3.5, each Newton step consists in solving the following problern:
min{ (U, ~)F
.6.

+ ~(x- 1 ~x- 1 , ~)F:

(Ai, ~)F

= 0, i = 1 ... m},

where X >-- 0 belongs to C and U is a combination of the cost matrix C


and the gradient F'(X). In accordance to Corollary 1.2.1, the solution of
this problern can be found from the following system of linear equations:
(4.3.8)
(Ai, ~)F

0,

i = 1 ... m.

From the first equation in (4.3.8) we get

~=X

[-u

f:

>.(i) Ajl

(4.3.9)

X.

J=l

Substituting this expression in the second equation in (4.3.8), we get the


linear system
m

L >.(j)(Ai,XAjX)F = (Ai,XUX)p,

i = 1 ... m,

(4.3.10)

j=l

which can be written in a matrixform as S>. = d with


S(i,j) = (A,XAjX)p,

d(i) =

(U,XAjX)p,

i,j = 1 ... n.

Thus, a Straightforward strategy of solving the system (4.3.8) consists


in the following steps.
Compute matrices XAjX, j = 1 ... m. Cost: O(mn 3 ) Operations.
Compute the elements of Sand d. Cost: O(m 2 n 2 ) operations .
Compute). =

s- 1d.

Cost: O(m 3 ) Operations.

Compute ~ by (4.3.9). Cost: O(mn 2 ) operations.


Taking into account that m ~ n(n2+l) we conclude that the complexity
of one Newton step does not exceed
O(n 2 (m + n)m) arithmetic operations.j

(4.3.11)

220

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

However, if the matrices Aj possess a certain structure, then this estimate can be significantly improved. For example, if all Aj are of rank 1:

Aj = ajaJ,

aj E Rn,

j = 1 ... m,

then the computation of the Newton step can be clone in


O((m + n) 3 ) arithmetic Operations.

(4.3.12)

We leave the justification of this claim as an exercise for the reader.


To conclude this section, note that in many important applications we
can use the barrier - ln det () for treating some functions of eigenvalues.
Consider, for example, a matrix A(x) E snxn, which depends linearly
on x. Then the convex region

{(x, t)

I 1~ftn >.i(A(x)) ~ t},

can be described by a self-concordant barrier

F(x, t) = -lndet(tln- A(x)).


The value of the parameter of this barrier is equal to n.

4.3.4

Extremal ellipsoids

In some applications we are interested in approximating polytopes by


ellipsoids. Let us consider the most important examp1es.

4.3.4.1

Circumscribed ellipsoid

Given by a set of points a1, , am E Rn, find an ellipsoid


W, which contains all points {ai} and which volume is as
small as possible.
Let us pose this problern in a formal way. First of all note, that any
bounded ellipsoid W C Rn can be represented as

w = {x ERn I X= n- 1(v + u), II u II~

1},

where H E int Pn and v E Rn. Then inclusion a E W is equivalent to


inequality II Ha - v II ~ 1. Note also that
1 W =

VO n

1 B 2 (0 ' 1) d et H -1 =

VO n

voln B2(0,1)
det H

221

Structural optimization

Thus, our problern is as follows:


minT,

H,v,T

s.t.

-lndetH~T,

(4.3.13)

II Hai -v II~ 1,

i = 1 ... m,

In order to solve this problern by an interior-point scherne we need to find


a self-concordant barrier for a feasible set. At this mornent we know such
barriers for all components of this problern except the first inequality.
LEMMA

4.3. 7 Function

-lndet H -ln(T + lndet H)


is an (n

+ 1)-self-concordant barrier for


{(H,T) E snxn

the set

R 1 I T ~ -lndetH, HE 'Pn}
0

Thus, we can use the following barrier:


F(H, v, T) = -lndet H -ln(T + lndetH)-

E ln{1- II Hai- v 11 2 ),

i=l

v=m+n+l.
The corresponding cornplexity bound is 0 ( Jm
tions of a path-following scheme.

4.3.4.2

+ n + 1 In m;n) itera-

Inscribed ellipsoid with fixed center

Let Q be a convex polytope defined by a set of linear


inequalities:

and Iet v E int Q. Find an ellipsoid W, centered at v,


such that W C Q and which volurne is as big as possible.

222

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Let us fix some HE int 'Pn We can represent the ellipsoidWas

W = {x ERn I (H- 1 (x- v),x- v} ~ 1}.


We need the following simple result.

<

LEMMA 4.3.8 Let (a, v}


x E W if and only if

Inequality (a, x}

b.

<

b is valid for any

(Ha, a} ~ (b - (a, v} )2
Proof: In view of Lemma 3.1.12, we have

max{(a,u}
u

(H- 1 u,u} ~ 1} = (Ha,a) 112

Therefore we need to ensure


max(a,x}
xEW

max((a,x- v)

(a, v)

xEW

+ (a,v)]

+ max{
(a, u} I
X

+ (Ha,a) 112
This proves our statement since (a, v} < b.
-

(a,v)

(H- 1u, u) ~ 1}

~ b.
D

Note that voln W = voln B2(0, 1)[det Hjll 2. Hence, our problern is as
follows:
minT,
H,T

s.t. -lndet H

H E 'Pn,

~ T,

{4.3.14)

E R1 .

In view of Lemma 4.3. 7, we can use the following self-concordant barrier:

F(H, T) =

v =

-lndet H -ln(T + lndet H)

m+n+l.

The efficiency estimate of the corresponding path-following scheme is


0 ( v'm + n + 1 In
iterations.

m;n)

223

Structural optimization

4.3.4.3

Inscribed ellipsoid with free center

Let Q be a convex polytope defined by a set of linear


inequalities:
Q = {x ERn

(ai,x) :$ bi, i = l. .. m},

and Iet int Q =/= 0. Find an ellipsoid W C Q, which has


the maximal volume.
Let GE int Pn, v Eint Q. We can represent Was follows:

w -

{x ERn

I II G- 1 (x- v) II:$ 1}

{x ERn I (G- 2 (x- v),x- v) :$ 1}.

In view of Lemma 4.3.8, the inequality (a, x) :$bis valid for any x E W
if and only if
II Ga 11 2
(G2a, a) :$ (b- (a, v)} 2

That gives a convex region for (G, v):

II Ga II:$ b- (a,v).
Note that voln W = voln B2(0, 1) det G. Therefore our problern can be
written as follows:
min r,

G,v,r

s.t. -lndet G :$ r,

(4.3.15)

In view of Lemmas 4.3. 7 and 4.3.3, we can use the following selfconcordant barrier:
F(G, v, r)

-lndet G -ln(r + lndet G)

2m+ n + 1.

224

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

The corresponding efficiency estimate is 0 ( v'2m + n


ations of a path-following scheme.

+ 1 ln m;n) iter-

Separable optimization

4.3.5

In problems of separable optimization all nonlinear terms are presented by univariate functions. A general formulation of such a problern
looks as follows:
mRi~

XE

s.t Qi(x)

qo(x) =

L: ao,jfo,j((ao,j,x) + bo,j),

j==l

(4.3.16)

m;

=E

mo

j==l

ai,jfi,j((ai,j,x) +bi,j)::::; i, i

= l. .. m,

where ai,j are some positive coefficients, ai,j E Rn and fi,j(t) are convex
functions of one variable. Let us rewrite this problern in a standard
form:
min ro,
x,t,r

m;

L: ai,jti,j ::::; Ti,

j=l

where M =

E mi.

i=O

i = 0 ... m,

(4.3.17)

Thus, in order to construct a self-concordant barrier

for the feasible set of the problem, we need barriers for epigraphs of
univariate convex functions Ai. Let us point out such barriers for several
important functions.
4.3.5.1
Logarithm and exponent.
Function FI(x, t) = -lnx -ln(lnx + t) is a 2-self-concordant barrier
for the set
Ql = {(x,t) E R2 1 x > 0, t ~ -lnx},
and function F2(x, t) = -ln t -ln(ln t- x) is a 2-self-concordant barrier
for the set

225

Structural optimization

Entropy function.
4.3.5.2
Function Fg(x, t) = -lnx -ln(t- x lnx) is a 2-self-concordant barrier
for the set
Qg = {(x,t) E R 2 1 x 2 0, t 2 xlnx}.
4.3.5.3
Increasing power functions.
Function F4 (x, t) = -2ln t-1n(t 21P -x 2 ) is a 4-self-concordant barrier
for the set
Q4 = {(x, t) E R 2 I t 21 x IP}, p 2 1,
and function F5 (x, t) = -lnx -ln(tP- x) is a 2-self-concordant barrier
for the set

I x 2 0,

Qs = { (x, t) E R 2

tP

2 x },

< p s 1.

4.3.5.4
Decreasing power functions.
Function F6 (X' t) = - ln t -ln( X - r l/p) is a 2-self-concordant barrier
for the set
Q6 = { (x, t) E R 2 I x

> 0, t 2 x~} ,

1,

and function F7(x, t) = -lnx -ln(t- x-P) is a 2-self-concordant barrier


for the set
Q7 = { (x, t) E R 2

I x > 0, t ;::: :lv} ,

< p < 1.

We omit the proofs of the above statements since they are rather
technical. It can be also shown that the barriers for all of these sets,
(except maybe Q4 ), are optimal. Let us prove this statement for the sets
Q6 and Q7.
LEMMA

4.3.9 Parameter v of any self-concordant barrier for the set


~l }
Q = { (x(l) ' x( 2)) E R 2 I x(l) > 0' x( 2) >
- (x1l1)P '

with p

> 0,

satisfies inequality v 2 2.

Proof: Let us fix some 'Y > 1 and choose


Pt= e1,

Then

P2 = e2,

x + eei E Q for

any

= 2 = {,

e;: : 0 and

x- et = (O,r) fj. Q,

x=

( 'Y, 'Y) E int Q. Denote

a1 = a2 = a

='Y-

x- e2 = {r,O) fj. Q,

x-a(el +e2) = (r-a,{-a) = (1,1)

E Q.

1.

226

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Therefore, the conditions of Theorem 4.3.1 are satisfied and

>
-

Ql.
t

+ Q.2.
2 =

21.::.!.
'Y

This proves the statement since 1 can be chosen arbitrarily big.

Let us conclude our discussion by two examples.

4.3.5.5
Geometrie optimization.
The initial formulation of such problems is as follows:
min qo(x) =

xERn

s.t qi(x) =

m;

j=l

j=l

ai,j

> 0, j

x(j)

mo

ao,j

(i)

TI (x(J)Y'"0 i,

j=l

(i)

TI (x(3 >yri.j

j=l

~ 1, i = 1 ... m,

(4.3.18)

= 1 ... n,

where ai,j are some positive coefficients. Note that the problern (4.3.18)
is not convex.
Let us introduce the vectors ai,j = (o{~, ... , u~~)) E Rn, and change
the variables: x(i) = eY(i). Then (4.3.18) is transformed into a convex
separable problem.
min

mo

yERn j=l

m;

s.t.
Denote M =

j=l
m

'

'

(4.3.19)
D!i,j exp( (ai,j, y)) ~ 1, i = 1 ... m.

E mi.

i=O

ao 3exp{(ao 3,y)),

The complexity of solving (4.3.19) by a path-fol-

lowing scheme is

o(M 1 1 2 ln~).
4.3.5.6
Approximation in lp norms.
The simplest problern of that type is as follows:

(4.3.20)
s.t.

a :::; x :::;

227

Structural optimization

where p ? 1. Clearly, we can rewrite this problern in an equivalent


standard form:
min
x,r

s.t

T(o)

'

I (ai, x} -

b(i) jP:::; T(i), i = 1 ... m,

T(i) :::; T(O)'

(4.3.21)

i=l

a::=;x ::=;,
The complexity bound of this problern is 0 (Jm + n In m;n) iterations
of a path-following scheme.
Wehave discussed the performance of interior-point methods on several pure problern formulations. However, it is important that we can apply these methods to mixed problems. For example, in problems (4.3. 7)
or (4.3.20) we can treat also the quadratic constraints. To do that, we
need to construct a corresponding self-concordant barrier. Such barriers
are known for all important examples we meet in practical applications.

4.3.6

Choice of minimization scheme

We have seen that many convex optimization problems can be solved


by interior-point methods. However, we know that the same problems
can be solved by another general technique, the nonsmooth optimization
methods. In general, we cannot say which approach is better, since the
answer depends on individual structure of a particular problem. However, the complexity estimates for optimization schemes often help to
make a reasonable choice. Let us consider a simple example.
Assurne we are going to solve a problern of finding the best approximation in lp-norms:

(4.3.22)
s.t

a:::; x:::; ,

where p ? 1. And Iet us have two numerical methods available:


The ellipsoid method (Section 3.2.6).
The interior-point path-following scheme.

228

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

What scheme should we use? We can derive the answer from the complexity estimates of corresponding methods.
Let us estimate first the performance of the ellipsoid method as applied to problern {4.3.22).

Complexity of ellipsoid method

Number of iterations: 0 ( n 2 In

*) ,

Complexity of the oracle: O(mn) operations,


Complexity of the iteration: O(n 2 ) Operations.

Total complexity: 0 ( n 3 (m + n) In~) operations.


The analysis of a path-following scheme is more involved. First of all,
we should form a barrier model of the problem:
~'
min
x,r,{

s.t.

I (ai, x) -

i=l

T(i)

b(i) IP~

T(i),

i = 1 ... m,

~ ~' a ~X~ ,
(4.3.23)

F(x,T,~))

Ej(T(i),(ai,x) -b(i)) -In(~- i=l


ET(i))

i=l

where f(y, t) = -2ln t -ln(t21P- y 2 ).


Wehaveseen that the parameter of barrier F(x, T, e) is V = 4m+n+ 1.
Therefore, the number of iterations of a path-following scheme can be
estimated as 0 ( v'4m + n + 1 In
At each iteration of the path-following scheme we need to compute
the gradient and the Hessian of barrier F(x, T, e). Denote

mpn ).

9I(y, t) = f~(y, t),

92(Y, t) = J:(y, t).

229

Structural optimization

Then
E9I(T(il,(ai,x)-b(il)ai- f: [ (il~
F~(x,r,~)= i=l
i=l
X

(ilQ

iil~

(il]ei,

Further, denoting

we obtain

F;(i)x(x,T,~)

h12(T(i), (ai,x)-

b(i))ai,

F~'(i>,r<il(x,r,O

h22(r(i), (ai,x)-

b(i))

F"(il
T

1T

Ul (x, T,

e- .L:m r(i) )

-2

, i

=1=

c(x, T, 0 = -

!=l

F"(i)
'~
T

FJ:e(x, T, 0

= (

+ (~- i~l T(i)) - 2 ,


j,

e- L: T(i) )-2 '


m

i=l

~- i~l T(i)) -2

Thus, the complexity of the second-order oracle in the path-following


scheme is O(mn 2 ) arithmetic operations.
Let us estimate now the complexity of each iteration. The main source
of computations at each iteration is the solution of a Newton system.
Denote
/'i,

(e - .I:

z=l

T(i)) -

Si

= (ai, x) - b(i), i = 1 ... n,

230

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

and

A2 = diag(hi2(T(i),si))~ 1 ,

D = diag(h22(T(i),si))~ 1 .

Then, using the notation A = (a 1 , ... , am), e = (1, ... , 1) E Rm, the
Newton system can be written in the following form:

(4.3.24)

where t is the penalty parameter. From the second equation in (4.3.24)


we obtain

Substituting !:l.T in the first equation in (4.3.24), we can express


!:l.x =

[A(Ao

+ A1- A~[D + tblmt 1 )AT]- 1 {F~(x,T,e)

-AA2[D

+ tblm]- 1(F:(x,T,e)- l'be!:l.e)}.

Using these relations we can find !:l.e from the last equation in (4.3.24).
Thus, the Newtonsystem (4.3.24) can be solved in O(n 3 + mn 2 ) Operations. This implies that the total complexity of the path-following
scheme can be estimated as
0 ( n 2 (m + n) 312 ln

m;n)

arithmetic operations. Comparing this estimate with that ofthe ellipsoid


method, we conclude that the interior-point methods are more efficient
if m is not too large, namely, if m ~ O(n2 ).

Bibliography

Chapter 1. Nonlinear optimization


Beetion 1.1. Complexity theory for black-box optimization schemes was
developed in [8]. In this monograph the reader can find different examples of resisting oracles and lower complexity bounds similar to that of
Theorem 1.1. 2.
Beetions 1.2 and 1.3. There exist several dassie monographs [2, 3, 7],
which treat different aspects of nonlinear optimization and numerical
schemes. For sequential unconstrained minimization the best source is
still [4].

Chapter 2. Smooth convex optimization


Seetion 2.1. The lower complexity bounds for smooth convex and strongly convex functions can be found in [8]. However, the proof used in this
section seems to be new.
Beetion 2.2. Gradient mapping was introduced in [8]. The optimal
method for smooth and strongly smooth convex functions was proposed
in [10]. A constrained variant of this scheme is taken from [11].
Section 2.3. Optimal methods for minimax problems were developed in
[11]. The approach of Section 2.3.5 seems tobe new.

232

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Chapter 3. Nonsmooth convex optimization

Section 3.1. A comprehensive treatment of different topics of convex


analysis can be found in [5]. However, the dassie [15] is still very useful.
Section 3.2. Lower complexity bounds for nonsmooth minimization
problems can be found in [8]. The scheme of the proof of the convergence rate was suggested in [9]. See [13] for detailed bibliographical
comments on the history of nonsmooth minimization schemes.
Section 3.3. The example for the Kelley method is taken from [8]. The
presentation of the Ievel method is close to [6].
Chapter 4. Structural optimization
This chapter contains an adaptation of the main concepts from [12].
We added several useful inequalities and slightly simplified the pathfollowing scheme. We refer the reader to [1] for numerous applications
of interior-point methods, and to [14], [16), [18] and [19] for detailed
treatment of different theoretical aspects.

References

[1] A. Ben-Tal and A. Nemirovskii. Lectures on Moden Convex Optimizatin Analysis, Alogorithms, and Engineering Applications, SIAM, Philadelphia, 2001.
[2] A.B. Conn, N.I.M. Gould and Ph.L. Toint. Trust Region Methods, SIAM,
Philadelphia, 2000.
[3] J.E. Dennis and R.B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations, SIAM, Philadelphia, 1996.
[4] A.V. Fiacco and G.P. McCormick. Nonlinear Programming: Sequential Unconstrained Minimization Techniques, John Wiley and Sons, New York, 1968.
[5] J.-B. Hiriart-Urruty and C. Lemankhal. Convex Analysis and Minimization
Algorithms, vols. I and II. Springer-Verlag, 1993.
[6] C. Lemarechal, A. Nemirovskii and Yu. Nesterov. New variants of bundle methods. Mathematical Programmming, 69(1995), 111-148.
[7] D.G. Luenberger. Linear and Nonlinear Programming, Second Edition, Addison Wesley. 1984.
[8) A.Nemirovsky and D.Yudin. Informational complexity and efficient methods
for solution of convex extremal problems, Wiley, New York, 1983.
[9) Yu.Nesterov. Minimization methods for nonsmooth convex and quaskonvex
functions. Ekonomika i Mat. Metody, v.ll, No.3, 519-531, 1984. (In Russian;
translated as MatEcon.)
[10] Yu.Nesterov. A method for solving a convex programming problern with rate
of convergence 0( ~ ). Soviet Math. Doklady, 1983, v.269, No.3, 543-547. (In
Russian.)
[11] Yu.Nesterov. Efficient methods in nonlinear programming. Radio i Sviaz,
Moscow, 1989. (In Russian.)
[12] Yu. Nesterov and A.Nemirovskii. Interior-Point Polynomial Algorithms in
Convex Programming, SIAM, Philadelphia, 1994.

234

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

[13] B.T. Polyak. Introduction to Optimization. Optimization Software, New York,


NY, 1987.
[14] J.Renegar. A Mathematical View of Interior-Point Methods in Convex Optimization, MPS-SIAM Series on Optimization, SIAM 2001.
[15] R.T. Rockafellar Convex Analysis, Princeton Univ. Press, Princeton, NJ, 1970.
[16] C. Roos, T. Terlaky and J.-Ph. Vial. Theory and Algorithms for Linear Optimization: An Interior Point Approach. John Wiley, Chichester, 1997.
[17] R.J. Vanderbei. Linear Programming: Foundations and Extensions. Kluwer
Academic Publishers, Boston, 1996.
[18] S. Wright. Primal-dual interior point methods. SIAM, Philadelphia, 1996.
[19] Y. Ye. Interior Point Algorithms: Theory and Analysis, John Wiley and Sons,
lnc., 1997.

Index

Analytic center, 198


Antigradient, 17
Approximate centering condition, 200
Approximation, 16
first order, 16
global upper, 37
in !p-norms, 226
linear, 16
quadratic, 19
second order, 19
Barrier
analytic, 156
self-concordant, 193
universal, 212
volumetric, 156
Black box concept, 7
Center
analytic, 198
of gravity, 151
Central path, 193
auxiliary, 204
equation, 193
Cla.ss of problems, 6
Complexity
analytical, 6
arithmetical, 6
lower bounds, 10
upper bounds, 10
Computational efforts, 6
Condition number, 65
Cone of
positive semidefinite matrices, 216
second order, 214
Conjugate directions, 43
Contracting mapping, 30
Convex
combination, 82, 113
differentiable function, 52

function, 112
set, 81
Cutting plane scheme, 150
Damped Newton method, 34
Dikin ellipsoid, 182
Directional derivative, 122
Domain of function, 112
Epigraph, 82
Estimate sequence, 72
Feasibility problem, 146
Function
barrier, 48, 180
convex, 112
objective, 1
self-concordant, 176
strongly convex, 63
Functional constraints, 1
General iterative scheme, 6
Gradient, 16
mapping, 86
Hessian, 19
Hyperplane
separating, 124
supporting, 124
Inequality
Cauchy-Schwartz, 17
Jensen, 112
Infinity norm, 116
Information set, 6
Inner product, 2
Kelley method, 158
Krylov subspace, 42

236

INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION

Level set, 17, 82


Localization set, 141
Matrix
positive definite, 19
positive semidefinite, 19
Max-type function, 91
Method of
analytic centers, 156
barrier functions, 49
centers of gravity, 151
conjugate gradients, 42
ellipsoids, 154
gradient, 25
inscribed ellipsoids, 156
penalty functions, 47
uniform grid, 8
variable metric, 38, 40
volumetric centers, 156
Method
optimal, 29
path-following, 202
quasi-Newton, 38
Minimax problem, 91
Minimum
global, 2
local, 2
Model of
convex function, 157
minimization problem, 5
barrier, 173, 199
functional, 7
Newton method
damped, 34, 188
standard, 33, 189
Newton system, 33
Norm
! 00 , 8, 116
h, 116
Euclidean, 16
Frobenius, 216
local, 181
Optimality condition
constrained problem, 84
first order, 17
minimax problem, 92
second order, 19
Oracle
local black box, 7
resisting, 10-11
Parameter
barrier, 194
centering, 200
Performance

on a problem, 5
on a problern dass, 5
Polar set, 212
Polynomial methods, 156
Positive orthant, 213
Problem
constrained, 2
feasible, 2
general, 1
linearly constrained, 2
nonsmooth, 2
of approximation in lv-norms, 226-227
of geometric optimization, 226
of integer optimization, 3
of linear optimization, 2, 213
of quadratic optimization, 2
of semidefinite optimization, 216
of separable optimization, 224
quadratically constrained quadratic, 2,
214
smooth, 2
strictly feasible, 2
unconstrained, 2
Projection, 124
Quasi-Newton rule, 40
Recession direction, 211
Relaxation, 15
Restarting strategy, 45
Self-concordant
barrier, 193
function, 176
Sequential unconstrained minimization, 46
Set
convex, 81
feasible, 2
basic, 1
Slater condition, 2, 49
Solution
global, 2
local, 2
Standard
logarithmic barrier, 213
minimization problem, 192
simplex, 132
Stationary point, 18
Step-size strategies, 25
Strict Separation, 124
Structural constraints, 3
Subdifferential, 126
Subgradient, 126
Support function, 120
Supporting vector, 126
Unit ball, 116

Você também pode gostar