Nptel Optimization

Applied Optimization for Wireless, Machine Learning, Big Data So, this is your, this is basically your n-dimensional,
ally your n-dimensional, this is your n dimensional vector

Prof. Aditya K. Jagannatham
contains n elements. This is an n dimensional vector.
Department of Electrical Engineering
Indian Institute of Technology, Kanpur
(Refer Slide Time: 02:28)
Lecture – 01
Vectors and Matrices – Linear Independence and Rank
Hello. Welcome to this module in this massive open online course. So, let us start with
the mathematical preliminaries that are required to understand the framework of
optimization that is which form the basis of building the framework for optimization, the
various tools and techniques for optimization, ok.
And now if these elements x 1, x 2, x n these belong to the real field that is these are real
numbers, then we say that this is an n dimensional real vector that is x bar belongs to the
set of n dimensional real vectors, all right. So, this is the this phase of n dimensional real
vectors.
So, we want to start with the mathematical preliminaries, the notation and so on that we
are going to use frequently in our treatment of optimization in order to illustrate or in
order to basically describe the various concepts of optimization.
Now, the first thing that we are going to use the mathematical construct that we are going
to use is that of a vector, as you must all be familiar or vector x bar which is denoted by a
bar on the top of the quantity. So, this basically is, so let us start with the concept of
vectors and a vector is denoted by the quantities like this that is of the bar on the top. So,
this is basically a vector. So, vector x bar is an n dimensional object which contains n
components. These are the elements.
And, so what we have is, if you consider x bar now, x bar is the column vector and (Refer Slide Time: 06:10)
therefore, x bar transpose we will similarly be a row vector. So, it will be a 1 cross n. So,
this is basically your row vector, x bar transpose. So, this so, x bar is a column vector or
basically n cross 1, has dimension n cross 1. Now x bar transpose, now, this you can see
this is a row vector which is of dimension 1 cross n. That is 1 row and n columns and
further, x bar transpose x bar, this is basically your x 1, x 2, x n, the row vector times the
product with the column vector x 1, x 2, x n.
Now, on the other hand, now you have since real vector similarly, if x 1, x 2, x n belong
to the set of complex number. So, now, what we want to see is we want to see the notion
of a complex vector. So, a complex vector if x 1, x 2, x n are elements belong to C that is
these are complex numbers, then this implies that x bar, the vector x bar belongs to the
set of n dimensional complex vector C n that is, this is n dimensional, the space of n
dimensional the space of n dimensional complex vectors.

And you can see, this is basically equal to you can see this is x 1 square, say these are
real quantities x 2 square plus x n square which is also denoted by the norm square and
in fact, we will see this is a specification of norm, this is the l 2 norm. So, this is the
square of the so, this indicates the l 2 norm of the vector, where norm of x bar and this l 2
norm is the default.
So, this is the l 2 norm which is the default norm that we will use. So, if there is a known
of so, the norm is explicit or not specified explicitly, it is it will it indicates the l 2 norm,
all right. And l 2 norm of a vector is basically something that you are already be very
familiar with that is simply the length of the vector, length of a vector in n dimensional
space ok. So, norm x bar is simply something that you are very must be most of you
might be very familiar with that is square root of x 1 square plus x 2 square plus x n
square which is basically the length of the vector, all right.
Now, x bar Hermitian, this is basically equal to x 1 conjugate, x 2 conjugate, x n So, this in that sense, this is a general definition,.
conjugate that is first you take that is when you take the Hermitian of a vector or matrix
in fact.
Now, he in this case you are taking the Hermitian of a vector, the column vector becomes
a row vector and you also in addition take the complex conjugate of each complex
element. So, that is basically your x bar Hermitian, all right. So, two steps; one is you
basically, perform row vector plus the complex conjugate of the elements and x bar
Hermitian into x bar is equal to x 1, x 2 conjugate, x n conjugate times x 1, x 2, x n and
this is equal to now the magnitude; look at this is x 1 conjugate into x 1, that is the
magnitude x 1 square plus magnitude x 2 square plus so on up to magnitude x n square
which is once again, this is equal to the norm. In fact, the l 2 norm of x bar square.
For real number, you can simply replace the magnitude square by the square of the
element. So, this general definition for real and this is general definition that is
applicable both for real and complex vectors. Now, a special kind of a vector is obtained
as following that is x tilde equals x bar divided by the norm of x bar. That is you are
taking the vector x bar and dividing it by the by it is norm and that gives a unit norm
vector. So, in this vector x tilde is basically unit norm vector because one you can show
that the norm of x tilde is unity. So, this vector x tilde as an interesting property; x tilde is
a unit norm. So, x tilde is a unit norm vector and we can simply show that very easily as
follows.
You can also write a 2 in the subscript 2, indicate that is the l 2 norm.
And therefore, once again now you see that in this case the norm of a complex vector x
bar, this is square root of magnitude x 1 square plus magnitude x 2 square plus
magnitude x n square. That is where we have replaced x i square with magnitude x i
square. In fact, this is a general definition that is magnitude x 1, magnitude x 2 square,
magnitudes plus magnitude x n square and square root of that quantity. This definition is
generalize, it works for both the real works for both the real as well as complex vectors
ok.
(Refer Slide Time: 11:27) (Refer Slide Time: 12:49)
In fact, if you look at x tilde Hermitian x tilde that is x bar Hermitian divided by norm x For instance, let us consider the vector x bar, let us consider this to be your x bar equals
bar times, x bar divided by norm, x bar is basically x bar Hermitian x bar that is norm x the all 1 vector that is n dimensional n dimensional all 1 vector. Then, we have norm of x
bar square divided by norm x bar square which is 1. So, this implies now x tilde bar equals square root of 1 square, that is 1 plus 1 plus 1, n times that is equal to square
Hermitian x tilde is nothing but norm x tilde square. So, this implies norm x tilde square root of n. In fact, norm x bar square remember, we are talking about l 2, norm x bar
equals 1 that this implies norm of x tilde equals 1 ok. square equals n.
So, x tilde is basically unit norm vector. You can also say this is the unit norm vector in And in fact, x tilde equals x bar divided by norm of x bar that is 1 over square root of n
the direction of x bar. So, if you think of this n dimensional vector x bar as representing a into the vector that is one vector of all one. So, this is basically the corresponding unit
particular direction in n dimensional space, the unit norm vector can think can be thought norm this is the corresponding unit norm vector, all right. So, that is basically that
of as a unit vector basically pointing in that direction, in n dimensional space. That is the completes a brief summary right of the properties of the various aspects the various
direction given by the vector x bar, ok. So, x bar and x tilde, both are a line except that x properties of vectors and most of you might already be familiar with many of these
tilde in this vector is a unit norm vector that is as it has norm equal to unity,. aspects, but this presents brief summary and we will quickly refresh your memory and
remind you of several of these aspects,. So now, let us look at matrices.
Let us take a simple example to understand this.
(Refer Slide Time: 14:39) And, when the number of rows is equal to number of columns that is m equal n, then the
matrix A becomes a square matrix ok. So, if m equal to n, then the matrix A is a square
matrix that is when the number of rows is equal to the number of columns. Let us now
look at an important concept of the row space and column space. Now to first understand
this concept of a row space and column space of a matrix, you have to understand what
we mean by, what we mean by the space and what you mean by the rank of a set of
vectors.
Once again, a brief review of various concepts in linear algebra and matrices. So, let us
consider m cross n matrix A. This implies A has m rows and n columns and you can
represent A as the matrix a 1 1, a 1 2 so on up to a 1 n; a 2 1, a 2 2 so on up to a 2 n and
the mth row is a m 1, a m 2 so on up to a m n. So, you can see there are m rows. So, there
are m rows and there are n columns.
So, let us start with this notion of rank. So, let us consider vectors W bar, consider W bar
1, W bar 2 so on up to W bar m. This is a set of this is a set of m vectors. Then, now
these vectors are linearly independent. Now this is an important concept.
So, these are linearly independent, if there do not exists C 1, C 2 so on up to C m, not all
0 that is all of them cannot be 0, they do not exists C 1, C 2, C m not all 0 such that such
that C 1 W bar 1 plus C 2 W bar 2 plus so on plus C m W bar m equals 0. That is there
cannot be set of constant C 1, C 2, C m such that C 1 W bar 1 plus C 2 W bar 2 so on so
forth up to C m W bar m equals 0 all right, that this is known as a linear combination. So,
they cannot be a linear combination of this vectors W bar 1, W bar 2, W bar m that
equals 0.
And note that this quantity for instance, the i jth element, a i j equals the element in ith So, this is basically a linear combination, that is your weighing them by coefficients and
row and the jth column. This is the element in the ith row and the jth column. adding them.
(Refer Slide Time: 20:01) Or let us look at this concept of linear dependence. They are linearly dependent, linearly
dependent if there exists if there exist C 1, C 2, C m not all 0 with or such that such that
C 1 W bar 1 plus C 2 W bar 2 plus C of W bar m equals 0.
So, if there exists these weights C 1, C 2, C m such that not all of them are 0 and the
linear combination of the vectors W bar 1, W bar 2, W bar m is 0, then these vectors W
bar 1, W bar 2, W bar m are linearly dependent ok. So, this is basically a linear
combination and these vectors are therefore, linearly dependent ok. For instance, let us
take a very simple example to understand this.
So, this is basically a linear combination, ok. So, there cannot be a linear combination of
vectors W bar 1, W bar 2, W bar m with co efficient C 1, C 2, C m or weight C 1, C 2, C
m such that not all of them are 0 all right, not with all them not 0 such that not all of
them are 0. Let us such that this linear combination is 0 else they are linearly
independent, all right.
Consider the vectors W bar 1 equals 1, 1, 1 and W bar 2 equals minus 2, minus 2, minus
2, then you have W bar 1 plus you have two times, you can easily see two times W bar 1
plus one times W bar 2 equal 0; implies, W bar 1 comma W bar 2 are linearly dependent,
these are linearly dependent.
Now, what is linearly independent, linearly dependent, else they are linearly dependent.
Now, on the other hand, if consider another example, W bar 1 equals 1, 1, 1 and W bar 2 So, let us go back and look at the matrix A as a set of remember it as n columns it is an n
equals well 1, 2, 3, you can quickly verify that you can check W bar 1 comma W bar 2 cross n matrix. So, you can either look at it as n columns or you can either look at it as
are linearly independent all right; implies that there do not exist C 1 comma C 2 both not you can either look at it as m rows ok. So, you have a 1 tilde, let us say denotes the rows
0 or not both 0, not both 0 that is one of them both, one of them can be 0 such that C 1 W a 2 tilde and. So, these are basically your n columns and these are basically your m rows
bar 1 plus C 2 bar 2 equals 0, ok. and the now column rank of A equals the maximum number of linearly independent
columns that is a bar 1, a bar 2 up to a bar n. That is the maximum number of linearly
They do not exists, these weights such that the linear combination is 0, ok. So, basically
independent maximum number of columns that you can choose from A such that the
this is the concept of linear dependence and linear independence of a set of vectors. Now,
linear combination such that they do does not exist any linear combination which is 0.
if you go back and look at the matrix A now one reduces concept of linear independence
to define the rank of the matrix A. (Refer Slide Time: 27:24)
So, maximum number of linearly independent columns of A. Now, similarly the row (Refer Slide Time: 29:48)
rank of A equals the maximum number of linearly independent rows of A.
So, this is the column rank of A and this is the row rank of A. So, you have this notion of
row rank and you have the notion of column rank and one of the fundamental results in
linear algebra or matrix theory is that the row rank of any matrix equals the column rank
and this quantity simply denoted by the rank of the row rank equals column rank which
is simply denoted by the rank of the matrix A, ok.
So, rank of any matrix is less than or equal to minimum of the number of rows and
columns of the matrix and this, the fundamental property of the matrix ok.
So, this is one of the fundamental properties of the matrices which again some of you
might already be familiar with, all right. So, the all right. So, we have this notion of
column rank which is the maximum number of linearly independent columns, the row
rank; which is the maximum number of linearly independent rows and the fundamental
theorem is that the row rank of the matrix of any matrix is equals is equal to it is columns
rank which is simply denoted by the rank of the matrix A.
So, we have the fundamental result and this should be available any standard textbook or
linear algebra that is the row rank equals the row rank of any matrix A equals column And further, this rank has to be less than or equal to the minimum of the number of rows
rank and this is therefore, simply denoted as the rank of matrix A. And in addition this and columns of the matrix all right. So, we have come covered some of the mathematical
also satisfies, the property that the rank of the matrix A is less than or equal to that is let preliminaries required to develop the various the various tools and techniques for
me just write this again rank of the matrix A is less than or equal to the minimum of the optimization. We will continue this discussion in the subsequent modules.
number of columns comma rows of A. So, the rank of A is less than or equal to minima
Thank you very much.
of m comma n where m remember is number of rows and n equals number of columns.
Applied Optimization for Wireless, Machine Learning, Big Data (Refer Slide Time: 02:14)
Lecture – 02
Eigenvectors and Eigenvalues of Matrices and their Properties
Hello, welcome to another module in this massive open online course.
This is known as the eigenvalue and this vector x bar is known as the eigenvector. And,
now since A x bar equals lambda x bar that implies A x bar equals lambda, times I can
use the identity matrix here, times x bar which implies A x bar minus lambda I times x
bar equal to 0, which basically implies that a minus lambda I times x bar equal to 0.
Now, this implies what this means is this matrix A minus lambda I this is a singular
matrix.

Let us continue our discussion regarding the mathematical preliminaries, for the
framework of convex optimization, looking by looking at another very important concept
that is of the Eigenvalues, the eigenvectors and eigenvalues of square matrices right.
So, we are going to start talking about the eigen the concept of or rather the concepts of
eigenvectors and, eigenvalues. And, now these is the eigenvector the notion of
eigenvector and eigenvalue is defined for a square matrix correct.
So, for a square matrix A all right x bar is an eigenvector, x bar is an eigenvector, if we
have A x bar that is the product of the matrix A with the vector x bar, A x bar equals a
multiple of that is lambda times x bar all right. And this is a fundamental equation for the
eigenvector all right and well the this lambda is known as the eigenvalue.
There exists a vector such that a minus lambda I times x bar equal to 0, which implies (Refer Slide Time: 06:09)
that the determinant of a minus lambda I equals 0. So, we use this to denote the
determinate. So, if lambda is an eigenvalue of A that means, the determinant of A minus
lambda equals 0. Now, by evaluating the determinant you can derive an equation that is
known as the characteristic equation corresponding to the matrix A, and the roots of that
all right, the roots of this equation in lambda give the eigenvalues of the matrix A all
right.
So, this gives so this basically gives the characteristic polynomial correct. So, a minus
determinant of a minus lambda I, this gives you the characteristic polynomial,
polynomial of a in terms of lambda. The roots and the roots of the characteristic
polynomial, these are the roots of the characteristic polynomial are eigenvalues of A for
example.
(Refer Slide Time: 05:04) Now, we given a now let us start by considering a minus lambda I, that will be equal to 1
1 1 minus 1 minus lambda times 1 0 0 1, which is equal to 1 minus lambda 1 1 minus 1
minus lambda all right. And, now we have to consider the determinant of this, now
determinant of a minus lambda I equals 0, now if you compute the determinant of a
minus lambda you will see this is this implies 1 minus lambda into minus 1 minus
lambda minus of 1 this is equal to 0, which basically implies that minus of 1 minus
lambda into 1 plus lambda equal to 1.
Let us say A is the matrix 1 1 1 minus 1 let us take this simple 2 cross 2 matrix all right.
So, this is a square matrix A and now we want to find the eigenvalues ok, for this given
matrix A 2 cross 2 matrix, we want to find the eigenvalues and also the corresponding
eigenvectors. So, find eigenvalues and the eigenvectors of A.
Which implies that lambda square minus 1 equal to 1, which implies that lambda equals implies that basically the first and the second equation are identical all right. So,
plus, or minus square root of 2. So, these are the eigenvalues of A lambda equals plus, or basically you have just one equation and therefore, this is an infinite number of solutions
minus square root of 2, these are the eigenvalues of the matrix A. So, we have got 2 and that is kind of obvious, because if the eigenvector corresponding to eigenvalue is not
eigenvalues that is lambda equals either plus square root of 2, or minus square root of 2. unique. So, that is if k x is k x bar is an eigenvector, then x bar scaled by any constant k
Now, let us find the corresponding eigenvectors of the matrix A corresponding to both is also an Eigen vector corresponding to the same eigenvalue.
these eigenvalues.
And therefore, correspond to there are infinite number of eigenvectors in that sense. So,
(Refer Slide Time: 08:33) this means that these two equations are basically the same.
Now, the eigenvector to find the eigenvector ok, what we are going to do is we have 1 1
1 minus 1 A times x bar equals square root of 2 the eigenvalue square root of 2 times x
Now, what we will do is to derive a solution you set x 1 to derive any solution, or one
bar this is from the definition of eigenvalue this implies. Now, you can which is also
such solution you set x 1 equal to 1, this implies x 2 equals minus of 1 minus root 2 and
basically square root of 2, now you can insert an identity times x bar.
this you can verify is an eigenvector, that is we look at x bar equals x 1 x 2 that is 1
Which implies now you bring it this on the left. So, this will become a minus square root minus 1 minus root 2 that is root 2 minus 1, or not minus 1 plus root 2, or root 2 minus 1
of 2 times identity which is 1 minus square root of 2 1 1 minus 1 minus square root of 2 and this you can check. This is an one of the eigenvectors of A.
into x bar, now I will write x bar as a vector it is a two dimensional vector x 1 x 2 equal
to 0. Now, this implies that you get equations in x 1 and x 2 1 minus square root of 2 into
x 1 plus x 2 equal to 0 and, x 1 minus 1 plus square root of 2 into x 2 into 0. Now, this
equation if you multiply by 1 minus square root of 2, you will realize that you get again
the same equation.
So, this will be 1 minus square root of 2 into x 1 minus 1 plus square root of 2 into 1
minus square root of 2 so, this minus of minus 1 so, that is plus 1 plus x 2 equal to 0. So,
One of the Eigen vectors of matrix and you can check this ok, let us do that if you look at And we already seen x bar is 1 square root of 2 minus 1 so, this verifies that verifies that
A times x bar that is equal to 1 1 1 minus 1 times x bar that is 1 root 2 minus 1. So, this square root of 2 equals the eigenvalue and the fact that 1 comma, it verifies both right
will be 1 into 1 plus root 2 minus 1 so, this will be root 2 1 into 1 minus root 2 minus 1. and that 1 comma square root of 2 minus 1 equals an eigenvector. So, this verifies
basically both the facts that square root of 2 is the eigenvalue of this matrix A and square
root of 1 square root of 2 minus 1 is the eigenvector.
So, this will be 2 minus root 2, which if you pull out this constant square root of 2, this
will be well 1 square root of 2 minus 1. And which is basically nothing, but if you call
this as your x bar this is nothing, but lambda times x bar where what is lambda, lambda Now, similarly one can find the other eigenvector all right corresponding to eigenvalue
equal square root of 2. minus square root of 2. Similarly for eigenvalue minus square root of 2, we have 1 1 1
minus 1 of x bar equals minus square root of 2 x bar, this implies 1 1 1 minus 1 plus And now once again what we will do is we will set x 1 equals 1, that implies x 2 equals
square root of 2 times, identity into x bar equals 0. minus of 1 plus square root of 2, or minus 1 minus root 2. And therefore, the eigenvector
x bar equals 1 minus 1 minus root 2. So, this is the other eigenvector corresponding to
eigenvalue minus square root of 2.
This implies that basically if you look at this implies 1 plus square root of 2, 1 1 minus 1
plus square root of 2 into the vector x bar, into the vector x bar that is x 1 x 2, this is
equal to 0 and this implies basically 1 plus square root of 2 times x 1 plus x 2 equal to 0 This is the other eigenvector, corresponding to the other eigenvalue minus square root of
and, you can see both the equations will reduce to the same thing. 2 all right. So, that is a brief introduction to the concept of eigenvectors and eigenvalues
of the matrix. Let us look at another important concept that is the concept of symmetric
and the Hermitian symmetric matrices, these are Hermitian matrices all right.
(Refer Slide Time: 17:57) And it is Hermitian or a Hermitian symmetric matrix Hermitian, if A equals A Hermitian
and what is a Hermitian? Now, let us say A is our matrix a 1 1, a 1 2, a 2 1, a 2 2 and so
on. For the Hermitian matrix what you have to do is you have to take the transpose and
you have to take the complex conjugate of each element. So, this will be a 1 1 conjugate,
since you are taking the transpose this becomes a 1 a 2 1 and the conjugate of that a 2 1
becomes a 1 2 and the conjugate, this is a 2 2 and the conjugate.
So, what we want to look at now is basically the notion of what is what are known as
symmetric and Hermitian symmetric and Hermitian matrices. So, let us say A is a real
matrix n cross n real matrix, we say symmetric A is symmetric, if A equals A transpose
that is this is symmetric equals transpose a transpose that implies take any element a i j,
that is equal to a j i for all pairs i comma j. So, basically for a symmetric matrix we must
have that equals A transpose. And naturally that implies this must be a square matrix all
right, because only this is the symmetry is only preserved of the matrix is a square
So, basically for the Hermitian you take transpose plus conjugate of each element. Now,
matrix.
A equal to A Hermitian that is it is Hermitian matrix implies that a i j equals a j i
(Refer Slide Time: 19:24) conjugate that is this is Hermitian symmetric, if a 1 2 equals a 2 1 conjugate and so on.
So, that is that matrix is known as a Hermitian matrix.
So, this basically is a Hermitian matrix. Now, there are several interesting properties of
this Hermitian and symmetric matrices one of the most interesting properties is that the
eigenvalues of both symmetric and Hermitian matrices are real all right.
(Refer Slide Time: 21:27) 0. This is the meaning of vectors being orthogonal that is two vectors are orthogonal, if
that is x 1 bar x 2 bar are two vectors, they are real vectors x 1 bar transpose x 2 bar is 0,
they are complex vectors x 1 bar Hermitian x 2 bar equal equals 0, then the vectors are
said to be orthogonal.
So, the first property is that the eigenvalues of Hermitian and, symmetric matrices are
real; these are real quantities all right.
So, this is orthogonality of vectors, this is an important property in general orthogonality

of vectors is also a very important property. Now, let us go back to our earlier example to
illustrate this fact for instance, you have probably realized that if you look at our
previous matrix that is 1 1 1 minus 1, this is a symmetric matrix you can see, this is a
symmetric matrix, we have A equals A transpose. And the eigenvalues are natural if you
look at the eigenvalues equals plus, or minus square root of 2 and these are real
quantities ok, we already seen this all right.
And second properties is another interesting property, Eigen vectors corresponding to

distinct eigenvalues that is different eigenvalues not the same Eigen value, but distinct
eigenvalues are orthogonal and this is an important property. This implies that if V 1 bar
comma V 2 bar are the eigenvectors corresponding to distinct eigenvalues lambda 1,
comma lambda 2 this implies for a symmetric matrix, V 1 bar Hermitian V 2 bar equal to
(Refer Slide Time: 24:43) Now, if you look at V 1 bar into V 2 bar we will bar transpose into V 2 bar that gives 1
square root of 2 minus 1 times 1 minus 1 minus root 2, which is equal to you can clearly
see 1 minus root 2 plus 1 into root 2 minus 1 that is 1 minus 2 minus 1, which is
basically 0. So, this basically is showing you that V 1 bar Hermitian V 2 bar equal to 0
implies these vectors are orthogonal all right.
That is a very interesting property and that is arising, because the matrix is symmetric all
right. So, in this module we have looked at various important and very interesting and
also very important concepts of eigenvalues, eigenvectors and symmetric matrices. And
these are very important, because these are going to be used frequently in our discussion
and the development of the framework of optimization for various applications.
Thank you.
So, you can see that the eigenvalues of this symmetric matrix are real as given by the
property. And, now let us look at the eigenvectors and we will show that the eigenvectors
are orthogonal the eigenvectors are 1 root 2 minus 1. Let us call this as your V 1 bar and
the other eigenvector is 1 minus 1 minus root 2. Now, since these vectors are real we can
simply take the transpose so, this is V 2 bar n because transpose or Hermitian will give
the same thing for real vectors.

Lecture – 03
Positive Semidefinite (PSD) and Positive Definite (PD) Matrices and their
Properties
Hello, welcome to another module in this massive open online course. So, we are
looking at the mathematical preliminaries for optimization all right. And we have looked
at the Eigenvectors and Eigen values and this we will start looking at a different,
different type of matrices known as positive semi definite and positive definite matrices.
So, also defined only for square ok; now, if now, matrix A consider a matrix A; consider
as square matrix A. Now, if for the real case, if x bar transpose A x bar greater than or
equal to 0 for all x bar then A is a positive semi definite matrix ok. So, x bar transpose A
x bar is greater than or equal to 0. Now, if x bar transpose A x bar is strictly greater than
0 for all x bar, then A is a positive definite matrix. So, this positive semi definite and
positive definite matrix, if x bar transpose A x bar is greater than or equal to 0 right, for
all x bar, all vectors x bar then it is positive semi definite, if it is strictly greater than 0
then it is positive definite.
Now, these are for real vectors and real matrices, now, for complex vectors and matrices.
Now, this definition is for.
So, we are going to look at the properties of and definition of positive semi definite and
as well as positive definite, positive semi definite and positive definite matrices. And this
is often abbreviate, abbreviated as PSD matrix and this is often abbreviated as a PD,
positive definite matrix. A matrix can be positive semi definite matrices that is there can
be positive semi definite matrices and positive definite matrices. And of course, both of
these are also defined, once again only for square matrices all right. Similar to the
concept of Eigen values and Eigenvectors, these are defined for square matrices ok.
(Refer Slide Time: 03:45) Let us take a simple example to understand this consider a square matrix consider, a
square matrix A equals 2 6 comma 18. Let us consider the square matrix 2 cross 2 square
matrix 2 6 6 18.
Now, let us look at x bar transpose, A x bar. This is a 2 cross 2 matrix. So, the vector x
bar will be two dimensional x 1 x 2 times 2 6 6 18 into x 1 x 2 and this is equal to you
can see this will be equal to well, this will be equal to twice x 1 square, when you
multiply this out plus 18 x 2 square plus 12 times x 1 x 2. And this will be equal to you
can easily check twice x 1 plus 3 times x 2 square, which is a perfect square and
therefore, this is always greater than or equal to 0.
Now for complex, for complex vectors or matrices as we have seen before; we have to
replace the transpose by the Hermitian. So, x bar Hermitian A x bar greater than or equal
to 0, for all x bar implies positive semi definite and further x bar Hermitian, x bar strictly
greater than 0 for all x bar implies the positive definite matrix. What this definition is for
complex matrices, complex matrices and vectors.
Hence, but and hence now, you can see this also, this not just this is not strictly greater
than 0, because 2 x 1 plus 3 x 2 equal to 0, if x 1 equals minus 3 over 2 x 2 all right. So,
this is only greater than or equal to 0, if 2 x 1 plus 3 equal to equal 0 then this will be 0
that is x bar transpose, x bar will be 0 otherwise, it is greater than.
So, therefore, in general it is greater than equal to 0. Hence, the matrix A is positive semi
definite. Hence, this matrix hence, A is positive semi definite since x bar transpose A x
bar is greater than or equal to 0 is greater than or equal to 0 for all x bar. Let us look at
some, let us look at a property of this positive semi definite.
Let us look at a very interesting property, let us look at an interesting property of this. Again look at us look at the example A equals our matrix 2 6 6 18. Now, to calculate the
Now, consider the Eigen values since, these are square matrices Eigen values. So, the Eigen values, one has to consider the characteristic polynomial determinant of a minus
Eigen values if A is positive definite then the Eigen values lambda i of A denoting this by lambda i, which is the determinant of 2 minus lambda 6 6 18 minus lambda and this can
lambda i bracket A. These are strictly greater than all the Eigen values have to be greater be simplified. The determinant can be simplified as 2 minus lambda times 18 minus
than 0. On the other hand, if A is PSD then the Eigen values are greater than equal to 0 lambda minus 36 equal to 0 and this is the characteristic equation ok. Remember the
that is some of the Eigen values can be 0 and rest of them are greater than 0 ok. characteristic polynomial and the characteristic equation for the matrix, this implies.
So, this is an interesting point for a PD matrix the Eigen values are strictly greater than 0 (Refer Slide Time: 10:55)
for a PSD matrix the Eigen values are greater than or equal to 0 again let us check verify
this property on the previous example. So, let us look at A.
Now if you simplify this, this implies 36 minus 20 lambda plus lambda square minus 36 Let us now, continue looking at another important concept that is of the Gaussian random
equal to 0. This implies lambda square minus 20 lambda equal to 0, which implies variable, which we are also going to use frequently in our framework of optimization.
lambda square, which implies lambda square equals 20 lambda, which implies lambda
equals 0 comma 20. These are the two Eigen values ok.
So, you can see. In fact, what you can see is very interestingly, that one of the Eigen
values is in fact 0, correct. If you call this lambda 1, lambda 1 is in fact, 0 1 of the Eigen
values is 0, this implies.
So, we want to look at the basic concepts of Gaussian Random Variables, which are
central to our discussion on optimization. So, what is a Gaussian random variable? Well
X is a Gaussian Random Variable, X is Gaussian Random Variable, variable with mean
equal to mu and variance equal to sigma square that is, it is denoted by this notation
Gaussian.
In fact, you can also see and we also checked that the matrix is PSD, positive semi This is also known as a Normal Random Variable is denoted by this notation N mean
definite. Now, for a symmetric matrix, if the Eigen values are greater than equal to 0 script N mean mu variance sigma square ok.
matrix, one can also conclude that, that is not only the forward property. But also the
reverse that is right, that if for a symmetry that is for a symmetric matrix. If the Eigen
values are greater than equal to 0, the matrix is positive semi definite, if the Eigen values
are greater than 0 the matrix is positive definite for symmetric matrix A.
So, the interesting property for a symmetric matrix A, if Eigen values lambda i of a
greater than equal to 0 then A is positive semi, if lambda i greater than equal to 0 then A
is positive semi definite, if lambda i A is greater than 0 then A is a positive, then A is a
positive definite matrix.
(Refer Slide Time: 15:11) Many of you might be familiar that the shape of this probability density function is given
by this bell shaped curve with the peak occurring, that is the peak of this occurs at x
equal to Mu. That is the mean and the spread of this is controlled by the variance equal to
sigma square controls the; so, the peak occurs at the mean.
So, the peak shifts in the Gaussian probability density function and it is symmetric about
the mean. It is a bell shaped curve and the spread of this is controlled by the variance.
For instance, of the variance decreases then the spread decreases the Gaussian
probability density function becomes more and more, peak. Here, that is more and more
concentrated around the mean ok. So, as the variance decreases, it becomes more and
more concentrated around the mean.
And the probability density function of this every random variable has a probability
density function that is given as 1 over square root of 2 pi sigma square for the Gaussian
Random Variable e raised to minus x minus mu whole square by 2 sigma square. What is
this? This is basically your PD F or the Probability Density Function. PD F of the
Gaussian Random Variable with mean mu and variance equal to sigma square ok. So,
this is the probability density function and you might also recall that.
As the variance decreases, it becomes concentrated around the mean and now, one can
define a new random variable X tilde by subtracting the mean and dividing by the square
root of the variance or sigma, which is the standard deviation ok. So, now, you can see
this X tilde is also which is derived by subtracting the mean and dividing by the standard
deviation sigma. Now, this X tilde is also a Gaussian R V and there is something
interesting about X tilde, the mean of X tilde will be 0 that is expected value of is 0.
And the variance that is expected value of X tilde minus mu or X tilde square in this case
as mean is 0 is equal to 1. So, mean equal to 0 and X tilde is Gaussian R V with mean 0
and variance unity ok. So, this is a Gaussian R V with mean equal to 0 and variance substitute mu equal to 0 and sigma equal to 1, in the earlier expression for the probability
equal to 1, such a Gaussian R V with mean equal to 0 variance equal to 1. density function of the Gaussian Random Variable and what you get is the PD F of this
Standard Gaussian Random Variable given as 1 over square root of 2 pi.
So, sigma square equal to 1 1 over square root of 2 pi e raised to minus X tilde square,
remember mu is 0. So, X tilde minus mu is simply X tilde, X tilde squared divided by 2
again. Once again sigma square equal to 1, this is the PD F of the Standard Normal and
one can define the Q function of the PD F of the Standard Normal, that is the probability
that this Standard Normal Gaussian Random Variable X tilde is greater than or equal to a
quantity x naught is denoted by Q of x naught. This is also termed as a Q function or the
Gaussian Q function.
This is termed as the standard normal random variable, standard normal the system, as
the standard normal random variable ok. And now, we define the standard normal
random variable is used to define what is known as the Q function.
This is the probability that the standard normal variable X tilde is greater than or equal to
x naught, which is given by the integral X tilde x x naught to infinity. The integral of the
probability density function of the Standard Normal Variable 1 or square root of 2 pi e
raised to minus x tilde square divided by 2 times d of x tilde that is the probability that
this, this also equal to the probability that X tilde belongs to the interval x naught comma
infinity.
That is basically, what we are asking is the question, what is the probability that X tilde
the Standard Normal Random variable takes a value greater than or equal to x naught or
So, this now the probability density function of the standard normal random variable that
basically, that it lies in the interval x naught to infinity and that is remember the
is also simple to design, because it has mean mu equal to 0 and variance 1. So, you
probability that the random variable lies in a particular interval is given by the integral of of them individually Gaussian and all of them being jointly Gaussian ok. So, we will talk
the probability density function of the random variable over that interval. So therefore, it about a Multivariate, Multivariate Gaussian R V.
is given by integral x naught. This probability is given by the integral x naught to infinity
1 or square root of 2 pi e raised to minus X tilde square divided by 2 d x tilde and what it
denotes as?
Now, a Multivariate Gaussian R V is given by x bar equals x 1 x 2 up to xn and this is a

Gaussian random vector x bar. This is a Gaussian random vector x bar and x bar equals x
1 x 2 x n. So, this has n components we will denote the mean now each of the
I have already told you it denotes the probability that this Gaussian Random variable components is going to have a mean.
with mean 0 and variance equal to 1 is greater than this quantity x naught. So, it denotes
the probability that or the area under the PD F greater than or equal to x naught. This is
also known as the tail probability of the Standard Normal Random Variable is also
known as the tail probability of the Standard Normal.
There is probability under the tail starting from x naught, this is also known as the
Complementary Cumulative Density Function. The Cumulative Density Function gives
the probability that the random variable takes values less than x naught the complement
of that or 1 minus the C D F gives the probability, that it is greater than or equal to x
naught. This is therefore, known as the Complementary Cumulative, also known as the
Complementary Cumulative Distribution Function or the C C D F of the Standard
Normal Random Variable.
Now, let us come to a multi normal Multivariate Gaussian Random Variable or a

Gaussian Random Vector. A Gaussian Random Vector with multiple components, each
So, the mean is going to have a b a vector that is expected x bar equals the expected (Refer Slide Time: 27:53)
value of each of the components. So, this is going to be an n dimensional vector, which
we will denote by this. Let the mean of the various components be mu 1 mu 2 up to mu
n. So, this is basically your mean vector. So, the mean is going to be a vector.
So, you can call this also as the mean vector of the Multivariate Gaussian random
variable and further instead of the variance. We will have the covariance matrix, which
looks at the variance of each component and also the cross. There also the covariance
correct, the cross correlation corresponding to each of the elements of this random vector
and the covariance matrix is defined as follows.
The probability density function of the Multivariate Gaussian Random Variable into and
you can also denote this as n again. So, the Multivariate Gaussian Random Variable
you can denote it as Gaussian Random Variable with mean vector mu bar, and covariance
matrix R, and the probability density function is given as 1 over square root 2 pi raised to
the power n raised to the power of n the determinant of the covariance matrix R times e
raised to minus half x bar minus mu bar transpose R inverse x bar minus mu bar. And
this is basically your PD of the Multivariate Gaussian Random Vector PD F of
Multivariate Gaussian Vector with mean equals mu bar and variance equal to a
covariance sorry, covariance matrix and covariance matrix equals R, all right and that is
the expression for the PD F of the Multivariate Gaussian Random Vector.
That is R equals expected value of x bar minus mu bar into x bar minus M u bar
Let us now, look at an example to understand this and let us look at an interesting special
transpose and this is remember, this is termed as the covariance matrix. This is termed as
case of this Ultivariate Gaussian Random Vector, which is when the different
a covariance matrix and this is an n cross n matrix, the system there is a covariance
components of this Gaussian Random Vector are independent.
matrix and this is an n cross n matrix.
So, consider, consider a Multivariate Gaussian with expected x bar equals mu bar equals And now, if you compute the covariance of this, that will be given as of the covariance
the 0 vector and expected value of x i into x j. The different components is equal to 0 if i matrix of this, that will be given as well. We already seen that is x bar minus mu bar into
not equal to j and sigma square if i equal to j alright. So, the, what we are seeing is the x bar minus mu bar transpose.
cross correlation expected value of x i x j, if i naught equal to j is 0 all right and all the
Now, mu bar is 0. So, this will simply be expected value of x bar into x bar transpose
variances of each of variance of each element is sigma square. So, these are basically
which is now, let me write it in terms of it is vector.
what you can see is; basically these are known as uncorrelated random variables, because
the correlation is 0. (Refer Slide Time: 32:17)
So, these are basically uncorrelated, uncorrelated Gaussian Random Variables is the
cross correlation, cross covariance is 0. These are uncorrelated Gaussian Random
Variables ok.
This is going to be expected value of x bar which is the vector x 1 x 2 x n into the
transpose x 1, which is the row vector remember transpose of a column vector is a row
vector and now, once you compute this what you will observe is if you compute this you (Refer Slide Time: 33:50)
will have entry such as x 1 square, the off diagonal entries will be x 1 x 2 x 2 x 1 x 2
square and so on.
And now, if you look at this matrix expected value of each element on the diagonal
expected value of x 1 square x 2 square x 3 square so on. Sigma square, when the
expected values of the off diagonal entries x 1 x 2 x 2 x 1 and so on are 0, because these
we have considered the random variables to be uncorrelated. And therefore, the
covariance matrix basically, you can see reduces to sigma square.
And now, once you compute; so, our covariance matrix equals sigma square times
identity, this is remember, this is our covariance matrix. And therefore, the multivariate
Gaussian Probability Density Function is given as 1 over square root. Now, if you look at
the determinant of this determinant of R is sigma square times identity determinant of R
is sigma square raise to the power of n, which is sigma to the power of 2 n and therefore,
the probability density function is 1 over.
Let me just write it separately is 1 over square root of 2 pi raised to the power of n sigma
raised to the power of 2 n that is the determinant times e raised to minus half x bar minus
mu bar that is x bar transpose R inverse R sigma square identity R inverse is identity
All the diagonal elements are sigma square the off diagonal elements are 0. And
divided by sigma square times x bar.
therefore, this is also sigma square times identity.
(Refer Slide Time: 35:19) And this is also equal to 1 over 2 pi sigma square n by 2 e raise to minus 1 over 2 sigma
square summation 1 equal to i to n x i square and interestingly.
Now, you can also write this as the product i equal to 1 to n 2 pi sigma square or 1 over
you can also write it as 1 over square root of 2 pi sigma square e raised to minus x i
square divided by 2 sigma square. And therefore, now, this is the product symbol similar
to summation.
Which is now, you can simplify this as 1 over 2 pi sigma square raise to the power of n
by 2 times e raised to minus half or 1 over 2 sigma square x bar transpose x bar transpose
identity x bar is x bar transpose x bar, but recall x bar transpose x bar recall x bar
transpose x bar equals norm of x bar square, which is also equal to x 1 square plus x 2
square plus so on, up to x n square for a real vector x bar. And therefore, this is also equal
to 1 over 2 pi sigma square to the power n by 2 times e raised to minus 1 over 2 sigma
square norm x bar square.
This is your product symbol, product i equal to 1 to n. And now, you see these are the
individual PD F's. These are the individual Gaussian PD F's of the various random
variables X i with mean equal to 0 and variance equal to sigma square and therefore,
what you are seeing is that when these Gaussian, when this component Gaussian
Random Variables are uncorrelated the joint Gaussian the Multivariate Gaussian PD F
equals the product of the individual PD F's.
So, which means that these PD F's, these random variables are not only uncorrelated, but
they are also independent and this is a unique property of the Gaussian Random Variable,
if two Gaussian Random variables are uncorrelated, they are also independent. This is
not true for any general random variable. It is an interesting property that is applicable
only for the Gaussian Random Variables.
(Refer Slide Time: 38:15) Applied Optimization for Wireless, Machine Learning, Big Data
Lecture – 04
Inner Product Space and its Properties: Linearity, Symmetry and Positive Semi-
definite
Hello, welcome to another module in this massive open online course. We are looking at
the mathematical preliminaries for optimization; let us continue our discussion with
another concept namely an inner product space, ok.
So, this implies this implies that the Gaussian R V's are also independent ok. So, for a
Gaussian R V uncorrelated implies that they are independent. However, this is not true
for any general random variable and this is the important property, not true for, but the
other way round is always true right, for any random variable if they are independent. If
two random variables are independent then they are going to be uncorrelated.
However it if in general it is only for a Gaussian Random Variable, it is true that if

they are uncorrelated, they are also independent. This is not true for any general random
variable all right. So, this small example illustrates this interesting property of the
Multivariate Gaussian Random, Multivariate Gaussian Random vector alright.
So, we will stop here and continue with other aspects in the subsequent modules. So, you want to start looking at the concept of what is known as an inner product space.
Now, what is an inner product space? Now, an inner product space of a real vector; now
Thank you.
the inner product of a real vector space; the inner product of a real vector space is an
assignment of a real number, is an assignment of a real number.

So, this basically is the linearity property that is a linear, the inner product of the linear
combination a u bar plus b v bar with the vector, w bar is the linear combination of the
inner products a times the inner product u bar w bar plus b times the inner product of v
bar comma w bar; this is the first property. Then we have the symmetric property or the
symmetry. It is very simple that is the inner product of u bar v bar equals the inner
product of v bar comma u bar. Then, we have so if you call this property number 1
linearity to symmetry then we have the positive semi definite property.
And for any 2 vectors u bar v bar that is, denoted by this notation u bar, the inner product
of u bar comma v bar and this is defined for any 2 vectors u bar b bar in this case real
vectors for any 2 vectors u bar comma v bar. And this is the inner product, which is a real
number; in the case of a real vector space and which satisfies the inner product satisfies
the following properties.
The positive semi definite property it must be the case that for any u bar, that is for any u
bar element of the vector space v then the inner product of u bar with itself must be
greater than or equal to 0. And more importantly and also u bar the inner product with
itself equals 0, if and only if u bar as itself 0. So, inner product is 0 if and only if the
vector u bar is 0.
So, these are so this is the definition it is an assignment of a real number for a real vector
space it satisfies the linearity, symmetric, asymmetry and the positive semi definite
properties ok. Let us look at a simple example to understand this for instance the
standard product, let us consider the standard dot product between 2 vectors alright.
The inner product of the vector; satisfies the following properties. First property is
linearity, which is a u bar plus that is the inner product of a linear combination a u bar
plus b v bar is the linear combination of the inner products that is, this is a times u bar
inner product of u bar w bar plus b times the inner product of v bar comma w bar ok.
(Refer Slide Time: 06:43) 2, v n. Which is basically u 1, v 1, u 1, v n ok. So, this is the definition of the dot product
ok. So, this is the dot product between 2 n dimensional real vectors.
Consider 2 vectors the dot product is defined for between 2 vectors consider u bar equals
u 1, u 2 up to u n and v bar equals v 1 v 2 up to v n. Then now these are real vectors n
dimensional vectors that is you can say that belong to the space of n dimensional real
Something that you might have well see in your in your vector space or vector calculus
vectors denoted by this bold R, R to n. And this is also termed as the Euclidean n space.
class in high school all right. This is a standard dot product which is also denoted by the
(Refer Slide Time: 08:10) dot operator that is u bar dot v bar ok. This is the dot product between 2 vectors. In fact,
more specifically to real n dimensional vectors, now we will show that dot product is an
inner product.
Now, that is very simple to see, first let us look at the linearity property. If you look at a u
bar plus b v bar dot product with w bar that is simply a u bar plus b v bar transpose times
w bar, as we have seen that is the definition of the dot product.
This is also termed as the Euclidean n space and the inner product in this Euclidean n
space between vectors u bar and v bar is defined as u bar transpose v bar, which is
basically if you look at this that is u 1, u 2 u n. Row vector times the column vector v 1, v
(Refer Slide Time: 10:45) So, symmetry satisfies a symmetric property, now we have to look at the positive semi
definiteness. Now positive semi definite property, now you can see that the inner product
of u bar with itself is u bar transpose u bar, which is u 1 square plus u 2 square plus u n
square; which is greater than or equal to 0 further. In fact, we have seen this is nothing
but this is used to define the 2 norms square that is, this is the l 2 norm square norm u bar
square which is greater than or equal to 0 in fact, equal to 0 if and only if some of the
squares is 0, which means each of the components is 0 u 1 equals u 2 equals u of n equals
0 which means u bar equals 0.
So, it is positive semi definite it is positive semi definite that is u bar inner product of u
bar with itself is always greater than equal to 0 it is 0 only when u bar the vector is
identically 0 all right. And therefore, the standard the dot product is an inner product.
And in fact, this is also termed as the standard inner product, this is also termed as the
standard inner product on R n that is the Euclidean n space. Or the n dimensional set of n
Which is a times u bar transpose w bar plus b times v bar transpose w bar which is a
dimensional space of real vectors Euclidean n space or the n dimensional space of real
times the inner product of u bar with w bar plus b times the inner product of v bar with w
vectors the dot product is an inner product.
bar ok, so it satisfies the linearity property. Now 2 now coming to the symmetry property,
we have u bar dot product v bar which is equal to u bar transpose v bar, which is (Refer Slide Time: 13:51)
basically v bar same as v bar transpose u bar which is v 1 u 1 plus v 2 u 2 plus v n u n
which is v bar u bar and 3.
Let us now consider another example for 2 dimensional vectors instance.

(Refer Slide Time: 15:01) Now let us consider a vector that is a x bar plus b x tilde that is; a times x 1 x 2 the 2
dimensional vector plus b times x 1 tilde x 2 tilde all right. So, we are taking a linear
combination of the 2 vectors x bar and x tilde. Now for this to be an inner product let us
consider now a x bar plus b x tilde the inner product with any vector y bar.
We have x bar equals x 1 x 2 and y bar equals y 1 y 2. So, both of these are basically 2 d
vectors that is there belong to the 2 dimensional Euclidean space. And let us define this
assignment x bar y bar for this 2 2 d vectors as twice x 1 y 1 minus x 1 y 2 minus x 2 y 1
plus 5 x 2 y 2. Now what we want to do is we want to show that this is an inner product;
this is a valid this assignment is a valid inner product and this can be shown as follows as
Now, this can be shown as now, we can see this is equal to twice a x 1 plus b x 1 tilde
usual we start with the linearity property.
into y 1 minus a x 1 plus b x 1 tilde into y 2 minus a x 2 plus b x 2 tilde into y 1 plus 5
(Refer Slide Time: 16:16) times a x 2 plus b x 2 tilde into y 2. And now you can clearly see this is linear that is this
can be written as a if you separate the components and write it you can easily see that
this is a times x bar comma y bar plus b times x tilde comma y bar and therefore, it is
linear ok. So, satisfies implies that this is linear alright. So, we have shown that the
assignment is linear, now let us know symmetry the symmetric property and this can be
shown as follows.
(Refer Slide Time: 18:51) And you can readily see that this is nothing but the assignment of y bar comma x bar. So,
we have x bar y bar equals y bar comma x bar. And hence, in this implies that that it
satisfies the symmetry property and finally, now coming to the positive semi definite as
positive semi definite property.
So, coming now to the aspect of symmetry you can see that x bar y bar this is equal to
twice x 1 y 1 minus x 1 y 2 minus x 2 y 1 or minus y 2 x 1, I am sorry minus x 2 y 1,
minus x 2 y 1 plus 5 x 2 y 2 ok. And now this can also be written as without any effort
you can see this is twice x y 1 x 1 interchanging the terms the second inter term this will
be minus y 1 x 2 minus x 1 y 2 plus 5 y 2 x 2. And you can readily see that x 1 y 2 I will
Coming now to the positive semi definite property, now if you look at x bar x bar that
write as minus y 2 x 1.
will be twice x 1 square minus 2 x 1 x 2 plus 5 x 2 square, which is equal to now I can
(Refer Slide Time: 20:08) write this as the sum of 2 terms x 1 square plus x 2 square plus 2 x 1 x 2 plus x 1 square
plus 4 x 2 square minus 4 x 1 x 2.
(Refer Slide Time: 21:51) That is what we have shown therefore, it also satisfies the PSD property, hence because it
satisfies the linearity symmetry and PSD properties, this is an valid inner product all
right. So, finally, what we have is that x bar comma y bar equals 2 x 1 y 1 minus x 1 y 2
minus x 2 y 1 plus 5 x 2 y 2 we can now claim that this is a. And this is not a
coincidence. In fact, such an inner product can be constructed by observing the following
thing I can write this thing as.
And this is equal to well x 1 plus x 2 whole square plus x 1 minus 2 x 2 square, which is
greater than or equal to 0. So, sum of 2 perfect squares this is greater than equal to 0. So,
implies this satisfies the positive semi definite property. This satisfies the PSD property
and that you can see this is equal to 0 only when it is equal to 0; only if x 1 plus x 2 both
must be 0 because it is sum of perfect squares. So, both must be 0 x 1 minus 2 x 2 must
also be 0, and this implies that x 1 equals x 2 equals 0. So, it is only 0 when x 1 equal to
x 2 equal to 0 that implies x bar equals. So, we have the PSD property which is x bar x
This is equal to x 1 x 2 twice minus 2 minus 1 minus 1 comma 5 y 1 y 2 which is nothing
bar the assignment is greater than or equal to 0 and equal to 0 only if x bar equal to 0.
but; x bar transpose A y bar where a is this matrix you have A is the matrix 2 cross 2
(Refer Slide Time: 23:16) matrix which is given as 2 minus 1 minus 1 5 you can see that this is nothing but;
identical to the this is basically another way of writing the inner product that we have
just defined. And now you can show in a very interesting property of this matrix in fact,
this matrix A is a positive semi definite positive definite matrix all right. So, you can see
that this matrix A is a positive definite matrix.
(Refer Slide Time: 25:24) Implies lambda square minus 7 lambda plus 9 equal to 0 implies; lambda equals well 7
plus or minus square root of 7 square 49 minus 9 36 7 plus or minus 13 divided by
square root of 13 divided by 2. And you can see both the Eigen values are greater than 0
all right it has 2 Eigen values and the Eigen values are strictly greater than Eigen values
are strictly greater than 0. So, symmetric plus the Eigen values greater than 0 implies A is
positive definite ok. So, A is a positive definite matrix.
So, A is a positive definite matrix and therefore, hence x bar transpose. So now, in fact an
interesting property and you can easily show this that is; if you look at 2 vector if we
define a inner product that x bar between x bar and y bar as x bar transpose a y bar,
where A is positive definite this is a inner product that is, this is a is an this is a inner
product. Where A is a symmetric positive let me also just write this a is a where A, A is a
inner product.
And you can see this as follows remember we said this matrix is first see that this matrix This is an inner product x bar transpose A y bar, where A is a symmetric positive definite
is symmetric. We have A equals A transpose, further if you look at the Eigen values a matrix is an inner product all right. And one of the other interesting aspects of the inner
minus that is if you said the determinant of a minus lambda equal to 0, then what we product is that it can be also be used to define a norm, and that is one of the most
have is you have the determinant of 2 minus lambda minus 1 minus 1 5 minus lambda interesting and important aspects of the inner product it induces an norm and that norm is
equal to 0; this implies 2 minus lambda into 5 minus lambda plus 1 equal to 0. This given as follows.
implies now if you simplify this lambda square minus 7 lambda, I am sorry this
determinant is minus 1 minus lambda square minus 7 lambda plus 10 minus 1 equal to 0. (Refer Slide Time: 29:43)

So, the norm is so inner product can be used to define a norm. So, the norm can be (Refer Slide Time: 31:49)
defined using this concept of inner product. And in fact, it can be defined as follows we
have norm of u bar square this is equal to the inner product of u bar with itself.
We already see in that inner product of x bar with itself is basically this is x 1 square plus
x 2 square plus x n square which is nothing but the l 2 norm square, ok. And therefore,
this is the square of the norm and we have already seen this for the standard inner
product, what this result is that; not just for the standard inner product which is given
Which means, basically the norm of vector u bar is equal to square root of the inner
with the dot product of 2 vectors.
product of u bar with itself. And in fact the unit norm vector can now be defined as u hat
equals u bar divided by norm of u bar that is u bar divided by square root of u bar u bar But any inner product on these 2 vectors x bar y bar can be used to define a norm that is;
and. In fact, what we have seen is and this is also this process also termed as the norm is given as the square root of the inner product of x bar with itself for instance
normalization that is, when you divide a vector by it is norm you can also say that the in the previous example we have seen x bar transpose a y bar all right, which means the
vector is normalized. And you can also see this is true for the standard inner product that norm of x bar under that inner product is given as square root of x bar transpose A x bar
is the standard inner product on R n that is if you look at x bar comma x bar where x bar ok. So, the norm of x bar is equal to square root of x bar transpose A x bar and this is for
belongs to R n and this is the standard inner product. the previous example. Let us look at other examples of inner products.
(Refer Slide Time: 33:17) Then the assignment defined as F of g equals integral over the interval a comma b F of x
g of x or F t d t that is this integral F of x g of x d x this is an inner product like this is a
very interesting application. In fact, this can now be used to define a norm so this is an
inner product for functions F comma g, and in fact, the norm that arises is basically
nothing but norm of F is integral a comma b or norm of F square a is basically inner
product of F with itself that is integral of a comma b, integral on the interval a comma b
F square x dx or this is nothing but the energy of the signal; that is, if you replace if you
look at this as signal in time.
So, we have some other examples we are already seen, u bar transpose v bar that is the
standard norm an R n this is an inner product this we already seen. Now another
interesting application of inner product is let us consider the space of continuous
functions on an interval a comma b denoted by c of a comma b, that is the set of
continuous functions on a comma b on interval a comma b. And let us say we have 2
functions F comma g which belong to C of a comma b that is their continuous functions
on the interval a comma b.

If I think of this as a signal in time, this is basically the energy of the signal in interval,
this is the energy of the signal in the interval a comma b all right. So, that is another so it
can also be used. So, basically one can also define an inner product correct, one can also
define an inner product on the space of continuous functions all right and therefore,
define the norm of the norm and as well as the norm square. In fact, the norm square of
the function is based or the signal is the energy of the signal in that particular interval.
Another interesting example of this inner product space consider the space of m cross n
matrices.
(Refer Slide Time: 36:58) square matrix that is the trace of the square matrix. And this can be shown to be an inner
product; this can also be shown to be an inner product. Trace of this can also be shown to
be an inner product all right. So, what we have done in this module is we have looked at
the inner product it is definition the various properties or when, is an assignment and
inner product and several examples. We will continue this discussion in the next module.
So, we have the space of m cross n matrices example m equal to 3 n equal to 2 implies;
we have 3 cross 2 matrices. And for instance the 3 cross 2 matrix is are equals a 1 1 a 1 2
a 2 1 a 2 2, a 3 1 a 3 2 and we have B the matrix B equals b 1 1 b 1 2, b 2 1 b 2 2 b 3 1 b
3 2. And the inner product A B defined as trace of B transpose A that is, the now this is
an interesting concept that is trace of a square matrix.
This is the trace of a square matrix the trace operator this is defined as equal to is equal
to sum of the diagonal elements of a square matrix, some of the diagonal elements of a
Lecture – 05
Inner Product Space and its Properties:Cauchy Schwarz Inequality
Welcome to another module in this massive open online course. So, you are looking at
the inner product of matrices and we have defined the inner product of two matrices.
And that will be now look at this it will be a 2 cross 2 matrix. So, that will be first entry
will be b 1 bar transpose a 1 bar b 1 bar transpose a 2 bar b 2 bar transpose a 1 bar b 2
bar transpose a 2 bar and you can check the various entries will be in this 2 cross 2
matrix the various entries will be b 1 1 a 1 1 plus.
In fact, two matrices of the same size A B as trace of to B transpose A. In fact, this can be
seen as if you have B for instance for our 3 cross 2 considering 3 cross 2 matrices if I can
write b as two columns b 1 because remember 3 cross 2 matrix has two columns each of
size 3 and the matrix A as this is just for the purpose of illustration a 1 bar first column a
2 bar. So, this is the first column this is the second column and if you write B transpose
into A that will be well transpose of b remember columns becomes rows. So, this will
become b 1 bar transpose second row will become b 2 bar transpose times a which is first
column a 1 bar second column a 2 bar.
And therefore, now you can check the various entries, but if you take the trace of this
that is if you trick the trace of this B transpose A that will be b 1 bar transpose a 1 bar
plus b 2 bar transpose a 2 bar which will be nothing, but b 1 bar transpose is well b 1 bar
transpose if you can look from above that is b 1 1 b 2 1 b 3 1 into a 1 bar that is a 1 1 a 2 (Refer Slide Time: 04:47)
1 a 3 1 plus b 2 bar that will be b 1 2 b 2 2 b 3 2 times a 1 2 a 2 2 and a 3 2.
So, for instance we can have A element of m cross n real m cross n matrix further B
belongs to m cross n these are two m cross n matrices then trace of B transpose a will be
trace of that is the sum of the diagonal elements of you can look at this the columns will
And this is equal to well if you simplify this what you will get is b 1 1 a 1 1 plus b 2 1 a
become row. So, m cross n matrices will have n columns of m element c.
2 1 plus b 3 1 plus b b 3 1 times a 3 1 plus b 1 2 a 1 2 plus b 2 2 times a 2 2 plus b 3 2
times a 3 2 and well this is the inner product this is your inner product and now in So, this will be b 1 bar transpose b 2 bar transpose b if I am not mistaken be n bar
general for an m cross n. So, this is the inner product which just a simple example transpose times a 1 bar transpose or times a 1 bar a 2 bar a n bar and that will be if you
illustrated this for a 3 cross 2 case in general this can be generalized for m cross n look at it summation if you take the trace of that that will be summation.
matrices.
I equals one to n b i bar transpose a i bar and this is n remember n equals number of (Refer Slide Time: 07:37)
columns n equals the number of columns.
Which is equal to and you can see from the this thing above that is simply equal to norm
a 1 bar square plus well you can write this as follows that is summation of i equals 1 to n
a i bar transpose a i bar which is now remember a i bar transpose a i bar is nothing, but
The various quantities b i bar each b i bar b i bar equals the ith column of matrix B and a
norm of a i bar square which is sum of the squares of the norms of all the columns which
i bar equals the ith column of matrix A equals the ith column of matrix A and further
is basically norm of a 1 bar square plus norm of a 2 bar square plus one plus norm of a n
now. And now, let us come to the norm that is induced by this inner product remember
bar square. The sum of squares of norms of all columns which is also equal to you can
every inner product can be every inner product induces a norm all right and the norm
say sum of magnitude squared of all elements you can also check clearly that this is sum
induced by this inner product that is the norm you can be used to define as the norm of
of magnitude square of all; sum of the magnitude square of all the elements of A. And
the matrix that is norm A square is inner product of A comma A that is basically your
this is termed as the Frobenius norm.
trace of A transpose A.
(Refer Slide Time: 09:00) These are fundamental property and arises frequently, what the Cauchy Schwarz
inequality that this is the fundament this is very fundamental property of a inner product
is a fundamental property of an inner product.
And what it states is that u bar comma v bar norm square is less than or equal to that is
the inner product of u bar v bar square is less than or equal to the inner product of
product of the inner product of u bar with itself, times the inner product of v bar with
itself which is basically nothing, but the norm of u bar square into norm of v bar square.
Which basically implies that if you look at the magnitude of the inner product of u bar v
bar that is less than or equal to norm of u bar times norm of v bar this is the Cauchy
Schwarz inequality.
This is termed as the matrix Frobenius. In fact, this is termed as the matrix Frobenius
norm square. So, this is basically your this thing this F denotes the Frobenius norm, this
is the matrix Frobenius norm which is basically nothing, but the sum of the norm squared
of all the columns of the matrix or sum of the magnitude squared of all the elements of
the matrix. And this is the norm that is induced by this particular definition of the inner
product corresponding to matrices. The next important aspect in this is what is known as
the Cauchy Schwarz inequality that is the inner product satisfies the Cauchy Schwarz
inequality.
This is can be applied for a valid for any inner product that is it can either be the inner
product for the vectors or inner product for the functions that we had defined previously
correct we had consider continuous functions on the interval a to b case valid for that
inner product it is also valid for the inner product of the may of matrices right m cross n
matrices that we have defined above and so on. So, this is valid for any general definition
of the inner product.
And now this can be proved as follows the Cauchy Schwarz inequality; the Cauchy And therefore, this is equal to inner product of u bar comma u bar plus twice t the inner
Schwarz inequality can be proved as follows that is I can consider a scenario consider a product of u bar with v bar plus t square times inner product of v bar plus v bar which if
function y t equals the inner product it is a function of t which is considered any two you view this as a function of t if for fixed u bar vectors u bar and v bar that is if you
vectors u bar v bar. So, I consider the inner product of u bar plus t v bar with I am using a view this as a function of t then this is a quadratic expression in t. So, this is quadratic in
parameter t. So, this is the inner product on forming the vector u bar plus t v bar and I am t now observe that this is a quadratic in t and this is always greater than or equal to 0 for
considering the inner product with itself. So, this t is a parameter and now remember this all values of t right. Because that is what we have seen this is the inner product of u bar
inner product of a vector with itself u bar plus t v bar. So, this is always greater than or plus t v bar with itself inner product of an a vector with itself is always greater than equal
equal to 0 and 0 only if u bar plus t v bar equals 0 and therefore, and now this can be to 0 all right.
simplified as remember u bar comma u bar t v bar plus the inner product of t v bar with u
So, this is greater than equal to 0 this quadratic in t is greater than or equal to 0 for all
bar t v bar which is basically t times the inner product of v bar with u bar plus t v bar
values of t which basically implies right and this holds true only when the discriminant
which can now be simplified as the inner product of u bar with u bar plus twice the inner
of the quadratic equation is less than or equal to 0. So, this holds true, this holds true only
product you can manipulate this to get twice the inner product of u bar with v bar plus t
when discriminant of the quadratic is less than or equal to 0 this implies if you calculate
square times the inner product. Or let me write one more step to simplify this u bar with
the discriminant of this quadratic that will be b square minus 4 a c less than or equal to 0
u bar plus t times the inner product of u bar with v bar plus t times the inner product of v
that implies you have your remember this is your for this quadratic this is your a this is
bar with u bar plus t square times the inner product of v bar with v bar.
your b and this is your c.
And now you can see inner product of v bar u bar is nothing, but the inner product of u
So, we have b square minus 4 a c less than or equal to 0 which means u bar comma v bar
bar v bar.
inner product square. In fact, 4 because b is twice u bar comma v bar we can write it as
follows we can write it as t into twice. So, twice u bar comma.
So, 4 u bar v bar square minus 4 into u bar comma u bar into v bar comma v bar less than The dot product that is if you look at the magnitude of the dot product magnitude of u
or equal to 0 this implies u bar comma v bar less than or this implies u bar comma v bar bar v bar is less than or equal to the product of the norms of the two vectors that is norm
square less than or equal to u bar comma u bar times v bar comma v bar and this implies u bar into v bar. So, this is your standard dot product all right.
u bar comma u bar u bar comma v bar and this implies basically now taking the square
And in fact, this also implies from the above this thing you can also conclude that minus
root this implies that magnitude of u bar comma v bar less than or equal to. So, this is
norm of u bar into norm of v bar less than or equal to u bar comma v bar less than or
basically your norm u bar square into norm v bar square. So, magnitude u bar comma v
equal to norm u bar into norm v bar which implies dividing overall by minor norm u bar
bar less than or equal to norm u bar times norm v bar and this is basically your CS
into norm v bar this implies minus 1 is less than or equal to u bar comma v bar divided
inequality a Cauchy Schwarz inequality.
by norm u bar comma norm v bar less than or equal to 1 and this quantity look at this it
So, that basically proves the Cauchy Schwarz inequality which says the important lies between minus 1 and 1.
property that the magnitude of the inner product between two vectors u bar v bar is less
than or equal to the product of the norms that is norm of u bar into norm of v bar all
right. You might have also seen this in the context of vectors in a high school on vector
calculus which states that.
So, this quantity can be defined as the cosine of an angle, cosine theta because cosine Remember this is the perpendicular symbol u bar is perpendicular to v bar if the inner
theta lies between minus 1 and one and this is basically defined as cosine theta. So, we product u bar comma v bar equal to 0 and this is again once again a concept that is valid
have cosine theta equals cosine theta equals u bar comma v bar divided by norm u bar for any general inner product, this is valid for any general inner product and therefore,
norm v bar and this is the angle between the two vectors this is the definition of the; this this can once again be used to define for painting. So, we have a notion we know when
is a definition of the angle between the two vectors all right. two vectors are perpendicular this can be similarly be used to define a notion of
perpendicularity for functions as well as matrices that is when the inner product between
So, now you can also see that if theta equal to 90 degrees that implies the vectors are
two functions is 0 the functions are perpendicular or the inner product between two
perpendicular if the angle between two vectors is 90 degrees then we have that the
matrices is 0 then the matrices are perpendicular and so on. So, this concept of a inner
vectors are perpendicular this implies u bar comma v bar are perpendicular. This implies
product is a very interesting and powerful concept which has a large number of
that cosine theta equals 0 and this implies that the inner product between u bar comma v
applications and yields several interesting insights. So, we will stop here and continue
bar equal to 0 and therefore, the interesting property is that two vectors are perpendicular
with other aspects in the subsequent modules.
if they are u bar is perpendicular to.
Lecture – 06
Properties of Norm, Gaussian Elimination, Echelon form of matrix
Hello. Welcome to another module in this massive open online course. So, you are
looking at the concept of inner product and it is various properties. In particular, we have
also looked at the Cauchy-Schwarz inequality all right.
And in fact, in the derivation of the Cauchy-Schwarz inequality, what you have seen is
that we have employed the property that the discriminant that is we have looked at u bar
plus the inner product of u bar plus t v bar with itself and we have said that this inner
product, this is always greater than or equal to 0. This implies quadratic in t is greater
than or equal to 0 which implies the discriminant has to be less than or equal to 0. And
which implies that the discriminant is nothing but 4 u bar comma v bar square minus 4
well u bar comma u bar times v bar comma v bar, this is less than or equal to 0.
So, what we are looking at is basically this concept of the inner product, the notion of an
inner product and the various properties of the inner product. And in particular, we have
looked at the Cauchy-Schwarz inequality which states that you have the inner product of
two vectors u bar comma v bar that is less than or equal to norm u bar times norm of v
bar.
In fact, if the discriminant is 0, all right that is b square minus 4 equal to 4 a c equal to 0, (Refer Slide Time: 05:34)
this means that the quadratic equation has a unique root.
Now, if this is the discriminant, if the discriminant equals 0, this implies the quadratic
equation. Remember, the quadratic equation in t has a this has a unique root which
implies that at that root you will have let us call that let us denote that by let us say t tilde
ok. So, at that value of t tilde, we will have u bar plus t tilde v bar, u bar plus t tilde v bar,
this inner product is 0. All right at that root because the quadratic is nothing but the inner
product of u bar plus t v bar with itself ok. So, at that value of t tilde this inner product
vanishes and we know that the inner product u bar between of any vector with itself
vanishes only if the vector is 0.
So, this implies that this is possible only if u bar plus t tilde v bar equals 0 which implies
u bar equals minus t tilde v bar. For some t tilde, you can denote it as some k constant k
times v bar. So, the inner product that is the magnitude of the inner product is equal to the product of
the norms of the two vectors only if the vector u bar is a constant times bigger. What this
(Refer Slide Time: 04:20) means basically is that if we have a vector v bar, then the vector u bar must be k times v
bar which means that the vector u bar must also lie along v bar. So, this implies that the
vector u bar must lie along v bar or u bar equals scaled. That is simply a scaled version of
v bar, ok. So, the inner product between two vectors, magnitude of the inner product
between two vectors equals the product of the norms only if u bar rise along v bar or u
bar can also be exactly opposite the direction of v bar. In fact, because we are looking
simply at the magnitude, you can also have v bar and u bar, ok. So which means either
the angle equal to theta equal to 0 degrees or theta equals 180 degrees, ok.
So, this shows that we have what this means is be sure that b square minus 4 a c equals 0,
this implies b square equals 4 a c, this implies, inner product of u bar comma v bar
equals that is u bar comma v bar square equals norm u bar into norm v bar square. Or
this implies magnitude of u bar comma v bar equals norm u bar into norm v bar only if
there exists a k, that is determinant is 0 only if there exists a constant k for which u bar
equals k v bar.
(Refer Slide Time: 07:05) And this is an interesting property that is related to the inner product which we will again
frequently invoke at instances during our discussion on optimization. Let us now look at
another important concept that is a various properties. So, what we want to look at is the
properties of this norm, the properties of the norm and these are as follows. The first
property will be that norm v bar is greater than or equal to 0. In fact, norm v bar this is in
fact, norm v bar square equals v bar the inner product of v bar comma v bar which is
greater than or equal to 0.
And norm v bar equals square root of the inner product between v bar and v bar which is
also greater than equal to 0. In fact, norm v bar equal to 0 if and only if, the inner product
v bar between v bar and itself equal to 0 which implies we have seen inner product is 0
only in only the inner product of v bar with itself is 0 only when v bar is 0.

So, in both the cases that is theta equal to 0 degrees or 180 degrees, we will have
magnitude u bar comma v bar equals norm u bar into norm v bar. Of course, if theta
equals 180 degrees the inner product between u bar v bar is negative, so u bar, v bar is
minus norm u bar times v bar that is a inner product between u bar v bar is minus norm u
bar times norm v bar and if theta equals 0 degrees, then the inner product is norm u bar
simply norm u bar times norm v bar ok. So, they have to be for the inner; so, magnitude
of the inner product to equal to the product of the norms, these vectors have to be either
aligned or exactly anti aligned or that is an angle between them is 180 degrees ok.
Now, the second aspect is that the norm of c v bar, the constant times v bar equals the
magnitude of the constant times the norm of v bar. In fact, you can see the norm of c v
bar square equals c v bar, c v bar which is c times inner product of v bar with c v bar
which is c square times the inner product of v bar with itself which is c square times the
norm of v bar square which implies the norm of c v bar equals taking the square root of
on the right, you have magnitude of c times norm of v bar, it was norm of v bar ok.
And the third property is the important property. This is known as the triangle inequality. Now, applying the Caushys-Shwardz inequality, we know that the inner product of u bar
And the triangle inequality, this states that norm u bar plus v bar less than or equal to v bar is less than the product of the norms u bar into v bar. So, this is less than or equal to
norm u bar plus norm v bar and this follows as follow this follows, this can be seen as norm u bar square plus twice norm u bar into norm v bar plus norm v bar square which is
follows if you look at norm u bar plus v bar square that is the inner product of u bar plus equal to norm u bar plus norm v bar whole square, norm v bar whole square. And
v bar with itself which is basically again u bar, inner product with u bar plus v bar plus v therefore, we have this implies taking the square root since all the quantities are positive,
bar, inner product with u bar plus v bar which is equal to u bar inner product with u bar this implies norm of u bar plus v bar is less than or equal to norm of u bar plus norm of v
plus u bar inner product with v bar plus v bar inner product with u bar plus v bar inner bar. This is basically the triangle inequality of the norm. This is valid for any norm that is
product with v bar equals, well norm u bar square plus twice inner product between u bar induced by the, this is valid for any norm in particular for the norm induced by the inner
and v bar because inner product v bar u bar is the same as inner product u bar v bar plus product, ok.
norm v bar square.
So, that basically concludes our discussion on the various aspects of the norm and the
properties of the norm that is induced by the inner product. So, now, let us start doing
some examples on this mathematical preliminaries. So, what we want to do is, we want
to do some examples on this mathematical preliminaries.
(Refer Slide Time: 14:04) So, we have x let us say, we have a 3 cross 3 matrix, 2, 4, minus 4, 1, 3, 2 and 1, 4, 2. So,
what we are going to do is we are going to perform row operations on this. So, what first
thing we are going to, we are going to perform Gaussian elimination.
So, first perform R 2 that is row 2 minus twice row 1. So, what we are doing is basically,
we are using this element as a pivot as what is known as a pivot and we are subtracting
this to reduce a corresponding elements in the second row, we are multiplying the first
row all right the pivot all right, we are multiplying it, we are multiplying the first row,
scaling it appropriately, subtracting it from the second row so as to reduce the
corresponding element in the second row to the element below the pivot in the second
row to 0.
For instance, let us start by considering a simple example. It is an important aspect of

matrices that we have not discussed previously which is the row elimination that is a
reduction to the row, what is known as the row echelon form or simply the echelon form
So, in this example, what we will want to do is the rank of matrix via Gaussian
elimination. So, we have seen the concept of a rank of a matrix and the echelon form and
we want to also discuss this notion of an of an echelon form.
So, this will give you an equivalent representation that is when you multiply 4. So, the
first row will remain as it is so, that is 2, 1, 1; the second row will be 0, 1, 2; third row
will be minus 4, 2, 2. So, we have reduced this element below the pivot; reduce the
element below the pivot to 0 ok. Now, similarly you can also reduce this to 0 by
performing R 3 plus 2 R 1. So, we want to perform R 3 plus 2 R 1 and that will give you
x equals the first row will remain as it is. Second row will give 0, 1, 2 and the third row
will be minus 4, plus 4 that is 0, 2 plus 2, that is 4, 2 plus 2, that is 4.
(Refer Slide Time: 18:21) And this is termed as the echelon form of the matrix. This is termed as Echelon form of
the matrix and these are the pivots, these are your pivots which are used to scale and
subtract from the rows below. These are the pivots and below each pivot, you have a
column of zeros below each pivot equals, below each pivot you have a column of zeros.
And the interesting property here is if you look at the number of nonzero rows, the
number of nonzero, in this case you have 3 nonzero rows and the number of nonzero
rows is equal to the rank of the matrix.
So, what we have done is we have chosen the pivot and we have reduced all the
corresponding entries in each column below the pivot to 0 ok. And now, similarly we can
use this as a pivot and we can again perform R 3 minus 4 R 1, now.
This is an interesting number the number of nonzero rows equals the rank of the matrix
and this is basically the interesting. So, one this is how one can see the rank by the, this
is how one can deduce the rank of the matrix why are this pivoting and this process of
Gaussian elimination during using the pivots and subtracting the scaled versions, the this
scaled versions of the rows from the rows beneath ok. Let us now consider another
matrix. Consider another example let us consider again a matrix X equals 1, 2, 1 and 2,
4, 5, 3, 6, 1.
And that equivalently yields the first rows as it is. Second row is also as it is and third
will become 0, 0 and 4 minus 8, that is minus 4 and if you look at this now, what you
will observe is that this is an upper triangular matrix. If you look at this, this is an upper
triangular matrix.
And if you look at this, you will have now perform R 2 minus 2 R 1, that will give you So, these entries which can be used as possible pivots, these are zeros. And now what we
the matrix 1, 2, 1, 0, 4, 5, 3, 6, 1. So, you have this as your pivot. Now perform, R 3 do is we perform, R 3 plus 2 by 3 R 2 and that gives us something interesting what you
minus 3 R 1 and that gives you well, you can see what that gives you. will see is you have 1, 2, 1, 0, 0, 3 and you will finally have a row of all 0s sSo, this is a
row
That gives you this is 0 and this is 5 minus 2, this is 3. So, this is 0, 0, 3 and now you will
have R 3 minus 3 R 1, this is 0, 0 and this is 1 minus 3, this is minus 2. And now, you can So this, in this Echelon form, you will see you have a row, all 0 row. So, this is an all
interestingly see that all the pivots possible pivots in the second column are also 0. zero row. So, number of nonzero rows, nonzero equal to 2 and implies the rank, rank of
matrix equal to 2 ok.
And in this case, the pivots you can clearly see the pivots that we have used these are the Applied Optimization for Wireless, Machine Learning, Big Data
pivots. So, 1 and 3, these are the pivots and note this interesting property of the pivots
that each pivot has to lie of the right of the pivot in the row above. So, each pivot each Indian Institute of Technology, Kanpur
pivot lies to the right of pivot in the row, in the row above. And therefore, the rank of the
Lecture - 07
matrix is 2 and this is obtained via Gaussian elimination, obtained as obtained via. Gram Schmidt Orthogonalization Procedure
And this is the Echelon form of this matrix. This is a Echelon form of this matrix and this
Hello, welcome to another module in this massive online open course. So, let us we are
is how the simple procedure of Gaussian elimination using pivoting can be used to
looking at examples to understand the mathematical preliminaries of optimization. Let us
determine the rank of the matrix and reduce it to the Echelon form and this is also very
look at another example and this is something that is very important and has a lot of
helpful makes it much easier to solve a system of linear equations. We will stop here.
practical utility, this is termed as the Gram Schmidt Orthonormalization Process ok.

So, we are looking at examples correct?
And as part of the second example; what we want to look at is the Gram Schmidt
procedure for Orthonormalization, Gram Schmidt procedure for Orthonormalization.

And what this does is for a given set of linearly independent vectors; u 1 bar, u 2 bar un
bar. So, this is a set of linearly given set of linearly independent vectors. What the Gram
So, we have an orthonormal bases, which is frequently very convenient to consider for as
Schmidt Orthonormalization procedure does; is it creates an orthonormal set creates, I
the basis of a vectors space. So, V 1 bar V 2 bar V n bar. Now this is orthonormal
am just going to explain in a moment what this means it is it creates an orthonormal set
implies; norm V i bar each vector norm V i bar equals to 1 this is basically your normal
of vectors that span the same surfaces span the same subspace in the sense that; linear
property. And if you look at any pairs of vectors V i bar or the real vectors V i bar
combinations of this vectors can be used to generate any vector in that subspace all right.
transpose, V j bar equal to 0 if i naught or for all i naught equal to j, for any i naught
So, both these subspaces spared by these sets are the same. And what is the meaning of equal to j all right. And this basically represents the orthogonal property.
this term orthonormal? An orthonormal set of vectors is a set in which each vectors has
So, they are orthonormal in the sense; all the vectors in the set are orthogonal to each
unit norm that is, the normal right. We said the process of making the norm of vector
other and each vector has unit norm all right and they span the same subspace. So, how
unity is normalization. And orthogonal represents the fact that the vectors in this set all
does this procedure work? Well the procedure works a various steps the Gram Schmidt
the vectors in this set are pair wise orthogonal to each other.
Orthonormalization procedure and that can be described as the follows ok.
So, that makes it an orthonormal set of vectors.
So, what we do is; we start with the first vector is the first step. You can think of this as
step one, what we do is we create a unit norm vector that is; we create first V 1 bar
equals well u 1 bar divide. So, in each step we create a set of orthonormal vectors. So,
the first step is we create a vector V 1 bar equals u 1 bar divided by norm u 1 bar. So, you
can observe that this is unit norm because in fact, what we are doing is we are
normalizing vector u 1 bar by it is norm so this is unit norm ok.
So, that is implies norm V 1 bar equals one so far satisfies the criterion for Gram So, what we are doing is, we are subtracting this component along V 1; to create
Schmidt Orthonormalization. Now in step 2 what we do is; we want to create a vector something that is parallel to V 1 bar. So, subtracting and retaining whatever is
that is orthogonal to V 1 bar remember. So, we look at V 2 tilde equals u 2 bar minus, perpendicular to V 1, but that creates basically to the orthogonality property ok. And that
what we do is we will subtract the projection of u 2 bar on V 1 bar that is what we are is what the Gram Schmidt Orthonormalization procedure is achieving and of course, now
doing. we have to ensure unit norm. So now, V 2 bar equals V 2 tilde divided by norm of V 2
tilde, and this makes it unit norm.
And so what we are doing here is we are subtracting projection of u 2 bar on V 1 bar that
is we have these 2 vectors remember, let us say this is V 1 bar and now you have this And you can see orthogonality as follows consider just a quick demonstration of
vector u 2 bar and this can be represented as the sum of 2 components, one is the orthogonality.
projection, which we can term as the parallel component u 2 bar the parallel component.
You can write as u 2 bar P and the other is the perpendicular component and this can be
written as u 2 bar perp.
(Refer Slide Time: 08:45) Now, if you expand this what you obtain is; this is known of V 2 tilde u 2 bar transpose.
V 1 bar that is the inner product of u 2 bar comma V 1 bar minus the inner product of u 2
bar comma V 1 bar times V 1 bar transpose V 1 bar is norm V 1 bar square, but this is
nothing but this is equal to 1. Since this is equal to 1 over norm V 2 tilde times u 2 bar V
1 bar minus u 2 bar V 1 bar, which you can now see is basically nothing but is equal to 0
implies; V 1 bar orthogonal to V 2 bar.
And this can also be represented as V 1 bar perpendicular to V 2 bar, because remember
we said the cosine of the angle between these 2 vectors is related to the inner product. If
the inner product is 0 angle, which means the cosine of the angle is 0 all right, which
means the angle is theta is 90 degrees and therefore, the vectors are perpendicular to each
other ok.
And now so, basically now we have created V 1 bar V 2 bar. So, basically at every step
Consider V 2 bar we have already seen V 2 bar is unit norm V 2 bar transpose is V 1 bar we are creating a set of orthonormal vectors. Now expect 3 you can clearly see how we
equals V 2 tilde transpose divided by norm V 2 tilde times V 1 bar, because V 2 bar can generate.
equals V 2 tilde divided by norm V 2 tilde something that we have already just seen.
Now, this is 1 over norm V 2 tilde times, well V 2 tilde is nothing but what we have seen.
V 2 tilde transpose is well as we just seen that is u 2 transpose u 2 bar transpose minus u
2 bar V 1 bar inner product into V 1 bar transpose times V 1 bar.
We have V 3 tilde equals well u 3 bar minus, remove the projection of u 3 bar on V 1 bar
minus, remove the projection of u 3 bar on V 2 bar. So, this inner product is basically
giving the projection of the unit norm vector. Inner product with the unit norm vector just
the projection ok. So, these are basically projections on V 1 bar comma V 2 bar and these
are being subtracted. And now we create a unit norm vector by V 3 bar equals V 3 tilde (Refer Slide Time: 14:23)
divided by norm V 3 tilde.
So, what this does is this creates a unit this creates a unit norm vector ok. And now you
can quickly check quickly check the orthogonality property once again.
So, this is basically again norm 1 over, norm V 3 tilde u 3 bar comma V 2 bar minus
inner product u 3 bar V 2 bar is equal to 0, this quantity is 0. So, these implies and even
similarly verify orthogonality of V 3 bar to V 1 bar as well. So, this implies V 1 bar V 2
bar V 3 bar is an orthonormal set and this procedure can similarly be continued.
So, we have V 1 bar V 2 bar V 3 bar this is a orthonormal set, and the procedure can be
If you do V 3 bar transpose V 1 bar let us say; or V 3 bar transpose V 2 bar just to make
similarly continued.
sure it is orthogonal to the previous vector. So, I have 1 over norm V 3 tilde V 3 tilde
transpose V 2 bar. This is 1 over V 3 tilde norm times u 3 bar minus u 3 bar inner product (Refer Slide Time: 15:48)
V 1 bar V 1 bar, of course this is all V transpose because we have to take the transpose
minus u 3 bar comma V 2 bar into V 2 bar transpose whole times V 2 bar.
And now you can see; this will be 1 over norm V 3 tilde u 3 bar transpose V 2 bar
because u 3 bar V 2 bar inner product minus u 3 bar V 1 bar inner product into V 1 bar
transpose V 2 equals 0 ok. Observe the V 1 bar transpose V 2 bar equal to 0 ok, into 0
minus u 3 bar inner product V 2 bar, into V 2 bar transpose V 2 bar that is norm V 2 bar
square which is once again 1.
And what we do in step n; if you look at the nth step correct in step n, what we do is we (Refer Slide Time: 18:42)
have V n tilde this will be u n bar minus u n bar comma V 1 bar into V 1 bar minus
projection of u 1 bar on V 2 bar remove that V 2 bar minus so on so forth.
Last one is you remove the projection along V n minus 1; I remove the projection along
n minus 1 bar ok. And finally, we have V n bar V n tilde divided by norm V n tilde ok,
and this generates the unit norm. Remember the orthogonality property is not affected by
the norm. So, all we have to do is take the vector and simply divide by it is magnitude to
get the divide by the norm to get the corresponding unit norm vector. That basically
summarizes the Gram Schmidt Orthonormalization procedure.
Let us look at a specific instance of this procedure application of this procedure

considering a set of vectors to demonstrate how this procedure actually works in
practice. So, this is to think of this is an example inside an example.
Now observe that norm u 1 bar equals square root of well one square plus 1 square plus 2
(Refer Slide Time: 17:24) square equals square root of 6. So, this V 1 bar equals 1 over square root of 6 times 1 1
minus 2. So, this is your unit norm vector ok.
Observe that this is a unit norm vector, this is the first step. Step one we do not remove
any projection because there is no vector in the set V bar yet ok. Remember we have to
remove the projections on the previously chosen vectors V 1 bar V 2 bar V n minus 1 bar
at step n minus 1. So, step 2 onwards we remove the projection, so that is V 2 tilde
equals your u 2 bar minus the projection of u 2 bar on V 1 bar, which is given by this
inner product since V 1 bar is a unit norm vector.
And this is basically 1 2 so u 2 bar is 1 2 minus 3 the vector 1 2 minus 3 minus.
You can think of this as an illustration a practical illustration of the Gram Schmidt
Orthonormalization Procedure.
So, consider 3 vectors that is, u 1 bar that is, the vector 1 1 minus 2, u 2 bar equals the
vector 1 2 minus 3 and u 3 bar equals the vector 0 1 1. And now we have V 1 bar equals
u 1 bar divided by norm u 1 bar.
(Refer Slide Time: 20:16) And you can clearly see this is a unit norm vector and you can also see this will be
orthogonal to V 1 bar. So, V 2 bar transpose V 1 bar is well 1 over square root of 2 minus
1 1 1 0 times, 1 over square root of 6 times 1 1 minus 2. And you can clearly see this is
minus 1 plus 1 which is 0 plus 0 which is 0.
So, 1 over square root of 12 times 0 so this is 0. So, implies these are orthogonal that is,
V 2 bar perpendicular to V 1 bar or the same thing as saying V 2 bar V 1 bar or
orthogonal. So, that completes your step 2, plus remove the projection of V 1 bar from u
2 bar and then divide by it is norm that is, you obtain V 2 tilde divide by the norm of V 2
tilde to obtain the orth normal vector alright. And V 2 bar you can see is also orthogonal
to V 1 bar therefore, it makes it an orthonormal sector.
You can check the inner product u 2 bar transpose V 1 bar is 9 over square root of 6
times 1 over square root of 6 into 1 1 minus 2. And this will basically be so V 2 tilde is
minus half half comma 0. And norm of V 2 tilde is square root of 1 by 4 plus 1 by 4 plus
0 that is square root of half that is 1 over root 2.
So, V 2 bar equals V 2 tilde divided by norm of V 2 tilde. This is basically you are 1 over
1 over square root of 2, so square root of 2 times minus half half 0 taking the fact of half
outside this becomes 1 over square root of 2 times minus 1 1 0.
Now, step 3 is again similar very similar. So, u 3 bar minus u 3 bar projection along V 1
bar into V 1 bar minus u 3 bar V 2 bar inner product V 2 bar, and this will be; you can
check this is u 3 bar is you have seen this is given 0 1 1, this is the vector 0 1 1 minus the
project the inner product is minus 1 over square root of 6 times 1 over square root of 6
times 1 1 minus 2, that is your V 1 bar minus 1 over square root of 2, that is a projection
1 over square root of 2 times minus 1 1 0 all right.

So, and that will be basically if you can simplify this this will be 0 1 1 plus 1 over 6 1 1
minus 2 minus 1 over 2 minus 1 1 0. So, this will be 0 plus 1 by 6 so if you look at this 0
So, finally, we have the orthonormal set of vectors, which has the same span space as u 1
plus 1 by 6 that is half, so half plus 1 by 6 half plus 1 by 6 that is 4 by 6. So, this will 2
bar u 2 bar u 3 bar remind you. So, this is 1 over square root of 6, 1 1 minus 2 V 2 bar
by 3, you can also check second entry will also be 2 by 3 and third entry will also be 2 by
equals 1 over square root of 2, apologize 1 over square root of 2 minus 1 1 0 and V 3 bar
3, so that is 2 by 3 into 1 1 1.
equals 1 over square root of 3, and this is 1 1 1. This is 0 orthonormal set of vectors V
And norm and this is basically your V 3 tilde ok. And norm of V 3 tilde you can also ok.
compute that very easily, that is square root of well that will be 4 by 9 into 3 4 by 9 plus
4 by 9 plus 4 by 9; that will be 2 by 3 square root of 3. And therefore, finally, V 3 bar
equals V 3 tilde divided by norm V 3 tilde equals well 2 by 3 1 1 1 divided by the norm
into 1 over 2 by 3 square root of 3 and that will be equal to well 1 over square root of 3
times 1 1 1.
And you can once again check if you do V 3 transpose V 1 bar, let us do a quick check
that would be; V 3 bar transpose V 1 bar. So, that would be well 1 over square root of 3
times 1 1 1 the rho vectors times 1 over square root of 6 times 1 1 minus 2, which is Applied Optimization for Wireless, Machine Learning, Big Data
basically one plus 1 minus 2 divided by square root of 18 and you can see this is equal to
0. Indian Institute of Technology, Kanpur
And therefore, V 1 bar V 2 bar and you can similarly check for the inner product or V 2 Lecture - 08
Null Space and Trace of Matrices
or V 3 bar transpose V 2 bar that in fact, you can readily see that they are also orthogonal
therefore, V 1 bar V 2 bar V 3 bar is an orthonormal set. And remember the important
Hello welcome to another module in this massive open online course. So, we are looking
thing about this is the span the same subspace as the original vector u 1 bar u 2 bar u 3
at the example problems to better understand the preliminaries required for optimization
bar.
all right. In this module let us start looking at another important concept that is the Null
So, that is the connection between this set, the new set V and the old set u bar. And in Space of a Matrix.
several times it is very convenient to represent to find out an orthonormal span. So,
although both the given sets has the same subspace, it is very convenient to deal with V
rather than u because V is an orthonormal set of vectors that spans the same subspace.
And in fact, this can be used to not only find the orthonormal span for the vector
subspace remember this can be used for any inner product space all right and we have
already said that the set of functions, so and so continuous functions forms an inner
product space continuous functions on the interval a and b.
So, given a set of basis functions linearly dependent functions on that right? Which spans
subspace on that one can similarly determine an orthonormal set of functions, that is
functions with an orthogonal to each other and have unit norm and that span the same
subspace of continuous functions on the interval ab.
So, this Gram Schmidt Orthonormalization procedure is something that is very

convenient, very handy and it is very popular or it is highly applicable a practice because So, what we want to understand as part of our examples for the mathematical
of it is first, because it is a low complexity procedure and 2, it has immense utility in preliminaries is this concept of what is known as the null space of a matrix.
terms of simplifying, either be it either deriving the span of a subspace or the
Now, what do we mean by the null space of a matrix; now consider a matrix A. So, that
representation of a set of or representation of a new vector, to represent it in a this
is an m cross n matrix, then the null space of A denoted by the symbol N of A this
subspace can be much readily derived or much more easily derived using the
denotes the null space.
orthonormal span for the same subspace. So, we will stop here and we will continue with
other aspects in the subsequent modules. (Refer Slide Time: 01:30)

any linear combination alpha x 1 bar beta x 2 bar then A times alpha x 1 bar plus beta x 2
bar equals, alpha times A x 1 bar plus beta times A x 2 bar, but remember x 1 bar and x 2
bar are both elements of the null space.
So, A x 1 bar a 2 bar is 0 which means this is alpha time 0 plus beta time 0, which is 0
and this implies this implies that alpha x 1 bar plus beta x 2 bar also belongs to the null
space of A, which means that the null space of A indeed is a this is a vector space. That is
because if x 1 bar belongs to what is the definition of vector space if x 1 bar belongs to
the vector space x 2 bar belongs to the vector space, then any linear combination of these
vectors was also belong to that set that is known as a space or a subspace a vector space
or a vector subspace all right.
The null space of A; this comprises of all vectors x bar the null space of A comprises of x And therefore, the null set that is the set of all vectors x bar such that A x bar equal 0 is
bar such that A x bar equal to 0 that is all set of all vectors A x bar x bar such that A x bar also a vector space. Because if you take any 2 vectors belonging to this set their linear
that is mean multiplied by A; A x bar equal 0 all right. combination also belongs to the set, also if this is known as the null space this vector
subspace is known as the null space ok.
So, that that is basically that set all right in fact, it is the space that space of all vectors x
bar is called the null space of the matrix A. In fact, this is the space vector space you can (Refer Slide Time: 04:33)
see this as follows.
Now, example let us point for instance, let us point let us consider an example within this
example; this is example number this null space this is the example number 3 that we are
looking at example number 3 ok. And for instance let us consider a matrix A that is equal
If x bar; so, observe that this is a vector space, this can be seen as follows if x 1 bar to minus 1 1 2 4 2 0 minus 1 7 and what we want to do. So, this is a 2 cross 4 matrix
belongs to the null space of A x 2 bar belongs to the null space of A then if you perform belongs to set of 2 cross for real matrices, and what we want to do is we want to find the
null space of this matrix find the null space of this matrix A and this can be found as What we will do now is we will perform row operations on the matrix A; perform we
follows. will perform row operations on the matrix A.
(Refer Slide Time: 05:43) So, we have 1 minus 1 minus 2 minus 4 2 0 minus 1 or 2 0 1 minus 7 first what we are
going to do is we are going to perform reduce the pivot to 1 I am sorry. So, 1 I am sorry
this is minus 1 1 2 4 minus 1 1 2 4 2 0 1 minus 1.
So, you want to find the set of all vectors remember minus 1 1 2. So, you want to find the
set of all vectors such that A x bar equal 0 which is minus 1 1 2 4 into 2 0 1 minus 7
times well remember. So, 2 cross 4 matrix so, you have to multiply this by a 4
dimensional vector x 2 x 3 x 4 equal 0 this is your x bar.
So, first we will divide R 1 goes to R 1 divided by minus 1. So, this will become
(Refer Slide Time: 06:29) equivalently this will equivalently become well when you divide by minus 1 this
becomes 1 minus 1 minus 2 minus 4 2 0 minus 1 7. Now what we are going to do is we
are going to perform R 2 minus twice R 1 the row operation, this will become the matrix
this for first we will remain as it is minus 1 minus 2 minus 4 2 minus 2 0 this will be 2
this will be 5 minus 1 in this will be minus 1 minus 4 this will be basically you can see
that this will be 2 R 2 minus 2 R 1 I am sorry this is 2 0 1 minus 7. So, 2 0 1 minus also 1
plus 4 this will be 5 and minus 7 plus minus 7 plus 8 this will be 1. So, this is what you
get.

have 2 equations 4 unknowns. So, observe here we have 2 equations 4 unknowns 4
unknown variable x 1 x 2 x 3 x 4. So, what this means is basically we have 2 free
variables we can set 2 of the unknown variables or 2 of the unknown parameters is free
variables.
So, let us set x 3 x 4 as free variables and we will express the rest of them x 1 x 2 in
terms of x 3 and x 4 ok.
And now what you can show this is a equivalent matrix after row operations, and what
you can show is that finding the null space of this matrix is equivalent to finding the null
space of this row reduced matrix. This is equivalent to finding null space of the row
reduced; y is equivalent to finding the null space of the row reduced matrix.
So, we will find the null space of this matrix which we can find as follows this will be
well 1 minus 1 minus 2 minus 4 0 2 5 1 into x 1 x 2 x 3 x 4 this is equal to 0.
So, set x 1 x 2 or x 3 x 4 as free variables express x 1 x 2 in terms of x 3 x 4. So, we have

well we have from the last equation we have 2 x 3 or 2 x 2 plus 5 x 3 plus x 4 equal to 0
this implies that well x 2 equals minus 5 by 2 x 3 minus half x 4 and finally, from the
last; from the first equation, we have x 1 minus x 2 minus 2 x 3 minus 4 x 4 equal to 0,
which implies if you can look at this is that x 1 equals x 2 plus 2 x 3 plus 4 x 4.
This implies that when we get 2 equations, x 1 minus x 2 minus 2 x 3 minus 4 x 4 equal
to 0 and the second equation we will be twice x 2 plus 5 x 3 plus x 4 equal to 0. So, we
(Refer Slide Time: 13:08) belonging to null space of the matrix A. That is if we chose any 2 parameters x 3 and x 4,
I can form a vector like this x bar which has this structure all right and that vector x bar
will belong to the null space of this matrix.
So, this is the general structure or the general expression for any vector x bar that
belongs to the null space of the matrix A ok.
Now, substitute for x 2 that is minus 5 by 2 x 3 minus half x 4 plus 2 x 3 plus 4 x 4,

which is basically now, you can see 2 minus 5 by 2 that is minus half x 3 plus 4 minus
half that is 7 by 2 x 4 ok.
Let us try to simplify this a little bit further so, we have x bar equals now what I am
going to do is, I am going to write this as the sum of 2 components x 3 times the vector, I
can write this as look at this x 3 times the vector minus half minus 5 by 2 1 0 plus x 4
times the vector 7 by 2, x 4 times the vector 7 by 2 minus half 0 1.
And therefore, the general structure of a null vector belong to null space of x bar will be
of the form well minus half x 1 is minus half x 3 plus 7 by 2 x 4 x 2 is minus 5 by 2 x 3
minus half x 4 and this is x 3 and x 4 this is a. So, this is the general structure remember
this is the general structure or you can say the general expression of vector x bar
So, I am writing this as the vector x 3 times u 1 bar x 4 times u 2 bar that is this is equal For instance if you consider a u 1 bar u 1 now also remember u 1 bar comma u 2 bar
to x 3 times observe this is a linear combination of u 1 bar and u 2 bar this is a linear themselves belong to the null space of a you can quickly check this for instance, A times
combination of the 2 vectors u 1 bar and u 2 bar. Therefore, any linear combination of u 1 bar equals multiply this minus 1 1 2 comma 4 2 0 1 comma minus 7 times if you look
this vectors u 1 bar and u 2 bar belongs to the null space of a all right. So, this null space at this, minus half u 1 bar minus 5 by 2 1 0 this is equal to what will this be equal to?
of A is formed by all linear combinations of these vectors u 1 bar and u 2 bar and This will be equal to well the first row will be half minus 5 by 2 plus 2 plus 0 second will
therefore, this u 1 bar and u 2 bar are the bases vectors for the null space of the matrix A. be minus 1 plus 0 plus 1 plus 0 and you can clearly see this is nothing, but the 0 vectors.
So, what are these vectors u 1 bar you can clearly see u 1 bar equals the vector minus So, this implies A u 1 bar equal to 0 they form the basis naturally u 1 bar u 2 bar itself in
half minus 5 by 2 1 0 this is the vector u 1 bar u 2 bar is the vector minus 7 by 2 minus that space.
half 0 1 so, these vectors are basis vectors ok. So, what we are saying is that these
vectors, these are basis vectors for the null space of a that is the null space of a is formed
by the all possible linear combinations of these vectors u 1 bar and u 2 bar. In fact, you
can quickly verify u 1 bar u 2 bar themselves belong to the null space.

Similarly, A u 2 bar equals 0 you can check that; similarly A u 2 bar is also 0 all right. So,
that is the concept of the null space of A matrix that is the set of the space of all vectors x
bar, such that A x bar equal to 0 and we justified that this is actually a vector space
because if we take any 2 vectors belonging to this space the linear combination also lies
in this space ok.
So, trace of A equals a 11 plus a 2 2 plus so, on that is basically if you look at this i
equals 1 to n a i i and that can also be denoted as summation i equals 1 to n take the
element A i i this denotes basically i comma i element of A this notation basically
denotes the i comma i element of the matrix A. So, the trace a trace of a square matrix A
is basically the sum of the diagonal elements of the matrix A all right a 11 plus a 2 2 a 3 3
plus a 3 3 and so on ok. Now if you look at.

Let us look at another example, example number 4 and this is regarding the trace
property, this is regarding the trace of a matrix remember that briefly if you just recall the
trace of any matrix is defined only for a square matrix, this is defined and trace of A
matrix remember is the sum of the diagonal elements. So, if A is an n cross n square
matrix, trace of A is basically the sum of the that is trace of A is nothing, but the that is if
a equal to this matrix this is an m cross n matrix, the trace of the matrix equals the sum of
the diagonal elements; trace of the matrix is sum of the diagonal elements.
Now, the property that we are interested in (Refer Time: 22:33) I have not yet got to the
property that problem that we want to; we want to show that for any 2 matrices let us
consider 2 matrices a of the same size A belongs to m cross n and B not same size, but B A jth column of B I am sorry this is jth column of B and do an element by element
belongs to n cross m. So, A transpose and B have the same size rather. multiplication and then take the sum that is the i jth element of AB.
So, A is m cross n and B is n cross m. So, basically so, that we can multiply A and B then Now, what is AB i i that is the diagonal element, diagonal element is simply take the ith
we can show that trace of A B equals trace of B A this is an interesting property. So, you row of kth row this is simply I am sorry this is the ith row of A; k equals one to n this is
want to show that for 2 matrices A which an of size m cross n and n cross n trace of A B simply a i comma k into b k comma i because we are looking at the i ith element. So, this
that is sum of diagonal elements of AB is same as the sum of diagonal elements of a BA. is ith row of A ith column of B ok.
Now note this does not imply that AB equals BA. One should not confuse that simply
trace of AB equals trace of B A does not imply that AB themselves AB itself is equal to B
A. In fact, AB most general AB typically is not equal to because if you look at AB, AB is
of size m cross m B A is of size n cross n. So, the matrices if m is not equal to n the sizes
themselves are different. So, AB is not equal to B A.
So, the look at this is m cross m this is n cross this is an n cross n matrix B A now let us
start with this proof this a very interesting property. So, and this is very helpful in several
simplifications.
And now trace of AB recall trace of AB is nothing but sum of the diagonal element. So,
that is basically you have to take the summation over i equals m of all the diagonal
elements AB i i correct this is the sum of the diagonal elements, which is equal to now
substitute this expression for AB i i that is the summation i equal to m summation k equal
to 1 a ik b ki. Now what I am going to do is I am just going to write it in a different this
product. So, i equals 1 to m summation k equals 1 to n now I am going to first write b ki
a ik b ki is b ki times a ik. So, I am going to write b ki a ik.

Now, first let us start by looking at the i jth element of the matrix product AB ij, you can
show this is equal to summation k equals 1 to n of a i k b k j that is the i jth element of
the product is summation of all elements in the kth row of a and jth column multiplied by
corresponding elements in the jth column. So, this basically is kth row of all elements in
the kth row of A and these are elements in the that is you take the you take the kth row of
And therefore, what you can see is basically this is summation i equals 1 to m you take
the matrix product B A and this is the k kth which is the diagonal k th diagonal element
of B k and you are taking the sum over all I am sorry sum over k equal to 1 to n you are
taking over the sum of all the diagonal elements k equal to 1 to n BA kk which is
nothing, but trace of BA. So, this is nothing, but the trace of B A correct this is nothing,
but the trace of BA.
Now, I am going to interchange the summation the order of summation. So, I am going
to interchange the order of summation that becomes k equal to 1 to n summation i equals
1 to m. Now I have b ki a ik now look at this b ki a is basically nothing, but this is
basically kth row of B; kth row of and a ik is correspondingly the kth column, you are
taking an element wise product of kth elements of kth row of B and kth column of A.
So, this is nothing but the; if you look at this is nothing but the k kth element of BA. So,
if you look at this is nothing, but you are taking B A kth row and kth column and this is
basically k kth element of B k.
And therefore, what we have is trace AB and this is very handy property, trace AB equals
(Refer Slide Time: 28:30) trace of trace AB equals trace of B A. And this is very handy property because in general
for matrices you do not have a commutative property that is AB is not generally equal to
B A, but trace of AB equals trace of the matrix B A this is I am interesting property of
matrices, which will come handy in several problems or several optimization problems,
where you have to manipulate matrices or manipulate the product of matrices all right.
So, let us stop here and we will continue with some other problems in the subsequent
modules.

Applied Optimization for Wireless, Machine Learning, Big Data.
Lecture - 09
Eigenvalue Decomposition of Hermitian Matrices and Properties
Hello, welcome to another module in this massive open online course. So, you are
looking at examples to understand the framework or the mathematical framework or
preliminaries required for developing the various optimization problems. Let us continue
our discussion in this module, let us start by looking at the Eigenvalues of a Hermitian
Symmetric Matrix, ok.
(Refer Slide Time: 00:35) So, m cross, so for any m cross n. So, if A remember this is for complex m cross n matrix
and remember Hermitian is nothing, but you first take the transpose of A and then take
the complex conjugate. So, Hermitian involves two steps, Transpose plus Conjugate.
So, first you take the transpose of the matrix m cross n matrix becomes an n cross m
matrix and then you take the complex conjugate of each element and if then you because
we have a matrix which is A equals that is where is the property that A equals A
Hermitian this is known as a Hermitian symmetric matrix or simply sometimes the
Hermitian matrix and naturally for a Hermitian matrix for matrix to be Hermitian
symmetric it must be the case that it is a square matrix correct. So, that an n cross n
matrix remains an n cross n matrix when you take its transpose.
So, this implies for a Hermitian matrix, implies essentially also that matrix is square,
makes sense only for a square matrix. Now we want to prove some of the properties in
So, we want to look at the Eigenvalue Decomposition or Eigenvalues and Eigenvalue this example, we want to prove some of the properties of this Hermitian symmetric.
decomposition of a Hermitian matrix. Hermitian Symmetric that is if a matrix is Hermitian Hermitian symmetric matrices, that is Eigenvalues of a Hermitian Matrix are real. The
Symmetric we know. So, this is our example number 5. So, recall matrix is Hermitian first property is Eigenvalues of Hermitian matrix are REAL that is if you look at I
symmetric if A equals A Hermitian ok. consider the Eigenvalues of Hermitian matrix these are Real ok. So, let to prove this let
us consider x bar be the Eigenvector of A, Lambda equals the corresponding Eigenvalue,
ok.
Now we multiply on left and right by x bar. So, this implies x bar Hermitian A Hermitian
x bar equals lambda conjugate, x bar Hermitian x bar, but look at this x bar Hermitian x
Taking the Hermitian, now what we are going to do is we are going to take the Hermitian
bar this is norm x bar square. So, this is equal to norm x bar square lambda conjugate
on the left and right side. So, we have now, Ax bar Hermitian equals lambda x bar
times norm x bar square and further Ax bar equals lambda times x bar. So, this implies x
Hermitian and you will use the property that A B Hermitian product of two Hermitian.
bar Hermitian lambda x bar equals lambda conjugate norm x bar square and this implies
So, A B Hermitian that is the product of two matrices are Hermitian A B Hermitian is B
now lambda is a number.
Hermitian times A Hermitian.
So, this is lambda x bar Hermitian x bar equals norm x bar square equals lambda
So, this implies basically that x bar Hermitian that is Hermitian vector, Hermitian of
conjugate norm x bar square and from this from the left hand side and right hand side
vector x bar times A Hermitian equals x bar Hermitian times lambda Hermitian, but
since x bar square, norm x bar square is not equal to 0, this implies lambda equals
lambda is a scalar. So, this, I can simply write as and of course, this is just a number so, I
lambda conjugate, which basically leads us to the conclusion that lambda is a Eigenvalue
can simply write as x bar Hermitian into lambda conjugate. Since lambda equals a scalar
of a Hermitian symmetric matrix this implies that lambda equals a real quantity ok.
quantity that is lambda is simply A, is lambda is simply a number. So, for a simple
number the Hermitian of the quantity is simply taking the complex conjugate and now So, Eigenvalues of, so this implies Eigenvalues Hermitian Symmetric Matrix are real,
we have this property x bar Hermitian A Hermitian equals lambda conjugate x bar Eigenvalues of Hermitian symmetric matrix are real. Now how about eigenvectors of a
Hermitian. Hermitian symmetric matrix.
(Refer Slide Time: 05:56) (Refer Slide Time: 08:02).

Now eigenvectors of Hermitian symmetric matrix satisfy an interesting property that is And now what we want to show that the inner product x i bar Hermitian x j bar equals
eigenvectors of a Hermitian symmetric matrix corresponding to distinct Eigenvalues are zero and the proof can proceed as follows that is if you look at a x i bar equals lambda i x
orthogonal that is their inner product is zero, we will demonstrate this fact. So, the i bar and this implies now if you take multiply by xj bar Hermitian we have x j bar
property number 2 and another very interesting property, both these properties of hermitian A x i bar equals lambda i or x j bar Hermitian lambda x i bar which is basically
Hermitian symmetric matrices very interesting and have immense utility. lambda x j bar Hermitian x i bar lambda x j bar Hermitian x j bar.
The second property is that Eigenvectors, Eigenvectors of, Eigenvectors of Hermitian So, let us call this as the first observation or the first result x j bar Hermitian Ax i bar
symmetric matrix corresponding to distinct Eigenvalues, corresponding to distinct equals x j or Hermitian lambda x i bar its lambda x j bar Hermitian x i bar. Now we also
Eigenvalues,these are, these are orthogonal and this is a very important and interesting have x j bar is an eigenvector corresponding to the Eigenvalue lambda j which implies A
property ok. x j bar equals lambda j x j bar this implies that x j bar Hermitian taking the Hermitian x j
bar or a x j bar, let us just write that one step A x j bar Hermitian equals lambda j, x j bar
So, let us consider two Eigenvectors x i bar comma x j bar, these are two Eigenvectors
Hermitian this implies x j bar Hermitian A hermitian equals x j bar once again, x j bar
and remember these correspond to distinct Eigenvalues, ok. These are the corresponding
Hermitian, we have already seen lambda j conjugate because lambda is once again it is
Eigenvalues and remember these are distinct, lambda i not equal to lambda j that is these
an Eigenvalue its simply a scalar quantity.
are distinct.
(Refer Slide Time: 12:26).
So, this is lambda conjugate, but remember the Eigenvalues of Hermitian matrix are real And now, since lambda i not remember that is a key point lambda i is not equal to
which implies that lambda j conjugate equals lambda j ok. So, we will use that property lambda j otherwise lambda i minus lambda j can be zero. So, this implies x j bar
here, this is equal to lambda times x j bar Hermitian since lambda equals lambda Hermitian x i bar this equal to zero.
conjugate and this implies that well again realize that this is the Hermitian symmetric
So, this finally, verifies the fact that, Eigenvalues correspond Eigenvectors corresponding
matrix A Hermitian is simply A ok.
to distinct Eigenvalues are basically real. So, the Eigenvalues corresponding to, the
So, we have x j x j bar Hermitian into a equals lambda conjugate into x j bar Hermitian. distinct Eigenvalues are basically real. Ok let me just mention this or eigenvectors
This implies that now if you multiply by x i, I am sorry this is lambda now if you corresponding to distinct Eigenvalues are orthogonal that is what I meant to say.
multiply by x i bar we have x j bar Hermitian A x i bar I am sorry this is lambda j this is Eigenvalues corresponding to distinct Eigenvalues are eigenvectors corresponding to
lambda j this is lambda j this is equal to lambda j times x j bar Hermitian lambda j x j bar distinct Eigenvalues are orthogonal or a Hermitian symmetric matrix and these are two
Hermitian x i bar ok. important and interesting properties of Hermitian symmetric matrices that one uses
frequently during the development of various, various techniques for optimization, all
So, x j bar Hermitian x i bar equals lambda j x j bar Hermitian x i bar and this we can,
right.
this we can denote as result 2 and now if you see from result 1 and result to x j bar
Hermitian a x bar x i x i bar equals lambda or lambda. In fact, lambda i, lambda i x j bar (Refer Slide Time: 16:31)
Hermitian x i bar. Similarly if you look at result 2 x j bar Hermitian A x i bar equals
lambda j x j bar Hermitian x i bar. So, this implies from 1 comma 2 from results, 1
comma 2 what we have, this implies well we, this implies lambda i x j bar Hermitian x i
bar equals lambda j x j bar Hermitian x i bar which implies now lambda i minus lambda j
into x j bar Hermitian x i bar equals 0.

So, let norm V 1 bar equals norm V 2 bar equals norm V n bar, this implies eigenvector
is equal to unit norm and further from the property previously let us assume that the
Eigenvalues are distinct which implies V i bar Hermitian V j bar equal to zero for i not
equal to j that is if the Eigenvalues are distinct then the eigenvector satisfies the property
V i bar Hermitian V j bar equals zero ok.
Now, notice that if you consider this, observe that if you consider this matrix V 1 bar, V 2
bar, V n bar, the matrix of eigenvectors if you now perform V Hermitian V then what you
are going to have is you are going to have V 1 bar Hermitian V 2 bar Hermitian V n bar
Hermitian times V 1 bar V 2 bar V n bar which is equal to.
Let us look at another example,example number Eigenvalue decomposition of Hermitian

matrices, Eigenvalue decomposition of Hermitian matrices for a Hermitian matrix that is
consider that is n cross n complex Hermitian matrix equal to A Hermitian that is
Hermitian symmetric, let the Eigenvalues be equals or the Eigenvalues and V 1 bar V 2
bar V n bar be the corresponding eigenvectors ok.
Well if you look at V 1 bar Hermitian, V 1 that is 1, V 1 bar Hermitian,V 2 is 0 V 2 bar

Hermitian V 1 is 0 V 2 bar Hermitian V 1 is 1. So, this you can see is simply the identity
matrix which implies V is the inverse of V Hermitian and V Hermitian is the inverse of
V. So, we have, well we have something interesting, what we have is, we have V
Hermitian V.
Now observe that let these eigenvectors be unit norm, we can simply normalize
remember an eigenvector if you normalize it, it still remains A eigenvector because you
are simply dividing it by a by it is norm or by a constant that is you are simply scaling an
eigenvector ok. So, let us consider the Eigenvectors to be unit norm, all right.
So, we have V Hermitian V equals identity implies V equals V Hermitian inverse and Now let us look at the product A V A times matrix V that is A into the eigenvectors V 1
since the inverse of A square matrix is unique and this also implies V Hermitian equals V bar, V 2 bar, V n bar, you can see this is nothing, but this equals well A V 1 bar, A V 2
inverse and this also implies that since if A is B inverse A B is identity be A is also bar, A V n bar which equals if you look at this, but if these are eigenvectors so, A V 1 bar
identity this also implies that V, V Hermitian equals identity since if A B is identity then is lambda 1 times V 1 bar, AV 2 bar is lambda times V 2 bar A V n bar is lambda n times
B A is also identity for square matrices. V n bar, these are the various columns which you can now write also as remember we are
looking at a times the matrix V.
So, we have and such a matrix V which satisfies V V Hermitian equals V Hermitian V
equals identity such a matrix is termed as a unitary matrix. So, V is termed as this (Refer Slide Time: 22:57)
satisfies this interesting property that is termed as a unitary matrix that is V matrix V,
square matrix which satisfies this property V V Hermitian equals V Hermitian V equals
identity is said termed as a unitary matrix, unitary matrix ok.
So, A times the matrix V equals now you can write this as V 1 bar, V 2 bar, V n bar
times the diagonal matrix, lambda 1, lambda 2, lambda n this is a diagonal matrix.
So, this is nothing, but your matrix V and this we denote by the matrix capital lambda,
which is the let me just write it with a little so, this we denote this by the V and this is
basically your capital lambda and what is this, this is the diagonal matrix of Eigenvalues,
This is termed as the Eigenvalue Decomposition ok, Eigenvalue Decomposition of A and

this has many interesting properties for instance if you want to compute the square A into
A which is equal to A square, this will be equals you can write it in terms of V lambda V
Hermitian into V lambda V Hermitian which is equal to now V V Hermitian is identity.
So, which is equal to V lambda, lambda V Hermitian equals V lambda square V
This is the diagonal matrix of Eigenvalues. So, we have A times V equals V times Hermitian.
lambda multiplying on both sides by V Hermitian A times VV Hermitian equals V
lambda V Hermitian, this implies the matrix A which is Hermitian symmetric can be
expressed as V lambda V Hermitian where V is the matrix of Eigenvectors and lambda is
the diagonal matrix of Eigenvalues. So, what is V, V equals the matrix of Eigenvectors
lambda equals diagonal, equals diagonal matrix of lambda, equals diagonal matrix,
lambda equals diagonal matrix of Eigenvalues and this is termed as the Eigenvalue
Decomposition.
So this is A square. Similarly you have many other interesting properties for instance you
can show, you can generalize this as Araise to n for this Hermitian symmetric matrix is V
lambda n V Hermitian and lambda to the power of n is easy to compute because since it
is a diagonal matrix. So, all it is it contains lambda i each lambda i raised to the power of Applied Optimization for Wireless, Machine Learning, Big Data
n on the diagonal. So, lambda raised to the power of n this is very easy to compute, this
is simply the matrix if you think about this, this is lambda 1 to the power of n, lambda 2 Indian Institute of Technology, Kanpur
to the power of n. So, on lambda n, I am sorry I should have used a different integer here
Lecture - 10
lambda m, lambda n raised to the power of m ok. Matrix Inversion Lemma (Woodbury identity)
So, this is simply raised to the power of m, raised to the power of m, raised to the power
of m. So, this is easy to compute and frequently you will see this is a very interesting
at the mathematical preliminaries and the examples for the various mathematical
property as well as this is a very handy tool to perform several matrix, matrix
preliminaries. Let us continue our discussion and look at another important principle that
manipulations that is a Eigenvalue Decomposition of a matrix all right and it is also one
comes in handy several times this is known as the Matrix Inversion Lemma or the Matrix
of the fundamental decompositions of a matrix or one of you can also call it as one of the
Inversion Identity.
fundamental properties or one of the of a matrix, all right. So, we will stop here and
continue with other aspects in the subsequent modules. (Refer Slide Time: 00:33)
So, what we want to look at is the matrix in this module of linear algebra and matrix
preliminaries for optimization. So, we want to look at the matrix inversion identity and
this is also often termed as the matrix inversion. So, there are many names for this, this is
also termed as the matrix inversion lemma, is also termed as the Woodbury matrix
identity; also popularly known sometime as the Woodbury matrix identity. This is our
example number 7 and what it is? It is a very convenient principle for the inversions a
convenient property for matrix inversions or convenient you can also say trick to
compute the inverse of a matrix; convenient property for matrix inversion or to compute
the inverse of a matrix. What it states is that if I have a matrix inverse of the form A plus
UCV inverse.
For instance to understand this better let us look at a simple example or let us call this an
illustration. Consider, computing the inverse of I plus x bar x bar transpose where and
where x bar you can see is a vector; this is the vector x 1 x 2 up to x n and this x bar x
bar transpose, I am sorry this x bar x bar x bar x bar transpose inverse. So, we want to
So, I want to compute the inverse of this matrix to compute the inverse of this matrix of compute the inverse of this and you can see this x bar x bar transpose has rank equal to 1;
the matrix A plus UCV, this inverse is given as follows, this inverse is given as follows this is a rank 1 matrix that is we compute x bar x bar transpose you will realize that x bar
that will be equal to A inverse minus A inverse U times C inverse plus VA inverse U x bar transpose is the rank 1 matrix or basically a rank deficient matrix and if you look at
inverse into VA inverse and this is the matrix inverse and this is especially convenient if this I, this is a matrix for which you can easily compute the inverse; I inverse is nothing
the inverse of A is already known; let us say A is a large matrix, for which the inverse is but I.
already known or can be computed rather easily and this quantity UCV is a low rank
So, this is what we mean this is its sort of a very illustrative case where this matrix
matrix ok.
inversion identity can be used and in fact, if you look at or compare or this thing this is
So, this is very handy if A inverse is known for instance this requires A inverse you can our A or x bar is U. Now there is no c, so c is basically the constant its simply 1 and V is
see this and UCV equals a low rank that is it is not a full rank matrix, although it can be this is V equals x bar transpose. So, in our matrix inversion lemma we have A equals I, U
used even when it is a full rank matrix its handy when it is a low rank matrix. equals x bar, c equals 1 and V equals x bar and v equals x bar transpose ok. So, that is
what that is a property that we can use and now with this settings we can use this
(Refer Slide Time: 04:32) property of the matrix inversion lemma and this can be done as follows.

Now, note that A inverse equals identity A is identity, so inverse is identity. So, I plus x So, the simple trick can be readily used to compute the inverse of such matrices and you
bar x bar transpose inverse equals well this equals A inverse, which is I minus, this is A can just do a quick check for instance, you can do a quick check to verify that this is
inverse once again I into U that is x bar times c inverse c equals 1. So, c inverse is c indeed the inverse. You can check I plus x bar x bar transpose into its purported or
inverse equals 1, so c inverse plus V x bar transpose A inverse, which is again identity claimed inverse that is x bar x bar transpose divided by 1 plus norm x bar square well
times U, which is x bar inverse c inverse plus V inverse U inverse into VA inverse; V is this gives I times I plus x bar x bar transpose into identity that is simply x bar x bar
again x bar transpose and A inverse is identity, which is equal to now you can see this is I transpose minus I into x bar x bar transpose by norm 1 plus by 1 plus norm x bar square,
minus I times x bar; this is 1 plus you can see x bar transpose x bar into identity into x so minus x bar x bar transpose divided by 1 plus norm x bar square minus x bar x bar
bar is simply x bar transpose into x bar this is norm x bar square inverse into x bar transpose into x bar into x bar transpose divided by 1 plus norm x bar square.
transpose into identity.
And what you can see here is that this quantity 1 plus x bar square this is a scalar
quantity, this is simply a number because remember norm x bar is a number norm of
vector that is length of the vector x bar norm x bar square is also a number, all right.
So, 1 plus norm x bar square is a number. So, inverse of a number is simply the
reciprocal that is 1 over that number ok. So, I can simply write this now once you realize
that I can simply write this and of course, these are simply identity matrix. So, I can
simply write this as x bar x bar transpose divided by 1 plus norm x bar square and this is
basically the expression for the inverse of you can readily compute this. So, this is
basically your expression for the inverse of I plus x bar x bar transpose.

Now, if you look at this quantity x bar transpose x bar that is equal to norm of x bar
square. So, this is therefore, equal to I plus x bar x bar transpose minus x bar x bar
transpose by 1 plus norm x bar square minus x bar x bar transpose x bar transpose x bar
is norm x bar square it is a scalar which comes out norm x bar square times x bar x bar
transpose by 1 plus norm x bar square now if you look at these 2 terms you have x bar x
bar transpose 1 plus norm x bar square and norm x bar square into x bar x bar transpose
by 1 plus norm x bar square.
So, we have used this handy property that is the matrix we have demonstrated this using
the matrix inversion identity or the Woodberry matrix inversion or the Woodberry matrix
inversion lemma all right. So, basically that completes the example.
So, this is simply I plus x bar x bar transpose minus x bar x bar transpose times 1 plus
norm x bar square correct 1 plus norm x bar square divided by 1 plus norm x bar square,
which is equal to I plus x bar x bar transpose minus x bar x bar transpose which is indeed
equal to identity and therefore what we have checked is that I plus x bar x bar transpose
inverse is indeed I minus x bar x bar transpose by 1 plus norm of x bar square.

Applied Optimization for Wireless, Machine Learning, Big Data segment that is joining x line segment between x 1 bar between x 1 bar and x 2 bar ok.
This is the line segment between x 1 bar and x 2 bar.
Lecture – 11
Introduction to Convex Sets and Properties
Hello. Welcome to another module in this massive open online course. So, let us start our
discussion on optimization by looking at some of the fundamental building blocks of
optimization by first looking at convex sets the notion of a convex set and the various
properties of convex sets ok.
So, we want to start our discussion of optimization.
Now, consider now a linear combination of the form theta x 1 bar plus 1 minus theta x 2
bar 1 minus theta x 2 bar, such that well 0 is less than or equal to theta is less than or
equal to 1 ok. So, we are considering a combination of x 1 bar and x 2 bar such with the
weights theta n 1 weighs theta that is theta times x 1 bar plus 1 minus theta times x 2 bar.
And another important aspect to note here is that we are not allowing any value of theta,
but only values of theta lying between 0 and 1 ok.
And now for instance let us take a look at the various such points generated by this
combination. Now for instance, if theta equals 1, let us denote this point by x bar we
have x bar equals well 1 times x bar plus 0 times x 1 times x 1 bar plus 0 times x 2 bar.
And one of the important concepts to understand in convex optimization is a convex set, So, this is x 1 bar if theta equal 0 on the other had x bar is simply you can check theta
the definition and properties of convex sets. So, a convex set is as follows now consider. equals 0 times x 1 bar plus 1 times x 2 bar. So, this is x 2 bar if theta equals half this half
So, define a convex set let us start with the following setup consider 2 points x 1 bar times x 1 bar plus half times x 2 bar.
comma x 2 bar in n dimensional space ok, which means these are vectors in n
So, this is x 1 bar plus x 2 bar by 2 equals x bar which is basically you can see mid-point
dimensional space. These are points in these are general points in n dimensional space.
of the midpoint, of the line segment between x 1 bar and x 2 bar. So, this you can see is
So, we are considering 2 points in n dimensional space let me describe these points. So, I the midpoint, midpoint of the line segment between x 1 bar and x 2 bar. So, what you
have point 1, let us say this is your x bar and this is your x bar. And this is the line observe is that x theta varies from 0 to 1. This combination theta times x 1 bar plus 1
minus x 2 bar traces the line segment between x 1 bar and x 2 bar correct.
(Refer Slide Time: 04:39) So, as theta varies from 0 to 1 it traces the line segment between x 1 bar and x 2 bar ok.
That is the first basic concept. Now what is the definition of a convex set a set is known
as a convex set?
So, theta times x 1 bar plus 1 minus theta times x 2 bar traces line segment between x 1
bar comma x 2 bar, for various values of 0 less than equal to theta less than equal to 1,
that is theta theta times.
(Refer Slide Time: 05:13) A set is termed as a convex set if for that is if any 2 points x 1 bar comma x 2 bar belong
to s, then the entire line segment not the line the entire line segment between x 1 bar and
x 2 bar, comma x 2 bar belongs to S ok. This symbol is basically belongs to belongs to
alright.
So, what this means is the following if mathematically writing, if x 1 bar comma x 2 bar
belongs to S implies that the line segment. Remember, we just demonstrated that the line
segment is denoted by x 1 bar theta times x 1 bar plus 1 minus theta times x 2 bar 0 less
or equal to theta less than equal to 1 also belongs to S; for all that is for any x 1 bar for
all that is for all 0 less or equal to theta less or equal to 1. What this says is that that is if
you pick any 2 points x 1 bar x 2 bar belonging to the set S, then if the entire line
segment between x 1 bar and x 2 bar belongs to S. And this is true for any such set of
points x 1 bar x 2 bar such a set is known as a convex set.
X 1 bar plus 1 minus theta times x 2 bar if you look at this combination which is also And the mathematical way of stating this is that if you pick any 2 points x 1 bar x 2 bar
termed as a convex combination. This denotes a point on line segment between x 1 bar construct theta times x 1 bar, plus 1 minus theta times x 2 bar which represents a point on
comma x 2 bar, for any particular value of theta lying between 0 and 1 ok. the line segment for various values of theta between 0 and 1 this point represented by
theta times x 1 bar plus 1 minus theta times x 2 bar must belong to x must belong to the (Refer Slide Time: 10:02)
set S ok.
It is a very simple example; it is a very simple definition.
For instance, you can also quickly check many other sets such as hexagon for instance.
These need not be regular shapes for instance a hexagon you take any 2 points; you join
them by the line segment. So, the hexagon you can clearly see is a hexagon is also a
convex set ok. On the other hand, if you have a region like this, you can clearly see this
For instance, you can readily see that an ellipse is a convex set. This can be formally also
kind of a set this kind of a set is not convex, because if you take any 2 points and join it
shown which we will show later. That is, you can choose any 2 points and the entire line
by a line segment then the line segment does not is not entirely contained in S. The line
segment joining the 2 points you can check lies in the set. So, for any 2 points, entire line
segment you can quickly verify the line segment part of the line segment lies part of the
segment is contained in set. So, for any 2 points in set entire line segment is contained.
line segment lies outside the set s. So, if this is your set S, S is not convex ok.
For any 2 points in the set the entire line segment between them is contained in the set.
So, you take 2 points and join them part of the line segment lies outside the set S that is a
line segment is not entirely. Remember that it is not just a few points of the line segment
the entire line segment has to be contained in S. Only then and that has to be true for any
set of points in S.
Now, for instance you can look at this even though if you choose these 2 points here then
the line segment is contained in S right, but it is not only for a particular set of points this
has to be true for any set of points. And the entire line segment between that chosen set
of points all right that has to be completely contained in S.
Let us come to the notion of a convex combination another very useful notion is that of a Now, I can what I can do is, I can generalize starting from here I can generalize. I can
convex combination. And a convex combination is as follows that is if we have generalize this notion of convex combination to include n points. So, you consider k
remember we said, theta times x 1 bar plus 1 minus theta times x 2 bar consider this points.
combination, I can equivalently represent this as theta 1 times x 1 bar plus theta 2 times x
2 bar, but they have to satisfy the property remember theta 1 plus because remember we
have theta and 1 minus theta. So, the implies theta 1 plus theta 2 equals to 1 and 0 less
than theta lies between 0 and 1, which means we have to have the property 0 less than or
equal to theta 1 comma theta 2 less than or equal to 1. Or you can equivalently say that
theta 1 comma theta 2 theta 2 comma theta 1 greater than or equal to 0 and theta 1 plus
theta 2 equals one ok.
So, a convex such a combination is known as a convex combination. This is known as a

termed as a convex combination of x 1 bar and x 2 bar. Now I can generalize this notion
of convex combination.
K points in n dimensional space that is x 1 bar x 2 bar, so on up to x k bar. And you

consider theta 1 theta 2 theta k such that, theta 1 plus theta 2 plus theta k equals 1 and
each theta i is greater than or equal to 0 and perform the combination theta 1 x 1 bar plus
theta 2 x 2 bar plus so, on theta k x k bar. This is termed as a convex combination of x 1 (Refer Slide Time: 18:36)
bar x 2 bar and up to x k bar.
So, this is a convex combination this is a convex combination of x 1 bar x 2 bar up to x k

bar. So, let us now look at the convex hull of a set.
Such as this kind of region that we looked this is a non-convex set for a non-convex set
the convex hull simply fills this region to make it a convex set.
So, the earlier one is a non-convex set. So now, once you fill this what you get that is this
entire set that you get is now a convex hull. So, the convex hull that makes it basically
that converts that you will see, you can say including includes all the convex
So, the other concept we want to look at is the notion of a convex hull. And the convex
combinations of all the original points in the set S to convert it into a convex set. That is
hull of a set is simply it is nothing but basically this is the set of. So, given set S for
the convex hull of a given set S alright.
convex the convex hull is the set of all of convex combination set of all convex
combinations set of all the convex combinations all the convex combinations of points in So, let us stop this module here. And we will look at other aspects in the subsequent
S. modules.
So, you take a set for any given set consider the set of all the convex combinations of the Thank you very much.
points in S, S and that gives the convex hull of the set. Now naturally observe that for
any convex set S the convex hull is the set itself because S if S is a convex set; then it
already contains all the convex combinations of the points in S ok. So, for a convex set,
for instance we saw yesterday that the hexagon is a convex set correct for a convex set
convex hull equals the set itself, because it already contains all the combinations convex
combinations for a non-convex set.
Applied Optimization for Wireless, Machine Learning, Big Data 0 and 1 however, in this case there is no such restriction and theta can take any real value
ok.
Lecture -12
Affine Set Examples - Line, Halfspace, Hyperplane and Application - Power
Allocation for Users in Wireless Communication
at the building blocks and the various fundamental definitions required to develop the
optimization techniques and we have previously looked at the definition notion of the
convex set the convex combination of the set of points and also the concept of a convex
hull. So, let us continue this discussion by look at looking at something slightly different
today that is the definition of an affine set.
And this is an important observe there is no restriction on theta. In fact, theta can take. In
fact, theta can take any real value and such a combination now this is an affine
combination and basically you can say that for various values of theta this represent the
entire line represents the line through x 1 bar comma x 2 bar.
So, previously when 0 is less than equal to theta is less than equal to 1 it simply
represented the line segment between x 1 bar and x 2 bar. Now if you remove that
restriction on theta, it represents the entire line that is any point on the line is captured by
this combination theta times x 1 bar plus1 minus theta times x 2 bar. Now if this belongs
to the set S whenever x 1 bar and x 2 bar belong to the set S that is the entire line all
right, entire line joining the points x 1 bar and x 2 bar belongs to the set S for any two
So, what you want to look at is the notion or the you want to learn the concept of an points x1 bar x2 bar belonging to the set S such a set is known as an affine set.
affine set. And this is very simple the affine set in the previous module we have seen the
notion of a convex set. Now what an affine set is that is if you consider any two points x
1 bar and x 2 bar ok. So, similar to the definition of convex set consider two n
dimensional points x1 bar comma x2 bar and now perform the combination theta x1 bar
plus 1 minus theta x2 bar, but the theta can take and real value theta can take any
remember in the convex for a convex combination we had restricted theta to lie between
That is set S tilde is affine if x 1 bar comma x 2 bar belongs to S tilde ok. Remember we All convex sets are not affine and that is very easy to see take a simple example for
saw we seen this symbol belongs to S tilde implies theta times x1 bar plus 1 minus theta instance if you consider a circle correct.
times x2 bar also belongs to x tilde for all real values of theta, that is for any real value of
theta. That is for any two points, for any to given any two points in S tilde entire line
joining the two points entire line joining the two points the entire line joining the two
points lies in S lies in S tilde.
And note that the affine set is convex and that you can note this is an interesting property
every affine set every affine set is convex correct. The reason is very simple because if it
contains the entire line joining the two points, naturally it contains the line segment that
is for any x1 bar x2 bar belonging to S tilde since it contains the entire line or if it is
affine, it naturally also contains the line segment all right.
So, the convex set is a special case of an affine set all right. So, every affine set is a is
also a convex. I am sorry affine set is a special case of a convex set all right. So, every
affine set is a convex set, but note that every convex set need not be an affine set all right
note that also, but all convex sets are not affine.
Which we saw yesterday is a convex set because if you took take any two points correct.
The line segment joining the two points which is contained, but if you extend that to
form the line the you can see that the entire line.
So, the line segment is contained. So, this is your S line segment is contained in S this is
these are your points x1 bar x2 bar the line segment belongs to the S , but the line does
not belong to S that is it does not the entire line does not belong to S it is a very simple (Refer Slide Time: 09:11)
thing ok. So, every affine set is convex, but every convex set is not affine ok. So, these
are the, that these are the this is the interesting relation between affine sets and convex
sets ok. All right now let us look at some examples to understand these better examples.
For x1equals for x2 equals 0 x1 equals you can see this is 3 and when x1 equal to 0 x2
equals 2. So, if you plot the line that will look something like it will look something like
this ok.
So, this is the line 2 x1 plus 3 x2 equal to 6 and remember line is a trivial example of an
Examples of convex sets and affine, let us look at some examples of convex and affine affine set correct because if you take any two points on the line all right if you take the
sets and consider a simple line for if simple example for instance; consider a simple line line as a set if you take any two points on the line and join the line correct. Naturally the
in 2 dimensions the line let us say is given by the equation 2 x1 plus 3 x2 equal to 6 ok. entire line, which is the same line belongs to that set all right. So, the line is a trivial
Now if you plot that line it looks something like this. example. So, any line is a trivial example of an affine set ok. So, let us note that. So, this
line is affine line is affine and it is also convex because every affine set is convex the line
is also line is also convex.
Now, the interesting thing occurs when you look at these regions now the line is
partitioning this plane into two regions if you look at this ok. Now this region is the
region 2 x1 plus 3 x2 greater than or equal to 6 and this region is the region 2 x1 plus 3
x2 is less than or equal to 6 and these regions are known as half spaces.
So, the line divides the plane into two regions 2 x1 plus 3 x2 greater than or equal to six So, a bar transpose x bar equals b this is a hyper plane a general equation for a hyper
or 2 x1 plus 3 x2 less than or equal to 6 and these are known as half spaces these are plane this is an equation for a hyper plane and you can see that this is affine a hyper
known as half spaces ok. So, we have a line and the line divides the plane into two plane is affine which also implies that this is convex as well right for instance if x1 bar
regions. So, the line is convex and also affine. In fact, and it divides the plane into two comma x2 bar belong to the set S that is you can quickly verify this that is a bar
regions are half spaces and note that half spaces are only convex they are not affine ok. transpose x1 bar equals b a bar transpose x2 bar equals b this.
So, these half spaces these are convex and these are not affine. So, a line is affine which (Refer Slide Time: 15:05)
implies it is also a convex, but half space is only convex and not affine now if we
generalize this. So, n dimensions in n dimensions one can consider an n dimensional
equation which is of the form a1 x 1 plus a 2 x 2 plus an xn equals b which implies if I
write it in vector notation that we are familiar with a one.
I can write it as the row vector a one times the column vector x1 x2 xn equals b which
implies now I can denote this by a bar transpose and this by x bar this I can denote by x
bar. So, I can write this as a bar transpose x bar equals b and this equation in n
dimensions this represents what is known as a hyper plane in n dimensions this is a hyper
plane which is in fact, you can see it is affine ok.
Now, consider theta times x1 bar plus 1 minus theta times x 2 bar theta times x1 bar plus
1 minus theta times x2 bar this equals theta times a bar transpose x 1 bar plus 1 minus
theta times a bar transpose x2 bar which is theta times b plus 1 minus theta times b that is So, the hyper plane divides it into two regions correct a bar transpose x bar less than
equal to b note that there is no restriction on theta valid for any theta. equal to b. These are half spaces that are. In fact, the general equation of half space you
can always remember represented by a bar transpose x bar less than equal to b for
instance. Example you have 2 x1 plus 3 x2 less than equal to 6, you also have the other
half space that is 2 x1 plus 3 x2 greater than equal to 6, which basically implies minus 2
x1 minus you take the negative minus 2 x1 minus 3 x 2 less than or equal to 6.
So, the general equation of a half space which is of the form again a bar transpose x bar
less than or equal to b ok.
Element on R implies this as affine. If it is only valid for zero less than equal to theta less
than equal to 1 it is convex in this case there is no restriction on theta. So, this is affine.
So, you can see that hyper plane is an affine set, now this hyper plane divides the space
into 2 the n dimensional space into two regions a bar transpose x bar greater or equal to b
a bar transpose x bar less than equal to b these two regions are known as half spaces ok.

So, the general equation of a half space. So, a bar transpose x bar less than equal to b is
the general expression for general representation of a half space ok. Thus these half
planes this hyper planes and half spaces are complex therefore, the important thing to
realize here therefore, hyper planes and half spaces. Hyper planes are affine as well, but
for our purposes it is enough to note that hyper planes as well as half spaces are convex
ok. Hyper planes as well as half spaces are convex ok.
Now, what we want to do is we want to explore a practical application because remember

we want to also explore practical right. Applications of the concepts that we learn for
optimization let us look at the practical application let us look at one of the practical
applications of the concepts that we have just learned regarding convexity and how these
influence practical optimization problems that are arise in wireless communications And you are transmitting signals to multiple users this is let us say user 1, you have
scenarios ok. another let us say user 2 somewhere and so on and so forth at some other point you have
user k, we are considering a downlink scenario with a base station is transmitting to
different users ok.
Now, let P 1 P1 denote the power to user 1, P 2 denote the power to user 2 so on and so
forth P k denote the power to user k ok. So, we have P 1 P i equals power of signal
power, allocated you can say power allocated to P i is the power allocated to user i by the
base station, P i is the power allocated by user i to the base station then, now we need
that.
So, P 1 P 2 P k are the powers that are allocated to the different users 1 to k, but this total
power allocated to different users has to be less than or equal to correct the sum total of
the powers of the different users has to be less than equal to the total power the
maximum power of the base station available at the base station all right. So, that is the
constraint that we have in a practical wireless scenario.
So, what we want to look at is we want to look at practical aspect practical application. (Refer Slide Time: 22:32).
So, for instance consider a wireless system with multiple users. So, what you want to do
is we want to start by considering a wireless system with multiple users consider a
wireless system with multiple users and for instance let us say you have a base station
correct.
So, the power that is allocated to the different users that is P 1 P 2 P k these are the
powers allocated to different users. These are the powers that are allocated to different
users. Now when we consider P 1 plus P 2 plus P k this has to be less than or equal to P
all right. So, the sum power of all the users sum of powers of all users has to be less than
or equal to P which is the total power of the base station so, that has the total base station (Refer Slide Time: 25:20)
powers.
That is you are less than or equal to by equal to. So, this is a equality this is an equality
power constraint and note that this represents the hyper plane so all the feasible powers
lie on a hyper plane represents a k dimensional. This represents a k dimensional hyper
So, the sum of the powers of all the users or basically if you look at sigma P i summation
plane all right.
P i i equal to 1 to k that has to be less than or equal to P. And you can see this constraint
is basically P 1 plus P 2 plus P k less than equal to P this is nothing, but half space So, we have this power constraint in a wireless communication system that can either
constraint because it is a linear combination of P 1 plus P 2 plus P k all right all right and you have any equality that is a sum total of the powers of the different users is less than
you can consider the weighting coefficients a 1 a 2 a k to be unity that is one and or equal to P that is the total power of the base station. That is basically half space
therefore, we have P 1 plus P 2 plus up to P k less than equal to p this. In fact, represents constraint and when your equality power constraint that is sum total of powers of all the
a half space. users has to be equal to the power of the base station that basically, represents a hyper
plane which means the set of all feasible powers corresponds to a hyper plane in k
So, this is a very important constraint in wireless communication this is nothing, but a
dimensional space all right.
half space. So, basically the set of all feasible powers possible powers that satisfy this
constraint, lie in a half space that is the interesting interpretation that is that one can So, this is an interesting practical perspective to the theoretical concepts of convex sets
make here all right. So, the set of all feasible powers in the wireless scenario this is an and affine sets that we have just seen. And we will explore several more links between
important notions set of all feasible, feasible in the sense that satisfy the constraint. the various theoretical concepts or the theoretical building blocks of optimization and it
is relation to practical applications in several fields as wireless communications be it
Set of all feasible powers lie in a half space the set of all feasible powers lie in a half
signal processing or so on. So, we will stop here and continue in the subsequent modules
space. Now you can also have an equality power constraint, that is you do not want to
waste any power and you want to set the power of all the users equal to p that is P 1 plus Thank you very much.
P 2 plus P k equal to P and this is an equality power constraint.
Applied Optimization for Wireless, Machine Learning, Big data (Refer Slide Time: 02:21)
Prof. Adithya K Jagannatham
Indian Institute Of Technology, Kanpur
Lecture – 13
Norm Ball and its Practical Applications: Multiple Antenna Beamforming
looking at the basics of convex optimization. In particular, we have looked at the concept
of convex set hyper planes and hyper spaces and we are looking at applications of these
concepts in wireless communication, that is practical applications of this concept, alright.
So, today, let us look at another application of the same concept that is in Multi Antenna
Beamforming, ok.

And, this multi antenna system what we have is we have this channel coefficients h 1, h
2, up to h 1, h 2 up to h L, these are the channel coefficients. These are also known as the
feeding channel coefficients because the wireless channel is typically a fading channel
that is received power of as a channel is varying with time, it is increasing and
decreasing. So, the wireless channel coefficients are also known as fading channel
coefficients, and h 1, h 2, h L denote the L channel coefficients corresponding to the L
antennas in this multi antenna receiver.
And, this is also known as a single input multiple output or a SIMO receiver, ok. So, this
is also typically known as such a system is also known as a SIMO or a single input
multiple output in the sense we have multiple antennas. Single input multiple output
system and let us now assume the combining it h 1, h 2, h L are the channel coefficients.
So, what I want to look at is I want to introduce another application for of the concept of
hyper planes and half spaces and this is in the context of multi antenna beam forming.
And, what happens in multi antenna beam forming is basically you have receiver with
multiple antennas, correct and let us say this is a receiver, the wireless receiver in
wireless communication system, and you have multiple antennas each of these is an
antenna and you have the signals that are coming with various channel coefficients, ok.
So, you have the various channel coefficients corresponding to the antennas 1, 2 up to let
us say L. So, there are total of L antennas.
What we are going to do is we are going to combine the signals with weights W 1, W 2, This is the vector h bar transpose times the vector if you look at this W 1, W 2 W L this
W L, so, W 1, W 2, W L you can think of these as the weights of the combiner; these are is equal to 1 you can call this as the beam forming vector or this is basically also your
the weights of the combiner, these are the weights of combiner or you can also think of receive beam former. So, h bar transpose W bar where W bar is a vector of combining
this as a beam former, weights of the combiner or also beam former and what we are weights or the combiner or the receive beam former is 1 to ensure unity gain.
performing is we would weigh the received signals and add them.
So, we are performing a linear combination of the signals and therefore, if you look at
the effective gain of the signal gain across at the output of the combiner that will be the
weights times the corresponding channel coefficient and the sum. So, you can think of
this as the effective you can think of this as the effective gain what the effective signal
gain, effective signal gain at the output of the combiner. And, to normalize this effective
signal gain what we do is we set this equal to 1. So, typically what we have what this
implies is that we want to design a system such that this effective signal gain at the
output of the combiner that is we look at the output of the combiner the effective signal
gain is unity, alright.
So, you like to design such a combiner, this is typically a constraint in multiple antenna
processing one of the types of constraints that can be employed in multiple antenna
processing, ok. So, what we have is this can be written as h 1, h 2, up to h L that is the
Which basically implies h bar transpose W bar equals 1 and this you can see now is
channel coefficients.
nothing, but a hyper plane constraint, ok. So, this is a practical application of the concept
of hyper plane in a wireless communications which says that all this beam forming
vectors W bar lie on this hyper plane described by h bar transpose W bar equals 1. This is So, this describes the interior of a circle which is nothing, but a 2 dimensional ball or a
the hyperplane constraint and what is this is doing is basically this is ensuring unity this sphere interior of circle of radius equal to r and with center origin.
is ensuring unity gain for the desired user or desired signal, that is what your doing is
your ensuring that the gain signal gain corresponding to a particular desired user or
signal is unity at the output of the combiner.
So, this signal is unity and then what you can do is you can either suppress, you can
typically either suppress the noise or suppress the interfering signals of the interfering
users. So, this is typically constraint that is employed in multi antenna signal processing
in a wireless communication system, alright. So, let us so, we have seen the definition
the notions of hyperplanes and half spaces. Let us know move on to different key type or
different types of convex sets in particular let us look at spherical balls or the norm what
are also known as norm balls, ok.
Now, if the center is not the origin then you can simply shifted to origin by considering
norm x bar minus x c bar less than or equal to r this is the general this is a circle or this is
the circle or a sphere. In fact, you can if x bar is n dimensional vector if you consider as
n dimensional vector, this is the sphere or a ball with center at x c bar, ok.
So, this describes interior of n dimensional ball with centre x c bar and radius r, and this
can be seen to be convex and that can be briefly justified as follows that is this region is
convex for sake of simplicity, let us consider simply the ball with centre at origin.
Remember if the ball with centre at origin is convex then naturally if you shift it to any
centre x c bar it is also going to be convex because shifting does not affect the convexity,
alright. So, if norm x bar less than or equal to r is convex then norm x bar minus x c bar
So, the next type of convex set we want to look at is that of a norm ball or basically a less than equal to r is also convex because a translation does not affect the convexity of
Euclidean ball. So, what we want to look at is this constraint of a norm ball in particular the object convexity of the region, or the set, ok.
a Euclidean ball. And, if you look at this, for instance you look at ball in 2 dimensions is
And, consider two points to verify this simply consider two points x 1 bar. So, we have
basically a circle with a certain centre let us say at the origin 0 and this is has a radius r,
to demonstrate that given any two points x 1 bar x 2 bar their convex combination lies in
then if you look at any point in the interior of the circle x bar, if you look at norm that is
the set.
the length of this vector which is norm of x bar that has to naturally be less than or equal
to r which is the radius, ok.
Consider x 1 bar, x 2 bar let us denote this set by B, belong to B. Then what we have is Now, remember theta is positive because this is the convex combination. So, we have 0
we have norm by definition since they belong to the interior of the ball norm x 1 bar less less than equal to theta less than or equal to 1 which basically implies that theta coma 1
than equal to r, norm x 2 bar less than or equal to r. Now, let us consider the convex minus theta are both greater than equal to 0. So, a norm of theta times x 1 bar is theta
combination we have theta times norm theta times x 1 bar plus 1 minus theta times x 2 times norm of x 1 bar because theta is greater than or equal to 0 plus 1 minus theta times
bar, we have to show that the norm of this is less than equal to r. So, that this also lies in norm of x 2 bar since, 1 minus theta is also greater than equal to 0. Now, observe that
the interior of the ball. norm x 1 bar norm x 2 bar are less than or equal to r. Remember, both these quantities lie
in the interior of the ball therefore, they are less than or equal to r. So, this is less than or
Now, you can readily see what needs to be an first we can use the triangle inequality that
equal to theta times r plus 1 minus theta times r which is nothing but r, ok.
is norm A bar plus B norm of A bar plus B bar is less than equal to norm A bar plus norm
B bar. So, that gives me this is less than equal to theta times x 1 bar norm plus norm of 1 (Refer Slide Time: 15:01)
minus theta times x 2 bar.
So, that implies you have norm of theta times x 1 bar plus 1 minus theta times x 2 bar direction or u bar norm of u bar is less than or equal to 1, ok. So, the norm of u bar is less
less than or equal to r implies theta times x 1 bar plus 1 minus theta times x 2 bar less than or equal to 1, ok.
than equal to r which implies this is essentially belongs to the set B, which implies B is
So, what that means is, that is what this and therefore, now you can see and you can
convex, ok. That completes the proof, and that is obvious what we have been able to
write readily verify this. This implies that norm of x bar minus x c bar equals norm of r
show that if x 1 bar for any two points x 1 bar, x 2 bar belong to the ball their convex
of u bar, since r is positive this is r times norm of u bar and norm u bar less than or equal
combinations, all their convex combination also belong to the ball. Therefore, the norm
to 1 which means this is less than or equal to 1, which is the same thing which is setting
ball is convex, ok.
the same thing in a different way that is your finding a vector u bar and which is norm
(Refer Slide Time: 16:04) less than or equal to 1 and your saying that x bar minus x c bar equals r times u bar. And,
this is true for any such vector you bar since this norm of x bar minus x c bar less than
equal to r. So, such point x c x bar will lie in the interior of the ball.
And, this norm ball now remember another equivalent so, we want to a de another
equivalent way an interesting way to represent this is now we can represent this norm
ball as B of x c bar comma r. So, this denotes a ball for norm ball its center equal to x bar
c, I am sorry, this is not r bar, but this is r which is the radius equal to r. So, this is the
Which implies now, that if you look at x bar equals x bar c plus r times u bar such that
norm ball another equivalent way to represent the norm ball that is equivalent
norm u bar less than equal to 1, this lies in the, well this lies in the interior of this interior
representation is as follows.
of the ball this implies that I can represent the ball with center x c bar comma radius r
The equivalent representation of the norm ball is as follows. that would be remember also in the following form that is equal to the set of all vectors x bar c plus r u bar such
norm of x bar minus x c bar equal to r and this implies, you can write this as x bar minus that norm u bar is less than or equal to.
x c bar is some is r times some vector u, correct? Where, norm of u bar is less than or
So, this is an alternate representation of the norm ball or this is basically an alternative
equal to 1. For instance I can always write this if you look at this x bar, I can always
representation of the norm ball. This is alternative representation of the norm ball,
write this x bar equals r times some vector u bar, correct where u bar is unit vector in this
alright. That is x bar x c bar that is center plus r radius times u bar, where u bar is any
vector such that norm u bar is less than or equal to 1, ok. So, this is an alternative
representation which is very convenient to represent often times represent the norm ball (Refer Slide Time: 22:47)
ok, alright.
Remember or recall that these are the beamforming weights; these are the beamforming
weights.
So, let us look at the another application similar to if your seen for the hyper planes and (Refer Slide Time: 23:04)
half spaces. Let us again look at the application of the concept of norm ball for wireless
application of the concept of norm ball in a wireless system and this can be seen as
follows.
For instance, again let us go back to our multi antenna beamforming problem and we
again have the different signals. So, this is again your multi antenna receiver and these
are antennas 1, 2 up to antennas L and which we are combining using our beamforming
using the weights W 1, W 2 up to W L.
And, what we will ensure that if you look at the energy of the beam former that is W 1
square plus W 2 square because this is also this also influences the power output power
of the combiner. So, we restrict the energy of the beam former which is equal to the
energy of the beam former and often also called the power of the beam former because
this is what is applied at every time instant energy slash power of beam former.
Now, in any wireless communication system this has to be should be restricted because concept of a norm ball to a wireless communication scenario to design the constraint for
this influences the power of the signal at the output of the beam former. If the energy of a receive beam former, alright.
the beam former unbounded then the power of the output signal can also be unbounded
So, we have seen various other concepts in this module namely that of the norm ball or
therefore, to ensure stable beam former restrict this energy of the beam former typically
the Euclidian ball and its application in the context of a wireless communication system.
to unity, alright. So, in any wireless communication system energy of the beam former
So, we will stop here and continue in the subsequent modules.
has to be limited. Let us call this as the power because that is typically the nomenclature
that is used. The power of the beam former has to be re represented limited. Thank you very much.
Now, if you can look at this, this is nothing, but norm of W bar square. This is norm of W
bar square. Typically, this is less set less than or equal to unity. So, we have the constraint
norm of W bar square less than or equal to unity which implies that if you look at this,
this implies that norm of W bar has to be less than equal to unity and this is nothing, but
a norm ball constraint.
So, this is the beam former power constraint which is a norm ball constraint norm of W
bar that is constraints the set of all beam formers to a norm ball with radius 1, interior of
a norm ball with radius 1. So, this is normal ball constraint. So, this is the constraint on
the beam former power constraint. This can also be thought of as the this can also be
thought of as the beam former power constraint which is basically nothing, but a norm
ball or a Euclidean ball constraint, alright. So, that is an interesting application of the
Lecture – 14
Ellipsoid and its Practical Applications: Uncertainty Modeling for Channel State
Information
looking at convex sets, let us continue our discussion by looking at another very
important convex set this is the ellipse or the ellipsoid, alright.
Now, this can be simplified as follows to get the general expression for an ellipse or an
ellipsoidal region. I can write this as well, let us write this also or let us write this instead
of y square by b square let us write this as a x 1 square by a square plus x 2 square by b
square is less than or equal to 1, ok, where x 1 and x 2.
So, x 1 is denoting your conventional x coordinate and x 2 is denoting your conventional

y coordinate, and now I can write this as. So, I can denote this by vector x bar equals x 1
and x 2, x 1, x 2, two components. So, this will help me generalize it to n dimensions.
So, this will be x 1, x 2 times 1 over a, 0, 1 over b. In fact, let me just write one more
step I can write this as 1 over a square, 1 over b square times x 1, x 2 less than or equal to
1, writing this in vector and matrix notation.
So, we want to look at the ellipse or also in n dimension also on ellipsoid or an
ellipsoidal region, alright. And, an ellipse as you know from knowledge of a high school
is look something like this and it is typically described by the equation. We are going to
come to the general model in a little bit, but first look at a very simple equation for an
ellipse described by the equation x square by a square plus y square by b square equals 1.
So, this is an ellipse. Well, this is the equation of the ellipse and the interior of the ellipse
including the boundary is described by this inequality that is x square by a square plus y
square by b square equals 1, this describes the interior of the ellipse.
(Refer Slide Time: 03:07) diagonal. So, this is A inverse and I can write this as a inverse transpose because this is a
diagonal matrix, the matrix is equal to its transpose. So, A inverse and A inverse
transpose. So, I can simplify this now interestingly as x bar transpose A inverse transpose
into A inverse into x bar less than or equal to A inverse into x bar less than or equal to 1.
Remember this is our matrix A, that is a diagonal matrix with a and b on the diagonal.
So, this implies we have x 1, x 2 times 1 over a, 0, 0, 1 over b into the matrix itself
because it is a diagonal matrix it to itself will give me 1 over a square or 1 over a square
and 1 over b square, 1 over a, 0, 0, 1 over b into x 1, x 2; well this is less than or equal to
this is less than or equal to 1, ok. And, this you can see now this is basically nothing, but
transpose of the vector x bar.
(Refer Slide Time: 04:06) And, this implies and of course, you can see that this implies A inverse equals simply 1
over a, 0, 0, 1 over b, ok. Now, the above inequality implies now, I can write this as
follows: I can write this as A inverse x bar transpose into A inverse x bar less than or
equal to 1 and now, you can clearly see the vector transpose itself is nothing, but the
norm of the vectors vector space, that is, if u bar is a vector we have already seen that u
bar transpose u bar is basically norm u bar square.
So, I can write this now very interestingly as norm A inverse x bar square less than or
equal to 1 that implies A inverse x bar is norm of A inverse x bar is less than or equal to
less than or equal to 1 and this is the equation of ellipse equation of ellipse above.
So, this is x bar transpose, this is the vector x bar and you can see if I call this matrix as
A inverse, remember I can define A as the matrix diagonal matrix A small a and b on the
And, now you can generalize as you know ellipsoid by considering in n dimensional And, now the alternative representation of an ellipse or an ellipsoid now the alternative
vector so, alright. So, you can generalize this n dimensions by considering by representation similar to that of a norm ball can be derived as follows. Well, we have
considering n n dimensional vector x bar. So, as you consider instead of x 1, x 2 if you norm A inverse x bar is less than or equal to 1 implies I can set A inverse x bar as a
consider a n dimensional vector x 1, x 2 up to x n norm A inverse x bar less than or equal vector u bar with norm u bar less than or equal to 1 that implies x bar equals A times u
to norm A inverse x bar less than or equal to 1, this becomes an ellipsoid, an n bar with norm u bar less than or equal to 1. Now, this is for centre as origin ellipse with
dimensional ellipsoid, ok. Generalize this to n dimensions, so, that becomes an ellipsoid, remember this equation here we have started with this is a centre has centre is origin.
in n dimensions.
Now, if centre is not the origin then I can simply modify this to include the appropriate
centre as x bar equals a times u bar plus x bar say. So, this is the centre of the ellipse or
the ellipsoid this is the centre of the ellipsoidal region, ok.
And therefore, the ellipsoid can now be represented as the ellipsoid with the Again, we will look at a multi antenna wireless system. Let us consider a multi antenna
corresponding to matrix A and centre x c bar is the set of all vectors x bar x c bar plus a u wireless system again similar to what we have seen before. Remember, multi antenna
bar such that norm of u bar is less than or equal to 1, this is the alternative representation wireless system basically has multiple antennas to over improved performance of such
of the ellipsoidal region. This is alternative representation of the ellipsoidal region, system. So, I have multiple antennas and corresponding to this multiple antennas I have
alright ellipsoidal region corresponding to a matrix A and the centre x bar c, ok. multiple channel coefficients h 1, h 2 to h L. So, these are the L antennas. So, these are
let us say L antennas. So, this is your receiver, in the wireless communication system we
Similar to the previous cases let us look at a practical application of this. So, another
have the L antennas. So, h 1, h 2, h L are the channel coefficients.
interesting one of the aspects of this course is also look at is to also look at a practical
applications of these concepts, is to also look at a practical application. Now, these channel coefficients also in wireless communication systems the knowledge
of these channel coefficient, this is also termed as channel state information alright. So,
the channel coefficient characterize the channel state and knowledge of this channel that
is knowing this channel coefficients, having the values of these channel coefficients is
also termed as channel state information in the wireless communication system. So, the
knowledge of these channel coefficients this is also termed as this is a frequent term, this
is termed as channel state information, ok. Knowledge of these channel coefficients is
termed as channel state information now this knowledge is important.
Now, to develop enhanced signal processing scheme we need knowledge of this channel
coefficients or we need the channel state information at the receiver to develop improved
or to basically develop schemes that yield improved performance after signal processing
at the receiver, ok.
So, this knowledge of CSI knowledge of CSI is required for accurate performance So, only the estimates exact CSI not known only estimates only the estimates or
improved performance; however, frequently the exact. So, frequently the exact channel basically, you can think of these also approximate values only estimates of channel
state information that is frequently the exact channel state information is not known in, coefficients or CSI or CSI is known. This implies that there is uncertainty in the CSI,
this is not known in practice. Now, what is known because remember these channel state implies this is termed as uncertainty CSI uncertainty this is uncertainty in the CSI arising
coefficients have to be estimated and whenever you estimate them there is going to be an from the estimation errors. There is uncertainty in the channel state information.
estimation error.
So, only approximate channel values of the channel state channel state information or
approximate values of these channel coefficients are known, that is, the corresponding to
the approximate values or the estimates of these channel coefficients are frequently
known in practice, ok.
So, we have this estimate. So, we have these true channel coefficients. The true
underlying channel coefficients, these are not known and what are known are there
estimates that are denoted by this hats h 1 hat, h 2 hat, up to h L hat. These are the these (Refer Slide Time: 18:59)
are the estimates these are the estimates of the channel coefficients, and therefore, we
have our true channel vector h bar this is h 1 hat h 2 hat up to h L, and this is your true
channel vector true channel vector meaning the actual channel vector in the wireless
system.
So, h hat we know is approximately equal to h bar, but h hat is not exactly equal to h bar.
And, this is an important this is an important consideration in practice because in practice
the perfect channel state information is very difficult I mean estimating the underlying
channel state coefficients without that is with 100 percent accuracy without any
estimation error is impossible, alright. So, in all practical scenarios the channel state
information or the channel coefficients are only approximately known, alright.
And, you have the estimated channel vector h hat which is equal to comprises of the
estimates and this is the true CSI or what is also known as perfect CSI. This is the So, one has to characterize this phenomena this phenomenon of uncertainty in this in the
estimated channel coefficient vector which is also known as this is the estimated channel CSI has to be suitably characterize to design signal processing schemes that take into
vector. This is also termed as the imperfect CSI, ok. This is all the, this is known as account this uncertainty into CSI and yield improved performance, ok. So, we have to
imperfect CSI. Now, we know that this imperfect CSI is close to the actual CSI that is h have. So, h hat h hat h h hat is close to h bar, but h hat is not equal to h bar. So, how to
hat the estimate is close to h bar, but it is not exactly equal to h bar, ok. characterize this uncertainty? How do we characterize the important question now is how
to characterize now how to characterize this uncertainty?
And, therefore, what one can say is that h hat this estimate lies close to h bar or h bar the
true channel lies close to the estimate h hat, we can say that h bar h bar lies in a region of
uncertainty around h hat and this region is frequently modeled as an ellipsoid, ok. So,
what we have and this where the application of the ellipsoid comes in we say that if you
consider an ellipsoidal region with the known estimate as the centre then h bar the true
channel lies in a region uncertainty regions.
So, h bar equals the true channel it lies in a region of uncertainty around h hat which is And, asymptotically you can see when the estimation error become 0, h bar the true
the estimate and this uncertainty region typically modeled. So, this is an uncertainty channel coincides with h hat, that is, for a large number of pilot symbols that is when the
region, this is typically modeled as an ellipsoid in n dimensions this is typically modeled SNR; SNR of estimation tends towards infinity alright. And therefore, now you have an
as an. This uncertainty region is typically modeled as an ellipsoid. So, we say that a true interesting model to characterize the true channel vector I can represent h bar as A times
channel vector true channel lies in an ellipsoid lies somewhere in an ellipsoid around h u bar plus remember this ellipsoid has centre e which is nothing, but it h hat. So, this
hat. forms your centre of the ellipsoid; h hat is centre of the uncertainty ellipsoid. So, h form.
So, h hat is nothing, but the centre of the uncertainty centre of the uncertainty ellipsoid,
ok.
Now, obviously, if the ellipsoid is large; that means, the uncertainty region is large,
which means the estimation error is high, alright. Now, if the estimation error is low that
is a estimation process is very good then the ellipsoid will suitably small; that is you can And, therefore, h bar belongs to this uncertainty ellipsoid which is given as A u bar plus
localize h bar to a much smaller region around h hat. So, if estimation is accurate that h hat such that norm u bar is less than or equal to 1, and this is termed as I already said
imply that implies ellipsoid that is the size of ellipse is small, ok. this is termed as on the uncertainty ellipsoid. This is termed as for this practical scenario
this is termed as the uncertainty ellipsoid.
On other hand, inaccurate estimation or poor estimation, the estimation poor implies
ellipsoid is large. So, one can characterize the ellipsoid based on the estimation process
also, because estimation process what results in the estimation errors. If the estimation
errors is large then the uncertainty will be large so, the size of the ellipsoid will be large,
that is, there is a lot of uncertainty in where h bar can lie. If the estimation is of good
quality then naturally h bar will be close to h hat. So, the size of the ellipsoid will be
much smaller.
Lecture – 15
Norm Cone, Polyhedron and its Applications: Base Station Cooperation
looking at various types of convex sets and their relevance to practical applications
especially in context of wireless communications and signal processing. Let us continue
our discussion by looking at yet another convex set or class of convex sets that is the
convex cone, ok.
This is termed as uncertainty ellipsoid and now, signal processing techniques that
concerned this that consider this uncertainty. Signal processing ellipsoid that consider
uncertainty are termed as robust. These are termed as a robust since they are not sensitive
to the uncertainty in the channel state information as they are not sensitive or less
sensitive, as they are less sensitive to uncertainty in the channel state information, or they
are not sensitive to errors; not sensitive to estimation errors as they are not sensitive to
estimation errors, alright.
So, an interesting application of this ellipsoid or ellipsoidal region in wireless

communication or for that matter signal processing and various other applications is the
following. Several quintiles have to be estimated such as, the channel coefficients or
even a signal processing alright and underlying filter has to be estimated, alright. So, the
true coefficients we do not know where that what the true coefficients are, but we know So, what is also start looking at a convex cone or a norm cone this is another prominent
that they lie close to the estimated values. So, these can be considered to lie in an class of convex sets and well, what is a norm cone?
ellipsoid is region around their corresponding around the respective estimated values
Well, let us consider a 2-dimensional scenario let us consider a 2-dimensional vector x 1,
alright and that ellipsoidal region is known as the uncertainty ellipsoid in the context of
x 2 and if you consider the set of all vectors that is the set of all vectors such that
wireless communication this arises because there is uncertainty in the CSI channel state
magnitude of x 1 is less than or equal to x 2, ok. So, you want to look at the set of all
information or the channel coefficients, alright.
vectors x bar such that magnitude of x 1 is less than or equal to x 2, alright and this can
So, let us stop here and consider other aspects in the subsequent modules. be represented as follows. This is your x 1, this is your x 2 and this line represents
magnitude of x 1 equals x 2 or these two lines. So, this is magnitude of x 1 equals x 2,
ok.
And, if you look at this region this region basically represents the region magnitude of x (Refer Slide Time: 03:53)
1 is less than equal to x 2 and you can see this region is a cone and this is basically also
convex. So, this is termed as a simply as a cone or also convex cone that is you are
looking at a 2-dimensional vector. So, 2-dimensional plane in which we considering all
the vectors such that magnitude of x 1 that is the first quadrant magnitude of x 1 is less
than or equal to x 2, and this is a convex region, alright.
And, if you plot that if you plot that in 3-dimensions this is your x 1, x 2 x 3 that looks
similar to the cone the shape of a classical cone that we are all familiar with this is the
region, and this is the 3D cone, the classical conical shape that we are all very familiar
with. And, you can clearly see that this region, if you look at the interior of this cone this
is convex and it is also reasonably easy to show that the cone is a convex region.

And, now we can similarly form it for and this is also termed as a norm cone,. A convex
cone or a norm cone similarly you can look at this for a 3-dimensional scenario that is
you have x bar equals x 1, x 2, x 3 and you consider the set of all points such that if we
take the first two coordinates that is norm of x 1, x 2 less than equal to x 3, which
basically implies that square root of x 1 square plus x 2 square less than equal to x 3, ok.
And, in general now you get the idea to generalize this we have x tilde equals let us say
we have n dimensional n plus 1 dimensional vector of which we form the first n
dimensional. So, this is your n plus 1 dimensional vector. This is your n plus 1 These have significant utility in wireless communication and signal processing. These
dimensional vector and this is x of n plus 1, this is x of n plus 1, ok. So, this is a n are second order cone programs. Especially in the context of robust signal processing
dimensional vector x bar and this is x of n plus 1 this is an additional. So, we have a x n robust similar to what we have seen previously in the context of robust for instance
plus 1 dimensional vector x tilde. Now, if you consider the set s of all vectors x tilde such estimation or robust signal processing. One of the most prominent applications of this
that norm of x bar less than or equal to x x n plus 1. SOCP paradigm is in the context of robust signal processing for instance for instance you
can look at applications such as robust beam forming for the same beam forming
Well, this represents an n plus 1 dimensional norm cone this represents a n plus 1
problem, that we can look at in a multiple antenna wireless communication system.
dimensional norm cone that is a convex region which is the relevance to us is that this is
a convex region, this is a convex region and it is a fairly important class of convex If you make it robust as we have seen robust is making it resilient to the channel
regions, it is very interesting and some more sophisticated convex region and right it is estimation errors it becomes a robust beam forming problem and all such applications
slightly difficult to describe a practical application of the convex cone in the context of can be formulated using the second order cone program SOCP will basically involves
resistance wireless communications or signal processing at this point. conic sets, alright and, we will look at them. In fact, we look at these kind of problems in
quite some detail as we proceed through the later modules or in the later stages of this
But, we will note that or I would like you to note that the convex cone in fact, has a very
course, ok.
interesting and very prominent applications which will explore during this course it is
just that it is a little difficult to setup the problem right now. (Refer Slide Time: 09:28)
. So, now let us move on to other interesting convex sets which also arrives very
frequently and these are termed as polyhedral. Now, polyhedra are basically formed from
We will explore problems what are known as problems that are known as SOCP or as
the intersection of hyperplanes and half spaces. These are formed from the intersection of
second order cone programs and these have significant application and utility. So, these
hyperplanes and half spaces. For instance, in fact, a finite intersection we can note that it
are these are SOCP problems second order cone programs that have cone conic
is a certain point, but these are finite formed by finite intersection of hyperplanes and
constraints and these have significant utility in the context of wireless communication
half spaces.
and signal processing.
So, for instance we have seen that a half spaces can be represented as follows a 1 bar (Refer Slide Time: 12:37)
transpose x bar less than equal to b 1, a 2 bar transpose x bar less than equal to b 2, so on
a n bar transpose x bar less than equal to b n. This is a collection of this is an intersection,
let us put it this way this is an intersection of half spaces, correct. This is an intersection
of half spaces.
So, now denoting this by the matrix A and this by the matrix b bar I can represent this
intersection of half spaces as A x bar A x bar less than or equal to b bar, ok. So, this
basically represent an intersection of your intersection of half spaces.
Now, similarly remember this is an intersection of half spaces now similarly remember I
can also formulate the hyperplanes as follows that is your C 1 bar transpose x bar equals
I can write this as follows I can put all these concatenate this in a matrix and I can write d 1, C 2 bar transpose x bar equal d 2, C n bar transpose x bar equals d m, let us put this
this as a 1 bar a 2 bar transpose a n bar transpose times x bar. Let me write the right hand make this as m. This is a collection of intersection of remember each represents a
side b 1, b 2 up to b n and what I can use here is what is known as a component wise hyperplane this is an intersection of m hyperplanes, ok. So, this is an intersection of m
inequality, which means that a 1 bar transpose x bar less than b 1 that each component on hyperplanes. Again, I can concatenate this system.
the left is less than each component on the right. So, this is known as a component wise
inequality. This is known as a component wise inequality it means each component that
is we take two vectors vector a vector let us say u bar less than v bar. which means each
component of vector u bar is less than or equal to each component of this vector v bar,
ok.
(Refer Slide Time: 14:01) That is vector x bar that satisfies A x bar component wise inequality less than equal to b
bar C x bar equals d bar this represents this region represents what you known as a
polyhedral. This represents that intersection of a intersection of a finite intersection of a
finite number of intersection of a finite number of hyperplanes and half spaces basically
represents a polyhedron.
I can represent this as follows C 1 bar transpose C 2 bar transpose C m bar transpose
times x bar equals b 1, b 2, up to b n. So, we have C this is a matrix C this is a matrix let
me call this as d yeah, this is d 1, d 2, in fact, as mentioned above this is d 1, d 2, d m.
So, we have C into x bar equals d bar this is your intersection of hyperplanes. It is a
compact way of representing an intersection of hyperplanes.
And, now if you put them together you have an intersection of hyperplanes and half So, it can be just simply shown as follows. So, you can imagine having a large number
spaces. for instance of a half spaces and for instance you can think of this as one half space
corresponding to this and you can think as this as another half space corresponding to
this and you can imagine the various half spaces and now, if you look at this region that
lies in the intersection of all these half space. This intersection region of half spaces in
fact, you can also through hyperplane into that intersection of half spaces plus
hyperplanes, this is your polyhedron, ok. So, that is roughly you can see described by
this region that is described by this region so, that basically forms your polyhedron, ok.
(Refer Slide Time: 17:53) is basically the strip is the region that is x 1 greater than equal to minus 1 minus 1, 1 less
than equal to 1 and if you look at x 2 this is the region pertaining to x 2 equal to minus 1
and so, this is the hyperplane for x 2 equal to 1.
So, this is the hyperplane x 2 equal to 1, this is the hyperplane x 2 equals minus 1 and if
you look at this square region, now you can see the square, which is formed by the
intersection of these hyperplanes this square is a polyhedron which is convex that is the
important thing. So, square is a polyhedron and so, worth noting especially that
polyhedron is important because it is convex.
Now, in general intersection of convex sets is convex. The hyperplanes and half spaces
are convex sets. So, their intersection is also convex, in particular such a region is known
as a polyhedron, and it is very handy and it arrives frequently in several applications,. So,
this square is a this square region is basically your polyhedron, which formed by the
And, for instance you can take a simple example again you can look you consider intersection of a finite number of half spaces and hyperplanes, ok.
polyhedron form by this four hyperplanes that is x 1 greater than equal to minus 1, x 1
less than equal to 1, again x 2 greater than equal to minus 1, x 2 less than equal to 1. (Refer Slide Time: 20:30)
To understand this better we can start looking at an example which we might continue in
the next module. So, let us look at an application of this concept of your polyhedron and
And, if you look at that region that is something that everyone will immediately this will be in the context of cooperative wireless communication that is, if you look or
recognize that is if you take the two points minus 1, 1 or x 1 equal to minus 1, x 2 equal cooperative base station transmission or base station cooperation. So, again once again
to. So, these are the two hyperplanes these are x 1 equals 1. So, this is your x 1, this is looking at a wireless plus base station cooperation also known as quality point
your x 2 and this is x 1 equals minus 1. So, this region is x 1 and this is the region which cooperative multipoint.
So, base station cooperation, and what happens in this scenario is we have several base (Refer Slide Time: 23:46)
stations typically in wireless cellular network what happens is conventionally a single
base station transmits to a single mobile, but in some scenarios you can have several base
stations cooperating to transmit to a single user or a group of users and this is especially
possible. If the users are at the edge of the cell or in the region between multiple cells,
where they can be simultaneously served by several base stations, ok.
And, we have M total users, that is 1, 2 up to M these are the number of users these are
the number of users.
Now, what we will also have is we will have now remember whenever you have a
wireless transmission scenario there is a channel between the transmitter and receiver
and this channel is fading channel, because the wireless channel is fading in nature. So,
that is characterized by fading channel coefficient. So, you will have a fading channel
So, each of these towers is let us say a base station and you have a mobile that is being coefficient h ij equals the fading coefficient fading channel coefficient and h ij is the
simultaneously served. In fact, you have several such mobiles let us say which are coefficient between the i-th user. So, h ij channel between the i-th. Let us put this
simultaneously being served by the. So, each mobile is being served each user j. So, what between the i-th this is the fading channel coefficient between the i-th base station and
we are saying is each mobile being served by multiple base stations i, that is, let say we the j-th user, ok.
have K base stations 1 less than equal to i less than equal to K, ok. So, we have base
So, this is the fading channel coefficient between i-th base station comma, this is the
stations 1, 2 up to K, these are the base stations. Base station 1, base station 2 so, K
fading channel coefficient which means if you look at h ij is the fading channel
equals the number of base stations, ok.
coefficient if you look at magnitude h ij square this represents the power gain this
represents what is conventionally known as simply as the gain, simply this is also
sometimes referred to simply as the gain. I mean there can be an amplitude gain which is
given by magnitude h ij, this is the power gain magnitude h ij, h ij square which is the
power gain which means if magnitude h ij square is strong then this received signal at the
user j corresponding to the signal transmitted by base station i is going to be strong, but
if this channel is in a deep fade, alright.
Which means, there is a lot of interference in the channel and as a result of this if Applied Optimization for Wireless, Machine Learning, Big Data
magnitude h ij square is very low then the power received, alright the received power by
user j corresponding to the signal transmitted by base station i is going to be very low Indian Institute of Technology, Kanpur
that can be attributed to the fading, alright. So, this is the fading nexist arises this varying
Lecture – 16
power level at each user corresponding to the signals transmitted by the base stations it is Applications: Cooperative Cellular Transmission
this varying power level arises due to the fading nature of the wireless of a typical
wireless channel, alright. Hello, welcome to another module in this massive of an online course. So, we are
looking at a wireless base station cooperation scenario, in which several base stations are
So, naturally what is the power that has to be transmitted by the various base stations
cooperating to transmit to a single user or group of users alright.
what is so that each user receives the desired amount of power or what is the power that
has to be transmitted by the base stations corresponding to the power constraint at each (Refer Slide Time: 00:26)
base stations. So, all these are various optimization problems and these can be these are
these are things that we going to look at during the course my intention here is to
formulate a basic problem related to this and demonstrate the applicability of the
polyhedron that is the convex set that we have just seen, alright.
So, we will stop here and continue this discussion this practical application in the context
of a cooperative wireless cooperative base station transmitting cooperative base station
transmission in the next module.
So, we are looking at this scenario, which is very practical example in cellular network
contrast termed as base station cooperation ok. And what happens in base station
cooperation, if you look at it we describe it in the previous module that is we have K
base stations ok. K is the number of base stations which are cooperating to transmit to M
users ok.
And these are the base stations and these users are typically located in a region where Let say P i j. P i j is the power transmitter by ith base station to jth user ok. So, P i j
they can receive the signals from the multiple users such that at the intersection of these equals power transmitted by ith base station to the jth user. So, P i j is the loads a
various cells. So, we have three cells in the intersection you have some of these user, and transmitter power magnitude h i j square is a power gain, which implies that if I multiply;
these can be served by multiple base stations not just a single base stations. So, the base so, this is the transmit power. So, if I multiply P i j by magnitude h i j square, now this
stations can cooperate with other to enhance the signal to noise modulation at each user quantity this denotes, this quantity this denotes the power received by user j. So, this is
ok. the power, user j from the power received by user j from the ith base station. So, this
quantity is given by P i j magnitude h i j square alright.
And h i j this quantity denotes the fading channel coefficient, fading channel coefficient
between the ith base station and the jth user and therefore, magnitude h i j square this is So, now, what we can look at is, let us look at the total power received by user j any
the power gain. And what this means is, if you look at the power received by j user jth particular user j from all the base stations. So, to compute the total power at any
user from the ith base station. So, let say so, we already said i think or let us say that now particular user, we have to sum the power that is received from all base station and that is
we have another quantity. given as.
(Refer Slide Time: 03:49) From all base stations and not we will say is this has to be greater than equal to some
quantity P j tilde. This is the minimum power that is desired by user j, this is the
minimum power that is desired by user j. So, the total power at user j received at user j
has to be greater than or equal to some quantity P j tilde, which is the minimum power
alright the minimum desirable you can say the minimum desirable signal quality at that
particular user. This is known as a QoS constraint or a Quality of Service constraint for
that particular user ok. So, this constraint is also termed as a QoS or a Quality of Service
constraint its quality.
So, what is the total power of user j? The total power of user j I mean total power of
signal received by user j that is basically the sum of power from all base stations, from
all base stations and that is basically you remember the base station index is i. So, you
have to sum over all i. So, i equal to 1 to K, K is the total number of base stations, P i j
magnitude h i j square. This is the total power received of total power received by user j
from all base stations.
So, a signal quality has to be such that the received power has to be at least greater than
equal to P j tilde. So, we will have one quality of service constraint for each user.
Remember we have M such users therefore; we will have M quality of service
constraints if you remember correctly where M, M is a number of users. So, therefore,
we will have M such quality of service constraints; what are those quality of service
constrains?
(Refer Slide Time: 06:31) represent a half space, because this is a which means that this is an intersection of half
spaces this is an intersection of half spaces.
When for user one remember we must have P i j equal to 1 magnitude h i 1 square i
equals 1 to K. This has to be greater or equal to P t 1 tilde right this has to greater than
equal to P 1 tilde summation i equal to 1 to k now we can write for user 2, P i 2
And implies this which implies that this is equal to a which implies this is equal to a
magnitude h i 2 square; that is a h i 2 remember is a channel coefficient between base
polyhedron. Remember that what you said an intersection of half spaces and hyperplanes
station i and use 2 this has to be greater than equal to minimum desirable power P 2 tilde
is nothing a polyhedron and this is interesting primary interesting practical application.
of user 2 so on so fourth. You will have summation i equal to 1 to K, P i m for the mth
user magnitude h i m square greater than equal to P m tilde these are the q s constraints So, the set of all possible powers, which meet the quality of service constraints of these
these are the m q s constraints remember these. different users in this base station cooperation setup right this cooperative multiple setup,
the set of all possible powers lie in a polyhedron, which is obtained as the intersection of
What are these are the M Q s or quality of service constraints there is one constraint for
the half spaces given above. So, the set of all possible powers; so, this is very important.
each user. There is one constraint for each other and if you look at each constraint
remember in the powers this is linear combination, this is a linear combination of the
powers this is of the transform c bar transpose x bar less than or equal to less than or
equal to this is of the form a bar transpose x bar is less than or equal to b where this we
are looking at x bar is basically nothing, but the powers. So, each of this represents the
hyper plane, each constraints represents you can see each constraint represents a hyper
plane.
So, this implies correct this implies this implies that this is an intersection. I am sorry
each constraints represents a half space not hyper plane I apologize each constraint
So, in the practical optimization problem set of all powers that meet the QoS constraints, Now, if you look at the base station power constrain, now each base station has a certain
these are in a polyhedron these lie in a polyhedron or a polyhedral regions. So, to maximum power. Let us call this maximum transmit power as P i bar P i bar equals max
optimize this powers that are transmitted to the different users by the base station correct transmit power of base station i. This is the maximum transmit power of base station i,
what has to consider? The set of all possible powers that; lie inside a polyhedron for his now what does it mean? That is the power that has to be transmit the power that is
optimization problem. transmitted to all the users all the m users by each base station i has to be less than or
equal to this quantity P bar of i. Because this is the maximum possible transmit power
Now, similarly remember each base station also has a possible power constraint total
correct?
power constraint. So, you can consider that also right. So, now, each base station; now
looking at for perspective of base station each base station. So, let us look at the base (Refer Slide Time: 12:56)
station power constraint.
So, this implies that if I look at any possible any particular base station and sum the (Refer Slide Time: 15:13)
power over all users. So, I must have P i j transmit the summation of all transmit powers
has to be less than or equal to P bar i and what is this? This is the sum of transmit total
transmit power total T X power of base station i to all users to all M users. In fact, you
can say this is total transmit power to all the M user. So, this has to be less than or equal
to P i bar. So, we can write one constraint for each base station.
There is a intersection of a K power constraints, one for each user, which is equal to
polyhedron. So, either you look at the P o S constraints one for each user total of
intersection of M constraints M half spaces, that is the polyhedral and now if you look at
the total power constraints of the base station of each base station. For each base station i
we have total power constraints. So, this is the i in the intersection of K in the half spaces
also polyhedral the set of all possible powers so, as that you meet the total transmit
power the transmit power constraint at each base station this satisfy all this satisfies this
So, what does that mean? You will have j equal to o1 to M P 1 j total transmit power of
set of constraints is also a polyhedral.
base station one to all users that is less than or equal to P 1 bar, j equal to 1 to M,
summation P 2 j total transmit power base station 2 to all users less than or equal to P 2 So, therefore, this polyhedral which basically is a region that is formed by the
bar. intersection of either hyper planes or half spaces, which is the convex has a significant
utility and arises frequently in various optimization problems especially in the context of
So on and so forth you can write k constraints, one constraints for each base station. So, j
signal processing and communication, and this illustrates one such simple application
equal to 1 to M P K j less than or equal to P K bar these are what are these? Well these
scenario alright. So, we will stop here and continue in the subsequent modulus.
are K power constraints. Now, you can see again each is a half space each is a each is a
half and again you are this K power constraints each is a half space. So, you have the Thank you very much.
intersection of the K half spaces a finite number of half spaces. So, there is also
represents a poly heat ok.
Lecture – 17
Positive Semi Definite Cone And Positive Semi Definite (PSD) Matrices
Hello. Welcome to another module in this massive open online course. So, we are
looking at various convex sets and their practical applications relevance to practical
applications. So, let us continue our discussion by looking at the set of Positive Semi
Definite Matrices which is also known as the Positive Semi Definite Cone.
And let us call this matrix S n is not defined another set S n plus as the set of all matrices
X element of s n that is a set of symmetric matrices such that X greater than I will
describe this what does it mean to say a matrix is greater than equal to 0, where this
notation X with this (Refer Time: 02:07) or this curved greater than equal to sign this
denotes that X is a positive semi definite matrix X is a positive semi definite or PSD
matrix X is a positive semi definite matrix, ok.
And, so, this is the set of all symmetric positive semi definite matrices, of size n cross n,
remember. So, s n plus is a set of all symmetric is a set of all symmetric positive semi
definite matrices of size n cross n, and remember the definition of a positive semi
definite matrix X is that if we take any vector Z bar it must satisfy the property Z bar
So, what do we want to look at is we want to look at the positive we want to look at the transpose X into Z bar is greater than or equal to 0, that is the definition of positive semi
positive semi definite cone. And, well what happens in this, in this case the positive semi definite matrix, ok.
definite cone well, let us consider this set of S n. Now, S n equals the set of all
symmetric, remember we have defined symmetric matrices before that is if a matrix A
equals A transpose for a real matrix it is a symmetric matrix set of all symmetric n cross
n matrices.
So, PSD implies Z bar transpose X into Z bar greater than equal to 0 for all Z bar that is Now, we want to show that if you take a convex combination theta times X 1 plus 1
for all Z bar which is an n dimensional vector. Now, we can show and it is not very minus theta times X 2, 0 less than equal to theta less than equal to 1. We want to show
difficult that this set S n plus this is a convex set this is a very important convex set. The this is also PSD or that it belongs to the set S n plus. Very simple, you take Z bar
set of all symmetric positive semi definite matrices this is a convex set alright that is S n transpose theta X 1 plus 1 minus theta X 2 into Z bar that will be equal to that equals
plus that we have just defined is a convex set and that is not very difficult to see we take theta Z bar transpose X 1 Z bar plus 1 minus theta Z bar transpose X 2 Z bar.
any two elements that is we take any two positive semi definite matrix, this is similar to I
Now, remember X 1 is positive semi definite. This quantity Z bar transpose X 1 Z bar is
mean we go back to the fundamental definition of a convex set that is we take two points
greater than equal to 0, X 2 is also positive semi definite. So, Z bar transpose X 2 Z bar
or in this case two matrices that are positive semi definite.
this is greater than equal to 0. Now, theta and 1 minus theta both these quantities are
So, let us say X 1 and X 2; X 1 belongs to S n plus that is X 1 is positive semi definite X greater than equal to 0, because 0 less than equal to theta less than equal to 1. Remember,
2 belongs to S n plus which implies Z bar transpose X 1 Z bar is greater than equal to 0, this is a convex combination, so theta lies between 0 and 1, ok.
Z bar transpose X 2 Z bar is greater than equal to 0.
This implies this whole quantity above is greater than equal to 0, which implies that Then, I can simply set theta 1 equals theta, theta 2 equals 1 minus theta and for 0 less
convex combination theta X 1 plus 1 minus theta X 2 this belongs to the set of all n cross than equal to theta less than equal to 1, now theta 1 and theta 2 are both greater than
n positive semi definite matrices, ok. equal to 0. This implies that theta X 1 bar theta X 1 bar plus 1 minus theta X 2 bar also
belongs to S, which implies it is a convex sets. So, what this means is any convex cone is
So, if we take any convex combination theta times X 1 plus 1 minus theta times X 2 0
also a convex set. So, convex set is a special is a subclass of convex cones, ok.
less than equal to theta less than equal to 1, the convex combination also belongs to the
set of positive semi definite matrices and therefore, S n plus is a convex set, ok. So, So, implies the convex set is also a convex cone I am sorry a convex cone is also a
implies; so, this implies S n plus equals a convex set. And in fact, this is a convex cone it convex set, but not the other way around not every convex set is a convex cone, alright.
is not just a convex set, you can show that this is a convex cone. The definition of a So, every; so, the set of positive semi definite matrices is a convex cone it is known as a
convex cone is something that very simple if X 1 bar similar to the convex set X 1 bar convex cone for this particular reason not just a convex set, but it is a convex cone.
comma X 2 bar belong to S, if this implies theta 1 X 1 bar plus theta 2, X 2 bar also
belong to S for any theta 1 comma theta 2 both greater than equal to 0.
Remember, there is no restriction of theta 1 plus theta 2 equal to 1 that restriction is there
both in convex and in affine, right. So, if you relax that restriction if this holds true for
any theta 1 theta 2 greater than or equal to 0 such a set is known as a convex cone. So, if
you look at this definition any convex cone is also called a convex set because if theta X
bar theta X theta times X 1 bar plus 1 minus theta times X 2 bar belongs to S, correct,
that is if theta 1 X 1 bar plus 1 minus theta 1 plus theta 2, X 2 bar belongs to S.
Because, if you take any theta 1 it is not difficult to see any theta 1 X 1 plus theta 2 X 2, Now, PSD Positive Semi Definite matrices are very important in signal processing and
where X 1 and X 2 are both positive semi definite you perform Z bar transpose Z bar you communication as well. So, if you look at PSD matrices so, the set of all positive semi
have theta 1 Z bar transpose X 1 into Z bar plus theta 2 Z bar transpose X 2 into Z bar. definite matrices these have a lot of applications these have arise very frequently these
Now, again this is greater than or equal to 0 this is greater or equal to 0, theta 1 theta 2 arise very frequently in signal processing and communication and for instance a simple
greater than equal to 0 by assumption implies this is greater than equal to 0, implies application can be demonstrated as follows.
again this theta 1 times X 1 plus theta 2 times X 2 belongs to the set S n plus implies S n
Let us consider a simple application of this concept of positive semi definite matrix. For
plus is a cone, implies the set S n plus that is the set of all n cross n symmetric positive
instance let us consider a signal vector let us consider a discrete signal vector given as
semi definite matrices is a cone, ok.
follows.
And, therefore, anyway for our purposes right it is important to remember that the set of
positive semi definite matrices is a convex cone more importantly it is a convex set, ok.
That is, we take the convex combination of any two positive semi definite matrices it is
also in turn a positive semi definite matrix.
(Refer Slide Time: 12:50) convenience of analysis we are setting this to be also a zero mean signal. Now, this can
arise in several scenarios. For instance, again let us go back to our multi antenna system.
Given as x bar equals this is a vector of samples or you can say also symbols x 1, x 2, x n
of size n, ok. So, this is a signal vector can arise in any scenario, ok. This is a; and let us
further consider that this to be a random signal vector. Let us say this is a random signal
Which you must now be very familiar with if you look at your multiple antenna system,
n dimensional random signal vector with average value that is we look at this mean.
let us say we have in this case a set of multiple transmit antenna which are transmitting
(Refer Slide Time: 13:35) the signal, ok. So this is your transmitter and let us say in the wireless communication
system your transmitting symbols x 1, x 2. So, let us say you have l or let us say you
have n transmit antennas. So, number of transmit antennas equals n number of transmit
antennas equals n and then let us say the transmits symbols are given by x 1, x 2, x n, ok.
So, we have n symbols x 1, x 2, x n; x 1 is transmitted from the first transmitter antenna

x 2 is transmitted from the second transmitter antenna so on, x n is transmitted from the
nth transmitter antenna. So, therefore, x bar also denotes the transmit vector or the vector
of transmit signals that is we have x bar equals x 1, x 2, x n this is the vector of transmit
symbols which is also known as the transmit vector, ok.
That is if you look at the mean of x mu bar of x is the expected value of x bar is equal to
0 which means this is a zero mean signal, ok, not important, but just for let us say
(Refer Slide Time: 15:20) So, the transmit covariance is given as that is if we denote the transmit covariance by this
matrix or we can also call this is as a covariance matrix of the transmit vector x bar that
is expected value of x bar minus mu bar x into x bar minus mu bar x transpose, this is a
expression for the covariance matrix.
So, you have x bar which is equal to x 1, x 2, x n this is your vector of transmitted
symbols from the multiple transmit antennas this can also be called as the transmit,
vector of transmit symbols from the multiple transmit antennas. So, these are the
symbols transmitted from the n transmit antennas. These are the symbols that are
transmitted these are the symbols that are transmitted from the multiple transmit
Now, in this case we already seen mu bar x that is the mean is 0. So, this is basically 0, 0.
antennas.
So, this is your expected value of x bar, x bar transpose and this is termed as this is
(Refer Slide Time: 16:45) denoted as I already said this R x this is also termed as the covariance matrix of x bar.
This is the covariance matrix of transmit of the transmit vector x bar or they simply also
known in practice and frequently in literature or research as simply the transmit
covariance, ok. So, this is also simply known as the transmit covariance which is
expected value of x bar x bar transpose.
And, we can show that any such covariance matrix that is R x is positive semi definite it So, we want to show that R x is the transmit covariance is a positive semi definite matrix.
can be shown very easily. In fact, we will do just that that this transmit covariance matrix So, we perform Z bar transpose R x Z bar we have to show this is greater than equal to 0.
which arise frequently this is a positive semi definite matrix. So, this is a PSD matrix, ok. So, this is equal to Z bar transpose expected value of x bar x bar transpose in into Z bar
which is now taking the Z bar inside the expected value. So, we have Z bar transpose x
So, this is a very important property of which arises very frequently that is positive semi
bar times x bar transpose Z bar which is basically expected value of you can see this is Z
definiteness and one of the most important types of matrices that we are going to see are
bar transpose x bar times Z bar transpose x bar transpose.
basically the covariance matrices of these random vectors. And, every such covariance
matrix for instance the transmit covariance matrix which is the covariance of the Now, you can see this Z bar transpose x bar is a scalar quantity ok, transpose of a vector
transmitted vector is a positive semi definite matrix this can be shown simply as follows. times another vector so, this is a scalar quantity. This implies Z bar transpose x bar
equals Z bar transpose x bar transpose, because for a scalar quantity the transpose that is
when the quantity simply a number real number right the transpose of the quantity is
itself.
(Refer Slide Time: 20:34). (Refer Slide Time: 21:43)
Therefore, this is simply equal to expected value of Z bar transpose x bar into itself In fact, the covariance matrix is related to the transmit power of the signal. It is a very
which is basically Z bar transpose x bar square and this is the expected average value of important property that is if you look at the covariance matrix, just expand it a little bits
the positive quantity so, this is greater than equal to 0. So, this because Z bar transpose x to give you a better idea. So, this is if you write this as expected value of x 1, x 2 up to x
bar square is always greater than equal to 0. So, we take it is mean the expected value n times x 1, x 2 up to x n this is equal to expected value of x 1 square x 1, x 2, x 1, x 2, x
that is also going to be always greater than equal to 0 which means basically your Z bar 2 square and so on.
transpose R x Z bar is always greater than equal to 0, which implies that R x is a that is
And, now if you look at the trace of this, now trace implies the sum of the diagonal
the covariance matrix. In fact, for that matter any covariance matrix is positive semi
elements, correct.
definite, ok.
So, that completes basically the proof. This shows that the covariance matrix, if I not just
the transmit covariance, but the receive covariance or any covariance matrix of a random
vector is a positive semi definite matrix and covariance matrix has an important role to
play.
And, so if you look at the trace so, this is your basically your R x that is your covariance (Refer Slide Time: 24:50)
matrix, ok. So, if you look at the trace of R x that is equal to the remember trace of a
square matrix is the sum of the diagonal elements.
And, this is basically your trace of R x so, this is the total transmit power. So, we have
that total transmit power is less than or equal to the total transmit power is less than or
equal to the maximum possible transmit power. So, this covariance matrix, alright
prominent role to play in wireless communications in fact, we will frequently encounter
That will be equal to expected value of the diagonal elements are x 1 square plus
this notion of transmit covariance, receive covariance matrices and so on or the
expected value of x 2 square plus so on plus expected value of x n square and this is
interference covariance matrix and so on and all has to do with the power of a particular
nothing, but the power of each symbol expected value of x i square is the power of
signal, alright. It gives an indication of what is basically the power of the signal which is
symbol x i square. So, this is basically total transmit power total.
the indeed a random vector, ok, alright.
So, the trace of the covariance matrix is nothing, but the total transmit power and that has
to be less than or equal to the maximum transmit power at the transmitter. Therefore, we
will have this constrained in a practical communication system. We will have the
constraint that is if you look at the trace of R x that is less than or equal to P T. Let us
denote the maximum transmit power by P T. So, this is an alternative way of writing the
transmit power constraint this is the maximum transmit power.
And, let us now move on to another important concept which is explore the properties of (Refer Slide Time: 27:56)
convex sets. So, you want to also explore this notion of properties of convex sets or
basically operations that preserve convexity or basically you can also think of this as
operations that can be performed on convex sets that preserve the convexity.
Various operations that can be performed on convex sets that preserve the convexity
various operations that can be performed on convex sets that preserve convexity. Now,
the first property in this is very simple.
For instance you take two circles look at the intersection region this looks nothing like a
circle, but yet you can see that this is a convex set that is if you take any two points
beyond the line. So, this intersection so, the two circular regions are convex so the
intersection is also the intersection of these two circles is also convex, ok.
And, we already seen an example in this regard that is a intersection of hyperplanes and
half spaces is also convex.

First property is that intersection preserves convexity that is we look at the intersection
of two convex sets the result is convex. The intersection preserves convexity. What this
implies is that if S 1 is a convex set and S 2 is a convex sets both S 1 and S 2 are convex
set. In fact, what you can see is S 1, intersection is S 2 is also convex that is in
intersection. In fact, this can be extended to any arbitrary number of sets, that is, we take
convex set if each set is convex, then the intersection of all these sets is also convex, that
is other interesting property and also very simple to verify you can also prove it formally.
For instance, if you look at the intersection of hyperplanes and half spaces are also Applied Optimization for Wireless, Machine Learning, Big Data
convex. In fact, each hyperplane is convex, half space is convex, the intersection of this
is convex. In fact, that has a special name this is termed as the polyhedron, we are Indian Institute of Technology, Kanpur
already seen this, ok. So, this intersection of hyperplanes and half spaces which is
Lecture – 18
convex has a special name it is termed as a polyhedron, alright. Introduction to Affine functions and examples: Norm cones l 2, l p, l 1, norm balls
We have looked at several interesting aspects first we have looked at positive semi
definite matrices, verified that the set of positive symmetric positive semi definite
looking at the properties of convex set or basically operations on convex sets that
matrices is a convex set and we also started looking at the properties of convex sets. We
preserve convexity, alright. Let us continue our discussion. Let us look at another
will continue this discussion further in the subsequent modules.
important operation that preserves convexity which is known as an affine function, ok.

So, the next important transformation that preserves convexity and this arises fairly
frequently is what is known as an affine. This is known as an Affine Function. For
instance, what is an affine function? Now, if you have a vector that is for instance let us
say x bar is a vector, let say this is your vector.
Now, an affine function is a function that is of the form A x bar plus b bar that is A is a
matrix, that is multiplied by trans matrix and translated by the vector b bar. So, this,
basically this function of this from this is termed as a function of this form is termed as
an Affine Function.
Now, under affine function now, the interesting property or it is relevant with respect to This implies, now what is F inverse of S; F inverse of S, that is the inverse of the set
convex sets is that if S is convex, this implies that F of S that is affine transformation under this affined pre composition is the set of all vectors x bar such that F of x bar
applied on as in the resulting set F of S is convex, ok. So, F of S implies affine belongs to S. This is known as an affined pre composition. We have F of S which is
transformation of S or affine transformation of all elements in it S that also results in a affine composition, F inverse of S is the affine, this is the Affine Pre composition. For
convex set ok. Typically, for instance we take a convex set, if you rotate it and translate instance, an application can be demonstrated as follows.
it, that is which corresponds to basically an affine transformation, the resulting set is also
convex.
Now, interestingly, what one can also show that an affine pre composition also results in
a convex set, alright. So, what is the meaning of that, that is F inverse S; if S is convex
then F inverse of S is also convex.
Consider the following simple example. We have already seen a Norm Cone. Let us go
back to our illustration of the Norm Cone and what we have seen in the Norm Cone is
that we have this vector x tilde which is of the form x bar x n plus 1, this is an n
dimensional vector. We have an n dimensional vector x bar and another element n plus
1th element x n plus 1 and the norm cone is basically described by the set, norm of x bar
is less than equal to x n plus 1 which basically implies that norm x bar square is less than
or equal to x square of n plus 1; which basically implies that x bar transpose x bar is less
than or equal to x square of n plus 1, correct because remember norm x bar square is
simply x bar transpose x bar.
Now, we want to find the set of all F inverse of sets. So, this is our convex set S. Now, F
inverse of S will be all S or will be all let say V bar such F of V bar belongs to S;
implies, now if you look at F of V bar belongs to S implies well, we already seen x bar
transpose x bar less than or equal to x square n plus 1. Now, substituting for x bar and x n
plus 1, we have x bar is well P times V transpose into x bar which is P times V bar is less
than or equal to x square of n plus 1 that is C bar transpose V bar.
Remember, x n plus 1 is C bar transpose V bar, square of that which basically implies
that V bar transpose P transpose P into V bar is less than or equal to C bar transpose V
bar whole square which basically implies V bar transpose V tilde V bar is less than or
So, x bar transpose x bar less than equal to x square n plus 1. This is an alternative
equal to C bar transpose V bar whole square, ok.
representation of the Norm Core. Now, let us see what is affine pre composition
corresponds to. So now, let us consider x bar equals P times another vector V bar and x n
plus 1 equal C bar transpose V bar. So, I can write x tilde equals this vector which is
already we have seen x bar x n plus 1, this is equal to the matrix P stack matrix P C bar
transpose and C bar. So, this is your matrix A, b bar is 0. So, this is an affine
transformation, correct or rather this is an Affine Function.

And this matrix V tilde is defined as P transpose P and you can see, this is a positive semi Let us now move on to another interesting aspect and let us re visit the concept of Norm
definite matrix. So, now, what you can see is this set alright V bar which satisfies this by Balls that we have seen previously. We had seen this concept of a Norm Ball, ok. What is
the property of the affined composition, right since x bar correct since we said F of V bar a norm ball? Now, remember the norm ball was defined as follows. I have the 2 norm,
that is x bar belongs to S, that is the norm cone. So, the V bar which is the affine pre this also known as the l 2 norm which you can write as magnitude x 1 square plus
composition which basically, which is the set corresponding to the affine pre magnitude x 2 square plus 1, magnitude x n square, this is the l 2 norm and the
composition alright that also forms a, that also forms a convex set. So, this set V bar such corresponding l 2 norm ball that is given as norm of x bar that is l 2 norm less than or
that F V bar belongs to S which is characterized by this relation also forms set of all V equal to for instance r let say equal to 1, ok. So, this is your l 2 norm ball.
bars satisfying this also forms, this also forms a convex set, ok.
And in fact, this is a convex cone is, we can think of this as a general expression for a
convex cone given by the affine pre composition ok, alright. So, these are very
interesting properties; the first one is a rather simple and which is basically says that
intersection of two sets, if two sets are convex or a finite number sets if or if any number
of sets is convex, their intersection is also convex. And further, if you can consider an
affine function F and a convex set S, then both F S and F inverse S are also convex,
alright.
And we have also seen that this l 2 norm ball for instance in two dimensions, this (Refer Slide Time: 13:14)
corresponds to a this is l 2 norm ball which is basically equal to circle slash sphere in n
dimensions, it is a sphere ok. So, this is your l 2 norm ball, ok.
So, for instance, the l 1 norm which is one of the most fundamental and widely applied
the l 1 norm is norm of x bar 1. You can see that simply reduces to magnitude x 1 plus
magnitude x 2 that is each to the power of P which is 1 plus magnitude x n whole to the
power of 1 over P which is again 1. So, this is simply magnitude x 1 plus magnitude plus
Now in general, we would also we one can define; now in general, one can define what
magnitude x n, this is the l 1 norm. And the l 1 norm sphere or the l 1 norm ball, this is
is known as an l P norm. What is this l P norm? If you take a vector x bar, the l P norm
given by norm x bar of 1 less than or equal to 1, this is your l 1 norm ball. And for
indicated by this P here is basically given as magnitude x 1 to the power of P plus
instance to look at this, let us consider a 2 D example, consider 2 dimensional case.
magnitude x 2 to the power of P plus magnitude x n to the power of P whole raise to the
power of 1 over P. (Refer Slide Time: 14:22)
Now, you can see if you set P equal to 1, P equal to 2, it reduces to, reduces to the l l, l 2
norm; therefore it is general. So, for P equal to 2, it reduces to magnitude x 1 square plus
magnitude x 2 square so on up to magnitude x n square 1 over 2, that is square root of the
whole thing which is nothing but the l 2 norm. Now, this can be now used to construct
other very interesting norm.
If x bar equals x 1 plus x 2 norm x bar less than equal to 1, this implies magnitude x 1 (Refer Slide Time: 16:17)
plus magnitude x 2 less than equal to 1, ok. Now, how to find this norm ball? You can
consider four cases; one is x 1 comma x 2, both greater than equal to 0 in which case
magnitude x 1 is nothing but x 1, magnitude x 2 is x 2 less than equal to 0. So, this
corresponds to the first quadrant.
If you plot the l 1 norm ball and what you will observe is if you look at the first quadrant
that corresponds to x 1 plus x 2 less than equal to 1 which is basically this region; x 1, x
2, x 1 comma x 2 greater than equal to 0 and x 1 plus x 2 less than equal to 1. And
similarly, this will be the corresponding region in the second quadrant, third quadrant,
fourth quadrant. And therefore, if you look at this, what you will observe is that this is
the region corresponding to the l 1 norm ball, it is very interesting. It is very different
Second quadrant, we have x 1 less than 0, x 2 greater than equal to 0; this corresponds to
from the l 2 norm ball in the sense that you can see that it has pointed edges, something
the case magnitude x 1 is minus x 1. So, this will be minus x 1 plus x 2 less than equal, I
very interesting. So, you can see and this simple observation which means it is non
am sorry this is not 0, this is 1 less than equal to 1. This is the second quadrant. Then in
differentiable if you see, if we observe it, the simple observation leads to in fact profound
the third quadrant, you will have both are negative, you will have minus x 1 minus x 2
implications.
less than equal to 1. And in the fourth quadrant, you will have x 1 because x 1 is greater
than equal 0, minus x 2 because x 2 is less than 0 less than equal to. So, these are the four So, if you look at the l 2 norm ball, you can see this is smooth, it has no (Refer Time:
cases and if you plot it, you will find something very interesting. 17:47) or edges. So, the l 2 norm is something that is very amenable for analysis that is it
can be easily differentiated and so on whereas, if you look at the l 1 norm ball, something
very interesting that is a square with the diagonals along the axis. So, it is a tilted square
and being a square, it has the sharp edges at which it is not differentiable. So, is
something that is very interesting.
It is an very interesting shapes. So, this is not what you think of when you think of a so,
this is basically your tilted square and it is 90 degrees, it is angles are 90 degrees is
symmetric and the diagonals, the diagonals are aligned with the axis or diagonals are on
the axis that is your x and y axis or your x 1 and x 2 axis, ok. So, this is the l 1 norm ball. (Refer Slide Time: 20:15)
Now, related to this is this notion, now we have seen the l 1 norm ball. Now, something
very interesting is what is known as the l infinity norm that is what happens when P
tends to infinity.
So, this is the l infinity norm, something that is very interesting. So, this is the l infinity
norm. And now, one can corresponding derive the norm ball corresponding to l infinity
norm, the norm ball and that is naturally given as norm of x bar infinity less than equal to
1. So, this is an interesting norm.
So, the l infinity norm that is norm of x bar infinity that is defined as limit P tending to So, norm of vector, the infinity norm is simply in the maximum of the absolute values of
infinity norm of x bar which is limit P tending to infinity under root of not under root, the components of that vector and the l infinity norm ball is basically simply norm, the
this is magnitude x 1 raise to the power of P plus magnitude x 2 raise to the power of P infinity norm of vector, the region corresponding to the infinity norm of a vector x bar
plus magnitude x n raise to the power of P whole to the power of P which can be being less than or equal to for instance, any radius. In this particular case, you can say
basically shown to be magnitude of more x i, 1 less than equal to i, less than equal to n the radius is equal to, alright.
which is basically simply the maximum of magnitude x 1, magnitude x 2, so on up to
So, we will stop here and continue with this discussion in the subsequent module.
magnitude of x n.
Lecture – 19
norm balls and Matrix Properties: Trace, Determinant
looking at the l infinity ball alright; we have find the l infinity norm which is the
maximum of the magnitude of the different components of a vector x bar, and let us now
look at continue our discussion on the l infinity balls.
Now, magnitude x 1 less than equal to 1 this implies that minus 1 less than or equal to x
1 less than or equal to 1 that is x 1 has to lie between minus 1 and 1, and further
magnitude of x 2 less than equal to 1 implies minus 1 less than or equal to x 2 less than
or equal to 1.
So, the we are looking at it is a l infinity ball is a simply defined as, the infinity norm of a
vector less or than equal to 1. Let us consider a simple scenario 2 dimensional vector x
bar which has 2 components x 1 x 2 and now the l infinity norm that is if you look at the
l infinity norm that will simply be the maximum of magnitude of x 1 comma magnitude
of x 2.
Now, l infinity norm less or equal to 1 implies this maximum of the magnitude x 1
comma magnitude x 2 less than or equal to 1 this implies that knows is a maximum of 2
And now so, this is the intersection of these 4 in fact, if you can see there are 4 half
quantity is less than or equal to 1, this implies that each of the quantities has to be less
spaces; one is given by x 1 less than or equal to 1 that is this is the first coordinate x 1 x
than or equal to 1.
2. So, let us say this corresponds to your hyperplane x 1 equal to 1 and then on the
opposite side let us say this corresponds to the hyperplane x 1 equals minus 1. The strip (Refer Slide Time: 05:14)
end between denotes the region minus 1 less than equal to x 1 less than equal to 1 and
similarly this corresponds to the hyperplane, this corresponds to the hyperplane x 2 equal
to 1. And, this corresponds to the hyperplane x 2 equals minus 1 and now the region
which is between these hyperplanes this polyhedron is in fact, what is your l infinity that
is the square is the l infinity ball.
And we also have this notion of an l 0 norm, which is very interesting and well what is
the l 0 norm l 0 norm and the l 0 norm you can essentially show, that if you look at a
vector. What is a l 0 norm that is the definition is you can show that norm x bar 0 equals
the number of non-zero; this is equal to the number of non-zero elements of x bar this is
very interesting the number of non-zero elements of x bar.
So, if you minimize the l 0 norm alright and this is a very idea, if you minimize x bar of
So, l infinity ball equals square this is also a square with sides that are parallel to axes; 0 this results in a large so; large number. So, what you will observe is large number of
the sides are. Remember the l 1 square had the diagonals that are, if you look at the l 1 components of x bar will be 0. So, you will get a vector x bar typically, in which a large
square that had the diagonals along the axes, and the l infinity square is the normal number of its components are 0 because the l 0 norm is the number of non-zero elements.
square that you would imagine which has the sides that are parallel to the diagonal sides
that are parallel to the axes. So, this is your l infinity looks like a square ok.
So, this is something interesting this normally when you think of balls, you think circles
and spheres, but when you look at the l 1 norm ball alright which is a tilted square and
the l infinity norm ball which is essentially a square alright. So, these are also norm balls
that is when you generalize the definition of a norm ball, you can derive these kinds of
norm balls which are a very interesting shapes ok.
(Refer Slide Time: 06:52) So, it brings to mind some vector like this, which has 0 some component non-zero
alright. So, what you have is a vector in which large number of elements equal to 0 such
a vector is termed as a sparse vector and this is a very interesting property.
Implies and this, such a vector x bar which has a large number of non-zero elements,
such a vector x bar is termed to be it has a very interesting name this is known as a
sparse vector alright. Similar to what we similar to the usage a sparse sparsely populated
area essentially means that, there are very few people. So, a sparse vector basically
And what is interesting about this is that is you can show that most signals most naturally
denotes a vector in which sparse it is sparsely populated with components a large only
occurring signals such as be it either music or video or images although they are not
very few non-zero components, and a large number of components are a large number of
dangerous sparse, they are sparse under some of you know some appropriate domain for
components are 0.
instance when you look at the either the Fourier transform or the wavelet transform of
(Refer Slide Time: 07:32) these signals, they are very sparse and that is a very important property.
So, when you look at most naturally occurring signals such as either a music signal or
video or images are sparse in some are sparse in some suitable domain example either
the wavelet domain or the Fourier domain that is when you take the Fourier transform or
basically the frequency domain, you can also say. So, they are sparse in some domain
and this can be used for at this idea is a very important idea, which can be used for signal
processing and to improve the performance.
This important property that is sparsity can be exploited for signal processing. This To start with let us start with some simple examples related to matrices and their
important idea can be exploited as signal processing and this is termed as compressive properties. So, let A be an n cross P matrix and B equals P cross n matrix, the first thing
sensing and one of the domains where this is exploited is known as one of the important we want to show is that determinant of I n plus A B equals the determinant of I m or I am
areas, where this idea is used compressive sensing also this is abbreviated as CS. sorry determinant of I n plus A B is the determinant of I p plus B A.
Compressive sensing this is a relatively new field in fact, a path breaking innovation I
Let us make this n cross m matrix. So, I think that will make it life much more simpler.
must say which has gained a lot of popularity in the recent past where in you exploit the
So, this B is an naturally if you are multiplying m they have to have dimensions that
knowledge that this vector that is this vector x bar which corresponds to a naturally
match up. So, this is we want to show this is an. So, I n this is the n cross n identity
occurring signal is sparse in a certain suitable domain. And, that can be used to further
matrix and this on the other hand this is the m this is the m cross m identity matrix now
that can be used in signal processing and to further improve the performance alright in
solution is as follows.
comparison to other schemes alright.
So, this is in fact, a break through this in fact, is a or you put its a break through
paradigm alright it is a break through framework or its compressive sensing framework
or the framework that exploit sparsity of the signal vectors is a break through framework
that can be used for enhanced signal processing alright. So, with that let us complete this
discussion, let us move on to looking at some problems to better understand the
concepts. Let us start with a few problems related to the determinants and the positive
semi definite property. So, I will want to start with some examples.
(Refer Slide Time: 13:56) that is I m plus B A. Now if you look at the determinant of P tilde, which is equal to
determinant of P. Now this arises since row operations since determinant remember you
might have seen a property of determinants before that its determinant remains
unchanged in row operations. So, determinant P tilde equals determinant P now you see
the lower block is 0’s, which means the determinant of P is simply the determinant of I n
times the determinant of I m plus B A.
Now, the determinants of I n is 1 since it is an identity matrix remains unchanged. So,

this quantity equals to 1. So, this is simply the determinant of I m plus B A. So,
determinant of P tilde equals the determinant of I M plus B A let us call this as our result
1.
Now, if you consider let us consider the matrix, let us consider the matrix P which is
given as I n minus A, B I m. Now, I can perform row operations or in this case block row
operations ok. So, the let us call this as block row 1 denoted by R 1 and this is block row
2 denoted by R 2.
Now, let us form the matrix P hat by performing R 1 plus A times R 2 on. So, R 1 goes to
let us write this as follows R 1 goes to R 1 plus A times R 2, here we have R 2 goes to R
2 minus B times R 1. So, on P and this gives the matrix you can check, this gives you the
matrix well this will be I n plus A times R 2 A times B plus I m plus well let me just
check this, this is n cross m matrix B times A B is m cross n; so, this B times A.
So, this will be I n or 1 plus a times R 2. So, I n plus B times A I am sorry this will be I n
Now, let us obtain P tilde by performing R 2 that is block row 2 minus the matrix B times plus A times B and this will be I n plus A times B, this will be the other will be minus A
block row 1. So, you obtain the matrix P tilde which is basically well first row remains plus A times minus A plus A times i. So, this will be 0 and the second row remains
unchanged second row is B minus I n times B which is 0 i minus of B times minus a. So, unchanged that will B and I m and now if once again.
So, this is your P hat and now, if you once again formulate the determinant of P hat that Let us look at another interesting example that is example number 2, we want to show
is going to be simply that determinant because this is 0, that is going to be the that the trace of a (Refer Time: 20:03) square matrix A is the sum of the eigenvalues of A
determinant of I n plus A B times the determinant of I m. and the determinant of a is the product of the eigenvalues. So, this is trace which is equal
to sum of eigenvalues, and the determinant is basically the product of eigenvalues ok.
So, and by the way again once again this is equal to the determinant of P, because you
And this is well to do this you can start with the property the following think for a
are performing row operations on P determinant of I m is 0. So, this will become I n plus
general matrix that A can be expressed as U lambda U inverse alright where U equals
determinant of I n plus A B and finally, from 1 and 2 from 1 and 2 it follows that
matrix of eigenvectors.
determinant of I plus B A equals determinant of I plus A B alright. So, this follows from
the results in 1 and 2. (Refer Slide Time: 20:58)
Lambda is diagonal value or diagonal matrix of eigenvalues. (Refer Slide Time: 22:39)
Because U Hermitian U equals U U Hermitian equals identity, which implies that U

Hermitian equals U inverse ok, but this is only for a PSD matrix ok. But, we can show
Lambda equals diagonal matrix of eigenvalues and lambda has the following structure
the above property that is the trace is equal to the trace of matrix a is equals the sum of
ok. If A is an n cross n matrix, lambda equals diagonal matrix of eigenvalues assumes
its eigenvalues, for any general matrix and that is as follows if you consider the trace of a
that a equals an n cross n matrix square matrix alright. Now of course, we already seen
you use the eigenvalue decomposition and now you replace the trace as U lambda U
that if a is PSD then it becomes U equal to U lambda U Hermitian. If A is a PSD matrix
inverse.
if A equals PSD well the above is valid for any general matrix A, this becomes U lambda
U Hermitian. I am sorry because U is orthogonal U is a unitary matrix for a PSD matrix Now we know that the trace of A B equals trace B, that is the trace of that is if you are
A; U is a unitary matrix and U inverse is simply U Hermitian. interchange the product of the matrices the order of the product in the trace. So, this will
become trace of lambda it is very simple. So, this is become trace of lambda U inverse if
you this is using the property, trace of CD just not to confuse with A equals trace of DC.
Now, U inverse U is identity. So, this is trace of lambda which is the diagonal matrix of
eigenvalues and this is nothing, but summation of i lambda i.
Now, similarly one can also show now if you look at the determinant of A that is the Now similarly, you can also exploit this property, if you look at trace of a matrix A raised
determinant of U lambda U inverse, which is now the determinant of product of matrices to the power of n, that equals trace of A times A, the product n times we have already
is the determinant of the product of the determinants, which is equal to the determinant seen this for a positive semi definite matrix that becomes U, lambda U inverse, U lambda
of lambda times the determinant of U times the determinant of U inverse. U inverse times U lambda U inverse, that becomes trace of I am sorry that becomes trace
of well U lambda raised to the n U inverse, which is nothing, but trace of lambda raised
Now, if you look at the determinant of U into the determinant of U inverse, because U U
to the power of n U inverse U which is trace of U inverse U identity. So, this is trace of
inverse is identity determinant of identity is 1. So, therefore, determinant of U into
lambda raised to the power of n, this is summation over i lambda i raised to the power of
determinant of U inverse is 1 this implies this equals the determinant of simply lambda
n that is trace a raised to the power n.
which is now nothing, but basically the product of the eigenvalues and this is a very
interesting property that is frequently used pi equals 1 to n this is the. So, you can say And in the similar way you can show that the determinant of a raised to the n is the
determinant of A equals product of its eigenvalues, trace is equal to the sum of its product of i equals 1 to n lambda i raised to the power of n. This is nothing, but the
eigenvalues trace of a matrix equals the sum of its eigenvalues ok. determinant of U lambda raised to the power of n U inverse which is how we get this
result ok. So, these are some interesting properties. In fact, very interesting properties
that come in handy frequently during manipulation, that is a trace of a matrix a square
matrix a is the sum of its eigenvalues and the determinant of a square matrix a is
basically the product of its eigenvalues good.
So, will stop here and continue to this discussion looking up by looking at other
examples in the subsequent modules.

Applied Optimization for Wireless, Machine Learning, Big data definite matrix they have to be strictly greater than 0, it cannot have any eigenvalues
equal to 0. And therefore, now if you look at A inverse its rather easy to see A inverse
Indian Institute of Technology, Kanpur equals U lambda U H U lambda U Hermitian inverse, which is basically U Hermitian
inverse times lambda inverse times U inverse.
Lecture-20
Inverse of a Positive Define Matrix, Eigenvalue Properties and Relation between
different norms (Refer Slide Time: 02:44)
looking at examples; on matrices and also convex sets. So, let us continue a discussion,
let us look at another example related to Positive Definite Matrices alright.
But we have seen that U is a unitary matrix which implies U Hermitian U equals U U
Hermitian equals identity, well this implies U equals U Hermitian. So, what this implies
is that if you look at you Hermitian inverse that is U itself because, U Hermitian to use
identity times lambda inverse into U inverse is U Hermitian because again U U
Hermitian equals identity. And therefore, this is again has the same structure except you
can see with eigenvalues 1 over lambda 1 1 over lambda 2 1 over lambda n times U
So, let us consider continue our discussion or continue looking at examples related to
Hermitian.
matrices and convex sets. So, this example number 3 what you have to show, that is A is
a PD matrix that is positive definite, if A is positive definite this implies A inverse is also
positive definite. We want to show that if A is positive definite A inverse is also positive
definite, this can be shown as follows. If A, A can be written expressed as we already
seen this U lambda U Hermitian where U is a unitary matrix satisfies U U Hermitian
equals U Hermitian U equals identity.
And lambda is a diagonal matrix of eigenvalues, and also further we have seen that the
eigenvalues of any positive definite matrix have to be greater than 0. The eigenvalues of
a positive semi definite matrix are greater than equal to 0; the eigenvalues of a positive
And therefore, now you can see if lambda i is greater than 0, this also implies that 1 over Which is equal to Z tilde 1 Z tilde 2 so on, up to Z tilde n into the product 1 over lambda
lambda i is greater than 0. So, eigenvalues of A inverse or also greater than 0 in effect A 1 1 over lambda 2 so on, 1 over lambda n times Z tilde 1, Z tilde 2 up to Z tilde n. And if
inverse can be expressed as U lambda inverse U Hermitian. And therefore, it has and it you look at this, this is nothing but well this will be Z tilde conjugate since this is the
has positive eigenvalues alright and therefore, it is also a positive definite matrix. Hermitian of the vector ok. So, we are setting U Hermitian Z bar Z tilde. So, this will be
summation over i equals 1 to n over lambda i Z tilde i conjugate into Z tilde i that is
And you can also check this as follows, for instance if you consider Z bar transpose for
magnitude Z tilde i square.
any real vector if you consider Z bar transpose A inverse Z bar, I can now write this as
now since this is a real vector I can write this as Z bar Hermitian, A inverse we have seen (Refer Slide Time: 06:03)
is U lambda inverse U Hermitian Z bar. Now if u set U Hermitian Z bar if you set this is
equal to Z tilde then this will become Z tilde Hermitian lambda inverse Z tilde.
Now, 1 over lambda i this is greater than 0, magnitude Z tilde i square this is greater than The characteristic polynomial of A matrix A is A minus lambda i you want to look at the
0. And therefore, this implies that Z bar transpose A inverse Z bar is also greater than 0 characteristic polynomial of the matrix A B therefore, that will be the determinant of A B
for each for all Z bar for all vectors Z bar, and this implies that A inverse is a this implies minus lambda i. So, this is equal to I can write this as determinant of A B A A inverse
that A inverse is a positive definite matrix. So, A is a positive definite matrix A inverse is minus lambda A A inverse because remember A is an invertible matrix both A and B are
also positive definite matrix. invertible matrices so, A A inverse equals identity.
In fact, the eigenvalues of a inverse are the inverse that is if lambda i is an eigenvalue of (Refer Slide Time: 10:02)
A then eigenvalues of corresponding eigenvalue of A inverse is 1 over lambda i.
Eigenvalues of A are strictly greater than 0 if A positive definite matrix and similarly the
eigenvalues of A inverse are also strictly greater than 0, if strictly greater than 0 correct.
Since lambda is greater than 0 1 over lambda is also greater than 0 alright, let us continue
our discussion let us look at another problem number example number 4.
So, I can always write this as A B determinant of A B A inverse minus lambda A A

inverse. Now, I can extract the A on the right so, this will become B A minus lambda I
times A inverse this is determinant of A times B A minus lambda I times A inverse. The
determinant of A matrix product is the product of the determinants that is determinant of
A times determinant of B A minus lambda I times determinant of A inverse. Determinant
of A inverse is basically 1 over the determinant of A because A times A inverse is
identity.
What we want to show is that if A, B these are 2 invertible n cross n matrices then, A B,
B A have the same A B and B A have the same eigenvalues we want to show this So, this is B A minus lambda I into 1 over the determinant of A determinant of A times 1
property that A B eigenvalues of A B are equal to eigenvalues of B A well. We start with over determinant of A these cancels. So, this becomes the determinant of A minus
the characteristic polynomial remember to compute the eigenvalues of any matrix in this lambda I. Therefore, we have this interesting property that is A B determinant of A B
case the eigenvalues of the matrix A B. So, we start with the characteristic polynomial of minus lambda I equals determinant of B A minus lambda I implies characteristic
A B, that is obtained by nothing but, that is basically obtained by looking at the polynomials of B A this implies the characteristic polynomial, the characteristic
determinant of A B minus lambda i remember the eigenvalues are computed as the roots polynomial of A B which is determinant of A B minus lambda I. This equals the
of the characteristic polynomial. characteristic polynomial of B A.
(Refer Slide Time: 12:06) equals some V times lambda times V inverse because remember their eigenvalues are
equal. So, that diagonal matrix of eigenvalues will be the equal or identical similar
eigenvalues they have the same eigenvalues.
This is the same eigenvalues the diagonal matrices lambda right, the diagonal matrix of
eigenvalues will be the same for both A B and B A in their eigenvalue decomposition ok.
So now, this implies that B A equals V lambda V inverse so, this implies V times or V
inverse times B A into V equals lambda. Now, substitute lambda in the first one this
implies A B equals U times lambda, but lambda is V inverse B A into U inverse V and
this is nothing, but U V inverse let us call this as U tilde B A. If U inverse U V inverse is
U tilde then U inverse B becomes U tilde inverse.
So, I can write the matrix A B as some matrix U tilde I can write this as you tilde times B
A times U tilde inverse such matrices are said to be similar matrices. So, A B implies A B
The characteristic polynomial of A B equals the characteristic polynomial of B A this is similar to B A. In general C similar to D if there exist M such that C equals M inverse
implies the roots are equal roots are equal or identical and this implies. So, characteristic D M. So, if there exists a matrix M, such that you can write C equals M inverse D into
polynomials are equal implies the roots are identical. And this implies therefore, M, then the matrices then the matrices C and D are said to be similar matrices. So, in this
eigenvalues of A B equal eigenvalues of this implies eigenvalues of A B equals this case you can see these 2 matrices A B and B A in fact, which of the same eigenvalues
implies that eigenvalues of A B equal eigenvalues of B A ok. alright these are similar matrices alright.
And in fact, you can see, if you write A B equals U lambda U inverse remember this
eigenvalue decomposition I can write these are eigenvalues are equal I can write B A
Let us now look at another interesting property that is the eigenvalues of unitary matrix. (Refer Slide Time: 18:25)
So, this is our example number 5 again another simple property, what can we say about
the eigenvalues of a unitary matrix, now let U be a unitary matrix.
So, this implies basically that eigen so, this basically shows a very interesting property
eigenvalues of unitary matrix, eigenvalues of a unitary matrix have unit magnitude that is
the interesting property that this shows alright. And now similarly, if you consider the
determinant of the unit the magnitude of the determinant let us consider the magnitude of
Remember the unitary matrix is defined by the property U Hermitian U equals U U
the determinant. Remember we have seen that the determinant is nothing, but the product
Hermitian equals identity. This is the property of the unitary matrix now let, x bar be the
of the eigenvalues so, the magnitude of the product of the eigenvalues, which is nothing,
eigen vector and lambda equals corresponding eigenvalue. Now this implies, what this
but the product of the magnitudes of the eigenvalues.
implies is that, U x bar equals lambda times x bar correct, which implies now you can
multiply U x bar Hermitian U x bar that will be equal to lambda x bar Hermitian. (Refer Slide Time: 19:40)
Because U x bar equals lambda x bar U x bar Hermitian equals lambda x bar Hermitian
multiplied by lambda x bar.
And this implies x bar Hermitian U Hermitian U x bar equals lambda Hermitian, but
lambda is a number so lambda Hermitian is simply lambda conjugate x bar Hermitian
lambda into x bar. Now U Hermitian U is identity because, U is unitary matrix that,
leaves x bar Hermitian x bar which is remember norm of x bar square this is equal to
lambda conjugate lambda, that is magnitude lambda square times x bar Hermitian x bar,
which is again once again norm x bar square which implies cancelling the norm x bar
square on both side. This implies magnitude lambda square equals 1 which means
magnitude lambda equals 1.
And each eigenvalue is unit magnitude so this is equal to the product of 1’s which is one (Refer Slide Time: 22:05)
which shows this ancillary property or you can also think of this as an axiom that the
determinant of a unitary matrix is identity. All the eigenvalues of a unitary matrix are
magnitude 1 and the determinant of unitary matrix has the magnitude of the magnitude
of the determinant of a unitary matrix is 1 as well alright. Let us continue a discussion let
us start with another example let us consider the norm relation between the 1 norm of a
vector we want to show that the 1 norm of a vector x bar is less than or equal to square
root of n times the 2 norm.
So, I am constructing two different vectors magnitude x 1 magnitude x 2 so on,

magnitude x n and v bar is a vector n dimensional vector of all 1’s. Now what I going to
do is, I am going to apply the Cauchy–Schwarz inequality, remember we have seen the
Cauchy–Schwarz inequality which states that the inner product square u bar v bar that is
less than u bar, that is norm u bar square into norm b bar square.
This implies that if you look at u bar transpose v bar square, that is less than or equal to
norm u bar square, norm v bar square. And this also implies that u bar transpose v bar is
less than or equal to norm u bar into norm v bar we know this property.
Now, if x bar is a vector with elements x 1, x 2 up to x n. Now, remember the 1 norm,
this is simply the sum of the magnitudes, the magnitude of magnitude x 1 plus magnitude
x 2, magnitude x n and the 2 norm is the square root of magnitude x 1 square plus
magnitude x 2 square plus so on, plus magnitude x n square. Now, to show the property
about what we will do is, we will consider two different vectors will construct 2 vectors
u bar and components the elements of u bar are magnitude x 1 magnitude x 2.
(Refer Slide Time: 23:30) these in 1, this basically yields norm x bar 1 that is u transpose v bar less than or equal to
norm v bar that is square root of n into norm u bar that is a 2 norm of x bar.
So, this is an interesting property that we have ok. So, this is the property or the relation
you can say characterizes the relation between the 1 norm and the 2 norm. In fact you
can also show something between the relation between the 2 norm the infinity norm.
Now, all we have to do is substitute the definition from the definition above substitute u
bar and v bar, you can see u bar transpose v bar is nothing but, magnitude x 1 plus
magnitude x 2 so on, up to magnitude x n. Which is basically norm x bar of 1 and norm u
bar that is the 2 norm remember the 2 norm u bar is square root of magnitude x 1 square
plus so on magnitude x n square which is nothing, but the 2 norm of x bar.

You can also show; that the infinity norm that is if you look at the 2 norm, now this is
equal to well we have seen magnitude x 1 square plus magnitude x 2 square plus
magnitude x n square, this is a sum of the squares of the magnitude is all the elements.
Now, this is greater than or equal to you simply take the maximum of the maximum of
the magnitude correct. This is the sum of the squares of the magnitude of all the
elements, which is greater than equal to the square of simply the magnitude of the
maximum of these elements, which is equal to now you take the square root all you are
left with is the maximum of magnitude x i which is nothing, but the l infinity norm ok.
So, therefore, this shows that so, this shows that basically your this thing is greater than
equal to the a l 2 norm is great. So, this basically shows that your l 2 norm is greater than
or equal to the l infinity norm alright. So, let us stop here and we will continue with other
aspects in the subsequent modules.
And finally, the 2 norm l 2 norm of v bar is square root of this is 1 plus 1 plus 1 n times
this is nothing, but square root of n. And now using this property using 1 substituting all Thank you very much.
Department of electrical engineering
Lecture – 21
Example Problems: Property of Norms, Problems on Convex Sets
at example problems related to matrices and convex sets. Let us continue our discussion.
Which you can expand as follows which is basically well it is magnitude x 1 square plus
magnitude x 2 square plus so, on magnitude x n square plus summation over all
combinations of i comma j the product magnitude x i into magnitude x j.
Now, this quantity this products cross products, the sum of all cross products, this is
greater than or equal to 0. Because the magnitudes are positive implies that this is greater
than or equal to magnitude x 1 square plus magnitude x 2 square plus magnitude x n
square and this is nothing, but the 2 norm square norm x 2 bar square.

So, you are looking at example problems in particular in the previous module we started
looking at the properties of norms, correct in and in particular we have seen that for
instance the l 2 norm is greater than the l infinity norm.
Now, in the same way and one can also show that the l 1 norm for a vector is greater than
or equal to the l 2 norm ok. And this can be shown simply as follows. If you look at the l
1 norm that is simply for an n dimensional vector, the sum of the absolute values of the
magnitudes that is what we have seen that is the l 1 norm ok. And if you look at the
square of the l 1 norm the square of this quantity that is simply magnitude x 1 plus
magnitude x 2 plus magnitude x n whole square.
So, we basically have what we have is that norm x bar the 1 norm square is greater than (Refer Slide Time: 04:29)
or equal to the 2 norms square and therefore, this implies both these quantities are
positive the square of 1 is greater than the square of the other this means the 1 norm is
greater than or equal to the 2 norm.
And we are already seen that the 2 norm is greater than the infinity norm. Therefore,
putting all these things together, you have the 1 norm is greater than or equal to the 2
norm is greater than or equal to the infinity norm.
So, what we want to look at next is convex sets, if you want to explore different types of
convex sets their definition classification and so, on ok. Now, for instance let us start
with the first problem, problem number let us call this as problem number the previous
one as problem number we had this, let us call this as problem number. So, we had
problem number I think 6. So, let us call this as problem number 5 6. So, let us call this
as problem number 7, I think we can start problem number or let us forget this we can
start problem number 7 over here.

This is the result that we have the 1 norm of a vector is greater than or equal to the 2
norm is greater than or equal to that infinitely norm. So, 1 norm is sum of the magnitude
values of the element. The 2 norm is you can think of this as the length the Euclidian the
length of the vector in Euclidian space. And the infinity norm is the maximum value of
the magnitudes of the different elements of the vector. These are the different norms ok.
Let us continue our discussion, now let us move on to look at example problems related
to convex sets and their applications alright. So, we have defined convex sets look at
their properties. Let us do some example to understand these things better ok.
Let us call this as problem number 7 and what we want to show is the set of let us say we (Refer Slide Time: 07:45)
can consider two vectors or two points in n dimensional space, denoted by the vectors a
bar and the vector b bar and we want to find. So, these are two points in n dimensional
space ok, these are all points to points in n dimensional space. Now, we want to find the
points set of points closer to a bar set of points that are closer to a bar than b bar.
Now, is this set convex well this implies that norm x bar minus a bar whole square less
than or equal to norm, x bar minus b bar whole square. Now, remember the norm square
of a vector is a vector transpose times itself, this implies norm x bar minus a bar
transpose norm x bar minus b bar less than or equal to or norm x bar minus a bar, vector
transpose times itself is less than or equal to norm x bar minus b bar transpose norm x
bar minus b bar.
And what we want to see is if we call this set S we want to show, we want to find is S
convex is this set S of all points which are closer to a bar that is where our two points in (Refer Slide Time: 08:29)
n dimensional space a bar and b bar. It is the set of points x bar which is which are closer
to a bar right, than b bar is this set of points convex. That is if you look at the set of
points, the distance of any point x bar between a bar we know that distance is the 2 norm,
norm of x bar minus a bar. And the distance to b bar is norm the 2 norm x bar minus b
bar, we want to see what is x bar belongs to this set if this distance that is this distance to
a bar is less than or equal to distance to.
This implies that if you split this or if you expand this will be x bar transpose x bar minus We have an equation of the form c bar transpose x bar less than or equal to well d and, if
a bar transpose x bar x bar transpose a bar. Now, since they are scalar a bar transpose x you look at this nothing, but a this is nothing, but a equation of the half space alright c
bar and x bar transpose a bar are the same. So, this will be minus 2 x bar transpose a bar, bar transpose x bar into equals d bar that is a hyper plane. And c bar transpose x bar less
plus a bar transpose a bar less than or equal to again the same thing similar x bar than equal to d some constant d that is the half space.
transpose a bar minus 2. I can also write it as b bar transpose x bar x bar transpose b bar
So, this set of all points x bar which are closer to a bar, then b bar represents half space
or b bar transpose x bar both are same plus b bar transpose b bar.
and therefore, the set is convex that completes the proof. So, if you look at this set it
And now you can see x bar transpose x bar x bar transpose x bar cancels this implies that represents a half space, implies this is convex.
bring the b bar over the other side this implies, let us write this as a bar transpose x bar.
Both of them are same x bar transpose a bar note that a bar transpose x bar equals,
because it is a number a bar transpose x bar transpose which is equal to x bar transpose a
bar because it is a number, it is a scalar quantity.
And interestingly if you look at that if you are wondering what that set is, if you look at
these points a bar and b bar and the line joining this points a bar b bar. And if you look at
the perpendicular bisector of this ok, if you look at the perpendicular bisector of the line
joining a bar b bar. All the points that are closer to a bar than b bar lie on basically one
So, this implies b bar minus a bar transpose times x bar less than or equal to well b bar side of this perpendicular bisector and this is the half space that you are talking about
transpose b bar, that is your norm b bar square or there is a two factor of 2 here minus essentially this is the set of all points closer to a bar, this is a set of all points that are
norm a bar square which implies that b bar minus a bar transpose x bar less than or equal closer to a bar than b bar.
to half norm b bar square minus norm a bar square. And therefore, if you call this as if
you call this vector as c bar b bar minus a bar, if you denoted by c bar and if you denote
this quantity by d that is your c bar equals b bar minus a bar and this constant d equals
half, norm b bar square minus norm a bar square.
And therefore, this is equal to this is indeed you can see clearly this is a half space is a Let us look at another problem let us look at another interesting set. So, we have S equals
set of all points. It is we look at the perpendicular bisector of the line joining a bar and b the set of vectors x bar. So, set of two dimensional vectors x bar equals x 1 x 2 such that
bar. This is the high half space that lies towards a bar this is a half space that is the x 1 x 2 greater than or equal to 1. And we want to ask the question again is S convex is
perpendicular bisector, divides a bar divides it into two half spaces the half space, that the set x convex. Now, if you look at this two dimensional plane ok, if you plot x 1
includes a bar that is the set of all points which are closer to a bar than b bar, which is equals x 2 that will be this hyperbola, correct x 1 equals x 2 or x 1 x 2 equals 1. And of
indeed the half space and therefore, it is indeed convex ok course, we are considering the positive let us also maintain that let us also impose this
additional constraint x 1 greater than equal to 0 x 2 greater than or equal to 0 ok.
So, that basically complete the proof of course, we have proved it analytically and that
should leave no question and this is basically an insight into the convex set ok. And now if you look at this now if you look at this so, this is your x 1 this is your x 2 and
x 1 x 2 greater than or equal to 1, that includes this set ok. So, this is a set of all points
such that x 1 x 2 greater equal to 1 and you can see visually see that this is convex.
We are going to take an alternative approach what we are going to show is that this half So, we are considering any point consider any point so, let us start with this consider a
space this convex set can be represented as a intersection of a infinite number of infinite point x 1 equals alpha this implies x 2 equals 1 over alpha since x 1 into x 2 equals 1. So,
number of half spaces. And since it is the intersection of half spaces it is essentially a we consider this point alpha comma 1 over alpha with alpha greater than 0. And now if
polyhedral, it is well it is an infinite number of half spaces its essentially a convex set, you look at this quantity d y b y d x d x 2 by d x 1,
because is the intersection of several convex sets that half spaces basically.
That is if you look at the tangent to the curve, if look at the slope of the tangent. So, this
And you can clearly see that if you draw that tangent and, if you look at the is your tangent and the slope of the tangent to the curve equals d x 2 by d x 1 that is at
corresponding half space ok. And if you draw all tangent all such tangents the infinite that point the tangent has the same slope as the curve. So, d x 2 by d x 1 which is
number of tangents, you can represent this as the intersection of an infinite number of remember x 2 equals 1 over x 1 which implies d x 2 over d x 1 is minus derivative of x 2
such half spaces. Now, let us look at any point well where x 1 equal to alpha implies x 2 that is 1 over x 1 with respect to x 1 is minus 1 over x whole square.
x 1 x 2 equal to 1 so, x 2 equals 1 minus alpha.
(Refer Slide Time: 18:22) alpha 1 over alpha with slope minus 1 over Alpha Square. So, that will be given as well x
2 minus x 2 minus 1 over alpha divided by x 1 minus alpha the slope is nothing, but
difference of the y coordinate by difference of the x coordinate.
So, x 2 minus 1 over alpha divided by x 1 over alpha divided by x 1 over alpha equals
minus 1 over alpha square, this implies x 2 minus 1 over alpha equals minus 1 over alpha
square x 1 minus alpha this implies alpha square x 2 alpha square x 2 minus alpha equals
well minus x 1 plus alpha this implies, if you can see this implies alpha square x 2 plus x
1 equals 2 alpha.
And at this point evaluated at x 1 equals alpha, so, this will be minus 1 over alpha 1
square. And so, this minus 1 over alpha square what is this is basically the slope of the
tangent, why is this the slope of the tangent at that point alpha 1 over alpha that point x 1
comma x 2. If you look at the derivative correct, if you look at the derivative of the curve
that itself gives the slope of the tangent ok.
So, this is the tangent to this curve x 1 x 2 equal to 1 at alpha come 1 over alpha tangent
ok.
And therefore, we are in the slope and we have the point the point is 1 over 1 over alpha
and the slope equals minus 1 over alpha square. So, the tangent will be point through
(Refer Slide Time: 20:42) the intersection of all the sets intersection over what intersection over any point alpha
greater than 0 such that x bar such that alpha square x 2 plus x 1 greater than or equal to
2 alpha.
So this is alpha square x 2 plus x 1 equal to 2 alpha, this is the tangent which implies
alpha square x 2 plus x 1 greater than equal to 12 alpha this is the half space ok. So, this
convex set x 1 x 2 x 1 x 2 greater than equal to 1. This intersection of all these half
spaces the intersection of an infinite number of half spaces implies, this is convex. So,
the set x 1 x 2 greater than equal to 1, we represented that as the intersection of an
infinite number of half spaces and therefore, indeed it is convex let. So, these are
interesting applications interesting problems which demonstrate which show you how to
demonstrate the convexity of set and how to visualize the different properties or different
aspects of a convex set alright. So, we will stop here and continue in the subsequent
modules.
So, basically what we have is you have you have this curve x 1 x 2 equal to 1 and you
have this particular tangent you have this half space. Now, you take all such tangents and Thank you.
you take the intersection of these half spaces and what you will see, you will end up
getting this original convex set.
So, set x 1 x 2 so, if you look at the set x 1 that is x bar such that x 1 x 2 greater than or
equal to 1 comma x 1 greater than or equal to 0 x 2 greater than or equal to 0. If you look
at this set, if you look at this set this set is equivalent to the set, that is this it is equal to
Applied Optimization for Wireless, Machine Learning, Big data If you look at this point of intersection of the normal vector with the hyperplane, that is if
you call this points as x 1 bar and x 2 bar this is from your follows from a simple
Indian Institute of Technology, Kanpur knowledge of high school level geometry your coordinate geometry, that is these two
hyperplanes are parallel look, if you look at the point of intersection of this normal with
Lecture-22
Problems on Convex Sets (contd.) these two hyperplanes. And if you look at the distance of these two points of intersection
the distance between these two points of intersection that is the distance between these
Hello welcome to another module in this massive open online course. So, we are looking two hyperplanes. So, the distance between these two hyperplanes is the distance between
at examples for convex sets, and various properties of matrices let us continue our these two points of intersection of the normal a bar with a hyperplane.
discussion.
So, distance between the hyperplanes is the distance between, these points of intersection
And what you want to look at in today’s module is you want to look at the properties of ok. And now what are these points of intersection remember the points of intersection are
hyperplane. So, this is example number let us call this is example number 9. So, consider along the normal. So, we have x so, we have let us look at the first hyperplane that is
two hyperplanes given by a bar transpose x bar equals b 1 and a bar transpose x bar your a bar transpose x bar equals b 1 the point along the normal, if you call that as k
equals b 2, recall that this is equation of hyperplane these are two hyperplanes. times a bar some constant times a bar.
And in fact, these hyperplanes you can see these are parallel you will realize that these Then this implies a bar transpose k times a bar equals b 1 which implies k times, now a
are two parallel hyperplanes, these are parallel hyperplsanes; so, that you can draw a bar transpose a bar that is norm of a bar square equals b 1, this implies the constant k
figure to denote this ok. So, if you represent this pictorially you find these are hyperplane equals b 1 divided by norm of a bar square.
one and this is your hyperplane two and both have the same normal vector, this a bar this
vector is the normal vector correct this vector a bar is the normal to both the hyperplanes
these are the normal, this is the normal vector to both hyperplanes and in fact, the
distance between both these hyperplanes can now be calculated as follows.
And this implies point of intersection of a bar, that is a point of intersection x 1 bar is this The distance norm of x 1 bar minus x 2 bar equals norm of b 1 a bar divided by norm a
is your point of intersection x 1 bar is k times a bar which is b 1 divided by norm a bar bar square minus b 2 a bar divided by norm a bar square, the norm of this which is equal
square into that is k times a bar times a bar. to norm of b 1 minus b 2 times a bar divided by norm of a bar square which is equal to
magnitude of b 1 minus b 2 times norm of a bar divided by norm of a bar square, which
So, this is the point of intersection of the normal vector a bar with the first hyperplane.
is magnitude of b 1 minus b 2 times norm of a bar divided by norm of a bar square,
Similarly the point of intersection with the 2nd hyperplane, that is if you look at x 2 bar x
which is equal to magnitude of b 1 minus b 2 divided by norm of a bar. So, you have this
2 bar is all you have to do is replace b 1 by b 2 that is b 2 divided by norm a bar square
a bar cancelling with this a bar square in the and you have norm magnitude of b 1 minus
into a bar ok. And therefore, we have found the two points of intersection with these
b 2 divided by norm of a bar.
hyperplanes of the normal vector a bar of this hyperplane and therefore, the distance
between the hyperplanes is the distance between these point of intersection. (Refer Slide Time: 07:51)
This is the distance between the between the parallel, this is the distance between the set 10 example number 10, you want to show the set S is convex if and only if its
of your parallel hyperplanes, which have the same normal vector a bar ok. And this is intersection with every line is convex, what this means is that if a set alright consider any
important because this has a lot of applications this interesting property. So, if you look set convex set ok.
at the normal if you look at these two hyperplanes, these two hyperplanes can be used for
classifications. So, you can have a set of points on one side a set of points on the other
side and you can use these two hyperplanes to separate these sets of points alright. So,
this is a know classifier in like this is the basis for what is known as the support vector
machine classifier. So, this forms this simple principle of maximizing the distance
between hyperplanes this forms the basis for the SVM.
That is your support vector machine this forms the basis for the support vector machine.
And this basically this maximizes maximizing the distance between this hyperplanes,
basically makes the classifier more effective thereby effectively separating these two
different classes of objects alright. So, been such problems your interested in maximizing
the distance between the hyperplanes, which is given by you can see magnitude b 1
minus b 2 divided by norm of a bar. And therefore, if you want to maximize the distance
between hyperplanes you have to minimize norm of a bar.
Now, if its intersection with every line is convex that is you take any line and, if you look
at its intersection with the line, you can clearly see that the intersection with every line is
a line segment which is itself is a convex set ok. So, the intersection with any line is a
convex set.
On the other hand if you take if you take for instance a region like this our non convex
set. And if you take any line, then the intersection with respect to this is this two
disjointed line segments and this is not convex. So, this is an if and only if statement, but
it says something very interesting. If the intersection of a set the set is convex then it
intersection with every line is convex.
So, if you minimize to maximize distance that is if b 1 and b 2 are fixed you have to
minimize normal if in fact, this is a very interesting property that pertains to
classification ok. Now, let us look at another in very interesting problem this is number
(Refer Slide Time: 12:27) Now, if you look at any convex combination that has to naturally belong to this
intersection, because of the convex combination does not belong to the intersection; that
means, the intersection is not convex combination.
Similarly, if the intersection of set with every line is convex, then the set as is also
convex and this is easy to verify you can see this as follows, let us start in one direction
if S is convex. And the intersection and the, if S is convex now consider any line given
any line, now S intersection line is convex if S is convex.
Therefore, the convex combination of x 1 bar, because x 1 bar x 2 bar belong to the
The set S intersection the line is convex, this is true because S is convex given that S is intersection. The convex combination of x 1 bar x 2 bar must also belong to the
convex any line is a convex set we know that. So, convex intersection convex this is intersection correct, x 1 bar x 2 bar implies their convex combination must also belong to
convex this is trivia ok. So, S is convex the set S is convex, then a line any line is also a the intersection, otherwise the intersection is not going to be convex, which intern
convex set. So, S intersection with the line convex set intersection with another convex implies that the convex combination belongs to S.
so, that is also convex, let us move it in the other direction. If S intersection with any line
Because the convex combination belongs to the intersection, which implies that S is the
is convex ok, consider any two points x 1 bar x 2 bar
which implies that S a convex set. So, the convex set if you consider an intersection with
This implies S intersection with line through x 1, because that is also a line a line any line is convex. And the other direction on other hand for conversely, if the
through consider a line through x 1 bar comma x 2 bar, this is also a line this implies the intersection of set S with any line is convex, then the set it S itself is a must be a convex
intersection with the line through x 1 bar plus x 2 bar is convex. This implies that if you set alright, this is a very interesting property and often very useful in demonstrating the
look at any convex combination theta times x 1 bar plus 1 minus theta times x 2 bar 0 convexity of sets alright.
less than equal to theta less than equal to 1. This must belongs must belong to as the
reason is the following, because x 1 bar x 2 bar belongs to S, x 1 bar x 2 belongs to the
line through x 1 bar x 2 bar. Therefore, x 1 bar x 2 bar belongs to the intersection of us
with the line alright.
Let us look at another interesting problem that pertains to probabilities problem number
11. That is let X be a random variable and it will take values a 1 a 2 up to a n and the
And, if you denote this by the vector P bar P 1 P 2 P n and then remember, you can use
probability that X takes the value a I, this is given by the this is equal to P i. So, this is
the component wise inequality symbol to say this vector P bar, each component
the probability X random variable x takes the value a i probability X takes the value ok.
remember this is our component wise inequality, this is the component wise inequality.
Now, naturally if you can look at the set of all this probabilities. So, we have
Each component of this vector P bar must be greater than equal to 0. Now, I want to
probabilities a 1 a 2 a n corresponding probability P 1 P 2 P n.
examine some properties of this set that contains of this set of vectors P bar let us look at
Now, one naturally the sum of the probabilities must be one first this probability each of the first one ok.
these properties, because their probabilities they must be greater than equal to 0. And
further the sum of all this probabilities must be equal to 1. Therefore, we must have
summation i equal to n P i equals to 1, which is basically also represented by P 1 plus P 2
plus P n equals to 1. And we must also have each P i is greater than or equal to 0.
Let us look at all the probability vectors P bar, such that alpha is less than or equal to the (Refer Slide Time: 21:36)
expected value of a random variable X less than or equal to beta. Set of all P bar that
satisfy this is this set convex, we would not ask this question is the set of all probabilities
P bar, such that the expected value of a random variable X lies between alpha and beta is
this set convex well that is easy to figure out.
If you look at the expected value of a random variable X, that when we calculated as
follows that a summation i equals to 1 to n. The expected value of a random variable is
probability that it takes the value a i into a i which is summation i equals to 1 to n P i a i.
And therefore, if you look at this problem alpha is less than or equal to expected value of
X less than or equal to beta this implies alpha is less than or equal to a bar transpose P
bar less than or equal to beta. So, this is the intersection of two hyperplanes, you can
readily see that this is the first hyper plane is a bar transpose P bar less than or equal to
beta.
Second hyper plane is minus a bar a bar transpose P bar greater than equal to alpha,
which can be written as a bar transpose P bar less than equal to minus a bar transpose P
bar less than equal to minus alpha. So, this is the intersection of two this is the in fact, the
intersection of two half spaces, I am sorry intersection of two this is the intersection of
Which is basically a 1 times P 1 plus a 2 times P 2 plus a n times P n, which you can find
two half spaces implies that this is convex.
as a 1 bar a 2 bar a n bar time P 1 P 2 P n, you can given this as a bar transpose this is
your P bar. So, expected value of the random variable X, now becomes a bar transpose P
bar.
(Refer Slide Time: 22:48) or equal to another constant beta or property of X greater than alpha is less than or equal
to beta. Is this set of all P bar that satisfies this is the resulting set of P bar convex well
what is the probability that X is greater than alpha the probability that X is greater than
alpha.
A simply summation of all probabilities P i such that the corresponding a i are greater
than alpha. And this since this probability X greater than alpha has to be less than equal
to beta, this is less than equal to beta this is probability X greater than alpha that is you
simply have to sum the probabilities of corresponding to all a i S. That is greater than
alpha and you can clearly see, this is the linear sum this is the half space, this is the half
space implies, once again this is convex.
Each is half space expected value of X is less than equal to alpha, that is can be
represented as the half space a bar transpose P bar is less than equal to beta, expected
value of x greater than equal to alpha is another half space. So, this forms the intersection
of two half spaces and therefore, this is indeed this set is indeed this is indeed a convex
set alright.
For example, let us take an example to understand this better for instance let us say you
have a 1 a 2 a 2 a 4 or up to a 5 a 6, take the simple example n equal to 6. Now, let us say
your alpha is here lies between a 3 and a 4. So, these are less than alpha so, these are
greater than alpha and these are less than alpha. So, the probability X is greater than
alpha equals the probability X equals either a 4 or a 5 or a 6 equals P 4 plus P 5 plus P 6.
So, probability X greater than alpha less than beta implies P 4 plus P 5 plus P 6 less than
beta.
Let us look at another set probability of X greater than alpha, that is the set of all
probability vectors such that the probability of X greater than equal to alpha, is less than
(Refer Slide Time: 26:02) Let us look at the set of all vectors P bar such that expected value of X square is less than
or equal to alpha. Now, we want to ask the question is this convex well, what is the
expected value of X square, this might seem a little confusing, because X square is non-
linear ok. But, what is but look at expected value of X square this is summation i equals
1 to n probability X equals, well probability X equals probability X equals a i into a i
square that is the expected value of X square ok, which is equal to P 1 times a 1 square
probability X equal to a 2 times a 2 square plus P n times a n square ok, which is now
you can again write it as a different vector transpose times P bar.
This implies basically 0 0 0 1 1 1.This is your a bar transpose times P 1 P 2 up to P six

this is I am sorry less than equal to beta, this is your a bar transpose this implies you can
write this as a bar transpose P bar less than equal to beta and this is a convex set ok. So,
the set of all probability vectors P bar such that the probability X is greater than alpha
some quantity fixed constant alpha is less than equal to beta is a convex set ok. Now,
what about the second moment what about expected value of X square and this is an
interesting aspect.

So, I can write this as a 1 square it is very interesting, I can write this as a 1 square a 2
square times a n square times the vector P bar P 1 P 2 up to P n ok.
(Refer Slide Time: 28:46) therefore, once again the set of all probability vectors once again the set of all probability
vectors P bar such that expected value of X square is less than or equal to alpha the set of
all such vectors P bar is once again convex alright.
So, these are some interesting applications of the notion of convexity, convex sets some
of the properties of convex sets and so, on which have heavy application or which are
going to be used very frequently in our discussion on optimization theory, on our
discussion on the practical applications of optimization bit in the context of wireless
communication, or signal processing or several other fields.
So, these form the these principles these examples that we are so, far seen from the basic
building blocks of several large problems, or several large how do you put it several
large paradigms or frameworks, that we are going to explore in the future with respect to
optimization and its application in several areas of interest.
Now, this if you look at this, this is a different vector you call this as well let us call this
as u bar transpose, this is your vector P bar. where what is u bar; u bar is this vector, we Thank you very much.
are now calling this vector a 1 square a 2 square so, on a n square by this vector u bar.
So, expected value of X square if you look at that is, u bar transpose P bar and expected
value of X square less than equal to alpha.
This implies u bar transpose P bar less than equal to alpha. And you can see this is once
again corresponds to a half space implies, this is yes this is therefore, convex. And
Lecture - 23
Introduction to Convex and Concave Functions
Hello, welcome to another module in this Massive Open Online Course. In this module
let us start looking at a new topic and that is of Convex Functions alright.
If you have two points x 1 bar and x 2 bar. And you take a convex combination theta
times x 1 bar and 1 minus theta time x 2 bar. And evaluate the function at that convex
combination that has to be less than or equal to the convex combination of the values of
the function itself at x bar x 1 bar and x 2 bar.
So, that would be 1 minus theta F of x 1 bar plus 1 minus theta F of x 2 bar. So, what we
must have is that F to be convex it must be the case that F of for two points x 1 bar x 2
bar that has be that belong to the domain of F. F of theta time x 1 bar plus 1 minus theta
times x 2 must be less than equal to theta times F of x 1 bar plus 1 minus theta times F of
So, want to start looking at another important aspect or building block of the x 2 bar. This can be represented pictorially as follows to better understand this look at the
optimization framework that we want to eventually build. And therefore, you want to following picture ok.
start looking at convex functions because, these are going main important for part of the
optimization problems that we are going to consider. Now what is the convex function?
A convex function for instance can be defined as follows consider a function F of x bar.
Now, x bar now x bar this can be simply F of x. So, that can be x bar in general it can be
a vector ok. So, you can either you can also have F of x, but in general you can have F of
x bar that is a function of a vector. Now, let also the domain of F that is the set over
which it is defined domain of F is a convex we already seen what is a convex set. So, the
domain of F is a convex set plus F is now F is convex, if the following properties
satisfied.
(Refer Slide Time: 03:11) always lies above the function. So, if you take x 2 points x 1 x 2 join the chord right
between F of x 1 and F of x 2 the chord joining F of x 1 F of x 2 always lies above the
function between these two points x 1 and x 2 that is a very that is an intuitive definition
of convex.
It is for convex function curves upward it looks like a bowl alright, your regular bowl are
typically a convex function looks like a bowl that curves upwards although that is not
always the case. But typically you when you think of convex functions the prototype of a
convex function is a bowl kind of function which curves upwards.
It can be seen pictorially as below I am just going to draw a diagram that will better
illustrate this. So, let us say we have function that looks like this. Let us take 2 points and
let us join now this is your F of x consider a simple 1 dimensional x let these to be the
points let this be your point x 1 correct. So, this is your point x 1 and this is your point x
2.
And let us take a convex combination of x 1 and x 2 that lies along the lines. So, this is
your theta times x 1 plus 1 minus theta time x 2 ok. And this value, if you look at this
value this is your F of theta time x 1 that is the value of the function at theta times x 1,
this is the value of the function.
So, this basically shows that your. F of theta times x 1 plus 1 minus theta times x 2 less
This is your F of theta times x 1 plus 1 minus theta times x 2. Now on the other hand than or equal to theta times F of x 1 plus 1 minus theta times F of x 2 ok. So, this is your
look at this, this point is F of x 1 that is one end of this chord and this point is F of x 2. definition of convexity and this is what you can see from so, this is your convex
And we know now that if we take a convex combination that represents the line segment function. Typically convex function looks like a bowl that curves upward, it looks like a
correct. So, therefore, we now consider the line segment that is joining this points F of x bowl. And what this means intuitively, the intuition is that the chord joining F of x 1
1 and F of x 2 and theta times F of x 1 plus 1 minus theta times about x 2 is this point comma F of x 2 lies above the function.
theta times F of x 1 plus 1 minus theta times F of x is this point. And you can see
therefore, that theta times F of theta times x 1 plus 1 minus theta times x 2 is less than or
equal to theta times F of x 1 plus 1 minus theta times F of x 2.
Which in essence is basically, saying that if you think about this what this is essentially
saying, that if you take two points on the function, join them by a chord, the chord
The curve or the plot of the function between x 1 and x 2 alright so, the chord that is Naturally this implies that air force F of theta times x bar plus 1 minus theta times x bar
joining x 1 and x 2 lies. The chord joining F ox x 1 and F of x 2 lies above the function is greater than equal to theta times F of x bar plus 1 minus I am sorry x 1 bar x 2 bar
between x 1 and x 2 ok. And this is a very important class of functions that we will theta times x 1 bar plus 1 minus theta times x 2 bar F of that is greater than equal to theta
frequently encounter. In fact, most of our optimization problems will be built on convex times F of x 1 bar plus 1 minus theta times F of x 2 bar. And naturally a concave function
functions already they felts very important to understand. The definition of a convex curves downwards.
function and also the various properties of convex function alright.
So, curvature is down so it looks like this and you can clearly see that the chord, if you
On the our hand now, naturally a concave function is 1, which is now for a concave look at the chord joining any two points the chord lies a function lies above the chord.
function again it is important to remember that if you look at a function of F x bar. The So, this is function lies above the chord for a concave function, this is your concave ok.
domain has to still be convex, either concave or convex the domain that is the set over
which it is defined is convex. And, if minus F is convex then F of x bar F of x bar is
concave alright so, F of x bar is concave if the domain is convex and minus of F is
convex.
And we cannot look at several examples to understand, this notion of concave and A convex function or what you can also say he is a strictly convex function that is curves
convex functions better. The simplest example is that the most simplest example is that upwards. And very important class of such functions is the exponential function if you
of a straight line, if you look at a straight line you take any two points on the straight look at this; this is e to the power of a x, where a greater than 0 ok. So, also known as the
line, the chord itself lies on the function the functions of the function. exponential this is exponential function. And you can clearly see once again you take any
two points, the chord lies above the function.
So, the chord either lies above or below the chord coincides with the functions. So, we
can say the straight line is convex as well as concave. So, the straight line is in fact, And therefore, this is of course, we are going see later a rigorous test to establish
convex. And it is also concave for that same matter because, of the function coincides convexity and concavity, but right now based on our intuition you can clearly see this is a
with the curve coincide with the chord joining any two points. So, this is convex and this convex function and this exponential function there is an increasing exponential which is
is also concave so, any straight line is convex and also concave ok. convex function; it is an important class of convex functions. And in fact, one can also
consider a decreasing exponential that e raise to the power of a x that arises for a less
than 0. And we will also we will also see you can also quickly see just by visual
inspection that that is also a convex function.
So, what happens? If it is not very difficult to see if you plot e raise to the power of this So, one can conclude that e raise to x e raise to x for any value of a the exponential
is your e raise to the power of a x a less than 0, also an exponential. And you can again function for any value of a is a convex function. If a is less than 0, it is decreasing if a is
see that take two points join the two points the chord ok. You take any two points and greater than 0 it is increasing ok, but in any case it is a convex function.
you join the two points, the chord lies above the function and therefore, this is a convex
function.
And of course, if a equals 0 this becomes e raise to a e raise to the power of 0 which is 1
which is again a convex function. So, one can say e raise to the a x in general for any real
value of a is a convex function.
Let us look at the power x to the power of alpha. If alpha is greater than or equal to 1.
This is defined for x greater than equal to 0 alpha greater than equal to 1 or alpha less
than or equal to 0. In both cases you can see it is convex for instance. If you look at F of
x equals x square; you can see you are well familiar that this is a classic bowl shape
function. In fact, F of x equal to x square you can define it for entire minus infinity less (Refer Slide Time: 19:59)
than x less than infinity and it looks something like this correct.
And at x equal to 0, it is 0 and this is your classic example of a convex function. This is F
of x equals x square and you can clearly see if you join any two points by chord it lies
above the function. So, this is a classic example of a convex function. In fact, it looks
like a perfect bowl this is your F of x equals to x square.
That is any norm x bar base p, p greater than or equal to 1. This is this is of course, of a
function of a vector, this is convex field keep this for later, but this is important.
That is the norm, p norm for p greater than or equal to one this is convex ok. So, we have
seen several examples of convex functions very straight forward mostly one dimensional
function mostly functions of single variable. So, we have seen that the exponential
function is convex the straight line of course, is convex and concave exponential
function for all values of a is convex, x square is convex over the entire real line, x cube
And now on the other hand if you look at F of x equal to x cube, now, you might be
is convex only if x is greater than or equal to 0. Let us look at the examples of concave
under the misconception that F of x equal to x to the power of alpha. For alpha greater
functions.
than equal to 1 is always convex, but if you look at F of x equal to x cube for x less than
0 this is negative; for x less than 0 this is negative. And this is negative and x greater than
0 it is positive, but you can see here x greater than 0 the chord lies above the function.
But for x less than 0 the chord lies below the functions. So, the x less than 0, it is
concave and for x greater than 0 it is convex. So, this is only convex for x greater than or
equal to 0 F of x cube. So, that is why it is important to consider the suitable domain. So,
this is now, convex F of x equal to x cube for x greater than equal to 0 is convex in this
region x less than equal to 0 it is concave. Further one can show that well we can check
this later.
(Refer Slide Time: 20:59) In fact, the right way to do it is you have to look at F of minus of F of x. So, if you look
at minus of F of x so, this is your F of x. Minus of F of x because as for a definition
remember minus of F of x, if minus of F of x is convex then F of x is con concave.
The classical example of concave function; concave function the classic example is l n x
always you can use log of x the natural logarithm is also known as a natural logarithm
log x to the base e. And if you draw this it is only defined for of course, x greater than
equal to 0, if you draw this at x equal to 1 it becomes 0. And at x equal to 0 tends to
And if you plot minus of F of x then that looks like this; this is 0 at x equal to 1 as x
minus infinity and as x tends to infinity tends to infinity and if you look at any two points
tends to 0, x tends to infinity as x tends to infinity tends to minus infinity. So, this is
join the chord the function always lies about the chord this is concave.
minus of the natural logarithm of x chord lies above the function so this is convex
(Refer Slide Time: 22:12) correct. So, minus natural logarithm is convex and this implies that the natural logarithm
of x equals is a concave function. That is the technically the correct way to demonstrate
concavity.

Applied Optimization for Wireless, Machine Learning, Big Data
Lecture – 24
Properties of Convex Functions with examples
at complex functions let us continue our discussion.
Another important class of concave functions is F of x equals x power alpha for alpha for
0 less than alpha less than 1 ok. And in fact, for example, let us say alpha equal to half,
we have F of x equals square root of x and naturally this is defined only for x greater
than or equal to 0 correct. So, we will restrict the domain here x greater than equal to 0.
And if you plot square root of x for x greater than equal to 0, it looks something like this
and you can once again see the function lies above the chord so, this is square root of x
correct.
Square root of x and this is convex, this is your function F of x equals square root of x
for x greater than equal to 0 it is convex alright. So, in this module we have seen a very
important definition that of convex and concave functions; I urge you to go through this
To we want to continue our discussion on convex functions ok. And well now let us look
once again and understand it thoroughly because we are going to invoke this and use this
at it test for convexity let us consider first function of a single variable x y equals F of x,
property very frequently.
this is a scalar variable, we are consider will comes to functions of vectors later, so this is
Thank you very much. scalar variable. And well this is convex if d square y by d x square equals d square F of x
by d x square that is the second derivative is greater than or equal to 0, and remember
this is for a function of a scalar there is a function of a of a one dimensional variable x of
a single dimensional variable single variable x all right.
So, the second derivative is greater than or equal to 0, then the function is convex ok..
And this can be understood as follows for instance if you plot a convex function, now Which implies this is the slope is constantly increasing, this implies that the derivative of
what you will see is if you look at the derivative of the function at different points ok. If the slope d by dx of d x d F x by dx must be greater than or equal to 0 for a convex
you look at the derivative which is nothing, but the slope derivative is slope. functions, and which gives us the result the d square F of x by dx square greater than or
equal to 0. This is basically nothing, but the condition that the slope or the derivative is
So, this is your x and this is your F of x. And if you can look the slope of the function
constantly increase slope of tangent is monotonically increasing for a convex function, is
right which is by the slope of the slope of the derivative of the function, which is given
monotonically increasing for a convex function.
by the slope of the tangent at each point you can see that the deri the slope of the tangent
or the derivative is increasing for a convex function, always slope of tangent equals d F x (Refer Slide Time: 05:07)
by dx is increasing.
Which means we know the test for an increasing function is if its derivative increase that
is the derivative is increasing all right. So, its function is increasing if its derivative is
greater than or equal to 0, all right a function is monotonically increasing with derivative
is greater than equal to 0.
Now, here the slope is constantly increase monotonically increasing which means that
the derivative of the slope must be greater than or equal to 0.
Let us take an example, consider F of x equals e raise to the power of F e raise to the
power of ax for any a can be less than 0 or equal to 0 greater than 0, then d F x by dx
equals d over dx of e raise to ax, this we know the derivative of the exponential this is a e
raise to a. Now, if you take the second derivative F x by dx equals d square or derivative
of the derivative that is d over d x of a e raise to ax, which is a square e raise to ax.
Now, we know e raise to ax exponential this is always greater than in fact, this is always
greater than 0, a square is greater than equals to 0 implies this is greater than or equal to
a square e raise to ax is greater than or equal to 0 all right. Because the a square is always
non negative all right a square is always greater than equal to 0.
Let us look at another function that we have seen yesterday. That is F of x equals x
square, you can recall that something like this correct this is a F of x equals to your x and
this is your x square. And you can see that..
So, this implies for any value of a d square F x by dx square is greater than equals to 0
implies power e ax equals convex and it is always convex irrespective of a, a can be
either negative be its negative it is a decreasing exponential, if it is possible is an
increasing exponential and both are convex. So, e raise ax is always convex and the
derivative tests that is the second order derivative also conforms that all right.
If you take this is very straightforward d F x by dx equals 2 x and d square F x by dx
(Refer Slide Time: 07:13) square equals to which is greater than 0 implies x square equals convex x square is the
convex function.
On the other hand if you take F of x equals to x cube, you will notice that d F x over dx (Refer Slide Time: 09:44)
equals 3 x square and d square F of x over dx square, second order derivative 6 x greater
than 0 greater than is equal to 0 only x is greater than equal to 0.
So, let us consider again or classic examples for concave functions that is the natural
logarithm of x, now minus F of x equals minus ln of x. So, to show first differentiate
minus the natural logarithm of x this is minus 1 over x you will considering minus of x.
Since to demonstrate concavity we have to demonstrate the convexity of minus of F x
Now, this is than equal to 0 for x greater than equal to 0 implies, x cube is convex only if
ok.
only for x greater than equal to 0. In fact, what is that we had seen in the previous
module, if you plot x cube just to refresh your memory it look something like this all Now, take the second derivative d square over d x square of minus ln of x equals 1 over x
right at so, this part is convex and less than 0 it is concave all right. square, which is greater than equal to 0 implies minus ln x minus ln x equals contacts
what we had seen in the previous module implies ln x equals the natural logarithm of x is
Now, how about concave functions, let us look at we have to test for the convexity of the
convex.
negative of that function correct.
Again square root of x F of x equals square root of x minus of of x equals minus square So, let us come to an application, let us look at y of x I am sorry, let us look at F of x
root of x d F x by dx equals well d F x by dx equals well this is minus over twice square equals Q x. So, your Q x is the Gaussian Q function the Gaussian Q function remember
root of x. And d square F x by dx square this is equal to 1 over 4 x to the 3 by 2 which is this is the CCDF, complementary cumulative distribution function of the standard normal
greater than 0 implies well minus square root of x equals x convex implies square root of random variable, there is a Gaussian random variable with mean 0 and variance 1, this is
x equals concave ok. And that is what we had seen yesterday, if you plot square root of x also the tail probability of the standard Gaussian random variable ok.
it looks like this and this concave and this is concave. So, this is square root of x all right.
Let us now come to the norm again norm cannot use the second derivative test. Let us
look at another interesting function in fact, let us look at a practical applications of this is
an interesting aspect and we are also seen this before.
You can recall this is given as follows, if you have a PDF of the standard normal, Now this Q function is a very interesting and deserve it frequently arises in
random variable mean equal 0 variance equals one this is the CCDF that is the communication and signal processing also, because this q function represents the bit
probability that x is greater than or equal to x ok. error rate might have seen the expression Q of 2 raise Q of square root of 2 E b or n not
which is an also write as Q of under root of SNR or two times SNR depending on how
So, this is the tail probability ok. Also the CCDF the complementary cumulative
you define it.
distribution function, which is basically the probability that takes is greater than or equal
to X this is your Q x where x equals standard normal random variable, denoted by N of 0 So, this denotes the bit error rate this is the bitter the rate of the wireless computer rate of
1 that is Gaussian random variable mean equal to 0 variance equal to 1. additive white Gaussian noise channel BESK, bit error rate of BPSK binary phase shift
key over an a additive white Gaussian noise channel. This is the bit error rate over an
AWGN channel and in fact, it has a lot of applications arises quite frequently as I said in
communications as well as single processing all right. So, now, want you want to show is
the this Q function which is a lot of practical applications that this is convex.
Gaussian random variable with mean equals to 0 and variance equals to unit that is
basically the CCDF of that, random variable complementary cumulative distribution
function of that random variable, is basically the q function.
And remember the expression for the Q function, we have also seen that or you must also
We want to demonstrate that this is convex and in fact, that is convex for x greater than
be familiar with that that is basically 1 over square root of 2 pi integral of the Gaussian
or equal to 0, the slight qualification its not convex over the entire x convex only for x
probability density function, or let me write it integral of x to infinity since is the
greater than equal to 0.
probability that its greater than equals to x integral of x to infinity e power minus x
square d by 2 dx, which if you take the constant outside you can also write this as
integral x to infinity e raise to x square I am sorry e raise to we can use a different
variable of integration e raise to minus t square by 2 dt ok.
In fact, we start with the definition q of x equals 12 1 over square root of 2 pie, integral x Which is equal to if you look at this which is equal to x over square root of 2 pi e raise to
to infinity e raise to minus x square by 2 dx you take e raise to minus t square by 2 dt minus x square by 2. And now you can see is the, this is always greater than equal to 0 e
apologize. You found the first derivative of this that is 1 over square root of 2 pi raise to minus x square by 2, x is a course greater than equal to 0. So, when x is greater
derivative of the top limit which is 0, because is a constant minus the bottom limit than equal to 0 this is greater than equal to 0 which means greater than equal to 0, if x is
correct. So, we have minus derivative of the bottom limit is derivative of x which is greater than equal to 0.
times 1, times the integral evaluated at the bottom limit that is e raise to minus x square
So, we have the second derivative d square Q x by d square greater than equal to 0, if x
by 2 that is it. So, the derivative of Q of x phase minus 1 square root of 2 pi e raise to
greater than equal to 0 implies that the Q function of x is convex, if x is greater than or
minus x square over 2.
equal to 0 and this is in fact, very interesting property.
Now, which means if you take the second derivative of this d square Q x by dx square
that will be minus square root of 1 1 over minus 1 over square root of 2 pi minus, 2 x
over 2 e raise to minus x square over 2.
That will again use in fact, quite some in the future, if you look at a plot of the Q Q of minus infinity equals probability X greater than equal to minus infinity. Since it has
function the CCDF decreasing constantly 0, we have Q of 0 equals half, because this is to be greater than equal to minus infinity I am sorry X has greater than equal to yeah
the CCDF the probability that X is greater than 0 is the point of symmetry correct Q of 0 greater than in fact, we can just replace this by greater than equal to greater than equal to
is the probability, that X is greater than equal to 0 which is half which is equal to the greater equal does not matter ok.
probability that the random variable X is less than equal to 0 for this, standard Gaussian
So, this is X greater than minus infinity is 1 therefore, starts at 1 and it decreases and
random variable with mean 0 and variance 1 ok.
therefore, if you look at this portion this is convex and this is half at x equal to 0 this is
So, Q of 0 equals half equals probability x greater than equal to 0 Q of infinity equals half, Q of x it is half this is convex and this is this portion which is less than 0 this is
probability x greater than equal to infinity equal 0 at infinity it is 0. concave ok. So, this is your Q function it is convex for X so, q function is basically
convex for X greater than or equal to 0 ok. So, that is what we have seen all right.
and finally, coming now to the norm the norm, it is a straightforward to show that it is
convex the 2 norm so, well first we note the triangle. So, first consider the norm.
(Refer Slide Time: 23:07) vector, correct or a function of more than one variables that is the function of a vector all
right.
So, we have to one has device a general test, to demonstrate similar to the test that we are
shown earlier the second order derivative test for a scalar for a function of a scalar
variable single variable x 1 has to derive or one has to arrive at a similar test for a
function of two vector all right, which are something that we are going to develop or
look at in a subsequent modules.
Now, what we want to do is we want to consider any two vectors x 1 bar x 2 bar we want
to find the norm of theta x 1 bar plus 1 minus theta x 2 bar ok. The norm or the value of
the function at the convex combination now using triangle inequality we know that this
is less than. Now, observed that of course, for any convex combinations 0 is less than
equal to theta less than equal.
So, norm of theta x less than equal to norm theta x 1 bar, from the triangle inequality ok.
We are using the property here is there is a bar plus b bar is less than equal to norm a bar
plus norm b bar, now theta in 1 minus theta are greater than equal to 0, because 0 less
than equal to theta less than equal to 1. So, this is simply norm of theta times x 1 bar is
theta times norm of x 1 bar plus 1 minus theta norm of x 2 bar, which is basically your
theta times F of x 1 bar plus 1 minus theta times F of x 2 bar, where we have F of x bar
equals norm of x bar the 2 norm. So, all these are basically the two norm and this is your
f of theta 1 bar plus 1 minus theta times x 2 bar.
So, what we have shown is F of theta times x 1 bar plus 1 minus theta times x 2 bar is
less than or equal to theta times F of x 1 bar plus 1 minus theta times F of x 2 bar implies
F of x bar equals norm, the 2 norm this is convex ok. So, this is sort of a straight forward
way to show that the 2 norm is convex the l 2 norm is convex. However, this might be a
slightly converge some way to show it especially for especially for any function of a
Applied Optimization for Wireless, Machine Learning, Big Data to each component of x n ok. So, this is also an the gradient is also an n dimensional
vector correct.
So, what is the gradient of F of x bar that it contains the partial derivative of F with
Lecture – 25 respect to x 1 partial derivative of F with respect to x 2 so on and so forth until the partial
Test for Convexity: Positive Semidefinite Hessian Matrix, example problems
derivative of F with respect to x n. So, it is also an n dimensional vector of partial
derivatives ok, one with respect to each component of the vector x bar.
looking at convex functions and test for convexity. We have seen the test for a function (Refer Slide Time: 02:54)
of a single variable correct when y equals F of x and F of x is a one-dimensional variable
correct. Let us now extend the test for functions of a vector or a multidimensional
variable ok.
Now, what is the Hessian, we want to consider the next that is the Hessian ok. So, this is
denoted by square of x bar. This is a matrix of partial derivatives second order partial
derivatives dou square F by dou x 1 square dou square F or let me write it here. The
Hessian of F is dou square F by dou x 2 square dou square F by dou x 1 partial with
respect to x 2 dou square F by dou x 2 dou x 1 dou square F by dou x 2 square dou
So, we want to find the test or convexity ok. And we want to consider y equals F of x bar.
square F by dou x n dou x 1 dou square F by dou x 1 dou x n and so on and so forth dou
So, this is a vector now ok. So, x bar equals this is an n dimensional vector x 1, x 2 up to
square F by dou x n square. And this is basically the Hessian ok. So, this is you think of
x n. So, this is an n dimensional vector.
this as a second order derivative, this is the Hessian.
And so we want to find the test for convexity for this function of a vector. And for this
we want to first define the gradient of this that is the gradient with respect to x, I am just
going to simply write this as gradient of F with F x bar since its clear from the context
that the function is of the n dimensional vector x bar. This is simply defined as the vector
which contains the partial derivative with respect to each component of the; with respect
(Refer Slide Time: 04:43) Now, the condition the test for convexity the test for convexity of this vector x bar is that
we have to have the Hessian with respect to x bar this has to be remember this symbol
this denotes a positive semi definite matrix. So, if this is positive semi definite that is
delta square that is the Hessian equals a positive semi definite matrix. This implies that F
of x bar is convex. So, if the Hessian is positive semi definite, we have F of x bar is a
convex function of the vector x bar. So, we have to compute the Hessian of this function
multi dimension this function of as a vector x bar and check if the Hessian is positive
semi definite. And if the it is positive semi definite, one can conclude that F of x bar is
convex.
Where the ijth entry if you look at any ijth entry of the Hessian, the ijth entry correct
delta square F, if you look at the ijth entry that is the second order partial of F with
respect to x i and x j that is the ijth entry. So, it is a matrix with the ijth. And naturally
this is going to be a needless to say this is an n cross n matrix for an n dimensional vector
x bar. So, this is an n dimensional if you consider a vector x bar to be n dimensional, then
this Hessian is an n cross n matrix with the ijth matrix entry corresponding to the partial
second order partial derivative of F with respect to x i and x j.
Let us understand that by looking at a simple example. A simple example let us consider
F of x bar, I am going to first talk about a simple x. So, first let us start with a two-
dimensional vector x bar so, x bar equals x 1 ok. And we have F of x bar equals x 1
square let me write this clearly F of x bar x 1 square divided by x 2.
(Refer Slide Time: 08:22) Now, consider the Hessian with respect to x that will be just to refresh your memory that
will be dou square F by dou x 1 square dou square F by dou x 1 dou x 2 dou square F by
dou x 2 dou x 1. And finally, dou square F by dou x 2 square which you can see is the
following thing. This will be well this will be 2 over x 2 this quantity, you can see minus
2 x 1 divided by x 2 square. In fact, dou square F by dou x 1 dou x 2 will be same as dou
square F by dou x 2 dou x 1 which will be minus 2 x 1 divided by x 2 square. And
finally, dou square F that is the second derivative second order partial derivative with
respect to x 2 that is 2 x 1 square divided by x 2 cube.
Then first let us consider the gradient with respect to x bar, so that we know is the partial
of F with respect to x 1, the partial with respect to x 2 ok. Now, partial derivative with
respect to x 1 treat x 2 as constant differentiate with respect to x 1 that gives us 2 x 1 by x
2 partial with respect to x 2, treat x 2 as constant differentiate with respect to I am sorry
treat x 1 as constant differentiate with respect to x 2 so that will be minus x 1 square by x
2 square correct. So, you are treating x 1 as a constant and differentiating with respect to
x 2 ok.

And now if you take 2 over x 2 cube as common, we have x 2 square minus x 1 x 2
minus x 1 x 2 and we will have this one is x 1 square, which you can now write as you
can decompose this matrix above as follows. You can write this as you can see this will
be x 2 minus x 1 times x 2 minus x 1 and if I call this vector as u bar, this will be u bar
transpose. So, this will be 2 by I am sorry 2 by x 2 cube x 2 cube u bar u bar transpose
that is the Hessian.
So, this is the Hessian ok, where u bar is the vector x 2 minus u bar is a vector x 2 minus So, this implies let me just write it clearly. This is basically 2 by x 2 cube times u bar u
x 1. And now here you can see that if there is any matrix p that can be decomposed as a a bar transpose. So, this is greater than equal to 0, this is positive semi definite. This
transpose or for that matter U U transpose then it is positive semi definite. And this can quickly or this curved greater than equal to. So, implies delta square F is this is positive
simply be seen as follows, many of you might already be familiar with it. If you have any semi definite. Implies x 1 square by x 2 a simple function of the vector x bar two-
matrix p which is basically can be written as a matrix U times U transpose then x bar dimensional vector x 1, x 2 this is convex since, the Hessian case. So, we have evaluated
transpose P x bar equals x bar transpose U U transpose x bar which is U x bar transpose the Hessian of this function and demonstrated that the Hessian is a positive semi definite
times U x bar which is basically norm of U x bar square. matrix therefore the function is a convex function.
So, the moment you have any matrix P which is U U transpose then P automatically (Refer Slide Time: 15:53)
becomes a this becomes a PSD matrix, P automatically becomes a PSD matrix. Which
implies that this 2 by x 2 cube times u bar u bar transpose this is automatically this is a
PSD matrix all right. For instance, we can assume that x 2 is greater than or equal to 0. I
think one can restrict this domain such that x 1, x 2 greater than equal to 0. So, what we
have is this quantity 2 x 2 cube this is greater than equal to 0, this is u bar u bar transpose
is the positive semi definite matrix implies.
Let us proceed onto look at another example. And we will look at an interesting practical (Refer Slide Time: 19:32)
application of this convexity of a function of a vector. So, what I want to look at now is
to develop a practical application of this. And in fact, in this practical application what
we want to look at we want to look at, I want to look at a MIMO Multiple Input Multiple
Output wireless communication system. Consider practical application with the MIMO
wireless system, where MIMO some of you might already be familiar the concept of
MIMO is basically you have this is stands for multiple input multiple output ok. So, this
implies that you have multiple transmit and multiple receive antennas. And having such
multiple input multiple output system significantly increases the communication rate or
the data rate of a wireless communication system. So, it is considered to be one of the
revolutionary technologies in wireless communication.
So, we can write this as y bar equals the received vector corresponding to the symbols y
1 y 2 up to y r this is the received vector or you can say this is the vector of received
symbols. And for that matter if you look at x bar that is x 1, x 2, x t, because you are
transmitting t symbols one from each transmit antenna. So, this is your transmit this is
your transmit vector.
And if you look at this MIMO communication system, you have as I already said you
have a system that is you have a transmitter you have a transmitter and you have many
possible channels between each transmitter. And so this is your t transmit antennas or let
us represent this by small t transmit small t transmit antennas, and this is r receive
antennas r receive antennas. And therefore, what we can do is we can transmit the
symbols x 1, x 2, x t from the t transmit antennas x 1 x 2 up to x t and we can receive the
symbols y 1, y 2 y r.
And our model for this MIMO system is therefore given as y bar equals H x bar plus n
bar where we know this y bar this is an r cross 1 receive vector, this is a t cross 1 transmit
vector which implies that this must be an r cross t matrix H. And this is known as the (Refer Slide Time: 23:43)
MIMO channel matrix. So, this r cross t matrix H which where r is the number of receive
antennas t is the number of transmit antennas this is your r cross t matrices. This is also
the MIMO channel matrix multiple input multiple output channel matrix. Of course, this
n bar, this is an r dimensional noise vector. You can see this is an r cross 1 noise vector.
So, this is the noise vector. This is also you can say additive noise vector; this is an
additive noise vector because just adding to H x bar ok.
And now our challenge is the problem that we want to address is basically which we will
continue in the subsequent module is that remember at the receiver we have this vector y
bar correct which we know at the receiver all right. Now, with this received vector y bar
one has to decode or one has to estimate the vector x bar that has been transmitted alright
that is the problem of communication. In a communication system you transmit some
symbols these symbols have to be recovered at the receiver correct.
So, in a MIMO system you have this received vector y bar, but you do not know at the
And if you look at this channel matrix H that has the following structure that has h 1 1 h receiver you do not know that vector x bar that has been transmitted which comprises of
1 2 h 2 1 h 1 I am sorry h 2 1 h 2 1 h 1 2 h r 1 h 1 t this is h 2 2, and this is finally, h r t. the t transmit symbols x 1 x 2 x t correct. So, one has to estimate this transmitted vector x
So, this is an r cross t matrix, r equals number of rows which is the number of receive bar and that forms the problem or the problem of MIMO receiver design. So, one has to
antennas t equals number of columns that is the number of transmit antennas. And if you design a suitable algorithm or a technique for the MIMO reserve receiver which recovers
look at h i j, so each of these quantities correct, each of these quantities these are this transmitted vector or the transmit vector x bar.
basically your fading channel coefficients ok. So, these are fading wireless channel
So, what we want to look at is this problem that is given y bar, given the vector y bar
because the wireless channel is the fading channel. So, these are fading these are fading
how to estimate where remember x bar comprises of your vector of t transmits symbols.
channel coefficients in particular if you look at this quantity h i j this is the channel
So, this vector comprises of the t transmit symbols ok. So, how do you detect this vector
coefficient between the ith receive antenna and jth transmit antenna. This is the channel
x bar that comprises of the t transmit symbols, how do you recover the t transmit
coefficient between the ith transmit between the ith transmit, ith receive I am sorry ith
symbols from the given vector y bar at the receiver all right, so that forms the problem of
receive and jth transmit antenna ok.
MIMO receiver design all right, which we will look at in subsequent module. The
problem of MIMO receiver design and its relation to convex optimization, the problem
of MIMO receiver design and its relation to convex optima convex optimization is Applied Optimization for Wireless, Machine Learning, Big Data
something that we want to explore in the subsequent module all right.
Department Of Electrical Engineering
Lecture –26
Application: MIMO Receiver Design as a Least Squares Problem
looking at applications of convexity, convexity of a function of a vector, and we are
looking at a practical application for a MIMO communication system or a MIMO
wireless system. So, let us continue our discussion.
So, we are looking at application of convexity in MIMO wireless system. And what we
have said is the following thing I have this model y bar equals H times x bar plus n bar
this is my r cross 1 received vector, this is the r cross t channel matrix, this is the t cross 1
transmit vector and this is an r cross 1 noise vector. And the problem is given y bar, we
have to recover x bar correct at the receiver all right. One has to estimate recover or
basically you can also say estimate, estimate x bar which is basically your transmit
vector; given the receive vector, estimate the transmit vector.
Now, let us go back for a minute look at a simple scenario let us ignore the noise for a If invertible remember inverse is not guaranteed to exist, if invertible then I can find x
little bit. Now, what you can see is if you ignore the noise, this reduces to y bar equals H hat that is estimate of vector x as H inverse into y bar; x hat equals H inverse y bar this is
x bar and this is basically you can see this is a system of linear equations. This is a the estimate of the transmit vector x, estimate of the transmit vector x. Now, on the other
system or linear system of equations. This is a linear system of linear equations and hand, if r is strictly greater than t, consider another scenario r strictly greater than. Now,
number of equations is basically you can see r equals number of equations correct. what happens if r is strictly greater than t, the number of equations is much greater than
the number of unknowns, it means that the system is over determined.
And the number of unknowns is basically the elements of this x bar which are the
transmitted symbols. So, t equals the number of unknowns. Now, let us consider a simple So, this implies number of equations is greater than the number of unknown, implies it is
scenario start by considering a simple scenario with r equals t ok. Now, if r equals t, what an over determined system. Now, for an over determined system, typically you cannot
happens number of equations equals number of unknowns. Therefore, the matrix H solve the system of equations y bar equals H x bar you cannot solve it typically, which
correct that is a square matrix. means you can only solve it approximately right. You cannot find an x bar such that y bar
equals H x bar which means you have to find the best vector x bar such that the error
And if H is invertible, then I can simply find the receive vector by doing H inverse y x
approximation error y bar minus H x bar is minimized.
bar equals H inverse y, remember this is an approximation x bar equals H inverse y.
Approximation in the sense that we are assuming we are neglecting the impact or the
influence of noise. So, if r equals t we have y bar equals H x bar y bar equals H x bar. So,
this is t cross t matrix if r equals t because r equals t this is r cross t if r equals t is t cross t
implies it is a square matrix. So, implies this is a square matrix.
(Refer Slide Time: 05:54) energy of the error that is minimized which basically implies that we want to minimize
norm of y bar minus H x bar square. So, we want to find x bar. So, find x bar such that y
bar minus H x bar square the error is minimized.
Now, what this means is since you cannot solve y bar equal to H x bar, you can find the
error vector. So, e bar by e bar we denote the error vector over determined. So, you
cannot find the vector such that y bar equals H x bar. So, cannot find or cannot find x bar
such that y bar equals H x bar implies find x bar such that the error is minimized. The
So, we want to find the best vector x bar that minimizes the error that is norm of y bar
error is minimized implies.
minus x bar square. This is known as the least squares, the least squares problem or
(Refer Slide Time: 07:05) simply the LS the least squares that is you want to find the x bar which gives you the
least squared error or in this case the squared norm of the error simply known as the
squared error. So, we want to minimize, we want to find the vector x bar the estimate
which minimizes which gives you the least squared error so that is known as the least
squares estimate. And this is a very important problem that arises frequently in both
communications as well as single processing.
Now, what is the error? Error is we have the error vector e bar error is the norm e bar or
norm e bar square ok. You can think of this as the energy of the error vector the total
Now, we want to simplify this, let us start by simplifying this cost function norm of y This is the cost function also known as the cost least squares cost function. You can also
minus H x bar square remember we said this is norm of error vector square. Norm of think of this as a the least squares cost function. Now, let us denote this least squares cost
error vector square is e bar transpose e bar vector transpose itself that is a norm square of function by f of x bar, remember this is our function of the vector x bar ok. So, this least
the vector which we can now write as y bar minus H x bar transpose into y bar minus H squares cost function is a function of the vector x bar ok. Now, here before we consider
x bar which is equal to y bar transpose minus x bar transpose x transpose into y bar the now we have to consider remember the Hessian to demonstrate this is convex. So,
minus H x bar. Which is equal to now multiply this out y bar transpose y bar minus x bar first let us look at the gradient the properties of the gradient and then we will look at the
transpose H transpose y bar y bar transpose H x bar these two terms are the transpose of Hessian ok.
each other remember quantity real number which is transpose of itself is equal to itself.
So, let us consider a simple function C bar transpose x bar. If you look at the gradient of
So, these two quantities you can see they are real numbers simply scalar quantities and this, the gradient of C bar transpose x bar, now remember C bar transpose x bar you can
they are transpose of each other, and therefore they are equal. So, I am going to simply write this as the vector c 1 c 2 c t row vector c 1 c 2 c t times column vector x 1 x 2 up to
write this as twice x bar transpose H transpose y bar plus x bar transpose H transpose H x t which is equal to c 1 x 1 plus c 2 x 2 plus so on up to c t x t. And now the gradient of
into x bar. Now, this is what we get. this C bar transpose x bar, we can see derivative with respect to x 1 is c 1 with respect to
x 2 is c 2 with respect to x t is c t. So, this is simply C bar.
Similarly, now the gradient now remember C bar transpose x bar is equal to x bar On the other hand, if you look at this, now look at this term of the form that is x bar
transpose C bar therefore, gradient of C bar transpose x bar is gradient of x bar transpose transpose P x bar with P equal to P transpose ok. So, P is symmetric. You can show that
C bar equal to C bar. So, this is what we have over here ok. Now, what about the the gradient of this first we will start with the gradient, gradient of this is x bar transpose
Hessian? Now, if you look at the Hessian, therefore, second order derivative of x bar P x bar, this is you can show this is twice P times x bar. Now, you take the gradient the
transpose C bar or C bar transpose x bar that will be the gradient of the that will be the Hessian of x bar transpose P x bar, you can also write this as the row vector of the
gradient of C bar which is, but C bar is a constant so gradient of. So, if you look at this gradient of that is you are taking the gradient and so the Hessian first you differentiate by
Hessian of x bar transpose C bar, x bar transpose any constant vector C bar that is going the row, then differentiate by the column and then you differentiate by the row right.
to because remember this is a linear in x bar all right. If you differentiate it twice, it is
So, transpose gradient transpose gradient of twice of x bar transpose P x bar which you
going to be 0. So, the gradient of the term x bar transpose Hessian of the term x bar
can write as the gradient transpose of now we have seen already the gradient of x bar
transpose C bar is 0.
transpose P x bar that is twice P x bar. Now, this is of the form well this is of the form
twice C bar transpose x bar. So, you can see this will be you can check that this Hessian
will be twice of P.
(Refer Slide Time: 15:17) transpose y bar this is a constant y bar transpose y bar given the vector y bar so this is 0
minus twice gradient of x bar transpose your vector C bar H transpose. So, this gradient
is simply going to be remember this is your C bar. So, this is simply going to be C bar
plus gradient of x bar transpose H transpose H into x bar. Now, remember this is your
matrix P which is we can see symmetric.
So, the gradient of x bar transpose P x bar will be twice into the matrix P this is your
quadratic term this is a quadratic term. Now, we go back to our original least squares cost
function that f of x bar and then we compute its Hessian.
So, this gradient will be simply twice P x bar or twice H transpose H into x bar so that is
it. So, now you have minus 2 times C bar which is H transpose y bar plus twice P which
is H transpose H into x bar this is the gradient. And the Hessian will now be if you
differentiate this again of course this is a constant term gradient of this is 0. And
corresponding to this what we will have we have already seen twice P into x bar Hessian
that is you take the gradient of that that will be twice P. So, this will be twice H transpose
H. So, the Hessian of this reduces to twice H transpose H, this is twice H transpose H.
So, now if you go back to the least squares cost function, you will see that the cost
function is norm of y bar minus H x bar square which is y bar transpose y bar minus
twice x bar transpose H transpose y plus x bar transpose H transpose H into x bar. Now,
if you first let us take the gradient of this thing, now observe that the gradient of y bar
And now you can see if you call this matrix as P, first you can see P equals P transpose So, what you will realize here is that this implies H transpose H is a positive semi
because H transpose H transpose A B transpose is B transpose A transpose. So, this is H definite matrix which implies that you are multiplying it by a positive constant 2, which
transpose into H transpose transpose, but H transpose transpose is itself. So, first this is a implies twice H transpose H is positive semi definite which implies that delta square f of
symmetric matrix. And further if you look at x bar transpose H transpose H into x bar, x bar is that is this Hessian is positive semi definite implies y bar minus H x bar square.
well what is this, this will be equal to H x bar transpose times H x bar which is equal to This is a convex, this is a convex function of x bar. So, this f of x bar this is convex. And
norm of vector transpose itself that is norm H x bar square which is greater than equal to solving this convex optimization this convex problem this optimization problem one
0 which means x bar transpose P x bar our x bar transpose H transpose H x bar is always obtains the estimate of the transmit vector x bar. And this is going to be important when
greater than equal to 0 for any vector x bar. Which means that this matrix P H transpose we talk about the receiver design for a MIMO system ok.
H is always positive semi definite which implies twice H transpose H because you are
multiplying it by a positive constant is also going to be positive semi definite.
Therefore, the Hessian is positive semi definite which implies that the least squares cost
function is convex that is a very important property which helps us design the receiver in
this MIMO system. In fact, that is what we are going to do when we solve the convex
optimization problem.
So, solving this can be used to design receiver this can be used to design the receiver for Applied Optimization for Wireless, Machine Learning, Big Data
the MIMO system all right. So, this is important. So, this least squares problem, in fact,
least squares problem occurs in several different scenarios all right. So, one sort of Indian Institute of Technology, Kanpur
application of this least squares problem is to define an efficient receiver for MIMO
Lecture - 27
system. Jensen's Inequality and Practical Application: BER calculation in Wired and
Wireless Scenario
We are trying to find the best transmit vector x bar corresponding to a received vector y
bar that minimizes the approximation error that is the best vector best estimate x hat Hello. Welcome to another module in this massive open online course. So, we are
which closely predicts or which is a closed or which is the which basically best explains looking at convex functions and particular convex functions of a vector, variable vector,
or we can say or which is the best approximation for the receive vector y bar in this correct. And out at the test for convex, if there is a function which convex function, the
MIMO system all right. So, we will stop here and continue with other aspects. test of convexity for a function of a vector, alright. So, let us change tracks a little bit and
look at something known as Jensen’s inequality which has significant applications in
various areas, ok.
So, what we want to look at in this module is Jensen’s inequality for you can say, convex
as well as which is a handy tool that arises in several scenarios. In fact, we will try to
justify this by looking at a practical application. So, this is a Jensen’s inequality and the
Jensen’s inequality is very interesting and that is as follows.
(Refer Slide Time: 01:17) Where we can say theta 1, you can set it as theta. Remember, theta 2 equals 1 minus theta
or you can simply say theta 1 comma theta 2 greater than equal to 0 and theta 1 and theta
2 satisfy this condition that theta 1 plus theta 2 equals 1. So, this is a convex combination
of 2 points, theta 1 times x 1 bar plus theta 2 times x 2 bar where, theta 1 theta 1 theta 2
are 2 scalar called 2 numbers which are greater than or equal to 0 which are non-
negative.
And sum theta 1 plus theta 2 equals 1 ok, which automatically implies is a non negative
and sum to 1 automatically implies that both theta 1 and theta 2 lie in the interval 0 to 1
ok. Further, now you can see that therefore, the inequality for the basic inequality is that
F of theta 1 x 1 theta 1 x 1 bar plus theta 2 x 2 bar is less than or equal to theta 1 F of less
than or equal to theta 1 F of x 1 bar theta 2 F of x 2 bar this is the inequality for a convex
function, for convex function F ok, F of x bar.
That is, remember if you go back and look at the definition of the convex function, it is (Refer Slide Time: 03:50)
this bowl shaped function we said and if you look at 2 points x 1 and x 2 and you take
any linear combination, that is, let us make this as vectors theta 1 x 1 bar plus theta 2 x 2
bar and this is the value of the function at the linear combination and this is the value of
the function that is this is remember, F of theta 1 x 1 bar plus theta 2 x 2 bar. And this is
theta 1 times F of x 1 bar plus theta 2 times F of x 2 bar.
Now, you can see that, now look at look go back and look at this quantity theta 1 plus
theta 2 greater than or equal to 0 and theta 1 plus theta 2 equals 1. Remember, this should
remind you of something, these remind you of probabilities. Reason being, you have 2
quantities theta 1 plus theta 2 which are non-negative greater than equal to 0 and the sum
to 1.
So, this remind you of the it should remind you of a probability distribution or a
probability mass function in this case which you can say that, you can consider a
distribution probability x equals x 1 bar equals theta 1 and the probability x equals x 2 random variable x because, remember we have shown that theta 1 x 1 bar plus theta 2 x 2
bar equals theta 1, that is probability random variable x takes x 1 bar is x 1 bar with bar, that is basically equal to this F of that is the expected value of X.
probability theta 1 and it is x 2 bar with probability theta 2. Some of the probabilities are
greater than equal to 0 and some of the probabilities as well.
Now, on the other hand, if you look at this quantity, that is theta 1 F of x 1 bar plus theta
2 F of x 2 bar, well this is equal to again, I can write this similarly, F of x 1 bar times the
And now, therefore, if you look at this quantity, now, it is interesting if you look at this probability that X equals x 1 bar which is theta 1 plus F of x 2 bar times the probability
quantity, theta 1 times x 1 bar plus theta 2 times x 2 bar, this is equal to, what is this that X equals x 2 bar. And this is equal to the expected value of F of x bar [noise,] this is
equal to? Well, this is if you look at this is x 1 bar or this is probability, X equals x 1 bar equal to the expected value of F of x bar.
times theta. I am sorry times theta 1 is probability X equal to x 1 bar times x 1 bar plus
And, therefore, what we have now? If you look at for a convex function right; so, what
probability X equals x 2 bar times x 2 bar, ok. This probability X equal to x 1 bar is theta
we have? Just shown is that theta 1 F of x 1 bar plus theta 2 of F of x 2 bar, that is the
1 probability X equal to x 2, once theta that is what we said about.
convex combinations of the 2 points, that is a point on the cord is nothing but the
And therefore, this is nothing but if you can look at this, this is the expected value of the expected value of F of x bar. And for a convex function, we know for a convex function
random variable X. Because, remember what is the expected value? Expected value is we have F of theta 1 x 1 bar plus theta 2 x 2 bar this is less than or equal to theta 1 F of x
nothing but you take each poise possible value, multiply it by the corresponding 1 bar plus theta 2 F of x 2 bar well.
probability and sum, right. That is expected value. This is nothing but summation over i
probability X equal to x bar i times x bar i. This is your definition for the expected value
of X ok.
So, this is the definition for the expected value of X. Further, if you look at this quantity
and therefore, if you look at this quantity that is F of theta 1 x 1 bar plus theta 2 x 2 bar,
you can now write that as F of well that is F of simply expected value of X ok. This
What is this we have seen is expected value of X and this is expected value of F of X. So, this is again F of expected value of X less than or equal to expected value of F of X
Therefore, we can represent the same thing as F of expected value of X is less than or ok. For concave or convex and for concave, it is the other way. So, this is for a convex
equal to the expected value of F of X and this is basically Jensen’s inequality for a function. For concave, will naturally have the reverse of this is F of expected value of
concave function. Convex function, I am sorry this is basically your Jensen’s inequality because the cord lies below the curve. So, F of expected value of X is greater than the
so, very important and a very handy tool as we just will see in a practical application. expected value of F of X. And this is basically for you concave function and this is for
This is Jensen’s inequality and it is frequently used in signal processing and your concave function.
communications.
So, these are basically you can think of this as basically Jensen’s inequality for a convex
Especially, if you look at information theory, there are several instances where this function and Jensen’s inequality for a concave function ok, all right and. So, let us know
Jensen’s inequality is fairly handy to prove various results, all right. It states that for a and you can generalize this too. Well, we have simply considered random variable that
convex function F of the function of an expected value of a random variable is less than takes 2 values, you can generalize it to you can generalize this as follows.
expected value of the function of that random variable.
So, you can consider theta 1 theta 2 theta n as the probabilities and x 1 bar x 2 bar, so on. So, you want to see a practical application ok. And therefore, consider the channel y
X 1 bar are the various valuables of the random variable. Then, you have again F of equals well consider a communication and this practical application is in the context of
expected value of X equals F of summation i theta i x i bar i equals 1 to n is less than or the BER what is termed as the bit error rate. I think all of you are familiar with this or
equal to F of or is less than equal to expected value of again for a convex function ok. most of you are familiar, those who work or those who are familiar with the properties of
Expected value of F of X which is basically summation i equals 1 to n theta i x bar i ok, a communication system or the performance analysis of communication systems, this is
all right. basically a bit error rate which also is the probability. That is, it denotes the probability
with which a bit is received in error over a communication channel, all right.
So, this is the Jensen’s inequality. In fact, it holds even for a continuous random variable
X. In fact, that is what makes this very interesting and very powerful inequality all right. (Refer Slide Time: 14:50)
And, in fact, that is what we are going to see shortly in a practical application in the
context of a communication system, alright. So, what we want to see now is, we want to
see a practical application of this to demonstrate the applicability.
So, you have consider a very simple channel which we y is the received symbol x is the And now, however so now, this model additive white Gaussian noise model is generally
transmitted symbol and n is the noise correct. So, y this is your received symbol, that is associated with a conventional digital communication system in which they, in which
the channel output. This is your transmitted symbol and this is your noise ok, which is there is a wire medium between the transmitter and the receiver. This is also known as a
typically assumed to be white Gaussian, correct. This is white Gaussian with mean 0 and wired channel, all right. A wire based channel such as the twisted copper pair or a coaxial
variance sigma square x can be a symbol typically take this as plus or minus square root cable and so on it or your conventional telephone where there is a wire that connects the
of P that is BPSK symbol. telephone to the local exchange and so on, all right.
Power BPSK stands for binary phase shift keying power is P. The noise is white So, this is a model for a conventional communication system or a wire line what we call
Gaussian and it is additive in nature, correct. Therefore, this is termed as an additive a conventional or basically or this is your conventional or basically your wire line
white Gaussian noise channel. Therefore, this is termed as an additive white Gaussian system. Now, on the other hand, what happens in if you look at a wireless system,
noise channel and since this is an additive white Gaussian noise channel. something interesting happens in a wireless system. In a wireless system, you have a
base station which is transmitting. This is your base station which is transmitting to let us
say, a mobile in the cell, then in addition to this signal which is you can call this as a
direct path.
If you are transmitting BPSK symbols power square root of P, then we have the SNR is
nothing but the symbol power signal power divided by noise power. That is, sigma
square, all right. And we can denote this by gamma. And the interesting thing about this
is, if you look at the bit error rate of an additive white BPSK of communication of digital They are also going to be several reflections or what are also known as scattered paths.
community BPSK or transmit the transmission of BPSK modulated digital symbols, over So, these are also known as the scatters such as large buildings. So, these gives ray give
an additive white Gaussian channel that is given by the well-known expression, that is rise to this non-line of sights the NLOS non line of sight scatter. So, there is a line of
the bit error rate is given as Q of square root of SNR, that is equal to Q of gamma Q of sight. Now, what happens? When these multiple paths the signals from multiple paths,
square root of this is the bit error rate of BPSK over AWGN. they superimpose, correct.
So, when these signals superimpose, so, multiple signals from the multiple paths is also Now, therefore, I cannot simply use the AWGN channel model. For the wireless channel,
known as a multipath environment. Multiple paths superimpose implies this leads to I have to use a different model. So, for instance, they are wire line or a conventional
interference, this leads to interference, correct and that is the problem, right. And once channel model is additive white Gaussian noise that is, your y questions may receive
you have interference, the signal can be now interference may not be only destructive; signal y equals transmitter signal x plus noise. In addition, in a wireless channel, I will
interference can also be constructive, alright. have the presence of a multiplicative factor h, which is a coefficient this term as the
fading channel coefficient. This is term as the fading channel coefficient, ok.
But, the moment you have seen interference, there is uncertainty in the received signal
level, all right. The signal level can dip if the interference is destructive or the signal (Refer Slide Time: 22:38)
level can rise if the interference is constructive. So, in general, the signal the level of the
signal or the power that is the power of the received signal is varying with time, all right.
Unlike a conventional wire line communication system where there is no phenomenon of
multipath reflection.
In a wireless system, because of this multipath reflection phenomenon, the interference

the resulting interference leads to a time varying power for the received signal like this
process is termed as a fade, this process is termed as fading, all right and this channel the
wireless channel is known as a fading channel, ok.
So, this is termed as a fading channel coefficient and this is a random variable. The
important thing to note is, since the received power is random. This fading channel
coefficient is a this is a random variable. This implies that, received power is random in
nature or received signal level the received signal level is random in nature. And
therefore, now you can see the SNR is influenced by this fading channel coefficient.
So, one can write the SNR which was previously P over sigma square will now be
multiplied by magnitude h square. So, it is P over sigma square. Remember, this is
gamma. So, this is gamma times magnitude h square. So, you can think of this as the
SNR of the wireless channel SNR of wireless channel. This is the SNR of the wireless
channel.
So, this leads to variation in the received signal which is term as fading implies that the
wireless channel is a fading channel as we have seen in some examples before. And, therefore, now what we can do is, we can look at the resulting bit error rate of the
bit error rate depending on this SNR of the fading wireless channel and apply Jensen’s
inequality and derive a suitable conclusion, alright. So, this is the practical scenarios in
which amount apply the Jensen’s inequality, alright. And, we will continue this Applied Optimization for Wireless, Machine Learning, Big Data
discussion in the subsequent module.
Lecture - 28
Jensen's Inequality application: Relation between Average BER of Wireless and
Wired system and Principle of Diversity
looking at a practical application of Jensen’s inequality in the context of a wireless
communication scenario. In fact, to look at a comparison correct of a wireless
communication system, the performance of a wireless communication system, a
comparison of that with that of a conventional digital or a wire line communication
system, alright.
And, what we have seen is that, this relevant channel model for your wireless
communication system is h x plus n the signal has power P, noise has power sigma
square. Now, because of the fading channel coefficient, this h is your fading channel
coefficient or this h is the fading coefficient. What has happened is your SNR of the
wireless channel that is, magnitude h square times P over sigma square which is
magnitude h square times gamma.
(Refer Slide Time: 01:20) AWGN channel or let us call it as SNR of the your conventional correct or your wire line
communication system. So, this is a fair comparison, alright.
So, we are using the same average SNR for both. It is not the case that a wireless system
has a much higher SNR than the than the conventional communication system, alright.
So, to perform a fair comparison between these 2 systems, we are choosing, alright; we
are setting the SNR or we are choosing about the channel coefficient is basically
assumed to follow a random assumed to follow probability density functions such that
the average SNR for both the scenarios is the same.
Further, what we will do is that, we will set that expected the average value of this
magnitude h square, will set this equal to 1. What happens because of this? That is, if you
look at the average SNR of the expected value, that is the mean or the average SNR of
the wireless communication system, that will be exact expected value of magnitude h
square times gamma which is basically.
Now, if you look at the probability of bit error for a wireless communication system, so,
if you look at the bit error rate for the wireless communication system, this will be Q of
square root of the SNR for the wireless communication system. We have already seen
that and which is nothing but Q of square root of magnitude h square times gamma ok.
So, this is known as the bit error rate. The instantaneous bit error, the bit error rate or this
is your basically bit error rate or probability of bit error for a wireless communication
system for a wireless communication system.
Well gamma times expected value of magnitude h square and we have seen that expected
magnitude square equal to 1. So, this is gamma which is nothing but the SNR of your
And now, we want to look at this function. Now, we have already seen what is Q of x. We want to show that this function F of x, this is a convex function. We want to show
Remember Q of x is the CCDF of the standard Gaussian random variable. So, that is that and that is easy to show, all right. So, we already know that Q of x is a convex
given as x to infinity 1 over square root of 2 pi e raised to minus t square by 2 dt and function. We want to show that your square root of x is also a convex function. And that
therefore, Q of square root of x. is easy to show if you differentiate this dF by dx. Remember, first derivative of a
derivative of the top limit, but top limit is infinity.
Now, let us look at this function Q of square root of x. That will be integral square root
of x to infinity 1 over square root of 2 pi ok. E raised to minus t square by 2 dt. Simply, If this is the constant derivative is 0 minus derivative of the bottom limit derivative of
replacing x by square root of x correct, that gives me Q of square root of that gives me Q different, when if you differentiate square root of x you get 1 over 2 square root of x 1
of square root of x. Now, therefore, now what we want to show is, if you denote this Q of over square root of 2 pi which is a constant times, you have to substitute the lower limit
square root of x by F of x, so, F of x equals Q of square root of x which is integral square in the integral. That is, e raised to minus x by 2. And now, d square F of x by dx square
root of x to infinity 1 over 2 pi e raised to minus t square by 2 dt. that is simple to evaluate. First, let us evaluate the derivative of 1 over square root of x.
So, this is 1 over 2 square root of 2 pi times that will be 1 over the derivative of well. So,
minus 1 over 2 square root of 2 pi derivative of square 1 over square root of x. That is,
half x to the power of 3 by 2 and there is a minus sign times e raised to minus x by 2
minus 1 over again 2 square root of 2 pi 1 over square root of x times derivative of e
raised to minus x by 2 which is minus half e raised to minus x by 2.
And now, if you simplify this, you can see that this derivative is nothing but 1 over; you Now,, what is the Jensen’s inequality tell us? Let us recollect. The Jensen’s inequality
can simplify this 1 over 4 square root of 2 pi 1 over x raised to the power of 3 by 2 into e that tells us that, F of well F of expected value of X any random variable is less than or
raised to minus x over 2 plus 1 over. Well, this is actually 1 over 1 over again 1 over 4 equal to expected value of F of X, ok. Now, this implies that if you look at our function F
square root of 2 pi times 1 over 4 square root of 2 pi times square root of x 1 over 4 of X equals Q of square root of x, this implies that your expected value of Q of square
square root of 2 pi m square root of x raised to minus x by 2. And essentially, the root of your SNR for the wireless channel which is magnitude h square gamma which is
important thing is, here you can see that the second order derivative is greater than or nothing but your expected value of F of X, where F of x is your magnitude h square into
equal to 0; implies that and this is greater than equal to 0 for x greater than or equal to 0. gamma.
So, implies that d square F x by dx square is greater than or equal to 0 implies F of x Expected value of F of X or so, expected value of it is greater than or equal to the is
equals Q of square root of x. This is convex, ok. So, that shows that this bit error rate greater than equal to F of expected function of the expected value of F. So, this is greater
function Q of square root of x is a convex function, ok. So, we start with that, ok. So, F than equal to the function of the expected value of magnitude h square gamma. And now,
of x Q of x is convex as well as Q of square root of x and now, we are demonstrating this you observe something interesting. If you recall or expected value of magnitude h square
for Q of square root of x because, that is relevant in this context. gamma, that is basically nothing but expected value of magnitude h square gamma. This
is equal to gamma.
Because, expected value of magnitude h square equals 1; that is precisely the condition
that we set. So, that both these communication systems have the same SNR on the
average. And, what you can see is, this is greater than or equal to F of magnitude. It is
and this is gamma which is equal to F of.
Now, you can see this is greater than or equal to F of gamma which is nothing but Q of So, this tells us the interesting. In fact, the very interesting one of the fundamental results
square root of gamma. So, this is the now Q of X square root of gamma. This is basically that underpins the entire study of wireless communication systems is that, average BER
bit error rate of conventional or your wire-line system ok. And what is this? Remember of wireless average BER of a wireless communication system is greater than average is
Q of square root of h square magnitude h square gamma? This is the instantaneous beta greater than average bit error rate of a wire line or you can also say that as a conventional
rate of the wireless communication system which implies that expected value of this. communication system.
This is the average bit error rate of a wireless communication system.
The communication system in which there is a wire or a guided media guided
So, this basically implies that the average bit error rate, correct? This basically shows propagation medium or a wired medium or the wire medium between the transmitter and
you know 3 theoretically and rigorously that the average bit error rate of a wireless receiver. And, in fact, in practice, this often greater than greater that is, it is significantly
communication system for the same average. Remember, both these systems of the same greater. It is not just greater, but the average bit error rate of a wireless communication. It
average SNR get the average bit error rate of a wireless communication system is is a wireless communication system is significantly greater than that of a wire line
significantly is higher. In fact, you will see in practice that, it is significantly higher than communication system or a wired communication system which implies that the wireless
that of a conventional or a digital wireless communication system, ok. communication of a wireless communication over communication in a wireless system is
highly unreliable.
Because, the probability of bit error is very high so, this quality so, this communication
is highly unreliable implies communication is highly unreliable which is basically bit
error rate is very high. And why is this arising? That is important. Remember, this is
arising. All of this is arising because, what is the fundamental difference between the
wire-line and a wireless communication system? The fading nature of the wireless
channels the fading channel coefficients. So, all this is arising because of the random (Refer Slide Time: 16:46)
nature of the fading channel coefficient ok.
So, this is a challenge and one has to therefore, overcome or surmount this challenge,
overcome this fading nature and that is why, we need technologies to overcome this
fading nature and one of the most important technologies to overcome the fading nature
So, this is arising because of the fading nature of the wireless channel coefficient. It is
or this degradation that arises due to fading is termed as diversity.
arises because of the fading nature of the violation. And therefore, fading has a
significant impact on the nature and in fact, the nature and the performance of Diversity is basically, we have multiple received signals at the receiver and you combine
communication. these receivers to enhance the signal to noise power ratio as well as the reliability of a
wireless communication. This principle is known as diversity. For instance, in a multiple
In fact, fading leads to a severe degradation in the performance of a wireless
antenna system where you have multiple receive antennas. You receive multiple copies
communication system. So, fading leads to a that is the challenge in wireless
of the signal combine the signal copies to enhance the performance of the wireless
communication, ok. Fading implies or leads to a severe degradation in performance. So,
communication system, ok.
fading is basically a challenge.
(Refer Slide Time: 17:49) So, you have these multiple copies or basically it is also diversity means, several right.
That is the English meaning of the word diversity and these multiple signal copies you
combine to enhance the signal (Refer Time: 19:17) power ratio as well as reliability.
So, the whole this fundamental property of wireless communication, alright leads to
technologies that overcome the fading an example of diversity is multiple antennas.

Hence, diversity is a key principle in a wireless communication system and in fact, one
of the key technologies. This is a very important or you say you can also say technology
innovation diversity is a very important. Technology innovation diversity is a very
important technology innovation for wireless communication systems all right. At that
and at the root of all this is basically the fading nature of the wireless channel and the
performance the poor performance of the wireless communication system the very high
beta rate which can in fact, be demonstrated using the Jensen’s inequality.
And therefore, Jensen’s inequality has a lot of applications. Especially, especially to put
fundamentals results like this. It can also be used in as I have already noted, can also be
used to derive and prove several results in the context of information theory. So, that
basically summarizes the Jensen’s inequality and also shows, it is application in a very
interesting context. In that, in the context of performance comparison of a wireless
What happens in a multiple antenna system? We have already seen schematically you
versus a conventional wired communication system all right.
have the receiver, all right. So, you have from the transmitter you have multiple copies.
So, these are your, so, this is your receiver this is your transmitter. So, we will stop here and continue in subsequent modules.

Lecture - 29
Properties of Convex Functions: Operations that preserve Convexity
Hello. Welcome to another module in this massive open online course. In this module, let
us start looking at the various properties of convex functions or the operations on convex
functions that preserve convexity, ok.
Now, F equals convex. If F is convex so, consider a function. First one is simple, F is
convex, then this implies alpha times F of x bar. If that is provided, alpha is greater than
or equal to 0 is also convex, correct. And this is very simple if you have what they say, is
that you have a convex function and you are scaling it by a factor alpha, right.
So, this is F, this is alpha times F. So, naturally once you scale a convex, function of
convex, I mean, it is important that you scale it by a non negative number, alright. So,
once you scale it by a non-negative number, a convex function remains a convex
function and same for translation. In fact, alpha F plus any constant is also a convex
function, correct. So, if you have alpha times F of x bar plus c. This is also a convex
So, what we want to look at are properties or you can also say operations that preserve function. So, you can scale and translate a function it remains a convex function.
convexity. We want to look at the operations that preserve convexity.
Also, simple to see that, if you have several functions F i of x bar, these are all convex The composition with affine functions F is convex, which implies the composition F x
for 1 for i equals let us say, then their sum is convex. So, if you take large number. So, if bar plus b bar correct remember this is an affine function, ok. So, F is convex is
you take function several functions which are each of which is convex, for instance, here composition with an affine, that is F of A or x bar x bar plus b bar is also convex that is
we are considering n functions F 1 F 2 F n, each of which is convex take their sum that is remember, a composition of a function implies F of g of x bar that is F composition with
also a convex function. g. Here, you are taking the composition of F with an affine function A x bar plus b bar.
What this is the composition of a convex function like an affine function is also convex.
In fact, this extends the interesting thing about this is this extends even to infinite sum
and more importantly, integrals, alright. And, we will see an application of this later, For instance, a simple example you have known x bar we have seen that this is convex.
alright. So, this extends to 1 in finite sum. We will try to see applications of this later, For instance, if we consider the 2 norm, this implies norm of x bar plus b bar. This is also
you can say also valid for infinite sums plus also integrals which is nothing but a convex, ok alright.
continuous sum with a final function similar to convex sets, correct.
(Refer Slide Time: 06:36) convex functions; this is the maximum which is also you can see, this is also convex. For
instance, you can take the maximum of a piecewise of a set of piecewise of a set of linear
functions, that is maximum of i or 1 less than equal to i less than equal to m a i bar
transpose plus b i. And, this is known as a piecewise linear function, correct.
Another interesting property, the point wise maximum or you can simply call this as the
maximum. This has a lot of interesting applications. If you take functions F 1 F 2 F of m
which are all convex this implies the maximum of these that is F 1 F 2 each point wise,
ok. So, this is also you can think of this as the point wise point wise maximum.
(Refer Slide Time: 07:37) So, if you take the maximum of several linear functions, you know what you get is a
piecewise linear function.
For instance, you have 2 functions convex. So, this you can see, this is convex, correct
this is convex. Now, if you take the maximum of these 2, you can see the maximum is
this which you can see is also a convex function. So, the maximum of 2 in fact, several
For instance, you have here you take several linear functions and you take the maximum, (Refer Slide Time: 11:56)
you can see you get something like this. So, this is basically first you have this is convex.
Further, this is also a piecewise, this is a piecewise linear function.
So, you take several linear functions right which are basically hyper planes. You take
their maximum, correct. What you get is a piecewise linear function which is also convex
and that follows from the property that we have just seen. Let us now look at another
concept that is the composition.
So, F of x is convex if g is convex plus h is convex and non-decreasing, that is basically

either increasing or at least non-decreasing. That is, if, so we are looking at h of g of x, if
a g of x is g is constant and h is both g is convex and h is both convex and non-
decreasing, then F x F of x is convex or if g is concave. If g is concave and if g is
concave and h; h is convex and non-increasing, non-increasing that is, it is a decreasing
function or at least should not increase that it is a non increasing function.
Let us look at a simple case of scalar functions. Let us look at the composition with
scalar functions, that is let us say we have a function F of x equals h of g of x that is a
composition of h with that is a composition of h with g. So, composition of h and g now,
this is convex.
We have this following rule for composition and that is fairly simple to see we are going (Refer Slide Time: 16:46)
to use the derivative test to demonstrate this assuming the functions are differentiable,
alright. What we could show is that, you have F of x equals h of x equals h of g of x. So,
if you take the first order derivative, you can write it as F prime of x represented by F
prime of x. This is h prime of g of x. Use the chain rule times g prime of x.
So, this implies also you can see all quantities are positive in plus second order
derivative greater than or equal to 0 this implies that F of x equals convex. So, this
implies, you can see if g of x is convex, h is convex and non-decreasing, correct. You can
see that the composition h of g of x is convex, ok.

Further, you would have F double prime of x second prime of second derivative of F F of
x or the derivative of F prime of x you use the product rule. So, it is h prime second
derivative of g of x into derivative of g of x into derivative of g of x. So, this is g prime
of x square plus well h prime of g of x into derivative of g of x which is g prime of x,.
Now, there are 4 components.
Now, first you can see that g prime of x. This is always greater than equal to 0. Now, h
prime of now, since h is convex, this implies h prime of h double prime of x is greater
than equal to 0. Remember, the second order derivative test convex. So, the second order
derivative is greater than or equal to 0. Now, this is interesting. Now, this is non-
decreasing correct implies the first order. This implies h prime of x or h prime of g of x is
greater than or equal to 0 and the last condition we have is g of x is convex implies the
second order derivative of g of x is greater than or equal to 0.
Now, similarly you can show it easily for the other condition; that is, start with again F
double prime of a let us write that the second derivative is second derivative of h of x g
prime of x square plus h prime of g of x minus g prime of x. Now, you can see this
quantity g prime x square is always greater than equal to 0. H of x is convex in the (Refer Slide Time: 20:52)
second condition. Also, if you look at that, h of x is convex.
So, this is greater than or equal to 0, convex implies h that is the second derivative
greater than equal to 0. Now, coming to this g prime, g is concave. This implies g second
order derivative. G double prime is less than or equal to 0. H is non-increasing implies h
prime g of x is less than or equal to 0. Now, together g prime x less than g double prime
x less than equal to 0, h prime of g of g of x is less than equal to 0.
We can look at a simple example to understand this for instance example you take e raise
to x square. Now, this is equal to h of g of x g of x equals x square h of x equals e raise to
x. Now, you can see g of x x square. This is convex and if you look at h of x, h of x
equals convex. And in fact, it is increasing. So, g of x is convex, h of x convex, non-
decreasing which implies F of x which is h of g of x is convex.
And, you can just check that if you take F prime of x, you get 2 x e power x square F
double prime of x is 2 e power x square plus 4 x square which is greater than equal to 0
implies that F of x equals convex; that means, F of x is a convex. Similarly, one can do
So, this implies since both of these are less than equal to 0, this implies h prime g of x
derived results for the concavity of that is when is F of x concave, given that it is a
into g double prime x greater than or equal to 0. So now, both the quantities in the sum
composition of h of x which is one can derive the corresponding conditions for
are positive. So, h double prime x into g prime x square is non negative and h prime of g
concavity, ok.
of x into g double prime that is second order derivative of g of x.
The product is greater than equal to 0; their sum is greater than equal to 0 implies if the
second order derivative of x is greater than equal to 0 which implies F of x is convex
implies F of x is convex, all right. So, we have seen these 2 conditions, alright. These are
the 2 conditions to demonstrate that is these are the 2 conditions that ensure that the
composition F of x, there is obtained by the composition of h of g of x is also convex and
let us look at a simple example.
Lecture - 30
Conjugate Function and Examples to prove Convexity of various Functions
looking at convex functions, we have completed the basic discussion on the basic aspects
including illustration of several practical applications. Let us now focus on some
examples to understand these concepts better all right.
Similarly, one can derive conditions for the concavity of F of x, that is the composition
of h of h n g h of g of x. So, we will stop. So, we looked at several properties of convex
functions. We will stop here and continue in the subsequent modules.
So, what we want to start looking at is examples for convex functions. The first example
we want to look at a new concept which is that of a conjugate that of a conjugate. This is
a very interesting concept for the following reason which I am going to describe shortly,
but the definitions, so given F of given a function, given a function F of x bar the
conjugate denoted by the conjugate function F conjugate y bar is this is given as the
maximum over the vectors x bar of y bar transpose x bar minus F of x bar; it has a rather
very interesting definition.
(Refer Slide Time: 01:54) For instance, if you look at F conjugative of y bar let me just write this again which is y
bar transpose x bar minus F of x bar. Well you can see that this is a linear, linear in y bar
for each x; for each value of x bar this is a linear function in y bar.
So, you are taking which is basically which implies that this is a convex in y bar. And
therefore, we need to taking the set anyway you take the maximum, you are taking the
maximum of your set of convex functions and therefore, from the property that we have
discussed for convex functions, the maximum of a set of convex functions is convex,
therefore, you are taking the set the maximum of the set of convex functions that is one
function for each x bar all right which is convex in y bar for each x bar, you are taking
the maximum and therefore, the resulting function is convex, implies the maximum,
implies the maximum implies the maximum is convex ok.

And so this is the definition the conjugate function and the interesting aspect of this is
this conjugate function is convex even when F of x, the original function is not convex,
this is important aspect. So, the important aspect of this is that the conjugate function F
conjugate y bar is convex.
And therefore, it is convex even if the original function F is not convex ok. Let us take
an example of this conjugate function. Consider the quadratic, consider the quadratic
function which is F of x bar equals half x bar transpose Q x bar with Q equals symmetric
positive semi definite ok, which we have also written as follows Q greater than equal to
0.
Even if F of x bar is not even is F of x bar is not convex, so corresponding to every Now, if you take now to conjugate the to construct the conjugate function, we have F
function any function F of x bar convex or non convex, one can construct an associated conjugate y bar equals maximum of over x bar y bar transpose x bar minus F of x bar
convex function which is a conjugate function and that is very easy to see. which has half x bar transpose.
And now to maximize this let us differentiate this ok, to maximize this differentiate and And the conjugate function F conjugate y equals y bar transpose Q inverse y bar minus
set it equal to 0. We are going to differentiate with respect to x bar; that is considered its that is y bar transpose into x bar minus half, x bar transpose Q inverse x bar which is
gradient of this quantity ok. nothing but basically Q inverse is of x bar and substituting y bar Q inverse y bar
transpose Q times Q inverse y bar, which is equal to basically y bar transpose Q inverse y
Let us call this as g of x bar, which has differentiated g derivative of g with respect to the
bar minus half y bar transpose Q inverse into Q into Q inverse y bar.
vector x bar you and remember derivative of y bar transpose x bar, it is of the form c bar
transpose x bar derivative F c bar which is basically y bar minus half derivative of x bar (Refer Slide Time: 09:17)
transpose 2 x bar; remember we said is twice derivative of x bar transpose p x bar is
twice p x bar. So, here we have twice Q x bar, which is equal to 0, you are setting the
derivative equal to 0; this implies y bar equals Q x bar; this implies x bar equals Q
inverse x bar equals Q inverse y bar ok.
So, Q inverse and Q cancels, so you again have minus half y bar transpose Q inverse y So, very interesting function that arises in several logs, sum of this is the log sum of
bar. So, that is basically half, y bar transpose Q inverse y bar which is basically the exponentials, which rises in several applications. What we want to show that show that
conjugate function conjugate function of. this is convex, show that this function is convex.
And you can see this conjugate function is also quadratic is also quadratic and in fact, (Refer Slide Time: 12:30)
you can say this is convex, because Q is positive semi definite which implies Q inverse is
also positive semi definite remember that is an important result property of positive semi
definite matrices, if that is the metrics is positive semi definite, the inverses is also
positive semi definite. And now you have a quadratic function with a positive semi
definite metrics and therefore, that is Q inverse and therefore, that is also a convex, it is a
convex function that half y bar transpose y bar Q inverse y bar is a convex function.
And for this what we will do is will compute the Hessian all right; so remember that has
to be convexity, whenever its differentiable is yet you can compute the Hessian and
demonstrate that the Hessian is a positive semi definite matrix. And that is what we want
to do in this problem, I want to compute the Hessian and demonstrate that it is indeed a
positive semi definite matrix.
So, all right the process is slightly involved, so, I urge you to be patient ok. So, first let
us set for convenience e raise to x k; let us say z this equal to z k. So, I have F of x bar
And so therefore, this is convex and this the conjugate function of x bar transpose Q equals log sum of exponentials that is log that is basically you can see this is simply log
inverse x bar. So, this is something that we have. So, one can readily derive the conjugate of z 1 plus z 2 plus z n which is log I can write this as 1 bar transpose the vector z bar,
function ok. where z bar is the vector is z 1 z 2 up to z n and 1 bar is basically the vector of all 1’s. So,
we are taking simply the when you are multiplying one bar transpose z bar, you are take
Let us now look at another example slightly complicated one which we want to prove the
simply taking the sum of all elements of z bar ok.
convexity, prove the convexity of you want to prove the convexity of the following
function, where F of x bar equals log k equal to 1 to n e raise to summation e raise to; so,
this is log some of exponentials; so also referred to as the log of sum of exponentials.
And now, if you differentiate, that is if you differentiate this with respect to each x i all So, you have if you look at the gradient; which is basically the derivative of F of x with
right, that is basically you are computing the gradient with respect to x of F of x bar or respect to each component of derivative of F of x with respect to each component that is
rather the ith element, if you are looking at the gradient, you are looking at the ith equal to well z 1 z 2 up to z n 1 divided by 1 bar transpose z bar, which is basically you
element of the gradient ok. Now, that would be d by d x or rather d by d x derivative with can see this is z bar transpose z bar over 1 bar transpose z bar, that is the gradient of z bar
respect to i of your log 1 bar transpose z bar. The derivative of log is log x is 1 over; so 1 over 1 bar transpose z bar that is the gradient.
over 1 bar transpose, times the derivative of 1 bar transpose z bar with respect to each x
i.
Now in z bar, you can see only z i which is e raise to x i depends on x i. So, this is 1 over
1 bar transpose z bar into derivative of z i with respect to x i, but recall z i is e raise to x i
so, e raise to x i.
So, derivative with respect to x i is e raise to x i only which is basically z I, so this

reduces to 1 over 1 bar transpose z bar times z i ok.
Now let us compute the Hessian, Hessian means the second order derivative. So, now,
for the Hessian first let us look at the derivative with respect to each component x j; that
is we have compute the second order components of the form derivative with partial with (Refer Slide Time: 19:25)
respect to x i, partial with respect to x j.
This is nothing but partial with respect to x j of partial of F of x bar with respect to x I,
which is partial with respect to x j of partial with respect to x j of well, what is this
quantity? This is we note the partial with respect to x i that is your z i divided by 1 bar
transpose z bar.
Now, if you differentiate the numerator we already see in derivative of z i with respect
x i is z i itself. So, that will be z i divided by well that will be z i divided by 1 bar
transpose z bar minus, now derivative of the numerator. So, 1 bar derivative of the
denominator that is minus 1 over 1 bar transpose z bar square z i times the derivative of z
i with respect to x i or 1 bar transpose z bar with respect to x i which is again z i.
So, this is you have a z i square in the numerator. So you will have z i divided by 1 bar
transposes z bar minus z i square divided by 1 bar transfer z bar whole square, ok.
Well partial with respect to x j of z i is 0; so, this is simply you can write this as minus 1
over 1 bar transpose z bar square into partial of 1 bar transpose z bar with respect x j that And now, if I will therefore, so only, so these terms are only present in terms of the form
leads on this x j z j, which depends on x j. dou square F x bar dou x i square all right. They are not present if you can look at it they
are not present, when you are in the second order terms of the form partial with respect
So, that is when you differentiate z j that is remember z j equals e raise to x j, when you
to x i x j ok.
differentiate with respect to x j you are left with e raise to x j which is again z j. So, this
is this quantity component is minus z i z j by 1 over transpose z bar square ok. (Refer Slide Time: 21:03)
Now on the other hand if you compute the second order derivative with respect to x i
itself, that is terms of the form partial with second order partial with respect to x i that is
dou square F by dou x i square that would be second order partial with respect to x i of
the first order partial with respect to x I, which is dou by dou i z i divided by 1 bar
transpose z bar.
So, z 1 z 2 up to z n minus terms of the forms z i z j, which can be represented as z bar z
bar transpose divided by 1 bar transpose z bar square. So, this is the hessian, Hessian of
F of x bar; this is the Hessian of the F x bar.
Now we have to demonstrate to show that this is a positive semi; so I can also write this
as just one last example; this is a diagonal metrics; diagonal metrics that contains z bar,
the vector z bar on its diagonal. I can simply write this as diagonal z bar. So, this is
diagonal z bar divided by 1 bar transpose z bar minus z bar z bar transpose divided by 1
bar transpose z bar whole square all right.
So, when you write you can readily see that if you look at the Hessian that will be well
what is the hessian? Hessian is partial of F with respect to x 1 square partial of F with
respect to x 1 x 2 partial of x with respect to x 2, partial with respect to x 1 and you have
the partial with respect to dou square F by dou x 2 square and so on.
And if you put all these terms together what you will see is you will see that this is given
as 1 over 1 bar transpose z bar, times the diagonal terms, remember the diagonal terms of
the forms z 1 z 2 which are only there for the partial, that is terms of the from dou square
F by dou x i square.
This is a Hessian and we have to show to show that this is positive semi definite, to show
this is indeed PSD, for convexity of F of x bar.
That is we want to show that is this is indeed a positive semi definite metrics to verify
the complexity of this function F of x all right, which we will do in the subsequent
module; so, we will stop here.

Lecture – 31
Example Problems: Operations preserving Convexity (log-sum-exp, average) and
Quasi-Convexity
looking at example problems for convex function, all right let us continue our discussion.
So, by default, if you are not mentioning anything, then you can assume that it is log to
the base e. We have computed the hessian of this function and that has an interesting
structure. So, if you look at the ,hessian, we have seen that this is diagonal of Z bar
divided by 1 bar transpose Z bar minus Z bar Z bar transpose divided by 1 bar transpose
Z bar whole square.
And, now what we want to do is, we want to show that this is positive semi definite that
the hessian is positive semi definite ok. Remember, we have already defined the symbol
to which basically indicates that this matrix the hessian is a positive semi definite matrix.
Now, to show that we employed the straightforward approach, that is consider any vector
So, what we are looking at is examples or rather example problems and in particular, we
V bar and multiply this V bar transpose the hessian times V bar.
are considering this interesting function which is the log sum exponential; that is, you
have F of x bar is log of summation k equals 1 to n e raise to x k.
And if we denote this by Z k; that is e raise to x k by Z k, then you can write this as log 1
bar transpose Z bar. Just to be just to clarify this, this is the natural logarithm. You also
you can write this as ln ok, log to the base this is log to the base e that we are considering
alright.
And, this you can see this is therefore, substituting the hessian, it is diagonal of Z bar by And now if you look at this quantity V bar transpose diagonal Z bar into V bar you can
1 bar transpose Z bar square minus Z bar Z bar transpose divided by 1 bar transpose Z clearly see or you can it is very easy to see that this will be nothing but the summation
bar square times V bar which is well, this is V bar transpose diagonals Z bar into V bar over k V k square times Z k divided by 1 bar transpose Z bar minus V bar transpose Z
divided by 1 bar transpose Z bar, I am sorry. There is no square here. 1 bar transpose Z bar square which is summation over k V k V k Z k whole square divided by 1 bar
bar correct, 1 bar transpose Z bar minus V bar transpose Z bar times well Z bar transpose transpose Z bar whole square.
V bar. But, you can think of this as V bar transpose Z bar transpose and this is a scalar
And, if you simplify this therefore, now further what you have is in the denominator you
quantity, right.
will have 1 bar transpose Z bar times summation k, k square Z k into 1 bar transpose Z
So, V bar transpose Z bar is the same thing as Z bar transpose V bar or in other words, Z bar which is nothing but summation k over Z k minus summation k V k Z k whole square
bar V bar transpose Z bar transpose is the same thing as V bar transpose Z bar, all right. and this is the quantity.
So, it is a scalar quantity V bar transpose Z bar times itself or rather it is basically V bar
And now, to demonstrate that this is positive semi definite we have to demonstrate that V
transpose Z bar square, ok. So, this quantity is V bar transpose Z bar square.
bar transpose the hessian times V bar is greater than equal to 0 for any vector V bar
which in term implies that the numerator of this expression has to be greater than equal
to 0 and that is something that we are going to show in a straight forward fashion now.
So, we want to show that this numerator quantity to show where the numerator is greater
than equal to 0.
Now, for that what we are going to do is, we are going to define 3 vectors we have a bar Which means, now you can simplify this as summation k V k square Z k times
equals V 1 square root of Z 1 V 2 square root of Z 2, so on V n square root of Z n. We summation k Z k minus summation k V k Z k square this is greater than equal to 0
want to define another vector b bar which is square root of Z 1 square root of Z 2, so on implies the numerator is greater than equal to 0 or numerator of one let us call this
square root of Z n. These are 2 n dimensional vectors or you can also say these are n expression this is what we have said to prove. So, let us call this expression as one
cross 1 real vectors. And now, employ the Cauchy Schwarz inequality or the inequality implies numerator of 1 is greater than equal to 0 implies V bar transpose the hessian
for the inner product of vectors which says that a bar transpose b bar whole square that is times V bar greater than equal to 0 implies the hessian is positive semi definite, alright.
less than or equal to norm a bar square into norm b bar square, all right. This is the chain of arguments.
And, now if you look at norm a bar transpose b bar whole square, that is nothing but So, implies the hessian; the hessian is positive semi definite implies F of x bar equals
summation k V 1 square root of Z 1 into square root of Z 1, that is nothing but V 1 Z 1 convex function ok. So, F of x bar which is the log some exponential is convex because
similarly V 2 Z 2 and so on. So, this is summation over k V k Z k whole square this is we have demonstrated that there is hessian first we have derived the hessian and again in
less than or equal to norm a bar square which is V 1 square Z 1 plus V 2 square Z 2 plus turn demonstrated the hessian is positive semi definite.
V n square Z n.
So, the proof is a little lengthy and tedious, but it has some very interesting aspects that
So, this will be summation k V k square Z k times norm b square which is Z 1 plus Z 2 can be used in general to demonstrate the convexity of functions especially the especially
plus Z n which is summation k Z k. So, this quantity is less than quantity on the right is the convexity of functions of vectors and sometimes these proofs indeed tend to be
less than equal to quantity on the left. slightly involved.
Let us proceed to the next example and in fact, this is several very interesting We want to demonstrate that if F of x is convex, we want to show that 1 over x integral 0
applications this is used for in if you look at the log sum exponential this is the original to x F of t d t this is convex. We want to show that this is convex or that we will use a
function we can this has a lot of. In fact, this can be used to logistic regression that is to simple procedure F of x equals convex implies the fine p composition implies F of s x
fit a curve to a given set of points, all right. And this has applications in machine learning equals convex for each s.
and classification as we are going to see later in this course to classify to classify a set of
Now, we will use the property of the sum that is a function is convex functions of convex
data points divided them into 2 sets; one which gives corresponds to a response of 1,
right, several if you have several functions which is convex their sum is convex. In fact,
other corresponds to response of 0.
here we are going to use a continuous sum. So, this implies that if you take the integral, I
So, this can be used for machine learning or classification of data sets ok, all right. On can treat this as one function for each s ok. This is one function for each value of s.
that note, let us move to the next example which is the following.
So, this implies that integral 0 to 1 F of s x dx d s. This is convex. Why is this convex?
Because, this is a continuous sum one function for each s continuous sum over s for s
lying in the interval 0 to s. So,,this is basically a continuous. So, instead of having a
discrete sum you have a continuous sum, all right.
(Refer Slide Time: 14:45) So, this demonstrates that for if F is a convex function 1 over x integral 0 to x F of t d t is
also a convex function. Let us go to the next example and in this, we want to look at an
interesting concept and this is the concept of quasi convexity. And quasi concavity, we
have seen the definition of convexity similarly one can define a set of functions which
are quasi convex.
Quasi convex basically means, so, F is quasi convex. If we define the set S of t equals the
set of all x or x bar such that F of x bar is less than or equal to t ok. This is called a
sublevel set with respect to t. If the sublevel set with respect to t, if is convex for all t,
then F of x bar is a convex function that is, if you look at the sublevel sets of this
function, what is the sublevel set.
That is, if you look at any parameter value t, consider the set of all points x bar such that
F of x takes values less than or equal to t. And, if these sublevel sets with respect to each
The integral is nothing but a continuous sum and this implies that. Now, in this you set s p are convex the function is said to be quasi convex. This is important because, there are
x equals t which implies that x times d s equals d t. So, this implies now substituting this. several functions which are not necessarily convex. But, qualify as quasi convex and can
So, what you have is basically integral 0 to upper limit becomes s times x which is which also and also have a lot of utility in practical applications for instance, let us take a
is t becomes equal to 1 times x. So, this is 0 to x F of s x is t ds is d t by x is this implies simple example.
that this is convex which in turn implies that 1 over x because x is a constant integral is
with respect to t 0 to x F of t d t. This is convex, alright. (Refer Slide Time: 18:36)
Now, if you look at square root of course, square root of magnitude of x. Now, you can
clearly see this is not convex, correct. Because, if you take 2 points let us say right and
join them join the chord. Now, function lies part of it lies above a part of it lies below.
So, it is neither convex nor concave ok. So, you can say function straddles the chord or is convex implies square root of magnitude of x this is a quasi-convex quasi; quasi means
function lies both above plus below the curve. And therefore, this is neither convex nor not exactly but, something that can pass for alright.
this is neither convex.
Because, remember we said if the chord lies below it is concave with the chord lies
above the function, then it is convex. So, this is neither convex nor concave. However, if
you look at the sublevel sets, that is if you look at any t that is you take the all the set all
the points such that F of x is less than or equal to t you look at the sublevel set, now this
set you can see sublevel set, the sublevel set is convex.
Something that is a quasi alright (Refer Time: 22:28) is a quasi-property. So, this is quasi
convex function that is it not strictly speaking convex function, but it has some properties
that are similar to the that of a convex function namely that the sublevel sets are convex.
Let us look at another example. For instance, if you look at a bar transpose x bar plus b
divided by c bar transpose x bar plus d.
Now, this is not a convex function, but if you consider the sublevel set this is less than
And, you can easily see that for instance, if you look at S of t equals set of all x such that equal to t. This implies that a bar transpose x bar plus p less than or equal to c bar
square root of magnitude of x is less than equal to t. This is equal to set of all x such that transpose t times c bar transpose x bar plus t times t times d which basically implies that
magnitude of x less than or equal to t square. This is equal to set of all x which is a bar transpose minus t or you can also say a bar minus t c bar transpose x bar plus b
basically minus t square less than equal to x less than equal to t square. minus t d less than equal to 0.
And therefore, if you look at this set, this is the set between minus t square to t square all Now, if you look at this, this is nothing but this is some a tilde transpose b tilde. So, this
right. And therefore, this S of t which is basically simply the interval closed interval is basically of the form a tilde transpose x bar plus b tilde is less than or equal to 0
minus t square to t square. This is a convex set. Remember, if convex set if you take any implies level set. In fact, this is a tilde of t this depends on t right. It depends of the
2 points in the set, join them by a line segment. It should lie in the set. parameter t implies and now if you look at this level set, this level set is nothing but a
half space ok. So, the level set is a half space. We know that half space is convex. So, all
So, minus t square to t square in fact, any closed interval this is a convex set and
the sublevel sets are convex and therefore, this is a quasi-convex function.
therefore, this implies S of t is convex. And therefore, the sublevel sets are convex S of t
Lecture – 32
Example Problems: verify Convexity, Quasi-Convexity and Quasi-Concavity of
functions
looking at example problems in convex functions and convexity. Let us continue our
discussion.
So, implies S of t is convex implies a bar transpose x bar plus b by c bar transpose x bar
plus d. This is a quasi convex function. Similarly, one cannot come up with several other
examples for quasi convex, all right. So, we will stop here and continue in the
subsequent module.
So, we are looking at example problems in convex functions and well, let us look at this
is problem number 5. F of x bar equals x 1 x 2 for the region where both x 1 x 2 are
greater than or equal to 0 ah. We want to ask the question is F of x bar of course, x bar
we can think of this as the 2 dimensional vector. This is a function of 2 variables x 1 x 2.
We want to ask is this convex or concave? Convex concave or neither.
Remember it, function need not need not be only either convex or concave, but can be
neither convex nor concave, all right. So, that is an important point to keep in mind.
Let us start by looking at once again, remember we have a simple test for convexity, that So, this matrix here, the hessian is a symmetric matrix, but this is not positive semi
is the hessian which in this case is simply dou square F by dou x 1 square dou square F definite, this is not a positive semi definite. In fact, if you look at the determinant of this
by dou x 2 square. These are the 2 diagonal elements and the off diagonal elements are is 0 minus 1 equals minus 1. This is negative, ok. So, the determinant is negative.
dou square F by dou x 1 partial with respect to x 1 x 2 partial with respect to x 1 and x 2.
Remember, the determinant of a positive semi definite matrix has to be a positive
And this is equal to if you look at this partial with respect to the first partial with respect
quantity. Because, the determinant is a product of the eigenvalues, all of these
to x 1 is x 2 partial with respect to x x 1 of x 2.
eigenvalues are either are non-negative. Therefore, the determinant has to be greater than
So, you can evaluate this as follows do partial second order partial with respect to x 1 is or equal to 0 for a positive semi definite matrix, all right.
partial with respect to x 1 of partial with respect to x 1, but partial with respect x 1 is x 2.
So, this is partial with respect to x 1 of x 2 which is 0. So, you can see this is 0 partial
with respect to x 1 x 2 you can see this is one partial with respect to x 1 x 2 x 1 the
second order partial with respect to x 2 is also 0. Now, first thing you can see is this is a
symmetric matrix correct;
In fact, if you compute the eigenvalues, that is find the characteristic polynomial minus (Refer Slide Time: 06:39)
lambda times I and take the determinant. This is equal to determinant of 0 1 1 0 minus
lambda times 1 0 0 1 and you take the determinant of this that is equal to minus lambda
minus lambda 1 1 minus lambda. And if you take the determinant of this, that is basically
lambda square minus 1 and lambda square minus 1 equals 0 implies basically lambda
equals plus or minus 1. And you can see the eigenvalues are both positive and negative.
It has a positive eigenvalue and a negative eigenvalue and negative.
Now, consider the hessian of F tilde. The hessian of F tilde is simply the hessian minus
of the hessian of x bar. So, this will be 0 minus 1 minus 1 0. Now, again check the
eigenvalues delta square F tilde; the hessian minus lambda I determinant. Look at the
determinant of the hessian, this will be the determinant of well minus lambda minus 1
minus 1 minus lambda equals once again lambda square minus 1 and once again lambda
square minus 1 equals 0 implies lambda equals plus or minus 1.
Again, the hessian of F tilde is not positive semi definite which means, the hessian of F
So, implies matrix remember positive semi definite matrix has only positive semi which means F tilde is not convex. And therefore, F is not concave. Remember, F is
definite matrix has only non-negative, that is eigenvalues are greater than equal to 0. concave only if F tilde, that is minus of is convex. So, this implies F tilde x bar equals
Here you have a negative eigenvalue that is minus 1 which implies the hessian is not minus of F of x bar is not convex implies F of x bar is not concave, x bar is not concave
positive semi definite. now.
And therefore, F of x bar is not convex implies delta square F of x bar is not positive
semi definite implies F of x bar is not convex. Now, what about concavity? For that,
consider minus x F tilde equals minus F x bar equals remember F of x is concave if
minus F of x is convex. So, we consider minus x 1 x 2.
(Refer Slide Time: 04:56) How about quasi convexity? Now, for quasi convexity, remember you have to look at the
level sets S of t equals x bar such that x 1 x 2 less than equal to t. Now, if you look at the
set x 1 x 2 less than equal to t, what you will observe is that, if you plot this if you look at
the set x 1 x 2 less than equal to t. That will be this set. This is the curve x 1 x 2 equals t
and this is the area that is x 1 x 2 less than equal to t. And you can see this set is not
convex, the set is not convex.
So, F of; so, x 1 x 2, you can see it is very interesting. It is neither convex nor concave,
alright. That shows that any function does not always it is not convex does not
automatically mean that it is concave, all right. They can be functions which are neither
convex and concave and that is easy to see.
Because, if you have a function that looks something like this; so, neither convex nor
concave, ok. And so, these kind of functions these are neither convex nor concave;
something to keep in mind. On the other hand, if you look at the super level set, that is x 1 x 2 greater than or equal
to t, this is a convex set because if you take any 2 points, join them by a line, it lies in the
set. This set the super level set x 1 x 2 greater than or equal to this is a convex set, ok.
So, implies thus not the sublevel set super level sets equals a or convex, that is super
level set sublevel set. Remember, is the set such that x 1 x 2 is less than equal to t super
level set is the set x 1 x 2 greater than or equal to t for any parameter value of the
parameter t. So, the super level sets are convex which means, this is a quasi-concave
function utilize this is a quasi-concave, this is a quasi-concave.
So, x 1 x 2 neither convex nor concave; it is a quasi-concave option ok. Similarly, let us And ah, now the hessian will be take the first element differentiated with respect to x 1.
now move on to the next example. Let us now consider the reciprocal of the previous 1 F So, that will be 2 over x 1 cube x 2 and now differentiate the first element with respect to
of x bar equals 1 over x 1 x 2. Again, we want to ask the same question is this function 1 x 2. So, this will be 1 over x 1 square x 2 square. This will be 1 over symmetric.
over x 1 x 2 is it convex or is it concave and you can once again find the hessian of this.
So, this element will also be 1 over x 1 square of x 2 square the 2 cross 2 element will be
The hessian of this will be well, let us first start with the gradient x bar is again the 2 well that will be 2 over x 1 x 2 q ok. And now, we have to see what s is this matrix
dimensional vector. This is a function of 2 variables x 1 x 2. So, the gradient with respect positive semi definite and you can simplify this by bringing x 1 cube x 2 cube outside.
to x 1 x 2 will be well this will be derivative with respect to x 1 that will be minus 1 over This will be 1 over x 1 cube x 2 cube times well times. This is twice x 2 square twice x 1
x 1 square x 2 derivative with respect to x 2 will be minus 1 over x 1 x 2 square. square and this will be x 1 x 2 x 1 x 2.
Now, we want to ask the question is this matrix positive semi definite. This is the hessian
and we want to ask the question is this matrix positive semi definite. Now, there are
many ways to show this. If the matrix is positive semi definite, one of the methods is of
course, to compute the eigenvalues which might be slightly tedious what we are going to
do here is, we are going to decompose this in the form of factors which are a times a
transpose, alright.
Now, remember each such factor is positive semi definite. We said that if a matrix can be Now, you can see this will be 1 over x 1 cube x 2 cube times x 2 or x 2 x 1 times x 2 x 1,
expressed as A A transpose. This is positive semi definite and the sum of such positive that is this is of the form vector a bar into a bar transpose plus. Obviously, the next one
semi definite matrices is positive definite. So, we are going to factorize this into a sum of you can easily see that this is x 1 cube x 2 cube diagonal matrix x 2 0 x 1 0 times x 2 0
matrices which can be expressed in this form. and this will be a matrix of the form B B transpose. So, we are decomposed it into the
sum of matrices which are factorized as A A transpose.
And, you can clearly see it is not very difficult. You can first write this as x 1 cube x 2
cube you can write this as x 1 or x 2 square x 1 x 2 x 1 x 2 x 1 square x 1 square plus this So, each of these matrices component matrices is positive semi definite. Therefore, the
will be 1 over x 1 cube x 2 cube x 2 square 0 0 x 1 and this will be now I can decompose sum is positive semi definite. You can easily see that if 2 matrices are positive semi
this into factors. definite alright, compatible matrices. If you sum them, you get another positive semi
definite matrix ok. So, this is positive semi definite, this is positive semi definite
And therefore, this implies that basically your matrix this implies that this is the sum of 2 Let us look at another example and this is you can treat this as a practical application. In
positive semi definite matrices. Remember, that this is also only for x 1 comma x 2 fact, it is a very interesting we are going to look at the entropy function, that is which can
greater than or equal to 0 which means these factors, that is if you look at these factors, be defined as the entropy of the source H can be defined as minus summation or entropy
these factors 1 over x 1 cube x 2 cube is or also greater than equal to 0. Therefore, this is of a minus summation i equals 1 to n x i log to the base e natural log x i, where x i equals
positive semi definite matrix. Matrices weighted by positive factors. So, these are probability of the i'th symbol.
positive, these are positive.
This is the probability of the i'th symbol and this entropy denotes the information content
So, the resulting matrix is positive semi definite. So, that means, hessian is positive semi of the source. This entropy denotes the information content of the source that is given in
definite ok. Hessian is positive semi definite. Remember, this notation which implies that source with n symbols which have probabilities x 1 x 2 up to x n. What is the average
F of x bar equals convex and it can also be seen that since it is convex, it is also quasi information content per symbol of this source? That is given by the entropy, that is minus
convex because any convex function is also quasi convex ok. summation x i log log of x i.
So, this implies that F of x bar also quasi convex sorts 1 over x 1 x 2 remember what is F And therefore, the higher the entropy, it means the higher the information content of the
of x bar if x bar this is 1 over x 1 x 2 this is quasi convex alright. So, 1 over x 1 x 2 is source. And therefore, we would like to maximize this entropy quantity and that has very
basically convex and hence, it is also quasi convex. important applications in information theory.
So, we would like to maximize the entropy or the information content, ok. So, this has a We start by considering F of x equals x log x, that is the natural logarithm and we
applications in information theory. So, this is a very important quantity in information demonstrate that this is convex ah. You take the first derivative, this is very simple, this
theory and by extension in also wireless communication and signal processing. is x. So, we use the product rule. So, first differentiate with respect to x, that is 1 times
log x plus x into the derivative of log x which is 1 over x which is basically log x plus 1.
In fact, many fields machine learning also since information theory has widespread
And now, if you look at the second derivative, that will be derivative of log x which is 1
applications in several right. As several applications is wide several applications in
or x plus 0 which is 1 over x that is greater than equal to 0.
various fields, alright. So, therefore, this quantity entropy is very important. Now what
we want to show is that, this quantity entropy is a concave quantity and that is relatively So, this implies x log x hessian is greater than or equal to 0. This implies x log x or
easy to show. second derivative is greater than equal to 0 implies x log x is convex which implies that
minus x log x equals concave.
Lecture – 33
Example Problems: Perspective function, Product of Convex functions, Pointwise
Maximum is Convex
looking at Example Problems for Convex Functions. Let us continue our discussion,
alright.
And therefore, now, if you look at the entropy log to the base, you can see this is the sum
of concave functions implies this is concaves the entropy is the sum of concave
functions. And therefore, this is in turn concave. And therefore, one can maximize the
entropy; thus maximizing the average information of the given source, alright. So, we
will stop here and continue in the subsequent modules.
We are looking at example problems and let us look at problem number 8 and we want to
consider the function norm x bar square by t which you can also write as a remember
norm of a vector square is vector transpose time means itself, that is x bar transpose x bar
divided by t you can think of this as an n plus 1 dimensional function.
So, we have if you define the vector x tilde equals well the vector x bar t augmented with
t which is basically you think of this as x 1, x 2, up to x n and then one additional
element. So, this is basically your n plus 1 dimensional vector ok. This is an n plus 1
dimensional vector we have this function F of x tilde ok. So, this is x tilde is the vector x
bar that is augmented with t and we can consider further that t is greater than 0, that is t is
a positive quantity.
Now, we want to show that this function is indeed a convex function. We will follow the And, well, if you look at this well you can simplify this x bar transpose x bar so, F of x
approach that we have shown before that is the test for convexity which is to evaluate the tilde norm x bar square by t which is also x 1 square plus x 2 square plus x n square
Hessian and demonstrate that it is indeed a positive semi definite matrix. divided by t you can see the partial with respect to x 1 is simply 2x 1. 2x 1 divided by t
partial with respect to x 2 is 2x 2 divided by t partial with respect to x n is 2x n divided
by t and the partial with respect to t is norm of x bar square into differentiate this with
respect to 3. So, that will give us minus 1 over t square, ok.
So, the partial with respect to t is minus norm x bar square divided by t square. So, this is
your partial with respect to partial derivative with respect to with respect to t. And, now,
we have to compute the Hessian for this, ok. This is basically the gradient we have to
compute the Hessian, ok.
So, what we want to do is we want to evaluate the Hessian of this. First let us start with
the gradient; gradient with respect to x tilde of F of x tilde which will contain first all the
partials with respect to all the x’s followed of course, by the partial with respect to t.
And, that is also it has an interesting structure it is fairly straightforward you just have to
pay attention to each element. Now, first well the 1 cross 1 element is the partial with
respect to x 1 that is a second order partial with respect to x 1 square. So, you take 2x 1
over t divide it with differentiate with respect to x 1. So, that gives you 2 over t. Now,
you take 2x 1 over t differentiate with respect to x 2 that gives you 0 differentiate with
respect to x 3 0. In fact, differentiate it with respect to. Now, the last element will be the
derivative that is 1 comma n plus 1-th element will be the derivative with respect to t and
that will be minus 2x 1 over t square.
Similarly, you have all these elements are 0 and the last element will be once again And, now you can divide this into the sum of several matrices each matrix with respect
minus 2x 1 over t square. Now, look at the 2 cross 2 element that is the derivative partial to one of the x i’s. So, the first matrix will be with respect to x 1. So, you can take this
second order partial with respect to x 2. So, we take 2x 2 or t differentiate with respect to and this has the particular structure. So, this will be 2 over t last element will be minus
x 2 that is 2 over t of course, rest of the elements will be 0 and this last element will be 2x 1 or t square 0 0 minus 2x 1 over t square and the last element out of the norm x bar
once again minus 2x 2 or t square this element here will be minus 2x 2 or t square you square you simply take x 1 square. So, this will be 2x 1 square divided by t cube. So, this
will have so on and so forth, so on and so forth and here the last element here will be you can think of this as a corresponding to with respect to x 1 or corresponding to the x
partial with respect to second order partial with respect to t. 1.
So, this will be minus norm x bar square derivative over 1 over t square that is 2 over or Similarly, corresponding to x 2 you will have a matrix which is of the form 2 cross 2
minus 2 over t cube. So, that will be 2 over t q. This is basically the Hessian. So, you can element is 2 over t the last element is minus 2x 2 the last element in the second row is
see this is basically you can see each element. So, this is basically the 1 cross 1 element minus 2x 2 over t square this is 0 n plus 1 comma 2. The last the second element in the
this is basically the 2 cross 2 element this is your 1 I am sorry I should say 1 comma 1 last row that is again minus 2x 2 over t square and again you take n plus 1 comma n plus
element 1 comma 1 element. This is the 2 comma 2 element that is the diagonal element 1-th element you take the component corresponding to x 2. So, this will be 2x 2 square
this is your 1 comma n plus 1 element and so on, and this last element here this is your n over t cube plus so on. So, total of n you have a total of n such matrices, ok.
plus 1 comma n plus 1 element, ok.
So, this is a Hessian it has an interesting structure. So, the first n diagonal elements are
all 2 over t the last n plus 1 comma n plus 1-th element is 2 norm x bar square divided by
t cube. And, along the last row and the last column you have the elements which are of
the form entries of the form minus 2 x i over t square. So, that is the structure of the
Hessian.
Well, for the i-th matrix you will have if you look at the structure of the i-th matrix ok,
the i-th matrix corresponding to x i we will have 2 over t in the i comma i-th element this
will be minus 2x i over t square. This is n plus 1 comma i similarly here you will have
minus 2x i over t square and n plus 1 comma n plus 1 element that will be 2x i square
divided by t cube and the rest of all ok. So, this is the matrix you can think of this as the
i-th matrix this is basically your i comma n plus one-th element this is your i comma i-th
element, this is your n plus 1 comma i-th element and this is your n plus 1 comma n plus (Refer Slide Time: 13:25)
one-th element, ok. So, this is the structure you can decompose it into such matrices.
And, now you can see I can decompose this as i equal to 1 to n 2 over t cube I can write
this as 0, 0 then you have a t in the i-th position minus x i in the n minus n plus one-th
position times 0 0 t again you have 0’s and minus x i in the nth position.
And, now so, I can write it as the summation. So, I can write this or rather let us now I
can write this as the summation over i of such matrices i equals 1 to n you have minus 2x So, this t is in the i-th a position this minus x i is in the n plus one-th n plus one-th
i or t square minus 2x i over t square and 2x i square over t cube, and the rest all the rest position and now, you can see I am writing this basically decomposing this as a i bar a i
of the entries are zeros. bar transpose, ok. So, a i bar is this vector a i bar is basically your vector which has t in
the i-th position and minus x i in the in the in the n plus 1-th position.
And, now, if I take 1 over that is 2 over t cube as common so, I can write this as i equals
1 to n take 2 over t cube as common this will become again very simple. So, if you take 2 (Refer Slide Time: 15:03)
or t cube common this will be t square this will be minus 2x i times t minus 2x i times t
and this will be well, this will simply be I am sorry the 2 will go because we have taken
the 2 outside. So, minus x i t and this will be simply x i square and rest of the entries are
0, rest of the entries are 0.
And therefore, I can write this as summation i equals 1 to t 2 over t cube a bar a bar show that h of x equals F of x into g of x is also a convex function, and that is easy to
transpose. Now, each of this is a positive semi definite matrix. Remember, whenever a show.
matrix can be decomposed as a transpose correct, it is a positive semi definite matrix that
is what we have seen you are weighing it by a positive coefficient because remember t is
greater than 0. So, 2 over t cube is greater than 0. So, sum of positive semi definite
matrices weighted by positive coefficients, the resulting resultant matrix is also positive
semi definite. Therefore, the Hessian is positive semi definite and hence the function is
convex ok.
So, this is greater than 0 implies the weighted sum of PSD matrices is PSD which
implies that your the Hessian F of x tilde equals is a PSD matrix which implies F of x
tilde is indeed therefore, x tilde is indeed a convex function, alright. So, that is tells that
is what. It is a slightly it is a slightly involved and lengthy proof, but as we have seen
some of these tend to be a bit involved, ok. So, we have demonstrated that norm x bar
square divided by t that is x bar transpose x bar divided by t considering this as a
function of the n plus 1 dimensional vector x bar augmented with t this is a convex
function. Let us proceed to the next problem that is problem number 9. So, we consider first the first order derivative of h of x which we denote by h prime of x
using the product rule that is F prime of x g of x plus g prime of x into F of x. Now,
considering the second order derivative it is h double prime of x which is F prime of F
double prime of x g of x plus F prime of x g prime of x plus plus plus well, g double
prime of x F, I am sorry this has to be g prime of x g prime of x F x plus g prime of x into
F prime of x which you can now simplify as follows.
Given two functions F of x g of x which are convex and these are greater than 0, that is
F of x comma g of x greater than 0 and further these are non-decreasing. We want to
You can write this as this is equal to, well F double prime of x g of x plus twice you can So, this implies h prime of x greater than equal to 0, this implies h of x is convex, which
combine these two terms. So, that gives you twice F prime of x g prime of x plus g is nothing, but F of x into g of x. This implies that F of x into g of x is convex. Let us
double prime of x F of x. Now, let us dissect this term by term if you look at this quantity now move on to another problem number 10.
here you can see F of x is convex. So, this implies F double prime of x is greater than
equal to 0. Now, g of x is given to be greater than or equal to 0 or greater than 0.
So, this implies F double prime of x into g of x is greater than or equal to 0. Now, F of x
and g of x are non-decreasing are non decreasing this implies F prime of x comma g
prime of x greater than equal to 0 this implies the product that is since they are non
decreasing both of them are non-negative. So, the product is also non-negative and now
the last term is similar g is convex. This implies g double prime of x is greater than equal
to 0 given F of x greater than 0. So, this implies g double prime x into F of x greater than
equal to 0.
So, all the three components in the some are non-negative. Therefore, the sum is non-
negative which implies the second order derivative h double prime x is non-negative or it
is basically greater than equal to 0, which implies essentially that h of x is convex or the
product F of x into g of x is convex.
Now, consider a set of variables x 1, x 2, x n or a vector or let us say we have x bar this is
an n dimensional vector, and we have x j is the j-th largest of x 1, x 1, x 2 up to x n which
implies that basically you are sorting this. So, what this means is the largest is x 1 which
is greater than if you sort this x 2 greater than or equal to x n; n with the square bracket (Refer Slide Time: 25:33)
remember this is actually different from x n.
So, x 1, x square bracket or x subscript bracket you can say x subscript square bracket
one this is the maximum or the largest and this is the this is the largest and this is the
minimum, and x subscript square bracket j that is the j largest j-th largest that is you
arrange them in descending order x subscript square bracket one is the largest followed
by x subscript square bracket 2 and so on.
Now, this is an interesting function first you can see what you are doing is you are taking
the r largest alright. This is very interesting and very complicated function. So, you are
taking r largest plus you are taking the linear combination and this is a highly non-linear
function because, when you look at the maximum, the maximum of the maximum is
basically a non-linear function correct.
So, we are taking the r largest getting a non-linear getting a linear combination. So, this
is basically it is a highly non-linear function because although you are taking a linear
combination you are looking at the maximum of these elements, right. So, this is a highly
And, let us assume the non negative coefficients alpha 1 which are again arranged in
non-linear interesting function yet you can demonstrate that this function is convex and
decreasing order such that alpha 1 greater than alpha 2 greater than alpha r greater than is
that can be done as follows. In fact, it is very simple.
equal to 0. Now, what we want to show that this function of x bar which is alpha 1 x 1
plus alpha r x r, we want to show that this is convex and this can be seen as follows.
If you consider the function F i of x bar which is defined as alpha 1 x i 1 plus alpha 2 x i What is the total number of such functions that is the total number of permutations of r
2 index i 2 plus alpha r x i r where i 1 i r these belong to the set 1, 2 up to n and none of objects from a set of n objects that is n P r. So, total number of permutations will be n P r.
these two are equal or no two of these are equal or all of them are distinct no two are You might have seen this is in this is in high school this is n factorial by n minus r
equal. So, which implies i 1 i 2 up to i r are distinct now how many ways can you choose factorial, ok. So, that is the total number of such functions F total number of such
these indices i 1, i 2, i r. functions F m of x bar.
In fact, I can call this as F of let us say some index not I because we are using well F of Now, you can see that each of these is a hyper plane each of these is a that is F m of x bar
let us say m just one particular combination. So, basically depends on total number of not equals alpha 1 x i 1 plus alpha r x i r each of these very interestingly this is a which
even combinations because remember for I alpha 1 you have to choose one index i 1, implies this is convex. So, each of these functions is a hyper plane each of this functions
alpha 2 you have to this is basically a problem of permutations, how many ways that is corresponding to a permutation of this r x r variables x i 1, x i 2, x r this is a hyper plane.
basically you are choosing the ordered pairs i 1 i 2 up to i r. So, this is basically the So, each such function is convex.
permutation and what is the total number of such you want to ask the questions what is
the total number of such functions.
(Refer Slide Time: 30:55) hyper planes and therefore, the resulting function is also indeed convex and that basically
completes the proof for this interesting problem alright.
So, we will stop here and starting from the next module we will start looking at various
convex optimization problems, the practical applications in various domains.
And, now, therefore, if I take the maximum of this for all the you take the maximum of
this for each x bar for all the m less than n P r that is you take the maximum of all these n
P r or n factorial by n minus r factorial functions you can see that the maximum is
nothing, but alpha 1 times x of x subscript square bracket 1 alpha 2 x subscript square
bracket 2 plus so on alpha, r x subscript square bracket. That is a maximum occurs,
remember we have said these coefficients alphas are arranged in the decreasing order so,
alpha 1 is greater than equal to alpha 2 is greater than equal to alpha 3 greater than equals
so on up to alpha r.
So, maximum occurs when the maximum alpha 1 is associated with the largest. So, this
is the largest alpha this is the maximum x i, this is the second largest and second largest
alpha i, this is the second largest x i and so on. Therefore, the maximum is nothing, but
alpha 1 x substitute square bracket 1 so on summation alpha r x subscript square bracket
r and this is nothing, but our F of x bar.
Now, you can see that F of x bar is the maximum of set of in fact, this is point wise
maximum that is for each x you are taking the point wise maximum of a set of n P r
convex functions, in fact, hyper planes implies that F of x bar is convex alright. So,
basically you are taking n P r that is n factorial by n minus r factorial convex functions or
hyper planes and you are take the maximum the point wise maximum of these n P r
Beamforming in Multi-antenna Wireless Communication (Refer Slide Time: 01:40).
Indian Institute of Technology Kanpur
Lecture - 34
Problems on Grassed Waterways
Hello welcome to another module in this Massive Open Online Course. So, if so far we
have been looking at the mathematical preliminaries or the concepts that form the
foundation of the optimization framework all right. Namely convex sets we have looked
at convex functions and now let us start looking at the practical applications of these
concepts in the form of optimization problems that arise in practice and that can be
solved using the framework of convex optimization.

So, this has a lot of applications in the context of wireless communication, you can think
of 3g, 4g wireless communication systems and so on wherever you have multiple
antennas and the Beamforming problem is the following that is at a very high level if you
have a multiple antenna system.
So, what we want to start looking at is convex optimization problems, various

optimization problems that have practical application convex optimization. And now
what happens in a convex optimization problem is well let us look at it through an
example the first take the example that I want to look at through is an excellent and a
very simple example. But yet very practically relevant and that is the example of
Beamforming, which is one of the most important components or you can say one of the Let us say you have multiple antenna system each of these represents an antenna you
most important techniques in a modern day wireless communication system. have a transmitter or for that matter receiver let us say and beam forming can be done
both at the transmitter and receiver and what do you want to do and what you want to do (Refer Slide Time: 04:51)
is, so this is let us say user 1 transmitter 1 and this is your transmitter 2.
This process of formation of this beam this is termed as beam forming and as you can see
this is unique to a setup with multiple antenna. So, you have multiple antennas what that
helps you it helps you to form a signal beam in the direction of s of a desired user, while
What you want to do in beam forming is essentially you want to form a beam or
suppressing the signal in the direction of the interfering user or the other unintended
maximize the signal to noise power ratio in a particular direction that is let us say user
users. And this significantly improves the energy efficiency why because, you are
one is the desired user. While minimizing the power so, you want to maximize the
transmitting now energy only in a particular direction which was previously transmitted
reception or transmission it is symmetric. Maximize reception or transmission; maximize
spread out in a diffused fashion over all directions.
the reception or transmission in a particular direction.
So, by focusing the energy you are significantly improving the energy efficiency and you
Maximize in a particular direction and minimize in the other direction the undesired is let
are significantly improving the SNR of the system. So, Beamforming the main idea
us say user twos and minimize in other directions and therefore, you are forming a beam
behind Beamforming is that Beamforming significantly improves the energy efficiency
in a particular direction and this is what is termed as beam forming. So, what you are
of the system by focusing the energy in a particular direction and not transmitting in all
doing is you are forming a beam in a particular direction and this implies and this is what
directions by focusing energy in a by forming a beam. You are basically forming a beam
is termed that this forming this beam in this particular direction.
by focusing the energy in a energy so it simply improves the SNR.
So, Beamforming improves SNR, ultimately Beamforming the main aim is it improves
the SNR and therefore it is very important already there is a not as SNR improves the
performance of communication efforts alright. So, Beamforming in that sense it is very
important communication it is important even in radar right that is another interesting
application where Beamforming is heavily used.
So, it is communication systems as well as radar. So, these are, wherever you have signal For simplicity h 1 h 2 h L these are the channel coefficients, fading channel coefficients,
transmission wireless signal transmission and reception one can use beam form. So, how remember in a wireless communication system the channel is fading in nature because of
do we build a model for this Beamforming communication system that is what we want the multipath scattering. Now, the transmitted symbol is x from the single antenna the
to develop and we have already seen this to some extent and the model is very simple, received symbols at the different antennas are y 1 y 2 y L at the L antennas correct.
which we probably have seen in earlier modules also. That is you have the receiver let us
say and remember this is for the receiver you can do something exactly identical for the
transmitter.
So you have single transmit antenna multiple receive antennas this is also known as a
single input, multiple output system or this is known as receive antenna diversity and so
on. So, you have L receive antennas, L equals the number of receive antenna for
simplicity number of transmit antennas equal to 1, otherwise you can generalize this.
So, you have x equals transmitted symbol; x is the transmitted symbol; x is the
transmitted symbol y i equals received symbol on antenna i and h i is the channel
coefficient corresponding to antenna i. You can also say this is the; you can also say this
is the fading channel coefficient and therefore I can express this system model as y 1
equals h 1 x transmitted fading channel coefficient times transmitted symbol x plus n 1 y
2 equals.
In fact, I can write this as a vector y 1 y 2 y L equals the vector h 1 h 2 h L plus n 1 n 2

up to n L, what this means is basically y 1 equals h 1 times x plus n 1 y 2 received
similar and antenna 2 equals h 2 times x plus n 2 and so on. So, you have the l
dimensional received vector y bar, which is equal to the transmit channel vector h bar
times x the transmitted symbol plus n bar, these are L dimensional vectors y bar is the
received vector.
So, our system model is y bar equals h times x bar plus n bar. Now what do we want to
do we want to, now remember we have the receiver this y 1 y 2 y L. What you want to
do is we want to beam form or we want to combine these symbols and we want to
perform a weighted combination. So, what we want to do is we want to combine these
received symbol w 1 y 1 times w 2 y 2 plus so on. So, what are we doing we are
performing a weighted combination we are performing a weighted combination of the
received symbols. I can also write this as w 1 w 2 w 1 the row vector times y 1 y 2 y L
the column vector. This is basically w bar transpose this is basically y bar the received
vector the same thing that we have been seen before.
H bar is a channel vector n bar is the noise vector x is the transmitted symbol. Y bar is
the received vector, h bar equals the channel vector, x bar is the transmitted symbol, n
bar is the; n bar is the noise vector.

Now, what is this vector w bar? W bar is the vector of weights w 1 w 2 up to, you can Now, what happens in mechanical steering? Mechanical steering is basically where you
think of this as a vector of weights this vector w bar is basically what is known as the rotate the array or you tilt the array in a particular direction to change the direction of the
beam forming vector. This process of multiplying the various samples receive samples y beam. This was what was that conventionally and as you can see this mechanical steering
1 y 2 y L with these weights and then combining this process is nothing but beam is something that is extremely time consuming and also energy inefficient because it to
forming. What you are doing is by choosing the weights you are focusing the beam in a tilt to the array and tilt it with precision is both something that is time consuming and it
particular direction while suppressing the interference or not focusing the beam in other is also energy inefficient. So, mechanical steering this is basically time consuming plus
direction. expensive not to forget the cost plus inefficient, its energy it requires energy and of
course it is time inefficient it is also energy inefficient. As against electronic steering the
So, this vector w bar this is known as the; this is known as the beam forming vector and
biggest advantage is it is easy is low complexity and precise because you do it in the
the beam forming problem is basically to choose this vector w bar. The beam forming
digital domain plus precision.
problem is basically to choose; Beamforming problem is to choose this vector w bar.
How do you choose the vector w bar that we have already said to maximize the SNR. So, not to forget that is the most important thing; for instance you want to tilt the array in
The Beamforming problem is to choose find this vector, optimal vector and that is direction 23.65 degrees mechanically that would require a lot of skill, but electronically
known as the optimum the optimization problem to find the optimal vector w bar. that can be done with very good precision. So, that is the advantage of electronics
steering and now how do we do that by simply adjusting the weights of the Beamformer
That is the optimization problem to find the optimal vector w bar and this also what you
by choosing all we are doing is choosing w bar that is what we are doing, by choosing w
are doing is basically by choosing these weights, by simply choosing this weights you
bar, by simply choosing the weights or simply adapting.
are steering the beam in a particular direction, this is known as electronic steering. In
literature this is known as electronic steering where you are simply electronically doing (Refer Slide Time: 18:53)
this, by choosing the this is in opposite or contrast or you can say this is not equal to
mechanical or manual steering.
In fact you can also say this is adaptive, an important way in which this is done is by
adaptive, adaptively changing w bar in a time varying scenario. So, this has a lot of
applications including adaptive signal processing and the process is very simple. So, we
perform y bar transpose w bar transpose y bar this is your beam former, substituting the transpose w bar equals 1. So, you can also write this as h bar transpose w bar equals so
expression for y bar, you have h bar times x plus n bar which is w bar times h into x bar on, all vectors w bar lie on the hyper plane.
plus w bar into n bar. Now if you can look at this component, now x is the transmitter
symbol this component is basically the signal gain, w bar transpose h bar is the signal
gain and w bar transpose n bar is the noise at the output of Beamforming, noise at output
of; this is the noise at the output of the beam former. Now, what we want to do? We want
to maximize the signal to noise power ratio.
Now, there are 2 ways of doing it one is either keep the signal gain constant, minimize
noise power or keep the noise power constant maximize the signal here all right. We will
choose the first approach that is keep the signal gain constant. Constant gain for the
signal minimize the noise power that will result in maximizing signal to noise power
ratio. So, SNR maximization can be achieved, that can be achieved as follows, implies
what can we do? We can keep you can keep signal gain or signal power constant plus
minimize the noise power.

So, we can sit w bar transpose h bar equals 1 this is known as the constraint in the
optimization problem, this is the constraint for the optimization problem. So, this is equal
to 1 this implies unit gain in signal direction, this maintains unit gain in signal direction
all right.
This is the constraint, this is the constraint for our optimization problem and now what
we have to formulate is here to formulate the SNR maximization all right, which will
form the objective function for our optimization. From then one can from the
optimization problem and one can solve the optimized resulting optimization problem
which we are going to do in the next module.
Now, what whatever signal gain is w bar transpose h bar you can see from here this is the
signal gate. So, what signal gain constant means we can sit w bar transpose h bar equals
1 and this we have already seen this is your affine constraint or this is basically
represents a hyper plane. So, there we have a convex function, so this represents
hyperbola. So, all the vectors w bar, you know w bar transpose h bar equals 1 or h bar
Lecture - 35
Practical Application: Maximal Ratio Combiner for Wireless Systems
Hello, welcome to the Tunnel Mode, another module in this massive open online course.
So, we are looking at the problem of beam forming that is to find the beam former or the
combiner basically which maximizes the signal to noise power ratio for the user in a
particular direction, right and we said the constraint for the signal gain.
So, this is summation over i w i n i. Now, we have to assume something about they are
not described properties of these noise samples, but as let us assume a very commonly
employed model that is all the noise samples are statistically identical that is they have 0
mean and variance that is expected value of n i square equals sigma square. That is noise
samples of variance of power. The noise power is sigma square, and they are 0 mean.
These noise samples n i are 0 mean, and in addition let us also assume that if you take 2
distinct noise samples expected value of n i times n j, this is equal to 0 that is if i not
equal to d, that is the distinct noise samples.
So, we are looking at the beam forming problem, and the constraint for the signal gain is
w bar transpose h bar equals 1. This is unit gain for desired user or unit gain for signal.
That is what we said we said the signal gain to be unity and minimize the noise power.
Now, coming to the noise power and the noise power can be calculated as follows. We
know the noise component is w bar transpose n bar which is basically your row vector w
1 w 2 w l times the column vector of noise which is n 1 n 2 up to n l.
Noise samples are uncorrelated noise samples. In addition if these are Gaussian if n i are (Refer Slide Time: 05:20)
Gaussian, then uncorrelated implies that they are also independent, however this does not
hold for non-Gaussian noise. Typically the gauss samples are assumed to be additive
white Gaussian which means that the different noise samples at the different antennas are
independent and identically distributed identically distributed in a sense that they have 0
mean and identical variance sigma square. So, we say the noise samples n i r i i d equals
independent, identically independent and identically distributed.
So, only the term when i is not equal to j expected value of n i n j is basically 0. There
are samples of the two different antennas are uncorrelated, all right. So, only the terms
where i equal to j survive and therefore, I can simplify this as since only terms where i
equal to j survive and in that case expected value of n i n j sigma square.
So, this will be i sigma square when i equal to j w i to w j will be w i square. So, sigma
square summation w i square are taking the sigma square outside. This will be sigma
square summation w i square which is basically nothing worse nor w bar square and this
Now if you look at the noise power that is expected value of summation over i w i n i is basically you are noise power. This is basically the noise power, ok.
square, this becomes expected value of I can write it as a quantity times itself w i n i
times summation over j w j n j which is equal to the expected value of well multiplying it
out summation i summation j w i w j n i n j and now, it is easy to see what this reduces
to. So, take the expected value inside summation over i j expected value of w i w j n i. In
fact, w i w j are constant. So, you can simply write this as w i w j expected value of n i j,
we have seen this is equal to 0 if i is not equal to j and equal to sigma square if and only
if i is equal to j.
Therefore, now I can formulate my optimization problem. Now, my optimization Well, minimizing sigma square norm w bar square, this is equivalent to minimizing norm
problem will be the following. The optimization problem, remember the optimization w bar square because sigma square is a constant. So, if I minimize not w bar square I will
problem for beam forming is to minimize the noise power. We have just found the noise also minimize sigma square norm w bar square and now, look at this norm w bar square
power that is minimized sigma square nor w bar square that is what are you doing is, you is simply w 1 square plus so on w l square sum of convex functions. So, this is a convex
are minimizing the noise power, minimizing the noise power subject to the constraint w objective affine constraint is also a convex constraint. So, you have a convex objective
bar transpose h bar equal to 1. That is your signal gain equals constant or constant gain in function and you have a convex constraint. You can have more than one convex cancer;
the direction of desire. So, what you are doing is you are minimizing the noise power you can have a set of convex constraint. So, convex objective plus convex constraints
keeping constant signal power. So, what does that does is that maximizes the signal to that makes a convex optimization problem that is a special subclass of optimization
noise power ratio that is what we said and therefore, if you look at this optimization problems which we are going to focus on.
problem, it has two parts. First what you are trying to minimize. This is termed as the
So, this is going to be the template of a convex optimization problem. So, convex
objective or the optimization objective or simply the objective function and this is termed
objective plus convex constraints in it lies implies a convex optimization problem. So,
as the constraint because remember we have to ensure w bar transpose h bar equals 1.
this implies convex also we have a convex objective.
This is term constraint or in fact, you can have more than one constraint, you can have
constraints. And if you see this constraint is an affine constraint it is an affine equality
constraint and if you look at the objective, the objective is convex sigma square norm w
bar square.
We have a convex, we have convex constraints. This implies we have a convex So, this implies 2 w bar equals lambda h bar which implies w bar equals lambda over 2 h
optimization problem. So, in fact they were constrained convex optimization problem. bar that is w bar equals lambda over 2 times h bar and this is the optimal beam form.
No constraints are an implicit part I mean constraints are this is a part. So, this is you Why is it the optimal beam former? It is because it maximizes the signal to noise power
have objective function and not just the objective function, but you also have a set of ratio maximizes the signal to noise and you can see that the optimal beam former is
constraints that the desired solution has to satisfy. How do we solve this? Let us proportional to h bar. So, therefore this is also like a matched filter. In fact, the spatially
formulate the objective power, let me rewrite the optimization problem. So, I am matched filter in space right because typically you have a matched filter in time. This is a
dropping the sigma square because it is a constant. So, I am going to simply minimize matched filter across the antennas corrects with some spatially matched filter. So, w bar
non w bar square subject to the constraint w bar transpose h bar equals 1. is proportional to h bar which implies that this is an analogue. Analogue is to a matched
filter that you employ in a digital communication system.
What I am now going to now do is something that you must have seen in an early course
on calculus that is to solve a constraint or constrained optimization problem, one needs to Now, how do we determine lambda? To determine lambda, this Lagrange multiplier we
use Lagrange multipliers. So, I am going to form and by the way this is equal to norm w have to use the constraint.
bar square equals w bar transpose w bar here which is basically f. I can denote this as w
bar transpose w bar plus lambda times 1 minus w bar transpose h bar, and now what? So,
this quantity lambda, this is the Lagrange multiplier, this is the Lagrange multiplier, and
this has to also this is a new parameter and optimization problem and this Lagrange
multiplier also has to be determined as the solution to the optimization problem, ok. So,
now I am going to differentiate this with respect to w bar and set it equal to 0 and that
gives us w bar transpose w bar differentiate that gives us 2 w bar plus lambda time
differentiate one that gives 0 minus differentiate w bar transpose h bar derivative of that
is h bar. So, set it equal to 0.
(Refer Slide Time: 15:11) Therefore, it implies the optimal beam former which you can now denote as w star w star
equals well lambda by 2. So, that is half lambda 2 over norm h bar square into h bar
lambda by 2 or 2 h bar. So, this is h bar divided by norm of h bar square.
So, this is your optimal beam former. So, this is the optimal beam former h bar divided
by. This is also termed as the maximal ratio combiner because this maximizes the signal
to noise power ratio. It is also termed as the maximal ratio combiner, ok. So, this is also
termed as the ok.
What is the constraint? Remember the constraint is w bar transpose h bar equals 1.
Substitute the value of w bar that is lambda by 2 h bar transpose h bar equals 1 which
implies that lambda by 2 h bar transpose h bar equals lambda by 2 nor h bar square
equals 1 which implies lambda equals 2 over norm h bar square. So, lambda equals 2
over norm h bar square correct, and that is what we have.
So, that gives us the optimal beam former which is h bar divided by norm h bar square
that is a maximal ratio combiner and this maximizes the. It is known as the maximal ratio
combiner because it maximizes SNR, correc. Maximizes the SNR employing the beam
forming maximizes SNR at the output of the beam former. What is SNR? That is easy to
see if you have y bar equals h bar times x plus n bar let this b p symbol power equals p
that is expected value of x square symbol power equals p, ok.
Now, what am I doing? I am beam forming with the maximal ratio combiner that is h bar So, expected x square is p divided by sigma square norm w bar square remember w bar
transpose divided by norm h bar square into y bar. Remember this is a maximal ratio equals h bar by norm h bar square norm w bar square is norm h bar square divided by
combiner which is h bar transpose h bar divided by norm h bar square times x plus h bar norm h bar power 4 equals p divided by sigma square into norm h bar square. So, SNR at
transpose n bar divided by norm h bar square. Now, if you see this h bar transpose h bar the output is norm h bar square into p divided by sigma square. You can also write this as
that is norm h bar square divided by norm h bar square which is 1 and that is indeed true norm h bar square into rho where rho equals p divided by rho equals the transmitted.
because we ensure that the signal gain is 1. So, this is h plus h bar transpose n bar
divided by norm h bar square. Therefore, the SNR at the output of the maximal ratio
combiner is simply signal power that is expected value of x square by noise power, but
we know noise for is sigma square times norm w bar square. We have already divided
derived that for any combiner.
So, p equals the symbol power sigma square equals noise power and rho equals p divided
a signal power by noise power. You can think of this as the transmit power because this is
the power of the transmitted symbols. So, rho equals I am sorry transmit SNR p over Applied Optimization for Wireless, Machine Learning, Big Data
sigma square which is the transmitted SNR. So, this is the interesting thing. So, this is
the first convex optimization problem which is rather simple application of the Indian Institute of Technology, Kanpur
optimization framework which we have seen and you can clearly see it is very powerful
Lecture - 36
and very handy, right. You can formulate a neat optimization problem where in you are Practical Application: Multi-antenna Beamforming with Interfering User
trying, today you are trying to design the optimal beam former. You set the beam forming
gain or the gain of the beam former in the desired direction and the direction of the signal Hello, welcome to another module in this massive open online course. So, we are
to be unity that gives you your constraint and you minimize the noise power which is looking at practical applications of optimization in particular we have looked at
sigma square norm w bar square to basically essentially maximize the signal to noise beamforming that is how to focus the beamforming, the beam in the direction of a
power ratio. We used the Lagrange multiplier framework to formulate the Lagrangian particular user correct a system that is beamforming. Let us now extend this paradigm to
basically differentiate it set it equal to 0. This is also known as the KKT frame, but will include to look at beamforming with interference that is what happens when you have a
look at it in more detail as we go in the subsequent modules, but initially if this illustrates desired user and also an interfering user.
to you in a very simple fashion through a practical example how the optimization
framework and more specifically the convex optimization framework can be used to
solve practical problems alright.
So, we will stop here and continue in the subsequent modules.
Thank you so much.
So, we want to look at beamforming with interference, something that you not looked at
so far we have simply considered a single user interference. So, what happens when you
have a desired user as well as an interfering user.

the secondary receiver. Secondary user the receiver or you can also think of this as a
secondary base station.
At the secondary base station you want to get the signal correct, you have to beamform
such that you receive the signal from the secondary user while rejecting the interference
from the primary user, because the primary user is the licensed user hand the primary
user has priority for the transmission. So, when there is an ongoing primary transmission,
how do you make sure that this does not impact the signal reception the signal quality at
the secondary user. So, this has many applications, there are several applications in fact
beamforming in the presence of interfering users.
So, you have again as usual you have this multiple antenna array. And now, you have a
user to which you want to transmit let us call this, let us or user who is the desired user
whose signal you want to receive and you have another user who is an interfering user
whose signal you want to reject. So, this is the receiver you want to form a beam in the
direction of the desired user; at the same time you want to reject correct, you want to
reject the signal from an undesired user. So, this is let us say your interfering user. This is
your interfering or this is your desired user actually. So, this is your desired user, and this
is your undesired or you could call this as your interfering user.
So, you want to receive the signal or focus your energy to in the direction of the desired
user while the same time rejecting correct, rejecting the signal from the undesired or the
interfering user. And this can occur in several scenarios. For instance, you have several
users, you want to focus on one particular user; or you have for instance of cognitive The received signal or remember the signal model is as follows previously, we had
radio scenario in which case you have the secondary user and then you have the without interference we had y bar equals h bar x plus you can think of this as n bar. So,
interference from the primary user all right. So, you want to reject the interference from desired signal plus noise, now you will also have the interference. So, previously we
the primary user while at the same time focusing your energy while transmission or simply had h bar x plus n bar, now you have this additional component which is basically
receiving the signal from your secondary user. your interference from the interfering user. So, this is as usual this is your noise all right,
additive white Gaussian noise; this is your desired signal.
So, for instance, you can think of this as an interesting application in the evolving
paradigm of cognitive radio in which you have T X 1 equals your secondary user. So, And so therefore, g bar x i plus n is the noise plus interference. You can also call this as n
you want to receive the signal from the secondary user, but there might be an ongoing tilde, this is equal to you can call this as n tilde; this is your noise plus you can also think
transmission of the primary user T X 2 that is undesired user that is at the secondary of this as multiuser interference because this interference is arising because of other
receiver is the primary user. So, you want to reject at the secondary. So, this R X can be
users. So, this is noise unlike for instance, it is a inter symbol interference, this is your
multiuser interference, interference arising from the other users.
Now, this can be simplified as follows taking the expected value inside this is g bar g bar
by the way g bar now remember h bar is the channel vector of the desired user, as per our
this is all according to previous notation channel vector of desired user. And similarly g
bar which is the L dimensional vector corresponding to L antennas this is the channel
So, you have noise plus interference, and now we want to compute the noise plus vector of the interfering user. This is the channel vector of the interfering user, so that is
interference covariance, the covariance of the noise plus interference. Now, this is simply your g bar.
your expected value of n tilde n tilde transpose, remember this is the definition of the
So, you have g bar g bar transpose times expected value of x i square plus g bar expected
covariance matrix. Assuming that expected and tilde equals 0, this is the typical
value of the symbol of the interfering user that is x i times n bar transpose. Remember x i
assumption that is the noise mean we already seen that the noise is zero mean also
this is basically your signal or you can think of this as symbol of the interfering symbol
assume that the interference is zero means or the noise and plus interference will have
of interfering user plus you have expected value of x i into n bar into g bar transpose plus
zero mean. And then the covariance matrix will be expected value of n tilde plus n tilde
expected value of this is the noise covariance that is what we have previously seen n bar
transpose which is expected value of well that will be g bar x i plus n tilde or g bar x i
n bar transpose.
plus n bar times g bar x i plus n bar transpose which is the expected value of, value you
have g bar g bar transpose x i square plus let me write this as follows g bar x i n bar Now, what we will do is we will assume the symbol power to be p which means
transpose plus x i into n bar into g bar transpose plus n bar n bar transpose. expected value of x i square our p i or sigma i square let me just write this as sigma i
square. So, we are assuming that the signal power or the interference power is sigma i
square, expect value expected value of x i square is sigma i square. Now, we have this
quantity which is interesting, the cross correlation this is expected value of x i n bar
transpose expected value of x i into n bar that is it looks at the correlation between the
signal and the noise.
Typically the signal and the noise are uncorrelated because the noise arises from the identity. So, this is your noise plus interference covariance matrix. This is your noise plus
system and the signal from the information that intense is the information that interference covariance matrix corresponding to this scenario of multiuser beamforming
corresponds to the particular either desired user or interfering so user. So, the signal and with interference. Now, again we want to find the beamforming vector W bar. How are
noise are typically uncorrelated, in fact, they are independent. So, if they are both zero we going to find the beam forming vector W bar?
mean then expected value of x i into n bar expected value of x i into n bar transpose both
are 0. So, this is the other assumption which is of course, very intuitive, but needless to
say, but nevertheless it can be better I think these are worth clarifying this that is
expected value of x i n bar equals or these are not equal. In fact, this is a one is a row
vector the other is a column vector. So, expected value of x i n bar transpose is 0;
expected value of x i n bar equals 0. So, both these quantities are 0.
So, our intention is to now perform beam forming and for that we need to have the beam
forming vector. How are we going to find the beam forming vector well let us say W bar
is the beam forming vector, W bar transpose y bar you are doing beam forming
remember it is also known as electronic steering that is you are simply linearly
combining the samples of the received signal it is a weighted combination a weighted
linear combination of the samples of the received signal that is W bar transpose y bar. W
And the reason is the same because signal and noise are independent. In fact, what we bar is the beam forming vector, W bar transpose substituting y bar this is y bar times h x
need is simply uncorrelated. And that signal and noise are uncorrelated typically that is plus.
what you have. And therefore, what you have is you will have sigma i square. And this
Now, n tilde with n tilde is the noise plus interference noise plus interference, this is well
we know noise covariance this is simply sigma square times identity matrix, because the
again this is W bar transpose h into x plus W bar transpose n tilde. This is the signal, this
noise samples are we have assumed the noise samples are the different antennas to be
is the signal gain in fact W bar transpose h bar. And this is the noise plus interference
IID - Independent Identically Distributed.
now not simply the noise. So, this is the noise plus interference. Now, what we want to
So, if you look at the covariance that is simply sigma square times identity. So, finally, do is we want to minimize the effect of this noise plus interference at the output. So, we
what you will get is if you call this noise plus interference covariance matrix as R, what have to calculate its power.
you will get as sigma i square times g bar g bar transpose plus sigma square times
So, while setting the gain in the direction of the signal, now this is something that you And now you have something interesting if you take the expected value inside you have
have to pay attention to the reason being the follows. If you minimize, simply minimize W bar transpose expected value of n tilde n tilde transpose W bar which is W bar
the noise plus interference without consider this optimization problem, it is simply transpose expected value of n tilde n tilde, this is nothing but the noise plus interference
minimize the noise plus interference or the noise without paying attention to the signal covariance. So, this is W bar transpose R expected value of n tilde n tilde transpose this
all right. So, all you want to do is minimize W bar transpose n tilde, then the optimal is the noise plus interference. So, this is the net noise plus interference power at the
value of W tilde or W bar beam forming vector will simply be 0, because if you use output of the beam former. So, this is your noise plus interference power at the output,
beam forming vector to be set it to be 0, then the output noise is 0. But the problem with noise plus interference power at the output of the beam former. And now what is our
that is the output signal is also 0 all right, and therefore, and you cannot have that all optimization, now again we want to have remember signal gain W bar transpose h bar
right. this has to be 1. So, signal gain has to be unity.
So, therefore, you have to minimize the noise or noise plus interference while restricting (Refer Slide Time: 18:22)
it constraining it in such a way that the signal is not affected, that the signal gain is still
unity all right, so that is the important aspect to pay attention here.
So, what is our modified optimization problem for this beam forming with interference.
So, the optimization problem for beam forming with interference, so the optimization
problem for beamforming with interference is minimize that is what we have seen
minimize the noise plus interference power which is what we have calculated W bar
So, want to minimize the noise plus interference what is the noise plus interference
transpose R W bar. Previously this was simply W bar transpose W bar if you remember.
power that is W bar transpose n tilde will come calculate this in a compact fashion
Previously this was simply W bar transpose W bar equals norm W bar square with the
expected W. Now, this is a scalar quantity. So, I can do all kinds of manipulations that is
noise when you simply add noise.
W bar transpose n tilde times itself scalar quantity is scalar quantity transpose. So, I am
going to simply write it as W bar transpose n tilde transpose which is nothing but W bar Now, you have the noise plus interference therefore, is W bar transpose R W bar where R
transpose n tilde square. Now, this is expected value of W bar transpose n tilde n tilde is the noise plus interference covariance that is the only difference. Subject to the
transpose into W bar. constraint, what is your constraint? Constraint is well W bar transpose h bar equals 1
gain in the direction of the signal equals unity. All you have to do is now form the out the value of lambda by 2, lambda by 2 is 1 over h bar transpose R inverse h substitute
Lagrangian F which is W bar transpose R W bar. Now, again remember the covariance this value of lambda by 2 above.
matrix is positive semi definite this is a positive semi definite matrix implies.
Now, that is always important to check implies W bar transpose W bar is convex, so that
is an important thing. So, this is a convex optimization problem because the objective
function is convex that is something that you have to verify at each and every stage that
you are solving indeed solving the right kind of problem. So, this is W bar transpose R W
bar and then usually have you; as usual you have the Lagrange multiplier 1 minus W bar
transpose h bar 1 minus W bar transpose h bar. And now differentiating this W bar
transpose R W bar because R is symmetric matrix this is simply twice R W bar minus
lambda derivative one is 0, W bar transpose h bar derivative is h bar set it equal to 0
which implies the optimal beam former is lambda by 2.
And therefore, your optimal beam former you can call that as W star equals lambda by 2
1 over h bar transpose R inverse h bar times R inverse h bar. So, this is your optimal
beam former with interference. Now, you have also incorporated. So, this is the optimal
beam former which maximizes signal in the desired direction or you can say this
maximizes the signal power while minimizing noise plus interference and that is the
important aspect; while minimizing noise plus interference that is important aspect.
And therefore, you can see with the slight twist slightly modifying the objective function
how we can change how you can make it even more comprehensive alright to tackle the
tackle a much more general problem. Previously we only had the noise power, now you
had a noise, now you have the noise plus interference where the interference can arise for
So, what you have is R W bar equals lambda by 2 h bar which implies that W bar equals a variety of reasons. This can be different users in a cellular scenario. The interference
lambda by 2 R inverse lambda by 2 R inverse h bar. So, this is lambda by 2 R inverse h can also be used be due to a harmful user or and a or a malicious user rules trying to
bar. And therefore, now, we have to find the Lagrange multiplier lambda, and how do we interfere with the base station or it can be in a cognitive radius scenario where there is an
find that we simply use the constraint. So, lambda by 2, so we have W bar transpose h ongoing interfere; ongoing transmission of the primary user, and therefore, this causes
bar equals 1, this implies lambda by 2 R inverse h bar transpose times h bar equals 1, this interference at the secondary user interference. So, this has several practical applications
implies that h bar transpose lambda by 2 h bar transpose R inverse into h bar equals 1, in that sense all right. So, we will stop here, and continue in the subsequent modules.
this implies lambda by 2 equals 1 over h bar transpose R inverse h. So, we have found
Lecture – 37
Practical Application: Zero-Forcing (ZF) Beamforming with Interfering User
Hello. Welcome to another module in this massive open online course. So, you are
looking at canary optimization and practical applications of various optimization
problems in the context of Beamforming, alright. We have looked at different kinds of
beam forming, the beam forming, the original beam forming problem, we have also seen
beam forming with interference. In this module let us look at yet another kind of beam
forming; that is, beam which is known as Zero-Forcing beam for me ok.

What happens in zero-forcing beamforming? Well have seen beamforming where you
have a multiple antenna array, and you have these multiple antennas, and let us see. So,
this is a receiver, and you have the desired user, and you have your interfering user this is
your interfering user.
This is the desired user, and what you do in this is that you again maximize the signal
gain in this direction; direction of the desired user. While in the direction of the
interfering user, in the direction of the interfering user, what you do is, you simply place
a null; that is, you make the signal equal to 0 ok. So, equal to 0, or gain equal to 0, you
can say signal gain in the direction of interfering user. Gain equal to 0 in the direction of
the interferences, or you can say the interference is nulled. Therefore, you are nulling the
interference.
This is termed as interference nulling, your nulling the interference by basically ensuring
So, what we want to look at is a different kind of beam forming which is termed as zero-
that the gain in the gain in the direction of the interfering user is 0. This also termed as
forcing, beamforming, you can also call this as ZF- ZF for zero-forcing. Now what
interference nulling. Or this also in fact, termed as placing a null in the direction of
happens in zero-forcing beamforming? In zero-forcing beamforming the interference is
interfere. So, this can also be thought of as placing a null. Already this various no mental
null, interference is made 0. So, we are forcing interference to 0, that is why is forced is
kinds of nomenclature, this is Placing a NULL in direction of interfere; also termed as
interference is Forced to 0, that is why this is known as zero-forcing beam following; is
Placing a NULL, you are placing a null in the direction of interfere.
also known as interference nulling. You can call this also as nulling the interference,
interference is null, interference Nulling.
(Refer Slide Time: 04:59) And you place w bar is the beam forming vector and w bar. So, you are performing beam
forming correct similar to what you have done before.
So, we have w bar transpose y bar equals w bar transpose x bar x plus g bar x i plus n
bar; where n bar is the noise this is the additive white Gaussian noise as usual. So, this is
your w bar transpose h bar x, plus w bar transpose g bar into x i, plus again the noise
output that is w bar transpose n bar.
Let us look at the procedure to do this. Let us again go back to a system model that is y
bar equals h bar x plus g bar x i plus n bar this model is similar to the model that you
might remember we had seen before. That is, this model y bar equals h bar x plus g bar x
i plus n bar; where you can see once again.
Now, this is the signal part, and you ensure that signal gain equals 1, the gain in the
direction you can also term this as the gain in the direction of the desired user. So, this
ensures that signal gain this ensures basically your unit is signal gain.
Now here what we do is to null the interference we set this interference term to 0. So,
what we are doing is we are setting is w bar transpose g bar equal to 0. So, this basically
nulls this basically nulls the interference. So, by setting w bar transpose g bar equals 0
what you are ensuring is that, you are placing a null in the direction of the interferer or
basically you are not able to receive or basically you are you are effectively suppressing,
or you are effectively not just suppressing, you are effectively zeroing whatever is the
signal that is received from the interfere. So, that is basically ensured by this condition
This is basically the channel vector of the desired user, the channel of the desired user. that we were transpose g bar equal c.
This is the channel of the interfering user. This is the channel of the interfering user ok.
And therefore, now your optimization problem the resulting zero-forcing beam former, said is equivalent to minimizing norm w bar square, because sigma square is constant,
optimization problem for the ZF beam former. which is basically nothing but w bar transpose w bar. So, you minimize w bar transpose
w bar. Now subject to the constraint, now you have 2 constraints. In fact, previously you
had only one constraint. So, you have w bar transpose h bar equals 1 unit again in the
direction of the signal.
Now, you have another constraint, that is w bar transpose g bar equals 0, this is your ZF
constraint. This is your ZF constraint: this is your objective function, which is again
convex. This is an affine constraint, this is also an affine constraint, both of them are
linear. So, this is basically now your convex optimization problem. This is again a
convex optimization problem, similar to what the objective is convex constraints are in
fact, affine that linear constraints of convex.
This is a convex, this is a convex optimization problem; however, now you see that you
have 2 constraints, correct? Unlike the previous one where you had only a single
constraint, you have to interact in a general optimization problem you can have multiple
constraint, not just one constraint. That was a very I mean previous problems were very
And by the way this is your zero-forcing condition in case you are wondering what is 0
simple rather simple. So now, you have 2 constraint and you can in fact have multiple
forcing. So, this is basically you are forcing the interference to 0. You are forcing the
constraints. In fact, in this scenario itself you can see that if you have more than one
interference to 0.
interfering user. So, if you have let us consider a schematic; where you have more than
(Refer Slide Time: 09:44) one interfering user.
So, you have t x 3 and again you want to place a NULL along the direction of t x 3. So,
depending on the number of interfering users, you can see in this scenario you have
constraints. In fact, if k is the number of interfering users, you have k plus 1 constraint.
One is the signal gain that is unity signal gain, plus k null constraints for the k interfering
users only. So, the number of constraints grows with the number of interfering users.
And therefore, the resulting optimization problem for a zero-forcing beamforming, you
minimize the noise power as usual that is sigma square norm w bar square; which we
And therefore, now this optimization problem; again I can write this as w bar transpose So I can write this optimization problem as minimize known w w bar transpose w bar
w bar subject to the constraint. Well, I can write this as w bar transpose h bar is h bar subject to c transpose w bar equals e bar 1. And this is basically the optimization problem
transpose w bar. So, I can write this as subject to the constraint, h bar transpose w bar now for my zero-forcing Beam Forning. This is the optimization problem for zero-
equals 1. And g bar transpose w bar equals 0. forcing Beam Forning.
And now I can make this as a matrix, this as a vector, I can call this matrix as c In fact, this is known as a quadratic program. See what you have here is; you have a
transpose. So, c transpose w bar equals this vector e bar 1; where c is the matrix you can quadratic constraint, a quadratic objective function. So, you have you have a quadratic
see this is the matrix which is this matrix. It is first column will be h bar and the second objective function, quadratic objective function. So, this quadratic objective function and
column will be g bar. So, this is the matrix and e bar is this vector, e 1 bar is this vector linear constraints, affine constraints or rather we put these things as affine constraint.
which has one in the first position and 0.
And therefore, this is basically termed as a quadratic program. This is therefore, termed So, you have to recognize multipliers equals 1 for each constraint. And therefore, this is
as a quadratic program, this kind of this type of an optimization problem is termed as a the rho vector lambda bar transpose. I can write this as lambda bar transpose; where
quadratic program. And now, what we want to do is; we want to solve this quadratic lambda bar is a vector containing the 2 Lagrnage multipliers, that is your lambda 1 and
program to obtain the zero-forcing being former. lambda 2.
So, I form the Lagrangian which is f of w bar from our lambda bar, we will see that it And therefore, this will be w bar transpose w bar plus lambda bar transpose times this is
will be function of a vector lambda bar; which is w bar transpose w bar plus, now you your c transpose, this is your e bar 1. So, this is your c transpose w bar minus e bar 1 plus
see there are 2 constraints. So, I will need one Lagrnage multiplier for each constraint. equals which is equal to w bar transpose w bar plus lambda bar transpose c transpose w
So, lambda 1 lambda 2 into c transpose; that is basically your h bar transpose g bar bar minus lambda bar transpose e bar one that is your Lagrangian.
transpose times w bar minus e bar a 1 bar ok. So, this is your first constraint, h bar
transpose w bar minus 1 g bar transpose w bar equals 0. So, you have 2 constraints so, 2
Lagrnage multipliers, one for each constraint.
This is your Lagrangian function. Now we are going to differentiate this, right compute Right, you do for any optimization problem to find the extrema. And this implies now,
the gradient of this Lagrangian function with respect to v bar differentiate this with therefore, what you have now this implies that twice w bar equals c times minus c times.
respect to v bar, I am sorry w bar. So, I compute derivative of w bar the transpose w bar Now here note that you cannot interchange the c and lambda bar, because previously
with respect to w bar that is twice w bar. You can also think of this as w bar transpose w lambda was a scalar. So, you can simply bring it out, but here lambda bar is a vector. So,
bar is w bar transpose, identity times w bar; where the derivative is so, if you you have to write it as c times lambda bar. And therefore, this implies that w bar the
differentiate this with respect to w bar, you get twice identity into w 1 which is nothing optimal vector w bar equals minus c minus half c lambda.
but twice of w bar.
So this is basically the optimal vector w bar that is that expression for the 0. And again to
So derivative of w transpose w bar is twice w bar. And in any case you can see w bar find lambda bar use the constraint. What is our constraint? Well, our constraint is
transpose w bar is w 1 square plus w 2 square. So, on up to w l square if you differentiate remember c bar c transpose w bar equals e bar 1. So, our constraint is c transpose w bar
it with respect to each w I you have 2 w i. So, that is nothing but the vector 2 w bar. And equals e bar 1, this implies c transpose no substitute for w bar minus half c lambda bar
plus lambda bar so, this is of the form lambda bar c transpose. So, this is of the form equals e bar 1 which implies minus half c transpose c lambda bar equals e bar 1.
your c bar transpose w bar where c bar equals c into lambda bar. So, the derivative is
simply c bar c bar transpose w bar the derivative with respect to w bar is c bar. So, this is
c lambda bar, derivative of lambda bar transpose e bar w bar with respect to w bar is 0.
So, simply that is 0 so, minus 0 and this we set it equal to 0 ok, the gradient is being set
equal to 0 like for any optimization problem.
(Refer Slide Time: 22:14) So, because of it is low complexity also it tends to be one of the popular beam forming
techniques, along with the interference the beam forming in the presence of interference
that we have seen trees that is also termed as a cap on beam former alright. So, the zero-
forcing beam former is also one of the popular modes of being forming that is employed
in practical scenarios, alright. And this gives you a neat procedure to derive the
expression for the zero-forcing beam, alright.
So, we will stop here and continue in the subsequent (Refer Time: 25:11).
Because we only need minus lambda bar over 2 this implies minus lambda bar over 2
equals c transpose c inverse e bar 1. That is the expression for your this is the expression
for your, this is the expression for the luggage bottom lamp. In fact, it is an expression
for lambda bar divided by. Now substitute this in the expression in if you call this one
substitute this in one. You have w star now the optimal zero-forcing beam forming
vector. You can write this as what is this?
This is basically you can take this factor of minus half inside. So, this will be I think it is
just the c into r divided by 2 which is equal to minus lambda bar divided by 2 is this
expression. So, this is simply c c transpose c inverse e bar 1. So, that is your optimal
zero-forcing ok. So, this is the zero-forcing beam former which basically places a null in
the direction of interfere. In the sense, I had completely blocks completely blocks the
interference from the interfering user all right. Or it 0’s the interference from the
interfering users.
And in fact, this also on the one of the; it is also a very popular technique that is
employed in practical wireless communication systems, especially in the presence of a
large number of interfering users. As I already told you before, you can also use this in a
cognitive radio scenario very have a secondary user there is an ongoing the ongoing
primary user transmission. So, you can block at the second user receiver can block this
interference caused by the primary transmitter by using zero-forcing wave forming.
Applied Optimization for Wireless, Machine Learning, Big Data Now, these are the channel, coefficients correct; these are the channel coefficients. And
you have your channel vector which is if you put, these, things as a vector, you have the,
Indian Institute of Technology, Kanpur vector the channel vector and this now, this channel vector; this knowledge of the
channel coefficient. This is also termed as the Channel State Information.
Lecture - 38
Practical Application: Robust Beamforming With Channel
Uncertainty for Wireless If, you look at papers for instance research papers on wireless communication, you will
see this is frequently termed as CSI, which is basically your channel.
Hello, welcome to another module, in this massive open online, course. So, we are
looking at various types of Beamforming and in this module let us look at, another very,
important and very interesting and in fact, a very practical format of beamforming that is
termed as Robust Beamforming. And it is going to take, it is a little involved. So, it is
going to take a little time to explain this, but nevertheless let us start this topic of robust.
This is also termed as, Channel State Information. Now, the thing about Channel State
Information that is this knowledge of this channel vector h, it is not available a priori all
right, it is not available in the beginning which means, this has to be, somehow obtained
all right because the channel is something that is, varying with time. It depends on
.
several things, it depends on the scattering environment, it depends on the location of the
Now, let us go back and look at what we are doing in beamforming. Now, if you go back base station location of the user. In general, it depends on the environment and it is
and look we have this, multiple antenna arrays at the receiver. In fact, this can also be changing. So, this knowledge of this channel vector has to be acquired or in other words
there at the transmitter. So, you have a multiple antenna, array and you have a transmitter this channel vector h bar which we have been assuming implicitly to be known has to be
and you can have several interference also; that is what we have seen or you can have estimated initially.
secondary users and primary users. But essentially what we are doing is the following
So, that is the important aspects. So, this Channel State Information CSI which is,
thing. We have this, channel coefficient, corresponding to the l antennas which were
broadly termed as CSI has to be estimated and what this means is whenever there is an
given by h 1, h 2 so on up to h n.
estimation process, there is always going to be, Estimation Error ok. So, no estimation
process is hundred percent accurate which means, there is always the there is especially
in practical sonorous. There is always a receive dual Estimation Error that depends again So, we say this is your estimate h bar p and this is the uncertainty region, typically,
on various, various settings you can say. modelled as an ellipsoid. So, this is also known as an uncertainty ellipsoid. That is E and
what you say is that the true channel vector lies somewhere in this uncertainty ellipsoid
For instance, how high is the signal to noise power ratio, how fast is this average you
all right.
can. So, in general there is, Estimation Error all right there is error in the, available
knowledge of the channel state information or in general there is error in the available So, this is an ellipsoidal, uncertainty region all right. Around the channel estimate h bar e
estimate of the channel vector. So, you have your channel vector h bar ok. So, your and the true channel vector h bar lies somewhere in this uncertainty ellipsoid and
channel vector is, h bar, but this channel vector is, not known exactly frequently. So, depending on the nature of the uncertainty. The uncertainty is severe than the ellipsoid is
what is known is this estimate h bar and the channel vector is this estimate plus some you larger the uncertainty is smaller than the ellipsoid shrinks which means, the truth channel
can, think of this as error. vector is actually very close to the available estimate h bar ok. So, the True channel
vector h bar lies somewhere in the uncertainty ellipsoid.
So, you have this Error in the CSI estimate that is, Error in the estimate of the channel
state information. So, what is known is, this is the estimate or this is also known as the True channel vector h bar lies in this uncertainty ellipsoid and so, h bar is not known
Nominal, C S I something that is available on the face of it. So, this is Nominal channel exactly, but h bar lies in this Uncertainty Ellipsoid. So, how do we model this? So, we
state information or an estimate of the channel state information. And this is the true model this as follows.
underlying channel which is unknown
.
.
The true channel vector h bar belongs, to this, uncertainty ellipsoid. We know, how to
The true channel the True channel vectors are known, but what is known is an estimate model this Ellipsoid we have already seen this. So, the ellipsoid with centre h, bar e can
and what we know is in general, this underlying channel the True channel vector h bar is be modelled as h bar e plus some matrix P times u.
close to the estimate that is all we know. Now, how close that has to be characterized at?
Such that, that is this is a set of all h bar e plus matrix P times u such that norm of u bar
One of the ways to characterize that is as we have seen again we have seen this before is
is less than or equal to 1. This is the model for your uncertainty ellipsoid. This is the
to basically look at a region; around the estimate.
model for the uncertainty ellipsoid and you can see this is clearly an uncertainty. For (Refer Slide Time: 10:30)
instance, you can write this as this implies, what does this imply? This implies that your
h bar equals h bar e plus P times u bar which implies that, h bar minus h bar e equals P
times u bar which implies that, P inverse h bar, minus, h bar e equals u bar. Now, note
that norm u bar is less than or equal to 1.
Which implies that, h bar, minus, h bar e transpose P inverse transpose P minus transpose
P inverse into h bar minus h bar e less than or equal to 1; which implies that now you can
see that now, P minus. So, this you can think of this as P P transpose inverse.
So, I can write this as, h bar minus h bar e transpose some matrix A inverse h bar minus h
.
bar e less than equal to 1where this matrix A equals P P transpose and A is therefore, you
Which implies, norm P inverse, times h bar, minus h bar e, less than or equal to 1. Which can see this is a, P S D metric. In fact, is a P D matrix, positive, definite matrix because
implies that, norm square of P inverse h bar minus h bar is square norm square less than you are looking at A inverse.
equal to norm square l vector is nothing, but the vector transpose times it itself.
And, therefore, this you can clearly see therefore, this is the ellipsoid. In fact, this is the
ellipsoid ok; this is the ellipsoid, or h bar or rather the uncertainty ellipsoidFor h bar.
And, that actual vector, h bar lies somewhere in this Ellipsoid. Now, let us go back let us
revisit our original beamforming problem ok.
(Refer Slide Time: 11:54) thing that you have to realize is this is only possible when the underlying channel vector
h bar is not ok.
And, therefore, now what do we do? So, now, there is no way to ensure this condition,
but rather, what we do ok, I am sorry this is previous one was W bar transpose h bar
equals to ensures unity gain for the 2 channel vectors. But now, what we do is we modify
this as follows. Now, we modify this as W bar transpose h bar greater than or equal to 1.
For all h bar belongs to the uncertainty ellipsoid.
So, what this says is you take the uncertainty ellipsoid and you look at any h bar
belonging to that uncertainty ellipsoid.And for any h bar belonging to the uncertainty
ellipsoid you are ensuring a minimum gain of unity all right. So, instead of just fixing the
gain to unity for one particular channel vector h bar you are ensuring that for all these,
.
channel vectors that belong to the uncertainty ellipsoid h bar the minimum gain is unity.
So, we have our original beamforming problem is y bar, equals h bar x, plus n bar, this is
your original beamforming problem. Let us say, let us pick it general let us say n bar
contains the noise plus interference N plus I with covariance remember we said, if you
have noise plus interference instead of a white covariance you can characterize it by a
covariance matrix R expected value of n bar n bar, Hermitian or rather you can say
expected value of n bar n bar transpose ok, is R, this is the noise plus interference
covariance.
Now, what we have doing so far is we have assumed h bar to be known exactly and when
we beam forward we assume, we are setting W bar transpose h bar greater than or equal
to 1 and this we said basically ensures, signal gain equals 1 and this is a very convenient
framework all right. So, when you Beam forming bar W bar, you ensure to ensure that
signal gain is constant. You have a constant signal gain while minimizing the noise part
.
because; we said otherwise the solution is the trivial beam formal that is W bar equals 0.
Therefore, we said W bar transpose h bar equals to 1. So, thereby this ensures minimum gain equals unity. Ensures minimum gain equal to
unity for all channel vectors belonging to the, but all channel vectors belonging to the
Now, the problem with this approach is that if h bar is not known then, how are you
uncertainty ellipsoid and therefore, in that sense all right in that sense, it is robust!
going to enforce this condition right? So, we do not know h bar. So, it is meaningless to
say W bar transpose h bar equals 1 because h bar the actual channel vector the Now, you can say this is, robust implies that, this is robot why is this robust remember
underlying CSI is unknown. So, this is not possible, when h bar is unknown. So, the first what is the definition of robust? Robust something is robust implies that, something is
strong something that cannot be swayed very easily. So, we say when you say a person is
robust; that means, the person is resilient all right, even the person is attacked or the So, again, once again you minimize the noise plus interference power. You minimize the
person is, for instance say, under an attack or something of that. So, the person has the noise plus interference power, but now the constraint instead of W bar transpose h bar
ability to withstand that all right. equals 1. This becomes W bar transpose, this becomes W bar transpose h bar greater than
or equal to 1. For all h bar belongs to the uncertainty ellipsoid.
So, in that sense this optimization problem is robust meaning that, even if there is an
uncertainty in the channel vector h bar which there is, this formulation is able to So, this is basically you are interesting and very interesting and you can say in novel
withstand it. Because you are ensuring a minimum gain of unity for not just any singular robust beamforming problem. So, this ensures that you are minimizing the noise plus
term any single value of the channel vector, but for all the channel vectors that belong to interfering interference power while at the same time ensuring a minimum signal gain for
this particular uncertainty ellipsoid. all channel vectors that belong to the Uncertainty set. So, this is your robust
beamforming problem.
So, in that sense this robustness criterion makes sure that the designed beamformer is
resilient or it can withstand this challenge of this uncertainty or this how do you put it (Refer Slide Time: 20:02)
this sort of this kind of scenario all right. This kind of an implement the scenario that is
arising because of the estimation error all right.
So, very interesting and not just interesting it has a lot of practical applications, of
course, all of the beamforming paradigms that we have seen. So, far have immense
. practical little. But this one especially, has significant practical utility because it takes
into account the practical artefacts, the practical effects that arise in systems such as the
So, the robust framework ensures that the beamformer can tolerate uncertainty implies, it
channel estimation error therefore, this further. So, this further enhances the practical
a strong or robust unless, something that was done. Previously if, it does not take the
utility.
uncertainty into account this thing can withstand uncertainty; therefore, it is robust. And
therefore, the robust beamforming problem can be formulated as follows. We similarly So, this has a significantly higher practical utility. Since, it takes into account practical
minimize the noise plus interference power W bar transpose R W bar effects such as the, channel estimation error and therefore, it is robust and indeed it has
significant practical utility and of course, as you can have already it must have noticed
the problem. Now, formulating a problem is one thing, but then we also have to solve the
problem to derive the optimal beam form also. So, in that sense, the problem has also Applied Optimization for Wireless, Learning, Big Data
become, significantly more complicated than that is naturally when you try to build,
increased capability in something that the paradigm becomes more complex. Indian Institute of Technology, Kanpur
So, in that sense is robust forming problem is more involved than the previous Lecture – 39
Practical Application: Robust Beamformer Design for Wireless Systems
beamforming paradigms that you. So, we are slowly building up the complexity. The first
we have seen beamforming. Beamforming with the interference, zero forcing
beamforming and now robust beamform it is indeed. In certain sense you can say
looking at Robust Beam forming as an application of convex optimization or the
encompasses, all these paradigms and generalizes to a scenario where this vector h bar is
optimization framework that we have seen so far. So, let us continue our discussion.
not known precisely and this is significantly more complex we are going to see the
solution to this in the subsequent modules. (Refer Slide Time: 00:28)
We are looking at robust beam forming or multiple antenna system, remember what
beam forming does is to focus the wireless signal in a particular direction formal beam in
a particular direction. And robust beam forming is the paradigm where the knowledge of
the channel is not known precisely. So, there is uncertainty in the channel knowledge and
how to design a beam former that is robust to that uncertainty, we said the robust beam
former can be designed as the solution to the following optimization problem w
minimize w bar transpose R w bar w bar is the beam former, R is the noise plus
interference covariance matrix.
Subject to the constraint that w bar transpose h bar is greater than or equal to 1 for all h
bar belongs to this ellipse correct, this ellipsoid this is also termed as the uncertainty
ellipsoid, this is also termed as uncertainty ellipsoid.
(Refer Slide Time: 01:52) constraint and constraint can be simplified as follows ok. Remember the constraint
ensures a minimum gain of unity for all vectors h bar belonging to the uncertainty
ellipsoid.
So, we have what is the constraint? The constraint is w bar transpose h bar is greater than
or equal to v for all h bar belonging to the uncertainty ellipsoid which means now you
substitute for h bar w bar transpose h bar is well we have seen that is simply h bar e the
estimated channel plus P u bar greater than equal to 1 for all.
Now, for all h bar belong to E. Now becomes because, the equivalent condition is h bar e
plus P u bar for all vectors u bar such that norm u bar is less than or equal to 1 for all
norm u bar less than or equal to 1 ok. Now this is the interesting part, now this has to be
true for all vectors u bar such that norm u bar is less than equal to 1.
And this ellipsoid is described as follows; this ellipsoid is the ellipse which has the centre This has to be greater than equal to 1 all right, which implies basically this also has to
h bar e that is the nominal channel or the estimated channel. So, h bar e plus P u bar such hold at that value of u bar where this is the minimum all right. So, is the minimum of this
that norm u bar is less than or equal to 1 ok. And well now we want to solve this overall u bar is greater than or equal to 1 that automatically implies that it is going to be
optimization problem to basically determine the optimal beam form the optimal robust greater than or equal to 1 for all u bar such that norm u bar is less than or equal to 1.
be for more if you were ok.

So, this can be written equal until and you can convince yourself, this implies that for the
(Refer Slide Time: 02:30) minimum of u bar that is you take the minimum over u bar, says in norm u bar less or
equal to 1. This w bar transpose h bar e plus P u bar greater than or equal to 1ok.
And that solution, first of all lets simplify this optimization problem and this is where it
can be done in a very interesting fashion as described below. So, let us look at the
If this definitely implies that about the minimum this has to be greater than equal to 1 And the dot product is minimum when the vector is 180 degree that is it is completely in
and this implies; now I can simplify this further you take the minimum of norm or norm opposite direction and in a direction opposite the data of the material. So, the dot product
u bar all u bars is that norm u bar is less than equal to 1. Now I can simplify this as w bar is minimum, when u bar forms a 180 degree angle with w tilde. So, this is where so, the
transpose h e bar plus w bar transpose P u bar, the minimum has to be over this minimum this is let us say, this is star. So, this is a 180 degree angle. So, we say w tilde u bar is
this has to be greater than or equal to 1. minimum when u bar is opposite that is forms a 180 degree angle with w tilde.
Now, this is a constant w bar transpose h e bar, this does not depend on u bar ok. So, this (Refer Slide Time: 08:35)
will come out of the minimization so, this implies if you look at this w bar transpose h
bar e plus minimum of norm of u bar less than or equal to 1 w bar transpose P u bar
greater than or equal to 1. Now, what we are going to do? We are going to set this let us
set this w bar transpose P as w tilde which implies P transpose w bar I am sorry w tilde
transpose so, P transpose w bar will be w tilde.
And therefore what we say is u bar star this will be equal to minus w tilde because, the
vector that is exactly opposite to w tilde is minus w tilde. However we need u bar to be
normally bar to be less than or equal to 1; therefore, we normalize this with norm of the
w tilde that is it.
So, u bar is the unit norm vector, that is opposite to w tilde and this is for which this is
precisely the u bar for which you have minimum such that norm u bar less than or equal
So, I will write this as w bar transpose h bar e plus the minimum over norm for all u bar, to 1 w tilde transpose u bar ok.
such that norm u bar less than equal to 1 minimum or u bars that w tilde transpose u bar
This is where the minimum occurs, that is when u bar is a unit norm vector it is exactly
greater than equal to 1.
opposite in direction to w tilde. Therefore, the inner product is basically negative
And now this is very interesting, now if you observe now you see what is this is the number; let us this cosine 180 is minus 1 ok.
nothing, but the dot product w tilde transpose u bar. So, we have w tilde and we have this
vector u bar correct and we have the dot product. Now when is the dot product between
w tilde and you are minimum remember the dot products maximum when you bar is
perfectly aligned with w tilde.
So, this implies that now the minimum will be w bar transpose h bar e plus the minimum And you can also write this as now w a transpose h bar e minus norm of P transpose w
over norm u bar u bar such that norm u bar less or equal to 1 occurs when u bar equals bar greater than or equal to 1, this can also be simplified as follows norm of P transpose
minus w when u bar equals minus w tilde divided by norm w tilde. Therefore, and that w bar less than or equal to w bar transpose h bar e minus 1 ok. And now if you look at
will be w tilde transpose into and I am now substituting for that value of u bar, which is this is very interesting, you can recall that this is a norm and this is your a fine, this is the
minus w tilde divided by norm w tilde. affine portion.
And this has to be greater than or equal to 1, if this is greater than or equal to 1, then it is So, we have norm less than equal to something that is affine. So, this is basically you can
going to be greater than equal to 1 for all by implication by a. So, by following this recall and you can look the notes this is a conic constraint. In fact, this pair, this
argument it is going to be greater than or equal to 1 for all vectors h bar belonging to that constraint represents a conic region or this is basically a cone or this is known as a this is
else material. And now you see this is mine w tilde transpose into w tilde is nothing, but let us say this is a cone also known as a conic constraint.
norm w tilde square divided by w tilde; so, that is norm w tilde. So, this implies w bar
It is a very interesting constraint; it reduces to a cone or a conic constraint. And
transpose h bar b minus norm w tilde greater than or equal to 1.
therefore, now the equivalent optimization problem to find the robust beam former ok,
And now, we substitute for now w tilde, w tilde is nothing, but we have seen earlier w that can be formulated as equivalent optimization problem.
tilde is basically you are a P transpose w bar. So, this implies w bar transpose h bar e
minus norm of P bar transpose w bar this has to be greater than or equal to 1. This is the
equivalent constraint or this you can say is the simplified constraint, simplified
constraint.
Equivalent optimization problem for robust beam forming that will be minimum of And it can be solved and I will demonstrate it separately because it is a little involved.
minimum or w bar transpose R w bar, such that w bar transpose P norm is less than or So, robust beam forming problem is an SOCP let us note that. Thus the robust beam
equal to w bar transpose h bar e minus ok. forming problem, thus the robust to inform a problem is in SOCP and the robust beam
former w bar; it can be shown that robust beam former is minus of lambda.
This is a coning constraint; this is a second order quadratic optimization so, this is a
second order objective. This is a conic constraint so; this is known as the second order In fact, I will show this in a subsequent module plus R Q inverse of h bar e; this is the
cone problem. So, this is basically this is known as an SOCP equals second order cone robust this is the robust beam former. This is the robust beam former and lambda this is
program so, this is a very interesting aspect. the Lagrange multiplier, lambda equals Lagrange multiplier and this has to be determined
suitably.
So, the robust beam forming problem reduces to a very interesting optimization for more
either a very interesting or belongs to a very interesting class of optimization problems
turned as second order cone programs with the objective function is second order. All
right our objective function and the constraint is a conic constraint all right. So, this is
known as an SOCP problem and it is a very interesting problem.
Lecture - 40
Practical Application: Detailed Solution for Robust Beamformer Computation in
Wireless Systems
Hello, welcome to another module in this Massive Open Online Course. So, we are
looking at Robust Beamforming. And we also looked at the optimization problem for
robust beam forming, and the solution the structure of the solution although we could not
derive it in detail. What I am going to do in this module is I am going to derive the
optimal robust beam former as I have already told you the derivation is slightly involved.
So, I advise you that if you are looking; if you are going through this for the first time,
And the matrix Q depends on R and P basically Q equals I am sorry depends on P and h and if you are not interested in knowing the intricacies of the derivation, you can skip
bar e. So, P transpose minus h bar e h bar. Of course you see notice P, P is the matrix this module. If you are interested in delving deeper into this, you can follow the
corresponding to the uncertainty ellipsoid and h bar e is the estimate the nominal derivation and understand how this is the robust beamformer is derived all right. And I
estimate of the channel. And this is the solution to the robust beam forming problem, that am going to illustrate the step by step procedure to derive this robust beamformer W bar
is w bar equals you can say this is w bar star w bar star equals minus lambda R plus that is the exact structure of the robust beamformer as well as the procedure to determine
lambda Q inverse into h bar e all right. the Lagrange multiplier lambda all right.
So, I will conclude this module with this all right so, there is a very interesting problem (Refer Slide Time: 01:14)
the robust beam forming problem, which can be shown to be an SOCP a Second Order
Code Problem and this is the solution. It is slightly involved which I will in illustrate in a
separate point.

So, we start with the robust beam forming problem. So, again let us just title this; this is
your robust beamforming minimize W bar transpose R W bar subject to the constraint
that norm of P transpose W bar less than or equal to norm of h bar e transpose W bar
minus 1, this is our robust beamform of problem we said. This is the quadratic constraint
W bar transpose R W bar of course, R is a positive semi definite matrix is the noise plus
interference covariance correct it is the N plus I noise plus interference covariance. And
this is a conic constraint norm of P transpose W bar less than equal to h e bar transpose
W bar minus 1.
Now, what this implies is first of all if we look at this, this implies that if you look at the
constraint that implies that P transpose W bar norm square is less than or equal to h bar e
transpose W bar minus 1 square which you can also say that.
And what this is W bar transpose R W bar. So, this we know this is twice R W bar. Now,
(Refer Slide Time: 03:06) even before we take the gradient let us simplify this further. So, this will be before you
take the gradient, let us simplify this a little bit further. So, this will be W bar transpose R
W bar plus lambda times now norm of a vector square P transpose W bar square is the
transpose of a vector times itself, so that will be W bar transpose P into P transpose W
bar minus square of this quantity. Square of this will be well it will be h bar transpose e
W bar whole square minus 1 plus twice h bar e transpose W bar.
And further now this h bar e transpose, now this is scalar quantity h bar e transpose W
bar, so I can write this as W bar transpose h bar e this transpose of the vector times itself.
So, W bar transpose h bar a times itself which is again h bar e. So, basically you can also
say this is the magnitude square of this both these quantities are equal. So, I can write
this a scalar quantity. So, I can write R s transpose of the quantity times itself, so that is
W bar transpose h e bar times h bar transpose W bar.
So, now what we are going to do is, so the Lagrangian can be formulated as follows. The
Lagrangian can be formulated as follows so, that will be W bar transpose R W bar plus
lambda times we write the constraint that is norm P transpose W bar square minus h bar e
transpose W bar minus 1 whole square. So, this is the Lagrangian, Lagrangian it is
obviously a function of you can write this as a function of F of W bar comma lambda.
So, this is your Lagrangian. Now, differentiate take the gradient with respect to W bar.
So, gradient of F with respect to regular W bar with respect to W bar.

So, this is again equal to, so the objective function again as I have said slightly involved And now if you simplify this, so this basically implies now setting it equal to 0 gradient
can be simplified as one has to be careful to write each and every term at every stage. So, all right considering the gradient of the Lagrangian with respect to W bar and we are
W bar transpose P P transpose W bar minus W bar transpose h e bar h bar e transpose W equating this to 0. Now, if you solve this thing, this will be twice R W bar or the factor of
bar minus 1 plus 2 h bar e transpose h bar e transpose W bar. And now what we will 2 this will cancel. So, you can remove this factor of 2. So, this will be R W bar or R plus
going to do, we are going to take the gradient with respect to gradient of the Lagrangian lambda P P transpose minus h bar e h bar e transpose, this whole thing into W bar equals
with respect to F, for gradient of the Lagrangian with respect to W bar. minus h bar e.
And this will be equal to the gradient of the Lagrange plus R W bar you can clearly see And, now if you set this matrix as Q, so we have P P transpose minus h bar e h bar e
this is twice R W bar we know this plus lambda 2 P transpose W bar minus twice, again transpose equals Q. So, this implies R plus lambda Q into W bar; I am sorry there is
W bar transpose h bar e h bar e transpose W bar. So, this is minus twice again h bar e h going to be another factor of lambda over here so minus lambda h bar e. So, this is equal
bar e transpose W bar minus of course, when you differentiate minus 1, it will be 0 plus to minus lambda h bar e. And this implies that and therefore, that gives us the equation
twice c bar transpose W bar that is c bar twice h bar e, this is equal to 0. which is W bar equals R plus there is going to be minus lambda minus lambda times R
plus lambda Q inverse minus lambda into R plus lambda Q inverse into h bar e. This and
this you can see, this is therefore you can write this as star this is the optimal robust
beamformer which we are already seen.

plus twice h bar e transpose W bar equal to 1. So, basically this is the constraint, now
again you can combine this P P transpose minus h bar h bar e transpose. So, what you get
is W bar transpose P P transpose minus h bar e h bar e transpose W bar. And this is
nothing but our matrix Q that we had seen plus of course, the other terms are there h bar
e transpose W bar equals 1.
So, this is the formal derivation for there. This is the optimal robust performer all right.
And this of course, depends on the Lagrange multiplier lambda. And therefore, what we
have to do is that we have to determine the Lagrange multiplier lambda to complete this
derivation. So, and that is a little bit involved so, we have to derive the Lagrange
multiplier. So, Lagrange multiplier m lambda has to be determined. How to find the
Lagrange multiplier lambda, for that use the constraint for this we use the constraint.

This so now, you can substitute this, this implies that W bar transpose Q W bar plus
twice h bar e transpose W bar equals 1, very good. Now, what we are going to do is we
are going to substitute the solution for the optimal d beam formal that is we have
remember W bar optimal is minus lambda R plus lambda Q inverse h bar e. Now, once
you substitute this w, but this value that is this expression for the optimal v former what
you will get is the following expression. You will get lambda square h bar e transpose R
plus lambda Q inverse Q R plus lambda Q inverse into h bar be h bar e. It is a slightly
lengthy expression minus twice lambda. However, you can easily verify it by substituting
it h bar e transpose R plus lambda Q inverse h bar e minus 1 this is equal to 0.
Now, what we are going to do is we are going to substitute R equals G G transpose. Now,
employing this is remember R is the noise plus interference covariance matrix. So, G G
transpose is a decomposition all right, every positive semi definite matrix because this is
And what is the constraint? Remember our constraint is that norm P T W bar square a PSD matrix, this is a covariance matrix. So, this is PSD matrix every PSD matrix can
minus h bar transpose e into W bar minus 1 whole square equal to 0. This implies that W be decomposed as G some matrix G times G transpose.
bar transpose P P transpose W bar minus W bar transpose h bar e h bar e transpose W bar
Now, substituting this above or employing this in the above equation, what we have is Now, we are going to do a further substitution all right, which will simplify this actually.
this implies gets further involved lambda square h bar e transpose G minus transpose And we are going to see how. So, now we are going to employ a further substitution.
which means transpose of G inverse. So, you can think of G minus transpose or there And the further substitution is if you look at this matrix looks like this matrix is the key,
was G transpose inverse of G inverse transpose both of them are the same, G transpose because this appears repeatedly. So, G inverse Q G minus transpose, we employ its
inverse G minus transpose I now plus lambda you can verify this G inverse Q G minus eigenvalue decomposition as follows that is V gamma V transpose. So, this is basically
transpose inverse G inverse times Q G minus transpose I plus lambda G inverse Q G the eigenvalue decomposition of this matrix G inverse Q G inverse transpose. So, this is
minus transpose inverse G inverse into h bar e. And it continues minus twice lambda h eigenvalue decomposition and this matrix gamma is the diagonal matrix of eigenvalues.
bar e transpose G minus transpose I plus lambda G inverse Q G minus transpose inverse So, this will be gamma 1, gamma 2 and so on and so forth up to whatever is the
into G inverse into h bar e minus 1 equal to 0. So, this is the equation that you get after dimension which is gamma L. So, this is diagonal matrix of eigenvalues.
you substitute R equal to G G transpose.
Now, this can be simplified as follows. If you look at this vector we set it as h r. So, we
set h bar r so, set h bar r equals V inverse G inverse h bar e. And this implies, so the
equation above reduces to the following lambda square h bar r transpose I plus lambda
gamma inverse gamma I plus lambda gamma inverse h bar r minus twice lambda h bar r
transpose I plus lambda gamma inverse h bar r minus 1 equal to 0.
And now you can see this is a very simple structure because if you look at all these
matrices, all these are diagonal matrices I plus lambda gamma. If you look at this, this is
a diagonal matrix gamma is a diagonal matrix this is a diagonal which is in fact, it is
entries will be 1 plus lambda gamma 11 plus lambda gamma 2 and so on. So, all these
matrices are diagonal matrices, so you can basically multiply it out and you can simplify
this.
This is the diagonal matrix of eigenvalues. And once you employ this substitution this
implies. So, once you employ the substitution then that is the substitution that we (Refer Slide Time: 22:44)
outlined about G inverse G G minus or G transpose inverse Q G transpose inverse or G

inverse G inverse transpose equals V gamma V transpose. This reduces to, well this
reduces to lambda square h bar e transpose G inverse minus transpose V minus transpose
I plus lambda gamma inverse times multiplied by gamma I plus lambda gamma inverse
V inverse G inverse h bar e minus twice lambda h bar e transpose G inverse transpose V
minus transpose I plus lambda gamma inverse into V inverse G inverse h bar e minus 1
equal to 0.
And once you simplify it, the resulting equation that you get will be the following. And
you get the final equation for the Lagrange multiplier lambda and that is given as
interestingly that is given as this lambda square summation I equal to 1 to L h r I am
going to explain these terms in a minute. So, h r i square, this ith component of the vector
h i h r divided by gamma I 1 plus lambda gamma i this is 1 plus lambda gamma square
minus twice lambda i equal to 1 to l h r square i divided by 1 plus lambda gamma all this
1 plus lambda gamma I terms are coming from the i plus lambda gamma inverse. You
can clearly see that minus 1 equal to 0. And this is the equation for lambda, this is the Applied Optimization for Wireless, Machine Learning, Big Data
equation.
So, you solve this equation determined lambda, we solve this to determine lambda. And
h r i as I already explained equals ith element of the vector h bar r which we defined Lecture – 41
Linear modeling and Approximation Problems: Least Squares
above. Once you find lambda, so solve this equation you find lambda substitute. Now,
you have to once you substitute in basically your earlier equation for the optimal beam
Hello, welcome to another module in this massive open online course. So, in this
former which is nothing but W bar star equals minus lambda R plus lambda Q inverse h
module, let us look at another class of problems or another class of
bar e.
optimization problems, specifically pertaining to Linear modeling and
(Refer Slide Time: 25:13) Approximation, which arise very frequently in various applications of course
both engineering, science and so on so. This is a very important class of
problems.
And this will give you gives this, so this is the optimal beam former. This is the equation
for the optimal beam so that basically gives the equation for lambda, you substitute this
to get the optimal robust beamform. In fact this is not just any beam former this is the
optimal robust beam former which is robust to the uncertainty in the channel state
This is termed as linear wherever you have models, and typically more most of them or
information or uncertainty in the knowledge of the uncertainty in the knowledge of the
frequently they are modeled as linear model. So, we have linear modeling and
channel coefficients or channel vector in the multi antenna system. All right so, we will
approximation problems, linear modeling and approximation problems. And
stop here and continue with other aspects in the subsequent modules.
what you can see in this is that if you consider now consider the linear model,
Thank you very much. general linear model can be described as follows. Consider the linear model y
bar equals A x bar, where we have A, this is in general an m cross n matrix,
and which implies of course that if x is a vector, x is a n cross 1 vector. And y
bar is a naturally an m cross 1 vector.
Now, what we assume or typically in this model, this vector x bar is unknown, which has Now, what happens in this is that let us assume a simple scenario to begin with if m
to be determined. So, x bar is unknown and it has to be determined so, x bar is equals n, and this everyone would know m equals n. And A is invertible, it is if
to be determined. And now frequently what you have also in such a system, equals m and A is invertible, remember this is as to be given. Now, this implies
now of course let us start by considering a simple example. Let us say A is I can find y y bar is equals to x bar, I can determine x hat or the estimate of x
square matrix that is m is equal to n ok. If m is a number of rows, n is a equals A inverse y bar ok. This I think is a typical solution for the linear
number of columns of A. Let us now if you look at this system, you can see system, which most people would know almost students would know.
what is m in this system, m is basically the number of equations, it is the
Now, however frequently what we have is that you have this y bar is well, it is not
dimension of y. So, m equals number of equations. And n equals number of
exactly equal to A x bar its. Actually, A x bar plus n when n is the noise, I write
unknowns ok.
it as y bar equal this is the linear model, y bar equals A x bar. And A is your m
cross n matrix, and m is greater than n, m is greater than n ok. So, what that
means is this implies that number of equations is greater than number of
unknowns.
And which also implies that this is basically an over determined system, because number Now, you can see these represent three line. So, what happens when you plot this things,
of equations is greater than this is basically an over determined system. Now, you have three lines. So, you basically have three lines each represents a line
of course over determined system frequently, this has no frequently no ok. Now, the solution now unless all the lines intersect no there is no solution
solution exists, unless the vector y bar belongs the column space of A. So, unless all the lines intersect at a common point all right. So, you see this point.
frequently typically because of the noise in this system, typically does not So, you have three lines that is three equations two unknowns ok. So, there is
does not exist. For instance, you can take a simple example. Let us take m no solution unless all the lines intersect at a there is no solution unless all the
equal to 3, what we have is we have y 1 equals a 1 1. So, A equals 3 cross 2 lines intersect at a single point all right.
matrix. So, we have y 1 equals a 1 1 x 1 plus a 1 2 x 2; y 2 equals a 2 1 x 1
So, you have three equations basically, two unknowns. And frequently if you take three
plus a 2 2 x 2; y 3 equals a 3 1 x 1 plus a 3 2 x 2.
lines at random, then they will naturally not intersect. It is highly unlikely that
they all intersect at a single point, which means there will be no solution. So,
in such a scenario you will have to find an approximate solution or what I
mean by that is some solution that best fits the model or best explains the
observed vector y. This is also known as the maximum likelihood vector x ok,
so it is interesting.
So, typically and this is because this happens because of the noise in the model. And we have to find x bar, you have to find best x bar that is which best explains y bar,
Typically, what happens is you have y bar equals A x bar plus n bar ok. So, what do you do is minimize the approximation error. Now, error is of course a
this is what is this is your model noise. And what happens because of this vector. What does it mean to minimize the error, we will simply minimize the
model noise this means, your y bar not equals is not equal to A x bar, which norm of this error vector. So, we will minimize the norm of the error vector,
means if you form y bar minus A x bar, there for any x bar there is always, which is basically equal to minimizing norm of y bar minus A x bar, which is
because there is no solution for any x bar, there is always an error. This is basically equivalent to minimizing norm square of y bar minus A x bar square
termed this is basically your model error or your approximation error vector. ok.
So, you can also call this as a error vector, which is basically the
And this problem where you are and this is the two norm square, the l two norm. And
approximation error. This is basically the approximation error.
when you are minimizing the norm square, so this is the vector x bar, which
gives you the least squared error norm. So, this is known as the least square
solution this is known as the least square solution or this is known as the least
squares problem, in fact this is known as the least squares problem.
And this is very popular in communication. This is arises, when we are going to express So, just gives the what is termed as a or basically if one asks the question what is the
the show some example, so this arises very frequently in communication and signal vector x bar, which has the maximum likelihood right, which best explains the y, which
processing applications ok. So, this arises very frequently in communication and (Refer has the maximum likelihood of having occurred that is the vector x bar, which basically
Time: 11:39). And in fact if you look at, this is nothing but a this is a quadratic objective minimizes the least squares, this which is the solution to the least squares problem that is
function or this also termed as a quadratic program. So, this minimize norm of y bar it minimizes the square or it gives the least squared error least squared norm of the error
minus A x bar square. This is termed as a it is a quadratic objective function. It is termed vector ok. And this can be solved as follows and it is not very complicated. So, what we
as a quadratic program or basically a QP ok. And finding the solution of this QP gives the do is that to find the solution. So, we want to find the least square solution, and that can
best estimate this is also known as the maximum likelihood estimate. So, the solution of be found as follows. We want to minimize norm y bar minus A x bar square.
this gives the least squares problem, gives the maximum likelihood estimate. In fact,
strictly speaking the maximum likelihood estimate in Gaussian noise ok.
So, what we do is basically remember norm square of a vector, this is nothing but vector Now, you said this equal to 0, to find the optimal value. So, you set gradient equal to 0 to
transpose times itself. So, this is y minus A x bar transpose times y bar minus A x bar, find optimal value. This implies now if you said this equal to 0, what you now this 2’s
which is equal to y bar minus x bar transpose A transpose into y bar minus A x bar, which cancel ok, you cancel the 2. So, what you get is A transpose A y bar equals A I am sorry
is equal to y bar transpose y bar minus x bar transpose A transpose y bar minus, now this A transpose A, this has to be A transpose x bar, so we have A transpose A x bar equals A
is to be transpose minus y bar transpose A x bar plus x bar transpose A transpose A into x transpose y bar.
bar.
Now, if you can look at this these two quantities are the same, these are the transpose of
each other and scalar quantities x bar transpose A transpose y bar y bar transpose A x bar.
So, you can take the twice of one of these, so this will be finally simplified as y bar
transpose y bar minus twice x bar transpose A x bar transpose A y bar plus x bar
transpose A transpose A x bar.
And what this implies is basically you have x bar or you can call it x hat, typically it used
in the context of estimation. The optimal value of x is x hat equal to A transpose A
inverse into A transpose y. Assuming A transpose A to be invert, because we can see A
transpose A will always be a square matrix correct, because A is m cross n, and A
transpose is n cross m. So, A transpose A will be n cross n matrix I am sorry, this is a
transpose A will be n cross n. A transpose A inverse will also be n cross n. So, this is the
least square solution.
And now if you call this as now this is an objective function, because here you have no
constraint, you have only the objective function as x bar. So, you take the gradient of F Assuming, A transpose A is invertible, it is a very compact and elegant form. Least
with respect to x bar. And now, we can see y bar transpose y bar gradient of that with squares problem is also termed as the L S. And this assumes that A transpose A is
respect to x is 0 minus twice; x bar transpose A transpose you can treat this as x bar invertible. This assumes A transpose A to be invertible, so that basically gives us the least
transpose c bar, so the gradient is simply c bar. So, minus twice A transpose y bar plus x square solution. And like this is one of the most fundamental problems in signal
bar transpose A transpose x bar that is you can treat this as x bar; you can treat this as processing, and also for that matter in estimation, and communication and so far as so
matrix P, it is positive semi definite, x bar transpose P x bar. So, the gradient is twice P x frequently.
bar or twice A transpose A y bar.
A solution is very well known, and it is thought in a lot of courses. And in fact this forms Applied Optimization for Wireless, Machine Learning, Big Data
this analysis of this problem from one of the staples of several course alright. And I
think, this is one of the most important optimization problems with various applications Indian Institute of Technology, Kanpur
that we are going to encounter in this course. So, we will stop here, and continue in other
Lecture – 42
modules. Geometric Intuition for Least Squares

Hello, welcome to another module in this massive of online course. So, we are looking at
the least squares optimization problem and we also derived the least squares solution that
is the solution to the least squares optimization problem.
So, let us continue our discussion so, we are looking at the least squares which is a very
important optimization problem. You can also think this is an approximation or modeling
problem.
And what we have seen is that when we have an over determined system of equations y
bar equals A x bar with a being an m cross in matrix and m greater than n this is over
determined correct. And therefore, to summarize and this cannot be solved exactly
therefore, what we do is we minimize norm y bar minus A x bar square this is termed as
the least squares problem.
And the solution to this least square solution we have derived.

(Refer Slide Time: 01:30) So, A transpose A’s n cross n and A transpose is n cross m implies if you look at A
transpose A inverse that is therefore, A transpose A inverse into A transpose this is you
can easily say this is n cross m matrix ok m greater than n. So, this has more columns
then rows. But, the interesting thing here is now if you look at A for m greater than n non
square matrix this implies that A is not invertible.
Now, we are considering a scenario in which m greater than n which means the number
of rows is much greater than the number of columns so, A looks like this. So, this is also
known as a tall matrix that is the height number of rows of the matrix is much larger than
the number of columns the matrix looks like a tall matrix. Now, for non square matrix;
obviously, does not have inverse, but you can observe a very interesting aspect.
This previously that is x hat equals A transpose A inverse A transpose into y bar ok. And
now, if you look the now let us look at some aspect salient aspects of this solution, now
consider this matrix A transpose A inverse into A transpose ok.
Now, consider the matrix A transpose A inverse into A transpose. Now, we can see that
you can easily say that this size of this matrix. So, A transpose m cross n A is n cross A is
m cross n A transpose n cross m. So, this is basically your n cross, this is n cross m A
transpose is this n A transpose n cross m. So, this is n cross m, this is m cross n.
That is if you look at A transpose if you look at this matrix A transpose A inverse into A
transpose and you multiply you take it is product with a now you look at this. So, A if m
greater than n it implies that it is not invertible, but if you look at this A transpose A
inverse into A transpose multiply this with A.
Now, look at this we have A transpose A inverse we have a transpose. So, this is basically
your A transpose A inverse times A transpose A which is identity. So, it is as if all the A is
not invertible, it is as if this matrix A transpose is inverse into A transpose this matrix
acts is acting, is behaving you can say is behaving as an inverse of A right. When
multiplied on the left with A it is giving identity.
Let us it behaving it not an inverse because A is not invertible when m is greater than n, (Refer Slide Time: 06:48)
but it is behaving as an inverse this is therefore, known as the pseudo inverse of A.
Pseudo is an quantity pseudo when it is not actually the quantity, but it gives the
appearance of that quantity ok.
The intuition behind the least square solution now if you look at our problem we have y
bar equals A x bar this is our problem.
But, we know that they does not exist any x bar that is this does not have any solution.
So, this is does not have any solution which implies that no matter that x bar you choose
So, that is basically termed as this quantity A transpose A inverse into A transpose this is they it will not satisfy y bar equal to y bar equals A x bar. Which means y bar minus A x
termed as the pseudo. This is termed as pseudo inverse of A and this also remember a left bar will always be non-zero that can be denoted by vector e bar.
inverse because, it is only true when you multiply it on the left.
So, there is no vector x bar so, this is an over determined system. Remember that is what
This is also the left not the it is a left inverse of the matrix. Now, to understand the we said unless and if you consider 3 equations 3 equations that 2 variables and you have
explore of this further this nature and to understand to get some intuition behind the 3 lines and unless they all intersected a point, it does not have any solutions. So, there
solution. So, what you want to do is want to get some intuition behind the least square will always be an approximation here let us denote that by e bar. So, this is the
solution and the intuition is very interesting. approximation error.
So, this is the approximation error. Now, therefore, y bar minus A x bar equals e bar this
is a approximation error now let me write it a little explicitly. So, this is will be y bar
minus A is an m cross n matrix which means it has n columns. So, these I can denote by
a 1 bar a 2 bar upto a n bar times x 1 x 2 up to x n. So, this is your matrix A and this is x
bar.
So, these are the n columns n columns of the matrix A and so, this is equal to e bar and. So, which means therefore, this lies in the subspace therefore, this lies in the subspace
So, this implies that y bar minus now we should multiply this out. So, you will get a 1 spanned by columns of A. what you have over here now to represent this pictorially if
bar times x 1 times a 1 bar plus x 2 times a 2 bar. you take let us say this plane remember a subspace a plane is nothing, but a subspace.
So, let us say you have the subspace which is spanned just to give you an idea I am
So, this minus x 1 times a 1 bar plus x 2 times a n a 2 bar plus so on x n times a n bar this
presenting the subspace by a plane.
is equal to e bar where a vector and now if you look at this, if you look at this x 1 times a
1 bar explicit x 2 times a 2 bar times so, on until x n times m. And now this is nothing, So, let us say this is the subspace spanned by columns of A. Let us say this subspace by
but a linear combination of the columns of matrix A. What this is? Is this represents a columns of A and you have your vector y bar which does not necessarily lie in the
linear combination a linear combination of the columns of the matrix A. subspace. And you are trying to form an approximation which lies in this subspace. So,
this is your approximation let us denote this approximation this approximation by y hat.
So, now, you have this linear combination of the columns which implies that this
approximation that is A times x bar always lies in the subspace spanned by the columns Ah this approximation let us say this approximation is let us say you denote this by y hat
of A. Remember linear combination you can consider all the linear combinations of the and this approximation and let us say you are let me just. So, this is your a 2 bar and this
these columns you get the subspaces. approximation is let us say you denote this by y hat. Now, this y minus y hat this is a if
you look at this, this is the corresponding error.
Ok, So, this is your corresponding error vector ok. So, this y minus y hat this is your So, this e bar is orthogonal when is this orthogonal to the subspace this implies that e bar
corresponding error. And therefore, now e bar equals y bar minus y hat and what is this? has to be orthogonal to each of the vectors in the subspace this means e bar has to be
This is the distance from y bar to the subspace or you can say the plane, the plane that is perpendicular to a 1 bar, a 2 bar, up to a n bar. So, e bar has to be perpendicular to the
spanned plane or plane let us make it simple plane containing a 1 bar a 2 bar up to. So, subspace and.
what do you think of this as basically you have a vector y bar and you have this plane
We know the condition for orthogonality the condition 2 vectors are orthogonal when
that contains y, a 1 bar a 2 bar a n bar already in this plane contain different possible
their inner product is 0. Which means we must have we must have we must have a 1 bar
approximations y hat.
transpose e 1 bar equal to 0, a 1 bar transpose or a 1 bar transpose e bar sorry there is
Now, what is the error? Error is the distance between this vector y bar and y hat which only e bar, e bar equal to 0, a 2 bar transpose e bar equal to 0.
lies in the plane. And we want to find the error the vector error I mean we want to
minimize this error that is we want to make the error vector e bar such that it has the
minimum. Or in other words the distance of y bar from this plane has to be minimum.
And now you can see this error is the distance of y bar from this plane is minimum when
the error is perpendicular to the plane that is the whole point. So, this error which is
nothing, but the distance geometrically you can see this error is minimum when e bar is
perpendicular to the. This is the important ideas so this error vector is minimum when it
is perpendicular to the subspace that is spanned by a 1 bar a 2 bar up to a n bar. Or we
can also say that this error vector is orthogonal to the subspace this is a big idea.
(Refer Slide Time: 16:23) But, e bar is y bar minus A x bar so, this implies A transpose times y bar minus A x bar
equal to 0. This implies that now you observe something interesting A transpose y bar
equals A transpose A into x bar which is a condition that we have already seen which
implies that x bar or the best vector x bar that minimizes the error that is x hat equals A
transpose A inverse A transpose into y bar ok. So, this implies that x hat equals A
transpose A inverse into A transpose into y bar. So, this implies this is nothing, but again
you get the least square solution which is basically exactly.
So, intuitively what the least square solution is doing is basically finding the vector y hat
which is the best approximation to y bar in the subspace that is spanned by the vectors a
1 bar a 2 bar up to a n bar which are basically nothing, but the columns of this matrix A.
And therefore, what now what is this so, therefore, that so, therefore, which implies the
error vector e bar which is the distance of y bar to y hat or basically distance of y bar to
the plane is minimum when the error vector is perpendicular to the plane which implies
So, on a n bar transpose e bar equal to 0, that is e bar has to be orthogonal to all these
that the error vector has to be perpendicular to all the vector a 1 bar, a 2 bar, a n bar
vector a 1 bar, a 2 bar upto a n bar. And now you can write this you can put write this as
which spanned the plane.
a matrix. So, this implies basically now you can concatenate these condition as a matrix.
So, this implies a 1 bar transpose, a 2 bar transpose so, on upto a n bar transpose into e Now, in addition what is y hat? Now, we can see y hat that is.
bar equal to 0.
And this implies well this is nothing, but now you can see this in nothing, but the matrix
A transpose. So, this implies A transpose e bar equal to 0.
Approximation to y, y hat is nothing, but a times x hat which is equal to A times A

transpose A inverse A transpose into y. Now, what is y bar? Now, what is y hat?
Remember, y hat is the best approximation. Now, if you look at this plane again, go back
and look at this plane again this is your vector y bar, this is your vector y hat and the look P A times P A that gives you A A transpose A inverse A transpose times multiplied
error vector is orthogonal. And the resulting error vector is orthogonal ok. by this A A transpose A inverse into A transpose.
And therefore, what is y hat now y hat you can see is the best approximation to y bar in So, this is A multiplying it by P A and you can see now you have A transpose A transpose
the plane or you can also say y hat is the projection of y bar in the plane or subspace A inverse. So, these things cancel and what you are left with is again you can see A A
containing. So, y hat equals projection in the best approximation, in the subspace of y. y transpose A inverse into A transpose which is P A. So, this satisfies the property P A
hat is a projection of y bar in the subspace or spanned by a 1 bar a 2 bar a n bar this square equals P A.
matrix which is giving you the projections. So, when you multiply this matrix by y bar
In fact, P A square equals P A, P A raise to the power of n correct any integer n right any
you get the projection. So, this implies that this matrix is the projection matrix.
integer n greater than or equal to 1 is P A. So, this is the projection matrix and this is
This implies that A into A transpose A inverse A transpose is the projection matrix and basically the intuitive or the intuition behind the least square solution which sheds which
the projection matrix for what? Projection matrix for the subspace that is spanned by a 1 basically very conducive to sort of intuitively understanding the reasoning and the
bar, a 2 bar, a n bar that is it. methodology behind the least square solution all right. So, we will stop here and
continue in the subsequent modulus.
So, this matrix which is very interesting, this matrix which is the projection matrix that is
spanned this is basically given as A equals A transpose A inverse into A transpose. This is
the projection matrix corresponding to the subspace that is spanned by the columns of
the matrix A ok.
So, this is basically your projection matrix and this you can see this is very interesting
properties. One of the most interesting properties of the projection matrix is that if you
Prof. Aditya K Jagannatham
Lecture – 43
Practical Application: Multi antenna channel estimation
Hello. Welcome to another module in this Massive Open Online Course. So, we are
looking at least square. So, we are looked at several aspects including the intuition
behind this least square. Let us look at some Practical Applications of this least squares.
Now, if you look at this typically what we said is now let us consider now let us consider
a system with multiple transmit antennas. So, you have a transmitter and this has
multiple antennas. So, have a transmitter let us say you have L antennas at the transmitter
and for simplicity let us say you have a single receive antenna. Now, you have the
channel coefficient since there L antennas there are L channel coefficient. So, you have h
1, h 2 up to h L these are the L channel coefficients, alright. This is a channel and this
channel coefficients are unknown alright comprises the alright and we also said this is
the channel state information or CSI which is unknown and has to be estimated.
So this channel vector if you call this as h bar this is this channel vector which basically
In fact, what we have said is this something that arises very frequently in signal constitutes the CSI, the channel state information and that has to be estimated. So, this is
processing and communication. So, we have the practical application of least squares and basically your h bar equals CSI.
what we want to do in this is; well, let us consider the first problem is related to wireless
communication. Of course, one of the interesting very interesting problems is that in a
wireless communication system is that of channel estimation in particular let us look at
the problem of multi antenna channel. Let us look at the problem of multi antenna
channel estimation.
This is the channel state information and the CSI has to be estimated the CSI has to be What is x i of k? Now, remember x i of k you can think of this x i of k, equals pilot
estimated. Now, to estimate the channel state information we transmit pilot symbols, symbol on the i-th antenna i-th transmit antenna pilot symbol transmitted on i-th antenna,
alright. Now, this pilot symbols are symbols that are known at the receiver. So, implies. pilot symbol transmitted on i-th antenna at time instant k. So, therefore, I can get this as
So, to estimate the CSI implies we have to transmit pilot symbols. So, in this case pilot this implies y k equals, well I can write this as the vector. So, this is your h bar transpose
vectors you have to transmit pilot vectors. channel vector is h bar times x bar k. So, h bar transpose x bar k plus the noise sample n
of k.
Now, let us say x k is the k-th pilot vector or pilot vector at time instant k; pilot vector at
time instant k. So, this implies that you are transmitting pilot vector time instant k. So, (Refer Slide Time: 06:32)
what we have is I can write the receive symbol y of k at time instant k equals the channel
coefficients h 1, h 2, h l times x 1 k, x 2 k, up to x l k plus well the noise sample plus n of
k.
Therefore, now if you transmit let us say M pilot vectors, which you can also try and (Refer Slide Time: 09:08)
now remember considering real vector this can also be written as x bar transpose times h
bar plus n k, I can write it as h bar transpose x bar k or x bar k transpose times h bar.
Now, considering the transmission of M pilot vectors so, let us say you transmit M pilot
vectors.
So, next what you have is, this is your model for channel estimation. So, what you will
have is you will have y bar equals X times h bar plus n bar. So, this is the model for
channel estimation. In fact, channel estimation model for a; this is the channel estimation
model for the multi antenna system and in fact, this X is the pilot matrix you can denote
it by X or X p the same thing. So, this X equals the pilot matrix and this is of size M
cross L.
I can write the equivalent system as y 1 equals x bar transpose 1 into h bar plus n 1
similarly, y 2 equals x bar transpose 2 into h bar plus n 2 so on y M equals x bar And, this vector h bar remember this is the CSI, this is unknown. This vector h bar which
transpose M into h bar plus n of m. And, now what I can do is I can make a matrix out of represent the CSI this is unknown and now to estimate this h bar remember we do not
this. So, this will become a vector this will become your M cross n, you can call this as know there are of course, there is noise. This is the noisy observation model. So, you
your M cross n output vector this will call you can call this as your pilot matrix. In fact, have to look at the best vector h bar which explains the observation observed vector y
this has M rows each of size L. bar corresponding to the transmitted pilot symbols in the matrix X.
So, this will be pilot matrix x will be of size M cross L h bar is a channel vector which is
of size L cross 1 and now, once you concatenate this noise elements, you will have this
will be n bar of size also again M cross 1.
(Refer Slide Time: 10:49) greater than equal to L for estimation of this multi antenna channel. So, this is an
interesting application. In fact, one of the most how; what are the most I would say one
of the most popular application, in fact, one of the most practical viable practically
prevalent applications of the least square solution in the especially in the context of
signal processing for a practical wireless communication system, right. So, we stop here
and look at other application is the subsequent modules.
So, therefore now one can formulate for estimation of h. So, to estimate h formulate the
least square estimation problem, let us minimize y bar minus minimize y bar minus x h
bar norms. So, this is the least square form and we know the solution. So, this is your LS
problem for channel estimation, in fact, for multi antenna channel estimation.
And, the solution h hat is given by the least square solution X transpose X inverse X
transpose y bar this is also known as the least squares channel estimate. This is very
interesting, this is equal to your LS, the least squares channel estimate. So, this is known
as the least square channel estimate X is your pilot matrix. So, we have the observation
vector y bar in the corresponding pilot vectors. The other thing that we are assuming is
that there is an over determined system which is the number of pilot symbols M is
greater than or equal to the number of transfer. Remember, this works only with the
system is over determined.
So, when we are doing this in simply we are assuming that M number of pilot symbols is
greater than or equal to typically M is much larger than. So, this implies this is a over
determined system. So, you can also say that for channel estimation using the least
square technique the number of pilot minimum number of pilot symbols are required is
at least that the number of transmit antenna. So, this is an another interesting result.
So, this implies minimum number of pilot symbols required; another interesting result
that is you need at least transmit pilot number of pilot symbols L that is M as to be
Lecture – 44
Practical Application: Image deblurring
Hello welcome to another module in this massive open online course. We are looking at
various applications of this paradigm of least squares that is the different applications
where least squares can be employed. And in general least squares has, is a very flexible
and powerful paradigm and is that can be applied in a variety of scenarios. In this module
let us look at another interesting application and that will be in the context of image
processing specifically in the context of Image deblurring ok.

Now just before we formulate the mathematical problem let just look at what it the
physical relevance of this; what happens is when you trying to capture the image let say
you are trying to capture you are trying to make a figure of a person over here and when
you try to capture an image of a person. And so, this is an impression of my camera and I
am trying to capture the image of this person. And now, if this person or this object is in
let say.
So, this is basically image captured device which is your camera or can be in like this
can be extended to the context of videos also. So, this is not necessarily applicable only
for images, but can be applied also for videos. If this object which you are trying to
image or which whose image you are trying to capture is in motion then this leads to a
blur alright. So, this remember when you have typically you might have seen. In fact, the
blur effect that is artificially applied for instance on things typically on the images of let
So, you want to look at yet another interesting application and. In fact, there are tin us say cars or vehicles that I have captured to give the impression that something is in
number of applications of the least squares paradigm. So, just to illustrate the versatility motion that an object is in motion and when the object is in motion that naturally gives
of this framework, let us look at an application in the context of image processing rise to blur.
specifically in the context of image deblurring ok. So, we want to look at the deblurring
So, the blur effect is basically associated with motion and blur can also it is basically
of image.
degrading effect of an image and it can also arise from several other factors. So, just
environmental factors, atmospheric factors, motion of the wind motion etcetera. So, in
general motion leads to blur and this is termed as motion blur; motion is one of the
predominant causes of blur and therefore, to get clean image implies image has to be And so, what you can see here is that each so, this is the output pixel. So, this y k this is
deblurred. Now one way to model the motion blur is the following let us say you have your output pixel and these are your input or original you can think of this as input pixels
the output pixel y of k. or original image pixels ok. These are the original image pixels and what you can see is
that the each output pixel is a combination is a linear combination of several input pixels.
So, you cannot you are not getting the crisp original pixels, but each pixel is sort of
merged right, each pixels are off mashed or combined along with other pixels. And, that
is what gives the blurring effect that is when you combine that is when you are not
getting the clear individual pixels, but rather your getting a combination of these pixels
that is x y of 2 the combination x of 2 x of 1 and x of 0. So, you combining these pixels
aright and that is what gives the blur effect.
So, the model for blur model let us look at it. The blur model can be described as
follows; let say you have an output pixel y of k that can be described as l equal to 0 to L
minus 1. The input pixel or the blur kernel h of l times x of k plus l ok; so, let me just
write this equation on for instance we have y of 0 equals h of 0 x of 0 plus h of 1 x of 1
plus h of L minus 1 x of L minus 1 ok. And you can also write this as y of 1 or you can
just or you can make this that is fine it does not matter, you can write this as y of 1 equals
h of 1 times x of 0 times x of 1.
So, the linear combination of pixels is what is giving you the blur effect. The linear
Or let me just write this as follows I can write this is as h of 0 times x of 0 y of 1 is h of 0
combination of pixels is giving you the blur effect.
times x 1 just to make this system causal or those it does not matter it does not really
matter because image processing can be non causal. So, this is h of 0 times x of 1 plus h
of 1 times x of 0 and I can write y of 2 equals h of 0 times x of 2 plus h of 1 times x of 1
plus h of 2 times x of 0, you can take the past pixels here to be 0. So, you can write this
as h of l x of k minus l. And you can assume that the past pixels for instance x of minus 1
x of minus 2 so, on equals 0 that is this is a causal that is this is a signal which is 0 for n
less than 0 alright.
And therefore the input output blur model can be represented as follows let us say you And this is basically your filter matrix, this is also known as a filter matrix. Because you
have a group of output pixels that is which you are representing by the victor y of 0 y of are representing the blur by this filter this linear this filter represented by this linear
1 up to y of let us say M minus 1 that is total number of pixels is M. So, you call this transformation characterized by this matrix, I can call this matrix as matrix H. This is
your output pixel vector. We are considering a single of column of an image. So, this is your filter matrix or your blur matrix. And this filter which you are repeating along the
your output pixel vector this is equal to well h of 0 h of 1 times h of 0 and so on and so row this is also called the Kernel or this is basically your blur kernel alright. So, I can
forth. represent the blurring effect in the image as this linear system.
And, this matrix has an interesting structure this is in fact, this is x of 0 x of 1 up to x of So, the blur model can be the blur effect. In fact, this can this model can also be
M minus 1 and this second row is h of 2 h of 1 h of 0 and of course, you can also have introduced used to introduce blur alright to get the blur effect in images; for instance you
noise. Now, in addition you can also have noise, but let me just ignore this for a little bit want get the effect of an object being in motion such as a car being in motion this linear
just to simply; this model little bit although in technique in practice you can also have transformation can be used alright. So, this can use both ways either to recover the
noise or let us make this. So, let us add the noise it does not matter ok. And now what original image from the given output vector y bar or given input vector y bar x bar to
you can see y of 0 is h of 0 times x of 0, y of 1 is h of 0 h of 0 times x of 1 plus h of 1 introduce the blur effect one can use this linear input output system model. Alright, now
times x. So, it also depend so, it also combines both x 1 and x 2 x 0. the point that in the problem that we are considering is the other way not specially that is
given a blur image how do deblur it alright.
Similarly, y of 2 combines x 0 x 1 x 2. So, each pixel is a combination a linear
combination of the pixel itself and the original pixel and some neighbouring pixel and And now once you formulate this problem. So, this is your original image I am
that is what gives the blur effect. And this matrix which so, this is the output pixel this is representing a single column although this can be easily extended to a 2 dimensional
the original input pixel vector we are considering as I said a single column of pixels. original image and this can be easily extended to 2 dimensional. In fact, three
dimensional images also which is nothing, but video that is your x axis y axis and in a
time alright. So, one can have 3 dimensional blurring effect so, as to speak so, alright.
So, you can have original image. And this can be extended, I will just note here can be (Refer Slide Time: 13:58)
extended easily to 2 D images plus video and therefore, how to reconstruct the original
we know that.
And this x hat is now your deblurred or reconstructed deblurred or reconstructed image.
This is your deblurred or reconstructed image and therefore, what you can see now is
that yet another interesting applications of the least squares paradigm, is a very
interesting application. It can be applied we already seen another application that is for
Now, we have this model output image y bar the equals the blur matrix H x bar plus the
channel estimation in a wireless multi antenna system.
noise vector. So, this is your blur matrix and to reconstruct the original image or to
recover we now apply the least squares alright. And therefore, what we do is minimize It can also be used for deblurring of images in image processing. And therefore,
norm of y bar minus H x bar square. Implies the estimate or the reconstructed image or reconstructing or recovering the original images alright. And therefore, the least square
the deblurred image is the pseudo inverse of H. paradigm in general has many applications several applications; it is one that arises very
frequently. In fact, in several different areas signal processing and communication and
We are not introduce this notations. So, far let me just describe this is nothing, but H
also other scientific disciplines alright. We stop here and continue in the subsequent
transpose H inverse H transpose y bar where we are denoting this matrix by H dragger.
modules.
This is known as H dragger equals the, we already said this is acts as a left inverse this is
the pseudo inverse of H not the inverse, but the pseudo inverse. This is the left inverse of Thank you very much.
H.
Applied Optimization for Wireless, Machine Learning Big Data A this is your vector x bar this implies that number of rows is much less the number of
Prof. Adithya K. Jagannatham
columns. One can call this as a wide matrix; just like when you are number rows is more
Indian Institute of Technology Kanpur than the number of columns we call that as a as a tall matrix ok.
Lesson - 45 (Refer Slide Time: 02:17)

Least Norm Signal Estimation
Hello, welcome to another module in this massive open online course. So, we are looked
at the least squares paradigm; let us look at it is analogue or a counterpart or something
that interestingly related to that which is known as the Least Norm Paradigm and that can
be described as follows.
Remember previously in the least squares we had a tall matrix ok; number of rows is
more than the, this is number of rows is more than the number of columns. Now we have
a wide matrix that is number of rows that is m is less than the number of columns ok.
Which basically implies number of equations as you remember this is the number of
rows each row is an equation.
Remember row equals equation and each of this is an unknown each element of x
because x is the unknown signal you can say is unknown. So, each column you can say
And these two always go hand in hand this is we so, far we have seen the least squares
each column of a correspondence to an unknown ok. So, rows equal equations equals m
paradigm. What you want to do now is the least norm framework and this can also be
columns equals unknowns equals m. So, this implies that for this kind of system when m
used for Signal Recovery or you can also think of this as single estimation ok. And the
is less than n; this implies that number of equations is smaller than number of unknowns
least known paradigm is as follows so, consider the following problem where in we have
which implies that system is under determined, there are not enough constraints, under
again y bar equals A times x bar and A is an m cross n matrix similar to what we have
determined not enough constraints implies not enough constraints on x bar.
seen previously.
But while previously m is greater than n in the least squares framework we will consider
a framework where m is less than n. That is the number of rows is much lower than the
number of columns so if you look at this, it will look like this which is this is your matrix
And when the system is under determined that is there is not enough constraints not It basically minimizes the energy of x bar which implies that the solution is energy
enough equations there are only n equations, but n unknowns. This means that typically efficient; implies that you are trying to find an which implies you are trying to find a
there is no unique solution or typically there are more than one possible there are infinite energy efficient solution. That is what is it is a out of all the so remember we said it is
number of solutions. This implies typically again let me qualify this typically with verify says infinitely many solutions all it x bar is not unique infinitely many possible values of
propelled typically more than one solution or infinite or an infinite number of solution. x bar because it is an undetermined system. Out of all these x bar find the one that is
Now, therefore, how to determine now therefore there is an infinite number of solution minimum norm that has the minimum energy, that is the solution that we desired then
infinite number of possible solutions. that is how we are constraining this problem.
So, there are fewer constraints alright there are fewer constraints on x bar ok. And At this it is precisely known as the least norm problems. So, this means out of infinitely
therefore, how do you determine x bar which means you have to additionally constrain x out of infinite solutions find the one that has least norm. So, therefore, this is known as
bar naturally if there are fewer constraints the only way to fix x bar or determine the the least norm framework least norm or minimum norm, least norm you can also say min
possible value of x bar is to introduce additional constraints ok. So, therefore, to norm least norm the thing is the previous one was least square as remember we had no
determine x bar, one have to introduce additional constraints, one has to introduce solution. So, find the one that minimizes the approximation error y bar minus A x bar or
additional constraints. And therefore, one such difficult constraint is to find the energy y bar minus h x bar. Here we have more than one solution we have infinitely many
efficient solution or if you look at norm x bar minimize the norm of x bar so this is your solutions. So, find the one that has minimum norm ok. And I think I am using yeah the
additional constraint. matrix A and strainley straight forward. And therefore, the relevant optimization problem
for this least norm solution can be formulated as follows.
And this can be solved as follows so what I can do is I can write this equivalently solving
this is fairly straightforward we can use earlier techniques minimum norm x bar square
The relevant optimization problem for this is as you have already seen the objective
subject to the constraints y bar equals A x bar. Therefore, one can form the lagrangian
function is to minimize norm x bar and the constraints now is y bar equals A x bar. Even
which is equal to objective function x bar norm x bar square is nothing, but x bar
also justify this minimum normal that naturally occurring signals have do not have an
transpose x bar plus lambda times A x bar minus y or y minus A x bar A x bar minus y
infinite number of energy infinite amount of energy, they are typically limited in terms of
bar. In fact, this has to be lambda bar transpose remember because how many constraints
energy. So, therefore, we want to make sure that the signal corresponds to something that
we have we have m constraints; each row is an equation. So, there are m equation so m
is naturally occurring which means such energy is bounded.
constraints so there has to be one Lagrange multiplier for each constraints. So, your
So, this is justified because naturally occurring signals have limited. In fact, we will see lambda bar will in fact be a vector so that is basically lambda bar.
an interesting version of this later when naturally occurring signals will say have a
sparsity. They are naturally sparse in nature, but for us to begin with let us look at the
minimum norm solution. In fact, this is the minimum two norm and this is basically this
linear system this is your constraints; so, minimum norm that is objectives. So this is the
constraint for our optimization problem.

So, this is one lagrange multiplier for each constraint it is one lagrange multiplier for (Refer Slide Time: 13:29)
each constraint. And now when you take the gradient of this so your F equals x bar
transpose x bar plus lambda bar transpose A x bar minus y bar. Now we take the gradient
we have done this before x bar transpose x bar is nothing, but x bar transpose I identity
types x bar. So, this is twice x bar plus lambda bar transpose A into x bar is c bar
transpose x bar so the gradient of this is c bar. So, this will be a transpose lambda bar a
transpose lambda bar minus lambda bar transpose y bar gradient with respect to x bar is
0 and this is equal to 0 setting gradient. So, you are setting the gradient setting the
gradient equal to 0.
So this is basically so, if you call this as 2 this is expression for lambda bar and if you
call this as 1. Now what we able to do substitute lambda bar from 2 in 1 and what we get
here is that lambda bar equals we have already seen this x bar equals x bar equals minus
half a transpose lambda bar substitute lambda bar minus half A transpose minus 2 AA
transpose inverse into y bar so the minus half and minus 2 cancel. So, this will be a
transpose A transpose inverse into y bar this is your x hat, for your signal estimate that
has the signal estimate that has the least norm correct. So this is basically your least
norm.

And once you solve this thing this implies you will get something interesting this
implies that x bar equals minus half a transpose lambda bar remember lambda bar is a
vector. So, I cannot simply get it or manipulate it otherwise so I simply have to write x
bar equals minus half lambda minus half a bar transpose a transpose lambda. How to
determine lambda bar? Use the constraint this similar to what we have to determine
lambda bar use the constraint. Remember or constraint is A x bar equals y bar substitute
x bar which implies A minus half A transpose lambda bar equals y bar which implies
minus half A, A transpose lambda bar equals y bar which implies that lambda bar equals
minus twice AA transpose inverse into y bar.
What we have obtained is the least norm this is the least norm signal estimate and this Applied Optimization for Wireless, Machine Learning, Big Data
also known as the least known solution. So, you can write this as x hat equals A
transpose AA transpose inverse into y bar ok. So, this is also known as the least norm Indian Institute of Technology, Kanpur
solution or minimum norm we can also there are many names. So, this is also known as
Lecture – 46
the least norm solution alright that gives you the solution x hat which has the minimum 2 Regularization: Least Squares + Least Norm
norm. And as I already told you this is suitable or well suited for scenarios where there
are it is a under constraint system. Hello welcome to another module, in this massive open online course. So, we are
looking at various convex optimisation problems, which are the Least squares and Least
That is your pure equations than unknowns your more unknowns which means there
norm. To round to round up this discussion, let us look at another problem, which is
infinitely possible infinitely many possible solutions. So, we have to constraint we have
essentially of combination of both these problem, that is the least squares and the least
to introduce additional constraints we want to find the solution one the one which has the
norm problem, which is termed as regularization.
minimum norm of the minimum energy. And this expression gives you close form
expression for the minimum norm solution alright we will stop here. (Refer Slide Time: 00:37)
Or you can also think of this as a generalization of the least squares and the least norm of
frame, (Refer Time: 00:50) it is termed as regularization ok.
And as I already said, what regularization does? Is that it combines, the least squares and
least norm combines the L S that is your Least Squares plus the Least Norm frameworks.
So, this is your L N and what did essentially does as follows remember, we said you we
use the least squares framework, when we have a linear model or an over determined
linear model and therefore, now consider this linear model A x bar equals y bar let us can
be over determined or under determined, I will come to this in a moment.
(Refer Slide Time: 02:23) maximize the probability of x bar, this is a natural systems vectors x bar, which have
lower energy have a higher probability. So, you have typically that is the scenario.
So, we would like to maximize the probability of x bar or you can also think of this is
not probability, but rather the likelihood of x bar, I think that is a better word maximize
the likelihood and therefore, we would want to have an trade off between the accuracy of
the linear model and this probability slash energy efficiency ok.
So, this is A x bar equals to y bar this is your linear model and this vector x bar, this is an
unknown vector similar to what we had previously, this vector y bar this is your
observation matrix and the matrix is assumed to be known that is a matrix that is
pertaining to your model or your linear model ok.
So, this is your observation vector, similar to the channel estimation problem you can
think of this as the pilot vector that is the output vector that is a vector of observations,
observed samples of the received samples at the receiver in the wireless system ok.
So, we would like to so, in summary, what we would like to do is we would like to it is
So now, we have a linear model in addition, we also desire. So, we want to minimise. So, not in either or we would like to do both this trade off accuracy of linear model for
we want to minimise the model error or minimise. So, it is desirable to minimise, the energy efficiency let us say. Implies we want to have an objective function, which
approximation error at the same time, we would also like to minimise plus, we would reflects both correct we want to have an objective function or an optimisation, we want
also like to minimise. Norm of x bar or basically we would also like to minimise the to have an objective of optimisation that achieves both that achieve both the above
energy, that is minimise the energy, this can be thought of in various ways, that is objectives and therefore, one can formulate the following optimisation problem, that is
basically to guaranty an energy efficient solution or it is known that vectors, which have minimise remember previously to basically obtain the solution x bar, which best explains
lower energy have a higher probability. So, you are trying to maximize. So we y bar or basically which minimises the approximation error, you have minimise the norm
considering, the prior probability of such factors arising in the problems. of A x bar minus y bar whole square, that is the least squares problem this minimises the
model error.
So, we also trying to take factor in take into account that factor alright. So, this can be
thought of as basically that, we can basically think of this as either maximizing the Now, to maximize the at get a energy efficient solution, we would like to minimise the
efficiency of the solution or you can also think of this as the fact that, you are trying to norm of x bar this minimise norm of x bar square, what we do now is the combine both
of these alright. So, this basically is the approximation error and this is the energy of the
solution and now you are minimising a linear combination of this. So, this is a weighting (Refer Slide Time: 09:52)
factor, now lambda here is not the Lagrange multiplier, this is a weighting factor or
which can be a weight and therefore, you have a weighted objective. So, you have a
weighted combination of these 2 objectives.
One is approximation error, one is a energy and you are minimising a weighted
combination that is weighting the approximation error, weighing it by this energy alright,
this process is known as regularization. So, you not just minimising the approximation
error, but you are regularising this objective function by the addition of the energy by the
addition of a component that is proportional to the energy of the solutions so this process.
You can also think as this as the regularized least squares ok, the solution to the
regularized L S can be found as follows well we have the objective function f of x bar
this is norm A x bar minus y bar whole square plus lambda times norm x bar whole
square, which I can write as norm square of a vector is vector times pose transition, that
is for a real vector minus y bar transpose A x bar minus y bar plus lambda x bar transpose
x bar which is equal to we can write this again, as again similar to what we have seen.
So, many times before x bar transpose A transpose minus y bar transpose, A x bar minus
y bar plus lambda x bar transpose x bar, this is equal to well i can write this as, x bar
transpose A transpose A x bar plus, we know that y transpose y bar transpose A x bar and
x bar transpose A transpose y bar these are equal to each other, because these are scalar
So, this basically encourages solutions that have lower energy. So, this constrains this
quantities one is a transpose of the other. So, I can simply going to write this is as minus
process is termed as regularization and this. So, this process is termed as and this factor
2 x bar transpose A transpose y bar plus y bar transpose y bar plus the regularization
lambda, this lambda is termed as the regularization parameter.
factor lambda x bar transpose x bar and this is your F, F of x bar.
This factor lambda is termed as regularization parameter and let us say, this is termed as,
basically in a scenario where we would like to achieve both or it achieved of between the
accuracy and as well as an energy efficient solution, one can apply this approach and the
process the procedure to solve this is similar to that it similar to what we have seen
before that is a expand, this objective function and it is rather once, you understood the
previous paradigms, which one is going to be rather simple. So, the solution is formed as
follows solution to the optimisation problem about.
Now if you take the gradient of this with respect to x bar, what that gives us is well twice
a transpose is a symmetric. So, this is twice x bar transpose p x bar the derivatives twice
And therefore, this is basically the solution to the, your regularized this was that I choose
p x bar. So, this is twice A transpose x bar minus twice x bar transpose c bar derivative is
a trade of between. So, it solution to your regularized L S, regularized S bar and
c bar derivative with respective or gradient with respective. (Refer Time: 12:06) So, this
combines both the criteria combines (Refer Time: 13:50) are least square as you can see
will be minus twice A transpose y bar plus derivative of y bar, this does not depend on x.
the combines the properties of both the least squares and the least node ok, L S plus L N
So, its derivative with respect x bar is 0 plus, now this you can write as lambda x bar least square plus least node, it is the solution of the weighted optimisation problem.
transpose identity matrix into x bar. So, this will be derivative will be lambda twice
Now a brief note regarding this lambda, this is the regularization parameter, now this
identity matrix into x bar. And now we said this equal to 0 to find the optimum. So this
needs to be chosen appropriately. Now remember the solution is easy, but we are not
implies, now the factors of 2 will cancel and further you can write this as A bar A
spoken of or we are not talked about, how to choose this? So, this lambda which is the
transpose A plus identity matrix or in fact, lambda times identity matrix into x bar equals
regularising parameter this has to be chosen appropriately or this lambda requires tuning,
A transpose y bar and this implies, that if you look at this solution, this will be A
what is known as tuning. In sense there one has to play around with this a little bit, to get
transpose A or this will be x hat you can call this as x hat equals A transpose A plus
the best solution alright.
lambda times, I inverse ok
So, you have to sort of adaptively, you might need to adaptively change this to get the
best to at the, obtain the best solution. So, this can be chosen either you choose
heuristically or this can be adaptively chosen adapted, adapted to changing environment
and typically one needs to have a slight intuition about the system, that is this based on
prior experience that is a this lambda can be determined over time, based on observing
the system or based on observing the model for several time instances and then
eventually determine, what is the lambda? What is the value of lambda? That best suits
or what is the value of the regularization parameter that gives the best solution alright. Applied Optimization for Wireless, Machine Learning, Big Data
So, that is the tuning process ok.
Dept. of Electrical Engineering
Lecture – 47
Convex Optimization Problem representation: Canonical form, Epigraph form
Welcome to another module in this massive open online course. So, we looked at
various optimization problem sort of in formally, what we are going to or what we are
going to do in this module. And the subsequent a few of the subsequent modules is to
basically set out or basically lay down the sort of formal framework to basically state or
to for the formulation of a Convex Optimization Problem, alright.
So, this requires basically tuning or either choose heuristically or tune it a appropriately,
alright by adapting it to the changing by adapting it to a changing environment and it
requires some intuition into the requires, some intuition or prior knowledge, we could put
it that way or prior experience with the operation of the system.
So, as we determined the lambda, that is the regularizing parameter, that you will find the
best solution alright. So, that basically completes our discussion on the regularised least
squares, alright which combines both the least squares and the least norm flavours of this
flame.
Thank you very much. So, let us discuss let us begin our discussion so, as to speak on the formal framework of a
convex optimization of a convex optimization problem. And now, a convex optimization
problem as we have seen in the standard form, alright this can be thought of as a
canonical form or the standard form of a convex optimization problem or a textbook
convex optimization problem can be stated as follows; that is you minimise or this can
also be sometimes written as min in fact, we frequently simply write as min period which
means minimize an objective function, alright.
This is basically can be objective function vector can be objective function of a scalar.
So, g naught of x bar subject to some constraints like we have seen so far or s dot d dot
or t dot subject and these constraints can be g i of you can have any number of value is unique, the minimizer need not be unique, but optimal value or you can think of
constraints g of x bar less than or equal to 0. this as the minimum or optimal value is unique ok.
And now this objective function g naught has to be this has to be a convex remember for (Refer Slide Time: 05:14)
a convex optimization problem this objective function is to be convex I can have these
constraints. Each of this constraint also is convex is a convex function I have i equals to
1, 2, up to l constraints and in addition I can have equality constraints g j tilde of x bar
equals 0. And these have to be for instance j equals 1, 2, up to m and these have to be
affine constraints so, equality constraints have to be affine in nature, is basically the
implies that they are hyperplane.
So, you have to have constraints of the form a j bar transpose x bar equals b j. So, these
are affine constraints or basically these are hyperplanes, so, the equality constraints. The
affine the inequality constraints can simply be convex function and of course, the
objective itself is a convex function, alright. So, this is a standard form of a convex
optimization problem so, you can think of this as a standard form convex optimization
problem.
Now, as against if it is nonconvex for instance, alright, so, if it is nonconvex now what
(Refer Slide Time: 04:12) happens here is you have the concept of what is known as a low for instance. If we look
at this here, this is also it appears like a minimum, but this only minimum locally that is
in a local neighbourhood in a certain neighbourhood it is for instance if you look at this
neighbourhood it is the minimum. But, it is not the minimum globally; the global
minimum that is minimum over the entire domain is this.
So, this you have the concept of a global minimum and this is the local minimum ok,
when the objective is nonconvex. However, here local minimum any local minimum is
the global minimum ok. So, that is the advantage of convex that is any local minima is
global minima. Here, in nonconvex there can be very many local minima and only one
global minimum, alright. So, the problem is that the algorithms that is the optimization
algorithms that you employ can get trapped in this.
So, they can get trapped in this local minima and they can yield spurious solutions which
And, what is advantage of a convex optimization problem that we also formulated and are not actually the minimum, which are not actually the optimal values of the objective
the important advantage as you might already know for a convex optimization problem is function. So, for nonconvex you have this problem that for nonconvex the optimization
a following that is when you look at a convex objective function and you minimise it the routine or optimization algorithm is trapped that is a terminology used frequently,
minimizer for so, this is your convex objective. And the minimizer is or the optimum trapped in local minima.
Implies there is a you have spurious solutions or non optimal solutions and this is Now, a related form in that something convenient reformulation you can think of you can
precisely because, there is only a whatever is a local minimum for a convex optima think of it is a convenient reformulation of a convex optimization is what is known as the
objective function is a global or a convex optimization problem is a global minimum. So, epigraph form which I am going to discuss shortly. So, a convex optimization problem
this problem of spurious minima or local getting trapped in local minima is entirely can be recast in what is known as it is epigraph form and that is the following thing that
avoided by a convex optimization problem. is remember we said you have a convex optimization problem minimized optimization
objective g naught x bar subject to the constraints g of x bar less than equal to 0 for 1 less
So, this problem does not and that is the advantage of a convex. So, this problem does
than equal to i less than or equal to l.
not arise in a convex optimization problem and that is the advantage of the convex
optimization framework. That is the advantage of the convex optimization framework, And g j tilde x bar these are affine constraints or I can write this directly as in fact, a j
that the algorithm because there are there are no local minima, alright you only have bar x bar equals b j a j bar transpose x bar these are hyperplanes 1 less than equal to j less
global minima so, the algorithm does not get trapped in local minima. than equal to m. And, now I can write this in epigraph so, this can be equivalently
expressed as follows. What I am going to do is I am going to introduce an additional
variable optimization. So, this optimization here is with respect to x bar I am going to
introduce an additional optimization variable t that is minimise x bar comma t.
Now, I am going to minimise t and in the constraint now, I am going to add an additional
constraint that is g naught of x bar less than or equal to t ok. And, the rest of the
constraints remain the same that is g i of x bar less than or equal to 0, i equals 1, 2, up to
l and a j transpose x bar equals b j; j equals 1, 2 up to m. So, these constraints remain
same so, my optimization objective now become simply t, optimization objective simply
becomes t and I have one additional constraint.
And the point is the this is the convex optimization problem because, if you look at this (Refer Slide Time: 15:19)
is simply linear this simply t. So, this is convex simply function of t this is convex and
we already said the optimization objective g naught x bar is a convex function, right. So,
g naught x bar less than or equal to t that is a convex constraint alright convex function
less than equal to t, alright. So, that is allowed in a convex optimization problem.
So, this is a convex constraint and therefore, this is still a convex optimization problem.
It is a very simple and elegant reformulation that simplifies many complex convex
Now, let us look at a simple example to understand this better let us look at a simple
example for this epigraph form and a simple example can be the following thing which is
basically I want to. So, let us consider this problem minimize sometimes we can even it
is frequently also written even omitting this period after min. So, this stands for minimise
norm x bar the infinity norm subject to the constraint let us say there is some constraint
there is x bar belongs to the set S that is it is a combination of linear and affine
constraints.
Let us say some constraints this x bar must belong to the set S which basically a convex
set, I am not too worried about this constraints. So, this is are of now look at this the set
So, this modified optimization problem this is still convex and it yields the same solution
is convex. So, the constraints are convex and remember this is a norm infinite norm this
that is the solution x bar of this problem optimal value that is optimal solution x bar of
is a convex norm so, the objective this is a convex objective ok. As a convex objective
the second problem of this of the second problem is the same as the optimal value x bar
and therefore, this is in fact, a convex optimization problem ok. Now, this is in fact, a
that you obtain from the first problem. However, the second problem you are optimising
convex optimization problem and now what we want to formulate is the equivalent
both with respect to x bar and t and this is termed as the epigraph form, this is termed as
epigraph form? We want to formulate the equivalent epigraph form and that can be
the epigraph form of the problem ok.
derived as follows.
And, in the epigraph form of the problem and as I already told you epigraph form it is
helpful in recasting this convex optimization problem in a more interesting or intuitive
form. So, if the epigraph form the advantage of epigraph form is basically it is helpful in
recasting in recasting convex optimization problems in more in a more interesting or
intuitive form in a more interesting or intuitive form.
So, I can write this as the epigraph form minimise remember the epigraph form is Now, magnitude x 1 less than equal to t this implies that minus t less than or equal to x 1
straight forward minimize t, objective function is always t subject to the objective less than or equal to t. Similarly, magnitude x 2 less than equal to t implies minus t less
original objective. Of course, this minimization is over x bar comma t norm x bar infinity than or equal to x 2 less than or equal to t and so on magnitude x n less than or equal to t
is less than equal to t and the original constraints remain that is x bar belongs to this set S implies minus t less than equal to x n less than equal to t.
so this is the epigraph.
And therefore, now the optimization problem above therefore, the epigraph form can be
Now, I can modify it slightly now if you look at this norm of x bar infinity. Let us say x simplified as minimise of t minimise with respect to t subject to minus t less than or
bar is a vector ok, n-dimensional vector norm x bar infinity is nothing, but the infinite equal to each x i and right. Let me just write it explicitly to illustrate minus t less than
from is nothing, but maximum of magnitude x 1 comma magnitude x 2 magnitude x n. equal to x 1 less than equal to t minus t less than equal to x 2 less than equal to t.
And when we say infinite infinity norm is less than equal to t so, this implies so, this
constraints here this basically implies that the maximum of magnitude of x 1 up to
magnitude of x n, this is less than equal to t.
Now, the maximum of n components or n quantities less than equal to t, that is possible
only if each of the quantities is less than equal to t. So, this in turn now as leads to
something interesting so, this implies that magnitude x 1 less than or equal to t
magnitude x 2 less than equal to t so on and so forth magnitude x n less than equal to t.
(Refer Slide Time: 21:09) t x 2 has to lie between t and minus t. So, that effectively limits your area that is your x 1,
x 2 has to lie in this so, this is the box ok so, this is the box in which your x 1, x 2 has to
lie, alright. So, this is basically your you can think of this as your box constraints alright
so, basically introduce introduces a box constraint for the original optimization.
So, introduces a box constraint you can say that it introduces a box constraint for the
original optimization problem ok. So, it sort of introduces a box constraints for the
original optimization problem. And, this is something that is it gives you an interesting
either gives you better intuition or it also gives you an interesting interpretation for the
original optimization problem which is in fact, an identical it is an equivalent
optimization problem. But, it is sort of opaque, alright one can it is not a minable to
derive insights, alright.
So, then this modified optimization problem is something that is more interesting and it
And, so on minus t less than or equal to x n less than or equal to t and original constraints is easy to interpret and probably also analyse sort of without using regress analytical tool
are already the are always there that is x bar belongs to this set s which is the original sort of analyse it more or analyse or interpret it sort of simply by looking at optimization
constraint. And, this is something that is more intuitive and these are in fact, if you look problem, alright. So, this is an important in fact, we one can use this and we are going to
at these are some sort of box constraints you can think of this constraints x bar to lie in a also use it from time to time to simplify or recast optimization problems, alright. So, we
box of dimensions 2t. will stop here and continue in the subsequent modules.
(Refer Slide Time: 21:49) Thank you very much.
In fact, that is what you will see. If you consider a 2-dimensional scenario; that means,
your x 1 so, if this is your x 1 and this is your x 2. So, x 1 has to lie between minus t and
Applied Optimization for Wireless, Machine Learning, Big Data basically has a linear objective or an affine and affine objective function, subject to the
constraints which are also affine.
Lecture - 48
Linear Program Practical Application: Base Station Co – operation
looking at convex optimization problems. We have looked at the canonical or the
standard form of a convex optimization problem. In this module, let us look at an
important subclass of convex optimization problems which is the, which basically linear
or the class of linear programs, ok.
That is c bar transpose c i bar transpose x bar less than or equal to d I, for i equals to 1, 2
up to l and equality constraints which have to be affine in any case for a convex
optimization problem as we have been seen in the previous module, that is c tilde j
transpose x bar equals d tilde j, for j equals 1, 2 up to m alright
So, basically the constraints, the equality as well as the inequality constraints alright; so,
all affine that is it has affine equality and slash or inequality affine constraints and slash
or inequality constraints. So, basically linear objective, no, of course, an affine function
is a convex functions. So, a convex function; so, objective function is convex alright. So,
and an affine object or otherwise an affine constraint is also convex. So, therefore, this is
So, what you want to look at is basically linear programs, which is also referred to as
a special class of convex optimization problem. So, linear program is a special is a
LPs. And this forms it is a subclass as I already said, it is we can think of this as a special
subclass of convex optimization problems, in which the objective function as well as the
case or type of convex optimization problem or a subclass of or a subclass of convex
constraints equality as well as inequality constraints are all affine in nature, alright.
optimization problems.
And you can say, this is so, this is basically your linear program and well, you can say,
And now, a linear program, ok. And this is also referred to as then also referred to this as
this is a simplest because, everything is affine. This is the simplest class or category of
an LP as I already said, it is a linear program is also an LP and well, linear program can
convex optimization of convex optimization problems.
be expressed as follows: that is remember any convex optimization any optimization
problem has an objective, let us say, as minimize a bar transpose x bar plus b, ok, which
Now, just for convenience, I can write this in matrix form. So, I can represent the LP in So, I can represent it in a compact form using matrices. Similarly, the equality constraints
matrix form as follows, is not very difficult. So, I have minimize the objective which is a I can represent them as c 1 tilde transpose c 2 tilde transpose c m tilde transpose times x
bar transpose x bar plus b ah, subject to the constraints. Now, constraints I can represent bar equals d 1 tilde up to d m tilde. So, this becomes your, c tilde x bar equals d tilde.
them as a matrix. So, I can write this as c 1 bar transpose, c 2 bar transpose, c l bar And therefore, the compact this can be written in compact form as follows: minimum
transpose x bar. Now, this is component wise that is each component of the vector on the minimize a bar transpose x bar plus b subject to c x bar component wise less than d bar
left has to be less than each component of the vector on the right. So, this is a component and c tilde x bar equals d tilde. So, this is a, you can say a compact form for a, this is the
wise, is also remember is also termed as component wise this is the component wise compact form for the linear program. So, this is basically expresses your linear program
inequality. So, I can write this now as your matrix c times x bar component wise less in a very compact form in a compact form using vectors and matrices, ok.
than, this is your vector d bar, alright.
Let us look at an example for this linear program and we have already seen this before. (Refer Slide Time: 10:55)
That is, we have already introduce this before that is the example of base station
cooperation where there are several base stations in a cellular communication scenario
whereas, user at the edge of several cell or the intersection or the region that can be
served by several cells alright or can be served by several base stations. So, the different
base stations typically, when the user is at the edge of a cell or at the edge of cells the
user can be served by several base stations belonging to the different cells which are
overlapping at that particular point. And so, we consider a scenario in which this
particular user is being served by several base station or not just one the single user. In
fact, several users can be served when the by different base stations cooperating with
each other, alright.
So, let us take a look at our example of base station cooperation or cooperative. And we
have already seen what happens in base station corporation. In base station cooperation,
And in this scenario, let P ij, it could denote the power of power transmitted by base
we have a group of group of cells that are cooperating to transmit to one or many users.
station i to user j, ok. And we will denote by h ij is the fading channel coefficient. So,
So, I have different users and the base stations are transmitting to the various users in a
magnitude h ij square equals g ij represents the power gain, the power gain between base
cooperative fashion. Normally, we have a single base station serving any particular user,
station I base station I and user j. So, P ij denotes the power transmitted by base station i
but in this particular stay in, but this particular scenario base stations can cooperate to
to user j, alright. And magnitude h ij ah, ah, h ij is the fading channel coefficient,
serve the various uses thereby enhance the SNR, enhance the reliability of
magnitude h ij square which is g ij represents the power, the gain from base station i to
communication in a wireless communicate scenario and wireless communication
user j, ok. And therefore, now if you look at the power that is received by each user i.
scenario, alright.
So, this is basically would be stations, where consider a scenario in which we have k. So,
we have k base stations and m. So, k base stations are cooperating in this cooperative
cellular scenario to serve basically m users, ok.
So, the power that is received by each user I, in this cooperative fashion is the power sum power transmitted by all the base stations to all the users. So, now, that makes it a convex
of the powers that is transmitted by all the base station. So, I have, i equals 1 to k, P i1 g optimization problem. So, the objective function can be, minimize summation i equal to
i1. This basically is the power received at user 1. This is the power received at your, this 1 to k summation j equal to 1 summation over all base stations summation over all users
is the power that is basically the sum of the powers, correct. That is the powers the sum P i j, that is power transmitted by the base stations to all. So, this is your linear objective
of the powers received from all the base stations, alright. And remember we said this has remember it is simply the sum. So, therefore, it is a linear objective linear objective and
to be greater than or equal to P 1 tilde, which is the minimum power desired by user 1 this represents remember this is the total power by all base stations to all users, alright.
this is the minimum power desired by user 1. This is a total power by all base stations to all users.
So, the sum of the powers received from all base stations has to be greater equal to P 1 So, what is our optimization problem? Our optimization problem is to minimize the total
tilde. This has to also hold similarly for the other user. So, at user 2 we must have the power transmit power of all the base stations to all the users subject to these constraints,
sum of the powers received from all the stations, that is P i 2 g i 2 summation over i that is the min that is the power received at each user is greater than a minimum
equals 1 to k over all base stations that has to be greater than or equal to P 2 tilde. threshold which is denoted by P j tilde, alright. At user 1, it is P 1 tilde, user 2 p 2 tilde.
So, on at user m, it is P m tilde. And you can clearly see the objective is linear,
constraints are linear; this is a linear program. So, this is so, cooperative base station
transmission or base station cooperation this is basically this is a one can formulate of
this linear program and you can see thus the variety or the interesting and very
interesting applications of the simple yet very flexible and powerful optimization
framework for that of a linear program. So, this can be obtained to minimize the total
transmit power total power that has to be transmitted by all the base stations to all the
users.
Now, also note that in the standard from convex optimization problem the inequalities
are have to be less than or equal to. So, and one can readily convert it to the standard
form by simply introducing a minus sign. So, I will have a minus in front of everything
and the inequalities become less than or equal to. So, this becomes minus is less than or
equal to and this is now your standard form linear program. So, this is your LP, to simply
And so on and so forth, that is summation I equal to 1 to k, P i m at user m summation i minimize total transmit power of base stations to all the users.
equal to 1 to k P i m g i m greater than or equal to P m tilde, ok. So, these are your
constraints ok. Ah, and you can see these are basically affine constraints.
And now, we need an objective. One of the objectives that one can consider is basically,
we want to meet the desired power level at each user, but simultaneously we also want to
transmit the minimum amount of power. So, what is the total, what is the minimum total
power that can be transmitted by all base stations to all users to meet these desired
criteria at the various users? So, the objective function can be to minimize the total
. And now, what do we want to do is we want to consider an interesting optimization

objective function, in which we want to minimize the maximum of the powers
.
transmitted by the various base station. So, the total k base station, so, you have transmit
Now, an interesting variation on this problem can be the following thing. Now, if you powers P k. So, you want to minimize the maximum. So, what this is doing is
look at the transmit power of any single base station, now, let us look at the transmit minimizing the maximum power transmitted minimize the maximum power transmitted
power of base station I, that will be equal that can be represented as P of i. by the base station.
So, the transmit power of base station i can be represented as P of P i equals summation Now, what happens typically in this cooperative cellular scenario is that, ah, there are
that is the transmitted power transmitted to all the users summation over all users P i j, few base stations there are few there are several base stations and what when you
ok. For instance, P 1 power transmitted by base station 1 equals j equals 1 to M P i or P 1 minimize the total power transmitted by all the base stations together that might result in
j, that is the power transmitted by base station 1 to each user j summer over all the users, an undue burden on a single base station alright. So, one of the particular one particular
ok. So, this is power transmitted by total transmit power total transmit power of base base station which probably has good condition good channel conditions of with
station 1 in fact, that is what this is, ok. channels to the different users are good can be over burdened or over tax in comparison
to. So, this does not ensure that the load is. So, the previous total power minimization
does not ensure that the power is uniform that the power load that is the burden the
transmitter power burden is not uniformly levied on all the base stations alright. There
might be different base stations which are ah which are levied more or which have to
transmit more power in comparison to others.
But, when your minimizing the maximum transmit power, what this does is that this
ensures a sort of fairness in the power burden it sort of ensures that this power burden to
the, for the different users is rather uniformly rather fairly distributed on all the base (Refer Slide Time: 24:58)
station. So, this is an important property of such problem. So, this ensures you can say
min max the min max criterion basically ensures fairness in the power burden in the
power burden or the power distribution minimizes the maximum power.
And subject to the constraints are all they are as usual. That is, if you look at the desired
power at each user that has to be greater than the threshold. So, the constraints are there.
So, this is, now, if you look at this min max, now, if you get now if you look at each P i
and we have already seen this each P i is summation j equals 1 to m is summation j
equals 1 to M P P 1 j and this you can see, this is affine or this is convex ok. So, each of
these P 1, P 2, P k each of these powers, so, each of these is convex, alright. And
therefore, we need take the maximum of this; remember the maximum of a set of finite
or infinite convex function is convex, alright.
And therefore, the maximum of this set is convex. So, this implies the maximum of this So, we have minimize now we will use the epigraph form. So, I can write this as
is convex and therefore, this is basically convex and therefore, you are minimizing a minimize t subject to the object to maximum of P 1, P 2, P k this is less than equal to t
convex optimization, convex objective function, constraints are as usual, I mean similar and the rest of the constraints are as usual there. That is summation i equal to 1 to k P i 1
to the previous one they are convex. So, this is also a convex optimization problem, ok. greater than or equal to P 1 tilde and so on and so forth, summation i equal to 1 to k P i M
And this is also therefore, convex optimization problem. Now, what is this relation to a greater than or equal to P M tilde. Now, the maximum of P 1 P 2 up to P k less than or
linear program? Now, at present it seems unrelated to a linear program, but we are going equal to t, the maximum of a set of quantity is less than equal to t if and only if each of
to demonstrate that this in is in fact, can be written as an equivalent linear program and these is less than or equal to t.
for that we will use the epigraph trick that we have seen in the previous module.
So, how can this be written as a linear program.

So, this implies each is has to be each P. This implies each P i has to be less than or equal there as usual and if the maximum is less than t then we have each P i is less than t,
to t. which implies that is why I can write this as.
(Refer Slide Time: 26:30) So, now we have this is basically the k constraints and you can see each is an affine
constraint ok. And these are your previous M constraints. So, the mini max optimization
problem has k. This has a total of K plus M constraints alright. So, this has a total of so,
this has a so, you can write the min max. So, this is also an LP. So, you can see each of
these all these constraints are affine. So, therefore, the mini max problem is also a linear
program and it is not obvious the first instance, but using a clever trick alright or by
manipulating this you can write the min max problem as an equivalent linear
programming.
And therefore, the linear program has can be written in various forms and has very
interesting application. So, not only can be it be used to minimize the total transmit
power, but you can also be used to as I have said, it also it can also be used to minimize
the maximum transmit power maximum power transmitted by any base station any of
these base stations ah, thereby ensuring that this power burden or the transmitted power
And now, I can write the equivalent optimization problem. Therefore, this is optimization
burden is fairly or it is sort of evenly distributed of all the cooperating base stations,
problem is equivalent to minimize t subject to, now, we need that each P i is less than or
alright. So, that basically introduces the linear program and basically demonstrates it is
equal t. So, this means P 1 equals summation over j equals 1 to M P 1 j less than or equal
demonstrates it is application in a practical scenario in a practical wireless scenario for
to t j equals to 1 to M P 2 j summation less than or equal to t, this is the power
base station corporation, alright. We will stop here and continue in the subsequent
transmitted by base station 2 to all users and finally, summation j equals 1 to M P k j less
modules.
than or equal to t. These are the base station power constraints. And you have the user
constraints as well similar to previous. So, summation i equal to 1 to k P i 1 greater than Thank you very much.
or equal to P 1 tilde summation i equal to 1 to k P i M greater than or equal to P M tilde. I
am sorry this has to be ah, I am missing I am missing here ah, missing the channel gains
the channel gains are very much there. So, this has to be g i 1 g i M greater than equal to
P 1 tilde P M tilde.
And similarly this has to be P i 1 g i 1 so on forth, summation P i M g i M greater than

equal to P M tilde. Ah, similarly here also, and summation i equal to 1 to k, P i M g i M
greater than equal to P M tilde. Just, let me make sure that we are not missing this at any
point yes. In fact, I think that so that ah, so, want to minimize the maximum correct using
the epigraph form I can write this as minimize t such that the maximum is less than or
equal to t, the desired power at each user has to be greater than or equal to P j tilde that is
Applied Optimization for Wireless, Machine Learning, Big Data And now, however, while formulating this linear program, we have assumed this
constraint that is this C bar i that is if you look at these vectors ok, we have assume these
Indian Institute of Technology, Kanpur to be perfectly known. Now, remember this can be perfectly known. Now, remember this
C bar i, this vector C bar i which characterize this problem, they depend on the problem
Lecture - 49
Stochastic Linear Program, Gaussian Uncertainty right, for instance in our base station corporation, there they are basically the power
gains all right, the gains between the base stations and the various use all right. So, these
Hello, welcome to another module in this Massive Open Online Course. So, we are have to be estimated in practice for any particular problem which means naturally that
looking at different kinds, of we are looking at convex optimization problems and in this going to be a certain level of uncertain or there can be a certain level of uncertainty
almost specifically we are looking at the specific class of convex optimization problems in this. So, what happens in as we have seen many times before and practical scenarios
this is basically linear programs ok. We have looked at linear programs and also and this is specially that is what we have said it is very, it has a lot of practical relevance.
demonstrated the practical application of linear program all right. And in this module let
us start looking at our variation of that or an extension of it know known as the robust
linear program all right.
There can be uncertainty in this C bar all right. So and there can be various levels of
uncertainty. Now, there can be various models for this. Now, one interesting way or one
practically useful model for such scenarios where is uncertainty C bar i is to assume that
So, what you want to look at in this module is another interesting extension of a linear these C bar i’s are random in nature all right. And this also gives rise to a stochastic
program and which is also very useful practically that is robust linear program. And linear program or stochastic version of this linear program. So, one can assume that the C
linear program if you remember can be formulated as follows that is we want to bar i’s random in nature in particular you can assume that the c bar i’s are Gaussian these
minimise a bar transpose x bar these are objective function which in a fine objective. are Gaussian random vectors with their nominal value which is their mean; mean equals
And this is subject to the constraint up again an a fine constraints that is C bar transpose i mu bar i and the covariance that is if you look at this C bar i’s C bar i minus mu bar i C
x bar less than or equal to d i or i equals to 1 to l. bar i minus mu bar i transpose this is equal to R i. This is the covariance matrix you can
say this is the covariance matrix.
(Refer Slide Time: 04:29) Gaussian x bar is a vector correct. Now, therefore, C now C bar C bar i transpose x bar
this is also random because C bar is a random vector. So, this implies this is random ok.
And in fact, you realize something interesting that this is we can also say that this is
Gaussian reason being C bar i is a Gaussian random vector C bar i transpose x bar is a
linear combination of the components of this vector C bar i. When you linearly combine
Gaussian random variables, you get another Gaussian random. Therefore, c bar there is a
linear transformation of Gaussian random variables yields another Gaussian random
variable. Therefore, C bar i transpose x bar is a Gaussian random variable in fact all
right.
So, and remember the covariance characterizes the spread around the mean all right. So,
this is a vector I am simplifying this for a scalar, for instance we have a Gaussian all right
and this is your mean mu i. And as the variance decreases, as the variance decreases, it
becomes more and more concentrated on the mean all right. So, C bar i is a Gaussian
random vector the nominal value or let us say the estimated value of this is mu bar i that
is the vector. And the covariance matrix characterizes the spread around that. If the
covariance matrix let us say the eigenvalues or if the variances of these different
elements of C bar i are low, it means that it is very close to the mu; if the variances are
large it means that it has a large spread around the mu ok.
So, basically that characterizes this problem. And of course, when the covariance matrix
And what you see is more interestingly since this is C bar i transpose x bar i is Gaussian
tends to 0 or becomes very close to 0, it reduces the previous version, because in that
random variable C bar i transpose x bar x bar this belongs or this can lie can take any
case C bar i becomes equal to mu bar i that it is equal to the mean mu bar i with
value can take any value in minus infinity to infinity that is the other interesting thing.
probability 1; and again reduced to the deterministic linear program that we have seen
So, remember C bar i transpose x bar previously once a deterministic one, and now it is a
before ok.
random. So, you can take any value between minus infinity and infinity. Which means
So, now, for this stochastic linear program or this you know linear program in which the that it need not be less than or equal to di for some value, that is there can always exist
C bar i’s are random, how do you formulate that. Now, if you look at the constraints let some random some values of this vector C bar i where this exceeds di. One cannot
us go back and take a look at the constraint the constraint is C bar transpose i into x bar because of the random nature of this constraint one cannot hope that this constraint is
is less than or equal to d i. Now, if you look at this C bar is random, in fact, C bar is always satisfied in the optimization problem alright.
So, because this is random need not or cannot be satisfied all the time, because it is or equal to eta i i equals 1, 2 up to l. So, this is your modified stochastic, you call this as
varying randomly, cannot be satisfied. This cannot be satisfied all, it cannot be satisfied robust LP or you call it as a stochastic because a constraint is random in nature or holds
all the time, which means this only holds with a certain probability I can expect this to with a certain probability. You can also called this as robust LP because you are ensuring
hold. So, this holds only with some probability. So, only holds this only holds with some that you are taking the uncertainty into C bar i uncertainty in C bar i into account.
probability let us say the probability is eta. So, what eta i, what we mean by that or let us
say this is beta i.
So, what or eta i, let us say the probability that C bar transpose i x bar less than or equal
to d i has to be greater than or equal to eta i which means the, what we are saying is very
interesting the probability with which the constraint holds that this you can only talk in
terms of probability. So, this constraint holds with probability eta i that a what we are
saying is this constraint need not hold all the time, but it can hold with a very high
probability that is the probability with which this constraint holds is has to be greater
than or equal to eta i. For instance, eta i can be let us say 95 percent just take a simple
example which means that this constraint has to hold 95 percent of the time ok. So, eta i
can be let us make it 0.95 95 percent or 0.95 which means this constraint has to hold;
which means this constraint which means this constraint holds 95 percent of the time..

Now, let us modify this problem further. Now, let us look at this quantity C bar i
transpose x bar; if you look at this quantity C bar i transpose x bar, we have already said
that this quantity this is a Gaussian random variable. Now, let us find the mean and
variance of this Gaussian random variable, we have expected mean is simple expected
value of C bar i transpose x bar this is expected value of C bar i transpose times x bar or
which is equal to expected value of C bar i is mu bar i. So, this is mu bar i transpose x
bar which is equal to x bar transpose mu bar i. So, this is the mean, this is the mean.
Now, what is the variance of this, what is the variance? Now, variance is the expected
value of the random variable that is C bar i transpose x bar or you can write it as x bar
transpose C bar i. So, x bar transpose C bar i minus its mean, which is x bar transpose
mu bar i whole square.
And therefore, now I can modify this linear program, now I can modify this optimization
problem. I can write this as minimise a bar transpose x bar plus b subject to the
constraint that the probability C bar i transpose x bar less than or equal to d i greater than
(Refer Slide Time: 13:23) Which is basically expected value of I can take the x bar outside. So, this is x bar
transpose and now you have something interesting expected value of C bar i minus mu
bar i time C bar i minus mu bar i transpose into x bar. And this is equal to x bar transpose
R i into x bar, where R i you can see this is the covariance that is expected value of C bar
i minus mu bar i into C bar i minus mu bar i transpose. So, basically the variance of this
C bar i transpose x bar or x bar transpose C bar i, this is x bar transpose R i into x bar
that is the variance of this quantity. And therefore, and therefore, now what we have seen
is x bar transpose C bar i, this is Gaussian with mean x bar transpose mu bar i and
variance x bar transpose R i into x bar ok.
And now let us find remember our constraint involves the probability. So, now, let us
find what is the probability that x bar transpose mu bar i is less than or equal to. So,
therefore, x bar transpose C bar i is Gaussian alright. Now, let us go back to the
constraint and find what is the probability that this let us simplify the probability that x
Now, I can write this as expected value of x bar transpose C bar i minus mu bar i whole
bar transpose C bar i is indeed less than or equal to d i.
square. Now, this is a scalar quantity. So, I can write this as expected, so I can write it as
a quantity into itself or quantity into its transpose because this is a scalar quantity x bar (Refer Slide Time: 16:33)
transpose C bar i minus mu bar i this is the scalar quantity or basically it is simply a
number. So, this is simply a number. So, this is therefore equal to expected value of x bar
transpose C bar i minus mu bar i times its transpose which is C bar i minus mu bar i
transpose times x bar.
Now, if you look at that the probability, probability x bar transpose C bar i less than
equal to d i. This is equal to the probability, now x bar transpose let us me a subtract the
mean x bar transpose C bar i minus x bar transpose mu bar i less than or equal to d i
minus x bar transpose mu bar i alright. I am subtracting the mean x bar transpose mu bar
i. Now, I am going to denote divide by the variance, remember the divided by the square (Refer Slide Time: 19:29)
root of the variance.
So, this is equal to 1 minus the probability that this quantity x bar transpose C bar i
minus x bar transpose mu bar i divided by square root x bar transpose R i x bar is greater
than or equal to x bar transpose mu bar i divided by x bar transpose R i into x bar. And
So, this is the variance and standard deviation, you can think of this as sigma square. So,
now, we have probability that the standard the standard normal random variable
sigma equals standard deviation equals square root of x bar transpose R i into x bar. And
Gaussian random variable mean 0, variance 1 is greater than equal to some threshold that
so therefore, this is equal to the probability divide both sides by the standard deviation.
is given by the Q function alright.
This is equal to the probability that x bar transpose C bar i minus x bar transpose mu bar
i divided by square root of x bar transpose R i x bar is less than or equal to d i minus x So, this quantity is nothing but the tail probability of the standard normal which is equal
bar transpose mu bar i divided by square root of x bar transpose R i into x bar. to Q of d minus x bar or d i minus x bar transpose mu bar i divided by under root x bar
transpose R i x bar variance. And this quantity is equal to 1 minus that, 1 minus the tail
And now if you look at this, if you look at this quantity, what we have done is from a
problem.
Gaussian random variable we have basically subtracted the mean and divided by the
standard deviation, so that gives us a zero mean unit variance Gaussian random variable
which is nothing but the standard normal random variable. So, this gives us this
manipulation has given us the standard normal, what we call the standard normal R
means that is Gaussian random variable mean equal to 0. Now, the probability that this is
less than equal to this is 1 minus the probability that this is greater than equal to this.
(Refer Slide Time: 21:05) The probability x bar transpose C bar i less than or equal to d i greater than equal to eta
this implies the Q of di minus x bar transpose mu i divided by x bar transpose R i into x
bar square root, this has to be or 1 minus Q I am sorry 1 minus Q has to be greater than
or equal to eta ok. And this implies that Q of x bar transpose mu i divided by square root
x bar transpose R i into x bar is less than or equal to 1 minus eta.
And therefore, the probability and that is what we are saying right. This is the tail
probability that is a probability standard normal if you remember Q of x equals
probability x greater than or equal to x, where x is your standard normal that is Gaussian
random variable with mean 0 and variance unity. And we need this probability to be
greater than or equal to eta. So, this means this quantity is greater than equal to eta. So,
therefore what we have is, if you look at it what we have is the following thing.
Now, the Q function is a decreasing function implies, this implies di minus x bar T
(Refer Slide Time: 21:57) transpose mu bar i, sorry this is has to be all mu bar i over mu bar i over x bar transpose
R i x, bar this has to be greater than equal to q inverse 1 minus eta ok. This is the Q
function is the decreasing function the equality gets reversed. So, if this has to be less
than equal to this that implies that this quantity has to be greater than equal to Q inverse
of 1 minus eta.
And note that if eta is greater than 0.5 that is the reliability with which this has to be hold
is greater than 50 percent then this Q inverse 1 minus eta this is greater than equal to 0.
So, this quantity is greater than equal to 0, eta is greater than equal to 0.5
Let me just simplify this now further into something that is interesting. Remember R i is And now if you look at this, you will have the something interesting. You have here a
a covariance matrix. So, R i is a positive semi definite matrix. So, I can factor it as R i norm and here you have something that is affine. So, you have norm of vector x bar
tilde R i tilde transpose this implies x bar x bar transpose R i x bar equals x x bar something all right, norm less than or equal to affine function of x bar, remember this
transpose R i tilde R i tilde transpose x bar. And if you look at this; this is nothing but R i represents the cone. So, this is the conic constraint ok. So, this represents a cone, and this
tilde x bar norm square. And therefore, square root of x bar transpose R i x bar equals is equal to a conic. So, this will become a cone program. So, this implies work with
norm of R i tilde transpose x bar. robust linear program that we are talking about, this becomes a cone program implies or
the stochastic LP, you can call it the stochastic LP equals a cone program. It has conic
And therefore, the condition above this implies that d i minus x bar transpose mu bar i
constraints ok.
over square root of x bar transpose R i x bar which is nothing but norm R i tilde
transpose x bar is greater than or equal to Q inverse 1 minus eta which basically implies (Refer Slide Time: 27:29)
that again Q inverse 1 minus eta is positive. So, you can write this as R i tilde transpose x
bar is less than or equal to d i minus x bar transpose mu bar i divided by Q inverse of 1
minus eta ok.
And therefore, now you can formulate this stochastic LP as follows that is basically Applied Optimization for Wireless, Machine Learning, Big Data
minimise a bar transpose x bar plus b subject to the constraint that your norm R i tilde
transpose x bar is less than or equal to d i minus x bar transpose mu bar i Q inverse 1 Indian Institute of Technology, Kanpur
minus eta for i equal to 1, 2 to l. In fact, this can be eta i does not really matter, this can
Lecture - 50
be eta i. I am just going to correct this over all places it can be a common eta or it can be Practical Application: Multiple Input Multiple Output (MIMO) Beamforming
eta i. You might want different reliabilities for different constraints alright.
So, this can be there you go. So, it is eta i and you can correct it everywhere to eta i; and
looking at various convex optimization problems, and especially focusing on their
this has to hold for i equal to 1, 2 up to l. So, this is your linear program. You can think of
practical applications all right. We are looking at this from an applied perspective all
this as a robust version for the scenario when these vector C bar i are random in nature
right. In this module, let us look at another interesting problem with a very important,
all right, so that is another interesting flavour.
and I would like to say novel practical application that is of beamforming in MIMO
In fact, I can say this is a practical flavour of the traditional linear program. The linear systems.
program itself has several practical applications, this in that sense it makes in more
practical or immensely enhances its utility by making more relevant because several, in
several scenarios these coefficients C bar i might not be known; might not be known
accurately or might be known only with the certain degree of accuracy. In particular
when they are random and they can be modelled as Gaussian random variables one can
use this interesting framework to formulate the equivalent either stochastic or you can
call it with the stochastic version of the robust stochastic LP or robust LP or the
stochastic version of the robust. We will stop here.
So, what you want to look at in this module is the problem of MIMO. Now, what
happens in a MIMO, now what we have looked at previously, we have seen only
multiple antenna previously we have seen when you have multiple beamforming, when
you have multiple receive antennas. You might recall that we consider a system, where
which is modelled as y bar equals h bar times x plus n bar. And then we derived the
beamformer for this system, the beamformer for this system is given by W bar equals h
bar divided by norm of h bar; the system the MRC or the maximal ratio of beamformer
ok, the maximal ratio combiner. We have seen this before.
Now, what we want to do is we want to extend this to a MIMO system. Remember, you (Refer Slide Time: 03:44)
might recall MIMO system is a system, which has not just multiple receiver antennas,
but also multiple transmitter antennas. So, your multiple antennas is the transmitter, as
well as the receiver that is why it is known as the MIMO system or the multiple input
multiple output system.
And one can represent this is from let us just, so you have let us say this is your
transmitter, you have multiple transmit antennas. And the same time, you have multiple
receive antennas. Now, these need not be equal ok. And this is your transmitter, this is
your receiver, and you have the different possible paths, you can say or links between the
transfer. Yeah, let us say t antennas at the transmitter, these are your t transmit antennas,
and this are your r equals to the number of receive antennas. And this is your transmitter,
So, what we want to consider is this interesting problem of beamforming. And this is not
this is your receiver.
so trivious straight forward beamforming for MIMO ok, which basically simply stands
for Multiple Input and Multiple Output multiple input, which basically implies that you (Refer Slide Time: 04:54)
have multiple transmit and multiple receive antennas multiple T X plus you have plus
you have multiple R X antennas multiple receiver, yeah multiple transmit and multiple
receive antennas.
And the system model can be represented as follows. You are received vector of symbols interference at the receiver. So, beamforming has many uses and very useful practically.
y 1, y 2 up to y r, since there are r receive antennas. You are going to have r receive Now, how do we do beamforming in a MIMO system that is what we want to explore.
symbols y 1, y 2, y r equals H, this is your channel matrix times x 1, x 2 up to x t,
because your t have t transmit antennas, therefore you can transmit t symbols plus you
have your noise vector.
Now, as you can see this is y bar, which has r symbols ok, which is basically r cross 1
vector. This is x bar, which is t cross 1 that is it has t symbols, because you have t
transmit antennas. And now naturally, you have r dimensional output, t dimensional
input, so the channel matrix that transforms the t dimensional input, r dimensional
output, this is r cross t. This is known as the H equals the r cross t that is r rows and t
columns in the MIMO channel matrix. So, this is your r cross t MIMO channel matrix.
And n bar that is the noise vector, this is an r cross 1, this is an r cross 1; this is an r cross
1 noise vector.
And what we want to do is we want to explore, how can we use beamforming?

Remember, beamforming means transmitting in a particular direction that is focusing the
So, let us say we want to transmit the vector x bar, we want to use a beamforming.
transmit power in the particular direction, and receive beamforming means focusing that
Remember, beamforming vector is central to the idea of beamforming. So, this v bar this
is looking in a particular direction to receive the maximum amount of energy that is what
is the vector of so this is your beamforming vector. And this is more specifically since
we mean by beamforming all right.
you are doing this at the transmitter, this is also known as the, this can also be known as
So, beamforming in a multiple antenna system remember that is basically you are the transmit the transmit beamformer or the transmit.
looking in a particular direction, and at the transmitter also you have transmitting in a
So, what you are doing is you are using v bar at the transmitter all right. To focus the
particular direction. So, this is your transmit beam so as to speak, and this is your
energy or transmit the energy or transmit the signal in a particular direction, this is
received beam. Now, there can be multiple beams, but when people talk about
known as transmit beamforming. And this v bar gains the transmit beamformer ok. And x
beamforming that typically talk about a single beam.
is the transmitted symbol, which is transmitted with the aid of this transmit beamform.
So, you have a transmit beam that is you are focusing the energy in the particular Remember, this is the concept of electronic steering, where you are adjusting the
direction, and you are have a received beam that is you are looking for the signal that is weights, such that you focus the signal in a particular direction. So, v bar x; this is the
you are receiving the signal in a particular direction. By doing that, you are achieving transmit.
several things as we have seen in multiple antenna beam forming that is one is you are
Now, what we now we can have the transmit power equals can be P that is this is the
maximizing the signal to noise power ratio. Two by using only a particular direction, you
transmit power. Now, the beamformer itself is simply focuses the signal, therefore it
can be you can avoid the interference that is the interference, which is caused by the
should not amplify it or attenuate the signal. So, what we will do is we will fix the power
interfering uses. By avoiding those other directions, you can basically suppress the
of the beamformer that is norm beam w square (Refer Time: 09:52) so the transmit
power is fixed. So, the energy of the beamformer there is norm (Refer Time: 09:55) bar
square is fixed, so that basically this imposes a transmit power constraint. So, transmit (Refer Slide Time: 12:47)
power is limited to P. So, this limits the power that has to be transmit.
So, this is also known as the transmit power constraint. So, norm v bar square equal to 1
ok. So, this is to basically this constraint is to limit, the transmit power we want to limit it
to a, we want to make limit it to a unit norm beamforming. So, now remember, we have
this system model here, which is basically your y bar equals H times x bar plus n bar, this
is your MIMO, this is your MIMO system model.
Now, this becomes your u bar equals to your. Well, you can think of this as the receive
filter, you are combining with u bar, so you are performing u bar Hermitian y bar. And
also, we can restrict without loss of generality, we can restrict norm u bar to 1 that is the
norm of the received beamform is also being restricted or limited to 1.
And therefore, now if I substitute this model, I have u bar Hermitian y bar. Now, I am
going to substitute expression for y bar equals u bar Hermitian H x bar plus n bar equals
u bar Hermitian. Now H x bar of course, x bar we replaced by H times v bar into x plus n
bar, which is now if we simplify this.
Now, what we want to do is we have y bar equals H x bar plus n bar. Now, here we want
to substitute x bar equals v bar times x that is what we said, v bar is a transmit
beamformer, x is the transmitted symbol, v bar is a transmit beamformer, x is the
transmit symbol. And therefore, now this is going to be H times v bar x plus n bar, all I
am doing is of so instead of x bar, I am substituting x bar equals v bar times x. Now this
is y bar.
And now, what we are going to do at the receiver is at the receiver, we are going to
employ a combiner. So, this is your received symbol vector. At receiver at the receiver,
we employ a combiner. Now, at the receiver, one can employ a combiner, which is of the
form that is you perform u bar Hermitian y bar. So, this is basically your receive
beamformer.
(Refer Slide Time: 14:01) the challenging problem that one has to address, which we will solve in the subsequent
module.
This is u bar Hermitian H v bar into x plus u bar Hermitian. And now what we have to do
is we have to determine both u bar and v bar that max. So, this is my receive
beamformer, and this problem is challenging, because you can see you have to determine
not one, but two beamformer. So, this is your beam for receive beamformer, this is your
transmit beamformer. And so, now we have to jointly, so now it what is the problem in
MIMO beamforming, we have to jointly determine both R X and T X beamformer that is
u bar and v bar v bar to maximize the SNR all right.
So, what we want to do is basically we have to determine both these beam formers that is
both your u bar and also v bar. That is remember normally in a multiple antenna
beamforming, when you have multiple antennas only at the receiver, you are simply
determining the single beamformer, there is beamformer at the receiver that is what we
have done so far.
But, here because you have multiple antennas both at the transmitter as well as the
receiver, one has to determine both the optimal beamformer that is v bar that is the
optimal beamformer of the transmitter, and u bar that is optimal beamformer at the
receiv[er]. And these problems are interlinked, because depending on v bar, one has to
choose u bar; and depending on u bar, one has to choose v bar. So, what is the what are
the optimal beamformers u bar and v bar that maximize the SNR at the receiver that is
Applied Optimization for Wireless, Machine Learning, Big data (Refer Slide Time: 01:35)
Prof Aditya K. Jagnnatham
Lecture - 51
Practical Application: Multiple Input Multiple Output (MIMO) Beamformer
Design
Hello welcome to another module in this massive open online course, so we are looking
at MIMO beam forming; how to design the optimal transmit and receive Beamformers in
a multiple input multiple output wireless communication system. So let us continue our
discussion.
We will set this quantity equal to h bar and now you will observe something interesting
this becomes u bar Hermitian h bar into x plus u bar Hermitian times n bar. Now if you
see this h bar this effectively becomes a single input multiple output system by setting H
v bar equal to h bar this effectively becomes us a SIMO system or a multiple simply
multiple receiver antenna system.
For which we already know the optimal beamformer effectively becomes your multiple
RX antenna system and once it become an RX antenna system what you have is that this
is equal to. Now, therefore, now you know what is optimal beam former u bar, the
optimal beam former u bar for this multiple antenna system optimal RX beamformer is
the maximal ratio combiner; that is u bar equals h bar divided by norm h bar ok. And
therefore, we know this is the maximal ratio combiner, this is your optimal beamformer
So, what we want to look at is we want to look at MIMO beamforming and we have said
maximum this is a maximal ratio combiner.
after combining with u bar Hermitian u bar is at receive beamformer we have u bar
Hermitian H v bar times x plus u bar Hermitian noise this is the noise vector. Now, what
we have to do is this is your receive beamformer and u bar and v bar is a transmit beam
former and we want to design this to maxi jointly design transmit and receive
beamformers to maximization are first what we will do is we will set H v bar.
And the output SNR of the maximal ratio combiner is given as norm h bar square P over So, we have to maximize this quantity where what is G, G equals H Hermitian H G
sigma square remember P is the transmit power sigma is the noise power. So, this equals H Hermitian H and we have to maximize. So, optimisation problem, so the net
quantity is constant P over sigma square because, P is the transmit power sigma square is optimisation problem becomes maximize v Hermitian G v subject to the constraint
the noise power implies that we have to maximize norm h bar square. So, to maximize remember the constraint is still there unit norm constraint that is the transit beam former
the SNR we have to maximize norm h bar square ok. So, in order to maximize SNR, so energy of the transmit beamformer or power of the transmit beam former norm v bar is
that is optimisation problem maximize or let us say maximize output SNR we have to less than or equal to 1. So, this is the resulting problem for optimal beam forming alright
maximize norm h bar square. we have substituted H capital H bar equals H times v bar and from that you derive this in
terms of the optimal this optimisation problem for the optimal beam former.
But h bar equals we have seen h bar equals h times v bar or capital H times v bar. So, h
bar equals H times v bar which basically implies substituting this we have to maximize And a now to just to simplify it what I am going today is I am going to assume real
norm H times v bar square which is basically norm H times v bar square. And what is vectors I am going to place this Hermitian by transpose. So, I am going to say to simplify
norm H times v bar square H times v bar square is the vector Hermitian itself for a consider real vectors. So, this becomes v bar transpose G v bar subject to the constraint
complex vector which is basically v bar Hermitian H Hermitian H into v bar which is norm v bar less than one or norm v bar square less than equal to 1 both these constraints
equal to v bar Hermitian G into v bar. are equal equivalent and observe that interestingly this is a non-convex problem.
This is one of the few and very interesting non-convex ones because if you look at this v That is F of v bar coma lambda remember you have the objective v bar transpose G v bar
bar transpose G v bar is a convex function correct. However, we are maximizing it we plus lambda times 1 minus norm v bar square objective plus eigen plus Lagrange
are not minimising remember standard form convex optimisation problem we have a multiplier times constraint. This is equal to v bar transpose G v bar plus lambda times 1
convex objective, but your minimising it here a convex objective your maximizing. So, minus v bar transpose v bar now, differentiate it take its gradient with respect to v bar
the problem is a non-convex problem although the objective is convex because your this v bar transpose G v bar twice G v bar plus gradient of lambda with respect to v bar is
maximizing it is it so non-convex. 0 minus lambda derivative of v bar transpose v bar is twice v bar minus twice v bar or it
write minus twice lambda v bar this is equal to 0.
In fact, if you minimise it you have to take the minimum and minimiser, but once you
take the minimum the convex objective becomes a concave objectives as and therefore, it (Refer Slide Time: 09:41)
is a non-convex problem ok. And therefore, is because G is a positive semi definite
matrix G equals a PSD matrix and v bar transpose G v bar equals convex however, what
your doing is your minimising a convex objective. Your performing minimization over a
convex objective function implies that this is non-convex and therefore, now what will
do is we will form the Lagrangian the same thing that we have done before.
Now, if you solve this the two is cancel and therefore this implies that G v bar equals lambda max of G which is lambda max of H Hermitian H G is a H Hermitian H. And
lambda times v bar and then you observe something very interesting what you observe is that is something that is extremely interesting what you have what it says is the transmit
the G v bar equals lambda v bar. And recall you have already seen this kind of equation beamformer H Hermitian H corresponds to the max eigenvalue corresponding to the
this equation before this is the nothing, but with the definition of the eigenvector of G maximum eigenvector of H Hermitian H that is the transmit beamformer v bar unit norm
that is any vector v bar which satisfy this property that is G times v bar vector equal transmit beamformer v bar which maximizes the SNR at the receiver.
simply a scaling factor lambda times v bar v bar is the eigenvector and lambda is the
eigenvalue. So, that is the interesting property so what this shows is that the optimal
transmit vector v bar equals eigenvalue of G equals now you can write H Hermitian H
that was just for simplicity.
So, there is a eigenvector of H Hermitian H there is an eigenvector of H Hermitian H

right, but correct right. So, this has many eigenvalues now which eigenvalue now how to
find which eigenvalue or you can say how to find the Lagrange multiplier lambda ah. So,
we can say that this will be v transpose G v bar now G v bar equals lambda v bar so this
because v transpose lambda into V bar which is lambda times v transpose v bar which is
lambda times norm v bar square, but norm v bar square equal to 1 so this becomes
lambda.
So, choose v bar unit norm choose v bar equals unit norm eigenvector corresponding to
maximum eigenvalue of corresponding to the maximum eigenvalue of H Hermitian H
that is the interesting aspect. This is also termed this Eigen vector corresponding to the
largest eigenvalue this also termed as the principal eigenvector, transmit beam former v
Hermitian transmit beamformer v bar is the principal eigenvector corresponding to H
Hermitian H.
So, this is lambda and you want to maximize this want to maximize v bar transpose G v
bar implies choose the maximum lambda or choose eigenvector corresponding to
maximum lambda eigenvalue of H Hermitian H. That is let us say we can denote this by
Now whatever u bar the receive beam former remember u bar receive Beamformer we And now u bar is simply u tilde divided by norm of u tilde which is principal eigenvector
still have to find that that is H v bar divided by norm H v bar. Now, this norm is simply with unit norm remember you can simply scale the eigenvector by any quantity it will
normalization so for the time being ignore this u tilde equals H v bar now look at this still be an eigenvector. So, this is the you can say principal eigenvector of H H Hermitian
now perform H H Hermitian u tilde equals H H Hermitian into H v bar, but v bar is with this is a principal eigenvector of H H Hermitian with unit norm great.
eigenvector of H Hermitian H correct.
And therefore, that basically gives us both the transmit and receive beamformers and we
So, this will become H now look at this H Hermitian H v bar is equal to lambda V bar so have very interesting expressions for them the transmit beamformer u bar optimal
this is lambda times H v bar, but H v bar is u tilde. So, what we have here is we have transmit beamformer.
shown something very interesting H H Hermitian u tilde equals lambda times u tilde
implies u tilde is the Eigen that is the receive beam former u tilde is the eigenvector of
the matrix H H Hermitian and that is something that is interesting u tilde. Again you can
see u tilde or now you can say u tilde equals principal eigenvector of H H Hermitian that
is eigenvector corresponding to largest eigenvalue of H H Hermitian.
So, to summarise you have the MIMO Beamforming problem y bar equals H x bar plus n singular and this is a key phrase not eigenvectors, but singular vectors of H from the
bar x bar equals. So, this is v bar times x plus n bar that is your y bar and at the receiver SVD what we call not the EVD from the SVD which is called the singular value
you perform u bar Hermitian v bar y bar which is u bar Hermitian H v bar x plus u bar decomposition. From the singular value decomposition and further from the singular
Hermitian n bar. value decomposition and remember dominant singular values means corresponding to
largest singular value.
And remember what you are doing here is as I already told you have to perform
beamforming at both the ends in the MIMO system. So, you have the transmitter you (Refer Slide Time: 20:02)
have the receiver your transmitting from the transmitter in a particular direction at the
receiver your also collecting or your also processing the signal your steering the receiver
antenna array in a particular direction. So, this is basically you are transmit steering
remember these all electronic steering so you do not need to physically steer and this is
your receive steering.
Dominant singular vector means corresponding to the largest singular value and in this
context also we have seen a very interesting optimization problem that is if you take a
positive semi definite matrix x bar transpose A x maximise subject to the constraint norm
x bar equal to 1 or norm x bar less than or equal to 1. Then x bar equals principal
eigenvector provided A is positive semi definite remember x bar transpose x provided A
is PSD positive semi definite matrix provided x bar is a positive semi definite matrix this
is a principal eigenvector of A that is a maximum A bar.
And u bar u bar equals eigenvector of H H Hermitian corresponding to larger eigenvalue
or principal H H Hermitian and v bar equals principal eigenvector of H Hermitian H. Now similarly if you minimise, now this is another interesting analogue. Now, this
Now later what we will see is we will see what is known as the singular value problem is convex minimise x bar transpose x bar such that norm x bar of course, this is
decomposition of the channel matrix and it will turn out that u bar and v bar are in fact, not convex again in the sense that the constraint is not convex norm x bar greater than or
the dominant left singular and right u bar is the dominant left singular vector which equal to 1, then x bar equals eigenvector corresponding to smallest eigenvalue.
means singular vector corresponding to the larger singular value.
And similarly v bar is the dominant right singular vector singular vector corresponding to
larger singular value. In fact, we will see later that is u bar comma v bar are the dominant
Lecture - 52
Practical Application: Co-operative Communication, Overview and Various
Protocols Used
Hello, welcome to another module in this Massive Open Online Course. We are looking
at several optimal optimization paradigms, in fact several practical applications of the
optimization theory that we have seen so far. Let us we look at yet another interesting
and novel application of the optimization framework and that is with respect to
cooperative communication.

So, there are there is a analog this problem eigenvector corresponding to x bar equals
eigenvector corresponding to the smallest eigenvalue. And therefore, this is a very
interesting application as we have seen here that is basically with respect to beam
forming in MIMO system to determine the top table transmit and receive the unit norm
transmit and receive beamformers which are given as v bar optimal transmit beamformer
is the principal eigenvector of H Hermitian H or you can also say the dominant left
singular vector of H.
And u bar is the principal eigenvector of H H Hermitian that is the dominant left singular
vector of H, the channel matrix H is the channel matrix of the MIMO wireless system
alright. So, with this interesting observation or this completing after completing this
interesting example we will stop here and we will continue in the subsequent modules.
Thank you very much. So, starting this module in this and probably the next couple of modules, we want to look
at another very novel communication paradigm or communication framework and that is
of cooperative communication or corporation in wireless communication. This is in fact
a very recent idea and is emerging to be very popular modern wireless communication
systems especially for high data rate and you can say next generation wireless
communication. So, very important you can say a mode very important technique. And
this is also something that is latest it is probably not more than about 5 to 10 years also
important technique for modern wireless for modern wireless communications systems.
(Refer Slide Time: 01:59) some basic signal processing operations or complex and advanced signal processing
operations that depends on the nature of the relay and then it relays the signal forward it
forwards this signal to the destination. So, the relay, so what happens is the source relay,
so this is your relay.
And again there can be a single or multiple relay there can be a single relay or more
relay. So, this relay receives the signal it makes some signal processing operations on it
and forwards it. So, relay receives and retransmits the signal and this is the important
operation; relay receive and of course, relay receives and retransmits after suitable
operation after processing some processing some after some processing. Now, so relay
retransmits the signal basically that is the point. Now, this can significantly enhance.
Now, what happens in cooperative communication system, now typically if most of you
already know, what typical wireless communication system at a very high level, what
happens is you have a base station correct. And we must have seen this in many many
times before and even in this course that is you have a base station. And then you have a
mobile which is also termed as the user equipment and so on. These can basically be
termed as the source of the communication signal that is S. And this is the destination
and typically what we have is we have communication from the source to destination.
So, in typically we have or frequently in a wireless communication system, we have

there is wireless communication between the source and the; the source and the
destination node. Now, what happens is in addition what you can have to enhance the
quality of communication is to have what is known as a relay, remember what a relay
So, the relaying this process, because now you have an additional copy of the signal you
does or the meaning of this, what relay is basically something that takes something and
have you have an additional copy of the transmit signal, which results in significantly;
relays it relays in relay in the sense that it hands it over all right that is the principle of
this significantly enhances the quality of communication or we can say enhances the
relay. And it is it is not with respect communication, but that is a general definition of a
reliability. We are going to define that in a little bit this significantly enhances the
relay. For instance, you have a relay race format in which one of the runners takes the
reliability of communication in the wireless communication system. What do you mean
baton carries it forward and then hands it over to the next runner and so on that is the
by reliability, we are going to define that what is the metric for reliability, but you can
concept of relay.
see that you have an additional copy. And this process what the relay is doing this is
Specifically, in the context of wireless communication what a relay does is it receives the termed as cooperation; relay is cooperating, what is the relay doing, the relay is
signal from the source; can perform, can manipulate it can perform some operations on it cooperative communication cooperating.
Hence, this is termed as this paradigm is hence typically termed as hence termed as d is source destination, h s r is source relay, h r d is relay destination. So, h s d equals the
cooperative that is the title of this module. Hence, this is frequently termed as this channel coefficient of the direct or S D link h s r is for the SR link the channel
paradigm, where we have a relay takes gets the signal from the source. Manipulates it coefficient. And h r d is for the RD link it is self explanatory all right. Just writing it out
performs some operations on it and transmits to the destination there by aiding in this explicitly, but then the notation is self explanatory.
process or cooperating with the source in this process of communication wireless
communication is termed as cooperative communication. Now, what is the relation of
cooperative communication to optimization we are going to see that shortly all right, but
first we have to set up this framework for cooperative communication.
And let us say these are relay fading and let us make a simplest let us make simplistic
assumption I mean these can be a follow any distribution when depending on the various
kind of distribution you will have different results, but let us make assume that these are
Rayleigh fading in nature, which implies basically that these coefficients are symmetric
Now, so schematically, let me just describe my cooperative communication system as the complex Gaussian implies that the channel coefficients are 0 mean symmetric complex
combination of these three nodes. So, I have the source which is S, destination source Gaussian in nature. These are symmetric 0 mean complex Gaussian.
sorry this is your relay and this is the destination, which is D. So, source to destination
And let us say, these average powers expected magnitude h s d square equals again these
this is also termed as the direct link, this is termed as the direct SD link it can be present,
are self explanatory that is the source destination links has average power delta s d
sometimes it can be absent, if it is for the shadowed and so on, then you have a link
square, source relay link has the average power delta s r square, and relay destination
where that is source to relay and then the relay to destination.
link has average power r d square. And these powers can vary depending on the distance
And let us denote the channel coefficients, remember each of this is a wireless link between source destination distance.
therefore each of these will be characterized by a fading wireless channel coefficient
For instance, if the source relay link is has a very the source and relay are very far or the
corresponding to each link source destination source relay and relay destination. And
source and destination are very far, let us say then the source destination average gain
therefore, each of these will have a channel coefficient all right. Let us, denote the
will be much smaller compared to for, let us say source relay and relay destination all
channel coefficients by the terms or by the symbols h. And these are self explanatory h s
right. So, there can be several scenarios depending on where the relay is located with
respect to source, and where the destination is located with respect to the source, and (Refer Slide Time: 13:30)
where the destination is located with respect the relay.
For instance, the distribution of this will be given as if you look at source destination
equals one over the average gain delta s d square w power minus beta s d divided by
delta s d square. So, this is what we mean by exponential distribution of the source
Similarly, you have expected magnitude h s r square equals delta s r square, and expected
definition. This is the exponential distribution for the source destination gain. So, if you
magnitude h r d square equals delta r d square. And we said these are, now if you denote
denote that random variable magnitude s d square by beta s d, we are denoting that by
these quantities by now these are of course, random in nature, because the fading channel
beta s d and beta s d is exponentially distributed.
coefficient is random. So, if you denote the gains of this fading channels by betas that is
h s d, h r d square equals beta r d gain of the relay destination link, and gain of the source Similarly, if you denote the magnitude s r square by beta s r, beta s r is also exponentially
relay link by beta s r. distributed, magnitude of h r d square by beta r d that is also exponentially distributed
and these can be given similarly. So, what we have is and these will come in handy later.
Now, these will have an exponential distribution, because these are Rayleigh fading,
So, I am just setting up its slightly as you can see elaborate, and as a result and that is,
which means the amplitude is Rayleigh fading, the gain that is magnitude, so the
because most of these things are these are the latest, and these are fairly sophisticated
magnitude is Rayleigh fading the gain that is magnitude square is exponentially fading
ideas. So, these require some amount of analysis.
random variable. These are exponentially fading I am going to give the exponential
distribution, these are exponential random variables.
So, minus beta s r so this is the source relay link, which is average, which is exponential So, relay performs can perform amplify and forward this is known as the AF amplify and
with average power delta s r square. And similarly beta r d 1 by delta r d square this is forward protocol or the relay can perform decode and forward the system as the decode
exponential with average gain delta r d square. So these are the relevant exponential and forward protocol. Now, what we are going to do, so there can be many many
distribution. So, exponential distribution for source destination s d, exponential protocols, and the analysis for each protocol is going to be different. We are going to
distribution for source relay, exponential distribution for relay destination. look at one simple protocol which is termed as the selective decode and forward
protocol. What happens in the selective decode and forward protocol, it is a version or an
Now, there are several protocols remember we said that the relay does some processing.
extension of the decode and forward in which a relay receives the signal, it decodes the
So, when you source destination and relay, so the relay processes the signal perform
signal it forwards it selectively only if it is able to decode the signal or the symbol
some signal processing operation on the (Refer Time: 16:34) perform some operations on
correctly, that is why this is known as selective decode and forward.
the signal. So, it performs some it performs some signal processing. Now, depending on
the nature of the signal processing you have a different protocol. For instance, relay can
simply amplify the signal and forward it that is termed as amplify and forward; the relay
can simply decode the signal and forward that is termed as decode and forward.
(Refer Slide Time: 18:47) source transmits to destination and relay. So, destination is receiving it we are assuming
there is a direct source destination this relay link.
In phase 2 relay retransmits, if it able to decode correctly. So, if relay transmits in phase
2, then destination is copies from both source and relay. Otherwise it simply as the copy
from the source and that is similar to what you have in conventional wireless
communication or the classical form of wireless communication non cooperative. And
therefore, now you can see that this performance get typically critically depends on the
decoding at the relay.
So, relay forwards selectively what do you mean by selectively, only if r relay decodes
only if relay decodes the symbol correctly, this is known as selective DF or a selective
decode and forward this is known as the SDF protocol. So, this is a very simple protocol
which is of particular interest was this is known as the SDF or the selective. The
selective it is also one of the popular protocol. There are several protocols and one can
analyse several protocols one can consider and construct or develop optimization
problems for several protocols. In particular, we are interested in this selective decode
and forward protocol.
What and we have said as I have already said in selective decode, and protocol selective
So, if phi denotes the error phi denotes that the event error event at relay, e denotes the
decode and forward protocol, the relay forwards the symbol to the destination, only if the
error at destination, now node that probability of e that is the this is known as the
relay is able to decode the transmitted signal by the source accurately. So, this happens in
probability of error at the destination or this is also known as the end to end probability
two phases so the SDF so any communication in general happens over multiple phases.
of error. This is equal to probability of e, now we will use the total probability rule here
So, here this SDF phase 1 source transmits or rather source broadcast symbol, which is
we have to know little bit about probability. So, this is probability of intersection phi plus
received by destination and relay. So, the source in the first phase broadcasts the symbol
probability of intersection phi bar, because phi bar union phi union phi bar is the total set
to the destination and the relay.
total event space S.
In the second phase phase-2, thus relay retransmits or forwards relay retransmits or
So, I can write it. So, we have a partition mutually exclusive and exhaustive all right. So,
forwards to destination, only if it is able to decode correctly, only if the decoding at the
phi and phi bar are mutually exclusive and is an exhaustive partition. So, I can write
really success only if it is able to decode correctly all right. So, this is the 2 phases of the
using the total probability, I can write this as probability of the error at destination is
SDF, so these are the two phases of the SDF protocol. Phase 1, and phase 2, phase 1
probability of error intersection phi, and probability of error intersection phi bar right.
And it is also intuitive this is a probability of error with probability of error at destination (Refer Slide Time: 25:18)
and error at relay or probability of error or plus probability of error at destination and no
error at relay all right that is basically the total probability rule.
So, this implies 1 minus probability of phi approximately equal to 1, this is probability of
phi bar. So, what we have is probability of phi bar is approximately equal to 1, which
means using the above approximation you can approximate the probability of error
approximately as probability of e given phi into probability of phi plus probability of e
And now using the conditional probability I can write this probability of e intersection
given phi bar. What is probability of e given phi, probability of e that is error at
phi is probability of e given phi into probability of phi. This is the definition of
destination given that there is also error remember phi denotes the error even at relay. So,
conditional probability plus probability e given phi bar into probability of phi bar. So,
e given phi means error at destination given error at relay.
this follows, the first one follows basically from so total probability rule. So, this is
probability rule. Times probability of phi, what is the probability that error occurs at relay plus probability
of error given phi bar that is probability of error at destination given that there is no error
Now, probability of phi remember is the probability of error at destination. So,
at the relay in which case the relay retransmits. And this is an approximation, but this is a
probability of phi is close to 0 probability of error at relay, so this is close to 0 at high
good approximation which is very tight at high SNR, remember this is known as a high
SNR. When the signal to noise power ratio is high, probability of phi that is a probability
SNR approximation all right.
of error at the relay is close to 0, which means 1 minus probability that is probability of
phi bar which is 1 minus probability of phi that is approximately equal to 1. So, I am So, this is tight or is going to be very close to the actual P r. So, this approximation is
going to make that approximation. tight the approximation is tight at high SNR all right. So, now we have to calculate these
various quantities all right. So, now what we have to do is we have to calculate these
various quantities to basically derive the expression for the overall probability of error.
And this is something that we are going to look at in the subsequent modules.

Applied Optimization for Wireless, Machine Learning, Big Data Therefore, in selective decode and forward, the relay simply does not retransmit ok. So,
the relative destination link that does not exist, because the relay is not if there is an
Indian Institute of Technology, Kanpur when there is an error at the relay, the relay does not retransmit and selective decoded
form. So, only source destination link exist that is the destination use uses only the
Lecture - 53
Practical Applications Probability of Error Computation for Co-operative symbol or the signal received from the source all right. So, in this case there is only the
Communication
source destination symbol. So, the decoding at the destination takes place on the signal
received from the source.
looking at co-operative communication, and specifically we would like to eventually (Refer Slide Time: 02:18)
look at optimization for co-operative communication all right. So, let us continue our
discussion on co-operative communication ok.
Now, let us try to model this link. This link can be modelled as the received symbol y
corresponding to transmission by the y s d, because this is the source destination link,
equals well square root of P 1. Let us assume that the source transmits with power P 1
square root of P 1, this is the source power times h s d channel. We have already seen
So, let us title this as optimization for optimization for co-operative communication. And
this, this is a flat this is the fading channel coefficient between source and destination
what we want to look at more specifically is we want to look at optimization means
times x, which is the transmit symbol plus n s d.
minimization of something, we want to look at minimization of the error rate ok. And if
we have a cooperative communication system, what we are looking as what we want to Let us consider this x to be a BPSK symbol that is this is plus or minus 1. So, this h s d is
try to find is what is the error rate of this. And so this is my cooperative communication the fading channel coefficient, P 1 is the power, and x is b plus or minus 1 which is
system with the your source relay and destination nodes. And if you in the event of phi BPSK or Binary Phase Shift Keying. And this N is Gaussian noise with mean 0, and
that is when you have an error at the relay, which means the relay is not able to decode additive white Gaussian noise with mean 0 in variance or power sigma square.
the symbol correctly.
And now if you look at this the SNR at the receiver, the output SNR this is P 1 times Now, what is the bit error rate? Remember for the for b p since we are considering BPSK
magnitude hsd square times x is plus or minus 1, so magnitude x square is 1 divided this modulation, the bit error rate has a very simple expressions. The bit error rate is the Q
is the power signal power divided by the noise power, which is sigma square. I can write function of square root of SNR, which is equal to Q function of square root of rho 1
this as P 1 divided by sigma square times magnitude h s d square, we have already seen times beta s d. And now remember, we have beta s d, which is a random quantity. And
this is equal to beta s d all right. And I will further define P 1 over sigma square as rho 1, we have seen in the previous module that this is exponentially distributed, it is average
so I can write this as rho 1 times beta s d. So, this is your SNR all right at the output all power delta S e d square e raised to minus B s d by delta s d square. This is a probability
right. When the relay is decoding in error, there is only so only the source destination density function of beta s d, which is the channel gain and distributed exponential.
link exist. There is the destination can use only the signal that has been transmitted by
the source ok.
And therefore, the average bit error rate. So, to find the average bit error rate, why do we tail probability of the standard normal random variable, but this is an alternative
have to find the average bit error rate? Remember beta s d that is the gain of the channel definition of the Q function, which is convenient in this scenario.
coefficient the channel coefficient, the fading channel coefficient is a random quantity.
So, using this I have the average bit error rate is well, I will replace the expression for the
So, it is changing from time to time, so varying with time. So, naturally this bit error rate,
Q function 1 by pi is a constant 1 over pi. So, I am going to take that out, so that will
which depends on beta is going to also change from slot to slot or from time to time all
give me 0 to pi over 2 e raise to minus x square, which in this case is rho 1 beta s d by 2
right.
sin square theta times 1 over delta s d square e power minus beta s d divided by delta s d
Therefore, we want to find what is the average bit error rate, corresponding to these square d beta s d. So, I am integrating the bit error rate over the probability density
observations or these decoded symbols that are decoded symbols, decoded symbols at function of random variable beta s d, which is the gain of the s d channel.
the destination over a long period of time ok. And that average bit error rate is given as
Now, what I am going to do is I am going to interchange the integrals ok. This is this
well to compute the average of a random quantity, you multiply this a bit error rate Q of
integral is first with respect of course this has to be theta, so there has to be d theta; this
square root of rho 1 beta s d, you multiply it by its probability density function F of beta
integral first with respect to theta. Next with respect to B s d, do, I am interchanging the
s d, and you integrate ok. So, multiply the probability by the probability density function
order. So, first now I am going to make the inner integral with respect to beta s d, the
and integrate over its domain that is from 0 to infinity ok. And here, we are going to use
outer integral with respect to theta.
the formula for the Q function, Q of x equals 1 over pi integral 0 to pi by 2 e raised to
minus x square by 2 sin square theta d theta, this is also known as the craigslist formula. (Refer Slide Time: 09:34)
So, this is going to become 1 over pi 0 to pi by 2 0 to infinity, and combine the inner
terms 1 over delta s d square e raised to minus beta s d divided by delta s d square into 1
And using this, this is an alternate you can think of this as an alternative definition of the
plus rho 1 deltas s d square divided by 2 sin square theta d beta s s d into d theta. Now, if
Q function normally. The Q function is defined as x to infinity 1 over square root of 2 pi
you look at the inner integral just pay attention to the inner integral, inner integral is of
e raised to minus x square by 2 d x I am sorry e raise to minus t square by 2 d t that is the
the form 1 over delta s d square e raised to minus B s d beta s d divided by delta s d
square into some constant K times d beta s d integrated from 0 to infinity. Why is this
constant K? Because remember this K it depends on theta, but theta is a constant. When (Refer Slide Time: 13:34)
you are looking at the inner integral, the quantity theta is a constant is it is K ok.
So, if you can denote this by K, and therefore this integral is simply if you evaluate this
inner integral, this is simply equal to 1 over K, where K is what is K 1 plus beta 1 plus I
am sorry this is not 1 plus beta 1, this 1 plus rho 1 ok. K is 1 plus rho 1 delta s d square
divided by 2 sin square theta. And therefore, this now reduces to 1 over pi integral 0 to pi
over 2 1 over 1 plus rho 1 delta s d square divided by 2 sin square theta d theta. There is
a convenient way to evaluate this.
Times integral and integral 0 to pi by 2 sin square theta is easily evaluated that is 1 minus
cosine 2 theta divided by 2, so that is integral of cosine 2 theta, obviously between 0 to
root of pi by 2 is 0. So, this is 1 over 2 times pi by 2. And if you (Refer Time: 13:49), if
you multiply 1 over pi 2 over rho 1 delta s d square into 1 over 2 pi by 2 the pi terms
cancel, the 2’s cancel. And what you have is basically at that point you have 1 over 2 rho
1 delta s d square, which is your probability of error given phi remember. This is the
quantity that we are talking we were talking about earlier.
What is this? this is the probability of error. The probability is e given phi remember phi
is the error at the relay. In the event of which there is no relay to destination
Now, observe that if you at high SNR at high SNR equal to rho 1 that is when rho 1 is
transmission, so there is only source destination transmission. And this is the probability
very high. 1 plus rho 1 delta s d square divided by this is approximately equal to rho 1
of error at the destination for decoding the BPSK symbol transmitted by the source given
delta s d square divided by. So, I can neglect this one all right, because when rho 1
the error at the relay.
becomes high, the second term dominates in the sum. So, I can simply approximate this
by rho 1 delta s d square over 2 sin square theta, which means this integral now
approximately becomes 1 over pi 0 to pi by 2, so the 1 goes, so this simply becomes 2 sin
square theta divided by rho 1 delta s d square d theta, which is well 1 over pi 2 over rho 1
1 over pi 1 over pi 2 over rho 1 delta s d square.
(Refer Slide Time: 14:49) Remember the source to relay link also a fading link. So, I can model this is as y of s
comma r, source power is P 1 square root of P 1 h of s comma r times x, which is the
same symbol n of s comma r. Now, what is the SNR, SNR is P 1 magnitude h s, r square
divided by rho sigma square, which is nothing but rho 1 beta s, r. So, SNR is basically
exactly same, as that of source destination link with beta s d replaced by beta s, r.
Remember beta s, r is also exponential random variable with average power delta s, r
square ok. So, this recall that this is also exponential with average power. This is also
exponential average power delta s, r square. So, therefore the error rate probability of
error is simply 1 over if you look at the earlier expression, 1 over 2 rho 1 delta s, d
square I simply have to replace delta s, d square by delta s, r square corresponding to the
source relay link.

So, this is probability of so describe it in detail, probability of error at destination in
event of ok. So, this is your probability of e given phi error at destination given phi that
is error at relay. Now, the other thing that we need is the probability of phi, similarly
probability of phi. Now, what is probability of phi? And similarly probability of; Now,
what is probability of phi, this is error at relay or rather probability of and what is the
probability of an error at relay.
So, this is simply going to be 1 over 2 rho 1 delta s, r square; I hope this is clear. What
we are saying is we are simply replacing in the source destination SNR beta s, d with
beta s, r, which is exponential with average power delta s, r square all right. So, in the bit
error rate expression, one can simply replace delta s, d square by delta s, r square. And
you will get the appropriate bit error rate expression for the source relay link ok. And
therefore, this is your probability of phi; this is the probability of error. This is the
probability of error at the relay ok, so that is two components remember, we have
probability of e, we want to find out which we have written in terms of probability of e
given phi and probability of phi. The next thing that we want to find is the key (Refer Slide Time: 19:25)
component, which is probability of e given phi bar.
So, probability of e given phi bar; what is this? This is the error at destination given that
the relay decodes accurately. So, in this case what happens is if you look at the diagram,
you have the source. In this case, what happens is you have the source, relay, destination.
Remember we are talking about if you go all the way back, go back to the previous
Relay is decoding accurately, so source transmits relay also transmits. So, destination
module we have this probability of e error at the destination, which you want to find
you have two signals, signal received by the source, signal received from this relay.
eventually. Now, that depends on probability of e given phi, it is approximately
probability of e given phi into probability of phi plus probability of e given phi bar. So, Now, how does a destination decoded, naturally the destination has to employee some
we have found out probability of e given phi probability of phi, what needs to be what kind of combining. And we already know, what is the optimal combining structure so.
remains to be found is probability e given phi bar. What is probability of e given phi bar And this is something that I am going to something that is very interesting shows the
that is the probability of error at destination given that given phi bar that is no error at broad applicability of the optimization principles that we have seen so far. You can treat
relay that is the really decodes accurately, in which case relay also retransmit. this in fact as a beam forming problem, and that is very interested. It is a beam forming
with the multiple nodes rather than multiple and so.
In this case, the relay retransmits. Phi bar implies, no implies, no error at relay. We already know, what is the optimal combiner. Optimal combiner for this optimal
Remember phi implies error at relay, so phi bar implies no error at relay, implies relay combiner at destination equals the MRC that is the Maximal Ratio Combiner. What is the
retransmits, because relay remember selective decode and forward, relay retransmits maximal ratio of combiner? That is you have your beam former W bar equals sorry h bar
only fits able to decode. So, relay retransmits, so now what happens? You have two divided by norm h bar, we know this. And what is the SNR, SNR is norm h bar square P
symbols y s, d, which you already had square root of P 1 h s, d x plus n s, d. Now, you divided by but P is 1 that is power of x is 1 into 1 divided by sigma square. And what is
will have transmission by the relay. So, y r d equals square root of P 2 that is the relay norm h bar square remember, norm h bar square is P 1 magnitude h s, d square plus P 2
power h r d fading channel coefficient between relay and destination plus n r d ok. magnitude h r, d square ok.
And so this is the original transmission by source, and this is the transmission by the You can see h bar, h bar is the vector square root of P 1 h s, d square root of P 2 h r, d ok,
relay and this is the transmission by the relay. And now you have these two symbols so that is norm h bar square equals P 1 magnitude h s, d square plus P 2 magnitude h r, d
received at the destination; one from the source, one from the relay. Now, what are you square divided by sigma square. Now, we know P 1 divided by sigma square is rho 1. So,
suppose to do, obviously you have to combine them in some kind of an optimal fashion. this is rho 1 magnitude h s, d square is beta s, d plus P 2 divided by sigma square is rho
And what is the optimal combiner; we know what is the optimal combiner. The optimal 2, magnitude h r, d square is beta r d ok. So, this is your SNR.
combiner, the optimal beamforming this is similar to the beam forming problem with
You can see, this is the coherent combining combines the SNRs corresponding to both
multiple receive antennas. And we know that the optimal combiner is the maximal ratio
the source destination transmission and the relay destination. And thereby you are
combiner all right. And therefore, if you treat this two received symbols as your receive
enhancing the reliability. This is where you get the gain from co-operative
vector y bar, and this as your channel vector h bar, and this as your noise vector n bar.
communication, because you have the signals that are transmitted by two different
sources. One is the source original source and other is the relay, which is acting as the
replica of the source, so that in hand and now you see the co-operative diversity aspect
emerging, because there is transmission by the source, there is transmission by the relay.
So, they are co-operating you have two signal copies, and that gives diversity in a (Refer Slide Time: 26:39)
wireless communication system, which leads to a significant decrease in the bit error rate
that is what we are going to see all right.
And rho 1, we know already find rho 1. Rho 2 is also similar rho 1 equals P 1 that is
source power by sigma square root 2 equals P 2 by sigma square. Well what is P 1, P 1
equals source. What is P 2, P 2 equals relay power, because the source and relay need not
transmit with equal power. Source power can be very different from the relay power.
And therefore, now again you can follow the same procedure. The bit error rate for
Relay can be have very low power depending on the relay can have high power or low
QPSK will be Q of square root of rho 1 beta s d plus rho 2 beta r, d. You can average this
power and so on.
over the probability density function. And it can be shown that the average bit error rate
for this, which is nothing but your probability of e given phi bar. It can be shown I am And you know, and this is where you can see this is where you have this is coherent
not going to explicitly show it, maybe in a different module separate module, because it combining, you have this signal copies, and you have this two signal copies. And this is
is not necessary for a preliminary discussion this is given as 3 or 4 rho 1 rho 2 delta s, d where you have this co-operative diversity really coming actually if you realize this. And
square delta r, d square. We know what delta r d square is delta r d square is the expected diversity is an important principle in wireless communication system, which results in a
value of average gain of the relay destination link. significant decrease in the error rate ok. And therefore, this is your P r of e given phi bar.
So, now we have this elegant expression for the probability of error given phi bar that is
correct decoding at the relay or no error at the relay all right.
So, now putting all these components together, one can derive the probability of error
that is the final expression for the end to, this is also known as the end to end error in this
that is probability of E ok. And using that now one can come up with a framework for
optimal power distribution between the source and relay that is our ultimate aim. The
optimization problem pertains to how to distribute the power optimally between the
source and relay in a wireless communication system, which we will deal with in the Applied Optimization for Wireless, Machine Learning, Big Data
next module.
Lecture - 54
Practical Application: Optimal power allocation factor determination for Co-
operative Communication
looking at optimization for co-operative communication all right. And in this module, we
are going to get to the optimization problem all right.
So, we are looking at optimization for your co-operative communication system. And
well what we have is, we have a source, we have the relay, and we have a destination.
So, this is my source, my relay and destination.
We also derive what is the expression for e given phi bar that is probability of error at And now probability of error, so this implies and now we already know the expression
relay error at destination given, given no error at the relay. for the error at the destination. So, probability of error at the destination, we have seen
this is probability of e given phi times probability of phi is a proxy and plus probability
of e given phi bar. We said this is an approximations, but this is a good approximation we
just tied at highest you know.
Now, what we are going to do is, we are going to substitute the expressions, we have
derived for each of this quantities and derived the probability of error at the destination.
This is also known as the end to end error that is the end is one of the end of ends is the
source, the other end is the relay, another end is the destination. So, this is the probability
of the error at the destination that is for the end to end communication ok, because the
communication happens in two phases, one is the source to destination, then source to
relay, relay to destination ok. So, this is the probability of end to end error, this is the
probability of end to end error.
And now substitute each of these quantities, so this is equal to, so first I am going to
substitute probability of e given phi that is the error at destination given error at relay.
That is phi is the event of error at relay, phi bar s the event of no error at relay. And this
So, this is 1 over 2 rho delta s d square times probability of error at relay, this is 1 over 2
we can see is we have seen is 3 by 4 rho 1 rho 2 delta s d square delta r d square. We are
rho 1 delta s r square plus probability of error given phi bar, this is 3 over 4 rho 1 rho 2
not explicitly derived this or you said that this can be shown ok, so phi bar implies no
delta s d square delta r d square.
error at relay ok.
Now, what is rho 1? Remember rho 1 equals P 1 by sigma square. And what is P 1? P 1 (Refer Slide Time: 06:25)
equals power of source; rho 2 equals P 2 y sigma square; P 2 equals power of relay all
right this is the transmit power of relay. And now what we want to do is now we want to
make an optimization problem, where we want to minimize the bit error rate the end to
end rate of course, we cannot just simply minimize. And because minimizing the error
rate simply means increasing the power infinitely; and of course, when the power
becomes infinite the bit error rate becomes 0. So, naturally that is not what we want we
want to minimize this subject to constraint, because the transmit power cannot increase
infinitely. So, what will impose is will impose up power budget on this corporative
communication system.
Where of course, because rho 1 comma rho 2 are greater than or equal to 0, this implies 0
less than equal to alpha must be less than or equal to 1 this. Now, is what is the what is
this alpha you can think of this alpha as the power allocation factor, this is the power
allocation factor which lies between 0 and 1; and rho 1 equals alpha times rho, rho 2
equals 1 minus alpha times rho. Now, substituting this values ok, substituting what are
we substituting rho 1 equals alpha times rho, rho 2 equals 1 minus alpha times rho. And,
if we call this expression, if we call this as star that is the probability of end to end error.
What is the power budget? The power budget is that the power of the source plus the
power of that relay, this is the constraint. So, this is the constraint which is the power
budget for the system. So, this is the power budget which is the constraint for this co-
operative wireless communication system. Now, P 1 plus P 2 equals P. Now if you divide
this by all sides by sigma square, we can get rho 1 plus rho 2 equals well P over sigma
square equals rho equals P over sigma square. Further to simplify this, because there only
two parameters, I can set rho 1 equals alpha times rho, rho 2 equals then becomes 1
minus alpha times rho because rho 1 plus rho 2 is equal rho.
So, substituting in star what we get is the probability of error probability of end to end (Refer Slide Time: 10:12)
error, this equals 1 over 4 alpha square rho square delta s d square delta s r square plus 1
over or not 1 over in fact, this is 3 over 4 alpha into 1 minus alpha into rho square delta s
d square delta r d square.
However, so this is basically you can say this is the 1 over SNR decrease, this decreases
as 1 over SNR.
And now if you take this 1 over rho square common or in fact I can take 1 over 4 rho
square delta or 4 square delta s d square I can take this common into 1 over alpha square
s r delta square plus 3 over alpha into 1 minus alpha into delta r d square. Now, if you
look at this bit error rate expression, now if we attention to this bit error rate expression,
you will notice something interesting you will notice that the effective end to end bit
error rate decreases as 1 over rho square. So, so or the bit error rate at destination equal
to 1 over SNR square, what is SNR, SNR equals P over sigma square.
So, what you are observed is normally in a wireless communication system, we have
seen in the absence. Now, if you go back ok, and if you go back and simply look at the
source destination link I urge you to look at the source destination link. If you look at
simply the source destination link, the probability of error decreases 1 over rho 1, it only
However, now once you are adding a relay in this co-operative communication system,
decreases as 1 over SNR all right. Therefore, this is known as diversity order 1, which is
the bit error rate in co-operative communication system, BER decreases as 1 over SNR
the exponent of the SNR, is simply 1 in the bit error rate expression.
square. And this is very important, because the bit error rate is decreasing as 1 over
square of SNR. So, the BER bit rate decreases much faster, this is the impact of
corporative communication. Thus corporative communication leads to a significant
decrease in the bit error rate of a wireless communication system thereby improving the (Refer Slide Time: 13:00)
reliability ok. So, this implies BER of co-operative decreases, this is implies bit error rate
to co-operative communication decreases significantly faster.
And now therefore, what you want to do, we want to find the optimal power factor alpha
to minimize naturally, what you want to minimize is we want to minimize the bit error
rate at the definition or probability of error. So, my optimization problem is, now
minimize probability of error remember the constraint is now incorporated in alpha,
This implies co-operative communication significantly co-operative communication
because constraint was P 1 plus P 2 equal to p, which have written in terms of alpha rho
improves reliability, improves the reliability of a wireless system. So, this significantly
and 1 minus alpha rho by writing P 1 and P 2 in terms of alpha and alpha rho and 1
improves the reliability of the wireless system. So, BER decreases significantly faster,
minus alpha rho or P rho 1 and rho 2 in terms of alpha rho and 1 minus alpha rho ok.
because remember 1 over SNR versus 1 over SNR square for a co-operative
communication system. So, these 2 implies diversity order equals 2; and this is also And, so this is minimize probability of error, there has to be a dot minimize, which
termed as co-operative diversity. So, co-operative diversity helps to improve the means minimize your earlier relay expression that we derived 1 over 4 rho square delta s
reliability for wireless communication system by making the bit error rate at the d square times 1 over alpha square delta s r square plus 3 over alpha into 1 minus alpha
destination decrease significantly faster. Then it would have happened in the presence of delta d square. Now, observe that this is a constraint rho is fixed, P is fixed, rho is fixed,
only a source destination link that is when the relay is absent so that is the important delta d square delta s d square is fixed. So, this implies this is a constant.
point to realise here ok.
(Refer Slide Time: 14:40) into alpha square plus alpha 3 delta s r square minus 4 delta r d square plus 2 delta r d
square equal to 0.
So, we need to only minimize this part which is therefore, equivalent to minimization of
1 over alpha square delta s r square plus 3 over alpha into 1 minus alpha delta r d square.
Let us call this as a F of alpha d F of alpha over d alpha this is equal to minus 2 over
And you can solve this; it is a simple quadratic equation. And what you will get is alpha
alpha cube delta s r square plus or minus rather minus 3 1 minus 2 alpha by alpha square
star is equal to delta s r square minus 4 by 3 delta r d square plus delta s r into square root
1 minus alpha square delta r d square. Now, equate to 0 to find the optimal value.
of delta s r square plus 8 by 3 delta r d square 4 delta s r square minus 1 by 3 delta r d
(Refer Slide Time: 15:51) square. And this is basically your optimal you can say this is alpha star equals optimal
power allocation factor for minimum, minimum probability of error at destination.
This implies minus 2 1 minus alpha square delta r d square minus 3 into 1 minus 2 alpha
alpha delta s r square equal to 0; this implies the 2 delta d square minus 6 delta s r square
This implies that your rho star P 1 star or your rho 1 star equals alpha star rho optimal Thank you very much.
power optimal SNR. So, this implies also that P 1 star equals alpha star P. What is P 1
star optimal source power, and this also implies rho 2 star equals 1 minus alpha star into
rho implies P 2 star equals 1 minus alpha star into P, and what is P 2 star this is equal to
optimal relay power, so that gives the optimal source relay power allocation ok. So, what
we have obtained is optimal source relay power allocation.
So, this is the optimization connect, so optimization arises everywhere in signal

processing and communication. What we are saying is only a few salient and most
relevant modern exams optimal source relay power allocation for co-operative
communication, which minimum co-optimal in the sense that it minimizes the end-to-
end bit error rate all right. So, the basically that completes our discussion on this co-
operative communication system source, relay and destination nodes all right. In the
relay implies that that is like decode and forward there is a protocol very transmits only
if it is able to decode that respected simple correctly all right.
And we have shown that because of corporative diversity the bit error rate decreases as 1
over SNR square, which implies that the bit error rate of co-operative communication is
significantly lower and therefore, the reliability significantly higher in comparison to that
of having only a source destination link. This is co-operative diversity. And we have also
derived the optimal source relay power allocation or power distribution.
Applied Optimization for Wireless, Machine Learning, Big Data And this has to be therefore either estimated or recovered or reconstructed. If it is an
image, then we say that the image has to be reconstructed.
Lecture – 55
Practical Application : Compressive Sensing
Keywords: Compressive Sensing
Hello, welcome to another module in this massive open online course, so let us start
looking at another new, in fact revolutionary and path breaking development or
technology and that is of Compressive Sensing.
So let us say this x is an N dimensional signal vector and we have to make some
measurements for this unknown signal vector x in order to estimate the signal vector. So
we are sensing the signal vector in order to estimate. And therefore, we have y =fx . So
this y is your observation vector and f is your sensing matrix.
So let us start by considering a signal x which is unknown.
Let us say, we are making M observations as shown in slide.

So these are the rows of the sensing matrix we are making these M observations y1 y2 yM We have these N observations y1 y2 up to yN which is simply the identity matrix. So
through this sensing matrix. So you can think of each observation as a projection of this typically what you have is you are simply sampling the signal at these N different
vector x on a row of this sensing matrix f . instants, we get N measurements and from those measurements we recover the signal.

So as in slide these are the samples, so this is temporal or spatial sampling. We are
So there are M observations and N unknowns and this matrix f is an M ´ N matrix. So
making one measurement per sample or signal value. And therefore, to uniquely
this is a system of equations or M is the number of equations and N is the number of determine the signal with N samples we need at least M ³ N , we can choose M = N.
unknowns. And from linear algebra we know that in order to recover x which is vector of Now let us take a simple example, consider a typical image for instance.
size N you need at least N equations to uniquely determine x. So to uniquely determine
x here we need M ³ N .
Now, this is a small image it is a 25 6 ´ 2 56 pixel image and let us say, it is a color image For example, you have JPEG, GIF, these are various formats for compression. So what
implies for each of these RGB components you need 1 byte for each pixel which means we are doing is the we are first sensing the image and if it is a large image this implies
the total number of bits per image is 2 5 6 ´ 2 5 6 ´ 3 ´ 8 = 1 .5 8 M b . you require a large number of sensors, the number of sensors required is N, so you are
taking one one sensor per sample. However, after sensing you are compressing that is
you throw away significant amount of data. You are throwing away a significant amount
of data to compress it which means basically you are using a large number of sensors but
at the same time you are throwing away a large amount of data, because the compression
is coming after the sensing process .
But the size of a typical image is let us say, 50 to 60 Kb only. So how is it that you are
able to store an image at such a small size even though the raw image has so many bits,
the obvious answer to this is that instead of storing a raw image, this image is being
significantly compressed in size in terms of the number of bits.
In this conventional paradigm, you have compression well after the sensing process. This
leads to a wastage with large number of sensors and hence the resulting system is
extremely expensive.
(Refer Slide Time: 20:59) Applied Optimization for Wireless, Machine Learning Big Data
Lecture - 56
Practical Application
Keywords: Compressive Sensing, Sparsity
Hello welcome to another module in this Massive Open Online Course. So we are
looking at compressive sensing where we try to compress during the sensing process
itself by making much fewer number of measurements in comparison to the dimension of
the signal and then later try to reconstruct the signal from the very few measurements
Now instead consider this paradigm where you have an image, you perform M made.
measurements much less than N that is number of sensors M is much less than N, this is
termed as compressive sensing. So basically while the sensing process itself you are
compressing. So you are compressing while sensing and then you can perform signal
recovery to extract the original signal. So this requires very few sensors.
So we have a measurement vector y =fx and there are M measurements as shown in

slide. The sensing matrix is M ´ N and we make significantly fewer measurements that
is M N . Now if you view this as a system of equation then we have M number of
equations and N number of unknowns. So simple linear algebra tells us that one cannot
So this results in a significant saving in terms of cost and in terms of the number of reconstruct the vector x of length N from M equations, since the number of equations is
sensors, because you are making very few measurements in comparison to the size of the much lower than the number of unknowns. So this is an underdetermined system.
signal. Now since the number of observations is less than the number of signal samples, Therefore, this sensing system has to satisfy certain special properties in order to recover
one cannot uniquely determine the signal vector x . So therefore, one has to come up with x .
some new techniques to reconstruct the original signal from this compressed or
compressively sensed signal y . So we will stop here. Thank you very much.
Now the first condition states that the measurements are not simply in the time or space.
T T T
Rather, they have to employ noise like or one can say pseudo noise like waveform. If you look at each row of the sensing matrix which we are denoting by f1 , f 2 , f M this
has to be a noise like waveform which means that it has to be something very random,
either can be a random sequence of - 1, 1 or it has to be some random noise like
waveform such as Gaussian noise.
So these rows have to look like independent realization of the noise waveforms. And
when you are making the measurement you are taking the projections of the signal on
this noise like waveform.
(Refer Slide Time: 08:19) Now the vector x itself has to satisfy an important property that is x has to be sparse
and this is a very important property. Now x is sparse implies that a large number of
entries of x are 0 and only very few entries are non-zero.
So x is sparse and typically x = ja .
T
So each fi x is a projection of x and we are taking the linear combination of x using
this noise like waveform.
So x is N ´ 1, this a is N ´ 1 and j is an N ´ N basis such that a is a sparse vector.

For instance, we take an image and if you look at the wavelet coefficients of an image
then they are sparse.
So x is sparse can be expressed in terms of a which is sparse and therefore, now if you So once you get a use x = ja to obtain the estimate x from a and this f matrix has to
substitute this, the sensing model becomes y = f x = fj a = fa . contain noise like waveforms.

Now to reconstruct x enforce sparsity which implies that we have to ensure that the
Now f becomes your effective sensing matrix. Now, it is as if you are trying to sense
reconstructed vector x is such that a large number of elements are 0’s and only some
the vector a which is in the wavelet domain. Now once you get the wavelet coefficients elements are non-zero. This is precisely what we call the l0 norm that is if you denote the
you can reconstruct the image because image and wavelet have a 1 to 1 correspondence.
l0 norm of a vector this equals the number of non-zero elements of x .
But the wavelet coefficient is sparse and that is very amenable to compressive sensing.
So we want to minimize the number of non-zero elements of x which is l0 norm.

We are not concerned with the values of the non-zero elements we just have to consider
(Refer Slide Time: 20:51) the number of non-zero elements.
For instance, let us say you have the vector x as shown in slide and there are 8 elements,
but only 2 non-zero elements, which implies the l0 norm of x is 2.
m in x
Therefore, the optimization problem for reconstruction of x can be given as 0
.
s .t y = f x
And we need this because M N and therefore, one has to exploit sparsity. So you are
trying to find the sparsest vector which satisfies this observation model. Now the
problem with this optimization problem is that not only the objective is non-
differentiable, this optimization problem is highly non convex.
So this implies it is very difficult to solve the optimization problem. So the point is that Therefore, one has to come up with other intelligent techniques to solve this optimization
although it is very simple to state the optimization problem it is an extremely problem and that forms the basis for compressive sensing. So let us stop here and
complicated one. continue in the subsequent modules. Thank you very much.
Now, if you look at l0 norm x £1 that comprises only of the axis, so it is highly non
0
convex.
Lecture – 57
Practical Application: Orthogonal Matching Pursuit (OMP) algorithm for
Compressive Sensing
Keywords: Orthogonal Matching Pursuit (OMP)
Hello, welcome to another module in this massive open online course. So we are looking
at compressive sensing and we have seen that the cost function for the compressive
sensing problem is highly non convex and therefore, we have to come up with intelligent
techniques to solve this and hence we are going to look at the orthogonal matching
pursuit. The name itself implies matching that is we are looking for the column that closely
matches the vector x , which means you have to find the projection of x on each of these
columns of the matrix f and choose the one that has the maximum value. So we are
trying to find the one which has the largest projection on y so the way to do that is to
find the column of f that will have the largest correlation or basically projection with y .
T
So the way to do that is we have i (1) = a rg m a x f j
y .
j
So this is one of the schemes for sparse signal recovery and this is also abbreviated as
OMP.
So choose the column j that has the maximum projection. And now we start building the
basis matrix that is A (1 ) = é f i (1 ) ù . So at this stage, this is a single column matrix. So this is
ë û
by the way the first iteration of the algorithm and now we find try to find the best
(1 )
2
estimate of the vector x . So y - A (1 ) x that is we are trying to minimize the least
(1 ) -1
squares norm, such that you find the best vector x = (A (1 )
T
A (1 ) )
T
A (1 ) y in the first
iteration that minimizes this error. So what we are doing is we are trying to estimate the
columns of f which are present in the linear combination that give rise to only few
elements of x that are non-zero. So that is what we are trying to find by this orthogonal
matching pursuit. So we take the projection of each column on y , finding the one that
has a maximum projection, choosing that column as the basis, then finding the best type
for optimization to the y based on that basis, that is what we are doing here by solving
this least squares problem. Now we find the residue that is left after getting this best
So in the second iteration we take the projection of this on the residue that is we find the
possible approximation.
column which has the maximum projection on the residue after the first iteration. So we
(Refer Slide Time: 08:00) find the projection of each column of f on this residue and choose the column which
now has the maximum projection on this residue.
(1 )
So this is r (1) = y - A (1 ) x .
Now, you augment your basis matrix and we get A(2) as shown in slide. Once again you
find the best estimate x via least squares as shown in slide.
Now we find the residue after the second iteration and then subsequently we do the third And now x will simply be a vector that mostly contains zeros, except corresponding to
iteration and keep repeating this process until the residue stops decreasing, that is you the location i(1), i(2) and i(K), let us say so. So this is a sparse vector that is estimated
repeat until such a stage that is let us say, you have K iterations, that is using the orthogonal matching pursuit. So we will stop here and we will look at an
example in the subsequent module. Thank you very much.
r ( K ) - r (K - 2 £ e which is some threshold.
So this is termed as the stopping criteria. Let us say, you stop after K iterations and after
(K )
this you have x , which means you basically obtained a fairly good estimate of
approximation to y and the residue is not decreasing any further.
Lecture - 58
Example Problem: Orthogonal Matching Pursuit (OMP) algorithm
Keywords: Orthogonal Matching Pursuit (OMP) algorithm
at techniques for compressive sensing and we have seen that orthogonal matching pursuit
can be used for sparse signal recovery, so let us now look at an example to understand
this better.
(Refer Slide Time: 00:41) In this problem we have M = 4 which is basically the number of equations and N = 6
which is the number of unknowns and M < N implies number of equations is less than
number of unknowns. Therefore, to estimate x you cannot use conventional linear
algebra, because in linear algebra you need the number of equations at least equal to the
number of unknowns to uniquely determine the unknown vector x . And therefore, one
has to enforce sparsity, so we assume that x is sparse and then we want to estimate this
sparse vector. This is basically termed as sparse signal recovery.
So let us consider the following example, we have y =fx and we have to estimate the
é x1 ù
ê ú
é0ù é1 0 1 0 0 1 ù x
ê 2ú
ê ú ê ú ê x3 ú
2 0 111 0 0
vector x . So let y = ê ú , the matrix f = ê ú and x = ê ú .
ê3 ú ê1 0 0 11 0 ú ê x4 ú
ê ú ê ú êx ú
ë5 û ë 0 1 0 0 11 û 5
ê ú
êë x 6 úû
The algorithm for sparse signal recovery is OMP and this can be done as shown in slide.
T é3 ù
So we perform f y and this is as shown in slide.
ê ú
7
ê ú
ê2ú
(Refer Slide Time: 07:33) So when we compute this f
T
y we will get the vector ê ú and the maximum is at the 5th
ê5 ú
ê8 ú
ê ú
ë5 û
T
entry which is equal to f5 y .
So each of these entries equals the projection of y on each column of f . Now the other
thing that you must have observed is if you look at these rows, you can see that these
rows are random 0’s and 1. So these are noise like waveforms. So each measurement is a
projection of y on this noise like waveform.
Therefore we form the basis matrix using this column, so we have A (1) = é f 5 ù .
ë û
é0ù é0 ù
ê ú ê ú
0 2 (1) 2
ê ú
(1 )
This is nothing but ê ú and now you solve the least squares problem y - A (1 ) x , this The residue is r (1) = y - A (1 ) x which will basically be and this is what we carry
ê1 ú ê - 1ú
ê ú ê ú
ë1 û ë1 û
(1 ) -1 over to the second iteration. So we subsequently find the projections of the columns of f
is the first iteration. So the solution to this is x = (A (1 )
T
A (1 ) )
T
A (1 ) y . So on evaluating
on this residue, choose the one that has the maximum and perform the least square
(1 )
this as shown in slide, we get x = 4 . solution, find the residue and repeat the process.
This corresponds to the index of the column that is chosen in the first iteration that is
column number 5. So this corresponds to the 5th column or the 5th entry of the vector x . So proceeding the same way, we get the residue as shown in the slides below.
Now we find the residue for the first iteration.


So here the residue is exactly 0 which basically means that you are exactly able to use this OMP algorithm for similar scenarios. So let us stop here and continue in the
approximate y in the second iteration. subsequent modules. Thank you very much.
(2) é2ù
So no further iterations are needed which means x = ê ú and the components 2 and 3
ë3 û
corresponds to f2 and f5 which are basically the second and fifth columns of the matrix
f respectively.
And therefore, now you can reconstruct the sparse vector x as follows, only the second
th
entry and the 5 entry will be 2 and 3 respectively and the rest of the entries are 0. This
is a simple example, but problems in practice are frequently more complex. But you can
Lecture – 59
Practical Application: L1 norm minimization and Regularization approach for
Compressive Sensing Optimization Problem
Keywords: L1 norm minimization, Regularization
Hello, welcome to another module in this massive open online course. We were looking
at compressive sensing and we have discussed the orthogonal matching pursuit. Let us
look at another completely different and radical approach to tackle this compressive
sensing problem.
m in x
The compressive sensing problem is the following thing that is 0
. One of the
(Refer Slide Time: 00:34) s .t y = f x
fundamental results in compressive sensing is that this l0 norm minimization can be

replaced by l1 norm minimization and still you can recover the sparse vector x .
So we want to look at compressive sensing via l1 norm minimization. We have seen this
l1 norm of a vector that is if you have an N-dimensional vector x the l1 norm is simply
the sum of the magnitudes of the components of x . That is we have
x = x 1 + x 2 + ..... + x .
1 N
m in x
So we have 1
what this says is that the l1 norm also enforces sparsity. And it can
s .t y = f x
be shown that for a large number of scenarios or with very high probability x that is
obtained as a solution of both these above optimization problems is the same. The
significant advantage is, the l1 norm is convex in nature and the l0 norm is highly non-
convex. So therefore if you look at this optimization problem, the objective is convex
and the constraint is an affine constraint. So we are converting a problem which was (Refer Slide Time: 10:56)
previously highly non-convex into something that is convex. So this is much easier to
solve and determine the sparse vector x .
Now, if you replace it by the l2 norm this approach fails because the l2 norm does not
enforce sparsity.

So this is one of the path breaking developments in compressive sensing that is
demonstrating that the l0 norm minimization is equivalent in a large number of scenarios
to the l1 norm minimization. If you look at the l0 norm ball that is x £1 , it is simply
0
along the axis, so this is highly non-convex.
The l2 norm is smooth with no pointed edges and if you look at this affine constraint it
intersects at a point which is not sparse.
However the l1 norm ball looks like a diamond shaped object which is a convex shape.
And now if you enforce this affine constraint, which is nothing but a line, it intersects at
one of these pointed edges, implies the solution is sparse.
m i n t 1 + t 2 + ... + t N
-t £ x £ t
s .t 1 1 1
The l1 norm minimization problem can be further simplified using the epigraph form as -t £ x £ t
2 2 2
shown in the slides below. .
Therefore, this optimization problem can be written as and this set of
.
(Refer Slide Time: 16:02) .
-t £ x £ t
N N N
y =fx
linear inequalities is also known as box constraint. So these are your linear inequality
constraints.

m in 1 t
T
is the approximation or fit error or the observation model error and this additional term
So this can also be written as -t £ x £ t
. So this is the component wise inequality and enforces sparsity.
s .t
y =fx
you can see these are linear inequalities, affine constraint and linear objective. So the
compressive sensing problem to estimate the sparse vector x reduces to a linear program
for which there are efficient techniques to solve.
So basically you are adding an l1 regularization component and this l is termed as the
regularization parameter. This has to be determined for the problem under consideration.
So here you are recovering the sparse vector as well as at the same time minimizing the
approximation error. So basically this is known as the regularized version of the previous
(Refer Slide Time: 21:18) problem. So we will stop here. Thank you very much.
Now so far we have considered a noiseless observation model. Now in the presence of
noise you have the observation model y =fx+ n and the vector x is sparse. Now
previously you minimize the least squares. Now here you have m in y - f x + l x that
2 1
Lecture - 60
Practical Application of Machine Learning and Artificial Intelligence: Linear
Classification, Overview and Motivation
Keywords: Machine Learning, Artificial Intelligence, Classifier, Spectrum sensing,

Linear classification
at various aspects of optimization. Let us look at another important application of convex
optimization that is in the latest evolving field of machine learning and artificial
intelligence. So we are going to focus on the problem of classification and these problems can lay the
foundation for the development of very complicated and sophisticated machine learning
algorithm. We have to design or develop as a classifier. And this is an important problem
in machine learning or artificial intelligence that is to automatically classify a set of
objects belonging to either two different sets or multiple sets. So let us say you have a
video or an image, from that video you are trying to classify the objects into various
categories. So let us look at a typical example in modern wireless communication and
that is of spectrum sensing.
So machine learning is abbreviated as ML and artificial intelligence as AI. And of course

this is a huge field, if you look at machine learning or artificial intelligence, there are
large number of problems with several interesting applications.
So the spectrum sensing problem arises in one of the latest wireless technologies which
is known as cognitive radio and what happens in a cognitive radio technology is that the
wireless device is embedded with intelligence and therefore ML or AI is an important
aspect. So cognition is an important aspect of the human brain and the idea is to embed
cognition or embed this kind of intelligence in wireless devices or wireless radios. So the (Refer Slide Time: 11:54)
ability for the radio to sense the environment and adapt itself is a very important aspect
of cognitive radio. And therefore, machine learning or artificial intelligence which
basically is concerned with extracting learning rules based on sets of data and in using
them later with a very high degree of probability is an important aspect of cognitive
radio.
And this is an important aspect of cognitive radio, because the cognitive radio has to
sense the wireless environment and adapt the radio process. And naturally this spectrum
sensing process at the secondary user is going to be based on some measurements that
are done by the secondary user of the environment, based on the signal let us say x that
is sensed by the secondary user. So let us say xi is the signal measured by secondary user
at time i. Now, we need to classify based on this xi in the sense xi implies PU present or
For instance, a simple problem in cognitive radio can be the following and that is when or absent.
you have a primary user or the user who is licensed to transmit in a certain spectrum and
in addition you have a secondary user. Now, this is the unlicensed user and this can be
communicating with the different base station but causes interference at the primary base
station. So this is as shown in slide. So naturally the secondary user can transmit and it
also causes interference to the primary user(PU), therefore the secondary user(SU) can
only transmit, when the primary user is not transmitting or the PU is absent. And
therefore the SU has to sense the environment and this process is termed the spectrum
sensing.
Now let us consider hypothesis 0 that is primary user is absent and this is termed as the
Null hypothesis and primary user is present termed as the alternative hypothesis or
hypothesis 1. So we have to assess which of these hypotheses is true and this is known as
a binary hypothesis testing problem. So the idea is to classify xi as belonging to one of (Refer Slide Time: 20:55)
these hypothesis. Now we have to build a classifier and we will build that classifier
initially on the base of some data that is available with us.
So let us plot this, let us say we have a 2 dimensional plot. So if all of them occur in a
single cluster, then it would be difficult to classify or separate them. So you expect to see
some logical separation between these points. So to separate both the set of points we
can use a hyper plane and this is basically termed as linear separation or linear
So this building of classifier is based on a test data set. Let us consider that this test data classification. This hyper plane which is separating them is known as a linear classifier
or a linear discriminant. One set lies on one side of the hyper plane and the other set lies
or training data set be x 1 , x 2 , ..., x M + N and the M points corresponds to the absence of
on the other side of the hyper plane. So the problem can be represented as follows.
primary user that is it simply implies noise as shown in slide.
So we have H 0
: xi = hi and H1 : x i = si + h i . So in the test data set M
i = 1, 2 , ..., M i = M + 1, ..., M + N T T
So we have a xi + b > 0 , a xi + b < 0 . And finding such a classifier is nothing but
points corresponds to absence of the primary user, the null hypothesis and N points i = 1, 2 , ..., M i = M + 1, .., M + N
correspond to the presence of the primary user or the alternative hypothesis and we have finding the hyper plane which is characterized by these parameters a and b. Therefore
to build our classifier based on this test data set. this implies that we have to estimate a and b. So basically it is very interesting that we
have boiled down this machine learning or artificial intelligence problem of classification Applied Optimization for Wireless, Machine Learning, Big Data
in this cognitive radio system into the design of a hyper plane which achieves this
separation such that all the points belonging to the presence of the primary user lie on Indian Institute of Technology, Kanpur
one side of the hyper plane and all the points belonging to the absence of the primary
Lecture – 61
user lie on the other side of the hyper plane. So if you find these parameters a and b Practical Application: Linear Classifier ( Support Vector Machine ) Design
corresponding to such a hyper plane, then we would build this classifier and this Keywords: Linear Classifier, Support Vector Machine
classifier can then be used to basically classify the further things. So let us say you make
Hello welcome to another module in this massive open online course. So we are looking
T
a measurement xi at time M + N +1. Now we have a xM + N +1 +b > 0 implies the primary at convex optimization and its application for machine learning and let us continue our
T
user is absent and a xM + N +1 +b < 0 implies primary user is present. discussion.
So we are looking at applications of convex optimization or machine learning and in

So thereby you build a classifier and then you can subsequently use the classifier to
particular we are looking at building the optimal classifier.
classify the measure signal and to eventually sense the spectrum that decides, if the
primary user is present or if the primary user is absent. So we will stop here and we will (Refer Slide Time: 00:55)
see how convex optimization helps in building the classifier in the subsequent modules.
If you have two sets of points corresponding to hypothesis 0 and hypothesis 1 and this is (Refer Slide Time: 05:01)
the hyperplane that is separating them and since this is linear you can also call this as a
linear classifier.
So I can simply set the optimization objective to 1, any constant does not matter so we
m in 1
T
a xi + b > 0
have i = 1, 2 , .., M . So this is a trivial optimization objective, the objective is
T s .t
H :a +b > 0 T
So we have
o
. a xi + b < 0
H1 :a
T
+b < 0 i = M + 1, .., M + N
constant which means it cannot be minimized any further. So this will return any feasible
point in the sense any a and b which satisfy the set of constraints and which are able to
separate these two sets of points. This type of optimization problem is termed as the
feasibility problem.
So we have the constraints and we do not have an optimization objective. Now, how do
you formulate the optimization problem in this context and you will realize something
interesting that given this constraint we do not need an optimization objective. Any a
and b satisfying this set of constraints is fine which means we can formulate an
So if there is any point which satisfies the constraint, the problem is feasible otherwise
optimization problem with a trivial optimization objective and that is as follows.
the problem is infeasible. So we are trying to check if the problem is feasible and once
you have the a and b you can build the classifier and thereby solve this. Now the other (Refer Slide Time: 13:40)
problem is that these constraints are strict inequalities. So you cannot have these strict
inequalities in a convex optimization problem. So we can modify the optimization
problem as follows.
Now, we have these two sets of points, hypothesis H1 and hypothesis H0 and we design
two hyperplanes and that is the novel solution. So we want to design two parallel
hyperplanes. In fact, what you are doing is you are fitting a slab, not just a hyperplane,
T
but this is a continuous slab. So this hyperplane is characterized by a x+b =1 and the
T T
other one is a x + b = -1 . All the points in hypothesis H0 will satisfy a x+b ³1 and all
m in 1
T
T these points in hypothesis H1 will satisfy a x + b £ -1. Now, we are fitting a slab
a xi + b ³ 0
So our modified optimization problem is the following, we have i = 1, 2 , .., M . between these two sets of points in the training data set. So this avoids the trivial
s .t
T
a xi + b £ 0 solution. Now we want to fit the thickest slab that is we want to maximize the separation
i = M + 1, .., M + N
between the hyperplanes to make it robust.
Now you do not have strict inequalities anymore, these are half spaces. So these are
affine and therefore, these are convex functions. So this is a convex optimization (Refer Slide Time: 17:29)
problem. The problem arises in the fact that it is a feasibility problem, we do not have
strict inequalities and if you set a = 0, b = 0 that trivially satisfies this problem. So now
this feasibility problem will always have the trivial solution. So even if the points are not
separable it will simply yield a = 0, b = 0 . So you will have to work out another approach
which does not yield the trivial solution, but yields actually a hyperplane that separates
these two sets of points. And to do that we will now further modify this optimization
problem as follows.
Now, we know that if we have two parallel hyperplanes
T T
a x = b1 and a x = b 2 , the (Refer Slide Time: 21:49)
b1 - b 2
distance between these two hyperplanes is and therefore, to maximize the
a
2
separation implies we have to maximize the distance between the hyperplanes.
Therefore, the optimization problem for maximum which is a robust separation problem
m in a
2
T
a x+b ³1
is . And this is a convex optimization problem. And more
i = 1, 2 , .., M
s .t
T
a x + b £ -1
i = M + 1, .., M + N
So this is done as shown in these slides.
importantly there is no trivial solution. So we have a convex optimization problem,
(Refer Slide Time: 20:38) avoided the trivial solution and we are finding the hyperplanes such that we are fitting
the thickest possible slab or you have the set of hyperplanes with the maximum possible
separation between them separating these two sets of points. So as the separation
becomes smaller and smaller there is a high chance that because of noise you might have
points from one set crossing over into another set. So the moment you are maximizing
the separation between two hyperplanes the probability of error becomes minimum. So
this is the linear classifier and linear classification into two sets is termed binary linear
classifier and is also termed as a linear SVM where SVM stands for support vector.
Lecture - 62
Practical Application: Approximate Classifier Design
Keywords: Approximate Classifier Design
Hello, welcome to another module in this Massive Open Online Course. We are looking
at the linear classifier and we have seen how the support vector machine for linear
classification of two sets of points can be formulated in a convex optimization problem.
So in this module, let us explore the possibility of approximately classifying two sets of
So this is in fact the cutting edge or one of the efficient mechanisms for linear separation. points.
But in this current form, it is simply a linear SVM which can be employed as a binary
linear classifier to classify two sets of points. All such binary classification problems can
be handled by the linear SVM. So we will stop here and continue in the subsequent
modules. Thank you very much.
So let us look at building an approximate classifier in the sense that this has some
classification error.
(Refer Slide Time: 01:34) 2
We said the distance of separation is . So if you want maximize the distance we want
a
m in a
T
a xi + b ³ -1
to have and this is your convex optimization problem for the
i = 1, 2 , ..., M
s .t
T
a xi + b £ -1
i = M + 1, .., M + N
linear SVM and now, the problem arises if the sets of points are not linearly separable.
So far we have seen two sets of points and we are trying to separate them. So this set
corresponds to hypothesis H0 and hypothesis H1 as shown in slide and we said we can
separate them using a hyper plane. But that results in the trivial solution. So what we said
was we are going to fit the thickest possible slab and maximize the separation or between
the parallel hyper planes. The optimization problem for this can be formulated as
follows.
For instance, you have a situation where you have some points belonging to hypothesis
H0 and at the same time you have some points belonging to hypothesis H1. But if you try
to separate them by any plane, you are going to have some classification errors. So you
might get misclassified points and these can also arise due to noisy data. So sometimes
there might be noise in the system and some of the H0 observations are closely clustered
with H1 and some of the H1 observations are closely clustered with H0. So it is not
possible to find a plane or it is not possible to fit a nice slab between them which means
one has to tolerate a certain amount of classification error.
So this implies that our classifier is only going to be approximate. So it is going to be an So you allow for a certain slack in the constraint that is the constraint need not be exactly
approximate classifier, but we want to have a good approximation. So we want to design satisfied.
an approximate classifier that minimizes the number of misclassified points or minimizes

the classification error.
So this is given as shown in slides.

So you are allowing a slack in this constraint or basically you are allowing some of the (Refer Slide Time: 18:00)
points to be misclassified, in the sense that some of the points have slack that is large
enough, so that they cross over one side of this hyper plane to the other side. Now, for a
perfect classifier there is no classification error. However, there is classification error,
even when you want to allow the possibility of point being misclassified, again due to
noisy data perfect separation is not possible.
And now, if you write this as a vector you can combine all these things and this is the
component wise inequality. So you can simply minimize the total slack. Now, the best
approximate classifier is basically the one which minimizes the total slack at the same
time, we do not want to allow too much of slack or too much of tolerance, we want to
keep the tolerance as low as possible which means we have to minimize the total slack.
M +N
T
So by introducing the slack in these constraints, you are allowing the possibility for few m in å ui = 1 u
i =1
of the points to be misclassified. So you want to build an approximate classifier, but the T
a xi + b ³ 1 - ui
m in a i = 1, 2 , ..., M
So this is given as . So that is basically the approximate classifier
ui ³ 0
T s .t
a xi + b ³ 1 - ui T
a xi + b £ -1 + ui
i = 1, 2 , ..., M
best approximate classifier. So we have . i = M + 1, .., M + N
ui ³ 0
s .t ui ³ 0
T
a xi + b £ -1 + ui
i = M + 1, .., M + N or you can also think as the soft margin classifier.
ui ³ 0
So you can regularize this. Now, if linear classification is possible, then these components of the vector u will either
be 0 or close to 0. But of course, if linear classification is not possible, then several of
these u’s will be greater than 0, in fact, some of these might even be greater than or equal
to 1 which shows that basically some of points are misclassified. However, we want
these elements to be as few of them to be greater than or equal to 0 as possible that is you
want the slacks in general to be as low as possible, as close to 0 as possible, that is why
we are minimizing the total set. In fact, in this case we are minimizing a weighted
combination of the distance between the separating hyper planes along with the slack. So
that fits the thickest slack, while allowing certain amount of misclassification and you
are minimizing the linear combination of these two objective functions. This is basically
the regularized minimization or you think of this as a regularized classifier. So we will
stop here and continue in the subsequent module, so thank you very much.
Previously, we wanted to fit the thickest slab. So there are two objective functions and
you can consider a combination of them. So we use the regularization parameter l , and
( )
T
m in a +l 1 u
T
a xi + b ³ 1 - ui
this is given as i = 1, 2 , ..., M .
s .t
T
a xi + b £ -1 + ui
i = M + 1, .., M + N
u ³ 0
Applied Optimization for Wireless, Machine Learing, Big data m in g0 (x)
Prof. Aditya K Jagannatham
Department of Electrical Engineering g (x) £ 0
i
Indian Institute of Technology, Kanpur So we have i = 1, 2 , .., l . And we have seen that the objective function is convex,
s.t
g j (x) = 0
Lecture - 63
j = 1, 2 , .., m
Concept of Duality
the inequality constraints are convex and the equality constraints are affine so it becomes
Keywords: Duality
a convex optimization problem.
at different topics and concepts in convex optimization and particularly from an applied (Refer Slide Time: 03:53)
perspective. In this module, let us start with a new topic and that is Duality.
Now for this optimization problem, the Lagrangian function can be formulated as
l m
(
L x , l ,n )=g 0
( x) + å li g i ( x) + ån j
g j(x) .
i =1 j =1
So what this does is it formalizes the framework of Lagrange multipliers. So recall a
standard form optimization problem given as follows. (Refer Slide Time: 05:09)
Now these quantities are the Lagrange multipliers.

æ l m
ö
So this is a weighted sum of the objective function and the constraints and the weights So this can be again written as m in ç g 0 ( x ) +
x
å li g i ( x) + ån j
g j ( x) ÷ . Now it is
è i =1 j =1 ø
are basically the Lagrange multipliers.
important to remember that we have started with the standard form optimization problem
(Refer Slide Time: 08:00) which is not necessarily convex. This Lagrangian dual function has a very interesting
property that is this can be shown to be concave in nature, irrespective of the original
optimization problem which need not be convex.
Now, the Lagrange dual function is

x
(
g d ( l , n ) = m in L x , l ,n ) so this is a function of the
Lagrange multipliers.
And if you closely observe this function you can observe that even though this is a
complicated function of x , this is affine in the Lagrange multipliers which are nothing
but the weights.
Lecture-64
Relation between optimal value of Primal and Dual problems, concepts of Duality
gap and Strong Duality
Keywords: Primal problem, Dual problem, Strong Duality, Duality Gap
So this is affine in the sense that this is a hyper plane and therefore, this is a concave
function. And what is the dual doing, this is taking the minimum over x . So this is
concave.
Hello, welcome to another module in this massive open online course and we are looking
at the concept of duality for optimization. So let us go back to the original possibly not
necessarily convex optimization problem.
So this Lagrangian dual function is a concave function. So even when the original
problem is not necessarily convex one can convert a standard, possibly non convex
optimization problem into an equivalent concave optimization problem. So that is the
power of the duality framework. So in fact, can use this to simplify several possibly non
convex optimization problems, as we are going to see subsequently. Thank you very
much.
m in g0 (x) So let us try to demonstrate this, so let us start with the following. Let x be any feasible
g (x) £ 0 point, feasible point means, it satisfies the constraint in the sense that you have
i
So we have i = 1, 2 , .., l . Now, let P* denote the optimal value of this original
s.t g i ( x ) £ 0 i = 1, 2 , .., l and g j ( x ) = 0 j = 1, 2 , .., m .
g j (x) = 0
j = 1, 2 , .., m
l m
So the dual function is L ( x , l ,n )= g 0

(x) + å li g i ( x ) + ån j
g j ( x) and since each li ³ 0
i =1 j =1
Now we want to show that if this vector of Lagrange multipliers associated with the and this is shown in slide.
*
inequality constraints that is if l ³ 0 Þ li ³ 0 then the dual that is g d ( l ,n ) £ P . This is
i = 1 , 2 . ., l (Refer Slide Time: 07:27)
a very important property of the dual function.
So we have finally (
L x , l ,n )£ g0(x) . So this holds for any feasible point and as long as
all the li ³ 0 .
Now this can be rewritten as g 0 ( x ) ³ g d ( l ,n ) as shown in slide. *

So we have g d ( l , n ) £ P fo r l ³ 0 . So this means that this Lagrange dual function forms
*
(Refer Slide Time: 10:45) a lower bound for this P which is the optimal value of the original optimization
problem.
Now, if you take the minimum of this for any feasible x that is nothing but P* which is
*
the optimal value of the original optimization problem. So we have P ³ g d ( l ,n ) .
Now the best lower bound is the maximum value of this lower bound which is as close as The original problem is termed as the primal problem and even if the primal problem is
the optimal value P*, so that this gap between the lower bound and P* is minimized. And non-convex, the equivalent dual problem that is derived from the primal problem is
m a x g d ( l ,n ) convex. And therefore, one can conveniently use all the techniques of convex
that is basically given as .
s .t l ³ 0 optimization to solve the Lagrange dual problem.
Now we can see that the best lower bound is also an optimization problem, although the Since you are taking the best lower bound that is going to give you something that is as
m in - g d ( l , n ) close as possible to the optimal value P*, but still it is going to be lower than P*. So what
original problem need not be convex. So I can equivalently write this as .
s .t -l ³ 0 you get by solving the dual optimization problem is always going to be the best lower
So you can use all the techniques of convex optimization to conveniently solve the dual bound, but still it is lower than P*.
problem.
So if P* = d*, that implies P* - d* = 0, that implies duality gap is 0.

So if we take the maximum for some optimal value of l ,n , this is still going to be less
than or equal to P*. And therefore if you call this optimal value as d*, we have d
*
£ P
*
. (Refer Slide Time: 24:05)
* *
So d is the optimal value of the dual optimization problem and P is the optimal value of
the primal problem
When this happens, it is said that strong duality holds for the problem.
And this gap between d* and P* is the duality gap.

(Refer Slide Time: 24:25) And typically the strong duality holds for any convex problem, although one can form
the dual optimization problem, solve the dual optimization problem for any possibly non-
convex problem. So the primal or dual, they always go hand in hand for any optimization
problem in particular for a convex optimization problem, because the duality gap is 0. So
we will stop here and continue in the subsequent modules. Thank you very much.
Otherwise, if d* £ P*, this is weak duality, it always holds.

Lecture – 65
Example problem on Strong Duality
Keywords: Strong Duality
at duality and we have seen the concept of strong duality that is for any optimization
problem written in the standard form, one can come up with an equivalent dual
optimization problem which is convex, you can solve that and to obtain the optimal point
d* and usually d * £ P* where P* is the optimal value of the original primal problem, but Here this matrix A has m rows as shown in slide and therefore there are m equality
when strong duality holds which is usually true for a convex optimization problem we constraints, one for each row of the matrix A. Therefore you need to have one Lagrange
have d* = P*. And now let us understand that through an example. multiplier for each equality constraint, so you have a vector n where each ni is for
T
a i x = bi .
m in x
So let us look at the minimum norm problem. So we have . Here there are only
s .t A x = b Now, the Lagrangian can be formulated as shown in slide and on solving it as shown we
equality constraints, there is no inequality constraint. get x = -
1 T
A n . So this is the x for which the minimum is achieved for the Lagrangian
2
corresponding to the original optimization problem. Now to get the dual optimization
problem we substitute this.
So after substitution and further simplification as shown in the slides we get the So this will always give a lower bound that is g d (n ) £ P
*
. Now this is a concave
1 T
T
T
Lagrange Dual function as g (n ) = - n AA n -n b . function.
4

Now the best lower bound is given by the maximum value. So we have m a x g d (n ) .
T -1
So d* that is the optimal value of the dual problem is obtained as ( AA )
1 *
= -2 ( AA )
T
So on solving this we get n T
b and for this value the Lagrange dual function is d = b b .
maximized. Now to find the optimal value d*, simply substitute n in the dual problem,.
So this is d* is always less than or equal to P*. Now we need to find P* that is optimal
2 T
m in x = x x
value of the primal problem. So we have .
s .t A x = b
-1
And we already know that the optimal solution for this is x = A
T
( AA )
T
b from the Let us look at another interesting problem and that is a linear program. So we have
T
T
*
previous modules. And now P = x x and we substitute x in P * and if you simplify it m in c x
. This can be written as a standard form convex optimization problem as
T -1 Ax = b
we get P
*
= b ( AA )
T
b = d
*
. s .t
x ³ 0
T
m in c x
.
Ax = b
s .t
-x £ 0
Therefore, strong duality holds and the dual objective and the primal objective are
coinciding at the same point which is the maximum value of the dual objective function
as well as the optimal value of the primal objective. So this is one of the simplest and
most elegant optimization problems.
Now, the Lagrangian of this can be formulated as
( A x - b ) + l ( - x ) . So this comprises of the Lagrange multiplier for

T T T
L ( x , l ,n ) = c x + n
the equality constraint and one Lagrange multiplier for each inequality constraint. Now,
we have to take the minimum of the Lagrangian and typically for that we differentiate it
with respect to the vector x , but since this is an affine function we will follow a slightly (Refer Slide Time: 24:02)
different approach.
m in L ( x , l , n )
So with that observation we have ìï T T T . So this is the

- ¥ if c + n A - l ¹ 0
g d ( l ,n ) = í
T T T T
If you separate the terms, you can see this is the equation of a hyperplane. Now this is an ïî -n b if c + n A - l = 0
affine function, it is like a line. Lagrange dual function and the best lower bound is available, when you maximize this.
T
m ax -n b
So if this line has a slope, then the minimum value of this will always be equal to - ¥ , So the dual optimization problem can be equivalently written as as shown
T
s .t A n + c ³ 0
only if the line is parallel, then the minimum value is a constant.
in the slides.
Lecture – 66
Karush-Kuhn-Tucker (KKT) conditions
Keywords: Karush-Kuhn-Tucker (KKT) conditions, Complimentary Slackness
Hello, welcome to another module in this massive open online course. In this module
you want to start looking at KKT conditions, the Karush-Kuhn-Tucker conditions, which
are convenient for solving any optimization problem.

And since the original problem is a linear program, the dual optimization problem is also
a linear program. Therefore, strong duality holds. So we will stop here and continue in
the subsequent modules. Thank you very much.

So now consider again the original or the primal optimization problem that is we have (Refer Slide Time: 04:04)
m in g0 (x)
g (x) £ 0
i
i = 1, 2 , .., l .
s.t
g j (x) = 0
j = 1, 2 , .., m
* * *
Now, let x is the primal optimal solution and let l ,n be the solution of the dual
* * *
problem. Then by strong duality, P* = d*, g0(x ) = P
*
and g d ( l ,n ) = d
*
.
And in addition assume that strong duality holds which implies P* = d*.
* * *
So this implies g 0 ( x ) = g d ( l ,n ) .
So P* is the optimal value of primal problem and d* is the optimal value of the dual
problem.
(Refer Slide Time: 06:39) This implies all the intermediate quantities which are sandwiched in between must be
*
also equal to g0(x ) .
* *
Now pn solving this as shown in the slides we have g0(x ) £ g0(x ) .

And therefore, this implies proceeding further as shown in slides, we have li g i ( x ) = 0 .
So this is the very interesting property, this is termed as complimentary slackness.

So these complement each other and this is termed as complimentary. This property is
So this implies two things either l i = 0 , which means gi (x) < 0 or li > 0 and gi(x) = 0 .
termed as complimentary slackness. And this is a unique aspect of the KKT conditions.
So in the first case the constraint is slack and Lagrange multiplier is tight. In the second
case, the constraint is tight and Lagrange multiplier is slack. So this is the meaning of the
complimentary slack that is either the Lagrange multiplier is slack or the constraint is
slack. It cannot happen that both the Lagrange multiplier and the constraint are slack.
And the KKT conditions can be finally stated as follows, if x, l , v are optimal and strong
g (x) £ 0
i
duality holds, this implies that g 0 ( x ) = g d ( l ,n ) . i = 1, 2 , .., l
Then it must be that, first, the primal constraints are basically .
g j (x) = 0
(Refer Slide Time: 20:05) j = 1, 2 , .., m
l ³ 0
Then the dual constraints l ³ 0 must hold.
i
i = 1, 2 , .., l
Lecture - 67
Application of KKT conditions : Optimal MIMO Power allocation (Waterfilling)
Keywords: Karush-Kuhn-Tucker (KKT) conditions, Optimal MIMO Power allocation

Waterfilling Algorithm
Hello, welcome to another module in this massive open online course. So we have
looked at the KKT conditions to solve an optimization problem. Let us look at an
application to better understand how one can use the KKT conditions to solve an
The third condition is complimentary slackness that is l g (x) = 0 that is either li > 0
i i
i = 1, 2 , .. , l
or gi(x) < 0 but not both. And finally, since at x you have the minimum of the
Lagrangian function, the gradient with respect to x of the Lagrangian must vanish at this
point x and this is as shown in the slide and we have Ñ x L ( x , l ,n ) = 0 .
So these are the four KKT conditions that must be satisfied by the solution of the primal
optimization problem and the dual optimization. So let us stop here and continue in the
subsequent modules. Thank you very much.
So let us say you have a set of parallel channels. So we have yi a i xi + ni where this quantity ni is the additive white Gaussian noise, with
2 2
(Refer Slide Time: 01:35) mean 0 and variance s . So noise power is s for each channel. The SNR for channel i
a i Pi
is 2
.
s
So let assume that these are arranged in decreasing order of gains. So you can transmit at
a certain bit rate over each of these communication channels and bit rate depends on the
power that is allocated to that particular channel. So let us say the power allocated for
first channel is P1, second channel is P2 and so on and for nth channel is Pn. Let us say the
And now the maximum information rate is given by the Shannon’s formula for the
gain of first channel is a1 , gain of channel 2 is a2 and so on gain of channel n is an . So
æ a i Pi ö
capacity of the channel. So this is given as lo g 2 ç 1 + 2 ÷
.
as shown in the slide, this is the input and this is the output or you can think of it as a è s ø
transmitter and the receiver. The received power across channel 1 will be a 1 P1 , similarly
across channel 2 will be a 2 P2 and so on across channel n will be a n Pn . Now in addition
for every communication at the receiver we will have thermal noise or Gaussian noise
which is typically modelled as additive Gaussian noise.
So this is the maximum rate at which information can be transmitted over the channel i.
And therefore the maximum sum rate of information transmitted across all these n
parallel channels will be given by the sum of the individual rates across each of these n
parallel channels.
So the maximum sum rate corresponding to powers P1, P2,..,Pn is to be calculated. n

æ a i Pi ö
m ax å lo g ç 1 +
s
2 ÷
i =1 è ø
(Refer Slide Time: 08:10) n
So the optimization problem is s .t å Pi = P . This log is a concave function and

i =1
Pi ³ 0
the sum of log is also a concave function. So this is the maximization of concave
objective function.
n
æ a i Pi ö
We want to maximize this sum rate, so we have m ax å lo g 2 ç 1 +
s
2 ÷
. Now we are
i =1 è ø
n
æ a i Pi ö
making a minor modification here m ax å lo g 2 ç 1 +
s ø
2 ÷
lo g 2 e . So this then becomes
i =1 è
a i Pi ö
n
æ
simply the natural logarithm, so we have m ax å lo g ç 1 +
s
2 ÷
, so instead of
i =1 è ø
maximizing the objective function times a constant, we can simply ignore the constant
factor. Now the constraint is that the total transmit power is a fixed quantity.
a i Pi ö ( ) = 0 is
n
æ Now Ñ P L P , l ,n one of the KKT conditions. So on solving this we get
m ax - å lo g ç 1 +
s
2 ÷
i =1 è ø
n ai
So this can equivalently be written as s .t å Pi = P . So this is the convex
n =
s
2
+ li .
i =1
a P
- Pi £ 0 1 + i 2i
s
optimization problem for optimal power allocation. You are allocating the powers
optimally and hence it is termed as optimal power allocation. (Refer Slide Time: 16:35)
Now we will use the KKT conditions to solve this, let us start with the Lagrangian. So
a i Pi ö
n
æ æ ö
( ) = lo g ç 1 +
T
we have L P , l ,n 2 ÷
+ n ç å Pi - P ÷ - l P . So you have one Lagrange
è s ø è i =1 ø
multiplier for each inequality constraint.
And now from the complementary slackness we have l i Pi = 0 that is either the constraint
is slack or the Lagrange multiplier is slack, but not both. So let us consider these two
conditions. If Pi > 0, that is power allocated to a channel is non-negative, then li = 0 .
2
1 s
So this implies that = + Pi . On the other hand, if you consider the case 2, if li > 0 , that is Lagrange multiplier is
n ai
2
1 s
slack which implies that Pi = 0. Now this implies < as shown in the slide.
n ai
2 2
1 s 1 s
This implies that Pi = - Þ ³ and the corresponding eigen value is 0. So this is
n ai n ai So there are two cases and therefore if you summarize it we have
the optimal power allocated to the ith channel. ì 1 s 2 1 s

2
ï - if ³
ïn a n a
Pi = í i i .
0 2
ï 1 s
if <
ï n a
î i
æ 2 ö
+
ì 2 ü
Let us let us assume that these are ordered as the first channel is the strongest and the last
1 s ï1 s ï
And therefore, you can write this Pi as Pi = ç - ÷ . So Pi = m a x í - ,0ý . one is the weakest.
çn a ÷ n a
è i ø ï
î i ï
þ

Now what this implies is that more power is allocated to the stronger channel as shown
in slides.
+
æ 2 ö
1 s
n
So to find nu solve å ç - ÷ = P that is the total power constraint.

i =1 çn a ÷
è i ø
Now, let us look at this representation, for instance a sort of bowl or you can call it an
Therefore, this scheme is known as the optimal water filling algorithm. So you can think
area with this kind of pillars. So the first pillar is corresponding to first channel and then
of this as a water level. So it is a solution of a convex optimization problem derived or
1
it decreases. Now if you draw here the level , you can think of this as a water level, obtained using the KKT conditions and the complementary slackness plays a very key
n
role. So this is a nice scheme or this is the optimal scheme to allocate power across the
now the power allocated to the first channel is basically the amount of water.
parallel channels that maximizes the sum rate of communication between the transmitter
(Refer Slide Time: 30:40) and the receiver. So we will stop here and continue in the subsequent module. Thank you
very much.
This is as shown in slide.

Lecture – 68
Optimal MIMO Power allocation(Waterfilling)-II
Keywords: Optimal MIMO Power allocation, Waterfilling Algorithm, Singular Value

Decomposition (SVD)
at KKT conditions to solve an optimization problem, we have looked at a specific
application of KKT conditions. Now, let us look at an example that is the application in
MIMO optimal power allocation. So we want to allocate power optimally to the various modes of this MIMO channel. The
modes are given by the singular value decomposition. Let us consider total power = 4
(Refer Slide Time: 00:33) 2
and the noise power s = 3dB = 2 .
So let us consider the following MIMO system, each MIMO channel can be represented
by the equivalent channel matrix. So we have the MIMO channel matrix as
é1 2ù
H = ê ú , so this an r´t , this implies r = 2, t = 2. So the number of receive antennas
ë2 - 2û
as well as the number of transmit antennas equals 2.

So as shown in slide, we have r receive antennas and t transmit antennas. Now as we

n
æ a i Pi ö have seen the optimal power allocation problem, we have a total transmit power P. We
m ax - å lo g ç 1 +
s
2 ÷
i =1 è ø
have a set of parallel channels and we want to allocate the power optimally amongst
n
So we have s .t å Pi = P . these parallel channels so as to maximize the total bit rate that can be transmitted across
i =1
- Pi £ 0 this channel. Now, for that first we have to see how this MIMO channel can be
decomposed into a set of parallel channels, because only then one can talk about optimal
(Refer Slide Time: 03:29) power allocation.
So MIMO stands for Multiple Input Multiple Output system, which means that basically
you have a wireless communication system with multiple transmit antennas and multiple
For that we have the received symbol vector y = H x + n for this MIMO system, where
receive antennas.
y has y1 y2…yr and these are the r received symbols across the r receive antennas and n
is the additive white Gaussian noise samples at the r receive antennas. This MIMO
channel matrix H is of dimension r ´ t. For this MIMO channel matrix you have the
coefficients h11 h12 h21 so on and finally, the last row last column will be hrt. So hij is the
channel coefficient between ith receive antenna and jth transmit antenna. And now the key (Refer Slide Time: 12:54)
to understand this decomposition of MIMO channel into a set of parallel channels is the
singular value decomposition.
This S is a diagonal matrix of what are known as singular values, s1 ³ s 2

³ .... ³ s t with
each of this s i
is non-negative and these are arranged in decreasing order.

Now, in singular value decomposition given this channel matrix H, you decompose this
H
as a product of three matrices, that is H = U SV . Consider now for the sake of
simplicity this H to be an r ´ t matrix with r³ t . Now, this matrix U is an r ´ t matrix,
H
S is a t ´ t diagonal matrix and V is again a t ´ t unitary matrix.
So this matrix can be written as shown in slides.
Now, U has orthonormal columns which implies that the columns are orthogonal to each
H
other and are unit norm and therefore, if you perform U U = I . Now, V is a unitary
H H
matrix implies that V V = VV = I , so V is a unitary matrix.
Now, we are going to do similarly at the transmitter even before transmission of x , we

Now, we have seen that u1 is the optimal receive beam former for the MIMO system and
are going to employ a preprocessing operation or this is also known as a pre coding
in fact, it is the principle eigenvector that is an eigenvector corresponding to the largest
operation. So in case of MIMO, processing can be done at both ends, one is at the
H
singular value of HH and v1 is the optimal transmit beam former, that is also equal to transmitter and the other at the receiver.
H
principle eigenvector of H H .
So we have x which is to be pre-coded, x =V x , this is the pre coding operation.
Now let us look at the following MIMO transmission scheme where you have
H
y = H x+ n and we have the singular value decomposition as y = U SV x+ n . Now the
H H
first step is, at the receiver I am going to process with U . So we have y =U y and
this is the receive processing, so we
have y =U
H
(U S V H
)
x + n = SV
H
x+U
H
n = SV
H
x + n .

So the singular value decomposition is what gives us our set of parallel channels and this
is shown in slides above.
H
Now we have y = SV V x + n = Sx + n . So this is as shown in slide. (Refer Slide Time: 24:37)
And therefore, optical power allocation can be done as shown. Now power allocated to (Refer Slide Time: 28:04)
2
Pi s
ith channel is Pi and noise power is s
2
. So SNR of ith channel is 2
i
.
s
So water filling technique can be used for optimal MIMO power allocation and these
parallel channels are already arranged in the decreasing order. So naturally first channel
is allocated a larger fraction of the power compared to the next channels. And there
t
æ Pis i ö
2 might be some channels which are below the water level and which are not allocated any
m ax å lo g ç 1 +
s
2 ÷
i =1 è ø power.
t
The sum rate will be s .t å Pi = P .

i =1
- Pi £ 0
So we will stop here and continue in the next lecture. Thank you very much.
*
æ 1 s 2
ö
And we have already solved this problem and Pi = ç - 2 ÷ , this is given by the water
èn s i ø
filling power allocation. This is the optimal power allocated to the ith mode of the MIMO
channel.
Applied Optimization for Wireless, Machine Learning, Big Data We have a total power P = 4 and noise power s
2
= 3dB = 2 . And now we have to
Department of Electrical Engineering optimally allocate this total power. So for optimal power allocation which maximizes the
Indian Institute of Technology, Kanpur sum rate we have to first start with the singular value decomposition.
Lecture – 69
Example problem on Optimal MIMO Power allocation (Waterfilling)
Keywords: Optimal MIMO Power allocation, Waterfilling Algorithm, Singular Value

Decomposition(SVD)
at Optimal MIMO Power Allocation. Now let us do an example to understand this better.
H
Now this can be written as H = U SV where U contains orthonormal columns and these
columns are orthogonal to each other. So all we have to do is we have to simply
normalize them and this is as shown in slide. So we will get
é 1 ù
2
ê úé 2 0 ù
2 2 2
H = ê úê ú .
ê 2 -2 ú ê0 2 2 úû
ë
ê ú
ë 2 2 2 û
é1 2ù
So now consider the MIMO channel matrix H = ê ú .
ë2 - 2û (Refer Slide Time: 05:31)
Now we have orthonormal columns and this satisfies the property of U. And in fact this
matrix can be possibly S because this is a diagonal matrix and these are non-negative, so
these are the possible singular values. And now we need the V matrix which is a unitary (Refer Slide Time: 10:27)
matrix. So I can simply use the identity matrix in this case as unitary matrix.
é 1 1 ù
ê ú é2 2 0 ù é0 1ù
2 2
So finally we have H = ê ú ê úê ú and now we have to do optimal
ê -1 1 ú êë 0 2 úû ë1 0û
ê ú
ë 2 2 û
Now the only problem is the that the singular values should be arranged in decreasing power allocation that is you can decompose this using pre coding as the combination of
order which is not possible in this obtained matrix. two parallel channels.
So this is not a valid SVD. So we need the singular values to be ordered in decreasing
order. So we have to somehow switch these values and this is as shown in slide. So this So you are transmitting x1 through the first channel that has gain s1 and noise n1 to give
H
is possible if I basically interchange the columns of U and the rows of V and then I can y1 . And similarly x2 through the second channel that has gain s 2
and noise n2 to give
flip the singular values.
y2 , so these are the parallel channels for the given MIMO channel. So this is a 2 ´ 2
MIMO channel.

So finally to find n we have to use the total power constraint. So this is basically a non-
linear equation and this is proceeded as shown in slide.
Now for the optimal power allocation we substitute the required quantities as shown in
slide.

1
So start with the assumption, ³1 .
n
19 13
So finally we get P1 = > 0 , P2 = > 0 . Now, if one of the powers would have been negative that implies our original assumption
8 8
is incorrect. So the power is negative implies that that corresponding channel is above
(Refer Slide Time: 20:56) the water level. So power is not allocated. So in the corresponding channel the power has
to be set to 0 and the problem has to be repeated with the total power constraint. So this
is the procedure alright and now you also observe that more power is allocated to the
stronger channel.
So the original assumption holds and we get the optimal powers.
So what this shows is that to maximize capacity more power is allocated to the stronger
channel that is the one with the largest singular value.. So we will stop here and continue
in the subsequent modules. Thank you very much.
Lecture – 70
Examples: Linear objective with box constraints, Linear Programming
Keywords: Linear objective, Box constraints, Linear Programming
at example problems in convex optimization.
Each li £ x i £ u i , so this is also known as box constraints and this is as shown in slide.
So we are looking at example problems and let us look at problem number 1. So we have
T
m in c x
which means that if you have the elements of are component wise less than
s .t l £ x £ u
é x1 ù
or equal to the other elements. So therefore, ê ú are confined to this box as shown in slide and hence this is also
ë x2 û
termed as box type constraint. In fact, it is a simple linear program. So the solution for
this is fairly straight forward.
ï li i f c i ³ 0
ì
T
n
And therefore, the optimal value of each xi is í .
So you have c x = å ci xi . Now, these box type constraints make sense only if li £ u i . ï
u if c < 0
î i i
i =1
So we assume here that li £ u i . If this is not so, then the problem becomes infeasible.
Now, consider any xi such that li £ x i £ u i .
T
n n ìï c if c ³ 0
+
ci = m ax {ci , 0} = í i
+ - i
So m in c x = m i n å ci xi = å c i li + c i u i where
0 o th e r w i s e
and
i =1 i =1 ïî
ìï c if c < 0
Now, if ci ³ 0 , this implies c i li £ c i x i £ c iu i . So minimum value for x i lying in this box is -
ci = í i i .
0 o th e r w is e
ïî
cili which occurs when x i = li. On the other hand, if ci < 0 this implies c i l i ³ c i x i ³ c iu i .
Now the minimum value is ciu i .

n
T -1
+ - + Now we substitute, Ax = y Þ x = A y .
You can also write this as m in c x = å c i li + c i u i where ci contains all positive
i =1
s .t l £ x £ u
-
elements of c and ci contains only negative elements of c and the rest are 0. So that is
the optimal value of this problem.
Now, we will write the equivalent optimization problem in terms of y . So we have the
T T T
-1 -1
objective c x = c A y = c y Þ A c = c .
Let us proceed to a slightly more sophisticated example for which the solution might not
T
m in c x
be very obvious and that is problem number 2 where we have . This is a linear
s .t A x £ b
program, but slightly more sophisticated and the solution depends on the nature of A
which is a square full rank matrix. This implies that A is invertible.
T
n
Now, if all c i £ 0 Þ y i £ bi Þ c i y i ³ c i bi . So the minimum occurs for y i = bi .
m in c y T
So the objective becomes . So now, we have c y = å c i yi .
s .t y £ b i =1
And therefore, the net minimum implies

n
T T T
-1 -T
And the constraint will be component wise constraint, this implies that each component
m in c y = å c ibi = c b = c A b if c £ 0 Þ A c £ 0 .
i =1
of this vector y is less than or equal to each component of vector b . Now we consider if
any c i > 0 Þ y i £ bi Þ c i y i £ c i bi . So this implies that c i yi ® -¥ as yi ® -¥ . So

objective becomes -¥ . So it is unbounded below.
(Refer Slide Time: 22:11) Applied Optimization of Wireless, Machine Learning, Big data
Prof. Adtiya K. Jagannatham
Lecture - 71
Examples: l1 minimization with l ¥ norm constraints, Network Flow problem
Keywords: l1 minimization, l ¥ norm constraints, Network Flow problem
Hello welcome to another module in this massive open online course. So we are doing
examples on convex optimization, so let us continue the discussion.

ì T -1
c A b
ï -T
T ï if A c £ 0
Therefore, the minimum value is m i n c x = í .
-¥ o t h e r w is e
s .t A x £ b ï
ïî
So we are doing a few examples which will illustrate how to formulate these problems as
convex optimization problems so that we can use the convex solver to solve these
optimization problems. So let us look at another example, example number 3 that is
m in A x - b
1
. Now, this is a convex optimization problem because if you look at this l1
s .t x £1
¥
So that is basically the solution to this optimization problem. So let us stop here. Thank norm, this is convex, so we have convex objective, convex inequality constraint and this
you very much. is the convex optimization problem so you can directly solve it. But we want to recast it
into a form that is more intuitive or more amenable to analysis.
T
And now we want to introduce the constraint that is a i x - bi £ y i . So this implies that
So let us assume that A is an m ´ n matrix.
T
- y i £ a i x - bi £ y i . So using the epigraph form you can write this as
m in y 1 + y 2 + .. + y m
T .
s .t - y i £ a i x - b i £ y i
i = 1 , 2 ,.. , m
Then you can rewrite this as shown in slide. So this matrix A has m rows and you are
taking the l1 norm of this which is nothing but the sum of the magnitudes of these
T T T
And now you still have the other constraint as it is. Now we also know that this l ¥ norm
elements of this vector, which is given as a 1 x - b 1 + a 2 x - b 2 + .... + a m x - b m .
constraint can also be written as a set of linear constraints and this is as shown in slide.
(Refer Slide Time: 07:20) m i n y 1 + y 2 + .. + y m
So the equivalent optimization problem can be written as - y £ Ax - b £ y . Here the
s .t
-1 £ x £ 1
objective is linear and constraints are also linear, so this is a Linear Program, LP.
This can be written in a compact fashion as - 1 £ x £1.
Let us look at another problem which is termed as network flow. This is one of the most
important kind of optimization problems that arises in various fields such as for logistics
management, supply chain management etc. Consider a network of hubs or sort facilities.
So this is our network, so each of these nodes is a hub or a sort facility, for instance in
the distribution network of Etailer, in an E commerce company, to distribute these
products you need a network of hubs or sort facilities, where you have a lot of these
products that are brought into, sorted and dispatched to other hubs and ultimately
delivered to the end user.
So this reduces to this equivalent representation as shown in slide.
So you have this network of connected hubs and we have flows between these hubs. So (Refer Slide Time: 18:44)
xij denotes the number of items that represents the flow between load i and load j and
each flow has a cost Cij which is the cost per item of link or the path between i and j. So
these are basically your sort facilities. Now in addition you will have external supply bi.
So if you look at each hub or each load you might see supply flows that are coming from
other loads, flows going to other loads and in addition you might have an independent
supply bi. So this bi indicates the supply that is coming into load i, if bi is positive and on
the other hand if bi is negative it means that the commodities are leaving from that load.
So let us say n equals the number of loads in the distribution network. Let us denote the
upper bound and lower bound, since each sort facility has a certain limit in terms of the
total outflow in flow products. So we want to formulate this network flow problem which is minimize the total cost of
the network and it is very simple.
So we have the lower bound as lij and uij equals the upper bound which basically implies
that each flow xij has to be between lij and uij. So we will enforce another condition that
m in åå x ij C i j
is if we look at the total external flow for all nodes, the total external supply equals the i j
j ¹1
total external demand. So it cannot happen that a large number of commodities are n n
So we have s .t b i + å x ji - å x ij = 0 . So this basically says that the total external supplies

entering and only a few commodities are leaving, which means that these commodities j =1 j =1
are getting lost or it does not mean that only some commodities are entering and a large l ij £ x i j £ u ij
number of commodities are leaving which means commodities are somehow being at each hub and the total flow from all the other hubs to a particular node i must be equal
magically generated. So it just means that whatever commodities entering at the various to the total flow of commodities or goods from node i to the other nodes. So this must
hubs are eventually leaving the network at possibly the same or different hubs that hold for all particular loads or all particular hubs in your distribution unit.
depends on the flow. So the net external supply must be 0 which means that the total
external supply equals the total external demand.
Lecture - 72
Examples on Quadratic Optimization
Keywords: Quadratic Optimization
at example problems for Convex Optimization. Let us look at another problem that is
Quadratic Optimization.

So this is your optimization problem.
T
m in c x
So the quadratic optimization objective is as follows T
and we will consider
s .t x A x £ 1
two cases for this, that is when A is positive definite and when A is not positive definite.
This is a linear objective and we have affine constraints. So you can have a large
Let us start with case 1, A is positive definite.
distribution network with tens or thousands of hubs as long as the total external supplies
is equal to the total external demand and this is is a linear program. So this is a very
practical problem and there are several such problems which has significant practical
relevance. So we will stop here and continue in the subsequent modules. Thank you very
much.
1
T T
When A is positive definite, you can write A = LL where L = A2 , this is obtained by m in c y
So we can write the optimization problem as . Now the maximum occurs for
the Cholesky decomposition. So for a positive definite matrix in addition, this L is s .t y £ 1
T T T 2
T T
invertible. So we have y = L x and x A x = x LL x = y y = y . c
y = which is given as shown in the following slides.
c
-1
Now as shown in slide we have x = (L ) T
y = L
-T
y . Therefore,
T
(L c)
T T T
-T -1
c x = c L y = y = c y .
T
-1
(Refer Slide Time: 07:16) So the optimal value of objective equals c A c as shown in slide.
-1 -1
L c A c
So you have the optimal y = and the optimal x = as shown in slides.
T T
-1 -1
c A c c A c
But remember that this entire case is when A is PD. Now when A is not positive definite,
T
(Refer Slide Time: 08:18) then it cannot be decomposed as LL .

So in that scenario let us say you have an eigenvalue decomposition of A which is (Refer Slide Time: 13:48)
T
A = QLQ . So this can be written as a matrix of eigenvectors as shown in slide.
n
T T T
å
T 2
And therefore, now if we look at x Ax = x QLQ x = y L y = li y i
. So this is the
i =1
objective function.
n
T
Now you can multiply it out and you can write this as A = å li qi qi . Now, since A is
i =1
T -T
not PD, you have some eigenvalue lj < 0 . In this case let us set y = Q xÞ x = Q y .
2
Now, since l j
is negative implies ljy j
equals negative implies the constraint is
n
T T T
-T
Now we have c x = c Q y = b y = å bi y i . So I can recast this optimization problem.
satisfied.
i =1
n
So basically by setting y j = -¥ we can make the optimization objective as small as
m in å bi y i
possible. Now, consider another scenario, when lj < 0 and if bj < 0 , then set yj to a
i =1
So we have . Now we are assuming that one particular lj < 0 , since A is
n
å
2
s .t li y i
£1 large positive value. This again implies that the constraint is always satisfied as shown in
i =1
slide. Let us look at another example, example number 6.
not positive definite. Now if bj > 0, then set yi to be a very large negative value.
T
So show that x x £ yz, y, z ³ 0 . This implies that This can be stacked in the form of a matrix which implies that Ax - b > 0 .
2 2 2 2 2
x £ yz Þ 4 x £ 4 yz Þ 4 x £ (y + z) - (y - z) . So this implies (Refer Slide Time: 26:54)
2 é2 x ù
4 x + (y - z)
2
£ (y + z) Þ
2
ë û £ y + z y, z ³ 0 .
y - z
æ m 1 ö
Now this is equivalent to m in ç å ÷ because everything is non-negative. Now,
ç i =1 a T x - b ÷
è i i ø
let us write this in an epigraph form.
T
The condition that x x £ yz , can be equivalently written as shown in the above slide.
-1
æ m
1 ö
m ax ç
ç
å ÷
x - b i ÷ø
T
Now, let us say we want to maximize the harmonic mean that is è i =1 ai .
T
s .t a i x - b i > 0
So the equivalent optimization problem can be written as

é2 ù
( )
1 T T
So we have T
£ ti Þ 1 £ a i x - bi t i Þ ê T ú £ a i x - bi + t i . m
a i x - bi ëê a i x - b i - t i ûú m in å ti
i =1
é2 ù T
(Refer Slide Time: 29:14) s .t ê T ú £ a i x - bi + t i
. So once you write this as a second order cone
ëê a i x - b i - t i úû
ti ³ 0
T
a i x - bi ³ 0
i = 1 , 2 ,.., m
program, you can use the convex solvers readily available to solve this. So let us stop
here and continue in the subsequent module. Thank you very much.
So we will have m constraints, one for each i and this is a second order conic constraint.
So the resulting optimization problem will be a second order cone program.
Applied Optimisation for Wireless, Machine Learning, Big Data (Refer Slide Time: 02:12)
Lecture – 73
Examples on Duality: Dual Norm, Dual of Linear Program (LP)
Keywords: Dual Norm, Linear Program
Hello, welcome to another module, in this massive open online course. Let us continue
looking at examples and in this module let us start looking at examples pertaining to
duality.

So let us look at some examples to understand this. Let us consider the dual norm of the
T
m ax z u
{z } . Now
T
l2 norm that is z *
= m ax u | u £1 is the pertinent optimization
2 2
s .t u £1
2
problem and this is convex in nature because, this is a linear objective, this is a convex
T
constraint and now this is easy to solve. In fact, we know that, z u £ z u . So this
follows from the Cauchy Schwarz Inequality.
Let us start with the first example that is for instance if you have a vector x this x is
l
the l norm, for instance, l can be 1, 2 and so on. Now, the dual norm of this is denoted by
{z } . So this is basically the dual norm.

T
z *
= m ax u | u £1
l l
Now, we know that this u £1 , which basically implies
T
z u £ z u £ z , which (Refer Slide Time: 07:01)
T
implies that z u £ z and the maximum occurs when u is aligned with z which implies
z
u = .
z
Now, we want to find the dual norm of the l¥ norm. This is simply
{ } . Now,
T
z
¥
*
= m ax z u | u
¥
£1 u
¥
£ 1 Þ m ax {u }£1.
i
T z
So the maximum is z = z .
2
z
n
T
Now, assume z and u to be n dimensional vectors. Now z u = å z iu i is simply the dot
i =1
product between these two. This is as shown in slide.
Therefore the dual norm of the l2 norm is l2 norm itself.


n
T
Now we have z u £ å zi .
i =1
Let us look at another problem to derive the dual optimal problem corresponding to
T
m in c x
general LP. So consider the general linear program, that is s .t G x £ h . Now this is a
So the maximum occurs when u i = 1 for each i and sg n ( u i ) = sg n ( z i ) as shown in slide. Ax = b
The maximum value is nothing but the l1 norm and therefore the dual norm of the general LP implies it has inequality constraints and equality constraints.
infinity norm is the l1 norm.

So we want to find the dual problem for this, so we have And now we have to find (
g l,m ) = m in L ( x , l , m ) . This implies that
x
the minimum is
( )=c (G x - h ) +n ( A x - b ) .
T T
L x, l , m x+l These are the Lagrange multipliers for the -¥ if T T
c+G l + A n ¹ 0 .
both the inequality constraints and the equality constraints.
Now this is also a lower bound for the original problem, but it is not very useful. So
Now this is simplified as shown in the slide. So this can be simply rewritten as instead we want a certain lower bound, which is more useful and that will be obtained if
T T
c+G l + A n = 0 .
( )= (c + G ) - (h ).
T T T
T T
L x, l , m x l + A n l +b n
(Refer Slide Time: 21:40) which is the optimum value d*, where d
*
£ p
*
. But in this case d* will be exactly equal
to p* because this is a linear program. So in general for a convex optimization problem
strong duality holds implies that d* = p*. So we will stop here and continue with other
examples in the subsequent modules. Thank you very much.
( ) reduces ( )
T T
Therefore, in this case g l,m to the constant, which is - h l + b n and
therefore this is a lower bound. This means that all the Lagrange multipliers associated
with the inequality constraint have to be greater than or equal to 0 and this is a lower
bound for the original optimization. So the best lower bound is the maximum value. So
this is the primal optimal and this d*, which is the dual optimal and this is what we call as
the best lower bound because, it is the one that is closest to the optimum value p* of the
primal optimization problem and if d* = p* implies that the duality gap is 0.
Therefore, the dual problem is basically the best lower bound, which is
( ) = - (h )
T T
m ax g l , m l + b n
T T
s .t c + G l + A n = 0 . So this is concave and therefore, you can find the solution
l ³ 0
Applied Optimization for Wireless, Machine Learning, Big Data For instance, each of these represents a line, therefore if you look at these m different
lines and you take the maximum, it will look something like as shown in slide. So this is
Indian Institute of Technology, Kanpur piecewise linear and now we want to find the dual problem.
Lecture - 74 (Refer Slide Time: 02:34)

Examples on Duality: Min-Max problem, Analytic Centering
Keywords: Min-Max problem, Analytic Centering
m in t
So using the epigraph form, this can be equivalently written as T . So this
m a x a i x + bi £ t
1£ i £ m
implies that each of this is less than or equal to t which is as shown in slide.
at example problems in duality. Let us look at example problem number 9 where we (Refer Slide Time: 03:59)
T
want to find the dual of the problem, minimize m a x a i x + bi and this is known as a
1£ i £ m
piecewise linear model.

And therefore, if you simplify this as shown we have

m in t
m
So I can write this basically as an equivalent optimization problem . The
( ) = (1 - 1 )
T T T T
s .t a i x + b i £ t L x,t, l l t + x å a il i + l b .
i = 1, 2 ,... , m i =1
dual of this problem is obtained as follows, first you form the Lagrangian that is we have
m
( )=t+ål (a ) , one Lagrange multiplier for each constraint.
T
L x,t, l i i x + bi - t
i =1
Now, this is affine in t, x which means it is a hyperplane. So now the dual is getting the
minimum of with respect to t, x , l and this is as shown.
Now, we want to group all the terms corresponding to each and this is as shown in slide.
T
Now we proceed as shown in slides. m ax l b
T
And therefore the dual problem can be formulated as s .t 1 l = 1 .
Al = 0
This is the dual problem for the given original min max problem.
Let us look at another interesting application, we want to derive the dual of And then we proceed as shown in slides below.
m
( ) . This problem is termed as analytic centering problem. So the

T
m in - å lo g b i - a i x (Refer Slide Time: 18:26)
i =1
T
domain of this is bi ³ a i x . So this is an intersection of half spaces and this is a
polyhedron.
m m
( ) = - å lo g y + å n (y ).
T
So we have L x , y ,n i i i
- bi - a i x
i =1 i =1
T
So to develop the dual we will use a simple substitution. We will substitute y i = bi - a i x .
m
So the optimization problem can be equivalently written as m in - å lo g y i .

i =1
Now, once again collecting all the terms we proceed as shown in slides. And now we
have to take the infimum with respect to the primal variables x, y .
Now, we proceed as shown in the slides.

(Refer Slide Time: 26:02) So these are some examples of various convex optimization problems and how to
formulate their dual problem which often yields very useful insights and these are often
very useful in practice. So let us stop here and continue in the subsequent modules.
m
T
m ax m + å lo g n i - b n
And therefore the dual problem of this analytical centering is i =1 .
s .t An = 0

Applied Optimization for Wireless, Machine Learning, Big Data The equality constraint is something that is very similar. But the inequality constraint is a
weighted combination of matrices or rather an affine combination.
Lecture – 75
Semi Definite Program (SDP) and its application: MIMO symbol vector decoding
Keywords: Semi Definite Program
at convex optimization problems and their applications. Let us start looking at what is
known as Semi Definite Programming.
You are saying this matrix has to be greater than equal to 0 which implies that this matrix
has to be positive semi definite.
So it is a very important and powerful class of problems and a semi definite program is
the following where you are minimizing a seemingly simple objective that is the
objective is still a linear objective.
We say two positive semi definite matrices that is A³ B where A is a positive semi
T T
definite matrix, if and only if x Ax ³ x B x Þ A - B is PSD, for all x .
Now, similarly if you have a strict inequality A > B, naturally this implies A - B is And I can represent this MIMO system model as y = H x+ n , where you have t transmit
positive definite. So this is the notion of this generalized inequality on the set of positive symbols and r output symbols.
semi definite matrices.
Now, if you look at each symbol xi this has to be drawn from a constellation that is it
So let us look at an application of SDP and the application is as follows. So let us cannot take any possible value. So each xi is drawn from a suitable digital constellation
consider a MIMO system and we want to perform MIMO symbol decoding. So in this for a digital wireless system, example, BPSK this is basically your binary phase shift
MIMO system, you have multiple TX antennas and multiple RX antennas. keying, which implies there are two phases that means, each xi has two possible values.
So each xi can be plus or minus 1, that means, each xi has two possible values.
So basically the size of this set of the vector constellation is 2

t
which is growing (Refer Slide Time: 20:07)
exponentially in the number of transmit antennas t.
Now this search this increases with the constellation. For instance, if you have 16 QAM,
t
then the number of vectors becomes 16 .
So now, let us say we have the set S which is the set of all possible digital transmit
t
vectors x and this set has 2 elements and therefore the problem is at the receiver once
you receive y we have to decode x . So therefore you use the best possible decoder
2
which is known as the ML decoder. So we have m in y - H x . The problem is you have
xÎ S
to search over all t possible vectors to find x which basically minimizes this error and
this is known as the maximum likelihood decoder.
So this has significantly lower complexity in comparison to the optimal ML decoder

So this implies that this search is impossible or next to impossible and therefore, we have
which has exponential complexity which is virtually impossible for a large number of
to come up with low complexity techniques.
transmit antennas and large constellation sizes. So the resulting ML decoder or the
(Refer Slide Time: 22:58) approximate ML decoder has a significantly lower complexity. So we will stop here and
we will look at this in the subsequent module. Thank you very much.
Hence, one such technique is basically what is termed as SDP relaxation we relax this
ML decoder problem as a Semi - Definite Program. We are going to formulate this
MIMO ML decoder as a Semi - Definite Program which has a significantly lower
complexity. So this is very useful for practical implementation.
Applied Optimization for Wireless, Machine Learning, Big Data And SDP employs a positive semi definite constraint, that is the linear combination of
matrices has to be positive semi definite and this is termed as a linear matrix inequality,
Indian Institute of Technology, Kanpur LMI.
Lecture – 76 (Refer Slide Time: 01:01)

Application: SDP for MIMO Maximum Likelihood (ML) Detection
Keywords: Semi Definite Program, MIMO Maximum Likelihood Detection
Hello, welcome to another module in this Massive Open Online Course. So we are
looking at the SDP that is Semi Definite Programming and its application in the context
of MIMO detection that is how to reduce the complexity of the MIMO detector.
So SDP enforces a linear matrix inequality that is what is novel about SDP.
So we are looking at SDP for MIMO ML detection.
2
We have the MIMO detection problem as m in y - H x .
xÎ S
And so basically this is exponential in the number of transmit antennas, which is of very This can be written as shown in slide. So I am making a column vector by stacking it
high complexity. along with this number 1. So we have this t + 1 dimensional vector, S .
éH H -H T y ù
T
Now the cost function is simplified as shown in slide. So we have the simplified cost Now the matrix L is given as ê ú .
T T
T T
T
T T
T
êë - y H y y úû
function as y y- x H y- y H x+ x H H x .
T T
T
So we have written this as S LS . So this L can be thought of as a weighting matrix. So Now the above problem can be equivalently written as m in s L s . Now this s Ls is a
s i Î { ± 1}
1£ i£ t
let us say this is BPSK constellation.
T
scalar quantity. So I can write this as s L s = T r ( L S ) since trace is the sum of the
diagonal elements for a square matrix. So a single number is a special case of square
matrix. So the trace will yield the number itself.
Now this proceeds as shown in slide.
m in T r ( L S )
T s .t d i a g ( S ) = 1
Here S = ss . So the equivalent problem will become .
S ³ 0
T
S = ss
S is a positive semi definite matrix and of all the constraints this is the most difficult non- So this is significantly of lower complexity and therefore, it is very amenable to
T implement this in practice. The only thing is it yields an approximate solution close to
convex constraint. This is known as a rank-1 constraint because S = ss . So since this is
very difficult to impose we simplify this and in this case we simply ignore this. This is the ML solution. Now, once you find S how to find s . So the point is because we have
T
known as an SDP relaxation, so we relax it. So this rank-1 constraint makes it non- ignored the rank 1 constraint, S is not guaranteed to be S = ss .
convex, so it makes it non SDP. So we relax it as an SDP that is we ignore this rank-1
constraint.
So in this context we need to find s and the key here is to find best rank 1 approximation
to S, for that we use the Eigenvalue decomposition of S.
So this is termed as SDP relaxation. So our ML decoder can be equivalently written as,
m in T r ( L S )
s .t d ia g ( S ) = 1 and this yields an approximate solution.

S ³ 0
The Eigenvalue decomposition is as follows, we can write S as S = QLQ

T
where Q is Then, the best rank – 1 approximation is simply choose S equal to the largest Eigenvalue.
the matrix of eigenvectors and L is the diagonal matrix of Eigenvalues and this is then
proceeded as shown in slide.
éxù
So this is as shown in slide and the final step is s = ê ú , so by choosing the first t
ë1 û
symbols of s you get transmitted symbols.

Now, since S is positive semi definite, note that li ³ 0 . So I can always arrange them in
decreasing order.
Lecture - 77
Introduction to Big Data: Online Recommender System (Netflix)
Keywords: Big Data, Online Recommender System
Hello, welcome to another module in this massive open online course. So in this module
let us start looking at yet another innovative application of optimization and this is in the
field of Big Data and in fact, Big Data is something that has gathered significant amount
of attention of late because of the tremendous rise in the amount of data that is being
So you take the original ML decoder, recast it in a different form and then you relax the generated each day in various websites or various online services that are there. Big Data
rank 1 constraint that makes it a semi definite program, this process is known as SDP has several applications and we are going to look at one very specific and very relevant
relaxation. From the SDP relaxation you get S which is a positive semi definite matrix, technique for Big Data known as a Recommender System.
from that you perform the Eigenvalue decomposition thereby getting the best rank 1
approximation, so that will be nothing but the principle Eigenvector of S and from that (Refer Slide Time: 01:39)
you take the top t symbols. So let us stop here and continue in the subsequent modules.
A Recommender System is something as the name implies, it recommends other items

based on your purchase or viewing history. For instance, if you go to any E-commerce
websites, you have several product recommendations, based on your viewing history of
the items that you have browsed or based on even your past purchase history or even
when you go to a video streaming site like YouTube, when you look at the different
videos or when you are watching current video, the website automatically comes up with
a recommendation of other videos that you might find a lot of similarity or that you will
be highly interested in.
And similarly music websites like Pandora for instance, which is a music website, it different users, their purchase histories and come up with recommendations based on
comes up with a set of music videos or music albums or songs that would be of very high what other people who had similar interests have purchased or have viewed and so on.
interest to you and some of these you might not be aware of. So by coming up with this
highly specific set of recommendations, it is a win-win situation because you cannot
browse the infinite number of products on an E-commerce website and similarly, it is
also beneficial for the website because enticing the customers to this possible set of
objects that the customer is interested in, are increasing their business. So it is a win-win
situation for everyone, it saves your time, increases the business of the website and so on
and for this part this module will be referring in particular to this very interesting book
by Professor Mung Chiang titled “Networked Life: 20 Questions and Answers” by
Princeton, Princeton University and the chapter that we are talking about is Chapter 4,
which is “How does Netflix recommend movies?”.
So all such systems which basically recommend various options for you to buy or
browse are known as recommender systems. The more closely your recommendations (Refer Slide Time: 06:22)
match the interest of the consumer, the better your recommender system is then the better
is going to be the efficiency of your website. So the idea is to design the best
recommender system which comes up with a very specific and highly interesting set of
recommendations.
One particular interesting application that we are going to talk about is that of Netflix
and it is an Online DVD rental site started in 1997. The model of Netflix is that you send
DVDs by mail or regular post which you can order online and they will be sent by mail
to you for a fee of cost. Now once you watch the DVDs, you send them back you get a
new set of DVDs, specific to that particular website.
For instance, is a simple snapshot from one of the E-commerce websites, you have a
book that you are interested in buying. This is the book you are interested in and the
website comes up with an alternative set of recommendations. So you look at these
alternatives or set of recommendations. So the recommender systems look at patterns of
Now, the key innovation that we are talking about is basically the website generates So in 2009, Netflix had about a 100000 titles and in 2013 about 33 million subscribers.
personalized movie recommendations for each user, based on your past viewing history. In 2005, it is close to shipping about 1 million DVDs a day which is a large amount of
So this collectively mines the viewing patterns of an extremely large number of users, data, implies this is a Big Data problem. So from this large amount of data, how do you
that is all the movies you have seen and the movies that a large number of users have mine the patterns and that is to be found out.
seen, to extract information and then based on this mining of this collective data of users
and movies, you come up with the specific set of recommendations for a particular
reason for that matter for each user. So it comes up with a set of recommendations or
comes up with a set of ratings for you for the movies which you have never seen. That is
in essence what the problem is. So it recommends movies to you based on what Netflix
thinks are movies that you are going to rate highly which means it has to generate a
predicted rating for you, for a set of movies that you have not seen and then, choose a set
of movies based on what Netflix thinks you would rate very highly.
Now, in 2006, Netflix rolled out an interesting challenge, this is termed as the Netflix
challenge, to the research community and what happened in that challenge is it made
6 8
available 100 million ratings. 100 million is 10 0 ´ 1 0 = 1 0 ratings which could fit into a
standard memory of a standard desktop. Now, at that point it had about 480000 users and
of course, these ratings were about 480000, that is roughly half a million users,
approximately 20000 movies. So the approximate number of possible ratings is 1 0 1 0 .
Now the actual number of ratings is only 1 0 8 . Now the reason for that is very simple
because you have half a million users, you have twenty thousand movies but each user Now you have a set of movies A B C D E F G H and you have users 1 2 3 4 5 6. For
has not seen every movie. So each user has probably seen a fraction of the movies. instance, user 1 has not seen movie A, but he has seen movie B and rated movie B. User
1 has seen movie D, rated movie. User 2 has seen movie A, movie C, movie D rated. So
these are the ratings that are available. The empty blocks are ratings. So for instance user
1 has not seen movie 6 which implies these ratings have to be predicted. You treat this as
a matrix each row corresponds to user each column corresponds to movie. So now some
users have rated some movies, therefore, some entries of this matrix are filled, the rest of
the entries of this matrix are vacant. And therefore, we have to complete this matrix. This
problem is known as a Matrix Completion Problem. So you have half million users,
20000 movies. So the size of the matrix is humongous.
So only 1 percent of the ratings are available which means we have to predict the rest of
the ratings. Based on these predictions, you come up with the movie recommendations
for each user. In an E-commerce website, if every person has bought every item then
there is no set to cover. The challenge is because few people have bought few items and
it is not even that few people have bought the same items, different people have bought a
different item. So from this checker kind of matrix, we have to come up with the ratings
and recommendations for each user. So each movie was rated by approximately 5000
users. So each user has seen about a percent of the movies. Now, we have to develop a
program to predict the ratings and provide recommendations for you. So this was the
The way the contest was organized was the training set was made available to the public
Netflix challenge.
and there is a probe set and one has to eventually test what is the performance of the
(Refer Slide Time: 16:05) algorithm that is being proposed. Now the quiz set and the test set are hidden and finally,
the algorithm is tested on the quiz and test set. So whichever algorithm performs the best
that is gives the recommendations on the test set which are closest to the ratings that are
given by the users, that is selected.
So we will stop here and start looking at exploring this problem in the subsequent
modules. Thank you very much.
Lecture - 78
Matrix Completion Problem in Big Data: Netflix-I
Keywords: Matrix Completion Problem
at an application of Convex Optimization Big Data in particular to the Netflix problem.
Let us consider a very simple version of that problem.

There are 3 movies A B C and 3 users. In total 8 ratings are available, one is missing,
that is the rating of user 2 for movie B and this which we have to predict. Now let us
consider a simple predictor and the best predictor is nothing but the mean that is the
average rating of each movie per user.
The Netflix problem as I said involves about half a million users and about 20000
movies. We are going to consider an extremely simple version of that which can be
generalized. Let us say that you have 3 movies A B C and 3 users 1 2 3, similar to the
table that we had seen yesterday. Let us consider again a simple example, where user 1
has rated movies A B C, the ratings are 3, 5, 3. User 3 has also rated movies A B C,
ratings are 2, 5, 4. But user 2 has only rated movies A and C and his ratings are 4 and 3.
1
Now, what you can clearly see is that one rating is missing, if you look at this as a So we have rating ra = å ru , m where S is the set of all ratings that are available. So
S ( u , m )Î S
matrix, this is missing. And therefore, we have to predict this rating to complete this
ru, m is the rating of user u for movie m.
matrix of users and ratings that is why it is also known as a matrix completion problem.
Now this is also termed as a Lazy Predictor as this is only the average rating. The reason for that is each movie is unique. So each user has a certain bias, each movie
is good, but by and large, a large number of people think some movies are better than
some other movies. So some movies are consistently rated the best. Some users are same
way more lenient towards their ratings while some users might be very harsh. So to each
user each movie is either good or bad.
And for this example, the average is ra = 3.625 as shown in slide. Its performance is
going to be very poor because this is a gross oversimplification.
So we can model the rating of each user as ru , m = ra + b u + b m which is the average, the
bias of each user and the bias of each movie respectively. So this is a slightly more
refined model and probably a more natural model in capturing the behaviour.
(Refer Slide Time: 13:46) In fact, this is a system of linear equations which can be written in the form of a matrix
as shown in slide.
Accordingly, we can have various equations for the above problem as shown in slide.
So here we have more equations than unknowns.
So there are 8 such equations as shown in slide.

We know that this is an over determined system and hence we solve this over determined
2
system by using least squares r - Ab but we cannot solve this exactly and we have to
solve it approximately. So we find the best vector b which minimizes the norm of the
2
approximation error r - Ab .
\
(Refer Slide Time: 21:49) Applied Optimisation for Wireless, Machine Learning, Big Data.
Lecture - 79
Matrix Completion Problem in Big Data: Netflix-II
Keywords: Matrix Completion Problem
Hello, welcome to another module in this massive open online course.
-1
So this is your least squares problem implies b = (A T
A) A
T
. So from this you get b1, b2,
b3, bA, bB, bC. So we have got first stage, we started with the average, we refined it, but
we are not done yet. We are going to refine this model further to predict it as closely as
possible and that we are going to do in the subsequent module. Thank you very much.
So we are looking at the Netflix problem that is how to predict the user ratings for
movies which he or she has not seen. So this is basically a specific case of matrix
completion problems. And we have seen that you can express the rating of each movie as
shown in slide. Now let us refine this a little bit because the movie trends change with
time. So a more refined model can be expressed as a fixed bias along with a time varying
bias as shown in slide.
So user preferences are also varying with time. Let us look at for instance a simple example lets refine our previous model itself
comprising of 3 movies A B C with a person’s ratings if you remember 2 5 4. Now let us
(Refer Slide Time: 03:25) say he or she rated those movies in January. And let us say this corresponds to batch 1,
so these are the batches. So some person who is using the same account as user 3 rated
movies A and C in January in the first batch while possibly another person rated movie B
in the same month, but in another batch. So now you can develop the model again as
shown in slide.
So we have bu ,n + bu ( n ) + s u ( n ) . For any rating system, the same account can be used by
multiple users. So one set of ratings can be given by a male member of the family,
another set of ratings can be given by a child, another set of ratings can be given by a
female member of the family and so on, different users of the same account have
different preferences. So this accounts for that short term bias resulting from different
users giving ratings in a batch.
So this is done as shown in slides below.
So now you have many more biases, so you can form a least squares again similar to the Now, we can refine this model slightly further by subtracting the biases. So after
previous one and solve for the biases. removing the biases we can call this as the unbiased rating or the innovation, something
that we have not been able to predict.
Now, at this stage you can once again predict substituting the biases and the missing
prediction q u , m = ru , m - ra - b u , n - b u , m . And now you can form the innovations corresponding to the different ratings that are
available as shown in slide and compute the missing innovation. So we
q 2 A + q 2C
have q 2 B = .
2
This is a simple model, then once you do this you generate the rating as
r2 B = q 2 B + ra + b 2 , n + b B , n and this is your prediction model. However, this model ignores
the correlation between movies A, B and movies B, C. Because if A and B are very
similar then you have to give higher weightage to A, on the other hand B and C are very
similar then you have to give a higher weightage to B. So one has to give adequate
weightage depending on the correlation.
å qu Aqu
i i B
i =1
Then the correlation between these two can be defined as . So we call this as
q A
qB
the similarity coefficient or measure of similarity, so what we are doing is we are taking
the ratings of users who have rated both A and B, computing the inner product between
them and dividing them by their norms of these two vectors.
Let us consider now the innovation vectors as shown in slide.

In fact, this is exactly the cosine of the angle between the two vectors. And similarly now evaluate dB, C, this is the correlation or measure of similarity.
So cosine of the angle between these two vectors is nothing but what we have defined as
d BAq 2 A + d BC q 2C
Now, the weighted innovation or weighted estimate is q 2B = and this is
the similarity coefficient. And you can clearly see that if q = 0 the vectors are aligned, d BA + d BC
p
movies A and B are similar, as q increases movies A and B are dissimilar. If q = they the final step.
2
are perpendicular in fact, A has no bearing on B implies they are orthogonal or no

correlation, implies uncorrelated, the innovations are uncorrelated.
And now once you compute this innovation estimate you can add it to the biases as
r 2 B = q 2 B + ra + b 2 , n + b B , n and that is basically the final step in this procedure. So this is
your final prediction of rating of user 2 for movie B. So this brings across various ideas
in both linear algebra as well as optimization. So we will stop here and continue in the
subsequent modules. Thank you very much.

Nptel Optimization

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Nptel Optimization

Enviado por

Direitos autorais:

Formatos disponíveis

Applied Optimization for Wireless, Machine Learning, Big Data So, this is your, this is basically your n-dimensional,

ally your n-dimensional, this is your n dimensional vector

(Refer Slide Time: 00:37)

(Refer Slide Time: 03:16)

(Refer Slide Time: 04:29)

(Refer Slide Time: 07:18)

(Refer Slide Time: 09:09)

(Refer Slide Time: 17:36)

(Refer Slide Time: 16:05)

(Refer Slide Time: 22:49)

(Refer Slide Time: 20:40)

(Refer Slide Time: 28:45)

Hello, welcome to another module in this massive open online course.

(Refer Slide Time: 00:36)

(Refer Slide Time: 03:14)

(Refer Slide Time: 07:46)

(Refer Slide Time: 11:10)

(Refer Slide Time: 14:51)

(Refer Slide Time: 17:09)

(Refer Slide Time: 20:25)

(Refer Slide Time: 23:42)

(Refer Slide Time: 22:10)

So, this is orthogonality of vectors, this is an important property in general orthogonality

And second properties is another interesting property, Eigen vectors corresponding to

(Refer Slide Time: 25:23)

(Refer Slide Time: 00:35)

(Refer Slide Time: 06:26)

(Refer Slide Time: 04:59)

(Refer Slide Time: 12:13)

(Refer Slide Time: 18:13)

(Refer Slide Time: 16:11)

(Refer Slide Time: 21:16)

(Refer Slide Time: 20:04)

(Refer Slide Time: 22:55)

Now, a Multivariate Gaussian R V is given by x bar equals x 1 x 2 up to xn and this is a

Now, let us come to a multi normal Multivariate Gaussian Random Variable or a

(Refer Slide Time: 27:08)

(Refer Slide Time: 33:29)

(Refer Slide Time: 37:06)

(Refer Slide Time: 00:30)

However it if in general it is only for a Gaussian Random Variable, it is true that if

(Refer Slide Time: 01:50)

(Refer Slide Time: 05:07)

(Refer Slide Time: 02:32)

(Refer Slide Time: 09:16)

(Refer Slide Time: 11:54)

Let us now consider another example for 2 dimensional vectors instance.

(Refer Slide Time: 17:12)

(Refer Slide Time: 20:36)

(Refer Slide Time: 24:15)

(Refer Slide Time: 26:45)

(Refer Slide Time: 30:24)

(Refer Slide Time: 35:56)

(Refer Slide Time: 34:41)

Thank you very much.

(Refer Slide Time: 38:21)

(Refer Slide Time: 00:24)

(Refer Slide Time: 02:21)

(Refer Slide Time: 03:45)

(Refer Slide Time: 06:29)

(Refer Slide Time: 11:59)

(Refer Slide Time: 10:13)

(Refer Slide Time: 00:25)