Escolar Documentos
Profissional Documentos
Cultura Documentos
Winter 2009
Lecture 2 January 8
Lecturer: Patrick Hayden
In this lecture, we introduce some functions that will be very useful to tackle information
theory questions. To fix the notations, upper case letters will denote random variables (e.g.
X, Y ) and their lower case will correspond to a possible value of this random variable (e.g
x, y). Moreover, the random variables considered will take values in finite sets usually called
. The distribution of a random variable X will be denoted by P(X = x) = p(x).
Recall from last lecture that we defined the entropy of a random variable X as
X
p(x) log p(x).
H(X) =
x
2.1
p(x, y) log
x,y
p(x, y)
p(y)
p(y)H(X|Y = y).
P
Here p(y) is the marginal distribution of Y , so p(y) = x p(x, y). It is easy to see that the
conditional entropy can be written as H(X|Y ) = H(X, Y ) H(Y ). For example, if X = Y ,
then H(X|Y ) = 0.
Observe that because H(X|Y = y) 0, the conditional entropy is always positive: H(X|Y )
0. In other words
H(Y ) H(X, Y )
which seems intuitive, as the uncertainty of the whole has to be at least as big as the
uncertainty of a part. But this remark is important as this will not be true for quantum
information. In fact, in quantum information, H(X|Y ) < 0 is possible.
2-1
COMP 761
Lecture 2 January 8
Winter 2009
Figure 2.1. Representation of the entropy functions family (taken from Wikipedia entry on conditional
entropy)
Now we introduce a complementary quantity, that measures the common uncertainty between X and Y , the mutual information is defined as
I(X : Y ) = H(X) H(X|Y ) = H(X) + H(Y ) H(X, Y ).
One can see from the last expression that the mutual information is symmetric in X and Y .
This symmetry means that this notion of uncertainty has the property that the information
we gain about X when knowing Y is the same as the information we gain about Y knowing
X. To illustrate this property, consider X uniformly distributed on the set {1, . . . , 2k }, and
Y the parity of X, then knowing X determines Y , so it gives exactly 1 bit of information.
Moreover, knowing Y gives the parity of X, so it also gives 1 bit of information about X.
The Venn diagram in figure 2.1 illustrates the relations between the different functions we
introduced.
2.2
Relative entropy
Another function that will be useful is the relative entropy, which is a measure of closeness
between probability distributions over the same set
D(p||q) =
p(x) log
p(x)
.
q(x)
Note that the relative entropy is not symmetric in p and q. We give some examples of relative
entropies.
If u(x) is the uniform distribution on , then
2-2
COMP 761
Lecture 2 January 8
D(p||u) =
p(x) log
Winter 2009
p(x)
= log || H(X)
1/||
p(x, y) log
x,y
p(x, y)
p(x)p(y)
x,y
x,y
x,y
p(x) log
p(x)
q(x)
1 X
q(x)
p(x) ln
ln 2 x
p(x)
q(x)
1 X
p(x)(
1)
ln 2 x
p(x)
=
q(x)
p(x)
Some consequences:
2-3
COMP 761
Lecture 2 January 8
Winter 2009
As D(p||u) = log || H(X), where u is the uniform distribution on , then for any X
taking values in , H(X) log ||. Moreover, by the equality condition, the uniform
distribution is the unique maximum of the entropy function among the distributions
on .
As we expressed the mutual information as a relative entropy, I(X : Y ) = H(X) +
H(Y ) H(X, Y ) 0, which is sometimes called the subaddtivity property. Furthermore, equality happens if and only if p(x, y) = p(x)p(y) which means that X and Y
are independent. This means that I(X : Y ) captures all kinds of relations between X
and Y , not only linear dependencies as the the correlation coefficient for example.
By rewriting the positivity of mutual information as H(X|Y ) H(X), one can interpret it as conditioning reduces entropy
2.3
When given many random variable, it is often useful to decompose the entropy of the joint
distribution in the following way using the chain rule:
H(X1 , . . . , Xn |Y ) =
n
X
j=1
One can also give a similar formula for the mutual information, but first we have to define
the conditional mutual information:
X
p(z)I(X : Y |Z = z)
I(X : Y |Z) =
z
p(x, y, z) log
x,y,z
p(x, y|z)
.
p(x|z)p(y|z)
n
X
j=1
2-4
COMP 761
2.4
Lecture 2 January 8
Winter 2009
p(x, y, z)
p(x)p(y|x)p(z|y)
p(x, y)
=
=
p(z|y) = p(x|y)p(z|y).
p(y)
p(y)
p(y)
2-5