Escolar Documentos
Profissional Documentos
Cultura Documentos
by
Dmitry Panchenko
1 Probability spaces. 1
2 Random variables. 7
7 Stopping times, Walds identity, Markov property. Another proof of the SLLN. 35
9 Characteristic functions. 44
14 Stopping times. 73
ii
Bibliography 165
iii
Probability spaces.
(, A , P) is a probability space if (, A ) is a measurable space and P is a measure on A such that P() = 1. Let us
recall some definitions from measure theory. A pair (, A ) is a measurable space if A is a -algebra of subsets of
. A collection A of subsets of is called an algebra if:
(i) A,
(ii) C, B A = C B,C B A,
(iii) B A = \ B A.
Elements of the -algebra A are often called events. P is a probability measure on the -algebra A if
(1) P() = 1,
(2) P(A) 0 for all events A A ,
(3) P is countably additive: given any disjoint Ai A for i 1, i.e. Ai A j = 0/ for all i , j,
[
P Ai = P(Ai ).
i=1 i=1
(30 ) P is finitely additive and continuous, i.e. for any decreasing sequence of events Bn A , Bn Bn+1 ,
\
B= Bn = P(B) = lim P(Bn ).
n
n1
Proof. First, let us show that (3) implies (30 ). If we denote Cn = Bn \ Bn+1 then Bn is the disjoint union knCk B
and, by (3),
P(Bn ) = P(B) + P(Ck ).
kn
Since the last sum is the tail of convergent series, limn P(Bn ) = P(B). Next, let us show that (30 ) implies (3). If,
given disjoint sets (An ), we define Bn = in+1 Ai , then
[ [ [ [ [
Ai = A1 A2 An Bn ,
i1
Let us give several examples of probability spaces. One most basic example of a probability space is ([0, 1], B([0, 1]), ),
where B([0, 1]) is the Borel -algebra on [0, 1] and is the Lebesgue measure. Let us quickly recall how this measure
is constructed. More generally, let us consider the construction of the Lebesgue-Stieltjes measure on R corresponding
to a non-decreasing right-continuous function F(x). One considers an algebra of finite unions of disjoint intervals
n[ o
A= (ai , bi ] n 1, all (ai , bi ] are disjoint
in
and defines the measure F on the sets in this algebra by (we slightly abuse the notations here)
[ n
F (ai , bi ] = F(bi ) F(ai ) .
in i=1
whenever all Ai and i1 Ai are finite unions of disjoint intervals. The proof is exactly the same as in the case of
F(x) = x corresponding to the Lebesgue measure case. Once countable additivity is proved on the algebra, it remains
to appeal to the following key result. Recall that, given an algebra A, the -algebra A = (A) generated by A is the
smallest -algebra that contains A.
Therefore, F above can be uniquely extended to the measure on the -algebra (A) generated by the algebra of finite
unions of disjoint intervals. This is the -algebra B(R) of Borel sets on R. Clearly, (R, B(R), F) will be a probability
space if
The reason we required F to be right-continuous corresponds to our choice that intervals (a, b] in the algebra are closed
on the right, so the two conventions agree and the measure F is continuous, as it should be, e.g.
\
(a, b + n1 ] = lim F (a, b + n1 ] .
F (a, b] = F
n
n1
In Probability Theory, functions satisfying properties (1) and (2) above are called cumulative distribution functions, or
c.d.f. for short, and we will give an alternative construction of the probability space (R, B(R), F) in the next section.
Other basic ways to define a probability is through a probability function or density function. If the measurable space
is such that all singletons are measurable, we can simply assign some weights pi = P(i ) to a sequence of distinct
points i , such that i1 pi = 1, and let
P(A) = P A {i }i1 .
The function f is called the density function of P with respect to Q and, in a typical setting when = Rk and Q is the
Lebesgue measure , f is simply called the density function of P.
i
pi = P({i}) = e
i!
for integer i 0 is called the Poisson distribution with the parameter > 0. (Notation: given a set A, we will denote
by I(x A) or IA (x) the indicator that x belongs to A.) (2) A probability measure on R corresponding to the density
function f (x) = e x I(x 0) is called the exponential distribution with the parameter > 0. (3) A probability
measure on R corresponding to the density function
1 2
f (x) = ex /2
2
is called the standard normal, or standard Gaussian, distribution on R. t
u
Recall that a measure P is called absolutely continuous with respect to another measure Q, P Q, if for all A A ,
Q(A) = 0 = P(A) = 0,
in which case the existence of the density is guaranteed by the following classical result from measure theory.
Theorem 2 (Radon-Nikodym) On a measurable space (, A ) let be a -finite measure and be a finite measure
absolutely continuous with respect to , . Then there exists the Radon-Nikodym derivative h L 1 (, A , )
such that Z
(A) = h() d()
A
for all A A . Such h is unique modulo -a.e. equivalence.
Of course, the Radon-Nikodym theorem also applies to finite signed measures , which can be decomposed into
= + for some finite measures + , the so-called Hahn-Jordan decomposition. Let us recall a proof of the
Radon-Nikodym theorem for convenience.
R R
T is a continuous
R
linear functional and, by the Riesz-Frechet theorem, f d = f g d( + ) for some g H. This
R
implies f d = f (1 g) d( + ). Now g() 0 for ( + )-almost all , which can be seen by taking f () =
I(g() < 0), and similarly g() 1 for ( + )-almost all . Therefore, we can take 0 g 1. Let E = { : g() =
1}. Then Z Z
(E) = I( E) d() = I( E)(1 g()) d( + )() = 0,
Let us now write down some important properties of -algebras and probability measures.
Lemma 2 (Approximation property) If A is an algebra of sets then for any B (A) there exists a sequence Bn A
such that limn P(B 4 Bn ) = 0.
We will prove that D is a -algebra and, since A D, this will imply that (A) D. One can easily check that
Z
d(B,C) := P(B 4C) = |IB () IC ()| dP()
lim P(Cin 4 Di ) = 0,
n
by the properties (a) (c), CnN = iN Cin approximates DN = iN Di , which means that DN D. Let D =
S S
then,
i1 Di . Since P(D \ D ) 0 as N , it is clear that D D, so D is a -algebra.
S N t
u
Dynkins theorem. We will now describe a tool, the so-called Dynkins theorem, or theorem, which is often quite
useful in checking various properties of probabilities.
-systems: A collection of sets P is called a -system if it is closed under taking intersections, i.e.
1. if A, B P then A B P.
1. L ,
2. if A L then Ac L ,
3. if An L are disjoint for n 1 then n1 An L .
Given any collection of sets C , by analogy with the -algebra (C ) generated by C , we will denote by L (C ) the
smallest -system that contains C . If is easy to see that the intersection of all -systems that contain C is again a
-system that contains C , so this intersection is precisely L (C ).
Proof. First of all, it should be obvious that the collection of sets which is both a -system and a -system is a
-algebra. Therefore, if we can show that L (P) is a -system then it is a -algebra and
P (P) L (P) L ,
which proves the result. Let us prove that L (P) is a -system. For a fixed set A , let us define
GA = B B A L (P) .
Step 1. Let us show that if A L (P) then GA is a -system. Obviously, GA . If B GA then B A L (P) and,
since Ac L (P) is disjoint from B A,
c
Bc A = (B A) Ac L (P).
This means that Bc GA . Finally, if Bn GA are disjoint then Bn A L (P) are disjoint and
Step 2. Next, let us show that if A P then L (P) GA . Since P L (P), by Step 1, GA is a -system. Also,
since P is a -system, closed under taking intersections, P GA . This implies that L (P) GA . In other words, we
showed that if A P and B L (P) then A B L (P).
Step 3. Finally, let us show that if B L (P) then L (P) GB . By step 2, GB contains P and, by Step 1, GB is a
-system. Therefore, L (P) GB . We showed that if B L (P) and A L (P) then A B L (P), so L (P) is
a -system. t
u
Example. Suppose that is a topological space with the Borel -algebra B generated by open sets. Given two
probability measures P1 and P2 on (, B), the collection of sets
L = B B P1 (B) = P2 (B)
is trivially a -system, by the properties of probability measures. On the other hand, the collection P of all open sets
is a -system and, therefore, if we know that P1 (B) = P2 (B) for all open sets then, by Dynkins theorem, this holds for
all Borel sets B B. Similarly, one can see that a probability on the Borel -algebra on the real line is determined by
probabilities of the sets (,t] for all t R. t
u
Regularity of measures. Let us now consider the case of = S where (S, d) is a metric space, and let A be the Borel
-algebra generated by open (or closed) sets. A probability measure P on this space is called closed regular if
P(A) = sup P(F) F A, F - closed (1.0.1)
for all A A . It is a standard result in measure theory that every finite measure on (Rk , B(Rk )) is regular. In the
setting of complete separable metric spaces, this is known as Ulams theorem, which we will prove below.
This proves that U = satisfies (1.0.1) and F L . Next, one can easily check that L is a -system, which we will
Fc
leave as an exercise below. Since the collection of all closed sets is a -system, and closed sets generate the Borel
-algebra A , by Dynkins theorem, all measurable sets are in L . This proves that P is closed regular. t
u
Theorem 5 (Ulam) If (S, d) is a complete separable metric space then every probability measure P is regular.
Proof. First, let us show that there exists a compact set K S such that P(S \ K) . Consider a sequence {s1 , s2 , . . .}
1 1
that is dense in S. For any m 1, S =
S
i=1 B(si , m ), where B(si , m ) is the closed ball of radius 1/m centered at si . By
the continuity of measure, for large enough n(m),
n(m)
[ 1
P S\ B si , m.
i=1
m 2
If we take
\ n(m)
[ 1
K= B si ,
m1 i=1
m
then
P(S \ K) m
= .
m1 2
Obviously, by construction, K is closed and totally bounded. Since S is complete, K is compact. By the previous
theorem, given A A , we can find a closed subset F A such that P(A \ F) . Therefore, P(A \ (F K)) 2, and
since F K is compact, this finishes the proof. t
u
Exercise. Let F = {F N : N \ F is finite} be the collection of all sets in N with finite complements. F is a filter,
which means that (a) 0/ < F , (b) if F1 , F2 F then F1 F2 F , and (c) if F F and F G then G F . It is well
known (by Zorns lemma) that F U for some ultrafilter U , which in addition to (a) (c) also satisfies: (d) for any
set A N, either A or N \ A is in U . If we define P(A) = 1 for A U and P(A) = 0 for A < U , show that P is finitely
additive, but not countably additive, on 2N . If P could be interpreted as a probability, suppose that two people pick
a number in N according to P, and whoever picks the bigger number wins. If one shows his number first, what is the
probability that the other one wins?
Exercise. Suppose that C is a class of subsets of and B (C ). Show that there exists a countable class CB C
such that B (CB ).
Exercise. Check that L in (1.0.3) is a -system. (In fact, one can check that this is a -algebra.)
Random variables.
X 1 (B) = X() B A .
In Probability Theory, such functions are called random variables, especially, when (S , B) = (R, B(R)). Depending
on the target space S , X may be called a random vector, sequence, or, more generally, a random element in S . Recall
that measurability can be checked on the sets that generate the -algebra B and, in particular, the following holds.
{X t} := X() (,t] A .
D = D R X 1 (D) A
is a -algebra. Since sets (,t] D, this will imply that B(R) D. The fact that D is a -algebra follows simply
because taking pre-image preserves set operations. For example, if we consider a sequence Di D for i 1 then
[ [
X 1 Di = X 1 (Di ) A ,
i1 i1
because X 1 (Di ) A and A is a -algebra. Therefore, D. Other properties can be checked similarly, so
S
i1 Di
D is a -algebra. t
u
Given a random element X on (, A , P) with values in (S , B), let us denote the image measure on B by PX = PX 1 ,
which means that for B B,
PX (B) = P(X B) = P(X 1 (B)) = P X 1 (B).
(S , B, PX ) is called the sample space of a random element X and PX is called the law of X, or the distribution of X.
Clearly, on this space a random variable : S S defined by the identity (s) = s has the same law as X.When
S = R, the function F(t) = P(X t) is called the cumulative distribution function (c.d.f.) of X. Clearly, this function
satisfies the following properties that already appeared in the previous section:
On the other hand, any such function is a c.d.f. of some random variable, for example, the random variables X(x) = x
on the space (R, B(R), F) constructed in the previous section, since
P x : X(x) t = F (,t] = F(t).
Another construction can be given on the probability space ([0, 1], B([0, 1]), ) with the Lebesgue measure , using
the so-called quantile transformation. Given a c.d.f. F, let us define a random variable X : [0, 1] R by the quantile
transformation (see Figure 2.1):
X(x) = inf s R F(s) x .
What is the c.d.f. of X? Notice that, since F is right-continuous,
This means that, to define the probability space (R, B(R), F), we can start with ([0, 1], B([0, 1]), ) and let F = X 1
be the image of the Lebesgue measure by the quantile transformation, or the law of X on R. A related inverse property
is left as an exercise below.
Given random element X : (, A ) (S , B), the -algebra
(X) = X 1 (B) B B
is called a -algebra generated by X. It is obvious that this collection of sets is, indeed, as -algebra.
Example. Consider a random variable X on ([0, 1], B([0, 1]), ) defined by
0, 0 x 1/2,
X(x) =
1, 1/2 < x 1.
Then, the -algebra generated by X consists of the sets
n h 1i 1 i o
(X) = 0, / 0, , , 1 , [0, 1] ,
2 2
and P(X = 0) = P(X = 1) = 1/2. t
u
Lemma 4 Consider a probability space (, A , P), a measurable space (S , B) and random elements X : S
and Y : R. Then the following are equivalent:
It should be obvious from the proof that R can be replaced by any separable metric space.
Proof. The fact that 1 implies 2 is obvious, since for any Borel set B R the set B0 = g1 (B) B and, therefore,
Let us show that 2 implies 1. For all integer n and k, consider sets
n h k k + 1 o h k k + 1
An,k = : Y () n , n = Y 1 , .
2 2 2n 2n
P(A1 An ) = P(Ai )
in
g
b
r
g
r b
we roll this die then the colors provide an example of pairwise independent random variables that are not independent,
since
1 1
P(r) = P(b) = P(g) = and P(rb) = P(rg) = P(bg) = ,
2 4
while 1 3
1
P(rbg) = , P(r)P(b)P(g) = .
4 2
t
u
First of all, independence can be checked on generating algebras.
for all Ci Ci for 2 i n. It is obvious that C is a -system, and it contains C1 by assumption. Since C1 is a
-system, by Dynkins theorem, C contains (C1 ). This means that we can replace C1 by (C1 ) in the statement of
the theorem and, similarly, we can continue to replace each Ci by (Ci ). t
u
Lemma 7 Consider random variables Xi : R on a probability space (, A , P). (a) Random variables (Xi ) are
independent if and only if, for all ti R,
n
P(X1 t1 , . . . , Xn tn ) = P(Xi ti ). (2.0.1)
i=1
(b) If the laws of Xi have densities fi on R then these random variables are independent if and only if a joint density f
on Rn of the vector (Xi ) exists and
n
f (x1 , ..., xn ) = fi (xi ).
i=1
Proof. (a) This is obvious by Lemma 6, because the collection of sets (,t] for t R is a -system that generates
the Borel -algebra on R.
(b) Let us start with the if part. If we denote X = (X1 , . . . , Xn ) then, for any Ai B(R),
n
\ Z n
P {Xi Ai } = P(X A1 An ) = fi (xi ) dx1 dxn
i=1 A1 An i=1
n Z n
= fi (xi ) dxi = P(Xi Ai ).
i=1 Ai i=1
for all A in the Borel -algebra on Rn , which would means that the joint density exists and is equal to the product of
individual densities. One can prove the above equality for all A B(Rn ) by appealing to the Monotone Class Theorem
from measure theory, or the Caratheodory Extension Theorem 1, since the above equality, obviously, can be extended
from the semi-algebra of measurable rectangles A1 An to the algebra of disjoint unions of measurable rectangles,
which generates the Borel -algebra. However, we can also appeal to the Dynkins theorem, since the family L of sets
A that satisfy the above equality is a -system by properties of measures and integrals, and it contains the -system
P of measurable rectangles A1 An that generates the Borel -algebra, B(Rn ) = (P). t
u
More generally, a collection of -algebras At A indexed by t T for some set T is called independent if any finite
subset of these -algebras are independent. Let T = T1 Tn be a partition of T into disjoint sets. In this case, the
following holds.
10
tTi tTi
It is obvious that Bi = (Ci ) since At Ci for all t Ti , each Ci is a -system, and C1 , . . . , Cn are independent by the
definition of independence of the -algebras At for t T . Using Lemma 6 finishes the proof. (Of course, one should
recognize from measure theory that Ci is a semi-algebra that generates Bi .) t
u
If we would like to construct finitely many independent random variables (Xi )in with arbitrary distributions (Pi )in
on R, we can simply consider the space = Rn with the product measure
P1 . . . Pn
and define a random variable Xi by Xi (x1 , . . . , xn ) = xi . The main result in the next section will imply that one can
construct an infinite sequence of independent random variables with arbitrary distributions on the same probability
space, and here we will give a sketch of another construction on the space ([0, 1], B([0, 1]), ). We will write P = to
emphasize that we think of the Lebesgue measure as a probability.
Step 1. If we write the dyadic decomposition of x [0, 1],
x= 2n n (x),
n1
then it is easy to see that (n )n1 are independent random variables with the distribution P(n = 0) = P(n = 1) = 1/2,
since for any n 1 and any ai {0, 1},
since fixing the first n coefficients in the dyadic expansion places x into an interval of length 2n .
Step 2. Let us consider injections km : N N for m 1 such that their ranges km (N) are all disjoint and let us define
It is an easy exercise to check that each Xm is well defined and has the uniform distribution on [0, 1], which can be seen
by looking at the dyadic intervals first. Moreover, by the Grouping Lemma above, the random variables (Xm )m1 are
all independent since they are defined in terms groups of independent random variables.
Step 3. Given a sequence of probability distributions (Pm )m1 on R, let (Fm )m1 be the sequence of the corresponding
c.d.f.s and let (Qm )m1 be their quantile transforms. We have seen above that each Ym = Qm (Xm ) has the distribution
Pm on R, and they are obviously independent of each other. Therefore, we constructed a sequence of independent
random variables Ym on the space ([0, 1], B([0, 1]), ) with arbitrary distributions Pm . t
u
Expectation. If X : R is a random variable on (, A , P) then the expectation of X is defined as
Z
EX = X() dP().
In other words, expectation is just another term for the integral with respect to a probability measure and, as a result,
expectation has all the usual properties of the integrals in measure theory: convergence theorems, change of variables
formula, Fubinis theorem, etc. Let us write down some special cases of the change of variables formula.
11
(3) If the distribution of X : Rn on Rn has the density function f (x) then, for any measurable function g : Rn R,
Z
Eg(X) = g(x) f (x) dx.
Rn
Proof. All these properties follow by making the change of variables x = X(),
Z Z Z
Eg(X) = g(X()) dP() = g(x) dP X 1 (x) = g(x) dPX (x),
Lemma 10 If X,Y : R are independent and E|X|, E|Y | < then EXY = EXEY.
Proof. Independence implies that the distribution of (X,Y ) on R2 is the product measure P Q, where P and Q are the
distributions of X and Y on R, and, therefore,
Z Z Z
EXY = xy d(P Q)(x, y) = x dP(x) y dQ(y) = EXEY,
R2 R R
Exercise. If a random variable X has continuous c.d.f. F(t), show that F(X) is uniform on [0, 1], i.e. the law of F(X)
is the Lebesgue measure on [0, 1].
R
Exercise. If F is a continuous distribution function, show that F(x) dF(x) = 1/2.
Exercise. ch( ) is a moment generating function of a random variable X with distribution P(X = 1) = 1/2, since
e + e
Ee X = = ch( ).
2
Does there exist a (bounded) random variable X such that Ee X = chm ( ) for 0 < m < 1? (Hint: compute several
derivatives at zero.)
Assume that both sides are well-defined, for example, that f is bounded.
Exercise. Suppose X is a random variable and g : R R is measurable. Prove that if X and g(X) are independent then
P(g(X) = c) = 1 for some constant c.
12
be the order statistics (the random variables arranged in the increasing order). Prove that the spacings
are independent exponential random variables, and e(k+1) e(k) has the parameter (n k) .
Exercise. Suppose that (en )n1 are i.i.d. exponential random variables with the parameter = 1. Let Sn = e1 + . . . + en
and Rn = Sn+1 /Sn for n 1. Prove that (Rn )n1 are independent and Rn has density nx(n+1) I(x 1). Hint: Let R0 = e1
and compute the joint density of (R0 , R1 , . . . , Rn ) first.
Exercise. Let N be a Poisson random variable with the mean , i.e. P(N = j) = j e / j! for integer j 0. Then,
consider N i.i.d. random variables, independent of N, taking values 1, . . . , k with probabilities p1 , . . . , pk . Let N j be
the number of these random variables taking value j, so that N1 + . . . + Nk = N. Prove that N1 , . . . , Nk are independent
Poisson random variables with means p1 , . . . , pk .
Additional exercise. Suppose that a measurable subset P [0, 1] and the interval I = [a, b] [0, 1] are such that
(P) = (I), where is the Lebesgue measure on [0, 1]. Show that there exists a measure-preserving transformation
T : [0, 1] [0, 1], i.e. T 1 = , such that T (I) P and T is one-to-one (injective) outside a set of measure zero.
(Additional exercises never need to be turned in.)
13
In this section we will describe a typical way to construct an infinite family of random variables (Xt )tT on the same
probability space, namely, when we are given all their finite dimensional marginals. This means that for any finite
subset N T , we are given
PN (B) = P (Xt )tN B
for all B in the Borel -algebra BN = B(RN ). Clearly, these laws must satisfy a natural consistency condition,
for any finite subsets N M and any Borel set B BN . (Of course, to be careful, we should define these probabilities
for ordered subsets and also make sure they are consistent under rearrangements, but the notations for unordered sets
is clear and should not cause any confusion.)
Our goal is to construct a probability space (, A , P) and random variables Xt : R that have (PN ) as their
finite dimensional distributions. We take
= RT = : T R
to be the space of all real-valued functions on T , and let Xt be the coordinate projection
Xt = Xt () = (t).
A = B RT \N B BN ,
must be contained in the -algebra A . It is easy to see that A is, in fact, an algebra of sets, and it is called the cylindrical
algebra on RT . We will then take A = (A) to be the smallest -algebra on which all coordinate projections are
measurable. This is the so-called cylindrical -algebra on RT . A set B RT \N is called a cylinder. As we already
agreed, the probability P on the sets in algebra A is given by
P B RT \N = PN (B).
Given two finite subsets N M T and B BN , the same set can be represented as two different cylinders,
B RT \N = B RM\N RT \M .
However, by the consistency condition, the definition of P will not depend on the choice of the representation. To finish
the construction, we need to show that P can be extended from algebra A to a probability measure on the -algebra
A . By the Caratheodory Extension Theorem 1, we only need to show that the following holds.
14
in in in
and, therefore,
\ [
Ci RT \Ni \ Ki RT \Ni P (Ci \ Ki ) RT \Ni
\
P
in in in
P (Ci \ Ki ) RT \Ni i+1 .
in in 2 2
in in
where
(Ki RNn \Ni )
\
Kn =
in
and, therefore, there exists a point n = ( n (t))tNn K n . By construction, we also have the following inclusion
property. For m > n,
m K m K n RNm \Nn
and, therefore, ( m (t))tNn K n . Any sequence on a compact has a converging subsequence. Let (n1k )k1 be such that
1
( nk (t))tN1 ((t))tN1 K 1
as k . Then we can take a subsequence (n2k )k1 of the sequence (n1k )k1 such that
2
( nk (t))tN2 ((t))tN2 K 2 .
15
This results gives us another way to construct a sequence (Xn )n1 of independent random variables with given distri-
butions (Pn )n1 as the coordinate projections on the infinite product space RN with the cylindrical (product) -algebra
A = B N and infinite product measure P = n1 Pn .
Exercise. Does the set C([0, 1], R) of continuous functions on [0, 1] belong to the cylindrical -algebra A on R[0,1] ?
Hint: An exercise in Section 1 might be helpful.
16
When we toss an unbiased coin many times, we expect the number of heads or tails to be close to a half a phe-
nomenon called the law of large numbers. More generally, if (Xn )n1 is a sequence of independent identically dis-
tributed (i.i.d.) random variables, we expect that their average
Sn 1 n
Xn = = Xi
n n i=1
is, in some sense, close to the expectation = EX1 , assuming that it exists. In the next section, we will prove a general
qualitative result of this type, but we begin with more quantitative statements in special cases. Let us begin with the
case of i.i.d. Rademacher random variables (n )n1 such that
1
P(n = 1) = P(n = 1) = ,
2
which is, basically, equivalent to tossing a coin with heads and tails replaced by 1. We will need the following.
P(X t) et Ee X for 0,
17
where in the last step we used Lemma 10. One can easily check the inequality
ex + ex 2
ex /2 ,
2
for example, using the Taylor expansions, and, therefore,
1 1 2 2
E exp( i ai ) = e ai + e ai e ai /2
2 2
and
n 2 n 2
P i ai t exp t + ai .
i=1 2 i=1
Optimizing over 0 finishes the proof. t
u
We can also apply the same inequality to (i ) and, combining both cases,
n t2
P i ai t 2 exp .
i=1 2 ni=1 a2i
Proof. Notice that the probability is zero when +t > 1, since the average can not exceed 1, so +t 1 is not really
a constraint here. Using the convexity of the exponential function, we can write for x [0, 1] that
e x = e (x1+(1x)0) xe + (1 x)e 0 = 1 x + xe ,
18
( + t)(1 + e ) + e = 0,
(1 )( + t)
e = 1,
(1 t)
so the critical point 0, as required. Substituting this back into the bound,
n !n
(1 t) +t
(1 )( + t)
P Xi n( + t) 1 +
i=1 (1 )( + t) 1 t
+t 1t !n
1
=
+t 1 t
+t 1 t
= exp n ( + t) log + (1 t) log ,
1
These inequalities show that, no matter how small t > 0 is, the probability that the average X n deviates from the
expectation by more than t in either direction decreases exponentially fast with n. Of course, the same conclusion
applies to any bounded random variables, |Xi | M, by shifting and rescaling the interval [M, M] into the interval
[0, 1].
Even though the Hoeffding-Chernoff inequality applies to all bounded random variables, in real-world applica-
tions in engineering, computer science, etc., one would like to improve the control of the probability by incorporating
other measures of closeness of the random variable X to the mean , for example, the variance
(x) = (1 + x) log(1 + x) x.
We will center the random variables (Xn ) and instead work with Zn = Xn .
Theorem 9 (Bennetts inequality) Let us consider i.i.d. (Zn )n1 such that EZ = 0, EZ 2 = 2 and |Z| < M. Then,
1 n
n 2
tM
P iZ t exp ,
n i=1 M2 2
for all t 0.
19
( Z)k
EZ k
Ee Z = E = k
k=0 k! k=0 k!
k 2 k2
k k2 2
= 1+ EZ Z 1+ M
k=2 k! k=2 k!
2 k Mk 2
= 1+ 2 = 1 + 2 e M 1 M
M k=2 k! M
2
exp e M 1 M
M2
where in the last inequality we used 1 + x ex . Therefore,
n 2
P Zi nt exp n t + 2 e M 1 M .
i=1 M
2 M
t + Me M = 0,
M2
tM
e M =
+ 1,
2
1 tM
= log 1 + 2 .
M
Since this 0, plugging it into the above bound,
n
t
tM
2 tM
tM
P Zi nt exp n log 1 + 2 + 2 + 1 1 log 1 +
i=1 M M 2 2
2
tM tM tM tM
= exp n log 1 + log 1 +
M2 2 2 2 2
2
tM tM tM
= exp n 1 + 2 log 1 + 2
M2 2
2
n tM
= exp 2 ,
M 2
finishes the proof. t
u
To simplify the bound in Bennetts inequality, one can notice that (we leave it as an exercise)
x2
(x) ,
2(1 + x/3)
which implies that
n
nt 2
1
P Zi t exp
n i=1 2( 2 + tM/3)
.
Combining with the same inequality for (Zi ), and recalling that Zi = Xi , we obtain another classical inequality,
Bernsteins inequality:
1 n
nt 2
P Xi t 2 exp .
n i=1 2( 2 + tM/3)
For small t, the denominator is of order 2 2 , and we get a better control of the probability when the variance is
small.
20
for some constants a1 , . . . , an . This means that modifying ith coordinate of f can change its value by not more than ai .
Let us begin with the following observation.
1 + X 1 X
e X e + e ,
2 2
2 /2
and taking expectations we get Ee X ch( ) e . t
u
Using this, we will now prove the following analogue of Hoeffdings inequality.
Theorem 10 (Azumas inequality) Under the above stability condition, for any t 0,
t2
P f E f t exp .
2 ni=1 ai
2
Proof. For i = 1, . . . , n, let Ei denote the expectation in Xi+1 , . . . , Xn with the random variables X1 , . . . , Xi fixed. One
can think of (X1 , . . . , Xn ) as defined on a product space with the product measure, and Ei denotes the integration over
the last n i coordinates. Let us denote Yi = Ei f Ei1 f and note that En f = f and E0 f = E f . Then we can write
f E f = ni=1 Yi (this is called martingale-difference representation) and as before, for 0,
n
P f E f t = P Yi t et EeY1 +...+Yn .
i=1
Notice that Ei1Yi = Ei1 f Ei1 f = 0. Also, the stability condition implies that |Yi | ai . Since Y1 , . . . ,Yn1 do not
depend on Xn (only Yn does), if we average in Xn first, we can write
2 n 2
and, therefore, EeY1 +...+Yn e an EeY1 +...+Yn1 . Proceeding by induction on n, we get EeY1 +...+Yn e i=1 ai and
n
2 n 2
P Yi t exp t + ai .
2 i=1
i=1
21
Example. Consider an Erdos-Renyi random graph G(n, p) on n vertices, where each edge is present with probability p
independently of other edges. Let f = (G(n, p)) be the chromatic number of this graph, which is the smallest number
of colors needed to color the vertices so that no two adjacent vertices share the same color. Let us denote the vertices
by v1 , . . . , vn and let Xi denote the randomness in the set of possible edges between the vertex vi and v1 , . . . , vi1 . In
other words, Xi = (X1i , . . . , Xi1
i ) where X i is 1 if the edge between v and v is present and 0 otherwise. Notice that
k k i
vectors X1 , . . . , Xn are independent and the chromatic number is clearly a function f = f (X1 , . . . , Xn ). To apply Azumas
inequality, we need to determine the stability constants a1 , . . . , an . Observe that changing the set of edges connected to
one vertex vi can only affect the chromatic number by at most 1 because, in the worst case, we can assigns a new color
to this vertex. This means that ai = 1 and Azumas inequality implies that
t2
P (G(n, p)) E(G(n, p)) t 2e 2n
(we simply apply the inequality to f and f here). For example, if we take
t = 2n log n, we get the bound 2/n, which
means that with high probability the chromatic number will be within 2n log n from its expected value E(G(n, p)).
it is known (but non-trivial) that this expected value is close to c p n/ log n with c p = log(1 p)/2,
When p is fixed,
so the deviation 2n log n is of much smaller order.
Exercise. (Hoeffding-Chernoffs inequality) Prove that for 0 < 1/2 and 0 t < ,
t2
D(1 + t, 1 ) .
2(1 )
Exercise. Let X1 , . . . , Xn be independent flips of a fair coin, i.e. P(Xi = 0) = P(Xi = 1) = 1/2. If X n is their average
show that for t 0
2
P |X n 1/2| > t 2e2nt .
Exercise. Suppose that the random variables X1 , . . . , Xn , X10 , . . . , Xn0 are independent and, for all i n, Xi and Xi0 have
the same distribution. Prove that
n n 1/2
P (Xi Xi0 ) 2t (Xi Xi0 )2 et .
i=1 i=1
Hint: think about a way to introduce Rademacher random variables i into the problem and then use Hoeffdings
inequality.
Exercise. (Bernsteins inequality) Let X1 , . . . , Xn be i.i.d. random variables such that |Xi | M, EX = and
var(X) = 2 . If X n is their average, make a change of variables in the Bernstein inequality to show that for t > 0,
r
2 2t 2Mt
P Xn + + et .
n 3n
22
In this section, we will study two types of convergence of the average to the mean, in probability and almost surely.
Consider a sequence of random variables (Yn )n1 on some probability space (, A , P). We say that Yn converges in
p
probability to a random variable Y (and write Yn Y ) if, for all > 0,
lim P(|Yn Y | ) = 0.
n
Theorem 11 (Weak law of large numbers) Consider a sequence of random variables (Xn )n1 that are centered, EXn =
0, have uniformly bounded second moments, EXn2 K < , and are uncorrelated, EXi X j = 0 for i , j. Then
1
Xn = Xi 0
n in
in probability.
EX 2
= P X 2n 2 2n
P |X n 0|
1 2 1 n nK K
= 2 2
E(X 1 + + Xn ) = 2 2 EXi2 2 2 = 2 0,
n n i=1 n n
Of course, if (Xn )n1 are independent then they are automatically uncorrelated, since
EXi X j = EXi EX j = 0.
Before we move on to the almost sure convergence, let us give one more application of the above argument to the
problem of approximation of continuous functions. Consider an i.i.d. sequence (Xn )n1 with the distribution P on R
that depends on some parameter R, and suppose that
EXi = , Var(Xi ) = 2 ( ).
23
24
(1 pi ) = 0 pi = +.
i1 i1
=. We can assume that pi 1/2 for i m for large enough m, because, otherwise, the series obviously diverges.
Since 1 p e2p for p 1/2 we have
(1 pi ) exp 2 pi
min min
the event that An occur infinitely often, which consists of all that belong to infinitely many events in the sequence
(An )n1 . Then the following holds.
t
u
T
by Lemma 13, since mn P(Am ) = +. Therefore, P(Bn ) = 1 and P(An i.o.) = P( n1 Bn ) = 1.
Let us show how this implies the strong law of large number for bounded random variables. Recall that in the case
when
= EX1 , 2 = Var(X1 ) and |X1 | M,
the average satisfies Bernsteins inequality proved in the last section,
nt 2
P |X n | t 2 exp .
2( 2 + tM/3)
25
nt 2
2
P |X n | tn 2 exp n2 = 2 .
4 n
Since the series n1 2n2 converges, the Borel-Cantelli lemma implies that
P |X n | tn i.o. = 0.
This means that for P-almost all , the difference |X n () | will become smaller than tn for large enough
n n0 (). If we recall the definition of almost sure convergence, this means that X n converges to almost surely
the so-called strong law of large number. Next, we will show that this holds even for unbounded random variables
under a minimal assumption that the expectation = EX1 exists.
Strong law of large numbers. The following simple observation will be useful: if a random variable X 0 then
Z
EX = P(X x) dx.
0
As before, let (Xn )n1 be i.i.d. random variables on the same probability space.
1 n
Xn = Xi = EX1 almost surely.
n i=1
Step 1. First, without loss of generality we can assume that Xi 0. Indeed, for signed random variables we can
decompose Xi = Xi+ Xi , where
1 n 1 n 1 n
Xi = Xi+ Xi EX1+ EX1 = EX1 .
n i=1 n i=1 n i=1
Step 2. (Truncation) Next, we can replace Xi by Yi = Xi I(Xi i) using the Borel-Cantelli lemma. Since
26
Step 3. (Limits over subsequences) We will first prove almost sure convergence along the subsequences of the type
n(k) = b k c for > 1. For any > 0,
1 1
P |Tn(k) ETn(k) | n(k) 2 n(k)2 Var(Tn(k) ) = 2 n(k)2 Var(Yi )
k1 k1 k1 in(k)
1 1 1
2 n(k)2 EYi2 = EYi2 n(k)2 .
k1 in(k)
2 i1 k:n(k)i
1 4 4 K
2
2k = 2k 1
2,
n(k)i
n(k) k i
0 (1 2 ) i
21 1 i 2
Z
n(k)
P |T ETn(k) | n(k) K i i2
EY = K 2 0 x dF(x),
k1 i1 i1 i
We proved that
P |Tn(k) ETn(k) | n(k) <
k1
27
Exercise. Suppose that random variables (Xn )n1 are i.i.d. such that E|X1 | p < for some p > 0. Show that maxin |Xi |/n1/p
goes to zero in probability.
Exercise. (Weak LLN for U-statistics) If (Xn )n1 are i.i.d. such that EX1 = and 2 = Var(X1 ) < , show that
1
n
2 Xi X j 2
1i< jn
in probability as n .
Exercise. If u : [0, 1]k R is continuous then show that
j
1 jk n
u n , . . . , n ji xiji (1 xi )n ji u(x1 , . . . , xk )
0 j1 ,..., j n
k ik
Exercise. Suppose that (Xn )n1 are i.i.d. with EX1+ < and EX1 = . Show that Sn /n almost surely.
28
Additional exercise. Suppose (Xn )n1 are i.i.d. standard normal. Prove that
|Xn |
P lim sup = 1 = 1.
n 2 log n
29
Consider a sequence (Xi )i1 of random variables on the same probability space (, F , P) and let (X
i )i1 be the
-algebra generated by this sequence. An event A (Xi )i1 is called a tail event if A (Xi )in for all n 1.
In other words,
AT =
\
(Xi )in ,
n1
is a tail event. It turns out that when (Xi )i1 are independent then all tail events have probability 0 or 1.
Theorem 14 (Kolmogorovs 01 law) If (Xi )i1 are independent then P(A) = 0 or 1 for all A T .
Proof. For a finite subset F = {i1 , . . . , in } N, let us denote by XF = (Xi1 , . . . , Xin ). The -algebra (Xi )i1 is
generated by the algebra of sets
This is an algebra, because any set operations on finite number of such sets can again be expressed in terms of finitely
many random variables Xi . By the Approximation Lemma, we can approximate any set A (Xi )i1 by sets in A .
Therefore, for any > 0, there exists a set A0 A such that P(A 4 A0 ) and, therefore,
By definition, A0 (X1 , ..., Xn ) for large enough n and, since A is a tail event, A ((Xi )in+1 ). The Grouping Lemma
implies that A and A0 are independent and P(A A0 ) = P(A)P(A0 ). Finally, we get
30
This set is independent of An , so P(An A0n ) = P(An )P(A0n ). Given x = (x1 , x2 , . . .) RN , let us define an operator
x = (xn+1 , . . . , x2n , x1 , . . . , xn , x2n+1 , . . .)
that switches the first n coordinates with the second n coordinates. Denote
X = (Xi )i1 and recall that A = {X B}
for some symmetric set B RN that, by definition, satisfies B = x x B = B. Now we can write
P(A0n 4 A) = P (Xn+1 , . . . , X2n ) Bn 4 X B
= P (Xn+1 , . . . , X2n ) Bn 4 X B
= P (Xn+1 , . . . , X2n ) Bn 4 X B
{using that Xi s are i.i.d.} = P (X1 , . . . , Xn ) Bn 4 X B
= P(An 4 A) .
(An A0n ) 4 A
This implies that P 2 and we can conclude that
Theorem 16 (Kolmogorovs inequality) Suppose that (Xi )i1 are independent and Sn = X1 +. . .+Xn . If for all j n,
P(|Sn S j | a) p < 1, (6.0.1)
then, for x > a,
1
P max |S j | x P(|Sn | > x a).
1 jn 1 p
31
The equality in the middle holds because the events {|S j | x} and {|Sn S j | < a} are independent, since the first
depends only on X1 , ..., X j and the second only on X j+1 , ..., Xn . The last inequality holds, because if |Sn S j | < a and
|S j | x then, by triangle inequality, |Sn | > x a.
To deal with the maximum, instead of looking at an arbitrary partial sum S j , we will look at the first partial sum
that crosses the level x. We define this first time by
= min j n : |S j | x
and let = n + 1 if all |S j | < x. Notice that the event { = j} also depends only on X1 , ..., X j , so we can again write
(1 p)P = j P |Sn S j | < a P = j
= P |Sn S j | < a, = j P(|Sn | > x a, = j).
Proof. The only if direction is obvious, so we only need to prove the if part. Since the sequence Mn is decreasing,
it converges to some limit Mn M 0 everywhere. Since for all > 0,
P(M ) P(Mn ) 0 as n ,
this means that P(M = 0) = 1 and Mn 0 almost surely. Of course, this implies that Yn Y almost surely. t
u
Proof. Suppose that the partial sums Sn converge to some random variable S in probability, i.e., for any > 0, for large
enough n n0 () we have P(|Sn S| ) . If k j n n0 () then
Next, we apply Kolmogorovs inequality with x = 4, a = 2 and p = 2 to the partial sums Xn+1 + . . . + X j to get
1 2
P max |S j Sn | 4 P(|Sk Sn | 2) 3,
n jk 1 2 1 2
for small . The events {maxn jk |S j Sn | 4} are increasing as k and, by the continuity of measure,
P max |S j Sn | 4 3.
n j
32
This means that the maximum maxn j |S j S| 0 in probability and, by previous lemma, Sn S almost surely. t
u
Let us give one easy-to-check criterion for convergence of random series. Again, we will need one auxiliary result.
Lemma 16 Random sequence (Yn )n1 converges in probability to some limit Y if and only if it is Cauchy in probability,
which means that
lim P(|Yn Ym | ) = 0
n,m
for all > 0.
Proof. Again, the only if direction is obvious and we only need to prove the if part. Given = l 2 , we can find
m(l) large enough such that, for n, m m(l),
1 1
P |Yn Ym | 2 2 . (6.0.2)
l l
Without loss of generality, we can assume that m(l + 1) m(l) so that
1 1
P |Ym(l+1) Ym(l) | 2 2 .
l l
Then, 1 1
P |Ym(l+1) Ym(l) | l 2 l 2 <
l1 l1
and, by the Borel-Cantelli lemma,
1
P |Ym(l+1) Ym(l) | 2 i.o. = 0.
l
As a result, for large enough (random) l and for k > l,
1 1
|Ym(k) Ym(l) | 2
< .
il i l 1
This means that, with probability one, (Ym(l) )l1 is a Cauchy sequence and there exists an almost sure limit Y =
liml Ym(l) . Together with (6.0.2) this implies that Yn Y in probability. t
u
as n, m , since the series i1 EXi2 converges. This means that (Sn )n1 is Cauchy in probability and, by previous
lemma, Sn converges to some limit S in probability. t
u
Example. Consider the random series i1 i /i where P(i = 1) = 1/2. We have
2 1 1
i
E i = i2 < if > 2 ,
i1 i1
33
This follows immediately from Theorem 18 and the following well-known calculus lemma.
Lemma 17 (Kroneckers lemma) Suppose that a sequence (bi )i1 is such that all bi > 0 and bi . Given another
sequence (xi )i1 , if the series i1 xi /bi converges then limn b1 n
n i=1 xi = 0.
Proof. Because the series converges, rn := in+1 xi /bi 0 as n . Notice that we can write xn = bn (rn1 rn )
and, therefore,
n n n1
xi = bi (ri1 ri ) = (bi+1 bi )ri + b1 r0 bn rn .
i=1 i=1 i=1
Since ri 0, given > 0, we can find n0 such that for i n0 we have |ri | and | n1i=n0 +1 (bi+1 bi )ri | bn .
Therefore,
n n0
1
bn xi b1 1
n (bi+1 bi )ri + + bn b1 |r0 | + |rn |.
i=1 i=1
A couple of examples of application of Kolmogorovs strong law of large numbers will be given in the exercises below.
Exercise. Let {Sn : n 0} be a simple random walk which starts at zero, S0 = 0, and at each step moves to the right
with probability p and to the left with probability 1 p. Show that the event {Sn = 0 i.o.} has probability 0 or 1. (Hint:
Hewitt-Savage 0 1 law.)
Exercise. In the setting of the previous problem, show: (a) if p , 1/2 then P(Sn = 0 i.o.) = 0; (b) if p = 1/2 then
P(Sn = 0 i.o.) = 1. Hint: use the fact that the events
n on o
lim inf Sn 1/2 , lim sup Sn 1/2
n n
are exchangeable.
Exercise. Suppose that (Xn )n1 are i.i.d. with EX1 = 0 and EX12 = 1. Prove that for > 0,
n
1
n1/2 (log n)1/2+
Xi 0
i=1
Exercise. Let (Xn ) be i.i.d. random variables with continuous distribution F. We say that Xn is a record value if Xn > Xi
for i < n. Let In be the indicator of the event that Xn is a record value.
(a) Show that the random variables (In )n1 are independent and P(In = 1) = 1/n. Hint: if Rn {1, . . . , n} is the rank
of Xn among the first n random variables (Xi )in , prove that (Rn ) are independent.
(b) If Sn = I1 + . . . + In is the number of records up to time n, prove that Sn / log n 1 almost surely. Hint: use
Kolmogorovs strong law of large numbers.
Exercise. Suppose that two sequences of random variables Xn : 1 R and Yn : 2 R for n 1 on two different
probability spaces have the same distributions in the sense that all their finite dimensional distributions are the same,
L ((Xi )in ) = L ((Yi )in ) for all n 1. If Xn converges almost surely to some random variable X on 1 as n ,
prove that Yn also converges almost surely on 2 .
34
In this section, we will have our first encounter with two concepts, stopping times and Markov property, in the setting of
the sums of independent random variables. Later, stopping times will play an important role in the study of martingales,
and Markov property will appear again in the setting of the Brownian motion.
Consider a sequence (Xi )i1 of independent random variables and an integer valued random variable {1, 2, . . .}.
We say that is independent of the future if { n} is independent of ((Xi )in+1 ). Suppose that is independent of
the future and E|Xi | < for all i 1. We can formally write
ES = ES I( = k) = ESk I( = k)
k1 k1
()
= EXn I( = k) = EXn I( = k) = EXn I( n).
k1 nk n1 kn n1
In (*) we can interchange the order of summation if, for example, the double sequence is absolutely summable, by
the Fubini-Tonelli theorem. Since is independent of the future, the event { n} = { n 1}c is independent of
(Xn ) and we get
ES = EXn P( n). (7.0.1)
n1
Theorem 20 (Walds identity.) If (Xi )i1 are i.i.d., E|X1 | < and E < , then ES = EX1 E.
The reason we can interchange the order of summation in (*) is because under our assumptions the double sequence
is absolutely summable since
35
Theorem 21 (Markov property) Suppose that (Xi )i1 are i.i.d. and is a stopping time. Then the sequence T =
(X+1 , X+2 , . . .) is independent of the -algebra and
d
T = (X1 , X2 , . . .),
d
where = means the equality in distribution.
In words, this means that the sequence T = (X+1 , X+2 , . . .) after the stopping time is an independent copy of the
entire sequence, also independent of everything that happens before the stopping time.
A { = n} = A { n} \ A { n 1} (X1 , . . . , Xn ).
On the other hand, {Tn B} (Xn+1 , . . .) and, therefore, is independent of A { = n}. Using this and the fact that
(Xi )i1 are i.i.d.,
P A {T B} = P A { = n} P Tn B
n1
= P A { = n} P T1 B = P A P T1 B ,
n1
Let us give one interesting application of the Markov property and Walds identity that will yield another proof of the
Strong Law of Large Numbers.
Theorem 22 Suppose that (Xi )i1 are i.i.d. such that EX1 > 0. If Z = infn1 Sn then P(Z > ) = 1.
This means that partial sums can not drift down to if the mean EX1 > 0. Of course, this is obvious by the strong
law of large number, but we want to prove this independently, since this will give another proof of the SLLN.
and recursively,
(n) (n) (n+1) (n) (n)
n = min k 1 : Sk 1 , Zn = min Sk , Sk = Sn +k Sn .
kn
We mentioned above that 1 is a stopping time. It is easy to check that Z1 is 1 -measurable and, by the Markov
property, it is independent of the sequence T1 = (X1 +1 , X1 +2 , . . .), which has the same distribution as the original
36
1
0 !1
z1
sequence. Since 2 and Z2 are defined exactly the same way as 1 and Z1 , only in terms of this new sequence T1 , Z2 is
an independent copy of Z1 . Now, it should be obvious that (Zn )n1 are i.i.d. random variables. Clearly,
Z = inf Sk = inf Z1 , S1 + Z2 , S1 +2 + Z3 , ... ,
k1
Therefore,
By Walds identity,
E|Z1 | E |Xi | = E|X1 |E1 <
i1
if we can show that E1 < . This is left as an exercise below. We proved that P(Z < N) 0 as N , which, of
course, implies that P(Z > ) = 1. t
u
This result gives another proof of the Strong Law of Large Numbers.
Theorem 23 If (Xi )i1 are i.i.d. and EX1 = 0 then Sn /n 0 almost surely.
Proof. Given > 0 we define Xi = Xi + so that EX1 = > 0. By the above result, infn1 (Sn + n) > with
probability one. This means that for all n 1, Sn + n M > for some random variable M. Dividing both sides
by n and letting n we get
Sn
lim inf
n n
with probability one. We can then let 0 over some sequence. Similarly, we prove that
Sn
lim sup 0
n n
with probability one, which finishes the proof. t
u
37
Exercise. Let S0 = 0, Sn = ni=1 Xi be a random walk with i.i.d. (Xi ), P(Xi = +1, 1) = p, 1 p for p > 1/2. Consider
integer b 1 and let = min{n 1 : Sn = b}. Show that for 0 < s 1,
1 (1 4pqs2 )1/2 b
Es =
2qs
and compute E.
Exercise. Suppose that we play a game with i.i.d. outcomes (Xn )n1 such that E|X1 | < . If we play n rounds, we gain
the largest of the first n outcomes. In addition, to play each round we have to pay amount c > 0, so after n rounds our
total profit (or loss) is
Yn = max Xn cn.
1mn
In this problem we will find the best strategy to play the game, in some sense.
1. Given a R, let p = P(X1 > a) > 0 and consider the stopping time T = inf{n : Xn > a}. Compute EYT . (Hint:
sum over sets {T = n}.)
2. Consider such that E(X1 )+ = c. For a = show that EYT = .
3. Show that Yn + nm=1 ((Xm )+ c) (for any , actually).
4. Use Walds identity to conclude that for any stopping time such that E < we have EY .
This means that stopping at time T results in the best expected profit .
38
In this section we begin the discussion of weak convergence of distributions on metric spaces. Let (S, d) be a metric
space with the metric d. Consider a measurable space (S, B) with the Borel -algebra B generated by open sets and
let (Pn )n1 and P be some probability distributions on B. We define
Cb (S) = f : S R continuous and bounded .
Since fm Cb (S), by monotone convergence theorem we get that P(U) = Q(U) and, by Dynkins theorem, P = Q. We
will also say that random variables Xn X in distribution if their laws converge weakly. Notice that in this definition
the random variables need not be defined on the same probability space, as long as they take values in the same metric
space S. We will come back to the study of convergence on general metric spaces later in the course, and in this section
we will prove only one general result, the Selection Theorem. Other results will be proved only on R or Rn to prepare
us for the most famous example of convergence of laws - the Central Limit Theorem (CLT). First of all, let us notice
that on the real line the convergence of probability measures can be expressed in terms of their c.d.f.s, as follows.
Theorem 24 If S = R then Pn P weakly if and only if Fn (t) = Pn (,t] F(t) = P (,t] for any point of
continuity t of the c.d.f. F.
Proof. = Suppose that (8.0.1) holds. Let us approximate the indicator I(x t) by continuous functions so that
as in Fig. 8.1 below. Obviously, 1 , 2 Cb (R). Then, using (8.0.1) for 1 and 2 ,
"1 "2
x
t!! t t+ !
39
More carefully, we should write lim inf and lim sup but, since t is a point of continuity of F, letting 0 proves that
the limit limn Fn (t) exists and is equal to F(t).
= Let PC(F) be the set of points of continuity of F. Since F is monotone, the set PC(F) is dense in R.
Take M large enough such that both M, M PC(F) and P((M, M]c ) . Clearly, for large enough n 1 we
have Pn ((M, M]c ) 2. For any k > 1, consider a sequence of points M = x1k x2k . . . xkk = M such that all
k xk | 0 as k . Given a function f C (R), consider an approximating function
xi PC(F) and maxi |xi+1 i b
f (xik ) I x (xi1
k
, xik ] + 0 I x < (M, M] .
fk (x) =
1<ik
Since f in continuous,
k (M) = sup fk (x) f (x) 0, k .
|x|M
R R
Letting n , then k and, finally, 0 (or M ), proves that f dFn f dF. t
u
Example. If Pn ({n1 }) = 1 and P({0}) = 1 then Pn P weakly, but their c.d.f.s, obviously, do not converge at the
point of discontinuity x = 0.
Our typical strategy for proving the convergence of probability measures will be based on the following elementary
observation.
Lemma 18 If for any sequence (n(k))k1 there exists a subsequence (n(k(r)))r1 such that Pn(k(r)) P weakly then
Pn P weakly.
Proof. Suppose not. Then for some f Cb (S) and for some > 0 there exists a subsequence (n(k)) such that
Z Z
f dPn(k) f dP > .
But this contradicts the fact that for some subsequence Pn(k(r)) P weakly. t
u
1. First, that the sequence (Pn ) is such that one can always find converging subsequences for any sequence (Pn(k) ).
This describes a sort of relative compactness of the sequence (Pn ) and will be a consequence of uniform
tightness and the Selection Theorem proved below.
2. Second, we need to show that any subsequential limit is always the same, P, and this will be done by some ad
hoc methods, for example, using the method of characteristic functions in the case of the CLT.
40
Theorem 25 (Selection Theorem) If (Pn )n1 is a uniformly tight sequence of laws on the metric space (S, d) then there
exists a subsequence (n(k)) such that Pn(k) converges weakly to some probability law P.
Lemma 19 (Cantors diagonalization) Let A be a countable set and fn : A R, n 1. Then there exists a subsequence
(n(k)) such that fn(k) (a) converges for all a A, possibly to .
Proof. Let A = {a1 , a2 , . . .}. Take (n1 (k)) such that fn1 (k) (a1 ) converges. Take (n2 (k)) (n1 (k)) such that fn2 (k) (a2 )
converges. Recursively, take (nl (k)) (nl1 (k)) such that fnl (k) (al ) converges. Now consider the sequence (nk (k)).
Clearly, fnk (k) (al ) converges for any l because for k l, nk (k) {nl (k)} by construction. t
u
Proof of Theorem 25. We will prove the Selection Theorem for arbitrary metric spaces, since this result will be useful
to us later when we study the convergence of laws on general metric spaces. However, when S = R one can see this in
a much more intuitive way, as follows.
(The case S = R). Let A be a dense set of points in R. Given a sequence of probability measures Pn on R and their
c.d.f.s Fn , by Cantors diagonalization, there exists a subsequence (n(k)) such that Fn(k) (a) F(a) for all a A. For
x R \ A, we can extend F by
F(x) = inf F(a) x < a, a A .
Obviously, F(x) is non-decreasing but not necessarily right-continuous. However, at the points of discontinuity, we
can redefine it to be right continuous. Then F(x) will be a cumulative distribution function, since the fact that Pn are
uniformly tight ensures that F(x) 0 or 1 if x or +. In order to prove weak convergence of Pn(k) to the
measure P with the c.d.f. F, let x be a point of continuity of F(x) and let a, b A such that a < x < b. We have,
F(a) = lim Fn(k) (a) lim inf Fn(k) (x) lim sup Fn(k) (x) lim Fn(k) (b) = F(b).
k k k k
and this proves that Fn(k) (x) F(x) for all such x. By Theorem 24, this means that the laws Pn converge to P. Let us
now prove the general case on general metric spaces.
(The general case). If K is a compact then, obviously, Cb (K) = C(K). Later in these lectures, when we deal in more
detail with convergence on general metric spaces, we will prove the following fact, which is well-known and is a
consequence of the Stone-Weierstrass theorem.
Since Pn are uniformly tight, for any r 1 we can find a compact Kr such that Pn (Kr ) > 1 1/r for all n 1. Let
Cr C(Kr ) be a countable dense subset of C(Kr ). By Cantors diagonalization, there exists a subsequence (n(k)) such
that Pn(k) ( f ) converges for all f Cr for all r 1. Since Cr is dense in C(Kr ), this implies that Pn(k) ( f ) converges for
all f C(Kr ) for all r 1. Next, for any f Cb (S),
Z Z
|| f ||
Z
| f | dPn(k) || f || Pn(k) (Krc )
f dPn(k) f dPn(k) .
Kr c
Kr r
This implies that the limit Z
I( f ) := lim f dPn(k) (8.0.2)
k
41
f , g = c f + g , c R and f g, f g .
Vector lattice is called a Stone vector lattice if f 1 for any f . For example, any vector lattice that contains
constants is automatically a Stone vector lattice.
A functional I : R is called a pre-integral if
2. f 0, I( f ) 0,
3. fn 0, || fn || < = I( fn ) 0.
See R.M. Dudley Reals Analysis and Probability for a proof of the following:
R
(Stone-Daniell theorem) If is a Stone vector lattice and I is a pre-integral on then I( f ) = f d for some unique
measure on the minimal -algebra on which all functions in are measurable.
We will use this theorem with = Cb (S) and I defined in (8.0.2). The first two properties are obvious. To prove the
third one, let us consider a sequence such that
fn 0, 0 fn (x) f1 (x) || f1 || .
for some unique measure P on (Cb (S)). The choice of f = 1 shows that I( f ) = 1 = P(S), which means that P is a
probability measure. Finally, let us show that (Cb (S)) is the Borel -algebra B generated by open sets. Since any
f Cb (S) is measurable on B we get (Cb (S)) B. On the other hand, let F S be any closed set and take a function
f (x) = min(1, d(x, F)). We have, | f (x) f (y)| d(x, y) so f Cb (S) and
However, since F is closed, f 1 ({0}) = {x : d(x, F) = 0} = F and this proves that B (Cb (S)). t
u
Conversely, the following holds (we will prove this result later for any complete separable metric space).
42
For n large enough, n n0 , we get Pn (|x| > 2M) 2. For n < n0 choose Mn so that Pn (|x| > Mn ) 2. Take
M 0 = max{M1 , . . . , Mn0 1 , 2M}. As a result, Pn (|x| > M 0 ) 2 for all n 1. t
u
Finally, let us relate convergence in distribution to other forms of convergence. Consider random variables X and Xn
on some probability space (, A , P) with values in a metric space (S, d). Let P and Pn be their corresponding laws
on Borel sets B in S. Convergence of Xn to X in probability and almost surely is defined exactly the same way as for
S = R by replacing |Xn X| with d(Xn , X).
Lemma 20 Xn X in probability if and only if for any sequence (n(k)) there exists a subsequence (n(k(r))) such
that Xn(k(r)) X almost surely.
Proof. =. Suppose Xn does not converge to X in probability. Then, for small > 0, there exists a subsequence
(n(k))k1 such that P(d(X, Xn(k) ) ) . This contradicts the existence of a subsequence Xn(k(r)) that converges to
X almost surely.
Proof. By Lemma 20, for any subsequence (n(k)) there exists a subsequence (n(k(r))) such that Xn(k(r)) X almost
surely. Given f Cb (R), by the dominated convergence theorem,
E f (Xn(k(r)) ) E f (X),
Exercise. Let Xn be random variables on the same probability space with values in a metric space S. If for some point
s S, Xn s in distribution, show that Xn s in probability.
Exercise. For the following sequences of laws Pn on R having densities fn , which are uniformly tight? (a) fn = I(0
x n)/n, (b) fn = nenx I(x 0), (c) fn = ex/n /nI(x 0).
Exercise. Suppose that Xn X in distribution on R and Yn c R in probability. Show that XnYn cX in distribution,
assuming that Xn ,Yn are defined on the same probability space.
Exercise. Suppose that random variables (Xn ) are independent and Xn X in probability. Show that X is almost surely
constant.
43
Characteristic functions.
In the next section, we will prove one of the most classical results in Probability Theory the central limit theorem
and one of the main tools will be the so-called characteristic functions. Let X = (X1 , . . . , Xk ) be a random vector on Rk
with the distribution P and let t = (t1 , . . . ,tk ) Rk . The characteristic function of X is defined by
Z
f (t) = Eei(t,X) = ei(t,x) dP(x).
In this section, we will collect various important and useful facts about the characteristic functions. The standard
normal distribution N(0, 1) on R with the density
1 2
p(x) = ex /2
2
will play a central role, so let us start by computing its characteristic function. First of all, notice that this is indeed a
density, since
"
1 Z 2 1 1 2 r2 /2
Z Z
x2 /2 2 2
e dx = e(x +y )/2 dxdy = e r drd = 1.
2 R 2 R2 2 0 0
If X has the standard normal distribution N(0, 1) then, obviously, EX = 0 and
1 1
Z Z
2 /2 2 /2
Var(X) = EX 2 = x2 ex dx = x d(ex ) = 1,
2 R 2 R
by integration by parts. To motivate the computation of the characteristic function, let us first notice that for R,
1 x2 2 1 (x )2 2 1 x2 2
Z Z Z
Ee X = e x 2 dx = e 2 e 2 dx = e 2 e 2 dx = e 2 .
2 2 2
For complex = it, we begin similarly by completing the square,
t2 1 (xit)2 t2
Z Z
EeitX = e 2 e 2 dx = e 2 (z) dz,
2 it+R
where we denoted
1 z2
(z) = e 2 for z C.
2
Since is analytic, by Cauchys theorem, integral over a closed path is equal to 0. Let us take a closed path it + x
for x from M to +M, M + iy for y from t to 0, x from M to M and, finally, M + iy for y from 0 to t. For large
M, the function (z) is very small on the intervals M + iy, so letting M we get
Z Z
(z) dz = (z) dz = 1,
it+R R
44
1 (x )2
exp .
2 2 2
This is the so-called normal distribution N(, 2 ) and its characteristic function is given by
2 2 /2
EeitY = Eeit(+ X) = eitt . (9.0.2)
Our next very important observation is that integrability of X is related to smoothness of its characteristic function.
Lemma 22 If X is a real-valued random variable such that E|X|r < for integer r 1 then f (t) Cr (R) and
for j r.
by the dominated convergence theorem. This means that f C(R), i.e. characteristic functions are always continuous.
If r = 1, E|X| < , we can use itX
e eisX
t s |X|
eitX eisX
f 0 (t) = lim E = EiXeitX .
st t s
Also, by dominated convergence theorem, EiXeitX C(R), which means that f C1 (R). We proceed by induction.
Suppose that we proved that
f ( j) (t) = E(iX) j eitX
and that r = j + 1, E|X| j+1 < . Then, we can use that
(iX) j eitX (iX) j eisX
|X| j+1 ,
t s
so that, by the dominated convergence theorem, f ( j+1) (t) = E(iX) j+1 eitX C(R). t
u
Next, we want to show that the characteristic function uniquely determines the distribution. This is usually proved
using convolutions. Let X and Y be two independent random vectors on Rk with the distributions P and Q. We denote
by P Q the convolution of P and Q, which is the distribution L (X +Y ) of the sum X +Y. We have,
ZZ
P Q(A) = EI(X +Y A) = I(x + y A) dP(x)dQ(y)
ZZ Z
= I(x A y) dP(x)dQ(y) = P(A y) dQ(y).
45
Let us denote by N(0, 2 I) the distribution of the random vector X = (X1 , . . . , Xk ) of i.i.d. random variables with the
normal distribution N(0, 2 ). The density of X is given by
k
1 1 2 xi2
1 k 1 2 |x|2
2 e 2 =
2
e 2 .
i=1
1 k
Z
1 2 |xy|2
p (x) = e 2 dP(y).
2
Using (9.0.1), we can write
1
Z
1 2 (xi yi )2 1 1 2
e 2 = ei (xi yi )zi e 2 zi dzi
2
and taking a product over i k we get
1 k Z
1 2 |xy|2 1 1 2
e 2 = ei (xy,z) e 2 |z| dz.
2
Then we can continue
1 k Z Z 1 1 2
p (x) = ei (xy,z) 2 |z| dzdP(y)
2
1 k Z Z 1 1 2
= ei (xy,z) 2 |z| dP(y)dz
2
1 k Z z 1 1 2
= f ei (x,z) 2 |z| dz.
2
Making the change of variables z = t finishes the proof. t
u
This immediately implies that the characteristic function uniquely determines the distribution.
Theorem 27 (Uniqueness) If Z Z
ei(t,x) dP(x) = ei(t,x) dQ(x)
then P = Q.
46
This immediately implies the following stability property of the normal distribution.
Lemma 24 If X j for j = 1, 2 are independent and have normal distributions N( j , 2j ) then X1 + X2 has normal
distribution N(1 + 2 , 12 + 22 ).
Proof. By independence and (9.0.2), the characteristic function of the sum is equal to
2 2 /2 2 2 /2 2 ( 2 + 2 )/2
Eeit(X1 +X2 ) = EeitX1 EeitX2 = eit1 t 1 eit2 t 2 = eit(1 +2 )t 1 2 .
This is the characteristic function of the normal distribution N(1 + 2 , 12 + 22 ), so the claim follows from the
Uniqueness Theorem. One can also prove this by a straightforward computation using the formula (9.0.4) for the
density of the convolution, which is left as an exercise below. t
u
Lemma 23 gives some additional information when the characteristic function is integrable.
R
Lemma 25 (Fourier inversion formula) If | f (t)| dt < then P itself has density
1 k Z
p(x) = f (t)ei(t,x) dt.
2
the integrability of f (t) implies, by the dominated convergence theorem, that p (x) p(x). Since P P weakly,
for any g Cb (Rk ), Z Z
g(x)p (x) dx g(x) dP(x).
Because of the uniqueness theorem, the convergence of characteristic functions implies convergence of distributions
under some mild additional assumptions.
Lemma 26 If (Pn ) is uniformly tight on Rk and the characteristic functions converge to some function f ,
Z
fn (t) = ei(t,x) dPn (x) f (t),
then f (t) = ei(t,x) dP(x) is a characteristic function of some distribution P and Pn P weakly.
R
47
as r and, therefore, f is the characteristic function of P. By the uniqueness theorem, the distribution P does not
depend on the sequence (n(k)). By Lemma 18, Pn P weakly. t
u
Even though in many cases uniform tightness of (Pn ) is not difficult to check directly, it also follows automatically
from the continuity of the limit of characteristic functions at zero. To show this, we will need the following bound.
Theorem 28 (Levys continuity theorem) Let (Xn ) be a sequence of random variables on Rk . Suppose that
and f (t) is continuous at 0 along each axis. Then there exists a probability distribution P such that
Z
f (t) = ei(t,x) dP(x)
Proof. By Lemma 26, we only need to show that {L (Xn )} is uniformly tight. If we denote Xn = (Xn,1 , . . . , Xn,k ) then
the characteristic functions of the ith coordinate:
Since fn (0) = 1, we have f (0) = 1. By the continuity of f i at 0, for any > 0 we can find > 0 such that for all i k,
| f i (ti ) 1| if |ti | . By the dominated convergence theorem,
1 1 1
Z Z Z
1 Re fni (ti ) dti = 1 Re f i (ti ) dti 1 f i (ti ) dti .
lim
n 0 0 0
48
Lemma 28 (Continuous Mapping.) Suppose that Pn P weakly on X and G : X Y is a continuous map. Then
Pn G1 P G1 on Y. In other words, if Zn Z in distribution then G(Zn ) G(Z) in distribution.
Proof. This is obvious, because for any f Cb (Y ) we have f G Cb (X) and, therefore, E f (G(Zn )) E f (G(Z)),
which is the definition of convergence G(Zn ) G(Z) in distribution. t
u
By Lemma 26, it remains to show that (Pn Qn ) is uniformly tight. By Theorem 26, since Pn P, (Pn ) is uniformly
tight. Therefore, there exists a compact K on Rk such that Pn (K) > 1 . Similarly, for some compact K 0 on Rm ,
Qn (K 0 ) > 1 . We have, Pn Qn (K K 0 ) > 1 2 and K K 0 is a compact on Rk+m . t
u
Proof. Since the function G : Rk+k Rk given by G(x, y) = x + y is continuous, by the continuous mapping lemma,
Pn Qn = (Pn Qn ) G1 (P Q) G1 = P Q,
49
We have seen in the Hoeffding-Chernoff inequality in Section 4 that if X1 , . . . , Xn are independent flips of a fair coin,
P(Xi = 0) = P(Xi = 1) = 1/2, and X n is their average then
2
P |X n 1/2| > t 2e2nt .
If we denote
Sn n
Zn = 2 n(X n 1/2) = ,
n 2
where = EX1 = 1/2 and 2 = Var(X1 ) = 1/4, then the Hoeffding-Chernoff inequality can be rewritten as
2 /2
P(|Zn | t) 2et .
We will see that this inequality is rather accurate for large sample size n, since we will show that
2 2
Z
2 /2 t 2
lim P(|Zn | t) = ex dx et /2 .
n 2 t t 2
The asymptotic equivalence as t is a simple exercise, and the large n limit is a consequence of a more general
result, Z z
1 2
lim P(Zn z) = (z) = ex /2 dx, (10.0.1)
n 2
2
for all z R. In other words, Zn converges in distribution to a standard normal distribution with the density ex /2 / 2
and the c.d.f. (z). Moreover, this holds for any i.i.d. sequence (Xn ) with = EX1 and 2 = Var(X1 ) < , and
Sn n 1 n Xi
Zn = = . (10.0.2)
n 2 n i=1
Before we use characteristic functions to prove this result as well as its multivariate version, let us describe a more
intuitive approach via the so-called Lindebergs method, which emphasizes a little bit more explicitly the main reason
behind the central limit theorem, namely, the stability property of the normal distribution proved in Lemma 24.
Theorem 29 Consider an i.i.d. sequence (Xi )i1 such that EX1 = , EX12 = 2 and E|X1 |3 < . Then the distribution
of Zn defined in (10.0.2) converges weakly to standard normal distribution N(0, 1).
Remark. One can easily modify the proof we give below to get rid of the unnecessary assumption E|X1 |3 < . This
will appear as an exercise in the next section, where we will state and prove a more general version of the Lindebergs
CLT for non i.i.d. random variables.
Proof. First of all, notice that the random variables (Xi )/ have mean 0 and variance 1. Therefore, it is enough to
prove the result for
1 n
Zn = Xi
n i=1
50
If we denote
1
Sm = g1 + . . . + gm1 + Xm+1 + . . . + Xn
n
then Tm = Sm + Xm / n and Tm+1 = Sm + gm / n. Suppose now that f has uniformly bounded third derivative. Then,
by Taylors formula,
f 0 (Sm )Xm f 00 (Sm )Xm2 k f 000 k |Xm |3
f (Tm ) f (Sm )
n 2n 6n3/2
and f 0 (Sm )gm f 00 (Sm )g2m k f 000 k |gm |3
f (Tm+1 ) f (Sm ) .
n 2n 6n3/2
Notice that Sm is independent of Xm and gm and, therefore,
and
E f 00 (Sm )Xm2 = E f 00 (Sm )EXm2 = E f 00 (Sm ) = E f 00 (Sm )Eg2m = E f 00 (Sm )g2m .
As a result, taking expectations and subtracting the above inequalities, we get
000 3 3
E f (Tm ) E f (Tm+1 ) k f k (E|X1 | + E|g1 | ) .
6n3/2
51
1 00
f (t) = f (0) + f 0 (0)t + f (0)t 2 + o(t 2 ) as t 0.
2
Since
f (0) = 1, f 0 (0) = EiXei0X = iEX = 0, f 00 (0) = E(iX)2 = EX 2 = 1,
we get
t2
f (t) = 1 + o(t 2 ).
2
Finally,
itS
n
t n t2 t 2 n 1 2
Ee n = f = 1 +o e 2 t
n 2n n
as n . The result then follows from Levys continuity theorem in the previous section. Alternatively, we could also
use Lemma 26 since it is easy to check that the sequence of laws
Sn
L
n n1
Next, we will prove a multivariate version of the central limit theorem. For a random vector X = (X1 , . . . , Xk ) Rk , let
EX = (EX1 , . . . , EXk ) denote its expectation and
Cov(X) = EXi X j 1i, jk
its covariance matrix. Again, the main step is to prove convergence of characteristic functions.
k 2
Theorem 31 Let (X i )i1 be a sequence of i.i.d. random vectors on R such that EX1 = 0 and E|X1 | < . Then the
distribution of Sn / n converges weakly to the distribution P with the characteristic function
1
f p (t) = e 2 (Ct,t) , (10.0.3)
where C = Cov(X1 ).
Proof. Consider any t Rk . Then Zi = (t, Xi ) are i.i.d. real-valued random variables and, by the central limit theorem
on the real line,
S
n 1 n Var(Z )
1
1
E exp i t, = E exp i Zi exp = exp (Ct,t)
n n i=1 2 2
as n , since
Var t1 X1,1 + + tk X1,k = tit j EX1,i X1, j = (Ct,t).
i, j
n o
The sequence L
Sn
n
is uniformly tight on Rk since
S
n
1 Sn 2 1 2 1 2 1 M
P M 2 E = 2
E (Sn,1 , . . . , Sn,k ) = ESn,i = 2 E|X1 |2 0,
n M n nM nM 2 ik M
so we can use Lemma 26 to finish the proof. Alternatively, we can just apply Levys continuity theorem. t
u
52
The unique distribution with covariance function in (10.0.3) is called a multivariate normal distribution with the
covariance C and is denoted by N(0,C). It can also be defined more constructively as follows. Consider an i.i.d.
sequence g1 , . . . , gn of standard normal N(0, 1) random variables and let g = (g1 , . . . , gn )T . Given a k n matrix A, the
covariance matrix of Ag Rk is
Since we know the characteristic function of the standard normal random variables gl , the characteristic function of
Ag is
E exp i(t, Ag) = E exp i(AT t, g) = E exp i(AT t)l gl = exp ((AT t)l )2 /2
ln ln
1 1 1
= exp |AT t|2 = exp t T AAT t = exp (Ct,t) ,
2 2 2
which means that Ag has the distribution N(0,C). On the other hand, given a symmetric non-negative definite matrix C
one can always find A such that C = AAT . For example, let C = QDQT be its eigenvalue decomposition, for orthogonal
matrix Q and diagonal matrix D. Since C is nonnegative definite, the elements of D are nonnegative. Then, one can
take n = k and A = C1/2 := QD1/2 QT or A = QD1/2 . This means that any normal distribution can be generated as a
linear transformation of the vector of i.i.d. standard normal random variables.
Density in the invertible case. Suppose det(C) , 0. Take any invertible A such that C = AAT , so that Ag N(0,C).
Since the density of g is
1 1 1 k 1
2 exp 2 xl2 = 2 exp 2 |x|2 ,
lk
Therefore, we get
1 k 1 1
Z
P(Ag ) = p exp yT C1 y dy.
2 det(C) 2
This means that 1 k 1 1
p exp yT C1 y
2 det(C) 2
is the density of the distribution N(0,C) when the covariance matrix C is invertible. t
u
General case. If C = QDQT is the eigenvalue decomposition of C, let us take, for example, X = QD1/2 g for i.i.d.
standard normal vector g, so that X N(0,C). If q1 , . . . , qk are the column vectors of Q then
1/2 1/2
X = QD1/2 g = (1 g1 )q1 + . . . + (k gn )qk .
53
1 x2
f (x1 , . . . , xn ) = p exp l .
ln 2l 2l
t
u
Lemma 32 Let Z = (X,Y ), where X = (X1 , . . . , Xi ) and Y = (Y1 , . . . ,Y j ), and suppose that Z is normal on Ri+ j . Then
X and Y are independent if and only Cov(Xm ,Yn ) = 0 for all m, n.
Proof. One way is obvious. The other way around, suppose that
D 0
C = Cov(Z) = .
0 F
Exercise. Finish the proof of Theorem 29. Namely, show that (10.0.1) holds if limn E f (Zn ) = E f (Z) for functions
f with bounded third derivative.
Exercise. 24% of the residents in a community are members of a minority group but among the 96 people called for
jury duty only 13 are. Does this data indicate that minorities are less likely to be called for jury duty?
54
Exercise. Given a vector of i.i.d. standard normal random variables g = (g1 , . . . , gk )T and orthogonal k k matrix V ,
show that Y = V g also has i.i.d. standard normal coordinates.
Exercise. If g is standard normal, prove that EgF(g) = EF 0 (g) for nice enough functions F : R R.
where C1i = Eg1 gi , for nice enough functions F : Rk R. Hint: what can you say about g1 and gi (C1i /C11 )g1 for
1 i k?
55
Instead of considering i.i.d. sequences, for each n 1 we will now consider a vector (X1n , . . . , Xnn ) of independent
random variables, not necessarily identically distributed. This setting is called triangular arrays for obvious reasons.
Theorem 32 For each n 1, consider a vector (Xin )1in of independent random variables such that
n
EXin = 0, Var(Sn ) = E(Xin )2 = 1.
i=1
Remark. We will prove this result again using characteristic functions, but one can also use the same argument as in
Theorem 29 in the previous section. We will leave it as an exercise below. Also, notice
that for i.i.d. random variables
this implies the CLT without the assumption E|X1 |3 < , since in that case Xin = Xi / n.
Proof. First of all, the sequence L in Xin is uniformly tight, because by Chebyshevs inequality
1
P Xin > M 2
in M
2
for large enough M. It remains to show that the characteristic function of Sn coverges to e 2 . For simplicity of
notation, let us omit the upper index n and write Xi instead of Xin . Since
n
Eei Sn = Eei Xi
i=1
56
as n , by (11.0.1), and then we let 0. As a result, to prove (11.0.2) it remains to show that
n
2
Eei Xi 1 .
2
i=1
Using (11.0.3) for m = 1, on the event |Xi | > ,
2
e i 1 i Xi I |Xi | > Xi2 I |Xi | >
i X
2
and, therefore,
2
1 i Xi + Xi2 I |Xi | > 2 Xi2 I |Xi | > .
i Xi
(11.0.5)
2
e
Using (11.0.3) for m = 2, on the event |Xi | ,
2 3 3 2
1 i Xi + Xi2 I |Xi | |Xi |3 I |Xi |
i Xi
X . (11.0.6)
2 6 6 i
e
We will now use this version of the CLT to prove the three series theorem for random series. Let us begin with the
following observation.
57
The condition P Q = P implies that fP (t) fQ (t) = fP (t). Since fP (0) = 1 and fP (t) is continuous, for small enough
|t| we have | fP (t)| > 0 and, as a result, fQ (t) = 1. Since
Z Z
fQ (t) = cos(tx) dQ(x) + i sin(tx) dQ(x),
R
for |t| this implies that cos(tx) dQ(x) = 1 and, since cos(s) 1, this can happen only if
Q x : xt = 0 mod 2 = 1 for all |t| .
Take s,t such that |s|, |t| and s/t is irrational. For x to be in the support of Q we must have xs = 2k and xt = 2m
for some integer k, m. This can happen only if x = 0. t
u
We proved in Section 6 that convergence of random series in probability implies almost sure convergence. It turns
out that this result can be strengthened, and convergence in distribution also implies convergence in probability and,
therefore, almost sure convergence.
Theorem 33 (Levys equivalence theorem) If (Xi ) is a sequence of independent random variables then the series
i1 Xi converges almost surely, in probability, or in distribution, at the same time.
Proof. Of course, we only need to show that convergence in distribution implies convergence in probability. Suppose
that L (Sn ) P. Convergence in law implies that {L (Sn )} is uniformly tight, which implies that {L (Sn Sk )}n,k1
is uniformly tight. We will now show that this implies that, for any > 0,
for n k N for large enough N. Suppose not. Then there exists > 0 and sequences (n(l)) and (n0 (l)) such that
n(l) n0 (l) and
P(|Sn0 (l) Sn(l) | > ) .
Let us denote Yl = Sn0 (l) Sn(l) . Since {L (Yl )} is uniformly tight, by the selection theorem, there exists a subsequence
(l(r)) such that L (Yl(r) ) Q. Since
letting r we get that P = P Q. By the above Lemma, Q({0}) = 1, which implies that P(|Yl(r) | > ) for large
r a contradiction. By Lemma 16, (11.0.7) implies that Sn converges in probability. t
u
Theorem 34 (Three series theorem) Let (Xi )i1 be a sequence of independent random variables and let
Zi = Xi I(|Xi | 1).
Then the series i1 Xi converges almost surely if and only if the following three conditions hold:
58
=. If i1 Xi converges almost surely, P(|Xi | > 1 i.o) = 0, and, since (Xi ) are independent, by the Borel-Cantelli
lemma,
P(|Xi | > 1) < .
i1
This proves (1). We will prove (3) by contradiction. If i1 Xi converges then, obviously, i1 Zi converges. Let
Smn = Zk .
mkn
Since Smn 0 as m, n , P(|Smn | > ) for any > 0 for m, n large enough. Suppose that i1 Var(Zi ) = .
Then
2
mn = Var(Smn ) = Var(Zk )
mkn
as n for any fixed m. Intuitively, this should not happen, since Smn 0 in probability but their variance goes
to infinity. In principle, one can construct such sequence of random variables, but in our case it will be ruled out by
Lindebergs CLT, as follows. Let us show that, by Lindebergs theorem,
Smn ESmn Zk EZk
Tmn = =
mn mkn mn
The central limit theorem describes when the distribution of the sum of independent random variables can approx-
imated by a normal distribution. For a change, we will now give a simple example of a different approximation.
Consider a triangular array of independent Bernoulli random variables Xin for i n such that P(Xin = 1) = pni and
P(Xin = 0) = 1 pni . If pni = p > 0 then, by the central limit theorem,
S np
L pn N(0, 1)
np(1 p)
weakly. However, if p = pn = pni 0 fast enough then one can check that Lindebergs conditions will be violated. We
will now show that, under certain conditions, the sum can be approximated in distribution by some Poisson distribution.
Recall that, for > 0, Poisson distribution has probability function
k
({k}) = e for k = 0, 1, 2, . . .
k!
The following holds.
59
Sn = X1 + . . . + Xn and = p1 + . . . + pn .
Proof. The proof is based on a construction on the same probability space. Let us construct Bernoulli r.v. Xi B(pi )
and Poisson r.v. Xi pi on the same probability space as follows. Let us consider the standard probability space
([0, 1], B, P) with the Lebesque measure P. Let us fix i for a moment and define a random variable
0, 0 x 1 pi ,
Xi = Xi (x) =
1, 1 pi < x 1.
and let
0, 0 x c0 ,
1, c0 < x c1 ,
Xi = Xi (x) =
2, c1 < x c2 ,
...
Clearly, Xi has the Poisson distribution pi . When do we have Xi , Xi ? Since 1 pi epi = c0 , this can only happen
for 1 pi < x c0 and c1 < x 1, which implies that
Then we construct pairs (Xi , Xi ) for i n on separate coordinates of the product space [0, 1]n with the product Lebesgue
measure, thus, making them independent for i n. It is well-known (and easy to check) that a sum of independent
Poisson random variables is again Poisson with the parameter given by the sum of individual parameters and, therefore,
in Xi has the Poisson distribution , where = p1 + . . . + pn . Finally, we use the union bound to conclude that
n n
P(Sn , Sn ) P(Xi , Xi ) p2i ,
i=1 i=1
Exercise. Adapt the argument of Theorem 29 in the previous section to prove Theorem 32. Hint: when using the Taylor
expansion in that proof, mimic equations (11.0.5) and (11.0.6).
Exercise. Prove (11.0.3). (Hint: integrate the inequality to make the induction step.)
Exercise. Consider the random series n1 n n where (n ) are i.i.d. random signs P(n = 1) = 1/2. Give the
necessary and sufficient condition on for this series to converge.
Exercise. Let (Xn )n1 be i.i.d. random variables such that EX1 = 0 and 0 < EX12 < . Show that n1 Xn /an converges
almost surely if and only if n1 1/a2n < .
xk1 ex/
I(x > 0).
k (k)
60
Exercise. Let (Xn )n1 be i.i.d. random variables with the Poisson distribution with mean = 1. Can you find an and
bn such that an n`=1 X` log ` bn converges in distribution to standard Gaussian?
Exercise. Suppose that the chances of winning the jackpot in a lottery are 1 : 139, 000, 000. Assuming that 100, 000, 000
people played independently of each other, estimate the probability that 3 of them will have to share the jackpot. Give
a bound on the quality of your estimate.
61
The topic of this section can be viewed as a vast abstract generalization of the Fubini theorem for product spaces.
For example, given a pair of random variables (X,Y ), how can we average a function f (X,Y ) by fixing X = x and
averaging with respect to Y first? Can we define some distribution of Y given X = x that can be used to compute this
average? We will begin by defining these conditional averages, or conditional expectations, and then we will use them
to define the corresponding conditional distributions.
Let (, A , P) be a probability space and X : R be a random variable such that E|X| < . Let B be a
-subalgebra of A , B A . Random variable Y : R is called the conditional expectation of X given B if
This conditional expectation is denoted by Y = E(X|B). If X, Z are random variables then the conditional expectation
of X given Z is defined by
Y = E(X|Z) = E(X| (Z)).
Since Y is measurable on (Z), Y = f (Z) for some measurable function f . Let us go over a number of properties of
conditional expectation, most of which follow easily from the definition.
By definition, Y = E(X|B). Another way to define at conditional expectations is as follows. Notice that L 2 (, B, P)
is a linear subspace of L 2 (, A , P) for B A . If f L 2 (, A , P) then the conditional expectation g = E( f |B)
is simply an orthogonal projection of f onto that subspace, since such projection is B-measurable and the orthogonal
projection is defined by E f h = Egh for all h L 2 (, B, P). In particular, one can take h = IB for any B B. When
f L 1 (, A , P), the conditional expectation can be defined by approximating f by functions in L 2 (, A , P).
2. (Uniqueness) Suppose there exists another Y 0 = E(X|B) such that P(Y , Y 0 ) > 0, i.e. either P(Y > Y 0 ) > 0 or
P(Y < Y 0 ) > 0. Since both Y,Y 0 are measurable on B, the set B = {Y > Y 0 } B. If P(B) > 0 the E(Y Y 0 )IB > 0. On
the other hand,
E(Y Y 0 )IB = EXIB EXIB = 0,
a contradiction. Similarly, P(Y < Y 0 ) = 0. Of course, the conditional expectation can be redefined on a set of measure
zero, so uniqueness is understood in this sense.
62
just like the usual integral. It is obvious that the right hand side satisfies the definition of the conditional expectation
E(cX +Y |B) and the equality follows by uniqueness. Again, the equality holds almost surely.
4. (Smoothing) If -algebras C B A then
E(X|C ) = E(E(X|B)|C ).
E(X|A ) = X, E(X|{0,
/ }) = EX.
E(X|B) = EX.
To prove this, let us first note that, since E(Xn |B) E(Xn+1 |B) E(X|B), there exists a limit
Since E(Xn |B) are measurable on B, so is the limit g. It remains to check that for any set B B, EgIB = EXIB . Since
Xn IB XIB and E(Xn |B)IB gIB , the usual monotone convergence theorem implies that
Since EIB E(Xn |B) = EXn IB , this implies that EgIB = EXIB and, therefore, g = E(X|B) almost surely.
8. (Dominated convergence) If |Xn | Y, EY < , and Xn X then
Y gn = inf Xm Xn hn = sup Xm Y
mn mn
63
First of all, it is enough to prove this for non-negative X,Y 0 by decomposing X = X + X and Y = Y + Y .
Since Y is B-measurable, we can find a sequence of simple functions Yn of the form k wk ICk , where Ck B, such
that 0 Yn Y. By the monotone convergence theorem, it is enough to prove that
f (E(X|B)) E( f (X)|B).
Let h(x) be some monotone (thus, measurable) function such that h(x) is in the subgradient f (x) of the convex
function f for all x. Then, by convexity,
If we ignore any integrability issues then, taking conditional expectations of both sides,
where we used the previous property, since h(E(X|B)) is B-measurable. Let us now make this simple idea rigorous.
Let Bn = { : |E(X|B)| n} and Xn = XIBn . As above,
Notice that Bn B and, by property 9, E(Xn |B) = IBn E(X|B) [n, n]. Since both f and h are bounded on [n, n],
now we dont have any integrability issues and, taking conditional expectations of both sides, we get
Now, let n . Since E(Xn |B) = IBn E(X|B) E(X|B) and f is continuous, f (E(Xn |B)) converges to f (E(X|B)).
On the other hand,
Conditional distributions. Again, let (, A , P) be a probability space and let B A be a sub- -algebra of A . Let
(Y, Y ) be a measurable space and f be a measurable function from (, A ) into (Y, Y ). A conditional distribution of
f given B is a function P f |B : Y [0, 1] such that
64
Theorem 36 If (Y, Y ) is regular then a conditional distribution P f |B exists. Moreover, if P0f |B is another conditional
distribution then
P f |B (, ) = P0f |B (, )
for P|B -almost all , i.e. the conditional distribution is unique modulo P|B -almost everywhere equivalence.
Proof. The Borel -algebra Y is generated by a countable collection Y 0 of sets in Y , for example, by open balls
with rational radii and centers in a countable dense subset of (Y, d). The algebra C generated by Y 0 is countable since
it can be written as a countable union of increasing finite algebras corresponding to finite subsets of Y 0 . Since (Y, d)
is complete and separable, by Ulams theorem (Theorem 5), for any set C Y one can find a sequence of compact
subsets K1 K2 . . . C such that P( f Kn ) P( f C). If K = n1 Kn then, by conditional monotone convergence
theorem,
E(I( f Kn )|B) E(I( f K)|B) = E(I( f C)|B) (12.0.1)
almost surely, where the last equality holds because P( f C \ K) = 0. Let D be a countable algebra generated by the
union of C and the set of all Kn for all C C . For each D D, let P f |B (D, ) be any fixed version of the conditional
expectation E(I( f D)|B)(). We have
(d) for any C C and the specific sequence Kn as chosen, (12.0.1) holds a.s.
Since there are countably many conditions in (a) - (d), there exists a set N B such that P(N ) = 0 and conditions (a)
- (d) hold for all < N . We will now show that for all such , P f |B (, ) can be extended to a probability measure on
Y and this extension is such that for any C Y , P f |B (C, ) is a version of the conditional expectation, i.e. (ii) holds.
First, we need to show that for all < N , P f |B (, ) is countably additive on C or, equivalently, for any
sequence Cn C such that Cn 0/ we have P f |B (Cn , ) 0. If not, then there exists a sequence Cn 0/ such that
P f |B (Cn , ) > 0. Using (d), take compact Kn D such that P f |B (Cn \ Kn , ) /3n . Then
P f |B (K1 . . . Kn , ) P f |B (Cn , ) i
>
1in 3 2
65
for P|B -almost all . Therefore, (12.0.2) holds for all simple functions. For g 0, take a sequence of simple functions
0 gn g. Since Eg( f ) < , by monotone convergence theorem for conditional expectations,
for all < N for some N B with P(N ) = 0. Assume also that for < N , (12.0.2) holds for all functions in the
sequence (gn ). By the usual monotone convergence theorem,
Z Z
gn (y) P f |B (dy, ) g(y) P f |B (dy, )
for all and, therefore, (12.0.2) holds for all < N . This prove the claim for g 0 and the general case follows. t
u
Product space case. Consider measurable spaces (X, X ) and (Y, Y ) and let (, A ) be the product space, = X Y
and A = X Y , with some probability measure P on it. Let
be the coordinate projections. Let B be the -algebra generated by the first coordinate projection h, i.e.
B = h1 (X ) = X Y.
Suppose that a conditional distribution P f |B of f given B exists, for example, if (Y, Y ) is regular. For a fixed C Y ,
P f |B (C, ) is B-measurable, by definition, and since B = h1 (X ),
for some X -measurable function Px (C). In the product space setting, Px is called a conditional distribution of y given
x. Notice that for any set D X , { : h() D} B and
Of course, (2) and (3) imply that Px (C) defines a version of the conditional expectation of I(y C) given the first
coordinate x. Moreover, (3) implies the following more general statement.
66
Proof. This coincides with (3) for the indicator of the rectangle D C and as a result holds for indicators of disjoint
unions of rectangles. Then, using the monotone class theorem (or Dynkins theorem) as in the proof of Fubinis
theorem, (12.0.4) can be extended to indicators of all measurable sets in the product -algebra A = X Y and then
to simple functions, positive measurable functions and all integrable functions. t
u
This result is a generalization of Fubinis theorem in the case when the measure is not a product measure. One can also
view this as a way to construct more general measures on product spaces than product measures. One can start with
any function Px (C) that satisfies properties (1) and (2) and define a measure P on the product space as in property (3).
Such functions satisfying properties (1) and (2) are called probability kernels or transition functions. In the case when
Px (C) = (C), we recover the product measure .
A special version of the product space case is the so called disintegration theorem. Let (Y, Y ) be regular and let
be a probability measure on Y . Consider a measurable function : (Y, Y ) (X, X ) and let = 1 be the
push-forward measure on X . Let P be the push-forward measure on the product -algebra X Y by the map
y ((y), y), so that P has marginals and .
Proof. To prove (12.0.5) just take g(x, y) = h(x) f (y) in (12.0.4). If we replace h(x) with h(x)g(x) and use (12.0.5)
twice we get
Z Z Z
f (y)h((y)) dPx (y) g(x) d(x) = f (y)h((y))g((y)) d(y)
Z Z
= f (y)dPx (y) h(x)g(x) d(x)
Z Z
= f (y)h(x) dPx (y) g(x) d(x).
This also holds for -almost all x simultaneously for countably many functions f . We can choose these f to be the
indicators of sets in the algebra C in the proof of Theorem 36, and this choice obviously implies that h((y)) = h(x)
for Px -almost all y. t
u
Example. (Same space case) Suppose that now X = Y, X Y and : Y X is the identity map, (y) = y. Then
= X is just the restriction of to a smaller -algebra X . In this case, (12.0.5) becomes
Z Z Z
f (y)h(y) d(y) = f (y) dPx (y) h(x) dX (x),
Exercise. Let (X,Y ) be a random variable on R2 with the density f (x, y). Show that if E|X| < then
Z .Z
E(X|Y ) = x f (x,Y )dx f (x,Y )dx,
R R
67
Exercise. In the setting of the Disintegration Theorem, if X = Y = [0, 1], is the Lebesgue measure on Y , (y) =
2yI(y 1/2) + 2(1 y)I(y > 1/2), what is and Px ?
Exercise. In the setting of the Disintegration Theorem, show that if X is a separable metric space then, for -almost
every x, Px ( 1 (x)) = 1.
Exercise. Suppose that X,Y and U are random variables on some probability space such that U is independent of
(X,Y ) and has the uniform distribution on [0, 1]. Prove that there exists a measurable function f : R [0, 1] R such
that (X, f (X,U)) has the same distribution as (X,Y ) on R2 .
68
Let (, B, P) be a probability space and let (T, ) be a linearly ordered set. We will mostly work with countable sets
T , such as Z or subsets of Z. Consider a family of random variables Xt : R and a family of -algebras (Bt )tT
such that Bt Bu B for t u. A family (Xt , Bt )tT is called a martingale if the following hold:
If the last equality is replaced by E(Xu |Bt ) Xt then the process is called a supermartingale and if E(Xu |Bt ) Xt
then it is called a submartingale. If (Xt , Bt ) is a martingale and Xt = E(X|Bt ) for some random variables X then the
martingale is called right-closable. If some t0 T is an upper bound of T (i.e. t t0 for all t T ) then the martingale
is called right-closed. Of course, in this case Xt = E(Xt0 |Bt ). Clearly, a right-closable martingale can be made into
a right-closed martingale by adding an additional point, say t0 , to the set T so that t t0 for all t T , and defining
Xt0 := X. Let us give several examples of martingales.
Example 1. Consider a sequence (Xn )n1 of independent random variables such that EXi = 0 and let Sn = in Xi . If
Bn = (X1 , . . . , Xn ) then (Sn , Bn )n1 is a martingale since, for m n,
69
Therefore,
Sn
Sn = E(Sn |Bn ) = E(Xk |Bn ) = nE(X1 |Bn ) and Zn := = E(X1 |Bn ).
1kn n
Theorem 40 (Doobs decomposition) If (Xn , Bn )n0 is a submartingale then it can be uniquely decomposed Xn =
Zn +Yn , where (Yn , Bn ) is a martingale, Z0 = 0, Zn Zn+1 almost surely and Zn is Bn1 -measurable.
The sequence (Zn ) is called predictable, since it is a function of X1 , . . . , Xn1 , so we know it at time n 1. Since an
increasing sequence is always convergent, the question of convergence for submartingales is reduced to the question
of convergence for martingales.
Hn = Dn Gn , Yn = H1 + . . . + Hn , Zn = G1 + + Gn .
and, therefore, E(Yn |Bn1 ) = Yn1 . Uniqueness follows by construction. Suppose that Xn = Zn + Yn with all stated
properties. First, since Z0 = 0, Y0 = X0 . By induction, given a unique decomposition up to n 1, we can write
When we study convergence properties of martingales, an important role will be played by their uniform integrability
properties. We say that a collection of random variables (Xt )tT is uniformly integrable if
For example, when |Xt | Y for all t T and EY < then, clearly, (Xt )tT is uniformly integrable. Other basic criteria
of uniform integrability will be given in the exercises below. The following criterion of the L1 -convergence, which can
be viewed as a strengthening of the dominated convergence theorem, will be useful.
Lemma 34 Consider random variables (Xn ) and X such that E|Xn | < , E|X| < . The following are equivalent:
1. E|Xn X| 0 and n .
2. (Xn )n1 is uniformly integrable and Xn X in probability.
70
lim sup E|Xn X| + 2 sup E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K).
n n
Now, first letting 0, and then letting K and using that (Xn ) is uniformly integrable proves the result.
and if we take K such that E|X|I(|X| K) /2 then it is enough to take = /(2K). Now, given > 0, take as
above and take M > 0 large enough, so that for all n 1,
E|Xn |
P(|Xn | > M) .
M
We showed that E|X|I(|Xn | > M) for such and, therefore,
We can also choose M large enough so that E|Xn |I(|Xn | > M) 2 for n n0 and this finishes the proof. t
u
Next, we will prove some uniform integrability properties for martingales and submartingales, but first let us make the
following simple observation.
Lemma 35 Let f : R R be a convex function such that E| f (Xt )| < for all t. Suppose that either of the following
two conditions holds:
1. (Xt , Bt ) is a martingale;
2. (Xt , Bt ) is a submartingale and f is increasing.
Under the second condition, since Xt E(Xu |Bt ) for t u and f is increasing,
71
Proof. 1. Since there exists an integrable random variable X such that Xt = E(X|Bt ),
and, therefore,
2. Since Xt E(X|Bt ) and {Xt > M} Bt , as above this implies that EXt I(Xt > M) EXI(Xt > M). Also, since the
function max(a, x) is convex and increasing in x, by the part (b) of the previous lemma,
Finally, if we take M > |a| then the inequality | max(Xt , a)| > M can hold only if max(Xt , a) = Xt > M. Combining all
these observations,
E| max(Xt , a)|I(| max(Xt , a)| > M) = EXt I(Xt > M) EXI(Xt > M)
KP(Xt > M) + E|X|I(|X| > K)
E max(Xt , 0)
K + E|X|I(|X| > K)
M
E max(X, 0)
{by (13.0.2)} K + E|X|I(|X| > K).
M
Letting M and then K proves that (max(Xt , a))tT is uniformly integrable. t
u
Exercise. In the setting of the Example 3 above, prove carefully that E(X1 |Bn ) = E(Xk |Bn ) for 1 k n.
Exercise. Let (Xn )n1 be i.i.d. random variables and, given a bounded measurable function f : R2 R such that
f (x, y) = f (y, x), consider
2
Sn = 0 f (Xl , Xl0 )
n(n 1) 1l<l n
(called the U-statistics) and the -algebras Fn = X(1) , . . . , X(n) , (Xl )l>n , where X(1) , . . . , X(n) are the order statistics
of X1 , . . . , Xn . Prove that F(n+1) Fn and Sn = E(S2 |Fn ), i.e. (Sn , Fn )n1 is a right-closed martingale.
Exercise. Suppose that (Xn )n1 are i.i.d. and E|X1 | < . If Sn = X1 + . . . + Xn , show that the sequence (Sn /n)n1 is
uniformly integrable.
Exercise. Show that if suptT E|Xn |1+ < for some > 0 then (Xt )tT is uniformly integrable.
Exercise. Show that (Xt )tT is uniformly integrable if and only if (a) supt E|Xt | < and (b) for any > 0 there exists
> 0 such that suptT E|Xt | IA if P(A) .
Exercise. Suppose that random variables Xn 0 for n 0, Xn X0 in probability and EXn EX0 . Prove that the limit
limn E|Xn X0 | = 0. (Hint: consider (X0 Xn )+ .)
72
Stopping times.
Consider a probability space (, B, P) and a sequence of -algebras Bn B for n 0 such that Bn Bn+1 . An
integer valued random variable {0, 1, 2, . . .} is called a stopping time if { n} Bn for all n. Of course, in this
case we also have
{ = n} = { n} \ { n 1} Bn .
Given a stopping time , let us consider the -algebra
B = B B { n} B Bn for all n
consisting of all events that depend on the data up to a stopping time . If a sequence of random variable (Xn )
is adapted to (Bn ), i.e. Xn is Bn -measurable, then random variables such as X or k=1 Xk are B -measurable. For
example,
[
{ k} \ { k 1} {Xk A} Bn .
[ \
{ n} {X A} = { = k} {Xk A} =
kn kn
2. Stopping time is B -measurable. To see this, we need to show that { k} B for all integer k 0. This is
true, because
{ k} { n} = { k n} Bkn Bn ,
by the definition of stopping time.
3. is a stopping time if and only if { > n} Bn for all n. This is obvious, because { > n} = { n}c and
-algebras as closed under taking complements.
4. If (k )k1 are all stopping times then their minimum k1 k and maximum k1 k are also stopping times. This is
true, because
k > n Bn , k1 k n = k n Bn .
\ \
k1 k > n =
k1 k1
5. If 1 and 2 are stopping times then the sum 1 + 2 is a stopping time. This is true, because
n
{1 = k} {2 n k} Bn .
[
{1 + 2 n} =
k=0
We will leave other properties, concerning two stopping times 1 and 2 , as an exercise.
73
8. If 1 2 then B1 B2 .
A martingale (Xn , Bn ) often represents a fair game, so that the average value E(Xm |Bn ) at some future time m > n
given the data Bn at time n is equal to the value Xn at time n. A stopping time represents a strategy to stop the game
based only on the information at any given time, so that the event { n} that we stop the game before time n depends
only on the data Bn up to that time. A classical result that we will now prove states that, under some mild integrability
conditions, the average value under any strategy is the same, so their is no winning strategy on average.
Theorem 41 (Optional Stopping) Let (Xn , Bn ) be a martingale and 1 , 2 < be stopping times such that
For example, if 1 = 0 and A = then EX2 = EX0 if the condition (14.0.1) is satisfied. For example, if the stopping
time 2 is bounded then (14.0.1) is obviously satisfied. Let us note that without some kind of integrability condition,
Theorem 41 can not hold, as the following example shows.
Example. Consider an i.i.d. sequence (Xn ) such that P(Xn = 2n ) = 1/2. If Bn = (X1 , . . . , Xn ) then (Sn , Bn ) is a
martingale. Let 1 = 1 and 2 = min{k 1, Sk > 0}. Clearly, S2 = 2 because if 2 = k then
S2 = Sk = 2 22 . . . 2k1 + 2k = 2.
However, 2 = ES2 , ES1 = 0. Notice that the second condition in (14.0.1) is violated, since
P(2 = n) = 2n , P(2 n + 1) = 2k = 2n
kn+1
and, therefore,
E|Sn |I(n 2 ) = 2P(2 = n) + (2n+1 2)P(n + 1 2 ) = 2,
which does not go to zero. t
u
Proof of Theorem 41. Consider a set A B1 . The result is based on the following formal computation with the
middle step (*) proved below,
EX2 IA I(1 2 ) = EX2 I A {1 = n} I(n 2 )
n1
()
= EXn I A {1 = n} I(n 2 ) = EX1 IA I(1 2 ).
n1
We begin by writing
EXn IAn I(n 2 ) = EXn IAn I(2 = n) + EXn IAn I(n < 2 ) = EX2 IAn I(2 = n) + EXn IAn I(n < 2 ).
EXn IAn I(n < 2 ) = EXn+1 IAn I(n < 2 ) = EXn+1 IAn I(n + 1 2 )
and, therefore,
EXn IAn I(n 2 ) = EX2 IAn I(2 = n) + EXn+1 IAn I(n + 1 2 ).
74
EXn IAn I(n 2 ) = EX2 IAn I(n 2 < m) + EXm IAn I(m 2 ). (14.0.3)
It remains to let m and use the assumptions in (14.0.1). By the second assumption, the last term
EXm IA I(m 2 ) E|Xm |I(m 2 ) 0
n
as m . Since
X2 IAn I(n 2 m) X2 IAn I(n 2 )
almost surely as m and E|X2 | < , by the dominated convergence theorem,
which gives a clean condition to check if we would like to show that EX = EX0 . t
u
Example. (Hitting times of simple random walk) Given p (0, 1), consider i.i.d. random variables (Xi )i1 such that
P(Xi = 1) = p, P(Xi = 1) = 1 p,
and consider a random walk S0 = 0, Sn+1 = Sn + Xn+1 . Consider two integers a 1 and b 1 and the stopping time
= min k 1 Sn = a or b .
75
e S e S
lim E I( n) = E I( < )
n ( ) ( )
by the monotone convergence theorem. Since Y0 = 1, (14.0.4) implies in this case that
e S e Sn
1=E I( < ) lim E I( n) = 0. (14.0.5)
( ) n ( )n
The left hand side is called the fundamental Walds identity. In some cases one can use this to compute the Laplace
transform of a stopping time and, thus, its distribution. t
u
Example. (Symmetric random walk) Let X0 = 0, P(Xi = 1) = 1/2, Sn = kn Xk . Given integer z 1, let
= min{k : Sk = z or z}.
Since ( ) = ch( ) 1,
e Sn e| |z
E I( n) P( n)
( )n ch( )n
and the right hand side of (14.0.5) holds. Therefore,
e S e z + e z
1=E = e z Ech( ) I(S = z) + e z Ech( ) I(S = z) = Ech( )
( ) 2
by symmetry. Therefore,
1 1
Ech( ) = and Ee = 1
ch( z) ch(ch (e )z)
by the change of variables e = 1/ch . t
u
For more general stopping times the condition on the right hand side of (14.0.5) might not be easy to check. We will
now show another approach that is helpful to verify fundamental Walds identity. If P is the distribution of X1 , let P
be the distribution with the Radon-Nikodym derivative with respect to P given by
dP e x
= .
dP ( )
This is, indeed, a density with respect to P, since
e x ( )
Z
dP = = 1.
R (x) ( )
For convenience of notation, we will think of (Xn ) as the coordinates on the product space (R , B , P ) with the
cylindrical -algebra. Therefore, { = n} (X1 , . . . , Xn ) B n is a Borel set on Rn . We can write,
e S
e Sn
e (x1 ++xn )
Z
E I( < ) = E n
I( = n) = dP(x1 ) . . . dP(xn )
( ) n=1 ( ) n=1 ( )n
{=n}
Z
= dP (x1 ) . . . dP (xn ) = P
( < ).
n=1
{=n}
This means that we can think of the random variables Xn as having the distribution P and, to prove Walds identity,
we need to show that < with probability one.
Example. (Crossing a growing boundary) Suppose that we have a boundary given by a sequence ( f (k)) that changes
with time and a stopping time (crossing time):
= min k Sk f (k) .
76
e x 0 ( )
Z
E Xi = x dP(x) = .
( ) ( )
f (n) 0 ( )
lim sup <
n n ( )
then, obviously, the random walk Sn will cross it with probability one. By Holders inequality, ( ) is log-convex and,
therefore, 0 ( )/( ) is increasing, which means that if this condition holds for 0 then it holds for > 0 . t
u
Exercise. Let (Xn , Bn ) be a martingale and 1 2 . . . be a non-decreasing sequence of bounded stopping times.
Show that (Xn , Bn ) is a matringale.
Exercise. Let (Xn , Bn )n0 be a martingale and a stopping time such that P( < ) = 1. Suppose that E|Xn |1+ c
for some c, > 0 and for all n. Prove that EX = EX0 .
Exercise. Given 0 < p < 1, consider i.i.d. random variables (Xi )i1 such that P(Xi = 1) = p, P(Xi = 1) = 1 p and
consider a random walk S0 = 0, Sn+1 = Sn + Xn+1 . Consider two integers a 1 and b 1 and define a stopping time
= min{k 1, Sn = a or b}.
e S
E I( < ) = 1
( )
77
In this section, we will study convergence of martingales, and we begin by proving two classical inequalities.
Theorem 42 (Doobs inequality) If (Xn , Bn )n1 is a submartingale and Yn = max1kn Xk then, for any M > 0,
1 1
P Yn M EXn I(Yn M) EXn+ .
(15.0.1)
M M
Proof. Define a stopping time
min{k | Xk M, k n}, if such k exists,
1 =
n, otherwise.
Let us average this inequality over the set A = {Yn = max1kn Xn M}, which belongs to B1 , because
n o
A {1 k} = max Xi M Bkn Bk .
1ikn
This is precisely the first inequality in (15.0.1), and the second inequality is obvious. t
u
Example. (Second Kolmogorovs inequality) If (Xi ) are independent and EXi = 0 then Sn = 1in Xi is a martingale
and Sn2 is a submartingale. Therefore, by Doobs inequality,
1 1
P max |Sk | M = P max Sk2 M 2 2 ESn2 = 2 Var(Xk ).
1kn 1kn M M 1kn
This is a big improvement on Chebyshevs inequality, since we control the maximum max1kn |Sk | instead of one
sum |Sn |. t
u
Doobs upcrossing inequality. Let (Xn , Bn )n1 be a submartingale. Given two real numbers a < b, we will define an
increasing sequence of stopping times (n )n1 when Xn is crossing a downward and b upward, as in figure 15.1. More
specifically, we define these stopping times by
1 = min n 1 Xn a , 2 = min n > 2 Xn b
78
a x x x x
which is the number of upward crossings of the interval [a, b] before time n.
E(Xn a)+
E(a, b, n) . (15.0.2)
ba
Proof. Since x (x a)+ is increasing convex function, Zn = (Xn a)+ is also a submartingale. Clearly, the number
of upcrossings of the interval [a, b] by (Xn ) can be expressed in terms of the number of upcrossings of [0, b a] by
(Zn ),
X (a, b, n) = Z (0, b a, n),
which means that it is enough to prove (15.0.2) for nonnegative submartingales. From now on we can assume that
0 Xn and we would like to show that
EXn
E(0, b, n) .
b
Let us define a sequence of random variables j for j 1 by
1, 2k1 < j 2k for some k
j =
0, otherwise,
i.e. j is the indicator of the event that at time j the process is crossing [0, b] upward. Define X0 = 0. Then, clearly,
n n
b(0, b, n) j (X j X j1 ) = I( j = 1)(X j X j1 ).
j=1 j=1
i.e. the fact that at time j we are crossing upward is determined completely by the sequence up to time j 1. Then
n n
X j X j1 B j1
bE(0, b, n) EI( j = 1)(X j X j1 ) = EI( j = 1)E
j=1 j=1
n
= EI( j = 1) E(X j |B j1 ) X j1 .
j=1
79
I( j = 1)(E(X j |B j1 ) X j1 ) E(X j |B j1 ) X j1 .
Therefore,
n n
bE(0, b, n) E E(X j |B j1 ) X j1 = E(X j X j1 ) = EXn .
j=1 j=1
This finishes the proof for nonnegative submartingales, and also the general case. t
u
Finally, we will now use Doobs upcrossing inequality to prove our main result about the convergence of submartin-
gales, from which the convergence of martingales will follow.
Proof. Let us note that Xn converges as n (here, = + or ) if and only if lim sup Xn = lim inf Xn . Therefore,
Xn diverges if and only if the following event occurs
[
lim sup Xn > lim inf Xn = lim sup Xn b > a lim inf Xn ,
a<b
where the union is taken over all rational numbers a < b. This means that P(Xn diverges) > 0 if and only if there exist
rational a < b such that
P lim sup Xn b > a lim inf Xn > 0.
If we recall that (a, b, n) denotes the number of upcrossings of [a, b], this is equivalent to
P lim (a, b, n) = > 0.
n
By Doobs inequality,
E(Yn a)+ E(X1 a)+
EY (a, b, n) = < .
ba ba
Since 0 Y (a, b, n) (a, b) almost surely as n +, by the monotone convergence theorem,
Therefore, P((a, b) = ) = 0, which means that the sequence (Xn )n0 can not have infinitely many upcrossings of
[a, b] and, therefore, the limit
X = lim Xn
n
exists with probability one. By Fatous lemma and the submartingale property,
+ + +
EX lim inf EXn EX1 < .
n
It remains to show that (Xn , Bn )n< is a submartingale. First of all, X = limn Xn is measurable on Bn
for all n and, therefore, measurable on B = Bn . Let us take a set A Bn . We would like to show that, for
80
Finally, since the function x x a is convex and non-decreasing, (Xn a, Bn )n< is a submartingale by Lemma
35 in Lecture 13 and, therefore, for any n m, we have E(Xn a)IA E(Xm a)IA since A Bn . This proves that
E(X a)IA E(Xm a)IA . Letting a , by the monotone convergence theorem, EX IA EXm IA .
Proof of 2. If (a, b, n) is the number of upcrossings of [a, b] by the sequence X1 , . . . , Xn then, by Doobs inequality,
Therefore, E(a, b) < for (a, b) = limn (a, b, n) and, as above, the limit X+ = limn+ Xn exists. Since all Xn
are measurable on B+ , so is X+ . Finally, by Fatous lemma,
+
EX+ lim inf EXn+ < .
n
Notice that a different assumption, supn E|Xn | < , would similarly imply that E|X+ | lim infn E|Xn | < .
Proof of 3. By 2, the limit X+ exists. We want to show that for any m and for any set A Bm ,
EXm IA EX+ IA .
We already mentioned above that (Xn a) is a submartingale and, therefore, E(Xm a)IA E(Xn a)IA for m n.
Clearly, if (Xn+ )<n< is uniformly integrable then (Xn a)<n< is uniformly integrable for all a R. Since
Xn a X+ a almost surely, by Lemma 34,
and this shows that E(Xm a)IA E(X+ a)IA . Letting a and using the monotone convergence theorem we
get that EXm IA EX+ IA . t
u
Convergence of martingales. If (Xn , Bn ) is a martingale then both (Xn , Bn ) and (Xn , Bn ) are submartingales and
one can apply the above theorem to both of them. For example, this implies the following.
1. If supn E|Xn | < then almost sure limit X+ = limn+ Xn exists and E|X+ | < .
2. If (Xn )<n< is uniformly integrable then (Xn , Bn )<n+ is a right-closed martingale.
In other words, a martingale is right-closable if and only if it is uniformly integrable. Of course, in this case we also
conclude that limn+ E|Xn X+ | = 0. For a martingale of the type E(X|Bn ) we can identify the limit as follows.
Theorem 45 (Levys convergence theorem) Let (, B, P) be a probability space and X : R be a random variable
such that E|X| < . Given a sequence of -algebras
B1 . . . Bn . . . B+ B
Proof. (Xn , Bn )1n< is a right-closable martingale since Xn = E(X|Bn ). Therefore, it is uniformly integrable and the
limit X+ := limn+ E(X|Bn ) exists. It remains to show that X+ = E(X|B+ ). Consider a set A n Bn . Since
S
Since n Bn is an algebra, we get EX+ IA = EXIA for all A B+ = n Bn (by Dynkins theorem or monotone
S
81
under the permutations of finitely many (Xn ), by the Savage-Hewitt 0 1 law, the probability of each event is 0 or 1,
i.e. B consists of 0/ and up to sets of measure zero. Therefore, Z is a constant a.s. and, since (Zn )n1 is
also a martingale, EZ = EX1 . Therefore, we proved that Sn /n EX1 almost surely.
Example. (Improved Kolmogorovs law of large numbers) Consider a sequence (Xn )n1 such that
E(Xn+1 |X1 , . . . , Xn ) = 0.
We do not assume independence here. This assumption, obviously, implies that EXk Xl = 0 for k , l. Let us show that
if a sequence (bn ) is such that
EX 2
bn bn+1 , lim bn = and 2n <
n=1 bn
n
then Sn /bn 0 almost surely. Indeed, Yn = kn (Xk /bk ) is a martingale and (Yn ) is uniformly integrable, since
1 1 n EXk2
E|Yn |I(Yn | > M) E|Yn |2 = b2
M M k=1 k
1 EXk2
sup E|Yn |I(Yn | > M) b2 0
n M k=1 k
as M . By the martingale convergence theorem, the limit Y = limn Yn exists almost surely, and Kroneckers
lemma implies that Sn /bn 0 almost surely. t
u
Exercise. Let (Xn )n1 be a martingale with EXn = 0 and EXn2 < for all n. Show that
EXn2
P max Xk M .
1kn EXn2 + M 2
R |Y |
Exercise. Show that for any random variable Y, E|Y | p = pt p1 P(|Y | t)dt. Hint: represent |Y | p = pt p1 dt and
R
0 0
switch the order of integration.
Exercise. Let X,Y be two non-negative random variables such that for every t > 0,
Z
P(Y t) t 1 XI(Y t)dP.
For any p > 1, k f k p = ( | f | p dP)1/p and 1/p+1/q = 1, show that kY k p qkXk p . Hint: Use previous exercise, switch
R
the order of integration to integrate in t first, then use Holders inequality and solve for kY k p .
82
b r
+ c of the same color
Exercise. Given a non-negative submartingale (Xn , Bn ), let Xn := max jn X j . Prove that for any p > 1 and 1/p+1/q =
1, kXn k p qkXn k p . Hint: use previous exercise and Doobs maximal inequality.
Exercise. (Polyas urn model) Suppose we have b blue and r red balls in the urn. We pick a ball randomly and return
it with c balls of the same color. Let us consider the sequence
#(blue balls after n iterations)
Yn = .
#(total after n iterations)
Prove that the limit Y = limn Yn exists almost surely.
Exercise. (Improved Kolmogorovs 0 1 law) Consider arbitrary random variables (Xi )i1 and consider -algebras
Bn = (X1 , . . . , Xn ) and B+ = ((Xn )n1 ). Consider any set A B+ that is independent of Bn for all n (also called
a tail event). By considering conditional expectations E(IA |Bn ) prove that P(A) = 0 or 1.
Exercise. Suppose that a random variable takes values 0, 1, 2, 3, . . .. Suppose that E = > 1 and Var( ) = 2 <
. Let nk be independent copies of . Let X1 = 1 and define recursively Xn+1 = Xk=1 n
nk . Prove that Sn = Xn / n
converges almost surely.
Exercise. Let (Xn )n1 be i.i.d. random variables and, given a bounded measurable function f : R2 R such that
f (x, y) = f (y, x), consider the U-statistics
2
Sn = 0 f (Xl , Xl0 )
n(n 1) 1l<l n
where X(1) , . . . , X(n) are the order statistics of X1 , . . . , Xn . Prove that limn Sn = E f (X1 , X2 ) almost surely. Hint:
notice that n1 Fn consists of symmetric events.
83
We will now study the class of the so called bounded Lipschitz functions on a metric space (S, d), which will play an
important role in the next section, when we deal with convergence of laws on metric spaces. For a function f : S R,
let us define a Lipschitz semi-norm by
| f (x) f (y)|
|| f ||L = sup .
x,y d(x, y)
Clearly, || f ||L = 0 if and only if f is constant so || f ||L is not a norm, even though it satisfies the triangle inequality.
Let us define a bounded Lipschitz norm by
|| f ||BL = || f ||L + || f || ,
and, therefore,
|| f g||BL || f || ||g|| + || f || ||g||L + ||g|| || f ||L || f ||BL ||g||BL ,
which finishes the proof. t
u
Proof. (1) It is enough to consider k = 2. For specificity, take = . Given x, y S, suppose that
84
Theorem 46 (Extension theorem) Given a set A S and a bounded Lipschitz function f BL(A, d) on A, there exists
an extension h BL(S, d) such that f = h on A and ||h||BL = || f ||BL .
Proof. Let us first find an extension such that ||h||L = || f ||L . We will start by extending f to one point x S \ A. The
value y = h(x) must satisfy
|y f (s)| k f kL d(x, s) for all s A,
or, equivalently,
inf ( f (s) + || f ||L d(x, s)) y sup( f (s) || f ||L d(x, s)).
sA sA
It remains to apply Zorns lemma to show that f can be extended to the entire S. Define order by inclusion:
f1 f2 if f1 is defined on A1 , f2 - on A2 , A1 A2 , f1 = f2 on A1 and k f1 kL = k f2 kL .
For any chain { f }, f = f f . By Zorns lemma, there exists a maximal element h. It is defined on the entire S,
S
because, otherwise, we could extend to one more point. To extend preserving the bounded Lipschitz norm, take
h0 = (h || f || ) (|| f || ).
By part (1) of previous lemma, it is easy to see that ||h0 ||BL = || f ||BL . t
u
Let (C(S), d ) denote the space of continuous real-valued functions on S with d ( f , g) = supxS | f (x) g(x)|. To
prove the next property of bounded Lipschitz functions, let us first recall the following famous generalization of the
Weierstrass theorem. We will give the proof for convenience.
Theorem 47 (Stone-Weierstrass theorem) Let (S, d) be a compact metric space and F C(S) is such that
3. F contains constants.
85
ag(x) + b = c, ag(y) + b = d
has a solution a, b. Then the function f = ag + b satisfies the above and it is in F by 1. Take h C(S) and fix x. For
any y let fy F be such that
fy (x) = h(x), fy (y) = h(y).
By continuity of fy , for any y S there exists an open neighborhood Uy of y such that
Since (Uy ) is an open cover of the compact S, there exists a finite subcover Uy1 , . . . ,UyN . Let us define a function
By construction, h0 (s) h(s) + and h0 (s) h(s) for all s S which means that d (h0 , h) . Since h0 F , this
proves that F is dense in (C(S), d ). t
u
The above results can be now combined to prove the following property of bounded Lipschitz functions.
Corollary 2 If (S, d) is a compact space then the bounded Lipschitz functions BL(S, d) are dense in (C(S), d ).
Proof. We apply the Stone-Weierstrass theorem with F = BL(S, d). Property 3 is obvious, property 1 follows from
Lemma 37 and property 2 follows from the extension Theorem 46, since a function defined on two points x , y such
that f (x) , f (y) can be extended to a bounded Lipschitz function on the entire S. t
u
We will also need another well-known result from analysis. A set A S is totally bounded if for any > 0 there exists
a finite -cover of A, i.e. a set of points a1 , . . . , aN such that A iN B(ai , ), where B(a, ) = {y S | d(a, y) } is
a ball of radius centered at a.
Theorem 48 (Arzela-Ascoli) If (S, d) is a compact metric space then a subset F C(S) is totally bounded in d
metric if and only if F is equicontinuous and uniformly bounded.
Equicontinuous means that for any > there exists > 0 such that if d(x, y) then, for all f F , | f (x) f (y)| .
The following fact was used in the proof of the Selection Theorem, which was proved for general metric spaces.
86
Proof. By the above theorem, BL(S, d) is dense in C(S). For any integer n 1, the set { f : || f ||BL n} is, obviously,
uniformly bounded and equicontinuous. By the Arzela-Ascoli theorem, it is totally bounded and, therefore, separable,
which can be seen by taking the union of finite 1/m-covers for all m 1. The union
[
f : || f ||BL n = BL(S, d)
n1
Exercise.
R If F is a finite set {x
1 , . . . , xn } and P is a law with P(F) = 1, show that for any law Q the supremum
sup f d(P Q) : k f kL 1 can be restricted to functions of the form f (x) = min1in (ci + d(x, xi )).
87
Let (S, d) be a metric space and B - a Borel -algebra generated by open sets. Let us recall that Pn P weakly if
Z Z
f dPn f dP
for all f Cb (S) - real-valued bounded continuous functions on S. For a set A S, we denote by A the closure of A,
int A - the interior of A and A = A \ int A - the boundary of A. The set A is called a continuity set of P if P( A) = 0.
(1) Pn P weakly.
(2) For any open set U S, lim infn Pn (U) P(U).
(3) For any closed set F S, lim supn Pn (F) P(F).
(4) For any continuity set A of P, limn Pn (A) = P(A).
Proof. (1) = (2). Let U be an open set and F = U c . Consider a sequence of functions in Cb (S),
Since F is closed, d(s, F) = 0 only if s F and, therefore, fm (s) I(s U) as m . Since, for each m 1,
Z
Pn (U) fm dPn
Now, letting m and using the monotone convergence theorem implies that
Z
lim inf Pn (U) I(s U) dP(s) = P(U),
n
88
(4) = (1). Consider f Cb (S) and let Fy = {s S : f (s) = y} be a level set of f . There exist at most countably many
y such that P(Fy ) > 0. Therefore, for any > 0 we can find a sequence a1 . . . aN such that
Next, we will introduce two metrics on the set of all probability measures on (S, d) with the Borel -algebra B and,
under some mild conditions, prove that they metrize the weak convergence. For a set A S, let us denote by
A = y S d(x, y) < for some x A
Proof. (1) First, let us show that (Q, P) = (P, Q). Suppose that (P, Q) > . Then there exists a set A such that
P(A) > Q(A ) + . Taking complements gives
Therefore, for the set B = Ac , Q(B) > P(B ) + . This means that (Q, P) > and, therefore, (Q, P) (P, Q). By
symmetry, (Q, P) (P, Q), so (Q, P) = (P, Q).
(2) Next, let us show that if (P, Q) = 0 then P = Q. For any set F and any n 1,
1 1
P(F) Q(F n ) + .
n
1
If F is closed then F n F as n and, by the continuity of measure,
1
P(F) lim Q F n = Q(F).
n
Similarly, P(F) Q(F) and, therefore, P(F) = Q(F) for all closed sets and, therefore, for all Borel sets.
89
which means that (P, R) x + y, and minimizing over x, y proves the triangle inequality. t
u
Given probability distributions P, Q on the metric space (S, d), we define the bounded Lipschitz distance between them
by nZ Z o
(P, Q) = sup f dP f dQ : || f ||BL 1 .
Proof. It is obvious that (P, Q) = (Q, P) and the triangle inequality holds. It remains to prove that if (P, Q) = 0
then P = Q. Given a closed set F and U = F c , it is easy to see that
fm (x) = md(x, F) 1 IU
R R
as m . Obviously, || fm ||BL m + 1 and, therefore, (P, Q) = 0 implies that fm dP = fm dQ. By the monotone
converngence theorem, letting m proves that P(U) = Q(U), so P = Q. t
u
Let us now show that on separable metric space, the metrics and metrize weak convergence. Before we prove this,
let us recall the statement of Ulams theorem proved in Theorem 5 in Section 1. Namely, every probability law P on a
complete separable metric space (S, d) is tight, which means that for any > 0 there exists a compact K S such that
P(S \ K) .
(1) Pn P.
R R
(2) For all f BL(S, d), f dPn f d P.
(3) (Pn , P) 0.
(4) (Pn , P) 0.
Remark. We will prove the implications (1) = (2) = (3) = (4) = (1), and the assumption that (S, d) is
separable or P is tight will be used only in one step, to prove (2) = (3).
Proof. (1) = (2). This follow by definition, since BL(S, d) Cb (S).
(3) = (4). In fact, we will prove that p
(Pn , P) 2 (Pn , P). (17.0.1)
Given a Borel set A S, consider the function
1
f (x) = 0 1 d(x, A) , so that IA f IA .
Obviously, || f ||BL 1 + 1 and we can write
Z Z Z Z
Pn (A) f dPn = f dPn f dP
f dP +
nZ Z o
P(A ) + (1 + 1 ) sup f dPn f dP : || f ||BL 1
90
(4) = (1). Suppose that (Pn , P) 0, which means that there exists a sequence n 0 such that
(2) = (3). First, let us consider the case when (S, d) is complete. Then, by Ulams theorem, again, we can find a
compact K such that P(S \ K) . If (S, d) is not separable then we assume that P is tight and, by definition, we can
again find a compact K such that P(S \ K) . If we consider the function
1 1
f (x) = 0 1 d(x, K) , so that || f ||BL 1 + ,
then Z Z
n
Pn (K ) f dPn f dP P(K) 1 ,
This uniform approximation can also be extended to K . Namely, for any x K take y K such that d(x, y) . Then
Finally, Z Z Z Z
(Pn , P) = sup f dPn f dP max f j dPn f j dP + 12
f B 1 jk
91
which means that both statement in (2) and (3) hold simultaneously on (S, d) and (T, d). This proves that (2) (3) on
all separable metric spaces. t
u
The focus of the above theorem was to show that metrics and metrize weak convergence. Another important aspect
of this result is that on separable spaces weak convergence can be checked on countably many functions f Cb (S, d),
which will be demonstrated in the following example.
1 n
n (A)() = I(Xi () A), A B.
n i=1
1 n
Z Z
f dn = f (Xi ) E f (X1 ) = f d a.s.
n i=1
However, the set of measure zero where this convergence is violated depends on f and it is not right away obvious that
the convergence holds for all f Cb (S) with probability one. We will need the following lemma.
Lemma 41 If (S, d) is separable then there exists a metric e on S such that (S, e) is totally bounded, and e and d define
the same topology, i.e. e(sn , s) 0 if and only if d(sn , s) 0.
Proof First of all, it is easy to check that c = d/(1 + d) is a metric (that defines the same topology) taking values in
[0, 1]. If {sn } is a dense subset of S, consider the map
T (s) = c(s, sn ) n1 : S [0, 1]N .
The key point here is that the sequence tm t in S, or limm c(tm ,t) = 0, if and only if limm c(tm , sn ) = c(t, sn )
for each n 1. In other words, tm t if and only if T (tm ) T (t) in the Cartesian product [0, 1]N equipped with the
product topology. This topology is compact and metrizable by the metric
so we can now define the metric e on S by e(s, s0 ) = r(T (s), T (s0 )). Because of what we said above, this metric defines
the same topology as d. Moreover, it is totally bounded, because the image T (S) of our metric space S by the map T
is totally bounded in ([0, 1]N , r), since this space is compact. t
u
92
Proof. Let e be the metric in the preceding lemma. Clearly, Cb (S, d) = Cb (S, e) and weak convergence of measures is
not affected by this change of metric. If (T, e) is the completion of (S, e) then (T, e) is compact. By the Arzela-Ascoli
theorem, BL(T, e) is separable with respect to the d norm and, therefore, BL(S, e) is also separable. Let ( fm ) be a
dense subset of BL(S, e). Then, by the strong law of large number,
1 n
Z Z
fm dn = fm (Xi ) E fm (X1 ) =
n i=1
fm d a.s.
R R
Therefore, on the set of probability one, fm dn fm d for all m 1 and, since ( fm ) is dense in BL(S, e), on the
same set of probability one, this convergence holds for all f BL(S, e). Since (S, e) is separable, the previous theorem
implies that n weakly. t
u
Convergence and uniform tightness. Next, we will make several connections between convergence of measures and
uniform tightness on general metric spaces, which are similar to the results in the Euclidean setting. The sequence of
laws (Pn ) is uniformly tight if, for each > 0, there exists a compact set K S such that Pn (S \ K) for all n 1.
First, we will show that, in some sense, uniform tightness is necessary for convergence of laws.
Theorem 52 If Pn P0 weakly and each Pn is tight for n 0, then (Pn )n0 is uniformly tight.
In particular, by Ulams theorem, any convergent sequence of laws on a complete separable metric space is uniformly
tight.
Proof. Since Pn P0 and P0 is tight, by Theorem 50, the Levy-Prohorov metric (Pn , P0 ) 0. Given > 0, let us
take a compact K such that P0 (K) > 1 . By the definition of ,
1 1
1 < P0 (K) Pn K (Pn ,P0 )+ n + (Pn , P0 ) +
n
and, therefore,
a(n) = inf > 0 : Pn (K ) > 1 0.
A probability measure is always closed regular, so any measurable set A can be approximated by a closed subset F.
Since Pn is tight, we can choose a compact of measure close to one, and intersecting it with the closed subset F, we can
approximate any set A by its compact subset. Therefore, there exists a compact Kn K 2a(n) such that Pn (Kn ) > 1 .
If we take L = K (n1 Kn ) then
Pn (L) Pn (Kn ) > 1
for all n 0. It remains to show that L is compact. Consider a sequence (xn ) on L. There are two possibilities.
First, if there exists an infinite subsequence (xn(k) ) that belongs to one of the compacts K j then it has a converging
subsubsequence in K j and as a result in L. If not, then there exists a subsequence (xn(k) ) such that xn(k) Km(k) and
m(k) as k . Since
Km(k) K 2a(m(k))
there exists yk K such that
d(xn(k) , yk ) 2a(m(k)).
Since K is compact, the sequence yk K has a converging subsequence yk(r) y K which implies that d(xn(k(r)) , y)
0, i.e. xn(k(r)) y L. Therefore, L is compact. t
u
We already know by the Selection Theorem that any uniformly tight sequence of laws on any metric space has a
converging subsequence. Under additional assumptions on (S, d) we can complement the Selection Theorem and
make some connections to the metrics defined above.
Theorem 53 Let (S, d) be a complete separable metric space and P be a subset of probability laws on S. Then the
following are equivalent.
93
(2) For any sequence Pn P there exists a converging subsequence Pn(k) P where P is a law on S.
(3) P has the compact closure on the space of probability laws equipped with the Levy-Prohorov or bounded
Lipschitz metrics or .
(4) P is totally bounded with respect to or .
Remark. In other words, we are showing that, on complete separable metric spaces, total boundedness on the space
of probability measures is equivalent to uniform tightness. The rest is just basic properties of metric space. Also,
implications (1) = (2) = (3) = (4) hold without the completeness assumption and the only implication where
completeness will be used is (4) = (1).
Proof. (1) = (2). Any sequence Pn P is uniformly tight and, by the Selection Theorem, there exists a converging
subsequence.
(2) = (3). Since (S, d) is separable, by Theorem 50, Pn P if and only if (Pn , P) or (Pn , P) 0. Every sequence in
the closure P can be approximated by a sequence in P. That sequence has a converging subsequence that, obviously,
converges to an element in P, which means that the closure of P is compact.
(3) = (4). Compact sets are totally bounded and, therefore, if the closure P is compact, the set P is totally bounded.
(4) = (1). Since 2 , we will only deal with . For any > 0, there exists a finite subset P0 P such that
p
P P0 . Since (S, d) is complete and separable, by Ulams theorem, for each Q P0 there exists a compact KQ
such that Q(KQ ) > 1 . Therefore,
Let F be a finite set such that K F (here we will denote by F the closed -neighborhood of F). Since P P0 ,
for any P P there exists Q P0 such that (P, Q) < and, therefore,
Thus, 1 2 P(F 2 ) for all P P. Given > 0, take m = /2m+1 and find Fm as above, i.e.
/2m
1 m
P Fm .
2
Then
/2m
\
P Fm 1 m
= 1.
m1 m1 2
/2m
t
u
T
Finally, L = m1 Fm is compact because it is closed and totally bounded by construction, and S is complete.
Theorem 54 (Prokhorovs theorem) The set of probability laws on a complete separable metric space is complete
with respect to the metrics and .
Proof. If a sequence of laws is Cauchy with respect to or then it is totally bounded and, by the previous theorem,
it has a converging subsequence. Obviously, a Cauchy sequence will converge to the same limit. t
u
Exercise. Prove that the space of probability laws Pr(S) on a compact metric space (S, d) is compact with respect to
the metrics and .
Exercise. If (S, d) is separable, prove that Pr(S) is also separable with respect to the metrics and . Hint: think about
probability measures concentrated on finite subsets of a dense countable set in S, and use metric .
94
- Levys metric. (a) Show that is a metric metrizing convergence of laws on R. (b) Show that , but that there
exist laws Pn , Qn with (Pn , Qn ) 0 while (Pn , Qn ) 6 0.
Exercise. Define a specific finite set of laws F on [0, 1] such that for every law P on [0, 1] there exists Q F with
(P, Q) < 0.1, where is Prokhorovs metric. Hint: reduce the problem to metric .
Exercise. Let X j be i.i.d. N(0, 1) random variables. Let H be a Hilbert space with orthonormal basis {e j } j1 . Let
X = j1 X j e j / j. For any > 0, find a compact K such that P(X K) > 1 .
95
Metric for convergence in probability. Let (, B, P) be a probability space, (S, d) - a metric space and X,Y : S
- random variables with values in S. The quantity
(X,Y ) = inf 0 P(d(X,Y ) > )
is called the Ky Fan metric on the set L 0 (, S) of classes of equivalences of such random variables, where two random
variables are equivalent if they are equal almost surely. If we take a sequence
k = (X,Y )
then P(d(X,Y ) > k ) k and, since I(d(X,Y ) > k ) I(d(X,Y ) > ), by the monotone convergence theorem,
P(d(X,Y ) > ) . Thus, the infimum in the definition of (X,Y ) is attained.
Proof. First of all, clearly, (X,Y ) = 0 if and only X = Y almost surely. To prove the triangle inequality,
P d(X, Z) > (X,Y ) + (Y, Z) P d(X,Y ) > (X,Y ) + P d(Y, Z) > (Y, Z)
(Y, Z) + (Y, Z),
so that (X, Z) (X,Y ) + (Y, Z). This proves that is a metric. Next, if n = (Xn , X) 0 then, for any > 0
and large enough n such that n < ,
(L (X), L (Y )) (X,Y ).
Proof. Take > (X,Y ) so that P(d(X,Y ) ) . For any measurable set A S,
96
A K-matching f of X into Y is a one-to-one function (injection) f : X Y such that (x, f (x)) K. We will need the
following well-known matching theorem.
Theorem 55 (Halls marriage theorem) If X,Y are finite and for all A X,
card(AK ) card(A) (18.0.1)
then there exists a K-matching f of X into Y .
Proof. We will prove the result by induction on m = card(X). The case of m = 1 is obvious. For each x X there
exists y Y such that (x, y) K. If there is a matching f of X \ {x} into Y \ {y} then defining f (x) = y extends f to
X. If not, then since card(X \ {x}) < m, by induction assumption, condition (18.0.1) is violated, i.e. there exists a set
A X \ {x} such that card(AK \ {y}) < card(A). But because we also know that card(AK ) card(A) this implies that
card(AK ) = card(A). Since card(A) < m, by induction there exists a matching of A onto AK . If there is a matching of
X \ A into Y \ AK we can combine it with a matching of A and AK . If not then again, by the induction assumption, there
exists D X \ A such that card(DK \ AK ) < card(D). But then
card (A D)K = card(DK \ AK ) + card(AK ) < card(D) + card(A) = card(D A),
Theorem 56 (Strassen) Suppose that (S, d) is a separable metric space and , > 0. Suppose that laws P and Q are
such that, for all measurable sets F S,
P(F) Q(F ) + (18.0.2)
Then for any > 0 there exist two non-negative measures , on S S such that
Remark. Condition (18.0.2) is a relaxation of the definition of the Levy-Prohorov metric, since one can take different
, > (P, Q). Conditions 1 3 mean that we can construct a measure on S S such that coordinates x, y have
marginal distributions P, Q, concentrated within distance + of each other (condition 2) except for the set of measure
at most + (condition 3).
Proof. Case A. The proof will proceed in several steps. We will start with the simplest case which, however, contains
the main idea. Given small > 0, take n 1 such that n > 1. Suppose that the laws P, Q are uniform on finite subsets
M, N S of equal cardinality,
1
card(M) = card(N) = n, P(x) = Q(y) = < , x M, y N.
n
97
(i) x U,
(ii) y V,
(iii) d(x, y) if x M, y N.
This means that small auxiliary sets can be matched with any points, but only close points, d(x, y) , can be
matched in the main sets M and N. Consider a set A X with cardinality card(A) = r. If A * M then by (i), AK = Y
and card(AK ) r. Suppose now that A M and, again, we would like to show that card(AK ) r. Using (18.0.2) and
the fact that, by (iii), A = AK N, we can write
r 1 1
= P(A) Q(A ) + = card(A ) + = card(AK N) + .
n n n
Therefore,
r = card(A) n + card(AK N) k + card(AK N) = card(AK ),
since k = card(V ) and AK = V (AK N). By the matching theorem, there exists a K-matching f of X into Y . Let
T = {x M : f (x) N},
i.e. points in M that are matched with points in N at distance d(x, y) . Clearly, card(T ) n k, since at most k
points can be matched with a point in V , and for x T, by (iii), d(x, f (x)) . For x M \ T, redefine f (x) to match
each x with a different points in N that are not matched with points in T. This defines a matching of M onto N. We
define measures and by
1 1
= (x, f (x)) , = (x, f (x)) ,
n xT n xM\T
and let = + . First of all, obviously, has marginals P and Q because each point in M or N appears in the sum
+ only once with the weight 1/n. Also,
card(M \ T ) k
d(x, f (x)) > = 0, (S S) < + . (18.0.3)
n n
Finally, both and are finite sums of point masses, which are product measures of point masses.
Case B. Suppose now that P and Q are concentrated on finitely many points with rational probabilities. Then we can
artificially split all points into smaller points of equal probabilities as follows. Let n be such that n > 1 and
98
Let , , be the projections of 0 , 0 , 0 back onto S S by the map ((x, i), (y, j)) (x, y). Then, clearly, = + ,
has marginals P and Q and (S S) < + . Since
e (x, i), (y, j) = d(x, y) + f (i, j) d(x, y),
we get
d(x, y) > + 2 0 e((x, i), (y, j)) > + 2 = 0.
Finally, is obviously a finite sum of product measures. Of course, we can replace by /2.
Case C. (General case) Let P, Q be laws on a separable metric space (S, d). Let A be a maximal set such that for
all x, y A the distance d(x, y) . Such set A is called an -packing and it is countable, because S is separable, so
A = {xi }i1 . Since A is maximal, for each x S there exists y A such that d(x, y) < , otherwise, we could add x to
the set A. Let us create a partition of S using -balls around the points xi :
Consider any set F S. For any point x F, if x Bk then d(x, xk ) < , i.e. xk F and, therefore,
P(F) P0 (F ).
All the relations above also hold true for Q, Q0 and Q00 that are defined similarly, and we can assume that the same
point x0 plays a role of the auxiliary point for both P00 and Q00 . Given F S, we can write
99
00 d(x, y) > + 3 = 0, 00 (S S) + 3.
Let us also move the points (x0 , xi ) and (xi , x0 ) for i 0 into in the support of 00 : since the total weight of these
points is at most , the total weight of 00 does not increase much:
00 (S S) + 5.
It remains to redistribute these measures from the sequence {xi }i0 to the entire space S in a way that recovers
marginal distributions P and Q and so that not much accuracy is lost. Define a sequence of measures on S by
P(C Bi )
Pi (C) = if P(Bi ) > 0, and Pi (C) = 0 otherwise,
P(Bi )
and define Qi similarly. The measures Pi and Qi are concentrated on Bi . Define
= 00 (xi , x j )(Pi Q j ).
i, j1
Since 00 (xi , x j ) = 0 unless d(xi , x j ) + 3, the measure is concentrated on the set {d(x, y) + 5} because
for x Bi , y B j ,
d(x, y) d(x, xi ) + d(xi , x j ) + d(x j , y) + ( + 3) + = + 5.
The marginals u and v of satisfy
and, similarly,
v(C) := (S C) Q(C).
If u(S) = v(S) = 1 then (S S) = 1 and is a probability measure with marginals P and Q, so we can take = 0.
Otherwise, take t = 1 u(S) = 1 v(S) and define
1
= (P u) (Q v).
t
It is easy to check that = + has marginals P and Q. For example,
1
(C S) = (C S) + (C S) = u(C) + P(C) u(C) Q(S) v(S)
t
1
= u(C) + P(C) u(C) 1 v(S) = u(C) + P(C) u(C) = P(C).
t
Also,
(S S) = t = 1 (S S) = 1 00 (S S) = 00 (S S) + 5.
Finally, by construction, is a finite sum of product measures. t
u
The following relationship between Ky Fan and Levy-Prohorov metrics is an immediate consequence of Strassens
theorem. We already saw that (L (X), L (Y )) (X,Y ).
Theorem 57 If (S, d) is a separable metric space and P, Q are laws on S then, for any > 0, there exist random
variables X and Y on the same probability space with the distributions L (X) = P and L (Y ) = Q such that
(X,Y ) (P, Q) + .
100
P(A) Q(A+ ) + + .
By Strassens theorem, there exists a measure on S S with the marginals P, Q such that
d(x, y) > + 2 + 2. (18.0.4)
then, by definition of the Ky Fan metric, (X,Y ) + 2. If P and Q are tight then there exists compact K such that
P(K), Q(K) 1 . For = 1/n find n as in (18.0.4). Since n has marginals P and Q, n (K K) 1 2 , which
means that the sequence (n )n1 is uniformly tight. By the Selection Theorem, there exists a converging subsequence
n(k) . Obviously, again has marginals P and Q. Since, by construction,
2 2
n d(x, y) > + +
n n
and {d(x, y) > + } is an open set in S S, by the Portmanteau Theorem,
d(x, y) > + lim inf n(k) d(x, y) > + .
k
P(X A) P(Y A ) +
for all measurable sets A R, and U is independent of (X,Y ) and has the uniform distribution on [0, 1]. Prove that
there exists a measurable function f : R [0, 1] R such that
P X f (X,U) >
101
Kantorovich-Rubinstein Theorem.
Let (S, d) be a separable metric space. Denote by P1 (S) the set of all laws on S such that for some z S (equivalently,
for all z S), Z
d(x, z) dP(x) < .
S
Let us denote by
M(P, Q) = is a law on S S with marginals P and Q .
For P, Q P1 (S), the quantity
nZ o
W (P, Q) = inf d(x, y) d(x, y) : M(P, Q)
is called the Wasserstein distance between P and Q. A measure M(P, Q) represents a transportation between
the measures P and Q. We can think of the conditional distribution (y|x) as a way to redistribute the mass in the
neighborhood of a point x so that the distribution P will be redistributed to the distribution Q. If the distance d(x, y)
represents the cost of moving x to y then the Wasserstein distance gives the optimal total cost of transporting P to Q.
Given any two laws P and Q on S, let us define
nZ Z o
(P, Q) = sup f dP f dQ : || f ||L 1
and nZ Z o
md (P, Q) = sup f dP + g dQ : f , g C(S), f (x) + g(y) < d(x, y) .
Notice that, obviously, for P, Q P1 (S), both (P, Q), md (P, Q) < . Let us show that these two quantities are equal.
Proof. Given a function f such that || f ||L 1, let us take a small > 0 and set g(y) = f (y) . Then
and Z Z Z Z
f dP + g dQ = f dP f dQ .
which, of course, proves that (P, Q) md (P, Q). Let us now consider functions f , g such that f (x) + g(y) < d(x, y).
Define
e(x) = inf(d(x, y) g(y)) = sup(g(y) d(x, y))
y y
102
which means that ||e||L 1 and, therefore, md (P, Q) (P, Q). This finishes the proof. t
u
Below, we will need the following version of the Hahn-Banach theorem.
Theorem 58 (Hahn-Banach) Let V be a normed vector space, E - a linear subspace of V and U - an open convex
set in V such that U E , 0.
/ If r : E R is a linear non-zero functional on E then there exists a linear functional
: V R such that |E = r and supU (x) = supUE r(x).
Proof. Let t = sup{r(x) : x U E} and let B = {x E : r(x) > t}. Since B is convex and U B = 0, / the Hahn-
Banach separation theorem implies that there exists a linear functional q : V R that separates U and B. For any
x0 U E let F = {x E : q(x) = q(x0 )}. Since q(x0 ) < infB q(x), F B = 0. / This means that the hyperplanes
{x E : q(x) = q(x0 )} and {x E : r(x) = t} in the subspace E are parallel and this implies that q(x) = r(x) on E
for some , 0. Let = q/. Then r = |E and
1 1
sup (x) = sup q(x) inf q(x) = inf r(x) = t = sup r(x) = sup (x).
U U B B UE UE
Theorem 59 If S is a compact metric space then W (P, Q) = md (P, Q) = (P, Q) for P, Q P1 (S).
Proof. We only need to show the first equality. Consider a vector space V = C(S S) equipped with k k norm and
let
U = f V : f (x, y) < d(x, y) .
Obviously, U is convex and open, because S S is compact and any continuous function on a compact achieves its
maximum. Consider a linear subspace E of V defined by
E = V : (x, y) = f (x) + g(y), f , g C(S)
so that
U E = V : (x, y) = f (x) + g(y) < d(x, y), f , g C(S) .
Define a linear functional r on E by
Z Z
r( ) = f dP + g dQ if = f (x) + g(y).
Let us look at the properties of this functional. First of all, if a(x, y) 0 then (a) 0. Indeed, for any c 0,
103
Since |E = r, Z Z Z
( f (x) + g(y)) d(x, y) = f dP + g dQ,
which implies that has marginal P and Q, i.e. M(P, Q). This proves that
nZ o Z
md (P, Q) = sup ( ) = sup f (x, y) d(x, y) : f (x, y) < d(x, y) = d(x, y) d(x, y) W (P, Q).
U
The opposite inequality is easy, because for any f , g such that f (x) + g(y) < d(x, y) and any M(P, Q),
Z Z Z Z
f dP + g dQ = ( f (x) + g(y)) d(x, y) d(x, y)d(x, y). (19.0.1)
This finishes the proof and, moreover, it shows that the infimum in the definition of W is achieved on some . t
u
Remark. Notice that in the proof of this theorem we never used the fact that d is a metric. Theorem holds for any
d C(S S) under the corresponding integrability assumptions. For example, one can consider loss functions of the
type d(x, y) p for p > 1, which are not necessarily metrics. However, in Lemma 45, the fact that d is a metric was
essential. t
u
Our next goal will be to show that W = on separable and not necessarily compact metric spaces. We start with the
following.
Lemma 46 If (S, d) is a separable metric space then W and are metrics on P1 (S).
Proof. Since for a bounded Lipschitz metric we have (P, Q) (P, Q), is also a metric, because if (P, Q) = 0
then (P, Q) = 0 and, therefore, P = Q. As in (19.0.5), it should be obvious that (P, Q) = md (P, Q) W (P, Q) and if
W (P, Q) = 0 then (P, Q) = 0 and P = Q. The symmetry of W is obvious. It remains to show that W (P, Q) satisfies the
triangle inequality. The idea here is very simple, and let us first explain it in the case when (S, d) is complete. Consider
three laws P, Q, T on S and let M(P, Q) and M(Q, T) be such that
Z Z
d(x, y) d(x, y) W (P, Q) + and d(y, z) d(y, z) W (Q, T) + .
Let us generate a distribution on S S S with marginals P, Q and T and marginals on pairs of coordinates (x, y)
and (y, z) given by and by gluing and in the following way. We know that when (S, d) is complete and
separable, there exist regular conditional distributions (dx | y) and (dz | y) of the coordinates x and z given y. Then,
we define a distribution on S S S by first generating y from the distribution Q and, given y, generating the pair
x and z according to the conditional distributions (dx | y) and (dz | y) independently of each other, i.e. according to
the product measure on S S
(dx dz | y) = (dx | y) (dz | y).
This is called conditionally independent coupling of x and z given y. Obviously, by construction, (x, y) has the
distribution and (y, z) has the distribution . Therefore, the marginals of x and z are P and T, which means that the
pair (x, z) has the distribution M(P, T). Finally,
Z Z Z Z
W (P, T) d(x, z) d(x, z) = d(x, z) d(x, y, z) d(x, y) d + d(y, z) d
Z Z
= d(x, y) d + d(y, z) d W (P, Q) +W (Q, T) + 2.
104
1 ((C Sn ) Sm ) 2 (Sn (C Sm ))
nm (C) = , nm (C) =
(Sn Sm ) (Sn Sm )
be the marginal distributions of the conditional distribution of on Sn Sm . Define
0 = (Sn Sm ) nm
1 2
nm .
n,m
In this construction, locally on each small box Sn Sm , measure is replaced by the product measure with the same
marginals. Let us compute the marginals of 0 . Given a set C S,
0 (C S) = 1
(Sn Sm ) nm 2
(C) nm (S)
n,m
Similarly, 0 (S C) = Q(C), so 0 has the same marginals as , 0 M(P, Q). It should be obvious that transportation
cost integral does not change much by replacing with 0 . One can visualize this by looking at what happens locally
on each small box Sn Sm . Let (Xn ,Ym ) be a random pair with distribution restricted to Sn Sm so that
1
Z
Ed(Xn ,Ym ) = d(x, y)d(x, y).
(Sn Sm ) Sn Sm
Let Ym0 be an independent copy of Ym , also independent of Xn . Then the joint distribution of (Xn ,Ym0 ) is nm
1 2 and
nm
Z
Ed(Xn ,Ym0 ) = 1
d(x, y) d(nm 2
nm )(x, y).
Sn Sm
Then Z
d(x, y) d(x, y) = (Sn Sm )Ed(Xn ,Ym ),
n,m
Z
d(x, y) d 0 (x, y) = (Sn Sm )Ed(Xn ,Ym0 ).
n,m
Finally, d(Ym ,Ym0 ) diam(Sm ) and these two integrals differ by at most . Therefore,
Z
d(x, y) d 0 (x, y) W (P, Q) + 2.
such that Z
d(x, y) d 0 (x, y) W (Q, T) + 2.
We will now show that this special simple form of the distributions 0 (x, y), 0 (y, z) ensures that the conditional distri-
butions of x and z given y are well defined. Let Qm be the restriction of Q to Sm ,
2
Qm (C) = Q(C Sm ) = (Sn Sm ) nm (C).
n
2 (C) = 0 for all n, which means that 2 are absolutely continuous with respect to
Obviously, if Qm (C) = 0 then nm nm
Qm and the Radon-Nikodym derivatives
2
dnm
fnm (y) =
dQm
(y) exist and (Sn Sm ) fnm (y) = 1 a.s. for y Sm .
n
105
Notice that for any A B, 0 (A|y) is measurable in y and 0 (A|y) is a probability distribution on B, for Q-almost all
y, because
0 (S|y) = (Sn Sm ) fnm (y) = 1 a.s.
n,m
= 1
(Sn Sm )nm 2
(A)nm (B) = 0 (A B).
n,m
Conditional distribution 0 (|y) can be defined similarly, and we finish the proof as in the case of the complete space
above. t
u
Next lemma shows that on a separable metric space any law with the first moment, i.e. P P1 (S), can be approxi-
mated in metrics W and by laws concentrated on finite sets.
Lemma 47 If (S, d) is separable and P P1 (S) then there exists a sequence of laws Pn such that Pn (Fn ) = 1 for some
finite sets Fn and W (Pn , P), (Pn , P) 0.
Proof. For each n 1, let (Sn j ) j1 be a partition of S such that diam(Sn j ) 1/n. Take a point xn j Sn j in each set Sn j
and for k 1 define a function
xn j , if x Sn j for j k,
fnk (x) =
xn1 , if x Sn j for j > k.
We have,
1 2
Z Z Z
d(x, fnk (x)) dP(x) = d(x, fnk (x)) dP(x) P(Sn j ) + d(x, xn1 ) dP(x)
j1 Sn j n jk S\(Sn1 Snk ) n
for k large enough, because P P1 (S), i.e. d(x, xn1 ) dP(x) < , and the set S \ (Sn1 Snk ) 0.
R
/
Let n be the image on S S of the measure P under the map x ( fnk (x), x) so that n M(Pn , P) for some Pn
concentrated on the set of points {xn1 , . . . , xnk }. Finally,
2
Z Z
W (Pn , P) d(x, y) dn (x, y) = d( fnk (x), x) dP(x) .
n
Since (Pn , P) W (Pn , P), this finishes the proof. t
u
Theorem 60 (Kantorovich-Rubinstein theorem) If (S, d) is a separable metric space then W (P, Q) = (P, Q) for any
distributions P, Q P1 (S).
106
Letting n proves that W (P, Q) (P, Q). We saw above that the opposite inequality always holds. t
u
Wassersteins distance Wp (P, Q) on Rn . We will now prove a version of the Kantorovich-Rubinstein theorem on Rn
in some cases when d(x, y) is not a metric. Given p 1, let us define the Wasserstein distance Wp (P, Q) on
n Z o
P p (Rn ) = P - law on Rn : |x| p dP(x) <
Even though d(x, y) is not a metric for p > 1, Wp is still a metric on P p (Rn ), which can be shown the same way as in
Lemma 46. Namely, given nearly optimal M(P, Q) and M(Q, T) we can construct (X,Y, Z) M(P, Q, T) such
that (X,Y ) and (Y, Z) and, therefore,
1 1 1 1 1
Wp (P, T) (E|X Z| p ) p (E|X Y | p ) p + (E|Y Z| p ) p (Wpp (P, Q) + ) p + (Wpp (Q, T) + ) p .
Proof We will show below that for any continuous uniformly bounded function d(x, y) on Rn Rn ,
nZ o
inf d(x, y) d(x, y) : M(P, Q)
nZ Z o
= sup f dP + g dQ : f , g C(Rn ), f (x) + g(y) d(x, y) . (19.0.4)
This will imply (19.0.3) as follows. Let us take R 1 large enough so that for K = {|x| R},
Z Z
|x| p dP 2p1 , |x| p dQ 2p1 .
Kc Kc
We can find such R, because P, Q P p (Rn ). Let d(x, y) = |x y| p (2R) p . Then for any x, y K, we have d(x, y) =
|x y| p and for any M(P, Q),
Z Z Z
p
|x y| d(x, y) d(x, y) d(x, y) + |x y| p d(x, y).
(KK)c
Let us break the second integral into two integrals over disjoint sets {|x| |y|} and {|x| < |y|}. On the first set,
|x y| p 2 p |x| p and, moreover, x can not belong to K, since in that case y must be in K c . This means that the part of
the second integral over this set in bounded by
Z Z
2 p |x| p d(x, y) = 2 p |x| p dP(x) /2.
K c Rn Kc
The second integral over the set {|x| < |y|} can be similarly bounded by /2 and, therefore,
Z Z
|x y| p d(x, y) d(x, y) d(x, y) + .
107
Letting 0 proves inequality in one direction, and the opposite inequality is always true, as we have seen above,
since for any f , g C(Rn ) such that f (x) + g(y) |x y| p and any M(P, Q),
Z Z Z Z
f dP + g dQ = ( f (x) + g(y)) d(x, y) |x y| p d(x, y). (19.0.5)
Therefore, it remains to prove (19.0.4). Notice that, by adding a constant, we can assume that d 0.
Again, the inequality is obvious, so we only need to prove . We will reduce this to the compact case proved
in Theorem 59. Let us take R 1 large enough and let K = {|x| R}. Let us consider measures
P(C K) Q(C K)
PK (C) = , QK (C) =
P(K) Q(K)
If we take R large enough so that the last term is smaller then then, using the Kantorovich-Rubinstein theorem on
compacts, we get
nZ o Z
inf d(x, y) d(x, y) : M(P, Q) d(x, y) dK (x, y) +
nZ o
= inf d(x, y) d(x, y) : M(PK , QK ) +
nZ Z o
= sup f dPK + g dQK : f , g C(K), f (x) + g(y) d(x, y) + .
Z Z
f dPK + g dQK + 2, (19.0.6)
for some f , g C(K) such that f (x) + g(y) d(x, y). To finish the proof, we would like to estimate the above sum of
integrals by Z Z
dP + dQ
for some functions , C(Rn ) such that (x) + (y) d(x, y). This can be achieved by improving our choice of
functions f and g using infimum-convolutions, as we will now explain.
108
Since we can replace f and g by f c and g+c without changing the sum of integrals, we can assume that each integral
is nonnegative. This implies that there exist points x0 , y0 K such that f (x0 ), g(y0 ) 0. Since f (x) + g(y) d(x, y),
This is called the infimum-convolution. Clearly, since f (x) d(x, y) g(y) for all y K, we have f (x) (x) for
x K, which implies that Z Z Z Z
f dPK + g dQK dPK + g dQK .
Notice that, by definition, (x) + g(y) d(x, y) for all x Rn , y K. Also, using (19.0.7) and the fact that g(y0 ) 0,
inf d(x, y) d(x0 , y) (x) d(x, y0 ) g(y0 ) d(x, y0 ). (19.0.8)
yK
Since g(y) d(x, y) (x) for all x Rn , we have g(y) (y) for y K, which implies that
Z Z Z Z Z Z
f dPK + g dQK dPK + g dQK dPK + dQK .
By definition, (x) + (y) d(x, y) for all x, y Rn . Also, using (19.0.8) and the fact that (x0 ) f (x0 ) 0,
infn d(x, y) d(x, y0 ) (y) d(x0 , y) (x0 ) d(x0 , y). (19.0.9)
xR
The estimates in (19.0.7) and (19.0.8) imply that k k , kk kdk . Therefore, if we write
Z Z Z Z Z
dP = dP + dP = P(K) dPK + dP
K Kc K c
Z ZK Z
= dPK P(K c ) dPK + dP,
K K Kc
each of the last two terms can be bounded in absolute value by kdk P(K c ), which shows that, for large enough R 1,
Z Z
dPK dP + .
K
and, since (x) + (y) d(x, y), this finishes the proof. t
u
109
follows. Consider a random variable with values in N such that E < and an independent random variable such
that E| | < . Given a sequence (Xi ) of i.i.d. random variables with the distribution P P1 (R), independent of and
, let (P) be the distribution of the sum + i=1 Xi , where the sum i=1 Xi is zero if = 0. If E [0, 1), prove that
has a unique fixed point, (P) = P. Hint: use the Banach Fixed Point Theorem for the Wasserstein metric W1 . (You
need to prove that the metric space (P1 (R),W1 ) is complete.)
Exercise. Let P be the set of probability laws on [1, 1] and define a map : P P as follows. Consider a random
variable with values in N and an independent random variable . Given a sequence (Xi ) of i.i.d. random variables on
[1, 1] with the distribution P P, independent of and , let (P) be the distribution of cos( + i=1 Xi ), where
the sum i=1 Xi is zero if = 0. Prove that has a fixed point, (P) = P. Hint: use the Schauder Fixed Point Theorem
for P equipped with the topology of weak convergence. (Notice that now a fixed point is possibly not unique.)
Exercise. (a) Consider a countable set A and two probability measures P and Q on A. Prove that
nZ o 1
inf I(x , y) d(x, y) : M(P, Q) = P({a}) Q({a}).
2 aA
Hint: use the Kantorovich-Rubinstein theorem with metric d(x, y) = I(x , y).
(b) Let (S, d) be a separable metric space and B be the Borel -algebra. Given two probability measures P and Q on
B, construct a measure M(P, Q) that witnesses equality
nZ o
inf I(x , y) d(x, y) : M(P, Q) = sup |P(A) Q(A)|.
AB
f1 , . . . , fn : A R
such that for any convex combination g =R1 f1 + . . . + n fn there is a point a A such that g(a) < 1. Prove that there
is a probability measure P on A such that fi dP < 1 for all i n. Hint: use the Hahn-Banach separation theorem.
110
In this section we will make several connections between the Kantorovich-Rubinstein theorem and other classical
objects. Let us start with the following classical inequality.
Theorem 62 (Brunn-Minkowski inequality on the real line) If is the Lebesgue measure and A, B are two bounded
Borel sets on R then (A + B) (A) + (B), where A + B is the set addition, i.e. A + B = {a + b : a A, b B}.
Proof. First, suppose that A and B are open. Since the Lebesgue measure is invariant under translations, let us translate
A and B so that sup A = inf B = 0. Let us check that, in this case, A B A + B. Since A is open, for each a A
there exists > 0 such that (a , a + ) A. Since inf B = 0, there exists b B such that 0 b < /2. Then
a (a + b , a + b + ) A + B, which proves that A A + B. One can prove similarly that B is also a subset of
A + B. Since A and B are disjoint, we proved that (A) + (B) = (A B) (A + B).
Now, suppose that A and B are compact. Then, obviously, A + B is also compact. For > 0, let us denote by
C an open -neighborhood of the set C. Since A + B (A + B)2 , using the previous case of the open sets, we get
(A ) + (B ) (A + B ) ((A + B)2 ). Since A is closed, A A as 0 and, by the continuity of measure,
(A ) (A). The same holds for B and A + B, and we proved the inequality for two compact sets.
Finally, consider arbitrary bounded measurable sets A and B. By the regularity of measure, we can find compacts
C A and D B such that (A \ C) and (B \ D) . Usin the previous case of the compact sets, we can write
(A) + (B) 2 (C) + (D) (C + D) (A + B), and letting 0 finishes the proof. t
u
Using this, we will prove another classical inequality.
Theorem 63 (Prekopa-Leindler inequality) Consider nonnegative integrable functions w, u, v : Rn [0, ) such that
for some [0, 1],
w( x + (1 )y) u(x) v(y)1 for all x, y Rn .
Then, Z Z Z 1
w dx u dx v dx .
Using the Prekopa-Leindler inequality one can prove that the Lebesgue measure on Rn satisfies the Brunn-Minkowski
inequality
(A)1/n + (B)1/n (A + B)1/n . (20.0.1)
We will leave this as an exercise.
Proof. The proof will proceed by induction on n. Let us first show the induction step. Suppose the statement holds for
n and we would like to show it for n + 1. By the assumption, for any x, y Rn and a, b R,
111
and, similarly, Z Z
u2 (a) = u1 (x, a) dx and v2 (b) = v1 (x, b) dx.
Rn Rn
Then the above inequality can be rewritten as
which finishes the proof of the induction step. It remains to prove the case n = 1. We can assume that u, v, w : R [0, 1],
because both inequalities in the statement of the theorem are homogeneous to truncation and scaling. Also, we can
assume that u and v are not identically zero, since there is nothing to prove in that case, and we can scale them by their
k k norm and assume that kuk = kvk = 1. We have the following set inclusion,
{w a} {u a} + (1 ){v a},
When a [0, 1], the sets {u a} and {v a} are not empty and the Brunn-Minkowski inequality implies that
(w a) (u a) + (1 )(v a).
Finally,
Z Z Z 1 Z 1
w(z) dz = I(x w(z)) dadz = (w a) da
R R 0 0
Z 1 Z 1
(u a) da + (1 ) (v a) da
0 0
Z Z Z Z 1
= u(z) dz + (1 ) v(z) dz u(z) dz v(z) dz ,
R R R R
where in the last step we used the arithmetic-geometric mean inequality. This finishes the proof. t
u
Remark.
R R
Another common proof of the last step n = 1 uses transportation of measure, as follows. We can assume that
u = v = 1 by rescaling
u v w
u R , v R , w R R 1 .
u v ( u) ( v)
R
Then we need to show that w 1. Consider cumulative distribution functions
Z x Z x
F(x) = u(y) dy and G(x) = v(y) dy,
112
x(t) N = t F(N ),
and, since F is absolutely continuous, F(N ) also has Lebesgue measure zero (Lusin N property). Therefore, for
t < F(N ) N 0 , the derivative x0 (t) exists and F 0 (x(t)) = u(x(t)). Using the chain rule in the equation F(x(t)) = t, for
all such t, u(x(t))x0 (t) = 1. Similarly, outside of some set of measure zero, v(y(t))y0 (t) = 1. Now, consider the function
z(t) = x(t) + (1 )y(t). This function is strictly increasing and differentiable almost everywhere. Therefore,
Z + Z 1 Z 1
w(z(t))z0 (t) dt = w x(t) + (1 )y(t) z0 (t) dt.
w(z) dz
0 0
and, by assumption,
w( x(t) + (1 )y(t)) u(x(t)) v(y(t))1 .
Therefore, Z 1 Z 1
Z 1
w(z)dz u(x(t))x0 (t) v(y(t))y0 (t) dt = 1 dt = 1.
0 0
This finishes the proof. t
u
Entropy and the Kullback-Leibler divergence. Consider a probability measure P on Rn and a nonnegative measur-
able function u : Rn [0, ). We define the entropy of u with respect to P by
Z Z Z
EntP (u) = u log u dP u dP log u dP.
Notice that EntP (u) 0, by Jensens inequality, since u log u is a convex function. Entropy has the following variational
representation.
Proof. Take any measurable function v such that ev dP 1. Then, for any 0,
R
Z Z Z Z
uv dP uv dP + 1 ev dP = + (uv ev ) dP.
The integrand (uv ev ) is concave in v (since 0) and can be maximized point-wise by taking v such that u = ev ,
which implies that
u
Z Z Z
uv dP + u log dP u dP.
R v
This bound holds for all v such that e dP 1 and, therefore,
nZ u
Z o Z Z
sup uv dP : ev dP 1 + u log dP u dP.
R
The right hand side is convex in for 0 (since u is nonnegative) and the minimum is achieved on = u dP. With
this choice of the right hand side is equal to EntP (u), and we proved that
nZ Z o
sup uv dP : ev dP 1 EntP (u).
113
Transportation inequality for Gaussian measures. Let us consider a non-degenerate normal distribution N(0,C)
with the covariance matrix C such that detC , 0. We know that this distribution has density eV (x) , where
1
V (x) = (C1 x, x) + const.
2
If we denote A = C1 /2 then, for any t [0, 1],
tV (x) + (1 t)V (y) V (tx + (1 t)y) t(Ax, x) + (1 t)(Ay, y) A(tx + (1 t)y), (tx + (1 t)y)
1
t(1 t)|x y|2 ,
= t(1 t) A(x y), (x y) (20.0.5)
2max (C)
where max (C) is the largest eigenvalue of C. We will use this to prove the following useful inequality for the Wasser-
stein distance W2 defined at the end of the previous section.
Proof. Let us denote C2 = 1/(2max (C)) and, using the Kantorovich-Rubinstein theorem (see (19.0.3)), write
1 n
Z o
W2 (Q, P)2 = inf C2 |x y|2 d(x, y) : M(P, Q)
C2
1 nZ Z o
= sup f dP + g dQ : f (x) + g(y) C2 |x y|2 .
C2
Consider functions f , g C(Rn ) such that f (x) + g(y) C2 |x y|2 . Then, by (20.0.5), for any t (0, 1),
1
f (x) + g(y) tV (x) + (1 t)V (y) V (tx + (1 t)y)
t(1 t)
and
t(1 t) f (x) tV (x) + t(1 t)g(y) (1 t)V (y) V (tx + (1 t)y).
If we introduce the function
u(x) = e(1t) f (x)V (x) , v(y) = etg(y)V (y) and w(z) = eV (z)
114
Z Z Z
v dQ = f dP + g dQ D(Q||P).
P(C A)
PA (C) = .
P(A)
Then, obviously, the Radon-Nikodym derivative
dPA 1
= IA
dP P(A)
and the Kullback-Leibler divergence
1 1
Z
D(PA ||P) = log dPA = log .
A P(A) P(A)
Since W2 is a metric, for any two Borel sets A and B,
p 1 1
W2 (PA , PB ) W2 (PA , P) +W2 (PB , P) 2max (C) log1/2 + log1/2 ,
P(A) P(B)
using (20.0.6). Suppose that the sets A and B are apart from each other by a distance t, i.e. d(A, B) t > 0. Then any
two points in the support of measures PA and PB are at a distance at least t from each other and the transportation
distance W2 (PA , PB ) t. Therefore,
p 1 1 p 1
t W2 (PA , PB ) 2max (C) log1/2 + log1/2 2max (C) log1/2 .
P(A) P(B) P(A)P(B)
Therefore,
1 t2
P(B) exp .
P(A) 4max (C)
115
1 t2
P d(x, A) t exp .
P(A) 4max (C)
If the set A is not too small, e.g. P(A) 1/2, this implies that
t2
P d(x, A) t 2 exp .
4max (C)
This shows that the Gaussian measure is exponentially concentrated near any large enough set. The constant 1/4 in
the exponent is not optimal and can be replaced by 1/2; this is just an example of application of the above ideas. The
optimal result is the famous Gaussian isoperimetry,
Gaussian concentration via infimum-convolution. If we denote c = 1/max (C) then setting t = 1/2 in (20.0.5),
x+y c
V (x) +V (y) 2V |x y|2 .
2 4
Given a function f on Rn , let us define its infimum-convolution by
c
g(y) = inf f (x) + |x y|2 .
x 4
Then, for all x and y, x+y
c
g(y) f (x) |x y|2 V (x) +V (y) 2V . (20.0.7)
4 2
If we define
u(x) = e f (x)V (x) , v(y) = eg(y)V (y) , w(z) = eV (z)
then (20.0.7) implies that x+y
w u(x)1/2 v(y)1/2 .
2
The Prekopa-Leindler inequality with = 1/2 implies that
Z Z
eg dP e f dP 1. (20.0.8)
1 ct 2 1 t2
P d(x, A) t exp = exp ,
P(A) 4 P(A) 4max (C)
Discrete metric and total variation. The total variation distance between probability measures P and Q on a mea-
surable space (S, B) is defined by
TV(P, Q) = sup |P(A) Q(A)|.
AB
116
Let us describe some connections of the total variation distance to the Kullback-Leibler divergence and the Kantorovich-
Rubinstein theorem. Let us start with the following simple observation.
R
Lemma 49 If f is a measurable function on S such that | f | 1 and f dP = 0 then for any R,
Z
2 /2
e f dP e .
1+ f 1 f
f = + ( ),
2 2
by the convexity of ex , we get
1 + f 1 f
e f e + e = ch( ) + f sh( ).
2 2
Therefore, Z
2 /2
e f dP ch( ) e ,
Then a 1-Lipschitz function f with respect to the metric d, k f kL 1, is defined by the condition that for all x, y S,
However, since any uncountable set S is not separable w.r.t. the discrete metric d, we can not apply the Kantorovich-
Rubinstein theorem directly. In this case, one can use the Hahn-Jordan decomposition to show that W coincides with
the total variation distance, W (P, Q) = TV(P, Q) and it is easy to construct a measure M(P, Q) explicitly that
witnesses the above equality. This was one of the exercises at the end of the previous section. One can also easily
check directly that (P, Q) = TV(P, Q). Thus, for the discrete metric d,
We have the following analogue of the Kullback-Leibler divergence bound in Theorem 64, which now holds for any
measure P, not only Gaussian.
117
Exercise. Prove the Brunn-Minkowski inequality (20.0.1) on Rn using the Prekopa-Leindler inequality. Hint: Apply
the Prekopa-Leindler inequality to sets A/ and B/(1 ) and optimize over (0, 1).
Exercise. Using the Prekopa-Leindler inequality prove that if N is the standard Gaussian measure on RN , A and B are
Borel sets and [0, 1] then
Exercise. (Andersons inequality) If C is convex and symmetric around 0, i.e. C = C, then for any z RN ,
N (C) N (C + z).
Cov(X) Cov(Y )
(i.e. their difference is nonnegative definite) then P(X C) P(Y C). Hint: use the previous problem.
Exercise. Suppose that X = (X1 , . . . , XN ) is a vector of i.i.d. standard normal random variables. Suppose that f : RN R
is a Lipschitz function with k f kL = L. If M is a median of f (X), prove that
2
P f (X) M + Lt 2et /4 .
Hint: use Gaussian concentration inequality. Median means that P( f (X) M) 1/2 and P( f (X)s M) 1/2.
118
We have developed a general theory of convergence of laws on (separable) metric spaces and in the following two
sections we will look at some specific examples of convergence on the spaces of continuous functions (C[0, 1], k k )
and (C(R+ ), d), where d is a metric metrizing uniform convergence on compacts. These examples will describe a
certain central limit theorem type results on these spaces and in this section we will define the corresponding limiting
Gaussian laws, namely, the Brownian motion and Brownian bridge. We will start with basic definitions and basic
regularity results in the presence of continuity. Given a set T and a probability space (, F , P), a stochastic process is
a function
Xt () = X(t, ) : T R
such that for each t T , Xt : R is a random variable, i.e. a measurable function. In other words, a stochastic process
is a collection of random variables Xt indexed by a set T. A stochastic process is often defined by specifying finite
dimensional (f.d.) distributions PF = L ({Xt }tF ) for all finite subsets F T. Kolmogorovs theorem then guarantees
the existence of a probability space on which the process is defined, under the natural consistency condition
F1 F2 = PF1 = PF2 F .
R 1
One can also think of a process as a function on with values in RT = { f : T R}, because for a fixed , Xt ()
RT is a (random) function of t. In Kolmogorovs theorem, given a family of consistent f.d. distributions, a process was
defined on the probability space (RT , BT ), where BT is the cylindrical -algebra generated by the algebra of cylinders
B RT \F for Borel sets B in RF and all finite F. When T is uncountable, some very natural sets such as
n o [
: sup Xt > 1 = {Xt > 1}
tT tT
might not be measurable on BT . However, in our examples we will deal with continuous processes that possess
additional regularity properties. If (T, d) is a metric space then a process Xt is called sample continuous if for all ,
Xt () C(T, d) - the space of continuous function on (T, d). The process Xt is called continuous in probability if
Xt Xt0 in probability whenever t t0 . Let us note that sample continuity is not a property that follows automatically
if we know all finite dimensional distributions of the process.
Example. Let T = [0, 1], (, P) = ([0, 1], ) where is the Lebesgue measure. Let Xt () = I(t = ) and Xt0 () = 0.
Finite dimensional distributions of these processes are the same, because for any fixed t [0, 1],
P(Xt = 0) = P(Xt0 = 0) = 1.
Xt () : T R
is jointly measurable on the product space (T, B) (, F ), where B is the Borel -algebra on T.
119
Proof. Let (S j ) j1 be a measurable partition of T such that diam(S j ) n1 . For each non-empty S j , let us take a point
t j S j and define
Xtn () = Xt j () for t S j .
Xtn () is, obviously, measurable on T , because for any Borel set A on R,
[
(,t) : Xtn () A =
S j : Xt j () A .
j1
Since Xt (w) is sample continuous, Xtn () Xt () as n for all (,t). Hence, Xt is also measurable. t
u
If Xt is a sample continuous process indexed by T then we can think of Xt as an element of the metric space of
continuous functions (C(T, d), || || ), rather then simply an element of RT . We can define measurable events on this
space in two different ways. On the one hand, we have the natural Borel -algebra B on C(T, d) generated by the
open (or closed) balls
Bg () = f C(T, d) : || f g|| < .
On the other hand, if we think of C(T ) as a subspace of RT , we can consider a -algebra
ST = B C(T, d) : B BT ,
which is the intersection of the cylindrical -algebra BT with C(T, d). It turns out that these two definitions coincide
if (T, d) is separable. An important implication of this is that the law of any random element with values in (C(T, d), ||
|| ) is completely determined by its finite dimensional distributions.
Proof. Let us first show that ST B. Any element of the cylindrical algebra that generates the cylindrical -algebra
BT is given by
B RT \F for a finite F T and for some Borel set B RF .
Then
B RT \F
\
C(T, d) = x C(T, d) : (xt )tF B = F (x) B
where F : C(T, d) RF is the finite dimensional projection such that F (x) = (xt )tF . Projection F is, obviously,
continuous in the || || norm and, therefore, measurable on the Borel -algebra B generated by the open sets in
the k k norm. This implies that F (x) B B and, thus, ST B. Let us now show that B ST . Let T 0 be a
countable dense subset of T . Then, by continuity, any closed -ball in C(T, d) can be written as
\
f C(T, d) : || f g|| = f C(T, d) : | f (t) g(t)| ST .
tT 0
In the remainder of the section we will define two specific sample continuous stochastic processes.
Brownian motion. Brownian motion is defined as a sample continuous process Xt on T = R+ such that
(d) for any t1 < . . . < tn , the increments Xt1 X0 , . . . , Xtn Xtn1 are independent.
120
As a result, we can give an equivalent definition: Brownian motion is a sample continuous centered Gaussian process Xt
for t [0, ) with the covariance Cov(Xt , Xs ) = min(t, s). A process is Gaussian if all its finite dimensional distributions
are Gaussian.
Without the requirement of sample continuity, the existence of such process follows from Kolmogorovs theorem,
since all finite dimensional distributions are consistent by construction. However, we still need to prove that there exists
a sample continuous version with these prescribed finite dimensional distributions. We start with a simple estimate.
for all c 0.
Theorem 66 (Existence of Brownian motion) There exists a sample continuous Gaussian process with the covariance
Cov(Xt , Xs ) = min(t, s).
Proof. It is enough to construct Xt on the interval [0, 1]. Given a process Xt that has the finite dimensional distributions
of the Brownian process, but is not necessarily continuous, let us define for n 1,
Vk = X k+1
n
X kn for k = 0, . . . , 2n 1.
2 2
121
converges almost surely to some limit Zt , because, by (21.0.1), with probability one,
|Xt(n) Xt(n1) | n2
for large enough (random) n n0 (). By construction, Zt = Xt on the dense subset of all dyadic t [0, 1]. If we
can prove that Zt is sample continuous then all f.d. distributions of Zt and Xt will coincide, which means that Zt is a
continuous version of the Brownian motion. Take any t, s [0, 1] such that |t s| 2n . If t(n) = 2kn and s(n) = 2mn ,
then |k m| {0, 1}. As a result, |Xt(n) Xs(n) | is either equal to 0 or one of the increments |Vk | and, by (21.0.1),
|Xt(n) Xs(n) | n2 for large enough n. Finally,
which proves the continuity of Zt . On the event in (21.0.1) of probability zero we set Zt = 0. t
u
Definition. A sample continuous centered Gaussian process Bt for t [0, 1] is called a Brownian bridge if
Such process exists because if Xt is a Brownian motion then Bt = Xt tX1 is a Brownian bridge, since for s < t,
Notice that B0 = B1 = 0. t
u
Exercise. (Ciesielskis construction of Brownian motion) Let ( fk2n ) be the Haar basis on (L2 [0, 1], dx), i.e. f0 = 0 and
for n 1 and k In = {k - odd, 1 k 2n 1},
and 0 otherwise. If (gk2n ) are i.i.d. standard Gaussian random variables, prove that
Z t
Wt = g0t + gk2n fk2n (x) dx
n1 kIn 0
is a Brownian motion on [0, 1]. Hint: To show that EWt Ws = min(t, s), use Parsevals identity. To prove continuity,
show that
Z t
fk2n (x) dx
= 2(n+1)/2 maxgk2n
en =
gk2n
kIn 0 kIn
satisfies P en 2 2n ln 2n 8n and use the Borel-Cantelli lemma.
2 /(2T )
Exercise. If Wt is a Brownian motion, using Doobs inequality prove that P(max0tT Wt x) ex .
122
In this section we will show how Brownian motion Wt arises in a classical central limit theorem on the space of
continuous functions on R+ . When working with continuous processes defined on R+ , such as the Brownian motion,
the metric k k on C(R+ ) is too strong. A more appropriate metric d can be defined by
1 dn ( f , g)
d( f , g) = 2n 1 + dn ( f , g) where dn ( f , g) = sup | f (t) g(t)|.
n1 0tn
It is obvious that d( f j , f ) 0 if and only if dn ( f j , f ) 0 for all n 1, i.e. d metrizes uniform convergence on
compacts. (C(R+ ), d) is also a complete separable space, since any sequence is Cauchy in d if and only if it is Cauchy
for each dn . When proving uniform tightness of laws on (C(R+ ), d), we will need a characterization of compacts via
the Arzela-Ascoli theorem, which in this case can be formulated as follows. For a subset K C(R+ ), let us define by
Kn = f [0,n] : f K
the restriction of K to the interval [0, n]. Then, it is clear that K is compact with respect to d if and only if each Kn is
compact with respect to dn . Therefore, using the Arzela-Ascoli theorem to characterize compacts in C[0, n] we get the
following. For a function x C[0, T ], let us denote its modulus of continuity by
n o
mT (x, ) = sup |xa xb | : |a b| , a, b [0, T ] .
Theorem 67 (Arzela-Ascoli) A set K is compact in (C(R+ ), d) if and only if K is closed, uniformly bounded and
equicontinuous on each interval [0, n]. In other words,
sup |x0 | < and lim sup mT (x, ) = 0 for all T > 0.
xK 0 xK
This implies the following criterion of the uniform tightness of laws on the metric space (C(R+ ), d), which is simply
a translation of the Arzela-Ascoli theorem into probabilistic language.
Theorem 68 A sequence of laws (Pn )n1 on (C(R+ ), d) is uniformly tight if and only if
lim sup Pn |x0 | > = 0 (22.0.1)
+ n1
and
lim sup Pn mT (x, ) > = 0
(22.0.2)
0 n1
123
(=) Fix any > 0. For each integer T 1, find T > 0 such that
sup Pn |x0 | > T .
n 2T +1
By construction, each set AT is closed, uniformly bounded and equicontinuous on [0, T ]. Therefore, by the Arzela-
Ascoli theorem, their intersection A is compact on (C(R+ ), d). Since > 0 was arbitrary, this proves that the sequence
(Pn ) is uniformly tight, . t
u
Remarks. Of course, for the uniform tightness on (C[0, 1], k k ) we only need the second condition (22.0.2) for
T = 1. Also, it will be convenient to slightly relax (22.0.2) and replace it with asymptotic equicontinuity condition
If this holds then, given > 0, we can find 0 > 0 and n0 1 such that for all n > n0 ,
Since mT (x, ) 0 as 0 for all x C(R+ ), for each n n0 , we can find n > 0 such that
Using this characterization of uniform tightness, we will now give our first example of convergence on (C(R+ ), d) to
the Brownian motion Wt . Consider a sequence (Xi )i1 of i.i.d. random variables such that EXi = 0 and 2 = EXi2 < .
Let us consider a continuous partial sum process on [0, ) defined by
1 Xbntc+1
Wtn = Xi + (nt bntc) , (22.0.4)
n ibntc
n
124
This implies for example that any continuous functionals of these processes on (C(R+ ), d) converges in distribution,
for example,
sup Wtn sup Wt
t1 t1
in distribution.
Proof. Since the last term in Wtn is of order n1/2 , for simplicity of notations, we will simply write
1
Wtn = Xi
n int
and since Wtn and Wsn Wtn are independent, it should be obvious that finite dimensional distributions of Wtn converge
to finite dimensional distributions of the Brownian motion Wt . By Lemma 51 in the previous section, this identifies
Wt as the unique n (L (Wtn ))n1 is uniformly tight
possible limit of Wt and, if we can shown that the sequence of laws n
on C[0, ), d , the Selection Theorem will imply that Wt Wt weakly. Since W0 = 0, we only need to prove the
asymptotic equicontinuity (22.0.3). First,
1 1
mT (W n , ) = sup Xi max Xi .
n 0knT,0< jn n
t,s[0,T ],|ts| ns<int k<ik+ j
If instead of maximizing over all 0 k nT, we maximize in increments of n , i.e. over indices k of the type k = ln
for 0 l m 1, where m := T / , then it is easy to check that the maximum will decrease by at most a factor of 3,
1 1
max Xi 3 max Xi ,
0knT,0< jn
n k<ik+ j 0lm1,0< jn
n ln <iln + j
because the second maximum over 0 < j n is taken over intervals of the same size n . As a consequence, if
mT (W n , ) > then one of the events
n 1 o
max Xi >
0< jn
n ln <iln + j 3
Kolmogorovs inequality, Theorem 16, implies that if Sn = X1 + + Xn and max0< jn P(|Sn S j | > ) p < 1 then
1
P max |S j | > 2 P(|Sn | > ).
0< jn 1 p
If we take = n /6 then, by Chebyshevs inequality,
62 n 2
P Xi > n 2 2 = 36 2
j<in
6 n
125
for all T > 0 and > 0. This finishes the proof that Wtn Wt weakly in (C[0, ), d). t
u
126
Empirical process and the Kolmogorov-Smirnov test. In this sections we show how the Brownian bridge Bt arises
in another central limit theorem on the space of continuous functions on [0, 1]. Let us start with a motivating example
from statistics. Suppose that x1 , . . . , xn are i.i.d. uniform random variables on [0, 1]. By the law of large numbers,
for any t [0, 1], the empirical c.d.f. n1 ni=1 I(xi t) converges to the true c.d.f. P(x1 t) = t almost surely and,
moreover, by the CLT,
1 n
d
Xtn := n
I(xi t) t N 0,t(1 t) .
n i=1
The stochastic process Xtn is called the empirical process. The covariance of this process, for s t,
is the same as the covariance of the Brownian bridge and, by the multivariate CLT, all finite dimensional distributions
of the empirical process converge to the f.d. distributions of the Brownian bridge,
However, we would like to show the convergence of Xtn to Bt in some stronger sense that would imply weak conver-
gence of continuous functions of the process on the space (C[0, 1], k k ).
The Kolmogorov-Smirnov test in statistics provides some motivation. Suppose that i.i.d. (Xi )i1 have continuous
distribution with c.d.f. F(t) = P(X1 t). Let Fn (t) = n1 ni=1 I(Xi t) be the empirical c.d.f.. Since F is continuous
and F(R) = [0, 1],
1 n 1 n
sup n|Fn (t) F(t)| = sup n I(Xi t) F(t)= sup n I F(Xi ) F(t) F(t)
tR tR n i=1 tR n i=1
1 n
d
= sup n I(F(Xi ) t) t = sup |Xtn |,
t[0,1] n i=1 t[0,1]
because (F(Xi )) are i.i.d. and have the uniform distribution on [0, 1]. This means that the distribution of the left hand
side does not depend on F and, in order to infer whether the sample (Xi )1in comes from the distribution with the c.d.f.
F, statisticians need to know only the distribution of the supremum of the empirical process or, as an approximation,
the distribution of its limit. Equation (23.0.1) suggests that
and the right hand side is called the Kolmogorov-Smirnov distribution that will be computed in the next section.
Since Bt is sample continuous, its distribution is the law on the metric space (C[0, 1], k k ). Even though Xtn is
127
We will now develop an approach to control the expectation of (23.0.3) for general classes of functions F and we will
only use the specific definition (23.0.4) at the very end. This will be done in several steps.
Symmetrization. As the first step, we will replace the empirical process (23.0.3) by a symmetrized version, called the
Rademacher process, that will be easier to control. Let x10 , . . . , xn0 be independent copies of x1 , . . . , xn and let 1 , . . . , n
be i.i.d. Rademacher random variables, such that P(i = 1) = P(i = 1) = 1/2. Let us define
1 n 1 n
Pn f = f (xi ) and P0n f = f (xi0 ).
n i=1 n i=1
Then, using Jensens inequality, the symmetry, and then triangle inequality, we can write
EZ = E sup Pn f E f = E sup Pn f EP0n f
f F f F
1 n 1 n
E sup ( f (xi ) f (xi0 )) = E sup i ( f (xi ) f (xi0 ))
f F n i=1 f F n i=1
1 n 1 n
E sup i f (xi ) + E sup i f (xi0 ) = 2ER.
f F n i=1 f F n i=1
Equality in the second line holds because switching xi xi0 arbitrarily does not change the expectation, so the equality
holds for any fixed (i ) and, therefore, for any random (i ). t
u
Covering numbers, Kolmogorovs chaining and Dudleys entropy integral. To control ER for general classes
of functions F , we will need to use some measures of complexity of F . First, we will show how to control the
128
Both packing and covering numbers measure how many points are needed to approximate any element in the set F
within distance u. It is a simple exercise (at the end of the section) to show that
and, in this sense, packing and covering numbers are closely related. Let F be a subset of the cube [1, 1]n equipped
with a rescaled Euclidean metric 1 n 1/2
d( f , g) = ( fi gi )2 .
n i=1
Consider the following Rademacher process on F,
1 n
R( f ) = i fi .
n i=1
Then we have the following version of the classical Kolmogorov chaining lemma.
{0} = F0 F1 . . . Fj . . . F
1. f , g Fj , d( f , g) > 2 j ,
2. f F we can find g Fj such that d( f , g) 2 j .
f = 0 ( f ) + (1 ( f ) 0 ( f )) + (2 ( f ) 1 ( f )) + . . .
= ( j ( f ) j1 ( f )).
j=1
129
Since R( f ) is linear,
R( f ) = R
j ( f ) j1 ( f ) .
j=1
We first show how to control R on the set of all links. Assume that ` L j1, j . By Hoeffdings inequality,
1 n t2 t2
P R(`) = i `i t exp 1 n 2 exp .
n i=1 2n i=1 `i 2 22 j+4
If |F| denotes the cardinality of the set F then |L j1, j | |Fj1 | |Fj | |Fj |2 and, therefore, by the union bound,
t2 1 u
P ` L j1, j , R(`) t 1 |Fj |2 exp
= 1 e
22 j+5 |Fj |2
after making the change of variables
1/2
t = 22 j+5 (4 log |Fj | + u) 27/2 2 j log1/2 |Fj | + 25/2 2 j u.
Hence, 1 u
P ` L j1, j , R(`) 27/2 2 j log1/2 |Fj | + 25/2 2 j u 1 e .
|Fj |2
If Fj1 = Fj then we can chose j1 ( f ) = j ( f ) and, since in this case L j1, j = {0}, there is no need to control these
links. Therefore, we can assume that |Fj1 | < |Fj | and taking a union bound for all steps,
P j 1 ` L j1, j , R(`) 27/2 2 j log1/2 |Fj | + 25/2 2 j u
1 u 1
1 2
e 1 2
eu = 1 ( 2 /6 1)eu 1 eu .
j=1 |F j | j=1 ( j + 1)
Given f F, let integer k be such that 2(k+1) < d(0, f ) 2k . Then in the above construction we can assume that
0 ( f ) = . . . = k ( f ) = 0, i.e. we will project f on 0 if possible. Then, with probability at least 1 eu ,
R( f ) = R j ( f ) j1 ( f )
j=k+1
27/2 2 j log1/2 |Fj | + 25/2 2 j u
j=k+1
27/2 2 j log1/2 D(F, 2 j , d) + 25/2 2k u.
j=k+1
Note that 2k < 2d( f , 0) and 25/2 2k < 27/2 d( f , 0). Finally, since the packing numbers D(F, , d) are decreasing in ,
we can write
Z 2(k+1)
29/2 2( j+1) log1/2 D(F, 2 j , d) 29/2 log1/2 D(F, , d) d
j=k+1 0
Z d(0, f )
29/2 log1/2 D(F, , d) d, (23.0.5)
0
130
The integral in (23.0.5) is called Dudleys entropy integral. We would like to apply the bound of the above theorem to
1 n
nR = sup i f (xi )
f F n i=1
for a class of functions F in (23.0.4). Suppose that x1 , . . . , xn [0, 1] are fixed and let
n o
F = fi 1in = I s < xi t : t, s [0, 1], |t s| {0, 1}n .
1in
Lemma 53 N(F, u, d) Ku4 for some absolute K > 0 independent of the points x1 , . . . , xn .
This kind of estimate is called uniform covering numbers, because the bound does not depend on the points x1 , . . . , xn
that generate the class F from F .
Proof. We can assume that x1 . . . xn . Then the class F consists of all vectors of the type
(0 . . . 1 . . . 1 . . . 0),
i.e. the coordinates equal to 1 come in blocks. Given u, let Fu be a subset of such vectors with blocks of 1s starting
and ending at the coordinates kbnuc. Given any vector f F, let us approximate it by a vector in f 0 Fu by choosing
the closest starting and ending coordinates for the blocks of 1s. The number of different coordinates will be bounded
by 2bnuc and, therefore, the distance between f and f 0 will be bounded by
q
d( f , f 0 ) 2n1 bnuc 2u.
The cardinality
of Fu is, obviously, of order u2 and this proves that N(F, 2u, d) Ku2 . Making the change of
variables 2u u proves the result. t
u
To apply Kolmogorovs chaining bound to this class F, let us make a simple observation that if a random variable
2
X 0 satisfies P(X a + bt) Ket for all t 0 then
Z Z Z t2
2
EX = P(X t) dt a + P(X a + t) dt a + K e
b dt a + Kb K(a + b).
0 0 0
1 n 1 n
D2n = sup d(0, f )2 = sup f (xi )2 = sup I(s < xi t)
F F n i=1 |ts| n i=1
1 n 1 n
= sup I(xi t) I(xi s) .
|ts| n i=1 n i=1
Since the integral on the right hand side of (23.0.6) is concave in Dn , by Jensens inequality,
1 n Z EDn r K
E sup i f (xi ) K log du + EDn .
F n i=1 0 u
131
The right-hand side goes to zero as 0 and this finishes the proof of asymptotic equicontinuity of X n . As a result,
for any continuous function on (C[0, 1], k k ) the distibution of (Xtn ) converges to the distribution of (Bt ). For
example,
1
n sup I(xi t) t sup |Bt |
0t1 n 0t1
in distribution. We will find the distribution of the right hand side in the next section. Notice that the methods we used
to prove equicontinuity were quite general and the main step where we used the specific class of function F was to
control the covering numbers.
Exercise. If (xi )i1 are i.i.d. uniform on [0, 1], prove that
1 n
sup I(xi t) t 0 a.s.
t[0,1] n i=1
Exercise. Suppose that a family F of measurable functions f : [0, 1] is such that for some V 1 and for any
1 , . . . , n ,
2 1 n 2 1/2
log D(F , , d) V log where d( f , g) = f (i ) g(i ) .
n i=1
If (Xn )n1 are i.i.d. random elements with values in , prove that
1 n V 1/2
E sup f (Xi ) E f (X1 ) K
f F n i=1 n
for some absolute constant K. (Assume, for example, that F is countable to avoid any measureability issues.)
132
We showed that the empirical process converges to the Brownian bridge on (C([0, 1]), k k ). As a result, the distribu-
tion of a continuous function of the process will also converge, for example,
in distribution. We will compute the distribution of this supremum in Theorem 73 below, but first we will prove the
so called strong Markov property
of the Brownian motion. Given a Brownian motion Wt on some probability space
(, B, P) let Bt = (Ws )st B be the -algebra generated by the process up to time t. Let us consider a family
of -algebras Ft for t 0 such that
For example, if we consider some -algebra A independent of the process Wt then Ft = (Bt A ) satisfy these
properties. A random variable 0 is called a stopping time if
{ t} Ft for all t 0.
For example, a hitting time c = inf{t > 0,Wt = c} for c > 0 is a stopping time because, by the sample continuity,
\[
{c t} = {Wr > q},
q<c r<t
where the intersection and union are over rational numbers q, r. The following is very similar to the property of
stopping times for sums of i.i.d. random variables in Section 7.
Theorem 71 (Strong Markov Property) On the event { < }, the increments Wt0 := W+t W of the process after
the stopping time are independent of the -algebra
F = B B : B { t} Ft for all t 0
generated by the data up to the stopping time and, moreover, the process Wt0 is again a Brownian motion.
Proof. The main tool in the proof is the approximation of an arbitrary stopping time by the dyadic stopping time
b2n c + 1
n = .
2n
133
By construction, n and, by the continuity of the process, Wn W almost surely. The process Wt0 is, obviously,
continuous and we only need to check the statement of the theorem on its finite dimensional distributions. Consider
an integer d 1, 0 t1 < . . . < td and f Cb (Rd ), and let
In the second line we used the homogeneity of the Brownian motion to write Ee(k2n ) = Ee(0). Since e(n ) e(),
This proves that the increments (W+t1 W , . . . ,W+td W ) are independent of the -algebra F on the event <
and have the same distribution as (Wt1 , . . . ,Wtd ), i.e. the original Brownian motion. This finishes the proof. t
u
Remark. A random variable 0 is called a Markov time if
One can show that a stopping time is always Markov time and F F + . With a little bit more work, one can
generalize the above Strong Markov property to Markov times and -algebras F + . t
u
Example. As a first example of application of the SMP, let us compute the following probability,
P sup Wt c = P(c b),
tb
for any c > 0 and the hitting time c . Since the event {c b} Fc is independent of the process Wt0 = Wc +t Wc ,
which is a Brownian motion, we can write (see also Remark below)
0 1
P(Wb c) = P(c b,Wb Wc 0) = P(c b)P(Wbc
0) = P(c b). (24.0.2)
2
Therefore,
1 x2
Z
P sup Wt c = P(c b) = 2P(Wb c) = 2 e 2 dx. (24.0.3)
tb c/ b 2
134
because one can show as in the proof of the SMP that the process Wt W for t is independent of F . Indeed, for
dyadic approximation n of , conditionally on n = k2n , the increments Wt Wk2n are independent of Fk2n , so the
proof does not change. For example, we can write
in Fk/2n indep. of Fk/2n
z }| { z }| {
P(n b,Wb Wn 0) = P n = k/2n b, Wb Wk/2n 0
k0
1 k 1
= P n = b = P(n b),
2 k0 2n 2
Reflection principles. If Wt is a Brownian motion then Bt = Wt tW1 is the Brownian bridge for t [0, 1]. The next
lemma shows that we can think of the Brownian bridge as Brownian motion conditioned to be equal to zero at time
t = 1 (pinned down Brownian motion).
Therefore, the Brownian motion can be written as a sum Wt = Bt + tW1 of the Brownian bridge and independent
process tW1 . Therefore, if we define a random variable with distribution L ( ) = L W1 |W1 | < independent
of Bt then
L Wt |W1 | < = L (Bt + t ) L (Bt ).
as 0. t
u
Let us first analyze the upper bound. If we define a hitting time = inf{t : Wt = b } then W = b and
P t : Wt b , |W1 | < = P 1, |W1 | < = P 1,W1 W (b, b + 2) .
135
B
!"
because the fact that W1 (2b 3, 2b ) automatically implies that 1 for b > 0 and small enough. We
reflected the Brownian motion after stopping time as in Figure 24.1. Therefore, we proved that
P(W1 (2b 3, 2b )) 2
P(t : Bt = b) e2b
P(W1 (, ))
as 0. The lower bound can be analyzed similarly. t
u
and let b and b be the hitting times of b and b. By symmetry of the distribution of the process Bt ,
P sup |Bt | b = P b or b 1 = 2P(A1 , b < b ).
0t1
Again, by symmetry,
and, by induction,
P(A1 , b < b ) = P(A1 ) P(A2 ) + . . . + (1)n1 P(An , b < b ).
As in Theorem 72, reflecting the Brownian motion each time we hit b or b, one can show that
P W1 (2nb , 2nb + ) 1 2 2 2
P(An ) = lim = e 2 (2nb) = e2n b
0 P W1 (, )
and this finishes the proof. t
u
Given a, b > 0, let us compute the probability that a Brownian bridge crosses one of the levels a or b.
136
Proof. We have
P(t : Bt = a or b) = P(t : Bt = a, a < b ) + P(t : Bt = b, b < a ).
If we introduce the events
Cn = t1 < . . . < tn : Bt1 = b, Bt2 = a, . . .
and
An = t1 < . . . < tn : Bt1 = a, Bt2 = b, . . .
then, as in the previous theorem,
and, similarly,
P(An , a < b ) = P(An ) P(Cn+1 , b < a ).
By induction,
P(t : Bt = a or b) = (1)n1 (P(An ) + P(Cn )).
n=1
Probabilities of the events An and Cn can be computed using the reflection principle as above,
2 (a+b)2 2 2
P(A2n ) = P(C2n ) = e2n , P(C2n+1 ) = e2(na+(n+1)b) , P(A2n+1 ) = e2((n+1)a+nb)
Proof. First of all, (24.0.4) gives the joint c.d.f. of (X,Y ) because
F(a, b) = P(X < a,Y < b) = P(a < inf Bt , sup Bt < b) = 1 P(t : Bt = a or b).
If f (a, b) = 2 F/ a b is the joint p.d.f. of (X,Y ) then the c.d.f of the spread X +Y is
Z tZ ta
P Y +X t = f (a, b) db da.
0 0
137
0
f (a, b)db = 4n((n + 1)t a)e2((n+1)ta) + 4(n + 1)(nt + a)e2(nt+a) 8n2te2n t .
n0 n0 n1
Exercise. Prove the Strong Markov Property for a Markov time and -algebra F + in (24.0.1).
Exercise. A family (Ft ) is called right-continuous if s>t Fs = Ft for all t 0. If (Ft ) is right-continuous, show that
a Markov time is a stopping time and F + = F .
Exercise. If (Ft ) satisfies conditions (i) (iii) at the beginning of this section, prove that (F t ) satisfies them too.
138
In this section we will prove another classical limit theorem in Probabiltiy, called the laws of the iterated logarithm,
using the method of Skorohods imbedding, which we will describe first. Let Wt be the Brownian motion.
Proof. Let us start with the case when a stopping time takes a finite number of values,
{t1 , . . . ,tn }.
By optional stopping theorem for martingales, EW = EWt1 = 0. Next, let us prove that EW2 = E by induction on n.
If n = 1 then = t1 and
EW2 = EWt21 = t1 = E.
To make an induction step from n 1 to n, define a stopping time = tn1 and write
First of all, by the induction assumption, EW2 = E. Moreover, , only if = tn , in which case = tn1 . The
event
{ = tn } = { tn1 }c Ftn1
and, therefore,
EW (W W ) = EWtn1 (Wtn Wtn1 )I( = tn ) = 0.
Similarly,
E(W W )2 = EE(I( = tn )(Wtn Wtn1 )2 |Ftn1 ) = (tn tn1 )P( = tn ).
Therefore,
EW2 = E + (tn tn1 )P( = tn ) = E
and this finishes the proof of the induction step. Next, let us consider the case of a uniformly bounded stopping time
M < . In the previous lecture we defined a dyadic approximation
b2n c + 1
n = ,
2n
139
is a martingale, adapted to a corresponding sequence of Ft , and n and 2M are two stopping times such that n < 2M,
by the optional stopping theorem, Theorem 41, Wn = E(W2M |Fn ). By Jensens inequality,
W4n E(W2M
4
|Fn ), EW4n EW2M
4
= 6M,
Finally, we consider the general case. Let us define (n) = min(, n). For m n, (m) (n) and
using (25.0.1), (25.0.2) and the fact that (n), (m) are bounded stopping times. Since (n) , Fatous lemma and
the monotone convergence theorem imply
Theorem 77 (Skorohods imbedding) Let Y be a random variable such that EY = 0 and EY 2 < . There exists a
stopping time < such that L (W ) = L (Y ).
Proof. Let us start with the simplest case when Y takes only two values, Y {a, b} for a, b > 0. The condition EY = 0
determines the distribution of Y,
a
pb + (1 p)(a) = 0 and p = . (25.0.3)
a+b
Let = inf{t > 0,Wt = a or b} be a hitting time of the two-sided boundary a, b. The tail probability of can be
bounded by
P( > n) P |W j+1 W j | < a + b, 0 j n 1 = P(|W1 | < a + b)n = n .
140
and
1
Z
y= x d(x). (25.0.4)
([c, d)) [c,d)
If ([c, d)) = 0 we pick any y (c, d) as a value of Y j on [c, d). Since in the -algebra B j+1 the interval [c, d) is split
into two intervals, the random variable Y j+1 can take only two values on the interval [c, d), say c y1 < y < y2 < d,
and, since (Y j , B j ) is a martingale,
E(Y j+1 |B j ) Y j = 0. (25.0.5)
We will define stopping times n such that L (Wn ) = L (Yn ) iteratively as follows. Since Y1 takes only two values a
and b, let 1 = inf{t > 0,Wt = a or b} and we proved above that L (W1 ) = L (Y1 ). Given j define j+1 as follows:
if W j = y for y in (25.0.4) then j+1 = inf{t > j ,Wt = y1 or y2 }.
Let us explain why L (W j ) = L (Y j ). First of all, by construction, W j takes the same values as Y j . If C j is the -
algebra generated by the disjoint sets {W j = y} for y as in (25.0.4), i.e. for possible values of Y j , then W j is C j
measurable, C j C j+1 , C j F j and at each step simple sets in C j are split in two,
{W j = y} = {W j+1 = y1 } {W j+1 = y2 }.
By Markovs property of the Brownian motion and Theorem 76, E(W j+1 W j |F j ) = 0 and, therefore,
E(W j+1 |C j ) W j = 0.
Since on each simple set {W j = y} in C j , the random variable W j+1 takes only two values y1 and y2 , this equation
allows us to compute the probabilities of these simple sets recursively as in (25.0.3),
y2 y
P(W j+1 = y2 ) = P(W j = y).
y2 y1
By (25.0.5), Y j s satisfy the same recursive equations and this proves that L (Wn ) = L (Yn ). The sequence n is
monotone, so it converges n to some stopping time . Since
En = EW2n = EYn2 EY 2 < ,
we have E = lim En EY 2 < and, therefore, < almost surely. Then Wn W almost surely by sample
continuity and, since L (Wn ) = L (Yn ) L (Y ), this proves that L (W ) = L (Y ). t
u
141
Let us briefly describe the main idea that gives origin to the function u(t). For a > 1, consider a geometric sequence
t = ak and take a look at the probabilities of the following events,
Wk Lu(ak ) 1 1 1 L2 2ak `(ak )
P Wak Lu(ak ) = P a
exp
ak
p
ak ak 2 L 2`(ak ) 2
L 2
1 1 1
p . (25.0.6)
2 L 2`(ak ) k log a
This series will converge or diverge depending on whether L > 1 or L < 1. Even though these events are not indepen-
dent, in some sense, they are almost independent and the Borel-Cantelli lemma would imply that the upper limit of
Wak behaves like u(ak ). Some technical work will complete this main idea. Let us start with the following.
Proof. Let , > 0, tk = (1 + )k and Mk = u(tk ). By symmetry, the equation (24.0.3) and the Gaussian tail estimate
in Lemma 52,
P sup |Wt Wtk | Mk 2P sup Wt Mk
tk ttk+1 0ttk+1 tk
1 Mk2
= 4N(0,tk+1 tk )(Mk , ) 4 exp
2 (tk+1 tk )
2 2t `(t ) 1 2
k k
4 exp =4 .
2tk k log(1 + )
If 2 > , the sum of these probabilities converges and, by the Borel-Cantelli lemma, for large enough k,
It is easy to see that for small enough , u(tk+1 )/u(tk ) < 1 + 2. If k is such that tk s tk+1 and s t (1 + )s
then, clearly, tk s t tk+2 and, therefore, for large enough k,
Wtk (1 + )u(tk )
for large enough k. If tk = (1 + )k then Lemma 55 implies that, with probability one, for large enough t, if tk t < tk+1
then
Wt Wtk u(tk ) Wt Wtk u(tk )
= + (1 + ) + 4 .
u(t) u(tk ) u(t) u(tk ) u(t)
142
To prove that the upper limit is equal to one, we will use the Borel-Cantelli lemma for the independent increments
Wak Wak1 for large values of the parameter a > 1. If 0 < < 1 then, similarly to (25.0.6),
1 1 1 (1)2
P Wak Wak1 (1 )u(ak ak1 )
.
2 (1 ) 2`(ak ak1 ) log(ak ak1 )
p
The series diverges and, since these events are independent, they occur infinitely often with probability one. We
already proved (by (25.0.6)) that, for > 0, for large enough k 1, Wak /u(ak ) 1 + and, therefore, by symmetry,
Wak /u(ak ) (1 + ). This gives
and r r
Wt Wk 1 1
lim sup lim sup ak (1 ) 1 (1 + ) .
t u(t) k u(a ) a a
Letting 0 and a over some sequences proves that the upper limit is equal to one. t
u
The LIL for Brownian motion implies the LIL for sums of independent random variables via Skorohods imbedding.
Theorem 79 Suppose that Y1 , . . . ,Yn are i.i.d. and EYi = 0, EYi2 = 1. If Sn = Y1 + . . . +Yn then
Sn
lim sup =1
n 2n log log n
almost surely.
L
Proof. Let us define a stopping time (1) such that W(1) = Y1 . By the strong Markov property, the increment of
the process after stopping time is independent of the process before stopping time and has the law of the Brownian
L L
motion. Therefore, we can define (2) such that W(1)+(2) W(1) = Y2 and, by independence, W(1)+(2) = Y1 + Y2
L
and (1), (2) are i.i.d. By induction, we can define i.i.d. (1), . . . , (n) such that Sn = WT (n) where T (n) = (1) +
. . . + (n). We have
Sn d WT (n) Wn WT (n) Wn
= = + .
u(n) u(n) u(n) u(n)
By the LIL for Brownian motion,
Wn
lim sup = 1.
n u(n)
By the strong law of large numbers, T (n)/n E(1) = EY12 = 1 almost surely. For any > 0, Lemma 55 implies that,
for large n,
|WT (n) Wn |
4 ,
u(n)
and letting 0 finishes the proof. t
u
143
It is easy to check that if Wt is a Brownian motion then tW1/t is also the Brownian motion and the result follows by a
change of variable t 1/t. To check that tW1/t is a Brownian motion notice that for t < s,
1 1
EtW1/t sW1/s tW1/t = st t 2 = t t = 0
s t
2
and E tW1/t sW1/s = t + s 2t = s t.
144
Let us start by recalling the example in Section 5. Let (Xi ) be i.i.d. with the Bernoulli distribution with probability of
success [0, 1], i.e. P (Xi = 1) = and P (Xi = 0) = 1 , and let u : [0, 1] R be some continuous function on
[0, 1]. Then, by Theorem ?? in Section 5, the Bernstein polynomials
n n n
k k n k
Bn ( ) := u P Xi = k = u (1 )nk
k=0 n i=1 k=0 n k
Moment problem. Consider a random variable X taking values in [0, 1] and let k = EX k be its moments. Given a
sequence (c0 , c1 , c2 , . . .), let us define the sequence of increments by ck = ck+1 ck . Then
of some random variable X (n) . We showed that EBn (X) = Eu X (n) Eu(X) for any continuous function u, which
means that X (n) converges to X in distribution. In other words, given finitely many moments of X, this construction
gives an explicit approximation of the distribution of X.
145
Theorem 80 (Hausdorff) There exists a r.v. X [0, 1] such that k = EX k if and only if (26.0.2) holds.
Proof. The idea of the proof is as follows. If k are the moments of some r.v. X, then the discrete distributions defined
(n)
in (26.0.1) should approximate it. Therefore, our goal will be to show that the condition (26.0.2) ensures that (pk )
is indeed a distribution and then show that the moments of (26.0.1) converge to k . As a result, any limit of these
(n)
distributions will be a candidate for the distribution of X. First of all, let us express k in terms of (pk ). Since
k = k+1 k , we have the following inversion formula:
(n)
by induction on r. Take r = n k and recall the definition of pk above. Then
nk 1
nk n (n)
k = pk+ j .
j=0 j k+ j
Since
n 1 k + j n 1
nk (n k)! (k + j)!(n k j)!
= = ,
j k+ j j!(n k j)! n! k k
we can rewrite 1
nk n 1
k+ j n (n) m n (n)
k = pk+ j = pm .
j=0 k k m=k k k
(n) (n)
For k = 0, this gives mn pm = 0 = 1 and, by the assumption (26.0.2), pm 0. Therefore, we can consider a
random variable X (n) such that m (n)
P X (n) = = pm for 0 m n.
n
Notice that, for any fixed k,
n 1 n n k
m n (n) m(m 1) (m k + 1) (n) m (k) k
k = pm = pm pm = E X (n)
m=k k k m=k n(n 1) (n k + 1) m=0 n
as n . By the Selection Theorem, one can choose a subsequence X (ni ) that converges to some r.v. X in distribution
and, as a result,
k
E X (ni ) EX k = k ,
which means that k are the moments of X. t
u
de Finettis theorem. As a consequence of the Hausdorff theorem above, we will now prove the classical de Finetti
representation for coin flips. The general case will be considered in the next section. A sequence (Xn )n1 of {0, 1}-
valued random variables is called exchangeable if, for any n 1 and any x1 , . . . , xn {0, 1}, the probability
P(X1 = x1 , . . . , Xn = xn )
depends only on the number of successes x1 + . . . + xn and does not depend on the order of 1s or 0s. Another way
to say this is that, for any n 1 and any permutation of {1, . . . , n}, the distribution of (X(1) , . . . , X(n) ) does not
depend on . Then the following holds.
146
In other words, in order to generate such an exchangeable sequence of 0s and 1s, we first pick p [0, 1] from some
distribution F and then generate a sequence of i.i.d Bernoulli random variables with probability of success p.
We have
Similarly, by induction,
Since, by exchangeability, changing the order of 1s and 0s does not affect the probability, we get
Z 1
n k
P(X1 + . . . + Xn = k) = p (1 p)nk dF(p),
0 k
Example (Polyas urn model). Suppose we have b blue and r red balls in the urn. We pick a ball randomly and return
Pick
b r
+ c of the same color
Xi s are not independent but it is easy to check that they are exchangeable. For example,
b b+c r b r b+c
P(bbr) = = P(brb) = .
b + r b + r + c b + r + 2c b + r b + r + c b + r + 2c
147
( + ) 1
x (1 x) 1 I(0 x 1)
()( )
with the parameters = b/c, = r/c. By de Finettis theorem, we can generate Xi s by first picking p from the
distribution Beta b/c, r/c and then generating i.i.d. Bernoulli (Xi )s with the probability of success p. By strong law
of large numbers, the proportion of blue balls in the first n repetitions will converge to this probability of success p,
i.e. in the limit it will be random with the Beta distribution. Recall that this example came up in Section 15 on the
convergence of martingales. t
u
148
We will begin by proving the classical de Finetti representation for exchangeable sequences. A sequence (s` )`1 is
called exchangeable if for any permutation of finitely many indices we have equality in distribution
d
s(`) `1
= s` `1
. (27.0.1)
Theorem 82 (de Finetti) If the sequence (s` )`1 is exchangeable then there exists a measurable function g : [0, 1]2
R such that d
s` `1 = g(w, u` ) `1 , (27.0.2)
where w and (u` ) are i.i.d. random variables with the uniform distribution on [0, 1].
Before we proceed to prove the above results, let us recall the following definition.
Definition. A measurable space (, B) is called a Borel space if there exists a one-to-one function from onto a
Borel subset A [0, 1] such that both and 1 are measurable. t
u
Perhaps, the most important example of Borel spaces are complete separable metric spaces and their Borel subsets (see
e.g. Section 13.1 in R.M. Dudley Real Analysis and Probability). The existence of the isomorphism automatically
implies that if we can prove the above results in the case when the elements of the sequence (s` ) or array (s`,`0 ) take
values in [0, 1] then the same representation results hold when the elements take values in a Borel space. Similarly, other
standard results for real-valued or [0, 1]-valued random variables are often automatically extended to Borel spaces. For
example, one can generate any real-valued random variable as a measurable function of a uniform random variable
on [0, 1] using the quantile transform and, therefore, any random element on a Borel space can also be generated by a
function of a uniform random variable on [0, 1].
Let us describe another typical measure theoretic argument that will be used many times below.
Lemma 56 (Coding Lemma) Suppose that a random pair (X,Y ) takes values in the product of a measurable space
(1 , B1 ) and a Borel space (2 , B2 ). Then, there exists a measurable function f : 1 [0, 1] 2 such that
d
(X,Y ) = X, f (X, u) , (27.0.3)
Another way to state this is to say that, conditionally on X, Y can be generated as a function Y = f (X, u) of X
and an independent uniform random variable u on [0, 1]. Rather than using the Coding Lemma itself we will often use
149
Proof. By the definition of a Borel space, one can easily reduce the general case to the case when 2 = [0, 1] equipped
with the Borel -algebra. Since in this case the regular conditional distribution Pr(x, B) of Y given X exists, for a fixed
x 1 , we can define by F(x, y) = Pr(x, [0, y]) the conditional distribution function of Y given x and by
its quantile transformation. It is easy to see that f is measurable on the product space 1 [0, 1] because, for any t R,
(x, u) | f (x, u) t = (x, u) | u F(x,t)
and, by the definition of the regular conditional probability, F(x,t) = Pr(x, [0,t]) is measurable in x for a fixed t. If u is
uniform on [0, 1] then, for a fixed x 1 , f (x, u) has the distribution Pr(x, ) and this finishes the proof. t
u
The most important way in which the exchangeability condition (27.0.1) will be used is to say that for any infinite
subset I N,
d
s` `I = s` `1 . (27.0.4)
Let us describe one immediate consequence of this simple observation. If
FI = s` : ` I
(27.0.5)
is the -algebra generated by the random variables s` for ` I then the following holds.
Lemma 57 For any infinite subset I N and j < I, the conditional expectations
E f (s j )FI = E f (s j )FN\{ j}
Proof. First, using the property (27.0.4) for I { j} instead of I implies the equality in distribution,
d
E f (s j )FI = E f (s j )FN\{ j} ,
E f (s j )FI E f (s j )FN\{ j}
2 = 0,
2
(27.0.7)
Proof of Theorem 82. Let us take any infinite subset I N such that its complement N \ I is also infinite. By (27.0.4),
we only need to prove the representation (27.0.23) for (s` )`I . First, we will show that, conditionally on (s` )`N\I , the
random variables (s` )`I are independent. This means that, given n 1, any distinct `1 , . . . , `n I, and any bounded
measurable functions f1 , . . . , fn : R R,
E f j (s` j )FN\I = E f j (s` j )FN\I ,
(27.0.8)
jn jn
150
Since IA and f1 (s`1 ), . . . , fn1 (s`n1 ) are FN\{`n } -measurable, we can write
where the second equality follows from Lemma 57. This implies that
E f j (s` j )FN\I = E f j (s` j )FN\I E fn (s`n )FN\I
jn jn1
and (27.0.8) follows by induction on n. Let us also observe that, because of (27.0.4), the distribution of the array
(s` , (s j ) jN\I ) does not depend on l I. This implies that, conditionally on (s j ) jN\I , the random variables (s` )`I
are identically distributed in addition to being independent. The product space [0, 1] with the Borel -algebra corre-
sponding to the product topology is a Borel space (recall that, equipped with the usual metric, it becomes a complete
separable metric space). Therefore, in order to conclude the proof, it remains to generate X = (s` )`N\I as a function
of a uniform random variable w on [0, 1], X = X(w), and then use the argument in the Coding Lemma 56 to generate
the sequence (s` )`I as ( f (X(w), u` ))`I , where (u` )`I are i.i.d. random variables uniform on [0, 1]. This finishes the
proof. t
u
Comments about de Finettis theorem. Consider an exchangeable sequence (s` )`1 taking values in some com-
plete separable space . It is equal in distribution to (g(w, u` ))`1 . For a fixed w, this is an i.i.d. sequence from the
distribution
w = g(w, )1 ,
where is the Lebesgue measure on [0, 1]. By the strong law of large numbers for empirical measures (Varadarajans
Theorem in Section 17),
1 n
w = lim g(w,u` ) .
n n
`=1
The limit on the right hand side is taken on the complete separable metric space P(, B) of probability measures
on equipped, for example, with the bounded Lipschitz metric or Levy-Prokhorov metric , and this limit exists
almost surely over (u` )`1 . This implies that
The measure is called the empirical measure of the sequence (s` )`1 . One can now interpret de Finettis represen-
tation as follows. First, we generate as a function of a uniform random variable w on [0,1] and then generate s` as
a function of and i.i.d. random variables u` uniform on [0, 1], using the Coding Lemma. Combining two steps, we
generate s` as a function of w and u` .
Remark. There are two equivalent definitions of a random probability measure on a complete separable metric space
with the Borel -algebra B. On the one hand, as above, these are just random elements taking values in the space
P(, B) of probability measures on (, B) equipped with the topology of weak convergence or a metric that metrizes
weak convergence. On the other hand, we can think of a random measure as a probability kernel, i.e. as a function
= (x, A) of a generic point x X for some probability space (X, F , Pr) and a measurable set A B such that for a
fixed x, (x, ) is a probability measure and for a fixed A, (, A) is a measurable function on (X, F ). It is well known
that these two definitions coincide (see, e.g. Lemma 1.37 and Theorem A2.3 in Foundations of Modern Probability).
151
where k = 1 + . . . + n . t
u
Another piece of information that can be deduced from de Finettis theorem and is often useful is the following.
Suppose that in addition to the sequence (s` )`1 we are given another random element Z taking values in a complete
separable metric space, and their joint distribution is not affected by a permutation of the sequence,
d
Z, (s` )`1 = Z, (s(`) `1 ). (27.0.9)
Theorem 83 Conditionally on , the sequence (s` )`1 is i.i.d. with the distribution and independent of Z.
For example, one can deduce Theorem 86 in the exercise below from this statement rather directly.
Proof of Theorem 83. If we define t` = (Z, s` ) then (27.0.9) implies that (t` )`1 is exchangeable. Let
1 n 1 n
= lim t` = lim (Z,s` )
n n n n
`=1 `=1
be the empirical measure of this sequence. Obviously, = Z . Since, given , the sequence (Z, s` ) is i.i.d. from
, this means that, given , the sequence (s` ) is i.i.d. from . In other words, given Z and , the sequence (s` ) is i.i.d.
from . The statement then follows from the following simple lemma, which we will leave as an exercise. t
u
Lemma 58 Suppose that three random elements X,Y and Z take values on some complete separable metric spaces.
The conditional distribution of X given the pair (Y, Z) depends only on Y ( (Y )-measurable) if and only if X and Z
are independent conditionally on Y .
The Aldous-Hoover representation. We will now prove two analogues of de Finettis theorem for two-dimensional
arrays. Let us consider an infinite random array s = (s`,`0 )`,`0 1 . The array s is called an exchangeable array if for any
permutations and of finitely many indices we have equality in distribution,
d
s(`),(`0 ) `,`0 1
= s`,`0 `,`0 1
, (27.0.10)
in the sense that their finite dimensional distributions are equal. Here is one natural example of an exchangeable array.
Given a measurable function : [0, 1]4 R and sequences of i.i.d. random variables w, (u` ), (v`0 ), (x`,`0 ) that have the
uniform distribution on [0, 1], the array
(w, u` , v`0 , x`,`0 ) `,`0 1 (27.0.11)
is, obviously, exchangeable. It turns out that all exchangeable arrays are of this form.
Theorem 84 (Aldous-Hoover) Any infinite exchangeable array (s`,`0 )`,`0 1 is equal in distribution to (27.0.11) for
some function .
Another version of the Aldous-Hoover representation holds in the symmetric case. A symmetric array s is called
weakly exchangeable if for any permutation of finitely many indices we have equality in distribution
d
s(`),(`0 ) `,`0 1
= s`,`0 `,`0 1
. (27.0.12)
152
Notice that, compared to (27.0.10), the diagonal elements now play somewhat different roles from the rest of the array.
One natural example of a weakly exchangeable array is given by
for any measurable functions g : [0, 1]2 R and f : [0, 1]4 R, which is symmetric in its middle two coordinates
u` , u`0 , and i.i.d. random variables w, (u` ), (x{`,`0 } ) with the uniform distribution on [0, 1]. Again, it turns out that such
examples cover all possible weakly exchangeable arrays.
Theorem 85 (Aldous-Hoover) Any infinite weakly exchangeable array is equal in distribution to the array (27.0.14)
for some functions g and f , which is symmetric in its two middle coordinates.
After we give a proof of Theorem 85, we will leave a similar proof of Theorem 84 as an exercise (with some hints).
We will then give another, quite different, proof of Theorem 84. The proof of the Dovbysh-Sudakov representation in
the next section will be based on Theorem 84.
The most important way in which the exchangeability condition (27.0.12) will be used is to say that, for any
infinite subset I N,
d
s{`,`0 } `,`0 I = s{`,`0 } `,`0 1 . (27.0.15)
Again, one important consequence of this observation will be the following. Given j, j0 I such that j , j0 , let us now
define the -algebra
FI ( j, j0 ) = s{`,`0 } : `, `0 I, {`, `0 } , { j, j0 } .
(27.0.16)
In other words, this -algebra is generated by all elements s{`,`0 } with both indices ` and `0 in I, excluding s{ j, j0 } . The
following analogue of Lemma 57 holds.
Lemma 59 For any infinite subset I N and any j, j0 I such that j , j0 , the conditional expectations
E f (s{ j, j0 } )FI ( j, j0 ) = E f (s{ j, j0 } )FN ( j, j0 )
Proof. The proof is almost identical to the proof of Lemma 57. The property (27.0.15) implies the equality in distri-
bution,
d
E f (s{ j, j0 } )FI ( j, j0 ) = E f (s{ j, j0 } )FN ( j, j0 ) ,
E f (s{ j, j0 } )FI ( j, j0 ) E f (s{ j, j0 } )FN ( j, j0 )
2 = 0,
2
It is obvious that the weak exchangeability (27.0.13) implies that the sequence (S j ) jI is exchangeable, since any
permutation of finitely many indices from I in the array (27.0.13) results in the corresponding permutation of the
153
F = (S j ) jI , F j, j0 = (S j , S j0 ) for j, j0 I, j , j0 ,
(27.0.19)
then we would like to show that, for any finite set C of indices {`, `0 } for `, `0 I such that ` , `0 and any bounded
measurable functions f`,`0 corresponding to the indices {`, `0 } C , we have
E f`,`0 (s{`,`0 } )F = E f`,`0 (s{`,`0 } )F`,`0 .
(27.0.20)
{`,`0 }C {`,`0 }C
Notice that the definitions (27.0.16) and (27.0.19) imply that F j, j0 = F(N\I){ j, j0 } ( j, j0 ), since all the elements s{`,`0 }
with both indices in (N\I){ j, j0 }, except for s{ j, j0 } , appear as one of the coordinates in the arrays S j or S j0 . Therefore,
by Lemma 59,
E f (s{ j, j0 } )F j, j0 = E f (s{ j, j0 } )FN ( j, j0 ) .
(27.0.21)
Let us fix any { j, j0 } C , let C 0 = C \ {{ j, j0 }} and consider an arbitrary set A F . Since IA and f`,`0 (s{`,`0 } ) for
{`, `0 } C 0 are FN ( j, j0 )-measurable,
EIA f`,`0 (s{`,`0 } ) f j, j0 (s{ j, j0 } )
{`,`0 }C 0
f`,`0 (s{`,`0 } )E f j, j0 (s{ j, j0 } )FN ( j, j0 )
= E IA
{`,` }C 0
0
f`,`0 (s{`,`0 } )E f j, j0 (s{ j, j0 } )F j, j0 ,
= E IA
{`,`0 }C 0
where the second equality follows from (27.0.21). Since F j, j0 F , this imples that
f`,`0 (s{`,`0 } )F E f j, j0 (s{ j, j0 } )F j, j0
E f`,`0 (s{`,`0 } )F = E
{`,`0 }C {`,`0 }C 0
and (27.0.20) follows by induction on the cardinality of C . By the argument in the Coding Lemma 56, (27.0.20)
implies that, conditionally on F , we can generate
d
s{ j, j0 } j, j0 I = h(S j , S j0 , x{ j, j0 } ) j, j0 I , (27.0.22)
for some measurable function h and i.i.d. uniform random variables x{ j, j0 } on [0, 1]. The reason why the function
h can be chosen to be the same for all { j, j0 } is because, by symmetry, the distribution of (S j , S j0 , s{ j, j0 } ) does not
depend on { j, j0 } and the conditional distribution of s{ j, j0 } given S j and S j0 does not depend on { j, j0 }. Also, the
arrays (S j , S j0 , s{ j, j0 } ) and (S j0 , S j , s{ j, j0 } ) are equal in distribution and, therefore, the function h is symmetric in the
coordinates S j and S j0 . Finally, let us recall (27.0.18) and define the function
f w, u j , u j0 , x{ j, j0 } = h X(w, u j ), X(w, u j0 ), x{ j, j0 } ,
which is, obviously, symmetric in u j and u j0 . Then, the equations (27.0.18) and (27.0.22) imply that
d
S j jI , s{ j, j0 } j, j0 I = X(w, u j ) jI , f w, u j , u j0 , x{ j, j0 } j, j0 I .
In particular, if we denote by g the first coordinate of the map X corresponding to the element s{ j, j} in the array S j in
(27.0.17), this proves that
d
s{ j, j} jI , s{ j, j0 } j, j0 I = g(w, u j ) jI , f w, u j , u j0 , x{ j, j0 } j, j0 I .
154
Proof of Theorem 84. One proof of Theorem 84 similar to the proof of Theorem 85 is sketched in an exercise below.
We will now give a different proof, and the main part of the proof will be based on the following observation. Suppose
that we have an exchangeable sequence of pairs (t` , s` )`1 with coordinates in complete separable metric spaces.
Given the sequence (t` )`1 , how can we generate the sequence of second coordinates (s` )`1 ? Consider the empirical
measures
1 n 1 n
= lim (t` ,s` ) , 1 = lim t` .
n n n n
`=1 `=1
Obviously, 1 is the marginal of on the first coordinate and, moreover, 1 is a measurable function of the sequence
(t` )`1 , so if we are given this sequence we automatically know 1 . The following holds.
Lemma 60 Given (t` )`1 , we can generate the sequence (s` )`1 in distribution as
d
s` `1
= f (1 ,t` , v, x` ) `1
for some measurable function f and i.i.d. uniform random variables v and (x` )`1 on [0, 1].
Proof. First, let us note how to generate the empirical measure given the sequence t = (t` )`1 . Given , the sequence
(t` , s` )`1 is i.i.d. from and, since 1 is the first marginal of , (t` )`1 are i.i.d. from 1 . This means that, if we
consider the triple (, 1 ,t) then the conditional distribution of t given (, 1 ) depends only on 1 ,
P t , 1 ) = P t 1 ).
In other words, to generate given t, we can simply compute 1 and generate given 1 . By the Coding Lemma, we
can generate = g(1 , v) as a function of 1 and independent uniform random variable v on [0, 1].
Now, recall that, given , the sequence (t` , s` )`1 is i.i.d. with the distribution so, given t` and , we can
simply generate s` from the conditional distribution (s` |t` ) independently from each other. Again, using the
Coding Lemma, we can generate s` = h(,t` , x` ) as a function of i.i.d. x` uniform random variables on [0, 1]. Finally,
recalling that = g(1 , v), we can write
Proof of Theorem 84. Let us for convenience index the array s`,`0 by ` 1 and `0 Z instead of `0 1. Let us denote
by
X`0 = s`,`0 `1 and X = X`0 `0 0
the `0 -th column and left half of this array. Since the sequence of columns (X`0 )`0 Z is exchangeable, we showed in
the proof of de Finettis theorem that, conditionally on X, the columns (X`0 )`0 1 in the right half of the array are i.i.d.
If we describe the distribution of one column X1 given X then we can generate all columns (X`0 )`0 1 independently
from this distribution. Therefore, our strategy will be to describe the distribution of X1 given X, and then combine it
with the structure of the distribution of X. Both steps will use exchangeability with respect to permutations of rows,
because so far we have only used exchangeability with respect to permutations of columns. Let
Y` = s`,`0 `0 0
155
where 1 is the empirical measure of (Y` ) and where instead of v and (x` ) we wrote v1 and (x`,1 ) to emphasize the
first column index 1. Since, conditionally on X, the columns (X`0 )`0 1 in the right half of the array are i.i.d., we can
generate
s`,`0 = f (1 ,Y` , v`0 , x`,`0 ),
where v`0 and x`,`0 are i.i.d. uniform random variables on [0, 1]. Finally, since (Y` )`1 are i.i.d. given the empirical
distribution 1 , we can generate 1 = h(w) as a function of a uniform random variable w on [0, 1] and then, using the
Coding Lemma, generate Y` = Y (1 , u` ) = Y (h(w), u` ) as a function of 1 and i.i.d. uniform random variables u` on
[0, 1]. Plugging these into f above gives s`,`0 = (w, u` , v`0 , x`,`0 ) for some function , which finishes the proof. t
u
Exercise. Give a similar proof of Theorem 84, using only global symmetry considerations. Hints: Since the row
and column indices play a different role in this case, the first step will be slightly different (the second step will be
essentially the same). One has to consider two sequences indexed by j I,
S1j = s`, j `N\I , s`,`0 `,`0 N\I and S2j = s j,` `N\I , s`,`0 `,`0 N\I .
for any permutations and of finitely many indices. Notice that these sequences are not independent. In this case
one needs to prove (as a part of the exercise) the following modification of de Finettis representation.
Theorem 86 If the sequences (s1` )`1 and (s2` )`1 are separately exchangeable then there exist measurable functions
g1 , g2 : [0, 1]2 R such that
d
s1` `1 , s2` `1 = g1 (w, u` ) `1 , g2 (w, v` ) `1
(27.0.23)
where w, (u` ) and (v` ) are i.i.d. random variables uniform on [0, 1].
that are separately exchangeable in the first coordinate and jointly exchangeable in the second coordinate, that is,
d
s11 (`),(`0 ) `,`0 1 , s22 (`),(`0 ) `,`0 1 = s1`,`0 `,`0 1 , s2`,`0 `,`0 1
for any permutations 1 , 2 , of finitely many coordinates. Show that there exist two functions 1 , 2 such that these
arrays can be generated in distribution by
where all the arguments are i.i.d. uniform random variables on [0, 1].
156
Let us consider an infinite symmetric random array R = (R`,`0 )`,`0 1 , which is weakly exchangeable in the sense defined
in (27.0.12), i.e. for any permutation of finitely many indices we have equality in distribution
d
R(`),(`0 ) `,`0 1 = R`,`0 `,`0 1 . (28.0.1)
In addition, suppose that R is positive definite with probability one, where by positive definite we will always mean
non-negative definite. Such weakly exchangeable positive definite arrays are called Gram-de Finetti arrays. It turns
out that all such arrays are generated essentially as the covariance matrix of an i.i.d. sample from a random measure
on a Hilbert space. Let H be the Hilbert space L2 ([0, 1], dv), where dv denotes the Lebesgue measure on [0, 1].
Theorem 87 (Dovbysh-Sudakov) There exists a random probability measure on H R+ such that the array R =
(R`,`0 )`,`0 1 is equal in distribution to
h` h`0 + a` `,`0 `,`0 1 , (28.0.2)
where, conditionally on , (h` , a` )`1 is a sequence of i.i.d. random variables with the distribution and h h0 denotes
the scalar product on H.
In particular, the marginal G of on H can be used to generate the sequence (h` ) and the off-diagonal elements of the
array R.
Proof of Theorem 87. Since the array R is positive definite, conditionally on R, we can generate a Gaussian vector g
in RN with the covariance equal to R. Now, also conditionally on R, let (gi )i1 be independent copies of g. If, for each
i 1, we denote the coordinates of gi by g`,i for ` 1 then, since the array R = (R`,`0 )`,`0 1 is weakly exchangeable,
it should be obvious that the array (g`,i )`,i1 is exchangeable in the sense of (27.0.10). By Theorem 84, there exists a
measurable function : [0, 1]4 R such that
d
g`,i `,i1 = (w, u` , vi , x`,i ) `,i1 , (28.0.3)
where w, (u` ), (vi ), (x`,i ) are i.i.d. random variables with the uniform distribution on [0, 1]. By the strong law of large
numbers (applied conditionally on R), for any `, `0 1,
1 n
g`,i g`0 ,i R`,`0
n i=1
almost surely as n . Similarly, by the strong law of large numbers (now applied conditionally on w and (u` )`1 ),
1 n
(w, u` , vi , x`,i ) (w, u`0 , vi , x`0 ,i ) E0 (w, u` , v1 , x`,1 ) (w, u`0 , v1 , x`0 ,1 )
n i=1
almost surely, where E0 denotes the expectation with respect to the random variables v1 and (x`,1 )`1 . Therefore,
(28.0.3) implies that
d
R`,`0 `,`0 1 = E0 (w, u` , v1 , x`,1 ) (w, u`0 , v1 , x`0 ,1 ) `,`0 1 .
(28.0.4)
157
then the off-diagonal and diagonal elements on the right hand side of (28.0.4) are given by
Z Z
(1) (w, u` , v) (1) (w, u`0 , v) dv and (2) (w, u` , v) dv
correspondingly. Notice that, for almost all w and u, the function v (1) (w, u, v) is in H = L2 ([0, 1], dv), since
Z Z
(1)
(w, u, v) dv 2
(2) (w, u, v) dv
and, by (28.0.4), the right hand side is equal in distribution to R1,1 . Therefore, if we denote
Z
h` = (1) (w, u` , ), a` = (2) (w, u` , v) dv h` h` ,
Lemma 61 There exists a measurable function 0 = 0 (R) of the array (R`,`0 )`,`0 1 with values in P(H R+ ) such
that 0 = (U, id)1 almost surely for some orthogonal operator U on H that depends on the sequence (h` )`1 .
Proof. Let us begin by showing that the norms kh` k can be reconstructed almost surely from the array R. Consider a
sequence (g` ) on H such that g` g`0 = R`,`0 for all `, `0 1. In other words, kg` k2 = kh` k2 + a` and g` g`0 = h` h`0 for
all ` , `0 . Without loss of generality, let us assume that g` = h` + a` e` , where (e` )`1 is an orthonormal sequence
orthogonal to the closed span of (h` ). If necessary, we identify H with H H to choose such a sequence (e` ). Since
(h` ) is an i.i.d. sequence from the marginal G of the measure on H, with probability one, there are elements in the
sequence (h` )`2 arbitrarily close to h1 and, therefore, the length of the orthogonal projection of h1 onto the closed
span of (h` )`2 is equal to kh1 k. As a result, the length of the orthogonal projection of g1 onto the closed span of
(g` )`2 is also equal to kh1 k, and it is obvious that this length is a measurable function of the array g` g`0 = R`,`0 .
Similarly, we can reconstruct all the norms kh` k as measurable functions of the array R and, thus, all a` = R`,` kh` k2 .
Therefore,
h` h`0 `,`0 1 and (a` )`1 (28.0.6)
are both measurable functions of the array R. Given the matrix (h` h`0 ), we can find a sequence (x` ) in H isometric to
(h` ), for example, by choosing x` to be in the span of the first ` elements of some fixed orthonormal basis. This means
that all x` are measurable functions of R and that there exists an orthogonal operator U = U((h` )`1 ) on H such that
Since (h` , a` )`1 is an i.i.d. sequence from distribution , by the strong law of large numbers for empirical measures
(Varadarajans theorem in Section 17),
1
(h` ,a` ) weakly
n 1`n
158
almost surely. The left hand side is, obviously, a measurable function of the array R in the space of all probability
measures on H R+ equipped with the topology of weak convergence and, therefore, as a limit, 0 = (U, id)1 is
also a measurable function of R. This finishes the proof. t
u
Exercise. Consider a sequence of pairs (G1N , G2N )N1 of random probability measures on {1, +1}N that are not
necessarily independent. Let ( ` )`0 be an i.i.d. sample of replicas from G1N and ( ` )`1 be an i.i.d. sample of replicas
from G2N (for convenience of notation, we denote both by ` but use difference sets of indices for `). Let RN =
(RN`,`0 )`,`0 Z be the array of the so-called overlaps
1 N ` `0
RN`,`0 = i i .
N i=1
Suppose that RN converges in distributions to an array R = (R`,`0 )`,`0 Z . Notice that this array is weakly exchangeable,
d
R(`),(`0 ) `,`0 Z
= R`,`0 `,`0 Z
,
but not under all permutations of integers Z, but only those permutations that map positive integers into positive and
non-positive into non-positive. Prove that there exists a pair of random measures G1 and G2 on a separable Hilbert
space H (not necessarily independent) such that
d
R`,`0 `,`0 Z
= h` h`0 `,`0 Z
,
where (h` )`0 is an i.i.d. sample from G1 and (h` )`1 is an i.i.d. sample from G2 .
159
Poisson processes.
In this section we will introduce and review several important properties of Poisson processes on a measurable space
(S, S ). In applications, (S, S ) is usually some nice space, such as a Euclidean space Rn with the Borel -algebra.
However, in general, it is enough to require that the diagonal {(s, s0 ) : s = s0 } is measurable on the product space S S
which, in particular, implies that every singleton set {s} in S is measurable as a section of the diagonal. This condition
is needed to be able to write P(X = Y ) for a pair (X,Y ) of random variables defined on the product space. From now
on, every time we consider a measurable space we will assume that it satisfies this condition. Let us also notice that
a product S S0 of two such spaces will also satisfy this condition since {(s1 , s01 ) = (s2 , s02 )} = {s1 = s2 } {s01 = s02 }.
Let and n for n 1 be some non-atomic (not having any atoms) measures on S such that
For each n 1, let Nn be a random variable with the Poisson distribution (n (S)) with the mean n (S) and let
(Xnl )l1 be i.i.d. random variables, also independent of Nn , with the distribution
n (B)
pn (B) = . (29.0.2)
n (S)
We assume that all these random variables are independent for different n 1. The condition that is non-atomic
implies that P(Xnl = Xm j ) = 0 if n , m or l , j. Let us consider random sets
[
n = Xn1 , . . . , XnNn and = n . (29.0.3)
n1
The set will be called a Poisson process on S with the mean measure . Let us point out a simple observation that
will be used many times below that if, first, we are given the means n (S) and distributions (29.0.2) that were used to
generate the set (29.0.3) then the measure in (29.0.1) can be written as
= n (S)pn . (29.0.4)
n1
We will show in Theorem 91 below that when the measure is -finite then, in some sense, this definition of a Poisson
process is independent of the particular representation (29.0.1). However, for several reasons it is convenient to think
of the above construction as the definition of a Poisson process. First of all, it allows us to avoid any discussion about
what a random set means and, moreover, many important properties of Poisson processes follow from it rather
directly. In any case, we will show in Theorem 92 below that all such processes satisfy the traditional definition of a
Poisson process.
One important immediate consequence of the definition in (29.0.3) is the Mapping Theorem. Given a Poisson
process on S with the mean measure and a measurable map f : S S0 into another measurable space (S0 , S 0 ), let
us consider the image set [
f () = f (Xn1 ), . . . , f (XnNn ) .
n1
160
and := m1 m = l1 m(l)n(l) for any enumeration of the pairs (m, n) by the indices l N, we immediately get
the following.
Theorem 89 (Superposition Theorem) If m for m 1 are independent Poisson processes with the mean measures
m then their superposition = m1 m is a Poisson process with the mean measure = m1 m .
Another important property of Poisson processes is the Marking Theorem. Consider another measurable space (S0 , S 0 )
and let K : S S 0 [0, 1] be a transition function (a probability kernel), which means that for each s S, K(s, ) is
a probability measure on S 0 and for each A S 0 , K(, A) is a measurable function on (S, S ). For each point s ,
let us generate a point m(s) S0 from the distribution K(s, ), independently for different points s. The point m(s) is
called a marking of s and a random subset of S S0 ,
= (s, m(s)) : s ,
(29.0.5)
is called a marked Poisson process. In other words, = n1 n , where
n = (Xn1 , m(Xn1 )), . . . , (XnNn , m(XnNn )) ,
Since pn
is obviously non-atomic, by definition, this means that is a Poisson process on S S0 with the mean
measure " "
n (S) pn (C) = K(s, ds )n (ds) = K(s, ds0 )(ds).
0
C C
n1 n1
Therefore, the following holds.
Theorem 90 (Marking Theorem) The random subset in (29.0.5) is a Poisson process on S S0 with the mean
measure "
(C) = K(s, ds0 )(ds). (29.0.6)
C
The above results give us several ways to generate a new Poisson process from the old one. However, in each case,
the new process is generated in a way that depends on a particular representation of its mean measure. We will now
show that, when the measure is -finite, the random set can be generated in a way that, in some sense, does not
depend on the particular representation (29.0.1). This point is very important, because, often, given a Poisson process
with one representation of its mean measure, we study its properties using another, more convenient, representation.
Suppose that S is equal to a disjoint union m1 Sm of sets such that 0 < (Sm ) < , in which case
= |Sm (29.0.7)
m1
is another representation of the type (29.0.1), where |Sm is the restriction of to the set Sm . The following holds.
161
|Sm (B) (B Sm )
pSm (B) = = . (29.0.8)
|Sm (S) (Sm )
We will show that, given the Poisson process in (29.0.3) generated according to the representation (29.0.1), the
cardinalities N(Sm ) = | Sm | are independent random variables with the distributions ((Sm )) and, conditionally
on (N(Sm ))m1 , each set Sm looks like an i.i.d. sample of size N(Sm ) from the distribution pSm . This means that,
if we arrange the points in Sm in a random order, the resulting vector has the same distribution as an i.i.d. sample
from the measure pSm . Moreover, conditionally on (N(Sm ))m1 , these vectors are independent over m 1. This is
exactly how one would generate a Poisson process using the representation (29.0.7). The proof of Theorem 91 will be
based on the following two properties that usually appear as the definition of a Poisson process on S with the mean
measure :
(i) for any A S , the cardinality N(A) = | A| has the Poisson distribution ((A)) with the mean (A);
(ii) the cardinalities N(A1 ), . . . , N(Ak ) are independent, for any k 1 and any disjoint sets A1 , . . . , Ak S .
When (A) = , it is understood that A is countably infinite. Usually, one starts with (i) and (ii) as the definition of
a Poisson process and the set constructed in (29.0.3) is used to demonstrate the existence of such processes, which
explains the name of the following theorem.
Theorem 92 (Existence Theorem) The process in (29.0.3) satisfies the properties (i) and (ii).
Proof. Given x 0, let us denote the weights of the Poisson distribution (x) with the mean x by
xj
j (x) = (x) { j} = ex .
j!
Consider disjoint sets A1 , . . . , Ak and let A0 = (ik Ai )c be the complement of their union. Fix m1 , . . . , mk 0 and let
m = m1 + + mk . Given any set A, denote Nn (A) = |n A|. With this notation, let us compute the probability of the
event
= Nn (A1 ) = m1 , . . . , Nn (Ak ) = mk .
Recall that the random variables (Xnl )l1 are i.i.d. with the distribution pn defined in (29.0.2) and, therefore, condi-
tionally on Nn in (29.0.3), the cardinalities (Nn (Al ))0lk have multinomial distribution and we can write
P() = P Nn = m + j P Nn = m + j
j0
= P Nn (A0 ) = j Nn = m + j m+ j n (S)
j0
162
If (A) < , this shows that N(A) has the distribution ((A)). If (A) = then P(N(A) r) = 0 for all r 0,
which implies that N(A) is countably infinite. t
u
Proof of Theorem 91. First, suppose that (S) < and suppose that the Poisson process was generated as in
(29.0.3) using some sequence of measures (n ). By Theorem 92, the cardinality N = N(S) has the distribution
((S)). Let us show that, conditionally on the event {N = n}, the set looks like an i.i.d. sample of size n
from the distribution p() = ()/(S) or, more precisely, if we randomly assign labels {1, . . . , n} to the points in ,
the resulting vector (X1 , . . . , Xn ) has the same distribution as an i.i.d. sample from p. Let us consider some measurable
sets B1 , . . . , Bn S and compute
Pn X1 B1 , . . . , Xn Bn = P X1 B1 , . . . , Xn Bn N = n , (29.0.9)
where we denoted by Pn the conditional probability given the event {N = n}. To use Theorem 92, we will reduce this
case to the case of disjoint sets as follows. Let A1 , . . . , Ak be the partition of the space S generated by the sets B1 , . . . , Bn ,
obtained by taking intersections of the sets Bl or their complements over l n. Then, for each l n, we can write
I(x Bl ) = l j I(x A j ),
jk
If n j = |I j | then the last event can be expressed in words by saying that, for each j k, we observe n j points of the
random set in the set A j and then assign labels in I j to the points A j . By Theorem 92, the probability to observe
n j points in each set A j , given that N = n, equals
P(N(A j ) = n j , j k)
Pn (N(A j ) = n j , j k) =
P(N = n)
(A j )n j (A j ) . (S)n (S)
= n j! e e ,
jk n!
while the probability to randomly assign the labels in I j to the points in A j for all j k is equal to jk n j !/n!.
Therefore,
(A ) n j
j
Pn X1 A j1 , . . . , Xn A jn = = p(A jl ). (29.0.11)
jk (S) ln
163
164
165
[3] Feller, W.: An Introduction to Probability Theory and Its Applications. John Wiley & Sons, Inc., New York-
London-Sydney (1966)
[4] Kallenberg, O.: Foundations of Modern Probability. Probability and its Applications, Springer-Verlag, New York
(1997)
[5] Kingman, J. F. C.: Poisson Processes. Oxford University Press, New York (1993)
[6] Ledoux, M.: The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs, 89. American
Mathematical Society, Providence, RI (2001)
[7] Panchenko, D.: The Sherrington-Kirkpatrick Model. Springer Monographs in Mathematics. Springer-Verlag,
New York (2014)
[8] Resnick, S. I.: A Probability Path. Modern Birkhauser Classics, Springer Science & Business Media (2014)
[9] Villani, C.: Topics in Optimal Transportation. No. 58. American Mathematical Soc., (2003)
166