Você está na página 1de 65

ST3453: Stochastic Models in Space and

Time I
Lecturer: Jason Wyse

LATEX: James ODonnell

December 3, 2014

Contents
1 Examples of Stochastic Processes
1.1 Importance . . . . . . . . . . . . .
1.2 Gamblers Ruin . . . . . . . . . . .
1.3 Social Mobility . . . . . . . . . . .
1.4 Fancy a Drink Tonight? . . . . . .
1.5 Modelling Evolutionary Divergence
2 The
2.1
2.2
2.3
2.4

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Markov Property and Markov Chains


Definition of Markov Chain . . . . . . . . . . . .
The Markov Property . . . . . . . . . . . . . . . .
Transition Probabilities and the Transition Matrix
Multistep Transition Probabilities . . . . . . . . .

3 Properties of Markov Chains


3.1 Decomposability . . . . . . . .
3.2 Periodicity . . . . . . . . . . . .
3.3 Stability . . . . . . . . . . . . .
3.4 Long-Run Regularity . . . . . .
3.5 Computing Stable Distributions
3.6 Detailed Balance . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

4 Poisson Processes
4.1 Assumptions of the Poisson Process . . . . . . . . . .
4.2 Probability Law of Poisson Process . . . . . . . . . .
4.3 Moments of the Poisson Distribution . . . . . . . . .
4.4 Times of First Arrival . . . . . . . . . . . . . . . . .
4.5 Memoryless Property of the Exponential Distribution
4.6 Time to Occurrence of rth Event . . . . . . . . . . . .
4.7 Summary of Inter-Arrival Times . . . . . . . . . . . .
4.8 General Poisson Process . . . . . . . . . . . . . . . .
4.9 Compound Poisson Processes . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

2
2
2
4
5
5

.
.
.
.

6
6
6
7
9

.
.
.
.
.
.

15
15
16
17
18
18
22

.
.
.
.
.
.
.
.
.

24
24
25
26
29
29
29
31
31
34

5 Some Continuous Time Processes


5.1 Brownian Motion . . . . . . . . .
5.2 Gaussian Processes . . . . . . . .
5.3 Brownian Motion With Drift . . .
5.4 Finance Applications . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

36
36
40
41
42

6 Applications of Stochastic Processes: Bayesian Model Estimation Through


Markov Chain Monte Carlo
42
6.1 Likelihood and Maximum Likelihood . . . . . . . . . . . . . . . . . . . .
42
6.2 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
6.3 Posterior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
6.4 Posterior Quantities of Interest . . . . . . . . . . . . . . . . . . . . . . .
54
6.5 MCMC: The Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
6.6 The Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . .
55
6.7 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . .
58
7 Spatial Processes

59

A Tutorials
A.1 Tutorial 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Tutorial 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59
59
61

Examples of Stochastic Processes

1.1

Importance

Stochastic models comprise some of the most powerful methods available to data analysts
in the description of observed real-life processes. In this module we will study Markov
models. The importance of these models is highlighted by the fact that:
1. a huge number of physical, biological, economic and social phenomena can be modelled naturally using them;
2. there is well developed theory and methods which allow us to do this modelling (in
a correct way).
On a very crude and granular level one could describe Markov models as models which
use information observed in the past to give an idea of what to expect in the present.
Lets begin by considering some examples and giving informal notions of some key
concepts.

1.2

Gamblers Ruin

Consider playing a game where on any play of the game you win A
C1 with probability
p = 0.4 or alternatively lose A
C1 with probability 1 p = 0.6. Suppose that you decide to

stop if your fortune reaches A


CN . If you reach A
C0 you cant play anymore. On one play
of the game the expected winnings are
(A
C1)p + ( A
C1)(1

p) = A
C(2p

1) =

A
C0.2

so that the casino has a margin on the game.


Let xn be the amount of money you have after n plays of the game. If you play again,
you either have xn+1 or xn 1 after the next play of the game, so xn+1 only depends on
xn .
8
<xn+1 , p
xn+1 =
:x
,1 p
n 1
So xn has what we call the Markov property. This means that given the current state,
xn , any other information about the past is irrelevant for predicting the next state xn+1 .
If you are still playing at time n, i.e. 0 < xn < N and xn = i, then
P(xn+1 = i + 1|xn = i, xn

= in 1 , . . . , x0 = i0 ) = p = 0.4

where in 1 , . . . , i0 are past values of your fortune. Note the use here of the conditioning
event, it is a probability xn+1 = i + 1 conditional on the past fortunes.
Recall that
P(B|A) =

P(B \ A)
P(a)

We say that xn is a discrete (indexed by the positive integers) time Markov chain
with transition probabilities p(i, j) if for any j, i, in 1 , . . . , i0 :
P(xn+1 = j|xn = i, xn

= in 1 , . . . , x0 = i0 ) = p(i, j)

P(xn+1 = j|xn = i) = p(i, j)


The Markov property means that we can forget about the past; only the present is
useful for predicting the future. In formulating the p(i, j) above we have assumed they
are temporally homogeneous, that is,
p(i, j) = P(xn+1 = j|xn = i)
does not depend on time.
So the transition probabilities really determine the rules of the game. Usually, we put
this information in a matrix.
0<i<N :

p(i, i + 1) = 0.4 p(i, i

i = 0, i = N :

1) = 0.6

p(0, 0) = 1 p(N, N ) = 1
p(i, i k) = 0 , k > 1

If we stop playing when we reach A


C5, the transition matrix is

1
6.6
6
60
6
60
6
40
0

0
0
.6
0
0
0

0
.4
0
.6
0
0

0
0
.4
0
.6
0

0
0
0
.4
0
0

3
0
07
7
07
7=p
07
7
.45
1

This matrix could also be represented pictorially using a graph diagram:


DIAGRAM

1.3

Social Mobility

Suppose xn is a familys social class in the nth generation, assuming this to be either 1 =
lower class, 2 =middle, 3 =upper. In this very simplified version of sociology changes
of status are a Markov chain with transition matrix:
2
3
.7 .2 .1
p = 4.3 .5 .25
.2 .4 .4
The graph diagram is as follows:
DIAGRAM
Critiques of this simple model of social mobility:
1. limited as there are only three states;
2. time (temporal) homogeneity we may expect probabilities to change with time;
3. model doesnt learn from the external environment (economy, etc.).
Nonetheless, using this simple model we can ask (and later answer) questions like:
Does the proportion of families in the three classes approach a limit?
Note that in the transition matrices P , that the sum of any row is 1
X
p(i, j) = 1 stochastic matrix
j

and also p(i, j)

0 as entries are all conditional probabilities.

If we consider a family in middle class in generation n, xn = 2 then


P(xn+1 = 1|xn = 2) + P(xn+1 = 2|xn = 2) + P(xn+1 = 3|xn = 2) = 1
since the family will either stay middle class or move in the next generation.

1.4

Fancy a Drink Tonight?

Notorious behaviour in the first night of college.


x0 ! yesterday
1 = quiet night
2 = had a few
3 = absolute mad one

ptoday

ptomorrow

.2
4
.2
=
.1
2
.2
4
.3
=
.7

It is a time inhomogeneous Markov chain:

3
.2 .6
.4 .45
.4 .5
3
.2 .6
.5 .25
.2 .1

ptoday (1, 1) = .2 = ptomorrow


ptoday (2, 2) = .4
ptomorrow (2, 2) = .5

1.5

Modelling Evolutionary Divergence

The fundamental description of all living organisms is their genome sequence. This is a
string of 4 characters.
A adenine, C cytosine, G guanine, T thymine.
In DNA terminology these are the bases. DNA is a double-stranded halix (Watson &
Crick); complementary base pairs: A with T, C with G.
E. coli ! 4.6 106 base pairs.
Homo sapiens ! 3 109 base pairs.

Briefly, evolution of organisms occurs because of mutations in these base pairs, which
amounts to a copying error when DNA replicates. Looking at mutations is the key when
talking about evolution. Modern man started our divergence from apes about 5 6
million years ago.
Modern evolutionary models are largely based on Markov processes, in continuous
time. At any site on the genome we have a stochastic variable x(t) (t is time) taking one
of the following values:

The Markov models say

{1, 2, 3, 4}

! {A, C, T, G}

P(x(t + s) = j|x(s) = i) = P(x(t) = j|x(0) = i) = pi,j (t)


5

so that the probabilities have the Markov property. The transition matrix is
2
3
P(A|A, t) P(C|A, t) P(G|A, t) P(T |A, t)
6
7
..
..
..
..
p=4
5
.
.
.
.
P(A|T, t) P(C|T, t) P(G|T, t) P(T |T, t)

Using some further assumptions on the structure of the probabilities the Jukes-Cantor
model for base substitution
1
(1 + 3e 4t )
4
1
pi,j (t) =
(1 e 4t )
4
pi,i (t) =

for a parameter to be estimated from data.

The Markov Property and Markov Chains

(Discrete time, finite # states)

2.1

Definition of Markov Chain

Consider a stochastic process {xn , n = 0, 1, 2, . . .} that can take on a finite number of


values. Let these values be denoted by the set of non-negative integers {1, 2, . . . , k}. The
process is in state i at time n if xn = i. Since the time index n is discrete, we say that
xn is a discrete time process. Since xn can take a finite number of values, we say it is a
finite state process.
If we assume there is a fixed probability that the process will be in state j at time
n + 1, given it is in state i at time n.
P(xn+1 = j|xn = i, xn

= in 1 , . . . , x0 = i0 } = p(i, j)

for all states i0 , i1 , . . . , in 1 , i, j and for all n


is a Markov process.

2.2

0. If this is the case, then we say that xn

The Markov Property

If xn is a Markov chain then it has the Markov property.


This says that the conditional distribution of any future state xn+1 , given all past
states x0 , x1 , . . . , xn , depends only on xn . It is independent of all the states x0 , . . . , xn 1 ,
i.e.,
P(xn+1 |xn , xn 1 , . . . , x0 ) = P(xn+1 |xn )

2.3

Transition Probabilities and the Transition Matrix

The one-step transition probabilities p(i, j) give the probability that the chain xn goes
from state i (xn = i) to state j in one step (xn+1 = j).
As we know the p(i, j)s are probabilities, it is clear that
p(i, j)

8 1 i, j k

and since the chain either will stay where it is or transition to a dierent state
k
X

p(i, j) = 1

j=1

Let P denotes the matrix of one-step transition probabilities


2
3
p(1, 1) p(1, 2) . . . p(1, k)
6p(2, 1) p(2, 2) . . . p(2, k)7
6
7
P = 6 ..
..
.. 7
.
.
4 .
.
.
. 5
p(k, 1) p(k, 2) . . . p(k, k)

then we refer to P as the one-step transition matrix of the Markov chain {xn , n

0}.

Example:
Let {xt , t 1} be independent identically distributed (iid).
(Recall that this means:

8 t, E(Xt ) = , 2 R
2

8 t, Var(Xt ) =

2R

8 (t, k), Cov(Xt , Xk ) = 0)


with P(Xt = `) = a`

` = 0, 1, 2, . . .
P
Now suppose S0 = 0, Sn = nt=1 Xt .

Exercise: Show that Sn is a Markov chain (MC).


P(Sn+1 = j|Sn = i, Sn

= in 1 , . . . , S0 = 0) = P(Sn + Xn+1 = j|Sn = i, Sn


= P(Xn+1 = j

i|Sn = i, Sn

= P(Xn+1 = j

i)

= aj

= in 1 , . . . , S0 = 0)

= in

1 , . . . , S0

iid

So Sn satisfies the Markov property. The process Sn is called a random walk.


End Example
7

= 0)

A simple random walk (SRW) is a process {Sn , n


Sn =

n
X

0}, S0 = 0 where

Xt

t=1

where Xt are iid and


P(Xt = 1) = p
P(Xt =

1) = q = 1

for 0 < p < 1.


One can show that |Sn | (the distance of the SRW from the origin) is a Markov process.

Consider P(Sn = i||Sn | = i, |Sn 1 | = in 1 , . . . , |S1 | = i1 } and let i0 = 0. Let j =


max{k : 0 k n; ik = 0} implying that Sj = 0.

n
j
Since we know Sj = 0:
P(Sn = i||Sn | = i, |Sn 1 | = in 1 , . . . , |S1 | = i1 ) = P(Sn = i||Sn | = i, |Sn 1 || = in 1 , . . . , |Sj | = 0)
There are two possible values of the sequence Sj+1 , . . . , Sn for which |Sj+1 | = ij+1 , . . . , |Sn | =
i. Since the process does not cross zero between times j+1 & n these are ij+1 , . . . , i, ij+1 , . . . , i.
Assume ij+1 > 0, to obtain the first sequence we note in the n
more up steps (+1) than down steps ( 1).
Let ds be the number of down steps ( 1). Then
(ds + i) +ds = n
| {z }

up

=) ds =

j
2

So the probability of thiss equence is


p

n j
2

+i

n j
2

=p

n j+i
2

Similarly, the second sequence has probability


p

n j
2

n j+i
2

n j
2

j steps there are i

Thus,
P(Sn = i||Sn | = i, . . . , |Sj+1 | = ij+1 ) =
=

p
p

n j+i
2

n j
2

n j+i
2

n j
2

+p

n j
2

n j+i
2

pi
pi + q i

Similarly,

qi
pi + q i
From this, we can see that upon conditioning on whether Sn = i or Sn = i.
P(Sn =

i||Sn | = i, . . . , |Sj+1 | = ij+1 ) =

P(|Sn+1 | = i + 1||Sn | = i, |Sn 1 | = in 1 , . . . , |S1 | = i1 )


= P(Sn+1 = i + 1|Sn = i)P(Sn = i||Sn | = in , . . . , |S1 |)
+P(Sn+1 = (i + 1)|Sn = i)P(Sn = i||Sn | = i, . . . , |S1 |)
= p
=
So {|Sn |, n

pi
qi
i
+
q

pi + g i
pi + q i

pi+1 q i+1
pi + q i

1} is a Markov chain, with transition probabilities


p(i, i + 1) =
p(i, i

1) =

pi+1 + q i+1
pi + q i
pi (1

p) + q i (1
pi + q i

q)

8i>0

p(0, 1) = 1

2.4

Multistep Transition Probabilities

The probability p(i, j) = P(xn+1 = j|xn = i) gives the probability of going from state i to
state j in one step. How do we compute the probability of going from i to j in m steps?
P(xn+m = j|xn = i) = pm (i, j) . . .
Recall that social mobility example with
2
.7
p = 4.3
.2

where 1 = lower, 2 =middle, 3 =upper.

one step transition matrix


3
.2 .1
.5 .25
.4 .4

If my grandmother was upper class (state 3) & my parents were middle class (state
2) what is the probability that I will be lower class (state 1)?
9

The Markov property tells us that the probability of this is


p(3, 2)p(2, 1) = (.4)(.3) = .12
Lets convince ourselves of this:
P(x2 = 1, x1 = 2, x0 = 3)
P(x2 = 1, x1 = 2|x0 = 3) =
P(x0 = 3)
=

P(x2 = 1, x1 = 2, x0 = 3) P(x1 = 2, x0 = 3)
P(x1 = 2, x0 = 3)
P(x0 = 3)

= P(x2 = 1|x1 = 2, x0 = 3)P(x1 = 2|x0 = 3)

Markov, so drop x0 = 3

= P(x2 = 1|x1 = 2)P(x1 = 2|x0 = 3)


= p(2, 1)p(3, 2)
= p(3, 2)p(2, 1)
If my parents were middle class, what is the probability that my children will be upper
class?
To answer this, we have to consider the three possible classes which I could have:
3
X

P(x2 = 3|x0 ) =

P(x2 = 3, x1 = `|x0 = 2)

`=1

3
X

p(2, `)p(`, 3)

`=1

= (.3)(.1) + (.5)(.2) + (.2)(.4)


= .21
What is the probability that my children will be middle class, given my parents are
upper class?
P(x2 = 2|x0 = 3) =

3
X

P(x2 = 2, x1 = `|x0 = 3)

`=1

3
X

p(3, `)p(`, 2)

`=1

= (.2)(.2) + (.4)(.5) + (.4)(.4)


= 0.4
Of course this approach for two step probabilities applies in general.
p2 (i, j) = P(xn+2 = j|xn = i)
=

k
X

p(i, `)p(`, j)

`=1

10

If we think of transition matrices P in general, the term p2 (i, j) can be seen as the
dot product of the ith row of P with the j th column of P . . . or the (i, j)th entry of P 2 .
2
3
..
2
3 ..
. p(1, j) .

6.
..
..7
2
6
4
5
P = p(i, 1) p(i, k) 4..
.
.7
5
..
..

. p(k, j) .

The Chapman-Kolmogorov Equation:


The Chapman-Kolmogorov equation is crucial in understanding multi-step transition
probabilities of Markov chains.
This states that:
p

m+n

(i, j) =

k
X

pm (i, `)pn (`, j)

`=1

Proof:
To prove this equation we break things down according to the state at time m.
time 0

time m + n

time m

j
i

P(xm+n = j|x0 = i) =

k
X

P(xm+n = j, xm = `|x0 = i)

Now use the definition of conditional probability for the term in the sum:
P(xm+n = j, xm = `|x0 = i) =
=

P(xm+n = j, xm = `, x0 = i)
P(x0 = i)
P(xm+n = j, xm = `, x0 = i) P(xm = `, x0 = i)
P(xm = `, x0 = i)
P(x0 = i)

= P(xm+n = j|xm = `, x0 = i)P(xm = `|x0 = i)


By the Markov property the first term on RHS is P(xm+n = j|xm = `), so that:
P(xm+n = j, xm = `|x0 = i) = P(xm+n = j|xm = `)P(xm = `|x0 = i)
= pn (`, j)pm (i, `)
= pm (i, `)pn (`, j)
11

Thus,
p

m+n

(i, j) =

k
X

pm (i, `)pn (`, j)

`=1

QED

Take n = 1 in this equation


p

m+1

(i, j) =

k
X

pm (i, `)p(`, j)

`=1

which is the ith row of the m step transition matrix by the column of P . So the m + 1
step transition matrix is given by P m+1 .
The m-step transition matrix is equal to the 1-step transition matrix to
the power of m.
Example:
Let Xn be the weather on day n in Dublin, either rainy= 1 or not rainy= 2, with
transition matrix

.8 .2
P =
.6 .4
The day after tomorrow?
A two step transition matrix

P =

.76 .24
.72 .28

so if it is rainy today, there is a 76% chance it is rainy the day after tomorrow.

10

20

=
=

.750 .250
.749 .251
.75 .25
.75 .25

(approx.)

Note the apparent converging behaviour of the entries of the transition matrix as n gets
larger and larger. More formally,
3 1
n
lim P = 43 41
n!1

( 34 , 14 ) is called a stationary distribution of the chain.

End Example

Consider the general 2 state chain, xn 2 {1, 2}, with transition matrix

1 a
a
P =
b
1 b
where 0 a 1, 0 b 1
12

What is P n in general?
In other words, what is the limiting behaviour as n ! 1?
If we can write P = QQ

then

P n = (QQ 1 )(QQ 1 ) . . . (QQ 1 )


= Qn Q

where is a diagonal matrix and Q is a matrix to be found.


We need to find the eigendecomposition of P . We need to find the eigenvalues and
eigenvectors of P .
|P

I| =

a
b

= (1

= (1

a)(1

=
)
1,2

=
=
=
=

Matrix =

1
0 1

1
)(1

(2
p
(2

(2

b)

(2

b)

(2

b) a + b
2

1
1

0
a

a
b

p
4

b)
a

a
2

ab

[1

a+1

b) + (1

b)2

4(1

b] +

b)

4(a + b) + (a + b)2
2

Next step ... eigenvectors

y1
y
y1
y1
1 a
a
P
=
=)
= 1
y2
y2
b
1 b y2
y2

So the first eigenvector is

a)y1 + ay2 = y1
ay1 =
ay2
y1 = y2 = y

y
y
13

ab

b)

(1

4(a + b)

Second eigenvector ...

z
P 1
z2

= (1

b)

(1 a)z1 + az2 = (1 a
(1 a 1 + a + b)z1 =
az2
b
z2 =
z1
a

z
So the second eigenvector is
b
z
a
Now,

y z
Q =
y ab z
1

1
b
yz yz
a

a
yz(a + b)

z
y

z1
z2

b)z1

b
z
a

b
z
a

z
y

P = QQ 1 , so P n = Qn Q 1 .

y
y

1
0
b
z 0 (1 a b)n
a

a
y z
(1
yz(a + b) y ab z

a
yz(a + b)

a
yz(a + b)

a
b
a+b a
a
b
a+b a

So
n

P =

b
yz (1
a
b
yz
+ ab (1
a

yz
yz

b
a
b
a

b
a+b
b
a+b

a
(1
a+b
b
(1
a+b

a
yz(a + b)

b
z
a

a
a

b)n
b)n

(1
1
a
a+b
a
a+b

b)n y

(1 a b)n )
b
(1 a b)n
a

a
(1
a+b
b
(1
a+b

b)n ! 0 as n ! 1
14

yz(1 (1 a b)n )
yz 1 + ab (1 a b)n

b| < 1, then we will know.


(1

z
y

yz + (1 a b)n yz
yz ab (1 a b)n yz

What happens as n ! 1?
If |1

b
z
a

a b)n yz
a b)n yz

a
a+b
a
a+b

b)n y (1

+ (1 a b)n
b
(1 a b)n
a

+ (1 a b)n
b
(1 a b)n
a

a
a

b)n
b)n

1 < 1 a b<1
1 < a+b 1<1
0 < a+b<2
Thus, if 0 < a + b < 2, then
n

lim P =

n!1

We know that

b
, a
a+b a+b

b
a+b
b
a+b

a
a+b
a
a+b

is the stationary distribution of the two state chain.

n+1

(i, j) =

k
X

pn (i, `)p(`, j)

`=1

2
X

pn (i, `)p(`, j)

`=1

(j)
|{z}

Solving for gives

(`)p(`, j)

`=1

long-run probability

2
X

p(1, 1) p(1, 2)
(1) (2) = (1) (2)
p(2, 1) p(2, 2)
b
, a
a+b a+b

Properties of Markov Chains

In this chapter we will look at properties which can be used to classify the behaviour of
Markov chains.

3.1

Decomposability

Definition: Closed Set of States


A set of states A is closed if
P(Xn+1 2 A|Xn = x) = 1
for all states x 2 A.

If A is closed, then starting from any value in A, we always stay in A.

Suppose we have two disjoint closed sets A1 and A2 . If we start the chain in A1 , i.e.
xa 2 A, then the states outside of A1 are immaterial, and the process (chain) can be
analysed solely through its movement in A1 . This is the idea of decomposability.

15

A Markov chain is indecomposable if its set of states does not contain two
or more disjoint closed sets of states.
A1

A2

If a transition from state i to state j is possible, then we write i ! j , i.e. there is


some m such that pm (i, j) > 0.
If additionally there exists n such that pn (i, j) > 0 in which case it is possible to
transition either way, we say then that i communicates with j and write i ! j.
(We will show later that communication is an equivalence relation)

The ideas of communication and indecomposability are closely related.


If for every pair of states i and j, at least one of i ! j or j ! i, then the
chains set of states is indecomposable.
Proof:
To see this suppose there are two disjoint closed sets of states A1 and A2 .
Take any two states i 2 A1 & j 2 A2 . Suppose that i ! j or j ! i is possible.
Assuming that i ! j, then there is an m such that
pm (i, j) > 0
But this contradicts the fact that A1 is closed
P(Xn+m 2 A1 |xn = i 2 A1 ) = 1
Hence, the chains states are indecomposable.
QED

3.2

Periodicity

Some Markov chains exhibit periodic behaviour. Suppose that the states are indecomposable and consider, say the two-step transition probabilities p2 (i, A) (two-step transition
probabilities in going from state i to states in the set A).
It is possible that the states decompose into two closed sets under this transition
probability. That is, there are two disjoint sets B1 &B2 such that
p2 (i, B1 ) = 1

8 i 2 B1

p2 (i, B2 ) = 1

8 i 2 B2

16

Example
In the simple random walk
p(i, i + 1) = p, p(i, i
q
i

1) = q = 1

p
i

i+1

If i is an odd integer, then the next state will be even. The next state again will be
odd. Similarly, if i is even, in two steps the state will be even again. So if we let the even
integers and B2 be odd, then
p2 (i, B) = 1

8 i even

p2 (i, B1 ) = 1

8 i odd

In general, the periodic behaviour can be summarised by the following:


Let d 1 be the largest integer such that the states can be decomposed into d disjoint
subsets, B1 , . . . , Bd , each of which is closed under the d step transition probability. The
Markov cycles among the the B1 , . . . , Bd . If the starting state is in B1 , the next state
will be in, say B2 , and so on until the chain transitions from B2 back to B1 .
End Example
Decomposability
Periodicity

3.3

! Stability

Stability

Suppose X0 = x. We want to know what statements can be made about the chain after
a large number of subsequent movements. A key question is that of stability.
Regardless of initial state, can the states that are visited by a chain be represented
by some limiting distribution after a large number of steps? If we think of the m-step
transition probabilities pm (x, A) (the probability of being in A after m steps) the question
were asking is:
Is there a limiting distribution (A) such that pm (x, A) ! (A) as m ! 1.

If the answer is yes, we say that the chain is stable. Stability here is a property of the
chains transition probabilities since pm (x, A) ! (A) regardless of the arbitrary state x.
It can be seen that if the chain is decomposable or periodic, we cant have stability.
Decomposability:
Let A1 , A2 be disjoint closed sets of states. Then for every m
(
1 , x 2 A1
pm (x, A1 ) =
0 , x 2 A2

with the reverse for m odd. So theres no limiting value for this probability as m ! 1.
At the very least the chain must be indecomposable and aperiodic to be stable.
Exercise:
Find an example of a decomposable chain and a periodic chain.
17

3.4

Long-Run Regularity

If the chain is stable it has some long-run regularity properties. No matter what state we
start from, the proportion of time the chain spends in the set of states A will be (A).
Count the number of times x1 , . . . , xm is in A. Let
(
1 , xt 2 A
f (xt ) =
0 , otherwise
Then

1 X
f (xt )
m t=1

is the proportion of time spent in A. We have

E[f (xt )] = 1 pt (x, A) + 0 (1

pt (x, A))

= pt (x, A)
So the expected proportion of time spent in A is
m

1 X t
p (x, A)
m t=1
As time ticks on, as m ! 1, then since the chain is stable pm (x, A) ! (A).
If the numbers in a sequence t ! , then
m

1 X
t !
m t=1
So that

1 X t
p (x, A) ! (A)
m t=1

We also have the law of large numbers:


m

1 X
f (xt )
m t=1

(A) >

!0

as m gets large(r).

3.5

Computing Stable Distributions


Stable distribution
Indecomposable

! stationary distribution
! irreducible

18

How do we compute a stable distribution? By the Markov property and C-K:


p

m+1

(x, A) =

k
X

pm (x, `)p(`, A)

`=1

By assumption, pm+1 (x, A) ! (A) as m ! 1, and also pm (x, `) ! (`) as m ! 1,


so () must satisfy the equation
(A) =

k
X

(`)p(`, A)

`=1

k states, {1, . . . , k}.

Stable distribution (1) . . . (k)


2
p(1, 1) p(1, 2)
6p(2, 1) p(2, 2)
6
P = 6 ..
..
4 .
.
p(k, 1) p(k, 2)

..
.

3
p(1, k)
p(2, k)7
7
.. 7
. 5

p(k, k)

`th row of p & multiply by ()

(1) . . . (k) = (1) . . . (k) P


(j) = (1)p(1, j) + (2)p(2, j) + . . . + (k)p(k, j)
Solving this matrix equation gives the stable distribution.
Example:
Two state chain:
P =

a
b

Find the stable distribution ((1), (2)).

19

a
1

(1) (2)

(1) (2)

1 = (1

a
b

a
1

a)1 + b2

2 = a1 + (1

b)2
a
= a1 =) 2 = 1
b

b2

a
1 + 2 = 1, 1 + 1 =
b

a+b
1
= 1
b
b
a+b

1 =

b
a
=
a+b
a+b

2 = 1
)

b
a+b

a
a+b

Quick Recap:
Stable, stationary distributions . . .
k k transition matrix P

P =

where = 1 . . . n is stable, and P is the transition matrix.


k 1 linearly independent equations.
k
X

` = 1

`=1

Example
Weather on day n in Dublin.
1 =rainy

2 =not rainy
P =

What is the stable distribution?

.8 .2
.6 .4

a
b

a
1

20

End Example

b
a
,
a+b a+b

.6 .2
=
,
.8 .8

3 1
=
,
4 4

From first principles:

.81 + .62

P =
.8 .2

2
= 1 2
.6 .4

.21 + .42 = 1 2

.81 + .62 = 1
.21 + .42 = 2

.1 + 2
.81 + .6(1 1 )
.81 + .6 .61
.81
1
2

=
=
=
=

I
II

1 =) 2 = 1
1
1
.6
6
3
=
=
8
4
1
=
4

End Example
Example:
Social mobility. Recall the transition matrix:
2
3
.7 .2 .1
P = 4.3 .5 .25
.2 .4 .4

Does the proportion of families falling into the three social classes approach a stable
limit?

1 2

P =
3
.7
.2
.1

3 4.3 .5 .25 = 1 2 3
.2 .4 .4
2

21

)
.71 + .32 + .23 = 1
.21 + .52 + .43 = 2
.11 + .22 + .43 = 3

I
II
II

1 + 2 + 3 = 1
Solving yields:
1 =

22
16
9
, 2 = , 3 =
47
47
47
End Example:

Theorem:
For an indecomposable, non-periodic chain with transition probabilities p(x, A) such
that any two states x and y communicate, then the system of equations
(j) =

k
X

p(`, j)(`)

j = 1, . . . , k

`=1

k
X

(`) = 1

`=1

will give a set of k linearly independent equations with unique solution .

3.6

Detailed Balance

() is said to satisfy detailed balance if


(x)p(x, y) = (y)p(y, x)
This is a stronger condition than P = . If we sum over x on each side of the above:
k
X

(x)p(x, y) =

x=1

k
X

(y)p(y, x)

x=1

= (y)

k
X

p(y, x)

x=1

= (y)
A good way to think of the detailed balance is as follows: imagine a beach wherre
(x) gives the amount of sand at mound x. A transition of the chain means a fraction
p(x, y) of the sand at x is transferred to y. Detailed balance says that the amount of sand
going from y to x in one step is completely balanced by the amount going back from y
to x.
(x)p(x, y) = (y)p(y, x)
22

In contrast, P = says that after all transfers of sand, the amount that ends up on
each mound is the same as the amount that started there.
Example:
A graph is defined by giving two things:
1. A set of vertices V (finite);
2. An adjacency matrix A(u, v) which is 1 if there is an edge connecting u and v, and
is 0 otherwise.
(Actors= group of nodes)
1

4
2

0
61
A(u, v) = 6
41
0

1
0
1
0

3
0
07
7
05
0

1
1
0
0

The adjacency matrix can be used to describe the topology of the graph.
By convention A(v, v) = 0 8 v 2 V .
3
3

The degree of any vertex is equal to the number of neighbours it has


X
d(u) =
A(u, v)
u

since each neighbour of u contributes 1 to this sum. Now consider a random walk Xn on
this graph.
Define the transition probability by
p(u, v) =

A(u, v)
d(u)

i.e., if xn = u, then we jump randomly to one of its neighbours at time n + 1.


1
2

1
2

23

Symmetric random walk.


Consider p(u, v) =

A(u,v)
.
d(u)

This says
d(u)p(u, v) = A(u, v)

since A is a symmetric matrix (non-directed graph).


(u) = p(u, v) = (u)p(v, u)
If we take (u) = c for some positive constant c
(u)p(u, v) = cd(u)p(u, v)
= cA(u, v)
= cA(v, u)
= cd(v)p(v.u)
So the RW on the graph satisfies detailed balance. Its stable distribution is
(u) = cd(u)
X

(v) = 1

v2V

)
c= P

1
v2V

d(v)

and (u) = P

d(u)
v2V d(v)
End Example

Poisson Processes

Think of e-mail messages arriving to a server. This is an example of events arriving


randomly in an interval of time. The number of events (e-mail arrivals) that occur over
an interval of time, say 1 hour, then this will be a discrete random variable. This discrete
RV is often modelled as a Poisson process. The length of the interval between arrivals
will be modelled by an exponential distribution.

4.1

Assumptions of the Poisson Process

First, we assume that we observe the process for a fixed period of time t.
The number of events that occur in this fixed interval (0, t] is a random variable X.
X will be discrete and its probability law will depend on the manner in which events
occur.
We make the following assumptions about the way in which events occur:
1. In a sufficiently short length of time t, then either 0 or 1 events occur in that time
(two or more simultaneous occurances are impossible).
24

2. The probability of exactly one event occuring in this short time interval of length
t is
t. So, the probability of exactly one event occuring in the interval is
proportional to the length of the interval.
3. Any non-overlapping intervals of length

t are independent Bernoulli trials.

These three assumptions are the assumptions for a Poisson process with parameter .

4.2

Probability Law of Poisson Process

Suppose our interval of length t, (0, t], is divided into n =


length pieces.

non-overlapping equal

| t|
0

time

By assumption 3, these smaller intervals are independent Bernoulli trials. Each of


these has probability of success (an event occuring) is equal to p =
t (form assumption
2).
Then the probability of no event occurring in the interval is q = 1

t.

Then X, the number of events in the interval of length t is binomial (n, p =


t k
n
P(X = k) =
1
k
n

t
n

n
k
n!
( t)k
t
t
=
1
1
k!(n k)! nk
n
n

n
k
( t)k
t
t
n(n 1) . . . (n
=
1
1
k!
n
n
nk

lim

n!1

lim

n!1

n(n

1)(n

t
n

2) . . . (n
nk

t
n

= 1

k + 1)

= 1 1

lim P(X = k) =
so in the limit as

= e

)
t!0

k + 1)

t ! 0, the same as n ! 1.

Now, examine the limiting cae as

t=

( t)k
e
k!

1
n

2
n

... 1

t ! 0 we retrieve the Poisson probability law.


25

1
n

!1

t
).
n

Recall the Taylor expansion of ex :


1

X xj
x2
e =1+x+
+ ... =
2!
j!
j=0
x

, valid for x 2 R.

If we sum the probability law over all possible values we get


1
X

1
X
( t)k
P(X = k) =
e
k!
k=0
k=0

e t=1

, as wed expect.

4.3

Moments of the Poisson Distribution

For any `

1 (` 2 Z) we can show that


E[X(X

1) . . . (X

` + 1)] = ( t)`

To see this observe that


X(X
if X `

1) . . . (X` + 1) = 0

1.

E[X(X

1) . . . (X

1
X
( t)k
` + 1)] =
e
k!
k=`

k(k

1
X
( t)k ` ( t)`
=
e
(k
`)!
k=`

1) . . . (k

1
X
( t)j
j!
j=0

= e

( t)`

= e

( t)` e

= ( t)`
So then,
E[X] = ( t)1 = t
E[X(X

1)] = ( t)2

Var[X] = E[X(X 1)] (E[X])2 + E[X]


= ( t)2 ( t)2 + t
= t
Example:
26

` + 1)

Assume molecules in a rare gas occur at an average rate of per cubic metre. If it
is reasonable to assume that these molecules of the gas are distributed independently in
the air, then the molecules in a cubic metre of air is a Poisson random variable with rate
parameter . If we wanted to be 100(1
)% confident of finding at least one molecule
of the gas in a sample of air, what sample size of air would we need to take?
Let the sample size be s cubic metres. let the number of molecules be X which is
Poisson distributed with rate s. So we would require

(s)0 e s
P(X 1) = 1 P(X = 0)
0!
= 1

1
So,e

s log

1
log

cubic metres of air is the sample size we would need to take.


End Example
Recall the assumptions for a Poisson process with parameter . The exponential
random variable is easily defined on this process. In a Poisson process events are occuring
independently at random and a uniform rate per unit of time. Assume that we begin to
observe the Poisson process at time zero and let T be the time of the first event. T is a
continuous random variable and its range is RT = {t : t 0}. Let t be any fixed positive
number and consider the event {T > t}, that the first event is greater than t. This event
occurs if there are zero events in the fixed interval (0, t]. The probability of zero events
occuring is
( t)0 e t
P(X = 0) =
=e t
0!
These events are equivalent and so have equal probability so
P(T > t) = e

=1

P(T t) = FT (t)
P(T > t) = 1

FT (t)

distribution function

FT (t)

survival function

from which we find the distribution function for T :


FT (t) = 1
and its density function
fT (t) =

, t>0

d
FT (t) = e
dt

,i.e. the exponential density function.


27

So the time to the first event in a Poisson process is exponentially distributed with
parameter .
The expected value for an exponential random variable is
Z 1
E(T ) =
t e t dt
0

( t + 1)

and the moment generating function is


mT (t) = E(etT )
Z 1
=
ets e
=

s(

t)

ds

0
t) 1

s(

E(X j ) =

ds

E(X) =

, t<

d
mT (t)
dt

t=0

d
mT (t)
dtj

t=0

The factorial generating function from above can be used to verify that
1
Var(T ) = 2
and so the standard deviation is the same as the mean.
Example:
Students arriving at a lecture, at a rate of 2 per minute. If I observe for 3 minutes,
what is the probability of no students arriving?
P(X = 0) =

( t)0 e
0!

= e

= 0.0025
So the probability of observing no students arriving in this interval is incredibly small.
End Example
28

4.4

Times of First Arrival

4.5

Memoryless Property of the Exponential Distribution

The exponential probability law has the memoryless property. If T is an exponential with
parameter and let a & b be positive constants, then
P(T > a + b|T > a) =
=

P(T > a + b)
P(T > a)
(a+b)

e
b

= e

= P(T > b)
The exponential distribution is the only continuous probability law with the memoryless
property. There are some similarities between the exponential and geometric probability
distribution.
X1 , . . . , X n
independent Bernoulli trials
# of trials to first success is a geometric random variable.
Geometric is the number of trials to first success while the exponential represents the
time to first event in a Poisson process. If Y is a geometric RV with parameter p, then
P(Y > n) = (1

p)n

In deriving the Poisson process we set p =


t = nt having subdivided (0, t] into n pieces
of length t. But then the events {Y > n} and {T > t} are equivalent and
P(T > t) =

lim P(Y > n)

n!1

lim

n!1

= e

t
n

so the exponential distribution is the limit of the geometric distribution function.

4.6

Time to Occurrence of rth Event

Suppose we begin observing a Poisson process at time zero and let Tn be the time to
occurrence of the rth event, r
1. This random variable is analogous to a negative
binomial random variable. Again t be any fixed number and consider the event {Tr > t}
(time to rth event greater than t). {Tr > t} is equivalent to the event {X r 1} where
X is the number of events in (0, t], since Tr can only exceed t if there are r 1 or fewer
events in (0, t]. X is Poisson with parameter t so
P(Tr > t) = P(X r
r 1
X
( t)k
=
e
k!
k=0

29

1)
t

and the distribution function for Tr is


FTr = P(Tr t)
P(Tr > t)

= 1

r 1
X
( t)k
e
k!
k=0

= 1

Tr is called an Erlang random variable with parameters r and . The density function
for Tr is
d
FT (t)
dt r

d
=
1 e
dt

fTr (t) =

(r

1)!

r r 1

t
e
(r)

r r 1

e
t

te
+

( t)2
e
2!

...
3 2

t
e
2!

( t)r 1
e
(r 1)!
t

r 2 r 2

...

(r

2)!

r r 1

(r

1)!

, t>0

, t>0

since

()

tr 1 e

is the density of the gamma distribution.


(Revise the Gamma distribution)
The Erlang probability law is a particular case of the gamma distribution. So the
time to the rth occurrence in a Poisson is gamma distributed with shape parameter r &
rate .
Example:
The instant at which telephone calls are made to a call centre form a Poisson process
with = 120/hr.
Let T10 be the time to the tenth call made starting from 9am using minutes as the
unit of time. Then T10 is gamma distributed with shape r = 10 and rate 2/min. The
expected time of the 10th call is
E(T10 ) =

10
= 5mins
2

so at 9.05am.

30

The probability of the tenth call occurs before 9.05am is


P(T10 < 5) = 1

9
X
(5.2)k

k!

k=0

= 1

9
X
10k
k=0

k!

5(2)

10

= .542

The probability that the tenth call is received between 9.05am & 9.07am.
!
!
9
9
X
X
(14)k 14
10k 10
P(5 < T10 7) =
1
e
1
e
k!
k!
k=0
k=0
= .349

End Example

4.7

Summary of Inter-Arrival Times

We have seen some results concerning the distributions of the time between occurances
in a Poisson process:
1. The distribution of the time to the first event is exponential ( );
2. Times between events are exponential ( );
3. The time to the rth event is gamma distributed, shape= r, rate=
So far we have assumed that the rate of occurrence
called a time homogeneous Poisson process.

1
scale= rate
.

is constant. This process is

Let X(t) denote the number of events in (0, t]. Then the Poisson process is said to
have independent increments.
Let T1 , T2 , T3 , T4 , . . . denote the arrival times of the process, and define T0 = 0, then
X(T1 ) X(T0 ), X(T2 ) X(T1 ), X(T3 ) X(T2 ), are independent random variables.
This is since X(t + s)
X(r), 0 r < s.

4.8

X(s), t 0 is a rate

Poisson process and is independent of

General Poisson Process

Let X(t) be the number of events in (0, t]. We say that X(t) is a Poisson process with
rate (t) if:
1. X(0) = 0;
2. X(t) has independent increments;
31

3. X(t)

X(s) for s < t, this is a Poisson process with mean


Z t
(r)dr
s

Note that if (r) = , a constant.


In this case the mean of the process X(t) X(s) is
Z t
Z t
(r)dr =
dr = (t
s

s)

which is just the Poisson process we have studied up until now.


For an time homogeneous process we have shown that the times between arrivals
follow an exponential distribution. If (t) depends explicitly on t, and hence in general,
this isnt the case.
Let T1 be the time to the first arrival
P(T1 > t) = P(X(t) = 0)

hR

t
0

= exp

(r)dr
Z

i0

(X(t) Poisson , mean=


exp
0!

R
t

(r)dr
0

(r)dr

(r)dr)
0

Whats the distribution of T1 , the cumulative distribution function.


FT1 (t) = P(T1 t)
= 1
= 1

P(T1 > t)
Z
exp

(r)dr

and the density function of T1


d
FT (t)
dt 1
Z t
Z t

d
(r)dr exp
(r)dr
=
dt 0
0
Z t

= (t) exp
(r)dr

fT1 (t) =

Rt
If we call (t) = 0 (r)dr, then we can see fT1 (t) = (t)e
exponential distribution.
32

(t)

will not in general be an

Aside: Time homogeneous.


(t) =
(t) =

f T1 =

(r)dr =
0

dr = t
0

When (t) depends explicitly on t, i.e. non-constant then we term this a time nonhomogeneous Poisson process. Change point:
(
,t <
1
(t) =
,t
2
Showing that a Poisson process satisfies the Markov property in general follows from the
independent increments property (2).
However, a Poisson process is a continuous time process, so we need to formally say
what we mean by the Markov property in continuous time. In discrete time we observe
our process at time points 0, 1, 2, 3, . . . , n, n + 1, . . .
For continuous time we observe the process at arbitrary points in time [R+ ].
0 = s0 < s1 < s2 < . . . < sk < s < t < t1 < . . . < tn
, with states i0 , i1 , i2 , . . . , ik , i, j, j1 , . . . , jn . We say that the Markov property holds if for
these arbitrary points in time:
P(X(t) = j, X(t1 ) = j1 , . . . , X(tn ) = jn |X(s0 ) = i0 , . . . , X(sk ) = ik , X(s) = i)
= P(X(t) = j, X(t1 ) = j1 , . . . , X(tn ) = jn |X(s) = i)
Compare to discrete time definition.
For the Poisson process
P(X(t) = j|X(s) = i) =
(independent increments) =

P(X(t) = j, X(s) = i)
P(X(s) = i)
P(X(t)

X(s) = j i)P(X(s) = i)
P(X(s) = i)

= P(X(t) X(s) = j i)
R
j i
R

t
t
(r)dr
exp
(r)dr
s
s
=
(j i)!
Therefore, satisfies the Markov property.
We will denote P(X(t) = j|X(s) = i) by Ps,t (i, j) for continuous time processes.
In the next chapter we will meet examples where the states form a continous random
variable.
33

4.9

Compound Poisson Processes

A compound Poisson process associates an independent, identically distributed variable


Y with each arrival of the Poisson process. The Yi are assumed independent of the Poisson
process of arrivals, and independent of each other.
Example 1:
Consider messages arriving at a central computer before being transmitted over the
internet. If we imagine a large number of users at separate terminals, we can assume that
messages arrive at the central computer according to a Poisson process.
If we let Yi be the size (in bytes) of the ith message, then again its reasonable to
assume the Yi s are iid and independent of the Poisson process of arrivals.
End Example
Example 2:
Claims which come in to a large insurance company. Assume claims arrive according
to a Poisson process and the sizes of claims (Yi ) can be assumed independent of each
other. The compound process will give an idea of total liability.
End Example
It is natural to consider the sum of all the Yi s up to time t.
At time t: X(t) events of the Poisson process Y1 , Y2 , . . . , YX(t) .
S(t) = Y1 + Y2 + . . . + YX(t)
where we set S(t) = 0 if X(t) = 0.
Example 1:
S(t) = total (bytes) information transmitted
Example 2:
S(t) = total liability for the company
We have the following results:
Theorem:
P
Let Y1 , . . . , YX(t) be iid and S(t) = X(t)
i=1 Yi .
Then:
1. If E[Yi ] < 1 and E[X(t)] < 1, then
E[S(t)] = E[X(t)]E[Y ]
2. If E[Yi2 ] < 1 and E[X(t)2 ] < 1 then
Var[S(t)] = E[X(t)]Var[Y ] + Var[X(t)]E[Y ]2
Proof:
When X(t) = n, then S(t) = Y1 + . . . + Yn , and E[S(t)] = nE[Y ].

34

Breaking things down according to the value of X(t):


E[S(t)] =

1
X

E[S(t)|X(t) = n]P(X(t) = n)

n=0

1
X

nE(Y )P(X(t) = n)

n=0

= E[Y ]

1
X

nP(X(t) = n)

n=0

= E[Y ]E[X(t)]
For 2, again if X(t) = n, then
Var[S(t)] = Var[Y1 + . . . + Yn ]
= nVar[Y ]
Hence,
2

E[S(t) ] =

1
X
n=0

1
X

E[S(t)2 |X(t) = n]P(X(t) = n)


(nVar[Y ] + E[S(t)|X(t) = n]2 )P(X(t) = n)

n=0

1
X

(nVar[Y ] + n2 E[Y ]2 )P(X(t) = n)

n=0

= Var[Y ]E[X(t)] + E[Y ]2 E[X(t)2 ]


Var[S(t)] = E[S(t)2 ]

E[S(t)]2

= Var[Y ]E[X(t)] + E[Y ]2 E[X(t)2 ]


= Var[Y ]E[X(t)] + E[Y ]2 (E[X(t)2 ]

E[Y ]2 E[X(t)]2
E[X(t)]2 )

= Var[Y ]E[X(t)] + E[Y ]2 Var[X(t)]


QED
Example:
Suppose the number of customers at an o-licence in a day is Poisson with mean 81
and the amount that each customer spends on average is A
C10 with a standard deviation
of A
C6. The expected revenue in one day is 81(10) = A
C810. The variance of the total
revenue is
(81)(62 ) = (102 )(81) = 11016
End Example
35

Some Continuous Time Processes

5.1

Brownian Motion

Consider the simple symmetric random walk


or right (up or down) with equal probability,
random variables with
(
+1
xj =
1

(2.3) whichPtakes a step to either the left


i.e., Sn = nj=1 Xj where the Xj s are iid
,p =
,p =

1
2
1
2

If we think about speeding up this process, i.e., looking at it in smaller and smaller
time intervals for smaller and smaller increments to the left and right well get a continuous time process.
St
2
1
1

1
2

In this regard, consider the symmetric random walk taking steps over short intervals
of length t, with steps of size x. Let X(t) be the value of the process at time t, and
well imagine we have n = t t time intervals.
X(t)
2 x
x
t

2 t

3 t

4 t

x
2 x
Then,
X(t) =
=

xX1 +

xX2 + . . . +

x[X1 + X2 + . . . + X[t/
36

xX[t/
t]

t]

Consider the mean and variance of X(t).


E[X(t)] =

t
t

E[X1 ]

= 0
since E[X1 ] = 12 (1) + 12 ( 1) = 0, and
Var[X(t)] = ( x)
= ( t)
since E[X12 ] = 12 (1) + 12 ( 1)2 = 1.
Now we want to take the limit as

x and
x=c

t
t

Var(X1 )

t tend to 0. Let
p

where c is some positive constant, so


Var[X(t)] = c

t
t

= c2 t
The process that were left with in the limit is Brownian motion.
Observe some more properties of this process:
1. Since X(t) = x(X1 + X2 + . . . + X[t/ t] ), by the Central Limit Theorem, X(t)
follows a normal distribution with mean 0 and variance c2 t.
2. As the distribution of the change in position of the random walk is independent over
non-overlapping time intervals, then this implies that {X(t), t 0} has independent
increments.
3. This process also has stationary increments, since the change in the process value,
i.e. X(t) N (0, ct2 ), over a given time interval depends only on the length of the
interval.
The standard Brownian motion (c = 1) is sometimes called thw Wiener process. It is
one of the most widely used processes in applied probability.
The independent increments assumption implies that the change in the value of the
process between times s and t + s, i.e. X(t + s) X(t), is independent of the process
values before time s.
P(X(t + s) a|X(s) = x, X(u)0 u < s) = P(X(t + s)

X(s) a

x|X(s) = x, X(u) 0 u < s)

= P(X(t + s)

X(s) a

x)

= P(X(t + s) = a|X(s) = x)
37

(independence)

So this tells us that Brownian motion satisfies the Markov property (showed that a simple
random walk satisfies the Markov property).
is

Let X(t) be standard Brownian motion, then X(t) N (0, t). So, the density of X(t)

x2
1
ft (x) = p e 2t
2
Since Brownian motion has stationary and independent increments we can write down
the joint distribution of X(t1 ), X(t2 ), . . . , X(tn ). This is:

f (x1 , . . . , xn ) = ft1 (x1 )ft2

t1 (x2

x1 )ft3

t2 (x3

x2 ) . . . ftn

tn

(xn

xn 1 )

Using this we can compute many probabilities of interest.


Quick Recap: Brownian Motion
Limit of SRW, speeding up
{X(t), t

0} , X(t) N (0, t)

Stnadrad Brownian motion c = 1.


X(t) N (0, c2 )
Independent increments:
P(X(t + s a|X(s) = x, X(u), 0 u < s) = P(X(t + s) a|X(s) = x)
=) Brownian motion satisfies the Markov property.
x(t1 ), . . . , x(tn )
f (x1 , . . . , xn ) = ft1 (x1 )ft2

t 1 < t2 < . . . < t n

t1 (x2

38

x1 ) f tn

tn

(xn

xn 1 )

For example, conditional distribution of X(s) given that X(t) = B, where s < t is
fs|t (x|B) =
=

fs,t (x, B)
ft (B)
fs (x)ft s (B
ft (B)
p1 e
2s

x2
2s

x)

1
e
2(t s)

p1 e
2t

= q
exp
2 s(tt s)
1

= q
exp
s(t s)
2 t
1

= q
exp
s(t s)
2 t
1

= q
exp
s(t s)
2 t
1

(B x)2
2(t s)

B2
2t

1 x2 (B
+
2 s
t

x)2
s

1 x2 B 2
+
2 s

2Bx + x2
t s

1
2

1
t
x2
2 s(t s)

1
1
+
s t s

= q
exp
s(t s)
2 t

2 s(tt s)

= q
exp
s(t s)
2 t

2 s(tt s)

Var[X(s)|X(t) = B) =

B2
t

Bs
t

2B
x + B2
t s

2B
sB 2
x+
t s
t(t s)

1
t

1
t

2 !

which is the density of a normal distribution with mean


tells us that
E[X(s)|X(t) + B) =

2Bs
s2 B 2
x+ 2
t
t

x2

B2
2t

Bs
t

and variance

s(t s)
.
t

So this

Bs
t
s(t

s)
t

(independent of B)

Interestingly, the variance here does not depend on B. If we set = st , then since s < t,
0 < < 1 and the mean is X(t), and the variance is (1 )t.
When we can consider the process only between 0 and 1 conditional on X(1) =), this
new process is known as the Brownian bridge.

39

X(0) = 0

X(1) = 0

DIAGRAM TO BE FINISHED
This is used in the analysis of empirical distribution functions.

5.2

Gaussian Processes

Any stochastic process {X(t), t


0} is called a Gaussian process if X(t1 ), . . . , X(tn ),
t1 < . . . < tn , has a multivariate normal distribution for all t1 , . . . , tn .
Recall that the multivariate normal distribution is defined for a random vector x =
(X(t1 ), . . . , X(tn )) by

1
1
T
1
fx (x) =
(x ) (x )
n
1 exp
2
(2) 2 || 2
where is an n n-covariance matrix & = (1 , . . . , n ) is the mean vector.
Example:

If X1 , . . . , Xn N (,

) iid, then
= diag( 2 , . . . ,

= (, . . . , )
fx (x) =

B 1
exp @ (x
(2) ||
| 2
n
2

1
2

)T 1 (x
{z

Mahalanobis distance

2 n

|| = ( )

= diag

40

,...,

C
)A
}

(x

)T 1 (x

) =

x1

x2

xn

(x

)T I(x

(x

)T (x

)
)

3
x1
n
7
6
6 x2 7 X
6 .. 7 =
(xi
4 . 5 i=1
xn

1
2 n
fX (x) =
) 2 exp
n (
(2) 2

The likelihood function for an iid normal sample.

n
1 X
2

(xi

)2

i=1

)2

Recall that the likelihood function is:


L(,

)=

n
Y

f (xi )

i=1

End Example
Recall that the joint density function of X(t1 ), . . . , X(tn ) for Brownain motion was
f (x1 , . . . , xn ) = ft1 (x1 )ft2

t1 (x2

x1 )ft3

t2 (x3

x2 ) . . . ftn

tn

(xn

xn 1 )

It follows from this that Brownian motion is a Gaussian process.

5.3

Brownian Motion With Drift

We say that {X(t), t

0} is a Brownian motion process with drift coefficient if:

1. X(0) = 0;
2. {X(t), t

0} has stationary and independent increments;

3. X(t) is normally distributed with mean t and variance t.

41

DIAGRAM TO BE FINISHED
It can be written as
X(t) = t + W (t)
where W (t) is a standard Brownian motion.

5.4

Finance Applications

Alas, no time.

6
6.1

Applications of Stochastic Processes: Bayesian Model


Estimation Through Markov Chain Monte Carlo
Likelihood and Maximum Likelihood

Likelihood and maximum likelihood were proposed by R. A. Fisher in 1921.


When one assumes a specific probability law/distribution for observed data then we
can form what is called the likelihood function. Maximum likelihood finds the parameter
values which maximise the likelihood. Assume X1 , . . . , Xn are a random sample of a
random variable X, which we assume has density f (x|) where are the unknown
parameter(s). Alternatively, if X is discrete, a probability mass function. Then the

42

likelihood function is
(x|) = f (x1 |)f (x2 |) . . . f (xn |)
n
Y

i=1

f (xi |)

This can be thought of as the probability of observing the given random sample with
parameters .
Example:
Suppose that the time to failure of a vital component in an electronic device is exponentially distributed. A sample of n failure times is X = (x1 , . . . , xn ). The likelihood
function is:
n
Y
Pn
i=1 xi
(x| ) =
e xi = n e
i=1

End Example

Maximum likelihood proceeds by maximising the likelihood with respect to the unknown parameter .
Usually, we work with the log-likelihood
log (x|) = log

"

n
X

i=1

n
Y
i=1

f (xi |)

log f (xi |)

Then take the gradient of the log-likelihood and set this equal to zero:
r log (x|) = 0
The value of which satisfies this is the maximum likelihood estimate.
Example:
X

log (x| ) = n log

n
d
log (x| ) =
d
n

xi = 0

=) =

= x
43

xi

xi

xi

End Example
Example:
Assume X1 , . . . , Xn Bernoulli (p). What is the MLE of p?
f (x|p) = px (1
(x|) =

n
Y
i=1
P

= p
= p

p)1

xi

(1

p)

xi

(1

p)n

P
P

xi

xi

(1

xi

X
i

xi

xi

1 xi
P

p)n

+ log(1

xi log p +

= log p

pxi (1

log (x|) = log p

p)1

xi
P

xi

p)n
X
i

d
d X
log (x|p) =
xi log p + n
dp
dp
i
P
P
(n
i xi
i xi )
=
p
1 p
P
(n
i xi )
= 0
1 p
P
p (n
i xi )
= 0
1 p
P
p
i xi
P
=
1 p
n
i xi
P
n
x
1 p
P i i
=
p
i xi
1
p

n
1 = P

p =

xi

xi

xi

xi

log(1

X
i

xi

p)

log(1

p)

n
n
1X
=
xi
n i=1
= x

End Example
44

Example:
X1 , . . . , Xn N (,

f (x; ,

(x; ,

). What are the maximum likelihood estimates of and

) = p

n
Y

) =

i=1

2)

(2
1

log (x; ,

1
(x
2 2

2)

(2

) = log
= log

n
2

1
2

1
(xi
2 2

!n

= log (2

n
2

1
n

2) 2

(2
2

)2

1
(xi
2 2

1
(xi
2 2

(2

2)

)2

n
2

)2

)2

1
(xi
2 2

+ log e

)2

1
2

n
log(2
2

MLE for :
d
log (x; ,
d

) =

= +

n
1 X
2

(xi

n
2 X
2

n
1 X
2

) = 0

xi

xi

= 0

N
= 0

1X

=
xi
n i=1
= x

45

(x1

)2

)2

(xi

(xi

i=1

)2

i=1

i=1

1 X

1
(xi
2 2

(xi

) ( 1)
)

log e

MLE for

:
d
log (x; ,
d 2

n 4

2 2

) =

1 X
3

2 X

(xi

(xi

)2

n
1 XX
+ 3
(xi

i i
X
2n +
(xi

)2 = 0
)2 = 0

2n =

)2

(xi

1X
=
(xi
n i=1

)2

x)2

1X
(xi
n i=1

A biased estimator. Recall that


1

s =
E(s2 ) =

n
X

(xi

x)2

i=1

, whereas here E( 2 ) =? Show that E(s) 6= .

Example:
Let X1 , . . . , Xn Gamma (, ). What are the MLEs of and ?

46

End Example

= (, )

f (x|) =

(x|) =

()
n
Y
i=1

n
Y
i=1

x 1 e

f (xi |)

()

xi 1 e

xi

" n #
Y
n
=
xi
[ ()]n i=1
log (x|) = n log
= n log

Pn

i=1

n log () + (
n log () + (

xi

1) log
1)

X
i

ML for :
@
n
log (x|) =
@
=)

X
n

=
xi

"

n
Y
i=1

log xi

xi

xi

xi

=) = P
i xi

ML for :

@
log (x|) = n log
@
=)
is the solution of:
X
n 0 (
)
n log +
log xi =
(
)
i

n 0 () X
+
log xi
()
i

No closed form solution for the MLEs; use numerical methods to solve for
, . This
can be done quite easily using the R package optim.
End Example

6.2

Prior Distributions

In finding the maximum likelihood estimates in the previous section, only the observed
sample values x1 , . . . , xn are used to construct the estimate of . ML does not require
47

any other information to estimate other than the sample values. If we did have some
prior information about the possible values that may take, such as expert opinion, it
would have been impossible to incorporate this. In many situations such information will
be available. We can use this information to inform a prior distribution for . and
then use the Bayesian approach for estimation. The prior distribution of a parameter
is a probability function/density expressing our degree of belief about the value of prior
to observing a sample of a random variable X, whose distribution function depends on
. The prior distribution makes use of information available above and beyond whats in
the random sample.
Example:
Suppose we have a brand new 50 cent coin and we want to estimate the probability
of a head. We know has to lie between 0 and 1. A prior for could be uniform over
the interval from 0 to 1.
(
1 , 2 (0, 1)
() =
0 , otherwise
()
1

This corresponds to an assumption of total ignorance; we feel that all values of are
equally likely. On the other hand, one may feel justified in assuming a priori 2 (.4, .6)
since the coin appears quite symmetric. Then the following prior corresponds to a belief
that any value in (.4, .6) is equally likely:
(
5 , 2 (.4, .6)
() =
0 , otherwise
()
5

48

Finally, we may only have values .4, .5, .6 with .5 twice as likely giving prior.
()
1
2

1
4

.4

.5

.6

Note in this example that the priors are dierent and depend on the assumptions we
are willing to make regarding the unknown . Often these assumptions will be informed
using expert opinion on the problem.
End Example
() ! prior beliefs about where may lie in space.
: Normal(
, 2 ) = R R +
: Bernoulli() = (0, 1)

Prior choice is a subjective task.


The final result of a Bayes technique is generally dependent on the prior assumed.
Hence, care should be taken when eliciting priors.

6.3

Posterior Distributions

Having observed a sample x = (x1 , . . . , xn ) we can write down the likelihood for x given
the value of :
Likelihood = (x|)
n
Y

i=1

f (xi |)

By taking a prior on we are in essence acting as if the probability law of X is itself a


random variable, though its dependence on .
Hence, we speak of the likelihood as the distribution of x conditional on .
Given a prior density for , () and the conditional density of the elements of a
sample (likelihood) (x|). The joint density for the sample and parameter, is simply the
product of these two functions
(x, ) = (x|)()
from the definition of conditional probability:
(x|) =
49

(x, )
()

,i.e. the product of the likelihood and the prior. Then the marginal density of the sample
values, which is independent of , is given by the integral of the joint density over the
space .
Thus,
(x) =
=

Z
Z

(x, )d

(x|)()d

This is called the marginal or the likelihood of the sample.


The posterior density for is the conditional density of given the sample values.
Thus,
(x, )
(x|)()
(|x) =
=
(x)
(x)
The prior density expresses our degree of belief about before any experiment while the
psoterior expresses our beliefs given the result of the sample. Notice that the marginal
likelihood (x) is the normalising constant of (x|)(), i.e.,
Z
(x|)()
d = 1
(x)

(The bottom line makes it a proper density)


But the marginal doesnt depend explicitly on . We will often write:
(|x) / (x|)
Posterior / likelihood prior
In many cases (x) will not be available analytically. This is what leads us to numerical methods such as Markov chain Monte Carlo (MCMC).
Example:
Suppose X1 , . . . , Xn iid N (,

).

Assume a prior for which is N (, 2 ) and a prior for


Y Gamma(|{z}
,
shape

|{z}

) where scale =

1
.
rate

1
Inv Gamma
Y

where

f 1 (t) =
Y

()

50

which is Inv Gamma (, ).

Note that R uses

rate

(+1)

= scale.

It is a good exercise to derive this:


FY (t) = P(Y t)

1
t)
Y
1
= P(Y
)
t

F 1 (t) = P(
Y

P(Y t)

= 1

1
FY ( )
t

= 1

d
F 1 (t)
dt Y

f 1 (t) =
Y

d
1
FY ( )
dt
t

( 1)
1
fY ( )
2
t
t

1
1
= 2
t
() t
=

f 1 (t)
Y

1
t
t2 ()

()

+1

(+1)

1
e (t)

Now, back to the example:


() = p
( 2 ) =
(x|,

) =

1
2 2

()
n
Y
i=1

= (2

exp

( 2)

2
)

n
2

1
(
2 2

(+1)

exp

exp

51

1
2

(xi

n
1 X
2

j=1

(xi

)2

Posterior / Likelihood Prior


(,

|x) / (x|,
2

/ (2

()
2

/ ( )

n
2

)()( 2 )
exp
2

( )

(+1)

(n
++1)
2

exp

(independence)
!
n
1 X
(xi )2 (2 2 )
2

exp

i=1
1

"

n
X

(xi

i=1

) + 2

#!

exp

exp

1
2

1
(
2 2

1
(
2 2

#!
1
/ ( 2)
exp
x2i 2
xi + 2 + 2 + 2 (2 2 + 2 )
2

i
i
P

1
n
1

2
2
2 (n
++1)
2
i xi
i xi
2
/ ( )
exp
+ 2
2
+ 2 +
+ 2 + 2
2
2
2
2

(n
++1)
2

"

1
2

End Example
Computing a marginal likelihood (x): only possible in the sinplest of cases/models.
X1 , . . . , Xn N (, 2 ), 2 known.

52

Prior which is N (, 2 ).
(x|,

) =

n
Y

(2

1
2

exp

i=1

= (2

n
2

1
2

() = (2 )

(x) =

Z
Z

= C

= C

exp

exp

(xi

n
1 X
2

(xi

i=1

1
(
2 2

+1

(x|,

)()d

1
+1

(2
|

Z
Z

= C0

= C0

1
+1
1

2
0

1
2

exp @

exp @

0P
2

0P

B
= C0 exp @

i xi
2

n
2

1
2

1
2

i xi
2

n
2

x2i

= C0 exp @

"

B 16
6
exp B
@ 24

+1

exp

+1

1
2

(2 2 ) exp
{z
}

+1

= C exp
Z

n
2

x2i

n
2

+
2

1
2

+
2

1
2

1
A

)2

(xi

xi + n2

C
A

1
+

exp
1

@
+1

+1

1
2

i xi
2

n
2

1
2

n
2

+
2

1
2

(1)
|{z}

+
2

1
2

1
A

i xi
2

31

x2i

C2

+ 2
n
1
2 + 2
P

integral over range of normal


n

C
2 7
7C d
+
+
2
2 5A
|
{z }
i

5A d

!#2

2 +
P

1
+ 2
2

31

+ 2
n
1
2 + 2

2
1
2 + 2

53

)2

1 2

2 2

+ 2

i xi
2

exp @

i xi
2

1
2

1
(
2 2

i=1

0"

2 1 s

1
+ 2

2
+ 2

@Var =

n
1 X

i xi
2

i xi
2

+ 2

12 1

A A d

+ 2
n
1
2 + 2

! 2 11

AA d

6.4

Posterior Quantities of Interest

There are many quantities of interest that we may want to get from a Bayesian analysis.
For example, the mean of the posterior distribution is a widely used Bayesian estimator.
The mode of the posterior is called the maximum a posteriori (MAP) estimate of
.
If is of dimension p, (1 , . . . , p ), we may be interested in the marginal density of
j :
(j | j , x) =
where

(|x)d

, j = 1, . . . , p

= (1 , . . . , j 1 , j+1 , . . . , p ) is with the j th element removed.

Consider the posterior expectation of , .


Z

=
(|x)d = E|x []
=

(x|)()
d
(x)

This calculation requires knowing (x) which will be intractable in most cases.
This is a big problem!
We will face these integrals in each problem we look at.
What if we could simulate values of , say (1) , (2) , . . . , (N ) , from (|x)? Instead of
doing these integrals analytically, we could approximate them numerically:
Z

N
1 X (k)

E|x [] =
(|x)d
N k=1

In fact we could use the same approach to approximate the posterior expectation of
any function g() of .
Z

N
1 X
g((k) )
E|x [g()] =
g()(|x)d
N

k=1

The main idea of Markov Chain Monte Carlo is to approximately generate samples
from the posterior (|x), and then use these to approximate integrals.

6.5

MCMC: The Key Ideas

The key idea of MCMC is simple. We want to generate samples from (|x) but we
cant do this directly. However, suppose we can construct a Markov chain (through its
transition probabilities) with state space (all values of which is straightforward to
simulate from, and it has stable (stationary) distribution which is the posterior (|x).
(0) , (1) , (2) , (3) , . . . , (t) , . . . , (N )
54

6.6

The Gibbs Sampling Algorithm

Julian Besag (1974) discussion paper in JRSSB.


Let = (1 , . . . , p ) and we want to obtain inferences from (|x), but sampling isnt
easy.
We can recast the problem as one of iterative sampling from appropriate conditional
distributions.
Consider the full conditional densities
(j |x, j )

j = 1, . . . , p

where j = {i : i 6= j}. These are densities of the individual components given the
data and the specified values of the other components of .
They can be typically recognised as standard densities, e.g. normal, gamma, etc., in
j .
(0)

(0)

Suppose we have an arbitrary set of starting values (0) = (1 , . . . , p ).


For the unknowns we implement the following interative procedure:
8
(1)
(0)
(0)
>
draw 1 from (1 |2 , . . . , p , x)
>
>
>
>
>
(1)
(1) (0)
(0)
>
>
<draw 2 from (2 |1 , 3 , . . . , p , x)
1st iteration draw (1) from ( |(1) , (1) , (0) , . . . , p(0) , x)
3 1
3
2
4
>
>
>
.
>
..
>
>
>
>
:draw (1) from ( |(1) , . . . , (1) , x)
p
p 1
p 1
2

nd

(2)

(0)

(0)

draw 1 from (1 |2 , . . . , p , x)
iteration .
..

Now suppose this procedure is contained through t iterations resulting in the sampled
(t)
vector (t) = ((t) , . . . , p ) is a realisation of a Markov chain with transition probabilities
(t)

p( ,

(t+1)

)=

p
Y

(t+1)

(j

(t+1)

)`

(t)

, ` < j, ` , ` > j, x)

j=1

The transition (Gibbs) kernel.


(t)

(t)

Then as t ! 1, (1 , . . . , p ) tends to the distribution of a random vector whose


joint density is (|x).
(Throw away the initial part, called the burn-in)
(t)

In particular, j tends in distribution to a random quantity whose density is (j |x).

Example:
A popular application of Gibbs sampler is in finite mixture models used for modelbased clustering. In R use package mclust (Raftery).
55

For Gaussian finite mixture the density of an observation x is given by


fx (x) =

G
X

2
g)

wg f (x|g ,

g=1

where wg are the mixture weights and


G
X

wg = 1

g=1

and
f (x|g ,

2
g)

is N (g ,

2
g)

The likelihood for n observations x1 , . . . , xn is


(x|) =

n
G
Y
X
i=1

g=1

wg f (xi |g ,

2
g)

The likelihood is very difficult to work with. Thus we usually complete the data with
component labels z = (z1 , . . . , zn ), which tells us which component each observation
belongs to, i.e. zi = g, then xi arises from a N (g , g2 ). Of course the labels give the
clustering of the data, but cant be observed directly. We cant include these as unknowns
in the Gibbs sampler.
(K-means)
The likelihood of the complete data is:
G Y
Y

1
(x, z|) =
wg p
2
g=1 i:z =g
i

G
Y

wgng (2 g2 )

ng
2

2
g

exp

2
1

exp

g=1

g ) 2

(xi

2
g

2
g i:zi =g

(xi

g )

where ng = # of is such that zi = g.


Priors: weights.
Standard assumption is to assume that the weights follow a Dirichlet distribution,
which is given by
!
G
X
( + ) 1
wg = 1
x (1 x 1 )
()
(
)
g=1
(w1 , . . . , wG ) =
=

( + + ... + )
w 1 w2
( ) ( )... ( ) 1
G
(G ) Y
w
( )G g=1 g

56

. . . wG

Usually one assumes that the means g arise from a N (, 2 ) a priori and independently.

G
Y
1
1
2
p
(1 , . . . , G ) =
exp
(g )
2 2
2 2
g=1
Finally, well assume that the variances arise from an inverse gamma distribution independently:

Y
2
2
2 (+1)
( 1 , . . . , G ) =
( )
exp
2
() g
g
g=1
(, z|x) / (x|)()
/

G
Y
g=1

wgng (2 g2 )
|

ng
2

exp

1
2
{z

2
g i:zi =g

(xi

g )

wg exp
| {z }
} prior |
weights

likelihood

(
|

1
(g
2 2
{z

exp
{z

prior means

2 (+1)
g)

2
g

prior variances

The next step in implementing a Gibbs sampler for this model is to derive the full
conditionals. We want to iteratively sample the lables, weights, means & variances.
Labels full conditional:
P(z + i = k|everything else) / wk (2
/

wk
k

exp

2
k)

1
2

exp
1

2
k

(xi

1
2

2
k

(xi

k )

k )

We compute this for each value of k = 1, . . . , G, then renormalise to get a discrete


distribution for the label which we can sample from.
Full conditional weights:
(w1 , . . . , wG |everything else) /

G
Y

wgng +

g=1

which is the form of a Dirichlet distribution (n1 + , n2 + , . . . , nG + ).


!
1 X
1
(g |everything else) / exp
(xi g )2
(g )2
2 g2 i:z =g
2 2
i

"

2
P
#!
1
ng
1

i:zi =g xi
2
/ exp
+ 2 g 2
+ 2 g
2
2
2

g
g
"
0 n
1
P
g
#2
1
xi
+
+
2
i:zi =g g2
2
2
g
A
/ exp @
g
ng
1
2
2 + 2
g

57

So the full conditional for g is


P

Finally, the full conditional for the

i:zi =g
ng
2
g

2
g

( g2 |everything else) / ( g2 ) (

xi

2
g

1
2

ng
2
g

1
+

1
2

is
n
++1
2

) exp

1
2
g

"

1 X
(xi
2 i:z =g

g ) 2 +

#!

which is an inverse gamma distribution


Inv Gamma

n
1 X
+ ,
(xi
2
2 i:z =g

g ) 2 +

!
End Example

6.7

The Metropolis-Hastings Algorithm

This algorithm constructs a Markov chain (1) , (2) , . . . , (t) , . . . by defining the transition
probability from (t) to (t+1) as follows:
Let q(, 0 ) denote a proposal distribution such that if = (t) , then 0 is a proposed
next value for the chain, i.e. 0 is a proposed value for (t+1) . However, a further
randomization then takes place.
With some probability (, 0 ) we actually accept (t+1) = (t) . This construction
defines a Markov chain with transition probabilities given by

Z
0
0
0
0
p(, ) = q(, )(, ) + I( = ) 1
q(, 00 )(, 00 )d00
where I() is an indicator function.
If we now set

(0 |x)q(0 , )
(, ) = min 1,
(|x)q(, 0 )
0

Then one can show

(|x)q(, 0 ) = (|x)q(0 , P )
This is called the detailed balance condition, & it is a sufficient condition to to ensure
that (P |x) is the stable distribution of the chain. Thus, we only require the functional
form of the posterior.
In practice we generally assume that q(, 0 ) is a normal distribution
N (,

2
prop I)

where I is the identity matrix. The value of the chain will depend on the value of
generally we tune this to give 25 40%.
58

2
prop ,

Spatial Processes

Tutorials

A.1

Tutorial 1

Problems 1
1. 5 white 5 black
2 urns five balls in each.

# white in left urn= Xn


Each step pick at random 1 ball from each urn & drop into the other urn.
P(Xn+1 = i + 1|Xn = i) =

i
5

i
5

i)2

(5
25

Take white from right urn & black from left.


(5
5
Take black from right urn & white from left.
P(Xn+1 = i

1|Xn = i) =

i)

i
i
i
i2
= =
5
5 5
25

i
5 i 5 i
i
2i(5 i)

+
=
5
5
5
5
25
Take 2 of the same colour from the urns
P(Xn+1 = i|Xn = i) =

Thus, we have the transition probabilities of Xn .


Extension: Bernoulli-Laplace model of diusion.
b black balls & 2m

b white balls.

Draw in the same way but such that theres m balls in each urn.
Let Xn = # black balls in left urn
P(Xn+1 = i + 1|Xn = i) =
(Draw black from right
white from left)

m i m (b

m
m2
(b black b

(m

i)(m
m2

i)

i in right)
b+i

Exercise:
P(Xn+1 = i

1|Xn = i) = P(Xn+1 = i|Xn = i) =?


59

2. (Gamblers ruin N = 4)
p(i, i + 1) = .4
p(i, i 1) = .6
Stop if i reaches 4 (p(4, 4) = 1) or if i reaches 0 (p(0, 0) = 1).
Since the games are independent:
p3 (1, 4) = (.4)3
1
Xn

2
Xn+1

3
Xn+2

4
Xn+3

p3 (1, 0) = .6 + .4(.62 ) = .744


1
Xn

0
Xn

Xn+1

0
Xn+3

Xn+2

or
1
Xn

Xn+1

Xn+2

0
Xn+3

Alternative method:
2

1
6.6
6
P =6
60
40
0

0
0
.6
0
0

0
.4
0
.6
0

0
0
.4
0
0

3
0
07
7
07
7
.45
1

Compute P 3 , the matrix of 3-step transition probabilities, and simply read o the
required values.
3. General two-state chain; state space S = {1, 2}.

1 a
a
P =
b
1 b
Use the Markov property to show that
b
P(Xn+1 = 1)
= (1
a+b

b) P(Xn = 1)

b
a+b

Now,
P(Xn+1 = 1) = P(Xn = 1)P(Xn+1 = 1|X

n) + P(Xn = 2)P(Xn+1 = 1|Xn = 2)

= P(Xn = 1)(1

a) + P(Xn = 2)b

= P(Xn = 1)(1

a) + (1

= (1

b)P(Xn = 1) + b
60

P(Xn = 1))b

P(Xn+1 = 1)

b
= (1
a+b

P(Xn+1 = 1)

b
= (1
a+b

b)P(Xn = 1) +

b(a + b) b
a+b

b) P(Xn = 1)

b
a+b

And hence,
b
P(Xn = 1) =
+ (1
a+b
PXn = 1)

b
= (1
a+b
= (1

a
a

b) P(Xn
b)

b)

= ...
= (1

P(Xn

= 1)

= 1)

b)

b| < 1 () 0 < a + b < 2, then


lim P(Xn = 1) =

b
a+b

n!1

A.2

P(Xn = 1)

Tutorial 2

Problems 2
1.
1
2

11 19 17
=
, ,
47 47 47

1 1 1
=
, ,
3 3 3

2.
1 = (.4, .6)

6 7 22
2 =
, ,
35 35 35

61

b
a+b

b
a+b
b
a+b
b
a+b

P(X0 = 1)

b
lim P(Xn = 1) =
+ lim (1
n!1
a + b n!1
If |1

b)

P(X0 = 1)

b
a+b

3. Machine; Shocks i = 1, 2, 3 Poisson

i.

! shocks type 1&3


! shocks type 2&3

Part 1
Part 2

Let U and V be the failure times of parts 1 and 2, respectively.


(a) P(U > s, V > t)
For U > s, V > t,
i. No shocks of type 1 before time s;
ii. No shocks of type 2 before time t;
iii. No shocks of type 3 before time max(t, s).
Shocks arrive according to a Poisson process: we know that the time to first
arrival is exponential . . .
Time to shock of type i exp( i )
T exp( ) =) FT (t) = P(T t) = 1
ST = P(T > t) = e

For i., ii., iii. to occur:


i. has probability e
ii. has probability e
iii. has probability e

1s
2t
3

;
;

max(t,s)

=) P(U > s, V > t) = e

1t

= e

1t

2t

2t

e
3

max(t,s)

max(t,s)

(b) U &V are times to first arrival in a Poisson process =) exponential.


U exponential (
V exponential (

+
2+
1

(c) Are V &V independent?


fU,V (s, t) = fU (s)fV (t)

62

3)
3)

if V &V independent.
P(U > s, V > t) =
=

Z
Z

= (
=

= e
= e

1
s
1
s

Z
Z

3 )(

2+

1 + 3 )s1

3)

1
s

1 + 3 )s
1s

2t

3 )e

1+

fU,V (s1 , t1 )ds1 dt1

Z
e

1 + 2 )s1

1 + 3 )s1

s
(

2 + 3 )t1

3 )e

ds1

2 + 3 )t

ds1 dt1
(

2 + 3 )t1

dt1

2 + 3 )t

3 (s+t)

6= P(U > s, V > t)


= e

1s

2t

max(t,s)

So U &V are not independent.


4. T1 , . . . , Tn independent exp

1, . . . ,

(a) T = min(T1 , . . . , Tn ) exp

n.

n
j=1

Distribution of T () P(T t)

FT (t).

Consider P(T > t), if T > t, then we must have that each of Tj is greater
than t.

P(T > t) =

n
Y

P(Tj > t)

j=1

n
Y

jt

j=1

= e (

Pn

j=1

)t

which is the survival function of an exponential RV with rate

63

Pn

j=1

j.

(b) Show
i

P(Ti < Tj ) =
P(Ti < Tj ) =
=
=
=

Z
Z

+
1

i 6= j

fTi (t)P(Tj > t)dt

0
1
0

0
i

i+

=
i

it

ie

jt

dt

i + j )t

dt
(

j )e

n.

Show

i + j )t

dt

(dist of exp(

(1)
j

=
i

(c) T1 , . . . , Tn exponential

j
1, . . . ,

P(Ti = min(T1 , . . . , Tn )) = Pn

i
j

j=1

P(Ti = min(T1 , . . . , Tn )) =
=

=
=

Z
Z

fTi (t)

0
1

ie

64

it

P(Tj > t)dt

j6=i

jt

dt

j6=i

e (

"

= Pn

Pn

1
Pn

j=1

j=1

5. X1 Poisson (1 ), X2 Poisson (2 ).

j=1

)t dt

e (
j

)t

#1
0

j ))

X1 + X2 Poisson (1 + 2 ).
P(X1 + X2 = k) =

k
X

P(X1 = m)P(X2 = k

m)

m=0

k
1
X
m
k
1 e
2
m!
(k
m=0

= e

(1 +2 )

(1 +2 )

(1 +2 )

k!
e

k
X

e 2
m)!

m!(k
m=0
k!

(1 +2 )

k!

7. Later

65

k
m
1 2

k
X

k!
k
m
1 2
m!(k m)!
m=0
k
X
k

m=0
k
X

k
m
1 2

(1 + 2 )m

m=0

=) probability mass function of Poisson (1 + 2 ).


6. Later

m)!

Você também pode gostar