Escolar Documentos
Profissional Documentos
Cultura Documentos
1.1
Theory
I develop the theoretical ideas within the context of models with discrete state and action spaces. Models
that are continous in either dimension can be approximated by a discretized version in order to apply the
techniques presented here.
For a more in depth overview see the quant-econ.net lecture on discrete dynamic programming.
1.1.1
I first repeat the formal definition of a discrete dynamic program that can be found in the quant-econ.net
lecture.
A discrete dynamic program consists of:
A finite, discrete set S = {1, 2, . . . , n} of states
A finite set of feasible actions A(s) for each state s S.
Denote SA := {(s, a)|s S, a A(s)} as the set of feasible state action pairs
We call the set A := sS A(s) the action space.
A reward function r : SA R
A transition probability function Q : SA (S), where (S) is the set of probability distributions
over S.
A discount factor (0, 1).
A policy is a funciton : S A.
A policy if feasible if it satisfies (s) A(s) s S.
Let denote the set of feasible policies.
P
I will represent the function Q as Q(s, a, s0 ) where s S, a A(s) Q satisfies s0 S Q(s, a, s0 ) = 1 and
Q(a, s, s0 ) [0, 1] s0 S.
For each define
t Q,(t) rt .
t=0
|S|
|S|
as
(
(T v)(s) = max
r(s, a) +
aA(s)
Q(s, a, s )v(s ) .
s0 S
Dynamic
Programming,
I,
1647.
I will sketch the proof of part (1). Part (2) will follow in an analagous way and part (3) can be obtained
by combining parts (1) and (2).
Let 0 , 1 be two feasible policies (not necessarily optimal). Then
v Tv
(
= max
r(s, a) +
aA(s)
Q(s, a, s0 )v(s0 )
s0 S
r0 + Q0 v
r0 + Q0 (r1 + Q1 v)
= r0 + Q0 r1 + 2 Q0 Q1 v
Define Q,(t) := Q0 Q1 Qt .
We can continue the string of inequalities from above and by induction write
t1
Y
v r0 + Q0 r1 + t1
Qj rt1 + t Qt ,(t) v
j=0
Consider following some policy . Breaking the infinite summation for v (see above) into two parts,
we can re-write the previous inequality as
v v + t Q,(t) v
j Q,(j) rj+1
j=t+1
v lim v + t Q,(t) v
t
j Q,(j) rj+1
j=t+1
The limit above exists when r is bounded. The previous inequality is true for any policy, therefore it
must also be true for the optimal policy. Thus we have that
v v
v
1.1.2
Mathematical Programs
|S|
s.t.
v Tv
By the theorem above, the constraint set includes all vectors v R|S| such that v v .
The objective function is to minimize the inner product between the value function and a strictly positive
vector. This combined with the constraint set being all v v results in the unique solution to this
mathematical program being the unique fixed point of the Bellman operator: v .
The objective function of this program is linear in v, but even in our discrete setting the constraint is
non-linear because the operator T includes a max.
Notice that v T v is a set of |S| constraints. If we look at the constraint for a particular s S we notice
the following implication
X
v(s) r(s, a) +
Q(s, a, s0 )v(s0 ) a A(s) = v(s) (T v)(s).
s0 S
This means that for each s S the single non linear constraint v(s) (T v)(s) can be replaced with a
system of linear inequalities, one for each feasible action at that state.
This insight motivates the following linear program:
minb0 v
v
s.t.v(s) r(s, a) +
s0 S
1.2
Numerical Implementation
The Julia library MathProgBase provides a unified interface with many industry quality linear programming
solvers. The interface is documented here and is accessible via the linprog function.
The signature of the linprog function is: linprog(b, A, ">", lb).
The linear program solved by this method is
minb0 v
v
s.t.
Av lb
I can recast the constraint of my linear program into this framework by following these steps:
v(s) r(s, a) +
s0 S
v r(:, a) +
Q(:, a, s0 )v(s0 ) a A
s0 S
1.3
Example
I will now consider a very simple example of a discrete dynamic program. This is example is discussed in
detail in this section of the quant-econ.net lecture.
I write a function to create an instance of DiscreteDP for the example growth model from the lecture.
In [8]: # define helper methods to wrap the PFI, VFI, and MPFI
# methods from QuantEcon.jl so we isolate computation time
# needed to comptue solution from time needed to do
# post-solution pacakging.
for (nm, typ) in [(:pfi, PFI), (:vfi, VFI), (:mpfi, MPFI)]
@eval function $(nm)(ddp::QuantEcon.DDP)
ddpr = QuantEcon.DPSolveResult{$(typ),Float64}(ddp)
QuantEcon._solve!(ddp, ddpr, 1000, 1e-7, 200)
ddpr
end
end
function horse_race(other_solvers,
Bs=[10, 20, 50, 100, 250, 500],
Ms=[5, 10, 20, 50])
nB = length(Bs)
nM = length(Ms)
times = zeros(Float64, length(other_solvers)+1, nB, nM)
build_constraints_time = zeros(nB, nM)
# run once to compile each routine
ddp = growth_ddp(B=minimum(Bs), M=minimum(Ms))
linprog(ddp)
[f(ddp) for f in other_solvers]
for (i_M, M) in enumerate(Ms), (i_B, B) in enumerate(Bs)
ddp = growth_ddp(B=B, M=M)
# handle linprog separately so we can separate time to build A from
# time to to linprog just as we separated time to solve via PFI and
# VFI from the time needed to construct the controled MarkovChain
A, lb = linprog_constraints(ddp)
tic()
linprog(ddp, A, lb)
t = toq()
times[1, i_B, i_M] = t
for (i_solver, solver) in enumerate(other_solvers)
tic()
solver(ddp)
t = toq()
times[i_solver+1, i_B, i_M] = t
end
end
Bs, Ms, times
end
Out[8]: horse race (generic function with 3 methods)
In [9]: other_solvers = [pfi, vfi, mpfi]
Bs, Ms, times = horse_race(other_solvers)
Out[9]: ([10,20,50,100,250,500],[5,10,20,50],
6
4x6x4 Array{Float64,3}:
[:, :, 1] =
0.000792707 0.00289365
0.000331459 0.000493138
0.00374358
0.00719934
0.0059769
0.00546229
0.00803942
0.000953892
0.0261211
0.0113737
[:, :, 2] =
0.0047423
0.000217353
0.00385994
0.00694409
0.00354907
0.000267647
0.00767909
0.00664556
0.0186667
0.00115345
0.0362517
0.010317
[:, :, 3] =
0.00853303
0.000566487
0.0237903
0.00534234
0.00874187
0.00496825
0.0306857
0.00878956
[:, :, 4] =
0.0306605
0.0125639
0.200401
0.029784
0.0528512
0.00893633
0.182073
0.0325731
0.0638502
0.00273073
0.0739338
0.0179435
0.294106
0.0183676
0.385639
0.0484988
0.035325
0.00272591
0.0581562
0.0278861
0.0940163
0.00309049
0.0904303
0.0265816
0.200947
0.00508882
0.206766
0.0386758
1.28193
0.0297667
0.923506
0.127971
0.651687
0.0153755
0.264998
0.157416
0.889793
0.0231833
0.626222
0.177521
2.87637
0.053197
1.25773
0.207349
10.6282
0.160551
3.8271
0.385324
4.80305
0.0929412
1.20181
0.501094
5.15933
0.0723744
2.19733
0.568824
12.3487
0.274444
4.41018
0.709394
41.2954
0.50697
11.238
0.834332)
Notice that policy function consistently outperforms all other measures and the linear programming
approach is almost always the slowest.
This means that without altering the linprog algorithm (e.g. by using approximate linear programming),
the dynamic programming via linear progams result is nice theoretically, but impractical.
In [ ]: