Bayesian Structural Time Series Models

Bayesian Structural Time Series Models
Steven L. Scott
August 10, 2015
Welcome!
The goal for the day is to introduce you to:

I
basic ideas in structural time series modeling,
regression modeling with spike and slab priors, and
the bsts R package.
Course notes and materials at https://goo.gl/VUWUC9
Steven L. Scott (Google)
bsts
August 10, 2015
2 / 100
Some good books

For structural time series, and time series in general.
Harvey
Durbin and Koopman
West and Harrison
Chatfield
Brockwell and Davis
Petris et. al
bsts
August 10, 2015
3 / 100
Introduction to time series modeling
Outline
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions
bsts
August 10, 2015
4 / 100
Strategies for time series models
Regression
ARMA
Smoothing
Structural time series
bsts
August 10, 2015
5 / 100
Regression models
Introductory statistics courses teach students to fit models like

yt = 0 + 1 t +2 xt + t .
|{z}
linear
1. The trend probably wont follow a parametric form.

2. Even if it does, the residuals will be autocorrelated.
bsts
August 10, 2015
6 / 100
Airline passengers
2.6
2.4
2.0
2.2
log10(AirPassengers)
500
300
100
AirPassengers
2.8
An example from elementary textbooks
1950
1954
1958
1950
Time
1958
Time
Air passengers
1954
log scale
bsts
August 10, 2015
7 / 100
Linear time trend doesnt quite fit

See air-passengers-bsts.R
0.06
0.04
0.02
residuals
air <- log10(AirPassengers)

time <- 1:length(air)
months <- time %% 12
months[months==0] <- 12
months <- factor(months,
label = month.name)
reg <- lm(air ~ time + months)
0.00
0.02
0.04
0.06
2.2
2.4
2.6
2.8
fitted
bsts
August 10, 2015
8 / 100
Quadratic time trend
Misses serial correlation
0.04
residuals
0.02
reg <- lm(air ~ poly(time, 2),

months)
plot(reg$residuals)
acf(residuals)
0.00
0.02
0.04
20
40
60
80
100
120
140
time
1.0
Predictions between months 80

- 100 predictably too low.
Between months 100 - 120
predictably too high.
Serial correlation is cured by locality.
0.8
0.6
ACF
0.4
0.2
0.0
0.2
10
Lag
15
20
ARMA models
ARMA(P,Q) models have the form
yt =
P
X
p ytp +
p=1
Q
X
q tq .
q=0
Some features that make ARMA models difficult:

1. yt must be stationary. If non-stationary then take differences until it
becomes stationary.
2. If yt contains a seasonal component, then seasonal differencing is also
required.
3. Harder to think about. (Regression of y on x vs of 52 2 y on x).
ARMA models can be written as a special case of state space models.
bsts
August 10, 2015
10 / 100
Stationary vs Nonstationary
See code in stationary.R
sample.size <- 1000

number.of.series <- 1000
many.ar1 <- matrix(nrow = sample.size, ncol =number.of.series)
for (i in 1:number.of.series) {
many.ar1[, i] <- arima.sim(model = list(ar = .95),
n = sample.size)
}
many.random.walk <- matrix(nrow = sample.size,
ncol = number.of.series)
for (i in 1:number.of.series) {
many.random.walk[, i] <- cumsum(rnorm(sample.size))
}
par(mfrow = c(1, 2))
plot.ts(many.ar1, plot.type = "single")
plot.ts(many.random.walk, plot.type = "single")
bsts
August 10, 2015
11 / 100
What it looks like

Single series
Random Walk
0
10
20
30
many.ar1[, 1]
many.random.walk[, 1]
10
10
AR1
200
400
600
800
1000
Time
400
600
800
1000
Time
yt = .95yt1 +
200
yt = yt1 +
bsts
August 10, 2015
12 / 100
What it looks like

Many series
yt = .95yt1 +
yt = yt1 +
bsts
August 10, 2015
13 / 100
Variance
AR1
yt = yt1 + t
= (yt2 + t1 ) + t
= ...
=
t
X
i ti .
i=0
If || < 1 then as t , Var (yt ) = Var (t )/(1 ||).

Random walk
yt =
t
X
t
i=0
2
Var (yt ) = t
Variance diverges to .
bsts
August 10, 2015
14 / 100
Smoothing
Exponential smoothing
st = yt + (1 )st1
turns out to be the Kalman filter for the local level model.
Holt-Winters or double exponential smoothing captures a trend.
st = yt + (1 )(st1 + bt1 )
bt = (st st1 ) + (1 )bt1
This is the Kalman filter for the local linear trend model.
Triple exponential smoothing can handle seasonality as well, but
the formulas are getting ridiculous!
I
What happens if you want to include a regression component?
Confidence about the smoothed estimate?
bsts
August 10, 2015
15 / 100
Advantages of structural time series models
All the flexibility of regression models.
The locality of ARMA models and smoothing.
Can handle non-stationarity.
Modular, so easy combine with other additive components.
All those smoothing parameters become variances that can be

estimated from data.
bsts
August 10, 2015
16 / 100
Outline
Models for trend
Modeling seasonality
Applications
Extensions
bsts
August 10, 2015
17 / 100

State space form
There are two pieces to a structural time series model

Observation equation
yt = ZtT t + t
I
I
I
t N (0, Ht )
yt is the observed data at time t.

t is a vector of latent variables (the state).
Zt and Ht are structural parameters (partly known).
Transition equation
t+1 = Tt t + Rt t
I
I
t N (0, Qt )
Tt , Rt , and Qt are structural parameters (partly known).

t may be of lower dimension that t .
bsts
August 10, 2015
18 / 100
Structural time series models are modular

Add your favorite trend, seasonal, regression, holiday, etc. models to the mix
Zt
State Vector
Tt
Trend
Seasonal
Regression
bsts
August 10, 2015
19 / 100
Example:
The basic structural model with a regression effect S seasons can be
written
yt = t + t + T xt +t
|{z}
|{z}
| {z }
trend
seasonal
regression
t = t1 + t1 + ut
t = t1 + vt
t =
S1
X
ts + wt
s=1
Local linear trend: level t + slope t .
Seasonal: S 1 dummy variables with time varying coefficients.

Sums to zero in expectation.
bsts
August 10, 2015
20 / 100
Models for trend
Some models for trend
Local level
Local linear trend
Generalized local linear trend
Autoregressive models
bsts
August 10, 2015
21 / 100
Models for trend
Understanding the local level model

I
The local level model is
t N 0, 2
t = t1 + t1
t N 0, 2
A compromise between the random walk model (when 2 = 0) and

the constant mean model (when 2 = 0).
I
yt = t + t
In the random walk model, your forecast of the future (given data to
time t) is yt .
In the constant mean model, your forecast is y.
The larger the ratio 2 / 2 the closer this model is to the constant
mean model.
In state space form
Tt = 1,
Zt = 1,
Rt = 1,
bsts
Ht = 2 ,
Qt = 2
August 10, 2015
22 / 100
Models for trend
Simulating the local level model

local-level.R
5
0
5
15
local.level.rw
10
tau = 1, sigma = 0
20
40
60
80
100
60
80
100
60
80
Time
5
4
3
2
1
local.level.constant
tau = 0, sigma = 1
20
40
Time
5
0
5
local.level
10
tau = 1, sigma = 0.5
20
40
Time
bsts
100
August 10, 2015
23 / 100
Models for trend
Local linear trend

local-linear-trend.R
I
The model is
t N 0, 2
t = t1 + t1 + ,t1
,t N 0, 2
t = t1 + ,t1
,t N 0, 2
yt = t + t
We normally think of a linear trend as

y = 0 + 1 t + t .
With change t, the expected increase in y is 1 t.
Now each t = 1, and 1 = t is a changing slope.
Neat fact! The posterior mean of the local linear trend model is a
smoothing spline.
bsts
August 10, 2015
24 / 100
Simulating local linear trend
20
40
60
3 simulations with level = 1, slope = .25 , obs = .5
20
40
60
80
100
60
80
100
60
80
100
100
60 40 20
Time
20
40
120
80
40
Time
20
40
Time
I
In the classroom regression model

I
I
We used a dummy variable for each season.

Left one season out (set its coefficient to zero).
In state space models

t =
S1
X
ts + t1
s=1
I
I
I
I
summer = (spring + winter + fall ) + t1

Mean over the year is zero.
State is S 1 dimensional.
Only one dimension of randomness.

1
1 1 1
1
0 Tt =
1
0
0 Rt =
0
Zt =
0
0
1
0
0
bsts
August 10, 2015
26 / 100
Example
Modeling the air passengers data
data(AirPassengers)
y <- log10(AirPassengers)
ss <- AddLocalLinearTrend(
list(),
## No previous state specification.
y)
## Peek at the data to specify default priors.
ss <- AddSeasonal(
ss,
## Adding state to ss.
y,
## Peeking at the data.
nseasons = 12) ## 12 "seasons"
model <- bsts(y, state.specification = ss, niter = 1000)
plot(model)
plot(model, "help")
plot(model, "comp")
## "components"
plot(model, "resid") ## "residuals"
bsts
August 10, 2015
27 / 100
2.8
Posterior distribution of state
2.4
2.0
2.2
distribution
2.6
1950
1952
1954
1956
1958
1960
Time
plot(model)
I
Fuzzy line shows posterior distribution of state at time t.
Blue dots are actual observations.
bsts
August 10, 2015
28 / 100
Contributions from each component
2.0
1.0
0.0
1.0
distribution
2.0
seasonal.12.1
0.0
distribution
trend
1950
1954
1958
1950
Time
plot(model, "comp")
1954
1958
Time
## "components"
bsts
August 10, 2015
29 / 100
Contributions from each component

seasonal.12.1
0.10
0.10
0.00
distribution
2.5
2.3
2.1
distribution
2.7
trend
1950
1954
1958
1950
Time
1958
Time
plot(model, "comp", same.scale = FALSE)

1954
bsts
## "components"
August 10, 2015
30 / 100
Evolution of the seasonal component
1958
1954
1958
1954
0.05
distribution
1950
0.10
0.05
0.10
distribution
1950
Season 4
1958
1950
1954
1958
Season 6
Season 7
Season 8
1954
1950
1954
1958
1950
1954
0.10
0.05
0.10
0.05
0.10
1958
0.05
Season 5
distribution
Time
distribution
Time
distribution
Time
1958
1950
1954
1958
Season 9
Season 10
Season 11
Season 12
Time
1958
1950
1954
Time
1958
1950
1954
Time
1958
0.10
0.05
0.10
0.05
0.10
0.05
1954
0.05
Time
distribution
Time
distribution
Time
distribution
Time
0.10
1950
Season 3
Time
0.05
1950
0.05
distribution
1954
0.10
distribution
1950
distribution
Season 2
0.10
0.05
0.10
distribution
Season 1
1950
1954
Time
1958
Setting priors
AddLocalLinearTrend(
state.specification
y,
level.sigma.prior =
slope.sigma.prior =
initial.level.prior
initial.slope.prior
sdy,
initial.y)
= NULL,
NULL,
NULL,
= NULL,
= NULL,
bsts
#
#
#
#
SdPrior
SdPrior
NormalPrior
NormalPrior
August 10, 2015
32 / 100
Priors on standard deviations
SdPrior(sigma.guess,
sample.size = .01,
initial.value = sigma.guess,
fixed = FALSE,
upper.limit = Inf)
I
This puts a gamma prior on 1/ 2 .

I
I
I
Shape () = sigma.guess2 sample.size/2

Scale () = sample.size/2
If specify an upper limit on then support will be truncated.
bsts
August 10, 2015
33 / 100
Whats in the model object

Varies depending on how the function was called.
> names(model)
[1] "sigma.obs"
[3] "sigma.trend.slope"
[5] "final.state"
[7] "one.step.prediction.errors"
[9] "state.specification"
"sigma.trend.level"
"sigma.seasonal.12"
"state.contributions"
"has.regression"
"original.series"
MCMC draws of model parameters (each one is named).
Draws of the final state vector (used for forecasting).
Draws of each components contributions to the state mean.
Draws of the one-step-ahead prediction errors (from the Kalman filter).
A logical value indicating whethter the model has a (static) regression

component.
The state specification you used to call the model.
A copy of the original data series.
bsts
August 10, 2015
34 / 100
Prediction
### Predict the next 24 periods.
pred <- predict(model, horizon = 24)
2.5
2.0
original.series
3.0
### Plot prediction along with last 36 observations

### from training series.
plot(pred, plot.original = 36)
1958
1959
1960
1961
1962
1963
time
bsts
August 10, 2015
35 / 100
Outline
Applications
Extensions
bsts
August 10, 2015
36 / 100
Gibbs sampling for state space models

1. Simulate p(|, y) using the Kalman filter and simulation
smoother.
2. Simulate p(|, y).
3. Goto 1.
Simulating p(|, y) is on a model by model basis, but for most models is
trivial.
I For models with only variance parameters, compute the right sum of
squared errors and draw the variances.
I Example: local level model:
!
X (t t1 )2
1
p(2 |) p(2 )T exp 2
t
2
If p(2 ) = Ga(df /2, ss/2) then
P

df + T ss + t (t t1 )2
p(2 |) = Ga
,
2
2
bsts
August 10, 2015
37 / 100
The Kalman filter

y t2
t2
I
y t1
yt
t1
The graph shows the conditional independence relationships among

the latent and observed variables in the model.
bsts
August 10, 2015
38 / 100
The Kalman filter

y t2
t2
I
y t1
yt
t1
At time t 1 we start off knowing the mean and variance of t1

given y1 , . . . , yt2 . (recursion)
bsts
August 10, 2015
39 / 100
The Kalman filter

y t2
t2
y t1
yt
t1

Then we observe yt1 .
bsts
August 10, 2015
40 / 100
The Kalman filter

y t2
t2
y t1
yt
t1

Then we observe yt1 .
The Kalman filter computes p(t |y1 , . . . , yt1 ),

and the incremental likelihood: p(yt1 |y1 , . . . , yt2 ).
bsts
August 10, 2015
41 / 100
The Kalman equations

Recall the state space form of the model is
yt = ZtT t + t
t N (0, Ht )
t+1 = Tt t + Rt t
t N (0, Qt )
The Kalman filter recursively computes P(t+1 |y1,...,t ) = N (at+1 , Pt+1 ).

vt = yt ZtT at
(1-step prediction error)
Ft = ZtT Pt Zt + Ht
(forecast variance)
Tt Pt Zt Ft1
(Kalman gain . . .
at+1 = Tt at + Kt vt
. . . is a regression coefficient)
Kt =
Pt+1 = Tt Pt (Tt Kt ZtT )T + Rt Qt RtT

The deriviation is tedious, but elementary.
You can use Bayes rule, or properties of the multivariate normal.
See [Durbin and Koopman(2012)] or many other sources.
bsts
August 10, 2015
42 / 100
Forward and backward

I
The Kalman filter marches forward through the data, collecting

information.
There are corresponding algorithms that march backward through the

data, distributing information.
I
I
Kalman smoother (useful for EM algorithm) computes p(t |y).

Simulation smoother
[Carter and Kohn(1994), Fr
uhwirth-Schnatter(1995),
de Jong and Shepard(1995), Durbin and Koopman(2002)]
The output of the Kalman filter + simulation smoother is an exact

draw from p(|y, ).
bsts
August 10, 2015
43 / 100
Simulation smoother
[Durbin and Koopman(2002)] thought of a clever way to simulate p(|y).
1. Simulate data with the wrong mean, but the right variance.
2. Subtract off the wrong mean, and put in the right one.
The argument goes like this:
1. For multivariate normal (, y), Var (|y) is independent of y.
2. Simulate fake data (, y) from a structural time series model. The
conditional distribution (|
y) has the same variance as (|y).
3. Subtract E (|
y) from your simulated s, and add E (|y).
[Durbin and Koopman(2012)] (Section 4.6.2) have a fast state smoother that
can quickly compute E (t |y) (without computing each Pt ).
I
The DK simulation smoother requires two Kalman filters (for y and y) and
two fast state smoothers.
Works even if Rt is not full rank.
bsts
August 10, 2015
44 / 100
Break time!
Lets take 15 minutes.
bsts
August 10, 2015
45 / 100
Outline
Applications
Extensions
bsts
August 10, 2015
46 / 100
Linear regression
I
Bayesian regression is just the ordinary linear model

yn1 N Xnk k1 , 2 Inn
with a prior on and .
A convenient prior distribution is p(, 2 ) = p(| 2 )p( 2 ), where

df ss
1
2
2
,
| N b,
2
2 2
This prior is conjugate to the regression likelihood (i.e. prior and

posterior are from the same model family).
bsts
August 10, 2015
47 / 100
Posterior distribution
Write (prior) (likelihood), do some algebra, and you get

DF
SS
1
2
2
V
|y
,
| , y N ,
2
2 2
where
V 1 = XT X + 1
DF = df + n
= V (XT y + 1 b)
SS = ss + yT y + b T 1 b T V 1
bsts
August 10, 2015
48 / 100
Some useful facts about the posterior distribution

I
The posterior mean

= V (XT y + 1 b)
is the information-weighted average of the OLS estimate and the prior
mean. (XT y = XT X)
The (scaled) posterior information

V 1 = XT X + 1
is the sum of the information in the prior (1 ) and data (XT X).
If 1 is positive definite then so is V 1 (and thus V ). Saves you

from perfect colinearity, k > n, etc.
bsts
August 10, 2015
49 / 100
Using default values makes prior specification easier

b=0
1 =
XT X
n
E ( 2 )
df
(Helpful to cheat a tiny bit and set b0 = y.)

XT X/n is the average information in a single
observation. is the number of prior observations worth of weight given to the prior.
ss
df
Weight (number of prior observations) given to

your guess at 2 .
Now specifying the prior means supplying 3 numbers: , df , and

your guess at 2 .
If you dont want to guess at 2 , peek at the sample variance of y,

and guess at R 2 , where 2 = (1 R 2 ) (sample variance).
Some useful default values: = 1, df = 1, R 2 = 0.5.
bsts
August 10, 2015
50 / 100
The marginal distribution of the data.

Because regression models are Gaussian, we can do some of the hard
integrals we cant do in other models.
p(, 2 |y) =
p(y|, 2 )p(| 2 )p( 2 )

p(y)
= p(| 2 , y)p( 2 |y)

Solve for
p(y) =
p(y|, 2 )p(| 2 )p( 2 )

.
p(| 2 , y)p( 2 |y)
bsts
August 10, 2015
51 / 100
Sparse modeling
I
If there are many predictors, one could expect many of them to have
zero coefficients.
Machine learning people like to use penalized log likelihood.

I
I
I
Lasso, elastic net, Dantzig selector, etc.

Penalties to log likelihood can be interpreted as log prior distributions.
These induce sparsity at the mode, but not in the distribution
(zero probability mass at zero).
Spike and slab priors set some coefficients to zero with positive
probability.
bsts
August 10, 2015
52 / 100
The lasso (and related priors) are not sparse

They induces sparsity at the mode, but not in the posterior distribution
p() exp
|j |
0.6
prior
likelihood
posterior
0.4
pri
0.2
0.3
0.3
0.2
0.0
0.1
0.1
0.0
pri
0.4
0.5
0.5
0.6
prior
likelihood
posterior
beta
(weak likelihood)
beta
(stronger likelihood)
Why is this important?

I
Penalized methods make a single decision about which variables are

included / excluded.
I
With 100 predictors there are 2100 models, which is about 1030 .
Avogadros number is 6 1023 , so if each model was an atom, the

space of models would have 1.66 million moles of mass.
A mole of carbon is (by definition) 12g, so thats about 20,000kg, or

22 (US) tons.
So finding the best model in a space of 100 predictors is analogous to

finding the best atom in a 22 ton block of carbon.
This argument absurdly overstates the case (because not all predictors
are exchangeable), but any algorithm that claims to find the right
model with this many candidates should be viewed with suspicion.
I
I
I
Some variables are obviously helpful.

Some are obviously garbage.
With some youre not sure. This is where the win comes from.
bsts
August 10, 2015
54 / 100
Spike and slab priors

[George and McCulloch (1997)]
We think most elements of are zero.
Let j = 1 if j 6= 0 and j = 0 if j = 0.
= (1, 0, 0, 1, , 1, 0, 0)
Now factor the prior distribution

p(, , 2 ) = p( |, 2 )p( 2 |)p()
bsts
August 10, 2015
55 / 100
A useful parameterization
This prior is conditionally conjugate given .
Notation
I
b means the elements of b where = 1.
1 where = 1.
1
means the rows and columns of
j j (1 j )1j
Spike

1
|, 2 N b , 2 1
df ss
,
2 2
Slab
bsts
August 10, 2015
56 / 100
Prior elicitation
j = expected model size / number of predictors

b = 0 (vector)
1 = {XT X + (1 )diagXT X}/n
2
ss/df = (1 Rexpected
)sy2
df = 1
I
The 1 expression is observations worth of prior information.
It can help to average 1 with its diagonal.
Prior elicitation is 4 numbers: expected model size, expected R 2 , beta

weight (), and sigma weight (df ).
bsts
August 10, 2015
57 / 100
Gibbs sampling for spike and slab regression

For each variable j, draw j |j , y.
1
|y C (y)
1
2
p()
DF
|V1 | SS 2
Each j only assumes the values 0 or 1, so the full conditional of j

only has 2 values. Compute them both and normalize.
The symbols in this forumla are the same as slide 48, but with
subscripts.
I
I
|1
|2
V is the posterior variance of .

SS is a sum of squares.
A || || matrix needs to be inverted to compute p(|y).

Cheap! (if there are lots of 0s).
bsts
August 10, 2015
58 / 100
A regression component in a structural time series model

The Kalman filter requires matrix-matrix multiplication at each step.
I
With T time points and latent state dimension m the complexity is

O(Tm3 ).
With pain, the exponent on m can be lowered, but it is still important

to keep the dimension down, where possible.
A regression component T xt can be added to the Kalman filter at the

cost of a single dimension.
t = 1,
Zt = T xt ,
bsts
Tt = 1,
Rt = 0
August 10, 2015
59 / 100
MCMC for spike-and-slab bsts

I
Time series parameters:
Regression coefficients .
1. Simulate p(|, , y).

I
I
Note the conditioning on .

Youre effecively subtracting off the regression component, then fitting
the state space model to the residuals.
2. Set yt = yt ZtT t
I
I
Simulate p(|)
Simulate |y
The componenets are independent so the simulation could be done in

parallel, but is so trivial it usually isnt worth the effort.
bsts
August 10, 2015
60 / 100
Orthogonal Data Augmentation

A neat trick!
If p(| 2 ) was diagonal and independent of 2 then the only think

keeping p(|, y) from being diagonal is XT X.
What if you happened to have a set of xs lying around, and if you

added these xs to your data XT X would be diagonal?
Youd need to know the y s that go along with these xs, so youd
have a missing data problem.
Step 1 Find the xs needed to diagonalize XT X.
Step 2 Repeat the the following steps:
1. Simulate the missing y s given and 2 .
2. Simulate and 2 given complete data.
bsts
August 10, 2015
61 / 100
Pros and cons of ODA
Pro
I
I
Cons
I
I
The i s can be sampled independently, as can i |i .

This can be done in parallel, and is really really fast.
You have to decompose the whole XT X matrix once at
the beginning of the algorithm to find the necessary xs.
Some of the xs can have high leverage.
I
I
High leverage points determine where the line goes.

If the missing data determine the line, and the line
determines the missing data then you have slow mixing.
bsts includes suport for ODA, but it is still experimental at this point.
bsts
August 10, 2015
62 / 100
Applications
Outline
Applications
Nowcasting with Google Trends
Causal Impact
Extensions
bsts
August 10, 2015
63 / 100
Applications
Outline
Applications
Causal Impact
Extensions
bsts
August 10, 2015
64 / 100
Applications
Nowcasting
Maintaining real time estimates of infrequently observed time series.
US weekly initial claims

for unemployment
(ICNSA).
Recession leading
indicator.
Can we learn this weeks

number before it is
released?
Wed need a real time

signal correlated with
the outcome.
600
500
400
300
Thousands
700
800
900
Jan 10
2004
Jul 02 Jul 01
2005 2006
Jul 07 Jul 05 Jul 04 Jul 03 Jul 02

2007 2008 2009 2010 2011
Jul 07
2012
bsts
August 10, 2015
65 / 100
Google searches are a real time indicator of public interest
Google searches are a real time indicator of public interest
Applications
Google trends public interface

I
Get it from google.com/trends
Click the little gear to download as CSV.
Data are percentage of all search traffic, normalized so the maximum

is 100.
You can restrict by type of search, time range, geo, or search category
(vertical).
There are 600 verticals

I
Hierarchical: Automotive vertical has Hybrid and Alternative

Vehicles subvertical.
If you compare your search to a vertical and then download as

CSV then you get the verticals series too.
Thats 600 public interest indicies you can use to predict YOUR
time series!
bsts
August 10, 2015
68 / 100
Applications
Individual search queries

Google correlate can provide the most highly correlated individual queries (up to 100)
bsts
August 10, 2015
69 / 100
Applications
Posterior inclusion probabilities

With expected model size = 3, and the top 100 predictors from correlate
plot(model, "coef", inc = .1)
unemployment.office
I
filing.for.unemployment
I
idaho.unemployment
Only showing inclusion

probabilities < .1.
Shading shows
Pr (j > 0|y).
I
sirius.internet.radio
White: positive
coefficients
Black: negative
coefficients
sirius.internet
0.0
0.2
0.4
0.6
0.8
1.0
Inclusion Probability
bsts
August 10, 2015
70 / 100
Applications
What got chosen?

plot(model, "predictors", inc = .1)
Solid blue line:

actual
Remaining lines
shaded by inclusion
probability.
Scaled Value
1 unemployment.office
0.94 filing.for.unemployment
0.47 idaho.unemployment
0.14 sirius.internet.radio
0.11 sirius.internet
2004
2006
2008
2010
2012
bsts
August 10, 2015
71 / 100
Applications
How much explaining got done?

Dynamic distribution plot shows evolving pointwise posterior distribution of state
components.
plot(model, "components")
2004
2006
2008
2010
time
2012
4
3
2
distribution
3
2
distribution
3
2
1
0
1
distribution
regression
seasonal.52.1
trend
2004
2006
2008
time
bsts
2010
2012
2004
2006
2008
2010
2012
time
August 10, 2015
72 / 100
Applications
Did it help?
Plot shows cumulative

absolute
one-step-ahead
prediction error
The regressors are not

very helpful during
normal times.
They help the model to

quickly adapt to the
recession.
20
40
60
80
100
pure time series

with Google Trends
3
2
1
0
1
scaled values
5 0
cumulative absolute error
CompareBstsModels(list("pure time series" = model1,

"with Google Trends" = model2))
Jan 04
2004
Jul 03
2005
Jul 02
2006
Jul 01
2007
Jul 06
2008
Jul 05
2009
Jul 04
2010
Jul 03
2011
bsts
Jul 01
2012
August 10, 2015
73 / 100
Applications
Causal Impact
Outline
Applications
Causal Impact
Extensions
bsts
August 10, 2015
74 / 100
Applications
Causal Impact
Measuring advertising effectiveness is a tricky business

I know that half my advertising dollars are wasted.
I just dont know which half.
John Wanamaker
I
I
One of the basic promises of online advertising is measurement.

It is supposed to be easy.
I
I
Change something (e.g. increase bid on Google).

Look to see how many incremental ad clicks you get.
Life is never easy.

Ad clicks and native search clicks interact in complicated ways.
Tough to get incremental clicks attributable to the ad campaign.
- Ad clicks can cannibalize native search clicks.
- Ads have a branding effect that can:
I
I
1. Be hard to measure,
2. Drive native search clicks,
3. Outlast the campaign.
bsts
August 10, 2015
75 / 100
Applications
Causal Impact
Example
Real Google advertiser. 6-week ad campaign. Random shift added to both axes.
11000
10000
clicks
9000
8000
7000
6000
5000
Apr 01
Apr 15
May 01
bsts
May 15
Jun 01
Jun 15
August 10, 2015
76 / 100
Applications
Causal Impact
Problem statement
I
An actor engages in a market intervention.

I
I
I
Has a sale.
Begins (or modifies) an advertising campaign.
Introduces (or adopts) a new product.
Other similar actors dont engage in the intervention.
We have data on both the actor and the similar actors prior to the
intervention.
Question: What was the effect of the intervention?

I
I
I
Total change to the bottom line.

How quickly did changes begin to occur?
How quickly did the effect begin to die out?
bsts
August 10, 2015
77 / 100
Difference in differences
An old trick from econometrics. Only measures at two points.
Applications
Causal Impact
Synthetic controls
A more realistic counterfactual model than DnD
Abadie et al. (2003, 2010) suggested synthetic controls as counterfactuals.

I
Weighted averages of untreated actors used to forecast actor of

interest.
Weights (0 wi 1) estimated so that synthetic control series

matches actors series in pre-treatment period.
Difference from forecast is estimated treatment effect.

Good Allows multiple controls, captures temporal effects.
Bad Scaling issues (California vs. Rhode Island), sign constraints
(negative correlations?), other time series?
Especially problematic for marketing. You know your sales, but not your
competitors sales.
bsts
August 10, 2015
79 / 100
Applications
Causal Impact
CausalImpact
Extends DnD and synthetic controls using BSTS
Use data in the pre-treatement period to build a flexible time series

model for the series of interest.
Forecast the time series over the intervention period given data from
the pre-treatment period.
I
I
I
Can use contemporaneous regressors in the forecast.

Model fit is based on pre-treatment data.
Deviations from the forecast are the treatment effect.
Assumes no interference between units. Often violated. Benign if

effect on untreated is small relative to effect on treated.
bsts
August 10, 2015
80 / 100
The picture
Simulated data
a
post-intervention period
pre-intervention period
Applications
Causal Impact
Potential outcomes
I
Let yjst denote the value of Y for unit j under treatment s at time t.
T is the time of the market intervention.
What we observe:
Before T We observe yj0t for everyone
After T We observe yj1t for the actor and yk0t for the potential
controls k 6= j.
If we could also observe yj0t for the actor then yj1t yj0t would be
the treatment effect.
For t > T we have a model for yj0t |yk0t .
bsts
August 10, 2015
82 / 100
Applications
Causal Impact
Case study
A Google advertiser ran a marketing experiment.
Google search ads ran 6 weeks.
Response is total search related visits to the site.

I
I
Native search clicks.

Ad clicks.
95 of 190 designated marketing areas received the ads. (DMAs are

areas that can receive distinct TV ads).
bsts
August 10, 2015
83 / 100
Applications
Causal Impact
This particular advertiser ran an experiment

11000
Plot shows clicks from treated vs untreated geos. Each dot is a time point.
9000
7000
5000
clicks (treatment region)
4000
5000
6000
clicks (control
bsts region)
before
during
after
7000
August 10, 2015
84 / 100
Applications
Causal Impact
Case study
Google advertiser. Treated vs. Untreated regions
a
pre-intervention
intervention
post-intervention
bsts
week 7
week 6
week 5
week 4
week 3
week 2
week 1
week 0
week -1
week -2
week -3
week -4
August 10, 2015
85 / 100
Applications
Causal Impact
Case study
Google advertiser. Competitors clicks as predictors
a
pre-intervention
intervention
post-intervention
bsts
week 7
week 6
week 5
week 4
week 3
week 2
week 1
week 0
week -1
week -2
week -3
week -4
August 10, 2015
86 / 100
Applications
Causal Impact
Case study
Google advertiser. Untreated regions. Competitors sales as predictors
a
pre-intervention
intervention
post-intervention
bsts
week 7
week 6
week 5
week 4
week 3
week 2
week 1
week 0
week -1
week -2
week -3
week -4
August 10, 2015
87 / 100
Applications
Causal Impact
Case study
Summary
vs. Untreated (1)

vs. Competitors (2)
A-A (placebo) test
Clicks
84,100
84,800
8,000
%
20
21
2
95% Interval
(15, 26)%
(13, 26)%
(-5, 6)%
Need experimental data to do analysis 1.
Analysis 2 is observational, but replicates the experimenatal results.

Using Google trends (instead of competitor information) gets about
the same results.
I
I
Google trends are publicly available, while competitor clicks are not.
Many more potential controls for Google trends. Spike and slab
variable selection / model averaging is useful for selecting appropriate
control groups.
bsts
August 10, 2015
88 / 100
Extensions
Outline
Applications
Extensions
Normal mixtures
Longer term forecasting
bsts
August 10, 2015
89 / 100
Extensions
Normal mixtures
Outline
Applications
Extensions
Normal mixtures
bsts
August 10, 2015
90 / 100
Extensions
Normal mixtures
Relaxing Gaussian assumptions

I
All model matrices in state space models are subscripted by t.
We can replace the Gaussian assumpttions with conditionally

Gaussian assumptions, where there is a latent variable at each t
determining the means and variances.
The MCMC now looks like this

I
Draw latent variables w = (w1 , . . . , wT ) given and .
Draw p(|y, , w)
Draw p(|y, , w).
Example: The T distribution is a mixture of normals

(a normal divided by a chi-square).
I
w Ga(/2, /2)
y |w N , 2 /w
bsts
August 10, 2015
91 / 100
Extensions
Normal mixtures
Example: retail sales

RSXFS: retail sales, excluding food service
Monthly data, already seasonally

adjusted.
Catastrophic drop in 2008.
Shift in slope of local linear

trend is too large to be handled
by Gaussian assumption.
300
250
200
150
RSXFS / 1000
350
Retail Sales
(Excluding Food Service)
Jan
1992
Jan
1996
Jan
2000
Jan
2004
Jan
2008
Jan
2012
bsts
August 10, 2015
92 / 100
Extensions
Normal mixtures
Local linear trend with student T errors

rsxfs-analysis.R
t N 0, 2
y t = t + t
t = t1 + t1 + ,t1
,t T (0, 2 )
t = t1 + ,t1
,t T (0, 2 )
This is an old Bayesian trick to ensure robustness.
If you tell the model that occasional large errors are possible, it is not
surprised by occasional large errors.
bsts
August 10, 2015
93 / 100
Extensions
Normal mixtures
Comparing dispersion parameters

Under the Guassian and T models
Standard Deviations
Student
Slope
Gaussian
Slope
Student
Level
Gaussian
Level
0.0
0.2
0.4
0.6
0.8
1.0
Because the model is aware that occasional large errors can occur, the
standard deviation parameters can be smaller.
bsts
August 10, 2015
94 / 100
Extensions
Normal mixtures
Impact on predictions
1995
2000
2005
Time
2010
600
200
300
400
original.series
500
600
500
400
original.series
200
300
400
300
200
RSXFS / 1000
500
600
Original Data
2009
2010
2011
2012
2013
2014
2009
time
2010
2011
2012
2013
2014
time
The extreme quantiles of the predictions under the Student model are
wider than under the Gaussian model.
The central (e.g. 95%, 90%) intervals are narrower.
bsts
August 10, 2015
95 / 100
Extensions
Normal mixtures
More normal mixtures
Similar tricks can be used to model probit, logit, and Poisson responses,
and even dynamic support vector machines by expressing these
distributions as normal mixtures.
bsts
August 10, 2015
96 / 100
Extensions
Outline
Applications
Extensions
Normal mixtures
bsts
August 10, 2015
97 / 100
Extensions
Long term predictions

I
The local linear trend model is focused on detecting short term

changes in the trend.
Very flexible, but it forgets the past quickly.
A less flexible, but more robust trend model is

t N 0, 2
t = t1 + t1 + ,t1
,t N 0, 2
t = D + (t1 D) + ,t
,t N 0, 2
yt = t + t
I
I
Now the slopes t follow an AR(1) process instead of a random walk.

If || < 1 then AR(1) is stationary, so it does not entirely forget the
past.
I
I
D is the long run trend of the series.

The slope can locally deviate from D, but it will eventually return.
bsts
August 10, 2015
98 / 100
Extensions
1995
2000
2005
Time
2010
600
500
200
300
400
original.series
500
200
300
400
original.series
500
400
300
200
RSXFS / 1000
Long term predictions
600
Local linear trend
600
Original Data
2009
2010
2011
2012
time
bsts
2013
2014
2009
2010
2011
2012
2013
2014
time
August 10, 2015
99 / 100
References
Carter, C. K. and Kohn, R. (1994).

On Gibbs sampling for state space models.
Biometrika 81, 541553.
de Jong, P. and Shepard, N. (1995).
The simulation smoother for time series models.
Durbin, J. and Koopman, S. J. (2002).
A simple and efficient simulation smoother for state space time series analysis.
Durbin, J. and Koopman, S. J. (2012).
Time Series Analysis by State Space Methods.
Oxford University Press.
Fr
uhwirth-Schnatter, S. (1995).
Bayesian model discrimination and Bayes factors for linear Gaussian state space models.
Journal of the Royal Statistical Society, Series B: Methodological 57, 237246.
George, E. and McCulloch R. (1997).
Approaches for Bayesian Variable Selection.
Statistica Sinica 7, 339374.
bsts
August 10, 2015
100 / 100

Bayesian Structural Time Series Models

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Bayesian Structural Time Series Models

Enviado por

Direitos autorais:

Formatos disponíveis

Bayesian Structural Time Series Models

August 10, 2015

The goal for the day is to introduce you to:

basic ideas in structural time series modeling,

regression modeling with spike and slab priors, and

the bsts R package.

Course notes and materials at https://goo.gl/VUWUC9

Steven L. Scott (Google)

August 10, 2015

Some good books

Durbin and Koopman

West and Harrison

Brockwell and Davis

Steven L. Scott (Google)

August 10, 2015

Introduction to time series modeling

Steven L. Scott (Google)

August 10, 2015

Introduction to time series modeling

Strategies for time series models

Structural time series

Steven L. Scott (Google)

August 10, 2015

Introduction to time series modeling

Introductory statistics courses teach students to fit models like

1. The trend probably wont follow a parametric form.

Steven L. Scott (Google)

August 10, 2015

Introduction to time series modeling

An example from elementary textbooks

Steven L. Scott (Google)

August 10, 2015

Introduction to time series modeling

Linear time trend doesnt quite fit

air <- log10(AirPassengers)

Steven L. Scott (Google)

August 10, 2015

Quadratic time trend

Misses serial correlation

reg <- lm(air ~ poly(time, 2),

Predictions between months 80

Serial correlation is cured by locality.

Introduction to time series modeling

Some features that make ARMA models difficult:

Steven L. Scott (Google)

August 10, 2015

Introduction to time series modeling

sample.size <- 1000

August 10, 2015

Introduction to time series modeling

What it looks like

August 10, 2015

Introduction to time series modeling

What it looks like

August 10, 2015

Introduction to time series modeling

If || < 1 then as t , Var (yt ) = Var (t )/(1 ||).

August 10, 2015

Introduction to time series modeling

What happens if you want to include a regression component?

Confidence about the smoothed estimate?

Steven L. Scott (Google)

August 10, 2015

If || < 1 then as t , Var (yt ) = Var (t )/(1 ||).