Você está na página 1de 100

Bayesian Structural Time Series Models

Steven L. Scott

August 10, 2015

Welcome!

The goal for the day is to introduce you to:


I

basic ideas in structural time series modeling,

regression modeling with spike and slab priors, and

the bsts R package.

Course notes and materials at https://goo.gl/VUWUC9

Steven L. Scott (Google)

bsts

August 10, 2015

2 / 100

Some good books


For structural time series, and time series in general.

Harvey

Durbin and Koopman

West and Harrison

Chatfield

Brockwell and Davis

Petris et. al

Steven L. Scott (Google)

bsts

August 10, 2015

3 / 100

Introduction to time series modeling

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions

Steven L. Scott (Google)

bsts

August 10, 2015

4 / 100

Introduction to time series modeling

Strategies for time series models

Regression

ARMA

Smoothing

Structural time series

Steven L. Scott (Google)

bsts

August 10, 2015

5 / 100

Introduction to time series modeling

Regression models

Introductory statistics courses teach students to fit models like


yt = 0 + 1 t +2 xt + t .
|{z}
linear

1. The trend probably wont follow a parametric form.


2. Even if it does, the residuals will be autocorrelated.

Steven L. Scott (Google)

bsts

August 10, 2015

6 / 100

Introduction to time series modeling

Airline passengers

2.6
2.4
2.0

2.2

log10(AirPassengers)

500
300
100

AirPassengers

2.8

An example from elementary textbooks

1950

1954

1958

1950

Time

1958

Time

Air passengers

Steven L. Scott (Google)

1954

log scale

bsts

August 10, 2015

7 / 100

Introduction to time series modeling

Linear time trend doesnt quite fit


See air-passengers-bsts.R

0.06

0.04
0.02
residuals

air <- log10(AirPassengers)


time <- 1:length(air)
months <- time %% 12
months[months==0] <- 12
months <- factor(months,
label = month.name)
reg <- lm(air ~ time + months)

0.00

0.02
0.04
0.06

2.2

2.4

2.6

2.8

fitted

Steven L. Scott (Google)

bsts

August 10, 2015

8 / 100

Quadratic time trend

Misses serial correlation

0.04

residuals

0.02

reg <- lm(air ~ poly(time, 2),


months)
plot(reg$residuals)
acf(residuals)

0.00

0.02

0.04

20

40

60

80

100

120

140

time
1.0

Predictions between months 80


- 100 predictably too low.
Between months 100 - 120
predictably too high.

Serial correlation is cured by locality.

0.8
0.6
ACF

0.4
0.2
0.0
0.2

10
Lag

15

20

Introduction to time series modeling

ARMA models
ARMA(P,Q) models have the form
yt =

P
X

p ytp +

p=1

Q
X

q tq .

q=0

Some features that make ARMA models difficult:


1. yt must be stationary. If non-stationary then take differences until it
becomes stationary.
2. If yt contains a seasonal component, then seasonal differencing is also
required.
3. Harder to think about. (Regression of y on x vs of 52 2 y on x).
ARMA models can be written as a special case of state space models.

Steven L. Scott (Google)

bsts

August 10, 2015

10 / 100

Introduction to time series modeling

Stationary vs Nonstationary
See code in stationary.R

sample.size <- 1000


number.of.series <- 1000
many.ar1 <- matrix(nrow = sample.size, ncol =number.of.series)
for (i in 1:number.of.series) {
many.ar1[, i] <- arima.sim(model = list(ar = .95),
n = sample.size)
}
many.random.walk <- matrix(nrow = sample.size,
ncol = number.of.series)
for (i in 1:number.of.series) {
many.random.walk[, i] <- cumsum(rnorm(sample.size))
}
par(mfrow = c(1, 2))
plot.ts(many.ar1, plot.type = "single")
plot.ts(many.random.walk, plot.type = "single")
Steven L. Scott (Google)

bsts

August 10, 2015

11 / 100

Introduction to time series modeling

What it looks like


Single series
Random Walk

0
10
20

30

many.ar1[, 1]

many.random.walk[, 1]

10

10

AR1

200

400

600

800

1000

Time

400

600

800

1000

Time

yt = .95yt1 + 
Steven L. Scott (Google)

200

yt = yt1 + 
bsts

August 10, 2015

12 / 100

Introduction to time series modeling

What it looks like


Many series

yt = .95yt1 + 
Steven L. Scott (Google)

yt = yt1 + 
bsts

August 10, 2015

13 / 100

Introduction to time series modeling

Variance
AR1
yt = yt1 + t
= (yt2 + t1 ) + t
= ...
=

t
X

i ti .

i=0

If || < 1 then as t , Var (yt ) = Var (t )/(1 ||).


Random walk
yt =

t
X

t

i=0
2

Var (yt ) = t
Variance diverges to .
Steven L. Scott (Google)

bsts

August 10, 2015

14 / 100

Introduction to time series modeling

Smoothing
Exponential smoothing
st = yt + (1 )st1
turns out to be the Kalman filter for the local level model.
Holt-Winters or double exponential smoothing captures a trend.
st = yt + (1 )(st1 + bt1 )
bt = (st st1 ) + (1 )bt1
This is the Kalman filter for the local linear trend model.
Triple exponential smoothing can handle seasonality as well, but
the formulas are getting ridiculous!
I

What happens if you want to include a regression component?

Confidence about the smoothed estimate?

Steven L. Scott (Google)

bsts

August 10, 2015

15 / 100

Introduction to time series modeling

Advantages of structural time series models

All the flexibility of regression models.

The locality of ARMA models and smoothing.

Can handle non-stationarity.

Modular, so easy combine with other additive components.

All those smoothing parameters become variances that can be


estimated from data.

Steven L. Scott (Google)

bsts

August 10, 2015

16 / 100

Structural time series models

Outline
Introduction to time series modeling
Structural time series models
Models for trend
Modeling seasonality
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions
Steven L. Scott (Google)

bsts

August 10, 2015

17 / 100

Structural time series models

Structural time series models


State space form

There are two pieces to a structural time series model


Observation equation
yt = ZtT t + t
I
I
I

t N (0, Ht )

yt is the observed data at time t.


t is a vector of latent variables (the state).
Zt and Ht are structural parameters (partly known).

Transition equation
t+1 = Tt t + Rt t
I
I

Steven L. Scott (Google)

t N (0, Qt )

Tt , Rt , and Qt are structural parameters (partly known).


t may be of lower dimension that t .
bsts

August 10, 2015

18 / 100

Structural time series models

Structural time series models are modular


Add your favorite trend, seasonal, regression, holiday, etc. models to the mix

Zt

State Vector

Tt

Trend

Seasonal

Regression

Steven L. Scott (Google)

bsts

August 10, 2015

19 / 100

Structural time series models

Example:
The basic structural model with a regression effect S seasons can be
written
yt = t + t + T xt +t
|{z}
|{z}
| {z }
trend

seasonal

regression

t = t1 + t1 + ut
t = t1 + vt
t =

S1
X

ts + wt

s=1

Local linear trend: level t + slope t .

Seasonal: S 1 dummy variables with time varying coefficients.


Sums to zero in expectation.

Steven L. Scott (Google)

bsts

August 10, 2015

20 / 100

Structural time series models

Models for trend

Some models for trend

Local level

Local linear trend

Generalized local linear trend

Autoregressive models

Steven L. Scott (Google)

bsts

August 10, 2015

21 / 100

Structural time series models

Models for trend

Understanding the local level model


I

The local level model is

t N 0, 2

t = t1 + t1

t N 0, 2

A compromise between the random walk model (when 2 = 0) and


the constant mean model (when 2 = 0).
I

yt = t + t

In the random walk model, your forecast of the future (given data to
time t) is yt .
In the constant mean model, your forecast is y.

The larger the ratio 2 / 2 the closer this model is to the constant
mean model.
In state space form
Tt = 1,

Steven L. Scott (Google)

Zt = 1,

Rt = 1,
bsts

Ht = 2 ,

Qt = 2
August 10, 2015

22 / 100

Structural time series models

Models for trend

Simulating the local level model


local-level.R

5
0
5
15

local.level.rw

10

tau = 1, sigma = 0

20

40

60

80

100

60

80

100

60

80

Time

5
4
3
2
1

local.level.constant

tau = 0, sigma = 1

20

40
Time

5
0
5

local.level

10

tau = 1, sigma = 0.5

20

Steven L. Scott (Google)

40
Time
bsts

100

August 10, 2015

23 / 100

Structural time series models

Models for trend

Local linear trend


local-linear-trend.R
I

The model is
t N 0, 2

t = t1 + t1 + ,t1

,t N 0, 2

t = t1 + ,t1

,t N 0, 2

yt = t + t

We normally think of a linear trend as


y = 0 + 1 t + t .

With change t, the expected increase in y is 1 t.

Now each t = 1, and 1 = t is a changing slope.

Neat fact! The posterior mean of the local linear trend model is a
smoothing spline.

Steven L. Scott (Google)

bsts

August 10, 2015

24 / 100

Simulating local linear trend

20

40

60

3 simulations with level = 1, slope = .25 , obs = .5

20

40

60

80

100

60

80

100

60

80

100

100

60 40 20

Time

20

40

120

80

40

Time

20

40
Time

Structural time series models

Modeling seasonality

Modeling seasonality
I

In the classroom regression model


I
I

We used a dummy variable for each season.


Left one season out (set its coefficient to zero).

In state space models


t =

S1
X

ts + t1

s=1
I
I
I
I

summer = (spring + winter + fall ) + t1


Mean over the year is zero.
State is S 1 dimensional.
Only one dimension of randomness.


1
1 1 1
1

0 Tt =
1
0
0 Rt =
0
Zt =
0
0
1
0
0
Steven L. Scott (Google)

bsts

August 10, 2015

26 / 100

Structural time series models

Modeling seasonality

Example
Modeling the air passengers data

data(AirPassengers)
y <- log10(AirPassengers)
ss <- AddLocalLinearTrend(
list(),
## No previous state specification.
y)
## Peek at the data to specify default priors.
ss <- AddSeasonal(
ss,
## Adding state to ss.
y,
## Peeking at the data.
nseasons = 12) ## 12 "seasons"
model <- bsts(y, state.specification = ss, niter = 1000)
plot(model)
plot(model, "help")
plot(model, "comp")
## "components"
plot(model, "resid") ## "residuals"
Steven L. Scott (Google)

bsts

August 10, 2015

27 / 100

Structural time series models

Modeling seasonality

2.8

Posterior distribution of state

2.4
2.0

2.2

distribution

2.6

1950

1952

1954

1956

1958

1960

Time

plot(model)
I

Fuzzy line shows posterior distribution of state at time t.

Blue dots are actual observations.

Steven L. Scott (Google)

bsts

August 10, 2015

28 / 100

Structural time series models

Modeling seasonality

Contributions from each component

2.0
1.0
0.0

1.0

distribution

2.0

seasonal.12.1

0.0

distribution

trend

1950

1954

1958

1950

Time

plot(model, "comp")
Steven L. Scott (Google)

1954

1958

Time

## "components"
bsts

August 10, 2015

29 / 100

Structural time series models

Modeling seasonality

Contributions from each component


seasonal.12.1
0.10
0.10

0.00

distribution

2.5
2.3
2.1

distribution

2.7

trend

1950

1954

1958

1950

Time

1958

Time

plot(model, "comp", same.scale = FALSE)


Steven L. Scott (Google)

1954

bsts

## "components"
August 10, 2015

30 / 100

Evolution of the seasonal component

1958

1954

1958

1954

0.05

distribution
1950

0.10

0.05
0.10

distribution
1950

Season 4

1958

1950

1954

1958

Season 6

Season 7

Season 8

1954

1950

1954

1958

1950

1954

0.10

0.05
0.10

0.05
0.10

1958

0.05

Season 5

distribution

Time

distribution

Time

distribution

Time

1958

1950

1954

1958

Season 9

Season 10

Season 11

Season 12

Time

1958

1950

1954
Time

1958

1950

1954
Time

1958

0.10

0.05
0.10

0.05
0.10

0.05

1954

0.05

Time

distribution

Time

distribution

Time

distribution

Time

0.10
1950

Season 3

Time

0.05
1950

0.05

distribution
1954

0.10

distribution

1950

distribution

Season 2

0.10

0.05
0.10

distribution

Season 1

1950

1954
Time

1958

Structural time series models

Modeling seasonality

Setting priors

AddLocalLinearTrend(
state.specification
y,
level.sigma.prior =
slope.sigma.prior =
initial.level.prior
initial.slope.prior
sdy,
initial.y)

Steven L. Scott (Google)

= NULL,
NULL,
NULL,
= NULL,
= NULL,

bsts

#
#
#
#

SdPrior
SdPrior
NormalPrior
NormalPrior

August 10, 2015

32 / 100

Structural time series models

Modeling seasonality

Priors on standard deviations

SdPrior(sigma.guess,
sample.size = .01,
initial.value = sigma.guess,
fixed = FALSE,
upper.limit = Inf)
I

This puts a gamma prior on 1/ 2 .


I
I
I

Shape () = sigma.guess2 sample.size/2


Scale () = sample.size/2
If specify an upper limit on then support will be truncated.

Steven L. Scott (Google)

bsts

August 10, 2015

33 / 100

Structural time series models

Modeling seasonality

Whats in the model object


Varies depending on how the function was called.

> names(model)
[1] "sigma.obs"
[3] "sigma.trend.slope"
[5] "final.state"
[7] "one.step.prediction.errors"
[9] "state.specification"

"sigma.trend.level"
"sigma.seasonal.12"
"state.contributions"
"has.regression"
"original.series"

MCMC draws of model parameters (each one is named).

Draws of the final state vector (used for forecasting).

Draws of each components contributions to the state mean.

Draws of the one-step-ahead prediction errors (from the Kalman filter).

A logical value indicating whethter the model has a (static) regression


component.

The state specification you used to call the model.

A copy of the original data series.

Steven L. Scott (Google)

bsts

August 10, 2015

34 / 100

Structural time series models

Modeling seasonality

Prediction
### Predict the next 24 periods.
pred <- predict(model, horizon = 24)

2.5
2.0

original.series

3.0

### Plot prediction along with last 36 observations


### from training series.
plot(pred, plot.original = 36)

1958

1959

1960

1961

1962

1963

time

Steven L. Scott (Google)

bsts

August 10, 2015

35 / 100

MCMC and the Kalman filter

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions

Steven L. Scott (Google)

bsts

August 10, 2015

36 / 100

MCMC and the Kalman filter

Gibbs sampling for state space models


1. Simulate p(|, y) using the Kalman filter and simulation
smoother.
2. Simulate p(|, y).
3. Goto 1.
Simulating p(|, y) is on a model by model basis, but for most models is
trivial.
I For models with only variance parameters, compute the right sum of
squared errors and draw the variances.
I Example: local level model:
!
X (t t1 )2
1
p(2 |) p(2 )T exp 2
t
2
If p(2 ) = Ga(df /2, ss/2) then
P


df + T ss + t (t t1 )2
p(2 |) = Ga
,
2
2
Steven L. Scott (Google)

bsts

August 10, 2015

37 / 100

MCMC and the Kalman filter

The Kalman filter


y t2

t2
I

y t1

yt

t1

The graph shows the conditional independence relationships among


the latent and observed variables in the model.

Steven L. Scott (Google)

bsts

August 10, 2015

38 / 100

MCMC and the Kalman filter

The Kalman filter


y t2

t2
I

y t1

yt

t1

At time t 1 we start off knowing the mean and variance of t1


given y1 , . . . , yt2 . (recursion)

Steven L. Scott (Google)

bsts

August 10, 2015

39 / 100

MCMC and the Kalman filter

The Kalman filter


y t2

t2

y t1

yt

t1

At time t 1 we start off knowing the mean and variance of t1


given y1 , . . . , yt2 . (recursion)

Then we observe yt1 .

Steven L. Scott (Google)

bsts

August 10, 2015

40 / 100

MCMC and the Kalman filter

The Kalman filter


y t2

t2

y t1

yt

t1

At time t 1 we start off knowing the mean and variance of t1


given y1 , . . . , yt2 . (recursion)

Then we observe yt1 .

The Kalman filter computes p(t |y1 , . . . , yt1 ),


and the incremental likelihood: p(yt1 |y1 , . . . , yt2 ).

Steven L. Scott (Google)

bsts

August 10, 2015

41 / 100

MCMC and the Kalman filter

The Kalman equations


Recall the state space form of the model is
yt = ZtT t + t

t N (0, Ht )

t+1 = Tt t + Rt t

t N (0, Qt )

The Kalman filter recursively computes P(t+1 |y1,...,t ) = N (at+1 , Pt+1 ).


vt = yt ZtT at

(1-step prediction error)

Ft = ZtT Pt Zt + Ht

(forecast variance)

Tt Pt Zt Ft1

(Kalman gain . . .

at+1 = Tt at + Kt vt

. . . is a regression coefficient)

Kt =

Pt+1 = Tt Pt (Tt Kt ZtT )T + Rt Qt RtT


The deriviation is tedious, but elementary.
You can use Bayes rule, or properties of the multivariate normal.
See [Durbin and Koopman(2012)] or many other sources.
Steven L. Scott (Google)

bsts

August 10, 2015

42 / 100

MCMC and the Kalman filter

Forward and backward


I

The Kalman filter marches forward through the data, collecting


information.

There are corresponding algorithms that march backward through the


data, distributing information.
I
I

Kalman smoother (useful for EM algorithm) computes p(t |y).


Simulation smoother
[Carter and Kohn(1994), Fr
uhwirth-Schnatter(1995),
de Jong and Shepard(1995), Durbin and Koopman(2002)]

The output of the Kalman filter + simulation smoother is an exact


draw from p(|y, ).

Steven L. Scott (Google)

bsts

August 10, 2015

43 / 100

MCMC and the Kalman filter

Simulation smoother
[Durbin and Koopman(2002)] thought of a clever way to simulate p(|y).
1. Simulate data with the wrong mean, but the right variance.
2. Subtract off the wrong mean, and put in the right one.
The argument goes like this:
1. For multivariate normal (, y), Var (|y) is independent of y.
2. Simulate fake data (, y) from a structural time series model. The
conditional distribution (|
y) has the same variance as (|y).
3. Subtract E (|
y) from your simulated s, and add E (|y).
[Durbin and Koopman(2012)] (Section 4.6.2) have a fast state smoother that
can quickly compute E (t |y) (without computing each Pt ).
I

The DK simulation smoother requires two Kalman filters (for y and y) and
two fast state smoothers.

Works even if Rt is not full rank.

Steven L. Scott (Google)

bsts

August 10, 2015

44 / 100

MCMC and the Kalman filter

Break time!

Lets take 15 minutes.

Steven L. Scott (Google)

bsts

August 10, 2015

45 / 100

Bayesian regression and spike-and-slab priors

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions

Steven L. Scott (Google)

bsts

August 10, 2015

46 / 100

Bayesian regression and spike-and-slab priors

Linear regression
I

Bayesian regression is just the ordinary linear model



yn1 N Xnk k1 , 2 Inn
with a prior on and .

A convenient prior distribution is p(, 2 ) = p(| 2 )p( 2 ), where





df ss
1
2
2

,
| N b,
2
2 2

This prior is conjugate to the regression likelihood (i.e. prior and


posterior are from the same model family).

Steven L. Scott (Google)

bsts

August 10, 2015

47 / 100

Bayesian regression and spike-and-slab priors

Posterior distribution

Write (prior) (likelihood), do some algebra, and you get






DF
SS
1
2
2
V
|y
,
| , y N ,
2
2 2
where
V 1 = XT X + 1
DF = df + n

Steven L. Scott (Google)

= V (XT y + 1 b)
SS = ss + yT y + b T 1 b T V 1

bsts

August 10, 2015

48 / 100

Bayesian regression and spike-and-slab priors

Some useful facts about the posterior distribution


I

The posterior mean


= V (XT y + 1 b)
is the information-weighted average of the OLS estimate and the prior

mean. (XT y = XT X)

The (scaled) posterior information


V 1 = XT X + 1
is the sum of the information in the prior (1 ) and data (XT X).

If 1 is positive definite then so is V 1 (and thus V ). Saves you


from perfect colinearity, k > n, etc.

Steven L. Scott (Google)

bsts

August 10, 2015

49 / 100

Bayesian regression and spike-and-slab priors

Using default values makes prior specification easier


b=0
1 =

XT X
n

E ( 2 )
df

(Helpful to cheat a tiny bit and set b0 = y.)


XT X/n is the average information in a single
observation. is the number of prior observations worth of weight given to the prior.

ss
df

Weight (number of prior observations) given to


your guess at 2 .

Now specifying the prior means supplying 3 numbers: , df , and


your guess at 2 .

If you dont want to guess at 2 , peek at the sample variance of y,


and guess at R 2 , where 2 = (1 R 2 ) (sample variance).

Some useful default values: = 1, df = 1, R 2 = 0.5.

Steven L. Scott (Google)

bsts

August 10, 2015

50 / 100

Bayesian regression and spike-and-slab priors

The marginal distribution of the data.


Because regression models are Gaussian, we can do some of the hard
integrals we cant do in other models.

p(, 2 |y) =

p(y|, 2 )p(| 2 )p( 2 )


p(y)

= p(| 2 , y)p( 2 |y)


Solve for
p(y) =

Steven L. Scott (Google)

p(y|, 2 )p(| 2 )p( 2 )


.
p(| 2 , y)p( 2 |y)

bsts

August 10, 2015

51 / 100

Bayesian regression and spike-and-slab priors

Sparse modeling
I

If there are many predictors, one could expect many of them to have
zero coefficients.

Machine learning people like to use penalized log likelihood.


I
I
I

Lasso, elastic net, Dantzig selector, etc.


Penalties to log likelihood can be interpreted as log prior distributions.
These induce sparsity at the mode, but not in the distribution
(zero probability mass at zero).

Spike and slab priors set some coefficients to zero with positive
probability.

Steven L. Scott (Google)

bsts

August 10, 2015

52 / 100

The lasso (and related priors) are not sparse


They induces sparsity at the mode, but not in the posterior distribution

p() exp

|j |

0.6

prior
likelihood
posterior

0.4
pri

0.2

0.3

0.3
0.2

0.0

0.1

0.1
0.0

pri

0.4

0.5

0.5

0.6

prior
likelihood
posterior

beta

(weak likelihood)

beta

(stronger likelihood)

Bayesian regression and spike-and-slab priors

Why is this important?


I

Penalized methods make a single decision about which variables are


included / excluded.
I

With 100 predictors there are 2100 models, which is about 1030 .

Avogadros number is 6 1023 , so if each model was an atom, the


space of models would have 1.66 million moles of mass.

A mole of carbon is (by definition) 12g, so thats about 20,000kg, or


22 (US) tons.

So finding the best model in a space of 100 predictors is analogous to


finding the best atom in a 22 ton block of carbon.

This argument absurdly overstates the case (because not all predictors
are exchangeable), but any algorithm that claims to find the right
model with this many candidates should be viewed with suspicion.
I
I
I

Some variables are obviously helpful.


Some are obviously garbage.
With some youre not sure. This is where the win comes from.

Steven L. Scott (Google)

bsts

August 10, 2015

54 / 100

Bayesian regression and spike-and-slab priors

Spike and slab priors


[George and McCulloch (1997)]

We think most elements of are zero.

Let j = 1 if j 6= 0 and j = 0 if j = 0.
= (1, 0, 0, 1, , 1, 0, 0)

Now factor the prior distribution


p(, , 2 ) = p( |, 2 )p( 2 |)p()

Steven L. Scott (Google)

bsts

August 10, 2015

55 / 100

Bayesian regression and spike-and-slab priors

A useful parameterization
This prior is conditionally conjugate given .

Notation
I

b means the elements of b where = 1.

1 where = 1.
1
means the rows and columns of

j j (1 j )1j

Spike


1 
|, 2 N b , 2 1

Steven L. Scott (Google)

df ss
,
2 2

Slab

bsts

August 10, 2015

56 / 100

Bayesian regression and spike-and-slab priors

Prior elicitation

j = expected model size / number of predictors


b = 0 (vector)
1 = {XT X + (1 )diagXT X}/n
2
ss/df = (1 Rexpected
)sy2

df = 1
I

The 1 expression is observations worth of prior information.

It can help to average 1 with its diagonal.

Prior elicitation is 4 numbers: expected model size, expected R 2 , beta


weight (), and sigma weight (df ).

Steven L. Scott (Google)

bsts

August 10, 2015

57 / 100

Bayesian regression and spike-and-slab priors

Gibbs sampling for spike and slab regression


For each variable j, draw j |j , y.
1

|y C (y)

1
2

p()
DF

|V1 | SS 2

Each j only assumes the values 0 or 1, so the full conditional of j


only has 2 values. Compute them both and normalize.
The symbols in this forumla are the same as slide 48, but with
subscripts.
I
I

|1
|2

V is the posterior variance of .


SS is a sum of squares.

A || || matrix needs to be inverted to compute p(|y).


Cheap! (if there are lots of 0s).

Steven L. Scott (Google)

bsts

August 10, 2015

58 / 100

Bayesian regression and spike-and-slab priors

A regression component in a structural time series model


The Kalman filter requires matrix-matrix multiplication at each step.
I

With T time points and latent state dimension m the complexity is


O(Tm3 ).

With pain, the exponent on m can be lowered, but it is still important


to keep the dimension down, where possible.

A regression component T xt can be added to the Kalman filter at the


cost of a single dimension.
t = 1,

Steven L. Scott (Google)

Zt = T xt ,

bsts

Tt = 1,

Rt = 0

August 10, 2015

59 / 100

Bayesian regression and spike-and-slab priors

MCMC for spike-and-slab bsts


I

Time series parameters:

Regression coefficients .

1. Simulate p(|, , y).


I
I

Note the conditioning on .


Youre effecively subtracting off the regression component, then fitting
the state space model to the residuals.

2. Set yt = yt ZtT t
I
I

Simulate p(|)
Simulate |y

The componenets are independent so the simulation could be done in


parallel, but is so trivial it usually isnt worth the effort.

Steven L. Scott (Google)

bsts

August 10, 2015

60 / 100

Bayesian regression and spike-and-slab priors

Orthogonal Data Augmentation


A neat trick!

If p(| 2 ) was diagonal and independent of 2 then the only think


keeping p(|, y) from being diagonal is XT X.

What if you happened to have a set of xs lying around, and if you


added these xs to your data XT X would be diagonal?

Youd need to know the y s that go along with these xs, so youd
have a missing data problem.
Step 1 Find the xs needed to diagonalize XT X.
Step 2 Repeat the the following steps:
1. Simulate the missing y s given and 2 .
2. Simulate and 2 given complete data.

Steven L. Scott (Google)

bsts

August 10, 2015

61 / 100

Bayesian regression and spike-and-slab priors

Pros and cons of ODA

Pro

I
I

Cons

I
I

The i s can be sampled independently, as can i |i .


This can be done in parallel, and is really really fast.
You have to decompose the whole XT X matrix once at
the beginning of the algorithm to find the necessary xs.
Some of the xs can have high leverage.
I
I

High leverage points determine where the line goes.


If the missing data determine the line, and the line
determines the missing data then you have slow mixing.

bsts includes suport for ODA, but it is still experimental at this point.

Steven L. Scott (Google)

bsts

August 10, 2015

62 / 100

Applications

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Nowcasting with Google Trends
Causal Impact
Extensions
Steven L. Scott (Google)

bsts

August 10, 2015

63 / 100

Applications

Nowcasting with Google Trends

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Nowcasting with Google Trends
Causal Impact
Extensions
Steven L. Scott (Google)

bsts

August 10, 2015

64 / 100

Applications

Nowcasting with Google Trends

Nowcasting
Maintaining real time estimates of infrequently observed time series.

US weekly initial claims


for unemployment
(ICNSA).

Recession leading
indicator.

Can we learn this weeks


number before it is
released?

Wed need a real time


signal correlated with
the outcome.

600
500
400
300

Thousands

700

800

900

Jan 10
2004

Jul 02 Jul 01
2005 2006

Steven L. Scott (Google)

Jul 07 Jul 05 Jul 04 Jul 03 Jul 02


2007 2008 2009 2010 2011

Jul 07
2012

bsts

August 10, 2015

65 / 100

Google searches are a real time indicator of public interest

Google searches are a real time indicator of public interest

Applications

Nowcasting with Google Trends

Google trends public interface


I

Get it from google.com/trends

Click the little gear to download as CSV.

Data are percentage of all search traffic, normalized so the maximum


is 100.

You can restrict by type of search, time range, geo, or search category
(vertical).

There are 600 verticals


I

Hierarchical: Automotive vertical has Hybrid and Alternative


Vehicles subvertical.

If you compare your search to a vertical and then download as


CSV then you get the verticals series too.

Thats 600 public interest indicies you can use to predict YOUR
time series!

Steven L. Scott (Google)

bsts

August 10, 2015

68 / 100

Applications

Nowcasting with Google Trends

Individual search queries


Google correlate can provide the most highly correlated individual queries (up to 100)

Steven L. Scott (Google)

bsts

August 10, 2015

69 / 100

Applications

Nowcasting with Google Trends

Posterior inclusion probabilities


With expected model size = 3, and the top 100 predictors from correlate

plot(model, "coef", inc = .1)

unemployment.office

I
filing.for.unemployment

I
idaho.unemployment

Only showing inclusion


probabilities < .1.
Shading shows
Pr (j > 0|y).
I

sirius.internet.radio

White: positive
coefficients
Black: negative
coefficients

sirius.internet

0.0

0.2

0.4

0.6

0.8

1.0

Inclusion Probability

Steven L. Scott (Google)

bsts

August 10, 2015

70 / 100

Applications

Nowcasting with Google Trends

What got chosen?


plot(model, "predictors", inc = .1)

Solid blue line:


actual

Remaining lines
shaded by inclusion
probability.

Scaled Value

1 unemployment.office
0.94 filing.for.unemployment
0.47 idaho.unemployment
0.14 sirius.internet.radio
0.11 sirius.internet

2004

2006

Steven L. Scott (Google)

2008

2010

2012

bsts

August 10, 2015

71 / 100

Applications

Nowcasting with Google Trends

How much explaining got done?


Dynamic distribution plot shows evolving pointwise posterior distribution of state
components.

plot(model, "components")

2004

2006

2008

2010

time

Steven L. Scott (Google)

2012

4
3
2

distribution

3
2

distribution

3
2
1
0
1

distribution

regression

seasonal.52.1

trend

2004

2006

2008
time

bsts

2010

2012

2004

2006

2008

2010

2012

time

August 10, 2015

72 / 100

Applications

Nowcasting with Google Trends

Did it help?

Plot shows cumulative


absolute
one-step-ahead
prediction error

The regressors are not


very helpful during
normal times.

They help the model to


quickly adapt to the
recession.

20

40

60

80

100

pure time series


with Google Trends

3
2
1
0
1

scaled values

5 0

cumulative absolute error

CompareBstsModels(list("pure time series" = model1,


"with Google Trends" = model2))

Jan 04
2004

Jul 03
2005

Jul 02
2006

Steven L. Scott (Google)

Jul 01
2007

Jul 06
2008

Jul 05
2009

Jul 04
2010

Jul 03
2011

bsts

Jul 01
2012

August 10, 2015

73 / 100

Applications

Causal Impact

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Nowcasting with Google Trends
Causal Impact
Extensions
Steven L. Scott (Google)

bsts

August 10, 2015

74 / 100

Applications

Causal Impact

Measuring advertising effectiveness is a tricky business


I know that half my advertising dollars are wasted.
I just dont know which half.
John Wanamaker
I
I

One of the basic promises of online advertising is measurement.


It is supposed to be easy.
I
I

Change something (e.g. increase bid on Google).


Look to see how many incremental ad clicks you get.

Life is never easy.


Ad clicks and native search clicks interact in complicated ways.
Tough to get incremental clicks attributable to the ad campaign.
- Ad clicks can cannibalize native search clicks.
- Ads have a branding effect that can:

I
I

1. Be hard to measure,
2. Drive native search clicks,
3. Outlast the campaign.
Steven L. Scott (Google)

bsts

August 10, 2015

75 / 100

Applications

Causal Impact

Example
Real Google advertiser. 6-week ad campaign. Random shift added to both axes.

11000
10000

clicks

9000
8000
7000
6000
5000

Apr 01
Steven L. Scott (Google)

Apr 15

May 01
bsts

May 15

Jun 01

Jun 15
August 10, 2015

76 / 100

Applications

Causal Impact

Problem statement
I

An actor engages in a market intervention.


I
I
I

Has a sale.
Begins (or modifies) an advertising campaign.
Introduces (or adopts) a new product.

Other similar actors dont engage in the intervention.

We have data on both the actor and the similar actors prior to the
intervention.

Question: What was the effect of the intervention?


I
I
I

Total change to the bottom line.


How quickly did changes begin to occur?
How quickly did the effect begin to die out?

Steven L. Scott (Google)

bsts

August 10, 2015

77 / 100

Difference in differences
An old trick from econometrics. Only measures at two points.

Applications

Causal Impact

Synthetic controls
A more realistic counterfactual model than DnD

Abadie et al. (2003, 2010) suggested synthetic controls as counterfactuals.


I

Weighted averages of untreated actors used to forecast actor of


interest.

Weights (0 wi 1) estimated so that synthetic control series


matches actors series in pre-treatment period.

Difference from forecast is estimated treatment effect.


Good Allows multiple controls, captures temporal effects.
Bad Scaling issues (California vs. Rhode Island), sign constraints
(negative correlations?), other time series?

Especially problematic for marketing. You know your sales, but not your
competitors sales.
Steven L. Scott (Google)

bsts

August 10, 2015

79 / 100

Applications

Causal Impact

CausalImpact
Extends DnD and synthetic controls using BSTS

Use data in the pre-treatement period to build a flexible time series


model for the series of interest.

Forecast the time series over the intervention period given data from
the pre-treatment period.
I
I
I

Can use contemporaneous regressors in the forecast.


Model fit is based on pre-treatment data.
Deviations from the forecast are the treatment effect.

Assumes no interference between units. Often violated. Benign if


effect on untreated is small relative to effect on treated.

Steven L. Scott (Google)

bsts

August 10, 2015

80 / 100

The picture
Simulated data
a
post-intervention period

pre-intervention period

Applications

Causal Impact

Potential outcomes
I

Let yjst denote the value of Y for unit j under treatment s at time t.
T is the time of the market intervention.

What we observe:
Before T We observe yj0t for everyone
After T We observe yj1t for the actor and yk0t for the potential
controls k 6= j.

If we could also observe yj0t for the actor then yj1t yj0t would be
the treatment effect.

For t > T we have a model for yj0t |yk0t .

Steven L. Scott (Google)

bsts

August 10, 2015

82 / 100

Applications

Causal Impact

Case study
A Google advertiser ran a marketing experiment.

Google search ads ran 6 weeks.

Response is total search related visits to the site.


I
I

Native search clicks.


Ad clicks.

95 of 190 designated marketing areas received the ads. (DMAs are


areas that can receive distinct TV ads).

Steven L. Scott (Google)

bsts

August 10, 2015

83 / 100

Applications

Causal Impact

This particular advertiser ran an experiment


11000

Plot shows clicks from treated vs untreated geos. Each dot is a time point.

9000

7000

5000

clicks (treatment region)

4000
Steven L. Scott (Google)

5000

6000

clicks (control
bsts region)

before
during
after

7000
August 10, 2015

84 / 100

Applications

Causal Impact

Case study
Google advertiser. Treated vs. Untreated regions
a

pre-intervention

intervention

post-intervention

Steven L. Scott (Google)

bsts

week 7

week 6

week 5

week 4

week 3

week 2

week 1

week 0

week -1

week -2

week -3

week -4

August 10, 2015

85 / 100

Applications

Causal Impact

Case study
Google advertiser. Competitors clicks as predictors
a

pre-intervention

intervention

post-intervention

Steven L. Scott (Google)

bsts

week 7

week 6

week 5

week 4

week 3

week 2

week 1

week 0

week -1

week -2

week -3

week -4

August 10, 2015

86 / 100

Applications

Causal Impact

Case study
Google advertiser. Untreated regions. Competitors sales as predictors
a

pre-intervention

intervention

post-intervention

Steven L. Scott (Google)

bsts

week 7

week 6

week 5

week 4

week 3

week 2

week 1

week 0

week -1

week -2

week -3

week -4

August 10, 2015

87 / 100

Applications

Causal Impact

Case study
Summary

vs. Untreated (1)


vs. Competitors (2)
A-A (placebo) test

Clicks
84,100
84,800
8,000

%
20
21
2

95% Interval
(15, 26)%
(13, 26)%
(-5, 6)%

Need experimental data to do analysis 1.

Analysis 2 is observational, but replicates the experimenatal results.


Using Google trends (instead of competitor information) gets about
the same results.

I
I

Google trends are publicly available, while competitor clicks are not.
Many more potential controls for Google trends. Spike and slab
variable selection / model averaging is useful for selecting appropriate
control groups.

Steven L. Scott (Google)

bsts

August 10, 2015

88 / 100

Extensions

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions
Normal mixtures
Longer term forecasting
Steven L. Scott (Google)

bsts

August 10, 2015

89 / 100

Extensions

Normal mixtures

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions
Normal mixtures
Longer term forecasting
Steven L. Scott (Google)

bsts

August 10, 2015

90 / 100

Extensions

Normal mixtures

Relaxing Gaussian assumptions


I

All model matrices in state space models are subscripted by t.

We can replace the Gaussian assumpttions with conditionally


Gaussian assumptions, where there is a latent variable at each t
determining the means and variances.

The MCMC now looks like this


I

Draw latent variables w = (w1 , . . . , wT ) given and .

Draw p(|y, , w)

Draw p(|y, , w).

Example: The T distribution is a mixture of normals


(a normal divided by a chi-square).
I

w Ga(/2, /2)

y |w N , 2 /w

Steven L. Scott (Google)

bsts

August 10, 2015

91 / 100

Extensions

Normal mixtures

Example: retail sales


RSXFS: retail sales, excluding food service

Monthly data, already seasonally


adjusted.

Catastrophic drop in 2008.

Shift in slope of local linear


trend is too large to be handled
by Gaussian assumption.

300
250
200
150

RSXFS / 1000

350

Retail Sales
(Excluding Food Service)

Jan
1992

Jan
1996

Jan
2000

Steven L. Scott (Google)

Jan
2004

Jan
2008

Jan
2012

bsts

August 10, 2015

92 / 100

Extensions

Normal mixtures

Local linear trend with student T errors


rsxfs-analysis.R

t N 0, 2

y t = t +  t

t = t1 + t1 + ,t1

,t T (0, 2 )

t = t1 + ,t1

,t T (0, 2 )

This is an old Bayesian trick to ensure robustness.

If you tell the model that occasional large errors are possible, it is not
surprised by occasional large errors.

Steven L. Scott (Google)

bsts

August 10, 2015

93 / 100

Extensions

Normal mixtures

Comparing dispersion parameters


Under the Guassian and T models

Standard Deviations

Student
Slope

Gaussian
Slope

Student
Level

Gaussian
Level

0.0

0.2

0.4

0.6

0.8

1.0

Because the model is aware that occasional large errors can occur, the
standard deviation parameters can be smaller.

Steven L. Scott (Google)

bsts

August 10, 2015

94 / 100

Extensions

Normal mixtures

Impact on predictions

1995

2000

2005
Time

2010

600
200

300

400

original.series

500

600
500
400

original.series

200

300

400
300
200

RSXFS / 1000

500

600

Original Data

2009

2010

2011

2012

2013

2014

2009

time

2010

2011

2012

2013

2014

time

The extreme quantiles of the predictions under the Student model are
wider than under the Gaussian model.

The central (e.g. 95%, 90%) intervals are narrower.

Steven L. Scott (Google)

bsts

August 10, 2015

95 / 100

Extensions

Normal mixtures

More normal mixtures

Similar tricks can be used to model probit, logit, and Poisson responses,
and even dynamic support vector machines by expressing these
distributions as normal mixtures.

Steven L. Scott (Google)

bsts

August 10, 2015

96 / 100

Extensions

Longer term forecasting

Outline
Introduction to time series modeling
Structural time series models
MCMC and the Kalman filter
Bayesian regression and spike-and-slab priors
Applications
Extensions
Normal mixtures
Longer term forecasting
Steven L. Scott (Google)

bsts

August 10, 2015

97 / 100

Extensions

Longer term forecasting

Long term predictions


I

The local linear trend model is focused on detecting short term


changes in the trend.

Very flexible, but it forgets the past quickly.

A less flexible, but more robust trend model is


t N 0, 2

t = t1 + t1 + ,t1

,t N 0, 2

t = D + (t1 D) + ,t

,t N 0, 2

yt = t + t

I
I

Now the slopes t follow an AR(1) process instead of a random walk.


If || < 1 then AR(1) is stationary, so it does not entirely forget the
past.
I
I

D is the long run trend of the series.


The slope can locally deviate from D, but it will eventually return.

Steven L. Scott (Google)

bsts

August 10, 2015

98 / 100

Extensions

1995

2000

2005
Time

Steven L. Scott (Google)

2010

600
500
200

300

400

original.series

500
200

300

400

original.series

500
400
300
200

RSXFS / 1000

Long term predictions

600

Local linear trend

600

Original Data

Longer term forecasting

2009

2010

2011

2012
time

bsts

2013

2014

2009

2010

2011

2012

2013

2014

time

August 10, 2015

99 / 100

References

Carter, C. K. and Kohn, R. (1994).


On Gibbs sampling for state space models.
Biometrika 81, 541553.
de Jong, P. and Shepard, N. (1995).
The simulation smoother for time series models.
Biometrika 82, 339350.
Durbin, J. and Koopman, S. J. (2002).
A simple and efficient simulation smoother for state space time series analysis.
Biometrika 89, 603616.
Durbin, J. and Koopman, S. J. (2012).
Time Series Analysis by State Space Methods.
Oxford University Press.
Fr
uhwirth-Schnatter, S. (1995).
Bayesian model discrimination and Bayes factors for linear Gaussian state space models.
Journal of the Royal Statistical Society, Series B: Methodological 57, 237246.
George, E. and McCulloch R. (1997).
Approaches for Bayesian Variable Selection.
Statistica Sinica 7, 339374.

Steven L. Scott (Google)

bsts

August 10, 2015

100 / 100

Você também pode gostar