Bayesian SAE Using Complex Survey Data

Bayesian SAE using Complex Survey Data
Lecture 8: Small Area Estimation
Richard Li
Department of Statistics
University of Washington
1 / 71
Outline
Overview
Inference for SAE
SAE with BRFSS data in R
Binomial smoothing by hand (not weighted)
Accounting for survey designs
Using SUMMER
2 / 71
Overview
3 / 71
Motivation
Small area estimation (SAE) is an important endeavor since many

agencies require estimates of health, economic indices, education and
environmental measures in order to plan and allocate resources and
target interventions.
SAE is an example of domain (sub-population) estimation.
“Small” here refers to the fact that we will typically base our inference on
a small sample from each area (so it is not a description of geographical
size).
In the limit there may some areas in which there are no data.
4 / 71
Small Area Estimation
Consider a study region partitioned into n disjoint and exhaustive areas,
labeled by i, i = 1, . . . , n.
As a concrete example, suppose we are interested in a particular

condition so that the response is a binary outcome, Yik , for
k = 1, . . . , Ni , individuals in area i.
Based on samples that are collected in the areas1 , the aim of SAE include
estimation of:
I The population totals:
Ni
X
Ti = Yik .
k=1
I The prevalence of the condition in each area:
Ni
1 X Ti
θi = Yik = .
Ni Ni
k=1
1 though some areas may contain no samples 5 / 71

Background reading on SAE
The classic text on SAE is Rao (2003), with a more recent edition (Rao
and Molina, 2015); not the easiest book to read, and little material on
spatial smoothing models.
An excellent review of SAE is Pfeffermann (2013).
The SAE literature distinguishes between direct estimation, in which data

from the area only is used to provide the estimate in an area, and indirect
estimation, in which data from other areas is used to provide the
estimate.
6 / 71
Inference for SAE
7 / 71
Design based inference based on weighted estimators
Suppose we undertake a complex design and obtain outcomes yik in area
i, k ∈ si , where si is the set of samples that were in area i.
Along with the outcome, there is an associated design weight wik .
Under the design-based approach to inference, it is common to use the

weighted estimator of the prevalence:
P
k∈si wik yik
Pi = P
b .
k∈si wik
There is an associated variance, that acknowledges the design, Vbi .
This variance estimate may be obtained analytically, or through

resampling techniques such as the jackknife.
Asymptotically (that is, in large samples):
Pbi ∼ N(Pi , Vi ).
8 / 71
Direct Estimation
The simplest approach is to simply map the direct estimates Pbi .
To assess the uncertainty, one may map the lower and upper ends of
(say) a 90% confidence interval:
q
Pbi ± 1.645 × Vbi .
If the samples in each area are large, so that Vbi is small, then this
approach works well.
Hence, as usual, we would like to carry out some form of smoothing, but
in the case of complex survey sampling, how should we proceed?
9 / 71
Design effects
The cluster design leads to a loss of information.
The so-called estimated design effect is
Vbi
di = ,
Pbi (1 − Pbi )/ni
and summarizes the information loss.
Define the effective sample size as
ni Pbi (1 − Pbi )
nei = = .
di Vbi
10 / 71
Smoothed Direct Estimation
Let θbi be the weighted estimator, then consider

!
Pbi
θbi = logit Pbi = log ,
1 − Pbi
which is on the whole of the real line.
“Data” Model2 :
θbi ∼ N(θi , Vbi ),
where Vi , its variance, is known.
Prior Random Effects Model:
θ i = β0 + i ,
where the random effects i ∼iid N(0, σ2 ).
2 We are taking the data as the estimator

11 / 71
This is very similar to the normal-normal model we saw in Lecture 3.
Fay and Herriot (1979) suggested this hierarchical model, in a landmark

paper.
This model acknowledges the design and also smooths, and it is

straightforward to add spatial random effects.
12 / 71
The spatial version of the model has:
“Data” Model:
θbi ∼ N(θi , Vbi ),
where Vbi is known variance.
Prior Model:
θi = β0 + i + Si ,
with
I i ∼ N(0, σ2 ).
I Si ∼ ICAR(σs2 ).
Adding a term x Ti β to the prior model allows covariate relationships to be

investigated.
This model has been investigated and applied with simulated and real
data in (Chen et al., 2014; Mercer et al., 2014) and (in a space-time
setting) in Mercer et al. (2014, 2015) and Li et al. (2018). 13 / 71
FYI, Different Models For Binary Responses
I Binomial sampling model: only strictly valid if no stratified sampling
and no cluster sampling.
I Direct estimates at the area level.
I Smoothed direct estimates at the area level, modeling the logit of
the direct estimates of the probabilities.
I Binomial GLMM at the area level: only strictly valid if no stratified
sampling and no cluster sampling.
I Binomial model for responses within each cluster with
I strata fixed effects,
I cluster random effects,
I IID random effects at the area level
I spatial random effects at the area level (via an ICAR model).
I Binomial model for responses within each cluster with
I strata fixed effects,
I IID cluster random effects,
I IID household effects?
I spatial random effects at the cluster level (via a Gaussian process
model).
14 / 71
SAE with BRFSS data in R
15 / 71
Motivating Example: Diabetes in King County
Arises out of a joint project between Laina Mercer/Jon Wakefield and

Seattle and King County Public Health, which lead to the work reported
in Song et al. (2016).
Aim we will concentrate on here is to estimate the number of 18 years or

older individuals with diabetes, by health reporting areas (HRAs) in King
County in 2011.
HRAs are city-based sub-county areas with a total of 48 HRAs in King

County. Some of these are as are a single city, some are a group of
smaller cities, and some are unincorporated areas. Larger cities such as
Seattle and Bellevue include more than one HRA.
Data are based on the question, “Has a doctor, nurse, or other health
professional ever told you that you had diabetes?”, in 2011.
16 / 71
Shoreline
Health Reporting Areas

N= 53030
Kenmore/LFP Bothell/Woodinville
N= 34444
(HRA)
N= 32837
Kirkland North
and 2010 population
NW Seattle
N= 33564
N= 42566 North Seattle
King County, WA
N= 44332
Ballard Bear Creek/Carnation/Duvall

N= 51822 Redmond N= 64643
Kirkland N= 53616
NE Seattle N= 47617
Fremont/
N= 67415
Greenlake
QA/Magnolia N= 50863
N= 57494
Capitol Hill/E.lake
Auburn, Bellevue, Federal N= 44740 Bellevue-NE
N= 33096
Way, Kent, Renton & Seattle
Mercer Isle/Pt Cities

Downtown
HRAs are divided into
N= 29978
Bellevue-Central
N= 42610 Central Seattle N= 35397 Sammamish
neighborhoods. N= 44407
Bellevue-West
N= 45453
N= 29577
West Seattle Beacon/Gtown

N= 52689 Issaquah Snoqualmie/North Bend/Skykomish
/S.Park Bellevue-South N= 29769 N= 43164
N= 39242 N= 31100
SE Seattle
N= 40305
Delridge
N= 30296
North Highline
N= 17400 Renton-NorthRenton-East
N= 28608 N= 29871
Newcastle/Four Creeks
N= 28270
Burien
N= 48070
SeaTac/Tukwila
N= 46254
Renton-South Fairwood
N= 50711 N= 23739
0 1 2 4 6 Vashon Island
Miles N= 10624 Des Moines/ Covington/Maple Valley
Normandy Pk Kent-West Kent-East N= 54070
N= 35966 N= 27921 N= 35924
Kent-SE
N= 55187
N = 2010 population for each HRA

Data source: Intermin Population
estimates, PHSKC, APDE 1/2012 East Federal Way N= 34976
Auburn-North
Fed Way- N= 35235
Dash Point/Woodmont Black Diamond/Enumclaw/SE County
Produced by: Public Health-Seattle & N= 32660 N= 47803
Fed Way-
King County Assessment, Policy Central/Military Rd
Development & Evaluation N= 56657
Last Modified 08/2012 Auburn-South

N= 25239
Figure: Health reporting areas (HRAs) in King County.
17 / 71
Motivating BRFSS Example
Estimates are used for a variety of purposes including summarization for

the local communities and assessment of health needs.
Analysis and dissemination of place-based disparities is of great

importance to allow efficient targeting of place-based interventions.
Because of its demographics, King County looks good compared to other

areas in the U.S., but some of its disparities are among the largest of
major metro areas.
Estimation is based on Behavioral Risk Factor Surveillance System

(BRFSS) data.
The BRFSS is an annual telephone health survey conducted by the

Centers for Disease Control and Prevention (CDC) that tracks health
conditions and risk behaviors in the United States and its territories since
1984.
18 / 71
Figure: Public Health: Seattle and King County website.
19 / 71
2012

Figure: Summaries from Public Health: Seattle King County.
20 / 71
L a k e
L ife E x p e c ta n c y C o m p a re d to
S h ore lin e F ore s t B o th e ll W o o d in ville
th e T e n L o n g e s t-‐L iv e d C o u n trie s P a rk
K e nm ore
b y C e n s u s T ra c t D u v a ll
2 0 0 5 -‐2 0 0 9 , K in g C o u n ty W A K irkla n d
R e d m o nd
L eg en d
C a rna tion
C IT Y Me d in a
S e a ttle
C a le n d a r Y e a r s A h e a d
B e lle v u e S a m m a m is h
3 1 to 4 2
Me rc e r
1 5 to 3 0 Is la n d
1 to 1 4 Is s a q ua h
N e w c a s tle S n oq u a lm ie
C a le n d a r Y e a r s B e h in d
Z e ro to 9 B u rie n N o rth B e n d
R e n to n
T u kw ila
1 0 to 2 3
2 4 to 5 7 S e a Ta c
N o rm a nd y
P a rk
S m a ll p op u la tio n K e nt
D e s Mo in e s
Ma p le Va lle y
C o ving ton
Y e a rs be hind or a h e a d a re fro m 20 07 .

D a ta S o urc e s : F e d e ra l W a y A u bu rn B la c k D ia m o nd
Inte rn a tion a l life e x pe c ta n c ie s : Ins titute for H e a lth Me tric s a nd E v a lua tion ,
U niv e rs ity of W a s h in g to n
L o c a l life e x p e c ta nc y : W a s h in g to n S ta te D e p a rtm e nt of H e a lth, A lg o na
C e n te r for H e a lth S ta tis tic s D e a th F ile s
A na ly s is a n d p re pa ra tion : A s s e s s m e n t, P o lic y D e v e lo pm e nt & E v a lua tion , P a c ific
P ub lic H e a lth ñ S e a ttle & K ing C ou nty, 1 0/20 11
Milto n
P re p a re d by : A s s e s s m e n t, P o lic y D e ve lo pm e nt & E v a lua tion
E n um c la w
D a te : 1 0 /11 /2 0 11
P ro v is io n a l: S u b je c t to R e v is io n
Figure: Summaries from Public Health: Seattle King County.
21 / 71
The BRFSS sampling scheme is complex: it uses a disproportionate
stratified sampling scheme.
The Sample Wt, is calculated as the product of four terms

1
Sample Wt = Strat Wt × × No Adults × Post Strat Wt
No Telephones
where Strat Wt is the inverse probability of a “likely” or “unlikely”

stratum being selected (stratification based on county and “phone
likelihood”).
Table: Summary statistics for population data, and 2011 King County BRFSS
diabetes data, across health reporting areas.
Mean Std. Dev. Median Min Max Total
Population (>18) 31,619 10,107 30,579 8,556 56,755 1,517,712
Sample Sizes 62.9 24.3 56.5 20 124 3,020
Diabetes Cases 6.3 3.1 6.3 1 15 302
Sample Weights 494.3 626.7 280.4 48.0 5,461 1,491,880
22 / 71
A total of 3, 020 individuals answered the diabetes question.
About 35% of the areas have sample sizes less than 50 (CDC
recommended cut-off), so that the diabetes prevalence estimates are
unstable in these areas.
We would like to use the totality of the data to aid in estimation in the
data sparse areas.
The variability in the weights is high, from 48 to 5,461, with mean 494.
The coefficient of variation (CV) of the weights is 1.27.
Therefore, the inefficiency of using the sample weights under the

assumption that unweighted mean is unbiased is about 62%, calculated
as CV2 /(CV2 + 1) (Korn and Graubard, 1999).
23 / 71
Modeling BRFSS data
I We take as example, the estimation of the prevalence of Type II

diabetes in health reporting areas (HRAs) in King County, using
BRFSS data.
I These survey data are collected using a complex stratified design.
I The design must be acknowledged in the analysis, but we would like
to use spatial smoothing to obtain estimates with more precision.
I We will work through a case study of analyzing such a dataset using
various methods.
24 / 71
Outline
To thread together what we have talked about so far, we can perform the
following analyses,
I Naive (i.e. unweighted, unsmoothed)
I Binomial spatial smoothing model, ignoring weighting
I Weighted (unsmoothed)
I By hand and using SUMMER package
I Smoothed and weighted
I By hand3 and using SUMMER package
3 very briefly, but you will see it again in the exercise session this afternoon.
25 / 71
Load data
First, we need to read in the King County BRFSS Stata dataset using the
foreign package.
library(foreign)
# kingdata <-
# read.dta(url('http://www.samclark.net/apa-sae/data/ct0913all.dta'))
kingdata <- read.dta("../data/ct0913all.dta")
names(kingdata)
## [1] "age" "pracex" "educau" "zipcode" "sex" "street1"

## [8] "seqno" "year" "hispanic" "mracex" "_ststr" "hracode"
## [15] "rwt_llcp" "genhlth2" "fmd" "obese" "smoker1" "diab2"
## [22] "zipout" "streetx" "ethn" "age4" "ctmiss"
26 / 71
Load map
Next, read in the shape files for King County HRAs
# install.packages('maptools')
library(maptools)
f <- "../data/HRA_ShapeFiles/HRA_2010Block_Clip.shp"
kingshape <- readShapePoly(f)
# install.packages('rgdal')
library(rgdal)
kingshape <- readOGR("../data/HRA_ShapeFiles",
layer = "HRA_2010Block_Clip")
## OGR data source with driver: ESRI Shapefile

## Source: "/Users/zehangli/Dropbox/Teachings/2018APAshanghai/data/HRA_S
## with 48 features
## It has 9 fields
27 / 71
Initial data cleaning
I Our outcome of interest is Type II diabetes and we will drop

observations with missing diabetes data.
I Our small area of interest is the HRA. We will also drop observations
with missing HRA.
kingdata <- subset(kingdata, !is.na(kingdata$diab2))

kingdata <- subset(kingdata, !is.na(kingdata$hracode))
names(kingdata)[names(kingdata) == "_ststr"] <- "strata"
kingdata$hracode <- as.character(kingdata$hracode)
kingdata[kingdata$hracode == "Fairwood ",
"hracode"] <- "Fairwood"
n.area <- length(unique(kingdata$hracode))
28 / 71
Naive binomial model
I Let yi and mi be the number of individuals flagged as having type II

diabetes and the denominators in the i = 1, . . . , n areas.
I We form naive estimates
yi
p̂i =
mi
with associated standard errors
s
p̂i (1 − p̂i )
.
mi
29 / 71
Naive binomial model
hras <- as.character(kingshape$HRA2010v2_)

props <- matrix(NA, nrow = n.area, ncol = 5)
props <- as.data.frame(props)
colnames(props) <- c("hracode", "p.hat",
"se.p.hat", "y.i", "n.i")
props[, 1] <- hras
for (i in 1:n.area) {
props[i, "p.hat"] <- mean(kingdata[kingdata$hracode ==
props[i, "hracode"], "diab2"])
props[i, "y.i"] <- sum(kingdata[kingdata$hracode ==
props[i, "n.i"] <- length(kingdata[kingdata$hracode ==
naivevar <- props[i, "p.hat"] * (1 -
props[i, "p.hat"])/props[i, "n.i"]
props[i, "se.p.hat"] <- sqrt(naivevar)
}
30 / 71
Naive binomial model: merge into map
Load shapefiles
library(ggplot2)
library(viridis)
geo <- fortify(kingshape, region = "HRA2010v2_")
geo1 <- merge(geo, props, by = "id", by.y = "hracode")
Merge prevalence to the map and visually check them
g <- ggplot(geo1)
g <- g + geom_polygon(aes(x = long, y = lat,
group = group, fill = p.hat), color = "gray")
g <- g + theme_void()
g <- g + scale_fill_viridis()
g
31 / 71
Naive binomial model: merge into map
Merge prevalence to the map and visually check them
p.hat
0.20
0.15
0.10
0.05
32 / 71
Binomial smoothing by hand (not weighted)
33 / 71
Binomial smoothing: the model
We use the INLA package to fit the following Bayesian hierarchical model:
yi |pi ∼ Binomial(Ni , pi )

pi
θi = log = µ + i + s i ,
1 − pi
i ∼ N(0, σ2 )
σs2

si |sj , j ∈ ne(i) ∼ N s¯i , .
ni
where ni is the number of neighbors for area i, and

1 X
s¯i = sj
ni
j∈ne(i)
Priors are put on µ, σ2 , σs2 ,
34 / 71
Binomial smoothing: construct adjacency matrix
To perform spatial smoothing using ICAR, we first need to construct an

adjacency matrix where each row and column is a region.
I Diagonal elements are 0
I Off-diagonal elements are 1 if the two corresponding regions are
adjacent and 0 if otherwise
library(spdep)
nb.r <- poly2nb(kingshape, queen=F,
row.names = kingshape$HRA2010v2_)
mat <- nb2mat(nb.r, style="B",zero.policy=TRUE)
colnames(mat) <- rownames(mat)
mat <- as.matrix(mat[1:dim(mat)[1], 1:dim(mat)[1]])
35 / 71
Binomial smoothing: model fitting
Implementation details:
I The index of the areas needs to be the same order as in the
adjacency matrix. It can be easily missed if data has been reordered
I Multiple random effects each need an index variable (unstruct and
struct below).
sum(colnames(mat) != props$region)
## [1] 0
props$unstruct <- props$struct <- 1:n.area
36 / 71
Binomial smoothing: model fitting
The following code carries out an unweighted binomial analysis, with

global and spatial smoothing, the latter via the ICAR model.
library(INLA)
formula = y.i ~ 1 +
f(struct,model='besag',
adjust.for.con.comp=TRUE,
constr=TRUE,graph=mat,
scale.model = TRUE,
param = c(0.5, 0.0015)) +
f(unstruct, model='iid',
param=c(0.5,0.0015))
fit.naive <- inla(formula,
family="binomial",
data=props, Ntrials=n.i,
control.predictor = list(compute = TRUE))
37 / 71
Binomial smoothing: organize output
props.smooth <- props

# posterior median
props.smooth[, "p.hat"] <- fit.naive$summary.fitted.values[,
"0.5quant"]
# posterior standard deviations
props.smooth[, "se.p.hat"] <- fit.naive$summary.fitted.values[,
"sd"]
# Post medians of unstructured random
# effects
props.smooth[, "unstruct"] <- fit.naive$summary.random$unstruct[,
"0.5quant"]
# Post medians of spatial random effects
props.smooth[, "struct"] <- fit.naive$summary.random$struct[,
"0.5quant"]
38 / 71
Binomial smoothing: Unstructured random effects
geo2 <- merge(geo, props.smooth, by = "id",

by.y = "hracode")
g <- ggplot(geo2) + geom_polygon(aes(x = long,
y = lat, group = group, fill = unstruct),
color = "gray")
g <- g + theme_void() + scale_fill_viridis()
g
unstruct
0.02
0.01
0.00
−0.01
39 / 71
Binomial smoothing: Spatial random effects

y = lat, group = group, fill = struct),
color = "gray")
g
struct
0.5
0.0
−0.5
40 / 71
Binomial smoothing: Proportion of variance (recap)
I It could be interesting to evaluate the proportion of variance
explained by the structured spatial component
I However, estimated σs2 and σ2 are not directly comparable
I We alternatively calculates the posterior marginal variance for the
structured effect (See Section 6.1.2 of Blangiardo, et.al (2015) for
more details.)
Sre <- matrix(NA, 1e4, 48)

for (i in 1:48){
Sre[,i] <- inla.rmarginal(1e4,
fit.naive$marginals.random$struct[[i]])
}
var.Sre <- apply(Sre,1,var)
var.eps <- inla.rmarginal(1e4, inla.tmarginal(function(x){1/x},
fit.naive$marginals.hyper$"Precision for unstruct"))
perc.var.Sre <- mean(var.Sre/(var.Sre+var.eps))
perc.var.Sre
## [1] 0.9610054
41 / 71
Binomial smoothing: Proportion of variance
To see there’s an difference between σs2 and the posterior marginal
variance for the structured effects:
var <- matrix(NA, 2, 2)
colnames(var) <- c("S", "Sigma^2")
rownames(var) <- c("median", "mean")
draws1 <- matrix(NA, 10000, 48)
for (i in 1:48) {
draws1[, i] <- inla.rmarginal(10000,
fit.naive$marginals.random$struct[[i]])
}
var[1, 1] <- median(apply(draws1, 1, var))
var[2, 1] <- mean(apply(draws1, 1, var))
draws2 <- inla.rmarginal(10000, inla.tmarginal(function(x) 1/x,
fit.naive$marginals.hyper$"Precision for struct"))
var[1, 2] <- median(draws2)
var[2, 2] <- mean(draws2)
var
## S Sigma^2
## median 0.1175084 0.06626365
## mean 0.1180709 0.07019962 42 / 71
Binomial smoothing: predicted prevalence

y = lat, group = group, fill = p.hat),
color = "gray")
g
p.hat
0.15
0.10
43 / 71
Binomial smoothing: SE of prevalence

y = lat, group = group, fill = se.p.hat),
color = "gray")
g
se.p.hat
0.025
0.020
0.015
0.010
44 / 71
Binomial smoothing: compare with naive approach
par(mfrow = c(1, 2))

lim1 <- range(c(props$p.hat, props.smooth$p.hat))
plot(props$p.hat, props.smooth$p.hat, xlim = lim1,
ylim = lim1, xlab = "Naive Prevalence",
ylab = "Smoothed prevalence")
abline(c(0, 1), col = "red")
lim2 <- range(c(props$se.p.hat, props.smooth$se.p.hat))
plot(props$se.p.hat, props.smooth$se.p.hat,
xlim = lim2, ylim = lim2, xlab = "Naive Prevalence SE",
ylab = "Smoothed prevalence SE")
45 / 71
Binomial smoothing: compare with naive approach
0.030
0.20
Smoothed prevalence SE
●
Smoothed prevalence
0.025
●
0.15
0.020
●
● ●
● ●● ●
● ● ●
● ● ● ● ●
0.015
●● ●● ● ●
0.10
●
●
● ●
●
●●
● ● ●
●
● ● ●● ●● ●
● ● ●● ● ●●
● ● ●
●● ●●● ● ● ● ●●●
0.010
● ●● ●● ● ●● ●
●
● ● ●● ● ●
● ●●
●
0.05
● ●●●●
●● ●
●●
0.05 0.10 0.15 0.20 0.010 0.015 0.020 0.025 0.030
Naive Prevalence Naive Prevalence SE
46 / 71
Accounting for survey designs
47 / 71
Survey weighted estimates: weights
I BRFSS uses a complex survey design. See http://www.cdc.gov/

brfss/annual_data/2013/pdf/Weighting_Data.pdf for more
details of the weighting procedure.
I Raking adjusts for: telephone source (allowing for cell phones),
race/ethnicity, education, marital status, age group by gender,
gender by race and ethnicity, age group by race and ethnicity,
renter/owner status.
I Design weights are
STRWT × 1/NUMPHON2 × NUMADULT.
I GEOSTR is the geographical strata (which in general may be the

entire state or a geographic subset such as counties, census tracts,
etc.). DENSTR is the density of the phone numbers for a given
block of numbers as listed or not listed.
48 / 71
Survey weighted estimates: weights
I NRECSTR is the number of available records and NRECSEL is the

number of records selected within each geographical strata and
density strata.
I Within each GEOSTR × DENSTR combination, the stratum weight
(STRWT) is calculated from the average of the NRECSTR and the
sum of all sample records used to produce the NRECSEL. The
stratum weight is equal to NRECSTR/NRECSEL, i.e. the reciprocal
of the selection probability.
I An adjustment is also made for the mostly cellular telephone dual
sampling frame users. Weight trimming also used, prior to trimming.
I The final weight rwt llcp is the raked design weight.
49 / 71
Survey weighted estimates: asymptotic distribution of p̂i
I The survey package will give us survey-weighted estimates of pi , the

proportion of people with Type II diabetes in small area i, and a
survey-weighted estimate of the standard error, SE
c (p̂i ).
I We use the method described in Mercer et al. (2014).

p̂i
I If we specify yi = log then, by the delta method, the
1 − p̂i
asymptotic (sampling) distribution of yi is:

pi var(
c p̂i )
yi |pi ∼ N log , 2 .
1 − pi p̂i (1 − p̂i )2
50 / 71
Survey weighted estimates: calculation
library(survey)
props.w <- props
kingcounty.des <- svydesign(ids = ~1, weights = ~rwt_llcp,
strata = ~strata, data = kingdata)
weighted <- svyby(~diab2, ~hracode, kingcounty.des,
svymean)
rows <- match(weighted$hracode, props.w$hracode)
props.w[rows, "p.hat"] <- weighted$diab2
props.w[rows, "se.p.hat"] <- weighted$se
props.w[, "logit.p"] <- log(props.w[, "p.hat"]/(1 -
props.w[, "p.hat"]))
props.w[, "logit.v"] <- props.w[, "se.p.hat"]^2/(props.w[,
"p.hat"] * (1 - props.w[, "p.hat"]))^2
props.w[, "logit.prec"] <- 1/props.w[, "logit.v"]
51 / 71
Survey weighted estimates: calculation
We obtain
I The weighted estimators of prevalences p.hat
I The design standard error of prevalences se.p.hat
I The weighted estimators of logits of prevalences logit.p
I The design variances of logits of prevalences logit.v
52 / 71
Survey weighted estimates: compare with naive approach

lim1 <- range(c(props$p.hat, props.w$p.hat))
plot(props$p.hat, props.w$p.hat, xlim = lim1,
ylim = lim1, xlab = "Naive Prevalence",
ylab = "Survey-weighted prevalence")
lim2 <- range(c(props$se.p.hat, props.w$se.p.hat))
plot(props$se.p.hat, props.w$se.p.hat, xlim = lim2,
ylim = lim2, xlab = "Naive Prevalence SE",
ylab = "Survey-weighted prevalence SE")
53 / 71
Survey weighted estimates: compare with naive approach
0.06
● ●
Survey−weighted prevalence SE
0.20
0.05
Survey−weighted prevalence
0.04
0.15
●
●
●
0.03
● ●
0.10
● ● ● ●
● ●●
●● ●● ● ●
●● ● ●
●● ● ●
●●
0.02
● ●
●●
● ●
●● ● ● ●●●● ●
●
● ●● ● ●
0.05
● ● ●●●
● ● ●●● ●
● ●● ● ● ●● ● ●
● ●●●
● ● ●
0.01
● ● ●
● ●● ● ●
● ●●
● ● ●●
0.05 0.10 0.15 0.20 0.01 0.02 0.03 0.04 0.05 0.06
Naive Prevalence Naive Prevalence SE
54 / 71
Survey weighted estimates: compare with binomial
smoothing

logit.binomial <- fit.naive$summary.linear.predictor[,
"0.5quant"]
logit.v.binomial <- fit.naive$summary.linear.predictor[,
"sd"]^2
lim1 <- range(c(logit.binomial, props.w$logit.p))
plot(logit.binomial, props.w$logit.p, xlim = lim1,
ylim = lim1, xlab = "Posterior median (unweighted)",
ylab = "Survey-weighted logit prevalence")
lim2 <- range(c(logit.v.binomial, props.w$logit.v))
plot(logit.v.binomial, props.w$logit.v, xlim = lim2,
ylim = lim2, xlab = "Posterior variance (unweighted)",
ylab = "Survey-weighted logit prevalence variance")
55 / 71
Survey weighted estimates: compare with binomial
smoothing
0.30
Survey−weighted logit prevalence variance
● ●
−1.5
Survey−weighted logit prevalence
0.25
●
●
−2.0
● ●
0.20
● ●
● ●
● ● ● ●
● ●●●
●
−2.5
● ●
0.15
● ● ● ● ●
● ● ● ●
● ●
●●
● ●● ● ●●
●●
0.10
●
−3.0
●
●● ●●●
● ● ● ●
● ●●●
●
●● ● ●
●●●
●●
●
●●●●
●● ● ●
●●
0.05
●●
●
●
● ●
−3.5
●● ●●
● ●
●
−3.5 −3.0 −2.5 −2.0 −1.5 0.05 0.10 0.15 0.20 0.25 0.30
Posterior median (unweighted) Posterior variance (unweighted)
56 / 71
Weighted and smoothed model
We use the INLA package to fit the following Bayesian hierarchical model:

p̂i
yi = log ∼ N(θi , V̂i )
1 − p̂i
θ i = µ + i + s i ,
i ∼ N(0, σ2 )
σs2

si |sj , j ∈ ne(i) ∼ N s¯i , .
ni
with priors on µ, σ2 , σs2 ,

The key here is that the first stage variance V̂i is assumed known:
var(p̂i )
V̂i = .
p̂i2 (1 − p̂i )2
57 / 71
Weighted and smoothed model: model fitting
props.w$unstruct <- props.w$struct <- 1:n.area

formula = logit.p ~ 1 +
f(struct,model='besag',
adjust.for.con.comp=TRUE,
constr=TRUE,graph=mat,
scale.model = TRUE,
param = c(0.5, 0.0015))+
f(unstruct, model='iid', param=c(0.5,0.0015))
fit.weighted <- inla(formula,
family="gaussian", data=props.w,
control.predictor = list(compute = TRUE),
control.family = list(hyper = list(prec = list(
initial = log(1), fixed=TRUE))),
scale=props.w$logit.prec)
58 / 71
Weighted and smoothed model: compare with weighted
−1.5
0.5
Posterior variance (weighted)
Posterior median (weighted)
●
−2.0
0.4
● ●
●
●●●●
● ●
−2.5
●
●
● ● ● ●
● ●
● ●
●●
0.3
● ●
● ● ●
●
● ●
−3.0
● ●●● ●
● ● ●
●
●● ● ● ● ●
● ● ● ● ● ●
● ● ●● ●
● ●● ●
●● ●●●● ●
0.2
●
−3.5
●●●●● ●
● ● ● ●
● ●●● ●●●● ● ●
● ●●
●● ●
−3.5 −3.0 −2.5 −2.0 −1.5 0.2 0.3 0.4 0.5
Survey−weighted logit prevalence Survey−weighted logit prevalence variance
59 / 71
Using SUMMER
60 / 71
Weighted and smoothed model: using SUMMER
The SUMMER (Spatio-Temporal Under-Five Mortality Methods with

Estimation in R) package
There is a function fitSpace() that estimates weighted and smoothed
estimates
library(SUMMER)
fit <- fitSpace(data = kingdata, geo = kingshape,
Amat = mat, family = "binomial", responseVar = "diab2",
strataVar = "strata", weightVar = "rwt_llcp",
regionVar = "hracode", clusterVar = "~1",
hyper = NULL, CI = 0.95)
61 / 71
SUMMER: default hyperpriors
For binary variables, the default hyperpriors for precisions is Gamma(0.5,

0.001488), which leads to a 95% prior interval for the residual odds ratio
between [0.5, 2], i.e., for
u|τ ∼iid Normal(0, 1/τ ), τ ∼ Gamma(0.5, 0.001488)
The [0.025%, 0.975%] quantiles are roughly [0.5, 2]. See Section 9.6.2 of
Wakefield (2013) for more details.
The structured effects are all scaled to have unit generalized marginal
variance, so that the precision parameter has the similar interpretation.
See https://www.math.ntnu.no/inla/r-inla.org/tutorials/
inla/scale.model/scale-model-tutorial.pdf for more details
about the scaled models.
62 / 71
SUMMER fit
The fit object from SUMMER package contains

I HT
I HT.est.original, HT.variance.original: mean, and sd of the
direct estimates accounting for survey weights.
I HT.est, HT.sd, HT.variance, HT.prec: mean, and asymptotic
sd, variance and precision of the logit transformed direct estimates
accounting for survey weights.
I smooth
I mean.trans, sd.trans, median.trans, lower.trans,
upper.trans: mean, sd, median, posterior credible intervals
(specified by CI argument in function call) of the posterior prediction
in the probability scale.
I mean, sd, median, lower, upper: mean, sd, median, posterior
credible intervals (specified by CI argument in function call) of the
posterior prediction in logit scale.
63 / 71
Easier visualization: merge all results
fit$HT$HT.sd.original <- fit$HT$HT.variance.original^0.5

combined <- merge(fit$HT, fit$smooth, by = "region")
combined <- combined[match(props$hracode,
combined$region), ]
# naive
combined$p.hat.naive <- props$p.hat
combined$se.p.hat.naive <- props$se.p.hat
# binomial smoothing
combined$p.hat.binomial <- props.smooth$p.hat
combined$se.p.hat.binomial <- props.smooth$se.p.hat
64 / 71
Easier visualization
g <- mapPlot(data = combined, geo = kingshape,

variables = c("p.hat.naive", "p.hat.binomial",
"HT.est.original", "median.trans"),
labels = c("Naive", "Unweighted smoothing: posterior median",
"Survey-weighted", "Weighted smoothing: posterior median"),
by.data = "region", by.geo = "HRA2010v2_",
is.long = FALSE)
g <- g + theme_bw() + theme(axis.title = element_blank(),
axis.text = element_blank())
g <- g + scale_fill_viridis("Prevalence")
g
65 / 71
Naive Unweighted smoothing: posterior median
Prevalence
0.20
Survey−weighted Weighted smoothing: posterior median 0.15
0.10
0.05
66 / 71
g <- mapPlot(data = combined, geo = kingshape,

variables = c("se.p.hat.naive", "se.p.hat.binomial",
"HT.sd.original", "sd.trans"), labels = c("Naive",
"Unweighted smoothing: posterior SD",
"Survey-weighted", "Weighted smoothing: posterior SD"),
by.data = "region", by.geo = "HRA2010v2_",
is.long = FALSE)
g <- g + theme_bw() + theme(axis.title = element_blank(),
axis.text = element_blank())
g <- g + scale_fill_viridis("SD(Prevalence)")
print(g)
67 / 71
Naive Unweighted smoothing: posterior SD
SD(Prevalence)
0.05
0.04
Survey−weighted Weighted smoothing: posterior SD
0.03
0.02
0.01
68 / 71
Conclusion
The last two plots illustrate the effect of the Bayesian smoothing model:
I the estimates are shrunk (both globally and locally), this introduces
bias,
I the uncertainty is in general reduced, due to the use of all the data.
Overall:
I It is clear we need to consider the weighting
I The smoothing does increase precision, at the expense of a little bias
69 / 71
References I
Chen, C., Wakefield, J., and Lumley, T. (2014). The use of sample
weights in Bayesian hierarchical models for small area estimation.
Spatial and Spatio-Temporal Epidemiology, 11:33–43.
Fay, R. and Herriot, R. (1979). Estimates of income for small places: an
application of James–Stein procedure to census data. Journal of the
American Statistical Association, 74:269–277.
Korn, E. and Graubard, B. (1999). Analysis of Health Surveys. John
Wiley and Sons, New York.
Li, Z. R., Hsiao, Y., Godwin, J., Martin, B., Wakefield, J., and Clark,
S. J. (2018). Changes in the spatial distribution of the Under Five
Mortality Rate: small-area analysis of 122 DHS Surveys in 262
subregions of 35 Countries in Africa. Submitted.
Mercer, L., Wakefield, J., Chen, C., and Lumley, T. (2014). A
comparison of spatial smoothing methods for small area estimation
with sampling weights. Spatial Statistics, 8:69–85.
70 / 71
References II
Mercer, L., Wakefield, J., Pantazis, A., Lutambi, A., Mosanja, H., and
Clark, S. (2015). Small area estimation of childhood of childhood
mortality in the absence of vital registration. Annals of Applied
Statistics, 9:1889–1905.
Pfeffermann, D. (2013). New important developments in small area
estimation. Statistical Science, 28:40–68.
Rao, J. (2003). Small Area Estimation. John Wiley, New York.
Rao, J. and Molina, I. (2015). Small Area Estimation, Second Edition.
John Wiley, New York.
Song, L., Mercer, L., Wakefield, J., Laurent, A., and Solet, D. (2016).
Peer reviewed: Using small-area estimation to calculate the prevalence
of smoking by subcounty geographic areas in king county, washington,
behavioral risk factor surveillance system, 2009–2013. Preventing
chronic disease, 13.
71 / 71

Bayesian SAE Using Complex Survey Data

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Bayesian SAE Using Complex Survey Data

Enviado por

Direitos autorais:

Formatos disponíveis

Bayesian SAE using Complex Survey Data

Lecture 8: Small Area Estimation

Inference for SAE

SAE with BRFSS data in R

Binomial smoothing by hand (not weighted)

Accounting for survey designs

Small area estimation (SAE) is an important endeavor since many

SAE is an example of domain (sub-population) estimation.

As a concrete example, suppose we are interested in a particular

1 though some areas may contain no samples 5 / 71

An excellent review of SAE is Pfeffermann (2013).

The SAE literature distinguishes between direct estimation, in which data

Along with the outcome, there is an associated design weight wik .

Under the design-based approach to inference, it is common to use the

There is an associated variance, that acknowledges the design, Vbi .

This variance estimate may be obtained analytically, or through

Asymptotically (that is, in large samples):

The simplest approach is to simply map the direct estimates Pbi .

The cluster design leads to a loss of information.

The so-called estimated design effect is

and summarizes the information loss.

Define the effective sample size as

Let θbi be the weighted estimator, then consider

which is on the whole of the real line.

Prior Random Effects Model:

where the random effects i ∼iid N(0, σ2 ).

2 We are taking the data as the estimator

This is very similar to the normal-normal model we saw in Lecture 3.

Fay and Herriot (1979) suggested this hierarchical model, in a landmark

This model acknowledges the design and also smooths, and it is

Adding a term x Ti β to the prior model allows covariate relationships to be

Arises out of a joint project between Laina Mercer/Jon Wakefield and

Aim we will concentrate on here is to estimate the number of 18 years or

HRAs are city-based sub-county areas with a total of 48 HRAs in King

Health Reporting Areas

Ballard Bear Creek/Carnation/Duvall

Mercer Isle/Pt Cities

West Seattle Beacon/Gtown

N = 2010 population for each HRA

Last Modified 08/2012 Auburn-South

Figure: Health reporting areas (HRAs) in King County.

Estimates are used for a variety of purposes including summarization for

Analysis and dissemination of place-based disparities is of great

Because of its demographics, King County looks good compared to other

Estimation is based on Behavioral Risk Factor Surveillance System

The BRFSS is an annual telephone health survey conducted by the

Figure: Summaries from Public Health: Seattle King County.

Y e a rs be hind or a h e a d a re fro m 20 07 .

P re p a re d by : A s s e s s m e n t, P o lic y D e ve lo pm e nt & E v a lua tion

Figure: Summaries from Public Health: Seattle King County.

The Sample Wt, is calculated as the product of four terms

where Strat Wt is the inverse probability of a “likely” or “unlikely”

A total of 3, 020 individuals answered the diabetes question.

The coefficient of variation (CV) of the weights is 1.27.

Therefore, the inefficiency of using the sample weights under the

I We take as example, the estimation of the prevalence of Type II

## [1] "age" "pracex" "educau" "zipcode" "sex" "street1"

Next, read in the shape files for King County HRAs

## OGR data source with driver: ESRI Shapefile

I Our outcome of interest is Type II diabetes and we will drop

kingdata <- subset(kingdata, !is.na(kingdata$diab2))

I Let yi and mi be the number of individuals flagged as having type II

hras <- as.character(kingshape$HRA2010v2_)

where the random effects i ∼iid N(0, σ2 ).

Priors are put on µ, σ2 , σs2 ,

with priors on µ, σ2 , σs2 ,