Você está na página 1de 71

Bayesian SAE using Complex Survey Data

Lecture 8: Small Area Estimation

Richard Li

Department of Statistics
University of Washington

1 / 71
Outline

Overview

Inference for SAE

SAE with BRFSS data in R

Binomial smoothing by hand (not weighted)

Accounting for survey designs

Using SUMMER

2 / 71
Overview

3 / 71
Motivation

Small area estimation (SAE) is an important endeavor since many


agencies require estimates of health, economic indices, education and
environmental measures in order to plan and allocate resources and
target interventions.

SAE is an example of domain (sub-population) estimation.

“Small” here refers to the fact that we will typically base our inference on
a small sample from each area (so it is not a description of geographical
size).

In the limit there may some areas in which there are no data.

4 / 71
Small Area Estimation
Consider a study region partitioned into n disjoint and exhaustive areas,
labeled by i, i = 1, . . . , n.

As a concrete example, suppose we are interested in a particular


condition so that the response is a binary outcome, Yik , for
k = 1, . . . , Ni , individuals in area i.

Based on samples that are collected in the areas1 , the aim of SAE include
estimation of:
I The population totals:
Ni
X
Ti = Yik .
k=1
I The prevalence of the condition in each area:
Ni
1 X Ti
θi = Yik = .
Ni Ni
k=1

1 though some areas may contain no samples 5 / 71


Background reading on SAE

The classic text on SAE is Rao (2003), with a more recent edition (Rao
and Molina, 2015); not the easiest book to read, and little material on
spatial smoothing models.

An excellent review of SAE is Pfeffermann (2013).

The SAE literature distinguishes between direct estimation, in which data


from the area only is used to provide the estimate in an area, and indirect
estimation, in which data from other areas is used to provide the
estimate.

6 / 71
Inference for SAE

7 / 71
Design based inference based on weighted estimators
Suppose we undertake a complex design and obtain outcomes yik in area
i, k ∈ si , where si is the set of samples that were in area i.

Along with the outcome, there is an associated design weight wik .

Under the design-based approach to inference, it is common to use the


weighted estimator of the prevalence:
P
k∈si wik yik
Pi = P
b .
k∈si wik

There is an associated variance, that acknowledges the design, Vbi .

This variance estimate may be obtained analytically, or through


resampling techniques such as the jackknife.

Asymptotically (that is, in large samples):

Pbi ∼ N(Pi , Vi ).
8 / 71
Direct Estimation

The simplest approach is to simply map the direct estimates Pbi .

To assess the uncertainty, one may map the lower and upper ends of
(say) a 90% confidence interval:
q
Pbi ± 1.645 × Vbi .

If the samples in each area are large, so that Vbi is small, then this
approach works well.

Hence, as usual, we would like to carry out some form of smoothing, but
in the case of complex survey sampling, how should we proceed?

9 / 71
Design effects

The cluster design leads to a loss of information.

The so-called estimated design effect is

Vbi
di = ,
Pbi (1 − Pbi )/ni

and summarizes the information loss.

Define the effective sample size as

ni Pbi (1 − Pbi )
nei = = .
di Vbi

10 / 71
Smoothed Direct Estimation

Let θbi be the weighted estimator, then consider


!
Pbi
θbi = logit Pbi = log ,
1 − Pbi

which is on the whole of the real line.

“Data” Model2 :
θbi ∼ N(θi , Vbi ),
where Vi , its variance, is known.

Prior Random Effects Model:

θ i = β0 + i ,

where the random effects i ∼iid N(0, σ2 ).

2 We are taking the data as the estimator


11 / 71
Smoothed Direct Estimation

This is very similar to the normal-normal model we saw in Lecture 3.

Fay and Herriot (1979) suggested this hierarchical model, in a landmark


paper.

This model acknowledges the design and also smooths, and it is


straightforward to add spatial random effects.

12 / 71
Smoothed Direct Estimation
The spatial version of the model has:

“Data” Model:
θbi ∼ N(θi , Vbi ),
where Vbi is known variance.

Prior Model:
θi = β0 + i + Si ,
with
I i ∼ N(0, σ2 ).
I Si ∼ ICAR(σs2 ).

Adding a term x Ti β to the prior model allows covariate relationships to be


investigated.

This model has been investigated and applied with simulated and real
data in (Chen et al., 2014; Mercer et al., 2014) and (in a space-time
setting) in Mercer et al. (2014, 2015) and Li et al. (2018). 13 / 71
FYI, Different Models For Binary Responses
I Binomial sampling model: only strictly valid if no stratified sampling
and no cluster sampling.
I Direct estimates at the area level.
I Smoothed direct estimates at the area level, modeling the logit of
the direct estimates of the probabilities.
I Binomial GLMM at the area level: only strictly valid if no stratified
sampling and no cluster sampling.
I Binomial model for responses within each cluster with
I strata fixed effects,
I cluster random effects,
I IID random effects at the area level
I spatial random effects at the area level (via an ICAR model).
I Binomial model for responses within each cluster with
I strata fixed effects,
I IID cluster random effects,
I IID household effects?
I spatial random effects at the cluster level (via a Gaussian process
model).
14 / 71
SAE with BRFSS data in R

15 / 71
Motivating Example: Diabetes in King County

Arises out of a joint project between Laina Mercer/Jon Wakefield and


Seattle and King County Public Health, which lead to the work reported
in Song et al. (2016).

Aim we will concentrate on here is to estimate the number of 18 years or


older individuals with diabetes, by health reporting areas (HRAs) in King
County in 2011.

HRAs are city-based sub-county areas with a total of 48 HRAs in King


County. Some of these are as are a single city, some are a group of
smaller cities, and some are unincorporated areas. Larger cities such as
Seattle and Bellevue include more than one HRA.

Data are based on the question, “Has a doctor, nurse, or other health
professional ever told you that you had diabetes?”, in 2011.

16 / 71
Shoreline

Health Reporting Areas


N= 53030
Kenmore/LFP Bothell/Woodinville
N= 34444
(HRA)
N= 32837

Kirkland North
and 2010 population
NW Seattle
N= 33564
N= 42566 North Seattle

King County, WA
N= 44332

Ballard Bear Creek/Carnation/Duvall


N= 51822 Redmond N= 64643
Kirkland N= 53616
NE Seattle N= 47617
Fremont/
N= 67415
Greenlake
QA/Magnolia N= 50863
N= 57494

Capitol Hill/E.lake
Auburn, Bellevue, Federal N= 44740 Bellevue-NE
N= 33096
Way, Kent, Renton & Seattle

Mercer Isle/Pt Cities


Downtown
HRAs are divided into

N= 29978
Bellevue-Central
N= 42610 Central Seattle N= 35397 Sammamish
neighborhoods. N= 44407
Bellevue-West
N= 45453

N= 29577

West Seattle Beacon/Gtown


N= 52689 Issaquah Snoqualmie/North Bend/Skykomish
/S.Park Bellevue-South N= 29769 N= 43164
N= 39242 N= 31100
SE Seattle
N= 40305
Delridge
N= 30296

North Highline
N= 17400 Renton-NorthRenton-East
N= 28608 N= 29871
Newcastle/Four Creeks
N= 28270
Burien
N= 48070
SeaTac/Tukwila
N= 46254
Renton-South Fairwood
N= 50711 N= 23739

0 1 2 4 6 Vashon Island
Miles N= 10624 Des Moines/ Covington/Maple Valley
Normandy Pk Kent-West Kent-East N= 54070
N= 35966 N= 27921 N= 35924

Kent-SE
N= 55187

N = 2010 population for each HRA


Data source: Intermin Population
estimates, PHSKC, APDE 1/2012 East Federal Way N= 34976
Auburn-North
Fed Way- N= 35235
Dash Point/Woodmont Black Diamond/Enumclaw/SE County
Produced by: Public Health-Seattle & N= 32660 N= 47803
Fed Way-
King County Assessment, Policy Central/Military Rd
Development & Evaluation N= 56657

Last Modified 08/2012 Auburn-South


N= 25239

Figure: Health reporting areas (HRAs) in King County.

17 / 71
Motivating BRFSS Example

Estimates are used for a variety of purposes including summarization for


the local communities and assessment of health needs.

Analysis and dissemination of place-based disparities is of great


importance to allow efficient targeting of place-based interventions.

Because of its demographics, King County looks good compared to other


areas in the U.S., but some of its disparities are among the largest of
major metro areas.

Estimation is based on Behavioral Risk Factor Surveillance System


(BRFSS) data.

The BRFSS is an annual telephone health survey conducted by the


Centers for Disease Control and Prevention (CDC) that tracks health
conditions and risk behaviors in the United States and its territories since
1984.

18 / 71
Figure: Public Health: Seattle and King County website.

19 / 71
2012  
     

Figure: Summaries from Public Health: Seattle King County.

20 / 71
L a k e  
L ife  E x p e c ta n c y  C o m p a re d  to
S h ore lin e F ore s t   B o th e ll W o o d in ville
th e  T e n  L o n g e s t-­‐L iv e d  C o u n trie s P a rk
K e nm ore
b y  C e n s u s  T ra c t D u v a ll
2 0 0 5 -­‐2 0 0 9 ,  K in g  C o u n ty  W A K irkla n d
R e d m o nd

L eg en d
C a rna tion
C IT Y Me d in a
S e a ttle
C a le n d a r  Y e a r s   A h e a d  
B e lle v u e S a m m a m is h
3 1  to  4 2
Me rc e r  
1 5  to  3 0 Is la n d
1  to  1 4 Is s a q ua h
N e w c a s tle S n oq u a lm ie

C a le n d a r  Y e a r s    B e h in d
Z e ro  to  9 B u rie n N o rth  B e n d
R e n to n
T u kw ila
1 0  to  2 3
2 4  to  5 7 S e a Ta c
N o rm a nd y  
P a rk

S m a ll  p op u la tio n K e nt
D e s  Mo in e s
Ma p le  Va lle y
C o ving ton

Y e a rs   be hind  or   a h e a d  a re  fro m   20 07 .


D a ta  S o urc e s :   F e d e ra l  W a y A u bu rn B la c k  D ia m o nd
Inte rn a tion a l  life   e x pe c ta n c ie s :  Ins titute  for  H e a lth  Me tric s  a nd  E v a lua tion ,  
U niv e rs ity  of  W a s h in g to n
L o c a l  life  e x p e c ta nc y :  W a s h in g to n  S ta te   D e p a rtm e nt  of  H e a lth, A lg o na
C e n te r   for   H e a lth  S ta tis tic s   D e a th   F ile s
A na ly s is   a n d  p re pa ra tion :  A s s e s s m e n t,  P o lic y  D e v e lo pm e nt   &  E v a lua tion , P a c ific
P ub lic  H e a lth  ñ   S e a ttle  &  K ing  C ou nty,  1 0/20 11
Milto n

P re p a re d   by :  A s s e s s m e n t,  P o lic y  D e ve lo pm e nt   &  E v a lua tion

E n um c la w

D a te :  1 0 /11 /2 0 11
P ro v is io n a l:  S u b je c t  to  R e v is io n

Figure: Summaries from Public Health: Seattle King County.

21 / 71
Motivating BRFSS Example
The BRFSS sampling scheme is complex: it uses a disproportionate
stratified sampling scheme.

The Sample Wt, is calculated as the product of four terms


1
Sample Wt = Strat Wt × × No Adults × Post Strat Wt
No Telephones

where Strat Wt is the inverse probability of a “likely” or “unlikely”


stratum being selected (stratification based on county and “phone
likelihood”).

Table: Summary statistics for population data, and 2011 King County BRFSS
diabetes data, across health reporting areas.
Mean Std. Dev. Median Min Max Total
Population (>18) 31,619 10,107 30,579 8,556 56,755 1,517,712
Sample Sizes 62.9 24.3 56.5 20 124 3,020
Diabetes Cases 6.3 3.1 6.3 1 15 302
Sample Weights 494.3 626.7 280.4 48.0 5,461 1,491,880

22 / 71
Motivating BRFSS Example

A total of 3, 020 individuals answered the diabetes question.

About 35% of the areas have sample sizes less than 50 (CDC
recommended cut-off), so that the diabetes prevalence estimates are
unstable in these areas.

We would like to use the totality of the data to aid in estimation in the
data sparse areas.

The variability in the weights is high, from 48 to 5,461, with mean 494.

The coefficient of variation (CV) of the weights is 1.27.

Therefore, the inefficiency of using the sample weights under the


assumption that unweighted mean is unbiased is about 62%, calculated
as CV2 /(CV2 + 1) (Korn and Graubard, 1999).

23 / 71
Modeling BRFSS data

I We take as example, the estimation of the prevalence of Type II


diabetes in health reporting areas (HRAs) in King County, using
BRFSS data.
I These survey data are collected using a complex stratified design.
I The design must be acknowledged in the analysis, but we would like
to use spatial smoothing to obtain estimates with more precision.
I We will work through a case study of analyzing such a dataset using
various methods.

24 / 71
Outline

To thread together what we have talked about so far, we can perform the
following analyses,
I Naive (i.e. unweighted, unsmoothed)
I Binomial spatial smoothing model, ignoring weighting
I Weighted (unsmoothed)
I By hand and using SUMMER package
I Smoothed and weighted
I By hand3 and using SUMMER package

3 very briefly, but you will see it again in the exercise session this afternoon.
25 / 71
Load data

First, we need to read in the King County BRFSS Stata dataset using the
foreign package.

library(foreign)
# kingdata <-
# read.dta(url('http://www.samclark.net/apa-sae/data/ct0913all.dta'))
kingdata <- read.dta("../data/ct0913all.dta")
names(kingdata)

## [1] "age" "pracex" "educau" "zipcode" "sex" "street1"


## [8] "seqno" "year" "hispanic" "mracex" "_ststr" "hracode"
## [15] "rwt_llcp" "genhlth2" "fmd" "obese" "smoker1" "diab2"
## [22] "zipout" "streetx" "ethn" "age4" "ctmiss"

26 / 71
Load map

Next, read in the shape files for King County HRAs

# install.packages('maptools')
library(maptools)
f <- "../data/HRA_ShapeFiles/HRA_2010Block_Clip.shp"
kingshape <- readShapePoly(f)

# install.packages('rgdal')
library(rgdal)
kingshape <- readOGR("../data/HRA_ShapeFiles",
layer = "HRA_2010Block_Clip")

## OGR data source with driver: ESRI Shapefile


## Source: "/Users/zehangli/Dropbox/Teachings/2018APAshanghai/data/HRA_S
## with 48 features
## It has 9 fields

27 / 71
Initial data cleaning

I Our outcome of interest is Type II diabetes and we will drop


observations with missing diabetes data.
I Our small area of interest is the HRA. We will also drop observations
with missing HRA.

kingdata <- subset(kingdata, !is.na(kingdata$diab2))


kingdata <- subset(kingdata, !is.na(kingdata$hracode))
names(kingdata)[names(kingdata) == "_ststr"] <- "strata"
kingdata$hracode <- as.character(kingdata$hracode)
kingdata[kingdata$hracode == "Fairwood ",
"hracode"] <- "Fairwood"
n.area <- length(unique(kingdata$hracode))

28 / 71
Naive binomial model

I Let yi and mi be the number of individuals flagged as having type II


diabetes and the denominators in the i = 1, . . . , n areas.
I We form naive estimates
yi
p̂i =
mi
with associated standard errors
s
p̂i (1 − p̂i )
.
mi

29 / 71
Naive binomial model

hras <- as.character(kingshape$HRA2010v2_)


props <- matrix(NA, nrow = n.area, ncol = 5)
props <- as.data.frame(props)
colnames(props) <- c("hracode", "p.hat",
"se.p.hat", "y.i", "n.i")
props[, 1] <- hras
for (i in 1:n.area) {
props[i, "p.hat"] <- mean(kingdata[kingdata$hracode ==
props[i, "hracode"], "diab2"])
props[i, "y.i"] <- sum(kingdata[kingdata$hracode ==
props[i, "hracode"], "diab2"])
props[i, "n.i"] <- length(kingdata[kingdata$hracode ==
props[i, "hracode"], "diab2"])
naivevar <- props[i, "p.hat"] * (1 -
props[i, "p.hat"])/props[i, "n.i"]
props[i, "se.p.hat"] <- sqrt(naivevar)
}

30 / 71
Naive binomial model: merge into map

Load shapefiles

library(ggplot2)
library(viridis)
geo <- fortify(kingshape, region = "HRA2010v2_")
geo1 <- merge(geo, props, by = "id", by.y = "hracode")

Merge prevalence to the map and visually check them

g <- ggplot(geo1)
g <- g + geom_polygon(aes(x = long, y = lat,
group = group, fill = p.hat), color = "gray")
g <- g + theme_void()
g <- g + scale_fill_viridis()
g

31 / 71
Naive binomial model: merge into map

Merge prevalence to the map and visually check them

p.hat

0.20

0.15

0.10

0.05

32 / 71
Binomial smoothing by hand (not weighted)

33 / 71
Binomial smoothing: the model

We use the INLA package to fit the following Bayesian hierarchical model:

yi |pi ∼ Binomial(Ni , pi )
 
pi
θi = log = µ + i + s i ,
1 − pi
i ∼ N(0, σ2 )
σs2
 
si |sj , j ∈ ne(i) ∼ N s¯i , .
ni

where ni is the number of neighbors for area i, and


1 X
s¯i = sj
ni
j∈ne(i)

Priors are put on µ, σ2 , σs2 ,

34 / 71
Binomial smoothing: construct adjacency matrix

To perform spatial smoothing using ICAR, we first need to construct an


adjacency matrix where each row and column is a region.
I Diagonal elements are 0
I Off-diagonal elements are 1 if the two corresponding regions are
adjacent and 0 if otherwise

library(spdep)
nb.r <- poly2nb(kingshape, queen=F,
row.names = kingshape$HRA2010v2_)
mat <- nb2mat(nb.r, style="B",zero.policy=TRUE)
colnames(mat) <- rownames(mat)
mat <- as.matrix(mat[1:dim(mat)[1], 1:dim(mat)[1]])

35 / 71
Binomial smoothing: model fitting

Implementation details:
I The index of the areas needs to be the same order as in the
adjacency matrix. It can be easily missed if data has been reordered
I Multiple random effects each need an index variable (unstruct and
struct below).

sum(colnames(mat) != props$region)

## [1] 0

props$unstruct <- props$struct <- 1:n.area

36 / 71
Binomial smoothing: model fitting

The following code carries out an unweighted binomial analysis, with


global and spatial smoothing, the latter via the ICAR model.

library(INLA)
formula = y.i ~ 1 +
f(struct,model='besag',
adjust.for.con.comp=TRUE,
constr=TRUE,graph=mat,
scale.model = TRUE,
param = c(0.5, 0.0015)) +
f(unstruct, model='iid',
param=c(0.5,0.0015))
fit.naive <- inla(formula,
family="binomial",
data=props, Ntrials=n.i,
control.predictor = list(compute = TRUE))

37 / 71
Binomial smoothing: organize output

props.smooth <- props


# posterior median
props.smooth[, "p.hat"] <- fit.naive$summary.fitted.values[,
"0.5quant"]
# posterior standard deviations
props.smooth[, "se.p.hat"] <- fit.naive$summary.fitted.values[,
"sd"]
# Post medians of unstructured random
# effects
props.smooth[, "unstruct"] <- fit.naive$summary.random$unstruct[,
"0.5quant"]
# Post medians of spatial random effects
props.smooth[, "struct"] <- fit.naive$summary.random$struct[,
"0.5quant"]

38 / 71
Binomial smoothing: Unstructured random effects

geo2 <- merge(geo, props.smooth, by = "id",


by.y = "hracode")
g <- ggplot(geo2) + geom_polygon(aes(x = long,
y = lat, group = group, fill = unstruct),
color = "gray")
g <- g + theme_void() + scale_fill_viridis()
g

unstruct
0.02

0.01

0.00

−0.01

39 / 71
Binomial smoothing: Spatial random effects

g <- ggplot(geo2) + geom_polygon(aes(x = long,


y = lat, group = group, fill = struct),
color = "gray")
g <- g + theme_void() + scale_fill_viridis()
g

struct

0.5

0.0

−0.5

40 / 71
Binomial smoothing: Proportion of variance (recap)
I It could be interesting to evaluate the proportion of variance
explained by the structured spatial component
I However, estimated σs2 and σ2 are not directly comparable
I We alternatively calculates the posterior marginal variance for the
structured effect (See Section 6.1.2 of Blangiardo, et.al (2015) for
more details.)

Sre <- matrix(NA, 1e4, 48)


for (i in 1:48){
Sre[,i] <- inla.rmarginal(1e4,
fit.naive$marginals.random$struct[[i]])
}
var.Sre <- apply(Sre,1,var)
var.eps <- inla.rmarginal(1e4, inla.tmarginal(function(x){1/x},
fit.naive$marginals.hyper$"Precision for unstruct"))
perc.var.Sre <- mean(var.Sre/(var.Sre+var.eps))
perc.var.Sre

## [1] 0.9610054
41 / 71
Binomial smoothing: Proportion of variance
To see there’s an difference between σs2 and the posterior marginal
variance for the structured effects:
var <- matrix(NA, 2, 2)
colnames(var) <- c("S", "Sigma^2")
rownames(var) <- c("median", "mean")
draws1 <- matrix(NA, 10000, 48)
for (i in 1:48) {
draws1[, i] <- inla.rmarginal(10000,
fit.naive$marginals.random$struct[[i]])
}
var[1, 1] <- median(apply(draws1, 1, var))
var[2, 1] <- mean(apply(draws1, 1, var))
draws2 <- inla.rmarginal(10000, inla.tmarginal(function(x) 1/x,
fit.naive$marginals.hyper$"Precision for struct"))
var[1, 2] <- median(draws2)
var[2, 2] <- mean(draws2)
var

## S Sigma^2
## median 0.1175084 0.06626365
## mean 0.1180709 0.07019962 42 / 71
Binomial smoothing: predicted prevalence

g <- ggplot(geo2) + geom_polygon(aes(x = long,


y = lat, group = group, fill = p.hat),
color = "gray")
g <- g + theme_void() + scale_fill_viridis()
g

p.hat

0.15

0.10

43 / 71
Binomial smoothing: SE of prevalence

g <- ggplot(geo2) + geom_polygon(aes(x = long,


y = lat, group = group, fill = se.p.hat),
color = "gray")
g <- g + theme_void() + scale_fill_viridis()
g

se.p.hat
0.025

0.020

0.015

0.010

44 / 71
Binomial smoothing: compare with naive approach

par(mfrow = c(1, 2))


lim1 <- range(c(props$p.hat, props.smooth$p.hat))
plot(props$p.hat, props.smooth$p.hat, xlim = lim1,
ylim = lim1, xlab = "Naive Prevalence",
ylab = "Smoothed prevalence")
abline(c(0, 1), col = "red")
lim2 <- range(c(props$se.p.hat, props.smooth$se.p.hat))
plot(props$se.p.hat, props.smooth$se.p.hat,
xlim = lim2, ylim = lim2, xlab = "Naive Prevalence SE",
ylab = "Smoothed prevalence SE")
abline(c(0, 1), col = "red")

45 / 71
Binomial smoothing: compare with naive approach

0.030
0.20

Smoothed prevalence SE

Smoothed prevalence

0.025

0.15

0.020

● ●
● ●● ●
● ● ●
● ● ● ● ●

0.015
●● ●● ● ●
0.10



● ●

●●
● ● ●

● ● ●● ●● ●
● ● ●● ● ●●
● ● ●
●● ●●● ● ● ● ●●●

0.010
● ●● ●● ● ●● ●

● ● ●● ● ●
● ●●

0.05

● ●●●●
●● ●
●●

0.05 0.10 0.15 0.20 0.010 0.015 0.020 0.025 0.030

Naive Prevalence Naive Prevalence SE

46 / 71
Accounting for survey designs

47 / 71
Survey weighted estimates: weights

I BRFSS uses a complex survey design. See http://www.cdc.gov/


brfss/annual_data/2013/pdf/Weighting_Data.pdf for more
details of the weighting procedure.
I Raking adjusts for: telephone source (allowing for cell phones),
race/ethnicity, education, marital status, age group by gender,
gender by race and ethnicity, age group by race and ethnicity,
renter/owner status.
I Design weights are

STRWT × 1/NUMPHON2 × NUMADULT.

I GEOSTR is the geographical strata (which in general may be the


entire state or a geographic subset such as counties, census tracts,
etc.). DENSTR is the density of the phone numbers for a given
block of numbers as listed or not listed.

48 / 71
Survey weighted estimates: weights

I NRECSTR is the number of available records and NRECSEL is the


number of records selected within each geographical strata and
density strata.
I Within each GEOSTR × DENSTR combination, the stratum weight
(STRWT) is calculated from the average of the NRECSTR and the
sum of all sample records used to produce the NRECSEL. The
stratum weight is equal to NRECSTR/NRECSEL, i.e. the reciprocal
of the selection probability.
I An adjustment is also made for the mostly cellular telephone dual
sampling frame users. Weight trimming also used, prior to trimming.
I The final weight rwt llcp is the raked design weight.

49 / 71
Survey weighted estimates: asymptotic distribution of p̂i

I The survey package will give us survey-weighted estimates of pi , the


proportion of people with Type II diabetes in small area i, and a
survey-weighted estimate of the standard error, SE
c (p̂i ).
I We use the method described in Mercer et al. (2014).
 
p̂i
I If we specify yi = log then, by the delta method, the
1 − p̂i
asymptotic (sampling) distribution of yi is:
   
pi var(
c p̂i )
yi |pi ∼ N log , 2 .
1 − pi p̂i (1 − p̂i )2

50 / 71
Survey weighted estimates: calculation

library(survey)
props.w <- props
kingcounty.des <- svydesign(ids = ~1, weights = ~rwt_llcp,
strata = ~strata, data = kingdata)
weighted <- svyby(~diab2, ~hracode, kingcounty.des,
svymean)
rows <- match(weighted$hracode, props.w$hracode)
props.w[rows, "p.hat"] <- weighted$diab2
props.w[rows, "se.p.hat"] <- weighted$se
props.w[, "logit.p"] <- log(props.w[, "p.hat"]/(1 -
props.w[, "p.hat"]))
props.w[, "logit.v"] <- props.w[, "se.p.hat"]^2/(props.w[,
"p.hat"] * (1 - props.w[, "p.hat"]))^2
props.w[, "logit.prec"] <- 1/props.w[, "logit.v"]

51 / 71
Survey weighted estimates: calculation

We obtain
I The weighted estimators of prevalences p.hat
I The design standard error of prevalences se.p.hat
I The weighted estimators of logits of prevalences logit.p
I The design variances of logits of prevalences logit.v

52 / 71
Survey weighted estimates: compare with naive approach

par(mfrow = c(1, 2))


lim1 <- range(c(props$p.hat, props.w$p.hat))
plot(props$p.hat, props.w$p.hat, xlim = lim1,
ylim = lim1, xlab = "Naive Prevalence",
ylab = "Survey-weighted prevalence")
abline(c(0, 1), col = "red")
lim2 <- range(c(props$se.p.hat, props.w$se.p.hat))
plot(props$se.p.hat, props.w$se.p.hat, xlim = lim2,
ylim = lim2, xlab = "Naive Prevalence SE",
ylab = "Survey-weighted prevalence SE")
abline(c(0, 1), col = "red")

53 / 71
Survey weighted estimates: compare with naive approach

0.06
● ●

Survey−weighted prevalence SE
0.20

0.05
Survey−weighted prevalence

0.04
0.15



0.03
● ●
0.10

● ● ● ●
● ●●
●● ●● ● ●
●● ● ●
●● ● ●
●●

0.02
● ●
●●
● ●
●● ● ● ●●●● ●

● ●● ● ●
0.05

● ● ●●●
● ● ●●● ●
● ●● ● ● ●● ● ●
● ●●●
● ● ●

0.01
● ● ●
● ●● ● ●
● ●●
● ● ●●

0.05 0.10 0.15 0.20 0.01 0.02 0.03 0.04 0.05 0.06

Naive Prevalence Naive Prevalence SE

54 / 71
Survey weighted estimates: compare with binomial
smoothing

par(mfrow = c(1, 2))


logit.binomial <- fit.naive$summary.linear.predictor[,
"0.5quant"]
logit.v.binomial <- fit.naive$summary.linear.predictor[,
"sd"]^2
lim1 <- range(c(logit.binomial, props.w$logit.p))
plot(logit.binomial, props.w$logit.p, xlim = lim1,
ylim = lim1, xlab = "Posterior median (unweighted)",
ylab = "Survey-weighted logit prevalence")
abline(c(0, 1), col = "red")
lim2 <- range(c(logit.v.binomial, props.w$logit.v))
plot(logit.v.binomial, props.w$logit.v, xlim = lim2,
ylim = lim2, xlab = "Posterior variance (unweighted)",
ylab = "Survey-weighted logit prevalence variance")
abline(c(0, 1), col = "red")

55 / 71
Survey weighted estimates: compare with binomial
smoothing

0.30
Survey−weighted logit prevalence variance
● ●
−1.5
Survey−weighted logit prevalence

0.25


−2.0

● ●

0.20
● ●
● ●
● ● ● ●
● ●●●

−2.5

● ●

0.15
● ● ● ● ●
● ● ● ●
● ●
●●
● ●● ● ●●
●●

0.10

−3.0


●● ●●●
● ● ● ●
● ●●●

●● ● ●
●●●
●●

●●●●
●● ● ●
●●

0.05
●●


● ●
−3.5

●● ●●
● ●

−3.5 −3.0 −2.5 −2.0 −1.5 0.05 0.10 0.15 0.20 0.25 0.30

Posterior median (unweighted) Posterior variance (unweighted)

56 / 71
Weighted and smoothed model

We use the INLA package to fit the following Bayesian hierarchical model:
 
p̂i
yi = log ∼ N(θi , V̂i )
1 − p̂i
θ i = µ + i + s i ,
i ∼ N(0, σ2 )
σs2
 
si |sj , j ∈ ne(i) ∼ N s¯i , .
ni

with priors on µ, σ2 , σs2 ,


The key here is that the first stage variance V̂i is assumed known:

var(p̂i )
V̂i = .
p̂i2 (1 − p̂i )2

57 / 71
Weighted and smoothed model: model fitting

props.w$unstruct <- props.w$struct <- 1:n.area


formula = logit.p ~ 1 +
f(struct,model='besag',
adjust.for.con.comp=TRUE,
constr=TRUE,graph=mat,
scale.model = TRUE,
param = c(0.5, 0.0015))+
f(unstruct, model='iid', param=c(0.5,0.0015))
fit.weighted <- inla(formula,
family="gaussian", data=props.w,
control.predictor = list(compute = TRUE),
control.family = list(hyper = list(prec = list(
initial = log(1), fixed=TRUE))),
scale=props.w$logit.prec)

58 / 71
Weighted and smoothed model: compare with weighted

−1.5

0.5
Posterior variance (weighted)
Posterior median (weighted)


−2.0

0.4
● ●

●●●●
● ●
−2.5



● ● ● ●
● ●
● ●
●●

0.3
● ●
● ● ●

● ●
−3.0

● ●●● ●
● ● ●

●● ● ● ● ●
● ● ● ● ● ●
● ● ●● ●
● ●● ●
●● ●●●● ●

0.2

−3.5

●●●●● ●
● ● ● ●
● ●●● ●●●● ● ●
● ●●
●● ●

−3.5 −3.0 −2.5 −2.0 −1.5 0.2 0.3 0.4 0.5

Survey−weighted logit prevalence Survey−weighted logit prevalence variance

59 / 71
Using SUMMER

60 / 71
Weighted and smoothed model: using SUMMER

The SUMMER (Spatio-Temporal Under-Five Mortality Methods with


Estimation in R) package
There is a function fitSpace() that estimates weighted and smoothed
estimates

library(SUMMER)
fit <- fitSpace(data = kingdata, geo = kingshape,
Amat = mat, family = "binomial", responseVar = "diab2",
strataVar = "strata", weightVar = "rwt_llcp",
regionVar = "hracode", clusterVar = "~1",
hyper = NULL, CI = 0.95)

61 / 71
SUMMER: default hyperpriors

For binary variables, the default hyperpriors for precisions is Gamma(0.5,


0.001488), which leads to a 95% prior interval for the residual odds ratio
between [0.5, 2], i.e., for

u|τ ∼iid Normal(0, 1/τ ), τ ∼ Gamma(0.5, 0.001488)

The [0.025%, 0.975%] quantiles are roughly [0.5, 2]. See Section 9.6.2 of
Wakefield (2013) for more details.
The structured effects are all scaled to have unit generalized marginal
variance, so that the precision parameter has the similar interpretation.
See https://www.math.ntnu.no/inla/r-inla.org/tutorials/
inla/scale.model/scale-model-tutorial.pdf for more details
about the scaled models.

62 / 71
SUMMER fit

The fit object from SUMMER package contains


I HT
I HT.est.original, HT.variance.original: mean, and sd of the
direct estimates accounting for survey weights.
I HT.est, HT.sd, HT.variance, HT.prec: mean, and asymptotic
sd, variance and precision of the logit transformed direct estimates
accounting for survey weights.
I smooth
I mean.trans, sd.trans, median.trans, lower.trans,
upper.trans: mean, sd, median, posterior credible intervals
(specified by CI argument in function call) of the posterior prediction
in the probability scale.
I mean, sd, median, lower, upper: mean, sd, median, posterior
credible intervals (specified by CI argument in function call) of the
posterior prediction in logit scale.

63 / 71
Easier visualization: merge all results

fit$HT$HT.sd.original <- fit$HT$HT.variance.original^0.5


combined <- merge(fit$HT, fit$smooth, by = "region")
combined <- combined[match(props$hracode,
combined$region), ]
# naive
combined$p.hat.naive <- props$p.hat
combined$se.p.hat.naive <- props$se.p.hat
# binomial smoothing
combined$p.hat.binomial <- props.smooth$p.hat
combined$se.p.hat.binomial <- props.smooth$se.p.hat

64 / 71
Easier visualization

g <- mapPlot(data = combined, geo = kingshape,


variables = c("p.hat.naive", "p.hat.binomial",
"HT.est.original", "median.trans"),
labels = c("Naive", "Unweighted smoothing: posterior median",
"Survey-weighted", "Weighted smoothing: posterior median"),
by.data = "region", by.geo = "HRA2010v2_",
is.long = FALSE)
g <- g + theme_bw() + theme(axis.title = element_blank(),
axis.text = element_blank())
g <- g + scale_fill_viridis("Prevalence")
g

65 / 71
Easier visualization

Naive Unweighted smoothing: posterior median

Prevalence

0.20

Survey−weighted Weighted smoothing: posterior median 0.15

0.10

0.05

66 / 71
Easier visualization

g <- mapPlot(data = combined, geo = kingshape,


variables = c("se.p.hat.naive", "se.p.hat.binomial",
"HT.sd.original", "sd.trans"), labels = c("Naive",
"Unweighted smoothing: posterior SD",
"Survey-weighted", "Weighted smoothing: posterior SD"),
by.data = "region", by.geo = "HRA2010v2_",
is.long = FALSE)
g <- g + theme_bw() + theme(axis.title = element_blank(),
axis.text = element_blank())
g <- g + scale_fill_viridis("SD(Prevalence)")
print(g)

67 / 71
Easier visualization

Naive Unweighted smoothing: posterior SD

SD(Prevalence)

0.05
0.04
Survey−weighted Weighted smoothing: posterior SD
0.03
0.02
0.01

68 / 71
Conclusion

The last two plots illustrate the effect of the Bayesian smoothing model:
I the estimates are shrunk (both globally and locally), this introduces
bias,
I the uncertainty is in general reduced, due to the use of all the data.
Overall:
I It is clear we need to consider the weighting
I The smoothing does increase precision, at the expense of a little bias

69 / 71
References I

Chen, C., Wakefield, J., and Lumley, T. (2014). The use of sample
weights in Bayesian hierarchical models for small area estimation.
Spatial and Spatio-Temporal Epidemiology, 11:33–43.
Fay, R. and Herriot, R. (1979). Estimates of income for small places: an
application of James–Stein procedure to census data. Journal of the
American Statistical Association, 74:269–277.
Korn, E. and Graubard, B. (1999). Analysis of Health Surveys. John
Wiley and Sons, New York.
Li, Z. R., Hsiao, Y., Godwin, J., Martin, B., Wakefield, J., and Clark,
S. J. (2018). Changes in the spatial distribution of the Under Five
Mortality Rate: small-area analysis of 122 DHS Surveys in 262
subregions of 35 Countries in Africa. Submitted.
Mercer, L., Wakefield, J., Chen, C., and Lumley, T. (2014). A
comparison of spatial smoothing methods for small area estimation
with sampling weights. Spatial Statistics, 8:69–85.

70 / 71
References II

Mercer, L., Wakefield, J., Pantazis, A., Lutambi, A., Mosanja, H., and
Clark, S. (2015). Small area estimation of childhood of childhood
mortality in the absence of vital registration. Annals of Applied
Statistics, 9:1889–1905.
Pfeffermann, D. (2013). New important developments in small area
estimation. Statistical Science, 28:40–68.
Rao, J. (2003). Small Area Estimation. John Wiley, New York.
Rao, J. and Molina, I. (2015). Small Area Estimation, Second Edition.
John Wiley, New York.
Song, L., Mercer, L., Wakefield, J., Laurent, A., and Solet, D. (2016).
Peer reviewed: Using small-area estimation to calculate the prevalence
of smoking by subcounty geographic areas in king county, washington,
behavioral risk factor surveillance system, 2009–2013. Preventing
chronic disease, 13.

71 / 71

Você também pode gostar