Escolar Documentos
Profissional Documentos
Cultura Documentos
ANALYZING COMPLEX
SURVEY DATA
Second Edition
Ronald N. Forthofer
SAGE PUBLICATIONS
International Educational and Professional Publisher
Thousand Oaks London New Delhi
Copyright Ó 2006 by Sage Publications, Inc.
All rights reserved. No part of this book may be reproduced or utilized in any form or by any
means, electronic or mechanical, including photocopying, recording, or by any information
storage and retrieval system, without permission in writing from the publisher.
For information:
05 06 07 08 09 10 9 8 7 6 5 4 3 2 1
1. Introduction 1
7. Concluding Remarks 78
Notes 80
References 83
Index 88
v
vi
ACKNOWLEDGMENTS
Ronald N. Forthofer
1. INTRODUCTION
1
2
analytic studies by social and policy scientists have increased, and a variety
of current issues are being examined, using available social survey data, by
researchers who were not involved with the data collection process. This
tradition is known as secondary analysis (Kendall & Lazarsfeld, 1950).
Often, the researcher fails to pay due attention to the development of com-
plex sample designs and assumes that these designs have little bearing on
the analytic procedures to be used.
The increased use of statistical techniques in secondary analysis and the
recent use of log-linear models, logistic regression, and other multivariate
techniques (Aldrich & Nelson, 1984; Goodman, 1972; Swafford, 1980)
have done little to bring design and analysis into closer alignment. These
techniques are predicated on the use of simple random sampling with re-
placement (SRSWR); however, this assumption is rarely met in social surveys
that employ stratification and clustering of observational units along with
unequal probabilities of selection. As a result, the analysis of social surveys
using the SRSWR assumption can lead to biased and misleading results.
Kiecolt and Nathan (1985), for example, acknowledged this problem in
their Sage book on secondary analysis, but they provide little guidance on
how to incorporate the sample weights and other design features into the
analysis. A recent review of literature in public health and epidemiology
shows that the use of design-based survey analysis methods is gradually
increasing but remains at a low level (Levy & Stolte, 2000).
Any survey that puts restrictions on the sampling beyond those of
SRSWR is complex in design and requires special analytic considerations.
This book reviews the analytic issues raised by the complex sample survey,
provides an introduction to analytic strategies, and presents illustrations
using some of the available software. Our discussion is centered on the use
of the sample weights to correct for differential representations and the
effect of sample designs on estimation of sampling variance with some dis-
cussion of weight development and adjustment procedures. Many other
important issues of dealing with nonsampling errors and handling missing
data are not fully addressed in this book.
The basic approach presented in this book is the traditional way of ana-
lyzing complex survey data. This approach is now known as design-based
(or randomization-based) analysis. A different approach to analyzing com-
plex survey data is the so-called model-based analysis. As in other areas of
statistics, the model-based statistical inference has gained more attention in
survey data analysis in recent years. The modeling approaches are intro-
duced in various steps of survey data analysis in defining the parameters,
defining estimators, and estimating variances; however, there are no gener-
ally accepted rules for model selection or validating a specified model.
Nevertheless, some understanding of the model-based approach is essential
3
blocks. This multistage design satisfies the requirement that all population
elements have a known nonzero probability of being selected.
Types of Sampling
The simplest sample design is simple random sampling, which requires
that each element have an equal probability of being included in the sample
and that the list of all population elements be available. Selection of a
sample element can be carried out with or without replacement. Simple
random sampling with replacement (SRSWR) is of special interest because
it simplifies statistical inference by eliminating any relation (covariance)
between the selected elements through the replacement process. In this
scheme, however, an element can appear more than once in the sample. In
practice, simple random sampling is carried out without replacement
(SRSWOR), because there is no need to collect the information more than
once from an element. Additionally, SRSWOR gives a smaller sampling
variance than SRSWR. However, these two sampling methods are practi-
cally the same in a large survey in which a small fraction of population
elements is sampled. We will use the term SRS for SRSWOR throughout
this book unless otherwise specified.
The SRS design is modified further to accommodate other theoretical
and practical considerations. The common practical designs include sys-
tematic sampling, stratified random sampling, multistage cluster sampling,
PPS sampling (probability proportional to size), and other controlled selec-
tion procedures. These more practical designs deviate from SRS in two
important ways. First, the inclusion probabilities for the elements (also
the joint inclusion probabilities for sets for the elements) may be unequal.
Second, the sampling unit can be different from the population element of
interest. These departures complicate the usual methods of estimation and
variance calculation and, if proper methods of analysis are not used, can
lead to a bias in estimation and statistical tests. We will consider these
departures in detail, using several specific sampling designs, and examine
their implications for survey analysis.
Systematic sampling is commonly used as an alternative to SRS because
of its simplicity. It selects every k-th element after a random start (between
1 and k). Its procedural tasks are simple, and the process can easily be
checked, whereas it is difficult to verify SRS by examining the results. It is
often used in the final stage of multistage sampling when the fieldworker is
instructed to select a predetermined proportion of units from the listing of
dwellings in a street block. The systematic sampling procedure assigns each
element in a population the same probability of being selected. This ensures
that the sample mean will be an unbiased estimate of the population mean
5
when the number of elements in the population (N) is equal to k times the
number of elements in the sample (n). If N is not exactly nk, then the equal
probability is not guaranteed, although this problem can be ignored when
N is large. In that case, we can use the circular systematic sampling scheme.
In this scheme, the random starting point is selected between 1 and N (any
element can be the starting point), and every k-th element is selected assum-
ing that the frame is circular (the end of the list is connected to the beginning
of the list). Systematic sampling can give an unrealistic estimate, however,
when the elements in the frame are listed in a cyclical manner with respect to
survey variables and the selection interval coincides with the listing cycle.
For example, if one selects every 40th patient coming to a clinic and the
average daily patient load is about 40, then the resulting systematic sample
would contain only those who came to the clinic at a particular time of the
day. Such a sample may not be representative of the clinic patients.
Moreover, even when the listing is randomly ordered, unlike SRS, differ-
ent sets of elements may have unequal inclusion probabilities. For example,
the probability of including both the i-th and the (i + k)-th element is 1/k
in a systematic sample, whereas the probability of including both the i-th
and the (i + k + 1)-th is zero. This complicates the variance calculation.
Another way of viewing systematic sampling is that it is equivalent to
selecting one cluster from k systematically formed clusters of n elements
each. The sampling variance (between clusters) cannot be estimated from
the one selected cluster. Thus, variance estimation from a systematic sample
requires special strategies.
A modification to overcome these problems with systematic sampling
is the so-called repeated systematic sampling (Levy & Lemeshow, 1999,
pp. 101–110). Instead of taking a systematic sample in one pass through the
list, several smaller systematic samples are selected, going down the list
several times with a new starting point in each pass. This procedure not only
guards against possible periodicity in the frame but also allows variance
estimation directly from the data. The variance of an estimate from all sub-
samples can be estimated from the variability of the separate estimates from
each subsample. This idea of replicated sampling offers a strategy for esti-
mating variance for complex surveys, which will be discussed further in
Chapter 4.
Stratified random sampling classifies the population elements into strata
and samples separately from each stratum. It is used for several reasons:
(a) The sampling variance can be reduced if strata are internally homoge-
neous, (b) separate estimates can be obtained for strata, (c) administration
of fieldwork can be organized using strata, and (d) different sampling needs
can be accommodated in separate strata. Allocation of the sample across
the strata is proportionate when the sampling fraction is uniform across the
6
P
^ = ð1 − n Þ N 2 ½yi −ð xÞxi 2
y=
is V^D ðYÞ N n n−1 . The model-based estimator is
2 P pffiffiffiffi 2
^ = ð1 − x Þ X ½{yi −ð xÞ}= xi
y=
V^M ðYÞ X x n−1 , where x is the sample total and X
is the population total of the auxiliary variable (see Lohr, 1999, sec. 3.4).
The ratio estimate model is valid when (a) the relation between yi and
xi is a straight line through the origin and (b) the variance of yi about this
line is proportional to xi . It is known that the ratio estimate is inferior to the
expansion estimate (without the auxiliary variable) when the correlation
between yi and xi is less than one-half the ratio of coefficient of variation
of xi over the coefficient of variation of yi (Cochran, 1977, chap. 6). There-
fore, the use of ratio estimation in survey analysis would require check-
ing the model assumptions. In practice, when the data set includes a large
number of variables, ratio estimation would be cumbersome to select different
auxiliary variables for different estimates.
To apply the model-based approach to a real problem, we must first be
able to produce an adequate model. If the model is wrong, the model-based
estimators will be biased. When using model-based inference in sampling,
one needs to check the assumptions of the model by examining the data
carefully. Checking the assumptions may be difficult in many circum-
stances. The adequacy of a model is to some extent a matter of judgment,
and a model adequate for one analysis may not be adequate for another
analysis or another survey.
Two essential aspects of survey data analysis are adjusting for the
differential representation of sample observations and assessing the loss or
gain in precision resulting from the complexity of the sample selection
design. This chapter introduces the concept of weight and discusses the
effect of sample selection design on variance estimation. To illustrate the
versatility of weighting in survey analysis, we present two examples of
developing and adjusting sample weights.
Equation 3.1 shows the use of the expansion weight in the weighted
sum of sample observations. Because the weight is the same for each ele-
ment in SRS, the estimator can be simplified to N times of the sample
mean (the last quantity in Equation 3.1). Similarly, the estimator of the
population mean is defined as Ȳ^ = wi yi / wi , which is the weighted
sample mean. In SRS, this simplifies to (N/n) yi /N = y, showing that
the sample mean is an estimator for the population mean. However, even
if the weights are not the same (in unequal probability designs), the esti-
mators are still a weighted sum for the population total and a weighted
average for the population mean.
Although the expansion weight appears appropriate for the estimator of
the population total, it may play havoc with the sample mean and other sta-
tistical measures. For example, using the sum of expansion weights in con-
tingency tables in place of relative frequencies based on sample size may
lead to unduly large confidence in the data. To deal with this, the expansion
weight can be scaled down to produce the relative weight, (rw)i , which is
defined to be the expansion weightP divided by the mean of the expansion
where w
weights, that is, wi /w, = wi /n. These relative weights for all
elements in the sample add up to n: For the SRS design, (rw)i is 1 for each
element. The estimator for the population total weighted by the relative
weights is
Y^ = w (rw)i yi = (N/n) yi = N y: (3:2)
Note in Equation 3.2 that the relative weighted sum is multiplied by the
average expansion weight, which yields the same simplified estimator for
the case of SRS as in Equation 3.1. Hence, the expansion weight is simpler
to use than the relative weight in estimating the population total. The
relative weight is appropriate in analytic studies, but it is inappropriate in
estimating totals and computing finite population corrections.
13
TABLE 3.1
Derivation of Poststratification Adjustment Factor:
General Social Survey, 1984
Weighted
Demographic Population Number of Sample Adjustment
Subgroups Distribution Adults Distribution Factor
(1) (2) (3) (1)/(3)
White, male
18–24 years .0719660 211 .0739832 0.9727346
25–34 .1028236 193 .0676718 1.5194460
35–44 .0708987 277 .0795933 0.8907624
45–54 .0557924 135 .0473352 1.1786660
55–64 .0544026 144 .0504909 1.0774730
65 and over .0574872 138 .0483871 1.1880687
White, female
18–24 years .0705058 198 .0694250 1.0155668
25–34 .1007594 324 .1136045 0.8869317
35–44 .0777364 267 .0936185 0.8303528
45–54 .0582026 196 .0682737 0.8469074
55–64 .0610057 186 .0652174 0.9354210
65 and over .0823047 216 .0757363 1.0867272
Nonwhite, male
18–24 years .0138044 34 .0119215 1.1579480
25–34 .0172057 30 .0105189 1.6356880
35–44 .0109779 30 .0105189 1.0436290
45–54 .0077643 37 .0129734 0.5984774
55–64 .0064683 12 .0042076 1.5372900
65 and over .0062688 18 .0063113 0.9932661
Nonwhite, female
18–24 years .0145081 42 .0145081 .9851716
25–34 .0196276 86 .0301543 .6509067
35–44 .0130655 38 .0133240 .9806026
45–54 .0094590 33 .0115708 .8174890
55–64 .0079636 30 .0105189 .7570769
65 and over .0090016 27 .0094670 .9508398
Total 1.0000000 2,852 1.0000000
SOURCE: U.S. Bureau of the Census, Estimates of the population of the United States, by age, sex, and race,
1980 to 1985 (Current Population Reports, Series P-25, No. 985), April 1986. Noninstitutional population
estimates are derived from the estimated total population of 1984 (Table1), adjusted by applying the ratio of
noninstitutional to total population (Table Al).
the same as the population distribution. The adjustment factors indicate that
without the adjustment, the GSS sample underrepresents males 25–34 years
of age and overrepresents nonwhite males 45–54 years of age and nonwhite
females 25–34 years of age.
16
TABLE 3.2
Comparison of Weighted and Unweighted Estimates in Two Surveys
Weighted Unweighted
Surveys and Variables Estimates Estimates
SOURCE: Data for the Epidemiologic Catchment Areas Survey are from E. S. Lee, Forthofer, and Lorimor
(1986), Table 1.
The adjusted relative weights are then used in the analysis of the data—for
example, to estimate the proportion of adults responding positively to the
question, ‘‘Are there any situations that you can imagine in which you would
approve of a man punching an adult male stranger?’’ As shown in the upper
section of Table 3.2, the weighted overall proportion is 60.0%, slightly larger
than the unweighted estimate of 59.4%. The difference between the weighted
and unweighted estimates is also very small for the subgroup estimates
shown. This may be due primarily to the self-weighting feature reflected in
the fact that most households have two adults and, to a lesser extent, to the
fact that the ‘‘approval of hitting’’ is not correlated with the number of adults
in a household. The situation is different in the National Institute of Mental
Health-Sponsored Epidemiologic Catchment Area (ECA) Survey. In this
survey, the weighted estimates of the prevalence of any disorders and of anxi-
ety disorders are, respectively, 20% and 26% lower than the unweighted
estimates, as shown in Table 3.2.
Finally, the adjusted weights should be examined to see whether there
are any extremely large values. Extreme variation in the adjusted weights
may imply that the sample sizes in some poststrata are too small to be reli-
able. In such a case, some small poststrata need to be collapsed, or some raking
procedure must be used to smooth out the rough edges (Little & Rubin, 1987,
pp. 59–60).
17
TABLE 3.3
Logistic Regression Model for Attrition
Adjustment in a Follow-Up Survey
Logistic Regression Model_______________
Factors Category Variable Beta Coefficient Survey-Related Information
initial survey, suggesting that the procedure worked reasonably well. (If there
were a large discrepancy between the sum of the adjusted weights and the sum
of the weights in the initial survey, there would be a concern about the adjust-
ment process.) The adjusted weights are readjusted to align to the initial sur-
vey. To show the effect of using attrition-adjusted weights, the prevalence
rates of six selected mental disorders were estimated with and without the attri-
tion-adjusted weights, as shown in Table 3.3. Using the adjusted weights, we
see that the prevalence of any disorders (DSM-III defined disorders) is nearly
5 percentage points higher than the unadjusted prevalence.
Because the class sizes are equal, the average number of books read per
student (population mean) is the mean of the N class means. The n sample
classes can be viewed as a random sample of n means from a population of
N means. Therefore, the sample mean (y ) is unbiased for the population
mean (Y ), and its variance, applying Equation 2.1, is given by
s2b
V^ðy Þ = ð1 − f Þ, (3:3)
n
where s2b = ðyi − y Þ2 /ðn − 1Þ, the estimated variance of the cluster
means. Alternately, Equation 3.3 can be expressed in terms of estimated
ICC (^
ρ) as follows (Cochran, 1977, chap. 9):
s2 [1 + (M − 1)^
ρ]
V^(y ) = (1 − f ), (3:4)
nM
where s2 = (yij − y )2 /ðnM − 1Þ, the variance of elements in the sam-
ple. If this is divided by the variance of the mean from an SRSWOR sample
of size nM, [V^ðy Þ = nM s2 ð1 − f Þ, applying Equation 2.1], then the design
Replicate: 9 8 13 12 14 8 10 7 10 8 Total = 99
Proportion
of Boys: .45 .40 .65 .60 .70 .40 .50 .35 .50 .40 Proportion = .495
The overall percentage of boys is 49.5%, and its standard error is 3.54%
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(¼ 49:5∗ 50:5/200). The standard error estimated from the 10 replicate
estimates using Equation 4.1 is 3.58%. It is easy to get an approximate
estimate of 3.50% by taking one tenth of the range (70%–35%). The chief
advantage of replication is ease in estimation of the standard errors.
In practice, the fundamental principle of selecting independent replicates
is somewhat relaxed. For one thing, replicates are selected using sampling
without replacement instead of with replacement. For unequal probability
designs, the calculation of basic weights and the adjustment for nonresponse
and poststratification usually are performed only once for the full sample,
rather than separately within each replicate. In cluster sampling, the replicates
often are formed by systematically assigning the clusters to the t replicates in
the same order that the clusters were first selected, to take advantage of strati-
fication effects. In applying Equation 4.1, the sample mean from the full
sample generally is used for the mean of the replicate means. These deviations
from fundamental principles can affect the variance estimation, but the bias is
thought to be insignificant in large-scale surveys (Wolter, 1985, pp. 83–85).
The community mental health survey conducted in New Haven, Connec-
ticut, in 1984 as part of the ECA Survey of the National Institute of Mental
Health (E. S. Lee, Forthofer, Holzer, & Taube, 1986) provides an example
of replicated sampling. The sampling frame for this survey was a geogra-
phically ordered list of residential electric hookups. A systematic sample
was drawn by taking two housing units as a cluster, with an interval of 61
houses, using a starting point chosen at random. A string of clusters in the
sample was then sequentially allocated to 12 subsamples. These subsam-
ples were created to facilitate the scheduling and interim analysis of data
during a long period of screening and interviewing. Ten of the subsamples
were used for the community survey, with the remaining two reserved for
another study. The 10 replicates are used to illustrate the variance estimation
procedure.
These subsamples did not strictly adhere to a fundamental principle of
independent replicated sampling because the starting points were systema-
tically selected, except for the first random starting point. However, the sys-
tematic allocation of clusters to subsamples in this case introduced an
approximate stratification leading to more stable variance estimation and,
25
TABLE 4.1
Estimation of Standard Errors From Replicates:
ECA Survey in New Haven, 1984 (n = 3,058)
Regression Coefficientsa
SOURCE: Adapted from ‘‘Complex Survey Data Analysis: Estimation of Standard Errors Using Pseudo-
Strata,’’ E. S. Lee, Forthofer, Holzer, and Taube, Journal of Economic and Social Measurement,
Ó copyright 1986 by the Journal of Economic and Social Measurement. Adapted with permission.
a. The dependent variable (coded as 1 = condition present and 0 = condition absent) is regressed on sex
(1 = male, 0 = female), color (1 = black, 0 = nonblack), and age (continuous variable). This analysis is
used for demonstration only.
b. Percentage with any mental disorders during the last 6 months.
c. Sex difference in the 6-month prevalence rate.
rate, in percent (p), can be calculated from the replicate estimates (pi )
using Equation 4.1:
(pi − 17:17)2
v(p) = = 0:3474,
10(10 − 1)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
and the standard error is 0:3474 = 0:59. The overall prevalence rate of
17.17% is slightly different from the mean of the 10 replicate estimates
because of the differences in response rates. Note that one tenth of the
range in the replicate estimates (0.61) approximates the standard error
obtained by Equation 4.1. Similarly, standard errors can be estimated for
the odds ratio and regression coefficients. The estimated standard errors
have approximately the same values as those calculated by assuming
simple random sampling (using appropriate formulas from textbooks).
This indicates that design effects are fairly small for these statistics from
this survey.
Although the replicated sampling design provides a variance estimator that
is simple to calculate, a sufficient number of replicates are required to obtain
acceptable precision for statistical inference. But if there is a large number of
replicates and each replicate is relatively small, it severely limits the use of
stratification in each replicate. Most important, it is impractical to implement
replicated sampling in complex sample designs. For these reasons, a repli-
cated design is seldom used in large-scale, analytic surveys. Instead, the
replicated sampling idea has been applied to estimate variance in the data-
analysis stage. This attempt gave rise to pseudo-replication methods for
variance estimation. The next two techniques are based on this idea of
pseudo-replication.
TABLE 4.2
Orthogonal Matrix of Order 44
Rows Columns (44)
1 11111111111111111111111111111111111111111111
2 10100101001110111110001011100000100011010110
3 10010010100111011111000101110000010001101011
4 11001001010011101111100010111000001000110101
5 11100100101001110111110001011100000100011010
6 10110010010100111011111000101110000010001101
7 11011001001010011101111100010111000001000110
8 10101100100101001110111110001011100000100011
9 11010110010010100111011111000101110000010001
10 11101011001001010011101111100010111000001000
11 10110101100100101001110111110001011100000100
12 10011010110010010100111011111000101110000010
13 10001101011001001010011101111100010111000001
14 11000110101100100101001110111110001011100000
15 10100011010110010010100111011111000101110000
16 10010001101011001001010011101111100010111000
17 10001000110101100100101001110111110001011100
18 10000100011010110010010100111011111000101110
19 10000010001101011001001010011101111100010111
20 11000001000110101100100101001110111110001011
21 11100000100011010110010010100111011111000101
22 11110000010001101011001001010011101111100010
23 10111000001000110101100100101001110111110001
24 11011100000100011010110010010100111011111000
25 10101110000010001101011001001010011101111100
26 10010111000001000110101100100101001110111110
27 10001011100000100011010110010010100111011111
28 11000101110000010001101011001001010011101111
29 11100010111000001000110101100100101001110111
30 11110001011100000100011010110010010100111011
31 11111000101110000010001101011001001010011101
32 11111100010111000001000110101100100101001110
33 10111110001011100000100011010110010010100111
34 11011111000101110000010001101011001001010011
35 11101111100010111000001000110101100100101001
36 11110111110001011100000100011010110010010100
37 10111011111000101110000010001101011001001010
38 10011101111100010111000001000110101100100101
39 11001110111110001011100000100011010110010010
40 10100111011111000101110000010001101011001001
41 11010011101111100010111000001000110101100100
42 10101001110111110001011100000100011010110010
43 10010100111011111000101110000010001101011001
44 11001010011101111100010111000001000110101100
SOURCE: Adapted from Wolter (1985, p. 328) with permission of the publisher.
31
TABLE 4.3
Estimated Proportions Approving One Adult Hitting Another in the
BRR Replicates: General Social Survey, 1984 (n = 1,473)
Estimate (%) Estimate (%)
1. Compute a pseudo sample mean deleting the first sample value, which
results in y(1) = (5 + 2 + 1 + 4)/4 = 12/4: Now, by deleting the
second sample value instead, we obtain the second pseudo-mean
y(2) = 10/4; likewise y(3) = 13/4, y(4) = 14/4, and y(5) = 11/4:
P
2. Compute the mean of the five pseudo-values; y = y(i) /n =
(60/4)/5 = 3, which is the same as the sample mean.
3. The variance can then be estimated from the variability among the
five pseudo-means, each of which contains four observations,
(n − 1) (y(i) − y )2
v(y ) = = 0:5, (4:6)
n
This estimator has the same form as Equation 4.3 and can be modified to
include one replicate, without averaging with the complement, from each
stratum, as in Equation 4.4 for the BRR method, which gives
v (
u) = (uh − u)2 : (4:8)
L
rh
nh − 1
vð ūÞ = (uhi − ū)2 : (4:9)
h
rh i
TABLE 4.4
Estimated Proportions Approving One Adult Hitting Another
in the JRR Replicates: General Social Survey, 1984
Estimate (%) Estimate (%)
From a closer examination of data in Table 4.4, one may get an impression
that there is less variation among the JRR replicate estimates than among the
BRR replicate estimates in Table 4.3. We should note, however, that the JRR
represents a different strategy that uses a different method to estimate the var-
iance. Note that Equation 4.3 for the BRR includes the number of replicates
(t) in the denominator, whereas Equation 4.7 for the JRR is not dependent on
the number of replicates. The reason is that in the JRR, the replicate estimates
themselves are dependent on the number of replicates formed. Because the
replicate is formed deleting one unit, the replicate estimate would be closer to
the overall estimate when a large number of units is available to form the
replicates, compared to the situation where a small number of units is used.
35
1 B
vðuÞ = (u − u)2 : (4:10)
B i=1 i
The same ideas carry over to functions of more than one random variable.
In the case of a function of two variances, the Taylor series expansion
yields
∂f ∂f
V [f (x1 , x2 )] ffi Cov(x1 , x2 ) (4:12)
∂x1 ∂x2
Applying Equation 4.12 to a ratio of two variables x and y—that is,
r = y/x—we obtain the variance formula for a ratio estimator
V (y) + r2 V (x) − 2r Cov(x, y)
V (r) = + ...
x2
2 V ð yÞ V ðxÞ 2 Covðx, yÞ
=r + 2 − + ...
y2 x xy
Extending Equation 4.12 to the case of c random variables, the approxi-
mate variance of θ = f (x1 , x2 , . . . , xc ) is
∂f ∂f
V (θ) ffi Cov(xi , xj ): (4:13)
∂xi ∂xj
38
TABLE 4.5
Standard Errors Estimated by Taylor Series Method for
Percentage Approving One Adult Hitting Another:
General Social Survey, 1984 (n = 1,473)
Estimate Standard Error Design
Subgroup (%) (%) Effect
be mean imputation for continuous variables, but this procedure will distort
the shape of the distribution. To preserve the shape of the distribution, one
can use hot deck imputation, regression imputation, or multiple imputation.
Some simple illustrations of imputation will be presented in the next chapter
without going into detailed discussion. If imputation is used, the variance
estimators may need some adjustment (Korn & Graubard, 1999, sec. 5.5),
but that topic is beyond the scope of this book.
Prior to any substantive analysis, it is also necessary to examine whether
each of the PSUs has a sufficient number of observations. It is possible that
some PSUs may contain only a few observations, or even none, because of
nonresponse and exclusion of missing values. A PSU with none of or only a
few observations may be combined with an adjacent PSU within the same
stratum. The stratum with a single PSU as a result of combining PSUs may
then be combined with an adjacent stratum. However, collapsing too many
PSUs and strata destroys the original sample design. The resulting data analy-
sis may be of questionable value, because it is no longer possible to determine
what population is represented by the sample resulting from the combined
PSUs and strata. The number of observations that is needed in each
PSU depends on the type of analysis planned. The required number will
be larger for analytic studies than for estimation of descriptive statistics.
A general guideline is that the number should be large enough to estimate the
intra-PSU variance for the given estimate.
To illustrate this point, we consider the GSS data. An unweighted tabula-
tion by stratum and PSU has shown that the number of observations in the
PSU ranges from 8 to 49, with most of the frequencies being larger than 13,
indicating that the PSUs probably are large enough for estimating variances
for means and proportions. For an analytic study, we may want to investi-
gate the percentage of adults approving of hitting, by education and gender.
For this analysis we need to determine if there are a number of PSUs with-
out observations in a particular education-by-gender category. If there are
many PSUs with no observation for some education-by-gender category,
this calls into question the estimation of the variance–covariance matrix,
which is based on the variation in the PSU totals within the strata. The
education (3 levels) by gender (2 levels) tabulation by PSU showed that
42 of the 84 PSUs had at least one gender-by-education cell with no observa-
tions. Even after collapsing education into two categories, we will have to
combine nearly half of the PSUs. Therefore, we should not attempt to investi-
gate, simultaneously, the gender and education variables in relation to the
question about hitting. However, it is possible to analyze gender or education
alone in relation to hitting without combining many PSUs and strata.
Subgroup analysis of complex survey data cannot be conducted by select-
ing out the observations in the analytic domain. Although the case selection
43
would not alter the basic weights, it might destroy the basic sample design.
For example, selecting one small ethnic group may eliminate a portion of the
PSUs and reduce the number of observations substantially in the remaining
PSUs. As a result, it would be difficult to assess, from the subset, the design
effect inherent in the basic design. Even though the basic design is not totally
destroyed, selecting out observations from a complex survey sample may
lead to an incorrect estimation of variance, as explained above in conjunction
with the method of handling missing values. Correct estimation of variance
requires keeping the entire data set in the analysis and assigning weights of
zero to observations outside the analytic domain. The subpopulation analy-
sis procedures available in software packages are designed to conduct a
subgroup analysis without selecting out the observations in the analytic
domain. The subpopulation analysis will be discussed further in Chapter 6.
The first step in a preliminary analysis is to explore the basic distributions
of key variables. The tabulations may point out the need for refining opera-
tional definitions of variables and for combining categories of certain vari-
ables. Based on summary statistics, one may learn about interesting patterns
and distributions of certain variables in the sample. After analyzing the vari-
ables one at a time, we can next investigate the existence of relations to screen
out variables that are clearly not related to one another or to some dependent
variables. It may be possible to conduct a preliminary exploration using the
standard graphic SRS-based statistical methods. Given the role of weights in
survey data, however, any preliminary analysis ignoring the weights may not
accomplish the goal of a preliminary analysis. One way to conduct a prelimin-
ary analysis taking weights into account is to select a subsample of manage-
able size, with selection probability proportional to the magnitude of the
weights, and to explore the subsample using the standard statistical and graphic
methods. This procedure will be illustrated in the first section of Chapter 6.
TABLE 5.1
Creation of Replicate Weights for BRR and Jackknife Procedure
(A) SUDAAN statements for BRR: (B) SUDAAN statements for jackknife method:
data brr; Data jackknife;
input stratum psu beds aids wt w1-w8; input stratum psu beds aids wt;
aids2=aids*2501; aids2=aids*2501;
datalines; datalines;
1 1 72 20 2 0 0 0 4 0 4 4 4 1 1 72 20 2
1 2 87 49 6 12 12 12 0 12 0 0 0 1 2 87 49 6
2 1 99 38 2 4 0 0 0 4 0 4 4 2 1 99 38 2
2 2 48 23 2 0 4 4 4 0 4 0 0 2 2 48 23 2
3 1 99 38 2 4 4 0 0 0 4 0 4 3 1 99 38 2
3 2 131 78 4 0 0 8 8 8 0 8 0 3 2 131 78 4
4 1 42 7 2 0 4 4 0 0 0 4 4 4 1 42 7 2
4 2 38 28 2 4 0 0 4 4 4 0 0 4 2 38 28 2
5 1 42 26 2 4 0 4 4 0 0 0 4 5 1 42 26 2
5 2 34 9 2 0 4 0 0 4 4 4 0 5 2 34 9 2
6 1 39 18 4 0 8 0 8 8 0 0 8 6 1 39 18 4
6 2 76 20 2 4 0 4 0 0 4 4 0 6 2 76 20 2
; ;
proc ratio design=brr deff; proc ratio design=jackknife deff;
weight wt; nest stratum;
repwgt w1-w8; weight wt;
numer aids2; numer aids2;
denom beds; denom beds;
run; run;
The jackknife procedure does not require replicate weights. The program
creates the replicates by deleting one PSU in each replicate. The SUDAAN
statements for JRR and the results are shown in the right side of Table 5.1.
The standard error estimated by the jackknife procedure is 141.3, which is
smaller than the BRR estimate. The standard error calculated by the Taylor
series method (assuming with-replacement sampling) was 137.6, slightly less
than the jackknife estimate but similar to the estimate from Fay’s BRR. As
discussed in Chapter 4, BRR and JRR assume with-replacement sampling. If
we assume without-replacement sampling (the finite population correction is
used), the standard error is estimated to be 97.3 for this example.
The third National Health and Nutrition Examination Survey (NHANES III)4
from the National Center for Health Statistics (NCHS) contains the replicate
weights for BRR. The replicate weights were created for Fay’s method with
k = 0.3 incorporating nonresponse and poststratification adjustments at
different stages of sampling. As Korn and Graubard (1999) suggested, a
preferred approach is using Fay’s method of creating replicate weights
49
TABLE 6.1
Subsample and Total Sample Estimates for Selected Characteristics
of U.S. Adult Population, NHANES III, Phase II
Vitamin Hispanic Correlation Between
Mean Age Use Population SBPa Sample BMIb and SBP
Total sample
(n = 9,920)c
Unweighted 46.9 years 38.4% 26.1% 125.9 mmHg 0.153
Weighted 43.6 42.9 5.4 122.3 0.243
PPS subsample
(n = 1,000)
Unweighted 42.9 43.0 5.9 122.2 0.235
consisted of 9,920 adults. We first sorted the total sample by stratum and PSU
and then selected a PPS subsample systematically using a skipping interval of
9.92 on the scale of cumulated relative weights. The sorting by stratum and
PSU preserved in essence the integrity of the original sample design.
Table 6.1 demonstrates the usefulness of a PPS subsample that can be
analyzed with conventional statistical packages. In this demonstration, we
selected several variables that are most affected by the weights. Because of
oversampling of the elderly and ethnic minorities, the weighted estimates are
different from the unweighted estimates for mean age and percentage of
Hispanics. The weights also make a difference for vitamin use and systolic
blood pressure because they are heavily influenced by the oversampled cate-
gories. The subsample estimates, although not weighted, are very close to the
weighted estimates in the total sample, demonstrating the usefulness of a PPS
subsample for preliminary analysis. A similar observation can be made based
on the correlation between body mass index and systolic blood pressure.
The PPS subsample is very useful in exploring the data without formally
incorporating the weights, especially for the students in introductory courses.
It is especially well suited for exploring the data by graphic methods such
as scatterplot, side-by-side boxplot, and the median-trace plot. The real
advantage is that the resampled data are approximately representative of the
population and can be explored ignoring the weights. The point estimates
from the resampled data are approximately the same as the weighted esti-
mates in the whole data set. Any interesting patterns discovered from the
resampled data are likely to be confirmed by a more complete analysis using
Stata or SUDAAN, although the standard errors are likely to be different.
52
missing for educat and height. We used a hot deck5 procedure to impute
values for these four variables by selecting donor observations randomly with
probability proportional to the sample weights within 5-year age categories
by gender. The same donor was used to impute values when there were
missing values in one or more variables for an observation. Regression
imputation was used for height (3.7% missing; 2.8% based on weight, age,
gender, and ethnicity; and 0.9%, based on age, gender, and ethnicity), weight
(2.8% missing, based on height, age, gender, and ethnicity), sbp (2.5% miss-
ing, based on height, weight, age, gender and ethnicity), and pir (10% miss-
ing, based on family size, educat, and ethnicity). About 0.5% of imputed pir
values were negative, and these were set to 0.001 (the smallest pir value in
the data). Parenthetically, we could have brought other anthropometric mea-
sures into the regression imputation, but our demonstration was based simply
on the variables selected for this analysis. Finally, the bmi values (5.5%
missing) were recalculated based on updated weight and height information.
To demonstrate that the sample weight and design effect make a difference,
the analysis was performed under three different options: (a) unweighted,
ignoring the data structure; (b) weighted, ignoring the data structure; and
(c) survey analysis, incorporating the weights and sampling features. The first
option assumes simple random sampling, and the second recognizes the
weight but ignores the design effect. The third option provides an appropriate
analysis for the given sample design.
First, we examined the weighted means and proportions and their standard
errors with and without the imputed values. The imputation had inconsequen-
tial impact on point estimates and a slight reduction in estimated standard
errors under the third analytic option. The weighted mean pir without
imputed values was 3.198 (standard error = 0.114) compared with 3.168
(s:e: = 0.108) with imputed values. For bmi, the weighted mean was 25.948
(s:e: = 0.122) without imputation and 25.940 (s:e: = 0.118) with imputa-
tion. For other variables, the point estimates and their standard errors were
identical to the third decimal point because there were so few missing values.
The estimated descriptive statistics (using imputed values) are shown
in Table 6.2. The calculation was performed using Stata. The unweighted
statistics in the top panel were produced by the nonsurvey commands
summarize for point estimates and ci for standard errors. The weighted
analysis (second option) in the top panel was obtained by the same nonsur-
vey command with the use of [w wgt]. The third analysis, incorporating
the weights and the design features, is shown in the bottom panel. It was
conducted using svyset [pweight wgt], strata (stra), and psu (psu) for
setting complex survey features and svymean for estimating the means or
proportions of specified variables.
54
TABLE 6.2
Descriptive Statistics for the Variables Selected for Regression
Analysis of Adults 17 Years and Older From NHANES III,
Phase II (n = 9,920): An Analysis Using Stata
(A) Weighted and unweighted statistics, ignoring the design features
The statistics shown in Table 6.2 are the estimated means for the continuous
variables, proportions for the binary variables, and standard errors. There
are slight differences between the weighted and unweighted means/
proportions for a few variables, and the differences are considerable for
some variables. The weighted proportion is more than 60% smaller than the
weighted proportion for blacks and nearly 80% smaller for Hispanics,
reflecting oversampling of these two ethnic groups. The weighted mean age
is about 3.5 years less than the unweighted mean because the elderly also
were oversampled. On the other hand, the weighted mean is considerably
greater than the unweighted mean for the poverty index and for the number
of years of schooling, suggesting that the oversampled minority groups are
concentrated in the lower ranges of income and schooling. The weighted
55
estimate for vitamin use is also somewhat greater than the unweighted
estimate. This lower estimate may reflect a lower use by minority groups.
The bottom panel presents the survey estimates that reflect both the
weights and design features. Although the estimated means and proportions
are exactly the same as the weighed statistics in the top panel, the standard
errors increase substantially for all variables. This difference is reflected in
the design effect in the table (the square of the ratio of standard error in the
bottom panel to that for the weighted statistic in the top panel). The large
design effects for poverty index, education, and age partially reflect the resi-
dential homogeneity with respect to these characteristics. The design effects
of these socioeconomic variables and age are larger than those for the pro-
portion of blacks and Hispanics. The opposite was true in the NHANES II
conducted in 1976–1980 (data presented in the first edition of this book),
suggesting that residential areas are now increasingly becoming more homo-
geneous with respect to socioeconomic status than by ethnic status.
The bottom panel also shows the 95% confidence intervals for the means
and proportions. The t value used for the confidence limits is not the familiar
value of 1.96 that might be expected from the sample of 9,920 (the sum of the
relative weights). The reason for this is that in a multistage cluster sampling
design, the degrees of freedom are based on the number of PSUs and strata,
rather than the sample size, as in SRS. Typically, the degrees of freedom in
complex surveys are determined as the number of PSUs sampled minus the
number of strata used. In our example, the degrees of freedom are 23
(= 46 − 23) and t23, 0:975 = 2.0687; and this t value is used in all confidence
intervals in Table 6.2. In certain circumstances, the degrees of freedom may
be determined somewhat differently from the above general rule (see Korn &
Graubard, 1999, sec. 5.2).
In Table 6.3, we illustrate examples of conducting subgroup analysis. As
mentioned in the previous chapter, any subgroup analysis using complex
survey data should be done using the entire data set without selecting out
the data in the analytic domain. There are two options for conducting proper
subgroup analysis in Stata: the use of by or subpop. Examples of conduct-
ing a subgroup analysis for blacks are shown in Table 6.3. In the top panel,
the mean BMI is estimated separately for nonblacks and blacks by using
the by option. The mean BMI for blacks is greater than for nonblacks.
Although the design effect of BMI among nonblacks (5.5) is similar to the
overall design effect (5.0 in Table 6.2), it is only 1.1 among blacks.
Stata also can be used to test linear combinations of parameters. The
equality of the two population subgroup means can be tested using the
lincom command ([bmi]1—[bmi]0, testing the hypothesis of the difference
between the population mean BMI for black = 1 and the mean BMI for
nonblack = 0), and the difference is statistically significant based on the
56
TABLE 6.3
Comparison of Mean Body Mass Index Between Black and
Nonblack Adults 17 Years and Older, NHANES III,
Phase II (n = 9,920): An Analysis Using Stata
(A) . svyset [pweight=wgt], strata(stra) psu(psu)
. svymean bmi, by (black)
t test. The design effect is 1.46, indicating that the t value for this test is
reduced about 20% to compensate for the sample design features.
Alternatively, the subpop option can be used to estimate the mean BMI
for blacks, as shown in the bottom panel. This option uses the entire data set
by setting the weights to zero for those outside the analytic domain. The
mean, standard error, and design effect are the same as those calculated for
57
blacks using the by option in the top panel. Next, we selected out blacks by
specifying the domain (if black = = 1) to estimate the mean BMI. This
approach did not work because there were no blacks in some of the PSUs.
The tabulation of blacks by stratum and PSU showed that only one PSU
remained in the 13th and 15th strata. When these two strata are collapsed
with adjacent strata, Stata produced a result. Although the point estimate is
the same as before, the standard error and design effect are different. As a
general rule, subgroup analysis with survey data should avoid selecting out
a subset, unlike in the analysis of SRS data.
Besides the svymean command for descriptive analysis, Stata supports the
following descriptive analyses: svytotal (for the estimation of population
total), svyratio (for the ratio estimation), and svyprop (for the estimation of
proportions). In SUDAAN, these descriptive statistics can be estimated by
the DESCRIPT procedure, and subdomain analysis can be accommodated by
the use of the SUBPOPN statement.
This is a linear model in the sense that the dependent variable (Yi ) is repres-
ented by a linear combination of the βj ’s plus εi : The βj is the coefficient of
the independent variable (Xj ) in the equation, and εi is the random error term
in the model that is assumed to follow a normal distribution with a mean of 0
and a constant variance and to be independent of the other error terms.
In regression analysis, the independent variables are either continuous or
discrete variables, and the βj ’s are the corresponding coefficients. In the
ANOVA, the independent variables (Xj ’s) are indicator variables (under
effect coding, each category of a factor has a separate indicator variable
coded 1 or 0) that show which effects are added to the model, and the βj ’s
are the effects.
Ordinary least squares (OLS) estimation is used to obtain estimates of the
regression coefficients or the effects in the linear model when the data result
from a SRS. However, several changes in the methodology are required to
deal with data from a complex sample. The data now consist of the individual
observations plus the sample weights and the design descriptors. As was
discussed in Chapter 3, the subjects from a complex sample usually have
58
TABLE 6.4
Summary of Multiple Regression Models for Body Mass
Index on Selected Variables for U.S. Adults From
NHANES III, Phase II (n = 9,920):
An Analysis Using Stata
unaccounted for by the model. Other important variables are not included
in this model. Perhaps the satisfactory specification of a model for predict-
ing BMI may not be possible within the scope of NHANES III data.
Both the unweighted and weighted analyses indicate that age is positively
related, and age squared is negatively related, to BMI. This indicates that the
age effect is curvilinear, with a dampening trend for older ages, as one might
expect. The poverty index and education are negatively associated with BMI.
Examining the regression coefficients for the binary variables, both blacks
and Hispanics have positive coefficients, indicating that these two ethnic
groups have greater BMI than their counterparts. The systolic blood pressure
is positively related to BMI, and the vitamin users, who may be more con-
cerned about their health, have a lower BMI than the nonusers. Those who
have ever smoked have BMIs less than half a point lower than those who
never smoked.
There is a small difference between the unweighted and weighted ana-
lyses. Although the education effect is small (beta coefficient ¼ −0.009) in
the unweighted analysis, it increases considerably in absolute value (beta
coefficient = −0.111) in the weighted analysis. If a preliminary analysis
were conducted without using the sample weights, one could have over-
looked education as an important predictor. This example clearly points to
the advantage of using a PPS subsample for a preliminary analysis that was
discussed at the beginning of this chapter. The negative coefficient for
smoking status dampens slightly, suggesting that the negative effect of
smoking on BMI is more pronounced for the oversampled groups than for
their counterparts. Again, the importance of sample weights is demon-
strated here. The analysis also points to the advantage of using a PPS
subsample for preliminary analysis rather than using unweighted analysis.
The analytical results taking into account the weights and design features
are shown in the bottom panel. This analysis was done using the svyregress
command. The estimated regression coefficients and R2 are the same as
those shown in the weighted analysis because the same formula is used in
the estimation. However, the standard errors of the coefficients and the t sta-
tistics are considerably different from those in the weighted analysis. The
design effects of the estimated regression coefficients ranged from 0.89 for
Hispanics to 4.53 for poverty-to-income ratio. Again we see that a complex
survey design may result in a larger variance for some variables than for
their SRS counterparts, but not necessarily for all the variables. In this parti-
cular example, the general analytic conclusions that were drawn in the pre-
liminary analysis also were true in the final analysis, although the standard
errors for regression coefficients increased for all but one variable.
Comparing the design effects in Tables 6.2 and 6.4, one finds that the
design effects for regression coefficients are somewhat smaller than for
61
the means and proportions. So, applying the design effect estimated from the
means and totals to regression coefficients (when the clustering information
is not available from the data) would lead to conclusions that are too conser-
vative. Smaller design effects may be possible in a regression analysis if the
regression model controls for some of the cluster-to-cluster variability. For
example, if part of the reason for people in the same cluster having similar
BMI is similar age and education, then one would expect that adjusting for
age and education in the regression model might account for some of cluster-
to-cluster variability. The clustering effect would then have less impact on
the residuals from the model.
Regression analysis can also be conducted by using the REGRESS
procedure in SUDAAN as follows:
PROC REGRESS DESIGN = wr;
NEST stra psu;
WEIGHT wgt;
MODEL = bmi age agesq black hispanic pir educat sbs
vituse smoker;
RUN;
TABLE 6.5
Comparison of Vitamin Use by Level of Education Among U.S. Adults,
NHANES III, Phase II (n = 9,920): An Analysis Using Stata
TABLE 6.6
Analysis of Gender Difference in Vitamin Use by
Level of Education Among U.S. Adults, NHANES III,
Phase II (n = 9,920): An Analysis Using SAS and SUDAAN
(A) Unweighted analysis by SAS:
proc freq;
tables edu*sex*vituse / nopercent nocol chisq measures cmh;
run;
[Output summarized below]
the males’ odds of taking vitamins are 63% of the females’ odds. The 95%
confidence interval does not include 1, suggesting that the difference is sta-
tistically significant. The odds ratios are consistent across three levels of
education. Because the ratios are consistent, we can combine 2 × 2 tables
across the three levels of education. We can then calculate the Cochran-
Mantel-Haenszel (CMH) chi-square (df = 1) and the CMH common odds
ratio. The education-adjusted odds ratio is 0.64, and its 95% confidence
interval does not include 1.
The lower panel of Table 6.6 shows the results of using the CROSSTAB
procedure in SUDAAN to perform the same analysis, taking the survey
design into account. On the PROC statement, DESIGN = wr designates
with-replacement sampling, meaning that the finite population correction is
65
not used. The NEST statement designates the stratum and PSU variables.
The WEIGHT statement gives the weight variable. The SUBGROUP state-
ment declares three discrete variables, and the LEVELS statement specifies
the number of levels in each discrete variable. The TABLES statement
defines the form of contingency table. The PRINT statement requests nsum
(frequencies), wsum (weighted frequencies), rowper (row percent), cor
(crude odds ratio), upcor (upper limit of cor), lowcor (lower limit of cor),
chisq (chi-square statistic), chisqp (p value for chi-square statistic), cmh
(CMH statistic), and cmhpval (p value for CMH).
The weighted percentages of vitamin use are slightly different from the
unweighted percentages. The Wald chi-square values in three separate
analyses are smaller than the Pearson chi-square values in the upper panel
except for the middle level of education. Although the odds ratio
remained almost the same at the lower level of education, it decreased
somewhat at the middle and higher levels of education. The CROSSTAB
procedure in SUDAAN did not compute the common odds ratio, but it can
be obtained from a logistic regression analysis to be discussed in the next
section.
standard likelihood-ratio test for model fit should not be used with the
survey logistic regression analysis. Instead of the likelihood-ratio test,
the adjusted Wald test statistic is used.
The selection and inclusion of appropriate predictor variables for a logistic
regression model can be done similarly to the process for linear regression.
When analyzing a large survey data set, the preliminary analysis strategy
described in the earlier section is very useful in preparing for a logistic regres-
sion analysis.
To illustrate logistic regression analysis, the same data used in Table 6.6
are analyzed using Stata. The analytical results are shown in Table 6.7. The
Stata output is edited somewhat to fit into a table. The outcome variable
is vitamin use (vituse), and explanatory variables are gender (1 = male;
0 = female) and level of education (edu). The interaction term is not included
in this model, based on the CMH statistic shown in Table 6.6. First, we per-
formed standard logistic regression analysis, ignoring the weight and design
features. The results are shown in Panel A. Stata automatically performs the
effect (or dummy) coding for discrete variables with the use of the xi option
preceding the logit statement and adding i. in front of the variable name.
The output shows the omitted level of each discrete variable. In this case,
the level ‘‘male’’ is in the model, and its effect is measured from the effect of
‘‘female,’’ the reference level. For education, being less than a high school
graduate is the reference level. The likelihood-ratio chi-square value is 325.63
(df = 3) with p value of < 0.00001, and we reject the hypothesis that gender
and education together have no effect on vitamin usage, suggesting that there
is a significant effect. However, the pseudo R2 suggests that most of the varia-
tion in vitamin use is unaccounted for by these two variables. The parameter
estimates for gender and education and their estimated standard errors are
shown as well as the corresponding test statistics. All factors are significant.
Including or in the model statement produces the odds ratios instead of
beta coefficients. The estimated odds ratio for males is 0.64, meaning that
the odds of taking vitamins for a male is 64% of the odds that a female uses
vitamins after adjusting for education. This odds ratio is the same as the
CMH common odds ratio shown in Table 6.6. The significance of the odds
ratio can be tested using either a z test or a confidence interval. The odds
ratio for the third level of education suggests that persons with some college
education are twice likely to take vitamins than those with less than
12 years of education for the same gender. None of the confidence intervals
includes 1, suggesting that all effects are significant.
Panel B of Table 6.7 shows the goodness-of-fit statistic (chi-square with
df = 2). The large p value suggests that the main effects model fits the data
(not significantly different from the saturated model). In this simple situation,
68
TABLE 6.7
Logistic Regression Analysis of Vitamin Use on Gender and
Level of Education Among U.S. Adults, NHANES III,
Phase II (n = 9,920): An Analysis Using Stata
(A) Standard logistic regression (unweighted, ignoring sample design):
. xi: logit vituse i.male i.edu
(C) Survey logistic regression (incorporating the weights and design features):
. svyset [pweight=wgt], strata (stra) psu(psu)
. xi: svylogit vituse i.male i.edu
the two degrees of freedom associated with the goodness of fit of the model
can also be interpreted as the two degrees of freedom associated with the
gender-by-education interaction. Hence, there is no interaction of gender and
education in relation to the proportion using vitamin supplements, confirming
the CMH analysis shown in Table 6.6.
69
Panel C of Table 6.7 shows the results of logistic regression analysis for
the same data, with the survey design taken into account. The log likelihood
is not shown because the pseudo likelihood is used. Instead of a likelihood-
ratio statistic, the F statistic is used. Again, the p value suggests that
the main effects model is a significant improvement over the null model.
The estimated parameters and odds ratios changed slightly because of the
sample weights, and the estimated standard errors of beta coefficients
increased as reflected in the design effects. Despite the increased standard
errors, the beta coefficients for gender and education levels are significantly
different from 0. The odds ratio for males adjusted for education decreased
to 0.61 from 0.64. Although the odds ratio remained about the same for
the second level of education, its p value increased considerably, to 0.008
from < 0.0001, because of the design taken into account.
After the logistic regression model was run, the effect of linear combination
of parameters was tested as shown in Panel D. We wanted to test the hypoth-
esis that the sum of parameters for male and the third level of education is zero.
Because there is no interaction effect, the resulting odds ratio of 1.3 can be
interpreted as indicating that the odds of taking vitamin for males with some
college education are 30% higher than the odds for the reference (females with
less than 12 years of education). SUDAAN also can be used to perform a logis-
tic regression analysis, using its LOGISTIC procedure in the stand-alone ver-
sion or the RLOGIST procedure in the SAS callable version (a different name
used to distinguish it from the standard logistic procedure in SAS).
Finally, the logistic regression model also can be used to build a prediction
model for a synthetic estimation. Because most health surveys are designed to
estimate the national statistics, it is difficult to estimate health characteristics
for small areas. One approach to obtain estimates for small areas is the syn-
thetic estimation utilizing the national health survey and demographic infor-
mation of local areas. LaVange, Lafata, Koch, and Shah (1996) estimated the
prevalence of activity limitation among the elderly for U.S. states and counties
using a logistic regression model fit to the National Health Interview Survey
(NHIS) and Area Resource File (ARF). Because the NHIS is based on a com-
plex survey design, they used SUDAAN to fit a logistic regression model to
activity limitation indicators on the NHIS, supplemented with county-level
variables from ARF. The model-based predicted probabilities were then
extrapolated to calculate estimates of activity limitation for small areas.
Then three binary logistic regression models could be used to fit a separate
model to each of three comparisons. Recognizing the natural ordering of obe-
sity categories, however, we could estimate the ‘‘average’’ effect of explana-
tory variables by considering the three binary models simultaneously, based
on the proportional odds assumption. What is assumed here is that the regres-
sion lines for the different outcome levels are parallel to each other and that
they are allowed to have different intercepts (this assumption needs to be
tested using the chi-square statistic; the test result is not shown in the table).
The following represents the model for j = 1, 2, . . . , c − 1 (c is the number
of categories in the dependent variable):
! p
Pr(category ≤j ) X
log = αj + β i xi (6:3)
Pr(category ≥ ðj+1Þ ) i=1
TABLE 6.8
Ordered Logistic Regression Analysis of Obesity Levels on Education,
Age, and Ethnicity Among U.S. Adults, NHANES III, Phase II
(n = 9,920): An Analysis Using SUDAAN
TABLE 6.9
Multinomial Logistic Regression Analysis of Obesity on Gender and
Smoking Status Among U.S. Adults, NHANES III, Phase II (n = 9,920):
An Analysis Using SUDAAN
proc multilog design=wr;
nest stra psu;
weight wgt;
reflevel csmok=2 sex=2;
subgroup bmi2 csmok sex;
levels 4 3 2;
model bmi3=age sex csmok;
setenv decwidth=5;
run;
available to the public and point to the need to increase the number of PSUs in
the design of large health surveys. These tests are limited to point estimation,
and therefore their conclusions may not apply to all circumstances. More
detailed discussion of these and related issues is provided by Pfeffermann
(1993, 1996).
The fact that the design-based analysis provides protection against possi-
ble misspecification of the model suggests that the analysis illustrated using
SUDAAN, Stata, and other software for complex survey analysis is appro-
priate for NHANES data. Even in the design-based analysis, a regression
model is used to specify the parameters of interest, but inference takes the
sample design into account. The design-based analysis in this case may be
called a model-assisted approach (Sarndal, Swensson, & Wretman, 1992).
The design-based theory relies on large sample sizes to make inferences
about the parameters. The model-based analysis may be a better option for
a small sample. When probability sampling is not used in data collection,
there is no basis for applying the design-based inference. The model-based
approach would make more sense where substantive theory and previous
empirical investigations support the proposed model.
The idea of model-based analysis is less obvious in a contingency table
analysis than in a regression analysis. The rationale for design-based ana-
lysis taking into account the sampling scheme already has been discussed.
As in the regression analysis, it is wise to pay attention to the differences
between the weighted proportions and the unweighted proportions. If
there is a substantial difference, one should explore why they differ. In
Table 6.6, the unweighted and weighted proportions are similar, but the
weighted odds ratios for vitamin use and gender are slightly lower than
the unweighted odds ratios for high school graduates and those with some
college education, while the weighted and unweighted odds ratios are
about the same for those with less than high school graduation. The small
difference for the two higher levels of education may be due to race or
some other factor. If the difference between the unweighted and weighted
odds ratios is much larger and it is due to race, one should examine the
association separately for different racial groups. The consideration of
additional factors in the contingency table analysis can be done using a
logistic regression model.
The uses of a model and associated issues in a logistic regression are
exactly the same as in a linear regression. A careful examination of the
weighted and unweighted analysis provides useful information. In Table 6.7,
the weighted and unweighted estimates of coefficients are similar. It appears
that the weighting affects the intercept more than the coefficients. The analy-
sis shown in Table 6.7 is a simple demonstration of analyzing data using
logistic regression, and no careful consideration is given to choosing an
78
7. CONCLUDING REMARKS
NOTES
that with stratified sampling, it is not sufficient to drop FPC factors from
standard design-based variance formulas to obtain appropriate variance
formulas for model-based inference. With cluster sampling, standard
design-based variance formulas can dramatically underestimate model-
based variability, even with a small sampling fraction of the final units.
They conclude that design-based inference is an efficient and reasonably
model-free approach to infer about finite population parameters but sug-
gest simple modifications of design-based variance estimators to make
inferences with a few model assumptions for superpopulation parameters,
which frequently are the ones of primary scientific interest.
83
REFERENCES
Aldrich, J. H., & Nelson, F. D. (1984). Linear probability, logit, and probit models (Quantitative
Applications in the Social Sciences, 07–045). Beverly Hills, CA: Sage.
Alexander, C. H. (1987). A model-based justification for survey weights. Proceedings of the
Section of Survey Research Methods (American Statistical Association), 183–188.
Bean, J. A. (1975). Distribution and properties of variance estimation for complex multistage
probability samples (Vital and Health Statistics, Series 2[65]). Washington, DC: National
Center for Health Statistics.
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex
surveys. International Statistical Review, 51, 279–292.
Brewer, K. R. W. (1995). Combining design-based and model-based inference. In B. G. Cox,
D. A. Binder, B. N. Chinnappa, A. Christianson, M. J. Colledge, & P. S. Kott (Eds.),
Business Survey methods (pp. 589–606). New York: John Wiley.
Brewer, K. R. W. (1999). Design-based or prediction-based inference? Stratified random vs.
stratified balanced sampling. International Statistical Review, 67, 35–47.
Brewer, K. R. W., & Mellor, R. W. (1973). The effect of sample structure on analytical
surveys. Australian Journal of Statistics, 15, 145–152.
Brick, J. M., & Kalton, G. (1996). Handling missing data in survey research. Statistical Methods
in Medical Research, 5, 215–238.
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data
analysis methods. Newbury Park, CA: Sage.
Chambless, L. E., & Boyle, K. E. (1985). Maximum likelihood methods for complex sample
data: Logistic regression and discrete proportional hazards models. Communications in
Statistics—Theory and Methods, 14, 1377–1392.
Chao, M. T., & Lo, S. H. (1985). A bootstrap method for finite populations. Sankhya, 47(A),
399–405.
Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: John Wiley.
Cohen, S. B. (1997). An evaluation of alternative PC-based software packages developed for
the analysis of complex survey data. The American Statistician, 51, 285–292.
Davis, J. A., & Smith, T. W. (1985). General Social Survey, 1972–1985: Cumulative codebook
(NORC edition). Chicago: National Opinion Research Center, University of Chicago and
the Roper Center, University of Connecticut.
DeMaris, A. (1992). Logit modeling (Quantitative Applications in the Social Sciences,
07–086). Thousand Oaks, CA: Sage.
Deming, W. E. (1960). Sample design in business research. New York: John Wiley.
DuMouchel, W. H., & Duncan, G. J. (1983). Using sample survey weights in multiple
regression analyses of stratified samples. Journal of the American Statistical Association,
78, 535–543.
Durbin, J. (1959). A note on the application of Quenouille’s method of bias reduction to the
estimation of ratios. Biometrika, 46, 477–480.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–26.
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Philadelphia:
Society for Industrial and Applied Mathematics.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman &
Hall.
Eliason, S. R. (1993). Maximum likelihood estimation: Logic and practice (Quantitative
Applications in the Social Sciences, 07–096). Beverly Hills, CA: Sage.
84
Eltinge, J. L., Parsons, V. L., & Jang, D. S. (1997). Differences between complex-design-based
and IID-based analyses of survey data: Examples from Phase I of NHANES III. Stats,
19, 3–9.
Fay, R. E. (1985). A jackknife chi-square test for complex samples. Journal of the American
Statistical Association, 80, 148–157.
Flyer, P., & Mohadjer, L. (1988). The WesVar procedure. Rockville, MD: Westat.
Forthofer, R. N., & Lehnen, R. G. (1981). Public program analysis: A categorical data
approach. Belmont, CA: Lifetime Learning Publications.
Frankel, M. R. (1971). Inference from survey samples. Ann Arbor: Institute of Social Research,
University of Michigan.
Fuller, W. A. (1975). Regression analysis for sample surveys. Sankhya, 37(C), 117–132.
Fuller, W. A. (1984). Least squares and related analyses for complex survey designs. Survey
Methodology, 10, 97–118.
Goldstein, H., & Silver, R. (1989). Multilevel and multivariate models in survey analysis. In
C. J. Skinner, D. Holt, & T. M. F. Smith (Eds.), Analysis of complex survey data
(pp. 221–235). New York: John Wiley.
Goodman, L. A. (1972). A general model for the analysis of surveys. American Journal of
Sociology, 77, 1035–1086.
Graubard, B. I., & Korn, E. L. (1996). Modelling the sampling design in the analysis of health
surveys. Statistical Methods in Medical Research, 5, 263–281.
Graubard, B. I., & Korn, E. L. (2002). Inference for superpopulation parameters using sample
surveys. Statistical Science, 17, 73–96.
Grizzle, J. E., Starmer, C. F., & Koch, G. G. (1969). Analysis of categorical data by linear
models. Biometrics, 25, 489–504.
Gurney, M., & Jewett, R. S. (1975). Constructing orthogonal replications for variance estima-
tion. Journal of the American Statistical Association, 70, 819–821.
Hansen, M. H., Madow, W. G., & Tepping, B. J. (1983). An evaluation of model-dependent
and probability-sampling inferences in sample surveys. Journal of the American Statistical
Association, 78, 776–807.
Heitjan, D. F. (1997). Annotation: What can be done about missing data? Approaches to impu-
tation. American Journal of Public Health, 87(4), 548–550.
Hinkins, S., Oh, H. L., & Scheuren, F. (1994). Inverse sampling design algorithms. Proceed-
ings of the Section on Survey Research Methods (American Statistical Association),
626–631.
Holt, D., & Smith, T. M. F. (1979). Poststratification. Journal of the Royal Statistical Society,
142(A), 33–46.
Holt, D., Smith, T. M. F., & Winter, P. D. (1980). Regression analysis of data from complex
surveys. Journal of the Royal Statistical Society, 143(A), 474–487.
Horton, N. J., & Lipsitz, S. R. (2001). Multiple imputation in practice: Comparison of software
packages for regression models with missing variables. The American Statistician, 55(3),
244–254.
Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: John Wiley.
Judkins, D. R. (1990). Fay method of variance estimation. Official Statistics, 6(3), 233–239.
Kalton, G. (1983). Introduction to survey sampling (Quantitative Applications in the Social
Sciences, 07–035). Beverly Hills, CA: Sage.
Kalton, G., & Kasprszky, D. (1986). The treatment of missing survey data. Survey Metho-
dology, 12(1), 1–16.
Kendall, P. A., & Lazarsfeld, P. F. (1950). Problems of survey analysis. In R. K. Merton &
P. F. Lazarsfeld (Eds.), Continuities in social research: Studies in the scope and method
of ‘‘The American soldier.’’ New York: Free Press.
85
Kiecolt, K. J., & Nathan, L. E. (1985). Secondary analysis of survey data (Quantitative
Applications in the Social Sciences, 07–053). Beverly Hills, CA: Sage.
Kish, L. (1949). A procedure for objective respondent selection within the household. Journal
of the American Statistical Association, 44, 380–387.
Kish, L. (1965). Survey sampling. New York: John Wiley.
Kish, L., & Frankel, M. R. (1974). Inferences from complex samples. Journal of the Royal
Statistical Society, 36(B), 1–37.
Knoke, D., & Burke, P. J. (1980). Log-linear models (Quantitative Applications in the Social
Sciences, 07–020). Beverly Hills, CA: Sage.
Koch, G. G., Freeman, D. H., & Freeman, J. L. (1975). Strategies in the multivariate analysis
of data from complex surveys. International Statistical Review, 43, 59–78.
Konijn, H. (1962). Regression analysis in sample surveys. Journal of the American Statistical
Association, 57, 590–605.
Korn, E. L., & Graubard, B. I. (1995a). Analysis of large health surveys: Accounting for the
sample design. Journal of the Royal Statistical Society, 158(A), 263–295.
Korn, E. L., & Graubard, B. I. (1995b). Examples of differing weighted and unweighted
estimates from a sample survey. The American Statistician, 49, 291–295.
Korn, E. L., & Graubard, B. I. (1998). Scatterplots with survey data. The American Statistician,
52, 58–69.
Korn, E. L., & Graubard, B. I. (1999). Analysis of health surveys. New York: John Wiley.
Korn, E. L., & Graubard, B. I. (2003). Estimating variance components by using survey data.
Journal of the Royal Statistical Society, B(65, pt. 1), 175–190.
Kott, P. S. (1991). A model-based look at linear regression with survey data. The American
Statistician, 45, 107–112.
Kovar, J. G., Rao, J. N. K., & Wu, C. F. J. (1988). Bootstrap and other methods to measure
errors in survey estimates. Canadian Journal of Statistics, 16(Suppl.), 25–45.
Krewski, D., & Rao, J. N. K. (1981). Inference from stratified samples: Properties of the lineariza-
tion, jackknife and balanced repeated replication methods. Annals of Statistics, 9, 1010–1019.
LaVange, L. M., Lafata, J. E., Koch, G. G., & Shah, B. V. (1996). Innovative strategies using
SUDAAN for analysis of health surveys with complex samples. Statistical Methods in
Medical Research, 5, 311–329.
Lee, E. S., Forthofer, R. N., Holzer, C. E., & Taube, C. A. (1986). Complex survey data analy-
sis: Estimation of standard errors using pseudo-strata. Journal of Economic and Social
Measurement, 14, 135–144.
Lee, E. S., Forthofer, R. N., & Lorimor, R. J. (1986). Analysis of complex sample survey data:
Problems and strategies. Sociological Methods and Research, 15, 69–100.
Lee, K. H. (1972). The use of partially balanced designs for the half-sample replication method
of variance estimation. Journal of the American Statistical Association, 67, 324–334.
Lehtonen, R., & Pahkinen, E. J. (1995). Practical methods for design and analysis of complex
surveys. New York: John Wiley.
Lemeshow, S., & Levy, P. S. (1979). Estimating the variance of ratio estimates in complex
surveys with two primary sampling units per stratum. Journal of Statistical Computing
and Simulation, 8, 191–205.
Levy, P. S., & Lemeshow, S. (1999). Sampling of populations: Methods and applications.
New York: John Wiley.
Levy, P. S., & Stolte, K. (2000). Statistical methods in public health and epidemiology: A look
at the recent past and projections for the future. Statistical Methods in Medical Research,
9, 41–55.
Liao, T. F. (1994). Interpreting probability models: Logit, probit, and other generalized linear
models (Quantitative Applications in the Social Sciences, 07–101). Beverly Hills, CA, Sage.
86
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York:
John Wiley.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.).
New York: John Wiley.
Lohr, S. L. (1999). Sampling: Design and analysis. New York: Duxbury.
McCarthy, P. J. (1966). Replication: An approach to the analysis of data from complex surveys
(Vital and Health Statistics, Series 2[14]). Washington, DC: National Center for Health Statistics.
Murthy, M. N., & Sethi, V. K. (1965). Self-weighting design at tabulation stage. Sankhya,
27(B), 201–210.
Nathan, G., & Holt, D. (1980). The effects of survey design on regression analysis. Journal of
the Royal Statistical Society, 42(B), 377–386.
National Center for Health Statistics (NCHS). (1994). Plan and operation of the Third National
Health and Nutrition Examination Survey, 1988–94 (Vital and Health Statistics, Series
1[32]). Washington, DC: Government Printing Office.
Nielsen, S. F. (2003). Proper and improper multiple imputation. International Statistical
Review, 71, 593–627.
Nordberg, L. (1989). Generalized linear modeling of sample survey data. Journal of Official
Statistics, 5, 223–239.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. Interna-
tional Statistical Review, 61, 317–337.
Pfeffermann, D. (1996). The use of sampling weights for survey data analysis. Statistical Meth-
ods in Medical Research, 5, 239–261.
Pfeffermann, D., & Homes, D. J. (1985). Robustness considerations in the choice of method of
inference for regression analysis of survey data. Journal of the Royal Statistical Society,
148(A), 268–278.
Pfeffermann, D., & Nathan, G. (1981). Regression analysis of data from a cluster sample.
Journal of the American Statistical Association, 76, 681–689.
Plackett, R. L., & Burman, P. J. (1946). The design of optimum multi-factorial experiments.
Biometrika, 33, 305–325.
Quenouille, M. H. (1949). Approximate tests of correlation in time series. Journal of the Royal
Statistical Society, 11(B), 68–84.
Rao, J. N. K., Kovar, J. G., & Mantel, H. J. (1990). On estimating distribution functions and
quantiles from survey data using auxiliary information. Biometrika, 77, 365–375.
Rao, J. N. K., & Scott, A. J. (1984). On chi-square tests for multiway contingency tables with
cell proportions estimated from survey data. Annals of Statistics, 12, 46–60.
Rao, J. N. K., & Wu, C. F. J. (1988). Resampling inference with complex survey data. Journal
of the American Statistical Association, 83, 231–241.
Rao, J. N. K., Wu, C. F. J., & Yue, K. (1992). Some recent work on resampling methods for
complex surveys. Survey Methodology, 18(3), 209–217.
Roberts, G., Rao, J. N. K., & Kumar, S. (1987). Logistic regression analysis of sample survey
data. Biometrika, 74, 1–12.
Royall, R. M. (1970). On finite population sampling theory under certain linear regression
models. Biometrika, 57, 377–387.
Royall, R. M. (1973). The prediction approach to finite population sampling theory: Applica-
tion to the hospital discharge survey (Vital and Health Statistics, Series 2[55]). Washington,
DC: National Center for Health Statistics.
Rust, K. F., & Rao, J. N. K. (1996). Variance estimation for complex surveys using replication
techniques. Statistical Methods in Medical Research, 5, 283–310.
Sarndal, C. E. (1978). Design-based and model-based inference in survey sampling. Scandina-
vian Journal of Statistics, 5, 25–52.
87
Sarndal, C. E., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling.
New York: Springer-Verlag.
Shah, B. V., Holt, M. H., & Folsom, R. E. (1977). Inference about regression models from
sample survey data. Bulletin of the International Statistical Institute, 47, 43–57.
Sitter, R. R. (1992). Resampling procedure for complex survey data. Journal of the American
Statistical Association, 87, 755–765.
Skinner, C. J., Holt, D., & Smith, T.M.F. (Eds.). (1989). Analysis of complex survey data.
New York: John Wiley.
Smith, S. S. (1996). The third National Health and Nutrition Examination Survey: Measuring
and monitoring the health of the nation. Stats, 16, 9–11.
Smith, T. M. F. (1976). The foundations of survey sampling: A review. Journal of the Royal
Statistical Society, 139(A), 183–204.
Smith, T. M. F. (1983). On the validity of inferences on non-random samples. Journal of the
Royal Statistical Society, 146(A), 394–403.
Sribney, W. M. (1998). Two-way contingency tables for survey or clustered data. Stata Technical
Bulletin, 45, 33–49.
Stanek, E. J., & Lemeshow, S. (1977). The behavior of balanced half-sample variance esti-
mates for linear and combined ratio estimates when strata are paired to form pseudo strata.
American Statistical Association Proceedings: Social Statistics Section, 837–842.
Stephan, F. F. (1948). History of the uses of modem sampling procedures. Journal of the
American Statistical Association, 43, 12–39.
Sudman, S. (1976). Applied sampling. New York: Academic Press.
Sugden, R. A., & Smith, T. M. F. (1984). Ignorable and informative designs in sampling infer-
ence. Biometrika, 71, 495–506.
Sundberg, R. (1994). Precision estimation in sample survey inference: A criterion for choice
between various estimators. Biometrika, 81, 157–172.
Swafford, M. (1980). Three parametric techniques for contingency table analysis: Non-technical
commentary. American Sociological Review, 45, 604–690.
Tepping, B. J. (1968). Variance estimation in complex surveys. American Statistical Association
Proceedings, Social Statistics Section, 11–18.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of Mathematical
Statistics, 29, 614.
Tukey, J. W. (1986). Sunset salvo. The American Statistician, 40, 72–76.
U.S. Bureau of the Census. (1986, April). Estimates of the population of the United States,
by age, sex, and race, 1980 to 1985 (Current Population Reports, Series P-25, No. 985).
Washington, DC: Author.
Wolter, K. M. (1985). Introduction to variance estimation. New York: Springer-Verlag.
Woodruff, R. S. (1971). A simple method for approximating the variance of a complicated
estimate. Journal of the American Statistical Association, 66, 411–414.
Zhang, P. (2003). Multiple imputation: Theory and method. International Statistical Review,
71, 581–592.
88
INDEX