Você está na página 1de 105

Series/Number 07–071

ANALYZING COMPLEX
SURVEY DATA
Second Edition

Eun Sul Lee


Division of Biostatistics, School of Public Health,
University of Texas Health Science Center—Houston

Ronald N. Forthofer

SAGE PUBLICATIONS
International Educational and Professional Publisher
Thousand Oaks London New Delhi
Copyright Ó 2006 by Sage Publications, Inc.

All rights reserved. No part of this book may be reproduced or utilized in any form or by any
means, electronic or mechanical, including photocopying, recording, or by any information
storage and retrieval system, without permission in writing from the publisher.

For information:

Sage Publications, Inc.


2455 Teller Road
Thousand Oaks, California 91320
E-mail: order@sagepub.com

Sage Publications Ltd.


1 Oliver’s Yard
55 City Road
London EC1Y 1SP
United Kingdom

Sage Publications India Pvt. Ltd.


B-42, Panchsheel Enclave
Post Box 4109
New Delhi 110 017 India

Printed in the United States of America

Library of Congress Cataloging-in-Publication Data

Lee, Eun Sul.


Analyzing complex survey data / Eun Sul Lee, Ronald N. Forthofer.—2nd ed.
p. cm.—(Quantitative applications in the social sciences ; vol. 71)
Includes bibliographical references and index.
ISBN 0-7619-3038-8 (pbk. : alk. paper)
1. Mathematical statistics. 2. Social surveys—Statistical methods. I. Forthofer, Ron N.,
1944-II. Title. III. Series: Sage university papers series. Quantitative applications in the
social sciences ; no. 07–71.
QA276.L3394 2006
001.4 22—dc22 2005009612

This book is printed on acid-free paper.

05 06 07 08 09 10 9 8 7 6 5 4 3 2 1

Acquisitions Editor: Lisa Cuevas Shaw


Editorial Assistant: Karen Gia Wong
Production Editor: Melanie Birdsall
Copy Editor: A. J. Sobczak
Typesetter: C&M Digitals (P) Ltd
CONTENTS

Series Editor’s Introduction v


Acknowledgments vii

1. Introduction 1

2. Sample Design and Survey Data 3


Types of Sampling 4
The Nature of Survey Data 7
A Different View of Survey Data* 9

3. Complexity of Analyzing Survey Data 11


Adjusting for Differential Representation: The Weight 11
Developing the Weight by Poststratification 14
Adjusting the Weight in a Follow-Up Survey 17
Assessing the Loss or Gain in Precision: The Design Effect 18
The Use of Sample Weights for Survey Data Analysis* 20

4. Strategies for Variance Estimation 22


Replicated Sampling: A General Approach 23
Balanced Repeated Replication 26
Jackknife Repeated Replication 29
The Bootstrap Method 35
The Taylor Series Method (Linearization) 36

5. Preparing for Survey Data Analysis 39


Data Requirements for Survey Analysis 39
Importance of Preliminary Analysis 41
Choices of Method for Variance Estimation 43
Available Computing Resources 44
Creating Replicate Weights 47
Searching for Appropriate Models for Survey Data Analysis* 49

6. Conducting Survey Data Analysis 49


A Strategy for Conducting Preliminary Analysis 50
Conducting Descriptive Analysis 52
Conducting Linear Regression Analysis 57
Conducting Contingency Table Analysis 61
Conducting Logistic Regression Analysis 65
Other Logistic Regression Models 69
Design-Based and Model-Based Analyses* 74

7. Concluding Remarks 78

Notes 80

References 83

Index 88

About the Authors 91


SERIES EDITOR’S INTRODUCTION

When George Gallup correctly predicted Franklin D. Roosevelt as the


1936 presidential election winner, public opinion surveys entered the age
of scientific sampling, and the method used then was quota sampling, a
type of nonprobability sampling representative of the target population.
The same method, however, incorrectly predicted Thomas Dewey as the
1948 winner though Harry S. Truman actually won. The method failed
because quota sampling is nonprobabilistic and because Gallup’s quota
frames were based on the 1940 census, overlooking the urban migration
during World War II.
Today’s survey sampling has advanced much since the early days, now
relying on sophisticated probabilistic sampling designs. A key feature is stra-
tification: The target population is divided into subpopulations of strata, and
the sample sizes in the strata are controlled by the sampler and are often pro-
portional to the strata population sizes. Another feature is cluster and multi-
stage sampling: Groups are sampled as clusters in a hierarchy of clusters
selected at various stages until the final stage, when individual elements are
sampled within the final-stage clusters. The General Social Survey, for exam-
ple, uses a stratified multistage cluster sampling design. (Kalton [1983] gives
a nice introduction to survey sampling.)
When the survey design is of this complex nature, statistical analysis of the
data is no longer a simple matter of running a regression (or any other model-
ing) analysis. Surveys today all come with sampling weights to assist with
correct statistical inference. Most texts on statistical analysis, by assuming
simple random sampling, do not include treatment of sampling weights, an
omission that may have important implications for making inferences. Dur-
ing the last two to three decades, statistical methods for data analysis have
also made huge strides. These must have been the reason my predecessor,
Michael Lewis-Beck, who saw through the early stages of the editorial work
in this volume, chose to have a second edition of Analyzing Complex Survey
Data.
Lee and Forthofer’s second edition of the book brings us up to date in
uniting survey sampling designs and survey data analysis. The authors
begin by reviewing common types of survey sample designs and demystify-
ing sampling weights by explaining what they are, and how they are devel-
oped and adjusted. They then carefully discuss the major issues of variance
estimation and of preliminary as well as multivariate analysis of complex
cross-sectional survey data when sampling weights are taken into account.
They focus on the design-based approach that directly engages sample designs
in the analysis (although they also discuss the model-based perspective, which

v
vi

can augment a design-based approach in some analyses), and they illustrate


the approach with popular software examples. Students of survey analysis will
find the text of great use in their efforts in making sample-based statistical
inferences.

—Tim Futing Liao


Series Editor
vii

ACKNOWLEDGMENTS

We sadly acknowledge that Dr. Ronald J. Lorimor, who had collaborated


on the writing of the first edition of this manuscript, died in 1999. His
insights into survey data analysis remain in this second edition. We are
grateful to Tom W. Smith for answering questions about the sample design
of the General Social Survey and to Barry L. Graubard, Lu Ann Aday, and
Michael S. Lewis-Beck for their thoughtful suggestions for the first edition.
Thanks also are due to anonymous reviewers for their helpful comments for
both editions and to Tim F. Liao for his thoughtful advice for the contents
of the second edition. Special thanks go to many students in our classes at
the University of Texas School of Public Health who participated in discus-
sions of many topics contained in this book.
ANALYZING COMPLEX
SURVEY DATA, SECOND EDITION
Eun Sul Lee
Division of Biostatistics, School of Public Health,
University of Texas Health Science Center—Houston

Ronald N. Forthofer

1. INTRODUCTION

Survey analysis often is conducted as if all sample observations were


independently selected with equal probabilities. This analysis is correct if
simple random sampling (SRS) is used in data collection; however, in prac-
tice the sample selection is more complex than SRS. Some sample observa-
tions may be selected with higher probabilities than others, and some are
included in the sample by virtue of their membership in a certain group
(e.g., household) rather than being selected independently. Can we simply
ignore these departures from SRS in the analysis of survey data? Is it appro-
priate to use the standard techniques in statistics books for survey data
analysis? Or are there special methods and computer programs available
for a more appropriate analysis of complex survey data? These questions
are addressed in the following chapters.
The typical social survey today reflects a combination of statistical theory
and knowledge about social phenomena, and its evolution has been shaped
by experience gained from the conduct of many different surveys during the
last 70 years. Social surveys were conducted to meet the need for information
to address social, political, and public health issues. Survey agencies were
established within and outside the government in response to this need for
information. In the early attempts to provide the required information, how-
ever, the survey groups were mostly concerned with the practical issues in the
fieldwork—such as sampling frame construction, staff training/supervision,
and cost reduction—and theoretical sampling issues received only secondary
emphasis (Stephan, 1948). As these practical matters were resolved, modern
sampling practice had developed far beyond SRS. Complex sample designs
had come to the fore, and with them, a number of analytic problems.
Because the early surveys generally needed only descriptive statistics,
there was little interest in analytic problems. More recently, demands for

1
2

analytic studies by social and policy scientists have increased, and a variety
of current issues are being examined, using available social survey data, by
researchers who were not involved with the data collection process. This
tradition is known as secondary analysis (Kendall & Lazarsfeld, 1950).
Often, the researcher fails to pay due attention to the development of com-
plex sample designs and assumes that these designs have little bearing on
the analytic procedures to be used.
The increased use of statistical techniques in secondary analysis and the
recent use of log-linear models, logistic regression, and other multivariate
techniques (Aldrich & Nelson, 1984; Goodman, 1972; Swafford, 1980)
have done little to bring design and analysis into closer alignment. These
techniques are predicated on the use of simple random sampling with re-
placement (SRSWR); however, this assumption is rarely met in social surveys
that employ stratification and clustering of observational units along with
unequal probabilities of selection. As a result, the analysis of social surveys
using the SRSWR assumption can lead to biased and misleading results.
Kiecolt and Nathan (1985), for example, acknowledged this problem in
their Sage book on secondary analysis, but they provide little guidance on
how to incorporate the sample weights and other design features into the
analysis. A recent review of literature in public health and epidemiology
shows that the use of design-based survey analysis methods is gradually
increasing but remains at a low level (Levy & Stolte, 2000).
Any survey that puts restrictions on the sampling beyond those of
SRSWR is complex in design and requires special analytic considerations.
This book reviews the analytic issues raised by the complex sample survey,
provides an introduction to analytic strategies, and presents illustrations
using some of the available software. Our discussion is centered on the use
of the sample weights to correct for differential representations and the
effect of sample designs on estimation of sampling variance with some dis-
cussion of weight development and adjustment procedures. Many other
important issues of dealing with nonsampling errors and handling missing
data are not fully addressed in this book.
The basic approach presented in this book is the traditional way of ana-
lyzing complex survey data. This approach is now known as design-based
(or randomization-based) analysis. A different approach to analyzing com-
plex survey data is the so-called model-based analysis. As in other areas of
statistics, the model-based statistical inference has gained more attention in
survey data analysis in recent years. The modeling approaches are intro-
duced in various steps of survey data analysis in defining the parameters,
defining estimators, and estimating variances; however, there are no gener-
ally accepted rules for model selection or validating a specified model.
Nevertheless, some understanding of the model-based approach is essential
3

for survey data analysts to augment the design-based approach. In some


cases, both approaches produce the same results; but different results occur
in other cases. The model-based approach may not be useful in descriptive
data analysis but can be useful in inferential analysis. We will introduce the
model-based perspective where appropriate and provide references for
further treatment of the topics. Proper conduct of model-based analysis
would require knowledge of general statistical models and perhaps some
consultation from survey statisticians. Sections of the book relevant to this
alternative approach and related topics are marked with asterisks (*).
Since the publication of the first edition of this book, the software situation
for the analysis of complex survey data has improved considerably. User-
friendly programs are now readily available, and many commonly used sta-
tistical methods are now incorporated in the packages, including logistic
regression and survival analysis. These programs will be introduced with
illustrations in this edition. These programs are perhaps more open to misuse
than other standard software. The topics and issues discussed in this book will
provide some guidelines for avoiding pitfalls in survey data analysis.
In our presentation, we assume some familiarity with such sampling
designs as simple random sampling, systematic sampling, stratified random
sampling, and simple two-stage cluster sampling. A good presentation of
these designs may be found in Kalton (1983) and Lohr (1999). We also
assume general understanding of standard statistical methods and one of
the standard statistical program packages, such as SAS or Stata.

2. SAMPLE DESIGN AND SURVEY DATA

Our consideration of survey data focuses on sample designs that satisfy


two basic requirements. First, we are concerned only with probability sam-
pling in which each element of a population has a known (nonzero) prob-
ability of being included in the sample. This is the basis for applying
statistical theory in the derivation of the properties of the survey estimators
for a given design. Second, if a sample is to be drawn from a population, it
is necessary to be able to construct a sampling frame that lists suitable sam-
pling units that encompass all elements of the population. If it is not feasible
or is impractical to list all population elements, some clusters of elements
can be used as sampling units. For example, it is impractical to construct
a list of all households in the United States, but we can select the sample in
several stages. In the first stage, counties are randomly sampled; in the
second stage, census tracts within the selected counties are sampled; in
the third stage, street blocks are sampled within the selected tracts. Then, in
the final stage of selection, a list of households is needed only for the selected
4

blocks. This multistage design satisfies the requirement that all population
elements have a known nonzero probability of being selected.

Types of Sampling
The simplest sample design is simple random sampling, which requires
that each element have an equal probability of being included in the sample
and that the list of all population elements be available. Selection of a
sample element can be carried out with or without replacement. Simple
random sampling with replacement (SRSWR) is of special interest because
it simplifies statistical inference by eliminating any relation (covariance)
between the selected elements through the replacement process. In this
scheme, however, an element can appear more than once in the sample. In
practice, simple random sampling is carried out without replacement
(SRSWOR), because there is no need to collect the information more than
once from an element. Additionally, SRSWOR gives a smaller sampling
variance than SRSWR. However, these two sampling methods are practi-
cally the same in a large survey in which a small fraction of population
elements is sampled. We will use the term SRS for SRSWOR throughout
this book unless otherwise specified.
The SRS design is modified further to accommodate other theoretical
and practical considerations. The common practical designs include sys-
tematic sampling, stratified random sampling, multistage cluster sampling,
PPS sampling (probability proportional to size), and other controlled selec-
tion procedures. These more practical designs deviate from SRS in two
important ways. First, the inclusion probabilities for the elements (also
the joint inclusion probabilities for sets for the elements) may be unequal.
Second, the sampling unit can be different from the population element of
interest. These departures complicate the usual methods of estimation and
variance calculation and, if proper methods of analysis are not used, can
lead to a bias in estimation and statistical tests. We will consider these
departures in detail, using several specific sampling designs, and examine
their implications for survey analysis.
Systematic sampling is commonly used as an alternative to SRS because
of its simplicity. It selects every k-th element after a random start (between
1 and k). Its procedural tasks are simple, and the process can easily be
checked, whereas it is difficult to verify SRS by examining the results. It is
often used in the final stage of multistage sampling when the fieldworker is
instructed to select a predetermined proportion of units from the listing of
dwellings in a street block. The systematic sampling procedure assigns each
element in a population the same probability of being selected. This ensures
that the sample mean will be an unbiased estimate of the population mean
5

when the number of elements in the population (N) is equal to k times the
number of elements in the sample (n). If N is not exactly nk, then the equal
probability is not guaranteed, although this problem can be ignored when
N is large. In that case, we can use the circular systematic sampling scheme.
In this scheme, the random starting point is selected between 1 and N (any
element can be the starting point), and every k-th element is selected assum-
ing that the frame is circular (the end of the list is connected to the beginning
of the list). Systematic sampling can give an unrealistic estimate, however,
when the elements in the frame are listed in a cyclical manner with respect to
survey variables and the selection interval coincides with the listing cycle.
For example, if one selects every 40th patient coming to a clinic and the
average daily patient load is about 40, then the resulting systematic sample
would contain only those who came to the clinic at a particular time of the
day. Such a sample may not be representative of the clinic patients.
Moreover, even when the listing is randomly ordered, unlike SRS, differ-
ent sets of elements may have unequal inclusion probabilities. For example,
the probability of including both the i-th and the (i + k)-th element is 1/k
in a systematic sample, whereas the probability of including both the i-th
and the (i + k + 1)-th is zero. This complicates the variance calculation.
Another way of viewing systematic sampling is that it is equivalent to
selecting one cluster from k systematically formed clusters of n elements
each. The sampling variance (between clusters) cannot be estimated from
the one selected cluster. Thus, variance estimation from a systematic sample
requires special strategies.
A modification to overcome these problems with systematic sampling
is the so-called repeated systematic sampling (Levy & Lemeshow, 1999,
pp. 101–110). Instead of taking a systematic sample in one pass through the
list, several smaller systematic samples are selected, going down the list
several times with a new starting point in each pass. This procedure not only
guards against possible periodicity in the frame but also allows variance
estimation directly from the data. The variance of an estimate from all sub-
samples can be estimated from the variability of the separate estimates from
each subsample. This idea of replicated sampling offers a strategy for esti-
mating variance for complex surveys, which will be discussed further in
Chapter 4.
Stratified random sampling classifies the population elements into strata
and samples separately from each stratum. It is used for several reasons:
(a) The sampling variance can be reduced if strata are internally homoge-
neous, (b) separate estimates can be obtained for strata, (c) administration
of fieldwork can be organized using strata, and (d) different sampling needs
can be accommodated in separate strata. Allocation of the sample across
the strata is proportionate when the sampling fraction is uniform across the
6

strata or disproportionate when, for instance, a higher sampling fraction


is applied to a smaller stratum to select a sufficient number of subjects for
comparative studies. In general, the estimation process for a stratified
random sample is more complicated than in SRS. It is generally described
as a two-step process. The first step is the calculation of the statistics—for
example, the mean and its variance—separately within each stratum. These
estimates are then combined based on weights reflecting the proportion of
the population in each stratum. As will be discussed later, it also can be
described as a one-step process using weighted statistics. The estimation
simplifies in the case of proportionate stratified sampling, but the strata
must be taken into account in the variance estimation.
The formulation of the strata requires that information on the stratifica-
tion variable(s) be available in the sampling frame. When such information
is not available, stratification cannot be incorporated in the design. But stra-
tification can be done after data are collected to improve the precision of the
estimates. The so-called poststratification is used to make the sample more
representative of the population by adjusting the demographic composi-
tions of the sample to the known population compositions. Typically, such
demographic variables as age, sex, race, and education are used in poststra-
tification in order to take advantage of the population census data. This
adjustment requires the use of weights and different strategies for variance
estimation because the stratum sample size is a random variable in the
poststratified design (determined after the data are collected).
Cluster sampling is often a practical approach to surveys because
it samples by groups (clusters) of elements rather than by individual ele-
ments. It simplifies the task of constructing sampling frames, and it reduces
the survey costs. Often, a hierarchy of geographical clusters is used, as
described earlier. In multistage cluster sampling, the sampling units are
groups of elements except for the last stage of sampling. When the numbers
of elements in the clusters are equal, the estimation process is equivalent to
SRS. However, simple random sampling of unequal-sized clusters leads to
the elements in the smaller clusters being more likely to be in the sample
than those in the larger clusters. Additionally, the clusters are often strati-
fied to accomplish certain survey objectives and field procedures, for
instance, the oversampling of predominantly minority population clusters.
The use of disproportionate stratification and unequal-sized clusters com-
plicates the estimation process.
One method to draw a self-weighting sample of elements in one-stage
cluster sampling of unequal-sized clusters is to sample clusters with prob-
ability proportional to the size of clusters (PPS sampling). However, this
requires that the true size of clusters be known. Because the true sizes
usually are unknown at the time of the survey, the selection probability is
7

instead made proportional to the estimated size (PPES sampling). For


example, the number of beds can be used as a measure of size in a survey of
hospital discharges with hospitals as the clusters. One important consequence
of PPES sampling is that the expected sample size will vary from one primary
sampling unit (PSU) to another. In other words, the sample size is not fixed
but varies from sample to sample. Therefore, the sample size, the deno-
minator in the calculation of a sample mean, is a random variable, and, hence,
the sample mean becomes a ratio of two random variables. This type of
variable, a ratio variable, requires special strategies for variance estimation.

The Nature of Survey Data


If we are to infer from sample to population, the sample selection process
is an integral part of the inference process, and the survey data must contain
information on important dimensions of the selection process. Considering
the departures from SRS in most social surveys, we need to view the survey
data not only as records of measurements, but also as having different
representation and structural arrangements.
Sample weights are used to reflect the differing probabilities of selection
of the sample elements. The development of sample weights requires keep-
ing track of selection probabilities separately in each stratum and at each
stage of sampling. In addition, it can involve correcting for differential
response rates within classes of the sample and adjusting the sample
distribution by demographic variables to known population distributions
(poststratification adjustment). Moreover, different sample weights may be
needed for different units of analysis. For instance, in a community survey
it may be necessary to develop person weights for an analysis of individual
data and household weights for an analysis of household data.
We may feel secure in the exclusion of the weights when one of the
following self-weighting designs is used. True PPS sampling in a one-stage
cluster sampling will produce a self-weighting sample of elements, as in the
SRS design. The self-weighting can also be accomplished in a two-stage
design when true PPS sampling is used in the first stage and a fixed number
of elements is selected within each selected PSU. The same result will follow
if simple random sampling is used in the first stage and a fixed proportion
of the elements is selected in the second stage (see Kalton, 1983, chaps. 5
and 6). In practice, however, the self-weighting feature is destroyed by nonre-
sponse and possible errors in the sampling frame(s). This unintended self-
selection process can introduce bias, but it is seldom possible to assess the
bias from an examination of the sample data. Two methods employed in an
attempt to reduce the bias are poststratification and nonresponse adjustments.
Poststratification involves assigning weights to bring the sample proportion
8

in demographic subgroups into agreement with the population proportion


in the subgroups. Nonresponse adjustment inflates the weights for those who
participate in the survey to account for the nonrespondents with similar char-
acteristics. Because of the nonresponse and poststratification adjustments by
weighting, the use of weights is almost unavoidable even when a self-weighting
design is used.
The sample design affects the estimation of standard errors and, hence,
must also be incorporated into the analysis. A close examination of the
familiar formulas for standard errors found in statistics textbooks and incor-
porated into most computer program packages shows that they are based
on the SRSWR design. These formulas are relatively simple because the
covariance between elements is zero, as a result of the assumed independent
selection of elements. It is not immediately evident how the formulas should
be modified to adjust for other complex sampling designs.
To better understand the need for adjustment to the variance formulas, let
us examine the variance formula for several sample designs. We first con-
sider variance for a sample mean from the SRSWOR design. The familiar
variance formula for a sample mean, y (selecting a sample of n elements
from a population of N elements by SRSWR where the population mean
P
 in elementary statistics textbooks is 2/n, where 2 = (Yi − Y)
is Y),  2 /N.
This formula needs to be modified for the SRSWOR design because the
selection of an element is no longer independent of the selection of another
element. Because of the condition of not allowing duplicate selection, there
is a negative covariance [−2/(N − 1)] between i-th and j-th sample ele-
ments. Incorporating n(n − 1) times the covariance, the variance of the sam-
ple mean for SRSWOR is n (N
2 −n
N − 1), which is smaller than that from SRSWR
by the factor of (N − n)/(N − 1). Substituting the unbiased estimator of
2 of [(N − 1)s2 /N], the estimator for the variance of the sample mean
from SRSWOR is
s2
y) = (1 − f ),
V^( (2:1)
n

where s2 = (xi − x)2 /(n − 1) and f = n/N. Both (N − n)/(N − 1) and
(1 − f ) are called the finite population correction (FPC) factor. In a large
population, the covariance will be very small because the sampling fraction
is small. Therefore, SRSWR and SRSWOR designs will produce practically
the same variance, and these two procedures can be considered equivalent
for all practical purposes.
Stratified sampling is often presented as a more efficient design because it
gives, if used appropriately, a smaller variance than that given by a comparable
SRS. Because the covariances between strata are zero, the variance of the
9

sample estimate is derived from the within-stratum variances, which are


combined based on the stratum sample sizes and the stratum weights. The
value of a stratified sample variance depends on the distribution of the strata
sample sizes. An optimal (or Neyman) allocation produces a sampling
variance less than or equal to that based on SRS except in extremely rare situa-
tions. For other disproportionate allocations, the sampling variance may turn
out to be larger than that based on SRS when the finite population correction
factor (FPC) within strata cannot be ignored. Therefore, it cannot be assumed
that stratification will always reduce sampling variance compared to SRS.
The cluster sampling design usually leads to a larger sampling variance
than that from SRS. This is because the elements within naturally formed
clusters are often similar, which then yield a positive covariance between
elements within the cluster. The homogeneity within clusters is measured
by the intraclass correlation coefficient (ICC)—the correlation between all
possible pairs of elements within clusters. If clusters were randomly formed
(i.e., if each cluster were a random sample of elements), the ICC would be
zero. In many natural clusters, the ICC is positive and, hence, the sampling
variance will be larger than that for the SRS design.
It is difficult to generalize regarding the relative size of the sampling
variance in a complex design because the combined effects of stratification
and clustering, as well as that of the sample weights, must be assessed.
Therefore, all observations in survey data must be viewed as products of a
specific sample design that contains sample weights and structural arrange-
ments. In addition to the sample weights, strata and cluster identification
(at least PSUs) should be included in sample survey data. Reasons for these
requirements will become clearer later.
One complication in the variance calculation for a complex survey stems
from the use of weights. Because the sum of weights in the denominator of
any weighted estimator is not fixed but varies from sample to sample, the esti-
mator becomes a ratio of two random variables. In general, a ratio estimator
is biased, but the bias is negligible if the variation in the weights is relatively
small or the sample size is large (Cochran, 1977, chap. 6). Thus, the problem
of bias in the ratio estimator is not an issue in large social surveys. Because of
this bias, however, it is appropriate to use the mean square error—the sum of
the variance plus the square of the bias—rather than the variance. However,
because the bias often is negligible, we will use the term ‘‘variance’’ even if
we are referring to the mean square error in this book.

A Different View of Survey Data*


So far, the nature of survey data is described from the design-based
perspective—that is, sample data are observations sampled from a finite
10

population using a particular sample selection design. The sampling design


specifies the probability of selection of each potential sample, and a proper
estimator is chosen to reflect the design. As mentioned in the introduction,
the model-based perspective offers an alternative view of sample survey
data. Observations in the finite population are viewed as realizations of a
random variable generated from some model (a random variable that fol-
lowed some probability distribution). The assumed probability model sup-
plies the link between units in the sample and units not in the sample. In the
model-based approach, the sample data are used to predict the unobserved
values, and thus inferences may be thought of as prediction problems (Royall,
1970, 1973).
These two points of view may not make a difference in SRS, where we can
reasonably assume that sample observations were independent and identi-
cally distributed from a normal distribution with mean µ and variance .
From the model point of view, the population total is the sum of observations
in thePsample and Pthe sum of observations that are not in the sample; that is,
Y = i∈S yi + i∈  S yi . Based on the assumption of common mean, the
estimate of population total can be made as Y^ = n y + ðN − nÞ y = N y, where
y is the best unbiased predictor of the unobserved observations under the
model. It turns out to be the same as the expansion estimator in the design-
P
based approach, namely, Y^ = (N/n) ni= 1 yi = N y; where (N/n) is the
sample weight (inverse of selection probability in SRS). Both approaches
lead to the same variance estimate (Lohr, 1999, sec. 2.8).
If a different model were adopted, however, the variance estimates might
differ. For example, in the case of ratio1 and regression estimation under
SRS, the assumed model is Yi = βxi + εi , where Yi is for a random variable
and xi is an auxiliary variable for which the population total is known.
Under this model, the linear estimate of the population total will be
P P P
Y^ = i∈S yi + i∈  S yi = ny + β^ i∈  S xi . The first part is from the sam-
ple, and the second part is the prediction for the unobserved units based on
the assumed model. If we take β ^ as the sample ratio of y/ x, then we have
y P y P y
Y = n
^ y + x i∈  S xi = x ðnx + i∈  S xi Þ = x X, where X is the population
total of xi . This is simply the ratio estimate of Y. If we take β^ as the esti-
mated regression coefficient, then we have a regression estimation. Although
the ratio estimate is known to be slightly biased from the design-based
viewpoint, it is unbiased from the model-based reasoning if the model is
correct.
But the estimate of variance by the model-based approach is slightly
different from the estimate by the design-based approach. The
design-based estimate of variance of the estimated population total
11

 P
^ = ð1 − n Þ N 2 ½yi −ð xÞxi 2
y=
is V^D ðYÞ N n n−1 . The model-based estimator is
 2 P pffiffiffiffi 2
^ = ð1 − x Þ X ½{yi −ð xÞ}= xi 
y=
V^M ðYÞ X x n−1 , where x is the sample total and X
is the population total of the auxiliary variable (see Lohr, 1999, sec. 3.4).
The ratio estimate model is valid when (a) the relation between yi and
xi is a straight line through the origin and (b) the variance of yi about this
line is proportional to xi . It is known that the ratio estimate is inferior to the
expansion estimate (without the auxiliary variable) when the correlation
between yi and xi is less than one-half the ratio of coefficient of variation
of xi over the coefficient of variation of yi (Cochran, 1977, chap. 6). There-
fore, the use of ratio estimation in survey analysis would require check-
ing the model assumptions. In practice, when the data set includes a large
number of variables, ratio estimation would be cumbersome to select different
auxiliary variables for different estimates.
To apply the model-based approach to a real problem, we must first be
able to produce an adequate model. If the model is wrong, the model-based
estimators will be biased. When using model-based inference in sampling,
one needs to check the assumptions of the model by examining the data
carefully. Checking the assumptions may be difficult in many circum-
stances. The adequacy of a model is to some extent a matter of judgment,
and a model adequate for one analysis may not be adequate for another
analysis or another survey.

3. COMPLEXITY OF ANALYZING SURVEY DATA

Two essential aspects of survey data analysis are adjusting for the
differential representation of sample observations and assessing the loss or
gain in precision resulting from the complexity of the sample selection
design. This chapter introduces the concept of weight and discusses the
effect of sample selection design on variance estimation. To illustrate the
versatility of weighting in survey analysis, we present two examples of
developing and adjusting sample weights.

Adjusting for Differential Representation: The Weight


Two types of sample weights are commonly encountered in the analysis
of survey data: (a) the expansion weight, which is the reciprocal of the
selection probability, and (b) the relative weight, which is obtained by scal-
ing down the expansion weight to reflect the sample size. This section
reviews these two types of weights in detail for several sample designs.
12

Consider the following SRS situation: A list of N = 4,000 elements in a


population is numbered from 1 to 4,000. A table of random numbers is used
to draw a fixed number of elements (for example, n = 200) from the popula-
tion, not allowing duplicate selection (without replacement). The selection
probability or sampling fraction is f = n/N = .05. The expansion weight
is the reciprocal of the selection probability, wi = 1/f = N/n = 20
(i = 1, . . . , n), which indicates the number of elements represented by a
sample observation in the population. These weights for the n elements
selected sum to N: An estimator of the population total of variable Y based
on the sample elements is
 
Y^ = wi yi = (N/n) yi = N y: (3:1)

Equation 3.1 shows the use of the expansion weight in the weighted
sum of sample observations. Because the weight is the same for each ele-
ment in SRS, the estimator can be simplified to N times of the sample
mean (the last quantity in Equation 3.1). Similarly, the estimator of the
 
population mean is defined as Ȳ^ = wi yi / wi , which is the weighted
sample mean. In SRS, this simplifies to (N/n) yi /N = y, showing that
the sample mean is an estimator for the population mean. However, even
if the weights are not the same (in unequal probability designs), the esti-
mators are still a weighted sum for the population total and a weighted
average for the population mean.
Although the expansion weight appears appropriate for the estimator of
the population total, it may play havoc with the sample mean and other sta-
tistical measures. For example, using the sum of expansion weights in con-
tingency tables in place of relative frequencies based on sample size may
lead to unduly large confidence in the data. To deal with this, the expansion
weight can be scaled down to produce the relative weight, (rw)i , which is
defined to be the expansion weightP divided by the mean of the expansion
 where w
weights, that is, wi /w,  = wi /n. These relative weights for all
elements in the sample add up to n: For the SRS design, (rw)i is 1 for each
element. The estimator for the population total weighted by the relative
weights is
 
Y^ = w (rw)i yi = (N/n) yi = N y: (3:2)
Note in Equation 3.2 that the relative weighted sum is multiplied by the
average expansion weight, which yields the same simplified estimator for
the case of SRS as in Equation 3.1. Hence, the expansion weight is simpler
to use than the relative weight in estimating the population total. The
relative weight is appropriate in analytic studies, but it is inappropriate in
estimating totals and computing finite population corrections.
13

Most public-use survey data from government agencies and survey


organizations use the expansion weight, and it can easily be converted to
the relative weight. Such conversion is not necessary, however, because the
user-friendly statistical programs for survey analysis automatically perform
the conversion internally when appropriate.
Let us consider expansion weights in a stratified random sampling design.
In this design, the population of N elements is grouped into L strata based
on a certain variable with N1 , N2 , . . . , NL elements respectively, from
which nh (h = l, 2, . . . , L) elements are independently selected from the h-th
stratum. A stratified design retains a self-weighting quality when the
sampling fraction in each stratum is the same. If a total of 200 elements are
proportionately selected from two strata of N1 = 600 and N2 = 3,400
elements, then f = 200/4,000 = .05. A proportionate selection (a 5%
sample from each stratum) yields n1 = 30 and n2 = 170 because
f1 = 30/600 = .05 and f2 = 170/3,400 = .05. The weighting scheme is
then exactly the same as in SRS design.
The situation is slightly different with a disproportionate stratified sample
design. For example, if the total sample of 200 were split equally between the
two strata, f1 (¼100/600) and f2 (¼100/3,400) could have different values
and the expansion weights would be unequal for the elements in the two strata,
with w1i = 6 and w2i = 34. The expansion weights sum to 600 in the first
stratum and to 3,400 in the second with their total being 4,000, the population
size. The mean expansion weight, w  = (100 × 6 + 100 × 34)/200 = 20,
and, hence, the relative weights sum to 30 in the first stratum and to 170 in the
second. The sums of relative weights in both strata add up to the total sample
size. Note that the use of either type of weight is equivalent to weighting
stratumP means ( yh ) using the population distribution across the strata
[i.e., (Nh /N) yh , the standard procedure]. Both the expansion and relative
weights in stratum 1 sum to 15% of their respective total sums, and the first
stratum also contains 15% of the population elements.
Although we have used SRS and stratified sample designs to introduce the
sample weights, the same concept extends easily to more complex designs.
In summary, the sample weight is the inverse of the selection probability,
although it often is further modified by poststratification and nonresponse
adjustments. The assignment of the sample weight to each sample element
facilitates a general estimation procedure for all sample designs. As a general
rule, all estimates take the form of weighted statistics in survey data analysis.
The scale of these weights does not matter in estimating parameters and stan-
dard errors except when estimating totals and computing finite population
corrections.
Next, we present two examples of developing/adjusting the weight. The
first example shows how sample weights are modified by the poststratification
14

procedure in order to make sample compositions conform to population


compositions. The same procedure can be used to adjust for differential
response rates in demographic subgroups. The poststratification approach
works well when only a few demographic variables are involved. The second
example demonstrates that differential attrition rates in a follow-up survey can
be adjusted for a large number of variables using a logistic regression model.

Developing the Weight by Poststratification


To demonstrate the development of sample weights, we shall work with
the 1984 General Social Survey (GSS). This is a complex sample survey
conducted by the National Opinion Research Center (NORC) to obtain gen-
eral social information from the civilian noninstitutionalized adult popula-
tion (18 years of age and older) of the United States. A multistage selection
design was used to produce a self-weighting sample at the household level.
One adult was then randomly selected from each sampled household
(Davis & Smith, 1985). There were 1,473 observations available for ana-
lysis in the data file. For the purpose of illustration, the expansion weight
for these data at the household level could be calculated by dividing the
number of households in the United States by 1,473. The expansion
weight within the sampled household is the number of adults in the house-
hold. The product of these two weights gives the expansion weight for
sample individuals.
For an analysis of the GSS data, we need to focus only on the weight
within the household, because each household has the same probability of
selection. The relative weight for the individual can be derived by dividing
the number of adults in the household by the average number of adults
(2,852/1,473 = 1.94) per household. This weight reflects the probability
of selection of an individual in the sample while preserving the sample size.
We further modified this weight by a poststratification adjustment in an
attempt to make the sample composition the same as the population compo-
sition. This would improve the precision of estimates and could possibly
reduce nonresponse and sample selection bias to the extent that it is related
to the demographic composition.2 As shown in Table 3.1, the adjustment
factor is derived to cause the distribution of individuals in the sample to
match the 1984 U.S. population by age, race, and sex. Column 1 is the 1984
population distribution by race, sex, and age, based on the Census Bureau’s
estimates. Column 2 shows the weighted number of adults in the sampled
households by the demographic subgroups, and the proportional distribu-
tion is in Column 3. The adjustment factor is the ratio of Column 1 to
Column 3. The adjusted weight is found by multiplying the adjustment
factor by the relative weight, and the distribution of adjusted weights is then
15

TABLE 3.1
Derivation of Poststratification Adjustment Factor:
General Social Survey, 1984
Weighted
Demographic Population Number of Sample Adjustment
Subgroups Distribution Adults Distribution Factor
(1) (2) (3) (1)/(3)

White, male
18–24 years .0719660 211 .0739832 0.9727346
25–34 .1028236 193 .0676718 1.5194460
35–44 .0708987 277 .0795933 0.8907624
45–54 .0557924 135 .0473352 1.1786660
55–64 .0544026 144 .0504909 1.0774730
65 and over .0574872 138 .0483871 1.1880687
White, female
18–24 years .0705058 198 .0694250 1.0155668
25–34 .1007594 324 .1136045 0.8869317
35–44 .0777364 267 .0936185 0.8303528
45–54 .0582026 196 .0682737 0.8469074
55–64 .0610057 186 .0652174 0.9354210
65 and over .0823047 216 .0757363 1.0867272
Nonwhite, male
18–24 years .0138044 34 .0119215 1.1579480
25–34 .0172057 30 .0105189 1.6356880
35–44 .0109779 30 .0105189 1.0436290
45–54 .0077643 37 .0129734 0.5984774
55–64 .0064683 12 .0042076 1.5372900
65 and over .0062688 18 .0063113 0.9932661
Nonwhite, female
18–24 years .0145081 42 .0145081 .9851716
25–34 .0196276 86 .0301543 .6509067
35–44 .0130655 38 .0133240 .9806026
45–54 .0094590 33 .0115708 .8174890
55–64 .0079636 30 .0105189 .7570769
65 and over .0090016 27 .0094670 .9508398
Total 1.0000000 2,852 1.0000000

SOURCE: U.S. Bureau of the Census, Estimates of the population of the United States, by age, sex, and race,
1980 to 1985 (Current Population Reports, Series P-25, No. 985), April 1986. Noninstitutional population
estimates are derived from the estimated total population of 1984 (Table1), adjusted by applying the ratio of
noninstitutional to total population (Table Al).

the same as the population distribution. The adjustment factors indicate that
without the adjustment, the GSS sample underrepresents males 25–34 years
of age and overrepresents nonwhite males 45–54 years of age and nonwhite
females 25–34 years of age.
16

TABLE 3.2
Comparison of Weighted and Unweighted Estimates in Two Surveys
Weighted Unweighted
Surveys and Variables Estimates Estimates

I. General Social Survey


(Percentage approving ‘‘hitting’’)
Overall 60.0 59.4
By sex
Male 63.5 63.2
Female 56.8 56.8
By education
Some college 68.7 68.6
High school 63.3 63.2
Others 46.8 45.2
II. Epidemiologic Catchment Areas Survey
(Prevalence rate of mental disorders)
Any disorders 14.8 8.8
Anxiety disorders 6.5 18.5

SOURCE: Data for the Epidemiologic Catchment Areas Survey are from E. S. Lee, Forthofer, and Lorimor
(1986), Table 1.

The adjusted relative weights are then used in the analysis of the data—for
example, to estimate the proportion of adults responding positively to the
question, ‘‘Are there any situations that you can imagine in which you would
approve of a man punching an adult male stranger?’’ As shown in the upper
section of Table 3.2, the weighted overall proportion is 60.0%, slightly larger
than the unweighted estimate of 59.4%. The difference between the weighted
and unweighted estimates is also very small for the subgroup estimates
shown. This may be due primarily to the self-weighting feature reflected in
the fact that most households have two adults and, to a lesser extent, to the
fact that the ‘‘approval of hitting’’ is not correlated with the number of adults
in a household. The situation is different in the National Institute of Mental
Health-Sponsored Epidemiologic Catchment Area (ECA) Survey. In this
survey, the weighted estimates of the prevalence of any disorders and of anxi-
ety disorders are, respectively, 20% and 26% lower than the unweighted
estimates, as shown in Table 3.2.
Finally, the adjusted weights should be examined to see whether there
are any extremely large values. Extreme variation in the adjusted weights
may imply that the sample sizes in some poststrata are too small to be reli-
able. In such a case, some small poststrata need to be collapsed, or some raking
procedure must be used to smooth out the rough edges (Little & Rubin, 1987,
pp. 59–60).
17

Adjusting the Weight in a Follow-Up Survey


Follow-up surveys are used quite often in social science research.
Unfortunately, it is not possible to follow up with all initial respondents.
Some may have died, moved, or refused to participate in the follow-up
survey. These events are not randomly distributed, and the differential attri-
tion may introduce selection bias in the follow-up survey. We could make
an adjustment in the same manner as in the poststratification, based on a
few demographic variables, but we can take advantage of a large number of
potential predictor variables available in the initial survey for the attrition
adjustment in the follow-up survey. The stratification strategy may not be
well suited for a large number of predictors. The logistic regression model,
however, provides a way to include several predictors. Based on the initial
survey data, we can develop a logistic regression model for predicting the
attrition (binary variable) by a set of well-selected predictor variables from
the initial survey, in addition to usual demographic variables. The predicted
logit based on characteristics of respondents in the original sample can then
be used to adjust the initial weight for each respondent in the follow-up sur-
vey. This procedure, in essence, can make up for those lost to the follow-up
by differentially inflating the weights of those successfully contacted and,
hence, removing the selection bias to the extent that it is related to the vari-
ables in the model. As more appropriate variables are used in the model, the
attrition adjustment can be considered more effective than the nonresponse
adjustment in the initial survey. Poststratification used to correct for non-
response resulting from ‘‘lost to follow-up’’ (or for other types of nonresponse)
does not guarantee an improved estimate. However, if ‘‘lost to follow-up’’ is
related to the variables used in the adjustment process, the adjusted estimate
should be an improvement over the unadjusted estimate.
Let us look at an example from a community mental health survey (one site
in the ECA survey). Only 74.3% of initial respondents were successfully sur-
veyed in the first annual follow-up survey. This magnitude of attrition is too
large to be ignored. We therefore conducted a logistic regression analysis of
attrition (1 = lost; 0 = interviewed) on selected predictors shown in Table 3.3.
The chi-square value suggests that the variables in the model are significantly
associated with attrition. The effect coding requires that one level of each vari-
able be omitted, as shown in the table. Based on the estimated beta coefficients,
the predicted logit (λ^i , i = 1, 2, . . . , n) is calculated for each respondent. It is
^i = 1/ð1 + e −λi ).
^
then converted to the predicted proportion of attrition by p
The initial weight for a successfully interviewed respondent is inflated by
dividing it by ð1 − p ^i Þ and setting the weight for a respondent lost to follow-
up to zero. As shown in Table 3.3, the adjusted weight for the retained panel
members added up to 300,172, only 59 more than the sum of the weights in the
18

TABLE 3.3
Logistic Regression Model for Attrition
Adjustment in a Follow-Up Survey
Logistic Regression Model_______________
Factors Category Variable Beta Coefficient Survey-Related Information

Intercept - -0.737* Initial survey


Age 18-24 yrs AGE1 0.196 Design: Multistage
Sampling
25-34 yrs AGE2 0.052 Sample size: 4,967
35-44 yrs AGE3 -0.338* Weighted sum: 300,113
45-54 yrs AGE4 -0.016
55 and over — Follow-up survey
Marital status: Sep./divorced MAR 0.051 Sample size: 3,690
(74.3%)
Other - Sum of attrition-
Gender: Male SEX 0.084 adjusted weights: 300,172
Female — Adjusted sum: 300,113
Race White RACE1 -0.668*
Black RACE2 -0.865* Comparison of attrition-adjusted
and Other - attrition-unadjusted
estimates
Socioeconomic 1st quartile SES1 0.534*
status 2nd quartile SES2 0.389* Disorders Unadjusteda Adjustedb
Diff.
3rd and 4th -
Family size: One SIZE1 0.033 Any disorder 39.2% 43.8%
-4.6%
2-4 members SIZE2 -0.003 Major depre. 15.0 18.1
-3.1
5 or more - Cog. Impair. 1.8 1.3
0.5
Diagnosis: Cog. Impair. DX1 0.472* Phobias 8.4 7.6
-0.4
Schizophrenia DX2 -0.049 Alcohol abuse 13.9 13.7
-4.2
Antisocial DX3 0.412* Schizophrenia 0.6 0.6
-0.4
Anorexia DX4 2.283*
No disorder —

Likelihood ratio chi-square(16) = 198.0, p < 0.00001.

a. Based on the initial weights.


b. Based on attrition-adjusted weights.
*
p < .05.

initial survey, suggesting that the procedure worked reasonably well. (If there
were a large discrepancy between the sum of the adjusted weights and the sum
of the weights in the initial survey, there would be a concern about the adjust-
ment process.) The adjusted weights are readjusted to align to the initial sur-
vey. To show the effect of using attrition-adjusted weights, the prevalence
rates of six selected mental disorders were estimated with and without the attri-
tion-adjusted weights, as shown in Table 3.3. Using the adjusted weights, we
see that the prevalence of any disorders (DSM-III defined disorders) is nearly
5 percentage points higher than the unadjusted prevalence.

Assessing the Loss or Gain in Precision: The Design Effect


As shown in the previous chapter, the variance of an SRSWOR sample
mean is the variance of an SRSWR sample mean times the finite population
19

correction factor (1 − f ). The ratio of the sampling variance of SRSWOR to


the sampling variance of SRSWR is then (1 − f ), which reflects the effect
of using SRSWOR compared to using SRSWR. This ratio comparing the
variance of some statistic from any particular design to that of SRSWR is
called the design effect for that statistic. It is used to assess the loss or gain
in precision of sample estimates from the design used, compared to a
SRSWR design. A design effect less than one indicates that fewer observa-
tions are needed to achieve the same precision as SRSWR, whereas a design
effect greater than one indicates that more observations are needed to yield
the same precision. In the case of SRSWOR, the design effect is less than
one, but it is close to one when the sampling fraction is very small. Because
SRSWOR is customarily used in place of SRSWR, researchers have tended
to base the design effect calculation on SRSWOR instead of SRSWR. In
addition, in complex surveys the design effect is usually calculated based
on the variance of the weighted statistic under the SRSWOR design. We
shall do that throughout the rest of the book.
Relating this notion of the design effect to the sample size, the effective
sample size can be defined to be the actual sample size divided by the design
effect. If the design effect is greater than one for a sample design, then, in
effect the sample size would be reduced for a statistical analysis, thus leading
to a larger sampling error. In other words, when the design effect is greater
than one, the effective sample size is smaller than the actual size.
Let us examine the design effect in more complex sample designs. The
properties and estimation of the sampling error for stratified random sam-
pling are well known, and so are the conditions under which stratification
will produce a smaller variance than SRS. However, stratification often
is used in conjunction with other design features, such as cluster sam-
pling in several stages within the strata. As discussed earlier, clustering
tends to increase the sampling error. The effect of stratification can be
diluted by the effect of clustering in many practical designs. Unfortu-
nately, the assessment of the design effect cannot be determined theoreti-
cally from properties of stratification and clustering separately, but
instead must be approximated numerically to account for their combined
effects.
The following example demonstrates the determination of the design
effect in a relatively simple situation. Consider the case of single-stage clus-
ter sampling in which all clusters are of the same size. Suppose that there are
N English classes (clusters) in a high school, with M students in each class.
From the N classes, n classes are selected by SRS, and all the students in the
chosen clusters are asked to report the number of books they have read since
the beginning of the year. The number of students is NM in the popula-
tion and nM in the sample. The sampling fraction is f = nM/NM = n/N.
20

Because the class sizes are equal, the average number of books read per
student (population mean) is the mean of the N class means. The n sample
classes can be viewed as a random sample of n means from a population of
N means. Therefore, the sample mean (y ) is unbiased for the population
mean (Y ), and its variance, applying Equation 2.1, is given by
s2b
V^ðy Þ = ð1 − f Þ, (3:3)
n

where s2b = ðyi − y Þ2 /ðn − 1Þ, the estimated variance of the cluster
means. Alternately, Equation 3.3 can be expressed in terms of estimated
ICC (^
ρ) as follows (Cochran, 1977, chap. 9):
s2 [1 + (M − 1)^
ρ]
V^(y ) = (1 − f ), (3:4)
nM

where s2 = (yij − y )2 /ðnM − 1Þ, the variance of elements in the sam-
ple. If this is divided by the variance of the mean from an SRSWOR sample
of size nM, [V^ðy Þ = nM s2 ð1 − f Þ, applying Equation 2.1], then the design

effect of the cluster sample is 1 + ðM − 1Þ^ ρ.


When ICC = 0, the design effect will be one; when ICC > 0, the design
effect will be greater than one. If the clusters were formed at random, then
ICC = 0; when all the elements within each cluster have the same value,
ICC = 1. Most clusters used in community surveys consist of houses in the
same area, and these yield positive ICCs for many survey variables. The
ICC is usually larger for socioeconomic variables than for the demographic
variables, such as age and sex.
The assessment of the design effect for a more complex sample design is
not a routine task that can be performed using the formulas in statistics text-
books; rather, it requires special techniques that utilize unfamiliar strategies.
The next chapter reviews several strategies for estimating the sampling var-
iance for statistics from complex surveys and examines the design effect
from several surveys.

The Use of Sample Weights for Survey Data Analysis*


As discussed above, sample weights are used when computing point
estimates. All point estimates take the form of weighted statistics. This use
of sample weights makes sense from the above discussion on development
and adjustment of sample weights, especially for a descriptive analysis.
However, the use of sample weights in analytical studies is not as clear as in
descriptive analysis. As discussed in the last section of Chapter 2, there are
two different points of view regarding survey data. From the design-based
21

position, the use of sample weights is essential in both descriptive and


analytical studies. Inferences are based on repeated sampling for the finite
population, and the probability structure used for inference is that defined
by the random variables indicating inclusion in the sample. From the
model-based perspective, however, it can be argued that the sample selec-
tion scheme is irrelevant when making inferences under a specified model.
If the observations in the population really follow the model, then the
sample design should have no effect as long as the probability of selection
depends on the model-specified dependent variable only through the inde-
pendent variables included in the model. Conditions under which the sam-
pling scheme is ignorable for inference have been explored extensively
(Nordberg, 1989; Sugden & Smith, 1984). Many survey statisticians have
debated these two points of view since the early 1970s (Brewer, 1999;
Brewer & Mellor, 1973; Graubard & Korn, 1996, 2002; Hansen, Madow, &
Tepping, 1983; Korn & Graubard, 1995a; Pfeffermann, 1996; Royall, 1970;
Sarndal, 1978; T. M. F. Smith, 1976, 1983).
A good analysis of survey data would require general understanding of
both points of view as well as consideration of some practical issues, espe-
cially in social surveys. The model-based approach is consistent with the
increasing use of model-based inference in other areas of statistical analy-
sis, and it provides some theoretical advantages. Model-based estimates
can be used with relatively small samples and even with nonprobability
samples. In addition, model-based analysis can be done using standard sta-
tistical software such as SAS and SPSS without relying on survey packages
such as SUDAAN and others that are reviewed in this book. The model-
based approach, however, assumes that the model correctly describes the
true state of nature. If the model is misspecified, then the analysis would be
biased and lead to misinterpretation of the data. Unfortunately, theoretically
derived models for all observations in the population seldom exist in social
survey situations. In addition, omission of relevant variables in the model
would be a real concern in secondary analysis of survey data because not all
relevant variables are available to the analysts. Thus, the major challenge to
model-based inference is specifying a correct model for the purpose of the
analysis.
It has been recognized that a weighted analysis is heavily influenced
by observations with extremely large weights (that may often be a result
of nonresponse and poststratification adjustments rather than the selection
probabilities). Another recognized limitation of weighting is the increase in
variance. In general, the increase is high when the variability of the weights
is large. There is something to lose in using weighted analysis when it is
actually unnecessary for bias reduction, and the weighted analysis will be
inefficient as compared with the unweighted analysis. Korn and Graubard
22

(1999, chap. 4) discuss various issues dealing with weighted and


unweighted estimates of population parameters and offer a measure of the
inefficiency of using weighted estimates. They recommend using the
weighted analysis if the inefficiency is not unacceptably large, to avoid
the bias in the unweighted analysis. If the inefficiency is unacceptably
large, they recommend using the unweighted analysis, augmenting the
model with survey design variables, including the weight, to reduce the
bias. Incorporation of design variables into the model is often problematic,
however, because the inclusion of the design variables as additional covari-
ates in the model may contradict the scientific purpose of the analysis. For
example, when the objective of the analysis is to examine associations
between health measures and risk factors, conditioning on the design vari-
ables may interfere with the relational pathway.
In complex large-scale surveys, it is often not possible to include in the
model all the design information, especially when the sample weights are
modified for nonresponse and poststratification adjustments (Alexander,
1987). Another practical problem in incorporating the design variables into
the model is the lack of relevant information in the data set. Not all design-
related variables are available to the analysts. Most public-use survey data
provide only PSU (the primary sampling unit), leaving out secondary clus-
ter units (census enumeration districts or telephone exchanges). Often, pro-
vision of secondary units may not be possible because of confidentiality
issues.
In a model-based analysis, one must guard against possible misspecifica-
tion of the model and possible omission of covariates. The use of sample
weights (design-based analysis) can provide protection against model mis-
specification (DuMouchel & Duncan, 1983; Pfeffermann & Homes, 1985).
Kott (1991) points out that the sampling weights need to be used in linear
regression because the choice of covariates in survey data is limited in most
secondary analyses. The merits and demerits of using sample weights will
be further discussed in the last section of Chapter 6.

4. STRATEGIES FOR VARIANCE ESTIMATION

The estimation of the variance of a survey statistic is complicated not only


by the complexity of the sample design, as seen in the previous chapters, but
also by the form of the statistic. Even with an SRS design, the variance esti-
mation of some statistics requires nonstandard estimating techniques. For
example, the variance of the median is conspicuously absent in the standard
texts, and the sampling error of a ratio estimator (refer again to Note 1) is com-
plicated because both the numerator and denominator are random variables.
23

Certain variance-estimating techniques not found in standard textbooks are


sufficiently flexible to accommodate both the complexities of the sample
design and the various forms of statistics. These general techniques for var-
iance estimation, to be reviewed in this chapter, include replicated sampling,
balanced repeated replication (BRR), jackknife-repeated replication (JRR),
the bootstrap method, and the Taylor series method.

Replicated Sampling: A General Approach


The essence of this strategy is to facilitate the variance calculation by
selecting a set of replicated subsamples instead of a single sample. It requires
that each subsample be drawn independently using an identical sample selec-
tion design. Then an estimate is made in each subsample by the identical
process, and the sampling variance of the overall estimate (based on all
subsamples) can be estimated from the variability of these independent sub-
sample estimates. This is the same idea as the repeated systematic sampling
mentioned in Chapter 2.
The sampling variance of the mean ( u) of t replicate estimates u1 ,
u2 , . . . , ut of the parameter U can be estimated by the following simple
variance estimator (Kalton, 1983, p. 51):

u) =
v( (ui − u)2 /t(t − 1) (4:1)

This estimator can be applied to any sample statistic obtained from


independent replicates of any sample design.
In applying this variance estimator, 10 replicates are recommended by
Deming (1960), and a minimum of 4 by others (Sudman, 1976) for descrip-
tive statistics. An approximate estimate of the standard error can be calcu-
lated by dividing the range in the replicate estimates by the number of
replicates when the number of replicates is between 3 and 13 (Kish, 1965,
p. 620). However, because this variance estimator with t replicates is based
on (t − 1) degrees of freedom for statistical inference, a larger number of
replicates may be needed for analytic studies, perhaps 20 to 30 (Kalton,
1983, p. 52).
To understand the replicated design strategy, let us consider a simple
example. Suppose we want to estimate the proportion of boys among 200
newly born babies. We will simulate this survey using the random digits
from Cochran’s book (1977, p. 19), assuming the odd numbers represent
boys. The sample is selected in 10 replicate samples of n = 20 from the
first 10 columns of the table. The numbers of boys in the replicates are as
follows:
24

Replicate: 9 8 13 12 14 8 10 7 10 8 Total = 99

Proportion
of Boys: .45 .40 .65 .60 .70 .40 .50 .35 .50 .40 Proportion = .495

The overall percentage of boys is 49.5%, and its standard error is 3.54%
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(¼ 49:5∗ 50:5/200). The standard error estimated from the 10 replicate
estimates using Equation 4.1 is 3.58%. It is easy to get an approximate
estimate of 3.50% by taking one tenth of the range (70%–35%). The chief
advantage of replication is ease in estimation of the standard errors.
In practice, the fundamental principle of selecting independent replicates
is somewhat relaxed. For one thing, replicates are selected using sampling
without replacement instead of with replacement. For unequal probability
designs, the calculation of basic weights and the adjustment for nonresponse
and poststratification usually are performed only once for the full sample,
rather than separately within each replicate. In cluster sampling, the replicates
often are formed by systematically assigning the clusters to the t replicates in
the same order that the clusters were first selected, to take advantage of strati-
fication effects. In applying Equation 4.1, the sample mean from the full
sample generally is used for the mean of the replicate means. These deviations
from fundamental principles can affect the variance estimation, but the bias is
thought to be insignificant in large-scale surveys (Wolter, 1985, pp. 83–85).
The community mental health survey conducted in New Haven, Connec-
ticut, in 1984 as part of the ECA Survey of the National Institute of Mental
Health (E. S. Lee, Forthofer, Holzer, & Taube, 1986) provides an example
of replicated sampling. The sampling frame for this survey was a geogra-
phically ordered list of residential electric hookups. A systematic sample
was drawn by taking two housing units as a cluster, with an interval of 61
houses, using a starting point chosen at random. A string of clusters in the
sample was then sequentially allocated to 12 subsamples. These subsam-
ples were created to facilitate the scheduling and interim analysis of data
during a long period of screening and interviewing. Ten of the subsamples
were used for the community survey, with the remaining two reserved for
another study. The 10 replicates are used to illustrate the variance estimation
procedure.
These subsamples did not strictly adhere to a fundamental principle of
independent replicated sampling because the starting points were systema-
tically selected, except for the first random starting point. However, the sys-
tematic allocation of clusters to subsamples in this case introduced an
approximate stratification leading to more stable variance estimation and,
25

TABLE 4.1
Estimation of Standard Errors From Replicates:
ECA Survey in New Haven, 1984 (n = 3,058)
Regression Coefficientsa

Replicate Prevalence Rateb Odds Ratioc Intercept Gender Color Age

Full Sample 17.17 0.990 0.2237 −0.0081 0.0185 −0.0020


1 12.81 0.826 0.2114 0.0228 0.0155 −0.0020
2 17.37 0.844 0.2581 0.0220 0.0113 −0.0027
3 17.87 1.057 0.2426 −0.0005 0.0393 −0.0015
4 17.64 0.638 0.1894 0.0600 0.2842 −0.0029
5 16.65 0.728 0.1499 0.0448 −0.0242 −0.0012
6 18.17 1.027 0.2078 −0.0024 −0.0030 −0.0005
7 14.69 1.598 0.3528 −0.0487 −0.0860 −0.0028
8 17.93 1.300 0.3736 −0.0333 −0.0629 −0.0032
9 17.86 0.923 0.2328 −0.0038 0.0751 −0.0015
10 18.91 1.111 0.3008 −0.0007 0.0660 −0.0043

Range 6.10 0.960 0.2237 0.1087 0.3702 0.0038

Standard error based on:


Replicates 0.59 0.090 0.0234 0.0104 0.0324 0.0004
SRS 0.68 0.097 0.0228 0.0141 0.0263 0.0004

SOURCE: Adapted from ‘‘Complex Survey Data Analysis: Estimation of Standard Errors Using Pseudo-
Strata,’’ E. S. Lee, Forthofer, Holzer, and Taube, Journal of Economic and Social Measurement,
Ó copyright 1986 by the Journal of Economic and Social Measurement. Adapted with permission.
a. The dependent variable (coded as 1 = condition present and 0 = condition absent) is regressed on sex
(1 = male, 0 = female), color (1 = black, 0 = nonblack), and age (continuous variable). This analysis is
used for demonstration only.
b. Percentage with any mental disorders during the last 6 months.
c. Sex difference in the 6-month prevalence rate.

therefore, may be preferable to a random selection of a starting point for this


relatively small number of replicates. Therefore, we considered these sub-
samples as replicates and applied the variance estimator with replicated
sampling, Equation 4.1.
Because one adult was randomly selected from each sampled household
using the Kish selection table (Kish, 1949), the number of adults in each
household became the sample case weight for each observation. This
weight was then adjusted for nonresponse and poststratification. Sample
weights were developed for the full sample, not separately within each
subsample, and these were the weights used in the analysis.
Table 4.1 shows three types of statistics calculated for the full sample as
well as for each of the replicates. The estimated variance of the prevalence
26

rate, in percent (p), can be calculated from the replicate estimates (pi )
using Equation 4.1:

(pi − 17:17)2
v(p) = = 0:3474,
10(10 − 1)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
and the standard error is 0:3474 = 0:59. The overall prevalence rate of
17.17% is slightly different from the mean of the 10 replicate estimates
because of the differences in response rates. Note that one tenth of the
range in the replicate estimates (0.61) approximates the standard error
obtained by Equation 4.1. Similarly, standard errors can be estimated for
the odds ratio and regression coefficients. The estimated standard errors
have approximately the same values as those calculated by assuming
simple random sampling (using appropriate formulas from textbooks).
This indicates that design effects are fairly small for these statistics from
this survey.
Although the replicated sampling design provides a variance estimator that
is simple to calculate, a sufficient number of replicates are required to obtain
acceptable precision for statistical inference. But if there is a large number of
replicates and each replicate is relatively small, it severely limits the use of
stratification in each replicate. Most important, it is impractical to implement
replicated sampling in complex sample designs. For these reasons, a repli-
cated design is seldom used in large-scale, analytic surveys. Instead, the
replicated sampling idea has been applied to estimate variance in the data-
analysis stage. This attempt gave rise to pseudo-replication methods for
variance estimation. The next two techniques are based on this idea of
pseudo-replication.

Balanced Repeated Replication


The balanced repeated replication (BRR) method is based on the
application of the replicated sampling idea to a paired selection design in
which two PSUs are sampled from each stratum. The paired selection design
represents the maximum use of stratification and yet allows the calculation
of variance. In this case, the variance between two units is one half of the
squared difference between them. To apply the replicated sampling idea, we
first divide the sample into random groups to form pseudo-replicates. If it is a
stratified design, it requires all the strata to be represented in each pseudo-
replicate. In a stratified, paired selection design, we can form only two
pseudo-replicates: one containing one of the two units from each stratum and
the other containing the remaining unit from each stratum (complement repli-
cate). Each pseudo-replicate then includes approximately half of the total
27

sample. Applying Equation 4.1 with t = 2, we can estimate the sampling


variance of the mean of the two replicate estimates, u , u , by

u) = [(u − u)2 + (u − u)2 ]/2:


v( (4:2)

As seen above, the mean of replicate estimates is often replaced by an


overall estimate obtained from the full sample. However, this estimator is
too unstable to have any practical value because it is based on only two
pseudo-replicates. The BRR method solves this problem by repeating the
process of forming half-sample replicates, selecting different units from dif-
ferent strata. The pseudo-replicated half samples then contain some common
units, and this introduces dependence between replicates, which complicates
the estimation. One solution, which leads to unbiased estimates of variance
for linear statistics, is to balance the formation of pseudo-replicates by using
an orthogonal matrix (Plackett & Burman, 1946). The full balancing requires
that the size of the matrix be a multiple of four and the number of replicates
be greater than or equal to the number of strata. Then the sampling variance
of a sample statistic can be estimated by taking the average of variance
estimates by Equation 4.2 over t pseudo-replicates:
 
u) =
v( [(ui − u)2 + (ui − u)2 ]/2t = (ui − ui )2 /4t: (4:3)

It is possible to reduce computation by dropping the complement


half-sample replicates:

v (
u) = (ui − u)2 /t: (4:4)

This is the estimator originally proposed by McCarthy (1966). This


balancing was shown by McCarthy to yield unbiased estimates of var-
iance for linear estimators. For nonlinear estimators, there is a bias in the
estimates of variance, but numerical studies suggest that it is small. For a
large number of strata, the computation can be further simplified by using
a smaller set of partially balanced replicates (K. H. Lee, 1972; Wolter,
1985, pp. 125–130).
As in replicated sampling, BRR assumes that the PSUs are sampled with
replacement within strata, although in practice sampling without replacement
generally is used. Theoretically, this leads to an overestimation of variance
when applied to a sample selected without replacement, but the overestima-
tion is negligible in practice because the chance of selecting the same unit
more than once under sampling without replacement is low when the sam-
pling fraction is small. The sampling fraction in a paired selection design
(assumed in the BRR method) usually is small because only two PSUs are
selected from each stratum.
28

When used with a multistage selection design, BRR usually is applied


only to PSUs and disregards the subsampling within the PSUs. Such a prac-
tice is predicated on the fact that the sampling variance can be approximated
adequately from the variation between PSU totals when the first-stage sam-
pling fraction is small. This is known as the ultimate cluster approximation.
As shown in Kalton (1983, chap. 5), the unbiased variance estimator for a
simple two-stage selection design consists of a component from each of
the two stages, but the term for the second-stage component is multiplied by
the first-stage sampling fraction. Therefore, the second-stage contribution
becomes negligible as the first-stage sampling fraction decreases. This short-
cut procedure based only on PSUs is especially convenient in the preparation
of complex data files for public use as well as in the analysis of such data,
because detailed information on complex design features is not required
except for the first-stage sampling.
If the BRR technique is to be applied to other than the paired selection
designs, it is necessary to modify the data structure to conform to the tech-
nique. In many multistage surveys, stratification is carried out to a maximum
and only one PSU is selected from each stratum. In such case, PSUs can
be paired to form collapsed strata to apply the BRR method. This procedure
generally leads to some overestimation of the variance because some of the
between-strata variability is now included in the within-stratum calculation.
The problem is not serious for the case of linear statistics if the collapsing is
carried out judiciously; however, the collapsing generally is not recommended
for estimating the variance of nonlinear statistics (see Wolter, 1985, p. 48). The
Taylor series approximation method discussed later may be used for the non-
linear statistics. Although it is not used widely, there is a method of construct-
ing orthogonal balancing for three PSUs per stratum (Gurney & Jewett, 1975).
Now let us apply the BRR technique to the 1984 GSS. As introduced in
the previous chapter, it used a multistage selection design. The first-stage
sampling consisted of selecting one PSU from each of 84 strata of counties
or county groups. The first 16 strata were large metropolitan areas and
designated as self-representing (or automatically included in the sample).
To use the BRR technique, the 84 strata are collapsed into 42 pairs of
pseudo-strata. Because the numbering of non–self-representing PSUs in the
data file approximately followed the geographic ordering of strata, pairing
was done sequentially, based on the PSU code. Thus, the 16 self-representing
strata were collapsed into 8 pseudo-strata, and the remaining 68 non–self-
representing strata into 34 pseudo-strata. This pairing of the self-representing
strata, however, improperly includes variability among them. To exclude this
and include only the variability within each of the self-representing strata, the
combined observations within each self-representing pseudo-stratum were
randomly grouped into two pseudo-PSUs.
29

To balance the half-sample replicates to be generated from the 42


pseudo-strata, an orthogonal matrix of order 44 (see Table 4.2) was used.
This matrix is filled with zeros and ones. To match with the 42 strata, the
first two columns were dropped (i.e., 44 rows for replicates and 42 columns
for pseudo-strata). A zero indicates the inclusion of the first PSU from the
strata, and a one denotes the inclusion of the second PSU. The rows are the
replicates, and the columns represent the strata. For example, the first repli-
cate contains the second PSU from each of the 42 pseudo-strata (because all
the elements in the first row are ones). Using the rows of the orthogonal
matrix, 44 replicates and 44 complement replicates were created.
To estimate the variance of a statistic from the full sample, we needed
first to calculate the statistic of interest from each of the 44 replicates and
complement replicates. In calculating the replicate estimates, the adjusted
sample weights were used. Table 4.3 shows the estimates of the propor-
tion of adults approving the ‘‘hitting’’ for the 44 replicates and their
complement replicates. The overall proportion was 60.0%. The samp-
ling variance of the overall proportion, estimated by Equation 4.3, is
0.000231. Comparing this with the sampling variance of the proportion
under the SRS design [pq/(n − 1) = 0.000163, ignoring FPC], we get the
design effect of 1.42 (= 0.000231/0.000163). The design effect indicates
that the variance of the estimated proportion from the GSS is 42% larger
than the variance calculated from an SRS of the same size. The variance
by Equation 4.4 also gives similar estimates.
In summary, the BRR technique uses a pseudo-replication procedure to
estimate the sampling variance and is primarily designed for a paired selec-
tion design. It also can be applied to a complex survey, which selects one
PSU per stratum by pairing strata, but the pairing must be performed judi-
ciously, taking into account the actual sample selection procedure. In most
applications of BRR in the available software packages, the sample weights
of the observations in the selected PSUs for a replicate are doubled to make
up for the half of PSUs not selected. There is also a variant of BRR, sug-
gested by Fay (Judkins, 1990), in creating replicate weights, which uses
2 − k or k times the original weight, depending on whether the PSU is
selected or not selected based on the orthogonal matrix (0 ≤ k < 1). This
will be illustrated further in the next chapter.

Jackknife Repeated Replication


The idea of jackknifing was introduced by Quenouille (1949) as a
nonparametric procedure to estimate the bias, and later Tukey (1958) sug-
gested how that same procedure could be used to estimate variance. Durbin
(1959) first used this method in his pioneering work on ratio estimation. Later,
30

TABLE 4.2
Orthogonal Matrix of Order 44
Rows Columns (44)

1 11111111111111111111111111111111111111111111
2 10100101001110111110001011100000100011010110
3 10010010100111011111000101110000010001101011
4 11001001010011101111100010111000001000110101
5 11100100101001110111110001011100000100011010
6 10110010010100111011111000101110000010001101
7 11011001001010011101111100010111000001000110
8 10101100100101001110111110001011100000100011
9 11010110010010100111011111000101110000010001
10 11101011001001010011101111100010111000001000
11 10110101100100101001110111110001011100000100
12 10011010110010010100111011111000101110000010
13 10001101011001001010011101111100010111000001
14 11000110101100100101001110111110001011100000
15 10100011010110010010100111011111000101110000
16 10010001101011001001010011101111100010111000
17 10001000110101100100101001110111110001011100
18 10000100011010110010010100111011111000101110
19 10000010001101011001001010011101111100010111
20 11000001000110101100100101001110111110001011
21 11100000100011010110010010100111011111000101
22 11110000010001101011001001010011101111100010
23 10111000001000110101100100101001110111110001
24 11011100000100011010110010010100111011111000
25 10101110000010001101011001001010011101111100
26 10010111000001000110101100100101001110111110
27 10001011100000100011010110010010100111011111
28 11000101110000010001101011001001010011101111
29 11100010111000001000110101100100101001110111
30 11110001011100000100011010110010010100111011
31 11111000101110000010001101011001001010011101
32 11111100010111000001000110101100100101001110
33 10111110001011100000100011010110010010100111
34 11011111000101110000010001101011001001010011
35 11101111100010111000001000110101100100101001
36 11110111110001011100000100011010110010010100
37 10111011111000101110000010001101011001001010
38 10011101111100010111000001000110101100100101
39 11001110111110001011100000100011010110010010
40 10100111011111000101110000010001101011001001
41 11010011101111100010111000001000110101100100
42 10101001110111110001011100000100011010110010
43 10010100111011111000101110000010001101011001
44 11001010011101111100010111000001000110101100

SOURCE: Adapted from Wolter (1985, p. 328) with permission of the publisher.
31

TABLE 4.3
Estimated Proportions Approving One Adult Hitting Another in the
BRR Replicates: General Social Survey, 1984 (n = 1,473)
Estimate (%) Estimate (%)

Replicate Number Replicate Complement Replicate Number Replicate Complement

1 60.9 59.2 23 61.4 58.6


2 60.1 59.9 24 57.7 62.4
3 62.1 57.9 25 60.4 59.6
4 58.5 61.7 26 61.7 58.2
5 59.0 61.0 27 59.3 60.6
6 59.8 60.2 28 62.4 57.6
7 58.5 61.5 29 61.0 58.9
8 59.0 61.0 30 61.2 58.7
9 61.3 58.8 31 60.9 59.1
10 59.2 60.8 32 61.6 58.5
11 61.7 58.3 33 61.8 58.2
12 60.2 59.8 34 60.6 59.4
13 62.1 58.7 35 58.6 61.5
14 59.7 60.4 36 59.4 60.7
15 58.1 62.0 37 59.8 60.3
16 56.0 64.2 38 62.0 58.1
17 59.8 60.3 39 58.1 61.9
18 58.6 61.3 40 59.6 60.5
19 58.9 61.1 41 58.8 61.2
20 60.8 59.3 42 59.2 60.8
21 63.4 56.5 43 58.7 61.4
22 58.3 61.7 44 60.5 59.5

Overall estimate = 60.0

Variance estimates Variance Standard Error Design Effect


By Equation 4.3 0.000231 0.0152 1.42
By Equation 4.4 0.000227 0.0151 1.40

it was applied to computation of variance in complex surveys by Frankel


(1971) in the same manner as the BRR method and was named the jackknife
repeated replication (JRR). As is BRR, the JRR technique generally is applied
to PSUs within strata.
The basic principle of jackknifing can be illustrated by estimating
sampling variance of the sample mean from a simple random sample.
Suppose n = 5 and sample values of y are 3, 5, 2, 1, and 4. The sample
mean then is y = 3, and its sampling variance, ignoring the FPC, is

(yi − y)2
v(y) = = 0:5: (4:5)
n(n − 1)
32

The jackknife variance of the mean is obtained as follows.

1. Compute a pseudo sample mean deleting the first sample value, which
results in y(1) = (5 + 2 + 1 + 4)/4 = 12/4: Now, by deleting the
second sample value instead, we obtain the second pseudo-mean
y(2) = 10/4; likewise y(3) = 13/4, y(4) = 14/4, and y(5) = 11/4:
P
2. Compute the mean of the five pseudo-values; y = y(i) /n =
(60/4)/5 = 3, which is the same as the sample mean.
3. The variance can then be estimated from the variability among the
five pseudo-means, each of which contains four observations,

(n − 1) (y(i) − y )2
v(y ) = = 0:5, (4:6)
n

which gives the same result as Equation 4.5.

The replication-based procedures have a distinct advantage: They can be


applied to estimators that are not expressible in terms of formulas, such as
the sample median as well as to formula-based estimators. No formula is
available for the sampling variance of the median, but the jackknife proce-
dure can offer an estimate. With the same example as used above, the sample
median is 3 and the five pseudo-medians are 3, 2.5, 3.5, 3.5, and 2.5
(the mean of these pseudo-medians is 3). The variance of the median is
estimated as 0.8, using Equation 4.6.
In the same manner, the jackknife procedure also can be applied to the
replicated sampling. We can remove replicates one at a time and compute
pseudo-values to estimate the jackknife variance, although this does not
offer any computational advantage in this case. But it also can be applied to
any random groups that are formed from any probability sample. For
instance, a systematic sample can be divided into random or systematic sub-
groups for the jackknife procedure. For other sample designs, random
groups can be formed following the practical rules suggested by Wolter
(1985, pp. 31–33). The basic idea is to form random groups in such a way
that each random group has the same sample design as the parent sample.
This requires detailed information on the actual sample design, but unfortu-
nately such information usually is not available in most public-use survey
data files. The jackknife procedure is, therefore, usually applied to PSUs
rather than to random groups.
For a paired selection design, the replicate is formed removing one PSU
from a stratum and weighting the remaining PSU to retain the stratum’s
proportion in the total sample. The complement replicate is formed in the
same manner by exchanging the removed and retained PSU in the stratum.
33

A pseudo-value is estimated from each replicate. For a weighted sample, the


sample weights in the retained PSU need to be inflated to account for the
observations in the removed PSU. The inflated weight is obtained by dividing
the sum of the weights in the retained PSU by a factor (1 − wd /wt ), where
wd is the sum of weights in the deleted PSU and wt is the sum of weights in
all the PSUs in that stratum. The factor represents the complement of the
deleted PSU’s proportion of the total stratum weight. Then the variance of a
sample statistic in a paired selection design calculated from the full sample
can be estimated from pseudo-values uh and complement pseudo-values uh
in stratum h by
 
v(u) = [(uh − u)2 + (uh − u)2 ]/2 = (uh − uh )2 /4: (4:7)

This estimator has the same form as Equation 4.3 and can be modified to
include one replicate, without averaging with the complement, from each
stratum, as in Equation 4.4 for the BRR method, which gives

v (
u) = (uh − u)2 : (4:8)

The JRR is not restricted to a paired selection design but is applicable to


any number of PSUs per stratum. If we let uhi be the estimate of U from the
h-th stratum and i-th replicate, nh be the number of sampled PSUs in the
h-th stratum, and rh be the number of replicates formed in stratum h, then
the variance is estimated by

L 
  rh
nh − 1 
vð ūÞ = (uhi − ū)2 : (4:9)
h
rh i

If each of the PSUs in stratum h is removed to form a replicate, rh = nh in


each stratum, but the formation of nh replicates in h-th stratum is not required.
When the number of strata is large and nh is two or more, the computation
can be reduced by using only one replicate in each stratum. However, a suffi-
cient number of replicates must be used in analytic studies to ensure adequate
degrees of freedom.
Table 4.4 shows the results of applying the JRR technique to the collapsed
paired design of the 1984 GSS used in the BRR computation. Estimated pro-
portions of adults approving ‘‘the hitting of other adults’’ are shown for the
42 jackknife replicates and their complements. Applying Equation 4.7, we
obtain a variance estimate of 0.000238 with a design effect of 1.46, and these
are about the same as those obtained by the BRR technique. Using only the
42 replicates and excluding the complements (Equation 4.8), we obtain a
variance estimate of 0.000275 with a design effect of 1.68.
34

TABLE 4.4
Estimated Proportions Approving One Adult Hitting Another
in the JRR Replicates: General Social Survey, 1984
Estimate (%) Estimate (%)

Replicate Number Replicate Complement Replicate Number Replicate Complement

1 60.2 59.8 22 60.3 60.0


2 60.2 59.8 23 60.0 60.0
3 60.0 60.0 24 60.4 59.6
4 60.3 59.8 25 60.1 59.8
5 60.0 60.1 26 59.8 60.3
6 59.9 60.1 27 59.9 60.1
7 60.0 60.0 28 60.1 60.0
8 60.0 60.0 29 59.5 60.3
9 59.9 60.2 30 59.9 60.1
10 60.1 60.0 31 59.6 60.2
11 59.8 60.2 32 60.5 59.6
12 59.9 60.1 33 60.1 59.9
13 59.8 60.2 34 60.3 59.8
14 60.0 60.1 35 60.1 59.8
15 59.6 60.5 36 60.2 59.8
16 60.4 59.6 37 60.0 60.0
17 59.9 60.0 38 59.6 60.4
18 59.8 60.2 39 59.9 60.1
19 59.8 60.2 40 60.5 59.6
20 59.9 60.1 41 60.4 59.8
21 60.0 60.0 42 60.7 59.4

Overall estimate = 60.0

Variance estimates Variance Standard Error Design Effect


By Equation 4.7 0.000238 0.0152 1.46
By Equation 4.8 0.000275 0.0166 1.68

From a closer examination of data in Table 4.4, one may get an impression
that there is less variation among the JRR replicate estimates than among the
BRR replicate estimates in Table 4.3. We should note, however, that the JRR
represents a different strategy that uses a different method to estimate the var-
iance. Note that Equation 4.3 for the BRR includes the number of replicates
(t) in the denominator, whereas Equation 4.7 for the JRR is not dependent on
the number of replicates. The reason is that in the JRR, the replicate estimates
themselves are dependent on the number of replicates formed. Because the
replicate is formed deleting one unit, the replicate estimate would be closer to
the overall estimate when a large number of units is available to form the
replicates, compared to the situation where a small number of units is used.
35

Therefore, there is no reason to include the number of replicates in


Equations 4.7 and 4.8. However, the number of replicates needs to be
taken into account when the number of replicates used is smaller than the
total number of PSUs, as in Equation 4.9.
In summary, the JRR technique is based on a pseudo-replication method
and can estimate sampling variances from complex sample surveys. No
restrictions on the sample selection design are needed, but forming replicates
requires considerable care and must take into account the original sample
design. As noted, this detailed design information is seldom available to sec-
ondary data analysts. For instance, if more information on ultimate clusters
had been available in the GSS data file, we could have formed more convin-
cing random groups adhering more closely to actual sample design rather
than applying the JRR technique to a collapsed paired design.

The Bootstrap Method


Closely related to BRR and JRR is the bootstrap method popularized by
Efron (1979). The basic idea is to create replicates of the same size and struc-
ture as in the design by repeatedly resampling the PSUs in the observed data.
Applying the bootstrap method to 84 PSUs in 42 pseudo-strata in the GSS
data, one will sample 84 PSUs (using a with-replacement sampling proce-
dure), two from each stratum. In some strata, the same PSU may be selected
twice. The sampling is repeated a large number of times, a minimum of 200
(referred to as B) times (Efron & Tibshirani, 1993, sec. 6.4). However, a
much larger number of replications usually is required to get a less variable
estimate (Korn & Graubard, 1999, p. 33). For each replicate created (ui ), the
parameter estimate is calculated. Then the bootstrap estimate of the variance
of the mean of all replicate estimates is given by

1 B
vðuÞ = (u − u)2 : (4:10)
B i=1 i

This estimator needs to be corrected for bias by multiplying it by


(n − 1)/n: When n is small, the bias can be substantial. In our example,
there are two PSUs in each stratum, and the estimated variance needs to be
halved. An alternative approach to correct the bias is to resample (nh − 1)
PSUs in stratum h and multiply the sample weights of the observations in
the resampled PSUs by nh /(nh − 1) (Efron, 1982, pp. 62–63). In our exam-
ple, this will produce half-sample replicates as in BRR. The bootstrap esti-
mate based on at least 200 replicates will then be about the same as the
BBR estimate based on 44 half-sample replicates. Because of the large
36

number of replications required in the bootstrap method, this method has


not yet been used extensively for variance estimation in complex survey
analysis.
Various procedures of applying the bootstrap method for variance
estimation and other purposes have been suggested (Kovar, Rao, & Wu, 1988;
Rao & Wu, 1988; Sitter, 1992). Although the basic methodology is widely
known, many different competing procedures have emerged in selecting boot-
strap samples. For example, Chao and Lo (1985) suggested duplicating each
observation in the host sample N/n times to create the bootstrap population
for simple random sampling without replacement. For sampling plans with
unequal probability of selection, the replication of the observations needs to
be proportionate to the sample weight; that is, the bootstrap sample should be
selected using the PPS procedure. These options and the possible effects of
deviating from the fundamental assumption of independent and identically
distributed samples have not been thoroughly investigated.
Although it is promising for handling many statistical problems, the
bootstrap method appears less practical than BRR and JRR for estimating
the variance in complex surveys, because it requires such a large number of
replicates. Although BRR and JRR will produce the same results when
applied by different users, the bootstrap results may vary for different users
and at different tries by the same user, because the replication procedure is
likely to yield different results each time. As Tukey (1986, p. 72) put it,
‘‘For the moment, jackknifing seems the most nearly realistic approach to
assessing many of the sources of uncertainty’’ when compared with boot-
strapping and other simulation methods. The bootstrap method is not imple-
mented in the available software packages for complex survey analysis
at this time, although it is widely used in other areas of statistical computing.

The Taylor Series Method (Linearization)


The Taylor series expansion has been used in a variety of situations in
mathematics and statistics. One early application of the series expansion was
designed to obtain an approximation to the value of functions that are hard to
calculate: for example, the exponential ex or logarithmic [log(x)] function.
This application was in the days before calculators had special function keys
and when we did not have access to the appropriate tables. The Taylor series
expansion for ex involves taking the first- and higher-order derivatives of ex
with respect to x; evaluating the derivatives at some value, usually zero; and
building up a series of terms based on the derivatives. The expansion for ex is
x2 x3 x4
1+x+ + + + ...
2! 3! 4!
37

This is a specific application of the following general formula expanded


at a:

f  (a)(x − a)2 f  (a)(x − a)3


f (x) = f (a) + f  (a)(x − a) + + +...
2! 3!
In statistics, the Taylor series is used to obtain an approximation to some
nonlinear function, and then the variance of the function is based on the
Taylor series approximation to the function. Often, the approximation
provides a reasonable estimate to the function, and sometimes the
approximation is even a linear function. This idea of variance estimation has
several names in the literature, including the linearization method, the delta
method (Kalton, 1983, p. 44), and the propagation of variance (Kish, 1965,
p. 583).
In statistical applications, the expansion is evaluated at the mean or expec-
ted value of x, written as E(x). If we use E(x) for a in the above general
expansion formula, we have

f (x) = f [E(x)] + f  [E(x)][x − E(x)] + f  [E(x)][x − E(x)]2 /2 ! + . . .

The variance of f(x) is V [f (x)] = E[f 2 (x)] − E2 [f (x)] by definition,


and using the Taylor series expansion, we have

V [f (x)] = {f  [E(x)]}2 V (x) + . . . (4:11)

The same ideas carry over to functions of more than one random variable.
In the case of a function of two variances, the Taylor series expansion
yields
  
∂f ∂f
V [f (x1 , x2 )] ffi Cov(x1 , x2 ) (4:12)
∂x1 ∂x2
Applying Equation 4.12 to a ratio of two variables x and y—that is,
r = y/x—we obtain the variance formula for a ratio estimator
V (y) + r2 V (x) − 2r Cov(x, y)
V (r) = + ...
 x2 
2 V ð yÞ V ðxÞ 2 Covðx, yÞ
=r + 2 − + ...
y2 x xy
Extending Equation 4.12 to the case of c random variables, the approxi-
mate variance of θ = f (x1 , x2 , . . . , xc ) is
  ∂f  ∂f 
V (θ) ffi Cov(xi , xj ): (4:13)
∂xi ∂xj
38

TABLE 4.5
Standard Errors Estimated by Taylor Series Method for
Percentage Approving One Adult Hitting Another:
General Social Survey, 1984 (n = 1,473)
Estimate Standard Error Design
Subgroup (%) (%) Effect

Overall 60.0 1.52 1.41


Gender Male 63.5 2.29 1.58
Female 56.8 1.96 1.21
Race White 63.3 1.61 1.43
Non white 39.1 3.93 1.30
Education Some college 68.7 2.80 1.06
High school graduate 63.3 2.14 1.55
All others 46.8 2.85 1.27

Applying Equation 4.13 to a weighted estimator,



f (Y) = Y^i = wi yij , j = 1, 2, . . . , c,

involving c variables in a sample of n observations, Woodruff (1971)


showed that
   
∂f
V (θ) ffi V wi yij : (4:14)
∂yj
This alternative form of the linearized variance of a nonlinear estimator
offers computational advantages because it bypasses the computation of
the c × c covariance matrix in Equation 4.13. This convenience of convert-
ing a multistage estimation problem into a univariate problem is realized by
a simple interchange of summations. This general computational procedure
can be applied to a variety of nonlinear estimators, including regression
coefficients (Fuller, 1975; Tepping, 1968).
For a complex survey, this method of approximation is applied to PSU
totals within the stratum. That is, the variance estimate is a weighted combi-
nation of the variation in Equation 4.14 across PSUs within the same stratum.
These formulas are complex but can require much less computing time than
the replication methods discussed above. This method can be applied to any
statistic that is expressed mathematically—for example, the mean or the
regression coefficient—but not to such nonfunctional statistics as the median
and other percentiles.
We now return to the GSS example of estimating the variance of sample
proportions. Table 4.5 shows the results of applying the Taylor series method
39

to the proportion of adults approving the hitting of other adults, analyzed by


gender, race, and level of education. The proportion is computed as a ratio of
weighted sums of all positive responses to the sum of all the weights. Its stan-
dard error is computed applying Equation 4.14 modified to include the PSUs
and strata. The design effect for the overall proportion is 1.41, which is about
the same as those estimated by using the other two methods, whose results
are shown in Tables 4.3 (BBR) and 4.4 (JRR). The estimated proportion var-
ies by gender, race, and level of education. Because the subgroup sizes are
small, the standard errors for the subgroups are larger than that for the overall
estimate. In addition, the design effects for subgroup proportions are different
from that for the overall estimate.
In this chapter, we presented several methods of estimating variance for
statistics from complex surveys (for further discussion, see Rust and Rao,
[1996]). Examples from GSS and other surveys tend to show that the design
effect is greater than one in most complex surveys. Additional examples can
be found in E. S. Lee, Forthofer, and Lorimor (1986) and Eltinge, Parsons, and
Jang (1997). Examples in Chapter 6 will demonstrate the importance of using
one of the methods reviewed above in the analysis of complex survey data.

5. PREPARING FOR SURVEY DATA ANALYSIS

The preceding chapters have concentrated on the complexity of survey


designs and techniques for variance estimation for these designs. Before
applying the sample weights and the methods for assessing the design effect,
one must understand the survey design and the data requirements for the esti-
mation of the statistics and the software intended to be used. These require-
ments are somewhat more stringent for complex survey data than for data
from an SRS because of the weights and design features used in surveys.

Data Requirements for Survey Analysis


As discussed in Chapter 3, the weight and the design effect are basic
ingredients needed for a proper analysis of survey data. In preparing for an
analysis of survey data from a secondary source, it is necessary to include
the weights and the identification of sampling units and strata in the work-
ing data file, in addition to the variables of interest. Because these design-
related data items are labeled differently in various survey data sources, it is
important to read the documentation or consult with the source agency or
person to understand the survey design and the data preparation procedures.
The weights usually are available in major survey data sources. As noted
earlier, the weights reflect the selection probabilities and the adjustments
40

for nonresponse and poststratification. The weights generally are expressed


as expansion weights, which add up to the population size, and in certain
analyses it may be more convenient to convert them to the relative weights,
which add up to the sample size. In some survey data, several weight vari-
ables are included to facilitate proper use of the data, which may be segmen-
ted for different target populations or subsampled for certain data items. It is
necessary to choose appropriate weights for different uses of the data,
through a careful review of the documentation. For some surveys, the weight
is not explicitly labeled as such, and it is necessary to study the sample design
to realize the weight. As seen in Chapter 4, in the GSS the weight was derived
from the number of adults in the household. It was also necessary to perform
poststratification adjustments to make the demographic composition of the
sample comparable to the population as in Table 3.1. If the weight is not
available even after contact with the provider of the data, the user cannot
assume a self-weighting sample. If the data are used without the weight, the
user must take responsibility for clearly acknowledging this when reporting
findings. It is hard to imagine analyzing survey data without the weight, even
in model-based analysis (although used differently from the design-based
analysis), when one recognizes the likelihood of unequal selection probabil-
ity and differential response rates in subgroups.
The calculation of the design effect usually requires information on the
first-stage selection procedure, that is, the identification of strata and PSUs,
although secondary sampling units and associated strata may be required
for certain nested designs. If one PSU is selected from each stratum as in
the GSS, the stratum identification is the same as the PSU identification.
If stratification is not used or the stratum identification is not available from
the data, one can perform the analysis assuming an unrestricted cluster sam-
pling. If there is no information on the stratum and PSU, it is important to
investigate whether treating the data as SRS is reasonable for the given
sample design.
When stratification is used and the stratum identification is available, we
need to make sure that at least two PSUs are available in each stratum.
Otherwise, it is not possible to estimate the variance. If only one cluster is
selected from each stratum, it is necessary to pair the strata to form pseudo-
strata. Pairing the strata requires good understanding of the sample design.
An example of a particular strategy for pairing, using the GSS, was pre-
sented in Chapter 3. In the absence of any useful information from the data
document, a random pairing may be acceptable. Stanek and Lemeshow
(1977) have investigated the effect of pairing based on the National Health
Examination Survey and found that variance estimates for the weighted
mean and combined ratio estimate were insensitive to different pairings of
the strata, but this conclusion may not apply to all surveys.
41

Importance of Preliminary Analysis


Survey data analysis begins with a preliminary exploration to see whether
the data are suitable for a meaningful analysis. One important consideration
in the preliminary examination of a secondary data source is to examine
whether there is a sufficient number of observations available in the various
subgroups to support the proposed analysis. Based on the unweighted
tabulations, the analyst determines whether sample sizes are large enough
and whether categories of the variables need to be collapsed. The unweighted
tabulations also give the number of the observations with missing values and
those with extreme values, which could indicate either measurement errors
or errors of transcription.
Although the unit nonresponse adjustment is handled by the data collec-
tion agency when developing the sample weight, the analyst must deal with
the missing values (item nonresponse). With a small number of missing
values, one can ignore the respondents with missing values for the analysis.
Instead of completely excluding the observations with missing values, how-
ever, the design-based survey analysis requires use of the entire data set
by setting the weights to zero for the observations with missing values. This
is necessary to estimate the variance that is inherent in the sample design.
Although the point estimate would be the same either excluding or setting the
weights to zero for the observations with missing values, the estimated var-
iance would be different. The exclusion tends to underestimate the variance.
If the amount of item nonresponse is not trivial, then ignoring missing
values would set too many weights to zero, and the original weighting
scheme would be destroyed. This can lead to bias, and it will no longer be
possible to refer accurately to the target population. One method of han-
dling this problem is to inflate the weights of the observations without miss-
ing values to compensate for the ignored observations. When performing this
type of adjustment, it generally is assumed that there is no systematic pattern
among the subjects with missing values, but this assumption may not be
valid. For example, if all the subjects with missing values were males or all
fell into a limited age range, then it would be inappropriate simply to inflate
the weight of the remaining observations. An alternative to the weight adjust-
ment is to impute the missing values by some reasonable method, although it
is not necessarily a better solution than the weight adjustment.
Imputation is not a routine statistical task. There are many ways to
impute missing values (Kalton & Kasprszky, 1986; Little & Rubin, 2002).
It is essential to understand the mechanism that leads to missing values before
choosing a particular method of imputation. In some situations, simple pro-
cedures can be used. For example, an extra category can be created for
missing values for categorical variables. Another simple procedure would
42

be mean imputation for continuous variables, but this procedure will distort
the shape of the distribution. To preserve the shape of the distribution, one
can use hot deck imputation, regression imputation, or multiple imputation.
Some simple illustrations of imputation will be presented in the next chapter
without going into detailed discussion. If imputation is used, the variance
estimators may need some adjustment (Korn & Graubard, 1999, sec. 5.5),
but that topic is beyond the scope of this book.
Prior to any substantive analysis, it is also necessary to examine whether
each of the PSUs has a sufficient number of observations. It is possible that
some PSUs may contain only a few observations, or even none, because of
nonresponse and exclusion of missing values. A PSU with none of or only a
few observations may be combined with an adjacent PSU within the same
stratum. The stratum with a single PSU as a result of combining PSUs may
then be combined with an adjacent stratum. However, collapsing too many
PSUs and strata destroys the original sample design. The resulting data analy-
sis may be of questionable value, because it is no longer possible to determine
what population is represented by the sample resulting from the combined
PSUs and strata. The number of observations that is needed in each
PSU depends on the type of analysis planned. The required number will
be larger for analytic studies than for estimation of descriptive statistics.
A general guideline is that the number should be large enough to estimate the
intra-PSU variance for the given estimate.
To illustrate this point, we consider the GSS data. An unweighted tabula-
tion by stratum and PSU has shown that the number of observations in the
PSU ranges from 8 to 49, with most of the frequencies being larger than 13,
indicating that the PSUs probably are large enough for estimating variances
for means and proportions. For an analytic study, we may want to investi-
gate the percentage of adults approving of hitting, by education and gender.
For this analysis we need to determine if there are a number of PSUs with-
out observations in a particular education-by-gender category. If there are
many PSUs with no observation for some education-by-gender category,
this calls into question the estimation of the variance–covariance matrix,
which is based on the variation in the PSU totals within the strata. The
education (3 levels) by gender (2 levels) tabulation by PSU showed that
42 of the 84 PSUs had at least one gender-by-education cell with no observa-
tions. Even after collapsing education into two categories, we will have to
combine nearly half of the PSUs. Therefore, we should not attempt to investi-
gate, simultaneously, the gender and education variables in relation to the
question about hitting. However, it is possible to analyze gender or education
alone in relation to hitting without combining many PSUs and strata.
Subgroup analysis of complex survey data cannot be conducted by select-
ing out the observations in the analytic domain. Although the case selection
43

would not alter the basic weights, it might destroy the basic sample design.
For example, selecting one small ethnic group may eliminate a portion of the
PSUs and reduce the number of observations substantially in the remaining
PSUs. As a result, it would be difficult to assess, from the subset, the design
effect inherent in the basic design. Even though the basic design is not totally
destroyed, selecting out observations from a complex survey sample may
lead to an incorrect estimation of variance, as explained above in conjunction
with the method of handling missing values. Correct estimation of variance
requires keeping the entire data set in the analysis and assigning weights of
zero to observations outside the analytic domain. The subpopulation analy-
sis procedures available in software packages are designed to conduct a
subgroup analysis without selecting out the observations in the analytic
domain. The subpopulation analysis will be discussed further in Chapter 6.
The first step in a preliminary analysis is to explore the basic distributions
of key variables. The tabulations may point out the need for refining opera-
tional definitions of variables and for combining categories of certain vari-
ables. Based on summary statistics, one may learn about interesting patterns
and distributions of certain variables in the sample. After analyzing the vari-
ables one at a time, we can next investigate the existence of relations to screen
out variables that are clearly not related to one another or to some dependent
variables. It may be possible to conduct a preliminary exploration using the
standard graphic SRS-based statistical methods. Given the role of weights in
survey data, however, any preliminary analysis ignoring the weights may not
accomplish the goal of a preliminary analysis. One way to conduct a prelimin-
ary analysis taking weights into account is to select a subsample of manage-
able size, with selection probability proportional to the magnitude of the
weights, and to explore the subsample using the standard statistical and graphic
methods. This procedure will be illustrated in the first section of Chapter 6.

Choices of Method for Variance Estimation


Incorporating the design features into the analysis requires choosing the
method of variance estimation. As discussed in Chapter 4, three methods of
variance estimation (BRR, JRR, and Taylor series approximation) are used
in practice. Several researchers (Bean, 1975; Frankel, 1971; Kish & Frankel,
1974; Lemeshow & Levy, 1979) have evaluated these three general methods
empirically, and Krewski and Rao (1981) have performed some theoretical
comparisons of these approaches. These evaluation studies tend to show that
none of the three methods consistently performs better or worse, and that the
choice may depend in most cases on the availability of and familiarity with
the software. In a few cases, the choice may depend on the type of statistic to
be estimated or the sample design used, as in the paired selection design.
44

The formula-based Taylor series approximation (linearization) is perhaps


the most widely used method of variance estimation for complex surveys
because it is found in most available software. It may be preferable to the
replication-based methods (BRR and JRR) for practical reasons, but, as
discussed in Chapter 4, it is not applicable for the median or other percen-
tiles and nonparametric statistics. The replication-based methods are more
general and can be applied with these statistics, but they require creating
and handling the replicates. Another advantage of the replication approach
is that it provides a simple way to incorporate adjustments for nonresponse
and poststratification more appropriately. By separately computing the
weighting adjustments for each replicate, it is possible to incorporate the
effects of adjustments in variance estimation.
For small surveys and small domain estimation, the JRR estimate may be
more stable than the BRR estimate because every replicate in JRR includes
most of the full sample, whereas only half of a sample is included in the
BRR replicate. However, BRR is reported to be more reliable than JRR for
the estimation of quartiles (Kovar et al., 1988; Rao, Wu, & Yue, 1992).
A variation of BRR suggested by Fay (Judkins, 1990) is used to stabilize
the variance estimator. In this variation, the replicate weights are to be
inflated by factor of 2 − k or k instead of 2 or 0, and the variance estimator
is to be modified by multiplying the right side of Equation 4.4 by 1/(1 − k)2
(k can take a value between 0 and 1). Korn and Graubard (1999, pp. 35–36)
showed that Fay’s BRR with k = 0.3 produced about the same result as the
standard BRR but produced a somewhat smaller variance when adjust-
ments for nonresponse and poststratification were incorporated in the repli-
cate weights. Fay’s method can be seen as a compromise between JRR and
BRR. Judkins (1990) demonstrated that for estimation of quartiles and other
statistics, Fay’s method with k = 0.3 performed better than either standard
BRR or JRR in terms of bias and stability.
The BRR method is designed for a paired selection design. When one
PSU is selected from each stratum, the PSUs must be paired to create
pseudo-strata in order to apply the BRR method. When more than two PSUs
are selected from each stratum, it is difficult to create a paired design, and it
is better to use the JRR or the Taylor series method. Procedurally, the Taylor
series approximation method is the simplest, and the replication-based
methods require extra steps to create replicate weights.

Available Computing Resources


Over the last three decades, several different programs were developed
for complex survey data analysis. Early programs were developed for dif-
ferent purposes in government agencies, survey research organizations, and
45

universities. Some of these programs evolved into program packages for


general users. Initial program packages were developed for mainframe
computing applications. With the enhanced computing capability of PCs,
more efforts were devoted to developing PC versions, and several new soft-
ware packages emerged.3 Although some failed to implement upgrades,
three program packages kept up with the current state of software standards
and are user-friendly. These are SUDAAN, Stata, and WesVar.
One set of programs that has been around for more than 20 years is the
SUDAAN package, which is available from the Research Triangle Institute
in North Carolina. It has two different versions: a stand-alone version and a
SAS (Statistical Analysis System) version. The latter is especially convenient
to use in conjunction with the SAS data step. As with SAS, the SUDAAN
license needs to be renewed annually.
The default method for variance estimation is the Taylor series method
with options to use BRR and JRR. It can handle practically all types of
sample designs including multiple-layered nesting and poststratified designs.
It has the most comprehensive set of analytical features available for ana-
lysis of complex survey data, but it is more expensive to maintain than the
other two packages. It supports a variety of statistical procedures, including
CROSSTAB, DESCRIPT, RATIO, REGRESS, LOGISTIC, LOGLINK
(log-linear regression), MULTILOG (multinomial and ordered logistic
regression), SURVIVAL (Cox proportional hazards model), and others. Over
the years, the designers have added new procedures and dropped some (e.g.,
CATAN for weighted least-square modeling). The SAS user may find these
procedures easy to implement, but it may be difficult for new users to specify
the design and to interpret the output without help from an experienced user.
Consulting and technical assistance are available on the SAS Web site.
Stata is a general-purpose statistical program package that includes a
survey analysis module. It is available from Stata Corporation, College
Station, Texas. Its survey analysis component supports a variety of analytical
procedures including svymean, svytotal, svyprop (proportion), svyratio
(ratio estimation), svytab (two-way tables), svyregress (regression), svylogit
(logistic regression), svymlogit (multinomial logistic regression), svypois
(Poisson regression), svyprobit (probit models), and others. It uses the
Taylor series method for variance estimation using PSUs (ultimate cluster
approximation). Although it does not support complicated designs such as
multilayered nesting designs, it can be used for analyzing most of the survey
designs used in practice. Most of its survey analysis procedures are parallel to
its general (nonsurvey) statistical procedures, which means that many general
features in its statistical analysis can be integrated easily with the survey ana-
lysis components. The output is relatively easy to understand, and new users
may find Stata easier to learn than SUDAAN.
46

The WesVar program is developed and distributed by Westat, Inc.,


Rockville, Maryland. It is designed to compute descriptive statistics, linear
regression modeling, and log-linear models using replication methods. Five
different replication methods are available, including JK1 (delete-1 jackknife
for unstratified designs), JK2 (jackknife for 2-per-stratum designs), JKn
(delete-1 jackknife for stratified designs), BRR, and Fay (BRR using Fay’s
method). It was developed at Westat, Inc. (Flyer & Mohadjer, 1988), and
older versions of this package are available for users at no charge. It is now
commercially available, and a student version is also available. Although
data can be imported from other systems, it is designed to be a stand-alone
package. In addition to the sample weights for the full sample, it requires that
each record in the input data file contain the replicate weight. For simple
designs, the program can create the replicate weights before running any
procedure. The program documentation is adequate, but new users may
find instructions for preparing the replicate weights somewhat difficult to
understand without help from experienced users.
Cohen (1997) evaluated early versions of these three software packages
(Release 7 of SUDAAN, Release 5 of Stata, WesVarPC Version 2.02)
with respect to programming effort, efficiency, accuracy, and programming
capability. The evaluation showed that the WesVar procedure consistently
required the fewest programming statements to derive the required survey
estimates, but it required additional data preparation for the creation of
replicate weights necessary for the derivation of variance estimates. Stata
tended to require more program statements to obtain the same results, but it
provided no undue burden to implement them. As far as computational effi-
ciency is concerned, the SUDAAN procedure was consistently superior in
generating the required estimates.
It is difficult to recommend one software package over another.
In choosing a software package, one should consider the method of var-
iance estimation to be used, the cost of maintaining the software, and
the strengths and limitations of the respective packages reviewed here in
the context of one’s analytical requirements. Perhaps more important is the
analyst’s familiarity with statistical packages. For example, for SAS users
it is natural to choose SUDAAN. Stata users probably will choose to use
Stata’s survey analysis component. Detailed illustrations presented in
Chapter 6 utilized primarily Stata (Version 8) and SUDAAN (Release
8.0.1). These illustrations may suggest additional points to consider for
choosing a software package.
The availability of computing resources is getting better all the time, as
more statistical packages incorporate complex survey data analysis proce-
dures. SPSS 13.0 now provides an add-on module for survey data analysis,
SPSS Complex Samples. It includes four procedures: CSDESCRIPTIVES,
47

CSTABULATE (contingency table analysis), CSGLM (regression, ANOVA,


ANCOVA), and CSLOGISTIC. The SAS system also provides survey data
analysis capabilities in its latest release, SAS 9.1. The SURVEYFREQ proce-
dure produces one-way and multiway contingency table analyses with tests of
association. The SURVEYLOGISTIC procedure performs logistic regres-
sion, and it can also fit other link functions. Survey data analysis often can
now be done by many statistical packages currently in use without resorting
to any special-purpose software.

Creating Replicate Weights


The BRR method requires replicate weights. These weights can be
included in the data set or created prior to running any analysis. Usually H
(a multiple of 4, larger than the number of strata) sets of replicate weights
need to be created. For SUDAAN, the replicate weights are entered as data,
and WesVar can generate them with proper specifications. For example, for
a survey with 6 strata and 2 PSUs in each stratum, 8 sets of replicate weights
need to be created, as shown in Table 5.1. The SUDAAN statements for
BRR for this example are shown in the right side of the table. The input
data consist of stratum, PSU, the number of hospital beds, the number of
AIDS patients, the sample weight (wt), and 8 sets of replicate weights
(w1 through w8). Note that the replicate weights are either twice the sample
weights, if the units are to be selected, or zero, if not to be selected. The zero
or twice the weight is arranged based on 6 rows of the 8 × 8 orthogonal
matrix, similar to Table 4.2. The ratio estimate (refer again to Note 2) of the
total AIDS patients is to be calculated applying the ratio of AIDS/beds to
the total number of beds (2,501) in the target area. PROC RATIO with
DESIGN = brr specifies the desired statistic and the method of estimating
variance, and deff requests the design effect. The NEST (specifying strata
and PSU) statement is not needed, because the BRR design is used.
REPWGT designates the replicate weight variables. NUMER and DENOM
specify the numerator and denominator of the ratio estimate. The BRR esti-
mates by SUDAAN are shown in the lower left side of Table 5.1. The esti-
mate is 1,191, and its standard error is 154.2 (the design effect is 2.10). For
the same data, WesVar produced the same point estimate and the standard
error of 155.0 by creating replicate weights based on different rows of the
same orthogonal matrix. The BRR using the alternative replicate weights
(1 or 0 time the sample weights) gave a standard error of 149.5. Fay’s var-
iant of BRR uses replicate weights created by taking 2 − k or k(0 ≤ k < 1)
times the sample weights, depending on whether an observation is in the
selected unit or not. Using k = 0.3, the standard error was estimated to
be 137.7.
48

TABLE 5.1
Creation of Replicate Weights for BRR and Jackknife Procedure
(A) SUDAAN statements for BRR: (B) SUDAAN statements for jackknife method:
data brr; Data jackknife;
input stratum psu beds aids wt w1-w8; input stratum psu beds aids wt;
aids2=aids*2501; aids2=aids*2501;
datalines; datalines;
1 1 72 20 2 0 0 0 4 0 4 4 4 1 1 72 20 2
1 2 87 49 6 12 12 12 0 12 0 0 0 1 2 87 49 6
2 1 99 38 2 4 0 0 0 4 0 4 4 2 1 99 38 2
2 2 48 23 2 0 4 4 4 0 4 0 0 2 2 48 23 2
3 1 99 38 2 4 4 0 0 0 4 0 4 3 1 99 38 2
3 2 131 78 4 0 0 8 8 8 0 8 0 3 2 131 78 4
4 1 42 7 2 0 4 4 0 0 0 4 4 4 1 42 7 2
4 2 38 28 2 4 0 0 4 4 4 0 0 4 2 38 28 2
5 1 42 26 2 4 0 4 4 0 0 0 4 5 1 42 26 2
5 2 34 9 2 0 4 0 0 4 4 4 0 5 2 34 9 2
6 1 39 18 4 0 8 0 8 8 0 0 8 6 1 39 18 4
6 2 76 20 2 4 0 4 0 0 4 4 0 6 2 76 20 2
; ;
proc ratio design=brr deff; proc ratio design=jackknife deff;
weight wt; nest stratum;
repwgt w1-w8; weight wt;
numer aids2; numer aids2;
denom beds; denom beds;
run; run;

SUDAAN output for BRR: SUDAAN output for jackknife procedure:

The RATIO Procedure The RATIO Procedure


Variance Estimation Method: BRR Variance Estimation Method: Delete-1 Jackknife
by: Variable, One. by: Variable, One.
---------------------------------------------- ------------------------------------------------
| Variable | | One | Variable | | One
| | | 1 | | | | 1 |
---------------------------------------------- ------------------------------------------------
| AIDS2/BEDS | Sample Size | 12 | | AIDS2/BEDS | Sample Size | 12 |
| | Weighted Size | 32.00 | | | Weighted Size | 32.00 |
| | Weighted X-Sum | 2302.00 | | | Weighted X-Sum | 2302.00 |
| | Weighted Y-Sum | 2741096.00 | | | Weighted Y-Sum | 2741096.00 |
| | Ratio Est. | 1190.75 | | | Ratio Est. | 1190.75 |
| | SE Ratio | 154.15 | | | SE Ratio | 141.34 |
| | DEFF Ratio #4 | 2.10 | | | DEFF Ratio #4 | 1.76 |
---------------------------------------------- ------------------------------------------------

SOURCE: Data are from Levy and Lemeshow (1999, p. 384).

The jackknife procedure does not require replicate weights. The program
creates the replicates by deleting one PSU in each replicate. The SUDAAN
statements for JRR and the results are shown in the right side of Table 5.1.
The standard error estimated by the jackknife procedure is 141.3, which is
smaller than the BRR estimate. The standard error calculated by the Taylor
series method (assuming with-replacement sampling) was 137.6, slightly less
than the jackknife estimate but similar to the estimate from Fay’s BRR. As
discussed in Chapter 4, BRR and JRR assume with-replacement sampling. If
we assume without-replacement sampling (the finite population correction is
used), the standard error is estimated to be 97.3 for this example.
The third National Health and Nutrition Examination Survey (NHANES III)4
from the National Center for Health Statistics (NCHS) contains the replicate
weights for BRR. The replicate weights were created for Fay’s method with
k = 0.3 incorporating nonresponse and poststratification adjustments at
different stages of sampling. As Korn and Graubard (1999) suggested, a
preferred approach is using Fay’s method of creating replicate weights
49

incorporating adjustments for the nonresponse and poststratification. But such


weights usually are not included in many survey data sets, nor is there appro-
priate information for creating such replicate weights.

Searching for Appropriate Models for Survey Data Analysis*


It has been said that many statistical analyses are carried out with no clear
idea of the objective. Before analyzing the data, it is essential to think
around the research question and formulate a clear analytic plan. As discussed
in a previous section, a preliminary analysis and exploration of data are very
important in survey analysis. In a model-based analysis, this task is much
more formidable than in a design-based analysis.
Problem formulation may involve asking questions or carrying out appro-
priate background research in order to get the necessary information for
choosing an appropriate model. Survey analysts often are not involved in col-
lecting the survey data, and it is often difficult to comprehend the data collec-
tion design. Asking questions about the initial design may not be sufficient, but
it is necessary to ask questions about how the design was executed in the field.
Often, relevant design-related information is neither documented nor included
in the data set. Moreover, some surveys have overly ambitious objectives
given the possible sample size. So-called general-purpose surveys cannot pos-
sibly include all the questions that are relevant to all future analysts. Building
an appropriate model including all the relevant variables is a real challenge.
There should also be a check on any prior knowledge, particularly when
similar sets of data have been analyzed before. It is advisable not to fit a
model from scratch but to see if the new data are compatible with earlier
results. Unfortunately, it is not easy to find model-based analyses using com-
plex survey data in social and health science research. Many articles dealing
with the model-based analysis tend to concentrate on optimal procedures for
analyzing survey data under somewhat idealized conditions. For example,
most public-use survey data sets contain only strata and PSUs, and opportu-
nities for defining additional target parameters for multilevel or hierarchical
linear models (Bryk & Raudenbush, 1992; Goldstein & Silver, 1989; Korn &
Graubard, 2003) are limited. The use of mixed linear models for complex
survey data analysis would require further research and, we hope, stimulate
survey designers to bring design and analysis into closer alignment.

6. CONDUCTING SURVEY DATA ANALYSIS

This chapter presents various illustrations of survey data analysis. The


emphasis is on the demonstration of the effects of incorporating the weights
and the data structure on the analysis. We begin with a strategy for conducting
50

a preliminary analysis of a large-scale, complex survey. Data from Phase II of


NHANES III (refer to Note 4) will be used to illustrate various analyses,
including descriptive analysis, linear regression analysis, contingency table
analysis, and logistic regression analyses. For each analysis, some theoretical
and practical considerations required for the survey data will be discussed.
The variables used in each analysis are selected to illustrate the methods
rather than to present substantive findings. Finally, the model-based perspec-
tive is discussed as it relates to analytic examples presented in this chapter.

A Strategy for Conducting Preliminary Analysis


Sample weights can play havoc in the preliminary analysis of complex
survey data, but exploring the data ignoring the weights is not a satisfactory
solution. On the other hand, programs for survey data analysis are not well
suited for basic data exploration. In particular, graphic methods were
not designed with complex surveys in mind. In this section, we present a
strategy for conducting preliminary analyses taking the weights into account.
Prior to the advent of the computer, the weight was handled in various ways
in data analysis. When IBM sorting machines were used for data tabulations,
it was common practice to duplicate the data cards to match the weight value
to obtain reasonable estimates. To expedite the tabulations of large-scale sur-
veys, the PPS procedure was adopted in some surveys (Murthy & Sethi,
1965). Recognizing the difficulty of analyzing complex survey data, Hinkins,
Oh, and Scheuren (1994) advocated an ‘‘inverse sampling design algorithm’’
that would generate a simple random subsample from the existing complex
survey data, so that users could apply their conventional statistical methods
directly to the subsample. These approaches are no longer attractive to survey
data analysis because programs for survey analysis are now readily available.
However, because there is no need to use the entire data file for preliminary
analysis, the idea of subsampling by the PPS procedure is a very attractive
solution for developing data for preliminary analysis.
The PPS subsample can be explored by the regular descriptive and graphic
methods, because the weights are already reflected in the selection of the sub-
sample. For example, the scatterplot is one of the essential graphic methods
for preliminary data exploration. One way to incorporate the weight in the
scatterplot is the use of bubbles that represent the magnitude of the weight.
Korn and Graubard (1998) examined alternative procedures to scatterplot
bivariate data and showed advantages of using the PPS subsample. In fact,
they found that ‘‘sampled scatterplots’’ are a preferred procedure to ‘‘bubble
scatterplots.’’
For a preliminary analysis, we generated a PPS sample of 1,000 from the
adult file of Phase II (1991–1994) of NHANES III (refer to Note 4), which
51

TABLE 6.1
Subsample and Total Sample Estimates for Selected Characteristics
of U.S. Adult Population, NHANES III, Phase II
Vitamin Hispanic Correlation Between
Mean Age Use Population SBPa Sample BMIb and SBP

Total sample
(n = 9,920)c
Unweighted 46.9 years 38.4% 26.1% 125.9 mmHg 0.153
Weighted 43.6 42.9 5.4 122.3 0.243
PPS subsample
(n = 1,000)
Unweighted 42.9 43.0 5.9 122.2 0.235

a. Systolic blood pressure


b. Body mass index
c. Adults 17 years of age and older

consisted of 9,920 adults. We first sorted the total sample by stratum and PSU
and then selected a PPS subsample systematically using a skipping interval of
9.92 on the scale of cumulated relative weights. The sorting by stratum and
PSU preserved in essence the integrity of the original sample design.
Table 6.1 demonstrates the usefulness of a PPS subsample that can be
analyzed with conventional statistical packages. In this demonstration, we
selected several variables that are most affected by the weights. Because of
oversampling of the elderly and ethnic minorities, the weighted estimates are
different from the unweighted estimates for mean age and percentage of
Hispanics. The weights also make a difference for vitamin use and systolic
blood pressure because they are heavily influenced by the oversampled cate-
gories. The subsample estimates, although not weighted, are very close to the
weighted estimates in the total sample, demonstrating the usefulness of a PPS
subsample for preliminary analysis. A similar observation can be made based
on the correlation between body mass index and systolic blood pressure.
The PPS subsample is very useful in exploring the data without formally
incorporating the weights, especially for the students in introductory courses.
It is especially well suited for exploring the data by graphic methods such
as scatterplot, side-by-side boxplot, and the median-trace plot. The real
advantage is that the resampled data are approximately representative of the
population and can be explored ignoring the weights. The point estimates
from the resampled data are approximately the same as the weighted esti-
mates in the whole data set. Any interesting patterns discovered from the
resampled data are likely to be confirmed by a more complete analysis using
Stata or SUDAAN, although the standard errors are likely to be different.
52

Conducting Descriptive Analysis


For a descriptive analysis, we used the adult sample (17 years of age or
older) from Phase II of NHANES III. It included 9,920 observations that
were arranged in 23 pseudo-strata, with 2 pseudo-PSUs in each stratum.
The identifications for the pseudo-strata (stra) and PSUs (psu) were
included in our working data file. The expansion weights in the data file
were converted to relative weights (wgt). To determine whether there were
any problems in the distribution of the observations across the PSUs, an
unweighted tabulation was performed. It showed that the numbers of obser-
vations available in the PSUs ranged from 82 to 286. These PSU sample
sizes seem sufficiently large for further analysis.
We chose to examine the body mass index (BMI), age, race, poverty
index, education, systolic blood pressure, use of vitamin supplements,
and smoking status. BMI was calculated by dividing the body weight (in
kilograms) by the square of the height (in meters). Age was measured in
years, education (educat) was measured as the number of years of school-
ing, the poverty index (pir) was calculated as a ratio of the family income
to the poverty level, and systolic blood pressure (sbp) was measured in
mmHg. In addition, the following binary variables were selected: Black
(1 = black; 0 = nonblack), Hispanic (1 = Hispanic; 0 = non-Hispa-
nic), use of vitamin supplements (vituse) (1 = yes; 0 = no), and smoking
status (smoker) (1 = ever smoked; 0 = never smoked).
We imputed missing values for the variables selected for this analysis to
illustrate the steps of survey data analysis. Various imputation methods have
been developed to compensate for missing survey data (Brick & Kalton,
1996; Heitjan, 1997; Horton & Lipsitz, 2001; Kalton & Kasprszky, 1986;
Little & Rubin, 2002; Nielsen, 2003; Zhang, 2003). Several software packages
are available (e.g., proc mi and proc mianalyze in SAS/STAT; SOLAS;
MICE; S-Plus Missing Data Library). There are many ways to apply them
to a specific data set. Choosing appropriate methods and their course of
application ultimately depends on the number of missing values, the
mechanism that led to missing values (ignorable or nonignorable), and on
the pattern of missing values (monotone or general). It is tempting to apply
sophisticated statistical procedures, but that may do more harm than good.
It will be more helpful to look at concrete examples (Kalton & Kasprszky,
1986; Korn & Graubard, 1999, sec. 4.7 and chap. 9) rather than reading tech-
nical manuals. Detailed discussions of these issues are beyond the scope of
this book. The following brief description is for illustrative purposes only.
There were no missing values for age and ethnicity in our data. We first
imputed values for variables with the fewest missing values. There were
fewer than 10 missing values for vituse and smoker and about 1% of values
53

missing for educat and height. We used a hot deck5 procedure to impute
values for these four variables by selecting donor observations randomly with
probability proportional to the sample weights within 5-year age categories
by gender. The same donor was used to impute values when there were
missing values in one or more variables for an observation. Regression
imputation was used for height (3.7% missing; 2.8% based on weight, age,
gender, and ethnicity; and 0.9%, based on age, gender, and ethnicity), weight
(2.8% missing, based on height, age, gender, and ethnicity), sbp (2.5% miss-
ing, based on height, weight, age, gender and ethnicity), and pir (10% miss-
ing, based on family size, educat, and ethnicity). About 0.5% of imputed pir
values were negative, and these were set to 0.001 (the smallest pir value in
the data). Parenthetically, we could have brought other anthropometric mea-
sures into the regression imputation, but our demonstration was based simply
on the variables selected for this analysis. Finally, the bmi values (5.5%
missing) were recalculated based on updated weight and height information.
To demonstrate that the sample weight and design effect make a difference,
the analysis was performed under three different options: (a) unweighted,
ignoring the data structure; (b) weighted, ignoring the data structure; and
(c) survey analysis, incorporating the weights and sampling features. The first
option assumes simple random sampling, and the second recognizes the
weight but ignores the design effect. The third option provides an appropriate
analysis for the given sample design.
First, we examined the weighted means and proportions and their standard
errors with and without the imputed values. The imputation had inconsequen-
tial impact on point estimates and a slight reduction in estimated standard
errors under the third analytic option. The weighted mean pir without
imputed values was 3.198 (standard error = 0.114) compared with 3.168
(s:e: = 0.108) with imputed values. For bmi, the weighted mean was 25.948
(s:e: = 0.122) without imputation and 25.940 (s:e: = 0.118) with imputa-
tion. For other variables, the point estimates and their standard errors were
identical to the third decimal point because there were so few missing values.
The estimated descriptive statistics (using imputed values) are shown
in Table 6.2. The calculation was performed using Stata. The unweighted
statistics in the top panel were produced by the nonsurvey commands
summarize for point estimates and ci for standard errors. The weighted
analysis (second option) in the top panel was obtained by the same nonsur-
vey command with the use of [w  wgt]. The third analysis, incorporating
the weights and the design features, is shown in the bottom panel. It was
conducted using svyset [pweight  wgt], strata (stra), and psu (psu) for
setting complex survey features and svymean for estimating the means or
proportions of specified variables.
54

TABLE 6.2
Descriptive Statistics for the Variables Selected for Regression
Analysis of Adults 17 Years and Older From NHANES III,
Phase II (n = 9,920): An Analysis Using Stata
(A) Weighted and unweighted statistics, ignoring the design features

Unweighted Analysis Weighted Analysis


Variable | Mean Std. Err. Mean Std. Err. Min Max
------------+----------------------------------------------------------------------
bmi | 26.4465 .05392 25.9402 .05428 10.98 73.16
age | 46.9005 .20557 43.5572 .17865 17 90
black | .2982 .00459 .1124 .00317 0 1
hispanic | .2614 .00441 .0543 .00228 0 1
pir | 2.3698 .01878 3.1680 .02086 0 11.89
educat | 10.8590 .03876 12.3068 .03162 0 17
sbp | 125.8530 .20883 122.2634 .18397 81 244
vituse | .3844 .00488 .4295 .00497 0 1
smoker | .4624 .00501 .5114 .00502 0 1
-----------------------------------------------------------------------------------

(B) Survey analysis, using the weights and design features

. svyset [pweight=wgt], strata(stra) psu(psu)


. svymean bmi age black hispanic pir educat sbp vituse smoker

Survey mean estimation


pweight: wgt Number of obs(*) = 9920
Strata: stra Number of strata = 23
PSU: psu Number of PSUs = 46
Population size = 9920.06
------------------------------------------------------------------------
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------+--------------------------------------------------------------
bmi | 25.9402 .11772 25.6946 26.2013 4.9903
age | 43.5572 .57353 42.3708 44.7436 10.3067
black | .1124 .00973 .0923 .1326 9.4165
hispanic | .0543 .00708 .0397 .0690 9.6814
pir | 3.1680 .10779 2.9622 3.4328 25.6331
educat | 12.3068 .12312 12.0565 12.5671 15.0083
sbp | 122.2634 .38543 121.4010 122.980 4.1995
vituse | .4295 .01215 .4043 .4546 5.9847
smoker | .5114 .01155 .4874 .5352 5.2829
------------------------------------------------------------------------

*Some variables contain missing values.

The statistics shown in Table 6.2 are the estimated means for the continuous
variables, proportions for the binary variables, and standard errors. There
are slight differences between the weighted and unweighted means/
proportions for a few variables, and the differences are considerable for
some variables. The weighted proportion is more than 60% smaller than the
weighted proportion for blacks and nearly 80% smaller for Hispanics,
reflecting oversampling of these two ethnic groups. The weighted mean age
is about 3.5 years less than the unweighted mean because the elderly also
were oversampled. On the other hand, the weighted mean is considerably
greater than the unweighted mean for the poverty index and for the number
of years of schooling, suggesting that the oversampled minority groups are
concentrated in the lower ranges of income and schooling. The weighted
55

estimate for vitamin use is also somewhat greater than the unweighted
estimate. This lower estimate may reflect a lower use by minority groups.
The bottom panel presents the survey estimates that reflect both the
weights and design features. Although the estimated means and proportions
are exactly the same as the weighed statistics in the top panel, the standard
errors increase substantially for all variables. This difference is reflected in
the design effect in the table (the square of the ratio of standard error in the
bottom panel to that for the weighted statistic in the top panel). The large
design effects for poverty index, education, and age partially reflect the resi-
dential homogeneity with respect to these characteristics. The design effects
of these socioeconomic variables and age are larger than those for the pro-
portion of blacks and Hispanics. The opposite was true in the NHANES II
conducted in 1976–1980 (data presented in the first edition of this book),
suggesting that residential areas are now increasingly becoming more homo-
geneous with respect to socioeconomic status than by ethnic status.
The bottom panel also shows the 95% confidence intervals for the means
and proportions. The t value used for the confidence limits is not the familiar
value of 1.96 that might be expected from the sample of 9,920 (the sum of the
relative weights). The reason for this is that in a multistage cluster sampling
design, the degrees of freedom are based on the number of PSUs and strata,
rather than the sample size, as in SRS. Typically, the degrees of freedom in
complex surveys are determined as the number of PSUs sampled minus the
number of strata used. In our example, the degrees of freedom are 23
(= 46 − 23) and t23, 0:975 = 2.0687; and this t value is used in all confidence
intervals in Table 6.2. In certain circumstances, the degrees of freedom may
be determined somewhat differently from the above general rule (see Korn &
Graubard, 1999, sec. 5.2).
In Table 6.3, we illustrate examples of conducting subgroup analysis. As
mentioned in the previous chapter, any subgroup analysis using complex
survey data should be done using the entire data set without selecting out
the data in the analytic domain. There are two options for conducting proper
subgroup analysis in Stata: the use of by or subpop. Examples of conduct-
ing a subgroup analysis for blacks are shown in Table 6.3. In the top panel,
the mean BMI is estimated separately for nonblacks and blacks by using
the by option. The mean BMI for blacks is greater than for nonblacks.
Although the design effect of BMI among nonblacks (5.5) is similar to the
overall design effect (5.0 in Table 6.2), it is only 1.1 among blacks.
Stata also can be used to test linear combinations of parameters. The
equality of the two population subgroup means can be tested using the
lincom command ([bmi]1—[bmi]0, testing the hypothesis of the difference
between the population mean BMI for black = 1 and the mean BMI for
nonblack = 0), and the difference is statistically significant based on the
56

TABLE 6.3
Comparison of Mean Body Mass Index Between Black and
Nonblack Adults 17 Years and Older, NHANES III,
Phase II (n = 9,920): An Analysis Using Stata
(A) . svyset [pweight=wgt], strata(stra) psu(psu)
. svymean bmi, by (black)

Survey mean estimation


pweight: wgt Number of obs = 9920
Strata: stra Number of strata = 23
PSU: psu Number of PSUs = 46
Population size = 9920.06
------------------------------------------------------------------------------
Mean Subpop. | Estimate Std. Err. [95% Conf. Interval] Deff
---------------+--------------------------------------------------------------
bmi black==0 | 25.7738 .12925 25.5064 26.0412 5.512
black==1 | 27.2536 .17823 26.8849 27.6223 1.071
------------------------------------------------------------------------------

(B) . lincom [bmi]1-[bmi]0, deff

( 1) - [bmi]0 + [bmi]1 = 0.0


------------------------------------------------------------------------------
Mean | Estimate Std. Err. t P>|t| [95% Conf. Interval] Deff
------------+-----------------------------------------------------------------
(1) | 1.4799 .21867 6.77 0.000 1.0275 1.9322 1.462
-------------------------------------------------------------------------------

(C) . svymean bmi, subpop(black)

Survey mean estimation


pweight: wgt Number of obs = 9920
Strata: stra Number of strata = 23
PSU: psu Number of PSUs = 46
Subpop.: black==1 Population size = 9920.06
------------------------------------------------------------------------------
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------+--------------------------------------------------------------------
bmi | 27.2536 .17823 26.8849 27.6223 1.071
------------------------------------------------------------------------------

(D) . svymean bmi if black==1


stratum with only one PSU detected

(E) . replace stra=14 if stra==13


(479 real changes made)
. replace stra=16 if stra==15
(485 real changes made)

. svymean bmi if black==1

Survey mean estimation


pweight: wgt Number of obs = 2958
Strata: stra Number of strata = 21
PSU: psu Number of PSUs = 42
Population size = 1115.244
------------------------------------------------------------------------------
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------+--------------------------------------------------------------------
bmi | 27.2536 .17645 26.8867 27.6206 2.782

t test. The design effect is 1.46, indicating that the t value for this test is
reduced about 20% to compensate for the sample design features.
Alternatively, the subpop option can be used to estimate the mean BMI
for blacks, as shown in the bottom panel. This option uses the entire data set
by setting the weights to zero for those outside the analytic domain. The
mean, standard error, and design effect are the same as those calculated for
57

blacks using the by option in the top panel. Next, we selected out blacks by
specifying the domain (if black = = 1) to estimate the mean BMI. This
approach did not work because there were no blacks in some of the PSUs.
The tabulation of blacks by stratum and PSU showed that only one PSU
remained in the 13th and 15th strata. When these two strata are collapsed
with adjacent strata, Stata produced a result. Although the point estimate is
the same as before, the standard error and design effect are different. As a
general rule, subgroup analysis with survey data should avoid selecting out
a subset, unlike in the analysis of SRS data.
Besides the svymean command for descriptive analysis, Stata supports the
following descriptive analyses: svytotal (for the estimation of population
total), svyratio (for the ratio estimation), and svyprop (for the estimation of
proportions). In SUDAAN, these descriptive statistics can be estimated by
the DESCRIPT procedure, and subdomain analysis can be accommodated by
the use of the SUBPOPN statement.

Conducting Linear Regression Analysis


Both regression analysis and ANOVA examine the linear relation
between a continuous dependent variable and a set of independent vari-
ables. To test hypotheses, it is assumed that the dependent variable follows
a normal distribution. The following equation shows the type of relation
being considered by these methods for i = 1, 2, . . . , n:
Yi = β0 + β1 X1i + β2 X2i +    + βp Xpi + εi (6:1)

This is a linear model in the sense that the dependent variable (Yi ) is repres-
ented by a linear combination of the βj ’s plus εi : The βj is the coefficient of
the independent variable (Xj ) in the equation, and εi is the random error term
in the model that is assumed to follow a normal distribution with a mean of 0
and a constant variance and to be independent of the other error terms.
In regression analysis, the independent variables are either continuous or
discrete variables, and the βj ’s are the corresponding coefficients. In the
ANOVA, the independent variables (Xj ’s) are indicator variables (under
effect coding, each category of a factor has a separate indicator variable
coded 1 or 0) that show which effects are added to the model, and the βj ’s
are the effects.
Ordinary least squares (OLS) estimation is used to obtain estimates of the
regression coefficients or the effects in the linear model when the data result
from a SRS. However, several changes in the methodology are required to
deal with data from a complex sample. The data now consist of the individual
observations plus the sample weights and the design descriptors. As was
discussed in Chapter 3, the subjects from a complex sample usually have
58

different probabilities of selection. In addition, in a complex survey the


random-error terms often are no longer independent of one another because
of features of the sample design. Because of these departures from SRS, the
OLS estimates of the model parameters and their variances are biased. Thus,
confidence intervals and tests of hypotheses may be misleading.
A number of authors have addressed these issues (Binder, 1983; Fuller,
1975; Holt, Smith, & Winter, 1980; Konijn, 1962; Nathan & Holt, 1980;
Pfeffermann & Nathan, 1981; Shah, Holt, & Folsom, 1977). They do not
concur on a single approach to the analysis, but they all agree that the use of
OLS as the estimation methodology can be inappropriate. Rather than
providing a review of all these works, we focus here on an approach that
covers the widest range of situations and that also has software available
and widely disseminated. This approach to the estimation of the model
parameters is the design-weighted least squares (DWLS), and its use is sup-
ported in SUDAAN, Stata, and other software for complex survey data
analysis.
The weight in the DWLS method is the sample weight discussed in
Chapter 3. DWLS is slightly different from the weighted least squares
(WLS) method for unequal variances, which derives the weight from an
assumed covariance structure (see Lohr, 1999, chap. 12). To account for the
complexities introduced by the sample design and other adjustments to the
weights, one of the methods discussed in Chapter 4 may be used in the esti-
mation of the variance–covariance matrix of the estimates of the model
parameters. Because these methods use the PSU total rather than the indivi-
dual value as the basis for the variance computation, the degrees of freedom
for this design equal the number of PSUs minus the number of strata,
instead of the sample size. The degrees of freedom associated with the sum
of squares for error are then the number of PSUs minus the number of strata,
minus the number of terms in the model.
Table 6.4 presents the results of the multiple regression analysis of BMI on
the selected independent variables under the three options of analysis. For
independent variables, we used the same variables used for descriptive analy-
sis. In addition, age squared is included to account for a possible nonlinear
age effect on BMI. For simplicity, the interaction terms are not considered
in this example, although their inclusion undoubtedly would have increased
the R-squared, apart from a heightened multicollinearity problem. Imputed
values were used in this analysis. The regression coefficients were almost the
same as those obtained from the same analysis without using imputed values.
The standard errors of the coefficients were also similar between the analyses
with and without imputed values.
The top panel shows the results of unweighted and weighted analyses
ignoring the design features. The regress command is used for both the
59

TABLE 6.4
Summary of Multiple Regression Models for Body Mass
Index on Selected Variables for U.S. Adults From
NHANES III, Phase II (n = 9,920):
An Analysis Using Stata

(A) Unweighted and weighted analysis, ignoring design features

Unweighted analysis Weighted analysis


------------------------------------------- -------------------------------------
Source | SS df MS | SS df MS
----------+------------------------------ -+------------------------------
Model | 33934.57 9 3770.48 | 37811.46 9 4201.27
Residual | 252106.39 9910 25.44 | 236212.35 9910 23.84
----------+------------------------------ -+------------------------------
Total | 286040.68 9919 28.84 | 274023.81 9919 27.63
----------------------------------------- --------------------------------
F( 9, 9910) = 148.21 F( 9, 9910) = 176.26
Prob > F = 0.0000 Prob > F = 0.0000
R-squared = 0.1186 R-squared = 0.1380
Adj R-squared = 0.1178 Adj R-squared = 0.1372
Root MSE = 5.0438 Root MSE = 4.8822
------------------------------------------------ -------------------------------------
bmi | Coef. Std. Err. t P>|t| Coef. Std. Err. t P>|t|
-----------+------------------------------------ -------------------------------------
age | .38422 .01462 26.27 0.000 .39778 .01528 26.03 0.000
agesq | -.00391 .00014 -27.61 0.000 -.00421 .00016 -27.06 0.000
black | 1.15938 .13178 8.80 0.000 .96291 .16108 5.98 0.000
hispanic | .70375 .14604 4.82 0.000 .64825 .22761 2.85 0.004
pir | -.14829 .03271 -4.53 0.000 -.12751 .02758 -4.62 0.000
educat | -.00913 .01680 -0.54 0.587 -.11120 .01865 -5.96 0.000
sbp | .05066 .00313 16.18 0.000 .07892 .00338 23.35 0.000
vituse | -.72097 .10752 -6.71 0.000 -.64256 .10176 -6.31 0.000
smoker | -.47851 .10456 -4.58 0.000 -.34981 .10033 -3.49 0.001
_cons | 12.70443 .49020 25.92 0.000 10.36452 .52213 19.85 0.000
------------------------------------------------ ------------------------------------

(B) Survey analysis using the data features

. svyset [pweight=wgt], strata(stra) psu(psu)


. svyregress bmi age agesq black hispanic pir educat sbp vituse smoker, deff

Survey linear regression


pweight: wgt Number of obs = 9920
Strata: stra Number of strata = 23
PSU: psu Number of PSUs = 46
Population size = 9920.06
F( 9, 15) = 71.84
Prob > F = 0.0000
R-squared = 0.1380
---------------------------------------------------------------------------------------
bmi | Coef. Std. Err. t P>|t| [95% Conf. Interval] Deff
-----------+---------------------------------------------------------------------------
age | .39778 .02110 18.85 0.000 .35412 .44143 2.0539
agesq | -.00421 .00023 -18.02 0.000 -.00469 -.00373 2.3647
black | .96291 .22418 4.30 0.000 .49916 1.42666 1.5778
hispanic | .64825 .20430 3.17 0.004 .22562 1.07087 .8897
pir | -.12751 .05624 -2.27 0.033 -.24855 -.01117 4.5323
educat | -.11203 .02703 -4.11 0.000 -.16712 -.05529 2.1457
sbp | .07892 .00514 15.35 0.000 .06828 .08956 1.8798
vituse | -.64256 .17793 -3.61 0.001 -1.01063 -.27449 3.0546
smoker | -.34982 .20405 -1.71 0.100 -.77192 .07229 4.0343
_cons | 10.36452 .80124 12.94 0.000 8.70704 12.02201 2.3041
---------------------------------------------------------------------------------------

unweighted and weighted analyses and the weight is specified by [w  wgt]


in the weighted analysis. First, our attention is called to the disappointingly
low R-squared values, 0.12 in the unweighted analysis and 0.14 in the
weighted analysis. It shows that most of the variation in BMI is
60

unaccounted for by the model. Other important variables are not included
in this model. Perhaps the satisfactory specification of a model for predict-
ing BMI may not be possible within the scope of NHANES III data.
Both the unweighted and weighted analyses indicate that age is positively
related, and age squared is negatively related, to BMI. This indicates that the
age effect is curvilinear, with a dampening trend for older ages, as one might
expect. The poverty index and education are negatively associated with BMI.
Examining the regression coefficients for the binary variables, both blacks
and Hispanics have positive coefficients, indicating that these two ethnic
groups have greater BMI than their counterparts. The systolic blood pressure
is positively related to BMI, and the vitamin users, who may be more con-
cerned about their health, have a lower BMI than the nonusers. Those who
have ever smoked have BMIs less than half a point lower than those who
never smoked.
There is a small difference between the unweighted and weighted ana-
lyses. Although the education effect is small (beta coefficient ¼ −0.009) in
the unweighted analysis, it increases considerably in absolute value (beta
coefficient = −0.111) in the weighted analysis. If a preliminary analysis
were conducted without using the sample weights, one could have over-
looked education as an important predictor. This example clearly points to
the advantage of using a PPS subsample for a preliminary analysis that was
discussed at the beginning of this chapter. The negative coefficient for
smoking status dampens slightly, suggesting that the negative effect of
smoking on BMI is more pronounced for the oversampled groups than for
their counterparts. Again, the importance of sample weights is demon-
strated here. The analysis also points to the advantage of using a PPS
subsample for preliminary analysis rather than using unweighted analysis.
The analytical results taking into account the weights and design features
are shown in the bottom panel. This analysis was done using the svyregress
command. The estimated regression coefficients and R2 are the same as
those shown in the weighted analysis because the same formula is used in
the estimation. However, the standard errors of the coefficients and the t sta-
tistics are considerably different from those in the weighted analysis. The
design effects of the estimated regression coefficients ranged from 0.89 for
Hispanics to 4.53 for poverty-to-income ratio. Again we see that a complex
survey design may result in a larger variance for some variables than for
their SRS counterparts, but not necessarily for all the variables. In this parti-
cular example, the general analytic conclusions that were drawn in the pre-
liminary analysis also were true in the final analysis, although the standard
errors for regression coefficients increased for all but one variable.
Comparing the design effects in Tables 6.2 and 6.4, one finds that the
design effects for regression coefficients are somewhat smaller than for
61

the means and proportions. So, applying the design effect estimated from the
means and totals to regression coefficients (when the clustering information
is not available from the data) would lead to conclusions that are too conser-
vative. Smaller design effects may be possible in a regression analysis if the
regression model controls for some of the cluster-to-cluster variability. For
example, if part of the reason for people in the same cluster having similar
BMI is similar age and education, then one would expect that adjusting for
age and education in the regression model might account for some of cluster-
to-cluster variability. The clustering effect would then have less impact on
the residuals from the model.
Regression analysis can also be conducted by using the REGRESS
procedure in SUDAAN as follows:
PROC REGRESS DESIGN = wr;
NEST stra psu;
WEIGHT wgt;
MODEL = bmi age agesq black hispanic pir educat sbs
vituse smoker;
RUN;

Conducting Contingency Table Analysis


The simplest form of studying the association of two discrete variables is
the two-way table. If data came from an SRS, we could use the Pearson chi-
square statistic to test the null hypothesis of independence. For the analysis of
a two-way table based on complex survey data, the test procedure needs to be
changed to account for the survey design. Several different test statistics have
been proposed. Koch, Freeman, and Freedman (1975) proposed using the
Wald statistic,6 and it has been used widely. The Wald statistic usually is
converted to an F statistic to determine the p value. In the F statistic, the
numerator degrees of freedom are tied to the dimension of the table, and the
denominator degrees of freedom reflect the survey design. Later, Rao and
Scott (1984) proposed correction procedures for the log-likelihood statistic,
using an F statistic with non-integer degrees of freedom. Based on a simula-
tion study (Sribney, 1998), Stata implemented the Rao-Scott corrected statis-
tic as the default procedure, but the Wald chi-square and the log-linear Wald
statistic are still available as an option. On the other hand, SUDAAN uses the
Wald statistic in its CROSSTAB procedure. In most situations, these two
statistics lead to the same conclusion.
Table 6.5 presents an illustration of two-way table analysis using Stata.
In this analysis, the association between vitamin use (vituse) and years
of education (edu) coded in three categories (1 = less than 12 years of
62

TABLE 6.5
Comparison of Vitamin Use by Level of Education Among U.S. Adults,
NHANES III, Phase II (n = 9,920): An Analysis Using Stata

(A) . tab vituse edu, column chi


--------------------------------------------------------
| edu
vituse | 1 2 3 | Total
-----------+---------------------------------+----------
0 | 2840 1895 1372 | 6107
| 68.43 61.89 50.66 | 61.56
-----------+---------------------------------+----------
1 | 1310 1167 1336 | 3813
| 31.57 38.11 49.34 | 38.44
-----------+---------------------------------+----------
Total | 4150 3062 2708 | 9920
| 100.00 100.00 100.00 | 100.00
Pearson chi2(2) = 218.8510 Pr = 0.000
--------------------------------------------------------
(B) . svyset [pweight=wgt], strata(stra) psu(psu)
. svytab vituse edu, column ci pearson wald

pweight: wgt Number of obs = 9920


Strata: stra Number of strata = 23
PSU: psu Number of PSUs = 46
Population size = 9920.06
----------------------------------------------------------------------
| edu
vituse | 1 2 3 Total
----------+-----------------------------------------------------------
0 | .6659 .6018 .4834 .5705
| [.6307,.6993] [.5646,.6379] [.4432,.5237] [.5452,.5955]
1 | .3341 .3982 .5166 .4295
| [.3007,.3693] [.3621,.4354] [.4763,.5568] [.4045,.4548]
Total | 1 1 1 1
----------------------------------------------------------------------
Key: column proportions
[95% confidence intervals for column proportions]
Pearson:
Uncorrected chi2(2) = 234.0988
Design-based F(1.63, 37.46) = 30.2841 P = 0.0000
Wald (Pearson):
Unadjusted chi2(2) = 51.9947
Adjusted F(2, 22) = 24.8670 P = 0.0000
----------------------------------------------------------------------
(C) . svyset [pweight=wgt], strata(stra) psu(psu)
. svytab vituse edu, subpop(hispanic) column ci wald

pweight: wgt Number of obs = 9920


Strata: stra Number of strata = 23
PSU: psu Number of PSUs = 46
Population size = 9920.06
Subpop.: hispanic==1 Subpop. no. of obs = 2593
Subpop. size = 539.043
----------------------------------------------------------------------
| edu
vituse | 1 2 3 Total
----------+-----------------------------------------------------------
0 | .7382 .6728 .5593 .6915
| [.6928,.7791] [.6309,.7122] [.4852,.6309] [.6509,.7293]
1 | .2618 .3272 .4407 .3085
| [.2209,.3072] [.2878,.3691] [.3691,.5148] [.2707,.3491]
Total | 1 1 1 1
----------------------------------------------------------------------
Key: column proportions
[95% confidence intervals for column proportions]
Wald (Pearson):
Unadjusted chi2(2) = 47.1625
Adjusted F(2, 22) = 22.5560 P = 0.0000
----------------------------------------------------------------------

education; 2 = 12 years; 3 = more than 12 years). In Panel A, the ordinary


chi-square analysis is performed ignoring the weights and the data struc-
ture. There is a statistically significant relation between education and use
63

of vitamins, with those having a higher education being more inclined to


use vitamins. The percentage of vitamin users varies from 32% in the
lowest level of education to 49% in the highest level. Panel B shows the ana-
lysis of the same data taking the survey design into account. The weighted
percentage of vitamin users by the level of education varies slightly more
than in the unweighted percentages, ranging from 33% in the first level of
education to 52% in the third level of education. Note that with the request
of ci, Stata can compute confidence intervals for the cell proportions.
In this analysis, both Pearson and Wald chi-square statistics are requested.
The uncorrected Pearson chi-square, based on the weighed frequencies, is
slightly larger than the chi-square value in Panel A, reflecting the slightly
greater variation in the weighted percentages. However, a proper p value
reflecting the complex design cannot be evaluated based on the uncorrected
Pearson chi-square statistic. A proper p value can be evaluated from the
design-based F statistic of 30.28 with 1.63 and 37.46 degrees of freedom,
which is based on the test procedure as a result of the Rao-Scott correction.
The unadjusted Wald chi-square test statistic is 51.99, but a proper p value
must be determined based on the adjusted F statistic. The denominator
degrees of freedom in both F statistics reflect the number of PSUs and strata
in the sample design. The adjusted F statistic is only slightly smaller than the
Rao-Scott F statistic. Either one of these test statistics would lead to the same
conclusion.
In Panel C, the subpopulation analysis is performed for the Hispanic
population. Note that the entire data file is used in this analysis. The analysis
is based on 2,593 observations, but it represents only 539 people when
the sample weights are considered. The proportion of vitamin users among
Hispanics (31%) is considerably lower than the overall proportion of vitamin
users (43%). Again, there is a statistically significant relation between educa-
tion and use of vitamins among Hispanics, as the adjusted F statistic indicates.
Let us now look at a three-way table. Using the NHANES III, Phase II
adult sample data, we will examine gender difference in vitamin use across
the levels of education. This will be a 2 × 2 × 3 table, and we can perform a
two-way table analysis at each level of education. Table 6.6 shows the ana-
lysis of three 2 × 2 tables using SAS and SUDAAN. The analysis ignoring
the survey design is shown in the top panel of the table. At the lowest level
of education, the percentage of vitamin use for males is lower than for
females, and the chi-square statistic suggests the difference is statistically
significant. Another way of examining the association in a 2 × 2 table is the
calculation of the odds ratio.
In this table, the odds of using vitamins for males is 0.358 [¼ 0.2634/
(1 − 0.2634)], and for females it is 0.567 [¼ 0.3617/(1 − 0.3617)]. The
ratio of male odds over female odds is 0.63 (¼ 0.358/0.567), indicating that
64

TABLE 6.6
Analysis of Gender Difference in Vitamin Use by
Level of Education Among U.S. Adults, NHANES III,
Phase II (n = 9,920): An Analysis Using SAS and SUDAAN
(A) Unweighted analysis by SAS:
proc freq;
tables edu*sex*vituse / nopercent nocol chisq measures cmh;
run;
[Output summarized below]

Level of education: Less than H.S. H.S. graduate Some college


Vitamin use status: (n) User (n) User (n) User

Gender- Male: (1944) 26.34% (1197) 31.91% (1208) 43.54%


Female: (2206) 36.17 (1865) 42.09 (1500) 54.00

Chi-square: 46.29 32.02 29.27


P-value: <.0001 <.0001 <.0001
Odds ratio: 0.63 0.64 0.66
95% CI: (0.56, 0.72) (0.56, 0.75) (0.56, 0.76)

CMH chi-square: 107.26 (p<.0001)


CMH common odds ratio: 0.64 95%CI: (0.59, 0.70)
--------------------------------------------------------
(B) Survey analysis by SUDAAN:
proc crosstab design=wr;
nest stra psu;
weight wgt;
subgroup edu sex vituse;
levels 3 2 2;
tables edu*sex*vituse;
print nsum wsum rowper cor upcor lowcor chisq chisqp cmh cmhpval;
run;
[Output summarized below]

Level of education: Less than H.S. H.S. graduate Some college


Vitamin use status: (n)* User (n)* User (n)* User

Gender- Male: (1274.9) 28.04% (1432.2) 32.66% (2031.1) 45.62%


Female: (1299.7) 38.62 (1879.4) 45.43 (2002.7) 57.37

Chi-square: 19.02 38.01 10.99

P-value: .0002 <.0001 .0030


Odds ratio: 0.62 0.58 0.62
95% CI: (0.50, 0.77) (0.49, 0.69) (0.48, 0.80)

CMH chi-square: 42.55 (p<.0001)



Weighted sum

the males’ odds of taking vitamins are 63% of the females’ odds. The 95%
confidence interval does not include 1, suggesting that the difference is sta-
tistically significant. The odds ratios are consistent across three levels of
education. Because the ratios are consistent, we can combine 2 × 2 tables
across the three levels of education. We can then calculate the Cochran-
Mantel-Haenszel (CMH) chi-square (df = 1) and the CMH common odds
ratio. The education-adjusted odds ratio is 0.64, and its 95% confidence
interval does not include 1.
The lower panel of Table 6.6 shows the results of using the CROSSTAB
procedure in SUDAAN to perform the same analysis, taking the survey
design into account. On the PROC statement, DESIGN = wr designates
with-replacement sampling, meaning that the finite population correction is
65

not used. The NEST statement designates the stratum and PSU variables.
The WEIGHT statement gives the weight variable. The SUBGROUP state-
ment declares three discrete variables, and the LEVELS statement specifies
the number of levels in each discrete variable. The TABLES statement
defines the form of contingency table. The PRINT statement requests nsum
(frequencies), wsum (weighted frequencies), rowper (row percent), cor
(crude odds ratio), upcor (upper limit of cor), lowcor (lower limit of cor),
chisq (chi-square statistic), chisqp (p value for chi-square statistic), cmh
(CMH statistic), and cmhpval (p value for CMH).
The weighted percentages of vitamin use are slightly different from the
unweighted percentages. The Wald chi-square values in three separate
analyses are smaller than the Pearson chi-square values in the upper panel
except for the middle level of education. Although the odds ratio
remained almost the same at the lower level of education, it decreased
somewhat at the middle and higher levels of education. The CROSSTAB
procedure in SUDAAN did not compute the common odds ratio, but it can
be obtained from a logistic regression analysis to be discussed in the next
section.

Conducting Logistic Regression Analysis


The linear regression analysis presented earlier may not be useful to
many social scientists because many of the variables in social science
research generally are measured in categories (nominal or ordinal). A number
of statistical methods are available for analyzing categorical data, ranging
from basic cross-tabulation analysis, shown in the previous section, to gener-
alized linear models with various link functions. As Knoke and Burke (1980)
observed, the modeling approach revolutionized contingency table analysis
in the social sciences, casting aside most of the older methods for deter-
mining relationships among variables measured at discrete levels. Two
approaches have been widely used by social scientists: log-linear models
using the maximum likelihood approach (Knoke & Burke, 1980; Swafford,
1980) and the weighted least square approach (Forthofer & Lehnen, 1981;
Grizzle, Starmer, & Koch, 1969). The use of these two methods for analyzing
complex survey data was illustrated in the first edition of this book. In these
models, the cell proportions or functions of them (e.g., natural logarithm in
the log-linear model) are expressed as a linear combination of effects that
make up the contingency table. Because these methods are restricted to con-
tingency tables, continuous independent variables cannot be included in
the analysis.
In the past decade, social scientists have begun to use logistic regression
analysis more frequently because of its ability to incorporate a larger
66

number of explanatory variables, including continuous variables (Aldrich


& Nelson, 1984; DeMaris, 1992; Hosmer & Lemeshow, 1989; Liao, 1994).
Logistic regression and other generalized linear models with different link
functions are now implemented in software packages for complex survey
data analysis. Survey analysts can choose the most appropriate model from
an array of models. The application of the logit model is illustrated below,
using Stata and SUDAAN.
The ordinary linear regression analysis represented by Equation 6.1
examines the relationship between a continuous dependent variable and
one or more independent variables. The logistic regression is a method for
examining the association of a categorical outcome with a number of inde-
pendent variables. The following equation shows the type of modeling a
binary outcome variable Y for i = 1, 2, . . . , n:

log[πi /(1 − πi )] = β0 + β1 x1i + β2 x2i +    + βp−1 xp−1, i : (6:2)

In Equation 6.2, πi is the probability that yi = 1. This is a generalized


linear model with a link function of the log odds or logit. Instead of least
square estimation, the maximum likelihood approach (see Eliason, 1993) is
used to estimate the parameters. Because the simultaneous equations to be
solved are nonlinear with respect to the parameters, iterative techniques are
used. Maximum likelihood theory also offers an estimator of the covariance
matrix of the estimated β’s, assuming individual observations are random
and independent.
Just as in the analysis of variance model, if a variable has l levels, we only
use l − 1 levels in the model. We shall measure the effects of the l − 1 levels
from the effect of the omitted or reference level of the variable. The esti-
mated β(β)^ is the difference in logit between the level in the model and the
omitted level, that is, the natural log of odds ratio of the level in the model
^
over the level omitted. Thus, taking eβ gives the odds ratio adjusted for
other variables in the model. The results of logistic regression usually are
summarized and interpreted as odds ratios (see Liao, 1994, chap. 3).
With complex survey data, the maximum likelihood estimation needs to
be modified, because each observation has a sample weight. The maximum
likelihood solution incorporating the weights is generally known as pseudo
or weighted maximum likelihood estimation (Chambless & Boyle, 1985;
Roberts, Rao, & Kumar, 1987). Whereas the point estimates are calculated
by the pseudo likelihood procedure, the covariance matrix of the estimated
^ is calculated by one of the methods discussed in Chapter 4. As discussed
β’s
earlier, the approximate degrees of freedom associated with this covariance
matrix are the number of PSUs minus the number of strata. Therefore, the
67

standard likelihood-ratio test for model fit should not be used with the
survey logistic regression analysis. Instead of the likelihood-ratio test,
the adjusted Wald test statistic is used.
The selection and inclusion of appropriate predictor variables for a logistic
regression model can be done similarly to the process for linear regression.
When analyzing a large survey data set, the preliminary analysis strategy
described in the earlier section is very useful in preparing for a logistic regres-
sion analysis.
To illustrate logistic regression analysis, the same data used in Table 6.6
are analyzed using Stata. The analytical results are shown in Table 6.7. The
Stata output is edited somewhat to fit into a table. The outcome variable
is vitamin use (vituse), and explanatory variables are gender (1 = male;
0 = female) and level of education (edu). The interaction term is not included
in this model, based on the CMH statistic shown in Table 6.6. First, we per-
formed standard logistic regression analysis, ignoring the weight and design
features. The results are shown in Panel A. Stata automatically performs the
effect (or dummy) coding for discrete variables with the use of the xi option
preceding the logit statement and adding i. in front of the variable name.
The output shows the omitted level of each discrete variable. In this case,
the level ‘‘male’’ is in the model, and its effect is measured from the effect of
‘‘female,’’ the reference level. For education, being less than a high school
graduate is the reference level. The likelihood-ratio chi-square value is 325.63
(df = 3) with p value of < 0.00001, and we reject the hypothesis that gender
and education together have no effect on vitamin usage, suggesting that there
is a significant effect. However, the pseudo R2 suggests that most of the varia-
tion in vitamin use is unaccounted for by these two variables. The parameter
estimates for gender and education and their estimated standard errors are
shown as well as the corresponding test statistics. All factors are significant.
Including or in the model statement produces the odds ratios instead of
beta coefficients. The estimated odds ratio for males is 0.64, meaning that
the odds of taking vitamins for a male is 64% of the odds that a female uses
vitamins after adjusting for education. This odds ratio is the same as the
CMH common odds ratio shown in Table 6.6. The significance of the odds
ratio can be tested using either a z test or a confidence interval. The odds
ratio for the third level of education suggests that persons with some college
education are twice likely to take vitamins than those with less than
12 years of education for the same gender. None of the confidence intervals
includes 1, suggesting that all effects are significant.
Panel B of Table 6.7 shows the goodness-of-fit statistic (chi-square with
df = 2). The large p value suggests that the main effects model fits the data
(not significantly different from the saturated model). In this simple situation,
68

TABLE 6.7
Logistic Regression Analysis of Vitamin Use on Gender and
Level of Education Among U.S. Adults, NHANES III,
Phase II (n = 9,920): An Analysis Using Stata
(A) Standard logistic regression (unweighted, ignoring sample design):
. xi: logit vituse i.male i.edu

i.male _Imale_0-1 (naturally coded; _Imale_0 omitted)


i.edu _Iedu_1-3 (naturally coded; _Iedu_1 omitted)
Iteration 0: log likelihood =
-6608.3602
Iteration 1: log likelihood =
-6445.8981
Iteration 2: log likelihood =
-6445.544
Iteration 3: log likelihood =
-6445.544
Logit estimates Number of obs = 9920
LR chi2(3) = 325.63
Prob > chi2 = 0.0000
Log likelihood = -6445.544 Pseudo R2 = 0.0246
---------------------------------------------------------------------------------------------
vituse | Coef. Std. Err. z P>|z| [95% Conf.Int.] Odds Ratio [95% Conf.Int.]
-----------+--------------------------------------------------------------------------------
_Imale_1 | -.4418 .0427 -10.34 0.000 -.5256 -.3580 .6429 .5912 .6990
_Iedu_2 | .2580 .0503 5.12 0.000 .1593 .3566 1.2943 1.1773 1.4285
_Iedu_3 | .7459 .0512 14.56 0.000 .6455 .8462 2.1082 1.9069 2.3308
_cons | -.5759 .0382 -15.07 0.000 -.6508 -.5010
---------------------------------------------------------------------------------------------

(B) Testing goodness-of-fit:


. lfit

Logistic model for vit, goodness-of-fit test


number of observations = 9920
number of covariate patterns = 6
Pearson chi2(2) = 0.16
Prob > chi2 = 0.9246

(C) Survey logistic regression (incorporating the weights and design features):
. svyset [pweight=wgt], strata (stra) psu(psu)
. xi: svylogit vituse i.male i.edu

i.male _Imale_0-1 (naturally coded; _Imale_0 omitted)


i.edu _Iedu_1-3 (naturally coded; _Iedu_1 omitted)

Survey logistic regression


pweight: wgt Number of obs = 9920
Strata: stra Number of strata = 23
PSU: psu Number of PSUs = 46
Population size = 9920.06
F( 3, 21) = 63.61
Prob > F = 0.0000
-------------------------------------------------------------------------------------------------
vituse | Coef. Std. Err. t P>|t| [95% Conf. Int.] Deff Odds Rat.[95% Conf. Int.]
------------+------------------------------------------------------------------------------------
_Imale_1 | -.4998 .0584 -8.56 0.000 -.6206 -.3791 1.9655 .6066 .5376 .6845
_Iedu_2 | .2497 .0864 2.89 0.008 .0710 .4283 2.4531 1.2836 1.0736 1.5347
_Iedu_3 | .7724 .0888 8.69 0.000 .5885 .9562 2.8431 2.1649 1.8013 2.6019
_cons | -.4527 .0773 -5.86 0.000 -.6126 -.2929 2.8257
-------------------------------------------------------------------------------------------------

(D) Testing linear combination of coefficients:


. lincom _Imale_1+_Iedu_3, or

( 1) _Imale_1 + _Iedu_3 = 0.0


------------------------------------------------------------------------------
vituse | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1.3132 .1518 2.36 0.027 1.0340 1.6681
------------------------------------------------------------------------------

the two degrees of freedom associated with the goodness of fit of the model
can also be interpreted as the two degrees of freedom associated with the
gender-by-education interaction. Hence, there is no interaction of gender and
education in relation to the proportion using vitamin supplements, confirming
the CMH analysis shown in Table 6.6.
69

Panel C of Table 6.7 shows the results of logistic regression analysis for
the same data, with the survey design taken into account. The log likelihood
is not shown because the pseudo likelihood is used. Instead of a likelihood-
ratio statistic, the F statistic is used. Again, the p value suggests that
the main effects model is a significant improvement over the null model.
The estimated parameters and odds ratios changed slightly because of the
sample weights, and the estimated standard errors of beta coefficients
increased as reflected in the design effects. Despite the increased standard
errors, the beta coefficients for gender and education levels are significantly
different from 0. The odds ratio for males adjusted for education decreased
to 0.61 from 0.64. Although the odds ratio remained about the same for
the second level of education, its p value increased considerably, to 0.008
from < 0.0001, because of the design taken into account.
After the logistic regression model was run, the effect of linear combination
of parameters was tested as shown in Panel D. We wanted to test the hypoth-
esis that the sum of parameters for male and the third level of education is zero.
Because there is no interaction effect, the resulting odds ratio of 1.3 can be
interpreted as indicating that the odds of taking vitamin for males with some
college education are 30% higher than the odds for the reference (females with
less than 12 years of education). SUDAAN also can be used to perform a logis-
tic regression analysis, using its LOGISTIC procedure in the stand-alone ver-
sion or the RLOGIST procedure in the SAS callable version (a different name
used to distinguish it from the standard logistic procedure in SAS).
Finally, the logistic regression model also can be used to build a prediction
model for a synthetic estimation. Because most health surveys are designed to
estimate the national statistics, it is difficult to estimate health characteristics
for small areas. One approach to obtain estimates for small areas is the syn-
thetic estimation utilizing the national health survey and demographic infor-
mation of local areas. LaVange, Lafata, Koch, and Shah (1996) estimated the
prevalence of activity limitation among the elderly for U.S. states and counties
using a logistic regression model fit to the National Health Interview Survey
(NHIS) and Area Resource File (ARF). Because the NHIS is based on a com-
plex survey design, they used SUDAAN to fit a logistic regression model to
activity limitation indicators on the NHIS, supplemented with county-level
variables from ARF. The model-based predicted probabilities were then
extrapolated to calculate estimates of activity limitation for small areas.

Other Logistic Regression Models


The binary logistic regression model discussed above can be extended to
deal with more than two response categories. Some such response categories
are ordinal, as in perceived health status: excellent, good, fair, and poor.
70

Other response categories may be nominal, as in religious preferences. These


ordinal and nominal outcomes can be examined as functions of a set of dis-
crete and continuous independent variables. Such modeling can be applied to
complex survey data, using Stata or SUDAAN. In this section, we present
two examples of such analyses without detailed discussion and interpretation.
For details of the models and their interpretation, see Liao (1994).
To illustrate the ordered logistic regression model, we examined obesity
categories based on BMI. Public health nutritionists use the following criteria
to categorize BMI for levels of obesity: obese (BMI ≥ 30), overweight
(25 ≤ BMI < 30), normal (18.5 ≤ BMI < 25), and underweight (BMI < 18.5).
Based on NHANES III, Phase II data, 18% of U.S. adults are obese, 34%
overweight, 45% normal, and 3% underweight. We want to examine the rela-
tionship between four levels of obesity (bmi2: 1 = obese, 2 = overweight,
3 = normal, and 4 = underweight) and a set of explanatory variables includ-
ing age (continuous), education (edu), black, and Hispanic.
For the four ordered categories of obesity, the following three sets of
probabilities are modeled as functions of explanatory variables:

Prfobeseg versus Prfall other levelsg


Prfobese plus overweightg versus Prfnormal plus underweightg
Prfobese plus overweight plus normalg versus Prfunderweightg

Then three binary logistic regression models could be used to fit a separate
model to each of three comparisons. Recognizing the natural ordering of obe-
sity categories, however, we could estimate the ‘‘average’’ effect of explana-
tory variables by considering the three binary models simultaneously, based
on the proportional odds assumption. What is assumed here is that the regres-
sion lines for the different outcome levels are parallel to each other and that
they are allowed to have different intercepts (this assumption needs to be
tested using the chi-square statistic; the test result is not shown in the table).
The following represents the model for j = 1, 2, . . . , c − 1 (c is the number
of categories in the dependent variable):
! p
Pr(category ≤j ) X
log = αj + β i xi (6:3)
Pr(category ≥ ðj+1Þ ) i=1

From this model, we estimate (c − 1) intercepts and a set of β’s.


^
Table 6.8 shows the result of the above analysis using SUDAAN. The
SUDAAN statements are shown at the top. The first statement, PROC
MULTILOG, specifies the procedure. DESIGN, NEST, and WEIGHT spe-
cifications are the same as in Table 6.6. REFLEVEL declares the first level
of education as the reference (the last level is used as the reference if not
71

TABLE 6.8
Ordered Logistic Regression Analysis of Obesity Levels on Education,
Age, and Ethnicity Among U.S. Adults, NHANES III, Phase II
(n = 9,920): An Analysis Using SUDAAN

proc multilog design=wr;


nest stra psu;
weight wgt;
reflevel edu=1;
subgroup bmi2 edu;
levels 4 3;
model bmi2=age edu black hispanic/ cumlogit;
setenv decwidth=5;
run;

Independence parameters have converged in 4 iterations


-2*Normalized Log-Likelihood with Intercepts Only: 21125.58
-2*Normalized Log-Likelihood Full Model : 20791.73
Approximate Chi-Square (-2*Log-L Ratio) : 333.86
Degrees of Freedom : 5

Variance Estimation Method: Taylor Series (WR)


SE Method: Robust (Binder, 1983)
Working Correlations: Independent
Link Function: Cumulative Logit
Response variable: BMI2
----------------------------------------------------------------------
BMI2 (cum-logit),
Independent P-value
Variables and Beta T-Test
Effects Coeff. SE Beta T-Test B=0 B=0
----------------------------------------------------------------------
BMI2 (cum-logit)
Intercept 1 -2.27467 0.11649 -19.52721 0.00000
Intercept 2 -0.62169 0.10851 -5.72914 0.00001
Intercept 3 2.85489 0.11598 24.61634 0.00000
AGE 0.01500 0.00150 9.98780 0.00000
EDU
1 0.00000 0.00000 . .
2 0.15904 0.10206 1.55836 0.13280
3 -0.20020 0.09437 -2.12143 0.04488
BLACK 0.49696 0.08333 5.96393 0.00000
HISPANIC 0.55709 0.06771 8.22744 0.00000
----------------------------------------------------------------------
-------------------------------------------------------
Contrast Degrees of P-value
Freedom Wald F Wald F
-------------------------------------------------------
OVERALL MODEL 8.00000 377.97992 0.00000
MODEL MINUS
INTERCEPT 5.00000 36.82064 0.00000
AGE 1.00000 99.75615 0.00000
EDU 2.00000 11.13045 0.00042
BLACK 1.00000 35.56845 0.00000
HISPANIC 1.00000 67.69069 0.00000
-------------------------------------------------------
-----------------------------------------------------------
BMI2 (cum-logit),
Independent
Variables and Lower 95% Upper 95%
Effects Odds Ratio Limit OR Limit OR
-----------------------------------------------------------
AGE 1.01511 1.01196 1.01827
EDU
1 1.00000 1.00000 1.00000
2 1.17239 0.94925 1.44798
3 0.81857 0.67340 0.99503
BLACK 1.64372 1.38346 1.95295
HISPANIC 1.74559 1.51743 2.00805
-----------------------------------------------------------
72

specified). The categorical variables are listed on the SUBGROUP statement,


and the number of categories of each of these variables is listed on the
LEVELS statement. The MODEL statement specifies the dependent variable,
followed by the list of independent variables. The keyword CUMLOGIT on
the model statement fits a proportional odds model. Without this keyword,
SUDAAN fits the multinomial logistic regression model that will be dis-
cussed in the next section. Finally, SETENV statement requests five decimal
points in printing the output.
The output shows three estimates of intercepts and one set of beta coeffi-
cients for independent variables. The statistics in the second box indicate
that main effects are all significant. The odds ratios in the third box can be
interpreted in the same manner as in the binary logistic regression. Hispanics
have 1.7 times higher odds of being obese than non-Hispanics, controlling
for the other independent variables. Before interpreting these results, we must
check whether the proportional odds assumption is met, but the output does
not give any statistic for checking this assumption. To check this assumption,
we ran three ordinary logistic regression analyses (obese vs. all other, obese
plus overweight vs. normal plus underweight, and obese plus overweight plus
normal vs. underweight). The three odds ratios for age were 1.005, 1.012,
and 1.002, respectively, and they are similar to the value of 1.015 shown in
the bottom section of Table 6.8. The odds ratios for other independent vari-
ables also were reasonably similar, and we have concluded that the propor-
tional odds assumption seems to be acceptable.
Stata also can be used to fit a proportional odds model using its svyolog
procedure, but Stata fits a slightly different model. Whereas the set of βi xi ’s
is added to the intercept in Equation 6.3, it is subtracted in the Stata model.
Thus, the estimated beta coefficients from Stata carry the sign opposite
from those of SUDAAN, while the absolute values are the same. This
means that the odds ratios from Stata are the reciprocal of odds ratios esti-
mated from SUDAAN. The two programs give identical intercept estimates.
Stata uses the term cut instead of intercept.
For nominal outcome categories, a multinomial logistic regression model
can be used. Using this model, we can examine the relationship between a
multilevel nominal outcome variable (no ordering is recognized) and a set
of explanatory variables. The model designates one level of the outcome as
the base category and estimates the log of the ratio of the probability being
in the j-th category relative to the base category. This ratio is called the
relative risk, and the log of this ratio is known as the generalized logit.
We used the same obesity categories used above. Although we recognized
the ordering of obesity levels previously, we considered it as a nominal vari-
able this time because we were interested in comparing the levels of obesity
to the normal category. Accordingly, we coded the obesity levels differently
73

[bmi3: 1 = obese, 2 = overweight, 3 = underweight, and 4 = normal (the


base)]. We used three predictor variables including age (continuous vari-
able), sex [1 = male (reference); 2 = female], and current smoking status
[csmok: 1 = current smoker; 2 = never smoked (reference); 3 = previous
smoker]. The following equations represent the model:
 
Pr(obese)
log = β0,1 + β1,1 (age) + β2,1 (male)
Pr(normal)
+ β3,1 (p:smo ker ) + β4,1 (p:smo ker )
 
Pr(overweight)
log = β0,2 + β1,2 (age) + β2,2 (male)
Pr(normal)
+ β3,2 (c:smo ker ) + β4,2 (p:smo ker )
 
Pr(underweight)
log = β0,3 + β1,3 (age) + β2,3 (male)
Pr(normal)
(6:4)
+ β3,3 (c:smo ker )
+ β4,3 (p:smo ker )
We used SUDAAN to fit the above model, and the results are shown in
Table 6.9 (the output is slightly edited to fit into a single table). The SUDAAN
statements are similar to the previous statements for the proportional odds
model except for omitting CUMLOGIT on the MODEL statement. The
svymlogit procedure in Stata can also fit the multinomial regression model.
Table 6.9 shows both beta coefficients and relative risk ratios (labeled as
odds ratios). Standard errors and the p values for testing β = 0 also are
shown. Age is a significant factor in comparing obese versus normal and
overweight versus normal, but not in comparing underweight versus
normal. Although gender makes no difference in comparing obese and nor-
mal, it makes a difference in the other two comparisons. Looking at the table
of odds ratios, the relative risk ratio of being overweight to normal for males
is more than 2 times as great as for females, provided age and smoking status
are the same. The relative risk of being obese to normal for current smokers
is only 0.68% of those who never smoked, holding age and gender constant.
Available software also supports other statistical models that can be used
to analyze complex survey data. For example, SUDAAN supports Cox’s
regression model (proportional hazard model) for a survival analysis,
although cross-sectional surveys seldom provide longitudinal data. Other
generalized linear models defined by different link functions also can
be applied to complex survey data, using the procedures supported by
SUDAAN, Stata, and other programs.
74

TABLE 6.9
Multinomial Logistic Regression Analysis of Obesity on Gender and
Smoking Status Among U.S. Adults, NHANES III, Phase II (n = 9,920):
An Analysis Using SUDAAN
proc multilog design=wr;
nest stra psu;
weight wgt;
reflevel csmok=2 sex=2;
subgroup bmi2 csmok sex;
levels 4 3 2;
model bmi3=age sex csmok;
setenv decwidth=5;
run;

Independence parameters have converged in 6 iterations


Approximate ChiSquare (-2*Log-L Ration) : 587.42
Degrees of Freedom : 12

Variance Estimation Method: Taylor Series (WR)


SE Method: Robust (Binder, 1983)
Working Correlations: Independent
Link Function: Generalized Logit
Response variable: BMI3
-------------------------------------------------------------------------------------------------
| BMI3 log-odds)| | Independent Variables and Effects |
| | | Intercept | AGE | SEX = 1 | CSMOK = 1 | CSMOK = 3 |
-------------------------------------------------------------------------------------------------
| 1 vs 4 | Beta Coeff. | -1.33334 | 0.01380 | 0.08788 | -0.39015 | -0.27372 |
| | SE Beta | 0.14439 | 0.00214 | 0.12509 | 0.07203 | 0.13206 |
| | T-Test B=0 | -9.23436 | 6.43935 | 0.70251 | -5.41617 | 2.07277 |
| | P-value | 0.00000 | 0.00000 | 0.48941 | 0.00002 | 0.04958 |
-------------------------------------------------------------------------------------------------
| 2 vs 4 | Beta Coeff. | -1.25883 | 0.01527 | 0.76668 | -0.24271 | -0.02006 |
| | SE Beta | 0.13437 | 0.00200 | 0.08275 | 0.11067 | 0.09403 |
| | T-Test B=0 | -9.36835 | 7.64830 | 9.26512 | -2.19307 | -0.21335 |
| | P-value | 0.00000 | 0.00000 | 0.00000 | 0.03868 | 0.83293 |
-------------------------------------------------------------------------------------------------
| 3 vs 4 | Beta Coeff. | -2.07305 | -0.01090 | -1.16777 | 0.33434 | 0.04694 |
| | SE Beta | 0.48136 | 0.00742 | 0.25280 | 0.30495 | 0.26168 |
| | T-Test B=0 | -4.30663 | -1.46824 | -4.61937 | 1.09637 | 0.17936 |
| | P-value | 0.00026 | 0.15558 | 0.00012 | 0.28427 | 0.85923|
-------------------------------------------------------------------------------------------------
-------------------------------------------------------
Contrast Degrees of P-value
Freedom Wald F Wald F
-------------------------------------------------------
OVERALL MODEL 15.00000 191.94379 0.00000
MODEL MINUS INTERCEP 12.00000 68.70758 0.00000
INTERCEPT . . .
AGE 3.00000 22.97518 0.00000
SEX 3.00000 64.83438 0.00000
CSMOK 6.00000 6.08630 0.00063
-------------------------------------------------------
-----------------------------------------------------------------------------------------------
| BMI3(log-odds)| | Independent Variables and Effects |
| | | Intercept | AGE | SEX = 1 | CSMOK = 1 | CSMOK = 3 |
-----------------------------------------------------------------------------------------------
| 1 vs 4 | Odds Ratio | 0.26360 | 1.01390 | 1.09186 | 0.67695 | 0.76054 |
| | Lower 95% Limit | 0.19553 | 1.00941 | 0.84291 | 0.58323 | 0.57874 |
| | Upper 95% Limit | 0.35535 | 1.01840 | 1.41433 | 0.78573 | 0.99946 |
-----------------------------------------------------------------------------------------------
| 2 vs 4 | Odds Ratio | 0.28399 | 1.01539 | 2.15260 | 0.78450 | 0.98014 |
| | Lower 95% Limit | 0.21507 | 1.01120 | 1.81393 | 0.62397 | 0.80689 |
| | Upper 95% Limit | 0.37499 | 1.01959 | 2.55450 | 0.98633 | 1.19059 |
-----------------------------------------------------------------------------------------------
| 3 vs 4 | Odds Ratio | 0.12580 | 0.98916 | 0.31106 | 1.39702 | 1.04805 |
| | Lower 95% Limit | 0.04648 | 0.97408 | 0.18439 | 0.74341 | 0.60994 |
| | Upper 95% Limit | 0.34052 | 1.00447 | 0.52476 | 2.62526 | 1.80086 |
-----------------------------------------------------------------------------------------------

Design-Based and Model-Based Analyses*


All the analyses presented so far relied on the design-based approach, as
sample weights and design features were incorporated in the analysis.
75

Before relating these analyses to the model-based approach, let us briefly


consider the survey data used for these analyses. For NHANES III, 2,812
PSUs were formed, covering the United States. These PSUs consisted of
individual counties, but sometimes they included two or more adjacent
counties. These are administrative units, and the survey was not designed to
produce separate estimates for these units. From these units, 81 PSUs were
sampled, with selection probability proportional to the sizes of PSUs—13
from certainty strata and 2 from each of 34 strata formed according to these
demographic characteristics rather than geographic location. Again, strata
are not designed to define population parameters. The second stage of
sampling involved area segments consisting of city or suburban blocks or
other contiguous geographic areas. Segments with larger minority popula-
tions were sampled with a higher probability. The third stage of sampling
involved the listing of all the households within the sampled area segments
and then sampling them at a rate that depended on the segment characteris-
tics. The fourth stage of sampling was to sample individuals within sampled
households to be interviewed. These secondary units were used to facilitate
sampling rather than to define population parameters. The public-use data
file included only strata and PSUs; identification of seconding sampling
units was not included. Sample weights were calculated based on the selec-
tions probabilities of interviewed persons with weighting adjustments for
nonresponse and poststratification. Many of the analytic issues discussed in
the previous chapters arise because of the way large-scale social and health
surveys are conducted. Available data are not prepared to support the use
of hierarchical linear models for incorporating the multistage selection
design.
Because of the unequal selection probabilities of interviewed indivi-
duals, coupled with adjustments for nonresponse and poststratification, it
is unconvincing to ignore the sample weights in descriptive analysis of
NHANES III data. The rationale for weighting in descriptive analysis has
been quite clear. As shown in Table 6.2, the bias in the unweighted estimate
is quite high for age and race-related variables. The standard error of the
weighted estimate is quite similar to that of the unweighted estimate for all
variables, suggesting that the variability in sample weights does not increase
the variance. However, the standard error for the weighted estimates taking
into account PSUs and strata is quite high, as reflected by the design effects.
One way to reduce the variance is the use of a model using auxiliary informa-
tion. Classical examples are the familiar ratio and regression estimators
(Cochran, 1977, chap. 6). In these estimators, the auxiliary information is in
the form of known population means of concomitant variables that are related
to the target variables. The use of auxiliary information can be extended
to the estimation of distribution functions (Rao, Kovar, & Mantel, 1990);
76

however, the use of auxiliary information in routine descriptive analysis is


limited because finding suitable auxiliary information is difficult.
The use of a model in regression analysis is quite obvious. The
unweighted estimates in Table 6.4 are based strictly on a model-based
approach. It ignored the sample weights and the design features, but much
of the relevant design information is included among the independent vari-
ables: for example, age (oversampling of elderly persons) and black and
Hispanic (oversampling of minority populations). Indeed, the model-based
estimates of coefficients are similar to the weighted estimates, suggesting
that the model-based analysis is quite reasonable in this case. The one nota-
ble exception is the coefficient of education, which is very different under
the two estimation procedures. Education is highly insignificant in the
model-based analysis but highly significant in the weighted analysis. The
education effect could not be detected by the model-based analysis because
of the diminishing education effect for older ages. Age is included in the
model, but the interaction effect between age and education is not included
in the model. This example suggests that the use of the sample weights
protects against misspecification of the model.
Korn and Graubard (1995b) further illustrate the advantage of using
sample weights in a regression analysis, using data from the 1988 National
Maternal and Infant Health Survey. This survey oversampled low-
birthweight infants. The estimated regression lines of gestational age on
birthweight from the unweighted and weighted analyses turn out to be very
different. Although the unweighted fitting reflects sample observations
equally and does not describe the population, the weighted fitting pulls the
regression line to where the population is estimated to be. The relationship
between the two variables actually is curvilinear. If a quadratic regression
were fit instead, then the unweighted and weighted regressions would show
greater agreement.
As discussed above concerning the analytic results in Table 6.4, a careful
examination of the differences between the weighted and unweighted
regressions can sometimes identify important variables or interactions that
should be added to the model. The differences between the unweighted and
weighted estimates suggest that incorporating the sample design provides
protection against the possible misspecification of the population model.
Several statistics for testing the differences between the weighted and
unweighted estimates have been proposed in the literature (DuMouchel &
Duncan, 1983; Fuller, 1984; Nordberg, 1989). Korn and Graubard (1995a)
apply these test statistics to the NHANES I and II data using design-based
variances. They recommend the design-based analysis when the inefficiency
is small. Otherwise, additional modeling assumptions can be incorporated
into the analysis. They have noted that secondary sampling units are not
77

available to the public and point to the need to increase the number of PSUs in
the design of large health surveys. These tests are limited to point estimation,
and therefore their conclusions may not apply to all circumstances. More
detailed discussion of these and related issues is provided by Pfeffermann
(1993, 1996).
The fact that the design-based analysis provides protection against possi-
ble misspecification of the model suggests that the analysis illustrated using
SUDAAN, Stata, and other software for complex survey analysis is appro-
priate for NHANES data. Even in the design-based analysis, a regression
model is used to specify the parameters of interest, but inference takes the
sample design into account. The design-based analysis in this case may be
called a model-assisted approach (Sarndal, Swensson, & Wretman, 1992).
The design-based theory relies on large sample sizes to make inferences
about the parameters. The model-based analysis may be a better option for
a small sample. When probability sampling is not used in data collection,
there is no basis for applying the design-based inference. The model-based
approach would make more sense where substantive theory and previous
empirical investigations support the proposed model.
The idea of model-based analysis is less obvious in a contingency table
analysis than in a regression analysis. The rationale for design-based ana-
lysis taking into account the sampling scheme already has been discussed.
As in the regression analysis, it is wise to pay attention to the differences
between the weighted proportions and the unweighted proportions. If
there is a substantial difference, one should explore why they differ. In
Table 6.6, the unweighted and weighted proportions are similar, but the
weighted odds ratios for vitamin use and gender are slightly lower than
the unweighted odds ratios for high school graduates and those with some
college education, while the weighted and unweighted odds ratios are
about the same for those with less than high school graduation. The small
difference for the two higher levels of education may be due to race or
some other factor. If the difference between the unweighted and weighted
odds ratios is much larger and it is due to race, one should examine the
association separately for different racial groups. The consideration of
additional factors in the contingency table analysis can be done using a
logistic regression model.
The uses of a model and associated issues in a logistic regression are
exactly the same as in a linear regression. A careful examination of the
weighted and unweighted analysis provides useful information. In Table 6.7,
the weighted and unweighted estimates of coefficients are similar. It appears
that the weighting affects the intercept more than the coefficients. The analy-
sis shown in Table 6.7 is a simple demonstration of analyzing data using
logistic regression, and no careful consideration is given to choosing an
78

appropriate model. Comparable model-based analyses without using the


weights and sample design were not performed for the ordered logistic regres-
sion model in Table 6.8 and multinomial logistic regression model in Table
6.9, because we feel that an appropriate model including all relevant indepen-
dent variables was not specified.
In summary, analysis of complex survey data would require both the
model-based and design-based analysis. Design-based methods yield ap-
proximately unbiased estimators or associations, but standard errors can be
ineffective. Model-based methods require assumptions in choosing the
model, and wrong assumptions can lead to biased estimators of associations
and standard errors.

7. CONCLUDING REMARKS

In this book, we have discussed the problematic aspects of survey data


analysis and methods for dealing with the problems caused by the use of
complex sample designs. The focus has been on understanding the problems
and the logic of the methods, rather than on providing a technical manual.
We also have presented a practical guide for preparing for an analysis of
complex survey data and demonstrated the use of some of the software avail-
able for performing various analyses. Software for complex survey analysis
is now readily available, and with the increasing computing power of perso-
nal computers, many sophisticated analytical methods can be implemented
easily. Nevertheless, data analysts need to specify the design, to create repli-
cate weights for certain analysis, and to choose appropriate test statistics for
survey analysis. Therefore, the user should have a good understanding of the
sample design and related analytical issues.
Although the material presented on these issues has been addressed
mainly to survey data analysts, we hope that this introduction also stimu-
lates survey designers and data producers to pay more attention to the needs
of the users of survey data. As more analytic uses are made of the survey
data that were initially collected for enumerative purposes, the survey
designers must consider including certain design-related information that
allows more appropriate analysis to be performed as well as easing the
user’s burden. The data producers should develop the sample weights
appropriate to the design, and even replicate weights incorporating adjust-
ments for nonresponse and poststratification in addition to codes for strata
and sampling units.
Finally, we must point out that we have been taking the position of
design-based statistical inference with some introduction to an alternative
approach known as model-based inference. Each has its own strengths.
79

Briefly, model-based inference assumes that a sample is a convenience set of


observations from a conceptual superpopulation. The population parameters
under the specified model are of primary interest, and the sample selection
scheme is considered secondary to the inference. Consequently, the role of
the sample design is deemphasized here, and statistical estimation uses the
prediction approach under the specified model. Naturally, estimates are sub-
ject to bias if the model is incorrectly specified, and the bias can be substantial
even in large samples. The design-based inference requires taking the sample
design into account, and it is the traditional approach. The finite population
is of primary interest, and the analysis aims at finding estimates that are
design-unbiased in repeated sampling.
We believe that the sample design does matter when inference is made
from sample data, especially in the description of social phenomena, in com-
parison to more predictable physical phenomena. At the same time, the
appropriateness of a model needs to be assessed, and the role of the analytical
model must be recognized in any data analysis. Any inference using both the
design and the model is likely to be more successful than that using either one
alone. We further believe that these two approaches tend to be complemen-
tary and that there is something to be gained by using them in combination.
Similar views are expressed by Brewer (1995, 1999) and Sundberg (1994).7
We cannot lose sight of either the many practical issues that prevail in social
survey designs or lingering problems of nonresponse and other sources of
nonsampling errors. These practical issues and the current state of substantive
theory in social science tend to force us to rely more on the traditional
approach for the time being. The model-based approach should provide an
added perspective in bridging the gap between survey design and analysis.
There is a considerable body of theoretical and practical literature on
complex survey data analysis. In addition to the references cited, more
advanced treatment of the topics discussed in this book and other related
issues are available in single volumes (Korn & Graubard, 1999; Lehtonen &
Pahkinen, 1995; Skinner, Holt, & Smith, 1989). For a secondary use of com-
plex survey data, the analyst should recognize that the analysis may be ser-
iously marred by the limitations or even mistakes made in the sample design.
It will be wise for an analyst with limited sampling knowledge to consult with
an experienced practicing survey statistician.
80

NOTES

1. The method of ratio estimation is used in estimating the population


ratio of two variables (for example, the ratio of the weight of fruits to the
amount of juice produced). It is also used in obtaining a more accurate esti-
mate of a variable (e.g., current income, y) by forming a ratio to another close-
ly related variable (e.g., previous income at the time of the last census, x). The
sample ratio (y/x, or change in income) is then applied to the previous census
income to obtain the current estimate of income, which is more accurate than
that estimated without using an auxiliary variable. For details, see Cochran
(1977, chap. 6).
2. Holt and Smith (1979) characterized poststratification as a robust
technique for estimation. Based on the conditional distribution, they showed
that the self-weighted sample mean is generally biased and poststratification
offers protection against extreme sample configurations. They suggested that
poststratification should be more strongly considered for use in sample
surveys than appears to be the case at present. In addition, poststratifica-
tion may also provide some protection against any anomalies introduced
by nonresponse and other problems in sample selection.
3. SUPER CARP and PC CARP were available from the Statistical
Laboratory of Iowa State University and were useful to the statistically
inclined users. CPLX was made available by Dr. Fay of the U.S. Bureau of
the Census and was useful for conducting discrete multivariate analysis
using modified BRR and the jackknife method (Fay, 1985). CENVAR and
VPLX programs are now available from the U.S. Bureau of the Census.
The Epi Info system for epidemiological and statistical analysis, developed
by CDC (U.S. Centers for Disease Control and Prevention), includes the
CSAMPLE procedure for complex survey data analysis. The OSIRIS
Statistical Software System from the Institute for Social Research, University
of Michigan, included some procedures for descriptive statistics and regres-
sion analysis (available only for mainframe computers). There were also
other programs written for some special survey projects such as the World
Fertility Survey (CLUSTERS). There are two programs written as a series of
SAS macros for survey data analysis, including GES, available from Statistics
Canada, and CLAN, available from Statistics Sweden.
4. The National Health and Nutrition Examination Survey (NHANES) is
a continuing series of surveys carried out by the National Center for Health
Statistics (NCHS) to assess the health and nutritional status of the U.S.
population. There have been several rounds of NHANES. NHANES I was
conducted in 1971–1973, NHANES II in 1976–1980, and NHANES III in
1988–1994. A special survey of the Hispanic population (Hispanic HANES)
81

was conducted in 1982–1984. NHANES became a continuing survey, and


data are now released every 2 years (NHANES 1999–2000, 2001–2002, and
2003–2004). NHANES collects information on a variety of health-related
subjects from a large number of individuals through personal interviews
and medical examinations, including diagnostic tests and other procedures
used in clinical practice (S. S. Smith, 1996). NHANES designs are complex
to accommodate the practical constraints of cost and survey requirements,
resulting in a stratified, multistage, probability cluster sample of eligible
persons in households (NCHS, 1994). The PSUs are counties or small groups
of contiguous counties, and the subsequent hierarchical sampling units
include census enumeration districts, clusters of households, households, and
eligible persons. Preschool children, the aged, and the poor are oversampled
to provide sufficient numbers of persons in these subgroups. The sample
weight contained in the public-use micro data files is the expansion weight
(inverse of selection probability adjusted for nonresponse and poststratifica-
tion). NHANES III was conducted in two phases. The multistage sampling
design resulted in 89 sample areas, and these were randomly divided into two
sets: 44 sites were surveyed in 1988–1991, and the remaining 45 sites in
1991–1994. Each phase sample can be considered as an independent sample,
and the combined sample can be used for a large-scale analysis.
5. The hot deck method of imputation borrows values from other obser-
vations in the data set. There are many ways to select the donor observa-
tions. Usually, imputation cells are established by sorting the data by
selected demographic variables and other variables such as stratum and
PSU. The donor is then selected from the same cell as the observation with
the missing value. By imputing individual values rather than the mean
values, this method avoids the underestimation of variances to a large
degree. This method is widely used by the U.S. Bureau of the Census and
other survey organizations. For further details, see Levy and Lemeshow
(1999, pp. 409–411).
6. A Wald statistic with one degree of freedom is basically the square of
a normal variable with a mean of zero divided by its standard deviation. For
hypotheses involving more than one degree of freedom, the Wald statistic
is the matrix extension of the square of the normal variable.
7. The difference between design-based and model-based approaches is
well illustrated by Brewer and Mellor (1973). The combined use of differ-
ent approaches is well articulated by Brewer (1995) and illustrated further
based on the case of stratified vs. stratified balanced sampling (Brewer,
1999). Sundberg (1994) expressed similar views in conjunction with
variance estimation. Advantages and disadvantages of moving from design-
based to model-based sample designs and sampling inference are reviewed
and illustrated with examples by Graubard and Korn (2002). They point out
82

that with stratified sampling, it is not sufficient to drop FPC factors from
standard design-based variance formulas to obtain appropriate variance
formulas for model-based inference. With cluster sampling, standard
design-based variance formulas can dramatically underestimate model-
based variability, even with a small sampling fraction of the final units.
They conclude that design-based inference is an efficient and reasonably
model-free approach to infer about finite population parameters but sug-
gest simple modifications of design-based variance estimators to make
inferences with a few model assumptions for superpopulation parameters,
which frequently are the ones of primary scientific interest.
83

REFERENCES

Aldrich, J. H., & Nelson, F. D. (1984). Linear probability, logit, and probit models (Quantitative
Applications in the Social Sciences, 07–045). Beverly Hills, CA: Sage.
Alexander, C. H. (1987). A model-based justification for survey weights. Proceedings of the
Section of Survey Research Methods (American Statistical Association), 183–188.
Bean, J. A. (1975). Distribution and properties of variance estimation for complex multistage
probability samples (Vital and Health Statistics, Series 2[65]). Washington, DC: National
Center for Health Statistics.
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex
surveys. International Statistical Review, 51, 279–292.
Brewer, K. R. W. (1995). Combining design-based and model-based inference. In B. G. Cox,
D. A. Binder, B. N. Chinnappa, A. Christianson, M. J. Colledge, & P. S. Kott (Eds.),
Business Survey methods (pp. 589–606). New York: John Wiley.
Brewer, K. R. W. (1999). Design-based or prediction-based inference? Stratified random vs.
stratified balanced sampling. International Statistical Review, 67, 35–47.
Brewer, K. R. W., & Mellor, R. W. (1973). The effect of sample structure on analytical
surveys. Australian Journal of Statistics, 15, 145–152.
Brick, J. M., & Kalton, G. (1996). Handling missing data in survey research. Statistical Methods
in Medical Research, 5, 215–238.
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data
analysis methods. Newbury Park, CA: Sage.
Chambless, L. E., & Boyle, K. E. (1985). Maximum likelihood methods for complex sample
data: Logistic regression and discrete proportional hazards models. Communications in
Statistics—Theory and Methods, 14, 1377–1392.
Chao, M. T., & Lo, S. H. (1985). A bootstrap method for finite populations. Sankhya, 47(A),
399–405.
Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: John Wiley.
Cohen, S. B. (1997). An evaluation of alternative PC-based software packages developed for
the analysis of complex survey data. The American Statistician, 51, 285–292.
Davis, J. A., & Smith, T. W. (1985). General Social Survey, 1972–1985: Cumulative codebook
(NORC edition). Chicago: National Opinion Research Center, University of Chicago and
the Roper Center, University of Connecticut.
DeMaris, A. (1992). Logit modeling (Quantitative Applications in the Social Sciences,
07–086). Thousand Oaks, CA: Sage.
Deming, W. E. (1960). Sample design in business research. New York: John Wiley.
DuMouchel, W. H., & Duncan, G. J. (1983). Using sample survey weights in multiple
regression analyses of stratified samples. Journal of the American Statistical Association,
78, 535–543.
Durbin, J. (1959). A note on the application of Quenouille’s method of bias reduction to the
estimation of ratios. Biometrika, 46, 477–480.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–26.
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Philadelphia:
Society for Industrial and Applied Mathematics.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman &
Hall.
Eliason, S. R. (1993). Maximum likelihood estimation: Logic and practice (Quantitative
Applications in the Social Sciences, 07–096). Beverly Hills, CA: Sage.
84

Eltinge, J. L., Parsons, V. L., & Jang, D. S. (1997). Differences between complex-design-based
and IID-based analyses of survey data: Examples from Phase I of NHANES III. Stats,
19, 3–9.
Fay, R. E. (1985). A jackknife chi-square test for complex samples. Journal of the American
Statistical Association, 80, 148–157.
Flyer, P., & Mohadjer, L. (1988). The WesVar procedure. Rockville, MD: Westat.
Forthofer, R. N., & Lehnen, R. G. (1981). Public program analysis: A categorical data
approach. Belmont, CA: Lifetime Learning Publications.
Frankel, M. R. (1971). Inference from survey samples. Ann Arbor: Institute of Social Research,
University of Michigan.
Fuller, W. A. (1975). Regression analysis for sample surveys. Sankhya, 37(C), 117–132.
Fuller, W. A. (1984). Least squares and related analyses for complex survey designs. Survey
Methodology, 10, 97–118.
Goldstein, H., & Silver, R. (1989). Multilevel and multivariate models in survey analysis. In
C. J. Skinner, D. Holt, & T. M. F. Smith (Eds.), Analysis of complex survey data
(pp. 221–235). New York: John Wiley.
Goodman, L. A. (1972). A general model for the analysis of surveys. American Journal of
Sociology, 77, 1035–1086.
Graubard, B. I., & Korn, E. L. (1996). Modelling the sampling design in the analysis of health
surveys. Statistical Methods in Medical Research, 5, 263–281.
Graubard, B. I., & Korn, E. L. (2002). Inference for superpopulation parameters using sample
surveys. Statistical Science, 17, 73–96.
Grizzle, J. E., Starmer, C. F., & Koch, G. G. (1969). Analysis of categorical data by linear
models. Biometrics, 25, 489–504.
Gurney, M., & Jewett, R. S. (1975). Constructing orthogonal replications for variance estima-
tion. Journal of the American Statistical Association, 70, 819–821.
Hansen, M. H., Madow, W. G., & Tepping, B. J. (1983). An evaluation of model-dependent
and probability-sampling inferences in sample surveys. Journal of the American Statistical
Association, 78, 776–807.
Heitjan, D. F. (1997). Annotation: What can be done about missing data? Approaches to impu-
tation. American Journal of Public Health, 87(4), 548–550.
Hinkins, S., Oh, H. L., & Scheuren, F. (1994). Inverse sampling design algorithms. Proceed-
ings of the Section on Survey Research Methods (American Statistical Association),
626–631.
Holt, D., & Smith, T. M. F. (1979). Poststratification. Journal of the Royal Statistical Society,
142(A), 33–46.
Holt, D., Smith, T. M. F., & Winter, P. D. (1980). Regression analysis of data from complex
surveys. Journal of the Royal Statistical Society, 143(A), 474–487.
Horton, N. J., & Lipsitz, S. R. (2001). Multiple imputation in practice: Comparison of software
packages for regression models with missing variables. The American Statistician, 55(3),
244–254.
Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: John Wiley.
Judkins, D. R. (1990). Fay method of variance estimation. Official Statistics, 6(3), 233–239.
Kalton, G. (1983). Introduction to survey sampling (Quantitative Applications in the Social
Sciences, 07–035). Beverly Hills, CA: Sage.
Kalton, G., & Kasprszky, D. (1986). The treatment of missing survey data. Survey Metho-
dology, 12(1), 1–16.
Kendall, P. A., & Lazarsfeld, P. F. (1950). Problems of survey analysis. In R. K. Merton &
P. F. Lazarsfeld (Eds.), Continuities in social research: Studies in the scope and method
of ‘‘The American soldier.’’ New York: Free Press.
85

Kiecolt, K. J., & Nathan, L. E. (1985). Secondary analysis of survey data (Quantitative
Applications in the Social Sciences, 07–053). Beverly Hills, CA: Sage.
Kish, L. (1949). A procedure for objective respondent selection within the household. Journal
of the American Statistical Association, 44, 380–387.
Kish, L. (1965). Survey sampling. New York: John Wiley.
Kish, L., & Frankel, M. R. (1974). Inferences from complex samples. Journal of the Royal
Statistical Society, 36(B), 1–37.
Knoke, D., & Burke, P. J. (1980). Log-linear models (Quantitative Applications in the Social
Sciences, 07–020). Beverly Hills, CA: Sage.
Koch, G. G., Freeman, D. H., & Freeman, J. L. (1975). Strategies in the multivariate analysis
of data from complex surveys. International Statistical Review, 43, 59–78.
Konijn, H. (1962). Regression analysis in sample surveys. Journal of the American Statistical
Association, 57, 590–605.
Korn, E. L., & Graubard, B. I. (1995a). Analysis of large health surveys: Accounting for the
sample design. Journal of the Royal Statistical Society, 158(A), 263–295.
Korn, E. L., & Graubard, B. I. (1995b). Examples of differing weighted and unweighted
estimates from a sample survey. The American Statistician, 49, 291–295.
Korn, E. L., & Graubard, B. I. (1998). Scatterplots with survey data. The American Statistician,
52, 58–69.
Korn, E. L., & Graubard, B. I. (1999). Analysis of health surveys. New York: John Wiley.
Korn, E. L., & Graubard, B. I. (2003). Estimating variance components by using survey data.
Journal of the Royal Statistical Society, B(65, pt. 1), 175–190.
Kott, P. S. (1991). A model-based look at linear regression with survey data. The American
Statistician, 45, 107–112.
Kovar, J. G., Rao, J. N. K., & Wu, C. F. J. (1988). Bootstrap and other methods to measure
errors in survey estimates. Canadian Journal of Statistics, 16(Suppl.), 25–45.
Krewski, D., & Rao, J. N. K. (1981). Inference from stratified samples: Properties of the lineariza-
tion, jackknife and balanced repeated replication methods. Annals of Statistics, 9, 1010–1019.
LaVange, L. M., Lafata, J. E., Koch, G. G., & Shah, B. V. (1996). Innovative strategies using
SUDAAN for analysis of health surveys with complex samples. Statistical Methods in
Medical Research, 5, 311–329.
Lee, E. S., Forthofer, R. N., Holzer, C. E., & Taube, C. A. (1986). Complex survey data analy-
sis: Estimation of standard errors using pseudo-strata. Journal of Economic and Social
Measurement, 14, 135–144.
Lee, E. S., Forthofer, R. N., & Lorimor, R. J. (1986). Analysis of complex sample survey data:
Problems and strategies. Sociological Methods and Research, 15, 69–100.
Lee, K. H. (1972). The use of partially balanced designs for the half-sample replication method
of variance estimation. Journal of the American Statistical Association, 67, 324–334.
Lehtonen, R., & Pahkinen, E. J. (1995). Practical methods for design and analysis of complex
surveys. New York: John Wiley.
Lemeshow, S., & Levy, P. S. (1979). Estimating the variance of ratio estimates in complex
surveys with two primary sampling units per stratum. Journal of Statistical Computing
and Simulation, 8, 191–205.
Levy, P. S., & Lemeshow, S. (1999). Sampling of populations: Methods and applications.
New York: John Wiley.
Levy, P. S., & Stolte, K. (2000). Statistical methods in public health and epidemiology: A look
at the recent past and projections for the future. Statistical Methods in Medical Research,
9, 41–55.
Liao, T. F. (1994). Interpreting probability models: Logit, probit, and other generalized linear
models (Quantitative Applications in the Social Sciences, 07–101). Beverly Hills, CA, Sage.
86

Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York:
John Wiley.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.).
New York: John Wiley.
Lohr, S. L. (1999). Sampling: Design and analysis. New York: Duxbury.
McCarthy, P. J. (1966). Replication: An approach to the analysis of data from complex surveys
(Vital and Health Statistics, Series 2[14]). Washington, DC: National Center for Health Statistics.
Murthy, M. N., & Sethi, V. K. (1965). Self-weighting design at tabulation stage. Sankhya,
27(B), 201–210.
Nathan, G., & Holt, D. (1980). The effects of survey design on regression analysis. Journal of
the Royal Statistical Society, 42(B), 377–386.
National Center for Health Statistics (NCHS). (1994). Plan and operation of the Third National
Health and Nutrition Examination Survey, 1988–94 (Vital and Health Statistics, Series
1[32]). Washington, DC: Government Printing Office.
Nielsen, S. F. (2003). Proper and improper multiple imputation. International Statistical
Review, 71, 593–627.
Nordberg, L. (1989). Generalized linear modeling of sample survey data. Journal of Official
Statistics, 5, 223–239.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. Interna-
tional Statistical Review, 61, 317–337.
Pfeffermann, D. (1996). The use of sampling weights for survey data analysis. Statistical Meth-
ods in Medical Research, 5, 239–261.
Pfeffermann, D., & Homes, D. J. (1985). Robustness considerations in the choice of method of
inference for regression analysis of survey data. Journal of the Royal Statistical Society,
148(A), 268–278.
Pfeffermann, D., & Nathan, G. (1981). Regression analysis of data from a cluster sample.
Journal of the American Statistical Association, 76, 681–689.
Plackett, R. L., & Burman, P. J. (1946). The design of optimum multi-factorial experiments.
Biometrika, 33, 305–325.
Quenouille, M. H. (1949). Approximate tests of correlation in time series. Journal of the Royal
Statistical Society, 11(B), 68–84.
Rao, J. N. K., Kovar, J. G., & Mantel, H. J. (1990). On estimating distribution functions and
quantiles from survey data using auxiliary information. Biometrika, 77, 365–375.
Rao, J. N. K., & Scott, A. J. (1984). On chi-square tests for multiway contingency tables with
cell proportions estimated from survey data. Annals of Statistics, 12, 46–60.
Rao, J. N. K., & Wu, C. F. J. (1988). Resampling inference with complex survey data. Journal
of the American Statistical Association, 83, 231–241.
Rao, J. N. K., Wu, C. F. J., & Yue, K. (1992). Some recent work on resampling methods for
complex surveys. Survey Methodology, 18(3), 209–217.
Roberts, G., Rao, J. N. K., & Kumar, S. (1987). Logistic regression analysis of sample survey
data. Biometrika, 74, 1–12.
Royall, R. M. (1970). On finite population sampling theory under certain linear regression
models. Biometrika, 57, 377–387.
Royall, R. M. (1973). The prediction approach to finite population sampling theory: Applica-
tion to the hospital discharge survey (Vital and Health Statistics, Series 2[55]). Washington,
DC: National Center for Health Statistics.
Rust, K. F., & Rao, J. N. K. (1996). Variance estimation for complex surveys using replication
techniques. Statistical Methods in Medical Research, 5, 283–310.
Sarndal, C. E. (1978). Design-based and model-based inference in survey sampling. Scandina-
vian Journal of Statistics, 5, 25–52.
87

Sarndal, C. E., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling.
New York: Springer-Verlag.
Shah, B. V., Holt, M. H., & Folsom, R. E. (1977). Inference about regression models from
sample survey data. Bulletin of the International Statistical Institute, 47, 43–57.
Sitter, R. R. (1992). Resampling procedure for complex survey data. Journal of the American
Statistical Association, 87, 755–765.
Skinner, C. J., Holt, D., & Smith, T.M.F. (Eds.). (1989). Analysis of complex survey data.
New York: John Wiley.
Smith, S. S. (1996). The third National Health and Nutrition Examination Survey: Measuring
and monitoring the health of the nation. Stats, 16, 9–11.
Smith, T. M. F. (1976). The foundations of survey sampling: A review. Journal of the Royal
Statistical Society, 139(A), 183–204.
Smith, T. M. F. (1983). On the validity of inferences on non-random samples. Journal of the
Royal Statistical Society, 146(A), 394–403.
Sribney, W. M. (1998). Two-way contingency tables for survey or clustered data. Stata Technical
Bulletin, 45, 33–49.
Stanek, E. J., & Lemeshow, S. (1977). The behavior of balanced half-sample variance esti-
mates for linear and combined ratio estimates when strata are paired to form pseudo strata.
American Statistical Association Proceedings: Social Statistics Section, 837–842.
Stephan, F. F. (1948). History of the uses of modem sampling procedures. Journal of the
American Statistical Association, 43, 12–39.
Sudman, S. (1976). Applied sampling. New York: Academic Press.
Sugden, R. A., & Smith, T. M. F. (1984). Ignorable and informative designs in sampling infer-
ence. Biometrika, 71, 495–506.
Sundberg, R. (1994). Precision estimation in sample survey inference: A criterion for choice
between various estimators. Biometrika, 81, 157–172.
Swafford, M. (1980). Three parametric techniques for contingency table analysis: Non-technical
commentary. American Sociological Review, 45, 604–690.
Tepping, B. J. (1968). Variance estimation in complex surveys. American Statistical Association
Proceedings, Social Statistics Section, 11–18.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of Mathematical
Statistics, 29, 614.
Tukey, J. W. (1986). Sunset salvo. The American Statistician, 40, 72–76.
U.S. Bureau of the Census. (1986, April). Estimates of the population of the United States,
by age, sex, and race, 1980 to 1985 (Current Population Reports, Series P-25, No. 985).
Washington, DC: Author.
Wolter, K. M. (1985). Introduction to variance estimation. New York: Springer-Verlag.
Woodruff, R. S. (1971). A simple method for approximating the variance of a complicated
estimate. Journal of the American Statistical Association, 66, 411–414.
Zhang, P. (2003). Multiple imputation: Theory and method. International Statistical Review,
71, 581–592.
88

INDEX

ANOVA, 57 Imputation, 41–42, 53, 81n5


Inference, model–based, 79, 82n7
Balanced repeated replication (BRR), Intraclass correlation coefficient
26–29, 44 (ICC), 9
jackknife repeated replication (JRR)
and, 33–34 Jackknife repeated replication (JRR),
replicate weights and, 47 29–35, 44
Binary logistic regression analysis, replicate weights and, 48
65–69
Bootstrap method, 35–36 Linearization, 36–39
Linear regression analysis, 57–61,
Cluster sampling, 6, 9 66, 78
Cochran–Mantel–Haenszel (CMH) Logistic regression analysis
chi–square, 64–65 binary, 65–69
Computer software, 44–47, 51, 56, nominal, 72–75
61–63, 72–75, 77 ordered, 70–72
Contingency table analysis, 61–65, 77
Maximum likelihood estimations,
Data 66–67
imputation, 41–42, 53, 81n5 Missing data, 41–42
missing, 41–42 Model–based analysis, 2–3, 9–11,
preliminary analysis of, 41–43, 75–78, 81–82n7
50–51 Model–based inference, 79, 82n7
requirements for survey analysis, Multistage sample design, 3–4
39–40
Descriptive analysis, 52–57 National Opinion Research Center
Design–based analysis, 2–3, 75–78, 79, (NORC), 14–16
81–82n7 Nature of survey data, 7–9
Design effects, 18–20, 40, 60–61 Nominal logistic regression analysis,
Design–weighted least squares 72–75
(DWLS), 58
Ordered logistic regression analysis,
Errors, estimation of 70–72
sample design and, 8 Ordinary least squares (OLS), 57–58
Taylor series method for, 38–39
Expansion weights, 11–14 Pearson chi–square statistic, 61, 63
Poststratification, 14–16, 80n1
Follow–up surveys, 17–18 Precision, assessing loss or gain in,
18–20
General Social Survey (GSS), Prediction models, 69
14–16 Preliminary analysis of data, 41–43,
Goodness–of–fit statistic, 67–69 50–51
89

Preparation for survey data analysis expansion weights in, 11–14


data requirements and, 39–40 follow–up, 17–18
importance of preliminary analysis jackknife repeated replication
in, 41–43 (JRR), 29–35, 44, 48
Primary sampling units (PSUs), 7, 9 probability proportional (PPS), 6–7,
balanced repeated replication (BRR) 7, 50–51
and, 26–29 proportional to estimated size
bootstrap method and, 35–36 (PPES), 7
jackknife repeated replication (JRR) repeated systematic, 5
and, 29–35 replicated, 23–26
preliminary analysis of, 42–43 simple random (SRS), 1, 4
Probability proportional sampling stratified random, 5–6, 8–9
(PPS sampling), 6–7, 7, 50–51 systematic, 4–5
Problem formulation, 49 types of, 4–7
Proportional to estimated size (PPES) units, primary (PSUs), 7, 9, 26–29
sampling, 7 weights, 7–8, 11–14, 20–22, 79
Simple random sampling (SRS), 1, 4
Ratio estimation, 80n1 with replacement (SRSWR), 2
Regression without replacement (SRSWOR), 4
analysis, logistic, 65–75 Software programs, 44–47, 51, 56,
contingency table analysis compared 61–63, 72–75, 77
to, 77 SPSS 13.0, 46–47
imputation, 53 Stata software package, 45, 51, 56,
linear, 57–61, 66, 78 61–63, 72
use of models in, 76 Stratified random sampling, 5–6,
Relative weights, 11–14 8–9, 40
Repeated systematic sampling, 5 Subgroup analysis, 42–43, 55–57
Replication SUDAAN software package, 45, 46,
balanced repeated (BRR), 26–29, 51, 61, 72
33–34, 44 Survey data analysis, 1–3
jackknife repeated (JRR), 29–35, 44 complexity of, 11–22
sampling, 23–26 computer software for, 44–47, 51,
weights, 47–49 56, 61–63, 72–75, 77
contingency table analysis and,
Sample design 61–65, 77
design effect and, 18–20 data requirements for, 39–40
multistage, 3–4 descriptive analysis and, 52–57
standard error estimation and, 8 design–based, 2–3, 75–78, 81–82n7
types of sampling in, 4–7 importance of preliminary analysis
variance estimation and, 22–39, in, 41–43
43–44 linear regression analysis and,
Sampling 57–61, 66, 78
balanced repeated replication logistic regression analysis and,
(BRR), 26–29, 44, 47 65–69
cluster, 6, 9 model–based, 9–11, 75–78, 81–82n7
90

nature of, 7–9 choosing method for, 43–44


preliminary, 50–51 jackknife repeated replication
preparing for, 39–49 (JRR), 29–35, 44, 48
problem formulation before, 49 replicated sampling, 23–26
sample design and, 3–11 Taylor series method, 36–39, 44, 45
searching for appropriate models
for, 49 Wald statistic, 61, 63, 81n6
subgroup analysis in, 42–43, 55–57 Weights
use of sample weights for, 20–22 adjusting in follow–up surveys,
variance estimation in, 22–39 17–18
Systematic sampling, 4–5 creating replicate, 47–49
design effect and, 18–20
Taylor series method, 36–39, 44, 45 development by poststratification,
14–16
Unweighted versus weighted analysis, expansion, 11–14
58–60 sample, 7–8, 79
for survey data analysis, 20–22,
Variance estimation 39–40
balanced repeated replication and unweighted versus weighted
(BRR), 26–29, 33–34, 47 analysis, 58–60
bootstrap method, 35–36 WesVar software package, 46
91

ABOUT THE AUTHORS

Eun Sul Lee is Professor of Biostatistics at the School of Public Health,


the University of Texas Health Science Center at Houston, where he
teaches sampling techniques for health surveys and intermediate biostatis-
tics methods. He received his undergraduate education at Seoul National
University in Korea. His PhD, from North Carolina State University, is in
experimental statistics and sociology. His current research interests involve
sample survey design, analysis of health-related survey data, and the appli-
cation of life table and survival analysis techniques in demography and
public health. He is junior author of Introduction to Biostatistics: A Guide
to Design, Analysis and Discovery (with Ronald Forthofer, 1995).

Ronald N. Forthofer is retired as a professor of biostatistics at the School


of Public Health, the University of Texas Health Science Center at Houston.
He now lives in Boulder County, Colorado. His PhD, from the University
of North Carolina at Chapel Hill, is in biostatistics. His past research has
involved the application of linear models and categorical data analysis tech-
niques in health research. He is senior author of Public Program Analysis:
A New Categorical Data Approach (with Robert Lehnen, 1981) and also
senior author of Introduction to Biostatistics: A Guide to Design, Analysis
and Discovery (with Eun Sul Lee, 1995).

Você também pode gostar