Theory of Sample Surveys

MONOGRAPHS ON
STATISTICS AND APPLIED PROBABILITY
General Editors
D.R. Cox, V. Isham, N. Keiding,

N. Reid, and H. Tong
Stochastic Population Models in Ecology and Epidemiology

MS. Bartlett (1960)
2 Queues D.R. Cox and W.L. Smith (1961)
3 Monte Carlo Methods J.M Hammersley and D.C. Handscomb (1964)
4 The Statistical Analysis of Series of Events D.R. Cox and
P.A. W. Lewis (1966)
5 Population Genetics w.J. Ewens (1969)
6 Probability, Statistics and Time MS. Bartlett (1975)
7 Statistical Inference S.D. Silvey (1975)
8 The Analysis of Contingency Tables B.S. Everitt (1977)
9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977)
10 Stochastic Abundance Models S. Engen (1978)
11 Some Basic Theory for Statistical Inference E.J.G. Pitman (1979)
12 Point Processes D.R. Cox and V. Isham (1980)
13 Identification of Outliers D.M Hawkins (1980)
14 Optimal Design S.D. Silvey (1980)
15 Finite Mixture Distributions B.S. Everitt and D.J. Hand (1981)
16 Classification A.D. Gordon (1981)
17 Distribution-free Statistical Methods, 2nd edition J.S. Maritz (1995)
18 Residuals and Influence in Regression R.D. Cook
and S. Weisberg (1982)
19 Applications of Queueing Theory, 2nd edition G.F. Newell (1982)
20 Risk Theory, 3rd edition R.E. Beard, T. Pentikainen and
E. Pesonen (1984)
21 Analysis of Survival Data D.R. Cox and D. Oakes (1984)
22 An Introduction to Latent Variable Models B.S. Everitt (1984)
23 Bandit Problems D.A. Berry and B. Fristedt (1985)
24 Stochastic Modelling and Control MH.A. Davis and R. Vinter (1985)
25 The Statistical Analysis of Compositional Data J. Aitchison (1986)
26 Density Estimation for Statistics and Data Analysis
B. W. Silverman (1986)
27 Regression Analysis with Applications G.B. Wetherill (1986)
28 Sequential Methods in Statistics, 3rd edition
G.B. Wetherill and KD. Glazebrook (1986)
29 Tensor Methods in Statistics P. McCullagh (1987)
30 Transfonnation and Weighting in Regression R.J. Carroll and
D. Ruppert (1988)
31 Asymptotic Techniques for Use in Statistics D.E. Bamdoif.f-Nielsen
and D.R. Cox (1989)
32 Analysis of Binary Data, 2nd editionD.R. Cox and E.J. Snell (1989)
33 Analysis of Infectious Disease Data N.G. Becker (1989)
34 Design and Analysis of Cross-Over Trials B. Jones and
M.G. Kenward (1989)
35 Empirical Bayes Methods, 2nd edition 1.S. Maritz and T. Lwin (1989)
36 Symmetric Multivariate and Related Distributions K-T. Fang
S. Kotz and K W Ng (1990)
37 Generalized Linear Models, 2nd edition P. McCullagh and
1.A. Neider (1989)
38 Cyclic and Computer Generated Designs, 2nd edition
J.A. John and E.R. Williams (1995)
39 Analog Estimation Methods in Econometrics CF Manski (1988)
40 Subset Selection in Regression A.J. Miller (1990)
41 Analysis of Repeated MeasuresM.1. Crowder and D.J. Hand (1990)
42 Statistical Reasoning with Imprecise Probabilities P. Walley (1991)
43 Generalized Additive Models T.J. Hastie and R.J. Tibshirani (1990)
44 Inspection Errors for Attributes in Quality Control
N.L. Johnson, S. KotzandX Wu (1991)
45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)
46 The Analysis of Quantal Response Data B.J. T. Morgan (1993)
47 Longitudinal Data with Serial Correlation: A State-space Approach
R.H. Jones (1993)
48 Differential Geometry and Statistics M.K Murray and 1. W Rice (1993)
49 Markov Models and OptimizationM.H.A. Davis (1993)
50 Networks and Chaos - Statistical and Probabilistic Aspects
0.E. Bamdoif.f-Nielsen, J.L. Jensen and WS. Kendall (1993)
51 Number-theoretic Methods in Statistics K-T. Fang and Y. Wang (1994)
52 Inference and Asymptotics o.E. Bamdoif.f-Nielsen and D.R. Cox (1994)
53 Practical Risk Theory for Actuaries CD. Daykin, T. Pentikiiinen
and M. Pesonen (1994)
54 Biplots 1.C Gower and D.J. Hand (1996)
55 Predictive Inference - An Introduction S. Geisser (1993)
56 Model-Free Curve Estimation ME Tarter and M.D. Lock (1993)
57 An Introduction to the Bootstrap B. Efron and R.J. Tibshirani (1993)
58 Nonparametric Regression and Generalized Linear Models
P.J. Green and B. W Silverman (1994)
59 Multidimensional Scaling T.F Cox and MA.A. Cox (1994)
60 Kernel SmoothingMP. Wand and Me. Jones (1995)
61 Statistics for Long Memory Processes 1. Beran (1995)
62 Nonlinear Models for Repeated Measurement Data M. Davidian
and D.M Giltinan (1995)
63 Measurement Error in Nonlinear Models R.J. Carroll. D. Ruppert
and L.A. Stefanski (1995)
64 Analyzing and Modeling Rank DataJ.!. Marden (1995)
65 Time Series Models - In econometrics, fmance and other fields
D.R Cox. D. V Hinkley and o.E. BamdorfJ-Nielsen (1996)
66 Local Polynomial Modeling and its Applications J. Fan and
1. Gifbels (1996)
67 Multivariate Dependencies - Models, analysis and interpretation
D.R. Cox and N. Wermuth (1996)
68 Statistical Inference - Based on the likelihoodA. Azzalini (1996)
69 Bayes and Empirical Bayes Methods for Data Analysis
B.P. Carlin and T.A. Louis (1996)
70 Hidden Markov and Other Models for Discrete-valued Time Series
1.L. Macdonald and W Zucchini (1997)
71 Statistical Evidence - A likelihood paradigm R. Royall (1997)
72 Analysis of Incomplete Multivariate Data 1.L. Schafer (1997)
73 Multivariate Models and Dependence Concepts H.Joe (1997)
74 Theory of Sample Surveys ME. Thompson (1997)
75 Retrial Queues G. Falin and 1.G.e. Templeton (1997)
76 Theory of Dispersion Models B. Jorgensen (1997)
77 Mixed Poisson Processes J. Grandell (1997)
78 Variance Components Estimation - Mixed models, methodologies
and applications P.S.R.S Rao (1997)
79 Bayesian Methods for Finite Population Sampling
G. Meeden and M Ghosh (1997)
(Full details concerning this series are available from the Publishers).
JOIN US ON THE INTERNET VIA WWW, GOPHER, FTP OR EMAIL:
WWW: http://www.thomson.com
GOPHER: gopher.thomson.com A service of I{!)P®
FTP: ftp.thomson.com
EMAIL: findit@kiosk.thomson.com
Theory of Sample Surveys
M.E. Thompson
Department of Statistics and Actuarial Science

University of Waterloo
Waterloo, Canada
Springer-Science+Business Media, B.Y.

First edition 1997
© 1997 M.E. Thompson
Originally published by Chapman & Hall in 1997.
Apart from any fair dealing for the purposes of research or private study, or
criticism or review, as pennitted under the UK Copyright Designs and Patents
Act, 1988, this publication may not be reproduced, stored, or transmitted, in any
fonn or by any means, without the prior pennission in writing of the publishers,
or in the case of reprographic reproduction only in accordance with the tenns of
the licences issued by the Copyright Licensing Agency in the UK, or in
accordance with the tenns of licences issued by the appropriate Reproduction
Rights Organization outside the UK. Enquiries concerning reproduction outside
the tenns stated here should be sent to the publishers at the London address
printed on this page.
The publisher makes no representation, express or implied, with regard to the
accuracy of the information contained in this book and cannot accept any legal
responsibility or liability for any errors or omissions that may be made.
A Catalogue record for this book is available from the British Library
ISBN 978-0-412-31780-4 ISBN 978-1-4899-2885-6 (eBook)

DOI 10.1007/978-1-4899-2885-6
Softcover reprint of the hardcover 1st edition 1997
i§Printed on pennanent acid-free text paper, manufactured in accordance with

ANSIINISO Z39.48 - 1992 and ANSIINISO Z39.48 - 1984 (Pennanence of
Paper).
Contents
PREFACE xiii
1 Introduction 1
1.1 Survey populations and samples 3
1.2 Population quantities 5
1.3 Survey error 6
1.4 Sampling· and non-sampling errors 7
1.5 Bias and variability 7
1.6 Focus on sampling error 8
2 The mathematics of probability sampling designs 9

2.1 Randomized sampling designs 9
2.2 Expectations and variances of sample sums; the HT
estimator 12
2.3 Linear estimators for population totals 19
2.4 Sampling strategies and local relative efficiency 21
2.5 Finite population cumulants 25
2.5.1 Cumulants and K statistics 25
2.5.2 Cumulants of the sample sum in SRS 29
2.5.3 Variance of the sample variance in SRS 31
2.6 Multi-stage sampling 31
2.7 Estimation in multi-stage sampling 33
2.8 Implementation of unequal probability designs 37
Exercises 43
3 Distributions induced by random sampling designs 49

3.1 Distribution of sample numbers and proportions in SRS 50
3.2 Confidence limits for population numbers and propor-
tions in SRS 52
3.3 The characteristic function of the sample sum in SRS 56
3.4 The finite population central limit theorem for SRS 58
viii CONTENTS
3.4.1 Distribution of univariate sample sum 58

3.4.2 Distribution of multivariate sample sum 60
3.5 Asymptotic normality and applications 61
3.5.1 Conditions for asymptotic normality 61
3.5.2 Normal-based confidence intervals for totals 64
3.5.3 Sample size determination 64
3.6 Formal Edgeworth expansions 65
3.7 Edgeworth expansions for the distribution of the sample
sum in SRS 68
3.8 Edgeworth expansions for the distribution of the
studentized mean in SRS 71
3.9 Saddlepoint approximations 73
3.10 Saddlepoint approximations for SRS 75
3.10.1 Approximations to the distribution of the sample
sum 75
3.10.2 Approximations to tail probabilities 78
3.11 Use of saddlepoint in constructing confidence intervals
in SRS 80
3.11.1 Construction of the test array 81
3.11.2 Approximation of the tail probabilities 82
3.12 Monetary unit sampling 83
3.13 Bootstrap resampling methods for confidence interval
construction 85
3.13.1 The idea of the bootstrap 86
3.13.2 Bootstrap t confidence intervals 91
4 Design-based estimation for general finite population

quantities 93
4.1 Quantities defined as roots of simple estimating
functions 94
4.1.1 Design frequency properties of the estimating
functions 96
4.1.2 Particular cases 98
4.1.3 Intervals based on refined approximations 102
4.1.4 Asymptotic properties of the point estimates 104
4.2 Quantities defined as functions of population totals 106
4.2.1 Linearization of the error 107
4.2.2 Linearization confidence intervals 108
4.2.3 Method of random groups 111
4.2.4 Balanced repeated replication 114
CONTENTS ix
4.2.5 Jackknife methods for variance estimation 121

4.2.6 Bootstrap variance estimation 125
4.2.7 Properties of the methods of variance estimation 126
4.2.8 Bias reduction from resampling methods 130
4.2.9 Bootstrap t confidence intervals 132
4.3 Quantities which are functions of population U -statistics 132
4.4 Quantities defined as roots of estimating functions with
nuisance parameters 135
5 Inference for descriptive parameters 143

5.1 Elements of descriptive sampling inference 143
5.1.1 Classical design-based inference 143
5.1.2 Applicability to survey populations 145
5.1.3 Conditioning and post-sampling stratification 146
5.1.4 Labels and likelihood 147
5.1.5 The role of prior knowledge 148
5.2 Superpopulation models 149
5.2.1 Exchangeable superpopulation models 150
5.2.2 Models with auxiliary variates 151
5.2.3 Time series and spatial process models 153
5.3 Prediction and inference 155
5.4 Randomization as support for statements of inference 157
5.4.1 Inferences based on exchangeability 157
5.4.2 Formal justification of conditioning 158
5.4.3 Survey weights 160
5.5 Evaluation of a sampling strategy 163
5.5.1 General considerations 163
5.5.2 Minimum expected variance criterion 165
5.5.3 Estimating function determination 168
5.6 Use of auxiliary information in estimation of means
and totals 171
5.7 Model-assisted estimation through optimal estimating
functions 172
5.7.1 Estimating totals through means 172
5.7.2 Estimating totals through models: ratio and
regression estimators 173
5.8 GREG estimators 176
5.9 Calibration methods 178
5.10 Predictive approach to regression estimators of totals 181
5.11 The uncertainty in ratio and regression estimators 182
x CONTENTS
5.11.1 Approximate variance estimators 182

5.11.2 Variance estimators and survey weights 187
5.12 Conditional sampling properties of the ratio and
regression estimators 187
5.12.1 The ratio estimator 188
5.12.2 Inverse Gaussian-based intervals 191
5.12.3 Simple regression estimator 192
5.12.4 Implications for inference 193
5.13 Robustness and mean function fitting 194
5.l4 Estimating a popUlation distribution function using
covariates 196
6 Analytic uses of survey data 199

6.1 What is analytic survey inference? 199
6.2 Single-stage designs and the use of weights 202
6.2.1 Likelihood estimation with independent obser-
vations 202
6.2.2 Estimating functions with weighted terms 204
6.2.3 Likelihood analysis under response-dependent
Bernoulli sampling 206
6.2.4 Case control studies 208
6.2.5 Pseudo likelihood constructions in life testing 209
6.2.6 Estimation of the average of a mean function 210
6.3 Estimating functions for vector parameters 211
6.4 The generalized linear model for survey populations 214
6.4.1 The simple generalized linear model 215
6.4.2 Confidence regions 217
6.4.3 Nested models 219
6.4.4 An exponential model form 220
6.4.5 Summary of homogeneous population methods
for generalized linear models 222
6.4.6 Incorporating population heterogeneity 223
6.4.7 Generalized linear mixed model 223
6.4.8 Poisson overdispersion models 226
6.4.9 The marginal linear exponential model 228
6.4.10 Estimation when the population is clustered 230
6.4.11 Estimation of random effects and conditional
means 233
6.5 Sampling and the generalized linear model 236
CONTENTS xi
6.5.1 Estimation based on the model at the sample

level 238
6.5.2 Estimation through a pseudolikelihood 242
6.5.3 Design-effect approach 246
7 Sampling strategies in time and space 251

7.1 Models for spatial populations 252
7.1.1 Models for the noise term in (7.1) 254
7.1.2 Models for the underlying process /kt 254
7.1.3 Correlation and semivariogram models 255
7.1.4 Properties of realizations of the z process 257
7.1.5 Discrete population case 258
7.2 Spatial sampling designs 259
7.3 Pointwise prediction and predictive estimation of means
and totals 262
7.3.1 Prediction of 1',0 264
7.3.2 Prediction of /klo 266
7.3.3 Estimation of the deterministic part of the trend 267
7.3.4 Estimating totals and means 267
7.3.5 Determining coefficients for predictors 268
7.3.6 Estimating the uncertainty of prediction 269
7.3.7 Bayesian prediction 271
7.3.8 Prediction of {/kt : t E U} 274
7.4 Global means and totals as integrals 275
7.4.1 One spatial dimension 275
7.4.2 More than one spatial dimension 277
7.5 Choosing purposive samples 279
7.6 Choice of randomized sampling designs 281
References 287
Index 301
PREFACE
This book began about ten years ago as an attempt to fill out the notes
for a graduate course in sampling theory. It has become a monograph,
intended to supplement rather than replace books already available.
For example, there are many good treatments at various levels on the
practical problems of survey design, and on the 'how to' of analysis.
I have dealt with these issues to some extent, but have focused more
on aspects of sampling theory which are not so commonly treated else-
where. Parts of the book can still be used in teaching, supplemented
sufficiently with examples and other material.
The book deals with a subject of great vitality. The theory of survey
methods is developing fast, with more and more connections to other
parts of statistics, as the needs of practitioners change, and as more
uses of survey data become possible. As a consequence of increases
in computing power and capability, data are easier to manipulate, and
computer-intensive methods for analysis can be investigated with reas-
onable assurance that the better ones will soon be practical.
Part of the fascination of the theory of sample surveys has always
lain in its foundational issues. The present book has been written very
much from a foundational perspective. For the most part, one point of
view is taken, but it is far from being the only possible one. It has long
been my belief that as far as the puzzles and paradoxes of inference
are concerned, everyone must come to her or his own account of the
truth.
In arriving at my own account, I have been aided by many others,
particularly by colleagues and students at the University of Waterloo.
By far the greatest debt is to V. P. Godambe, who began looking
critically at the logic of sampling inference in the 1950s, and has had a
profound influence on the subject ever since. It was he who taught me
that the best questions are those which have only partial answers, and
that confusion in the search for clarity is an honourable condition. His
interest in this project, and his great generosity in collaboration over
the years, are much appreciated.
xiv PREFACE
I would like to express thanks to J. N. K. Rao, whose influence

throughout has also been very important, for many helpful suggestions;
to Sir David Cox for thoughtful ideas on the issues and organization,
for detailed comments, and for heartening encouragement at crucial
times; to T. J. DiCiccio for illuminating discussions of approaches to
asymptotic inference.
Students who have provided very welcome assistance with compu-
tation and experimentation are Mary Lynn (French) Benninger, Kar-
Seng Teo, Kathy Bischoping, Michelle Richards, Ronnie Lee, Dianne
Piaskoski, Thierry Duchesne, Julie Horrocks, and Gudrun Wessel (who
also designed the figures).
Special thanks are due also to Ms Lynda Clarke for her expert and
timely typesetting of the book, through the many revisions.
Finally, I would express much gratitude to the editors of Chapman
& Hall, and to supportive friends and tactful family - Carl, Simon,
Andrew and Alan - who long ago stopped asking when the book would
be finished!
Support from the Natural Science and Engineering Research Council
of Canada is gratefully acknowledged.
M. E. Thompson
Waterloo, Ontario, 1996
CHAPTER I
Introduction
The idea of making inferences from samples drawn from populations is

fundamental to statistics. Historically, this idea has arisen in a number
of different contexts, but most concretely in connection with enumera-
tive or descriptive surveys. In fact, one of the earliest apparent refer-
ences to inference from a random sample appears in the Indian epic,
the Mahabharata, as described by Hacking (1975, p. 7):
[A king, Rtuparna] flaunts his mathematical skill by estimating the number
of leaves and of fruit on two great branches of a spreading tree. Apparently
he does this on the basis of a single twig that he examines. There are, he
avers, 2095 fruit. Nala counts all night and is duly amazed by the accuracy of
this guess. Rtuparna, so often right in matters like this, accepts his due ... :
I of dice possess the science

and in number thus am skilled.
Following much development in modem times (Bellhouse, 1988a),
the same method and its successors appear nowadays in a remarkable
variety of contexts, including the compiling of most economic and so-
cial statistical summaries, nationally and internationally. For descriptive
purposes it is desired to estimate some attribute for a population - the
number of current smokers, the total payroll for an industry for the
month, the proportion of adults over age 65 who are literate - and this
is done on the basis of careful measurement of some variate for a small,
objectively chosen subset of population members.
The theory of sample surveys for descriptive aims is the subject of
most of the book. As is traditional, after narrowing the focus to the
study of sampling error, we will concentrate in Chapters 2, 3 and 4 on
the mechanics of design-based estimation, particularly the distributional
properties of estimators under randomized sampling schemes. Then, in
a somewhat non-traditional manner, we will consider inference itself
in Chapter 5, together with the role of models in descriptive inference.
The models just referred to are frequently called superpopu/ation
models. They are essentially stochastic models for the response variate,
but often the best way of thinking of them is to imagine the population
2 INTRODUCTION
at hand being chosen randomly from a hypothetical superpopulation of

populations. There are at least three reasons for trying to integrate this
kind of model into a discussion of sampling inference. First, although
traditional estimation techniques make no explicit use of superpopu-
lation models, assumptions very close to models are often implicit,
particularly in the traditional ways of incorporating auxiliary informa-
tion about the population. Second, if the term inference is meaningful
at all, a superpopulation model we happen to believe in must have
bearing on inference from a sample, however that sample is drawn.
The third reason is pragmatic: since the assessment of non-sampling
errors can be handled only through models, models can provide a con-
venient framework for considering both sampling and non-sampling
errors together.
Not all surveys are carried out purely for descriptive purposes. Some-
times it is desired to draw conclusions about what would typically be
seen in populations resembling the one at hand. For example, we may
have a study population of schoolchildren, and may wish to use the
results of a survey on this population to relate school performance with
television viewing habits. We are probably interested in applying the
relationship to children generally, children like the ones in the study
population. This kind of aim is often called analytic. In a survey for
analytic purposes, a superpopulation is not only assumed but is actually
the object of interest.
Techniques for analytic inference were first developed in non-survey
contexts, assuming a sampling scheme which was effectively simple
random sampling. In practice, however, the data being analysed may
come from more complex sampling schemes. In recent years it has
become clearer, as long suspected, that approaching analytic inference
from a suitably adapted survey sampling perspective can be illuminat-
ing. Thus Chapter 6 will be devoted to an examination of analytical
purposes in surveys, and in particular approaches to the generalized
linear model for survey data.
The modelling in Chapter 6 will have bearing also on special de-
scriptive aims such as the estimation of local or small-area attributes,
and accounting at a global level for non-sampling errors. In a similar
spirit, the concluding chapter, Chapter 7, will return primarily to de-
scriptive surveys, particularly surveys for which the temporal or spatial
structure of the population is a significant element in the modelling of
the response variates.
The scope of sampling theory can thus be seen to be very broad.
Still, as is often pointed out, sampling theory addresses only the easiest
SURVEY POPULATIONS AND SAMPLES 3
aspects of survey design and analysis, those which can readily be for-
mulated in mathematical terms. The difficult aspects are the scientific
questions (such as whether or not a survey can be designed which will
actually provide the answers we seek), the implementation questions
(such as whether we can achieve the response rates which will make for
results we can trust), and the measurement questions (such as how to
design a questionnaire or interview format for accurate measurement of
response variates). The next few sections will describe more explicitly
the total context in which the theory of sampling is applied.
1.1 Survey populations and samples

By a survey, then, is meant the process of measuring characteristics of
some of the members of an actual population, with the purpose of mak-
ing quantitative generalizations about the population as a whole or its
subpopulations (or, sometimes, its superpopulations). This definition is
broad enough to include not only opinion polls, labour force surveys,
market research and social surveys, but also surveys of wildlife, ex-
ploratory drilling for oil or minerals, and quality control sampling of
manufactured items or records undergoing audit.
In a survey, the members of the population whose characteristics are
being measured constitute the sample. Thus the sample is in general a
subset of the population.
In this introductory chapter, the word sample is used in two slightly
different ways, and we will try to make the distinction when it is im-
portant. When the context is the planning of a survey, 'sample' will
mean the intended sample, or the subset of population members whose
characteristics the surveyors intend to measure. When we are talking
about the results of a survey, and estimates from a survey, 'sample'
will mean the achieved sample, or the subset of population members
whose characteristics have actually been measured.
We illustrate the difference by noting that a census, technically,
means a survey for which the intended sample is the entire popula-
tion; the achieved sample consists of all population members the census
takers have been able to find.
For descriptive surveys, particularly in the discussion of survey error,
it is useful also to distinguish several related concepts of population.
We begin with two: the target population is the population to which
the surveyor would like to be able to generalize from the sample; the
represented population is the population to which the surveyor can
legitimately generalize from the sample. In simple cases, such as sam-
4 INTRODUCTION
pling from a population of records for audit, the two populations may
coincide. In surveys of human populations they generally do not: the
population from which we are actually able to sample (the represented
population) is usually only an approximation to the population about
which information is desired (the target population).
There are essentially two main sources of discrepancy between the
target and represented populations. The first is the inadequacy of the
sampling frame, the list or map from which the units to be sampled
are identified. The second is the possibility of non-response, or more
generally the possibility of inaccessibility of the units to be sampled.
For example, suppose that for a household expenditure survey, the
target population consists of all dwelling units in a city, and the sam-
pling frame is a list of all the dwelling units, compiled three years ago.
Then the represented population does not include newly constructed
dwelling units. Suppose further that the survey requires that for a sam-
pled dwelling unit to respond, some occupant must be at home on a
specified day. Thus membership in the achieved sample, as a subset
of the intended sample, is related to availability of occupants during
the day, which may be related to expenditure. In such a case we might
wish to specify the represented population as consisting of 'all dwelling
units on the three-year-old list at which someone is home (and willing
and able to respond if asked) on the survey day'.
There is clearly some subjectivity in the determination of the repres-
ented population as we have defined it. In the example just discussed,
if it is believed that the potentially responding dwelling units on the list
are representative of the whole list for the purposes of the survey, the
represented population could simply consist of the frame population, or
all dwelling units on the list which are in existence on the survey date.
However, in many situations this kind of assumption is inappropriate,
and it is better to think of the represented population as the respondent
part of the frame population, namely the subpopulation of accessible
members who would have responded. Accordingly, we will identify the
represented population with the respondent part of the frame population,
particularly in the discussion of error components in Section 1.3. We
will think of the intended sample as drawn from the frame population,
but the achieved sample as drawn from the respondent part.
For theoretical discussions it is convenient to think of the frame pop-
ulation as a subset of the target population, consisting of those target
population units incorporated in the frame. However, in practice there
are many possible relationships of the sampling frame and the target
POPULATION QUANTITIES 5
population; Lessler and Kalsbeek (1992) have provided a comprehen-

sive discussion.
1.2 Population quantities

When the descriptive objectives of a survey are made precise, they usu-
ally come down to an attempt to estimate certain population quantities,
such as population proportions or rates, population totals, population
averages or means. (Sometimes more complex quantities such as cor-
relation or regression coefficients may be of interest.) We will now
introduce some notation for the basic quantities.
A survey population (target or other) is usually composed of a finite
number of members, called elementary units. These may be businesses,
dwelling units, individual people, accounts, etc. The population size is
the number of elementary units in the population, and is usually denoted
byN.
The population proportion of members having a certain character-
istic C is
M
P=-,
N
where N is the population size and M is the number of elementary
units which have the characteristic C. The quantity M is called the
population number having the characteristic C.
Often the quantity of interest is the total or average of some real
variate y. For example, consider a population of small businesses. The
total number of people employed by all businesses in the population
can be written as
N
I>j
j=l
where Yj is the number employed by business j. A population total is
defined as a quantity which can be expressed this way. The population
average or population mean of the variate Y is
N
<'LYj)/N.
j=l
More complex population quantities can similarly be defined in terms

of the values of variates x, Y, z, . . . for unit j, j = I, ... , N.
It is worth noting that the proportion P can be thought of as a
special case of the population mean. If we let Yj = I for a population
member j with characteristic C, and Yj = 0 for a member not having
6 INTRODUCTION
characteristic C, then we can write
and
N
P = CEJYj)/N.
j=l
1.3 Survey error

A typical survey objective is to estimate a descriptive population quant-
ity using the values of the appropriate variates for units in the sample.
The total survey error in the estimate is the amount by which the
estimate differs from the true value of the quantity for the target popu-
lation.
The total survey error can be written as a sum of component errors:
total error = response error + sampling error + non-response error
+ coverage error.
For example, suppose a population proportion P were being estimated.
If we denote the estimate by P, then the decomposition of survey error
could be written as follows:
P- Ptarget = P- Ptrue + Ptrue - P resp + P resp - P frame + P frame - Ptarget.
The response error or measurement error P- Ptrue is the amount by

which the estimate P differs from the value Ptrue it would have had if
the relevant variate values ('yes' or 'no'; 1 or 0) in the achieved sample
had been determined correctly; i.e. it is that part of the total error which
results from errors in the responses of individual elementary units in
the achieved sample.
The sampling error Ptrue - P resp is the amount by which the estim-
ate from the achieved sample would differ from the true value of the
quantity for the represented population, assuming correct determination
of the relevant variate values; i.e. it is that part of the total error which
results from the fact that it has been possible to observe only a subset
of the represented population, here taken to be the respondent part of
the frame population.
The non-response error is P resp - Pframe, or the amount by which
P resp differs from P frame, the value of the quantity for the frame pop-
ulation. If the respondent or accessible part of the frame popUlation is
SAMPLING AND NON-SAMPLING ERRORS 7
not representative of the whole, this error component is the component

due to the 'atypical' nature of the respondent part.
Finally, the coverage error Pframe - Ptarget is the amount by which
the quantity for the frame population differs from the quantity for the
target population.
1.4 Sampling and non-sampling errors

Coverage error, non-response error and response or measurement error
are sometimes called non-sampling errors. There are important dif-
ferences between sampling and non-sampling errors. For one thing,
particularly if randomization is used in selecting the sample, the ex-
tent of sampling errors is much easier to estimate than the extent of
non-sampling errors; in contrast, non-sampling errors are often left un-
estimated or unacknowledged in reports of surveys. Also, in principle,
sampling errors can be made small by the choice of a sufficiently large
and well-deployed sample. On the other hand, the only way to con-
trol non-sampling errors is to exercise great care in the planning and
execution of the survey. Research continues on the best ways to go
about this in a changing technological environment. Some useful re-
cent books are those of Groves et al. (1988), Lessler and Kalsbeek
(1992) and Oppenheim (1992).
1.5 Bias and variability

Suppose we define bias in an error component (loosely) as the aver-
age of that error component if the survey were repeated many times
independently under the same conditions. Any of the components of
error above can give rise to a positive or negative bias. Thus we have
response or measurement bias, sampling bias, non-response bias and
coverage bias as possibilities. For example, suppose that among sub-
scribers to the Canadian Journal of Statistics, the willingness to re-
spond to a yearly salary survey tends to increase with age. Then the
median salary of those subscribers who would respond might system-
atically tend to overestimate the median salary of the frame population
(all subscribers on the mailing list). This would lead to a positive non-
response bias, contributing to the overall error in an estimate of median
salary from the survey.
The variability in an error component is the extent to which that
component would vary about its average value if the survey were re-
peated many times independently under the same conditions. Any of
8 INTRODUCTION
the components of error above can be thought of as having variability.

However, the types most commonly met with are measurement error
variability (because of variability in the measured values of responses
on the same subjects) and sampling variability (because of the fact
that a different sample would be obtained each time the survey was
repeated).
1.6 Focus on sampling error

In Chapters 2-5 of this book, we will be dealing with ways to control
and to assess sampling error for descriptive quantities in the context of
probability sampling schemes. When we do this, we will usually assume
for simplicity that there are no non-sampling errors of consequence:
that we are estimating quantities for a frame population which coincides
with the target population, and that a response is correctly determinable
for every unit which is selected by the surveyor for inclusion in the
sample. Thus the target, frame and represented populations will be
taken to coincide, and the achieved sample will be taken to be the
intended sample. The techniques and formulae we then develop will
apply in real life situations also, but in general only to the sampling
error component of the total error. Sampling bias will be measured by
expected value of sampling error, with respect to a sampling design
or a model; sampling variability will be expressed by a distribution of
sampling error, usually through a mean squared error.
CHAPTER 2
The mathematics of probability

sampling designs
In this chapter, we discuss the mathematics traditionally at the heart of

the theory of survey sampling. This is concerned with the estimation
of finite population quantities, the objects of descriptive inference. For
the sake of simplicity, it is assumed that there is no non-sampling error.
The only source of randomness, yielding probabilities and associated
expectations, is a probability sampling design. Estimators of population
quantities for finite populations are functions of the observations in the
sample drawn. Unbiasedness of an estimator in this chapter means
unbiasedness with respect to the design, and it means in effect that the
sampling error has mean zero.
The simplest and in a sense the most fundamental probability sam-
pling design is simple random sampling. However, there are many other
designs in common use, and the aim here is to provide a unified account
of some of their important properties in general terms. Detailed treat-
ments of specific designs and techniques are available in the books of
Cochran (1977), Sukhatme et al. (1984), Thompson (1992) and others.
2.1 Randomized sampling designs
Let us first introduce some terminology and notation. Consider a survey

population U whose units are labelled 1, ... , N. Thus in the notation
of sets
U = {1, ... , N}. (2.1)
Let y be a real- or vector-valued variate, and let Yj be the value of
Y associated with unit j. For instance, in a population of business
establishments, Yj might be the number of employees of firm j. The
population Y values may be thought of as being arranged in a population
array
y=(Y\,···,YN). (2.2)
10 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
The array y = (y\, ... , YN) is sometimes called the population vector
if Y is one-dimensional.
Denote by f.Ly the population mean and Ty the population total:
N
f.Ly = (LYj)/N, (2.3)
j=1
N
Ty = LYj. (2.4)
j=1
(Each of these has the same dimension as y.) These population quanti-
ties are fundamental; most quantities estimated in sample surveys can
be expressed as functions of population means or of population totals.
In particular, as we have seen, proportions can be viewed as population
means, and numbers as population totals.
Functions of y are generally to be estimated from observation of a
sample of the 'coordinates' Yj. Often, samples are obtained by drawing
units successively from the population according to some randomized
scheme. We might denote the sample sequence of unit labels from U
by
s* = U\,h, ... ,jn);
then if the labels of units are identifiable, the full data from the sampling
experiment would consist of the pair sequence
Xs· = (U\. YiI),"" Un, Yjn»'

each sampled Y value being paired with the label of its unit.
Since there is no response error present, if any label is repeated
in the sequence, so is the corresponding Y value. Suppose that we
regard the randomized scheme as providing a family of distributions
for Xs. indexed by the parameter y. If the randomized scheme is non-
informative, in the sense that its draw probabilities do not depend on
y, then it can be shown (Basu, 1958) that for any inference abouty the
reduced data
(2.5)
are sufficient in the statistical sense, where s is the set of distinct mem-
bers of the sequence s*; this is consistent with the intuitive notion that
repetitions of pairs in Xs. should provide no new information. From
here on, in most of the discussion, we will regard Xs of (2.5) as em-
bodying the sample data.
In keeping with this, we consider samples to be subsets s of U =
{1, ... , N}, and denote by S the collection of all subsets s of U.
RANDOMIZED SAMPLING DESIGNS 11
A sampling design, sometimes referred to as a 'probability sampling

design' or a 'randomized sampling design', is then formally a probabil-
ity function on S. That is, with each sample s is associated a probability
p(s) (2.6)
which is interpreted as the probability that s is the sample drawn. Each
p(s) is a number in [0, 1], and
LP(s) = 1.
seS
EXAMPLE 2.1: If N = 3, then S = {0, {I}, {2}, {3}, {I, 2}, {I, 3},
{2, 3}, {l, 2, 3}}. The sampling design which corresponds to the scheme
which selects two units by simple random sampling (SRS) without re-
placement has
p(0) = p({l}) = p({2}) = p({3}) = p({l, 2, 3}) = 0;
p({l, 2}) = p({I, 3}) = p({2, 3}) = 1/3.
EXAMPLE 2.2: Suppose that N = 3 and the scheme selects two units
by SRS with replacement. Then the sampling design is given by
p(0) = p({I, 2, 3}) = 0; p({I}) = p({2}) = p({3}) = 1/9;
p({l,2}) = p({I, 3D = p({2, 3D = 2/9.
Let the sample size

n(s) . (2.7)
be defined to be the total number of units in the set s. Afixed size (n)
sampling design is one for which p(s) = 0 whenever n(s) is not equal
to n. For example, SRS without replacement and n draws gives a fixed
size (n) design:
p(s) = 1/ (:) if n(s) =n (2.8)

= 0 if n(s) '" n.
On the other hand, the design corresponding to SRS with replacement,
n draws, is not a fixed size design for n ~ 2.
Another important set of probabilities associated with a sampling
design is the set of inclusion probabilities. The inclusion probability
of unit j is defined to be the probability that j appears in the sample
drawn. It is denoted by 7rjo and in terms of the p(s) probabilities it is
given by
1Tj =L p(s). (2.9)

s:jes
Since the 1Tj are associated with events that are not generally mutually
exclusive, their sum over all units is usually greater than one. In fact,
their sum is the expected sample size, as will be seen later on.
It is also useful to define joint inclusion probabilities for pairs of
units in the population. For distinct units j and k, let 1Tjk denote the
probability that both j and k appear in the sample.
Then, in terms of the p(s) probabilities,
1Tjk = L p(s). (2.10)

s:j,kes
EXAMPLE 2.3: The inclusion probabilities for SRS without replace-

ment follow easily from the definition of the design. Since there are
(~~ll) samples of size n which contain j and each has probability Ij(~),
then 1Tj = (~~D/e) = niN. Similarly, 1Tjk = n(n - l)jN(N - 1).
More generally, we could define for any set (){ of population units a set
inclusion probability 1Ta , to be the probability that ()( is contained in the
sample drawn.
2.2 Expectations and variances of sample sums; the HT

estimator
Sampling expectations and variances are used in assessing sampling

bias and uncertainty in estimators of population quantities. In this sec-
tion we focus on expressing the moments of sample sums in terms of
the inclusion probabilities defined in Section 2.1.
Let ZJ, ••• ,ZN be the values for the units in U of some real- or
vector- valued variate Z. Let E denote expectation with respect to the
sampling design. The sample sum of Z can be written Ljes Zj' and its
design expectation, by definition, is given by
E(LZj) = LP(s)(LZj). (2.11)

jes seS jes
Interchanging the order of summation on the right gives L~=l Zj

x Ls:jes p(s), and the definition of inclusion probability 1Tj yields
EXPECTATIONS AND VARIANCES OF SAMPLE SUMS 13
the identity
N
E(Lz) = LZjJrj. (2.12)
jes j=l
An alternative method of showing (2.12) is to use the fact that
N
LZj = LzJjs, (2.13)
jes j=l
where the sample indicator random variable Ijs is given by
Ijs = 1 if j E S
= 0 if j ¢ s. (2.14)
Since ZI, ... ,ZN are non-random, (2.13) implies E(LjesZj)
= Lf=l zjE(Ijs), and since E(Ijs) = Jrj, the identity (2.12) follows
at once.
From (2.12) with Zj == 1 it follows that
N
E{n(s)} = LJrj; (2.15)
j=l
the expectation of sample size for a given design is the sum of its
inclusion probabilities. In particular, for a fixed size (n) design, the
inclusion probabilities will sum to n.
A sampling design is called self-weighting if all its inclusion proba-
bilities are equal. Designs which are both self-weighting and of fixed
size are of particular importance (Kish, 1965), because for these de-
signs, a sample mean is unbiased as an estimator of the corresponding
population mean. This can be seen as follows. If
Ys = (LYj)/n (2.16)
jes
is the sample mean, its expectation can be obtained from (2.12) with
Zj = Yj/n. If the design p is self-weighting and of fixed size (n), then
(2.15) and (2.12) imply that
(i) Jrj = n/ N for all j;
(ii) E <Ys) = fJ-y for any population array y.
The consequence (ii) is a mathematical statement of the sampling un-
biasedness of Ys as an estimator of the population mean fJ-y.
SRS without replacement is clearly self-weighting and of fixed size.
Many other designs used in practice also have these properties. One
example is systematic sampling with sampling interval K, if N = Kn,

where n is an integer:
select the first unit jl at random from I, ... , K, and let the sample be
UI, jl + K, jl + 2K, .. .J, i.e. take every Kth unit from a random start.
Another example is stratified random sampling with proportional allo-
cation (see below).
For a self-weighting design of fixed size, it is also true that
E(Nys) = 1', for any population array y.
That is, the expansion estimator N Ys is an unbiased estimator of the
population total 1',. This is the reason for the term self-weighting: in the
expansion estimator, each Yj is multiplied by the same raising Jactor
N / n, which may be thought of as the number of population units that
the sampled unit j represents.
The identity (2.12) also yields a simple unbiased estimator for 1',
for an arbitrary sampling design, provided it has a positive inclusion
probability for each j. It is easy to see by taking Zj = yj/'1rj that if all
Jrj > 0, the Horvitz-Thompson estimator (HT estimator)
iy = LYj/Jrj (2.17)
jES
is unbiased for 1', (Horvitz and Thompson, 1952). (The estimator is

taken to have value 0 if s is empty.) When the design is self-weighting
and of fixed size, the HT estimator reduces to the expansion estimator
Nys'
A simple example of a design which need not be self-weighting, and
for which the HT estimator need not be the expansion estimator, is
provided by stratified random sampling. Here, U is the disjoint union
of strata SJ, ... , SH. For example, for a human population the strata
could be age-sex groupings. The sizes N1, ... , N H of the strata are
known, and
H
LNh=N.
h=1
For each h independently, the design prescribes an SRS of nh draws
without replacement from Sh. Thus
Jrj = nh/Nh for j E Sh .
The HT estimator for 1', is the sum
LNhYh,
h
where Yh is the mean of y in the part of the sample coming from Sh; the
hth term in the sum is the expansion estimator for the total of y over
Sh. The corresponding estimator of JLy is the stratified sample mean
H
Yst = L WhYh,
h=1
(2.18)
where Wh = Nh/N. Only if the allocation is proportional, that is only

if nh/Nh = n/N for all h, is the design self-weighting, and Yst equal
to the ordinary sample mean.
To derive variance formulae for sample sums under general sampling
designs, it is most convenient to return to the representation
N
LZj = LZjljs
jes j=1
of (2.13). Then for one-dimensional Z
N N N
Var(Lzj} = Lz~Var(ljs} + L LZjZkCOV(ljS, Iks}· (2.19)
jes j=1 j '# k
Since Var(ljs} = 1rj{l - 1rj} and Cov(ljs, Its) = 1rjk - 1rj1rk this
relation becomes
N N N
Var(Lzj} = LZ~1r/l -1rj} + L L Zj Zk(1rjk -1rj1rk). (2.20)
jes j=1 j '# k
A more compact formula is obtainable when the design is of fixed
size (n), for then L:=I Iks = n with probability 1, and
N
- L Cov(ljS, Its} = -Cov(ljs, n - Ijs ) = Var(ljs). (2.21)
k,#j
The first term of (2.19) can be written
N N 1 N N
- LZ~ L Cov(ljso Iks) = -2 L L(z~ + z~)Cov(ljS' Its},
j=1 k,#j j '# k
and hence
1 N N
Var(Lzj) = -2 L L(Zj - Zk)2(1rj 1rk -1rjk). (2.22)
jes j '# k
Now the HT estimator of 1), is Ljes Zj, where Zj = Yj/1rj. Assume

all 7rj > O. For a fixed size (n) design, (2.22) implies for Y real that
Var(l;,)=Var(I>j!7rj)=~ t tnjk(Yj _Yk)2, (2.23)

jes 2 j =F k 7rj 7rk
where njk = 7rj7rk -7rjk. Similar computations show that, for x and Y
real,
Cov(l;,) = ~ t t n jk (Yj _ Yk) (Yj _ Yk)T (2.25)

2 j =F k 7rj 7rk 7rj 7rk
where t' denotes transpose.

Now in analogy with (2.12), for a variable Z jk defined for pairs of
population units,
N N
E(L LZjk) = L L Zjk7rjk. (2.26)
jes kes j =F k
j#
From this it can be shown that ifall7rjk > 0 (which implies all7rj > 0)
and the design has fixed size, an unbiased estimator for Var(l;,) when
Y is real is
2
1 ( Yj Yk
= -LL W'k - - -) (2.27)
A
v(T.)
y 2 s J 7rj 7rk
where Wjk = (7rj7rk - 7rjk)!7rjk and L Ls denotes summation over
j, k E s with j =/: k. A design for which 7rjk > 0 for all j, k is
sometimes called measurable (Sarndal et al., 1992).
It is possible to show that generally no unbiased estimator ofVar(l;,)
exists if the design is non-measurable, that is if some 7r jk = O. (See, for
example, Liu and Thompson, 1983.) An example of a non-measurable
design is systematic sampling, for which 7rjk = 0 if j, k are not sep-
arated by a multiple of the sampling interval K. Thus for systematic
sampling no unbiased estimator of Var(N}is) exists.
The estimator (2.27) is called the Yates-Grundy--Sen variance estima-
tor (Yates and Grundy, 1953; Sen, 1953). The corresponding estimator
of Cov(iy) in the vector case is
~LLsWjk (Yj _ Yk) (Yj _ Yk)f (2.28)

2 7T:j 7T:k 7T:j 7T:k
The Yates--Grundy-Sen variance estimator has the advantage of always
being non-negative ifall Wjk > O. This is not true of the alternative vari-
ance estimator derived directly from the variance form (2.20), namely
L (1 - 7T:j )(yj /7T:j)2 - L LsWjk(Yj /7T:j )(Yk/7T:k). (2.29)

jES
However, the latter is unbiased more generally, even when the design
is not of fixed size.
The general variance forms of this section do not lend themselves
easily to computation, and it is advisable to reduce them to standard
forms in specific cases. For example, in SRS without replacement,
(2.25) and (2.28) can be used to derive the standard formulae for the
covariance matrix Cov(Nys) and its usual unbiased estimator. For each
j, k, 7T:j = 7T:k = n/ N, 7T:jk = n(n - 1)/ N(N - 1) and the factor
Qjk = n(N - n)/N2(N - 1), so that it follows from (2.25) that
N2 n I
(1 - -)
NN
Cov(Nys) =- L L(Yj - Yk)(Yj - yd f .
n N 2N(N - 1) j oF k
This becomes the standard formula
Cov(NYs) = :2 (1 - ;) s; (2.30)
when it is noted that the population covariance matrix

N
S; = L(Yj - J-Ly)(yj - J-Ly)f /(N - 1) (2.31)
j=1
has the alternative form

N N
S; = L L(yj - yd(Yj - Yk)f /[2N(N - 1)] (2.32)
j oF k
as an average of the quantities (yj - yd(Yj - YkY /2.

The corresponding variance estimator for real y, derivable from
(2.27), is
N2 ( 1 - N
v(Nys) = --;;- n) s;, (2.33)
where
s; = L(Yj - Ys)2/(n - 1) = LLs(yj - Yk)2/[2n(n -1)] (2.34)
jES
is the sample variance. The form s; = (LjES yJ -

ny;)/(n - 1) is the
one normally used in computation of (2.33). See Section 2.5 for some
further discussion of SRS forms.
A note on the HT estimator

Although the HT estimator is unbiased for 1'y in general, and will form
the basis of construction of many estimators, it is not always itself a
very good estimator. For example, when the sampling design is not of
fixed size, the coefficients fail to compensate for the variability of the
sample size.
EXAMPLE 2.4: In SRS with replacement of three draws from a pop-

ulation of size N = 10, each T{j = 1 - (9/10)3 = 271/1000. If
the units drawn happen to be distinct units 7, 2, 4, then iy will be
12°70:> (Y2 + Y4 + Y7), as compared with the more natural expansion esti-
¥
mator <Y2 + Y4 + .Y7). If the units drawn are 4, 2, 4 then iy will be
iO~10(y2 + Y4), instead of the more natural estimator 5(Y2 + Y4).
Even when the sample size is fixed, the HT estimator may be seri-
ously deficient, as in the following famous example!
EXAMPLE 2.5 (Basu, 1971 with permission):
The circus owner is planning to ship his 50 adult elephants and so he needs a
rough estimate of the total weight of the elephants. As weighing an elephant
is a cumbersome process, the owner wants to estimate the total weight by
weighing just one elephant. Which elephant should he weigh? So the owner
looks back on his records and discovers a list of the elephants' weights taken
3 years ago. He finds that 3 years ago Sambo the middle-sized elephant was
the average (in weight) elephant in his herd. He checks with the elephant
trainer who reassures him (the owner) that Sambo may still be considered
to be the average elephant in the herd. Therefore, the owner plans to weigh
Sambo and take 50y (where y is the present weight ofSambo) as an estimate
of the total weight Y = Y1 + ... + Yso of the 50 elephants. But the circus
statistician is horrified when he learns of the owner's purposive sampling
plan. 'How can you get an unbiased estimate of Y this way?' protests the
statistician. So, together they work out a compromise sampling plan. With
the help of a table of random numbers they devise a plan that allots a
selection probability of 99/100 to Sambo and equal selection probabilities
of 1/4900 to each of the other 49 elephants. Naturally, Sambo is selected
LINEAR ESTIMATORS FOR POPULATION TOTALS 19
and the owner is happy. 'How are you going to estimate Y?', asks the statist-
ician. 'Why? The estimate ought to be 50y of course,' says the owner. 'Oh!
No! That cannot possibly be right,' says the statistician, 'I recently read
an article in the Annals of Mathematical Statistics where it is proved that
the Horvitz-Thompson estimator is the unique hyperadmissible estimator in
the class of all generalized polynomial unbiased estimators.' 'What is the
Horvitz-Thompson estimate in this case?' asks the owner, duly impressed.
'Since the selection probability for Sambo in our plan was 99/100,' says
the statistician, 'the proper estimate of Y is 100y /99 and not 50y.' 'And,
how would you have estimated Y', inquires the incredulous owner, 'if our
sampling plan made us select, say, the big elephant Jumbo?' 'According to
what 1 understand of the Horvitz-Thompson estimation method," says the
unhappy statistician, 'the proper estimate of Y would then have been 4900y,
where y is Jumbo's weight.' That is how the statistician lost his circus job
(and perhaps became a teacher of statistics!).
The reader is invited to try to resolve the statistician's difficulty. One
resolution would make use of some kind of model-assisted estimation,
to be discussed in Chapter 5.
2.3 Linear estimators for population totals

Although the HT estimator is in a sense the simplest unbiased estima-
tor of Ty, other linear unbiased estimators can easily be defined for a
specific sampling design. In general the coefficient of Yj depends on s
as well as j, and the linear estimator takes the form
e = LdjsYj, (2.35)
jES
with
L p(s)djs = 1
S:jES
for each j = I, ... , N (Godambe, 1955). For real Y the variance of e

takes the form
N N N
Var(e) = LajY; + LI>jkYjYk. (2.36)
j;\ Uk
where
and ljs is the indicator defined in (2.14). Unbiased estimators ofVar(e)

can be constructed if all Kjk > 0: one such estimator is
(2.37)
Rao (1979) has shown that if e is of form (2.35), but not necessarily
unbiased, there is a form of MSE(e) analogous to (2.23), provided that
for some array w with no zero elements
L djs W j = Tw for all s with p(s) > o. (2.38)

jES
In the terminology of Section 5.7, Rao's condition means that the estim-
ator is calibrated to be correct for the array w. In that case
MSE(e) = -~ t tajk (Yj _ Yk)2 WjWb (2.39)

2 j #- k Wj Wk
where ajk = E{(djsljs - l)(dks hs - I)}; in case e is unbiased, an

unbiased estimator of Var(e) which generalizes the Yates--Grundy-Sen
estimator is
1
v(e) = --LLs ( djsdks - - 1 )(YJ
- - -Yk)2 WjWk. (2.40)
2 Kjk Wj Wk
This variance estimator has the appealing property of being zero when
y = w, and its non-negativity is relatively easy to establish or refute in
particular cases.
When Y is vector-valued, and the same coefficients djs are used in
the estimation of each component, the covariance matrix Cov(e) may
be estimated by
EXAMPLE 2.6: Returning to the case of scalar y, suppose that the size
of the sample is to be n = 2, and that the sample s = {j, k} is drawn
without replacement, with draw selection probabilities proportional to
probabilities Ph ... , PN, ,,£7=1 Pj = 1. For example, the probability
of obtaining s = {l, 3} would be PIP3/(1 - PI) + p3pJ/(1 - P3). The
estimator of Ty proposed by Murthy (1957) is
e= 1 (Yj
-(1 - Pk) + -(1
Yk - Pj) ) (2.41)
2 - Pj - Pk Pj Pk
SAMPLING STRATEGIES AND LOCAL RELATIVE EFFICIENCY 21
when s = {j, k}. This estimator is not the HT estimator, but is of the
form (2.35), and satisfies (2.38) with W j = Pj. It is unbiased, and the
estimator (2.40) of its variance is
vee) = (1- p·)(l-

] Pk)(l- p. -
] Pk) (y· Yk)2
-...!.... - - (2.42)
(2-Pj-Pk)2 Pj Pk
2.4 Sampling strategies and local relative efficiency

In finite population sampling, both sampling design and estimator may
be under the control of the statistician. Thus an estimator-design pair
(e, p) is often referred to as a strategy, and the strategy is called unbi-
ased for estimating a population function O(y), such as 1'y or fLy, if e
is unbiased for O(y) under p, i.e. if
E(e) = LP(s)e(xs) == O(y) (2.43)
for all possible population arrays y. Here as in (2.5) XS stands for

the sample data. In comparison of strategies mean squared errors are
usually used: thus, (el, PI) may be considered to be more efficient than
(e2, P2) at Y (for O(y) real) if
Ep, (el - o(y»2 < E p2 (e2 - O(y»2. (2.44)
If (el, PI) and (e2, P2) are unbiased, this condition becomes
(2.45)
This concept of local (at y) relative efficiency in terms of design
mean squared error is usually not sufficient by itself for deciding among
strategies. Almost any fixed size design strategy will perform well for
some array y. The following example shows that even under SRS with-
out replacement the 'natural' estimator )is of fLy can be less efficient
for some arrays y than one which is much less appealing.
EXAMPLE 2.7: Let N = 3, and consider the estimator given by )is -

YI/3 if s = {I, 2},)is + yt!3 if s = {I, 3}, and)is if s = {2, 3}. Under
SRS without replacement, n = 2 draws, this estimator is easily seen to
be unbiased for the population mean. The variance of the estimator is
~3 [(YI6 + Y2)2 + (5YI + Y3)2 + (Y2 + Y3)2] _ fL2

2 6 2 2 2 y'
If in particular y = (3,5, 1), this variance is zero, and for y close to

(3,5, 1), it will be less than the variance of )is. Thus this new unbiased
estimator can be more efficient for some y than the sample mean, and
is in fact exact for a particular y. Note that the estimator is of form
LjES djsYj, and can be formed only because the labels of the sample
Y values are available, not just the sample y-values themselves.
In fact it can be shown that, for a given population function fJ, unless
the design is a census it is generally not possible to choose e to minimize
E(e - fJ(y»2, or unbiased e to minimize Var(e), simultaneously for all
y in RN. The first such result was proved by Godambe (1955), and
extensions or refinements of it are surveyed by Cassel et al. (1977).
To make decisions among strategies in practice requires additional
information, not necessarily very formal or precise, about what arrays
y are possible or likely for the population at hand. Such comparisons
are discussed, for example, by Kendall, Stuart and Ord (1983, §39).
Consider again the example of stratified random sampling. With the
notation introduced earlier it is easy to see that in formula (2.23)
rl.jk = nh(Nh - nh)/Nt(Nh - I) > 0 if j. k E Sh.
= 0 if j. k are in different strata.
Then (2.23) becomes
var(~ NhYh) = ~ ~ nIh (1 - ~:) N~~ 1 L L[hl(yj - Yk)2.

(2.46)
where L L[h] represents the sum over j in Sh and kin Sh, j =ft k. Now
suppose that the population is not yet stratified, but that a stratification
appears in order, and that we want to make (2.46) small for efficient
estimation of Ty. It then follows that we should use whatever knowledge
we have to make the strata as homogeneous as possible in the Y values,
so that large values of (yj - Yk)2 in the population will have j. k in
different strata and rl.jk = O. For j. k for which rl.jk > 0 because j
and k are in the same stratum, (yj - Yk)2 will tend to be smaller.
In social surveys stratification by residence area, sex, age group,
income group etc. is used to try to divide the populations into homo-
geneous parts. Populations of establishments may be stratified by mea-
sures of establishment size, or by previous values of the variate of
interest.
A more convenient form for (2.46) is
' " NhYh)

Var(L....
h
_ = 'L....
" -Nt
h nh
( 1 - -N
nh) Sh'
h
2 (2.47)
where St = LjESh(Yj - Ji.h)2/(Nh - 1) and Ji.h = (LjEShYj)/Nh

SAMPLING STRATEGIES AND LOCAL RELATIVE EFFICIENCY 23
are respectively the variance and mean of y in stratum Sh. This is

obtainable directly from the analogous form (2.30) for SRS without
replacement. Correspondingly, for the mean estimator Yst = Lh WhYh,
Wh = NhIN, we have
_ = "~ -W~ ( 1 - -nh) Sh'

Var(.Yst) 2 (2.48)
h nh Nh
For another example, suppose the population is divided into equal-
sized clusters B\, ... , BL, from which I clusters are to be selected by
SRS without replacement. Suppose that once a cluster Br is selected,
every unit j in Br is observed. Then '!fj = I I L for every unit j. If
j, k E Br , then '!fjk = '!fj = '!fk = II L, so that
njk = (lIL)2 - (IlL),
which is negative; if j E Br and k E Bq , q I:- r, then '!fjk = l(l -
1)1 L(L - 1), so that
njk = I(L - 1)IL2(L - 1),
a positive quantity. Thus the best way to divide the population into
clusters is to make the within-cluster variation of Yj as large as poss-
ible. This will give negative coefficients to as many as possible of
the large (yj - yd in the formula (2.23) for the variance of the HT
estimator of Ty. Another way of putting this is to say that if cluster
sampling is to be efficient, we should try to see that each cluster is as
nearly representative as possible of the population.
If clusters are areas composed of geographically contiguous units,
and Y tends to vary slowly with location, the aim of large within-
cluster variation will not be achieved, and each cluster will tend to be
representative of its own location. Nevertheless, in practice area cluster
sampling has been used frequently, since the losses in efficiency can
be compensated for by savings in travel and administrative costs.
Mathematically, systematic sampling is a special case of cluster sam-
pling, with one cluster of form U\, iJ + K, ..• ,j\ + (n - I)K} being
drawn at random. Because the sampled labels are evenly spaced, if Yj
varies slowly with the label, the sample will tend to be highly rep-
resentative of the population with respect to y values, and systematic
sampling will be highly efficient at y.
A convenient standard for the efficiency of a strategy (e, p) for esti-
mating Ty is Var(NYs) under SRS without replacement, with the number
of draws being the same as the expected sample size under p. Exten-
sions of the preceding discussion lead to the following generalizations.
(i) Stratified random sampling (with the HT estimator NYst) is more

efficient than SRS (with Nys) if the strata are more homogeneous
in y than is the population; stratification by variates related to y
increases efficiency.
(ii) Cluster sampling (with the HT estimator) is less efficient than SRS
(with Nys) if the clusters are more homogeneous in y than is the
population.
(iii) Systematic sampling with sampling interval K is less efficient than
SRS if the y j vary periodically with period K, so that within-sample
variation is small; systematic sampling is more efficient than SRS
if the y j values tend to be similar for units which are close to each
other.
Decisions on complex design strategies generally involve a weighing
of efficiency of estimation with cost considerations. For example, the
decision to carry out area cluster sampling as mentioned above comes
from a judgement based on travel and administrative costs: it is judged
that estimation with a more dispersed sample will be less efficient than
estimation with a cluster sample of the same overall cost, because the
cluster sample size will be larger by an amount sufficient to compensate
for the relative homogeneity of clusters.
Perhaps the best-known examples of integrating local relative effi-
ciency and cost considerations are seen in the optimal allocation rules
for stratified random sampling.
EXAMPLE 2.8: Suppose that the object is to estimate a scalar mean J.Ly
for a stratified population. A stratified random sample is to be used, with
estimator Yst. There is a total budget C for sampling and measurement.
The cost per unit sampled in stratum Sh is Ch, independent of the size
of nh. The problem is to determine the overall sample size n, and the
allocation fractions nh/n, h = 1, ... , H, so as to minimize Var{Yst)
subject to the total cost not exceeding C. That is, we need to minimize
H w:2S2 N w:2S2
Var{Yst} = L ~ - L ~ (2.49)
h=1 nh h=1 Nh
from (2.48) subject to the constraint
H
C= Lnhch. (2.50)
h=1
It is not difficult to show, using a Lagrange multiplier or the Cauchy-
FINITE POPULATION CUMULANTS 25
Schwarz inequality, that this would be achieved by setting

nh WhSh
- ex:-- (2.51 )
n ,.jCh ,
or equivalently
(2.52)
Substituting the resulting expression for nh into (2.50) allows solution

for the overall sample size n. Note that we are optimizing over real
values of {nh} and n; in actual surveys it is necessary to take the nh
and n values to be integers near their optimal values.
The solution to the optimal allocation problem in Example 2.8 de-
pends on the y array at hand, through the relative values of the stratum
variances S~. In fact, the larger the variability as measured by Sh, the
larger proportionally is the allocation for stratum Sh. Thus for gains
in efficiency we need some way of judging at the very least which of
the strata have more variability in y than the others. Sometimes it is
possible to guess at relative values of Sh from covariate information
or from previous data on the same population. Sometimes information
on the range of possible y values in Sh can be used to make guesses
about Sh.
Examples of allocation determination can be found in most sampling
textbooks. A particularly interesting example of an application in pre-
information-age accounting is described by Neter (1972).
Approximate optimal allocation is most useful for single-purpose
surveys, especially where we have stratification covariates which are
closely related to the y values. For multi-purpose surveys, the allocation
may need to be a compromise among those which are optimal for
the various purposes singly. If the purposes are to estimate overall
proportions, a useful compromise is often to take all the Sh to be equal
in (2.51) or (2.52), resulting in proportional allocation if the Ch are also
constant.
2.5 Finite population cumulants

2.5.1 Cumulants and K statistics
This section deals with the estimation of population moments and cu-
mulants for a real variate y, material which will be used further in Chap-
ters 3 and 4 in discussions of Edgeworth expansions and the bootstrap.
A detailed treatment of cumulants and their generalizations is given by

Stuart and Ord (1987) and Kendall, Stuart and Ord (1983).
Suppose Y is a real random variable with all moments finite and
moment generating function My(s) = eexp(sY) defined on an open
= e
interval containing s O. Here the symbol denotes expectation with
respect to the distribution of Y, to distinguish it from the sampling
design expectation E used elsewhere in this chapter. The cumulant
generating function
Ky(s) = log My(s)
defines the cumulants of the distribution of Y via the expansion
Ky(s) = I>p,'
00
p=1
sP
p.
It is easily seen that
KI =ey. (2.53)
K2 = Var(Y) = f.L2. (2.54)
K3 = f.L3. (2.55)
K4 = f.L4 - 3f.L~ (2.56)
and, in general, for P > 1,
Kp = p! ~(-l)P-I(p _
f=t
1)!L2 (f.LPI
PI!
)rl ... (f.L ps
Pst
__·rs!_
)r. rl!"
(2.57)
In these expressions f.Lp is the pth central moment of Y (f.L1 being 0),
and the second sum L2 in (2.57) extends over all positive integers
PI < ... < Ps and all positive integers rl •...• rs such that
s s
LP;r; =p. L r; =p. (2.58)
;=1 ;=1
Moreover, for P > 1, the moments in formula (2.57) can all be replaced
by the non-central moments f.L~ = e yk .
If YI •..•• YN are finite population Y values which are regarded as
independent observations on a random variable Y, there are at least two
natural ways of defining finite population versions of the cumulants.
One, which might be especially appropriate when with-replacement
sampling is contemplated, would be to define the pth cumu1ant as the
coefficient of sP / p! in the expansion of
N
log(L eSYi / N).
j=1
the logarithm of the finite population moment generating function.

The other, which seems to be appropriate in the context of without-
replacement sampling, is to define the pth cumulant as the finite pop-
ulation symmetric polynomial K p whose expectation £ K p is K p' The
pth population' K statistic' has the form
where L3 is the sum over all vectors (z" ... ,zp) whose components
are distinct coordinates of (y" ... ,YN)' Note that the number of terms
in L3 is N to p factors, or N(P), given by N(N - 1)··· (N - p + 1).
In fact, K p also has the following expression:
(2.59)
where there is a term in the second sum corresponding to every way of

assigning the subscripts so that the total number of distinct subscripts
used is p. It is easy to see that K, is the finite population mean Jly.
For p > 1, the Y values in (2.5.7) can be corrected by subtracting the
mean Jly or not, as desired. The second, third, and fourth K statistics
are
K2 =
= (2.60)
N
= NL(yj - Jly)3/(N - 1)(N - 2) (2.61)
j='
N
[N(N + I) L{Yj - f-Ly)4 - 3(N - I)
j=l
N
x {L{yj - f-Ly)2}2]/(N - 1)(3). (2.62)
j=l
Note that, like the corresponding standardized cumulants of Y, K3/ Ki l2

and K4/ KJ. can be taken as measures of population skewness and (ex-
cess in) kurtosis, respectively.
If the sample from the finite population has n units, n > p, the
sample version of K p is
(2.63)
where the second summation has terms as in (2.59), but only Y values
with labels in the sample are used. It is easy to see that under SRS
without replacement, n draws, Ekp = K p, for the joint inclusion prob-
ability of any set of p distinct units is n (p) / N(p). (Recall that E denotes
expectation with respect to the sampling design.)
If the sampling design is not necessarily SRS without replacement, an
extension of the arguments involving indicators in Section 2.2 can give
an unbiased estimate kp of Kp. For example, from (2.60) an unbiased
estimator of K2 = S;
might be
2
k2 = ~L Yj - 1 L L YjYk. (2.64)
N jES lrj N(N - 1) S lrjk
For some designs this k2' unlike K 2, can actually be negative. Take
N = 3 and let p({I, 2}) = 0.9, p({I, 3}) = p({2, 3}) = 0.05. Then for
s = {I, 3} and Yl = Y3 = 1, k2 = -2.98 < O.
On the other hand, it has been seen in Section 2.2 that K2 has an
alternative, more compact, symmetric form
INN
K2 = 2N(N _ 1) ~# ~{yj - Yk)2, (2.65)
which yields the unbiased estimator
k_ 1 {Yj - Yk)2
(2.66)
2 - 2N(N _I)LLs lrjk
which is clearly non-negative, and 0 when the sample Y values are all
equal.
There are also compact alternative expressions for K3 and K 4:
K3 =
4
:3'
1
N(N _ 1)(N _ 2) ~ ~~
N N N ( +)3
Yi - Yj 2 Yk
liN N N N (2.67)
K4 = 4-N(4) L L L L[(Yi + Yj - Yk - YI)4
i#j# k# I
-12(yi - Yj)2(yk _ YI)2].
These expressions follow readily from a general formula of Good
(1977), namely that
Kr = r_1_"
N(r) ~
... "(y.
~ JI
+ w J2 + w2y.J3 + ... + wr-'y.)r
tJ •
of J,'
(268)
•
where w is the rth root of unity e2tri / r , and the r-fold sum is taken over
all sequences of r distinct subscripts between I and N. For example,
K4 = Re {4:(4) L L L L(Yi + iYj - Yk - iYI )4}

= 4:(4) LLLL[(Yi + Yj - Yk - YI)4
-4(yi + Yj - Yk _ YI)3(Yj - YI)
+8(yi + Yj - Yk - y/)(Yj - y/)3 - 4(yj - y/)4]. (2.69)
Expanding all but the first term of the right-hand side of (2.69) in terms
of powers of differences like (Yi - Yj), and performing the summations,
leads to the result of (2.67).
2.5.2 Cumulants of the sample sum in SRS

If Y" ... , Yn are independent and identically distributed with the dis-
tribution of Y, then the cumulants of their sum are n times the corres-
ponding cumulants of Y. The relationship between the finite population
cumulants and the cumulants of the sample sum
as = LYj
jES
under SRS without replacement is more complex. We have already seen

that
and
Var(as) = n (1 - ~) K2.
It is also not difficult to show that
E(as - n/Ly)3 = n (1 - ~) (1 - ~) K3. (2.70)
Thus each of the first three cumulants of the sample sum is n times a
function of n / N times the corresponding population K statistic. This is
not quite true for the fourth-order cumulant, but it is possible to show
that an approximation to the fourth cumulant of as has a similar form:
N -I
E(as - n/Ly) - 3 N
4
+ 1 (Var(as »2
=n (1 - ~) (1 - N: 1 6 ~ (1 - ~)) K 4• (2.71)
For this quantity when N is large the dependence of the right-hand side
on N is mainly through n / N.
The exact fourth cumulant of as is E (as - n/Ly)4 - 3(Var(as given »2,
by
n (I - ~) (I - N: I6~ (I - ~)) K4 - 1 I- ~)
N: n2 (
2
Ki·
(2.72)
A relatively easy way of verifying (2.71) is based on the fact that any
symmetric fourth-degree polynomial in Yt, ... ,YN which is invariant
when all the Yj are increased by any amount c is a combination of
N N
A= L L(yj - Yk)4/N(2) (2.73)
j i- k
and
N N N N
B = LLLL(Yi - Yj)2(yk - YI)2/N(4). (2.74)
ii-ji-k#
For example, K4 = (A - 3B)/2. It can be shown that
E(as
_
nIL) -
4 _ n(N - n) [(N - n)3
N4 2 +
n3
2
)A
+(~n(N - n)N(N + 6) - ~N3)B] (2.75)
MULTI-STAGE SAMPLING 31
and
2 n2(N-n)2[NA N(4)B]
(Var(as )) = N4 2" + 4(N - 1)2 • (2.76)
and that (2.71) follows.
2.5.3 Variance of the sample variance in SRS

Since the variance of the sample sum as is a constant times S;, the
expression in (2.76) for the square ofVar(as ) can be used to show that
y
1(1+--
.).:=-+-
n4 K4
N
2) 4 N-l
B. (2.77)
Since s~ is the sample analogue of S:. it can then be seen immediately

that
Var(s 2 ) = Es 4 -.).:
y Y
n4 = K4
Y
(1- - -1) +-B(-1- - -1)
n N 2
- .
n-l N-I
(2.78)
2.6 Multi-stage sampling

In many surveys the sampling is conducted in stages. The elementary
units of the population are grouped to begin with into first-stage units
or primary sampling units (PSUs). For example, the households of a
city might be grouped into city blocks of households. At the first stage
of sampling, a sample of PSUs is taken; and subsequently elementary
units are sampled from within the selected PSUs according to some
scheme, which may itself be conducted in stages. In rural areas of
Canada the Canadian Labour Force Survey at one time used four stages
in its sampling of households.
Sampling in stages generally results in samples which are geograph-
ically clustered to some extent. This makes estimation of means and
totals less efficient than for dispersed samples of the same size. How-
ever, for household surveys requiring strict control of survey error,
savings in time and travel costs can be appreciable. In particular, if
sampling takes places in stages, the sampling frame need only be con-
structed one stage at a time, within selected units. The PSUs are listed
first, followed by second-stage units within the selected PSUs, and so
on. Thus the method is practically important, though it adds a great
deal of complexity to the proper analysis of survey results.
One way of setting up notation for discussion of the mathematics of
multi-stage sampling is as follows. Let the population U = {I, ...• N}
be partitioned into PSUs 131, ••• ,13£ with sizes M 1 , ••• , M£. Then
L;=1 Mr = N.
Assume that at the first stage a sample s B of PSU labels is taken.
Then, independently for each r E SB, a sample Sr of m(sr) elementary
units is selected from 13r according to some scheme. Using this notation,
the total sample is
S = Us"
resB
and n(s) = LresB m(sr)' We shall assume that the scheme for sampling
within a selected PSU 13r is not dependent on the composition of the
rest of SB.
The first-stage inclusion probabilities will be defined by
TIr = P(r E SB),
where P denotes probability; and for j in 13r the conditional inclusion

probability, i.e. the probability that j is in Sr given that r is in SB, will
be denoted by rejlr' Then the unconditional inclusion probability rej can
be computed as
rej = TIrrejlr for j E 13r .
The HT estimator of Ty is
,,1;.
n'
A
Ty = L..,
(2.79)
rESB r
with
being an unbiased estimator of the PSU total Tr = LjeB, Yj·

For example, suppose a fixed number I of PSUs are selected at the
first stage. Then by (2.15),
(2.80)
If the TIr are chosen proportional to Mr (in 'inclusion probability pro-

portional to size' or reps sampling, as described in Section 2.8), then
from (2.80)
(2.81)
r=1
Suppose further that for r E SB, the subsample Sr is chosen by SRS
ESTIMATION IN MULTI-STAGE SAMPLING 33
without replacement, mr draws, from Br . Then for j E Br , the condi-

tional inclusion probability is
'lrjl,. = mr/Mr,
and the unconditional inclusion probability is
'lrj = TIr'lrjlr = Im,./ N. (2.82)
It follows easily that the HT estimator for 1'y under this two-stage design
is Ny, where
-y= T
I~Yn
"
rEss
the mean of the subsample means of y over the subsamples Sr. In the
special case that mr = m for all r (a feature often incorporated in
practice), clearly 'lrj = 1m / N for all j, and the design is self-weighting
and of fixed size (1m) in the elementary units; the mean of subsample
y
means is just the overall sample mean of y.
Note that equation (2.82) will apply not only when SRS is used at
the second stage but also whenever the Sr are selected within PSUs by
designs which are self-weighting and of fixed sizes.
2.7 Estimation in multi-stage sampling

More generally, the estimation of 1'y in multi-stage sampling usually
begins with its expression as
r=1
where Tr is the total of y over the rth PSU Br . When the first-stage
sample is taken without replacement, a typical estimator of Ty is of the
form
(2.83)
rESS
where tr is a function of the Yi> j E s,., and an unbiased estimator of

Tr under the sampling design within Br . The coefficients dr (s B) are
chosen so that for any ~1, .. " gL,
Ldr(SB)gr
rESs
is an unbiased or nearly unbiased estimator of r,;=1

gr with respect
to the first stage of sampling; for simplicity we will assume exact
unbiasedness in the discussion which follows.
The usual formula for computing expectations by conditioning is

E(e) = E\ (E2(elsB)).
where E2( ISB) denotes expectation with respect to the second-stage
design, conditional on S B being chosen at the first stage of sampling,
and E \ denotes expectation with respect to the first-stage design (s B
varying). Since
clearly
L
E(e) = E\ ( L d,(SB)T,.) = L T,. = Ty.
,eSe ,=\
and e is an unbiased estimator of Ty.
The analogous formula for computing variances by conditioning can
be written
Var(e) = E\ (Var2(elsB)) + Var\ (E2(elsB)) (2.84)
with Var\ and Var2 being defined in analogy with E\ and E2. For an
estimator e of the form L,ese d,(SB)/" this becomes
Var(e) = E\(Ld;(sB)Var2(t,lsB» + Varl(Ld,(SB)T,.). (2.85)

'ese ress
For example, suppose that the design at the first stage of sampling
chooses a fixed number I ofPSUs by rrps sampling. and that ifr E SB,
sampling takes place within B, by SRS without replacement, m, draws.
Then the HT estimator of Ty is Ny, which is of the form considered
above with t, = M, y" and dr (s B) = 1/ TI, = N / 1M,. It is not difficult
to see that in this case (2.84) gives
-
Var(NY)
N ~ TI, (
= 2"
2
L..,- 1- -
m,) S, 2
I ,=\ m, M,
+-2 '"
1 L "'(TI
L TI (
T. T. )2
L.., L.., 'q
- TI rq ) -
TI ' -....!!....
TI • (2.86)
'''' q 'q
adapting formula (2.23) to the notation for the first stage of sampling.
For the problem of variance estimation in multi-stage sampling it
is useful to consider a sort of backward decomposition of expectation
in which conditioning is done not on the first-stage sample but on
the subsequent subsampling. That is, imagine implementing the design
backwards, by first selecting s, from every B" and then picking SB, so
that only the Yj for j E S,' r E SB, are actually kept in the final data.
ESTIMATION IN MULTI-STAGE SAMPLING 35
The formula for computing E(e) can be written

E(e) = E 2(E I (els r , r = 1, ... , L», (2.87)
with the obvious definitions for the expectation symbols. If e is given
by LrEss dr(SB)tr as in (2.83), then
L
EI(els" r = 1, ... , L) = I)r, (2.88)
r=1
and in (2.87) E(e) = 'L;=I E2tr = 'L;=I Tr = Ty.

Similarly, we have a formula for the variance:
Var(e) = E2 Varl (elsr , r = 1, ... , L) + Var2EI (els" r = 1, ... , L).
(2.89)
This form is less intuitive than (2.84), but it yields a derivation of
variance estimates more easily. First of all, Varl (els" r = 1, ... , L) is
a function of tl, ... , tL. For example, if dr(SB) = 1/n r and the nr are
first-stage inclusion probabilities for a fixed size (I) first-stage design,
then from (2.23)
(
Varl (els" r = 1, ... , L) = -I L L
L Lcnrnq - n rq )
tr
- - -
tq )2
2 r '#- q nr nq
Thus from (2.89), it follows that for this choice of dr (s B)
1 L L ( tr
Var(e) = £2 ( "2 LL(nrnq - n rq ) TI
tq )
- IT 2) + L L
V2."
r,#-q r q r=1
where V2,r is the variance of tr with respect to stages of sampling after
the first.
Returning to the general case and (2.89), it is clear that an unbiased
estimate of Var(e) can be formed as follows:
vee) = VI +L d;(SB)V2.r (2.90)
rEss
where V2,r is an unbiased estimate of V2,r from the y values in s" d;(SB)
satisfies the same unbiasedness condition as dr (s B), and VI is an unbi-
ased estimate with respect to the first-stage design of Varl (elsr , r =
1, ... , L). This principle for forming variance estimates was given by
Rao (1975), having been put forward in a less general context by Durbin
(1953).
If as above dr (s B) = I I n" the first-stage design is of fixed size (I),
and each Sr is selected by SRS without replacement, mr draws, from
Br , then the HT estimator is given by

iy = L tr / TIr, tr = Mr Yr,
rEss
and v(i;,) is
~L Ls s (TIrTIq - TIrq) (~ _ .!L)2 + L v2,r, (2.91)

2 TIrq TIr TIq rEss TIr
where
Suppose in addition that TIr = I Mr / N. Then iy = NY, and an unbiased

estimate of its variance can be computed from
N 2" "
2 L... L...ss
(~)2
I
TIrTIq - TIrq (ji __
TI r Yq
)2 + "L.J V2,rTI . (2.92)
rq rEss r
A simple approximation to (2.91) can be obtained in the case where
I is small compared to L, the total number of PSUs, and sampling at
the first stage is approximately with replacement with constant selection
probabilities TIr / T. In this case TIrq is approximately TIr TIq (l - 1) / I
and
(2.93)
Also, if the TIr are uniformly small and the Mr are bounded, the first
term in (2.91) will predominate, and we have the variance estimate
All ( tr tq
2' . (1- 1) L Lss TIr - TIq
v(Ty) ~
)2 ,tr = Mryro (2.94)
which when TIr <X Mr reduces to
v(Ny) ~ N2 U' 2I(1l_l) LLss<Yr - yq)2]; (2.95)
it is as though y were a sample mean of a SRS with replacement of I

Yrs. We will discuss further the conditions for these simple (and usually
conservative) approximations in the next section, and use them to justify
various techniques for variance and mean squared error estimation in
Section 4.2. By way of abbreviation, we will refer to first-stage fixed
size (I) designs satisfying (2.93), or single-stage fixed size (n) designs
satisfying
(2.96)
IMPLEMENTATION OF UNEQUAL PROBABILITY DESIGNS 37
as approximately with-replacement designs, meaning approximately

with replacement with constant selection probabilities.
2.8 Implementation of unequal probability designs

We first consider the problem of implementing fixed size designs with
desired inclusion probabilities. This is a problem of some practical
importance. In surveys conducted in stages, it is often desirable to
separate the population PSUs into strata, and then to make the sampling
design self-weighting and of fixed size within strata. The more refined
the stratification, the more efficient the estimation of totals is likely to
be, and the smaller the number of PSUs which will be sampled within
strata. It was pointed out in Section 2.6 that the first step in a fixed
size self-weighting design involves 7r ps sampling, or choosing a fixed
number of PSUs with first-stage inclusion probabilities proportional to
the PSU sizes. Thus it is of interest to consider schemes for selecting
I PSUs such that flr = IMr/ N, or more generally flr = lar where
a r are specified positive numbers with 'L,;=l a r = 1. (Note that each
Mr/ N or a r must also be ::::: 1/1.) The case of I = 2 is of special
interest, since it is the least number of PSUs which can be taken from
a stratum, compatible with the requirement that all flrq > O.
Many methods of choosing a first-stage sample with flr = lar have
been proposed in the literature. Comprehensive treatments of these are
given by Brewer and Hanif(1983) and Chaudhuri and Vos (1988). The
few methods described here are chosen for historical or theoretical as
well as practical interest.
For small values of I, particularly I = 2, there is a class of designs
which mimics SRS without replacement, except that the selection prob-
abilities for a given draw are not the same for all units. Two examples
follow.
Selection probabilities constant (successive sampling)

Let PI, P2, ... ,PL be 'selection probabilities' with Pr ::: 0 and 'L,;=l Pr
= 1. Draw I times with these probabilities, without replacement. For
example, if I = 2, draw r at the first draw with probability Pr, and
q at the second draw with probability pq/(l - Pr). This scheme is
sometimes called successive sampling. Hajek (1981, Chapter 9) has
given a thorough discussion of its asymptotic theory.
The inclusion probabilities can be expressed in terms of the selection
probabilities. In the case I = 2, the expression for flr follows from the
fact that r is included if it is drawn on the first draw, or on the second

draw not having been drawn on the first. Thus
TIr = Pr + LPqPr/O - pq) = PrO + LPq/O- pq)).

qfr qfr
Also, is it easy to see that
TIrq = Prpq/(l - Pr) + pqPr/{l - pq)
= Pr Pq C~ Pr + 1 ~ Pq ) .
To find {Pr} so that TIr = 2ar exactly, an iterative computation may
be used. For example, let
p~ = ar, A~ = 1 + LP~/O - p~),
P; = 2ar/A~, A; = 1 + LP~/O- p~),

and so on. If this procedure converges, the limiting values of the selec-
tion probabilities Pr will produce the required inclusion probabilities.
This approach to implementing rrps sampling was proposed by Narain
(1951).
Whatever the choice of {Pr}, the successive sampling design yields
an alternative unbiased estimator for the population total, namely the
one corresponding to Murthy's estimator in Example 2.6. See also Ex-
ercise 2.6.
Fellegi's method (Fellegi, 1963)

This method is similar to the previous one, but the selection prob-
abilities are different for the second draw, and are chosen in such a way
that the marginal selection probabilities are the same on both draws.
This fact would make the method appropriate for rotating samples, as
for example in a labour force survey in which a fixed proportion of the
sample is replaced each month.
The procedure is to draw r at the first draw with probability pr = an
and then q (without replacement) at the second draw with probability
Pq / (1 - Pr), where PI, ... , PL is another set of selection probabilities
chosen so that
LarPq/{l - Pr) = aq
ri'q
for each q. Subsequent draws are handled analogously. In the case of
a truly rotating sample with I = 2, the sample would always consist
of the units drawn in the current and immediately preceding draw, and
the selection probabilities Pr would be used at each draw after the first
one. Units 'rotated out' would become eligible to be drawn again.
Besides using successive draws without replacement, there are other
ways to implement an SRS of fixed size. One is to select units with
replacement, and then to reject the sample if there are duplications.
This same notion can be extended to unequal probability sampling.
The following is one kind of unequal probability rejective sampling.
Sampford's method (Sampford, 1967)

First draw r with probability a r ; in the subsequent I - I draws, carried
out with replacement, use selection probabilities f3r = Kar/(l -Iar)'
where K is a normalizing constant; if there are any duplicates in the
sample, begin again. Clearly, K is given by 1/ Lq[aq/(l-Iaq)]. The
inclusion probability flr can be shown to be equal to I a r •
The probabilities required for Sampford's method are easy to com-
pute, but as I increases, the probability of having to reject a sequence
of draws because of duplications will naturally increase also. Some
methods of 7r ps sampling which avoid this problem and are easy to
implement for general I are based on extensions of systematic sampling
rather than SRS.
Ordered systematic procedure (Madow, 1949)

This method divides the unit interval (0, I] into subintervals of lengths
al, ... , aL respectively. A systematic sample of I points is then drawn
from the unit interval. The PSUs sampled will be those corresponding
to the subintervals in which the systematic sample points fall.
Formally, choose a point v at random from the interval (0, 1/ l]. Put
r in the set s B of first-stage sample labels if
L:aj < v+ I::: L:aj

r-I ~ r
for ~ = 0, 1, ... or I.
j=O j=O
This method gives a fixed size sample with the desired inclusion prob-
abilities provided at most one point of the systematic sample can fall
in any of the subintervals. Thus it works with the proviso mentioned
earlier, that none of the a r is greater than 1/1.
Essentially this method is sometimes used in monetary unit sampling,
where instead of PSUs being sampled with probabilities proportional
to size, we have records being sampled with probabilities proportional

to their amounts in dollars or other units. See Section 3.12.
One drawback of the ordered systematic procedure is the relatively
small number of possible samples. It is easy to see that in general some
of the joint inclusion probabilities nrq will be 0, and as in the case of
ordinary systematic sampling this implies that there will be no design
unbiased estimator for the variance of the HT estimator. This mayor
may not be seen as a problem, but in any case the design is much 'less
random' than those which mimic SRS. The following method removes
this difficulty.
Random systematic procedure (Goodman and Kish, 1950; Hartley and

Rao, 1962)
This method is the same as the previous one, except that the order
of the subintervals determined by the ar is first rearranged at random,
before the systematic sample is taken. If all the ar happen to be equal,
the procedure is equivalent to an SRS of the PSUs.
When the HT estimator is used with any of the first-stage designs
above (and SRS at the second stage), an unbiased estimator of its
variance is given by (2.92), which necessitates knowledge of the joint
first-stage inclusion probabilities nrq • These are not difficult to work
out and compute for the first three designs, at least when I is sufficiently
small. For the random systematic procedure, exact computation of nrq
is possible but complicated: see Connor (1966). Hartley and Rao (1962)
have given an approximate formula for nrq for use when L, the total
number of PSUs, is large.
Elimination procedures (nile, 1996)

These methods produce a sample of the desired size by successive elim-
ination of population units, with probabilities of selection (for elimina-
tion) redefined at each step.
For all the methods described above except the ordered systematic
procedure, it can be shown that for large L and PSUs of comparable
sizes, nrq should be approximately nr nq (l - 1) / I, and hence that
the first-stage design is an approximately with-replacement design. For
these designs the simple variance estimator given by (2.94) (or (2.95)
if nr ()( Mr) will be approximately unbiased.
Alternatively, there are ways in which simpler unbiased estimation
procedures can be devised with the use of other unequal probability

sampling procedures at the first stage. Two examples follow.
Sampling with replacement

Successive PSUs are drawn with replacement with selection prob-
abilities Pr at each draw. An unbiased estimator of the population total
is of the form
I
(L./r(J)/ Pr(J»/ I
j=l
where rU) is the label of the PSU obtained at the Jth draw, and tr(J) is
unbiased with respect to the second-stage design for the corresponding
PSU total 1',.. (The second stage sampling scheme is repeated independ-
ently in PSUs which are drawn more than once.) An exactly unbiased
estimator of the variance of this is
~
I
. Itt
21(l - 1) j=l k=!
(tr(J) _ tr(k)
Pr(J) Pr(k)
)2, (2.97)
j#
which strongly resembles (2.94) and actually reduces to (2.95) when tr
is Mr Yr and no PSUs are duplicated in the sample. Choosing Pr = a r
gives estimators which coincide with the HT estimator with TIr = lar
when I distinct units are drawn.
Ra(T-Hartley-Cochran procedure (Rao et al. 1962)

In this procedure the PSUs are separated randomly into I groups, and
Pj denotes LrEjth group ar. Then one PSU is selected (independently)
from each of the I groups, selection probabilities within the Jth group
being given by a r / Pj .
The usual practice for estimating the total is to use the 'conditional
HT estimator' e = Lj Pjtr(J)/ar(J), where rU) is the label of the PSU
selected from the Jth group.
The groups need not be of equal size, but if they are, an approx-
imately unbiased estimator of the variance of e can be shown to be
vee) = _1_ (1 _~) t

I- 1 L j=l
Pj [tr(J) _
ar(J)
e]2 (2.98)
Finally, two other unequal probability sampling schemes which will

be mentioned in later chapters will now be described.
Bernoulli sampling
For each r = 1, ... , L independently, the PSU r is included in the
sample with probability lar and excluded with probability 1 - lar.
The first-stage inclusion probability will be nr = lar and the expected
PSU sample size will be I, but the actual PSU sample size is potentially
any integer between 0 and L. This design is theoretically important (see
Sections 3.3 and 3.4) and mathematically simple, but its variable sample
size may be a disadvantage in practice since it means there is little
control on the information in the sample. When there is non-response
in a single-stage design, the respondents are sometimes assumed to
constitute a Bernoulli subsample of the originally intended sample.
Simple rejective sampling

There are two ways to implement this fixed size design. One is to
.select I units by sampling with replacement with selection probabilities
Pr, r = 1, ... , L, ~d to reject the sample if any of the PSUs in it
have been repeated. The process begins again, and is carried out as
many times as it takes to produce a sample of I distinct units. The
other method is to take a Bernoulli sample, letting PSU r be included
in the sample with probability Ar and excluded with probability 1 - Ar •
Again the sample is rejected if it does not contain precisely I units. The
two methods give the same design if Pr = A).Al - Ar)-I, A being a
normalizing constant.
Hajek (1981, Chapter 7, Problems and Notes) has given approx-
imate expressions for nr and nrq . To a first-order approximation, if
Ar is chosen equal to lar and L maxl=:;r=:;L nr remains bounded, we
have nr ::: lar and the approximate with-replacement property that
(nrnq - nrq)/n rq ::: 1/(1- 1) as L becomes large.
Perhaps the main importance of simple rejective sampling is the fact
that it corresponds to Bernoulli sampling conditioned on the achieved
sample size. However, it has recently been shown by Chen et al. (1994)
that the design has other nice properties. For simple rejective sampling,
fixing the inclusion probabilities {nr} determines the draw probabili-
ties {Pr} uniquely. Moreover, among fixed size (I) (first stage) designs
(P(SB)} with given {nr}, simple rejective sampling maximizes the en-
tropy measure
- LPI(SB) log PI (SB)·
The joint inclusion probabilities satisfy 0 < nrq < nr nq for any
pair r # q, and hence the variance estimator (2.7.9) is generally non-
negative.
EXERCISES 43
Exercises
2.1 Consider a population of size N = 3, and let the sampling de-
sign be equal probability Bernoulli sampling, where each unit in
the population is included in the sample with probability 2/3,
independently of the others. Give pes) for each subset s of the
population U = {I, 2, 3} under this scheme. Find the inclusion
probabilities Jrj, and verify that their sum over all population
units is the expected sample size.
2.2 Recall that if all inclusion probabilities Jrj are positive, then the
HT estimator L j es Yj / Jrj is unbiased for 1'y. Show that if some
Jr j = 0 and if the corresponding Yj is allowed to vary independ-
ently of the other components of the popUlation array y, then there
is no unbiased estimator of 1'y with respect to the sampling design.
2.3 For each of the following sampling schemes, give values or
expressions for the inclusion probabilities Jrj and the expected
sample size En(s). State whether the design is self-weighting
and whether it is of fixed size.
(i) Systematic sampling with K = 3 and N = 7: choose a start-
ing unit il at random from {I, 2, 3}, and let the sample con-
sist of iJ and jl + K, and jl + 2K if this last unit is in the
population.
(ii) Circular systematic sampling with K = 3, N = 7 and n = 3:
choose a starting unit at random from {l, ... , N} and let the
sample be {jl, il +K, ... ,il + (n -1)K} where the unit label
is taken to be its value modulo N.
2.4 Verify (2.39) in the text.
2.5 Show that the estimator in Example 2.6 is sampling unbiased, and
that vee) of (2.42) is of the form (2.40).
2.6 In a two-stage sample, suppose that the size of the first-stage
sample is I = 2, and that the first-stage sample s B = {r, q} is
drawn in successive sampling without replacement, with selec-
tion probabilities proportional to probabilities PI, ... , PL. In the
notation of Section 2.7, the estimator of 1'y corresponding to the
estimator of Murthy (1957) is
e = 1 [ -(1
tr tq
- pq) + -(1 - Pr) ] .
2 - Pr - Pq Pr Pq
Suppose second-stage sampling is carried out by SRS without
44 EXERCISES
replacement and sample sizes mr. Using the result of Exercise

2.5, give an unbiased estimator for the variance of e.
2.7 A national household survey uses a stratified multi-stage sample.
When a stratum consists of a rural part and a small urban area, it
is divided into four PSUs. Each PSU consists of a geographically
connected rural part, and an urban part which is a one-in-four
systematic sample of households in the urban area. Suggest an
explanation for constructing the PSUs this way.
2.8 For Fellegi's method of unequal probability sampling (Section
2.8) with L = 4, I = 2, find PI, P2, P3, P4 and PI, P2, P3, P4 so
that fIl = 0.3, fI2 = 0.4, fI3 = 0.6, fI4 = 0.7. Find fIrq and
(fIrfIq - fIrq)/fI rq for each pair {r, q}.
2.9 Show that for Sampford's rejective sampling method (Section 2.8)
with I = 2, the inclusion probability fIr is equal to 2ar , and give
an expression for fI rq •
2.10 In a sampling method due to Durbin (1967), for I = 2, the
first unit r is selected with probability ar , and the second unit
q without replacement with probability proportional to bqr =
aq {(1 - 2ar )-1 + (1 - 2a q )-I). Show that the inclusion proba-
bility fIr is equal to 2ar , and give an expression for the joint
inclusion probability fI rq .
2.11 McLeod and Bellhouse (1983) describe a method for drawing a
simple random sample without replacement (size n) on a single
pass through a sequentially ordered population of size N. The first
n units of the population are selected as the initial sample. When
the kth unit is encountered, for k = n + 1, ... , N, the sample
remains the same with probability I - n / k; with the remaining
probability n / k a randomly selected member of the current sample
is replaced by unit k. Show that this method does indeed produce a
self-weighting design. Note that N need not be known in advance
for this procedure to be carried out. Chao (1982) gives a method
of 7r ps sampling which is a generalization of this.
2.12 In Midzuno's sampling design (Midzuno, 1952; Rao, 1963) the
first unit j of a single-stage size n sample is selected with prob-
ability P j, and the remaining units are selected with equal prob-
abilities without replacement. Show that if Pj = x j / Tx for some
positive variate x, then the ratio estimator
eR = Tx(LYj)/(LXj)
jES jES
EXERCISES 45
is unbiased for the population total Ty under this design. How

would {p j} be chosen to make lrj (X X j?
2.13 Suppose the Rao-Hartley-Cochran procedure (Section 2.8) is used
at the first stage of sampling, I PSUs being selected from L PSUs.
Show that if k = L / I is an integer, and if all groups are of size k,
It
then
Var (t
j=l
Pj 1',.(j)) = lk(k -
Ctr(j) L(L - 1)
1) Tr2 -
r=l Ctr
I21
y
where 1',. is the total of y in the rth PSU. Hence explain why
v(e) of (2.98) should be an approximately unbiased estimator of
the variance of the conditional HT estimator L~=l Pjtr(j)/Ctr(j)
for large k.
Solutions
2.1 p(s) = (~)n(s)(!)3-n(s) for each subset s of U. Since sample

size is binomial with mean 3 x (2/3) = 2, E{n(s)} = 2; also
L]=l lrj = 3 x (2/3) = 2.
°
2.2 Suppose lrl = 0, so that p(s) = whenever s contains 1. Suppose
also
LP(s)e(xs ) = Ty (*)
SES
for all possible y. Varying Yl but not the other components of y
will make the right-hand side of (*) change, but not the left-hand
side, resulting in a contradiction.
2.3 (i) lrj 1
= for each j; En(s) = t;
design is self-weighting but
not of fixed size.
(ii) lrj = ~ for each j; En(s) = 3; design is self-weighting and
of fixed size (3).
N
2.4 e - Ty = L(djsljs - I)Yj. Thus
j=l
MSE(e) = tY]E(djs l js -l)2+ t tajk (Yj.) (Yk) WjWk.

j=l j -# k w] Wk
It suffices to show that E(djsljs - 1)2w] = - Lk-#j ajkWjWk.

This follows since by (2.38), (djsljs - l)wj = - Lk-#/dkshs -
l)wk.
46 EXERCISES
2.5 Ee = LL PjPk(
N N 2
- Pj - Pk)
j<k 0- Pj)O - Pk)
x 1 [ Yj O-Pk)+Yk(1_pj)]
(2 - Pj - Pk) Pj Pk
= ~
2
t t[Yj~+Yk~]
j # k 1 - Pj 1 - Pk
=Ty.
If S =
{j', k} then dJ·s =
(i-Pk) x -2_1_. Note (2.38) is sat-
Pj -pj-Pk
isfied with W j = Pj for all j. In (2.40),
v(e) = _ (1 - Pj)(1 - Pk)

PkPj
x 1 [1- (2 _ Pj _ Pk)) (YJ _ Yk)2 PjPk,

(2 - P j - pd 2 Pj Pk
which reduces to the expression in (2.42).
2.6 Using the result of Exercise 2.5 and the rule (2.90) we construct
an unbiased estimator of Var(e) as
v(e) = (2- Pr1- 2 (1-Pr)(l-Pq)(1-Pr-Pq)

( tr
---
tq )2
pq) Pr Pq
+ 1 [v2,r(1 - pq) + V2,q(1 - pr)],
2 - Pr - Pq Pr Pq
where V2,r is as in (2.91).
2.7 The method is intended to retain the cost advantages of area PSUs,
but to increase variability of the response variates within PSUs,
thereby decreasing the variability of their means across PSUs, and
reducing the variances of estimators of population totals.
2.8 PI = 0.15, P2 = 0.2, P3 = 0.3, P4 = 0.35;

PI = 0.1183, P2 = 0.1669, P3 = 0.2965, P4 = 0.4183;
{r,q} I {I,2} {1,3} {I,4} {2,3} {2,4} {3,4}

IIrq 0.0568 0.1009 0.1423 0.1424 0.2008 0.3568
2.9 The design is of fixed size (2). Thus to show IIr = 2a" it suffices
to show IIr ()( ar. Now IIr = ar(Lq1'r {3q) + (Lq1'r aq){3r +
(L q aq{3q)II r . Thus
IIr ()( arO - {3r) + (1 - a r ){3r = a r + {3r(1 - 2ar) = (1 + K)ar.
47
= 1 + "" aq
L..q -2a = C;
q q
= 1),.
Generally 7rj = Pj + Lk#jPk (~-:::.D = ~-:::.11 + Pj Z::::~. To make
7rj = nXj/Tx, set Pj = [r - ~-:::.;] Z::::!, provided all these quant-
ities are non-negative.
e
2.13 The quantity = L~=1 PjTrU)/arU), for which the variance is
being calculated, is conditionally unbiased for Ty, given the ran-
dom grouping. Thus its variance is the expectation of its condi-
tional variance, given the random grouping, and is
48 EXERCISES
where Gj is the jth group, and Tj is LrEGJ 1',.. This reduces to
using a special case of (2.23) within groups. Since the probability

that fixed rand q are both in G j is 1~1"=.'?), the expectation of the
conditional variance of ise
I x ~
2
x k(k-l)
L(L - 1)
t
r
tcxrCXq (Tr _ Tq)2,
=I q CXr cxq
which reduces to the given expression. Now the conditional ex-
pectation of
I) ~
- 1 ( 1-- ( TrU")
~Pj - - e
_)2
I- 1 L j=l CXrU)
t)
is 1~1 (1- {L;=l Z- Var(elgrouping) - 1'y2 }. Thus its uncon-
ditional expectation is
-1- ( 1 - -
1-1 L
I) [L-k-l
--1 - 1] Var (-e) = Vi-
ar(e).
According to the rule (2.90), an unbiased estimator of Var(e),

where e = L~=l PjtrU)/cxrU)' would be given by vee) of (2.98)
plus a term which would be negligible for L » I, or k large.
CHAPTER 3
Distributions induced by random

sampling designs
This chapter describes some further mathematical results, namely res-

ults on distributions of estimators induced by the randomization of the
sampling design. In Sections 2.2 and 2.3 expectations, variances and
variance estimators were derived for certain linear estimators of totals
and means. With some more theory on their distributions it will be
possible to derive approximate confidence intervals' (or regions) in a
mathematical sense for population means, totals and derived functions,
based on the randomization distribution.
Because such intervals are based on the artificial randomness of
the survey design, rather than on anything that is known about how
the population values themselves are generated, they should be used
in inference with caution. The elaboration of this remark will be the
subject of much of Chapter 5.
The sampling design for which the most extensive theory is available
is simple random sampling, and we will focus for most of this chapter
on simple random sampling without replacement. From here on the
abbreviation SRS unqualified will mean SRS without replacement.
Sections 3.1 and 3.2 will deal with the estimation of proportions and
numbers in SRS. This special case, which is of practical importance,
provides a setting for the introduction of confidence intervals in Section
3.2. Extension of the proportion case methods to interval estimation of
means and totals relies heavily on the normal approximation to the
distribution of the sample sum. Sections 3.3 and 3.4 establish this ap-
proximation in SRS, while Section 3.5 surveys approximate normality
results for more complex designs.
To some extent, the need for methods tailored to very complex de-
signs is diminishing. Advances in computing technology are making
frame construction and sample selection easier, and it is becoming
more and more economical and feasible to collect data over long dis-
tances, by telephone or other media. Thus some degree of emphasis
on SRS and its simpler generalizations is not misplaced. Sections 3.6-
50 DISTRIBUTIONS INDUCED BY RANDOM SAMPLING
3.13 are devoted mainly to estimation under SRS when refinements of

the normal approximation are desirable. The Edgeworth expansions of
Sections 3.6 and 3.7 for the sample sum distribution, and Section 3.8
for the studentized mean distribution, are used mainly in theoretical
assessment of second-order asymptotic properties, rather than in confi-
dence interval construction. The saddlepoint approximations discussed
in Sections 3.9-3.11 are more directly relevant, and can be used in
constructing inverse testing confidence intervals for means and totals.
The inverse testing approach is further illustrated in a monetary unit
sampling context, this time with resampling, in Section 3.12. Then Sec-
tion 3.13 describes a competing approach to refined confidence interval
construction, namely the bootstrap t method.
3.1 Distribution of sample numbers and proportions in SRS
By far, the popUlation quantities most often estimated in surveys are

numbers and proportions. The estimation of these quantities can present
problems when they refer to relatively rare characteristics in the popula-
tion. Thus it is worthwhile as well as instructive to begin by looking at
the special case of the distribution of sample numbers and proportions,
in particular under SRS.
Let M be the number of units in a population U having a certain
characteristic C. Then P = M / N, where N is the population size, is
the proportion with characteristic C. Suppose it is desired to estimate
M or P from a sample of size n from the population.
Let the sample number and proportion with characteristic C be ms
and Ps respectively:
Ps = ms/n.
Recall that M is a population total and P a population mean if the Y
variate is taken to be an indicator of characteristic C:
Yj = 1 if unit j has C,
= 0 otherwise. (3.1)
Thus if the sample is taken by SRS, n draws,
Eps = P, Ems = nP, and E(Nps) = M = E (~ ms).

The estimator N Ps is the expansion estimator of M.
If Yj is given by (3.1), it is easy to see that the population variance
DISTRIBUTION OF SAMPLE NUMBERS AND PROPORTIONS IN SRS 51
of (2.60) is
S2
y
= ~P(l-P)
N -1 '
and hence from (2.30)
1 N-n
Var(ps) = ~ N-l
P(1
-
P)
N-n
Var(m s) = n N _ 1 PO - P) (3.2)
N 2 N-n
Var(Nps) = - - - P O - P).
n N-l
But we can go further in this case, and obtain an exact distribution for
P(m s = m) =
(~)(~=~) (3.3)
(~) , m=O, ... ,n.
This is the hypergeometric distribution, of which M or P may be

regarded as a parameter.
For small populations the hypergeometric distribution is easy to work
with, but for large populations approximations are desirable. The chief
of these is the binomial , under which
P(m s = m) ::= ( ~ ) pm(l - p)n-m, m = 0, ... , n. (3.4)
This is a good approximation if N is very large relative to n, so that

without-replacement and with-replacement sampling are virtually equi-
valent.
If in addition P is thought to be very small, the Poisson approx-
imation is often used:
(np)m -nP
P(m s = m) ::= - - e , m = 0,1,.... (3.5)
m!
This approximation is extremely useful for estimating the proportion
of units with some rare characteristic.
Under other circumstances, because of the SRS central limit theorem,
to be discussed in Section 3.4, the sample number ms is approximately
normally distributed, with mean n P and variance
n(N - n)
---P(1-P).
N-l
For this approximation to work both nand N -n should be moderately

large, and P should not be too close to 0 or 1. To put it more precisely,
the closer P is to 0 or I, the larger n and N - n must be to guarantee
the usefulness of the normal approximation. Note, however, that N
need not be many times larger than n, as it should be for the binomial
approximation.
Each of these exact or approximate distributions for ms induces a
distribution for Ps. In particular, if ms is approximately normal, so is
Ps·
3.2 Confidence limits for population numbers and proportions in

SRS
In a mathematical sense the distributions in 3.1 yield confidence inter-
vals, both one-sided and two-sided, for M and P. In this section the
construction of upper limits for P will be described. Lower limits for
P and the limits for M are formed similarly.
The upper 100(1 - a)% confidence limit Pu for P is a function
of the observed sample number, which we shall denote by m? here, to
distinguish it/or the moment from the random variable ms. Specifically,
Pu is the largest value of P such that
(3.6)
where P( ; P) denotes design-induced probability when P is the true
population proportion.
EXAMPLE 3.1: Suppose the population has size N = 20, and that in an
SRS of size n = 10 we observe m? = 1. Let p(P) = P(m s ~ 1; P) =
"I (20P)(20(l-P»)/(20)
L.Jm=O m 10-m 10 •
The upper 90% confidence limit Pu would
be the largest value of P such that p(P) ~ 0.1. From the following
table we see that Pu in this example is fa.
P o 1
2li
2
2li
3
2li
4
2li
5
2li
6
2li
p(P) 1 0.763 0.500 0.291 0.152 0.070
If we regard m? as having the sample number distribution, it is
possible to show that the interval [0, Pu ] covers P with probability
at least 100(1 - a)%. Moreover, in a certain sense the interval is no
larger than it needs to be to have this property of at least 100(1 - a)%
coverage probability. Note also the significance interval (or inverse
testing) interpretation, that any value of P higher than Pu would be
rejected by an a-level one-sided test of significance based on m?
CONFIDENCE LIMITS IN SRS 53
As in Example 3.1, the unknown P is actually discrete, since its
possible values are 0, 1/N, 2/N, ... , 1. However, if N is large enough
that P can be taken to be continuous, we can find Pu by solving
P(m s ~ m~; P) =a (3.7)
for P. Thus, using the binomial approximation, we would solve
(3.8)·
If we were using the Poisson approximation, we would solve
(3.9)
In cases where m2 is small, solving (3.9) is made easier by the

fact that its left-hand side is equal to the probability that a chi-square
random variable with 2(m2 + 1) degrees of freedom exceeds 2nP. If
the appropriate X2 percentage point is available, its value divided by
2n will be the upper limit Pu .
EXAMPLE 3.2. Suppose an SRS of size n = 100000 is taken from a
population of size N = 26700000 and that the sample number with
characteristic C is m~ = 5. Since the 0.95 quantile ofax112) variate
is 21.026, the upper 95% confidence limit for P would be this divided
by 200000: that is, Pu = 0.000 105. The corresponding upper limit for
M, the number in the population with characteristic C, would be about
2800.
Similarly, (3.8) has its left-hand side given by
n -mo
P ( F(2(m~+1),2(n-mm > m~ + i 1 -P P ) . (3.10)
Thus,.ifthe appropriate percentage point of the F distribution is avail-

able, we have a simple linear equation to solve for Pu.
A way of using the binomial approximation without F tables was
developed by Pratt (1968). He used the fact, based on the WilSOIr
Hilferty cube root approximation to X2, that
is approximately standard normal if F is distributed as F(/J.. v) for large

f.L, v. This converts the F tail probability of (3.10) to a normal tail
probability. Solving the resulting version of (3.8) for P gives
Pu = [1+ (m~+ 1 )2{ 81(m~+ 1)(n-m~)-9n-8-3IzaIJA }3]-1

n-m~ 81(m~+1)2-9(m~+1)(2+z~)+1
(3.11 )
where A = 9(m~+1)(n-m~)(9n+5-z~)+n+l, andza is the a quantile
of a standard normal distribution; when the binomial approximation for
the distribution of m~ is a good one, the associated confidence interval
will be close to exact. For example, when n = 10 and m~ = I both
formulae (3.10) and (3.11) give Pu = 0.394 to three decimal places.
A detailed discussion of binomial confidence intervals has been given
by Blyth (1986).
!
When a < and the normal approximation to the distribution of ms
is reliable, (3.7) becomes
o 1
ms + -2 -nP
-;:.======== = Za = -Iza I (3.12)
fn(N - n) PO _ P)
Y N-l
where Za is again the a quantile of a standard normal distribution.
The 112 in the numerator of the left-hand side gives a correction for
continuity. Use of this equation means solving a quadratic equation for
Pu , and yields
z2 / a 2 Z2)
Pu = (a +"2 +zya - --;; + 4" I(n +z2) (3.13)
where a = m~ + 1/2 and Z = IZaIJ(N - n)/(N - 1). When nand

N - n are large, however, it may be expected that
Pu
A
= Ps0 + -2nI + -In

IZal jN - n 0
- - p (l - pO)
N _ ISS
(3.14)
will yield an approximate solution sufficient for practical purposes.

When the binomial and simple normal approximations are inade-
quate, something closer to the exact hypergeometric distribution of
ms is needed. Molenaar (1973) has provided several refined normal,
Poisson and binomial approximations to the Poisson, binomial and hy-
pergeometric distributions. In particular, he has given an expansion (p.
120) for Z where <l>(z) = P(m s :::: m~), <l> being the standard normal
CONFIDENCE LIMITS IN SRS 55
cumulative distribution function. Confidence limits would be obtained
by numerical inversion of the expansion taken to a fixed number of
terms. It is also increasingly realistic to contemplate computing exact
intervals for proportions in SRS, for moderate nand N.
To summarize how to obtain confidence limits for P in practice, the
following may be suggested.
(i) If only a crude interval estimate is needed, if nand N - n are
moderately large and if p2 = m~/n is not very close to 0 or 1,
use (3.14) for Pu . This gives an upper limit for a 100(1 - a)%
one-sided interval. A 100(1 - 2a)% two-sided interval would be
given by [PL. Pu ] where
s
°
P~L = P - - 1 - -ZI-a
2n In
IN -
N-l
n
--p0(1 - pO).
S S
(3.15)
Note that Zl-a= IZal.

(ii) If N is many times larger than n, n is large, and m? is a small
number, use the Poisson approximation and take Pu to be the
solution of (3.9).
(iii) If N is many times larger than n and an accurate upper limit is
desired, solve (3.8) from the binomial approximation, or use the
Pratt formula (3.11).
(iv) For nand N - n moderately large, if p2 is not too close to 0 or 1
and accurate limits are desired, solve (3.l2) based on the normal
approximation. The two roots PL , Pu will form a 100(1 - 2a)%
confidence interval.
(v) In other cases, for accurate limits it may be reasonable to use
numerical inversion of the exact hypergeometric tail probabilities
for ms.
Confidence intervals for P and M based on the binomial approx-
imation will tend to be conservative, because the exact variance of ms
or Ps incorporates the finite population correction (N - n)/(N - 1),
while the binomial approximation does not. Thus for many practical
purposes binomial-based confidence limits will be quite useful, even if
N is not extremely large compared with n.
An important practical problem is the joint estimation of several pro-
portions, arising when each population member is in one of k disjoint
classes C 1••••• Ck. If Mi and Pi denote respectively the population
number and proportion in Ci, then Pi = =
M / N and L~=l Pi 1. The
sample numbers mis will have a multiple hypergeometric distribution
with probability function

P(mis = mI,···, mks = mk)
(~:) ... (~:) mi 2: 0, ... ,mk 2: 0,
= (3.16)
L~=Imi =n.
(~)
The mean sample numbers are Emis = nPi , and the second-order
moments are given by
N-n
= N _ 1nPi (1 - Pi),
N-n (3.17)
= -N_lnPiP/, ii-I.
In the reporting of survey results, estimates ofthe variances of the mis

are often transformed into standard errors for the sample proportions
Pis, ... ,Pks· These can be transformed in turn into confidence intervals
for PI, ... , Pk taken one at a time. However, when joint consideration
of the proportions is necessary, analysis is usually based on a multivari-
ate normal approximation to the joint distribution of mis, ... , mks, with
means nPI, ... , nPk, and covariance matrix given by (3.17). Thus, for
example, an interval for P2 - PI might be based for large samples on
taking
IN-n
N _ 1n[pis + P2s - (p2s - Pis) ] 2
(3.18)
to be approximately N(O, 1). Ways of defining simultaneous confidence

intervals for Plo ... , Pk were reviewed by Thomas (1989).
3.3 The characteristic function of the sample sum in SRS

The central limit theorem for a sequence of independently distributed
random variates states that, under certain conditions, the standardized
partial sums of the variates have a limiting normal distribution. This
theorem can be applied directly when the design is carried out by in-
dependent draws with replacement. However, in SRS (without replace-
ment) the results of successive draws are dependent, and for this case
rather delicate arguments are needed to obtain the most useful asymp-
totic normality results. These were established independently by Erdos
and Renyi (1959) and by Hajek (1960).
In their proof of a central limit theorem for SRS, Erdos and Renyi
THE CHARACTERISTIC FUNCTION OF THE SAMPLE SUM IN SRS 57
established the following very suggestive representation for the char-

acteristic function of the sample sum of a real variate y.
THEOREM 3.1: Assume without loss of generality that the population

mean JLy = 0, so that
(3.19)
The sample sum characteristic function is by definition
(3.20)
where i 2 = -1, but it can be represented as
(3.21 )
where).. = njN and B = ( ~ ) )..H(1- )..)N-n.

This representation follows from the fact that SRS can be obtained
by carrying out Bernoulli sampling and rejecting the sample if the size
is different from n. In Bernoulli sampling with equal inclusion prob-
abilities ).., each population unit j is in the sample with probability )..,
independently of all the others. The samtle size n(s) has mean N)" = n,
and n(s) - n can be represented as Lj=l (ljs - )..), where Ijs = 1 if j
is included and 0 if not. Similarly, the sample sum Ljes Yj minus its
expectation has representation L~=l (ljs - )")Yj. Thus the joint char-
acteristic function of sample size and sample sum, with expectations
subtracted, is
N
¢(u, t) = D[(1 - )..)e-iA(u+tYj) + )..ei(l-A)(U+tYj )]. (3.22)
j=l
Now the integral J... f71: eiux du is 1 for x = 0 and 0 for x =

271: -71:
±l, ±2, .... Thus if X and Yare discrete variates with joint proba-
bility function p(x, Y), and X is integer-valued, the conditional char-
acteristic function of Y given X = 0 can be computed from the joint
characteristic function of X and Y as
_1
2JT
l 1C
(L L p(x, y)eiux+itY)du/ P(X = 0)
x Y
l
-1C
1C
= -2
1 ¢x,Y(u, t)du/ P(X = 0). (3.23)
JT -1C
Since B as defined above is the probability that n (s) - n = 0, the right-

hand side of (3.20) is clearly the conditional characteristic function of
the sample sum, given that n(s) = n.
3.4 The finite population central limit theorem for SRS
3.4.1 Distribution of univariate sample sum
When Y is a real variate, the finite population central limit theorem

asserts that under certain conditions on the population array and the
sample size, the standardized sample sum under SRS is asymptotically
normally distributed.
In formulating the conditions precisely it is convenient to think of
an asymptotic framework in which the population size N tends to 00,
with the population array being redefined for each N, so that in facty is
really (Y\ N, ••• , YN N ). Suppose that the sample size n is also regarded
as a function of N. Let h = Yj - J.tY' so that the population mean of the
y variate is O. Let VN = NAO-A)L;=\y;/N stand for (N -l)/N
times the variance of LjES Yj, where A = n/ N, and let D~ = L;=\ y;,
THEOREM 3.2: A necessary and sufficient condition for the distribu-
tion of (LjEsYj - nJ.ty)/JVN to approach the N(O, 1) distribution is
that
lim dN(E)
N-.oo
= 0 for any E > 0, (3.24)
where
d N(E) = -I2 '~" Yj" -2

DN J:!.Vjl>Ev'VN
Condition (3.24) is analogous to the Lindeberg condition for sums of
independent random variates (Feller, 1971, p. 262). It can be shown to
imply that
v = NA(1 - A) = nO - n/ N) -+ 00 as N -+ 00.
THE FINITE POPULATION CENTRAL LIMIT THEOREM FOR SRS 59
A sufficient condition for (3.24) to hold is that
·
11m
N-+oo
QN
VN(2+8)/2 =
° (3.25)
for some 8 > 0, where
L IYiI
N
QN =)..{l -)..) 2+8.
j='
This condition is analogous to Lyapunov's condition (Feller, 1971, p.
286), and would hold, for example, if y" Y2, ... were observations
from independent and identically distributed variates Y" Yz. . .. with
finite (2 + 8)th moment, and v = N)..(l -)..) approached 00 with N.
Alternatively, (3.24) holds if
lim jmax
N-+oo }
Y~
DN
I x (number of j such that IYjl > Ej'V;) = 0.
In the special case where M of the Yj values are 1 and N - M are
0, condition (3.24) can be shown to be equivalent to
1.
1m
M(N - M)n(N - n)
3 = 00, (3.26)
N-+oo N
or
lim vP(l - P) = 00. (3.27)
N-+oo
This implies that we may think of a sample number ms as being ap-
proximately normal if P is not too close to or 1, and if nand N - n
are moderately large.
°
In outline, the proof of the sufficiency of (3.24) from representa-
tion (3.21) proceeds as follows. The details are given in Renyi (1970,
Chapter 8, Section 5).
First note that since 4>(0) must be 1, (3.21) can be rewritten as
4>(t) = <l>(t)/<l>(O), (3.28)
D
where
<l>(t) = JV j-Jr
Jr N
p(u + tYj)du
and
)..(1 -)..) )
exp ( - 2 x2 for x near 0.
Defining new variates wand r by
u = w/Jv, 1= r/y'V;
makes it possible to rewrite (t) as
I IT n./V
-n./V j=]
p (~+ rYj )
JV../VN
dw
.
(3.29)
Now for any E > 0, the sufficient condition (3.24) implies a negligible
effect as N --+ 00 of factors in which Ir h /../VNI 2: E. If Ir h /../VNI <
E, and also Iw/ JVI < 2E, we can use the approximation
w rYj ) 1).,(1 2- ).,) ( wr.; + rh )2} .

"'v + ",V
p ( r.; rrr-
N
~ exp rrr-
"'V ",VN
If Ir h /../VNI < E, but Iw / JVI 2: 2E, then it can be shown that
k(;+ ~)I
is uniformly less than 1, and that the product of an indefinitely increas-
ing number of such factors becomes negligible. Thus the integral <1>(/)
of (3.29) may be approximated by
J 2E./V
exp -
1).,(1 -).,) L N ( W
-+--
rYj )2} dw
-2E./V 2 j=] JV ../VN
or by
J-00
oo
exp -2 - 2w
{r2 2
}
dw =
r2 }
J2rr exp { -2
for large N. Dividing by the value at 1 = 0 (or r = 0) as in (3.28) gives
the limiting characteristic function of LjES Yj /../VN as the standard
normal characteristic function exp( _r2 /2).
3.4.2 Distribution of multivariate sample sum

When y is vector-valued, Theorem 3.2 has a natural analogue. It is
convenient to express it a little differently: we assume without loss of
generality for practice that the population covariance matrix S; of (2.30)
approaches a constant matrix ~ as N --+ 00. Again let h = Yj - J-ty-
THEOREM 3.3: Under the condition
lim ON(E)
N->oo
= 0 for any E > 0, (3.30)
ASYMPTOTIC NORMALITY AND APPLICATIONS 61
where
ON(E) =-
I
L II h 112
N j:IlYiIl>fNA(l-A)
and II is the Euclidean norm, the distribution of the normalized
sample sum
jEs
approaches the MV N(O, :E) distribution.
The idea of the proof has been given by Rao (1973).
3.5 Asymptotic normality and applications

3.5.1 Conditions for asymptotic normality
Extensions of the finite population central limit theorem to other sam-
pling designs are used frequently, although rigorous justifications are
not always available. In the partial survey of results which follows, y
is taken to be real.
One case of unequal probability sampling where the characteristic
function of an estimator is easy to derive is simple rejective sampling.
Consider Bernoulli sampling with unequal probabilities, where popu-
lation unit j is in the sample with probability Aj, independently of the
others. Assume that n = 2:.7=1 Aj. Then the joint characteristic func-
tion of n(s) - n and the sample sum analogue ALjEsYjjAj (where
A = njN) is
= 0[(1- Aj)e-iAj(U+/YjA/Aj) + AjeiO-Aj)(U+tYjA/Aj)].

N
</J(u, t)
j=1
Now consider an unequal probability sampling design which is im-

plemented by taking a Bernoulli sample and rejecting the sample if
n(s) "# n. The characteristic function of
e = A LYjjAj (3.31)
jes
l
under this design is
1C
I
277: -1C </J(u, t)duj B (3.32)
where B is the probability under the Bernoulli sampling design that

n(s) = n. The integral (3.31), which corresponds to (3.21), can be
approximated analogously.
As pointed out in Section 2.8, the conditioned Bernoulli sampling

design above is equivalent to simple rejective sampling, which uses
selection probabilities PI, ... , P N in each of n independent draws, and
then rejects the sample if n(s) i= n. The equivalence is established
when Aj is chosen so that Pj ex: Aj/(l - Aj) and L:7=1 Aj = n.
Thus it is possible to formulate an asymptotic normality result for
e of (3.31) for the simple rejective design. Asymptotic normality was
established by Hajek (1964) (using an alternative method, however).
In general, conditions can be formulated under which a standardiza-
tion of the HT estimator
jes
for a fixed size design will be asymptotically normal, either if the

design is well approximated by conditioned Bernoulli sampling, or if
the design can be implemented through selection probabilities which
vary little from draw to draw. Rosen (1972) provided a proof for the
case of successive sampling, where the selection probability for j at
the kth draw is proportional to Pj for those j which have not already
been drawn. The proof is based on theory for the coupon collector's
problem.
It may be recalled from Section 2.7 that single-stage rejective and
successive sampling are also approximately with-replacement designs,
in the sense that their joint inclusion probabilities tend to satisfy
n- 1
rrjk ~ - - r r j r r k
n
when N is much larger than n and the rrj are not too highly variable.
Asymptotic normality of the stratified sample mean under stratified
random sampling has been discussed by Bickel and Freedman (1984).
In their formulation, the number of strata and the stratum sizes Nh
are allowed to depend on N as N ---+ 00, and either the number or the
sizes of the strata may remain bounded. It is shown that if all n h satisfy
2:::: nh :::: Nh - 1, then both the standardized stratified mean
(jist - /Ly)/JVar(jist)
and the studentized stratified mean
(jist - /Ly)/Jv(Yst)
are asymptotically N(O, 1) provided that
[L ah L (yj - /Lh)2/ L ah L (Yj - /Lh)2] ---+ 0, (3.33)
h jeShnA,h h jeSh
ASYMPTOTIC NORMALITY AND APPLICATIONS 63
where JLh is the mean of Y over the hth stratum, ah = Nh (Nh -

nh)/(nh(Nh - 1» and A€h is the set of j E Sh such that
IYj - JLhl > E~h '/Var<Yst).

Krewski and Rao (1981) have given sufficient conditions, based on
the independence of sampling from stratum to stratum, for the asymp-
totic normality of the standardized stratified mean when the nh are
bounded and H, the number of strata, goes to 00. Sampling is with re-
placement within strata. The conditions assume a vector-valued Y with
components Ya, a =1•...• p. The variate Ya has stratum means JLah,
and the stratum covariance matrix of Y is S~. Suppose that
H 1
(i) LWh x -LIYaj - JLahI2+~ remains bounded for some ~ as
h=l Nh jeSh
H -+ 00 (a = 1• ...• p);
(ii) max nh is bounded;

l~h~H
L L
(iv) n LWln;;lS~ -+ 1:, a positive definite matrix, where n = Lnh.
h=l h=l
Then as H -+ 00 the distribution of ..;n<Yst-f.Ly) approaches MVN(O. 1:).
Also, for any linear combination of components Ya, the distribution of
the usual studentized stratified sample mean will approach normality.
Using an approach similar to Rosen's, Sen (1980; 1988) has shown
how to obtain asymptotic normality of a Horvitz-Thompson type es-
timator from a multi-stage sample where the primary sampling units
are obtained by successive sampling. It is assumed that the number L
of PSUs in the population and also the sampled number I approach
infinity at the same rate. The most important sufficient condition for
asymptotic normality of the estimator is a modified Lindeberg condi-
tion on the variates tr - 1',., where tr is an unbiased estimator from the
later stages of sampling of the total 1',. in the rth PSU.
The implications of asymptotic normality results are important both
for estimation and for choice of sampling strategy, as the next two
subsections will illustrate.
3.5.2 Normal-based corifidence intervals/or totals

Suppose an estimator
e = LdjsYj or e = Ldr(sB)tr
jEs rESB
of the population total Ty is asymptotically normal. That is, suppose
that in a suitable asymptotic framework, the distribution of
e- Ty
(3.34)
JVar(e)
approaches the N(O, 1) distribution as the index N ~ 00. If v(e) is a
design-consistent estimator of Var(e) , or in other words if v(e)/Var(e)
approaches 1 in probability as N ~ 00, then the studentized variate
e - Ty
(3.35)
Jv(e)
also has a limiting N(O, 1) distribution, and approximate two-sided
100(1 - 2a)% confidence intervals for Ty are given by
e ± ZI-aJv(e), (3.36)
where Z\-a is the 1 - a quantile of the N(O, 1) distribution.

To use this construction we need to be able to show consistency of
v(e). In SRS, if e = Nys, then v(e) is generally ~2 (1 - -W)s;, and we
s;
can see from (2.78) that will be a consistent estimator of S; in some
reasonable asymptotic framework satisfying n = n N ~ 00 as N ~
00. Similarly, for stratified random sampling, consistency of the usual
v<Yst) can be established. For more complex designs, consistency of
the Yates-Grundy-Sen estimator (2.27), the other unbiased estimators
(2.29) and (2.40), the multi-stage estimators (2.90) and (2.91), and
the approximately unbiased estimators (2.94) and (2.95) will require
stronger but still realistic conditions tailored to their contexts.
3.5.3 Sample size determination

In the design of sample surveys, many considerations enter into the
determination of sample size. The main limiters of sample size tend to
be the budget and time available for data collection, and the need to
keep non-sampling errors as small as possible. Very often, as well, a
sampling scheme and sample size are chosen because they are 'stan-
dard' for the type of survey being planned. Thus, somewhere between
FORMAL EDGEWORTH EXPANSIONS 65
1000 and 1500 respondents is standard for national opinion polls in

North America.
Nevertheless, it can sometimes be very useful to try to determine a
sample size which will yield specified precision for certain estimators,
or specified strengths for certain analyses. Precision of point estimators
is often expressed in terms of a probability interval for error. In these
cases the limiting distributions of the estimators will play an important
role.
EXAMPLE 3.3: Consider the following statement, taken from numer-
ous newspaper reports of surveys:
Percentages from this survey are considered to be accurate within 3.5 per-
centage points, nineteen times out of twenty.
A reasonable interpretation of this statement is true in SRS if n is such
that
P(lps - PI ~ 0.035) ~ 0.95 (3.37)
for any P in the expected range of population proportions. Suppose P
can range throughout [0,1]. For N large, Var(ps) ~ *P(1-P). Assum-
ing an approximate N (0, 1) distribution for (Ps - P) / JVar(ps) leads
to setting 0.035/ JVar(ps) ~ 1.96, and obtaining n ~ (d.O~s)2 P (1 - P)
for all P, or n ~ (oll36S)2 X 0.25 = 784. Thus the statement is true if n
is about 800 or greater.
In general, asymptotic normality of an estimator makes its variance
a suitable indicator of the widths of its probability intervals. Precision
requirements can then be expressed in terms of bounds on the variance.
If there is sufficient prior knowledge about the array y, these bounds
can be translated into implications for sample structure and sample size.
3.6 Formal Edgeworth expansions

Approximate normality of the sample sum under SRS has been proved
in Theorem 3.2 by showing that its characteristic function approaches
the normal characteristic function in the limit. The argument there did
not place a bound on the error of approximation or the rate of its con-
vergence to zero. In order to address these questions it is natural to ask
whether better approximations to the characteristic function are pos-
sible. If, moreover, these can be translated into better approximations
for tail probabilities for sample sums or functions of sample sums, we
would have improved approximations to coverage probabilities for in-
terval estimates of means and totals. Edgeworth expansions may be
viewed as one way of obtaining approximations to the characteristic

function and tail probabilities.
The usual Edgeworth expansion of a univariate probability density
f(x) (see for example Reid, 1988) is derived heuristically from the
representation
00
1
f(x) = exp L(-l)rKr Dr)
- , 1/1 (x),
r=3 r.
(3.38)
where Kr is the rth cumulant of the distribution with density f, D

represents differentiation, and 1/1 is the normal density function with
i: i:
the same mean and variance as f. This follows from the key identity
eitx Dr 1/1 (x)dx = (-itl eitx 1/1 (x)dx
= (-itYexp {Klit - ~t2}. (3.39)
i:
For then
eitx exp 1~(-lY:~ Dr ) 1/1 (x)dx
= exp 1f:(ityK~ ) exp {Klit _ K2 t 2 },

r=3 r. 2
which is the representation in terms of cumulants of the characteristic

function J~oo eitx f(x)dx.
Now suppose, as is common in applications of (3.38), that f is
the density of a quantity like .;n(X - /-L)/a for an independently and
identically distributed (i.i.d.) sample, which has mean and variance
constant as n ~ 00, and Kr = O(n l-r/2) for r ~ 3. If we expand the
exponential in (3.38), grouping together terms of the same order in n,
we have
K3 K4
f(x) = 1/I(x) { I + 3! h3(X; K" K2) + 4! h 4 (x; KI, K2)
+ ~h6(X·
2(3!)2 "
KI K2) + ... } (3.40)
where the hj (x; KI, K2) are polynomials called Hermite polynomials
and are defined by
FORMAL EDGEWORTH EXPANSIONS 67
It is easy to show that
and
hl(y; 0,1) Y = h4(y; 0,1) = y4 - 6y2 + 3
h2(Y; 0, 1) = y2 - 1 =
hs(y; 0, 1) yS - 10y3 + 15y.
=
h3(y; 0, 1) y3 - 3y
The expansion of (3.40) can be integrated tenn by tenn to give a
representation for the tail probability:
P(X~x) = \II(X)-1/f(X){K~h2(X;KI'K2)
3.
+K 4. Kr.}-I
4!h3(x,KI,K2) + 2(3!)2hs(X, KJ, K2) +o(n )
(3.41)
where \II(x) and 1/f(x) are the cumulative distribution function (c.d.f.)
and probability density function (p.d.f.) of an N(KI, K2) variate.
The two expansions (3.40) and (3.41) have been written so as to dis-
play the tenns of order n- I / 2 and n-I. Proof of the validity of(3.41) in-
volves showing that n times the difference between left- and right-hand
sides approaches 0 as n -+ 00. Under sufficient regularity conditions
this is so: in fact the error is typically of order n -3/2. Taking further
tenns in the expansion theoretically improves the approximation. When
X is a discrete variate, the expansion (3.40) is regarded as giving an
approximation to P(x ~ X < x + 8)/8 for suitable 8.
If the variate X is ..;n(X - J-L)/u, where XJ, ... , Xn are Li.d., the
validity of (3.41) requires only that the fourth moment of XI exist and
that its distribution not be concentrated on a lattice. If XI were a lattice
random variate and the lattice had span a, the possible values of X
would be shifted multiples of a /..;n, and the largest jumps in its c.d.f.
would have to be of order n- I / 2 • Since the right-hand side of (3.41) is
continuous, the error tenn could not approach zero faster than n -I /2 ,
and even the one-tenn expansion
P(X ~ x) = \II(x) -1/f(x) ;~ h2(X; KI, K2) + o(n- I /2) (3.42)
would need modification; see Feller (1971, p. 540).

If X is an appropriately scaled 'smooth function of means' such as
a studentized mean (as in Section 3.8), it may be the case that its mean
KI is of order n- I /2, its variance K2 is 1 + P where p is of order n- 1,
and for each r ::: 3 the cumulant Kr is of order n l - r / 2 • In such a case

we may obtain expansions of the same type whose terms are somewhat
more complicated. For the one-term expansion we have
P(X ::5 x) = (x) - fjJ(x) (KI + ;~ h2(X; 0, 1») + o(n- I/2) (3.43)
and for the two-term expansion
P(X::5 x) = (x) - fjJ(x) [ (KI + ;~ h2(x; 0,1))

+( K2 - I)
2
+ Kf hI (x; 0, 1) + (K4 K3KI)
4! + 3! h3(X; 0, 1)
K
3 • _I
+ 2(3!)2
2 hs(X, 0, 1) ) ] + o(n ), (3.44)
where (x) and fjJ(x) are respectively the N(O, 1) c.d.f. and density
function. One way of deriving these expressions is to expand
X - KI)
\I1(x) = P ( Z::5.jK2 , z'" N(O, 1)
and t/I(x) about KI = 0, K2 = 1.
3.7 Edgeworth expansions for the distribution of the sample
sum in SRS
Finding and justifying an Edgeworth expansion for the distribution of
the sum in SRS from a finite population is a complicated task. The
distribution of the sample sum is discrete, and in fact is often a lattice
distribution. Moreover, both the sample size n and the population size N
are considered to be approaching infinity; thus there might conceivably
be more than one way of ordering terms in the expansion, depending
on the relative growth of n and N. Finally, as in the case of the finite
population central limit theorem, conditions on the triangular array of
Y values for the population units must be formulated if the expansion
is to be applied very generally in the absence of models for y.
For simplicity's sake we take y to be real to begin with. We have
seen in Section 2.5 that the first four cumulants of the centred sample
sum LjES Yj - nf.Ly are given as follows in terms of the population K
statistics:
0, n(1 - )")K2, n(1 - )..)(1 - 2)")K3,
n 2 (1 - )..)2 ( 2 K4)
n(l - )..)(1 - 6)..(1 - )"»K4 - 6 N+I K2 - Ii '
EDGEWORTH EXPANSIONS FOR THE SAMPLE SUM IN SRS 69
where).. = n/ N, and (if Yj = Yj -/l-y)

N
K2= Ly;/(N - I),
j=1
N
K3=NLy;/(N -I)(N -2)
j=1
K.=N {(N+l) t,yj-3(N-l)3KliN) j(N-l)(N-2)(N-3).

It seems most natural then to derive an expansion for the distribution of
the standardized variate W = (v'n(1 - )..))-1 Q::::jES Yj - n/l-y), which
has mean zero and variance K2. Note that if the YI, ... , YN came from
a realization of an i.i.d. sequence with finite variance, the variance of
W would be bounded in probability as the population size N increased
indefinitely. The third and fourth cumulants of W are given by
N-1/2Y3K3 and N-1Y4K4 - 6(Ki - K 4/N)/(N + I),
where Y3 = (1-2)..)/,J)..(1 -)..) and Y4 = (1-6)..(1-)..»/)..(1-)..) are
the third and fourth cumulants of a standardized Bernoulli()..) random
variate.
Provided I - ).. remains bounded above zero as n and N tend to
00, and provided K 2 , K3 and K4 remain bounded, the third and fourth
cumulants of Ware of order n- I / 2 and n- I , respectively. Appropriate
further conditions on the higher-order cumulants would yield the one-
term expansion
P(W ~ w) = 'I1(w) -1f;(w)h 2 (w; 0, K2)N- 1/2 ~~ K3 + o(n- I / 2),

(3.45)
where II1(w) and 1f;(w) are the c.d.f. and density for N(O, K 2 ), and the
twa.term expansion
P(W ~ w) =
(3.46)
An approach to establishing these expansions rigorously begins with

the representation of (3.21) of the characteristic function ofthe sample
sum. A series approximation is derived for the characteristic function
= D[(1 - )..)eiA(u+lYi) + )..ei(I-A)(U+lYi)]

N
¢(U, t)
j=1
of the integrand, and this expansion is then integrated together with
bounds on its error. Robinson (1978) has justified the expansion (3.46)
under a technical condition which ensures that the values of Y 'do not
cluster around too few values'. The condition is satisfied in probability
when YI, ... ,YN is a realization of an i.i.d. sequence having finite fifth
moment and a density; it is not satisfied in the lattice case. The bound
given by Robinson on the error of the ex~ansion is of the form BoAs
where Bo is a function of).. and As = Lj=1 IYjlS /(Lf=1 YJ)S/2.
Noting that Robinson's condition is difficult to interpret and to verify,
Babu and Singh (1985) have proved the validity of the single-term
Edgeworth expansion (3.45) under the conditions that
I
FN(y) = N x (number of Yj :s y)
i:
converges weakly to a strongly non-lattice distribution, the moment
lyl3+ 8dFN(Y)
is bounded, and).. = n/N :s I - ;) as N -+ 00 for some;) > O.

They have also provided a modified one-term expansion for the case
when the population Y values lie on a lattice, and a multivariate one-
term expansion for the non-lattice case when Y is p-dimensional: the
expansion of the density for the p-dimensional variate
w= (In(1 - )..»-1 (LYj - nILy)

jes
is
1/I'EN(W) - N- 1/2 [L
IfJl=3
fJ\ (~ tyj) DfJ1/I'EN(W)] +o(n- I/2),
J=I
(3.47)
where 1/I'EN is the joint density for MV N(O, I: N ) and I:N is the co-
variance matrix of W. Notationally, if fJ = (fJt, ... , fJp) is a vector of
non-negative integers, in the above expression IfJl = fJt + ... + fJp,
{J! = fJl! ... fJp!, DfJ = Dfl ... D~P, xfJ = Xfl ... x~p.
EDGEWORTH EXPANSIONS FOR THE STUDENTIZED MEAN IN SRS 71
3.8 Edgeworth expansions for the distribution of the studentized

mean in SRS
Since confidence intervals for the population mean are usually based on
the studentized sample sum (or sample mean) rather than on the stand-
ardized sample sum, it is also of interest to be able to approximate the
distribution of that quantity in SRS.
We may begin by finding the cumulants of the studentized sample
mean. Denoting the sample mean as before by Ys> let
t = r===vIn==n(ji~-=s=-=/L=y)=== (3.48)
1", -2
(l - J..) n _ 1 4-)Yj - Ys)
]ES
where J.. = n/ N. Then
fiI t ZI (3.49)
y;;-:::t = [ 1 + .JT=I Z2 (1 - J..)ZI] 1/2
(T
vIn - - ---=---=-
(T2 n(T2
h
were (T
2 - .1
-
"N -2 Z _ .;n<Y,-J.Ly) Z _ .;nz,
N ~j=IYj' I -
. _ -2 _ 2
JI-I ' 2 - JI-I' Z] - Yj (T
and Yj = Yj - /Ly. Taking J.. as fixed and expanding (3.48) in powers
of n- I / 2 gives
filt =
y;;-:::t
(3.50)
Taking moments of the right-hand side of (3.50), then converting to

moments of t, gives after much computation
K2 = Var(t)
(3.51)
K4 = ~(1 + A) + K4 [_1_ - 3(1 - A)]

n a 4n 1 - A
K2
+--+[12 - 61..] + R 4 ,
an
where the terms of RI, ... , R4 are of order O(n- 3/ 2 ) and higher. From
these expressions and (3.44), the Edgeworth expansion of the distribu-
tion of t can be found to be
1 1
pet ::::: x) = (x) + ~ql(x)</J(x) + -q2(X)</J(X)
'\In n
+higher-order terms, (3.52)
where
3!
1
.Jf-=-5:
1- Aa
K3
3 «2 - A)x 2 + 1 - 21..)
q2(X) =
K4
x ( A2(x) a 4
Ki + C
+ B2(X)~ )
2(x)
= (1 3 + A) (1 - A 1) 2
A2(x) 8(1- A) - -8- + -8- - 24(1 _ A) x
B 2 (x) ( 27
72 +
15)" 15)
72 - 72(1 - A)
1 14 10) 2
+ ( -4 + 721.. + 72(1 _ A) x
--1 ( 3-1..+--
1 ) x4
72 1- A
C2(X) = 51..;3 -C:A)x2.

This approximation can be inverted to produce an approximation for
(say) an upper 100(1 - a)% confidence limit for /l-y (see Hall, 1988):
Ys + n-I/2~ S{ZI_a + n- I /2ql (ZI-a) + n- 1q21 (ZI-a)}, (3.53)
where
/ 1 2
q21 (x) = ql (x)ql (x) - 2"x(ql (x» - q2(X)
and ZI-a is the 1 - a quantile of the N(O, 1) distribution. Experience

has shown that limits obtained from (3.53) do not have particularly
good coverage accuracy in small samples, even in the case A = 0,
K3 = K4 = O. However, the cumulant expressions do give an indication
of the aspects of the distribution most likely to affect the adequacy of
SADDLEPOINT APPROXIMATIONS 73
a normal or t approximation to the distribution of t. Clearly K3 has

the most influence since it determines the leading term of the bias KI
and the skewness K3 of t, and occurs in the expressions for K2 and K4.
A non-zero K3 has the effect of increasing the variance and kurtosis
of t. When )" is small a non-zero K4 affects mainly the kurtosis of t,
making it smaller if K4 is greater than zero. When )" is a moderately
large proportion, the bias is diminished because of the factor (1 _),,) 1/2,
and the variance is also less than it would be in the case N = 00.
The Edgeworth expansion in (3.52) is a formal one; that is, no bound
has been placed on the 'higher-order terms'. Babu and Singh (1985)
have shown the validity of the corresponding one-term expansion
pet ~ x)=¢(x)+ ~ K3 «2-),,)x 2+(1-2A))¢(x)+o(n- 1/ 2 )

3!Jn I-A a3
(3.54)
under the conditions that
is bounded and
FN(y)
I
= N(no. f -
0 Yj ~ Y)
converges weakly to a continuous distribution.
3.9 Saddlepoint approximations

The usual saddlepoint approximation to the density of a real variate X
comes from the inversion formula
I(x) = - 1 foo e-ixt¢(t)dt, (3.55)

2rr
i:
-00
where ¢(t) is the characteristic function
¢(t) = eixt I(x)dx = exp{Kx(it)} (3.56)
and KxO is the cumulant generating function of X. Changing the

integration variable to T = it gives I(x) as an integral along the
imaginary axis of the complex plane:
I(x) = -I. fioo

e-X; eKx("r)dT.
2rrl -ioo
If the integrand is analytic and has no singularities in the strip 0 ~
Re T ~ a, (or a < Re T < 0 if a is negative) and if it is negligibly
small within the strip but far from the real axis, the Cauchy-Goursat
theorem implies that also for the real constant a,
(3.57)
so that
f(x) = _1_ tx)
e-x(a+ib)eKx(a+ib)db, (3.58)
2JT Loo
where here both a and b are real. Expanding Kx(a + ib) about a gives
Kx(a + ib) = Kx(a) + ibK'x(a) - (b 2 /2)K';(a) + ...
and
eKx(a+ib) = eKx(a)+ibKxCa)-p{K;Ca)(I + R(a, b» (3.59)
with R(a, b) being expandable as a series in higher powers of b. In
applications X is typically a member of a sequence of variates with
density depending on a parameter n, and for which R(a, b) tends to 0
as n ~ 00.
If a is now selected to satisfy
K'x(a) = x, (3.60)
the integrand of (3.57) has a saddlepoint at a, and the approximation
following from (3.59) in (3.58) is
f(x) ~ _1_ roo
e-xaeKxCa)-(b2/2)K;Ca)db (3.61)
2JT 1-00
or
1
f(x) ~ f,'L[K';(a)r l/2 exp{Kx(a) - xa}. (3.62)
...,2JT
Note that the value of a selected depends on x, the argument of the
density to be approximated. Note also that the approximation need not
integrate to 1; in fact, it is usually improved by normalization.
The density approximation (3.62) works well when Kx and its
derivatives are of the order of n, for then the main contribution to the
integral (3.61) comes from b in an interval of form (-C/v'n, C/v'n);
over such an interval the terms of R(a, b) are 0(1). For example, if X
is the sum of a sample of n Li.d. observations with cumulant generating
function Kt. then (3.62) gives
1
f(x) ~ f,'L[nK~(a)rl/2 exp{nKI (a) - xa}
...,2JT
where a satisfies
SADDLEPOINT APPROXIMATIONS FOR SRS 75
From this is derived the well-known approximation to the density of

the sample mean:
gn(x) ~ [2rr:;'(a)f /2 exp{n(K\(a) -xa)},

where K;(a) = x. Daniels (1954) showed that the ratio of the true
density gn (x) to its approximation is typically I + 0 (~ ).
An alternative view of the approximation in general comes from the
fact that since
Q(a + ib) = E[exp(a + ib)X] = exp[Kx(a + ib)]
then Kx(a + ib) - Kx(a) as a function of b is the logarithm of the
characteristic function of the variate with 'exponentially tilted' density
v(w) = e aw f(w)/Q(a).
If a is chosen to satisfy the saddlepoint equation (3.60), the approx-
imation (3.61) amounts to taking the tilted density to have mean x,
and to be approximately normal with mean x and variance K~(a). The
normal approximation to v(w) at w = x is then transformed to pro-
duce an approximation for f at x. More refined approximations can
be obtained by similarly transforming higher-order Edgeworth approx-
imations to the density v. Since the choice of a is specific to x, the
performance of these approximations, unlike that of ordinary Edge-
worth expansions, does not depend on how far x is situated from the
centre of the distribution of X.
As in the case of the Edgeworth expansion, here also if X is discrete
the right-hand side of (3.62) can be regarded as an approximation to
P(x ::::: X < x + 0)/0 for suitable o.
3.10 Saddlepoint approximations for SRS

3.10.1 Approximations to the distribution of the sample sum
A modification of the saddlepoint approximation of Section 3.9 for the
distribution of the sample sum under SRS was proposed by Robinson
(1982). It involves using the same (complex Laplace) approximation
technique twice, once to evaluate the integral which defines the charac-
teristic function (see (3.21», and once to invert the result to obtain an
approximate density. A heuristic derivation along the lines of Robin-
son's follows.
Again in this section we assume a real variate y, and let h = Yj - /-Ly.
To begin with it follows from (3.21) that if a and b are real, and a is
any real number, we have
Q(a + ib) = E(e(a+1'b)"L.je.Yj)

- = q, (a + ib)
-i-
= _1_1" fI
2rr B _" j=1
{(l - A)e-J.[{j+i~jl + Ae(l-J.)[{j+i~jl}du,
(3.63)
where A = n/N and B = (~)An(1 - A)N-n as before, and the real and
imaginary parts of the argument of K are
Sj = ah + c, ~j = bYj + u.
This expression can be written as
Q(a + ib) = 2rr1B 1" J]

_" N exp[K(Sj + i~j)]du, (3.64)
where
K(z) = 10g[(1 - A)e-J.z + Aeo-J.)Z]. (3.65)
Then the integrand exp(E K(sj + i5j)} is approximated by
(In this section E will denote summation over j from 1 to N.) We can
now choose c = c(a) to make the integrator u vanish from the second
term in the exponent. That is, choose c(a) to satisfy
L K' (aYj + c(a» = O.
From this point on in the derivation, Sj = aYj + c(a).
Letting u = 1/f/.;n, we can rewrite the logarithm of the integrand of
(3.64) as approximately
"'" K(T.) _ ~ {.Jr + .;nb Eh K "(Sj)}2 E K"(sj) + ibm _ ~b2a2

L.J '>J 2 'I' EK"(sj) n 2'
where
m = LhK'(sj),
a 2 = [Ly;K"(Sj) - (Lh K"(sj»2/(LK"(Sj))]. (3.66)
Integrating the approximate integrand with respect to Jnd1/f, and taking
1{r to range from -00 to 00, so that we are in effect integrating over a
normal density, we obtain
Q(a+ib) ~ ~B (LK"(~j»-1/2exp {LK(~j)+ibm-~b2(12}.

(3.67)
Having this expression gives us a way of approximating the charac-
teristic function of a 'tilted' version of the distribution of the centred
sample sum. That is, if
F(w) = P(Ljij ::: w),
jES
let
(3.68)
noting thatQ(a) = eawdF(w). Then the characteristic function for the
tilted distribution V is
Qv(b) = f eibYdV(y) = Q(a + ib)/Q(a).

From (3.67) it is seen that this is approximately the characteristic func-
tion of an N(m, (12) distribution.
Now to find an approximation to the 'density' f of the centred
sample sum, we note that for a given value of w, we have for general
a
aw -(w-m)2/2u 2
f( w) '"'-
_ Q(a)e-~ e .
...,2rr(1
The parameters m and (12 depend on a through ~j.
If we choose a so that m(a) = w we now have the saddlepoint
density approximation for the centred sample sum:
1
f(w) ~ 2rr B(1 (L K"(~j»-1/2 exp{L K(~j) - awl (3.69)
where K is given by (3.65),

(12 = [LjiJK"(~j) - (LjijK"(~j»2/LK"(~j)]
~j ah +c
and a and c satisfy the sadd/epoint equations
L K'(ajij + c) = 0
(3.70)
The corresponding approximate density for the sample mean Ys

= (LjES Yj)/n is
g(jis):::: 21l'~(1 (LK"(~j))-1/2exp{LK(~j)-na(jis-/Ly)} (3.71)
with the same (12, ~j, c, a; the second saddlepoint equation is written
LyjK'(aYj + c) = n(jis -/Ly).
It is interesting to note that the integral in (3.64) is actually the joint
'complex moment generating function'
E [e(a+ib)(Lje, YJ )+(c+iu)(n(s)-n)]
(cf. (3.21» under Bernoulli sampling, integrated over u from -1l' to 1l'
so as to make it conditional on n(s) = n. It is not difficult to show that
the density approximation (3.69) can be obtained as follows. Find a
bivariate normal approximation to the appropriately tilted joint density
under Bernoulli sampling of LjES Yj and n(s)-n; the tilting parameters
are a and c. Then use the conditional (normal) distribution of the first
component, given that the second one is zero, as the approximation to
the V distribution of (3.68).
Wang (1993) has developed a direct saddlepoint approximation which
is based on a method for computing the cumulant generating function
of LjES Yj exactly for small n. It provides an alternative to Robinson's
method in the approximations to tail probabilities which follow.
3.10.2 Approximations to tail probabilities

There are three possible approaches to obtaining tail probabilities of
the form
or
jES jES
from the density approximations (3.69) and (3.71); see Daniels (1987).
The first, which we shall not describe in further detail, is simply to
integrate the normalized density approximation numerically.
The second approach, used by Robinson (1982), can be motivated
by going back to the step preceding (3.68), and approximating the
tilted distribution V described by (3.68). We consider here the simple
N(m, (12) approximation with m, (12 given by (3.66). Since
dF(w) = Q(a)e-aWdV(w),
it follows that
F(w) c:::: Q(a)e-ma+a2u2/2<1> (w - ma+ aa 2 )
and
1- F(w) c:::: Q(a)e-ma+a2u2/2 (1 _ 2

 (w - ma+ aa ) ) ,
where is the standard normal c.d.f. Choosing a according to
LjijK(~j) = w,
so that m = w, gives
_1_(" K"(~.»-1/2 exp{" K(~.)

~B ~ J ~ J
-aw + a2T
0'2 }
(1 - (aa)), w > O.
(3.72)
Robinson (1982) has shown that this approximation should typically
have an error of order O(n- 1/ 2 ). Taking a one-term Edgeworth expan-
sion for the approximate V distribution improves the order of approx-
imation.
The third approach is based on the fact that the saddlepoint density
approximation for the centred mean is of the form
cb(rJ} exp{l (I])}
where
/(1]) = LK(~j) -nal],
cb(l]) = 27r~a (LK"(~j))-1/2.

Approximating the distribution function uses an approximation of
Temme type, closely related to the Lugananni and Rice (1980) tail prob-
ability approximation; see, for example, DiCiccio and Martin (1991)
and Wang (1993).
The resulting lower tail probability approximation for the sample
mean is
P(js - JLy :5 '1) ~ <l>(r) + <p(r) [~ ~

- IN)..(l - )..)1/2]' (3.73)
<1(LK"({j»)
where are the standard normal p.d.f. and c.d.f., K is given by
(3.65),
r = r('1) = sgn('1}(2[na'1- LK({j)])I/2,
<12 = Ly;K"({j) - (Lyj K"({j»2/(L K"({j»,
and {j = aYj + C and a and c are calculated from
LK'({j) = 0
LyjK'({j) = n'1.
Extensions of the approximations of this section to the case of vector-
valued y are straightforward.
3.11 Use of saddlepoint in constructing confidence intervals in

SRS
Suppose that X is a real variate whose distribution function G depends
on a real parameter (), in such a way that for fixed x, the value of
G(y; () = P(X :5 x; () decreases as () increases. Then for an observed
value xO we can define a 100(1 - a)% level upper confidence limit u e
as the largest value of () such that
G(xo; () = P(X :5 xO; () ~ a
if this exists; cf. the definition in Section 3.2 on proportion estimation.
e
The limit u , which is a function of the data, is exact in the sense that
P(eu ~ (); () ~ I - a,
and eu is not uniformly greater than any other monotone function of
xO having this property.
If we try to apply this idea to the construction of a confidence in-
terval for the finite population mean JLy, it is natural to try to use the
sample mean Ys as a basis. Unfortunately, in general the distribution
of Ys depends not only on JLy but also otherwise on the unsampled
Yj values. The upper limit for JLy corresponding to u above woulde
USE OF SADDLEPOINT IN CONFIDENCE INTERVALS IN SRS 81
depend not only on the observed Ys but also on the unobserved part of
the population array.
One way of adapting the method to the finite population problem,
which is essentially the estimation of a nonparametric mean, is to ima-
gine for each possible value of 0 = i-Ly a suitable population array y(O)
which is consistent with the observed sample and which has a mean of
O. Given an observed sample mean y~, we would then define eu as the
greatest value of 0 for which
P(jis ~ y~;y(O» ::: a. (3.74)
e
That is, u would be the greatest value of 0 such that the test array y(O)
is not rejected by a one-sided a-level significance test. With appropriate
choice of the conceptual array y(O), we would hope to achieve close to
the nominal coverage probability 1 - a by this method. A saddlepoint
tail probability approximation could be used to provide an estimate of
the left-hand side of (3.74) for each O. Because of its interpretation
e
in terms of significance testing, u would be called an inverse testing
confidence limit.
3.11.1 Construction a/the test array

One way of constructing the population array y(O) would be as follows.
Let yl, ... ,yr be the distinct sampled values of y, occurring respec-
tively ai, ... , ar times. Define population weights WI, ..• , Wr with the
intention that in the array y(O) there would be Wi occurrences of yi for
1 ~ i ~ r. Thus we would aim to have the weights Wi = Wi(O) satisfy
r
(i) LWi = N
i=1
r
(ii) LWiyi = NO.

i=1
We would then require that the population distribution of y be close to
the sample distribution in the sense that
L
r
(iii) the Kullback-Leibler distance Wi log( Wi I ai) is minimized.
i=1
Conditions (i), (ii) and (iii) are satisfied when
r
Wi = NpilLPi, (3.75)
i=1
where
Pi = ai exp{t(9)/} (3.76)
and t = t(9) is the unique solution of
r r r
9 = Lai/elyi/Laie'Y' = LWi//N. (3.77)
i=1 i=1 i=1
Of course the weights Wi satisfying (3.75)-(3.77) exactly will not in
general correspond to an array y(O); the weights actually used would be
approximations of these which fulfilled the further condition of being
positive integers greater than the corresponding sample frequencies.
Related ideas are seen in the 'scale load' approach of Hartley and
Rao (1968; 1969), the 'empirical likelihood' approach of Chen and Qin
(1993) and the 'conditional maximum likelihood' approach of Bickel,
Nair and Wang (1992).
3.11.2 Approximation of the tail probabilities

The saddlepoint equations for the approximation of the left-hand side
of (3.74) can now be specified. For given 9, the tilting parameters c
and a will satisfy
r
L wiK'(a(/ - 0) + c) = 0,
;=1 (3.78)
r
L Wi(/ - 9)K'(a(/ - 9) + c) = n(Y~ - 0),
i=1
where K(z) = 10g[(1 - ),.)e-lZ + ),.e(l-l)z], and for these values of c
and a the approximation
_ -0 [I I
IN),.{1 - ),.)
P(Ys :::: Ys ;y(O» ~ (r) + tjJ(r) ;: - -;; (7(L~=1 wiK"(~;»1/2
]
r~ sgn(Y~ -8) (2 [na(Y~ -8) - t WiK(~i)]r2,

= L Wi (/ -
r
(72 9)2 K" (~i) (3.79)
i=1
- ( t Wi(yi - 0)K"(~i»)2/(t WiK"(~i»),

1=1 1=1
~i = a(yi - 0) +c
MONETARY UNIT SAMPLING 83
(based on (3.73)) might be used in (3.74). Solving (3.74) for e would

then complete the process of calculating 8u . In practice we would treat
(3.78) and (3.79) as a system of equations for c, a and e.
Alternatively, the left-hand side of (3.74) could in principle be cal-
culated exactly for appropriately chosen 8 by repeated simulation of
SRS from the population with array y(e).
The Edgeworth expansions of Sections 3.6 and 3.7 can be used to
show the approximate validity of the inverse testing method for SRS.
In fact, it can be shown, in arguments like those of DiCiccio and Ro-
mano (1990), that 8u approximates (to within a term op(n-I) under
appropriate conditions) the exact upper limit
(3.80)
based on the c.d.f. of the studentized mean. Here J- I ( ; y) is the inverse

function of the c.d.f.
5,y) = P {
J(t:. In Ys -~ Jly -<
'1'"" 1:.
5,y } (3.81 )
VI - A an
and
~2 n- 1 N 2
an = -n-N _lsy.
Note that since the exact computation of 8EX(1 - a) requires that we

know the unsampled as well as the sampled values of y, it is not strictly
speaking a confidence bound, but an ideal to be approximated.
The possibility of extension of the inverse testing method to designs
other than SRS poses some interesting problems.
3.12 Monetary unit sampling
In monetary unit sampling, the target population consists of a set of

accounts, each of which represents a cluster of dollars or other currency
units. The number of dollars x j nominally represented by account j
can be called the book value of account j. A sample of the accounts is
selected by taking a random or systematic sample of individual dollars
from the union of the account clusters, and then looking at the accounts
in which those dollars appear. For example, we might visualize N = 8
accounts with a total book value of $12 000 as being laid out like this:
Account no. 2 3 4 5 6 7 8
Book value
in dollars 1000 1000 2000 2000 2000 1000 1000 2000
Dollar 0001 1001 2001 4001 6001 8001 9001 10001
labels -1000 -2000 -4000 --6000 -8000 -9000 -10 000 -12000
For a sample of n = 3 accounts we might take a systematic sample
of three dollars from {l, ... , 12 OOO}. If for example the dollar sam-
ple were {2271, 6271, 10271}, the account sample would consist of
accounts {3, 5, 8}.
If we begin with a random ordering of the accounts, this method
with a systematic dollar sample is essentially the Hart1ey-Rao random
systematic procedure of selecting accounts with inclusion probabilities
proportional to size (Section 2.8).
Now suppose that the nominal Jth account value differs from the
actual value, so that the account has an error (book value - actual
value) of Yj, and we need to estimate the total error
The error values are determined for the sampled accounts. The point
estimator frequently used for Ty in the literature is the 'tainted dollar'
estimate
1 " Yj (3.82)
- ~Tx-,
n jES Xj
where x j is the book value of account J. This reduces in our terminol-
ogy to the HT estimator
1'y = LYj/llj, (3.83)
jES
11:jbeing nx j / Tx. The problem of finding appropriate confidence bounds

for Ty or for ILy = Ty/ N is of long standing, because the distribution of
iy is not likely to be close to normal for small samples. Ideally, at least,
many of the Yj should be 0, with only a few of the accounts having
appreciable errors. For a history of the use of statistics in auditing, see
Tamura (1989). Cox and Snell (1979) have discussed the problem of
estimation of rare errors in this and other areas of application.
Bickel (1992) has examined a number of methods for finding upper
confidence limits, including an inverse testing method which is close in
spirit to those considered in Section 3.11. One way offormulating such
an idea is for each test value ILo of ILy to construct a new population
BOOTSTRAP RESAMPLING METHODS 85
of Tx dollars, and to associate with each new dollar one of the sampled
'taint' values rj = YjJ-Lx/Xj. If Wj denotes the number of dollars now
to be associated with taint value rj, we require that
(i) LWj = Tx ,
jES
(3.84)
(ii) LWjrj/Tx = J-Lo,
jES
and (iii) Wj is close to XjTx/ ~jES Xj = Aj in the sense of minimizing
L Wj log(wj/ A j ).
jES
This gives
Wj = TxAjetTj / L AjetTj , (3.85)
jES
where t is chosen so that (3.84) is satisfied. We then resample repeatedly
at random from the new population of dollars and taint values, suitably
arranged, and each time we compute a new estimate 7;,*. The original
t can then be compared with the empirical distribution of {T/} to see
whether J-Lo should be below or above the confidence bound.
3.13 Bootstrap resampling methods for confidence interval

construction
In the final section of this chapter we discuss the application of boot-

strap resampling as another way of obtaining refined confidence inter-
vals or sets. Since we will return to it in Chapter 4 as a method of con-
structing intervals for complex population functions, we will introduce
the idea in terms of a general population function here. An overview
of the topic as a whole was provided by DiCiccio and Romano (1988).
We will suppose, to start, that the sample data are
as is certainly appropriate for fixed size sampling designs like SRS

without replacement. However, for with-replacement situations analo-
gous arguments can be made in terms of
x: = (U[, Yh)'···' Un, Yjn))'

with Ui' Yj) representing the result of the ith draw.
3.13.1 The idea of the bootstrap

Let e(y) be a real-valued population function. The bootstrap provides
confidence intervals for e(y) through a sample-based estimate of the
distribution of a root function
R(Xs, e(y)). (3.86)
For example, if e(y) = f.iy, in a simple random sampling context we
might consider R to be the scaled mean difference In<Ys - f.iy), or to
be the studentized mean difference (Ys - f.iy)IJv<ys), where v<Ys) is
an unbiased estimator for Var<ys). Let H denote the c.d.f. of R, so that
for eachy
H(~;y) = P(R(Xs. e(y)) ~~) (3.87)
(cf. J of (3.81». Now let
H-I(a;y) = inf{~ : H(~;y) ~ a}
denote an ath quantile of the distribution of R. Then

lEx = {e: R(Xs, e) ~ H-I(a;y)} (3.88)
can be thought of as an exact (1 - a)-level confidence set for e(y) (cf.
(3.80».
Being a function of y as well as Xs, lEx is not actually a confidence
set in the usual sense. (If it happens that e actually indexes the possible
y arrays, as in the case of estimating a population proportion, then
1* = {e : e E lEx for the y satisfYing e(y) = e} (3.89)
would be an exact (1 - a)-level confidence set which is a function of
XS only.) However, because lEx has the correct coverage probabilities,
it is natural to try to construct an interval estimate for e by estimating
lEx from the sample. This means replacing H-I(a;y) in (3.88) by
an appropriate estimate. If iI(~) denotes a sample-based estimate of
H(~;y) and iI-I (a) is its ath quantile, then the approximate (1- a)-
level confidence set will be
A A_I
1= {e : R(Xs. e) ~ H (a)}. (3.90)
The set j is termed a bootstrap confidence set or interval.
REMARK 3.1: It is clear that a bootstrap confidence set is likely to
work best when the distribution of R(Xs, e(y)) depends very little on
y. Thus when e(y) is the mean f.iy of a real variate y, the use of
the studentized mean difference for R(Xs. e(y)) will usually be more
appropriate than taking R to be the scaled mean difference In <Ys - f.iy).
(The 'inverse testing' confidence intervals in Section 3.ll were based

on the mean difference, but represented an attempt to approximate hx
by the interval 1* of (3.89) for a 'least favourable one-dimensional
subfamily' of the possible arrays y.)
The most common way of implementing bootstrapping in the non-
parametric i.i.d. case - which corresponds to simple random sampling
with replacement - is to regard H(;; y) as a function H(;; F) of the
c.d.f. F (or FN in the finite popUlation context) of the y variate on a
single draw. If Fn is the sample c.dJ., we take the estimate of H to be
(3.91)
Confidence sets obtained using f:I are considered to be 'valid' if the
estimated distribution f:I(~) of R and the true distribution H(~; F) of
R both approach the same non-degenerate distribution function in the
limiting case as n ~ 00. This is true in the simple random sampling
with replacement case, under weak conditions, for the scaled mean
difference and the studentized mean difference.
Let us consider the studentized mean for i.i.d. sampling from a distri-
bution F more closely. That is, let Xs' = (XI, ... , xn) be the vector of
n observations, so that for with-replacement sampling Xi would be Yjp
the ith Y value drawn, and F would be the population y distribution.
Let
T(x s" fJ,F) = .jii(i - fJ,F)/Sx,
where s; is the sample variance L:7=, (Xi - i)2/(n - 1) and fJ,F is the
mean of the F distribution. For the distribution function H(~; F) of T
we have the one-term Edgeworth expansion
H(~; F) = <l>(~) + ~_1_[2~2 _ I]EF(x - fJ,F)3 ¢(~) + o(n- I / 2 )

3! In a 3 (F)
(3.92)
where a 2 (F) = VarF(x). Now let Fn be the sample c.d.f., and suppose
x;. = (xi, ... ,X;) is a new i.i.d. sample from Fn. Conditional on
the original sample, the one-term Edgeworth expansion for f:I(~), the
distribution of T(x;.; fJ,ft) = T(x;.; i), is
All E· (x - i)3
H(~) = <l>(~) + -31 r.;[2e - 1] Fn A ¢(;) + o(n- I / 2 ). (3.93)
. 'In a 3 (Fn)
Both H(~; F) and f:I(;) approach the N(O, 1) distribution as n ~ 00.
It is not difficult to see that the difference between H(~; F) and f:I(~) is
typically of order 0 p (n -I) unconditionally. (The difference between the
corresponding distributions of the scaled mean difference is Op(n- I / 2 ).)
=
Thus in regular i.i.d. cases, using fI(~) H(~; Fn) to provide quant-
iles in (3.90) will give valid confidence intervals for the population
mean, a smaller order of error being obtainable with the studentized
mean difference as root. The distributions fI(~) can be obtained for
example by Monte Carlo (re)sampling, that is repeatedly drawing i.i.d.
samples xi, ... ,x;from Fn.
Coming to the case of SRS without replacement, it is natural again
to try to estimate the distribution of the root from the sample. Let
Ys = {Yj : j E s} be the Y values in the sample, and let jis denote
the sample mean. Here we could write the distribution H(~;y) of the
studentized mean
as H(~; FN). One option for estimating this distribution is simply to

replace FN by the sample c.d.f. Fn. Ifn divides N, this estimated distri-
bution can in principle be found by constructing a 'pseudopopulation'
in which k = N In units have Y value Yj for each j E s, and then
repeatedly taking SRS without-replacement samples (n draws) from
the pseudopopulation. This method is referred to in the literature as
the BWO method, or the without-replacement bootstrap. Booth, But-
ler and Hall (1994) have studied second-order properties of confidence
intervals constructed by extensions of the method.
To check the 'validity' of the without-replacement bootstrap means
essentially to check the closeness of the distributions of the root T
under FN and Fn in a limiting context. We have seen in Section 3.7 that
closeness of these distributions is tied to the closeness of the first few
cumulants of the root, which are functions of the first few cumulants of
the scaled mean difference. Thus it is useful to consider the cumulants
of D = .jn(jis - /Ly), and the cumulants conditional on the original
sample of
D* = .;n(ji* - jis),
where ji* is the mean of Y in an SRS of n draws from the pseudopopu-
lation. Simple arguments show that moments of D* conditional on the
original sample are
ED* = 0
Var(D)* ( 1 - n)
- -N- -n-12
-s
= N N-l n y (3.95)
E(D*)3 ( _ ~) ( _ 2n) N 2 (n -l)(n - 2) k
= 1 N 1 N (N _ l)(N _ 2)n 2 3.
where k3 is defined in (2.63). Since unconditionally
ED 0
Var(D) = (l - ; ) S; (3.96)
ED3 = (1-~)(I-~)K3'
the second and third cumulants of D* from (3.95) are biased as estima-
tors of the corresponding cumulants of D. However, the biases in these
Op(1) quantities, and in the fourth cumulant of D* not shown here,
are of order Open-I) for n large. These results suggest that the boot-
strap confidence intervals are indeed valid asymptotically for this case.
A theorem along these lines has been given by Bickel and Freedman
(1984).
The bootstrap technique can be extended in an obvious way to strat-
ified random sampling. However, if the sample sizes nh within strata
are small, as is often the case in practice, the bias in the variance of
as an estimator of the variance of
D = JC;,CYst - iLy),
Fn being a scaling factor analogous to In in the SRS case, can be

substantial. From (3.95), the hth term in the variance of D*, condi-
tional on the sample, is (nh -l)/nh times the hth term in the unbiased
estimator of Var(D). Thus if the nh remain bounded while the number
of strata tends to infinity, the bias will persist even though the total·
sample size approaches infinity.
Several ways of correcting for this bias in stratified random sam-
pling have been proposed. Rao and Wu (1988) have proposed a rescal-
ing technique which will be described in Section 4.2.6. Sitter (1992a;
1992b) has considered adjusting the sizes of the pseudo strata and the
numbers of draws in res amp ling. He has also proposed the 'mirror
match' method, which works in the simplest one-stratum case like this.
Suppose that m = An and A-I = N / n = k are integers. Suppose we
generate a new set of observations y; = (Yf, ... ,y:) of size n from
the original sample by taking k independent without-replacement sam-
ples of size m. Then conditional on the original sample, the first few
cumulants of ji* are

E(ji*) = ys
11 1
Var(y*) = --(1 - A)S
km
2
y
= -(1 - A)S
n Y
2
(3.97)
1 1
- - ( 1 - A)(1 - 2A)k3
k2 m 2
with
k3 = n L(Yi - ys)3 I(n - l)(n - 2).
iES
These conditional moments are unbiased estimators of the (uncondi-

tional) cumulants of Ys' Similarly, if there are two variates x and y, the
conditional covariance of i* and y* is
1 1
Cov(i*, ji*) --(1 - A)S
km xy
1
-(1 - A)Sxy, (3.98)
n
with
1
Sxy = n _ 1 L(Yi - Ys)(xi - is),
JES
which is an unbiased estimator of the covariance of is and Ys. The

method can be adapted to cases where 1I A is not an integer dividing
n; see Sitter (1992b).
Encouraged by the identities (3.97) and (3.98), and focusing on the
studentized mean as root, we might hope to approximate H (I; ; y) by
the empirically determined distribution of T (y;; Ys) of (3.94) under the
mirror-match sampling scheme. Chen and Sitter (1993) have studied
the relevant Edgeworth expansions for the distribution of the mean (in
stratified random sampling as well as SRS). Looking at the case of the
studentized mean heuristically, we note that, conditional on the original
data Xs. the one-term Edgeworth expansion of the distribution if (I;) of
T(y;; Ys) is
if(l;) = <1>(1;) + _1_,Jf=-Ik3 [31;2 - 1 - 2A (1;2 - l)J ¢(I;);

3!J1l s; 1- A
(3.99)
this indeed approximates the distribution of T(ys; /Ly), which by the

argument of Babu and Singh (1985) is
H(~;y) = q,(~)
+~ Jf="I K3 [3~2 _1 - 2\~2 _I)J cf>(~) + o(n- I / 2).

3! In S] I-A
(3.100)
The mirror-match method is clearly adaptable to stratified random
sampling, where the resampling is carried out independently in each
stratum. Modifications of the BWO, rescaling and mirror-match meth-
ods for some multi-stage designs are also available (Rao and Wu, 1988;
Sitter, 1992a; 1992b).
3.13.2 Bootstrap t confidence intervals

Thus for many commonly used probability sampling designs, we arrive
at a prescription for generating refined confidence intervals for /Ly. It
is given here for the SRS case, with a resampling bootstrap. Using
something like a BWO or a mirror match technique, we resamp1e from
the original sample a large number B of times, obtaining samples Yb =
(Ybl' ... , Ybn)' b = 1, ... , B. For each b = 1, ... , B, from the bth
sample compute Tb = T(Yb; Ys) = In<Yb - Ys)/J1 - n/Nsy;" and
order the B values from smallest to largest. Then compute fI-1 (1 - a)
as the (1 - a)B-th value of n, an empirically determined quantile.
Compute fI-l(a) as the aB-th value of n. A 100(1-a)% level lower
confidence bound is now given by
~I ~
Ys - H- (1 - a)V;;(1 - A) Sy. (3.101)
Similarly, a 100(1 - a)% level upper confidence bound for /Ly would
be given by
~I ~
Ys - H- (a)V;;(1 - A) Sy. (3.102)
EXAMPLE 3.4: Suppose a without-replacement SRS of size n = 12 is
drawn from a population of size N = 48. Suppose the sample values
are
50, 57, 50, 52, 49, 39, 36, 50, 58, 47, 58, 49.
The sample mean is Ys = 49.583, while the sample variance is = s;
46.083. A two-sided 95% confidence interval for /Ly assuming an ap-
proximate N(O, 1) distribution for T(ys; i-Ly) is

[46.257, 52.910]. (3.103)
If we assume an approximate t(ll) distribution for T(ys; i-Ly) as sug-
gested in some textbooks, we obtain
[45.848, 53.320]. (3.104)
A typical bootstrap t interval, following the mirror-match resampling
procedure above with B = 1000, k = 4 and m = 3, and using
fI- I (0.975) and fI- I (0.025) in (3.101) and (3.102) respectively, is
[45.792, 51.723]. (3.105)
This interval is close to the previous two, but narrower; it is not sym-
metric about jis = 49.583, but skewed to the left, as is the sample
distribution.
As indicated earlier, the bootstrap t method can be extended in prin-
ciple to other sampling designs for which valid resampling methods
are available. Kovar et al. (1988) have reported empirical studies in
which intervals from the bootstrap t methods perform well in stratified
random sampling.
CHAPTER 4
Design-based estimation for general

finite population quantities
Up to this point the emphasis has been on properties of estimators of

finite population totals and means. In Chapter 2 the Horvitz-Thompson
(HT) estimator for the population total was introduced, as well as some
other unbiased strategies. Unbiased estimators of variance of the point
estimators were also presented for very general classes of sampling
designs. Then in Chapter 3 the limiting distributions of standardized and
studentized estimators of totals and means were examined, and it was
seen that the standard normal distribution was often a useful approx-
imation for these. Refinements potentially leading to improved coverage
properties for design-based confidence intervals were explored, mainly
for simple random sampling (SRS).
Now in Chapter 4 we propose to extend the development of design-
based confidence intervals to the estimation of nonlinear population
quantities or parameters under both simple and complex sampling de-
signs. Because we will regard the population quantities as functions
which can be expressed somehow in terms of population totals, unbi-
ased estimators for totals and their associated theory will play an im-
portant role. For the sake of simplicity, we will use the HT estimator
as the basic unbiased estimator for totals, but much of the development
could be carried through analogously for other basic estimators.
The organization of this chapter needs some explanation. First of
all, an estimating function formulation unifies the treatment of means,
ratios, and finite population parameters like quantiles which are defined
implicitly. Thus we begin in Section 4.1 with the theory for a scalar
parameter, defined as the root of a finite popUlation estimating equation.
The construction of interval estimates from a probability sample is
simple and straightforward, being based on the distribution of a sample
scalar-valued estimating function.
The more traditional approach to estimating a nonlinear population
parameter is based on approximations to the distribution of a point es-
timator. In Section 4.2 we consider the case where the finite population
94 DESIGN-BASED ESTIMATION
parameter is a smooth function of totals or means. The point estimator

is the same function of the estimates of the totals or means. We de-
scribe in detail the techniques available for estimating mean squared
errors (MSEs) of estimation. In particular, 'linearization' of the error
as a function of the component total estimates provides a fundamental
basis. Resampling methods can be applied to avoid the calculation of
derivatives and explicit use of joint inclusion probabilities. Confidence
intervals are usually based on assuming approximate normality for the
studentized point estimator, and tend to require large samples for their
validity.
Section 4.3 then treats functions of population U -statistics in a spirit
similar to that of Section 4.2.
The class of parameters treatable by the methods of Section 4.2
contains some quantities, such as variances and regression coefficients,
which are not covered in Section 4.1. Many of these are definable
through systems of estimating equations rather than single estimating
equations. Section 4.4 deals with such systems, when one component
of the parameter is the finite population parameter of interest, and the
other components can be regarded as nuisance parameters. Point estim-
ation is straightforward, but interval estimation is less so, at least in
comparison with the approach of Section 4.2. It requires determining
and using a certain combination of the system estimating functions,
namely the one which changes least with changes in the nuisance para-
meters. For this reason the method is as yet less well developed, and
has been placed at the end of the chapter.
4.1 Quantities defined as roots of simple estimating functions

The focus in this first part of the chapter will be on finite population
parameters which are defined implicitly by a population equation of the
form
L ifJj(Yj. Xj. ON) = O.

N
(4.1)
j=1
Here Yj and Xj are values of observable variates, Yj being real; the

ifJj are known real-valued functions; and ON is a real-valued quantity
defined by (4.1). Extensions to vector-valued parameters are straightfor-
ward. The associated methodology is due to Binder (1983) and Binder
and Patak (1994).
Several important finite population quantities are naturally defined
as in (4.1):
ROOTS OF SIMPLE ESTIMATING FUNCTIONS 95
(i) the population mean ILy is defined by

N
L(yj - ON} = 0; (4.2)
j=!
(li) the population ratio R of Y to x is defined by
N
L(Yj - ONXj} = 0; (4.3)
j=!
(iii) the population c.d.f. evaluated at Y is defined by
N
L(/(Yj ~ y} - ON} = 0 where I(Yj ~ Y} = I if Yj ~ y,
j=!
= 0 ifYj > y; (4.4)
(iv) the population median can be defined as the least value of ON such
that
N
L(/(Yj ~ ON} - 1/2} ~ 0, (4.5)
j=!
and this is approximately of the form (4.1).
In general the population yth quantile is obtained from the definition
(4.5) with 1/2 replaced by y.
When a population function (or parameter) is defined by (4.1), an
estimator for it can be defined as a solution of the sample estimating
equation
L<Pj (yj , Xj' O}/7Cj = 0 (4.6)
jes
where, as before, 7Cj denotes the probability of inclusion of unit j. For
any sampling design, even a so-called informative design where 7Cj
depends on y or x and y, the left-hand side of (4.6) is design-unbiased
for the left-hand side of (4.1). Thus we may expect a solution Os of
(4.6) to be close to ON for large samples when the functions tPj are well
behaved.
In example (i) above, equation (4.6) is Ljes(Yj - O}/7Cj = 0, and
the resulting estimator of ILy has the form
Os = (LYj/7Cj}/(L l/7Cj). (4.7)

jes jes
For any self-weighting design, even one which is not of fixed size, this
estimator is the sample mean Ys. If the design is self-weighting and of
e
fixed size, s =.vs is the HT estimator divided by N. For simple random
sampling with replacement it is not the HT estimator divided by N, but
is more intuitively appealing because it is unbiased for ILy conditionally
on the sample size n(s) as well as unconditionally. In general, the estim-
ator (4.7) need not be design-unbiased in either sense; however, no
matter how peculiar the choice of the inclusion probabilities 7rj (recall
Example 2.5), it is error-free when all components of y are the same
- a property not shared by the unbiased estimator (L j es Y j / 7rj ) / N.
On balance, then, the use of (4.7) tends to be preferred when the Yj
values are thought not to depend a great deal on the 7rj. This preference
corresponds to a formal optimality property of (4.6), as will be seen at
the end of Section 5.5.
In examples (ii)--{iv) we obtain similarly the estimator
k = (LYj/7rj)j(LXj/7rj) (4.8)
jes jes
for the population ratio, the estimator
frs(y) = [L I(yj ::: Y)/7rj]/[L 1/7rj] (4.9)
jes jes
for the value of the population c.d.f. at y, and the estimator
frs-I (~)
for the population median.
Note that frs(y) of (4.9) behaves like a true distribution function in
that it increases from 0 to 1 as Y increases; this would not always be
the case if Ljes l/7rj in the denominator were replaced by N.
In the interesting case of length-biased sampling, where 7rj is pro-
portional to Yj, we obtain as estimator for the population c.d.f.
frs(y) = [LI(Yj ::: Y)/Yj]/[L l/Yj].

jes jes
an estimator which has also been discussed in the wider statistical
literature, for example by Gill et al. (1988).
4.1.1 Design frequency properties of the estimating functions

Let us introduce for the sample estimating function the convenient
notation
f/ls(9) = L f/lj(Yj, Xj' 9)/7rj. (4.10)
jes
From the theory in Chapters 2 and 3, it is clear that

N
E{4>s(8)} = L4>j(Yj,xj,8) (4.11)
j=l
for any 8; that
Var{4>s(8)}= LN (4).)
~
2
1l"j(1-1l"j)+ L L
N N ( 4> . ) (4)k
~ -
)
(1l"jk-1l"j1l"k),
j=l 1l"J j "I- k 1l"J 1l"k
where 4>j = 4>j(Yj,xj,8); and that unbiased estimators for (4.11) can
be defined for most cases of practical interest. For example, for fixed
size single-stage designs (if we continue to regard 8 as a freely varying
argument), the function
(4.12)
with
(lljk = (1l"j1l"k - 1l"jk)/1l"jk
is an unbiased estimator for Var{4>s (8)}, and is likely to be consistent
under appropriate conditions for designs of simple structure. Moreover,
for most designs used in practice one can expect for large n approx-
imately a standard normal distribution for the quantity
4>s(8) - E7=1 4>j(Yj, Xj' 8) (4.13)

-v'Var{4>s (8)}
(see Section 3.5). If v(4)s) is a consistent estimator for Var{4>s (8)}, then
in large samples we can take
4>s(8) - E7=1 4>j(Yj. Xj, 8) (4.14)

Jv(4)s)
to be approximately standard normal also. These facts suggest a number
of possibilities for constructing interval estimates for 8N •
For one possibility, let v(4)s) be v(4)s) with 8 replaced by Os, so that
it is calculable from the sample. If 4>s is a monotone function of 8,
we can construct limits for an approximate two-sided 100(1 - 2a)%
confidence interval for 8N as the values of () satisfying
(4.15)
where Zl-a is the (1 - a) quantile of the standard normal distribution.
A second possibility is to retain the dependence on () in v(4)s) and
to try to find limits which satisfy

¢s«()
Jv(¢s) = ±Zl-a.
(4.16)
This method will not be applicable so generally because (4.16) is less

likely than (4.15) to have exactly two solutions in (). It is suggested
here because the left-hand side of (4.16) may in some cases have a
distribution closer to normality than ¢s(()/Jv(¢s), as it does under
SRS when () is a proportion.
In both cases, the interval for ()N consists of values () for which the
hypothesis H: ()N = () would not be rejected by a corresponding
two-sided significance test at level 2a.
4.1.2 Particular cases

In the case of the population mean, as indicated in the previous sub-
section, the sample estimating function being used is
¢s«() = L(Yj - ()/lTj = t - ()N,
JES
where f = LjES Yj /lTj and N = LjES lilT) are unbiased estimators

of 1'y and N respectively. Thus the point estimator is
Os = f /N.
The variance estimator v,A¢s) for a fixed size single-stage design is
(4.17)
This can be written in an obvious way as vw(T) - 2()covwcT, N) +

()2v w(N), where v denotes estimated variance and cov denotes esti-
mated covariance. If the design is also self-weighting, or more gener-
ally if lTj is constant within strata, the dependence of this estimator on
() disappears, and thus vw(¢s) coincides with vw(¢s).
For simple random sampling without replacement (SRS), the approx-
imate confidence limits from (4.15) and (4.16) are both given by
Ys ±Zl-a ~ (1-~)
n N.
L(Yj - Ys)2/(n -I). (4.18)
JES
For stratified random sampling, N is again equal to N, and here again

V{J)(tPs) coincides with v{J)(tPs). The approximate confidence intervals

(4.15) and (4.16) are given by
(4.19)
where s~ is the sample variance of Y from the sample in stratum Sh.

In cases of unequal probability sampling within strata the formulae
are different: using (4.15) with v{J) yields
(4.20)
where
= 1
-LLsWjk
2
(Yj
- - -Yk -Os - 1 - -
7rj trk 7rj
1
7rk
A ( ))2
"'2
= ,.. A A
v{J)(T) - 20scov{J)(T, N)
A
+ Os v{J)(N),
A
(4.21)
while using (4.16) yields (after solution of a quadratic equation)
Os - z2f3s ± z JZ 2(f3i - asys) + ys - 2f3s0s + asO;

(4.22)
1 - z2 as 1 - z2 as
where z = Zl-a, as = v{J)(N)/N, f3s = cov{J)(T, N)/N , ys

A "'2 A ,.. "'2
A A2
= v{J)(T)/N .
(If an alternative variance estimator to Va> in (4.17) is used, with
corresponding covariance estimator, formulae corresponding to (4.20)
and (4.22) would apply with V and cov having the alternative forms.)
Note that the smaller the values of as and f3s in (4.22), or the smaller
the variability in N, the closer to symmetric about Os the interval will
be.
The case of two-stage sampling is also interesting. Suppose we select
/ first-stage units with probability proportional to size, and mr secondary
units by SRS from the rth first-stage unit if it is sampled. Then the point
estimator Os of 0 = lLy takes the form T/ N, where N = N, T = NY
and y is the mean of subsample means. The variance estimator for
tPs(O) = N LresD <Yr - 0)/ / takes the form
2
N " " nr nq - n rq - - 2
v(tPs) = 2/2 L LSD nrq <Yr - Yq)
N 2 nr ( mr ) 1 _ 2
+"LrEsB [2 mr 1 - Mr mr _ 1 LjES,(Yj - Yr) , (4.23)
or approximately
(4.24)
when the first-stage design is an approximately with-replacement de-

sign, as outlined in Section 2.7. The approximate confidence intervals
given by (4.15) and (4.16) are both equal to
Y± ZI-aJV(<Ps)/ N.
However, when the first-stage units are also selected by SRS, then
the full design is no longer self-weighting in general, and the point
estimator Os is the ratio to size estimator
LMrYr/LMr. (4.25)
reSB reSB
or T / N where T = (L / I) LresB Mr Yr and N = (L / I) LresB Mr. The
confidence intervals are
Os ± ZI-aJ v(T) - 20scov(T, N) + O;v(N)/ N (4.26)

from -(4.15), and the same as (4.22) from (4.16). Here in (4.22) and
(4.26)
v(T) = TL2 (
1-
I) 21(l_I)LLsB(MrYr - MqYq)2
L 1
L" M; ( mr ) 1" _ 2
+/ ~resB mr 1 - Mr mr _ 1 X ~jesr (Yj - Yr)
A
v(N) =T
L2( I) 1
1- L 21(l_l)LLsB(Mr - Mq)
2 (4.27)
cov(T, N)
A L2
=T (
1- L
I) 21(1- I)L LSB(Mr-Mq)(MrYr-MqYq)·
1
Similar results can be derived for other population quantities of in-

terest. In fact, it is easy to see that the confidence interval formulae
for the population mean also apply to the population c.d.f. at y if we
replace Yj in the formulae by I{Yj :::: Y) as in (4.9).
For the population ratio, we have two approximate confidence inter-
vals to be considered for use with SRS. It is possible to show that
(4.28)
has a bias of
1
-2~yn-l/2 - ~(/Lxx//Lxa)(R - {J)n- 1/ 2 + O(n- 1 ),
where
N N
A = n/N, {J = LXjYj/L:>;,
j=l j=l
/Lxx = (tX;)/N' a 2 = ( t Zjr/ N ,

J=l J=l
Y = (tZ] )/Na 3 / 2 and Zj = Yj - RXj.

J=l
On the other hand,
¢s(R)
(4.29)
Jv(¢s)
j~n (1-~) _1_L (z· -zs)2
N n - 1 JES J
has a bias of
1
__ Jf=):"yn- 1/ 2 + O(n- 1).
2
Thus for particular populations the distribution of the second may be
closer to standard normal than the first. The approximate confidence
intervals based on these quantities are respectively
(4.30)
and
(4.31 )
where Z = Zl-a, as = v(is)/i;, bs = cov(i.,ys)/i;, cs = v(Ys)/i;.

For other one-stage sampling designs, approximate confidence intervals
follow a similar pattern, with is and Ys replaced respectively by Tx =
LjEsXj/J"{j and Ty = LjEsYj/J"{j. The limits in (4.31) were used by
Fieller (1932), and have been discussed by Cochran (1977, p. 156).
The population median is a case of particular interest, since the esti-
mating function is not linear in the parameter. In the general one-stage
design case, the variance estimators for
(4.32)
are
1
+ 4'v(N)
AA A A,.. A
V(f/lS) = v(N Fs«()) - cov(N, N Fs«()) (4.33)
and v(f/ls) = =
v(f/ls) evaluated at () Os. For stratified random sampling
(where N = N), using v(f/ls) yields intervals (from (4.15» approxi-
mately of the form
A_I
Fs [12" ± JZI-a
A
v(Fs «()) 10=9, ] (4.34)
as proposed by Woodruff (1952); using v(f/ls) with () variable yields

intervals of the form
{(): ~ > Fs«() - ZI-aJ v (FS<()), ~ < Fs«()) + ZI-aJV(Fs«())} ,

(4.35)
which have been found to be superior by Francisco and Fuller (1991),
particularly if the confidence limits for FN«() used in (4.35) can be well
approximated by smooth functions monotone increasing in (). Francisco
and Fuller give conditions for both types of interval under which their
coverage probabilities will approach 1 - 2a as popUlation and sample
sizes become large. Their conditions are actually applicable for the
more general case of stratified cluster sampling where the clusters are
selected by SRS. However, when the cluster sizes are variable, their
'inverse testing' intervals (4.35) are not exactly the same as would arise
from the prescription (4.16), since in that case their v(Fs«()) is given
by
veN Fs«()) - 2Fs «())cov(N, N Fs«())) + [Fs«())]2v(N)
N2
while prescription (4.16) would replace v(Fs«()) in (4.35) by v(f/ls«())
fN 2 •
4.1.3 Intervals based on refined approximations

In the previous subsection, interval estimates for ()N were found by
inverting the distribution of
f/ls«()) = L f/l{Yj, Xj, ())/7rj, (4.36)
jes
which was taken to be approximately normal. Since f/ls is a sample sum,

we may hope for improved intervals when better approximations to the
distributions of sample sums are available. We have also seen in Chapter
3 that such refinements are available under SRS. Particularly good

approximations exist for the case of the c.d.f. value FN(y), because
this is in fact a population proportion (see Sections 3.1 and 3.2). For
a more general population mean My, we have noted the inverse testing
method of Section 3.11, and the bootstrap t method of Section 3.13.
Let us see how these approaches would apply to a general population
parameter eN defined by
N
L ¢(Yj, Xj, eN) = 0
j=!
and its estimating functions ¢s of (4.36). Note that we now require that
the form of¢j of (4.1) be independent of j.
The inverse testing method works by excluding from the confidence
set for eN all values of e which would be rejected by a level-a sig-
nificance test of the hypothesis eN = e. For a given value of e, if the
hypothesis were true the population arrays x, y would satisfy
N
L ¢(Yj, Xj, e) = o. (4.37)
j=!
Thus the first step is to construct artificial population arrays x(e),y(e)

which are consistent with the sample values, which satisfy (4.37) and
which have an associated distribution close to the one found in the
sample. This enables the conceptual imputation of ¢(Yj(e), Xj(e» for
the unseen units of the population. The second step is to approximate
the SRS distribution of ¢s (e) for this artificial population by a sad-
dlepoint approximation or by simulation. In particular, we approximate
the probability that ¢s (e) differs from zero by more than its observed
sample value ¢?(e), given the artificial population. If this probability
is less than or equal to a, the value of e is excluded. Thus if ¢s (e)
were decreasing in e the upper confidence limit eu
might satisfy
0 " ,.. "
A
P(¢s (e u ) ::: ¢s (eu) Ix(e u ),y(eu » = a, (4.38)
while the lower limit eL might satisfy

(4.39)
approximately.
Using this approach to estimating a population ratio would proceed
as follows. (Here we assume that both variates y and x are not known
for the unsampled units.) Noting that (3.74) may be written P<Ys -e :::
ji~ - e;y(e» ~ a and (3.77) may be written
L L ai el
r r
0= ai(yi - e)e l (IJ)(yi_ 8)/ (yi- 8) (4.40)
i=1 i=1
we follow the procedure of Section 3.11, replacing yi - €I everywhere

by/-Rxi.
The bootstrap t method would aim to find an approximate distribution
for
¢s(e)/y'v(¢s)
or for
¢s(e)/Jv(¢s)
by resampling. It would then solve for €I equations of the form (4.15)
or (4.16) with ±ZI-a replaced by .B- 1(1 - a) and .B- 1(a), the I - a
and a quantiles of the resampling distribution. If the equations had
unique solutions these would serve as end-points for an approximate
confidence interval for €IN.
4.1.4. Asymptotic properties of the point estimates

Although confidence interval estimation for 8N via estimating func-
tions is carried out without direct reference to the properties of 8s ,
the properties of 8s may still be of interest. For example, we might
wish to know that 8s as an estimator of 8N is consistent or asymptoti-
cally design-unbiased, or that it is asymptotically normally distributed.
Showing this from the form T/ if for estimators of the mean or Ty / Tx
for estimators of R would usually be straightforward.
The key to establishing such properties in general is the Taylor series
expansion
-¢s(8) ¢s(Os) - ¢s(e)

_ 8) + ~ a ¢s I (0 _ e)2
- 2-
(4.41 )
= a¢s (0
ae s 2 ae 2 e s ,
e
where ¢s(e) = ¢s(e)/N and is some value between Os and e. This
expansion can be formed when ¢s and a¢s/ae are continuous in €I and
a2¢s/ae 2 exists in the region of interest. Suppose it can be shown in
some asymptotic framework that (8s - €IN )/eN -+ 0 in probability, that
Var{¢s(e)} and Var(a¢s/ae) are of order O(n-I) for some measure of
sample size n, that a¢s/ae and its expectation approach A(e) =1= 0, and
that a2¢s/ae 2 is bounded in probability uniformly near €IN. Then by
solving (4.41) at e = eN for Bs - eN, we can see that

Bs - eN is of order Op(n- 1/2 )
and
es -eN_-
A
-¢sceN)
_
+ 0 pn
( -I) . (4.42)
E(a¢>s/ae)IIJ=IJN
Then since ¢>s (eN) has expectation 0, the bias in Bs is of order O(n- I ),
while its root mean squared error is of order O(n- 1/2 ). Thus it is
generally the case that Bs is asymptotically design-unbiased, in the sense
that
E(Bs - eN )/! E(Bs - eN)2
approaches 0 asymptotically. Moreover, if conditions are right for the
asymptotic normality of the sample sum ¢s, we can conclude that
In CBs - () N) is also approximately normal, with mean 0 and vari-
ance nVar(¢s (eN»/[Ea¢s/aef IIJ=IJN' This is the basis of the common
practice of estimating the mean squared error of Bs by
(4.43)
The condition that (Bs - eN)/eN ~ 0 in probability is a design

consistency condition, and is usually established case by case. Gen-
eral proofs of the consistency of roots of estimating functions can be
constructed in the manner of Cramer (1946, pp. 500ff).
The approach above must be modified for the case where Bs is the
sample median. Here the estimating function
is not differentiable in e. The most natural approach is to assume that as

N ~ 00 the population c.d.f. FN(y) uniformly (with error O(N- 1/2»
approaches a c.d.f. F(y) which is continuous and has a continuous
positive derivative ! in the neighbourhood of the median. The next
step would be to establish a Bahadur representation for the sample
median Bs:
1 1/2
(4.44)
A A
Bs-BN= !(BN)[Fs(BN)-FN(BN)]+op(n- ).
Francisco and Fuller (1991) have given sufficient conditions for the
representation (4.44) to hold.
4.2 Quantities defined as functions of population totals

Another class of nonlinear population functions consists of those which
can be represented in the form g(T), where T is a vector of population
totals.
Formally, let Y = (YI, ... , Ya, ... , Yp) be a vector characteristic, and
suppose that Ya has values Yaj, j = 1, ... , N for the population units.
Define
N
Ta = IYa = LYaj, (4.45)
j=1
and let T be the vector (T" ... , Ta , ... , Tp). We consider the problem
of estimating real population quantities of the form () = g(T). Here and
in Section 4.3, we will omit the subscript N on population functions.
Thus () in this section corresponds to ()N of Section 4.1.
The means and ratios considered in Section 4.1 are of the form g( T).
For example, JLy = Ttl T2 if Ylj = Yj and Y2j = 1 for all j. The ratio
R takes the same form with Ylj = Yj and Y2j = Xj. However, there
are other important examples not covered by the theory in Section 4.1.
The population regression coefficient
N N
B = L(yj - JLy)(Xj - JLx)/ L(Xj - JLx)2 (4.46)
j=1 j=1
is perhaps the most important of these. It can be written as a function
of five totals, namely as
(4.47)
where Ylj = Yjo Y2j = Xj' Y3j = YjXj, Y4j = x;,

YSj = 1.
Other functions of this kind include various types of correlation co-
efficient. For illustration of certain issues we will keep in mind also the
population variance (this time with denominator N)
N
Cf2 = L(yj - JLy)2 / N = (T2 - TN T3)/ T3 (4.48)
j=1
with Ylj = Yj, Y2j = Y;, Y3j = 1.

Throughout this section the approach taken to form a point estim-
ate of () = g( T) is to compute the HT (or sometimes other unbiased)
estimator
fa = LYaj/Tlj
jES
FUNCTIONS OF POPULATION TOTALS 107
for each ta , and to substitute into the form for g, so that

e=gCT),
where T = (fl,"" Ta , ... , Tp), is the estimator for (). If g happens
to be a linear function of the components of T, then the estimator is e
e
unbiased. Otherwise will not be unbiased in general.
NOTE: Representation of a quantity as a function of totals is not
necessarily unique. The best representations seem to be those which
are functions of ratios of totals. For example, if N is known, then to
estimate Ty it may be better to represent it as NTd T2, where TI = Ty
and T2 = N, rather than as TI • (To see this, compare the values of Ty
and Niy/ if when YI = ... = YN = Y > 0, and the inclusion probabil-
ities'lrj are variable.) The estimate NTy / if will be biased, usually only
slightly, if if is variable. See Section 5.6.1 for a justification in terms
of model-assisted estimation.
4.2.1 Linearization o/the error

e
The properties of the point estimate = g(T) are studied via a Taylor
series approximation to the error. The first-order approximation of this
type is a 'linearization' of the error, and it looks like this:
~ ~ ~ "ag ~
E«(); () = () - () = geT) - geT) = ~ -(Ta - Ta) + remainder.
a aTa
(4.49)
The remainder may be bounded by the squared length of T- T, times a
bound on the second-order derivatives of g in the region of interest for
T, provided the derivatives and bound exist. If the design and variates
and the function g are such that the linear term
(4.50)
is dominant in (4.49) (which implies that at least one of the partial

derivatives of g is non-zero near the true value of T), and if T - T
appropriately standardized is asymptotically multivariate normal, then
it is clear that
g(T) - geT) e-() (4.51)
----;::==(====a=g=~=) = ,JVar(Er)
Var La-Ta
aTa
is asymptotically N(O, 1). In such a case it may be useful to estimate

e
the mean square error (MSE) of = g(T) by using an estimate of the
variance of the linearized error, namely
VeEr) = V(LjEsZj!rrj) IT=T (4.52)

where v is a variance estimator form and
(4.53)
v
The symbol is used in (4.52) because computing it involves estimation
of the coefficient ag/a Ta.
Although we are aiming at estimating the MSE of the possibly biased
point estimator e, we will use the term 'variance estimation', in con-
formity with much of the literature, unless the distinction is important.
Thus (4.52) will be called the linearization variance estimator of e.
Specifically, using the Yates-Grundy-8en estimator with a fixed size
one-stage design would give as linearization variance estimator
1 A A)2
A
Vw(EL) = -2LLsWjk ( -rrZjj Zk

- -
rrk
, (4.54)
where Zj = ~a -/f.IT=TYaj. For a stratified multi-stage design with h
!
PSUs selected from stratum Sh, we would have
L ~ ~ ~SBh flr flq - flrq (Mrzr _ MqZ q ) 2
II
h 2 flrq flr flq
+L V2,r (4.55)
rESBh flr T=T
where V2,r estimates the variance of the estimate of the Z total in the
rth PSU. If Lh, the number ofPSUs in stratum Sh, is large for all hand
the first-stage design approximates with-replacement sampling within
strata, then we can use the approximation (2.94), which gives as the
estimate of variance of linearized error
MqZq )21
A
V(E[) = ,,1
h h
1 (MrZr
~ 2l-lLLsBh IT - ---rI
r q
.'
T=T
(4.56)
4,2.2 Linearization confidence intervals

A two-sided lO0(1-2a)% confidence interval for e = geT) which will
clearly be serviceable in many situations is based on the approximate
normality of
g(T)-g(T) o-e (4.57)
JV(EL) = JV(Er) ,
and takes the form
o± ZI-aJV(Er) (4.58)
where again ZI-a is the 1 - ex quantile of the N (0, 1) distribution. This
can be justified for large samples when (4.51) is asymptotically normal
and it can be shown that v(Ed is design-consistent for Var(Er), in the
sense that their ratio tends to 1 in probability. See the results ofKrewski
and Rao (1981) described in Section 4.2.7.
Suppose we take as an illustration of the linearization approach the
estimation of the population variance a 2 of (4.48) with a one-stage
design. The point estimator from the prescription above is
a2 = CT2 - t?/t3)/t3=(~YJ/1l"j-(LYj/1l"jY/(~ l/1l"j))
I(~I/1l"j)
lES lES lES
or
(4.59)
The linearization of the error is

1 A 2TI A 1 (2
-(T2 - T2) - - ( TI - Td - - a - -T12) (T3
A
- T3). (4.60)
D ~ D ~
The estimate VeEr) of the MSE ofa 2 would be given by (4.52), where
Z j = N1 (yj2 - 2/-LyYj - a2 + /-Ly)'

2
which evaluated at t reduces to

A 1 [(y j-/-LA)2 -a.
Zj=-;;- A2l (4.61)
N
An approximate 100(1 - 2ex)% confidence interval for a 2 would take
the form of (4.58) with 0 = 2 • a
For the regression coefficient B of (4.46) the calculations are very

similar. The point estimator is
"'2
= (T3 -
A A A A A A A
Bs Tl T2/ TS)/(T4 - T2/Ts)
= (:?:(Xj - P-x)(Yj - P-y)/1rj ) / (~)Xj - P-x)2/1rj ) (4.62)

jS jS
where
P-x = (:?:Xj/1r j ) /
JES
(L:
JES
1/1rj )
and P-y is defined similarly. The linearization of the error is
I [T2 A
T4 - Tl/N - N(Tl - Td
2BT2 - Tl
+ + (T3 -
A A
N (T2 - T2) T3)

A
-B(T4 - T4) + T2
N
Tl -BT2
N (Ts -
A ]
Ts) .
The estimator of the MSE of Bs would be given by (4.52), where this

time Z j evaluated at T reduces to
L jES (.
Xj
~ J1-x )2/1rj. [(Xj -
A P-x)(Yj - P-y) - Bs(xj - P-x)2]. (4.63)
Two connections with the estimating function approach are worth

noting here. The first is that for the means and ratios treated in Section
4.1, linearization confidence intervals based on (4.58) will coincide with
those based on (4.15) when the same variance estimation forms are used
in both. The second is that for the variance and regression coefficients
treated in this section, linearization confidence intervals will correspond
to intervals derived from a more general kind of sample estimating
function <p.. the type that will in fact be considered in Section 4.4.
The condition that the linear part EL of the error should be dominant
is necessary for the asymptotic normality of
o-() (4.64)
JV(Er) .
An interesting example where it does not hold has been given by Kom
and Graubard (1991). Here we suppose that () = geT) = (J1-y - J1-0)2,
and that J1-0 is a constant which J1-y approaches in such a way that in
our asymptotic framework n 1/2 (J1-y - J1-0) -+ 0 as the index N -+ 00. If
we express g(T) as «Til T2 ) - tLO)2 it is easy to see that ag/a TI and

ag/aT2 become negligible as N --+ 00. Korn and Graubard have shown
that if the design is SRS the distribution ofn(O -8) = n(g(i') - g(T»
approaches that of y Z2 where y is a positive constant and Z is an
N(O, 1) random variate. The distribution of (4.64) approaches that of
IZI/2.
The linearization method is applicable very generally. However,
some alternative methods have been proposed, in attempts to make
the estimation of variance simpler computationally, or in efforts to find
functions of 8 = g(T) and the sampled values having improved distri-
butional approximations. Those methods described in Sections 4.2.3-
4.2.6 are based in one way or another on resampling or subsampling
from the sample itself, to gain information about the sampling distri-
bution of an estimator or root.
4.2.3 Method of random groups
The simplest subsampling method has been called by Wolter (1985) the
method of random groups. It was introduced by Mahalanobis (1946),
who called it the method of interpenetrating subsamples. Deming (1956)
and others termed it the method of replicated samples.
There are two kinds of random group method. The 'independent
random group' technique is not strictly speaking a subsampling method.
In this technique, samples Sl, ••. ,SG are drawn independently to begin
with, each according to a sampling design po. Each sample is 'replaced'
after it is drawn. If 'Tv is the HT estimator for the vector T of totals
obtained from the sample sv, we may use as a point estimate for 8 =
g(T) either
(4.65)
where g(sv) is an estimate of g(T) from sv, usually g(Tv), or
IG
= g(T), = G LTv.
A -;;- -;;- A
82 where T (4.66)
v=1
An approximately unbiased estimator for the variance of 01 or O

2 would
be
or (more conservatively)
The non-independent random group method to some extent mimics

the previous method, but the component subsamples are formed after
the full sample has been drawn. The 'parent' sample is divided ran-
domly into G groups, according to rules which make the parent design
as much like a replicated design with those components as possible,
(see Wolter, 1985, pp. 31ff.). Thus, if the parent design is stratified ran-
dom sampling with sizes nl •...• nh •...• nH, the random groups will
be chosen so that each has nh/ G elements in stratum Sh.
Here the random groups are non-overlapping, and the subsample
estimators cannot be considered to be independent, conditionally on
the parent sample or unconditionally. However, the formulae for the
independent case are usually used in practice.
If the parent sample s is taken by SRS with n draws, the subsamples
SI ••••• SG will constitute a random division of the full sample into
groups of size n / G. The natural overall estimator of JLy is
1 G
Ys = G LYs•.
v=1
which is certainly unbiased. If we let
as in the independent case, we have
1 G
G(G _ 1) ~Var(jis.IS)
= 1 G2 ( 1- -1) s2
G(G - 1) n G y
1 2
= -sY'
n
1 "(y - )2
2
Sy = ---=--1
n
L.J j
jes
- Ys •
which implies that E(VRG(Ys» = S;ln. Since actually
-
Var(Ys) = -;;l( 1 - N
n)S2
Y'
it follows that unless N is very large a better estimate of Var(ys) is
V*RG = (1 - ;) VRG(Ys)'
By analogy, we would estimate the variance of OJ of (4.65) or (4.66)
by
G
(1 - ; ) G(G1_ 1) ?;(g(sv) - OJ)2 (4.67)
for an SRS parent sample, i = I, 2.

As an illustration, consider the estimation of S; = (N I(N - 1»0'2
from an SRS of size n = 2m, divided at random into m groups of size 2.
If the vth pair is {j" h}, the most natural candidate for the subsample
estimate is g(sv) = (Yh - jisY + (Yh - jisY = (Yh - YjY 12. Starting
with this unbiased subsample estimate and following the principles of
the random group method gives
It is not difficult to show that if
V*RG = (1 - ~ ) ~ L (yjl -YjY - 0,)2 ,

N m ('jl,j2'J 2
then
E(V*RG) = (1 - ~) ~(A -
N 2n
B)
(in the notation of (2.73), (2.74) while
Var(O,) A
= (
1- -
N
1-
n) -(A
2n
B) (2N
B
+ - - - - -.
N-1
1) 2
Thus V*RG is approximately unbiased for Var(O,). The more efficient
estimator s;,
which can be regarded as analogous to O2 , has variance
Var(s;) = ~ (1 _~) A - 3B + _1_ (1 _ n - 1) ~

n N 2 n-1 N-1 2
(1- ~) ~(A - 2B) for n large.

N 2n
The random group method is appealing because its principles are

very general. However, it has some drawbacks:
(i) if the design is complex it is difficult to separate the sample into
many subsamples of equal status; for example, a stratified random
sample with two units drawn per stratum can be separated into
just two disjoint stratified subsamples;
(ii) if the number of subsampies is small, the variance estimator (4.67)
will have a small number of terms and will tend to be unstable;
(iii) the variance estimator (4.67) tends to estimate MSE(OI) rather than
MSE(02); these quantities are different in general, and for point
estimation 02 would often be preferred to 01 as being less biased
and more efficient.
The method of balanced repeated replication described next is de-
signed to overcome the first two drawbacks for stratified random sam-
pling by replicating the division of the sample.
4.2.4 Balanced repeated replication

The method of balanced repeated replication (BRR) for variance estim-
ation is applicable generally to estimates B= g(i') of complex quant-
ities under stratified multi-stage designs. However, we describe it first in
the context of the estimation of a population mean J.ty from a stratified
random sample with two units drawn per stratum.
The usual unbiased estimator of J.ty is
Yst = L WhYh,
h
where Yh is the sample mean in the stratum Sh. If nh = 2 for each h,

an unbiased estimator for Var<Yst) is
But
" "
~(yj - 2
- Yh) = 2
1 (Yjhl - Yh2) 2 ,
jESh
where Sh consists of units jhJ. jh2 in order of drawing at random. Thus,

assuming all Nh » nh, where» means 'much larger than',
(4.68)
Now, the total sample can be divided into two half samples,
{Jll, hi, ... , jhl, ... } and {J12, h2, ... , jh2, ... },
which can be regarded as two independent replications of stratified ran-
dom sampling, with one draw per stratum. (Here again the assumption
that Nh » nh for all h is being used.) We may form two 'half sample
estimates' of fJ.Y' namely
jist I = L WhYjhl and jist2 = L WhYjh2'

h h
and clearly
1
jist = i(jistl + jist2). (4.69)
Since jist is thus a sample mean of two virtually independent 'observa-
tions', from what are essentially two random groups, another unbiased
estimate for Var(jist) would be
(4.70)
which is much simpler than v(jist) above. But this latter estimate has
only one degree of freedom, and is highly unstable. We can try to
obtain a similar estimate with greater stability by considering a larger
collection of half samples.
In particular, consider the class of all possible half samples formed
by taking one of the units Uhl, jh2} from each stratum. If the number
of strata is H, there are 2H half samples, which may be numbered
I, ... , 2H. Then the vth half-sample estimate of fJ.y is
- = "~ Wh (~(v)
Ystv Uhl Yjhl + Uh2
~(v»)
Yjh2 ' (4.71)
h
where
8(v)
hi = 1 if jhl E vth half sample,
= o if jhl ¢. vth half sample;
8(v)
h2 = 1 - 8(v)
hi .
Clearly the total sample estimator jist is the average of all the half-
sample estimators. But it is also true that if we take the average of
all (jistv - jist)2, as in (4.70) but this time averaging over all 2H half
samples, we recover the variance estimator v(jist):
(4.72)
To see this, we first write

jistv - jist = (jistv - jist,,)/2
where the vth half sample is the one which is complementary to the
vth, so that
- ~ ur (~(v)
Yst" = ~ rr h 0h2 Yjhl + °hl
~(v»
Yh2 .
h
Thus
Ystv - Yst =
=
where
8(v)
h
= 8(v) _ 8(v)
hi h2
= I if jhl E vth half sample
-I if jh2 E vth half sample.
Then
-
(jistv -
- Yst) 2
= 4"1 ""
~ W2(y
h ihl - YjhJ 2 +:21 ""
~ ""
~ 0h(v) 0h'
(v) Wh Wh'
h hc ~
X (Yjhl - Yih2) (Yh'l - Yh')'
But for fixed h, h' such that h =1= h', the sum Lv 0k V
) ok~) = O. Thus
.p, i
L~:I (jistv - jist)2 = Lh Wl(Yihl - Yih2)2 = v(jist) of (4.68), and it
is indeed possible to express V (jist ) as the average of squared deviations
of jistv from jist for all half samples.
The basis of the BRR method (McCarthy, 1969) is that the same
thing can be done using fewer than 2H half samples, if the set of half
samples chosen is balanced. That is, provided that
(4.73)
where tv denotes summation over the selected half samples, the same
algebra will show that if the number of half samples selected is K, then
1 -
v(jist) = K Lv(jisrv - jist)2. (4.74)
The condition (4.73) implies that if the 0k V ) are arranged in a K x H

matrix, the columns of the matrix are orthogonal. Now it is possible
to construct a square matrix of Is and -Is with orthogonal rows if

the dimension of the matrix is a multiple of 4 (Plackett and Burman,
1946). For example if H = 6, since 8 is the next multiple of 4 higher
than 6, we may construct a set of eight balanced half samples (K = 8)
from which v(Ysr) can be computed using (4.74). If we start with the
Hadamard matrix
+1 +1 +1 -1 +1 -1 -1 -1
-1 +1 +1 +1 -1 +1 -1 -1
-1 -1 +1 +1 +1 -1 +1 -1
+1 -1 -1 +1 +1 +1 -1 -1
(4.75)
-1 +1 -1 -1 +1 +1 +1 -1
+1 -1 +1 -1 -1 +1 +1 -1
+1 +1 -1 +1 -1 -1 +1 -1
-1 -1 -1 -1 -1 -1 -1 -1
which has eight orthogonal columns, then since H = 6 we can use the
first six columns. The first half sample will be
Ul\, hi, hi, j42, jSI, j62}
from the first six elements in the top row, and so on. We could alter-
natively use the seventh column in place of one of the others. Using
the eighth would be less satisfactory, for then it would not be true that
Yst = (tvYstv) / K.
What we have just seen is that BRR can be used to compute the
variance estimator for Yst in stratified random sampling using a well-
chosen selection of half samples. As with the random group method,
we now look at how the method can be extended to more complex
designs and nonlinear population functions.
Thus suppose more generally that we have a stratified multi-stage
design, with lh = 2 PSUs being selected from each stratum Sh. Suppose
that the object is to estimate a population function or parameter e =
geT), and that from any sample or half sample this is estimated by
geT), where T is an unbiased or approximately unbiased estimate of
T from the sample or half sample.
We begin by constructing a balanced set SI, ... , SK of half samples
from the PSU labels in the sample, and denote by SI, ... ,SK the set of
complementary half samples, which will also be balanced. Denote by
g(sv), g(sv)
the estimates of geT) from the data in the half samples sv, Sv respect-
ively. In this section g(sv) will always be g(Tv), Tv being the estimate
of T from Sv. There are two natural choices for the point estimate of
geT), namely the full-sample estimate
o=g(T). (4.76)
and the average of the half-sample estimates:
~ I K
(J = 2K L(g(sv) + g(sv)). (4.77)
v=l
Note that if the components of T are estimated by HT estimators,

and if we take the half-sample inclusion probability of the rth PSU to
be nr /2, then for linear functions g of the components of T, 0 is the
average of g(sv) and g(sv). Also, in general, for large numbers of strata
we should have
(4.78)
for each v. Thus in such cases the two point estimators and should e e
be close in value.
There are correspondingly three estimators of approximate mean
squared error for these estimators. The first is suggested by the right-
hand side in equation (4.74):
K
1", A2 A2
VBRR-S = 2K L)(g(sv) - (J) + (g(sv) - (J) ]. (4.79)
v=l
A variant of this would be

1 K
+ (g(sv) -
A A
- L[(g(sv) - 0)2 0)2].

2K v=l
The third replaces (g(sv) - 0)2 + (g(sv) - 0)2 by the term (g(sv) -
g(sv))2/2, which it would equal if (4.78) were exact:
1 K
VBRR-D = 4K L(g(sv) - g(sv))2. (4.80)
v=l
Now if Lh, the number of PSUs in stratum Sh, is large for all hand
the first-stage design approximates with-replacement sampling within
strata, the estimate of variance from the linearization method will be
'" ~""
~ 2 L..- L..-SBh
(Mrzr _ MqZq
nr nq
)21 ' T=T
(4.81)
h
Zj = "~
a
ag Yaj
aT,
a
(from (4.54) with lh = 2). It is not difficult to see that for a set of
balanced half samples, (4.81) is the same as
or
where
(4.84)
is the vth half-sample estimate of Ta , and Tva is the complementary

half-sample estimate. Thus if the error in
g(sv) - g(sv) ~~ (:f It) (Tva - Tva) (4.85)
is appropriately small, the BRR estimators for the MSE of g( i') agree
'to first order' with the linearization estimator. In fact, as has been
pointed out by Rao and Wu (1985), equality holds in (4.85) if g is a
quadratic function of the components of T.
When the number of primary sampling units sampled in each stratum
is more than two, the BRR method must be modified. Three approaches
to this problem have appeared. One is to divide each stratum randomly
into two equal parts, and then to construct balanced half samples using
these as the basic units. However, the random group method gives
inconsistent estimators of variance when the number of strata is finite.
A second approach, which avoids grouping, is to form a balanced
set of K replicates with one unit per stratum according to a mixed
orthogonal array. (This method is due to Wu (1991), adapted from an
approach suggested by Gupta and Nigam (1987).)
Let us suppose that the original design is stratified random sampling
with nh units drawn from stratum Sh, and with Nh » nh. A mixed
orthogonal array of strength 2, denoted here by OA(K, n, x··· x nH),
is a K x H matrix whose hth column has nh symbols such that in
any pair of columns each possible combination of symbols appears
equally often. The first six columns of the Hadamard matrix of (4.75)
is an example of an OA(8, 26 ). An example which could be used to
select replicates from a stratified random sample for which H = 5 and
nl = 3, n2 = ... = ns = 2 is the following OA(l2, 3 x 24);
I
I 2 I 2
2 I 2 2
2 2 2
2 2 2
2 I 2 2
2 2 I I 2
2 2 2 I I
3 2
3 I 2 2
3 2 I
3 2 2 2 2
Each subsample or replicate would be defined by a row of the matrix.

The first replicate would contain the first unit drawn in each stratum,
while the twelfth replicate would contain the third unit drawn from the
first stratum, and the second unit drawn from each of the others.
In this general case, for the vth replicate let
jistv = jist +L Ph (Y}(v,h) - jih),

h
where Ph = Wh /.J1ih=1, and j (v, h) is the unit from stratum Sh in

the vth replicate. (Note that for nh = 2 this reduces to Lh
Wh(ok~)Yihl +
ok1 Y}h2) as before.) It is not difficult to show that in the general case
the average of jist v over all replicates is jist. and that
K
~ L(.Ystv - jist)2 = v(.Yst).
v=l
For general functions of multivariate population totals, let the estimator

of () = geT) from theA vth replicate be g(sv) = g(Tv). Let the overall
estimator of geT) be () = geT) as before, or
The BRR variance estimator is defined as
1 ~ ~ 2
VBRR = K L.,.(g(Sv) - (})
v=!
1 K ~
or K L(g(sv) - 0)2.
v=!
Clearly the formula gives the usual unbiased estimator of variance of
ewhen g is a linear function of the components of T. Wu (1991)
has shown that under appropriate regularity conditions, v BRR and the
linearization estimator are asymptotically equivalent to first order.
The third approach, described by Sitter (1993), is motivated by the
fact that mixed orthogonal arrays with small numbers of replicates may
be difficult to obtain with differing stratum sample sizes. The number
of replicates can be reduced if each is allowed to contain some number
cth > 1 of units in the hth stratum sample. The replicates are thus
described by an array of subsamples, and the technique and results of
Wu (1991) extend naturally to the case where the replicate array is
what is called a balanced orthogonal multi-array.
When the number of strata is very large, the BRR method becomes
computationally unwieldy. In this case what is sometimes suggested is
to subdivide the PSU population into smaller groups of equal numbers
of strata, and to apply the BRR technique within each group. Separate
replicates from the groups are combined to produce a set of overall
replicates which is smaller than that of a fully balanced set of replicates.
See Wolter (1985, pp. 125ff.) for a discussion of the consequences of
this method, which is called partial balancing or partial BRR.
4.2.5 Jackknife methods for variance estimation
The idea of the jackknife was introduced by Quenouille (1949; 1956)

in connection with the reduction of bias in a nonlinear estimator. Tukey
(1958) suggested its use for the estimation of variance or mean squared
error, and Durbin (1959) appears to have been the first to use it in a
finite population context. For a detailed discussion of the jackknife, see
Wolter (1985).
We begin by showing how the technique works for a non-stratified
single-stage design of fixed size n. Suppose that the sample is divided
randomly into G groups, each of size m = n/ G. Let T = Ty be a
univariate total. We denote by T(v) the estimate of T from the sample
with the vth group omitted. That is,
T(v) = G G_ 1 !--
' " Yj
;-:'
jES(v) j
where S(v) consists of all units in the sample s outside the vth group,
and the probabilities 1'{j are the inclusion probabilities under the original
design. Now clearly
A 1 G A
T = G LT(v).
v=1
Furthermore, if 'Tv denotes the estimate of T from the vth group only,
we can write
or equivalently
-(G - 1)(T(V) - T) = Tv - T. (4.86)
Now if N is large and the design is approximately with replacement,
the random group method gives
1 ~ 2 A A
G(G _ 1) ~(Tv - T)
as an estimate of the variance of T. But by (4.86), this is the same as
G-l~A
A
vJ(T) = -a ~(T(v) -
A2
T) , (4.87)
v=1
which is called the jackknife estimator of variance. In the commonly

used case of G = n, m = 1, this estimator coincides with
2
1 ( nYj
L -;-: - T )
A
n(n _ 1)
jES j
If the design is simple random sampling without replacement it is

N 2s;/n, and if N is not very large, multiplying by the finite popu-
lation correction (1 - n / N) is appropriate.
For estimation of a general function () = geT) we may use the point
e
estimator = g(T) or alternatively
~ 1 G A
() = G Lg(T(v»).
v=1
The MSE of either is estimated by
G-l~ A A2
Vj = ( J L)g(1(v» - OJ (4.88)
v=l
or
G-l~ ~2
( J L.,,[g(T(v» - 0] .
A
(4.89)
v=l
Because
(4.90)
these estimators also are close to the linearization variance estimator

(4.52). Note, however, that the approximation (4.90) is likely to be very
close, since if G = n then (1(V) - i')/N is of order Op(l/n). In gen-
eral the jackknife variance estimator is the closest of the 'resampling'
variance estimators to the linearization estimator, which is sometimes
referred to as the infinitesimal jackknife.
Now consider stratified one-stage sampling. Suppose the sample in
stratum Sh is divided into Gh groups of size mh = nh/Gh. Let 1(hv)
be the estimate of a univariate total T from the full stratified sample
with the vth group omitted from the sample in stratum Sh. If Sh(v) is
the part of the sample in Sh omitting the vth group, then
For each h we can write

A 1 Gh A
T = - LT(hV)'
Gh v=1
Also, if Thv is the estimate of T when only the vth group is sampled
in stratum h, then
It follows that if each Nh is large and sampling is approximately with

replacement within strata, an unbiased estimator of Var(i') is
(4.91)
If G h = nh for each h and the design is SRS within strata, this coincides
with the usual estimator
Now to estimate the MSE of the estimate of a general function e=

geT) of a multivariate total T, under a stratified multi-stage design,
assume that Lh the number of PSUs in Sh, is large for all strata and
that the design is approximately with replacement within strata. Divide
each stratum sample into 'groups', each group consisting of one PSU,
say. (Using groups of several PSUs is also possible.) Let g(s(r) be the
estimate of g(T) based on the sample with the rth PSU omitted. Then
by analogy with (4.91), an estimate of the MSE of 0 = g(T) would be
(4.92)
using the notation of Kovar et al. (1988). Other ways of forming a

jackknife variance estimator would be obtained by replacing 0 in (4.92)
by other point estimators such as
(4.93)
which would give VJl of Kovar et al. (1988). In the case where all
Ih = 2, the jackknife technique is termed JRR or jackknife repeated
replication by Kish and Frankel (1974). In that case VJ2 of (4.92) be-
comes
and if we use 8h of (4.93) instead of 0 we obtain for Vn

(4.94)
For g a quadratic function of the components of T, in the case Ih = 2

for all h the variance estimator VJ RR-D coincides with the linearization
estimator (4.51), as pointed out by Rao and Wu (1985). For general
functions g, it is possible to show that VJRR-S 2: VJRR-D (Wolter,
1985, p. 180).
4.2.6 Bootstrap variance estimation

The idea of bootstrap in connection with SRS was introduced in Section
3.13. Here we discuss an application of the same ideas to more complex
designs. The more complex the design, the more difficult it is to find
a method of resampling producing a bootstrap distribution which is a
good estimator of the real distribution of important statistics. However,
if we focus on using the resampling to provide estimates of variance
or mean squared error only, bootstrapping can in principle be applied
quite widely, as has been shown by Rao and Wu (1988).
As in the previous three subsections, finding a suitable procedure
means finding one which will give the usual unbiased estimator of
variance for the estimator of a population mean or total, and then ap-
plying the same procedure to general functions geT).
For example, suppose the sampling design is stratified random sam-
pling without replacement, and we wish to estimate the population
e
function = geT) where T is the total of a vector-valued character-
istic y. One 'rescaling' procedure suggested by Rao and Wu (1988) is
as follows:
Step 1. Take a simple random sample {Yh)7~1 with replacement of
mh draws from the nh sampled values of y in stratum Sh. Compute the
vector-valued 'pseudovalues'
hj = Yh + m!/2(nh _1)-1/2(1- fj,)1/2(yZj - Yh), fh = nh/ Nh (4.95)
and then set
mh
Yh = m hl LYhj
j=1
= Yh + m!/2(nh - 1)-1/2(1 - fh)I/2(jiZ - Yh),
H
Y L1
WhYh,
and 0 = g(N y) as the estimate of e.

Step 2. Independently replicate step 1 B times and calculate the
corresponding estimates e , ... , e .
-I -B
e,
The bootstrap point estimator £*(0) of where £* denotes condi-
tional expectation given the original sample, can be approximated by
O' = LOb / B. Note that the bootstrap point estimator of f..ty is equal to
Yst, the usual unbiased estimator.
The bootstrap variance estimator of O· or of g(Nji") or of g(NYst)
would be defined by £*(0 -£*0)2 = Var*(O), or by £*(0 -8)2, where
8 = g(NYst); and when ILy is being estimated this is equal to v(Yst),

the usual unbiased estimator. The Monte Carlo approximation to this
would be
1 ..f. -b
VBI = - _0 ~(e -
-
e·)
2
(4.96)
1 b=1
or
B
VB2 = B1 '~(e
" -b
- e) .
A 2
(4.97)
I
Some special cases are of interest. Ifmh = nh -1 and the Nh are very
large, Yhj of (4.95) becomes yZj' and Yh = yz. Steps 1 and 2 then yield
what Rao and Wu have called the 'naive bootstrap'. When nh = 2,
so that mh = 1, each 'resample' is a half sample, and the method is
closely akin to BRR. On the other hand if mh = (nh - 2)2/(nh - 1),
which is approximately nh - 3 for nh ::: 5, we have matching of the
first three conditional cumulants of Y with the estimates of cumulants
of Yst. as in (3.97).
Bootstrapping with rescaling can also be extended to other sampling
designs, for instance, stratified two-stage sampling with large numbers
of PSUs per stratum.
4.2.7 Properties of the methods of variance estimation

In this subsection we review some properties of the methods of variance
estimation, first of all looking at their asymptotic distribution theory,
and then considering briefly the results of several simulation studies.
The asymptotic properties were first studied systematically by
Krewski and Rao (1981), who assumed that the geT) in question was
a constant multiple of the same function g evaluated at ILy = T / N, as
is the case for all the examples we have considered. They considered
the design to be stratified random sampling with replacement, so that
e = g(ILy) would be estimated by 8 = g(P-), where
~ = f /N = L WhYh
h
and Yh is the vector-valued mean of the nh values of y drawn from Sh

with replacement.
They considered an asymptotic framework in which the number of
strata H would approach 00, while the stratum sample sizes nh remain
bounded. They assumed conditions (i)-(iv) of Section 3.5, which imply
that there is a positive definite matrix 1: such that the distribution
of nl/2(p- -lLy) approaches the MVN(O,~) distribution, where n =

L:h nh approaches 00 because H does. Under these same conditions,
since Var(p-) is of order O(1/n), it follows that P- - lLy is of order
Op(n- I / 2 ); and if
v ( IL~) = "wl
~ - -1- ~(y
~ hj -
_)(yhj - Yh
Yh
-)t" (4.98)
h nh nh - 1 j=1
(where Yhj is the jth sampled Y value from Sh) is the usual estimator
of Var(p-), then
v(p-)/Var(p-) ~ 1 (4.99)
in probability, so that v(P-) is a consistent estimator ofVar(p-) as H ~
e=
00.
Inferring from these the corresponding properties of g(P-) re-
quires further conditions; it is convenient to assume also that:
(v) lLy ~ a finite vector IL;
(vi) the first partial derivatives of g(.) are continuous in a neighbour-
hood of the limiting vector IL.
If we then define the linearization variance (more properly mean
squared error) estimator as V(Er) of (4.52), or in vector notation
~ = VL = (a- g I)t" V(IL)-,

VL«(}) ~ ag I
aIL p, aIL p,
standard arguments will show that
nVL ~ a 2 = (ag)t" L ag (4.100)

aIL aIL
in probability. Thus VL is a consistent estimator of the variance of
(ag/alLr (P- -lLy), and In(g(P-) - g(lLy»/a and (g(P-) - g(lLy»/.jVL
are both asymptotically N(O, 1).
When the third partial derivatives of g are continuous in a neigh-
bourhood of IL, we can write the Taylor series expansion
e- 8 = g(M - g(JL,) = ( :! I,J m- I'y)
1 ~
+2(1L - lLy)
t" 82
a2g I ~
(IL - lLy) + Op(n -3/2 );
IL JLy
and under conditions which allow us to take expectations it follows

e
that the bias in is tr L: ~ /n + 0(1/n 2 ), while (ag/alLr L: ag/alL
is the dominating term in the mean squared error of O. Thus vL is also

a design-consistent estimator of the mean squared error of 0, which is
asymptotically design-unbiased in the sense of Section 4.1.4.
Krewski and Rao (1981) established the same properties for the jack-
knife variance or MSE estimators such as VJI of (4.92) and VJ2, and
the BRR estimators such as VBRR-S and VBRR-D of (4.79), (4.80).
The proofs involved showing rigorously the asymptotic equivalence of
VL, Vn and VJ2 suggested by (4.90), and the asymptotic equivalence
of VL with VBRR-S and VBRR-D suggested by (4.85). Under similar
conditions the asymptotic equivalence of VL with the bootstrap esti-
mators VBI. VB2 of (4.96) and (4.97) was established by Rao and Wu
(1988).
As remarked by Rao and Wu (1985), these arguments for the consist-
ency of the variance or MSE estimators, which we denote gener-
ally by v, and the asymptotic normality of the quantities (g(/l) -
g(J.,Ly»J.jV, can be extended to stratified multi-stage designs in which
the PSUs are selected with replacement and in which independent sam-
ples are selected within those PSUs sampled more than once. Condi-
tions (i}-(iv) of Section 3.5 would be assumed at the PSU level. The
results can also be expected to have analogues in cases of without-
replacement cluster sampling within strata, when conditions are such
that the estimator of a population mean is asymptotically normal, and
its variance is of order 1JI, where I is the total number of clusters
sampled.
Thus to first order, all the variance or MSE estimators which have
been suggested for MSE(g(T) or MSE(g(/l» in this section are equi-
valent. However, differences in their behaviour in moderate-sized sam-
ples from actual populations have long been observed. The first sys-
tematic simulation study was carried out by Frankel (1971). See also
the discussion paper by Kish and Frankel (1974). Second-order ana-
lyses of the differences in bias of such estimators (e.g. EVL - EVJI)
were carried out by Rao and Wu (1985) using higher-order Taylor se-
ries approximations. They were also able to assess the absolute biases
(e.g. IEvL - MSE(g(/l»1) in the particular case of ratio estimation.
For example, in general the absolute bias of VL could be found ap-
proximately by looking at the difference between the expectation of
the expansion
VL ~ g'(J.,L)TV(/l)g'(J.,L) + 2(/l- J.,L)Tg"(J.,L)V(/l)g'(J.,L)

+(/l - J.,L r gil (J.,L) v (/l)gil (J.,L)(/l - J.,L)
and the expectation of the expansion

(g(P-) - g(J,L»2 :::. g' (J,L) T (p, - J,L)(P, - J,L) T g' (J,L)
+g' (J,L) (P, - J,L)(P, - J,L) T gil (J,L)(P, - J,L)
l
T
+ (P, - J,L) T gil (J,L)(P, - J,L)(P, - J,L) T gil (J,L)(P, - J,L)
at J,L = J,Ly.
The theoretical results of Rao and Wu (1985) tended to agree with
the results of empirical studies by Kish and Frankel, carried out on the
linearized, jackknife and BRR variance estimators. A subsequent em-
pirical study by Kovar et al. (1988) also included tests of the bootstrap
techniques given by Rao and Wu (1988), and again tended to agree
with theoretical calculations. Here we give a brief summary of some of
the main findings of all these studies. Unless otherwise qualified they
seem to be true generally for the estimation of ratios, simple regression
coefficients and correlation coefficients.
1. The jackknife estimators VJl and VJ2 are very close to one another,
differing by amounts which are Op(n- 3 ) under broad conditions (Rao
and" Wu, 1985). They tend to be very close to VL numerically, and for
nh = 2 for all h, VJ (= VJl or VJ2) and VL are equivalent to higher-order
terms. That is,
In empirical studies the mean of vL tends to be less than that of v J, and

for small nh, VJ tends to be slightly negatively biased as an estimator
of MSE(g(P-». (As the nh increase, while H remains fixed, the bias of
vJ tends to become positive.)
2. The BRR estimators are not as close to VL. When nh = 2 for all
h, we have in general only
VBRR-D= VL + Op(n-2)
VBRR-S = VL + Op(n- 2 ).
The BRR estimators tend to be positively biased, with VBRR-S being

larger than v BRR- D. The bootstrap variance estimators behave similarly
to the BRR estimators in the nh = 2 case. The extent of the positive
bias of both tends to increase with the complexity of the parameter
being estimated, from ratios to regression coefficients to correlation
coefficients. It also appears to be larger for population arrays under
which half-sample estimators would tend to be highly variable.
3. The BRR and bootstrap estimators tend to be less stable (or in
other words more highly variable from sample to sample) than the
linearization and jackknife estimators.
4. Although all the estimators of MSE seem to have acceptably small
biases for actual populations, this fact is not necessarily of great relev-
ance to the performance of confidence intervals based on assuming an
N(O, 1) distribution for (g(M - g(/Ly»!Jv = (e - ())!Jv, or a t(H)
distribution as in the study by Kish and Frankel. It has been found that
for ratios, one-sided intervals with lower limits may have serious un-
dercoverage, although two-sided intervals tend to have close to nominal
values of coverage, in agreement with the theoretical explanation based
on Edgeworth expansions given by Rao and Wu (1988). For regression
and correlation coefficients, two sided confidence intervals which use
VL or VJ tend to cover the true value with probability less than the
nominal values. For these quantities intervals which use v BRR-S have
closer to nominal coverage, in keeping with the fact that vBRR-S is
the most positively biased of the MSE estimators. However, even with
this estimator, one-sided confidence intervals with upper limits tend to
have undercoverage for correlation coefficients. The undercoverage of
all the intervals tends to decrease as nh goes up, in the situations exam-
ined by Kovar et al. The asymmetry of coverage also decreases as nh
increases, and is less severe for regression coefficients than for ratios
or correlations. See the discussion on ratio and regression estimation
of a population mean or total in Sections 5.11 and 5.12.
It may be observed that of all the estimators vL is the one which
can be applied most generally, but it requires knowledge of all joint
inclusion probabilities in principle. Of the resampling estimators, VJ
is the least trouble to compute. The main difficulty with using any of
them for confidence interval estimation is the departure of (e - ())! Jv
from normality for small nh.
4.2.8 Bias reduction/rom resampling methods

In previous subsections, particularly 4.2.7, we have looked at resam-
pIing methods as a way of trying to estimate the MSE of the possibly
nonlinear estimator e = g([L). However, they may also be used to
construct alternative point estimators with smaller asymptotic bias than
e. In fact historically this was the first use of the jackknife technique
as proposed by Quenouille. It may be desirable to use these alterna-
tives when it is important to obtain point estimators as free of bias as
possible.
Consider first an application of the jackknife in the context of strat-
ified random sampling. As was shown by Rao and Wu (1985), in the

asymptotic framework of Krewski and Rao (1981) and under appropri-
ate conditions, the bias in 8 may be written
A
E(} - () =
1
-tr L
a2g
- 2 In + O(n- 2 ),
2 a/.L
where :E is the limiting covariance matrix of n 1/2({t - /.Ly), n is the total
sample size and a2gla/.L2 is the matrix of second partial derivatives
a2gla/.Laa/.L{3 with respect to the components of /.L = (/.LI, ... , /.La, ... ,
/.Lp). Now let 8(hj) = g({t(hj», where (t(hj) is the estimator of /.L when
the jth unit is removed from the sample Sh in stratum Sh. If 8h
LjESh 8(hj)/nh, it can then be shown that
()h = ()
A A 1
+ -2 nh (nh -
wl 2g I
1) trLh -a2
A a + Op(n-
3
),
/.L [l
where th is the sample covariance matrix of y within sample Sh. Since

Lh (Wl I nh) th estimates L, it follows that
()j') = (n + 1- H)8 - L(nh - 1)8h

h
= 8- L(nh -1)(8h -8)

h
(a point estimator due to Jones, 1974) has bias O(n- 2). Since 8 = g({t)
itself was seen in Section 4.2.7 to have bias O(n- I ), the jackknife point
estimator ()jl) is a reduced bias estimator. As pointed out by Rao and
Wu (1985), the use of ()jl) in place of 8 in VJ2 of (4.92) will give an
inconsistent estimate of MSE, however.
The BRR technique also provides a means of bias reduction. Wu
(1991) showed that if (tv is the vth half-sample estimate of /.Ly, and
g(sv) = g({tv), then
Bias(g(sv» = 2 Bias(8) + o(n-I).

This relationship holds even for nh possibly greater than 2 when the
approach based on mixed orthogonal arrays is used. It then follows that
the point estimator
K
()~I) = 28 - ~ L g(sv)
v=1
has a lower-order bias than 8.
4.2.9 Bootstrap t confidence intervals
As a method of reducing the asymmetry of coverage of confidence

intervals for 0 = g(T) or g(JL), Rao and Wu (1988) have proposed the
use of bootstrapping to approximate the distribution of the studentized
quantity
e-o
t=--,
.jVJ
VJ being a jackknife estimator of variance of e, rather than assuming
a standard nomal distribution.
The bootstrap t method would in this case generate a large number
of independent resamplings, computing for each a sample of pseudo-
values as for example in Section 4.2.6, a point estimate 0, the appro-
priate jackknife variance estimate VJ, and the t statistic
Upper and lower confidence limits for 0 would be set equal to
o" - "1
H- (1 - a)..(iij, 0" - H-
"1
(a)..(iij
where fI-1 (a) and fI-1 (1 - a) are the a and 1 - a quantiles of the
resampling distribution of t.
In their empirical study Kovar et al. (1988) investigated this tech-
nique for stratified random sampling with replacement, and pseudo-
values generated as in Section 4.2.6. They showed that it does tend
to give improved confidence intervals for ratios and correlation coef-
ficients. The intervals in fact tended to be conservative, unlike those
based on linearization or jackknife variance estimators. Sitter (1992a)
extended this investigation to a comparison of these with several boot-
strap methods, concluding that the mirror-match and extended BWO
methods discussed in Section 3.13.1 also perform relatively well, and,
if extendible, would have the advantage in complex situations of better
reflecting the original sampling.
4.3 Quantities which are functions of population U-statistics
It is interesting to note that, under SRS, linearization and jackknifing

can also be applied in connection with the estimation of population
U -statistics, or functions of these.
FUNCTIONS OF POPULATION V-STATISTICS 133
AU-statistic is a population function or parameter of the form

1
(4.101)
where the sum is taken over all sets UI, ... , jm} of m distinct units of
{l, ... , N} and the 'kernel' f is a symmetric function ofm arguments.
If m is the smallest number for which Uy has representation (4.101),
then m is called the degree of Uy . For example, the population mean
f-Ly is aU-statistic of degree 1. A population variance
1 N N
S2- ""(y._y)2
y - 2N(N _ 1) ~# ~ j k
is aU-statistic of degree 2, with f(Yj,yd = (yj - Yk)2j2.

Suppose the design is SRS with n draws. Then an unbiased estimator
for Uy is
1 " (4.102)
U
A
=( ) ~ f(Yjl"" ,Yjm)'
n UI ,... ,jm les
m
Now suppose U;a) and u)fl) are two (not necessarily distinct) popu-
lation U -statistics with kernels fa and fp and degrees ma and m p
respectively. It is possible to show under general conditions, given by
Nandi and Sen (1963), and generalized by Krewski (1978) and others,
that .J1i(U(a) - U;a» and .J1i(U(P) - u)fl») are asymptotically jointly
normally distributed with means 0 and limiting covariance mamp{l -
),,)~aP' where)" = nj N and ~ap is the limiting population covariance
of zeal and z(P), the value of zeal for unit j being
Zj(a) = average 0 f Ja(Yjp
I'
.. " YjmJ
over all UI> ... , jm«} C {I, ... , N} such that jl = j.
A consistent estimator for ~ap is
(ap = n ~ I ~(Vr
jES
- u(a»)(v}P) - U(P»), (4.103)
where
v}"' ~ ( n ~1 ) ,6/("'(Yi' Yi2 , ••• , Yim.);

ma - 1 o/Ii)
and this can be used to form a covariance estimator for (Ua , Up).
For example, suppose that we are estimating a univariate Uy = S;,

and that the point estimator is the usual sample variance which can s;,
be expressed in the form (4.102) with !(Yj,Yk) = (yj - Yk)2/2. The
variance estimator for s; associated with (4.103) is given by
22 X (~_~)
n
_1 ,,([_1
N n-1L.."
"(Yj _ Yk)2/ 2 ] _ s2)2,
n-1L.."
JES
y
kES
which reduces to
( ~n _ ~) [(n -
N (n - 1)2
2)2 k4 + n(3)
(n - 1)3 2 '
~] (4.104)
where k4 is given by (2.63), and b is the sample version of B of(2.74).

Now suppose e = g(U)l), ... , U)q», where the real-valued g has
continuous first partial derivatives ga, ex = 1, ... , q, in a neighbourhood
of the true value of (Uy , .•. , Uy ). Let e = g(U , ... , U q ). Then
(I) (q) A A(I) A( )
under the conditions for joint asymptotic normality of .J1i(U(a) - u)a»,

ex = 1, ... , q, and consistency of the covariance estimators we have
In(B - () -+ N( 0, (1 -).) ~ ~ mam,8gag,8(a,8).

A consistent estimator of Var(.J1i(8 - e» is
(1 -).) L Lm am,8ga(U)g,8(U)(a,8.
a ,8
Jackknifing to estimate the MSE of B can also be carried out in a
manner described by Krewski (1978). For example, consider an applica-
tion to the estimation of the MSE of a function g(s;). Let s;(j) be the
sample estimator of S;
with unit j in the sample removed. Thus
2
Sy(j) = 2(n _ 1
l)(n - 2) 7'# ~(yk - Y/) ,
"" 2
and s; = (LjES s;(j)/n. Then the quantity

n-l" 2 22
- - L..,,(Sy(j) - Sy)
n .
JES
can be written as
k4 b
-+
n 2(n -2)
, (4.105)
ROOTS OF ESTIMATING FUNCTIONS WITH NUISANCE PARAMETERS 135
which implies that for large nand N » n it is close to (4.104) and to

the unbiased estimator
( ~n _~)k4 + (_1 __1_) ~

N n-l N-l 2
(see (2.78)) of the variance of s;. It follows that we might estimate the
MSE of g(s2) for a smooth function g by
( -;;1 - N1 )-n -l" 2 2 2

n - L.J(g(Sy(j») - g(Sy)) . (4.106)
JES
4.4 Quantities defined as roots of estimating functions with

nuisance parameters
In Section 4.1 the idea of thinking of population quantities as roots
of estimating functions was introduced. This formulation was shown
to lead very naturally to methods of interval estimation for general
designs. The same approach can be extended to the estimation of quant-
ities which need other parameters and a system of estimating functions
for their definition.
Some of the examples which follow have been encountered already.
EXAMPLE 4.1: The variance a~ = 'L7=1 (yj - /1-y)2 / N satisfies
N
L[(Yj - AN)2 - a~] =0
j=l
N (4.107)
L(yj - AN) = 0;
j=l
here AN = /1-y, another parameter which must be estimated, and which

may be regarded as a nuisance parameter.
EXAMPLE 4.2: The mean of a stratified population can be written as
H
eN = L Wheh,
h=l
with stratum mean parameters eh satisfying the system

N
LOjh(yj-eh)=O, h=l, ... ,H. (4.108)
j=l
The indicator Ojh = 1 if j E Sh, = 0 otherwise. Viewing eN as the

quantity of interest and (}h, h = 1, ... , H - 1, as a nuisance parameter

can lead naturally to a justification of post-stratification (Binder and
Patak, 1994); details are given further on in this section.
EXAMPLE 4.3: The regression coefficients B N, AN satisfy

N
:~::>j(yj - BNXj - AN) = 0,
j=l
N (4.109)
L(Yj - BNXj - AN) = 0;
j=l
here it might be the case that B N is of interest, while A N is a nuisance

parameter.
EXAMPLE 4.4: In the Cox proportional hazards model (Cox, 1972) at

the population level, the lifetime of individual j has hazard function
h jet) of form
ho(t) exp{xj(t),L3},
where hoe!) is a baseline hazard function, {3 is a vector of coefficients,
and xj (t) is the vector of a set of time-varying covariates for unit j.
Unit j is regarded as being observed until event time tj. The indicator
°
8j is set equal to 1 if unit j is observed to die or fail at time tj> and equal
to if the observation for j is censored at time tj. The information
about {3 in the data is captured in the population partial likelihood
nN
i=l
N
([exp{xJ (ti) (3}]Oi /[LYj(ti) exp{xj(ti) (3}]),
j=l
°
where the variate Y/t) is I if unit j is present, i.e. in the risk set,
at time t, and if not. If we define the vector B N to be the value of
{3 maximizing the partial likelihood, we can interpret B N as a finite
population analogue of (3. It can be shown to satisfy the system
°
N
L8i (Xi(ti) - ANi) =
i=l
N
LYj(ti)(Xj(ti) - ANi) exp{xj(ti)B N } = 0, i = 1, ... , N. (4.110)
j=l
The vector parameter ANi, of the same dimension as the covariate
vector x(t), gives the partial derivatives of the logarithm of

N
S(ti, EN) = LYj(ti) exp{xj(ti)B N }
j=1
with respect to the components of BN (Binder, 1992). This example is

different from the others at least in this formulation since the number
of 'nuisance parameters' ANi is of the order of N.
In general, let us think of a system of population estimating functions
N
L 4>1j(Yj, Xj; ON, AN) = 0 (4.111)
j=1
N
L 4>2j(Yj, Xj; ON, AN) = 0 (4.112)
j=1
with (4.111) and (4.112) having the dimensions of ON and AN respec-
tively. Typically these equations would have the form of population
maximum likelihood equations for ON and AN. Suppose that ON is the
parameter of interest, while AN is a nuisance parameter.
The sample version of this estimating function system at a general
parameter value (0, A) is
(4.113)
(4.114)
If )./1 satisfies 4>2s (0, )./1) = 0, then the estimating equation system to
be solved for the estimate Os of ON becomes
(4.115)
Binder and Patak (1994) have shown that to a first-order approximation.

(for real 0) the MSE of 4>ls(O, )./1) can be estimated by
v (L~~)'
JES J
where v is a variance estimator form as in (4.52) and

"A_l ,..
= 4>1j(Yj, Xj; 0, A9) -
A
Z/1j J])..J2).. 4>2j(Yj, Xj; 0, A9),

where
(4.116)
Note that LjES (ZOj /1T:j) is the combination of the estimating functions
in (4.113) and (4.114) which changes least as the nuisance parameter )..
changes, near i o. See Godambe (1991) for related discussion. Interval
estimates for ON are then obtainable from an N(O, 1) approximation to
the distribution of
JV(LjES ZOj /1T:j)

(4.117)
This approximation is likely to be particularly effective if l/>2s (0, )..) is

linear in )... In some situations where (4.115) is significantly biased
as an estimating function for 0, improvements may be expected from
modifications which reduce the bias.
A further alternative would be to use an N(O, 1) approximation to
the distribution of
(4.118)
)V(LjEs Zj/1T:j) ,
where Zj is ZOj evaluated at 8.

Let us consider again the examples given earlier in the section.
EXAMPLE 4.1 continued: Since iu = i = ly/ N = /Ly, then
&2 = [L(yj - /Ly)2/1T:j]/N;

JES
JI).. = 0, I n = -N;
ZOj = (yj - A)2
ILy 2
- a; -
Zj = (yj A)2
- ILy - aA2 .
We could obtain interval estimates of a 2 by setting (4.117) or (4.118)
equal to N(O, 1) quantiles and solving. Using (4.118) would give results
equivalent to using the linearization method of Section 4.2.
EXAMPLE 4.2 continued: Here).. = (0 1, ... , OH-I) and i = (81,

...• OH-I), where Oh = 'h/Nh; we could use
lPIs(O, ~o) = L (Yj - ZO +

jesH
I:
;h Oh) /1fj
H h=1 H
= NNH(O_O)
NH
as the estimating function, where 0 = L:=I NhOh/ N is the post-
stratified estimator for the mean. Then applying the formula for ZOj
in (4.116) and putting in 0 for 0 gives
where
8hj = 1 if j E stratum Sh
= 0 otherwise.
The resulting estimate of the MSE of 0 = jist in the case of SRS would
be
_ (1-~) _n_" Nt (nh -1)s;

- N n - 1 ~ nh nh Y'
which approximates the usual post-sampling stratification estimator
~- (1 -nh)
"Nt - 2
sh'
h nh Nh Y
EXAMPLE 4.3 continued: In this case if BN is the parameter of interest,

we may think of AN as playing the role of AN. Thus ~B from the second
equation of (4.109) is [Ljes(Yj - BXj)/1fj]/N; and
lPIs(B, ~B) = LxA(Yj - [Ly) - B(xj - [Lx)]/1fj'

jes
The corresponding expression for ZOj is
ZOj = (Xj - [Lx)[(yj - [Ly) - B(xj - [Lx)].
and for Zj is Zj of (4.63) times Ljes(Xj - {l,x)2/1fj . Using (4.117) for
confidence intervals means solving a quadratic equation in B for the

limits for B N .
EXAMPLE 4.4 continued: The population system (4.l1O) corresponds

to a sample estimating function system in such a way that the second
set of equations yields
LjES Yj (ti )Xj (ti) exp{xJ(ti )B} In"]

Ai B = ---''="--=-__-__:_-=--''-----
A
LjES Yj(ti) exp{x} (ti)B}j1l"j
for each i E s. The first set of equations, to be solved for Bs , is

¢ls(B, {~iB}) = LOi(Xi(ti ) - ~iB)I1l"i = o.
iES
The expression for z(}j is
where
Si = L Yk(ti) exp{xI (ti)B}j1l"k.
kES
The corresponding expression for Zj is z(}j evaluated at Bs. Binder

(1992) has performed a simulation study of the coverage properties
of confidence intervals based on a version of (4.118) with linearized
numerator. For a stratified random sampling design with unequal al-
location the empirical coverage probabilities were close to predicted
values.
As in previous sections, resampling methods can be considered for

purposes of interval estimation. It is clear that the estimate of MSE
of ¢ls(e, ~(}) does not have the same kind of simple form as v(¢s)
of Section 4.1 (e.g. (4.l2». Rather than using the analytic approxima-
tion described above and an approach based on (4.116), we could use
subsample counterparts of the estimating function system to assess the
variability of ¢ls(e, ~(}).
For example, in the context of the JRR methods of Section 4.2.5, we
might test a value eo of eN by comparing
¢ls (eo, ~(}o)

y'VJRR-D
with N(O, 1) quantiles, where
VJRR-D = "L...J 4(4)1

1 (S(rhl) , (0) - 4>1 (s(rh2)' (0» 2
h
and 4>1 (S(rhl)' (0) is the value of 4>1s(00, i80) if the data in sampled PSU
rhlare replaced by a copy of the data in sampled PSU rh2.
CHAPTER 5
Inference for descriptive parameters
As we said in the beginning, the object of study in descriptive sampling

inference is a finite population quantity B(y), or a collection of such
quantities. Population means, totals, number and proportions are typical
descriptive quantities or parameters. In practice, statements of inference
tend to be point or interval estimates, or less frequently the results of
tests of hypotheses. In this chapter the word inference will refer to
statements of this type which are also compatible with the investigator's
knowledge or beliefs about the population after sampling. One of the
objectives of this chapter is to try to clarify the distinction between what
is truly inference and what is not, by means of various illustrations.
We begin in Section 5.1 to examine the elements of descriptive
sampling inference. From this examination it will emerge that prior
knowledge about the population plays an important role, even in the
traditional design-based approach. Section 5.2 outlines the use of super-
population models as a means of formal expression of prior knowledge.
In a superpopulation setting, descriptive inference can be regarded as
prediction of functions of unseen responses. Hence in Section 5.3 we
discuss historical approaches to the predictive aspect of inference; and
in Section 5.4 we consider the role of randomization in the design,
appropriately conditioned, as support for statements of inference.
Superpopulation models are used also in the planning of surveys,
in the selection of efficient estimators (or estimating functions) and
sampling designs. Section 5.5 contains a brief discussion of some cri-
teria for evaluating sampling strategies, and a general optimality result
for sample estimating functions. Then inference and efficiency consid-
erations are combined in Sections 5.6-5.14, which deal with ways of
incorporating the knowledge of auxiliary variates in estimation of totals
and means.
5.1 Elements of descriptive sampling inference

5.1.1 Classical design-based inference
An example of a very simple context for sampling inference is the
following. Suppose that there are 100 balls in an urn, distinguishable
144 INFERENCE FOR DESCRIPTIVE PARAMETERS
only in that some are white and some are black. The number M of white
balls is unknown. Suppose that a simple random sample of n = 10 balls
is drawn without replacement, and that the observed sample number m~
of white balls is 4. What then can be said or inferred about the number
of white balls among the 90 unsampled balls (and hence about M)?
The idea behind classical sampling inference (Neyman, 1934;
Cochran, 1977) might be summarized as follows.
CSI(i) Since the sampling was done at random without replacement,
the sample number ms of white balls is a hypergeometric ran-
dom variable, i.e.
P(m s = m) = (~) (~=:) / (:). m =0 • ...• n.

(5.1)
CSI(ii) On the basis of this hypergeometric distribution, we can con-

struct confidence intervals for M. For example, for fixed a
we can find a rule, assigning to each possible m an interval
1m = [ML • Mu], such that for any M the probability that 1m,
covers M is approximately 1 - 2a. (See Section 3.2.) In the
example with n = 10, N = 100, the value of the usual 'exact'
two-sided 95% confidence interval for M when m = m~ = 4
is [14,72].
CSI(iii) (Long-run frequency property.) Such intervals have the prop-

erty that if the sampling procedure is repeated again and again,
the long-run relative frequency of non-coverage will approxi-
mate 2a.
CSI(iv) Since M was unknown to begin with and only ms = m~ was

observed, the uncertainty in the 'guess' that M belongs to 1m ,
where m = m~, is quantified via CSI(iii): this interval at level
100(1 - 2a)%, or a collection of such intervals for several
levels, is a reasonable expression of inference about M. In
intuitive terms, the computed 95% confidence interval [14,72]
for M is compatible with the inference that, while we would
guess that M is in some interval around 40, we would be
surprised to find it as low as 13 or as high as 73.
The application ofCSI(i)-(iv) depends crucially on the fact of having

drawn the sample of balls at random, rather than purposively.
ELEMENTS OF DESCRIPTIVE SAMPLING INFERENCE 145
5.1.2 Applicability to survey populations
Now suppose we consider applying the same reasoning to a population

of N households. Suppose the object of interest is the number M of
households with teenage children. In principle there is no difficulty in
taking an SRS of any desired size: we simply label the households on
a map from 1 to N and, using random numbers or a lottery device,
select an SRS of household labels. The basis of the argument for the
confidence intervals developed in Section 3.2 is still present, in the
randomization induced by the sampling design. But the intervals them-
selves may no longer be so appealing as an expression of inference.
The reason is that the households, unlike the balls in the urn, may well
be distinguishable on a number of characteristics besides the one of
primary interest. For example, if N = N\ + N 2, and the residences of
households I to N\ are older and smaller, while those of households
N\ + 1 to N are newer and larger, the proportions of households with
teenage children are likely to be different in these two strata.
To be specific, if N\ = 40, N2 = 60, N = 100, suppose that from
an SRS of size n = lOwe obtain observations on five households in
the first stratum and five households in the second. Suppose that the
numbers of sampled households with teenage children are one and three
in the respective subsamples. Then the total number m~ = 4, and if the
households had been indistinguishable we would have estimated M as
40. However, with the information on size and age of residence, we
would be likely to prefer the estimate
1 3
40 x - + 60 x - = 44,
5 5
which is expressible as N = 100 times the stratified sample proportion
Nh
L WhPh = L /iPh.
2 2
Pst =
h=\ h=\
Thus, even though the sample has been taken by SRS, we might pre-
fer to base inferences about M on the distribution of Pst or its compo-
nents, rather than on the distribution of ms. The statements CSI(i)-{iii)
concerning SRS-based confidence intervals are still true if we replace
'white balls' by 'households with teenage children'; however, the con-
clusion ofCSI(iv), that SRS-based confidence intervals are a reasonable
expression of inference, is no longer so appealing.
5.1.3 Conditioning and post-sampling stratification

If inferences are to be based on the distribution of Pst, it is 'natural'
to use the distribution under a stratified sampling design, even though
the original design was simple random sampling. This is a special case
of the very common practice of post-sampling stratification, and here
it amounts to using the original sampling design conditional on the
sample sizes in the two strata being fixed (Holt and Smith, 1979).
Both the conditional and the unconditional analyses are equally jus-
tified on the basis of CSI(i)---{iii). However, the conditional analysis
may seem a better expression of inference because of the identifiability
of the strata. We will suggest a formal justification of the conditional
analysis in section 5.4.2.
In another example, suppose an SRS of n = 50 households from
a population of N = 500 is taken with a view to estimating total
automobile fuel consumption over a certain time period. Suppose the
sample mean consumption is calculated to be Ys = 47.3 litres over the
period. Thus the expansion estimator for the population total is
N Ys = 500 x 47.3 = 23 650 litres.
However, suppose it is also noted that only n I = 46 of the sampled
households actually have automobiles, and it is known that this is true
for NI = 437 households in the population. Then it is 'natural' to
post-stratify the population into households with automobiles and those
without. This yields an estimate of
50 x 47.3
= 437 x 46 + 63 x0
= 22468 litres
for total consumption. Both estimators N Ys and N Yst are unbiased
under the original SRS sampling design; but it is interesting to note
that, conditional on n I, the number of households with automobiles in
the sample, N Ys, is no longer unbiased. This fact is in line with the
generally held intuitive preference for the stratified estimator, N Yst,
which is conditionally unbiased in that sense.
These examples begin to illustrate the following point, which un-
derlies much of the historical development of sampling theory: the
more specific and detailed the knowledge of variates associated with
the responses before sampling takes place, the more desirable it is that
genuine statements of inference incorporate this knowledge, and the
less appealing are simple point estimates and design-based confidence
intervals which ignore it.
ELEMENTS OF DESCRIPTIVE SAMPLING INFERENCE 147
At the same time, post-sampling stratification should be used with

caution. Rao (1971) has raised implicitly the following question. Is it
appropriate to stratify the population on a variate which is known for all
units but which mayor may not be associated with the response variate
of interest? Clearly, if the answer is an unqualified yes, and if knowl-
edge of the population units is very specific and detailed, there will be
the possibility of overconditioning: fixing the sample composition with
respect to all the variates of interest may render the sample unique, so
that design-unbiased estimates and confidence intervals based on the
conditioned design will not exist. In a sense, the more we condition,
and the more we make use of very detailed knowledge, the weaker will
be our capacity to generalize. We will return to the issue of how much
to condition in Section 5.4.2.
The potential for overconditioning actually arises whenever the popu-
lation units are all distinguishable, or labelled, as is typical for survey
populations. We shall see in the next section how the weakened capa-
city to generalize in this case is related to the flatness of the likelihood
function for the array y.
5.1.4 Labels and likelihood

The sampling of households is different from the urn sampling situation
we began with, in part because the households are typically labelled
(by street address, say) while the balls in the urn were not. The role
in inference played by knowledge of the labels of the population units
is in fact somewhat mysterious. To make the urn example more like a
survey, suppose now that the balls are known not only to be black or
white but also to be labelled from 1 to 100. Suppose the data from an
SRS of n = 10 draws are
(23, W) (5, W) (74, B) (62, B) (96, B)
(5.2)
(52, B) (17, W) (36, B) (40, B) (31, W).
As before, four of the balls are white. Now that the balls are distin-
guishable, is the 95% confidence interval [14,72] for the population
number M of white balls still meaningful?
In this situation the classical consensus is not quite so clear. Thomp-
son (1983) has surveyed some of the related literature. Points CSI(i)-
(iii) of the basis for confidence interval inference are still applicable, and
an acceptance of the conclusion ofCSI(iv) is implicit in the approach to
this kind of problem prescribed in many textbooks. However, the new
element introduced by the knowledge of the labels becomes apparent
from a likelihood perspective. When the data exclude the labels, (5.1)
defines a likelihood function for M, and two-sided confidence intervals
based on the hypergeometric distribution as in CSI(ii) can be viewed
as approximate likelihood intervals for M. On the other hand, when
the data are equivalent to (U,Yj) : j E s}, where Yj equals 1 if ball j
is white and equals 0 if ball j is black, the probability of the observa-
tion depends not just on M but also on the array y = (Yl, ... ,YIOO) of
indicators. That is, if (U, yJ) : j E s} is a possible data set,
P[{U,yJ):jES}IYJ = p(s) ifY~=YjforalljES

= 0 if Yj =F Yj for some j E s.
Since for SRS p(s) has no dependence on y, it follows that the likeli-
hood function for y, the full parameter, is flat over all y which 'agree'
with the observation. If the data are as in (5.2), then
L(Yldata) =K if Y23 = Ys = Y17 = Y31 = I, Y74 =
Y62 = Y96 = YS2 = Y36 = Y40 = 0, (5.3)
L (Yldata) =0 otherwise.
This likelihood function by itself expresses little information about M,
since M ranges from 4 to 94 over all 'possible' y.
REMARK: Since likelihood is defined only up to a multiplicative con-
stant, the same likelihood function (5.3) applies no matter what proba-
bility sampling design has been used to select the sample, provided p(s)
has no dependence on y. It is for this reason that it is sometimes said
that classical sampling inference appears to violate the strong likelihood
principle (Cox and Hinkley, 1974, p. 39): classical sampling estima-
tion formulae usually depend on the sampling design used, while the
likelihood function (5.3) does not (Godambe, 1966).
5.1.5 The role ofprior knowledge

More precise inferences about M from (5.3) now necessitate some
further assumption about the relationship between the labels and the Y
values. If we believe the labels to have been assigned in effect randomly
to the units, then estimates and intervals based on the SRS design
may express our beliefs after sampling appropriately. If the Y values
are thought to have some systematic dependence on the labels, other
expressions will be preferable.
Thus emerges one way of possibly coming to terms with the implica-
tions of the flat likelihood function. We might say that, no matter what
SUPERPOPULATION MODELS 149
sampling design is used, in the absence of prior knowledge (known

or assumed or believed) about y, the inference about y is indeed the
trivial one implied by the likelihood function, namely that y agrees
with the observations. The use of SRS or some other probability de-
sign provides a means of expressing a (mathematical) consequence of
this inference about y via confidence intervals for M and their long-
run coverage frequencies. These intervals are always meaningful in the
sense of CSI(iii). Interpretations in the sense of CSI(iv), as non-trivial
statements of belief about M itself, require a more specific prior be-
lief or assumption about y, leading to posterior beliefs with which the
intervals are consistent.
Another way of putting the last statement is that confidence intervals
for M which are somehow inconsistent with prior information or be-
lief about y will still possess a long-run frequency interpretation as in
CSI(iii), but will not be of much use for inference. One illustration of
this is provided by SRS-based intervals in post-stratifiable populations,
as discussed earlier in this section. For a more extreme example, sup-
pose it is known that the white balls in the urn of 100 balls are labelled
1, ... , M, so that estimating M from the sample amounts to determin-
ing where in the 'population' the transition from white to black occurs.
From an SRS with the data as given in (5.2), it is clear that M belongs
to the set {31, 32, 33, 34, 35}, and the SRS-based confidence interval
[14,72] is irrelevant to inference.
In summary, the role of prior knowledge or belief is to guide us
in forming statements of inference after sampling. We will argue later
what has been hinted at here, that with an appropriate sampling design,
appropriately conditioned, the statements of inference will be reinforced
by long-run frequency properties under repeated use of the sampling
design.
5.2 Superpopulation models

Discussions of a role for prior knowledge or belief lead naturally to
the question of how prior assumptions ought to be formulated. One
obvious way is to assume that the vector y is a realization of a vector
random variate Y = (Y1, Y2, ... , YN ). That is, the population vector y
is itself 'sampled' from a hypothetical superpopulation of values of the
random vector Y.
It is not necessarily desirable to express the prior knowledge in this
way. For example, if the prior knowledge is simply that the white balls
in the urn are labelled I, ... , M, and that the rest are black, there seems
little point in assuming also that M is a random variate. However, in

many situations, provided the prior knowledge or belief is sufficiently
simple, a properly chosen probabilistic model for Y can reflect it well.
It could be said that in descriptive inference, the essential purpose of
modelling Y is to make formal and usable the relationship between the
components Yj and the labels j.
5.2.1 Exchangeable superpopulation models

A conceptually very useful class of models for Y = (Y\, ... , YN) con-
sists of what are called exchangeable joint distributions and their gen-
eralizations.
We say that Y\, ... , YN are exchangeable if their joint distribution
is symmetric, that is if
FN(I1),· .. ,I1N) = P(Y\::: 11\,··., YN ::: I1N)
= P(Y".(1)::: 11)' ... , Y".(N) ::: I1N)
for any permutation u of I, ... , N. If Y), ... , YN are i.i.d., so that
then Y j , ••• , YN are exchangeable. If Y j , ••• , YN have a distribution

which is a mixture of i.i.d. distributions, so that
F N(r/1, ... , I1N) = f F('1\; a) ... F('1N; a)dv(a), (5.4)
where f dv(a) = 1, then again YJ, ... , YN are exchangeable. For ex-
ample, suppose that, conditional on a, Y), ... , YN are LLd. N(a, u 2 ),
and that a is N(ILo, ul). Then Y\, ... , YN are exchangeable, and in fact,
they are multivariate normal with common mean ILo, common variance
u 2 + ul, and pairwise covariance u~. It is a theorem due in its original
form to de Finetti (1931) that if YJ, ... , YN is the initial segment of an
infinite exchangeable sequence, then FN must be of the form (5.4).
However, not all exchangeable distributions have this mixture form.
For example, consider the random permutation model, where (Y\,
... , YN) is simply a random permutation of some fixed vector (a\, ... ,
aN):
(5.5)
for each permutation u of 1, ... , N. Then Y\, ... , YN are exchangeable,
but the distribution is not generally of the form (5.4).
In a sense, random permutation models are the most basic exchange-
able models. Any exchangeable joint distribution of Y" ... , YN can be

regarded as a mixture of random permutation models.
An exchangeable model is appropriate if we wish to assume that the
unit label j carries no information about the associated Y values: in ef-
fect, they have been assigned at random to the population. For example,
in a telephone poll carried out by random digit dialling, household la-
bels are effectively telephone numbers. There is often an absence of
knowledge of how the numbers have been assigned, except for a vague
notion that the assignment has had little to do with street address,
housing type, or other variable which might be related to the survey
question. In such a case an exchangeability assumption is natural.
The exchangeability assumption will not be so natural when we do
have some idea of a relationship between labels and y values. In some
such cases, an assumption of partial exchangeability, or invariance
of the distribution under a subgroup of the group of permutations of
1, ... , N, may be appropriate. (See Sugden, 1993; Thompson, 1984.)
For example, suppose the population consists of the patients of L doc-
tors, and suppose (relabelling units by jk instead of j) that Yjk is the
number of physician visits in the preceding 12 months for patient k
on the list of doctor j. We could assume that the population Yjk val-
ues come from the realization of a sufficiently large array of random
variables
(Yjd7=' k~'
of which the distribution is invariant under 'two-stage' permutations
which first permute the rows, and then permute independently within
rows. Thus in particular we would be assuming one joint distribution
for the visits of two patients on the list of the same doctor, and another
for the case of two patients on the lists of different doctors.
Another kind of partial exchangeability would be exchangeability
within strata, with independence across strata, in a stratified population.
5.2.2 Models with auxiliary variates
Frequently there is an additional real- or vector-valued variate x for

which the values are available or easily measured for all the units of the
population, such that x is thought to be related to the variable Y to some
degree. If it is possible to stratify by the value of x, then a stratified
(partially) exchangeable model for Y may well be appropriate. In some
other cases, it may be possible to assume a (partially) exchangeable
model for some function of Y and x: for example, it might be possible
for a real and positive x to assume Y1/XI,"" YN/XN exchangeable.

Such a model has been discussed by Cox and Snell (1979) in connection
with auditing. Suppose the population consists of items or records with
specific nominal monetary values, which may be 'in error' rather rarely.
We might be interested in estimating the total error 1'y = 2:.;=1 Yj, and
a reasonable model for the error Yj in item j might be given by
Yj = ZjD.jXj
where x j is the nominal value of item j. D. j is the relative error in x j

if item j is in error, and Zj is a 0-1 variate indicating the presence or
absence of an error in x j. A reasonable assumption might be to take
the variates ZjD.j or Yj/Xj to be i.i.d.
In still other situations a regression model or something resembling
one may be more suitable. Often when x is a measure of size of unit j
in some sense, a regression model of the form
Yj = fJXj + Ej (5.6)
is used. The Ej are taken to be independent mean-zero errors with
variances depending on x j. In most applications x and yare both non-
negative, and the variance of y about the line through the origin with
slope fJ tends to increase with x. For example, for a population of
short-stay hospitals the model (5.6) has been suggested (Herson, 1976),
where Xj is the number of beds in hospital j (a size measure), Y j is the
number of discharges from hospital j in a given month, and the error
Ej has variance a 2xj. Royall and Cumberland (1981a) have discussed
the consequences of an imperfect fit to the model (see Figure 5.1).
Another situation where regression models like (5.6) or
(5.7)
are commonly introduced is where x j is the value of y for unit j on
some previous occasion when a census was completed. For example, in
a population of universities, Yj and x j might respectively represent the
number of PhDs granted by institution j this year and last year, as in
Figure 5.2. In a population of cities, Y j and Xj might denote numbers
of residents now and at the time of the last national census.
Related models are also sometimes used when there are two ways
of measuring the characteristic of interest. Suppose that one of these is
crude but inexpensive, and can be applied to all population units. The
other is more accurate but also more expensive (or perhaps destructive),
and can be applied to only a few population units. For example, in an
analysis to estimate the number of trees in a wooded area divided into
3000
2500
:WOO
~
cII)
«i 1500
0...
1000
500
0 200 400 600 800 1000

Beds
Figure 5.1 Scatterplot of patients discharged versus beds for a population of

short stay hospitals (Royall and Cumberland. 1981a. p. 70).
plots, Xj might be the estimate from an aerial survey of the number of

trees in plot j, and Yj the true number of trees in plot j as determinable
from a count at ground level. It is sometimes assumed that the pairs
(Xj , Yj ) are independent, and that the joint distributions of the Xj and
Yj are such that the mean of Yj conditional on Xj = x j is a fixed linear
function of x j.
5.2.3 Time series and spatial process models

A third very important class of models consists of those where the label
of a unit specifies a time or location in a one-, two- or three-dimensional
space and the y values are thought to vary in some meaningful way
with time or location. We may wish, for example, to estimate the total
input of a contaminant into a body of water over a certain period by
measuring the input in a sample of shorter time intervals. We may
wish to estimate the total number of insect larvae in a stream bed by
counting them in a small number of soil samples. There are many other
examples in the biological and earth sciences.
Specific time series and random field (spatial process) models will
be applicable in such situations. These mayor may not incorporate
Y
30
25
20
15
10
5 10 15 20 25 x
Figure 5.2 Scatterplot of mathematics PhDs granted in 1987-88 (Yj) versus

PhDs granted in 1986--87 (Xj) for intermediate strata of a population of uni-
versities, stratified by the x variate.
trends or underlying cycles. One of the simplest models expressing

dependence between neighbouring y values assumes that Y1, ••• , YN
form the initial segment of a stationary time series. It may be noted
that for the infinite series, stationarity is a special example of partial
exchangeability, where the joint distribution of a finite set of terms is
invariant under the group of translations of the unit labels.
Sampling from temporal and spatial populations will be the subject
of Chapter 7.
PREDICTION AND INFERENCE 155
5.3 Prediction and inference

Formally, a superpopulation model like those of Section 5.2 is a class
C = {~} of distributions ~ of the random vector Y = (Y1 , ••• , YN ). De-
pending on the context, C could be a parametric family of distributions,
indexed by some finite-dimensional parameter {3, or a very broad non-
parametric family such as the class of all exchangeable distribu,tions,
or a semiparametric family.
Once a superpopulation model has been assumed, the problem of
estimating a population quantity (Ny) can be viewed as the problem of
e
predicting the value of the random variable (y), given observation of
a subset of the components of Y, namely {(j, Yj = Yj) : j E s}. An
estimator
e = e({(j, Yj ) : j E s})
can be called a predictor of e(Y). The prediction error is
e - e(Y).
Let us suppose that the sampling design probabilities are independ-
ent of Y and the model parameters, so that the selection of s as the
sample implies no information about the unseen y values. Then it is
appropriate as far as use of the model is concerned to regard s as
fixed. If £~ denotes expectation with respect to a distribution ~ in the
superpopulation model, then
£~(e - e(y»
gives the (prior) prediction bias or model bias. Since this is a model
expectation of the error for s fixed, we can think of it in some contexts
as the bias of e conditional on the sample unit labels. If
£~(e - e(y» = 0 (5.8)
for all ~ in C, the point estimator or predictor e is called model unbiased
or £-unbiased for e(y). Where there is no possibility of confusion we
shall write £ for £~.
The model mean squared error (model MSE) or prediction MSE for
e is also defined for s fixed, as
(5.9)
From the prediction viewpoint, it is desirable to choose e to be £-
unbiased and to have a prediction MSE which is as small as possible
under the distributions ~ in C.
In addition, it is sometimes meaningful to construct prediction inter-
vals for O(y) based on the sample data. Under the model distribution,
O(Y) would belong to these intervals with specified probabilities. For
example, if Ve were a model-consistent estimator of the prediction MSE
(5.9) and if
e - O(Y)
(5.10)
Fe
(for s fixed) were approximately N(O, 1) under distributions ~ E C,
then the interval
e ± ZI-a.y'Ve (5.11)
would cover O(y) with approximate ~ probability I - 2a.
Other predictive frameworks for descriptive sampling inference have
been put forward by Ericson (1969), Scott and Smith (1969),
Kalbfleisch and Sprott (1969), and others subsequently. In a Bayesian
setting, as adopted by Ericson and by Scott and Smith, C consists of
a single prior ~, usually hierarchical or multi-stage, and inference is
expressed in terms of the posterior distribution of Y, or ~ conditioned
on the data (U, Yj = Yj) : j E s}. Recent applications of this approach
have been discussed by many authors, including Cox and Snell (1979),
Malec and Sedransk (1985), Stroud (1991) and Ghosh and Rao (1994).
In the fiducial approach of Kalbfleisch and Sprott, C is a parametric
family, and inferences are derived from the fiducial distribution of the
parameters composed with the conditional distribution of O(y), given
the parameters and the data. The parametric empirical Bayes approach
(Ghosh and Meeden, 1986) also takes C to be a parametric family,
this time of prior distributions; in the posterior distribution of O(y)
the parameters are then estimated from the data. The point estimate
of O(y) is the estimated posterior mean or estimated posterior mode.
The estimated posterior variance of 0 (Y) must be adjusted to produce a
mean squared error estimate for O(Y) which incorporates the parameter
estimation errors (see, for example, Laird and Louis, 1987; Kass and
Steffey, 1989).
The framework (5.8)-(5.11), which could be termed the 'frequentist'
predictive approach, was put forward by Brewer (1963) and by Royall
(1970); here inferences are constructed through the unconditional (or
'prior') distributions ~ of C. We will use this framework subsequently
because it formalizes fairly simply the predictive element in sampling
inference.
It should be noted that the justification for thinking of inference in
predictive terms depends on the appropriateness of the superpopulation
model, and in particular on aspects like model unbiasedness in (5.8) and
RANDOMIZATION AS SUPPORT FOR STATEMENTS OF INFERENCE 157
the approximate normality of (5.10). For this reason, in the frequen-

tist approach, robust predictive methods, those which work for broad
classes C (nonparametric or semiparametric), tend to be preferred in
practice. We shall see examples of these in the next few sections.
5.4 Randomization as support for statements of inference

In this section it will be seen how statements of inference with basis in
a superpopulation model can be supported by a matching randomization
in the sampling design, so that in a sense the reliance on the model is
decreased.
5.4.1 Inferences based on exchangeability

The most clear-cut example arises when prior knowledge about the y
variate can be summarized in a fairly broad exchangeable or partially
exchangeable superpopulation model. In that case randomization in the
sampling scheme can reinforce quite precisely the inference based on
the model and the observed sample. For example, every exchangeable
model is a mixture of random permutation models like (5.4); and under
the random permutation model (5.4) the distribution of YI , ... , Yn or
of any set of n of the variates in Y is the same as the distribution of
{Yj : j E s} under SRS. Thus an appropriate prediction interval for Ily
under exchangeability would be the realized value of
-
Ys ± Z\-a l( 1- Nn)",
-;; L....(Y - -
Ys)2j(n
j - 1), (5.12)
JES
which is the same as the SRS justified confidence interval.

In thinking of (5.12) as predictive inference, we assume that under-
lying symmetry in the generation of the Y values is provided by nature,
rather than by the statistician's sampling design. However, if the as-
sumption of symmetry is an oversimplification, using SRS as the design
will ensure that at least the repeated sampling coverage frequencies of
the interval will be under control. In this sense, randomization can be
said to support a statement of (predictive) descriptive inference. Look-
ing back to Section 5.1.1, we have a formalization of the relationship
between CSI(iii) and CSI(iv).
If the appropriate model is partially exchangeable, inference will
likewise be supported by a randomization which can be generated by
the permutations in the associated subgroup. Thus, for example, the
estimation statements arising from a stratified exchangeable model will

have the same form as those arising classically for stratified random
sampling, and will be reinforced if a stratified random sampling design
is actually used.
5.4.2 Formal justification of conditioning

Historically, the desirability of conditioning sampling inference on fea-
tures of the sample drawn was put forward by Durbin (1969), and has
been developed by many others subsequently, including Holt and Smith
(1979), Thompson (1984), Rao (1985) and Valliant (1993). The ques-
tion of how far to carry the conditioning is delicate, as we have seen.
For if the sampling design probabilities do not depend on y, the sam-
ple s is ancillary, and the classical conditionality principle (Cox and
Hinkley, 1974, p. 38) would suggest conditioning on s itself. However,
if we do condition on s, we have overconditioned: there is no ran-
domness left in the sampling design with which to support inferences.
Thus if we want to make use of randomization, it would be better to
condition on some function of s, and that is the conditioning principle
we are now able to propose here: we condition to the extent that the
conditioned randomization will support a statement of inference under
an appropriate model.
Post-sampling stratification
Let us return to the stratified population example. It is easy to see that
estimation statements from a stratified exchangeable model will be re-
inforced if SRS is used, as long as the relevant sampling distributions
are taken to be conditioned on the stratum sample sizes. This applica-
tion of our principle gives us in fact a superpopulation justification of
the conditioning of post-sampling stratification, arrived at intuitively in
Section 5.1.3. By defining strata within which the response variate is
assumed to be exchangeably distributed, the model prevents overcon-
ditioning.
Conditioning on sample size

We can use the conditioning principle to justify conditioning on the
realized sample size under certain variable size sampling schemes. For
example, suppose the superpopulation model is exchangeable, and that
a simple random sample of n draws is taken with replacement. Suppose
s is the set of distinct units drawn, and let n (s) be the size of s. The
sample mean
jES
is the natural estimator of the population mean {.Ly. The associated pre-
diction intervals under the exchangeable model, based on an application
of the central limit theorem of Section 3.4, are
_1_
n(s)
(1 _n(s»)
N
s2,
Y
where
2 1 '"
sy = n(s) _ 1 ~(yj - Ys) .
-2
JES
These are valid with respect to the exchangeable superpopulation

model, with s fixed. At the same time, they also have the appropri-
ate frequency properties as confidence intervals under the sampling
design conditional on n(s), namely SRS without replacement.
In Bernoulli sampling with equal inclusion probabilities (Section
2.8), conditioning on the realized sample size is similarly justified.
Estimation of domain means and totals

The estimation of a domain mean in simple cases can also be set in
a conditional framework. The term domain is often used to describe a
subpopulation Dc 'P of which the size Nv is typically unknown before
sampling. Membership in the domain is determinable for sampled units,
but in general is not known for unsampled units. Theory for estim-
ating domain means and totals under complex sampling designs was
developed by Durbin (1958), Hartley (1959) and others. Here we will
take the sampling scheme to be SRS. If the population 'P is sampled
by SRS, n draws, then the size
nT) = n(s n D)
of the part of the sample falling in D is random.
If we regard D as fixed and the variates Yj. JED, as exchangeable,
we may form the following prediction intervals for the mean {.LT) of y
in D:
_ ± Zl-a
YT) -
I ( I - -nT)) sT)'
2
(5.13)
nT) NT) Y
where YT) and si,y are respectively the sample mean and variance of y
in snD. If NT) is large compared to nT), the finite population correction
l-n1)/N1) in (5.13) can be taken to be 1; otherwise, N1) is usually

estimated by Nn1)/n, giving an estimated finite population correction of
l-n/ N. The intervals (5.13) have the appropriate frequency properties
as confidence intervals under SRS with n1) draws from 'D. This can be
thought of as the actual sampling design (restricted to 'D), conditional
on n1).
Estimation of the domain total T1) = N1)f.L1) is generally approached
through the unconditional sampling distribution of the natural estimator
1'1) = N( L
jEsn1)
Yj)/n = N1)Y1), (5.14)
which is also
(5.15)
JES
where Zj = IjYj and I j is 1 if j E 'D, is 0 if j (j. 'D (see Cochran,

1973). This can be consistent with our conditioning principle, which
may in this case tell us not to condition. In order to 'predict' T1), we
need a model for the location of'D within the whole population U (so
as to be able to predict N1) as well as a model for fj, j E s. Assuming
exchangeability for the variates fj. j E U, and independently for the
variates
if j E'D
(5.16)
o if j (j. 'D,
we have exchangeability of the variates Z j = fj I j , j E U. Noting that
T1) = 1';" we are led to an estimator of prediction mean squared error
which is the same as the unconditional SRS variance estimator, and
which can be put in the form
v(Nzs ) = (~y (1- ~)

x [n~~~l) si,y + n ~ ln1) (1- n;)yi, J. (5.17)
Another kind of model for the location and size of 'D might justify
a different estimator of prediction mean squared error.
5.4.3 Survey weights

Organizations like Statistics Canada which make survey data avail-
able usually provide survey weights for the individual records from the
sample. Ideally, the weight W js for j E s is the number of units in the
population represented by unit j (Statistics Canada, 1990). Thus

L Wjs = N (5.18)
JES
would estimate the population size, and can sometimes be constrained

to be exactly N;
(5.19)
jEs
would estimate the total for y; and
L Wjs = Np (5.20)
jEsnD
would estimate the size of a domain V.

Traditionally, the weights have been identified with the reciprocals
of inclusion probabilities. In fact, if we take Wjs = l/rej, (5.19) is
the HT estimator of Ty, featured in Chapters 2-4. This choice of W js
makes the estimators (5.18)-(5.20) unbiased. Their form is simple, and
they possess certain optimality properties, to be described in Section
5.5. However, for estimators to be consistent with inference, there may
be better choices of weights, and the usual practice is to adjust the
'basic' weights l/rej, to produce weights Wjs which are thought to
incorporate the 'representation' interpretation better. Typically, there
are adjustments to account for differing rates of non-response from
different PSUs, and adjustments to account for new construction and
other frame changes. Finally, there are 'post-sampling stratification'
adjustments which guarantee that for certain strata, or domains Vk for
which Np, is known from another source,
L
jEsnPk
Wjs = N Pk (5.21)
exactly. For example, the V k could be age-sex groups and economic

regions, as in the Canadian Labour Force Survey. The computation of
the weights will be discussed further in Section 5.11.2, but the intended
effect of the adjustments is to make it plausible that respondent j should
be representing W js people in the same age-sex group and economic
region.
If the original sampling design is self-weighting within strata, the
process of adjustment generally divides the sample into weight classes
c, c = 1, ... , C, within which the weights are constant. In the single-
stage sampling case, these may be subsets of well-defined population
classes (also denoted by c), defined by variates like age, sex, stra-
tum label, and other variates which might be thought of as influencing
response rates and response values. The size of population class c,

denoted by Ne , is unknown in general, although certain combinations
of the Ne may be known. For example, if the classes care age-sex
groups, it might be the case that the total numbers of males and fe-
males in the population are known, but not the numbers within age
groups.
If the realized sample size within a class c is me, and the weights
are constructed with the representation interpretation in mind, we may
regard these as being of the form
He
Wis = - , j E s and class c.
me
That is, each weight implies an estimate of the size of its class, namely
He = mewis for any j E s n c. If the classes c = 1, ... , C disjointly
cover the whole population, the estimator of the total Ty can then be
written
(5.22)
e
where Ye is the sample mean of y within weight class c.
In assessing the properties of iy in (5.22), suppose we can assume
the following superpopulation model, which might be appropriate for a
single-stage, single-stratum design. The population units are randomly
assigned to the weight classes c, size Ne , c = 1, ... , C. Moreover, the
Yi values are generated by random permutation models (independently)
within weight classes. Sampled units in weight class c are assumed to
respond with probability Be, independently of one another. Then the
prediction MSE of Ty is the expectation of the conditional MSE, given
me, c = 1, ... , C, i.e. given {wis : j E s}; and hence
MSE(Ty) = E[Var(Ty\{wi s : j E s})]
+E[Bias 2 (TY\{WiS : j E s})]. (5.23)

With respect to the superpopulation,
Bias(TY\{WiS : j E s}) = ~)He - Ne)/Le,
e
where /Le is the mean of y within weight class c. It is not difficult to

show from this that the second term of (5.23) is given by
:2 (1 _;) ~ ; (/Le _ ~ y
EVALUATION OF A SAMPLING STRATEGY 163
Thus an approximately unbiased estimator of MSE(Ty ) would be given

by
:2
~ ~: (I -;:) S;y + (1 - ;) ~ ~ (ye - ~ y, (5.24)
where Ne = Nne/n, and Ye and S;y

are the sample mean and variance
of y for sampled respondents in class c. The estimator (5.24) is close
t
to one obtained from the unconditional distribution of based on the
assumptions that (i) the original sampling design is SRS with n draws
and (ii) the respondents within each weight class form a Bernoulli
subsample of the original sample from the weight class (see Sarndal et
al. 1992, p. 582). An approach using adjustments for conditional design
bias has been studied by Valliant (1993).
Thus for the superpopulation model above, the weighted analysis
which uses (5.19) and (5.24) has a model justification and some sup-
port from a suitable sampling design, namely SRS. For more complex
situations where the weight class is determined partly by stratum or
PSU label, an approximation to the appropriate analysis may be ex-
pressible in terms of stratified and multi-stage exchangeable models
for the assignment of units to weight classes.
5.5 Evaluation of a sampling strategy

5.5.1 General considerations
In the previous section the role of a randomized sampling design in
descriptive inference was described in terms of support for some ex-
pression of inference appropriate under a superpopulation model. In
that context the choice of strategy, or estimator-design combination, is
essentially determined by the selection of a model-based expression of
inference.
However, in practice the possible strategies may be limited, or we
may not have a very specific model, or we may have a model without
sufficient symmetry to be reinforceable by a randomized design. In such
a case, for estimating a given finite population parameter there may be
no obvious ideal strategy available to the survey statistician. That is, in
many cases where a randomized sampling design is affordable, there
is a real choice to be made among various ways of using auxiliary
information on the population. It is then useful to consider the design
frequency properties of the strategies in their own right.
Section 2.4 discussed the concepts ofunbiasedness and local relative
efficiency for estimation of a finite popUlation parameter, in terms of

unbiasedness and variance or mean squared error of a point estimator
with respect to the sampling design. In that section, the properties of
strategies for the estimation of the population total were compared for
population arrays y considered likely to arise on the basis of prior
information about the population.
This method of evaluation seems fundamental and natural. In the
literature, it is often the case that when a new strategy is proposed,
it is tested and compared to other strategies with respect to specific
arrays y, either artificially generated or taken from real populations. On
each population array the strategy is replicated many times, and the
properties of the estimator under the design are noted.
The notion of strategy can be made more general, so that in addition
to point estimation strategies we can also consider interval estimation
strategies. For these the evaluation is usually in terms of coverage
frequencies. Again, empirical evaluations tend to be conducted using
specific arrays y. For each y the design is replicated many times, and
for each replication an interval is constructed. The coverage probabil-
ity for the strategy on the array y is estimated as the proportion of the
constructed intervals which do include the true value of the finite pop-
ulation parameter. Strategies with the same coverage probabilities on
arrays of interest may also be compared on the basis of average length
or stability of length of the intervals.
Increasingly also, in line with the discussion of Section 5.4.2, it is
recognized that the frequency properties of the strategy conditional on
some aspect of the sample data may be more important than the un-
conditional properties. This is because conditional frequency properties
may be more relevant to statements of inference. Further illustrations
will appear in Section 5.6. In these cases the evaluation looks in prin-
ciple at a separate set of replications for each value of the conditioning
statistic.
Empirical evaluations are very powerful in settling questions about
choice of strategy. If a point estimation method is seen to have non-
negligible bias relative to its standard error on realistic arrays with
realistic sample sizes, and particularly if the bias tends to be in one
direction or the other, we reject the estimation method for producing
input to decisions of consequence. If an interval estimation method
has empirically determined coverage probabilities which are much less
than nominal values, overall or conditional on an appropriate statistic,
we cannot use it for confidence intervals. Nevertheless, we need to
supplement the empirical evaluations with theoretical results in order
to understand how widely the empirical conclusions can be taken to

apply.
In principle, as in Sections 3.7 and 3.8, the moments of various
estimators and the coverage properties of associated intervals can be
approximated for many designs using sufficiently detailed expansions.
The resulting approximations are assessable in terms of properties of the
array y. However, the work involved and the complexity of the resulting
expressions, as well as the asymptotic nature of the approximations,
may make this approach less practical than large-scale simulations,
carefully designed. Moreover, considering performance on individual
arrays y will not necessarily help us to choose: for example, as is
well known, almost any plausible looking point estimation strategy
performs well for some array y (see Section 2.4). Uniformly optimal
point estimation strategies (subject to a limit on sample size) do not
exist in the sense of minimizing sampling MSE. We need simpler ways
of summarizing the performance of candidate strategies, ways that focus
attention on arrays y which are likely to occur as population arrays, if
these can be described at least approximately.
5.5.2 Minimum expected variance criterion

Thus we may be led to considering strategy performance averaged with
respect to some simple superpopulation model for y. We do not ne-
cessarily believe that the real life arrays under consideration have been
generated by this model, but the model is such that for some value of
its parameters the real life array y should be a plausible outcome.
In setting up formal evaluation criteria, let us now denote expecta-
tion and variance with respect to the design by E p and Varp, and
expectation with respect to the superpopulation model by E as before.
Let eCy) denote a real-valued function of y, for simplicity. Let (e, p)
denote a strategy for estimating eCy), and let h(Xs) be a statistic or
function of the data Xs on which we plan to condition. The discussion
in the next paragraph will also apply to cases where an unconditional
analysis is appropriate, taking h to be constant in those cases.
For point estimation, as indicated above, unbiasedness with respect
to the design probabilities is very important, and thus it is important
that conditional E p-unbiasedness
(5.25)
be satisfied exactly or approximately for all possible y. (We return to
a discussion of how 'approximately' may be understood below.) If the
weaker condition that
(5.26)
is satisfied, we know only that any bias in e with respect to Ep('lh(Xs»
averages to zero over the superpopulation. In the special case of £-
unbiasedness, namely when
£(e - /1(Y) =0 for every s, (5.27)
then (5.26) must hold, even though (5.25) may not. The condition (5.27)
is important for inference, as we have seen in Section 5.3, but the in-
ference will be more secure if 'supported' by (5.25), and for purposes
of comparing strategies we often take (5.25) to be the primary unbi-
asedness criterion to be satisfied.
Two point estimation strategies (el, PI) and (e2, P2) can be com-
pared in efficiency with respect to their (conditional) mean squared
errors averaged with respect to the superpopulation distribution, so that
(el, PI) is more efficient than (e2, P2) if
(5.28)
With this criterion, in certain circumstances it is possible to find optimal
design-unbiased strategies e, strategies for which
£Varp(elh(xs» .:::: £Varp(e2I h (xs»
holds for all e2 and for all values of the superpopulation parameters
(Godambe, 1955).
Perhaps the simplest example of such a result can be shown for
the sample mean jis as an estimator of the population mean J-Ly, ac-
companied by a self-weighting sampling design of size n. Consider
any superpopulation model under which the variates YI , Y2, ... , YN
are symmetrically or exchangeably distributed (see Section 5.2.1). For
a given n, consider all estimator-<iesign pairs (e, p) such that p is of
fixed size n and the unconditional E p-unbiasedness condition
Ep(e - J-Ly) = 0
is satisfied. It can be shown that among such strategies, any strategy

with e = jis minimizes
£Ep(e - J-Ly)2
for any choice of parameters in the exchangeable superpopulation
model.
From this result it also follows easily that if we consider pairs (e, p)
where the sample size n (s) may not be fixed, but conditional Ep-
unbiasedness
Ep(e - !-Lyln(s» = 0
is satisfied, we can improve them by replacing e with the sample mean
and a conditionally self-weighting design with the same distribution of
n(s).
For other examples of results like these, see Thompson (1984) and
references therein.
Exact E p-unbiasedness may be too strong a requirement for the es-
timation of parameters defined by estimating functions, as in Section
4.1. A useful concept of approximate E p-unbiasedness is known as
asymptotic design unbiasedness, and can be defined in the same kind
of context as the limit results of Sections 3.4 and 3.5 (see also Section
4.1.4). We consider a sequence of population arrays indexed by N, so
that YN is (YNl, ... ,YNN). The strategy of interest is embedded in a
sequence (eN, PN) with sample size nN(s) tending to increase with N.
We could say that this strategy is asymptotically design-unbiased for
estimating 8(y) if
(5.29)
as N -+ 00. For example, the statement that 'the bias of the ratio
estimator R, of (4.8) is of order lin' under SRS (cf. Cochran, 1977,
p. 160) is an assertion that it is asymptotically design-unbiased: under
regularity conditions onYN and nN, the standard deviation of R, will be
of order n-;.1/2, and by the assertion EpN(R, -RN) will be of order n,./,
implying the truth of (5.29) for R,. The 'regularity conditions' will be
satisfied in the context of an appropriate model. This type of asymptotic
design unbiasedness trivially implies design consistency, namely that
11m·
N->oo
p(leN-8(YN)1 >€
18(YN)1
)-0
- ,
provided VarpN (eN)/18(YN)1 2 -+ 0 as N -+ 00.

Defining an analogous criterion for conditional E p-unbiasedness is
difficult because the way to make the conditioning event depend on N
as N -+ 00 may not be obvious. Nevertheless this criterion is also im-
portant, and closely connected with model unbiasedness of estimating
functions, as will be seen in Section 5.12.
In general, it is possible to establish asymptotic design-unbiasedness
results for estimators when they arise as solutions of unbiased estimat-
ing equations, as in Section 4.1. In the next section we consider finite

population criteria for optimal choice of these.
5.5.3 Estimating function determination

Corresponding to the discussion of optimal estimation strategies in Sec-
tion 5.5.2, a theory of optimal estimating functions can be formulated
along the following lines (Godambe and Thompson, 1986).
As has been seen in Section 4.1, very often a finite population para-
meter is thought of as the root of a population estimating function. For
example, a population ratio RN can be regarded as the solution of the
equation
N
L(Yj - RXj) = 0. (5.30)
j=!
In many contexts for the estimation of ratios and means, the aim is
entirely descriptive. However, in other situations, even simple finite
population parameters may also have interpretations as estimates of su-
perpopulation parameters. In these cases, where the aim has a partly
analytic flavour, it is appropriate that the superpopulation model deter-
mine the 'defining' estimating functions (Godambe, 1995).
From the other side, estimating superpopulation parameters through
their finite population counterparts has a certain robustness, because the
finite population parameters may be meaningful for descriptive infer-
ence even if the model is deficienr For example, if the model specifies
(5.31 )
the Ej being independent with mean EEj = 0, Var(Ej) = a 2xj, Xj > 0,
the object might be to estimate {3 from a sample of the units j. The
estimating equation
N
"~(yj - X·
RXj)....l.. = 0, (5.32)
j=! Xj
which is equivalent to (5.30), defines R = RN as the (optimal) weighted

least squares estimate of {3 under the model, given all the population
values (see Section 5.7.2). If the model is correct and the population is
large, estimating RN from the sample is effectively estimating {3. If on
the other hand the true superpopulation departs from the model (5.31)
in such a way as to make {3 meaningless, RN may still be of interest
as the finite population ratio.
In general, suppose that we have a superpopulation model describable
in terms of a class C = {;} of distributions ; for the population array

Y. Let () = ()(;) be a superpopulation parameter, namely a real- or
vector-valued function defined on C. If Y1, ••• , YN are independent
under distributions in C, then in many practically important cases, an
'optimal' estimating function system for () exists in the form
N
*(y,() = LCPj(yj,(), (5.33)
j=l
where each CPj has the dimension of (), and

£{cpj(Yj , ()(;))} = ° for all; E C. (5.34)
Godambe and Thompson (1986) have discussed the relevant optimality
criteria in detail. For simplicity, let us take () to be real in the following
discussion.
When <1>* of (5.33) is optimal for estimating (), we regard ()N, defined
by
N
LCPj(Yj,()N) =0, (5.35)
j=l
as the finite population parameter associated with (). We then consider

estimating ()N from the sample by solving equations
g(Xs, () = 0, (5.36)
where as before
Xs = {U, Yj) : j E s}
represents the sample data. When the data are being obtained via a ran-
domized sampling design p, it is natural to require design unbiasedness
for the estimating function, namely that
N
Ep{g(Xs,()} = LCPj(Yj,() (5.37)
j=1
for each population array y and parameter value (). In particular, if

the inclusion probabilities 17:j are all positive, j = 1, ... , N, then the
function
(5.38)
satisfies (5.37).
In fact, g* of (5.5.14) is optimal in senses compatible with the optim-
ality criterion of Section 5.5.2.
THEOREM 5.1: If Y!, ... , YN are independent and (5.34) holds, and
if the sampling design is independent of Y, then among all g satisfying
(5.37), g* can be shown to minimize each of
fEpg' / (fEP :!),' fEpg'.nd fEp (g - t¢i(Yi , 0)),

(5.39)
for all g E C.
Proof: The theorem follows easily once it is shown that
£Ep(g - g*)g* = O. (5.40)

The left-hand side of (5.40) is
f{ ~>(S)(g - gO) ~ ¢i /"1} ~ f {t ~; ,~P(S)(g - gO) } .
But this equals
-£ {t ¢~ L
j=l 7r] s;j'/.s
p(s)(g - g*)} ,
since Ls p(s)(g-g*) = O. Because of the independence of Y!, ... , YN

under g, ¢j and Ls:Ns p(s)(g-g*) are independent. Thus the left-hand
side of (5.40) is
N £(¢.)
- L - ] £(L p(s)(g - g*»,
j=! trj s:j¢s
which is 0 by (5.34). We have established (5.40) and the optimality of

g* = ¢s«(), an estimating function of the type used in Section 4.1.
For example, suppose that for gEe the variates Y!, ... , YN are
independent and identically distributed with mean ()(g). Then from
almost any standpoint, the optimal population estimating function for
() is
N
L(Yj - ().
j=!
Thus the associated finite population parameter is the solution of
N
L(yj - ()N) = 0,
j=!
AUXILIARY INFORMATION IN ESTIMATION OF MEANS AND TOTALS 171
or eN = ILy. The optimal sample estimating function is
g*(Xs, e) = L(yj - e)/rrj' (5,41)

JES
leading to the estimator

e = L(Yj/rrj)/ L(1/rrj),
s (5,42)
JES JES
as in (4.7).
From the considerations of Section 5.3 and 5,4 it is clear that the
optimality of g* in this section may be merely formal. If in the previous
example we really believe in the model C, we might prefer the linear
E-unbiased estimator of eN having minimal predictive mean squared
error, namely the sample mean Ys. In that sense, estimation of ON
via (5,42) would be inefficient (and rather unappealing) unless the rrj
were all equal; the E p-unbiasedness requirement would pull us away
from the estimator which is best in model terms. On the other hand,
if the superpopulation distributions ; are merely a convenient device
for averaging over plausible arrays y, the asymmetry in (5,42) is not so
glaringly inconsistent with prior beliefs. Moreover, whether the model
e
is meaningful or not, s is approximately design-unbiased for the finite
population mean ON.
5.6 Use of auxiliary information in estimation of means and

totals
We now tum to combining considerations of inference and efficiency,
in the estimation of means and totals in the presence of information on
auxiliary variates. We will describe four approaches in the next four
sections. The first two may be described as 'model-assisted' approaches,
because the form of inference is dictated or influenced by a superpopu-
lation model, but has also prescribed frequency properties under the
sampling design. The third is the 'calibration' approach, which does not
rely explicitly on a model, while the fourth is the 'predictive' approach,
which makes no use of the design probabilities. For the remainder of
Chapter 5, we will assume that the design probabilities do not depend
on y or on the model parameters.
We will see that when the response variate y has a linear relationship
with the auxiliary variate x, the four approaches lead to similar results.
The approach which is closest in spirit to the discussions of Sections
5.1-5.5 is the first, namely model-assisted estimation through optimal
estimating functions. The second approach, namely generalized regres-

sion estimation, is more generally applicable, and can yield the same
results as the first approach when both are implementable.
5.7 Model-assisted estimation through optimal estimating

functions
Model-assisted estimation, broadly speaking, is a way of trying to con-
struct estimators with good design-based properties, consistent with a
plausible model for (Y1, ••• , YN ) (see Samdal et al., 1992).
The most easily justified kind of model-assisted estimation is based
on expressing a descriptive parameter like 1). = L7=1 Yj in terms of
population based estimates of the parameters of a simple superpopula-
tion model. Let us examine a simple case first, where we estimate 1).
through a superpopulation mean.
5.7.1 Estimating totals through means

Suppose it is reasonable to assume that YI, ... , YN are Li.d. with mean
(). We are using a model-assisted approach if we write 1). as N j,ty =
N()N, and estimate it (because of the optimality of (5.41» by
iy = NBs = N(LYj/7rj)/(L 1/7rj), (5.43)

jes jES
rather than by
jES
When the Li.d. model for Y1, ••• , YN is correct, and thus an appropriate
basis for inference, we have the justification of model unbiasedness.
That is, for iy we have
£(Ty - Ty) = 0; (5.44)
while for the Horvitz-Thompson estimator

£(Ty - Ty) #0
unless LjES 1/7rj is identically N. For this type of model-assisted
estimation generally, the £-unbiasedness of (5.44) is typical, as a con-
sequence of £g* = 0 for g* of (5.38).
The estimator iy is not always exactly design-unbiased as iy is, but
it is approximately so because of condition (5.37), which in this case
OPTIMAL-ESTIMATING FUNCTIONS 173
I" I .f.
is expressible as
Ep ~ (yj - . B) = ~(yj -B).

jES 7rJ j=!
The fact that the underlying sample estimating function g* of (5.41) is

best among those which are both model- and design-unbiased means
that, in a sense, fy is as efficient as possible subject to the constraints
of the two types of unbiasedness.
The next section describes a similar kind of model-assisted estimation
under regression models, when y is linearly related to a vector-valued
auxiliary variate x.
5.7.2 Estimating totals through models: ratio and regression

estimators
We now consider the estimation of a real total
N
Ty = LYj
j=!
when the population array Y, now regarded as an N x I column vector,

is modelled by
Y= XI3+ E. (5.45)
In (5.45), X is a known N x p matrix of covariate values, 13 is a
p x 1 vector of unknown coefficients, and E is a vector of independent
variates with mean vector £, E = O. The model covariance matrix of E
is taken to be
}:; = diag(o-r •... , a~).
It will be convenient to denote by xi
the jth row of X, for j =
1•...• N. This is the 1 x p vector of covariate values for unit j in the
population.
NOTE: The assumption that the errors Ej are independent is natural
with a single-stage or 'element' sampling design, but less so when the
design is clustered multi-stage. If the situation requires us to think of
the population as clustered, then a correlated error model, or a model
with random as well as fixed effects, should be contemplated in place
of (5.45). We will consider such models in Chapter 6, but from here
on in Chapter 5, we will consider the population as one which is to
be sampled element by element, and in which an independent error
structure is applicable. We will sometimes think of the model as a true
expression of belief about Y, and sometimes as a 'working model',

being used to capture efficiency gains from whatever relationship exists
between X and Y.
The finite population parameter associated with the superpopulation
parameter f3 can be taken to be the solution f3 N of the estimating
equation system
xrr.- I (y-Xf3N) = o.
More explicitly, this may be written
N
L a j- 2xj (yj - xi f3N) = 0, (5.46)
j=1
so that
N N
f3N = (L aj- 2 (xjXi»-1 (L aj- 2xjYj). (5.47)
j=1 j=1
A special case is expressed by what we shall call the condition, Xr.
namely that the vector (af, ... , a~)T: is in the column space of X.
XI: condition: There exists a p x 1 vector .x such that

aJ = )./xjo j = 1, ... , N. (5.48)
An important consequence of the XI: condition is that

N
= I>if3N
j=1
= PXf3N, (5.49)
where F is a row vector of N 1s. For if we premultiply (5.46) by ~ 1" ,
we obtain
N
L(Yj -xi f3N) = O. (5.50)
j=1
According to the optimality-based argument of Section 5.7.1, general-
ized to a vector parameter, this would justify as an estimator of 1). the
quantity
N
iy = Lxi~s' (5.51)
j=1
where 13s satisfies the sample estimating equation
'"' a-:- 2
~_J_Xj(yj -xjf3s)
A
= O. (5.52)
jes 7rj
OPTIMAL-ESTIMATING FUNCTIONS 175
Special cases
Ratio estimator of Ty
If p = 1, so that xi is a real number x}, and aJ oc x}, then
N N
f3N = LY}I LX} = RN ,
}=l }=l
the population ratio of y to x. The XI: condition is certainly satisfied,

and in fact it is clear even otherwise that (5.50) holds, namely that
Ty = TxRN. The model-assisted estimator iy of (5.51) is
(5.53)
proposed by Hajek (1971) for Example 2.5. If the design is self-
weighting, like SRS for example, then
(5.54)
the classical ratio estimator.
Simple regression estimator of Ty

If aJ a
= 2 for all j, then
{3N = (tXjXi)-1 tXjYj. (5.55)

}=l }=l
If furthermore, X contains a column of 1s, so that there is an intercept

term in the model, then the XI: condition is satisfied, and
When
xr_(1Xl X21 ...... XN1) '
the model-assisted estimator t, reduces to the regression estimator
iL = N Pis + TxP2s
where
P2s = (~(X) ~X7C)Y}) /

JES j
(L
JES
(x) ~~7C)X}),
j
Pis = Y7C - P2S X7C'

and
i" = (L: Xj) / (L: ~),

jes trj jes trj
ji" = (L: yj) / (L: ~) .
jes trj jes 7rj
In the preceding justification of regression estimators through estim-

ating function optimality, the Xl: condition was important. It does not
hold in all cases of interest. A commonly used superpopulation model
is given by
(S.56)
where £Ej = 0 and Var(Ej) = a 2x; for some y =f:. 1, O. In such a
case Mantel (1991) has recommended enlarging the set of indepen-
dent variables for construction of a model-assisted estimator of Ty. The
convenient 'working model' is now not (S.S6) but
Yj = fJI + fJ2 Xj + fJ3 x; + Ej, (S.S7)

and the XE condition holds.
5.8 GREG estimators

An approach to model-assisted estimation which is motivated mainly
by unbiasedness, and thus does not depend on the XE condition, is the
following (Sarndal et ai., 1989; 1992). This approach yields what is
called in the latter work the GREG or generalized regression estimator.
We begin by noting that
is designed-unbiased for Ty but not model-unbiased under model (S.4S).

That is, Ep(iy -Ty) = 0, but
A x~ {3 N
£(Ty - Ty) = 'L.,
" _J_ - ' " x~ {3.
7r. L., J
jes J j=1
If we knew {3 we could 'correct' iy, replacing it by

x T {3
L: Yj - L:~+ L:xj{3,
N
(S.58)
jes 7rJ jes 7rJ j=1
which is both design- and model-unbiased. In estimating function terms,

we would be estimating Ty as a constant plus the finite population
GREG ESTIMATORS 177
estimating function
N
L(yj - xj,6),
j=1
for which the best sample-based estimating function is
L(Yj - xj (3)/7Cj.
jES
Now the estimator (5.58) would still be model-unbiased if f3 were

replaced by any model-unbiased estimate of f3. If such an estimator
were also design-consistent for some population parameter like f3 N of
(5.47), then (5.58) would also still be approximately design-unbiased
for Ty. An obvious candidate for the estimate of f3 would be '/3s of
(5.52), yielding the estimator
(5.59)
Note that in the GREG approach, apart from the form of the bias
correction being provided by the model, the main emphasis is on the
role of Ty as a descriptor of the finite population. However, if the Xl:
condition does hold and '/3s is chosen as above, then :;y coincides
with iy of (5.51), which is obtained by regarding Ty as the population
estimator of the superpopulation parameter 1T X f3. This is because the
Xl: condition implies that
(5.60)
as can be seen by premultiplying the system
"a----;-:-(yj -xjf3s) = 0
~
j
-2
Xj A
jES J
by A'.
Let us apply the GREG approach to the special case of model (5.56)
in which f31 = =
0 and y 2, so that the variates Yj/Xj have common
mean f3 = f32 and common variance a 2. Here the Xl: condition is not
satisfied. In this case a fixed size (n) design with
(5.61)
is often recommended. This gives as HT estimator

iy = Tx(LYj/xj)/n, (5.62)
jes
which is clearly both model- and design-unbiased for Ty. The cprrected
estimator Ty of (5.59) is iy itself, no matter what estimator is used for
f3. It is interesting that in this case T'"y takes the appealing form
(5.63)
where Ps is the minimum variance unbiased estimator for f3 with respect
to the model, s being regarded as fixed. The estimator-design strategy
given by (5.61) and (5.62) can be shown to be optimal in the sense of
minimizing expected sampling variance, as in Section 5.5.2 (Godambe
and Joshi, 1965; Godambe and Thompson, 1973). A consequence is
the optimality of the monetary unit sampling of Section 3.12 for the
error model of Cox and Snell (1979) in Section 5.2.2.
5.9 Calibration methods

The calibration approach to estimation provides another approach to
'correcting' estimators of Ty to incorporate auxiliary information, this
time without necessarily introducing a model at all. A general treatment
is provided by Deville and Siimdal (1992).
We begin with an estimator constructed without reference to X, such
as the HT estimator
iy = LYj/1fj'
jes
The N x p matrix X contains as before the values of a p-dimensional
covariate for the population members, and we suppose that besides the
values xj for j E s we also have the knowledge of
Tx = I'X,
the vector of column totals of X. This allows construction of a new
estimator
Tye = LWjsYj (5.64)
jes
where the weights Wjs are close to the weights 1/1fj in some sense,
and where
LWjsxj Tx, = (5.65)
jes
CALIBRATION METHODS 179
so that the estimator rye

is exact for every y which is in the column
space of X.
One important application is where X is a subgroup indicator matrix.
Suppose VI, ... , Vp are subgroups of the population, and define the
rth column of X to have 1 at row j if j E Vr and 0 at row j if j f. Vr.
Then
Tx = (NI , •.• , Np ),
the vector of subgroup sizes. If the row vectors xj, j E s, have p
linearly independent members, we construct {w js} so that
Nr = L Wjs = Nr , r = 1, ... , p. (5.66)

jesnv,
If the subgroups VI, ... , Vp are disjoint and all represented in the sam-
ple, such weights always exist, and they may exist more generally; if in
addition the population is the union of disjoint members of VI , ... , V p,
then the weights satisfy
LWjs =N.
jes
They may therefore have a natural representation interpretation as in

Section 5.4.3, where Wjs is the number of population units 'represented'
by the sampled unit j.
For example, suppose the population has N people, where N is
known, and that it is known also that: NI live in urban areas, N - NI
in rural areas; N2 are female, N - N2 are male and N3 are under 25
years of age, N - N3 are aged 25 years or older. Then we can take
the rth column of X to be the indicator for Vr. where VI = {urban
area dwellers in population}, V 2 = {female members in population},
V3 = {members of population under 25} and V 4 = U = population.
Thus xj = (1,0,0, 1) would signify that the jth population member is
an urban dwelling male aged 25 years or older. The vector of column
totals would be
If
jes
and the weights are calibrated to X, then we will use
NI2 = L Wjs and NI2 = L Wjs
to estimate the number of urban-dwelling females and the number of

urban-dwelling males respectively; and these estimates will automati-

cally satisfy the constraint
Nl = N12 + Nl ?.
In general, there will be many choices of sets of weights {w js}
which will satisfy the calibration constraints (5.65). To select from
among these we might try to minimize a measure of distance between
{WjS : j E s} and the initial weights {lin"] : j E s}. For example,
the empirical likelihood method suggested by Chen and Qin (1993) is
equivalent to minimizing
DEL = - '~
" -log
I (-
Wjs
-) , (5.67)
)ES
. lr)' l/lr)'
subject to (5.65) with
X' = (
Xl
I
X2
... I)
•.• XN '
in the equal inclusion probability case. This can be shown to be ap-

proximately equivalent to minimizing
DQ = ~ (WjS _ ~.)2 lrj (5.68)

)ES )
subject to (5.65). Another possible distance measure is the Kullback-

Leibler measure
DKL = - '~wjslog
" (-Wjs
-) . (5.69)
.
)ES
l/lr)'
The GREG (regression) estimator (5.59) is a calibration estimator
since it satisfies (5.65). It can be shown to arise from minimizing
DR = L (WjS - ~.)2 aJlrj

)ES )
subject to that constraint. The representation of the regression estimator

in calibration estimator form
LWjsYj, (5.70)
JES
PREDICTIVE APPROACH TO REGRESSION ESTIMATORS lSI
will be important for approximate variance estimation in Section 5.11.

Deville and Siirndal (1992) have shown that other calibration es-
timators can be regarded as asymptotically equivalent to regression
estimators. Thus estimators like raking ratio estimators (Deming and
Stephan, 1940) which satisfy constraints like (5.66) can be seen as
model-assisted in a sense, for a model relating y linearly to the indic-
ator variates of the subgroups. Note that this 'working' model which
provides the 'assistance' is not necessarily the one which best repre-
sents our idea of the dependence of y on the variates. For example,
suppose we know the population sizes in regions h = 1, ... , H and in
age groups a = 1, ... , A, but do not know the numbers in the cross
classes, which are age groups within regions. Then the covariates used
for calibration or model assistance may be the indicators for the regions
and for the age groups. A more plausible superpopulation model for
iriference might well contain also indicators for the cross classes, but
we do not use it for calibration because the numbers in the cross classes
are unknown.
5.10 Predictive approach to regression estimators of totals

The pure (frequentist) predictive approach to the use of regression mod-
els, as developed by Brewer (1963), Royall (1970) and Royall and
Cumberland (1978), does not involve the sampling design at the anal-
ysis stage. Thus for purposes of estimation the sample is regarded as
fixed, and inference relies on a model which expresses our belief about
the generation of Y. We shall describe this approach here and will see
that its methods give estimator forms which are very close to those of
model-assisted estimation.
A superpopulation model, which we shall take to be that of (5.45), is
assumed. For estimating (predicting) 7;" we consider the class of linear
predictors
e = LajsYj ,
jES
which are £ -unbiased in the sense that

£(e - Ty) = O.
Suppose we let ajs = Cjs + 1, and let {3c be any p x 1 vector oflinear
combinations of the Yj , j E s, satisfying
LCjsYj = L x j{3c
jES Ns
and
Then
e - Ty = I:>jsYj - LYj,
jES jlts
or
e - Ty = (Lxi)(,Bc - f3) - L(Yj - xi (3), (5.71)
jlts jlts
and the terms in (5.71) are each of mean zero and independent. It
follows that the 'best unbiased linear predictor' of Ty, namely the one
which minimizes £(e - Ty)2, is of the form
Tym = LYj + Lxi (3s' (5.72)
jES jlts
where (3s is the best unbiased linear estimator of {3. (For multidimen-
sional {3 this means that linear combinations of the components of (3s
have minimal variance.) The estimator (3s is obtained by weighted least
squares, and if the XL condition is satisfied by the model, then
LYj = L xi(3s,
jES jES
and we have
Tym = ITX(3s'
Thus in the special case of the XL condition, the predictor has the
same form as the model-assisted estimator (5.51), except that (3s is
optimal in a purely model-based sense. Unlike ,Bs of (5.52) in regular
cases, (3s need not be design-consistent for {3N of (5.47).
5.11 The uncertainty in ratio and regression estimators

We now turn to estimation of uncertainty for ratio and regression es-
timators, the relationship of uncertainty estimation with conditioning,
and extensions of model-assisted estimation to other contexts. The sur-
vey paper ofRao (1994) is recommended for a far-reaching discussion
of the issues in the rest of this chapter.
5.11.1 Approximate variance estimators

The problem of associating standard errors and confiden<:e intervals
with ratio and regression estimators of totals can be approached in
several ways. To begin with an essentially design-based approach, we
UNCERTAINTY IN RATIO AND REGRESSION ESTIMATORS 183
consider the interpretation of these estimators as calibration estimators.

That is, they are of the form
e= LWjsYj (5.73)
JES
as in (5.70), and the weights {Wjs} are chosen so as to make the estim-
ator e exact whenever Y is in the column space of the N x p 'design
matrix' X. Then for any p x 1 vector f3,
N
e-Ty = LWjs(Yj-xj{3)-L(Yj-xjf3)
JES j=!
N
= L WjsEj(f3) - L Ej(f3), (5.74)
JES j=!
where Ej(f3) = Yj - xj f3. Here again we will assume a single-stage

design.
Suppose e is approximately design-unbiased, and we have a formula
for estimating its variance. Then we could use this formula, substituting
Ej(f3) for any f3 in place of Yj, to estimate the design MSE
Ep(e - Ty)2.
Intuitively, we will make best use of a linear dependence of Y on the x

variates if we further substitute the appropriate least squares estimate
/3s for f3. Thus if
Wjs = gjs/lij (5.75)
where gjs is a correction factor which should be close to 1, we might
suggest the estimator
vo(e) = L(l - lij)(gjsfj /lij)2 - L L Wjk (gjs fj /lij) (gks fk!lik) ,
JES s
(5.76)
where
(see (2.29)) and

fj =Yj -xj/3s. (5.77)
This form is proposed by Sarndal et al. (1992, p. 246).
Alternatively, if the design is of fixed size (n), we might consider
the estimator
(5.78)
This is based on the Yates--Grundy-Sen estimator (2.27), and is more

likely than voCe) to take non-negative values only. Recall that if N In is
very large and the design is approximately with replacement, we have
the approximation
vee) ~ --L
n ( Ej
gjS- - -
_)2
gE
(S.79)
n-l. JES
Jr. Jr
j
as in (2.9S), where
-gE = -1 L _(yo -x.(3).

gjs,A
Jr n. Jr. j j S
JES j
If e is a regression estimator and the X~ condition is satisfied, then

gE/Jr is 0, and
vee) ~ _n L (gjSEj)2
n-l.JES Jr.
j
= (S.80)
For the special case in which
X' =( 1
Xl X2
... 1)
•.. XN '
and in which we derive a regression estimator assuming (11 = ... =

(1k= (12, it can be seen that Wjs = gjslJrj where
(S.81 )
jEs
Thus the 'g-weights' (Sarndal et at., 1992, p. 232) will indeed be close
to 1 in large samples, and for these the variance estimators given above
should estimate the design MSE of ewell.
If we have a fairly strong belief in the superpopulation model of
(S.4S) as a description of how the Yj are generated, we might prefer
to look at regression estimation from the point of view of prediction.
UNCERTAINTY IN RATIO AND REGRESSION ESTIMATORS 185
Again suppose that

e = LWjsYj
jEs
is exact when y is a linear combination of the x variates, as is true for
t of (5.51), Ty of (5.59) and Tym of (5.72). As before, but now in
random variate notation,
N
e - Ty = L Wjs(Yj - xi (3) - L(Yj - xi (3). (5.82)
JES j=!
Clearly if [3 is the true value of the model parameter, then

£(e - Ty) = 0
for s fixed. The prediction mean squared error is

£(e - Ty)2 = L(WjS - 1)2VarEj([3) +L VarEj([3), (5.83)
jS j~
with
Ej([3) = Yj - xi [3.
It is easily seen that for populations which are large relative to the
sample size, the first term of (5.83) will be dominant. Moreover, a
robust estimator of the first term of (5.83) is provided by
L(WjS - 1)2E;, (5.84)
JES
which is approximately
vm(e) = L(WjSEj)2. (5.85)
JES
The increase in (5.85) over (5.84) will tend to compensate for the
omission of an estimator of the second term of(5.83), particularly if the
sample is representative with respect to the values of Var{(E/[3)} =
aJ. See Kott (1990) for a related discussion.
The closeness of (5.85) and (5.80), which differ only by a factor of
n/(n - 1), has a couple of implications. First, the model-based MSE
estimator vm(e) is seen to have desirable frequency properties under the
design, assuming the conditions leading to (5.80) are satisfied. That is,
it approximates an unbiased estimate of the design MSE for a certain
design which is approximately with replacement. Second, the design-
based distribution under which this is so can be taken to be conditional
on the sample statistic
Ws = {(Wjs, aJ): j E s},
and approximately on Ljes w;saJ, which is the dominating part of

(5.83). For since £(e - Ty}2 is the same for all samples with the same
values of Ws , it is the same as
£Ep{(e - Ty)2IWs }.
The use of confidence intervals based on (5.80) as an estimator of

MSE for the regression estimator can also be justified in terms of the
estimating function theory of Section 4.4 (see Binder and Patak, 1994).
Let us consider the special case where
Xr =( 1 ... I)
.•• XN
(5.86)
XI
and ar = ai = ... = a~ = a 2. The estimating equation system for

f3J, fh is
jes
I:>j(vj - f31 - f32 Xj)/7rj = O.

jes
Ifwe reparametrize to (y = f31 + f32 J-tx , f32) we can write an equivalent
estimating equation system as follows:
¢Is = Ljes(vj - y - f32(Xrr - J-tx} - f32(Xj - xrr }}/7rj =0
¢2s = Ljes(Xj - xrr}(vj - y - f32(Xrr - J-tx} - f32(Xj - x rr }}/7rj = 0,
(5.87)
where xrr = (LjesXj/7rj)/(Ljes 1/7rj). Since Ljes(Xj -xrr }/7rj = 0,
solving the second equation gives the estimate
fi2 = {I>j(Xj - x rr }/7rj}f{L Xj(Xj - x rr }/7rj}.

jes jes
Then the approximately normal pivot for confidence intervals for YN =
Ty/ N has as numerator ¢Is evaluated at fi2, or
L(vj - Y - fi2(X rr - J-tx} - fi2(Xj - x rr }}/7rj. (5.88)
jes
The denominator suggested by (4.118) is of the form JV(Ljes Zj/7rj},

where v is a suitable variance estimation form,
Zj = [1 - N(xj - xrr)(xrr - J-tx)f{Lxj(Xj - x rr }/7rj}]Ej, (5.89)
jes
CONDITIONAL SAMPLING PROPERTIES 187
Under conditions appropriate for (5.80), this would lead to approximate

confidence intervals for 1'y of form
where gjs is given by (5.81).
5.11.2 Variance estimators and survey weights

Weights accompanying survey data in public use tapes (see Section
5.4.3) are often calibration weights in the sense of Section 5.9. That is,
weights which incorporate design inclusion probabilities and response
rates are corrected so that the final estimator
t = LWjsYj
jES
will be exact when applied to the columns of a matrix X of auxiliary

variates. If the weights before correction are 1/7fj> j E s, it follows
from Section 5.11.1 that
- n
n _-1.
L (W· f·) A
jS j
2
JES
of (5.80) might be a reasonable MSE estimate for t. To calculate it

requires the ability to calculate
A l"f3A
f j =Yj -X j s.
and Ps as the solution of (5.52); this strictly speaking requires know-

ledge of the 'before correction' weights and the variates which have
been used for calibration, as well as the 'after correction' weights Wjs.
5.12 Conditional sampling properties of the ratio and regression

estimators
In accordance with the arguments of Section 5.4, it is important to
consider whether descriptive inference based on or assisted by regres-
sion models can be supported by randomization in the sampling design.
Since inference about the regression coefficients in the model is con-
ditional on the Xj values in the sample, it is intuitively clear that the
supporting design-based distribution should be conditional on appro-
priate functions of these. For purposes of illustration we consider the

cases of the ratio estimator and simple regression estimator.
5.12.1 The ratio estimator

The ratio estimator is obtained as a model-assisted estimator for Ty
when the design is SRS and the model is
(5.90)
with eEj = 0, Var(Ej) = (]'2Xj. Here all variates are real, and the
covariate x is positive. As we have seen in Section 5.7, the estimator
has the form
(5.91)
jes jes
It is easily seen that e(TR - Ty) = 0 and

-
e(TR - Ty)
2
= (]' 2Tx"· .. x,·
L...,~s (5.92)
LjesXj
for each fixed s. Since the prediction MSE depends on the sample s
through LjES Xj or through is, we are led to evaluating the sampling
properties of TR through its distribution conditional on is. This is a
distribution which should support model-based inference about Ty if
the model is correct.
This sort of evaluation is in effect what has been done (with a differ-
ent though related purpose) in empirical studies by Royall and Cum-
berland (1981a; 1981b; 1985). They considered a collection of real
populations representing the contexts in which ratio estimation has tra-
ditionally been applied. In each case they took a large number of sam-
ples of size n = 32 at random, grouped these according to the values
of is into 20 groups, and in each group observed the bias and MSE
of TR, the bias of various estimators of MSE, and the performance of
associated confidence intervals. The grouping effectively resulted in the
measurement of sampling properties conditional on is.
If the population is generated from the model (5.90), then the con-
ditional sampling bias, denoted symbolically by
Ep(TR - Tylis ),
would be expected to be close to 0 for large or moderately large sam-
ples. However, it is evident from the studies of Royall and Cumberland
that this is seldom true for real popUlations when is is not relatively
close to /-Lx. One reason is that the ratio estimator is model-biased for
fixed s when £Yj is not f3Xj but f31 + f32Xj, where f31 =1= O. In that case,
-
£(TR - Ty) = f31 (Tx)
is - N , (5.93)
which has the sign of f31 if is < /-Lx and the opposite sign if is > /-Lx.
The sampling bias conditional on is is an estimate of the model bias,
and hence will exhibit the same behaviour.
The unconditional and conditional biases of TR under SRS can be
assessed for large samples using the theory in Chapter 3. Note first that
(5.94)
where RN = L7=IYjIL7=IXj. The right-hand side of (5.94) can be

rewritten as
z
1 + C ~ )1 ':::. zs(1 - (is - /-Lx)//-Lx)
Xs /-Lx /-Lx
where Zs = Ys - RNis. It follows that the unconditional bias of eRIN
is approximately -Covp(zs,is)//-Lx, which is the order of lin.
Ifwe further assume that conditions are such that is, Zs can be taken
to be approximately bivariate normal, then the approximation
(5.95)
holds, where
L7=1 (yj - /-Ly)(Xj - /-Lx)
EN = N .
Lj=1 (Xj - /-Lx)2
From (5.94) and (5.95), it follows that the conditional bias has the
approximation
(5.96)
Unless is = /-Lx or RN = EN, the conditional bias of TRI N is of order

n- 1j2 in probability, and is thus actually comparable to the standard
error of TRIN.
Figure 5.3 shows how error in the assumption that (Xj' Yj), j =
1, ... , N, lie near a line through the origin makes TRIN unsuitable as
an estimator of /-Ly, if is =1= /-Lx. Robinson (1987) has suggested making
a correction to TR for its bias conditional on is. This correction can
y
1\
y=Rsx
x
®
Ily - - - - - - - - - -. X
X I
_ X X I
111 IN - - - - - - X- - - -
X
Figure 5.3 Population scatterplot lies close to a line with positive intercept. If
the points ® are sampled. so that is > /-Lx. TR/ N will underestimate /-Ly-
also be interpreted as correcting for model bias, under a model with

f(Y}) = fh + fhx}.
In the empirical study of Royall and Cumberland (1985), confidence
intervals based on a standard normal approximation for
TR -Ty (5.97)
JV(TR ) ,
where VcTR) is a robust MSE estimator, do not perform well: their ac-
tual coverage frequencies, conditional on is> tend to be below nominal
levels, and the shortfall in coverage tends to be more serious at one
end of the interval than the other. Some of the problem is accounted
for by the conditional bias noted above, as can be seen from the fact
that it is less serious in a comparable study for the simple regression
estimator, reported in the same paper.
Another contributing factor seems to be the skewness of the error dis-
tribution in real populations, which means that lines fitted by (weighted)
least squares, together with estimates of the error variance, do not sum-
marize the distributions of the Yj very well. In the next section we see
how prediction using the ratio estimator can sometimes be improved
by use of a parametric model which incorporates skewness of the error
distribution.
5.12.2 Inverse Gaussian-based intervals

At least for some populations, basing intervals on assuming an inverse
Gaussian distribution for Yj appears to allow us to do better. We say
Y'" IG(/-L, A) if it has p.d.f.
!(Y) = (2ni IA)-1/2 exp { - 2:2 (y - /-L)2IY} , Y > O. (5.98)
The moments of Y are eY = /-L, Var(Y) = /-L3/A. A (superpopulation)

regression model relating Y to a covariate vector x is given by
(5.99)
where /-Lj = f3Xj, Aj = KXJ. Then eYj = f3xjo Var(Yj ) = f33 xj IK.
From a set of sample x and Y values, assuming independence of the
Yj under the model, the maximum likelihood estimate of f3 is
Ps = LYjl LXj,
jES jES
the same as Rsof (5.53) in the SRS case. If the estimator is TR = TxP,
the distribution of the prediction error
TR - Ty
can be found explicitly. A 100(1 - 2a)% prediction interval for 1'y is
given by
LYj + 1'2[a ± Ja 2 - b) (5.100)
jES
where
1'2 PLXj,
Us
2K(n - l)uv 2 + un1'2F
a = 2[K(n - l)uv 2 - vn1'2F)
,
b = K(n - l)uv 2/[K(n - l)uv 2 - vn1'2F),
K = n/ {DX;IYi) -(LXi)' /LYi I'

J8 )8 )8
U = LXj, V = LXj
jES j<f.s
and F is the 100(1 - a) percentage point for an F(l,n-I) distribu-

tion (Whitmore, 1986). The method can be adapted also to take Aj =
KxJ, where v is not necessarily 2, in which case Ps = LjES x;-2Yj
/ "L..,jEsXjv-I
.
For the inverse Gaussian model with v = 2, although the mean of Yj
is linear in x j> the modal value of Yj is not. Thus it might be expected
to give reasonable fit to some populations for which a line fit by least
squares to the population scatterplot would not quite pass through the
origin.
Whitmore has provided an example of prediction intervals for a pop-
ulation total of annual sales for N products in a particular consumer-
product industry. In this example, x j represents the projected sales for
product j and Yj the actual sales for product j. The value of Tx is
known, and the values of x j, Yj are observed in a sample of n = 20
products. The prediction limits for Ty based on (5.100) are asymmetric
about the point estimate, and higher than the limits based on the more
standard normal assumption, accounting for the greater freedom of Y
to vary above {Jx than below it. In empirical studies with an SRS de-
sign, the prediction intervals (5.100) often have conditional coverage
probabilities (conditional on is) closer to their nominal values I - 2a
than do intervals based on (5.97).
5.12.3 Simple regression estimator
As indicated in Section 5.6, the simple regression estimator is obtained

when the design is SRS and the model is
EEj = 0, Var(Ej) = a 2 • The estimator has form

tr = LWjsYj,
jES
where
Wjs = [(N/n) + (Tx - Nis)(xj - is)/ LXk(Xk - is)], (5.101)

kES
It is easily seen that £(h - Ty) = 0, and that
£(iL - Tyf = [N(N - n) + (Tx - Ni s)2 ] a2 (5.102)

n LkEs(Xk - xs)2
for fixed s. For a sample which is 'balanced' on x, so that Nis = Tx ,
the estimator reduces to Ny" and its prediction MSE is the same as for
a model exchangeable in the Yjs. However, in general, the weights and
the prediction MSE depend on the sample through is and (n - l)s; =
LkES (Xk - i s)2.
In evaluating the sampling properties of TL , it seems reasonable,
then, to condition at least on is and s;.
To have a better glimpse of the large sample behaviour of TL , we
examine the corresponding estimator of the population mean JLy, and
write the error of estimation as
- _ JLx - Xs "'" _
(hi N) - JLy = Zs + 2 ~Zj(Xj - xs), (5.l03)
(n - l)sx JES
where Zj = Yj - f31N - f32NXj, and f31N and f32N are the population
least squares coefficients. We first look at its properties under SRS,
conditional on is. It can easily be shown that under SRS, Ezs = 0 and
the covariance of is, Zs is O. Thus, if conditions are such that is, Zs and
(LjES zjxj)ln are approximately trivariate normal, we might expect
that is and Zs are approximately independent, that E (zs lis) ::::: 0, and
that the conditional bias of hi N given is should be approximately
(JLx - is) "'" _
(n _ I)E(sx2Iis) nE(~xjzjlnlxs). (5.104)
JES
In (5.l04) E(LjES xjzjlnlis) would be approximately a multiple of

is - JLx. This suggests that the conditional bias of hi N given is is
of order lin in probability, and this has been shown by Rao and Liu
(1992). However, the conditional bias given is and s; is of order n- I / 2
in probability in general, since the covariance ofzs with (LjESx]ln)
need not be approximately 0; it will not be, for example, if Yj is a
quadratic function of x j. This observation is borne out by the empirical
study of Royall and Cumberland (1981 b).
5.12.4 Implications for inference

Thus for both the ratio estimator and the simple regression estimator
there are possibilities of bias. The model-based inference assumes a
linear relationship between the mean of Yj and Xj' For fixed s, there
is a model bias if the mean of Yj is not in fact appropriately linear
in Xj' Correspondingly, there is a bias with respect to the sampling
design, conditional on those functions of {U, x j) : j E s} which would
determine the distribution of the estimator if the model held. On the
other hand, if there is no model bias, the conditioned randomization
lends 'support' to inference.
When the linear superpopulation model is not a matter of belief
but a matter of convenience, or simply a means of constructing an
estimator of Ty likely to be somewhat more efficient than t,
then
the unconditional design frequency properties of the estimator take on
an additional importance: we are assured of approximate unbiasedness
of the estimator, and in fortunate circumstances approximately correct
coverage of confidence intervals based on variance estimation formulae
like those of Section 5.11. These properties are valid in the context of
implementing the sampling procedure many times on the population.
However, the inference content of these estimates and intervals may be
flawed by their inconsistency with our knowledge of the population:
if the dependence of £ Yj on x j were thought possibly to contain an
intercept term, we would knowingly be incurring an estimable bias by
using the ratio estimator. Similarly, if the dependence were thought to
be quadratic, we would be incurring a quantifiable bias in the simple
regression estimator, in the sense of the model or in the sense of the
sampling design conditional on Xs and s;.
5.13 Robustness and mean function fitting
As suggested in early sections of this chapter, there is in principle a way

of using auxiliary information without sampling bias, whether or not the
population dependence of y on x fits a linear model well. The method
of 'sharp stratification' or 'fine stratification' divides the population
into strata according to x value, in such a way that a small number of
units (at least two) will be sampled from each stratum. This in fact was
suggested by Neyman (1934) as an alternative to regression estimation
when the linear relationship is in doubt (see also Godambe, 1982).
Conditioning on the stratum sample sizes will yield an approximation
to conditioning on the set of x values in the sample.
Using the stratification estimator N Yst for Ty will give unbiased es-
timators; these will, however, have less efficiency than regression es-
timators if the linear model fits well, and the variance estimator may
ROBUSTNESS AND MEAN FUNCTION FITTING 195
for small sample sizes be less stable (because of small stratum sample
sizes) than the regression MSE estimator.
A compromise between the two methods might be to increase the
flexibility of the model in model-assisted estimation, and think in terms
of fitting a 'mean function' for Y which is a relatively smooth function
of x. Thus the model (5.45) would become
(5.105)
where the Ej are independent with Var(Ej) = aJ. The form ofa model-
assisted estimator for 1'y would be
1y= "Yj - I:0(X j ) + to(X~)' (5.106)

~rr· rr·J J
jES J jES j=1
where O(XT) is the sample estimate of the mean function, evaluated at

the argument x T. A variant of this approach, in which 0(XT) is obtained
from a separate sample, has been described by Gerow and McCulloch
(1994), in the context of estimating the average of the mean function.
With stratified random sampling, NYst is the special case where 0(XT)
is the sample mean of Y in the stratum for which the auxiliary variate
values include x T • Thus O(XT) is an example of a local mean function
estimate. On the other hand, if PI, Ih and x are real, O(x) = PI + P2X
would be a global mean function estimate, if PI and P2 were estimates
of PI, P2 from the entire sample. In general, using local mean function
estimates can give robustness against departures from a simple global
model, while the use of global estimates would give greater efficiency
if the global model were trustworthy.
Mean function fitting also has bearing on another kind of robustness,
namely insensitivity to the presence of outliers. The contribution of an
outlier value of Y to a total 1'y must of course be incorporated into the
estimate, but it is important that it not have an undue effect on the pre-
diction of the unseen y values. Often in surveys it is possible to guess
which units may have unusual values of the variate of interest (or co-
variates), and it may sometimes be possible to include these units in the
sample with probability one; or at least a high probability. Otherwise,
however, it is important to decide how to treat an unexpected outlier,
in terms of the population units it should be allowed to represent. Un-
intended effects can be reduced if outlier-insensitive methods are used
for mean function estimation. We will not go further into these import-
ant issues here; some relevant references are Hidiroglou and Srinath
(1981), Chambers (1986) and Gwet and Rivest (1992).
5.14 Estimating a population distribution function using

covariates
An interesting further illustration of the use of covariates in descriptive
inference is provided by the estimation of a population c.d.f. That is,
the parameters of interest are the values of the distribution function (cf.
(4.4))
1 N
FN(y) = - I:J(Yj :::: y). (S.107)
N j=l
There is an auxiliary variate x, and Yl, ... ,YN are thought to come
from a superpopulation with model
Yj = xi f3 + Ej'
where EJ, •.• , EN are independent and EEj = 0, Var(Ej) = Cases of aJ.
this problem have been treated by Chambers and Dunstan (1986), Kuk
(1988), Godambe (1989a), Rao et al. (1990), Chambers et al. (1992),
and Rao and Liu (1992).
Let G denote the model c.d.f. of Ej. Then under the superpopulation
model,
I:tPj =.?:
N N [
J(Yj :::: y) - G
(Y -ax' (3)]
i (S.108)
;=1 ;=1 J
is a population estimating function, each of whose terms has expectation

O. It is estimated optimally by
(S.109)
if G and f3 are specified, and the part of (S.108) depending on Yj is

N times the desired FN(y). Thus, following the prescription leading to
(S.S9), if G and f3 were known, an appropriate unbiased estimator of
FN(y) would be
~N !I:
.
~ [J(Yj :::: Y) -
Tt;
G (Y -xi (3)] + t
aJ . 1
G (Y -X
aJ
i(3 )I.
J~ ;=
(S.1lO)
When G and f3 are unknown, as is usually the case, we would want
to substitute well-behaved estimates from the sample. Note that this
case resembles regression estimation of 1'y when the XI: condition is
not satisfied, since the estimating function (S.109) will not normally be
part of the system used for estimating G and f3.
CONomONAL SAMPLING PROPERTIES 197
One suggestion for the estimation of f3 is to use weighted least

squares, which leads to the sample estimating equation system
I>j-2
jES
Xj (yj - xi f3)/7rj = o. (5.l11)
For G, we note that for f3 known,
I(}j ::: 'fi) - G ('fi -a~i f3)

has model expectation 0 for each i and real number 'fi. Thus, setting
'fi = uai + xi f3, we see that
I(Yj ::: uaj +xi f3) - G(u)
has model expectation 0 for each i. This suggests one possibility for
the estimation of G(u), namely
G(u) = {LI(€i ::: u)/7r;}/{L 1/7r;}, (5.112)
iEs ies
where €i = (Yi -xi 13s)/ai. Then the estimator of FN(y) would be
N1[,,1
Jes J
"
~ ;-:{I(yj ::: Y) - G j } + ~ G j
f.,,]
J=l
, (5.113)
where
Gj = G(u), U = (y-xj13s)/aj.
Rao et al. (1990) have ensured a kind of conditional design unbiased-
ness by replacing the first Gj in (5.113) by
G jC =L 7rj {I(€k::: u)}/L 7rj .

kes 7rjk kES 7rjk
Note that computing (5.113) depends on knowledge of af, ... , a~ up
to a proportionality factor.
Improved estimation of G(u) would presumably yield better esti-
mates ofFN(y). Whether or not the 'model-assisted' estimator (5.113)
for FN(y) is better than
FN(Y) = {LI(yj ::: Y)/7rj}/{L 1/7rj} (5.l14)
jES jES
as in (4.9) depends on the degree of validity of the superpopulation
model. In fact, other improvements might arise with the use of a model
in which xj f3 is replaced by a more general mean function 9(xj), as
in Section 5.13, so that local fitting could be employed.
If the model were felt to be reliable, inference about the uncertainty

in the point estimate of FN (y) might be made through an estimate
of the prediction mean squared error. This would be approximated by
the conditional sampling mean squared error given {xi: j E s},
meaningful in its own right. One way of approaching this problem
would use a bootstrap resampling approach (see Efron and Tibshirani,
1993). The sample s is considered to be fixed, as are the values xi,
j E s. In the linear model case, take {3 to be fixed at some reasonable
value, and compute the residuals
Ej = (Vj - xi (3)/aj
for j E s. For each of a large number of times, generate an i.i.d.
sample {Ej, ... , E;} from the empirical distribution of {Ej' j E s}.
Matching these with sampled units, produce a new set of y values
yj = xi {3 + ajEj, j E s. Still keeping (3 fixed, use the Ej, j E s, to
produce an estimate
A* LiES I(Ej ::: U)/TCi
G (u) = ".
L.,'ES
1/.
TC,
'
and estimate FN (y) by
A* 1 1 * A* N A* ]
F (y) = N [ I:~{I(yj :::y)-Gj }+ I:Gj , (5.115)
JES J J=1
where
OJ = O*{(y -xi (3)/aj}.
For fixed y, the empirical variance of the F*(y) values should approx-
imate the prediction or model MSE of FN(y) well, provided it does
not depend greatly on the value of {3 used.
CHAPTER 6
Analytic uses of survey data
In Chapter I the difference between descriptive and analytic aims for

a survey was discussed briefly. The current chapter will look at ana-
lytic uses of survey data in greater depth. We begin in Section 6.1 with
a discussion of the meaning of analytic inference, and the role of a
randomized sampling design. In Section 6.2 the issue of incorporating
sampling weights in the inference is considered, with a general estimat-
ing function formulation, and examples involving single-stage designs.
Section 6.3 contains an outline of the theory of estimating functions
for vector parameters. Finally, Sections 6.4 and 6.5 are devoted to the
application of the generalized linear model to complex survey data,
with Section 6.4 describing population-level analyses and Section 6.5
some sample-level analyses.
An excellent reference for this area of sampling theory is the book
by Skinner et al. (1989).
6.1 What is analytic survey inference?
Briefly, inference from survey data is analytic rather than descriptive

when we seek to make statements about the parameters of conceptual
populations, on which the survey population at hand is at best a window.
For example, if we ask smokers in Ontario whether they have smoked
brand A during October of this year, our aim is probably descriptive,
concerning the population of current smokers in Ontario. On the other
hand, if we ask smokers in Ontario whether they have switched brands
in the month prior to the survey, we might really be trying to estimate
the probability that a randomly selected smoker, in Ontario or some
larger geographical entity, will switch brands in some future month
under similar conditions. In this case the aim is analytic.
It is often helpful to think of the difference between descriptive and
analytic aims in terms of the difference in the target populations, one
being actual and definite, and the other being conceptual or hypothet-
ical and somewhat indefinite. This is essentially the distinction being
200 ANALYTIC USES OF SURVEY DATA
made in the previous paragraph. The population distinction is very

useful in the interpretation of analytic parameters. On the other hand,
for estimation of analytic parameters, ambiguity of definition must be
eliminated, and this can be done with the introduction of a specific
stochastic model, of which some of the parameters coincide with the
parameters of interest. Once this is done we can express the difference
between descriptive and analytic inference as a difference in the type of
attribute being estimated. Descriptive inference is about some function
of the values of a variate for unseen units of the survey population, and
analytic inference is about the parameters of the model. The concept of
'superpopulation model' already encountered in Chapter 5 is suggestive
of both ways of making the descriptive-analytic distinction.
It might at first appear that if the object of analytic inference is a
superpopulation parameter, then only the model probabilities would be
relevant for inference. However, we will see that the design probabil-
ities from a randomized sampling design may also have a significant
role. Because of the large size and heterogeneous structure of a typical
survey population, a realistic model for the responses would tend to
be rather complex. A randomized sampling design can 'support' infer-
ence based on a model which may have simpler structure and fewer
parameters. Thus the choice of sampling design may subtly influence
the choice of parameters to be considered, and their interpretation.
Example: interpreting a superpopulation proportion

To illustrate, suppose there is a certain treatment, and for the population
at hand suppose that Yj equals 1 if person j would respond positively
to the treatment at its next application, equals 0 if not. A random
sample is taken, and the treatment applied to each member. The sample
proportion who respond positively may be thought of as an estimate,
not only of the corresponding proportion in the survey population, but
also of a superpopulation proportion e, which might be interpreted as a
'positive response' probability. Inference about e would be an example
of analytic inference.
We could think of the meaning of the superpopulation parameter e
in different ways. Most simply perhaps, we could think of the pop-
ulation at hand as a random sample from a huge hypothetical pop-
ulation, a proportion e of whose members would respond positively.
Then YI, ... , YN would be independent Bernoulli variates with success
probability e. Less naturally, but with more structure, we could think
of each population member as being a 'stochastic subject'. Each mem-
WHAT IS ANALYTIC SURVEY INFERENCE? 201
ber would represent a hypothetical stochastic sequence of subjects or

subject trials, responding positively or not according to a Bernoulli trial
model with success probability 8. This structure seems artificial, but in
terms of the distribution of Y1, ••• , YN it would be equivalent to the
previous interpretation. Moreover, its artificial nature points the way to
better formulations.
In this example, if each population member represents a stochastic
sequence, it is inherently unlikely that the stochastic sequences would
be a homogeneous collection, with independent tendencies to respond
positively or not. It would be more natural to think of the sequence
for person j as having its own proportion or probability of positive
response, 8j .
Then for some purposes 8 might be interpreted as a simple or appro-
priately weighted average of 8j over all subjects in the survey popu-
lation at hand. For other purposes, with a wider conceptual population
in mind, we might think of the 8j themselves as random, either inde-
pendent or correlated, with common expectation ().
Under either of these two interpretations, randomization in the sam-
pling design can remove the effects of nuisance parameters and simplify
the model for certain functions of the sample data: for example, if the
model is composed with simple random sampling with replacement,
the resulting unconditioned model makes the successively drawn (un-
labelled) responses Bernoulli variates with mean (). Under the second
interpretation above, where the ()j are random, we can say something
stronger. Suppose that the ()j as random variates are spatially correl-
ated, with correlation growing weaker as the distance between units
increases. If a sample obtained from any design were sufficiently well-
dispersed, we could take the labelled Yj , j E s, to be approximately
i.i.d. Bernoulli variates. Thus inference about () based solely on this
approximate i.i.d. model would be 'supported' by frequency properties
under the population model composed with simple random sampling.
Happily, simple random sampling will tend to produce well-dispersed
samples if the sample size is small relative to the population size.
Example: interpreting a regression coefficient

Another example of analytic inference would involve regression, which
has already been discussed to some extent in Sections 4.2 and 4.4. The
meaning of the finite population least squares slope BN of (4.109) as a
descriptive parameter is clear. However, its meaning more generally is
not so clear, except in the context of specific stochastic models which
take into account the likely heterogeneity of the survey population.

Several such models have been pursued in the literature. For some
authors (e.g. Pfeffermann and LaVange, 1989) the true regression slope
{3j varies from unit to unit, and is a function of 'design variates'.
Thus the value {3j might be constant within each cluster in a clustered
design, and across clusters be i.i.d. with mean {3 and some fixed variance
a 2 • For other authors, {3j has a common value {3 for all j, but the
errors in the regression model are correlated, either spatially correlated
or partially exchangeable. A two-stage error model, where errors are
conditionally independent within PSUs, with PSU mean levels being
independent with mean 0, has been used by Scott and Holt (1982)
and others subsequently. This model is essentially a generalized linear
mixed model, as will be discussed in Section 6.4.7.
In the case of constant {3 but spatially correlated errors, a simpler un-
correlated error model within the sample may be appropriate if the sam-
pling design is a single-stage self-weighting design with well-dispersed
samples. Inference about the regression slope {3 could then be carried
out in accordance with the usual weighted least squares theory, and sup-
ported by design-frequency properties. This approach to model choice
and parametrization will be discussed further in Section 6.5.1.
In the manner above, it is relatively easy to assign a role in ana-
lytic inference to simple random sampling, which is single-stage and
self-weighting. If a randomized sampling design is not self-weighting,
the question arises whether or not to incorporate the weights, or more
generally the design probabilities, into the analysis. This is one of the
more difficult questions of analytic survey inference, and we shall dis-
cuss it in several contexts in the next section. A related discussion is
provided by Godambe (1995).
6.2 Single-stage designs and the use of weights
6.2.1 Likelihood estimation with independent observations
The first context arises, typically with a single-stage design, when the
model specifies the response values Yj for the sampled units to be in-
dependent, with probability functions known up to a finite-dimensional
parameter f). Generally, some component(s) of f) form the 'parameter
of interest'. For simplicity here and in the next section we shall take f)
to be real-valued.
If the sampling design is simple random sampling, then the log likeli-
SINGLE-STAGE DESIGNS AND THE USE OF WEIGHTS 203
hood function for the parameter () takes the form
Llogjj(yj; (}). (6.1)
jES
where Yj is the observed value of Yj and jj is the probability function
for the observation at unit j. In fact, strictly speaking, (6.1) is the log
likelihood function whenever p(s) from the sampling design is inde-
pendent of the parameter, and independent of the array Y of response
values. The score function, a model-unbiased estimating function for
(), has realized value
(6.2)
In the case of a non-self-weighting design, one approach would ad-

vocate the use of
(6.3)
jES
or more generally
8
L Wjs 8(} 10gfj(yj; (}). (6.4)
jES
where the W js are weights such that
Ep(~ WjSZj) ~ Tz
JES
for all z. Clearly this maximum pseudolikelihood approach (Binder,

1983; Skinner, 1989) is motivated by the idea of estimating a population
score function of the form
N 8
~ 8(} 10gfj(Yj; (}). (6.5)
J=l
as we have seen for general estimating functions in Section 4.1. The jus-
tification is less obvious here, where the emphasis is on the estimation
of a superpopulation parameter, () say, rather than its finite population
counterpart. If we believe sufficiently in the model to believe in (6.5)
as a population score function, then we should believe even more in
(6.2) as a sample score function; and there should be no need for the
unbiased estimation of (6.5) provided by the use of weights in (6.3)
or (6.4). On the other hand, (6.3) and (6.4) are still estimating func-
tions for the superpopulation parameter and the corresponding finite
population parameter. They would be nearly as efficient as the sim-

ple score estimating function (6.2) if the weights were approximately
equal. Moreover, if the sampling design probabilities did depend on Y,
and (6.2) could no longer be regarded as the sample score function,
(6.3) and (6.4) would still estimate the population score function (6.5).
Thus in the next section we will discuss the pseudo likelihood approach
formally, in the context of a more general population-level estimating
function with independent terms.
6.2.2 Estimatingfunctions with weighted terms

Suppose that we have a model under which the response values Yj
are independent, with distribution depending on a real parameter ().
Suppose that there is a population-level estimating function
L ¢j(Yj , (),
N
<p(Y, () = (6.6)
j=l
and suppose to begin with that £ {¢j (Yj , ()} = 0 for each j. Let
¢; = L Wjs¢j(Yj , () (6.7)
jES
be a sample estimating function, with the Wjs not depending on Y. If
the design probabilities are also independent of Y, then
£(¢;Isample is s) =L WjS£{¢j(Yj , ())} = 0;
jES
Var(¢;lsample is s) =L W]s£¢];
jEs
and a model-unbiased estimator of this variance would be
Vrn (¢;) = LjES W]s¢J. (6.8)
Under appropriate conditions, approximate confidence limits for () could

be based on solving
¢;
Jv rn (¢:) = ±Zl-a. (6.9)
If the model were true, therefore, inference which is valid conditional

on the sample could be based on ¢; and Vrn (¢;), without intermediate
consideration of <p(Y, (), and without involving the design probabil-
ities.
In the example of Section 6.2.1, we would have
tP; = ~ Wjs aao10gh(Yj; 0) (6.10)

JES
and
Vm(tP;) = ~W]s (aao10gh(yj;0)Y (6.11)
JES
If the model were well substantiated, it might be appropriate to replace
the relatively robust Vm (tP;) by
a2
I s(O) = - ~ W]s a0 2 log h (yj; 0) (6.12)
JES
or by their common expectation Is(O). In vm(tP;), I s(O) or Is(O), 0

could be replaced for large samples by its point estimate e,
obtained
by solving tP; = o.
For some applications we will need to relax the conditions we started
with. In Section 6.2.6 on estimating the average of a mean function, the
Yj will be independent, and £{(Y, O)} will be 0, but it will not nec-
essarily be the case that £{tPj(Yj , O)} = 0 for each j. Sections 6.2.3,
6.2.4 and 6.2.5 will deal with examples of response-dependent sam-
pling, where the design probabilities are informative, in the sense of
being allowed to depend on Y. For these designs, if the W js are chosen
to make tP; of (6.7) design-unbiased for (Y, 0), then the Wjs will typi-
cally depend on Yalso, and it may not be true that e{wjstPj(Yj , O)} = 0
for each j.
To be specific, let us now focus on
"tPj(Yj'O)
tPs = ~ , (6.13)
jES 1T:j
which is a special case of tP; of (6.7). Clearly EptPs = = (Y, 0),
and we adopt the weakened assumptions that YI, ... , YN are indepen-
dent and £<1> = O. It will not generally be true that £(tPslsample is s) =
O. However, with respect to the compound model-design distribution,
we do have
(6.14)
Thus for measuring uncertainty it is natural to consider the compound
mean squared error
£Ep(tPs - £<1»2 = £Ep(tPs _ + _ £<1»2
= £Varp(tPs) + Var(<I», (6.15)
where Varp and Var denote variances with respect to design and model
respectively. Accordingly, under the compound distribution, f!Js will
have mean 0 and variance £Varp(f!Js) + Lf=1 Var{f!Jj(Yj • O)}, which
can be estimated if desired by
Vc(f!Js) = v(f!Js) + L Vj/7rj (6.16)

jes
where Epv(f!Js) = Varp(f!Js) and £Vj = Var{f!Jj(Yj> O)}.
Approximate confidence limits obtained by solving
f!Js
C7I\
vVc(f!Js)
= ±ZI-a (6.17)
would have appropriate frequency properties under the compound dis-

tribution.
Thus we see that the design probabilities have a natural role in es-
timation when they are informative, or whenever we cannot guarantee
=
that £(f!Jsl sample is s) O.
Let us return to the simpler initial case where each £(f!Jj(Yj • O)} 0=
and the design probabilities are not dependent on Y. If f!Js is chosen as
the sample estimating function, then (6.16) and (6.17) under the com-
pound distribution provide an alternative to the model-based variance
estimator
Vm(f!Js) = Lf!J;/7rj (6.18)
jes
and interval
f':":"""7T\ = ±z I-a (6.19)

vVm(f!Js)
(special cases of (6.8) and (6.9». Those based on vc(f!Js) = v(f!Js) +
Ljes f!J;/7rj are less heavily model-dependent, but are valid only 'on
average', rather than conditionally on the sample drawn. If the design
is simple random sampling, then all the 7rj are equal to n / N, and the
e,
estimators vm(f!Js) and vc(f!Js), evaluated at are close to one another,
allowing the design to 'support' model based-inference.
6.2.3 Likelihood analysis under response-dependent Bernoulli

sampling
Returning again to the likelihood estimation of Section 6.2.1, suppose

that there is a model with parameter 0, and that the population score
function is
a
LN -logfj(Yj
j=lao
; 0).
This time, however, suppose that the sampling scheme is length-biased

Bernoulli sampling, for which the inclusion probability of unit j is
proportional to Yj , or more generally response-dependent Bernoulli
sampling, for which the inclusion probability is equal to a function
7rj(Yj ).
An approach in the spirit of (6.3) and (6.13) would use
" ;(} logf/Yj; 0)

¢s=~ (6.20)
jES 7rj(yj)
as a sample-based estimating function, together with a variance estim-

ator vc(¢s) and perhaps confidence limits for 0 from (6.17).
When belief in the model is firm, it is worthwhile also to consider
whether the information from the design probabilities can be incorpor-
ated more directly. For response-dependent Bernoulli sampling, the
likelihood function based on the sample data would be
n 7r ·(Y·)
J. (~) fj(yj; 0) n 7rjo(O) n (1 - 7rjo(O», (6.21)
jES 7rJ o jES j;'s
f
where
7rjO«(}) = 7fj(y)/j(y; (})dy;
and the score function would be
f- a (7r (Yj)Jj (yj; 0») f- a

j
~_JjS ao log 7r. (0) + ~_JjS ao log 7rjo(O)
J=I JO J=I
N a
+ ?:(1 - I js ) ao log(l - 7rjo(O», (6.22)
J=I
where Ijs equals 1 if j E s, and 0 if j (j. s. Note that since 7rj(yj)

x Jj (y j; 0) / 7rjO (0) is the probability function of Y j , given that j is sam-
pled, the first component of the score function is an unbiased estimating
function conditional on the sample actually chosen. This component
could be isolated and used on its own, for inference which would be
essentially model-based. The remainder, like (6.20), is unbiased on the
average over repetitions of the sampling design, and may be useful,
when 7rjo(O) is independent of j, for estimating N if N is unknown.
Godambe and Rajarshi (1989) have given a detailed discussion of this

example.
6.2.4 Case control studies

In a simple framework for case control studies (see Scott and Wild,
1986) the response variates Yj = 1 (case) or 0 (control) are observed
for the whole population. There is a vector-valued explanatory variate
X, such that the vectors Xj may be taken to be i.i.d. with density
ICx). Conditional on the Xj the responses are taken to be independent,
with distribution depending on Xj and parameter (3. The conditional
probabilities may be denoted
P(llxj; (3) = PCYj = llXj =Xj; (3)
(6.23)
PCOIXj; (3) = PCYj = OIXj = Xj; (3).
The marginal probability function for Yj is given by
WIC(3) = PCYj = 1; (3) = f P(llx, (3)/Cx)dx

(6.24)
WoC (3) = PCYj = 0; (3) = 1- WI C(3).
If the Xj were also observed for the whole population, the score function
vector for (3 could be written
' " a log P(1lxj; (3) ' " a log PCOlxj; (3)
L..,.;
JES,
a{3 + L..,.;
JESo
a(3 , C6.25)
where SI = {j : Yj = I} and So = {j : Yj = O} are the strata of cases

and controls, with respective random sizes NI and No. For the actual
study, however, a sample is selected from each stratum according to
some scheme, and the Xj are observed only for the sampled j.
If the scheme is such that nl and no units are selected at random from
SI and So respectively, the inclusion probabilities will be response-
dependent, since 1T:j = n t! NI if j is in SI, and = no/No if j is in So.
The sample estimating function system analogous to (6.13) is then
<Ps = NI L a log P(1lxj; (3) + No L alogPCOlxj; (3) = 0,
nl jES, a{3 no jEso a{3
(6.26)
where Sl, So are the samples from SI and So respectively. As in the gen-
eral case, (6.26) has the property of being design-unbiased for (6.25),
given the determination of SI and So. Thus it is model-unbiased 'on
average' if the model is true. Godambe (1989b) has discussed its opti-
mality properties.
Here again it is possible to write down the score function from the
sample data, as
a log WI ( (3) a log Wo ( (3) ' " a I f* ( . /.I)
NI a(3 + No a(3 + ~ a(3 og I Xj, tJ
JESt
+ L aa(3logfO'(Xj; (3); (6.27)

jESo
the function R(Xj; (3) = f(xj)POJXj; (3)/ WI ((3) is the conditional

density ofxj. given Yj = 1, and fO'(xj; (3) is defined similarly. The last
two terms of (6.27) form a model-unbiased estimating function which
is valid conditional on SI, So and the sample drawn. However, for these
components to be usable for model-unbiased inference, it is necessary
to know f(x) to the extent of being able to calculate WI «(3) as a
function of (3. When this knowledge is absent, (6.26) is an appealing
alternative.
6.2.5 Pseudolikelihood constructions in life testing

Kalbfleisch and Lawless (1988a) have described the situation of a field
performance study in which y is the time to failure for an item and x is
a vector-valued covariate. Suppose for simplicity that there is a single
unknown real parameter (). There are N items on test, but for any item
the time to failure is observed only if it falls within a warranty period.
That is, the values Yj, Xj are observed only if Yj :s To for a fixed time
To. If all Xj values were observed, the score function would be
a a
.L atJ log f(yj JXj; tJ) + .L atJ log F(To JXj; tJ), (6.28)
j:Yj~To j:Yj> To
where
F(ToJXj; tJ) = roo f(yJXj; tJ)dy.
iTo
If no x j values outside the truncated data set were observed the score
function would be
.
L aa
p o .
ll log f(yj JXj; tJ)+ L p
~
atJ
109! F(ToJx; tJ)g(x)dx, (6.29)
J~~lO J~>lO
provided the density function g(x) of the covariate were known.

However, the distribution of the covariate would rarely be known
precisely, and Kalbfleisch and Lawless have suggested instead taking a

small subsample So of the items which have not failed by time To,
and observing the corresponding covariate values. This enables the
construction of the pseudoscore
which is a direct estimate of (6.28) in the form of (6.13), if P2 is the

inclusion probability for an item with unobserved failure time. Like
(6.20) and (6.26), the pseudoscore (6.28) is an unbiased estimating
function for e on the average, though not conditional on the actual
subsample So.
The form of the variance estimator vc(¢s) from (6.13) is simple,
particularly if So is chosen by Bernoulli sampling, as indicated by
Kalbfleisch and Lawless (1988a):
I-P2",(a
-2- ~ ae 10gF(ToIXj; e) )2
P2 jESo
+ L
j:Yj~To
Vlj+- L V2j,
1
P2 jESo
(6.31 )
where
V2j = ( aea 10gF(ToIXj; e) )2 or

a2
ae 2 10gF(Tolxj; e).
For applications of pseudolikelihood constructions in models for dis-
ease incidence and mortality, see also Kalbfleisch and Lawless (1988b).
6.2.6 Estimation of the average of a mean function
The closer the 'analytic' purpose is to a descriptive purpose, the easier

it is to justify incorporating use of the sampling design in inference.
For example, the descriptive purpose of estimating a finite population
mean when the measurement of the y values is subject to zero-mean
response error is formally the same problem as estimating an analytic
parameter e which is interpreted as the population average of a 'mean
ESTIMATING FUNCTIONS FOR VECTOR PARAMETERS 211
function',
N
()=L()j/N,
j=1
where
()j=E(Yj ), j=I, ... ,N.
We saw this form earlier in Section 6.1 as one of the possible inter-
pretations of a superpopulation proportion (), an interpretation as the
population average of probabilities ()j for stochastic subject sequences.
If we have no model for the ()J> or if we have one but would prefer
not to rely on it completely, then we might wish in estimating () to use
a randomized design and the estimating function
These are model-unbiased 'on the average under the design' though not
unbiased conditional on the sample drawn. The estimator of () from the
equation <Ps = 0 would be
8s = (LYj/1Tj)/(L 1/1Tj). (6.32)
JES JES
The variance estimator vcC<Ps) from (6.16) would take the form
vcC<Ps) = v(<Ps) + '~

" ' V·
..l...., (6.33)
JES 1T'}
where v(<ps) is a sampling-unbiased estimator of Var(<ps), and is the
only term which would appear if the object of interest were tLy =
L;=1 Yj / N instead of (). Since we are estimating (), there is an ad-
ditional term LjES Vj/1Tj in (6.33) such that [Vj = Var(Yj - () =
Var(Yj - ()j). (Note that here we take ()j to be non-random.) In the
application to estimating a true finite population mean in the presence
of response error, Vj would be an estimate of the response variance for
unit j.
6.3 Estimating functions for vector parameters

In preparation for discussion in Sections 6.4 and 6.5 of the general-
ized linear model, we will summarize some of the theory of optimal
estimating functions for a vector parameter (). Although the traditional
notation () is used here, application will be to f3 or to f3 and a in
Section 6.4. We suppose, then, that the joint distribution of observations
depends on 6, which has dimension P xl, and that we are interested

in estimating 6 through a system of P unbiased estimating equations,
denoted collectively by
q;(6) = 0; (6.34)
here q; (6) and the zero vector 0 each have dimension P xl, and
q;( 6) is a function of the observations as well as of 6.
In particular regular cases the resulting estimates 8 will have asymp-
totic covariance matrix
T(q;) = (£~:) -I V(q;) [(£~:rrl , (6.35)
where V ( q;) is the covariance matrix of q; and 0 q; I 0 6 is the matrix

with (u, v)th component o<t>"lo()v. Even without this interpretation, it
is useful to think of (6.35) as a measure of performance of the system
(6.34), to be minimized if possible in some convenient ordering of mat-
rices. Godambe and Kale (1991) have reviewed the history and use of
this criterion.
Let f denote the density or probability function for the observations,
regarded as a function of 6. One unbiased estimating function system
is given by the multi-dimensional score function S( 6), for which the
uth component is
S,,( 6) = 0 log flO()".
It is well known that if the class of possibilities for q;( 6) includes
S( 6), then in regular cases
T(q;) -T(S)
is non-negative definite for all q;, and in this sense S is optimal. It is
easily shown that
V(S) = -£ (!!),
and hence that
T(S) = V-I (S).
In many applications, we have a vector g( 6) whose components are
elementary estimating functions gj (6), j = 1, ... , M. These compo-
nents are functions of 6 and parts of the total observation; the vector
g( 6) has zero expectation, and covariance matrix V(g). Typically Mis
greater than P, and we consider for estimating 6 the class of all P x 1
systems q;( 6) = 0 with components of q;( 6) being linear combina-
tions of components of g( 6). That is, we consider systems like
Ag(6) = 0,
ESTIMATING FUNCTIONS FOR VECTOR PARAMETERS 2\3
where A is a P x M matrix which may depend on 0, but not on the

observations.
If H denotes the M x P matrix £(*), the matrix I(<I» of (6.35)
becomes
I(Ag) = (AH)-l AV(g)Ar[(AH),]-l, (6.36)
and we can show that the best choice of A is
A* = HrV-1(g), (6.37)
and that
I(A*g) = (HrV- 1(g)H)-l; (6.38)
see, for example, Heyde (1987). One method of proof notes that the
covariance matrix of (AH)-l Ag - (A* H)-l A*g is
I(Ag) - 2(AH)-1 AV(g)A*r[(A* H)rrl + I(A*g),
which is easily seen to be I(Ag) - I(A*g); it follows that I(Ag) -
I(A*g), being a covariance matrix, is non-negative definite.
If the components of g are orthogonal, so that V(g) is a diagonal
matrix, then the uth component of *(0) = A*g( 0) can be written in
the familiar quasiscore form
<l>~(O) = L£ (888gj ) Var-1(gj)gj(O)

j u
(6.39)
(Godambe and Thompson, 1989).

Both in this case and in general, cI>~ (9) is optimal for estimating (}u
alone when all the other components are known. Furthermore,
Cov(<l>~(O), <l>~(O» = £ (a~~~o») = £ (8~~~O») (6.40)
for all u, v = I, ... , P. Thus if <l>~ (0) and <l>~ (0) are orthogonal in
the sense of being uncorrelated, <l>~ (0) changes little with changes in
8v near the true value.
More generally, as considered by Godambe (1991), suppose the
parameter 0 is partitioned so that or = (j3T, aT), and that *(O)
is correspondingly partitioned as *r = (rr, ;r). The subsystem
r (0) = 0
would be optimal for estimating f3 if a were known. Thus its solutions
would depend on a in general. When a is unknown, then for assessing
uncertainty in the estimation of j3, we would prefer to use an estimating
function subsystem as little dependent on a as possible.
In fact, the relations (6.40) suggest the construction of another unbi-

ased system, with the same solutions overall, such that the first part has
relatively little dependence on the 'nuisance parameter' Q. The new
system is (~r*T, ~iT), where
1 --
~** ~*1 _ E( ~*1 ~*T)V-l
2 (~*)
2 ~*.
2' (6.41 )
and we have a~r* fa Q ~ 0 since (6.40) implies the rough approxima-
tion
~** ~ ~* _ a<l>f (a<l>i)-l ~*.
1 1 aQ aQ 2
Thus it will make sense to use the variability of ~r*, at a fixed value
of Q, in approximate inferences about (3 in the absence of knowledge
of Q.
Finally, let us suppose that our system of estimating equations can
be written
=L
N
~(9) cPj(9) = 0, (6.42)
j=l
where the vectors cPj (9) are independent with expectation O. If

N
J(O) =L cf>/O)cf>j(O), (6.43)
j=l
it is easily seen that J( 0) estimates the covariance matrix of ~(9) con-

sistently under regularity conditions. For large N it is often reasonable
to base confidence regions for 9 on the distributional approximations
(6.44)
and
(6.45)
iJ being a solution of (6.42). See, for example, the discussion by Mach
(1988) in the case of estimating a scalar mean.
6.4 The generalized linear model for survey populations

An important framework for analytic inference is the generalized lin-
ear model, and it is increasingly applied to survey data. This will be
the subject of Sections 6.4 and 6.5. The variety of interpretations seen
in Section 6.1 for the model and its parameters will be present here
also, and the likely heterogeneity of the population will again play a
role. Thus we will begin in Section 6.4 with a detailed discussion of
THE GENERALIZED LINEAR MODEL FOR SURVEY POPULATIONS 215
the model and its variants. Essentially, we will describe how inference
might be made for superpopulation parameters if the whole finite pop-
ulation were surveyed. Then in Section 6.5 we will indicate how the
analyses would proceed using data from a sample, obtained through
a simple or a complex sampling design. These analyses will focus in
particular on the special case of logistic regression.
A comprehensive treatment of the generalized linear model is given
by McCullagh and NeIder (1989).
6.4.1 The simple generalized linear model

In its simplest contexts, the generalized linear model may be formulated
as follows, at the level of a small, homogeneous population.
Let Y be a vector of random variates for which the joint distribution
depends on a vector of parameters
(J = (8j )Nxl.
If we assume that (J varies in R N, a linear model for (J is a statement

that (J belongs to a linear subspace of R N , or equivalently that
(J = X{3, (6.46)
where X is some known N x p matrix, p < N, and {3 is a p x 1 vector
of parameters PI, P2, ... , pp •
For example, suppose that the 'response' vector
has real components, and that these are independently and normally
distributed, with
(6.47)
Suppose there is a p-dimensional covariate x, with value for the jth
component given by the row vector xi.
Then the regression model
Yj = xi (3 + Ej (6.48)
with EI, ... , EN independent and Ej ,....., N(O, aJ) (cf. (5.45» can be
expressed as (6.47) together with a linear model for (J, namely (6.46)
where X has xi as its jth row.
In another example, Y may be a vector of independent binary re-
sponses, with
P(Yj = 1)
(6.49)
P(Yj = 0) =
for j = 1, ... , N. If X is a covariate matrix as in the previous example,

a linear model would take the form (6.46), where
()j = f(pj)
for a suitable transforming function f, known as the link function.
Special cases include the probit link function
f(p) = <1>-1 (p),
where <1> is the standard normal distribution function, and the link func-
tion
f(p) = log[pj(l - p)],
which corresponds to logistic regression.
In the general case of independent response components, when the
distribution of Y under the linear model is known apart from a low-
dimensional parameter, then 13 of (6.46) is estimated by maximum
likelihood estimation or a generalized least squares procedure. Interval
estimates for 13 are usually based on asymptotic distributions of the
point estimates.
For example, if 13 is the only parameter, and 1(13) is the log likeli-
hood function based on the full observation of Y, then the score vector
r
can be written
s~ = (a~1 '... ,a~p (6.50)
and the information function matrix is
(6.51)
The maximum likelihood equations are the components of the system

(6.52)
In the logistic regression case, where the log likelihood function for
() would have the form
log ID
N [ e(Jj JYj [
+ e(Jj
1 + e(Jj
J1-Yj)
1 ,
I
(6.53)
the log likelihood for 13 (since ()j = xj 13) is

N N
1(13) = LYjxj 13 - L log(l + cl ~). (6.54)
j=1 j=1
The maximum likelihood equations are

N
S~ = L(Yj - JLj((3»Xj = 0 (6.55)
j=1
or
xr (Y - p,((3» = 0,
where
l?}~
JL j ((3) = I + eX} ~
is the expected value of Yj under the linear model.
These maximum likelihood equations may be solved iteratively. For
example, in the vth iteration of the Newton-Raphson algorithm,
j3(v+1) = j3(V) + rLxr (Y _ p,(j3(V»). (6.56)
~
This has an expression as an iteratively reweighted least squares (lRLS)

algorithm:
j3(V+I) = (Xr VvX)-1 Xr Vvzv (6.57)
where the pseudo-observation vector Zv is given by
Zv = xj3(V) + Vv-I(y - p,(j3(V»)
and
(
Vv = diag e(X~)j
" (v)
/( I + A
e(X~)j
(v)
)2) •
6.4.2 Confidence regions

There are two approaches to the determination of confidence regions
for (3 which have natural extensions to cases of complex sampling
designs. Here again we will assume independent response components,
with (3 as the only unknown parameter of their distributions.
The first approach and the one we will emphasize, is based on the
distribution of the score vector S ~. In regular cases, when (3 is the true
value, the expectation of S ~ is a zero vector, and its covariance matrix
is given by
(6.58)
If the model is correct and I ~ is invertible, approximate confidence
regions for (3 may be obtained by inverse testing, using the large-
sample approximation
S rI-Is
~ ~
'" 2
f3 ~ X(p) (6.59)
or, with usually less computational effort, using

SrI-'S
f3 /3 f3 ~
2
X(p)' (6.60)
Ifwe have faith in the model only as far as the independence of response
components and the unbiasedness of the score estimating functions is
concerned, we might prefer to use a more robust estimator than I f3 of
the covariance matrix of S f3' A common choice is the matrix
J-
f3 -
(( ~al.al'))
~_J_J
j=' af3r af3s pxp •
(6.61)
where Ij is the log likelihood term associated with fj, leading to confid-
ence regions obtained from the approximations
S rf3 J-'S
f3 f3 ~ 2
X(p)'
(662)
.
S rf3 J-'S
/3 f3 ~ 2
X(p)'
(663)
.
The second approach is based on approximations to the distribution

of the maximum likelihood estimates, beginning with the Taylor series
approximation
(6.64)
Carrying this approximation through to the computation of second-order
moments, we obtain when (3 is the true value
NCov(j3) ~ N(£I f3)-'Cov(S f3)(£I f3)-"
which is well known to reduce to

N(£I f3)-'
if the model is correct, because of (6.58). If the model is doubtful,

rather than estimating Cov(jJ) by I;' we would probably prefer to use
something like the 'sandwich estimator'
"L...,f3-
- 1-'
f3 J f3 1-'
f3 (6.65)
and base confidence regions on the approximation
(6.66)
or
(6.67)
All the methods for confidence regions described here can be shown
to be valid (and equivalent) for large N under suitable conditions on
the model and covariates (Cox and Hinkley, 1974, Chapter 9).
6.4.3 Nested models

In many applications it is of interest to test whether () actually belongs
to a lower-dimensional space than that spanned by the columns of the
original covariate matrix. Formally, suppose that under modell, () =
XI" while under model 2 (the reduced model), we have () = X 2 /3,
where
X 2 =XIA.
There are no other unknown parameters. We suppose XI(Nxp\) has rank
PI and X2(NxP2) has rank P2, with PI > P2. Then if model 2 holds,
the approximation
(6.68)
is true very generally for large N, where S~(A/3) means S~ evalu-

ated at , = A /3, and the other quantities are defined similarly. Thus
the 'score statistic' (6.68) can be used to test for model 2 under the
assumption of model 1 (Cox and Hinkley, 1974).
Example: equality of stratum probabilities

Suppose that the first NI components of Y come from one stratum, and
the remaining N2 = N - NI from another. The model that assumes 8j
to be constant within strata (model 1) would take the form
o
o
(6.69)
o
o
The reduced model that assumes 8j to be constant overall (model 2) is
expressed as
() = X2/3
where X2 is a column of 1s. If, furthermore, model 1 is a logistic
regression model for binary responses, we have
, N
~
= Lj=1 Yj = P.
N1 +eP
If model 2 holds, then (6.68) becomes
NI (PI - p)2 N2(P2 - p)2 2
PIO - PI) + (PI - P)2 + P20 - P2) + (P2 _ P)2 :::::: X(l)'
If we were to use I~ I in place of J~ I , we would have on the left-hand

side of (6.68) the Pearson chi-square statistic, expressible as
NI (PI - p)2 N (P2 _ p)2
- - - - - + -2- - - -
PIO - PI) P20 - P2)
6.4.4 An exponential model form

When further parameters besides (3 are needed to characterize the re-
sponse distributions, it is common practice to assume an exponential
family form, since this form can incorporate a dispersion parameter
ifJ very tractably. In continuous distribution cases the log likelihood
function takes the form
1({3,ifJ) = ~[Yjo/j-b(o/j)
~ cpa.
+C'(Y',ifJ)]
.I J
(6.70)
}=1 J
with o/j = f(e j ), ej = xj {3 and aj being specified constants (Nord-

berg, 1989).
Maximum likelihood estimation

The maximum likelihood estimating equation system is
S
f3
= ,\,N
L..J=I
Yj-b'(f(xj
rpaj
(3) f' (x T (3)x·
J J
=0
(6.71 )
Srp
,\,N
L..J=I
(_...!..)
rp2
YJf(xj (3)-b(j(xj
aj
(3)) + aCj(Yj,rp)
arp
= O.
Because S f3 depends on ifJ only through a scale factor, estimation of

{3 based on asymptotic theory is unaffected by whether ifJ is known or
unknown. We might base confidence regions for (3 on the approxima-
tion
STf3 I-IS
j3 f3::::::
2
X(p)' (6.72)
where
I f3 -- ~
~
b"(j(xj (3)) [f'e (3)]2 .
Xj XJXj ,
T T
}=I ifJaj
THE GENERALIZED LINEAR MODEL FOR SURVEY POPULATIONS 22 I
or on
S '{3 J-IS
{3 {3
2
~ X(p)' (6.73)
where
(y. - b'(f(x~ (3»2
N
J{3 = L ) 2 2 } [f'(xj(3)fXjxj. (6.74)
j=I rp aj
Although S {3, I {3 and J {3 all involve rp, the score statistics in (6.72) and
(6.73) do not. For testing nested models, (6.68) is applicable.
Example: normal regression

For the normal regression model
Yj ~ N(ej , aJ), ej = xi (3,
where aJ = a 2aj and aj is known, the dispersion parameter rp is a 2.

The log likelihood function is
N 1 N
1«(3, a 2) =- L -2-(Yj - xi (3)2 - L log(a 2aj)I/2.
j=I 2a aj j=I
The maximum likelihood equations are

N 1
L - 2-(Yj -xi (3)Xj = 0,
j=I a aj
(6.75)
N i l N 1
= L-4-(Y
j=I2a aj
j -xi (3)2 - - 2"
2 j=Ia
L = O.
The approximation (6.72) reduces to
(Y - X (3)' V-I X[X' V-I X]-I X' V-I (Y - X (3) ~ xtp)' (6.76)
where V = diag(aI, ... , aN), while the approximation (6.73) becomes
(Y - X (3)' V-IX[X' BX]-IX' V-I(y - X (3) ~ X&)' (6.77)
with B = diag«Yj -xi (3)2/rpaj).
Extended quasilikelihood alternatives to likelihood

For the generalized linear model of (6.70), it is easy to show that
EYj = b'(1f!j) and Var(Yj ) = rpb"(1f!j)aj, (6.78)
= =
where 1/!j f(xj (3). If we write £Yj alternatively as ILj ILj({3) and
Var(Yj ) as rpVj =
rpV/(3), we may write the first component of the
score system (6.71) as
N
,,-
S _ '"' Yj - ILj aILj
j=1 rpVj a{3
~ . (6.79)
Moreover, it is possible to define an optimal quadratic estimating

function system using the theory outlined in Section 6.3: if 1/!j = xj (3,
the first component of the system is given by (6.79), while the second
component is
Scp= L A/(Yj -
N
ILj)2 - rpVj - Bj(Yj - ILj)), (6.80)
j=1
where Aj = [rp2 Vj (Y2j + 2 - YI~)]-I, Bj = Y1j(rpVj )I/2 and Ylj and
Y2j are the skewness and kurtosis of Yj . Although it is derived from
the exponential model, the system (6.79) and (6.80) is unbiased much
more generally, as long as the specification of the first two moments
of Yj is the same as for the exponential model. The same is true if the
coefficients A j and Bj are replaced by other approximating weights.
Thus in (6.79) and a suitable approximation to (6.80) we have a system
of estimating functions which is very widely applicable, and also ef-
ficient for the generalized linear exponential model (6.70). The notion
of extended quasilikelihood was put forward by NeIder and Pregibon
(1987). For discussion of the extended quasilikelihood system above,
see, for example, Godambe (1992) and Dean (1991).
6.4.5 Summary of homogeneous population methods for generalized

linear models
To this point we have imagined a homogeneous survey population
where the variates Yj are independent, and the log likelihood function
takes the form (6.70):
1({3, rp) =~
~
[Yj 1/!j - b(1/!j)
. + c/Yj , rp) ] ,
j=1 rpaJ
with 1/!j = xj (3. Letting
£Yj = IL/(3) = b'(1/!j)
and
we have noted that the component of the score estimating function

system associated with j3 is
~ y. - IL· aIL .
S{3 = ~ J J __
J.
j=l cpVj aj3
Two estimating functions for cp, namely Scp of (6.71) and Scp of (6.80),
may be considered.
We have seen in Sections 6.4.1--6.4.4 several suggestions for obtain-
ing confidence regions for j3, and will now focus on the one most
easily generalized, namely the one which involves the robust estimator
J {3 of the covariance matrix of S {3' Since cp enters S {3 as a scale factor,
confidence regions for j3 using J {3 can be based on (6.62) or (6.63)
whether cp is known or unknown: in the new notation we have
S '{3 J-lS
{3
'"
{3 ' "
2
X(p)'
S'{3 J-ls
/3 {3 ~ 2
X(p)' (6.81 )
where S{3 is given in (6.79) and
_ ~ (Yj - ILj)2 alLj alLj

J {3 - ~ m 2 V2 aj3 aj3r" (6.82)
J=l 't' J
6.4.6 Incorporating population heterogeneity

The class of generalized linear exponential models is broad, and com-
patible with methods that are robust to some variations in distributional
form. Their parameters are easy to interpret, and their analysis is fairly
routine. However, exponential models do place a constraint on the re-
lationship between the mean and variance of the response Yj ' as in
(6.75). They also assume independence of responses from unit to unit.
Apparent violations of these assumptions may signal population hetero-
geneity. A more faithful model might incorporate random unit, cluster
or location effects, and/or a multi-stage or spatial correlation correla-
tion structure. Depending on the purpose of our parametrization, we
might specify a linear exponential model to hold conditional on ran-
a
dom effects, or specify marginal exponential model with dependence
structure. These approaches will be outlined in the next few sections.
6.4.7 Generalized linear mixed model

One way of implementing the conditional approach is to introduce a
vector of additive random effects, giving rise to a generalized linear
mixed model (e.g. Breslow and Clayton, 1993). As before, X is an

N x p design matrix, and (3 is a p x 1 vector of fixed effects; but there
is also another N x q design matrix Z, and a q x 1 vector a of random
effects. Conditional on a the Yj are independent, and the density or
probability function of Yj is again
Y/I/!j - b CI/lj )
exp { + Cj(Yj , cp) } , (6.83)
cpaj
as in (6.70). It should be noted that cp is equal to 1 in the probability
function case, when the responses Yj are discrete. The aj are constants
as before. However, 1{!j is now a function of both (3 and a, namely
1{!j = f(xj (3, zj a).

The conditional expectation of Yj we can denote in two ways: by
(6.84)
and by
(6.85)
we also have
(6.86)
where Vj «(3, a) = b"(1{!j)aj = h'(1{!j)aj.
Suppose we are interested primarily in the estimation of (3. We spe-
cialize to the case where
1{!j = xj (3 + zj a, (6.87)
and consider two lines of attack, namely analysis through joint estima-
tion of (3 and a, and analysis through marginal moments.
Analysis through joint estimation of (3 and a
For fixed a and all parameters known apart from (3, the score function
system for the estimation of (3 can be written as in (6.71), as
cI>T«(3, a) =.E NY._f-L.

J
j=1 cpaj
=0; J Xj (6.88)
in matrix form, the system is
~T«(3, a) = X'g«(3, a) = 0, (6.89)

where g({3, 0:) has as its jth component

Yj - ILj
gj({3, 0:) = . (6.90)
cpaj
Solving (6.89) for '/3

requires knowledge or estimation of 0:.
Suppose the random effects 0: satisfy
£ 0: = 0, Cov(o:) = ~, (6.91)
where ~ is invertible. Then for estimating 0: for fixed {3, the opti-
mal combination of 0: itself and the components of g({3, 0:) is the
estimating equation system
CJ.)i({3, 0:) = -zr g ({3, 0:) + ~-l 0: = 0 (6.92)
(Godambe, 1994).
Now for emphasis on estimating {3, the theory leading to (6.40)
suggests that it is useful to consider a system equivalent to (6.89) and
(6.92), where CJ.)r({3, 0:) is replaced by CJ.)r*({3, 0:) given by
CJ.)r* = CJ.)r - £( CJ.)r CJ.)ir)V- 1(CJ.)i) CJ.)i. (6.93)
The new first component CJ.)r* will have approximate non-dependence
on 0:. If DJ denotes the diagonal matrix for which the jth diagonal
element is £{h'(l/Ij)}/cpaj = Var{gj({3, o:)}, we may write
CJ.)r*({3,o:) = x'v - DIZ(~-l + Zr D1Z)-1 zr]g({3, 0:)
+ r DIZ(~-l + Z' D1Z)-1 ~-l 0: = 0 (6.94)
as the new first component.
It is readily shown that the covariance matrix of CJ.)r*({3, 0:) is given
by XC D, W- 1D,X, where
W- 1 = Dll_Z(~-I+ZrDIZ)-IZr
= (D, +D,Z~Z'D))-l. (6.95)

Thus if all parameters of Dl and ~ were known, then for point
estimation of {3 we could solve the system (6.94) and (6.92), or equiv-
alently the original system (6.89) and (6.92), perhaps by Fisher scoring
as indicated by Breslow and Clayton (1993). Suppose we denote the
solution by ('/3,
it). Confidence intervals for {3 could be based on a
suitable distributional assumption for
CJ.)t*T ({3, O:o)[XC DJ W- 1D1X]-1 CJ.)r*( {3, 0:0), (6.96)
where it may ultimately be substituted for the fixed value 0:0. How-
ever, since the parameters of Dl and ~ are likely to be unknown,
more robust methods will be of interest, and these will be discussed

for clustered populations in Section 6.4.10.
A marginal moment analysis
An alternative analysis would base an estimating function system for

(3 on the marginal moments of the Yj . That is, under the generalized
linear mixed model, let
(6.97)
Then
for small a we have

Var(Yj ) :::::: Eh'Cl/lj)cpaj + [Eh'(1/!j)fzj ~Zj,
Cov(Yj , Yk) :::::: [Eh'(1/!j)]zj ~zk[Eh'(1/!k)].
If gm ({3) has as its jth component
Yj -mj
(6.98)
cpaj
it can be shown that the covariance matrix for gm ({3) is approximately
(6.99)
where Dl = diag(Eh'(1/!j)/cpaj).1t follows from (6.37) that an approxi-
mately optimal combination of the components of gm ({3) for estimating
(3 would be the system
~({3) = X T Dl W- 1g m ({3) = O. (6.100)
The covariance matrix of ~ ((3) is approximately equal to X T Dl
W- 1D1X, the covariance matrix of r*({3, a). Unlike r*({3, 00)
or r* ({3, a( (3)), where a( (3) solves (6.92), the left-hand side of the
system (6.100) is exactly unbiased, and if tractable should give good
estimates.
Formation of confidence regions for (3 based on ~({3) will be
discussed in Section 6.4.10 for clustered populations.
6.4.8 Poisson overdispersion models
In some contexts, introducing multiplicative rather than additive ran-

dom effects may be convenient. A simple example arises when the
responses Yj are counts, and when these counts would be distributed

'locally' as independent Poisson variates, with means determined by a
loglinear dependence on Xj. For this model the score function system
for (3 is easily seen to be
N
S(3 = L(Yj - fJ.,j«(3))Xj,
j=1
where
fJ.,j«(3) = f?j(3.
However, for many populations, count data seem to be Poisson-like
but having variation in excess of what would be expected under the
Poisson loglinear model; this is the phenomenon of overdispersion. It
has been found that such situations can often be modelled by consid-
ering a conditional mean for the jth unit response to be a function
(6.101)
where the ex j are independent random variates, or are distributed jointly
with some other dependence structure. If each [exj = 1, the model
has the nice feature that the parameter (3 plays the same role for the
marginal mean
(6.102)
as it does for the conditional mean fJ., j «(3, a), in both cases measuring
10glinear dependence on x. Dean (1991) has given a detailed discussion
of this approach.
For example, if the exj were independent gamma variates with mean
1 and variance A. -I e -xj (3, then Yj would have a negative binomial
distribution, with mean m j of (6.102) and variance m j(l + A. -I). The
score function system for (3 would be complicated, but for large A.
would be approximately equal to the simple estimating function system
N
S(3= L(Yj - m j«(3))Xj, (6.103)
j=1
which has the same form as the score function system for the Pois-
son model. More importantly, the system in (6.103) is unbiased. Its
covariance matrix is
N
Lmj{l + rl)Xjxj,
j=1
which can be estimated by

N
L(Yj - mj(f3»2xjxj,
j=1
by
or by
where
N N
(1 +):-1) = L(Yj -mj({m 2/Lmj({3),
j=1 j=1
depending on the purpose. We would thus have specified for Yj a
marginal loglinear model which is of Poisson type, except for constant
overdispersion in the variance structure.
Alternatively, we might specify that the aj have mean I and some
constant variance r. Then the marginal mean would be m j = m j (f3) =
t?i {3, but the marginal variance of Yj would change to m j (1 + r m j).
Thus the overdispersion would be relatively greater for larger values
of m j. Uncertainty estimation for this case when the a j are constant
within clusters is treated in Section 6.4.10.
6.4.9 The marginal linear exponential model

Another approach to extending generalized linear models to survey
populations is to assume that the marginal distribution of the Yj has a
linear exponential form. Thus the marginal density of Yj is given by
(6.70) with 1/Ij = f(xj (3). The Yj may be thought to be dependent
because of clustering or spatial correlation.
Before proceeding, let us consider briefly the relationship of these
models with those of Section 6.4.7. Except in the standard Gaussian
case, the generalized linear mixed model implies a marginal model
which is not generally of linear exponential form, and thus the two
approaches to modelling are not precisely compatible. Moreover, the
f3 parameters in the two models will have differing values and differing
meanings. For example, in the case of logistic regression with a single
explanatory variate, a generalized linear mixed model might specify the
conditional expectation of Yj (= 1 or 0) given random effects Q as

eCij+fJo+lhXj
(6.104)
I + eCij+fJO+fJIXj .
The marginal expectation of Yj is not logistic in form, but it is shown
by Neuhaus et al. (1991) that it can be approximated for small fh by
efJo+fJ;Xj
(6.105)
I + efJo+fJ;Xj ,
where
*_
fJ 1 - fJI
(I _ Var(p)
£(p)£(l _ p)
) (6.106)
and
p=--~
I + eCij+fJo •
Thus if we fit to data coming from (6.104) a model with logistic re-
gression marginal expectations as in (6.105), the coefficient of Xj will
approximate fJ;, which is 'attenuated' in comparison with fJI.
If in the mixed model the aj are constant within clusters, (6.106) is
expressible as
fJ; = fJI (I - p(O»,
where p(O) is the intracluster correlation among the Yj when fJI is O.
Which of the two approaches to use will depend on the purpose of
modelling. In the logistic regression example first discussed, if we want
to express the dependence of the response probabilities on x with a
held constant, we would be interested in fJI of (6.104). On the other
hand, if we want to be able to express the way in which the response
of a randomly selected member of the population would depend on
the x variate, we would be interested in estimating the correspond-
ing marginal expectation parameter, which would be closer to fJ; of
(6.106). Neuhaus et al. (1991) have distinguished the two purposes as
the 'cluster-specific' and 'population-averaged' approaches. Holt (1989)
has referred to a similar distinction in terms of 'disaggregation' and
'aggregation' .
Let us proceed to look at an analysis when a marginal linear expo-
nential dispersion model is the basis of parametrization. We will now
denote the marginal regression parameter by (3*, to distinguish it from
the (3 of Section 6.4.7. If the marginal density were given by (6.70)
with t/fj = xj (3*, and if the Yj were independent, the score function
system for j3* would be given by

N
S f3* -_"Yj-/J-j
~ Xj, (6.l07)
j=' q;aj
where /J-j = /J-j(j3*) = hex} j3*) = h(1/!j) is the mean of Yj .
Suppose instead that the population is such that although the marginal
mean of Yj is still /J- j = h (x) j3*), the Yj are correlated. The system
(6.107) would still be unbiased for j3*, and we could write it as
Sf3* = X'g(j3*) = 0, (6.108)
where g( j3*) is the vector of the elements
gj(j3*) = (Yj - /J-j)jcpaj.
However, if the covariance structure of the Yj were known or could be
approximated, we could construct a more efficient estimating function
system using (6.37). In fact, the optimal combination of the gj(j3*) for
estimating j3* would be of the form
cJ?*(j3*) = X'D, W-'g(j3*) = 0, (6.l09)
where D, = diag(h'(1/!j)jcpaj) and W is the covariance matrix of
g(fJ*). The covariance matrix of ip*(fJ*) is xr Dl W- 1D1X. The for-
mation of confidence regions for fJ* will be discussed in Section 6.4.l 0
for clustered populations.
6.4.10 Estimation when the population is clustered

We now consider estimation of j3 through the joint system (6.92) and
(6.94) or the marginal moment system (6.l00), and the estimation of
j3* through (6.l09), when the population is clustered. That is, the pop-
ulation is divided into a large number of disjoint clusters, and the Yj
are independent from cluster to cluster, but correlated within clusters.
In the case of the generalized linear mixed model of Section 6.4.7,
we suppose that the random effects are essentially cluster effects. Thus
if a(r) represents the random effect for the rth cluster, we can write
(6.110)
In the special case where the cx(r) are independent and identically dis-
tributed, the covariance matrix ~ of(6.91) will be given by a 2 I, where
a 2 is the common variance of the a(r) and I is the L x L identity ma-
trix. The N x L matrix Z will have as U, r )th element 1 if j belongs
to the rth cluster and 0 otherwise.
The joint system (6.92) and (6.94) for estimation of the a(r) and f3
reduces to the following:
*
2r(f3, a(r» =- "~ Yj - f.I., j
+ -a(r)
2 = 0, (6.111)
JEB, cpaj (J
r = 1, ... , L;
L
cI»r*(f3, a) =L cI»r:(f3, a) = 0, (6.112)
r=1
where
LYj-f.I.,jXj
JEB, cpa j
- L Xkdlk V(r) ;r (f3, a(r» (6.113)
kEB,
, 1
= Eh (l/fd/CPak, V(r) = ".
(J-2 + ~d1j
JEB,
Under suitable regularity conditions and knowledge of a and f3, a

consistent estimator of the covariance matrix of cI»r*(f3, a) as the
number of clusters approaches infinity would be
L
J(j3, a) = L cI>r;(f3, a)cI>r;(j3, a)'. (6.114)
r=1
Since cI»f* of (6.94) is defined so as to be as little dependent on a as
possible, a reasonable way of defining confidence regions for f3 might
be based on inverting
cI»r*(f3, ao)' J-I(f3, ao) cI>r*(f3, ao) ~ X&,)' (6.115)
where &. or some 'smoother' estimate of a is used for the fixed value
ao·
Turning to the marginal moment analysis, the system (6.100) can be
written
L
cI»*(f3) = LX; Dlr wr- I 'TIr = 0, (6.116)
r=1
where 'TIr is the vector of the elements of gm (f3) for j in the rth cluster,
and the matrices x:., Dlr and Wr- I are the submatrices of X', DI and
W- I determined by the rth cluster.
The covariance matrix of ~* (,8) takes the form

L
L X~ Dlr Wr- 1£( 1]r 1]n Wr- 1D1rXr . (6.117)
r=l
As pointed out by Liang and Zeger (1986) in the longitudinal data ana-
lysis context, under suitable regularity conditions a consistent estimator
of this as the number of clusters approaches infinity would be
L
J; = L X~ D1r Wr- l 1]r 1]~ Wr- 1DlrXr. (6.118)
r=l
In the absence of knowledge of the precise covariance structure or its

parameters, an approximating 'working' covariance structure could be
assumed, specifying Dl and W, and confidence regions for ,8 based
on
(6.119)
(see Liang and Zeger, 1986).
Similarly, the system (6.1 09) for the marginal linear exponential
model can be written
L
~*(,8*) = LX;D 1r Wr- l 1]r = 0, (6.120)
r=l
where 1]r is the vector of the elements of g( ,8*) for j in the rth cluster,
and Dlr and Wr- 1 are redefined in the context of the marginal model.
Confidence regions for ,8* would be based on
(6.121)
with
L
J;. = LX; D 1r Wr- l 1]r 1]~ wr- 1D1rXr. (6.122)
r=l
To illustrate the marginal moment technique in another setting, let
us suppose that the true model is that for j in the rth cluster, condi-
tional on a(r), Yj is Poisson with mean a(r)~j f3, and that the a(r) are
independent with £a(r) = 1, Var(a(r) = r. Then we have gj(,8) =
Yj - mj, where mj = ~jf3, with Var{g/,8)} = mj{1 + Tmj) and
Cav (gj (,8), gj' (,8» = T m j m j' if j, j' are distinct units in the rth
cluster. It can be shown that

L
~*(f3) = L~
r=1
Dlr Wr- l l1r
L(l + r L
L
= mj)-I L(Yj - mj)xj. (6.123)
r=1 jEBr jEBr
If r is unknown, the corresponding system with an approximate or esti-

mated value of r could be solved for f3, and used to define confidence
regions.
6.4.11 Estimation of random effects and conditional means

In Section 6.4.7 we concentrated on the estimation of f3 in the general-
ized mixed linear model. However, there are contexts in sampling when
estimation of functions of the random effects 0: is also important. The
simplest example is the estimation of a superpopulation mean which is
thought of as the average of a mean function
N
() = L{)j/N,
j=1
where, conditional on the {)j. the response values Yj are thought of

as independent, with £(Yjl{)j) = {)j. This problem was treated from
one standpoint in Section 6.2.6, where its connection with response
error models was noted. Here, however, suppose the {)j are modelled
in such a way as to reflect dependence on covariates and heterogeneity
or spatial dependences. Typically, a model for {)j might be given as in
Section 6.4.7, with
()j = J-Lj({3, 0:) = h(xjf3+zjo:). (6.124)
Then population- or sample-level estimates of () could take the form
e= Lh(xj.8+zj&)/N.
N
(6.125)
j=1
There are other applications for estimating a mean function. An ob-

vious one would be for use in model-assisted inference for a population
total, as suggested in Section 5.13. Closely related would be its use in
predictive inference for small-area totals and means, as a way of 'bor-
rowing strength' from estimation in other parts of the population for
the small area in question. And in a certain sense the estimation of {)j,
j = 1, ... , N, represents a smoothing of the response variate values

which may reveal important patterns. The paper of Ghosh and Rao
(1994) reviews the history of what might be called local estimation in
surveys, and outlines several applications in detail. We will specialize
to a particular small-area model for purposes of illustration here.
Suppose the population is divided into L small areas, indexed by i.
Let Ni be the size of the ith area. For simplicity, suppose that the mean
function is constant within areas. Reflecting this in some new notation,
for unit j belonging to the ith area we have
xj = x(i); aj = a(i), 1/Ij = 1/I(i) = X(i).8 + a(i);
f-Lj = h(1/Ij) = h(1/I(i).
Conditional on a(i), i = 1, ... , L, the Yj will be regarded as indepen-
dent, with
(6.126)
Var(Yjl a) = ((Jai h' (1/Ii)
for j in the ith area. The random effects will be assumed to satisfy
£(a(i» = 0, i = 1, ... , L;
and if a is the vector of a(i),
Cov(a) = :E, (6.127)
where :E is a non-singular L x L matrix, possibly diagonal. Let 1/1 be the
vector of the 1/I(i), and suppose our interest is in estimating 1/1, which
is essentially the mean function in this application. Thus it is helpful
to transform the parameters of the conditional model from ({3, a) to
({3, 1/1).
Elementary unbiased estimating functions are
Yj - f-Lj, j = 1, ... , N, and 1/I(i) -x(i){3, i = 1, ... , L.
For notational convenience, as before, we define an N x 1 vector g to
have jth component
(6.128)
for j in the ith area. Note that this modified elementary estimating
function is a function only of 1/1 in the new parametrization. We may
then write the system which is 'optimal' for estimating ({3, 1/1) as
ct>i({3,1/1) = -X'A'E- 1(1/1-AX{3)=O
(6.129)
Here A is an L x N matrix which averages the (identical) rows of X

which correspond to each area; ZT is an L x N matrix which sums
elements of g corresponding to each area.
In the special case where h is the identity, then ZT g is L\ -I (Y - 'I/J),
where L\ = diag(cpad N i ) and the ith element of Y is the average of
Yj over the ith area. Solving (6.129) gives
(6.130)
where V-I = (L\ + 1:)-1, and
;p = (L\ -I + L-I)-I L\ -I Y+ (L\ -I + L-I)-I L- I Axfi. (6.131)
This estimator of 'I/J coincides with the best linear unbiased population
level 'predictor' of 'I/J, and is composite in the sense of being a combina-
tion of a direct estimator Y and a regression estimator AX '/3. Thus,
provided L\ and 1: are known, we have solved the problem of estimating
the mean function.
Since 'I/J = AX f3 + 0:, the covariance matrix of the error 'I/J;p -
can be shown to be
+(L\ -I + L-1)-1 L- 1 AXCov('/3)XT AT L-1(L\ -I + L-1)-I.

(6.132)
If the variance components L\ and/or 1: are not known, the problem
of estimation of 'I/J, particularly interval estimation, becomes harder.
Suppose we specialize further to the case where 1: = h xL. Then a;
the ith component of ;p
of (6.131) has representation
(6.133)
where
0'2
Yi = a
cpad Ni + a';
can be regarded as a measure of intra-area correlation. The mean
squared error is
~ 2 CPai 2
£(1jr(i) -1jr(i) = Yi Ni + (1 - Yi) Ki, (6.134)
where Ki = X(i)[L7=1 X(l)X(I)/('7if + O'';)]-Ix(i)'

Both the estimator (6.133) and its mean squared error (6.134) depend
on the unknown elements Yi. Several ways of dealing with the situation
have been proposed, in the context of small-area estimation (Ghosh and

Rao, 1994).
The empirical best linear unbiased predictor (EBLUP) method re-
places the variance parameters in (6.133) by consistent estimators if
required. Then the mean squared error in (6.134) needs to be corrected
to take into account the additional uncertainty introduced by the use of
these estimates. The estimate of 1/I(i) becomes
(6.135)
Assuming cp to be known and writing the right-hand side of (6.134) as
gli (a;) + g2i (a;),
it can be shown (Prasad and Rao, 1990; Singh, 1994) that
EC(/J(i) - 1/I(i)2 = gli (a;) + g2i (a;) + g3i (a;) + o(L -1), (6.136)
where g3i(O';) = (1- Yi)2(O'; + Pt; )-1 t\a;) and V(a;) is the asymp-
totic variance of a;.
The first te~ in (6.l36) is of order 0(1), while
the others are of order O(L -1). If we estimate the mean squared error
in (6.136) by substituting for a; a;,
we obtain an estimate with a bias
of O(L -1) because E(g\i(a;) - g\i(O';» is of this order. To correct for
this bias, Prasad and Rao (1990) suggested the corrected estimator
MSE(~(i) = g\i(a;) + g2i(a;) + 2g3i (a;). (6.137)
Other possible corrections have been discussed by Singh et al. (1993).
For the general class of mean function estimation problems, altern-
atives to this frequentist approach are provided by a Bayesian frame-
work, wherein appropriate prior distributions are assumed for {3 and
the variance parameters. Then {3, 0: and 1/J all have similar status, and
inferences about them come from their posterior distributions, given
the data. If this program is carried out using a full hierarchical Bayes
approach, the posterior distributions require significant computational
power for their calculation, but this is less and less of an impediment
to implementing the method. It has been shown in empirical studies
(Ghosh and Rao, 1994; Singh et al., 1993) that the hierarchical Bayes
approach and EBLUP properly corrected give comparable results.
6.5 Sampling and the generalized linear model

There are at least three ways of approaching the problem of estimating
the parameters of a generalized linear model from a sample. In outlining
the three ways, we will assume the sample has come from a probability
SAMPLING AND THE GENERALIZED LINEAR MODEL 237
sampling design, with design probabilities p(s) which do not depend

on any of the variates in the model. The design need not have equal
inclusion probabilities, and may be single-stage or clustered.
In the first approach, we begin by imagining a meaningful model at
the population level, as discussed in detail in Section 6.4. This will im-
ply a model for the sample observations, much like the population-level
model, but possibly approximable by a model with simpler structure
if the sample is well dispersed throughout the population. (See also
the examples of Section 6.1.) Inference about the model parameters of
interest, say (J Px 1, would then proceed as in Section 6.4, with robust
likelihood or estimating function based methods, but this time using
sample data rather than population data. The sampling design would not
enter the analysis explicitly. However, symmetries in the randomization
would tend to support the adoption of the simplified model structure,
and in this manner influence the interpretation of the parameters (J.
The second approach follows the path we saw earlier leading to (6.3)
and (6.13). It has the same beginning as the first approach, namely a
model at the population level, but instead of proceeding to a sample-
level model, focuses attention on the population-level system of esti-
mating equations for the model parameters. The solution of these equa-
tions is viewed not only as an estimate of the model parameters (J but
also as a vector (J N of finite population quantities of possibly indepen-
dent interest. As a solution of population level estimating equations,
(J N can be estimated using design-unbiased sample-level estimating
equations as in Section 4.1. Thus the sample provides estimates 8s of
(J through approximately unbiased estimation of (J N. If the population-
level estimating equations are truly model-unbiased and efficient, 8s
should estimate (J well. If not, 8s will still provide reasonably trust-
worthy estimation of (IN. In this approach, the design probabilities will
enter the analysis through the requirement for design unbiasedness of
the sample-level estimating functions. They may also play a role in
constructing robust measures of uncertainty for 8s •
Unlike the first and second approaches, the third approach is one we
have not discussed earlier. The first and second approaches require some
knowledge of the population structure and the corresponding sample
structure. In particular, if sampling is clustered or conducted in stages,
we need to know which sampled units are from the same clusters or the
same PSUs in order to be able to carry out either analysis. Such infor-
mation is often absent from survey microdata files. With this difficulty
and the limitations of software packages, it may happen that the only
analysis available assumes a single-stage design with well-dispersed

samples. Confidence regions and tests of hypotheses in the output will
be misleading without adjustment. Still, we will see in the third ap-
proach that it is sometimes possible to provide validating adjustments
which are fairly simple, based on the notion of 'design effects'.
Sections 6.5.1, 6.5.2 and 6.5.3 will give more details of the three
approaches in the context of logistic regression. We will assume a two-
stage population, sampled in a variety of ways. That is, we assume the
population is made up of clusters or PSUs 1370 r = 1, ... , L, and that
we sample elementary units either directly or in stages. The response
Yj is binary, and with each unit j is associated a covariate vector Xj;
moreover, a random effect a(r) is present in the rth PSU. The model
specifies
(6.138)
where 1/fj = xi,(3 + a(r), j E 13r . Thus
£(Yjla(r» =
/Lj(,(3, a) =
h(1/fj) =
b'(1/fj) = l::"j; (6.139)
Var(Yjla(r» =
h'(1/fj) =
el/l /(1 + el/lj )2;
j
and Vrj is of the form (6.87) if we take Zjr to be 1 for j E 13r and 0
otherwise. If a denotes the Lxi vector of the random effects, we will
assume that
(6.140)
and that a 2 is unknown.
As indicated, we will consider a variety of sampling schemes with
design probabilities independent of the parameters and the variates.
6.5.1 Estimation based on the model at the sample level
Sample is dispersed
Suppose first that the sample is taken by simple random sampling, or
some other scheme which leads to samples which are dispersed, in the
sense that their units may be assumed to belong to different PSUs. Then
the Yj will be independent, with
£(Yj ) = mj(,(3, a 2 ) = £(/Lj({3, a» (6.141)
and variance
(6.142)
If the form of m j((3, 0- 2) is known, then in principle a system of com-

binations of Yj - mj((3, 0- 2) can be used to estimate (3 and 0- 2. How-
ever, it would be simpler to reparametrize the problem, and use as an
approximation to mj((3, 0- 2) the function
f?J /3-
/L j ((3*) = I + f?J /3- ; (6.143)
here (3* is the analogue of f3; and f3r in (6.108). Using for sample-
based estimating functions the same notation as for the population-
based estimating functions in Section 6.4, we can write down the score
function system
S /3* = I)Yj - /L j ((3*))Xj = o. (6.144)
JES
~ *
The point estimate (3 will be the solution of (6.144). Confidence re-
gions for (3* could be based on
S'/3* J-IS '"'"' 2
/3* /3* '" X(p) (6.145)
or
S'/3* J-IS
'/3* /3*::::::
2
X(p)' (6.146)
where
(6.147)
JES
~ *
Alternatively, the approximate normality of (3 could be used along
~ *
with the fact that the model covariance matrix of (3 has the approxi-
mation
~*
[
Cov((3 ):::::"
f?f, /3*
, * Xjx)' (6.148)
]-1
~
)ES
(1 + f?f /3 )2
Sample is two-stage
Now suppose that the sample has been taken in two stages, correspond-
ing to the structure in the population. Let SB be the first-stage sample
of PSU labels, and for each r in S B let Sr be the sample taken from Br •
Now the sampled Yj are no longer independent unconditionally.
The approach through joint estimation of (3 and a can be applied
as in Section 6.4.7. If a were known, the conditional score function
system for (3 would be
<pr((3, a) = L L(Y j - /Lj«(3, a))xj = O. (6.149)
rEsB JESr
The sample analogue of ~~(,8, a) of (6.92) has as rth component
(6.150)
and we find the estimate {3 of ,8 by solving (6.149) and (6.150) sim-

ultaneously for a(r) and ,8, using a known or estimated value for a 2 •
Correcting ~r(,8, a) to diminish its dependence on a gives (as in
(6.94»
~r*(,8, a, ( 2) = LL(Yj - !Lj(,8, a»)(xj - B r (,8, ( 2 »

rESsjES,
+",a(r) 2
L...,.-2 B r (,8, a ), (6.151)
rEss a
where
(6.152)
Note that for k E Sr, £h' ('I/Ik) = Q(xk,8), where

Q(x) = J (1 :;:CX)2 / (a)da
and I(a) is the p.d.f. of a(r). The covariance matrix of ~r* is

V {3.a 2 V {3,a 2 ( ~r*)
= LL£h'(1/!j)xjxj - L
rEss jES, rEss
(:2 + L£h'(1/!k») BrB;.
kES,
(6.153)
If 8- 2 is a consistent estimator of a 2 , then confidence regions for ,8
might be based as in (6.97) on
(6.154)
where ao is a plausible value of a. A suitable value could be a,

obtained by solving the system implied in (6.149) and (6.150), or some
'smoother' estimate of a. Alternatively V (3,&2 might be replaced in
(6.154) by Jo(,8, 8- 2) or Jo({3, 8- 2), where
Jo(,8,a 2 ) =L ~r;(,8, ao,( 2 )<t>r;r(,8, ao,( 2) (6.155)
rEss
and
Approaching the generalized linear mixed model through marginal

moments would yield inferences of very similar form. The estimate j3
of (3 would be obtained as in (6.100) by solving
cIl~({3, &2) =L L(Yj -mj({3, &2»(Xj-Br ({3, &2» = 0, (6.156)

rESB jES r
where &2 is a consistent estimate of ()2. To the degree that the co-
variance matrix of cIl~ is approximately V {3.a 2 of (6.153), confidence
regions for (3 could be based on
"'*'(~ ~2)V-l iF..*(~ ~2)<,-, 2. (6.157)
'¥ m tJ, () f3.&2 ~ m tJ, () '" X(p)'
for greater robustness, V f3,&2 might be replaced by Jm ({3, &2) or

~ 2
Jm ({3, & ), where
Jm ({3,()2) = LL(Yj -mj({3,()2»2(Xj -Br)(xj -Br)'. (6.158)

rESB jES r
Here again, it might be considered simpler to use the approach of

Section 6.4.9, and estimate (3* in an approximating model
-'" f3*
2
£(Yj ) = m j({3, () ) :::::
t: J
*
*
= J-Lj({3 ). (6.159)
1 + t?J f3
T
As indicated in Section 6.4.9, the parameter (3* measures effects which

are 'population-averaged' rather than 'cluster-specific'.
In the approximating model, the variance of Yj is
Var(Yj ) = Vj ({3*) = J-Lj({3*)(1 - J-LJ<{3*». (6.160)
Y Yk
The j and are uncorrelated if j and k are in different PSUs. When
Y Yk
j and k are both in T3r , j and are correlated. If our two-stage model
holds, with random effect variance ()2 relatively small, we may look to
the form of (6.156), and use as estimating system for (3*
cIl*({3*, &2) =L L(Yj - J-Lj({3*»(Xj -Br({3*, &2» = 0, (6.161)

rESB jESr
where
Br({3*, a 2) = I l L Vk(,8*)Xk (6.162)

a 2 + L Vk('8*) kes,
kes,
and (j2 is an estimated or assumed value for a 2 • Confidence regions
for (3* would be generated from
tf!*T({3*, (j2)J-'(jl, (j2)tf!*({3*, (j2) ~ X&> (6.163)

with
J({3*,a 2) = LL(Yj -1L/{3*»2(Xj -Br)(xj -Br)T. (6.164)

resB jesT
A*
An estimated covariance matrix for {3 would be given by
H-' (/3*, (j2)J«(3*, (j2)[HT «(3*, (j2)r',
where H is the matrix of expected derivatives of tf!* with respect to

(3*, that is
H({3*,a 2) =- LLV/{3*)(Xj -Br ({3*,a 2»xj. (6.165)
resB jesT
6.5.2 Estimation through a pseudolikelihood

Sample is dispersed
Suppose first that the sample is taken by a single-stage sampling de-
sign, such that the samples are dispersed. Inclusion probabilities 7rj are
allowed to vary from unit to unit. Because the samples are dispersed,
then under the model the Yj are independent with
£(Yj ) = mj({3, a 2) = £(lLj({3, 0:»
and variance
Var(Yj ) = mj({3, a 2)(1 - mj({3, a 2».
Again, an appealing option is to take the linear model of Section 6.4.9
as an approximation, so that the population-level estimating function
system is
N
Sf3* = L(Yj -lLj({3*»Xj = O. (6.166)
j='
The solution of this system is a finite population parameter f3"N, which

is an estimate of the superpopulation parameter f3* in the approximat-
ing model. A sample s is obtained by a probability sampling scheme
with inclusion probabilities 7r j, and the sample pseudoscore estimating
function system
(6.l67)
,. .. *
is unbiased in the sense that EpS 13* = S 13*. The solution f3 s of S 13* = 0
A
is an asymptotically unbiased estimator for f3"N, and hence an estimator

of f3*. Using (6.l67) instead of an unweighted system is often justified
*
on the basis that if the model is not correct, f3 s is still of use as an
A
estimator of the descriptive parameter f3"N, regardless of its role in

analytic inference.
If the model is correct, clearly £ S13* = O. Because the sample is
dispersed, the model covariance matrix of S13* is
..)C~ 13*
e 1
jxj. (6.168)
A A """"
£(Sf3*S~*) = ~ 2 ~f3* 2X
jEs trj (1 + Cl )
An approximate covariance matrix for 'j3* is
H- 1(f3*)£(S f3*~*)[H' (f3*)r l , (6.169)
where
H(f3*) = - L -1 (1 +e"iClf3*)
jEs trj
13*
2X jxj. (6.170)
On the assumption that the model is correct, the most relevant measure
.. * .. *
of uncertainty in f3 is (6.169) evaluated at f3 , which is approximately
* ,. *
£-unbiased for £«f3 - f3*)(f3 - f3*)'). It may be noted, however,
A
.. * A*
that (6.l69) is also Ep£-unbiased for Ep£«f3 - f3*)(f3 - f3*)') =
.. * 1\*
£Ep«f3 - f3*)(f3 - f3*)"C).
On the other hand, if the model is not correct, we might be more
"* .. *
interested in estimating Ep«f3 - f3"N)(f3 - f3"N)"C) - which will also
* .. *
be an approximately unbiased estimator of £Ep«f3 - f3*)(f3 - f3*)')
A.
if the model is correct. This will be provided by
(6.l71)
where v(S 13*) is a design-unbiased estimator for the design covariance

matrix of Sf3* (see Section 4.l).
In some situations, the two approaches to uncertainty estimation will

agree. For example, we might estimate £(8 ,G,8~.) of(6.169) in a robust
manner by
J,G' = L ~2 (yj - /Lj(f3*»2Xjxj.
JES j
(6.172)
If the sampling design is of single stage, and approximately with re-

placement with constant draw probabilities, and N is much larger than
the sample size n, we can use (6.171) with
n -1 Lr
v(8,G') = _n_" (Yj-/Lj(f3*) Xj-Ms ) (Yj - /L/f3*) x' -
Trj Trj j
M:) ,
(6.173)
where Ms is the sample average of «yj - /Lj(f3*»/Trj )Xj. Confidence
intervals for f3 can be based on taking
S',G' J-IS
/3' ,G' 2
~ X(p) , (6.174)
which is close to
~ *
since Ms is 0 at f3* = f3 .
A detailed treatment of the kind of analysis in this section has been
given by Roberts et al. (1987).
Sample is two-stage
Suppose that random effects are present, and that they are essentially
cluster effects. Thus in the model (6.138)-{6.140), 0'2 is non-zero. The
sampling design is two-stage, with PSUs coinciding with the clusters.
At the population level, we have seen two approaches to the estimation
of f3, one through joint estimation of f3 and 0:, and one through a
system for marginal moments. Each of these has a pseudo likelihood
analogue at the population level.
In the first approach, an estimating function system which corres-
ponds to the population-level system (6.88) and (6.92) is
~*
«P 1 (f3,0:)=
L -1 L (Yj - /Lj(f3, O:»Xj
=0 (6.175)
rEsB nr jES r Tr"t
] r
and
(6.176)
where
and TIr and 1Tjlr are first- and second-stage inclusion probabilities
*
respectively. The estimating function 4»1 of (6.175) estimates 4»r of
A
(6.88). The estimating function ~r of (6.176) is a suitably chosen

combination of its first part and fX(r). If simple random sampling is
used at both stages, (6.175) and (6.176) reduce to (6.149) and (6.150)
respectively.
The estimating function corresponding to the corrected system 4»r*
of (6.94) would be
ci;~* ({3, 0, (12) = L ~ L Yj - !-tj({3, 0) (Xj - B nr ({3, (12»

rEsB r jESr 1Tjlr
(6.177)
where
(6.178)
If 0- 2 is a consistent estimator of (12, then confidence regions for (3

could be based as in (6.155) on
(6.179)
where 00 is a plausible value of 0, and Jo({3, 0- 2 ) is a robust estimator

of the covariance matrix of 4» **
1 . A reasonable candidate might be
A
"~ TI2
1 4»**
Ir
({3 ,00, (1 2) 4»**r
Ir
A ({3 ' A
00, (1
2) , (6.180)
rESB r
where
In the second approach, using a marginal moment analysis, the

population-level estimating function for (3 would have a form like

L
~~«(3, £1~) =L L (Yj - mj«(3, £1~))(Xj - Br«(3, £1~» = 0,
r=l jeSr
(6.181)
where £1~ is a population-level estimate of a 2 ,
mj«(3, a 2) = £Yj ,
and
(6.182)
Solving (6.181) defines a finite population parameter (3N' although (3N

may have limited appeal as a descriptive quantity. Under a two-stage
probability sampling scheme we could estimate ~~ via
~:«(3, £12) = L L ~
Yj - ;.«(3, £12) (Xj -Bnr «(3, £12»,
reSB jesr
r Jlr
(6.183)
where £12 is a sample-based estimate of a 2, and Br:rr«(3. a 2) is given
by (6.178). We could solve ~:«(3, £12) = 0 for (3, and estimate the
~ *
covariance matrix of ~ m by
~ 1 ~ * ~ 2 ~ *T ~2
~ n 2 ~mr«(3, a ) ~mr«(3. a ), (6.184)
resB r
where
A similar approach can be devised for the estimation of (3*, the para-
meter of an approximate marginal logistic regression model.
6.5.3 Design-effect approach

Suppose that the unweighted sample estimating function system
S~. = Laj(Yj - JLj«(3*»Xj
jes
is being used to estimate (3*, the parameter of a marginal logistic
regression model. Under an assumption of independence of the Yjo the
model covariance matrix of S(3* is
Vo = I>;JLj(j3*)(1 - JLj(j3*))Xjxj, (6.185)

jES
and we would have

(6.186)
for confidence regions and hypothesis testing. Here Vo is Vo with 13*

A*
replaced by 13 .
Suppose, however, that the population and sample are two-stage, and
that the Yj are correlated within PSUs. Then the real covariance matrix
of S(3* (in terms of the model distribution or the design distribution as
desired) will be different, VR say. As noted by Rao and Scott (1981), it
follows from standard results on distributions of quadratic forms (John-
son and Kotz, 1970) that the statistic X2 of (6.186) will be distributed
approximately as
(6.187)
where ZI, ... , Zp are independent N(O, 1) variates, and AI, ... , Ap are
the eigenvalues of VOl VR •
The distribution of D of (6.185) can be evaluated numerically, given
the eigenvalues AI, ... , Ap. Alternatively, it can be approximated in
terms of single chi-square variates. For example, since
p
ED = LAm = p"5..,
m=1
the variate Dc = Dj"5.. will have the same mean as a xfp) distribution.
Since the variance
is greater than 2p, taking Dj"5.. to be xfp) as a first-order approximation

will be correct in terms of the mean but will underestimate its variance.
For a second approximation, due to Satterthwaite (1946), we note that
if
p
a2 = L(Am- "5..)2 jpP
m=1
is the square of the coefficient of variation of the Am, then

Var(Dc) = 2p(l + a 2). (6.188)
Define a new statistic
(6.189)
Since eDs = p/(l +a 2 ) and Var(Ds) = 2p/(l +a2), a 'second-order'
approximation will take
Ds ~ xtV)' v = p/(l + a 2 ). (6.190)
Note that L~=I Am is the trace of VOIVR' and L~=I A~ is the trace
of the square of the same matrix.
To compute the Acorresponding adjusted versions of X2 of (6.186),
namely X; X X; X; / a
= 2/~ or = (l + 2), we would need estimates
I'" "
AI, ... , Am, and these can be computed from Vo V R, if V R is available.
" " A
To estimate VR we need to know which sample units belong to the same

PSUs. However, even if an estimate of VR is not available, we may be
able to make a crude adjustment. This is because, as has been noted by
Rao and Scott (1981), each eigenvalue AI, ... , Am has an interpretation
as a 'design effect', or ratio of true mean squared error to SRS mean
squared error. That is, each 5..m is the design effect for some combination
of the components of S,,*. Previous experience with the population
may give us an upper bound AO on the design effects for components.
Then taking X2/AO to be approximately X{P) will give conservative
confidence regions for (3*. For another example, if the matrix Vo is
diag(vl, ... , vp), then the second-order Rao--Scott correction ~(1 +a 2 ),
based on the Satterthwaite approximation, will be
(L L V~il/VjV/)/ L VRii/Vj) (6.191)

/
where VRii is the (i,l)th element of VR • If the true covariance matrix

is such that VRii/Vj is roughly a constant d), and for each off-diagonal
element the ratio VRii/ JVjv/ is roughly a constant d2, the correction
will be approximately d l (1 + (p - l)d?/dr). Thus we will have
X 2 /[d l (1 + (p - l)dVdr)] ~ xtV)' v = 1 + (p !:.l)d?/dr· (6.192)
Rao and Scott (1981; 1984) have developed their adjustments in

practical form for testing goodness-of-fit hypotheses for multinomial
proportions, and for analysing loglinear models in contingency tables,
in sample survey situations. The technique has been extended to cor-

rections of X2 statistics for logit models by Rao and Scott (1987), and
to logistic regression contexts by Roberts et al. (1987).
CHAPTER 7
Sampling strategies in time and space
In this chapter we consider survey populations for which the units

are labelled by times or positions in space. Samples taken from these
populations are samples of times or locations, and there are many ap-
plications where the aims are essentially descriptive, like the following.
(a) Mapping. For a population considered at a fixed time point, it may
be of interest to construct a map or a picture of the locations and
extents of various features. For example, we might want to produce
a contour map of elevations in a geographic area from elevations
at a sample of locations, or to reconstruct a photograph of a face
from grey-scale values for a set of pixels.
(b) Searching. We might be interested in sampling to search for certain
features such as the location of an ore body or a school of fish, or
the point of maximum elevation in a geographic area.
(c) Monitoring. For quality control of a production process, we would
want to sample the quality level of the process at selected time
points. For monitoring pollution levels, we might want to sample
the concentration of a contaminant more or less continuously in
time, at selected points in space.
(d) Estimating local or global means. Estimating local means of a time
series or spatial process is akin to mapping via smoothing. Some-
times global totals or means are also of interest: for example, the
integral of a function, expressible as the area under a curve or the
volume under a surface; the total milk yield of a cow over a lac-
tation cycle; the total volume of contaminant flowing past a point
on a river in a day.
(e) Estimating other features. Besides means, measures of variability,
either global or local, may be of interest, as may spatial correlations,
or measures of texture of an image. For assessing compliance with
safety requirements, it may be necessary to estimate a maximum
level, for example of temperature within a reactor (Cox, 1968) or
of radiation exposure at a worksite.
252 SAMPLING STRATEGIES IN TIME AND SPACE
Although modelling the response values will be an integral part of

the methods discussed in this chapter, we will emphasize essentially
these descriptive aims, particularly the prediction of unseen values and
the estimation of population means, either of response variates or of
underlying 'mean functions'.
In many applications, the sample is essentially fixed in time or space
by physical constraints. For example, air ozone may be monitored at
fixed equally spaced time points and at fixed monitoring stations, and
the problem is to estimate or 'predict' its distribution over the whole
area using an appropriate space-time model. In other situations, the
surveyor may have much more freedom in the choice of the sample,
and the problem is not only the estimation of quantities of interest,
but also the choice of an efficient sampling scheme in the light of the
survey objectives. The next two sections will outline the more com-
monly encountered models and ways of sampling in time and space.
Then, focusing on fixed sample contexts, Section 7.3 will give a brief
introduction to the theory of optimal linear prediction for spatial pro-
cesses, and Section 7.4 will discuss other methods based on sample
function approximation. Ways of choosing the points of a fixed sample
will be discussed in Section 7.5. Finally, Section 7.6 will look at the
use of randomized sampling designs in time and space, and criteria for
selecting them.
7.1 Models for spatial populations

Let t be a point in a d-dimensionallattice or in a d-dimensional spatial
continuum. If d = 1, t = t could represent either time or location, along
a line or in some narrow band. If d = 2, t could represent location in
a geographic area. Larger values of d would be appropriate if the units
corresponded to points in a 'design space' of covariate vectors, or the
domain of a multiple integral.
The survey population U of interest we will take to be identifiable
with some finite or connected subset of the lattice or of continuous
d-dimensional space, as appropriate. Thus we might have
U = [0, 1] x [0, 1],
a subset of R2, or
U = (U, k), j = 1, ... , M\; k = 1, ... , M2 },
a subset of the two-dimensional integer lattice Z2.
When a superpopulation model is assumed, let Y, be a real-valued
MODELS FOR SPATIAL POPULATIONS 253
random variable representing the value of the response variate at time

Ilocation t. If t represents time, then {Yt : t E U} is a time series or
stochastic process; if t is a spatial location, {Yt : t E U} is a random
field.
For the models we consider, there will be an underlying process or
random field {J..tt : t E U} which may be random or deterministic; the
process {Yt : t E U} will be random, and will have its distribution
specified conditionally on {J..tt : t E U}. Thus, for example, we may
have
(7.1)
where {J..tt : t E U} is a random or deterministic trend function which is
either known, known up to a few parameters, or unknown, and where
{Et : t E U} is a zero-mean noise process with known or estimable
covariance structure.
As another example, we may have J..tt = fA (t) where fA is the indic-
ator of a random set, and Yt a noisy observation of J..tt, so that
where ai, a2 may be thought of as misc1assification probabilities.

In terms of these processes, the mapping problem becomes estimating
(or 'predicting') unseen values of Y, or of J..t,; monitoring involves
detecting changes in the unconditional mean level of Yt ; and searching
becomes looking for t such that Yt is maximal or above a certain level.
The estimation of response variate means becomes estimation of the
'local' means
L~B Yt/ L
~B
1 or (Ytdv(t)/ ( dv(t),
1B 1B
where B is a fixed subset, typically small, of U, and v is a measure on
U in the continuous case; or the 'global' means
L Y,j L I or (Ytdv(t)/ ( dv(t).

,eU teU 1u 1u
Both of these can be expressed in the forms
L ¢,Y,j L ¢t or (¢,Ytdv(t)/ ( ¢,dv(t) , (7.3)

teU ,eU 1u 1u
where ¢, is the indicator of B for local means, and is identically one
for global means. Other features of interest are expressible in terms of
the distribution of the Yt values over t E B or t E U, or in terms of the
distribution of pairs of these values at nearby points. Examples would

be mliXteU Yt , or the population semivariogram
~ L(Y,+h - yt )2/ L 1,
teU' teU'
where U' = {t E U: t+ hE U}.
In later sections, we will look mainly at the mapping problem and
the estimation of means as in (7.3).
7.1.1 Modelsfor the noise term in (7.1)
When the model (7.1) with an additive noise term is appropriate, we

usually think of Et as a continuous variate with mean O. We will de-
note by C~ (s, t) and p~ (s, t) respectively the conditional covariance and
correlation of Es and Et, given p, = {ILt : t E U}:
C~(s, t)= COV(Es , Etl p,),
p~(s, t) = Corr(Es , Etl p,) = C~(s, t)/C;/2(S, s)C;/2(t, t). (7.4)
If U is discrete and the terms in the noise process are independent,
given the underlying mean function, then C~ (s, t) and p~ (s, t) are zero
for s i- t. For either discrete or continuous U, if the noise process is
stationary, we may write C~(s, t) as C~(t - s) and p~(s, t) as p~(t - s).
A more general class of models is the one for which the increments are
stationary, so that £(Et - Es) = 0 and
(7.5)
The function 2r ~ is called the variogram of the process or random
field, while r ~ is called the semivariogram.
The importance of the semivariogram derives from its interpretability
in terms of variation. Moreover, it is widely applicable, since many
spatial processes which appear non-stationary because of a 'drifting'
level have relatively homogeneous increments.
7.1.2 Models for the underlying process ILt

As indicated above, when the model is given by (7.1), the mean function
is often taken to be deterministic, either a known function fo(t) or an
unknown linear combination of a set of functions fi (t), ... , fp(t):
ILt = fJd,(t) + fJd2(t) + ... + fJpfp(t). (7.6)
These mean functions, together with stationary increment noise, are

essentially the models of kriging and universal kriging, which have
been applied successfully in many geostatistical applications.
When the mean function in (7.1) is thought to be random, a general
representation for it can be given by
(7.7)
where 1] = {1]t : t E U} is a zero-mean random process with covari-
ance Cry(s, t) = Cov(1]s,1]t) and correlations Pry(s, t) = Corr(1]s,1]t).
Combining (7.7) and (7.1) gives
Yi = {Jdl (t) + ... + {Jpfp(t) + 1]t + Et· (7.8)
Usually, the process 1] would be taken to be independent of the noise
process E = {Et : t E U}, and would have realizations or sample
functions which would be smoother or more slowly varying than those
of E.
More specific models in common use for E and 1] depend somewhat
on whether the space U is continuous or discrete. In the continuous
case, the emphasis tends to be on direct modelling of the covariance
structures or variogram rather than of joint distributions.
7.1.3 Correlation and semivariogram models

Correlation functions sometimes assumed for stationary processes z =
{Zt : t E U} (where Zt = Et or 1]t or Et + 1]t) include the following
(Christakos, 1984; Matern, 1986; Cressie, 1993):
pz(h) = 1 if h =0
(white noise) (7.9)
= 0 if h#O
pz(h) = 1 - Ihlla if 0 S Ihl Sa (triangular;
= 0 if Ihl > a valid only (7.10)
in Rl)
pz(h) = exp{ _a 2 II h 112} (7.11 )
pz(h) = exp{-b II h II} (exponential) (7.12)
pz(h) = exp{-b II h liP}, 0<PS2 (7.13)

pAh) (7.14)
pz(h) = (a 2 II
II 12)V2Kv{a 2 II hil)I rev), v > 0, (7.15)
h
where 1111represents the Euclidean nonn, r denotes the gamma function

and Kv denotes the modified Bessel function of the second kind.
All the correlation functions specified so far are isotropic, depending
on h through II h II. Models which are not isotropic are sometimes
defined through isotropic models for transfonned coordinate systems
(Sampson and Guttorp, 1992). In other cases they are defined as prod-
ucts. For example, a valid correlation function in d = 2 dimensions
can be obtained as a product of one-dimensional correlation functions
(Martin, 1979);
(7.16)
When the z process is the sum of a process with a continuous cor-
relation function Pc and independent white noise, we have
pz{h) = 1 if h =0 (7.17)
= (1 - co)Pc{h) if h # 0,
for some Co between 0 and 1.
If the process z has stationary increments, the semivariogram of z is
1
rz{h) = "2 Var{Zt+h - Zt). (7.18)
Then rz{h) = 0 if h = O. If rz{h) has a positive limit when h --+ 0,

this limit is called in geostatistics a 'nugget effect' (Matheron, 1965),
and it represents a combination of measurement error noise and very
small-scale variation or roughness in the true response.
When z is stationary, the semivariogram of z has the fonn
(7.19)
where o} is Var{Zt), and rz{h) approaches the constant level a; as h
tends to infinity. In the case of (7.17), we have
rz{h) = 0 if h =0 (7.20)
= a;(co + (1 - co)(l - Pc{h») if h # 0,
so that the nugget effect limh-+o rz(h) is alco. Thus corresponding to

the correlation forms of (7.1 0)-{7 .15), we have a set of semivariogram
fonns detennined by (7.20).
At the same time, as indicated earlier, the semivariogram is defined
also for any process with stationary increments. For example, if z is a
Wiener process, the semivariogram is
(7.21)
Another semivariogram corresponding to a non-stationary model has

the power form
rz(h) = Co + b II h 11\ 0 < )... < 2. (7.22)
A semivariogram often used in geostatistics is the spherical semi-

variogram, valid for d = 1, 2 or 3, taking the form
rz(h) = ~ II h II fa - ! II h 11 3 fa 3 if IIhll:::a
(7.23)
= 0 if II h II> a.
7.1.4 Properties of realizations of the z process
A realization of the z process (i.e. of the '" or € process) defines a

curve (d = 1), a surface (d = 2) or a hypersurface (d :::: 3) if U is a
continuum. The degree of smoothness of the realization depends on the
behaviour of the correlation function for points very close together i.e.
on pz(t, t+h) as h -+ 0, or on pz(h) as h -+ 0 in the stationary case. In
general, we would expect that pz(t, t + h) would possess higher-order
derivatives with respect to h at h = 0 than would the realization at
t + h. Cramer and Leadbetter (1967) have given some details of this
dependence for d = 1, in both stationary and non-stationary cases. In
the stationary Gaussian case, a sufficient condition for continuity of the
sample paths is that for some a > 1,
pz(h) = 1- O{lloglhll- a } as h ~ 0, (7.24)

or
1 00
[log(l + )...)]adFz ()"') < 00, (7.25)
i:
where F is the spectral 'distribution function' given by
pz(t) == eiO·dFz()"'). (7.26)
Again in the stationary Gaussian case, a sufficient condition for some

version of the process to have continuous derivatives of order n is that
(7.27)
for some a > 3. A sufficient condition in terms of the correlation

function would be that
h 2 A h4 A h2n
p(h) = 1- A2- + _4_ _ ... + (_ly_2_n_
2 24 (2n)!
(7.28)
h2n )
(
+0 IloglhW as h \. O.
The degree of smoothness of the sample functions will clearly af-

fect the efficiency of methods for estimating unseen values and local
and global means from samples. Intuitively, the smoother the curve or
surface, the easier it will be to approximate it using values at sampled
points.
7.1.5 Discrete population case
In the discrete case, if U is a d-dimensional lattice, the underlying

process {ILt : t E U}, or equivalently the 'IJ process if the model is (7.8),
is often modelled as Markovian. Thus if noise or misclassification error
is present, the model for the mean function and observable process {Yt :
t E U} together can sometimes be called a 'hidden Markov process'. A
time series or stochastic process {ILl : t E U} has the Markov property if
conditional on its variate values up to time to (the 'past' and 'present'),
the distribution of its variate values after time to (the 'future') depends
only on the value of ILlo. An example might be a measure of quality
which may jump from one constant level to another in response to some
random shock from the external environment.
In space, there is no past, present or future. We can define an analogue
of the Markov property by generalizing one of its stochastic process
consequences, namely that if t, < to < t2, then the distribution of ILlo
conditional on {ILs : s ::: t, and s ~ t2} depends only on ILl! and IL 12.
In discrete space, we may define for each to a neighbourhood N(/o) of
points in U, and stipulate that the conditional distribution of ILto given
ILs for all s =j:. 10 depends only on {ILs : S E N(to)}. The property of
being neighbours is usually taken to be a symmetric relation between
population units, so that s is a neighbour of 1 if and only if t is a
neighbour of s.
A very useful subclass of homogeneous Markov random field mod-
els on a discrete space expresses the joint probability function of the
variates in {ILl: 1 E U} in the 'Gibbs distribution' form with pairwise
SPATIAL SAMPLING DESIGNS 259
interactions:
P(J-tt : t E U} = C(fJ) exp{LA(J-tt; fJ)
+ Ls Lt Os,t(J-ts, J-tt; fJ)},
where fJ is a parameter, the function 0 is the Gibbs energy function,
C(fJ) is a normalizing constant, and the double sum is over all pairs
(s, t) which are neighbours of each other.
Variations on models such as this can be surprisingly useful as prior
distributions for black and white or grey-scale images (see Qian and
Titterington, 1991, and references therein).
For applications in which we are concerned with the presence and lo-
cation of some substance or object in space, contained in a background
of some different material, more geometrically based models are often
assumed, and invariance of the )11odel under translations and/or rota-
tions is an important consideration. If we let the role of J-tt be taken
by the indicator of the substance or object in question, then we may
wish to model it as the indicator of a random closed set generated in
some physically plausible way. For example, for an object made up of
globular particles, we could think of the set A as consisting of spheres
or ellipses with randomly distributed centres and orientations, and prin-
cipal axis lengths coming from a certain size distribution. For deposits
of crystalline material, we could think of the substance as occupying
cells in space, such as convex polygons or polyhedra. These could be
defined by randomly positioned lines or planes, or by growth from ran-
domly distributed 'nuclei' to form Voronoi polyhedra (constant growth
simultaneously for all cells) or more complicated structures (Moran,
1972).
7.2 Spatial sampling designs

At the beginning of the chapter it was seen that there are many possible
objectives for surveys in time or space. Clearly different objectives will
require different approaches to sampling, and some of the considera-
tions relevant to choice of approach will be outlined in Sections 7.5
and 7.6.
All of the types of sampling design we have discussed previously
can be applied to spatial and temporal populations, with divisions like
strata or PSUs usually being made up of units which are contiguous in
space or time. However, when the structure of U is that of a lattice or a
regular continuum, it becomes natural to think of sampling in terms of
the locations and spacings of the units, and regular pattern samples such
as systematic or grid samples are frequently used. In situations where

exact sampling unbiasedness is not of great concern, non-random or
purposive sampling schemes have strong appeal. However, randomized
or probability sampling designs have their advantages, and are often
advocated for protection against biases.
When U is finite and discrete, a probability sampling design is, as
before, a probability function p(s) on the collection of subsets s of U.
When U is continuous, we may still think of the sample as a finite set
s of points, but if the sampling scheme is not purposive, it is usually
defined in terms of draw probability densities on subsets of the space
U. Thus, for example, the analogue of an SRS would be a sequence of
points tl, ... ,tn drawn independently from a distribution with uniform
density on U; the analogue of unequal probability random sampling
designs like the 'approximately with replacement' scheme of Section
2.8 would be to draw points tl, .•. , tn independently from a distribution
with density h(t) with respect to some measure on U. For systematic
sampling from [0, 1], we would draw tl from a distribution which was
uniform on [0, lin], and define tj = tl + U - l)/n, j = 2, ... , n. For
the analogue of Madow's ordered systematic procedure with density
h on [a, b], we would draw Sl, ... , Sn systematically from [0,1], and
define tj = H-1(sj), j = 1, ... , n, where
H(t) = l' h(u)du.
The most obvious analogue of systematic sampling in d dimensions

is to sample at the points of intersection of a randomly located d-
dimensional lattice and U. Other analogues divide the space into cells,
and sample one point from ea~h cell in such a way that the resulting
set has one-dimensional projections which are unions of systematic
samples. See Figure 7.1 as well as Cochran (1977) and Bellhouse (1977)
for examples with d = =
2. In the most general form for U [0, 1]2,
the sample points are (tljko t2jk), where tljk = tljl + (k - l)/ml, k =
1, ... , ml, t2jk = t2lk+U -l)/m2, j = 1, ... , m2. The coordinates tljl
and t2lk are selected from (0, ...L]
ml
and (0, ...L]
m2
respectively. Lattice or
grid sampling corresponds to selecting tlll, t211 randomly, and letting
tljl = t111 + U - 1)/m2 for all j, t2lk = t211 + (k - 1)/ml for all k.
For d ~ 2, in applications such as response surface estimation or
numerical integration when sampling is expensive, the problem arises
of how to choose a small number of points scattered as uniformly as
possible over the space U. One class of designs proposed for dealing
with this problem is Latin hypercube sampling (McKay et at., 1979;
SPATIAL SAMPLING DESIGNS 261
x
x I x I x x x
IX
__1__ 1 __ I x I _ _I.)C _ 1 __
I x I x
I- -x -I - - I" - x- xl I
x x
I I I x I I I
- -1- - + -- - -1- - + -x- - -1- -'4 - -
x I x I x x I I x I x
I x I
(a) (b) (c)
Figure 7.l Two-dimensional analogues of systematic samples. (a) Aligned or

'square grid' sample; (b) sample aligned in one direction; (c) unaligned sample.
Stein, 1987; Owen, 1992; see also Yates, 1960). If U = [0, l]d, U
is partitioned into md smaller hypercubes of edge length ~. A Latin
hypercube is an m x d matrix, of which each column is a permutation
of 1, ... , m. The sampling scheme consists of randomly generating
such a matrix, with entries U jr. The jth sampled small hypercube is
the one for which the rth defining edge is tj~-l, U,;;], r = 1, ... , d.
The jth sampled point is randomly or purposively selected within the
jth sampled hypercube. The special case d = 2 yields Latin square
sampling, an example of which is shown in Figure 7.2. Tang (1993) has
shown how improved uniformity of a Latin hypercube sample results
from constraining the random permutation matrices with background
orthogonal arrays.
In some applications, samples are not finite sets, but are continuous
subsets of the continuum U. The science of stereology (Moran, 1972)
involves trying to infer the properties of a three-dimensional object
by measuring the properties of a randomly selected planar section or
a linear probe with random location and direction. Wildlife sampling
may involve counting all 'sightings' along a set of randomly selected
transects through a wilderness area (Thompson, 1992). Some of the
more challenging problems of spatial sampling theory are found in
trying to use the data from these kinds of sampling to best advantage.
A note on asymptotics
For a discrete population U, such as the set of observation points for a

time series, the natural framework for asymptotic results is similar to the
classical framework for sampling theory (Section 3.4). We think of the
size of U as increasing, and the population response array developing,
as the sample size increases. The spacing between neighbouring units
______ x ~
I
I _______ L
I _____ _
x
I I
------,-------r------
I I
x
I I
I I
Figure 7.2 A Latin square sample o/size 3.
of U remains constant, and the typical spacing between sampled units

does not decrease.
With a continuous population U, the same kind of asymptotic frame-
work may be relevant. However, for many applications it is more nat-
ural to consider 'fixed domain' or 'infill' asymptotics (Cressie, 1993),
where U remains fixed and the sample points become more and more
dense in U as the sample size increases.
7.3 Pointwise prediction and predictive estimation of means and

totals
In this section we will focus first on prediction of values fto of the

observable process, or values #Lto of the mean function process, at single
points to in U. Optimal linear predictors of fto and #Lto will lead naturally
to linear predictors of population means and totals. The sample will
be taken to be a fixed set of points tl, ... ,tn, and the criteria to be
satisfied by the predictors will be framed in terms of the superpopulation
model only. We will look first at optimal prediction. A linear predictor
will be optimal if it is model-unbiased and has minimal predictive
mean squared error. Variants of these criteria have been found to be
POINTWISE PREDICTION 263
very useful in geostatistics, where they have led to a set of techniques

designated by the term kriging.
Although the optimal predictors depend only on mean and covari-
ance functions, the criteria oflinearity, unbiasedness and minimal mean
squared error are most suited to additive models like (7.1) with (7.8),
with Gaussian or near-Gaussian random components. Other forms or
criteria may be more appropriate when the response variates are non-
Gaussian, or if robustness of prediction against outliers is needed. De-
tailed discussions of modifications for these situations are provided by
Cressie (1993).
We will take the model to be given by (7.1) and (7.8), so that for
lEU,
Yt = J-tt + Et
P (7.29)
J-tt = Lfh ji (I) + 1Jt
[=1
with p < n. Thus

p
Y, = L fhji(l) + Ot (7.30)
1=1
where
Ot = 1Jt + Et· (7.31)
The functions ji (I) will be taken to be known functions, and in particu-

lar II (I) will be identically 1. The process "., = {1Jt : lEU} is the
'state process', analogous to a function of random effects, and we think
of it as smoother or having slower variation than the 'noise process'
€ = {Et : t E U}. The processes € and"., will be taken to be independ-
ent with zero mean and general covariance structures to begin with.
Thus we have
E1Jt = 0, EEt = 0, EO t = 0;
(7.32)
r~(s, I) = rry(s, I) + rl(s, I);

where Cz(s, I) = Cov(zs, Zt) and 2rz (s, I) = Var(zt - zs), with Z stand-
ing for 0 or 1J or E.
7.3.1 Prediction of Yto

We first consider the prediction of Yto by the linear predictor I:J=laj Ytj ,
where the coefficients a j are chosen to satisfY the condition
n
Laj!t(tj)=!t(to),l=l, ... ,p. (7.33)
j=1
This condition implies E-unbiasedness of the predictor. Moreover, since
under (7.30) we have
n n
Yio - La j Ytj = Oto - L a jOt}' (7.34)
j=1 j=1
we see that the prediction error is free of dependence on the unknown
coefficients fh, ... , {Jp.
The square of the prediction error can be expressed in two ways, as
n n n n
(Yio - Laj yt)2 = L LajakOtAk - 2 LajOtAo + oio (7.35)

j=1 j=1 k=1 j=1
or (since the constraint for fl implies I:J=I aj = 1) as
n n n n
= - LLajak(Otj -Otk)2/2+ I>j(Oto -Ot)2.
(Yto - Laj ytj )2
j=1 j=1 k=1 j=1
(7.36)
Thus the mean squared error of prediction E(Yio - I:J=I aj yt)2 is
expressible as
n n n
L L ajakC~(tj, tk) - 2 L ajC~(tj, to) + C~(to, to) (7.37)
j=1 k=1 j=1
or as
n n n
- LLajakr~(tj,tk)+2Lajr~(to,tj). (7.38)
j=1 k=1 j=1
We minimize these expressions subject to the constraints (7.33), which
are expressible as
(7.39)
where a is the column vector with jth element aJ> F is the matrix with
U,l)th element !t(tj), and fo is the column vector with lth element
!t(to). Applying the method of Lagrange multipliers yields ultimately
(7.40)
or
a = ri l 1'0.1 + ri l F(FTri l F)-I lfo - r r i l 1'0.1), (7.41)
where Ca and ra have U. k)th elements Ca(tj. tk) and r 8(tj • tk) re-
spectively, and Co.I and 1'oa have jth elements Ca(tj. to) and r8(tj. to)
respectively. For these expressions to be valid we need to as~ume, and
will assume, that all the matrix inverses in them exist.
The optimal predictor can be written as YJ a, where the jth element
of Ys is Y'j. (Here the subscript s refers to the fact that Ys is the vector
of sampled Y values.) The form (7.40), which is valid even without
the assumption that fl (t) == 1, makes it clear that the predictor can
be thought of as a basic predictor for the zero trend or no-constraint
problem (that is, YJCilcoa), corrected or calibrated to the constraints
(7.39). The predictor can also be written as an estimated trend value
plus an appropriate combination of estimated residuals at the sample
points:
. (7.42)
where
(7.43)
is the generalized least squares estimator of {3 = (fJI •...• fJp)\ and
Qa = Ci l - Ci l F(r Cil F)-l Cil.
The predictive mean squared error of r a can be shown to be
C8(to. to) - co.sCilcoa + AT(F"Cil F)-I A. (7.44)
where
(7.45)
or to be
(7.46)
where
B = lfo - r r i l 1'oa)· (7.47)
Note that if to is one of the sampled points tit then Cilcoa and
ri l 1'oa have jth element equal to 1 and the other elements 0; it follows
easily that A and B are zero matrices, and that the optimal linear pre-
dictor of Y,o is Y'j' which is equal to Y'o itself. Thus the optimal linear
predictor can be regarded in this simple sense as interpolating between
the observations at sampled points. We shall see below, however, that
the predictor as a function of to is not necessarily continuous at the
sampled points.
7.3.2 Prediction of /.Lto

Now we consider t.he prediction of /.Lto by ¥;b = 'LJ=l bjYtj , subject
to an f··unbiasedness constraint which is expressible by
n
Lbj!t(lj) = !t (10) , l=l, ... ,p, (7.48)
j=l
or in matrix form by
F'b =fo. (7.49)
The error of prediction is
n n
/.Lto - LbjY,j = -fto +Oto - Lbjotj , (7.50)
j=l j=l
and it follows that the mean squared error of prediction is expressible
as
n n n
L L bj bkC8(lj, Ik) - 2 L bjCry (Ij , 10) + Cry (to, to) (7.51)
j=l k=l j=l
or as
n n n
- L L bj bkr 8(tj, tk) + 2 L bj (r8(to, tj) + C.(/o, Ij» - C. (to, 10).
j=l k=l j=l
(7.52)
From the same computations as for prediction of Yto, it readily follows
that the coefficients of the optimal linear predictor are given by
b = Cilcory + Ci l F(F' Ci l F)-lifo - F' Cilcory) (7.53)
or by
b = ri l "I'~8 + ri l F(F'ri l F)-lifo - F'r;-l "I'~8)' (7.54)
where COry and "I'~8 have jth elements Cry (tj, to) and r8 (tj , to)+C.(tj, to)
respectively. The optimal predictor itself has a form analogous to (7.42):
'" T A T
Y,b = f3 fo + (¥; - f3 F')Q8 COry' (7.55)
The predictive mean squared error of ¥; b can be written as
Cry (to, to) - C~ryCilCOry + A*' (F' Ci l F)-l A*, (7.56)
where
(7.57)
or as
"1'0*'r-
8 8 l "1'0*8 - B*'(F'r-1F)-lB*
8 - C• (to, to ) , (7.58)
where
(7.59)
It is interesting to note that if to is not one of the tj, and if the
noise process components are uncorrelated so that C, (tj, to) = 0, then
b of (7.53), (7.54) is the same as a of (7.40), (7.41). From the form
(7.55), we can see that if C~(s, t) and/(t) are continuous, then so is the
optimal linear predictor of I1 to as a function of to. When to is one of the
sampled points, the predictor of I1 to will not in general be equal to f,o'
Thus in the case of uncorrelated noise, continuous I, and continuous
C~, the optimal predictor of f,o will be discontinuous at the sampled
points, but continuous elsewhere, as a function of to.
7.3.3 Estimation of the deterministic part of the trend

As has already been noted, the criterion of minimum variance £-
unbiased estimation for (3 leads to
j3 = (P C;l F)-l F r C;l Ys ' (7.60)
The covariance matrix of the estimator is

(7.61)
The corresponding estimator of the deterministic part of the trend sur-
face at to, namely the estimator of (3r/o, is L:J=1 dj ftj or 17 d, where
d = C;l F' (Fr C-;l F)-l/o. (7.62)
The mean squared error of estimation of L:J=1 dj f tj is

(7.63)
7.3.4 Estimating totals and means

As indicated in Section 7.1, estimating totals and means can be reduced
to the prediction of quantities
lEU
in the case of U discrete, and
L ¢tY,dv(t)
in the case of U continuous. The same arguments as used for linear

prediction of Yio and I1 to apply here also. The optimal linear unbiased
predictor of Lteu rptY, or fu rptY,dv(t) will be of the form YJ a, where

a = CiIC",8 + Cil F(FrCil F)-l(f,p - FrCilc",8) (7.64)
oJ
and
I", has lth component Lteu rpt/!(t) or fu rpt/!(t)dv(t)
c",a has jth component LteU rptC8(tj, t) or f'1l rptC8(tj, t)dv(t)
'1",8 has jth component LteU rptr8(tj, t) or Ju rptr8(tj' t)dv(t).
The predictive mean squared error of YJ a, is
C8(rp, rp) - c~8CiIC"'8 + A~(Fr Cil F)-I A"" (7.66)
where
C8(rp, rp) = L L rptrpsC8(t, s)
teU seU
LL
or
rptrps C8(t, s)dv(s)dv(t), (7.67)
A", =/", - rCilc",8,

or
r r- l Br(Frr-IF)-IB (7.68)
'1",88 '1",8-", 8 ""
where
(7.69)
7.3.5 Determining coefficients for predictors

To compute the optimal predictors of Y,o' ILto and LteU rptY, or
fu rptY,dv(t), we would need to know the covariance structures of the
E and TJ processes. In practice, knowledge of the covariance structures
is generally partial at best, and thus optimal predictors can seldom be
used. However, even if in (7.40), (7.41) or their analogues, we have
only approximations (not data-based) to Ci\ Coa, ri l , '108' etc., the
resulting predictors will still be linear and unbiased, and are still likely
to be useful for practical purposes. Sometimes little efficiency is lost in
these approximations. Stein (1988; 1990b) has studied the fixed domain
asymptotic efficiency oflinear predictions using an incorrect covariance
matrix 'compatible' with the true one. (Covariance functions are com-
patible if their respective Gaussian measures are mutually absolutely
continuous. See also Cressie (1993, p. 353).)
Useful (though not necessarily efficient) approximations may result

from the assumption of covariance stationarity or stationary increments
for the process, together with a simple form for a correlation func-
tion (e.g. (7.9)-(7.15), (7.17» or a semivariogram (e.g. (7.20)-(7.23».
In one dimension, when the noise process € and the trend function
L:f=1 f3di(t) are absent and 6 = '11 is Markov (e.g. an AR(l) process),
the coefficients for predicting Yto are non-zero only for the sample val-
ues surrounding to. The resulting predictor is much simpler than in
the general case. This suggests more generally trying to approximate
C;IC08 in (7.40) or CiICO~ in (7.53) by coefficient vectors which have
zero entries except for values corresponding to tj close to to. This ap-
proach is likely to be reasonably efficient if the 6 process has small
nugget effect and has approximately a low-order autoregressive corre-
lation function. If the nugget effect is large, as it is if an uncorrelated
noise process is the dominant component of 6, a different kind of sim-
plification emerges, since all the entries ofC;lc08 will be approximately
lin.
7.3.6 Estimating the uncertainty ofprediction
Keeping in mind the kind of suboptimal estimators/predictors just dis-

cussed, we tum to the problem of estimating error. For a general linear
unbiased predictor, an easily interpreted measure of uncertainty is an
estimate of the predictive mean squared error. To focus on a specific
case, let us suppose fu ¢tYtdv(t) is being estimated or predicted by
L:~=I aj Ytj or :r:
a, where the aj are suitably determined constants
satisfying the constraints
(7.70)
with fl (t) == 1. Then the mean squared error of prediction is expressible

either as
(7.71)
+ fu fu ¢s¢tC8(t, s)dv(t)dv(s)
or as
(7.72)
- fu fu ¢s¢tra(t, s)dv(t)dv(s).
Estimating these forms can in principle be handled through suitable
estimates of the covariances Ca(t, s) or the semivariograms ra(t, s). In
practice, since the sample covers only the spacings among tl, ... , tn,
this means assuming simple parametric forms for the covariances or
semivariograms, and estimating the parameters from the sample. Tech-
niques for doing this include using quadratic estimating functions,
'modified maximum likelihood' (where the likelihood is based on the
joint distributions of contrasts so as to be free of dependence on {3), or
cross validation methods. Since (7.71) and (7.72) involve covariances
for members of U which are very close together, it is helpful for mean
squared error estimation to have some clustering in the sample; this will
obviously aid in the estimation of measurement variances and nugget
effects. For detailed discussion of semivariogram estimation techniques
see, for example, Stein (1990), Cressie (1993) and Laslett et al. (1987).
With stationary covariance structures, a spectral approach to error
estimation may be fruitful, since the variance of a sample mean from
a regular sample is expressible as a functional of the spectral density
(Stein, 1993).
Intuitively, when the purpose is prediction of fu ¢tY,dv(t) the choice
of the family of deterministic trend components fz (t) is important. The
mean function process was earlier expressed as
p
ILt = L PI fz (t) + TIt,
1=1
(7.73)
while the response Yt was ILt plus a noise term ft. The more of ILt that
we can think of as being captured by (fixed or random) combinations
of {fz (t) , 1 = 1, ... ,p}, the easier will be the assessment of prediction
error, assuming p remains small relative to n.
This being said, the difficulty with the preceding approach to error es-
timation remains. The assumptions and simple parametric forms for the
corresponding structures are very likely to be oversimplifications, and
oversimplifications that matter. Robust estimation of predictive mean
squared error is difficult when the sample is fixed, particularly if it is
far from evenly deployed, or not representative of all regions of the

popUlation U.
A pair of test populations constructed by Laslett et al. (1987) consists
of soil pH measurements on a 'sample' 11 x 11 grid, and on an internal
8 x 8 grid consisting of points in the centres of the inner squares (Figure
7.3). The measured values appear to be consistent with the assumption
of a stationary random field. Suppose we consider prediction at the
internal grid points using the average values of pH for the four nearest
neighbours among the sampling points. An average squared error of
prediction can be calculated over the 64 internal points because the
true values of soil pH at these points are known. It is then possible to
test methods of estimating either prediction error, or average squared
prediction error, using pH values from the sample grid only.
Methods based on estimating the semivariogram and then estimating
(7.72) can be shown to yield underestimates of average squared predic-
tion error in this case. A slight improvement is obtained if we adjust
these upwards by a correction factor, the one which would work when
we are predicting sampled values from nearest-neighbour sampled val-
ues in the same way (Figure 7.4).
H0st et al. (1995) have suggested estimating prediction or 'inter-
polation' error by local cross validation, comparing true values with
interpolation values at nearby sample points. Particularly when predic-
tion or interpolation is performed locally, for example by averaging
over nearest neighbours in a lattice sample, it makes intuitive sense to
estimate error locally also.
7.3.7 Bayesian prediction
A Bayesian approach to prediction of response values and means is in

principle much more general than the approach using linear unbiased
predictors. The model components need not be additive, the response
variates need not be near-Gaussian, and prior uncertainty about the
covariance structure can be incorporated without the need for ad hoc
adjustments or approximations.
The Bayesian approach is more and more frequently used. It inspires
much of the methodology of image processing. It also has much poten-
tial for use in geostatistics. Stein (1992) has considered a 'truncated'
Gaussian random field model for spatial data which have a large frac-
tion of zero observations, such as might arise in mining, hydrology,
or pollution monitoring. He has shown that for this model, Bayesian
o o o o o o o o
r~ - -6"1
I I
o o o o o 01 0 10
1I II II -'I
____
o o o o o o o o
o o o o o o o o
II II i. .....6. ..... A· : II II II II
o 0 0 0 :0 0 0 0
ll: II
o o o o :0 o o o
o o o o o o o o
o o o o o o o o
Figure 7.3 Sites of soil pH measurement. Values at internal sites (0) are to be
predicted from those at sampled sites (.6.).
methods are better than kriging-based methods for summarizing the

conditional distributions of the responses at unseen locations.
In the context appropriate for kriging, a Bayesian approach is con-
sistent with kriging, and can be used to motivate the optimal linear
predictors obtained earlier.
To illustrate, let us temporarily assume as before that the model
(7.29) holds, but that the € and .,., processes are Gaussian, and that
/3 is a priori multivariate normal with mean vector 0 and covariance
matrix a 2 V. These assumptions determine the prior distributions or
measure for {Yt : t E U}. With a squared error loss function, the Bayes
predictors of Yto, f.-Lto and EtEU <PtYt or fu <PtYtdv(t) are their posterior
means, given the sample data. It can be shown that t'(,BIYs ) is given
6. .'
'0.':. 6.
.,
,.. .. ,
.
,.. .. ,
.,,
...
.,
. '
".. ':.
.'
".. ':.
......,,
, ,
(':
.,
6. .'
'•• 0:. 6.
Figure 7.4 The value at a sampled site (.) can be predicted from values at four
diagonal neighbour sites (t.), and prediction error noted.
by
(rCil F + a-2 V- I )-1 F'Cilys, (7.74)
that £ (Yto IYs ) is given by
.to £(,l3I Ys ) + r; Q8a C08, (7.75)
and that £ (/Lto IYs ) is given by
.to£(,I3I Ys) + r; Q8a COry, (7.76)
where
Q8a = Ci l - Ci l F(r Ci l F + a- 2 V-I )-1 F' Cil.
As a 2 ~ 00, so that the prior for ,13 becomes more and more diffuse,
these posterior means approach the optimal linear unbiased predictors
derived earlier, and provide alternative justification for them.
Like the optimal linear unbiased predictors, the predictors (7.75) and
(7.76) are derived assuming known covariance structures. When these
are unknown, a Bayesian approach would put a suitable prior distri-
bution on the parameters of CE(t, s) and of Cry(t, s), as well as on ,B.
Again, the point estimators/predictors of Yto ' /Lto or fu ¢tYtdv(t) might
be their posterior means, and an appropriate measure of uncertainty
their posterior variances. With a sensible prior on the covariance struc-

ture, the computation of the posterior means and variances may not
be easy, but the results should represent inference well (Cressie, 1993,
Section 3.4.4).
7.3.8 Prediction of {ILt : t E U}

The methods of the earlier part of this section are designed for pointwise
prediction of Yio and ILto' and for prediction of global or local totals
or means. However, as can be seen from (7.42) and (7.55), the curves
or surfaces defined by the predictors of ILto or Yto as to varies do not
actually look like sample functions of the processes {ILt : t E U} or {Yt :
t E U}, having the smoothness of their covariance functions instead.
For some applications, it may be appropriate to regard these surfaces
as smoothed estimates of the actual sample functions.
However, for applications where estimating {ILt : t E U} as a function
is important, it may be better to try to estimate all ILt simultaneously,
constraining the resulting function to be in some appropriate space. A
convenient framework for doing this is a Bayesian approach. The model
for {IL" t E U} is regarded as a prior distribution on a space of possible
sample functions. Observing the sample data Y" = y", ... 'Y'n = Y'n
leads to a posterior distribution on the space of possible sample func-
tions, and in principle we could estimate the mean function using a
sample function which is highly plausible (in some sense) under the
posterior distribution. This is essentially what is done in many image-
processing applications (Ripley, 1988), where the restoration of the
image {ILt : t E U} is important.
The choice of sample function to represent the posterior distribu-
tion is not entirely straightforward. The posterior mode is often sought
in image-processing contexts, where sampling is dense. The posterior
mean is typically not variable enough to be a good representative of
the posterior function space.
In a related context, Devine et al. (1994) have considered a case that
is equivalent to having U finite and discrete, and sampling Yt for every
tin U, with the object of estimating {ILt : t E U}. Under model (7.9)
with f3 known, and the assumptions that both the .,., process and the E
process are Gaussian, the posterior mean for the vector of ILt values is
(7.77)
Louis (1984) showed that the elements of the posterior mean vector
jJ, are likely to be not as variable about F f3 as the true underlying
GLOBAL MEANS AND TOTALS AS INTEGRALS 275
components of IL, and suggested addressing this problem by altering

the estimator so that the expected sample variance of the elements of
jL - F (3 is close to a true residual variance. Following work of Ghosh
(1992), Devine et al. (1994) have developed a constrained empirical
Bayes estimate of IL (incorporating estimates of (3, Cry and Cf ) which
seems better to reflect the underlying variability of IL. The motiva-
tion is akin to the motivation for adjusting a smoothing parameter in
nonparametric function estimation.
7.4 Global means and totals as integrals

In this section the focus shifts away from model-unbiased or Bayesian
prediction of Yto or fu ¢t Ytdv (t), and towards regarding the realized
Yto as a value to be interpolated, and the realized fu ¢tYtdv(t) as an
integral to be approximated given Ytj' j = 1, ... , n. The emphasis on
serial and spatial autocorrelation is replaced to some extent by an em-
phasis on their consequences in terms of sample function properties
for {Yt : t E U} or {ILt : t E U} - their continuity and smoothness, or
their slowness of variation. These are the properties which motivate the
use of simple or adjusted interpolation estimators, local averaging or
nonparametric smoothing, rather than optimal predictors, for approx-
imating sample functions. Since the estimation techniques are borrowed
from numerical approximation and integration methodology, more at-
tention is then concentrated on the best choice of sampling scheme,
and on properties of the strategies as the sample becomes increasingly
large within a fixed domain. Although many of the ideas will apply
as approximations in discrete populations U, we will take U to be a
bounded continuous subset of Rd in the following survey of some of
the literature.
7.4.1 One spatial dimension

For U = [0, 1], assume {Yt : t E U} to be integrable, and consider the
problem of estimating the integral
1= 11 Ytdt (7.78)
from values Ytj' j = 1, ... , n. The integral I can represent either a

total or a mean. If s = {tl, ... , tn } comes from a randomized self-
weighting design like simple random sampling or systematic sampling,
the sample mean Ys = ~ L:J=I Ytj is a natural estimate of I. In the
case of any design with one sample unit in each of the cells [0, ~],
(~, ~], ... , (n~l, ;], using the sample mean is equivalent to approxi-
mating Yt by the value Ytj for all t in the cell of tj, and then integrating.
In particular, if {Yt : t E [0, In
has a continuous second derivative, and
stratified random sampling with one unit per cell is used, it can be
shown that the standard deviation of the sample mean Ys is of order
O(n-I) as n -+ 00. Thus in a fixed domain asymptotic sense, Ys with
stratified random sampling is more efficient than Ys with simple random
sampling; for the latter, the standard deviation is of order O(n- 1/2 ).
For slowly varying Yt, it is intuitive that a better approximation to
Yt would be obtained by continuous (e.g. linear) interpolation between
the points of a centred systematic sample. For example, if tj = S,
j = I, ... , n, then a linear interpolation should be closer than a step-
function interpolation to the function Yt. The trapezoidal rule resulting
from a linear interpolation would give as an estimate of I the function
I tr = n _ 1
1 (1 n-I
2"Ytl + f;Yt j
1)
+ 2"Ytn •
(7.79)
Although the interpolation is continuous, the estimator i tr is close to the

sample mean. In fact, the sample mean Ys is equivalent to the modified
trapezoidal estimator
Ytl
2n
1(1
+;; 2"Ytl + ~Ytj
~ 1)
+ 2"Yt + 2n
n
Ytn (7.80)
associated with tj = f,; + L;l, j = 1, ... , n.

Use of a self-weighting randomized sampling design can be viewed
as implementing a kind of Monte Carlo integration, of which design
unbiasedness and general applicability are very appealing features. They
can be contrasted with the use of a purposive centred systematic sample
with interpolation, which may incur a bias (in a model sense) if {Yt :
t E [0, In has a nonlinear underlying trend. On the other hand, the error
in itr. conditional on the realized function, will be of order O(n- 2 ) as
n -+ 00 if {Yt : t E [0, Inhas a continuous second derivative. Thus
the convergence is tighter with a centred design and interpolation than
with a randomized design and the sample mean in this case.
It is well known that the efficiency of Monte Carlo integration
can be increased by making the density of sampled points approxi-
mately proportional to the integrand. In the present context, choosing
points tl, ... , tn randomly from [0, I] with density h leads to a design-
GLOBAL MEANS AND TOTALS AS INTEGRALS 277
unbiased estimator analogous to the HT estimator:
1~ Yli
(7.81)
A
IHT= - ~--.
n j=1 h(tj)
Clearly iHT is exact if YI ex h(t) on [0, 1]. If the proportionality is

only approximate, iHT will be better than the sample mean, but under
random selection of points it too will have a standard deviation of order
O(n- I / 2 ) as n ~ 00.
Benhenni and Cambanis (1992) have studied the use of an analogous
purposive design which takes tj such that
!ao
li
h(t)dt
j - I
= --,
n- 1
.
J = 1, ... ,n. (7.82)
They have considered improvements involving higher-order derivatives

to a modified trapezoidal estimator
in(h) = _1_1~~
n -1 2h(t()
+
j=1 h(tj)
I:..!!L
+ ~~l'
2h(tn)
(7.83)
which itself will converge to I at rate O(n- 2) when YI and h(t) are
sufficiently smooth. In the case where {Yt : t E [0, In
is a mean-zero
process with continuous covariance function C(s, t) = £(YsYt ), and has
exactly K quadratic mean derivatives (K = 0,1,2, ., .), then under
further regularity conditions, their improved estimator has predictive
mean squared error of order O(n- 2K - 2 ). Thus even when (YI : t E
n
[0, 1 is a Wiener process, having no quadratic mean derivative, the
°
error in the estimator, coinciding for K = with i ,To will have model
standard deviation O(n- I ).
7.4.2 More than one spatial dimension

When U has more than one dimension, methods for approximating the
integral
I = !uYtdt (7.84)
through Monte Carlo integration often involve confining each sample
point to a hypercube cell of U, as with the analogues of systematic
sampling or Latin hypercube sampling. For a design which is self-
weighting on [0, l]d, the sample mean will be a design-unbiased estim-
ator of I. There is an associated approximation of Yt within each cell by
its value at the sampled point in the cell. A purposive centred systematic
sample is also appealing, however. Stein (1993) has shown that if {Yt :
t E [0, l]d} is a stationary random field with spectral density I(A),
and if [0, l]d is divided into m d cells with one sample point at the
centre of each, then the mean squared error of a predictor of / can
be made to be of order O(m- P ), where d < p and I(A) is of order
IAI-P as IAI -+ 00. The number p is an indicator of smoothness of
the sample functions. The predictor used can be the sample mean for
p .:s 4, and is an edge-corrected sample mean for p > 4. Both are
model-unbiased under the stationarity assumption. Because they are
analogous to the trapezoidal estimator (7.80) in one dimension, it may
be conjectured that predictors derived from continuous interpolation
in higher dimensions would have similar properties. See Laslett et al.
(1987) for a discussion of continuous and smooth interpolators for d =
2.
Thus prediction (of 1',0 or of /) via continuous interpolation works
well when the sample functions of {1', : t E U} are sufficiently smooth
or slowly varying. Continuous interpolation may not be advisable to
the same degree when the sample functions are likely to be rough or
spiky. As we have seen, the function or surface defined by kriging
prediction is discontinuous at the sample points. Laslett et al. (1987)
have illustrated the superiority (for pointwise prediction) of kriging and
linear spline smoothing over continuous interpolation methods, for grid
sampling of a two-dimensional test population of soil pH values with
an apparently non-zero nugget effect.
If the sample points are of necessity non-random and irregularly
spaced, the idea of integrating a sample function approximation, ob-
tained by interpolating or smoothing, takes on added importance. An
example where U is one-dimensional concerns the estimation of the
milk yield of a cow over a 305 day lactation cycle, from measure-
ments of yield taken at about eight irregularly spaced days in the cycle.
Bartlett (1986) compared the performance, over several test popula-
tions, of (i) an interpolation (trapezoidal) estimator with end corrections
and (ii) kriging estimators where the trend functions fitted included lin-
ear splines with two knots and combinations of gamma curves. Despite
the roughness of the population sample functions, the simple interpo-
lation estimator (traditionally used by Agriculture Canada) turned out
to be more efficient and robust as an estimator of total yield than the
kriging estimators. Essentially, the linear interpolation had greater flex-
ibility to follow the actual trend than either the linear spline or gamma
family of functions.
When d = 2, an approach sometimes taken is to define an 'area of
CHOOSING PURPOSIVE SAMPLES 279
influence' for each sample point tj, throughout which Yt is taken to be

equal to Ytj' Then the estimator of I of (7.84) is the sample sum of
the Ytj' weighted by the measures of their areas. A typical choice of
areas is defined by the Voronoi polygons of the sample: the Voronoi
polygon for tj is the set of points t which are closer to tj than to any
other sample point (Okabe et al., 1992). The analogue of this method in
one dimension, for U = [0, 1], is equivalent to using a trapezoidal rule
like (7.80) for an unevenly spaced sample. With uneven spacing the
estimator will be different from the sample mean, and perform better
if {Yt : t E U} is smooth.
7.5 Choosing purposive samples

When sampling is purposive, the choice of sample will depend on the
aim of the study, prior knowledge about the population responses, and
the relative importance of efficiency, robustness and error estimation.
For searching or monitoring, sequential choice of sample points may be
most appropriate. For mapping an unknown territory we might be inter-
ested in covering the population as evenly as possible. For estimating
population totals or means when there are previous response values
available, we might want to sample more densely in the areas where Yt
is larger or more variable.
Making sample points evenly spaced facilitates computation of esti-
mates or predictions, and estimation of stationary covariance structures.
Partial clustering of sample points makes estimation and prediction less
efficient, but assists the estimation of small-scale variation and noise,
as separate from a more slowly varying random drift.
Some formal analyses of optimal sample determination have been
carried out. A relatively early example is the work of Blight (1973), in
connection with minimizing the model mean squared error of Y, as an
estimator of the population mean of Y for a discrete one-dimensional
population. For a stationary model with exponential autocorrelation
p(lhl) = exp{-Alhl}, A > 0, he showed that the optimal purposive
sample was a centred systematic sample, with sampling interval and
end interval lengths depending on the value of A.
A similar problem in two dimensions has been described by Okabe et
al. (1992), in determining optimal observation points for estimating the
total quantity in a plane region of a spatial variable like precipitation or
NO x density. They have reported on work by Hori and Nagata (1985),
who modelled Yt as Ji-t + Et. where {Ji-t : t E U} is a known function and
{Et : t E U} is noise with correlation structure C€(s, t). These authors
developed a method for determining a sample of n points to minimize
L
£(1 _1)2, where
1= Ytdt
and J(t) = L:;=I AjY,j, Aj being the area of the Voronoi polygon of
tj with respect to the sample. They applied it to the determination of
16 points for monitoring NOx density in Kyoto, Japan.
More commonly, the problem of optimal choice of sample for point-
wise prediction is considered. Sacks et al. (1989b) have discussed op-
timality criteria. The integrated mean squared error (IMSE) criterion is
the minimization of an integral
1 U
A 2
£(Y, - Y,) dv(t),
where Y, is usually a kriging predictor of Y, from Ys = (Yt" ... , yty.

The maximum mean squared error (MMSE) criterion is the minimiza-
tion of
2
max£(Y, - Y,) .
A
tEU
Sacks and Schiller (1988) have shown how to find the points of sam-
ples of given size to minimize the IMSE or the MMSE for discrete
populations with two or more dimensions. Yfantis et al. (1987) have
applied an MMSE criterion to the comparison of regular grids in two
dimensions, for kriging prediction at centre points of grid regions; in
their study, as long as the nugget effect was small, the equilateral trian-
gular grid seemed most efficient, and was seen to give the most reliable
estimate of the semivariogram; for large nugget effect, a hexagonal grid
appeared to be most efficient.
For continuous populations, Sacks et al. (l989b) have stated a pref-
erence for the IMSE criterion since the MMSE criterion 'involves a
d-dimensional optimization of a function with numerous local optima
at every iteration of a given design-optimization algorithm'. Some spe-
cific problems have been solved with the IMSE criterion. For example
Sacks et al. (1989a) found sample points in two dimensions minimizing
the IMSE for
Y, = ILt + Et,
where ILt = L:f=1 fJdi(t) and the error process was an Ornstein-
Uhlenbeck process. Su and Cambanis (1993) considered the more gen-
eral case in one dimension of a non-random ILt and an error process
In
{E t : t E [0, with zero mean and known covariance function. The
mean function process {lLt : t E [0, In
was either known, or of form
CHOICE OF RANDOMIZED SAMPLING DESIGNS 281
'2:f=1 fJdi(t), or unknown. They found samples which were asymptot-

ically optimal as n -+ 00 in the sense of minimizing the IMSE, when
kriging predictors or simple linear interpolators were used.
7.6 Choice of randomized sampling designs

Randomized sampling designs tend to be most relevant in the context
of estimating population functions like totals and means. It was noted
by Cochran (1946), in perhaps the first formal use of expected sam-
pling variance (Section 5.5.2) as an optimality criterion, that systematic
sampling with Ys is more efficient than SRS with Ys when (YI , •.. , YN )
is a stationary series with positive serial correlation. Combining sev-
eral variants and extensions of this result, Hajek (1959) established an
optimality property of systematic sampling from a time series. Because
of its beautiful proof, we will reproduce the result here. The optimality
actually applies to Madow's ordered systematic procedure of Section
2.8, and to ordinary systematic sampling as a special case.
To emphasize the fact that the population U is discrete and labelled
by integers, we will temporarily use subscripts j and k instead of sand
t to designate population units. As before, E p will denote expectation
with respect to the sampling design, and £ expectation with respect to
the model.
THEOREM 7.1: Suppose al, ... , aN are known positive numbers, and
that YI / a I, ... , YN / a N form the initial segment of a stationary process,
with
(7.85)
and p(O) = 1. Suppose that the autocorrelation function p(u) is a
convex function of integers u ~ O. That is, for every u ~ 1,
t!..u = p(u + 1) - 2p(u) + p(u - 1) ~ O. (7.86)
We assume that it has been decided that
A
e = - LYj/aj (7.87)
n jEs
will be used as an estimator of Ty , where A = '2:7=1 aj and n is the size

of sample s. To guarantee design unbiasedness, we require the sampling
design {pes) : s E S} to have fixed size (n) and inclusion probabilities
7rj = na j / A, and assume all na j / A :::: 1. Then the sampling design
minimizing
is Madow's design. That is, the optimal design selects r from a distri-
bution uniform on [0, lin], and includes j in the sample s if, for some
I E {O, ... , n - l}, j is the least integer such that
I j
r + - ::: LakiA. (7.88)
n k=1
Proof: The first step in Hajek's proof is to show that, for every j and
k,
=L L
00 N+u-I
p(lk - jl) !l.uqjuwqkuw, (7.89)
u=1 w=1
where
qjuw = I if w - u < j ::: w
(7.90)
= 0 otherwise,
and !l.u is given by (7.86). We leave this step to the reader. Now since
ANy.
e - Ty =- L ....!...(/jS -7rj),
n j=1 aj
(7.91)
where
Ijs = 1 if j E S
(7.92)
= 0 if j1- s,
we have c(e - Ty) = (AlLin) ",£7=1 (/js - 7rj) = 0 for any sample of
size n. Thus
cEp(e - Ty)2 = Epc(e - Ty)2 = EpVar(e - Ty),
which can be written as

N N
(A2In 2 )Ep(L L (12 p(lk - jl)(/js - 7rj)(/ks - 7rk». (7.93)
j=1 k=1
From (7.93) and (7.89) it is easily shown that

00 N+u-I N
cEp(e - T y)2 = (A2In 2)(12 L L !l.u Varp(L Ijsqjuw). (7.94)
u=1 w=1 j=1
Since each !l.u ~ 0, we minimize cEp(e - Ty)2 if we minimize for

each u, w the variance of
N
nuw(s) = L
j=1
Ijsqjuw
= number of j in s such that w - u < j ::: w.

The constraint fixing the inclusion probabilities of the design implies

that Epnuw(s) is determined, as Auw = L~=l 'lrjqjuw. The variance
Varpnuw (s) will clearly be minimized subject to this constraint if nuw (s)
takes only values [Auw] and [Auw] + 1, with respective probabilities
1 - Auw + [Auw] and Auw - [Auw], where [Auw] is the greatest integer
~ Auw. Since Madow's design can be seen to satisfy this condition, it
is optimal, and the theorem is proved.
Harvey (1976, Theorem 3.4.1) has shown that minimizing
Varpnuw(s) for every u and w implies maximizing the expected sum
of distances
LLlk-jl
jes kes
between units in the sample, and also maximizing the minimum dis-
tance between any two units in any sample in the design. Thus we have
established optimality of a design which gives maximal dispersion of
sample points in two senses. Hajek's result has no broad generaliza-
tion to higher dimensions, and this is connected with the fact that in
higher dimensions the two maximal dispersion criteria lead to different
designs. That is, suppose U is a subset of a two-dimensional integer
lattice, and the Yt have a common mean with
(7.95)
for some a > O. Then, as Harvey (1976) has shown, the optimal design
for unbiased estimation of Ty by NYs will be the one which maximizes
the expected sum of distances between sample points if a is small
enough; on the other hand, if a is large, it will be the one which
maximizes the overall minimum distance between sample points. Since
in two dimensions these criteria yield different designs in general, the
optimal design will depend on the value of a.
Matern (1986) compared several sampling designs for U a continuous
rectangle (d = 2), when {Yt : t E U} was a stationary isotropic random
field with
(7.96)
for some a > O. In his numerical comparisons (p. 83), systematic
sampling at points of an equilateral triangular network tended to give
slightly smaller variances than systematic sampling at points of a square
grid, with same density of sample points per unit area. Both these de-
signs were superior to Latin square sampling, and to stratified sampling
with square strata, for the values of a examined. However, it is the tri-
angular grid which for a given density maximizes the minimum distance
between sample points, and Matern noted that the triangular grid will
therefore be best whenever we have a rapidly decreasing stationary
isotropic correlation function (e.g. as in (7.96) with large a).
Analogues of Hijek's result in two dimensions have been formulated
and proved by Bellhouse (1977). We can think of the 'systematic'
designs he considered for a sample of size n = mlm2 as dividing a
rectangular U,
into rectangles of sides MI,

ml
!1J.,
m2
and choosing one point in each rect-
angle, as described in Section 7.2. Considered by itself, each point
comes from a distribution uniform in its rectangle. Following Que-
nouille (1949), Bellhouse (1977) proved optimality results for 'system-
atic' designs in which the sample points are aligned in both directions,
aligned in only one direction, or unaligned (Figure 7.1). The analogues
of the convexity condition are fairly restrictive, and are not satisfied,
for example, by the exponential correlation function of (7.96). They
are satisfied by the correlation functions of 'linear-by-linear' lattice
processes (Quenouille, 1949; Martin, 1979) for which
and PI, P2 are convex autocorrelation functions. See also Bellhouse

(1988b) for a comprehensive survey of systematic sampling and optim-
ality results.
With increased computing power, it is possible to determine restricted
randomized sampling designs for specific purposes. For example, Muk-
erjee and Sengupta (1990) have determined self-weighting fixed size
designs for U = {I, ... , N} which are 'trend-free' in the sense of sat-
isfying
Lj
.
= '!.(N
2
+ 1) (7.97)
JES
for every sample s with p(s) > O. For these designs the expansion
estimator NYs of Ty is model-unbiased as well as design-unbiased in
the presence of a linear trend, such that
£Yj = a + pj, j = 1, ... , N. (7.98)
Choosing from among these designs for efficiency and ease of error
estimation is an interesting theoretical problem.
Should a randomized sampling design be chosen?

We have seen that the approach of minimizing model mean squared er-
ror favours purposive sampling, and so to some extent does the sample
function approximation approach of Section 7.4. Thus it is worth re-
viewing some reasons for choosing randomized rather than purposive
designs for estimating spatial population means and totals.
First, randomized sampling is objective and easy to replicate in prin-
ciple. It admits the use of design-unbiased or design-consistent estima-
tors, and if these can be chosen to be model-unbiased as well, then they
will have inferential meaning.
Second, designs such as simple random sampling and stratified ran-
dom sampling admit unbiased or consistent estimators of design vari-
ance or design mean squared error. Purposive sampling does not, nor
does any randomized design which is not 'measurable' in the sense
of having all joint inclusion probabilities Trjk or Trst greater than O.
Thus the optimal designs in this section, and the designs described
in Section 7.2 which are highly efficient because they assign at most
one unit per cell, do not support design-based error estimation. Nor
do they support model-based error estimation very well in the pres-
ence of non-negligible autocorrelation of unknown form. The reason
is closely related, namely that population units which are near to each
other have no chance of being included together in the sample unless
they are in different cells. Thus there is a scarcity of information about
autocorrelation at small distances.
If simple random sampling is used with a spatial population, the
sample mean Ys and the expansion estimator N Ys are the simplest
estimators of mean and total respectively, but, as we have seen in
Section 7.4, there may be better point estimators. Furthermore, the usual
estimators of Varp (}is) and Varp (NYs) will overestimate the uncertainty
inherent in inference from the sampled values. Rather than adopt this
kind of draw-sequential randomization, it seems preferable to look for
measurable designs with samples having some regularity of spacing.
Sometimes geographic stratification should be considered. Stratification
into cells, with at least two units selected per cell, may lead to design-
based intervals which are good approximations to inference. If this
is not practical, another possibility might be stratification into blocks
of cells, followed by two-stage sampling of units thinking of cells as
PSUs. On the other hand, for the purpose of modelling the population
covariance structure, replicated systematic sampling or replicated grid
sampling would be advantageous, and would allow both model- and
design-based error estimation.
References
Babu, G. J. and Singh, K. (1985) Edgeworth expansions for sampling without

replacement from finite populations. Journal of Multivariate Analysis 17,
261-278.
Bartlett, R. F. (1986) Sampling a finite population in the presence of trend
and correlation: estimation of total 305-day lactation production in cattle.
Canadian Journal of Statistics 14,201-210.
Basu, D. (1958) On sampling with and without replacement. Sankhya 20, 287-
294.
Basu, D. (1971) An essay on the logical foundations of survey sampling, part
one. In: V. P. Godambe and D. A. Sprott (eds) Foundations of Statistical
Inference. Holt, Rinehart and Winston, Toronto, 203-242.
Bellhouse, D. R. (1977) Optimal designs for sampling in two dimensions.
Biometrika 64, 605--61l.
Bellhouse, D. R. (1988a) A brief history of random sampling methods. In: P.
R. Krishnaiah and C. R. Rao (eds) Handbook of Statistics, Vol. 6. North-
Holland, Amsterdam, 1-14.
Bellhouse, D. R. (1988b) Systematic sampling. In: P. R. Krishnaiah and C.
R. Rao (eds) Handbook of Statistics, Vol. 6. North-Holland, Amsterdam,
125-145.
Benhenni, K. and Cambanis, S. (1992) Sampling designs for estimating inte-
grals of stochastic processes. Annals of Statistics 20, 161-194.
Bickel, P. J. (1992) Inference and auditing: the Stringer bound. International
Statistical Review 60, 197-210.
Bickel, P. J. and Freedman, D. A. (1984) Asymptotic normality and the boot-
strap in stratified sampling. Annals of Statistics 12, 470-482.
Bickel, P. J., Nair, V. N. and Wang, P. C. C. (1992) Nonparametric inference
under biased sampling from a finite population. Annals of Statistics 20, 853-
878.
Binder, D. A. (1983) On the variances of asymptotically normal estimators
from complex surveys. International Statistical Review 51, 279-292.
Binder, D. A. (1992) Fitting Cox's proportional hazards model from survey
data. Biometrika 79, 139-147.
Binder, D. A. and Patak, Z. (1994) Use of estimating functions for estimation
from complex surveys. Journal of the American Statistical Association 89,
1035-1043.
288 REFERENCES
Blight, B. J. N. (1973) Sampling from an autocorrelated finite population.

Biometrika 60, 375-385.
Blyth, C. R. (1986) Approximate binomial confidence limits. Journal of the
American Statistical Association 81, 843-855.
Booth, J. G., Butler, R. W. and Hall, P. (1994) Bootstrap methods for finite
populations. Journal of the American Statistical Association 89, 1282-1289.
Breslow, N. E. and Clayton, D. G. (1993) Approximate inference in generalized
linear mixed models. Journal of the American Statistical Association 88, 9-
25.
Brewer, K. R. W. (1963) Ratio estimation and finite populations: some re-
sults deducible from the assumption of an underlying stochastic process.
Australian Journal of Statistics 5, 93-105.
Brewer, K. R. W. and Hanif, M. (1983) Sampling with Unequal Probabilities.
Springer-Verlag, New York.
Cassel, C. M., Siirndal, C. E. and Wretrnan, J. H. (1977) Foundations ofInfer-
ence in Survey Sampling. Wiley, New York.
Chambers, R. L. (1986) Outlier robust finite population estimation. Journal of
the American Statistical Association 81, 1063--1069.
Chambers, R. L. and Dunstan, R. (1986) Estimating distribution functions from
survey data. Biometrika 73, 597-{)04.
Chambers, R. L., Dorfman, A. H. and Hall, P. (1992) Properties of estimators
of the finite population distribution function. Biometrika 79, 577-582.
Chao, M.T. (1982) A general purpose unequal probability sampling plan.
Biometrika 69, 653-{)56.
Chaudhuri, A. and Vos, 1. W. E. (1988) Unified Theory and Strategies ofSurvey
Sampling. North-Holland, Amsterdam.
Chen,1. and Qin, J. (1993) Empirical likelihood estimation for finite popUlations
and the effective use of auxiliary information. Biometrika 80, 107-116.
Chen, J. and Sitter, R. R. (1993) Edgeworth expansion and the bootstrap for
stratified sampling without replacement from a finite population. Canadian
Journal of Statistics 21, 347-357.
Chen, X.-H., Dempster, A. P. and Liu, 1. S. (1994) Weighted finite population
sampling to maximize entropy. Biometrika 81, 457-470.
Christakos, G. (1984) On the problem ofperrnissible covariance and variogram
models. Water Resources Research 20, 251-265.
Cochran, W. G. (1946) Relative accuracy of systematic and stratified random
samples for a certain class of populations. Annals ofMathematical Statistics
17, 164-177.
Cochran, W. G. (1977) Sampling Techniques, 3rd edition. Wiley, New York.
Connor, W. S. (1966) An exact formula for the probability that two specified
sample units will occur in a sample drawn with unequal probabilities without
replacement. Journal of the American Statistical Association 61, 384--390.
Cox, D. R. (1968) Some sampling problems in technology. In: N. L. Johnson
and H. Smith (eds) New Developments in Survey Sampling. Wiley, New
REFERENCES 289
York, 506-527.
Cox, D. R. (1972) Regression models and life-tables (with discussion). Journal
of the Royal Statistical Society B 34, 187-220.
Cox, D. R. and Hinkley, D. V. (1974) Theoretical Statistics. Chapman & Hall,
London.
Cox, D. R. and Snell, E. 1. (1979) On sampling and the estimation of rare
errors. Biometrika 66, 125-132.
Cramer, H. (1946) Mathematical Methods of Statistics. Princeton University
Press, Princeton.
Cramer, H. and Leadbetter, M. R. (1967) Stationary and Related Stochastic
Processes. Wiley, New York.
Cressie, N. A. C. (1993) Statistics for Spatial Data, revised edition. Wiley, New
York.
Daniels, H. E. (1954) Saddlepoint approximations in statistics. Annals ofMath-
ematical Statistics 25, 631-650.
Daniels, H. E. (1987) Tail probability approximations. International Statistical
Review 55, 37-48.
De Finetti, B. (1931) Funzione caratteristica di un fenomeno aleatorio. Memorie
dell'Accademia Nazionale dei Lincei Ser. 6, 4, 251-299.
Dean, C. B. (1991) Estimating equations for mixed Poisson models. In: V.
P. Godambe (ed.) Estimating Functions. Oxford University Press, Oxford,
35-46.
Deming, W. E. (1956) On simplification of sampling design through replica-
tion with equal probabilities and without stages. Journal of the American
Statistical Association 51, 24--53.
Deming, W. E. and Stephan, F. F. (1940) On a least squares adjustment of
a sampled frequency table when the expected marginal totals are known.
Annals of Mathematical Statistics 11, 427-444.
Deville, J. C. and Siimdal, C. E. (1992) Calibration estimators and generalized
raking techniques in survey sampling. Journal of the American Statistical
Association 87, 376-382.
Devine, O. J., Louis, T. A. and Halloran, M. E. (1994) Empirical Bayes esti-
mators for spatially correlated incidence rates. Environmetrics 5, 381-398.
DiCiccio, T. 1. and Martin, M. A. (1991) Approximations of marginal tail
probabilities for a class of smooth functions with applications to Bayesian
and conditional inference. Biometrika 78, 891-902.
DiCiccio, T. 1. and Romano, 1. P. (1988) A review of bootstrap confidence
intervals. Journal of the Royal Statistical Society B 50, 338-354.
DiCiccio, T. 1. and Romano, 1. P. (1990) Nonparametric confidence limits by
resampling methods and least favorable families. International Statistical
Review 58, 59-76.
Durbin, J. (1953) Some results in sampling theory when the units are selected
with unequal probabilities. Journal of the Royal Statistical Society B 15,
262-269.
290 REFERENCES
Durbin, J. (1958) Sampling theory for estimates based on fewer individuals

than the number selected. Bulletin of the International Statistical Institute
36(3), 113-119.
Durbin, 1. (1959) A note on the application of Quenouille's method of bias
reduction to the estimation of ratios. Biometrika 46, 477-480.
Durbin, 1. (1967) Design of multi-stage surveys for the estimation of sampling
errors. Applied Statistics 16, 152-164.
Durbin, J. (1969) Inferential aspects of the randomness of sample size in survey
sampling. In: N. L. Johnson and H. Smith (eds) New Developments in Survey
Sampling. Wiley, New York. 629--651.
Efron, B. and Tibshirani, R. J. (1993) An Introduction to the Bootstrap. Chap-
man & Hall, New York,
Erdos, P. and Renyi, A. (1959) On the central limit theorem for samples from
a finite population. Pub. Math. Inst. Hung. A cad. Sci. 4, 49--57.
Ericson, W. A. (1969) Subjective Bayesian models in sampling finite popula-
tions. Journal of the Royal Statistical Society B 31, 195-224.
Fellegi, I. P. (1963) Sampling with varying probabilities without replacement:
rotating and non-rotating samples. Journal of the American Statistical Asso-
ciation 58, 183-201.
Feller, W. (1971) An Introduction to Probability Theory and Its Applications,
Volume II, 2nd edition. Wiley, New York.
Fieller, E. C. (1932) The distribution of the index in a normal bivariate popu-
lation. Biometrika 24, 428-440.
Francisco, C. A. and Fuller, W. A. (1991) Quantile estimation with a complex
survey design. Annals of Statistics 19,454-469.
Frankel, M. R. (1971) Inferencefrom Survey Samples: an Empirical Investiga-
tion. Institute for Social Research, University of Michigan, Ann Arbor.
Gerow, K. and McCulloch, C. E. (1994) Model-unbiased, unbiased-in-general
estimation of the average of a regression function. Preprint.
Ghosh, M. (1992) Constrained Bayes estimation with applications. Journal of
the American Statistical Association 87, 533-540.
Ghosh, M. and Meeden, G. (1986) Empirical Bayes estimation in finite popu-
lation sampling. Journal of the American Statistical Association 81, 1058-
1062.
Ghosh, M. and Rao, 1. N. K. (1994) Small area estimation: an appraisal (with
discussion). Statistical Science 9, 55-93.
Gill, R. D., Vardi, Y. and Wellner, 1. A. (1988) Large sample theory of empirical
distributions in biased sampling models. Annals of Statistics 16, 1069--1112.
Godambe, V. P. (1955) A unified theory of sampling from finite populations.
Journal of the Royal Statistical Society B 17, 269--278.
Godambe, V. P. (1966) A new approach to sampling from finite populations I,
II. Journal of the Royal Statistical Society B 28, 310--328.
Godambe, V. P. (1982) Estimation in survey sampling: robustness and opti-
mality. Journal of the American Statistical Association 77, 393-406.
REFERENCES 291
Godambe, V. P. (1989a) Estimation of cumulative distribution of a survey

population. Technical Report STAT-89-17, University of Waterloo.
Godambe, V. P. (1989b) Optimal estimation for response dependent retrospec-
tive sampling. Technical Report STAT-89-25, University of Waterloo.
Godambe, V. P. (1991) Orthogonality of estimating functions and nuisance
parameters. Biometrika 63, 277-284.
Godambe, V. P. (1992) Non-exponentiality and orthogonal estimating func-
tions. In: J. Chen (ed.) Recent Concepts in Statistical Iriference. Proceedings
of a Symposium in Honour of Professor V. P. Godambe. University of Wa-
terloo.
Godambe, V. P. (1994) Linear Bayes and optimal estimation. Technical report,
University of Waterloo.
Godambe, V. P. (1995) Estimation of parameters in survey sampling: optimal-
ity. Canadian Journal of Statistics 23, 227-243.
Godambe, V. P. and Joshi, V. M. (1965) Admissibility and Bayes estimation
in sampling finite populations 1. Annals ofMathematical Statistics 36, 1707-
1722.
Godambe, V. P. and Kale, B. K. (1991) Estimating functions: an overview. In:
V. P. Godambe (ed.) Estimating Functions. Oxford University Press, Oxford,
3-20.
Godambe, V. P. and Rajarshi, M. B. (1989) Optimal estimation for weighted
distributions. In: Y. Dodge (ed.) Statistical Data Analysis and Inference.
North-Holland, Amsterdam.
Godambe, V. P. and Thompson, M. E. (1973) Estimation in sampling theory
with exchangeable prior distributions. Annals of Statistics 1, 1212-1221.
Godambe, V. P. and Thompson, M. E. (1986) Parameters of superpopulation
and survey population: their relationships and estimation. International Sta-
tistical Review 54, 127-138.
Godambe, V. P. and Thompson, M. E. (1989) An extension of quasi likelihood
estimation. Journal of Statistical Planning and Inference 22, 137-172.
Good, 1. J. (1977) A new formula for k-statistics. Annals of Statistics 5, 224-
228.
Goodman, R. and Kish, L. (1950) Controlled selection - a technique in proba-
bility sampling. Journal of the American Statistical Association 45,350-372.
Groves, R. M., Biemer, P. P., Lyberg, L. E., Massey, 1. T., Nicholls, W. L. and
Waksberg,1. (1988) Telephone Survey Methodology. Wiley, New York.
Gupta, V. K. and Nigam, A. K. (1987) Mixed orthogonal arrays for variance es-
timation with unequal numbers of primary selections per stratum. Biometrika
74, 735-742.
Gwet, J.-P. and Rivest, L.-P. (1992) Outlier resistant alternatives to the ratio
estimator. Journal of the American Statistical Association 87, 1174-1182.
Hacking, 1. (1975) The Emergence ofProbability. Cambridge University Press,
Cambridge.
Hajek, J. (1959) Optimum strategy and other problems in probability sampling.
292 REFERENCES
Casopis pro Pestovani Mathematiky 84, 387-423.

Hajek, J. (1960) Limiting distributions in simple random sampling from a finite
population. Pub!. Math. Inst. Hung. Acad. Sci. 5, 361-374.
Hajek, J. (1964) Asymptotic theory of rejective sampling with varying proba-
bilities from a finite population. Annals ofMathematical Statistics 35, 1491-
1523.
Hajek, J. (1971) Discussion of 'An essay on the logical foundations of survey
sampling, part one' by D. Basu. In: V. P. Godambe and D. A. Sprott (eds)
Foundations of Statistical Inference. Holt, Rinehart and Winston, Toronto,
236.
Hajek, 1. (1981) Samplingfrom a Finite Population. Marcel Dekker, New York.
Hall, P. (1988) Theoretical comparison of bootstrap confidence intervals. An-
nals of Statistics 16, 927-985.
Hartley, H. O. (1959) Analytic Studies of Survey Data, volume in honour of
Corrado Gini. Istituto di Statistica, Rome.
Hartley, H. O. and Rao, 1. N. K. (1962) Sampling with unequal probabilities
and without replacement. Annals of Mathematical Statistics 33, 350-374.
Hartley, H. O. and Rao, 1. N. K. (1968) A new estimation theory for sample
surveys. Biometrika 55, 547-557.
Hartley, H. O. and Rao, 1. N. K. (1969) A new estimation theory for sample
surveys, II. In N. L. Johnson and H. Smith (eds) New Developments in Survey
Sampling. Wiley, New York, 147-169.
Harvey, D. J. (1976) Some investigations into sampling from popula-
tions meaningfully arranged in one or more dimensions. PhD thesis,
University of Waterloo.
Herson, 1. (1976) An investigation of relative efficiency of least squares pre-
diction to conventional probability sampling plans. Journal of the American
Statistical Association 71, 700-703.
Heyde, C. C. (1987) On combining quasi-likelihood estimating functions.
Stochastic Processes and Applications 25, 267-276.
Hidiroglou, M. A. and Srinath, K. P. (1981) Some estimators of a population
total from simple random samples containing large units. Journal of the
American Statistical Association 76, 690-695.
Holt, D. (1989) Introduction to 'Disaggregated analysis: modelling structured
populations'. In: C. 1. Skinner, D. Holt and T. M. F. Smith (eds) Analysis
of Complex Surveys. Wiley, Chichester, 209-220.
Holt, D. and Smith, T. M. F. (1979) Post-stratification. Journal of the Royal
Statistical Society A 142, 33-46.
Hori, H. and Nagata, M. (1985) Examples of optimization methods for envi-
ronment monitoring systems. Report B-266-R-53-2, Environmental Sciences,
Ministry of Education, Japan, 18-29 [in Japanese].
Horvitz, D. G. and Thompson, D. 1. (1952) A generalization of sampling with-
out replacement from a finite universe. Journal of the American Statistical
REFERENCES 293
H0st, G., Omre, H. and Switzer, P. (1995) Spatial interpolation errors for mon-
itoring data. Journal of the American Statistical Association 90, 853-861.
Johnson, N. L. and Kotz, S. (1970) Continuous Univariate Distributions.
Houghton Mifflin, Boston.
Jones, H. L. (1974) Jackknife estimation of functions of strata means.
Kalbfleisch, J. D. and Lawless, 1. F. (1988a) Estimation of reliability in field-
performance studies. Technometrics 30, 365--378.
Kalbfleisch, 1. D. and Lawless, J. F. (1988b) Likelihood analysis of multi-
state models for disease incidence and mortality. Statistics in Medicine 7,
149-160.
Kalbfleisch, 1. D. and Sprott, D. A. (1969) Applications of likelihood and
fiducial probability to sampling finite populations. In: N. L. Johnson and
H. Smith (eds) New Developments in Survey Sampling. Wiley, New York,
358-389.
Kass, R. E. and Steffey, D. (1989) Approximate Bayesian inference in condi-
tionally independent hierarchical models (parametric empirical Bayes mod-
els). Journal of the American Statistical Association 84, 717-726.
Kendall, M. G., Stuart, A. and Ord, 1. K. (1983) The Advanced Theory of
Statistics, Volume 3 (4th edition). Griffin, London.
Kish, L. (1965) Survey Sampling. Wiley, New York.
Kish, L. and Frankel, M. R. (1974) Inference from complex samples (with
discussion). Journal of the Royal Statistical Society B 36, 1-37.
Korn, E. L. and Graubard, B. I. (1991) A note on the large sample properties
of linearization, jackknife and balanced repeated replication methods for
stratified samples. Annals of Statistics 19, 2275--2279.
Kott, P. S. (1990) Estimating the conditional variance of a design consistent
regression estimator. Journal of Statistical Planning and Inference 24, 287-
296.
Kovar, J. G., Rao, J. N. K. and Wu, C. F. 1. (1988) Bootstrap and other methods
to measure error in survey estimates. Canadian Journal of Statistics 16,
Supplement, 25--45.
Krewski, D. (1978) Jackknifing U -statistics in finite populations. Communica-
tions in Statistics - Theory and Methods A 7(1), 1-12.
Krewski, D. and Rao, 1. N. K. (1981) Inference from stratified samples: prop-
erties of linearization, jackknife and balanced repeated replication methods.
Annals of Statistics 2,1010-1019.
Kuk, A. Y. C. (1988) Estimation of distribution functions and medians under
sampling with unequal probabilities. Biometrika 75, 97-103.
Laird, N. M. and Louis, T. A. (1987) Empirical Bayes confidence intervals
based on bootstrap samples. Journal of the American Statistical Association
82, 739-757.
Laslett, G. M., McBratney, A. B., Pahl, P. 1. and Hutchinson, M. F. (1987)
Comparison of several spatial prediction methods for soil pH. Journal of
294 REFERENCES
Soil Science 38, 325-341.

Lessler, J. T. and Kalsbeek, W. D. (1992) Nonsampling Error in Surveys. Wiley,
New York.
Liang, K.-Y. and Zeger, S. L. (1986) Longitudinal data analysis using gener-
alised linear models. Biometrika 73, 13-22.
Liu, T. P. and Thompson, M. E. (1983) Properties of estimators of quadratic
finite population functions: the batch approach. Annals of Statistics 11,275-
285.
Louis, T. A. (1984) Estimating a population of parameter values using Bayes
and empirical Bayes methods. Journal ofthe American Statistical Association
79, 393-398.
Lugannani, R. and Rice, S. O. (1980). Saddlepoint approximation for the sum
of independent random variables. Advances in Applied Probability 12, 475-
490.
Mach, L. (1988) The use of estimating functions for confidence interval con-
struction: the case of the population mean. Working Paper No. BSMD-88-
028E, Methodology Branch, Statistics Canada.
Madow, W. G. (1949) On the theory of systematic sampling II. Annals of
Mathematical Statistics 20, 333-354.
Mahalanobis, P. C. (1946) Recent experiments in statistical sampling in the
Indian Statistical Institute. Journal of the Royal Statistical Society 109,325-
370.
Malec, D. and Sedransk, J. (1985) Bayesian inference for finite population
parameters in multistage cluster sampling. Journal ofthe American Statistical
Mantel, H. (1991) Making use of a regression model for inferences about a finite
popUlation mean. In: V. P. Godambe (ed.) Estimating Functions. Clarendon
Press, Oxford, 217-222.
Martin, R. J. (1979) A subclass of lattice processes applied to a problem in
planar sampling. Biometrika 66, 209-217.
Matheron, G. (1965) Les variables regionalisees et leur estimation. Masson,
Paris.
Matern, B. (1986) Spatial Variation. Lecture Notes in Statistics, No. 36.
Springer-Verlag, New York.
McCarthy, P. 1. (1969) Pseudo-replication: half samples. Review of the Inter-
national Statistical Institute 37, 239-264.
McCullagh, P. and Neider, 1. A. (1989) Generalized Linear Models, 2nd edition.
Chapman & Hall, London.
McKay, M. D., Conover, W. 1. and Beckman, R. 1. (1979) A comparison of
three methods for selecting values of input variables in the analysis of output
from a computer code. Technometrics 21, 239-245.
McLeod, A. I. and Bellhouse, D. R. (1983) A convenient algorithm for drawing
a simple random sample. Applied Statistics 32, 182-184.
Midzuno, H. (1952) On the sampling system with probability proportional to
REFERENCES 295
the sum of sizes. Annals of the Institute ofStatistical Mathematics 3, 99-107.

Molenaar, W. (1973) Approximations to the Poisson, Binomial and Hypergeo-
metric Distribution Functions. Mathematisch Centrum, Amsterdam.
Moran, P. A. P. (1972) The probabilistic basis of stereo logy. Special Supplement
to Advances in Applied Probability, 69-91.
Mukerjee, R. and Sengupta, S. (1990) Optimal estimation of a finite population
mean in the presence of linear trend. Biometrika 77, 625-630.
Murthy, M. N. (1957) Ordered and unordered estimators in sampling without
replacement. Sankhya 18, 379-390.
Nandi, H. K. and Sen, P. K. (1963) On the properties of U-statistics when
the observations are not independent. Part Two: Unbiased estimation of the
parameters of a finite population. Calcutta Statistical Association Bulletin
12, 125-148.
Narain, R. D. (1951) On sampling without replacement with varying probabil-
ities. Journal of the Indian Society of Agricultural Statistics 3, 169-175.
Neider, J. A. and Pregibon, D. (1987) An extended quasi likelihood function.
Biometrika 74, 221-232. .
Neter, J. (1972) How accountants save money by sampling. In: J. M. Tanur, F.
Mosteller, W. H. Kruskal, R. F. Link, R. S. Pieters and G. R. Rising (eds)
Statistics: A Guide to the Unknown. Holden-Day, San Francisco, 203-211.
Neuhaus, J. M., Kalbfleisch, J. D. and Hauck, W. W. (1991) A comparison of
cluster-specific and population-averaged approaches for analyzing correlated
binary data. International Statistical Review 59, 25-36.
Neyman, J. (1934) On the two different aspects of the representative method:
The method of stratified sampling and the method of purposive selection.
Journal of the Royal Statistical Society 97, 558-625.
Nordberg, L. (1989) Generalized linear modelling of sample survey data. Jour-
nal of Official Statistics 5, 223--239.
Okabe, A., Boots, B. and Sugihara, K. (1992) Spatial Tesselations: Concepts
and Applications of Voronoi Diagrams. Wiley, Chichester.
Oppenheim, N. A. (1992) Questionnaire Design, Interviewing, and Attitude
Measurement. St. Martin's Press, New York.
Owen, A. B. (1992) A central limit theorem for Latin hypercube sampling.
Journal of the Royal Statistical Society B 54, 541-551.
Pfeffermann, D. and LaVange, L. (1989) Regression models for stratified multi-
stage cluster samples. In: C. J. Skinner, D. Holt and T. M. F. Smith (eds)
Analysis of Complex Surveys. Wiley, Chichester, 237-260.
Plackett, R. L. and Burman, J. P. (1946) The design of optimum multifactorial
experiments. Biometrika 33, 305-325.
Prasad, N. G. N. and Rao, J. N. K. (1990) The estimation of mean squared
errors of small-area estimators. Journal of American Statistical Association
85, 163-171.
Pratt, J. W. (1968) A normal approximation for binomial, F, beta, and other
common, related tail probabilities, I. Journal of the American Statistical As-
296 REFERENCES
sociation 63, 1457-1483.

Qian, W. and Titterington, D. M. (1991) Estimation of parameters in hidden
Markov models. Philosophical Transactions of the Royal Society of London
A 337,407-428.
Quenouille, M. H. (1949) Problems in plane sampling. Annals ofMathematical
Statistics 20, 355-375.
Quenouille, M. H. (1956) Notes on bias in estimation. Biometrika 43,353-360.
Rao, C. R. (1971) Some aspects of statistical inference in problems of sam-
pling from finite populations. In: V. P. Godambe and D. A. Sprott (eds)
Foundations of Statistical Inference. Holt, Rinehart and Winston, Toronto,
177-202.
Rao, C. R. (1973) Linear Statistical Inference and Its Applications, 2nd edition.
Wiley, New York.
Rao, 1. N. K. (1963) On two systems of unequal probability sampling without
replacement. Annals of the Institute of Statistical Mathematics 14, 143-150.
Rao, J. N. K. (1975) Unbiased variance estimation for multistage designs.
Sankhya C 37, 133-139.
Rao, J. N. K. (1979) On deriving mean square errors and their non-negative
unbiased estimators. Journal of the Indian Statistical Association 17, 125-
136.
Rao, J. N. K. (1985) Conditional inference in survey sampling. Survey Method-
ology 16, 3-29.
Rao, J. N. K. (1994) Estimating totals and distribution functions using auxiliary
information at the estimation stage. Journal ofOfficial Statistics 10, 153-166.
Rao, J. N. K. and Liu, J. (1992) On estimating distribution functions from
sample survey data using supplementary information at the estimation stage.
In Saleh, A. K. Md. E. (ed.) Nonparametric Statistics and Related Topics.
Elsevier, Amsterdam, 399-407.
Rao, J. N. K. and Scott, A. 1. (1981) The analysis of categorical data from
complex sample surveys: chi-squared tests for goodness of fit and indepen-
dence in two-way tables. Journal of the American Statistical Association 76,
221-230.
Rao,1. N. K. and Scott, A. 1. (1984) On chi-squared tests for multi-way tables
with cell proportions estimated from survey data. Annals of Statistics 12,
4~0.
Rao, 1. N. K. and Scott, A. 1. (1987) On simple adjustments to chi-square tests
with sample survey data. Annals of Statistics 15, 385-397.
Rao, 1.N.K. and Wu, C.FJ. (1985) Inference from stratified samples: second-
order analysis of three methods for nonlinear statistics. Journal of the Amer-
ican Statistical Association 80, 620-630
Rao, 1. N. K. and Wu, C. F. 1. (1988) Resampling inference with complex
survey data. Journal of the American Statistical Association 83, 231-241.
Rao, J. N. K., Hartley, H. O. and Cochran, W. G. (1962) On a simple procedure
of unequal probability sampling without replacement. Journal of the Royal
REFERENCES 297
Statistical Society B 24, 482-49l.

Rao, J. N. K., Kovar, J. G. and Mantel, H. J. (1990) On estimating distribu-
tion functions and quantiles from survey data using auxiliary information.
Reid, N. M. (1988) Saddlepoint methods and statistical inference. Statistical
Science 3, 213-238.
Renyi, A. (1970) Probability Theory. North-Holland, Amsterdam.
Ripley, B. D. (1981) Spatial Statistics. Wiley, New York.
Ripley, B. D. (1988) Statistical Inference for Spatial Processes. Cambridge
University Press, Cambridge.
Roberts, G., Rao, J. N. K. and Kumar, S. (1987) Logistic regression analysis
of survey data. Biometrika 74, 1-12.
Robinson, J. (1978) An asymptotic expansion for samples from a finite popu-
lation. Annals of Statistics 6, 1005-1O1l.
Robinson, J. (1982) Saddlepoint approximations for permutation tests and con-
fidence intervals. Journal of the Royal Statistical Society B 44, 9 1-10 I.
Robinson, J. (1987) Conditioning ratio estimates under simple random sam-
pling. Journal of the American Statistical Association 82, 826-831.
Rosen, B. (1972) Asymptotic theory for successive sampling with varying prob-
abilities without replacement, I and II. Annals of Mathematical Statistics 43,
373-397, 74&-776.
Royall, R. M. (1970) On finite population sampling theory under certain linear
regression models. Biometrika 57, 377-387.
Royall, R. M. and Cumberland, W. G. (1978) Variance estimation in finite pop-
ulation sampling. Journal of the American Statistical Association 73, 351-
358.
Royall, R. M. and Cumberland, W. G. (1981a) An empirical study of the ratio
estimator and estimators of its variance. Journal of the American Statistical
Royall, R. M. and Cumberland, W. G. (1981b) The finite-population linear
regression estimator and estimators of its variance - an empirical study.
Journal of the American Statistical Association 76, 924-930.
Royall, R. M. and Cumberland, W. G. (1985) Conditional coverage properties
of finite population confidence intervals. Journal of the American Statistical
Sacks, J. and Schiller, S. (1988) Spatial designs. In: S. S. Gupta, and J.
O. Berger (eds) Statistical Decision Theory and Related Topics IV, Vol. 2.
Springer-Verlag, New York, 385-399.
Sacks, J., Schiller, S. and Welch, W. J. (1989a) Designs for computer experi-
ments. Technometrics 31, 41-47.
Sacks, J., Welch, W. J., Mitchell, T. J. and Wynn, H. P. (1989b) Design and
analysis of computer experiments. Statistical Science 4, 409-423.
Sampford, M. R. (1967) On sampling without replacement with unequal prob-
abilities of selection. Biometrika 54, 499-513.
298 REFERENCES
Sampson, P. D. and Guttorp, P. (1992) Nonparametric estimation of nonsta-

tionary spatial covariance structure. Journal of the American Statistical As-
sociation 87, 108-119.
Siimdal, C. E., Swensson, B. and Wretman, J. H. (1989) The weighted residual
technique for estimating the variance of the general regression estimator of
the finite population total. Biometrika 76, 527-537.
Siimdal, C. E., Swensson, B. and Wretman, J. H. (1992) Model Assisted Survey
Sampling. Springer-Verlag, New York.
Satterthwaite, F. E. (1946) An approximate distribution of estimates of variance
components. Biometrics 2, 1l0-114.
Scott, A. J. and Holt, D. (1982) The effect of two stage sampling on ordinary
least squares methods. Journal of the American Statistical Association 77,
848-852.
Scott, A. J. and Smith, T. M. F. (1969) Estimation in multistage surveys.
Journal of the American Statistical Association 64, 830--840.
Scott, A. J. and Wild, C. J. (1986) Fitting logistic models under case-control or
choice based sampling. Journal of the Royal Statistical Society B 48, 170-
182.
Sen, A. R. (1953) On the estimate of the variance in sampling with varying
probabilities. Journal of the Indian Society ofAgricultural Statistics 5, 119-
127.
Sen, P. K. (1980) Limit theorems for an extended coupon collector's prob-
lem and for successive sub-sampling with varying probabilities. Calcutta
Statistical Association Bulletin 29, 113-132.
Sen, P. K. (1988) Asymptotics in finite population sampling. In: P. R. Krish-
naiah and C. R. Rao (eds) Handbook of Statistics, Vol. 6. North-Holland,
Amsterdam,291-331.
Singh, A. C. (1994) Comment on 'Small area estimation: an appraisal' by M.
Ghosh and J. N. K. Rao. Statistical Science 9, 8~7.
Singh, A. c., Stukel, D. M. and Pfeffermann, D. (1993). Bayesian versus fre-
quentist measures of uncertainty for small area estimators. Proceedings of
the Section on Survey Research Methods. American Statistical Association.
Sitter, R. R. (1992a) Comparing three bootstrap methods for survey data. Cana-
dian Journal of Statistics 20, 135 -154.
Sitter, R. R. (1992b) A resampling procedure for complex survey data. Journal
of the American Statistical Association 87, 755--764.
Sitter, R. R. (1993) Balanced repeated replications based on orthogonal multi-
arrays. Biometrika 80, 211-222.
Skinner, C. J. (1989) Introduction to Part A: Aggregated analysis: standard
errors and significance tests. In: C. J. Skinner, D. Holt and T. M. F. Smith
(eds) An Analysis of Complex Surveys. Wiley, Chichester, 23-58.
Skinner, C. J., Holt, D. and Smith, T. M. F. (eds) (1989) An Analysis ofComp Iex
Surveys. Wiley, Chichester.
Statistics Canada (1990) Methodology of the Labour Force Survey. Catalogue
REFERENCES 299
71-526.
Stein, M. L. (1987) Minimum norm quadratic estimation of spatial variograms.
Journal of the American Statistical Association 82, 765-772.
Stein, M. L. (1988) Asymptotically efficient prediction of a random field with
a misspecified covariance function. Annals of Statistics 16,55-63.
Stein, M. L. (1990a) A comparison of generalized cross validation and modified
maximum likelihood for estimating the parameters of a stochastic process.
Annals of Statistics 18, 1139-1157.
Stein, M. L. (1990b) Bounds on the inefficiency of linear predictions using an
incorrect covariance function. Annals of Statistics 18,1116-1138.
Stein, M. L. (1992) Prediction and inference for truncated spatial data. Journal
of Computational and Graphical Statistics I, 91-110.
Stein, M. L. (1993) Asymptotic properties of centered systematic sampling for
predicting integrals of spatial processes. Annals of Applied Probability 3,
874-880.
Stroud, T. W. F. (1991) Hierarchical Bayes predictive means and variances
with application to sample survey inference. Communications in Statistics,
Theory and Methods 20, 13-36.
Stuart, A. and Ord, J. K. (1987) Kendall's Advanced Theory of Statistics, Vol.
I, 5th edition. Oxford University Press, New York.
Su, Y. and Cambanis, S. (1993) Sampling designs for estimation of a random
process. Stochastic Processes and their Applications 46, 47-89.
Sugden, R. A. (1982) Exchangeability and survey sampling inference. In G.
Koch and F. Spizzichino (eds) Exchangeability in Probability and Statistics.
North-Holland, Amsterdam, 321-330.
Sugden, R. A. (1993) Partial exchangeability and survey sampling inference.
Sukhatme, P. V., Sukhatme, B. V., Sukhatme, S. and Asok, C. (1984) Sampling
Theory ofSurveys with Applications, 3rd edition. Iowa State University Press,
Ames.
Tamura, H. (1989) Statistical models and analysis in auditing. Panel on Non-
standard Mixtures of Distributions. Statistical Science 4, 2-33.
Tang, B. (1993) Orthogonal array-based Latin hypercubes. Journal of the Amer-
ican Statistical Association 88, 1392-1397.
Thomas, D. R. (1989) Simultaneous confidence intervals for proportions under
cluster sampling. Survey Methodology 15, 187-202.
Thompson, M. E. (1983) Labels. In N. L. Johnson and S. Kotz (eds) Encyclo-
pedia of Statistical Sciences 9. Wiley, New York, 427-430.
Thompson, M. E. (1984) Model and design correspondence in finite population
sampling. Journal of Statistical Planning and Inference 10, 323-334.
Thompson, S. K. (1992) Sampling. Wiley, New York.
Tille, Y. (1996) An elimination procedure for unequal probability sampling
without replacement. Biometrika 63, 238-24l.
Tukey, J. W. (1958) Bias and confidence in not-quite large samples (abstract).
300 REFERENCES
Annals of Mathematical Statistics 29, 614.

Valliant, R. (1993) Poststratification and conditional variance estimation. Jour-
nal of the American Statistical Association 88, 89-96.
Wang, S. (1993) Saddlepoint expansions in finite population problems.
Whitmore, G. A (1986) Inverse Gaussian ratio estimation. Applied Statistics
35,IH5.
Wolter, K. M. (1985) Introduction to Variance Estimation. Springer-Verlag,
New York.
Woodruff, R. S. (1952) Confidence intervals for medians and other position
measures. Journal of the American Statistical Association 47, 635-646.
Wu, C. F. J. (1991) Balanced repeated replications based on mixed orthogonal
arrays. Biometrika 78, 181-188.
Yates, F. (1960) Sampling Methodsfor Censuses and Surveys, 3rd edition. Grif-
fin, London.
Yates, F. and Grundy, P. M. (1953) Selection without replacement within strata
with probability proportional to size. Journal of the Royal Statistical Society
B 15, 235-261.
Yfantis, E. A, Flatman, G. T. and Behar, J. V. (1987). Efficiency of kriging
estimation for square, triangular and hexagonal grids. Mathematical Geology
19, 183-205.
Index
K statistics, 27, 68 calibration, 178, 183, 187

XI; condition, 182 case control studies, 208
XI; condition, 174,176,177,196 census, 3
g-weights, 184 central limit theorem, 56
central limit theorem (finite
accounting, 25 population), 58
analytic purposes, 2, 199 characteristic function, 73
approximately with replacement, 36, characteristic function of sample
37,40, 42, 62, 100, 108, 184, 185 sum, SRS, 57
asymptotic framework, 58, 104, 110, circular systematic sampling, 43
126,262 cluster sampling, 23, 24
asymptotically design unbiased, 105, clustered population, 230
128, 167 composite estimator, 235
auditing, 84, 152 conditionality principle, 158
auxiliary variates, 171, 173 conditioning, 34,42,57,62, 78, 146,
average of a mean function, 211, 233 158, 163-165, 185, 188
conditioning principle, 158
confidence intervals, 52, 64, 97, 186
Bahadur representation, 105 confidence limits, 52, 55, 80, 103,
balanced repeated replication, 114 206
balanced sample, 193
correlation functions, 255
Bayesian, 271
coverage error, 7
Bayesian approach, 156,236,274
coverage probabilities, 52, 130, 164
Bernoulli sampling, 42, 43, 57, 62,
Cox proportional hazards model, 136
163,210
cumulants, 25, 66
bias, 7
cumulants of sample sum, SRS, 29
bias reduction, 130
binomial distribution, 51
bootstrap t, 91, 103, 132 descriptive purposes, 1
bootstrap confidence intervals, 86 design consistent, 64, 105, 109, 128,
bootstrap resampling, 85, 198 167
bootstrap variance estimators, 125 design covariance, 16
BRR,114 design effect, 246, 248
BRR bias reduction, 131 design expectation, 12
BRR variance estimators, 119 design variance, 15
BWO method, 88 dispersion parameter, 220
302 INDEX
domain mean, 159 hypergeometric distribution, 51, 144
domain total, 160
inclusion probabilities, 11, 32
EBLUP method, 236 inference, 49, 143, 144, 193
Edgeworth expansion of sample sum, infinitesimal jackknife, 123 .
SRS, 68 informative, 95, 205, 206
Edgeworth expansions, 65 interpolation estimators, 276, 278
efficiency, 21, 166 inverse Gaussian, 191
elementary estimating functions, 212 inverse testing, 52, 81, 83, 84, 102,
elimination procedures, 40 103,217
empirical Bayes, 156,275 isotropic, 256
empirical likelihood, 82, 180
estimating function system, 135 jackknife bias reduction, 130
estimating functions, 94, 110, 168, jackknife repeated replication, 124
186 jackknife variance estimators, 121,
estimator, 9 134
exchangeability, 150, 152 JRR method, 124
exchangeability (partial), 202
expansion estimator, 14, 50
kriging, 255, 263, 278
extended quasilike1ihood, 222
kurtosis, 28
Fellegi's method, 38, 44

labels, 9, 145, 147
fiducial approach, 156
Laplace approximation, 75
finite population parameter, 204
Latin hypercube sampling, 260
finite population parameters, 93
Latin square sampling, 261
fixed size, 11, 43
lattice sampling, 260
frame, 4, 161
length biased sampling, 96, 207
frame population, 4
likelihood function, 148, 203, 207
likelihood function (flatness), 147,
generalized linear mixed model, 202, 148
224 Lindeberg condition, 58, 63
generalized linear model, 214 linear estimator, 19
generalized regression estimator, 176 linearization, 107
Gibbs distribution, 258 linearization variance estimators,
GREG estimator, 176, 180 108,119,123
link function, 216
Hajek estimator, 175 local estimation, 195
Hadamard matrix, 117 local relative efficiency, 21, 164
half samples, 115 logistic regression, 216, 228, 238
Hartley and Rao procedure, 40 Lyapunov's condition, 59
hexagonal grid, 280
Horvitz-Thompson estimator, 14 Madow's ordered systematic
HT estimator, 14, 18, 32 procedure, 39, 260, 281
INDEX 303
marginal linear exponential model, partial likelihood, 136

228 Poisson approximation, 51
marginal moment analysis, 226, 231 Poisson overdispersion models, 226
Markov property, 258 popUlation U-statistics, 132
measurable, 16 population array, 9
measurement error, 6 population c.d.f., 95, 100, 196
method of random groups, 111 population covariance, 17
Midzuno's sampling design, 44 population mean, 5, 10, 95, 98
minimum expected variance, 165 population mean function, 195
mirror match bootstrap, 89 population median, 95, 101
misclassification probabilities, 253 popUlation number, 5
mixed orthogonal array, 119 popUlation proportion, 5, 50
model assisted estimation, 107, 172 population ratio, 95, 100, 129, 168
model bias, 155 population regression coefficients,
model parameters, 200 106, 110, 129, 136
model unbiased, 155 population semivariogram, 254
monetary unit sampling, 39, 83, 178 population total, 5, 10
Monte Carlo integration, 276 population variance, 17, 50, 106,
multi-stage sampling, 31, 63 109, 133, 135
multiple proportions, 55 post sampling stratification, 139,
multivariate sample sum, 61 146, 158, 161
Murthy's estimator, 20, 38, 44 Pratt's approximation, 53
prediction and inference, 155
nearest neighbour averages, 271 prediction intervals, 156
nested models, 219 predictive approach, 156, 181
noninformative, 10 primary sampling units, 31
nonresponse, 42 proportional allocation, 15, 25
nonresponse error, 6 proportional to size (1r ps) sampling,
nonresponse rates, 161 32,34,44
normal approximation, 5 t, 54 pseudolikelihood, 203
nugget effect, 256, 269, 270 pseudoscore, 210
nuisance parameter, 135, 214 PSUs,31
nuisance parameters, 138 purposive sampling schemes, 260,
276, 285
optimal allocation, 24
optimality, 166, 174, 176, 178, 262, quasiscore, 213
279, 281
orthogonal, 213
raking ratio estimator, 181
orthogonal arrays, 261
random effects, 173, 223, 230, 233
outliers, 195
random permutation model, 150, 157
overconditioning, 147
random systematic procedure, 40, 84
randomization roles, 201, 202
partial exchangeability, 151 randomized sampling schemes, 285
304 INDEX
Rao-Hartley-Cochran procedure, 41, simple rejective sampling, 42, 61

45 skewness, 28
Rao-Scott correction, 248 small area totals, 233
rare errors, 84 spectral distribution, 257
ratio estimator, 45, 167, 175, 188 spectral methods, 270
regression coefficients, 201 SRS, 11,49
regression estimation, 194 SRS with replacement, 158
regression estimator, 175, 190 stereology, 261
regression model, 152 stochastic subject, 200
rejective sampling, 39 strategy, 21, 163
replication, 111, 285 stratification, 194
representation interpretation, 161, stratified multi-stage design, 108
162, 179 stratified multi-stage sampling, 114,
represented population, 3 128
rescaling technique, 89, 125 stratified popUlation mean, 135
response dependent sampling, 207 stratified random sampling, 14, 22,
response error, 6, 210 24, 89, 98, 126
robust uncertainty estimates, 237 stratified sample mean, 15, 62
robust uncertainty estimators, 185, stratified two-stage sampling, 126
218 strong likelihood principle, 148
root function, 86 studentized sample mean, 67, 71, 87
rotating samples, 38 successive sampling, 37, 62
sufficiency, 10
saddlepoint, 74 superpopulation model, 149, 200
saddlepoint approximation, 73, 103 superpopulation models, 1
saddlepoint equations, 77 superpopulation proportion, 200
Sampford's method, 39, 44 survey, 3
sample, 3, 10 survey error, 6
sample indicator, 13 survey weights, 160, 178, 187
sample mean, 13,22 systematic sampling, 14, 16, 23, 43,
sample median, 105 260,281,284
sample pseudoscore function, 243
sample size, 64 tail probabilities, 67, 69, 78, 82
sample variance, 18,31, 134 target population, 3, 199
sampling design, 11 Temme approximation, 79
sampling error, 6 tilted distribution, 75, 77, 78
sandwich estimator, 218 trapezoidal rule, 276
scale load approach, 82 trend function, 253, 267
score function, 203 triangular grid, 280, 283
score statistic, 219 two-stage sampling, 99
self-weighting, 13, 43, 98, 202
semivariogram, 254
semivariogram forms, 256 uncertainty estimation, 182
simple random sampling, 11 units, 9
INDEX 305
variability, 7
variance estimation in multi-stage
sampling, 34
Voronoi polygons, 279, 280
Voronoi polyhedra, 259
warranty period, 209

weight classes, 161
with replacement sampling, 41
without replacement bootstrap, 88
Yates-Grundy-Sen variance
estimator, 16, 20

Theory of Sample Surveys

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Theory of Sample Surveys

Enviado por

Direitos autorais:

Formatos disponíveis

MONOGRAPHS ON

STATISTICS AND APPLIED PROBABILITY

D.R. Cox, V. Isham, N. Keiding,

Stochastic Population Models in Ecology and Epidemiology

Department of Statistics and Actuarial Science

Springer-Science+Business Media, B.Y.

ISBN 978-0-412-31780-4 ISBN 978-1-4899-2885-6 (eBook)

i§Printed on pennanent acid-free text paper, manufactured in accordance with

2 The mathematics of probability sampling designs 9

3 Distributions induced by random sampling designs 49

3.4.1 Distribution of univariate sample sum 58

4 Design-based estimation for general finite population

4.2.5 Jackknife methods for variance estimation 121

5 Inference for descriptive parameters 143

5.11.1 Approximate variance estimators 182

6 Analytic uses of survey data 199

6.5.1 Estimation based on the model at the sample

7 Sampling strategies in time and space 251

I would like to express thanks to J. N. K. Rao, whose influence

The idea of making inferences from samples drawn from populations is

I of dice possess the science

at hand being chosen randomly from a hypothetical superpopulation of

1.1 Survey populations and samples

population; Lessler and Kalsbeek (1992) have provided a comprehen-

1.2 Population quantities

More complex population quantities can similarly be defined in terms

characteristic C, then we can write

1.3 Survey error

The response error or measurement error P- Ptrue is the amount by

not representative of the whole, this error component is the component

1.4 Sampling and non-sampling errors

1.5 Bias and variability

the components of error above can be thought of as having variability.

1.6 Focus on sampling error

The mathematics of probability

In this chapter, we discuss the mathematics traditionally at the heart of

2.1 Randomized sampling designs

Let us first introduce some terminology and notation. Consider a survey

Xs· = (U\. YiI),"" Un, Yjn»'

A sampling design, sometimes referred to as a 'probability sampling

Let the sample size

p(s) = 1/ (:) if n(s) =n (2.8)

1Tj =L p(s). (2.9)

1Tjk = L p(s). (2.10)

EXAMPLE 2.3: The inclusion probabilities for SRS without replace-

2.2 Expectations and variances of sample sums; the HT

Sampling expectations and variances are used in assessing sampling

E(LZj) = LP(s)(LZj). (2.11)

Interchanging the order of summation on the right gives L~=l Zj

example is systematic sampling with sampling interval K, if N = Kn,

is unbiased for 1', (Horvitz and Thompson, 1952). (The estimator is

where Wh = Nh/N. Only if the allocation is proportional, that is only

Now the HT estimator of 1), is Ljes Zj, where Zj = Yj/1rj. Assume

Var(l;,)=Var(I>j!7rj)=~ t tnjk(Yj _Yk)2, (2.23)

Cov(l;,) = ~ t t n jk (Yj _ Yk) (Yj _ Yk)T (2.25)

where t' denotes transpose.

of Cov(iy) in the vector case is

~LLsWjk (Yj _ Yk) (Yj _ Yk)f (2.28)

L (1 - 7T:j )(yj /7T:j)2 - L LsWjk(Yj /7T:j )(Yk/7T:k). (2.29)

This becomes the standard formula

when it is noted that the population covariance matrix

has the alternative form

as an average of the quantities (yj - yd(Yj - YkY /2.

is the sample variance. The form s; = (LjES yJ -