Escolar Documentos
Profissional Documentos
Cultura Documentos
General Editors
(Full details concerning this series are available from the Publishers).
JOIN US ON THE INTERNET VIA WWW, GOPHER, FTP OR EMAIL:
WWW: http://www.thomson.com
GOPHER: gopher.thomson.com A service of I{!)P®
FTP: ftp.thomson.com
EMAIL: findit@kiosk.thomson.com
Theory of Sample Surveys
M.E. Thompson
A Catalogue record for this book is available from the British Library
PREFACE xiii
1 Introduction 1
1.1 Survey populations and samples 3
1.2 Population quantities 5
1.3 Survey error 6
1.4 Sampling· and non-sampling errors 7
1.5 Bias and variability 7
1.6 Focus on sampling error 8
References 287
Index 301
PREFACE
This book began about ten years ago as an attempt to fill out the notes
for a graduate course in sampling theory. It has become a monograph,
intended to supplement rather than replace books already available.
For example, there are many good treatments at various levels on the
practical problems of survey design, and on the 'how to' of analysis.
I have dealt with these issues to some extent, but have focused more
on aspects of sampling theory which are not so commonly treated else-
where. Parts of the book can still be used in teaching, supplemented
sufficiently with examples and other material.
The book deals with a subject of great vitality. The theory of survey
methods is developing fast, with more and more connections to other
parts of statistics, as the needs of practitioners change, and as more
uses of survey data become possible. As a consequence of increases
in computing power and capability, data are easier to manipulate, and
computer-intensive methods for analysis can be investigated with reas-
onable assurance that the better ones will soon be practical.
Part of the fascination of the theory of sample surveys has always
lain in its foundational issues. The present book has been written very
much from a foundational perspective. For the most part, one point of
view is taken, but it is far from being the only possible one. It has long
been my belief that as far as the puzzles and paradoxes of inference
are concerned, everyone must come to her or his own account of the
truth.
In arriving at my own account, I have been aided by many others,
particularly by colleagues and students at the University of Waterloo.
By far the greatest debt is to V. P. Godambe, who began looking
critically at the logic of sampling inference in the 1950s, and has had a
profound influence on the subject ever since. It was he who taught me
that the best questions are those which have only partial answers, and
that confusion in the search for clarity is an honourable condition. His
interest in this project, and his great generosity in collaboration over
the years, are much appreciated.
xiv PREFACE
Introduction
aspects of survey design and analysis, those which can readily be for-
mulated in mathematical terms. The difficult aspects are the scientific
questions (such as whether or not a survey can be designed which will
actually provide the answers we seek), the implementation questions
(such as whether we can achieve the response rates which will make for
results we can trust), and the measurement questions (such as how to
design a questionnaire or interview format for accurate measurement of
response variates). The next few sections will describe more explicitly
the total context in which the theory of sampling is applied.
pling from a population of records for audit, the two populations may
coincide. In surveys of human populations they generally do not: the
population from which we are actually able to sample (the represented
population) is usually only an approximation to the population about
which information is desired (the target population).
There are essentially two main sources of discrepancy between the
target and represented populations. The first is the inadequacy of the
sampling frame, the list or map from which the units to be sampled
are identified. The second is the possibility of non-response, or more
generally the possibility of inaccessibility of the units to be sampled.
For example, suppose that for a household expenditure survey, the
target population consists of all dwelling units in a city, and the sam-
pling frame is a list of all the dwelling units, compiled three years ago.
Then the represented population does not include newly constructed
dwelling units. Suppose further that the survey requires that for a sam-
pled dwelling unit to respond, some occupant must be at home on a
specified day. Thus membership in the achieved sample, as a subset
of the intended sample, is related to availability of occupants during
the day, which may be related to expenditure. In such a case we might
wish to specify the represented population as consisting of 'all dwelling
units on the three-year-old list at which someone is home (and willing
and able to respond if asked) on the survey day'.
There is clearly some subjectivity in the determination of the repres-
ented population as we have defined it. In the example just discussed,
if it is believed that the potentially responding dwelling units on the list
are representative of the whole list for the purposes of the survey, the
represented population could simply consist of the frame population, or
all dwelling units on the list which are in existence on the survey date.
However, in many situations this kind of assumption is inappropriate,
and it is better to think of the represented population as the respondent
part of the frame population, namely the subpopulation of accessible
members who would have responded. Accordingly, we will identify the
represented population with the respondent part of the frame population,
particularly in the discussion of error components in Section 1.3. We
will think of the intended sample as drawn from the frame population,
but the achieved sample as drawn from the respondent part.
For theoretical discussions it is convenient to think of the frame pop-
ulation as a subset of the target population, consisting of those target
population units incorporated in the frame. However, in practice there
are many possible relationships of the sampling frame and the target
POPULATION QUANTITIES 5
I>j
j=l
where Yj is the number employed by business j. A population total is
defined as a quantity which can be expressed this way. The population
average or population mean of the variate Y is
N
<'LYj)/N.
j=l
and
N
P = CEJYj)/N.
j=l
y=(Y\,···,YN). (2.2)
10 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
The array y = (y\, ... , YN) is sometimes called the population vector
if Y is one-dimensional.
Denote by f.Ly the population mean and Ty the population total:
N
f.Ly = (LYj)/N, (2.3)
j=1
N
Ty = LYj. (2.4)
j=1
(Each of these has the same dimension as y.) These population quanti-
ties are fundamental; most quantities estimated in sample surveys can
be expressed as functions of population means or of population totals.
In particular, as we have seen, proportions can be viewed as population
means, and numbers as population totals.
Functions of y are generally to be estimated from observation of a
sample of the 'coordinates' Yj. Often, samples are obtained by drawing
units successively from the population according to some randomized
scheme. We might denote the sample sequence of unit labels from U
by
s* = U\,h, ... ,jn);
then if the labels of units are identifiable, the full data from the sampling
experiment would consist of the pair sequence
LP(s) = 1.
seS
EXAMPLE 2.1: If N = 3, then S = {0, {I}, {2}, {3}, {I, 2}, {I, 3},
{2, 3}, {l, 2, 3}}. The sampling design which corresponds to the scheme
which selects two units by simple random sampling (SRS) without re-
placement has
p(0) = p({l}) = p({2}) = p({3}) = p({l, 2, 3}) = 0;
p({l, 2}) = p({I, 3}) = p({2, 3}) = 1/3.
EXAMPLE 2.2: Suppose that N = 3 and the scheme selects two units
by SRS with replacement. Then the sampling design is given by
p(0) = p({I, 2, 3}) = 0; p({I}) = p({2}) = p({3}) = 1/9;
p({l,2}) = p({I, 3D = p({2, 3D = 2/9.
given by
the identity
N
E(Lz) = LZjJrj. (2.12)
jes j=l
An alternative method of showing (2.12) is to use the fact that
N
LZj = LzJjs, (2.13)
jes j=l
where the sample indicator random variable Ijs is given by
Ijs = 1 if j E S
= 0 if j ¢ s. (2.14)
Since ZI, ... ,ZN are non-random, (2.13) implies E(LjesZj)
= Lf=l zjE(Ijs), and since E(Ijs) = Jrj, the identity (2.12) follows
at once.
From (2.12) with Zj == 1 it follows that
N
E{n(s)} = LJrj; (2.15)
j=l
the expectation of sample size for a given design is the sum of its
inclusion probabilities. In particular, for a fixed size (n) design, the
inclusion probabilities will sum to n.
A sampling design is called self-weighting if all its inclusion proba-
bilities are equal. Designs which are both self-weighting and of fixed
size are of particular importance (Kish, 1965), because for these de-
signs, a sample mean is unbiased as an estimator of the corresponding
population mean. This can be seen as follows. If
Ys = (LYj)/n (2.16)
jes
is the sample mean, its expectation can be obtained from (2.12) with
Zj = Yj/n. If the design p is self-weighting and of fixed size (n), then
(2.15) and (2.12) imply that
(i) Jrj = n/ N for all j;
(ii) E <Ys) = fJ-y for any population array y.
The consequence (ii) is a mathematical statement of the sampling un-
biasedness of Ys as an estimator of the population mean fJ-y.
SRS without replacement is clearly self-weighting and of fixed size.
Many other designs used in practice also have these properties. One
14 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
iy = LYj/Jrj (2.17)
jES
LNhYh,
h
EXPECTATIONS AND VARIANCES OF SAMPLE SUMS 15
where Yh is the mean of y in the part of the sample coming from Sh; the
hth term in the sum is the expansion estimator for the total of y over
Sh. The corresponding estimator of JLy is the stratified sample mean
H
Yst = L WhYh,
h=1
(2.18)
N N N
Var(Lzj} = Lz~Var(ljs} + L LZjZkCOV(ljS, Iks}· (2.19)
jes j=1 j '# k
Since Var(ljs} = 1rj{l - 1rj} and Cov(ljs, Its) = 1rjk - 1rj1rk this
relation becomes
N N N
Var(Lzj} = LZ~1r/l -1rj} + L L Zj Zk(1rjk -1rj1rk). (2.20)
jes j=1 j '# k
A more compact formula is obtainable when the design is of fixed
size (n), for then L:=I Iks = n with probability 1, and
N
- L Cov(ljS, Its} = -Cov(ljs, n - Ijs ) = Var(ljs). (2.21)
k,#j
The first term of (2.19) can be written
N N 1 N N
- LZ~ L Cov(ljso Iks) = -2 L L(z~ + z~)Cov(ljS' Its},
j=1 k,#j j '# k
and hence
1 N N
Var(Lzj) = -2 L L(Zj - Zk)2(1rj 1rk -1rjk). (2.22)
jes j '# k
all 7rj > O. For a fixed size (n) design, (2.22) implies for Y real that
where njk = 7rj7rk -7rjk. Similar computations show that, for x and Y
real,
From this it can be shown that ifall7rjk > 0 (which implies all7rj > 0)
and the design has fixed size, an unbiased estimator for Var(l;,) when
Y is real is
2
1 ( Yj Yk
= -LL W'k - - -) (2.27)
A
v(T.)
y 2 s J 7rj 7rk
where Wjk = (7rj7rk - 7rjk)!7rjk and L Ls denotes summation over
j, k E s with j =/: k. A design for which 7rjk > 0 for all j, k is
sometimes called measurable (Sarndal et al., 1992).
It is possible to show that generally no unbiased estimator ofVar(l;,)
exists if the design is non-measurable, that is if some 7r jk = O. (See, for
example, Liu and Thompson, 1983.) An example of a non-measurable
design is systematic sampling, for which 7rjk = 0 if j, k are not sep-
arated by a multiple of the sampling interval K. Thus for systematic
sampling no unbiased estimator of Var(N}is) exists.
The estimator (2.27) is called the Yates-Grundy--Sen variance estima-
tor (Yates and Grundy, 1953; Sen, 1953). The corresponding estimator
EXPECTATIONS AND VARIANCES OF SAMPLE SUMS 17
However, the latter is unbiased more generally, even when the design
is not of fixed size.
The general variance forms of this section do not lend themselves
easily to computation, and it is advisable to reduce them to standard
forms in specific cases. For example, in SRS without replacement,
(2.25) and (2.28) can be used to derive the standard formulae for the
covariance matrix Cov(Nys) and its usual unbiased estimator. For each
j, k, 7T:j = 7T:k = n/ N, 7T:jk = n(n - 1)/ N(N - 1) and the factor
Qjk = n(N - n)/N2(N - 1), so that it follows from (2.25) that
N2 n I
(1 - -)
NN
Cov(Nys) =- L L(Yj - Yk)(Yj - yd f .
n N 2N(N - 1) j oF k
Cov(NYs) = :2 (1 - ;) s; (2.30)
N2 ( 1 - N
v(Nys) = --;;- n) s;, (2.33)
18 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
where
s; = L(Yj - Ys)2/(n - 1) = LLs(yj - Yk)2/[2n(n -1)] (2.34)
jES
and the owner is happy. 'How are you going to estimate Y?', asks the statist-
ician. 'Why? The estimate ought to be 50y of course,' says the owner. 'Oh!
No! That cannot possibly be right,' says the statistician, 'I recently read
an article in the Annals of Mathematical Statistics where it is proved that
the Horvitz-Thompson estimator is the unique hyperadmissible estimator in
the class of all generalized polynomial unbiased estimators.' 'What is the
Horvitz-Thompson estimate in this case?' asks the owner, duly impressed.
'Since the selection probability for Sambo in our plan was 99/100,' says
the statistician, 'the proper estimate of Y is 100y /99 and not 50y.' 'And,
how would you have estimated Y', inquires the incredulous owner, 'if our
sampling plan made us select, say, the big elephant Jumbo?' 'According to
what 1 understand of the Horvitz-Thompson estimation method," says the
unhappy statistician, 'the proper estimate of Y would then have been 4900y,
where y is Jumbo's weight.' That is how the statistician lost his circus job
(and perhaps became a teacher of statistics!).
The reader is invited to try to resolve the statistician's difficulty. One
resolution would make use of some kind of model-assisted estimation,
to be discussed in Chapter 5.
e = LdjsYj, (2.35)
jES
with
L p(s)djs = 1
S:jES
(2.37)
Rao (1979) has shown that if e is of form (2.35), but not necessarily
unbiased, there is a form of MSE(e) analogous to (2.23), provided that
for some array w with no zero elements
In the terminology of Section 5.7, Rao's condition means that the estim-
ator is calibrated to be correct for the array w. In that case
1
v(e) = --LLs ( djsdks - - 1 )(YJ
- - -Yk)2 WjWk. (2.40)
2 Kjk Wj Wk
This variance estimator has the appealing property of being zero when
y = w, and its non-negativity is relatively easy to establish or refute in
particular cases.
When Y is vector-valued, and the same coefficients djs are used in
the estimation of each component, the covariance matrix Cov(e) may
be estimated by
EXAMPLE 2.6: Returning to the case of scalar y, suppose that the size
of the sample is to be n = 2, and that the sample s = {j, k} is drawn
without replacement, with draw selection probabilities proportional to
probabilities Ph ... , PN, ,,£7=1 Pj = 1. For example, the probability
of obtaining s = {l, 3} would be PIP3/(1 - PI) + p3pJ/(1 - P3). The
estimator of Ty proposed by Murthy (1957) is
e= 1 (Yj
-(1 - Pk) + -(1
Yk - Pj) ) (2.41)
2 - Pj - Pk Pj Pk
SAMPLING STRATEGIES AND LOCAL RELATIVE EFFICIENCY 21
when s = {j, k}. This estimator is not the HT estimator, but is of the
form (2.35), and satisfies (2.38) with W j = Pj. It is unbiased, and the
estimator (2.40) of its variance is
estimator can be more efficient for some y than the sample mean, and
is in fact exact for a particular y. Note that the estimator is of form
LjES djsYj, and can be formed only because the labels of the sample
Y values are available, not just the sample y-values themselves.
In fact it can be shown that, for a given population function fJ, unless
the design is a census it is generally not possible to choose e to minimize
E(e - fJ(y»2, or unbiased e to minimize Var(e), simultaneously for all
y in RN. The first such result was proved by Godambe (1955), and
extensions or refinements of it are surveyed by Cassel et al. (1977).
To make decisions among strategies in practice requires additional
information, not necessarily very formal or precise, about what arrays
y are possible or likely for the population at hand. Such comparisons
are discussed, for example, by Kendall, Stuart and Ord (1983, §39).
Consider again the example of stratified random sampling. With the
notation introduced earlier it is easy to see that in formula (2.23)
rl.jk = nh(Nh - nh)/Nt(Nh - I) > 0 if j. k E Sh.
= 0 if j. k are in different strata.
Then (2.23) becomes
(2.52)
This section deals with the estimation of population moments and cu-
mulants for a real variate y, material which will be used further in Chap-
ters 3 and 4 in discussions of Edgeworth expansions and the bootstrap.
26 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
Ky(s) = I>p,'
00
p=1
sP
p.
It is easily seen that
KI =ey. (2.53)
K2 = Var(Y) = f.L2. (2.54)
K3 = f.L3. (2.55)
K4 = f.L4 - 3f.L~ (2.56)
and, in general, for P > 1,
Kp = p! ~(-l)P-I(p _
f=t
1)!L2 (f.LPI
PI!
)rl ... (f.L ps
Pst
__·rs!_
)r. rl!"
(2.57)
In these expressions f.Lp is the pth central moment of Y (f.L1 being 0),
and the second sum L2 in (2.57) extends over all positive integers
PI < ... < Ps and all positive integers rl •...• rs such that
s s
LP;r; =p. L r; =p. (2.58)
;=1 ;=1
Moreover, for P > 1, the moments in formula (2.57) can all be replaced
by the non-central moments f.L~ = e yk .
If YI •..•• YN are finite population Y values which are regarded as
independent observations on a random variable Y, there are at least two
natural ways of defining finite population versions of the cumulants.
One, which might be especially appropriate when with-replacement
sampling is contemplated, would be to define the pth cumu1ant as the
coefficient of sP / p! in the expansion of
N
log(L eSYi / N).
j=1
FINITE POPULATION CUMULANTS 27
where L3 is the sum over all vectors (z" ... ,zp) whose components
are distinct coordinates of (y" ... ,YN)' Note that the number of terms
in L3 is N to p factors, or N(P), given by N(N - 1)··· (N - p + 1).
In fact, K p also has the following expression:
(2.59)
K2 =
= (2.60)
N
= NL(yj - Jly)3/(N - 1)(N - 2) (2.61)
j='
28 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
N
[N(N + I) L{Yj - f-Ly)4 - 3(N - I)
j=l
N
x {L{yj - f-Ly)2}2]/(N - 1)(3). (2.62)
j=l
(2.63)
where the second summation has terms as in (2.59), but only Y values
with labels in the sample are used. It is easy to see that under SRS
without replacement, n draws, Ekp = K p, for the joint inclusion prob-
ability of any set of p distinct units is n (p) / N(p). (Recall that E denotes
expectation with respect to the sampling design.)
If the sampling design is not necessarily SRS without replacement, an
extension of the arguments involving indicators in Section 2.2 can give
an unbiased estimate kp of Kp. For example, from (2.60) an unbiased
estimator of K2 = S;
might be
2
k2 = ~L Yj - 1 L L YjYk. (2.64)
N jES lrj N(N - 1) S lrjk
For some designs this k2' unlike K 2, can actually be negative. Take
N = 3 and let p({I, 2}) = 0.9, p({I, 3}) = p({2, 3}) = 0.05. Then for
s = {I, 3} and Yl = Y3 = 1, k2 = -2.98 < O.
On the other hand, it has been seen in Section 2.2 that K2 has an
alternative, more compact, symmetric form
INN
K2 = 2N(N _ 1) ~# ~{yj - Yk)2, (2.65)
k_ 1 {Yj - Yk)2
(2.66)
2 - 2N(N _I)LLs lrjk
FINITE POPULATION CUMULANTS 29
which is clearly non-negative, and 0 when the sample Y values are all
equal.
There are also compact alternative expressions for K3 and K 4:
K3 =
4
:3'
1
N(N _ 1)(N _ 2) ~ ~~
N N N ( +)3
Yi - Yj 2 Yk
liN N N N (2.67)
K4 = 4-N(4) L L L L[(Yi + Yj - Yk - YI)4
i#j# k# I
-12(yi - Yj)2(yk _ YI)2].
These expressions follow readily from a general formula of Good
(1977), namely that
Kr = r_1_"
N(r) ~
... "(y.
~ JI
+ w J2 + w2y.J3 + ... + wr-'y.)r
tJ •
of J,'
(268)
•
where w is the rth root of unity e2tri / r , and the r-fold sum is taken over
all sequences of r distinct subscripts between I and N. For example,
as = LYj
jES
and
Var(as) = n (1 - ~) K2.
It is also not difficult to show that
Thus each of the first three cumulants of the sample sum is n times a
function of n / N times the corresponding population K statistic. This is
not quite true for the fourth-order cumulant, but it is possible to show
that an approximation to the fourth cumulant of as has a similar form:
N -I
E(as - n/Ly) - 3 N
4
+ 1 (Var(as »2
=n (1 - ~) (1 - N: 1 6 ~ (1 - ~)) K 4• (2.71)
For this quantity when N is large the dependence of the right-hand side
on N is mainly through n / N.
The exact fourth cumulant of as is E (as - n/Ly)4 - 3(Var(as given »2,
by
n (I - ~) (I - N: I6~ (I - ~)) K4 - 1 I- ~)
N: n2 (
2
Ki·
(2.72)
A relatively easy way of verifying (2.71) is based on the fact that any
symmetric fourth-degree polynomial in Yt, ... ,YN which is invariant
when all the Yj are increased by any amount c is a combination of
N N
A= L L(yj - Yk)4/N(2) (2.73)
j i- k
and
N N N N
B = LLLL(Yi - Yj)2(yk - YI)2/N(4). (2.74)
ii-ji-k#
For example, K4 = (A - 3B)/2. It can be shown that
E(as
_
nIL) -
4 _ n(N - n) [(N - n)3
N4 2 +
n3
2
)A
+(~n(N - n)N(N + 6) - ~N3)B] (2.75)
MULTI-STAGE SAMPLING 31
and
2 n2(N-n)2[NA N(4)B]
(Var(as )) = N4 2" + 4(N - 1)2 • (2.76)
y
1(1+--
.).:=-+-
n4 K4
N
2) 4 N-l
B. (2.77)
Var(s 2 ) = Es 4 -.).:
y Y
n4 = K4
Y
(1- - -1) +-B(-1- - -1)
n N 2
- .
n-l N-I
(2.78)
be partitioned into PSUs 131, ••• ,13£ with sizes M 1 , ••• , M£. Then
L;=1 Mr = N.
Assume that at the first stage a sample s B of PSU labels is taken.
Then, independently for each r E SB, a sample Sr of m(sr) elementary
units is selected from 13r according to some scheme. Using this notation,
the total sample is
S = Us"
resB
and n(s) = LresB m(sr)' We shall assume that the scheme for sampling
within a selected PSU 13r is not dependent on the composition of the
rest of SB.
The first-stage inclusion probabilities will be defined by
TIr = P(r E SB),
Ty = L..,
(2.79)
rESB r
with
(2.80)
(2.81)
r=1
Suppose further that for r E SB, the subsample Sr is chosen by SRS
ESTIMATION IN MULTI-STAGE SAMPLING 33
the mean of the subsample means of y over the subsamples Sr. In the
special case that mr = m for all r (a feature often incorporated in
practice), clearly 'lrj = 1m / N for all j, and the design is self-weighting
and of fixed size (1m) in the elementary units; the mean of subsample
y
means is just the overall sample mean of y.
Note that equation (2.82) will apply not only when SRS is used at
the second stage but also whenever the Sr are selected within PSUs by
designs which are self-weighting and of fixed sizes.
r=1
where Tr is the total of y over the rth PSU Br . When the first-stage
sample is taken without replacement, a typical estimator of Ty is of the
form
(2.83)
rESS
clearly
L
E(e) = E\ ( L d,(SB)T,.) = L T,. = Ty.
,eSe ,=\
and e is an unbiased estimator of Ty.
The analogous formula for computing variances by conditioning can
be written
Var(e) = E\ (Var2(elsB)) + Var\ (E2(elsB)) (2.84)
with Var\ and Var2 being defined in analogy with E\ and E2. For an
estimator e of the form L,ese d,(SB)/" this becomes
-
Var(NY)
N ~ TI, (
= 2"
2
L..,- 1- -
m,) S, 2
I ,=\ m, M,
+-2 '"
1 L "'(TI
L TI (
T. T. )2
L.., L.., 'q
- TI rq ) -
TI ' -....!!....
TI • (2.86)
'''' q 'q
adapting formula (2.23) to the notation for the first stage of sampling.
For the problem of variance estimation in multi-stage sampling it
is useful to consider a sort of backward decomposition of expectation
in which conditioning is done not on the first-stage sample but on
the subsequent subsampling. That is, imagine implementing the design
backwards, by first selecting s, from every B" and then picking SB, so
that only the Yj for j E S,' r E SB, are actually kept in the final data.
ESTIMATION IN MULTI-STAGE SAMPLING 35
(
Varl (els" r = 1, ... , L) = -I L L
L Lcnrnq - n rq )
tr
- - -
tq )2
2 r '#- q nr nq
1 L L ( tr
Var(e) = £2 ( "2 LL(nrnq - n rq ) TI
tq )
- IT 2) + L L
V2."
r,#-q r q r=1
where V2,r is the variance of tr with respect to stages of sampling after
the first.
Returning to the general case and (2.89), it is clear that an unbiased
estimate of Var(e) can be formed as follows:
vee) = VI +L d;(SB)V2.r (2.90)
rEss
where V2,r is an unbiased estimate of V2,r from the y values in s" d;(SB)
satisfies the same unbiasedness condition as dr (s B), and VI is an unbi-
ased estimate with respect to the first-stage design of Varl (elsr , r =
1, ... , L). This principle for forming variance estimates was given by
Rao (1975), having been put forward in a less general context by Durbin
(1953).
If as above dr (s B) = I I n" the first-stage design is of fixed size (I),
and each Sr is selected by SRS without replacement, mr draws, from
36 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
N 2" "
2 L... L...ss
(~)2
I
TIrTIq - TIrq (ji __
TI r Yq
)2 + "L.J V2,rTI . (2.92)
rq rEss r
A simple approximation to (2.91) can be obtained in the case where
I is small compared to L, the total number of PSUs, and sampling at
the first stage is approximately with replacement with constant selection
probabilities TIr / T. In this case TIrq is approximately TIr TIq (l - 1) / I
and
(2.93)
Also, if the TIr are uniformly small and the Mr are bounded, the first
term in (2.91) will predominate, and we have the variance estimate
All ( tr tq
2' . (1- 1) L Lss TIr - TIq
v(Ty) ~
)2 ,tr = Mryro (2.94)
= Pr Pq C~ Pr + 1 ~ Pq ) .
To find {Pr} so that TIr = 2ar exactly, an iterative computation may
be used. For example, let
of the units drawn in the current and immediately preceding draw, and
the selection probabilities Pr would be used at each draw after the first
one. Units 'rotated out' would become eligible to be drawn again.
Besides using successive draws without replacement, there are other
ways to implement an SRS of fixed size. One is to select units with
replacement, and then to reject the sample if there are duplications.
This same notion can be extended to unequal probability sampling.
The following is one kind of unequal probability rejective sampling.
This method gives a fixed size sample with the desired inclusion prob-
abilities provided at most one point of the systematic sample can fall
in any of the subintervals. Thus it works with the proviso mentioned
earlier, that none of the a r is greater than 1/1.
Essentially this method is sometimes used in monetary unit sampling,
where instead of PSUs being sampled with probabilities proportional
40 MATHEMATICS OF PROBABILITY SAMPLING DESIGNS
This method is the same as the previous one, except that the order
of the subintervals determined by the ar is first rearranged at random,
before the systematic sample is taken. If all the ar happen to be equal,
the procedure is equivalent to an SRS of the PSUs.
When the HT estimator is used with any of the first-stage designs
above (and SRS at the second stage), an unbiased estimator of its
variance is given by (2.92), which necessitates knowledge of the joint
first-stage inclusion probabilities nrq • These are not difficult to work
out and compute for the first three designs, at least when I is sufficiently
small. For the random systematic procedure, exact computation of nrq
is possible but complicated: see Connor (1966). Hartley and Rao (1962)
have given an approximate formula for nrq for use when L, the total
number of PSUs, is large.
(L./r(J)/ Pr(J»/ I
j=l
where rU) is the label of the PSU obtained at the Jth draw, and tr(J) is
unbiased with respect to the second-stage design for the corresponding
PSU total 1',.. (The second stage sampling scheme is repeated independ-
ently in PSUs which are drawn more than once.) An exactly unbiased
estimator of the variance of this is
~
I
. Itt
21(l - 1) j=l k=!
(tr(J) _ tr(k)
Pr(J) Pr(k)
)2, (2.97)
j#
which strongly resembles (2.94) and actually reduces to (2.95) when tr
is Mr Yr and no PSUs are duplicated in the sample. Choosing Pr = a r
gives estimators which coincide with the HT estimator with TIr = lar
when I distinct units are drawn.
Bernoulli sampling
For each r = 1, ... , L independently, the PSU r is included in the
sample with probability lar and excluded with probability 1 - lar.
The first-stage inclusion probability will be nr = lar and the expected
PSU sample size will be I, but the actual PSU sample size is potentially
any integer between 0 and L. This design is theoretically important (see
Sections 3.3 and 3.4) and mathematically simple, but its variable sample
size may be a disadvantage in practice since it means there is little
control on the information in the sample. When there is non-response
in a single-stage design, the respondents are sometimes assumed to
constitute a Bernoulli subsample of the originally intended sample.
Exercises
2.1 Consider a population of size N = 3, and let the sampling de-
sign be equal probability Bernoulli sampling, where each unit in
the population is included in the sample with probability 2/3,
independently of the others. Give pes) for each subset s of the
population U = {I, 2, 3} under this scheme. Find the inclusion
probabilities Jrj, and verify that their sum over all population
units is the expected sample size.
2.2 Recall that if all inclusion probabilities Jrj are positive, then the
HT estimator L j es Yj / Jrj is unbiased for 1'y. Show that if some
Jr j = 0 and if the corresponding Yj is allowed to vary independ-
ently of the other components of the popUlation array y, then there
is no unbiased estimator of 1'y with respect to the sampling design.
2.3 For each of the following sampling schemes, give values or
expressions for the inclusion probabilities Jrj and the expected
sample size En(s). State whether the design is self-weighting
and whether it is of fixed size.
(i) Systematic sampling with K = 3 and N = 7: choose a start-
ing unit il at random from {I, 2, 3}, and let the sample con-
sist of iJ and jl + K, and jl + 2K if this last unit is in the
population.
(ii) Circular systematic sampling with K = 3, N = 7 and n = 3:
choose a starting unit at random from {l, ... , N} and let the
sample be {jl, il +K, ... ,il + (n -1)K} where the unit label
is taken to be its value modulo N.
2.4 Verify (2.39) in the text.
2.5 Show that the estimator in Example 2.6 is sampling unbiased, and
that vee) of (2.42) is of the form (2.40).
2.6 In a two-stage sample, suppose that the size of the first-stage
sample is I = 2, and that the first-stage sample s B = {r, q} is
drawn in successive sampling without replacement, with selec-
tion probabilities proportional to probabilities PI, ... , PL. In the
notation of Section 2.7, the estimator of 1'y corresponding to the
estimator of Murthy (1957) is
e = 1 [ -(1
tr tq
- pq) + -(1 - Pr) ] .
2 - Pr - Pq Pr Pq
Suppose second-stage sampling is carried out by SRS without
44 EXERCISES
2.9 Show that for Sampford's rejective sampling method (Section 2.8)
with I = 2, the inclusion probability fIr is equal to 2ar , and give
an expression for fI rq •
2.10 In a sampling method due to Durbin (1967), for I = 2, the
first unit r is selected with probability ar , and the second unit
q without replacement with probability proportional to bqr =
aq {(1 - 2ar )-1 + (1 - 2a q )-I). Show that the inclusion proba-
bility fIr is equal to 2ar , and give an expression for the joint
inclusion probability fI rq .
2.11 McLeod and Bellhouse (1983) describe a method for drawing a
simple random sample without replacement (size n) on a single
pass through a sequentially ordered population of size N. The first
n units of the population are selected as the initial sample. When
the kth unit is encountered, for k = n + 1, ... , N, the sample
remains the same with probability I - n / k; with the remaining
probability n / k a randomly selected member of the current sample
is replaced by unit k. Show that this method does indeed produce a
self-weighting design. Note that N need not be known in advance
for this procedure to be carried out. Chao (1982) gives a method
of 7r ps sampling which is a generalization of this.
2.12 In Midzuno's sampling design (Midzuno, 1952; Rao, 1963) the
first unit j of a single-stage size n sample is selected with prob-
ability P j, and the remaining units are selected with equal prob-
abilities without replacement. Show that if Pj = x j / Tx for some
positive variate x, then the ratio estimator
eR = Tx(LYj)/(LXj)
jES jES
EXERCISES 45
It
then
Var (t
j=l
Pj 1',.(j)) = lk(k -
Ctr(j) L(L - 1)
1) Tr2 -
r=l Ctr
I21
y
where 1',. is the total of y in the rth PSU. Hence explain why
v(e) of (2.98) should be an approximately unbiased estimator of
the variance of the conditional HT estimator L~=l Pjtr(j)/Ctr(j)
for large k.
Solutions
°
2.2 Suppose lrl = 0, so that p(s) = whenever s contains 1. Suppose
also
LP(s)e(xs ) = Ty (*)
SES
for all possible y. Varying Yl but not the other components of y
will make the right-hand side of (*) change, but not the left-hand
side, resulting in a contradiction.
2.3 (i) lrj 1
= for each j; En(s) = t;
design is self-weighting but
not of fixed size.
(ii) lrj = ~ for each j; En(s) = 3; design is self-weighting and
of fixed size (3).
N
2.4 e - Ty = L(djsljs - I)Yj. Thus
j=l
2.5 Ee = LL PjPk(
N N 2
- Pj - Pk)
j<k 0- Pj)O - Pk)
x 1 [ Yj O-Pk)+Yk(1_pj)]
(2 - Pj - Pk) Pj Pk
= ~
2
t t[Yj~+Yk~]
j # k 1 - Pj 1 - Pk
=Ty.
If S =
{j', k} then dJ·s =
(i-Pk) x -2_1_. Note (2.38) is sat-
Pj -pj-Pk
isfied with W j = Pj for all j. In (2.40),
2.9 The design is of fixed size (2). Thus to show IIr = 2a" it suffices
to show IIr ()( ar. Now IIr = ar(Lq1'r {3q) + (Lq1'r aq){3r +
(L q aq{3q)II r . Thus
IIr ()( arO - {3r) + (1 - a r ){3r = a r + {3r(1 - 2ar) = (1 + K)ar.
47
= 1 + "" aq
L..q -2a = C;
q q
= 1),.
Generally 7rj = Pj + Lk#jPk (~-:::.D = ~-:::.11 + Pj Z::::~. To make
7rj = nXj/Tx, set Pj = [r - ~-:::.;] Z::::!, provided all these quant-
ities are non-negative.
e
2.13 The quantity = L~=1 PjTrU)/arU), for which the variance is
being calculated, is conditionally unbiased for Ty, given the ran-
dom grouping. Thus its variance is the expectation of its condi-
tional variance, given the random grouping, and is
48 EXERCISES
I) ~
- 1 ( 1-- ( TrU")
~Pj - - e
_)2
I- 1 L j=l CXrU)
t)
is 1~1 (1- {L;=l Z- Var(elgrouping) - 1'y2 }. Thus its uncon-
ditional expectation is
-1- ( 1 - -
1-1 L
I) [L-k-l
--1 - 1] Var (-e) = Vi-
ar(e).
Yj = 1 if unit j has C,
= 0 otherwise. (3.1)
of (2.60) is
S2
y
= ~P(l-P)
N -1 '
and hence from (2.30)
1 N-n
Var(ps) = ~ N-l
P(1
-
P)
N-n
Var(m s) = n N _ 1 PO - P) (3.2)
N 2 N-n
Var(Nps) = - - - P O - P).
n N-l
But we can go further in this case, and obtain an exact distribution for
P(m s = m) =
(~)(~=~) (3.3)
(~) , m=O, ... ,n.
(3.8)·
(3.9)
n -mo
P ( F(2(m~+1),2(n-mm > m~ + i 1 -P P ) . (3.10)
(3.11 )
where A = 9(m~+1)(n-m~)(9n+5-z~)+n+l, andza is the a quantile
of a standard normal distribution; when the binomial approximation for
the distribution of m~ is a good one, the associated confidence interval
will be close to exact. For example, when n = 10 and m~ = I both
formulae (3.10) and (3.11) give Pu = 0.394 to three decimal places.
A detailed discussion of binomial confidence intervals has been given
by Blyth (1986).
!
When a < and the normal approximation to the distribution of ms
is reliable, (3.7) becomes
o 1
ms + -2 -nP
-;:.======== = Za = -Iza I (3.12)
fn(N - n) PO _ P)
Y N-l
where Za is again the a quantile of a standard normal distribution.
The 112 in the numerator of the left-hand side gives a correction for
continuity. Use of this equation means solving a quadratic equation for
Pu , and yields
z2 / a 2 Z2)
Pu = (a +"2 +zya - --;; + 4" I(n +z2) (3.13)
Pu
A
s
°
P~L = P - - 1 - -ZI-a
2n In
IN -
N-l
n
--p0(1 - pO).
S S
(3.15)
IN-n
N _ 1n[pis + P2s - (p2s - Pis) ] 2
(3.18)
(3.19)
(3.20)
(3.21 )
N
¢(u, t) = D[(1 - )..)e-iA(u+tYj) + )..ei(l-A)(U+tYj )]. (3.22)
j=l
_1
2JT
l 1C
(L L p(x, y)eiux+itY)du/ P(X = 0)
x Y
l
-1C
1C
= -2
1 ¢x,Y(u, t)du/ P(X = 0). (3.23)
JT -1C
where
v = NA(1 - A) = nO - n/ N) -+ 00 as N -+ 00.
THE FINITE POPULATION CENTRAL LIMIT THEOREM FOR SRS 59
·
11m
N-+oo
QN
VN(2+8)/2 =
° (3.25)
L IYiI
N
QN =)..{l -)..) 2+8.
j='
This condition is analogous to Lyapunov's condition (Feller, 1971, p.
286), and would hold, for example, if y" Y2, ... were observations
from independent and identically distributed variates Y" Yz. . .. with
finite (2 + 8)th moment, and v = N)..(l -)..) approached 00 with N.
Alternatively, (3.24) holds if
lim jmax
N-+oo }
Y~
DN
I x (number of j such that IYjl > Ej'V;) = 0.
In the special case where M of the Yj values are 1 and N - M are
0, condition (3.24) can be shown to be equivalent to
1.
1m
M(N - M)n(N - n)
3 = 00, (3.26)
N-+oo N
or
lim vP(l - P) = 00. (3.27)
N-+oo
This implies that we may think of a sample number ms as being ap-
proximately normal if P is not too close to or 1, and if nand N - n
are moderately large.
°
In outline, the proof of the sufficiency of (3.24) from representa-
tion (3.21) proceeds as follows. The details are given in Renyi (1970,
Chapter 8, Section 5).
First note that since 4>(0) must be 1, (3.21) can be rewritten as
4>(t) = <l>(t)/<l>(O), (3.28)
D
where
<l>(t) = JV j-Jr
Jr N
p(u + tYj)du
and
)..(1 -)..) )
exp ( - 2 x2 for x near 0.
60 DISTRIBUTIONS INDUCED BY RANDOM SAMPLING
u = w/Jv, 1= r/y'V;
makes it possible to rewrite <I> (t) as
I IT n./V
-n./V j=]
p (~+ rYj )
JV../VN
dw
.
(3.29)
Now for any E > 0, the sufficient condition (3.24) implies a negligible
effect as N --+ 00 of factors in which Ir h /../VNI 2: E. If Ir h /../VNI <
E, and also Iw/ JVI < 2E, we can use the approximation
k(;+ ~)I
is uniformly less than 1, and that the product of an indefinitely increas-
ing number of such factors becomes negligible. Thus the integral <1>(/)
of (3.29) may be approximated by
J 2E./V
exp -
1).,(1 -).,) L N ( W
-+--
rYj )2} dw
-2E./V 2 j=] JV ../VN
or by
J-00
oo
exp -2 - 2w
{r2 2
}
dw =
r2 }
J2rr exp { -2
for large N. Dividing by the value at 1 = 0 (or r = 0) as in (3.28) gives
the limiting characteristic function of LjES Yj /../VN as the standard
normal characteristic function exp( _r2 /2).
where
ON(E) =-
I
L II h 112
N j:IlYiIl>fNA(l-A)
and II is the Euclidean norm, the distribution of the normalized
sample sum
jEs
approaches the MV N(O, :E) distribution.
The idea of the proof has been given by Rao (1973).
e = A LYjjAj (3.31)
jes
l
under this design is
1C
I
277: -1C </J(u, t)duj B (3.32)
jes
L L
(iv) n LWln;;lS~ -+ 1:, a positive definite matrix, where n = Lnh.
h=l h=l
Then as H -+ 00 the distribution of ..;n<Yst-f.Ly) approaches MVN(O. 1:).
Also, for any linear combination of components Ya, the distribution of
the usual studentized stratified sample mean will approach normality.
Using an approach similar to Rosen's, Sen (1980; 1988) has shown
how to obtain asymptotic normality of a Horvitz-Thompson type es-
timator from a multi-stage sample where the primary sampling units
are obtained by successive sampling. It is assumed that the number L
of PSUs in the population and also the sampled number I approach
infinity at the same rate. The most important sufficient condition for
asymptotic normality of the estimator is a modified Lindeberg condi-
tion on the variates tr - 1',., where tr is an unbiased estimator from the
later stages of sampling of the total 1',. in the rth PSU.
The implications of asymptotic normality results are important both
for estimation and for choice of sampling strategy, as the next two
subsections will illustrate.
64 DISTRIBUTIONS INDUCED BY RANDOM SAMPLING
e = LdjsYj or e = Ldr(sB)tr
jEs rESB
of the population total Ty is asymptotically normal. That is, suppose
that in a suitable asymptotic framework, the distribution of
e- Ty
(3.34)
JVar(e)
approaches the N(O, 1) distribution as the index N ~ 00. If v(e) is a
design-consistent estimator of Var(e) , or in other words if v(e)/Var(e)
approaches 1 in probability as N ~ 00, then the studentized variate
e - Ty
(3.35)
Jv(e)
also has a limiting N(O, 1) distribution, and approximate two-sided
100(1 - 2a)% confidence intervals for Ty are given by
e ± ZI-aJv(e), (3.36)
00
1
f(x) = exp L(-l)rKr Dr)
- , 1/1 (x),
r=3 r.
(3.38)
i: i:
the same mean and variance as f. This follows from the key identity
i:
For then
where the hj (x; KI, K2) are polynomials called Hermite polynomials
and are defined by
FORMAL EDGEWORTH EXPANSIONS 67
and
hl(y; 0,1) Y = h4(y; 0,1) = y4 - 6y2 + 3
h2(Y; 0, 1) = y2 - 1 =
hs(y; 0, 1) yS - 10y3 + 15y.
=
h3(y; 0, 1) y3 - 3y
The expansion of (3.40) can be integrated tenn by tenn to give a
representation for the tail probability:
P(X~x) = \II(X)-1/f(X){K~h2(X;KI'K2)
3.
+K 4. Kr.}-I
4!h3(x,KI,K2) + 2(3!)2hs(X, KJ, K2) +o(n )
(3.41)
where \II(x) and 1/f(x) are the cumulative distribution function (c.d.f.)
and probability density function (p.d.f.) of an N(KI, K2) variate.
The two expansions (3.40) and (3.41) have been written so as to dis-
play the tenns of order n- I / 2 and n-I. Proof of the validity of(3.41) in-
volves showing that n times the difference between left- and right-hand
sides approaches 0 as n -+ 00. Under sufficient regularity conditions
this is so: in fact the error is typically of order n -3/2. Taking further
tenns in the expansion theoretically improves the approximation. When
X is a discrete variate, the expansion (3.40) is regarded as giving an
approximation to P(x ~ X < x + 8)/8 for suitable 8.
If the variate X is ..;n(X - J-L)/u, where XJ, ... , Xn are Li.d., the
validity of (3.41) requires only that the fourth moment of XI exist and
that its distribution not be concentrated on a lattice. If XI were a lattice
random variate and the lattice had span a, the possible values of X
would be shifted multiples of a /..;n, and the largest jumps in its c.d.f.
would have to be of order n- I / 2 • Since the right-hand side of (3.41) is
continuous, the error tenn could not approach zero faster than n -I /2 ,
and even the one-tenn expansion
P(X ::5 x) = <I>(x) - fjJ(x) (KI + ;~ h2(X; 0, 1») + o(n- I/2) (3.43)
K
3 • _I
+ 2(3!)2
2 hs(X, 0, 1) ) ] + o(n ), (3.44)
where <I>(x) and fjJ(x) are respectively the N(O, 1) c.d.f. and density
function. One way of deriving these expressions is to expand
X - KI)
\I1(x) = P ( Z::5.jK2 , z'" N(O, 1)
and t/I(x) about KI = 0, K2 = 1.
3.7 Edgeworth expansions for the distribution of the sample
sum in SRS
Finding and justifying an Edgeworth expansion for the distribution of
the sum in SRS from a finite population is a complicated task. The
distribution of the sample sum is discrete, and in fact is often a lattice
distribution. Moreover, both the sample size n and the population size N
are considered to be approaching infinity; thus there might conceivably
be more than one way of ordering terms in the expansion, depending
on the relative growth of n and N. Finally, as in the case of the finite
population central limit theorem, conditions on the triangular array of
Y values for the population units must be formulated if the expansion
is to be applied very generally in the absence of models for y.
For simplicity's sake we take y to be real to begin with. We have
seen in Section 2.5 that the first four cumulants of the centred sample
sum LjES Yj - nf.Ly are given as follows in terms of the population K
statistics:
0, n(1 - )")K2, n(1 - )..)(1 - 2)")K3,
n 2 (1 - )..)2 ( 2 K4)
n(l - )..)(1 - 6)..(1 - )"»K4 - 6 N+I K2 - Ii '
EDGEWORTH EXPANSIONS FOR THE SAMPLE SUM IN SRS 69
P(W ~ w) =
(3.46)
i:
converges weakly to a strongly non-lattice distribution, the moment
lyl3+ 8dFN(Y)
is
1/I'EN(W) - N- 1/2 [L
IfJl=3
fJ\ (~ tyj) DfJ1/I'EN(W)] +o(n- I/2),
J=I
(3.47)
where 1/I'EN is the joint density for MV N(O, I: N ) and I:N is the co-
variance matrix of W. Notationally, if fJ = (fJt, ... , fJp) is a vector of
non-negative integers, in the above expression IfJl = fJt + ... + fJp,
{J! = fJl! ... fJp!, DfJ = Dfl ... D~P, xfJ = Xfl ... x~p.
EDGEWORTH EXPANSIONS FOR THE STUDENTIZED MEAN IN SRS 71
Since confidence intervals for the population mean are usually based on
the studentized sample sum (or sample mean) rather than on the stand-
ardized sample sum, it is also of interest to be able to approximate the
distribution of that quantity in SRS.
We may begin by finding the cumulants of the studentized sample
mean. Denoting the sample mean as before by Ys> let
t = r===vIn==n(ji~-=s=-=/L=y)=== (3.48)
1", -2
(l - J..) n _ 1 4-)Yj - Ys)
]ES
fiI t ZI (3.49)
y;;-:::t = [ 1 + .JT=I Z2 (1 - J..)ZI] 1/2
(T
vIn - - ---=---=-
(T2 n(T2
h
were (T
2 - .1
-
"N -2 Z _ .;n<Y,-J.Ly) Z _ .;nz,
N ~j=IYj' I -
. _ -2 _ 2
JI-I ' 2 - JI-I' Z] - Yj (T
and Yj = Yj - /Ly. Taking J.. as fixed and expanding (3.48) in powers
of n- I / 2 gives
filt =
y;;-:::t
(3.50)
K2 = Var(t)
(3.51)
72 DISTRIBUTIONS INDUCED BY RANDOM SAMPLING
3!
1
.Jf-=-5:
1- Aa
K3
3 «2 - A)x 2 + 1 - 21..)
q2(X) =
K4
x ( A2(x) a 4
Ki + C
+ B2(X)~ )
2(x)
= (1 3 + A) (1 - A 1) 2
A2(x) 8(1- A) - -8- + -8- - 24(1 _ A) x
B 2 (x) ( 27
72 +
15)" 15)
72 - 72(1 - A)
1 14 10) 2
+ ( -4 + 721.. + 72(1 _ A) x
--1 ( 3-1..+--
1 ) x4
72 1- A
is bounded and
FN(y)
I
= N(no. f -
0 Yj ~ Y)
i:
-00
small within the strip but far from the real axis, the Cauchy-Goursat
theorem implies that also for the real constant a,
(3.57)
so that
f(x) = _1_ tx)
e-x(a+ib)eKx(a+ib)db, (3.58)
2JT Loo
where here both a and b are real. Expanding Kx(a + ib) about a gives
Kx(a + ib) = Kx(a) + ibK'x(a) - (b 2 /2)K';(a) + ...
and
eKx(a+ib) = eKx(a)+ibKxCa)-p{K;Ca)(I + R(a, b» (3.59)
with R(a, b) being expandable as a series in higher powers of b. In
applications X is typically a member of a sequence of variates with
density depending on a parameter n, and for which R(a, b) tends to 0
as n ~ 00.
If a is now selected to satisfy
K'x(a) = x, (3.60)
the integrand of (3.57) has a saddlepoint at a, and the approximation
following from (3.59) in (3.58) is
f(x) ~ _1_ roo
e-xaeKxCa)-(b2/2)K;Ca)db (3.61)
2JT 1-00
or
1
f(x) ~ f,'L[K';(a)r l/2 exp{Kx(a) - xa}. (3.62)
...,2JT
Note that the value of a selected depends on x, the argument of the
density to be approximated. Note also that the approximation need not
integrate to 1; in fact, it is usually improved by normalization.
The density approximation (3.62) works well when Kx and its
derivatives are of the order of n, for then the main contribution to the
integral (3.61) comes from b in an interval of form (-C/v'n, C/v'n);
over such an interval the terms of R(a, b) are 0(1). For example, if X
is the sum of a sample of n Li.d. observations with cumulant generating
function Kt. then (3.62) gives
1
f(x) ~ f,'L[nK~(a)rl/2 exp{nKI (a) - xa}
...,2JT
where a satisfies
SADDLEPOINT APPROXIMATIONS FOR SRS 75
To begin with it follows from (3.21) that if a and b are real, and a is
any real number, we have
= _1_1" fI
2rr B _" j=1
{(l - A)e-J.[{j+i~jl + Ae(l-J.)[{j+i~jl}du,
(3.63)
where A = n/N and B = (~)An(1 - A)N-n as before, and the real and
imaginary parts of the argument of K are
Sj = ah + c, ~j = bYj + u.
This expression can be written as
where
K(z) = 10g[(1 - A)e-J.z + Aeo-J.)Z]. (3.65)
Then the integrand exp(E K(sj + i5j)} is approximated by
(In this section E will denote summation over j from 1 to N.) We can
now choose c = c(a) to make the integrator u vanish from the second
term in the exponent. That is, choose c(a) to satisfy
L K' (aYj + c(a» = O.
From this point on in the derivation, Sj = aYj + c(a).
Letting u = 1/f/.;n, we can rewrite the logarithm of the integrand of
(3.64) as approximately
1{r to range from -00 to 00, so that we are in effect integrating over a
normal density, we obtain
let
(3.68)
noting thatQ(a) = eawdF(w). Then the characteristic function for the
tilted distribution V is
~j ah +c
and a and c satisfy the sadd/epoint equations
L K'(ajij + c) = 0
(3.70)
78 DISTRIBUTIONS INDUCED BY RANDOM SAMPLING
with the same (12, ~j, c, a; the second saddlepoint equation is written
LyjK'(aYj + c) = n(jis -/Ly).
It is interesting to note that the integral in (3.64) is actually the joint
'complex moment generating function'
E [e(a+ib)(Lje, YJ )+(c+iu)(n(s)-n)]
(cf. (3.21» under Bernoulli sampling, integrated over u from -1l' to 1l'
so as to make it conditional on n(s) = n. It is not difficult to show that
the density approximation (3.69) can be obtained as follows. Find a
bivariate normal approximation to the appropriately tilted joint density
under Bernoulli sampling of LjES Yj and n(s)-n; the tilting parameters
are a and c. Then use the conditional (normal) distribution of the first
component, given that the second one is zero, as the approximation to
the V distribution of (3.68).
Wang (1993) has developed a direct saddlepoint approximation which
is based on a method for computing the cumulant generating function
of LjES Yj exactly for small n. It provides an alternative to Robinson's
method in the approximations to tail probabilities which follow.
or
jES jES
from the density approximations (3.69) and (3.71); see Daniels (1987).
The first, which we shall not describe in further detail, is simply to
integrate the normalized density approximation numerically.
The second approach, used by Robinson (1982), can be motivated
by going back to the step preceding (3.68), and approximating the
tilted distribution V described by (3.68). We consider here the simple
N(m, (12) approximation with m, (12 given by (3.66). Since
dF(w) = Q(a)e-aWdV(w),
SADDLEPOINT APPROXIMATIONS FOR SRS 79
it follows that
and
LjijK(~j) = w,
so that m = w, gives
-aw + a2T
0'2 }
(1 - <I>(aa)), w > O.
(3.72)
Robinson (1982) has shown that this approximation should typically
have an error of order O(n- 1/ 2 ). Taking a one-term Edgeworth expan-
sion for the approximate V distribution improves the order of approx-
imation.
The third approach is based on the fact that the saddlepoint density
approximation for the centred mean is of the form
cb(rJ} exp{l (I])}
where
/(1]) = LK(~j) -nal],
mean is
where <p and <l> are the standard normal p.d.f. and c.d.f., K is given by
(3.65),
r = r('1) = sgn('1}(2[na'1- LK({j)])I/2,
<12 = Ly;K"({j) - (Lyj K"({j»2/(L K"({j»,
and {j = aYj + C and a and c are calculated from
LK'({j) = 0
LyjK'({j) = n'1.
Extensions of the approximations of this section to the case of vector-
valued y are straightforward.
depend not only on the observed Ys but also on the unobserved part of
the population array.
One way of adapting the method to the finite population problem,
which is essentially the estimation of a nonparametric mean, is to ima-
gine for each possible value of 0 = i-Ly a suitable population array y(O)
which is consistent with the observed sample and which has a mean of
O. Given an observed sample mean y~, we would then define eu as the
greatest value of 0 for which
P(jis ~ y~;y(O» ::: a. (3.74)
e
That is, u would be the greatest value of 0 such that the test array y(O)
is not rejected by a one-sided a-level significance test. With appropriate
choice of the conceptual array y(O), we would hope to achieve close to
the nominal coverage probability 1 - a by this method. A saddlepoint
tail probability approximation could be used to provide an estimate of
the left-hand side of (3.74) for each O. Because of its interpretation
e
in terms of significance testing, u would be called an inverse testing
confidence limit.
(i) LWi = N
i=1
r
L
r
(iii) the Kullback-Leibler distance Wi log( Wi I ai) is minimized.
i=1
Conditions (i), (ii) and (iii) are satisfied when
r
Wi = NpilLPi, (3.75)
i=1
82 DISTRIBUTIONS INDUCED BY RANDOM SAMPLING
where
Pi = ai exp{t(9)/} (3.76)
and t = t(9) is the unique solution of
r r r
9 = Lai/elyi/Laie'Y' = LWi//N. (3.77)
i=1 i=1 i=1
Of course the weights Wi satisfying (3.75)-(3.77) exactly will not in
general correspond to an array y(O); the weights actually used would be
approximations of these which fulfilled the further condition of being
positive integers greater than the corresponding sample frequencies.
Related ideas are seen in the 'scale load' approach of Hartley and
Rao (1968; 1969), the 'empirical likelihood' approach of Chen and Qin
(1993) and the 'conditional maximum likelihood' approach of Bickel,
Nair and Wang (1992).
_ -0 [I I
IN),.{1 - ),.)
P(Ys :::: Ys ;y(O» ~ <I>(r) + tjJ(r) ;: - -;; (7(L~=1 wiK"(~;»1/2
]
(3.80)
5,y) = P {
J(t:. In Ys -~ Jly -<
'1'"" 1:.
5,y } (3.81 )
VI - A an
and
~2 n- 1 N 2
an = -n-N _lsy.
Account no. 2 3 4 5 6 7 8
Book value
in dollars 1000 1000 2000 2000 2000 1000 1000 2000
Dollar 0001 1001 2001 4001 6001 8001 9001 10001
labels -1000 -2000 -4000 --6000 -8000 -9000 -10 000 -12000
For a sample of n = 3 accounts we might take a systematic sample
of three dollars from {l, ... , 12 OOO}. If for example the dollar sam-
ple were {2271, 6271, 10271}, the account sample would consist of
accounts {3, 5, 8}.
If we begin with a random ordering of the accounts, this method
with a systematic dollar sample is essentially the Hart1ey-Rao random
systematic procedure of selecting accounts with inclusion probabilities
proportional to size (Section 2.8).
Now suppose that the nominal Jth account value differs from the
actual value, so that the account has an error (book value - actual
value) of Yj, and we need to estimate the total error
The error values are determined for the sampled accounts. The point
estimator frequently used for Ty in the literature is the 'tainted dollar'
estimate
1 " Yj (3.82)
- ~Tx-,
n jES Xj
where x j is the book value of account J. This reduces in our terminol-
ogy to the HT estimator
1'y = LYj/llj, (3.83)
jES
of Tx dollars, and to associate with each new dollar one of the sampled
'taint' values rj = YjJ-Lx/Xj. If Wj denotes the number of dollars now
to be associated with taint value rj, we require that
(i) LWj = Tx ,
jES
(3.84)
(ii) LWjrj/Tx = J-Lo,
jES
L Wj log(wj/ A j ).
jES
This gives
Wj = TxAjetTj / L AjetTj , (3.85)
jES
where t is chosen so that (3.84) is satisfied. We then resample repeatedly
at random from the new population of dollars and taint values, suitably
arranged, and each time we compute a new estimate 7;,*. The original
t can then be compared with the empirical distribution of {T/} to see
whether J-Lo should be below or above the confidence bound.
=
Thus in regular i.i.d. cases, using fI(~) H(~; Fn) to provide quant-
iles in (3.90) will give valid confidence intervals for the population
mean, a smaller order of error being obtainable with the studentized
mean difference as root. The distributions fI(~) can be obtained for
example by Monte Carlo (re)sampling, that is repeatedly drawing i.i.d.
samples xi, ... ,x;from Fn.
Coming to the case of SRS without replacement, it is natural again
to try to estimate the distribution of the root from the sample. Let
Ys = {Yj : j E s} be the Y values in the sample, and let jis denote
the sample mean. Here we could write the distribution H(~;y) of the
studentized mean
ED* = 0
Var(D)* ( 1 - n)
- -N- -n-12
-s
= N N-l n y (3.95)
E(D*)3 ( _ ~) ( _ 2n) N 2 (n -l)(n - 2) k
= 1 N 1 N (N _ l)(N _ 2)n 2 3.
BOOTSTRAP RESAMPLING METHODS 89
ED 0
Var(D) = (l - ; ) S; (3.96)
ED3 = (1-~)(I-~)K3'
the second and third cumulants of D* from (3.95) are biased as estima-
tors of the corresponding cumulants of D. However, the biases in these
Op(1) quantities, and in the fourth cumulant of D* not shown here,
are of order Open-I) for n large. These results suggest that the boot-
strap confidence intervals are indeed valid asymptotically for this case.
A theorem along these lines has been given by Bickel and Freedman
(1984).
The bootstrap technique can be extended in an obvious way to strat-
ified random sampling. However, if the sample sizes nh within strata
are small, as is often the case in practice, the bias in the variance of
D = JC;,CYst - iLy),
11 1
Var(y*) = --(1 - A)S
km
2
y
= -(1 - A)S
n Y
2
(3.97)
1 1
- - ( 1 - A)(1 - 2A)k3
k2 m 2
with
k3 = n L(Yi - ys)3 I(n - l)(n - 2).
iES
~I ~
Ys - H- (1 - a)V;;(1 - A) Sy. (3.101)
Similarly, a 100(1 - a)% level upper confidence bound for /Ly would
be given by
~I ~
Ys - H- (a)V;;(1 - A) Sy. (3.102)
EXAMPLE 3.4: Suppose a without-replacement SRS of size n = 12 is
drawn from a population of size N = 48. Suppose the sample values
are
50, 57, 50, 52, 49, 39, 36, 50, 58, 47, 58, 49.
The sample mean is Ys = 49.583, while the sample variance is = s;
46.083. A two-sided 95% confidence interval for /Ly assuming an ap-
92 DISTRIBUTIONS INDUCED BY RANDOM SAMPLING
(iv) the population median can be defined as the least value of ON such
that
N
L(/(Yj ~ ON} - 1/2} ~ 0, (4.5)
j=!
and this is approximately of the form (4.1).
In general the population yth quantile is obtained from the definition
(4.5) with 1/2 replaced by y.
When a population function (or parameter) is defined by (4.1), an
estimator for it can be defined as a solution of the sample estimating
equation
L<Pj (yj , Xj' O}/7Cj = 0 (4.6)
jes
where, as before, 7Cj denotes the probability of inclusion of unit j. For
any sampling design, even a so-called informative design where 7Cj
depends on y or x and y, the left-hand side of (4.6) is design-unbiased
for the left-hand side of (4.1). Thus we may expect a solution Os of
(4.6) to be close to ON for large samples when the functions tPj are well
behaved.
In example (i) above, equation (4.6) is Ljes(Yj - O}/7Cj = 0, and
the resulting estimator of ILy has the form
For any self-weighting design, even one which is not of fixed size, this
estimator is the sample mean Ys. If the design is self-weighting and of
96 DESIGN-BASED ESTIMATION
e
fixed size, s =.vs is the HT estimator divided by N. For simple random
sampling with replacement it is not the HT estimator divided by N, but
is more intuitively appealing because it is unbiased for ILy conditionally
on the sample size n(s) as well as unconditionally. In general, the estim-
ator (4.7) need not be design-unbiased in either sense; however, no
matter how peculiar the choice of the inclusion probabilities 7rj (recall
Example 2.5), it is error-free when all components of y are the same
- a property not shared by the unbiased estimator (L j es Y j / 7rj ) / N.
On balance, then, the use of (4.7) tends to be preferred when the Yj
values are thought not to depend a great deal on the 7rj. This preference
corresponds to a formal optimality property of (4.6), as will be seen at
the end of Section 5.5.
In examples (ii)--{iv) we obtain similarly the estimator
k = (LYj/7rj)j(LXj/7rj) (4.8)
jes jes
for the population ratio, the estimator
frs(y) = [L I(yj ::: Y)/7rj]/[L 1/7rj] (4.9)
jes jes
for the value of the population c.d.f. at y, and the estimator
frs-I (~)
for the population median.
Note that frs(y) of (4.9) behaves like a true distribution function in
that it increases from 0 to 1 as Y increases; this would not always be
the case if Ljes l/7rj in the denominator were replaced by N.
In the interesting case of length-biased sampling, where 7rj is pro-
portional to Yj, we obtain as estimator for the population c.d.f.
Var{4>s(8)}= LN (4).)
~
2
1l"j(1-1l"j)+ L L
N N ( 4> . ) (4)k
~ -
)
(1l"jk-1l"j1l"k),
j=l 1l"J j "I- k 1l"J 1l"k
where 4>j = 4>j(Yj,xj,8); and that unbiased estimators for (4.11) can
be defined for most cases of practical interest. For example, for fixed
size single-stage designs (if we continue to regard 8 as a freely varying
argument), the function
(4.12)
with
(lljk = (1l"j1l"k - 1l"jk)/1l"jk
is an unbiased estimator for Var{4>s (8)}, and is likely to be consistent
under appropriate conditions for designs of simple structure. Moreover,
for most designs used in practice one can expect for large n approx-
imately a standard normal distribution for the quantity
(4.15)
where Zl-a is the (1 - a) quantile of the standard normal distribution.
A second possibility is to retain the dependence on () in v(4)s) and
98 DESIGN-BASED ESTIMATION
Os = f /N.
The variance estimator v,A¢s) for a fixed size single-stage design is
(4.17)
Ys ±Zl-a ~ (1-~)
n N.
L(Yj - Ys)2/(n -I). (4.18)
JES
(4.19)
(4.20)
where
= 1
-LLsWjk
2
(Yj
- - -Yk -Os - 1 - -
7rj trk 7rj
1
7rk
A ( ))2
"'2
= ,.. A A
v{J)(T) - 20scov{J)(T, N)
A
+ Os v{J)(N),
A
(4.21)
while using (4.16) yields (after solution of a quadratic equation)
N 2 nr ( mr ) 1 _ 2
+"LrEsB [2 mr 1 - Mr mr _ 1 LjES,(Yj - Yr) , (4.23)
100 DESIGN-BASED ESTIMATION
or approximately
(4.24)
Y± ZI-aJV(<Ps)/ N.
However, when the first-stage units are also selected by SRS, then
the full design is no longer self-weighting in general, and the point
estimator Os is the ratio to size estimator
LMrYr/LMr. (4.25)
reSB reSB
or T / N where T = (L / I) LresB Mr Yr and N = (L / I) LresB Mr. The
confidence intervals are
v(T) = TL2 (
1-
I) 21(l_I)LLsB(MrYr - MqYq)2
L 1
L" M; ( mr ) 1" _ 2
+/ ~resB mr 1 - Mr mr _ 1 X ~jesr (Yj - Yr)
A
v(N) =T
L2( I) 1
1- L 21(l_l)LLsB(Mr - Mq)
2 (4.27)
cov(T, N)
A L2
=T (
1- L
I) 21(1- I)L LSB(Mr-Mq)(MrYr-MqYq)·
1
(4.28)
ROOTS OF SIMPLE ESTIMATING FUNCTIONS 101
has a bias of
1
-2~yn-l/2 - ~(/Lxx//Lxa)(R - {J)n- 1/ 2 + O(n- 1 ),
where
N N
A = n/N, {J = LXjYj/L:>;,
j=l j=l
(4.30)
and
(4.31 )
(4.32)
102 DESIGN-BASED ESTIMATION
are
1
+ 4'v(N)
AA A A,.. A
and v(f/ls) = =
v(f/ls) evaluated at () Os. For stratified random sampling
(where N = N), using v(f/ls) yields intervals (from (4.15» approxi-
mately of the form
A_I
Fs [12" ± JZI-a
A
v(Fs «()) 10=9, ] (4.34)
and its estimating functions ¢s of (4.36). Note that we now require that
the form of¢j of (4.1) be independent of j.
The inverse testing method works by excluding from the confidence
set for eN all values of e which would be rejected by a level-a sig-
nificance test of the hypothesis eN = e. For a given value of e, if the
hypothesis were true the population arrays x, y would satisfy
N
L ¢(Yj, Xj, e) = o. (4.37)
j=!
L L ai el
r r
0= ai(yi - e)e l (IJ)(yi_ 8)/ (yi- 8) (4.40)
i=1 i=1
e
where ¢s(e) = ¢s(e)/N and is some value between Os and e. This
expansion can be formed when ¢s and a¢s/ae are continuous in €I and
a2¢s/ae 2 exists in the region of interest. Suppose it can be shown in
some asymptotic framework that (8s - €IN )/eN -+ 0 in probability, that
Var{¢s(e)} and Var(a¢s/ae) are of order O(n-I) for some measure of
sample size n, that a¢s/ae and its expectation approach A(e) =1= 0, and
that a2¢s/ae 2 is bounded in probability uniformly near €IN. Then by
ROOTS OF SIMPLE ESTIMATING FUNCTIONS 105
Bs-BN= !(BN)[Fs(BN)-FN(BN)]+op(n- ).
Francisco and Fuller (1991) have given sufficient conditions for the
representation (4.44) to hold.
106 DESIGN-BASED ESTIMATION
(4.50)
(4.53)
v
The symbol is used in (4.52) because computing it involves estimation
of the coefficient ag/a Ta.
Although we are aiming at estimating the MSE of the possibly biased
point estimator e, we will use the term 'variance estimation', in con-
formity with much of the literature, unless the distinction is important.
Thus (4.52) will be called the linearization variance estimator of e.
Specifically, using the Yates-Grundy-8en estimator with a fixed size
one-stage design would give as linearization variance estimator
1 A A)2
A
!
PSUs selected from stratum Sh, we would have
II
h 2 flrq flr flq
+L V2,r (4.55)
rESBh flr T=T
where V2,r estimates the variance of the estimate of the Z total in the
rth PSU. If Lh, the number ofPSUs in stratum Sh, is large for all hand
the first-stage design approximates with-replacement sampling within
strata, then we can use the approximation (2.94), which gives as the
estimate of variance of linearized error
MqZq )21
A
V(E[) = ,,1
h h
1 (MrZr
~ 2l-lLLsBh IT - ---rI
r q
.'
T=T
(4.56)
normality of
g(T)-g(T) o-e (4.57)
JV(EL) = JV(Er) ,
and takes the form
o± ZI-aJV(Er) (4.58)
where again ZI-a is the 1 - ex quantile of the N (0, 1) distribution. This
can be justified for large samples when (4.51) is asymptotically normal
and it can be shown that v(Ed is design-consistent for Var(Er), in the
sense that their ratio tends to 1 in probability. See the results ofKrewski
and Rao (1981) described in Section 4.2.7.
Suppose we take as an illustration of the linearization approach the
estimation of the population variance a 2 of (4.48) with a one-stage
design. The point estimator from the prescription above is
I(~I/1l"j)
lES lES lES
or
(4.59)
where
P-x = (:?:Xj/1r j ) /
JES
(L:
JES
1/1rj )
I [T2 A
T4 - Tl/N - N(Tl - Td
2BT2 - Tl
+ + (T3 -
A A
-B(T4 - T4) + T2
N
Tl -BT2
N (Ts -
A ]
Ts) .
L jES (.
Xj
~ J1-x )2/1rj. [(Xj -
A P-x)(Yj - P-y) - Bs(xj - P-x)2]. (4.63)
The simplest subsampling method has been called by Wolter (1985) the
method of random groups. It was introduced by Mahalanobis (1946),
who called it the method of interpenetrating subsamples. Deming (1956)
and others termed it the method of replicated samples.
There are two kinds of random group method. The 'independent
random group' technique is not strictly speaking a subsampling method.
In this technique, samples Sl, ••. ,SG are drawn independently to begin
with, each according to a sampling design po. Each sample is 'replaced'
after it is drawn. If 'Tv is the HT estimator for the vector T of totals
obtained from the sample sv, we may use as a point estimate for 8 =
g(T) either
(4.65)
IG
= g(T), = G LTv.
A -;;- -;;- A
82 where T (4.66)
v=1
or (more conservatively)
1 G
Ys = G LYs•.
v=1
1 G
G(G _ 1) ~Var(jis.IS)
= 1 G2 ( 1- -1) s2
G(G - 1) n G y
1 2
= -sY'
n
1 "(y - )2
2
Sy = ---=--1
n
L.J j
jes
- Ys •
FUNCTIONS OF POPULATION TOTALS 113
-
Var(Ys) = -;;l( 1 - N
n)S2
Y'
V*RG = (1 - ;) VRG(Ys)'
By analogy, we would estimate the variance of OJ of (4.65) or (4.66)
by
G
Var(O,) A
= (
1- -
N
1-
n) -(A
2n
B) (2N
B
+ - - - - -.
N-1
1) 2
Thus V*RG is approximately unbiased for Var(O,). The more efficient
estimator s;,
which can be regarded as analogous to O2 , has variance
Yst = L WhYh,
h
But
" "
~(yj - 2
- Yh) = 2
1 (Yjhl - Yh2) 2 ,
jESh
(4.68)
FUNCTIONS OF POPULATION TOTALS 115
Now, the total sample can be divided into two half samples,
{Jll, hi, ... , jhl, ... } and {J12, h2, ... , jh2, ... },
which can be regarded as two independent replications of stratified ran-
dom sampling, with one draw per stratum. (Here again the assumption
that Nh » nh for all h is being used.) We may form two 'half sample
estimates' of fJ.Y' namely
and clearly
1
jist = i(jistl + jist2). (4.69)
Since jist is thus a sample mean of two virtually independent 'observa-
tions', from what are essentially two random groups, another unbiased
estimate for Var(jist) would be
(4.70)
which is much simpler than v(jist) above. But this latter estimate has
only one degree of freedom, and is highly unstable. We can try to
obtain a similar estimate with greater stability by considering a larger
collection of half samples.
In particular, consider the class of all possible half samples formed
by taking one of the units Uhl, jh2} from each stratum. If the number
of strata is H, there are 2H half samples, which may be numbered
I, ... , 2H. Then the vth half-sample estimate of fJ.y is
- = "~ Wh (~(v)
Ystv Uhl Yjhl + Uh2
~(v»)
Yjh2 ' (4.71)
h
where
8(v)
hi = 1 if jhl E vth half sample,
= o if jhl ¢. vth half sample;
8(v)
h2 = 1 - 8(v)
hi .
Clearly the total sample estimator jist is the average of all the half-
sample estimators. But it is also true that if we take the average of
all (jistv - jist)2, as in (4.70) but this time averaging over all 2H half
samples, we recover the variance estimator v(jist):
(4.72)
116 DESIGN-BASED ESTIMATION
Ystv - Yst =
=
where
8(v)
h
= 8(v) _ 8(v)
hi h2
= I if jhl E vth half sample
-I if jh2 E vth half sample.
Then
-
(jistv -
- Yst) 2
= 4"1 ""
~ W2(y
h ihl - YjhJ 2 +:21 ""
~ ""
~ 0h(v) 0h'
(v) Wh Wh'
h hc ~
But for fixed h, h' such that h =1= h', the sum Lv 0k V
) ok~) = O. Thus
.p, i
L~:I (jistv - jist)2 = Lh Wl(Yihl - Yih2)2 = v(jist) of (4.68), and it
is indeed possible to express V (jist ) as the average of squared deviations
of jistv from jist for all half samples.
The basis of the BRR method (McCarthy, 1969) is that the same
thing can be done using fewer than 2H half samples, if the set of half
samples chosen is balanced. That is, provided that
(4.73)
where tv denotes summation over the selected half samples, the same
algebra will show that if the number of half samples selected is K, then
1 -
v(jist) = K Lv(jisrv - jist)2. (4.74)
of T from Sv. There are two natural choices for the point estimate of
geT), namely the full-sample estimate
o=g(T). (4.76)
and the average of the half-sample estimates:
~ I K
(J = 2K L(g(sv) + g(sv)). (4.77)
v=l
for each v. Thus in such cases the two point estimators and should e e
be close in value.
There are correspondingly three estimators of approximate mean
squared error for these estimators. The first is suggested by the right-
hand side in equation (4.74):
K
1", A2 A2
VBRR-S = 2K L)(g(sv) - (J) + (g(sv) - (J) ]. (4.79)
v=l
The third replaces (g(sv) - 0)2 + (g(sv) - 0)2 by the term (g(sv) -
g(sv))2/2, which it would equal if (4.78) were exact:
1 K
VBRR-D = 4K L(g(sv) - g(sv))2. (4.80)
v=l
Now if Lh, the number of PSUs in stratum Sh, is large for all hand
the first-stage design approximates with-replacement sampling within
strata, the estimate of variance from the linearization method will be
'" ~""
~ 2 L..- L..-SBh
(Mrzr _ MqZq
nr nq
)21 ' T=T
(4.81)
h
FUNCTIONS OF POPULATION TOTALS 119
Zj = "~
a
ag Yaj
aT,
a
(from (4.54) with lh = 2). It is not difficult to see that for a set of
balanced half samples, (4.81) is the same as
or
where
(4.84)
is appropriately small, the BRR estimators for the MSE of g( i') agree
'to first order' with the linearization estimator. In fact, as has been
pointed out by Rao and Wu (1985), equality holds in (4.85) if g is a
quadratic function of the components of T.
When the number of primary sampling units sampled in each stratum
is more than two, the BRR method must be modified. Three approaches
to this problem have appeared. One is to divide each stratum randomly
into two equal parts, and then to construct balanced half samples using
these as the basic units. However, the random group method gives
inconsistent estimators of variance when the number of strata is finite.
A second approach, which avoids grouping, is to form a balanced
set of K replicates with one unit per stratum according to a mixed
orthogonal array. (This method is due to Wu (1991), adapted from an
approach suggested by Gupta and Nigam (1987).)
Let us suppose that the original design is stratified random sampling
with nh units drawn from stratum Sh, and with Nh » nh. A mixed
orthogonal array of strength 2, denoted here by OA(K, n, x··· x nH),
is a K x H matrix whose hth column has nh symbols such that in
any pair of columns each possible combination of symbols appears
120 DESIGN-BASED ESTIMATION
equally often. The first six columns of the Hadamard matrix of (4.75)
is an example of an OA(8, 26 ). An example which could be used to
select replicates from a stratified random sample for which H = 5 and
nl = 3, n2 = ... = ns = 2 is the following OA(l2, 3 x 24);
I
I 2 I 2
2 I 2 2
2 2 2
2 2 2
2 I 2 2
2 2 I I 2
2 2 2 I I
3 2
3 I 2 2
3 2 I
3 2 2 2 2
K
~ L(.Ystv - jist)2 = v(.Yst).
v=l
1 ~ ~ 2
VBRR = K L.,.(g(Sv) - (})
v=!
1 K ~
or K L(g(sv) - 0)2.
v=!
Clearly the formula gives the usual unbiased estimator of variance of
ewhen g is a linear function of the components of T. Wu (1991)
has shown that under appropriate regularity conditions, v BRR and the
linearization estimator are asymptotically equivalent to first order.
The third approach, described by Sitter (1993), is motivated by the
fact that mixed orthogonal arrays with small numbers of replicates may
be difficult to obtain with differing stratum sample sizes. The number
of replicates can be reduced if each is allowed to contain some number
cth > 1 of units in the hth stratum sample. The replicates are thus
described by an array of subsamples, and the technique and results of
Wu (1991) extend naturally to the case where the replicate array is
what is called a balanced orthogonal multi-array.
When the number of strata is very large, the BRR method becomes
computationally unwieldy. In this case what is sometimes suggested is
to subdivide the PSU population into smaller groups of equal numbers
of strata, and to apply the BRR technique within each group. Separate
replicates from the groups are combined to produce a set of overall
replicates which is smaller than that of a fully balanced set of replicates.
See Wolter (1985, pp. 125ff.) for a discussion of the consequences of
this method, which is called partial balancing or partial BRR.
T(v) = G G_ 1 !--
' " Yj
;-:'
jES(v) j
where S(v) consists of all units in the sample s outside the vth group,
and the probabilities 1'{j are the inclusion probabilities under the original
design. Now clearly
A 1 G A
T = G LT(v).
v=1
Furthermore, if 'Tv denotes the estimate of T from the vth group only,
we can write
or equivalently
-(G - 1)(T(V) - T) = Tv - T. (4.86)
Now if N is large and the design is approximately with replacement,
the random group method gives
1 ~ 2 A A
G(G _ 1) ~(Tv - T)
G-l~A
A
vJ(T) = -a ~(T(v) -
A2
T) , (4.87)
v=1
n(n _ 1)
jES j
() = G Lg(T(v»).
v=1
FUNCTIONS OF POPULATION TOTALS 123
G-l~ A A2
Vj = ( J L)g(1(v» - OJ (4.88)
v=l
or
G-l~ ~2
( J L.,,[g(T(v» - 0] .
A
(4.89)
v=l
Because
(4.90)
T = - LT(hV)'
Gh v=1
Also, if Thv is the estimate of T when only the vth group is sampled
in stratum h, then
(4.91)
124 DESIGN-BASED ESTIMATION
If G h = nh for each h and the design is SRS within strata, this coincides
with the usual estimator
(4.92)
(4.93)
which would give VJl of Kovar et al. (1988). In the case where all
Ih = 2, the jackknife technique is termed JRR or jackknife repeated
replication by Kish and Frankel (1974). In that case VJ2 of (4.92) be-
comes
Yh = m hl LYhj
j=1
= Yh + m!/2(nh - 1)-1/2(1 - fh)I/2(jiZ - Yh),
H
Y L1
WhYh,
e,
The bootstrap point estimator £*(0) of where £* denotes condi-
tional expectation given the original sample, can be approximated by
O' = LOb / B. Note that the bootstrap point estimator of f..ty is equal to
Yst, the usual unbiased estimator.
The bootstrap variance estimator of O· or of g(Nji") or of g(NYst)
would be defined by £*(0 -£*0)2 = Var*(O), or by £*(0 -8)2, where
126 DESIGN-BASED ESTIMATION
v ( IL~) = "wl
~ - -1- ~(y
~ hj -
_)(yhj - Yh
Yh
-)t" (4.98)
h nh nh - 1 j=1
(where Yhj is the jth sampled Y value from Sh) is the usual estimator
of Var(p-), then
v(p-)/Var(p-) ~ 1 (4.99)
in probability, so that v(P-) is a consistent estimator ofVar(p-) as H ~
e=
00.
Inferring from these the corresponding properties of g(P-) re-
quires further conditions; it is convenient to assume also that:
(v) lLy ~ a finite vector IL;
(vi) the first partial derivatives of g(.) are continuous in a neighbour-
hood of the limiting vector IL.
If we then define the linearization variance (more properly mean
squared error) estimator as V(Er) of (4.52), or in vector notation
1 ~
+2(1L - lLy)
t" 82
a2g I ~
(IL - lLy) + Op(n -3/2 );
IL JLy
l
T
at J,L = J,Ly.
The theoretical results of Rao and Wu (1985) tended to agree with
the results of empirical studies by Kish and Frankel, carried out on the
linearized, jackknife and BRR variance estimators. A subsequent em-
pirical study by Kovar et al. (1988) also included tests of the bootstrap
techniques given by Rao and Wu (1988), and again tended to agree
with theoretical calculations. Here we give a brief summary of some of
the main findings of all these studies. Unless otherwise qualified they
seem to be true generally for the estimation of ratios, simple regression
coefficients and correlation coefficients.
1. The jackknife estimators VJl and VJ2 are very close to one another,
differing by amounts which are Op(n- 3 ) under broad conditions (Rao
and" Wu, 1985). They tend to be very close to VL numerically, and for
nh = 2 for all h, VJ (= VJl or VJ2) and VL are equivalent to higher-order
terms. That is,
other words more highly variable from sample to sample) than the
linearization and jackknife estimators.
4. Although all the estimators of MSE seem to have acceptably small
biases for actual populations, this fact is not necessarily of great relev-
ance to the performance of confidence intervals based on assuming an
N(O, 1) distribution for (g(M - g(/Ly»!Jv = (e - ())!Jv, or a t(H)
distribution as in the study by Kish and Frankel. It has been found that
for ratios, one-sided intervals with lower limits may have serious un-
dercoverage, although two-sided intervals tend to have close to nominal
values of coverage, in agreement with the theoretical explanation based
on Edgeworth expansions given by Rao and Wu (1988). For regression
and correlation coefficients, two sided confidence intervals which use
VL or VJ tend to cover the true value with probability less than the
nominal values. For these quantities intervals which use v BRR-S have
closer to nominal coverage, in keeping with the fact that vBRR-S is
the most positively biased of the MSE estimators. However, even with
this estimator, one-sided confidence intervals with upper limits tend to
have undercoverage for correlation coefficients. The undercoverage of
all the intervals tends to decrease as nh goes up, in the situations exam-
ined by Kovar et al. The asymmetry of coverage also decreases as nh
increases, and is less severe for regression coefficients than for ratios
or correlations. See the discussion on ratio and regression estimation
of a population mean or total in Sections 5.11 and 5.12.
It may be observed that of all the estimators vL is the one which
can be applied most generally, but it requires knowledge of all joint
inclusion probabilities in principle. Of the resampling estimators, VJ
is the least trouble to compute. The main difficulty with using any of
them for confidence interval estimation is the departure of (e - ())! Jv
from normality for small nh.
E(} - () =
1
-tr L
a2g
- 2 In + O(n- 2 ),
2 a/.L
where :E is the limiting covariance matrix of n 1/2({t - /.Ly), n is the total
sample size and a2gla/.L2 is the matrix of second partial derivatives
a2gla/.Laa/.L{3 with respect to the components of /.L = (/.LI, ... , /.La, ... ,
/.Lp). Now let 8(hj) = g({t(hj», where (t(hj) is the estimator of /.L when
the jth unit is removed from the sample Sh in stratum Sh. If 8h
LjESh 8(hj)/nh, it can then be shown that
()h = ()
A A 1
+ -2 nh (nh -
wl 2g I
1) trLh -a2
A a + Op(n-
3
),
/.L [l
(a point estimator due to Jones, 1974) has bias O(n- 2). Since 8 = g({t)
itself was seen in Section 4.2.7 to have bias O(n- I ), the jackknife point
estimator ()jl) is a reduced bias estimator. As pointed out by Rao and
Wu (1985), the use of ()jl) in place of 8 in VJ2 of (4.92) will give an
inconsistent estimate of MSE, however.
The BRR technique also provides a means of bias reduction. Wu
(1991) showed that if (tv is the vth half-sample estimate of /.Ly, and
g(sv) = g({tv), then
e-o
t=--,
.jVJ
VJ being a jackknife estimator of variance of e, rather than assuming
a standard nomal distribution.
The bootstrap t method would in this case generate a large number
of independent resamplings, computing for each a sample of pseudo-
values as for example in Section 4.2.6, a point estimate 0, the appro-
priate jackknife variance estimate VJ, and the t statistic
o" - "1
H- (1 - a)..(iij, 0" - H-
"1
(a)..(iij
where fI-1 (a) and fI-1 (1 - a) are the a and 1 - a quantiles of the
resampling distribution of t.
In their empirical study Kovar et al. (1988) investigated this tech-
nique for stratified random sampling with replacement, and pseudo-
values generated as in Section 4.2.6. They showed that it does tend
to give improved confidence intervals for ratios and correlation coef-
ficients. The intervals in fact tended to be conservative, unlike those
based on linearization or jackknife variance estimators. Sitter (1992a)
extended this investigation to a comparison of these with several boot-
strap methods, concluding that the mirror-match and extended BWO
methods discussed in Section 3.13.1 also perform relatively well, and,
if extendible, would have the advantage in complex situations of better
reflecting the original sampling.
where the sum is taken over all sets UI, ... , jm} of m distinct units of
{l, ... , N} and the 'kernel' f is a symmetric function ofm arguments.
If m is the smallest number for which Uy has representation (4.101),
then m is called the degree of Uy . For example, the population mean
f-Ly is aU-statistic of degree 1. A population variance
1 N N
S2- ""(y._y)2
y - 2N(N _ 1) ~# ~ j k
=( ) ~ f(Yjl"" ,Yjm)'
n UI ,... ,jm les
m
Now suppose U;a) and u)fl) are two (not necessarily distinct) popu-
lation U -statistics with kernels fa and fp and degrees ma and m p
respectively. It is possible to show under general conditions, given by
Nandi and Sen (1963), and generalized by Krewski (1978) and others,
that .J1i(U(a) - U;a» and .J1i(U(P) - u)fl») are asymptotically jointly
normally distributed with means 0 and limiting covariance mamp{l -
),,)~aP' where)" = nj N and ~ap is the limiting population covariance
of zeal and z(P), the value of zeal for unit j being
Zj(a) = average 0 f Ja(Yjp
I'
.. " YjmJ
over all UI> ... , jm«} C {I, ... , N} such that jl = j.
A consistent estimator for ~ap is
(ap = n ~ I ~(Vr
jES
- u(a»)(v}P) - U(P»), (4.103)
where
and this can be used to form a covariance estimator for (Ua , Up).
134 DESIGN-BASED ESTIMATION
22 X (~_~)
n
_1 ,,([_1
N n-1L.."
"(Yj _ Yk)2/ 2 ] _ s2)2,
n-1L.."
JES
y
kES
which reduces to
( ~n _ ~) [(n -
N (n - 1)2
2)2 k4 + n(3)
(n - 1)3 2 '
~] (4.104)
2
Sy(j) = 2(n _ 1
l)(n - 2) 7'# ~(yk - Y/) ,
"" 2
°
8j is set equal to 1 if unit j is observed to die or fail at time tj> and equal
to if the observation for j is censored at time tj. The information
about {3 in the data is captured in the population partial likelihood
nN
i=l
N
([exp{xJ (ti) (3}]Oi /[LYj(ti) exp{xj(ti) (3}]),
j=l
°
where the variate Y/t) is I if unit j is present, i.e. in the risk set,
at time t, and if not. If we define the vector B N to be the value of
{3 maximizing the partial likelihood, we can interpret B N as a finite
population analogue of (3. It can be shown to satisfy the system
°
N
L8i (Xi(ti) - ANi) =
i=l
N
LYj(ti)(Xj(ti) - ANi) exp{xj(ti)B N } = 0, i = 1, ... , N. (4.110)
j=l
The vector parameter ANi, of the same dimension as the covariate
ROOTS OF ESTIMATING FUNCTIONS WITH NUISANCE PARAMETERS 137
N
L 4>2j(Yj, Xj; ON, AN) = 0 (4.112)
j=1
with (4.111) and (4.112) having the dimensions of ON and AN respec-
tively. Typically these equations would have the form of population
maximum likelihood equations for ON and AN. Suppose that ON is the
parameter of interest, while AN is a nuisance parameter.
The sample version of this estimating function system at a general
parameter value (0, A) is
(4.113)
(4.114)
If )./1 satisfies 4>2s (0, )./1) = 0, then the estimating equation system to
be solved for the estimate Os of ON becomes
(4.115)
v (L~~)'
JES J
where
(4.116)
Note that LjES (ZOj /1T:j) is the combination of the estimating functions
in (4.113) and (4.114) which changes least as the nuisance parameter )..
changes, near i o. See Godambe (1991) for related discussion. Interval
estimates for ON are then obtainable from an N(O, 1) approximation to
the distribution of
(4.118)
)V(LjEs Zj/1T:j) ,
JI).. = 0, I n = -N;
ZOj = (yj - A)2
ILy 2
- a; -
Zj = (yj A)2
- ILy - aA2 .
We could obtain interval estimates of a 2 by setting (4.117) or (4.118)
equal to N(O, 1) quantiles and solving. Using (4.118) would give results
equivalent to using the linearization method of Section 4.2.
= NNH(O_O)
NH
as the estimating function, where 0 = L:=I NhOh/ N is the post-
stratified estimator for the mean. Then applying the formula for ZOj
in (4.116) and putting in 0 for 0 gives
where
8hj = 1 if j E stratum Sh
= 0 otherwise.
The resulting estimate of the MSE of 0 = jist in the case of SRS would
be
~- (1 -nh)
"Nt - 2
sh'
h nh Nh Y
where
Si = L Yk(ti) exp{xI (ti)B}j1l"k.
kES
and 4>1 (S(rhl)' (0) is the value of 4>1s(00, i80) if the data in sampled PSU
rhlare replaced by a copy of the data in sampled PSU rh2.
CHAPTER 5
only in that some are white and some are black. The number M of white
balls is unknown. Suppose that a simple random sample of n = 10 balls
is drawn without replacement, and that the observed sample number m~
of white balls is 4. What then can be said or inferred about the number
of white balls among the 90 unsampled balls (and hence about M)?
The idea behind classical sampling inference (Neyman, 1934;
Cochran, 1977) might be summarized as follows.
CSI(i) Since the sampling was done at random without replacement,
the sample number ms of white balls is a hypergeometric ran-
dom variable, i.e.
1 3
40 x - + 60 x - = 44,
5 5
which is expressible as N = 100 times the stratified sample proportion
Nh
L WhPh = L /iPh.
2 2
Pst =
h=\ h=\
Thus, even though the sample has been taken by SRS, we might pre-
fer to base inferences about M on the distribution of Pst or its compo-
nents, rather than on the distribution of ms. The statements CSI(i)-{iii)
concerning SRS-based confidence intervals are still true if we replace
'white balls' by 'households with teenage children'; however, the con-
clusion ofCSI(iv), that SRS-based confidence intervals are a reasonable
expression of inference, is no longer so appealing.
146 INFERENCE FOR DESCRIPTIVE PARAMETERS
from a likelihood perspective. When the data exclude the labels, (5.1)
defines a likelihood function for M, and two-sided confidence intervals
based on the hypergeometric distribution as in CSI(ii) can be viewed
as approximate likelihood intervals for M. On the other hand, when
the data are equivalent to (U,Yj) : j E s}, where Yj equals 1 if ball j
is white and equals 0 if ball j is black, the probability of the observa-
tion depends not just on M but also on the array y = (Yl, ... ,YIOO) of
indicators. That is, if (U, yJ) : j E s} is a possible data set,
where f dv(a) = 1, then again YJ, ... , YN are exchangeable. For ex-
ample, suppose that, conditional on a, Y), ... , YN are LLd. N(a, u 2 ),
and that a is N(ILo, ul). Then Y\, ... , YN are exchangeable, and in fact,
they are multivariate normal with common mean ILo, common variance
u 2 + ul, and pairwise covariance u~. It is a theorem due in its original
form to de Finetti (1931) that if YJ, ... , YN is the initial segment of an
infinite exchangeable sequence, then FN must be of the form (5.4).
However, not all exchangeable distributions have this mixture form.
For example, consider the random permutation model, where (Y\,
... , YN) is simply a random permutation of some fixed vector (a\, ... ,
aN):
(5.5)
for each permutation u of 1, ... , N. Then Y\, ... , YN are exchangeable,
but the distribution is not generally of the form (5.4).
In a sense, random permutation models are the most basic exchange-
SUPERPOPULATION MODELS 151
Yj = fJXj + Ej (5.6)
is used. The Ej are taken to be independent mean-zero errors with
variances depending on x j. In most applications x and yare both non-
negative, and the variance of y about the line through the origin with
slope fJ tends to increase with x. For example, for a population of
short-stay hospitals the model (5.6) has been suggested (Herson, 1976),
where Xj is the number of beds in hospital j (a size measure), Y j is the
number of discharges from hospital j in a given month, and the error
Ej has variance a 2xj. Royall and Cumberland (1981a) have discussed
the consequences of an imperfect fit to the model (see Figure 5.1).
Another situation where regression models like (5.6) or
(5.7)
are commonly introduced is where x j is the value of y for unit j on
some previous occasion when a census was completed. For example, in
a population of universities, Yj and x j might respectively represent the
number of PhDs granted by institution j this year and last year, as in
Figure 5.2. In a population of cities, Y j and Xj might denote numbers
of residents now and at the time of the last national census.
Related models are also sometimes used when there are two ways
of measuring the characteristic of interest. Suppose that one of these is
crude but inexpensive, and can be applied to all population units. The
other is more accurate but also more expensive (or perhaps destructive),
and can be applied to only a few population units. For example, in an
analysis to estimate the number of trees in a wooded area divided into
SUPERPOPULATION MODELS 153
3000
2500
:WOO
~
cII)
«i 1500
0...
1000
500
Y
30
25
20
15
10
5 10 15 20 25 x
(5.9)
From the prediction viewpoint, it is desirable to choose e to be £-
unbiased and to have a prediction MSE which is as small as possible
under the distributions ~ in C.
In addition, it is sometimes meaningful to construct prediction inter-
156 INFERENCE FOR DESCRIPTIVE PARAMETERS
vals for O(y) based on the sample data. Under the model distribution,
O(Y) would belong to these intervals with specified probabilities. For
example, if Ve were a model-consistent estimator of the prediction MSE
(5.9) and if
e - O(Y)
(5.10)
Fe
(for s fixed) were approximately N(O, 1) under distributions ~ E C,
then the interval
e ± ZI-a.y'Ve (5.11)
would cover O(y) with approximate ~ probability I - 2a.
Other predictive frameworks for descriptive sampling inference have
been put forward by Ericson (1969), Scott and Smith (1969),
Kalbfleisch and Sprott (1969), and others subsequently. In a Bayesian
setting, as adopted by Ericson and by Scott and Smith, C consists of
a single prior ~, usually hierarchical or multi-stage, and inference is
expressed in terms of the posterior distribution of Y, or ~ conditioned
on the data (U, Yj = Yj) : j E s}. Recent applications of this approach
have been discussed by many authors, including Cox and Snell (1979),
Malec and Sedransk (1985), Stroud (1991) and Ghosh and Rao (1994).
In the fiducial approach of Kalbfleisch and Sprott, C is a parametric
family, and inferences are derived from the fiducial distribution of the
parameters composed with the conditional distribution of O(y), given
the parameters and the data. The parametric empirical Bayes approach
(Ghosh and Meeden, 1986) also takes C to be a parametric family,
this time of prior distributions; in the posterior distribution of O(y)
the parameters are then estimated from the data. The point estimate
of O(y) is the estimated posterior mean or estimated posterior mode.
The estimated posterior variance of 0 (Y) must be adjusted to produce a
mean squared error estimate for O(Y) which incorporates the parameter
estimation errors (see, for example, Laird and Louis, 1987; Kass and
Steffey, 1989).
The framework (5.8)-(5.11), which could be termed the 'frequentist'
predictive approach, was put forward by Brewer (1963) and by Royall
(1970); here inferences are constructed through the unconditional (or
'prior') distributions ~ of C. We will use this framework subsequently
because it formalizes fairly simply the predictive element in sampling
inference.
It should be noted that the justification for thinking of inference in
predictive terms depends on the appropriateness of the superpopulation
model, and in particular on aspects like model unbiasedness in (5.8) and
RANDOMIZATION AS SUPPORT FOR STATEMENTS OF INFERENCE 157
-
Ys ± Z\-a l( 1- Nn)",
-;; L....(Y - -
Ys)2j(n
j - 1), (5.12)
JES
Post-sampling stratification
Let us return to the stratified population example. It is easy to see that
estimation statements from a stratified exchangeable model will be re-
inforced if SRS is used, as long as the relevant sampling distributions
are taken to be conditioned on the stratum sample sizes. This applica-
tion of our principle gives us in fact a superpopulation justification of
the conditioning of post-sampling stratification, arrived at intuitively in
Section 5.1.3. By defining strata within which the response variate is
assumed to be exchangeably distributed, the model prevents overcon-
ditioning.
sample mean
jES
is the natural estimator of the population mean {.Ly. The associated pre-
diction intervals under the exchangeable model, based on an application
of the central limit theorem of Section 3.4, are
_1_
n(s)
(1 _n(s»)
N
s2,
Y
where
2 1 '"
sy = n(s) _ 1 ~(yj - Ys) .
-2
JES
where YT) and si,y are respectively the sample mean and variance of y
in snD. If NT) is large compared to nT), the finite population correction
160 INFERENCE FOR DESCRIPTIVE PARAMETERS
1'1) = N( L
jEsn1)
Yj)/n = N1)Y1), (5.14)
which is also
(5.15)
JES
Another kind of model for the location and size of 'D might justify
a different estimator of prediction mean squared error.
He
Wis = - , j E s and class c.
me
That is, each weight implies an estimate of the size of its class, namely
He = mewis for any j E s n c. If the classes c = 1, ... , C disjointly
cover the whole population, the estimator of the total Ty can then be
written
(5.22)
e
where Ye is the sample mean of y within weight class c.
In assessing the properties of iy in (5.22), suppose we can assume
the following superpopulation model, which might be appropriate for a
single-stage, single-stratum design. The population units are randomly
assigned to the weight classes c, size Ne , c = 1, ... , C. Moreover, the
Yi values are generated by random permutation models (independently)
within weight classes. Sampled units in weight class c are assumed to
respond with probability Be, independently of one another. Then the
prediction MSE of Ty is the expectation of the conditional MSE, given
me, c = 1, ... , C, i.e. given {wis : j E s}; and hence
:2 (1 _;) ~ ; (/Le _ ~ y
EVALUATION OF A SAMPLING STRATEGY 163
:2
~ ~: (I -;:) S;y + (1 - ;) ~ ~ (ye - ~ y, (5.24)
(5.26)
is satisfied, we know only that any bias in e with respect to Ep('lh(Xs»
averages to zero over the superpopulation. In the special case of £-
unbiasedness, namely when
£(e - /1(Y) =0 for every s, (5.27)
then (5.26) must hold, even though (5.25) may not. The condition (5.27)
is important for inference, as we have seen in Section 5.3, but the in-
ference will be more secure if 'supported' by (5.25), and for purposes
of comparing strategies we often take (5.25) to be the primary unbi-
asedness criterion to be satisfied.
Two point estimation strategies (el, PI) and (e2, P2) can be com-
pared in efficiency with respect to their (conditional) mean squared
errors averaged with respect to the superpopulation distribution, so that
(el, PI) is more efficient than (e2, P2) if
(5.28)
With this criterion, in certain circumstances it is possible to find optimal
design-unbiased strategies e, strategies for which
£Varp(elh(xs» .:::: £Varp(e2I h (xs»
holds for all e2 and for all values of the superpopulation parameters
(Godambe, 1955).
Perhaps the simplest example of such a result can be shown for
the sample mean jis as an estimator of the population mean J-Ly, ac-
companied by a self-weighting sampling design of size n. Consider
any superpopulation model under which the variates YI , Y2, ... , YN
are symmetrically or exchangeably distributed (see Section 5.2.1). For
a given n, consider all estimator-<iesign pairs (e, p) such that p is of
fixed size n and the unconditional E p-unbiasedness condition
Ep(e - J-Ly) = 0
where the sample size n (s) may not be fixed, but conditional Ep-
unbiasedness
Ep(e - !-Lyln(s» = 0
is satisfied, we can improve them by replacing e with the sample mean
and a conditionally self-weighting design with the same distribution of
n(s).
For other examples of results like these, see Thompson (1984) and
references therein.
Exact E p-unbiasedness may be too strong a requirement for the es-
timation of parameters defined by estimating functions, as in Section
4.1. A useful concept of approximate E p-unbiasedness is known as
asymptotic design unbiasedness, and can be defined in the same kind
of context as the limit results of Sections 3.4 and 3.5 (see also Section
4.1.4). We consider a sequence of population arrays indexed by N, so
that YN is (YNl, ... ,YNN). The strategy of interest is embedded in a
sequence (eN, PN) with sample size nN(s) tending to increase with N.
We could say that this strategy is asymptotically design-unbiased for
estimating 8(y) if
(5.29)
as N -+ 00. For example, the statement that 'the bias of the ratio
estimator R, of (4.8) is of order lin' under SRS (cf. Cochran, 1977,
p. 160) is an assertion that it is asymptotically design-unbiased: under
regularity conditions onYN and nN, the standard deviation of R, will be
of order n-;.1/2, and by the assertion EpN(R, -RN) will be of order n,./,
implying the truth of (5.29) for R,. The 'regularity conditions' will be
satisfied in the context of an appropriate model. This type of asymptotic
design unbiasedness trivially implies design consistency, namely that
11m·
N->oo
p(leN-8(YN)1 >€
18(YN)1
)-0
- ,
satisfies (5.37).
In fact, g* of (5.5.14) is optimal in senses compatible with the optim-
ality criterion of Section 5.5.2.
170 INFERENCE FOR DESCRIPTIVE PARAMETERS
THEOREM 5.1: If Y!, ... , YN are independent and (5.34) holds, and
if the sampling design is independent of Y, then among all g satisfying
(5.37), g* can be shown to minimize each of
-£ {t ¢~ L
j=l 7r] s;j'/.s
p(s)(g - g*)} ,
For example, suppose that for gEe the variates Y!, ... , YN are
independent and identically distributed with mean ()(g). Then from
almost any standpoint, the optimal population estimating function for
() is
N
L(Yj - ().
j=!
Thus the associated finite population parameter is the solution of
N
L(yj - ()N) = 0,
j=!
AUXILIARY INFORMATION IN ESTIMATION OF MEANS AND TOTALS 171
as in (4.7).
From the considerations of Section 5.3 and 5,4 it is clear that the
optimality of g* in this section may be merely formal. If in the previous
example we really believe in the model C, we might prefer the linear
E-unbiased estimator of eN having minimal predictive mean squared
error, namely the sample mean Ys. In that sense, estimation of ON
via (5,42) would be inefficient (and rather unappealing) unless the rrj
were all equal; the E p-unbiasedness requirement would pull us away
from the estimator which is best in model terms. On the other hand,
if the superpopulation distributions ; are merely a convenient device
for averaging over plausible arrays y, the asymmetry in (5,42) is not so
glaringly inconsistent with prior beliefs. Moreover, whether the model
e
is meaningful or not, s is approximately design-unbiased for the finite
population mean ON.
jES
When the Li.d. model for Y1, ••• , YN is correct, and thus an appropriate
basis for inference, we have the justification of model unbiasedness.
That is, for iy we have
I" I .f.
is expressible as
= I>if3N
j=1
= PXf3N, (5.49)
where F is a row vector of N 1s. For if we premultiply (5.46) by ~ 1" ,
we obtain
N
L(Yj -xi f3N) = O. (5.50)
j=1
According to the optimality-based argument of Section 5.7.1, general-
ized to a vector parameter, this would justify as an estimator of 1). the
quantity
N
iy = Lxi~s' (5.51)
j=1
where 13s satisfies the sample estimating equation
'"' a-:- 2
~_J_Xj(yj -xjf3s)
A
= O. (5.52)
jes 7rj
OPTIMAL-ESTIMATING FUNCTIONS 175
Special cases
Ratio estimator of Ty
If p = 1, so that xi is a real number x}, and aJ oc x}, then
N N
f3N = LY}I LX} = RN ,
}=l }=l
(5.54)
When
xr_(1Xl X21 ...... XN1) '
the model-assisted estimator t, reduces to the regression estimator
iL = N Pis + TxP2s
where
and
estimating function
N
L(yj - xj,6),
j=1
L(Yj - xj (3)/7Cj.
jES
(5.59)
Note that in the GREG approach, apart from the form of the bias
correction being provided by the model, the main emphasis is on the
role of Ty as a descriptor of the finite population. However, if the Xl:
condition does hold and '/3s is chosen as above, then :;y coincides
with iy of (5.51), which is obtained by regarding Ty as the population
estimator of the superpopulation parameter 1T X f3. This is because the
Xl: condition implies that
(5.60)
"a----;-:-(yj -xjf3s) = 0
~
j
-2
Xj A
jES J
by A'.
Let us apply the GREG approach to the special case of model (5.56)
in which f31 = =
0 and y 2, so that the variates Yj/Xj have common
mean f3 = f32 and common variance a 2. Here the Xl: condition is not
satisfied. In this case a fixed size (n) design with
(5.61)
178 INFERENCE FOR DESCRIPTIVE PARAMETERS
(5.63)
where Ps is the minimum variance unbiased estimator for f3 with respect
to the model, s being regarded as fixed. The estimator-design strategy
given by (5.61) and (5.62) can be shown to be optimal in the sense of
minimizing expected sampling variance, as in Section 5.5.2 (Godambe
and Joshi, 1965; Godambe and Thompson, 1973). A consequence is
the optimality of the monetary unit sampling of Section 3.12 for the
error model of Cox and Snell (1979) in Section 5.2.2.
If the subgroups VI, ... , Vp are disjoint and all represented in the sam-
ple, such weights always exist, and they may exist more generally; if in
addition the population is the union of disjoint members of VI , ... , V p,
then the weights satisfy
LWjs =N.
jes
If
jes
and the weights are calibrated to X, then we will use
DEL = - '~
" -log
I (-
Wjs
-) , (5.67)
)ES
. lr)' l/lr)'
X' = (
Xl
I
X2
... I)
•.• XN '
DKL = - '~wjslog
" (-Wjs
-) . (5.69)
.
)ES
l/lr)'
The GREG (regression) estimator (5.59) is a calibration estimator
since it satisfies (5.65). It can be shown to arise from minimizing
Suppose we let ajs = Cjs + 1, and let {3c be any p x 1 vector oflinear
combinations of the Yj , j E s, satisfying
LCjsYj = L x j{3c
jES Ns
182 INFERENCE FOR DESCRIPTIVE PARAMETERS
and
Then
e - Ty = I:>jsYj - LYj,
jES jlts
or
e - Ty = (Lxi)(,Bc - f3) - L(Yj - xi (3), (5.71)
jlts jlts
and the terms in (5.71) are each of mean zero and independent. It
follows that the 'best unbiased linear predictor' of Ty, namely the one
which minimizes £(e - Ty)2, is of the form
Tym = LYj + Lxi (3s' (5.72)
jES jlts
where (3s is the best unbiased linear estimator of {3. (For multidimen-
sional {3 this means that linear combinations of the components of (3s
have minimal variance.) The estimator (3s is obtained by weighted least
squares, and if the XL condition is satisfied by the model, then
LYj = L xi(3s,
jES jES
and we have
Tym = ITX(3s'
Thus in the special case of the XL condition, the predictor has the
same form as the model-assisted estimator (5.51), except that (3s is
optimal in a purely model-based sense. Unlike ,Bs of (5.52) in regular
cases, (3s need not be design-consistent for {3N of (5.47).
as in (5.70), and the weights {Wjs} are chosen so as to make the estim-
ator e exact whenever Y is in the column space of the N x p 'design
matrix' X. Then for any p x 1 vector f3,
N
e-Ty = LWjs(Yj-xj{3)-L(Yj-xjf3)
JES j=!
N
= L WjsEj(f3) - L Ej(f3), (5.74)
JES j=!
(5.78)
184 INFERENCE FOR DESCRIPTIVE PARAMETERS
vee) ~ --L
n ( Ej
gjS- - -
_)2
gE
(S.79)
n-l. JES
Jr. Jr
j
as in (2.9S), where
vee) ~ _n L (gjSEj)2
n-l.JES Jr.
j
= (S.80)
X' =( 1
Xl X2
... 1)
•.. XN '
(S.81 )
jEs
Thus the 'g-weights' (Sarndal et at., 1992, p. 232) will indeed be close
to 1 in large samples, and for these the variance estimators given above
should estimate the design MSE of ewell.
If we have a fairly strong belief in the superpopulation model of
(S.4S) as a description of how the Yj are generated, we might prefer
to look at regression estimation from the point of view of prediction.
UNCERTAINTY IN RATIO AND REGRESSION ESTIMATORS 185
with
Ej([3) = Yj - xi [3.
It is easily seen that for populations which are large relative to the
sample size, the first term of (5.83) will be dominant. Moreover, a
robust estimator of the first term of (5.83) is provided by
L(WjS - 1)2E;, (5.84)
JES
which is approximately
vm(e) = L(WjSEj)2. (5.85)
JES
The increase in (5.85) over (5.84) will tend to compensate for the
omission of an estimator of the second term of(5.83), particularly if the
sample is representative with respect to the values of Var{(E/[3)} =
aJ. See Kott (1990) for a related discussion.
The closeness of (5.85) and (5.80), which differ only by a factor of
n/(n - 1), has a couple of implications. First, the model-based MSE
estimator vm(e) is seen to have desirable frequency properties under the
design, assuming the conditions leading to (5.80) are satisfied. That is,
it approximates an unbiased estimate of the design MSE for a certain
design which is approximately with replacement. Second, the design-
based distribution under which this is so can be taken to be conditional
on the sample statistic
Ws = {(Wjs, aJ): j E s},
186 INFERENCE FOR DESCRIPTIVE PARAMETERS
Xr =( 1 ... I)
.•• XN
(5.86)
XI
jes
t = LWjsYj
jES
jS j
2
JES
for each fixed s. Since the prediction MSE depends on the sample s
through LjES Xj or through is, we are led to evaluating the sampling
properties of TR through its distribution conditional on is. This is a
distribution which should support model-based inference about Ty if
the model is correct.
This sort of evaluation is in effect what has been done (with a differ-
ent though related purpose) in empirical studies by Royall and Cum-
berland (1981a; 1981b; 1985). They considered a collection of real
populations representing the contexts in which ratio estimation has tra-
ditionally been applied. In each case they took a large number of sam-
ples of size n = 32 at random, grouped these according to the values
of is into 20 groups, and in each group observed the bias and MSE
of TR, the bias of various estimators of MSE, and the performance of
associated confidence intervals. The grouping effectively resulted in the
measurement of sampling properties conditional on is.
If the population is generated from the model (5.90), then the con-
ditional sampling bias, denoted symbolically by
Ep(TR - Tylis ),
would be expected to be close to 0 for large or moderately large sam-
ples. However, it is evident from the studies of Royall and Cumberland
that this is seldom true for real popUlations when is is not relatively
CONDITIONAL SAMPLING PROPERTIES 189
close to /-Lx. One reason is that the ratio estimator is model-biased for
fixed s when £Yj is not f3Xj but f31 + f32Xj, where f31 =1= O. In that case,
-
£(TR - Ty) = f31 (Tx)
is - N , (5.93)
which has the sign of f31 if is < /-Lx and the opposite sign if is > /-Lx.
The sampling bias conditional on is is an estimate of the model bias,
and hence will exhibit the same behaviour.
The unconditional and conditional biases of TR under SRS can be
assessed for large samples using the theory in Chapter 3. Note first that
(5.94)
(5.96)
y
1\
y=Rsx
x
®
Ily - - - - - - - - - -. X
X I
_ X X I
111 IN - - - - - - X- - - -
X
Figure 5.3 Population scatterplot lies close to a line with positive intercept. If
the points ® are sampled. so that is > /-Lx. TR/ N will underestimate /-Ly-
TR -Ty (5.97)
JV(TR ) ,
where VcTR) is a robust MSE estimator, do not perform well: their ac-
tual coverage frequencies, conditional on is> tend to be below nominal
levels, and the shortfall in coverage tends to be more serious at one
end of the interval than the other. Some of the problem is accounted
for by the conditional bias noted above, as can be seen from the fact
that it is less serious in a comparable study for the simple regression
estimator, reported in the same paper.
Another contributing factor seems to be the skewness of the error dis-
tribution in real populations, which means that lines fitted by (weighted)
least squares, together with estimates of the error variance, do not sum-
CONDITIONAL SAMPLING PROPERTIES 191
marize the distributions of the Yj very well. In the next section we see
how prediction using the ratio estimator can sometimes be improved
by use of a parametric model which incorporates skewness of the error
distribution.
U = LXj, V = LXj
jES j<f.s
where
where Zj = Yj - f31N - f32NXj, and f31N and f32N are the population
least squares coefficients. We first look at its properties under SRS,
conditional on is. It can easily be shown that under SRS, Ezs = 0 and
the covariance of is, Zs is O. Thus, if conditions are such that is, Zs and
(LjES zjxj)ln are approximately trivariate normal, we might expect
that is and Zs are approximately independent, that E (zs lis) ::::: 0, and
that the conditional bias of hi N given is should be approximately
(JLx - is) "'" _
(n _ I)E(sx2Iis) nE(~xjzjlnlxs). (5.104)
JES
linear relationship between the mean of Yj and Xj' For fixed s, there
is a model bias if the mean of Yj is not in fact appropriately linear
in Xj' Correspondingly, there is a bias with respect to the sampling
design, conditional on those functions of {U, x j) : j E s} which would
determine the distribution of the estimator if the model held. On the
other hand, if there is no model bias, the conditioned randomization
lends 'support' to inference.
When the linear superpopulation model is not a matter of belief
but a matter of convenience, or simply a means of constructing an
estimator of Ty likely to be somewhat more efficient than t,
then
the unconditional design frequency properties of the estimator take on
an additional importance: we are assured of approximate unbiasedness
of the estimator, and in fortunate circumstances approximately correct
coverage of confidence intervals based on variance estimation formulae
like those of Section 5.11. These properties are valid in the context of
implementing the sampling procedure many times on the population.
However, the inference content of these estimates and intervals may be
flawed by their inconsistency with our knowledge of the population:
if the dependence of £ Yj on x j were thought possibly to contain an
intercept term, we would knowingly be incurring an estimable bias by
using the ratio estimator. Similarly, if the dependence were thought to
be quadratic, we would be incurring a quantifiable bias in the simple
regression estimator, in the sense of the model or in the sense of the
sampling design conditional on Xs and s;.
5.13 Robustness and mean function fitting
for small sample sizes be less stable (because of small stratum sample
sizes) than the regression MSE estimator.
A compromise between the two methods might be to increase the
flexibility of the model in model-assisted estimation, and think in terms
of fitting a 'mean function' for Y which is a relatively smooth function
of x. Thus the model (5.45) would become
(5.105)
where the Ej are independent with Var(Ej) = aJ. The form ofa model-
assisted estimator for 1'y would be
where EJ, •.• , EN are independent and EEj = 0, Var(Ej) = Cases of aJ.
this problem have been treated by Chambers and Dunstan (1986), Kuk
(1988), Godambe (1989a), Rao et al. (1990), Chambers et al. (1992),
and Rao and Liu (1992).
Let G denote the model c.d.f. of Ej. Then under the superpopulation
model,
I:tPj =.?:
N N [
J(Yj :::: y) - G
(Y -ax' (3)]
i (S.108)
;=1 ;=1 J
(S.109)
~N !I:
.
~ [J(Yj :::: Y) -
Tt;
G (Y -xi (3)] + t
aJ . 1
G (Y -X
aJ
i(3 )I.
J~ ;=
(S.1lO)
When G and f3 are unknown, as is usually the case, we would want
to substitute well-behaved estimates from the sample. Note that this
case resembles regression estimation of 1'y when the XI: condition is
not satisfied, since the estimating function (S.109) will not normally be
part of the system used for estimating G and f3.
CONomONAL SAMPLING PROPERTIES 197
N1[,,1
Jes J
"
~ ;-:{I(yj ::: Y) - G j } + ~ G j
f.,,]
J=l
, (5.113)
where
Gj = G(u), U = (y-xj13s)/aj.
Rao et al. (1990) have ensured a kind of conditional design unbiased-
ness by replacing the first Gj in (5.113) by
A* 1 1 * A* N A* ]
F (y) = N [ I:~{I(yj :::y)-Gj }+ I:Gj , (5.115)
JES J J=1
where
OJ = O*{(y -xi (3)/aj}.
For fixed y, the empirical variance of the F*(y) values should approx-
imate the prediction or model MSE of FN(y) well, provided it does
not depend greatly on the value of {3 used.
CHAPTER 6
The first context arises, typically with a single-stage design, when the
model specifies the response values Yj for the sampled units to be in-
dependent, with probability functions known up to a finite-dimensional
parameter f). Generally, some component(s) of f) form the 'parameter
of interest'. For simplicity here and in the next section we shall take f)
to be real-valued.
If the sampling design is simple random sampling, then the log likeli-
SINGLE-STAGE DESIGNS AND THE USE OF WEIGHTS 203
hood function for the parameter () takes the form
Llogjj(yj; (}). (6.1)
jES
where Yj is the observed value of Yj and jj is the probability function
for the observation at unit j. In fact, strictly speaking, (6.1) is the log
likelihood function whenever p(s) from the sampling design is inde-
pendent of the parameter, and independent of the array Y of response
values. The score function, a model-unbiased estimating function for
(), has realized value
(6.2)
(6.3)
jES
or more generally
8
L Wjs 8(} 10gfj(yj; (}). (6.4)
jES
Ep(~ WjSZj) ~ Tz
JES
as we have seen for general estimating functions in Section 4.1. The jus-
tification is less obvious here, where the emphasis is on the estimation
of a superpopulation parameter, () say, rather than its finite population
counterpart. If we believe sufficiently in the model to believe in (6.5)
as a population score function, then we should believe even more in
(6.2) as a sample score function; and there should be no need for the
unbiased estimation of (6.5) provided by the use of weights in (6.3)
or (6.4). On the other hand, (6.3) and (6.4) are still estimating func-
tions for the superpopulation parameter and the corresponding finite
204 ANALYTIC USES OF SURVEY DATA
L ¢j(Yj , (),
N
<p(Y, () = (6.6)
j=l
and suppose to begin with that £ {¢j (Yj , ()} = 0 for each j. Let
¢; = L Wjs¢j(Yj , () (6.7)
jES
be a sample estimating function, with the Wjs not depending on Y. If
the design probabilities are also independent of Y, then
£(¢;Isample is s) =L WjS£{¢j(Yj , ())} = 0;
jES
Var(¢;lsample is s) =L W]s£¢];
jEs
and a model-unbiased estimator of this variance would be
and
Vm(tP;) = ~W]s (aao10gh(yj;0)Y (6.11)
JES
If the model were well substantiated, it might be appropriate to replace
the relatively robust Vm (tP;) by
a2
I s(O) = - ~ W]s a0 2 log h (yj; 0) (6.12)
JES
which is a special case of tP; of (6.7). Clearly EptPs = <I> = <I>(Y, 0),
and we adopt the weakened assumptions that YI, ... , YN are indepen-
dent and £<1> = O. It will not generally be true that £(tPslsample is s) =
O. However, with respect to the compound model-design distribution,
we do have
(6.14)
Thus for measuring uncertainty it is natural to consider the compound
mean squared error
£Ep(tPs - £<1»2 = £Ep(tPs _ <I> + <I> _ £<1»2
= £Varp(tPs) + Var(<I», (6.15)
206 ANALYTIC USES OF SURVEY DATA
where Varp and Var denote variances with respect to design and model
respectively. Accordingly, under the compound distribution, f!Js will
have mean 0 and variance £Varp(f!Js) + Lf=1 Var{f!Jj(Yj • O)}, which
can be estimated if desired by
and interval
function is
a
LN -logfj(Yj
j=lao
; 0).
n 7r ·(Y·)
J. (~) fj(yj; 0) n 7rjo(O) n (1 - 7rjo(O», (6.21)
jES 7rJ o jES j;'s
f
where
7rjO«(}) = 7fj(y)/j(y; (})dy;
N a
+ ?:(1 - I js ) ao log(l - 7rjo(O», (6.22)
J=I
average' if the model is true. Godambe (1989b) has discussed its opti-
mality properties.
Here again it is possible to write down the score function from the
sample data, as
a log WI ( (3) a log Wo ( (3) ' " a I f* ( . /.I)
NI a(3 + No a(3 + ~ a(3 og I Xj, tJ
JESt
where
F(ToJXj; tJ) = roo f(yJXj; tJ)dy.
iTo
If no x j values outside the truncated data set were observed the score
function would be
.
L aa
p o .
ll log f(yj JXj; tJ)+ L p
~
atJ
109! F(ToJx; tJ)g(x)dx, (6.29)
J~~lO J~>lO
I-P2",(a
-2- ~ ae 10gF(ToIXj; e) )2
P2 jESo
+ L
j:Yj~To
Vlj+- L V2j,
1
P2 jESo
(6.31 )
where
function',
N
()=L()j/N,
j=1
where
()j=E(Yj ), j=I, ... ,N.
We saw this form earlier in Section 6.1 as one of the possible inter-
pretations of a superpopulation proportion (), an interpretation as the
population average of probabilities ()j for stochastic subject sequences.
If we have no model for the ()J> or if we have one but would prefer
not to rely on it completely, then we might wish in estimating () to use
a randomized design and the estimating function
These are model-unbiased 'on the average under the design' though not
unbiased conditional on the sample drawn. The estimator of () from the
equation <Ps = 0 would be
8s = (LYj/1Tj)/(L 1/1Tj). (6.32)
JES JES
The variance estimator vcC<Ps) from (6.16) would take the form
V(S) = -£ (!!),
and hence that
T(S) = V-I (S).
In many applications, we have a vector g( 6) whose components are
elementary estimating functions gj (6), j = 1, ... , M. These compo-
nents are functions of 6 and parts of the total observation; the vector
g( 6) has zero expectation, and covariance matrix V(g). Typically Mis
greater than P, and we consider for estimating 6 the class of all P x 1
systems q;( 6) = 0 with components of q;( 6) being linear combina-
tions of components of g( 6). That is, we consider systems like
Ag(6) = 0,
ESTIMATING FUNCTIONS FOR VECTOR PARAMETERS 2\3
for all u, v = I, ... , P. Thus if <l>~ (0) and <l>~ (0) are orthogonal in
the sense of being uncorrelated, <l>~ (0) changes little with changes in
8v near the true value.
More generally, as considered by Godambe (1991), suppose the
parameter 0 is partitioned so that or = (j3T, aT), and that <I>*(O)
is correspondingly partitioned as <I>*r = (<I>rr, <I>;r). The subsystem
<I>r (0) = 0
would be optimal for estimating f3 if a were known. Thus its solutions
would depend on a in general. When a is unknown, then for assessing
uncertainty in the estimation of j3, we would prefer to use an estimating
function subsystem as little dependent on a as possible.
214 ANALYTIC USES OF SURVEY DATA
1 --
~** ~*1 _ E( ~*1 ~*T)V-l
2 (~*)
2 ~*.
2' (6.41 )
and we have a~r* fa Q ~ 0 since (6.40) implies the rough approxima-
tion
~** ~ ~* _ a<l>f (a<l>i)-l ~*.
1 1 aQ aQ 2
Thus it will make sense to use the variability of ~r*, at a fixed value
of Q, in approximate inferences about (3 in the absence of knowledge
of Q.
Finally, let us suppose that our system of estimating equations can
be written
=L
N
~(9) cPj(9) = 0, (6.42)
j=l
the model and its variants. Essentially, we will describe how inference
might be made for superpopulation parameters if the whole finite pop-
ulation were surveyed. Then in Section 6.5 we will indicate how the
analyses would proceed using data from a sample, obtained through
a simple or a complex sampling design. These analyses will focus in
particular on the special case of logistic regression.
A comprehensive treatment of the generalized linear model is given
by McCullagh and NeIder (1989).
has real components, and that these are independently and normally
distributed, with
(6.47)
Suppose there is a p-dimensional covariate x, with value for the jth
component given by the row vector xi.
Then the regression model
Yj = xi (3 + Ej (6.48)
with EI, ... , EN independent and Ej ,....., N(O, aJ) (cf. (5.45» can be
expressed as (6.47) together with a linear model for (J, namely (6.46)
where X has xi as its jth row.
In another example, Y may be a vector of independent binary re-
sponses, with
P(Yj = 1)
(6.49)
P(Yj = 0) =
216 ANALYTIC USES OF SURVEY DATA
r
can be written
(6.51)
In the logistic regression case, where the log likelihood function for
() would have the form
log ID
N [ e(Jj JYj [
+ e(Jj
1 + e(Jj
J1-Yj)
1 ,
I
(6.53)
or
xr (Y - p,((3» = 0,
where
l?}~
JL j ((3) = I + eX} ~
is the expected value of Yj under the linear model.
These maximum likelihood equations may be solved iteratively. For
example, in the vth iteration of the Newton-Raphson algorithm,
j3(v+1) = j3(V) + rLxr (Y _ p,(j3(V»). (6.56)
~
and
(
Vv = diag e(X~)j
" (v)
/( I + A
e(X~)j
(v)
)2) •
J-
f3 -
(( ~al.al'))
~_J_J
j=' af3r af3s pxp •
(6.61)
where Ij is the log likelihood term associated with fj, leading to confid-
ence regions obtained from the approximations
S rf3 J-'S
f3 f3 ~ 2
X(p)'
(662)
.
S rf3 J-'S
/3 f3 ~ 2
X(p)'
(663)
.
or
(6.67)
All the methods for confidence regions described here can be shown
to be valid (and equivalent) for large N under suitable conditions on
the model and covariates (Cox and Hinkley, 1974, Chapter 9).
THE GENERALIZED LINEAR MODEL FOR SURVEY POPULATIONS 219
(6.68)
o
(6.69)
o
o
The reduced model that assumes 8j to be constant overall (model 2) is
expressed as
() = X2/3
where X2 is a column of 1s. If, furthermore, model 1 is a logistic
regression model for binary responses, we have
220 ANALYTIC USES OF SURVEY DATA
, N
~
= Lj=1 Yj = P.
N1 +eP
If model 2 holds, then (6.68) becomes
NI (PI - p)2 N2(P2 - p)2 2
PIO - PI) + (PI - P)2 + P20 - P2) + (P2 _ P)2 :::::: X(l)'
1({3,ifJ) = ~[Yjo/j-b(o/j)
~ cpa.
+C'(Y',ifJ)]
.I J
(6.70)
}=1 J
S
f3
= ,\,N
L..J=I
Yj-b'(f(xj
rpaj
(3) f' (x T (3)x·
J J
=0
(6.71 )
Srp
,\,N
L..J=I
(_...!..)
rp2
YJf(xj (3)-b(j(xj
aj
(3)) + aCj(Yj,rp)
arp
= O.
}=I ifJaj
THE GENERALIZED LINEAR MODEL FOR SURVEY POPULATIONS 22 I
or on
S '{3 J-IS
{3 {3
2
~ X(p)' (6.73)
where
(y. - b'(f(x~ (3»2
N
J{3 = L ) 2 2 } [f'(xj(3)fXjxj. (6.74)
j=I rp aj
Although S {3, I {3 and J {3 all involve rp, the score statistics in (6.72) and
(6.73) do not. For testing nested models, (6.68) is applicable.
= =
where 1/!j f(xj (3). If we write £Yj alternatively as ILj ILj({3) and
Var(Yj ) as rpVj =
rpV/(3), we may write the first component of the
score system (6.71) as
N
,,-
S _ '"' Yj - ILj aILj
j=1 rpVj a{3
~ . (6.79)
Scp= L A/(Yj -
N
ILj)2 - rpVj - Bj(Yj - ILj)), (6.80)
j=1
where Aj = [rp2 Vj (Y2j + 2 - YI~)]-I, Bj = Y1j(rpVj )I/2 and Ylj and
Y2j are the skewness and kurtosis of Yj . Although it is derived from
the exponential model, the system (6.79) and (6.80) is unbiased much
more generally, as long as the specification of the first two moments
of Yj is the same as for the exponential model. The same is true if the
coefficients A j and Bj are replaced by other approximating weights.
Thus in (6.79) and a suitable approximation to (6.80) we have a system
of estimating functions which is very widely applicable, and also ef-
ficient for the generalized linear exponential model (6.70). The notion
of extended quasilikelihood was put forward by NeIder and Pregibon
(1987). For discussion of the extended quasilikelihood system above,
see, for example, Godambe (1992) and Dean (1991).
1({3, rp) =~
~
[Yj 1/!j - b(1/!j)
. + c/Yj , rp) ] ,
j=1 rpaJ
with 1/!j = xj (3. Letting
£Yj = IL/(3) = b'(1/!j)
and
THE GENERALIZED LINEAR MODEL FOR SURVEY POPULATIONS 223
~ y. - IL· aIL .
S{3 = ~ J J __
J.
j=l cpVj aj3
Two estimating functions for cp, namely Scp of (6.71) and Scp of (6.80),
may be considered.
We have seen in Sections 6.4.1--6.4.4 several suggestions for obtain-
ing confidence regions for j3, and will now focus on the one most
easily generalized, namely the one which involves the robust estimator
J {3 of the covariance matrix of S {3' Since cp enters S {3 as a scale factor,
confidence regions for j3 using J {3 can be based on (6.62) or (6.63)
whether cp is known or unknown: in the new notation we have
S '{3 J-lS
{3
'"
{3 ' "
2
X(p)'
S'{3 J-ls
/3 {3 ~ 2
X(p)' (6.81 )
where S{3 is given in (6.79) and
Y/I/!j - b CI/lj )
exp { + Cj(Yj , cp) } , (6.83)
cpaj
as in (6.70). It should be noted that cp is equal to 1 in the probability
function case, when the responses Yj are discrete. The aj are constants
as before. However, 1{!j is now a function of both (3 and a, namely
(6.84)
and by
(6.85)
we also have
(6.86)
where Vj «(3, a) = b"(1{!j)aj = h'(1{!j)aj.
Suppose we are interested primarily in the estimation of (3. We spe-
cialize to the case where
1{!j = xj (3 + zj a, (6.87)
and consider two lines of attack, namely analysis through joint estima-
tion of (3 and a, and analysis through marginal moments.
For fixed a and all parameters known apart from (3, the score function
system for the estimation of (3 can be written as in (6.71), as
or by
where
N N
(1 +):-1) = L(Yj -mj({m 2/Lmj({3),
j=1 j=1
depending on the purpose. We would thus have specified for Yj a
marginal loglinear model which is of Poisson type, except for constant
overdispersion in the variance structure.
Alternatively, we might specify that the aj have mean I and some
constant variance r. Then the marginal mean would be m j = m j (f3) =
t?i {3, but the marginal variance of Yj would change to m j (1 + r m j).
Thus the overdispersion would be relatively greater for larger values
of m j. Uncertainty estimation for this case when the a j are constant
within clusters is treated in Section 6.4.10.
and
p=--~
I + eCij+fJo •
Thus if we fit to data coming from (6.104) a model with logistic re-
gression marginal expectations as in (6.105), the coefficient of Xj will
approximate fJ;, which is 'attenuated' in comparison with fJI.
If in the mixed model the aj are constant within clusters, (6.106) is
expressible as
fJ; = fJI (I - p(O»,
where p(O) is the intracluster correlation among the Yj when fJI is O.
Which of the two approaches to use will depend on the purpose of
modelling. In the logistic regression example first discussed, if we want
to express the dependence of the response probabilities on x with a
held constant, we would be interested in fJI of (6.104). On the other
hand, if we want to be able to express the way in which the response
of a randomly selected member of the population would depend on
the x variate, we would be interested in estimating the correspond-
ing marginal expectation parameter, which would be closer to fJ; of
(6.106). Neuhaus et al. (1991) have distinguished the two purposes as
the 'cluster-specific' and 'population-averaged' approaches. Holt (1989)
has referred to a similar distinction in terms of 'disaggregation' and
'aggregation' .
Let us proceed to look at an analysis when a marginal linear expo-
nential dispersion model is the basis of parametrization. We will now
denote the marginal regression parameter by (3*, to distinguish it from
the (3 of Section 6.4.7. If the marginal density were given by (6.70)
with t/fj = xj (3*, and if the Yj were independent, the score function
230 ANALYTIC USES OF SURVEY DATA
The joint system (6.92) and (6.94) for estimation of the a(r) and f3
reduces to the following:
*
<I>2r(f3, a(r» =- "~ Yj - f.I., j
+ -a(r)
2 = 0, (6.111)
JEB, cpaj (J
r = 1, ... , L;
L
cI»r*(f3, a) =L cI»r:(f3, a) = 0, (6.112)
r=1
where
LYj-f.I.,jXj
JEB, cpa j
- L Xkdlk V(r) <I>;r (f3, a(r» (6.113)
kEB,
, 1
= Eh (l/fd/CPak, V(r) = ".
(J-2 + ~d1j
JEB,
As pointed out by Liang and Zeger (1986) in the longitudinal data ana-
lysis context, under suitable regularity conditions a consistent estimator
of this as the number of clusters approaches infinity would be
L
J; = L X~ D1r Wr- l 1]r 1]~ Wr- 1DlrXr. (6.118)
r=l
(6.119)
(see Liang and Zeger, 1986).
Similarly, the system (6.1 09) for the marginal linear exponential
model can be written
L
~*(,8*) = LX;D 1r Wr- l 1]r = 0, (6.120)
r=l
where 1]r is the vector of the elements of g( ,8*) for j in the rth cluster,
and Dlr and Wr- 1 are redefined in the context of the marginal model.
Confidence regions for ,8* would be based on
(6.121)
with
L
J;. = LX; D 1r Wr- l 1]r 1]~ wr- 1D1rXr. (6.122)
r=l
To illustrate the marginal moment technique in another setting, let
us suppose that the true model is that for j in the rth cluster, condi-
tional on a(r), Yj is Poisson with mean a(r)~j f3, and that the a(r) are
independent with £a(r) = 1, Var(a(r) = r. Then we have gj(,8) =
Yj - mj, where mj = ~jf3, with Var{g/,8)} = mj{1 + Tmj) and
Cav (gj (,8), gj' (,8» = T m j m j' if j, j' are distinct units in the rth
THE GENERALIZED LINEAR MODEL FOR SURVEY POPULATIONS 233
L(l + r L
L
= mj)-I L(Yj - mj)xj. (6.123)
r=1 jEBr jEBr
e= Lh(xj.8+zj&)/N.
N
(6.125)
j=1
(6.126)
Var(Yjl a) = ((Jai h' (1/Ii)
for j in the ith area. The random effects will be assumed to satisfy
£(a(i» = 0, i = 1, ... , L;
and if a is the vector of a(i),
Cov(a) = :E, (6.127)
where :E is a non-singular L x L matrix, possibly diagonal. Let 1/1 be the
vector of the 1/I(i), and suppose our interest is in estimating 1/1, which
is essentially the mean function in this application. Thus it is helpful
to transform the parameters of the conditional model from ({3, a) to
({3, 1/1).
Elementary unbiased estimating functions are
Yj - f-Lj, j = 1, ... , N, and 1/I(i) -x(i){3, i = 1, ... , L.
For notational convenience, as before, we define an N x 1 vector g to
have jth component
(6.128)
for j in the ith area. Note that this modified elementary estimating
function is a function only of 1/1 in the new parametrization. We may
then write the system which is 'optimal' for estimating ({3, 1/1) as
ct>i({3,1/1) = -X'A'E- 1(1/1-AX{3)=O
(6.129)
THE GENERALIZED LINEAR MODEL FOR SURVEY POPULATIONS 235
(6.130)
where V-I = (L\ + 1:)-1, and
;p = (L\ -I + L-I)-I L\ -I Y+ (L\ -I + L-I)-I L- I Axfi. (6.131)
This estimator of 'I/J coincides with the best linear unbiased population
level 'predictor' of 'I/J, and is composite in the sense of being a combina-
tion of a direct estimator Y and a regression estimator AX '/3. Thus,
provided L\ and 1: are known, we have solved the problem of estimating
the mean function.
Since 'I/J = AX f3 + 0:, the covariance matrix of the error 'I/J;p -
can be shown to be
(6.133)
where
0'2
Yi = a
cpad Ni + a';
can be regarded as a measure of intra-area correlation. The mean
squared error is
~ 2 CPai 2
£(1jr(i) -1jr(i) = Yi Ni + (1 - Yi) Ki, (6.134)
(6.135)
Assuming cp to be known and writing the right-hand side of (6.134) as
gli (a;) + g2i (a;),
it can be shown (Prasad and Rao, 1990; Singh, 1994) that
EC(/J(i) - 1/I(i)2 = gli (a;) + g2i (a;) + g3i (a;) + o(L -1), (6.136)
where g3i(O';) = (1- Yi)2(O'; + Pt; )-1 t\a;) and V(a;) is the asymp-
totic variance of a;.
The first te~ in (6.l36) is of order 0(1), while
the others are of order O(L -1). If we estimate the mean squared error
in (6.136) by substituting for a; a;,
we obtain an estimate with a bias
of O(L -1) because E(g\i(a;) - g\i(O';» is of this order. To correct for
this bias, Prasad and Rao (1990) suggested the corrected estimator
MSE(~(i) = g\i(a;) + g2i(a;) + 2g3i (a;). (6.137)
Other possible corrections have been discussed by Singh et al. (1993).
For the general class of mean function estimation problems, altern-
atives to this frequentist approach are provided by a Bayesian frame-
work, wherein appropriate prior distributions are assumed for {3 and
the variance parameters. Then {3, 0: and 1/J all have similar status, and
inferences about them come from their posterior distributions, given
the data. If this program is carried out using a full hierarchical Bayes
approach, the posterior distributions require significant computational
power for their calculation, but this is less and less of an impediment
to implementing the method. It has been shown in empirical studies
(Ghosh and Rao, 1994; Singh et al., 1993) that the hierarchical Bayes
approach and EBLUP properly corrected give comparable results.
(6.138)
£(Yjla(r» =
/Lj(,(3, a) =
h(1/fj) =
b'(1/fj) = l::"j; (6.139)
Var(Yjla(r» =
h'(1/fj) =
el/l /(1 + el/lj )2;
j
and Vrj is of the form (6.87) if we take Zjr to be 1 for j E 13r and 0
otherwise. If a denotes the Lxi vector of the random effects, we will
assume that
(6.140)
and that a 2 is unknown.
As indicated, we will consider a variety of sampling schemes with
design probabilities independent of the parameters and the variates.
Sample is dispersed
Suppose first that the sample is taken by simple random sampling, or
some other scheme which leads to samples which are dispersed, in the
sense that their units may be assumed to belong to different PSUs. Then
the Yj will be independent, with
£(Yj ) = mj(,(3, a 2 ) = £(/Lj({3, a» (6.141)
and variance
(6.142)
SAMPLING AND THE GENERALIZED LINEAR MODEL 239
here (3* is the analogue of f3; and f3r in (6.108). Using for sample-
based estimating functions the same notation as for the population-
based estimating functions in Section 6.4, we can write down the score
function system
S /3* = I)Yj - /L j ((3*))Xj = o. (6.144)
JES
~ *
The point estimate (3 will be the solution of (6.144). Confidence re-
gions for (3* could be based on
S'/3* J-IS '"'"' 2
/3* /3* '" X(p) (6.145)
or
S'/3* J-IS
'/3* /3*::::::
2
X(p)' (6.146)
where
(6.147)
JES
~ *
Alternatively, the approximate normality of (3 could be used along
~ *
with the fact that the model covariance matrix of (3 has the approxi-
mation
~*
[
Cov((3 ):::::"
f?f, /3*
, * Xjx)' (6.148)
]-1
~
)ES
(1 + f?f /3 )2
Sample is two-stage
Now suppose that the sample has been taken in two stages, correspond-
ing to the structure in the population. Let SB be the first-stage sample
of PSU labels, and for each r in S B let Sr be the sample taken from Br •
Now the sampled Yj are no longer independent unconditionally.
The approach through joint estimation of (3 and a can be applied
as in Section 6.4.7. If a were known, the conditional score function
system for (3 would be
<pr((3, a) = L L(Y j - /Lj«(3, a))xj = O. (6.149)
rEsB JESr
240 ANALYTIC USES OF SURVEY DATA
(6.150)
+",a(r) 2
L...,.-2 B r (,8, a ), (6.151)
rEss a
where
(6.152)
= LL£h'(1/!j)xjxj - L
rEss jES, rEss
(:2 + L£h'(1/!k») BrB;.
kES,
(6.153)
If 8- 2 is a consistent estimator of a 2 , then confidence regions for ,8
might be based as in (6.97) on
(6.154)
and
where &2 is a consistent estimate of ()2. To the degree that the co-
variance matrix of cIl~ is approximately V {3.a 2 of (6.153), confidence
regions for (3 could be based on
"'*'(~ ~2)V-l iF..*(~ ~2)<,-, 2. (6.157)
'¥ m tJ, () f3.&2 ~ m tJ, () '" X(p)'
Y Yk
The j and are uncorrelated if j and k are in different PSUs. When
Y Yk
j and k are both in T3r , j and are correlated. If our two-stage model
holds, with random effect variance ()2 relatively small, we may look to
the form of (6.156), and use as estimating system for (3*
where
,. .. *
is unbiased in the sense that EpS 13* = S 13*. The solution f3 s of S 13* = 0
A
£(Sf3*S~*) = ~ 2 ~f3* 2X
jEs trj (1 + Cl )
An approximate covariance matrix for 'j3* is
H- 1(f3*)£(S f3*~*)[H' (f3*)r l , (6.169)
where
H(f3*) = - L -1 (1 +e"iClf3*)
jEs trj
13*
2X jxj. (6.170)
On the assumption that the model is correct, the most relevant measure
.. * .. *
of uncertainty in f3 is (6.169) evaluated at f3 , which is approximately
* ,. *
£-unbiased for £«f3 - f3*)(f3 - f3*)'). It may be noted, however,
A
.. * A*
that (6.l69) is also Ep£-unbiased for Ep£«f3 - f3*)(f3 - f3*)') =
.. * 1\*
£Ep«f3 - f3*)(f3 - f3*)"C).
On the other hand, if the model is not correct, we might be more
"* .. *
interested in estimating Ep«f3 - f3"N)(f3 - f3"N)"C) - which will also
* .. *
be an approximately unbiased estimator of £Ep«f3 - f3*)(f3 - f3*)')
A.
(6.l71)
n -1 Lr
v(8,G') = _n_" (Yj-/Lj(f3*) Xj-Ms ) (Yj - /L/f3*) x' -
Trj Trj j
M:) ,
(6.173)
where Ms is the sample average of «yj - /Lj(f3*»/Trj )Xj. Confidence
intervals for f3 can be based on taking
S',G' J-IS
/3' ,G' 2
~ X(p) , (6.174)
which is close to
~ *
since Ms is 0 at f3* = f3 .
A detailed treatment of the kind of analysis in this section has been
given by Roberts et al. (1987).
Sample is two-stage
Suppose that random effects are present, and that they are essentially
cluster effects. Thus in the model (6.138)-{6.140), 0'2 is non-zero. The
sampling design is two-stage, with PSUs coinciding with the clusters.
At the population level, we have seen two approaches to the estimation
of f3, one through joint estimation of f3 and 0:, and one through a
system for marginal moments. Each of these has a pseudo likelihood
analogue at the population level.
In the first approach, an estimating function system which corres-
ponds to the population-level system (6.88) and (6.92) is
~*
«P 1 (f3,0:)=
L -1 L (Yj - /Lj(f3, O:»Xj
=0 (6.175)
rEsB nr jES r Tr"t
] r
and
(6.176)
SAMPLING AND THE GENERALIZED LINEAR MODEL 245
where
and TIr and 1Tjlr are first- and second-stage inclusion probabilities
*
respectively. The estimating function 4»1 of (6.175) estimates 4»r of
A
(6.177)
where
(6.178)
(6.179)
"~ TI2
1 4»**
Ir
({3 ,00, (1 2) 4»**r
Ir
A ({3 ' A
00, (1
2) , (6.180)
rESB r
where
(6.182)
~:«(3, £12) = L L ~
Yj - ;.«(3, £12) (Xj -Bnr «(3, £12»,
reSB jesr
r Jlr
(6.183)
where £12 is a sample-based estimate of a 2, and Br:rr«(3. a 2) is given
by (6.178). We could solve ~:«(3, £12) = 0 for (3, and estimate the
~ *
covariance matrix of ~ m by
~ 1 ~ * ~ 2 ~ *T ~2
~ n 2 ~mr«(3, a ) ~mr«(3. a ), (6.184)
resB r
where
A similar approach can be devised for the estimation of (3*, the para-
meter of an approximate marginal logistic regression model.
(6.187)
where ZI, ... , Zp are independent N(O, 1) variates, and AI, ... , Ap are
the eigenvalues of VOl VR •
The distribution of D of (6.185) can be evaluated numerically, given
the eigenvalues AI, ... , Ap. Alternatively, it can be approximated in
terms of single chi-square variates. For example, since
p
ED = LAm = p"5..,
m=1
the variate Dc = Dj"5.. will have the same mean as a xfp) distribution.
Since the variance
Note that L~=I Am is the trace of VOIVR' and L~=I A~ is the trace
of the square of the same matrix.
To compute the Acorresponding adjusted versions of X2 of (6.186),
namely X; X X; X; / a
= 2/~ or = (l + 2), we would need estimates
I'" "
AI, ... , Am, and these can be computed from Vo V R, if V R is available.
" " A
L~B Yt/ L
~B
1 or (Ytdv(t)/ ( dv(t),
1B 1B
where B is a fixed subset, typically small, of U, and v is a measure on
U in the continuous case; or the 'global' means
~ L(Y,+h - yt )2/ L 1,
teU' teU'
where U' = {t E U: t+ hE U}.
In later sections, we will look mainly at the mapping problem and
the estimation of means as in (7.3).
pAh) (7.14)
pz(h) = (a 2 II
II 12)V2Kv{a 2 II hil)I rev), v > 0, (7.15)
h
(7.21)
rz(h) = ~ II h II fa - ! II h 11 3 fa 3 if IIhll:::a
(7.23)
= 0 if II h II> a.
1 00
[log(l + )...)]adFz ()"') < 00, (7.25)
i:
where F is the spectral 'distribution function' given by
(7.27)
258 SAMPLING STRATEGIES IN TIME AND SPACE
interactions:
P(J-tt : t E U} = C(fJ) exp{LA(J-tt; fJ)
+ Ls Lt Os,t(J-ts, J-tt; fJ)},
where fJ is a parameter, the function 0 is the Gibbs energy function,
C(fJ) is a normalizing constant, and the double sum is over all pairs
(s, t) which are neighbours of each other.
Variations on models such as this can be surprisingly useful as prior
distributions for black and white or grey-scale images (see Qian and
Titterington, 1991, and references therein).
For applications in which we are concerned with the presence and lo-
cation of some substance or object in space, contained in a background
of some different material, more geometrically based models are often
assumed, and invariance of the )11odel under translations and/or rota-
tions is an important consideration. If we let the role of J-tt be taken
by the indicator of the substance or object in question, then we may
wish to model it as the indicator of a random closed set generated in
some physically plausible way. For example, for an object made up of
globular particles, we could think of the set A as consisting of spheres
or ellipses with randomly distributed centres and orientations, and prin-
cipal axis lengths coming from a certain size distribution. For deposits
of crystalline material, we could think of the substance as occupying
cells in space, such as convex polygons or polyhedra. These could be
defined by randomly positioned lines or planes, or by growth from ran-
domly distributed 'nuclei' to form Voronoi polyhedra (constant growth
simultaneously for all cells) or more complicated structures (Moran,
1972).
Stein, 1987; Owen, 1992; see also Yates, 1960). If U = [0, l]d, U
is partitioned into md smaller hypercubes of edge length ~. A Latin
hypercube is an m x d matrix, of which each column is a permutation
of 1, ... , m. The sampling scheme consists of randomly generating
such a matrix, with entries U jr. The jth sampled small hypercube is
the one for which the rth defining edge is tj~-l, U,;;], r = 1, ... , d.
The jth sampled point is randomly or purposively selected within the
jth sampled hypercube. The special case d = 2 yields Latin square
sampling, an example of which is shown in Figure 7.2. Tang (1993) has
shown how improved uniformity of a Latin hypercube sample results
from constraining the random permutation matrices with background
orthogonal arrays.
In some applications, samples are not finite sets, but are continuous
subsets of the continuum U. The science of stereology (Moran, 1972)
involves trying to infer the properties of a three-dimensional object
by measuring the properties of a randomly selected planar section or
a linear probe with random location and direction. Wildlife sampling
may involve counting all 'sightings' along a set of randomly selected
transects through a wilderness area (Thompson, 1992). Some of the
more challenging problems of spatial sampling theory are found in
trying to use the data from these kinds of sampling to best advantage.
A note on asymptotics
______ x ~
I
I _______ L
I _____ _
x
I I
------,-------r------
I I
x
I I
I I
Yt = J-tt + Et
P (7.29)
J-tt = Lfh ji (I) + 1Jt
[=1
where
E1Jt = 0, EEt = 0, EO t = 0;
(7.32)
or
a = ri l 1'0.1 + ri l F(FTri l F)-I lfo - r r i l 1'0.1), (7.41)
where Ca and ra have U. k)th elements Ca(tj. tk) and r 8(tj • tk) re-
spectively, and Co.I and 1'oa have jth elements Ca(tj. to) and r8(tj. to)
respectively. For these expressions to be valid we need to as~ume, and
will assume, that all the matrix inverses in them exist.
The optimal predictor can be written as YJ a, where the jth element
of Ys is Y'j. (Here the subscript s refers to the fact that Ys is the vector
of sampled Y values.) The form (7.40), which is valid even without
the assumption that fl (t) == 1, makes it clear that the predictor can
be thought of as a basic predictor for the zero trend or no-constraint
problem (that is, YJCilcoa), corrected or calibrated to the constraints
(7.39). The predictor can also be written as an estimated trend value
plus an appropriate combination of estimated residuals at the sample
points:
. (7.42)
where
(7.43)
is the generalized least squares estimator of {3 = (fJI •...• fJp)\ and
Qa = Ci l - Ci l F(r Cil F)-l Cil.
The predictive mean squared error of r a can be shown to be
C8(to. to) - co.sCilcoa + AT(F"Cil F)-I A. (7.44)
where
(7.45)
or to be
(7.46)
where
B = lfo - r r i l 1'oa)· (7.47)
Note that if to is one of the sampled points tit then Cilcoa and
ri l 1'oa have jth element equal to 1 and the other elements 0; it follows
easily that A and B are zero matrices, and that the optimal linear pre-
dictor of Y,o is Y'j' which is equal to Y'o itself. Thus the optimal linear
predictor can be regarded in this simple sense as interpolating between
the observations at sampled points. We shall see below, however, that
the predictor as a function of to is not necessarily continuous at the
sampled points.
266 SAMPLING STRATEGIES IN TIME AND SPACE
where
(7.59)
It is interesting to note that if to is not one of the tj, and if the
noise process components are uncorrelated so that C, (tj, to) = 0, then
b of (7.53), (7.54) is the same as a of (7.40), (7.41). From the form
(7.55), we can see that if C~(s, t) and/(t) are continuous, then so is the
optimal linear predictor of I1 to as a function of to. When to is one of the
sampled points, the predictor of I1 to will not in general be equal to f,o'
Thus in the case of uncorrelated noise, continuous I, and continuous
C~, the optimal predictor of f,o will be discontinuous at the sampled
points, but continuous elsewhere, as a function of to.
lEU
L ¢tY,dv(t)
and
I", has lth component Lteu rpt/!(t) or fu rpt/!(t)dv(t)
c",a has jth component LteU rptC8(tj, t) or f'1l rptC8(tj, t)dv(t)
'1",8 has jth component LteU rptr8(tj, t) or Ju rptr8(tj' t)dv(t).
The predictive mean squared error of YJ a, is
C8(rp, rp) - c~8CiIC"'8 + A~(Fr Cil F)-I A"" (7.66)
where
C8(rp, rp) = L L rptrpsC8(t, s)
teU seU
LL
or
rptrps C8(t, s)dv(s)dv(t), (7.67)
(7.70)
(7.71)
+ fu fu ¢s¢tC8(t, s)dv(t)dv(s)
270 SAMPLING STRATEGIES IN TIME AND SPACE
or as
(7.72)
- fu fu ¢s¢tra(t, s)dv(t)dv(s).
Estimating these forms can in principle be handled through suitable
estimates of the covariances Ca(t, s) or the semivariograms ra(t, s). In
practice, since the sample covers only the spacings among tl, ... , tn,
this means assuming simple parametric forms for the covariances or
semivariograms, and estimating the parameters from the sample. Tech-
niques for doing this include using quadratic estimating functions,
'modified maximum likelihood' (where the likelihood is based on the
joint distributions of contrasts so as to be free of dependence on {3), or
cross validation methods. Since (7.71) and (7.72) involve covariances
for members of U which are very close together, it is helpful for mean
squared error estimation to have some clustering in the sample; this will
obviously aid in the estimation of measurement variances and nugget
effects. For detailed discussion of semivariogram estimation techniques
see, for example, Stein (1990), Cressie (1993) and Laslett et al. (1987).
With stationary covariance structures, a spectral approach to error
estimation may be fruitful, since the variance of a sample mean from
a regular sample is expressible as a functional of the spectral density
(Stein, 1993).
Intuitively, when the purpose is prediction of fu ¢tY,dv(t) the choice
of the family of deterministic trend components fz (t) is important. The
mean function process was earlier expressed as
p
ILt = L PI fz (t) + TIt,
1=1
(7.73)
while the response Yt was ILt plus a noise term ft. The more of ILt that
we can think of as being captured by (fixed or random) combinations
of {fz (t) , 1 = 1, ... ,p}, the easier will be the assessment of prediction
error, assuming p remains small relative to n.
This being said, the difficulty with the preceding approach to error es-
timation remains. The assumptions and simple parametric forms for the
corresponding structures are very likely to be oversimplifications, and
oversimplifications that matter. Robust estimation of predictive mean
squared error is difficult when the sample is fixed, particularly if it is
POINTWISE PREDICTION 271
o o o o o o o o
r~ - -6"1
I I
o o o o o 01 0 10
1I II II -'I
____
o o o o o o o o
o o o o o o o o
II II i. .....6. ..... A· : II II II II
o 0 0 0 :0 0 0 0
ll: II
o o o o :0 o o o
o o o o o o o o
o o o o o o o o
Figure 7.3 Sites of soil pH measurement. Values at internal sites (0) are to be
predicted from those at sampled sites (.6.).
6. .'
'0.':. 6.
.,
,.. .. ,
.
,.. .. ,
.,,
...
.,
. '
".. ':.
.'
".. ':.
......,,
, ,
(':
.,
6. .'
'•• 0:. 6.
Figure 7.4 The value at a sampled site (.) can be predicted from values at four
diagonal neighbour sites (t.), and prediction error noted.
by
(rCil F + a-2 V- I )-1 F'Cilys, (7.74)
that £ (Yto IYs ) is given by
.to £(,l3I Ys ) + r; Q8a C08, (7.75)
and that £ (/Lto IYs ) is given by
.to£(,I3I Ys) + r; Q8a COry, (7.76)
where
Q8a = Ci l - Ci l F(r Ci l F + a- 2 V-I )-1 F' Cil.
As a 2 ~ 00, so that the prior for ,13 becomes more and more diffuse,
these posterior means approach the optimal linear unbiased predictors
derived earlier, and provide alternative justification for them.
Like the optimal linear unbiased predictors, the predictors (7.75) and
(7.76) are derived assuming known covariance structures. When these
are unknown, a Bayesian approach would put a suitable prior distri-
bution on the parameters of CE(t, s) and of Cry(t, s), as well as on ,B.
Again, the point estimators/predictors of Yto ' /Lto or fu ¢tYtdv(t) might
be their posterior means, and an appropriate measure of uncertainty
274 SAMPLING STRATEGIES IN TIME AND SPACE
1= 11 Ytdt (7.78)
case of any design with one sample unit in each of the cells [0, ~],
(~, ~], ... , (n~l, ;], using the sample mean is equivalent to approxi-
mating Yt by the value Ytj for all t in the cell of tj, and then integrating.
In particular, if {Yt : t E [0, In
has a continuous second derivative, and
stratified random sampling with one unit per cell is used, it can be
shown that the standard deviation of the sample mean Ys is of order
O(n-I) as n -+ 00. Thus in a fixed domain asymptotic sense, Ys with
stratified random sampling is more efficient than Ys with simple random
sampling; for the latter, the standard deviation is of order O(n- 1/2 ).
For slowly varying Yt, it is intuitive that a better approximation to
Yt would be obtained by continuous (e.g. linear) interpolation between
the points of a centred systematic sample. For example, if tj = S,
j = I, ... , n, then a linear interpolation should be closer than a step-
function interpolation to the function Yt. The trapezoidal rule resulting
from a linear interpolation would give as an estimate of I the function
I tr = n _ 1
1 (1 n-I
2"Ytl + f;Yt j
1)
+ 2"Ytn •
(7.79)
Ytl
2n
1(1
+;; 2"Ytl + ~Ytj
~ 1)
+ 2"Yt + 2n
n
Ytn (7.80)
1~ Yli
(7.81)
A
IHT= - ~--.
n j=1 h(tj)
!ao
li
h(t)dt
j - I
= --,
n- 1
.
J = 1, ... ,n. (7.82)
in(h) = _1_1~~
n -1 2h(t()
+
j=1 h(tj)
I:..!!L
+ ~~l'
2h(tn)
(7.83)
which itself will converge to I at rate O(n- 2) when YI and h(t) are
sufficiently smooth. In the case where {Yt : t E [0, In
is a mean-zero
process with continuous covariance function C(s, t) = £(YsYt ), and has
exactly K quadratic mean derivatives (K = 0,1,2, ., .), then under
further regularity conditions, their improved estimator has predictive
mean squared error of order O(n- 2K - 2 ). Thus even when (YI : t E
n
[0, 1 is a Wiener process, having no quadratic mean derivative, the
°
error in the estimator, coinciding for K = with i ,To will have model
standard deviation O(n- I ).
sample is also appealing, however. Stein (1993) has shown that if {Yt :
t E [0, l]d} is a stationary random field with spectral density I(A),
and if [0, l]d is divided into m d cells with one sample point at the
centre of each, then the mean squared error of a predictor of / can
be made to be of order O(m- P ), where d < p and I(A) is of order
IAI-P as IAI -+ 00. The number p is an indicator of smoothness of
the sample functions. The predictor used can be the sample mean for
p .:s 4, and is an edge-corrected sample mean for p > 4. Both are
model-unbiased under the stationarity assumption. Because they are
analogous to the trapezoidal estimator (7.80) in one dimension, it may
be conjectured that predictors derived from continuous interpolation
in higher dimensions would have similar properties. See Laslett et al.
(1987) for a discussion of continuous and smooth interpolators for d =
2.
Thus prediction (of 1',0 or of /) via continuous interpolation works
well when the sample functions of {1', : t E U} are sufficiently smooth
or slowly varying. Continuous interpolation may not be advisable to
the same degree when the sample functions are likely to be rough or
spiky. As we have seen, the function or surface defined by kriging
prediction is discontinuous at the sample points. Laslett et al. (1987)
have illustrated the superiority (for pointwise prediction) of kriging and
linear spline smoothing over continuous interpolation methods, for grid
sampling of a two-dimensional test population of soil pH values with
an apparently non-zero nugget effect.
If the sample points are of necessity non-random and irregularly
spaced, the idea of integrating a sample function approximation, ob-
tained by interpolating or smoothing, takes on added importance. An
example where U is one-dimensional concerns the estimation of the
milk yield of a cow over a 305 day lactation cycle, from measure-
ments of yield taken at about eight irregularly spaced days in the cycle.
Bartlett (1986) compared the performance, over several test popula-
tions, of (i) an interpolation (trapezoidal) estimator with end corrections
and (ii) kriging estimators where the trend functions fitted included lin-
ear splines with two knots and combinations of gamma curves. Despite
the roughness of the population sample functions, the simple interpo-
lation estimator (traditionally used by Agriculture Canada) turned out
to be more efficient and robust as an estimator of total yield than the
kriging estimators. Essentially, the linear interpolation had greater flex-
ibility to follow the actual trend than either the linear spline or gamma
family of functions.
When d = 2, an approach sometimes taken is to define an 'area of
CHOOSING PURPOSIVE SAMPLES 279
L
£(1 _1)2, where
1= Ytdt
and J(t) = L:;=I AjY,j, Aj being the area of the Voronoi polygon of
tj with respect to the sample. They applied it to the determination of
16 points for monitoring NOx density in Kyoto, Japan.
More commonly, the problem of optimal choice of sample for point-
wise prediction is considered. Sacks et al. (1989b) have discussed op-
timality criteria. The integrated mean squared error (IMSE) criterion is
the minimization of an integral
1 U
A 2
£(Y, - Y,) dv(t),
tEU
Sacks and Schiller (1988) have shown how to find the points of sam-
ples of given size to minimize the IMSE or the MMSE for discrete
populations with two or more dimensions. Yfantis et al. (1987) have
applied an MMSE criterion to the comparison of regular grids in two
dimensions, for kriging prediction at centre points of grid regions; in
their study, as long as the nugget effect was small, the equilateral trian-
gular grid seemed most efficient, and was seen to give the most reliable
estimate of the semivariogram; for large nugget effect, a hexagonal grid
appeared to be most efficient.
For continuous populations, Sacks et al. (l989b) have stated a pref-
erence for the IMSE criterion since the MMSE criterion 'involves a
d-dimensional optimization of a function with numerous local optima
at every iteration of a given design-optimization algorithm'. Some spe-
cific problems have been solved with the IMSE criterion. For example
Sacks et al. (1989a) found sample points in two dimensions minimizing
the IMSE for
Y, = ILt + Et,
where ILt = L:f=1 fJdi(t) and the error process was an Ornstein-
Uhlenbeck process. Su and Cambanis (1993) considered the more gen-
eral case in one dimension of a non-random ILt and an error process
In
{E t : t E [0, with zero mean and known covariance function. The
mean function process {lLt : t E [0, In
was either known, or of form
CHOICE OF RANDOMIZED SAMPLING DESIGNS 281
is Madow's design. That is, the optimal design selects r from a distri-
bution uniform on [0, lin], and includes j in the sample s if, for some
I E {O, ... , n - l}, j is the least integer such that
I j
r + - ::: LakiA. (7.88)
n k=1
Proof: The first step in Hajek's proof is to show that, for every j and
k,
=L L
00 N+u-I
p(lk - jl) !l.uqjuwqkuw, (7.89)
u=1 w=1
where
qjuw = I if w - u < j ::: w
(7.90)
= 0 otherwise,
and !l.u is given by (7.86). We leave this step to the reader. Now since
ANy.
e - Ty =- L ....!...(/jS -7rj),
n j=1 aj
(7.91)
where
Ijs = 1 if j E S
(7.92)
= 0 if j1- s,
we have c(e - Ty) = (AlLin) ",£7=1 (/js - 7rj) = 0 for any sample of
size n. Thus
cEp(e - Ty)2 = Epc(e - Ty)2 = EpVar(e - Ty),
between sample points, and Matern noted that the triangular grid will
therefore be best whenever we have a rapidly decreasing stationary
isotropic correlation function (e.g. as in (7.96) with large a).
Analogues of Hijek's result in two dimensions have been formulated
and proved by Bellhouse (1977). We can think of the 'systematic'
designs he considered for a sample of size n = mlm2 as dividing a
rectangular U,
Lj
.
= '!.(N
2
+ 1) (7.97)
JES
for every sample s with p(s) > O. For these designs the expansion
estimator NYs of Ty is model-unbiased as well as design-unbiased in
the presence of a linear trend, such that
Choosing from among these designs for efficiency and ease of error
estimation is an interesting theoretical problem.
CHOICE OF RANDOMIZED SAMPLING DESIGNS 285
York, 506-527.
Cox, D. R. (1972) Regression models and life-tables (with discussion). Journal
of the Royal Statistical Society B 34, 187-220.
Cox, D. R. and Hinkley, D. V. (1974) Theoretical Statistics. Chapman & Hall,
London.
Cox, D. R. and Snell, E. 1. (1979) On sampling and the estimation of rare
errors. Biometrika 66, 125-132.
Cramer, H. (1946) Mathematical Methods of Statistics. Princeton University
Press, Princeton.
Cramer, H. and Leadbetter, M. R. (1967) Stationary and Related Stochastic
Processes. Wiley, New York.
Cressie, N. A. C. (1993) Statistics for Spatial Data, revised edition. Wiley, New
York.
Daniels, H. E. (1954) Saddlepoint approximations in statistics. Annals ofMath-
ematical Statistics 25, 631-650.
Daniels, H. E. (1987) Tail probability approximations. International Statistical
Review 55, 37-48.
De Finetti, B. (1931) Funzione caratteristica di un fenomeno aleatorio. Memorie
dell'Accademia Nazionale dei Lincei Ser. 6, 4, 251-299.
Dean, C. B. (1991) Estimating equations for mixed Poisson models. In: V.
P. Godambe (ed.) Estimating Functions. Oxford University Press, Oxford,
35-46.
Deming, W. E. (1956) On simplification of sampling design through replica-
tion with equal probabilities and without stages. Journal of the American
Statistical Association 51, 24--53.
Deming, W. E. and Stephan, F. F. (1940) On a least squares adjustment of
a sampled frequency table when the expected marginal totals are known.
Annals of Mathematical Statistics 11, 427-444.
Deville, J. C. and Siimdal, C. E. (1992) Calibration estimators and generalized
raking techniques in survey sampling. Journal of the American Statistical
Association 87, 376-382.
Devine, O. J., Louis, T. A. and Halloran, M. E. (1994) Empirical Bayes esti-
mators for spatially correlated incidence rates. Environmetrics 5, 381-398.
DiCiccio, T. 1. and Martin, M. A. (1991) Approximations of marginal tail
probabilities for a class of smooth functions with applications to Bayesian
and conditional inference. Biometrika 78, 891-902.
DiCiccio, T. 1. and Romano, 1. P. (1988) A review of bootstrap confidence
intervals. Journal of the Royal Statistical Society B 50, 338-354.
DiCiccio, T. 1. and Romano, 1. P. (1990) Nonparametric confidence limits by
resampling methods and least favorable families. International Statistical
Review 58, 59-76.
Durbin, J. (1953) Some results in sampling theory when the units are selected
with unequal probabilities. Journal of the Royal Statistical Society B 15,
262-269.
290 REFERENCES
H0st, G., Omre, H. and Switzer, P. (1995) Spatial interpolation errors for mon-
itoring data. Journal of the American Statistical Association 90, 853-861.
Johnson, N. L. and Kotz, S. (1970) Continuous Univariate Distributions.
Houghton Mifflin, Boston.
Jones, H. L. (1974) Jackknife estimation of functions of strata means.
Biometrika 61, 343-348.
Kalbfleisch, J. D. and Lawless, 1. F. (1988a) Estimation of reliability in field-
performance studies. Technometrics 30, 365--378.
Kalbfleisch, 1. D. and Lawless, J. F. (1988b) Likelihood analysis of multi-
state models for disease incidence and mortality. Statistics in Medicine 7,
149-160.
Kalbfleisch, 1. D. and Sprott, D. A. (1969) Applications of likelihood and
fiducial probability to sampling finite populations. In: N. L. Johnson and
H. Smith (eds) New Developments in Survey Sampling. Wiley, New York,
358-389.
Kass, R. E. and Steffey, D. (1989) Approximate Bayesian inference in condi-
tionally independent hierarchical models (parametric empirical Bayes mod-
els). Journal of the American Statistical Association 84, 717-726.
Kendall, M. G., Stuart, A. and Ord, 1. K. (1983) The Advanced Theory of
Statistics, Volume 3 (4th edition). Griffin, London.
Kish, L. (1965) Survey Sampling. Wiley, New York.
Kish, L. and Frankel, M. R. (1974) Inference from complex samples (with
discussion). Journal of the Royal Statistical Society B 36, 1-37.
Korn, E. L. and Graubard, B. I. (1991) A note on the large sample properties
of linearization, jackknife and balanced repeated replication methods for
stratified samples. Annals of Statistics 19, 2275--2279.
Kott, P. S. (1990) Estimating the conditional variance of a design consistent
regression estimator. Journal of Statistical Planning and Inference 24, 287-
296.
Kovar, J. G., Rao, J. N. K. and Wu, C. F. 1. (1988) Bootstrap and other methods
to measure error in survey estimates. Canadian Journal of Statistics 16,
Supplement, 25--45.
Krewski, D. (1978) Jackknifing U -statistics in finite populations. Communica-
tions in Statistics - Theory and Methods A 7(1), 1-12.
Krewski, D. and Rao, 1. N. K. (1981) Inference from stratified samples: prop-
erties of linearization, jackknife and balanced repeated replication methods.
Annals of Statistics 2,1010-1019.
Kuk, A. Y. C. (1988) Estimation of distribution functions and medians under
sampling with unequal probabilities. Biometrika 75, 97-103.
Laird, N. M. and Louis, T. A. (1987) Empirical Bayes confidence intervals
based on bootstrap samples. Journal of the American Statistical Association
82, 739-757.
Laslett, G. M., McBratney, A. B., Pahl, P. 1. and Hutchinson, M. F. (1987)
Comparison of several spatial prediction methods for soil pH. Journal of
294 REFERENCES
71-526.
Stein, M. L. (1987) Minimum norm quadratic estimation of spatial variograms.
Journal of the American Statistical Association 82, 765-772.
Stein, M. L. (1988) Asymptotically efficient prediction of a random field with
a misspecified covariance function. Annals of Statistics 16,55-63.
Stein, M. L. (1990a) A comparison of generalized cross validation and modified
maximum likelihood for estimating the parameters of a stochastic process.
Annals of Statistics 18, 1139-1157.
Stein, M. L. (1990b) Bounds on the inefficiency of linear predictions using an
incorrect covariance function. Annals of Statistics 18,1116-1138.
Stein, M. L. (1992) Prediction and inference for truncated spatial data. Journal
of Computational and Graphical Statistics I, 91-110.
Stein, M. L. (1993) Asymptotic properties of centered systematic sampling for
predicting integrals of spatial processes. Annals of Applied Probability 3,
874-880.
Stroud, T. W. F. (1991) Hierarchical Bayes predictive means and variances
with application to sample survey inference. Communications in Statistics,
Theory and Methods 20, 13-36.
Stuart, A. and Ord, J. K. (1987) Kendall's Advanced Theory of Statistics, Vol.
I, 5th edition. Oxford University Press, New York.
Su, Y. and Cambanis, S. (1993) Sampling designs for estimation of a random
process. Stochastic Processes and their Applications 46, 47-89.
Sugden, R. A. (1982) Exchangeability and survey sampling inference. In G.
Koch and F. Spizzichino (eds) Exchangeability in Probability and Statistics.
North-Holland, Amsterdam, 321-330.
Sugden, R. A. (1993) Partial exchangeability and survey sampling inference.
Biometrika 80, 451-455.
Sukhatme, P. V., Sukhatme, B. V., Sukhatme, S. and Asok, C. (1984) Sampling
Theory ofSurveys with Applications, 3rd edition. Iowa State University Press,
Ames.
Tamura, H. (1989) Statistical models and analysis in auditing. Panel on Non-
standard Mixtures of Distributions. Statistical Science 4, 2-33.
Tang, B. (1993) Orthogonal array-based Latin hypercubes. Journal of the Amer-
ican Statistical Association 88, 1392-1397.
Thomas, D. R. (1989) Simultaneous confidence intervals for proportions under
cluster sampling. Survey Methodology 15, 187-202.
Thompson, M. E. (1983) Labels. In N. L. Johnson and S. Kotz (eds) Encyclo-
pedia of Statistical Sciences 9. Wiley, New York, 427-430.
Thompson, M. E. (1984) Model and design correspondence in finite population
sampling. Journal of Statistical Planning and Inference 10, 323-334.
Thompson, S. K. (1992) Sampling. Wiley, New York.
Tille, Y. (1996) An elimination procedure for unequal probability sampling
without replacement. Biometrika 63, 238-24l.
Tukey, J. W. (1958) Bias and confidence in not-quite large samples (abstract).
300 REFERENCES
optimal allocation, 24
optimality, 166, 174, 176, 178, 262, quasiscore, 213
279, 281
orthogonal, 213
raking ratio estimator, 181
orthogonal arrays, 261
random effects, 173, 223, 230, 233
outliers, 195
random permutation model, 150, 157
overconditioning, 147
random systematic procedure, 40, 84
randomization roles, 201, 202
partial exchangeability, 151 randomized sampling schemes, 285
304 INDEX
variability, 7
variance estimation in multi-stage
sampling, 34
Voronoi polygons, 279, 280
Voronoi polyhedra, 259
Yates-Grundy-Sen variance
estimator, 16, 20