Você está na página 1de 8

A Solution to the Problem of Optimum Stratification in Multivariate Sampling

Author(s): Carlos M. Jarque


Reviewed work(s):
Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 30, No. 2
(1981), pp. 163-169
Published by: Blackwell Publishing for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2346387 .
Accessed: 02/05/2012 02:01
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics).

http://www.jstor.org

Appl.Statist.(1981),
30, No. 2, pp. 163-169

A Solutionto the Problemof Optimum


in MultivariateSampling
Stratification
By CARLOS M. JARQUE
AustralianNational University,
Canberra
[ReceivedJuly1979. Final revisionJanuary1981]

SUMMARY
A clustering
or
In thispapertheuse ofclusteranalysisin stratification
is considered.
whichis derivedwithinthecontextof
criterion
function
is suggested
stratification
multivariate
stratified
sampling.
A casestudyinwhichtheStatesofMexicoarestratified
withrespect
to ninesocio-economic
variables
is presented.
SAMPLING
CLUSTERANALYSIS;MULTIVARIATE
Keywords:OPTIMUMSTRATIFICATION;
1. INTRODUCTION
STRATIFICATION
to
maybe usedinsurveysforvariousreasons.For example;foradministration;

definedomainsofstudy;or to gain in theprecisionoftheestimators.


This paperis concerned
with precision,and is aimed at defininga stratification
criterionthat provides efficient
estimators
and is numerically
manageable.Stratification
proceduresare limitedbytheamount
of information
available beforethe survey,so the followingdiscussionapplies mainlyto the
of firststage samplingunitsforwhichthereis good priorinformation.
stratification
ofexistingresultsfor
In thissectionsomenotationis introduced,
followedbya presentation
theunivariatestratification
problem.In Section2 a stratification
procedureis suggestedforthe
multivariate
case. Section3 containsa case studyin whichtheStatesofMexicoare stratified
by
severalprocedures.The conclusionsare summarizedin Section4.
It is assumedthata stratified
sampleofsizenis obtainedfroma populationofsizeN and that
theparametersto be estimatedare(withoutloss ofgenerality)
theK populationmeans01, .. OK
of a certain set of variables X1, X2, ..., XK.
Define L = numberofstrata;Nh = size ofstratumh; Xk,h = samplemeanofvariableXk for
stratumh; ok = estimatorof ok using stratifiedsampling and ak ,h = variance of variable Xk
withinstratumh defined withdivisor Nh -1. Let Wh = Nh/N forh = 1,2, ..., L. Then ok is given

by

YZWhXkh, k1,2, ...,K,

ok

h= 1

(1)

and its variance by


L

V(0k)

h= 1

Wh V(Xk,h),

k = 1,2,...,K.

(2)

Here onlystratified
randomsamplingwithproportionalallocationis consideredand hence(2)
reducesto
V(Ok)=

NTn

hL=

APPLIED

164

STATISTICS

Some commentson othersample allocationsare made in Section4.


maybe statedas
Omittingthesubscriptk,theproblemofunivariateoptimumstratification
findingthe strataboundariesx( ),x(2),..., x(L- 1) (subjectto x(O)< x(1) <... <x(L- 1) < x(L) where
-()=min {x} and x(L) = max{x}) such that the stratification
criterionfunctionV(0) is
minimized.
ofX,f (x),is continuous(fora
densityfunction
Dalenius (1957)assumesthattheprobability
commenton the implicationsof thisassumptionsee Dalenius, 1957,p. 175) and shows that
thesimultaneousequations
x(01 ... x(L-1) mustsatisfy
-

x (h)

TX(h)

Is1X(1

1)

(X(h+

xf(x)dx

< f(x) dx

xf(x)dx

Jxh

X(h)

X(h- 1)

x(h
X(h)

+ 1)

f(x) dx

h=

...,L-1.

(3)

is such thatstratumh is formedby those population


Therefore,
the optimumstratification
elementswhose values ofX are betweenx(h- 1) and x(h), wherethenumbersx(h) are obtained
from(3) and may be foundby an iterativeprocedure.
The previousresult,or theresultfortwovariablesobtainedbyGhosh(1963)and Sadasivan
Here,I look at the
et al. (1978),maybe appliedwhenthereare one or twovariablesofinterest.
more generalproblemwhen K, the numberof variables,is greaterthan two. Of course,
cannotbe based on thevariablesto be studied,sincetheirvaluesare unknown.In
stratification
whatfollows,the X1,..., XK are to be takenas "proxyvariables"to the variablesof interest.
2. MULTIVARIATE OPTIMUM STRATIFICATION

Techniques of multivariateanalysis have been used in formingstrata formultivariate


surveys.See, forexample,thepublicationoftheGreaterLondon Council(1971) fortheuse of
clusteringalgorithmsand Hagood et al. (1945) forthe use of principalcomponentanalysis.
Theseproceduresattemptto definehomogeneousgroupsaccordingto givendistancemeasures.
it is not clear how these relate to surveyaims. Desirable propertiesof an
Unfortunately
randomsamplethe
estimatorare thatit is unbiasedand has small variance.For a stratified
estimators01..OK (see equation (1)) are respectivelyunbiased estimatorsof Ol,...,OK for
whateverstratification
is used.Hencethemainconcernherewillbe withtheirvariances.For the
univariateproblem,strataare chosento minimizethevarianceoftheestimator.Here thereare
so thatthechoice of
K such varianceswhichwillnot generallybe simultaneously
minimized,
stratamustbe based on some criterionF( ), thedefinitionof whichis now to be discussed.
Suppose thatDalenius's procedureis appliedindividuallyto each oftheK variables.This
a lower
SK. Each one of theseSk*givesrespectively
say S1*,S2*,...,
providesK stratifications,
boundforthevarianceofok. Denote theselowerboundsby V*(01),V*(02), ..., V*(OK),and denote
S. Unlessthecorrelations
between
thevarianceof0k obtainedwhenusingstratification
by Vs(Ok)
S that
all pairsofvariablesare close to one,in absolutevalue,therewillnotexista stratification
theseK lowerbounds. Define
attainssimultaneously
k = 1,2,.., K.
VS(O)
dk(S) = V*(Ok'
ofa stratification
Thesemaybe regardedas thereciprocalsoftheefficiencies
S, in thesensethat
each dk(S)measurestheclosenessofS to theoptimumunivariatestratification
Sk*.Bydefinition
To obtainan efficient
withlow values ofdk(S)would be preferred.
dk(S)> 1 and stratifications
a criterionsuggestedis to findS* such thatit minimizes
stratification,
K

F(S)=

E dk(S).

k=1

Firstofall itis scale invariant.Secondly,itpenalizes


has desirableproperties.
Thisfunction
thatprovidesa smallvarianceofthe
anyvariableforwhichtheredoes notexista stratification

OPTIMUM

STRATIFICATION

165

estimator.For example,ifthereare two variablesXl and X2 measuredin the


corresponding
same units,withequal variances,and V*(01)is muchless than V*(02),variableXl would have
more"importance"thanX2in theminimization
ofF(S). Thirdly,
of
F(S) is a quadraticfunction
theoriginaldata and so thereare clustering
Two of
algorithmsavailable foritsminimization.
theseare Ward'sclustering
algorithmand theK-meansclustering
algorithm(see Ward,1963;
MacQueen, 1967).
The moregeneralform
K

Z Vkdk(S)
k= 1

(4)

could be considered,wherethe /k are given weights.In addition,other functionalforms


involvingd1(S),..., dK(S)maybe used as alternative
criteriafunctions.
For example,one maybe
D(S) = (d(S)-1)'A(d(S)-1),
whered(S) and 1 are K x 1 vectorsdefinedas d(S) = (dl(S),..., dK(S))' and 1 = (1,..., 1)', and A is a
K x K matrix.WhenA = I thisis thesquare oftheEuclidiandistanceofthevectord(S) to the
"ideal vector"ofreciprocalsofefficiencies
1. For severalexercisescomputed(e.g.see Table 4 in
Section3) the S thatminimizedF(S) also minimizedD(S) withA = I. Althoughthisis not a
general result,it encouragesthe use of F(S). Anothercriterionmay be to minimizethe
ofE (generalizedvariancecriterion)
determinant
whereE is definedas thevariance-covariance
matrixof the estimators01, OK. F(S) is a weightedtraceof E and hereit is the preferred
criteriondue to its computationalease.
So farthediscussionofthechoiceofa stratification
has beenbased on efficiency
measures.If
thechoiceis based on F(S) alone,S* willclearlyalwaysbe preferred.
However,itmaybe thatin
practicetwostratifications
S and S2 are suchthatF(S1) < F(S2) and yetS2 is preferred
due to a
moredesirablesetofvaluesofthevariances.The magnitudesofF(S) (and D(S)) mayindicatethe
stratifications
thatare themoreappropriateonesin termsofefficiency.
A procedureto arriveat
an optimumstratification
maybe to choosethemostdesirablestratification
amongtheefficient
ones.Alternatively,
as notedbyone ofthereferees,
ifwe had targetvariancesVT(Ok)wecould set
V'k = V*(Ok)/VT(Ok)and choose to minimize
(4).
...,

3. CASE STUDY

The problemofstrataconstruction
maybe seenas a classification
problemand thishas led
authorsto considertheuse ofclustering
algorithms
(e.g.Greenet al., 1967;Golderet al., 1973).
These algorithmshave been devisedforspecialproblems,and it is not clear whytheyshould
workindifferent
contexts.The discussionofSection2 suggeststhattheuse ofWard'sclustering
algorithmor the K-meansclusteringalgorithm,
appropriatelyapplied,should yieldsensible
In thissection,a case studyis presented.The proceduredescribedin Section2 is
stratifications.
appliedtogetherwithotherstratification
procedures(someofwhichuse clustering
algorithms)
forthepurposeof efficiency
comparison.
It is assumed that the aim is to estimatethe means 01, 02,
of nine socio-economic
09
variablesX 1,X2, ..., X9 (fortheirdefinition
see Table 3) usinga stratified
randomsample.The 31
StatesofMexicoarestratified
withrespectto thesevariables.The data refer
to theyear1974and
are foundin IEPES (1976);thesamplesize is consideredto be n = 12.The K-meansalgorithm
was used to minimizeF(S), thusprovidingS*. The resultingvariancesoftheestimatorsofthe
meansare presentedin Table 1 fora numberofstrataL = 2, 3, 4 and 5. In thetable,therow
markedRS refers
to thevariancesobtainedusingsimplerandomsampling,and thenumbersin
bracketsreferto thecorrespondinglowerbounds (e.g. forL = 4, VS(05))1 1 forall S).
Frequentlyin surveypractice,stratification
is done with respectto a variable that is
consideredas themain indicatorof the variablesof interest.Table 2 presents(forL = 5) the
variancesobtainedwithS* and thosethatwould be obtainedifXl was consideredas "main
indicator"ofX1,..., Xg and Dalenius's optimumstratification
procedurewas appliedtof(X1).
...,

166

APPLIED

STATISTICS

TABLE 1

Variancesof theestimators
of themeansusingstratification
S* (withlowerboundsin brackets)

RS
L= 2
L= 3
L= 4
L= 5

V(01)

V(02)

123
121
(041)
116
(0 22)
072
(0 10)
1 23
(0 074)

0144
0141
(0056)
013
(0-03)
011
(0 02)
008
(0 011)

V(03)

V(04)

V(05)

V(06)

V(07)

368
259
(1 19)
281
(0 56)
259
(0-35)
1 51
(0-25)

576
406
(1 97)
191
(0-95)
181
(0 47)
201
(0 33)

982
395
(3 18)
411
(2 0)
44
(1 1)
269
(0-51)

1113
1067
(50)
925
(2-6)
105
(1 3)
803
(0 76)

500
035
016
355
(1 55) (0 13)
016
234
(0 53) (0 06)
019
18
(0 31) (005)
1 87
009
(0 18) (0 033)

V(08)

V(09)

008
004
(0016)
004
(0 01)
006
(0 005)
006
(0-003)

TABLE 2

Variancesobtainedwithstratifications
S*, S*, S* and S* (L =5)
Parameter
S*

S1
4S*

9S9

01

1-2
(0074)
11

113

02

03

04

008
016
016
012

1-5
3*7
36
3-5

20
4-7
033
50

Parameter
05
27
100
57
7.5

66

07

68

09

80
122
140
128

1.9
67
5*7
36

009
035
025
028

006
009
007
0003

SimilarlyforX4 and Xg (althoughS7 was computedforj = 1,..., 9,here,to preservespace,only


thenumericalresultsforj = 1,4 and 9 are presented).The variancesusingS* are nearlytwice
S* mustbe better.Similar
thoseobtainedbyusingS*, exceptfor01,in whichcase (bydefinition)
resultswereobtained,in fact,forSj* withj = 2,...,9. This clearlysupportstheargumentthat
stratification
withrespectto a main indicatormay be an inappropriateprocedure.
S* and someofthecharacteristics
ofeach stratumare given
For fivestrata,thestratification
in Table 3. For example,withrespectto the national mean,stratum1 has highbirthand
to
itis notsurprising
ratesand a veryhighproportionofruralpopulation.Therefore
mortality
or drinking
findthata low percentageof the populationlivesin houses thathave electricity
water(in ruralareas livingconditionsare notas good as in urbanareas).The proportionofthe
populationthatdoes not speak Spanish out of those thatspeak an Indian dialect,and the
is low.
illiteracyrate,are both high.However,theaveragerate of unemployment
Other stratification
procedureswere also used, and some resultsare reportedhere for
comparingefficiencies
(moreextensiveresultsare available fromtheauthor).The procedures
reported(apart fromS*, Sf*,S* and S*) are:
Ward's clusteringalgorithm:
SW: Stratification
obtainedbyapplyingWard'sclustering
algorithmto thestandardized
data (variablesmeasuredfromtheirmeansand dividedbytheirstandarddeviation).
Singlelinkageclusteringalgorithm:
SL: Stratification
obtained by applyingthe single linkageclusteringalgorithm(see
Gower et al., 1969) to the standardizeddata.
Principalcomponentanalysis:
SPC: Stratification
using the firstprincipalcomponent fromthe standardizeddata
(explains45 percentoftotalvariance).This procedurewas proposedby Hagood et
al. (1945).

OPTIMUM

167

STRATIFICATION
TABLE 3

Strataand nationalmeans
Stratummeans

X1: Birthrate(per 1000 inhabitants)


X2: Mortalityrate(per 1000 inhabitants)
X3: Percentageof rural population (percentageof
population that live in towns of less than 500
inhabitants)
X4: Percentageof populationwithdrinkingwaterin28
household
X5: Percentageof populationthatlivein houses with
electricty
X6: Percentage of population that do not speak
Spanishout ofthosethatspeak an Indian dialect
X7: Percentageof populationthat is illiterateout of
thosethatare over 10 yearsof age
eduX8: Percentageofpopulationwithpost-primary
cation out of those over 6 yearsof age
(unemployedover popuXg: Rate of unemployment
lationover 12 yearsof age)

National
mean

46
84
27

47
56
246

46
82
9

43
7
19

41
6
9

27

30

48

46

341

41

49

62

65

80

53-7

37

18

10

28

20

26

36

20

25

15

11

248

13

73

4-5

45

Variable

42

45 5
7-3
21-1

38

Stratum1: Chiapas, Guanajuato, Guerrero,Hidalgo, Michoacan, Oaxaca, Puebla, Queretaro,San Luis Potosi,
Tabasco, Zacatecas
Stratum2: Baja CaliforniaSur,Campeche,Durango,Nayarit,Quitana Roo, Sinaloa, Veracruz
Stratum3: Edo de Mexico, Morelos,Tlaxcala, Yucatan
Stratum4: Aguascalientes,Coahuila, Colima, Chihuahua,Jalisco,Sonora, Tamaulipas
Stratum5: Baja CaliforniaNorte,Nuevo Le6n

usingthefirstprincipalcomponentoftheoriginaldata measuredfrom
SPCT: Stratification
theirmeans and dividedby theirtotals(explains50 per centof total variance).
procedure:
Index stratification
SI: Stratification
usinga welfareindexdefinedas the sum of the ranksbut witheach
rankmultipliedby (+ 1) or (- 1) accordingto whethertheattributeis desirableor
not.
was obtained.For
stratifications
By applyingtheseprocedures,a rangeof quitedifferent
thefunctions
F(S) and D(S) (definedin Section2) wereevaluated(forL = 2, 3,
each stratification
proceduresgive
4 and 5). The valuesobtainedare givenin Table 4. For L = 2,fourstratification
approximatelythe same value of F(S) (e.g. F(S*) = F(SPC) = 19, F(SPCT) = 20 and
thesame. For
are effectively
F(SW) = 21). This is because,in thiscase, all fourstratifications
to thesestratification
L = 3,4 and 5,however,thedifference
in thevaluesofF(S) corresponding
proceduresbecomeslarger.For example,withL = 5, F(S*) = 79 and F(SPCT) = 97. It is also
(SL) providesworsevaluesofF(S)
algorithm
to notethatthesinglelinkageclustering
interesting
thanifno stratification
was carriedout.Thatis,SL giveslargervaluesthanthoseobtainedusing
a simplerandomsample(see rowmarkedRS). (A reasonwhythisalgorithmwas so bad maybe
a "tree",
to thestratumoftheclosestelement,forming
thatelementsare assignedhierarchically
shouldnot
algorithms
withoutregardto thestratummeans.)This resultshowsthatclustering
in stratification.
be used indiscriminantly
4. CONCLUSION

In this paper a criterionfunctionhas been suggestedin order to obtain efficient


in the multivariate
estimationproblem.The functionmay be minimizedusing
stratifications

APPLIED

168

STATISTICS

TABLE 4

criterion
F(S) and D(S)
Values of stratification
functions
(N = 31, n = 12, K = 9)
D(S)

F(S)
S

L=2

S*
SW
SPC
SPCT
SI
9S9
S*4

S*1
RS
SL

19
21
19
20
23
25
24
27
28
29

L= 3
33
34
34
38
38
44
45
54
54
59

L=4
53
56
60
64
69
80
81
93
96
107

L=5
79
81
91
97
105
117
131
150
153
166

L=2
14
19
16
18
28
30
34
45
44
52

L=3
77
81
92
114
117
181
197
300
351
357

L=4

L=5

283
297
371
453
498
719
773
1048
928
1237

648
673
934
1074
1202
1544
2194
2965
2500
3133

This procedurewas used to


algorithm.
Ward'sclustering
algorithmor theK-meansclustering
were,ingeneral,
theMexicanStatesand itwas foundthatthevariancesoftheestimators
stratify
with
less thanthoseobtainedby usingotherprocedures.In particular,optimumstratification
use
respectto a singlevariablewas foundinadequate.Also,itwas foundthattheindiscriminate
thatprovidelargervariancesthanifthere
ofclusteringalgorithmsmaylead to stratifications
at all.
was no stratification
randomsamplinghas been considered,thederivationofF(S) for
Althoughonlystratified
ofdk(S)and
samplingdesignmaybe obtainedbytheappropriatedefinition
anyotherstratified
by followingthe same argumentas the one presented.
ofthenumberof
are thedetermination
Two additionalimportantaspectsin stratification
strata,L, and the sample allocation.
ofthe
distribution
In theunivariateestimationproblem,undertheassumptionofa uniform
it can be shown(see Cochran,1963,p. 133) thatthe
stratification
variableovereach stratum,
varianceoftheestimatorwould reduceat therate 1/L2 withincreasingL. Hence forL greater
estimation
than 6 therewould be littlereductionin termsof variance.For the multivariate
problemthefunctionI(L) = E Vs(Ok:L)/V(Ok:1) maybe used to determineL, whereV(Ok:1) is
procedures
thevarianceof ok usingsimplerandomsampling.For severalofthestratification
used in thecase studypresentedin Section3, thevalues ofI(L) werecomputed.Althoughthe
numericalresultsare notpresentedhere,it was foundthatforL greaterthan6 therewas stilla
reductionin variance.This suggeststhat a value of L greaterthan 6 may be
significant
surveys.
justified in termsof precisiongain in multivariate
to proportionalallocation.It is also ofinterestto
The discussionherehas been restricted
thatminimizesF(S)) whenone departsfrom
ofS* (i.e.thestratification
studytheperformance
werealso
in thecase study,thevariancesoftheestimators
thisallocation.For all stratifications
computedunderNeymanallocationforvariableXl. It was foundthatS* produced,in general,
The same resultwas
smallervariancesthan those correspondingto other stratifications.
studyis required,this
obtainedwhenusingNeymanallocationforX2,..., Xg.Althoughfurther
resultsuggeststhatS* maybe usefulevenifone departsfromproportionalallocation.Kokan
surveys.
(1963)and Chatterjee(1968)considertheproblemofsampleallocationin multivariate
to minimizethe total sample size, subjectto desired
They use mathematicalprogramming
Therefore,
precisionlevelsof theestimators.Theirsolutionis based on a givenstratification.
anothersolutionto thestratification
problemmaybe to followa two-stageprocedure.In the

OPTIMUM

STRATIFICATION

169

are found(e.g.S*) and in thesecondstageKokan's solution


first
stage,alternative
stratifications
is appliedto each.The stratification
thatsatisfiesthedesiredprecisionlevelswiththeminimum
samplesize may thenbe chosen as optimum.
ACKNOWLEDGEMENTS

This paper containsresultsthatformpartofa researchprojectsubmittedat The London


School of Economics and Political Science in April 1978. The authorwishesto expresshis
gratitudeto ProfessorA. Stuartforhelpfulcomments.The constructivesuggestionsof the
referees
ofthepaperare also warmlyappreciated.Anyerrorsare thesole responsibility
ofthe
author.
REFERENCES
CHATTERJEE,
S. (1968). Multivariatestratified
surveys.J. Amer.Statist.Ass.,63, 530-534.
COCHRAN,W. G. (1963). SamplingTechniques,2nd ed. New York and London: WileyInternational.
DALENIuS, T. (1957). Samplingin Sweden.Contributions
to the Methodsand Theoriesof Sample SurveyPractice.

Stockholm:Almqvistand Wiksell.
withtwo characters.Ann.Math. Statist.,34, 866-872.
GHOSH,S. P. (1963). Optimumstratification
GOLDER, P. A. and YEOMANS,K. A. (1973). The use of clusteranalysisforstratification.
Appl.Statist.,22, 213-219.
GOWER,J.C. and Ross, G. J.S. (1969). Minimumspanningtreesand singlelinkageclusteranalysis.Appl.Statist.,18,
54-64.
GREATERLONDON COUNCIL (1971). ResearchReportNo. 9. Classificationof the London Boroughs.
GREEN, P. E., FRANK, R. E. and ROBINSON,P. J.(1967). Clusteranalysisin testmarketselection.Manag. Sci., 13(8),
387-400.
HAGOOD, M. J.and BERNERT,E. H. (1945).Componentindexesas a basis forstratification
in sampling.J.Amer.Statist.
Ass.,40, 330-341.
IEPES (1976). La CampaiiaPresidencialen Cifras.Jose Lopez Portillo.
KOKAN, A. R. (1963). Optimumallocationin multivariate
surveys.J. R. Statist.Soc. A, 126, 557-565.
MACQUEEN, J.(1967).Some methodsforclassification
and analysisofmultivariate
observations.In Proc.5thBerkeley
Symp.Math. Statist.and Prob. 1, 281-297. University
of CaliforniaPress.
SADASIVAN,G. and AGGARWAL,R. (1978). Optimumpointsof stratification
in bi-variatepopulations.Sankhya,40,
C, 84-97.
WARD, J. H. (1963). Hierarchicalgroupingto optimisean objectivefunction.
J. Amer.Statist.Ass.,58, 236-244.

Você também pode gostar