UNIT PROBLEM
S. Openshaw
ISSN 03066142
ISBN 0 86094 134 5
S. Openshaw
Published by Geo Books. NorwichPrinted by Headley Brothers Ltd. Kent
CATMOG  Concepts and Techniques in Modern Geography
CATMOG has been created to fill in a teaching need in the field of quantitative CONCEPTS AND TECHNIQUES IN MODERN GEOGRAPHY No.38
methods in undergraduate geography courses. These texts are admirable guides for
teachers, yet cheap enough for student purchase as the basis of classwork. Each
book is written by an author currently working with the technique or concept he
describes. THE MODIFIABLE AREAL UNIT PROBLEM
(ii) Nongeographical solutions 32 The usefulness of many forms of spatial study, quantitative or other
wise, depends on the nature and intrinsic meaningfulness of the objects that
(iii) A traditional geographical solution 33 are under study. Geographers have a long tradition of studying data for
areal units; for example, spatial objects such as zones or places or towns
(iv) Towards a new methodology for spatial study 34 or regions. The problem is that ever since the demise of 'the region' as
the primary object of geographical study very little concern has been ex
pressed about the nature and definition of the spatial objects under study.
VI CONCLUSIONS 37 As Chapman (1977) put it 'Geography has consistently and dismally failed to
tackle its entitation problems, and in that more than anything else lies the
root of so many of its problems' (page 7). In short insufficient thought is
BIBLIOGRAPHY 39 given to precisely what it is that is being studied.
For many purposes the zones in a zoning system constitute the objects
ACKNOWLEDGEMENTS or geographical individuals that are the basic units for the observation and
measurement of spatial phenomena. It is usual in a scientific experiment that
Particular thanks are due to Dr A.C. Gatrell, Dr K. Jones, and two the definition of the objects of study should precede any attempts to measure
anonymous referees for making many useful and extremely helpful their characteristics. However, this is not the case with areal data where
suggestions. the spatial objects only exist after data collected for one set of entities
are subjected to an arbitrary aggregation to produce a set of spatial units.
Consider an example about wheat and potato yields. Data for one set of
entities (farms or fields) can be aggregated to produce data for a set of
spatial entities (parishes or counties). In this instance spatial aggrega
tion is necessary in order to 'create' a relevant data set. As Yule and
Kendall (1950) put it '.. geographical areas chosen for the calculation of
crop yields are modifiable units, and necessarily so. Since it is impossible
(or at any rate agriculturally impracticable) to grow wheat and potatoes on
the same piece of ground simultaneously we must, to give our investigation
any meaning, consider an area containing both wheat and potatoes and this
area is modifiable at choice' (page 312). What they mean is that it is
necessary to use areal units that are larger than the individual field and
include both wheat and potatoes so that some measure of spatial association
can be computed. Obviously at the level of the individual field there is no
spatial association (assuming the fields are either all wheat or all potatoes)
and, therefore, the degree of spatial association depends on the nature of
the areal units that are used. The definition of these geographical objects
is arbitrary and (in theory) modifiable at choice; indeed, different research
ers may well use different sets of units. This process of defining or cre
ating areal units would be quite acceptable if it were performed using a
fixed set of rules, or so that there was some explicit geographically meaning
ful basis for them. However, there are no rules for areal aggregation, no
standards, and no international conventions to guide the spatial aggregation
process. Quite simply, the areal units (zonal objects) used in many geo
graphical studies are arbitrary, modifiable, and subject to the whims and
fancies of whoever is doing, or did, the aggregating. It is most unfortun
ate that there is no standard set of spatial units.
Since any study region over which data are collected is continuous, it
follows that there will be a tremendously large number of different ways by
which it can be divided into nonoverlapping areal units for the purpose of
2 3
spatial analysis. Viewed as a combinatorial problem, the number of different zone. The unfortunate use of the 'wrong' set of areal units has therefore
zoning systems, each of mzones, to which data for n individuals can be ag deprived South Tyneside of considerable job opportunities.
gregated becomes incredibly large even for small values of n (Keane, 1975).
For example, there are approximately 10 12 " different aggregations of 1,000 A final example concerns the Parlimentary Boundary Commision which
objects into 20 groups. If the aggregation process is constrained so that reviews the boundaries of the 520 English constituencies every 15 years. This
the groups consist of internally contiguous objects (i.e. all the objects is a pure exercise in modifying areal units. The task is to select an
assigned to the same group are geographical neighbours) then this huge number appropriate amalgamation of wards to create constituencies of approximately
is reduced, but only by a few orders of magnitude. So even with the im equal electoral population size, that conform as far as practicable to county
position of contiguity constraints the combinatorial problem remains totally boundaries and London Boroughs, and that take into account local community
unmanageable. interests. The latest revisions (1983) were performed by manual means and
have been heavily criticised because of inconsistencies in application and
Consider an example based on census data. In Tyne and Wear County there unequal constituency sizes. The average electorate is about 68,000 but it
are about 1.1 million people and 300,000 households. The 1981 census uses varies from extremes of 24,000 (Newcastle Central) to 100,000 (Buckingham).
a set of about 2,800 enumeration districts to report the results. Consider There are even larger discrepancies in neighbouring constituencies; 57,000 in
how many different sets of 2,800 zones could be used for reporting the census Finchley but 84,000 in Wood Green. The reason is simply that only a small
characteristics of 300,000 households: Moreover, there are other huge com proportion of all alternative areal arrangements were identified (Johnston
binatorial explosions whenever a zoning system of 2,800 zones are reaggregated and Rossiter, 1983). For example, the 26 wards in Camden can be aggregated
to form other zoning systems with fewer zones; for example, the 258 zones used to form two constituencies in 878 different ways, with a maximum two percent
for transportation modelling and planning. There are a tremendously large deviation from the mean size of 68,000 (Johnston, The Times, March 15th 1983).
number of alternative 258 zone aggregations that could be used, most (if not If a political geographer had wardlevel voting data, then these different
all) of which will yield different results. constituency definitions would yield a wide range of results.
This, then, is the crux of the modifiable areal unit problem (MAUP). If the reader is still unconvinced then he can attempt the following
There are a large number of different spatial objects that can be defined and experiment. Construct an artificial set of areal units, or use a map of a
few, if any, sets of nonmodifiable units. Whereas census data are collected few neighbouring local authorities. Assign some data to them. Compute either
for essentially nonmodifiable entities (people, households) they are reported a few statistics for each zone (eg rates) or an overall statistic (eg cor
for arbitrary and modifiable areal units (enumeration districts, wards, local relation coefficient or mean). Now amalgamate a few zones which are contig
authorities). The principal criteria used in the definition of these units uous, recalculate your statistics for the aggregated data, and examine the
are the operational requirements of the census, local political considera changes. Now try to amalgamate a few more zones with the aim of either
tions, and government administration. As a result none of these census areas increasing or reducing the magnitude of the changes. Obviously this experi
have any intrinsic geographical meaning. Yet it is possible, indeed very
ment would be easier if a microcomputer was used. However, a few hours exp
likely, that the results of any subsequent analyses depend on these defini
erimentation will convince virtually anyone about the severity of the MAUP.
tions. If the areal units or zones are arbitrary and modifiable, then the
Quite simply, different aggregations yield different results but without any
value of any work based upon them must be in some doubt and may not possess
systematic trends emerging that can be used for prediction or correction
any validity independent of the units which are being studied.
purposes.
The question is, does it matter? If you change the areal basis does it
What is so surising about the MAUP is that while geographers know of
have any really significant effect on the results? Do haphazard zoning
its existence they readily assume, in the absence of any knowledge, that it
systems yield haphazard results? If they do, then what can be done about it?
has no significant effect on their studies. An important reason for this
deliberate neglect is that the validity of many applications of quantitive
Consider two more examples. The definition of enterprise zones was
analyses of zonal data depends on the assumption that the MAUP does not exist
restricted to areas with high levels of unemployment. Unemployment rates were
and that the spatial units under study are given, meaningful, and fixed.
calculated for a set of statistical reporting units known as 'travel to work
Whilst these may be tolerable assumptions for a statistician, who may know
areas' (TTWAs); for details see Coombes and Openshaw (1982). Unfortunately,
no better, it is hardly a satisfactory basis for the application and fruther
for a few areas of the country these particular areal units provide a poor
development of spatial analysis techniques in geography (Openshaw and Taylor
representation of labour markets and present a biased picture of levels of
1981).
unemployment. For example, for some obscure reason South Tyneside (in Tyne
and Wear County) was included in the same TTWA as Washington, Gateshead,
Although there is an almost infinite number of different ways by which
Jarrow, and parts of rural Northumberland. The effect was to mix areas of
very high unemployment with fairly prosperous rural areas which have no strong a geographical region of interest can be areally divided, data are normally
journey to work links; the result was to reduce the apparent level of un only presented and analysed for one particular set of units. The choice of
employment on South Tyneside. The total June unemployment, rates for the these units is often haphazard, in that considerations such as convenience
period 197882 were 11.1, 10.7, 12.9, 17.2, 18.7. However, if a more geo rather than geographical meaning are paramount. This uncertainty about the
graphically meaningful definition of the South Tyneside labour market is used nature and definition of the zonal objects of spatial study is an important
(see Coombes et al, 1982) then the unemployment rates for South Tyneside be consequence of the MAUP. It is important because of the effects that the use
come 13.9, 13.3, 15.9, 20.1, 20.7; more than enough to justify an enterprise of different areal units may have on the results of geographical study and
4 5
because it is endemic to all analyses of areal or zonal data. It is a major (i) An insoluble problem
geographical problem with ramifications that need to be properly appreciated
by geographers and all others interested in the analysis of spatially aggre One very good reason for ignoring the MAUP is the belief that it is in
gated data. Looked at in this way, the MAUP is today one of the most important soluble. If it really is endemic to the study of all areal data and if it
unresolved problems left in spatial analysis. There has been very little really is insoluble then why not pretend it does not exist, in order to allow
research compared with that afforded to many far less significant problems, some analysis to be performed? This CATMOG is dedicated to those who believe
and whilst it appears to be primarily a technical problem, it is also a major in this fallacy of insolubility.
conceptual problem that is central to many aspects of geographical study.
(ii) A problem that can be assumed away
6 7
III ON THE NATURE OF THE MODIFIABLE AREAL UNIT PROBLEM
(i) Definitions
The MAUP obviously includes both these subproblems. The scale problem
arises because of uncertainty about the number of zones needed for a particu
lar study. The aggregation problem arises because of uncertainty about how
the data are to be aggregated to form a given number of zones. It should be
noted that for any reasonably sized data set there is considerably more
spatial freedom in the choice of aggregation than there is in the choice of
the number of zones.
At this stage it is worth noting that there are two different types of
zonal arrangement. Most geographical studies have employed spatial aggrega
tions based on contiguous arrangements of zones, something referred to as a
zoning system. However, a zoning system is only a special case of a grouping
system that incorporates a contiguity constraint. The noncontiguous case
is referred to as a grouping system. The use of a contiguity constraint
restricts the degree of aggregational variability but inmost practical studies
it is so large anyway that it brings little real advantage other than the
convenience of having zones which are formed of internally connected units.
11
10
(1950) provides the conclusive proof that this in fact the case. He quotes other. This is the interpretation put forward by Taylor (1977). It has also
an example based on the correlation between percentage population 10 years been argued that if the variables are not spatially autocorrelated then the
old and over which is negro and the percentage of the same population that correlation coefficient will not increase with scale. A problem with this
is illiterate; another example is based on the correlation between nativity interpretation is that Blalock ignores aggregation effects and these may
and illiteracy. Table 3 shows the various correlations that were computed easily dominate any scale effects. Some of the ideas developed by Blalock
for different levels of spatial aggregation. (1964) and put into a geographical context by Taylor (1977) have been tested
in Openshaw and Taylor (1979). The expected systematic relationships did
Table 3. Individual and ecological correlations (after Robinson, 1950) not emerge. The effects of the aggregational variability were simply too
strong; indeed, perhaps rather alarmingly, the authors concluded that 'We
level of number correlations between: have been able to find a wide range of correlations. We simply do not know
aggregation of units negroandilliteracy nativity and illiteracy why we have found them. Hence we can make no general statements about vari
ations in correlation coefficients so that each areal unit problem must be
individual 98 million .203 .118 treated individually for any specific.piece of research' (Openshaw and Taylor,
state 48 .773 .526 1979; p 142143). What is meant is that the aggregational variability is not
census division 9 .946 .619 susceptible to a statistical approach since no systematic empirical regulari
ties could be found.
The results are quite conclusive. There is a pronounced scale effect in that
the absolute values of the correlations increase as the number of observations (iii) More recent studies
decrease. In addition, the aggregate values bear little resemblance to the
individual values prior to spatial aggregation. Robinson concludes therefore Apart from the occasional mention, the MAUP seems to have been ignored
'..there need be no correspondence between the individual correlation and the until the problem was reexamined in the late 1970's. Openshaw (1977a) was
ecological correlation' (page 354). This is an important result which readily one of the first to reemphasise the importance of aggregation effects. An
illustrates the dangers of making individual level inferences from analyses example readily shows the importance of the aggregation problem and relative
performed at an aggregate level. insignificance of the scale problem. The data used here relate to 100 metre
gridsquares for South Shields. These data could be readily aggregated to
A final paper that is of interest in this section is that of Blalock 200, 300, 400, 500, 600, 700, 800, 900 and 1 km squares. For each of these
(1964). He describes the results of a series of experiments designed to in scales there are a number of alternative aggregations; for example, shifting
vestigate the effects of data aggregation. The correlation coefficient be the origin of the 100 metre lattice produces 25 different 500 metre grid
tween differences in income for blacks and whites and percentage blacks for square aggregations. The resulting distribution of correlation coefficients
150 southern USA counties was found to be 0.54. Blalock was interested in are shown in Table 5.
the question of what happens if the counties are grouped into larger units in
various different ways. The results are shown in Table 4. Table 5. Scale and aggregation effects on the correlation between
numbers of arly and MidVictorian houses in South Shields
Table 4. Blalock's aggregation experiments
size of squares scale aggregation effects
number of units random grouping random zoning ( metres) effects mean correlation standard deviation
75 .67 .63
100 .08  
30 .61 .70 .21 .31 .11
200
15 .62 .84 .43 .43 .06
300
10 .26 .81 .47 .11
400 .28
500 .55 .49 .16
With random grouping systems we would expect the correlation coefficients to .52 .16
600 .45
show no systematic scale effects. The variability in values will be due to
700 .20 .57 .18
sampling fluctuation. In this instance sampling fluctuation is in fact the .56 .58 .18
800
aggregation component since there are a very large number of ways by which .66 .60 .19
900
150 objects can be randomly grouped into 75 groups or less. The apparently .62 .20
1 km .73
anomalous value of the correlation coefficient for the 10 groups is an indi
cation of this effect; indeed, it is slightly miraculous that the other The second column shows the effects of increasing scale using only one of the
values are so uniform. possible aggregations to each scale of gridsquare. The third column shows
the mean correlation based on different aggregations to the same scale pro
By contrast the random zoning systems will be affected by any spatial duced by moving the origin of the lattice. The fourth column shows the
autocorrelation present in the data, so that the rising correlations with standard deviation of the correlation coefficients produced for the different
increasing scale can be regarded as the result of spatial autocorrelation aggregations to each scale. This example contradicts the claim by Evans (1981)
whereby the zoning system retains more variance of one variable than of the that with gridsquare data the changes in correlation coefficient are usually
12 13
consistent across a wide range of scales up to 256 km (page 55). The reason Table 6. Crosstabulation of individual and ecological correlations
is that his 1 km squares have already smoothed the data dramatically. Fin (percentage of row totals)
ally, it is noted that the example shown in Table 5 does not consider the
full extent of the aggregation problem. This would involve an examination of
10,000 alternative 100 metre squares (assuming the data being aggregated have areal correlations
been gridreferenced at the 1 metre level), 1,000,000 different 1 km squares, individual 1. .8 .6 .4 .2 .0 .2 .4 .6 .8
and even larger numbers of alternatives if the zones are not constrained to correlations .8 .6 .4 .2 .0 .2 .4 .6 .8 1. total
be square in shape.
The conclusion that can be drawn from Table 5 is that Yule and Kendall 1. to .8
(1950) were quite correct, although they clearly underestimated the severity .8 to .6
of the problem. Different variables can be affected by aggregation in dif .6 to .4
ferent ways so that multivariate techniques based on correlations will tend .4 to .2
.2 to .0
to amplify the differences in results caused by the use of different zoning
.0 to .2
systems. As a result, the aggregation and scale variability reported for the
.2 to .4
correlation coefficient also applies to more complex multivariate methods and
.4 to .6
to many other forms of analysis. It is demonstrated later that it is not a
.6 to .8
problem that afflicts only the poor correlation coefficient.
totals
The ecological fallacy problem has also been studied further. The
principal problem here is that a detailed investigation requires access to
Sunderland 1 km squares
large spatially referenced individual data sets and it is only quite recently
that sufficiently powerful computers have become available_ to handle these.
The ecological fallacy problem occurs because areal studies cannot distin 1. to .8
guish between spatial associations created by the aggregation of data and real .8 to .6
associations possessed by the individual data prior to spatial aggregation. .6 to .4
Thus the characteristics of typical deprived urban areas need not be the same .4 to .2
as the characteristics of the individuals who live there. .2 to .0
.0 to .2
One consequence of Robinson's work was that may social scientists inter .2 to .4
preted his warning as a rigid taboo on the use of all aggregate data; although .4 to .6
this never extended to geography. Borgatta and Jackson (1980) pointed out .6 to .8
that 'what happened was the assumption that, because use of aggregate data
could be misleading at the individual level, every such interpretation had to totals
be incorrect' (page 8). It is also possible that Robinson exaggerated the
i mportance of the problem; in particular he only examined the most gross Sunderland polling districts
levels of aggregation. The question arises as to whether these results are
typical of what might happen with finer spatial scales and more realistic
1. to .8
zoning systems.
.8 to .6
.6 to .4
Recently, some further insights into this problem have come from the
.4 to .2
analysis of a random 10 per cent sample survey of all households in Sunder
.2 to .0
land and from the analysis of individual census data for part of Italy
.0 to .2
(Openshaw, 1983; Bianchi et al, 1981). A brief description of the results
.2 to .4
for Sunderland can best be examined here. These data can be studied at the
.4 to .6
individual level (8,483 households) or aggregated to polling districts (36
.6 to .8
zones), 1 km squares (117 zones), and 500 metre squares (348 zones). A set
of 54 typical indicator variables were computed. The simplest way to in
totals
vestigate the ecological fallacy problem is to crosstabulate the individual
and zonal correlation coefficients (Table 6).
Sunderland 500 m squares
14 15
shows that aggregation has a flattening effect on the frequency distribution connections and partly because it has been suggested that the statistical
of the individual correlation coefficients. Table 6 clearly demonstrates distribution of a statistic due to sampling variability and its zoning dis
the systematic biasing of the ecological correlations from 0 towards 1 and tribution due to the choice of different zoning systems, are analogous. If
that the magnitude of the bias increases with scale. the analogy could be proven then it would be exceptionally convenient because
it would allow the standard formulae for estimating sampling errors for
It is noted that the crosstabulations in Table 6 do not give any indi simple random samples to be used to provide estimates of aggregational vari
cation of aggregational variability since only one aggregation at each scale ability, presumably under the assumption of simple random zoning. In this
was examined. That is to say, these results refer only to scale effects and samplingzoning analogy the number of zones in the zoning system would be
it may be expected that the aggregational effects will be somewhat larger. regarded as equivalent to sample size.
Both are important since if these phenomena were better understood it might
be possible to design improved areal definitions for reporting census data. For this study 1970 census data for the 99 counties in the State of Iowa,
For instance, is there a critical size for census enumeration districts which USA, are examined. Two variables are selected for analysis; the percentage
may minimise the effects of scale and aggregation on the data being aggregated? vote for Republican candidates in the congressional election of 1968 and the
The present size is merely a reflection of the area that can be covered by a percentage of the population over 60 years. There is nothing special about
census enumerator in one day; this is hardly a meaningful variable in urban the selection of this data, it merely happened to be convenient! Openshaw
geography. It is something of a mystery why census data collecting agencies and Taylor (1979) report a range of different correlations that can be pro
do not bother to try and resolve these very important practical questions. duced for these variables when the 99 counties are aggregated into a number of
arbitrary six zone aggregations; the values ranged from .26 for the con
These results suggest that perhaps the magnitude of the ecological fal gressional districts to .86 for a simple typology of Iowa into ruralurban
lacy problem is less than the results presented by Robinson (1950) might types. The value of the correlation at the 99 county level is 0.34. Since
indicate. Certainly the changes in the magnitude of the correlation coeffi the 99 counties form a complete population of Iowa counties this value can
cient are smaller in Table 6 than in Table 3. However, this is slightly mis be regarded as the population correlation. The question is how well random
leading since only a small percentage of all correlations in Table 6 do not samples and sample random zoning systems represent this population value.
have substantial and systematic biases; for the polling district data the
figure is 16 per cent. Additionally, it is impossible to predict the severity Table 7 reports the means and standard deviations of the correlation
of the problem without access to individual data. As a result there is no coefficient for 10,000 random samples of (i) random zoning systems (randomly
way of knowing whether a particular areal data set will yield values which selected areal aggregations) with 6, 12, 18, 24, 30, 36, 42, 48, and 54
are close to the individual values. A fuller discussion of empirical aspects zones; and (ii) random samples (random selections of various numbers of zones)
is provided in Openshaw (1983), while Williams (1976, 1979) outlines a of 6, 12, 18, 24, 30, 36, 42, 48, and 54 counties. The latter provide re
theoretical interpretation. sults which approximate the values that would be obtained from standard samp
ling formulae. Openshaw (1977b) describes the computer algorithm used to
generate the quasirandom zoning systems.
IV THE RESULTS OF SOME AGGREGATION EXPERIMENTS Table 7. Sampling and zoning distributions of the correlation coefficient
(i) Random aggregation and the correlation coefficient number zoning distributions
of zones mean standard deviation sample size mean standard deviation
The complex nature of the MAUP suggests that further advances in our
understanding of it can be most readily made by empirical experimentation. 6 .36 .218 6 .31 .429
It is not denied that a theoretical approach could be rewarding; indeed, 12 .33 .161 12 .34 .273
various preliminary studies have been made (Williams, 1976, 1979; Batty and 18 .33 .139 18 .34 .209
Sikdar, 1982). However, the problem is proving to be exceptionally complex 24 .32 .122 24 .34 .172
and it is most easily investigated by empirical means. Furthermore, the 30 .33 .110 30 .34 .144
availability of highspeed computers makes it possible to design aggregation 36 .33 .102 36 .34 .125
experiments of a far more comprehensive nature than would be the case if non 42 .33 .092 42 .34 .109
automated methods were being employed. Additionally, entire new numerical 48 .33 .082 48 .34 .097
algorithms can be devised to explore different aspects of the aggregation 54 .33 .073 54 .34 .086
problem.
99 .346 .346
The first set of experiments concerns the effects of random aggregation
on the correlation coefficient. Some of the results produced by simple ran The most interesting discovery here is that scale has no systematic effect
dom aggregation experiments by Gehlke and Biehl (1934) and Blalock (1964) on the mean correlation coefficient. This is because the zoning systems are
have already been described. The question is simply what happens if a more chosen at random so that the sample (or more precisely the zoning) estimates
systematic and comprehensive series of experiments is performed. Interest of the correlation coefficient approximate the population value (which for
is focused on purely random aggregations partly because of the historical zonal data is the observed value prior to the current aggregation, ie the
16 17
99 zone value ). It should also be noted that there is considerable aggregational variability due to the use of simple random zoning systems.
zoning and sampling variability about the mean values but that this reduces An examination of a simple null hypothesis test based on the correlation co
with increasing numbers of zones or increasing sample sizes. Finally, the efficient shows the sort of additional risk that is involved. For a standard
standard deviations of the zoning distributions are considerably smaller than type I error significance level of 0.05 the value observed from the Monte
the corresponding sampling distributions but exhibit a greater degree of bias. Carlo experiments ranged from .10 to .22, according to the level of spatial
autocorrelation and the particular variable under study.
In these results, somewhere, are the effects of spatial autocorrelation.
Most data sets exhibit positive spatial autocorrelation and the Iowa data Other problems with the samplingzoning analogy concern the fact that
are no exception. Spatial autocorrelation only affects the zoning distribu most zonal data sets contain both sampling variability and aggregational vari
tions because aggregation takes place under contiguity restrictions. Normality ability. In addition, zonal data are unusual in that the population value
or nonnormality is not thought to have any important effect on these for any statistic can be determined; for aggregated data this is the value of
experiments. a statistic for the data prior to the current aggregation. A final problem
concerns the fact that geographers have not previously shown any interest in
One way of identifying the effects of spatial autocorrelation is to use studying purely random zoning systems; perhaps they are not thought to be
data sets with different levels of spatial autocorrelation and see what ef meaningful entities, although it is possible also that until quite recently
fect this has on the zoning distributions. Openshaw and Taylor (1979) de it was difficult to generate random zoning systems.
scribe a procedure for generating artificial data for the 99 Iowa counties
with the following properties: zero skewness and kurtosis to ensure normality, Another aspect of this discussion concerns the use of inferential statis
a correlation equal to that observed for the real Iowa data, and regression tical techniques with zonal data. Quite simply, it is seldom clear as to
slope and intercept parameters also equal to the observed Iowa data. Three what is the nature of the hypothesis that is being tested and what, if any
different levels of spatial autocorrelation were considered (autocorrelation thing, the results signify. If random zoning is not being used then in what
is measured by Moran's I statistic for first order contiguities, see Silk way do zonal data constitute a sample, be it simple or complex? What is the
(1979)): maximum negative spatial autocorrelation, MN, (the best that could population? A statistical answer to some of these questions is to invent a
be achieved were values of .71 for the vote variable and .57 for the old 'super population'; for example, that the Iowa data is a random sample of
age variable), zero autocorrelation,Z, and maximum positive autocorrelation, data for Iowa counties because it relates to one, randomly chosen, point in
MP, (the best that could be managed were values of .82 and .92). The same time. While this is easy to say, it is far less easy to identify what the
sets of 10,000 zoning systems as used for Table 7 are applied to these significance tests mean. There is also the difficult problem of determining
artificial data sets with the results shown in Table 8. an appropriate set of sampling error estimation equations. The Iowa data
represents a sample size of 1. Finally, it is not clear as to the geographi
Table 8. Zoning distributions of the correlation coefficient for cal implications of the hypotheses that could be tested. For example, under
three different levels of spatial autocorrelation what conditions is it possible to compare zonal estimates for one set of
zones with zonal estimates for another set?
number MN Z MP
of standard standard standard (ii) Random aggregation and other statistics
zones mean deviation mean deviation mean deviation
A further consideration is whether or not the results observed for the
6 .31 .443 .61 .294 .60 .247 correlation coefficient also hold good for other statistics. Perhaps the
12 .30 .370 .47 .263 .52 .176 correlation coefficient is a special case. The question is therefore what
18 .29 .350 .42 .227 .48 .142 scale and aggregation variability are likely to be displayed by other un
24 .31 .309 .40 .192 .44 .121 standardised statistics, such as the mean and the regression slope coefficient.
30 .32 .277 .39 .166 .42 .108 Is it possible that these statistics will be less affected and more robust
36 .32 .242 .38 .146 .40 .098 to aggregation effects? For example, the mean has very good large sample
42 .33 .209 .37 .128 .39 .087 properties. Table 9 should dispel any fears in this direction. It illustrates
48 .33 .183 .36 .112 .38 .080 some results from a regression of percentage rate for Republican candidates
54 .33 .160 .36 .100 .34 .072 as a percentage of the population over 60 years of age (see page 17).
The artificial data with negative spatial autocorrelation has the least The mean statistic for the zoning distributions is only very slightly
biased results but the largest standard deviations, whereas increasing posi biased but still has the now characteristic small standard deviation, rela
tive spatial autocorrelation produces results which are increasingly biased tive to the related sampling distributions. The regression coefficient be
but with smaller standard deviations. The zero autocorrelation state confers haves in a similar fashion to the correlation coefficient.
no particular benefits.
(iii) Random aggregation experiments with once aggregated data
The principal conclusion from these experiments is that the sampling
The previous experiments concerned the effects of randomly aggregating
zoning analogy does not hold good. There is an additional risk involved in
zonal data which have already been aggregated at least once previously. Most
using standard error formulae for simple random sampling as estimates of the
18 19
Table 9. Zoning and sampling distributions of a mean and a regression These results are superficially similar to those reported for the re
slope statistic aggregation of already aggregated data. One difference is the smaller rela
tive sizes of the standard deviations of the zoning distributions in Table 10.
number mean old aged regression slope This could well reflect the use of small sample sizes; computer times for a
of zoning sampling zoning sampling sample of 100 different individual data aggregations amounted to 2 hours of
zones mean std mean std mean std mean std CPU time on an IBM 370/168. The use of notional contiguities and a sample
data set may also have contributed to reducing the expected range of aggre
6 14.5 .263 14.5 1.105 1.55 1.071 1.13 1.944 gation effects. Most of the results for the other 50 variables which were
12 14.5 .291 14.5 .747 1.34 .689 1.23 1.088 examined tended to have zoning distribution means of the correlation coef
18 14.5 .278 14.5 .591 1.27 .569 1.23 .800 ficient which are similar to the 1 km zonal values. Nevertheless, the re
24 14.5 .267 14.5 .496 1.25 .484 1.24 .649 sults again show that zonal correlations need not correspond to the individual
30 14.5 .249 14.5 .422 1.24 .427 1.24 .536 level correlations and that a 'good' zoning system for one variable can be
36 14.5 .230 14.5 .369 1.23 .389 1.24 .460 quite 'poor' for another, at least in terms of the differences between eco
42 14.5 .217 14.5 .323 1.23 .346 1.24 .396 logical and individual correlation coefficients. It is still confidently
48 14.5 .202 14.5 .291 1.23 .307 1.24 .352 expected that the aggregational variability in the range of possible results
54 14.5 .186 14.5 .255 1.23 .273 1.25 .312 due to the choice of the first zoning system will exceed that of any subse
quent reaggregations of the data, although the current experiment did not
99 14.5 14.5 1.25 1.25 show it. Even if this assumption can be disproven, it is highly likely that
the choice of the first zoning system has a crucial effect on the severity of
Note: std is an abbreviation for standard deviation any subsequent ecological fallacies and that, as far as practicable, the
design of this zoning system should be optimised to minimise these effects.
data that geographers study are of this type. The question arises, therefore, It may be that the possible benefits are slight or are offset by the computer
as to the effects of aggregating data that have not been previously aggre costs that are involved, but until we try we shall never know.
gated; for example, the aggregation of individual data to a zoning system.
This problem is interesting partly because it is here that ecological falla (iv) Identifying the limits of the MAUP
cies may be created and because aggregation changes the measurement scale,
usually from a nominal to a continuous form. For example, presence or absence So far attention has been restricted to investigating the variability in
measurements become frequencies or ratios or percentages after aggregation. results due to purely random spatial aggregations. The question now arises
as to what are the worst case or, real limits of aggregation effects if we are
The Sunderland data are used to investigate this problem. The house perverse enough to look and know bow to find them. The existence of elec
hold data have 100 metre gridreferences attached to them. For this experi toral boundary gerrymandering has been known about in political geography for
,
ment the 8,483 households can be regarded as single member zones. Notional over 170 years, ever since the famous 1810 gerrymander (Taylor and Johnston,
contiguities can be generated by a Thiessen polygon program so that the in 1979; pages 371374). However, it is only recently that its general implica

dividual data zones can be aggregated to form random zoning systems with 25, tions for spatial analysis have beep recognised (Openshaw, 1977a, 1977c,
50, 75, 100, 150, and 200 zones. Table 10 shows the results that were ob 1978b). By searching for the approximate limits of the range of aggregation
tained for three variables which were selected to show different types of effects it is possible to demonstrate the magnitude and severity of the MAUP.
aggregational behaviour displayed by the correlation coefficient.
Openshaw (1977a) uses a heuristic procedure, of a type similar to itera
Table 10. Zoning distributions for once aggregated data for Sunderland tive relocation algorithms in cluster analysis, to optimise any general func
tion by manipulating the zoning systems. This method provides an approxi
mate solution to what is termed the Automatic Zoning Problem; the algorithm
number variable 1 variable 2 variable 3 is called the Automatic Zoning Procedure (AZP). The basic algorithm is best
of zones mean std mean std mean std described in general terms as consisting of a series of steps.
Step 1. Decide how many regions are required in the final aggregation.
25 .79 .045 .93 .015 .94 .014 Step 3 Generate a random zoning system with this number of regions.
50 .82 .034 .92 .015 .92 .017 St ep 3.. Randomly select one of these regions and proceed around its bound
75 .83 .026 .92 .015 .91 .016 ary measuring the effects on the objective function of moving
100 .84 .026 .92 .015 .90 .020 zones from the bordering regions into it.
150 .83 .022 .91 .016 .88 .013 Step 4. 'Once an improvement is recorded for the objective function which
200 .82 .022 .91 .015 .87 .018 is being optimised, then check whether the move is possible;
individual that is, it must not destroy the internal contiguity of the
correlation .42 .81 .57 region from which a zone is being moved; either reject or accept
the move.
Note: std is an abbreviation for standard deviation Step 5. Once all the members of a region have been examined return to
step 3 to process another region; if all regions have been ex
amined then go to step 6.
20 21
Step 6. If one or more moves have been made then return to step 3 other Yule and Kendall (1950), in a prophetic statement, warn against the
wise stop. development of zonal manipulation procedures of the kind used here. They
In this algorithm the initial data are assumed to relate to a set of zones write 'the student should not now go to the other extreme and claim that,
and these zones are to be aggregated into a smaller number of large zones since a large range of values of correlation coefficients may be obtained
which for purposes of clarity are termed regions. For example, the 99 Iowa according to the choice of a modifiable unit, a particular value has no sig
Counties form a set of 99 zones which can be aggregated into 6 regions. The nificance' (page 312). Perhaps they did not realise that such a wide range
aggregation is performed in such a way so as to approximately optimise an of aggregation effects were present or did not know how to find them in a
objective function whilst ensuring that all the zones assigned to the same systematic fashion. Instead what they mean is that significance of the cor
region are internally connected or contiguous. The objective function can relation coefficient depends on the meaningfulness of the areal units on which
be any general function and it need not be continuous. For example, the aim it is based. Perhaps they thought, rather naively, that counties are a sen
may be to maximise or minimise a correlation coefficient between two vari sible spatial unit for the study of crop yield relationships whereas arbitrary
ables in order to identify the approximate limits of variability due to the aggregations of the counties to maximise a correlation coefficient would not
MAUP. The AZP algorithm is a heuristic procedure which experience has shown be. It is a shame that Yule and Kendall's work on the modifiable areal unit
can readily solve many types of optimal zoning problems although there is no problem did not continue past this point. Perhaps it could not, because the
guarantee that it will always find the global optimum; indeed with this type problem rapidly becomes one of trying to assess the degree of meaningfulness
of problem there can be no certainty that there is a unique global optimum associated with different geographical definitions for a particular purpose.
to be found. For most problems it probably gets fairly close to a 'good' In general terms it is an impossible problem; for example, how would we go
local optimum; large problems are easier to solve than small ones. No doubt about determining whether counties are an appropriate unit by which to study
the heuristics could be further improved; for example, by the incorporation crop yield relationships or indeed anything?
of a multiple simultaneous move heuristic; but at present this is not the Some critics of the optimal zoning results have suggested that it only
most important problem. More important was the discovery of how to incor works when applied to correlation coefficients and that in any case the opti
porate a constraint handling procedure (Openshaw, 1978b), because together mal zoning systems will be of the most peculiar shapes and sizes. This latter
with fast computers this made possible the application of the AZP algorithm point is examined later. The first is simply incorrect. The performance and
to a wide range of region building problems. parameter estimates of a variety of linear and nonlinear models have also been
shown to vary between wide limits (Openshaw, 1977c, 1978a, 1978b).. Some mod
Returning to the correlation coefficient, this can be used as the ob els, for instance interaction models, are highly sensitive since the pattern
jective function in the AZP and attempts made to seek zoning systems that of trips that these models try to represent depends on the zoning systems used.
either maximise it or minimise it. This can be regarded as an exercise in A simple example based on the linear regression model should help emphasise
applied gerrymandering or, if you prefer, spatial engineering of zoning sys the importance of the MAUP. The AZP can be used to produce zoning systems
tems. The dramatic results are shown in Table 11. Even for the 99 Iowa which generate data to either maximise or minimise best statistical estimates
zones, a small data set by current standards, a very wide range of results of the slope coefficient in a regression model based on the Iowa data (Open
can be obtained. The amount of aggregational variability, or spatial free shaw, 1978a). In this experiment every time a change is made to the zoning
dom, will be even greater with larger data sets and is probably some expon system by the AZP the parameters are reestimated. Two different parameter
ential function of the aggregation factors involved. Nevertheless, for a 6 estimation procedures are used; one based on ordinary least squares the other
region aggregation of the 99 Iowa counties the range of possible correlations on a robust line fitting procedure in the style of Tukey (1977) and described
is between .99 and +.99. It is also possible that many of the intermediate in McNeil (1977); the purpose is to avoid making normal linear regression
results can be obtained; for example, a zoning system with a correlation of model assumptions. The results are shown in Table 12 and two of the 12 region
0.5 or 0.334. Different amounts of spatial autocorrelation have no notice zoning systems are shown in Figure 2.
able effects.
Table 12. Approximate limits of regression slope coefficients due to
Table 11. Some approximate limits of the correlation coefficient due to different aggregations of the Iowa data
different aggregations of the Iowa data
number Ordinary least squares estimation robust line fitting estimation
number Iowa data MN data Z data MP data
min r max r min r max r min r max r min r max r of of slope of slope
of zones
zones minimise maximise minimise maximise
6 .99 .99 .99 .99 .99 .99 .99 .99
.99 .98 .99 6 121 27 84 22
12 .99 .99 .97 .99 .99
.99 12 24 12 34 42
18 .97 .99 .97 .99 .97 .99 .92
.98 18 12 12 14 16
24 .92 .99. .98 .99 .90 .99 .89
.95 24 8 10 11 14
30 .73 .98 .93 .98 .86 .98 .78
.98 .61 .93 30 5 7 12 12
36 .71 .96 .93 ..98 .80
.52 .93 36 4 6 8 10
42 .55 .95 .92 .97 .79 .96
.95 .39 .89 42 3 5 5 8
48 .50 .90 .87 .96 .66
48 2 4 4 6
54 .42 .82 .85 .95 .52 .91 .32 .88
54 1 4 2 6
Notes: based on best of five different random zoning systems used as starting
aggregations. MN, Z, MP are the three artificial Iowa data sets 23
(see Table 8)
22
The propensity that many geographers have shown for attributing substan
tive interpretations to the slope coefficients in regression models should
be greatly diminished by these results. For example, the value of the slope
coefficient in distance decay models clearly reflects the zoning system as
well as behaviour patterns. It is likely that more complex models, including
entropy maximising spatial interaction models, will also suffer from similar
effects as that displayed by these linear regression models. Currently,
there is no evidence to the contrary.
Figure 2a. Zoning system that minimises the regression slope coefficient 6 14.8 .02
(24, r = .25) 12 15.3 .8
18 15.0 .7
24 14.3 1.6
30 12.4 1.9
36 12.2 2.2
42 11.5 2.5
48 10.7 3.2
54 10.3 3.6
Figure 3 shows the geometry of two 12 zone systems that maximise and mini
mise the mean absolute error. In these experiments the objective function used
in the AZP is the mean absolute error goodness of fit statistic and the model
parameters are reestimated using a robust line fitting procedure every time
the zoning system changes. The range in results reported here is due solely
to the nature of the zoning systems that are used.
26 27
Figure 4a. Zoning system that fits a model with arbitrary intercept and
slope of 41.4 and 2 (actual 42.4 and 1.90)
Figure 5a. Zoning system that fits model with arbitrary intercept and
slope of 50 and 1.25 (actual 48.4 and 1.26)
Figure 4b. Zoning system that fits a model with arbitrary intercept and Figure 5b. Zoning system that fits model with arbitrary intercept and
slope of 41.4 and 1.0 (actual 42.0 and .98) slope of 30 and 1.25 (actual 30.3 and 1.31)
28 29
Suppose that two sets of runs are performed; the first holds the intercept geometrically regular set of zones at a carefully selected scale would be most
at the 99 zone level and systematically varies the slope coefficient; the relevant. However, given the very uneven, lumpy, and discontinuous nature of
second holds the slope coefficient at the 99 zone level and systematically real world patterns it is not at all obvious as to why zoning systems should
varies the intercept. An alternative approach to fitting these models is to possess geometric regularity, and if they do what advantages this brings over
minimise the difference between the target parameters and the values esti the sorts of shapes described in Figures 2 to 5.
mated for a particular zoning system. Both sets of results are shown in
Table 14 (page 27) with some of the zones being reproduced in Figures 4 and 5 Likewise it is not apparent why neutral or locationally arbitrary areal
(pp. 28, 29). units, for example gridsquares, should be of any interest in geography.
Since we have the means to design zoning systems which are optimal for a given
The decisions as to whether an acceptable level of fit is achieved are purpose, should we not be seeking to use these zoning systems as a means of
arbitrary. Nevertheless, it is suggested that quite reasonable levels of investigating further the relationships under study. An analogy with a
fit have been achieved. It is particularly noticeable that the 99 zone inter television aerial seems most appropriate. You could use an aerial designed
cept and slope parameters (41.46 and 1.25) can be matched at all five levels for a radio and perhaps receive a poor picture. You could build your own to
of aggregation and that these zoning systems have zero aggregation effects. the most beautiful geometric design and get no picture at all. You could
A robust data fitting procedure was used for Table 14. Similar results can design an aerial to produce the best possible picture without worrying too
be obtained for ordinary least squares regression, indeed rather more zoning much about aesthetics. The zone design problem is broadly analogous to an
systems would be judged to fit the target parameters. aerial. The zoning systems acts as a detector of spatial patterns and the
patterns that are detected and their distinctiveness depend on its design.
One use of spatial calibration is to test specific geographical hypo Surely no geographer can be content to use zoning systems produced by others
theses about the nature of the results that may be expected; this is elabor or seek to use nice looking zones purely on aesthetic grounds without any
ated upon later. The argument here is that these empirical results demonstrate regard for their performance as pattern detectors.
that the statistical and geographical aspects of spatial analysis need to be
integrated. Zone design is in many ways a geographical complement to the
statistical process of parameter estimation and with zonal data they cannot
be separated if meaningful geographical results are to be obtained. This V POSSIBLE SOLUTIONS
viewpoint is controversial since it implies that a large number of geographi
cal studies are: (1) inherently nongeographical, (2) based on haphazard zon (i) No philosopher's stone
ing systems with little direct control over aggregation effects; and (3) other
wise seriously flawed. The logic of this argument leads inextricably to a It is not thought likely that a general solution can be found that will
very different paradigm for spatial study than that currently used; this is allow existing methods to be used as if the MAUP did not exist. The problem
examined later. is far too complex, it is difficult to investigate by analytical means, and
its inherent geographical nature makes it unlikely that a statistical solu
(vi) .. but do the optimal zoning systems look nice? tion will emerge or if it does that it will suffice.
A final consideration concerns the nature of the optimal zoning systems The simplest solution to the MAUP is to pretend it does not exist and
shown in Figures 2 to 5. It can be argued, with some justification, that for hope that the results being produced for ad hoc zoning systems will still be
reasons not yet investigated or understood, the aggregational properties of meaningful or least interpretable. This view is implicit, by the lack of
the 'real' zoning systems that geographers use are not as bad as the perverse any explicit statements to the contrary, in much geographical work. For
optimal zoning systems that the AZP can identify. Perhaps the use of zones example, the performance of a mathematical model depends partly on its speci
that look 'nice' or are based on regularly shaped units or convenient admin fication and partly on the zoning system that is used. There is often an
istrative definitions may avoid the extremes of the MAUP that have been iden elaborate body of theory to help with the model specification problem but
tified in the various aggregation experiments. At the limits this is cer little or no guidance is available to aid the choice of zoning system.
tainly true but the real problem is that the aggregational properties of Likewise many quantitative geography texts describe the existence of the MAUP
nearly all ad hoc zoning systems are simply unknown. Additionally, it is but offer little or no advice as how best to use the techniques that are
difficult to establish any spatial benchmark against which the performance of described to study data for modifiable units.
alternative zoning systems can be measured. Geometric criteria, shape and
size are not particularly relevant because it is the characteristics of the It is also fortuitous that ad hoc zoning systems often produce plausible
data and not the zones themselves that is important. The only absolute results despite the neglect afforded to the careful definition of areal
benchmark is the same data at a preaggregation or individual level and the entities. However, it should be noted that the general absence of compara
characteristics of the latter are seldom known or available for analysis. tive studies may have helped disguise the extent to which zonedependent
regularities are being uncovered. The principal example sometimes quoted to
In principle it really does not matter what shape zones have since it demonstrate that the choice of zoning system is of little consequence is that
is the relationship between zonal boundaries and the microlevel patterns of factorial ecologies where it seems that the major structural relationships
which they detect and report that is the subject of spatial analysis. If between sets of social variables are relatively free from zoning effects.
the assumption of an isotropic plain were applicable then obviously a Whilst zonal invariance may be useful for some purposes, it is also slightly
30 31
worrying that so many social area analyses should be so similar despite cul statistic is an appropriate measure of the performance of a zoning system.
tural and other important differences. Perhaps a combination of closed number More to the point, how do you decide which criteria to use? How do you know
set problems and correlated denominators have combined to determine the re if its use is successful?
sults. It may also be that more sensitive methods and carefully engineered
zoning systems would detect very different spatial patterns. An example demonstrates the arbitrariness of these and other general pur
pose zone design criteria. Table 15 shows the effects on the Iowa correla
The problem is not that geographers have failed to realise that the MAUP tion coefficient of the following:
exists, only that they do not know what to do about it. Perhaps mistakenly,
they have opted to concentrate on the more tractable statistical problems (i) the equal area, population, and compaction criteria of Sammons (1976);
presented by the analysis of spatial data whilst neglecting the more geo (ii) the spatial entropy statistic of Batty and Sammons (1978);
graphical ones. The pioneering work on spatial autocorrelation by Cliff and
Ord (1973) and on spacetime processes by Bennett (1979) are good examples. (iii) the minimum withinzone heterogeneity criteria of Cliff et al (1975);
They provide elegant solutions to complex statistical problems concerned (iv) the maximum independent variable variance criteria (Cramer, 1964;
with the spatial, and temporal, dependency of zonal data but in so doing they Hannan (1971);
deny the existence of the MAUP. For example, the expected moments of Cliff
and Ord's spatial autocorrelation statistic can be computed under two dif (v) the maximum relative variation of the independent variable
ferent sets of assumptions, both of which assume that zonal data are fixed. Blalock, 1964);
Yet spatial autocorrelation is a characteristic of zonal data which is de (vi) the minimum standard error of the regression slope coefficient
pendent on the choice of a particular zoning system. It can be varied by (Williams, 1976).
manipulating the zoning system. All these criteria were formulated as objective functions for the AZP and
solutions obtained.
A final consideration is that when geographers express concern about zon
ing systems it is mainly a reflection of problems of data comparability. It Table 15. Effects of different zone design criteria on the Iowa correlation
is suggested that the current naive approach depends on two major assumptions coefficient
which are both incorrect. First, that the results will be substantially the
same even if different areal units are used. A corollary of this argument number of zones
would be that meaningful results can be obtained for virtually any set of design criteria 6 12 18 24 30 36 42 48 54
arbitrary areal units; this view is widely held. The aggregation experiments
reported in this section disproves this assumption. Second, that geographers equal area .40 .34 .31 .35 .39 .48 .24 .33 .32
have little or no control over the zoning systems for which data are avail equal population .88 .72 .63 .56 .59 .47 .50 .40 .55
able so that it is not practical to consider zoning systems as anything other equal density .03 .71 .52 .52 .53 .53 .53 .46 .56
than fixed. This is an oversimplification because it is always possible to compact zones .30 .12 .25 .30 .46 .03 .42 .26 .21
seek to reaggregate zonal data in order to find a 'better' set of areal units spatial entropy .90 .21 .26 .28 .54 .26 .33 .43 .46
and thus recover from the effects of the initial aggregation. Why not exploit zonal homogeneity .49 .26 .42 .45 .37 .28 .31 .31 .33
the modifiable nature of areal units rather than passively accept whatever independent variable variation .64 .50 .42 .39 .54 .44 .42 .38 .33
zonal manipulations others perform on their behalf? Sadly, it seems that relative variation .68 .65 .40 .54 .47 .47 .35 .27 .42
many geographers are happier if they do not know about the effects of the standard error of slope .99 .99 .97 .97 .95 .93 .90 .85 .81
zonal manipulations that they or others perform.
(ii) Nongeographical solutions Table 15 demonstrates that different criteria merely produce different
results. At best some of the criteria reduce the systematic effects of scale
The most convenient solution is to accept the normal science view that but the levels of correlation largely reflect the nature of the criteria.
zoning systems should be independent of the phenomena they are used to re Since the choice of criteria are arbitrary, then so too are the results.
port. This would allow the selection of areal units to be independent of the Worse still, the criteria are independent of any particular purpose so that
subsequent analysis, and would partly justify the status quo. However, this the results are largely meaningless.
is at best an inherently nongeographical approach. The areal units being
studied should be meaningful in some way which is relevant to the purpose of (iii) A traditional geographical solution
the study; therefore, it is argued that zoning systems cannot logically be
independent of the phenomena they represent. In this context independence Looked at in another way, the MAUP is fairly trivial. All that is needed
implies irrelevance. Nevertheless, a number of arbitrary zone design criteria is for geographers to agree upon what constitutes the objects of geographical
have been suggested; for instance, approximate equality of population and enquiry. The MAUP exists because of uncertainty as to what are the spatial
zone shape compaction (Sammons, 1976; 1979); multiple design criteria (Masser entities which are being studied. Remove that uncertainty and the problem
and Brown, 1978); and information statistics (Batty, 1978; Batty and Sammons, disappears. Unfortunately, this task of identifying meaningful geographical
1978). However, it is not apparent why geographers should only be interested entities is a difficult one for many geographers to face because of the
in areal units of a regular shape and size or in what way an information traditional regional geography connotations. Additionally, different defini
tions will be needed for different purposes.
32
33
The best examples of this approach have been the use of functional region STEP 3. Decide what the results mean in a statistical sense, if this is
definitions of urban areas for studying census data (Spence et al, 1982; appropriate, as well as in terms of the geography of the optimal zoning sys
Coombes et al, 1982). The justification here is that local authority defini tems. Has the target result(s) been achieved with a tolerable degree of
tions are b st provided by functional region definitions. This solution
e error? If not, then the associated hypothesis must either be rejected or
clearly works well only when there is sufficient geographical knowledge to changed, so return to STEP 1. If the results are acceptable, then have any
define with a high degree of precision the sorts of areal units that are most i mportant statistical assumptions been violated? If constraints are needed
sensible for a particular purpose. There are many areas in geography where then go to STEP 4. What does the zoning system tell us about the geography
it cannot be applied. Furthermore, this approach only removes the aggrega of the study area? The zoning system makes visible the interaction between
tional uncertainty, the effects of the MAUP still survive and condition the the data being aggregated and the hypothesis being studied and a study of
results. the nature of the zones may be very useful. How did the AZP optimise the
objective function? Is there a trivial spatial solution? If the number of
(iv) Towards a new methodology for spatial study zones are changed what effect does this have? Finally, are the optimal zoning
systems satisfactory from a geographical point of view?
Once it is accepted that the results of studying zonal data depend on the
particular zoning system that is being used, then it is no longer possible STEP 4. It may be necessary to introduce constraints to impose restrictions
to continue using a normal science paradigm. The data are not fixed so that on either the nature of the zones or on the properties of the data they gen
the results depend, at least in part, on the areal units that are being erate. These constraints are in addition to the usual contiguity restrictions
studied; units which are essentially arbitrary and modifiable. The selection necessary to ensure that the zones are internally connected. The AZP can
of areal units, or zoning systems, cannot therefore be separate from, or in handle either equality or inequality constraints. These additional constraints
dependent of, the purpose and process of a particular spatial analysis; in represent a potentially important interface between the geography and statis
deed it must be an integral part of it. This view conflicts with the current tics of spatial study; for example, constraints to ensure zero spatially auto
use of scientific methods and statistical techniques in geography, and for correlated data and a maximum zone size. The feasibility of the additional
these reasons many geographers would refuse to consider it to be a viable constraints is partly related to the aggregation factors involved; when large
proposition. numbers of zones are being aggregated then a large number of complex con
straints can often be satisfied.
Let us continue with the heresy a little longer. The problem is to in
vent a new paradigm for spatial study which can explicitly handle the geo STEP 5. Now solve the constrained automatic zoning problem. If a satisfac
graphy of the MAUP. The most obvious approach is to reverse the normal tory result is found then return to STEP 3 for interpretation. If not, then
science paradigm. Instead of meekly accepting whatever result the choice of examine the consequences of failing to satisfy some or all of the constraints.
a haphazard zoning systems happens to produce, it is necessary to start by The extent to which various constraints can or cannot be satisfied may also
specifying precisely what outcome is expected. This can take the form of a provide useful information about the nature of the problem under study. In
hypothesis. If the desired result can be attained by solving the associated most cases an iterative process of experimentation is probably needed.
automatic zoning problem, that what are the limits on both the range of re
sults and the range of zoning systems that produce similar outcomes? What if Clearly the statistical power of this new approach is considerably less
anything does the geography of these optimal zoning systems tell us about the than that promised by conventional methods. Hypothesis testing is used here
hypothesis being studied? If the desired result cannot be attained without as a device for introducing an explicit purpose into the process of spatial
violating either statistical assumptions or geographical factors, then the study: This is necessary so that whatever zonal entities are identified
associated hypothesis must be rejected. they should be both purpose related and geographically meaningful. The new
paradigm is likely to be most useful when comparative studies are being
The following methodology is suggested as being appropriate for a geo performed.
graphical solution to the MAUP.
Consider the previous correlation analysis for Iowa. Simply reporting
STEP 1. Define the purpose of the study in an explicit fashion. This can be the level of correlation is not very useful because the result is zone
done by speculating as to what outcome is expected given prior knowledge or dependent. Similarly, it is no use testing the null hypothesis that the cor
what outcome is desired. For example, does a model that fits data in Nevada relation coefficient is significantly different from zero. Consider the
also work in Iowa? This desired result would be expressed as a hypothesis related linear regression model. If no prior information other than a model
and set up as an objective function for the AZP. For example, to find out if specification is available, then a wide range of different results can be
a model fits the Iowa data, minimise the model errors using the AZP. If the obtained depending on the choice of zoning system. However, as we move
question is whether a particular set of parameters can provide an acceptable through the new paradigm, the introduction of various statistical and geo
l evel of performance then again solve the associated AZP. graphical constraints reduces the range of alternatives, although there may
still be a number of different results that require interpretation and
STEP 2. Try to obtain the desired result by identifying zoning systems which explanation.
approximately optimise the appropriate objective function using the AZP. For
example, minimise the differences between a set of target factor loadings and For example, suppose we wish to see whether a correlation of 0.8 between
values produced for a particular zoning systems; the purpose here might be old age and Republican voters reported from some other study area also
to investigate whether a set of social area analysis results for one area
also apply to another.
35
34
occurs in Iowa. Suppose also that you want the associated linear regression
model to satisfy various statistical assumptions; specifically, that the
residuals have zero spatial autocorrelation, that the spatial autocorrelation
of the predictor variable is zero, that the mean residual is zero, and that
the rank correlation between the absolute residuals and the independent vari
able is also zero (a residual homoscedasticity constraint). The zoning sys
tems shown in Figure 6 satisfy these assumptions and the range of correlation
is still large, between .928 and +.993. All we can conclude from this is
that there is so much aggregational variability in these data that the results
are not meaningful despite the undoubted high degree of statistical signifi
cance. Clearly additional constraints are needed and there should be some
basis for these restrictions. For example, if zonal population sizes are
restricted to about plus or minus 15 per cent of the average, then the range
of correlations is reduced to between .28 and .94. If area is used instead
of population then the range is slightly wider; .06 to .81. The problem here
is that there is no real basis for these size constraints. The best strategy
would be to combine the Iowa data with another data set ( viz for which the
correlation of 0.8 was obtained). The AZP would be used to identify optimal
zoning systems for both data sets simultaneously (the contiguity constraints
would keep their zones apart). The resulting map patterns for a global
correlation of 0.8 could then be examined.
The idea then is to use the optimal zoning approach to test hypotheses
Figure 6a. Zoning system that minimises the correlation coefficient subject by manipulating the aggregation process. Instead of asking whether a result
to constraints (r = .928, intercept = 5.0, slope = 3.12, spatial obtained in study area A is different from a result for study area B, it is
autocorrelation of residuals = 0.0, spatial autocorrelation of necessary to consider the range of results that can be produced for both A
independent variable = 0.0, homoscedasticity = 0.0) and B. Instead of trying to fit a model to an arbitrary zoning system, it is
necessary to consider which zoning systems provide the best results and to
consider what properties they, or the aggregated data they produce, should
have. The map patterns produced by optimal zoning systems for particular
purposes may themselves contribute to the spatial analysis process.
VI CONCLUSIONS
36 37
as having the potential to open up an entirely new approach to the study of
spatial data as well as offering a general methodological framework into BIBLIOGRAPHY
which any existing model or technique can be incorporated. It is argued that
this constitutes the beginning of a new era which will be characterised by Batty, M. (1978, Speculations on an information theoretic approach to spatial
the development of more relevant and more appropriate core of geographical representation. in: Spatial representation and spatial
analysis techniques. It would seem that the adoption of an exceptionalist interaction, (eds) I. Masser and P.J.B. Brown, (Martinus Nijhoff;
position is a basic prerequisite for this development to take place. Leiden), pp 115147.
Batty, M. and Sammons, R. (1978), On searching for the most informative spatial
Critics will argue that the 'cure' in the form of AZP appears to be no pattern. Environment and Planning A, 10, pp 747749.
better than the disease and will inevitably result in difficulties in making
generalisations outside of a particular zoning system for a particular data Batty, M. and Sikdar, P.K. (1982), Spatial aggregation in gravity models.
set. The answer to this latter problem is straightforward. All that need be 1. An informationtheoretic framework. Environment and
done is to incorporate several different data sets in the same AZP problem Planning A, 14, pp 377405.
formulation. They will remain separate entities by virtue of having no con Bianchi, G., Openshaw, S., Scattoni, P., Sforzi F. and Wymer, C. (1981),
tiguity links but they will be linked through the definition of global con Analisi dell'area sociale: comparazione delle classificazioni con
straints (other than contiguities) and through a common objective function. dotte su dati medi per sezioni di censimento e su data individuali.
The answer to the first point is selfevident. The widespread and serious (Paper presented at 2nd Italian Regional Science Conference,
impact of the MAUP on spatial study has been convincingly demonstrated so Napoli, October 19th21st.
it is no longer possible to simply ignore it. Thus it would seem that methods
which cannot cope with the MAUP should not be used. Currently there are no Bennett,R.J. (1979), Spatial time series: analysis, forecasting and
convincing alternative methods for handling spatially grouped data in a control, (Pion: London).
statistically sound framework. So why not investigate more radical non Blalock, H. (1964), Causal inferences in nonexperimental research,
statistical frameworks and what could possibly be better for a geographer (University of North Carolina Press: Chapel Hill).
than a purely geographical approach?
Borgatta, E.F. and Jackson, D.J. (1980), Aggregate data: analysis and
The consequences of seriously accepting this challenge may well be funda interpretation, (Sage Publications: Beverly Hills).
mental changes in the manner by which geographers analyse spatial data. Chapman, G.P. (1977), Human and environmental systems: a geographer's
There has to be an admission of an approach to spatial study that is tanta appraisal, (Academic Press; New York).
mount to operating the normal science paradigm in reverse. This is clearly
nonscientific according to any contemporary liberal definition. It would Cliff, A.D. and Ord, J.K. (1973), Spatial autocorrelation, (Pion: London)
seem that while few geographers would question the utility of using the AZP Cliff, A.D. and Ord, J.K. (1975), Model building and the analysis of spatial
to identify ranges of possible results due to the MAUP, few have so far shown pattern in human geography, Journal of the Royal Statistical
any enthusiasm for going any further let alone consider the unimaginable Society Ser B, 37, pp 297348.
horrors of scientific heresy. However, it is likely that the former will
inexorably lead to the latter. There have been paradigm shifts before in Cliff, A.D., Haggett, P., Ord, J.K., Bassett, K. and Davies, R. (1975),
science so why not a new one designed specially for geographers? It should Elements of spatial structure: a quantitative approach,
be appreciated that the AZP and its associated methodology offers as yet the (Cambridge University Press: London).
only practical working solution to the MAUP. There can be no real doubts Coombes, M.G., Dixon, J.S., Goddard, J.B., Openshaw, S. and Taylor, P.J.
about its geographical nature but perhaps it is too geographical for many (1982), Functional regions for the population census of Great
modern geographers. Britain, in: Geography and the Urban Environment, (eds)
D.T. Herbert and R.J. Johnston, (Wiley: London), 5, pp 63112.
It is suggested therefore that the prospect is gradually dawning that the
MAUP is not so much an insoluble problem but rather a powerful analytical Coombes, M.G. and Openshaw, S. (1982), The use and definition of travel to
tool ideally suited for probing the structure of areal data sets. The growing work areas in Great Britain: some comments, Regional Studies,
speed of computers opens up the tremendous potential offered by heuristic 16, pp 141149.
solution procedures, such as the AZP, to identify the most appropriate zoning Cramer, J.S. (1964), Efficient grouping, regression, and correlation in Engel
systems for any particular purpose without having to solve currently intract curve analysis. Journal of the American Statistical
able theoretical and analytical problems. That is to say, we do not as yet Association, 59, pp 233250.
fully understand the problem and we are certainly no way near to being able
to develop a calculus to handle it, but the problem can be solved or turned Evans, I.S. (1981), Census data handling. in: Quantitative Geography:
around using what are essentially Monte Carlo optimisation methods. Cur a British View, (eds) N. Wrigley and R.J. Bennett, (Routledge
rently much can be done with small data sets and fairly complex models or with and Kegan Paul: London), pp 4659.
larger data sets and simple models. Very soon it will be possible to rou Gehlke, C.E. and Biehl, H. (1934), Certain effects of grouping upon the size
tinely apply the same methods to any spatial data set and any model or func of the correlation coefficient in census tract material. Journal
tion, no matter how complex. When this happens often enough then a new geo of the American Statistical Association, Supplement,
graphical revolution will surely have occurred. 29, pp 169170.
38 39
Griffith, D.A. (1980), Towards a theory of spatial statistics, Silk, J. (1979), Statistical concepts in geography, (Allen and Unwin:
Geographical Analysis, 12, pp 325339. London).
Hannan, M.T. (1971), Aggregation and disaggregation in sociology, Spence, N., Gillespie, A., Goddard, J.B., Kennett, S. Pinch, S. and Williams A.
(Lexington Books: Lexington Mass.). (1982), British cities: an analysis of urban change,
Johnston, R.D. and Rossiter, D.J. (1982), Constituency building, political (Pergamon: Oxford)
representation and electoral bias in urban England. in: Geo Taylor, P.J. (1977), Quantitative methods in geography, (Houghton
graphy and the Urban Environment, (eds) D.T. Herbert and Mifflin: Boston).
R.J. Johnston, (Wiley: London), 5, pp 113156.
Taylor, P.J. and Johnston, R.J. (1979), Geography of elections,
Keans, M. (1975), The size of the regionbuilding problem. Environment and (Penguin: Harmondsworth).
Planning A, 7, pp 575577.
Tukey, J.W. (1977), Exploratory data analysis, (AddisonWesley:
Masser, I. and Brown, P.J.B. (1978), Spatial representation and spatial Reading, Mass.).
interaction, (Martinus Nijhoff: Leiden).
Williams, I.N. (1976), Optimistic theory validation from spatially grouped
McNeil, D.R. (1977), Interactive data analysis, (Wiley: London). regression: theoretical aspects. Transactions of the Martin
Openshaw, S. (1977a), A geographical solution to scale and aggregation prob Centre, 1, pp 113145.
lems in regionbuilding, partitioning, and spatial modelling.
Williams, I.N. (1979), Some implications of the use of spatially grouped
Transactions of the Institute of British Geographers, data. in: Towards the dynamic analysis of spatial systems
New series, 2, pp 459472.
(eds) R.L. Martin, R.J. Bennett, and N.J. Thrift, (Pion: London),
Openshaw, S. (1977b), Algorithm 3: a procedure to generate pseudorandom ag pp 5364.
gregations of N zones into M zones, where M is less than N'.
Environment and Planning A, 9, pp 14231428. Yule, G.U. and Kendall, M.G. (1950), An introduction to the theory
of statistics, (Griffin: London).
Openshaw, S. (1977c), Optimal zoning systems for spatial interaction models.
Environment and Planning A, 9, pp 169184.
Openshaw, S. (1978a), An empirical study of some zone design criteria.
Environment and Planning A, 10, pp 781794.
Openshaw, S. (1978b), An optimal zoning approach to the study of spatially
aggregated data. in: Spatial representation and spatial
interaction, (eds) I. Masser and P.J.B. Brown, (Martinus Nijhoff:
Leiden).
Openshaw, S. (1981), Le problem de l'aggregation spatiale en geographie.
L'espace Geographique, 1, pp 1524.
Openshaw, S. (1983), Ecological fallacies and the analysis of areal census
data. Environment and Planning A, (forthcoming)
Openshaw, S. and Taylor, P.J. (1979), A million or so correlation coefficients:
three experiments on the modifiable areal unit problem. in:
Statistical methods in the spatial sciences, (ed) N. Wrigley,
(Pion: London), pp 127144.
Openshaw, S. and Taylor, P.J. (1981), The modifiable areal unit problem. in:
Quantitative geography: a British View, (eds) N. Wrigley
and R.J. Bennett, (Routledge and Kegan Paul: London), pp 6070.
Robinson, A.H. (1950), Ecological correlation and the behaviour of individuals.
American Sociological Review, 15, pp 351357.
Sammons, R. (1976), zoning systems for spatial models, (Redding Geo
graphical Paper 52, Department of Geography: Reading University).
40