Você está na página 1de 28

CLUSTER ANALYSIS FOR ARCHAEOMETRIC DATA ANALYSIS

Mike Baxter

A version of this paper is published as:

Baxter, M.J. (2008) Cluster analysis. In Liritzis I. (ed.), New Technologies in the
Archaeognostic Sciences, Gutenberg Press, Athens, Greece, 445-481. (Paper and book are
published in Greek)

1
Introduction

Cluster analysis (CA) is a generic term that refers to a range of methods aimed at
identifying groups in a set of data. In archaeology, to give only a few examples, CA has
been used to group artefacts on the basis of their chemical compositions; assemblages on
the basis of the similarity of their profiles; and to identify spatial clustering on the basis
of the location of artefacts in space. This paper concentrates on the first of these
applications, though most of the ideas to be discussed have general application.

The intention is to discuss the ideas that underpin different methods of CA in as non-
mathematical a way as possible. Some notation and ideas are unavoidable, and are mostly
discussed in the next section; more complex material is provided in the appendix. The
section on model-based clustering is more demanding than other sections. There are
many good books and articles on CA, and statistical software to implement the methods;
a selective review is provided towards the end of the paper.

The heart of the paper discusses the main types of CA that have either been widely used
in archaeometry, or could be used. Readers familiar with the subject will recognise that I
have been highly selective, but I have tried to comment on what seem to be the most
popular methods, as well as newer approaches that may be worth exploring.

Occasional reference is made to principal component analysis (PCA), and the reader will
need a working knowledge of PCA to follow some of the discussion. As used here PCA
allows high-dimensional data to be viewed through a series of two dimensional plots.
Any text on multivariate analysis, some of which are referenced later, should provide an
account.

Notation

For definiteness, assume we have n artefacts whose chemical composition has been
measured with respect to p variables. The results may be collected in an n p table of
data, or data matrix, X, with typical element xij. The rows correspond to artefacts, and the
term case will be used to refer to a row. Sometimes xi will be used to refer to the 1 p
vector of values for case i.

Let x j be the mean of variable j, and sj its estimated standard deviation. Usually the raw
data matrix is modified in some way before CA. If y ij ( xij x j ) is used, the variable is
said to be centred. If y ij ( xij x j ) / s j is used the variable is standardized (the terms
auto-scaled and normalized are also used, though the latter is best reserved for
transformations that attempt to induce a normal distribution).

The modified data matrix arising from either of these treatments is Y, with typical
element yij. In some approaches to CA the raw data are transformed to logarithms (to base

2
10) before centring or standardization. How the data should be treated prior to CA is a
non-trivial issue that is discussed later.

Many methods of cluster analysis result in the identification of G groups, with the hope
that cases in a group are similar to each other and dissimilar from cases in other groups.
This introduces the idea of (dis-)similarity, which is critical to an understanding of how
many methods of CA work.

Many measures of (dissimilarity) can be defined, contributing to the many methods of


CA available. A common measure of dissimilarity is Euclidean distance, or its square, the
latter defined as

p
d ij2 ( y ik y jk ) 2 (yi yj)(yi yj)T (1)
k 1

where T indicates a vector transpose. Euclidean distance, dij, is just the generalization to p
dimensions of distance as customarily measured in the real world.

The definition in terms of vectors is not really necessary here, but is unavoidable in
presenting Mahalanobis distance (MD), important in some archaeometric applications of
CA. For a single group, with estimated covariance matrix S, the MD between cases i and
j is defined as

(yi yj)S-1(yi yj)T. (2)

The estimated covariance matrix, S, assumes some importance in later discussions. It is a


diagonal p p matrix. The entry in row i and column j is sij, the estimated covariance
between variables i and j. The ith diagonal element may be written as sii or si2 and is the
estimated variance of variable i. The matrix is symmetric, so sij = sji. The notation Sg will
sometimes be used to emphasise that the estimate is for a particular group g, with
corresponding population covariance matrix, g.

The covariance measures the strength of relationship between two variables, but the
actual values of the sij depend on the units of measurement, and can be difficult to
interpret. For this reason it is often useful to scale it using rij = sij/sisj to get correlations,
for which -1 rij 1. The p p correlation matrix, R, has typical element rij, with the ith
diagonal element (the correlation of a variable with itself) equal to 1.

If yj in the equation for MD is replaced by the vector of variable means (on the scale
defined by the yij) then we obtain the distance between a case, yi, which may or may not
be a member of the group, and the group centroid. Readers uncomfortable with the
mathematics here should know that the merit of MD, compared to Euclidean distance, is
that it makes allowance for the fact that variables may be correlated (common in
archaeometric data) in a manner that may be beneficial for the clustering process. This is
discussed in more detail later.

3
For some methods of CA it is necessary to have a measure of how good a clustering of G
clusters is. Start with a single cluster; ideally we would like this to be as compact as
possible, with individual cases close to the centroid (or mean) of the cluster. For a single
case, i, and single cluster the overall closeness to the centroid, ( x1 , x2 ,K , x p ) , can be
measured by

(x
j
ij x j )2 . (3)

If there are ng cases in cluster g, an overall measure of compactness is

Sg (x ij x j )2 (4)
i j

where the first summation is over the ng cases in the cluster. A measure of how good the
clustering is can then be defined by summing Sg over the G clusters, to get SG as an
overall measure. This is discussed further in the section on Wards method.

Some Approaches to Cluster Analysis

Hierarchical Clustering
Hierarchical agglomerative methods of CA are those most commonly used in
archaeometry. Each case is initially treated as a single cluster so there are n in all. The
two most similar cases are merged to form a cluster of two cases, giving (n 1) clusters.
Thereafter, clusters are successively merged with each other (treating cases as clusters)
on the basis of which are most similar. Eventually all cases are merged into a single
cluster.

It is also possible to start by assuming that all cases belong to a single cluster and then
successively splitting clusters up, one case at a time, until all cases are distinct. This
method, hierarchical divisive clustering, is not much used in archaeometry, and not
considered further.

To merge clusters, a measure to determine how similar clusters are is needed. Similarity
can be defined in different ways, contributing to the vast number of ways in which a
cluster analysis can be carried out. In single-linkage analysis the similarity of two clusters
is measured by the smallest distance between two cases, one from each cluster. The two
clusters merged are those for which this smallest distance is smallest. In complete-
linkage analysis, similarity is defined by the largest distance between two cases, one from
each cluster, the clusters being merged for which this largest distance is smallest.

Single-linkage CA is rarely used in archaeometry because it tends to produce


uninterpretable results unless the structure is obvious. It is sometimes useful for detecting

4
outliers. A criticism of both methods is that the measure of similarity between clusters
depends only on two cases, and fails to take account of group structure and other cases.

Average-linkage CA attempts to overcome this problem by defining similarity between


clusters as the average distance between all possible pairs of cases, one from each cluster.
It has probably been the most widely used method of CA in archaeometry.

The results from a hierarchical CA need to be interpreted. This is almost invariably done
using a dendrogram or tree diagram, an example of which is shown in Figure 1.
5
4
3
Height

2
1
0

18

11
19

15

27
25

24
6
5
4
9
1
2
22

12
10

13
14
3

20

26
8
7
16
17

21
23

Figure 1: A dendrogram arising from an average-link cluster analysis of standardized


data, for a 27 11 data matrix of medieval glass compositions. The data used are that
given in Baxter (1989) and are a subset taken from Cox and Gillies (1986).

This is from an average-linkage analysis of a 27 11 matrix of standardized data of


medieval glass compositions. Euclidean distance was used as the measure of
dissimilarity. This can be thought of as a tree with branches and leaves corresponding
to the numbered cases on the horizontal axis.

The vertical axis shows the dissimilarity level at which cases or clusters merge. Cases
that merge at a low level (e.g., 4 and 9) show a high level of chemical similarity.

5
Interpretation of dendrograms is usually subjective. Often what is done is to cut the tree
at a level of (dis-)similarity that isolates the main branches, the leaves associated with the
branches defining the clusters. This is not always easy, even in this rather easy case.

Cutting the tree at a value of 5 results in three clusters; cutting at 4 results in three
clusters and an outlier; cutting at just above 2 would result in four clusters and two
outliers, with the main cluster to the right being split in two. There seems to be three
fairly clear clusters and a possible outlier (20), but this should be checked if possible. The
appearance of a dendrogram depends on the choice of style (see Figure 3), choice of
method, and the distance measure used. Squared Euclidean, as opposed to Euclidean,
distance will often result in dendrograms showing apparently clearer clustering. This
interpretive strategy is commonly used. It is sometimes better, and legitimate, to cut
different branches at different levels.

Methods of cluster analysis are designed to identify clusters and may do so, even when
there are no groups in the data. Some form of cluster validation is, therefore desirable,
and is discussed later. One simple and often effective way of confirming the
interpretation of a dendrogram is to examine principal component (PC) plots on which
the points are labelled. This is done in Figure 2, where case numbering is as in Figure 1.
Inspection shows the three groups and outlier suggested by the CA are consistent with the
PC plots. An alternative would be to label points according to the cluster they are in.

23
21
24
17
20
2

27
25
26
22 8
16 7
1

18
component 2

19
0

1011
1315
12
14
3
-1
-2
-3

42
596 1

-3 -2 -1 0 1 2 3

component 1

Figure 2: A plot based on the first two principal components (PCs) from a principal
component analysis (PCA) of the data used to obtain Figure 1. The intention is to show
the fairly clear clustering, but with possible outliers.

6
Wards method
Wards method has been widely used in archaeology, and was particularly popular in the
1970s and 80s. In archaeometric investigations, it is usually used as an exploratory
hierarchical cluster technique in the same spirit as the linkage methods. It possesses
distinctive characteristics, however, that provides a useful lead into other kinds of CA, so
is discussed separately here.

Wards method is initiated in the same way as the linkage methods to give (n 1)
clusters, often using squared Euclidean distance as a dissimilarity measure. The quality of
the clustering can be measured using the term SG, defined in the section on notation,
where G = (n 1). In the further merging of clusters, remembering that individual cases
are clusters, the value of G is reduced by 1, while SG is increased. It is easy to show that
any merge will increase SG and the merge is chosen for which this increase is least.

Users should be aware of several aspects of Wards method. The linkage methods
described are essentially grouping algorithms with no firm basis in statistical theory. That
they are widely used is presumably because they have seemed sensible to the people who
devised them, and have found favour with practitioners. Wards method, by contrast,
attempts to optimise an explicit objective function, SG.

All the agglomerative methods discussed suffer from the drawback that once a merge is
made it cannot be undone. Wards method is no exception, and the word attempts was
used above because often SG will not be optimised. That is, given any specific partition
into G clusters produced by Wards method, it may be possible to improve SG by
reallocating cases between clusters. This is the basis of k-means methodology, discussed
in the next section.

Users new to cluster analysis sometimes find that they like Wards method, compared to
other alternatives, because it produces apparently clear and well separated clusters more
readily. This can be a delusion the method can suggest clusters quite clearly, even when
none exist. This behaviour can be understood by viewing Wards method as a special case
of a model-based method. Such methods are discussed later.

To illustrate the problem, Figure 3 shows the dendrogram for a Wards method analysis of
standardized data. It is tempting to conclude that there are two clear clusters in the data
(note the different style of presentation from that used in Figure 1). The data used were
50 cases randomly generated from a two-dimensional multivariate normal distribution.
The observed correlation between the variables is 0.79. The data are plotted in Figure 4,
with cases labelled according to which of the clusters they belong to. It is quite clear that
the distinctive separation suggested by Figure 3 is misleading.

7
30
25
20
15
Height

10
5
0

31 9

50
17
16
24
38

45

32
4

6
5
14
18

34
49
39

22
35
12
44
11
43

28
46

7
36

138
301
26
25
40
472
19
21
15
48

23
27
37
42

10
33
3
20

29
41
Figure 3: The dendrogram arising from a Wards method analysis of standardized data
generated randomly from a bivariate normal distribution. This suggests, clearly but
incorrectly, that there are apparently two distinct clusters.

This is not to say that Wards method will always produce poor results. It will tend to
impose a certain kind of structure on the data, which can be understood in terms of the
assumptions it implicitly embodies (see later for details). When these assumptions are
satisfied it can work well. Empirical experience suggests that other methods that are not
model-based can also impose inappropriate structure on data, but because of their lack of
a theoretical grounding the reasons are less well understood than for Wards method.

K-means and related methods


As already noted, the idea behind Wards method is to try and minimise a particular
objective function, SG, for any given level of clustering, G. In general, the method will
not achieve the optimum, because merges that occur during the early stages of clustering
cannot be undone.

8
2

12
2 2 2 2
2
2
2

11
2 2 2
2
2 2 22
2
222
10

11 2
2 22
1 1 2
y

11 11 2
11
1 1
1 1 1
1
9

1
1 1
11
8

1 1 1

8 9 10 11 12 13

Figure 4: The data used to generate Figure 3, labelled by the apparent clustering
suggested by that figure.

One way round this problem is, given a choice of G, to reallocate cases between clusters
in order to reduce SG. This reallocation proceeds in an iterative manner until no further
reduction is possible. This is a particular example of k-means clustering.

The method has been used in archaeometric applications, but less so than one might
think, given in the context of Wards method at least it can only improve a clustering.
One possible reason for a relative lack of use, compared to hierarchical methods, is that a
simple representation of the outcome in the form of a dendrogram is not possible.

Choice of the appropriate number of clusters is not straightforward. To apply the method
for fixed G, a starting partition is needed to initiate the iterative reallocation procedure,
and the final partition may be a local rather than global minimum. That is, results may
depend on the starting position, so the choice of a good starting partition is helpful. Using
clusters derived from an application of Wards method is one possibility.

As described above, k-means is based on squared Euclidean distance as a measure of


dissimilarity. The idea can be extended to the use of general measures of dissimilarity. In
k-medoids clustering, for example, a cluster centre is defined by a typical value called
the medoid . Cases are allocated to the cluster with the most similar medoid, this process
proceeding iteratively. This was developed as a more robust alternative to k-means.

9
Figure 5 shows some output from a k-medoids analysis, using the partitioning around
medoids (PAM) method of Kaufman and Rousseeuw (1990), with Euclidean distance as
the dissimilarity measure. The data are the same as that used for Figure 1 and have been
standardized.

This is an example of a silhouette plot, shown here for the 3 cluster solution. If a(i) is the
average dissimilarity of case i from other cases in its cluster, and b(i) is the average
dissimilarity of i with cases in the closest other cluster, then their difference, scaled to a
maximum of 1, is the silhouette width for case i. Values near 1 are very well clustered;
values near zero probably lie between two clusters; and negative values (of which there
are none here) are probably in the wrong cluster. The silhouette width is given on the
horizontal axis of the figure. Cases in clusters 1 and 2 are generally well-clustered, apart
from case 22 in cluster 2. Cluster 3 is less well-defined (the silhouette widths are
generally smaller) and case 20 is not well clustered.

Given G, the methods discussed so far lead to each case being assigned to just one
cluster. This is sometimes called a crisp clustering. In fuzzy clustering the membership of
a case may be spread over several clusters. Slightly more formally, ig is the
membership of case i in group g, with ig 0 . Estimation of the ig can be achieved by
a fuzzy k-means algorithm (also called fuzzy c-means), of which there are several. Fuzzy
clustering has been little used in archaeology (the appendix gives more technical detail).

Silhouette plot
n = 27 3clustersCj
j: nj | avei Cjsi
1
9
4 1 : 6 | 0.82
2
5
6
15
13
3
14
10 2 : 10 | 0.73
19
11
12
18
22
27
25
26
8
17
24 3 : 11 | 0.49
16
7
23
21
20

0.0 0.2 0.4 0.6 0.8 1.0


Silhouette width
si
Average silhouette width : 0.65

Figure 5: An example of a silhouette plot, used in conjunction with the partitioning


around the medoids (PAM) method of Kaufman and Rousseeuw (1990).

10
Model-based methods
A known problem with Wards method is that it will tend to produce spherical clusters of
roughly equal size. The same phenomenon is sometimes observed with other clustering
algorithms. This can be a problem with certain kinds of material studied in archaeometry,
where the variables can be expected to be correlated (see Harbottle (1976) for a
discussion of this), so that any clusters in the data can be expected to be (hyper)-ellipsoid
in shape, with no prior expectation that they are of the same size. Figures 3 and 4
illustrated the problem for a very simple data set showing moderate correlation.

In principle, one way to avoid this problem is to use model-based methods. In such
studies, for G clusters, assume that, the gth cluster is sampled from a multivariate normal
distribution (MVN) with mean g and covariance matrix g, which can be written as
MVN(g , g). A mixture model is obtained by assuming that the observed sample is
drawn from a mixture of these multivariate normal distributions.

One approach is to estimate the parameters of the distributions, including the mixing
proportions, g, by maximum likelihood. Given the estimates, the relative probabilities of
cases belonging to the gth component can be determined, and cases can be assigned to the
cluster for which they have the highest probability, if a crisp clustering is needed.

In classification maximum likelihood, cases are associated with labels, initially unknown,
that identify the cluster to which they belong. These labels are estimated and provide a
direct clustering of the data.

Some details are given in the appendix. The methodologies can be implemented in open-
source software discussed later.

Bayesian methods of CA add extra structure to the mixture model in the form of prior
distributions for the unknown parameters in the model. This added complexity places the
methodology beyond the reach of the average non-statistical researcher, unless they can
find a suitable statistical collaborator. Bayesian CA has not been widely applied in
archaeometry; references to its use are provided later.

The methods described so far have developed in the statistical literature. The framework
sketched above provides a useful basis for describing methods developed by
archaeometricians, to deal with what they see as the peculiarities of archaeometric data.
The main idea now described, which is quite simple, can be implemented in more than
one way.

Determine a provisional grouping into G clusters, by any method that seems suitable.
Measure the Mahalanobis distance (MD) of every case to each cluster centroid and, if
necessary, re-allocate the case to the group with the nearest centroid. Repeat this process
until a stable clustering is obtained. This is similar to the idea used in k-means clustering

11
as described earlier, but the use of MD means that the potential ellipsoidal nature of
clusters is accounted for.

One refinement, when calculating the MD of a case i to its own group centroid, is to use
leave-one-out methods, where the centroid is calculated omitting i. Another refinement
is to assume that, within clusters, data have an MVN distribution. This allows the MDs to
be converted to probabilities, so that the strength of cluster membership can be assessed.
Where the probabilities of cluster membership are sufficiently low for all clusters a case
can be declared an outlier, and possibly omitted from subsequent iterations.

These more specifically archaeometric approaches have mainly been developed in a


small number of laboratories (e.g., Bieber et al. 1976, Glascock 1992, Beier and
Mommsen 1994, Neff 2002), and many published applications to be found in the pages
of Archaeometry and the Journal of Archaeological Science among others involve these
authors and their co-workers. Implementation of the basic idea differs, and the reader is
referred to the papers cited for fuller details. Baxter (2001a) and Baxter (2003: 97-99)
summarise some of the differences in approach.

These methods can be viewed as model-based, to the extent that they depend on the
MVN assumption to exploit their full power. My own (unpublished) experience of using
these methods is that their application is as much art as science, since a lot of
decisions need to be made (e.g., numbers of clusters; starting partitions; outlier
identification) that may require judgements that could differ from researcher to
researcher.

A lot of practical issues have been avoided in the account given above. Issues concerning
data transformation (to achieve normality), treatment of outliers and the determination of
the numbers of clusters are discussed, in general terms, in the next section.

The main limitation, particularly when a large number of variables (say 20-30 in this
context) are involved, is the sample size requirement. For those methods explicitly based
on MD, a minimum requirement is that group sizes, ng, be greater than p. Unless ng is
somewhat greater than this, results can be unstable because estimates of the covariance
matrices, Sg, are unstable. Guidelines vary but, for example, ng/p > 3 has been suggested
as a minimum.

A little thought will suggest that for many data sets, typical in archaeometric analysis, the
full power of the methodology is not available. For n = 100, for example, and p = 20, and
G = 5, at least some group sizes will be too small to allow MD to be used.

Similar, and related, difficulties apply to the other model-based methods discussed. For
data sets with typical p, constraints have to be imposed on the form of the covariance
matrices, as there are otherwise more parameters to estimate than cases. Assuming equal
covariance matrices is one common strategy, and this implies that clusters should be
ellipsoidal, with the same orientation and similar size. If it is further assumed that the
covariance matrices are diagonal, so S s 2 I where I is the p p identity matrix, this

12
amounts to an assumption that clusters are spherical and of equal size. It can be shown
that this is essentially equivalent to Wards method, which can thus be viewed as a model-
based method that is pre-disposed to finding such clusters.

Some of the methods discussed have associated with them methods for testing hypotheses
about the number of clusters in the data. They can also, in principle, cope with the
ellipsoidal nature of clusters that typifies some kinds of archaeological data. Apart from
their relative mathematical complexity, the main barriers to their wider use are practical.
They are worth further investigation, but a willingness to engage with the mathematics,
or a suitably qualified collaborator, is desirable.

Issues in Cluster Analysis

In this section a number of related issues that users of CA need to bear in mind, when
carrying out and reporting an analysis, are discussed.

Data transformation
Prior to a CA some form of data treatment is needed. The most commonly used
treatments in archaeometry are standardization, or transformation to base 10 logarithms,
without subsequent standardization. Standardization is undertaken to give variables equal
weight, so those with large variances dont, predictably, dominate an analysis.
Logarithmic transformation will produce new variables with a similar order of
magnitude, and some researchers assert that the transformed variables, particularly if they
are trace elements, are more likely to have a normal (Gaussian) distribution within
clusters. This can be of importance in model-based analyses where the normality
assumption is used in the analysis.

Standardization of log-transformed data is sometimes used, but will often give similar
results to using standardization without transformation. This is because there is a
monotonic relationship between the raw and logged data, and standardization will tend to
convert values on either scale to a similar range of values. The exception to this
generalization is when transformation either down-weights the effect of cases that are
outliers on the original scale; or creates outliers not present on the original scale from
cases with values close to zero.

Archaeometric data of the kind discussed here are an example of compositional data. For
fully compositional data, xij 0, and for fixed i, the xij sum to 100%, assuming all
measurements are in %. A sub-set of such data is sub-compositional. For j = 1, , (p-1) it
is possible to define ratios of the form xij/xip and base analyses on these or their
logarithms. This has been done, and debated, since the 1960s/70s (Wilson 1978) and, for
statistical theoretical reasons it has been argued that this is the only correct approach
(Aitchison et al. 2002).

For analyses based on trace elements alone log-ratio analysis (LRA), as it is called, is
equivalent to the use of log-transformed data. For analyses that include major and minor

13
elements/oxides the theoretical merits of LRA can be outweighed by poor performance in
terms of interpretability of the results obtained compared to the use of standardized data.
Baxter and Freestone (2006) discuss the issues involved, and provide references to
archaeometric applications of LRA.

Choice of method
Methods of hierarchical agglomerative CA are the workhorse of archaeometric
applications. As a minimum, the method used, including the measure of dissimilarity
chosen and decisions about data standardization or transformation, should be reported in
applications. Comment on why a particular approach was preferred is desirable, but is
often omitted. There is, in fact, little theoretical reason for choosing between the more
popular methods of CA, so that pragmatic considerations are acceptable, but they should
be stated.

It is unacceptable to try a variety of methods and report only that which gives good
results. It is easy, given the capabilities of modern software, to apply a variety of methods
to a data set so that it is pointless to urge researchers not to do this. What is important is
the honesty with which results are reported. If different analyses lead to similar
conclusions it is worth stressing this, since it tends to strengthen the conclusions. If
different methods lead to different results further investigation is necessary. It is possible
that different aspects of the data are being revealed, but also possible that the apparent
structure revealed by some methods is illusory.

One approach is to do an initial analysis using Wards method, which tends to produce
more easily interpretable results than other methods. Given an initial clustering
determined in this way, output from other methods can be examined to see whether the
same structure is apparent, albeit in a possibly more noisy fashion.

With the caveats about sample and cluster sizes, output from a hierarchical analysis can
be used as the starting point for the iterative reallocation procedures discussed, or for
suggesting an appropriate number of clusters to investigate in a model-based method.

Variable selection
Baxter and Jackson (2001) discuss the issue of variable selection and only the most
important points are reviewed here. Variable selection is inevitable; the analytical
techniques used to generate compositional data invariably measure only a subset of the
elements in the periodic table. Before statistical analysis some of these are often
discarded, because of poor precision; more-or-less constant values; too many values
below the level of detection etc.

Once this selection has taken place, implied and overt, a common view is that as many
variables as possible should be used in statistical analysis. This is wrong. Assume there
are clusters in a set of data, and use the term structure-carrying to refer to those
variables that help distinguish between at least two clusters (most simply, if there were
two clusters, then a variable with a bimodal distribution associated with the two clusters
would be structure carrying). It is easy to construct artificial examples where the effect of

14
a large number of non-structure-carrying variables overwhelms the influence of the
structure-carrying variables, so that clusters may not be detected. There is no reason why
this should not be an issue with real data.

Identifying a problem is one thing; dealing with it is another. Statistical research on this
topic has not carried over to the archaeometric literature. Mathematically inclined
archaeometricians who find the problem interesting could usefully start with Friedman
and Meulman (2004). The less ambitious should be aware of the issue and be prepared to
used their subject-based knowledge to identify variables which are likely to be structure-
carrying, or not.

A fairly simple tactic, if tedious with a large number of variables, is to look at all possible
pairwise plots of the variables. This will often provide an indication of whether or not
there are clusters in the data, and which variables are most useful for identifying them. If
there are obvious and large clusters it is often useful to extract them from the data set and
repeat the above procedure on them. This may serve to identify further sub-clusters,
associated with variables not identified as structure-carrying in the first pass through the
data.

Outliers
In CA an outlier can be defined, loosely, as a case that is distant from, or cannot be
comfortably associated with, any of the clusters in the data. The presence of outliers in a
set of data can, in principle, distort the appearance of the dendrogram in hierarchical CA,
and invalidates the normality assumption used in model-based methods. It is sensible,
therefore, to try and identify the more obvious outliers in a set of data and remove these
before further analysis (for separate discussion), if the main aim is to identify groups in a
data set. Where less obvious outliers are identified in the course of analysis, it is often
sensible to remove these as well, and proceed in an iterative fashion. Fuller discussion of
some of the points now raised is given in Baxter (1999).

Given the extensive literature on outlier detection in the statistical literature, surprisingly
little is directly relevant to archaeometric problems. This is because much of it is
concerned with detecting outliers relative to what is otherwise a single cluster of data.
Identifying outliers relative to several clusters, where these are initially unknown, and
where their definition may be affected by outliers, has received much less attention.
Relatively informal methods of outlier detection are often quite effective.

Univariate and bivariate inspection will usually serve to identify gross outliers. Plots
based on the first few principal components (PCs) will also identify the more obvious
outliers, and some less obvious. Since clear outliers can distort the appearance of plots
based on the PCs it helps to remove them and repeat the process iteratively, to identify
more subtle outliers. The principle at work here is that cases that are distant from the
bulk of the data on the first few PCs will be distant in the full p-dimensional space.

The process just described can lead to the identification of a relatively large number of
outliers. In the study by Papageorgiou et al. (2001) of the compositions of 130 specimens

15
of Late Roman Cooking Ware from the Balearic Islands, 22 cases were identified as
outliers and removed prior to the application of various clustering methods.

The detection of outliers is built-in to those model-based methods that use Mahalanobis
distance. A good example is provided by the study of Olmec pottery production in
Blomster et al. (2005), in which 188 out of 944 cases were judged to be outliers, or were
not assigned to clusters. Some researchers are uneasy about the removal from the final
analysis of a lot of outliers or of any outliers at all, on the grounds that data are being
ignored or manipulated to get results congenial to the investigator. If a primary aim of an
analysis is to identify the main groups or pattern in a set of data, this concern seems to me
misguided. There is no logical reason why all cases should be assignable, with reasonable
confidence, to a cluster, and no logical reason for expecting only a very small number of
outliers (relative to the main clusters) in a large data set.

Number of clusters and cluster validation


For all the methods discussed, a decision has to be made about the number of clusters, G,
to report and interpret. With obvious structure in a data set, different methods are likely to
lead to the same conclusions (with the caveat that clusters need to be large enough for
some of the model-based methods to be used). Often CA may not even be necessary in
these circumstances.

For the hierarchical methods a decision is often made on the basis of the appearance of
the dendrogram and we have seen, in Figures 3 and 4, that this can be misleading. It is
common to cut the tree at a particular level, but often better to cut at different levels to
isolate the more (visually) distinct branches.

Formal approaches to determining the numbers of clusters in a set of data are discussed,
for example, in Everitt et al. (2001: 177-196), but have been little used in archaeometry.
Similarly, some of the model-based methods are associated with tests for the number of
clusters, but have been little used. More informal methods graphical methods are often
useful.

For example, suppose a range of possible values of G are suggested by a CA, which may
be any of those discussed above. Carrying out a principle component analysis (PCA) and
producing component (PC) plots, labelled by cluster membership, for competing values
of G will often be informative. If G is too small then groups separated on the PC plots
may have the same labels; if G is too large cases within the same group suggested by the
PC plots may have different labels.

The more obvious structure will often be apparent on a plot of the first two PCs, but it is
worth inspecting all possible pairwise plots for, say, the first four or five PCs. This is
because some of the clusters suggested by a CA may not obviously separate out on the
first two PCs, but do so using the others.

Another useful tactic is to strip out very obvious outliers and clusters from the data (those
that are suggested by the CA and clearly distinct from the rest of the data) and repeat both

16
the CA and inspection using PCs with what remains. The aim here is to see if the
structure suggested in the original analysis remains, or whether other structure, obscured
in the original analysis, exists.

These informal methods are not foolproof, but often work well in application. Sometimes
the iterative application of PCA, independently of CA, is sufficient to reveal the stricture
in the data, and it can be viewed as an informal approach to CA, if used in this way.

Further Reading

Books written specifically for archaeologists, that discuss CA, include, in ascending order
of difficulty, Shennan (1997), Baxter (1994) and Baxter (2003). Not all the methods
discussed in this paper are covered in these books.

General statistical texts, with a wider coverage, include Everitt et al. (2001), which is
devoted to CA, accessible, and includes archaeological examples. Gordon (1999), at a
more advanced level, is devoted to the subject of classification.

Good statistical texts on multivariate analysis, with treatments of CA, abound. They
include Manly (2004), Everitt and Dunn (2001), Krzanowski (2000), Krzanowski and
Marriott (1995), and Seber (1984). This is in rough order of difficulty, Manlys text being
the most introductory.

Articles evaluating or developing aspects of the use of multivariate methods in


archaeometry include Bieber et al. (1976), Pollard (1986), Glascock (1992), Beier and
Mommsen (1994), Baxter and Buck (2000), Baxter (2001b) and Neff (2002). Several of
these outline approaches that have been developed in particular laboratories.

For archaeometric applications of CA, much of it very standard, with an often cursory
discussion of CA, the journals Archaeometry, the Journal of Archaeological Science, and
the published proceedings of Archaeometry conferences are good sources. Baxter (1994;
79-81) has a (now dated) review. Papers written or co-authored by the researchers noted
in the previous paragraph are often of more than average interest.

For newer approaches to cluster analysis, that have mostly had limited application in
archaeometry, Everitt et al. (2001) is probably the most accessible statistical text for a
non-statistical readership and includes material on mixture models, k-medoids and fuzzy
CA. Ripley (1996), Hastie et al. (2001) and Webb (2002) are all good, but at a more
advanced level. They are primarily concerned with methods of supervised pattern
recognition (e.g., discriminant analysis, classification trees, neural networks), but have
chapters on unsupervised pattern recognition that cover many of the newer methods.

Kaufman and Rousseeuw (1990) present a variety of methods for robust CA, including
PAM. Some of the text, particularly computational details, is outdated, but the book

17
provides useful background for the implementation in R (see below). See, also, Struyf et
al. (1996).

Fuzzy CA has been little used, to date, in archaeometry. The examples given in Everitt et
al. (2001) and Baxter (2006) use real archaeometric data, but are essentially illustrative.

Banfield and Raftery (1993) is a useful starting point for a statistical treatment of model
based clustering, and Fraley and Raftery (2007) provide updated computational
information. Fraleys website (http://www.stat.washington.edu/fraley/ - accessed
20/02/07) is a useful resource for keeping track of developments. Papageorgiou et al.
(2001) is one of the more detailed explorations of the merits of model-based methods as
applied to archaeometric data. Hall (2004) and Hall and Minyaev (2002) provide other
applications.

Buck et al. (1996) is the best starting point for an exposition of the uses of Bayesian
methods in archaeology, with references to, and applications of, CA to archaeometric
data. Their pioneering work has not been emulated much, exceptions being Dellaportas
and Papageorgiou (2006) and Papageorgiou and Lyritzis (2007).

Software

The general purpose statistical software packages that I am familiar with, of the kind used
for teaching statistics to non-specialists, are typically menu driven and include a range of
hierarchical and k-means methods among their options. Everitt et al. (2001) review what
was current in about 2000, and is inevitably a bit dated. In particular the open-source
software, R (see below), is not discussed.

Of commercially available software it is worth singling out CLUSTAN, developed by


David Wishart and distributed by Clustan Ltd. (http://www.clustan.com/ - accessed
09/03/07). It is a specialised package for CA, with many more options that the general
purpose packages. I have no recent personal experience of using it, but earlier versions
were quite widely used in archaeology.

Non-specialist, menu-driven, software can be restrictive, both in the options allowed and
the control one has over the presentation of results, which can be unsatisfactory. To obtain
more control over presentation, and explore some of the more complex methodologies
available, more powerful software is needed, and the open-source package, R, would be
the current choice of many statisticians.

R is command-driven (as opposed to menu-driven) with powerful graphical and


programming facilities. It is developed and maintained by the R Development Team, and
can be obtained from http://cran.r-project.org/ (accessed 09/03/07). It is updated on a
regular basis; version 2.4.1 is current at the time of writing (now 2.9.2 when revising
this).

18
For non-statisticians used to menu driven packages, coming to terms with R can be
initially difficult, but it is worth the effort. Apart from the comprehensive documentation,
Dalgaard (2002) is a good general introduction, while Venables and Ripley (2002) is
more advanced and has a section showing how some of the methods discussed here can
be applied.

A major attraction of R is that there are a large number of packages, designed by users for
specific tasks, which can be installed in addition to what comes with R, and this includes
packages for CA. The simpler hierarchical methods are available in the stats package
that comes with R; type ?hclust from within R to get information on what is available.
Similarly, ?kmeans provides help on k-means clustering.

Among available add-on packages cluster implements the robust methods of


Kaufman and Rousseeuw (1990), including k-medoids clustering using the pam function,
and a robust version of fuzzy CA using the fanny function. Fuzzy CA is also available
in the package e1071. The package mclust02 provides a variety of functions for
model-based CA. The approaches to clustering designed by archaeometricians, discussed
in the section on model-based clustering, are not immediately available, but could be
programmed. I am not aware of R packages for Bayesian CA. Other R packages to do
cluster analysis are available, some of which implement quite new methodology.

Example

To illustrate some of the ideas discussed above a sample of fragments from 34 Romano-
British cast glass bowls, measured with respect to the composition 11 oxides will be used.
The data are given in Baxter et al. (2005: 64). Oxides will be referred to by their
associated element, and Si, obtained by differencing in the original paper, is not used
here. Numbering is from 1-34, rather than the identifications in the original paper.

Initial, univariate, data inspection is illustrated in Figures 6 and 7, which are dotplots for
Fe and Na. There is some suggestion of grouping in Figure 6, suggesting there may be
clusters in the data. Grouping is possibly suggested in Figure 7 but is less obvious. Case 9
to the extreme left appears to be an outlier, but is not so extreme as to warrant immediate
exclusion from the analysis. One other case, 5, was a similar kind of outlier for K and Al,
but is also retained.

Figure 6: A dot-plot of Fe for the cast-bowl compositional data.

19
Figure 7: A dot-plot of Na for the cast-bowl compositional data.

Inspection of other dot-plots showed that Mn, P, Pb and Ti took on few values with no
evidence of clustering, so these were omitted from further analyses. This leaves seven
variables, Al, Ca, Fe, K, Mg, Na and Sb. A pairs plot (also called a draftsman plot, or
scatterplot matrix) is shown for these variables in Figure 8. Such plots show all possible
bivariate plots for the variables selected, the upper triangle of plots being the same as the
lower triangle, except that axes are reversed. There is a suggestion of grouping in several
of these plots, particularly those involving Fe, and a number of outliers are evident.

0.30 0.50 4.5 6.0 7.5 0.35 0.55

1.9
Al

1.6
0.50

Fe
0.30

0.35 0.50
Mg
4.5 6.0 7.5

Ca
20

Na
18
16
0.55

K
0.35

1.0

Sb
0.6

1.6 1.9 0.35 0.50 16 18 20 0.6 1.0

Figure 8: A pairs plot for the seven variables used for the cluster analysis of the cast-
bowl compositional data.

20
30
25
20
15
Height

10
5

9
0

3
26
14

12
7

24
32

23
10
13
17

18
4

27

6
21

8
28
11
25

15
30
31

16
22
29
33
34

19
20

1
2
4

9
14
3

3
26
Height

5
12
32
7

10
23

24
1

13
17

18

6
21
8
28

27
4
11
25

30
31

16
22
15
29
33
34
0

19
20
1
2

Figure 9: Wards method (top) and average-linkage dendrograms from the analysis of the
standardized cast-bowl data, using Euclidean distance as a measure of dissimilarity.

21
In Figure 9 the dendrograms from some initial cluster analyses are shown, Wards method
for standardized data at the top, and average linkage beneath. Euclidean distance was
used as the measure of similarity. Wards method, suggests, most clearly, two clusters, but
could be cut to give three or four apparently distinct groups. We have seen how this can
mislead.

For the two cluster solution, that to the left consists of cases (4, 7, 11, 14, 17, 25, 29, 30,
31, 33, 34), that is, 11 cases in all. Call this cluster 1. If the average linkage dendrogram
is cut to give two clusters, the same cluster is obtained with the addition of cases 3 and 26
that appeared to the extreme left of the Wards method dendrogram. These, and case 14,
seem somewhat outlying relative to other cases in the cluster.

If the upper dendrogram in Figure 9 is cut to give four clusters, careful inspection of the
lower dendrogram shows that, other than cluster 1, these are not closely matched, though
some groups of cases such as (16, 18, 22, 27) do group similarly in both analyses. It is
usually easier to do this sort of comparison by labelling cases according to the cluster
membership suggested by the first analysis rather than by case number.

These initial analyses suggest that there are, perhaps, two main groups in the data, with a
number of outliers (or not easily clustered cases), indicated in the average linkage
analysis. Analysis could proceed in various ways at this point for example, by labelling
points on the pairs plot by cluster membership, for G = 2, 3, 4. This approach will be
illustrated shortly, using principal components rather than the original variables, but first
some k-means analysis is undertaken.

Using the kmeans function from R, for G=2, gave clusters of size 13 and 21, the smaller
of these being cluster 1 plus (3, 26). Clustering was initiated using random starts, but
these did not affect the results. For G = 3 and G = 4 the random starts did affect results.
For 100 random starts the best results produced clusters of size 13, 11 and 10 for G = 3,
the first of these being identical with the cluster of 13 for G = 2. For G = 4 the same
procedure produced clusters of size 11, 2, 10 and 11, with the first two of these splitting
the previous clusters of size 13 into cluster 1 and (3, 26). Carrying out a PCA and looking
at the pairs plot for the first three PCs, labelled by the four-cluster solution, produces
Figure 10.

22
2 1 2

Comp.3
0

-1
-1 0

2 0 1 2

1 group
0 Comp.2 0 1
2
-1 3
4
-2 -1 0 -2

0 2
2

0
Comp.1
-2

-4 -2
-4
Scatter Plot Matrix

Figure 10: A pairs plot, based on the first three principal components of the standardized
glass bowl data, labelled by the cluster membership derived from a k-means CA with four
clusters.

Cluster 1 is associated with the crosses and (3, 26) with the triangles. The cases in cluster
1 plot tightly together with the exception of an outlier (relative to the cluster) evident on
the third component. This is case 14, which was suggested as unusual in the average-link
analysis. Case 14 has the highest values of Ca and Sb in the total sample, and the lowest
value of K within cluster 1. The affinity of (3, 26) with each other and cluster 1 is
evident, but they plot slightly separately on the plot for the first two components. There is
little evidence that the remaining two clusters are genuinely separate.

Using k-medoids clustering gives the same results as k-means for G = 2 and 3; for G = 4,
case 26 is separated from case 3 and cluster 1, but a silhouette plot suggests that it is
probably not a member of the cluster to which it is assigned. A silhouette plot for G = 3
suggests that cases 3, 14, and 26 lie between clusters, rather than belonging securely to
the group they are assigned to.

23
Fuzzy CA, implemented using the fanny function with m = 2 in the cluster package
produced a different crisp clustering, for G = 2, from those previously obtained, adding
case 12 to cluster 1 and (13, 26). Its membership coefficient was, however, only 0.52
(contrasting with 0.57 for cases 13 and 26, and 0.59 for case 14). Using G = 3 removed
case 12 from the cluster.

The overall picture is that there is one fairly tight cluster of 10 cases in the data, and
another more diffuse one, with (3, 26) also forming a small group having affinities with
the tight cluster. A number of cases, including 14, are not readily assigned to either of the
main clusters.

This example has been designed to illustrate some of the ideas mentioned in the text.
Rather than relying on a single method, the aim has been to show how the simpler
methods can be used to explore the data in a relatively informal way. The more complex
methods of CA have not been illustrated, but pose similar problems in assessing the
number and validity of clusters. The data set used is a relatively small one. Larger data
sets, or data with less clear structure, pose more problems but, with care, the ideas used
here can be applied.

Conclusion

Anyone reading this paper with little experience with CA, who thinks it might be useful
for their data, is advised to start with the simpler hierarchical techniques. For more
experienced, or adventurous, users there are several avenues that could be explored.

It would be useful, for example, to have a systematic investigation of how fuzzy CA or


robust methods of CA perform across a range of archaeometric data sets. Similar
comments apply to some of the more complex model-based methods, though I suspect
sample and cluster size will be problematic.

Cluster analysis has been the most widely used method of multivariate analysis in
archaeology. It has not always been used well or interestingly, and some researchers now
treat it as a starting point, rather than end-point, of their statistical analysis. The
widespread use of CA, nevertheless, presumably reflects the fact that archaeologists and
archaeometricians have found it useful. Greater awareness of the potential problems in
applying CA, and in some cases better reporting of the results, would be desirable, but
CA can be a useful tool and is likely to remain as a commonly applied technique.

References

Aitchison, J., Barcel-Vidal, C., and Pawlowsky-Glahn, V. (2002) Some comments on


compositional data analysis in archaeometry, in particular the fallacies in Tangri and
Wright's dismissal of logratio analysis, Archaeometry, vol.44, 295-304.

24
Banfield, J.D. and Raftery, A.E. (1993) Model-based Gaussian and non-Gaussian
clustering. Biometrics, vol. 49, 803-821.

Baxter, M.J. (1994) Exploratory multivariate analysis in archaeology, Edinburgh


University Press, Edinburgh.

Baxter, M.J. (1999) Detecting multivariate outliers in artefact compositional data.


Archaeometry, 41, 321-338.

Baxter, M.J. 2001a: Statistical modelling of artefact compositional data. Archaeometry,


vol. 43, 131-147.

Baxter, M.J. (2001b) Multivariate analysis in archaeology. In Handbook of


archaeological sciences, D.R. Brothwell and A.M. Pollard (eds.), Wiley, Chichester, 685-
694.

Baxter, M.J. (2003) Statistics in archaeology, Arnold, London.

Baxter, M.J. (2006) Supervised and unsupervised pattern recognition in archaeometry.


Archaeometry, vol. 48, 671-694.

Baxter, M.J. and Buck, C.E. (2000) Data handling and statistical analysis. In Modern
analytical methods in art and archaeology, E. Ciliberto, and G. Spoto, G. (eds.), Wiley,
New York, 681-746.

Baxter, M.J. and Freestone, I.C. (2006) Log-ratio compositional data analysis in
archaeometry. Archaeometry, vol. 48, 511-531.

Baxter, M.J., Cool, H.E.M. and Jackson, C.M. (2005) Further studies in the
compositional analysis of colourless Romano-British vessel glass. Archaeometry, vol. 47,
47-68.

Baxter, M.J. and Jackson, C.M. (2001) Variable selection in artefact compositional
studies. Archaeometry, vol. 43, 253-268.

Beier, T. and Mommsen, H. (1994) Modified Mahalanobis filters for grouping pottery by
chemical composition. Archaeometry, vol. 36, 287-306.

Bieber, A.M., Brooks, D.W., Harbottle, G., and Sayre, E.V. (1976) Application of
multivariate techniques to analytical data on Aegean ceramics. Archaeometry, vol. 18,
59-74.

Blomster, J.P., Neff, H. and Glascock, M.D. (2005) Olmec pottery production and export
in ancient Mexico determined through elemental analysis. Science, vol. 307, 1068-1072.

Buck, C.E., Cavanagh, W.G. and Litton, C.D. (1996) Bayesian approach to interpreting
archaeological data, Wiley, Chichester.

25
Cox, G.A. and Gillies, K.J.S. (1986) The X-ray fluorescence analysis of medieval durable
blue soda glass from York Minster. Archaeometry, vol. 28, 57-68.

Dalgaard, P. (2002) Introductory statistics with R, Springer, New York.

Dellaportas, P. and Papageorgiou, I. (2006) Multivariate mixtures of normals with


unknown number of components. Statistics and Computing, vol. 16, 57-68.

Everitt, B.S. and Dunn, G. (2001) Applied multivariate data analysis, 2nd edition,
Arnold, London.

Everitt, B.S., Landau, S., and Leese, M. (2001) Cluster analysis, 4th edition, Arnold,
London.,

Fraley, C. and Raftery, A.E. (2007) Model-based methods of classification: using the
mclust software in chemometrics, Journal of Statistical Software, vol. 18, Issue 6.

Friedman, J.H. and Meulman, J.J. (2004) Clustering objects on subsets of attributes (with
discussion). Journal of the Royal Statistical Society B, vol. 66, 815-849.

Glascock, M.D. (1992) Characterization of archaeological ceramics at MURR by neutron


activation analysis and multivariate statistics. In Chemical characterization of ceramic
pastes in archaeology, H. Neff (ed.), Prehistory Press, Madison, WI,11-26.

Gordon, A.D. (1999) Classification, 2nd Edition, Chapman and Hall/CRC, London.

Hall, M.E. (2004) Pottery production during the Late Jomon period: insights from the
chemical analyses of Kasori B pottery. Journal of Archaeological Science, vol. 31, 1439-
1450.

Hall, M.E. and Minyaev, S. (2002) Chemical Analyses of Xiong-nu Pottery: A


Preliminary Study of Exchange and Trade on the Inner Asian Steppes. Journal of
Archaeological Science, Vol. 29, 135-144.

Hastie, T., Tibshirani, R., and Friedman, J. (2001) The elements of statistical learning,
Springer, New York.

Harbottle, G. (1976) Activation analysis in archaeology. Radiochemistry, vol. 3, 33-72.

Kaufman, L. and Rousseeuw, P.J. (1990) Finding groups in data, Wiley, New York.
Krzanowski, W.J. (2000) Principles of multivariate analysis, 2nd edition, Oxford
University Press, Oxford.

Krzanowski, W.J. and Marriott, F.H.C. (1995) Multivariate analysis: classification,


covariance structures and repeated measurements, Edward Arnold, London.

26
Manly, B.F.J. (2004) Multivariate statistical methods: a primer, 3rd edition, Chapman
and Hall/CRC, Baton Rouge, FL.

Neff, H. (2002) Quantitative techniques for analyzing ceramic compositional data. In


Source determination by INAA and complementary mineralogical investigations, D.W.
Glowacki and H.Neff (eds.), Monograph 44, The Cotsen Institute of Archaeology at
UCLA, Los Angeles, 15-36.

Papageorgiou, I. and Lyritzis, I. (2007) Multivariate mixture of normals with unknown


number of components. An application to cluster Neolithic Ceramics from Aegean and
Asia Minor.Archaeometry, vol. 49, 795-813.

Papageorgiou, I., Baxter, M. and Cau, M.A. (2001) Model-based cluster analysis of
artefact compositional data. Archaeometry, vol. 43, 571-588.

Pollard, A.M. (1986) Multivariate methods of data analysis. In Greek and Cypriot
pottery: a review of scientific studies, R.E. Jones (ed.), British School at Athens Fitch
Laboratory Occasional Paper 1, Athens, 56-83.

Ripley, B. D. (1996) Pattern recognition and neural networks, Cambridge University


Press, Cambridge.

Seber, G.A.F. (1984) Multivariate observations, Wiley, New York.

Shennan, S. (1997) Quantifying Archaeology, 2nd edition, Edinburgh University Press,


Edinburgh.

Struyf, A., Hubert, M. and Rousseeuw, P.J. (1996) Clustering in an object-oriented


environment. Journal of Statistical Software, vol. 1, 1-30.

Venables, W.N. and Ripley, B.D. (2002) Modern applied statistics with S, 4th
edition, Springer, New York.

Webb, A.R. (2002) Statistical pattern recognition, 2nd edition, Wiley, New York.

Wilson, A.L. (1978) Elemental analysis of pottery in the study of its provenance - a
review. Journal of Archaeological Science, vol. 5, 219-236.

Appendix

Fuzzy CA
As with k-means cluster analysis, in fuzzy clustering an objective function is minimized.
One possibility is to minimize

27
G n
TG igm d ig2
g 1 i 1


G
where ig is the membership of case i in group g, with ig 0 and g 1
ig 1 ; m is a
fuzzification factor; and dig is the Euclidean distance of case i to the centroid of group
g. If m = 1 a crisp clustering, equivalent to a non-fuzzy k-means clustering, is obtained; as
m increases the classification becomes increasingly fuzzier, a totally fuzzy classification
being one in which the membership of all G groups is equally likely. The choice of m = 2
is common.

Mixture models
The following short account is based on Papageorgiou et al. (2001). The observed data
are x = (x1, x2, ..., xn) where xi is a p-valued vector. If xi is selected from the gth
component of the mixture it is assumed to have probability density f g (x i ; g
, g ) where

f k (x
i ; g
, g ) = (2 ) 1/ 2 | g |1/ 2 exp{(xi g )T g1 (xi g ) / 2}

and g and g are the mean and covariance matrix of the gth component.

In the mixture maximum-likelihood the likelihood maximized is

n G


i 1 k 1
g f g ( x
i ; g
, g )


G
where k 1. Unless constraints are placed on the parameters, there are
k 1
(G / 2)( p 1)( p 2) 1 parameters to estimate. For many problems the number of
parameters to estimate will exceed the sample size, so that constraints must be imposed.
Most usually the constraint that the covariance matrices, g , are the same is used.

In classification maximum-likelihood approach let g i g if xi belongs to the gth


component. Initially the values of g i are unknown. The likelihood for the data can be
written in the form

fg (x;
i 1
i i ,
gi gi )

and the labels, g i , as well as g and g must be estimated. This results in a direct
clustering of the observations. It is usually impractical to maximise the above likelihood
without some constraints on the parameters.

28

Você também pode gostar