Você está na página 1de 7

CLUSTER ANALYSIS

V. K. Bhatia
I.A.S.R.I., Library Avenue, New Delhi-110 012
In multivariate situation, the primary interest of the experimenter is to examine and
understand the relationship amongst the recorded traits. In other words the simultaneous
study of several variables is of paramount importance. The data on several traits may be
classified under broadly two following ways and can be studied by various statistical
techniques.
Case I: When a set of variables constitutes a mixture of dependent and independent
variables.
In this situation, the objectives of examining the relationship among variables can be
studied by:
1. Both sets of dependent and independent variables are quantitative:
Multivariate multiple regression
Canonical correlation
2. Dependent set of variables as quantitative but independent set of variables as
qualitative:
MANOVA
3. Set of binary (polytomous) dependent variables and set of independent quantitative
variables
Discriminant analysis
Logistic regression
Multiple logistic/ logic models
Case II: All the variables are of the same status and there is no distinction of dependent/
independent or target variables.
In such a situation, the objectives of examining the structure among them can be studied
by:
1. Reduction of variables
Principal components
2. Discover natural affinity groups
Cluster analysis
3. Identify unobservable underlying factors
Factor analysis
From the above description of multivariate techniques, it is clear that the cluster analysis
is a methodology used to find out similar objects in a set based on several traits. There are
various mathematical methods which help to sort objects in to a group of similar objects
called a Cluster. Cluster analysis is used in diversified research fields. In biology cluster

Cluster Analysis

analysis is used to identify diseases and their stages. For example by examining patients
who are diagnosed as depressed, one finds that there are several distinct sub-groups of
patients with different type of depression. In marketing cluster analysis is used to
identify persons with similar buying habits. By examining their characteristics it becomes
possible to plan future marketing strategies more efficiently.
Although both cluster and discriminant analysis classify objects into categories,
discriminant analysis requires one to know group membership for the cases used to
decide the classification rule and whereas in cluster analysis group membership for all
cases is unknown. In addition to membership, the number of groups is also generally
unknown. In cluster analysis the units within cluster are similar but different between
clusters. The grouping is done on the basis of some criterion like similarities measures
etc. Thus in the case of cluster analysis the inputs are similarity measures or the data
from which these can be computed.
No generalisation about cluster analysis is possible as a vast number of clustering
methods have been developed in several different fields with different definitions of
clusters and similarities. There are many kinds of clusters namely:
Disjoint cluster where every object appears in single cluster.
Hierarchical clusters where one cluster can be completely contained in another
cluster, but no other kind of overlap is permitted
Overlapping clusters.
Fuzzy clusters, defined by a probability of membership of each object in one
cluster.
1.
Similarity Measures
A measure of closeness is required to form simple group structures from complex data
sets. A great deal of subjectivity is involved in the choice of similarity measures.
Important considerations are the nature of the variables i.e. discrete continuous or binary
or scales of measurement ( nominal, ordinal, interval, ratio etc. ) and subject matter
knowledge. If the items are to be clustered, proximity is usually indicated by some sort of
distance. The variables are however are grouped on the basis of some measure of
association like the correlation co-efficient etc. Some of the measures are
Qualitative Variables
Consider k variables observed on n units, in case of binary response it can be represented
as
Jth unit
Ith unit
Yes
No
Total
Yes
K11
K12
K11+K12
No
K21
K22
K21+K22
Total
K11+K21
K12+K22
K
Simple matching coefficient
(% matches)

dij = (K11 + K12 )/ K

This can easily be summarized to polytomous responses


672

(i,j =1,2,n)

Cluster Analysis

Quantitative Variables
In the case of k quantitative variables recorded on n cases, the observations can be
expressed as
X11
X12
X13
X1k
X21
X22
X23 X2k
Xn1
Xn2
Xn3
Xnk
Similarity

rij (i,j = 1,2..n)


Correlation between Xik s with Xjk s
(Not the same as correlation between variables)

Dissimilarity

dij =

( Xik Xjk ) 2

Euclidean distance

Xs are standardised. It can be calculated for one variable.


Hierarchical Agglomeration
Hierarchical Clustering techniques begin by either a series of successive mergers or of a
successive divisions.
Consider a natural process of grouping
Each unit is an entity to start with
Merge those two units first which are most similar (least dij ) now becomes
an entity
Examine mutual distance between (n-1) entities
Merge those two that are most similar
Repeat the process and go on merging till all are merged to form one entity
At each stage of agglomerative process, note the distance between the two
merging entities
Choose that stage which shows sudden jump in this distance ( Since it
indicates that two very dissimilar entities are being merged ) _ This could be
subjective.
Distance between entities
As there are large number of methods available, so it is not possible to enumerate them
here but some of them are
Single linkage- This method works on the principle of smallest distance or
nearest neighbour
Complete linkage- It works on the principle of distant neighbour or
dissimilarities- Farthest neighbour
Average linkage This works on the principle of average distance. (Average of
distances between unit of one entity and the other unit of the second entity.
Centroid This method assigns each item to the cluster having nearest centroid
(means). The process has three steps
Partition the items into k initial clusters
Proceed through the list of items assigning an item to the
cluster whose centroid (mean) is nearest. Recalculate the
673

Cluster Analysis

centroid (mean) for the cluster receiving the new item and the
cluster losing the item.
Repeat step 2 until no more assignments take place

Wards
Two stage density linkage
Units assigned to modal entities on the basis of densities
(frequencies) (kth nearest neighbour)
Modal entities allowed to join later on

SAS Cluster Procedure


The SAS procedures for clustering are oriented towards disjoint or hierarchical cluster
from a co-ordinate data, distance or a correlation or covariance matrix. The following
procedures are used for clustering
CLUSTER

Does hierarchical clustering of observations

FASTCLUS

Finds disjoint clusters of observations using a k-means method applied to


co-ordinate data. Recommended for large data sets.

VARCLUS

It is both for hierarchical data disjoint clustering

TREE

Draws the tree diagrams or dendograms using outputs from the CLUSTER
or VARCLUS procedures

The TREE Procedure


The CLUSTER and VARCLUS procedures create output data sets giving the results of
hierarchical clustering as tree structure. The TREE procedure uses the output sets to
print a diagram.
Following is the terminology related to TREE procedure.
Leaves
Root
Branch
Node
Parent &
Child

Objects that are clustered


The cluster containing all the objects
A cluster containing at least two objects but not all of them
A general term for leaves, branch and roots
If A is union of cluster and B and C, the A is parent and B and C are
children

Specifications
The TREE procedure is invoked by the following statements:
PROC TREE < options>
Optional Statements
NAME
variables
HEIGHT
variables

674

Cluster Analysis

PARENT
BY
COPY
FREQ
ID

variables
variables
variables
variables
variables

If the data sets have been created by CLUSTER or VARCLUS, the only requirement is
the statement PROC TREE. The other optional statements listed above are described
after the PROC TREE statement
PROC TREE statement
PROC TREE < options>
The PROC TREE statement starts the TREE procedure. The options that usually find
place in the PROC TREE statement
FUNCTION
OPTION
Specify data set
DATA=
DOCK=
LEVEL=
NCLUSTERS=
OUT=
Specify cluster heights
HEIGHT=
DISSIMILAR=
SIMILAR=
FUNCTION
Print horizontal trees
Control the height axis

OPTION
HORIZONTAL
INC=
MAXHEIGHT=
MINHEIGHT=
NTICK=
PAGES=
POS=
SPACES=
TICKPOPS=
FILLCHAR=
JOINCHAR=
LEAFCHAR=
TREECHAR=
DESCENDING
SORT
LIST
NOPRINT
PAGES

Control characters printed in trees

Control sort order


Control output

675

Cluster Analysis

By default, the tree diagram is oriented with the height and vertical and the object names
at the top of the diagram. For horizontal axis HORIZONTAL option can be used.
Example: The data along with SAS CODE belongs to different kinds of teeth for a
variety of mammals. The objective of the study is to identify suitable clusters of
mammals based on the eight variables.
Data teeth;
Input mammal $ v1 v2 v3 v4 v5 v6 v7 v8;
Cards;
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
AA
BB
CC
DD
EE
FF
;

2
3
2
2
2
1
2
2
1
1
1
1
1
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
0
0
0
0

3
2
3
3
3
3
1
1
1
1
1
1
1
3
3
3
3
3
3
3
3
2
3
3
2
2
2
1
4
4
4
4

1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0

1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0

676

3
3
2
2
1
2
2
3
2
2
1
0
1
4
4
4
4
3
4
3
4
3
3
3
4
4
3
4
3
3
3
3

3
3
3
2
2
2
2
2
1
1
1
0
1
4
4
4
4
3
4
3
3
3
2
2
4
4
3
4
3
3
3
3

3
3
3
3
3
3
3
3
3
3
3
3
3
2
2
3
1
1
1
1
1
1
1
1
1
1
2
1
3
3
3
3

3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
1
1
1
1
2
1
3
3
3
3

Cluster Analysis

proc cluster method=average std pseudo ndesign;


var v1-v8;
id mammal;
run;
This will perform clustering by using average linkage distance method.
The following PROC TREE statements use the average linkage distances as height as
default
proc tree;
run;
The following PROC TREE statements sort the clusters at each branch in order of
formation and use the number of clusters for the height axis.
proc tree sort height=n;
run;
The following PROC TREE statements produce no printed output but creates an output
data set indicating the cluster to which each observation belongs at the 6-cluster level in
the tree; the data set is reproduced by PROC PRINT
proc tree noprint out=part nclusters=6;
id mammal;
copy v1-v8;
run;
proc sort;
by cluster;
run;
proc print label uniform;
id mammal;
var v1-v8;
format v1-v8 1.;
by cluster;
run;

677