Cluster Analysis

Cluster Analysis
Prof. Shelendra K. Tyagi
Cluster Analysis:
A major FMCG company wants to map the profile of its target audience in terms of lifestyle,
attitudes, and perceptions. The companys managers prepare, with the help of their marketing
research team, a set of 15 statements, which they feel measure many of the variables of interest.
These 15 statements are given below. The respondent had to agree or disagree (1= Strongly agree,
2=agree, 3=neither agree nor disagree, 4=disagree, 5=strongly disagree) with each statement.
1.
I prefer to use email rather than write a letter.
2.
I feel that quality products are always priced high.
3.
I think twice before I buy anything.
4.
Television is a major source of entertainment.
5.
A car is a necessity rather than a luxury.
6.
I prefer fast food and ready-to-use products.
7.
People are more health-conscious today.
8.
Entry of foreign companies has increased the efficiency of Indian companies.
9.
Women are active participants in purchase decisions.
10.
I believe politicians can play a positive role.
11.
I enjoy watching movies.
12.
If I get a chance, I would like to settle abroad.
13.
I always buy branded products.
14.
I frequently go out on weekends.
15.
I prefer to pay by credit card rather than in cash.
Cluster Analysis:
Godrej India Ltd wants to map the profile of its target customer in terms of lifestyle, attitude and
perception. Godrejs marketing managers prepare a set of fifteen statements. Which according to the
market research team will measure many of the variables of interest. The respondent had to agree or
disagree with each statement on a scale of 1 to 5.
1=completely Agree, 2=Agree, 3=Neither Agree nor disagree, 4=Disagree, 5=Completely Disagree
The following fifteen statements were prepared by the Godrej marketing team:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
I feel foreign made products are always superior in quality.

I prefer to pay by credit cards as a matter of convenience.
A computer is a necessity rather than luxury.
The liberalization of the Indian economy has increased the efficiency of Indian companies.
I prefer old Hindi songs rather than the latest ones.
I feel vegetarian food is more nutritious than non-vegetarian food.
I enjoy surfing on the net.
Television has become an integral part of Indian urban life.
Womens education is an important aspect for the overall development of the country.
A movie is a major source of entertainment.
People are more conscious about the quality of product.
I believe the economic status of India will improve.
I prefer readymade wear to tailored clothes.
I prefer to take my food outside every weekend.
Basic Computer Education should be included in primary education level.
Cluster Analysis
LEARNING OBJECTIVES:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Define cluster analysis, its roles and its limitations.

Identify the research questions addressed by cluster analysis.
Understand how interobject similarity is measured.
Distinguish between the various distance measures.
Differentiate between clustering algorithms.
Understand the differences between hierarchical and
nonhierarchical clustering techniques.
Describe how to select the number of clusters to be formed.
Follow the guidelines for cluster validation.
Construct profiles for the derived clusters and assess managerial
significance.
Cluster Analysis Defined
Cluster analysis . . . groups objects

(respondents, products, firms,
variables, etc.) so that each object is
similar to the other objects in the
cluster and different from objects in
all the other clusters.
What is Cluster Analysis?
Cluster analysis . . . is a group of multivariate

techniques whose primary purpose is to
group objects based on the characteristics
they possess.
It has been referred to as Q analysis,

typology construction, classification
analysis, and numerical taxonomy.
The essence of all clustering approaches is

the classification of data as suggested by
natural groupings of the data
themselves.
6
Both cluster analysis and discriminant analysis

are concerned with classification. However,
discriminant analysis requires prior knowledge
of the cluster or group membership for each
object or case included, to develop the
classification rule. In contrast, in cluster
analysis there is no a priori information about
the group or cluster membership for any of
the objects. Groups or clusters are suggested
by the data, not defined a priori.
Criticisms of Cluster Analysis
The following must be addressed by

conceptual rather than empirical
support:
Cluster analysis is descriptive,

atheoretical, and noninferential.
. . . will always create clusters,
regardless of the actual existence of
any structure in the data.
The cluster solution is not generalizable
because it is totally dependent upon
the variables used as the basis for the
similarity measure.
8
Fig.
20.1
Variable 1
Variable 1
Cluster Analysis:
Variable 2
Fig. 20.2
Variable 2
What Can We Do With Cluster

Analysis?
1. Determine if statistically
different clusters exist.
2. Identify the meaning of the
clusters.
3. Explain how the clusters can
be used.
10
11
12
Stage 1: Objectives of Cluster

Analysis
Primary Goal = to partition a set of

objects into two or more groups based on
the similarity of the objects for a set of
specified characteristics (the cluster
variate).
There are two key issues:
The research questions being addressed,
and
The variables used to characterize
objects in the clustering process.
13
Research Questions in Cluster

Analysis
Three basic research questions:

How to form the taxonomy an
empirically based classification of
objects.
How to simplify the data by grouping
observations for further analysis.
Which relationships can be identified
the process reveals relationships among
the observations.
14
Selection of Clustering Variables
Two Issues:
Conceptual considerations, and
Practical considerations.
15
Rules of Thumb
OBJECTIVES OF CLUSTER ANALYSIS
Cluster analysis is used for:

Taxonomy description identifying natural groups within the
data.
Data simplification the ability to analyze groups of
similar observations instead of all individual observations.
Relationship identification the simplified structure from
cluster analysis portrays relationships not revealed
otherwise.
Theoretical, conceptual and practical considerations must be
observed when selecting clustering variables for cluster analysis:
Only variables that relate specifically to objectives of the
cluster analysis are included, since irrelevant variables
can not be excluded from the analysis once it begins
Variables are selected which characterize the individuals
(objects) being clustered.
16
Stage 2: Research Design in Cluster

Analysis
Four Questions:
Is the sample size adequate?
Can outliers be detected an, if so, should
they be deleted?
How should object similarity be
measured?
Should the data be standardized?
17
Measuring Similarity
Interobject similarity is an empirical

measure of correspondence, or
resemblance, between objects to be
clustered. It can be measured in a
variety of ways, but three methods
dominate the applications of cluster
analysis:
Correlational Measures.
Distance Measures.
Association.
18
Cluster Analysis:
19
Types of Distance Measures
Euclidean distance.
(is the square root of the sum of the squared differences
in values for each variable)
Squared (or absolute) Euclidean distance.
City-block (Manhattan) distance.

(is the sum of the absolute differences in values for each
variable)
Chebychev distance.
(is the maximum absolute difference in values for any
variable)
Mahalanobis distance (D2).
(is a relative measure of a data point's distance (residual)
from a common point.)
20
Rules of Thumb
RESEARCH DESIGN IN CLUSTER ANALYSIS
The sample size required is not based on statistical considerations for inference testing,
but rather:
Sufficient size is needed to ensure representativeness of the population and its
underlying structure, particularly small groups within the population.
Minimum group sizes are based on the relevance of each group to the research
question and the confidence needed in characterizing that group.
Similarity measures calculated across the entire set of clustering variables allow for the
grouping of observations and their comparison to each other.
Distance measures are most often used as a measure of similarity, with higher
values representing greater dissimilarity (distance between cases) not similarity.
There are many different distance measures, including:
Euclidean (straight line) distance is the most common measure of distance.
Squared Euclidean distance is the sum of squared distances and is the
recommended measure for the centroid and Wards methods of clustering.
Mahalanobis distance accounts for variable intercorrelations and weights each
variable equally. When variables are highly intercorrelated, Mahalanobis distance is
most appropriate.
Less frequently used are correlational measures, where large values do indicate
similarity.
Given the sensitivity of some procedures to the similarity measure used, the researcher
should employ several distance measures and compare the results from each with other
results or theoretical/known patterns.
21
Rules of Thumb Continued . . .

RESEARCH DESIGN IN CLUSTER ANALYSIS
Outliers can severely distort the representativeness of the results if they
appear as structure (clusters) that are inconsistent with the research
objectives
They should be removed if the outlier represents:
Aberrant observations not representative of the population
Observations of small or insignificant segments within the population which are of
no interest to the research objectives
They should be retained if representing an under-sampling/poor representation of
relevant groups in the population. In this case, the sample should be augmented to
ensure representation of these groups.
Outliers can be identified based on the similarity measure by:

Finding observations with large distances from all other observations
Graphic profile diagrams highlighting outlying cases
Their appearance in cluster solutions as single-member or very small clusters
Clustering variables should be standardized whenever possible to avoid

problems resulting from the use of different scale values among clustering
variables.
The most common standardization conversion is Z scores.
If groups are to be identified according to an individuals response style, then
within-case or row-centering standardization is appropriate.
22
Stage 3: Assumptions of Cluster

Analysis
Representativeness of the sample.

Impact of multicollinearity.
23
Rules of Thumb
ASSUMPTIONS IN CLUSTER ANALYSIS

Input variables should be examined for
substantial multicollinearity and if present:
Reduce the variables to equal numbers in
each set of correlated measures, or
Use a distance measure that compensates for
the correlation, like Mahalanobis Distance.
24
Stage 4: Deriving Clusters and

Assessing Overall Fit
The researcher must:

Select the partitioning procedure
used for forming clusters, and
Make the decision on the number
of clusters to be formed.
25
A Classification of Clustering Procedures

Clustering Procedures
Nonhierarchical
Hierarchical
Agglomerative
Divisive
Sequential
Threshold
Linkage
Methods
Variance
Methods
Parallel
Threshold
Optimizing
Partitioning
Centroid
Methods
Wards Method
Single
Complete
Average
26
Fig. 20.4
Two Types of Hierarchical

Clustering Procedures
1. Agglomerative Methods
(buildup)
2. Divisive Methods (breakdown)
27
How Agglomerative Approaches

Work?
Start with all observations as their own

cluster.
Using the selected similarity measure,
combine the two most similar observations
into a new cluster, now containing two
observations.
Repeat the clustering procedure using the
similarity measure to combine the two most
similar observations or combinations of
observations into another new cluster.
Continue the process until all observations are
in a single cluster.
28
Agglomerative Algorithms
Single Linkage (nearest neighbor)

Complete Linkage (farthest
neighbor)
Average Linkage.
Centroid Method.
Wards Method.
29
Cluster Linkages:
30
Linkage Methods of
Clustering Single Linkage
Minimum
Distance
Cluster 1
Complete Linkage
Cluster 2
Maximum
Distance
Cluster 1
Cluster 2
Average Linkage
Fig. 20.5
Cluster 1
Average
Distance
31
Cluster 2
Cluster Analysis:
32
Other Agglomerative Clustering

Methods
Fig. 20.6
Wards Procedure
Centroid Method
33
How Nonhierarchical Approaches

Work?
Specify cluster seeds.

Assign each observation to
one of the seeds based on
similarity.
34
How Nonhierarchical Approaches

Work?
The nonhierarchical clustering methods are frequently referred to as

k-means clustering. These methods include sequential threshold,
parallel threshold, and optimizing partitioning.
In the sequential threshold method, a cluster center is selected and all

objects within a prespecified threshold value from the center are grouped
together. Then a new cluster center or seed is selected, and the process is
repeated for the unclustered points. Once an object is clustered with a seed,
it is no longer considered for clustering with subsequent seeds.
The parallel threshold method operates similarly, except that several

cluster centers are selected simultaneously and objects within the threshold
level are grouped with the nearest center.
The optimizing partitioning method differs from the two threshold

procedures in that objects can later be reassigned to clusters to optimize
an overall criterion, such as average within cluster distance for a given
number of clusters.
35
Selecting Seed Points
Researcher
specified.
Sample generated.
36
Nonhierarchical Clustering
Procedures
Sequential Threshold = selects one

seed point, develops cluster; then
selects next seed point and develops
cluster, and so on.
Parallel Threshold = selects several

seed points simultaneously, then
develops clusters.
Optimization = permits reassignment

of objects.
37
Rules of Thumb
DERIVING CLUSTERS
Hierarchical clustering methods differ in the method of representing similarity

between clusters, each with advantages and disadvantages:
Single-linkage is probably the most versatile algorithm, but poorly delineated

cluster structures within the data produce unacceptable snakelike chains for
clusters.
Complete linkage eliminates the chaining problem, but only considers the
outermost observations in a cluster, thus impacted by outliers.
Average linkage is based on the average similarity of all individuals in a cluster and
tends to generate clusters with small within-cluster variation and is less affected by
outliers.
Centroid linkage measures distance between cluster centroids and like average
linkage, is less affected by outliers.
Wards is based on the total sum of squares within clusters and is most appropriate
when the researcher expects somewhat equally sized clusters. But it is easily
distorted by outliers.
Nonhierarchical clustering methods require that the number of clusters be

specified before assigning observations:
The sequential threshold method assigns observations to the closest cluster, but an
observation cannot be re-assigned to another cluster following its original
assignment.
Optimizing procedures allow for re-assignment of observations based on the
sequential proximity of observations to clusters formed during the clustering
process.
38
Rules of Thumb continued . . .
DERIVING CLUSTERS
Selection of hierarchical or nonhierarchical methods is based on:
Hierarchical clustering solutions are preferred when:

A wide range, even all, alternative clustering solutions is to be examined
The sample size is moderate (under 300-400, not exceeding 1,000) or a
sample of the larger dataset is acceptable
Nonhierarchical clustering methods are preferred when:
The number of clusters is known and initial seed points can be specified
according to some practical, objective or theoretical basis.
There is concern about outliers since nonhierarchical methods generally are
less susceptible to outliers.
A combination approach using a hierarchical approach followed by a

nonhierarchical approach is often advisable.
A nonhierarchical approach is used to select the number of clusters and

profile cluster centers that serve as initial cluster seeds in the
nonhierarchical procedure.
A nonhierarchical method then clusters all observations using the seed
points to provide more accurate cluster memberships.
39
Stage 5: Interpretation of the

Clusters
This stage involves examining each cluster in

terms of the cluster variate to name or assign a
label accurately describing the nature of the
clusters
40
Results of Hierarchical Clustering

Agglomeration Schedule Using Wards Procedure
Stage cluster
Clusters combined
first appears
Table
Stage
1
14
2
6
3
2
4
5
5
3
6
10
7
6
8
9
9
4
10
1
11
5
12
4
13
1
14
1
15
2
16
1
17
4
18
2
20.219 1
Cluster 1 Cluster 2
16
1.000000
7
2.000000
13
3.500000
11
5.000000
8
6.500000
14
8.160000
12
10.166667
20
13.000000
10
15.583000
6
18.500000
9
23.000000
19
27.750000
17
33.100000
15
41.333000
5
51.833000
3
64.500000
18
79.667000
4 172.662000
2 328.600000
Coefficient
0
0
6
0
0
7
0
0
15
0
0
11
0
0
16
0
1
9
2
0
10
0
0
11
0
6
12
6
7
13
4
8
15
9
0
17
10
0
14
13
0
16
3 11
18
14
5
19
12
0
18
15 17
19
16 18
0
Cluster 1 Cluster 2 Next stage
41
Results of Hierarchical
Clustering
Table 20.2
cont.Membership of Cases Using Wards Procedure

Cluster
Number of Clusters
Label case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
1
3
2
1
1
1
2
3
2
1
2
3
1
3
1
4
3
2
1
2
1
3
2
1
1
1
2
3
2
1
2
3
1
3
1
3
3
2
1
2
1
2
2
1
1
1
2
2
2
1
2
2
1
2
1
2
2
2
42
Dendrogram Using Wards Method

Fig. 20.8
43
Cluster Centroids
Table 20.3
Means of Variables
Cluster No. V1 V2 V3 V4 V5 V6
1
5.750
3.625
6.000
3.125
1.750
3.875
1.667
3.000
1.833
3.500
5.500
3.333
3.500
5.833
3.333
6.000
3.500
6.000
44
45
Stage 6: Validation and Profiling of the

Clusters
Validation:
Cross-validation.
Criterion validity.
Profiling: describing the
characteristics of each cluster to
explain how they may differ on
relevant dimensions. This typically
involves the use of discriminant
analysis or ANOVA.
46
Rules of Thumb
DERIVING THE FINAL CLUSTER SOLUTION

There is no single objective procedure to determine the correct
number of clusters. Rather the researcher must evaluate
alternative cluster solutions on the following considerations to
select the best solution:
Single-member or extremely small clusters are generally not
acceptable and should generally be eliminated.
For hierarchical methods, ad hoc stopping rules, based on
the rate of change in a total similarity measure as the
number of clusters increases or decreases, are an indication
of the number of clusters.
All clusters should be significantly different across the set
of clustering variables.
Cluster solutions ultimately must have theoretical validity
assess through external validation.
47
Results of Nonhierarchical Clustering

Table 20.4
Initial Cluster Centers

1
V1
V2
V3
V4
V5
V6
4
6
3
7
2
7
Cluster
2
2
3
2
4
7
2
3
7
2
6
4
1
3
Iteration History
Iteration
1
2
Change in Cluster Centers

1
2
3
2.154
2.102
2.550
0.000
0.000
0.000
Convergence achieved due to no or small distance

change. The maximum distance by which any center
has changed is 0.000. The current iteration is 2. The
minimum distance between initial centers is 7.746.
48

Table 20.4 cont.
Cluster Membership
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cluster
3
2
3
1
2
3
3
3
2
1
2
3
2
1
3
1
3
1
1
2
Distance
1.414
1.323
2.550
1.404
1.848
1.225
1.500
2.121
1.756
1.143
1.041
1.581
2.598
1.404
2.828
1.624
2.598
3.555
2.154
2.102
49

Table 20.4 cont.
Final Cluster Centers
1
V1
V2
V3
V4
V5
V6
4
6
3
6
4
6
Cluster
2
2
3
2
4
6
3
3
6
4
6
3
2
4
Distances between Final Cluster Centers

Cluster
1
2
3
2
5.568
5.568
5.698
3
5.698
6.928
6.928
50

Table 20.4 cont.
ANOVA
V1
V2
V3
V4
V5
V6
Cluster
Mean Square
29.108
13.546
31.392
15.713
22.537
12.171
df
2
2
2
2
2
2
Error
Mean Square
0.608
0.630
0.833
0.728
0.816
1.071
df
17
17
17
17
17
17
F
47.888
21.505
37.670
21.585
27.614
11.363
Sig.
0.000
0.000
0.000
0.000
0.000
0.001
The F tests should be used only for descriptive purposes because the clusters have been
chosen to maximize the differences among cases in different clusters. The observed
significance levels are not corrected for this, and thus cannot be interpreted as tests of the
hypothesis that the cluster means are equal.
Number of Cases in each Cluster

Cluster
Valid
Missing
1
2
3
6.000
6.000
8.000
20.000
0.000
51
Rules of Thumb
INTERPRETING, PROFILING AND VALIDATING CLUSTERS

The cluster centroid, a mean profile of the cluster on each clustering
variable, is particularly useful in the interpretation stage.
Interpretation involves examining the distinguishing characteristics
of each clusters profile and identifying substantial differences
between clusters
Cluster solutions failing to show substantial variation indicate other
cluster solutions should be examined.
The cluster centroid should also be assessed for correspondence
with the researchers prior expectations based on theory or
practical experience.
Validation is essential in cluster analysis since the clusters are descriptive
of structure and require additional support for their relevance:
Cross-validation empirically validates a cluster solution by creating
two sub-samples (randomly splitting the sample) and then comparing
the two cluster solutions for consistency with respect to number of
clusters and the cluster profiles.
Validation is also achieved by examining differences on variables not
included in the cluster analysis but for which there is a theoretical and
relevant reason to expect variation across the clusters.
52

Cluster Analysis

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Cluster Analysis

Enviado por

Direitos autorais:

Formatos disponíveis

Cluster Analysis

Prof. Shelendra K. Tyagi

I prefer to use email rather than write a letter.

I feel that quality products are always priced high.

I think twice before I buy anything.

Television is a major source of entertainment.

A car is a necessity rather than a luxury.

I prefer fast food and ready-to-use products.

People are more health-conscious today.

Entry of foreign companies has increased the efficiency of Indian companies.

Women are active participants in purchase decisions.

I believe politicians can play a positive role.

I enjoy watching movies.

If I get a chance, I would like to settle abroad.

I always buy branded products.

I frequently go out on weekends.

I prefer to pay by credit card rather than in cash.

I feel foreign made products are always superior in quality.

Define cluster analysis, its roles and its limitations.

Cluster Analysis Defined

Cluster analysis . . . groups objects

What is Cluster Analysis?

Cluster analysis . . . is a group of multivariate

It has been referred to as Q analysis,

The essence of all clustering approaches is

Both cluster analysis and discriminant analysis

Criticisms of Cluster Analysis

The following must be addressed by

Cluster analysis is descriptive,

What Can We Do With Cluster

Stage 1: Objectives of Cluster

Primary Goal = to partition a set of

Research Questions in Cluster

Three basic research questions:

Selection of Clustering Variables

OBJECTIVES OF CLUSTER ANALYSIS

Cluster analysis is used for:

Stage 2: Research Design in Cluster

Interobject similarity is an empirical

Types of Distance Measures

Squared (or absolute) Euclidean distance.

City-block (Manhattan) distance.

Rules of Thumb Continued . . .

Outliers can be identified based on the similarity measure by:

Clustering variables should be standardized whenever possible to avoid

Stage 3: Assumptions of Cluster

Representativeness of the sample.

ASSUMPTIONS IN CLUSTER ANALYSIS

Stage 4: Deriving Clusters and

The researcher must:

A Classification of Clustering Procedures

Two Types of Hierarchical

How Agglomerative Approaches

Start with all observations as their own

Single Linkage (nearest neighbor)

Other Agglomerative Clustering

How Nonhierarchical Approaches

Specify cluster seeds.

How Nonhierarchical Approaches

The nonhierarchical clustering methods are frequently referred to as

In the sequential threshold method, a cluster center is selected and all

The parallel threshold method operates similarly, except that several

The optimizing partitioning method differs from the two threshold

Selecting Seed Points