Escolar Documentos
Profissional Documentos
Cultura Documentos
Cluster Analysis:
A major FMCG company wants to map the profile of its target audience in terms of lifestyle,
attitudes, and perceptions. The companys managers prepare, with the help of their marketing
research team, a set of 15 statements, which they feel measure many of the variables of interest.
These 15 statements are given below. The respondent had to agree or disagree (1= Strongly agree,
2=agree, 3=neither agree nor disagree, 4=disagree, 5=strongly disagree) with each statement.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Cluster Analysis:
Godrej India Ltd wants to map the profile of its target customer in terms of lifestyle, attitude and
perception. Godrejs marketing managers prepare a set of fifteen statements. Which according to the
market research team will measure many of the variables of interest. The respondent had to agree or
disagree with each statement on a scale of 1 to 5.
1=completely Agree, 2=Agree, 3=Neither Agree nor disagree, 4=Disagree, 5=Completely Disagree
The following fifteen statements were prepared by the Godrej marketing team:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Cluster Analysis
LEARNING OBJECTIVES:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Fig.
20.1
Variable 1
Variable 1
Cluster Analysis:
Variable 2
Fig. 20.2
Variable 2
1. Determine if statistically
different clusters exist.
2. Identify the meaning of the
clusters.
3. Explain how the clusters can
be used.
10
11
12
13
14
Two Issues:
Conceptual considerations, and
Practical considerations.
15
Rules of Thumb
Four Questions:
Is the sample size adequate?
Can outliers be detected an, if so, should
they be deleted?
How should object similarity be
measured?
Should the data be standardized?
17
Measuring Similarity
Correlational Measures.
Distance Measures.
Association.
18
Cluster Analysis:
19
Euclidean distance.
(is the square root of the sum of the squared differences
in values for each variable)
Rules of Thumb
RESEARCH DESIGN IN CLUSTER ANALYSIS
The sample size required is not based on statistical considerations for inference testing,
but rather:
Sufficient size is needed to ensure representativeness of the population and its
underlying structure, particularly small groups within the population.
Minimum group sizes are based on the relevance of each group to the research
question and the confidence needed in characterizing that group.
Similarity measures calculated across the entire set of clustering variables allow for the
grouping of observations and their comparison to each other.
Distance measures are most often used as a measure of similarity, with higher
values representing greater dissimilarity (distance between cases) not similarity.
There are many different distance measures, including:
Euclidean (straight line) distance is the most common measure of distance.
Squared Euclidean distance is the sum of squared distances and is the
recommended measure for the centroid and Wards methods of clustering.
Mahalanobis distance accounts for variable intercorrelations and weights each
variable equally. When variables are highly intercorrelated, Mahalanobis distance is
most appropriate.
Less frequently used are correlational measures, where large values do indicate
similarity.
Given the sensitivity of some procedures to the similarity measure used, the researcher
should employ several distance measures and compare the results from each with other
results or theoretical/known patterns.
21
23
Rules of Thumb
24
25
Hierarchical
Agglomerative
Divisive
Sequential
Threshold
Linkage
Methods
Variance
Methods
Parallel
Threshold
Optimizing
Partitioning
Centroid
Methods
Wards Method
Single
Complete
Average
26
Fig. 20.4
1. Agglomerative Methods
(buildup)
2. Divisive Methods (breakdown)
27
28
Agglomerative Algorithms
Average Linkage.
Centroid Method.
Wards Method.
29
Cluster Linkages:
30
Linkage Methods of
Clustering Single Linkage
Minimum
Distance
Cluster 1
Complete Linkage
Cluster 2
Maximum
Distance
Cluster 1
Cluster 2
Average Linkage
Fig. 20.5
Cluster 1
Average
Distance
31
Cluster 2
Cluster Analysis:
32
Wards Procedure
Centroid Method
33
34
35
Researcher
specified.
Sample generated.
36
Nonhierarchical Clustering
Procedures
37
Rules of Thumb
DERIVING CLUSTERS
The sequential threshold method assigns observations to the closest cluster, but an
observation cannot be re-assigned to another cluster following its original
assignment.
Optimizing procedures allow for re-assignment of observations based on the
sequential proximity of observations to clusters formed during the clustering
process.
38
DERIVING CLUSTERS
40
Table
Stage
1
14
2
6
3
2
4
5
5
3
6
10
7
6
8
9
9
4
10
1
11
5
12
4
13
1
14
1
15
2
16
1
17
4
18
2
20.219 1
Cluster 1 Cluster 2
16
1.000000
7
2.000000
13
3.500000
11
5.000000
8
6.500000
14
8.160000
12
10.166667
20
13.000000
10
15.583000
6
18.500000
9
23.000000
19
27.750000
17
33.100000
15
41.333000
5
51.833000
3
64.500000
18
79.667000
4 172.662000
2 328.600000
Coefficient
0
0
6
0
0
7
0
0
15
0
0
11
0
0
16
0
1
9
2
0
10
0
0
11
0
6
12
6
7
13
4
8
15
9
0
17
10
0
14
13
0
16
3 11
18
14
5
19
12
0
18
15 17
19
16 18
0
41
Results of Hierarchical
Clustering
Table 20.2
1
2
1
3
2
1
1
1
2
3
2
1
2
3
1
3
1
4
3
2
1
2
1
3
2
1
1
1
2
3
2
1
2
3
1
3
1
3
3
2
1
2
1
2
2
1
1
1
2
2
2
1
2
2
1
2
1
2
2
2
42
43
Cluster Centroids
Table 20.3
Means of Variables
Cluster No. V1 V2 V3 V4 V5 V6
1
5.750
3.625
6.000
3.125
1.750
3.875
1.667
3.000
1.833
3.500
5.500
3.333
3.500
5.833
3.333
6.000
3.500
6.000
44
45
Validation:
Cross-validation.
Criterion validity.
Profiling: describing the
characteristics of each cluster to
explain how they may differ on
relevant dimensions. This typically
involves the use of discriminant
analysis or ANOVA.
46
Rules of Thumb
4
6
3
7
2
7
Cluster
2
2
3
2
4
7
2
3
7
2
6
4
1
3
Iteration History
Iteration
1
2
Cluster Membership
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Cluster
3
2
3
1
2
3
3
3
2
1
2
3
2
1
3
1
3
1
1
2
Distance
1.414
1.323
2.550
1.404
1.848
1.225
1.500
2.121
1.756
1.143
1.041
1.581
2.598
1.404
2.828
1.624
2.598
3.555
2.154
2.102
49
1
V1
V2
V3
V4
V5
V6
4
6
3
6
4
6
Cluster
2
2
3
2
4
6
3
3
6
4
6
3
2
4
2
5.568
5.568
5.698
3
5.698
6.928
6.928
50
V1
V2
V3
V4
V5
V6
Cluster
Mean Square
29.108
13.546
31.392
15.713
22.537
12.171
df
2
2
2
2
2
2
Error
Mean Square
0.608
0.630
0.833
0.728
0.816
1.071
df
17
17
17
17
17
17
F
47.888
21.505
37.670
21.585
27.614
11.363
Sig.
0.000
0.000
0.000
0.000
0.000
0.001
The F tests should be used only for descriptive purposes because the clusters have been
chosen to maximize the differences among cases in different clusters. The observed
significance levels are not corrected for this, and thus cannot be interpreted as tests of the
hypothesis that the cluster means are equal.
Valid
Missing
1
2
3
6.000
6.000
8.000
20.000
0.000
51
Rules of Thumb