Escolar Documentos
Profissional Documentos
Cultura Documentos
What is clustering?
A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Outliers
Outliers are objects that do not belong to
any cluster or form clusters of very small
cardinality
cluster
outliers
In some applications we are interested in
discovering outliers, not clusters (outlier analysis)
Why do we cluster?
Clustering : given a collection of data objects group
them so that
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Observations to cluster
Usually data objects consist of a set of
attributes (also known as dimensions)
Real-value attributes/variables
e.g., salary, height
Binary attributes
e.g., gender (M/F), has_cancer(T/F)
Ordinal/Ranked attributes
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Observations to cluster
If all d dimensions are real-valued
then we can visualize each data
point as points in a d-dimensional
space
If all d dimensions are binary then
we can think of each data point as a
binary vector
Distance functions
The distance d(x, y) between two objects x and y is a
metric if
d(i,
d(i,
d(i,
d(i,
j)0 (non-negativity)
i)=0 (isolation)
j)= d(j, i) (symmetry)
j) d(i, h)+d(h, j) (triangular inequality)
data matrix
tuples/objects
Data Structures
attributes/dimensions
x
11
...
x
i1
...
x
n1
...
...
x
1
...
... x
1d
... ...
x
id
...
... ...
... x
... x
n
nd
...
x
i
...
...
objects
Distance matrix
objects
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
Q
6
L p ( x, y) | x y | | x y | ... | x x |
1
2 2
d d
1
p 1/ p
|x y |
i i
i 1
p 1/ p
L ( x, y) | x1 y1 | | x y | ... | x y |
1
2 2
d d
x y
i
i
i 1
Alsod (one
can
x, y) (w
| x xuse
| w | weighted
x x | ... w | x y | )
1 1 1
2 2 2
d d d
distance:
2
d ( x, y) w x y w x y ... w x y
1 1 1 2 2 2
d d d
Single Linkage
Complete Linkage
Average Linkage
Partitioning Clustering
K-means
Cost (C ) L2 x ci
2
i 1 xCi
is minimized
Some special cases: k = 1, k = n
k-means algorithm
Finds a local optimum
Converges often quickly (but not always)
The choice of initial points can have
large influence
Clusters of different densities
Clusters of different sizes
Multiple runs
Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree-like diagram that records the
sequences of merges or splits
5
6
0.2
4
3
0.15
5
2
0.1
0.05
0
3
1
Strengths of Hierarchical
Clustering
No assumptions on the number of
clusters
Any desired number of clusters can be
obtained by cutting the dendrogram at
the proper level
Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a
point (or there are k clusters)
Agglomerative clustering
algorithm
Basic algorithm
1.
2.
3.
4.
5.
6.
Distance/Proximity Matrix
Intermediate State
After some merging steps, we have some clusters
C1
C2
C3
C4
C5
C1
C2
C3
C4
C3
C4
C5
C1
Distance/Proximity Matrix
C2
C5
Intermediate State
Merge the two closest clusters (C2 and C5) and
update the distance matrix.
C1 C2
C3
C4
C5
C1
C2
C3
C4
C3
C4
C5
C1
Distance/Proximity Matrix
C2
C5
After Merging
How do we update the distance matrix?
C1
C1
C3
C4
C1
C2 U C5
C2 U C5
C2
U
C5
C3
C4
?
?
C3
C4
x, y
Single-link clustering:
example
Determined by one pair of points,
i.e., by one link in the proximity
graph.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
3
5
0.2
0.15
0.1
0.05
0
Nested Clusters
Dendrogram
x, y
Complete-link clustering:
example
Distance between clusters is
determined by the two most distant
points in the different clusters
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Complete-link clustering:
example
4
2
5
0.4
0.35
0.3
0.25
6
1
Nested Clusters
0.2
0.15
0.1
0.05
0
Dendrogram
d ( x, y )
xCi , yC j
Average-link clustering:
example
Proximity of two clusters is the average of
pairwise proximity between points in the
two clusters.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Average-link clustering:
example
5
2
5
0.25
0.2
0.15
6
1
4
3
Nested Clusters
0.1
0.05
0
Dendrogram
Average-link clustering
Strengths
Less susceptible to noise and outliers
Thank you