9977 Cluster

Clustering
What is clustering?
A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Outliers
Outliers are objects that do not belong to
any cluster or form clusters of very small
cardinality
cluster
outliers
In some applications we are interested in
discovering outliers, not clusters (outlier analysis)
Why do we cluster?
Clustering : given a collection of data objects group
them so that
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Clustering results are used:

As a stand-alone tool to get insight into data distribution
Visualization of clusters may unveil important information
As a preprocessing step for other algorithms

Efficient indexing or compression often relies on clustering
The clustering task

Group observations so that the
observations belonging in the same group
are similar, whereas observations in
different groups are different
Basic questions:
What does similar mean
What is a good partition of the objects? I.e.,
how is the quality of a solution measured
How to find a good partition of the observations
Observations to cluster
Usually data objects consist of a set of
attributes (also known as dimensions)
Real-value attributes/variables
e.g., salary, height
Binary attributes
e.g., gender (M/F), has_cancer(T/F)
Nominal (categorical) attributes

e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
Ordinal/Ranked attributes
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Observations to cluster
If all d dimensions are real-valued
then we can visualize each data
point as points in a d-dimensional
space
If all d dimensions are binary then
we can think of each data point as a
binary vector
Distance functions
The distance d(x, y) between two objects x and y is a
metric if
d(i,
d(i,
d(i,
d(i,
j)0 (non-negativity)
i)=0 (isolation)
j)= d(j, i) (symmetry)
j) d(i, h)+d(h, j) (triangular inequality)
The definitions of distance functions are usually

different for real, boolean, categorical, and ordinal
variables.
Weights may be associated with different variables
based on applications and data semantics.
data matrix
tuples/objects
Data Structures
attributes/dimensions
x
11
...
x
i1
...
x
n1
...
...
x
1
...
... x
1d
... ...
x
id
...
... ...
... x
... x
n
nd
...
x
i
...
...
objects
Distance matrix
objects
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
... 0
Distance functions for binary

vectors
Jaccard similarity between binary
X Y
vectors
JSim( X , Y ) X and Y
X Y
Jaccard distance between binary

vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y) Q Q Q Q Q
Example:
JSim = 1/6
Jdist = 5/6
Q
6
Distance functions for real-valued

vectors
Lp norms or Minkowski distance:
L p ( x, y) | x y | | x y | ... | x x |
1
2 2
d d
1
p 1/ p
|x y |
i i
i 1
p 1/ p
where p is a positive integer
If p = 1, L1 is the Manhattan (or city block)

distance:
L ( x, y) | x1 y1 | | x y | ... | x y |
1
2 2
d d
x y
i
i
i 1
Distance functions for realvalued vectors

If p = 2, L2 is the Euclidean
d ( x, y) (| x y | | x y | ... | x y | )
distance:
1 1
2 2
d d
2
Alsod (one
can
x, y) (w
| x xuse
| w | weighted
x x | ... w | x y | )
1 1 1
2 2 2
d d d
distance:
2
d ( x, y) w x y w x y ... w x y
1 1 1 2 2 2
d d d
Algorithms: basic concept

Construct a partition of a set of n objects into a set of k clusters
Hierarchical Clustering
Single Linkage
Complete Linkage
Average Linkage
Partitioning Clustering
K-means
The k-means problem

Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points {c1, c2,
,ck} in the d-dimensional space to
form clusters {C1, kC2,,Ck} such that
Cost (C ) L2 x ci
2
i 1 xCi
is minimized
Some special cases: k = 1, k = n
The k-means algorithm

One way of solving the k-means problem
Randomly pick k cluster centers {c1,,ck}
For each i, set the cluster Ci to be the set of points
in X that are closer to ci than they are to cj for all
ij
For each i let ci be the center of cluster Ci (mean
of the vectors in Ci)
Repeat until convergence
k-means algorithm
Finds a local optimum
Converges often quickly (but not always)
The choice of initial points can have
large influence
Clusters of different densities
Clusters of different sizes
Outliers can also cause a problem

(Example?)
Some alternatives to random

initialization of the central
points
Multiple runs
Helps, but probability is not on your side
Select original set of points by

methods other than random . E.g.,
pick the most distant (from each
other) points as cluster centers
(kmeans++ algorithm)
What is the right number of

clusters?
or who sets the value of k?
For n points to be clustered consider the
case where k=n. What is the value of
the error function
What happens when k = 1?
Since we want to minimize the error why
dont we select always k = n?
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree-like diagram that records the
sequences of merges or splits
5
6
0.2
4
3
0.15
5
2
0.1
0.05
0
3
1
Strengths of Hierarchical
Clustering
No assumptions on the number of
clusters
Any desired number of clusters can be
obtained by cutting the dendrogram at
the proper level
Hierarchical clustering may

correspond to meaningful taxonomies
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a
point (or there are k clusters)
Traditional hierarchical algorithms use a similarity

or distance matrix
Merge or split one cluster at a time
Agglomerative clustering
algorithm
Most popular hierarchical clustering technique
Basic algorithm
1.
2.
3.
4.
5.
6.
Compute the distance matrix between the input data

points
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains
Key operation is the computation of the distance

between two clusters
Different definitions of the distance between clusters

lead to different algorithms
Input/ Initial setting

Start with clusters of individual points
and a distance/proximity
p1 matrix
p2
p3
p4 p5
...
p1
p2
p3
p4
p5
.
.
.
Distance/Proximity Matrix
Intermediate State
After some merging steps, we have some clusters
C1
C2
C3
C4
C5
C1
C2
C3
C4
C3
C4
C5
C1
C2
C5
Intermediate State
Merge the two closest clusters (C2 and C5) and
update the distance matrix.
C1 C2
C3
C4
C5
C1
C2
C3
C4
C3
C4
C5
C1
C2
C5
After Merging
How do we update the distance matrix?
C1
C1
C3
C4
C1
C2 U C5
C2 U C5
C2
U
C5
C3
C4
?
?
C3
C4
Distance between two

clusters
Each cluster is a set of points
How do we define distance between
two sets of points
Lots of alternatives
Not an easy task

clusters
Single-link distance between
clusters Ci and Cj is the minimum
distance between any object in Ci
and any object in Cj
The distance is defined by the two
most
similar
objects
D C , C min d ( x, y ) x C , y C
sl
x, y
Single-link clustering:
example
Determined by one pair of points,
i.e., by one link in the proximity
graph.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Single-link clustering: example

5
1
3
5
0.2
0.15
0.1
0.05
0
Nested Clusters
Dendrogram

clusters
Complete-link distance between
clusters Ci and Cj is the maximum
The distance is defined by the two
most
dissimilar
objects
D C , C max d ( x, y ) x C , y C
cl
x, y
Complete-link clustering:
example
Distance between clusters is
determined by the two most distant
points in the different clusters
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Complete-link clustering:
example
4
2
5
0.4
0.35
0.3
0.25
6
1
Nested Clusters
0.2
0.15
0.1
0.05
0
Dendrogram

clusters
Group average distance between
clusters Ci and Cj is the average
1
Davg Ci , C j
Ci C j
d ( x, y )
xCi , yC j
Average-link clustering:
example
Proximity of two clusters is the average of
pairwise proximity between points in the
two clusters.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Average-link clustering:
example
5
2
5
0.25
0.2
0.15
6
1
4
3
Nested Clusters
0.1
0.05
0
Dendrogram
Average-link clustering
Compromise between Single and

Complete Link
Strengths
Less susceptible to noise and outliers
Thank you

9977 Cluster

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

9977 Cluster

Enviado por

Direitos autorais:

Formatos disponíveis

Clustering

Clustering results are used:

As a preprocessing step for other algorithms

The clustering task

Nominal (categorical) attributes

The definitions of distance functions are usually

d ( n,1) d ( n,2) ...

Distance functions for binary

Jaccard distance between binary

Distance functions for real-valued

where p is a positive integer

If p = 1, L1 is the Manhattan (or city block)

Distance functions for realvalued vectors

Algorithms: basic concept

The k-means problem

The k-means algorithm

Outliers can also cause a problem

Some alternatives to random

Helps, but probability is not on your side

Select original set of points by

What is the right number of

Hierarchical clustering may

Traditional hierarchical algorithms use a similarity

Most popular hierarchical clustering technique

Compute the distance matrix between the input data

Key operation is the computation of the distance

Different definitions of the distance between clusters

Input/ Initial setting

Distance between two

Distance between two

Single-link clustering: example

Distance between two

Distance between two

Compromise between Single and

Você também pode gostar