Você está na página 1de 38

Clustering

What is clustering?
A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Intra-cluster
distances are
minimized

Inter-cluster
distances are
maximized

Outliers
Outliers are objects that do not belong to
any cluster or form clusters of very small
cardinality

cluster
outliers
In some applications we are interested in
discovering outliers, not clusters (outlier analysis)

Why do we cluster?
Clustering : given a collection of data objects group
them so that
Similar to one another within the same cluster
Dissimilar to the objects in other clusters

Clustering results are used:


As a stand-alone tool to get insight into data distribution
Visualization of clusters may unveil important information

As a preprocessing step for other algorithms


Efficient indexing or compression often relies on clustering

The clustering task


Group observations so that the
observations belonging in the same group
are similar, whereas observations in
different groups are different
Basic questions:
What does similar mean
What is a good partition of the objects? I.e.,
how is the quality of a solution measured
How to find a good partition of the observations

Observations to cluster
Usually data objects consist of a set of
attributes (also known as dimensions)

Real-value attributes/variables
e.g., salary, height

Binary attributes
e.g., gender (M/F), has_cancer(T/F)

Nominal (categorical) attributes


e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

Ordinal/Ranked attributes
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

Observations to cluster
If all d dimensions are real-valued
then we can visualize each data
point as points in a d-dimensional
space
If all d dimensions are binary then
we can think of each data point as a
binary vector

Distance functions
The distance d(x, y) between two objects x and y is a
metric if

d(i,
d(i,
d(i,
d(i,

j)0 (non-negativity)
i)=0 (isolation)
j)= d(j, i) (symmetry)
j) d(i, h)+d(h, j) (triangular inequality)

The definitions of distance functions are usually


different for real, boolean, categorical, and ordinal
variables.
Weights may be associated with different variables
based on applications and data semantics.

data matrix

tuples/objects

Data Structures

attributes/dimensions

x
11
...

x
i1
...

x
n1

...
...

x
1
...

... x
1d
... ...

x
id
...
... ...
... x
... x
n
nd
...

x
i
...

...

objects

Distance matrix
objects

d(2,1)
0

d(3,1) d ( 3,2) 0

:
:
:

d ( n,1) d ( n,2) ...

... 0

Distance functions for binary


vectors
Jaccard similarity between binary
X Y
vectors
JSim( X , Y ) X and Y
X Y

Jaccard distance between binary


vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y) Q Q Q Q Q
Example:
JSim = 1/6
Jdist = 5/6

Q
6

Distance functions for real-valued


vectors
Lp norms or Minkowski distance:

L p ( x, y) | x y | | x y | ... | x x |
1
2 2
d d
1

p 1/ p

|x y |
i i
i 1

p 1/ p

where p is a positive integer

If p = 1, L1 is the Manhattan (or city block)


distance:

L ( x, y) | x1 y1 | | x y | ... | x y |
1
2 2
d d

x y
i
i
i 1

Distance functions for realvalued vectors


If p = 2, L2 is the Euclidean
d ( x, y) (| x y | | x y | ... | x y | )
distance:
1 1
2 2
d d
2

Alsod (one
can
x, y) (w
| x xuse
| w | weighted
x x | ... w | x y | )
1 1 1
2 2 2
d d d
distance:
2

d ( x, y) w x y w x y ... w x y
1 1 1 2 2 2
d d d

Algorithms: basic concept


Construct a partition of a set of n objects into a set of k clusters
Hierarchical Clustering

Single Linkage
Complete Linkage
Average Linkage
Partitioning Clustering

K-means

The k-means problem


Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points {c1, c2,
,ck} in the d-dimensional space to
form clusters {C1, kC2,,Ck} such that

Cost (C ) L2 x ci
2

i 1 xCi

is minimized
Some special cases: k = 1, k = n

The k-means algorithm


One way of solving the k-means problem
Randomly pick k cluster centers {c1,,ck}
For each i, set the cluster Ci to be the set of points
in X that are closer to ci than they are to cj for all
ij
For each i let ci be the center of cluster Ci (mean
of the vectors in Ci)
Repeat until convergence

k-means algorithm
Finds a local optimum
Converges often quickly (but not always)
The choice of initial points can have
large influence
Clusters of different densities
Clusters of different sizes

Outliers can also cause a problem


(Example?)

Some alternatives to random


initialization of the central
points

Multiple runs

Helps, but probability is not on your side

Select original set of points by


methods other than random . E.g.,
pick the most distant (from each
other) points as cluster centers
(kmeans++ algorithm)

What is the right number of


clusters?
or who sets the value of k?
For n points to be clustered consider the
case where k=n. What is the value of
the error function
What happens when k = 1?
Since we want to minimize the error why
dont we select always k = n?

Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree-like diagram that records the
sequences of merges or splits
5

6
0.2

4
3

0.15

5
2

0.1

0.05
0

3
1

Strengths of Hierarchical
Clustering
No assumptions on the number of
clusters
Any desired number of clusters can be
obtained by cutting the dendrogram at
the proper level

Hierarchical clustering may


correspond to meaningful taxonomies

Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left

Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a
point (or there are k clusters)

Traditional hierarchical algorithms use a similarity


or distance matrix
Merge or split one cluster at a time

Agglomerative clustering
algorithm

Most popular hierarchical clustering technique

Basic algorithm
1.
2.
3.
4.
5.
6.

Compute the distance matrix between the input data


points
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the distance matrix
Until only a single cluster remains

Key operation is the computation of the distance


between two clusters

Different definitions of the distance between clusters


lead to different algorithms

Input/ Initial setting


Start with clusters of individual points
and a distance/proximity
p1 matrix
p2
p3
p4 p5
...
p1
p2
p3
p4
p5
.
.
.

Distance/Proximity Matrix

Intermediate State
After some merging steps, we have some clusters
C1

C2

C3

C4

C5

C1
C2

C3
C4

C3
C4
C5

C1

Distance/Proximity Matrix

C2

C5

Intermediate State
Merge the two closest clusters (C2 and C5) and
update the distance matrix.
C1 C2
C3
C4

C5

C1
C2

C3
C4

C3
C4
C5

C1

Distance/Proximity Matrix

C2

C5

After Merging
How do we update the distance matrix?
C1
C1

C3
C4

C1

C2 U C5

C2 U C5

C2
U
C5

C3

C4

?
?

C3

C4

Distance between two


clusters
Each cluster is a set of points
How do we define distance between
two sets of points
Lots of alternatives
Not an easy task

Distance between two


clusters
Single-link distance between
clusters Ci and Cj is the minimum
distance between any object in Ci
and any object in Cj
The distance is defined by the two
most
similar
objects
D C , C min d ( x, y ) x C , y C
sl

x, y

Single-link clustering:
example
Determined by one pair of points,
i.e., by one link in the proximity
graph.
I1
I2
I3
I4
I5

I1
1.00
0.90
0.10
0.65
0.20

I2
0.90
1.00
0.70
0.60
0.50

I3
0.10
0.70
1.00
0.40
0.30

I4
0.65
0.60
0.40
1.00
0.80

I5
0.20
0.50
0.30
0.80
1.00

Single-link clustering: example


5
1

3
5

0.2

0.15
0.1

0.05
0

Nested Clusters

Dendrogram

Distance between two


clusters
Complete-link distance between
clusters Ci and Cj is the maximum
distance between any object in Ci
and any object in Cj
The distance is defined by the two
most
dissimilar
objects
D C , C max d ( x, y ) x C , y C
cl

x, y

Complete-link clustering:
example
Distance between clusters is
determined by the two most distant
points in the different clusters
I1
I2
I3
I4
I5

I1
1.00
0.90
0.10
0.65
0.20

I2
0.90
1.00
0.70
0.60
0.50

I3
0.10
0.70
1.00
0.40
0.30

I4
0.65
0.60
0.40
1.00
0.80

I5
0.20
0.50
0.30
0.80
1.00

Complete-link clustering:
example
4

2
5

0.4
0.35

0.3
0.25

6
1

Nested Clusters

0.2
0.15
0.1
0.05
0

Dendrogram

Distance between two


clusters
Group average distance between
clusters Ci and Cj is the average
distance between any object in Ci
and any object in Cj
1
Davg Ci , C j
Ci C j

d ( x, y )

xCi , yC j

Average-link clustering:
example
Proximity of two clusters is the average of
pairwise proximity between points in the
two clusters.

I1
I2
I3
I4
I5

I1
1.00
0.90
0.10
0.65
0.20

I2
0.90
1.00
0.70
0.60
0.50

I3
0.10
0.70
1.00
0.40
0.30

I4
0.65
0.60
0.40
1.00
0.80

I5
0.20
0.50
0.30
0.80
1.00

Average-link clustering:
example
5

2
5

0.25
0.2

0.15

6
1

4
3

Nested Clusters

0.1
0.05
0

Dendrogram

Average-link clustering

Compromise between Single and


Complete Link

Strengths
Less susceptible to noise and outliers

Thank you

Você também pode gostar