Você está na página 1de 6

Introduction to Clustering (Challenges in Clustering)

Conceptual:
Interpretation of clusters is difcult
No correct clustering is known in
advance
Technical:
Outliers
Incremental databases (data
streams)
Fall, 2006 Arturas Mazeika Page 3
Clustering I
1 Introduction to Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Dendogram, Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . 7
3 Nearest Neighbor Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 K-medoids (PAM Algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 CLARA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 CLARANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8 Questions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Introduction to Clustering (Notations and Denitions used in Clustering 1/2)
Assume one cluster data: {t
1
, t
2
, . . . , t
n
}
R
m
, that is t
1
= (t
1
1
, t
2
1
, . . . , t
m
1
).
The centroid of a cluster is:
(C
1
, C
2
, . . . , C
m
) =
_
1
n
n

i=1
t
1
i
,
1
n
n

i=1
t
2
i
, . . . ,
1
n
n

i=1
t
m
i
_
(1)
The radius of the cluster (average distance from the centroid):
(R
1
, R
2
, . . . , R
m
) =
_

_
1
n
n

i=1
_
t
1
i
C
1
i
_
2
, . . . ,
_

_
1
n
n

i=1
_
t
m
i
C
m
i
_
2
_
(2)
Diameter of the cluster (average distance between any two points in the cluster):
(D
1
, D
2
, . . . , D
m
) =
_

_
1
n
2
n

i,j=1
_
t
1
i
t
1
j
_
2
, . . . ,

_
1
n
2
n

i,j=1
_
t
m
i
t
m
j
_
2
_
(3)
Fall, 2006 Arturas Mazeika Page 4
Introduction to Clustering (the Idea)
Denition 0.1 Given a database D =
{t
1
, t
2
, . . . , t
n
} (and a number of clusters k)
the clustering problem is to dene the mapping
f : D {1, 2, . . . , k}, where f(t) is as-
signed to one class only.
Denition 0.2 Given a database D =
{t
1
, t
2
, . . . , t
n
} and a set of classes C =
{C
1
, C
2
, . . . , C
m
} the classication problem
is to dene the mapping f : D C, where
f(t) is assigned to one class only.
Clustering is similar to classication since both
method group data.
Clustering is different to classication since no
classes are predened in clustering.
Fall, 2006 Arturas Mazeika Page 2
Dendogram, Hierarchical Clustering
Dendogram is a tree data structure, which repre-
sents the distribution of the DB points into clusters
Leaf nodes in the dendogram contain the individual
DB points
The root contains the whole DB as a cluster
Internet layers of the tree represent new clusters
formed by merging the child nodes in the tree
Dendogram is a base for hierarchical clustering
A dendogram can be expressed as a set of triplets:
{< d, k, K >}:
n

0, 7,

{A}, {B}, {C}, {D}, {E}, {F}, {G}, {H}

1, 4, {A}, {B, C}, {D, E}, {F, G}

2, 3, {A, B, C}, {D, E}, {F, G}

3, 2, {A, B, C}, {D, E, F, G}

4, 1, {A, B, C, D, E, F, G}

o
A
B
C
D E
F G
A B C D E F G
Distance
A B C D E F G
Distance
A B C D E F G
Distance
A B C D E F G
Distance
A B C D E F G
Distance
Fall, 2006 Arturas Mazeika Page 7
Introduction to Clustering (Notations and Denitions used in Clustering 2/2)
Assume two cluster data: {t
A
1
, t
A
2
, . . . , t
A
nA
}, and
{t
B
1
, t
B
2
, . . . , t
B
nB
},
The single link:
min
i,j
{d(t
A
i
, t
B
j
)} (4)
The complete link:
max
i,j
{d(t
A
i
, t
B
j
)} (5)
Average distance:
1
n
2
n

i,j=1
d(t
A
i
, t
B
j
) (6)
Centroid distance:
d(C
A
, C
B
) (7)
Outlier is a cluster with only a few observations.
Fall, 2006 Arturas Mazeika Page 5
Dendogram, Hierarchical Clustering (Distances)
Dendograms (agglomerative clustering) can em-
ploy different error metrics to cluster the data
Single link
min
i,j
{d(t
A
i
, t
B
j
)}
Complete link
max
i,j
{d(t
A
i
, t
B
j
)}
Average Link
1
n
2
n

i,j=1
d(t
A
i
, t
B
j
)
Only distances between the data points are
needed to build a dendogram
The memory complexity of dendograms is
O(n
2
). The time complexity of dendograms is
O(n
2
).
Fall, 2006 Arturas Mazeika Page 8
Introduction to Clustering (Analytical Investigation)
t1 t1 t2 t3
x1 x2 1x1x2...x(k1)
The analytical investigation of the clustering problem is as follows:
Consider all possible partitions of the given database into clusters X
1
, X
2
, . . . , X
k
.
Choose the partition, which minimizes the error between the clusters (single link
error, centroid distance error, or any other error)
The number of possible partitions:
S(n, k) =
1
k!
k

i=1
(1)
ki
_
k
i
_
i
n
S(19, 4) = 11, 259, 666, 000
Most algorithms look at the small subset of all possible clusters
Fall, 2006 Arturas Mazeika Page 6
Nearest Neighbor Clustering
Input parameter: the number of nearest number of clusters
l (and threshold t)
Algorithm:
Initial step: the rst DB point forms its own cluster
Scan the remaining points. For each DB point X
2.1 Find l-nearest clustered points
2.2 Select the clustered points, which are closer than t
2.3 Assign X to the cluster, which has most nearest
neighbors
2.4 If no cluster points are in distance t, X forms a new
cluster
Fall, 2006 Arturas Mazeika Page 11
Dendogram, Hierarchical Clustering (Conclusions)
Single link approach can produce chain clusters
Complete (or average) link approach produce only spheric clusters
Fall, 2006 Arturas Mazeika Page 9
K-Means Algorithm(the Idea)
K-means algorithm clusters data in 4 steps:
1. Initialize k centroids: choose k data points
C
1
, C
2
, . . . , C
k
randomly in space.
2. Assign the data points to clusters. Data point
X
i
is assigned to cluster l if
(X
i
, C
l
) = argmin
j=1,2,...,k
d(X
i
, C
j
)
3. Calculate new centroids for the clusters
4. Iterate 13 until centroids will not move
(or) until all points will not from one cluster to
another
(or) until p% of points will not move from one
cluster to another
(or) iterate 7 times
...
Fall, 2006 Arturas Mazeika Page 12
Dendogram, Hierarchical Clustering (MSP Tree, Divisible Clustering)
The minimum spanning tree (MSP) is a single link
dendogram.
The tree also can be used divisible clustering
Divisible clustering places all items in one cluster
and then repeatedly splits until all points form their
own cluster
1 1
1
1 1
1
2
1 1
1
2
3 3
Fall, 2006 Arturas Mazeika Page 10
K-medoids (PAM Algorithm) (Swapping the Medoids)
Let M
1
, M
2
, . . . , M
k
be medoids.
Let K
1
, K
2
, . . . , K
k
be the clusters of
medoids.
We are swapping a medoid M
i
with a non-
medoid M
N
i
.
The swapping requires to recalculate the dis-
tances between M
i
, M
N
i
and all the rest non-
medoids X
j
.
M_i
X_j
X_i^N
Fall, 2006 Arturas Mazeika Page 15
K-Means Algorithm(Conclusions)
Time complexity is O(tkn)
Memory complexity is k
Works well for spheric clusters
Is not robust to outliers
Does not work well for arbitrary shaped clusters
Fall, 2006 Arturas Mazeika Page 13
K-medoids (PAM Algorithm) (Swapping the Medoids)
Swapping medoid M
i
with non-medoid M
N
i
changes the quality
of the clustering in one of the four cases:
1. Some of my point run away. Let X
j
K
i
, but other medoid
M

is closer to X
j
:
C
jih
= d(X
j
, M
i
) d(X
j
, M

)
2. The points stay in the cluster; they are either happy or unhappy
about the move. Let X
j
K
i
, and no other medoid M

is
closer:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M
i
)
3. The points of other cluster ignores the change. X
j
/ K
i
, X
j
X
j
does not change its medoid:
C
jih
= 0
4. Some of the other points want into my cluster. X
j
/ K
i
, X
j
wants to join M
i
:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M

)
The cost of replacing M
i
with M
N
i
is TC
ih
=

j
C
jih
Fall, 2006 Arturas Mazeika Page 16
K-medoids (PAM Algorithm) (the Idea)
The Idea of partition around medoids (PAM) is very
similar to the idea of k-means: the only difference
is that cluster centroids are changed to cluster rep-
resentatives.
K-medoids clusters the data in 4 steps:
1. Initialize k medoids. Choose any k data base
points M
1
, M
2
, . . . , M
k
as initial medoids.
2. Compute the new medoids of the clusters, by
swapping medoid M
i
with non selected data
point X
j
. The swap should improve the quality
of the clustering:
k

i=1
n
k

j=1
d(X
j
i
, M
i
) decreases (8)
3. Iterate 12 until Equation (8) does not de-
crease.
4. Assign the non-selected data points to clusters.
Data point X
i
is assigned to cluster l if
(X
i
, M
l
) = argmin
j=1,2,...,k
d(X
i
, M
j
)
M_i
X_j
X_i^N
Fall, 2006 Arturas Mazeika Page 14
CLARANS (the Idea of Neighbors)
Let S
1
= {M
1
, M
2
, . . . , M
i
, . . . , M
k
}
be a set of medoids (clustering) and S
2
=
{M
1
, M
2
, . . . , M
h
, . . . , M
k
} be the clus-
tering after the swap.
We organize the clusterings S
s
into a graph
G
n,k
.
S
s
1
and S
s
2
are connected through an arc if
S
s
1
is a clustering gotten from S
s
1
with one
swap.
One clustering S
s
has k(n k) neighbors.
K-medoids searches for the minimum on the
graph G
n,k
.
M1
M2
M3
Xh
(M1,M2,M3) ...
(M1,M2,Xh)
(M1,M2,X1)
(M1,X1,M3)
Fall, 2006 Arturas Mazeika Page 19
K-medoids (PAM Algorithm) (Example)
Swapping medoid M
i
with non-medoid M
N
i
changes the quality
of the clustering in one of the four cases:
1. Some of my point run away. Let X
j
K
i
, but other medoid
M

is closer to X
j
:
C
jih
= d(X
j
, M
i
) d(X
j
, M

)
2. The points stay in the cluster; they are either happy or unhappy
about the move. Let X
j
K
i
, and no other medoid M

is
closer:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M
i
)
3. The points of other cluster ignores the change. X
j
/ K
i
, X
j
X
j
does not change its medoid:
C
jih
= 0
4. Some of the other points want into my cluster. X
j
/ K
i
, X
j
wants to join M
i
:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M

)
The cost of replacing M
i
with M
N
i
is TC
ih
=

j
C
jih
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Fall, 2006 Arturas Mazeika Page 17
CLARANS (the Idea)
CLARANS extends CLARA by sampling the neighbors in the calculation of the cost
difference.
CLARANS has two input parameters: numlocal and maxneighbor.
numlocal is the number of samples to be taken (CLARA)
maxneighbor is the number of examined neighborhoods
Rule of thumb:
numlocal = 2
maxneighbor = max{0.0125 k(n k), 250}
Fall, 2006 Arturas Mazeika Page 20
CLARA
CLARA combines k-medoids and sampling
The sample size 40 + 2k seems to give good results
A few independent samples could be drawn to improve the quality of clustering. The
sample, which results the best clustering is used to cluster the dataset
Fall, 2006 Arturas Mazeika Page 18
Questions?
Questions?
Fall, 2006 Arturas Mazeika Page 21

Você também pode gostar