Clusteringi 4

Introduction to Clustering (Challenges in Clustering)
Conceptual:
Interpretation of clusters is difcult
No correct clustering is known in
advance
Technical:
Outliers
Incremental databases (data
streams)
Fall, 2006 Arturas Mazeika Page 3
Clustering I
1 Introduction to Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Dendogram, Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . 7
3 Nearest Neighbor Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 K-medoids (PAM Algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 CLARA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 CLARANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8 Questions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Introduction to Clustering (Notations and Denitions used in Clustering 1/2)
Assume one cluster data: {t
1
, t
2
, . . . , t
n
}
R
m
, that is t
1
= (t
1
1
, t
2
1
, . . . , t
m
1
).
The centroid of a cluster is:
(C
1
, C
2
, . . . , C
m
) =
_
1
n
n
i=1
t
1
i
,
1
n
n
i=1
t
2
i
, . . . ,
1
n
n
i=1
t
m
i
_
(1)
The radius of the cluster (average distance from the centroid):
(R
1
, R
2
, . . . , R
m
) =
_
_
1
n
n
i=1
_
t
1
i
C
1
i
_
2
, . . . ,
_
_
1
n
n
i=1
_
t
m
i
C
m
i
_
2
_
(2)
Diameter of the cluster (average distance between any two points in the cluster):
(D
1
, D
2
, . . . , D
m
) =
_
_
1
n
2
n
i,j=1
_
t
1
i
t
1
j
_
2
, . . . ,
_
1
n
2
n
i,j=1
_
t
m
i
t
m
j
_
2
_
(3)
Introduction to Clustering (the Idea)
Denition 0.1 Given a database D =
{t
1
, t
2
, . . . , t
n
} (and a number of clusters k)
the clustering problem is to dene the mapping
f : D {1, 2, . . . , k}, where f(t) is as-
signed to one class only.
Denition 0.2 Given a database D =
{t
1
, t
2
, . . . , t
n
} and a set of classes C =
{C
1
, C
2
, . . . , C
m
} the classication problem
is to dene the mapping f : D C, where
f(t) is assigned to one class only.
Clustering is similar to classication since both
method group data.
Clustering is different to classication since no
classes are predened in clustering.
Dendogram, Hierarchical Clustering
Dendogram is a tree data structure, which repre-
sents the distribution of the DB points into clusters
Leaf nodes in the dendogram contain the individual
DB points
The root contains the whole DB as a cluster
Internet layers of the tree represent new clusters
formed by merging the child nodes in the tree
Dendogram is a base for hierarchical clustering
A dendogram can be expressed as a set of triplets:
{< d, k, K >}:
n
0, 7,
{A}, {B}, {C}, {D}, {E}, {F}, {G}, {H}
1, 4, {A}, {B, C}, {D, E}, {F, G}
2, 3, {A, B, C}, {D, E}, {F, G}
3, 2, {A, B, C}, {D, E, F, G}
4, 1, {A, B, C, D, E, F, G}
o
A
B
C
D E
F G
A B C D E F G
Distance
A B C D E F G
Distance
A B C D E F G
Distance
A B C D E F G
Distance
A B C D E F G
Distance
Introduction to Clustering (Notations and Denitions used in Clustering 2/2)
Assume two cluster data: {t
A
1
, t
A
2
, . . . , t
A
nA
}, and
{t
B
1
, t
B
2
, . . . , t
B
nB
},
The single link:
min
i,j
{d(t
A
i
, t
B
j
)} (4)
The complete link:
max
i,j
{d(t
A
i
, t
B
j
)} (5)
Average distance:
1
n
2
n
i,j=1
d(t
A
i
, t
B
j
) (6)
Centroid distance:
d(C
A
, C
B
) (7)
Outlier is a cluster with only a few observations.
Dendogram, Hierarchical Clustering (Distances)
Dendograms (agglomerative clustering) can em-
ploy different error metrics to cluster the data
Single link
min
i,j
{d(t
A
i
, t
B
j
)}
Complete link
max
i,j
{d(t
A
i
, t
B
j
)}
Average Link
1
n
2
n
i,j=1
d(t
A
i
, t
B
j
)
Only distances between the data points are
needed to build a dendogram
The memory complexity of dendograms is
O(n
2
). The time complexity of dendograms is
O(n
2
).
Introduction to Clustering (Analytical Investigation)
t1 t1 t2 t3
x1 x2 1x1x2...x(k1)
The analytical investigation of the clustering problem is as follows:
Consider all possible partitions of the given database into clusters X
1
, X
2
, . . . , X
k
.
Choose the partition, which minimizes the error between the clusters (single link
error, centroid distance error, or any other error)
The number of possible partitions:
S(n, k) =
1
k!
k
i=1
(1)
ki
_
k
i
_
i
n
S(19, 4) = 11, 259, 666, 000
Most algorithms look at the small subset of all possible clusters
Nearest Neighbor Clustering
Input parameter: the number of nearest number of clusters
l (and threshold t)
Algorithm:
Initial step: the rst DB point forms its own cluster
Scan the remaining points. For each DB point X
2.1 Find l-nearest clustered points
2.2 Select the clustered points, which are closer than t
2.3 Assign X to the cluster, which has most nearest
neighbors
2.4 If no cluster points are in distance t, X forms a new
cluster
Dendogram, Hierarchical Clustering (Conclusions)
Single link approach can produce chain clusters
Complete (or average) link approach produce only spheric clusters
K-Means Algorithm(the Idea)
K-means algorithm clusters data in 4 steps:
1. Initialize k centroids: choose k data points
C
1
, C
2
, . . . , C
k
randomly in space.
2. Assign the data points to clusters. Data point
X
i
is assigned to cluster l if
(X
i
, C
l
) = argmin
j=1,2,...,k
d(X
i
, C
j
)
3. Calculate new centroids for the clusters
4. Iterate 13 until centroids will not move
(or) until all points will not from one cluster to
another
(or) until p% of points will not move from one
cluster to another
(or) iterate 7 times
...
Dendogram, Hierarchical Clustering (MSP Tree, Divisible Clustering)
The minimum spanning tree (MSP) is a single link
dendogram.
The tree also can be used divisible clustering
Divisible clustering places all items in one cluster
and then repeatedly splits until all points form their
own cluster
1 1
1
1 1
1
2
1 1
1
2
3 3
K-medoids (PAM Algorithm) (Swapping the Medoids)
Let M
1
, M
2
, . . . , M
k
be medoids.
Let K
1
, K
2
, . . . , K
k
be the clusters of
medoids.
We are swapping a medoid M
i
with a non-
medoid M
N
i
.
The swapping requires to recalculate the dis-
tances between M
i
, M
N
i
and all the rest non-
medoids X
j
.
M_i
X_j
X_i^N
K-Means Algorithm(Conclusions)
Time complexity is O(tkn)
Memory complexity is k
Works well for spheric clusters
Is not robust to outliers
Does not work well for arbitrary shaped clusters
K-medoids (PAM Algorithm) (Swapping the Medoids)
Swapping medoid M
i
with non-medoid M
N
i
changes the quality
of the clustering in one of the four cases:
1. Some of my point run away. Let X
j
K
i
, but other medoid
M
is closer to X
j
:
C
jih
= d(X
j
, M
i
) d(X
j
, M
)
2. The points stay in the cluster; they are either happy or unhappy
about the move. Let X
j
K
i
, and no other medoid M
is
closer:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M
i
)
3. The points of other cluster ignores the change. X
j
/ K
i
, X
j
X
j
does not change its medoid:
C
jih
= 0
4. Some of the other points want into my cluster. X
j
/ K
i
, X
j
wants to join M
i
:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M
)
The cost of replacing M
i
with M
N
i
is TC
ih
=
j
C
jih
K-medoids (PAM Algorithm) (the Idea)
The Idea of partition around medoids (PAM) is very
similar to the idea of k-means: the only difference
is that cluster centroids are changed to cluster rep-
resentatives.
K-medoids clusters the data in 4 steps:
1. Initialize k medoids. Choose any k data base
points M
1
, M
2
, . . . , M
k
as initial medoids.
2. Compute the new medoids of the clusters, by
swapping medoid M
i
with non selected data
point X
j
. The swap should improve the quality
of the clustering:
k
i=1
n
k
j=1
d(X
j
i
, M
i
) decreases (8)
3. Iterate 12 until Equation (8) does not de-
crease.
4. Assign the non-selected data points to clusters.
Data point X
i
is assigned to cluster l if
(X
i
, M
l
) = argmin
j=1,2,...,k
d(X
i
, M
j
)
M_i
X_j
X_i^N
CLARANS (the Idea of Neighbors)
Let S
1
= {M
1
, M
2
, . . . , M
i
, . . . , M
k
}
be a set of medoids (clustering) and S
2
=
{M
1
, M
2
, . . . , M
h
, . . . , M
k
} be the clus-
tering after the swap.
We organize the clusterings S
s
into a graph
G
n,k
.
S
s
1
and S
s
2
are connected through an arc if
S
s
1
is a clustering gotten from S
s
1
with one
swap.
One clustering S
s
has k(n k) neighbors.
K-medoids searches for the minimum on the
graph G
n,k
.
M1
M2
M3
Xh
(M1,M2,M3) ...
(M1,M2,Xh)
(M1,M2,X1)
(M1,X1,M3)
K-medoids (PAM Algorithm) (Example)
Swapping medoid M
i
with non-medoid M
N
i
changes the quality
of the clustering in one of the four cases:
1. Some of my point run away. Let X
j
K
i
, but other medoid
M
is closer to X
j
:
C
jih
= d(X
j
, M
i
) d(X
j
, M
)
2. The points stay in the cluster; they are either happy or unhappy
about the move. Let X
j
K
i
, and no other medoid M
is
closer:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M
i
)
3. The points of other cluster ignores the change. X
j
/ K
i
, X
j
X
j
does not change its medoid:
C
jih
= 0
4. Some of the other points want into my cluster. X
j
/ K
i
, X
j
wants to join M
i
:
C
jih
= d(X
j
, M
N
i
) d(X
j
, M
)
The cost of replacing M
i
with M
N
i
is TC
ih
=
j
C
jih
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
CLARANS (the Idea)
CLARANS extends CLARA by sampling the neighbors in the calculation of the cost
difference.
CLARANS has two input parameters: numlocal and maxneighbor.
numlocal is the number of samples to be taken (CLARA)
maxneighbor is the number of examined neighborhoods
Rule of thumb:
numlocal = 2
maxneighbor = max{0.0125 k(n k), 250}
CLARA
CLARA combines k-medoids and sampling
The sample size 40 + 2k seems to give good results
A few independent samples could be drawn to improve the quality of clustering. The
sample, which results the best clustering is used to cluster the dataset
Questions?
Questions?

Clusteringi 4

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Clusteringi 4

Enviado por

Direitos autorais:

Formatos disponíveis

Introduction to Clustering (Challenges in Clustering)

{A}, {B}, {C}, {D}, {E}, {F}, {G}, {H}

1, 4, {A}, {B, C}, {D, E}, {F, G}

2, 3, {A, B, C}, {D, E}, {F, G}

3, 2, {A, B, C}, {D, E, F, G}

Você também pode gostar