Survey of Clustering Algorithms

Survey of Clustering Algorithms
PREPARED BY-
GUIDED BY-
DEBABRAT DAS- R010113014
JYOTISMITA TALUKDAR
ANIKET ROY- R010113007
ASST. PROFESSOR
CIT, UTM
CONTENTS
PROBLEM STATEMENT
WHAT IS CLUSTERING?
ALGORITHMS FOR CLUSTERING
HIERARCHICAL CLUSTERING
PARTITIONING BASED CLUSTERING
k-means algorithm
BRIEF INTRODUCTION TO WEKA

CONCLUSION
PROBLEM STATEMENT AND AIM

GIVEN A SET OF RECORDS (INSTANCES, EXAMPLES, OBJECTS, OBSERVATIONS, ),
ORGANIZE THEM INTO CLUSTERS (GROUPS, CLASSES)
CLUSTERING: THE PROCESS OF GROUPING PHYSICAL OR ABSTRACT OBJECTS INTO

CLASSES OF SIMILAR OBJECTS
AIM:
HERE WE USE WEKA TOOL TO COMPARE DIFFERENT CLUSTERING ALGORITHM ON IRIS
DATA SET AND SHOWN DIFFERENCES BETWEEN THEM.
WHAT IS CLUSTERING?
CLUSTERING IS THE MOST COMMON FORM OF UNSUPERVISED LEARNING.
CLUSTERING IS THE PROCESS OF GROUPING A SET OF PHYSICAL OR

ABSTRACT OBJECTS INTO CLASSES OF SIMILAR OBJECTS.
APPLICATION
Clustering help marketers discover distinct groups in their customer base. and they can
characterize their customer groups based on the purchasing patterns.
Thematic maps in GIS by clustering feature spaces
WWW
Document classification
Cluster weblog data to discover groups of similar access patterns
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to observe characteristics of each cluster.
ALGORITHMS TO BE
ANALYSED
HIERARCHICAL CLUSTERING ALGORITHM
PARTITIONING BASED CLUSTERING ALGORITHM
Hierarchical Clustering
Initially each point in the data set is assigned as a cluster.
Then we repeatedly combine two nearest points in a single cluster.
The distance function is calculated on the basis of the characteristics we want to cluster.
Hierarchical:
Agglomerative (bottom up):
Initially, each point is a cluster
Repeatedly combine the two
nearest clusters into one
Divisive (top down):
start with one cluster and recursively split it
s
r
e
w
Ans
Three important questions:

How do you represent a cluster
of more than one point?
1.
Centroid- Centroid (mean) in i th dimension =

SUMi /N.
SUMi = i th component of SUM.
How do you determine the

nearness of clusters?
2.
Nearest inter-cluster distance (Euclidian

Distance).
When to stop combining

clusters?
3. When min(inter-cluster distance d[i]) for a point

i > Threshold.
Algorithm
Hierarchical
Clustering
1.Begin with the disjoint clustering having level L(0) = 0 and sequence number m
= 0.
2. Find the least dissimilar pair of clusters in the current clustering, say pair (r),
(s), according to
where the minimum is over all pairs of clusters in the current clustering.
d[(r),(s)] = min
d[(i),(j)]
3. Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a
single cluster to form the next clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
Cont.
Hierarchical
Clustering
4.Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r)
and (s) and adding a row and column corresponding to the newly formed cluster.
5.The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),
(s)]
6.If all objects are in one cluster, stop. Else, go to step 2.
EXAMPLE: HIERARCHICAL CLUSTERING

(5,3)
o
(1,2)
o
x (1.5,1.5)
x (1,1) o (2,1)
o (0,0)
Data:
o data point
x centroid
o (4,1)
x (4.7,1.3)
x (4.5,0.5)
o (5,0)
13
Dendrogram
Implementation
//Assigning points to
array
loop: i=0 to n
a[i] = Point(i);
//For The Distance matrix:

loop:i=1 to n
Begin
loop: j=i to n
Begin
calculate distance(i,j);
set d[i,j] = distance(i,j);
set d[j,i] = distance(i,j);
END
calculate min( row([i] );
set n[i] = min( row[i] );
END
//Clustering the points

loop: i=1 to n
Begin
loop j=1 to n
Begin
if( i=j)
continue;
if(n[i]<n[j]-threshhold)
cluster i and j;
calculate centroid;
update d[i,j];
else
Set outlier[i] = a[i];
END
END
Complexities
Complexity of setting the points to array=O(n)
Complexity of calculating d[i,j]=O(nlog(n))
Complexity of clustering=O(n2log(n))
PARTITIONING BASED
Suppose we are given a database of n objects and the partitioning method constructs k
partition of data. Each partition will represent a cluster and k n. It means that it will
classify the data into k groups, which satisfy the following requirements
Each group contains at least one object.
Each object must belong to exactly one group.
FOR A GIVEN NUMBER OF PARTITIONS (SAY K), THE PARTITIONING METHOD

WILL CREATE AN INITIAL PARTITIONING.
THEN IT USES THE ITERATIVE RELOCATION TECHNIQUE TO IMPROVE THE
PARTITIONING BY MOVING OBJECTS FROM ONE GROUP TO OTHER.
GLOBAL OPTIMAL: EXHAUSTIVELY ENUMERATE ALL PARTITIONS.
HEURISTIC METHODS: K-MEANS.
K-MEANS CLUSTERING
KMEANS ALGORITHM(S)
ASSUMES EUCLIDEAN SPACE/DISTANCE
START BY PICKING K, THE NUMBER OF CLUSTERS
INITIALIZE CLUSTERS BY PICKING ONE POINT PER CLUSTER
FOR THE MOMENT, ASSUME WE PICK THE K POINTS AT RANDOM
19
POPULATING CLUSTERS
1) FOR EACH POINT, PLACE IT IN THE CLUSTER WHOSE CURRENT CENTROID IT IS NEAREST
2) AFTER ALL POINTS ARE ASSIGNED, UPDATE THE LOCATIONS OF CENTROIDS OF THE K
CLUSTERS
3) REASSIGN ALL POINTS TO THEIR CLOSEST CENTROID
SOMETIMES MOVES POINTS BETWEEN CLUSTERS
REPEAT 2 AND 3 UNTIL CONVERGENCE
CONVERGENCE: POINTS DONT MOVE BETWEEN CLUSTERS AND CENTROIDS STABILIZE
20
21
EXAMPLE: ASSIGNING CLUSTERS

x
x
x
x
x
x
x data point
centroid
22
Clusters after round 1

x
x
x
x
x
x
x data point
centroid
23
Clusters after round 2

x
x
x
x
x
x
x data point
centroid
24
Clusters at the end
GETTING THE K RIGHT

HOW TO SELECT K?
Average
distance to
centroid
Best value
of k
k
25
EXAMPLE: PICKING K
Too few;
many long
distances
to centroid.
x
x
x x x x
x xx x
x x x
x x
x
xx x
x x
x x x
x
xx x
x
x x
x x x x
x x x
x
26
Just right;
distances
rather short.
x
x
x x x x
x xx x
x x x
x x
x
xx x
x x
x x x
x
xx x
x
x x
x x x x
x x x
x
27
Too many;
little improvement
in average
distance.
x
x
x x x x
x xx x
x x x
x x
x
xx x
x x
x x x
x
xx x
x
x x
x x x x
x x x
x
28
ADVANTAGES OF K-MEANS
WITH A LARGE NUMBER OF VARIABLES, K-MEANS MAY BE COMPUTATIONALLY FASTER

THAN HIERARCHICAL CLUSTERING
(IF K IS SMALL).
K-MEANS MAY PRODUCE TIGHTER CLUSTERS THAN HIERARCHICAL CLUSTERING,
ESPECIALLY IF THE CLUSTERS ARE GLOBULAR.
COMPLEXITY:
EACH ROUND IS O(KN) FOR N POINTS, K CLUSTER
29
DISADVANTAGES
DIFFICULTY IN COMPARING QUALITY OF THE CLUSTERS PRODUCED (E.G. FOR

DIFFERENT INITIAL PARTITIONS OR VALUES OF K
AFFECT OUTCOME).
FIXED NUMBER OF CLUSTERS CAN MAKE IT DIFFICULT TO PREDICT WHAT K
SHOULD BE.
30
WEKA
WAIKATO ENVIRONMENT FOR KNOWLEDGE ANALYSIS (WEKA) IS A
POPULAR SUITE OF MACHINE LEARNING SOFTWARE WRITTEN IN JAVA,
DEVELOPED AT THE UNIVERSITY OF WAIKATO, NEW ZEALAND
IRIS DATA
31
32
HIERARCHAL ALGORITHM
33
K-MEANS
34
FURTHER WORK
WE WOULD LIKE TO SOLVE THE COMPLEXITY PROBLEM OF HIERARCHAL

ALGORITHM
IMPROVEMENT OF K-MEANS FOR THE NO OF CLUSTER SELECTION NEED TO BE
AUTOMATE FURTHER
35
CONCLUSION
EVERY ALGORITHM HAS THEIR OWN IMPORTANCE AND WE USE THEM ON THE BEHAVIOR
OF THE DATA
ON THE BASIS OF THIS RESEARCH WE FOUND THAT K-MEANS CLUSTERING ALGORITHM IS

SIMPLEST ALGORITHM AS COMPARED TO OTHER
ALGORITHMS.
WE CANT REQUIRED DEEP KNOWLEDGE OF ALGORITHMS FOR WORKING IN WEKA.

THATS WHY WEKA IS MORE SUITABLE TOOL FOR DATA MINING APPLICATIONS.
36
REFERENCES
YRM, S. AND KRKKINEN, T. INTRODUCTION TO PARTITIONING-BASED CLUSTERING METHODS WITH A ROBUST EXAMPLE,
REPORTS OF THE DEPT. OF MATH. INF. TECH. (SERIES C. SOFTWARE AND COMPUTATIONAL ENGINEERING), 1/2006, UNIVERSITY
OF JYVSKYL, 2006.
BERKHIN, P. (1998). SURVEY OF CLUSTERING DATA MINING TECHNIQUES. RETRIEVED NOVEMBER 6TH, 2015, WEBSITE:
E.B FAWLKES AND C.L. MALLOWS, A METHOD FOR COMPARING TWO HIERARCHICAL CLUSTERINGS, JOURNAL OF THE
AMERICAN STATISTICAL ASSOCIATION, 78:553584, 1983.
J. BEZDEK AND R. HATHAWAY, NUMERICAL CONVERGENCE AND INTERPRETATION OF THE FUZZY C-SHELLS CLUSTERING
ALGORITHMS, IEEE TRANS. NEURAL NETW., VOL. 3, NO. 5, PP. 787793, SEP. 1992.
M. AND HECKERMAN, D. (FEBRUARY, 1998), AN EXPERIMENTAL COMPARISON OF SEVERAL CLUSTERING AND INITIALIZATION
METHOD, TECHNICAL REPORT MSRTR-98-06, MICROSOFT RESEARCH, REDMOND, WA
MYTHILI S ET AL, INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND MOBILE COMPUTING, VOL.3 ISSUE.1 , JANUARY- 2014,
PG. 334-340
NARENDRA SHARMA, AMAN BAJPAI, MR. RATNESH LITORIYA COMPARISON THE VARIOUS CLUSTERING ALGORITHMS OF WEKA
TOOLS, INTERNATIONAL JOURNAL OF EMERGING TECHNOLOGY AND ADVANCED ENGINEERING, VOLUME 2, ISSUE 5, MAY 2012.
R. DAV, ADAPTIVE FUZZY C-SHELLS CLUSTERING AND DETECTION OF ELLIPSES, IEEE TRANS. NEURAL NETW., VOL. 3, NO. 5, PP.
643662, SEP. 1992.
T. VELMURUGAN AND T. SANTHANAM, 2011. A SURVEY OF PARTITION BASED CLUSTERING ALGORITHMS IN DATA MINING: AN
EXPERIMENTAL APPROACH. INFORMATION TECHNOLOGY JOURNAL, 10: 478-484.
37
WEKA AT HTTP://WWW.CS.WAIKATO.AC.NZ/~ML/WEKA.

Survey of Clustering Algorithms

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Survey of Clustering Algorithms

Enviado por

Direitos autorais:

Formatos disponíveis

Survey of Clustering Algorithms

DEBABRAT DAS- R010113014

ANIKET ROY- R010113007

BRIEF INTRODUCTION TO WEKA

PROBLEM STATEMENT AND AIM

CLUSTERING: THE PROCESS OF GROUPING PHYSICAL OR ABSTRACT OBJECTS INTO

CLUSTERING IS THE PROCESS OF GROUPING A SET OF PHYSICAL OR

HIERARCHICAL CLUSTERING ALGORITHM

PARTITIONING BASED CLUSTERING ALGORITHM

Initially each point in the data set is assigned as a cluster.

Then we repeatedly combine two nearest points in a single cluster.

Three important questions:

Centroid- Centroid (mean) in i th dimension =

How do you determine the

Nearest inter-cluster distance (Euclidian

When to stop combining

3. When min(inter-cluster distance d[i]) for a point

EXAMPLE: HIERARCHICAL CLUSTERING

//For The Distance matrix:

//Clustering the points

Each group contains at least one object.

Each object must belong to exactly one group.

FOR A GIVEN NUMBER OF PARTITIONS (SAY K), THE PARTITIONING METHOD

EXAMPLE: ASSIGNING CLUSTERS

Clusters after round 1

EXAMPLE: ASSIGNING CLUSTERS

Clusters after round 2

EXAMPLE: ASSIGNING CLUSTERS

Clusters at the end

GETTING THE K RIGHT

WITH A LARGE NUMBER OF VARIABLES, K-MEANS MAY BE COMPUTATIONALLY FASTER

DIFFICULTY IN COMPARING QUALITY OF THE CLUSTERS PRODUCED (E.G. FOR

WE WOULD LIKE TO SOLVE THE COMPLEXITY PROBLEM OF HIERARCHAL

ON THE BASIS OF THIS RESEARCH WE FOUND THAT K-MEANS CLUSTERING ALGORITHM IS

WE CANT REQUIRED DEEP KNOWLEDGE OF ALGORITHMS FOR WORKING IN WEKA.

Você também pode gostar