Você está na página 1de 5

Volume 2 Special Issue ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

 
http://www.cisjournal.org 

Review on Various Clustering Methods for the Image Data


Madhuri A. Tayal1,M.M.Raghuwanshi2
1
SRKNEC Nagpur, 2NYSS Nagpur, 1, 2 Nagpur University
Nagpur [Maharashtra], INDIA.
1
madhuri_kalpe@rediffmail.com, m_raghuwanshi@rediffmail.com

ABSTRACT
Now-a-days, keeping information (data) is not a problem, but keeping that data effectively is the problem. Clustering is the
classification of patterns into the groups of similar items. The data in every group is similar but quiet different in different
groups. The clustering problem has been addressed in many of the fields .It shows the usability of it .In this paper the
clustering is applied to the image data. The feature values are taken, and the final solution depends upon, these values on
which the categorization is done. The complexities for the different methods are also defined here. The paper ends with
some of the difficulties and solutions for them and with the results, on which the clustering is done.

Keywords— classification, clustering, feature extraction, feature selection

I. INTRODUCTION the discussion on different algorithms, we focus on


hierarchical clustering and classical partitional clustering
We are living in a world full of data. Every day, people algorithms in Section?
encounter a large amount of information and store or
represent it as data, for further analysis and management. Distance and Similarity Measure
One of the vital means in dealing with these data is to An important component of a clustering algorithm is the
classify or group them into a set of categories or clusters. distance measure between data points. If the components
Clustering refers to the process of grouping samples so of the data instance vectors are all in the same physical
that the samples are similar within each group[1]. units then it is possible that the simple Euclidean distance
Clustering can be considered the most important metric is sufficient to successfully group similar data
unsupervised learning problem; so, as every other problem instances. Distance between the two clusters can be
of this kind, it deals with finding a structure in a collection measured by[1].
of unlabeled data. A cluster is therefore a collection of 1. Euclidian Distance
objects which are “similar” between them and are 2. City Block Distance
“dissimilar” to the objects belonging to other clusters. In addition to this some of the similarity and
Important survey papers on clustering techniques also dissimilarity measures are as follows in Table-1[3]
exist in the literature. Starting from a statistical pattern
recognition viewpoint, Jain, murty, and Flynn [2] Table I: Similarity and Dissimilarity Measure
reviewed clustering algorithms and other important issues
related to cluster analysis.
For Quantitative Features [3]
The purpose of this paper is to provide a comprehensive
description of the influential and important clustering
algorithms rooted in statistics, computer science, and
machine learning, with emphasis on new advances in
recent years.
One issue on cluster analysis, how to choose the number
of clusters, is also summarized in the last section.

II. CLUSTERING ALGORITHMS


Different starting points and criteria usually lead to
different taxonomies of clustering algorithms [1][2][3]. A
rough but widely agreed frame is to classify clustering
techniques as hierarchical clustering and partitional
clustering, based on the properties of clusters generated.
Hierarchical clustering groups data objects with a
sequence of partitions, either from singleton clusters to a
cluster including all individuals or vice versa, while
partitional clustering directly divides data objects into
some prespecified number of clusters without the
hierarchical structure. We follow this frame in surveying
 
the clustering algorithms in the literature. Beginning with
  34
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

 
http://www.cisjournal.org 

III. CLASSIFICATION Table Ii Computational Complexity Of Clustering


Algorithms[3]
Clustering algorithms may be broadly classified as listed
below:

A. Hierarchical
---Agglomerative
a) Single linkage,
b) Complete linkage,
c) Group average linkage,
d) Median linkage,
e) Centroid linkage,
f) Ward’s method,
g) Balanced iterative reducing and clustering using
hierarchies (BIRCH),
h) Clustering using representatives (CURE),
i) Robust clustering using links (ROCK)
---Divisive
Divisive analysis (DIANA), monothetic analysis
(MONA)

B. Squared Error-Based (Vector


Quantization)
a) K-means, In the first case data are grouped in an exclusive way, so
that if a certain datum belongs to a definite cluster then it
C. Fuzzy could not be included in another cluster. A simple example
a. Fuzzy -means (FCM), of that is shown in the figure below, where the separation
b. Mountain method (MM), of points is achieved by a straight line on a bi dimensional
Possibilistic means clustering algorithm (PCM), plane. On the contrary the second type, the overlapping
c. Fuzzy shells (FCS) clustering, uses fuzzy sets to cluster data, so that each
D. Neural Networks-Based point may belong to two or more clusters with different
a) Learning vector quantization (LVQ), degrees of membership. In this case, data will be
b) Self-organizing feature map (SOFM), associated to an appropriate membership value. Instead, a
ART, hierarchical clustering algorithm is based on the union
c) Simplified ART (SART), between the two nearest clusters. The beginning condition
d) Hyperellipsoidal clustering network is realized by setting every datum as a cluster. After a few
e) Self-splitting competitive learning network iterations it reaches the final clusters wanted.
(SPLL) Finally, the last kind of clustering uses a completely
f) probabilistic approach.
E. Kernel-Based
a) Kernel -means,
IV. HIERARCHICAL CLUSTERING
b) Support vector clustering (SVC) ALGORITHM

F. Data visualization/High-dimensional data Given a set of N items to be clustered, and an N*N


a) Iterative self-organizing data analysis distance (or similarity) matrix, the basic process of
technique (ISODATA), hierarchical is this:
b) Genetic -means algorithm (GKA),
c) Partitioning around medoids (PAM) 1. Start by assigning each item to a cluster, so that if
you have N items, you now have N clusters, each
Similarly various clustering algorithms and their containing just one item. Let the distances
complexities are mentioned in Table-2. (similarities) between the clusters the same as the
distances (similarities) between the items they
contain.
2. Find the closest (most similar) pair of clusters
and merge them into a single cluster, so that now
you have one cluster less.

  35
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

 
http://www.cisjournal.org 
3. Compute distances (similarities) between the new algorithms. The Average linkage algorithm is obtained by
cluster and each of the old clusters. defining the distance between two clusters to be the
4. Repeat steps 2 and 3 until all items are clustered average distance between two points such that one point
into a single cluster of size N. (*) is in each cluster. If Ci and Cj are clusters, the distance
between them is defined as
Step 3 can be done in different ways, which is what
distinguishes single-linkage from complete-linkage and DAL(Ci,Cj) = 1/ninj ∑ d(a,b)
average-linkage clustering. aεCi, bεCj
In single-linkage clustering (also called the connectedness where d(a,b) denotes the distance between the samples a
or minimum method), we consider the distance between and b.
one cluster and another cluster to be equal to the shortest The main weaknesses of agglomerative clustering methods
distance from any member of one cluster to any member are:
of the other cluster. If the data consist of similarities, we  They do not scale well, time complexity of at least
consider the similarity between one cluster and another O(n2), where n is the number of total objects;
cluster to be equal to the greatest similarity from any  They can never undo what was done previously.
member of one cluster to any member of the other cluster.
In complete-linkage clustering (also called the diameter or V. K-Means Clustering
maximum method), we consider the distance between one K-means is one of the simplest unsupervised learning
cluster and another cluster to be equal to the greatest algorithms that solve the well known clustering problem.
distance from any member of one cluster to any member The procedure follows a simple and easy way to classify a
of the other cluster. In average-linkage clustering, we given data set through a certain number of clusters
consider the distance between one cluster and another (assume k clusters) fixed a priori. The main idea is to
cluster to be equal to the average distance from any define k centroids, one for each cluster. These centroids
member of one cluster to any member of the other cluster. shoud be placed in a cunning way because of different
The result with image data is shown in the section. location causes different result[6]. So, the better choice is
to place them as much as possible far away from each
Single Linkage Algorithm: other. The next step is to take each point belonging to a
given data set and associate it to the nearest centroid.
Single linkage algorithm is also called as the minimum When no point is pending, the first step is completed and
method. The single linkage algorithm is obtained by an early groupage is done. At this point we need to re-
defining the distance between two clusters to be the calculate k new centroids as barycenters of the clusters
smallest distance between two points such that one point resulting from the previous step. After we have these k
is in each cluster. If Ci and cj are clusters , the distance new centroids, a new binding has to be done between the
between them is defined as same data set points and the nearest new centroid. A loop
DsL(Ci,Cj) = min d(a,b) has been generated. As a result of this loop we may notice
aεCi, bεCj that the k centroids change their location step by step until
no more changes are done. In other words centroids do not
where d(a,b) denotes the distance between the samples a move any more.
and b. Finally, this algorithm aims at minimizing an objective
function, in this case a squared error function. The
Complete Linkage Algorithm:- objective function

Complete linkage algorithm is also called as the maximum


method. The complete linkage algorithm is obtained by
defining the distance between two clusters to be the ,
largest distance between two points such that one point is
in each cluster. If Ci and cj are clusters , the distance
between them is defined as
DcL(Ci,Cj) = max d(a,b) where is a chosen distance measure between
aεCi, bεCj
a data point and the cluster centre , is an indicator
where d(a,b) denotes the distance between the samples a of the distance of the n data points from their respective
and b. cluster centres.
The algorithm is composed of the following steps:
Average Linkage Algorithm:-
1. Place K points into the space represented by the
Average linkage algorithm is an attempt to compromise objects that are being clustered. These points
between the extremes of the single and complete linkage represent initial group centroids.
2. Assign each object to the group that has the
closest centroid.

  36
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

 
http://www.cisjournal.org 
3. When all objects have been assigned, recalculate
the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer
move. This produces a separation of the objects
into groups from which the metric to be
minimized can be calculated.

Advantages

1. K-means is a simple algorithm that has been


adapted to many problem domains.
2. More automated than manual thresholding of an
image
3. It is a good candidate for extension to work with
fuzzy feature vectors.

Disadvantages
Fig. 1 Different image patterns
1. Although it can be proved that the procedure will
always terminate, the k-means algorithm does not The patterns can be clustered using no of features. The
necessarily find the most optimal configuration, basic features are color, shape and texture. Here one
corresponding to the global objective function feature from the basic features is taken in addition to two
minimum. more new features i.e. No of objects and size of object.
2. The algorithm is also significantly sensitive to the The various methods for detection for size, shape etc are
initial randomly selected cluster centers. The k- available in [7], and in literature also. The corresponding
means algorithm can be run multiple times to value for each feature is shown in the Table. The results
reduce this effect. after the experimentation for clustering is shown in
Figure-2.Results are found to be same for simple,
complete, average linkage algorithms. Even the results are
A large number of attempts have been made to estimate
same by using Euclidian and City lock distances.
the appropriate and some of representative examples are
illustrated in the following. [6]. Some Solutions for this
algorithm are Table Iii Image Features Value With Respect To
Patterns
1. Visualization of the data set. For the data points that
can be effectively projected onto a two-dimensional Pattern Color No. of Size of
Euclidean space, which are commonly depicted with a No objects object
histogram or scatterplot, direct observations can provide
good insight on the value of .However, the complexity of 1 15(White) 1 3
most real data sets restricts the effectiveness of the
strategy only to a small scope of applications. 2 08 (Dark 6 1
Gray)
2. Construction of certain indices (or stopping rules). 3 14 1 2.5
These indices usually emphasize the compactnesss of (Yellow)
intra-cluster and isolation of inter-cluster and consider the
comprehensive effects of several factors, including the 4 06 12 1
defined squared error, the geometric or statistical (Brown)
properties of the data, the number of patterns, the
dissimilarity (or similarity), and the number of clusters.
Milligan and Cooper compared and ranked 30 indices
according to their performance over a series of artificial
data sets.

3. Optimization of some criterion functions under


probabilistic cmixture-model framework. In a
statistical framework, finding the correct number of
clusters (components) is equivalent to fitting a model
with observed data and optimizing some criterion.

  37
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

 
http://www.cisjournal.org 
Fig. 2 Dendrogram for the Clustering. Presented work consisting of the basic idea and
implementation of some of the basic clustering methods,
In future, we will go for more number of implementations
for the clustering methods and their utilities.

REFERENCES
[1] Textbook on “Pattern Recognition and Image
Analysis”,Earl Gose, Richard Johnsonbaugh
,Steve Jost.

[2] A.K. Jain, ,M.N. Murty, P.J. Flynn.” Data


Clustering: A Review”, ACM Computing Surveys,
Vol. 31, No. 3, September 1999.

[3] Rui Xu, Donald Wunsch “Survey of Clustering


Fig. 3 Different clusters for different patterns.
Algorithms”,IEEE Transactions on Neural
Networks Vol 16, No. 3, May 2005.
After experimentation it is found that the patterns 1 and 3
are categorised into cluster-1 and patterns 2 and 4 are there
[4] Anil K. Jain, Robert P.W. Duin, and Jianchang
in cluster-2 as shown above in figure 2.So depending upon
number of features and the corresponding values, we can Mao, IEEE Transactions on pattern analysis and
separate the patterns into different clusters. machine intelligence, “Statistical pattern
recognition: a review”. vol. 22, no. 1, january
Applications 2000.
clustering algorithms can be applied in many fields, for
instance: [5] R. Agrawal, J. Gehrke, D. Gunopulos, and P.
Raghavan, “Automatic subspace clustering of high
 Marketing: finding groups of customers with dimensional data for data mining applications,” in
similar behaviour given a large database of Proc. ACM SIGMOD Int. Conf. Management of
customer data containing their properties and past Data, 1998,pp. 94–105.
buying records;
 Biology: classification of plants and animals [6] Hui Xiong, Junjie Wu, and Jian Chen,“K-
given their features; Means Clustering Versus Validation
 Insurance: identifying groups of motor Measures:A Data-Distribution Perspective”.
insurance policy holders with a high average
IEEE Transaction on Man,and cybernetics-
claim cost; identifying frauds;
 City-planning: identifying groups of houses
Part B:Cybernetics, Vol. 39, No. 2, April
according to their house type, value and 2009.
geographical location;
 Earthquake studies: clustering observed [7] Textbook on “Digital Image Processing”, Rafael
earthquake epicenters to identify dangerous Gonzalez, Richard E.Woods.
zones;
 WWW: document classification; clustering
weblog data to discover groups of similar access
patterns. And many more.

VI. CONCLUSIONS AND FUTURE


WORK
In this paper, we studied the various clustering methods
their complexities. We have studied the K means
clustering algorithm, its advantages and disadvantages,
also the problems which are encountered for this
algorithm. In literature, we found that most of the existing
methods for the clustering are depending on the image
features like gray levels, texture, color. Even one more
method for the clustering can be done on Histrograms[7].

  38

Você também pode gostar