Você está na página 1de 5

1st Conference on Swarm Intelligence and Evolutionary Computation (CSIEC2016), Higher Education Complex of Bam, Iran, 2016

A Clustering Algorithm Based on Integration of


K-Means and PSO
Habibollah Agh Atabay

Mohammad Javad Sheikhzadeh

Mehdi Torshizi

Faculty Member of Computer


Department
Gonbad Kavous University
Gonbad Kavous, Iran
atabay@gonbad.ac.ir

Faculty Member of Computer


Department
Gonbad Kavous University
Gonbad Kavous, Iran
sheikhzadeh@gonbad.ac.ir

Faculty Member of Computer


Department
Gonbad Kavous University
Gonbad Kavous, Iran
mtorshizi@gonbad.ac.ir

outlier data, the K-Means algorithm is also influenced with


these factors [3].

AbstractClustering data are one of the key issues in data


mining that has attracted much attention. One of the famous
algorithms in this field is K-Means clustering that has been
successfully applied to many problems. But this method has its
own disadvantages, such as the dependence of the efficiency of
this method to initialization of cluster centers. To improve the
quality of K-Means, hybridization of this algorithm with other
methods suggested by many researchers. Particle Swarm
Optimization (PSO) is one of Swarm Intelligence (SI) algorithms
that has been combined with K-Means in various ways. In this
paper, we suggest another way of combining K-Means and PSO,
using the strength of both algorithms. Most of the methods
introduced in the context of clustering, that hybridized K-Means
and PSO, used them sequentially, but in this paper we applied
them intertwined. The results of the investigation of this
algorithm, on the number of benchmark databases from UCI
Machine Learning Repository, reflect the ability of this approach
in clustering analysis.

Particle swarm optimization (PSO) is a population-based


stochastic optimization algorithm that inspired by behavior of
groups of birds [4]. PSO is one of the most important
algorithms in the category of Swarm Intelligence (SI), where
cooperation and communication between the bird communities,
enables the population to imitate the overall pattern of society.
High decentralization, cooperation between the particles and
simple implementation, makes this algorithm applicable in
many optimization problems [5, 6]. One of the major fields of
application of this algorithm is clustering analysis [7, 8].
Many methods have been proposed to solve K-Means
drawbacks [7].One category of proposed methods, is using a
combination of K-Means and PSO and the use of the strengths
of both algorithms to improve the final result of clustering. The
first paper that combines K-Means and PSO, to use in
clustering analysis, was presented in 2003 [9], where K-Means
used to maintain initial cluster centers. Every cluster center is
considered as a particle of the PSO algorithm. Then, using
PSO, these cluster centers along with other randomly generated
initial particles, are optimized to maintain the resultant cluster
centers. Soon after in [1] the idea of using a particle to display
a complete clustering solution, was introduced, which has
better compatibility with the structure of the PSO algorithm,
because, in general form, each particle of PSO must be a
complete solution to be optimized. In this form, a particle
composed of the coordinates of the potential cluster centers.

Keywords: Clustering Analysis, K-Means, Particle Swarm


Optimization (PSO), Hybridization

I.

INTRODUCTION

Clustering analysis is one of the key issues in data mining


that has always attracted many researchers. In fact, clustering
analysis is an unsupervised classification method that is applied
for recognition of the essential structures of objects by
categorizing them into different subsets that are meaningful in
the context. Usually every object can be expressed by a series
of features in a multi-dimensional vector space. Objects can be
split into several clusters using their feature vector. If the
number of clusters, k, is already known, clustering can be
specified as a distribution of n objects in the N-dimensional
space in the k groups, so that objects within a cluster are more
similar than objects in different clusters.

Then, in 2005 another hybridization of PSO and K-Means


was proposed [10] which named Alternative K-means and PSO
(AKPSO) where a new alternative metric was used, instead of
Euclidean distance. In that paper, PSO used for generating suboptimal solution, then, K-Means used to improve the outcome.
In a similar paper, [11], K-Means used after PSO to cluster
documents. Utilizing local search capabilities of Nelder-Mead
simplex is also considered to modify the results of hybrid KMeans and PSO, named KNMPSO [12]. Mixing other
optimization algorithms such as Ant Colony Optimization
(ACO), along with Fuzzy Adaptive PSO (FAPSO) and KMeans suggested in [13], where K-Means algorithm is used to
ameliorate the hybridization outcome of FAPSO and ACO. In

Because of simplicity of implementation and good


performance, the K-Means algorithm is one of the most
versatile and popular clustering algorithms that has linear time
complexity [1]. But this method has several disadvantages.
Objective function of K-Means is not convex, so it may have
numerous local minima [2]. Additionally, the performance of
this algorithm is dependent on the initial choice of cluster
centers. Since Euclidean distance is sensitive to noise and

978-1-4673-8737-8/16/$31.00 2016 IEEE


59

another work [14], the PSO is used for the initial search, Kmeans used to reclaim the initial search results and multiclass
merging is used to achieve faster convergence and better search
functionality in clustering. Recently in [15] the combination of
Improved PSO, K-Means and Genetic Algorithm (GA) is
applied on clustering.

The stopping criterion of this algorithm can be any of the


following: when the maximum number of iterations has been
reached, when improvement in results is less than a threshold.
III.

PSO is one of the important algorithms among SI


techniques developed by Kennedy and Eberhart [4]. PSO is
inspired by social interactions, such as bird flocking and fish
schooling. Like other algorithms in SI such as ACO, the PSO
algorithm is also population based and evolutionary. A particle
swarm is a group of particles that each particle can move
through the problem space and can be attracted to the better
positions. Each particle is a position in the feature space and a
possible solution to the problem. The objective of PSO is to
optimize the problem dependent fitness function, which is a
function to choose better and the best positions. For each
particle in population, PSO assigns two vectors called
velocity and position and holds the best position that the
particle ever viewed ( ). The algorithm used the velocity of
each particle to change the position of them, then selects the
best position, among the best ever viewed position of all
particles as the global best solution ( ). The steps of the PSO
algorithm can be summarized as follows:

This paper presents a new method of combining K-Means


and PSO, which is another perspective of hybridization of these
two algorithms. As noted in the above works, the important
disadvantage of K-Means is sensitivity to initialization and this
issue can be well covered by PSO algorithm. The strength of
K-Means algorithm which causes rapid convergence, is the
transition of a cluster center from the previous location to
average location of points belong to that cluster, in each
iteration. Our proposed method uses this advantage of KMeans to improve the convergence and final yields of PSO. In
this way, in addition to fixing the problem of initialization of
K-Means, we achieve better convergence of PSO and suitable
results of clustering. We experimented our method on some
benchmark datasets from UCI Machine Learning Repository
[16]. Our experiments can be compared to the papers of [17]
and [18] that applied ABC clustering and PSO based
classification on similar datasets.
II.

Randomly generate N particles.


For each particle repeat:
o Update the velocity using:

K-MEANS ALGORITHM

The term "K-Means" was proposed by James MacQueen in


1967 [18], but Stuart Lloyd firstly introduced the standard
algorithm of K-Means as a technique pulse-code modulation in
1957. This algorithm categorizes feature vectors of data into k
cluster. The number of clusters, must be defined initially. The
dissimilarity of feature vectors is estimated by Euclidean
distance. The result of K-Means algorithm is such that the
similarity between samples in the same cluster are higher than
the similarity of samples in different clusters. Each cluster
center in K-Means represented by the mean location of data
vectors which belong to the cluster. The process of K-Means
can be summarized as follows:

o Update the position using:


,

(4)

If the value of the objective function indicates that the


new position is better than the current then:
o Replace with the new position.
o If the value of the objective function indicates that the
new position is better than the current , replace
with the new position.
o Until the stopping criterion is satisfied.

(1)

Any one of the general criteria, mentioned in the previous


section, can be used as the stopping criteria of the PSO
algorithm.

Where is the feature vector, is the centroid


of cluster c and d is the number of features in each
vector.

IV.

Calculating new cluster centers using following formula:


,

(3)

Where and are two positive constants, w is an


inertia weight, and and are uniformly generated
random numbers in the range [0, 1].

Initialize the k cluster centroid vectors randomly


Repeat the following steps:
o For each feature vector, assign it to the nearest cluster,
where the distance to the cluster centroid is calculated
using following formula:
,

PARTICLE SWARM OPTIMIZATION

INTERTWINED K-MEANS AND PSO

The K-means algorithm requires fewer function


evaluations, thus it converges faster than PSO, but in some
situations where the choice of initial cluster centers has a large
impact on results, its accuracy decreases. What makes K-means
to quickly converge, is replacing the old cluster centers with
the mean point of the members of the cluster. We can use this
advantage, in the PSO algorithm to speed up its convergence
rate. To do so, we applied a small change in the PSO
algorithm, so that after updating the speed and the position of
each particle, if the new position is better than the best position

(2)

Where is the number of samples in the cluster c


and is the set of samples that belongs to the cluster
c.
Until stopping criterion is satisfied.

60

that has been observed by the particle, the new position


transferred to the center of the cluster. We called our algorithm
as Intertwined K-Means and PSO or IKPSO. The steps of
IKPSO can be described as:

in results and the maximum number of 50 unchanged result is


used as stopping criteria; if changes in the optimal value of the
objective function are less than the threshold and this
continually repeated for 50 times, we stop the process.

Randomly generate N particles.


For each particle repeat:
o Update the velocity using (3):
o Update the position using (4):
o If the value of the objective function indicates that the
new position is better than the current :
Replace with the mean point of the cluster
members.
If the value of the objective function indicates that
the new position is better than the current ,
replace with the mean point of the cluster
members.
o Until the stopping criterion is satisfied.

The accuracy of clustering is demonstrated by


Classification Error Percentage (CEP) which is the percentage
of incorrectly classified samples of the test datasets. Like KMeans, in IKPSO the cluster that each sample belongs to is
shown by the nearest center of the clusters but the class of this
cluster may be different. To determine the class of a cluster,
first we calculate the center of classes by averaging the position
of samples belongs to each class. Then we choose the class that
its center is nearest to the center of that cluster as the class of
the cluster. Then we assign this class to samples of the cluster
as the result of the algorithm. Finally, we compare the output of
the algorithm with the desired result (the given class of each
sample) and if they are not exactly the same, the sample is
labeled as misclassified. This process is used for all test
samples, and the percentage of incorrectly classified data is
calculated using following formula:

It should be noted that this process is not done for any new
position because this may cause repeated calculations. Also the
K-Means algorithm was not applied separately and integrated
in the PSO algorithm. The objective function in our clustering
algorithm, like K-Means, is the sum of the distances of all
samples from all cluster centers using Euclidean distance. The
feature vectors may be normalized with respect to the
maximum of range in the dimension.
V.




(5)

The results of IKPSO along with the results of the PSO


clustering algorithm are given in Table II and Table III where
CEP values are shown as well as the number of iterations to
achieve the results. We separate the results on normalized and
original datasets in order to the comparison of experiments.

EXPERIMENTAL RESULTS

In the paper, 12 benchmark databases from UCI Machine


Learning Repository selected for the evaluation of the proposed
algorithm, including Balance Scale (BS), Breast Cancer
Wisconsin Original (BCW-O), Credit Approval(CA),
Dermatology (D), Diabetes (Pima Indians Diabetes) (PID),
Ecoli (E), Glass Identification (GI), Heart Disease (HD), Horse
Colic (HC), Iris (I), Thyroid Disease (TD) and Wine (). In the
case of Diabetes dataset, we used its old version named Pima
Indians Diabetes, because in the new version, the format of
data is dramatically changed. The datasets and the
corresponding attributes: the number of samples, the number of
features and the number of categories are shown in Table I.

The results of the K-Means algorithm is also shown in


table I and II. Because, the results of the K-Means algorithm on
some datasets are very dependent on initial cluster centers, and
in various tests, the results have been different, we execute this
algorithm, 10 times for each dataset and the average result have
been mentioned. However, due to the random nature, the PSO
algorithm also may bring different results in every run, but the
difference was not so much as the results of K-Means. As
indicated in these tables the proposed method (IKPSO) in the
most cases improved the clustering results and in all cases lead
to faster convergence. In some cases that IKPSO could not
decrease the error, it has remarkably increased the convergence
speed.

Benchmark dataset are selected similar to [17] but our


datasets do not have exactly the same data, because of the
changes in the new versions. We used first 75% of each class
of dataset as training set and the remaining 25% are used as test
set. Some of the above datasets have unknown features. We
replaced these features with an average value of the known
features. Moreover the nominal features have been substituted
by integer values corresponding to the order of attribute stated
on the datasets page.

TABLE I.
Dataset
BS
BCW-O
CA
D
D-PID
E
GI
HD
HC
I
TD
W

The sizes of the training and the test sets are also mentioned
in Table I. We set the parameters of the PSO algorithm as: n =
50, = 1000, = 10% of the feature space width, =
- . To determine the parameters , and we used the
method of [20] that results = =1.4562 and w=0.7298. The
value of w is degraded by the coefficient 0.99 in each iteration.
We used the threshold of 0.00005 as minimum improvements

61

Total
625
699
690
366
768
336
214
303
368
150
215
178

ATTRIBUTES OF DATASETS.
Train
468
524
517
274
576
252
160
227
276
112
161
133

Test
157
175
173
92
192
84
54
76
92
38
54
45

Features
4
10
15
34
8
7
9
13
27
4
5
13

Category
3
2
2
6
2
8
6
5
2
3
3
3

TABLE II.

RESULTS OF CLUSTERING AS CLASSIFICATION ERROR


PERCENTAGE WITH NORMALIZING DATASETS.

Dataset

IKPSO

PSO

BS
BCW-O
CA
D
D-PID
E
GI
HD
HC
I
TD
W

31.85
1.14
11.56
14.13
28.12
26.19
38.89
36.84
52.17
7.89
16.67
4.44

42.04
1.14
17.34
35.87
27.08
34.52
50
40.79
52.17
7.89
16.67
6.67

IKPSO
Iterations
64
60
57
65
58
72
69
69
59
59
62
62

PSO
Iterations
97
99
143
195
109
177
155
162
177
94
119
146

TABLE III.
RESULTS OF CLUSTERING AS CLASSIFICATION ERROR
PERCENTAGE WITHOUT NORMALIZING DATASETS.

K-Means

Dataset

IKPSO

PSO

30.57
1.14
53.76
35.87
29.17
22.62
44.44
39.87
48.26
7.89
17.78
4.44

BS
BCW-O
CA
D
D-PID
E
GI
HD
HC
I
TD
W

31.21
1.14
55.49
81.52
39.58
23.81
57.41
69.74
31.52
7.89
37.04
15.54

31.85
1.14
44.51
81.52
45.83
33.33
59.26
71.05
31.52
7.89
40.74
13.33

PSO
Iterations
95
106
158
224
138
143
153
196
199
133
113
136

K-Means
39.49
1.14
55.49
80.98
39.58
22.62
57.22
69.47
31.52
7.89
22.96
21.56

samples. The results of experiments show that the algorithm


can be applied successfully to the clustering analysis.

Also, the comparison of results in Table II and III shows


that normalizing dataset has positive effects on reducing the
clustering error. This influence is quite evident in some
datasets like Credit Approval and Dermatology. Just in the case
of Horse Colic dataset the normalization has increased the
error. But it should be noted that this dataset is different from
the other datasets because of the presence of the large number
of unknown features which have been replaced by the average
value of the known features as mentioned before.

REFERENCES
[1]

[2]

[3]

Note that results shown in Table II and III are different


from the results obtained in [17] and [18] because we do not
use the predetermined class of samples in the training phase.
These papers deal with these datasets, as the classification
problems and used the class of the samples in the objective
function they optimized. But as we know in the problem of
clustering, we do not know the objective class of data and try to
categorize objects according to the intrinsic characteristics in
their features. Thus we ignored the classes of objects in the
training phase and used them just for evaluating the final
results on the test sets.

[4]

[5]

[6]

[7]

As shown in final results, IKPSO outperforms PSO


algorithm both in the final results and the number of iterations
to achieve them. The average CEP for IKPSO and PSO in all
datasets with normalization are 22.49% and 27.68%
respectively, and without normalization are 37.66% and
38.50% respectively. Moreover IKPSO improves the speed of
finding result averagely 52.38% in normalized datasets and
52.30% in original datasets compared to PSO.
VI.

IKPSO
Iterations
74
61
64
103
59
65
75
79
63
61
62
55

[8]

[9]

[10]
[11]

COLLUSION

In this work, a new method for hybridization of K-Means


and PSO for using in clustering analysis is presented. This
algorithm called, Intertwined K-means and PSO (IKPSO) and
we used both simplicity and speed of K-Means along with
generalization and effectiveness of PSO, in clustering analysis
of various benchmark datasets from UCI Machine Learning
Repository. The performance of IKPSO is compared with
original PSO clustering in terms of accuracy and speed on both
original and normalized datasets. Also the results can be
compared with ABC based clustering presented in [17] and
PSO based classification in [18] but the results is different
because they used the class of samples in the objective function
of their methods, but we dont used them because in clustering
analysis we suppose that we dont know the class of training

[12]
[13]
[14]

[15]

62

Chen C-Y, Fun Y. Particle swarm optimization algorithm and its


application to clustering analysis. IEEE Int C Netw Sens 2004;2:78994.
Selim SZ, Ismail MA. K-Means-Type Algorithms: A Generalized
Convergence orem and Characterization of Local Optimality. IEEE T
Pattern Anal 1984;6:81-7.
Wu K-L, Yang M-S. Alternative c-means clustering algorithms. Pattern
Recogn 2002;35:2267-78.
Kennedy J, Eberhart R. C. Particle swarm optimization. In: IEEE 1995
International Conference On Neural Networks; 1995; Perth, Australia,
Piscataway, NJ: IEEE Service Center. pp. 19421948.
Niknam T. A new fuzzy adaptive hybrid particle swarm optimization
algorithm for non-linear, non-smooth and non-convex economic
dispatch problem. Appl Energ 2010;87:327-39.
Zwe-Lee G. A particle swarm optimization approach for optimum
design of PID controller in AVR system. IEEE T Energy Conver
2004;19:384-91.
Rana S, Jasola S, Kumar R. A review on particle swarm optimization
algorithms and ir applications to data clustering. Artif Intell Rev
2011;35:211-22.
Esmin AA, Coelho R, Matwin S. A review on particle swarm
optimization algorithm and its variants to clustering high-dimensional
data. Artif Intell Rev 2015;44:23-45.
Van Der Merwe DW, Engelbrecht AP. Data clustering using particle
swarm optimization. In: IEEE Congress on Evolutionary Computation
2003 (CEC 2003), Canbella, Australia, pp. 215-220.
Fun Y, Ching-Yi C. Alternative KPSO-Clustering Algorithm. Tamkang
Journal of Science and Engineering 2005;8:165-74.
Xiaohui C, Thomas EP. Document Clustering Analysis Based on Hybrid
PSO+K-means Algorithm. Journal of Computer Sciences 2005; (Special
Issue):27-33.
Kao Y-T, Zahara E, Kao IW. A hybridized approach to data clustering.
Expert Syst App 2008;34:1754-62.
Niknam T, Amiri B. An efficient hybrid approach based on PSO, ACO
and k-means for cluster analysis. Appl Soft Comput 2010;10:183-97.
Lin Y, Tong N, Shi M, Fan K, Yuan D, Qu L, Fu Q. K-means
Optimization Clustering Algorithm Based on Particle Swarm
Optimization and Multiclass Merging. In: Jin D, Lin S, editors.
Advances in Computer Science and Information Engineering: Springer
Berlin Heidelberg, 2012. pp. 569-78.
Nayak J, Kanungo DP, Naik B, Behera HS. Evolutionary Improved
Swarm-Based Hybrid K-Means Algorithm for Cluster Analysis. In:
Satapathy SC, Raju KS, Mandal JK, Bhateja V, editors. In: Second
International Conference on Computer and Communication
Technologies: 2016; Springer India. p. 343-52.

[16] Lichman
M.
UCI
Machine
Learning
Repository
[http://archive.ics.uci.edu/ml]. Irvine: 2013;
CA: University of
California, School of Information and Computer Science.
[17] Karaboga D, Ozturk C. A novel clustering approach: Artificial Bee
Colony (ABC) algorithm. Appl Soft Comput 2011;11:652-7.
[18] MacQueen JB. Some Methods for classification and Analysis of
Multivariate Observations. In: 5th Berkeley Symposium on Mamatical
Statistics and Probability: University of California Press; 1967. p. 281
97.

63