Você está na página 1de 6

International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India

Improving Ant Colony Optimization Algorithm for Data


Clustering
R Tiwari M Husain S Gupta A Srivastava
AZAD IET, Lucknow AZAD IET, Lucknow VIET, G. B. Nagar VIET, G. B. Nagar
UP, INDIA UP, INDIA UP, INDIA UP, INDIA
Tel.: +919415561502 Tel.: +919415459591 Tel.: +919717577497 Tel.: +919451003872
rajgaurang@gmail.com mohd.husain90@gmail.com sandeepjicool@gmail.com arun019@yahoo.com

ABSTRACT Clustering recognizes groups in a data set based upon some


Data mining is a process that uses technology to bridge the gap criteria of likeness [2]. Clustering aspires to discover sensible
between data and logical decision-making. The jargon itself organization of objects in a given dataset by identifying and
offers a promising view of organized data manipulation for quantifying likeness or unlikeness between the objects [3].
extracting valuable information and knowledge from high In data mining, clustering is utilized chiefly as preprocess to
volume of data. Copious techniques are developed to fulfill this another data mining application. We put into operation a
aspiration. This paper outlines an ant colony optimization clustering technique using ant colony optimization for clustering
algorithm which is used newly in data mining mostly aiming a data set into a pre-determined number of clusters and
solve data-clustering and data-classification problems and recommend two new techniques added to the algorithm.
developed from imitating the technique of real ants finding the
shortest way from their nests and the food source. This paper Real ants have the aptitude to find the shortest path from their
embodies an application aiming to cluster a data set with ant nests to the food source without any visual trace [4]. Ant colony
colony optimization algorithm and to increase the working optimization is built by modeling these deeds of real ants [2].
performance of ant colony optimization algorithm used for This paper is structured as follows: Section 2 describes ant
solving data-clustering problem. We also propose two new colony optimization subject. Section 3 describes ant colony
techniques and show the increase on the performance with the optimization algorithm developed for data clustering, and our
addition of these suggested techniques. proposed two new techniques. Section 4 reports the results of the
verification of ACO algorithm and proposed techniques on an
Categories and Subject Descriptors application program using a dataset. Finally, in Section 5
H.2.8 [Database Management] Database Applications – Data conclusions of the current work are reported.
mining

General Terms
2. ANT COLONY OPTIMIZATION
Algorithm Ant colony optimization (ACO) [5] mimics the way real ants
find the shortest route between a food source and their nest.
Keywords Figure 1-a demonstrates that ants start from their nest and
Data Mining, Knowledge Discovery in Databases, Clustering, proceed along a linear path through the food source.
Ant Colony Optimization, Data classification

1. INTRODUCTION
Data Mining (DM) or Knowledge Discovery in Databases Nest Food
(KDD) as it is also known, is the nontrivial extraction of Source

implicit, previously unknown, and potentially useful information (a)


from data [1]. This covers a many dissimilar technical
approaches, such as clustering, data summarization, learning
OBSTACLE

classification rules, finding dependency networks, analyzing


changes, and detecting anomalies.
Nest Food
Permission to make digital or hard copies of all or part of this work for Source
personal or classroom use is granted without fee provided that copies are (b)
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
OBSTACLE

otherwise, or republish, to post on servers or to redistribute to lists,


requires prior specific permission and/or a fee.
ICWET’10, February 26–27, 2010, Mumbai, Maharashtra, India. Nest Food
Copyright 2010 ACM 978-1-60558-812-4…$10.00. Source

(c)

529
International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India

an optimal or near-optimal partitioning of objects into subsets in


a specified dataset.

OBSTACLE
The aspire of data-clustering is to attain optimal assignment of N
Nest Food objects in one of the K clusters where N is the number of objects
Source and K is the number of clusters [7-8]. Artificial ants utilized in
(d) algorithm are called as software ants or agent and number of
agents articulated with R.
Figure1. Behavior of ants between their nest and food source Ants begin with blank solution strings and in the first iteration
the elements of the pheromone matrix are initialized to the same
values. With the growth of iterations, the pheromone matrix is
Actually, if there exists a difficulty on the path while going to updated depending upon the quality of solutions formed.
the food source (Figure 1-b), ant lying in front of this difficulty
cannot continue and has to account a preference for the new To explain the algorithm in detail, a data set with 10 test data is
outgoing path. In this case, selection probability of the new produced. The data of this test data set are obtained from UCI’s
direction options of ants is identical. In other words, if ant can machine learning repository [6]. Test data is revealed in Table 1
select any of the right and left directions, the selection prospect and in real data set, data are divided into 3 subsets, so K=3.
of these directions is identical (Figure 1-c). Namely, two ants
begin from their nest to find the food, one of them prefers the
path that turns out to be shorter while the other selects the longer Table 1. Illustrative dataset to explain ACO algorithm for
path. But it has been experienced that following ants mostly clustering with N=10 and n=4 (N: number of agents, n:
select the shorter path owing to the pheromone concentration number of attributes)
deposited mostly on the shorter one.
The ant that selected the shorter path comes to the nest earlier Sample Sepal Sepal Petal Petal
Cluster
and the pheromone deposited in this path is obviously more than Number length width length width
what is deposited in the longer path. Other ants in the nest thus 1 5.1 3.5 1.4 0.2 1
have greater probability of following the shorter route. These
ants also deposit their personal pheromone on this path. More 2 7 3.2 4.7 1.4 2
and more ants are soon attracted to this path and hence the finest
route from the nest to the food source and back is very speedily 3 6.3 3.3 6 2.5 3
recognized. Such a pheromone-meditated cooperative search
process escorts to the intelligent swarm behavior. 4 4.9 3 1.4 0.2 1

The mechanism of ants used to find the shortest path is 5 4.6 3.1 1.5 0.2 1
pheromone. Pheromone is a chemical secretion used by some
animals to affect their own class. Ant deposit some pheromone 6 6.4 3.2 4.5 1.5 2
while moving, and they prefer the way deposited more
7 6.2 2.9 4.3 1.3 2
pheromone than the other one with a method based on
probability. Ants leave the pheromone on the preferred path 8 5.8 2.7 5.1 1.9 3
while going to the food source, so they help following ants on
the selection of the path (Figure 1-d). 9 7.1 3 5.9 2.1 3

10 6.3 2.9 5.6 1.8 3


3. CLUSTERING WITH ANT COLONY
OPTIMIZATION To construct a solution, the agent uses the pheromone trail
In this section we used ant colony optimization algorithm to information to allocate each element of string S to an appropriate
solve the data-clustering problem and proposed two new cluster label. In the beginning of the algorithm, each agent or
techniques are explained in detail and the solutions are software ant initiate with empty solution string and the
compared. pheromone matrix τ keeping each element is assigned to which
cluster is initialized to some small value τ0. Hence, at primary
We use an Ant Colony Optimization algorithm for data
iteration each element of solution string S of each agent is
clustering, in which a set of coexisting distributed agents
assigned arbitrarily to one of the K clusters.
collectively discover a sagacious organization of objects for a
known dataset [3]. In the algorithm, each agent discovers a The trail value, τij at location (i,j) characterizes the pheromone
possible partition of objects in a given dataset and the level of concentration of sample i associated to the cluster j.
partitioning is measured subject to some metric like Euclidean Consequently, for the problem of separating N samples into K
distance. Information allied with an agent about clustering of clusters, the size pheromone matrix is NxK. Therefore, each
objects is collected in the comprehensive information hub sample is associated with K pheromone concentrations. The
(pheromone trail matrix) and is used by the other agents to build pheromone trail matrix advances as we iterate. At any iteration
possible clustering solutions and iteratively perk up them. The level, each agent or software ants will extend solutions showing
algorithm works for a known maximum number of iterations and the probability of each ant belonging to which cluster using this
the finest solution found with regard to a given metric represents pheromone matrix.

530
International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India

After engendering the solutions of R agents, a local search is Most of current ant colony optimization algorithms use some
performed to further perk up fitness of these solutions. The local search procedures to build up the generated solutions
pheromone matrix is then reorganized depending on the quality discovered by software ants. Local search assists to generate
of solutions produced by the agents. Then, the agents build better better solutions, if the heuristic information cannot be discovered
solutions depending on the pheromone matrix and the above effortlessly. Local search is applied on all produced solutions or
steps are repeated for definite number of iterations. on a few percent R. In this work, local search is performed on
20% of the total solutions. Consequently in the test data set of 10
At the end of any iteration level each agent engenders the data, local search is applied on the top 2 solutions in Table 3. In
solution using the information resulting from updated the local search procedure, the objective function values of top 2
pheromone matrix. The pheromone matrix at any iteration level agents are calculated again. These solutions can be accepted
for test dataset is given in Table 2 below. merely if there is an improvement on the fitness, that is, if the
newly computed objective function value is lower than the first
computed value, newly generated solution replaces the old one.
Table 2. Pheromone trail matrix generated at any iteration
level of the ACO algorithm for test dataset After the local search procedure, the pheromone trail matrix is
K (Cluster No) restructured. Such a pheromone updating process reproduces the
usefulness of dynamic information provided by software ants.
1 2 3
The pheromone matrix used in ant colony optimization
1 0.014756 0.015274 0.009900 algorithm is a sort of adaptive memory that contains information
2 0.015274 0.009900 0.014756 provided by the previously found superior solutions and is
3 0.015274 0.014756 0.009900 updated at the end of the iteration. The pheromone updating
4 0.009900 0.015274 0.014756 process used in this algorithm includes best L solutions revealed
N (Sample No)

5 0.014756 0.015274 0.009900 by R agents at iteration level t. This L agent imitates the real
6 0.009900 0.014756 0.015274 ants’ pheromone deposition by assigning the values of solutions.
7 0.009900 0.020131 0.009900 The trail information is updated using the following rule as
8 0.015274 0.014756 0.009900
9 0.009900 0.015274 0.014756 L
10 0.014756 0.015274 0.009900  ij (t  1)  (1   ) ij (t )    ijl
The pheromone concentration for the first sample as shown in l 1
Table 2 are: τ11=(0,014756), τ12=(0,015274) and τ13=(0,009900).
It indicates that at the current iteration, sample number 1 has the i = 1,…,N j = 1,…,K
highest probability of belonging to cluster number 2, because τ12 Where ρ is a persistence or trail and lies between [0,1] and (1-ρ)
is the highest. is the evaporation rate. Higher value of ρ recommends that the
Apiece agent picks a cluster number with a probability value for information assembled in the past iterations be forgotten quicker.
each element of S string to outline its own solution string S. The The amount of  ijl is equal to 1 Fl , if cluster j is assigned to
quality of constructed solution string S is considered in terms of
the value of objective function for a given data-clustering hitch. ith element of the solution created by ant l and zero otherwise.
This goal function is defined as the sum of squared Euclidian An optimal solution is that solution which lessens the objective
distances between each object and the center of belonging function value. If the value of finest solution in memory is
cluster. Then, the elements of the population i.e. agents are updated with the best solution value of the current iteration if it
sorted increasingly by the intention function values. Because, the has a lower objective function value than that of the best solution
lesser objective function value, the higher fitness to the real in memory, otherwise the best solution in memory kept. This
solution, that is, poorer objective function values are more process explains that an iteration of the algorithm is completed.
approximated to real solution values. Table 3 shows the solution Algorithm iterates these steps repetitively until a certain number
string values of ten agents in the test data set and the fitness of iterations and solution having lowest function value represents
values of each agent sorted decreasingly. the optimal partitioning of objects of a given dataset into several
Table 3. For data-clustering problem generated solutions groups.
sorted decreasingly The flow chart of ant colony optimization algorithm developed
N (Sample No) for solving data-clustering problem and explained in detail above
1 2 3 4 5 6 7 8 9 1 F(Fitnes is given away in Figure 2. The flow chart of the first and second
0 s) techniques proposed to increase the performance of the ACO
1 2 1 1 2 2 3 3 1 2 2 4.003931 algorithm is given in Figure 3 and 4 correspondingly.
2 2 3 1 2 2 3 2 3 2 2 7.172357
S (Solution String)

3 2 1 1 2 2 3 2 1 2 3 7.864054
4 2 1 3 2 2 3 2 1 2 3 8.455329
5 2 2 1 2 2 3 2 1 2 2 10.36714
6 2 1 1 2 3 3 2 1 1 3 10.92255
7 1 1 1 2 2 3 2 1 2 3 11.94087
8 2 1 1 2 1 3 2 1 1 1 12.00959
9 1 1 2 2 2 3 1 1 2 2 13.26286
10 1 1 2 2 2 3 3 1 2 3 13.33634

531
International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India

START START

T, τ0 t=1 E
t \ 50 = 1
t=1 τ(t)=τ
τ(t)=τ
Send R agents each with empty solution
string, S
Send R agents each with empty solution
string, S i=1
i=1 Construct solution, Si using pheromone trail

Construct solution, Si using pheromone trail Compute weights of all test samples and cluster centers

Compute weights of all test samples and cluster centers Compute clustering metric & assign it as objective fun. value Fi of solution Si

Compute clustering metric & assign it as objective fun. value Fi of solution Si


i=i+1
i=i+1 E
i≤R
Yes
i≤R
Select best L solutions out of R
solutions using objective function
Select best L solutions out of R solutions using objective function values
values l=1
l=1
Let St = Sl where St is a temporary
Let St = Sl where St is a temporary solution and perform local search on St
solution and perform local search on St
Compute weights of all test samples and cluster centers
Compute weights of all test samples & cluster centers
Compute clustering metric and assign it as
Compute clustering metric and assign it as objective function value Ft of solution St
objective function value Ft of solution St
If Ft < Fl then Fl = Ft and Sl = St
If Ft < Fl then Fl = Ft and Sl = St
l=l+1
l=l+1
E
Yes l≤L
l≤L

Update pheromone trail matrix using best L sol. → τ(t+1)=(1-ρ)τ(t)+ΣΔτ


Update pheromone trail matrix using best L sol. → τ(t+1)=(1-ρ)τ(t)+ΣΔτ
k=k+1
t=t+1
H
Termination
Termination No criterion attended?
criterion attended? (t=T)
(t=T)

Print best solution


Print best solution
STOP
STOP

Figure 2. The flow chart of ACO algorithm developed for Figure 3: The flow chart of the first technique proposed to
solving data-clustering problem [3] increase the performance of ACO

532
International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India

START Ants follow the path between their nest and the food supply
E according to the pheromone amount deposited on the path.
k = 10
Following ants comes to a decision which path to go depending
k=0 on the pheromone concentrations on the path. After a number of
τ(t)=τ0 iterations, ants start to follow continuously the same path
because of the enormous pheromone concentration than the
disused paths. This conduct of ants is called stagnation behavior
Send R agents each with empty solution as explained before. To evade from this disadvantage, reference
string, S
algorithm is improved with the addition of two new techniques
i=1 and the solutions are compared with each other.

Construct solution, Si using pheromone trail To increase the working performance of the algorithm developed
to cluster data with ant colony optimization technique, first
Compute weights of all test samples and cluster centers proposed technique (Figure 3) brings the pheromone amount to
initial values every 50 iteration to evade from stagnation
Compute clustering metric and behavior.
assign it as objective function
To minimize the stagnation deeds of ants, the second proposed
value F of solution S
technique (Figure 4) follows the pheromone amounts of ants and
i=i if there is no change on the pheromone concentration of every
+1 path after last 10 iterations, it brings the pheromone amount to
i≤R E initial values. In other words, for making the solution better, a
feedback technique is applied on the algorithm.
Select best L solutions out of R 4. EXPERIMENTAL EVALUATION
solutions using objective function values
With the intend of generating the optimal solutions of the
presented ACO algorithm developed for solving data-clustering
l=1 problem and added two new techniques, an application program
is written with “Microsoft Visual Basic 6.0” and the program is
Let St = Sl where St is a temporary applied on the iris database existing in the data warehouse of
solution and perform local search on St UCI [6]. The iris database is composed of 150 data and it is
stored in a text file. Our intend is to compare the actual cluster
Compute weights of all test samples & cluster centers values produced by the ACO algorithm and the new values
produced by added two new techniques, to determine the
Compute clustering metric and assign it as performance of ACO and our proposed techniques.
objective function value Ft of solution St
The main screen of the application program is given in Figure 5.
If Ft < Fl then Fl = Ft and Sl = St Number of iterations, clusters, agents, local search agents and
preliminary pheromone values, evaporation rate of pheromone
l=l and some values needed for the algorithm are specified in this
screen. Program runs the algorithm for the number of iterations.
E
l≤L

Update pher. trail matrix using best L sol.→ τ(t+1)=(1-ρ)τ(t)+ΣΔτ

k=k+1

Termination
H
criterion attended?
(t=T)

Print best solution

STOP
Figure 5. The main screen of the application program
Figure 6, depicts the statistical result values of these three
Figure 4. The flow chart of the second technique proposed to methods (reference algorithm and the two new techniques)
increase the performance of ACO worked on the application program with 1000 iterations. Figure
6, ‘1. Solution’ represents our main ant colony optimization
algorithm and comparing with the real cluster values of iris

533
International Conference and Workshop on Emerging Trends in Technology (ICWET 2010) – TCET, Mumbai, India

database, its performance is 4%, ‘2. Solution’ represents our denoting the ‘1. Solution’ illustrates the ACO algorithm results.
proposed first technique and its performance is 52% and ‘3. Its working performance derived from comparing with the real
Solution’ represents our proposed second technique and its cluster values is only 4% (Figure 7), because algorithm exposed
performance is 80%. stagnation behavior after 615th iteration (Figure 6). Curve
denoting the ‘2. Solution’ shows the first proposed technique’s
results and its working performance is 52% and curve denoting
the ‘3. Solution’ shows the second proposed technique’s results
and its working performance is 80% (Figure 7).

5. CONCLUSIONS
In this paper, we proposed two new techniques to boost the
working performance of the ant colony optimization algorithm
algorithm. We also verified ACO algorithm and proposed
techniques on an application program With the comparison of
these three methods, it is shown that the proposed techniques
enhance the performance of the reference ACO algorithm and
the best results are derived from the second proposed technique.
Consequently, our proposed two techniques noticeably increased
the success of the ACO algorithm developed for solving the
data-clustering problem.

6. REFERENCES
[1] DORIGO, M., MANIEZZO, V., COLORNI, A., “The Ant
System : Optimization by a colony of cooperating agents”,
IEEE Transactions on Systems, Man, and Cybernetics-Part
B,Vol.26, No.1, pp.1-13, 1996
[2] DI CARO, G., DORIGO, M., “Extending AntNet for Best-
effort Quality-of-Services Routing”, Ant Workshop on Ant
Colony Optimization,
Figure 6. Statistical result values of the ACO worked with htpp://iridia.ulb.ac.be/ants98/ants98.html, 15-16, 1998
the criterion specified on Figure 5. [3] FRAWLEY, W.J, PIATETSKY-SHAPIRO, G.,
MATHEUS, C., J, “Knowledge Discovery in Databases:
An Overview”, AI Magazine, 13(3): 57-70, 1992
[4] KUO, R.J., WANG, H.S., HU T., CHOU, S.H.,
“Application of Ant K-Means on Clustering Analysis”,
Computers and Mathematics with Applications 50, p.1709-
1724, 2005
[5] MANIEZZO, V., BOSCHETTI, M., JELASITY, M. , ”An
Ant Approach To Membership Overlay Design”, ANTS
2004 – Fourth International Workshop On Ant Colony
Optimization and Swarm Intelligence, p.37-48, Berlin, 2004
[6] N. Holden; A. Freitas; "Web Page Classification with an
Ant Colony Algorithm", Parallel Problem Solving from
Nature, Springer-Verlag, USA, pp. 1092-1102, 2004.
[7] PARPINELLI, R.S., LOPES, H.S.,FREITAS, A.A.,
“Classification-Rule Discovery with an Ant Colony
Algorithm”, Encyclopedia of Information Science and
Technology, Idea Group Inc., 2005
[8] SHELOKAR, V.K., JAYARAMAN, V.K., KULKARNI,
B.D., “An Ant Colony Approach for Clustering”, Analytica
Chimica Acta 509, 187–195, 2004
Figure 7. Graph screen illustrating the result values of the [9] TSAI, C.F., TSAI, C.W., WU, HC, YANG, T., “ACODF:
ACO worked with the criterion specified in Figure 5 and the a novel data clustering approach for data mining in large
real solution. databases”, The Journal of Systems and Software 73,
Figure 7, demonstrates the graph screen of these three methods p.133–145, 2004
(reference algorithm and the two new techniques) worked on the
[10] UCI Repository for Machine Learning Databases retrieved
application program with 1000 iterations and the given criterion
from the World Wide Web:
(Figure 6). The straight line in the graph points out the fitness
http://www.ics.uci.edu/~mlearn/MLRepository.htm
value of the real cluster values of the iris database. Curve

534

Você também pode gostar