Escolar Documentos
Profissional Documentos
Cultura Documentos
I.
Abstract
In this work the K-mean clustering algorithm is applied to Fishers Iris Plant Dataset. The data
set known to include 3 classes of Iris plant data. Setosa, Verginica and Versicolor. One of
which is linearly separable from other two, to assesses the capabilities of clustering algorithm,
it is applied to the data set with varied number of initials centers and stopping thresholds, it
will be shown that the K-means Clustering algorithm is capable of perfectly separating the
Setosa data set from others two, as expected, and able to achieve the acceptable recognition of
the other two plants species.
II.
Introduction
K-Means Clustering is an unsupervised learning algorithm that tries to cluster data based on
their similarity. Unsupervised learning means that there is no outcome to be predicted, and the
algorithm just tries to find patterns in the data. In k means clustering, we have to specify the
number of clusters that we want the data to be grouped into. The algorithm randomly assigns
each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm repeats
through two steps. The first one is to reassign the data points to the cluster whose centroid is
closest and the second one is to calculate new centroids of each clusters.
Figure 1: k-means clustering result for the Iris flower data set and actual species visualized.
III.
Methodology
The K-means algorithm is an unsupervised algorithm that attempts to cluster data into groups
based on a chosen similarity measure. In this work, the similarity measure of choice is
Euclidean distance. To create the clusters, the K-Means algorithm iteratively implements the
following steps:
I. Initialize - Initialize the centre of each cluster.
II. Distribute data points - Assign each data point to the cluster whose centre is the smallest
distance from the data point.
III. Compute new cluster centres Set the position of the cluster centre to the mean of all
data points belonging to that cluster.
IV. Compare new centres to old center, If the new centres are the same as the old centres,
then the algorithm converges. The clusters and centres computed in step 3 are the final
clusters and centres. If they are not the same, return to step 2.
In reality, it may not always be possible to find centers that do not change from iteration to
iteration. In other words, this algorithm may not always lead to a perfect solution. Some
datasets may lead to centers that oscillate between two values, for example. So, to avoid an
infinite loop when iterating through the algorithm, the threshold is used as another stopping
condition. When the new cluster centers are identified at the end of each iteration, the amount
of change in the clusters is also taken into consideration. This is done by measuring the distance
between the new centers and old centers. If this distance is less than the threshold distance, the
algorithm converges.
IV.
Experimental Setup
This project was broken up into two tasks. First, the K-means algorithm was coded into a
general function so that the number of centers and threshold value could be easily varied. Next,
a shell to call the function iteratively for each of the three k values and two threshold values
was created. The results of each run was saved to a cell array. The function can be reviewed in
Appendix A. In addition to making the function more general, the convergence test step was
modified to include the threshold as a stopping condition for the algorithm. So, the function
takes in the following inputs: data, number of centers, and threshold value. Given these
parameters, the K-meansm function will return the following two cells: centers per iteration,
assigned classes per iteration. More in-depth consideration of pertinent steps are presented
below.
Confusion Matrix
After calling the Kmeans function, the confusion matrix was generated for each
simulation using the Matlab confusionmat function. The results can be seen in the
subsequent section.
Result
As previously stated, K-means clustering was applied six times to the dataset. The
results are broken up into two groups and presented based on the chosen threshold
value.
Threshold = 0.01
Below, the confusion matrix for the three different choices of number of initial centers
is shown. In each simulation, the stopping threshold was set to 0.01.
1
Setosa
50 0
Setosa
50
Setosa
50 0
Versicolor 3
47
Versicolor
48
Versicolor 0
Virginica
50
Virginica
36
14
Virginica
21 26 1
21 29
Table 3.
It can be seen that in each case, the Setosa plant species was easily separated from the
others. However, the Versicolor and Virginica datasets were not as easily distinguished
from each other as they were from the Setosa. However, it is interesting to see that when
there were just 2 centers, the Virginica dataset was able to be perfectly separated from
the others. The Versicolor. However, is still straddling between the two clusters. It is
mostly clustered with the Virginica dataset, but there are several pieces that were
clustered with the Setosa plants.
Threshold = 0.1
Below, the confusion matrix for the three different choices of number of initial centers
is shown. In each simulation, the stopping threshold was set to 0.1.
1
50
Setosa
50
Setosa
Versicolor 47 3
Versicolor
47
Versicolor 0
25 25 0
Virginica
Virginica
14
36
Virginica
17 1
Setosa
50 0
50 0
32
Table 6
Increasing the threshold did not have much of an impact on the final confusion matrices.
Although there is some shifting of the data points, as is evidenced by the values shown
in the tables, the overall clustering results are quite similar. In all three runs, the Setosa
species was perfectly separated from the other two species. The other two species, on
average could not be perfectly separated. However, when there are two centers, the
Setosa and Virginica sets are again easily separated from one another while the
Versicolor is split (unevenly) between the two clusters.
V.
Conclusions
Using Matlab and a personal computer, the K-means algorithm was applied to the Iris plant
dataset. It was shown that the Setosa dataset was able to be perfectly classified in each case.
The other two species Versicolor and Virginica were not as easily separated from each other
as they were from the Setosa plant. After randomly selecting the initial centers, varying the
number of centers, and manipulating the stopping threshold, these results remained true. Since
these results are typical of the Iris plant dataset and the recognition using the K-means
clustering algorithm was able to reach these results, the K-means algorithm was shown to be a
reliable method of clustering.
VI.
Appendix A - Kmeans.m
end
Dist_iter{m} = norm(Dist);
for N = 1:numSamples %for each sample
minDist = min(Dist(N,:)); %get min dist for sample
[i,j] = find(Dist(N,:) == minDist); %index of min
Class(N) = j(1); %index=class, save index/class
end
Class_Iter{m} = Class; %save class assignments for m
step = 3;
case 3 %compute new centers
for k = 1:centers%for each center
C = find(Class == k); %find all samples in class
zt = [0,0];
for i = 1:size(C,1) %for every sample in class
zt = zt + x(C(i),:); %add sample to sum
end
Znew(k,:) = zt/size(C,1); %center = sample mean
end
Z_iter{m+1} = Znew; %save next centers
step = 4;
case 4 %compare new centers to old centers
if Z_iter{m+1} == Z_iter{m} %if new = current
NotEq = 0; %algorithm converges
break;
elseif m == 1 %if 1st iter, no prev distance, proceed
m = m+1;
step = 2;
else %if not 1st iter and Z \= Znew
%check stopping conditions
if abs(Dist_iter{m} - Dist_iter{m-1}) < threshold
NotEq = 0;
break;
else
m = m+1; %new iteration
step = 2; %go back to step 2
end
end
end
end
clusCenters = Z_iter; %return cluster centers per iteration
clusData = Class_Iter; %return clusters per iteration
end
VIII. References
[1]http://www.r-bloggers.com/k-means-clustering-in-r/
[2]Monique Kirkman-Bey, K-MEANS CLUSTERING & THE IRIS PLANT DATASET
[3]http://www.cs.colostate.edu/~anderson/cs545/index.html/lib/exe/fetch.php?media=assign
ments:solutions1:two.pdf
[4]https://www.youtube.com/watch?v=Qy2vEecfucY
[5]https://www.google.com/search?q=KMeans+Clustering&biw=1366&bih=667&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjj
k72igbnMAhWMcT4KHWSlAn0Q_AUICCgD#imgrc=o7bZUFEHo72JXM%3A
[6]http://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drop