Você está na página 1de 72

A Fast Greedy

K-Means
Algorithm

N. Hussein

Page 1 of 72

A Fast Greedy
K-Means Algorithm
N. Hussein
Masters Thesis
Nr:9668098
University of Amsterdam
Faculty of Mathematics, Computer Sciences,
Physics and Astronomy
Euclides Building
Plantage muidergracht 24
1018 TV Amsterdam
the Netherlands
November, 2002

Page 2 of 72

First of all I would like to thank Nikos Vlassis for his


guidance and support, both during the stage of developing
ideas as well as during the writing of this thesis. Sjaak
Verbeek has been more than helpfull to push me in the
beginning of the graduation project. I thank also Mike for
correcting my english. Furthermore, I would like to thank
my family for everything they had to cope during writing my
thesis and for their support and love throughout these
years. They make my life meaningfull. I am also grateful to
the perfecthousing team for their support and
understanding.

Page 3 of 72

Table of Contents
Abstract......................................................................................................
1

Introduction............................................................................................

Definitions and Related Work.....................................................................

2.1

Nave k-means algorithm....................................................................

2.2

Kd-trees........................................................................................

2.3

The Greedy K-means Algorithm............................................................

2.4

Accelerating the nave K-Means Algorithm...............................................

The Blacklisting Algorithm............................................................

2.4.2

Variants of k-means using kd-tree...................................................

Accelerating the Global k-means...............................................................


3.1

The overall algorithm........................................................................

3.2

Restricted filtering Algorithm...............................................................

3.3
4

2.4.1

3.2.1

Computing Distortion for the current set of centers:.............................

3.2.2

Computing the distortion for the updated list of centers:.......................

Distortion computation for the set of candidate center locations....................

Implementation......................................................................................
4.1

Computing the new Centroids..............................................................

4.2

Computing the distortion....................................................................

4.3

Computing distortion for added center...................................................

4.4

Maintaining the Tree.........................................................................

Experimental results...............................................................................
5.1

Dataset generated using MATLAB..........................................................


5.1.1

Variation of number of insertion points.............................................

5.1.2

Variation of threshold1................................................................

5.1.3

Variation of threshold2................................................................

Page 4 of 72

5.2

5.3

5.1.4

Effect of number of points............................................................

5.1.5

Effect of dimensionality...............................................................

5.1.6

Effect of number of centers..........................................................

Data generated using Dasgupta [15].......................................................


5.2.7

Variation in dimensionality...........................................................

5.2.8

Variation in number of points........................................................

5.2.9

Variation in Cluster Separation.......................................................

Dataset from UCI ML Repository............................................................

Conclusion............................................................................................

Further Research....................................................................................

References............................................................................................

Page 5 of 72

Abstract
In this paper, two algorithms for image segmentation are studied. K-means
and an Expectation Maximization algorithm are each considered for their
speed, complexity, and utility. Implementation of each algorithm is then
discussed. Finally, the experimental results of each algorithm are presented
and discussed.
In clustering, we are given a set of N points in d-dimension space Rd and we have to
arrange them into a number of groups (called clusters). In k-means clustering, the
groups are identified by a set of points that are called the cluster centers. The data
points belong to the cluster whose center is closest. Existing algorithms for k-means
clustering suffer from two main drawbacks, (i) The algorithms are slow and do not
scale to large number of data points and (ii) they converge to different local minima
based on the initializations. We present a fast greedy k-means algorithm that attacks
both these drawbacks based on a greedy approach. The algorithm is fast,
deterministic and is an approximation of a global strategy to obtain the clusters. The
method is also simple to implement requiring only two variations of kd-trees as the
major data structure. With the assumption of a global clustering for k-1 centers, we
introduce an efficient method to compute the global clustering for k clusters. As the
global clustering for k = 1 is obvious (cluster center at the mean of all points), we
can find the global solution for k clusters inductively. As a result, the algorithm we
present can also be used for finding the minimum number of centers that satisfy a
given criteria. We also present a number of empirical studies both on synthetic and
real data. Segmentation was used to identify the object of image that we are interested.
We have three approaches to do it. The first is Edge detection. The second is to use threshold.
The third is the region-based segmentation. It does not mean that these three of that method can
solve all of the problems that we met, but these approaches are the basic methods in
segmentation.

Page 6 of 72

1 Introduction
The problem has been formulated in various ways in the machine learning, pattern
recognition optimization and statistics literature. The fundamental problem in
segmentation is to partition an image into regions. Segmentation algorithms for monochrome
images are generally based on one of the following two basic categories. The first one is Edgebased segmentation. The second one is Region-based segmentation. Another method is to use
the threshold. It belongs to Edge-based segmentation. The fundamental clustering problem
is that of grouping together (clustering) data items that are similar to each other.
The most general approach to clustering is to view it as a density estimation
problem. Because of its wide application, several algorithms have been devised to
solve the problem. Notable among these are the EM algorithm, neural nets, SVM and
k-means. Clustering the data acts as a way to parameterize the data so that one does
not have to deal with the entire data in later analysis, but only with these
parameters that describe the data. Sometimes clustering is also used to reduce the
dimensionality of the data so as to make the analysis of the data simpler. Both of

the algorithms under consideration date back originally over 30 years. A KMeans algorithm was introduced by name in 1967 [1] while an EM algorithm
was somewhat formalized ten years later in 1977 [2]. The dates attached to
these algorithms can be somewhat confusing since the EM algorithm can be
considered a generalized K-Means algorithm. It is safe to assume that both
algorithms

existed

previously

without

formalization

or

complete

explanation of the theory involved. In order to understand how to implement


each algorithm, the class notes and Wikipedia [3,4] were relied on heavily. In
addition, a few of the papers found in the references section were of limited
assistance.
There are three goals that we want to achieve:

1.
2.
3.

The first one is speed. Because we need to save the time from segmentation and give
complicate compression more time to process.
The second, one is to have a good shape matching even under less computation time.
The third, one is that the result of segmenting shape will be intact but not fragmentary
which means that we want to have good connectivity.

Page 7 of 72

Edge-Based Segmentation
The focus of this section is on the segmentation methods based on detection in sharp,
local changes in intensity. The three types of image feature in which we are interested are
isolated points, lines , and edges. Edge pixels are pixels at which the intensity of an image
function changes abruptly.

The K-Means algorithm and the EM algorithm can both be used to find natural
clusters within given data based upon varying input parameters. Clusters can
be formed for images based on pixel intensity, color, texture, location, or
some combination of these. For both algorithms the starting locations of the
partitions used are very important and can make the difference between an
optimal and a non-optimal solution since both are susceptible to termination
when achieving a local maximum as opposed to the global maximum.
Figure 2.1 The intensity histogram of image
We have following conclusions from Figure 2 .1 The intensity histogram of image Figure 2 .1:

1.
2.

First-order derivatives generally produce thicker edges in an image.


Second-order derivative have a stronger response to fine detail, such as thin lines, isolated
points, and noise.
Second-derivatives produce a double-edge response at ramp and step transitions in
intensity.
The sign of the second derivative can be used to determine whether a transition
into an edge is from light to dark or dark to light.

Page 8 of 72

2.1

Isolated Points
It is based on the conclusions reached in the preceding section. We know that
point detection should be based on the second derivative, so we expect Laplacian mask.

Figure 2.2 The mask of isolation point


We can use the mask to scan the all point of image , and count the response of every point, if the
response of point is greater than T(the threshold we set), we can define the point to 1(light), if
not, set to 0(dark).

G ( x, y )

2.2

1
0

if R ( x , y ) T
otherwise

Line Detection
As discussion in section2.1,we know the second order derivative have stronger
response and to produce thinner lines than 1 st derivative. We can get four different direction of
mask.

Page 9 of 72

Figure 2.3 Line detection masks.

Let talk about how to use the four masks to decide which direction of mask is better than others.
Let

R1 , R 2 , R 3

image ,

and

R 4 denote the response of the masks in Figure

2 .3. If at a point in the

R k > R j ,for all j k. R 1 > R j for j=2,3,4,that point is said to be more likely

associated with a line in the direction of mask k.

2.3

Edge detection

Figure 2.4 (a) Two region of constant intensity separated by an ideal vertical ramp edge.
(b)Detail near the edge, showing a horizontal intensity profile.
We conclude from the observation that the magnitude of 1 st derivative can be used to
detect the presence of an edge at a point in an image. The 2 nd derivative have two properties : (1)

Page 10 of 72

it produces two values for every edge in an image.(2)its zero crossings can be used for locating
the center of thick edges.

K-Means can be thought of as an algorithm relying on hard assignment of


information to a given set of partitions. At every pass of the algorithm, each
data value is assigned to the nearest partition based upon some similarity
parameter such as Euclidean distance of intensity. The partitions are then
recalculated based on these hard assignments. With each successive pass, a
data value can switch partitions, thus altering the values of the partitions at
every pass. K-Means algorithms typically converge to a solution very quickly
as opposed to other clustering algorithms.

In one of its forms, clustering problems can be defined as: given a dataset of N
records, each having dimensionality d, to partition the data into subsets such that a
specific criterion is optimized. The most widely used criterion for optimization is the
distortion criterion. Each record is assigned to a single cluster and distortion is the
average squared Euclidean distance between a record and the corresponding cluster
center. Thus this criterion minimizes the sum of the squared distances of each record
from its corresponding center. K-means clustering is used to minimize the abovementioned term by partitioning the data into k non-overlapping regions identified by
their centers. Also k-means is defined only over numeric or continuous data since it
requires the ability to compute Euclidean distances as well as the mean of a set of
records.
Clustering based on k-means is closely related to a number of other clustering and
location problems. These include the Euclidean k-medians, in which the objective is
to minimize the sum of distances to the nearest center, and the geometric k-center
problem, in which the objective is to minimize the maximum distance from every
point to its closest center. There are no efficient methods known to any of these
problems and some formulations are NP-hard. In [11] is presented an asymptotically
efficient approximation for the k-means clustering problem, but the large constant
factors suggest that it is not a good candidate for practical implementation.

The k-means algorithm is based on the simple observation that the optimal
placement of a center is at the centroid of the associated cluster. Thus given any set
of k-centers C, for each center c

C, let V(c) denote the voronoi region of the point

c, that is, the region of space that is closest to the center c. At every stage of the k-

Page 11 of 72

means, the algorithm moves the centers in C to the centroid of the points in V(c) and
then updates V(c) by recomputing the distance from each point to a center c. These
steps are repeated until some convergence condition is met. For points in general
positions (in particular, if no point is equidistant from two centers), the algorithm
will converge to a point where further stages of the algorithm will not change the
position of any center. This would be an ideal convergence condition. However,
converging to the point where no center is further updated might take very long.
Thus further stages of the algorithm are stopped when the change in distortion, is
less than a threshold. This saves a lot of time as in the last stages, the centers move
very little in every stage. Even without the approximation of convergence based on
distortion, the results obtained are not necessarily a global minimum. The results
obtained vary greatly based on the initial set of centers chosen. The algorithm is
deterministic after the initial centers are determined.
The attractiveness of the k-means lies in its simplicity and flexibility. In spite of other
algorithms being available, k-means continues to be an attractive method because of
its convergence properties. However, it suffers from major shortcomings that have
been a cause for it not being implemented on large datasets. The most important
among these are
(i) K-means is slow and scales poorly with respect to the time it takes for large
number of points;
(ii) The algorithm might converge to a solution that is a local minimum of the
objective function.
Much of the related work does not attempt to confront both the before mentioned
issues directly. Because of these shortcomings, K-means is at times used as a hill
climbing method rather than a complete clustering algorithm, where the centroids
are initialized to the centers obtained from some other methods.
The clusters obtained depend heavily on the initial centers. In [9] is given a
discussion on how to choose the initial centers. To treat the problem of choosing the
initial centers, several other techniques using stochastic global optimization methods
(e.g. Simulated annealing, genetic algorithms), have been developed. Different
methods of sub-sampling and approximation have also been proposed to counter the
above. However, none of these methods have gained wide acceptance and in many
practical applications, the clustering method that is used is based on running the kmeans algorithm with multiple restarts.
There has been some related work where one of the above two issues mentioned has
been addressed. In [2,3] and [4] methods have tried to address the problem of
efficient implementation of the k-means algorithm. In [1] a method has addressed
the problem of the local convergence of the k-means algorithm and proposed an

Page 12 of 72

iterative algorithm that comes closer to the global solution by running a set of
deterministic k-means algorithms one by one. But the proposed strategy is very slow
and has little practical application even in case of medium sized datasets.
Before proceeding, we explicitly define the problem that we are looking at. In a
traditional k-means algorithm (for which we develop an improved solution) each of
the points is associated with only one partition also called the cluster. The number of
partitions is pre-specified. Each of the partitions is recognized by a cluster center,
which is the mean of all the points in the partition. All the points in a partition are
closer to its center than to any other cluster center. The accuracy of a clustering
approach is defined by the distortion criterion, which is the mean squared distance
of a point from its cluster center. Thus the objective is to get the clustering with
minimum distortion and the specified number of clusters.
In this thesis we address the issues of scaling the algorithm for large number of points
and convergence to local optima. We present a fast greedy k-means algorithm. The
proposed algorithm is entirely deterministic and so avoids the problem of having to
initialize the centers. In addition to combining ideas in [2,3], [4], [1] and the
standard k-means algorithm, the proposed algorithm also introduces some new
techniques to make k-means more tractable. The implementation is also simple and
only requires variants of a kd-tree.
The rest of the thesis is organized as follows. In the next chapter, we introduce the
various concepts that are related to this algorithm. In chapter 3 we outline our
algorithm, its variants and the new ideas implemented. Chapter 4 explains the
implementation issues so that similar results can be obtained in independent studies
later. Experimental results on both synthetic and real data are reviewed in chapter 5.
Chapter 6 gives the conclusions of the research and chapter 7 points to certain
directions in which further research could be undertaken. The thesis closes with
relevant references in chapters 8.

Page 13 of 72

3 Definitions and Related


Work
In this chapter we look at relevant work and definitions that are important for the
understanding of the thesis. The ideas covered in this chapter are the standard kmeans algorithm, also referred to as the Lloyds algorithm in [4], a brief introduction
to kd-trees, the Global K-means algorithm as in [1] including the greedy
approximations and other methods suggested in the same paper for improving the
performance of the method and the Blacklisting/filtering algorithm as in [4] and [2].
Several definitions, given in these papers, which we would use in this thesis, are also
included. We, however, do not give any proofs of the theorems or lemma that are
already proved in these papers.

Page 14 of 72

3.1

Nave k-means algorithm

One of the most popular heuristics for solving the k-means problem is based on a
simple iterative scheme for finding a locally optimal solution. This algorithm is often
called the k-means algorithm. There are a number of variants to this algorithm, so to
clarify which version we are using, we will refer to it as the nave k-means algorithm
as it is much simpler compared to the other algorithms described here. This
algorithm is also referred to as the Lloyds algorithm in [4].
The naive k-means algorithm partitions the dataset into k subsets such that all
records, from now on referred to as points, in a given subset "belong" to the same
center. Also the points in a given subset are closer to that center than to any other
center. The partitioning of the space can be compared to that of Voronoi partitioning
except that in Voronoi partitioning one partitions the space based on distance and
here we partition the points based on distance.
The algorithm keeps track of the centroids of the subsets, and proceeds in simple
iterations. The initial partitioning is randomly generated, that is, we randomly
initialize the centroids to some points in the region of the space. In each iteration
step, a new set of centroids is generated using the existing set of centroids following
two very simple steps. Let us denote the set of centroids after the i th iteration by C(i).
The following operations are performed in the steps:
(i)

Partition the points based on the centroids C(i), that is, find the centroids to which
each of the points in the dataset belongs. The points are partitioned based on the
Euclidean distance from the centroids.

(ii)

Set a new centroid c(i+1)


closest to c(i)

C (i+1) to be the mean of all the points that are

C (i) The new location of the centroid in a particular partition is

referred to as the new location of the old centroid.


The algorithm is said to have converged when recomputing the partitions does not
result in a change in the partitioning. In the terminology that we are using, the
algorithm has converged completely when C(i) and C(i

1)

are identical. For

configurations where no point is equidistant to more than one center, the above
convergence condition can always be reached. This convergence property along with
its simplicity adds to the attractiveness of the k-means algorithm.
The nave k-means needs to perform a large number of "nearest-neighbor" queries for
the points in the dataset. If the data is d dimensional and there are N points in
the dataset, the cost of a single iteration is O(kdN). As one would have to run several
iterations, it is generally not feasible to run the nave k-means algorithm for large
number of points.

Page 15 of 72

Sometimes the convergence of the centroids (i.e. C(i) and C(i+1) being identical) takes
several iterations. Also in the last several iterations, the centroids move very little.
As running the expensive iterations so many more times might not be efficient, we
need a measure of convergence of the centroids so that we stop the iterations when
the convergence criteria is met. Distortion is the most widely accepted measure.
Clustering error measures the same criterion and is sometimes used instead of
distortion. In fact k-means algorithm is designed to optimize distortion. Placing the
cluster center at the mean of all the points minimizes the distortion for the points in
the cluster. Also when another cluster center is closer to a point than its current
cluster center, moving the cluster from its current cluster to the other can reduce
the distortion further. The above two steps are precisely the steps done by the kmeans cluster. Thus k-means reduces distortion in every step locally. The k-Means
algorithm terminates at a solution that is locally optimal for the distortion function.
Hence, a natural choice as a convergence criterion is distortion. Among other
measures of convergence used by other researchers, we can measure the sum of
Euclidean distance of the new centroids from the old centroids as in [9]. In this thesis
we always use clustering error/distortion as the convergence criterion for all variants
of k-means algorithm.
Definition 1: Clustering error is the sum of the squared Euclidean distances from
points to the centers of the partitions to which they belong.
Mathematically, given a clustering , we denote by ( x ) the centroid this
clustering associates with an arbitrary point x (so for k-means, ( x ) is simply the
center closest to x). We then define a measure of quality for :

distortion

1
N

x ( x)

(1)

Where |a| is used to denote the norm of a vector a. The lesser the difference in
distortion over successive iterations, the more the centroids have converged.
Distortion is therefore used as a measure of goodness of the partitioning.
In spite of its simplicity, k-means often converges to local optima. The quality of the
solution obtained depends heavily on the initial set of centroids, which is the only
non-deterministic step in the algorithm. Note that although the starting centers can
be selected arbitrarily, k-means is fully deterministic, given the starting centers. A
bad choice of initial centers can have a great impact on both performance and
distortion. Also a good choice of initial centroids would reduce the number of
iterations that are required for the solution to converge. Many algorithms have tried
to improve the quality of the k-means solution by suggesting different ways of
sampling the initial centers, but none has been able to avoid the problem of the
solution converging to a local optimum. For example, [9] gives a discussion on how to

Page 16 of 72

choose the initial centers, other techniques using stochastic global optimizations
methods (e.g. Simulated annealing, genetic algorithms), have also been developed.
None of these algorithms is widely accepted.

Page 17 of 72

3.2

Kd-trees

A very important data structure that is used in our algorithm is a kd-tree. A kd-tree is
a data structure for storing a set of finite points from a d-dimensional space. It was
introduced and examined it in detail in [12, 13].
Kd-trees are simple data structures with the following properties:
(i)

They are binary trees;

(ii)

The root node contains all the points;

(iii)

A node is split along a split-plane such that points to the left are part of the
left sub-tree, points to the right are part of the right sub-tree;

(iv)

The left and right sub-trees are recursively split until there is only one point
in the leaf or a certain condition is satisfied.

This is the basic kd-tree structure. There exist several variants of the kd-tree based
on the way in which they choose the splitting plane, the termination criteria, etc.
Originally designed to decrease the time in nearest neighbor queries, kd-trees have
found other applications as well. Omohumdro has recommended it in a survey of
possible techniques to increase speed of neural network learning in [14]. In the
context of k-means, kd-trees are used as a data structure to save the points in the
dataset by methods described in [10], [4] and [2, 3]. A variant of the kd-tree is also
used in [1].
Though kd-trees give substantial advantage for lower dimensions, the performance of
kd-trees decreases/drops in higher dimensions. Other data structures like AD trees
[5] have been suggested for higher dimensions [2] but these have never been used for
k-means.
After this brief introduction to the kd-trees (which is the primary data structure used
in our algorithm), we discuss the two main approaches that try to counter the
shortcomings of the k-means algorithm.

Page 18 of 72

3.3

The Greedy K-means Algorithm

In [1] is presented a variant of the k-means algorithm that is called the global kmeans algorithm. The local convergence properties of k-means have been improved
in this algorithm. Also it does not require the initial set of centroids to be decided.
The idea is that the global minima can be reached through a series of local searches
based on the global clustering with one cluster less.
Assumption: The assumption used in the algorithm is that the global optima can be
reached by running k-means with the (k-1) clusters being placed at the optimal
positions for the (k-1) clustering problem and the k th cluster being placed at an
appropriate position that is yet to be discovered.
Let us assume that the problem is to find K clusters and K K. We Use the above
assumption, the global optima for k = K clusters is computed as a series of local
searches. Assuming that we have solved the k-means clustering problem for K 1
clusters, we have to place a new cluster at an appropriate location. To discover the
appropriate insertion location, which is not known, we run k-means algorithm until
convergence with each of the points in the entire set of the points in the dataset
being added as the candidate new cluster, one at a time, to the K 1 clusters. The
converged K clusters that have the minimum distortion after the convergence of kmeans in the above local searches are the clusters of the global k-means.
We know that for k = 1, the optimal clustering solution is the mean of all the points
in the dataset. Using the above method we can compute the optimal positions for the
k = 2, 3, 4, ... K, clusters. Thus the process involves computing the optimal k-means
centers for each of the K = 1, 2, 3 K clusters. The algorithm is entirely
deterministic.
Though the attractiveness of the global k-means lies in it finding the global solution,
the method involves a heavy cost. K-means is run N times, where N is the number of
points in the dataset, for every cluster to be inserted. The complexity can be
reduced considerably by not running the K-means with the new cluster being inserted
at each of the dataset points but by finding another set of points that could act as an
appropriate set for insertion location of the new cluster. In [1] is suggested the use of
centers of regions, formed by a variant of the kd-tree, as the insertion points for new
centroids. The variant of the kd-tree splits the points in a node using the plane that
passes through the mean of the points in the node and is perpendicular to the
principal component of the points in the node. A node is not split if it has less than a
pre-specified number of points or an upper bound to the number of leaf nodes is
reached. The idea is that even if the kd-tree were not used for nearest neighbor

Page 19 of 72

queries, merely the construction of the kd-tree based on this strategy would give a
very good preliminary clustering of the data. We can thus use the kd-tree nodes
centers as the candidate/initial insertion positions for the new clusters.
The time complexity of the algorithm can also be improved by taking a greedy
approach. In this approach, running k-means for each possible insertion position is
avoided. Instead reduction in the distortion when the new cluster is added is taken
into account without actually running k-means. The point that gives the maximum
decrease in the distortion when added as a cluster center is taken to be the new
insertion position. K-means is run until convergence on the new list of clusters with
this added point as the new cluster. The assumption is that the point that gives the
maximum decrease in distortion is also the point for which the converged clusters
would have the least distortion. This results in a substantial improvement in the
running time of the algorithm, as it is unnecessary to run k-means for all the possible
insertion positions. However, the solution may not be globally optimal but an
approximate global solution.

Page 20 of 72

3.4

Accelerating the nave K-Means Algorithm

The high time complexity of the k-means algorithm makes it impractical for use in
the case of large number of points in the dataset. Reducing the large number of
nearest neighbor queries in the algorithm can accelerate it. The kd-tree data
structure helps in accelerating in the nearest neighbor queries and is a natural choice
to reduce the complexity. There are at least three variants of the k-means algorithm
using kd-trees presented in [2], [4] and [10]. In all of these the kd-tree is formed
using the points in the dataset. Such usage of kd-trees in k-means was first done in a
method detailed in [10].

3.4.1

The Blacklisting Algorithm

In [2] is presented a method to speed up the execution of single k-means iteration


thus making k-means for large datasets tractable. This algorithm is called the
blacklisting algorithm.
In-spite of its simplicity, the nave k-means algorithms involves a very large number
of nearest neighbor queries. The blacklisting algorithm proceeds by reducing the
number of nearest neighbor queries that the k-means algorithm executes. Thus speed
is enhanced greatly. The time taken for a single iteration is greatly improved. The
algorithm remains exact and it would until take the same number of iterations for
convergence as the nave k-means.
The blacklisting algorithm uses a variation of the kd-tree, called the mrkd tree,
described in [5]. In this variation of the kd-tree, the region of a node is a hyperrectangle identified by two vectors, h max and hmin. The node is split at the mid point
of the hyper-rectangle, perpendicular to the dimension that is longest. The root node
represents the hyper-rectangle that encloses all the points in the dataset.
The algorithm exploits the fact that instead of updating a particular cluster from the
points that belong to it point by point, a more efficient approach will be to update in
bulk or groups. Naturally these groups will correspond to hyper-rectangles in the kdtree. Points can be assigned in bulk using the known centers of mass and size of
groups of points provided all the points are being assigned to the same cluster. To
ensure correctness we must first make sure that all of the points in a given rectangle
indeed belong to a specific center before adding their statistics to it.
Before proceeding to give a description of the algorithm, we give some definitions
that are necessary for an understanding of the algorithm. For further details we refer
to [2] and [4].

Page 21 of 72

Definition 2: Given a set of centers C and a hyper-rectangle h, we define by


ownerC(h) a center c

C such that any point in h is closer to c than to any other

center in C, if such a center exists.


Theorem 1: Let C be a set of centers, and h a hyper-rectangle. Let c

C be owner

(h). If d(c, h) represents the distance of a hyper-rectangle from the center c


(minimum distance from any point of the hyper-rectangle), then: (It is not just the
hyper-rectangle boundary. In case the center is inside the distance is always zero.
When the center is outside the distance from the hyper-rectangle is minimum from
one of the boundaries)

d (c, h) min d (c' , h)


c 'C

The above theorem states that if the hyper-rectangle is dominated by a centroid,


then it is the centroid that is the closest to the hyper-rectangle.

Page 22 of 72

Figure 5: The dotted lines are the lines of decision between C 1, C2 and C1, C3. C1 is
also closest to any of the hyper-rectangles. C 1 dominates the hyper-rectangles U1,
U2 and U3.
Definition 3: Given a hyper-rectangle h, and two centers c1 and c2, we say that c1
dominates c2 with respect to h if every point in h is closer to c1 than it is to c2.
Thus the ownerC(h) dominates all the centers in C, except itself. That is, we say that
a hyper-rectangle is dominated by a centroid if no other centroid is closer to any
point of the hyper-rectangle than this centroid.
Lemma 1: Given two centers c1, c2, and a hyper-rectangle h such that d(c1,h) <
d(c2,h), the decision problem "does c1 dominate c2 with respect to h?" can be
answered in O(d) time where d is the dimensionality of the dataset.

Page 23 of 72

The task mentioned in Lemma 1 can be done by taking the extreme point p of the

hyper-rectangle h in the direction v c 2 c1 and verifying if p is closer to c1 or c2.


If it is closer to c1 then c1 dominates c2.

Furthermore, finding the extreme point in the direction v c 2 c1 , that is, the
linear program maximize

v, p

such that p h can be solved in time O(d).

Again we notice the extreme point is a corner of h. For each coordinate i, we choose
pi to be himax if c2i > c1i , and himin otherwise.
Using theorem 1 and lemma 1, we can find in at most O(kd), where k is the number
of cluster centers passed to the node and d the dimensionality of the data, if any
cluster center dominates the hyper-rectangle. If so, all the points in this hyperrectangle can be assigned to this cluster.

Figure 6: This picture shows the vectors C 2 C1 and C3 C1. E1, E2, E3 and E4 are
the extreme points in the direction C 2 C1. E3, E4, E5 and E6 are the extreme
points in the direction C3 C1.

Page 24 of 72

All the cluster centers are initially passed to the root node. The node verifies if a
single cluster center dominates its region. Using theorem 1 and lemma 1 described
before does this. That is, we find the centroid closest to the hyper-rectangle (in the
case of the blacklisting algorithm) or the centroid closest to the center of the center
of the hyper-rectangle (filtering algorithm). Using lemma 1, we can then verify if this
centroid dominates all the other centroids. If so, all the points in the node are
assigned to that center. If the closest center does not dominate all the other
centroids, then the closest centroid and the centroids not dominated by this centroid
are passed to the child nodes. The same process of filtering is repeated at the child
nodes. When a node is completely assigned to a single centroid, the centroid details
are updated based on the sum of the points in the node, their norm squared and the
number of points assigned so that the distortion and the mean of the points assigned
to a centroid can be computed.

3.4.2

Variants of k-means using kd-tree

At least two other variants of k-means using the kd-tree are known. These are
described in [10] and [4].
In [10] is also probably the first use of kd-tree for k-means in a manner similar to the
blacklisting algorithm. A kd-tree is used as a data structure to store the points and
points are assigned to the center in bulk based on the regions of the kd-tree.
However, in [10] the concept of domination of a node by a center as in blacklisting is
not used. Instead, a node was considered closer to a node only if its maximum
distance from the hyper-rectangle was more than the minimum distance of any other
center.
In [4] is suggested a method very similar to the blacklisting algorithm, called the
filtering algorithm, for speeding up the execution of the k-means clustering. This
algorithm differs from the blacklisting algorithm only at two places. One difference
from the blacklisting algorithm is that in order to find a potential owner instead of
distance from the cell as in theorem 1, the distance from the center of the hyperrectangle is taken. This has the advantage that when there are more than one cluster
centers which are inside the hyper-rectangle, the cluster center that is closer to the
center of the hyper-rectangle is chosen. Another difference between the filtering
algorithm and the blacklisting algorithm is that during the kd-tree construction, the
filtering algorithm uses a sliding midpoint plane for splitting a node. In this, if one of
the children nodes has no points in its range the splitting plane is shifted so that at
least one point is contained in every node. See [6] for further discussion on the
sliding midpoint splitting in kd-trees.

Page 25 of 72

All the three algorithms, described in [10], the blacklisting and the filtering
algorithm proceed in very similar fashion. The algorithms are recursive in nature and
call the children nodes if an owner does not exist. All the cluster centers are initially
passed to the root node. The node verifies if a single cluster center is closer to the
region than any other, using their respective criteria. If so, all the points in the node
are assigned to that center. If the closest center does not dominate all the other
centroids, then the closest centroid and the centroids not filtered (again based on
their respective criteria) by this centroid are passed to the children nodes. The same
process of filtering out centroids is repeated at the child nodes. When a node is
completely assigned to a single centroid, the centroid details are updated based on
the sum of the points in the node, their squared norm, and other cached statistics so
that the new location of the centroid and the distortion can be computed.
In the filtering algorithm, based on how the tree is are classified as expanded if the
children of that node were visited.
Definition 4: A node is expanded if its children are visited when the tree is
traversed.
Though these work well for lower dimensions, the performance is not as good for
higher dimensions. This is so because the kd-trees are not very good for higher
dimensions. However the time complexity of these algorithms does not increase
linearly with the number of points or the number of clusters, which is the case with
the nave k-means algorithm. As a result of these, these algorithms are better suited
for large number of points and centers. In [4] it is proven that when the clusters are
close to the actual centers, the number of nodes expanded in a k-means iteration of
the filtering algorithm is of the order of O(nk2/ 2). Where

is the cluster

separation. Thus the time complexity of the k-means reduces as the cluster
separation increases. This is in contrast to the O(nk) cost of single k-means iteration
in the nave k-means iteration for all degrees of separation.
The filtering algorithm that computes the closest centroid based on its distance from
the center of the hyper-rectangle performs better than blacklisting algorithm. One
advantage of using the sliding midpoint plane for splitting the plane over the method
in the blacklisting algorithm is that every node has at least one point. Thus any node
with no points in its region is eliminated. The idea is already more or less expressed
in saying that a region with no points is identified. Therefore, this has been
discarded.
In this chapter we have covered all the basic concepts that are needed for
understanding the algorithms that are discussed in the following chapter. We use all
features discussed in this chapter in our algorithms and also introduce some new
concepts that add an advantage to the algorithm.

Page 26 of 72

4 Accelerating the Global


k-means
In this chapter we combine the three algorithms described by us in the previous
section. Thus we combine the global k-means algorithm suggested in [1], the fast kmeans algorithm suggested in [2] and [4] with the nave k-means algorithm, thereby
making the global k-means tractable for large data set. The kd-tree as in [1] is used
for the insertion positions of the centroids. The kd-tree as in [2], as in the
blacklisting algorithm, with minor modifications, is used for accelerating the k-means
iteration. Certain other modifications are proposed that speed up the algorithm
considerably. As for the concepts described in [1], the same assumption s also used
here. We restate this assumption once more below:
Assumption: The assumption used in the algorithm is that the global optima can be
reached by running k-means with the (k-1) clusters being placed at the optimal
positions for the (k-1) clustering problem and the k th cluster being placed at an
appropriate position that is yet to be discovered.
This chapter is organized in three sections. In the first section we give an overview of
the fast greedy k-means Algorithm. This gives a brief description of the various
methods that are part of the algorithm.
In the next section we discuss an algorithm that enhances the fast k-means
algorithms, that is, an algorithm based on the blacklisting algorithm and the filtering
algorithm. This is the algorithm that we use in the fast greedy k-means algorithm for
convergence of clusters. The new algorithm is a mixture of the nave approach and
the approach using kd-trees and is faster than either of these. This fast algorithm is
named the restricted filtering algorithm, as filtering is not done when certain
criteria are satisfied. The restricted filtering algorithm also uses some new concepts
not used in either nave k-means or the filtering algorithm.
The next section gives an algorithm that acts as the crucial link in combining the
restricted filtering algorithm and the greedy global k-means algorithm. Efficient
implementation of this link is thus very important for the fast greedy k-means
algorithm. We then combine the restricted filtering algorithm and the greedy global
k-means algorithm to get a fast greedy k-means algorithm.
The fast greedy k-means algorithm overcomes the major shortcomings of the k-means
algorithm discussed in chapter 1, viz. possible convergence to local minima and large
time complexity with respect to the number of points in the dataset. However, there

Page 27 of 72

is a limitation in that though the algorithm can be used for large number of data
points, it might until not be feasible to use the fast greedy k-means algorithm for
large number of clusters and higher dimensions.

Page 28 of 72

4.1

The overall algorithm

The fast greedy k-means algorithm uses the concepts of the greedy global k-means
algorithm for a global solution. The intermediate convergence is done using the
restricted k-means algorithm instead of the nave k-means algorithm.
The following 4 steps that have been illustrated in the diagram below outline the
algorithm.

1. Construct an appropriate set of positions/locations which can act as good


candidates for insertion of new clusters;
2. Initialize the first cluster as the mean of all the points in the dataset;
3. In the Kth iteration, assuming K-1 clusters after convergence find an
appropriate position for insertion of a new cluster from the set of points
created in step 1 that gives minimum distortion;
4. Run k-means with K clusters till convergence. Go back to step 3 if the
required number of clusters is not yet reached.

Step 1: The first step of the algorithm is to find an appropriate set of points that
could act as the insertion positions for the new clusters in the cluster insertion steps
of the algorithm. This is done using the kd-tree as in [1]. Thus, the fast greedy kmeans algorithm starts by first initializing the kd-tree so that the centroids of the
points contained in the leaf nodes can be used for the insertion of new cluster
positions. This variant of kd-tree splits the points in a node using the plane that
passes through the mean of the points and is perpendicular to the principal
component of the points in the node. The number of buckets (leaf nodes) is prespecified, so one does not split a node if the maximum number of buckets has been
reached. Normally the number of buckets is taken to be the number of clusters that
one wants to in the final solution. For further discussion on the properties of the kdtree, see [1].
Step 2: Once the cluster insertion locations are determined, the k-means initializes
the first cluster to be the mean of the points in the dataset. This is the optimum
cluster center for K = 1. This is the first step in our iterative process.
The algorithm proceeds based on the following concept. Let us assume that we have
the optimum clustering for K 1 clusters. Using the assumption in [1], also described
in chapter 2, we have to find an appropriate cluster location for insertion so that the

Page 29 of 72

cluster locations for the (K 1)-means and the new cluster location converge to the
optimum cluster locations for k-means.
Step 3: One way to find the new insertion location is to run k-means until
convergence with each of the candidate cluster locations computed in step 1 being
added to the

K 1 means optimum solution. However, this is very time consuming.

As a result we use a greedy approach that is an approximation. We only compute the


new distortion when the new cluster is added at a candidate insertion position. The
point that gives the minimum distortion if added to the existing K 1 means/centers
in there own positions is taken to be the cluster center that is most appropriate for
getting the optimal k-means solution.
Step 4: The final step in the algorithm is to run k-means until convergence on the
configuration for K 1 clusters plus the new cluster being added at the location as
determined in Step 3. If we have the required number of clusters, the algorithm is
terminated. Else we go back to step 3, find another insertion location and run the
algorithm for the new set of clusters.

Page 30 of 72

4.2

Restricted filtering Algorithm

The restricted filtering algorithm concentrates on accelerating step 4 of the fast


greedy k-means algorithm. As this is executed several times, in a for loop,
accelerating the large number of nearest-neighbor queries that the algorithm
executes in step 4 can significantly reduce the time complexity of the algorithm. For
each iteration of k-means, one has to assign various data points to their nearest
cluster centers and then update the cluster centers to be the mean of the points
assigned to them.
The restricted filtering algorithm is very similar to the filtering/blacklisting algorithm
explained in chapter 2. Modifications/extensions have been done so that efficiency
can be improved in the construction of nodes, space and number of nodes spanned.
The modifications/extensions are:
1. Boundaries of the node: A modification is done at the stage when the kd-tree is created. In
methods in both [2] and [4], one of the boundaries of the region of the node was the plane
perpendicular to the dimension along which the parent hyper-rectangle is split. The other
extreme along this dimension was typically the other extreme value along that dimension in
the parent. As a result, the union of the region of the children equals the region of the
parent. However, in this scheme, large regions of the node have no point, also these regions
are considered during the dominance check. An improvement in the above scheme can be
made. The boundaries of the node are redefined based on the points in the region. The
bounds of the region are the extreme values that the point can take. Thus if there is only one
data point in the hyper-rectangle, its coordinates would be both the higher and lower
extremes of the region that would be looked at by the algorithm. This gives marginal
improvement during filtering and also helps when the resultant region is not split by any of
the Voronoi boundaries of the centroids. It can also be noted that if the boundaries were
redefined this way, each of the children nodes, formed by splitting the node along the
longest dimension, would have at least one point each, thus eliminating the need for a sliding
midpoint splitting as suggested in [4].
2. Threshold for direct computation of distances: Another change is in the condition when we
descend the tree in a node. The cost overheads of the blacklisting and filtering algorithm are
high. The overhead is high not just when we have to create the tree but also when we have
to traverse the tree, as we have to maintain/compute several statistics for each node
individually. Furthermore, the cost of the traditional k-means is significantly less for small
number of points and centers. As a result it would be more intuitive to use a combination of
both the accelerated and the nave algorithm. When the cost of the nave algorithm is less,
i.e. product of number of data points and the number of centroids to be considered is less
than a given value in the hyper-rectangle, instead of descending the tree, we directly
compute the distances and assign the points in the hyper-rectangle to the centers. The

Page 31 of 72

threshold value is decided based on experimental results on various random datasets of


various dimensions and sizes. This change results in a speed up of at least two to three times
when compared to the implementation without the threshold value. It should be noticed
that even if the average height of the kd-tree reduces by just one, the number of nodes to
be expanded is reduced to almost half. This results in reduction of large overheads like
recursion and maintenance of statistics for each node. In certain languages like MATLAB,
having to run a large number of statements as against the few statements as required for
direct computation is also a big overhead.
3. Kd-tree construction at runtime: It might be noted that as a result of the modification in (2)
above, not all nodes with more than one data point have children. As a result the children of
a node are determined only at run time, i.e. only if this node is expanded in some iteration.
However, once the children of a node are defined, these are stored so that one does not have
do redefine the children. This might give considerable advantage as the depth of the tree is
reduced considerably. Also the space complexity is drastically reduced as even if the average
depth reduces by just one, we have reduced the space complexity of the tree to almost half
as with a full tree.

Figure 7: This figure shows a typical case when using the threshold would be
advantageous. The decision line passes through almost the middle of one of the
children hyper-rectangles that has only 7 points. Instead of splitting this node, we
would directly compute the distortion.

Page 32 of 72

The restricted filtering algorithm can is summarized in the following:


Update (Node U, Centers C) {
Do any initializations to U if first call //boundaries set to extreme points

If the threshold criteria for direct computation is satisfied {


Directly compute distances; update statistics of centers
Return
}

Find the centroid c* closest to the center of the hyper-rectangle U

If c* dominates all other centroids {


All points in U belong to c*; update statistics
Return;
} else {
Find the centers in C, Cunfiltered, not dominated by c*

Do initializations to the child nodes if first call, //points split in node among
children
Update (U.left, Cunfiltered)
Update (U.right, Cunfiltered);
Return;

The main operations in k-means are to find the mean of points in a cluster region and
to compute the distortion. When the distances are being computed point by point,
the computation of these statistics is trivial. However when using the restricted
filtering algorithm, points are assigned in bulk, so the computation is not so
straightforward. While simply initially computing and storing the sum of all the points
in a node can obtain the mean, there are two ways in which the distortion can be
computed. They are described below.

Page 33 of 72

4.2.1

Computing Distortion for the current set of centers:

For all points X that belong to a particular center (x) the contribution to the total
distortion is:

1
N

distortion

| x |

x ( x)

2 T ( x ) x N | ( x ) | 2
x

(2)

Where N is the number of points. The above equation can be directly used to
compute the distortion for the current set of points if one has stored the statistics of
the sum of squared norm of the points, the sum of the points that belong to a single
cluster and the number of points in each cluster.

4.2.2

Computing the distortion for the updated list of centers:

The same statistics can be used to compute the distortion with the updated
centroids. This is a new method of computing the distortion devised by us. Let (x)
be the new centroids. (x) is computed as the mean of the sum of points in each
region. Thus

' ( x)

1
N

(3)

Substituting the same in the equation (2), we get the following

x ' ( x)
x

2 T ' ( x)( N ' ( x )) N | ' ( x) |2

x N ' ( x)
2

(4)

The statistics that need to be stored in either of these is the same viz. the sum of
squared norms of all the points in a node. However, based on the formula that is
used, we might either obtain the distortion for the old set of centers or the one-step
updated centers. However when we use the second formula for obtaining the
distortion for the new cluster centers, the number of iterations required for
convergence would reduce by one (based on the criteria that we use for
convergence). This results in marginal reduction of the time taken for convergence.
The exact figures about the reduction in time taken can be seen in the experimental
results, Section 5.1: Effect of number of points, Effect of Dimensionality and Effect
of number of centers. Section 5.2: Variation of dimension, Variation of number of
points.

Page 34 of 72

The restricted filtering algorithm is highly efficient and faster than either the nave
k-means algorithm or the filtering algorithm for all types of problems except when
the dimensionality is very large. In fact, the restricted k-means algorithm can be
viewed as a combination of the nave k-means and the filtering algorithm. When the
threshold value is set very high, the algorithm reduces to a nave k-means algorithm.
When the threshold is set to zero, the algorithm reduces to the filtering algorithm.
Thus in this case, the restricted k-means algorithm would be at least as fast as any of
these two algorithms but the threshold value for direct computation of distortion
might be different. For lower dimensions the threshold value is expected to be lower
and for higher dimensions, the threshold value is expected to be higher.

Page 35 of 72

4.3
Distortion computation for the set of
candidate center locations
Distortion computation for the set of candidate centers is the key step in the fast
greedy k-means algorithm. The distortion with each of the candidate centers being
inserted in the current set global clusters needs to be computed. Thus for each
cluster that is inserted, the distortion for each of the candidate centroids is
computed. As the number of candidate centroids is large, this step can become a
bottleneck in the program. The number of such distortion computations might easily
exceed the total number of distortion computations done by the restricted filtering
algorithm. An efficient algorithm of this step is thus very important for the fast
greedy algorithm.
The computation of this distortion parameter is also very similar to the restricted
filtering algorithm. We use the same kd-tree that was defined in the previous section
for the restricted filtering algorithm. The outline of the algorithm that computes
distortion with a centroid added is as follows.
CalDistortion (Node U, Point p) {
Do any initializations to u that are necessary

If the threshold criteria for direct computation is met {


Directly compute distances
Return
}

If the centroid closest to the center of U dominates p {


Distortion = same as computed in last k-means iteration;
Return;
}

If p is closer to the center than the closest centroid and the closest centroid is
not contained in the hyper-rectangle {
If p dominates all the unfiltered centroids for node U considered by the
last k-means iteration {

Page 36 of 72

//all points in the region belong to p


Compute distortion // all point assigned to the point p
Return
}
}

Do the necessary initializations to the child nodes,


Distortion = Caldistortion (U.left, p) + Caldistortion (U.right, p);

Return;

Page 37 of 72

Figure 8: This figure shows 3 possible insertion positions. C 1 and C2 are the
centroids that are not filtered. (i) P 1 is closer than the closest centroid and also
dominates the unfiltered list of centroids; all points assigned to P 1. (ii) P2 is
dominated by the closest centroid; Distortion is the same as last computed in kmeans iteration. (iii) P3 is not dominated by the closest centroid; calDistortion is
called recursively for points of this type and those not classified as cases (i) and
(ii).
As mentioned previously, it is important that the above be implemented in an
efficient way. As a result, we save the statistics that have been already computed in
the k-means iteration. These include the closest centroid in the iteration, the list of
unfiltered centroids and the distortion as computed previously. Also during
implementation, we use a version of calDistortion that does all the above
computations for all the possible insertion points simultaneously so that the
advantages of the language and some common computations like the distances of the
centers from all the points in case the threshold is satisfied can be used to the best
of the advantage.

Page 38 of 72

The condition for direct computation is slightly different than in the restricted kmeans algorithm. In calDistortion, we not only have the old k-means centers from
which distances need to be computed but also the insertion positions which are often
more in number than the number of centers. As a result, the threshold condition is
also modified. We not only check that the product of the number of points and the
number of point satisfies the criteria in the restricted k-means algorithm, but also
introduce a new threshold which validates the product of the number of insertion
points and the number of points. We call these thresholds threshold1 and threshold2
respectively.
It should be noted that the task accomplished in this step is very similar to the
distortion computation in the restricted filtering algorithm except for three things: 1:
Instead of keeping track of the mean and the sum of squared norms of the points
assigned to each cluster, we maintain the distortion for each cluster from the
previous iteration; 2: We already know which cluster each of the points belongs to; 3:
The list of centroids that were unfiltered in the last k-means iterations and the final
distortion as was obtained is already cached in the node. As in the restricted filtering
algorithms, there exist two ways of computing distortion. However, in calDistortion,
we compute the distortion for the original cluster centers only and not for the
updated list of centers. It would have been highly advantageous if we had been able
to compute the distortion for the updated list of centroids, as then we would be
choosing the insertion point based on the distortion after the iteration and so the
choice would be a better approximation of the actual insertion center. However,
doing so in MATLAB is not so efficient for multiple sets of clusters. As a result we
always compute the distortion based on the initial set of centers and not the updated
centers. Further discussion on the same is in the chapter on Implementation, section
4.3.
In this chapter we have given a detailed description of the fast greedy k-means
algorithm and its variant that does advance distortion computation. All the new ideas
viz., the restricted filtering algorithm, advance distortion computation and the
distortion computation with just one added center have been given due emphasis.
The next chapter gives a description of some of the implementation issues when
implementing the fast greedy k-means algorithm.

Page 39 of 72

5 Methodology
In this chapter we discuss the important issues that we had to handle in our
implementation of the various algorithms discussed so far. We explain the various
mathematical formulae used. These include the computation of the new centroids
and computation of the distortion: for both the current and the updated list of
centers. We also discuss how we computed each of the terms in these formulae in
the actual implementation. Finally we give the basic fields in a single node of the kdtree and how one can check if one of the centers is dominating all the other centers.
All this data will also help in replicating the results obtained in the experimental
results at other places.

6
6.1

Thresholding
Basic Global Thresholding
As the fact that we need only the histogram of the image to segment it, segmenting
images with Threshold Technique does not involve the spatial information of the images.
Therefore, some problem may be caused by noise, blurred edges, or outlier in the image. That is
why we say this method is the simplest concept to segment images.

When the intensity distributions of objects and background pixels are sufficiently distinct, it is
possible to use a single(global) threshold applicable over the entire image. The following
iterative algorithm can be used for this purpose:
1. Select an initial estimate for the global threshold, T.
2. Segment the image using T as

g ( x, y ) 01

if f(x,y) T
if f(x,y) T

This will produce two groups of pixels: G1 consisting of all pixels with intensity values > T,
and G 2 consisting of pixels with values T.
3. Compute the average(mean)intensity values m 1 and m 2 for the pixels in G1 and G 2 .
4. Compute a new threshold values:
1
T (m 1 m 2 )
2
5. Repeats Step2 through 4 until the difference between values of T in successive iterations
is smaller than a predefined parameter.

Page 40 of 72

6.2

Optimum Global Thresholding Using Otsus


Method

Thresholding may be viewed as a statistical-decision theory problem whose objective is to


minimize the average error incurred in assigning pixels to two or more groups.
Let{0,1,2,,L-1}denote the L distinct intensity levels in a digital image of size M N pixels,
and let n i denote the number of pixels with intensity i . The total number, MN, of pixels in the
image is MN n 0 n 1 n 2 ... n L 1 . The normalized histogram has components p i n i / MN ,
from which it follows that
L 1

p
i 0

1, p i 0

Now, we select a threshold T (k ) k , 0 k L 1 , and use it to threshold the input image into
two classes, C 1 and C 2 , where C 1 consist with intensity in the range [0, k ] and C 2 consist with
[k 1, L 1] .
Using this threshold , P 1 (k ) ,that is assigned to C 1 and given by the cumulative sum.
k

P 1 (k ) p i
i 0

P 2 (k )

L 1

i k 1

p i 1 P 1 (k )

The validity of the following two equations can be verified by direct substitution of the
preceding result:
P1m1P 2 m 2 m G

P 1 + P 2 =1
In order to evaluate the goodness of the threshold at level k we use the normalized,
dimensionless metric
B2 (k )
= 2
G
2
Where G is the global variance
L 1

G2 = i m G p i
2

i 0

2
And B is the between-class variance, define as :
B2 P 1 (m 1 m G ) 2 P 2 (m 2 m G ) 2

(m G P 1 (k ) m(k )) 2
(k ) P 1 P 2 (m1 m 2 )
P 1 (k )(1 P 1 (k ))
2
B

Page 41 of 72

Indicating that the between-class variance and is a measure of separability between class.
2
Then, the optimum threshold is the value, k ,that maximizes B ( k )

B2 (k ) max B2 (k )
o k L 1

In other word, to find k we simply evaluate (2.7-11) for all integer values of k
Once k has been obtain, the input image f ( x, y ) is segmented as before:

g ( x, y ) 10

if f(x,y) k*
if f(x,y) k*

For x = 0,1,2,,M-1 and y = 0,1,2,N-1. This measure has values in the range
0 (k *) 1

6.2.1

Using image Smoothing/Edge to improve Global Threshold

Compare the difference between preprocess of smoothing and Edge detection.


Smoothing
Edge detection
What situation is more Large object we are Small object we
suitable for the method
interested.
interested

6.3

are

Multiple Threshold

For three classes consisting of three intensity intervals, the between-class variance is given by:
B2 P 1 (m 1 m G )2 P 2 (m 2 m G ) 2 P 3 (m 3 m G ) 2
The following relationships hold:
P 1m1P 2 m 2 P 3 m 3 m G

P 1 + P 2 +P 3 =1
The optimum threshold values,
B2 ( k1* , k2* )

max

o k 1 k 2 L 1

B2 ( k 1 , k 2 )

Finally, we note that the separability measure defined in section 2.7.2 for one threshold extends
directly to multiple thresholds:
B2 (k1* , k 2* )
*
*
(k1 , k2 )
G2
Advantages

1.

Region growing methods can correctly separate the regions that have the same properties
we define.
2. Region growing methods can provide the original images which have clear edges the
good segmentation results.
3. The concept is simple. We only need a small numbers of seed point to represent t he
property we want, then grow the region.
4. We can choose the multiple criteria at the same time.
It performs well with respect to noise, which means it has a good shape matching of its result.
Disadvantage

Page 42 of 72

1. The computation is consuming, no matter the time or power.


2. This method may not distinguish the shading of the real images.
In conclusion, the region growing method has a good performance with the good shape
matching and connectivity. The most serious problem of region growing method is the time
consuming.
The distance between point x and cluster C is defined as below,

d ( x, C ) the mean of the distance between x and every single point in cluster C .

Figure 4.9 The simple case of hierarchical division

6.4

Partitional clustering
Comparing with Hierarchical algorithm, partitional clustering cannot show the characteristic of
database. However, it does save more computational time than the Hierarchical algorithms. The
most famous partitional clustering algorithm is K-mean method. We now check out its
algorithm.

Algorithms of K-means
1. Decide the numbers of the cluster in the final classified result we want. We assume the
number is N .
2. Randomly choose N data points (as for image, the N pixels) in the whole database (as for
image, the whole image) as the N centroids in N clusters.
3. As for every single data point (as for image, the pixel), find out the nearest centroid and
classify the data point into that cluster the centroid located. After step 3, we now have all
data points classified in a specified cluster. And the total numbers of cluster is always N ,
which is decided in the step 1.
4. As for every cluster, we calculate the centroid of it with the data points in it. The same, the
calculation of the centroid can also be defined by users. It can be the median of the
Page 43 of 72

database within the cluster, or it can be the really centre. Then again, we get N centroids of
N clusters just like we do after step 2.
5. Repeat the iteration of step 3 to step 4 until there is no change in two successive iterations.
( Step 4 and 5 are checking step)
We have to mention the in step 3, we do not always decide the cluster classified by the
distance between the data points and the centroids. The using of distance now is because it is
the simplest criteria choice. We can also use some other criteria as we want depending on the
database characteristics or the final classified result we want.

Figure 4.10 (a) original image. (b) Chose 3 clusters and 3 initial points. (c)
Classify other points by using the minimum distance between the point to the
center of the cluster.

There are some disadvantages of the K-mean algorithm.

1. The results are sensitive from the initial random centroids we choose. That is, difference
choices of initial centroids may result in different results.
2. We cannot show the segmentation details as hierarchical clustering does.
3. The most serious problem is that the results of the clustering are often the circular shapes
due to the distance-oriented algorithms.
4. It is important to clarify that 10 average values does not mean there are merely 10 regions in the
segmentation result. See fig 4.4

Figure 4.11 (a) The result of the K-means. (b) The result that we want.

To solve initial problem

Page 44 of 72

There is a solution to overcome the initial problem. We can choose just one initial point in the
step 2. Using two point beside the initial point as centroids to classify become two cluster, and
then using the other four point beside the two point, do it until satisfy the number of clustering
we want. That means the initial point is fixed also. Then we split the numbers of cluster until

N clusters are classified. By using this kind of improvement, we solve the initial problem
caused by randomly chose initial data point. No matter what names they called, the concepts of
them are all similar.

To determine the number of clusters


There are many researches to determine the number of clusters in K-means clustering.
Siddheswar Ray and Rose H. Turi define a validity which is the ratio of intra-cluster
distance to inter-cluster distance to be a criteria. The validity measure will tell us what the
ideal value of K in the K-means algorithm is.

Vadility=

intra
inter

They define the two distances as

intra-cluster distance=

1
N

Px z
i 1 xCi

P2

where N is the number of pixels in the image, K is the number of cluster, zi is the cluster
centre of cluster Ci .
Since the goal of the k-means clustering algorithm is to minimize the sum of squared distances
from all points to their cluster centers. First we have to define a intra-cluster distance to
describe the distances of the points from their cluster centre (or centroid), then minimize it.

inter-cluster distance min(P zi z j P2 )


, i 1, 2,3,......, K 1
; j i 1,......K
From the other hand, the purpose of clustering a database (or an image) is to separate the
differences between clusters. Therefore we define the Inter-cluster distance which can
describe the difference of different clusters. And we want the value of it the bigger the better.
Its obvious that the distance between each clusters centroids are large, then we just need few
clustering to segment it, but if each centroids are close, then we need more cluster to classify it
clearly.

Page 45 of 72

6.5

Variable Thresholding

Image partitioning
One of the simplest approaches to variable threshold is to subdivide an image into
nonoverlapping rectangles. This approach is used to compensate for non-uniformities in
illumination and/or reflection.

Page 46 of 72

6.6

Computing the new Centroids

The new centroid is the mean of all the data points that belong to this centroid. As
we assign the data points to a centroid in bulk, for each node in the tree we have to
cache the sum of the points in its region. Whenever a node is assigned to a centroid,
the corresponding sum needs to be added for that centroid. The new location of the
centroid is the total sum divided by the total number of points assigned to this
centroid.
Thus computation of the new centroids can be done by storing/caching the statistics
of the sum of points in a node and the number of points in each node. Whenever a
node is assigned to a center, these statistics are added to the properties of that
node. When a direct computation is done because of the threshold criteria being
satisfied, we can directly compute the distance of each of the points from the all the
centers in consideration.

Page 47 of 72

6.7

Computing the distortion

As mentioned in chapter 3, there are two ways in which the distortion can be
computed by maintaining the same statistics in a node. We restate the two formulae
for reference:

distortion

| x |

2 T ( x) x N | ( x) |2

distortion x N ' ( x )
2

(2)

(4)

The statistics that are required for the distortion computation are
(i).

Squared norms of points in the node;

(ii).

Sum of points in the node (also required for finding the new center location);

(iii).

Number of points in the node (also required for finding the new center
location).

The distortion with the updated centroids can be computed by maintaining the
number of data points assigned to each center, the sum of points and the sum of the
norm squared of these points.
Though it appears that equation (4) is always better to use as it gives the distortion
for the updated list of centroids, there is one drawback with this equation. The
distortion can be computed only after the complete traversal of the tree is done, as
the values of the updated centers would only be known then. By contrast, equation
(2) can be used in the computation of distortion for each of the hyper-rectangles
individually.

Two K-Means algorithms have been implemented. The first clusters the pixel
information from an input image based on the RGB color of each pixel, and
the second clusters based on pixel intensity. The algorithm begins with the
creation of initial partitions for the data. The clustering based on pixel color
will be considered first.

In order to try to improve the runtime and results of the algorithm, partitions
which are equally spaced were chosen.

Initially, eight partitions were

chosen. The partitions represented red, green, blue, white, black, yellow,
magenta, and cyan or the corners of the color cube.

With the initial

partitions chosen, the algorithm can begin. Every pixel in the input image is
compared against the initial partitions and the nearest partition is chosen and

Page 48 of 72

recorded. Then, the mean in terms of RGB color of all pixels within a given
partition is determined. This mean is then used as the new value for the
given partition. If a partition has no pixels associated with it, it remains
unchanged. In some implementations, a partition with no pixels associated
with it would be removed; however, these partitions are simply ignored in
this implementation. Once the new partition values have been determined,
the algorithm returns to assigning each pixel to the nearest partition. The
algorithm continues until pixels are no longer changing which partition they
are associated with or, as is the case here, until none of the partition values
changes by more than a set small amount. In order to gather more results,
the initial partitions used were varied by adding and removing partitions.

In a nearly identical setup, a second K-Means implementation was made using


the grayscale intensity of each pixel rather than the RGB color to partition
the image. The starting partitions were chosen as equally spaced values from
0 to 255. The number of partitions used, as in the color segmenting setup,
was varied in order to increase the number of results available and to study
the effects of varying this parameter.

The EM algorithm is very similar in setup to the K-Means algorithm. Similarly,


the first step is to choose the input partitions. For this case, the same initial
partitions as used in the color segmentation with K-Means were used in order
to make the comparison of results more meaningful. Here, RGB color was
again chosen as the comparison parameter. The EM cycle begins with an
Expectation step which is defined by the following equation:

E[ zij ]

p( x xi | j )
k

p( x x |
n 1

Page 49 of 72

1
2 2

( xi j ) 2

1
2 2

( xi n ) 2

n 1

This equation states that the expectations or weight for pixel z with respect
to partition j equals the probability that x is pixel x i given that is partition
i divided by the sum over all partitions k of the same previously described
probability. This leads to the lower expression for the weights. The sigma
squared seen in the second expression represents the covariance of the pixel
data. Once the E step has been performed and every pixel has a weight or
expectation for each partition, the M step or maximization step begins. This
step is defined by the following equation:

1 m
E[ zij ] xi
m i 1

This equation states that the partition value j is changed to the weighted
average of the pixel values where the weights are the weights from the E
step for this particular partition. This EM cycle is repeated for each new set
of partitions until, as in the K-Means algorithm, the partition values no longer
change by a significant amount.

Page 50 of 72

6.8

Computing distortion for added center

Here we are referring to the calDistortion step viz. step 3 of the fast greedy k-means
algorithm. Efficiency is most important in this step. The number of distortion
computations that we have to do in this step is much more than the distortion
computations that we do for the convergence of k-means. Using the strength of
MATLAB, we do computation in matrices for all the possible insertion points
simultaneously.
Using equation (4) in this step would be actually better than using equation (2) as
then we are finding the distortion after one run of k-means iteration which is a
better approximation than computing the distortion without any iterations. However
for equation (3), we have to first explicitly compute the squared norm sum and the
number of points that are in the region of each cluster. When we have a single
partitioning, this is simple. However when we have to maintain the sum for each of
the possible insertion points, the task becomes difficult. For each of the candidate
centers, the sum of the points (which in turn is a vector) for each of the clusters
needs to be maintained. This makes the matrix three-dimensional and difficult to
implement. In contrast, the distortion in each region can be computed easily using
the equation (2) and only the distortion values summed. As a result we use equation
(2) for the distortion computation in the calDistortion step.
To use equation (2) for the computation of the distortion, we have to cache the sum
of the points assigned to a centroid in a hyper-rectangle and the number of points
that are closest to the corresponding centroids in a hyper-rectangle. As the list of
centroids that are to be considered during the calculation cannot be known during
the recursion of calDistortion, the list of unfiltered centroids last considered (by kmeans) is also stored/cached in the nodes.
The implementation of calDistortion in MATLAB accepts a set of points instead of a
point and computes the distortion with each of the points added. The return value
of calDistortion is a vector representing the distortion (in its region) with the
respective points added to its list of centroids. The algorithm for the same is
depicted in the following:
CalDistortion (Node U, Points P) {
Do any initializations to U that are necessary

If the threshold1 and threshold2 criteria for direct computation is met {


Directly compute distances

Page 51 of 72

Return
}
for the Points in P which are dominated by centroid closest to the center of U {
Distortion = same as computed in last k-means iteration;
Return;
}
If closest centroid not contained in U {
For Points in P dominating all the unfiltered centroids of the last k-means
iteration for node U {
//all points in the region belong to these candidate centers in P
Compute distortion // all point assigned to the new center in P
Return
}
}

Do the necessary initializations to the child nodes,


Distortion = Caldistortion (U.left, PnotComputed ) + Caldistortion (U.right, PnotComputed );

Return;

Page 52 of 72

6.9

Maintaining the Tree

As the kd-tree is a binary tree and also keeping in mind the various statistics for kmeans and distortion computation for the added center, the following fields are
necessary in a node:
Hmax

//the upper bound of the hyper-rectangle

Hmin

//the lower bound of the hyper-rectangle

//the

list

of

points

in

the

hyper-rectangle,

//necessary so that they could be assigned to the //children.


Leftchild
Rightchild
Sum

//sum of all the points in the hyper-rectangle

Normsq

//sum of the norm squared of all the points in //the

hyper-rectangle
C

//indices of centroids not filtered

Csum

//sum of points assigned to the various centroids

//number of points assigned to various centroids

Also, the node needs to be initialized only once to compute the mean and other
fields. After that it would only need to be traversed.

Page 53 of 72

7 Experimental results
We performed a series of carefully prepared tests on synthetic data generated using
MATLAB; synthetic data generated using a well-known algorithm as in [15] and on
some well-known data sets. The various experiments performed compare the effect
of various parameters of the algorithm and its variant with advanced distortion
computation suggested in this paper. The experiments also compare the algorithm to
the greedy global k-means algorithm as in [1].
In all the tests that we perform, we compare the distortion and time of the
algorithm. Tests have been designed in order to measure the effect of the various
parameters in the algorithm. These include the effect of number of points in dataset,
dimensionality of data, number of centroids and the threshold values for the direct
computation. The distortion is considered when considering the affect of the number
of leaves of the kd-tree on the quality of the solution.
All the tests are conducted on a Celeron 600 MHz machine with 128 MB RAM. MATLAB
version 5.2 on Windows 98 is used for the tests.

Page 54 of 72

7.1

Dataset generated using MATLAB

We generated a set of datasets using MATLAB for testing the effect of various
parameters and size of the problem on the time taken by the algorithm. By choosing
the required number of cluster centers randomly in the region [0,100] and then
creating a required number of points with a normal distribution around these points
we create the datasets on which the experiments are done.

7.1.1

Variation of number of insertion points

Under this category of tests, the effect of the number of insertion points on the
solution and the time taken for the computation was evaluated. Having a smaller
number of insertion points might affect the quality of the solution, as it might not be
globally optimal. But having many insertion positions would result in the calDistortion
step taking much longer. 50000 points were generated with 50 cluster centers and
1000 points being generated around them with a normal distribution. The results are
displayed the figure below.

Page 55 of 72

Figure 12: Variation in number of insertion points (50 centers)


As expected the quality of the solution improves with increasing number of insertion
points but only to some value (=150 in fig 5) . After that the distortion is more or less
the same. The time taken for the computation increases with increasing number of
insertion points. Based on the results of these experiments, we set the number of
insertion points in our other experiments to be the same as the number of centers.

7.1.2

Variation of threshold1

Threshold1 is the threshold for direct computation of the distances in the restricted
filtering algorithm. Having a small threshold1 value will make the restricted filtering
algorithm closer to the filtering algorithm as in [4]. A large threshold will make the
algorithm similar to the nave k-means algorithm. We conducted the experiments on
two different dimensions to also see the variation in threshold based on the
dimension, as we are aware from previous results in [4] that the filtering algorithm

Page 56 of 72

does not perform very well for high dimensions. The number of insertion points was
set to the number of centers. The results are shown below.

Figure 13: Threshold1 vs. Time taken. 2 D data

Figure 14: Threshold1 vs. time taken. 6 D data

Page 57 of 72

The results obtained in the experiments show that there exists an optimum value of
the threshold1 for which the algorithm would take the least time. The value of
threshold1 would probably depend on the language of implementation, as the nave
k-means algorithm might be highly efficient compared to the implementation of the
filtering algorithm. For example in MATLAB, computation of distances for all the
points to all the centers can be implemented very efficiently in a vectorized fashion
while the filtering algorithm would require several computations.

7.1.3

Variation of threshold2

Experiments were also done to see the affect of the variation in the threshold2,
which is the product of the number of candidate centers in consideration and the
number of points. The affect on the value of the time taken for the algorithm to
execute is considered. The data had 50000 points and was generated by creating an
equal number of points with normal distribution around each of these 50 centers. The
threshold1 value was set at 10000 in these experiments. The results are displayed in
the following graphs.

Figure 15: Threshold2 vs. Time taken. 2 D data, threshold1 fixed as 10000

Page 58 of 72

Figure 16: Threshold2 vs. Time taken. 6 D data, threshold1 fixed as 10000
The results show that as with threshold1, there also exists an optimal value for the
threshold2. However the variation is not very high and is generally not more than 10
percent of the minimum time taken if the threshold2 is not set very less compared to
threshold1.

7.1.4

Effect of number of points

Under this category of tests, test runs are performed to see how the various
algorithms scale with increasing number of points. The three algorithms that we
tested were the greedy global algorithm in [1], the fast greedy algorithm discussed in
this thesis and the fast greedy algorithm with advanced distortion computation in the
restricted filtering algorithm. The fast greedy algorithm with advanced distortion
computation is expected to perform marginally better than the algorithm with direct
computation as it has one iteration of the restricted filtering algorithm less during
the convergence of k-means. The results of the same are shown in the following
graph.

Page 59 of 72

Figure 17: Variation in number of points. 2 D

As expected the complexity of greedy global algorithm in [1] increases linearly with
time. Algorithm in [1] also requires a lot of memory and could not run for more than
50000 points. The complexity of the fast greedy algorithm does not increase as fast
and the rate of increase also appears to fall with increase in the number of points.
The fast greedy algorithm also fares better than the greedy global. Though it might
be expected that the distortion for the fast greedy algorithm with advance distortion
computation might not be as good as fast greedy algorithm since the number of
iterations in each step is one less, that was not so. The distortion of all the three
cases was almost identical.

7.1.5

Effect of dimensionality

Under this category of tests the effect of variation in the dimensionality of the data
on the time taken by the algorithm is evaluated. The tests involve 50000 points and
50 clusters. The average time taken for 2 runs is plotted in the graph below.

Page 60 of 72

Figure 18
Curiously for normal data, variation in dimension does not seem to have much effect
on the time taken for the fast greedy algorithms. Also time taken for the algorithm in
[1] falls initially with increase in dimensionality. These results were consistent across
successive runs on data generated in a similar fashion.
This might be because we are testing the data of 50000 records and the algorithm in
[1] requires lot of memory and therefore might be creating a lot of page faults. By
contrast, the fast greedy algorithm is highly memory efficient works very well even
for 16 dimensional data.

7.1.6

Effect of number of centers

The effect of variation of the number of centers on the time complexity of the
algorithm is seen in these tests. As before we generate normal data. The total
number of points is set at 50000. The data in this case is 2 dimensional. Tests were
not conducted on the algorithm in [1] as it required lot of memory and could not run
even for 100 centers with 50000 points.

Page 61 of 72

Figure 19
As expected the number of centers affects the complexity of the algorithm more or
less in a quadratic way. This is also expected as according to a result in [4] the
number of nodes expanded in a k-means iteration in filtering algorithm is not more
than O(nk2/ 2). Thus for each k-means iteration, the time complexity is expected to
be of that order. As the number of clusters increases one by one, the time complexity
of the fast greedy algorithm is expected to be:
kmax

n
nk
O 2
2

O
k 1

n k (k 1)
k O 2 max max

k 1

kmax

2
nk max

Thus with increasing number of centers to converge, the restricted k-means


algorithm also takes more time. As a result of this, the fast greedy algorithm cannot
be used for large number of cluster centers. However our restricted k-means
algorithm can still be used.

Page 62 of 72

7.2

Data generated using Dasgupta [15]

In this set of experiments the dataset is generated using the algorithm in [15]. The
algorithm generates a random Gaussian distribution given parameters like
eccentricity, separation degree, dimensionality and number of points. Tests are done
to compare the greedy global k-means algorithm with the algorithms proposed in this
paper.

7.2.1

Variation in dimensionality

The effect of dimensionality of data on the time taken is computed. 50000 points
around 30 centers are generated with a Gaussian mixture of eccentricity 2, and
varying degrees of separation based on the lines in [15]. The results obtained are as
in the following diagrams.

Page 63 of 72

Figure 20: Time Vs Dimension, 50000 points, 30 centers, eccentricity 2

Page 64 of 72

Unlike the case with normally distributed data uniformly with 50000 points and 50
centers (the previous category of tests), the increase in dimensionality hardly affects
the time complexity of the algorithm in [1] but drastically increases the time
complexity of the fast greedy algorithm. The algorithm in [1] performs better than
the fast greedy algorithm for higher dimensions. This is also expected, as kd-trees do
not perform well for high dimensional data.

7.2.2

Variation in number of points

The variation of the number of points is tested in this category. Random points are
generated around 30 centers with an eccentricity 2 and separation degree of 2 in two
dimensions. The number of points ranged from 20000 to 160000. The results obtained
are illustrated in the diagram below.

Figure 21: Time vs. Number of points, 30 centers 2 D, eccentricity 2, separation


degree 2
The results in this case are similar to those obtained with synthetic data generated
using MATLAB alone.

7.2.3

Variation in Cluster Separation

Under this category of tests, we varied the cluster separation of the clusters being
generated to see the affect of the various algorithms in them. 50000 points were

Page 65 of 72

generated around 30 centers with eccentricity 1 and varying cluster separation. The
results obtained are illustrated in the figure below.

Figure 22: Variation in separation, 30 centers, 50000 points, eccentricity 2


The decrease in time taken for increase in the separation is expected for two
reasons.

With the increase in the separation of the clusters, the number of iterations needed
for convergence in k-means would be less;

The cost of Fast Greedy approaches is also likely to decrease because of the
complexity result proved in [4], viz., the maximum number of nodes expanded in a kmeans iteration on filtering algorithm is O(nk2/ 2), where

is the separation

degree of the clusters. As we have a restricted k-means approach, the number of


nodes expanded is expected to be much less.
The reduction in the time taken by the algorithm in [1] is only because of the first
reason stated above. The fast greedy approach would require less time because of
both the above reasons. In fact it appears that the affect of both of these is equal
since while the algorithm in [1] reduced the time complexity by a factor of just two,
the time complexity of the fast greedy approached reduced by a factor of four. The
curves also seem to follow the rule of being proportional to (1/ 2 + constant).

Page 66 of 72

7.3

Dataset from UCI ML Repository

Most of the publicly available real world datasets have very few records. We
evaluated our method on the letter-recognition dataset of the Irvine dataset. This
dataset consists of 20000 data elements in 16 dimensions, which consist of a large
number of black-and-white rectangular pixel displays as one of the 26 capital letters
in the English alphabet. The character images were based on 20 different fonts and
each letter within these 20 fonts was randomly distorted to produce a file of 20,000
unique stimuli1. This is the largest number of records that any of the datasets in the
repository have.
As it was known that the algorithm does not scale well for high dimension data, the
experiments were also conducted by considering only first few dimensions of the
data as well. We also experimented with varying values of the threshold. The results
are shown in the following table.

Threshold1
value

Time Taken (seconds) for Fast greedy algorithm with


advanced distortion computation
All 16 dimensions First 8 dimensions First 4 dimensions

First 2 dimensions

5000

304.9

224.6

102.4

41.0

15000

235.3

186.4

105.4

48.6

25000

227.2

180.9

111.7

57.1

35000

227.8

177.9

117.6

59.6

50000

225.6

174.4

123.9

64.4

75000

228.1

172.5

132.1

72.0

100000

227.6

171.9

136.2

76.5

150000

226.1

173.9

135.8

80.4

200000

228.2

172.3

139.4

83.3

Greedy
global [1]

138.3

136.9

108.6

62.9

The results obtained are on expected lines keeping in mind that the fast greedy
algorithm is not efficient for large dimensions and for less number of points. The
results also show that with increasing dimensions, the value of threshold1 increases.
For just 2 dimensions, the best solutions are obtained with a threshold1 as low as
5000. With higher dimensions (8 and 16), the solutions are better for a larger value of
1

For a more detailed description of the data, see the Irvine ML Data repository at

http://www.ics.uci.edu/~mlearn/MLRepository.html

Page 67 of 72

threshold1. If the number of points is less, 20000 for example, and the number of
dimensions is also less, then which algorithm viz., the greedy global algorithm in [1]
or the fast greedy algorithm, is faster depends on the kind of data. Here, the fast
greedy algorithm performs substantially better for 2 dimensions. However for data
that was generated as described in [15], the fast greedy algorithm was not so much
faster for so few points even in the case of 2 dimensional data (refer figure 8).
However when the number of points is large and dimensionality low, it is always
better to use the fast greedy algorithm.

Page 68 of 72

8 Conclusion
The K-Means algorithm worked as expected.

It was fairly simple to

implement and the results for image segmentation are impressive. As can be
seen by the results, the number of partitions used in the segmentation has a
very large effect on the output. By using more partitions, in the RGB setup,
more possible colors are available to show up in the output. The same is true
for the setup checking grayscale intensity.

By adding more partitions, a

greater number of intensities are available to use in the output image. The
algorithm also runs quickly enough that real-time image segmentation could
be done with the K-Means algorithm.
The experimental results on both the synthetic and real data suggest that the
algorithm performs very well for low dimensional data but not so well for data
exceeding 6 dimensions. Also the algorithm suggested in [1] is faster for less number
of points. This is so as the greedy global algorithm in [1] caches the distances from
each of the possible insertion points to each of the data points. The magnitude of
data cached is so large that this algorithm is not able to run even for medium sized
data of an order beyond 50000 points and 50 centers. By contrast, the fast greedy
algorithm is memory efficient and scales very well with large number of points. If the
algorithm is to be used only for less number of points caching the distances from the
insertion points to the data points when computed by a node can be cached, as in
[1]. When this is done, the fast greedy algorithm with caching is likely to run faster
than the algorithm in [1] if the thresholds are set appropriately.
However, all the algorithms discussed, including the one in [1], fare very badly in
terms of the number of centers. The cost goes up quadratically with the number of
centers. As a result the algorithm cannot be used for more than a few hundred
centers when implemented in MATLAB. However an implementation in C is likely to
be up to 50 100 times faster making the algorithm feasible even for centers of the
order of a few thousands.
Some of the ideas introduced in this thesis can be used even at places where it is not
practical to use the fast greedy algorithm because of large number of centers
involved. The restricted filtering algorithm introduced in chapter 3 scales very well
for all sizes of number of centers and number of points. This method can be used
when one is dealing with millions of records in low dimension. This algorithm is also
likely to fare well for higher dimensions with an increase in the threshold values.

Page 69 of 72

9 Further Research
Currently we are using the center of the partitions formed by a kd-tree as in [1] as
the insertion position for the new centers. The centers are used as they give a good
preliminary clustering of the data. Other ways could be employed to get a good
preliminary clustering. For example to use centers that are suggested for
initialization in [9] as the insertion positions. The idea of having the insertion points
is only to have an appropriate set of insertion locations so that one of the locations
always satisfies the appropriate location assumption (refer Chapter 3) used by the
algorithm. The initialization points suggested in [9] would probably give a good
preliminary clustering of data to start with instead of the kd-tree as in [1].
Another extension of the work would be to come up with a formula for choosing the
appropriate threshold values, a formula that would work for all dimensions, number
of points and centers. Here we have only been able to experimentally see that an
optimum threshold value exists, but not been able to come to the exact value. It is
also likely that the thresholds depend on the type of data. More research is needed
for deriving the formula.
The current algorithm performs badly with increases in dimensionality. This is
because the most important data structure used in the algorithm, the kd-tree, does
not scale well with increases in dimension. An alternative way would be to use some
other data structures instead of a kd-tree, like AD trees in [5], that scale well with
increases in dimensionality.

Page 70 of 72

10

References

[1] Aristidis Likas, Nikos Vlassis and Jacob J. Verbeek: The global k-means clustering
algorithm. In Pattern Recognition Vol 36, No 2, 2003.

[2] Dan Pelleg and Andrew Moore: Accelerating exact k-means algorithms with
geometric reasoning.(Technical Report CMU-CS-00105), Carnegie Mellon University,
Pittsburgh, PA.

[3] Dan Pelleg and Andrew Moore: X-means: Extending k-means with efficient
estimation of the number of clusters. In Proceedings of the Seventeenth
International Conference on Machine Learning, Palo Alto, CA, July 2000.

[4] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman and A.Y. Wu: An
Efficient k-means clustering algorithm: analysis and implementation. In IEEE
Transactions On Pattern Analysis And Machine Intelligence Vol. 24, No. 7, July 2002,
pp. 881-892

[5] Dan Pelleg and Andrew Moore: Cached sufficient statistics for efficient machine
learning with large datasets. In Journal of Artificial Intelligence Research, 8:67-91,
1998.

[6] Songrit Maneewongvatana and David M. Mount: Analysis of approximate nearest


neighbor searching with clustered point sets. In Workshop on Algorithm Engineering
and Experiments (ALEXNEX99), January 1999

[7] Andrew W. Moore: An introductory tutorial on kd-trees, Extract from Andrew


Moores PhD Thesis: Efficient Memory-based learning for Robot Control, PhD Thesis;
Technical Report No. 209, Computer Laboratory, University of Cambridge, 1991

Page 71 of 72

[8] Kan Deng and Andrew Moore: Multiresolution instance-based learning, In The
Proceedings of IJCAI-95, pages 1233-1242. Morgan Kaufmann, 1995

[9] P.S. Bradley and Usama M. Fayyad: Refining initial points for k-means clustering.
In Proceedings Fifteenth International Conference on Machine Learning, pages 91-99,
San Francisco, CA, 1998, Morgan Kaufmann.

[10] Khaled Alsabti, Sanjay Ranka and Vineet Singh: An Efficient k-means clustering
algorithm. In Proceedings of the First Workshop on High Performance Data Mining,
Orlando, FL, March 1998.

[11] J. Matouek. On the approximate geometric k-clustering. Discrete and


Computational Geometry. 24:61-84, 2000

[12] J.L. Bentley, Multidimensional Divide and Conquer. Communications of the ACM.
23(4):214-229, 1980

[13] J.H. Friedman, J.L. Bentley and R.A. Finkel. An Algorithm for Finding Best
Matches in Logarithmic Expected Time. ACM Trans on Mathematical Software,
3(3):209-226, September 1977.

[14] S.M. Omohundro, Efficient Algorithms with Neural Network Behaviour. Journal of
Complex Systems, 1(2):273-347, 1987.

[15] Sanjoy Dasgupta, Learning mixture of Gaussians. 40th Annual Symposium on


Foundations of Computer Science (New York, NY, 1999), IEEE Comput. Soc. Press, Los
Alamitos, CA, 1999, pp. 634644.
[16] R. C. Gonzalez, R. E. Woods, Digital Image Processing third edition, Prentice
Hall, 2010.
[17]L.G. Roberts, Machine Perception of three-Dimensional Solid, in Optical and
Electro-Optical Information Processing, Tippet, J.T (ed.), MIT Press, Cambridge, Mass,
1965.

Page 72 of 72

Você também pode gostar