Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract—This paper presents an algorithm to automatically (Improve-structure) to optimize the value of k according to
determine the number of clusters in a given input data set, under a Information Criterion.
mixture of Gaussians assumption. Our algorithm extends the One of the major problems with X-means is that it
Expectation- Maximization clustering approach by starting with a assumes an identical spherical Gaussian of the data.
single cluster assumption for the data, and recursively splitting Because of this, it tends to over-fit data in elliptical clusters
one of the clusters in order to find a tighter fit. An Information
[6], or in an input set with data of varying cluster size. The
Criterion parameter is used to make a selection between the
G-Means and PG-Means algorithms try to solve this
current and previous model after each split. We build this
approach upon prior work done on both the K-Means and
problem by projecting the data onto one dimension, and
Expectation-Maximization algorithms. We extend our algorithm running a statistical goodness-of-fit test. This approach
using a cluster splitting approach based on Principal Direction leads to better performance for non-spherical distributions,
Divisive Partitioning, which improves accuracy and efficiency. however, projections may not work optimally for all data
sets. A projection can collapse the data from many clusters
Keywords-clustering, expectation-maximization, mixture of together, neglecting the difference in density. This requires
Gaussians, principal direction divisive partitioning multiple projections for accuracy [5].
Our algorithm employs a divisive hierarchical
I. INTRODUCTION approach using Information Criterion, similar to X-Means,
however, it differs by considering each cluster to be
generated by a general multivariate Gaussian distribution.
The clustering problem is defined as follows : given This allows each distribution to take a non-spherical shapes,
a set of n-dimensional input vectors {x1 , . . . , xm}, we want and permits accurate computation of the likelihood of the
to group them into an appropriate number of clusters such model. We use Expectation-Maximization instead of K-
that points in the same cluster are positionally coherent. Means for greater accuracy in detection of the parameters.
Such algorithms are useful for image compression, Further, we also use an optimal method for cluster splitting
document clustering, bio-informatics, data mining, based on Principal Direction Divisive Partitioning (PDDP)
astrophysics and many other fields. A common approach for [3]. The PDDP algorithm uses Principal Component
the clustering problem is to assume a Gaussian Mixture analysis to identify the direction of maximum variance for
Model. In this model, the input data is assumed to have been the data. By using this information during cluster splitting,
generated by selecting any one of k Gaussian distributions, we can attain much higher accuracy and performance, as
and drawing the input vector from the chosen distribution. our results show.
Each cluster is thus represented by a single distribution. The
Expectation-Maximization algorithm [4] is a well known
method to estimate the set of parameters for such a mixture II. CONCEPTS AND DEFINITIONS
corresponding to maximum likelihood, however, it requires
pre-knowledge about the number of clusters in the data (k). A. Mixture of Gaussians and Expectation-Maximization
Determining this value is a fundamental problem in
data clustering, and has been attempted using Information In the Gaussian Mixture model the input data set I =
Theoretic [6], Silhouette based [8], [9] and Goodness-of-fit {x1 , . . . , xm} where xi ∈ Rn is assumed to be sampled from a
methods [5], [7]. The X-Means algorithm [11] is an set of distributions L = {A1,...,Ak} such that the probability
Information Criterion based approach to this problem density is given by
developed for use with the K-Means algorithm. X-Means
k
works by alternatively applying two operations – The K-
Means algorithm (Improve-params) to optimally detect the p x i =∑ j G xi | A j
clusters for a chosen value of k, and cluster splitting j =1
Where Aj denotes a Gaussian distribution characterized An Information Criterion parameter is used for selection
by mean μj and co-variance matrix Σj. Фj denotes the among models with different number of parameters. It seeks
normalized weight of the jth distribution, and k is the number to balance the increase in likelihood due to additional
of distributions. parameters by introducing a penalty term for each
parameter. Two commonly used Information Criterion are
The likelihood of the data set is given by Schwarz Criterion or Bayesian Information Criterion (BIC)
[13] and Akaike's Information Criterion (AIC) [1].
m
L=∏ p x i BIC is defined as:
i=1
1
m d d −1
j ∑ w ij f =k −1kd k
m i=1 2
m Other Information Criterion measures like the Integrated
Completed Likelihood [2] may also be used.
∑ w ij x i
i=1
k m
III. ALGORITHM
∑ w ij
i=1
A. Description of the algorithm
m
Consider the two dimensional data set pictured in 1 −1 2 xy
Figure 3. The red spots indicate each sample. First, we = tan
compute the mean point O of the sample points. The first 2 xx− yy
principle component is a line passing through O, such the
length of projections of each sample point on the line has This angle, along with the mean computed earlier, gives
maximum variance. Consider the two lines ABCD and us the First Principle Component of the distribution. In case
PQRS given in the figure. The variance for the two cases is of higher dimensional data sets, this is equivalent to the
OA2 + OB2 + OC2 + OD2 and OP2 + OQ2 + OR2 + OS2, principle eigen-vector of the covariance matrix of the input
respectively. It is clearly apparent that the first case results data (normalized to zero mean). Determining the Eigen-
in a larger variance. vectors for a non-rectangular matrix requires computation of
In order to automatically select the direction of this its Singular Value Decomposition (SVD). The paper on
axis, we assume the following notation: Principal Direction Divisive Partitioning [3] describes an
efficient technique for this using fast construction of the
μx and μy are the coordinates of the mean point partial SVD.
xi and yi represent the coordinates of the ith sample In two dimensions, we can represent each Gaussian
Ө is the angle the line makes with the horizontal axis distribution by an elliptical contour of the points which have
Li is the length of projection for the ith point equal probability density. This is achieved by setting the
m is the total number of points exponential term in the density function to a constant C (The
V is the variance actual value of C affects the size and not the orientation of
σxy is the covariance between variables x and y the ellipse, and as such does not make a difference in
calculations).
T −1
x i− j j x i− j =C
j=
[ xx xy
yx yy ]
−1
j =
1
[
yy − yx
xx yy− xy yx − xy xx ]
Figure 3. Principal Component Analysis
x i− j = x
y []
450
400
350
300
No of it-
250 erations
200 without
SPLIT
150 No of iter-
100 ations with
Figure 4. Splitting along the major axis SPLIT
50
0
Solving the equation, we get: 1 2 3 4 5 6 7 8 9 101112
2 2 Avg no. of clusters
yy x − yx xy xy xx y = xx yy− xy yx
Figure 5. Performance of the algorithm
The angle made by the major axis of the ellipse with
the x axis is given by: 9
8
1
2
yx
= tan−1 xy
xx− yy 7
6
% RMS
5 Error
As σxy = σyx, this is equivalent to the earlier result without
4
achieved using PCA. This shows that the major axis of the SPLIT
elliptical representation of the Gaussian distribution is 3 % RMS
equivalent to the First Principal Component. Figure 4 2 Error with
illustrates the process of splitting a cluster in an elliptical SPLIT
1
distribution. With D being half the length of the major axis,
the centers of the new distributions can be computed as 0
follows: 1 2 3 4 5 6 7 8 9 101112 13
Avg no. of clusters
new 1= j Dcos
Dsin [ ] Figure 6. Accuracy of the algorithm
V. ACKNOWLEDGEMENT