Você está na página 1de 12

Document Clustering in Correlation

Similarity Measure Space


Taiping Zhang, Member, IEEE, Yuan Yan Tang, Fellow, IEEE,
Bin Fang, Senior Member, IEEE, and Yong Xiang
AbstractThis paper presents a new spectral clustering method called correlation preserving indexing (CPI), which is performed in
the correlation similarity measure space. In this framework, the documents are projected into a low-dimensional semantic space in
which the correlations between the documents in the local patches are maximized while the correlations between the documents
outside these patches are minimized simultaneously. Since the intrinsic geometrical structure of the document space is often
embedded in the similarities between the documents, correlation as a similarity measure is more suitable for detecting the intrinsic
geometrical structure of the document space than euclidean distance. Consequently, the proposed CPI method can effectively
discover the intrinsic structures embedded in high-dimensional document space. The effectiveness of the new method is demonstrated
by extensive experiments conducted on various data sets and by comparison with existing document clustering methods.
Index TermsDocument clustering, correlation measure, correlation latent semantic indexing, dimensionality reduction.

1 INTRODUCTION
D
OCUMENT clustering aims to automatically group related
documents into clusters. It is one of the most important
tasks in machine learning and artificial intelligence and has
received much attention in recent years [1], [2], [3]. Based on
various distance measures, a number of methods have been
proposed to handle document clustering [4], [5], [6], [7], [8],
[9], [10]. A typical and widely used distance measure is the
euclidean distance. The /-means method [4] is one of the
methods that use the euclidean distance, which minimizes
the sum of the squared euclidean distance between the data
points and their corresponding cluster centers. Since the
document space is always of high dimensionality, it is
preferable to find a low-dimensional representation of the
documents to reduce computation complexity.
Low computation cost is achieved in spectral clustering
methods, in which the documents are first projected into a
low-dimensional semantic space and then a traditional
clustering algorithm is applied to finding document
clusters. Latent semantic indexing (LSI) [7] is one of the
effective spectral clustering methods, aimed at finding the
best subspace approximation to the original document
space by minimizing the global reconstruction error
(euclidean distance).
However, because of the high dimensionality of the
document space, a certain representation of documents
usually reside on a nonlinear manifold embedded in the
similarities between the data points [11]. Unfortunately, the
euclidean distance is a dissimilarity measure which de-
scribes the dissimilarities rather than similarities between
the documents. Thus, it is not able to effectively capture the
nonlinear manifold structure embedded in the similarities
between them [12]. An effective document clustering
method must be able to find a low-dimensional representa-
tion of the documents that can best preserve the similarities
between the data points. Locality preserving indexing (LPI)
method is a different spectral clustering method based on
graph partitioning theory [8]. The LPI method applies a
weighted function to each pairwise distance attempting to
focus on capturing the similarity structure, rather than the
dissimilarity structure, of the documents. However, it does
not overcome the essential limitation of euclidean distance.
Furthermore, the selection of the weighted functions is often
a difficult task.
In recent years, some studies [13], [14], [15] suggest that
correlation as a similarity measure can capture the intrinsic
structure embedded in high-dimensional data, especially
when the input data is sparse. In probability theory and
statistics, correlation indicates the strength and direction of a
linear relationship between two random variables which
reveals the nature of data represented by the classical
geometric concept of an angle. It is a scale-invariant
association measure usually used to calculate the similarity
between two vectors. In many cases, correlation can
effectively represent the distributional structure of the input
data which conventional euclidean distance cannot explain.
The usage of correlation as a similarity measure can be
found in the canonical correlation analysis (CCA) method
[16]. The CCA method is to find projections for paired data
sets such that the correlations between their low-dimen-
sional representatives in the projected spaces are mutually
maximized. Specifically, given a paired data set consisting
of matrices A fr
1
. r
2
. . . . . r
i
g and Y fy
1
. y
2
. . . . . y
i
g, we
1002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
. T. Zhang and Y.Y. Tang are with the Department of Computer Science,
Chongqing University, Chongqing 400030, China and the Faculty of
Science and Technology, University of Macau, Taipa, Macau, China.
E-mail: tpzhang@ieee.org, yuanyant@gmail.com.
. B. Fang is with the Department of Computer Science, Chongqing
University, Chongqing 400030, China. E-mail: fb@cqu.edu.cn.
. Y. Xiang is with the School of Engineering, Deakin University, Geelong,
VIC 3217, Australia. E-mail: yxiang@deakin.edu.au.
Manuscript received 16 Nov. 2009; revised 1 June 2010; accepted 6 Jan. 2011;
published online 7 Feb. 2011.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-11-0784.
Digital Object Identifier no. 10.1109/TKDE.2011.49.
1041-4347/12/$31.00 2012 IEEE Published by the IEEE Computer Society
would like to find directions n
r
for A and n
y
for Y that
maximize the correlation between the projections of A on
n
r
and the projections of Y on n
y
. This can be expressed as
max
n
r
.n
y
hAn
r
. Y n
y
i
kAn
r
k kY n
y
k
. 1
where . h i and k k denote the operators of inner product
and norm, respectively. As a powerful statistical technique,
the CCA method has been applied in the field of pattern
recognition and machine learning [15], [16]. Rather than
finding a projection of one set of data, CCA finds
projections for two sets of corresponding data A and Y
into a single latent space that projects the corresponding
points in the two data sets to be as nearby as possible. In the
application of document clustering, while the document
matrix A is available, the cluster label Y is not. So the
CCA method cannot be directly used for clustering.
In this paper, we propose a new document clustering
method based on correlation preserving indexing (CPI),
which explicitly considers the manifold structure embedded
in the similarities between the documents. It aims to find an
optimal semantic subspace by simultaneously maximizing
the correlations between the documents in the local patches
and minimizing the correlations between the documents
outside these patches. This is different from LSI and LPI,
which are based on a dissimilarity measure (euclidean
distance), and are focused on detecting the intrinsic
structure between widely separated documents rather than
on detecting the intrinsic structure between nearby docu-
ments. The similarity-measure-based CPI method focuses on
detecting the intrinsic structure between nearby documents
rather than on detecting the intrinsic structure between
widely separated documents. Since the intrinsic semantic
structure of the document space is often embedded in the
similarities between the documents [11], CPI can effectively
detect the intrinsic semantic structure of the high-dimen-
sional document space. At this point, it is similar to Latent
Dirichlet Allocation (LDA) [17] which attempts to capture
significant intradocument statistical structure (intrinsic
semantic structure embedded in the similarities between
the documents) via the mixture distribution model.
The rest of the paper is organized as follows: the
proposed document clustering method is presented in
Section 2. In Section 3, experimental results are provided
to illustrate the performance of the CPI method. Finally,
conclusions are given in Section 4.
2 DOCUMENTATION CLUSTERING BASED ON
CORRELATION PRESERVING INDEXING
In high-dimensional document space, the semantic struc-
ture is usually implicit. It is desirable to find a low-
dimensional semantic subspace in which the semantic
structure can become clear. Hence, discovering the intrinsic
structure of the document space is often a primary concern
of document clustering. Since the manifold structure is
often embedded in the similarities between the documents,
correlation as a similarity measure is suitable for capturing
the manifold structure embedded in the high-dimensional
document space. Mathematically, the correlation between
two vectors (column vectors) n and . is defined as
Coiin. .
n
T
.

n
T
n
p
.
T
.
p
n
knk
.
.
k.k
( )
. 2
Note that the correlation corresponds to an angle 0 such that
co:0 Coiin. .. The larger the value of Coiin. ., the
stronger the association between the two vectors n and ..
Online document clustering aims to group documents
into clusters, which belongs unsupervised learning. How-
ever, it can be transformed into semi-supervised learning by
using the following side information:
A1. If two documents are close to each other in the
original document space, then they tend to be
grouped into the same cluster [8].
A2. If two documents are far away from each other in the
original document space, they tend to be grouped
into different clusters.
Based on these assumptions, we can propose a spectral
clustering in the correlation similarity measure space
through the nearest neighbors graph learning.
2.1 Correlation-Based Clustering Criterion
Suppose y
i
2 Y is the low-dimensional representation of the
ith document r
i
2 A in the semantic subspace, where
i 1. 2. . . . . i. Then the above assumptions (A1) and (A2)
can be expressed as
max
X
i
X
r
,
2`r
i

Coiiy
i
. y
,
. 3
and min
X
i
X
r
,
62`r
i

Coiiy
i
. y
,
. 4
respectively, where `r
i
denotes the set of nearest
neighbors of r
i
. The optimization of (3) and (4) is equivalent
to the following metric learning:
dr. y c cosr. y.
where dr. y denotes the similarity between the documents
r and y, c corresponds to whether r and y are the nearest
neighbors of each other.
The maximization problem (3) is an attempt to ensure
that if r
i
and r
,
are close, then y
i
and y
,
are close as well.
Similarly, the minimization problem (4) is an attempt to
ensure that if r
i
and r
,
are far away, y
i
and y
,
are also far
away. Since the following equality is always true
X
i
X
y
,
2`y
r

Coiiy
i
. y
,

X
i
X
y
,
62`y
r

Coiiy
i
. y
,

X
i
X
,
Coiiy
i
. y
,
.
5
the simultaneous optimization of (3) and (4) can be
achieved by maximizing the following objective function
P
i
P
r
,
2`r
i

Coiiy
i
. y
,

P
i
P
,
Coiiy
i
. y
,

. 6
Without loss of generality, we denote the mapping between
the original document space and the low-dimensional
semantic subspace by \, that is, \
T
r
i
y
i
. Following some
algebraic manipulations, we have
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1003
P
i
P
r
,
2`r
i

Coiiy
i
. y
,

P
i
P
,
Coiiy
i
. y
,


P
i
P
r
,
2`r
i

y
T
i
y
,

y
T
i
y
i
y
T
,
y
,
p
P
i
P
,
y
T
i
y
,

y
T
i
y
i
y
T
,
y
,
p

P
i
P
r
,
2`r
i

tiy
i
y
T
,

tiy
i
y
T
i
tiy
,
y
T
,

p
P
i
P
,
tiy
i
y
T
,

tiy
i
y
T
i
tiy
,
y
T
,

p

P
i
P
r
,
2`r
i

ti\
T
r
i
r
T
,
\

ti\
T
r
i
r
T
i
\ti\
T
r
,
r
T
,
\
p
P
i
P
,
ti\
T
r
i
r
T
,
\

ti\
T
r
i
r
T
i
\ti\
T
r
,
r
T
,
\
p
.
7
where ti is the trace operator. Based on optimization
theory, the maximization of (7) can be written as
arg max
\
P
i
P
r
,
2`r
i

ti

\
T
r
i
r
T
,
\

P
i
P
,
ti

\
T
r
i
r
T
,
\

arg max
\
ti

\
T
P
i
P
r
,
2`r
i

r
i
r
T
,

ti

\
T
P
i
P
,

r
i
r
T
,

\
.
8
with the constraints
ti\
T
r
i
r
i
\ 1 for all i 1. 2. . . . . i. 9
Consider a mapping \ 2 IR
id
, where i and d are the
dimensions of the original document space and the semantic
subspace, respectively. We need to solve the following
constrained optimization:
arg max
P
d
i1
n
T
i
`
o
n
i
P
d
i1
n
T
i
`
T
n
i
10
subject to
X
d
i1
n
T
i
r
,
r
T
,
n
i
1. , 1. 2. . . . . i. 11
Here, the matrices `
T
and `
o
1
are defined as
`
T

X
i
X
,

r
i
r
T
,

.
`
o

X
i
X
r
,
2`r
i

r
i
r
T
,

.
It is easy to validate that the matrix `
T
is semipositive
definite. Since the documents are projected in the low-
dimensional semantic subspace in which the correlations
between the document points among the nearest neighbors
are preserved, we call this criterion correlation preserving
indexing.
Physically, this model may be interpreted as follows: all
documents are projectedonto the unit hypersphere (circle for
2D). The global angles between the points in the local
neighbors, u
i
, are minimized and the global angles between
the points outside the local neighbors, c
,
, are maximized
simultaneously, as illustrated in Fig. 1. On the unit hyper-
sphere, a global angle can be measured by spherical arc, that
is, the geodesic distance. The geodesic distance between :
and :
0
on the unit hypersphere can be expressed as
d
G
:. :
0
arccos:
T
:
0
arccos Coii:. :
0
. 12
Since a strong correlation between : and :
0
means a small
geodesic distance between : and :
0
, then CPI is equivalent to
simultaneously minimizing the geodesic distances between
the points in the local patches and maximizing the geodesic
distances between the points outside these patches. The
geodesic distance is superior to traditional euclidean
distance in capturing the latent manifold [18]. Based on this
conclusion, CPI can effectively capture the intrinsic struc-
tures embedded in the high-dimensional document space.
It is worth noting that semi-supervised learning using
the nearest neighbors graph approach in the euclidean
distance space was originally proposed in the literatures
[19] and [20], and LPI is also based on this idea. Differently,
CPI is a semi-supervised learning using nearest neighbors
graph approach in the correlation measure space. Zhong
and Ghosh showed that euclidean distance is not appro-
priate for clustering high dimensional normalized data such
as text and a better metric for text clustering is the cosine
similarity [12]. In [21], Lebanon proposed a distance metric
for text documents, which was defined as
d
1

`
Jr.y
arccos
X
i1
i1
`
i

r
i
y
i
p
hr. `ihy. `i
!
.
I f we use t he not at i ons " r

r
1
p
.

r
2
p
. , " y


y
1
p
.

y
2
p
. , and set `
1
`
2
`
i
, then the dis-
tance metric d
1

`
Jr.y
reduces to
d
1

`
Jr.y
arccos
X
i1
i1
" r
i
" y
i
h" r. " rih" y. " yi
!
arccos Coii" r. " y .
This distance is very similar to the distance defined by (12).
Since the distance d
1

`
Jr.y
is local (thus it captures local
variations within the space) and is defined on the entire
embedding space [21], correlation might be a suitable
distance measure for capturing the intrinsic structure
embedded in document space. That is why the proposed
CPI method is expected to outperform the LPI method.
Note that the distance d
1

`
Jr.y
can be obtained based on
the training data and it can be used for classification rather
than clustering.
1004 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 1. 2D projections of CPI.
1. When computing the matrix `
o
, if r
,
is among the nearest neighbors
of r
i
, then we consider r
i
is also among the nearest neighbors of r
,
. This is
to ensure that `
o
is a symmetric matrix.
2.2 Algorithm Derivation
The optimizationproblem(10) withtheconstraints (11) canbe
solved by maximizing the objective function
P
d
i1
n
T
i
`
o
n
i
under the constraints
X
d
i1
n
T
i
`
T
n
i
1. 13
and
X
d
i1
n
T
i
r
,
r
T
,
n
i
1. , 1. 2. . . . . i. 14
To do this, we introduce an additional Lagrangian function
with multipliers `
i
[22] as follows:
J
1
\.
X
d
i1
1
`
n
T
i
`
o
n
i
`
0
X
d
i1
n
T
i
`
T
n
i
1
!
`
1
X
d
i1
n
T
i
r
1
r
T
1
n
i
1
!
`
i
X
d
i1
n
T
i
r
i
r
T
i
n
i
1
!
.
15
where \ n
1
. . . . . n
d
and `. `
0
. `
1
. . . . . `
i
. The
additional Lagrange multiplier 1,`` 6 0 is introduced as
a multiplicative factor for
P
d
i1
n
T
i
`
T
n
i
, which does not
affect the solution (see [22] for detail). The additional
Lagrangian function J
1
\. will be maximized by setting
the partial derivatives of J
1
\. with respect to n
i
to be
zero, i.e.,
0J
1
\.
0n
i
0. This yields
1
`
`
:
n
i
`
0
`
T
n
i
`
1
r
1
r
T
1
n
i
`
i
r
i
r
T
i
n
i
0. 16
or equivalently
`
o
n
i
` `
0
`
T
`
1
r
1
r
T
1
`
i
r
i
r
T
i

n
i
0. 17
Equation (17) means that the n
i
. i 1. 2. . . . . d are the set of
generalized eigenvectors of the matrix `
o
and the matrix
` `
0
`
T
`
1
r
1
r
T
1
`
i
r
i
r
T
i
. 18
corresponding to the d largest generalized eigenvalues.
In order to find the optimal solution, we first need to fix
the value of the Lagrange multipliers `
i
. In (15), we suppose
that the function
1\.
X
d
i1
1
`
n
T
i
`
o
n
i
.
obtains a relative extremum: 1


P
d
i1
n

i
T
`
o
n

i
together
with the optimal values `

i
and n

i
under the constraints
X
d
i1
n

i
T
`
T
n

i
/
0
and
X
d
i1
n

i
T
r
,
r
T
,
n

i
/
,
.
Based on the interpretation of the Lagrange multipliers [23],
the values of 1

, n

i
, and `

i
depend on the values of /
i
on
the right-hand sides of the constraint equations. Suppose
that `

i
and n

i
are continuously differentiable functions of /
i
in some -neighborhood of /
i
. Then 1

is also continuously
differentiable with respect to /
i
. The partial derivatives of
1

with respect to /
i
are equal to the corresponding
Lagrange multipliers `

i
, i.e.,
`

i

01

0/
i
. i 0. 1. 2. . . . . i.
Let \

\
T
. It follows:
1


X
d
i1
n

i
T
`
o
n

i
ti`
o
\

\
T
ti`
o
.
/
0

X
d
i1
n

i
T
`
T
n

i
ti`
T
\

\
T
ti`
T
.
/
,

X
d
i1
n

i
T
r
,
r
T
,
n

i
tir
,
r
T
,
\

\
T
tir
,
r
T
,
.
for all , 1. 2. . . . . i.
Then, the values of `

i
can be computed by
`

0

01

0/
0

01

0
0/
0
0

ti`
:

ti`
T

. 19
`

,

01

0/
,

01

0
0/
,
0

ti`
:

tir
,
r
T
,

. , 1. 2. . . . . i. 20
Note that in document clustering, the matrix ` in (18) is
often singular as the dimension of the documents is
generally larger than the number of documents. To
circumvent the requirement of ` being nonsingular, we
may first project the document vectors into the SVD
subspace by throwing away the zero singular values.
2.3 Clustering Algorithm Based on CPI
Given a set of documents r
1
. r
2
. . . . . r
i
2 IR
i
. Let A denote
the document matrix. The algorithm for document cluster-
ing based on CPI can be summarized as follows:
1. Construct the local neighbor patch, and compute the
matrices `
o
and `
T
.
2. Project the document vectors into the SVD subspace
by throwing away the zero singular values. The
singular value decomposition of A can be written as
A l\
T
. Here all zero singular values in have
been removed. Accordingly, the vectors in l and \
that correspond to these zero singular values have
been removed as well. Thus the document vectors in
the SVD subspace can be obtained by
~
A l
T
A.
3. Compute CPI Projection. Based on the multipliers
`
0
. `
1
. . . . . `
i
obtained from (19) and (20), one can
compute the matrix ` `

0
`
T
`

1
r
1
r
T
1

`

i
r
i
r
T
i
. Let \
C11
be the solution of the generalized
eigenvalue problem `
o
\ ``\. Then, the low-
dimensional representation of the document can be
computed by
Y \
T
C11
~
A \
T
A.
where \ l\
C11
is the transformation matrix.
4. Cluster the documents in the CPI semantic subspace.
Since the documents were projected on the unit
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1005
hypersphere, the inner product is a natural measure
of similarity. We seek a partitioning f
,
g
/
,1
of the
document using the maximization of the following
objection function [24]:
Qf
,
g
/
,1

X
/
,1
X
r2
,
r
T
c
,
.
with c
,

i
,
ki
,
k
, where i
,
is the mean of the
document vectors contained in the cluster
,
.
2.4 Complexity Analysis
The time complexity of the CPI clustering algorithm can be
analyzed as follows: consider i documents in the d-
dimensional space (d ) i). In Step 1, we first need to
compute the pairwise distance which needs Oi
2
d opera-
tions. Second, we need to find the / nearest neighbors for
each data point which needs O/i
2
operations. Third,
computing the matrices `
o
and `
T
requires Oi
2
d
operations and Oii /d operations, respectively. Thus,
the computation cost in Step 1 is O2i
2
d /i
2
ii /d.
In Step 2, the SVD decomposition of the matrix A needs
Od
3
operations and projecting the documents into the i-
dimensional SVD subspace takes Oii
2
operations. As a
result, Step 2 costs Od
3
i
2
d. In Step 3, we need to solve
the generalized eigenvalue problem `
:
n ``n in order
to find the i generalized eigenvectors associated with the
i-largest eigenvalues which needs Oi
3
operations. Then,
transforming the documents into i-dimensional semantic
subspace requires Oii
2
operations. Consequently, the
computation cost of Step 3 is Oi
3
ii
2
. In Step 4, it takes
O|cii operations to find the final document clusters,
where | is the number of iterations and c is the number of
clusters. Since / ( i, | ( i, and i. i ( d in document
clustering applications, the Step 2 will dominate the
computation. To reduce the computation cost of Step 2,
one can apply the iterative SVD algorithm [25] rather than
matrix decomposition algorithm or feature selection meth-
od to first reduce the dimension.
3 EXPERIMENTAL RESULTS
In this section, the performance of the proposed CPI
method is demonstrated by various experiments and
compared with that of other competing methods.
3.1 Evaluation Metrics
In this work, the accuracy (AC) metric and the normal-
ized mutual information (`1) metric are used to measure
the clustering performance [9]. The AC metric is defined
as follows:
C
P
i
i1
c:
i
. ioji
i

i
.
where i
i
is the cluster label obtained by our algorithm, :
i
is
the label provided by the corpus, i is the total number of
documents, cr. y is the delta function that equals one if
r y and equals zero otherwise, and ioji
i
is the
permutation mapping function that maps cluster label i
i
to the equivalent label from the data corpus. The best
mapping can be achieved by using the Kuhn-Munkres
algorithm [27].
The normalized mutual information (`1) is defined as
`1C. C
0

`1C. C
0

maxHC. HC
0

.
where C is the set of clusters provided by the document
corpus and C
0
is the set of clusters obtained by our
algorithm. HC and HC
0
are the entropies of C and C
0
,
respectively. `1 is the mutual information corresponding
to the matrices C and C
0
`1C. C
0

X
c
i
2C.c
0
,
2C
0
jc
i
. c
0
,
log
2
jc
i
. c
0
,

jc
i
jc
0
,

.
Here, jc
i
(resp. jc
0
,
) is the probability that a document
arbitrarily selected from the corpus belongs to the clusters c
i
(resp. c
0
,
) and jc
i
. c
0
,
is the joint probability that the
arbitrarily selected document belongs to the clusters c
i
and
c
0
,
at the same time. It is easy to check that `1C. C
0
takes
zero when the two sets are independent, and takes one
when the two sets are identical.
3.2 Document Representation
In all experiments, each document is represented as a term
frequency vector. The term frequency vector can be
computed as follows:
1. Transform the documents to a list of terms after
words stemming operations.
2. Remove stop words. Stop words
2
are common
words that contain no semantic content.
3. Compute the termfrequency vector using the TF/IDF
weighting scheme. The TF/IDF weighting scheme
assigned to the term t
i
in document d
,
is given by
t),id)
i.,
t)
i.,
id)
i
.
Here,
t)
i.,

i
i.,
P
/
i
/.,
is the term frequency of the term t
i
in document d
,
,
where i
i.,
is the number of occurrences of the
considered term t
i
in document d
,
. id)
i
log
j1j
jd:t
i
2dj

is the inverse document frequency which is a


measure of the general importance of the term t
i
,
where j1j is the total number of documents in the
corpus and jfd : t
i
2 dgj is the number of documents
in which the term t
i
appears. Let V ft
1
. t
2
. . . . . t
i
g
be the list of terms after the stop words removal and
words stemming operations. The term frequency
vector A
,
of document d
,
is defined as
A
,
r
1,
. r
2,
. . . . . r
i,
.
r
i,
t),id)
i.,
.
Using i documents from the corpus, we construct an ii
term-document matrix A. The above process can be com-
pleted by using the text to matrix generator (TMG) [26] code.
1006 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
2. http://www.comp.hkbu.edu.hk/~yytang/stopwords.txt.
3.3 Clustering Results
Experiments were performed on 20 newsgroups (or NG20),
Reuters, and OHSUMED data sets. We compared the
proposed algorithm with other competing algorithms
under same experimental setting. The experimental results
of LPI and CPI on NG20 data set are obtained when the
number of nearest neighbors is set to seven or eight. For
Reuters and OHSUMED data sets, the number of nearest
neighbors used for LPI and CPI varies from 3 to 24. In all
experiments, our algorithm performs better than or
competitively with other algorithms. The details of experi-
ments can be described as follows.
3.3.1 20 Newsgroups (or NG20)
The 20 newsgroups corpus
3
consists of roughly 20,000 docu-
ments that come from 20 specific Usenet newsgroups. We
repeated the experiments in [28] and [29] to illustrate the
performance of the proposed CPI algorithm and other
competing algorithms. The first set of experiments involved
binary clustering. In each experiment, we randomly chose
50 documents from the two selected newsgroups and
100 runs were conducted for each algorithm to obtain
statistically reliable clustering result. The means and stan-
dard deviations of the test results were recorded in Table 1.
We also tested other competing methods under same
experimental setting, including Kmeans, p-Kmeans [28],
p-QR [28], and Spectral [29]. It can be seen from Table 1 that
CPI achieves the best clustering accuracy on all six data sets.
LPI performs the second best, p-Kmeans and p-QR outper-
form Kmeans, and Kmeans performs the worst. Under
normalized mutual information metric, CPI also performs
the best. Kmeans performs better than p-Kmeans and p-QR,
and p-Kmeans or p-QR performs the worst. We also tested
the significance of the comparisons between various meth-
ods using paired t-test with a significance level of 0.05. We
can see that the CPI method outperforms with statistical
significance other competing methods in most of the data
sets.
The second set of experiments involved c-way clustering
withc 5 andc 10. Again100 runs were usedfor eachtests
and the means and standard deviations were recorded in
Table 2. The number in the parenthesis (50 or 100) indicates
the number of random documents chosen from the news-
groups sets. As can be seen, CPI achieves the best accuracy
and normalized mutual information in all six data sets.
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1007
TABLE 1
Performance Comparison of Different Clustering Methods Using NG20 Data Corpus
Note that ( ()) indicates that schemes on the right are significantly better (worse) than the schemes on the left, and < (>) indicates that the
relationship is not significant. In all results of statistical significance tests, the expression < 1 < C means the relationship between and C is
< C.
3. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.
html.
p-Kmeans and p-QR perform better than Kmeans under
accuracy metric. Under normalized mutual information
metric, Kmeans outperforms the p-Kmeans and p-QR
methods intwodata sets. The results of statistical significance
test show that CPI is more accurate than the other methods
with statistical significance for most of the data sets.
3.3.2 Reuters
The Reuters corpus
4
contains 21,578 documents in
135 topics. Many documents have multiple category labels.
A subset of Reuters contained the total 8,067 documents in
30 categories with unique category labels is used in this
experiment. The proposed method was compared with five
methods, including
. Kmeans on original data used with cosine similarity
measure [4],
. Kmeans with cosine similarity measure after LSI [7],
. Kmeans with cosine similarity measure after LPI [8],
. von Mises-Fisher model (vMF) [10], [12], and
. nonnegative matrix factorization (NMF) [9].
The experiments were performed with the number of
clusters ranging from 2 to 8. For each given c that is the
number of clusters, 50 document sets with different clusters
were randomly selected from the corpus. Since all the
tested algorithms depend heavily on the initial partition,
we performed 100 runs for each set of documents. The
means and standard deviations were presented in Table 3.
One can see that the average accuracy for CPI, Kmeans,
vMF, NMF, LSI, and LPI are 67.60, 50.37, 63.43, 64.78, 55.44,
and 65.28 percent, respectively. The average normalized
mutual information for CPI, Kmeans, vMF, NMF, LSI, and
LPI are 53.08, 40.05, 49.25, 50.03, 44.44, and 51.01 percent,
respectively. Also, in all the cases (the number of clusters
c 2. . . . . 8), our algorithm consistently outperforms the
other five algorithms. We also tested the significance of the
comparisons between various methods using paired t-test
with a significance level of 0.05. The results show that our
algorithm outperforms with statistical significance Kmeans,
LSI, and vMF in most cases.
3.3.3 OHSUMED
The OHSUMED collection
5
includes medical abstracts from
the MeSH categories of the year 1991. In the literature [30],
the first 20,000 documents were divided into two halves,
10,000 for training and 10,000 for testing. The specific task
was to categorize the 23 cardiovascular diseases categories.
In our experiment, we use this subset for document
clustering. After removing those documents appearing in
multiple categories, the testing data contains more than
14,000 documents with unique cluster labels.
Similarly, the experiments were performed with the
number of clusters ranging from 2 to 8. For each given c
1008 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
TABLE 2
Performance Comparison of Different Clustering Methods Using NG20 Data Corpus
4. http: //www. davi ddl ewi s. com/resources/testcol l ecti ons/
reuters21578/. 5. ftp://medir.ohsu.edu/pub/ohsumed.
(i.e., the number of clusters), 50 document sets with
different clusters were randomly chosen from the corpus.
Hundred runs were performed for each set of documents.
Table 4 shows the means and standard deviations. It can
be seen from Table 4 that the average accuracy (resp.
average normalized mutual information) for CPI, Kmeans,
vMF, NMF, LSI, and LPI are 52.21 percent (30.01 percent),
44.06 percent (22.84 percent), 46.89 percent (20.94 percent),
46.75 percent (20.84 percent), 46.32 percent (23.72 percent),
and 49.29 percent (27.99 percent), respectively. In addi-
tion, for all the cases of c 2. . . . . 8, our algorithm
consistently outperforms the other five algorithms. We
can also see that the normalized mutual information
values of vMF and NMF are less than that of kmeans,
although the accuracies of vMF and NMF are higher than
the accuracy of kmeans. This is because the results on
different clusters achieved by vMF and NMF are rather
uneven. Furthermore, the results of statistical significance
test show that the CPI is more accurate than the other
methods with statistical significance in most c-way
clustering cases.
In summary, several experiments have been performed
on different data sets. Our CPI algorithm outperforms other
competing algorithms, including the LPI algorithm, on all
data sets. Although both CPI and LPI attempt to preserve
the local structure, CPI is performed in the correlation
measure space while LPI is performed in euclidean metric
space. For document clustering application, the document is
of high dimensionality in general. In [12], Zhong and Ghosh
showed that euclidean distance is not appropriate for
clustering high-dimensional data such as text, and a better
metric for text clustering is the cosine measure or equiva-
lently the correlation similarity. Thus, CPI can have more
discriminating power than LPI.
Meanwhile, high-dimensional data (such as texts) often
lie on relatively low-dimensional manifold and the non-
linear geometry of that manifold is usually embedded in the
similarities between the data points [11]. Since CPI is based
on correlation similarity measure, it is capable of capturing
the geometry of the manifold structure. In contrast, LPI uses
euclidean metric to represent the dissimilarity between data
points. So it cannot effectively detect the geometry of the
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1009
TABLE 3
Performance Comparison of Different Clustering Methods Using Reuters Data Corpus
manifold structure. The experimental results also show that
CPI is more effective in discovering the latent manifold
structure than LPI.
3.4 Parameter Selection
Parameter selection is a crucial issue in most learning
problems. In some situations, the learning performance may
drastically vary with different choices of the parameters.
Our CPI algorithm has two essential parameters: the
dimension of optimal semantic subspace and the number
of nearest neighbors. Unfortunately, how to determine the
optimal dimension of the semantic subspace is still an open
problem. In typical spectral clustering, the dimension of
semantic subspace is set to the number of clusters [31].
Based on this observation, we transform the original
documents into a c-dimensional semantic subspace when
involving c-way clustering in our experiments.
Of great interest is the selection of the number of nearest
neighbors (/), which defines the locality of the data
manifold in CPI and is also a key parameter in LPI. Fig. 2
shows the performance of CPI and LPI as a function of the
parameter /. As can be seen, LPI and CPI methods achieve
good performance when / is in the range from 10 to 20. CPI
is very stable with respect to the parameter /. Thus, the
selection of parameter / is not a crucial problem in the CPI
algorithm. It should be pointed out that the selection of the
optimal number of nearest neighbors is very difficult under
unsupervised learning and is still an open problem.
3.5 Generalization Capability
The LSI, LPI, and CPI methods try to find a low-dimensional
semantic subspace by preserving the relational structure
among the documents, where the mapping between the
original document space and the low-dimensional semantic
subspace is explicit. In practical applications, we may use
the part of documents to learn such mapping, then trans-
form the documents into the low-dimensional semantic
subspace which can reduce computing time. It is very
important for a clustering method to have the capability of
predicting the new data by using knowledge formerly
acquired from training data without learning again. The
performance on the new samples reflects the generalization
capability of the methods.
1010 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
TABLE 4
Performance Comparison of Different Clustering Methods Using OHSUMED Data Corpus
In order to test the generalization capability of LSI, LPI,
and CPI, the following experiment was performed in each of
the previous experiments, we randomly selected 20 percent
of the data for learning the mapping, and the remaining
were used for testing. Finally, the cluster label of the testing
documents was predicted by using knowledge formerly
acquired from training data. The generalization error
(averaged over two to eight classes) across 30 random runs
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1011
Fig. 2. The accuracy and normalized mutual information (averaged over 2 to 8 clusters) with respect to the number of nearest neighbors. (a) Reuters
corpus. (b) OHSUMED corpus.
Fig. 3. The generalization capability of the LSI, LPI, and CPI methods on Reuters corpus (first row) and OHSUMED corpus (second row) when using
correlation, 1
2
and 1
1
(from left to right) distance metric, respectively.
is shown in Fig. 3. Clearly, the CPI method has smaller
generalization error than the LSI and LPI methods do. This
means that the CPI method has better generalization
capability. Another interesting result lies in the general-
ization error of LSI and LPI with euclidean distances (1
2
and
1
1
). We observed that LSI and LPI with correlation distance
performed much better than LSI and LPI with euclidean
distances, while CPI with euclidean distances performed
competitively with CPI with correlation distance. CPI can
find a low-dimensional semantic subspace in which the
documents related to the same semantic are close to each
other. Thus, correlation is an appropriate metric for
measuring similarity between the documents.
4 CONCLUSIONS
In this paper, we present a newdocument clustering method
based on correlation preserving indexing. It simultaneously
maximizes the correlation between the documents in the
local patches and minimizes the correlation between the
documents outside these patches. Consequently, a low-
dimensional semantic subspace is derived where the
documents corresponding to the same semantics are close
to each other. Extensive experiments on NG20, Reuters, and
OHSUMED corpora show that the proposed CPI method
outperforms other classical clustering methods. Further-
more, the CPI method has good generalization capability
and thus it can effectively deal with data with very large size.
ACKNOWLEDGMENTS
The authors would like to thank the associate editor and the
anonymous reviewers for helpful comments that greatly
improved this manuscript. The authors also thank
C.L. Philip Chen for great help in revising the manuscript.
This work was supported in part by the National Natural
Science Foundations of China (61003120 and 60873092), and
in part by the Australian Research Council under grant
(DP1095498), and in part by the Fundamental Research
Funds for the Central Universities (CDJRC10180014).
REFERENCES
[1] R.T. Ng and J. Han, Efficient and Effective Clustering Methods
for Spatial Data Mining, Proc. 20th Intl Conf. Very Large Data
Bases (VLDB), pp. 144-155, 1994.
[2] A.K. Jain, M.N. Murty, and P.J. Flynn, Data Clustering: A
Review, ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[3] S. Kotsiantis and P. Pintelas, Recent Advances in Clustering: A
Brief Survey, WSEAS Trans. Information Science and Applications,
vol. 1, no. 1, pp. 73-81, 2004.
[4] J.B. MacQueen, Some Methods for Classification and Analysis of
Multivariate Observations, Proc. Fifth Berkeley Symp. Math.
Statistics and Probability, vol. 1, pp. 281-297, 1967.
[5] L.D. Baker and A.K. McCallum, Distributional Clustering of
Words for Text Classification, Proc. 21st Ann. Intl ACM SIGIR
Conf. Research and Development in Information Retrieval, pp. 96-103,
1998.
[6] X. Liu, Y. Gong, W. Xu, and S. Zhu, Document Clustering with
Cluster Refinement and Model Selection Capabilities, Proc. 25th
Ann. Intl ACM SIGIR Conf. Research and Development in Information
Retrieval (SIGIR 02), pp. 191-198, 2002.
[7] S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and
R.A. Harshman, Indexing by Latent Semantic Analysis, J. Am.
Soc. Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[8] D. Cai, X. He, and J. Han, Document Clustering Using
Locality Preserving Indexing, IEEE Trans. Knowledge and Data
Eng., vol. 17, no. 12, pp. 1624-1637, Dec. 2005.
[9] W. Xu, X. Liu, and Y. Gong, Document Clustering Based on Non-
Negative Matrix Factorization, Proc. 26th Ann. Intl ACM SIGIR
Conf. Research and Development in Informaion Retrieval (SIGIR 03),
pp. 267-273, 2003.
[10] S. Zhong and J. Ghosh, Generative Model-Based Document
Clustering: A Comparative Study, Knowledge of Information
System, vol. 8, no. 3, pp. 374-384, 2005.
[11] D.K. Agrafiotis and H. Xu, A Self-Organizing Principle for
Learning Nonlinear Manifolds, Proc. Natl Academy of Sciences
USA, vol. 99, no. 25, pp. 15869-15872, 2002.
[12] S. Zhong and J. Ghosh, Scalable, Balanced Model-Based Cluster-
ing, Proc. Third SIAM Intl Conf. Data Mining, pp. 71-82, 2003.
[13] Y. Fu, S. Yan, and T.S. Huang, Correlation Metric for Generalized
Feature Extraction, IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 30, no. 12, pp. 2229-2235, Dec. 2008.
[14] Y. Ma, S. Lao, E. Takikawa, and M. Kawade, Discriminant
Analysis in Correlation Similarity Measure Space, Proc. 24th Intl
Conf. Machine Learning (ICML 07), pp. 577-584. 2007,
[15] R.D. Juday, B.V.K. Kumar, and A. Mahalanobis, Correlation Pattern
Recognition. Cambridge Univ. Press, 2005.
[16] D.R. Hardoon, S.R. Szedmak, and J.R. Shawe-taylor, Canonical
Correlation Analysis: An Overview with Application to Learning
Methods, J. Neural Computation, vol. 16, no. 12, pp. 2639-2664,
2004.
[17] D.M. Blei, A.Y. Ng, and M.I. Jordan, Latent Dirichlet Allocation,
J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[18] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A Global
Geometric Framework for Nonlinear Dimensionality Reduction,
Science, vol. 290, no. 5500, pp. 2319-2323, Dec. 2000.
[19] X. Zhu, Z. Ghahramani, and J. Lafferty, Semi-Supervised
Learning Using Gaussian Fields and Harmonic Functions, Proc.
20th Intl Conf. Machine Learning (ICML 03), 2003.
[20] X. Zhu, Semi-Supervised Learning Literature Survey, technical
report, Computer Sciences, Univ. of Wisconsin-Madison, 2005.
[21] G. Lebanon, Metric Learning for Text Documents, IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 497-507,
Apr. 2006.
[22] F.Y.M. Wan, Introduction to Calculus of Variations and Its Applica-
tions. Chapman&Hall, 1995.
[23] Encyclopaedia of Mathematics. M. Hazewinkel, ed., Springer-Verlag,
http://eom.springer.de/L/l057190.htm, 2002.
[24] I.S. Dhillon and D.M. Modha, Concept Decompositions for Large
Sparse Text Data Using Clustering, Machine Learning, vol. 42, no. 1,
pp. 143-175, 2001.
[25] P. Strobach, Bi-Iteration SVD Subspace Tracking Algorithms,
IEEE Trans. Signal Processing, vol. 45, no. 5, pp. 1222-1240, May
1997.
[26] D. Zeimpekis and E. Gallopoulos, Design of a Matlab Toolbox for
Term-Document Matrix Generation, Proc. Workshop Clustering
High Dimensional Data and Its Applications at the Fifth SIAM Intl
Conf. Data Mining (SDM 05), pp. 38-48, 2005.
[27] L. Lovasz and M. Plummer, Matching Theory. Elsevier, 1986.
[28] H. Zha, C. Ding, M. Gu, X. He, and H. Simon, Spectral Relaxation
for k-Means, Neural Information Processing Systems, vol. 14 (NIPS
2001), pp. 1057-1064, MIT Press, 2001.
[29] D. Cheng, R. Kannan, S. Vempala, and G. Wang, A Divide-and-
Merge Methodology for Clustering, ACM Trans. Database
Systems, vol. 31, no. 4, pp. 1499-1525, 2006.
[30] T. Joachims, Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, LS VIII-Report LS8-
Report 23, Universitat Dortmund, 1997.
[31] A. Ng, M. Jordan, and Y. Weiss, On Spectral Clustering: Analysis
and an Algorithm, Proc. Advances in Neural Information Processing
Systems, 2001.
1012 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Taiping Zhang (M10) received the BS and MS
degrees in computational mathematics, and the
PhD degree in computer science from Chongq-
ing University, China, in 1999, 2001, and 2010,
respectively. Currently, he is working as a
lecturer in the Department of Computer Science
at Chongqing University and a visiting research
fellow at the Faculty of Science and Technology,
University of Macau. His research interests
include pattern recognition, image processing,
machine learning, and computational mathematics. He has published
extensively in the IEEE Transactions on Image Processing, the IEEE
Transactions on Systems, Man, and Cybernetics, Part B (TSMC-B),
Pattern Recognition, Neurocomputing, etc. He is a member of the IEEE.
Yuan Yan Tang (F04) received the BS degree
in electrical and computer engineering from
Chongqing University, China, the MEng degree
in electrical engineering from the Graduate
School of Post and Telecommunications, Beij-
ing, China, and the PhD degree in computer
science from Concordia University, Montreal,
Canada. Currently, he is working as a professor
in the Department of Computer Science at
Chongqing University, a chair professor of the
Faculty of Science and Technology at the University of Macau, and
adjunct professor in computer science at Concordia University. He is an
honorary professor at Hong Kong Baptist University, an advisory
professor at many institutes in China. His current interests include
wavelet theory and applications, pattern recognition, image processing,
document processing, artificial intelligence, parallel processing, Chinese
computing, and VLSI architecture. He has published more than
300 technical papers and is the author/coauthor of 23 books/book-
chapters on subjects ranging from electrical engineering to computer
science. He has serviced as general chair, program chair, and
committee member for many international conferences. He was the
general chair of the 18th International Conference on Pattern Recogni-
tion (ICPR06). He is the founder and editor-in-chief of International
Journal on Wavelets, Multiresolution, and Information Processing
(IJWMIP) and associate editors of several international journals related
to Pattern Recognition and Artificial Intelligence. He is the chair of
pattern recognition committee in IEEE SMC. He is a fellow of the IAPR
and the IEEE.
Bin Fang (SM10) received the BS degree in
electrical engineering from Xian Jiaotong Uni-
versity, China, the MS degree in electrical
engineering from Sichuan University, Chengdu,
China, and the PhD degree in electrical en-
gineering from the University of Hong Kong,
China. Currently, he is working as a professor in
the Department of Computer Science, Chongq-
ing University, China. His research interests
include computer vision, pattern recognition,
medical image processing, biometrics applications, and document
analysis. He has published more than 100 technical papers and is an
associate editor of the International Journal of Pattern Recognition and
Artificial Intelligence. He is a senior member of the IEEE
Yong Xiang received the BE and ME degrees
from the University of Electronic Science and
Technology of China, Chengdu, China, in 1983
and 1989, respectively, and the PhD degree
from The University of Melbourne, Australia, in
2003. He was with the Southwest Institute of
Electronic Equipment of China, Chengdu, from
1983 to 1986. In 1989, he joined the University of
Electronic Science and Technology of China,
where he was a lecturer from 1989 to 1992 and
an associate professor from 1992 to 1997. He was a senior commu-
nications engineer with Bandspeed Inc., Melbourne, Australia, from
2000 to 2002. Currently, he is working as an associate professor in the
School of Engineering, Deakin University, Victoria, Australia. His
research interests include blind signal/system estimation, communica-
tion signal processing, biomedical signal processing, information and
network security, speech and image processing, and pattern recognition.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1013

Você também pode gostar