Escolar Documentos
Profissional Documentos
Cultura Documentos
Text Categorization
Marta Capdevila and Oscar W. Ma rquez Flo rez, Member, IEEE
AbstractThe basic concern of a Communication System is to transfer information from its source to a destination some distance
away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view,
the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is
developed where Automatic Text Categorization (ATC) is studied under a Communication System perspective. Under this approach,
the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by
a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and
adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections
validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction
(reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of
final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by
state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).
Index TermsData communications, text processing, data compaction and compression, clustering, classifier design and evaluation,
feature evaluation and selection.
1 INTRODUCTION
A
deep parallelism may be established between a
Communication System and an Automatic Text Categor-
ization (ATC) scheme, since both disciplines deal with the
transmission of information and its reliable recovery. The
establishment of this novel simile allows to tackle the over-
dimensioned document representation space that is heavily
redundant with the classification task and typically turns
problematic to many categorizers [1] in ATC,
1
from a
founded Communication theoretical point of view. The main
objective of our research has been to investigate how and up
to which extreme the document representation space can be
compressed and what are the effects on final classification of
this compression. The idea behind is to set a first step toward
an optimal encoding of the category, carried by the
document vectorial representation, in view of both limiting
the greedy use of resources issued from the high-dimension-
ality feature space and reducing the effects of overfitting.
2
Additionally, our research also aims at showing how the
document decoding (or classification task) can take advan-
tage of common Gaussian assumptions made in the
Communication System discipline but fairly ignored in ATC.
This paper is structured as follows: In Section 2, ATC is
briefly reviewed, and in Section 3, the Communication
systems perspective is established. Sections 4 and 5 explain
the theoretical basis of the proposed Document Sampling
and Document decoding. Sections 6, 7, 8, and 9 are
dedicated to experimental results, and finally, Section 10
presents the conclusions.
2 AUTOMATIC TEXT CATEGORIZATION
ATC is the task of assigning a text document to one or more
predefined categories or classes,
3
based on its textual context.
It corresponds to a supervised (nonfully automated) process,
where categories are predefined by some external mechan-
ism (normally human) by establishing, at the same time, a
set of already labeled examples that form the training set.
Classifiers are generated from those training examples, by
induction, in the so-called learning phase. This forms the
machine learning paradigm (as opposed to the knowledge-
engineering approach) over ATC that is predominant since
the 1990s exponential universalization of electronic textual
information [1].
It is further generally assumed that categories are
exclusive (also known as nonoverlapping), meaning that a
document can only belong to a single category (single-label
categorization), as this scenario has been shown [1] to be
more general than the multilabel case.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009 1027
. The authors are with the Signal and Communications Processing
Department, Telecommunication Engineering School, University of Vigo,
Rua Maxwell s/n, Campus Universitario Lagoas-Marcosende, E-36310
Vigo, Spain. E-mail {martacap, omarquez}@gts.tsc.uvigo.es.
Manuscript received 30 July 2008; revised 21 Nov. 2008; accepted 18 Dec.
2008; published online 8 Jan. 2009.
Recommended for acceptance by S. Zhang.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number
TKDE-2008-07-0394.
Digital Object Identifier no. 10.1109/TKDE.2009.22.
1. A sound exception can be established for state-of-the-art SVM
categorizer, which arguably [2] is well adapted to the typical high-
dimensionality representation space of ATC and which benefits from the
improved performance of recent training algorithms [3].
2. In general terms, the problem of overfitting results from the
characterization of reality with too many parameters, which makes the
modeling too specific and poorly generalizable.
3. Categorization and classification designations are used indiscriminately
in this text, both indicating the described supervised process.
1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society
2.1 Document Vectorial Representation
The first step toward any compact document representation
is the definition of the indexing features. The indexing features,
also called terms, are the minimal meaningful constitutive
units (a common choice is to use words). The set of different
terms that appear in the collection of training documents
forms the vocabulary or alphabet of terms. Once the
alphabet chosen, the text document can be represented in
the terms space. In this indexing process, the sequentiality
or order of terms in the text is commonly lost. This is known
as the bag-of-words approach [1].
The problem is that the indexing vocabulary typically
reaches tens or hundred of thousands of terms. To work in
such a high-dimensionality space commonly turns proble-
matic. This is why, before initiating any classification task, a
filtering designed to reduce the term space dimensionality
is usually applied. There are basically two approaches to
dimensionality reduction: 1) Term selection, where a subset of
features is selected out of the original set and 2) Term
extraction, where chosen features are obtained by combina-
tion of the original features. In the latter approach,
Distributional Clustering [4], [5], [6], [7] is a supervised
clustering technique that has been shown to be very
effective at reducing the document indexing space with
residual loss in categorization accuracy.
2.2 Common Categorizers
In the following, we will shortly review two of the state-of-
the-art classifiers used in ATC, which will be extensively
referenced in our experiments.
2.2.1 Multinomial Naive Bayes (MNB)
MNB is a probabilistic categorizer that assumes a document
is a sequence of terms, each of themrandomly chosen among
the term vocabulary, independently from the rest of term
events in the document. Besides its oversimplified Naive
Bayes basis, MNB results in good real performance [8].
2.2.2 Support Vector Machines (SVMs)
SVM is a binary classifier that attempts to find, among all
the surfaces that separate positive from negative training
examples, the decision surface that has the widest possible
margin (the margin being defined as the smallest distance
between positive and negative examples to the decision
surface). SVM is particularly well adapted to ATC [2] and
stands for one of the best-performing categorizers [9].
3 A COMMUNICATION INTERPRETATION ON ATC
A communication system [10] has the basic function of
transferring information (i.e., a message) from a source to
a destination. There are mainly three essential parts of any
communication system: the encoder/transmitter, the trans-
mission channel, and the receiver/decoder. The encoder/
transmitter processes the source message into the encoded
and transmitted messages. The transmission channel is the
medium that bridges the distance from source to destina-
tion. Every channel introduces some degree of undesirable
effects such as attenuation, noise, interference, and distor-
tion. The receiver/decoder processes the received mes-
sages in order to deliver it to destination. A classical
digital communication system simplified model is repre-
sented in Fig. 1.
In its raw form, a text document is a string of characters.
Typically in ATC, a bag-of-words approach is adopted,
which assumes that the document is an order-ignored
sequence of words
4
that can be represented vectorially. It is
further assumed that the vocabulary used by a given
document depends on the category or topic it belongs to.
The ATC scheme can be modeled by a communication
system, as shown in Fig. 2.
3.1 The Encoder/Transmitter Model
The generation of a document is determined by a Category
encoder, which is a random selector of words, modulated by
the category C (i.e., the selection of words is a random event
different and characteristic of each category). For each
category input c
i
, the Category encoder is characterized by a
distinct alphabet
T
i
(which is the subset of
T
that contains
the words used by the documents of c
i
) and the conditional
probabilities of each element of this alphabet fjt
1
jc
i
.
jt
2
jc
i
. . . jt
,
jc
i
. . .g. Fig. 3 illustrates different example
alphabets
T
i
.
Actually, the Category encoder generates a sequence of
outcomes
5
that are the actual words that (partially) form the
document. In the communication nomenclature, each word
1028 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
Fig. 1. A classical digital communication system simplified model.
Fig. 2. ATC modeled by a communication system.
4. Words and terms are used indiscriminately in this text to designate the
meaningful units of language.
5. The outcomes may be considered independent, as in the Naive Bayes
approaches.
could be a symbol. Note that the length of the sequence is
random (i.e., not fixed, since some documents are short,
others long, randomly) but presumably category indepen-
dent. And, finally, let us indicate that the input c
i
is itself the
value of the outcome of another random event, the category
C characterized by an alphabet
C
fc
1
. c
2
. . . c
j
C
j
g
6
with
probabilities j
c
fjc
1
. jc
2
. . . jc
j
C
j
g.
The Document builder measures the degree of contribu-
tion
7
of each word in the sequence generated by the
Category encoder and establishes a
T
-dimensional vector or
codeword with all the obtained weights. This process is
commonly known in ATC as the indexing of the document.
The final vectorial representation of the document, noted 1
in our scheme, constitutes the vector signal that is trans-
mitted over the channel.
The miscouplings between the ideally generated docu-
ments and the actual documents (i.e., introduction of lexical
borrowed from the vocabulary of other categories, words
iterations, etc.) are modeled by undesirable effects intro-
duced by the channel, namely Noise, Intersymbol interference,
and Channel distortion. Noise refers to random and unpre-
dictable variations superimposed to the transmitted signal.
Channel distortion is a perturbation of the signal due to
distorting response of the channel. And, finally, the
Intersymbol interference is a form of distortion caused by
the previously transmitted symbols.
The received signal d is the actual document that we
manipulate. The role of the receiver/decoder is to decode the
category out of the received document d (i.e., to perform the
document classification).
3.2 The Receiver/Decoder Model
Now, the problem is that, typically in ATC, the alphabet of
symbols
T
has an extremely high dimensionality. Many
words are semantically equivalent or category-related (i.e.,
the alphabets
T
i
are extensive) and a large amount of
others are not discriminative of any category in particular
(i.e., the intersection between sets
T
i
is large), see Fig. 3 to
get a visual perception upon this. The alphabet of symbols
T
may be said to carry a high degree of redundancy
and noise.
From a communication perspective, working with such a
redundant and noisy alphabet of symbols
T
implies a
suboptimal encoding of the category C that generates over-
dimensioned codewords. Apart from representing a waste of
resources in terms of use of channel capacity
8
and
processing economy, the over-dimensionality problem is not
insignificant since it happens to affect the Category decisor
task. From this point of view, we may wish to eliminate
noise and redundancy by filtering and compressing (of
course, ideally under a lossless compression) the alphabet of
symbols
T
.
This is what the blocks Prefilter, Noisy terms filter, and
Redundant terms compressor in Fig. 4 basically aim to. More
precisely, the Prefilter is a low-level filter which typically
includes removal of stopwords (i.e., articles, conjunctions,
etc.), infrequent words, nonalphabetical words, etc. The
Noisy terms filter eliminates words that are noninformative
(i.e., nondiscriminative) of the category variable. And,
finally, the Redundant term compressor clusters terms that
convey similar information over the category.
The resulting alphabet of symbols
1
is a lower
dimensional set of less noisy and less redundant features
(i.e., combinations of terms) that provides a more optimal
sampling space for documents. In fact, the space of symbols
is transformed with a view that the final documents, seen as
codewords, are characterized by 1) being as short as
possible, 2) having as little noise as possible, and 3) con-
taining the maximum information as possible.
Back to Fig. 2, the document filter applies the prefiltering
and noisy terms filtering just described upon the received
document d. The document is, thus, finally represented in a
lower dimensional space
0
T
, resulting in d
0
.
The document sampler projects the filtered document d
0
into the new space of features
1
, producing document
representation d
00
. This projection process implies a new
document quantization. In the case of a TF indexing,
quantization simply means adding up the weights of the
original words of a same cluster.
The category decisor has the task of decoding the
document. It is the actual supervised classifier that has
previously undergone a learning phase that, for simplicity
reasons, is not reflected in Fig. 2.
3.3 Further Remarks
Several interesting issues can be drawn from the commu-
nication analysis performed upon ATC. The first one is that
the document may be seen as a vector signal that encodes
the category source information. The signal space is
conformed by the alphabet of terms
T
defined by the
document collection. The main issue of this encoding
scheme is its high suboptimization both because of the high
CAPDEVILA AND M
ARQUEZ FL
#t
,
.
1
tioii
. 2
being #t
,
.
1
tioii
i
the number of times the term t
,
appears
in all training documents belonging to category c
i
and
#t
,
.
1
tioii the number of times the term t
,
appears in the
whole train collection.
4.3 Alphabet of Symbols
T
Filtering
and Compression
As announced in Section 3, we can envisage to reduce the
symbol alphabet
T
in two distinct directions:
1. Noisy terms filteringTerms that have a flat function
)
t
,
, that is,
)
t
,
c
i
% )
t
,
c
:
. 8i. : 3
do not convey information on the target category.
These terms, from a communication perspective, are
noisy and should be eliminated. A noise filter should
discriminate between informative terms and noninfor-
mative or noisy terms. A dispersion measure needs to be
defined, as well as a selection threshold, upon it. Note
that the threshold will have to be set experimentally.
2. Redundant terms compressionTerms that convey
similar information on the target category random
event, that is,
)
t
i
c
i
% )
t
,
c
i
. 8i 4
are redundant. A (ideally) lossless data compression
scheme reduces redundancy by clustering terms
with similar distributional representation. A similar-
ity measure needs to be defined.
4.4 Dispersion and Similarity Measures
As just seen, to perform the document compression, we
need to establish measures on the information conveyed by
the term and the redundancy between terms. But which
metrics to use? The answer is not straightforward. We may
use distinct dispersion and similarity measures depending
on different interpretations on what the distributional term
representation )
t
,
is.
4.4.1 The PMF Interpretation
Distributional functions )
t
,
are commonly PMFs. Information
Theory (IT) [11] provides useful measures to quantify
information conveyed by random events (e.g., entropy)
and similarity between PMFs (e.g., Kullback-Leibler and
Jensen-Shannon divergences).
The IT approach has been commonly adopted by other
works on Distributional Clustering [4], [5], [6]. We have
chosen to follow a new and unexplored direction, the
discrete signal interpretation, which is rooted in Communica-
tion and Signal Processing related concepts, coherently with
the proposed general framework setup.
1030 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
4.4.2 The Discrete Signal Interpretation
A discrete signal is a set of j
C
j measurements of an
unknown but latent random variable (r.v.). The rationale
behind a discrete signal interpretation of )
t
,
is that we are
interested in analyzing the general shape of the distribu-
tions. By modeling those distributions with a latent random
variable, small differences between distributions are assimi-
lated by the random nature of the signal.
. A dispersion measure: Sample variance. The variance is a
measure of the statistical dispersion of an r.v. For a
given discrete r.v. A with PMF j
A
defined in
A
, it is
expressed as:
o
2
A
4
1A j
A
2
1A
2
j
A
2
. 5
where 1qA
P
r2
A
qrj
A
r denotes the ex-
pectation operation and j
A
1A the expected
mean of A.
Now, a discrete signal is a bunch of measure-
ments on an r.v. The underlying PMF is unknown,
and thus, the expectation operator 1 cannot be
computed. The variance of a discrete signal )
t
,
, also
called sample variance, is thus obtained by substitut-
ing the expectation operator in (5) by the arithmetic
mean as follows
9
:
:
2
)
t
,
4
1
j
C
j
X
c
i
2
C
)
t
,
c
i
i
)
t
,
2
. 6
where i
)
t
,
1
j
C
j
P
c
i
2
C
)
t
,
c
i
is the arithmetic mean
of )
t
,
.
Sample variance is bounded in the interval
0;
j
C
j1
j
C
j
2
. Noninformative terms are those with low
dispersion among categories (i.e., )
t
,
with flat
distribution). They are thus characterized by their
low variance.
. A similarity measure: Sample correlation coefficient.
Correlation refers to the departure of two variables
from independence. Pearsons correlation coefficient in
(7) is the most widely used measure of relationship
between two random variables A and Y . It evaluates
the degree to which both functions are linearly
associated (equals 0 if they are statistically indepen-
dent and, in the other extreme, 1 if they are linearly
dependent)
,
A.Y
4
1
o
A
o
Y
1
AY
A j
A
Y j
Y
. 7
As with the variance, in the case of two discrete
signals )
t
,
and )
t
/
, the correlation coefficient is
expressed by its sample version:
i
)
t
,
)
t
/
4
P
j
C
j
i1
)
t
,
c
i
i
)
t
,
)
t
/
c
i
i
)
t
/
j
C
j:
)
t
,
:
)
t
/
. 8
4.5 Clustering Algorithms
Now let us turn to the clustering algorithms used for the
redundancy compression task. Our main approach in this
research has been to adopt an agglomerative term cluster-
ing approach, disregarding efficiency aspects apparently
improved by divisive clustering methods as pointed out by
Dhillon [6]. The reason is that our interest has been
focussed in studying the influence of the number of clusters
built and its optimal number more than working on
algorithm efficiency aspects.
4.5.1 Initial Agglomerative Approach
The first clustering implementation conveyed has been
inspired by the agglomerative hard clustering
10
algorithm
proposed by Baker [4]. The algorithm is simple and scales
well to large vocabulary sizes, since instead of comparing
the similarity of all pairs of words, it restricts the comparison
to a smaller subset of size ` (` being the final number of
clusters desired). After the data set preprocessing and noisy
terms filtering, the words vocabulary is ordered in decreas-
ing variance order. Then, the algorithm initializes the
` clusters to the ` first words of the sorted list. It follows
on by iteratively comparing the ` clusters and merging the
closer ones. Empty clusters are filled with next words in the
sorted list.
When merging occurs, the distribution of the new
cluster becomes the weighted average of the distributions
of its constituent words. For instance, when merging
terms t
,
and t
/
into a same cluster, the resulting
distribution function is:
)
t
,
_t
/
c
jt
,
jt
,
jt
/
)
t
,
c
jt
/
jt
,
jt
/
)
t
/
c. 9
This algorithm has been named as Static window Hard
clustering (SH clustering). Static window refers to the fixed
`-dimensional window it is based on, while Hard clustering
denotes the nonoverlapping nature of clustering.
4.5.2 Dynamic Window Approach
A further agglomerative clustering algorithm has been
implemented where the fixed `-dimensional Static window
has been replaced by a Dynamic window scheme. The
rationale of this procedure is to avoid the forcing of distant
clusters merging due to the `-dimensional fixed size of the
working window, specially when ` is low.
The algorithm proceeds as the former Hard clustering
procedure except that the initial window size is set to an
input value \ 6 `. The window is iteratively expanded
whenever no pair of clusters with intercluster distance
lower than a certain threshold exist. In a subsequent step,
when all vocabulary terms have been assigned to a cluster,
the window is progressively contracted until its dimension
reaches the number ` of desired clusters. Toward this
objective, the intercluster distance threshold is progres-
sively incremented following an arithmetic progression
(whose common difference has to be set in an input
parameter). At each step, the merging of close clusters
is performed.
CAPDEVILA AND M
ARQUEZ FL
d
/
, that is, the conditional probability that c
i
was
selected by the source given that the document
d
/
12
is
received. If
jc
i
j
d
/
. H jc
,
j
d
/
. H 8, 6 i. 11
(where H denotes the overall hypothesis space), then the
decoder should decide that the transmitted symbol was the
category c
i
. This constitutes the basis of a maximum a
posteriori (MAP) or probabilistic decoder that is expressed as:
e
c
d
/
max
c
i
jc
i
j
d
/
. H. 12
Now, the posterior probability jc
i
j
d
/
. H can be straight-
forwardly estimated by Bayesian inference jc
i
j
d
/
. H
j
d
/
jc
i
.Hjc
i
jH
j
d
/
jH
. Given that the evidence j
d
/
jH does not
depend on category c
i
, the classification criterion simplifies
to the following expression that constitutes the discrimina-
tive function of MAP categorizers:
e
c
`1
d
/
max
c
i
j
d
/
jc
i
. Hjc
i
jH. 13
5.2 Gaussian Assumption (Discriminant Analysis)
The Gaussian assumption is a classical modeling assumption
heavily used in areas such as Signal Processing and Commu-
nication System but poorly applied in the field of ATC (see
Section 5.2.3 for a discussion on this assertion).
The Gaussian model assumes that each category encod-
ing characterizes a multivariate Gaussian or Normal Prob-
ability Density Function (PDF). A document
d is then
assumed to be a realization of an i-dimensional random
vector
1 that is dependent on the category output c
i
with
the following Gaussian PDF N j
i
.
i
:
)
N j
i
.
i
d
1
2
i
2
j
i
j
1
2
c
1
2
d j
i
1
i
d j
i
. 14
where the mean vector j
i
1
1jc
i
is an i-dimensional
vector and the covariance matrix
i
1
1jc
i
j
i
1jc
i
j
i
T
is an i i-dimensional positive-definite
13
matrix with
positive determinant j
i
j.
Now, the likelihood j
d
/
jc
i
can be expressed in the
following terms, where 1
N j
i
.
i
d
/
jc
i
lim
!
0
j
d
/
d
d
/
jc
i
lim
!
0
1
N j
i
.
i
d
/
1
N j
i
.
i
d
/
% )
N j
i
.
i
d
/
!
0
.
15
The factor appears in the numerator of (15) for each
category. Consequently, it does not affect classification, and
thus, the MAP criterion translates to the following dis-
criminative function:
1032 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
11. Soft or fuzzy clustering allows a term to belong to more than one
cluster. Clusters may then overlap.
12. Note that
d
/
corresponds to the document signal vector, after the
filtering and compression processes described in Section 3 (and thus,
properly corresponds to d
00
/
).
13. Recall that a positive-definite matrix is a symmetric matrix with all its
eigenvalues positive. A positive-definite matrix is always invertible or
nonsingular. Its determinant is always positive.
14. By definition, probability distribution 1
1
and density )
1
functions
are related, )
1
c
4
0
i
1 1
c
0c1...0ci
.
e
c
d
/
max
c
i
1
j
i
j
1
2
c
1
2
d
/
j
i
1
i
d
/
j
i
jc
i
jH
( )
. 16
Given that the logarithmic function is a monotonically
increasing function, (16) is normally expressed as:
e
c
d
/
max
c
i
ln jc
i
jH
1
2
ln j
i
j
&
1
2
d
/
j
i
1
i
d
/
j
i
'
.
17
5.2.1 Quadratic Discriminant Analysis (QDA)
ATC being a supervised classification scheme, both j
i
and
i
can be estimated from the training set of documents
1
tioii
i
that belong to category c
i
. The discriminative
functions in (16) and (17) describe a quadratic shape for
each category and the decision frontiers are also quadratic.
5.2.2 Linear Discriminant Analysis (LDA)
Let us suppose that the covariance matrices for all
categories are identical. This constitutes the homoscedastic
simplifying assumption. The discriminant function (17)
simplifies to:
e
c
d
/
max
c
i
lnjc
i
jH
1
2
j
i
T
1
j
i
j
i
T
d
/
& '
. 18
This corresponds to a linear separative surface (i.e., a
hyperplane).
5.2.3 Applying Discriminant Analysis to ATC
The first problem that appears when trying to apply
discriminant analysis (DA) to ATC is the so-called i )`
i
problem.
15
The number of variables (indexing terms or
symbols) is typically extremely high (tens or hundred of
thousands) while the number of sample documents is
moderately small. Moreover, note that even if this problem
would not exist, the computation of extremely high-
dimensional covariance matrices is not feasible.
As a result of these limitations, DA can only be
envisaged after a preliminary indexing terms space reduc-
tion phase.
One of the pioneers in using DA in ATC were Schutze
et al. [12] who applied LDA classifier to the routing task, in
1995.
More recently, in 2006, Li et al. [13] used discriminant
analysis for multiclass classification. Their experimental
investigation showed that LDA reaches accurate perfor-
mance comparable to the one offered by SVM with a neat
improvement in terms of simplicity and time efficiency both
in the learning and the classification phases.
16
5.3 Independence Assumption
The independence or Naive Bayes assumption that states
that terms are stochastically independent can be formu-
lated over the Gaussian MAP categorizer. Under this
statement, the covariance matrix is the diagonal matrix
of variances
o
2
1
0 . . . 0
0 o
2
2
. . . 0
.
.
.
.
.
.
.
.
.
0 0 . . . o
2
i
2
6
6
6
4
3
7
7
7
5
. 19
The covariance matrix determinant simply is jj
o
1
. . . o
i
and the inverse covariance matrix is straightfor-
wardly deduced. The estimation of is, computationally
speaking, drastically simplified, since it reduces to the
computation of the i variances.
It can be easily shown that, under the independence
assumption, the discriminative function (16) is simplified to
the product of the univariate Gaussian PDFs in each of the
document indexing directions:
e
c
d
/
max
c
i
)
Nj
i1
.o
2
i1
d
/1
. . . )
Nj
ii
.o
2
ii
d
/i
jc
i
jH
.
20
5.3.1 Quadratic GNB
The quadratic-GNB obeys (20). The category-dependent
parameters can be estimated by the arithmetic mean j
i,
$
i
i,
1
`
i
P
d
/
2
1
tioii
i
d
/,
and the sample variance o
2
i,
$ :
2
i,
1
`
i
P
d
/
2
1
tioii
i
d
/,
i
i,
2
.
5.3.2 Linear GNB
Under the homoscedastic hypothesis, the variance is con-
sidered to be category independent. It can be estimated
as before by substituting `
i
by ` and
1
tioii
i
by
1
tioii .
Under this simplification, (20) defines a linear GNB as
expressed in (18).
5.3.3 White Noise GNB
A further simplifying assumption may consider the
variance to be one and the same for all categories and
variables. This generates the, by us baptized, White Noise-
GNB (or simply WN-GNB)
17
categorizer. It can be easily
shown that, after applying the logarithmic function,
discriminative function (20) simplifies to:
e
c
`1
d
/
max
c
i
1
2
P
i
,1
d
/,
j
i,
2
o
2
lnjc
i
jH
( )
.
21
In the case of equiprobable categories, the WN-GNB
categorizer defined in (21) reduces to a (euclidean)
minimum distance categorizer.
5.3.4 A New Hybrid GNB Categorizers Family
Document data sets are characterized by extremely sparse
matrices (even after the resampling envisaged in Section 4).
The majority of the variables (i.e., indexing terms) are not
CAPDEVILA AND M
ARQUEZ FL
d
/j
, which (when summed up repeatedly) can
end by setting to nil the overall probability.
With the idea to mitigate the effects of the sparsity existing
in ATC, we have envisaged to set a variance lower bound.
6 EXPERIMENTAL SCENARIO
Before entering into this section, note that some aspects of
the experimental scenario adopted, such as the tackling of
the overlapping issue in Reuters-21578, and the effective-
ness measure used, are reviewed in rather detail since they
form a different treatment from commonly followed
solutions [1].
6.1 Standard Collections
Two of the most widely used standard collections, the
20 Newsgroups (NG) and the Reuters-21578 (RE) data sets,
form our experimental scenario. In our experiments, both
collections have been preprocessed by removing stopwords
(from the Weka [14] stop list) and nonalphabetical words.
Infrequent terms occurring in less than four documents or
appearing less than four times in the whole data set, have
also been filtered.
6.1.1 20 Newsgroups
The 20 Newsgroups data set is a collection of approximately
20,000 newsgroup documents, partitioned (nearly) evenly
across 20 different newsgroups, each corresponding to a
different topic. We used the bydate version of the data set
maintained by Jason Rennie,
19
which is sorted by date into
training (60 percent) and test (40 percent) sets.
6.1.2 Reuters-21578
The Reuters-21578, Distribution 1.0 test collection
20
is
formed using 21,578 news that appeared in the Reuters
newswire in 1987, classified (by human indexers) according
to 135 thematic categories, mostly concerning business and
economy. In our experiments, we have used the most
common subset of Reuters-21578, ModApte training/test
partition which only considers the set of 90 categories with
at least one positive training example and one positive test
example. It results in a partition of 7,769 training documents
and 3,019 test documents.
Several factors characterize Reuters-21578 data set [15],
notably: categories are overlapping (i.e., a document may
belong to more than one category) and distribution across
categories is highly skewed (i.e., some categories have
very few labeled documentseven only onewhile others
have thousands).
6.1.3 Tackling the Category Overlapping Issue
Reuters-21578 data set classification is inserted in a multi-
label categorization frame, which is, by nature, out of the
single-label categorization scheme assumed in most of ATC
research works, including this. The category overlapping
issue can be tackled in three different ways:
. By deploying the 1 nonexclusive categories into all
the possible 2
1
category combinations.
. By assuming that categories are independent, and
thus, reverting 1-categories classification problem
into 1 independent binary classification tasks.
. By ignoring multilabeled documents which consti-
tute approximately 29 percent of all documents in
Reuters-21578.
Classically, the category independence alternative has
been implicitly undertaken in most researches [15]. Minority
authors as [16] have optedto ignore multilabeleddocuments.
In our work, we have opted to deploy the 1 90 Reuters-
21578 categories into all possible 2
1
2
90
combinations,
which results in an impressive number of the order of 10
27
.
The reasons for our decision are basically:
. Our conceptual framework, both for document
sampling and decoding, is based in a multiclass
(not binary) scheme.
. By deploying categories, we avoid assuming any
independence hypothesis.
. 379 (out of more than 10
27
!) is the actual number of
category combinations that have at least one docu-
ment representative in the training set (out of these
379, only 126 category combinations are represented
in the test set).
6.2 Effectiveness Measures
6.2.1 Confusion Matrix
In a single-label multiclass classification scheme, the
classification process can be visualized by means of a
confusion matrix . Each column of the matrix represents the
number of documents in a predicted category, while each
row represents the documents in an actual category.
In other words, referring to Table 1,
11
would be the
number of documents of category c
1
that are correctly
classified under c
1
, while
12
corresponds to documents of
1034 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
TABLE 1
An Example of Confusion Matrix
18. Note that in practice, variance has to be set to a minimal value;
otherwise, univariate normal PDFs expressions could not be computed.
19. http://people.csail.mit.edu/jrennie/20Newsgroups/.
20. The Reuters-21578 corpus is freely avaible for experimentation
purposes from http://www.daviddlewis.com/resources/testcollections/
reuters21578.
c
1
incorrectly classified into c
2
and
21
corresponds to
documents of c
2
incorrectly classified into c
1
.
6.2.2 Precision and Recall Microaverages
Precision and recall
21
are common measures of effectiveness
in ATC [1]. Overall measures for all categories are usually
obtained by microaveraging:
22
T1
T1 11
P
j
C
j
i1
T1
i
P
j
C
j
i1
T1
i
11
i
. 22
c
j
T1
T1 1`
P
j
C
j
i1
T1
i
P
j
C
j
i1
T1
i
1`
i
. 23
It can be easily seen that, in the single-label multiclass
classification model, the former expressions are equivalent.
They both result fromthe quotient of the sumof the diagonal
components of matrix and the sum of all elements of
matrix , which happens to be the overall classification
accuracy which is the quotient of correct classifications
(numerator) and total correct and incorrect classifications
(denominator).
6.3 Technological Solutions
The implementations undertaken in this research are based
on the Weka 3 Data Mining Software [14]. In particular, we
have used: NaiveBayesMultinomial, which is the MNB
categorizer implemented in Weka and Weka LibSVM
(WLSVM), which is the integration of LibSVM into Weka
environment.
23
7 EXPERIMENTAL RESULTS ON NOISY
TERMS FILTERING
This section provides empirical evidence that the novel
noisy terms filtering ensures a beneficial feature space
reduction.
7.1 20 Newsgroups
7.1.1 Experimental Scenario
In the noisy terms filtering scheme that was designed, the
setting of sample variance threshold t
:
2 (from this point on,
simply noted as t) has to be heuristically tuned. The only
thing we a priori know is that it is bounded in the interval
0; :
2
ior
j
C
j1
j
C
j
2
that results in 0; 0.0475 for j
C
j 20.
We have graphically represented in Fig. 5 the effect of the
variation of sample variance threshold on the classification
accuracy for a term clustering of 20,100 and 5,000 clusters,
respectively. The classifiers used have been the classic MNB
and SVM. The clustering algorithm tested is static window
hard clustering (SH clustering).
24
The similarity measure
used is the sample correlation coefficient in both cases. The
rest of clustering parameters are the default ones (see
Section 8).
7.1.2 Analysis of Results
The curves corresponding to MNB and SVM classification
in Fig. 5 have parallel evolutions. They present a mountain
shape with a maximum in the t range of [0.005; 0.02]. Note
that this range assures a classification accuracy decrease
lower than 5 percent for the 20 clusters curve (which
happens to be our target clustering as we will see in
Section 8). Outside this range, classification accuracy
decrease is considered to be too high.
7.2 Reuters-21578
7.2.1 Experimental Scenario
As in the case of 20 Newsgroups, the dispersion measure
used has been sample variance. A priori, we only know that
the variance threshold is bounded in the interval 0; :
2
ior
j
C
j1
j
C
j
2
that results in 0; 0.002631 for j
C
j 379.
We have graphically represented in Fig. 6 the effect of the
variation of sample variance threshold on the classification
accuracy for a term clustering of 200, 500 and 1,000 clusters,
respectively. The classifiers used have been the classic MNB
and SVM and the clustering algorithm tested SH clustering.
The similarity measure used is the sample correlation
CAPDEVILA AND M
ARQUEZ FL
ARQUEZ FL
ARQUEZ FL
ARQUEZ FL