Você está na página 1de 15

A Communication Perspective on Automatic

Text Categorization
Marta Capdevila and Oscar W. Ma rquez Flo rez, Member, IEEE
AbstractThe basic concern of a Communication System is to transfer information from its source to a destination some distance
away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view,
the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is
developed where Automatic Text Categorization (ATC) is studied under a Communication System perspective. Under this approach,
the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by
a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and
adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections
validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction
(reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of
final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by
state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).
Index TermsData communications, text processing, data compaction and compression, clustering, classifier design and evaluation,
feature evaluation and selection.

1 INTRODUCTION
A
deep parallelism may be established between a
Communication System and an Automatic Text Categor-
ization (ATC) scheme, since both disciplines deal with the
transmission of information and its reliable recovery. The
establishment of this novel simile allows to tackle the over-
dimensioned document representation space that is heavily
redundant with the classification task and typically turns
problematic to many categorizers [1] in ATC,
1
from a
founded Communication theoretical point of view. The main
objective of our research has been to investigate how and up
to which extreme the document representation space can be
compressed and what are the effects on final classification of
this compression. The idea behind is to set a first step toward
an optimal encoding of the category, carried by the
document vectorial representation, in view of both limiting
the greedy use of resources issued from the high-dimension-
ality feature space and reducing the effects of overfitting.
2
Additionally, our research also aims at showing how the
document decoding (or classification task) can take advan-
tage of common Gaussian assumptions made in the
Communication System discipline but fairly ignored in ATC.
This paper is structured as follows: In Section 2, ATC is
briefly reviewed, and in Section 3, the Communication
systems perspective is established. Sections 4 and 5 explain
the theoretical basis of the proposed Document Sampling
and Document decoding. Sections 6, 7, 8, and 9 are
dedicated to experimental results, and finally, Section 10
presents the conclusions.
2 AUTOMATIC TEXT CATEGORIZATION
ATC is the task of assigning a text document to one or more
predefined categories or classes,
3
based on its textual context.
It corresponds to a supervised (nonfully automated) process,
where categories are predefined by some external mechan-
ism (normally human) by establishing, at the same time, a
set of already labeled examples that form the training set.
Classifiers are generated from those training examples, by
induction, in the so-called learning phase. This forms the
machine learning paradigm (as opposed to the knowledge-
engineering approach) over ATC that is predominant since
the 1990s exponential universalization of electronic textual
information [1].
It is further generally assumed that categories are
exclusive (also known as nonoverlapping), meaning that a
document can only belong to a single category (single-label
categorization), as this scenario has been shown [1] to be
more general than the multilabel case.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009 1027
. The authors are with the Signal and Communications Processing
Department, Telecommunication Engineering School, University of Vigo,
Rua Maxwell s/n, Campus Universitario Lagoas-Marcosende, E-36310
Vigo, Spain. E-mail {martacap, omarquez}@gts.tsc.uvigo.es.
Manuscript received 30 July 2008; revised 21 Nov. 2008; accepted 18 Dec.
2008; published online 8 Jan. 2009.
Recommended for acceptance by S. Zhang.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number
TKDE-2008-07-0394.
Digital Object Identifier no. 10.1109/TKDE.2009.22.
1. A sound exception can be established for state-of-the-art SVM
categorizer, which arguably [2] is well adapted to the typical high-
dimensionality representation space of ATC and which benefits from the
improved performance of recent training algorithms [3].
2. In general terms, the problem of overfitting results from the
characterization of reality with too many parameters, which makes the
modeling too specific and poorly generalizable.
3. Categorization and classification designations are used indiscriminately
in this text, both indicating the described supervised process.
1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society
2.1 Document Vectorial Representation
The first step toward any compact document representation
is the definition of the indexing features. The indexing features,
also called terms, are the minimal meaningful constitutive
units (a common choice is to use words). The set of different
terms that appear in the collection of training documents
forms the vocabulary or alphabet of terms. Once the
alphabet chosen, the text document can be represented in
the terms space. In this indexing process, the sequentiality
or order of terms in the text is commonly lost. This is known
as the bag-of-words approach [1].
The problem is that the indexing vocabulary typically
reaches tens or hundred of thousands of terms. To work in
such a high-dimensionality space commonly turns proble-
matic. This is why, before initiating any classification task, a
filtering designed to reduce the term space dimensionality
is usually applied. There are basically two approaches to
dimensionality reduction: 1) Term selection, where a subset of
features is selected out of the original set and 2) Term
extraction, where chosen features are obtained by combina-
tion of the original features. In the latter approach,
Distributional Clustering [4], [5], [6], [7] is a supervised
clustering technique that has been shown to be very
effective at reducing the document indexing space with
residual loss in categorization accuracy.
2.2 Common Categorizers
In the following, we will shortly review two of the state-of-
the-art classifiers used in ATC, which will be extensively
referenced in our experiments.
2.2.1 Multinomial Naive Bayes (MNB)
MNB is a probabilistic categorizer that assumes a document
is a sequence of terms, each of themrandomly chosen among
the term vocabulary, independently from the rest of term
events in the document. Besides its oversimplified Naive
Bayes basis, MNB results in good real performance [8].
2.2.2 Support Vector Machines (SVMs)
SVM is a binary classifier that attempts to find, among all
the surfaces that separate positive from negative training
examples, the decision surface that has the widest possible
margin (the margin being defined as the smallest distance
between positive and negative examples to the decision
surface). SVM is particularly well adapted to ATC [2] and
stands for one of the best-performing categorizers [9].
3 A COMMUNICATION INTERPRETATION ON ATC
A communication system [10] has the basic function of
transferring information (i.e., a message) from a source to
a destination. There are mainly three essential parts of any
communication system: the encoder/transmitter, the trans-
mission channel, and the receiver/decoder. The encoder/
transmitter processes the source message into the encoded
and transmitted messages. The transmission channel is the
medium that bridges the distance from source to destina-
tion. Every channel introduces some degree of undesirable
effects such as attenuation, noise, interference, and distor-
tion. The receiver/decoder processes the received mes-
sages in order to deliver it to destination. A classical
digital communication system simplified model is repre-
sented in Fig. 1.
In its raw form, a text document is a string of characters.
Typically in ATC, a bag-of-words approach is adopted,
which assumes that the document is an order-ignored
sequence of words
4
that can be represented vectorially. It is
further assumed that the vocabulary used by a given
document depends on the category or topic it belongs to.
The ATC scheme can be modeled by a communication
system, as shown in Fig. 2.
3.1 The Encoder/Transmitter Model
The generation of a document is determined by a Category
encoder, which is a random selector of words, modulated by
the category C (i.e., the selection of words is a random event
different and characteristic of each category). For each
category input c
i
, the Category encoder is characterized by a
distinct alphabet
T
i
(which is the subset of
T
that contains
the words used by the documents of c
i
) and the conditional
probabilities of each element of this alphabet fjt
1
jc
i
.
jt
2
jc
i
. . . jt
,
jc
i
. . .g. Fig. 3 illustrates different example
alphabets
T
i
.
Actually, the Category encoder generates a sequence of
outcomes
5
that are the actual words that (partially) form the
document. In the communication nomenclature, each word
1028 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
Fig. 1. A classical digital communication system simplified model.
Fig. 2. ATC modeled by a communication system.
4. Words and terms are used indiscriminately in this text to designate the
meaningful units of language.
5. The outcomes may be considered independent, as in the Naive Bayes
approaches.
could be a symbol. Note that the length of the sequence is
random (i.e., not fixed, since some documents are short,
others long, randomly) but presumably category indepen-
dent. And, finally, let us indicate that the input c
i
is itself the
value of the outcome of another random event, the category
C characterized by an alphabet
C
fc
1
. c
2
. . . c
j
C
j
g
6
with
probabilities j
c
fjc
1
. jc
2
. . . jc
j
C
j
g.
The Document builder measures the degree of contribu-
tion
7
of each word in the sequence generated by the
Category encoder and establishes a
T
-dimensional vector or
codeword with all the obtained weights. This process is
commonly known in ATC as the indexing of the document.
The final vectorial representation of the document, noted 1
in our scheme, constitutes the vector signal that is trans-
mitted over the channel.
The miscouplings between the ideally generated docu-
ments and the actual documents (i.e., introduction of lexical
borrowed from the vocabulary of other categories, words
iterations, etc.) are modeled by undesirable effects intro-
duced by the channel, namely Noise, Intersymbol interference,
and Channel distortion. Noise refers to random and unpre-
dictable variations superimposed to the transmitted signal.
Channel distortion is a perturbation of the signal due to
distorting response of the channel. And, finally, the
Intersymbol interference is a form of distortion caused by
the previously transmitted symbols.
The received signal d is the actual document that we
manipulate. The role of the receiver/decoder is to decode the
category out of the received document d (i.e., to perform the
document classification).
3.2 The Receiver/Decoder Model
Now, the problem is that, typically in ATC, the alphabet of
symbols
T
has an extremely high dimensionality. Many
words are semantically equivalent or category-related (i.e.,
the alphabets
T
i
are extensive) and a large amount of
others are not discriminative of any category in particular
(i.e., the intersection between sets
T
i
is large), see Fig. 3 to
get a visual perception upon this. The alphabet of symbols

T
may be said to carry a high degree of redundancy
and noise.
From a communication perspective, working with such a
redundant and noisy alphabet of symbols
T
implies a
suboptimal encoding of the category C that generates over-
dimensioned codewords. Apart from representing a waste of
resources in terms of use of channel capacity
8
and
processing economy, the over-dimensionality problem is not
insignificant since it happens to affect the Category decisor
task. From this point of view, we may wish to eliminate
noise and redundancy by filtering and compressing (of
course, ideally under a lossless compression) the alphabet of
symbols
T
.
This is what the blocks Prefilter, Noisy terms filter, and
Redundant terms compressor in Fig. 4 basically aim to. More
precisely, the Prefilter is a low-level filter which typically
includes removal of stopwords (i.e., articles, conjunctions,
etc.), infrequent words, nonalphabetical words, etc. The
Noisy terms filter eliminates words that are noninformative
(i.e., nondiscriminative) of the category variable. And,
finally, the Redundant term compressor clusters terms that
convey similar information over the category.
The resulting alphabet of symbols
1
is a lower
dimensional set of less noisy and less redundant features
(i.e., combinations of terms) that provides a more optimal
sampling space for documents. In fact, the space of symbols
is transformed with a view that the final documents, seen as
codewords, are characterized by 1) being as short as
possible, 2) having as little noise as possible, and 3) con-
taining the maximum information as possible.
Back to Fig. 2, the document filter applies the prefiltering
and noisy terms filtering just described upon the received
document d. The document is, thus, finally represented in a
lower dimensional space
0
T
, resulting in d
0
.
The document sampler projects the filtered document d
0
into the new space of features
1
, producing document
representation d
00
. This projection process implies a new
document quantization. In the case of a TF indexing,
quantization simply means adding up the weights of the
original words of a same cluster.
The category decisor has the task of decoding the
document. It is the actual supervised classifier that has
previously undergone a learning phase that, for simplicity
reasons, is not reflected in Fig. 2.
3.3 Further Remarks
Several interesting issues can be drawn from the commu-
nication analysis performed upon ATC. The first one is that
the document may be seen as a vector signal that encodes
the category source information. The signal space is
conformed by the alphabet of terms
T
defined by the
document collection. The main issue of this encoding
scheme is its high suboptimization both because of the high
CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1029


Fig. 3. Venn diagram of the alphabets of terms of different categories.
Fig. 4. Supervised term alphabet
T
reduction process.
6. j
C
j denotes the size or cardinality of
C
.
7. Classically, the contribution of each word can be measured by either a
binary, a term frequency (TF), or a term frequency inverse document
frequency (TFIDF) weight scheme, among others [1].
8. The concept of channel capacity can be assimilated to the concept of
storage capacity.
dimensionality of the signal space (
T
)
C
) and because
of the fact (directly related to the latter) that a same
category may be encoded by many different codewords.
Our work directly tackles the document-space optimi-
zation. Inspired by the communication simile established,
we ideally pursue the goal of extracting (up to the extent it
may be practically possible) an orthonormal basis for the
signal space. Under this inspiration, we have designed a
term alphabet transform (i.e., filtering and compression)
that improves the optimality of the category encoding
conveyed by the resampled documents. This should result
in a better utilization of storage and processing resources
as well as, hopefully, facilitate the document decoding or
classification task.
Note that the framework of the problem, as it has been
initially established, is not to perform a coding design and
then try to adapt the document representations to it. Instead,
we have opted to optimize, to the extent it may be possible,
the signal space in the hope of improving classification.
The document resampling model is further developed
in Section 4, while Section 5 deepens the document
decoding aspects.
4 DOCUMENT SAMPLING
4.1 Document-Space Analysis
In their vectorial representation, documents are represented
in the space of terms, and thus, following the communica-
tion simile established in Section 3, terms have to be
thought as being the basis functions of the document space.
The nutshell of the document-space analysis is how to
represent terms by means of a function. Which information
do we have about terms? How could we characterize them?
The answer to these questions resides in the fact that in a
supervised scheme such as ATC, we are given a set of
prelabeled documents. Based on the latter, terms can be
expressed in terms of the information they convey on
categories. Once terms have been properly characterized by
a function, the notion of the orthogonality and redundancy
between them (basis functions of the document space)
can be pursued.
4.2 Distributional Representation of Terms
As previously expressed in Section 3, the generation of a
text document may be modeled as a random selection of
terms dependent on category C. In other words, the
probability of appearance of a term in a document depends
on the category the document belongs to. Terms can be
understood as the outcomes of a r.e. T that is mutually
dependent with the category r.e. C. The term is the
observable data while the category is the unknown parameter.
Thereby, any term t
,
can be characterized as a distribu-
tional function )
t
,
over the space of categories
C
)
t
,
:
C
! 0. 1.
c 7!)
t
,
c.
1
An intuitive alternative (but not the only one) for this
distributional function is the conditional probability mass
function (PMF) j
Cjt
,
[4]. Note that the conditional probabil-
ities j
Cjt
,
c
i
jt
,
are not known, but in a supervised scheme
such as ATC, these can be approximated from the training
set of documents
1
tioii
jc
i
jt
,
%
#t
,
.
1
tioii
i

#t
,
.
1
tioii
. 2
being #t
,
.
1
tioii
i
the number of times the term t
,
appears
in all training documents belonging to category c
i
and
#t
,
.
1
tioii the number of times the term t
,
appears in the
whole train collection.
4.3 Alphabet of Symbols
T
Filtering
and Compression
As announced in Section 3, we can envisage to reduce the
symbol alphabet
T
in two distinct directions:
1. Noisy terms filteringTerms that have a flat function
)
t
,
, that is,
)
t
,
c
i
% )
t
,
c
:
. 8i. : 3
do not convey information on the target category.
These terms, from a communication perspective, are
noisy and should be eliminated. A noise filter should
discriminate between informative terms and noninfor-
mative or noisy terms. A dispersion measure needs to be
defined, as well as a selection threshold, upon it. Note
that the threshold will have to be set experimentally.
2. Redundant terms compressionTerms that convey
similar information on the target category random
event, that is,
)
t
i
c
i
% )
t
,
c
i
. 8i 4
are redundant. A (ideally) lossless data compression
scheme reduces redundancy by clustering terms
with similar distributional representation. A similar-
ity measure needs to be defined.
4.4 Dispersion and Similarity Measures
As just seen, to perform the document compression, we
need to establish measures on the information conveyed by
the term and the redundancy between terms. But which
metrics to use? The answer is not straightforward. We may
use distinct dispersion and similarity measures depending
on different interpretations on what the distributional term
representation )
t
,
is.
4.4.1 The PMF Interpretation
Distributional functions )
t
,
are commonly PMFs. Information
Theory (IT) [11] provides useful measures to quantify
information conveyed by random events (e.g., entropy)
and similarity between PMFs (e.g., Kullback-Leibler and
Jensen-Shannon divergences).
The IT approach has been commonly adopted by other
works on Distributional Clustering [4], [5], [6]. We have
chosen to follow a new and unexplored direction, the
discrete signal interpretation, which is rooted in Communica-
tion and Signal Processing related concepts, coherently with
the proposed general framework setup.
1030 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
4.4.2 The Discrete Signal Interpretation
A discrete signal is a set of j
C
j measurements of an
unknown but latent random variable (r.v.). The rationale
behind a discrete signal interpretation of )
t
,
is that we are
interested in analyzing the general shape of the distribu-
tions. By modeling those distributions with a latent random
variable, small differences between distributions are assimi-
lated by the random nature of the signal.
. A dispersion measure: Sample variance. The variance is a
measure of the statistical dispersion of an r.v. For a
given discrete r.v. A with PMF j
A
defined in
A
, it is
expressed as:
o
2
A

4
1A j
A

2
1A
2
j
A
2
. 5
where 1qA
P
r2
A
qrj
A
r denotes the ex-
pectation operation and j
A
1A the expected
mean of A.
Now, a discrete signal is a bunch of measure-
ments on an r.v. The underlying PMF is unknown,
and thus, the expectation operator 1 cannot be
computed. The variance of a discrete signal )
t
,
, also
called sample variance, is thus obtained by substitut-
ing the expectation operator in (5) by the arithmetic
mean as follows
9
:
:
2
)
t
,

4
1
j
C
j
X
c
i
2
C
)
t
,
c
i
i
)
t
,

2
. 6
where i
)
t
,

1
j
C
j
P
c
i
2
C
)
t
,
c
i
is the arithmetic mean
of )
t
,
.
Sample variance is bounded in the interval
0;
j
C
j1
j
C
j
2
. Noninformative terms are those with low
dispersion among categories (i.e., )
t
,
with flat
distribution). They are thus characterized by their
low variance.
. A similarity measure: Sample correlation coefficient.
Correlation refers to the departure of two variables
from independence. Pearsons correlation coefficient in
(7) is the most widely used measure of relationship
between two random variables A and Y . It evaluates
the degree to which both functions are linearly
associated (equals 0 if they are statistically indepen-
dent and, in the other extreme, 1 if they are linearly
dependent)
,
A.Y

4
1
o
A
o
Y
1
AY
A j
A
Y j
Y
. 7
As with the variance, in the case of two discrete
signals )
t
,
and )
t
/
, the correlation coefficient is
expressed by its sample version:
i
)
t
,
)
t
/

4
P
j
C
j
i1
)
t
,
c
i
i
)
t
,

)
t
/
c
i
i
)
t
/

j
C
j:
)
t
,
:
)
t
/
. 8
4.5 Clustering Algorithms
Now let us turn to the clustering algorithms used for the
redundancy compression task. Our main approach in this
research has been to adopt an agglomerative term cluster-
ing approach, disregarding efficiency aspects apparently
improved by divisive clustering methods as pointed out by
Dhillon [6]. The reason is that our interest has been
focussed in studying the influence of the number of clusters
built and its optimal number more than working on
algorithm efficiency aspects.
4.5.1 Initial Agglomerative Approach
The first clustering implementation conveyed has been
inspired by the agglomerative hard clustering
10
algorithm
proposed by Baker [4]. The algorithm is simple and scales
well to large vocabulary sizes, since instead of comparing
the similarity of all pairs of words, it restricts the comparison
to a smaller subset of size ` (` being the final number of
clusters desired). After the data set preprocessing and noisy
terms filtering, the words vocabulary is ordered in decreas-
ing variance order. Then, the algorithm initializes the
` clusters to the ` first words of the sorted list. It follows
on by iteratively comparing the ` clusters and merging the
closer ones. Empty clusters are filled with next words in the
sorted list.
When merging occurs, the distribution of the new
cluster becomes the weighted average of the distributions
of its constituent words. For instance, when merging
terms t
,
and t
/
into a same cluster, the resulting
distribution function is:
)
t
,
_t
/
c
jt
,

jt
,
jt
/

)
t
,
c
jt
/

jt
,
jt
/

)
t
/
c. 9
This algorithm has been named as Static window Hard
clustering (SH clustering). Static window refers to the fixed
`-dimensional window it is based on, while Hard clustering
denotes the nonoverlapping nature of clustering.
4.5.2 Dynamic Window Approach
A further agglomerative clustering algorithm has been
implemented where the fixed `-dimensional Static window
has been replaced by a Dynamic window scheme. The
rationale of this procedure is to avoid the forcing of distant
clusters merging due to the `-dimensional fixed size of the
working window, specially when ` is low.
The algorithm proceeds as the former Hard clustering
procedure except that the initial window size is set to an
input value \ 6 `. The window is iteratively expanded
whenever no pair of clusters with intercluster distance
lower than a certain threshold exist. In a subsequent step,
when all vocabulary terms have been assigned to a cluster,
the window is progressively contracted until its dimension
reaches the number ` of desired clusters. Toward this
objective, the intercluster distance threshold is progres-
sively incremented following an arithmetic progression
(whose common difference has to be set in an input
parameter). At each step, the merging of close clusters
is performed.
CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1031


9. In this document, biased estimates are adopted. Alternatively, unbiased
estimates could be used by substituting j
C
j by j
C
j 1.
10. Hard clustering assumes that each term can only belong to one cluster.
Clusters do not overlap. They produce a partition (disjoint subsets) of terms.
4.5.3 Soft Clustering Approach
A soft clustering
11
approach has been designed in order to
accept different semantic contexts for a same term. The
implementation of a soft clustering model is notably more
computationally expensive than a hard clustering scheme,
since it demands an iterative procedure where the degree of
proximity of each pair of clusters is analyzed.
4.5.4 Clustering Algorithms Implemented
From the combination of the different approaches exposed,
four agglomerative clustering algorithms have been im-
plemented, namely: Static Window Hard clustering (SH),
Static Window Soft clustering (SS), Dynamic Window Hard
clustering (DH), and Dynamic Window Soft clustering (DS).
4.6 Document Quantization
Once the clustering algorithm has ended, we assume each
of the resulting term clusters to be a symbol of the new
alphabet
1
(i.e., an indexing dimension for the document
sampling). The indexing or quantization of the document
can be simply done by a term frequency (TF) weighting
scheme such as:
d
,
i
T1,. i #/
,
. d
i
. 10
where #/
,
. d
i
is the number of times the terms of cluster /
,
appear in d
i
. In this simple indexing, the classical inverse
document frequency (IDF) factor is ignored because it has
already been, to a certain extent, taken into account in the
noise filtering phase (i.e., terms that appear uniformly in
documents of all categories have been identified as noisy
terms, and thus, eliminated).
We may ignore document length as we assume it is
independent from the category. In order to normalize all
resulting document vectors, whenever necessary, we have
adopted a cosine normalization.
5 DOCUMENT DECODING
We cannot expect that the final alphabet of symbols
obtained exactly map a pure orthogonal set of basis as
eventually desired. And, consequently, documents re-
sampled in the new term-clusters space can be assumed
to be (to a certain extent) corrupted codewords of the
ideal category encoding. Adopting communication termi-
nology, we may say they constitute the actually received
messages, contaminated by the undesirable effects intro-
duced by the transmission channel. To sum up, the
decoding of a document sampled in the term-clusters
space that ideally would be a straightforward extraction
of the category, cannot as such directly implemented
in practice due to the channel noise, interference,
and distortion influence. This brings us to an Optimum
Detection problem.
5.1 MAP Decoder
The optimization criterion can be formulated in terms of
jc
i
j

d
/
, that is, the conditional probability that c
i
was
selected by the source given that the document

d
/
12
is
received. If
jc
i
j

d
/
. H jc
,
j

d
/
. H 8, 6 i. 11
(where H denotes the overall hypothesis space), then the
decoder should decide that the transmitted symbol was the
category c
i
. This constitutes the basis of a maximum a
posteriori (MAP) or probabilistic decoder that is expressed as:
e
c

d
/
max
c
i
jc
i
j

d
/
. H. 12
Now, the posterior probability jc
i
j

d
/
. H can be straight-
forwardly estimated by Bayesian inference jc
i
j

d
/
. H
j

d
/
jc
i
.Hjc
i
jH
j

d
/
jH
. Given that the evidence j

d
/
jH does not
depend on category c
i
, the classification criterion simplifies
to the following expression that constitutes the discrimina-
tive function of MAP categorizers:
e
c
`1

d
/
max
c
i
j

d
/
jc
i
. Hjc
i
jH. 13
5.2 Gaussian Assumption (Discriminant Analysis)
The Gaussian assumption is a classical modeling assumption
heavily used in areas such as Signal Processing and Commu-
nication System but poorly applied in the field of ATC (see
Section 5.2.3 for a discussion on this assertion).
The Gaussian model assumes that each category encod-
ing characterizes a multivariate Gaussian or Normal Prob-
ability Density Function (PDF). A document

d is then
assumed to be a realization of an i-dimensional random
vector

1 that is dependent on the category output c
i
with
the following Gaussian PDF N j
i
.
i
:
)
N j
i
.
i

d
1
2
i
2
j
i
j
1
2
c

1
2

d j
i

1
i

d j
i

. 14
where the mean vector j
i
1

1jc
i
is an i-dimensional
vector and the covariance matrix
i
1

1jc
i
j
i

1jc
i

j
i

T
is an i i-dimensional positive-definite
13
matrix with
positive determinant j
i
j.
Now, the likelihood j

d
/
jc
i
can be expressed in the
following terms, where 1
N j
i
.
i

denotes the probability


distribution function
14
:
j

d
/
jc
i
lim
!

0
j

d
/

d

d
/
jc
i

lim
!

0
1
N j
i
.
i

d
/
1
N j
i
.
i

d
/

% )
N j
i
.
i

d
/

!

0
.
15
The factor appears in the numerator of (15) for each
category. Consequently, it does not affect classification, and
thus, the MAP criterion translates to the following dis-
criminative function:
1032 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
11. Soft or fuzzy clustering allows a term to belong to more than one
cluster. Clusters may then overlap.
12. Note that

d
/
corresponds to the document signal vector, after the
filtering and compression processes described in Section 3 (and thus,
properly corresponds to d
00
/
).
13. Recall that a positive-definite matrix is a symmetric matrix with all its
eigenvalues positive. A positive-definite matrix is always invertible or
nonsingular. Its determinant is always positive.
14. By definition, probability distribution 1

1
and density )

1
functions
are related, )

1
c
4

0
i
1 1
c
0c1...0ci
.
e
c

d
/
max
c
i
1
j
i
j
1
2
c

1
2

d
/
j
i

1
i

d
/
j
i

jc
i
jH
( )
. 16
Given that the logarithmic function is a monotonically
increasing function, (16) is normally expressed as:
e
c

d
/
max
c
i
ln jc
i
jH
1
2
ln j
i
j
&

1
2

d
/
j
i

1
i

d
/
j
i

'
.
17
5.2.1 Quadratic Discriminant Analysis (QDA)
ATC being a supervised classification scheme, both j
i
and

i
can be estimated from the training set of documents

1
tioii
i
that belong to category c
i
. The discriminative
functions in (16) and (17) describe a quadratic shape for
each category and the decision frontiers are also quadratic.
5.2.2 Linear Discriminant Analysis (LDA)
Let us suppose that the covariance matrices for all
categories are identical. This constitutes the homoscedastic
simplifying assumption. The discriminant function (17)
simplifies to:
e
c

d
/
max
c
i
lnjc
i
jH
1
2
j
i
T

1
j
i
j
i
T

d
/
& '
. 18
This corresponds to a linear separative surface (i.e., a
hyperplane).
5.2.3 Applying Discriminant Analysis to ATC
The first problem that appears when trying to apply
discriminant analysis (DA) to ATC is the so-called i )`
i
problem.
15
The number of variables (indexing terms or
symbols) is typically extremely high (tens or hundred of
thousands) while the number of sample documents is
moderately small. Moreover, note that even if this problem
would not exist, the computation of extremely high-
dimensional covariance matrices is not feasible.
As a result of these limitations, DA can only be
envisaged after a preliminary indexing terms space reduc-
tion phase.
One of the pioneers in using DA in ATC were Schutze
et al. [12] who applied LDA classifier to the routing task, in
1995.
More recently, in 2006, Li et al. [13] used discriminant
analysis for multiclass classification. Their experimental
investigation showed that LDA reaches accurate perfor-
mance comparable to the one offered by SVM with a neat
improvement in terms of simplicity and time efficiency both
in the learning and the classification phases.
16
5.3 Independence Assumption
The independence or Naive Bayes assumption that states
that terms are stochastically independent can be formu-
lated over the Gaussian MAP categorizer. Under this
statement, the covariance matrix is the diagonal matrix
of variances

o
2
1
0 . . . 0
0 o
2
2
. . . 0
.
.
.
.
.
.
.
.
.
0 0 . . . o
2
i
2
6
6
6
4
3
7
7
7
5
. 19
The covariance matrix determinant simply is jj
o
1
. . . o
i
and the inverse covariance matrix is straightfor-
wardly deduced. The estimation of is, computationally
speaking, drastically simplified, since it reduces to the
computation of the i variances.
It can be easily shown that, under the independence
assumption, the discriminative function (16) is simplified to
the product of the univariate Gaussian PDFs in each of the
document indexing directions:
e
c

d
/
max
c
i

)
Nj
i1
.o
2
i1

d
/1
. . . )
Nj
ii
.o
2
ii

d
/i

jc
i
jH

.
20
5.3.1 Quadratic GNB
The quadratic-GNB obeys (20). The category-dependent
parameters can be estimated by the arithmetic mean j
i,
$
i
i,

1
`
i
P

d
/
2
1
tioii
i
d
/,
and the sample variance o
2
i,
$ :
2
i,

1
`
i
P

d
/
2
1
tioii
i
d
/,
i
i,

2
.
5.3.2 Linear GNB
Under the homoscedastic hypothesis, the variance is con-
sidered to be category independent. It can be estimated
as before by substituting `
i
by ` and
1
tioii
i
by
1
tioii .
Under this simplification, (20) defines a linear GNB as
expressed in (18).
5.3.3 White Noise GNB
A further simplifying assumption may consider the
variance to be one and the same for all categories and
variables. This generates the, by us baptized, White Noise-
GNB (or simply WN-GNB)
17
categorizer. It can be easily
shown that, after applying the logarithmic function,
discriminative function (20) simplifies to:
e
c
`1

d
/
max
c
i

1
2
P
i
,1
d
/,
j
i,

2
o
2
lnjc
i
jH
( )
.
21
In the case of equiprobable categories, the WN-GNB
categorizer defined in (21) reduces to a (euclidean)
minimum distance categorizer.
5.3.4 A New Hybrid GNB Categorizers Family
Document data sets are characterized by extremely sparse
matrices (even after the resampling envisaged in Section 4).
The majority of the variables (i.e., indexing terms) are not
CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1033


15. When the number of dimensions of the random vector i is greater
than the quantity of sample data `, then the estimated covariance matrix
does not have full rank, and thus, cannot be inverted.
16. The advantage of adopting an LDA strategy is that it is a natural
multiclass classifier, while SVM is a binary classifier that has to be adapted
to the multiclass problem, by reducing it to a binary classification scenario,
which is a nontrivial task.
17. In Communications, a white noise is a Gaussian-distributed noise that
has a flat spectral density. It is called white noise by analogy to white light.
By similitude, we use the terminology white noise to designate a Gaussian
noise that affects uniformly all document vectors.
representative of the target category. This means that, for a
big part of indexing attributes, mean and variance are null
theoretically.
18
This fact affects negatively the computation
of discriminative function in (20), and as for those
attributes, a small deviation from zero of an indexing
weight d
/j
results in a close to zero value for the term
)
Nj
ij
.o
2
ij

d
/j
, which (when summed up repeatedly) can
end by setting to nil the overall probability.
With the idea to mitigate the effects of the sparsity existing
in ATC, we have envisaged to set a variance lower bound.
6 EXPERIMENTAL SCENARIO
Before entering into this section, note that some aspects of
the experimental scenario adopted, such as the tackling of
the overlapping issue in Reuters-21578, and the effective-
ness measure used, are reviewed in rather detail since they
form a different treatment from commonly followed
solutions [1].
6.1 Standard Collections
Two of the most widely used standard collections, the
20 Newsgroups (NG) and the Reuters-21578 (RE) data sets,
form our experimental scenario. In our experiments, both
collections have been preprocessed by removing stopwords
(from the Weka [14] stop list) and nonalphabetical words.
Infrequent terms occurring in less than four documents or
appearing less than four times in the whole data set, have
also been filtered.
6.1.1 20 Newsgroups
The 20 Newsgroups data set is a collection of approximately
20,000 newsgroup documents, partitioned (nearly) evenly
across 20 different newsgroups, each corresponding to a
different topic. We used the bydate version of the data set
maintained by Jason Rennie,
19
which is sorted by date into
training (60 percent) and test (40 percent) sets.
6.1.2 Reuters-21578
The Reuters-21578, Distribution 1.0 test collection
20
is
formed using 21,578 news that appeared in the Reuters
newswire in 1987, classified (by human indexers) according
to 135 thematic categories, mostly concerning business and
economy. In our experiments, we have used the most
common subset of Reuters-21578, ModApte training/test
partition which only considers the set of 90 categories with
at least one positive training example and one positive test
example. It results in a partition of 7,769 training documents
and 3,019 test documents.
Several factors characterize Reuters-21578 data set [15],
notably: categories are overlapping (i.e., a document may
belong to more than one category) and distribution across
categories is highly skewed (i.e., some categories have
very few labeled documentseven only onewhile others
have thousands).
6.1.3 Tackling the Category Overlapping Issue
Reuters-21578 data set classification is inserted in a multi-
label categorization frame, which is, by nature, out of the
single-label categorization scheme assumed in most of ATC
research works, including this. The category overlapping
issue can be tackled in three different ways:
. By deploying the 1 nonexclusive categories into all
the possible 2
1
category combinations.
. By assuming that categories are independent, and
thus, reverting 1-categories classification problem
into 1 independent binary classification tasks.
. By ignoring multilabeled documents which consti-
tute approximately 29 percent of all documents in
Reuters-21578.
Classically, the category independence alternative has
been implicitly undertaken in most researches [15]. Minority
authors as [16] have optedto ignore multilabeleddocuments.
In our work, we have opted to deploy the 1 90 Reuters-
21578 categories into all possible 2
1
2
90
combinations,
which results in an impressive number of the order of 10
27
.
The reasons for our decision are basically:
. Our conceptual framework, both for document
sampling and decoding, is based in a multiclass
(not binary) scheme.
. By deploying categories, we avoid assuming any
independence hypothesis.
. 379 (out of more than 10
27
!) is the actual number of
category combinations that have at least one docu-
ment representative in the training set (out of these
379, only 126 category combinations are represented
in the test set).
6.2 Effectiveness Measures
6.2.1 Confusion Matrix
In a single-label multiclass classification scheme, the
classification process can be visualized by means of a
confusion matrix . Each column of the matrix represents the
number of documents in a predicted category, while each
row represents the documents in an actual category.
In other words, referring to Table 1,
11
would be the
number of documents of category c
1
that are correctly
classified under c
1
, while
12
corresponds to documents of
1034 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
TABLE 1
An Example of Confusion Matrix
18. Note that in practice, variance has to be set to a minimal value;
otherwise, univariate normal PDFs expressions could not be computed.
19. http://people.csail.mit.edu/jrennie/20Newsgroups/.
20. The Reuters-21578 corpus is freely avaible for experimentation
purposes from http://www.daviddlewis.com/resources/testcollections/
reuters21578.
c
1
incorrectly classified into c
2
and
21
corresponds to
documents of c
2
incorrectly classified into c
1
.
6.2.2 Precision and Recall Microaverages
Precision and recall
21
are common measures of effectiveness
in ATC [1]. Overall measures for all categories are usually
obtained by microaveraging:
22

T1
T1 11

P
j
C
j
i1
T1
i
P
j
C
j
i1
T1
i
11
i

. 22
c
j

T1
T1 1`

P
j
C
j
i1
T1
i
P
j
C
j
i1
T1
i
1`
i

. 23
It can be easily seen that, in the single-label multiclass
classification model, the former expressions are equivalent.
They both result fromthe quotient of the sumof the diagonal
components of matrix and the sum of all elements of
matrix , which happens to be the overall classification
accuracy which is the quotient of correct classifications
(numerator) and total correct and incorrect classifications
(denominator).
6.3 Technological Solutions
The implementations undertaken in this research are based
on the Weka 3 Data Mining Software [14]. In particular, we
have used: NaiveBayesMultinomial, which is the MNB
categorizer implemented in Weka and Weka LibSVM
(WLSVM), which is the integration of LibSVM into Weka
environment.
23
7 EXPERIMENTAL RESULTS ON NOISY
TERMS FILTERING
This section provides empirical evidence that the novel
noisy terms filtering ensures a beneficial feature space
reduction.
7.1 20 Newsgroups
7.1.1 Experimental Scenario
In the noisy terms filtering scheme that was designed, the
setting of sample variance threshold t
:
2 (from this point on,
simply noted as t) has to be heuristically tuned. The only
thing we a priori know is that it is bounded in the interval
0; :
2
ior

j
C
j1
j
C
j
2
that results in 0; 0.0475 for j
C
j 20.
We have graphically represented in Fig. 5 the effect of the
variation of sample variance threshold on the classification
accuracy for a term clustering of 20,100 and 5,000 clusters,
respectively. The classifiers used have been the classic MNB
and SVM. The clustering algorithm tested is static window
hard clustering (SH clustering).
24
The similarity measure
used is the sample correlation coefficient in both cases. The
rest of clustering parameters are the default ones (see
Section 8).
7.1.2 Analysis of Results
The curves corresponding to MNB and SVM classification
in Fig. 5 have parallel evolutions. They present a mountain
shape with a maximum in the t range of [0.005; 0.02]. Note
that this range assures a classification accuracy decrease
lower than 5 percent for the 20 clusters curve (which
happens to be our target clustering as we will see in
Section 8). Outside this range, classification accuracy
decrease is considered to be too high.
7.2 Reuters-21578
7.2.1 Experimental Scenario
As in the case of 20 Newsgroups, the dispersion measure
used has been sample variance. A priori, we only know that
the variance threshold is bounded in the interval 0; :
2
ior

j
C
j1
j
C
j
2
that results in 0; 0.002631 for j
C
j 379.
We have graphically represented in Fig. 6 the effect of the
variation of sample variance threshold on the classification
accuracy for a term clustering of 200, 500 and 1,000 clusters,
respectively. The classifiers used have been the classic MNB
and SVM and the clustering algorithm tested SH clustering.
The similarity measure used is the sample correlation
CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1035


Fig. 5. 20 NewsgroupsEffect of the variation of the sample variance threshold on the classification accuracy for a term clustering of 20,100, and
5,000 clusters, respectively. The clustering algorithm used is static window hard clustering version with sample correlation coefficient as similarity
measure. (a) MNB classifier. (b) SVM classifier.
21. Precision (
i
) is the probability that if a random document d
,
is
classified under c
i
, this decision is correct. Recall, (c
i
) is the probability that
if a random document d
,
ought to be classified under c
i
, this decision is
taken.
22. T1 stands for True Positive, 11 for False Positive, T` for True
Negative, and 1` for False Negative.
23. Weka LibSVM is publicly available from http://www.cs.iastate.edu/
~yasser/wlsvm/.
24. Similar results are obtained with dynamic window soft clustering
(DS clustering), but, due to space limitations, they are not shown here.
coefficient in both cases. The rest of clustering parameters
are the default ones (see Section 8).
7.2.2 Analysis of Results
The curves corresponding to MNB and SVM classification
have slightly different evolutions. MNB classifier seems to
be more robust to the presence of noisy terms (low values of
variance threshold). In any case, both classifiers (or more
precisely, the combination of clustering/classifier) may be
considered reasonably robust to the presence of noisy
terms. The optimal range for t is 0.00005; 0.001. Note that
this range assures a maximum decrease of accuracy of
5 percent for the 200 clusters curve (our target clustering
as we will see in Section 8). In both cases, the absolute
maximum is obtained for a value of t of 0.0003.
Similar results are obtained when clustering is per-
formed by the DS algorithm, with the difference that the
optimal range for t is restricted to 0.0002; 0.001.
7.3 Interpreting and Extrapolating Results
Both in 20 Newsgroups andReuters-21578, we have seen that
a range of values exist for the sample variance threshold
where a maximum of classification accuracy is reached.
Outside these bounds, the classification accuracy decreases:
. When t < |onci-/onid, the selection is too unrest-
rictive and noisy terms affect negatively term
clustering, which results in classification accuracy
deterioration.
. When t njjci-/onid, the selection is too restrictive
and informative terms are eliminated, which also
results in accuracy deterioration.
While in some cases (e.g., SH clustering on Reuters
collection), classification accuracy seems not to be affected
by any variance threshold lower-bound (i.e., robustness to
noisy terms), a threshold upper-bound always exists. This is
reasonable as one cannot limitlessly eliminate terms without
reducing classification accuracy. The quid of the question
here is to find an effective term reduction (that fastens the
term clustering process) while preserving classification
accuracy. In other words, we have the following compro-
mise: the higher t is, the faster the clustering process results,
but the higher t is, the lower the classification accuracy
becomes.
The problem is how to determine a priori (not experi-
mentally) the optimal values for the variance threshold
lower and upper-bounds.
There is a clear correspondence of relative t values in
both cases. In the case of Newsgroups, the optimum range
for t is 0.1053:
2
ior
; 0.4211:
2
ior
, which corresponds to a
range of 88-41 percent terms selected. For Reuters, it is
0.08:
2
ior
; 0.38:
2
ior
, which corresponds to a range of 94-
33 percent terms selected.
We may extrapolate these particular cases into a general
scenario (that should, of course, be verified in future
experiments with other collections). The optimal range for
t may be set to 0.11:
2
ior
; 0.38:
2
ior
(this would embrace the
limits set by both collections).
8 EXPERIMENTAL RESULTS ON CLUSTERING
OF TERMS
This section provides empirical evidence that our redun-
dancy compression model drastically outperforms two of
the most effective feature selection functions [17], namely,
Information Gain (IG) and Chi-square (CHI), and allows an
aggressive dimensionality reduction with minimal loss (in
some cases, benefits) on final classification accuracy.
8.1 20 Newsgroups
8.1.1 Experimental Scenario
In the preliminary term selection phase of this experimental
scenario, noisy terms have been eliminated using the
sample variance measure and a threshold set to 0.02. As a
result of this, out of the initial 32,631 preselected terms of
the training set, 13,465 informative words have been
selected and served as the basis of the clustering phase.
We applied the four clustering variants with sample
correlation coefficient similarity function. In the dynamic
window approaches, the initial window size has been set to
100 and the contraction step was empirically set to 0.0001.
25
8.1.2 Discussion of Basic Performance
We have obtained the categorization accuracy of the MNB
categorizer upon NG data set indexed in the resulting
clusters space. This has been referenced to the results
1036 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
25. Experimentally small values for this parameter, lower than 0.0001,
have showed better results than higher ones.
Fig. 6. Reuters-21578Effect of the variation of the sample variance threshold on the classification accuracy for a term clustering of 200, 500, and
1,000 clusters, respectively. The clustering algorithm used is static window hard clustering version with the sample correlation coefficient as the
similarity measure. (a) MNB classifier. (b) SVM classifier.
issued from classic IG and CHI selection functions applied
to the raw data set with same preprocessing (removal of
stopwords and nonalphabetical words) and same space
reduction factor.
Fig. 7 shows parallel results to the ones obtained by the
previous research works on Distributional clustering [4], [5],
[6]. The clusters categorization accuracy curves are notably
better than those of classic IG and CHI term selection
functions. They present an abrupt initial increase up to
20 clusters (accuracy in the range of 74 percent to 76 percent,
maximum value 76.0489 percent), and from there, they
asymptotically get to the maximum accuracy of 78.2528 per-
cent obtained by a full-feature MNB classifier.
26
We can say that clustering is good, with only a residual
loss of classification accuracy of 2.82 percent, for 20 clusters
or more, which is the number of categories defined in
20 Newsgroups collection. It results in an indexing term-
space reduction of original 113,357 words into only 20 word
clusters. The reduction factor is
113.35720
113.357
0.9982.
8.1.3 NG Indexed by Clusters
Fig. 8a illustrates how NG data set is indexed in the
obtained clusters space.
27
The graphic represents the
documents vectors (in the y-axis) indexed in the term-
cluster space (x-axis). Documents have been normalized
and they have also been arranged as to the category they
belong to. The graphic has thus to be read following
the 20 horizontal bands that can be identified in the
figures. Each band corresponds to all documents belong-
ing to the same category. Each single document is a line
inside this band. Basically, each category is mainly
identified by a single and distinctive cluster.
28
Finally, NG, indexed by 100 term clusters in Fig. 8b,
gives a very graphical understanding of the power of
clustering. Here again, basically, 20 clusters are active,
while a large part of the figure is void. This asserts the idea
that increasing the number of clusters up to more than the
number of categories is roughly unnecessary.
8.2 Reuters-21578
8.2.1 Experimental Scenario
We applied our four clustering variants on the Reuters-
21578 preprocessed training set upon which noisy terms
have been filtered by a sample variance threshold of 0.0003
(the collection is finally indexed by 8,550 informative
terms). We have used sample correlation coefficient. The
step in the contraction phase of the dynamic window
approaches was empirically set to 0.00001 and the window
size to 100. Finally, the whole collection of documents has
been indexed in the resulting space of clusters.
8.2.2 Discussion of Basic Performance
Fig. 9 shows parallel results to the ones obtained with
20 Newsgroups data set. Curves corresponding to Distribu-
tional clustering algorithms present an abrupt initial increase
up to 200 clusters (accuracy 76-82 percent, depending on the
clustering and the categorizer used), and from there, they
asymptotically get to a maximum accuracy of the order of
the maximum of 81.5953 percent obtained by classical
selection functions (IG or CHI) with SVM. Note that in
similar conditions, MNB obtains a maximum accuracy of
78.453 percent.
In other words, the indexing space can be reduced from
the original 32,539 words to 200 clusters (i.e., two orders of
magnitude) with no loss (even gain) of categorization
accuracy. The reduction factor is
32.539200
32.539
0.9939.
With this reduction, there is a gain of classification accuracy
of 4.53 percent (from 82.1478 percent to 78.453 percent
accuracy obtained when indexing with 10,000 terms
selected with indiscriminately IG or CHI functions) when
using the MNB classifier. When using SVM, the loss of
classification accuracy is only 2.24 percent (from79.7676 per-
cent to 81.5953 percent). Most notably, an overall maximum
accuracy of 82.1478 percent is reached when RE is indexed
by the 200 clusters obtained by DS clustering and classified
with MNB.
Qualitatively, our clustering approach essentially im-
proves the deterministic annealing algorithm (a Distribu-
tional Clustering algorithm) proposed by Bekkerman et al.
[7] who found that simple word indexing was a more
efficient representation than word clusters for Reuters-
21578 (10 largest categories).
29
The authors argue that the
bad performance of their algorithm is due to the
structural low-complexity of Reuters corpora (compared
to 20 Newsgroups, for instance) that was found not to
significantly improve classification accuracy when docu-
ment indexation was done using a great number of
words. Our results show that classification accuracy over
Reuters-21578 is clearly improved by using word clusters
instead of words, when indexed by a small number of
features. One of the basic differences between Bekker-
mans approach and ours that may explain our results,
CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1037


26. Note that Baker and McCallum [4] and Dhillon et al. [6]
categorization accuracy maximum values are (in absolute terms) slightly
superior to ours due to the fact they have used a bigger train split (2/3 train-
1/3 test random split against our 60 percent train-40 percent test Rennies
publicly available bydate split). They refer [6] to achieve 78.05 percent
accuracy with only 50 clusters, just 4.1 percent short of the accuracy
achieved by a full-feature classifier. Our results with only 20 clusters,
minimize this loss to 2.82 percent, thus representing relatively improved
results.
27. Produced by DS clustering using sample correlation coefficient as
similarity measure and a sample variance threshold of 0.015 (i.e., 18,046
terms selected).
28. This is the general idea. Nevertheless, when analyzed more in detail,
Fig. 8 shows secondary clusters for some of the categories which indicate
subject relationships.
29. To our knowledge, Bekkerman et al. research is the only work that
has applied Distributional Clustering to Reuters-21578 data set.
30. Issued from the decomposition of the original 90 overlapped
categories.
Fig. 7. 20 NewsgroupsClassification accuracy of Distributional
Clustering algorithms vs. Information Gain and Chi-square term
selection functions with Multinomial Naive Bayes categorizer, using
sample correlation coefficient as similarity measure.
apart from the noisy term filtering process, is that the
former solve the overlapping issue in Reuters collection
by assuming the obviously violated assumption of
category independence, while we avoid such hypothesis
(see Section 6.1.3).
The 379 (nonoverlapped) categories
30
Reuters-21578
collection can thus be optimally indexed with only
200 clusters. It should be noted that out of the total
379 categories of the training set, only 126 are effectively
represented in the test set. This fact, together with the
extremely insignificant representation of some categories,
may explain why the optimal number of clusters needed for
indexing is lower than the number of categories.
8.2.3 RE Indexed by Clusters
Fig. 10 illustrates how Reuters data set is indexed in the
obtained clusters space produced by DS clustering (the
figure shows a detailed view on most representative
clusters). In order to facilitate the lecture of the figures,
the ten most populated categories have been labeled
(horizontal grids delimitate the corresponding bands). The
figure clearly translates the uneven category distribution of
the data set. Clusters corresponding to dominant categories
earn and acq, clearly mark two vertical lines in the figure.
In general, these clusters are quite active in all documents.
A possible explanation for this is that these clusters are so
big
31
that they tend to be generalist. Another way to see it is
that all categories are related to the general subject of
business and economy that these clusters (identifiers of
categories earn and acq) globally represent. Anyway,
when examined in detail, a singular and discriminative
cluster can be identified for each category, except for
category interest&moneyfx that shows similar index-
ing to category moneyfx. Note that both former categories
are mainly indexed by a fairly big cluster.
9 EXPERIMENTAL RESULTS ON DOCUMENT
DECODING
9.1 20 Newsgroups
In Fig. 11a, it can be observed that
32
hybrid Q-GNB
presents good classification results, surpassing MNB and
LDA in the range of 20-200 word clusters. These results
1038 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
Fig. 8. 20 Newsgroups training set indexed by clusters. (a) Training set indexed by 20 clusters. (b) Training set indexed by 100 clusters.
Fig. 9. Reuters-21578Categorization accuracy of Distributional Clustering versus Information Gain and Chi-square. Dispersion and similarity
measures used are sample variance and sample correlation coefficient. (a) MNB categorizer. (b) SVM categorizer.
30. Issued from the decomposition of the original 90 overlapped
categories.
31. Uneven cluster size has a direct correspondence with uneven
category distribution. Most populated categories (i.e., earn and acq),
which by their side condensate more than 57 percent of all documents, are
related to two single clusters that together represent more than 52 percent of
the total number of terms contained in clusters.
32. The NG data set used in these experiments is indexed in the space of
the obtained word clusters produced by DS clustering using sample
correlation coefficient as similarity measure and a sample variance
threshold of 0.015 (i.e., 18,046 preselected terms), with the rest of parameters
set to default values.
should be interpreted under the perspective of the data set
representation sparsity. As extensively commented in
Section 8.1.1, when indexing NG collection with 20 word
clusters, each word cluster (nearly) univocally identifies a
distinct category. This means that the rest of word-clusters
indexes are set (on average) to zero with a very narrow
(practically null) variance. The effect of null mean and
null variance indexing attributes results to be extremely
negative. For instance, a residual not null weight in one of
those null indexing attributes may easily generate a
0 probability term in (20) that will eventually overimpose
to other nonnull components of the expression. The idea
pursued by our hybrid GNB is to set a lower-bound value
for variance.
To get a more precise idea of what a variance lower bound
of 0.015 means, we can make use of the 68-95-99.7 rule or
empirical rule that characterizes a normal or Gaussian
distribution. As stated by this rule, 95 percent of the values
lie in the interval j 2o. Setting a lower-bound variance of
0.015 means that the narrowest Gaussian distribution
allowed concentrates 95 percent of its values in the interval
j 0.25. This seems to be a reasonable superimposed
restriction in a normalized scheme as the one we work on,
where documents are cosine-normalized, and thus, attribute
weights vary in the interval [0; 1].
The effect of variance lower-bound tuning in hybrid
Q-GNB classification has been further studied. Optimal
results were obtained for values greater than 0.01 (for which
2o 0.2). For values higher than this, Q-GNB shows a
slight maximum at 0.015 (for which 2o 0.25). Surpris-
ingly, classification results for a variance lower bound of 0.1
(for which 2o 0.63) are stably good. This may be justified
by the robustness of word-clusters indexing of NG.
9.2 Reuters-21578
In Fig. 11b, it can be observed
33
that hybrid Q-GNB
presents good classification accuracy, similar to the one
obtained by SVM and only surpassed by MNB. These
results should be interpreted in the same sense as
done with NG. Introducing a variance lower bound holds
the negative effect of null mean and null variance indexes
due to data set indexing matrix sparsity. The effect of
variance lower-bound tuning in hybrid Q-GNB classifica-
tion has also been analyzed. As in NG, it has been seen that
small values of variance lower bound degrade classification
accuracy, which explains the bad performance of native
GNB categorizers. Optimal results are obtained for values
in the range of 0.005-0.01 (for which 0.14 2o 0.2). For
values higher than this, hybrid Q-GNB suffers a drastic
accuracy descent. Forcing the widening of the Gaussian
distributions too much introduces errors and confusion in
the classification. This fact, which is reasonable, was
surprisingly not occurring with NG eventually due to the
neater separation between categories.
10 CONCLUSIONS
The theoretical model we have proposed has led to a
performing two-level term-space reduction scheme, imple-
mented by a noisy term filtering and a subsequent
redundant term compression.
In particular, the elimination of noisy terms based on the
sample variance of their category-conditional PMF has
experimentally proved to be an innovative and correct
procedure. Both in 20 Newsgroups and Reuters-21578
collections, we have seen that a range of values for the
sample variance threshold t exists where a maximum of
classification accuracy is reached. Outside these bounds, the
classification accuracy decreases due to noisy terms inter-
ference (at the lower bound) and informative terms over-
elimination (at the upper bound).
On the other hand, the results obtained by our signal
processing inspired redundancy compressors to allow an
indexing term-space reduction factor of 0.9982 (from
original 113,357 words to only 20 word clusters) with a
residual loss of classification accuracy of 2.82 percent
34
with
MNB classifier.
We have enlarged our research with tough Reuters-21578
data set, which is an highly nonuniformly distributed text
collection where categories are related and overlapped.
Results obtained are extremely satisfactory and clearly
outperform those formerly published by Bekkerman et al.
[7] with other distributional clustering procedures. Our
clustering method, with sample correlation coefficient as
similarity measure, allows an indexing term-space reduction
factor of 0.9938 (from original 32,539 words to only
200 word clusters) with a gain of classification accuracy of
4.53 percent
35
when using the MNB classifier. When
adopting SVM, the loss of classification accuracy is only
2.24 percent. In any case, the overall maximum of classifica-
tion accuracy is reached when the collection is indexed by
200 clusters with MNB classifier. These results tend to
indicate that MNB significantly benefits from the compres-
sion of the term space (and its intrinsic overfitting reduc-
tion). SVM is, arguably [2], more robust to overfitting, and
thus, less prone to be positively affected by the compression.
By reducing redundancy, feature extraction by term
clustering tends to produce a set of orthogonal basis for the
CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1039


Fig. 10. Reuters-21578 training set indexed by 200 clusters. View
restricted to the 50 most representative clusters.
33. Same experimental scenario as in Section 8.2.1.
34. With respect of an indexing of 5,000 word clusters.
35. With respect of an indexing of 10,000 terms selected with classical IG
or CHI functions.
document representation. Basically, each (main) category is
identified by a singular and discriminative cluster, thus
drawing up the compressed documents to an orthogonal
coding of the category. In all, with both collections, there
seems to be a relationship between the entropy of the
category distribution and the actual optimal number of
clusters. 20 Newsgroups is a practically uniform distributed
collection. Its normalized entropy
36
is 0.9981 (almost 1).
Reuters-21578 is an extremely unevenly distributed collec-
tion. Its normalized entropy is 0.4881 (almost 1/2). Experi-
mentally, 20 Newsgroups needs as many clusters as
categories, while Reuters-21578 needs half of it.
In ATC, MNB is one of the most popular statistical
categorizer because of its simplicity and good accuracy
results. Gaussian assumption [12], [13] has been poorly
applied to ATC due to intrinsic problems related to the
high dimensionality of the typical document vectorial
representation. Our purpose has been, once the represen-
tation space has been optimally reduced, to experimen-
tally test how Gaussian MAP categorizers, specially under
the Naive Bayes assumption, may be adapted to the
concomitance of sparsity in ATC. By establishing a
variance lower bound in Gaussian PDFs, we have rescued
the use of Gaussian MAP classifiers in the ATC arena.
Our hybrid approach reaches classification results com-
parable to those obtained by MNB and SVM and opens
the door for further research.
We are currently pursuing our work with the design of a
divisive clustering algorithm, which, in view of the results
obtained with the tested agglomerative clustering schemes,
we think can throw interesting improvements both in
classification effectiveness and computational efficiency
terms. We have also envisaged to establish a thorough
similarity measures comparison and analysis. Future work
is also foreseen in the communication theoretical modeling
aspect, with special stress on the synthesis of prototype
documents via the generative model proposed, as well as
the deepening on the document coding (and subsequent
decoding) optimal design.
ACKNOWLEDGMENTS
The authors wish to thank the anonymous reviewers for
their fruitful comments and the Weka Machine Learning
Project for making their software open source under a GPL
license. For this research, Marta Capdevila was supported
in part by a predoctoral grant from the R&D General
Department of the Xunta de Galicia Regional Government
(Spain), awarded on 19 July 2005.
REFERENCES
[1] F. Sebastiani, Machine Learning in Automated Text Categoriza-
tion, ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[2] T. Joachims, Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, Proc. 10th European Conf.
Machine Learning (ECML), pp. 137-142, 1998.
[3] T. Joachims, Learning to Classify Text Using Support Vector
MachinesMethods, Theory, and Algorithms. Kluwer/Springer,
2002.
[4] L.D. Baker and A.K. McCallum, Distributional Clustering of
Words for Text Classification, Proc. Special Interest Group on
Information Retrieval (SIGIR 98) 21st ACM Intl Conf. Research and
Development in Information Retrieval, pp. 96-103, 1998.
[5] N. Slonim and N. Tishby, The Power of Word Clusters for Text
Classification, Proc. 23rd European Colloquium on Information
Retrieval Research, 2001.
[6] I. Dhillon, S. Mallela, and R. Kumar, A Divisive Information-
Theoretic Feature Clustering Algorithm for Text Classification,
J. Machine Learning Research (JMLR), special issue on variable and
feature selection, vol. 3, pp. 1265-1287, 2003.
[7] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, Distribu-
tional Word Clusters vs. Words for Text Categorization,
J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.
[8] A. McCallum and K. Nigam, A Comparison of Event Models for
Naive Bayes Text Classification, Proc. Assoc. for the Advancement of
Artificial Intelligence (AAAI 98) Workshop Learning for Text
Categorization, 1998.
[9] Y. Yang and X. Liu, A Re-Examination of Text Categorization
Methods, Proc. 22nd Ann. Intl ACM Special Interest Group on
Information Retrieval Conf. (SIGIR 99), pp. 42-49, Aug. 1999.
[10] S. Haykin, Communication Systems. John Wiley & Sons, 2001.
[11] T.M. Cover and J.A. Thomas, Elements of Information Theory,
second ed. John Wiley & Sons, Inc., 2006.
[12] H. Schutze, D. Hull, and J. Pedersen, A Comparison of Classifiers
and Document Representations for the Routing Problem, Proc.
18th Ann. Intl ACM Special Interest Group on Information Retrieval
(SIGIR 95) Conf. Research and Development in Information Retrieval,
pp. 229-237, 1995.
[13] T. Li, S. Zhu, and M. Ogihara, Using Discriminant Analysis
for Multi-Class Classification: An Experimental Investigation,
Knowledge and Information Systems, vol. 10, no. 4, pp. 453-472,
2006.
[14] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques, second ed. Morgan Kaufmann, 2005.
[15] F. Debole and F. Sebastiani, An Analysis of the Relative Hardness
of Reuters-21578 Subsets, Proc. Fourth Intl Conf. Language
Resources and Evaluation (LREC 04), pp. 971-974, 2004.
[16] K. Torkkola, Linear Discriminant Analysis in Document Classi-
fication, Proc. IEEE Intl Conf. Data Mining (ICDM-2001) Workshop
Text Mining (TextDM 01), 2001.
1040 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
36. Normalized entropy is the fraction between entropy and maximum
entropy.
Fig. 11. Comparison of Gaussian Naive Bayes categorizers classification accuracy against state-of-the-art SVM and MNB performance.
(a) 20 NewsgroupsHybrid GNB with variance lower bound of 0.015. (b) Reuters-21578Hybrid GNB with variance lower bound of 0.008.
[17] Y. Yang and J.O. Pedersen, A Comparison Study on Feature
Selection in Text Categorization, Proc. Intl Conf. Machine
Learning, pp. 412-420, 1997.
Marta Capdevila received the engineering
degree in telecommunications from the Poly-
technic University of Catalonia (UPC), Barcelo-
na, Spain, in 1992. She is currently working
toward the PhD degree. During 1991, she
studied image processing at the Ecole Nationale
Supe rieure des Te le communications, Paris,
France. From 1992 to mid-1993, she was
selected for a young graduate trainee contract
at the European Space Agency (ESA), Frascati,
Italy. After this stage, and until mid-1994, she held an application
engineering post at the pan-European research networking company
DANTE, Cambridge, United Kingdom. From 1995 to 2001, she was
appointed to several positions in the Spanish industry. Since 2001, she
has been involved in research activities at the Telecommunication
Engineering School, University of Vigo, Spain. Her research interests
include automatic text categorization and term-space compaction and
compression.
Oscar W. Ma rquez Flo rez received the tele-
communication engineering degree in 1985 from
the Odessa Electrotechnical Institute of Com-
munication, Ukraine, and the doctorate degree
in telecommunications in 1991 from the Ruhr-
University Bochum, Germany. In 1992, he joined
the Telecommunication Engineering faculty of
the University of Vigo, where he is currently an
associate professor. In addition to teaching, he
is involved in research in the areas of signal
processing in digital communications, computer-based learning, statis-
tical pattern recognition, and image-based biometrics. He is a member
of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
CAPDEVILA AND M

ARQUEZ FL

OREZ: A COMMUNICATION PERSPECTIVE ON AUTOMATIC TEXT CATEGORIZATION 1041

Você também pode gostar