Escolar Documentos
Profissional Documentos
Cultura Documentos
Figure 2. MNPNB classifier: labeling GUI. Figure 4. The Italian Google Directory.
altre culture" and "multimedia politica". • P (S ALUTE) = 0.0787,
The PDF filter, offered by the Google search engine, has • P (C OMPUTER) = 0.0696,
been used to ensure that only pdf format files are retrieved. • P (T EMPO L IBERO ) = 0.0884
The random query process retrieved 14,037 documents, each • P (A MMINISTRAZIONE) = 0.1021.
associated with one or more gDir first level topics. Furthermore, each topic is summarized in a words list,
whose 20 most frequent words are reported in Table 1 and
4.2. Text Pre-processing Table 2. It is worthwhile to mention that for each pair
of words wi and topics j, an estimate of the conditional
The document corpus, consisting of the 14,037 retrieved probability of the word wi given the topic j, i.e. P (wi |j)
documents, has been submitted to the Text Pre-processor (Eq. 1), is provided.
software component. Therefore, PDF files have been trans-
formed to plain text, submitted to stopwords removal and 4.4. Multi-Label Classification
to word stemming. Furthermore, size-based file selection
has been applied to include only those PDF files with size The performance of the software system, as a whole, has
between 2 and 400 KB. been estimated by submitting a new document corpus to
The obtained document corpus consists of 10,056 doc- the Multi-label Classifier. This document corpus has been
uments (D = 10, 056). The global vocabulary, which has
been formed by including only those words occurring at
Table 1. S ALUTE and C OMPUTER.
least in 10 and in no more than 450 documents, consists of
48,750 word tokens (W = 48, 750). The document corpus S ALUTE 0.0787 C OMPUTER 0.0696
is represented in a word-document matrix W D consisting of cellule 0.0032 blog 0.0063
emissioni 0.0029 google 0.0035
W × D (term frequency) elements (48, 750 × 10, 056). The nutrizione 0.0028 linux 0.0034
word-document matrix is inputed to the Topic Extractor molecolare 0.0026 copyright 0.0033
software component as described in the next subsection. proteine 0.0022 wireless 0.0030
dieta 0.0022 source 0.0029
climatici 0.0021 access 0.0028
4.3. Topic Extraction foreste 0.0021 client 0.0027
cancro 0.0021 multimedia 0.0027
aids 0.0020 hacker 0.0026
The Topic Extractor software component has been in- disturbi 0.0020 password 0.0026
voked with the following learning parameters; 12 topics infermiere 0.0019 giornalismo 0.0025
cibi 0.0019 browser 0.0023
(T = 12), alpha prior equal to 1.67 (α = 1.67), which tumori 0.0019 provider 0.0022
implements the α = 20 T rule cited in [7], beta prior equal veterinaria 0.0018 telecom 0.0022
to 0.01 (β = 0.04), which implements the β = 200 W rule
obesitá 0.0018 brand 0.0022
clinico 0.0018 book 0.0021
cited in [7]. The Gibbs sampling procedure has been run serra 0.0017 chat 0.0021
100 times with different initial conditions and different virus 0.0017 wiki 0.0021
initialization seeds. Each run consisted of 500 sampling infezioni 0.0017 piattaforme 0.0021
iterations. The topics extracted through the last 99 runs of
the Gibbs sampling learning procedure have been re-ordered Table 2. T EMPO L IBERO and A MMINISTRAZIONE.
to correspond as best as possible with the topics obtained T EMPO L IBERO 0.0884 A MMINISTRAZIONE 0.1021
through the first run. Correspondence was measured by sconto 0.0077 locazione 0.0021
means of the simmetrized Kullback Liebler distance. aeroporto 0.0038 federale 0.0021
salone 0.0035 direttivo 0.0021
The 12 topics extracted from the Topic Extractor soft- spiaggia 0.0028 finanze 0.0020
ware component have been summarized through their cor- lago 0.0028 versamento 0.0020
responding 500 most frequent words. Among the extracted colazione 0.0026 lire 0.0019
albergo 0.0025 commi 0.0019
topics, the four most interesting ones have been manually vacanza 0.0025 prescrizioni 0.0018
labeled as follows; piscina 0.0024 vietato 0.0018
vini 0.0023 contrattuale 0.0018
• S ALUTE (medicine and health), bagni 0.0023 richiedente 0.0018
• C OMPUTER (information and communication technolo- voli 0.0021 utilizzatore 0.0017
gies), pensione 0.0021 agevolazioni 0.0017
biglietto 0.0020 contabile 0.0017
• T EMPO L IBERO (travels and holidays), notti 0.0020 appalto 0.0017
• A MMINISTRAZIONE (bureaucracy, public services). escursioni 0.0020 affidamento 0.0017
agevolazioni 0.0020 redditi 0.0017
The structure of the four topics, is described in Table 1 archeologico 0.0019 sanzione 0.0017
and Table 2. Each topic is associated with an estimate of its piatti 0.0019 somme 0.0016
prior probability; bicicletta 0.0019 indennitá 0.0016
collected by using the same random querying procedure Table 3. Precision/Recall for S ALUTE (S AL .),
described in subsection 4.1. Its documents have been man- C OMPUTER (C OM .), T EMPO L IBERO (T E L.) and
ually labeled, according to the 12 first level gDir topics, A MMINISTRAZIONE (A MM .).
independently by three humans. S AL . C OM . T E L. A MM .
The labeled document corpus consists of 1,012 docu- Precision 76 85 78 92
ments. It is worthwhile to mention that each document can Recall 41 59 44 79
be associated with one or more labels, i.e. a document can
mention one or more topics. In detail, 478 documents are
singly labeled, 457 are doubly labeled, while only 77 are 5. Conclusions and Future Work
associated with three labels.
The Multi-label Classifier, queried by using the binary The overwhelming amount of textual data calls for ef-
document representation and by setting a posterior threshold ficient and effective methods to automatically summarize
equal to 0.5, achieves an accuracy equal to 73%, which information and to extract valuable knowledge. Text Mining
can be considered satisfactory. The estimates of precision offers a rich set of computational models and algorithms
and recall for the four selected topics; namely C OMPUTER, to automatically extract valuable knowledge from huge
S ALUTE, A FFARI and T EMPO L IBERO, are reported in Table amounts of semi and un-structured data.
3. In this paper a software system for topic extraction and
The best result is achieved for the topic A MMINIS - document classification has been described. The software
TRAZIONE , where the precision equals 92%, i.e. if the system assists the user in correctly discovering which main
Multi-label Classifier labels a document with the label A M - topics are mentioned in a document collection. The dis-
MINISTRAZIONE , then the labeling is wrong with respect to covered topic structure, after being user-validated, is used
the manual labeling with probability 0.08. Furthermore, the to implement an automatic document classifier. This model
recall equals 79% which means that the documents manually suggests labels to be used for each new document submitted
labeled with A MMINISTRAZIONE are correctly labeled by by the user. It is important to mention that each document is
the Multi-label Classifier with probability 0.79. The topic not restricted to receive a single label but can be labeled with
C OMPUTER achieves a precision value equal to 85% which more topics. Furthermore, each topic, for a given document,
is slightly lower than that achieved for A MMINISTRAZIONE. is associated with a probability value that informs the user
However, the achieved recall value drops from 79% of about the fitting of the topic to the considered document.
A MMINISTRAZIONE to 59%. The topic T EMPO L IBERO This feature offers an important opportunity to the user who
achieves a precision value equal to 78%, i.e. slightly lower can sort his/her document collection in descending order of
than those achieved for C OMPUTER, while the achieved probability for each topic.
recall value is equal to 44%. Thus, the achieved recall value However, it must be clearly stated that many improve-
is significantly lower than that achieved for C OMPUTER. ments can be achieved by taking into account specific
Finally, the topic S ALUTE achieves performances compa- user requirements. This aspect is under investigation, and
rable to those of T EMPO L IBERO. Indeed, the precision particular attention is being dedicated to non-parametric
equals 76%, while the recall equals 41%. Around 57% of models for the discovery of hierarchical topic structures.
documents manually labelled with T EMPO L IBERO and/or Finally, the interplay between topics taxonomies and topic
with S ALUTE, are not identified as such by the Multi-label extraction algorithms offers an interesting research direction
Classifier. to explore.
It is worthwhile to notice that for each label the achieved
recall value is consistently lower than the precision value. References
A possible explanation for this behavior is as follows; the
manual labeling procedure is both complex and ambiguous; [1] R. Feldman and J. Sanger, The Text Mining Handbook. New
it could label documents by using a broader meaning for York: Cambridge University Press, 2007.
each topic. Therefore, it is expected somewhat that auto-
matic document classification could not achieve excellent [2] E. Wigner, “The unreasonable effectiveness of mathematics
in the natural sciences,” Communications in Pure and Applied
performance with respect to both precision and recall. It is Mathematics, vol. 13, no. 1, pp. 1–14, February 1960.
expected that the ‘purer’ a topic is, the better the perfor-
mance achieved for the automatic documents classification [3] A. Halevy, P. Norvig, and F. Pereira, “The unreasonable
task will be. effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2,
However, it is important to keep in mind the difficulty pp. 8–12, 2009.
of the considered labeling task, together with the fact that [4] M. W. Berry and M. Castellanos, Survey of Text Mining II:
human labeling of documents can result in ambiguous and clustering, classification and retrieval. London: Springer,
contradictory label assignment. 2008.
[5] T. L. Griffiths and M. Steyvers, “A probabilistic approach to [20] A. Mccallum and K. Nigam, “A comparison of event models
semantic representation,” in Proceedings of the Twenty-Fourth for naive bayes text classification,” in In AAAI-98 Workshop
Annual Conference of Cognitive Science Society, G. W. and on Learning for Text Categorization. AAAI Press, 1998, pp.
C. Schunn, Eds., 2002, pp. 381–386. 41–48.
[6] T. Hofmann, “Probabilistic latent semantic analysis,” in In [21] D. D. Lewis, “Naive (bayes) at forty: The independence
Proc. of Uncertainty in Artificial Intelligence, UAI99, 1999, assumption in information retrieval,” in ECML ’98: Proceed-
pp. 289–296. ings of the 10th European Conference on Machine Learning.
London, UK: Springer-Verlag, 1998, pp. 4–15.
[7] D. M. Blei, N. Andrew, and M. I. Jordan, “Latent dirichlet
allocation,” Journal of Machine Learning Research, vol. 3, [22] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive
pp. 993–1022, January 2003. learning algorithms and representations for text categoriza-
tion,” in CIKM ’98: Proceedings of the seventh international
[8] T. L. Griffiths and M. Steyvers, Probabilistic Topic Models, conference on Information and knowledge management. New
S. D. . W. K. T. Landauer, D. McNamara, Ed. Erlbaum, York, NY, USA: ACM Press, 1998, pp. 148–155.
2007.
[23] Y. Yang and X. Liu, “A re-examination of text categorization
[9] T. Hofmann, “Unsupervised learning by probabilistic methods,” in SIGIR ’99: Proceedings of the 22nd annual
latent semantic analysis,” Machine Learning, vol. 42, international ACM SIGIR conference on Research and de-
no. 1-2, pp. 177–196, 2001. [Online]. Available: velopment in information retrieval. New York, NY, USA:
http://portal.acm.org/citation.cfm?id=599631 ACM Press, 1999, pp. 42–49.
[10] T. L. Griffiths and M. Steyvers, “Finding scientific topics.” [24] A. K. McCallum and T. Mitchell, “Text classification from
Proc Natl Acad Sci U S A, vol. 101 Suppl 1, pp. 5228–5235, labeled and unlabeled documents using em,” in Machine
April 2004. Learning, 2000, pp. 103–134.
[11] T. Joachims, Learning to Classify Text Using Support Vector [25] Y. Yang and C. G. Chute, “An example-based mapping
Machines: Methods, Theory and Algorithms. New York: method for text categorization and retrieval,” ACM Trans. Inf.
Springer, 2002. Syst., vol. 12, no. 3, pp. 252–277, 1994.
[12] R. Cooley, “Classification of news stories using support vector [26] R. E. Schapire and Y. Singer, “Boostexter: a boosting-based
machines,” in Proc. 16th International Joint Conference on system for text categorization,” in Machine Learning, 2000,
Artificial Intelligence Text Mining Workshop, 1999. pp. 135–168.
[13] H. Lodhi, C. Saunders, N. Cristianini, C. Watkins, and [27] H.-C. R. H. M. Sang-Bum Kim, Kyoung-Soo Han, “Some
B. Scholkopf, “Text classification using string kernels,” Jour- effective techniques for naive bayes text classification,” IEEE
nal of Machine Learning Research, vol. 2, pp. 563–569, 2002. Transactions on Knowledge and Data Engineering, vol. 18,
no. 11, pp. 1457–1466, January 2006.
[14] L. Chen, J. Huang, and Z.-H. Gong, “An anti-noise text
categorization method based on support vector machines,” in [28] G. Salton and M. J. Mcgill, Introduction to Modern Infor-
AWIC, 2005, pp. 272–278. mation Retrieval. New York, NY, USA: McGraw-Hill, Inc.,
1986.
[15] A. Lehmann and J. Shawe-Taylor, “A probabilistic model
for text kernels,” in ICML ’06: Proceedings of the 23rd
international conference on Machine learning. New York,
NY, USA: ACM, 2006, pp. 537–544.