Escolar Documentos
Profissional Documentos
Cultura Documentos
Keywords-semantic vector, word space model, hyperspace analogue to language, random projection;
I. INTRODUCTION
Finding semantic similarity is an interesting problem in
Natural Language Processing (NLP). Determining semantic
similarity of a pair of words is an important problem in
many NLP applications. There is not much research done on
semantic similarity for Vietnamese, while semantic similarity
plays a crucial role for human categorization and reasoning;
and computational similarity measures have also been applied
to many fields such as: semantics-based information retrieval,
information filtering or ontology engineering.
Nowadays, word space model [1] is often used in current
research in semantic similarity. Specifically, there are many
well-known approaches for representing the context vector of
words such as: Hyperspace Analogue to Language (HAL)
[2] and Random Indexing (RI) [3]. These approaches have
been introduced and they have proven useful in implementing
word space model. In our paper, we implement the word
space model for computing the semantic similarity. We have
considered a number of methods and investigated their advantages and disadvantages to select suitable techniques to
apply for Vietnamese text data. Then, we built a complete
system for finding synonyms in Vietnamese, called Semantic
Similarity Finding System. Our system is a combination of
some processes or approaches to easily return the synonyms of
a given word. Our experimental results on the task of finding
synonym are promising.
Our paper is organized as follows. In section II, we introduce
the background knowledge about word space model and also
review some existing approaches that have been proposed
for the word space model implementation. In section III, we
978-0-7695-4288-1/10 $26.00 2010 IEEE
DOI 10.1109/IALP.2010.78
91
A. Description
The semantic Similarity Finding System contains three
components:
Word Segmentation: This component is used in the preprocessing step. We used package WS4VN [18] to segment
all Vietnamese lexicons in documents.
Lucence Indexing: Apache Lucene is a high-performance,
full-featured text search engine library written entirely in Java1 .
Semantic vector Package: The semantic Vector package is an
open source that efficiently builds semantic vector or context
vector of word and document from a raw text corpus. In the
next section, we will present an overview about our Semantic
Similarity Finding System.
92
1,
0,
where n = 1..19
Pn (word) =
V. EVALUATIONS
A. Experiment 1: Two kinds of context vector
We run two modes of system on all target words of Test
Corpus. But in MODE 2 the context-size parameter changes
from 3 to 19 (See: Context-size Evaluation).
P1 %
6.25
55
P19 %
13.75
92.5
93
65% for nouns. To the best of our knowledge, this is the best
result obtained so far for Vietnamese. We compare our results
with the other methods used in finding semantic similarity
on English. In the studies of LSI, Landauer and Dumais [14]
report 64,4% for identifying correct answers in multi-choice
TOEFL question. A study in Random Indexing [19] gave
an accuracy of 35-44% with unnormalized 1800 dimensional
vectors and 48-51% with normalized vectors.
In the future, we will improve our system to collect
all meanings of a word, and build Directional words-bywords co-occurrence matrix in which rows contain leftcontext co-occurrences, and columns contain right-context
co-occurrences. to gain more accuracy for adjectives and
pronouns.
REFERENCES
[1] G. Salton and M. J. McGill, Introduction to modern information
retrieval, McGraw-Hill, Inc. New York, NY, USA, 1986.
[2] K. Livesay and C. Burgress, Producing high-dimensional semantic
spaces from lexical co-occurrence, Behavior Research Methods Instruments Computers, 28, 203-208, 1997.
[3] M. Sahlgren, An introduction to random indexing, In Proceedings of
the Methods and Applications of Semantic Indexing Workshop at the 7th
International Conference on Terminology and Knowledge Engineering,
TKE 2005, Copenhagen, Denmark, 2005.
[4] S. Banerjee and T. Pedersen, Extended gloss overlaps as a measure of
semantic relatedness.in, IJCAI, pages 805810, 2003.
[5] J. J. Jiang and D. W. Conrath, Semantic similarity based on corpus
statistics and lexical taxonomy, In ROCLING97, 1997.
[6] P. Resnik, Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language. JAIR 11:95130, 1999.
[7] B. Roark and E. Charniak, Noun-phrase co-occurrence statistics
for semi-automatic semantic lexicon construction, In Proceedings of
COLING-ACL98, 1998.
[8] M. Thelen and E. Riloff, A bootstrapping method for learning semantic lexicons using extracting pattern contexts, In Proceedings of
EMNLP02., 2002.
[9] W. Phillips and E. Riloff, Exploiting strong syntactic heuristics and
co-training to learn semantic lexicons, In Proceedings of EMNLP02,
2002.
[10] D. Yarowsky, Unsupervisedword sense disambiguation rivaling supervisedmethods, In Proceedings of ACL95, pages 189196, 1995.
[11] L. van der Plas and J. Tiedemann, Finding synonyms using automatic
word alignment and measures of distributional similarity, Association
for Computational Linguistics Morristown, NJ, USA, 2006.
[12] J. Curran and M. Moens, Improvements in automatic thesaurus extraction, In Proceedings of the Workshop on Unsupervised Lexical
Acquisition, pages 59?67, 2002.
[13] L. van der Plas and G. Bouma, Syntactic contexts for finding semantically similar words, Proceedings of the Meeting of Computational
Linguistics in the Netherlands (CLIN), 2005.
[14] T. K. Landauer and S. T. Dumais, A solution to platos problem,
Psychological Review 104:211240, 1997.
[15] H. Schuetze, Dimensions of meaning, In Proceedings of Supercomputing92, pages 787796., 1992.
[16] R. K. Ando, Semantic lexicon construction: Learning from unlabeled
data via spectral analysis, in HLT-NAACL 2004 Workshop: Eighth
Conference on Computational Natural Language Learning (CoNLL2004), 2004.
[17] T. K. Landauer, P. W. Foltz, and D. Laham, An Introduction to Latent
Semantic Analysis. Discourse Processes, 25, 259-284., 1998.
[18] D. D. Pham, G. B. Tran, and S. B. Pham, A hybrid approach to Vietnamese Word Segmentation using Part of Speech tags, International
Conference on Knowledge and Systems Engineering, 2009.
[19] P. Kanerva, J. Kristoferson, and A. Holst, Random indexing of text
samples for latent semantic analysis, In Erlbaum (editor), Proc. of the
22nd annual conference of the cognitive science society New Jersey,
USA., 2000.
94