Você está na página 1de 4

2010 International Conference on Asian Language Processing

Finding semantic similarity in Vietnamese


Dat Tien Nguyen
Information Technology Institute
Vietnam National University, Hanoi
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
datnt88@gmail.com

Son Bao Pham


Information Technology Institute
Vietnam National University, Hanoi
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
sonpb@vnu.edu.vn

describe our Semantic Similarity Finding System for finding


synonyms in Vietnamese. Section IV is about our experimental
setup. In section V we carry out experiments to evaluate the
quality of our system. We discuss some future directions to
improve our system and conclude in section VI.

AbstractFinding semantic similarity is an important task in


many natural language processing applications. Despite numerous works for popular languages, there is still limited research
done for Vietnamese. In this paper, we tackle the problem
of finding semantic similarity for Vietnamese using Random
Indexing and Hyperspace Analogue to Language to represent
the semantics of words and documents. We build a system to
find synonyms in Vietnamese. Experimental results show that
our system achieves accuracies of 75% for finding synonyms for
verbs and 65% for synonyms for nouns.

II. RELATED WORK


There are many approaches to automatic finding the semantic similarity from text corpora or how to represent the
semantics of words.
One approach uses lexical resources such as: WordNet or
Rogets Thesaurus. WordNet is a large lexical database for English. It defines the general concepts and records the semantic
relations between these synonym sets. These approaches find
semantic similarity by combining the structure and content
of WordNet with co-concurrence information from raw text.
There are many metrics which are used to calculate related
meaning using various properties of the hierarchical structure
in WordNet [4], [5], [6]. These methods use the co-occurrence
information in the WordNet definition to build context vector
or gloss vectors corresponding to each concept in WordNet.
Some common methods to construct the semantic lexicon to
find synonymic word use bootstrapping. Given a small set of
words, called seeds, various methods exploiting appositives,
compound nouns, and ISA-clauses are used to grow up the
number of seeds. Roark and Charniak [7] grows seeds from the
relationship by co-occurrences in lists of words, conjunctions
between words and appositives. The best performing method at
the time is in Thelen and Riloff [8] which operates repeatedly
feature selection and word selection for each class of word.
Others use a co-training combination of three bootstrapping
method each of which exploits appositives, compound nouns,
and ISA-clauses [9], word sense disambiguation [10] etc.
There have been many proposals to extract semantic of
words using measures of distributional similarity, but these are
not able to distinguish between synonyms and other types of
related words such as antonyms, hyponyms. Lonneke van der
Plas and Tiedemann [11] present a method based on automatic
word alignment of parallel corpora consisting of documents
translated into multi languages. Their result is better than a
monolingual syntax-based method. Monolingual syntax-based

Keywords-semantic vector, word space model, hyperspace analogue to language, random projection;

I. INTRODUCTION
Finding semantic similarity is an interesting problem in
Natural Language Processing (NLP). Determining semantic
similarity of a pair of words is an important problem in
many NLP applications. There is not much research done on
semantic similarity for Vietnamese, while semantic similarity
plays a crucial role for human categorization and reasoning;
and computational similarity measures have also been applied
to many fields such as: semantics-based information retrieval,
information filtering or ontology engineering.
Nowadays, word space model [1] is often used in current
research in semantic similarity. Specifically, there are many
well-known approaches for representing the context vector of
words such as: Hyperspace Analogue to Language (HAL)
[2] and Random Indexing (RI) [3]. These approaches have
been introduced and they have proven useful in implementing
word space model. In our paper, we implement the word
space model for computing the semantic similarity. We have
considered a number of methods and investigated their advantages and disadvantages to select suitable techniques to
apply for Vietnamese text data. Then, we built a complete
system for finding synonyms in Vietnamese, called Semantic
Similarity Finding System. Our system is a combination of
some processes or approaches to easily return the synonyms of
a given word. Our experimental results on the task of finding
synonym are promising.
Our paper is organized as follows. In section II, we introduce
the background knowledge about word space model and also
review some existing approaches that have been proposed
for the word space model implementation. In section III, we
978-0-7695-4288-1/10 $26.00 2010 IEEE
DOI 10.1109/IALP.2010.78

91

distributional similarity is used in many proposals to find


related words [12], [13].
Our approach is similar to spectral analysis approaches.
These methods use factor analysis techniques to reduce highdimensional semantic vector built from text corpus. LSI is
the other method that use SWD matrix technique to reduce
high-dimensional vector. In the studies of LSI, Landauer and
Dumais [14] report 64,4% for identifying correct answers in
multi-choice TOEFL question. Schuetze [15] uses spectral
analysis for vector dimensionality reduction and shows that
when using spectral analysis, the result can be positively or
negatively.
A promising method is learning from Unlabeled Data via
Spectral Analysis in [16] using a small number of labeled
examples (seed words). The proposal show that Spectral
analysis is useful for compensating for the paucity of labeled
examples by learning from unlabeled data that employs techniques such as EM and co-training.
III. SEMANTIC SIMILARITY FINDING SYSTEM
We built a complete system to find synonyms in Vietnamese. We use Random Indexing and HAL to represent
semantic of word by vector to measure document or word
similarities. We provide a system to compute context vector to find semantic similarity. Our system implements two
methods: Random Indexing (RI) and Hyperspace Analogue to
Language (HAL). RI is similar to LSA [17] where it calculates
term-document matrix from a sparse corpus. The Hyperspace
Analogue to Language (HAL) model creates context vector
only as the words that immediately surround a target word.
HAL builds the words-by-words co-occurrence matrix which,
in contrast to LSA ,was developed specially for semantic vector
research.

Figure 1: System process flow

A. Description
The semantic Similarity Finding System contains three
components:
Word Segmentation: This component is used in the preprocessing step. We used package WS4VN [18] to segment
all Vietnamese lexicons in documents.
Lucence Indexing: Apache Lucene is a high-performance,
full-featured text search engine library written entirely in Java1 .
Semantic vector Package: The semantic Vector package is an
open source that efficiently builds semantic vector or context
vector of word and document from a raw text corpus. In the
next section, we will present an overview about our Semantic
Similarity Finding System.

Secondly, using WS4VN to segment all terms or words


in free text corpus [18]. We gain a new corpus called
Segmented Corpus.
Thirdly, Building Lucene Index for all documents and
terms. There are two kinds of Lucene Index in our
System. They run in two modes (MODE 1 and MODE
2).
Final step is creating the document vectors and terms
vector. Using Semantic Vector Package helps this matter
quickly. From the indexed files of Lucence, we build the
context of all words or terms in corpus in two modes
(MODE 1 and MODE 2). They are different from how
to build context vector of term.

MODE 1 creates context vector of term by collecting


information on which words co-occur together in a document
by implementing Random Indexing. Context vector of word
is the whole document which it occurs in.
MODE 2 captures semantic relations by collecting information on which words co-occur with other similar words.
It creates semantic vector indexes using the sliding context
window approach. MODE 2 implements combination of two
approaches: Random Indexing and Hyperspace Analogue to
Language. Firstly, system allocates random vector for all
documents, and then it indexes all terms in corpus with
positional offset data. Context vector of word is some words
that immediately surround it. It differ from MODE 1 by
providing an extra windowradius argument specifies the
size of context window, including the focus term at its center.
For example, windowradius 1 will use as a context a
window of radius 1, i.e., including the center term and the
terms immediately to the left and right of it.

B. System process flow


Figure 1 shows the process flow of the system. Our system
consists of the following steps:
First, we need to collect data for training. Data used in
this system contain a large number of free texts, which
are documents available on many resources (articles and
news). The set of documents are called free text corpus.
1 http://lucene.apache.org

92

B. Experiment 2: Context-size Evaluation

IV. EXPERIMENTAL SETUP


Data Setup
There are many resources but we mostly collect data from
newspapers, articles on the Internet. We have an large free text
corpus that contains 15736 news articles from various fields
such as sport, economy, science, etc. After segmentation step,
there are 159226 terms in the segmented corpus.
Test Corpus
To evaluate the accuracy of our system, we build a test
corpus. We randomly select a number of target Vietnamese
words in four common kinds of part-of-speech (POS): 100
nouns, 20 pronouns, 100 verbs, 80 adjectives. We then find all
of their synonyms in Synonym Dictionary to create the Test
Corpus that contains the target words and their synonyms.
Experimental metric
To evaluate our Semantic similarity finding system, we
define the following metrics:


1,
0,
where n = 1..19

Pn (word) =

This evaluation is about different contexts of all word that


depend on the size of context: sliding window. This leads to
the suitable size of context size in which words are considered
being co-occurring.

Figure 2: Context Size

We create a simple experiment for our system. In each case,


system runs on each different context size: 3, 5, 7, 9, and
11. . . .19. The test randomly pick up 4 instances of various
context windows and calculate the average of true results that
are presented in Figure 3.

at least 1 synonym in top n + 1 output


otherwise

For example: In the 19 output synonyms of word t


(car), the word xe hoi (auto) is the third output synonym.
Then,
P1 ( t) = P2 ( t) = 0
And P3 ( t) = P4 ( t) = .. = P19 ( t) = 1
A set S = (word1 , word2 . . . wordk ) contains k word.
Accuracy of system on S is defined:
Pk
Pn (wordi )
Pn (S) = 1
k
The correctness of system on S is P1 (S).

Figure 3: Results when context-size changes

To sum up, the best average obtained with a minimal


window of size 3 words leads to a score of 55% according
to the Test Corpus. The results show that the smallest context
size (3 words) provides the most accurate results on this test.

V. EVALUATIONS
A. Experiment 1: Two kinds of context vector
We run two modes of system on all target words of Test
Corpus. But in MODE 2 the context-size parameter changes
from 3 to 19 (See: Context-size Evaluation).

C. Experiment 3: Performance of system


This experiment investigates overall quality of the system.
We evaluate output words returned from system, investigate
what is the best order to collect accurate results for each kind
of word: Nouns, Pronoun, Verb and Adjective. Our Semantic
Similarity Finding System runs only on MODE 2 where
context-size is equal 3.

Table I: Results of Mode 1 and Mode 2 on Test Corpus


MODE 1
MODE 2

P1 %
6.25
55

P19 %
13.75
92.5

The synonym result of MODE 1 is lower than results of


MODE 2. This emphasizes a point about the granularity on
the computing the context vector of word. Both documents
and sliding windows will be used to compute context vectors.
As a result, in the document case, the output contains the
words have related meaning with the target word. They are not
synonyms and they often co-occur in document. The results in
MODE 1 are used as latent semantic in many search engine
applications [10]. But, as the context vector is computed by the
words that immediately surround the target word, the outputs
are truly synonyms.

Figure 4: Pn , n=1...19 for each kind of word

93

We compare the synonyms among four kinds of target.


The pronoun has the smallest number of synonyms. The verb
and noun have various synonyms. In the top 6 output, our
system found out all synonyms for all verbs, and nouns rank
Top 12. The accuracy level of finding synonym is 75% for
verbs and 65% for nouns. The synonyms of nouns are more
accurate than adjectives. For all target words in Test Corpus,
all synonymic nouns are found in top 12 outputs, while all
synonymic adjectives are in Top 19 output.
The Figure 5 shows the average estimation for the target
words in Test Corpus. On average, in the top 6 output words ,
it has promising results for finding synonym; the accuracy of
system is more than 80%.

65% for nouns. To the best of our knowledge, this is the best
result obtained so far for Vietnamese. We compare our results
with the other methods used in finding semantic similarity
on English. In the studies of LSI, Landauer and Dumais [14]
report 64,4% for identifying correct answers in multi-choice
TOEFL question. A study in Random Indexing [19] gave
an accuracy of 35-44% with unnormalized 1800 dimensional
vectors and 48-51% with normalized vectors.
In the future, we will improve our system to collect
all meanings of a word, and build Directional words-bywords co-occurrence matrix in which rows contain leftcontext co-occurrences, and columns contain right-context
co-occurrences. to gain more accuracy for adjectives and
pronouns.
REFERENCES
[1] G. Salton and M. J. McGill, Introduction to modern information
retrieval, McGraw-Hill, Inc. New York, NY, USA, 1986.
[2] K. Livesay and C. Burgress, Producing high-dimensional semantic
spaces from lexical co-occurrence, Behavior Research Methods Instruments Computers, 28, 203-208, 1997.
[3] M. Sahlgren, An introduction to random indexing, In Proceedings of
the Methods and Applications of Semantic Indexing Workshop at the 7th
International Conference on Terminology and Knowledge Engineering,
TKE 2005, Copenhagen, Denmark, 2005.
[4] S. Banerjee and T. Pedersen, Extended gloss overlaps as a measure of
semantic relatedness.in, IJCAI, pages 805810, 2003.
[5] J. J. Jiang and D. W. Conrath, Semantic similarity based on corpus
statistics and lexical taxonomy, In ROCLING97, 1997.
[6] P. Resnik, Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language. JAIR 11:95130, 1999.
[7] B. Roark and E. Charniak, Noun-phrase co-occurrence statistics
for semi-automatic semantic lexicon construction, In Proceedings of
COLING-ACL98, 1998.
[8] M. Thelen and E. Riloff, A bootstrapping method for learning semantic lexicons using extracting pattern contexts, In Proceedings of
EMNLP02., 2002.
[9] W. Phillips and E. Riloff, Exploiting strong syntactic heuristics and
co-training to learn semantic lexicons, In Proceedings of EMNLP02,
2002.
[10] D. Yarowsky, Unsupervisedword sense disambiguation rivaling supervisedmethods, In Proceedings of ACL95, pages 189196, 1995.
[11] L. van der Plas and J. Tiedemann, Finding synonyms using automatic
word alignment and measures of distributional similarity, Association
for Computational Linguistics Morristown, NJ, USA, 2006.
[12] J. Curran and M. Moens, Improvements in automatic thesaurus extraction, In Proceedings of the Workshop on Unsupervised Lexical
Acquisition, pages 59?67, 2002.
[13] L. van der Plas and G. Bouma, Syntactic contexts for finding semantically similar words, Proceedings of the Meeting of Computational
Linguistics in the Netherlands (CLIN), 2005.
[14] T. K. Landauer and S. T. Dumais, A solution to platos problem,
Psychological Review 104:211240, 1997.
[15] H. Schuetze, Dimensions of meaning, In Proceedings of Supercomputing92, pages 787796., 1992.
[16] R. K. Ando, Semantic lexicon construction: Learning from unlabeled
data via spectral analysis, in HLT-NAACL 2004 Workshop: Eighth
Conference on Computational Natural Language Learning (CoNLL2004), 2004.
[17] T. K. Landauer, P. W. Foltz, and D. Laham, An Introduction to Latent
Semantic Analysis. Discourse Processes, 25, 259-284., 1998.
[18] D. D. Pham, G. B. Tran, and S. B. Pham, A hybrid approach to Vietnamese Word Segmentation using Part of Speech tags, International
Conference on Knowledge and Systems Engineering, 2009.
[19] P. Kanerva, J. Kristoferson, and A. Holst, Random indexing of text
samples for latent semantic analysis, In Erlbaum (editor), Proc. of the
22nd annual conference of the cognitive science society New Jersey,
USA., 2000.

Figure 5: Average found synonym for Test Corpus

In this level, all synonyms of verbs, nouns have been found,


and the adjectives shows quite good enough results while
output of the pronouns does not. Some verb and noun have
returned many synonyms, this proves the effect of our system.
Antonym
Some target words that describe the status of person, such
as the adjective or the social relationship in (father, mother,
you, friends...) result in many antonym in our experiments.
Although they are antonyms, they often co-occur in many
contexts.
contrai (boy) - 0.89182067: congi (girl)
sungsng (happy)-0.8050421: aun (painful)
yu (love) - 0.8493682: ght (hate)
VI. CONCLUSION
In our paper, we have done research on various techniques
and investigated their advantages as well as disadvantages as
to give out suitable techniques for tasks of finding semantic
similarity in Vietnamese. We built a complete system to find
the synonyms in Vietnamese. Our system applies RI and HAL
approach to compute context vector of word.
For our experiments, we show the effect of granularity on
the creating of context vector. For the identifying synonyms in
Vietnamese, the performance of system is good for nouns and
verbs and the results for adjective or pronoun are promising.
In our experiment, the accuracy level of our system on Verbs
and Nouns are higher than Adjective and Pronoun. The average
correctness is 55%, while the accuracy is 75% for verbs and

94

Você também pode gostar