International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 272
Abstract: Natural Language Processing (NLP) is an area which is concerned with the computational aspects of the human language. The main difficulty in natural language processing tasks is perhaps its ambiguity. Ambiguity in natural language pervades virtually all aspects of language analysis. Sentence analysis in particular exhibits a large number of ambiguities that demand adequate resolution before the sentence can be understood. Most of the language processing applications like Information Retrieval (IR), Information Extraction (IE), Question-Answering systems, Text Summarization and Machine Translation (MT) are affected by the highly ambiguous nature of natural language. Ambiguities in sentence analysis are generally categorized into two types: lexical and structural ambiguities. The present work describes the methodology to resolve lexical ambiguity in a Kannada sentence. Lexical ambiguity arises when a lexical item has alternate meanings and different Part-Of- Speech (POS) tags. The paper describes the decision list based algorithm to disambiguate Kannada polysemous words. We built Kannada corpora using web resources. It is further divided in to training and testing corpora. The decision list required for disambiguation task is created using training corpora. The example sentences needs to be disambiguated are stored in testing corpora. The proposed algorithm attempts to disambiguate all the content words such as nouns, verbs, adverbs, adjectives in an unrestricted Kannada text sentence. The algorithm is based on one powerful assumption that words tend to have one sense per collocation. i.e. the nearby words provide strong and consistent clues to the sense of the ambiguous word.
Keywords: Decision list, Lexical ambiguity, Corpus, Machine Translation, Word Sense Disambiguation. 1. INTRODUCTION One of the fundamental tasks in Natural Language Processing is Word Sense Disambiguation. It is the problem of determining in which sense a word having a number of distinct senses is used in a given sentence. Consider the following sentence. O V I^ Io . [raamanu pandithara baLI hoogi yantra haakisikonDu bandanu] 'Ramu received an astrological device from the panDith' O -O Iv^C. [raamana maneya niirettuva yantra haaLaagide] 'Rama's house water lifting machine is got corrupted'.
In the above sentence the word [yantra] is ambiguous. It has two distinct meanings such as an astrological device (Sense A) and a Type of machinery (Sense B). To a human, it is obvious that the first sentence is using the word [yantra] in an astrological device sense and in the second sentence, it is being used in a type of machinery sense. The process of identifying the correct sense of the word [yantra] in a given context is called Word Sense Disambiguation (WSD). As a human being, we can easily disambiguate the word [yantra] using the external world knowledge. But developing algorithms to replicate this human ability is a difficult task. There are a number of ways to approach this problem. One simple way is to determine which sense occurs most commonly (Most Frequent Sense) and always choose that sense. This technique acts as a base line for the remaining WSD algorithms. Initial research on WSD focused on disambiguating a few selected target words in a given sentence. Until recently, research in WSD did not focus on disambiguating all the content words in a text in one go. Disambiguating all the content words is called unrestricted WSD. In the present work, we propose a machine learning algorithm based on Decision List for unrestricted Kannada text WSD. The algorithm considers multiple types of evidence in the context of the ambiguous word, exploiting the differences in the collocation distribution as measured by log-likelihood. The algorithm exploits the one sense per collocation property of the human language. i.e. nearby words provide strong and consistent clues to the sense of a target word. In order to achieve high quality translation output in machine translation, word sense disambiguation is one of the most important problems to be solved. This is the motivation behind the present work. The WSD is necessary not only in Machine Translation but also in almost every application of language technology including information retrieval or extraction, knowledge mining or acquisition, lexicography, semantic interpretation etc. The rest of the paper is organized as follows. Section 2 explores previous work done in word sense disambiguation and presents the current state of the word sense disambiguation. Section 3 introduces linguistic preliminaries of the Kannada language and the basic Kannada Word Sense Disambiguation Using Decision List
Parameswarappa S and Narayana V.N
Department of Computer Science & Engineering, Malnad College of Engineering, Hassan, Karnataka, 573 202, India
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 273
infrastructure requirement of the present work. Section 4 discuses the basics of decision list and its construction using the Kannada corpus. Section 5 describes the methodology and the proposed algorithm used for disambiguation task. Section 6 provides the detailed information required to implement the proposed algorithm. Section 7 discuses the algorithm testing, evaluation and discuss the observations made during the process. Section 8 concludes the paper and provides the pointer for future research in this direction. 2. RELATED WORK Word Sense Disambiguation has been a key task in the field of Natural Language Processing since late 1940s. Based on how the disambiguation information is acquired by the WSD system, they are classified as knowledge- based, hybrid and corpus-based systems [1]. Knowledge-based approaches encompass systems that rely on information from an explicit lexicon such as Machine Readable Dictionaries, thesauri, computational lexicons such as Wordnet [2] or hand crafted knowledge bases. Some of the examples for Knowledge based approaches are Lesks algorithm [3], Walker's algorithm [4]. Hybrid approaches like WSD using Structural Semantic Interconnections [5] use combination of more than one knowledge sources such as Wordnet as well as a small amount of tagged corpora. Unsupervised algorithms work directly from un-annotated raw corpora. They have the potential to overcome the new knowledge acquisition bottleneck and they have achieved good result. [6] and [7] are some of the examples of unsupervised approaches. Supervised and semi-supervised methods make use of annotated corpora to train from or as seed data in a bootstrapping process. Some of the examples for supervised learning algorithms are WSD using SVM [8], Exemplar based WSD [9]. An example for both supervised and semi-supervised algorithm is Decision list algorithm. Decision list algorithm is an accurate algorithm. Its performance is better than other algorithms. The formal model of decision lists was presented in [10]. The statistical decision procedure for lexical ambiguity resolution was presented by [11]. The algorithm exploits both local syntactic patterns and more distinct collocation evidence, generating an efficient, effective and highly perspicuous recipe for resolving a given ambiguity. Yarowsky proposed an algorithm based on two powerful constraints namely one sense per collocation and one sense per discourse for sense disambiguation [12]. Initial WSD research focused on disambiguating a few selected target words in a given sentence. But the ultimate aim of WSD is to disambiguate all the content words in a sentence. Rada Mihalcea [13] proposed a method based on semantic density to do unrestricted WSD. 3. KANNADA LANGUAGE This section introduces the linguistic preliminaries of the Kannada language and the basic infrastructure requirement of the present work. 3.1 Linguistic preliminaries of Kannada The world languages are classified in to two categories. Namely, fixed word order and free word order. In the former case, the words constituting a sentence can be positioned in a sentence according to grammatical rules in some standard ways. On the other hand, in the later case, no fixed ordering is imposed on the sequence of words in a sentence. An example for fixed word order language is English and that of pure free word order language is Sanskrit. Generally Kannada is a free word order language. But, it lost free word ordering partially in the course of evolution. Here word groups are free order but the internal structure of word groups is fixed order [14]. Kannada is an agglutinating language of the suffixing type. Nouns are marked for number and case and verbs are marked, in most cases, for agreement with the subject in number, gender and person. This makes Kannada, a relatively free word order language. Kannada language exhibits a very rich system of morphology. Morphology includes inflection, derivation, conflation (sandhi) and compounding [15]. Indian languages come from four different language families - the Indo-Aryan, The Tibeto-Burman, The Austro-Asiatic and the Dravidian. Kannada belongs to Dravidian family. Kannada is one of the technologically least developed languages in India today. This is ironical since Kannada has a very old and rich literary tradition, it is currently spoken by a very large number of people, and Karnataka is in the centre stage of IT (Information Technology) revolution in the country. As of today, the only corpus we have is the roughly 3 Million word corpus developed by CIIL (Central Institute of Indian Languages) Mysore long ago. Lack of basic resources such as corpora is one of the major reasons for our lagging behind in language technology. Several languages in India today have 30 to 50 Million word corpora. There are hardly any electronic dictionaries, morphological analyzers, POS tagger and Computational Grammars or Parsing systems for Kannada worth taking seriously. Naturally, we are lagging behind in many areas of linguistics as also in language technologies [16]. 3.2 Kannada Corpus Corpus is a large and representative collection of language material stored in a computer processable form [17]. It provides realistic, interesting and insightful examples of language use for theory building and for verifying hypothesis [18]. Corpus provides the basic International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 274
language data from which lexical resources, such as dictionaries, thesauri, Wordnet etc can be generated [19]. Language technologies and applications are greatly benefited from language corpus. For the proposed algorithm testing, we have used randomly selected set of sentences from the Kannada web corpus developed using corpus building tool [20]. The corpora include wide variety of subjects such as, Kannada news papers, Wikipedia articles, blogs, books, novels etc. The selected set of sentences from the Kannada web corpora are grouped into two categories, namely training and testing corpora. Both the categories are parsed with Kannada Shallow Parser. Given a sentence, the parser assigns to it a syntactic structure. This information is used for creating decision list. 3.3 Kannada Dictionary Knowledge of language is essential for meaningful communication through language. Words of a language and the phonological, morphological, syntactic and semantic information associated with them, forms a very important part of the knowledge of language. Knowing the words is an extremely important part of knowing a language. Dictionaries are storehouse of such information and therefore they have key role to play in Natural Language Processing (NLP). As in [21], a dictionary may be regarded as a lexicographical product that is characterized by three significant features: (1) it has been prepared for one or more functions; (2) it contains data that have been selected for the purpose of fulfilling those functions; and (3) its lexicographic structures link and establish relationships between the data so that they can meet the needs of users and fulfill the functions of the dictionary. By keeping these features in mind, we created a Kannada electronic dictionary containing around 50000 entries for our work using corpora in a semi-automated way. 3.4 Kannada Shallow Parser The lexical and syntactic structure of the sentence in training and testing corpora are extracted by the Kannada Shallow Parser. It is developed by IIIT Hyderabad [22]. For each parsed sentence, the shallow parser produces eight intermediate stage output. Namely, Tokenization, Morphological analysis, POS tagger, chunker, pruning, pick one Morph, Head computation and Vibakti computation. These outputs help us to extract Lexical and Syntactic structure of the sentence for further processing. 4. DECISION LIST This Section discus the basics of decision list as well as its construction using the Kannada corpus. 4.1 One Sense per Collocation The nearby words provide strong and consistent clues as to the sense of the target word is referred as the One Sense per Collocation property. This effect varies depending upon the type of collocation. It is strongest for immediately adjacent collocation, and weakens with distance. It is much stronger for words in an argument- predicate relationship than for arbitrary associations at equivalent distance. It is very much stronger for collocations with content words than those with function words. In general high reliability of this behavior makes it an extremely useful property for sense disambiguation. 4.2 Context Features A context feature is a way of keeping track of the words that occur surrounding an ambiguous word. The context features can also be viewed as Uni-grams, Bi-grams, Tri- grams or include all words in the context as unsorted n- grams. When part-of-speech tags or case tags were provided, the surrounding part-of-speech or case tag n- grams were also used with each respective feature set. Table 1 shows the initial set of context features considered. Table 1: Initial set of Context features Context features Word found in +/ k word window Word immediately to the right (+1 W) Word immediately to the left (-1 W) Pair of words at offsets -2 and -1 Pair of words at offsets -1 and +1 Pair of words at offsets +1 and +2
4.3 Decision List Construction A decision list is a classifier that can best be described as an extended if-then-else statement. For each matched condition, there is a single classification. Table 3 shows an example decision list for the ambiguous Kannada word [yantra] constructed using the part of the training corpora as shown in table 2. If the context of the test sentence does not have any of the features in the decision list, the classifier simply chooses the most frequent sense with a confidence of the probability of that sense. The decision list classifier uses the log-likelihood of correspondence between each context feature and each sense, using additive smoothing. The decision list was created by ordering the correspondences from strongest to weakest. Instances that did not match any rule in the decision list were assigned the most frequent sense, as calculated from the training data. The log-likelihood of correspondence is used as a confidence for classifier combination. International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 275
Table 2: Training Corpus Sense Training sentence (Key Word In Context) A CO [dehaliya yantra mantra] 'Yantra Mantra Delhi' B ecC O [tyalavillade calisuva yantra] 'Without oil moving machinery' B 1 o nV [kruShi kelasa sugamagoLisida yantra] 'The agricultural work streamlined machinery' A O V I^ Io . [raamanu pandithara baLI hoogi yantra haakisikonDu bandanu] 'Ramu received an astrological device from the panDith' B O -O Iv^C. [raamana maneya niirettuva yantra haaLaagide] 'Rama's house water lifting machine is got corrupted'. Table 3: Decision List. Deduced Sense Collocation Sentence Number A [mantra] 'mantra' 1 B O [calisuva] 'moving' 2 B 1 [kruShi] 'agricultural' 3 A [panDita] 'astrologer' 4 B -O [niirettuva] 'water lifting' 5
5. EXPERIMENTATION This section describes the methodology used to do unrestricted WSD and proposes an algorithm to disambiguate all the content words in a given sentence. 5.1 Methodology Our proposed algorithm uses decision list constructed using training corpora to disambiguate all the content words in the testing corpora. All ambiguous content words in a sentence are disambiguated using the one sense per collocation property of the human language. 5.2 Algorithm Decision List based Kannada WSD .1. Collect a large set of collocation for the ambiguous word. 2. Calculate word sense probability distribution for all such collocation. 3. Calculate the log-likelihood ratio using the formula log(pr(Sense A/Collocationi)/pr(Sense B/Collocationi)) 4. Higher log-likelihood =more predictive evidence. 5. Collocations are ordered in a decision list with most predictive collocations ranked highest. 6. IMPLEMENTATION The algorithm is implemented using Perl. The System architecture, required files and the modules for implementing the algorithm are discussed here. Figure 1 shows the architecture of the proposed system. It consists of different modules based on their functionality.
Figure 1 Proposed System Architecture 6.1 Implementation Modules The program uses the following modules for disambiguation task. a) Sentence extractor: This module extracts the sentence from corpora for disambiguation task. And writes them into training as well as testing corpora files depending on the sentence category. b) Kannada Shallow Parser: It splits the input sentence across the space and extracts only valid Kannada words from a sentence and does the morphological analysis of the valid Kannada words. c) AmbiWord Extractor: It extracts all the ambiguous words present in a sentence. d) Disambiguator: This module will identify and extract correct sense of an ambiguous word. 6.2 Files The following files are used during program execution. a) Training corpora: This file contains randomly selected set of sentences from Kannada web corpora. The decision list required for content word sense disambiguation is Kannada Corpora
Sentence Extractor Kannada Shallow Parser
Disambiguator
AmbiWord Extractor
Disambiguated Sentence Dictionary
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 276
constructed using this file. b) Testing Corpora: This file contains randomly selected set of sentences from a Kannada Web Corpora. These are different from the set of sentences present in training Corpora. This file acts as input file. The sentences in the file are given as an input for an algorithm. The algorithm disambiguates all the content words in a given sentence. 7. EVALUATION We used the sentences from testing corpora as a test bed for testing the program. 7.1 Test Document Table 4 shows the partial list of sentences used to test the program. Table 4: Sentences extracted from test Corpora.
Table 5 shows the transliteration and English translation of the sentences shown in table 4. Table 5: Transliteration & Translation. Test Sentence [paakistaana nere santhrastharige neravu] Help for Pakisthan flood victims [siite neredaru] 'Seethe matured biologically' [vyavastheya eduru iijuva heNNumakkaLinda naanu aatmavishvaasada paaTha kalitiddeene]. 'I learnt a lesson of confidence from females who swim against the system' [raamanige anna iDu] serve rice to rama [raajabaag savaarana urusu khyaatavaagide]. 'raajabaag savaara's fair is popular' [baTTegaLu oNagive] 'Cloths are dried'. [ondu niirettuva yantra tanni] 'bring water lifting machine' [panDitaru yantra koTTaru] 'Astrologer gave an astrological device'
7.2 Result The results obtained for the test sentences are shown in Table 6. Table 6: The program execution result. Example Sentence Comment ck O n ,. Correct e O. Wrong k IV _co c OcC Wrong O-n . Correct Ocn c c^C. Correct L: ^c. Correct -O -. Correct o. Correct
7.3 Discussion During the disambiguation process, the following observations were made.
a) The accuracy of the algorithm is entirely depends on the decision list constructed using the training examples present in the training corpora. b) The classifier is word specific. c) A new classifier needs to be trained for every word that you want to disambiguate. d) The decision list algorithm does not need large tagged corpus. It is simple to implement. e) It is a simple semi-supervised algorithm which builds on a supervised algorithm. f) Understanding the decision list is easy. g) The system assigns incorrect sense 'gather' for a sentence e O [siite neredaru] instead of 'biologically matured' sense. This is because of the insufficient context information. This kind of problems can be easily addressed at discourse level analysis but it is behind the scope of the present work. h) It is possible to capture the clues provided by the proper nouns from the corpus to disambiguate the word. In a sentence Occn c c^C [raajaabaag savaarana urusu khyaatavaagide], Occn c [raajaabaag savaarana] is a proper noun. With the help of this clue it is possible to disambiguate the word [savaara] easily. i) In a sentence k IV _co c OcC [vyavastheya eduru iijuva heNNumakkaLinda naanu aatmavishvaasada paaTha International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 277
kalitiddeene], IV [heNNumakkaLinda] and _co [aatmavishvaasada] are compound words, because both the words are formed using two word combination, according to linguistic principles the meaning of the compound words must be deduced using the constituent words in it. But these words are not available in lexical database, hence our system interprets the constituent parts of the compound word as a separate words, it leads to wrong result during the disambiguation process.
8. CONCLUSION AND FUTURE WORK In this paper, we proposed a Kannada Word Sense Disambiguation system. It is a valuable resource for resource poor Kannada Language. we constructed reasonable size Kannada web corpora using a corpus builder tool. The corpus is divided in to training and testing corpus. The decision list is constructed using training corpora and then used them for disambiguating the sentences in the testing corpora. Experiments are conducted and the results obtained are described. The performance of the system with respect to applicability and precision are encouraging. In future, we are planning to build a robust Kannada Word Sense Disambiguation system by addressing the compound words and discourse level issues.
References [1] Eneko Agirre, Philip Edmonds, Word Sense Disambiguation: Algorithms and Applications, Text, Speech and Language Technology, XXXIII, Springer, 2007. [2] Fellbaum Christiane,WordNet: An electronic Lexical database, MIT Press, 1998. [3] M. Lesk, "Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone," In Proceeding of the SIGDOC, 1986. [4] D. Walker, R. Amsler, "The Use of Machine Readable Dictionaries in Sublanguage Analysis," In Analyzing Language in Restricted Domains, Grishman and Kittredge (eds), pp. 69 - 83, LEA Press, 1986. [5] Roberto Navigli, Paolo Velardi, "Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation," IEEE Transactions On Pattern Analysis and Machine Intelligence, 1992. [6] Vronis Jean, "HyperLex: Lexical cartography for information retrieval," Computer Speech & Language, XVIII (3), pp. 223-252, 2004. [7] Schtze, Hinrich. "Automatic word sense discrimination," Computational Linguistics, XXIV (1), pp. 97123, 1998. [8] K. Lee Yoong, T. Ng Hwee, T Tee chia, "Supervised word sense disambiguation with support vector machines and multiple knowledge sources," In Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 137-140, Barcelona, Spain, 2004. [9] T. Ng Hwee, Hian B. Lee, "Integrating multiple knowledge sources to disambiguate word senses: An exemplar-based approach," In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 40-47, Santa Cruz, U.S.A.,1996. [10] R.L. Rivest, "Learning decision lists," Machine learning, pp. 229 - 246, II, 1987. [11] Yarowsky David, "Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French," In Proceedings of the 32nd Annual Meeting of the association for Computational Linguistics (ACL), pp. 88-95, Las Cruces. U.S.A., 1994. [12] Yarowsky David, "Unsupervised word sense disambiguation rivaling supervised methods," In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 189-196, Cambridge, MA, 1995. [13] R. Mihalcea, D. Moldovan, "A method for word sense disambiguation of unrestricted text," In Proceedings of 37th Annual Meeting of the ACL, pp. 152-158, Maryland, 1999. [14] A. Bharathi, V. Chaitanya, R. Sangal, Natural Language Processing: A paninian Prespective, PHI, 1995. [15] S.N. Sridhar, Modern Kannada Grammar, Manohar Publications & Distributors, 2007. [16] K. Narayana Murthy, Computer processing of Kannada Language, In Proceedings of the KUWH, Hampi, 2001. [17] J. Sinclair, Corpus, Concordance, Collocation, Oxford University Press, Oxford, 1991. [18] M. Barlow, "Corpora for theory and Practice," Journal of Corpus Linguistics. I (1), pp, 1-38, 1996. [19] Niladri Sekhar Dash, Bidyut Baran Chaudhuri, "Relevance of Corpus in Language research and applications," Journal of Dravidian Linguistics, XXXII (2), pp. 101-122, 2002. [20] S. Parameswarappa, V.N. Narayana, G.N. Bharathi, A Novel Approach to Build Kannada Web Corpus, In the proceedings of IEEE co-sponsored International Conference on Computer, Communication and Informatics (ICCCI-2012), pages 259-264, IEEE Digital Library, SIET, Coimbatore, Jan 2012. [21] Sandro Nielsen, "The effect of Lexicographical Information cost on Dictionary making and use," In Proceedings of Lexikos 18, pp 170-189, 2008. [22] "Kannada Shallow Parser," International Institute of Information Technology, Hyderabad, [Online]. Available: http://ltrc.iiit.ac.in/analyzer/Kannada.
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 278
AUTHOR
Parameswarappa S received the B.E. and M.E degrees in CSE from Bangalore University in 1995 and 1999 respectively. Since 1995, he is working as a faculty in Computer Science & Engineering decipline. Currently he is pursuing Ph.D. in Computer Science & Engineering from Visvesvaraya Technological University, Belguam, Karnataka, India. His areas of interest include Natural Language Processing, Machine Translation and Word Sense Disambiguation. He has 12 International publications to his credit.
Dr. V.N. Narayana received the B.E. from Mysore University in 1982. M.E. from IISc Bangalore in 1986 and Ph.D. from IIT Kanpur in 1995. He is serving MCE, Hassan since 1983. His areas of interest are A.I., NLP, Machine Translation, Compilers and Operating Systems. He has 18 national and 12 international publications to his credit.