Você está na página 1de 8

Verb Detection in Persian Corpus Iranpour Mobarakeh, Majid, Minaei-Bidgoli, Behrouz

Verb Detection in Persian Corpus


Iranpour Mobarakeh, Majid *Corresponding author Division of Computer Engineering, Iran University Science and Technology, Narmak , Tehran, 1684613114 Iran Majid.iranpour@gmail.com Minaei-Bidgoli, Behrouz Division of Computer Engineering, Iran University Science and Technology, Narmak , Tehran, 1684613114 Iran minaeibi@cse.msu.edu
doi: 10.4156/jdcta.vol3.issue1.mobarakeh

Abstract
A novel technique is introduced for verb and inflection detection in Persian texts. This recognition can be useful for preprocessing phase in natural language processing (NLP) and text mining like partof-speech (POS) tagging and sentence boundary detection (SBD) in Persian texts. Our technique employs structural information of Persian verb for the first phase of this detection and then uses the n-gram approach for Homograph Disambiguation in order to increase the performance as the second phase. Experimental results show that our technique can achieve high efficiency performance (99%) which is an exemplar solution for Persian SDB and POS tagging problem domain.

Keywords
N-gram, Persian, Stemming, Verb detection

1. Introduction
Persian, also known as Farsi or Parsi, is an IndoEuropean language spoken and written primarily in Iran, Tajikistan and parts of Afghanistan. Persian alphabet contains 32 letters. Persian is written from right to left. Some other languages like Arabic, Kurdish, and Urdu use Persian's form of penmanship but have their own specifications. Persian also has its own specifications such as not using accents (except in special cases) and polymorphism in writing. Persian morphology is an affixal system consisting mainly of suffixes and a few prefixes. The nominal paradigm consists of a relatively small number of affixes. The verbal inflectional system is quite regular and can be obtained by the combination of prefixes, stems, inflections and auxiliaries [3].

Persian syntax is quite ambiguous in written form which raises certain difficulties in automatic parsing of written text. Several factors contribute to the ambiguity: Although Persian is a verb-final language, it does not adhere to a strict word order and the sentential constituents may occur in various positions in the clause; this is especially the case for preposition phrases and adverbials. In addition, there are no overt markers, such as case morphology, to indicate the function of a noun phrase or its boundary; in Persian, only specific direct objects receive an overt marker. Although in spoken language, the ezafe () morpheme is used to link the elements within the noun phrase, this morpheme, being a short vowel, is absent in written text. Furthermore, subjects are optional in Persian and subject-verb agreement is not always present for inanimate subjects. Since short vowels are not transcribed, lexical ambiguity is also another problem in automatic parsing of Persian text [1, 5, 6]. Persian is an SOV language: the sentences appear in the word order Subject-Object-Verb. The verb is marked for tense and aspect and usually agrees with the subject in person and number. Persian is a pro-drop language, thus the subject is optional. The object marker r ( )is used to indicate specific direct objects in simple sentences [4].

2. Related Studies
Owing to the regular derivational word structure of the Persian language, it has a potential to be stemmed systematically. However, an extensive literature review reveals that very scarce studies have ever performed on Persian stemmers. A Persian stemmer algorithm was proposed by Taghva et al. 2003. Like the Porter stemmer [9], this Persian stemmer algorithm works on

58

International Journal of Digital Content Technology and its Applications Volume 3, Number 1, March 2009

the basis of the morphology of its language. Afterwards, [10] modified the Krovetz algorithm for the Persian stemming that used from pos tagging to increasing performance and reduce errors 60%. [11] used a simple rule-based system for stemming Persian words. [12] presented a statistical stemmer for Persian text and [13] study on the stemming challenges for Persian verbs and present an algorithm for Persian verbs.

from the infinitives by removing the character () and adding the same suffixes. The inflectional system for the Persian verbs consists of simple forms and compound forms; the latter are forms that require an auxiliary verb. The simple forms are divided into two groups based on the stem they use in their formation: the tenses that use the Present Stem and those formed on the Past (or Aorist) Stem. The Present Stem needs to be specified in the lexicon since it cannot be derived, while the Past Stem is easily derivable from the infinitival form of the verb. The citation form for the verb is the infinitive [5]. In addition to the verb stems, the following elements also participate in the formation of the verbal inflectional system in Persian [5]: Prefixes: the imperfective prefix mee () and the morpheme b ( )or bee, which characterizes the subjunctive and the imperative. Negation is marked by the n () or m ( )or nee prefix. Personal Inflections: present, past and imperative personal inflections are used in conjugating the Persian verb. All verb forms are marked for person and number. Suffixes: the suffix and e marks the present participle ending and e (written h) ( )is used to form the past participle. Causation morpheme: causatives are obtained by adding the affix n ( )or ni ( )to the end of the Present Stem of the verb. Personal inflections and suffixes can then be attached to the Causative Present Stem to derive all verbal forms for the causative construction. Auxiliaries: Persian conjugation uses a number of auxiliaries in the compound forms. The enclitic form of the auxiliary budan (be) ()is the one used in the formation of the perfect forms of all verbs. The verb khstan (want) ( )is used as an auxiliary in forming the future tenses. The auxiliary shodan (become) ( )forms the passive constructions.

3. Stemming
To stem a word is to reduce it to a more general form, possibly its root. For example, stemming the term interesting may produce the term interest. Though the stem of a word might not be its root, we want all words that have the same stem to have the same root. The effect of stemming on the search engine performance over English document collections has been tested extensively. Stemmers such as the Lovins and Porter improve precision/recall scores in many information retrieval contexts [8]. However, these stemmers are language specific, therefore achieving the similar results on a Persian collection requires a Persian stemmer. Like English, Persian has affixitive morphology. In other words, suffixes and prefixes are concatenated to Persian words to modify the meaning. Since Persian is read from right to left, what appears to be the end of a word to an English reader is actually the beginning. Prefixes might at first appear to be suffixes. Like English nouns, Persian nouns are affixed to signify possession and plurality. On the other hand, Persian verbs are modified more extensively than English verbs. Persian verbs vary form according to tense, person, negation, and mood. Therefore, a given verb may have scores of variations. As a matter of fact, one of the motivations behind our stemmer was the high number of variations for Persian verbs.

3.1 Persian stemming


In the Persian language, words are usually built up from the imperative forms of the verbs. Hence, from a linguistic point of view, the first step in extracting the root is to find the imperative mood of the word. Assuming one can obtain the imperative form of a verb, then one can follow the grammar rules of Persian to generate tense such as present tense. The present tense rules are generally used to generate other tenses. Another group of tenses such as past tense is generated

The complete inflectional system can be obtained by the various combinations of these elements as shown in table 1.

4. Challenges of Persian verb stemming and detection

59

Verb Detection in Persian Corpus Iranpour Mobarakeh, Majid, Minaei-Bidgoli, Behrouz

4.1 Persian Homograph


Homograph literally means written the same, and here homographs are words with the same spelling, but either a different meaning or a different pronunciation, or both. For example the word with pronunciation mardom is a noun which means "people", and with pronunciation mordam is a verb which means I died are an example of homograph words that have the same spelling but a different meaning and pronunciation [14]. Thus, the homograph words are difficult for stemming because stems and grammatical type of homograph words may be differed, such as the stem of that means people is ,but the stem of meaning I died is .

theta roles (Mohammad and Karimi, 1992; VahediLangrudi, 1996; Karimi-Doostan, 1997). Compound verbs also consist of a non-verbal element, usually nouns, adjectives, prepositions, or prepositional phrases. No other phrasal node other than prepositional phrases may be used within a non-verbal element. Normal morphological suffixes such as comparative, superlative, and pluralization may not be used on the non-verbal element. The non-verbal elements, which were mentioned above, usually consist of nouns, adjectives or prepositional phrases. It is easy to detect verbal elements of compound verbs because we always write them in separate form, but prepositions non-verbal forms are added to the first of verbal elements and create anomaly in verb structure.

5. Verb and inflection detection algorithm 4.2 Exchange letter when append some prefix
In Persian language, there are stems that when prefixes such as and are added to them, if the first letter of the stem is ,then it would be changed into or is added to the stem. For example the present stem of is ,for constructing imperative verb of this stem relevant to table 1, we must add the prefix to the present tense stem, and according to the rule that was mentioned ,the result must be but in fact it is that the spelling is changed, or the present stem of is ,for constructing the verb of this stem relevant to table 1, we must add the prefix to the present tense stem, and based on this rule ,the result must be but it is changed into . Persian is an SOV language: the sentences appear in the word order Subject-Object-Verb. The verb is marked for tense, aspect, and usually agrees with the subject in person and number. Thus, verb detection can be useful for sentence boundary detection in Persian text. Verb and inflection detection can increase POS taggers accuracy. Presented algorithm employs rules for verb stemming and uses a lexicon for increasing the stemming accuracy. It also uses n-gram for homograph disambiguation. The lexicon contain stem of all of the verbs in Persian. Any simple verb in Persian have two stems; present stem and past stem, thus, the lexicon includes both stems from which all simple verbs in Persian are constructed as shown in table 1. In Persian language, the objective pronoun some of the times is added to the end of the verb, thus it is known as a connected objective pronoun. Persian connected objective pronouns are ,which must be annotated in the phase of verb stemming. This algorithm is shown in figure 1.

4.3 Light verb detection


The Persian language uses one-word verbs and compound verbs to head VPs. Compound verbs generally consist of a light verb, which is morphologically like other verbs, but semantically is unlike other verbs: light verbs apparently do not assign

60

International Journal of Digital Content Technology and its Applications Volume 3, Number 1, March 2009

Table 1. Persian inflection Symbol Structure + Symbol + Past stem + + + Symbol + Past stem + + Past stem + + past participle Symbol + + + +past participle Symbol + + past participle Symbol + + past participle Symbol + + present stem Symbol + + + Symbol + present stem + + + +present stem Symbol + Symbol + Past stem + + + +present stem Symbol + + + +present stem Symbol + + + +present stem Symbol + + Other Verbs Type of Verb Simple Past past continuous past participle present perfect present perfect continuous past perfect Implicit past simple present present continuous Implicit present Future Imperative

( ) Past

) ( Present ) ( Future

imperative

Negative imperative
Negative

Negative present Other Negative Verbs

61

Verb Detection in Persian Corpus Iranpour Mobarakeh, Majid, Minaei-Bidgoli, Behrouz

Start

Is Verb ()
Is_simple_past() remove (word, symbol of simple past) if exist (word, past_stem_lexicon) stem = word type = simple past Is_past_continue() remove (word,")" remove (word, symbol of simple past) if exist (word,past_stem_lexicon) stem = word type = past continuous Is_past_perfect() remove (word, symbol of past perfect) remove(word,")" if exist (word,past_stem_lexicon) stem = word type = past perfect Is_present_perfect() remove (word, symbol of present perfect) remove (word,")" if exist (word,past_stem_lexicon) stem = word type = present perfect Is_present_perfect_continue() remove(word,")" remove(word, symbol of present perfect) remove(word,")" if exist (word,past_stem_lexicon) stem = word type = present perfect Continue Is_present_implicit() remove(word,")" remove(word, symbol of simple present) if exist (word, present _stem_lexicon) stem = word type = present Implicit Is_future() if next_word (symbol of future) if exist (word,past_stem_lexicon) stem = word type = future

Is_implicit_past() remove (word, symbol of Implicit past) remove (word,")" if exist (word,past_stem_lexicon) stem = word type = Implicit past

Is_simple_present() remove(word, symbol of simple present) if exist (word, present _stem_lexicon) stem = word type = simple present Is_negative_imperative() remove (word,")" remove (word, symbol of imperative) if exist (word, present _stem_lexicon) stem = word type = Negative imperative Is_negative_present() remove (word,")" remove (word, symbol of Negative present) if exist (word, present _stem_lexicon) stem = word type = Negative present

Is_present_continue() remove(word,")" remove(word, symbol of simple present) if exist (word, present _stem_lexicon) stem = word type = present Continue Is_imperative() remove (word, ")" remove (word, symbol of imperative) if exist (word, present _stem_lexicon) stem = word type = imperative Is_negative_verbs() remove (word,")" if is_verb (word) stem = word type = Negative Verb type

If is_Verb() display (word ,is_verb ,stem ,type)

Figure 1. Verb stemming and inflection detection algorithm

5.1 Disambiguation
To disambiguate the three challenges that are elaborated above, three efficient techniques are proposed respectively in this study as follows: For changed letter, we add some rules to solve this problem. This rule is used when we add prefix and to the stem, which start

with letter , then this first letter is changed to or . This rule is shown in figure 2.

62

International Journal of Digital Content Technology and its Applications Volume 3, Number 1, March 2009

Figure2. Adding rules for exchange letter For disambiguating homograph words, the 1gram is used to detect verb. We use a tagged corpus [29] to create n-gram of the words in the corpus. First step checks the words in created n-gram. Since last action will be calculated n-gram for word if word exist or not, and have probability up to 50% accepted the word is verb else is not accepted. Remove conjunction prefix and retest resulted word in the verb stemming rule.If the word is recognized as a non-verb, remove the proposition such as and so foth that are appended to the verb in order to create light verb and retest the word. If the word is detected as non-verb and start with the proposition that construct light verb, system removes the proposition and retests the word. This rule that must be added to algorithm as shown in figure3.

Figure3. Adding rules to remove proposition light verb

To assess the efficiency and performance of the proposed remedial techniques, the verb detection model is applied to the corpus compromising 1073023 words. The model has been tested to evaluate accuracy with four different combinations of the presented techniques, namely, verb stemming (Case 1), verb stemming and exchange letter (Case 2), verb stemming, exchange letter and light verb (Case 3), and verb stemming as well as all three techniques. The errors resulted from applying the model fall into two major categories: (i) "false positive": the cases that our method erroneously has labeled a non-verb as a verb, and (ii) "false negative": the cases that the method erroneously has labeled a verb as a non-verb. The accuracy evaluated for each case are listed in Table1. As observed, the accuracy is favorably increased as more techniques are considered to detect the verbs. For instance, the accuracy in detecting the verbs as verbs and the non-verbs as non-verbs correctly increases 1.7 and 0.2, respectively, when the exchange letter technique is implemented into the model. Consequently, total accuracy increases 0.4 percent favorably. The accuracy in detecting the verbs as verbs and the non-verbs as non-verbs correctly increases 3.5 and 0.9, respectively, when Case 3 is implemented into the model. Consequently, total accuracy increases 1.3 percent favorably. The accuracy in detecting the verbs as verbs and the non-verbs as non-verbs correctly increases 6.7 and 1.7, respectively, when Case 4 is implemented into the model. Consequently, total accuracy increases 2.5 percent favorably. It is noteworthy to state that the implementation of the proposed techniques has decreased the false negative error much more than the false positive error. Conclusively, these results demonstrate the promising performance of implementing proposed remedial techniques into the basic model.

6. Experimental Results and Discussion


This algorithm is evaluated and tested over the corpus developed by RCISP [15] which is called . So far, this corpus has not been reported anywhere. We use the annotated part of corpus which consists of approximately 7.5 million annotated tokens including 10 million words. The corpus is consisted of several text genres e.g. politics, social, economics, culture, art, religious, and sport. The tagset of the corpus includes 168 single tags which 25 of them are major categories and others are used for morphology of Persian words. The N-gram is trained over some parts of this corpus.

63

Verb Detection in Persian Corpus Iranpour Mobarakeh, Majid, Minaei-Bidgoli, Behrouz

Table 2. Algorithm performance of the Linguistic Structure of Persian Language


Total Error Total accuracy

Type of Detection False negative Verb stemming Verb stemming + exchange letter Verb stemming + exchange letter+ light verb Verb stemming + exchange letter+ light verb homograph disambiguation (algorithm) False positive

88.1% 89.8% 91.6% 95.8%

11.9% 10.2% 8.4% 4.2%

97.4% 97.6% 98.3% 99.3%

2.6% 2.4% 1.7% 0.7%

3.6% 3.2% 2.3% 1.1%

96.4% 96.8% 97.7% 98.9%

9. Acknowledgement
This work is partially supported by Data and Text Mining Research group at Computer Research Center of Islamic Sciences (CRCIS), NOOR co. P.O. Box 37185-3857, Qom, Iran

[7] Bateni, Mohammad-Reza, 1995. Towsif-e Sakhteman-e Dastury-e Zaban-e Farsi [Description of the Linguistic Structure of Persian Language]. Amir Kabir Publishers, Tehran, Iran. [8] Agosti, M., Bacchin, M., Ferro, N., & Melucci, M.Improving the automatic retrieval of text documents. In C.Peters, M. Braschler, J. Gonzalo, & M. Kluck (Eds.), Lecture notes in computer science (LNCS): Vol. 2785, (2003) [9] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130t137, 1980. [10] Reza Hessami Fard, Gholamreza Ghasem Sani, Stemmer Algorithm Design for Persian Language, 11th International CSI Computer Conference (CSICC2006), School of Computer Science, IPM, Jan. 24-26, 2006, Tehran, Iran. [11] Alireza Mokhtaripour, Saber Jahanpour, Introduction to a New Farsi Stemmer, CIKM06, November 511, 2006, Arlington, Virginia, USA. ACM 1-59593-433-2/06/0011. [12] Mojtaba Mohammad Nasiri, Kiyomars Sheikh Esmaeili, Hassan Abolhassani, A statistical stemmer for Persian language , 11th International CSI Computer Conference (CSICC2006), School of Computer Science, IPM, Jan. 24-26, 2006, Tehran, Iran. [13] Ahmad Usefan, Somayeh Salehi, Behrouz Minaei Bidgoli, Stemming Chalenges and Stemming Algorithm for Farsi Verbs, In Proceedings of the First Workshop on Persian Language and Computers. Tehran University, Iran. May 25-26. [14] M. Bijan khan, sh. Moradzadeh, "Homographs in Persian Morphology",. In Proceedings of the First Workshop on Persian Language and Computers. Tehran University, Iran. May 25-26. [15] www.rcisp.com

10. References
[1] Megerdoomian, Karine, 2003. Text Mining, Corpus Building and Testing. In A Handbook for Language Engineers; edited by Ali Farghaly. CSLI Publications: Stanford, CA. [2] Megerdoomian, Karine, 2004. Finite-State Morphological Analysis of Persian. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, COLING 2004. University of Geneva, Switzerland [3] Megerdoomian, Karine, 2000. Unification-Based Persian Morphology. In Proceedings of CICLing 2000. Alexander Gelbukh (ed.). Centro de Investigacion en Computacion-IPN, Mexico. [4] Megerdoomian, Karine, 2004. Developing a Persian Part-of-Speech Tagger. In Proceedings of the First Workshop on Persian Language and Computers. Tehran University, Iran. May 25-26. [5] Megerdoomian, Karine, 2004. A Semantic Template for Light Verb Constructions. In Proceedings of the First Workshop on Persian Language and Computers. Tehran University, Iran. May 25-26. [6] Farsi Grammer, Mohammadreza bateni, AmirKabir, 2003.

64

International Journal of Digital Content Technology and its Applications Volume 3, Number 1, March 2009

[16] A. Gharib, M. Bahar, B. Fooroozanfar, J. Homaii, and R. Yasami. Farsi Grammar. Jahane Danesh, 2nd edition, 2001. [17] David A. Hull. Stemming algorithmsua case study for detailed evaluation. Technical report, Rank Xerox Research Centre, Meylen, France, June 1995. [18] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. Mc-Graw Hill, New York, 1983. [19] Kazem Taghva, Russell Beckley, and Mohammad Sadeh. A list of farsi stopwords.Technical Report 2003-01, Information Science Research Institute, University of Nevada, Las Vegas, July 2003. [20] ] R. Krovetz, 1993: Viewing morphology as an inference process, in R. Korfhage et al., Proc. 16th ACM SIGIR Conference, Pittsburgh, June 27-July 1, 1993; pp. 191-202. [21] J. P. Callan, M. Connell, A. Du. Automatic discovery of language models for text databases, ACM SIGMOD International Conference on Management of Data, pages 479-490, 1999. [22] A probabilistic model for stemmer generation, M Bacchin, N Ferro, M Melucci - Information Processing and Management , 2005 [23] Development of a stemming algorithm., JB Lovins, Cambridge, MIT Information Processing Group, 1968 [24] Bon: The Persian Stemmer, M Tashakori, MR Meybodi, F Oroumchian - Proceedings of the First EurAsian Conference on Information, 2003 [25] Shariat, M. J. Simple Farsi Grammar (Second impression).Asaatir, 2000, Iran. [26] Bateni, Mohammad-Reza, 1995. Towsif-e Sakhteman-e Dastury-e Zaban-e Farsi [Description of the Linguistic Structure of Persian Language]. Amir Kabir Publishers, Tehran, Iran. [27] Amtrup, J.W., Mansouri Rad, H., Megerdoomian K., and Zajac,R., "Persian-English Machine Translation, An Overview of the Shiraz Project" , Memoranda in Computer and Cognitive Science MCCS-00-319, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico, April 2000. [28] Riazati, D. (1997), "Computational Analysis of Persian Morphology" , PhD thesis, Department of Computer Science, RMIT, UK.

[29] Bijan khan M., 2004. The Role of The Corpus in Writing a Grammar: An Introduction to a Software, Iranian Journal of Linguistics, vol. 19, no. 2. [30] Tabatabaie A. 1994. Fel-e Basit-e farsi va vajheh sazi [Persian Simple Verb and Making Word], Iran University Press, Tehran, Iran.

65

Você também pode gostar