Part of Speech Tagging

Part of Speech Tagging
Faculty of Computers and Information Computer Science Department
Presented To Prof . Mohammed Hagag
By PRE-PHD Students: Eman Monir Mohammed Ezz Eldin
Part of speech tagging

1. Introduction Part of speech tagging is generally considered to be one of the basic and indispensable tasks in natural language processing. Taggers are used for parsing, for recognition in message extraction systems, for text-based information retrieval, for speech recognition, for generating intonation in speech production systems and as a component in many other applications. Part of speech tagging involves assigning to a word its disambiguated part of speech in the sentential context in which the word is used. Many words are ambiguous in their part of speech. Even the word .tag. can feature as a noun or a verb. However when the word appears in the context of other words, the ambiguity is often reduced: in .a tag is a part of speech label., the word .tag. can only be a noun. Many sequences of tags are theoretically possible for a given sequence of words. Narrowing all possibilities down to the most probable sequence can improve dramatically both the speed and accuracy of later phases of processing. In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E.Brill's
2
tagger, one of the first and widely used English POS-taggers employs rule based algorithms. 1.1 History
The Brown Corpus Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences). The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree). This corpus has been used for innumerable studies of wordfrequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-ofspeech tagging systems, such as CLAWS (linguistics) and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus. For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided
3
without understanding the semantics or even the pragmatics of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word. 1.2 Aims The fundamental aim was to build a probabilistic part of speech tagger that could quickly and efficiently handle large amounts of unprocessed text. Once that was done, a number of modifications and improvements could have been implemented. These included: Using statistics from a number of different corpora in order to drive disambiguation, with results converted to a fixed more parser-centric representation. Statistics from the Penn Treebank, and Suzanne Corpus could be collected and each made available to the disambiguation engine. Implementing an evaluation module to compare pretagged text with a re-tagged result, in order to evaluate design changes. Design a web-interface. The basic tagger returns just the best tagging. Adjustments could be made to allow the user to specify how many possible tags should be returned. The basic tagger considers only 1 preceding tag. Greater depth of context could be considered, and approaches to interpolating depth-0, depth-1 and depth-2. There is considerable scope for experimentation in processing unknowns; that is words that never occurred in the training corpus. This is important to do well, as a 20% rate of unknowns is not uncommon, even relative to large lexical resources.
There is also scope of experimentation with tokenization strategies, that is deciding where one word end, and another begins, and with sentence segmentation strategies. There is scope for experimentation with basic statistical model used (e.g. prob of word given tag vs prob of tag given word) There is a technique known as Baum-Welch parameter reestimation, which automatically adjusts the probabilities given untagged text. A module implementing this could also be attempted
There is major scope for expansion and improvement on the tagger. Time limitations meant that in addition to implementing the tagger, only a small fraction of these modifications could be attempted. In the time allowed aim to deal with two of these, namely segmentation and tokenization issues and unknown word handling.
1.3 Principle
Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rarein natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb: The sailor dogs the barmaid. Performing grammatical tagging will indicate that "dogs" is a verb, and not the more common plural noun, since one of the words must be the main verb, and the noun reading is less likely following "sailor" (sailor ! dogs). Semantic analysis can then extrapolate that "sailor" and "barmaid" implicate "dogs" as 1) in the nautical
5
context (sailor<verb>barmaid) and 2) an action applied to the object "barmaid" ([subject] dogsbarmaid). In this context, "dogs" is a nautical term meaning "fastens (a watertight barmaid) securely; applies a dog to". "Dogged", on the other hand, can be either an adjective or a pasttense verb. Just which parts of speech a word can represent varies greatly. Native speakers of a language perform grammatical and semantic analysis innately, and thus trained linguists can identify the grammatical parts of speech to various fine degrees depending on the tagging system. Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English, for example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech, and found that about as many words were ambiguous there as in English. A morphosyntactic descriptor in the case of morphologically rich languages can be expressed like Ncmsan, which means Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no. 2. What is part of speech tagging? Part of speech tagging strives to provide lexical descriptions for words in different Situations and contexts. A crucial feature of
lexical description is ambiguity. A single word form may relate to various mutually exclusive portions of linguistic information. The English word form .rust. for instance is either a verb or a noun. In the verb sense, it relates to a finite or a non-finite form, and the finite reading can be interpreted as present, imperative or indicative. In highly inflectional languages, such as German, inflection introduces a need for greater disambiguation of verb forms. The following diagram (Figure 2.1) shows how a simple sentence can contain a number of potentially ambiguous words. Under each word some of its possible parts of speech are shown, in order of decreasing likelihood. The correct tag is in bold font.
2.1 General Algorithms and Methods Several different approaches have been used for building text taggers. An overview of the main approaches is given in Figure 2.
Figure various approaches to automatic POS tagging (This is a simplified description. in reality many tagging systems use aspects of some or all approaches for the purpose of tagging.) 2.1.1 Supervised vs. Unsupervised Supervised taggers rely on pre-tagged corpora to help in the process of disambiguation. Various things that can be extracted from the corpus include; the lexicon, word/tag frequencies, tag sequence probabilities or a set of rules. Unsupervised, on the other hand, use sophisticated computational methods to automatically induce word groupings (thus devising it.s own tagset) from raw, untagged texts, and based on this tagset, is able to calculate the probabilistic information needed by stochastic taggers, or to induce the rules needed by rule based systems (outlined in 2.2.2). There are pros and cons to using either of these two methods: unsupervised taggers are extremely portable and can be applied to any corpus of raw text. The advantage of this is that reliable, pre-tagged text is not easily accessible for some languages or genres of writing. On the other hand, the word clustering (tagsets) devised by automatic taggers, tend to be very coarse, i.e. one loses the fine distinctions found in the carefully designed tag sets used in the supervised methods. 2.1.2 Stochastic vs. Rule Based. The term stochastic tagging can refer to any number of different approaches to the problem of part of speech tagging. Any model that somehow incorporates frequency or probability, i.e. statistics, may be properly labelled stochastic. If based on a corpus, then such a tagger would have to extrapolate various probabilities from observations in the corpus and use them in its disambiguation process. In contrast, typical rule based approaches use contextual and morphological information to assign tags to unknown or
8
ambiguous words. These rules are often known as context frame rules. As an example, a context frame rule might say something like: If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective.
These rules can be either automatically induced by the tagger or encoded by the designer. Eric Brill designed the best-known rulebase part of speech tagger as it was the first one to be able to achieve an accuracy level comparable to that of stochastic taggers. [Brill 95a] Rule based taggers most commonly require supervised training; but, very recently there has been a great deal of interest in automatic induction of rules. One approach to automatic rule induction is to run an untagged text through a tagger and see how it performs. A human then goes through the output of this first phase and corrects any erroneously tagged words The properly tagged text is then submitted to the tagger, which learns correction rules by comparing the two sets of data. Several iterations of this process are sometimes necessary. Stochastic taggers have a number of advantages over rule based ones as they obviate the need for laborious manual rule construction and possibly capture useful information that may not have been noticed by the human engineer. However these probabilistically driven ones have the disadvantage that linguistic information is only captured indirectly, in large tables of statistics. Also in their favour is the fact that automatic rule-based taggers have much smaller storage requirements and are more portable. The fact remains though, that human language is random and difficult to capture in a finite number of prescribed rules. The probabilistic method of tagging means that such prescriptions are not necessary, and seems to be a more competent and robust method for tagging of unprocessed, random text.
2.1.3 Hidden Markov Models In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words. More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb. When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range. It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing [1], that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns, will approach 90% accuracy because many words are unambiguous. CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many (the Brown Corpus contains a case with 17
10
ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech). HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bidirectional inference algorithm. 2.1.4 Dynamic Programming methods In 1987, Steven DeRose and Ken Church independently developed dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the Viterbi algorithm known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and an ingenious method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus). Both methods achieved accuracy over 95%. DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective. These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well. Markov Models are now the standard method for part-of-speech assignment. 2.1.5 Unsupervised taggers
11
The methods already discussed involve working from a preexisting corpus to learn tag probabilities. It is, however, also possible to bootstrap using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarities classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights. These two categories can be further subdivided into rule-based, stochastic, and neural approaches. Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill Tagger, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. 3. Traditional Arabic grammar Traditional Arabic grammar defines a detailed part-of-speech hierarchy which applies to both words and morphological segments. Fundamentally, a word may be classified as a nominal ism ( ,)verb fiil ( )or a particle arf ( .)The set of nominals include nouns, pronouns, adjectives and adverbs. The particles include prepositions, conjunctions and interrogatives, as well as many others. Morphological annotation in the Quranic Arabic corpus divdes words into multiple segments. Each segment is assigned a part-of-speech tag. These tags are detailed in the following sections. In addition to part-of-speech tags, each segment is annotated using a set of multiple morphological features.
12
Nominals The first of the three basic parts-of-speech are the nominals ism (literally "names" in Arabic). The tags for nominals in the Quranic corpus are shown in Figure 1 below: Tag Arabic Name Description Nouns N Noun PN Proper noun Derived ADJ Adjective nominals IMPN Imperative verbal noun Pronouns PRON Personal pronoun DEM Demonstrative pronoun REL Relative pronoun Adverbs T Time adverb LOC Location adverb Fig 1. Part-of-speech tagset for nominals. Proper Nouns Proper nouns are annotated using the PN tag in the Quranic corpus. In Arabic orthography, there is no distinction between a proper noun and a noun, whereas in English these are written with the first letter capitalized. Proper nouns in Arabic are known by convention and through the fact that they have the grammatical property of being definite even though they do not carry the al- determiner prefix. The set of proper nouns includes personal names such as "the prophet ibrhm". In Arabic, proper nouns as known as .
13
Pronouns Three types of pronoun are identified in the corpus using the tags PRON, DEM and REL. The personal pronouns (PRON) are those which are found in English ("I", "we", "you", "them", "us") together with pronouns found only in Quranic Arabic, such as those inflected for the dual or feminine (for example antum, "you two"). When segmenting words for morphological annotation, the PRON tag is also used to identify attached pronoun segments, which are suffixes that appear at the end of words. In the case of nouns these are possessive pronouns. For example "his book" is fused into a single Arabic word-form (kitbuhu). Suffixed pronouns attached to verbs will be either subject pronouns or object pronouns. The DEM tag is used to identify demonstrative pronouns ("this", "that", "these", "those"). In Quranic Arabic, these are termed ism ishra (literally, "the names of pointing"). The REL tag is used to identify relative pronouns which connect a relative clause to its main clause (for example "the book that you bought"). In Arabic grammar, relative pronouns are known as ism mawl ("the names of connection"). Adjectives Adjectives ( )are closely related to nouns in Quranic Arabic, and it is sometimes not straightforward to distinguish between the two as both carry the same morphological features. For example both nouns and adjectives accept the determiner al- ("the"). The rule followed in the Quranic corpus is to mark a word as an adjective if it is considered to be one according to its syntactic role in the sentence. A nominal tagged as an adjective will directly follow the noun that it describes.
14
Verbs The second of the three basic parts-of-speech is the verb. All verbs in the Quranic corpus are tagged using the V (verb) tag. Each verb is also annotated using multiple morphological features to specify conjugation. In Quranic Arabic, verbs can be conjugated according to three different grammatical aspects (perfect, imperfect and imperative) as well as moods of the imperfect (indicative, subjunctive and jussive). Nouns derived from verbs such as active and passive participles are tagged as N (noun) and are annotated using the "derivation" feature. Tag Arabic Name Description Verbs V Verb Fig 2. Verb part-of-speech tag. Particles The third of the three basic parts-of-speech is the particle. Particles include prepositions, lm prefixes, conjunctions and others. Interrogative particles are tagged using INTG, which includes the independent particle hal and the prefixed interrogative alif. Negative particles in the Quranic Arabic corpus are tagged as NEG. Certain negative particles may place a following imperfect verb into the subjunctive or jussive mood. The VOC tag is used to identify vocative particles and prefixes such as in y-rabbi. In English this would be roughly translated using the archaic vocative particle "O", as in "O my Lord". Part-of-speech tags for particles and Quranic initials are shown below. Prepositions lm Prefixes Tag P EMPH IMPV Arabic Name
15
Description Preposition Emphatic lm prefix Imperative lm prefix
Conjunctions
PRP CONJ SUB
Particles
ACC AMD ANS AVR CAUS CERT CIRC COM COND EQ EXH EXL EXP FUT INC INT INTG NEG PREV PRO REM RES RET RSLT SUP SUR VOC
Purpose lm prefix Coordinating conjunction Subordinating conjunction Accusative particle Amendment particle Answer particle Aversion particle Particle of cause Particle of certainty Circumstantial particle Comitative particle Conditional particle Equalization particle Exhortation particle Explanation particle Exceptive particle Future particle Inceptive particle Particle of interpretation Interogative particle Negative particle Preventive particle Prohibition particle Resumption particle Restriction particle Retraction particle Result particle Supplemental particle Surprise particle Vocative particle
16
Disconnected INL Quranic initials Letters Fig 3. Part-of-speech tagset for particles and the Quranic initials. Prepositions In the case of attached prefixes, the prepositions are straightforward. Certain single letter prepositions may be fused to a word as a prefix. These include bi, ka, ta, wa, and one of the senses of lm. The prefixed prepositions ta and wa occur in Quranic Arabic but not are not typically found in the modern standard form of the language. They are used in the Quran as particles of oath, for example ta-allah, "by Allah". In addition independent words may be prepositions. In the Quranic Arabic corpus, prepositions are identified by the P tag. A word is tagged as a preposition if and only if it considered to be a genitive preposition arf jar (.) Quranic Initials (Disconnected Letters) The Quranic initials (or muqattat in Arabic) are sequences of mysterious letters, such as alif lm mm ( ,) which make up the first verses of several chapters in the Holy Quran but do not combine to form words. These are also known as the disconnected letters ( .) There are numerous suggestions as to the meaning of these letters and so they are given their own part-ofspeech tag (INL). This is so as to avoid any assumptions as to their meaning, which would be the case if they were tagged as being proper nouns or as abbreviations. In the Quran, 30 verses in 29 chapters begin with initials. All initials occur at the first verse, except for those in chapter 42. This chapter has a pair of initials at verses (42:1) and at (42:2). The full meaning behind the Quranic initials is still not yet clearly understood, and there are differing opinions as to their interpretation. One observation is that the initials are almost always followed by a description of Quranic revelation itself. The only
17
occurrence of double initials - at the first two verses of chapter 42 is followed by two mentions of revelation, at verse (42:3).
References
1-
Brent A. Olde, James Hoener, Patrick Chipman, Arthur C. Graesser, and the Tutoring Research Group (1999) A Connectionist Model for Part of Speech Tagging. In Proceedings of the 12th International Florida Artificial Intelligence Research Society Conference, Menlo Park, CA, pp. 172-176.
2- Shereen KHOJA , APT: Arabic Part-of-speech Tagger , Computing Department, Lancaster University ,Lancaster , LA1 4YR, UK,2002.
3-
Dipanjan Das, Slav Petrov, Unsupervised Part-of-Speech Tagging. with Bilingual Graph-Based Projections, Carnegie Mellon University Pittsburgh, PA 15213, USA,2010.
4-
5- Taylor Berg-Kirkpatrick, Alexandre B. Cote, John DeNero, and Dan Klein. 2010. Painless unsupervised learning with features. In Proc. of NAACL-HLT.
18
6- Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. Multilingual part-of-speech tagging: Two unsupervised approaches. JAIR, 36. 7Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. 2010. Using universal linguistic knowledge to guide grammar induction. In Proc. of EMNLP. Emad Mohamed, Sandra Kubler, Arabic Part of Speech Tagging, Indiana University Department of Linguistics Memorial Hall 322, Bloomington, IN 47405, USA
8-
19

Part of Speech Tagging

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Part of Speech Tagging

Enviado por

Direitos autorais:

Formatos disponíveis

Part of Speech Tagging

Faculty of Computers and Information Computer Science Department