Você está na página 1de 3

W R FORMATION IN NATURAL LANGUAGE PROCESSING SYSTEMS OD Roy J. Byrd IBM Thomas J.

Watson Research Center Yorktown H e i g h t s , New York 10598

ABSTRACT Systems which process n a t u r a l language r e q u i r e a r e l i a b l e source of i n f o r m a t i o n about words. Not only must t h e i r l e x i c a l subsystems handle a large number of known words; they must a l s o cope w i t h The m o r p h o l o g i c a l p r i n c i p l e s u n d e r l y i n g coinages. the n o t i o n " p o s s i b l e word" are under a c t i v e study by l i n g u i s t s , and are a r t i c u l a t e d in the theory of word formation. This paper presents a technique f o r b u i l d i n g l e x i c a l subsystems which embody these p r i n c i p l e s by emulating the behavior of word f o r m a t i o n rules. These subsystems combine t o t a l l y i d i o s y n c r a t i c l e x i c a l information, stored in a d i c t i o n a r y , w i t h systematic i n f o r m a t i o n d e r i v e d from word s t r u c ture. A p p l i c a t i o n s f o r l e x i c a l subsystems b u i l t along the l i n e s d e s c r i b e d here w i l l be d i s c u s s e d . 1. Introduction. In many p u b l i s h e d d e s c r i p t i o n s of computer systems t h a t process n a t u r a l language, l i t t l e o r n o i n f o r m a t i o n i s given about the sources o f l e x i c a l i n f o r m a tion. I n s t e a d , f u l l y analyzed words, b e a r i n g a l l required categorial, syntactic, and semantic f e a t u r e s , appear in the data s t r u c t u r e s presented as illustrations of the systems' operation. The d i c t i o n a r y s t r u c t u r e s which s t o r e these words, and the processes by which f e a t u r e i n f o r m a t i o n is d e r i v e d , are not u s u a l l y r e v e a l e d . When system d e s c r i p t i o n s do shed some l i g h t on t h e i r l e x i c a l p r o c e s s i n g , two general approaches can be discerned. I n the f i r s t , t y p i f i e d b y Sager(1981), the. words in the i n p u t t e x t serve as keys f o r access i n g d i c t i o n a r y e n t r i e s where i n f o r m a t i o n about the words is s t o r e d . D i f f e r e n t i n f l e c t e d forms of a word serve as independent keys into the d i c t i o n a r i e s ' data s t r u c t u r e s , although m o r p h o l o g i c a l r e g u l a r i t i e s are sometimes captured by a l l o w i n g different dictionary entries for inflectionally related words to share sublists of common properties. The second category of l e x i c a l subsystems uses " a f f i x s t r i p p i n g " t o economize o n word s t o r a g e , o r , e q u i v a l e n t l y , t o increase the apparent s i z e o f t h e i r word l i s t s . I n the s i m p l e r v e r s i o n s o f t h e p r o c e d u r e , t h e systems w i l l merely t e s t i f a v a l i d ending is a t t a c h e d to a v a l i d word (as in P e t e r s o n ( 1 9 8 0 ) ) . In some cases adjustments are made to the s p e l l i n g o f the base a f t e r s t r i p p i n g the a f f i x and b e f o r e l o o k i n g up the base in the d i c t i o n a r y (as in Winograd(1972)). These systems u s u a l l y handle o n l y

words w i t h i n f l e c t i o n a l s u f f i x e s , reducing them t o t h e i r u n i n f l e c t e d forms. Words w i t h d e r i v a t i o n a l a f f i x e s again have separate d i c t i o n a r y e n t r i e s . A more complex form of the a f f i x s t r i p p i n g procedure i s presented i n Cercone(1974). I n t h a t system, p r e f i x e s as w e l l as s u f f i x e s are ( r e c u r s i v e l y ) removed from an incoming word, and the r e s u l t i n g bases are looked up in the. d i c t i o n a r y , as above. Only if the base is of an a p p r o p r i a t e category f o r the a f f i x , w i l l the i n p u t b e accepted, however. Furthermore, i n f o r m a t i o n t h a t i s redundantly a s s o c i ated w i t h the a f f i x - - s u c h as the f a c t t h a t words ending in -ness are n o u n s - - i s asserted f o r the input word. Cercone's system handles d e r i v a t i o n a l as w e l l as i n f l e c t i o n a l a f f i x e s . The system of m o r p h o l o g i c a l r u l e s described in t h i s paper i s s u p e r i o r t o the a f f i x s t r i p p i n g systems i n several respects. F i r s t , i t allows f o r the f e a t u r e s associated w i t h a word to be a combination of t o t a l ly i d i o s y n c r a t i c i n f o r m a t i o n from a d i c t i o n a r y , t o g e t h e r w i t h systematic i n f o r m a t i o n d e r i v e d by the recursive application of morphological rules. Second, it provides f o r the o b s e r v a t i o n * o f a complex set of r e s t r i c t i o n s on morphological r u l e a p p l i c a tion. The nature of these r e s t r i c t i o n s is known from l i n g u i s t i c r e s e a r c h , and t h e i r o b s e r v a t i o n allows f o r much t i g h t e r c o n t r o l over the attachment of a f f i x e s to bases. This enhanced c o n t r o l p r o v i d e s f o r the t h i r d type of improvement over previous systems: t h i s system can e x h i b i t much more r e l i a b l e behavior in the presence of c o i n a g e s - - t h e c r e a t i o n of new words by human u s e r s . Section 2 of t h i s paper describes word f o r m a t i o n r u l e s and the r e s t r i c t i o n s t h a t govern t h e i r a p p l i cation. Section 3 presents a system of m o r p h o l o g i c a l r u l e s based i n the l i n g u i s t i c r e s u l t s o f s e c t i o n 2. The f i n a l s e c t i o n l i s t s a p p l i c a t i o n s o f the techniques developed i n t h i s paper. 2 . Word Formation Rules. L i n g u i s t i c s t u d i e s of morphology have s e t t l e d on the word f o r m a t i o n r u l e as the best means of accounting f o r the s t r u c t u r e o f words. Extensive d e s c r i p t i o n s of two v a r y i n g views of t h i s model can be found in Aronoff(1976) and Selkirk(1982) . B r i e f l y , word f o r m a t i o n r u l e s can be represented as in ( 1 ) . Rule ( l a ) w i l l apply t o t r a n s i t i v e verbs t h a t s e l e c t f o r animate o b j e c t s , and w i l l produce animate nouns. Using t h i s r u l e , we can d e r i v e d r a f t e e from d r a f t , but we c a n ' t get * s i n g e e ( s i n g doesn t take animate

R. Byrd

705

objects) or

*abdicatee

(abdicate

is

intransitive).

(2)

+ity:

[[ + latinate)A _]N

Rule ( l b ) d e r i v e s l a t i n a t e a d j e c t i v e s from t r a n s i t i v e verbs. For example, we can get i n t e r r u p t i b l e and s i n g a b l e , but not * a b d i c a t a b l e . Rule ( l c ) says t h a t the i n f l e c t i o n a l a f f i x -ed can be a p p l i e d to verbs bearing the m o r p h o l o g i c a l f e a t u r e [ - a b l a u t ] , y i e l d i n g a past tense v e r b . I t w i l l apply t o walk t o produce walked, but not to s i n g which is [ + a b l a u t ] .

Because + i t y o n l y attaches t o l a t i n a t e a d j e c t i v e s (we get f a l s i t y but not * w r o n g i t y ) , i t s a b i l i t y t o cooccur w i t h s i n g a b l e in s i n g a b i l i t y must mean t h a t s i n g a b l e is [ + l a t i n a t e ] . But since s i n g is not a l a t i n a t e word, t h i s must be so because of the s u f f i x +abl i n s i n g a b l e . Thus, the d e r i v a t i o n a l h i s t o r y o f a word, i n a d d i t i o n t o i t s u l t i m a t e base, determines 2 what f e a t u r e s w i l l b e associated w i t h i t . 3. The M o r p h o l o g i c a l Rule System. The l i n g u i s t i c t h e o r y of word f o r m a t i o n r u l e s , o u t l i n e d above, can be advantageously e x p l o i t e d f o r n a t u r a l language p r o c e s s i n g systems. A u s e f u l way of o r g a n i z i n g the l e x i c a l subcomponent of such systems is to i n c o r p o r a t e a d i c t i o n a r y , where t r u l y i d i o s y n c r a t i c i n f o r m a t i o n about words is s t o r e d , and an i n t e r p r e t i v e mechanism f o r a p p l y i n g m o r p h o l o g i c a l r u l e s d e r i v e d from word f o r m a t i o n r u l e s . The i n f o r m a t i o n contained in the d i c t i o n a r y should i n c l u d e not only the usual c a t e g o r i a l , s y n t a c t i c , and semantic f e a t u r e s . It should also i n c l u d e morphological, etymological, and phonological i n f o r m a t i o n r e l e v a n t to word f o r m a t i o n processes. This i n f o r m a t i o n can be e f f i c i e n t l y encoded, and the p a y o f f - - a s t h i s paper attempts t o s h o w - - i s w e l l worth the a d d i t i o n a l c o s t . The morphological r u l e s themselves c o n s i s t of f i v e parts: a) an a f f i x name, b) a boundary s p e c i f i c a t i o n , c) a p a t t e r n , d) a c o n d i t i o n , and e) an assertion. The p a r t s of a r u l e combine to emulate the behavior of a word f o r m a t i o n r u l e in l i n g u i s t i c t h e o r y . The p a t t e r n s p e c i f i e s the base f o r the r u l e by d e s c r i b i n g the a f f i x to be removed and f u r t h e r adjustments to be made. The c o n d i t i o n embodies the s u b c a t e g o r i z a t i o n and s e l e c t i o n a l r e s t r i c t i o n s on the base. The a s s e r t i o n allows statement of the c a t e g o r i a l and d i a c r i t i c f e a t u r e s f o r the output o f a rule application. The boundary s p e c i f i c a t i o n insures the o b s e r v a t i o n o f boundary r e s t r i c t i o n s . An example of a morphological r u l e is given in ( 3 ) . The f i v e p a r t s of the r u l e are l a b e l e d a) through e) . (3) +ion: b)a) ation5*e* c) (verb) d) (noun +sg + a b s t r ) e)

Three types of r e s t r i c t i o n s apply to word f o r m a t i o n rules: boundary restrictions, subcategorization r e s t r i c t i o n s , and s e l e c t i o n a l r e s t r i c t i o n s . Boundary r e s t r i c t i o n s r e f e r to the f a c t t h a t some types of a f f i x e s may o n l y occur "outside." of other t y p e s . S e l k i r k ( 1 9 8 2 ) d i s t i n g u i s h e s t h r e e types o f boundary. a) The "innermost" are those associated w i t h n o n - n e u t r a l a f f i x e s which can a l t e r the phonology of the base to which they a p p l y . Such an a f f i x is +ee i n ( l a ) which a l t e r s the s t r e s s p a t t e r n o f assign when forming assignee. These boundaries are denoted by " + " . b) Boundaries at the next l e v e l , denoted by " # " , are associated w i t h n e u t r a l a f f i x e s , such as #abl , which do not a l t e r the phonology of t h e i r base. Note t h a t the s t r e s s of the base is not s h i f t e d in assignable. c) The 'outermost" boundaries are a s s o c i a t e d w i t h i n f l e c t i o n a l a f f i x e s , ana are denoted by " - " . An example is the past tense s u f f i x - e d . An import a n t f a c t about i n f l e c t i o n a l a f f i x e s i s t h a t they do not p i l e u p " ; at most one is ever found on a s i n g l e word. S u b c a t e g o r i z a t i o n r e s t r i c t i o n s l i m i t the occurrence o f a f f i x e s t o environments c o n t a i n i n g only c e r t a i n word c a t e g o r i e s . Thus, in example ( l b ) , #abl may o n l y cooccur w i t h v e r b s . I t cannot b e a f f i x e d t o nouns (*elementable) or adjectives (*temporaryable) . Cercone's m o r p h o l o g i c a l a n a l y s i s system captures s u b c a t e g o r i z a t i o n r e s t r i c t i o n s when i t t e s t s t h a t the s t r i n g remaining a f t e r a f f i x s t r i p p i n g belongs to a c e r t a i n c a t e g o r y . Selectional r e s t r i c t i o n s specify constraints on n o n - c a t e g o r i a l f e a t u r e s of an a f f i x ' s base. Thus, t h e base to which the a f f i x +ee can apply must not s i m p l y be a v e r b , it must a l s o bear the f e a t u r e s [ - t r a n s i t i v e ] and [+animate o b j e c t ] . Since the word f o r m a t i o n process is r e c u r s i v e , and s i n c e word f o r m a t i o n r u l e s themselves can a s s e r t new f e a t u r e s f o r t h e words t h a t they form, the s e l e c t i o n a l r e s t r i c t i o n mechanism can be a powerful t o o l . Consider the r u l e i n ( 2 ) . 1. [ + l a t i n a t e ] is an abstract morphological feature which roughly i n d i c a t e s t h a t words so marked are of Greek or L a t i n o r i g i n . A p a r t i c u l a r use of t h i s a b s t r a c t f e a t u r e w i l l b e g i v e n below.

This r u l e would operate as f o l l o w s w h i l e a n a l y z i n g the word r e a l i z a t i o n . The p o s i t i o n of the boundary marker i n d i c a t e s t h a t t h i s i s a s u f f i x r u l e . Hence, the r i g h t end of the word is checked f o r the p a t t e r n c h a r a c t e r s ' ' a - t - i - o - n ' ' . This check succeeds, so 5 c h a r a c t e r s are removed and the "*" causes the r e s u l t ( " r e a l i z " ) to be looked up e i t h e r in the d i c t i o n a r y or v i a a r e c u r s i v e i n v o c a t i o n of the m o r p h o l o g i c a l r u l e processor. This look up f a i l s - - . i t would have succeeded had the o r i g i n a l word been, say, r e l a x a t i o n s o " e " i s added, y i e l d i n g " r e a l i z e " which i s 2. I n s p i t e o f appearances, the word s i n g a b i l i t y i s not a v i o l a t i o n of the boundary r e s t r i c t i o n s mentioned e a r l i e r . See Aronoff(1976) for a j u s t i f i c a t i o n o f t h e e x i s t e n c e o f two s u f f i x e s , #abl and + a b l . + a b l i s the one i n s i n g a b i l i t y .

706

R. Byrd

s u c c e s s f u l l y looked up. The c a t e g o r i a l and d i a c r i t i c f e a t u r e s of the base are checked a g a i n s t the c o n d i tion. I n t h i s case, the f a c t t h a t " r e a l i z e " i s a verb s u f f i c e s . Furthermore, the boundary s p e c i f i c a t i o n i s not v i o l a t e d , s i n c e " r e a l i z e " c o n t a i n s n o n e u t r a l or i n f l e c t i o n a l a f f i x e s . S o the a s s e r t i o n is applied, a s s o c i a t i n g the c a t e g o r i a l feature "noun" and the diacritic features "+singular --abstract" w i t h the word " r e a l i z a t i o n " . These r u l e s a c t u a l l y manipulate morphographemes (i.e., the w r i t t e n form o f words) r a t h e r than morphemes. While t h i s i s p e r f e c t l y acceptable f o r computer a p p l i c a t i o n s , we must be c l e a r about the correspondence between t h e a b s t r a c t word f o r m a t i o n r u l e s , and t h e i r r e a l i z a t i o n a s m o r p h o l o g i c a l r u l e s . Thus, the t h r e e m o r p h o l o g i c a l r u l e s i n (4) r e p r e s e n t d i f f e r e n t o r t h o g r a p h i c a l f a c e t s of a s i n g l e word f o r m a t i o n r u l e which handles the s u f f i x # a b l . (4)a. #abl: b. # a b l : c. # a b l : able4*e* sible5d* able3te* (v + t r a n s ) (a + l a t -sg - p i ) (v + t r a n s ) (a + l a t -sg - p i ) (v + t r a n s ) (a + l a t -sg - p l )

d e s c r i b e d in H e i d o r n , et a l . ( 1 9 8 2 ) . As p a r t of the development of t h a t system's d i c t i o n a r y , a set of morphological rules for analysing inflectional a f f i x e s was used to a u t o m a t i c a l l y i d e n t i f y redundant words in a source d i c t i o n a r y . From among more than 80,000 o r i g i n a l e n t r i e s , i t was p o s s i b l e t o omit a p p r o x i m a t e l y 10,000 which were found to have c o m p l e t e l y p r e d i c t a b l e analyses. The m o r p h o l o g i c a l r u l e i n t e r p r e t e r i t s e l f i s being used as p a r t of a l i n g u i s t i c study of r e s t r i c t i o n s on word f o r m a t i o n r u l e s , d e s c r i b e d in Byrd(1983). This study should y i e l d a h i g h l y d e t a i l e d set o f word f o r m a t i o n r u l e s w h i c h , i n t u r n , w i l l p r o v i d e t h e b a s i s f o r m o r p h o l o g i c a l r u l e s to be used in v a r i o u s computer a p p l i c a t i o n s . I n automatic t e x t - t o - s p e e c h s y n t h e s i s systems, accu r a t e m o r p h o l o g i c a l decomposition as w e l l as r e l i a b l e p h o n o l o g i c a l i n f o r m a t i o n , such as can be s t o r e d in our d i c t i o n a r y , is e s s e n t i a l . A v e r s i o n of t h e m o r p h o l o g i c a l r u l e i n t e r p r e t e r has been implemented which can decompose an i n p u t word i n t o a base, f o r which a p r o n u n c i a t i o n is known, p l u s an ordered l i s t of a f f i x e s . See A l l e n ( 1 9 7 6 ) f o r another approach to t h i s problem. Finally, with a highly articulated set o f m o r p h o l o g i c a l r u l e s , i t seems l i k e l y t h a t t h i s technology could be used to generate words as r e l i a b l y as it analyses them. Such a c a p a b i l i t y would be i n v a l u a b l e in t e x t g e n e r a t i o n a p p l i c a t i o n s . References. A l l e n , J. (1976) "Synthesis of Speech from Unre s t r i c t e d T e x t , " Proceedings of the _IEEE 64, 433-442. A r o n o f f , M. (1976) Word Formation in Generative Grammar, L i n g u i s t i c I n q u i r y Monograph 1, MIT Press, Cambridge, Massachusetts. B y r d , R. J. (1983) "On R e s t r i c t i n g Word Formation R u l e s , " unpublished paper, New York U n i v e r s i t y . Cercone, N. (1974) "Computer A n a l y s i s of E n g l i s h Word F o r m a t i o n , " Technical Report TR74-6, Depart ment of Computing Science, U n i v e r s i t y of A l b e r t a , Edmonton, A l b e r t a , Canada. Chomsky, N. (1970) Remarks on N o m i n a l i z a t i o n s , " in R. Jacobs and P. S. Rosenbaum, eds. Readings in E n g l i s h T r a n s f o r m a t i o n a l Grammar, G i n n , Waltham, Massachusetts. H e i d o r n , G. E . , K. Jensen, L. A. M i l l e r , R. J. B y r d , and M. S. Chodorow (1982) "The EPISTLE T e x t - C r i t i q u i n g System," IBM Systems Journal 2 1 , 305-326. J a c k e n d o f f , R. S. (1975) " M o r p h o l o g i c a l and Semantic Regularities i n the L e x i c o n , " Language 5 1 , 639-671. P e t e r s o n , J. L. (1980) "Computer Programs f o r D e t e c t i n g and C o r r e c t i n g S p e l l i n g E r r o r s , " Commu n i c a t i o n s of the ACM, 23, 676-687. Sager, N. (1981) N a t u r a l Language I n f o r m a t i o n Proc essing: A grammar of English and Its A p p l i c a t i o n s , Addison-Wesley, Reading, Massachu setts . S e l k i r k , E. 0. (1982) The Syntax of Words, L i n g u i s t i c I n q u i r y Monograph 7, MIT Press, Cambridge, Massachusetts. Winograd, T. (1972) Understanding N a t u r a l Language, Academic Press, New York.

Rule (4a) w i l l d e r i v e reachable from reach and l i k a b_le from l i k e . (4b) d e r i v e s d e f e n s i b l e from d e f e n d . (4c) d e r i v e s d e l e g a b l e from d e l e g a t e . All will check t h a t the base is a t r a n s i t i v e v e r b , and a s s e r t t h a t the r e s u l t i s a l a t i n a t e a d j e c t i v e . Similar clusters of rules will exist for cases of allomorphy, as in in+continent, im+practical, irreversible, and il+llogical. In fact, the p a t t e r n s of m o r p h o l o g i c a l r u l e s can be used to c a p t u r e the allomorphy and t r u n c a t i o n phenomena discussed by A r o n o f f ( 1 9 7 6 ) . In a d d i t i o n , they can encode p u r e l y o r t h o g r a p h i c a l phenomena, such as the s p e l l i n g r u l e s f o r s i l e n t e s u p p r e s s i o n , consonant d o u b l i n g , c-k a l t e r n a t i o n s , e t c . In a l e x i c a l subcomponent c o n s t i t u t e d as suggested h e r e , the i n f o r m a t i o n known about a g i v e n word is a combination of the i n h e r e n t f e a t u r e s found in the l e x i c a l e n t r y f o r the u l t i m a t e base and the system a t i c f e a t u r e s a s s o c i a t e d w i t h the word's s t r u c t u r e . Thus, if one form of the verb r e a l i z e is known to be t r a n s i t i v e and to r e q u i r e an animate s u b j e c t (as in " t h e s c u l p t o r r e a l i z e d h i s masterpiece i n b r o n z e " ) , then we a l s o know t h a t r e a l i z a t i o n is l i k e w i s e t r a n s i t i v e and r e q u i r e s an animate s u b j e c t (as in " t h e sculptor's realization of his masterpiece"). Furthermore, we know t h a t the semantic p r o p e r t i e s of r e a l i z e and r e a l i z a t i o n are c l o s e l y r e l a t e d . This means of a t t a c h i n g i n h e r e n t i n f o r m a t i o n to many r e l a t e d words captures the lexical redundancy r e l a t i o n s i n t r o d u c e d by Chomsky(1970) and e l a b o r a t e d by Jackendoff(1975). It goes far beyond Sager's(1981) scheme of having m u l t i p l e d i c t i o n a r y e n t r i e s f o r i n f l e c t i o n a l l y r e l a t e d words p o i n t t o shared s u b l i s t s of common p r o p e r t i e s . 4. Applications. A l e x i c a l subcomponent based on the ideas presented here has been implemented and is b e i n g used in a system that produces syntactic and stylistic critiques of English language texts in a word-processing environment. The system is

Você também pode gostar