Você está na página 1de 6

Linearization of Nonlinear Lexical Representations

George Anton Kiraz


Bell L a b o r a t o r i e s
700 M o u n t a i n Ave.
M u r r a y Hill, N J 07974
Email: gkiraz@research, bell-labs, corn

Abstract pecially those whose number of transitions exceeds


by far the number of states, a typical situation in
This paper presents a new schema for han- natural language problems.
dling nonli:near morphology. The schema This paper provides a finite-state schema with
argues for linearizing nonlinear represen- which one can maintain the nonlinear lexical rep-
tations before applying phonological and resentation in templatic morphology, yet allow
morphological rules. for a linear model for representing phonologi-
cal/orthographic and other script related rules. Such
1 IntroduCtion and Problem rules are in fact linear and need not be made com-
Statement plex on the account of the nonlinear templatic phe-
nomenon of morphology.
Languages which exhibit templatic morphology have
been lately treated using multi-tape finite state 2 P r o b l e m s in T e m p l a t i c
transducers, with one tape representing surface
Morphology
forms and the remaining tapes representing lexical
forms (see (KayI 1987; Kiraz, Forthcoming)). There 2.1 N o n l i n e a r i t y vs. L i n e a r i t y
are a number of:advantages for using this multi-tape
Consider the infamous Arabic s t e m / k a t a b / ' t o write
model. Not only does it accurately represent the lin-
-- P E R F E C T
ACTIVE'. It is derived from the root mor-
guistic insights behind the templatic nonlinear na-
pheme {ktb) 'notion of writing', the vocalism mor-
ture of these languages, it also allows the computa- pheme {a} 'PERFECT ACTIVE' and .the rather ab-
tional linguist to compile efficient, relatively small stract pattern morpheme {CVCVC} 'VERB.' The
morphological l~xica as opposed to lexica containing latter describes the interdigination of the root and
millions of entri&s.
vocalism. Substituting the Cs with the root conso-
However, maintaining a nonlinear lexical represen-
nants and the Vs with the vocalism vowels results
tation has its owh inconveniences and computational in the surface form /katab/. This process is illus-
complexities. Firstly, the writer of multi-tape rules trated along the lines of (McCarthy, 1981) - based
must keep tracl~ of multiple representations (four on autosegmental phonology (Goldsmith, 1976) as
in the case of Semitic as opposed to two for En-
follows:
glish), which makes, writing grammars an arduous
task. Secondly, !rules which describe one phonolog-
ical/orthographi!c phenomenon must be duplicated C V C V C
= /katab/
in order to accohnt for the nonlinear nature of the
.I
I I I
stem, but the linear nature of segments present in k t b
prefixes and sufi0xes. Thirdly, in systems which re- Similarly, applying the same process on the root
quire multiple sots of rules (say a text-to-phoneme {sdq} 'notion of truth' results in the verb /s.adaq/
system with two! sets of rules: surface-to-lexical and 'to speak the truth - PERFECT ACTIVE'.
lexical-to-phoneme), the above complexities multi- The stems /katab/ and /s.adaq/ may be pre-
ply. Finally, the're is the issue of space complexity: fixed and suffixed to fomn other words. Prefixa-
I
although the sPace complexity for transitions of an tion and sumxation, however, are linear operations
automata with respect to the number of tapes is tin- in Semitic. In other words, the lexical representation
ear, space can b@come costly for huge machines, es- of the prefixes and suffixes does not require multi-

57
ple tapes. Hence, the prefix {wa} 'and' is applied to phatic phoneme). 2 In this case, emphasis can be
the above stems to f o r m / w a k a t a b / a n d / w ~ a d a q / , determined from the surface (~ orthographic) form.
respectively. However, this is not always the case. Syriac spi-
rantization requires lexical information as the fol-
2.2 Phonological and Orthographic Rules lowing example illustrates: Synchronically speaking,
Surface-to-lexical mappings must account for phono- the six plosives [b], [g], [d], [k], [p] and [t] undergo
logical and orthographic processes. In fact, for many spirantization when in postvocalic position wilh re-
languages, the phonological and orthographic rules spect to the lexical form, 3 resulting in [v], [~], [b], [x],
tend to be more numerous than the morphological If] and [0], respectively. Hence, */katab/--~ [k0av],
rules. This is the case in Semitic. For example, and */wakatab/--~ [wax0av] (in both cases the first
the Syriac grammar reported in (Kiraz, 1996) con- stem vowel is deleted as described above).
tains 48 rules. Only six rules (a mere 12.5%) 1 are
motivated by templatic morphology. The rest are 3 Multi-Tape Grammar
phonological and orthographic. This section provides a grammar for the above data
Consider the above derivation o f / k a t a b / , but for using a multi-tape model and illustrates some of the
Syriac rather than Arabic (both languages share the complexities involved in maintaining multiple lexical
same morphemes in this case). Syriac has the Vowel tapes throughout. The multi-tape model (originally
Deletion Rule proposed by (Kay, 1987)) is an extension to the com-
monly used regular rewrite rules. In the multi-tape
V ~ e / _ _ CV version, more than one lexical tape is allowed. Here,
we shall use the following formalism - which derives
where e is the empty string. The rule states that from the one reported by (Pulman and Hepple, 1993)
short vowels in open syllables are deleted. Hence,
- to express regular rewrite rules:
* / k a t a b / ~ / k t a b / . The rule applies right-to-left; LLC - LEx - RLC {:::~,~-:~)
hence, when adding the object pronominal suffix LSC - SURF -- I:~SC
{eh} 'MASCULINE 3RD SINGULAR', the second vowel
is deleted, * / k a t a b e h / ~ / k a t b e h / .
where L L C is the left lexical context, L~x is the
Similarly, prefixing the above {wa} morpheme lexical form, R L C is the right lexical context, LSC
(which is also shared by Syriac and Arabic), re-
is the left surface context, SURF is the surface form,
sults in */wakatab/ ~ / w a k t a b / (first stem vowel and RSC is the right surface context. The operators
is deleted), and * / w a k a t a b e h / ~ / w k a t b e h / ( p r e f i x and ¢:~ indicate optional and obligatory rules, re-
vowel and second stem vowel are deleted). spectively. In the multi-tape version, lexical expres-
It is worth noting that such phonological rules sions are n-tuple of regular expressions of the form
do not depend on the nonlinear lexical structure (xl, x2, ..., x,0, with the ith expression referring
of the stem. They actually apply on the morpho- to symbols on the ith lexical tape. When n = 1,
logically derived stem. Semitic, then, maintains the parentheses can be ignored; hence, (x) and x are
at least the following strata: lexical-morphological equivalent .4
(where the lexical representation is nonlinear) and The grammars presented here assumes a lexicon
morphological-surface (where both representations with the morpheme entries presented above. The
are linear). pattern morpheme is {cvcvc} (in small letters); cap-
itals in rules denote variables drawn from a finite-set
2.3 Other Linguistic Representations
of symbols.
So far we have looked at two linguistic representa- Lexieal expressions make use of three tapes: pat-
tions: lexical and surface (~ orthographic). Now tern, root and vocalism, respectively. Hence, the
consider a text-to-speech system which requires a
phonological representation as well. 2The scope of emphasis is another challenging prob-
lem. Sometimes emphasis spreads till the end of the
In the Arabic example above, the first phoneme current syllable, and sometimes till the end of the word.
of / s a d a q / is emphatic (denoted by the sublinear 3Diachronically speaking, early Aramaic idioms, of
dot). This emphasis is spread at the phonologi- which Syriac is one, did not apply the above vowel dele-
cal level resulting in [s.a.d.aq] ([q] is already an em- tion rule; hence, in the New Testament the first [a] in
sabachthani (Mt 27:46) is retained. Later, however, the
1Had the grammar been more exhaustive, the per- vowel deletion rule took effect, but spirantized conso-
centage would be much less since most additions to the nants remained as if the deletion did not take place.
rules would be in the domain of phonology/orthography, 4For compiling such rules into automata, see
rather than templatic morphology. (Grimley-Evans, Kiraz, and Pulman, 1996).
G r a m m a r 1 G r a m m a r for A r a b i c / ( w a ) k a t a b / a n d Grammar 2 G r a m m a r for Syriac Vowel Deletion
/(wa)s. a d a q / Rule
* - <c;C,~> - * * - (v,~,a) - (cv,C,a) ¢*
R4 . .
R1 , _ IC - *
_ _

* - (v:,c,a) -- ** * - {v,~,a) - (cV,C,¢ ¢*


R2 . _ ~a R5 , _ _ ,

* - a - (cv,C,a) ¢*
* - Xi - *
R6 , ,
R3 . - X - *

where X is any segment, C is a consonant, and * is * - a - CV

any context. R7 * _ _ *

where C is a consonant and V is a vowel

lexical expression (c,k,s) denotes a [c] on the first


(pattern) tape,i a [k] on the second (root) tape and
the e m p t y string on the third (vocalism) tape. Pre-
l a l a t Vocalism
fixation and suffixation, which for the most part fall
k t I I b Root
out of the domain of templatic morphology, are rep-
c v c i v i c Pattern
14121
resented as a sequence of segments as in any other
language and ate placed on the first (pattern) lexical
Ik t la!blSurface
tape. 5 Another rule (R5 in G r a m m a r 2) is required for
deriving * / k a t a b / + {eh} ~ / k a t b e h / , where the
3.1 N o n l i n e a r i t y vs. L i n e a r i t y second stem vowel is deleted by the same phono-
logical phenomenon. The difference here lies in the
Rules R1 and R2 in G r a m m a r 1 take care of con- right-lexical context expression (cV,C,e), where the
sonants and vo~wels, respectively. The rules derive suffix vowel appears on the first lexical tape. The
the Arabic forms / k a t a b / and /s.adaq/. R3 is the derivation is illustrated below:
default rule for prefixes and suffixes. It simply maps
t al a t i. Vocatism
every segment on the .first lexical tape to the sur-
kl It bi i Root
face. G r a m m a r ' 1 derives the f o r m s / w a k a t a b / a n d c ! v l c v cte!11 Pattern and Affixes
/ w ~ . a d a q / a s well. The former is illustrated below.
1215133
Iklalt b l e l h [Surface
I It Ib Root R4 and R5 fail when the deleted :vowel itself ap-
w atclvlc)tvlc Pattern and Affixes pears in the prefix, e.g. {wa} + / k a t b e h / - - ~ / w k a t -
33121121
beh/. R6 handles this case; here, the right context
[wl alk la I t I albl Surface (cv,C,a) belongs to the nonlinear stem as shown be-
The numbers:between the tapes refer to the rules low:
in the grammaR. Note that the prefix shares a tape If al all ]Vocalism
with the pattern. t Ik 1t b l Root
!alc vlc v cielh Pattern and Affixes
3.2 Phonological and Orthographic Rules 3 6 1 2 1 5 1 3 3
Iwl fk!a!t ib!e!hlSurZace
The Syriac vowel deletion rule, V ---+ e/ CV,
is given in the notation of our formalism in G r a m - In addition, R7 deletes prefix vowels when the right
m a r 2. Note that by virtue of its right-lexical con- context belongs to a (possibly another) linear prefix,
text (cv,C,a), R4 can only apply to the first stem e.g., {wa} + {la} + {da} + / k a t a b / - - - ~ / w a l d a k t a b /
vowel as illustrated in the derivation o f / k t a b / f r o m (the [a] of {la] and the first stem vowel are deleted),
the underlying * / k a t a b / b y the deletion of the first as illustrated below:
vowel:
t t I t t Ikl Itl Root
SHaving tile prefixes share a tape of tile patterns is a w]alllald!a!clv[c[vl Pattern and Affixes
matter of convenience since the number of segments in a 3 3 3 7 3 3 1 4 1 2 1
pattern, more or less, corresponds to that on the surface Iw[a[l{ ld[a. l k l Itla[b]Surface
more than segments of roots and vocalisms.
The above examples clearly illustrate the com- on every tape i, 1 _< i < n, and end up in a final
plexity of maintaining large nonlinear grammars. state q E F.
An n - t a p e f i n i t e - s t a t e t r a n s d u c e r is a 6-tuple
4 Using a Linearized Lexical M = ( Q , E , 5 , qo, F , d ) , where Q, .~, 6, q0 and F
Representation are like before and d, 1 < d < n, is the number of
d o m a i n t a p e s . The number of r a n g e t a p e s is
This section argues that a better framework for solv- simply n - d.
ing Semitic morphology divides the lexical-surface Let A = (Qi, El, 51, ql, Fi, dl) and B =
mappings into two separate problems. The first (Q2, E2, 52, q2, F2, d2) be two multi-tape transducers
handles the templatic nature of morphology, map- over nl and n2 tapes, respectively. Further, let si
ping the multiple lexical representation into a lin- denote the symbol on the ith tape. There is a com-
e a r i z e d l e x i e a l f o r m . This linearized form main- position of A and B, denoted by C, if and only if
tains the same linguistic information of the original
lexical representation, and somewhat corresponds to d2 = n l -- dl
McCarthy's notion of t i e r c o n f l a t i o n (McCarthy, with
1986).
The second takes care of phonological/ ortho- C = (Q1 × Q2, ~1 i.j ~ 2 , 5, [ql, q2], F1 x F2, d l )
graphic/graphemic mappings between the linearized
lexical form and the actual surface. The combined where for all Pl E Q1 and p2 E Q2,
machine is mathematically taken as the composi-
tion of the two machines representing the two sets of
d2+1 : " ' " : ~ ) "~
rules. This brings us to the question of composing
[51(pl,sl : . . . : s d l : s~,+l : . . . : s , , ) ,
multi-tape automata.
52(p~,st:...:s' d~ : s'~ + 1 : - . . : s'-2)]
4.1 Composition of Multi-Tape Machines
if and only if
The composition of two binary transducers A and B
is straightforward since one tape is taken for input /
8d1+1 ~ 8 1 , ' " " , 8 n l ~
8 td2
and the other for output. The composition of the
two machines is a generalization of the intersection The resulting machine is an k-tape machine,
of the same two automata in that each state in the where k = dl - d2 + n2.
resulting machine is a pair drawn from one state in Implementational Note
A and the other from B, and each transition corre- We found that it is best not to indicate d, the
sponds to a pair of transitions, one from A and the number of domain tapes, in the data structure rep-
other from B, with compatible labels. resenting the automata, but to hav~ it as an argu-
The composition of multi-tape transducers, how- ment to the composition function. This enables the
ever, is ambiguous. Which tapes are input and user to change the value of d per operation if the
which are output? Consider the machine which ac- need arises.
cepts the regular relation 6 a*:b*:b* and a second
machine which accepts the regular relation b* :b* :c*. 4.2 A Mixed Grammar
The composition of the two machines can be either Now we illustrate the advantage of having a lin-
the machine accepting a* :c* or the machine accept- earized lexical form by developing a mixed grammar.
ing a* :b* :b* :c*. However, if tapes can be marked as We make use of two grammars for the data pre-
belonging to the domain or range of the transduc- sented above. G1 for templatic nonlinear problems
tion, the ambiguity will be resolved. and G2 for linear issues. For the current data, our
Formally, an n - t a p e f i n l t e - s t a t e a u t o m a t o n is G1 would be similar to the rules in G r a m m a r 1.
a 5-tuple M = (Q, Z, 5, q0, F), where Q is a finite G2 takes as input the o u t p u t of G1, i.e., the lin-
set of s t a t e s , E is a finite i n p u t a l p h a b e t (a set earized lexical form such as Syriac * / k a t a b / , */wal-
of n-tuples of symbols), t~ is a t r a n s i t i o n f u n c t i o n adakatab/, etc. Since R4-R7 in G r a m m a r 2 rep-
mapping Q x E'~ to Q, q0 E Q is an i n i t i a l s t a t e , resent the one phonological phenomenon, viz., the
and F C Q is a set of final s t a t e s . An n-tape FSA deletion of a short vowel in an open syllable, they
accepts an n-tuple of strings if and only if starting can be combined into one rules:
from the initial state q0, it can scan all the symbols * - CV a -

R8 * - - *
6For regular relations, see (Kaplan and Kay, 1994). where C is a consonant and V is a vowel

~o
Grammar 3 G r a m m a r for Spirantization, case for In addition, the size of the intermediate a u t o m a t a is
[b]--~ Iv] substantially decreased in terms of space complexity.
V - b - * ¢~ There is another advantage of this model if used
lZ9 . , in a multi-lingual Semitic environment system. We
-- V --
noted above how the derivation o f / k a t a b / i n Arabic
RiO V - (c,b,c) - * ¢:.,~ and Syriac is similar. The only difference is that in
• -- V -- * the latter a vowel deletion rule takes place. It is
then possible to generalize the lexical-to-linearized-
(v,e,V) - (c,b,e} - *
form module for more than one Semitic language.
Rll • -- V -- *
At the abstract finite-state level, our solution may
where V is a vowel
have some similarities with the proposal of (Kor-
nat, 1991) which aims at modeling autosegmental
phonology by coding nonlinear autosegmental repre-
An identity rule (similar to R3 is also required). sentations as linear strings. Kornai's approach lin-
Applying R8 and the identity rule on the input of earizes the lexical nonlinear representation from the
G2 is illustrated below: outset using a number of coding mechanisms.
IwlalllaldlalktaltlalblLinearizedLexForm
3 3 3 8 3 3 3 8 3 3 3
Iwlatl[ Idtalkl Itla[blSurfaee References

Recall that the rule applies right-to-left. Goldsmith, J. 1976. Autosegmental Phonology.
Ph.D. thesis, MIT. Published as Autosegmental
It might not be clear from this example how ad-
and Metrical Phonology, Oxford 1990.
vantageous is this solution. After all, only three rules
were saved. However, note that almost all of the Grimley-Evans, E., G. Kiraz, and S. Putman. 1996.
rules in a real grammar do not belong to the tem- Compiling a partition-based two-level formalism.
platte morphology domain, but to the linear phono- In COLING-96: Papers Presented to the 16th
logical]orthographic domain. Consider the case of International Conference on Computational Lin-
Syriac spirantization mentioned above, viz., guistics.

[- plosive] ~ [+ fricative] / V _ _ Kaplan, R. and M. Kay. 1994. Regular models of


phonological rule systems. Computational Lin-
Each of the six Syriac plosives requires a set of guistics, 20(3):331-78.
rules of the form in G r a m m a r 3 : R 9 applies when
Kay, M. 1987. Nonconcatenative finite-state mor-
the center and context belong to prefixes and suf-
phology. In Proceedings of the Third Conference
fixes, R10 applies when the center belongs to the of the European Chapter of the Association for
stem and the context belongs to a prefix, and R l l Computational Linguistics, pages 2-10.
applies when the center and context belong to the
stem. (Since Syriac stems invariably end in conso- Kiraz, G. 1996. S.EMHE: A generalised two-level
nants, there is no rule for the case when the center system. In Proceedings of the 34th Annual Meet-
belongs to a suffix and the right context to the stem ing of the Association for Computational Linguis-
in this case.) To cover all six plosives, 18 rules are tics.
required. If, however, the rules are to apply on the
linearized lexical form, each plosive requires only one Kiraz, G. [Forthcoming]. Computational Ap-
proach to Nonlinear Morphology: with empha-
rule similar to R9 (a total of six rules).
i sis on Semitic languages. Cambridge University
P,
Press.
5 Conclusilon
Using a lineari~ed form provides a pragmatic solu- Kornai, A. 1991. Formal Phonology. Ph.D. thesis,
tion to the pr6blems discussed above. While the Stanford University.
templatic mo@hology issues are resolved using a McCarthy, J. 1981. A prosodic theory of non-
multi-tape grammar, the linear-in-nature phonologi- concatenative morphology. Linguistic Inquiry,
cal/graphemic issues are dealt with using a two-tape 12(3):373-418.
grammar as in lany other Western language. As il-
lustrated with ithe vowel deletion rule above, this McCarthy, J. 1986. OCP effects: gemination and
makes the task Iof the grammar writer easier by far. antigemination. Linguistic Inquiry, 17.

6i
Pulman, S. and M. Hepple. 1993. A feature-based
formalism for two-level phonology: a description
and implementation. Computer Speech and Lan-
guage, 7:333-58.

Você também pode gostar