Escolar Documentos
Profissional Documentos
Cultura Documentos
DEEPA GUPTA
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF TECHNOLOGY DELHI
HAUZ KHAS, NEW DELHI-110016, INDIA
JANUARY, 2005
by
DEEPA GUPTA
Department of Mathematics
Submitted
in fulfilment of the requirement of
the degree of
Doctor of Philosophy
to the
Dedicated to
My Parents,
My Brother Ashish and
My Thesis Supervisor...
Certificate
Acknowledgement
If I say that this is my thesis it would be totally untrue. It is like a dream come true.
There are people in this world, some of them so wonderful, who helped in making
this dream, a product that you are holding in your hand. I would like to thank all
of them, and in particular:
Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research
and stayed with me right till the end. His efforts, comments, advices and ideas
developed my thinking, and improved my way of presentation. Without his constant encouragement, keen interest, inspiring criticism and invaluable guidance, I
would not have accomplished my work. I admit that his efforts need much more
acknowledgement than expressed here.
I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech
Research Lab who funded this research. I sincerely thank all the faculty members of
Department of Mathematics, especially, I express my gratitude for Prof B. Chandra
and Dr. R. K. Sharma, for providing me continuous moral support and help. I
thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time
and efforts. I also thank the department administrative staff for their assistance. I
extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and
Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh, Dr. D. K. Lobiyal,
and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening
discussions on basics of languages.
I would like to express my sincere thanks to my friends Priya and Dharmendra
for many fruitful discussions regarding my research problem. I thank Mr. Gaurav
Deepa Gupta
Abstract
We feel that the overall scheme proposed in this research will pave the way for
developing an efficient EBMT system for translating from English to Hindi. We
hope that this research will also help development of MT systems from English to
other languages of the Indian subcontinent.
ii
Contents
1 Introduction
1.1
1.2
23
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2
2.3
2.3.2
2.3.3
2.3.4
2.4
2.5
56
Contents
2.5.2
2.5.3
2.5.4
Adaptation Rules for Variations in the Morpho Tags of Premodifier Adjective @AN> . . . . . . . . . . . . . . . . . . . . 64
2.5.5
69
2.6
2.7
2.8
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
87
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2
3.3
Structural Divergence . . . . . . . . . . . . . . . . . . . . . . . 97
3.3.2
Categorial Divergence
3.3.3
Nominal Divergence
3.3.4
3.3.5
3.3.6
3.3.7
Possessional Divergence
3.3.8
. . . . . . . . . . . . . . . . . . . . . . 100
. . . . . . . . . . . . . . . . . . . . . . . 104
. . . . . . . . . . . . . . . . . . . . . 121
iv
Contents
3.4
135
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.2
4.3
4.4
4.5
4.4.1
Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4.2
Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4.3
Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.4.4
171
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.2
5.3
5.4
Contents
5.5
5.4.1
5.4.2
5.4.3
5.4.4
5.6
5.5.1
5.5.2
5.5.3
5.5.4
5.6.2
5.7
5.8
5.9
. . . . . . . . . . . . . 222
Splitting Rule for the Connectives when, where, whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . 231
5.9.2
Contents
5.10.2 Adaptation Procedure for Connective who . . . . . . . . . . 256
5.11 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
5.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6 Discussions and Conclusions
267
6.1
6.2
6.3
6.4
Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.4.1
6.4.2
Appendices
280
281
A.1 English and Hindi Language Variations . . . . . . . . . . . . . . . . . 281
A.2 Verb Morphological and Structure Variations . . . . . . . . . . . . . . 285
A.2.1 Conjugation of Root Verb . . . . . . . . . . . . . . . . . . . . 286
291
B.1 Functional Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
B.2 Morpho Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
vii
Contents
C
299
C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures 299
303
D.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
305
E.1 Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective305
Bibliography
308
viii
List of Figures
1.1
2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2
2.3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
LIST OF FIGURES
3.8
3.9
4.2
5.1
5.2
5.3
5.4
LIST OF FIGURES
5.5
5.6
5.7
5.8
xi
List of Tables
1.1
. . . . . . . . . .
2.2
2.3
2.4
2.5
2.6
Different Possible Morpho Tags for Each of the Functional Tag under
the Functional Slot <S> or <O> . . . . . . . . . . . . . . . . . . . . 58
2.8
LIST OF TABLES
2.13 Functional & Morpho Tags Corresponding to Each Interrogative Sentence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.14 Adaptability Rules for Group G5 Sentence Patterns . . . . . . . . . . 83
2.15 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 84
3.1
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Some Illustrations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.8
4.9
5.1
5.2
5.3
Adaptation Operations of Verb Morphological Variation Present Indefinite to Past indefinite . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.4
LIST OF TABLES
5.5
5.6
Best Five Matches by Using Semantic Similarity for the Input Sentence I work.
5.7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Best Five Matches by Using Semantic Similarity for the Input Sentence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . 201
5.8
5.9
5.10 Best Five Matches by Syntactic Similarity for the Input Sentence Sita
sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.11 Functional-morpho Tags for the Input English Sentence (IE) and the
Retrieved English Sentence (RE) . . . . . . . . . . . . . . . . . . . . 204
5.12 Retrieval on the Basis of Cost of Adaptation Based Scheme for the
Input Sentence I work. . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.13 Retrieval on the Basis of Cost of Adaptation Based Similarity for the
Input Sentence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . 207
5.14 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence I work. by Using Semantic and Syntactic Based Similarity
Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.15 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence Sita sings ghazals by Using Semantic and Syntactic based
Similarity Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.16 Weights Used for Characteristic Features . . . . . . . . . . . . . . . . 220
xv
LIST OF TABLES
5.17 Notation Used in the Complexity Analysis . . . . . . . . . . . . . . . 222
5.19 Typical Examples of Complex Sentence with Connective when, where,
whenever or wherever Handled by Module 2 . . . . . . . . . . . . 235
5.20 Typical Examples of Complex Sentence with Connective when, where,
whenever or wherever Handled by Module 3 . . . . . . . . . . . . 239
5.21 Typical Complex Sentences with Relative Adverb who Handled by
Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.22 Typical Complex Sentences with Relative Adverb who Handled by
Module 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.23 Typical Complex Sentences with Relative Adverb who Handled by
Module 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
5.24 Hindi Translation of Relative Adverbs . . . . . . . . . . . . . . . . . . 254
5.25 Patterns of Complex Sentence with Connective when, where,
whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . . 255
5.26 Patterns of Complex Sentence with Connective who . . . . . . . . . 257
5.27 Five Most Similar Sentence for RC You go to India. Using Cost of
Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . . . . . 261
5.28 Five Most Similar Sentence for MC You should speak Hindi. Using
Cost of Adaptation based Scheme . . . . . . . . . . . . . . . . . . . . 261
5.29 Five Most Similar Sentence for RC He wants to learn Hindi. Using
Cost of Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . 263
5.30 Five Most Similar Sentence for MC The student should study this
book. Using Cost of Adaptation Based Scheme . . . . . . . . . . . . 263
xvi
LIST OF TABLES
A.2 Different Case Ending in Hindi . . . . . . . . . . . . . . . . . . . . . 283
A.3 Suffixes and Morpho-Words for Hindi Verb Conjugations . . . . . . . 286
A.4 Verb Morphological Changes From English to Hindi Translation . . . 288
E.1 Costs Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective307
xvii
Chapter 1
Introduction
Chapter 1. Introduction
Machine Translation (MT) is the process of translating text units of one language
(source language) into a second language (target language) by using computers. The
need for MT is greatly felt in the modern age due to globalization of information,
where global information base needs to be accessed from different parts of the world.
Although most of this information is available online, the major difficulty in dealing
with this information is that its language is primarily English. Starting from science,
technology, education to manuals of gadgets, commercial advertisements, everywhere
predominant presence of English as the medium of communication can be easily
observed. This world, however, is multi-lingual, where different languages are spoken
in different regions. This necessitates the development of good MT systems for
translating these works into other languages so that a larger population can access,
retrieve and understand them. Consequently, in a country like India, where English
is understood by less than 3% of the population (Sinha and Jain, 2003), the need
for developing MT systems for translating from English into some native Indian
languages is very acute. In this work we looked into different aspects of designing an
English to Hindi MT system using Example-Based (Nagao, 1984) technique. Two
fundamental questions that we feel we should answer at this point are:
edge (Sumita and Iida, 1991). Various MT paradigms have so far evolved depending
upon how the translation knowledge is acquired and used. For example,
1. Rule-Based Machine Translation (RBMT): Here rules are used for analysis
and representation of the meaning of the source language texts, and the
generation of equivalent target language texts (Grishman and Kosaka, 1992),
(Thurmair, 1990), (Arnold and Sadler, 1990).
2. Statistical- (or Corpus-) Based Machine Translation (SBMT): Statistical translation models are trained on a sentence-aligned translation corpus, which is
based on n-gram modelling, and probability distribution of the occurrence of
a source-target language pair in a very large corpus. This technique was proposed by IBM in early 1990s (Brown, 1990), (Brown et. al., 1992), (Brown et.
al., 1993), (Germann, 2001).
However, these techniques have their own drawbacks. The main drawback of
RBMT systems is that sentences in any natural language may assume a large variety of structures. Also, machine translation often suffers from ambiguities of various
types (Dorr et. al., 1998). As a consequences, translation from one natural language into another requires enormous knowledge about the syntax and semantics of
both the source and target languages. Capturing all the knowledge in rule form is
daunting task if not impossible. On the other hand, SBMT techniques depend on
how accurately various probabilities are measured. Realistic measurements of these
probabilities can be made only if a large volume of parallel corpus is made available.
However, availability of such huge data is not easy. Consequently, this scheme is
viable only for small number of language pairs.
Chapter 1. Introduction
Example-based Machine Translation (Nagao, 1984), (Carl and Way, 2003) makes
use of past translation examples to generate the translation of a given input. An
EBMT system stores in its example base of translation examples between two languages, the source language (SL) and the target language (TL). These examples are
subsequently used as guidance for future translation tasks. In order to translate a
new input sentence in SL, a1 similar SL sentence is retrieved from the example base,
along with its translation in TL. This example is then adapted suitably to generate a
translation of the given input. It has been found that EBMT has several advantages
in comparison with other MT paradigms (Sumita and Iida, 1991):
Even other researchers (e.g. (Somers, 1999), (Kit et. al., 2002)) have considered
EBMT to be one major and effective approach among different MT paradigms,
primarily because it exploits the linguistic knowledge stored in an aligned text in a
more efficient way.
We apprehend from the above observation that for development of MT systems
from English to Indian languages, EBMT should be one of the preferred approaches.
This is because a significant volume of parallel corpus is available between English
and different Indian languages in the form of government notices, translation books,
1
advertisement material etc. Although this data is generally not available in electronic form yet, converting them into machine readable form is much easier than
formulating explicit translation rules as required by an RBMT system. In fact some
parallel data in electronic form has been made available through some projects (e.g.
EMILLE :http://www.emille.lancs.ac.uk/home.html). Also, there has been some
concerted effort from various government organizations like TDIL2 , CIIL Mysore3 ,
C-DAC Nodia4 , (Vikas, 2001) and various institutes, e. g., IIT Bombay5 , IIT Kanpur6 , LTRC (IIIT Hyderabad)7 and develop linguistic resources. At the same time
this data is not large enough to design an English to Hindi SBMT, which typically
requires several hundred thousand of sentences. These resources, we hope, will be
fruitfully utilized for developing different EBMT systems involving Indian languages.
Of the different Indian languages8 Hindi has some major advantages over the others as far as working on MT is concerned. Not only is Hindi the national language of
India, it is also the most popular among all Indian languages. With respect to Indian
languages, all the major works that have been reported so far (e.g. ANGLAHINDI
(Sinha et. al., 2002), SHIVA (http://shiva.iiit.net/) , SHAKTI (Sangal, 2004), MaTra (Human aided MT)9 ) are primarily concerned English and Hindi as their preferred languages. In 2003 Hindi has been considered as the surprise language
(Oard, 2003) by DARPA. As a consequence, different universities (e.g. CMU, Johns
Hopkins, USC-ISI) have invested efforts in developing MT systems involving Hindi.
2
http://tdil.mit.gov.in/
http://www.ciil.org/
4
http://www.cdacnoida.com/
5
http://www.cfilt.iitb.ac.in
6
http://www.cse.iitk.ac.in/users/isciig/
7
http://ltrc.iiit.net/
8
India has 17 official languages, and more than 1000 dialects
(http://azaz.essortment.com/languagesindian rsbo.htm)
9
http://www.ncst.ernet.in/matra/about.shtml
3
Chapter 1. Introduction
This world-wide popularity of the language makes the study of English to Hindi
machine translation more meaningful in todays context.
One major advantage of having the above-mentioned English to Hindi translation
systems available on-line is that it helped us in working on the systems to examine
the quality of their outputs. In this respect, we find that the outputs given by the
above systems are not always the correct translations of the inputs. The following
Table 1.1 illustrates the above statement with respect to the systems AnglaHindi
and Shakti. In this table we show the translations produced by the above two
systems for different inputs, and also show the correct translations of these sentences.
Input
Output of
Output of
Actual
Sentences
AnglaHindi
Shakti
Translation
raam ne siitaa se
vahaa kiyaa
vaaha kiyaa
vivaaha kiyaa
pankhaa ho par
pankhaa
Fan is on.
la-
gaataar hai
This dish tastes
yaha
good.
achchhaa
hai
The soup
salt.
lacks
vyanjan
hotaa
vyanjan
swaad
hai
kaa
achchhaa
It is raining.
yah varshaa ho
rahii hai
yah varshaa ho
rahii hai
unke
badhii unkii
ghamasan
ladaaiyaan hain
ladaii huii
varshaa
hai
ho
rahii
We have found many such instances where the outputs produced by the systems
may not be considered to be correct Hindi translations of the respective inputs. This
observation prompts us to study different aspects of English to Hindi translations in
order to understand the difficulty in machine translations, particularly with respect
to English to Hindi translation, also, how can these shortcomings be dealt with
under an EBMT framework. The research is concerned with the above studies.
1.1
The success of an EBMT system lies on two different modules: (i) Similarity measurement and Retrieval. (ii) Adaptation. Retrieval is the procedure by which a
suitable translation example is retrieved from a systems example base. Adaptation is the procedure by which a retrieved translation is modified to generate the
translation of the given input. Various retrieval strategies have been developed (e.g.
(Nagao, 1984), (Sato, 1992), (Collins and Cunningham, 1996)). All these retrieval
strategies aim at retrieving an example from the example base such that the retrieved
example is similar to the input sentence. This is due to the fact that the fundamental
intuition behind EBMT is that translations of similar sentences of the source language will be similar in the target languages as well. Thus the concept of retrieval is
intricately related with the concept of similarity measurement between sentences.
But the main difficulty with respect to this assumption is that there is no straightforward way to measure similarity between sentences. In different works different
approaches have been defined for measuring similarity between sentences. For example, Word-based metrics(e.g. (Nirenburg, 1993), (Nagao, 1984)), Character-based
6
Chapter 1. Introduction
metrics (e.g. (Sato, 1992)), Syntactic/Semantic based matching (e.g. (Manning and
Schutze, 1999)), DP-matching between word sequence (e.g. (Sumita, 2001)), Hybrid
retrieval scheme (e.g. (Collins, 1998)).
In all these works similarity measurement and adaptation are considered
in isolation. This we feel is the major hindrance with respect to EBMT. In this
work we therefore propose a novel approach for measuring similarity. We intend
to look at similarity from the point of view of adaptation. We suggest that a past
example will be considered as the most similar with respect to an input sentence, if
its adaptation towards generating the desired translation is the simplest. The work
carried out in this research is aimed at achieving this goal. Our studies therefore start
in the following way. We first look at adaptation in detail. An efficient adaptation
scheme is very important for an EBMT system because even a very large example
base cannot, in general, guarantee an exact match for a given input sentence. As
a consequence, the need for an efficient and systematic adaptation scheme arises
for modifying a retrieved example, and thereby generating the required translation.
Various adaptation schemes have been proposed in literature, e.g. (Veale and Way,
1997), (Shiri et. al., 1997), (Collins, 1998) and (McTait, 2001). A scrutiny of these
schemes suggest that primarily there are four basic adaptation operations, i.e. word
addition, word deletion, word replacement and copy.
In our approach we started with these basic operations: word addition, word
deletion, word replacement and copy. However, in this respect we notice the following:
1. Both English and Hindi relies heavily on suffixes for morphological changes.
There are a number of suffixes for achieving declension of verbs and nouns.
Further, in Hindi there are situations when morphological changes in the ad7
jectives is also required depending upon the number and gender of the corresponding noun/pronoun. Since the number of suffixes is limited, we feel that
instead of purely word-based operations if adaptation operations are focused
on the suffixes, then in many situations significant amount of computational
efforts may be saved.
2. A further observation with respect to Hindi is that there are situations when instead of suffixes whole words are used for bringing in morphological variations.
For example, the present continuous form of Hindi verbs is: <Root form of the
verb> + <rahaa/rahii /rahe> + <hai /hain/ho>. Here the words rahaa,
rahii or rahe are used to achieve the morphological variation. Which of
these will be used depend upon the number and gender of the subject. Similarly, hai , hain and ho are used corresponding to situations when the
subject is singular or plural and person, respectively. We term these words as
morpho-words. Appendix A gives details of different Hindi morpho-words
and their usages.
A major fall out of the above observation is that in some situations, adaptation
may be carried out by dealing with the morpho-words instead of whole words, which
are computationally much less expensive than dealing with constituent words as a
whole. Thus we propose an adaptation scheme consisting of ten operations: addition,
deletion, and replacement of constituent words, addition, deletion, and replacement
of morpho-words, addition, deletion, and replacement of suffixes and copy. Chapter
2 of the thesis discusses these adaptation operations in detail.
One point, however, we notice with respect to the above operations is that the
above-mentioned operations cannot deal with translation divergences in an efficient
way. Divergence occurs when structurally similar sentences of the source language
8
Chapter 1. Introduction
do not translate into sentences that are similar in structures in the target language.
(Dorr, 1993). We therefore felt study of divergence is an important aspect for any
MT system. With respect to an EBMT system the need arises because of the two
reasons:
The past example that is retrieved for carrying out the task of adaptation
has a normal translation, but translation of the input sentence should involve
divergence.
The translation of the retrieved example involves divergence, whereas the input
sentence should have a normal translation.
In this work we made an in-depth study of divergence with respect to English to
Hindi translation. In this regard one may note that divergence is a highly languagedependent phenomenon. Its nature may change along with the source and target
languages under consideration. Although divergence has been studied extensively
with respect to translation between European languages (e.g. (Dorr et. al., 2002),
(Watanabe et. al., 2000)) very little studies on divergence may be found regarding
translations in Indian languages. The only work that came into our notice is in (Dave
et. al., 2002). In this work the author has followed the classifications given in (Dorr,
1993) and tried to find examples of each of them with respect to English to Hindi
translation. In this regard it may be noted that Dorr has described seven different divergence types: structural, categorical, conflational, promotional, demotional,
thematic and lexical, with respect to translations between European languages.
However, we find that all the different divergence types explained in Dorrs work
do not apply with respect to Indian languages. In fact, we found very few (if not
none) examples of thematic and promotional divergence with respect to English
9
to Hindi translation. On the other hand we identified three new types of divergence
that have not so far been cited in any other works on divergence. We named these
divergences as nominal, pronominal and possessional, respectively. We have
further observed that all the different divergence types (barring structural) for
which we found instances in English to Hindi translation may be further divided into
several sub-categories. Chapter 3 explains in detail different divergence types and
their sub-types that we have observed with respect to English to Hindi translation,
and illustrates them with suitable examples. Some of these results have already been
presented in (Gupta and Chatterjee, 2003a) and (Gupta and Chatterjee, 2003b).
Presence of divergence examples in the example base makes straightforward application of the above-mentioned adaptation scheme difficult. As mentioned earlier,
application of the operations discussed in Chapter 2 will not be able to generate
the correct translation if the input sentence requires normal translation, whereas
the translation of retrieved example involves divergence, or vice versa. To overcome
this difficulty we suggest that the example base may be partitioned into two parts:
one containing examples of normal translation, the other containing the examples
of divergence, so that given an input sentence an EBMT system may retrieve an
example from the appropriate part of the example base. However, implementation
of the above scheme requires design of algorithms for:
1) Partitioning the example base sentences.
2) Designing an efficient retrieval policy.
We attempt to answer the first one by designing algorithms for identification of
translation divergence, i.e. if an English sentence and its Hindi translation are given
as input, these algorithms will detect whether this translation involves any of the said
10
Chapter 1. Introduction
11
algorithm seeks evidence from the example base and the WordNet. In this work we
have used WordNet 2.013 to measure semantic similarity of the constituent words
of the input sentence, and various words present in the example base sentences to
arrive at a decision in this regard. The scheme works in the following way. We first
identified the roles of different Functional Tags (FT) towards causing divergence.
We observe with respect to different divergence type and sub-types that each FT
may have one of the three following roles;
This knowledge is stored in the form of a table (Table 4.2) in Chapter 4. Given
an input sentence the scheme first determines its constituent FTs. We have used
ENGCG parser14 for parsing an input sentence and obtaining its FTs. This finding
is then compared to the above-mentioned knowledge base (Table 4.2) to identify
the set (D) of divergence types that may possibly occur in the translation of this
sentence. Further investigation is carried out to discard elements from the set D, so
that the divergence that may actually occur can be pin-pointed. In this respect we
proceed in the following way. Corresponding to each divergence type we identify the
functional tag that is at the root of causing the divergence. We call it the problematic FT corresponding to that particular divergence. Table 4.3 presents our finding
in this regard. Corresponding to each possible divergence (as found in D) the scheme
13
14
http://www.cogsci.princeton.edu/cgi-bin/webwn
http://www.lingsoft.fi/cgi-bin/engcg
12
Chapter 1. Introduction
works as follows. It first retrieves from the input sentence the constituent word corresponding to the problematic FT of the divergence type under consideration. Then
the semantic similarity of this word is compared to other words. Proximity in this
semantic distance is then used as a yardstick for similarity measurement. Chapter
4 discusses this scheme in detail.
Finally, in Chapter 5 we look at how cost of adaptation may be used as a similarity measurement scheme. It has been stated that no unique definition of similarity
exists for comparing sentences. Similarity between sentences may be viewed from
different perspectives. In this work, we have first considered two most general similarity schemes: syntactic similarity and semantic similarity. The ideas have
been borrowed from the domain of Information Technology (Manning and Schutze,
1999). According to the definition given therein semantic similarity is measured on
the basis of commonality of words. The more is the number of words common between two sentences, the more similarity is said to exist between the two sentences
under consideration. However, it has been shown in (Chatterjee, 2001) that this
measurement of similarity is not always helpful from EBMT point of view. For example, it has been shown there that although the sentences The horse had a good
run. and The horse is good to run on. have most of the key words common, the
structure of their Hindi translations are very different. Consequently, adaptation of
the translation of one of them to generate the translation of the other is computationally demanding. On the other hand, syntactic similarity between two sentences
is measured on the basis of commonality of morpho-functional tags between them.
In this case, adaptation may require a large number of constituent word replacement
(WR) operations. Each of these WR operations involves reference to some dictionary for picking up the appropriate words in the target language. Typically the
13
dictionary access will involve accessing an external storage, and thereby will incur
significant computational cost. Thus a purely syntax-based similarity measurement
scheme may not be suitable for an EBMT system.
In this work we therefore propose that from EBMT perspective retrieval and
adaptation should be looked at in a unified way. In this chapter (i.e. Chapter
5) we investigate feasibility of the above proposal in depth. In this respect we first
look into the overall adaptation operations deeply. We have already observed that
these operations are invoked successively to remove the discrepancies between the
input sentence and the retrieved example. These discrepancies, as we observe, may
be in the actual words, or in the overall structure of the sentences. For illustration,
suppose the input sentence is The boy eats rice everyday., whose Hindi translation
ladkaa har roz chaawla khaataa hai has to be generated. The nature of the adaptation varies depending upon which example is retrieved from the example base. For
illustration:
a) If the retrieved example is The boy eats rice, the adaptation procedure needs
to apply a constituent word addition operation (WA) to take care of the adverb
everyday.
b) However, if the retrieved sentence is The boy plays cricket everyday. ladkaa
roz cricket kheltaa hai , then the adaptation procedure needs to invoke two
constituent word replacement (WR) operations : to replace Hindi of play,
i.e. khel with the Hindi of eat, i.e. khaa, and cricket (cricket) with
chaawal (rice).
c) In case the retrieved example is The boy is eating rice., one adaptation operation that is constituent word addition (WA) is required for the adverb
14
Chapter 1. Introduction
Obviously, the more will be the discrepancy between the retrieved example and
the input sentence, the more will be the number of adaptation operations towards
generating the desired translation. The above illustrations make certain points evident:
a) Adaptation operations are required for performing two general tasks: dealing
with constituent words (along with their suffixes, morpho-words), and dealing
with the overall structure of the sentence.
b) Each invocation of adaptation operation pertains to a particular part of speech,
such as, noun, verb, adverb etc.
c) Of the ten adaptation operations (described earlier with respect to Chapter
15
2) only the WA and WR operations require dictionary15 searches. Since dictionary search typically involves accessing an external device (e.g hard disk),
a dictionary search is computationally more expensive than other operations
(e.g. constituent word deletion, morpho-word operations) which are purely
RAM16 -based and hence computationally cheaper.
The above observations help us to proceed towards achieving the intended goal
of using cost of adaptation as a measurement of similarity. As a first step towards
achieving the intended goal, we suggest to divide the dictionary into several parts
based on the part-of-speech (POS) of the words. Division of the dictionary into
several parts according to the POS reduces the search time for each invocation of
the above procedures, and as a consequence, the search time is reduced. The cost of
adaptation based similarity measurement approach then proceeds along the following
line:
a) We first estimate the average cost for each of the ten adaptation operations.
We observe that these costs depend on two major types of parameters. On
one hand they depend on certain linguistic aspects, such as, the average length
of the sentences in both source and target languages, the number of suffixes
(used with different POS), the number of morpho-words etc. On the other
hand, these costs are related to the machine on which the EBMT system is
working. Since we aim at analyzing the costs in a general way, we assumed
these machine-dependent costs to be variables in all our analysis. For the linguistic parameters, we used values that we have obtained by analyzing about
15
By dictionary we mean a source language to target language word dictionary available online.
16
Random Access Memory
16
Chapter 1. Introduction
30,000 examples of English to Hindi translations. These examples were collected from various sources that are translation books, advertisement materials, childrens story books and government notices, which are freely available
in non-electronic form.
b) At the second step, we estimated the costs incurred in adapting various functional tags17 . In particular, we have considered cost of adaptation due to variations in active and passive verb morphology, subject/object, pre-modifying
adjective, genitive case and wh-family words. These costs are stored in various
tables, in Section 5.4.
c) At the third step we have considered costs of adaptation due to differences in
sentence structure. Here, we have considered four different sentence structures:
affirmative, negative, interrogative, negative-interrogative. These adaptation
costs too are stored in tabular form. Section 5.4 gives details of this analysis.
Once these basic costs are modelled, we are in a position to experiment on costs
of adaptation as a similarity measure vis-`a-vis semantics and syntax based similarity
measurement scheme discussed above. Our experiments have clearly established the
efficiency of the proposed scheme over the others. Part of this work is also presented
in Gupta and Chattrejee (2003c). Two apparent drawbacks of this scheme are:
1) It may end up in comparing a given input with all the example-base sentences
to ascertain the least cost of adaptation.
2) Another major question that may arise is that whether the cost of adaptation
scheme is efficient enough to handle sentences that are structurally more com17
In fact we worked on Functional Slots which are more general than Functional Tags. This
is discussed in detail in Section 2.2
17
Chapter 1. Introduction
tation with respect to a selected set of sentence structures, and for a selected set of
Functional slots. Definitely many more variations are available with respect to these
parameters. Consequently, more work has to be done to form rules for handling
these variations. However, we feel that the work described in research provides the
suitable guideline for further continuation of the research.
1.2
1) The aim of this research is not to construct an English to Hindi EBMT system.
Rather our intuition is to analyze the requirements that help in building an
effective EBMT system. The motivation behind this research came from two
major observations:
Although some MT system for translation from English to Hindi already
exist, the quality of their translation is often not up to the mark. This
promoted us to look into the process of MT to ascertain the inherent
difficulties.
We have chosen EBMT as our preferred paradigm because of its certain
advantages our other MT paradigms such as RBMT, SBMT. One major
advantage of EBMT is that it requires neither a huge parallel corpus as
required by SBMT, nor it requires framing a large rule base required by
RBMT. Study of EBMT is therefore feasible for us as we did not have
access to such linguistics resources.
2) In order to design our scheme we have studied about 30,000 English to Hindi
translation examples available off-line. Although now large volumes of English
19
English sentence: The horses have been running for one hour.
Tagged form: @DN> ART the, @SUBJ N PL horse %ghodaa%,
@+FAUXV V PRES have, @-FAUXV V PCP2 be, @-FMAINV V PCP1
run %daudaa%, @ADVL PREP for, @QN> NUM CARD one %ek %, @<P
N SG hour %ghantaa%.
Hindi sentence: ghode ek ghante se daudaa rahen hain
Figure 1.1: An Example Sentence with Its Morpho-Functional Tags
to Hindi parallel text is available on-line (EMILLE: http:// www.emille.lancs.ac.
uk/home.htm), the time when this work was started no such parallel corpus
was available to us. For our work we prepared an online parallel example base
of 4,500 sentences. These example pairs are chosen carefully so that different
sentence structure as well translation variations (divergence) are taken care of
as much as possible.
3) Each translation example record in our example base contains morpho-functional
tag18 information for each of the constituent word of the source language (English) sentence along with the sentence, its Hindi translation, and the root
word correspondence. Figure 1.1 provides an example of the records stored in
our example base.
The morpho-functional tags of a word indicates its syntactic function within
the sentence. The tags are helpful in identifying the root words, their roles
in the sentence and roles of the different suffixes (used for declensions) in the
overall sentence construction.
In this work we have studied the two major pillars of EBMT: Retrieval and
Adaptation. We feel that the studies made as well as the techniques developed by
18
Appendix B provides different morpho tags and functional tags that have been used in this
work. These tags are obtained by editing the sentence tagging given by the ENGCG parser :
(http://www.lingsoft.fi/cgi-bin/engcg) for English sentences.
20
Chapter 1. Introduction
this research will be helpful for developing MT system not only for Hindi but also for
other Indian languages (e.g. Bangla, Gujrati, Panjabi). All these languages suffer
from the same drawback - unavailability of linguistics resources. However, demands
for developing MT systems from English to these languages is increasing with time
not only because these are prominent regional languages of India, but also they
are important minority languages in other countries such as U.K. (Somers, 1997).
The studies made in the research should pave the way for developing EBMT system
involving these languages as well.
21
Chapter 2
Adaptation in English to Hindi
Translation: A Systematic
Approach
2.1
Introduction
The need for an efficient and systematic adaptation scheme arises for modifying a
retrieved example, and thereby generating the required translation. This chapter is
devoted to the study of systematic adaptation approach. Various approaches have
been pursued in dealing with adaptation aspect of an EBMT system. Some of the
major approaches are described below.
1. Adaptation in Gaijian (Veale and Way, 1997) is modelled via two categories:
high-level grafting and keyhole surgery. High-level grafting deals with phrases.
Here an entire phrasal segment of the target sentence is replaced with another
phrasal segment from a different example. On the other hand, keyhole surgery
deals with individual words in an existing target segment of an example. Under
this operation words are replaced or morphologically fine-tuned to suit the
current translation task. For instance, suppose the input sentence is The girl
is playing in the park., and in the example base we have the following examples:
(a) The boy is playing.
(b) Rita knows that girl.
(c) It is a big park.
(d) Ram studies in the school.
For the high level grafting the sentences (a) and (d) will be used. Then keyhole
surgery will be applied for putting in the translations of the words park and
girl. These translations will be extracted from (b) and (c).
2. Shiri et. al. (1997) have proposed another adaptation procedure. It is based on
three steps: finding the difference, replacing the difference, and smoothing the
23
2.1. Introduction
output. The differing segments of the input sentence and the source template
are identified. Translations of these different segments in the input sentence
are produced by rule-based methods, and these translated segments are fitted
into a translation template. The resulting sentence is then smoothed over by
checking for person and number agreement, and inflection mismatches. For
example, assume the input sentence and selected template are:
SI
St
Tt
The parsing process however shows that The very efficient lady doctor is a
noun phrase, and so matches it with The lady doctor - ek mahilaa chikitsak . The very efficient lady doctor is translated as ek bahut yogya mahilaa
chikitsak , by the rule-based noun phrase translation system. This is inserted
into Tt giving the following: Tt : ek bahut yogya mahilaa chikitsak vyasta hai.
3. ReVerb system (Collins, 1998) proposed the following adaptation scheme. Here
two different cases are considered: full-case adaptation and partial-case adaptation. Full-case adaptation is employed when a problem is fully covered by the
retrieved example. Here desired translation is created by substitution alone.
No addition and deletion are required for adapting TL0 for generating the translation of SL. Here TL0 and SL denote example base target language sentence
and input source language sentence, respectively. In this case five scenarios
are possible: SAME, ADAPT, IGNORE, ADAPTZERO and IGNOREZERO.
Partial-case adaptation is used when a single unifying example does not exist.
Here three more operations are required on the top of the above five. These
three operations are ADD, DELETE and DELETZERO.
24
25
2.1. Introduction
are:
I
That old man has died. wah boodhaa aadmii mar gayaa
To generate the desired translation of the word man aadmii is first replaced with the translation of woman aurat in R. This operation is called
reinstantiation. At this stage an intermediate translation wah boodhaa aurat
mar gayaa is obtained. To obtain the final translation wah boodhii aurat
mar gayii , the system must also change the adjective boodhaa to boodhii
and the word gayaa to gayii . This is called parameter adjustment.
5. The adaptation scheme proposed by McTait (2001) works in the following way.
Translation patterns that share lexical items with the input and partially cover
it are retrieved in a pattern matching procedure. From these, the patterns
whose SL side cover the SL input to the greatest extent (longest cover) are
selected. They are termed base patterns, as they provide sentential context in
the translation process. It is intuitive that the greater extent of the cover is
provided by the base patterns, the more is the context, and the lesser is the
ambiguity and complexity in the translation process. If the SL side of the base
pattern does not fully cover the SL input, any unmatched segments are bound
to the variable on the SL side of the base pattern. The translations of the SL
segments bound to the SL variables of the base pattern are retrieved from the
remaining set of translation patterns, as the text fragments and variables on
the TL side of the base pattern from translation strings.
The following is a simple example: given the source language input is I: AIDS
control programme for Ethiopia, suppose the longest covering base pattern is:
D1: AIDS control programme for (....) ke liye AIDS contral smahaaroo (...).
26
27
2.1. Introduction
Many other EBMT systems are found in literature, e.g. GEBMT (Brown, 1996,
1999, 2000, 2201), EDGAR (Carl and Hansen, 1999) and TTL (G
uvenir and Cicekli,
1998). But overall in our view the adaptation procedures employed in different
EBMT systems primarily consist of four operations:
Copy, where the same chunk of the retrieved translation example is used in
the generated translation;
Add, where a new chunk is added in the retrieved translation example;
Delete, when some chunk of the retrieved example is deleted; and
Replace, where some chunk of the retrieved example is replaced with a new
one to meet the requirements of the current input.
The operations prescribed in different systems vary in the chunks they deal with.
Depending upon the case it may be a phrase, a word or a sub-word (e.g. declensional
suffix).
1
28
2.2
Suppose the input sentence is: The squirrel was eating groundnuts., and the
most similar example retrieved by the system (along with its Hindi translation)
is: The elephant was eating fruits. haathii phal khaa rahaa thaa. The
desired translation may be generated by replacing haathii with the Hindi of
squirrel, i.e. gilharii and replacing phal with the Hindi of groundnuts,
i.e. moongphalii . These are examples of the operation of constituent word
replacement.
2. Constituent Word Deletion (WD): In some cases one may have to delete some
words from the translation example to generate the required translation. For
example, suppose the input sentence is: Animals were dying of thirst. If the
retrieved translation example is : Birds and Animals were dying of thirst.
pakshii aur pashu pyaas se mar rahe the, then the desired translation can
be obtained by deleting pakshii aur (i.e the Hindi of birds and) from the
retrieved translation. Thus the adaptation here requires two constituent word
deletions.
3. Constituent Word Addition (WA): This operation is the opposite of constituent
word deletion. Here addition of some additional words in the retrieved translation example is required for generating the translation. For illustration, one
may consider the example given above with the roles of input and retrieved
sentences being reversed.
4. Morpho-word Replacement (MR): In this case one morpho-word is replaced by
another morpho-word in the retrieved translation example. Consider a case
when the input sentence is: The squirrel was eating groundnuts., and the
retrieved example is: The squirrel is eating groundnuts. gilharii moongfalii
30
Of course the final translation will be obtained by adding the the suffix taa with the word
khaa.
31
we
f ootball
khel
rahe hain
(are)
The translation to be generated is : wah roz football kheltaa hai . When carried
out adaptation using both word and suffix operations the adaptation steps look as
given in Figure 2.2. In this respect one may note that Hindi is a free order language,
and consequently the position of adverb is not fixed. Hence the above input sentence
may have different Hindi translations:
wah roz football kheltaa hai
wah football roz kheltaa hai
roz wah football kheltaa hai
While implementing an EBMT system one has to stick to some specific format.
The adverb will be added according to the format adapted by the system.
33
Input
Operations
Output
we
WR
wah
WA
roz
football
CP
football
khel
SA
kheltaa
rahe
MD
hain
MR
hai
The following example illustrates the difference between functional slots and functional tags.
Consider the sentence The old man is weak.. The subject of this sentence is the noun phrase The
old man. It consists of three functional tags, viz. @DN>, @AN> and @SUBJ, stating that the
is a determiner, old is adjective, and man is the subject. But, as mentioned above, the entire
noun phrase plays the role of subject of the sentence. Thus the functional slot for this phrase is
<S>, i.e. subject slot. Note that a particular functional slot may have variable number of words.
The sequence of functional slots in a sentence provide the sentence pattern. The difference between
various tags (e.g. POS tag, functional tag) is explained in detail in Appendix B.
34
Role of operators
<>
&
For functional slot and part of speech and its transformation. E.g. <S>, <V> etc.
Both functional slots or past of speech and its transforma-
or
{}
[]
Functional
Slot
<LV>
<V>
<AuxV>
seem etc., and in Hindi are: hai, hain, ho, thaa, the etc.
Auxiliary verb (if any) and main verb of the sentence
Auxiliary verb
<MainV>
<S>
<O>
Main verb
Subject
Object
<O1>
<O2>
First object
Second object
<SC>
<PCP1 form>
<PCP2 form>
Subjective complement
-ing verb form other than the main verb.
-ed or -en verb forms other than the main verb.
<to-infinitive>
<Adverb>
<AdjP>
<PP>
<preposition>
Adjective phrase
preposition phrase
preposition
The following sections describe how many such operations are required in different cases. In particular we consider the following functional slot and sentence
kinds:
1. Tense and Form of the Verb. Since there are three tenses (viz. Present,
Past and Future) and four forms (Indefinite, Continuous, Perfect, and Perfect
Continuous), in all one can have 12 different structures of verb and passive
form verb structure also.
2. Subject/Object functional slot. Variations in subject/object functional slot
may happen in many different ways, such as, Proper Noun, Common Noun
(Singular or Plural), Pronoun, PCP1 form6 and PCP2 form7 . Study of variation in pre-modifier adjectives, genitive case, quantifier and determiner tags.
3. Study of wh-family interrogative sentences.
4. Kind of sentence. Whether the sentence is affirmative, negative, interrogative
and negative interrogative.
Systematic study of these patterns, and their components helps in estimating
the adaptation costs between them.
2.3
Hindi verb morphological variations depend on four aspects: gender, number and
person of subject and tense (and form) of the sentence. All these variations effect
6
7
36
example, respectively. The row and column headers are of the form gender, person
and number of the subject where gender can be one of M or F, person can be one of
1, 2 or 3 specifying whether first, second and third person, and the number is either
S or P suggesting a singular or plural. Note that, here the gender of the English
sentence subject is assigned according to the Hindi grammar rule. The content of
the (i, j)th cell suggests the adaptation operations need to be carried out when the
subject of the input sentence matches the specification of the j th column header,
and that of the retrieved example matches the specification of the ith row header.
2.3.1
Here the input sentence and the retrieved example both have the same tense and
form. Yet, verb morphological variations may occur in the translation depending
upon variations in the number, gender and person of the subject.
For illustration, we consider the case when both the input and the retrieved sentences have the main verb in present indefinite form. Table 2.3 lists the adaptation
operations involved for verb morphological variations. In general, in this situation
the verb adaptation requires at most one suffix replacement and one morpho-word
replacement. Suffix replacement will confine to the set {taa, te, tii } (call it S1 ),
while the morpho-word replacement is associated with the set {hain, hai, ho, hoon}
(call it M1 ) (Refer Table A.3). Note that if the person, the number and the gender of
a subject in both input and retrieved sentences are same then only copy operations
will be performed.
We illustrate with an example how Table 2.3 is to be used for adaptation of verb
morphological variations. Suppose the input sentence is She eats rice., and
38
Input
M1S
F1S
M1P
F1P
M1S
CP
SR
SR+MR
F1S
SR
CP
M1P
M2S
F2S
M3S
M3P
F3S
F3P
SR+MR SR+MR
SR+MR
MR
SR+MR
SR+MR
SR+MR
SR+MR
MR
SR+MR
MR
SR+MR SR+MR
MR
MR
SR+MR SR+MR
CP
SR
MR
SR+MR
SR+MR CP
SR+MR
SR
F1P
SR+MR MR
SR
CP
MR+SR
MR
SR+MR SR
MR
CP
M2S
SR+MR SR+MR
MR
SR+MR CP
SR
SR+MR MR
SR+MR
SR+MR
F2S
SR+MR MR
SR+MR
MR
CP
SR+MR SR+MR
MR
MR
M3S
MR
SR+MR
SR+MR SR+MR
SR+MR
CP
SR
SR+MR
M3P
SR+MR SR+MR
CP
SR
MR
SR+MR
SR+MR CP
SR+MR
SR
F3S
SR+MR MR
SR+MR
MR
SR+MR
MR
SR
CP
MR
F3P
SR+MR MR
SR
CP
SR+MR
MR
SR+MR SR
MR
CP
Retd
SR+MR
SR+MR
39
SR+MR
SR
the retrieved example is We eat rice. ham chaawal khaate hain. In the input
sentence the subject is 3rd person, feminine and singular, whereas in the retrieved
sentence the subject is 3rd person, masculine and plural.
The cell (3, 9), i.e. corresponding to (M1P, F3S) suggests that two adaptation operations are required: suffix replacement (SR) and morpho-word replacement
(MR). The suffix te is replaced with taa in the main verb khaate as a suffix
replacement operation, the morpho-word hain is replaced with hai in the retrieved Hindi sentence to get the Hindi translation of the input sentence. Although,
there is a need to replace the subject ham with wah, to get the appropriate
Hindi translation of the input sentence but this is not considered in the discussion
on verbs. The translation of the input sentence is: wah chaawal khaatii hai .
Under this group nine combinations are possible, taking into an account three
tenses and three forms for each of them. Adaptation rule tables similar to the other
eight possibilities have been developed in a similar way. Salient features of these
verb morpho variations are discussed below:
1. Past indefinite to past indefinite: Here the verb morpho variation is doing in
a way similar to the present indefinite case discussed just above. However, for
morpho-word replacement, the set to be considered is {thaa, the, thii } (call it
M2 ) instead of the set M1 .
2. Future indefinite to future indefinite: In this case either a copy operation or
a suffix replacement operation is used to handle the verb morphological variations. Accordingly, from Table 2.3 all the morpho-word replacement operations
have to be removed in order to handle the future indefinite case. Further, it
has to be taken into account that for the suffix replacement (SR) operations
40
41
The morpho-words and suffixes for adaptation operations for all above discussed
cases can be referred from Table A.3.
In case of present perfect, past indefinite and past perfect sometimes there is a
case ending ne with the subject (see Section A.1). In that case, the verb morphology variation will change according to the gender and number of the object,
instead of the gender, number and person of the subject. For past indefinite to past
indefinite transformation, the adaptation operation will either be copy operation or
suffix replacement, whereas in the other two cases the adaptation operations can
be either copy operation or suffix replacement and morpho-word replacement. All
possible suffix variations and morpho-word variations are listed in Section A.2.
2.3.2
In this group the verb morphological variation depends on gender, number and person of the subject, and also on the variation in the tenses of the input and the
retrieved example. This group comprises of eighteen combinations of verb morphology variations. These 18 possibilities occur due to three different tenses (present,
past, future), and three verb forms (indefinite, continuous, perfect). Some members
42
Example 1 : Suppose the input sentence is She drinks water., and the retrieved sentence is She drank water. with the Hindi translation as wah paanii piitii thii . The
subjects of both the input and the retrieved example are feminine, 3rd person and
singular. In this situation, only one adaptation operation is required, i.e. morphoword replacement. The morpho-word thii is to be replaced with the morpho-word
hai to convey the sense of present indefinite form of the input sentence. Therefore,
the desired translation is wah paanii piitii hai .
Example 2 : Here the input sentence is She reads books, and the retrieved sentence
is He read books. with the Hindi translation as wah kitaabe padhtaa thaa. The
subject of the input is feminine, 3rd person and singular whereas in the retrieved
sentence the subject is masculine, 3rd person and singular. In this situation two
adaptation operations are required:
1. One suffix replacement: the suffix taa is to be replaced with tii ; and
2. One morpho-word replacement: the morpho-word thaa is replaced with
hai .
43
M1S
F1S
M1P
F1P
M1S
MR
SR+MR
SR+MR
F1S
SR+MR MR
M1P
M2S
F2S
M3S
M3P
F3S
F3P
SR+MR SR+MR
SR+MR
MR
SR+MR
SR+MR
SR+MR
SR+MR
MR
MR
SR+MR MR
SR+MR
MR
SR+MR SR+MR
MR
SR+MR MR
SR+MR
SR+MR SR+MR
MR
SR+MR
F1P
SR+MR MR
SR+MR
MR
MR
SR+MR MR
SR+MR
MR
M2S
SR+MR SR+MR
MR
SR+MR MR
SR+MR
SR+MR SR+MR
MR
SR+MR
F2S
SR+MR MR
SR+MR
MR
MR
SR+MR MR
SR+MR
MR
M3S
MR
SR+MR
SR+MR SR+MR
SR+MR
MR
SR+MR
SR+MR
F3S
SR+MR MR
SR+MR
MR
MR
SR+MR MR
SR+MR
MR
M3P
SR+MR SR+MR
MR
SR+MR MR
SR+MR
SR+MR SR+MR
MR
SR+MR
F3P
SR+MR MR
SR+MR
MR
MR
SR+MR MR
SR+MR
MR
Retd
44
SR+MR
SR+MR
SR+MR
SR+MR
SR+MR
SR+MR
SR+MR
Input
As an illustration, suppose the input sentence is I will eat rice., and the
retrieved sentence is I eat rice. with Hindi translation main chaawal khaataa
hoon. Here the suffix taa is replaced with suffix oongaa, and the morphoword hoon is deleted from the retrieved Hindi translation. Therefore, the
Hindi translation of the input sentence is main chaawal khaaoongaa.
2. Similarly, to adapt from future indefinite to present or past indefinite, one
suffix replacement and one morpho-word addition have to be carried out. The
suffix replacement will just be opposite to the one discussed above. In the
morpho-word addition is also to be done in the same spirit. It will be clear
by the example which has been discussed above, if the roles of the input and
retrieved example will be reversed.
3. If the verb form is continuous or perfect then regardless of the tense of the
sentence, the same Table 2.4 will work. This verb form and tenses consideration
may occur in the input sentence or in the retrieved sentence. The only change
to be incorporated is that a morpho-word replacement (MR) has to be carried
out instead of suffix replacement (SR) operation in the adaptation rule Table
2.4.
For these tenses and verb forms, the suffixes for the suffix replacement and the
morpho-words for the morpho-word replacement can be referred from Table
A.3.
2.3.3
Here the input sentence and the retrieved example have the same tense but they
have different verb forms. Here too eighteen combinations of verb morphological
46
is She will have eaten rice. wah chaawal khaa chukii hogii . In order to adapt
verb morphology the morpho-wordsf chukii and hogii are to be deleted, and the
suffix oongii is to be added to the verb khaa. And thus one gets the required
verb morphology.
If the roles of the input and the retrieved sentence are reversed in the above
cases, then, in place of suffix addition, suffix deletion has to be carried out. Further,
the two morpho-word deletions are to be replaced with corresponding morpho-word
additions.
Adaptation rules for dealing with other verb morphological variations belonging
to this group have been developed in a similar way. One may refer to Section A.2
of Appendix A to figure out the appropriate suffixes and morpho-words that will be
involved in the necessary addition/deletion/replacement operation.
2.3.4
The remaining thirty six possibilities out of the total eighty one combinations of verb
morphological variations belong to this group. Since it is not possible to discuss all
of them in this report, some typical once are considered for the present discussion.
In particular, we discuss the case where the input sentence is in present indefinite
form. For illustration, we consider the retrieved examples of the following types: (i)
past continuous (ii) past perfect (iii) future continuous (iv) future perfect. It will
be shown that a single set of adaptation operations is sufficient for all the four cases
mentioned above. These adaptation operations are one suffix addition (SA), one
morpho-word replacement (MR) and one morpho-word deletion (MD). The purpose
of these three operations are as follows:
48
For illustration, suppose the input sentence is She eats rice., and the retrieved
example is one of the following:
49
(A)
(B)
(C)
Evidently, the sentences (A), (B), (C) and (D) are in past continuous, past
perfect, future continuous and future perfect form, respectively. The modification
in the retrieved Hindi translations are as follows:
In case of translation of all retrieved examples the suffix tii will be added in
khaa (Hindi of eat).
The morpho-word rahii or chukii (depending upon the case) will be deleted.
The morpho-word thii will be replaced with hai if the retrieved example
is either (A) or (B).
The morpho-word hogii is replaced with hai in Hindi translation of example (C) or (D).
Therefore, the required translation of the input sentence by incorporating all
these modification in the respective translation of retrieved examples is wah chaawal
khaatii hai .
In a similar way one can identify that the set of three adaptation operations
will be required if the input is in past indefinite form, and the retrieved is one of
present continuous, present perfect, future continuous or future perfect. However,
in order to carry out the morpho-word replacement one has to confine the selection
to the set {thaa, the, thii }. It will replace the relevant morpho-word, which is one
50
2.4
The above discussion of adaptation procedures for verb morphological variation has
been limited to the active form of verb. Similar adaptation procedures have also
been studied when the verb is in the passive form. Ideally, the passive form should
exist for all the three tenses and all the four verb forms. However, the passive forms
of verbs for present perfect continuous, past perfect continuous, future continuous,
51
and future perfect continuous tenses are cumbersome, and are rarely used (Ansell,
2000). We, therefore, restrict our discussion to the other eight more commonly used
forms of the passive voice only. Since adaptation may take place from an active voice
sentence to a passive one, and vice versa, we classified these adaptation procedures
into three broad groups:
1. Passive verb form to passive verb form (88, i.e. 64 cases)
2. Passive verb form to active verb form (89, i.e. 72 cases)
3. Active verb form to passive verb form (98, i.e. 72 cases)
For each of the above mentioned three groups, we discuss few cases in detail:
Passive verb form to passive verb form
If the input sentence is in past indefinite passive verb form, and the retrieved
example is in present continuous passive verb form or past continuous passive verb
form, then a single set of adaptation operations is sufficient. These adaptation
operations are one morpho-word replacement, two morpho-word deletions and one
suffix replacement. The suffix replacement depends upon the particular Hindi verb
under consideration, and also upon the gender and number of the subject. So this
operation is required in some examples of such cases. However, the other three
adaptation operations are mandatory. The purpose of these four operations are as
follows:
In the past indefinite passive verb form, one of the following morpho-words
{gayaa, gayii, gaye} has to be used depending upon the number and gender
of the subject. However, if the retrieved example is in continuous passive form
52
Retrieved example
and hain are to be deleted, the morpho-word jaa is to be replaced with the
morpho-word gayii . Note that there is no change to the main verb khaayii .
Example 2 : Here we consider the same input sentence but the retrieved example is
The apple was begin eaten by the squirrel. gilharii ke dwaaraa seb khaayaa jaa
rahaa thaa. Evidently, to generate the required translation gilharii ke dwaaraa
moongphalii khaayii gayii all the operations given in Example 1 are to be carried
out. Further, due to the change in gender of the subject8 , i.e. the suffix yaa is
replaced with the suffix yii . To generate the final translation of the input sentence
one more adaptation operation is needed, the constituent word seb is replaced with
moongphalii , but that is not a part of the set of adaptation operations mentioned
above.
Passive verb form to active verb form
Here too we illustrate the verb morphology adaptation with the help of a specific
case: the input sentence is in the present indefinite passive verb form, and the
retrieved sentence is in active verb form in the same tense and form. Here one can
identify that one suffix replacement, one morpho-word addition and one morphoword replacement (depending upon the situation) are required to carry out the verb
morphology adaptation task. Significance of these three operations are as follows:
The suffix {taa, te, tii} in the main verb of the active retrieved sentence is
replaced with an appropriate suffix according to the rules of PCP form of verb
given in Section A.2 of Appendix A.
A morpho-word from the set {jaataa, jaatii, jate}, whose elements are essentially declensions of the verb jaa, is to be added after the main verb.
8
54
Retrieved example
The Hindi translation of the input sentence is yah khaanaa sitaa ke dwaaraa
banaayaa jaataa hai . Evidently, to deal with the verb morphology in the generated
translation, two adaptation operations have to be performed. The suffix tii of the
main verb of the retrieved sentence is to be replaced with yaa, and the morphoword jaataa is to be added after the main verb.
As the input sentence is the passive form of the retrieved sentence, ke dwaaraa
is added before the subject siitaa. This is necessary to generate the appropriate
translation of the input sentence but is not a part of the set of adaptation operations
mentioned above.
Active verb form to passive verb form
If the roles of the above mentioned input and the retrieved example are reversed,
one suffix replacement, one morpho-word replacement and one morpho-word deletion
55
will be required for adapting the verb morphology. One can easily figure out the
relevant sets of morpho-words and suffixes keeping in view the above discussions.
The adaptation rules for all other possible variations mentioned earlier have been
formulated in a similar way. However, the similar nature of discussions prohibits us
to describe all of them in detail.
2.5
Subject (<S>) and Object (<O>) functional slots can be sub-divided into a number
of functional tags. These tags act as pre-modifier and post-modifier of the subject
(@SUBJ) and/or object (@OBJ) functional tag. The maximum possible structure
of the <S> or <O> functional slot using different tags is:
Functional
Slot
<S> or <O>:
<S> or <O>:
Table 2.5: Different Functional Tags Under the Functional Slot <S> or <O>
56
@DN>:
@DN> ART
@DN> DEM
@GN>:
@QN>:
@AN>:
@AN> A ABS
@AN> A PCP1
@AN> A PCP2
57
Functional tags
@SUBJ or @OBJ
or @<P :
@<NOM-OF:
@<NOM-OF PREP
@<NOM:
@<NOM PREP
We explain Table 2.5 and Table 2.6 with an example. Consider the sentence This
old man is sitting in Rams office. Its parsed version, obtained using the ENGCG
parser, is:
@DN> DEM this,
@AN> A ABS old,
@SUBJ N SG man,
@+FAUXV V PRES be,
@-FMAINV V PCP1 sit,
@ADVL PREP in,
@GN> <Proper> GEN SG Ram,
58
Here, the tags that start with @ are called functional tags e.g. @DN> - determiner,
@GN> - genitive case, @AN> - pre-modifier adjective etc. In Table 2.6 these tags
are succeeded by morpho tags, such as, SG - singular, PERS - personal pronoun,
GEN - genitive. Appendix B provides more details on these tags.
In the following discussion, the adaptation rules for functional tags due to the
variation of morpho tags are given.
2.5.1
The morpho tags ART and DEM are associated with the functional tag @DN>
(see Table 2.6). The morpho tag ART is associated with the English words the,
an and a, and DEM is associated with this, these, that etc. The word the
does not have any Hindi equivalent, hence it is absent in all Hindi translations.
Corresponding to articles a and an often no Hindi word used in the translation.
However, in some cases the word ek (meaning one) is used depending upon
the context. For adaptation of these words no morphological changes take place.
Therefore, if @DN> ART is present in the parsed version of either in the input
or in the retrieved sentence, and it is corresponding to the word the, then no
adaptation operation will be performed. With respect to determiners (word having
DEM morpho tags such as this (yah), these (ye), that (wah) etc.), adaptation
procedure is straightforward.
59
For illustration, consider the input sentence This man is kind., and the retrieved
example is The man is kind. aadmii dayaalu hai . Note that no Hindi word
exists in the retrieved Hindi sentence corresponding to the word the. But the input
sentence contains the determiner this. Therefore, its Hindi translation yah is
required to be added before the subject aadmii in the generated translation. Hence
the translation of the input sentence is yah aadmii dayaalu hai .
2.5.2
The functional tag @GN> is used for a genitive (i.e. possessive) case. Eight possible
morpho tag variations are listed in Table 2.6. These variations occur due to the
variations in gender, number and person of three different POS, which are N, PRON
and <Proper>. When the part of speech of the genitive word is N or <Proper>,
then the genitive case in Hindi is indicated with one of the case endings from the set
{kaa, ke, kii } as a morpho-word. Its usage depends upon the gender and number
of the noun following the word corresponding to the tag @GN>. When the genitive
word is pronoun (PRON) the case endings are transformed into suffixes. Following
examples illustrate different genitive case structures in Hindi.
kaa is used when the noun following it is masculine singular. For example:
the washermans son
- malii ke bete
- in aadmiyoon ke ghodhe
60
There are occasions when morpho changes occur to the genitive word (when it
is noun) due to the case ending kaa, ke and kii . These rules are listed in
Appendix A. For example: the boys horse ladke kaa ghodha. Although, the
Hindi of boy is ladkaa, its oblique form ladke has been used in the above
example. This happens because of the case ending kaa.
If POS of the genitive case is proper noun, then too, the same case ending {kaa,
ke, kii } is used as a morpho-word according to the gender and number of the noun
following it. In this case no morpho changes occur in the genitive word due to the
case ending. For example: Paruls home - paarul kaa ghar , Rams book - raam kii
kitaab, in Rams home - raam ke ghar mein.
As mentioned above, when the POS of the genitive word is pronoun the case
ending is attached with it in a form of the suffix. In case of first and second person
pronoun the suffix comes from the set {aa, e, ii }. However, in case of third person
pronoun the entire morpho-word is used as a suffix. The following examples illustrate
the genitive case with respect to pronoun.
61
my son
- meraa betaa
my sons
- mere bete
my daughter
- merii betii
my daughters
- merii betiyaan
his son
- uskaa betaa
his sons
- uske bete
on my son
on our son
his daughter
- uskii betii
his daughters
- uskii betiyaan
in my book
Once these structures are known, adaptation rules for different variations of
genitive case may be formulated by referring to Table 2.6. Table 2.8 has been
designed to indicate the adaptation procedure of different genitive case. The headers
of the rows and columns of this table correspond to three POS: <Proper>, N and
PRON.
Input
<proper>
PRON
(CP or ({WR}+
WR+ MD+SA
WR+ MD+SA
Retd
<proper>
{MR}))
N
WR+ {MR}
{(SR or SA or SD)}))
PRON
WR + MA
WR + MA +{SR or SA }
(CP or SR or
(WR+SA))
62
2.5.3
The functional tag @QN is a quantifier tag. It is of two types: numeral (NUM) (e.g.
two - do, fourth - choothaa, one-third - ke-tihaaii , two-thirds - do-tihaaii ),
and quantitative (<Quant>) (e.g. some- kuchh, all - sab, many - bahut). More
details of this functional tag and its morpho tags are given in Appendix B. Seven
variations in total (see Table 2.6) are possible due to the changes in the number (SG,
PL, SG/PL) and numeral properties, i.e. ordinal, cardinal number etc. But as far as
Hindi translation is concerned these seven variations do not play any role in the Hindi
translation. Therefore, no suffix operations or morpho-word operations are relevant
in this case. Only a single word operation, i.e deletion/addition/replacement/copy,
is required depending upon the tags in the input and the retrieved sentences.
For illustration, to adapt the translation of the retrieved example: Two men
are coming here. do aadmii yahaan aa rahe hain to generate the translation of
the input sentence Some men are coming here., the adaptation procedure should
replace kuchh (i.e. some) with do (i.e. two). The Hindi translation of the
input sentence is, therefore, kuchh aadmii yahaan aa rahe hain.
Other cases may be dealt in a similar way.
2.5.4
Adjectives fall into two classes, viz., uninflected and inflected (Kellogg and Bailey,
1965). Uninflected adjectives, as the term implies, remain unchanged before all
64
- achchhaa ladkaa
honest boy
- iimaandaar ladkaa
good girl
- achchhii ladkii
honest girl
- iimaandaar ladkii
good boys
- achchhe ladke
honest boys
- iimaandaar ladke
good girls
- achchhii ladkiyaan
honest girls
- iimaandaar ladkiyaan
Adjectives are of two types: basic adjectives, and participle forms, i.e. those that
are derived from verbs (Kachru, 1980). The inflection rules of these two types are
discussed below.
Basic adjectives: These adjectives are those which are adjective themselves such as
sundar beautiful, achchhaa good. ENGCG parser denotes them as ABS.
The rules of inflection for these adjectives are as follows.
1. If an adjective in Hindi ends with aa, then it changes into e for plural.
For example, bur aa ladkaa (bad boy) and bur e ladke (bad boys).
2. An adjective ending with aa changes into ii for feminine, e.g. bur ii
ladkii (bad girl) and bur ii ladkiyaan (bad girls).
3. If an adjective in Hindi ends with any other vowel, it does not change in any
case.
65
1. In order to attain A(PCP1) form of adjective a suffix from the set {taa, te,
tii } is added to the root form of the verb. But in case of past participle form
A(PCP2) an appropriate suffix is attached according to the rules of the PCP
form of verb (see Section A.2).
2. Further, in most cases a morpho-word (from the set {huaa, huye, huii }) also
needs to be added after the modified verb.
3. Participle forms of adjectives are also inflected according to the gender and
number of the corresponding noun.
A dancing girl
A running horses
A broken chair
A rotten apples
66
ABS
A(PCP1)
A(PCP2)
Retd
ABS
(WR+{SR})
A(PCP1) WR+ {SR} +
A(PCP2)
(CP or ({WR} +
MD
{2SR}))
SR+ {SR})
WR + {SR} +
((SR+
MD
{SR})
or
67
The pre-modifier adjective sundar is replaced with murjhaa, and suffix yaa
is added in that word. The morpho-word huaa will be added after this modified
verb in the retrieved Hindi sentence as the subject (phool ) is singular masculine.
Hence three adaptation operations (viz. one constituent word replacement, one
morpho-word addition and one suffix addition) are required for carrying out the
adaptation task. However, there may be situations when the suffix replacement may
have to be carried out in place of suffix addition.
As not is present in the input, and its translation (in Hindi is nahiin) is
added to retrieved Hindi sentence to generate the appropriate translation of the
input sentence. But this modification is not a part of the adaptation operation listed
in Table 2.10. Hence, the Hindi translation of the input sentence is murjhaayaa
huaa phool acchaa nahiin dikhtaa hai .
Example 2 : Another retrieved example is:
Fading flowers do not look good.
murjhate huye phool achchhe nahiin dikhte hain
Cell(2, 3), i.e. (A(PCP1), A(PCP2)) of the Table 2.10 lists the necessary adaptation
operations. The two possible sets of operations are ((SR+ {SR}) or (WR+ SR+
{SR})), i.e. one suffix operation is optional in both the sets.
In this example only two suffix replacements are required, i.e. first set of operations. The suffix te is replaced with yaa in the pre-modifying adjective
murjhaate, and the suffix ye is replaced with aa in the morpho-word huye.
In some situations, the second suffix operation may not be needed if the gender and
person of the qualified nouns are same in both the input and the retrieved example. If the input and the retrieved example have different verbs in the participle
68
2.5.5
The subject tag @SUBJ is the main and obligatory tag under the subject slot.
As listed in Table 2.6, nine possible morpho tag variations have been observed for
69
the subject functional tag. Within these nine possible morpho tags, there are total
four parts of speech: noun (N), proper noun (<Proper>), pronoun (PRON) and
gerund (PCP1). The variations in these parts of speech may occur due to either a
case ending or number. In this respect the following may be observed.
The only case ending that may occur with respect to subject is ne. If the
POS of the subject is noun or pronoun then morphological changes may occur
due to this case ending. For example,
ladkaa + ne -ladke ne (boy)
More details of this case ending are given in Appendix A. It may be noted that
no morphological changes occur to the subject due to this case ending if the
POS of the subject is proper noun or PCP1.
Morphological changes may occur in nouns due to variations in number (singular or plural) also. For example,
Singular
Plural
boy - ladkaa
boys - ladke
house - ghar
cloth - kapadaa
clothes - kapade
girl - ladkii
girls - ladkiyaan
class - kakshaa
classes - kakshaayen
In PCP1 form, always a suffix naa is added to the root form of the verb.
For example, Swimming is a good exercise. tairnaa achchaa wyaayaam
hai .
70
N
(CP or ({WR} +
{SR or SA or SD}))
WR + {SR or SA
or SD}
WR + {SR or SA
or SD}
WR + {SR or SA
or SD}
<proper>
PRON
PCP1
WR
WR
WR+WR +SA
WR
WR+SA
(CP
WR)
WR
WR
or
(CP
WR)
WR
or WR+SA
(CP or WR)
The rule Table 2.11 presents the relevant adaptation operations for different
variations in the subject discussed above. The following examples illustrate some of
these rules.
Example 1 : Suppose the input sentence is The boy is playing., and the retrieved
example is Boys are playing. ladke khel rahe hain. Since the subject of the
input sentence is boy, to generate its Hindi translation ladkaa only the suffix e
is replaced with aa from the subject ladke in the retrieved Hindi sentence. This
is because the root word of the subject in both the input and the retrieved sentence
are same, that is boy. However, if the root words of the subjects differ, then a
word replacement will definitely be needed. Additionally, some suffix replacement
or addition may be needed to take care of the number of the subject. For example,
to adapt boy to sister it needs only one constituent word replacement: ladkaa
is to be replaced with bahan. On the other hand, if the boy is to be adapted to
sisters (i.e. plural form) then a word replacement (ladkaa bahan) followed
by suffix addition (bahan bahanen) will be required.
71
Therefore, the cell(1,1), i.e. (N, N) corresponds to above discussed operation set,
i.e. (CP or ({WR} + {SR or SA or SD}).
Example 2 : Consider the input sentence He is a good man.. Let the corresponding retrieved example be Walking is a good exercise. sair karnaa ek achchhaa
vyaayaam hai . The subject of the input sentence is he, and its POS is pronoun
(PRON), while the subject of the retrieved example is walking, and its POS is ing verb form, i.e. gerund (PCP1). In this case, adaptation operation as mentioned
in cell(4, 3), i.e. (PCP1, PRON) is required for doing the changes in the retrieved
translation. In this case the adaptation operation is word replacement. The word
sair karnaa is to be replaced with wah.
For the functional tags @OBJ and @>P, the same adaptation rule table can be
used because the morpho variation for these functional tags are same as that for
@SUBJ as given in Table 2.6.
For the last two functional tags (@<NOM-OF, @<NOM) there is only one possible morpho tag variation where the POS is preposition in both the cases. The
functional tag @<NOM-OF corresponding to the preposition of, and its translation in Hindi is either kaa, ke or kii . It is based on the gender, number
and person of the word corresponding to @<P tag. In case of @<NOM, there is
no particular postposition as in @<NOM-OF, a fixed Hindi translation cannot be
mentioned, and the translation takes place according to the prepositions in the input
sentence.
72
2.6
This section discusses sentences that start with interrogative words, which are of
two types: interrogative pronoun (such as, who, what, whom, which, whose) and
interrogative adverbs (such as, when, where, how, why). This study has been done on
a selected set of representative sentences from the example base. This study focuses
on finding the usages of different interrogative words and corresponding translation
patterns. The major findings of this study are as follows.
The above findings are important from EBMT point of view because commonality
of the interrogative words may not lead to the most useful retrieval. In order to
retrieve the most similar translation example, one may have to look into sentences
involving some other interrogative words. Table 2.12 shows the examples and their
patterns. The interrogative sentence patterns are denoted as INi , i=1,2,...,26. These
examples have been taken from the example base. The patterns of the sentences are
decided from the parsed versions of various examples given by the ENGCG parser.
73
P. No.
IN1 :
IN01 :
IN2 :
<Adverb>)}?
Who has played tunes on guitar well? gitaar par dhun kisne
achchhii bajaaii hai?
IN3 :
Who
&<AuxV>
&<S>
&<MainV> {<Adverb>}?
IN4 :
Who
&<AuxV>
&<S>
&<MainV> &<Preposition>?
IN5 :
IN6 :
What
&<AuxV>
&<S>
&<MainV>?
IN7 :
What
&<N>
&<SC/S>?
tum kyaa
IN8 :
IN9 :
74
P. No.
IN10 :
Which
IN11 :
&<AuxV>
&<S>
&<MainV> &<Adverb>?
IN12 :
&<MainV> {<Adver>}?
IN13 :
IN14 :
or <SC>)?
kitaab hai?
75
P. No.
IN15 :
Whose
&<N>
&<AuxV>
IN16 :
Whose
&<N>
&<V>
&{(<PP>) or (<Adverb>) or
(<PP> <Adverb>)}?
IN17 :
IN18 :
<Preposition>
&whom
IN19 :
Whom
&<AuxV>
&<MainV>?
&<S>
IN20 :
Why
&<LV>
&<S>
&(<Adverb> or <AdjP>)?
IN21 :
Why
&<AuxV>
&<S>
&<MainV>
{(<O>)
or
(<Adverb>)
(<O>
&
<Adverb>)}?
Why are you weeping? tum kyon
ro rahe ho?
IN22 :
Where
&<LV>
&<S>
&{<preposition> or <Adverb>}?
IN23 :
Where
&<AuxV>
&<S>
&<MainV>?
IN24 :
When
&<AuxV>
&<S>
76
P. No.
IN25 :
How
&<LV>
&<S>
&{<Adverb>}?
How is she? wah kaisii hai?
How is he? wah kaisaa hai?
IN26 :
How
&<AuxV>
&<S>
No.
Who:
IN1
IN01 ,
What:
Which:
pattern
IN2
IN3
IN4
IN5
IN6
IN7 , IN8
IN9
IN10
IN11 , IN12 , IN13
Whose:
Whom:
IN18
IN19
Why:
IN20 , IN21
@ADVL ADV WH
Where:
IN22 , IN23
@ADVL ADV WH
When:
IN24
@ADVL ADV WH
How:
IN25 , IN26
@ADVL ADV WH
Note that, Table 2.12 by no means provides an exhaustive list of English sentence
patterns involving interrogative words. However, these are the sentence patterns that
are predominatingly present in our example base. By examining the structures of
the corresponding Hindi sentences one can easily see that it is the role of the word
concerned that is most important in determining the Hindi sentence structures. One
may easily find that an interrogative word that has more than one functional tag and
having different translations in Hindi which certainly implies different translation
structures. These variations corresponding to each interrogative word are explained
below:
Variation in Translation of who: Table 2.13 shows four different functional tags
for this word. The observed translation patterns for these are as follows:
1. In translation patterns IN5 and IN6 the interrogative word what is used as
subject (@SUB) and object (@OBJ), respectively. In both the cases what is
translated as kyaa.
2. In case of sentence patterns IN7 and IN8 the word what is used as a determiner and its functional tag is @DN>. However, due to the variations in
overall sentence patterns in these two cases the different translations for the
word what have been observed. In both the cases, the Hindi translation is
of the form kaun followed by one of {saa, se, sii } depending upon the number
and gender of the noun following the word what. However, in both the cases
of IN7 and IN8 the more translation of what has been observed, i.e. kis or
79
kin according to the number of the noun following the word what. Further, the morpho-word ki is added after the noun in case of IN7 sentence
pattern. While in case of IN8 , the morpho-word ko is added after the noun.
Variation in translation of the word which: As shown in Table 2.12, five different
sentence patterns, viz. IN9 to IN13 , have been observed corresponding to the word
which. In all these cases, although the functional tag for the word which varies,
its translation to Hindi is done in the same way using the word kaun followed by
one of the morpho-word from the set {saa, sii, se} depending upon the number and
gender of the noun following the word which.
However, in both the cases IN12 and IN13 , one more translation of which has
been observed, i.e. kis or kin according to the number of the noun following the
word which. Further, the morpho-word ko is added after the noun.
Variation in translation of the word whose: Although four different sentence patterns (i.e. IN14 , IN15 , IN16 and IN17 ) have been observed for English sentences
involving the word whose, in all of them the functional tag of this word has been
found to be @GN>. Consequently, its translation into Hindi is also found to be the
same, i.e. one from the set {kiskaa, kiske, kiskii, kinkii, kinkaa, kinke}. The actual
usage depends upon the gender and number of the noun following the word whose.
Variation in translation of the word whom: Two possibilities have been observed
in this case:
1. @<P: Under this functional tag the word whom is used as a complement of
the preposition, as in sentence pattern IN18 . In this case the Hindi translation
of this word is kis.
80
Variation in translation of interrogative adverb words: Under this case four words
have been studied: why, where, when and how. Their Hindi translations
are as follows:
Irrespective of the sentence patterns (i.e. IN20 , IN21 , IN22 , IN23 and IN24 ) the
first three of the above four interrogative adverbs have unique translations in
Hindi. The Hindi translation of why is kyon, that of where is kahaan,
while when is translated as kab.
In both the sentence patterns IN25 and IN26 , the translation of the word how
into Hindi is one from the set {kaisaa, kaisii, kaise}. This variation in the
translation is governed by the gender and number of the subject of the underlying sentence pattern.
The above study of interrogative words suggests that sentence having different
interrogative words may have same translation patterns. The following examples
illustrate this point.
(A)
(B)
(C)
Suppose one has to generate the translation of the English sentence How are you
going today?. Its Hindi translation is tum aaj kaise jaa rahe ho? . Obviously, this
translation can be generated easily if one of the above three examples is considered
as a retrieved sentence.
Based on the above observations we cluster the above sentence patterns into
several groups as given below.
82
Input IN8
IN12
IN15
IN16
Retd
IN8
(CP or WR)
(CP or WR)
WR+SA+WD
WR+SA+WD
IN12
(CP or WR)
(CP or WR)
WR+SA+WD
WR+SA+WD
IN15
WR+ WA
WR+ WA
(CP or SR)
(CP or SR)
IN16
WR+ WA
WR+ WA
(CP or SR)
(CP or SR)
2.7
83
One may notice that in Hindi the negative and interrogative structures are obtained by addition of the words nahiin and kyaa, respectively. Also note that
the position of kyaa is always at the beginning of the sentence - hence its addition or deletion needs no traversing through the sentence. Typically, nahiin
occurs before the main verb of the Hindi sentence. However, since Hindi is relatively
free order, it may occur at some other position also. Adaptation operations are,
therefore, as follows:
Table 2.15 gives the required operations for all types of variation in the kind of
sentences. The expressions are obtained by deciding upon which of the words are
being added and/or deleted for the adaptation.
Input AFF
NEG
INT
NINT
WA + MA
Retd
AFF
CP
WA
MA
NEG
WD
CP
MA + WD MA
INT
MD
WA + MD CP
WA
NINT
WD + MD
MD
CP
WD
84
2.8
Concluding Remarks
In this chapter we have described different adaptation operations that may be used
for adapting a retrieved Hindi translation example to generate the translation of a
given input. The novelty of the scheme is that not only does it work in the word
level, it deals with suffixes as well. The advantage of the above scheme is that
since the number of suffixes is very limited, it reduces the overall cost of adaptation.
Chapter 5 discusses how the cost for each of the operations is evaluated.
This chapter looks into the process of adaptation itself. The adaptation operations described in this chapter are to be used in succession in order to generate the
required translation. The overall adaptation scheme will first have to look into the
discrepancies in the input sentence and the retrieved example. Discrepancies may
occur in different functional slots of the sentences, and also in the kind of sentences.
Once the discrepancies are identified, appropriate adaptation operations have to
be applied to remove them. Thus successive applications of these operations will
generate the required translation in an incremental way.
In this chapter we have considered variations in the different tense and verb forms
both in active and passive, variations in subject/object functional slot, variation in
wh-family word (e.g. what, who, where, when) and their sentence patterns.
We have also worked on Modal verbs (e.g should, might, can, could, may),
and their respective sentence patterns. However, due to similar nature of discussion
we do not elaborate on them in this report.
Of the different sentence kinds, we have discussed four (viz. affirmative, negative,
interrogative and negative interrogative) in this chapter. Evidently one may find
many other kinds of sentences (e.g. Imperative, Exclamatory). We have not dealt
85
with them in this work, however, we feel that they can be treated in similar fashion.
With respect to each of the variations we have identified the minimum number
of operations that are required for the overall adaptation of the retrieved example.
We presented these required operations in the form of various tables. The advantage of these tables is that they can be used as yardsticks for measuring the total
adaptation cost, which in turn may be used as a measurement of similarity between
an input sentence and the sentences of the example base. These issues are discussed
in Chapter 5.
The above-mentioned scheme of adaptation works well under the implicit assumption that translations of similar source language sentences are similar in the
target language as well. However, in reality one may find examples when the above
assumption does not hold good. For example, consider the two English sentences
It is running. and It is raining.. Although these two sentences are structurally
very similar, their Hindi translations are structurally very different. The first sentence is translated as wah (it) bhaag (run) rahaa (..ing) hai (is). But the second
one is translated as baarish (rain) ho (be) rahii (..ing) hai (is). Hence in order
to translate the first sentence if the second one is retrieved from the example base,
then the translation generated through the above-mentioned adaptation procedure
will not be able to produce the correct translation of the said input. Such instances
are primarily due to some inherent characteristics of the source and target language,
which is termed as translation divergence (Dorr, 1993). The existence of translation divergences makes the straightforward transfer from source structures into
target structures difficult. Study of adaptation therefore needs a careful study of
divergence as well. The following chapter discusses divergences in English to Hindi
translation in detail.
86
Chapter 3
An FT and SPAC Based
Divergence Identification
Technique From Example Base
3.1
Introduction
sadme
(she) (shock)
(B) : She is in trouble.
wah
(she)
wah
mein
hai
(in)
(is)
pareshaanii
mein
(trouble)
(in)
(is)
rahii
hai
ghabraa
hai
(is)
Items (A) and (B) above are examples of normal translation pattern. The prepositional phrases (PP) of the English sentences are realized as PP in Hindi, and that
the prepositions occur after the corresponding noun is in accordance with the Hindi
syntax. However, in example (C) one may notice huge structural variation. Here,
the sense of the prepositional phrase in panic is realized by the verb ghabraa rahii
hai (is panicking). Hence this is an instance of a translation divergence.
Assuming that the English sentence in (A) is given as the input to an English to
Hindi EBMT system, two scenarios may be considered:
1. The retrieved example is (B) i.e. She is in trouble. In this case, the correct
Hindi translation may be generated in a straightforward way by using word
87
3.1. Introduction
Thus the output of the system will depend entirely on the sentence ((B) or (C))
which will be retrieved to generate the translation of the input (A). Given a very
similar structure of the three sentences, the retrieval may eventually depend on the
semantic similarity of the prepositional phrase (PP) of the input with the PPs of the
stored examples. With respect to the above illustration, this implies that similarity
between the sentences may be measured by the semantic similarity between shock
and trouble in case (1), and the semantic similarity between shock and panic
in case (2). Table 3.1 gives this similarity value under different schemes given in
WordNet::Similarity web interface (http://www.d.umn.edu /mich0212/ cgi-bin/
similarity/similarity.cgi), considering the words as nouns, and taking their sense
number 1 as given by WordNet 2.0.
Similarity measure
Similarity
between
Similarity
between
Lin
Leacock & Chodorow
Resnik
0.2989
1.3863
2.734
0.5172
1.6376
5.2654
0.078
0.3333
0.1017
0.5
Path lengths
Adapted Lesk
0.1111
1
0.1429
2
88
3.2
Various approaches have been pursued in dealing with translation divergence. These
may be classified into four categories:
1. Transfer approach. Here transfer rules are used for transforming a source
language (SL) sentence into target language (TL) by performing lexical and
structural manipulations. These rules may be formed in several ways: by
89
manual encoding (Han et. al., 2000), by analysis of parsed aligned bilingual
corpora (Watanabe et. al, 2000) etc.
2. Interlingua approach. Here, identification and resolution of divergence are
based on two mappings (the Generalized Linking Routine (GLR), the CSR
(Canonical Syntactic Realization)) and a set of Lexical Conceptual Structure
(LCS) parameters. In general, translation divergence occurs when there is an
exception either to the GLR or to the CSR (or to both) in one language but
not in the other. This premise allows one to formally define a classification
of all possible lexical-semantic divergences that could arise during translation.
This approach has been pursued in the UNITRAN (Dorr, 1993) system that
deals with translation from English to Spanish and English to German.
3. Generation-Heavy Machine Translation (GHMT) approach. This scheme works
in two steps. In the first step, rich target language resources, such as wordlexical semantics, categorical variations and sub-categorization frames are used
for generating multiple structural variations from a target-glossed syntactic dependency representation of SL sentences. This is symbolic-overgeneration
step. This step is constrained by a statistical TL model that accounts for
possible translation divergences. Finally, a statistical extractor is used for
extracting a preferred sentence from the word lattice of possibilities. Evidently, this scheme bypasses explicit identification of divergence, and generates
translations (which may include divergence sentences) otherwise. MATADOR
(Habash, 2003), a system for translation between Spanish and English uses
this approach.
4. Universal Networking Language. (UNL) based approach. UNL has been devel-
90
NLP involving Indian languages has been enthusiastically pursued, and is sponsored by the government and several educational institutes over the last few years
(http://tdil.mit.gov.in/tdilsept2001.pdf), it will take some time before various linguistic resources are easily available. This motivates us to develop a simpler algorithm that requires as little linguistic resources as possible. The usefulness of such
techniques will be twofold:
The proposed approach uses only the functional tags (FT) and the syntactic
phrasal annotated chunk (SPAC) structures of the source language (SL) and target
language (TL) sentences for identification of divergence in a translation example. A
translation divergence occurs when some particular FT upon translation is realized
with the help of some other FT in the target language. Thus occurrence of divergence
may be identified by comparing the roles of different constituent words in the source
and target language sentence. Thus the proposed approach aims at designing an
algorithm that uses as little linguistic resources as possible.
The most fundamental work before developing any such algorithm is to determine
the different types of divergence that may be found in English to Hindi translation.
Since divergence is a language-dependent phenomenon, it is not expected the same
set of divergences will occur across all languages. In this respect one may refer to
(Dorr, 1993) which provides the most detailed categorization of Lexical-semantic
divergences for translation among the European languages. There divergence has
92
1. A light verb construction involves a single verb in one language being translated
using a combination of a semantically light verb and another meaning unit
(a noun, generally) to convey the appropriate meaning. In English to Hindi
(and perhaps in many other Indian languages) context such happenings are
very common. Hence this is not considered as a divergence for English to Hindi
translation. Later, this point will be discussed in detail under the conflational
divergence.
2. Head swapping essentially combines both promotional and demotional divergences under one heading.
3. Lexical divergence, which is a mixture of more than one divergence, has not
been considered.
4. All other divergence categories remain as they are under the new scheme.
obtained from different bilingual sources (such as, storybooks, translation books,
recipe books). This analysis suggests that English to Hindi translation divergence
is in many cases somewhat different in its characteristics, and therefore need to be
redefined. In the following subsections we describe the various types of divergence
that may be found in the context of English to Hindi translation, and their subtypes. We also discuss the algorithm to identify each type of divergence, and its
characteristics in more detail.
It may be noted that Dave et. al (2002) also studied English to Hindi divergence
in detail. However, they have restricted their discussions to the above-mentioned
seven categories only. Our studies of English to Hindi translation divergences reveal
the following:
1. All of the above-mentioned seven categories do not apply with respect to English to Hindi translation.
2. Instances of thematic and promotional divergence have not been found in English to Hindi translation
3. Structural divergence, in the English to Hindi context, occurs in the same way
as in European languages.
4. Some variations from the definitions given in (Dorr, 1993) may be noticed in
the occurrence of categorial, conflational, demotional divergences.
5. Three new types of divergence may be found with respect to English to Hindi
translation. These are named as nominal, pronominal and possessional.
6. Most of the divergence types may be further subdivided into several sub-types.
94
Subject (S), object (O), verb (V), subjective complement (SC), adjectival complement by preposition(SC C), subjective predicative adjunct (PA), verb complement (VC) and adjunct (A).
Categories in the SPAC structure are as follows:
POS tags: noun (N), adjective (Adj), verb (V), auxiliary verb (AuxV), preposition (P), adverb (Adv), determiner (DT), personal Pronoun (PRP), possesive
case of personal pronoun (PRP$) and cardinal number (CD).
Phrases: N, Adj, V, ADV and P are called the lexical heads of the phrases.
For each category a suffix P is used to denote a phrase.
In Appendix B and Appendix C, definitions of these FTs and SPAC are discussed
in detail. With this background we proceed to define the divergence types/ sub-types
and their identification scheme.
95
3.3
We order the different divergence types on the basis of the FTs of the source language
sentence with which they are concerned. Accordingly, we observe the following:
In the following subsections we provide the different divergence types and their
identification schemes. In the description of all the algorithms the following convention for representation will be followed:
a) The input to the algorithms will be an English sentence and its Hindi sentence.
These two will be denoted as E and H, respectively.
b) All these identification algorithms will return 0 if a particular divergence is
absent in the sentence. Otherwise it will return a value n, indicating that the
96
3.3.1
Structural Divergence
A structural divergence is said to have occurred if the object of the English sentence
is realized as a noun phrase (NP) but upon translation in Hindi is realized as a
prepositional phrase (PP). The following examples illustrate this. One may note
that different Hindi prepositions (e.g. se, par, ko, kaa) have been used in different
contexts leading to structural divergence.
iss
sabhaa mein
jaayegaa
se
vivah kiyaa
uttar degaa
(will answer)
97
maregaa
Analysis of various translation examples reveals the following points with respect
to structural divergence, which we use to design the algorithm for identification of
structural divergence:
If the main verb of an English sentence is a declension of be verb, then the
structural divergence cannot occur.
Structural divergence deals with the objects of both the English sentence and
its Hindi sentence. Therefore, if any one of the two sentences has no objects
then structural divergence cannot occur.
If both the sentences have objects, and their SPAC structures are same then
also structural divergence does not occur.
In other situations structural divergence may occur only if the SPAC of the
object of the English sentence is an NP, and the SPAC of the object of the
Hindi sentence is a PP.
The algorithm for identification of structural divergence has been designed to
take care of the above conditions. Figure 3.1 gives the corresponding algorithm. For
98
Step1.
Step2.
Step3.
Step4.
Illustration
Consider for illustration the following sentence pair:
E: Andre will marry Steffi.
H: andre (andre) steffi (steffi) se (from) vivaah karegaa (will marry)
The SPACs of these two sentences and their correspondences are given in Figure 3.2. Here bold arrows represent correspondence, and dotted lines indicate no
correspondence. Note that, the objects of E and H are not null; in E the object
99
is Steffi, whereas in H the object is steffi se. But their SPACs are [NP [Steffi /
N]] and [PP [NP [ steffi / N]] [ se / P]], respectively, which are not equal. Therefore,
structural divergence is identified.
3.3.2
Categorial Divergence
sher
se dartaa hai
(fears)
The adjective of the English sentence afraid is realized in Hindi by the verb
darnaa meaning to fear, and dartaa hai is its conjugation for present
100
(regularly)
(uses)
Here the focus is on the word user which is a noun, and has been used as
an SC in the above English sentence. This provides the main verb istemaal
karnaa (meaning to use) of the Hindi sentence. Its conjugation for present
indefinite tense is istemaal kartaa hai, when the subject is third person, singular and masculine. The adjective regular of the noun user is realized as
the adverb baraabar .
3. Categorial sub-type 3 : In the event of this divergence an adverbial PA of an
English sentence is realized as the main verb of the Hindi sentence. Consider
for illustration the following translation:
The fan is on.
paankhaa (fan) chal (move) rahaa (..ing) hai (is)
The main verb of the Hindi sentence is chalnaa i.e to move. Its sense
comes from the adverbial PA on of the English sentence. The present continuous form of this verb is chal rahaa hai , when the subject is third person,
singular and masculine. It may be noted that in Hindi grammar neuter gender
101
does not exist. Inanimate objects are treated as masculine or feminine, and
this categorization follows some systematic rules but occasionally with some
exception (See Appendix A).
4. Categorial sub-type 4 : This sub-type concerns with predicative adjuncts that
are realized in English as PP, but are realized in Hindi as the main verb. For
example, one may consider the following pair:
The train is in motion. railgaadii chal
(train)
rahii hai
102
Step1.
Step2.
Step3.
Step4.
Step5.
Step6.
Step7.
The identification algorithm has been designed taking care of the above observations.
The algorithm returns 0 if the translation does not involve any categorial divergence.
Otherwise, depending upon the case it returns 1, 2, 3, or 4. Figure 3.3 provides the
schematic view of the proposed algorithm.
Illustration
Let E be the sentence She is in tears.. Its Hindi translation H be wah (she) ro
rahii hai (is crying). As the sentences are parsed, and their SPACs are obtained
the algorithm proceeds as follows.
In Step 1, it finds that the English sentence root main verb is be, hence it
proceeds to Step 2. In Step 2 the root main verb of the Hindi sentence is determined.
In this case it is ronaa (i.e. to cry), which is not the be verb. The algorithm,
therefore, proceeds to Step 3 where it detects that Hindi sentence does not have a
PA. Thus this is a case of categorial divergence.
The algorithm now checks the SPAC of the PA in tears which is a prepositional
phrase comprising a preposition and a noun. The algorithm, therefore, detects
103
3.3.3
Nominal Divergence
Nominal divergence is concerned with the subject of the English sentence. In the
event of nominal divergence, upon translation the subject of the English sentence
becomes the object or verb complement. In this respect this divergence is somewhat
similar to the thematic divergence as defined in (Dorr, 1993). However, in case of
thematic divergence the object of the source language sentence becomes the subject
upon translation, whereas, in case of nominal divergence the subject of the Hindi
translation is derived from the adjectival complement of the English sentence. Thus,
characteristically nominal divergence differs from thematic divergence.
The subject of the English sentence is realized in Hindi with the help of a prepositional phrase. In particular, with respect to nominal divergence use of two prepositions: ko and se can be observed, which are typically used for an object or
ablative case, respectively (kachru, 1980). Hence the latter one is called as verb
104
1. Nominal sub-type 1: Here the subject of the English sentence becomes object
upon translation. For illustration the following example may be considered:
Ram is feeling hungry. ram ko
bhukh
lag
rahii hai
Here, the adjective hungry is an SC. Its sense is realized in Hindi by the
word bhukh, meaning hunger that acts as the subject of the Hindi sentence.
The subject Ram of the English sentence becomes the object ram ko of
the Hindi translation. Sometimes such an object is termed as dative subject
(kachru, 1980) also. However, because of the use of the postposition ko we feel
that calling it the object of the sentence is more appropriate.
2. Nominal sub-type 2: In this case the subject of the English sentence provides
a verb complement (VC) in the Hindi translation. The following example
illustrates this point.
This gutter smells foul.
iss
naale
se
badboo
aatii hai
Note that, the subject of the English sentence This gutter is realized as the
modifier iss naale seof the verb aatii hai .
105
Step1.
Step2.
Step3.
Step4.
Step5.
Step6.
IF(the
IF(the
IF(the
IF(the
IF(the
IF(the
1. Nominal divergence cannot occur if the main verb of the English sentence is a
declension of the be verb. This is because in that case, the English sentence
does not have an SC, which is essential for a nominal divergene to occur.
2. Otherwise, even if the root word of the main verb of the English sentence is
not be, nominal divergence cannot occur if the English sentence does not
have an SC.
3. Otherwise, if the SC is null, and the object is not null in H, then it is the
instance of nominal divergence of sub-type 1. In place of the object, if verb
complement (VC) is present in H then it is nominal divergence of sub-type 2.
The algorithm has been designed by taking care of the above observations. Figure
3.5 provides a schematic view of the proposed algorithm.
Illustration
Let E be the sentence I am feeling sleepy, and H be its translation mujhe (to me)
niind (sleep) aa rahii hai (is coming). The root form of the main verb of E is not
106
3.3.4
Pronominal Divergence
107
subaha
ho gayii
hai
andherii raat
thii
In example (a) the word morning, a noun, acts as an SC. Upon translation
it provides the subject subaha of the Hindi sentence. In example (b) the SC
is still a noun but it is preceded by an adjective. Upon translation the whole
noun phase andherii raat becomes the subject of the corresponding Hindi
sentence.
2. Pronominal sub-type 2 : In this case, the adjectival complement of the subject
it becomes the subject of the Hindi translation. For illustration:
It is very humid today aaj
bahut
umas
hai
108
Step1.
Step2.
Step3.
Step4.
110
Illustration
For the English sentence (E) It is raining, and its translation H in Hindi is barsaat
ho rahii hai . The syntactic phrase annotated chunk (SPAC) structures of the
example pair, and their correspondences are given in Figure 3.8. Here the subject
of E is it and the subject of H is barsaat but not yah or wah. It does not
satisfy step 2. In step 3, this algorithm finds that root form of the main verb of the
English sentence is rain which is not be. Therefore, the condition of step 3 is
also not satisfied. Hence step 4 detects pronominal divergence of sub-type 4.
3.3.5
Demotional Divergence
The characteristic feature of demotional divergence is that here the role of the main
verb of the source language sentence is demoted upon translation. In case of European languages this implies that the main verb of the target language is realized
from the object of the source language, and the main verb of the source language
upon translation becomes the adverbial modifier. However, with respect to English
to Hindi translations a subtle variation may be noticed. We observed several examples where the main verb of the English sentence upon translation is demoted
to the subjective complement or predicative adjunct of the Hindi sentence, but not
to adverbial modifier, which we call an adjunct. Hence in the event of demotional
divergence, the main verb of the Hindi translation is realized as a be verb. Thus
111
for English to Hindi translation, demotional divergence needs to be redefined accordingly. Depending upon how the roles of different constituent words change, four
different sub-types of demotional divergence may be obtained. The four sub-types
are defined as follows:
1. Demotional sub-type 1: This divergence occurs when the main verb and the
object of the English sentence are realized as predicative adjunct in the Hindi
sentence. However, the subject of the English sentence remains the subject
after translation to Hindi. For illustration, we consider the following example:
This dish feeds four people.
yah pakvaan chaar logon
ke liye hai
In this example the main verb feeds and the object four people of the
English sentence together give the predicative adjunct, which is the PP, chaar
logon ke liye (in English for four people) of the Hindi sentence. The subject
this dish remains subject after translation.
2. Demotional sub-type 2: Unlike the above sub-type, here the main verb and its
complement (instead of the object) of the English sentence are realized as the
predicative adjunct of the Hindi sentence. The following example illustrates
this point:
This house belongs to a doctor.
yah
ghar
ek
112
(of) (is)
do
sofa
ek
dusre ke
saamne
hain
(are)
In this example, the main verb of the English sentence face is realized as
the SC saamne in the Hindi sentence. Also, the object each other of the
English sentence becomes an SC C, i.e. ek dusre ke. Thus, this translation
belongs to demotional divergence of sub-type 3. The literal meaning of this
translation is These two sofas are opposite to each other.
4. Demotional sub-type 4 : Here also, the main verb of source is realized as SC
(adjective) of the target language. But the object and subject of the English
sentence become the subject of the translation and the post modifier of the
SC of the target language sentence, respectively. We illustrate this with the
following example:
This soup lacks salt.
iss
113
Step1.
Step2.
Step3.
Step4.
Step5.
Step6.
In the above example, the main verb lack of the English sentence is realized as
kam, the SC of the Hindi sentence. The object salt (namak ) becomes
the subject of the target language, and the sense of the soup is realized
as soop mein, post modifier of the SC. In particular, this is an adjective
complementation, and is expressed through the said PP. The literal meaning
of the translation is Salt is less in this soup..
Analysis of the translation examples that involve demotional divergence highlights the following points:
1. In all the instances of demotional divergence we find that the main verb of
the English sentence is different from be or have. Thus if the main verb
of an input sentence is either be or have, the possibility of demotional
divergence in its Hindi translation may be ruled out.
114
Illustration 1.
Consider the English sentence (E) The soup lacks salt, and its Hindi translation
(H) soop mein namak kam hai . The SPACs of these sentences and their term
correspondences are given in Figure 3.10.
115
Here the root form of the main verb of H is ho (i.e. be), and for E it is
lacks. Hence, steps 1 and step 2 are not satisfied, and therefore, computation
proceeds to step 3. However, if condition of step 3 fails as the subjects of E and H
are not same. Step 4 and 5 check that both SC and SC C are present in the Hindi
sentence. Hence, step 6 is considered. Since E has no object, the algorithm returns
4 indicating that the above sentence pair has a demotional divergence of sub-type 4.
Illustration 2.
Consider another example, where E is This dish feeds four people., and H is yah
pakvaan chaar logon ke liye hai . The SPACs of these two sentences and their
correspondences are given in Figure 3.11.
3.3.6
Conflational Divergence
Conflational divergence pertains to the main verb of the source language sentence.
Typically, as characterized in (Dorr, 1993), conflational divergence occurs when some
new words are required to be incorporated in the target language sentence in order to convey the proper sense of the verb of the input. However, with respect to
English to Hindi translation we need to deviate from this definition because of the
following reason. Many English verbs do not have a single-word equivalent in Hindi.
In fact, a large number of English verbs are expressed in Hindi with the help of a
noun followed by a simple verb. Such a combination is called a Verb Part (Singh,
2003), where the verb used in the Verb Part is some basic verb such as honaa (to
become), karnaa (to do) etc. Some examples of Hindi Verb Parts are given below.
For illustration, consider the verb to begin. Its Hindi equivalent is aarambh
karmaa. In Hindi, aarambh is the abstract noun meaning the beginning;
whereas, karnaa means to do. Thus the verb is realized in Hindi as a combination of noun and verb. In a similar vein, the verbs denaa (meaning to give)
and honaa (meaning to become) are used as the basic verbs along with appropriate nouns to provide the meanings of the English verbs cited above. There are
also examples of Verb Parts involving other basic verbs, such as maarna as well.
117
1. Conflational sub-type 1 : Divergence of this type occurs when the new words
are added as adjunct to the verb. Typically, this adjunct is realized as a
prepositional phrase. For illustration, consider the following English sentences
and their Hindi translations:
Ram stabbed John. ram ne john ko
(Ram) (to John)
chaaku se maaraa
(knife) (by) (hit)
The sense of the verb stab is conveyed through the introduction of the prepositional phrase chaaku se. There are cases when the adjunct appears in the
form of an adverbial phrase instead of a prepositional phrase.
Mary hurried to market. mary
jaldi se
bazaar
gayii
To convey the proper sense of the verb hurry, the adverbial phrase jaldi
se is used along with the main verb jaanaa meaning to go. Note that
118
shakal
uskii
maa
se
miltii hai
(his)
The literal meaning of the translation is: His face is similar to his mother..
The subject of the Hindi sentence, viz. uskii shakal (meaning his face)
is realized form the source language verb to resemble. Here uskii (his)
is the possessive pronoun of the original subject (he) of the English sentence.
pakvaan
(this) (dish)
hai
(is)
In this example too, the subject of the Hindi sentence iss pakvaan kaa swaad
(the taste of this dish) is realized from the verb to taste.
Figure 3.12 provides a schematic representation of the proposed algorithm keep119
Step1.
Step2.
Step3.
1. If the English sentence E has declension of be/have verb at the main verb
position, then conflational divergence cannot occur.
2. If H has more adjuncts than E, then it is the case of conflational divergence
sub-type 1.
3. If the number of nouns in the SPAC of the subject of E is less than the number
of nouns in the SPAC of the subject of H, and the SPAC of the subject of
H further contains a possessive personal pronoun (PRP$), or a possesive case
(POSS), or a preposition (P), then conflational divergence of sub-type 2 occurs.
The algorithm returns 0 if the translation does not involve any conflational divergence. Otherwise, depending upon the case it returns 1 or 2.
120
3.3.7
Possessional Divergence
Possessional divergence deals with English sentences in which the declension of verb
have is used as the main verb. An interesting feature of Hindi is that it has
121
no possessive verb, i.e. one equivalent to the have verb of English. The normal
translation pattern of English sentences with declensions of have as main verb is
illustrated below:
Ram has many enemies.
ram ke bahut
shatru
hai
aaj
chhuttii
hai
ram ke paas
davaat hai
(with ram)
(inkpot) (is)
The above examples demonstrate that the normal translation pattern of these
sentences is one of the following:
1. The main verb of the translated sentence is honaa which means to be.
2. The verb is used along with some genitive prepositions (viz. kaa, ke or kii ),
or the locative prepositional phrase, viz. ke paas, to convey the meaning of
possession (kachru, 1980).
3. Which one of the three genitive prepositions will be used depends upon the
number and gender of the object. It is kaa if the object is masculine singular,
kii if the object is feminine singular, and ke for plural both masculine and
feminine.
However, there are many examples where the translation structure deviates from
this normal pattern giving rise to divergence. We call this as the possessional
122
tez
sirdard hai
In sentence (a), he and a bad headache are the subject and the object,
respectively. In the Hindi translation the subject is tez sirdard , i.e. bad
headache, and the object of the Hindi sentence is use which is the accusative
case of he. Thus the roles of subject and object are reversed upon translation.
Similarly, in (b) upon translation the roles of the subject Ram and object
fever are reversed.
2. Possessional sub-type 2 : In this case the object and its premodifying adjective
in the English sentence are realized as the subject and SC, respectively, in the
Hindi sentence. The subject of the English sentence is realized as possessive
case of the subject of the target language sentence. The following example
illustrates this.
123
(these) (birds)
(voice)
(sweet) (is)
The object voice and its premodifying adjective sweet of the English sentence are realized in Hindi as the subject aawaaz and its adjectival complement miithii . Note that the subject these birds of the English sentence is
realized as a possessive case (in chidiyon kii) in the Hindi translation.
3. Possessional sub-type 3 : Here, the object and its post modifier (normally, a
PP) in the English sentence are realized as the subject and the predicative
adjunct, respectively, in the Hindi translation. The subject of the English
sentence also contributes as the possessive case to the predicative adjunct. For
illustration, consider the following:
Boys have books in their satchels.
ladkon ke baston
mein
(boys)
(in)
(satchels)
kitaaben
(books)
hain
(are)
zeb
mein do
rupaye hain
In the first example, the object (books) provides the subject (kitaaben) of
the Hindi translation. The post-modifier in their satchels of the object of the
English sentence is realized as a predicative adjunct ladkon ke baston mein
of the Hindi sentence. One may notice that the subject boys is present as
the possessive case in the predicative adjunct.
124
(this) (city)
125
apne
chaachaa
kii
izzat
kartii hai
baal baal
bache the
In example (a) the main verb of the Hindi sentence (izzat kartii hai) is
realized from the object regards of the English one. Similarly, in example
(b), the object escape of the English sentence is realized as the main verb
(bache the) of the Hindi sentence. Further, the premodifying adjective of
the object (narrow) is realized as an adjunct (baal baal ) in the translated
sentence.
6. Possessional sub-type 6 : Here, the main verb of the translated sentence is not
ho. Moreover, this verb does not come from any of the functional tags of
the English sentence. Consider for example the following translations:
(a) Radha had a good time here.
raadhaa ne
yahaan
(radha )
(here)
acchaa samay
(good)
bitaayaa
(time) (spent)
bhaarii
naashtaa
kiyaa
(ram)
In example (a), the main verb of the Hindi sentence is bitaayaa which is
different from the verb ho which does not come from any FT of the English
126
We have designed our algorithm taking care of the above observations. Figure 3.14 provides a schematic view of the proposed algorithm. We illustrate the
algorithm with the help of the following examples.
Illustration 1.
Consider the English sentence (E) Suresh has fever.. Its Hindi translation (H)
is suresh ko (suresh) bukhaar (fever) hai (is). The SPACs of these sentences and
their terms correspondences are given in Figure 3.15.
The root form of the main verb of E and H are have and ho respectively.
This implies that they do not satisfy the conditions of step 1 and 2. In step 3, the
algorithm checks the postposition condition of the subject of H, it finds that none
of the relevant postpositions is present for the subject of H. In step 4, the algorithm
finds that the subject of E and the object of H are suresh and suresh ko,
respectively, which are translations of each other. Further it finds that the object
128
Step1.
Step2.
Step3.
Step4.
Step5.
129
Illustration 2.
Consider the English sentence (E) This city has a museum.. Its Hindi translation
H is iss (this) shahar (city) mein (in) ek (one) sangrahaalaya (museum) hai (is).
The SPACs of these sentences and their terms correspondences are given in Figure
3.16.
The root form of the main verb of E and H are have and ho, respectively.
Therefore, algorithm arrives to step 3, here the subject of H does not have any
postposition kaa, ke or kii . Hence the algorithm proceeds further. Since the
conditions for step 4 are not meet with, the algorithm arrives at step 5. Here it
finds that in H there is no object, but a PA (iss shahar mein) is present. Also,
since there is no postmodifier of the object of E, the algorithm returns 4. Thus, the
algorithm diagnoses possessional divergence of sub-type 4 in the above translation
example.
130
3.3.8
In this chapter we have discussed the various types of divergences that have been
observed in English to Hindi translation. By analyzing the characteristics of various
examples, we have been able to identify different sub-types under each divergence
type. These observations helped us to design algorithms for their identification.
However, we still have some examples of divergence which do not fall under any of
the above-mentioned types. At the same time we do not have sufficient number of
examples for these types so as to classify them under some new type or sub-types.
Efficiency of the algorithm however, is dependent on the availability of the following:
Cleaned and aligned parallel corpus of both the source and the target languages.
An on-line bi-lingual dictionary. For this work, we have used Shabdanjali,
an English-Hindi on-line dictionary.1
1
http://www.iiit.net/ltrc/Dictionaries/Dict Frame.html
131
Appropriate parsers have to be designed for source language and target language. The parsers should be able to provide the FT and SPAC information
for both the languages. Note that, presently no such parser is available for
Hindi. For our experiments we have used manually annotated Hindi corpora.
3.4
Concluding Remarks
This chapter deals with the characterization and identification of different types
of divergence that may occur in English to Hindi translation. We observed that
identification of divergence can be made without going into the semantic details of
the two sentences. This can be achieved by comparing the Functional Tags (FT)
and Syntactic Phrase Annotated Chunks (SPAC) of the source language sentence
and its translation.
The work described here may be broadly classified into two parts:
An obvious question that arises at this point is how an EBMT system is expected
to handle divergences. In this regard our suggestions are as follows. Once divergences
are identified, the focus of a system designer should be on the following:
To split the systems example base into two parts: normal and divergence
example base. The translation examples are to be put in the appropriate part
of the example base.
To design appropriate retrieval policy, so that for a given input sentence, an
EBMT system can heuristically judge whether its translation may involve any
divergence, and retrieval may be made accordingly.
To design appropriate adaptation strategies for modifying retrieved translation
examples. Since translations having divergence do not follow any standard
patterns, their adaptations may need specialized handling that may vary with
the type/sub-type of divergence.
133
Chapter 4
A Corpus-Evidence Based
Approach for Prior Determination
of Divergence
4.1
Introduction
This chapter presents a corpus-evidence based scheme for deciding whether the translation of an English sentence into Hindi will involve divergence. Surely, occurrence
of divergence poses a great hindrance in efficient adaptation of retrieved sentences.
A possible solution may lie in separating the example base (EB) into two parts:
Divergence EB and Normal EB, so that given an input sentence, retrieval can be
made from the appropriate part of the example base. However, this scheme can
work successfully only if the EBMT system has the capability to judge from the
input sentence itself whether its translation will involve any divergence. However,
making such a decision is not straightforward, since occurrence of divergence does
not follow any patterns or rules. In fact, a divergence may be induced by various
factors, such as, structure of the input sentence, semantics of its constituent words
etc. In this chapter we propose a corpus-evidence based approach to deal with this
difficulty. Under this scheme, upon receiving an input sentence, a system looks into
its example base to glean evidences in support as well as against any possible type of
divergence that may occur in the translation of the input sentence. Based on these
evidences the system decides whether the retrieval has to be made from the normal
EB, or from the divergence EB.
The algorithm proposed here, works for structural, categorial, conflational, demotional, pronominal and nominal types of divergence 1 . For convenience of presentation we denote them as d1 , d2 , d3 , d4 , d5 and d6 , respectively. Barring structural
divergence (d1 ) all of the other five types of divergence (i.e. d2 ,...,d6 ) have further
1
Prior identification of possessional divergence has been kept out of discussion here. This is
because possessional divergence depends upon several factors, such as, subject, object, and even
the sense in which the verb have is used. Our work (Goyal et. al., 2004) discusses these issues
in detail.
135
been classified into several sub-types depending upon the variations in the role of
different functional tags upon translation to Hindi.
In this chapter, we have identified the necessary FT-features that the source
langauge (English) sentences should have in order that a particular type/sub-type
of divergence may occur. This, however, does not mean that any sentence having
those FT-features will necessarily produce a divergence upon translation. As a
consequence, mere examination of the FTs of an input sentence cannot ascertain
whether its translation will induce any divergence or not. Hence more evidences
need to be considered.
This chapter describes all these evidences and how they are to be used for making a priori decision regarding whether the input English sentence will involve any
divergence upon translation to Hindi.
4.2
The proposed scheme makes use of three different types of evidence to decide whether
a given input sentence will have a normal translation, or whether it will involve one
(or more ) type(s) of divergence when translated into Hindi. These evidences are
used in succession to obtain the overall evidence in support of divergence(s)/nondivergence in the translation of the input sentence. These three steps are explained
below:
Step1 : Here Functional Tags (FTs) of the constituent words of the input sentence
are used to determine the divergence types that cannot certainly occur in the
136
137
4.2.1
Analysis of the divergence examples suggests that for each divergence type to occur
the underlying sentence needs to have some specific functional tags (FT) and/or some
specific attributes of these FTs. We call them together FT-features of a sentence.
Considering all the divergences together we found that ten different FT-features are,
in particular, useful for identification of divergence. Table 4.1 provides a list of these
features, which we label as f1 , f2 , . . . , f10 .
FT-feature
Property of feature
f1
f2
f3
f4
f5
Subject is present
Object is present
f6
f7
f8
f9
f10
Its presence in the input sentence is necessary for the corresponding divergence
type to occur;
It should necessarily be absent in the input sentence if the corresponding divergence is to occur.
138
sub-type
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
d1
sub-type 1
sub-type 2
sub-type 3
sub-type 4
sub-type 1
sub-type 2
sub-type 1
sub-type 2
sub-type 3
sub-type 4
sub-type 1
sub-type 2
sub-type 3
sub-type 1
sub-type 2
d2
d3
d4
d5
d6
139
Each row of the Relevance Table provides the necessary conditions on the FTfeatures of an input sentence in order that the corresponding divergence may occur.
The advantage of this evidence is that it helps in quick discarding of those types of
divergence that cannot occur in the translation of the given input sentence.
The information given in Table 4.2 may be used in the following way. Given an
input sentence, the algorithm first extracts the values for the ten FT-features, fj ,
j = 1, 2, ..., 10, from the sentence. These values are then compared with the row
entries of the Relevance Table. If the FT-features of the sentence conform with the
entries of some particular row, then evidence is obtained towards occurrence of that
particular divergence for which this row corresponds to one of the sub-types. If a
particular sentence has evidence supporting more than one divergence then all these
possible divergence types are to be considered for step 2 of the algorithm. This set
of possible divergence types for a given input is denoted as D.
For illustration, consider the following input sentence: Ram is friendly to me..
As the sentence is parsed (with some unnecessary components edited) one may get
the following:
@SUBJ <Proper> N SG Ram, @+FMAINV V PRES be, @PCOMPL-S A ABS
friendly , @ADVL PREP to, @<P PRON PERS SG1 i < $. >
The notations used here are from ENGCG parser and explained in Appendix B.
We can summarize the parsed version as follows. Of the ten FT-features discussed
above (see Table 4.1), only four are present in the above sentence. These are:
140
d2 (check the rows for sub-types 3 and 4) for both the sentences. But of the two
sentences the translation of the first one is a normal one. It is only the second
sentence that involves categorial divergence upon translation to Hindi. Thus, to
determine the possible divergence type(s) in a sentence, only the FT-features cannot
be taken as the sole evidence, and more evidences need to be sought.
From the above example, it can be surmised that it is the prepositional phrase
in tears that is instrumental for causing the categorial divergence in the second
sentence. In general, corresponding to each divergence type one can associate some
functional tags that are instrumental for causing the divergence. We call it as the
Problematic FT of the corresponding divergence type. Table 4.3 provides the Problematic FT corresponding to all the six divergence types relevant in the context
of English to Hindi translation. This table has been obtained by examining the
sentences in our example base.
Divergence Type
Problematic FT
Structural
Main Verb
Categorial
Conflational
Main Verb
Demotional
Main Verb
Pronominal
Nominal
SC (adjective)
Table 4.3: FT of the Problematic Words for Each Divergence Type
(school)
(in) (drum)
(beat)
(Becker)
(beat)
Here, the first one is an example of normal translation, while the second one is a
case of structural divergence because of the introduction of the postposition ko in
the object of the Hindi sentence. A careful examination suggests that although the
main verb of both the sentences is beat, its translation causes divergence when
used in a particular sense, but not when used in some other sense. By referring
to WordNet 2.02 one may find that the first sentence has the 6th sense of the word
beat, which is to make a rhythmic sound ; while the second sentence has the
1st sense of the word beat, which is to come out better in a competition, race,
or conflict. Therefore, while dealing with words one needs to pay attention to
2
http://www.cogsci.princeton.edu/cgi-bin/webwn
143
the particular sense in which a word is being used in some senses it may cause
divergence, and in some other senses it may not induce any divergence at all.
Since an exhaustive list of words (along with their relevant senses) that lead
to divergence is impossible to make, the proposed algorithm tries to gather more
evidences by using the semantic similarity of the constituent words to the word
senses that are already known to cause divergence, or known to deliver a normal
translation. In order to achieve this two dictionaries have been created: Problematic
Sense Dictionary (PSD) and Normal Sense Dictionary (NSD). The PSD contains the
words along with their senses that have been found to cause divergence. Similarly,
the NSD contains the words along with their senses for which normal translation
has been observed.
Divergence type (di ) No. of words in PSDi
Structural
(d1 )
163
1078
Categorial
(d2 )
57
167
Conflational
(d3 )
43
997
Demotional
(d4 )
66
1422
Pronominal
(d5 )
75
170
Nominal
(d6 )
12
97
416
3931
Total
These dictionaries are further grouped into six sections a section corresponding to each divergence type. Section PSDi contains problematic words occurring in
sentences whose translations involve divergence of type di . Similarly, section NSDi
contains problematic words of sentences having the FT-features as required for divergence type di (as specified in the Relevance Table), but actually having a normal
144
NSD1
PSD2
NSD2
Attend#v#1
Beat#v#6
Afraid#a#1
Brave#a#1
Beat#v#1
Love#v#3
Do#v#13
Eat #v#4
Friendly#a#4
On#r#2
Good#a#1
Illusion#n#2
Marry#v#1
Occupy#v#4
..
.
Purchase#v#1
See#v#1
..
.
Pain#n#1
Tear#n#1
..
.
Monitor#n#2
Trouble#n#1
..
.
PSD3
NSD3
PSD4
NSD4
Face#v#3
Agree#v#4
Belong#v#1
Continue#v#9
Look#v#5
Resemble#v#1
Feel#v#4
Go#v#10
Face#v#3
Front#v#1
Ride#v#9
Sell#v#2
Rush#v#4
Stab#v#1
..
.
Look#v#3
Solve#v#1
..
.
Smell#v#2
Suffice#v#1
..
.
Solve#v#1
Walk#v#6
..
.
PSD5
NSD5
PSD6
NSD6
Freeze#v#6
Humid#a#1
Morning#n#3
Bright#a#10
Light#a#1
Plain#a#2
Cold#a#1
Hot#a#1
Hungry#a#1
Dull#a#4
Good#a#1
Happy#a#2
Rain#v#1
Winter#n#1
..
.
Shiny#a#3
Wrong#a#1
..
.
Sleepy#a#1
Thirsty#a#2
..
.
Helpful#a#1
Innocent#a#4
..
.
Each PSD/NSD entry contains along with the relevant word, its part of speech
and appropriate sense number (as given by WordNet 2.0). Table 4.5 shows some
entries corresponding to each PSDi and NSDi , i=1,2,...6. The entries are stored
in the format word#pos#k, where pos stands for the particular Part of Speech,
which can be one of n, v, a or r (corresponding to noun, verb, adjective and adverb
145
1. sim(ai , wi ) gives the maximum similarity score between ai and the words in
PSDi , where sim(x, y) denotes the semantic similarity between two words x
and y (see Appendix C).
2. The quantity s(di ), corresponding to divergence type di is defined as follows:
s(di ) =
0
1
2
if xi = 0;
xi
ci
ci
S
...(1)
otherwise.
146
These four quantities are used to determine the possibility of occurrence of divergence di in the translation of the given input sentence.
4.3
In order to determine whether a given input sentence, say e, may involve some
divergence upon translation, the evidences mentioned in previous section are used
in the following way. First the input sentence e is parsed, and then using the
Relevance Table a set D is determined that contains the divergence types that may
possibly occur in the translation of e. For each possible divergence type di D the
problematic word ai is extracted from the sentence e. From PSDi , the word wi is
retrieved that is semantically most similar to ai . The subsequent steps depend upon
the value of sim(ai , wi ). If the value is 1, that implies that ai is present in PSDi .
On the other hand, a small value of sim(ai , wi ) implies that there is not enough
evidence in support of divergence di . Hence it may be concluded that divergence di
147
will not occur in the translation of e. Note that, whether the value of sim(ai , wi ) is
sufficiently small is determined by comparing it with a threshold t, which is to be
determined experimentally from the corpus. If the value of sim(ai , wi ) is between t
and 1, then some evidence in support of divergence di is obtained. In order to make
a conclusion from this point the algorithm now refers to NSDi to obtain the word
wi0 that is semantically most similar to ai . Depending upon the values of sim(ai ,
wi ) and sim(ai , wi0 ), a decision is taken regarding whether the translation of e will
involve divergence di or not. Based on this decision, the retrieval is to be made from
the appropriate part of the example base, i.e. the Divergence EB or Normal EB.
The overall scheme is explained below which involves four major steps as follows:
Step 1: At this stage, the input sentence e is parsed, and its FT-features are
obtained. From these FT-features, using Table 4.2, the set D of possible divergence
types is determined.
The main objective now is to determine the divergence types, out of all the di
D, which have positive evidence supporting them to happen in the translation of e.
Steps 2 and 3 are designed for this purpose. A set of flags, Flagi , corresponding to
each di D is used to store this information. Initially each of these flags is set to
1. Step 2 and Step 3 are now carried out for each di D in order to reassign the
value of Flagi . At each iteration the next di with the minimum index i is chosen
such that Flagi is -1.
Step 2:
divergence di (see Section 4.2) is determined. The set Wi comprising of words belonging to PSDi , and having positive semantic similarity score with ai is determined.
Thus Wi = {b : b PSDi and sim(ai , b) >0}. From Wi the word wi is obtained such
148
Here, first the set W0i = {b |b NSDi and sim(ai , b) > 0) is computed.
149
From this set the word wi0 is picked such that sim(ai , wi0 ) = max sim(ai , b) b Wi0 .
If Wi0 is empty then sim(ai , wi0 ) is considered to be 0. Depending on sim(ai , wi0 ), one
of the following cases is executed.
Case 3a: If sim(ai , wi0 ) = 0 then it implies that there is no evidence that the
word will lead to normal translation. Consequently, Flagi is set to 1 indicating that
divergence di has a positive chance of happening.
Case 3b: If sim(ai , wi0 ) = 1 then the evidence suggests that the word ai should
provide a normal translation to the sentence, and there is no possibility of divergence
di to occur in the translation of this sentence. Consequently, Flagi is set to 0.
Case 3c: Decision making becomes most difficult when 0 < sim(ai , wi0 ) < 1. This
implies that words sufficiently similar to ai exist neither in the PSD nor in the NSD.
Thus, any decision about divergence/non-divergence cannot be taken yet.
In this case the scheme proposes to look into how many words similar to ai , are
available in PSDi and NSDi . This evidence is given by score s(di ) and s(ni ) computed
using formula (1) (given in Section 4.2). Finally, similarity scores sim(ai , wi ) and
sim(ai , wi0 ) are combined with s(di ) and s(ni ) respectively, to take into consideration
the importance of both the evidences. If evidence supporting divergence di is more
then the value of Flagi is set to 1 otherwise it is set to 0. Thus, in this case, following
computations are performed:
If |C| = 1, it implies that there is evidence for only one permissible combination.
Let it be {dk , dl }. The algorithm suggests that the input sentence e will involve
both divergence dk and dl upon translation to Hindi.
If |C| > 1, that is, if the evidences are obtained in support of more than
one permissible combination of divergences, then the scheme needs to select
the most likely combination of them. It therefore determines the quantity
1
(m(di )
2
The flowchart of the proposed scheme is given in Figures 4.1 and 4.2.
152
154
4.4
In this section we first illustrate with examples how the above algorithm works
towards prior identification of divergence, if any, in translation from English to Hindi.
The examples considered are increasingly difficult in nature. Later in Subsection
4.4.4 a consolidated result of several experiments is presented, and certain limitations
of the said algorithm are discussed.
4.4.1
Illustration 1
Note that the FT-features of the given input sentence conform with both the subtypes of d3 and only sub-type 2 of d6 (see Table 4.2). Hence the set D of possible
divergence types is obtained as D={d3 , d6 } which are conflational and nominal types
155
4.4.2
Illustration 2
Using the Relevance Table the set D of possible divergence types is obtained as
{d2 }.
The algorithm now collects evidences in support of categorial divergence (d 2 ):
Table 4.3 suggests that problematic FT for d2 is predicative adjunct, i.e. in
dilemma. Thus problematic word is dilemma. WordNet 2.0 provides only one
sense for dilemma: state of uncertainty or perplexity especially as requiring a choice
between equally unfavorable option. Thus the problematic word a2 is dilemma#n#1.
A search in PSD2 for the word that is semantically most similar to a2 retrieves the
157
4.4.3
Illustration 3
158
+ s(n1 )) = 0.520. Since m(d1 ) > m(n1 ), the algorithm set Flag1 to be
1.
159
structural
(d1 )
0.444 0.552
conflational
(d3 )
0.086 0.543
demotional
(d4 )
0.204 0.602
In Step 3, the set D 0 = {d1 , d3 , d4 } is constructed . The set of possible combinations C (see Case 4c) is found to be {{d1 , d3 }, {d3 , d4 }}. For a final decision
the algorithm now computes the values of s(di ) and m(di ) (see Case 3c). These
values are given in Table 4.6. Using the values given therein the algorithm computes
1/2*(m(d
1)
Since the latter one is maximum, the algorithm suggests that the above input
sentence will have divergence d3 and d4 upon translation to Hindi. The above decision of the algorithm is also correct.
Tables 4.7 and 4.8 provide few more examples with brief explanation. The overall
160
161
Problematic
Word, ai
Most similar
word, wi
sim(ai , wi )
Is wi
a
sim(ai , wi 0 )
coordinate
term?
(i )
(ii )
(iii )
(iv )
(v )
(vi )
(vii )
(viii )
Continued . . .
1.
2.
162
3.
4.
5.
She will
d1
resolve#v#6
calculate#v#1
0.984
No
resolve#v#6
1.0
resolve
this issue.
d3
d4
resolve#v#6
resolve#v#6
Nil
Nil
0.0
0.0
NA
NA
NA
NA
NA
NA
I will
attend this
d1
d3
attend#v#1
attend#v#1
attend#v#1
look#v#5
1.0
0.66
NA
No
NA
ride#v#9
NA
0.66
meeting.
d4
attend#v#1
face#v#4
0.75
Yes
NA
NA
This exer-
d1
hurt#v#2
trample#v#2
0.96
No
twist#v#9
0.96
cise
will hurt
your back.
d3
d4
hurt#v#2
hurt#v#2
knife#v#1
Nil
0.96
0.0
No
NA
twist#v#9
NA
0.96
NA
John
stabbed
d1
d3
stab#v#1
stab#v#1
stab#v#1
stab#v#1
1.0
1.0
NA
NA
NA
NA
NA
NA
Mary.
d4
stab#v#1
Nil
0.0
NA
NA
NA
This dish
d3
taste#v#1
taste#v#1
1.0
NA
NA
NA
tastes
good.
d6
good#a#1
Nil
0.0
NA
NA
NA
S. Sentence
No.
S. Sentence
No.
Problematic
Word, ai
Most similar
word, wi
sim(ai , wi )
Is wi
a
sim(ai , wi 0 )
coordinate
term?
(i )
6.
(ii )
(iii )
(iv )
(v )
(vi )
(vii )
(viii )
d1
weigh#v#1
encounter#v#3 0.660
No
stay#v#1
0.660
weighs
100kg.
d3
d4
weigh#v#1
weigh#v#1
measure#v#3
suffer#v#6
0.972
0.660
No
No
look#v#3
look#v#3
0.660
0.660
7.
It is windy
today.
d2
d5
windy#a#1
windy#a#1
stormy#a#1
stormy#a#1
0.75
0.75
No
No
Nil
Nil
0.0
0.0
8.
It will be
morning
d2
d5
morning#n#3
morning#n#3
pain#n#2
morning#n#3
0.406
1.0
No
NA
NA
NA
NA
NA
d2
pain#n#1
pain#n#2
0.438
No
NA
NA
d3
suffice#v#1
resemble#v#1
0.782
No
meet#v#5
0.96
d4
d5
suffice#v#1
suffice#v#1
suffice#v#1
Nil
1.0
0.0
NA
NA
NA
NA
NA
NA
163
This table
soon.
She is in
pain.
10. It suffices.
9.
(ii )
s(di )
s(ni )
m(di )
m(ni ) Flagi
D0
1
(m(di )
2
(ix )
(x )
(xi )
(xii )
(xiv )
(xv )
(xvi )
(xvii )
(xiii )
+ m(dj )) Result
Continued . . .
1.
d1
NA
NA
NA
NA
d3
d4
NA
NA
NA
NA
NA
NA
NA
NA
0
0
d1
d3
NA
0.075
NA
0.141
NA
0.368
NA
0.401
1
0
d4
NA
NA
NA
NA
d1
0.241
0.142
0.601
0.551
d3
d4
0.08
NA
0.145
NA
0.52
NA
0.553
NA
0
0
d1
NA
NA
NA
NA
d3
d4
NA
NA
NA
NA
NA
NA
NA
NA
5.
d3
d6
NA
NA
NA
NA
NA
NA
6.
d1
0.224
0.231
d3
0.186
d4
0.287
2.
164
3.
4.
NA
NA
Normal
d1
NA
NA
d1
d1
NA
NA
d1
1
0
d 1 , d3
{d1 ,d3 }
NA
d 1 , d3
NA
NA
1
0
d3
NA
NA
d3
0.442
0.495
d3
NA
NA
d3 ; No decision about
d4
0.219
0.579
0.439
0.287
0.473
0.473
1
2
S. D
No.
s(di )
s(ni )
m(di )
m(ni ) Flagi
D0
1
(m(di )
2
(ii )
(ix )
(x )
(xi )
(xii )
(xiii )
(xiv )
(xv )
(xvi )
(xvii )
d2
NA
NA
NA
NA
d 2 , d5
{d2 , d5 }
NA
d 2 , d5
d5
NA
NA
NA
NA
d2
NA
NA
NA
NA
d5
NA
NA
d5
d5
NA
NA
NA
NA
d2
NA
NA
NA
NA
NA
NA
NA
NA
Normal
S. D
No.
7.
8.
9.
+ m(dj )) Result
as
NA
NA
NA
NA
NA
NA
1
0
d4
NA
NA
d4
10. d4
d5
4.4.4
Experimental Results
In order to evaluate the performance we have used the above algorithm on randomly
selected 300 sentences, that are not currently present in our example base. Manual
analysis of the translations of these 300 sentences revealed that 32 of them will
involve some type of divergence when translated from English to Hindi. Remaining
268 sentences have normal translations.
The output of the algorithm is as follows: It recognized 36 of the sentences
to have divergence upon translation, and 261 to have normal translation. For 3
sentences the algorithm could not make any decision. Table 4.9 summarizes the
overall outcome.
Parameters
Divergence
Normal
Number of examples
32
268
Experimental results
36
261
Correct results
30
260
Recall %
83.33%
99.62%
Precision %
93.75%
97.39%
The very high value (above 90%) for precision establishes the efficiency of the
algorithm in detecting possible occurrence of divergence even before the actual translation is carried out.
There are few examples when the algorithm failed to produce the correct decision.
These may be put into three categories:
166
The algorithm is not able to give correct result in first two cases. We feel that
the possible reasons behind the incorrect decisions taken by the algorithm are the
following:
167
Lack of robust PSD and NSD. The present size of the PSD and NSD are
416 and 3931 respectively. Evidently, these numbers are not large enough
to deal with all different sentences. As more examples (particularly, those
involving divergence) are collected, both the PSD and NSD may be enriched
with additional entries. This will in turn enable the algorithm to measure
semantic similarity in a more direct way. As a consequence the number of
erroneous decisions will reduce.
The value of threshold. For our experiments we have used 0.5 to be the value
of the threshold t. This value has been obtained by carrying out a number
of experiments on our example base. However, with more examples this value
of t may have to be reassigned, which may in turn improve the quality of the
results. Further experiments with more examples need to be carried out to
arrive at an optimal value of the threshold t.
4.5
Concluding Remarks
The experiments carried out by us resulted in very high values of precision and
recall. However, more experiments need to be done to establish this scheme as a key
technique for dealing with divergences for an EBMT system.
The following points may be noted with respect to the scheme presented here:
170
Chapter 5
A Cost of Adaptation Based
Scheme for Efficient Retrieval of
Translation Examples
5.1
Introduction
5.2
Various similarity metrics reported in the literature can be characterized depending on the text units they are applied on. These units may be words, characters,
171
172
score depends to a significant extent on the word weights (word frequency and
sentence frequency)2 , which in turn depends on the sentences in the example
base. Thus the schemes become highly subjective. In particular, sentences
having similar structure (in terms of tense, subject, number of objects, etc.)
have higher similarity measurement values for a given input sentence. Different weights have been assigned to similarity of different syntactic tags. For
example, a score of 20 is given to verb or auxiliary verb matching, a score of
5 is given to adjective or adverb matching etc. .
5. Hybrid retrieval Scheme: This scheme has been used in ReVerb system (Collins
and Cunningham, 1996) (Collins, 1998) that utilizes two different levels of
case retrieval: String matching retrieval (Phase 1), and Activation passing for
syntactic retrieval (Phase 2). In Phase 1, only exact words are matched, and
near morphological neighbours (such as, variation due to number, variation due
to tense ) are not considered. The highest score is allocated to those cases that
have been activated the greatest numbers of times. In Phase 2, for structural
retrieval, the input sentence is first pre-chunked, such that each chunk has an
explicit head-word. The algorithm initiates activation from each word in the
chunk, giving the head word an increased weight to reflect its pivotal role in
the chunk. The final score is evaluated by summing the above two scores.
6. DP-matching between word sequence (Sumita, 2001): This scheme scans the
source parts of all example sentences in a bilingual corpus. By measuring
the semantic distance between the word sequences of the input and example
sentences, it retrieves the examples with the minimum distance, provided the
distance is smaller than a given threshold. Otherwise, the whole translation
2
174
P
I + D + 2 SEM DIST
dist =
Linput + Lexample
where I is the number of insertions, D is the number of deletions and SEMDIST
=
K
,
N
and N is height of the abstraction level. The value of SEMDIST ranges from
0 to 1. The denominator of the above expression is the sum of the length of
the input and the example sentence.
7. Semantic matching procedure (Jain, 1995): This scheme first looks at the verbpart of the input sentence, and on the basis of the type of verb-part it chooses
an appropriate partition of the input sentence. The syntactic units of the
input sentence are counted for entering into the next level of partition. After
reaching the correct sub-partition, exact pattern matching is performed. For
all such examples, distance is found for the input sentence using a distance
formula. The distance d between I(input sentence) and E(example sentence)
is defined as follows:
d(I, E) =
n
X
p=1
where n is the number of noun syntactic groups in the source langauge sentence.
IG and EG are input sentence noun syntactic group, and example sentence
noun syntactic group, respectively. Similarly, IV and EV are the input and
the example sentence verb groups, respectively.
175
The above distance d is calculated on the basis of weighted average of attribute difference, status difference, gender difference, number difference, person difference, additional sematic difference, and verb category difference between example sentence E and input sentence I. Pre-assigned values in the
range of 0 to 1 have been used as the weighting factors for the above parameters.
8. Retrieving Meaning-equivalent Sentences (Shimohata et. al., 2003): Retrieval
of meaning-equivalent sentences is based on content words (e.g. noun, adjective, verb), modality (request, desire, question) and tense. This method
does not rely on functional words (e.g. conjunction, preposition, auxiliary
verb) information. A thesaurus is utilized to extend the coverage of the example base. Two types of content words identical and synonymous have
been used. Sentences that satisfy the following conditions are recognized as
meaning-equivalent sentences.
The retrieved sentence should have the same modality and tense as the
input sentence.
All content words (identical or synonymous) are included in the input sentence. This means that the set of content words of a meaning-equivalent
sentence is a subset of the input.
At least one identical content word is included in the input sentence.
If more than one sentence is retrieved, the algorithm ranks them by introducing
focus area to select the most similar one. The focus area has been defined
as the last N words from the word list of an input sentence. The value of N
varies according to the length of the input sentence.
176
Character based metrics are highly script dependent. Hence a scheme which
is designed for a specific langauge may not be pursued for another language.
Word based metrics are generally dependent on the size of the database. If the
database does not contain sentences having words common to a given sentence,
then these methods may fail to retrieve any similar sentence from the example
base.
Most importantly, in almost all the schemes 3 described above, adaptation and
retrieval have been dealt with independently. However, we feel that adaptation
and retrieval should go hand in hand. A retrieval scheme should be considered
efficient (for an EBMT system) if the adaptation of the retrieved sentence is
computationally less expensive.
The only metrics that we found to have considered the concepts of adaptation while measuring
similarity are (Sumita, 2001) and (Collins, 1998). These schemes rely only on counting the number of adaptation operations, and a fixed penalty is assigned to these operations. However, this
assumption is not very realistic.
177
5.3
As discussed in Chapter 2, the cost of adaptation depends on the number of operations required for adapting a retrieved example. The total cost may then be computed as the sum of the individual cost of each operation used for the adaptation.
An important point to be noted in this respect is that some adaptation operations
(e.g. constituent word addition and constituent word replacement) requires a search
of an English to Hindi dictionary. Typically, this dictionary will not be stored in the
RAM of the system, and its access requires retrieval from external storage. This cost
is much more than the cost of any operation that can be accomplished. Resorting
morpho-word or suffix operations reduce the number of dictionary searches, since
the number of morpho-tags and suffixes is much smaller in comparison with the total content of a dictionary. However, since complete avoidance of constituent word
additions and replacements is impossible, we had to take into account the search
time due to different operations in our analysis of computational cost. To deal with
dictionary search, we make the following assumptions/observations:
178
179
POS
Size
Search time
Noun
Adjective
5449
Adverb
1027
Preposition
87
log2 87 6.44
Pronoun
72
log2 72 6.17
Verb
4330
L
.
2
7. Since dictionary search required for constituent word addition (WA) and constituent word replacement (WR) operations is computationally expensive, we
introduce the following step before referring to the dictionary. We suggest that
180
Lp
,
2
(B)
diijal
par
chaltii
hai
(runs)
(is)
iss
gaadii
ke liye
(for)
upyukta
iindhan
(suitable)
(fuel)
hai
(is)
Note that in sentence (A) the words car and diesel are subject and complement of preposition, respectively. On the other hand, in sentence (B) their
roles are reversed. In order to generate the translation of (A) if the sentence
(B) is retrieved then for these two positions constituent word replacement operations are required. Typically, this operation demands a dictionary search
to get the Hindi equivalent of these words. However, in the above example
the dictionary search may be avoided since the Hindi equivalent of the desired
words (car and diesel) may be obtained from the retrieved example itself.
Hence the computational cost can be minimized.
8. Morpho-word operations or suffix operations do not require any dictionary
search. Only a set of fixed rules (which may be in a tabular form) is needed
181
M
.
2
K
,
2
where K is
5.3.1
Based on the above observations, cost of the ten different adaptation operations
(discussed in Chapter 2) are estimated in the following way:
1. Constituent Word Deletion(WD): To delete a word from a retrieved example,
first the word is located in the sentence, and then it is deleted. Thus the average
cost is (l1 L2 ) + , where L is the length of the retrieved Hindi sentence, and
l1 is the constant of proportionality. is a small positive quantity reflecting
the cost of actual deletion operation (e.g. adjustment of pointers if sentences
are stored in a linked-list structure of words).
2. Constituent Word Addition(WA): Constituent word addition is done in three
steps:
First, the Hindi equivalent of the word to be added has to be found in
the dictionary. This involves the cost {(d log2 D) + (c 105 )}, where
D is the size of the relevant dictionary, and c and d are the constants of
proportionality. The two terms correspond to searching the binary tree
182
Lp
).
2
Here
Lp
)
2
M
2
the constituent word deletion to get the cost of morpho-word deletion. Here
183
M
)
2
+ .
5. Morpho-word Addition (MA): For morpho-word addition, the cost for dictionary search and access in constituent word addition (by referring to item 2
above) is replaced with the average cost that is (m
M
).
2
(l2 L2p ) in constituent word addition is not considered for morpho word addition, as these morpho-words are not present in the tagged version. Therefore,
the average cost of morpho-word addition is (l1 L2 ) + (m
M
)
2
+ + .
6. Morpho-word Replacement (MR): To compute the cost of morpho-word replacement one may refer to the morpho-word addition cost as explained just
above. However, two of its components, viz. and , need not be considered
in the cost for morpho-word replacement. This is because the grammar rules
need not be used to find the location of new word, and consequently no extra
space needs to be created. Further, an additional cost (m M21 ) is to be added
for finding out the morpho-word to be replaced, where M1 is the set from
where the new morpho-word is to be picked. Therefore, the average cost for
morpho-word replacement is (l1 L2 ) + (m M2 ) + (m M21 ). It may be noted
that M1 and M can be equal if the word is replaced with some morpho-word
from the same set.
7. Suffix Deletion (SD): Here the work involved is first to identify the right suffix,
then to do the stripping. So the cost is (l1 L2 ) + (k K2 ), where k is the cost
of proportionality, and K is the total number of suffixes (as explained in item
8 of Section 5.3).
8. Suffix Addition (SA): Suffix addition is done in two steps. First the position
184
K
).
2
K
)
2
+ (k
K1
).
2
SA because here on the top of adding the suffix some extra computational
effort (k
K1
)
2
stripping from the word. It may be noted that K1 and K can be equal if the
suffix is replaced with some suffix from the same set.
10. For Copy operation no computational cost is taken into account.
These individual costs may be used for determining the overall cost of adaptation.
Section 5.4 discusses how cost may be calculated for adaptation between different
functional slots and kind of sentences.
5.4
In this Section we discuss cost of adaptation corresponding to the features by referring to the adaptation rules as presented in various rule tables given in Sections 2.3,
2.4, 2.5, 2.6 and 2.7.
185
5.4.1
The adaptation rule Table 2.15 suggests that in order to adapt a particular kind of
sentence into another kind requires one or more of the following operations: either
addition or deletion of the word adverb nahiin; or addition or deletion of the
morpho-word kyaa. Hence the cost table can be generated by computing the
costs with respect to the above four adaptation operations only.
By referring to the notations in adaptation cost operations given in Section 5.3.1,
the cost of these operations are:
The cost (k1) of WA for the word adverb nahiin is (l1 L2 ) + + . Here
dictionary search is not required as the translation nahiin may be stored in
some readily accessible location.
The cost (k2) of WD for the word adverb nahiin is (l1 L2 ) + ;
The cost of MA (for the morpho-word kyaa) is , as kyaa always comes in
the beginning of the sentence, no search is required to find the correct position
of the word in the retrieved Hindi sentence. We call this cost k3.
Similarly, the cost of MD of the morpho-word kyaa may be computed as k4
= .
Table 5.1 gives the adaptation cost due to kind of sentences for different combinations of input and retrieved sentence. Cost of adaptation due to variations in
kind of sentences can now be calculated by referring to the required set of adaptation
operations for different cases as given in Table 2.15.
186
Input AFF
NEG
INT
NINT
Retd
AFF
k1
k3
k1 + k3
NEG
k2
k3 + k2
k3
INT
k4
k1 + k4
k1
NINT
k2 + k4
k4
k2
5.4.2
Below we discuss the cost of adaptation for certain types of verb morphological
variations. In particular, we discuss two groups:
(1) the input and the retrieved sentence have same tense and same verb form.
(2) the input and the retrieved sentence have same tense but different verb forms.
M1P F1P
M2S F2S
F3P
Retd
M1S
s+n
s+n
s+n
s+n
s+n
s+n
s+n
F1S
s+n
s+n
s+n
s+n
M1P
s+n
s+n
s+n
s+n
s+n
F1P
s+n
n+s
s+n
M2S
s+n
s+n
s+n
s+n
s+n
s+n
F2S
s+n
s+n
s+n
s+n
M3S
s+n
s+n
s+n
s+n
s+n
s+n
s+n
M3P
s+n
s+n
s+n
s+n
s+n
F3S
s+n
s+n
s+n
s+n
F3P
s+n
s+n
s+n
The same Table 5.2 works for adaptation from past indefinite to past indefinite
with a slight modification. In this case morpho-word replacement is done from
the morpho-word set {thaa, the, thii } instead of morpho-words set {hain, ho,
hoon, hai }. Hence if the value of n is replaced by (l1 L2 )+ (m 32 ) + (m 32 ) in
the cost Table 5.2, one gets the cost table for past indefinite to past indefinite.
In case of adaptation from future indefinite to future indefinite, cost depends
upon two operations CP and SR. Hence the cost n due to morpho-word replacement (MR) is to be removed from the entries of the Table 5.2. The cost
s of SR in this case is (l1 L2 )+ (k 82 ) + (k 82 ), which is obtained by
considering the relevant set of suffixes, viz. {oongaa, oongii, oge, ogii, egaa,
egii, enge, engii }.
For all other combinations of verb morphological variations of the same group
one more morpho-word replacement is to be added to the cost Table 5.2 in
place of the suffix replacement cost s (as discussed in items 3 to 6 of Section
2.3.1). Here cost of these two morpho-word replacements will vary according
to the tense and verb form. For example, in case of present continuous to
189
present continuous the relevant morpho-word sets are {hain, ho, hoon, hai }
and {rahaa, rahii, rahe}. The average cost of these morpho-word replacements
are (l1 L2 )+ (m 42 ) + (m 42 ) and (l1 L2 )+ (m 3f2 ) + (m 23 ), respectively.
The cost for morpho-word replacements for the remaining 5 cases (e.g. future
continuous to future continuous, past prefect to past perfect etc.) can be
computed in a similar way by referring to the appropriate morpho-word sets.
we do not present all other different cases in this report. However, we present below
the adaption cost table (Table 5.3) for verb morphological variation from present
indefinite to past indefinite, which belongs to the group different tenses same verb
form. These values are obtained by referring to the adaptation rule Table 2.4.
Input M1S F1S
Retd
M1P F1P
M2S F2S
F3P
M1S
s+w
s+w
s+w
s+w
s+w
s+w
s+w
s+w
F1S
s+w
s+w
s+w
s+w
s+w
M1P
s+w
s+w
s+w
s+w
s+w
s+w
s+w
F1P
s+w
s+w
s+w
s+w
s+w
M2S
s+w
s+w
s+w
s+w
s+w
s+w
s+w
F2S
s+w
s+w
s+w
s+w
s+w
M3S
s+w
s+w
s+w
s+w
s+w
s+w
s+w
s+w
F3S
s+w
s+w
s+w
s+w
s+w
M3P
s+w
s+w
s+w
s+w
s+w
s+w
s+w
F3P
s+w
s+w
s+w
s+w
s+w
Here the cost s denotes the cost of suffix replacement between {taa, te, tii }, which
is (l1 L2 )+ (k 32 ) + (k 32 ), and w denotes the cost of morpho-word replacement
from morpho word of set {hoon, hai, ho, hain} to the set of morpho-word {thaa,
thii, the} which is (l1 L2 )+ (m 42 ) + (m 32 ).
5.4.3
In this subsection we discuss the adaptation cost majorly for three functional tags
under subject/object functional slot. These tags are genitive case (@GEN), pre192
1. The average cost of constituent word replacement from the set P with a proper
noun. We denote this by w1 . Note that in this case no dictionary search
is required as proper nouns are not stored in any dictionary. Hence w1 is
computed as (l1 L2 ) + (l2
Lp
).
2
2. The average cost of morpho-word replacement (MR) from {kaa, ke, kii } with
itself. We denote this cost by w2 . Since the number of morpho-words is 3, w2
may be formulated as (l1 L2 ) + (m 32 ) + (m 23 ).
3. The average cost of WR from the set P with a noun. This cost is denoted
by w3 . Note that in this case noun dictionary search is necessary for which
the search time is 13.77 (see item 4 of Section 5.3). Further, to access the
dictionary a cost (c 105 ) is required. Hence the total cost is (l1 L2 ) +
(l2
Lp
)
2
4. The average cost of WR from the set P with pronoun. This is denoted by w4 .
Imitating the case just mentioned above the cost here may be formulated as
(l1 L2 ) + (l2
Lp
)
2
5. The average cost of morpho-word deletion from the set {kaa, ke, kii }. This
cost is denoted by w5 , which may be formulated simply as (l1 L2 ) + m 23 +
.
6. The average cost of morpho-word addition from the set {kaa, ke, kii }. We
denote this cost by w6 , which is formulated as (l1 L2 ) + (m 23 ) + + .
7. The average cost of suffix replacement for converting a noun in either an oblique
noun form or a plural form (refer Section 2.5.2 and Appendix A). We denote
this cost by s1 . Since the number of relevant suffixes is four, s1 may be computed as is (l1 L2 ) + (k 42 ) + (k 42 ).
8. The average cost of suffix addition for converting noun either in oblique noun
form or in plural form. This cost of suffix addition can be formulated in a way
similar to item 7 above. Here this cost is (l1 L2 ) + (k 52 ) + (k 25 ), which
we denote as s2 .
9. The average cost of suffix addition form the set {kaa, ke, kii } is (l1 L2 ) +
(k 32 ). We denote it as s3 .
10. The average cost of suffix deletion for converting oblique noun form to noun,
or plural to singular ( See Appendix A and Section 2.5.2). This cost is (l1 L2 )
+ (k 52 ) + . We denote is as s4 .
11. The average cost of suffix replacement from the set {kaa, ke, kii }. We denote
this cost by s5 , which is formulated as (l1 L2 ) + (k 23 ) + (k 32 ).
The cost table corresponding to genitive case to genitive case is given in Table
5.4. It has been formulated in accordance with the adaptation rule Table 2.8.
194
Input
<proper>
PRON
(0 or ({w1 } +
{w2 }))
s2 }
w1 + {w1 }
Retd
<proper>
N
{(s1 or s2 or s4 )}))
PRON
w1 +w6
w3 + {s1 or s2 } +w6
(0 or s5
(w4 +s3 ))
or
1. The average cost of constituent word replacement of the set Q by noun. This
cost is denoted by w1 . In this case noun dictionary search is required, and its
search time is 13.77. Hence, w1 is computed as (l1 L2 ) + (l2
Lp
)
2
+ {(d
Lp
).
2
195
3. The average cost of constituent word replacement from the set Q to pronoun.
This cost is denoted as w3 , which is formulated as (l1 L2 ) + (l2 L2p ) + {(d
6.17) + (c 105 )}(same given in item 1 above).
4. The average cost of constituent word replacement from the set Q to gerund
(PCP1) is (l1 L2 ) + (l2
Lp
)
2
1). Note that, here verb dictionary search is required, and its search time is
12.08. We denote this cost as w4 .
5. For converting singular form of noun to plural form of noun, or vice versa (See
appendix A) any one of the three different suffix operations are required that
are suffix replacement (SR) , suffix addition (SA) and suffix deletion (SD). The
average cost of these operations are:
The cost of SR is (l1 L2 ) + (k 42 ) + (k 24 ). We denote it as s1 .
The cost of SA is (l1 L2 ) + (k 32 ). We denote it as s2 .
The cost of SD is (l1 L2 ) + (k 32 ). We denote it as s3 .
6. The average cost of suffix addition na in the verb of PCP1 form is (l1 L2 ).
Note that here only one suffix is required in any of the cases, therefore, no
search is required for deciding about the suffix. This cost is denoted by s4
196
Input
<proper>
PRON
PCP1
w3
w4 +s4
w3
w4 +s4
Retd
N
(0 or ({w1 } + {s1 or w2
s2 or s3 })
<proper> w1 + {s1 or s2 or w2
s3 }
PRON
w1 + {s1 or s2 or w2
s3 }
(0 or w3 )
w4 +s4
PCP1
w1 + {s1 or s2 or w2
s3 }
w3
(0 or w4 )
5.4.4
The input sentence may be compared with the example base sentences in terms
of functional-morpho tags, their discrepancies may be measured, and adaptation
197
cost may be estimated using the formulae given above. The example base sentence
having the minimum cost of adaptation may then be considered the most similar
to the input sentence, and may be retrieved for generating the translation of given
input sentence. Below we compare the proposed scheme with some other similarity
measurement schemes. In particular, we consider semantic similarity and syntactic
similarity in a way similar to what Manning and Schutze (1999) prescribed for
information retrieval.
5.5
5.5.1
Semantic Similarity
Semantic similarity depends on the similarity of words occurring in the two sentences
under consideration. Here, we used a purely word based metric, and developed a
vector space model as suggested in (Manning and Schutze, 1999). However, the
weighting scheme has been modified (Gupta and Chatterjee, 2002) in order that the
scheme can be applied on sentences in a meaningful way. Here, each of the example
base sentences and the input are represented in a high-dimensional space, in which
each dimension of the space corresponds to a distinct word in the example base.
Similarity is calculated as the normalized dot product of the vectors. The method
is explained below.
Let Ej : j = 1,2,. . ., N be the English sentences in the example base, and E0 be
the input sentence in English. We denote E0 and each Ej as n-dimensional vector
in a real valued space. The space of all distinct words W1 , W2 , . . ., Wn are in the
198
m(E0 , Ej ) =
n
X
e0i eji
(5.1)
i=1
This scheme computes how well the occurrence of word Wi (measured by e0i
and eji ) correlate in input and the example base sentences. The coordinates eji are
called word weights in the vector space model. The basic information used for word
weighting are word frequency (wji ) and sentence frequency (si ).
For a word Wi word frequency wji and sentence frequency si are combined into
a single word weight as
wji (( N ) 1),
si
eji =
0,
if wji 1;
(5.2)
if wji = 0.
Given an input sentence, this similarity may be used for retrieving an appropriate
past example from the example base. In order to achieve that the similarity of the
input sentence is measured with each of the example base sentence. The one with
the highest similarity score may be considered for retrieving.
We have experimented with two input sentences I work. and Sita sings ghazals.. Tables 5.6 and 5.7 provide the best five matches for them, respectively.
Retrieved Sentences
Semantic Score
I do this work.
0.9852
0.9746
0.9543
0.7954
0.6834
Table 5.6: Best Five Matches by Using Semantic Similarity for the Input Sentence I work.
200
Example Sentence
Semantic Score
1.00
0.775
0.733
0.731
0.701
Table 5.7: Best Five Matches by Using Semantic Similarity for the Input Sentence Sita sings ghazals.
One may note that the drawback of this scheme is that the outcome varies significantly on the content words, the size of the database sentences and the occurrence
of the words in the sentences.
5.5.2
Syntactic Similarity
Syntactic similarity pertains to the similarity of the structure of two sentences under
consideration. Let Tj be the tagged version of English sentence Ej of the example
base, and T0 is the tagged version of the input sentence E0 . Here too, every sentence
Tj in the example base is expressed as a vector generated from the structure of the
sentence. A matching technique similar to that used for semantic similarity has been
applied on Tj and T0 (instead of Ej and E0 , as discussed in earlier subsection). As
a consequence similarity measures are computed at the structural level, and not at
the word level.
The key question is whether all the components in determining the structural
similarity are of equal importance. We feel that the contributions of all the constituent words on the formation of the sentence are not same, in particular, sentences
201
having similar structure (in terms of verb, auxiliary, adverb etc) should have higher
similarity measurement value for a given input sentence. Having tried with different
weighting schemes, we found that the one given in Table 5.8 provides the best result.
POS/syntactic role
Multiplier
Auxv/verb
20
Preposition
10
Adjective/adverb
Subject/object
Determiner/negative
0.1
Table 5.8: Weighting Scheme for Different POS and Syntactic Role
Table 5.9 and Table 5.10 give the similarity measures obtained for the example
base corresponding to the input sentences I work and Sita sings ghazals when
the above weighted syntactic similarity scheme has been used.
Retrieved Sentences
Syntactic Score
I walk.
1.00
I do this work.
0.971
0.968
They walk.
0.942
0.928
202
Example sentences
Syntactic Score
1.000
1.000
1.000
0.918
He reads history.
0.907
Note that here similarity of words is completely ignored, as the main emphasis
is laid on the similarity of tense. By resorting to a different weighting scheme one
can change the similarity measures to some extent.
5.5.3
The above studies reveal that neither semantic measure nor syntactic measure provides an effective scheme for calculating similarity between two sentences. In both
cases the measurement score depends to a significant extent on the word weights,
which in turn depends on the sentences in the example base. Thus the schemes
become highly subjective. We, therefore, look for a method that provides a more
objective measurement of similarity. We consider the cost of adaptation for this
purpose., which is seen as the number of operations required for transforming a
retrieved translation example into the translation of a given input sentence. We
continue with the adaptation operations discussed in Section 5.3.1. The following
example illustrates how the functional-morpho tags of an input (IE) and a retrieved
203
example base sentence (RE) can be used for determining the appropriate adaptation
operations.
Input sentence(IE)
Table 5.11 gives the functional-morpho tags of the IE and the RE. To generate
the translation ram bahut tezii se gaadii chalaa rahaa hai of the input sentence
the following adaptation operations are required.
IE:
RE:
Ram
@SUBJ <Proper> N SG
Ram
he
is
@+FAUXV V PRES be
is
@+FAUXV V PRES be
driving
sitting
the
the
car
@OBJ N SG car
...
...
at
@ADVL PREP at
on
@ADVL PREP on
@DN> ART a
...
...
high
...
...
speed
@<P N SG speed
chair
@<P N SG chair
Table 5.11: Functional-morpho Tags for the Input English Sentence (IE) and the Retrieved English Sentence
(RE)
(a) Whenever a functional tag along with the morpho tags match in both sentences
but the corresponding words are different, a constituent word replacement
204
The Hindi translation of the word drive is to be taken from the verb dictionary.
205
corresponding word in Hindi has to be deleted from the Hindi sentence RH.
(e) Other necessary adaptation operations may be WA (bahut) and SA(tez
tezii ). Similarly, these adaptation operations can also be identified.
After the identification of all the adaptation operations required for adapting RE
to IE, we have calculated cost of each of the adaptation operations. For this purpose
we have referred cost of all the different operations which are listed in Section 5.3.1
and Section 5.4.
In order to apply cost of adaptation to design an appropriate retrieval scheme
one needs to have a measurement of different constants of proportionality (i.e. l1 ,
l2 , d etc.) as described in Section 5.3.1. Evidently, these constants depend upon
the underline computing system. Hence in our discussion we want to keep them
independent of any particular platform. We further make a few assumptions in
order to make the calculations relatively simple:
We assume that the linear search operations in the RAM are equally costly
irrespective of the size of each data record. Hence we assume that the constants
l1 , l2 , d, m, and k are all equal. Let them have a common value .
It has already been discussed in Section 5.3 that hard disc operations are
costlier than RAM operations by an order of 105 . Hence we denote the constant
associated with retrieval from the external storage as c 105 , where c .
and costs are treated as independent quantities. Here is very small
quantity and . On the other hand, is considered as a large quantity as
discussed in item 5 of Section 5.3.
206
Adaptation cost
23+4
30+5
This works.
13.17+ 105 c
13.67+105c
I walk.
16.27+105c
Retrieved sentence
Adaptation cost
13+
22+ 105 c+ 2
32.85+2(105c)
Table 5.13: Retrieval on the Basis of Cost of Adaptation Based Similarity for the Input Sentence Sita sings
ghazals.
To generate the translation main kaam kartaa hoon of the input sentence
I work., the adaptation operations required for adapting each of the sentences
given in Table 5.12 above are as follows:
For adapting I have been working for four hours. main chaar ghante se
207
kaam kar rahaa hoon to the input sentence, five operations are required. This
operations are : SA (kar kartaa), WD (chaar ), WD(ghante), WD (se) and
MD (rahaa). The total adaptation cost is therefore = (5.5)+ (4+) +
(4+) + (4+)+(5.5+)= 23+4. One may refer Section 5.3.1 for this
computation.
In case of adapting the second retrieved sentence I have not been working for
four hours main chaar ghante se kaam nahi kar rahaa hoon to the input
sentence, at most six operations are required to be done. These operations are
: SA (kar kartaa), WD (chaar ), WD(ghante), WD (se), WD(nahiin) and
MD(rahaa). Hence the total adaptation cost is (6)+ (4.5+) + (4.5+) +
(4.5+)+ (4.5+) +(6+)= 30+5.
For adapting This works yah kaam kartaa hai to the input sentence,
at most two operations WR (yah main) and MR (hai hoon) are required. The total adaptation cost is therefore (9.17 + c105 )+ (2+ 2)=
13.67+c105 .
Similarly, one can identify the appropriate adaptation operations for adapting
last two sentences to the input sentence.
We now consider the costs of adaptation for the best five sentences that are
retrieved using semantic and syntactic similarity schemes as given in Sections 5.5.1
and 5.5.2.
208
Adaptation Cost
mantic Similarity
I do this work.
23.77+ 105 c+ 2
22.9+ 105 c+ + 2
21.17+ 105 c+
21.67+ 105 c+
Adaptation Cost
I walk.
16.27+ 105 c
I do this work.
23.27+ 105 c+ 2
19.77+ 2(105 c)
They walk.
28.94+ 2(105 c)
First we consider the input sentence I work. Table 5.14 provides the costs of
adaptation of the best five matches under semantic similarity and syntactic similarity
based measurement schemes. An examination of the adaptation costs suggests that
all the five sentences retrieved by the semantic similarity based scheme are costlier
to adapt in comparison with all the sentences retrieved by cost of adaptation scheme
(see Table 5.12). On the other hand, the sentence I walk. which is retrieved as the
best matching sentence under syntactic similarity based scheme actually requires
more computational efforts than the best four sentences as given by the cost of
adaptation based scheme (see Table 5.12).
209
Adaptation Cost
mantic Similarity
Sita sings ghazals.
29.08+ 105 c+ +
32.85+ 2(105 c)
13+
Adaptation Cost
32.85+ 2(105 c)
39.85+ 2(105 c)
39.85+ 2(105 c)
He read history.
39.85+ 2(105 c)
In a similar way Table 5.15 provides the costs of adaptation for the best five
matches retrieved by semantic and syntactic based schemes for the input sentences
Sita sings ghazals. One may note the following by comparing Table 5.13 and Table
5.15.
Sita sings ghazals. is retrieved as a best match by all the three schemes
because it is already there in the example base.
The second best match Ghazals were nice. under semantic similarity based
scheme is actually very expensive to adapt as it has a term . This term
occurs since the sentence concerned is of a structure that is different from the
210
5.5.4
One major drawback of the proposed scheme is that for each input sentence, the
scheme essentially boils down to evaluating the cost of adaptation for all the sentence
in the example base. This makes retrieval from a large example base is computationally very expensive. On the other hand, use of cost of adaptation as a potential
yardstick for measuring similarity makes too strong an argument to be ignored with
respect to Example Based Machine Translation. This, therefore, necessitates development of some filtration technique so that given an input sentence, the example
base sentences that are difficult to adapt are discarded. The adaptation scheme can
therefore be applied only on the remaining sentences of the example base. We have
designed a systematic two-level filtration scheme for this purpose.
It is clear from cost of adaptation operations, mentioned in Section 5.3.1, that
constituent word addition and constituent word replacement are the costliest adaptation operations in terms of computational cost, with the former being costlier than
the latter. Hence the filters are designed to retrieve those example base sentences for
which the adaptation of given input sentence will require less number of constituent
word addition and constituent word replacement operations.
The two-level scheme works as follows.
211
In the first level, the algorithm retrieves sentences that are structurally similar to the input sentence, thereby reducing the number of constituent word
additions in the adaptation of the retrieved example. Here functional tags
(FTs) are used to determine the structural similarity. We call this step as
measurement of structural similarity.
In the second level only the sentences passed by the first filter are considered
for further processing. Here the dissimilarity of each of these sentences with the
input sentence is measured. The lower is the dissimilarity score of an example,
the lesser will be its adaptation cost to generate the required translation. The
dissimilarity is measured on the basis of tense and POS tag along with its root
word. Henceforth, for notational convenience, we shall denote these features
as characteristic features of a sentence. This step is denoted as measurement
of characteristic feature dissimilarity.
The following examples illustrate the necessity of the two levels of the filtration
scheme. Let us consider the following two sentences:
A
sundar
(beautiful)
ladkii
apne
ghar
jaa
rahii
hai
(girl)
(my)
(home)
ghar
(this) (home)
bahut
sundar
hai
Even though there are two common words (beautiful and home) between these two
sentences, adapting the translation of sentence B to generate translation of A is not
an easy task because of their structural difference. Adaptation of the translation
212
ladkii
(girl)
office
(office)
jaa
rahii
hai
(go)
(...ing)
(is)
This sentence also has two words (girl and going) common with sentence A. But
its adaptation for generating the translation of A is computationally less expensive in
comparison with adaptation of B. In order to adapt C to A, only four adaptation operations are required: WR (yeh ek ), WA (sundar ), WR (office ghar ) and WA
(apne). The total cost of adaptation for adapting C to A is 46.08+2+3(10 5 c)+2.
Evidently, this cost is much less than the cost of adapting B to A as computed above.
This happens because of the structural similarity and commonality of some characteristic features of sentence C with A.
The above discussion suggests that one of the filters alone is not sufficient. For
appropriate filtration both the levels are required. The next section discusses the
proposed filtration scheme in detail.
5.6
We have used the following notations to describe the filtration scheme. Let L denote
natural language (here it is English), and let e L denote an input sentence. S
213
denotes the example base which is a finite subset of L, and d S is an example base
sentence. The following subsections discuss the above-mentioned levels of the filter.
5.6.1
In this step, the aim is to filter the example base S of sentences to produce a subset
of S such that the sentences are structurally similar to e. The example base is
partitioned into equivalence classes of sentences that have same functional tags (e.g.
subject, object, verb etc.). This example base of equivalence classes is filtered, and
the classes that are similar in structure to the equivalence class of the input sentence
are identified. Here too we have used ENGCG parser for finding the functional tags
(FTs).
Given a sentence x L, let (x) be the bag of functional tags present in x. We
have used the term bag in place of set as in a bag repetition of elements are
allowed. For example, if x is My brother helps me in my studies. then (x) = {
@GN>, @SUBJ, @+FMAINV, @OBJ, @ADVL, @GN>, @<P}. Let F be the set
of the possible bag of functional tags for the language L.
Note that induces an equivalence relation on L. Two sentences e L and
e0 L are said to be equivalent(notationally, eEe0 ) if they have the same bag of
functional tags. Let [e] denote the equivalence class corresponding to the sentence
e, i.e. [e] = {e0 L| eEe0 }. For example, the sentences He drank milk.,Sita eats
mangoes., They are playing football. and Will Ram marry Sita? are members of
the same equivalence class because all these five sentences have same functional tags
representation (.) = {@SUBJ, @+FMAINV, @OBJ}.
Since our concentration is on our example base S, the function and equivalence
214
m
.
2
those equivalence classes for which the number of common FTs is between d m2 e and
m. Thus, by this step, all the equivalence classes having number of common FTs
less than d m2 e are discarded. We claim that the sentences having higher cost of
adaptation has been discarded from the set Se0 . The proof is given below:
Let n = |(e)|, i.e. the number of functional tags present in e is n, evidentally,
n m. Let us now consider the examples left out of consideration in Se0 (i.e. the
set (S 0 Se0 )). Of all the examples belonging to this set, the one that will have least
cost of adaptation should have the following properties:
(a) It should have maximum number of common FTs with e. We assume that
there exist a sentence with (d m2 e -1) FTs common with e.
(b) We further assume that for all these common FTs, the underlying words are
also same as in e.
means (n (d m2 e-1)) constituent word additions are required. Therefore, the cost of
adaptation of any such sentence will be approximately6 : (n d m2 e+1)WA, where
WA is the cost of constituent word addition, and is ((l1 L2 ) + (l2 L2p ) + {(dlog2 D)
+ (c 105 )} + + ). For details of this cost, check item 2 of Section 5.3.1. Let us
denote this cost as C1. This cost will be certainly more than the cost of adaptation
for the sentence having m common FTs with e, i.e. the sentences belonging to the
equivalence classes of the set Se0 selected by the first filter. Argument supporting
this fact is as follows:
If all the words corresponding to the m common FTs are different from the input
sentence constituent words, then the cost of adaptation will be approximately the
sum of m constituent word replacements and (n m) constituent word additions,
i.e. mWR + (n m)WA, where WR is the cost of constituent word replacement,
and is ((l1 L2 ) + (l2
Lp
)
2
The sentence having m common FTs have all different words. The approximate
6
We have not added other costs like suffix operations and morpho-word operations cost.
216
5.6.2
This filter arranges sentences of the set Se0 on the basis of the characteristic features
(see Section 5.5.4) of a sentence. We have considered the following characteristic
features: POS with its root word main verb (V), noun(N), adverb(ADV), adjective(A), pronoun(Pron), determiner(DET), preposition(p), gerund(PCP1), participles(PCP1, PCP2), and tense and form of the sentence. Here note that we have
considered those main verbs whose root forms are other than be or have . We
stick to the notations provided by the ENGCG parser (See Appendix B). For convenience of presentation, we denote the above mentioned ten characteristic features
217
as p1 (y), p2 (y), ..., p10 (y), where y is the root word of the corresponding characteristic feature pi . For example, let us consider the sentence I am sitting on the old
chair. According to this example, there are six characteristic features, i.e. p 5 (I),
p1 (sit), p7 (on), p4 (old), p2 (chair) and p10 (present continuous). Here the
verb sit is the root form of the verb sitting.
In the following, we define a dissimilarity measure so that the sentences belonging
to Se0 can be arranged in increasing order of dissimilarity score. Note that the
smaller is the dissimilarity score, the lesser is the cost of adaptation (to generate the
translation of the input sentence e) of the corresponding sentence.
Let M be the set of the possible bag of characteristic features. We define a
mapping : L M such that (x) = {pi (y) : pi (y) is a characteristic feature
[
of sentence x}. Let Se =
[d], the set of sentences of all equivalence classes
of
Se0 ,
[d]Se0
218
dise (d) = (
(5.3)
pi (y)(e,d)
dise (d) gives the dissimilarity score of d Se with respect to e. Here, is the
cost of finding the location of new word, which has already been explained in item
5 of Section 5.3, and wi is the weight assigned to the characteristic feature pi (.).
Significance of this dissimilarity function dise (d) and wi are explained below.
As mentioned earlier, two of the costliest operations from the adaptation point
of view are constituent word addition and constituent word replacement. Thus, the
dissimilarity measure is designed to focus on these two operations. The second term
in the above-mentioned measure correspond to the approximate cost involved for
constituent word addition (to find the appropriate position). Further it should be
noted that cost of adaptation varies with the POS of the word to be added/replaced.
This is because this cost depends on the dictionary size of the concerned POS. The
bigger is the dictionary size, more will be the search time, and hence costlier will
be the required operation. Thus, for the characteristic features pi (y), i=1, 2, . . .,
9, a weight wi is assigned, depending on the respective dictionary size. Table 5.16
lists the weights of these characteristic features according to the search time of the
respective dictionaries.
Note that for tense and form (p10 ) identification cannot be done through dictionary search. Appropriate rules should be developed for this purpose. In our
implementation, we have used 65 rules to take care of the sentences in our example
base. Therefore, the weight 6.02 (log2 (65) = 6.02 ) is assigned to the characteristic
feature p10 .
219
POS
pi
Dictionary size
Weights, wi
Verb (V)
Noun (N)
p1
p2
4330
13953
Adverb (A)
Adjective (ADJ)
Pronoun (PRON)
p3
p4
p5
1027
5449
72
Determiner (DET)
Preposition (P)
p6
p7
72
87
Gerund (PCP2)
Participles (PCP1, PCP2)
p8
p9
4330
4330
d2
pi (e,dj )
wi ) + ( |4 4|) =(
pi (e,dj )
wi ) for j = 1, 2. It is
to be noted that the contribution of the second term is zero for both d1 and d2 since
both these sentences have the same FTs as that of e.
Let us now consider two cases:
1. The weights corresponding to all features are same, say wi = 1 i = 1, 2, ...,
10. In this case, dise (dj ) = 2 for j = 1, 2.
2. The weights are taken as given in Table 5.16. In this case, dise (d1 )= w2 + w2
= 13.77 + 13.77 = 27.54 and dise (d2 )= w5 + w6 = 6.17 + 6.17 = 12.34.
Note that in the first case, the dissimilarity score is same. But from adaptation
point of view, the cost involved for adapting d2 to e is much less than the cost of d1 .
This is due to the fact that d1 has a determiner and a pronoun characteristic feature
common with e, while d2 has two noun characteristic features common with e.
Since the search and access time from a dictionary depends upon the size of
the dictionary under consideration, in this context one has to look at the sizes of
the dictionaries concerned. It is a general observation that the size of the noun
dictionary is much more than the sizes of the pronoun and determiner dictionaries.
For example, in our case the sizes are 14000, 70 and 72, respectively (see item 4
of Section 5.3). Consequently, retrieval from noun dictionary is computationally
costlier than the retrieval from pronoun or determiner dictionaries. This fact is
not reflected if equal weights are assigned to each POS. Hence in order to assign
priorities to the POS features in such a way that the dissimilarity score reflect the
approximate cost of adaptation, weights are assigned to each POS as given in Table
5.16.
221
Here the dissimilarity metric is so designed that the dissimilarity score is directly
proportional to the approximate cost of adaptation. Finally the sentences in Se are
arranged in descending order of dissimilarity score. Now a few best sentences are
considered for cost of adaptation based scheme, and the best one is retrieved as the
most similar to the given input sentence. In our experiments we have considered
five best sentences by this two-level filtration scheme for evaluation of their costs of
adaptation.
5.7
The above discussed filtration scheme aims at improving the efficiency of cost of
adaptation based scheme. This improvement can be observed by comparing the
worst case complexities of the two algorithms - cost of adaptation based scheme
without the two-level filtration, and cost of adaptation based scheme after the twolevel filtration. These two similarity measurement schemes are denoted as A1 and
A2 , respectively. Table 5.17 gives the notations for different parameters used in the
analysis, and their maximum sizes with respect to our example base.
Parameters
Notations
Maximum size
N
Le
4000
10
Ld
LF
|S 0 |
10
5
162
|Se0 |
|(e)|
|S 0 |
10
|(d)|
10
classes of Se0 are Se . Maximum two comparisons are required: one for POS matching
and other for its root word matching between d and e for finding the characteristic
feature. The number of POS and its root words can be at most equal to the length of
a sentence. Thus the total number of comparisons required are computed as follows:
POS of d and e are compared first which makes the number of comparisons to
be Le Ld .
Then, comparison of root words of d and e having same POS is done. Thus
there are Le comparisons.
Hence, the total number of comparisons required for POS and root word matching
between e and d are Le (Ld + 1). Summing it over all the sentences of Se , we get
PSe0
the total complexity as A22 = ( i=1
Pi ) (Le (Ld + 1)) = |Se | (Le (Ld + 1)),
PSe0
where i=1
Pi N .
Finally the cost of adaptation based scheme is applied on the top few sentences
of Se having the minimum dissimilarity score. We have considered a set of first five
sentences in our experiments. This makes the number of comparisons to be 5 C1 .
Hence, the total complexity of the algorithm A2 is given by TA2 = A21 +A22 +5C1 .
Comparing the time complexities of the algorithms A1 and A2 ,
TA2
TA1
|S 0 |
N
d
[( LdL+L
)] +
F
Se
N
+1
[( LLdd+L
)] +
F
5
N
TA2
TA1
|S 0 |
N
d
[( LdL+L
)] +
F
N
N
+1
[( LLdd+L
)] +
F
224
5
N
= c, where c = 0.762.
The above ratio shows that in the worst case the improvement in the algorithm
A2 is about 25%, i.e. the cost of adaptation based scheme will need to be applied on
only 75% sentences of the example base. But experimentally we found that for 500
different examples, which are not present in our example base, the improvement is
of the order of about 75%, which is quiet a significant improvement. This variation
is mainly due to the fact that during our experiments the cardinality of Se has been
obtained to be much less than N , and thus the ratio
Se
N
Se
N
+1
[( LLdd+L
)], which is the main contributory term towards c.
F
The retrieval scheme has been developed with respect to simple sentences. However, if the input sentence is complex then its adaptation is not straightforward
(Dorr et. al., 1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003)
and (Rao et. al., 2000).
Typically, complex sentences are characterized by sentences having more than
one clause, of which one is the main clause, and the rest are subordinate clauses
(Wren, 1989), (Ansell, 2000). Relative clause is a type of subordinate clause in
which a relative adjective (who, which etc.) or relative adverb (when, where etc.)
is used as a connective. The clauses may be joined by some connectives, but their
presence is not mandatory. However, in this work we consider complex sentences
with exactly one relative clause, and we further assume presence of the connective is
mandatory. Even with this simplified assumption, we find that translating complex
sentences under an EBMT framework is relatively difficult. This is because English
complex sentences having same connectives are often translated in different ways
in Hindi. Consequently, for a given complex English sentence, finding its suitable
225
match from the example base is difficult. And even when it is found its adaptation
may not be straightforward. The following section illustrates the above points.
5.8
Even for complex sentences having the same connective (e.g. who, when, where,
which) the structure of the translations may vary. For illustration, consider the
four examples given below. Each of these English sentences may have at least
four possible variations depending on, in which position the Hindi connectives
are used. It may further be noticed that although the keywords of all these
four sentences are same7 (subject to morphological variations), their translation
patterns vary according to the role of the connective, and the role of the noun
modified by the relative clause. If the relative adjective who plays the role
of subject in the relative clause, then the Hindi relative adjective may be one
of jo, jis or jin, depending upon the tense and form (i.e. for present
perfect, past indefinite and past perfect) of the main verb of the relative clause.
The items (A), (B), (C) and (D) below show the four sentences and their Hindi
translations.
(A) The policeman who chased the thief was tall.
wah sipaahii jo chor kaa piichhaa kartaa thaa, lambaa thaa
wah sipaahii, jis ne chor kaa piichhaa kiyaa lambaa thaa
jo sipaahii chor kaa piichhaa kartaa thaa wah lambaa thaa
7
policeman - sipaahii, thief - chor, to chase - piichaa karnaa, I - main, tall - lambaa, to know jaannaa
226
227
The above discussion suggests that the retrieval and adaptation strategies for
complex sentences may need to take care of a large number of variations according
to each connective word and its usage. Creating the adaptation rules and implementation of such number of possibilities are not an easy task. To overcome this
problem, we propose a split and translate scheme for handling complex sentences
in an EBMT framework. The proposed scheme works as follows:
1. First it checks whether the input sentence is complex. If yes then it executes
the following:
2. It splits the input sentence into two simple sentences RC and MC, corresponding to the Relative clause and the main clause of the complex sentence.
3. By using cost of adaptation based scheme it retrieves sentences most similar to RC and MC. Let these retrieved sentences be denoted as R1 and R2,
respectively.
228
5.9
Various approaches have been suggested in literature for splitting of complex sentences. For example:
1. Furuse et. al. (1998, 2001) has proposed a technique where a sentence is split
according to the sub-trees and partly constructed parsed trees.
2. Takezawa (1999) recommended a word-sequence characteristic based technique.
3. Doi and Sumita (2003) proposed two methods: Method-T and Method-N.
Method-T uses three criteria, viz. fault-length, partial-translation-count and
combined-reliability. On the other hand Method-N uses pre-process-splitting
method based on N-gram of POS subcategories.
229
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
Many approaches exist for splitting complex sentences (typically for English),
e.g.(Orasan, 2000), (Sang and Dejean, 2001) and (Clough, 2001). The technique
used by us is similar in nature to that proposed in (Leffa, 1998), (Puscasu, 2004).
They suggest three ways in which a sentence can be segmented to the clause level:
(1) Starting with the first word in the sentence, and processing it from left to
right, word by word until all the clauses are identified;
(2) Starting with formal indicators of subordination/coordination, and proceeding
until the end of the clause is found;
(3) Starting with the verb phrase, identifying the verb type, and locating its subject and complements.
In our approach, we have used first two methods. We have developed heuristics
to split a complex sentence into two simple sentences one related to the main clause
and the other to the relative clause. Here the advantage is that both the simple
sentences now can be translated independently using the retrieval and adaptation
procedures developed for dealing with simples sentences. For this work we made the
following assumptions for the input sentence:
The sentence has only one relative clause, and a connective must be present.
The connectives that we have considered are when, where, whenever, wherever, who, which, whose, whom, whoever, whichever, that, whomever, what and
whatever.
The algorithm makes use of the delimiter of the input sentence as well. We
illustrate this technique with respect to the delimiters . and ?.
230
In the following subsections, we discuss the splitting rules for complex sentences
having any of the following connectives: when, where, whenever,wherever
and who. Since the splitting rule for some of the above connectives are same,
the following subsection considers connectives when, where, whenever and
wherever together . The subsequent Subsection 5.9.2 discusses the splitting rule
of complex sentence having connective who.
5.9.1
Module 1
Module 1 identifies whether a given input sentence e is a complex sentence or not.
If e is complex, then the module identifies the position of the relative adverb, which
can be one of when, whenever, where or wherever. The algorithm considers
the two possible positions of the relative clause: i.e, the relative clause is present
before the main clause, or it is present after the main clause. Depending upon the
position of the relative clause, the algorithm proceeds to Module 2 or Module 3.
Figure 5.1 provides a schematic view of this module.
231
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
e2 ,. . ., en ; where
{
PRINT Simple sentence";
EXIT;
}
ELSE
ELSE
}
{
j =0;
For (i = 2, 3,
. . .,n)
{
IF(Roote (ei ) = "where" OR when" OR wherever"
OR whenever")
j = i;
}
IF(j = 0)
THEN {Print Simple sentence";}
ELSE {Print Complex sentence"; GO To Module 3;}
}
Figure 5.1: Schematic View of Module 1 for Identification of Complex Sentence with
Connective any of when, where, whenever, or wherever
232
Example 2:
Consider the another input sentence e Will you bring anyone along when you return
from town?. Its parsed version f is:
@+FAUXV V AUXMOD VFIN will, @SUBJ PRON PERS SG2/PL2
you, @-FMAINV V INF bring, @OBJ PRON SG anyone, @ADVL
ADV ADVL along, @ADVL ADV WH when, @SUBJ PRON PERS
SG2/PL2 you, @+FMAINV V PRES VFIN return, @ADVL PREP
from , @<P N NOM SG town <$?>
The length of e is 10, and the bag of functional tags is {@+FAUXV, @SUBJ, @FMAINV, @OBJ, @ADVL, @ADVL, @SUBJ, @+FMAINV, @ADVL, @<P}. Since
8
233
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
f1 is not @ADVL, the module checks the presence of any of the connectives when,
whenever, where or wherever in e. The connective when is present at the
6th position, i.e. j =6. Hence Module 1 concludes that the given input sentence e is
complex, and for splitting of e, the algorithm should proceed to Module 3.
Module 2
If the relative adverb is the first word of the given input sentence e then the sentence
is splitted in Module 2. Figure 5.2 gives a schematic view of this module. Table
5.19 gives the typical sentence structures that can be handled by this module. The
structure of the sentence handled by this module is characterized, the relative clause
will be present in the beginning of the main clause. In this module, along with
the position of the relative clause, position(s) of the subject(s) is used to split the
complex sentence. In the following, we assume the length of e to be n. Sub-steps of
this module are as follows:
ENGCG parser always provides either @+FAMINV or @+FAUXV tag to the first occurrence
of a verb whether it is main verb or auxiliary verb. All other verbs (main or auxiliary) in the
sentence are denoted with either @-FAMINV or @-FAUXV tag
234
235
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
Let
IF ((delimiter of
THEN {
l = 0;
For (i = 2, 3,
. . ., n)
{
IF(fi = @+FMAINV OR @+FAUXV)
IF(l = 0)
THEN l++ ;
ELSE { l = i; Break;}
}
- The string e2 , e3 , . . . , el1 constitutes a simple
sentence (say RC), which is the relative clause;
- The string el , el+1 , . . . , en constitutes a simple
sentence (say MC), which is the main clause;
- The functional-morpho tags of RC are
f2 , f3 , . . .,fl1 ;
fl , fl+1 , . . .,fn ;
- Delimiter of RC is .";
IF(delimiter of e is ?")
THEN {delimiter of MC is ?"}
ELSE {delimiter of MC is ."}
236
ELSE
{
IF(delimiter of
{
m = 0;
For(i = 2 to n)
{
IF(fi = @SUBJ or @F-SUBJ)
IF(m = 0)
THEN m++ ;
ELSE {m = i; Break;}
}
}
\ Now this algorithm finds the attributes (pre-modifier adjective,
determiner etc.) of second subject \
k = m 1;
WHILE((k>2) AND (fk = @N OR @DN> OR @NN> OR @GN>
OR @AN> OR @QN> OR @AD-A>))
k ;
-The string ek , ek+1 , . . . , en constitutes the simple
sentence (say MC), which is the main clause;
-The string e2 , e3 , . . . , ek1 constitutes the simple
sentence (say RC), which is the relative clause;
- The functional-morpho tags of MC is
-The functional-morpho tags of RC is
fk , fk+1 , . . ., fn ;
f2 , f3 , . . ., fk1 ;
- Delimiter of RC is .";
- Delimiter of MC is ?";
}
Figure 5.2: Schematic View of Module 2
237
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
Our discussion on Module 1 concluded with the remark that the complex sentence
given in Example 1 should be splitted using Module 2. We now continue with the
same example Whenever you go to India, speak Hindi. to show how Module 2 splits
this sentence into two simple ones. In this example, the number of subjects is one,
i.e. K = 1, and the delimiter is .. The module now proceeds to determine the
position of the second occurrence of @+FMAINV or @+FAUXV tag, which is found
to be at the 6th position11 , i.e. l = 6. Hence the input complex sentence is splitted
into simple sentences as follows:
2nd to 5th words constitute a simple sentence RC, i.e. You go to India, its
delimiter is .. This is a part of relative clause, and its morpho functional
tags are:
@SUBJ PRON PERS SG2/PL2 you, @+FMAINV V PRES VFIN go,
@ADVL PREP to, @<P <Proper> N SG India <$.>.
6th and 7th words constitute the simple sentence MC, i.e. Speak Hindi, its
delimiter is also .. This is a part of main clause, and its morpho functional
tags are: @+FMAINV V IMP speak , @OBJ <proper> N SG Hindi <$.>.
Module 3
If the relative adverb (or connective) is not the first word of the given input sentence,
then the sentence is splitted by this module. In this case, the relative clause is present
after the main clause, i.e. relative clause is located towards the end of the sentence.
Let the position of the relative adverb (as identified in Module 1) be j. In this case,
11
238
According to Module 1, the rule given in Module 3 will split the complex sentence
discussed in Example 2, i.e. Will you bring anyone along when you return from town?.
As discussed in Module 1, for this input sentence e the value of j 12 is 6. Hence
spitting in this case is as follows:
12
239
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
f1 , f2 , . . ., fj1 ;
fj+1 , fj+2 , . . ., fn ;
5.9.2
Here we discuss the algorithm for splitting complex sentences when the connective
is who. It should be noted that in this case, the relative clause can occur either
embedded in between the main clause, or after the main clause. In both the cases,
there are two possible functional tags of the connective word who, i.e. @SUBJ and
@OBJ. The algorithm takes care of all these possibilities. The algorithm is divided
into four modules, which are given in Figures 5.4, 5.6, 5.7 and 5.8. Along with these
four modules, there is a subroutine SPLIT given in Figure 5.5. The brief outline of
these modules and subroutine SPLIT is as follows:
The Module 1 checks whether the given input sentence is complex or not. If
the sentence is complex with the connective who then depending on the
position of the clause and the delimiter of a sentence it routes the algorithm
to the appropriate module.
Module 2 splits those complex sentences in which the relative clause is embedded in the main clause, and the delimiter of the sentence is .. Table 5.21
provides the typical sentence structures considered in this module.
The complex sentences in which the relative clause follows the main clause are
splitted in Module 3. Here also the delimiter of the sentence under consideration should be .. The sentence structures considered in this module are
exemplified in Table 5.22.
Irrespective of the position of the relative clause, Module 4 splits those complex sentences for which the delimiter is ?. Examples given in Table 5.23
demonstrate the sentence structures considered in this module.
241
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
The algorithm for splitting those complex sentences in which the relative clause
is embedded in the main clause is given in subroutine SPLIT. This subroutine
accepts two arguments: integer-type x, and character-type y. x gives a splitting point position, and y provides the delimiter of the simple sentence that is
a part of the main clause.
242
243
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
e1 , e2 ,. . ., en where
j = 0;
For(i = 1 to n)
IF( Roote (ei ) = "who" AND <Rel> Morpho-tag of
THEN {Print Complex sentence"; j = i}
ELSE {Print "Simple sentence"; Exit;}
Flag = 0;
p = 0;
fj )THEN
For(i = j 1 to 1)
IF(fi = @+FAUXV OR fi = @+FMAINV)
{Flag=1; p = i; break;}
IF(Flag = 0)
THEN GO TO Module 2;
ELSE IF(delimiter of sentence = ".")
THEN GO TO Module 3;
ELSEIF(fi = @+FAUXV)
THEN GO TO Module 4;
ELSE Print "Sentence can not be splitted";
Figure 5.4: Schematic View of Module 1 for Identification of Complex Sentence with
Connective who
244
245
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
ELSE
\Here fj = @OBJ \
c = 0;
For(i = j + 1 to x 1)
IF(fi = @-FMAINV OR
{c = i; break;}
fi = @+FMAINV)
fx to fn
Modification:
- A new word, either "him", "her" or "them" is placed
after the cth word. The functional-morpho tag of this
new word will be @OBJ PRON PERS MASC SG3 he",
@OBJ PRON PERS FEM SG3 she" or
@OBJ PRON PERS PL3 they";
The choice of this new word depends on the gender and
number of el , where l is such that fl =@SUBJ, 1 l j1.
}
Delimiter of RC is ".";
Delimiter of MC is y ;
Exit;
}
Figure 5.5: Schematic View of the SUBROUTINE SPLIT
246
k = 0;
j + 1 to n)
{
IF(fi = @+FAUXV OR
IF(count = 0)
THEN count++;
ELSE k = i;
fi = @+FMAINV)
IF(count = 1)
break;
}
CALL SUBROUTINE SPLIT(k , ".")
\This modules splits sentences in which the relative clause succeeds the
main clause.\
IF(fj 6= @OBJ)
THEN {
- Words e1 to ej1 of the given input sentence e forms
the main clause, which is a simple sentence, say MC;
- Functional-morpho tags f1 to fj1 of the parsed
version of e gives the parsed version of MC
- Words
\ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j 1th word to the first word in e.\
NUMBER = ;
GENDER = ;
247
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
For(i =
j 1 to 1)
{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)
{
ith word;
ith word;
GENDER = gender of
NUMBER = number of
}
IF(NUMBER
6= ) break;
}
IF (GENDER = ) GENDER = MASC;
-
}
ELSE
{
- Words e1 to ej1 form the main clause, which
is a simple sentence, say MC;
-
- Words
c=0;
For(i=j + 1 to n)
IF(fi = @+FMAINV OR
{c = i; break;}
fi = @-FMAINV)
248
\ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j 1th word to the first word in e.\
NUMBER = ;
GENDER = ;
For(i = j 1 to 1)
{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)
{
ith word;
ith word;
GENDER = gender of
NUMBER = number of
}
IF(NUMBER
6= ) break;
}
IF (GENDER = ) GENDER = MASC;
- A new word w which can be either "him", "her" or "them"
is placed after the cth word. The functional- morpho
tag of w will be @OBJ PRON PERS MASC SG3 he",
@OBJ PRON PERS FEM SG3 she" or
@OBJ PRON PERS PL3 they";
The choice of w depends on GENDER and NUMBER.
- Words
call it RC;
- Except for the functional-morpho tag of (c + 1)th word,
the functional-morpho tag of constituent words of RC
are obtained from the functional-morpho tags
of the corresponding words of e.
}
Delimiter of MC is ".";
Delimiter of RC is ".";
Exit;
249
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
p + 1 to j 1)
{
IF(fi = @-FMAINV AND
IF(r 6= 0) break;
fi1 = @INFMARK>) r = i;
}
IF(r 6= 0)
\This implies that relative clause follows main clause\
{CALL Module3a, where Module3a is same as Module 3 with one
modification, i.e. the delimiter of MC is "?" instead of
".";}
251
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
5.10
The predicative of the sentence having root form of the main verb be can be any one (or
combination) of the subjective complement, preposition phrase or adverb.
253
5.10.1
Since the Hindi translation patterns of the complex sentence having connectives one
of when, where, whenever or wherever are same, the adaptation procedure
for such complex sentences is discussed collectively. Table 5.24 gives the translation
of the above-mentioned connectives, and Table 5.25 provides the possible structures
of English and its Hindi translation for these connectives (Refer Tables 5.19 and 5.20
for examples of these sentence patterns). Since the correlative adverb is frequently
not indicated in the Hindi translation of complex sentences having any of abovementioned connectives (Bender, 1961), (Kachru, 1980), the correlative adverb is
given in {}.
English
Relative Adverb
Hindi
Hindi
Relative Adverb Correlative Adverb
When
jab
tab
where
jahaan
vahaan
whenever
jab bhii
tab
wherever
jahaan bhii
vahaan
The adaptation procedure for generating the translation of the complex sentences
under consideration is discussed below. Suppose R1 and R2 are sentences having
least cost of adaptation which have been retrieved from example base corresponding
to RC and MC, respectively. The steps of this procedure are as follows:
1. Adapt the translation of R1 to the translation of RC.
2. Adapt the translation of R2 to the translation of MC.
254
[when,
where,
whenever
or
Connective where:
It may be noted that the total cost involved in generating the translation of
given complex sentence depends on the cost of adapting the translation of R1 and
255
R2 to the translation of RC and MC, respectively. This is due to the fact that the
cost involved in one (or two) morpho-word addition (required in step 3) is fixed, i.e.
(or 2) as relative and correlative adverbs always occur at the beginning of the
Hindi translation of RC and MC sentences (refer Table 5.25), respectively. Hence no
search is required to find the correct position for morpho-word in the Hindi sentence.
Further, the cost of concatenating these two translations is also fixed, which is .
Assuming that the cost of adapting the translation of R1 and R2 to the translation of RC and MC are c1 and c2 , respectively. Hence the total cost involved in
generating the translation of the given complex sentence is c1 +c2 +2+.
5.10.2
This section discusses the adaptation procedure for the complex sentence having
connective who. It may be noted that there are many variations in sentence
structure having this connective (refer Figure 5.19). For illustration, we consider
the sentence pattern as given in Table 5.26. In this pattern the connective who
plays the role of subject in the relative clause of English sentence.
Two different Hindi translation patterns may be noted corresponding to the
above mentioned English sentence pattern. In the first pattern, the relative adjective
jo occurs in the beginning of the relative clause whereas in the other pattern jo
occurs before the subject slot of the main clause. The noun in the main clause,
to which the relative clause modifies14 , is represented by wah or we depending
upon the number of the noun (Bender, 1961), (Kachru, 1980).
14
For the sentences under consideration, this noun is the subject of the main clause.
256
The adaptation procedure for generating the translation of the complex sentences
under consideration is discussed below. Suppose R1 and R2 are sentences having the
least cost of adaptation that have been retrieved from example base corresponding
to RC and MC, respectively. The steps of this procedure are as follows:
257
the set {wah, we}, and the other morpho-word is jo. The position of the
morpho-words in the two patterns are given below:
For the first pattern, the morpho-word jo is added in the beginning of
the translation of RC, and depending on the number of the subject of
MC, the morpho-word either wah or we is added in the beginning of
the translation of MC.
For the second pattern, the morpho-word jo is added in the beginning
of the translation of MC, and depending on the number of the subject
of MC, a morpho-word (either wah or we) is added after the subject
slot of the translation of MC.
4. Combine the (modified) translations of RC and MC. For both translation patterns, the translation of RC is embedded in the translation of MC after the
subject slot.
1. Cost of adapting the translation of R1 to the translation of RC. Let this cost
be c1 . In this case, the cost involved for adapting the translation of the subject
slot is not included.
2. Cost of deletion of subject slot from the translation of RC. Let us denote this
cost by w.
3. Cost of adapting the translation of R2 to the translation of MC. Let this cost
be denoted as c2 .
258
L
2
Thus the total cost involved for two translation patterns is the sum of all the
above mentioned costs. The two simple sentences R1 and R2 are retrieved from
the example base for generating the translation of given complex sentence so as to
minimize the total cost of adaptation.
We have formulated the adaptation procedure for other complex sentence structures having connective who in a similar way. However, due to similar nature of
discussion we do not elaborate on them in this report.
The above discussed adaptation procedures are illustrated in the following subsection. In particular, we show for a given complex sentence how the scheme retrieves
two similar simple sentences from the example base that can be used to generate
the translation of the input complex sentences.
259
5.11. Illustrations
5.11
Illustrations
The adaptation procedures for complex sentences are explained using two illustrations.
5.11.1
Illustration 1
Suppose the input sentence is: You should speak Hindi when you go to India.. Its
parsed version is:
RC :
MC :
Five most similar sentences for RC and MC, obtained by applying cost of adaptation based scheme, are given in Table 5.27 and Table 5.28, respectively.
260
25.67+105 c
29.58+105 c+2
9.5+ +2
24.67+ + 105 c+
Cost of Adaptation
10.67+c105
28.25+2c105
8.5++
15.5++
30.67+c105 ++
To obtain the translation of RC and MC, we consider the first sentence of Table
5.27 and Table 5.28, respectively. Thus, R1 is You go to school., and R2 is He
should speak English.. The Hindi translations of these sentences are:
Translation of R1
Translation of R2
(go)
261
5.11. Illustrations
Translation of MC
(go)
The morpho-words jab and tab are to be added in the beginning of the Hindi
translation of RC and MC, respectively. After this modification, these two sentences
are concatenated. Hence the desired translation of the given input sentence is jab
tum india jaate ho tab tum ko hindi bolnii chaahiye.
5.11.2
Illustration 2
Let us consider another input sentence The student who wants to learn Hindi should
study this book. and its parsed version:
@DN> ART the, @SUBJ N SG student, @SUBJ <Rel> PRON
WH SG/PL who, @+FMAINV V PRES want, @INFMARK>
INFMARK> to, @-FMAINV V INF learn, @OBJ <proper> N SG
Hindi, @+FAUXV V AUXMOD should, @-FMAINV V INF study,
@DN> DEM this , @OBJ N SG book < $. >
After applying algorithm of splitting complex sentence (refer Figure 5.4 and
Figure 5.6), two simple sentences RC and MC are obtained. These are as follows:
262
RC :
MC :
The student should study this book. @DN> ART the, @SUBJ N
SG student, @+FAUXV V AUXMOD should, @-FMAINV V INF
study, @DN> DEM this , @OBJ N SG book < $. >
Five most similar sentences for RC and MC are given in Table 5.29 and Table
5.30, respectively. One point to be noted here is that the cost for obtaining the
translation of subject slot of RC is not considered in Table 5.29.
Retrieved Sentences for RC
Cost of Adaptation
17.58+c105
23.08+c105
23.08+c105
He is leaning Hindi.
Cost of Adaptation
37.85 +2(c105 )
20.17+c105 +2+
48.17+3(c105 )++
For all the possible combinations of sentences given in Tables 5.29 and 5.30,
the cost of adaptation involved for generating the translation of the input sentence
is calculated in the way explained under the Section 5.10. The minimum cost of
adaptation, which is (17.58+c105 ) + (3+) + (24++2) + (0.5+2) + (3
+ + ) = 48.08 + c105 + 2 + 6, is obtained for following sentences:
R1: He likes to learn Hindi.
R2: The student wants to study this book.
Hence the translation of the given input sentence after generating the translation
of RC (i.e wah (he) hindi (Hindi) siikhanaa (to learn) chaahtaa hai (likes)) and
MC (i.e. vidyarthii (student) ko yah (this) kitaab (book) padhnii (study) chaahiye
(should)), and appending the relative adjective jo, and the appropriate personal
pronoun from the set {we, wah} in the begining of the translation of RC and MC,
respectively, is wah vidyarthii ko jo hindi siikhanaa chaahtaa hai yah kitaab padhnii
chaahiye.
5.12
Concluding Remarks
The lower is the dissimilarity score of an example, the lesser will be its adaptation cost to generate the required translation. Finally, cost of adaptation based
scheme is applied on the selected set of sentences provided by the filtration scheme.
The advantage of this filtration scheme is that it reduces the number of example
base sentences that are to be analysed for evaluating the cost of adaptation. In the
worst case it reduced this number by 25%. However, as we repeated our experiments
with 500 different sentences we found that the average reduction in the number of
sentence subjected to evaluation of cost of adaptation is about 75%. The proposed
265
scheme however cannot be applied for complex sentences straightway. This is because adaptation with respect to complex sentence is more difficult due to the more
complicated structure of complex sentences for both English and Hindi. Consequently, we suggest that complex sentences may be first split into simple sentences.
Then the adaptation cost based scheme may be applied to retrieve best matches
for each of the simple sentences. These retrieved translations may then be adapted
to generate the translation of these simple sentences. These translations may then
be combined using linguistic rules to generate the translation of the input complex
sentence. The novelty of the scheme is that it gives an algorithmic way for handling
complex sentences under EBMT as well.
In this work we have dealt with complex sentences with main clause and one
relative clause. We have developed heuristics to first determine whether a sentence
is complex. We use observations from our example base of complex sentences to
validate the heuristics used in the sentence splitting algorithm. In this report, we
have discussed algorithms for splitting complex sentences for five connectives: who,
when, where, wherever and whenever. We have also developed the splitting
rules for other connectives (e.g. which, whom, whose, whoever, whichever), but to
avoid the similar type of discussion, we have not discussed all of them in this report.
Finally, we have shown the adaptation procedure for adapting given complex sentences with any of the above-mentioned connective. In particular, we showed for a
given complex sentence how the scheme retrieves two similar simple sentences from
the example base that can be used to generate the translation of the input complex
sentences.
266
Chapter 6
Discussions and Conclusions
6.1
The primary goal of this research is to study various aspects of designing an EBMT
system for translation from English to Hindi. It may be observed that in todays
world a lot of information is being generated around the world in various fields. However, since most of this information is in English, it remains out of reach of people at
large for whom English is not the language of communication. This is particularly
true for a country like India, where the population size is more than a billion, yet
only about 3% of the population can understand English. As a consequence, an increasing demand for developing machine translation systems from English to various
languages of Indian subcontinent is being felt very strongly. However, development
of MT systems typically demands availability of a large volume of computational
resources, which is currently not available for these languages, in general. Moreover,
generating such a large volume of computational resources (which may comprise an
extensive rule base, a large volume of parallel corpora etc.) is not an easy task.
EBMT scheme, on the other hand, is less demanding on computational resources
making it more feasible to implement in respect of these languages.
In this respect, we further observed that although a few number of English to
Hindi MT systems are available online, quality of the translations produced by them
is not always up to a satisfactory level. This prompted us to investigate in detail the
various difficulties that one may face while developing an MT system from English
to Hindi. We feel that the studies made in this research will be helpful not only for
Hindi, but also for other languages that are major regional languages of Indian subcontinent, and at the same time prominent minority languages of other countries
(e.g. U.K.). Although an increasing demand for MT systems from English to these
languages is clearly evident, development of necessary computational resources is
267
6.2
well. In a similar way, we further observed that declensions of Hindi verb, noun
and adjective often depend on some auxiliary words, called morpho-words. Adaptation using morpho-words also makes it efficient and computationally cheaper. The
above observations motivate us to design an adaptation scheme comprising nine different basic operations, beside copy, to perform addition, deletion or replacement
of constituent words, morpho-words and suffixes. Successive application of these
operations help in adapting the translation of a retrieved sentence to generate the
translation of a given input. Another advantage of using these operations is that
their algorithmic nature enables one to estimate the computational cost of each of
these operations. We used this estimation to design a novel similarity measurement
scheme as explained below.
Retrieval and Adaptation. However good an adaptation scheme is, its performance is hindered seriously if the example that it attempts to adapt is not quite
similar to the input sentence. But there is no unique way of defining similarity between sentences. Depending upon the application, the definition of similarity may
vary. In this work we proposed a scheme for defining similarity from adaptation
perspective. We say that a sentence S1 should be called similar to another sentence S2 if adaptation of the translation of S1 to generate the translation of S2 is
computationally inexpensive. The lesser the cost is the more will be the similarity.
In this work we have provided appropriate models for prior estimation of cost of
adaptation. This cost depends not only on the number of basic operations to be performed, but also on the functional slots on which the operation is applied. Thorough
analysis of adaptation costs for different phrasal structures within various functional
slots (e.g., subject, object, verb), and also for different sentence types (e.g., affirmative, negative, interrogative) has been carried out, and models for estimating these
269
to use these resources effectively, proper alignment techniques for English to Hindi
alignment should be developed.
6.3
Possible extensions
The work carried out in this research may be extended in several directions:
In this work, various adaptation procedures are studied and developed for
the sentence structures (and their components). These are the patterns that
are predominantly found among the sentences present in our example base.
Many other variations in sentence structures are possible which have not been
discussed in this work. Adaptation rules may be developed for such sentence
patterns.
Although we have dealt with complex sentences, we imposed certain restrictions in this work on the structures of complex sentence. The splitting and
adaptation rules have been developed for these restricted structures only. The
proposed split and translate technique needs to be extended for complex sentences having more complicated structures. Further, we have left compound
sentences out of our discussion. Strategies are to be developed for dealing with
compound sentences as well.
With respect to English to Hindi translation, seven types of divergence are
identified. Although this finding is based on our analysis of about 30,000
sentences, possibility of some other divergence types cannot be completely
ruled out. More detailed study of English-Hindi parallel corpus is required to
identify other divergence types, if any.
272
Robustness of the scheme proposed for taking a prior decision about the possible divergence types in translation of given input sentence (refer Chapter 4),
depends on PSD/NSD. These dictionaries contain the proper sense of the word,
and are created manually. For automating the construction of PSD/NSD,
appropriate Word Sense Disambiguation technique should be developed and
applied.
Prior identification of Possessional Divergence is not discussed in Chapter 4.
This is due to the reason that possessional divergence may be associated with
a large number of variations on the properties of the subject, object, premodifier of the object etc. which are not governed by a simple set of rules.
The hypernym (according to WordNet 2.0) of these words need to be analyzed
and compared to arrive at any conclusion regarding prior identification of this
divergence type.
Divergence identification algorithms (discussed in Chapter 3) depends on FT
and SPAC of the English sentence and its Hindi translation. For English
sentence, this knowledge is extracted from the parsed version which is obtained
from the parsers available online. But no such resource is available for Hindi.
Thus, in our work we parsed and obtained FT and SPAC of Hindi sentences
manually. For practical applications of the proposed algorithms, Hindi parser
is needed to obtain the required information of Hindi sentence.
6.4
Epilogue
There are many issues pertaining MT that have not been dealt with in this work.
Arguably, the two most important ones of them are:
273
6.4. Epilogue
6.4.1
Pre-editing is the process of identifying and editing, where necessary, the source
text prior to translation so that any sentences (segments) of text that the machine
will have problems with are highlighted and removed. In other words, pre-editing is
based on the building up of new text data from a given text data of existing version
(e.g. paragraph form) that the MT system is able to handle. Pre-editing metric
varies according to the requirement of MT system.
In case of our study on EBMT system, we have also done some pre-editing according to the requirement of the problem of retrieval, adaptation and identification
of divergence. Firstly, we have assumed that our original data is aligned sententially,
i.e. one source language sentence corresponds to one target language sentence. In
the retrieval and the adaptation procedure, we have added the parsed version of
source sentence, which is based on morpho-functional tags, along with the information of word alignment at the root level. This minimum information is stored in
our example base for carrying out adaptation, converting complex sentences into
simple sentences, and measuring similarity for effective retrieval. In Chapter 1, we
have provided Figure 1.1 of an example record of the example base. However, the
274
algorithms for divergence identification require both FTs and SPAC (see Appendix
B) information for the parallel corpus. Pre-editing has to be done accordingly.
The task of the post-editing is to edit, modify and/or correct the output of the
MT system that has been processed by a machine translation system from a source
language into target language. In other words, post-editing corrects the output of the
MT system to an agreed standard, e.g. amending the style of the output sentences,
or any minimal amendments which will make the text more readable.
As such we have not developed any MT system, so post-editing is not directly
relevant in this work. But still here we point out some situations where post-editing
can be possible while deigning an EBMT system.
In case of EBMT system post-editing may be required in the following form. The
desired translation of the input sentence is generated by adapting the translation
of the best similar example. Sometimes it may happen that even after adaption
the system does not produce a translation that is grammatically correct in the
target language. It may be because of insufficient morpho-syntactic information
or grammar rules that the system uses while carrying out the adaptation task. In
this situation one has to correct translation according to the requirement. Another
situation, where post-editing can be useful is when the system does not have sufficient
number of words in the dictionary. Typically, in these cases MT system provides
transliteration of these words in the target language. Post-editing is useful in these
cases too. The amount of post-editing required on the output provides a good
yardstick for measuring the output quality of an MT system.
275
6.4. Epilogue
6.4.2
Implementation of an MT system can be considered to be successful only if the quality of the translation produced by the system is of acceptable quality. This automatically raises the issue of how to evaluate the quality of the output produced by a system. In recent years, various methods have been proposed to automatically evaluate
machine translation quality. Typically, these methods take the help of some reference translation of some pre-selected test data. Reference translation is also known
as gold-standard translation. By comparing the output produced by the system
under consideration (with respect to the pre-selected test data) with the reference
translation, an estimate of possible discrepancy is arrived at. This in turn gives a
measure of the translation quality of the said system. Examples of such methods are
Word Error Rate (WER), Position-independent word Error Rate (PER)(Tillmann
et al., 1997), and multi-reference Word Error Rate (mWER)(Nieen et al., 2000).
Below we describe the above-named methods.
WER: The word error rate is based on the Levenshtein distance. It is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated sentence into the
reference sentence.
PER: A major shortcoming of the WER is the fact that it requires a perfect
word order. In order to overcome this problem, the position independent word
error rate (PER) was introduced as additional measure. It compares the words
in the two sentences without taking the word order into account. Words that
have no matching counterparts are counted as substitution errors, missing
words are deletion error and additional words are insertion errors. Evidently,
276
In later years n-gram based schemes have been proposed to evaluate translation quality. The most prominent among them are BLEU (Bilingual Evaluation
Understudy)(Papineni et al., 2001) and NIST (National Institute of Standard and
Technology) (Doddington, 2002) scores. All these criteria try to approximate human assessment, and often achieve an astonishing degree of correlation to human
subjective evaluation of fluency1 and adequacy2 . These methods work as follows.
BLEU: This scheme has been proposed by IBM (Papineni et. al., 2001). It is
based on the notion of modified n-gram precision, for which all candidate ngram counts are collected. The geometric mean of n-grams precision of various
lengths between a hypothesis and a set of reference translations are computed.
This score is multiplied by brevity penalty (BP) factor to penalize too short
translations. Therefore,
1
A fluent sentence is one that is well-formed grammatically, contains correct spellings and
adheres to common use of terms, is intuitively acceptable and can be sensibly interpreted by a
native speaker
2
The judge is presented with the gold-standard translation, and should evaluate how much of
the meaning expressed in the gold-standard translation is also expressed in the target translation
277
6.4. Epilogue
BLU E = BP exp
N
X
log pn
n=1
Here pn denotes the precision of n-gram in the hypothesis translation. N denotes total number of n-grams considered, usually N {1, 2, 3, 4}. Papineni et.
al. (2001) state that BLEU captures adequacy3 as well as fluency4 . BLEU is
an accuracy measure, while the above-mentioned measures are error measures.
The disadvantage of BLUE score is that longer n-grams dominate over shorter
n-grams, and it cannot match corresponding (sub)parts from hypothesis to
reference.
NIST: This score was proposed by National Institute of Standard and Technology in 2002. It reduces the effect of longer n-grams. This criterion computes
a arithmetic average over n-gram counts instead of geometric mean and multiplied by a factor BP that penalizes short sentences. Both, NIST and BLEU
are accuracy measures, and thus larger values reflect better translation quality.
278
for EBMT, while some other, such as, study of divergences, is relevant for other
paradigms as well. We hope that the contributions made in this thesis will be useful
for designing English to Hindi MT system, and also for many other language pairs
at large.
279
Appendix A
Appendix A.
A.1
English and Hindi languages are of two different origin, so study of their general
structural properties is necessary. In this discussion, some of the basic concepts of the
translation from English to Hindi are briefly outlined. Some of the general structural
properties of English and Hindi(Kachru, 1980)(Kellogg and Bailey, 1965)(Singh,
2003)(Qurick and Greenbaum, 1976) are described below. For example,
Sentence Pattern: The basic sentence pattern in English is Subject (S) Verb
(V) Object (O), whereas it is SOV in Hindi. Consider for example Radha
eats mango here Radha is subject; eats is the verb while mango is the
object. So the words occur in the order SVO. But in Hindi it becomes
radha (S)
Radha
mango
eats
281
For Example:
radha
chidiyaan
(Radha) (sparrows)
dekhatii hai
(watches)
radha
dekhatii hai
(Radha)
(watches)
Above mentioned differences are structural differences between English and Hindi.
Some differences are in the part of speech properties of English and Hindi languages.
These discrepancies are as follows:
Noun: Hindi nouns are effected by gender, number and case ending (Kellogg
and Bailey, 1965). These are as follows:
1. Gender : English has four genders-masculine (MASC), feminine (FEM),
common and neuter, whereas Hindi has only two- masculine and feminine. The neuter gender of Sanskrit (Origin of Indian languages), Hindi
as well as the closely related languages, has vanished.
2. Number : As English, Hindi also has two numbers- Singular and Plural.
There are some possible suffixes for singular to plural conversion in Hindi,
which are as follows (Kellogg and Bailey, 1965):
For example:
singular
Plural
ladke - boys
kapade - clothes
ladkiyaan - girls
kakshaa-class(FEM)
kakshayen- classes
282
Appendix A.
3. Case ending: There are eight case endings in Hindi, which are given below
in Table A.2. All these are appended to the oblique form of the noun,
where such a form exists. There are some rules for making oblique nouns.
Some of them are as follows:
(a) Masculine singular nouns ending in aa change into e when some
case ending is added : e.g. ladkaa + ne ladke ne. Nouns ending
in other vowels do not undergo such changes e.g. ghar ko, daaku
kaa.
(b) If a noun (masculine or feminine) ends in a, it is changed into
aon in plural, when a case ending is added. For example: in the
house ghar mein while in the houses ghar on
main.
Case-endings
Nominative case
ne
Accusative case
ko
Agent case
Dative case
Ablative case
se ( from, since)
Possessive case
Locative case
Vocative case
283
No postposition is used with the nominative and Vocative. Here we will discuss
three cases nominative, accusative and possessive case. Other cases work same
as English case ending. These cases as follows:
1. Nominative case: The subject of a sentence takes the nominative sign
ne only when its predicative is a transitive verb in the past tense (past
indefinite, present perfect and past perfect). The use of this case is to
make a noun or pronoun act as subject of a verb. In that case, verb agrees
with the object in gender and number. For example,
Ram narrated a story.
ram ne
kahaanii
sunaayii
(Ram)
(story)
(narrated)
biij
boyee hain
kisaan ne
(farmer)
(seeds)
(sowed has)
Here in these two examples the objects of translated sentence are kahaanii , biij . The number and gender of these nouns are singular
feminine and plural masculine, respectively.
2. Accusative case: ko is the sign of this case and it is generally added
only to animate objects. Sometimes it is also added to inanimate objects, either to intensify its effect or to express a special significance. For
example:
The boy beats the dog.
ladkaa
kutte ko
(boy)
(dog)
martaa hai
(beats)
3. Possessive case: the signs of this case are kaa , ke and kii . These
words are used with noun according to gender, number and case-ending
of the following noun. This case ending has already been discussed in
detail in Section 2.5.2 of Chapter 2.
284
Appendix A.
A.2
Every language has its own grammar rules. In other words, we can find same sentence
following different grammatical aspects corresponding to the language concerned.
For example, consider an English sentence He will be sleeping at the moment. Its
translation in Hindi is wah iss samay so rahaa hogaa. As per English grammar
rules, verb phrase follows future tense and progressive aspect (or continuous aspect)
but at the same time Hindi sentence verb phrase comes under definite potential type
of mood according to Hindi grammar. For the translation work, we have followed
English grammar categorization for verb phrase structure (Quirk and Greenburm,
1976) which involves different combination of tense, aspect and mood.
To understand English to Hindi verb structure, conjugation of root verb in Hindi
has been presented in the following subsection.
285
A.2.1
Verb morphological variations in Hindi depend on four aspects: tense and form of
the sentence, gender of the subject, person of the subject and number of the subject.
All these variations affect the root verb of a sentence. Since there are three tenses
(i.e. Present, Past and Future) and four forms (i.e. Indefinite, Continuous, Perfect,
and Perfect Continuous), in all one can have 12 different conjugations. In Hindi,
these conjugations are realized using suffixes attached to the root verbs, and/or
adding some auxiliary verbs, which we call Morpho-Words (MW). Table A.3 gives
the total number of morphological words and suffixes in Hindi, for all the tenses and
their forms.
Tense form
Present Tense
Indefinite
Suffix :
oongaa,
oongii, oge, ogii,
ho, hain
the
Past Tense
hain
Prefect
Perfect
Future Tense
hoongii, hoonge,
hogaa, hogii, hoge
MW:
hoongii,
chukii, chuke
chuke
Same as Continu-
Same as Continu-
Same as Continu-
ous
ous
Continuous ous
286
hoongaa,
hoonge,
Appendix A.
Above suffixes and morphological words in present prefect, past indefinite and
past prefect are used for literal translation of a sentence. Actually conjugation in
root verb is aa, e and ii . It has been observed that according to Table
A.3, suffixes {taa, te, tii } are added in the root form of past indefinite tense form.
According to the tense forms, the morpho-words {thaa, the, thii }, {chukaa, chukii,
chuke} and {hoon, hai, ho, hain} are added after the main verb of the sentence.
Another possible way of expressing these three tenses and forms in Hindi is that,
in place of above mentioned suffixes different conjugations of verbs is used that
is different from the verb of tenses and forms discussed earlier. The morpho words
{thaa, the, thii } or { hoon, hai, ho, hain} is added depending upon the tense towards
the end of the sentence.
Some rules of these conjugation of verbs are as follows (Sastri and Apte, 1968):
If the root of the verb ends in a (silent) lengthen it to aa in masculine
singular and change it into e for masculine plural; in feminine singular it
becomes ii and in feminine plural iin. For example the verb play khelaa is in Hindi khelaa (masculine singular), khelii (feminine singular),
khele (masculine plural) and kheliin (feminine plural).
If the root ends in aa or oo, yaa is added, which changes according
to the aa, ai and ii rule1 . Sometimes e is used in place of ye; and
ii and iin in the place of yii and yiin, respectively. For example, the
verb is come -aa, in masculine aayaa (singular) and aaye or aae
(plural), and in feminine aayii or aaii (singular) and aayiin or aaiin
(plural).
1
The aa, ai, ii Rule (Sastri and Apte, 1968): Masculine words ending in aa form their
plurals by changing the aa into e and their feminine by changing aa into ii
287
Gender
Tense
Hindi Sentence
I am writing a letter.
M/F
Present
continuous
M/F
Present
Indefinite
I write a letter.
M /F
Present
Indefinite
Past
continuous
Future
Indefinite
Past
Indefinite
M /F
letter.
We will write a letter.
Sita wrote a letter.
M/F
F
Similar discussion can be done for the passive verb form also. Passive form can
be formulated for transitive verbs only.
The morphological variation depends on the gender and number of the object
288
Appendix A.
of the active form of the sentence that is basically the subject in the passive form
(Sastri and Apte, 1968). The subject of the active form occurs in the passive forms
as the instrumental case followed by by and its Hindi is either se, ke duwaraa
or duwaraa. In passive form, the changes in the main verb are according to the
rules of PCP form of verb as discussed in the section A.1. Moreover an extra verb
jaa is introduced after the main verb and the suffixes that are given in Table A.3
are added in this additional verbs instead of the main verb of the sentence. The
morpho words are added after the conjugation of verb jaa. Suppose the set of
examples are:
We add sugar to milk.
ham
(we)
dudh
mein
(milk)
shakkar
(in)
(sugar)
daalte hain
(add)
mein
shakkar
hamare
(in)
(sugar)
(us)
duwaraa
(by)
(added) (is)
The first example is in the active form and the second example in the passive
form. The verb morphological changes are according the above discussion.
289
Appendix B
Appendix B.
B.1
Functional Tags
In this work we have used the ENGCG parser1 for parsing the English sentence.
Most of the FTs that are relevant for this work are obtained directly from the
parser. Description of these FTs are given below:
@+FAUXV
@-FAUXV
@+FMAINV
@-FMAINV
@SUBJ
Subject
(e.g. He reads.)
@F-SUBJ
Formal Subject
(e.g.There was some argument about that. It is raining.)
@N
Title
(King George and Mr. Smith)
@DN>
Determiner
(He read the book.)
@NN>
Premodifying Noun
(The car park was full.)
@AN>
Premodifying Adjective
(The blue car is mine.)
http://www.lingsoft.fi/cgi-bin/engcg
291
@QN>
Premodifying Quantifier
(He had two sandwiches and some coffee.)
@GN>
Premodifying Genitive
(My car and Bills bike are blue.)
@AD-A>
Premodifying Ad-Adjective
(e.g. She is very intelligent.)
@OBJ
Object
(e.g. She read a book.)
@PCOMPL-S
Subject Complement
(e.g. He is a fool.)
@I-OBJ
Indirect Object
(e.g.He gave Mary a book.)
@ADVL
Adverbial
(e.g. She came home late. She is in the car.)
@<NOM-OF
Postmodifying of
(e.g. Five of you will pass.)
@<NOM-FMAINV
@<AD-A
Postmodifying Ad-Adjective
(e.g. This is good enough.)
@INFMARK>
Infinitive Marker
(e.g. John wants to read.)
@<P-FMAINV
@CC
Coordinator
292
Appendix B.
Subordinator
(e.g. If John is there, we shall go, too.)
@NEG
Negative Particle
(e.g.It is not funny.)
@DN>
Determiner
(e.g. He read the book.)
@AN>
Premodifying Adjective
(e.g. The blue car is mine.)
@QN>
Premodifying Quantifier
(e.g. He had two sandwiches and some coffee.)
@GN>
Premodifying Genitive
(e.g. My car and Bills bike are blue.)
@<P
293
B.2
Morpho Tags
adjective (small)
ADV
adverb (soon)
CC
CS
DET
determiner (any)
INFMARK>
INTERJ
interjection (hooray)
noun (house)
NEG-PART
NUM
numeral (two)
PCP1
PCP2
PREP
preposition (in)
PRON
pronoun (this)
verb (write)
CMP
SUP
Appendix B.
ABS
CMP
SUP
WH
wh-adverb (when)
ADVL
<Def>
definite (the)
<Indef>
indefinite (an)
<Quant>
quantifier (some)
ABS
ART
article (the)
CENTRAL
CMP
DEM
GEN
genitive (whose)
NEG
PL
plural (few)
POST
postdeterminer (much)
PRE
predeterminer (all)
SG
singular (much)
SG/PL
SUP
WH
wh-determiner (whose)
295
proper (Jones)
GEN
PL
plural (cars)
SG
singular (car)
SG/PL
fraction (two-thirds)
CARD
ORD
SG
singular (one-eighth)
PL
plural (three-eighths)
<Comp-Pron>
<Interr>
interrogative (who)
<Quant>
<Refl>
<Rel>
ABS
CMP
DEM
296
Appendix B.
FEM
feminine (she)
GEN
genitive (our)
MASC
masculine (he)
NEG
PERS
PL
plural (fewer)
PL1
PL2
PL3
RECIPR
SG
singular (much)
SG/PL
SG1
SG2
SG2/PL2
SG3
SUP
WH
wh-pronoun (who)
SUBJ
297
intransitive (go)
<SVO>
monotransitive (open)
<SVOO>
ditransitive (give)
<SVC/A>
<SVC/N>
AUXMOD
IMP
imperative (go)
INF
infinitive (be)
PAST
PRES
298
Appendix C
Appendix C.
C.1
The definitions of some non-typical functional tags that we have used in our algorithms are given below.
1. Adjunct (A): An adjunct is a type of adverbial indicating the circumstances
of the action. Adjuncts may be obligatory or optional. They express such
relations as time, place, manner, reason, condition, i.e. they are answers to
the questions where, when, how and why.
For example:
He lives in Brazil.
She was walking slowly.
Here, in Brazil is the adjunct as it gives the answer to where.
2. Predicative Adjunct (PA): If the copula (linking verb) is present, and it allows
an adverbial as complementation, then the complementation is called predicative adjunct.
For example:
The children are at the zoo.
The party will be at nine oclock.
The two eggs are for you.
The party will be tonight.
Here, the underlined prepositional phrase and adverb are the examples of predicative adjunct.
299
300
Appendix C.
http://www.lingsoft.fi/cgi-bin/engcg
http://pi0657.kub.nl/cgi-bin/tstchunk/demo.pl
301
302
Appendix D
Appendix D.
D.1
Semantic Similarity
Semantic similarity between two words is computed on the basis of their semantic
distance (sd ) (Stretina et. al., 1998), as follows:
sim(a,b) = 1 (sd (a,b))2
The semantic similarity score lies between 0 and 1. Semantic distance [Stetina
et. al., 1998] between two words, say a and b, is computed as:
1
sd(a, b) =
2
Ha H Hb H
+
Ha
Hb
303
Appendix E
Appendix E.
E.1
Here transformation from genitive case to genitive case requires fourteen adaptation
operations. Below we describe the cost for each of them. Note that the pre-modifier
word can be either ABS or A(PCP1) or A(PCP2) (See Section 2.5.4). We denote
this set by R.
1. The average cost of word replacement from the set R to ABS. This cost is
denoted by w1 . Note that in this case adjective dictionary search is necessary
for which the search time is 12.41 (see item 4 of Section 5.3). Hence, the total
average cost may be computed as (l1 L2 ) + (l2
Lp
)
2
+ {(d 12.41) + (c
105 )}.
2. The average cost of word replacement from set R to either A(PCP1) or A(PCP2).
We denote this cost as w2 . Here also, dictionary search is required. Note that
adjective forms A(PCP1) and A(PCP2) are derived from the verb part of
speech, therefore, in this case dictionary search time is 12.08. Hence the total
average cost is (l1 L2 ) + (l2
Lp
)
2
3. The average cost of morpho-word addition from the set {huaa, huye, huii }.
This cost is denoted as w3 . Since the total morpho words are three, the average
cost may be formulated as (l1 L2 ) + (m 32 ) + +.
4. The average cost of morpho-word deletion from the set {huaa, huye, huii }.
We denoted it as w4 . This average cost is evaluated as (l1 L2 ) + (m 23 ) + .
305
5. The average cost of suffix replacement from the set {aa, e, ii } is (l1 L2 ) +
(k 32 ) + (k 32 ). We denote it as s1 .
6. The average cost of suffix addition from the set {taa, tai, tii }. This cost is
denoted as s2 , which is computed to be (l1 L2 ) + (k 32 ).
7. The average cost of suffix replacement in verb form of A(PCP2) by using PCP
form of verb (see Appendix A). We denote it as s3 . Hence, the total average
cost is (l1 L2 ) + (k 62 ) + (k 62 ).
8. The average cost of suffix addition in verb form of A(PCP2) by using PCP
form of verb is (l1 L2 ) + (k 82 ). We denote this cost as s4 .
9. We denote the average cost of suffix replacement from the set {taa, te, tii } as
s5 , which is formulated as (l1 L2 ) + (k 32 ) + (k 32 ).
10. The average cost of suffix replacement from the set {aa, ye, ii } is (l1 L2 ) +
(k 32 ) + (k 32 ). We denote it as s6 .
11. The average cost of suffix replacement from the suffix set {taa, te, tii } to any
of the suffix which is required for verb form of A(PCP2) (using PCP verb form
rule, see Appendix A). We denote it as s7 . Since the number of suffixes required
for verb form of A(PCP2) is fourteen, the average cost of this operation may
be formulated as (l1 L2 ) + (k
14
)
2
+ (k 32 ).
12. The average cost of suffix replacement from any of the suffix which is required
for converting verb form of A(PCP2) to {taa, te, tii } is (l1 L2 ) + (k 32 ) +
(k
14
)
2
13. The average cost of suffix replacement for verb form of A(PCP2) to verb form
306
Appendix E.
ABS
(0
or
s1
(w1 +{s1 })
A(PCP1)
or w2 + w2 + s2
A(PCP2)
w2 + w3 + (s3 or s4 )
A(PCP1) w2 + {s1 } + w4
A(PCP2) w2 + {s1 } + w4
307
Bibliography
Ansell, M.: 2000, English Grammar: Explanations and Exercises, Second edn,
http://www.fortunecity.com/bally/durrus/153/gramdex.html.
Arnold, D. and Sadler, L.: 1990, Theoretical basis of MiMo, Machine Translation
5(3), 195222.
Bender, E.: 1961, HINDI Grammar and Reader, University of Pennsylvania Press,
University of Pennsylvania South Asia Regional Studies, Philadelphia, Pennsylvania.
Bennett, W. S.: 1990, How much semantics is necessary for MT systems?, Proceedings of the Third International Conference on Theoretical and Methodological
Issues in Machine Translation of Natural Languages, Vol. TX, Linguistics Research Center, The University of Texas, Austin, pp. 261269.
Bharati, A., Sriram, V., Krishna, A. V., Sangal, R. and Bendre, S.: 2002, An
algorithm for aligning sentences in bilingual corpora using lexical information,
International Conference on Natural Language Processing, Mumbai.
Brown, P. F.: 1990, A statistical approach to Machine Translation, Computational
Linguistics 16(2), 7985.
BIBLIOGRAPHY
Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., Lafferty, J. D. and Mercer, R. L.:
1992, Analysis, statistical transfer, and synthesis in machine translation, Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages, Montreal, Canada,
pp. 83100.
Brown, P. F., Pietra, S. A., Pietra, V. J. D., Pietra, D. and Mercer, R. L.: 1993, The
mathematics of statistical Machine Translation: parameter estimation, Computational Linguistics 19(2), 263311.
Brown, P., Lai, J. C. and Mercer, R. L.: 1991, Aligning sentences in parallel corpora, Proc. of 29th Annual Meeting of Association for Computational Linguistic,
Berkeley, pp. 169176.
Brown, R. D.: 1996, Example-Based Machine Translation in the pangloss system,
Proceedings of the 16th International Conference on Computational Linguistics
(COLING-96), Copenhagen, Denmark, pp. 169174.
Brown, R. D.: 1999, Adding linguistic knowledge to a lexical Example-Based
Translation System, Proceedings of the Eighth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99), Chester, UK,
pp. 2232.
Brown, R. D.: 2000, Automated generalization of translation examples, Proceedings
of the Eighteenth International Conference on Computational Linguistics, pp. 125
131.
Brown, R. D.: 2001, Transfer-rule induction for Example-Based Translation, Pro-
310
BIBLIOGRAPHY
ceedings of the MT Summit VIII Workshop on Example-Based Machine Translation, Santiago de Compostela, Spain, pp. 111.
Buchholz, S.: 2002, Memory-Based Grammatical Relation Finding, PhD thesis,
Tilburg University, Netherlands.
Carl, M. and Hansen, S.: 1999, Linking translation memories with Example-Based
Machine Translation, Proceedings of Machine Translation Summit VII, Singapore,
pp. 617624.
Carl, M. and Way, A.: 2003, Advances in Example-Based Machine Translation
Series: Text, Speech and Language Technology, Vol. 21, Kluwer Academic Publishers, Netherlands.
Chatterjee, N.: 2001, A statistical approach to similarity measurement for EBMT,
Proceedings of STRANS-2001, IIT Kanpur, pp. 122131.
Choueka, Y., Conley, E. S. and Dagan, I.: 2000, A comprehensive bilingual word
alignment system: Accommodating disparate languages: Hebrew and English, J.
Vronis (ed.): Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Clough,
P.:
2001,
www.ayre.ca/library/cl/files/sentenceSplitting.ps.
Collins, B.: 1998, Example-Based Machine Translation: an Adaptation-Guided
Retrieval Approach, PhD thesis, University of Dublin, Trinity College.
Collins, B. and Cunningham, P.: 1996, Adaptation guided retrieval in ebmt: A
case-based approach to Machine Translation, EWCBR, pp. 91104.
311
BIBLIOGRAPHY
Daelemans, W., Zavrel, J., Berck, P. and Gillis, S.: 1996, MBT: A memory-based
part of speech tagger-generator, Proceedings of the Fourth Workshop on Very
Large Corpora, E. Ejerhed and I. Dagan (eds.), Copenhagen, Denmark, pp. 14
27.
Dave, S., Parikh, J. and Bhattacharya, P.: 2002, Interlingua Based English-Hindi
Machine Translation and language divergence, Journal of Machine Translation
(JMT) 17.
Doi, T. and Sumita, E.: 2003, Input sentence splitting and translating, HLT-NAACL
2003 Workshop: Building and Using Parallel Texts Data Driven machine Translation and Beyond, Edmonton, pp. 104110.
Dorr, B. J.: 1993, Machine Translation: A View from the Lexicon, MIT Press,
Cambridge, MA.
Dorr, B. J., Jordan, P. W. and Benoit, J. W.: 1998, A survey of current paradigms
in Machine Translation, Technical Report LAMP-TR-027, UMIACS-TR-98-72,
CS-TR-3961, University of Maryland, College Park, USA.
Dorr, B. J., Pearl, L., Hwa, R. and Habash, N. Y. A.: 2002, DUSTer: A method for
unraveling cross-language divergences for statistical word level alignment., Proceedings of the Fifth Conference of the Association for Machine Translation in the
Americas, AMTA-2002, Tiburon, CA.
Fung, P. and McKeown, K.: 1996, A technical word-and term-translation aid using
noisy parallel corpora across language groups, The Machine Translation Journal,
Special Issue on New Tools for Human Translators, pp. 5387.
312
BIBLIOGRAPHY
Furuse, O., Yamada, S. and Yamamoto, K.: 1998, Splitting long or ill-formed input
for robust spoken-language translation, Proceedings of the Thirty-Sixth Annual
Meeting of the ACL and Seventeenth International Conference on Computational
Linguistics, pp. 421427.
Gale, W. A. and Church, K. W.: 1991a, Identifying word correspondences in parallel texts, Proceedings of the Fourth DARPA Workshop on Speech and Natural
Language, Morgan Kaufmann Publishers, Inc., pp. 152157.
Gale, W. A. and Church, K. W.: 1991b, A program for aligning sentences in bilingual
corpora, ACL 91, Berkeley CA, pp. 177184.
Gale, W. and Church, K.: 1993, A program for aligning sentences in bilingual corpora, Computational Linguistics 19(1), 75102.
George, D.: 2002, Automatic evaluation of machine translation quality using ngram co-occurrence statistics, Proceedings ARPA Workshop on Human Language
Technology.
Germann, U.: 2001, Building a Statistical Machine Translation system from scratch:
How much bang for the buck can we expect?, ACL 2001 Workshop on Data-Driven
Machine Translation, Toulouse, France.
Goyal, S., Gupta, D. and Chatterjee, N.: 2004, A study of Hindi translation patterns for English sentences with have as the main verb, Proceedings of International Symposium on MT, NLP and Translation Support Systems: iSTRANS2004, CDEC and IIT Kanpur, Tata McGraw-Hill, New Delhi, pp. 4651.
Grishman, R. and Kosaka, M.: 1992, Combining rationalist and empiricist approaches to Machine Translation, Proceedings of the Fourth International Confer313
BIBLIOGRAPHY
ence on Theoretical and Methodlogical Issues in Machine Translation of Natural
Languages, Montreal, Canada, pp. 263274.
Gupta, D. and Chatterjee, N.: 2002, Study of similarity and its measurement for
English to Hindi EBMT, Proceedings of STRANS-2002, IIT Kanpur.
Gupta, D. and Chatterjee, N.: 2003a, Divergence in English to Hindi Translation:
Some studies, International Journal of Translation 15, 524.
Gupta, D. and Chatterjee, N.: 2003b, Identification of divergence for English to
Hindi EBMT, Proceedings of the MT SUMMIT IX, Orleans, LA, pp. 141148.
Gupta, D. and Chatterjee, N.: 2003c, A morpho-syntax based adaptation and retrieval scheme for English to Hindi EBMT, Proceedings of Workshop on Computational Linguistic for the Languages of South Asia: Expanding Synergies with
Europe, Budapest, Hungary, pp. 2330.
G
uvenir, H. A. and Cicekli, I.: 1998, Learning translation templates from examples,
Information System 23, 353363.
Habash, N.: 2003, Generation-Heavy Hybrid Machine Translation, PhD thesis,
University of Maryland, College Park.
Habash, N. and Dorr, B. J.: 2002, Handling translation divergences: Combining statistical and symbolic techniques in generation-heavy Machine Translation, Proceedings of the Fifth Conference of the Association for Machine Translation in the
Americas, AMTA-2002, Tiburon, CA.
Han, C.-h., Benoit, L., Martha, P., Owen, R., Kittredge, R., Korelsky, T., Kim,
N. and Kim, M.: 2000, Handling structural divergences and recovering dropped
314
BIBLIOGRAPHY
arguments in a Korean/English machine translation system, Proceedings of the
Fourth Conference of the Association for Machine Translation in the Americas,
AMTA-2000, Cuernavaca, Mexico.
Hutchins, J.: 2003, The Oxford Handbook of
Translation (Design and Implementation for Hindi to English), PhD thesis, I.I.T.
Kanpur.
Kachru, Y.: 1980, Aspects of Hindi Grammar, Manohar Publications, New Delhi.
Kellogg, R. S. and Bailey, T. G.: 1965, A Grammar of the Hindi Language, Routledge and Kegan Paul Ltd., London.
Kit, C., Pan, H. and Webster, J.: 2002, Example-Based Machine Translation: A
New Paradigm, Translation and Information Technology, Chinese U of HK Press,
pp. 5778.
Leffa, V. J.: 1998, Clause processing in complex sentences, Proceedings of the First
International Conference on Language Resources and Evaluation, Vol. 1, pp. 937
943.
Loomis, M. E. S.: 1997, Data Management and File Structures, second edn, Prentice
Hall of India Private Limited, New Delhi-110001.
Manning, C. and Schutze, H.: 1999, Foundations of Statistical Natural Language
Processing, The MIT Press, MA.
315
BIBLIOGRAPHY
McEnery, A. M., Oakes, M. P. and Garside, R.: 1994, The use of approximate
string matching techniques in the alignment of sentences in parallel corpora, in
A. Vella (ed.), The Proceedings of Machine Translation: 10 Years On, University
of Cranfield.
McTait, K.: 2001, Translation Pattern Extraction and Recombination for ExampleBased Machine Translation, PhD thesis, Centre for Computational Linguistics
Department of Language Engineering, UMIST.
Nagao, M.: 1984, Artificial and Human Intelligence, North-Holland, chapter A
Framework of a Mechanical Translation Between Japanese and English by Analogy Principle, pp. 173180.
Nie en, S., Och, F. J., Leusch, G. and Ney, H.: 2000, An evaluation tool for machine translation: Fast evaluation for machine translation research., Proceedings
of the Second Int. Conf. on Language Resources and Evaluation (LREC), Athens,
Greece, pp. 3945.
Nirenburg, S.: 1993, Example-Based Machine Translation, Proceedings of the Bar
Ilan Symposium on Foundations of Artificial Intelligence, Bar Ilan University,
Israel.
Nirenburg, S., Grannes, D. and Domashnev, K.: 1993, Two approaches of matching in Example-Based Machine Translation, Proceedings of TMIMT-93, Kyoto,
Japan.
Oard, D. W.: 2003, The surprise language exercises, ACM Transactions on Asian
Language Processing 2(2), 7984.
316
BIBLIOGRAPHY
Orasan, C.: 2000, A hybrid method for clause splitting in unrestricted English texts,
Proceedings of ACIDCA 2000, Monastir, Tunisia.
Papineni, K. A., Roukos, S., Ward, T. and Zhu, W.-J.: 2001, Bleu: a method
for automatic evaluation of machine translation, Technical Report Technical Report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research
Center, Yorktown Heights, NY.
Piperidis, S., Boutsis, S. and Papageorgiou, H.: 2000, From sentences to words and
clauses, Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Puscasu, G.: 2004, A multilingual method for clause splitting, Proceedings of CLUK
2004, Birmingham, UK, pp. 199 206.
Quirk, R. and Greenbaum, S.: 1976, A University Grammar of English, English
Language Book Socitey, Longman.
Rao, D.: 2001, Human aided Machine Translation from English to Hindi: The
MaTra project at NCST, Proceedings Symposium on Translation Support Systems, STRANS-2001, I.I.T. Kanpur.
Rao, D., Mohanraj, K., Hegde, J., Mehta, V. and Mahadane, P.: 2000, A practical framework for syntactic transfer of compound-complex sentences for EnglishHindi Machine Translation, Proceedings of the Conf. on Knowledge based computer systems, National Centre for Software Technology, Mumbai, pp. 343354.
Resnik, P. and Yarowsky, D.: 2000, Distinguishing systems and distinguishing senses:
New evaluation methods for word sense disambiguation, Natural Language Engineering 5(2), 113133.
317
BIBLIOGRAPHY
Sang, E. F. T. K. and Dejean, H.: 2001, Introduction to the CoNLL-2001 shared
task: Clause identification, Proceedings of CoNLL-2001, Toulouse, France, pp. 53
57.
Sangal, R.: 2004, Shakti: IIIT-Hyderabad machine translation system (experimental), http://shakti.iiit.net/ shakti/.
Sastri, S. and Apte, B.: 1968, Hindi Grammar, Dakshina Bharat Hindi Prachar
Sabha, Madras, India.
Sato, S.:
2003, AnglaHindi:
Translation system, Proceedings of the MT SUMMIT IX, Orleans, LA, pp. 2327.
Sinha, R. M. K., Jain, R. and Jain, A.: 2002, An English to Hindi machine aided
translation system based on ANGLABHARTI technology ANGLA HINDI,
I.I.T. Kanpur, http://anglahindi.iitk.ac.in/translation.htm.
318
BIBLIOGRAPHY
Somers, H.: 1997, Machine Translation and minority languages, Translating and the
Computer 19: Papers from the Aslib conference, London.
Somers, H.: 1998, Further experiments in bilingual text alignment, International
Journal of Corpus Linguistics 3, 115150.
Somers, H.: 1999, Review article: Example-Based Machine Translation, Machine
Translation 14, 113158.
Somers, H.: 2001, EBMT seen as case-based reasoning, MT Summit VIII Workshop
on Example-Based Machine Translation, Santiago de Compostela, Spain, pp. 56
65.
Stetina, J., Kurohashi, S. and Nagao, M.: 1998, General word sense disambiguation method based on a full sentential context, Proceedings of COLING-ACL
Workshop, Usage of WordNet [http://www.cogsci.princeton.edu/cgi-bin/webwn]
in Natural Language Processing, Montreal, Canada.
Sumita, E.: 2001, Example-Based Machine Translation using DP-matching between word sequences, Proc. of the ACL 2001 Workshop on Data-Driven Methods
in Machine Translation, pp. 18.
Sumita, E. and Iida, H.: 1991, Experiments and prospects of Example-Based
Machine Translation, Proceedings of the 29th Annual Meeting of the Association
for Computational Linguistics, Berkeley, California, USA, pp. 85192.
Sumita, E., Iida, H. and Kohyama, H.: 1990, Translating with examples: A new
approach to Machine Translation, TMI-1990, pp. 203212.
Sumita, E. and Tsutsumi, Y.: 1988, A translation aid system using flexible text
retrieval based on syntax matching, Proceedings of TMI-88, CMU, Pittsburgh.
319
BIBLIOGRAPHY
Takezawa, T.: 1999, Transformation into meaning-ful chunks by dividing or connecting utterance units, Journal of Natural Language Processing 6(2).
Thurmair, G.: 1990, Complex lexical transfer in METAL, Proceedings of the Third
International Conference on Theoretical and Methodological Issues in Machine
Translation of Natural languages, Linguistics research center, The University of
Texas, Austin, TX, pp. 91107.
Tillmann, C., Vogel, S., Ney, H., Zubiaga, A. and Sawaf, H.: 1997, Accelerated dp
based search for statistical translation, In European Conf. on Speech Communication and Technology, Rhodes, Greece, p. 26672670.
Uchida, H. and Zhu, M.: 1998, The Universal Networking Language (UNL) specifications version 3.0. 1998, Technical report, United Nations University, Tokyo,
http://www.unl.unu.edu/unlsys/unl/unls30.doc.
Veale, T. and Way, A.: 1997, Gaijin: A template-driven bootstrapping approach to
Example-Based Machine Translation, International Conference, Recent Advances
in Natural Language Processing, Tzigov Chark, Bulgaria,, pp. 239244.
Vikas, O.: 2001, Technology development for indian languages, Proceedings of Symposium on Translation Support Systems STRANS-2001, IIT Kanpur.
Watanabe, H.: 1992, A similarity-driven transfer system, Proceeding of the 14th
COLING, pp. 770776.
Watanabe, H., Kurohashi, S. and Aramaki, E.: 2000, Finding structural correspondences from bilingual parsed corpus for Corpus-Based Translation, Proceedings
of COLING-2000, Saarbrucken, Germany.
320
BIBLIOGRAPHY
Weiderhold, G.: 1987, File organization for database design, McGraw-Hill Inc. New
York, USA.
Wren, P., Martin, H. and Rao., N.: 1989, High School English Grammar, S. Chand
& Co. Ltd., New Delhi.
Wu, D.: 1995, Large-scale automatic extraction of an English-Chinese translation
lexicon, Machine Translation 9(3-4), 285313.
321
List of Publications
Published Paper(s) in Journal
1. Gupta D. and Chatterjee N. 2003. Divergence in English to Hindi Translation: Some Studies International Journal of Translation, Vol. 15, pp.5-24.