10 1 1 103 4227 PDF

CONTRIBUTIONS TO ENGLISH TO HINDI
MACHINE TRANSLATION USING

EXAMPLE-BASED APPROACH
DEEPA GUPTA
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF TECHNOLOGY DELHI
HAUZ KHAS, NEW DELHI-110016, INDIA
JANUARY, 2005
CONTRIBUTIONS TO ENGLISH TO HINDI
MACHINE TRANSLATION USING
EXAMPLE-BASED APPROACH
by
DEEPA GUPTA
Department of Mathematics
Submitted
in fulfilment of the requirement of

the degree of
Doctor of Philosophy
to the
Indian Institute of Technology Delhi

Hauz Khas, New Delhi-110016, India
January, 2005
Dedicated to
My Parents,
My Brother Ashish and
My Thesis Supervisor...
Certificate
This is to certify that the thesis entitled “Contributions to English to Hindi

Machine Translation Using Example-Based Approach” submitted by Ms.
Deepa Gupta to the Department of Mathematics, Indian Institute of Technology

Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona fide
research work carried out by her under my guidance and supervision.
The thesis has reached the standards fulfilling the requirements of the regulations
relating to the degree. The work contained in this thesis has not been submitted to
any other university or institute for the award of any degree or diploma.
Dr. Niladri Chatterjee

Assistant Professor
Department of Mathematics
Indian Institute of Technology Delhi
Delhi (INDIA)
Acknowledgement
If I say that this is my thesis it would be totally untrue. It is like a dream come true.
There are people in this world, some of them so wonderful, who helped in making
this dream, a product that you are holding in your hand. I would like to thank all
of them, and in particular:
Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research
and stayed with me right till the end. His efforts, comments, advices and ideas
developed my thinking, and improved my way of presentation. Without his con-
stant encouragement, keen interest, inspiring criticism and invaluable guidance, I
would not have accomplished my work. I admit that his efforts need much more
acknowledgement than expressed here.
I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech
Research Lab who funded this research. I sincerely thank all the faculty members of
Department of Mathematics, especially, I express my gratitude for Prof B. Chandra
and Dr. R. K. Sharma, for providing me continuous moral support and help. I
thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time
and efforts. I also thank the department administrative staff for their assistance. I
extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and
Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh, Dr. D. K. Lobiyal,
and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening
discussions on basics of languages.
I would like to express my sincere thanks to my friends Priya and Dharmendra

for many fruitful discussions regarding my research problem. I thank Mr. Gaurav
Kashyap for helping me in the implementation of the algorithms. In particular, I
would like to thank Inderdeep Singh, for his help in writing some part of the thesis.
I want to give special thanks to my friends, Sonia, Pranita and Nutan, for helping
me in both good and bad times. I would like to thank Prabhakhar for his brotherly
support. I extend my thanks to Manju, Anita, Sarita, Subhashini and Anju for
cheering me, always.
Shailly and Geeta - amazing friends who read the manuscript and gave honest com-
ments. Both of them also stayed with me in the process, and handled me, and
sometimes my out-of-control emotions so well. Especially, I wish to extend my
thanks to Geeta for providing me stay in her hostel room, and also for her wonderful
help when my leg got fractured when we knew each other for a month only. I wish
to acknowledge Krishna for his constant help, both academic and nonacademic, and
his continuous encouragement.
I convey my sincere regards to my parents, and brothers for the sacrifices they have
made, for the patience they have shown, and for the love and blessing they have
showered. I thank Arun for his moral support. Most imperative of all, I would like
to express my profound sense of gratitude and appreciation to my sister Neetu. Her
irrational and unbreakable belief in me bordered on craziness at times.
I cannot avoid to mention my friend Sharad who deserves more than a little ac-
knowledgement. His constant inspiration and untiring support has sustained my
confidence throughout this work.
Finally, I thank GOD for every thing.
Deepa Gupta
Abstract
This research focuses on development of Example Based Machine Translation (EBMT)

system for English to Hindi. Development of a machine translation (MT) system
typically demands a large volume of computational resources. For example, rule-
based MT systems require extraction of syntactic and semantic knowledge in the

form of rules, statistics-based MT systems require huge parallel corpus containing
sentences in the source languages and their translations in target language. Require-
ment of such computational resources is much less in respect of EMBT. This makes
development of EBMT systems for English to Hindi translation feasible, where avail-
ability of large-scale computational resources is still scarce. The primary motivation

for this work comes because of the following:
a) Although a small number of English to Hindi MT systems are already available,

the outputs produced by them are not of high quality all the time. Through
this work we intend to analyze the difficulties that lead to this below par
performance, and try to provide some solutions for them.
b) There are several other major languages (e.g., Bengali, Punjabi, Gujrathi) in
the Indian subcontinent. Demand for developing MT systems from English to
these languages is increasing rapidly. But at the same time, development of
computational resources in these languages is still at its infancy. Since many

of these languages are similar to Hindi, syntactically as well as lexicon wise,
the research carried out here should help developing MT systems from English
to these languages as well.
i
The major contributions of this research may be described as follows:
1) Development of a systematic adaptation scheme. We proposed an adaptation

scheme consisting of ten basic operations. These operations work not only at
word level, but at suffix level as well. This makes adaptation less expensive in
many situations.
2) Study of Divergence. We observe that occurrence of divergence causes major

difficulty for any MT systems. In this work we make an in depth study of the
different types of divergence, and categorize them.
3) Development of Retrieval scheme. We propose a novel approach for measuring

similarity between sentences. We suggest that retrieval strategy, with respect
to an EBMT system, will be most efficient if it measures similarity on the basis

of cost of adaptation. In this work we provide a complete framework for an
efficient retrieval scheme on the basis of our studies on “divergence” and “cost
of adaptation”.
4) Dealing with Complex sentences. Handling complex sentences by an MT sys-

tem is generally considered to be difficult. In this work we propose a “split
and translate” technique for translating complex sentences under an EBMT
framework.
We feel that the overall scheme proposed in this research will pave the way for
developing an efficient EBMT system for translating from English to Hindi. We

hope that this research will also help development of MT systems from English to
other languages of the Indian subcontinent.
ii
Contents
1 Introduction 1
1.1 Description of the Work Done and Summary of the Chapters . . . . . 6
1.2 Some Critical Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Adaptation in English to Hindi Translation: A Systematic Ap-

proach 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Description of the Adaptation Operations . . . . . . . . . . . . . . . 29
2.3 Study of Adaptation Procedure for Morphological Variation of Active
Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Same Tense Same Verb Form . . . . . . . . . . . . . . . . . . 38
2.3.2 Different Tenses Same Verb Form . . . . . . . . . . . . . . . . 42
2.3.3 Same Tense Different Verb Forms . . . . . . . . . . . . . . . . 46
2.3.4 Different Tenses Different Verb Forms . . . . . . . . . . . . . . 48
2.4 Adaptation Procedure for Morphological Variation of Passive Verbs . 51
2.5 Study of Adaptation Procedures for Subject/ Object Functional Slot 56
2.5.1 Adaptation Rules for Variations in the Morpho Tags of @DN> 59

Contents
2.5.2 Adaptation Rules for Variations in the Morpho Tags of @GN> 60
2.5.3 Adaptation Rules for Variations in the Morpho Tags of @QN . 64
2.5.4 Adaptation Rules for Variations in the Morpho Tags of Pre-
modifier Adjective @AN> . . . . . . . . . . . . . . . . . . . . 64
2.5.5 Adaptation Rules for Variations in the Morpho Tags of @SUB 69
2.6 Adaptation of Interrogative Words . . . . . . . . . . . . . . . . . . . 73
2.7 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 83
2.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3 An FT and SPAC Based Divergence Identification Technique From

Example Base 87
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2 Divergence and Its Identification: Some Relevant Past Work . . . . . 89
3.3 Divergences and Their Identification in English to Hindi Translation . 96
3.3.1 Structural Divergence . . . . . . . . . . . . . . . . . . . . . . . 97
3.3.2 Categorial Divergence . . . . . . . . . . . . . . . . . . . . . . 100
3.3.3 Nominal Divergence . . . . . . . . . . . . . . . . . . . . . . . 104
3.3.4 Pronominal Divergence . . . . . . . . . . . . . . . . . . . . . . 107
3.3.5 Demotional Divergence . . . . . . . . . . . . . . . . . . . . . . 111
3.3.6 Conflational Divergence . . . . . . . . . . . . . . . . . . . . . 117
3.3.7 Possessional Divergence . . . . . . . . . . . . . . . . . . . . . 121
3.3.8 Some Critical Comments . . . . . . . . . . . . . . . . . . . . . 131
iv
Contents
4 A Corpus-Evidence Based Approach for Prior Determination of

Divergence 135
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.2 Corpus-Based Evidences and Their Use in Divergence Identification . 136
4.2.1 Roles of Different Functional Tags . . . . . . . . . . . . . . . . 138
4.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.4 Illustrations and Experimental Results . . . . . . . . . . . . . . . . . 155
4.4.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.4.3 Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 166
5 A Cost of Adaptation Based Scheme for Efficient Retrieval of Trans-
lation Examples 171
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.2 Brief Review of Related Past Work . . . . . . . . . . . . . . . . . . . 171
5.3 Evaluation of Cost of Adaptation . . . . . . . . . . . . . . . . . . . . 178
5.3.1 Cost of Different Adaptation Operations . . . . . . . . . . . . 182
5.4 Cost Due to Different Functional Slots and Kind of Sentences . . . . 185
v
Contents
5.4.1 Costs Due to Variation in Kind of Sentences . . . . . . . . . . 186
5.4.2 Cost Due to Active Verb Morphological Variation . . . . . . . 187
5.4.3 Cost Due to Subject/Object Functional Slot . . . . . . . . . . 192
5.4.4 Use of Adaptation Cost as a Measure of Similarity . . . . . . . 197
5.5 The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
198
5.5.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 198
5.5.2 Syntactic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 201
5.5.3 A Proposed Approach: Cost of Adaptation Based Similarity . 203
5.5.4 Drawbacks of the Proposed Scheme . . . . . . . . . . . . . . . 211
5.6 Two-level Filtration Scheme . . . . . . . . . . . . . . . . . . . . . . . 213
5.6.1 Measurement of Structural Similarity . . . . . . . . . . . . . . 214
5.6.2 Measurement of Characteristic Feature Dissimilarity . . . . . . 217
5.7 Complexity Analysis of the Proposed Scheme . . . . . . . . . . . . . 222
5.8 Difficulties in Handling Complex Sentences . . . . . . . . . . . . . . . 226
5.9 Splitting Rules for Converting Complex Sentence into Simple Sentences229
5.9.1 Splitting Rule for the Connectives “when”, “where”, “when-

ever” and “wherever” . . . . . . . . . . . . . . . . . . . . . . . 231
5.9.2 Splitting Rule for the Connective “who” . . . . . . . . . . . . 241
5.10 Adaptation Procedure for Complex Sentence . . . . . . . . . . . . . . 253
5.10.1 Adaptation Procedure for Connectives “when”, “where”, “when-

ever” and “wherever” . . . . . . . . . . . . . . . . . . . . . . . 254
vi
Contents
5.10.2 Adaptation Procedure for Connective “who” . . . . . . . . . . 256
5.11 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6 Discussions and Conclusions 267
6.1 Goals and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.2 Contributions Made by This Research . . . . . . . . . . . . . . . . . . 268
6.3 Possible extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.4 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.4.1 Pre-editing and Post-editing . . . . . . . . . . . . . . . . . . . 274
6.4.2 Evaluation Measures of Machine Translation . . . . . . . . . . 276
Appendices 280
A 281
A.1 English and Hindi Language Variations . . . . . . . . . . . . . . . . . 281
A.2 Verb Morphological and Structure Variations . . . . . . . . . . . . . . 285
A.2.1 Conjugation of Root Verb . . . . . . . . . . . . . . . . . . . . 286
B 291
B.1 Functional Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
B.2 Morpho Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
vii
Contents
C 299
C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures 299
D 303
D.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
E 305
E.1 Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective305
Bibliography 308
viii
List of Figures
1.1 An Example Sentence with Its Morpho-Functional Tags . . . . . . . . 20
2.1 The five possible scenarios in the SL → SL’ → TL’ interface of partial
case matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Example of Different Adaptation Operations . . . . . . . . . . . . . . 34
2.3 Some Typical Sentence Structures . . . . . . . . . . . . . . . . . . . . 83
3.1 Algorithm for Identification of Structural Divergence . . . . . . . . . 99
3.2 Correspondence of SPACs of E and H for Identification of Structural

Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3 Algorithim for Identification of Categorial Divergence . . . . . . . . . 103
3.4 Correspondence of SPACs for the Categorial Divergence Example of
Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5 Algorithim for Identification of Nominal Divergence . . . . . . . . . . 106
3.6 Correspondence of SPAC E and SPAC H of Nominal Divergence of

Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.7 Algorithim for Identification of Pronominal Divergence . . . . . . . . 110

LIST OF FIGURES
3.8 Correspondence of SPAC E and SPAC H of Pronominal Divergence

of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.9 Algorithm for Identification of Demotional Divergence . . . . . . . . . 114
3.10 Correspondence of SPAC E and SPAC H for Demotional Sub-type 4 115
3.11 SPAC Correspondence for Demotional Divergence of Sub-type 1 . . . 116
3.12 Algorithm for Identification of Conflational Divergence . . . . . . . . 120
3.13 Correspondence of SPAC E and SPAC H for Conflational Divergence

of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.14 Algorithm for Identification of Possessional Divergence . . . . . . . . 129
3.15 Correspondence of SPAC E and SPAC H for Possessional Divergence
of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.16 Correspondence of SPAC E and SPAC H for Possessional Divergence

of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.1 Schematic Diagram of the Proposed Algorithm . . . . . . . . . . . . . 153
4.2 Continuation of the Figure 4.1 . . . . . . . . . . . . . . . . . . . . . . 154
5.1 Schematic View of Module 1 for Identification of Complex Sentence

with Connective any of “when”, “where”, “whenever”, or “wherever” . 232
5.2 Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 237
5.4 Schematic View of Module 1 for Identification of Complex Sentence
with Connective “who” . . . . . . . . . . . . . . . . . . . . . . . . . . 244
x
LIST OF FIGURES
5.5 Schematic View of the SUBROUTINE SPLIT . . . . . . . . . . . . . 246
xi
List of Tables
1.1 Output of “AnglaHindi” and “Shakti” MT System . . . . . . . . . . 5
2.2 Notations Used in Sentence Patterns . . . . . . . . . . . . . . . . . . 35
2.3 Adaptation Operations of Verb Morphological Variations in Present

Indefinite to Present Indefinite . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Adaptation Operations of Verb Morphological Variations in Present
Indefinite to Past Indefinite . . . . . . . . . . . . . . . . . . . . . . . 44
2.5 Different Functional Tags Under the Functional Slot <S> or <O> . . 56
2.6 Different Possible Morpho Tags for Each of the Functional Tag under
the Functional Slot <S> or <O> . . . . . . . . . . . . . . . . . . . . 58
2.8 Adaptation Operations for Genitive Case to Genitive Case . . . . . . 62
2.10 Adaptation Operations for Pre-modifier Adjective to Pre-modifier Ad-

jective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.11 Adaptation Operations for Subject to Subject Variations . . . . . . . 71
2.12 Different Sentence Patterns of Interrogative Words . . . . . . . . . . . 77

LIST OF TABLES
2.13 Functional & Morpho Tags Corresponding to Each Interrogative Sen-

tence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.14 Adaptability Rules for Group G5 Sentence Patterns . . . . . . . . . . 83
2.15 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 84
3.1 Different Semantic Similarity Score between “shock” with “trouble”

and “panic” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1 FT-features Instrumental for Creating Divergence . . . . . . . . . . . 138
4.2 Relevance of FT-features in Different Divergence Types . . . . . . . . 139
4.3 FT of the Problematic Words for Each Divergence Type . . . . . . . 142
4.4 Frequency of Words in Different Sections . . . . . . . . . . . . . . . . 144
4.5 PSD/NSD Schematic Representations . . . . . . . . . . . . . . . . . . 145
4.6 Values of s(di ) and m(di ) for Illustration 3 . . . . . . . . . . . . . . . 160
4.7 Some Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.8 Continuation of Table 4.7 . . . . . . . . . . . . . . . . . . . . . . . . 165
4.9 Results of Our Experiments . . . . . . . . . . . . . . . . . . . . . . . 166
5.1 Cost Due to Variation in Kind of Sentences . . . . . . . . . . . . . . . 187
5.2 Cost Due to Verb Morphological Variation Present Indefinite to Present
Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.3 Adaptation Operations of Verb Morphological Variation Present In-

definite to Past indefinite . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.4 Costs Due to Adapting Genitive Case to Genitive Case . . . . . . . . 195
xiv
LIST OF TABLES
5.5 Cost of Adaptation Due to Subject/Object to Subject/Object . . . . 197
5.6 Best Five Matches by Using Semantic Similarity for the Input Sen-
tence “I work.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.7 Best Five Matches by Using Semantic Similarity for the Input Sen-
tence “Sita sings ghazals.” . . . . . . . . . . . . . . . . . . . . . . . . 201
5.8 Weighting Scheme for Different POS and Syntactic Role . . . . . . . 202
5.9 Best Five Matches by Syntactic Similarity for the Input Sentence “I
work.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.10 Best Five Matches by Syntactic Similarity for the Input Sentence “Sita
sings ghazals.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.11 Functional-morpho Tags for the Input English Sentence (IE) and the
Retrieved English Sentence (RE) . . . . . . . . . . . . . . . . . . . . 204
5.12 Retrieval on the Basis of Cost of Adaptation Based Scheme for the
Input Sentence “I work.” . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.13 Retrieval on the Basis of Cost of Adaptation Based Similarity for the
Input Sentence “Sita sings ghazals.” . . . . . . . . . . . . . . . . . . . 207
5.14 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence “I work.” by Using Semantic and Syntactic Based Similarity
Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.15 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence “Sita sings ghazals” by Using Semantic and Syntactic based
Similarity Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.16 Weights Used for Characteristic Features . . . . . . . . . . . . . . . . 220
xv
LIST OF TABLES
5.17 Notation Used in the Complexity Analysis . . . . . . . . . . . . . . . 222
5.19 Typical Examples of Complex Sentence with Connective ‘when”, “where”,
“whenever” or “wherever” Handled by Module 2 . . . . . . . . . . . . 235
5.20 Typical Examples of Complex Sentence with Connective “when”, “where”,
“whenever” or “wherever” Handled by Module 3 . . . . . . . . . . . . 239
5.21 Typical Complex Sentences with Relative Adverb “who” Handled by

Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

Module 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

Module 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
5.24 Hindi Translation of Relative Adverbs . . . . . . . . . . . . . . . . . . 254
5.25 Patterns of Complex Sentence with Connective “when”, “where”,

“whenever” and “wherever” . . . . . . . . . . . . . . . . . . . . . . . . 255
5.26 Patterns of Complex Sentence with Connective “who” . . . . . . . . . 257
5.27 Five Most Similar Sentence for RC “You go to India.” Using Cost of
Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . . . . . 261
5.28 Five Most Similar Sentence for MC “ You should speak Hindi.” Using
Cost of Adaptation based Scheme . . . . . . . . . . . . . . . . . . . . 261
5.29 Five Most Similar Sentence for RC “He wants to learn Hindi.” Using
Cost of Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . 263
5.30 Five Most Similar Sentence for MC “The student should study this
book.” Using Cost of Adaptation Based Scheme . . . . . . . . . . . . 263
xvi
LIST OF TABLES
A.2 Different Case Ending in Hindi . . . . . . . . . . . . . . . . . . . . . 283
A.3 Suffixes and Morpho-Words for Hindi Verb Conjugations . . . . . . . 286
A.4 Verb Morphological Changes From English to Hindi Translation . . . 288
E.1 Costs Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective307
xvii
Chapter 1
Introduction
Chapter 1. Introduction
Machine Translation (MT) is the process of translating text units of one language
(source language) into a second language (target language) by using computers. The
need for MT is greatly felt in the modern age due to globalization of information,
where global information base needs to be accessed from different parts of the world.
Although most of this information is available online, the major difficulty in dealing
with this information is that its language is primarily English. Starting from science,
technology, education to manuals of gadgets, commercial advertisements, everywhere
predominant presence of English as the medium of communication can be easily
observed. This world, however, is multi-lingual, where different languages are spoken
in different regions. This necessitates the development of good MT systems for

translating these works into other languages so that a larger population can access,
retrieve and understand them. Consequently, in a country like India, where English
is understood by less than 3% of the population (Sinha and Jain, 2003), the need
for developing MT systems for translating from English into some native Indian
languages is very acute. In this work we looked into different aspects of designing an
English to Hindi MT system using Example-Based (Nagao, 1984) technique. Two
fundamental questions that we feel we should answer at this point are:
• The rationale behind choosing Example-Based Machine Translation (EBMT)
as the paradigm of interest;
• The reason behind selecting Hindi as the preferred language.
Below we provide justifications behind these choices.
Development of MT systems has taken a big leap in the last two decades. Typ-
ically, machine translation requires handcrafted and complicated large-scale knowl-
1
edge (Sumita and Iida, 1991). Various MT paradigms have so far evolved depending
upon how the translation knowledge is acquired and used. For example,
1. Rule-Based Machine Translation (RBMT): Here rules are used for analysis
and representation of the “meaning” of the source language texts, and the
generation of equivalent target language texts (Grishman and Kosaka, 1992),
(Thurmair, 1990), (Arnold and Sadler, 1990).
2. Statistical- (or Corpus-) Based Machine Translation (SBMT): Statistical trans-

lation models are trained on a sentence-aligned translation corpus, which is
based on n-gram modelling, and probability distribution of the occurrence of
a source-target language pair in a very large corpus. This technique was pro-
posed by IBM in early 1990s (Brown, 1990), (Brown et. al., 1992), (Brown et.
al., 1993), (Germann, 2001).
However, these techniques have their own drawbacks. The main drawback of
RBMT systems is that sentences in any natural language may assume a large vari-
ety of structures. Also, machine translation often suffers from ambiguities of various
types (Dorr et. al., 1998). As a consequences, translation from one natural lan-
guage into another requires enormous knowledge about the syntax and semantics of
both the source and target languages. Capturing all the knowledge in rule form is
daunting task if not impossible. On the other hand, SBMT techniques depend on
how accurately various probabilities are measured. Realistic measurements of these

probabilities can be made only if a large volume of parallel corpus is made available.
However, availability of such huge data is not easy. Consequently, this scheme is
viable only for small number of language pairs.
2
Example-based Machine Translation (Nagao, 1984), (Carl and Way, 2003) makes
use of past translation examples to generate the translation of a given input. An
EBMT system stores in its example base of translation examples between two lan-
guages, the source language (SL) and the target language (TL). These examples are
subsequently used as guidance for future translation tasks. In order to translate a

new input sentence in SL, a1 similar SL sentence is retrieved from the example base,
along with its translation in TL. This example is then adapted suitably to generate a
translation of the given input. It has been found that EBMT has several advantages
in comparison with other MT paradigms (Sumita and Iida, 1991):
1. It can be upgraded easily by adding more examples to the example base;
2. It utilizes translators’ expertise, and adds a reliability factor to the translation;
3. It can be accelerated easily by indexing and parallel computing;
4. It is robust because of best-match reasoning.
Even other researchers (e.g. (Somers, 1999), (Kit et. al., 2002)) have considered
EBMT to be one major and effective approach among different MT paradigms,
primarily because it exploits the linguistic knowledge stored in an aligned text in a

more efficient way.
We apprehend from the above observation that for development of MT systems

from English to Indian languages, EBMT should be one of the preferred approaches.
This is because a significant volume of parallel corpus is available between English
and different Indian languages in the form of government notices, translation books,
1
Sometimes more than one sentence is also retrieved
3
advertisement material etc. Although this data is generally not available in elec-
tronic form yet, converting them into machine readable form is much easier than
formulating explicit translation rules as required by an RBMT system. In fact some
parallel data in electronic form has been made available through some projects (e.g.
EMILLE :http://www.emille.lancs.ac.uk/home.html). Also, there has been some

concerted effort from various government organizations like TDIL2 , CIIL Mysore3 ,
C-DAC Nodia4 , (Vikas, 2001) and various institutes, e. g., IIT Bombay5 , IIT Kan-
pur6 , LTRC (IIIT Hyderabad)7 and develop linguistic resources. At the same time
this data is not large enough to design an English to Hindi SBMT, which typically
requires several hundred thousand of sentences. These resources, we hope, will be

fruitfully utilized for developing different EBMT systems involving Indian languages.
Of the different Indian languages8 Hindi has some major advantages over the oth-
ers as far as working on MT is concerned. Not only is Hindi the national language of
India, it is also the most popular among all Indian languages. With respect to Indian
languages, all the major works that have been reported so far (e.g. ANGLAHINDI
(Sinha et. al., 2002), SHIVA (http://shiva.iiit.net/) , SHAKTI (Sangal, 2004), Ma-
Tra (Human aided MT)9 ) are primarily concerned English and Hindi as their pre-
ferred languages. In 2003 Hindi has been considered as the “surprise language”
(Oard, 2003) by DARPA. As a consequence, different universities (e.g. CMU, Johns

Hopkins, USC-ISI) have invested efforts in developing MT systems involving Hindi.
2
http://tdil.mit.gov.in/
3
http://www.ciil.org/
4
http://www.cdacnoida.com/
5
http://www.cfilt.iitb.ac.in
6
http://www.cse.iitk.ac.in/users/isciig/
7
http://ltrc.iiit.net/
8
India has 17 official languages, and more than 1000 dialects
(http://azaz.essortment.com/languagesindian rsbo.htm)
9
http://www.ncst.ernet.in/matra/about.shtml
4
This world-wide popularity of the language makes the study of English to Hindi
machine translation more meaningful in today’s context.
One major advantage of having the above-mentioned English to Hindi translation

systems available on-line is that it helped us in working on the systems to examine
the quality of their outputs. In this respect, we find that the outputs given by the
above systems are not always the correct translations of the inputs. The following
Table 1.1 illustrates the above statement with respect to the systems “AnglaHindi”
and “Shakti”. In this table we show the translations produced by the above two
systems for different inputs, and also show the correct translations of these sentences.
Input Output of Output of Actual

Sentences “AnglaHindi” “Shakti” Translation
Ram married Sita. raam ne siita vi- raam ne siitaa vi- raam ne siitaa se
vahaa kiyaa vaaha kiyaa vivaaha kiyaa
Fan is on. pankhaa ho par pankhaa la- pankhaa chal rahaa
gaataar hai hai
This dish tastes yaha vyanjan yah thalii achc- iss vyanjan kaa
good. achchhaa hotaa chaa swaad letii swaad achchhaa
hai hai hai
The soup lacks soop namak kam shorbaa namak soop mein namak
salt. hotaa hai kamii hai kam hai
It is raining. yah varshaa ho yah varshaa ho varshaa ho rahii
rahii hai rahii hai hai
They have a big unke paas eka unke badhii unkii ghamasan
fight. badhii ladaae hai ladaaiyaan hain ladaii huii
Table 1.1: Output of “AnglaHindi” and “Shakti” MT
System
5
1.1. Description of the Work Done and Summary of the Chapters
We have found many such instances where the outputs produced by the systems
may not be considered to be correct Hindi translations of the respective inputs. This
observation prompts us to study different aspects of English to Hindi translations in
order to understand the difficulty in machine translations, particularly with respect
to English to Hindi translation, also, how can these shortcomings be dealt with
under an EBMT framework. The research is concerned with the above studies.
1.1 Description of the Work Done and Summary
of the Chapters
The success of an EBMT system lies on two different modules: (i) Similarity mea-
surement and Retrieval. (ii) Adaptation. Retrieval is the procedure by which a
suitable translation example is retrieved from a system’s example base. Adapta-
tion is the procedure by which a retrieved translation is modified to generate the

translation of the given input. Various retrieval strategies have been developed (e.g.
(Nagao, 1984), (Sato, 1992), (Collins and Cunningham, 1996)). All these retrieval
strategies aim at retrieving an example from the example base such that the retrieved
example is similar to the input sentence. This is due to the fact that the fundamental
intuition behind EBMT is that translations of similar sentences of the source lan-
guage will be similar in the target languages as well. Thus the concept of retrieval is
intricately related with the concept of “similarity measurement” between sentences.
But the main difficulty with respect to this assumption is that there is no straight-
forward way to measure similarity between sentences. In different works different
approaches have been defined for measuring similarity between sentences. For exam-
ple, Word-based metrics(e.g. (Nirenburg, 1993), (Nagao, 1984)), Character-based
6
metrics (e.g. (Sato, 1992)), Syntactic/Semantic based matching (e.g. (Manning and
Schutze, 1999)), DP-matching between word sequence (e.g. (Sumita, 2001)), Hybrid
retrieval scheme (e.g. (Collins, 1998)).
In all these works “similarity measurement” and “adaptation” are considered

in isolation. This we feel is the major hindrance with respect to EBMT. In this
work we therefore propose a novel approach for measuring similarity. We intend
to look at similarity from the point of view of adaptation. We suggest that a past
example will be considered as the most similar with respect to an input sentence, if
its adaptation towards generating the desired translation is the simplest. The work
carried out in this research is aimed at achieving this goal. Our studies therefore start
in the following way. We first look at adaptation in detail. An efficient adaptation
scheme is very important for an EBMT system because even a very large example
base cannot, in general, guarantee an exact match for a given input sentence. As
a consequence, the need for an efficient and systematic adaptation scheme arises
for modifying a retrieved example, and thereby generating the required translation.
Various adaptation schemes have been proposed in literature, e.g. (Veale and Way,
1997), (Shiri et. al., 1997), (Collins, 1998) and (McTait, 2001). A scrutiny of these
schemes suggest that primarily there are four basic adaptation operations, i.e. word
addition, word deletion, word replacement and copy.
In our approach we started with these basic operations: word addition, word
deletion, word replacement and copy. However, in this respect we notice the follow-
ing:
1. Both English and Hindi relies heavily on suffixes for morphological changes.
There are a number of suffixes for achieving declension of verbs and nouns.
Further, in Hindi there are situations when morphological changes in the ad-
7
jectives is also required depending upon the number and gender of the corre-
sponding noun/pronoun. Since the number of suffixes is limited, we feel that
instead of purely word-based operations if adaptation operations are focused
on the suffixes, then in many situations significant amount of computational
efforts may be saved.
2. A further observation with respect to Hindi is that there are situations when in-
stead of suffixes whole words are used for bringing in morphological variations.
For example, the present continuous form of Hindi verbs is: <Root form of the
verb> + <rahaa/rahii /rahe> + <hai /hain/ho>. Here the words “rahaa”,
“rahii ” or “rahe” are used to achieve the morphological variation. Which of

these will be used depend upon the number and gender of the subject. Sim-
ilarly, “hai ”, “hain” and “ho” are used corresponding to situations when the
subject is singular or plural and person, respectively. We term these words as
“morpho-words”. Appendix A gives details of different Hindi morpho-words
and their usages.
A major fall out of the above observation is that in some situations, adaptation
may be carried out by dealing with the morpho-words instead of whole words, which
are computationally much less expensive than dealing with constituent words as a
whole. Thus we propose an adaptation scheme consisting of ten operations: addition,

deletion, and replacement of constituent words, addition, deletion, and replacement
of morpho-words, addition, deletion, and replacement of suffixes and copy. Chapter
2 of the thesis discusses these adaptation operations in detail.
One point, however, we notice with respect to the above operations is that the
above-mentioned operations cannot deal with translation divergences in an efficient

way. Divergence occurs “when structurally similar sentences of the source language
8
do not translate into sentences that are similar in structures in the target language.”
(Dorr, 1993). We therefore felt study of divergence is an important aspect for any
MT system. With respect to an EBMT system the need arises because of the two
reasons:
• The past example that is retrieved for carrying out the task of adaptation
has a normal translation, but translation of the input sentence should involve
divergence.
• The translation of the retrieved example involves divergence, whereas the input
sentence should have a normal translation.
In this work we made an in-depth study of divergence with respect to English to

Hindi translation. In this regard one may note that divergence is a highly language-
dependent phenomenon. Its nature may change along with the source and target
languages under consideration. Although divergence has been studied extensively
with respect to translation between European languages (e.g. (Dorr et. al., 2002),
(Watanabe et. al., 2000)) very little studies on divergence may be found regarding
translations in Indian languages. The only work that came into our notice is in (Dave
et. al., 2002). In this work the author has followed the classifications given in (Dorr,
1993) and tried to find examples of each of them with respect to English to Hindi
translation. In this regard it may be noted that Dorr has described seven differ-
ent divergence types: structural, categorical, conflational, promotional, demotional,
thematic and lexical, with respect to translations between European languages.
However, we find that all the different divergence types explained in Dorr’s work
do not apply with respect to Indian languages. In fact, we found very few (if not
none) examples of “thematic” and “promotional” divergence with respect to English
9
to Hindi translation. On the other hand we identified three new types of divergence
that have not so far been cited in any other works on divergence. We named these
divergences as “nominal”, “pronominal” and “possessional”, respectively. We have
further observed that all the different divergence types (barring “structural”) for
which we found instances in English to Hindi translation may be further divided into
several sub-categories. Chapter 3 explains in detail different divergence types and
their sub-types that we have observed with respect to English to Hindi translation,
and illustrates them with suitable examples. Some of these results have already been
presented in (Gupta and Chatterjee, 2003a) and (Gupta and Chatterjee, 2003b).
Presence of divergence examples in the example base makes straightforward ap-
plication of the above-mentioned adaptation scheme difficult. As mentioned earlier,

application of the operations discussed in Chapter 2 will not be able to generate
the correct translation if the input sentence requires normal translation, whereas
the translation of retrieved example involves divergence, or vice versa. To overcome
this difficulty we suggest that the example base may be partitioned into two parts:
one containing examples of normal translation, the other containing the examples
of divergence, so that given an input sentence an EBMT system may retrieve an
example from the appropriate part of the example base. However, implementation
of the above scheme requires design of algorithms for:
1) Partitioning the example base sentences.
2) Designing an efficient retrieval policy.
We attempt to answer the first one by designing algorithms for identification of
translation divergence, i.e. if an English sentence and its Hindi translation are given
as input, these algorithms will detect whether this translation involves any of the said
10
types of divergence. The remaining part of Chapter 3 discusses different algorithms

that we developed for identification of divergence from a given English-Hindi pair
of sentences. The identification algorithms designed by us consider the Functional
tag (FT10 ) of the constituent words and the Syntactic Phrasal Annotated Chunk
(SPAC11 ) of the SL and TL sentences. When these two do not match for a source
language sentence and its translation in the TL, a divergence can be identified. With
respect to each divergence categories and their sub-categories we have identified
the appropriate FTs and SPACs whose presence/absence indicate possibilities of
certain divergence. By systematically analyzing the FTs and SPACs of the English
sentence and its Hindi translation the algorithms arrive at a decision on whether
this translation involves any divergence. Thus the algorithm partitions the example
base in two parts: Normal Example Base and Divergence Example Base. Some of
these algorithms have already been presented in (Gupta and Chatterjee, 2003b).
To answer the second question, we feel that given an input sentence if it can be
decided a priori whether its translation will involve divergence then the retrieval can
be made accordingly. To handle the situation when the translation of input sentence
does not involve any divergence, we devise a cost of adaptation based two-level
filtration scheme that enables quick retrieval from normal example base12 . Chapter 4
describes our scheme of retrieval from divergence example base in situations involving
divergence. Here our primary attempt is to develop a procedure so that given an
input English sentence it can decide whether its Hindi translation will involve any
type of divergence. Obviously, this decision has to be made before resorting to
the actual translation. Hence we call it “prior identification” of divergence. The

10
Appendix B provides details on the FTs.
11
SPAC structure is discussed in detail in Appendix C.
12
This scheme is discussed in Chapter 5.
11
algorithm seeks evidence from the example base and the WordNet. In this work we
have used WordNet 2.013 to measure semantic similarity of the constituent words
of the input sentence, and various words present in the example base sentences to
arrive at a decision in this regard. The scheme works in the following way. We first
identified the roles of different Functional Tags (FT) towards causing divergence.
We observe with respect to different divergence type and sub-types that each FT
may have one of the three following roles;
1) Its presence is mandatory for the corresponding divergence (sub-)type to occur;
2) Its absence if mandatory for the corresponding divergence (sub-)type to occur;
3) Occurrence/non-occurrence of the divergence (sub-)type is not influenced by
the FT under consideration.
This knowledge is stored in the form of a table (Table 4.2) in Chapter 4. Given
an input sentence the scheme first determines its constituent FTs. We have used
ENGCG parser14 for parsing an input sentence and obtaining its FTs. This finding
is then compared to the above-mentioned knowledge base (Table 4.2) to identify
the set (D) of divergence types that may possibly occur in the translation of this
sentence. Further investigation is carried out to discard elements from the set D, so
that the divergence that may actually occur can be pin-pointed. In this respect we
proceed in the following way. Corresponding to each divergence type we identify the
functional tag that is at the root of causing the divergence. We call it the “problem-
atic FT” corresponding to that particular divergence. Table 4.3 presents our finding
in this regard. Corresponding to each possible divergence (as found in D) the scheme
13
http://www.cogsci.princeton.edu/cgi-bin/webwn
14
http://www.lingsoft.fi/cgi-bin/engcg
12
works as follows. It first retrieves from the input sentence the constituent word cor-
responding to the problematic FT of the divergence type under consideration. Then
the semantic similarity of this word is compared to other words. Proximity in this
semantic distance is then used as a yardstick for similarity measurement. Chapter
4 discusses this scheme in detail.
Finally, in Chapter 5 we look at how cost of adaptation may be used as a similar-

ity measurement scheme. It has been stated that no unique definition of similarity
exists for comparing sentences. Similarity between sentences may be viewed from
different perspectives. In this work, we have first considered two most general sim-
ilarity schemes: “syntactic similarity” and “semantic similarity”. The ideas have
been borrowed from the domain of Information Technology (Manning and Schutze,
1999). According to the definition given therein semantic similarity is measured on
the basis of commonality of words. The more is the number of words common be-
tween two sentences, the more similarity is said to exist between the two sentences
under consideration. However, it has been shown in (Chatterjee, 2001) that this
measurement of similarity is not always helpful from EBMT point of view. For ex-
ample, it has been shown there that although the sentences “The horse had a good
run.” and “The horse is good to run on.” have most of the key words common, the
structure of their Hindi translations are very different. Consequently, adaptation of

the translation of one of them to generate the translation of the other is computa-
tionally demanding. On the other hand, syntactic similarity between two sentences
is measured on the basis of commonality of “morpho-functional tags” between them.
In this case, adaptation may require a large number of constituent word replacement
(WR) operations. Each of these WR operations involves reference to some dictio-
nary for picking up the appropriate words in the target language. Typically the
13
dictionary access will involve accessing an external storage, and thereby will incur
significant computational cost. Thus a purely syntax-based similarity measurement
scheme may not be suitable for an EBMT system.
In this work we therefore propose that from EBMT perspective “retrieval” and
“adaptation” should be looked at in a unified way. In this chapter (i.e. Chapter
5) we investigate feasibility of the above proposal in depth. In this respect we first

look into the overall adaptation operations deeply. We have already observed that
these operations are invoked successively to remove the discrepancies between the
input sentence and the retrieved example. These discrepancies, as we observe, may
be in the actual words, or in the overall structure of the sentences. For illustration,
suppose the input sentence is “The boy eats rice everyday.”, whose Hindi translation
“ladkaa har roz chaawla khaataa hai ” has to be generated. The nature of the adap-
tation varies depending upon which example is retrieved from the example base. For
illustration:
a) If the retrieved example is “The boy eats rice”, the adaptation procedure needs
to apply a constituent word addition operation (WA) to take care of the adverb
“everyday”.
b) However, if the retrieved sentence is “The boy plays cricket everyday.” ∼ “ladkaa
roz cricket kheltaa hai ”, then the adaptation procedure needs to invoke two
constituent word replacement (WR) operations : to replace Hindi of “play”,

i.e. “khel ” with the Hindi of “eat”, i.e. “khaa”, and “cricket” (“cricket”) with
“chaawal ” (“rice”).
c) In case the retrieved example is “The boy is eating rice.”, one adaptation op-
eration that is constituent word addition (WA) is required for the adverb
14
“everyday”. Further to take care of verb conjugation some morpho-word and

suffix operations need to be carried out. This is because the Hindi transla-
tion of “The boy is eating rice” is : “ladkaa (boy) chaawal (rice) khaa (eat)
rahaa (..ing) hai (is)”. But the translation of the input sentence “The boy
eats rice everyday” should be “ladkaa har roz chaawal khaataa hai ”. Thus the
morpho-word “rahaa”, which is required for the present continuous tense of
the retrieved sentence needs to be deleted. Further the suffix “taa” is to be
added to the root main verb to get the required present indefinite verb form
of the input.
d) However, if the retrieved example is “Does the boy eat rice?”, then adaptation
procedure needs to take care of the structural variation between the “inter-
rogative” form of the retrieved example, and the affirmative form of the input
sentence.
Obviously, the more will be the discrepancy between the retrieved example and
the input sentence, the more will be the number of adaptation operations towards
generating the desired translation. The above illustrations make certain points evi-
dent:
a) Adaptation operations are required for performing two general tasks: dealing
with constituent words (along with their suffixes, morpho-words), and dealing
with the overall structure of the sentence.
b) Each invocation of adaptation operation pertains to a particular part of speech,

such as, noun, verb, adverb etc.
c) Of the ten adaptation operations (described earlier with respect to Chapter
15
2) only the WA and WR operations require dictionary15 searches. Since dic-

tionary search typically involves accessing an external device (e.g hard disk),
a dictionary search is computationally more expensive than other operations
(e.g. constituent word deletion, morpho-word operations) which are purely
RAM16 -based and hence computationally cheaper.
The above observations help us to proceed towards achieving the intended goal
of using cost of adaptation as a measurement of similarity. As a first step towards
achieving the intended goal, we suggest to divide the dictionary into several parts
based on the part-of-speech (POS) of the words. Division of the dictionary into
several parts according to the POS reduces the search time for each invocation of
the above procedures, and as a consequence, the search time is reduced. The cost of
adaptation based similarity measurement approach then proceeds along the following
line:
a) We first estimate the average cost for each of the ten adaptation operations.
We observe that these costs depend on two major types of parameters. On
one hand they depend on certain linguistic aspects, such as, the average length
of the sentences in both source and target languages, the number of suffixes
(used with different POS), the number of morpho-words etc. On the other
hand, these costs are related to the machine on which the EBMT system is
working. Since we aim at analyzing the costs in a general way, we assumed

these machine-dependent costs to be variables in all our analysis. For the lin-
guistic parameters, we used values that we have obtained by analyzing about
15
By “dictionary” we mean a source language to target language word dictionary available on-
line.
16
Random Access Memory
16
30,000 examples of English to Hindi translations. These examples were col-

lected from various sources that are translation books, advertisement materi-
als, children’s story books and government notices, which are freely available
in non-electronic form.
b) At the second step, we estimated the costs incurred in adapting various func-
tional tags17 . In particular, we have considered cost of adaptation due to vari-
ations in active and passive verb morphology, subject/object, pre-modifying
adjective, genitive case and wh-family words. These costs are stored in various
tables, in Section 5.4.
c) At the third step we have considered costs of adaptation due to differences in
sentence structure. Here, we have considered four different sentence structures:

affirmative, negative, interrogative, negative-interrogative. These adaptation
costs too are stored in tabular form. Section 5.4 gives details of this analysis.
Once these basic costs are modelled, we are in a position to experiment on costs
of adaptation as a similarity measure vis-à-vis semantics and syntax based similarity
measurement scheme discussed above. Our experiments have clearly established the
efficiency of the proposed scheme over the others. Part of this work is also presented
in Gupta and Chattrejee (2003c). Two apparent drawbacks of this scheme are:
1) It may end up in comparing a given input with all the example-base sentences
to ascertain the least cost of adaptation.
2) Another major question that may arise is that whether the cost of adaptation
scheme is efficient enough to handle sentences that are structurally more com-
17
In fact we worked on “Functional Slots” which are more general than “Functional Tags”. This
is discussed in detail in Section 2.2
17
plicated, e.g. complex or compound sentences. It is a generally accepted fact

that complex sentences are difficult to handle in an MT system (Dorr et. al.,
1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003).
In order to deal with first difficulty we have proposed a two-level filtration scheme.
This scheme helps in selecting a smaller number of examples from the example base,
which may subsequently be subjected to the rigorous treatment for determining their
costs of adaptation with respect to the given input. We have also justified that this
scheme does not leave out the sentences whose translations are easier to adapt for
the given input.
In this work we have given a solution for the second problem too. We have
given rules for splitting a complex sentence into more than one simple sentence.
Translations of these simple sentences may then be generated by the EBMT system.
These individual translations may then be combined to obtain the translation of the
given complex sentence input. If the cost of adaptation based similarity measurement
scheme is applied for translating the simple sentences, then the cost of adaptation
of the complex sentence too can be estimated, by adding the individual costs with
the cost of combining the individual translations. Since the last operation is purely
algorithmic its computational complexity can be easily computed, and hence the
overall cost of adaptation be estimated. With respect to dealing with complex
sentences, we have however used certain restrictions. We considered sentences with
only one subordinate clause. Further, the presence of a connecting word is also
mandatory. Evidently, more complicated complex sentence structures are available,
and further investigations are required for developing techniques for handling them
in an EBMT framework.
In this connection we like to mention that we have explained the cost of adap-
18
tation with respect to a selected set of sentence structures, and for a selected set of
Functional slots. Definitely many more variations are available with respect to these
parameters. Consequently, more work has to be done to form rules for handling
these variations. However, we feel that the work described in research provides the
suitable guideline for further continuation of the research.
1.2 Some Critical Points
1) The aim of this research is not to construct an English to Hindi EBMT system.
Rather our intuition is to analyze the requirements that help in building an
effective EBMT system. The motivation behind this research came from two
major observations:
– Although some MT system for translation from English to Hindi already

exist, the quality of their translation is often not up to the mark. This
promoted us to look into the process of MT to ascertain the inherent

difficulties.
– We have chosen EBMT as our preferred paradigm because of its certain
advantages our other MT paradigms such as RBMT, SBMT. One major

advantage of EBMT is that it requires neither a huge parallel corpus as
required by SBMT, nor it requires framing a large rule base required by
RBMT. Study of EBMT is therefore feasible for us as we did not have
access to such linguistics resources.
2) In order to design our scheme we have studied about 30,000 English to Hindi
translation examples available off-line. Although now large volumes of English
19
1.2. Some Critical Points
English sentence: The horses have been running for one hour.
Tagged form: @DN> ART “the”, @SUBJ N PL “horse” %ghodaa%,
@+FAUXV V PRES “have”, @-FAUXV V PCP2 “be”, @-FMAINV V PCP1
“run” %daudaa%, @ADVL PREP “for”, @QN> NUM CARD “one” %ek %, @<P
N SG “hour” %ghantaa%.
Hindi sentence: ghode ek ghante se daudaa rahen hain
Figure 1.1: An Example Sentence with Its Morpho-Functional Tags
to Hindi parallel text is available on-line (EMILLE: http:// www.emille.lancs.ac.

uk/home.htm), the time when this work was started no such parallel corpus
was available to us. For our work we prepared an online parallel example base
of 4,500 sentences. These example pairs are chosen carefully so that different
sentence structure as well translation variations (divergence) are taken care of
as much as possible.
3) Each translation example record in our example base contains morpho-functional
tag18 information for each of the constituent word of the source language (En-
glish) sentence along with the sentence, its Hindi translation, and the root
word correspondence. Figure 1.1 provides an example of the records stored in
our example base.
The morpho-functional tags of a word indicates its syntactic function within

the sentence. The tags are helpful in identifying the root words, their roles
in the sentence and roles of the different suffixes (used for declensions) in the
overall sentence construction.
In this work we have studied the two major pillars of EBMT: Retrieval and
Adaptation. We feel that the studies made as well as the techniques developed by
18
Appendix B provides different morpho tags and functional tags that have been used in this
work. These tags are obtained by editing the sentence tagging given by the ENGCG parser :
(http://www.lingsoft.fi/cgi-bin/engcg) for English sentences.
20
this research will be helpful for developing MT system not only for Hindi but also for
other Indian languages (e.g. Bangla, Gujrati, Panjabi). All these languages suffer
from the same drawback - unavailability of linguistics resources. However, demands
for developing MT systems from English to these languages is increasing with time
not only because these are prominent regional languages of India, but also they
are important minority languages in other countries such as U.K. (Somers, 1997).
The studies made in the research should pave the way for developing EBMT system
involving these languages as well.
21
Chapter 2
Adaptation in English to Hindi

Translation: A Systematic
Approach
Adaptation in English to Hindi Translation: A Systematic Approach
2.1 Introduction
The need for an efficient and systematic adaptation scheme arises for modifying a
retrieved example, and thereby generating the required translation. This chapter is
devoted to the study of systematic adaptation approach. Various approaches have
been pursued in dealing with adaptation aspect of an EBMT system. Some of the
major approaches are described below.
1. Adaptation in Gaijian (Veale and Way, 1997) is modelled via two categories:
high-level grafting and keyhole surgery. High-level grafting deals with phrases.
Here an entire phrasal segment of the target sentence is replaced with another
phrasal segment from a different example. On the other hand, keyhole surgery
deals with individual words in an existing target segment of an example. Under
this operation words are replaced or morphologically fine-tuned to suit the
current translation task. For instance, suppose the input sentence is “The girl
is playing in the park.”, and in the example base we have the following examples:
(a) The boy is playing.
(b) Rita knows that girl.
(c) It is a big park.
(d) Ram studies in the school.
For the high level grafting the sentences (a) and (d) will be used. Then keyhole
surgery will be applied for putting in the translations of the words “park” and
“girl”. These translations will be extracted from (b) and (c).
2. Shiri et. al. (1997) have proposed another adaptation procedure. It is based on
three steps: finding the difference, replacing the difference, and smoothing the
23
2.1. Introduction
output. The differing segments of the input sentence and the source template
are identified. Translations of these different segments in the input sentence
are produced by rule-based methods, and these translated segments are fitted
into a translation template. The resulting sentence is then smoothed over by
checking for person and number agreement, and inflection mismatches. For
example, assume the input sentence and selected template are:
SI A very efficient lady doctor is busy.
St A lady doctor is busy.

Tt mahilaa chikitsak vyasta hai
The parsing process however shows that “The very efficient lady doctor” is a
noun phrase, and so matches it with ”The lady doctor” - “ek mahilaa chikit-
sak ”. “The very efficient lady doctor” is translated as “ek bahut yogya mahilaa
chikitsak ”, by the rule-based noun phrase translation system. This is inserted
into Tt giving the following: Tt : ek bahut yogya mahilaa chikitsak vyasta hai.
3. ReVerb system (Collins, 1998) proposed the following adaptation scheme. Here
two different cases are considered: full-case adaptation and partial-case adap-
tation. Full-case adaptation is employed when a problem is fully covered by the
retrieved example. Here desired translation is created by substitution alone.

No addition and deletion are required for adapting TL0 for generating the trans-
lation of SL. Here TL0 and SL denote example base target language sentence
and input source language sentence, respectively. In this case five scenarios
are possible: SAME, ADAPT, IGNORE, ADAPTZERO and IGNOREZERO.
Partial-case adaptation is used when a single unifying example does not exist.
Here three more operations are required on the top of the above five. These
three operations are ADD, DELETE and DELETZERO.
24
Figure 2.1: The five possible scenarios in the SL → SL’ → TL’ interface of partial
case matching
Note that there is a subtle difference between ADAPT and ADAPTZERO.
For ADAPT as well as for ADAPTZERO, both SL and SL0 have same links
but different chunks. If TL0 has words corresponding to the chunk which is
different in SL and SL0 , then the words in TL0 should be modified and this is
the case of ADAPT. One the other hand, if no corresponding chunk is present
in TL0 then it is the case of ADAPTZERO. Therefore, in that case no work is
needed for adaptation. Similar subtleties may be observed between DELETE

and DELETZERO, and also between IGNORE and IGNOREZERO. Other
operations (such as, SAME, ADD) have obvious interpretations. Figure 2.1
provides the conceptual view of partial case matching.
4. Somers (2001) proposes adaptation from case-based reasoning point of view.

The simplest of the CBR adaptation methods is null adaptation where no
changes are recommended. In a more general situation various substitution
methods (e.g. reinstatiation, parameter adjustment)/transformation methods
(e.g. commonsense transformation and model-guided repair) may be applied.

For example, suppose the input sentence (I) and the retrieved example (R)
25
2.1. Introduction
are:
I That old woman has died.
R That old man has died. ∼ wah boodhaa aadmii mar gayaa
To generate the desired translation of the word “man” ∼“aadmii ” is first re-
placed with the translation of “woman” ∼ “aurat” in R. This operation is called
reinstantiation. At this stage an intermediate translation “wah boodhaa aurat
mar gayaa” is obtained. To obtain the final translation “wah boodhii aurat
mar gayii ”, the system must also change the adjective “boodhaa” to “boodhii ”
and the word “gayaa” to “gayii ”. This is called parameter adjustment.
5. The adaptation scheme proposed by McTait (2001) works in the following way.
Translation patterns that share lexical items with the input and partially cover
it are retrieved in a pattern matching procedure. From these, the patterns
whose SL side cover the SL input to the greatest extent (longest cover) are
selected. They are termed base patterns, as they provide sentential context in
the translation process. It is intuitive that the greater extent of the cover is
provided by the base patterns, the more is the context, and the lesser is the
ambiguity and complexity in the translation process. If the SL side of the base
pattern does not fully cover the SL input, any unmatched segments are bound
to the variable on the SL side of the base pattern. The translations of the SL
segments bound to the SL variables of the base pattern are retrieved from the
remaining set of translation patterns, as the text fragments and variables on
the TL side of the base pattern from translation strings.
The following is a simple example: given the source language input is I: “AIDS
control programme for Ethiopia”, suppose the longest covering base pattern is:
D1: AIDS control programme for (....)∼ ke liye AIDS contral smahaaroo (...).
26
To complete the match between I and the source language side of D1, a trans-
lation pattern containing the text fragment “Ethiopia” is required i.e.
D2: (...) Ethiopia (...) ∼ Ethiopia (...).
The TL translation T: “ethiopia ke liye AIDS contral smahaaroo” is generated
by recombing the text fragments: “Ethiopia” and “ethiopia” are aligned in D2

as are the variables in the base pattern D1. Since “Ethiopia” and “ethiopia”
are aligned on a 1:1 basis, and so are the variables in the base pattern D1, the
TL text fragment “Ethiopia” is bound to the variable on the TL side of D1 to
produce T.
6. In HEBMT (Jain, 1995) examples are stored in an abstracted form for deter-
mining the structural similarity between the input sentence and the example
sentences. The target language sentence is generated using the target pat-
tern of the sentence that has lesser distance with the input sentence. The
system substitutes the corresponding translations of syntactic units identified

by a finite state machine in the target pattern. Variation in tense of verb,
and variations due to number, gender etc. are taken care at this stage for
generating the appropriate translation. This system translates from Hindi to
English, therefore, we explain its adaptation process with the example of Hindi
to English translation.
For example, suppose the input sentence is “merii somavara ko jaa rahii hai ”
and its matches with examples sentence R: “meraa dosta itavaar ko aayegaa”.
Steps (a) to (f) below, show the process of translation.
(a) merii somavara ko jaa rahii hai (input sentence)
27
2.1. Introduction
(b) <snp>1 <npk2>2 <mv>3 (syntactic grouping)
(c) [Mary] [Monday] [go] (English translation of syntactic groups)
(d) <snp> <mv> {on} <npk2> (target pattern of example R)
(e) [Mary] [is going] on [Monday] (Translation after substitution)
(f) Mary is going on Monday (Final translated output)
Many other EBMT systems are found in literature, e.g. GEBMT (Brown, 1996,
1999, 2000, 2201), EDGAR (Carl and Hansen, 1999) and TTL (Güvenir and Cicekli,
1998). But overall in our view the adaptation procedures employed in different
EBMT systems primarily consist of four operations:
• Copy, where the same chunk of the retrieved translation example is used in
the generated translation;
• Add, where a new chunk is added in the retrieved translation example;
• Delete, when some chunk of the retrieved example is deleted; and
• Replace, where some chunk of the retrieved example is replaced with a new
one to meet the requirements of the current input.
The operations prescribed in different systems vary in the chunks they deal with.
Depending upon the case it may be a phrase, a word or a sub-word (e.g. declensional
suffix).
1
snp : noun, adj+noun, noun+ kaa+noun
2
npk2: noun+ko
3
mv: verb-part
28
With respect to English and Hindi, we find that both the languages depend
heavily on suffixes for verb morphology, changing numbers from singular to plu-
ral and vice versa, case endings, etc. Appendix A provides detail descriptions
of various Hindi suffixes. Keeping the above in view we differentiated the adap-
tation operations in two groups: word based and suffix based. The word based
operations are further subdivided into two categories: constituent word based and
morpho-word based. Thus the adaptation scheme proposed here consists of ten op-
erations: Copy (CP), Constituent word deletion (WD), Constituent word addition
(WA), Constituent word replacement(WR), Morpho-word deletion (MD), Morpho-
word addition (MA), Morpho-word replacement(MR), Suffix addition (SA), Suffix

deletion (SD) and Suffix replacement (SR). Section 2.2 illustrates the roles of the
these operations in adapting a retrieved translation example.
The advantage of the above classification of adaptation operations is twofold.
Firstly, it helps in identifying the specific task that has to be carried out in the step-
by-step adaptation for a given input. Secondly, it helps in measuring the average
cost of each of the above operations in a meaningful way, which in turn helps in
estimating the total adaptation cost for a given sentence. This estimate can be used
as a tool for similarity measurement between an input and the stored examples.
These issues are discussed in Chapter 5.
2.2 Description of the Adaptation Operations
The ten adaptation operations mentioned above are described below.
1. Constituent Word Replacement (WR): One may get the translation of the
input sentence by replacing some words in the retrieved translation example.
29
2.2. Description of the Adaptation Operations
Suppose the input sentence is: “The squirrel was eating groundnuts.”, and the
most similar example retrieved by the system (along with its Hindi translation)
is: “The elephant was eating fruits.” ∼ “haathii phal khaa rahaa thaa”. The
desired translation may be generated by replacing “haathii ” with the Hindi of
“squirrel”, i.e. “gilharii ” and replacing “phal ” with the Hindi of “groundnuts”,
i.e. “moongphalii ”. These are examples of the operation of constituent word
replacement.
2. Constituent Word Deletion (WD): In some cases one may have to delete some
words from the translation example to generate the required translation. For
example, suppose the input sentence is: “Animals were dying of thirst”. If the
retrieved translation example is : “Birds and Animals were dying of thirst.” ∼
“pakshii aur pashu pyaas se mar rahe the”, then the desired translation can
be obtained by deleting “pakshii aur ” (i.e the Hindi of “birds and”) from the
retrieved translation. Thus the adaptation here requires two constituent word
deletions.
3. Constituent Word Addition (WA): This operation is the opposite of constituent

word deletion. Here addition of some additional words in the retrieved trans-
lation example is required for generating the translation. For illustration, one
may consider the example given above with the roles of input and retrieved
sentences being reversed.
4. Morpho-word Replacement (MR): In this case one morpho-word is replaced by
another morpho-word in the retrieved translation example. Consider a case

when the input sentence is: “The squirrel was eating groundnuts.”, and the
retrieved example is: “The squirrel is eating groundnuts.” ∼ “gilharii moongfalii
30
khaa rahii hai ”. In order to take care of the variation in tense the morpho-
word “hai ” is to be replaced with “thaa”. This is an example of Morpho-word
replacement.
5. Morpho-word Deletion (MD): Here some morpho-word(s) are deleted from the
retrieved translation example. For illustration, if the input sentence is “He

eats rice.”, and the retrieved example is: “He is eating rice.” ∼ “wah chaawal
khaa rahaa hai ”, then to obtain the desired translation4 first the morpho-word
“rahaa” is to be deleted from the retrieved translation example.
6. Morpho-word Addition (MA): This is the opposite case of morpho-word dele-

tion. Here some morpho-words need to be added in the retrieved example in
order to generate the required translation.
7. Suffix Replacement (SR): Here the suffix attached to some constituent word
of the retrieved sentence is replaced with a different suffix to meet the current
translation requirements. This may happen with respect to noun, adjective

verb, or case ending . For illustration,
(a) To change the number of nouns
Boy (ladkaa) → Boys (ladke)
The suffix “aa” is replaced with “e” so in order to get its plural form in
Hindi.
(b) Change of Adjectives
Bad boy (buraa ladkaa) → Bad girl (burii ladkii )
The suffix “aa” is replaced with “ii ” to get the adjective “burii ”.
4
Of course the final translation will be obtained by adding the the suffix “taa” with the word
“khaa”.
31
(c) Morphological changes in verb
He reads. (wah padtaa hai ) → She reads. (wah padtii hai )
The suffix “taa” is replaced with “tii ” to get the verb “padtii ”, which is
required to indicate that the subject is feminine.
(d) Morphological changes due to case ending
boy (ladkaa) → from boy (ladke se)
room (kamraa) → in room (karmre mein)
The suffix “aa” is replaced with “e” to get the nouns “ladke” and “kamre”.
8. Suffix Deletion (SD): By this operation the suffix attached to some constituent
word may be removed, and thereby the root word may be obtained. This
operation is illustrated in the following examples:
(a) To change the number of nouns
women (aauraten) → woman (aaurat),
The suffix “en” is deleted from “aauraten” to get the Hindi translation
of “woman”.
(b) Morphological changes in verb
He reads. (wah padtaa hai ) → He is reading. (wah pad rahaa hai )
The suffix “taa” is deleted from “padtaa” to get the root form “pad ” of
the English verb “read”.
(c) Morphological changes due to case ending
in the houses (gharon mein) → houses (ghar )
in words (shabdon mein) → words (shabd )
The suffix “on” is deleted from “gharon” and “shabdon” to get the Hindi
translation of nouns “houses” and “words”, respectively.
32
9. Suffix Addition (SA): Here a suffix is added to some constituent word in the
retrieved example. Note that here the word concerned is in its root form in
the retrieved example. One may consider the examples given above with the
roles of input and retrieved sentences reversed as suitable examples for suffix
addition operation.
10. Copy (CP): When some word (with or without suffix) of the retrieved example
is retained in toto in the required translation then it is called a copy operation.
Figure 2.2 provides an example of adaptation using the above operations. In this
example the input sentence is “He plays football daily.”, and the retrieved translation
example is:
They are playing football. ∼ we f ootball khel rahe hain

(They) (football) (play) (...ing) (are)
The translation to be generated is : “wah roz football kheltaa hai ”. When carried
out adaptation using both word and suffix operations the adaptation steps look as
given in Figure 2.2. In this respect one may note that Hindi is a free order language,
and consequently the position of adverb is not fixed. Hence the above input sentence
may have different Hindi translations:
• wah roz football kheltaa hai
• wah football roz kheltaa hai
• roz wah football kheltaa hai
While implementing an EBMT system one has to stick to some specific format.
The adverb will be added according to the format adapted by the system.
33
Input we φ football khel rahe hain

↓ ↓ ↓ ↓ ↓ ↓
Operations WR WA CP SA MD MR
↓ ↓ ↓ ↓ ↓ ↓
Output wah roz football kheltaa φ hai
Figure 2.2: Example of Different Adaptation Operations
Which adaptation operations will be required to translate a given input sentence
depends upon the translation example retrieved from the example base. A variety
of examples may be adapted to generate the desired translation, but obviously with
varying computational costs. For efficient performance, an EBMT system, therefore,
needs to retrieve an example that can be adapted to the desired translation with
least cost. This brings in the notion of “similarity” among sentences. The proposed
adaptation procedure has the advantage that it provides a systematic way of evalu-
ating the overall adaptation cost. This estimated cost may then be used as a good
measure of similarity for appropriate retrieval from the example base. How cost of
adaptation may be used as a yardstick to measure similarity between sentences will

be described in Chapter 5.
Here our aim is to count the number of adaptation operations required in adapt-
ing a retrieved example to generate the translation of a given input. Obviously, de-
pending upon the situation one has to apply some adaptation operations for changing
different functional slot 5 (Singh, 2003), such as, subject(<S>), object (<O>), verb
(<V>). Also certain operations are required for changing the kind of sentence, e.g.
5
The following example illustrates the difference between functional slots and functional tags.
Consider the sentence “The old man is weak.”. The subject of this sentence is the noun phrase “The
old man”. It consists of three functional tags, viz. @DN>, @AN> and @SUBJ, stating that “the”
is a determiner, “old” is adjective, and “man” is the subject. But, as mentioned above, the entire
noun phrase plays the role of subject of the sentence. Thus the functional slot for this phrase is
<S>, i.e. subject slot. Note that a particular functional slot may have variable number of words.
The sequence of functional slots in a sentence provide the sentence pattern. The difference between
various tags (e.g. POS tag, functional tag) is explained in detail in Appendix B.
34
affirmative to negative, negative to interrogative etc. Table 2.2 contains the nota-
tions for roles of different functional slot and operators, which are required for the
subsequent discussion.
Operators Role of operators

<> For functional slot and part of speech and its transforma-
tion. E.g. <S>, <V> etc.
& Both functional slots or past of speech and its transforma-
tion should be present.
or Either first slot/tag, second slot/tag or both.
{} For non-obligatory functional tag/slot or for optional adap-
tation operation.
[] For the property of functional slot/tag.
Functional Role of Functional slot
Slot
<LV> Linking verbs in English are: are , am , was, were, become,
seem etc., and in Hindi are: hai, hain, ho, thaa, the etc.
<V> Auxiliary verb (if any) and main verb of the sentence
<AuxV> Auxiliary verb
<MainV> Main verb
<S> Subject
<O> Object
<O1> First object
<O2> Second object
<SC> Subjective complement
<PCP1 form> -ing verb form other than the main verb.
<PCP2 form> -ed or -en verb forms other than the main verb.
<to-infinitive> to-infinitive form of verb.
<Adverb> Adverb
<AdjP> Adjective phrase
<PP> preposition phrase
<preposition> preposition
Table 2.2: Notations Used in Sentence Patterns
35
2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs
The following sections describe how many such operations are required in dif-
ferent cases. In particular we consider the following functional slot and sentence
kinds:
1. Tense and Form of the Verb. Since there are three tenses (viz. Present,
Past and Future) and four forms (Indefinite, Continuous, Perfect, and Perfect
Continuous), in all one can have 12 different structures of verb and passive
form verb structure also.
2. Subject/Object functional slot. Variations in subject/object functional slot

may happen in many different ways, such as, Proper Noun, Common Noun
(Singular or Plural), Pronoun, PCP1 form6 and PCP2 form7 . Study of varia-
tion in pre-modifier adjectives, genitive case, quantifier and determiner tags.
3. Study of wh-family interrogative sentences.
4. Kind of sentence. Whether the sentence is affirmative, negative, interrogative
and negative interrogative.
Systematic study of these patterns, and their components helps in estimating

the adaptation costs between them.
2.3 Study of Adaptation Procedure for Morpho-
logical Variation of Active Verbs
Hindi verb morphological variations depend on four aspects: gender, number and
person of subject and tense (and form) of the sentence. All these variations effect
6
-ing verb form other than the main verb
7
-ed or -en verb forms other than the main verb
36
the adaptation procedure. In Hindi, these conjugations are realized by using suffixes
attached to the root verbs, and/or by adding some auxiliary verbs (see Table A.3 of
Appendix A). Since there are 12 different structures (depending upon the tense and
form), the adaptation scheme should have the capabilities to adapt any one of them
for any of the input type. Hence altogether 12×12, i.e. 144 different combinations are
possible. However, Table A.3 (Appendix A) shows that in Hindi, perfect continuous
form as that of any tense has the same verb structure of continuous form in the same
tense. Therefore we exclude perfect continuous form from our discussion. Thus our
work concentrate on 9×9, i.e. 81 possibilities.
These 9×9 possible combinations of verb morphology variations are divided into
four groups. These are:
1. Same tense same verb form
2. Different tenses same verb form
3. Same tense different verb forms
4. Different tenses different verb forms
In the following subsections we explain in detail how adaptation is carried out in

those four groups. One may note that in both English and Hindi verb morphological
variations depend not only on the tense and form of the verb, but also on the gender,
number and person of the subject of the sentence. However, since Hindi grammar
does not support neutral gender, every noun is considered as masculine or feminine.
Therefore, adaptation rules have been developed keeping in view the above. In
general, these rules have been represented in the form of tables where the column and
row headers specify the nature of the subject of the input sentence and the retrieved
37
example, respectively. The row and column headers are of the form gender, person
and number of the subject where gender can be one of M or F, person can be one of
1, 2 or 3 specifying whether first, second and third person, and the number is either
S or P suggesting a singular or plural. Note that, here the gender of the English
sentence subject is assigned according to the Hindi grammar rule. The content of
the (i, j)th cell suggests the adaptation operations need to be carried out when the
subject of the input sentence matches the specification of the j th column header,
and that of the retrieved example matches the specification of the ith row header.
2.3.1 Same Tense Same Verb Form
Here the input sentence and the retrieved example both have the same tense and
form. Yet, verb morphological variations may occur in the translation depending
upon variations in the number, gender and person of the subject.
For illustration, we consider the case when both the input and the retrieved sen-
tences have the main verb in present indefinite form. Table 2.3 lists the adaptation
operations involved for verb morphological variations. In general, in this situation
the verb adaptation requires at most one suffix replacement and one morpho-word
replacement. Suffix replacement will confine to the set {taa, te, tii } (call it S1 ),
while the morpho-word replacement is associated with the set {hain, hai, ho, hoon}
(call it M1 ) (Refer Table A.3). Note that if the person, the number and the gender of
a subject in both input and retrieved sentences are same then only copy operations
will be performed.
We illustrate with an example how Table 2.3 is to be used for adaptation of verb
morphological variations. Suppose the input sentence is “She eats rice.”, and
38
Input→ M1S F1S M1P F1P M2S F2S M3S M3P F3S F3P
Ret’d ↓
M1S CP SR SR+MR SR+MR SR+MR SR+MR MR SR+MR SR+MR SR+MR

F1S SR CP SR+MR MR SR+MR MR SR+MR SR+MR MR MR
M1P SR+MR SR+MR CP SR MR SR+MR SR+MR CP SR+MR SR
F1P SR+MR MR SR CP MR+SR MR SR+MR SR MR CP
M2S SR+MR SR+MR MR SR+MR CP SR SR+MR MR SR+MR SR+MR
F2S SR+MR MR SR+MR MR SR CP SR+MR SR+MR MR MR

39
M3S MR SR+MR SR+MR SR+MR SR+MR SR+MR CP SR+MR SR SR+MR
M3P SR+MR SR+MR CP SR MR SR+MR SR+MR CP SR+MR SR
F3S SR+MR MR SR+MR MR SR+MR MR SR SR+MR CP MR
F3P SR+MR MR SR CP SR+MR MR SR+MR SR MR CP
Table 2.3: Adaptation Operations of Verb Morphological

Variations in Present Indefinite to Present Indefinite
the retrieved example is “We eat rice.” ∼ “ham chaawal khaate hain”. In the input
sentence the subject is 3rd person, feminine and singular, whereas in the retrieved
sentence the subject is 3rd person, masculine and plural.
The cell (3, 9), i.e. corresponding to (M1P, F3S) suggests that two adapta-
tion operations are required: suffix replacement (SR) and morpho-word replacement
(MR). The suffix “te” is replaced with “taa” in the main verb “khaate” as a suffix
replacement operation, the morpho-word “hain” is replaced with “hai ” in the re-
trieved Hindi sentence to get the Hindi translation of the input sentence. Although,
there is a need to replace the subject “ham” with “wah”, to get the appropriate
Hindi translation of the input sentence but this is not considered in the discussion
on verbs. The translation of the input sentence is: “wah chaawal khaatii hai ”.
Under this group nine combinations are possible, taking into an account three
tenses and three forms for each of them. Adaptation rule tables similar to the other
eight possibilities have been developed in a similar way. Salient features of these
verb morpho variations are discussed below:
1. Past indefinite to past indefinite: Here the verb morpho variation is doing in
a way similar to the present indefinite case discussed just above. However, for
morpho-word replacement, the set to be considered is {thaa, the, thii } (call it
M2 ) instead of the set M1 .
2. Future indefinite to future indefinite: In this case either a copy operation or
a suffix replacement operation is used to handle the verb morphological varia-

tions. Accordingly, from Table 2.3 all the morpho-word replacement operations
have to be removed in order to handle the future indefinite case. Further, it
has to be taken into account that for the suffix replacement (SR) operations
40
the set {oongaa, oongii, oge, ogii, egaa, egii, enge, engii } (call it S2 ) is to be
considered (See Table A.3 of Appendix A) instead of the set S1 , i.e. {taa, te,
tii } used for present indefinite.
3. Present continuous to present continuous: In this case either a copy opera-
tion or one/two morpho-word replacements are required to deal with the verb
morphological variations depending upon the variations in the gender, number
and person of the subjects concerned (see in Section A.2 of Appendix A). Thus
the rule table for handling this case may be obtained by modifying Table 2.3
in the following way. Each suffix replacement operation is to be replaced with

an additional morpho-word replacement to take care of number, gender and
person of the subject. This new morpho-word replacement operation will be
restricted to the set {rahaa, rahii, rahe} (call it M3 ).
4. Past continuous to past continuous: Here the verb morpho variation is done
in a way similar to the present continuous case discussed above. Hence, here
too, one may have two morpho-word replacements. For one of them the set
M2 , i.e. {thaa, thii, the} is to be considered instead of the set M1 , i.e. {hai,
hain, ho, hoon}. The set required for other morpho-word replacement is M3 ,
i.e. {rahaa, rahii, rahe}.
5. Future continuous to future continuous: In this case too either a copy op-
eration or one/two morpho-word replacements are required. If morpho-word
replacement operations are carried out then the relevant sets are as follows:
first set is M3 as discussed above. The other morpho-word replacement will

take care of the sense of the future tense, and therefore, instead of the set M1
the set {hoongaa, hoongii, honge, hogaa, hogii, hoge} (call it M4 ) has to be
41
used in Table 2.3.
6. Present perfect, Past perfect, Future perfect: If the input and the retrieved
example both have any one of these three then the verb morphology and adap-
tation operations imitate the rules of continuous form of the respective tense.
The only relevant change is that instead of the set M3 in all the three cases
the set {chukaa, chukii, chuke} (call it M5 ) is to be considered.
The morpho-words and suffixes for adaptation operations for all above discussed
cases can be referred from Table A.3.
In case of present perfect, past indefinite and past perfect sometimes there is a
case ending “ne” with the subject (see Section A.1). In that case, the verb mor-
phology variation will change according to the gender and number of the object,
instead of the gender, number and person of the subject. For past indefinite to past
indefinite transformation, the adaptation operation will either be copy operation or
suffix replacement, whereas in the other two cases the adaptation operations can
be either copy operation or suffix replacement and morpho-word replacement. All
possible suffix variations and morpho-word variations are listed in Section A.2.
2.3.2 Different Tenses Same Verb Form
In this group the verb morphological variation depends on gender, number and per-
son of the subject, and also on the variation in the tenses of the input and the
retrieved example. This group comprises of eighteen combinations of verb morphol-
ogy variations. These 18 possibilities occur due to three different tenses (present,
past, future), and three verb forms (indefinite, continuous, perfect). Some members
42
of this group are present indefinite to past indefinite, present indefinite to future
indefinite, present continuous to past continuous, etc.
For illustration, we consider the following examples where input sentence is in

present indefinite and the retrieved example is in past indefinite.
Example 1 : Suppose the input sentence is “She drinks water.”, and the retrieved sen-
tence is “She drank water.” with the Hindi translation as “wah paanii piitii thii ”. The
subjects of both the input and the retrieved example are feminine, 3rd person and
singular. In this situation, only one adaptation operation is required, i.e. morpho-
word replacement. The morpho-word “thii ” is to be replaced with the morpho-word
“hai ” to convey the sense of present indefinite form of the input sentence. Therefore,
the desired translation is “wah paanii piitii hai ”.
Example 2 : Here the input sentence is “She reads books”, and the retrieved sentence
is “He read books.” with the Hindi translation as “wah kitaabe padhtaa thaa”. The
subject of the input is feminine, 3rd person and singular whereas in the retrieved
sentence the subject is masculine, 3rd person and singular. In this situation two
adaptation operations are required:
1. One suffix replacement: the suffix “taa” is to be replaced with “tii ”; and
2. One morpho-word replacement: the morpho-word “thaa” is replaced with

“hai ”.
Hence the appropriate translation of the input sentence in Hindi is generated as

“wah kitaabe padhtii hai ”.
43
Input−→ M1S F1S M1P F1P M2S F2S M3S M3P F3S F3P
Ret’d ↓
M1S MR SR+MR SR+MR SR+MR SR+MR SR+MR MR SR+MR SR+MR SR+MR
F1S SR+MR MR SR+MR MR SR+MR MR SR+MR MR SR+MR MR
M1P SR+MR SR+MR MR SR+MR MR SR+MR SR+MR SR+MR MR SR+MR
F1P SR+MR MR SR+MR MR SR+MR MR SR+MR MR SR+MR MR
M2S SR+MR SR+MR MR SR+MR MR SR+MR SR+MR SR+MR MR SR+MR

44
M3S MR SR+MR SR+MR SR+MR SR+MR SR+MR MR SR+MR SR+MR SR+MR
M3P SR+MR SR+MR MR SR+MR MR SR+MR SR+MR SR+MR MR SR+MR
F3P SR+MR MR SR+MR MR SR+MR MR SR+MR MR SR+MR MR

Variations in Present Indefinite to Past Indefinite
The above two examples summarize the relevant adaptation operations to deal
with verb morphology variations while adapting from past indefinite sentence to
present indefinite sentence. Adaptation may be carried out by either one MR
(morpho-word replacement) operation, or one SR (suffix replacement ) and one MR
(morpho-word replacement) operations. Typically, one morpho-word from the set

{thaa, the, thii } is replaced with one from the set {hain, hai, ho, hoon} under the
MR operations. Under suffix replacement, when necessary, one element of the set
{taa, tee, tii } is replaced with another from the same set.
Table 2.4 provides all possible adaptation operations which occur due to the
variation in the gender, number and person of the subject.
Some important points regarding the adaptation rules for other remaining 17
combinations of verb morphological variation are discussed below.
1. If the input sentence is in future indefinite form, and the retrieved example is
in either past indefinite or present indefinite then a single set of adaptation

operations will work for dealing with verb morphological variations due to
variation in gender, number and person of the subject.
The adaptation operations for this set are suffix replacement and morpho-word
deletion.
• In suffix replacement the suffix {taa, tee, tii } is replaced by {oongaa,

oongii, oge, ogii, egaa, egii, enge, engii }.
• Also since in future indefinite case in Hindi no additional morpho-word is

required, the additional morpho-word that comes with present indefinite
(i.e. {hoon, hai, ho, hain}) or that occurs with past indefinite (i.e. {thaa,
the, thii } has to be deleted.
45
As an illustration, suppose the input sentence is “I will eat rice.”, and the
retrieved sentence is “I eat rice.” with Hindi translation “main chaawal khaataa
hoon”. Here the suffix “taa” is replaced with suffix “oongaa”, and the morpho-
word “hoon” is deleted from the retrieved Hindi translation. Therefore, the
Hindi translation of the input sentence is “main chaawal khaaoongaa”.
2. Similarly, to adapt from future indefinite to present or past indefinite, one

suffix replacement and one morpho-word addition have to be carried out. The
suffix replacement will just be opposite to the one discussed above. In the
morpho-word addition is also to be done in the same spirit. It will be clear

by the example which has been discussed above, if the roles of the input and
retrieved example will be reversed.
3. If the verb form is continuous or perfect then regardless of the tense of the
sentence, the same Table 2.4 will work. This verb form and tenses consideration
may occur in the input sentence or in the retrieved sentence. The only change
to be incorporated is that a morpho-word replacement (MR) has to be carried
out instead of suffix replacement (SR) operation in the adaptation rule Table
2.4.
For these tenses and verb forms, the suffixes for the suffix replacement and the
morpho-words for the morpho-word replacement can be referred from Table
A.3.
2.3.3 Same Tense Different Verb Forms
Here the input sentence and the retrieved example have the same tense but they
have different verb forms. Here too eighteen combinations of verb morphological
46
variation are possible. For example: present indefinite to present continuous, past
indefinite to past perfect, future perfect to future indefinite, etc. Different cases are
discussed below.
Suppose the verb of the input sentence is in future indefinite form, and the verb in
the retrieved example is in future continuous form. Three adaptation operations are
required to take care of all the possible variations in gender, number and person of the
subject. These operations are one suffix addition and two morpho-word deletions.
The suffix addition one item from the suffix set {oongaa, oge, oongii, ogge, egi,
enge, egaa} is to be added in the root form of the main verb of the retrieved Hindi
translation. Note that, since the retrieved example is in future continuous form,
the main verb will be in its root form only, and, therefore, no suffix deletion or
replacement is required. The two morpho-word deletions will be restricted to the sets
{rahaa, rahii, rahe} and {hoongaa, hoongii, honge, hogaa, hogii, hoge}, respectively.
The following example illustrates this adaptation procedure:
Let the input sentence be “She will eat rice.”, and the retrieved example is “She
will be eating rice.” ∼ “wah chaawal khaa rahii hogii ”. In this case, a suffix will
be added to “khaa” that is “oongii ”, and the last two words of the retrieved Hindi
sentence will be deleted that is “rahii ” and “hogii ”. The addition of suffix “oongii ”
takes place because the subject of the input sentence is feminine, 3rd person and
singular.
If the retrieved example is in future perfect form, instead of future continuous

form as in above, the adaptation operations will remain same. The only modification
is that one of the morpho-word deletion will be from the set {chukaa, chukii, chuke}
instead of the set {rahaa, rahii, rahe}.
For illustration, suppose for the same input as given above, the retrieved example
47
is “She will have eaten rice.” ∼ “wah chaawal khaa chukii hogii ”. In order to adapt
verb morphology the morpho-wordsf “chukii ” and “hogii are to be deleted, and the
suffix “oongii ” is to be added to the verb “khaa”. And thus one gets the required
verb morphology.
If the roles of the input and the retrieved sentence are reversed in the above
cases, then, in place of suffix addition, suffix deletion has to be carried out. Further,
the two morpho-word deletions are to be replaced with corresponding morpho-word
additions.
Adaptation rules for dealing with other verb morphological variations belonging
to this group have been developed in a similar way. One may refer to Section A.2
of Appendix A to figure out the appropriate suffixes and morpho-words that will be
involved in the necessary addition/deletion/replacement operation.
2.3.4 Different Tenses Different Verb Forms
The remaining thirty six possibilities out of the total eighty one combinations of verb
morphological variations belong to this group. Since it is not possible to discuss all
of them in this report, some typical once are considered for the present discussion.
In particular, we discuss the case where the input sentence is in present indefinite
form. For illustration, we consider the retrieved examples of the following types: (i)
past continuous (ii) past perfect (iii) future continuous (iv) future perfect. It will
be shown that a single set of adaptation operations is sufficient for all the four cases
mentioned above. These adaptation operations are one suffix addition (SA), one
morpho-word replacement (MR) and one morpho-word deletion (MD). The purpose
of these three operations are as follows:
48
• For the present indefinite tense the relevant suffix for the main verb is one of
{taa, tii, te} depending upon the gender, person and number of the subject.
However, if the retrieved sentence is one of the four types mentioned above
then the root verb in Hindi translation is in its root form. Consequently, the
suffix addition is mandatory.
• In the present indefinite form, one of the following morpho-words { hoon, hai,
ho, hain} has to be used depending upon the number, gender and person of
the subject. However, if the retrieved example is in past tense (irrespective of
continuous or perfect verb form), then the relevant morpho-word set is {thaa,
thii, the}. On the other hand, if the retrieved sentence is in future tense,
whether continuous or perfect verb form, then the relevant morpho-word set is
{hoongaa, honge, hogii, hoge, hogaa, hongii }. The morpho-word replacement
is required to have the right morpho-word in the generated translation.
• The morpho-word deletion operation is required to take care of indefinite verb

form of the input sentence. In this case no other morpho-word is necessary.
However, in order to indicate continuous form of the verb (irrespective of past
or future tense) one of the following morpho-words {rahaa, rahii, rahe} is re-
quired. Similarly, for perfect verb form an additional morpho-word is required

from the set {chukaa, chuke, chukii }. For adaptation of verb morphology this
addition morpho-word has to be deleted from the retrieved example.
For illustration, suppose the input sentence is “She eats rice.”, and the retrieved
example is one of the following:
49
(A) She was eating rice. ∼ wah chaawal khaa rahii thii
(B) She had eaten rice. ∼ wah chaawal khaa chukii thii
(C) She will be eating rice. ∼ wah chaawal khaa rahii hogii
(D) She will have eaten rice. ∼ wah chaawal khaa chukii hogii
Evidently, the sentences (A), (B), (C) and (D) are in past continuous, past
perfect, future continuous and future perfect form, respectively. The modification
in the retrieved Hindi translations are as follows:
• In case of translation of all retrieved examples the suffix “tii ” will be added in
“khaa” (Hindi of “eat”).
• The morpho-word “rahii ” or “chukii ” (depending upon the case) will be deleted.
• The morpho-word “thii ” will be replaced with “hai ” if the retrieved example
is either (A) or (B).
• The morpho-word “hogii ” is replaced with “hai ” in Hindi translation of exam-

ple (C) or (D).
Therefore, the required translation of the input sentence by incorporating all

these modification in the respective translation of retrieved examples is “wah chaawal
khaatii hai ”.
In a similar way one can identify that the set of three adaptation operations
will be required if the input is in past indefinite form, and the retrieved is one of
present continuous, present perfect, future continuous or future perfect. However,
in order to carry out the morpho-word replacement one has to confine the selection
to the set {thaa, the, thii }. It will replace the relevant morpho-word, which is one
50
of {hoon, hai, ho, hain} for present tense, or one of {hoongaa, honge, hogii, hoge,
hogaa, hongii } for future tense, of the retrieved Hindi example. The suffix addition
and morpho-word deletion operations are restricted to the same set as mentioned
above.
Similarly, one can identify adaptation operations when the roles of the input
and the retrieved sentence are reversed in the cases discussed above. Evidently, in
these cases one suffix deletion, one morpho-word replacement and one morpho-word
addition will be required for adapting the verb morphology variations. One can
easily figure out the relevant sets of morpho-words and suffixes keeping in view the
above discussions.
The above discussion takes care of sixteen possible combinations of different

verb morphological variations. Other verb morphological variations of this group
have been identified. However, due to the stereotype nature of discussion, we do not
present all other different cases in this report. One may identify the relevant suffixes
and the morpho-words by referring to Section A.2 of Appendix A.
2.4 Adaptation Procedure for Morphological Vari-
ation of Passive Verbs
The above discussion of adaptation procedures for verb morphological variation has
been limited to the active form of verb. Similar adaptation procedures have also
been studied when the verb is in the passive form. Ideally, the passive form should
exist for all the three tenses and all the four verb forms. However, the passive forms
of verbs for present perfect continuous, past perfect continuous, future continuous,
51
2.4. Adaptation Procedure for Morphological Variation of Passive Verbs
and future perfect continuous tenses are cumbersome, and are rarely used (Ansell,
2000). We, therefore, restrict our discussion to the other eight more commonly used
forms of the passive voice only. Since adaptation may take place from an active voice
sentence to a passive one, and vice versa, we classified these adaptation procedures
into three broad groups:
1. Passive verb form to passive verb form (8×8, i.e. 64 cases)
2. Passive verb form to active verb form (8×9, i.e. 72 cases)
3. Active verb form to passive verb form (9×8, i.e. 72 cases)
For each of the above mentioned three groups, we discuss few cases in detail:
Passive verb form to passive verb form
If the input sentence is in past indefinite passive verb form, and the retrieved
example is in present continuous passive verb form or past continuous passive verb
form, then a single set of adaptation operations is sufficient. These adaptation
operations are one morpho-word replacement, two morpho-word deletions and one
suffix replacement. The suffix replacement depends upon the particular Hindi verb
under consideration, and also upon the gender and number of the subject. So this
operation is required in some examples of such cases. However, the other three
adaptation operations are mandatory. The purpose of these four operations are as
follows:
• In the past indefinite passive verb form, one of the following morpho-words
{gayaa, gayii, gaye} has to be used depending upon the number and gender
of the subject. However, if the retrieved example is in continuous passive form
52
(irrespective of present or past tense), then the relevant morpho-word is “jaa”.

This necessitates the operation of morpho-word replacement.
• The morpho-word deletion operations are required to take care of indefinite

passive verb form of the input sentence. In this case no other morpho-word
is necessary. However, in order to indicate continuous form of the verb (ir-
respective of past or present tense) one of the following morpho-words set

{rahaa, rahii, rahe} is required, and also in order to indicate present and past
tenses one more morpho-word is required. This morpho-word comes from the
set {hain, ho, hoon, hai } if the retrieved sentence is in present tense; or it
comes from the set {thaa, the, thii } in case of past tense. For adaptation of
verb morphology these two morpho-words are to be deleted from the retrieved
example.
• In case of optional suffix replacement, the appropriate suffix has to be decided

according to the rules of PCP form of verb given in Appendix A of Section
A.2.
Consider the following examples.
Example 1 : The input sentence is in past indefinite passive verb form, and the
retrieved example is in present continuous passive verb form.
Input sentence : The nut was eaten by the squirrel.

Retrieved example : The nuts are being eaten by the squirrel. ∼
gilharii ke dwaaraa moongphaliyaan khaayii
jaa rahii hain
The Hindi translation of the input sentence is “gilharii ke dwaaraa moongphalii

khaayii gayii ”. Evidently, to generate this translation the morpho-words “rahii ”
53
2.4. Adaptation Procedure for Morphological Variation of Passive Verbs
and “hain” are to be deleted, the morpho-word “jaa” is to be replaced with the
morpho-word “gayii ”. Note that there is no change to the main verb “khaayii ”.
Example 2 : Here we consider the same input sentence but the retrieved example is
“The apple was begin eaten by the squirrel.” ∼ “gilharii ke dwaaraa seb khaayaa jaa
rahaa thaa”. Evidently, to generate the required translation “gilharii ke dwaaraa
moongphalii khaayii gayii ” all the operations given in Example 1 are to be carried
out. Further, due to the change in gender of the subject8 , i.e. the suffix “yaa” is
replaced with the suffix “yii ”. To generate the final translation of the input sentence
one more adaptation operation is needed, the constituent word “seb” is replaced with
“moongphalii ”, but that is not a part of the set of adaptation operations mentioned
above.
Passive verb form to active verb form
Here too we illustrate the verb morphology adaptation with the help of a specific
case: the input sentence is in the present indefinite passive verb form, and the
retrieved sentence is in active verb form in the same tense and form. Here one can
identify that one suffix replacement, one morpho-word addition and one morpho-
word replacement (depending upon the situation) are required to carry out the verb
morphology adaptation task. Significance of these three operations are as follows:
• The suffix {taa, te, tii} in the main verb of the active retrieved sentence is
replaced with an appropriate suffix according to the rules of PCP form of verb
given in Section A.2 of Appendix A.
• A morpho-word from the set {jaataa, jaatii, jate}, whose elements are es-
sentially declensions of the verb “jaa”, is to be added after the main verb.
8
“seb” (“apple”) is masculine but “moongphalii” (“nut”) is feminine
54
The appropriate morpho-word is dependent on the gender and number of the

subject.
• Since the retrieved example is in present tense, it must contain one of the
morph-word {hai, hain, ho, hoon}. Again since the input sentence is also
in present tense its Hindi translation will also have one morpho-word from
the same set. Hence, depending upon the gender, number and person of the
respective subjects, the same morpho-word may be retained, or it may have
to be replaced with another morpho-word from the same set.
The above is explained with the help of the following example.
Input sentence : This food is cooked by Sita.

Retrieved example : Sita cooks this food.
∼ sitaa yah khaanaa banaatii hai
The Hindi translation of the input sentence is “yah khaanaa sitaa ke dwaaraa
banaayaa jaataa hai ”. Evidently, to deal with the verb morphology in the generated
translation, two adaptation operations have to be performed. The suffix “tii ” of the
main verb of the retrieved sentence is to be replaced with “yaa”, and the morpho-
word “jaataa” is to be added after the main verb.
As the input sentence is the passive form of the retrieved sentence, “ke dwaaraa”
is added before the subject “siitaa”. This is necessary to generate the appropriate
translation of the input sentence but is not a part of the set of adaptation operations
mentioned above.
Active verb form to passive verb form
If the roles of the above mentioned input and the retrieved example are reversed,
one suffix replacement, one morpho-word replacement and one morpho-word deletion
55
2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot
will be required for adapting the verb morphology. One can easily figure out the
relevant sets of morpho-words and suffixes keeping in view the above discussions.
The adaptation rules for all other possible variations mentioned earlier have been
formulated in a similar way. However, the similar nature of discussions prohibits us
to describe all of them in detail.
2.5 Study of Adaptation Procedures for Subject/
Object Functional Slot
Subject (<S>) and Object (<O>) functional slots can be sub-divided into a number
of functional tags. These tags act as pre-modifier and post-modifier of the subject
(@SUBJ) and/or object (@OBJ) functional tag. The maximum possible structure
of the <S> or <O> functional slot using different tags is:
Functional Functional Tag Patterns

Slot
<S> or <O>: {@DN> or @GN> or @QN> or @AN> } & (@SUBJ or @OBJ)

& {@<NOM-OF & {@DN> or @GN> or @QN> or @AN> }
& @ or <O>: {@DN> or @GN> or @QN> or @AN> } & (@SUBJ or @OBJ)

& {@<NOM & {@DN> or @GN> or @QN> or @AN> } &
@<P}
Table 2.5: Different Functional Tags Under the Func-

tional Slot <S> or <O>
56
Table 2.5 lists only those structures which are present in our example base, and
are studied in the course of present research work. Here, {} is used for showing non-
obligatory (see Table 2.2) functional tags/slots. The definitions of the functional tags
are given in detail in Appendix B. The part of speech and its transformation under
the morpho tags for <S> or <O> functional slots are noun(N), pronoun(PRON),
proper noun (<Proper>), adjective(A) with transformations ABS, PCP1 (“-ing”
participle form), and PCP2 (“-ed” participle form), adverb(ADV) and gerund(PCP1
form). All possible variations in the morpho tags of the functional tags under the
<S> and <O> functional slot are listed in Table 2.6.
Functional tags Functional tags and their morpho tags

@DN>: @DN> ART
@DN> DEM
@GN>: @GN> PRON PERS GEN SG1
@GN> PRON PERS GEN PL1
@GN> PRON PERS GEN SG2/PL2
@GN> PRON PERS GEN PL3
@GN> PRON PERS GEN SG3
@GN> <Proper> GEN SG
@GN> N GEN SG
@GN> N GEN PL
@QN>: @QN> NUM CARD
@QN> NUM ORD
@QN> NUM <Fraction> SG
@QN> NUM <Fraction> PL
@QN> <Quant> DET SG
@QN> <Quant> DET PL
@QN> <Quant> DET SG/PL
@AN>: @AN> A ABS
@AN> A PCP1
@AN> A PCP2
57
Functional tags Functional tags and their morpho tags

@SUBJ or @OBJ (@SUBJ or @OBJ or @<P) PRON PERS SG1
or @ N SG
(@SUBJ or @OBJ or @<P) N SG
(@SUBJ or @OBJ or @<P) N PL
(@SUBJ or @OBJ or @<P) PCP1
@<NOM-OF: @<NOM-OF PREP
@<NOM: @<NOM PREP
Table 2.6: Different Possible Morpho Tags for Each of the
Functional Tag under the Functional Slot <S> or <O>
We explain Table 2.5 and Table 2.6 with an example. Consider the sentence “This
old man is sitting in Ram’s office”. Its parsed version, obtained using the ENGCG
parser, is:
• @DN> DEM “this”,
• @AN> A ABS “old”,
• @SUBJ N SG “man”,
• @+FAUXV V PRES “be”,
• @-FMAINV V PCP1 “sit”,
• @ADVL PREP “in”,
• @GN> <Proper> GEN SG “Ram”,
58
• @OBJ N SG “office”
• < $. >.
Here, the tags that start with ‘@’ are called functional tags e.g. @DN> - determiner,
@GN> - genitive case, @AN> - pre-modifier adjective etc. In Table 2.6 these tags
are succeeded by morpho tags, such as, SG - singular, PERS - personal pronoun,
GEN - genitive. Appendix B provides more details on these tags.
In the following discussion, the adaptation rules for functional tags due to the
variation of morpho tags are given.
2.5.1 Adaptation Rules for Variations in the Morpho Tags
of @DN>
The morpho tags ART and DEM are associated with the functional tag @DN>
(see Table 2.6). The morpho tag ART is associated with the English words “the”,
“an” and “a”, and DEM is associated with this, these, that etc. The word “the”
does not have any Hindi equivalent, hence it is absent in all Hindi translations.
Corresponding to articles “a” and “an” often no Hindi word used in the translation.
However, in some cases the word “ek ” (meaning “one”) is used depending upon
the context. For adaptation of these words no morphological changes take place.
Therefore, if “@DN> ART” is present in the parsed version of either in the input
or in the retrieved sentence, and it is corresponding to the word “the”, then no
adaptation operation will be performed. With respect to determiners (word having
DEM morpho tags such as this (“yah”), these (“ye”), that (“wah”) etc.), adaptation
procedure is straightforward.
59
For illustration, consider the input sentence “This man is kind.”, and the retrieved
example is “The man is kind.” ∼ “aadmii dayaalu hai ”. Note that no Hindi word
exists in the retrieved Hindi sentence corresponding to the word “the”. But the input
sentence contains the determiner “this”. Therefore, its Hindi translation “yah” is
required to be added before the subject “aadmii ” in the generated translation. Hence
the translation of the input sentence is “yah aadmii dayaalu hai ”.
of @GN>
The functional tag @GN> is used for a genitive (i.e. possessive) case. Eight possible
morpho tag variations are listed in Table 2.6. These variations occur due to the
variations in gender, number and person of three different POS, which are N, PRON
and <Proper>. When the part of speech of the genitive word is N or <Proper>,
then the genitive case in Hindi is indicated with one of the case endings from the set
{kaa, ke, kii } as a morpho-word. Its usage depends upon the gender and number
of the noun following the word corresponding to the tag @GN>. When the genitive
word is pronoun (PRON) the case endings are transformed into suffixes. Following
examples illustrate different genitive case structures in Hindi.
• “kaa” is used when the noun following it is masculine singular. For example:
the washerman’s son - dhobii kaa betaa

the pundits’ house - panditon kaa ghar
• “ke” is used when the noun following it is masculine plural. For example:
the gardener’s sons - malii ke bete

these men’s horses - in aadmiyoon ke ghodhe
60
• “ke” is also used when the noun following it is masculine singular with a case
ending. For example:
on the doctor’s child - daaktar ke bachche par

in the king’s villages - raja ke gaon mein
• “kii ” is used when the noun following it is feminine irrespective of whether it

is singular or plural with/without any case ending. For example:
the Brahmin’s book - brahmin kii pothii
on the king’s command - raja kii agya par

on the mountains’ peaks - pahadon kii chotiion par
There are occasions when morpho changes occur to the genitive word (when it
is noun) due to the case ending “kaa”, “ke” and “kii ”. These rules are listed in
Appendix A. For example: “the boy’s horse” ∼ “ladke kaa ghodha”. Although, the
Hindi of “boy” is “ladkaa”, its oblique form “ladke” has been used in the above
example. This happens because of the case ending “kaa”.
If POS of the genitive case is proper noun, then too, the same case ending {kaa,
ke, kii } is used as a morpho-word according to the gender and number of the noun
following it. In this case no morpho changes occur in the genitive word due to the
case ending. For example: “Parul’s home - paarul kaa ghar ”, “Ram’s book - raam kii
kitaab”, “in Ram’s home - raam ke ghar mein”.
As mentioned above, when the POS of the genitive word is pronoun the case
ending is attached with it in a form of the suffix. In case of first and second person
pronoun the suffix comes from the set {aa, e, ii }. However, in case of third person
pronoun the entire morpho-word is used as a suffix. The following examples illustrate
the genitive case with respect to pronoun.
61
my son - meraa betaa my sons - mere bete
my daughter - merii betii my daughters - merii betiyaan

his son - uskaa betaa his sons - uske bete
on my son - mere bete par on our son - hamaare bete par
his daughter - uskii betii his daughters - uskii betiyaan
in my book - merii kitaab mein in their villages - unke gaon mein
Once these structures are known, adaptation rules for different variations of
genitive case may be formulated by referring to Table 2.6. Table 2.8 has been
designed to indicate the adaptation procedure of different genitive case. The headers
of the rows and columns of this table correspond to three POS: <Proper>, N and
PRON.
Input→ <proper> N PRON
Ret’d ↓
<proper> (CP or ({WR}+ WR+ {MR}+ {SR or SA} WR+ MD+SA

{MR}))
N WR+ {MR} (CP or ({WR}+ {MR} + WR+ MD+SA

{(SR or SA or SD)}))
PRON WR + MA WR + MA +{SR or SA } (CP or SR or

(WR+SA))
Table 2.8: Adaptation Operations for Genitive Case to

Genitive Case
62
We explain these adaptation rules with the help of the following example. Sup-
pose the input sentence is “The boy’s uniform is new.”, and the retrieved example
is “Parul’s toy is new.” ∼ “paarul kaa khiloonaa nayaa hai ”. The translation of the
input sentence is “ladke kii wardii naii hai ”. In order to generate this translation
from the retrieved example the following adaptation operations need to be carried
out.
The word “boy” corresponds to the genitive case in the input sentence, its part
of speech is noun (N), and in the retrieved sentence the part of speech of “paurl” is
proper noun (<Proper>). Hence, according to cell(1, 2), i.e. (<Proper>, N) the
set of adaptation operation is WR+ {MR}+ {SR or SA}. This indicates that one
word replacement is mandatory and other two operations are carried out depending
upon the particular example under consideration.
Here, the nouns that follow the genitive cases are “uniform” and “toy”, respec-
tively. Their Hindi translations are “wardii ” (which is feminine and singular) and
“khiloonaa” (which is masculine and singular), respectively. The possessive case
ending, therefore, will not be the same. One morpho-word replacement is needed
to adapt genitive case ending: the morpho-word “kaa” is to be replaced with “kii ”.
Therefore, that optional morpho-word replacement is required in this example. The
genitive word “paarul ” in retrieved Hindi sentence is to be replaced with “ladkaa”.
However, a suffix replacement is necessary in this genitive word, viz. “ladkaa” be-
comes “ladke”.
Thus, in this example all the three adaptation operations are needed to adapt
the genitive case. In some situations all these operations may not be required. For
illustration, to adapt the genitive case “Parul’s uniform” ∼ “paarul kii wardhi ” to
“boy’s uniform” no change is required in the case ending “kii ”.
63
of @QN
The functional tag @QN is a quantifier tag. It is of two types: numeral (NUM) (e.g.
“two - do”, “fourth - choothaa”, “one-third - ke-tihaaii ”, “two-thirds - do-tihaaii ”),

and quantitative (<Quant>) (e.g. “some- kuchh”, “all - sab”, “many - bahut”). More
details of this functional tag and its morpho tags are given in Appendix B. Seven
variations in total (see Table 2.6) are possible due to the changes in the number (SG,
PL, SG/PL) and numeral properties, i.e. ordinal, cardinal number etc. But as far as
Hindi translation is concerned these seven variations do not play any role in the Hindi
translation. Therefore, no suffix operations or morpho-word operations are relevant
in this case. Only a single word operation, i.e deletion/addition/replacement/copy,
is required depending upon the tags in the input and the retrieved sentences.
For illustration, to adapt the translation of the retrieved example: “Two men
are coming here.” ∼ “do aadmii yahaan aa rahe hain” to generate the translation of
the input sentence “Some men are coming here.”, the adaptation procedure should
replace “kuchh” (i.e. “some”) with “do” (i.e. “two”). The Hindi translation of the
input sentence is, therefore, “kuchh aadmii yahaan aa rahe hain”.
Other cases may be dealt in a similar way.
of Pre-modifier Adjective @AN>
Adjectives fall into two classes, viz., uninflected and inflected (Kellogg and Bailey,
1965). Uninflected adjectives, as the term implies, remain unchanged before all
64
nouns and under all circumstances. English adjectives are necessarily uninflected -
they undergo no morphological changes with the variation in the nouns they qualify.
But Hindi adjectives may fall under both the categories. For example: “achchhaa”
(“good”) is inflected adjective, while “iimaandaar ” (“honest”) is uninflected. For
illustration:
good boy - achchhaa ladkaa honest boy - iimaandaar ladkaa

good girl - achchhii ladkii honest girl - iimaandaar ladkii
good boys - achchhe ladke honest boys - iimaandaar ladke
good girls - achchhii ladkiyaan honest girls - iimaandaar ladkiyaan
Adjectives are of two types: basic adjectives, and participle forms, i.e. those that
are derived from verbs (Kachru, 1980). The inflection rules of these two types are
discussed below.
Basic adjectives: These adjectives are those which are adjective themselves such as
“sundar ∼ beautiful”, “achchhaa ∼ good”. ENGCG parser denotes them as “ABS”.
The rules of inflection for these adjectives are as follows.
1. If an adjective in Hindi ends with “aa”, then it changes into “e” for plural.
For example, “bur aa ladkaa” (bad boy) and “bur e ladke” (bad boys).
2. An adjective ending with “aa” changes into “ii ” for feminine, e.g. “bur ii
ladkii ” (bad girl) and “bur ii ladkiyaan” (bad girls).
3. If an adjective in Hindi ends with any other vowel, it does not change in any
case.
65
Participle form of adjectives: Participle forms are of two types:
• present participle form of adjective (“-ing” form), denoted as A(PCP1);
• past participle form of adjective (“-ed” form), denoted as A(PCP2).
In Hindi the following rules govern the structures of these adjective forms.
1. In order to attain A(PCP1) form of adjective a suffix from the set {taa, te,
tii } is added to the root form of the verb. But in case of past participle form
A(PCP2) an appropriate suffix is attached according to the rules of the PCP
form of verb (see Section A.2).
2. Further, in most cases a morpho-word (from the set {huaa, huye, huii }) also
needs to be added after the modified verb.
3. Participle forms of adjectives are also inflected according to the gender and
number of the corresponding noun.
The following examples illustrate the above points.
A(PCP1) form of adjective:

A falling stone ∼ girtaa huaa patthar
A dancing girl ∼ naachtii huii ladkii
A running horses ∼ daudte huye ghode
A(PCP2) form of adjective:

A tired man ∼ thakaa huaa aadmii
A broken chair ∼ tutii huii kursii
A rotten apples ∼ sade huye seb
66
The pre-modifier adjective tag @AN>, therefore, has three possible morpho tag
variations (see Table 2.6): “@AN> A ABS”, “@AN> A PCP1”, and “@AN> A
PCP2”. Adaptation rules for adjectives have been formulated keeping in view all
the morpho-transformations discussed above. The following Table 2.10 presents
these rules.
The following examples illustrate the usage of the rule Table 2.10. Suppose
the input sentence is “Faded flower do not look good.”. We consider two different
retrieved examples to describe the adaptation procedure.
Example 1 : Here the retrieved example is:
Beautiful flower looks good. ∼ sundar phool acchaa dikhtaa hai
The adaptation operations for this example should follow the cell(1, 3), i.e. (ABS,
A(PCP2)) of the above table as the pre-modifier adjective of the subject is of the
form “ABS” in the retrieved sentence, and of the form “A(PCP2)” in the input
sentence.
Input→ ABS A(PCP1) A(PCP2)
Ret’d ↓
ABS (CP or SR or WR+ MA+ SA WR+ MA+ (SR or SA)

(WR+{SR})
A(PCP1) WR+ {SR} + (CP or ({WR} + ((SR+ {SR}) or (WR+

MD {2SR})) SR+ {SR})
A(PCP2) WR + {SR} + ((SR+ {SR}) or (CP or ({WR}+ {SR}

MD (WR+ SR+ {SR})) + {SR or SA}))
Table 2.10: Adaptation Operations for Pre-modifier Ad-
jective to Pre-modifier Adjective
67
The pre-modifier adjective “sundar ” is replaced with “murjhaa”, and suffix “yaa”
is added in that word. The morpho-word “huaa” will be added after this modified
verb in the retrieved Hindi sentence as the subject (“phool ”) is singular masculine.
Hence three adaptation operations (viz. one constituent word replacement, one
morpho-word addition and one suffix addition) are required for carrying out the
adaptation task. However, there may be situations when the suffix replacement may
have to be carried out in place of suffix addition.
As “not” is present in the input, and its translation (in Hindi is “nahiin”) is
added to retrieved Hindi sentence to generate the appropriate translation of the
input sentence. But this modification is not a part of the adaptation operation listed
in Table 2.10. Hence, the Hindi translation of the input sentence is “murjhaayaa
huaa phool acchaa nahiin dikhtaa hai ”.
Example 2 : Another retrieved example is:
Fading flowers do not look good.
∼ murjhate huye phool achchhe nahiin dikhte hain

Cell(2, 3), i.e. (A(PCP1), A(PCP2)) of the Table 2.10 lists the necessary adaptation
operations. The two possible sets of operations are ((SR+ {SR}) or (WR+ SR+
{SR})), i.e. one suffix operation is optional in both the sets.
In this example only two suffix replacements are required, i.e. first set of op-
erations. The suffix “te” is replaced with “yaa” in the pre-modifying adjective
“murjhaate”, and the suffix “ye” is replaced with “aa” in the morpho-word “huye”.
In some situations, the second suffix operation may not be needed if the gender and
person of the qualified nouns are same in both the input and the retrieved exam-
ple. If the input and the retrieved example have different verbs in the participle
68
form then, accordingly, one word replacement operation has to be invoked. This
operation will take care of the variation in the participle verb. One can realize this
if “blooming flowers” ∼ “khilte huye phool ” has to be adapted to translate “fading
flowers” (“murjhaate huye phool ”).
Note that the present discussion is limited to the adaptation procedure of pre-
modifier adjective, therefore, in order to generate the Hindi translation of the input
sentence, it has been assumed that other modifications in the sentence have been
incorporated already, therefore, the Hindi translation is “murjhaayaa huaa phool
achchhaa nahiin dikhtaa hai ”.
The present discussion concentrates on the variations in the pre-modifier ad-
jective form. The adaptation rule table developed therein corresponds to nouns
belonging to subject and object functional slots, for which the adjective works as an
attributive one. The same rule Table 2.10 works for an attributive adjective corre-
sponding to noun belonging to any functional slot/tag other than subject and object
as discussed above. Another usage of adjective may be noticed, in both English and
Hindi, that is the predicative one. In Hindi, a predicative adjective (subjective com-
plement) agrees with its subject in number and gender. For example, “He is good ∼
wah achchhaa hai ” and “She is good ∼ wah achchhii hai ”. The rules given in Table
2.10 works for predicative adjective as well.
of @SUB
The subject tag “@SUBJ” is the main and obligatory tag under the subject slot.
As listed in Table 2.6, nine possible morpho tag variations have been observed for
69
the subject functional tag. Within these nine possible morpho tags, there are total
four parts of speech: noun (N), proper noun (<Proper>), pronoun (PRON) and
gerund (PCP1). The variations in these parts of speech may occur due to either a
case ending or number. In this respect the following may be observed.
• The only case ending that may occur with respect to subject is “ne”. If the
POS of the subject is noun or pronoun then morphological changes may occur
due to this case ending. For example,
“ladkaa + ne” -“ladke ne” (boy) “bchchaa + ne” - “bchche ne” (child)
“wah + ne” - “usne” (he/ she) “ham + ne” - “hamne” (we)
More details of this case ending are given in Appendix A. It may be noted that
no morphological changes occur to the subject due to this case ending if the
POS of the subject is proper noun or PCP1.
• Morphological changes may occur in nouns due to variations in number (sin-

gular or plural) also. For example,
Singular Plural
boy - ladkaa boys - ladke
house - ghar houses - ghar (No change)
cloth - kapadaa clothes - kapade
girl - ladkii girls - ladkiyaan
class - kakshaa classes - kakshaayen
• In PCP1 form, always a suffix “naa” is added to the root form of the verb.
For example, “Swimming is a good exercise.” ∼ “tairnaa achchaa wyaayaam

hai ”.
70
Input→ N <proper> PRON PCP1

Ret’d ↓
N (CP or ({WR} + WR WR WR+WR +SA
{SR or SA or SD}))
<proper> WR + {SR or SA (CP or WR WR+SA
or SD} WR)
PRON WR + {SR or SA WR (CP or WR+SA
or SD} WR)
PCP1 WR + {SR or SA WR WR (CP or WR)
or SD}
Table 2.11: Adaptation Operations for Subject to Subject Variations
The rule Table 2.11 presents the relevant adaptation operations for different
variations in the subject discussed above. The following examples illustrate some of
these rules.
Example 1 : Suppose the input sentence is “The boy is playing.”, and the retrieved
example is “Boys are playing.” ∼ “ladke khel rahe hain”. Since the subject of the
input sentence is “boy”, to generate its Hindi translation “ladkaa” only the suffix “e”
is replaced with “aa” from the subject “ladke” in the retrieved Hindi sentence. This
is because the root word of the subject in both the input and the retrieved sentence
are same, that is “boy”. However, if the root words of the subjects differ, then a
word replacement will definitely be needed. Additionally, some suffix replacement

or addition may be needed to take care of the number of the subject. For example,
to adapt “boy” to “sister” it needs only one constituent word replacement: “ladkaa”
is to be replaced with “bahan”. On the other hand, if the “boy” is to be adapted to
“sisters” (i.e. plural form) then a word replacement (“ladkaa” → “bahan”) followed
by suffix addition (“bahan” → “bahanen”) will be required.
71
Therefore, the cell(1,1), i.e. (N, N) corresponds to above discussed operation set,
i.e. (CP or ({WR} + {SR or SA or SD}).
Example 2 : Consider the input sentence “He is a good man.”. Let the correspond-
ing retrieved example be “Walking is a good exercise.” ∼ “sair karnaa ek achchhaa
vyaayaam hai ”. The subject of the input sentence is “he”, and its POS is pronoun
(PRON), while the subject of the retrieved example is “walking”, and its POS is “-
ing” verb form, i.e. gerund (PCP1). In this case, adaptation operation as mentioned
in cell(4, 3), i.e. (PCP1, PRON) is required for doing the changes in the retrieved
translation. In this case the adaptation operation is word replacement. The word
“sair karnaa” is to be replaced with “wah”.
For the functional tags @OBJ and @>P, the same adaptation rule table can be
used because the morpho variation for these functional tags are same as that for
@SUBJ as given in Table 2.6.
For the last two functional tags (@<NOM-OF, @<NOM) there is only one pos-
sible morpho tag variation where the POS is preposition in both the cases. The
functional tag @<NOM-OF corresponding to the preposition “of”, and its trans-
lation in Hindi is either “kaa”, “ke” or “kii ”. It is based on the gender, number
and person of the word corresponding to @<P tag. In case of @<NOM, there is
no particular postposition as in @<NOM-OF, a fixed Hindi translation cannot be
mentioned, and the translation takes place according to the prepositions in the input
sentence.
72
2.6 Adaptation of Interrogative Words
This section discusses sentences that start with interrogative words, which are of
two types: interrogative pronoun (such as, who, what, whom, which, whose) and
interrogative adverbs (such as, when, where, how, why). This study has been done on
a selected set of representative sentences from the example base. This study focuses
on finding the usages of different interrogative words and corresponding translation
patterns. The major findings of this study are as follows.
• An interrogative word may have many different structures in English.
• The same interrogative word may have different Hindi translations in different
contexts, and consequently, the structures of the corresponding Hindi transla-
tions may also vary.
• Different interrogative words may generate Hindi translations of the same
structure.
The above findings are important from EBMT point of view because commonality
of the interrogative words may not lead to the most useful retrieval. In order to
retrieve the most similar translation example, one may have to look into sentences
involving some other interrogative words. Table 2.12 shows the examples and their
patterns. The interrogative sentence patterns are denoted as INi , i=1,2,...,26. These
examples have been taken from the example base. The patterns of the sentences are
decided from the parsed versions of various examples given by the ENGCG parser.
73
2.6. Adaptation of Interrogative Words
P. No. Interrogative Sentence Pattern Examples

IN1 : “Who” &<LV> &<S>? Who are you? ∼ tum kaun ho?
IN01 : “Who” &<LV> & <PP>? Who is at the door? ∼ darwaaje par
kaun hai?
IN2 : “Who” &<V> &<O> Who knows music?∼ sangiit kaun
{<PP> or <Adverb> or (<PP> jaantaa hai?
<Adverb>)}?
Who has played tunes on gui-
tar well?∼ gitaar par dhun kisne
achchhii bajaaii hai?
IN3 : “Who” &<AuxV> &<S> Who do you like most?∼ tum kisko
&<MainV> {<Adverb>}? sab se zyaadaa pasand karte ho?
IN4 : “Who” &<AuxV> &<S> Who are you laughing at? ∼ tum kis
&<MainV> &<Preposition>? par hans rahe ho?
IN5 : “What” &<LV> &<SC>? What is this? ∼ yah kyaa hai?
IN6 : “What” &<AuxV> &<S> What do you like? ∼ tum kyaa
&<MainV>? pasand karte ho?
IN7 : “What” &<N> &<LV> What color is the cap? ∼ topii kaun
&<SC/S>? se rang kii hai?
OR topii kis rang kii hai?
What shapes are these balls? ∼ ye

gendhen kaun se aakaaro kii hain?
OR ye gendhen kin aakaaro
kii hain?
IN8 : “What” &<N> &<AuxV> &<S> What book have you read recently?
&<MainV> {<Adverb>}? ∼ tum ne kaun sii kitaab abhii
padhii hai?
OR tum ne kis kitaab ko abhii
padhaa hai?
IN9 : “Which” &<LV> &<SC>? Which is the best book? ∼ kaun sii
kitaab sab se achchhii hai?
74

IN10 : “Which” &<AuxV> &<S> Which do you feel better? ∼ tumhe
&<MainV> &<Adverb>? kaun saa zyaadaa sahii lagtaa
hai?
IN11 : “Which” &<N> &<LV> &(<SC> Which fruit is good for his health? ∼
or <S>) {<PP>}? kaun saa phal us kii sehat ke liye
achchhaa hai?
Which fruit is this? ∼ yah kaun saa
phal hai?
IN12 : “Which” &<N> &<AuxV> &<S> Which book will you take? ∼ tum
&<MainV> {<Adver>}? kaun sii kitaab logii?
OR ∼ tum kis kitaab ko logii?
Which student will you call? ∼ tum
kis chhaatra ko bulaaoge?
OR ∼ tum kaun saa chhaa-
tra bulaaoge?
IN13 : <Preposition> &“which” &<N> In which hotel will you stay? ∼ tum
&<AuxV> &<S> &<MainV> kis hotal main rukoge?
{<O>}? OR ∼ tum kaun se hotal
main rukoge?
To which boy did you give the book?

∼ tum ne kis ladke ko kitaab dii?
OR ∼ tum ne kaun se ladke
ko kitaab dii?
To which boys did you give the

books? ∼ tum ne kin ladkon ko ki-
taaben dii?
OR ∼ tum ne kaun se ladkon
ko kitaaben dii?
IN14 : “Whose” &<N> &<LV> &(<S> Whose book is this? ∼ yah kiskii
or <SC>)? kitaab hai?
75

IN15 : “Whose” &<N> &<AuxV> Whose book are you reading? ∼ tum
&<S> & <MainV>? kiskii kitaab padh rahe ho?
IN16 : “Whose” &<N> &<V> Whose pen is lying on the table? ∼
&{(<PP>) or (<Adverb>) or kiskii kalam mez par padii hai?
(<PP> <Adverb>)}?
IN17 : <Preposition> &“whose” &<N> In whose name will you transfer this
&<AuxV> &<S> &<MainV>? home? ∼ tum kiske naam par yah
ghar hastaantaran karoge?
IN18 : <Preposition> &“whom” To whom are you listening? ∼ tum
&<AuxV> &<S> &<MainV>? kis ko sun rahe ho?
IN19 : “Whom” &<AuxV> &<S> Whom do you prefer? ∼ tum kisko
&<MainV>? pasand karte ho?
IN20 : “Why” &<LV> &<S> Why are you here? ∼ tum yahaan
&(<Adverb> or <AdjP>)? kyon ho?
Why are you so stupid? ∼ tum itne
murkh kyon ho?
IN21 : “Why” &<AuxV> &<S> Why did you abuse the old man? ∼
&<MainV> {(<O>) or tum budhe aadmii ko gaalii kyon
(<Adverb>) (<O> & de rahe ho?
<Adverb>)}?
Why are you weeping? ∼ tum kyon
ro rahe ho?
IN22 : “Where” &<LV> &<S> Where are you? ∼ tum kahaan ho?
&{<preposition> or <Adverb>}?
IN23 : “Where” &<AuxV> &<S> Where do you live? ∼ tum kahaan
&<MainV>? rahte ho?
IN24 : “When” &<AuxV> &<S> When did the British quit India? ∼
&<MainV> &{<O> or <Adverb> british bhaarat ko kab choda kar
or <PP> or (<O> <Adverb>)}? gaye?
When does he go to bed? ∼ wah
bistar par kab jataa hai?
76

IN25 : “How” &<LV> &<S> How are you? ∼ tum kaise ho?
&{<Adverb>}?
How is she? ∼ wah kaisii hai?
How is he? ∼ wah kaisaa hai?
IN26 : “How” &<AuxV> &<S> How are you feeling now? ∼ tumhe
&<MainV> {<O> or <Adverb> ab kaisaa lag rahaa hai?
or (<O> <Adverb>)}?
How is she looking today? ∼ wah aaj
kaisii dikh rahii hai?
Table 2.12: Different Sentence Patterns of Interrogative
Words
Each interrogative word has a particular role in a sentence. According to their

roles, parser assigns different functional tags and corresponding morpho tags to these
interrogative words. The functional tags and corresponding morpho tags as assigned
by the parser to the interrogative words of the above sentences are given in Table
2.13.
Interrogative Sentence pattern Functional & Morpho tags

word No.
Who: IN1 @PCOMPL-S PRON WH SG/PL
IN01 , IN2 @SUB PRON WH SG/PL
IN3 @OBJ PRON WH SG/PL
IN4 @ DET WH SG/PL
Which: IN9 @SUB PRON WH SG/PL
IN11 , IN12 , IN13 @DN> DET WH SG/PL
77
Whose: IN14 , IN15 , IN16 , IN17 @GN> DET WH GEN SG/PL

Whom: IN18 @<P PRON WH SG/PL
Why: IN20 , IN21 @ADVL ADV WH
Where: IN22 , IN23 @ADVL ADV WH
When: IN24 @ADVL ADV WH
How: IN25 , IN26 @ADVL ADV WH
Table 2.13: Functional & Morpho Tags Corresponding to
Each Interrogative Sentence Patterns
Note that, Table 2.12 by no means provides an exhaustive list of English sentence
patterns involving interrogative words. However, these are the sentence patterns that
are predominatingly present in our example base. By examining the structures of

the corresponding Hindi sentences one can easily see that it is the role of the word
concerned that is most important in determining the Hindi sentence structures. One
may easily find that an interrogative word that has more than one functional tag and
having different translations in Hindi which certainly implies different translation

structures. These variations corresponding to each interrogative word are explained
below:
Variation in Translation of “who”: Table 2.13 shows four different functional tags
for this word. The observed translation patterns for these are as follows:
1. @PCOMPL-S: If “who” is used as the subjective complement, as in pattern

IN1 , then the only way it is translated into Hindi is “kaun”.
2. @SUB: When “who” is used as a subject, as in pattern IN01 and IN2 its trans-
78
lation into Hindi may have two possibilities depending upon the tense of the
sentence in case of IN2 sentence pattern. If the tense and verb form are present
perfect, past indefinite or past perfect, the translation of “who” in Hindi is
“kisne”. In all other tense and verb form “who” is translated as “kaun”.
However, the translation of “who” in case of IN01 sentence pattern is “kaun”.
3. @OBJ: If the functional tag assigned to “who” is @OBJ, (as in pattern IN3 )
then its Hindi translation is “kisko”
4. @<P: This tag implies that here “who” is used as a complement of preposition
(as in pattern IN4 ). In this case the Hindi translation is “kis”.
Variation in translation of word “what”: The four translation patterns of sentences

involving “what” as the interrogative word show only three functional tags for the
word “what” (see Table 2.13). Their translations have the following variations.
1. In translation patterns IN5 and IN6 the interrogative word “what” is used as
subject (@SUB) and object (@OBJ), respectively. In both the cases “what” is
translated as “kyaa”.
2. In case of sentence patterns IN7 and IN8 the word “what” is used as a de-
terminer and its functional tag is @DN>. However, due to the variations in
overall sentence patterns in these two cases the different translations for the
word “what” have been observed. In both the cases, the Hindi translation is
of the form kaun followed by one of {saa, se, sii } depending upon the number
and gender of the noun following the word “what”. However, in both the cases
of IN7 and IN8 the more translation of “what” has been observed, i.e. “kis” or
79
“kin” according to the number of the noun following the word “what”. Fur-
ther, the morpho-word “ki ” is added after the noun in case of IN7 sentence
pattern. While in case of IN8 , the morpho-word “ko” is added after the noun.
Variation in translation of the word “which”: As shown in Table 2.12, five different
sentence patterns, viz. IN9 to IN13 , have been observed corresponding to the word
“which”. In all these cases, although the functional tag for the word “which” varies,
its translation to Hindi is done in the same way using the word “kaun” followed by
one of the morpho-word from the set {saa, sii, se} depending upon the number and
gender of the noun following the word “which”.
However, in both the cases IN12 and IN13 , one more translation of “which” has
been observed, i.e. “kis” or kin according to the number of the noun following the
word “which”. Further, the morpho-word “ko” is added after the noun.
Variation in translation of the word “whose”: Although four different sentence pat-
terns (i.e. IN14 , IN15 , IN16 and IN17 ) have been observed for English sentences
involving the word “whose”, in all of them the functional tag of this word has been
found to be @GN>. Consequently, its translation into Hindi is also found to be the
same, i.e. one from the set {kiskaa, kiske, kiskii, kinkii, kinkaa, kinke}. The actual
usage depends upon the gender and number of the noun following the word “whose”.
Variation in translation of the word “whom”: Two possibilities have been observed
in this case:
1. @<P: Under this functional tag the word “whom” is used as a complement of
the preposition, as in sentence pattern IN18 . In this case the Hindi translation
of this word is “kis”.
80
2. @OBJ: In this case the functional tag corresponding to “whom” is object, as

in sentence pattern IN19 . The corresponding Hindi translation of this word is
“kisko”.
Variation in translation of interrogative adverb words: Under this case four words
have been studied: “why”, “where”, “when” and “how”. Their Hindi translations
are as follows:
• Irrespective of the sentence patterns (i.e. IN20 , IN21 , IN22 , IN23 and IN24 ) the
first three of the above four interrogative adverbs have unique translations in
Hindi. The Hindi translation of “why” is “kyon”, that of “where” is “kahaan”,
while “when” is translated as “kab”.
• In both the sentence patterns IN25 and IN26 , the translation of the word “how”
into Hindi is one from the set {kaisaa, kaisii, kaise}. This variation in the
translation is governed by the gender and number of the subject of the under-
lying sentence pattern.
The above study of interrogative words suggests that sentence having different
interrogative words may have same translation patterns. The following examples
illustrate this point.
(A) Why are you going today?

∼ tum aaj kyoo jaa rahe ho?
(B) Where are you going today?
∼ tum aaj kahaan jaa rahe ho?
(C) When are you going today?

∼ tum aaj kab jaa rahe ho?
81
Suppose one has to generate the translation of the English sentence ‘How are you
going today?”. Its Hindi translation is “tum aaj kaise jaa rahe ho? ”. Obviously, this
translation can be generated easily if one of the above three examples is considered
as a retrieved sentence.
Based on the above observations we cluster the above sentence patterns into
several groups as given below.
G1 : IN21 , IN23 , IN24 and IN26
G2 : IN1 , IN01 IN20 , IN22 , and IN25
G3 : IN2 , IN3 , IN6 , IN10 and IN19
G4 : IN7 , IN11 , and IN14
G5 : IN8 , IN12 , IN15 , and IN16
G6 : IN5 , and IN9
G7 : IN4 , and IN18
G8 : IN13 , and IN17
Adaptation of the interrogative word within each group of sentences is relatively
easy, and typically, can be done using simple operations. Table 2.14 shows the
operations required for the adaptation within the group G5 . However, the adaptation
between two different groups may not be so simple because the remaining part of the
sentences also need to be taken into consideration. And therefore, more structural
transformational of the retrieved examples will be needed.
82
Input→ IN8 IN12 IN15 IN16

Ret’d ↓
IN8 (CP or WR) (CP or WR) WR+SA+WD WR+SA+WD
IN12 (CP or WR) (CP or WR) WR+SA+WD WR+SA+WD
IN15 WR+ WA WR+ WA (CP or SR) (CP or SR)
IN16 WR+ WA WR+ WA (CP or SR) (CP or SR)
Table 2.14: Adaptability Rules for Group G5 Sentence

Patterns
2.7 Adaptation Rules for Variation in Kind of Sen-
tences
Here we consider four kinds of sentences: Affirmative (AFF), Interrogative (INT),
Negative (NEG) and Negative-Interrogative (NINT). Typical sentence structures of

these four types are given in Figure 2.3.
Ram eats rice. ∼ ram chaawal khaataa hai
Ram does not eat rice. ∼ ram chaawal nahiin khaataa hai
Does Ram eat rice? ∼ kyaa ram chaawal khaataa hai?
Does Ram not eat rice? ∼ kyaa ram chaawal nahiin khaataa hai?
Figure 2.3: Some Typical Sentence Structures
83
2.7. Adaptation Rules for Variation in Kind of Sentences
One may notice that in Hindi the negative and interrogative structures are ob-
tained by addition of the words “nahiin” and “kyaa”, respectively. Also note that
the position of “kyaa” is always at the beginning of the sentence - hence its ad-
dition or deletion needs no traversing through the sentence. Typically, “nahiin”
occurs before the main verb of the Hindi sentence. However, since Hindi is relatively
free order, it may occur at some other position also. Adaptation operations are,
therefore, as follows:
• Word addition (WA) (for “nahiin”);
• Word deletion (WD) (for “nahiin”);
• Morpho-word addition (MA) (for “kyaa”); and
• Morpho-word deletion (MD)(for “kyaa”)
Table 2.15 gives the required operations for all types of variation in the kind of
sentences. The expressions are obtained by deciding upon which of the words are
being added and/or deleted for the adaptation.
Input → AFF NEG INT NINT

Ret’d ↓
AFF CP WA MA WA + MA
NEG WD CP MA + WD MA
INT MD WA + MD CP WA
NINT WD + MD MD WD CP
Table 2.15: Adaptation Rules for Variation in Kind of

Sentences
84
2.8 Concluding Remarks
In this chapter we have described different adaptation operations that may be used
for adapting a retrieved Hindi translation example to generate the translation of a

given input. The novelty of the scheme is that not only does it work in the word
level, it deals with suffixes as well. The advantage of the above scheme is that
since the number of suffixes is very limited, it reduces the overall cost of adaptation.
Chapter 5 discusses how the cost for each of the operations is evaluated.
This chapter looks into the process of adaptation itself. The adaptation opera-
tions described in this chapter are to be used in succession in order to generate the
required translation. The overall adaptation scheme will first have to look into the
discrepancies in the input sentence and the retrieved example. Discrepancies may
occur in different functional slots of the sentences, and also in the kind of sentences.
Once the discrepancies are identified, appropriate adaptation operations have to

be applied to remove them. Thus successive applications of these operations will
generate the required translation in an incremental way.
In this chapter we have considered variations in the different tense and verb forms
both in active and passive, variations in subject/object functional slot, variation in
wh-family word (e.g. “what”, “who”, “where”, “when”) and their sentence patterns.
We have also worked on Modal verbs (e.g “should”, “might”, “can”, “could”, “may”),
and their respective sentence patterns. However, due to similar nature of discussion
we do not elaborate on them in this report.
Of the different sentence kinds, we have discussed four (viz. affirmative, negative,
interrogative and negative interrogative) in this chapter. Evidently one may find
many other kinds of sentences (e.g. Imperative, Exclamatory). We have not dealt
85
2.8. Concluding Remarks
with them in this work, however, we feel that they can be treated in similar fashion.
With respect to each of the variations we have identified the minimum number
of operations that are required for the overall adaptation of the retrieved example.
We presented these required operations in the form of various tables. The advan-
tage of these tables is that they can be used as yardsticks for measuring the total
adaptation cost, which in turn may be used as a measurement of similarity between

an input sentence and the sentences of the example base. These issues are discussed
in Chapter 5.
The above-mentioned scheme of adaptation works well under the implicit as-
sumption that translations of similar source language sentences are similar in the
target language as well. However, in reality one may find examples when the above
assumption does not hold good. For example, consider the two English sentences
“It is running.” and “It is raining.”. Although these two sentences are structurally
very similar, their Hindi translations are structurally very different. The first sen-
tence is translated as “wah (it) bhaag (run) rahaa (..ing) hai (is)”. But the second
one is translated as “baarish (rain) ho (be) rahii (..ing) hai (is)”. Hence in order
to translate the first sentence if the second one is retrieved from the example base,
then the translation generated through the above-mentioned adaptation procedure
will not be able to produce the correct translation of the said input. Such instances
are primarily due to some inherent characteristics of the source and target language,
which is termed as translation “divergence” (Dorr, 1993). The existence of trans-
lation divergences makes the straightforward transfer from source structures into
target structures difficult. Study of adaptation therefore needs a careful study of
divergence as well. The following chapter discusses divergences in English to Hindi

translation in detail.
86
Chapter 3
An FT and SPAC Based

Divergence Identification
Technique From Example Base
An FT and SPAC Based Divergence Identification Technique
3.1 Introduction
Divergence is a common phenomenon in translation between two natural languages.
Typically, translation divergence “occurs when structurally similar sentences of the

source language do not translate into sentences that are similar in structure in the
target language” (Dorr, 1993). As a consequence, dealing with divergence assumes
special significance in the domain of EBMT.
For illustration, consider the following English sentences and their Hindi trans-
lations:
(A) : She is in a shock. ∼ wah sadme mein hai
(she) (shock) (in) (is)
(B) : She is in trouble. ∼ wah pareshaanii mein hai

(she) (trouble) (in) (is)
(C) : She is in panic. ∼ wah ghabraa rahii hai
(she) (panic) (..ing) (is)
Items (A) and (B) above are examples of normal translation pattern. The prepo-
sitional phrases (PP) of the English sentences are realized as PP in Hindi, and that
the prepositions occur after the corresponding noun is in accordance with the Hindi
syntax. However, in example (C) one may notice huge structural variation. Here,
the sense of the prepositional phrase “in panic” is realized by the verb “ghabraa rahii
hai ” (“is panicking”). Hence this is an instance of a translation divergence.
Assuming that the English sentence in (A) is given as the input to an English to
Hindi EBMT system, two scenarios may be considered:
1. The retrieved example is (B) i.e. “She is in trouble”. In this case, the correct
Hindi translation may be generated in a straightforward way by using word
87
3.1. Introduction
replacement operation to replace “pareshaanii ” with “sadme”.
2. If example (C) is retrieved for adaptation, the generated translation may be

“wah (she) sadmaa (shock) rahii (. . . ing) hai (is)”, which is not a syntactically
correct Hindi sentence.
Thus the output of the system will depend entirely on the sentence ((B) or (C))
which will be retrieved to generate the translation of the input (A). Given a very
similar structure of the three sentences, the retrieval may eventually depend on the
semantic similarity of the prepositional phrase (PP) of the input with the PPs of the
stored examples. With respect to the above illustration, this implies that similarity
between the sentences may be measured by the semantic similarity between “shock”
and “trouble” in case (1), and the semantic similarity between “shock” and “panic”
in case (2). Table 3.1 gives this similarity value under different schemes given in
WordNet::Similarity web interface (http://www.d.umn.edu /∼mich0212/ cgi-bin/
similarity/similarity.cgi), considering the words as nouns, and taking their sense
number 1 as given by WordNet 2.0.
Similarity measure Similarity between Similarity between

“shock” and “trouble” “shock” and “panic”
Lin 0.2989 0.5172
Leacock & Chodorow 1.3863 1.6376
Resnik 2.734 5.2654
Jiang & Conrath 0.078 0.1017
Wu & Palmer 0.3333 0.5
Path lengths 0.1111 0.1429
Adapted Lesk 1 2
Table 3.1: Different Semantic Similarity Score between
“shock” with “trouble” and “panic”
88
The above values show that under all the above measures “panic” is more similar
to “shock”. But we observe from the translation point of view that example (B)
proves to be more useful in producing the appropriate translation. This happens
because of the presence of divergence in the translation of example (C).
Identification of divergence may therefore be considered paramount for an EBMT
system. Such an algorithm may be used in partitioning the example base into
different classes: divergence and normal. This, in turn, helps in efficient retrieval of
past examples which enhances the performance of an EBMT system. The present
work aims at designing algorithms for identification of divergence examples in a
example base of translations.
This chapter is organized as follows. Section 3.2 discusses some related past
work on divergence and its identification. Section 3.3 presents a detailed study of
divergence categories for English to Hindi translation along with their identification
algorithms.
3.2 Divergence and Its Identification: Some Rel-
evant Past Work
Various approaches have been pursued in dealing with translation divergence. These
may be classified into four categories:
1. Transfer approach. Here transfer rules are used for transforming a source
language (SL) sentence into target language (TL) by performing lexical and
structural manipulations. These rules may be formed in several ways: by
89
3.2. Divergence and Its Identification: Some Relevant Past Work
manual encoding (Han et. al., 2000), by analysis of parsed aligned bilingual
corpora (Watanabe et. al, 2000) etc.
2. Interlingua approach. Here, identification and resolution of divergence are

based on two mappings (the Generalized Linking Routine (GLR), the CSR
(Canonical Syntactic Realization)) and a set of Lexical Conceptual Structure

(LCS) parameters. In general, translation divergence occurs when there is an
exception either to the GLR or to the CSR (or to both) in one language but
not in the other. This premise allows one to formally define a classification
of all possible lexical-semantic divergences that could arise during translation.

This approach has been pursued in the UNITRAN (Dorr, 1993) system that
deals with translation from English to Spanish and English to German.
3. Generation-Heavy Machine Translation (GHMT) approach. This scheme works

in two steps. In the first step, rich target language resources, such as word-
lexical semantics, categorical variations and sub-categorization frames are used

for generating multiple structural variations from a target-glossed syntactic de-
pendency representation of SL sentences. This is “symbolic-overgeneration”
step. This step is constrained by a statistical TL model that accounts for
possible translation divergences. Finally, a statistical extractor is used for

extracting a preferred sentence from the word lattice of possibilities. Evi-
dently, this scheme bypasses explicit identification of divergence, and generates
translations (which may include divergence sentences) otherwise. MATADOR
(Habash, 2003), a system for translation between Spanish and English uses
this approach.
4. Universal Networking Language. (UNL) based approach. UNL has been devel-
90
oped to play the role of Interlingua to access, transfer and process information
on the internet (Uchida and Zhu, 1998). In UNL, sentences are represented
using hypergraphs with concepts as nodes and relations as directed arcs. A
dictionary of concepts (termed as Universal Word or UW) is maintained. A di-
vergence is said to occur if the UNL expression generated from the both source
and target language analyzer differ in structure. This approach has been pro-
posed for English to Hindi machine translation in (Dave et. al., 2002).
Each of the above schemes, however, has its own shortcomings when applied in
English to Hindi context. For example, the Generation-Heavy approach requires rich
resources for the target language. Creation of such heavy resources requires signif-
icant amount of efforts, and it is not currently available for Hindi. The Interlingua
approach requires deep semantic analysis of the sentences, but it has been observed
elsewhere that an MT system can work even without such semantic details (Dorr
et. al., 1998). Similarly, creation of exhaustive set of rules to capture all the lexical
and structural variations that may be witnessed in English to Hindi translation is
too formidable. Even in case of UNL based approach, each UW of the dictionary
contains deep syntactic, semantic and morphological knowledge about the word.
Creation of such a dictionary even for a restricted domain is difficult, and needs
deep semantic analysis of each word.
With respect to Hindi the major problem in applying the above techniques is
that such linguistic resources are not freely available. As a consequence, applica-
tion of the above techniques in English-Hindi context is severely constrained, at
least presently, due to scarcity of linguistic resources for Hindi. Although Hindi
is one of the major languages of the present world, research in NLP on Hindi
(and other Indian languages too) is still in its infancy. Even though research in
91
NLP involving Indian languages has been enthusiastically pursued, and is spon-
sored by the government and several educational institutes over the last few years
(http://tdil.mit.gov.in/tdilsept2001.pdf), it will take some time before various lin-
guistic resources are easily available. This motivates us to develop a simpler algo-
rithm that requires as little linguistic resources as possible. The usefulness of such
techniques will be twofold:
1. Study of EBMT for Hindi can be pursued successfully
2. The methods can be used for other languages too where linguistic resources
are scarce.
The proposed approach uses only the functional tags (FT) and the syntactic
phrasal annotated chunk (SPAC) structures of the source language (SL) and target
language (TL) sentences for identification of divergence in a translation example. A

translation divergence occurs when some particular FT upon translation is realized
with the help of some other FT in the target language. Thus occurrence of divergence
may be identified by comparing the roles of different constituent words in the source
and target language sentence. Thus the proposed approach aims at designing an
algorithm that uses as little linguistic resources as possible.
The most fundamental work before developing any such algorithm is to determine
the different types of divergence that may be found in English to Hindi translation.
Since divergence is a language-dependent phenomenon, it is not expected the same
set of divergences will occur across all languages. In this respect one may refer to
(Dorr, 1993) which provides the most detailed categorization of Lexical-semantic
divergences for translation among the European languages. There divergence has
92
been put in seven broad types: structural, conflational, categorial, promotional, de-
motional, thematic and lexical. Section 3.3 discusses these divergence types in detail.
In a more recent work (Dorr et. al., 2002) and (Habash and Dorr, 2002), the diver-
gence categories have been redefined in the following way. Under the new scheme six
different types of divergence have been considered: light verb construction, manner
conflation, head swapping, thematic, categorial, and structural. The differences in
the two categorizations may be summarized as follows:
1. A light verb construction involves a single verb in one language being translated
using a combination of a semantically “light” verb and another meaning unit
(a noun, generally) to convey the appropriate meaning. In English to Hindi
(and perhaps in many other Indian languages) context such happenings are
very common. Hence this is not considered as a divergence for English to Hindi
translation. Later, this point will be discussed in detail under the conflational
divergence.
2. Head swapping essentially combines both promotional and demotional diver-

gences under one heading.
3. Lexical divergence, which is a mixture of more than one divergence, has not
been considered.
4. All other divergence categories remain as they are under the new scheme.
Thus, the new categorization is essentially a regrouping of some of the above

types. The basic motivation behind the present work is to study the relevance of
the above-mentioned seven types of divergence in the context of English to Hindi
translation. For this work we have analyzed more than 4500 translation examples
93
obtained from different bilingual sources (such as, storybooks, translation books,
recipe books). This analysis suggests that English to Hindi translation divergence
is in many cases somewhat different in its characteristics, and therefore need to be
redefined. In the following subsections we describe the various types of divergence
that may be found in the context of English to Hindi translation, and their sub-
types. We also discuss the algorithm to identify each type of divergence, and its
characteristics in more detail.
It may be noted that Dave et. al (2002) also studied English to Hindi divergence
in detail. However, they have restricted their discussions to the above-mentioned
seven categories only. Our studies of English to Hindi translation divergences reveal
the following:
1. All of the above-mentioned seven categories do not apply with respect to En-
glish to Hindi translation.
2. Instances of thematic and promotional divergence have not been found in En-
glish to Hindi translation
3. Structural divergence, in the English to Hindi context, occurs in the same way
as in European languages.
4. Some variations from the definitions given in (Dorr, 1993) may be noticed in
the occurrence of categorial, conflational, demotional divergences.
5. Three new types of divergence may be found with respect to English to Hindi
translation. These are named as nominal, pronominal and possessional.
6. Most of the divergence types may be further subdivided into several sub-types.
94
In Section 3.3 we discuss all the relevant divergence types and their sub-types
that we have observed in English to Hindi translation, and provide algorithms for
their identification. As mentioned earlier, the identification technique uses functional
tags (FT) and syntactic phrase annotated chunk (SPAC) of both the source language
sentence and its translation. For each divergence type we identify the FTs that are
instrumental in causing the divergence. Each divergence type is defined on the basis
of which FTs of the English sentence it is concerned with, and also to which FTs it
is mapped upon translation.
The proposed algorithm requires the following FTs and SPAC of both the lan-
guages:
• Subject (S), object (O), verb (V), subjective complement (SC), adjectival com-
plement by preposition(SC C), subjective predicative adjunct (PA), verb com-
plement (VC) and adjunct (A).
• Categories in the SPAC structure are as follows:
POS tags: noun (N), adjective (Adj), verb (V), auxiliary verb (AuxV), prepo-
sition (P), adverb (Adv), determiner (DT), personal Pronoun (PRP), possesive
case of personal pronoun (PRP$) and cardinal number (CD).
Phrases: N, Adj, V, ADV and P are called the “lexical heads” of the phrases.
For each category a suffix “P” is used to denote a phrase.
In Appendix B and Appendix C, definitions of these FTs and SPAC are discussed
in detail. With this background we proceed to define the divergence types/ sub-types
and their identification scheme.
95
3.3. Divergences and Their Identification in English to Hindi Translation
3.3 Divergences and Their Identification in En-
glish to Hindi Translation
We order the different divergence types on the basis of the FTs of the source language
sentence with which they are concerned. Accordingly, we observe the following:
• Structural divergence is concerned with the object of the English sentence.
• Categorial divergence is characterized by how the subjective complement (SC)

and predicative adjunct (PA) of the English sentence are realized upon trans-
lation.
• Nominal divergence concerns with the SC of the English sentence.
• Pronominal divergence is related to both SC and verb of the English sentence.
• Demotional divergence, conflational divergence, and possessional divergence
may be identified by studying how the main verb of the English sentence is
realized upon translation.
In the following subsections we provide the different divergence types and their
identification schemes. In the description of all the algorithms the following conven-
tion for representation will be followed:
a) The input to the algorithms will be an English sentence and its Hindi sentence.
These two will be denoted as E and H, respectively.
b) All these identification algorithms will return 0 if a particular divergence is

absent in the sentence. Otherwise it will return a value n, indicating that the
96
corresponding divergence is present in the translation, and its sub-type is n.

It may be noted that the number of possible sub-types may be different for
different divergence types.
3.3.1 Structural Divergence
A structural divergence is said to have occurred if the object of the English sentence
is realized as a noun phrase (NP) but upon translation in Hindi is realized as a
prepositional phrase (PP). The following examples illustrate this. One may note
that different Hindi prepositions (e.g. se, par, ko, kaa) have been used in different
contexts leading to structural divergence.
• Ram will attend this meeting.

∼ ram iss sabhaa mein jaayegaa
(Ram) (this) (meeting) (in) (will go)
• Ram married Sita.

∼ ram ne sita se vivah kiyaa
(Ram) (Sita) (with) (married)
• Ram will answer this question.

ram iss prashn kaa uttar degaa
(Ram) (this) (question) (will answer)
97
• Ram will beat Mohan.

∼ ram mohan ko maregaa
(Ram) (Mohan) (will beat)
• Ram has signed the paper.

∼ ram ne kagaj par hastaakshar kar diyaa hai
(Ram) (paper) (on) (signature) (has done)
Analysis of various translation examples reveals the following points with respect
to structural divergence, which we use to design the algorithm for identification of
structural divergence:
• If the main verb of an English sentence is a declension of “be’’ verb, then the
structural divergence cannot occur.
• Structural divergence deals with the objects of both the English sentence and
its Hindi sentence. Therefore, if any one of the two sentences has no objects
then structural divergence cannot occur.
• If both the sentences have objects, and their SPAC structures are same then
also structural divergence does not occur.
• In other situations structural divergence may occur only if the SPAC of the
object of the English sentence is an NP, and the SPAC of the object of the
Hindi sentence is a PP.
The algorithm for identification of structural divergence has been designed to

take care of the above conditions. Figure 3.1 gives the corresponding algorithm. For
98
Step1. IF(root word of the main verb of E is “be")THEN RETURN(0)

Step2. IF((the object of E is null) OR (the object of H is null))
THEN RETURN(0)
Step3. IF(the SPAC of the object of E EQUALS
the SPAC of the object of H )THEN RETURN(0)
Step4. IF((the SPAC of the object of E is NP)AND
(the SPAC of the object of H is PP))THEN RETRUN(1)
Figure 3.1: Algorithm for Identification of Structural Divergence
Figure 3.2: Correspondence of SPACs of E and H for Identification of Structural

Divergence
structural divergence, as discussed above, there is only one possible sub-type. Thus
depending upon the case the algorithm given in Figure 3.1 returns either 0 or 1.
Illustration
Consider for illustration the following sentence pair:
E: Andre will marry Steffi.
H: andre (andre) steffi (steffi) se (from) vivaah karegaa (will marry)
The SPACs of these two sentences and their correspondences are given in Fig-
ure 3.2. Here bold arrows represent correspondence, and dotted lines indicate no
correspondence. Note that, the objects of E and H are not null; in E the object
99
is “Steffi”, whereas in H the object is “steffi se”. But their SPACs are [NP [Steffi /
N]] and [PP [NP [ steffi / N]] [ se / P]], respectively, which are not equal. Therefore,
structural divergence is identified.
3.3.2 Categorial Divergence
Categorial divergence concerns with the subjective complement (SC) or predicative

adjunct (PA), if any, of the English sentence. In the event of categorial divergence,
the SC or PA, upon translation, is realized as the main verb of the Hindi sentence.
This happens irrespective of whether the SC is an NP or AdjP or whether the PA

is a PP or adverb, in the underlying English sentence. Thus categorial divergence
in English to Hindi translation context is different from its definition given by Dorr
(1993) in the context of European languages. There, categorial divergence is con-
cerned with adjectival SCs which upon translation map into noun, verb or PP. This
subtle difference allows us to redefine categorial divergence in English to Hindi con-
text. In particular, depending upon the nature of the SC or PA , four sub-types
have been identified.
The definitions and characteristics of the four sub-types are given below.
1. Categorial sub-type 1: When the SC of the English sentence is used as an ad-

jective, but upon translation is realized as the main verb of the Hindi sentence,
then this divergence occurs. For illustration, consider
Ram is afraid of lion. ∼ ram sher se dartaa hai

(Ram) (lion) (of) (fears)
The adjective of the English sentence “afraid” is realized in Hindi by the verb
“darnaa” meaning “to fear”, and “dartaa hai ” is its conjugation for present
100
indefinite tense, when the subject is third person, singular and masculine.
2. Categorial sub-type 2 : Here the SC is an NP in the English sentence. Upon

translation the noun part gives the verb of the corresponding Hindi sentence.
The adjective part is realized as an adverb upon translation.
Consider for illustration the following:
Ram is a regular user of the library. ∼

ram pustakaalay ko baraabar istemaal kartaa hai
(Ram) (library) (of) (regularly) (uses)
Here the focus is on the word “user” which is a noun, and has been used as
an SC in the above English sentence. This provides the main verb “istemaal
karnaa” (meaning “to use”) of the Hindi sentence. Its conjugation for present
indefinite tense is istemaal kartaa hai, when the subject is third person, sin-
gular and masculine. The adjective “regular” of the noun “user” is realized as
the adverb “baraabar ”.
3. Categorial sub-type 3 : In the event of this divergence an adverbial PA of an

English sentence is realized as the main verb of the Hindi sentence. Consider
for illustration the following translation:
The fan is on. ∼
paankhaa (fan) chal (move) rahaa (..ing) hai (is)
The main verb of the Hindi sentence is “chalnaa” i.e “to move”. Its sense
comes from the adverbial PA “on” of the English sentence. The present con-
tinuous form of this verb is “chal rahaa hai ”, when the subject is third person,
singular and masculine. It may be noted that in Hindi grammar neuter gender
101
does not exist. Inanimate objects are treated as masculine or feminine, and
this categorization follows some systematic rules but occasionally with some
exception (See Appendix A).
4. Categorial sub-type 4 : This sub-type concerns with predicative adjuncts that
are realized in English as PP, but are realized in Hindi as the main verb. For
example, one may consider the following pair:
The train is in motion. ∼ railgaadii chal rahii hai
(train) (move) (..ing) (is)

Here, the PA “in motion” is a prepositional phrase whose sense is realized by
the verb “chalnaa”. One may notice that here in the Hindi translation the
auxiliary verb is “rahii ” in order to convey that the subject of the sentence is
feminine and singular.
Our analysis of a large number of translation examples reveals the following:
• Categorical divergence occurs if the main verb of the English sentence is a

declension of “be”, but the main verb of the Hindi translation is not the “be”
verb i.e. “ho”.
• We further notice that for categorial divergence to occur, the Hindi translation
should not have any subjective complement or predicative adjunct.
• If SPAC structure of the subjective complement (SC) is an AdjP or NP then
it is the case of categorial divergence of sub-type 1 or 2, respectively.
• Otherwise, if the SPAC structure of predicative adjunct (PA) is AdvP or PP

then it is called categorial divregnece of sub-type 3 or 4, respectively.
102
Step1. IF(root word of the main verb of E is not "be")

THEN RETURN(0)
Step2. IF(root word of the main verb of H is "ho ")
THEN RETURN(0)
Step3. IF((the SC of H is not null) OR
(the PA of H is not null))THEN RETURN(0)
Step4. IF(the SPAC of the SC of E is AdjP)THEN RETURN(1)
Step5. IF(the SPAC of the SC of E is NP)THEN RETURN(2)
Step6. IF(the SPAC of the PA of E is AdvP)THEN RETURN(3)
Step7. IF(the SPAC of the PA of E is PP)THEN RETURN (4)
Figure 3.3: Algorithim for Identification of Categorial Divergence
The identification algorithm has been designed taking care of the above observations.
The algorithm returns 0 if the translation does not involve any categorial divergence.
Otherwise, depending upon the case it returns 1, 2, 3, or 4. Figure 3.3 provides the
schematic view of the proposed algorithm.
Illustration
Let E be the sentence “She is in tears.”. Its Hindi translation H be “wah (she) ro
rahii hai (is crying)”. As the sentences are parsed, and their SPACs are obtained
the algorithm proceeds as follows.
In Step 1, it finds that the English sentence root main verb is “be”, hence it
proceeds to Step 2. In Step 2 the root main verb of the Hindi sentence is determined.
In this case it is ronaa (i.e. “to cry”), which is not the “be” verb. The algorithm,
therefore, proceeds to Step 3 where it detects that Hindi sentence does not have a
PA. Thus this is a case of categorial divergence.
The algorithm now checks the SPAC of the PA “in tears” which is a prepositional
phrase comprising a preposition and a noun. The algorithm, therefore, detects
103
Figure 3.4: Correspondence of SPACs for the Categorial Divergence Example of

Sub-type 4
categorial divergence of sub-type 4. Figure 3.4 shows the correspondence of the

SPACs of the English sentence and its translation.
Other sub-types may be detected in a similar way.
3.3.3 Nominal Divergence
Nominal divergence is concerned with the subject of the English sentence. In the
event of nominal divergence, upon translation the subject of the English sentence
becomes the object or verb complement. In this respect this divergence is somewhat
similar to the thematic divergence as defined in (Dorr, 1993). However, in case of
thematic divergence the object of the source language sentence becomes the subject
upon translation, whereas, in case of nominal divergence the subject of the Hindi
translation is derived from the adjectival complement of the English sentence. Thus,
characteristically nominal divergence differs from thematic divergence.
The subject of the English sentence is realized in Hindi with the help of a prepo-
sitional phrase. In particular, with respect to nominal divergence use of two prepo-
sitions: “ko” and “se” can be observed, which are typically used for an object or
ablative case, respectively (kachru, 1980). Hence the latter one is called as “verb
104
complement”.
In respect of the above discussions, we define two sub-types of nominal diver-

gence:
1. Nominal sub-type 1: Here the subject of the English sentence becomes object
upon translation. For illustration the following example may be considered:
Ram is feeling hungry. ∼ ram ko bhukh lag rahii hai

(to Ram) (hunger) (feel) (..ing) (is)
Here, the adjective “hungry” is an SC. Its sense is realized in Hindi by the
word bhukh, meaning “hunger” that acts as the subject of the Hindi sentence.
The subject “Ram” of the English sentence becomes the object “ram ko” of
the Hindi translation. Sometimes such an object is termed as dative subject

(kachru, 1980) also. However, because of the use of the postposition ko we feel
that calling it the object of the sentence is more appropriate.
2. Nominal sub-type 2: In this case the subject of the English sentence provides
a verb complement (VC) in the Hindi translation. The following example

illustrates this point.
This gutter smells foul. ∼

iss naale se badboo aatii hai
(this) (gutter) (from) (bad smell) (comes)
Note that, the subject of the English sentence “This gutter” is realized as the
modifier “iss naale se”of the verb “aatii hai ”.
105
Step1. IF(the main verb root form is “be")THEN RETURN(0)

Step2. IF(the SC of E is null)THEN RETURN(0)
Step3. IF(the SPAC of the SC of E is not AdjP)THEN RETURN(0)
Step4. IF(the SC of H is not null )THEN RETURN(0)
Step5. IF(the object of H is not null)THEN RETURN(1)
Step6. IF(the VC of H is not null)THEN RETRUN(2)
Figure 3.5: Algorithim for Identification of Nominal Divergence
The analysis of different examples of nominal divergence establishes the following
points:
1. Nominal divergence cannot occur if the main verb of the English sentence is a
declension of the “be” verb. This is because in that case, the English sentence
does not have an SC, which is essential for a nominal divergene to occur.
2. Otherwise, even if the root word of the main verb of the English sentence is
not “be”, nominal divergence cannot occur if the English sentence does not
have an SC.
3. Otherwise, if the SC is null, and the object is not null in H, then it is the
instance of nominal divergence of sub-type 1. In place of the object, if verb
complement (VC) is present in H then it is nominal divergence of sub-type 2.
The algorithm has been designed by taking care of the above observations. Figure
3.5 provides a schematic view of the proposed algorithm.
Illustration
Let E be the sentence “I am feeling sleepy”, and H be its translation “mujhe (to me)
niind (sleep) aa rahii hai (is coming)”. The root form of the main verb of E is not
106
“be”. Therefore, it satisfies the else condition of step1. So, we can proceed to check
further steps. Here, the SC of E is “sleepy”, which is an adjective (Adj).
Figure 3.6: Correspondence of SPAC E and SPAC H of Nominal Divergence of

Sub-type 1
Hence steps 2 and 3 do not apply. In step 4 the SC of H is checked and that is
null. This implies that conditions for step 4 are not satisfied. In step 5 the object of
the H is identified. It implies that the given example pair has nominal divergence
of sub-type 1. Figure 3.6 gives the correspondence of the SPACs of the example
discussed above.
3.3.4 Pronominal Divergence
Pronominal divergence pertains to English sentences in which the pronoun “it” is
used as the subject. The Hindi equivalent of “it” is “wah” or “yah”. Thus, typically
the Hindi translation of such a sentence should have one of these two words as the
subject of the sentence. For examples, the following translations may be considered:
• It is crying. ∼ wah (it) ro (cry) rahaa (. . . ing) hai
• It is small. ∼ yah (it) chhotaa (small) hai (is)
107
However, sentences of similar structures with an impersonal pronoun as subject are

sometimes translated into Hindi in different ways. One may observe different varia-
tions here depending upon which part of speech/ FT of English sentence becomes the
subject upon translation. This observation helps in defining four different sub-types
of pronominal divergence that are illustrated below.
1. Pronominal sub-type 1 : Here, a subjective complement, which may be a noun,

with/without any qualifying adjective becomes the subject of the Hindi trans-
lation.
For illustration, consider the following sentences:
(a) It is morning. ∼ subaha ho gayii hai
(morning) (become) (has)

(b) It was a dark night. ∼ ek andherii raat thii
(one) (dark) (might) (was)
In example (a) the word “morning”, a noun, acts as an SC. Upon translation
it provides the subject “subaha” of the Hindi sentence. In example (b) the SC
is still a noun but it is preceded by an adjective. Upon translation the whole
noun phase “andherii raat” becomes the subject of the corresponding Hindi
sentence.
2. Pronominal sub-type 2 : In this case, the adjectival complement of the subject

“it” becomes the subject of the Hindi translation. For illustration:
It is very humid today ∼ aaj bahut umas hai

(today) (very) (humidity) (is)
108
In this example, the adjectival complement “humid”, and its adverb “very” of
the English sentence are together realized with the help of the noun phrase
“bahut umas”, which acts as the subject of the Hindi sentence. As a conse-
quence pronominal divergence occurs.
3. Pronominal sub-type 3 : Under this sub-type of pronominal divergence the

subject of the Hindi translation comes from the infinitive form of a verb. For
illustration, one may consider the English sentence “It is difficult to run in
the Sun”. The Hindi translation of the sentence is: “dhoop (sun shine) mein
(in) daudhnaa (to run) kathin (difficult) hai (is)”. The subject of the Hindi
translation has become “daudhnaa”, which means “to run”. One may note
that the adjunct “in the Sun” of the infinitive form “to run” translates to
“dhoop mein” that becomes a post modifier for the subject “daudhnaa”.
4. Pronominal sub-type 4 : Here the subject of the Hindi translation is realized
from the main verb of the source language sentence. Consider, for example,
the following translation:
It is raining ∼ barsaat (rain) ho (be) rahii (. . . ing) hai (is)
The main verb “to rain” of the English sentence provides the subject “barsaat”
of the Hindi translation. One may notice the difference between this trans-
lation, and the translation of the sentence “It is crying” given earlier in this
section to appreciate the divergence.
Thus we find four different sub-types of the pronominal divergence each having
its own characteristics. If the subject of the English sentence is not “it”, then the
possibility of pronominal divergence can be ruled out. Further, even if the English
sentence has “it” at the subject position, if the subject of the Hindi sentence is
109
Step1. IF(subject of E is not "It")THEN RETURN(0)

Step2. IF(subject of H is “wah " or “yah " )THEN RETURN(0)
Step3. IF(root form of the main verb of E is "be")THEN
IF( SC of E is null)THEN RETURN(0)
ELSE
IF(the SC of H is not null)THEN
IF(SPAC of the SC of E is NP)THEN RETURN(1)
IF(SPAC of the SC of E is AdjP)THEN RETURN(2)
ELSEIF(E contains infinitive form of verb)
THEN RETURN(3)
ELSE RETURN(0)
Step4. IF(root form of the main verb of H is "ho ")
THEN RETURN(4)
ELSE RETURN(0)
Figure 3.7: Algorithim for Identification of Pronominal Divergence
Figure 3.8: Correspondence of SPAC E and SPAC H of Pronominal Divergence of

Sub-type 4
110
one of “wah” or “yah”, then too pronominal divergence cannot occur. Otherwise,
depending upon the SC or main verb of the English sentence the sub-type of the
pronominal divergence is identified. Figure 3.7 gives the corresponding algorithm.
Illustration
For the English sentence (E) “It is raining”, and its translation H in Hindi is “barsaat
ho rahii hai ”. The syntactic phrase annotated chunk (SPAC) structures of the
example pair, and their correspondences are given in Figure 3.8. Here the subject
of E is “it” and the subject of H is “barsaat” but not “yah” or “wah”. It does not
satisfy step 2. In step 3, this algorithm finds that root form of the main verb of the
English sentence is “rain” which is not “be”. Therefore, the condition of step 3 is
also not satisfied. Hence step 4 detects pronominal divergence of sub-type 4.
3.3.5 Demotional Divergence
The characteristic feature of demotional divergence is that here the role of the main
verb of the source language sentence is demoted upon translation. In case of Eu-
ropean languages this implies that the main verb of the target language is realized
from the object of the source language, and the main verb of the source language
upon translation becomes the adverbial modifier. However, with respect to English
to Hindi translations a subtle variation may be noticed. We observed several ex-
amples where the main verb of the English sentence upon translation is demoted
to the subjective complement or predicative adjunct of the Hindi sentence, but not
to adverbial modifier, which we call an adjunct. Hence in the event of demotional
divergence, the main verb of the Hindi translation is realized as a “be” verb. Thus
111
for English to Hindi translation, demotional divergence needs to be redefined ac-

cordingly. Depending upon how the roles of different constituent words change, four
different sub-types of demotional divergence may be obtained. The four sub-types
are defined as follows:
1. Demotional sub-type 1: This divergence occurs when the main verb and the
object of the English sentence are realized as predicative adjunct in the Hindi
sentence. However, the subject of the English sentence remains the subject
after translation to Hindi. For illustration, we consider the following example:
This dish feeds four people. ∼

yah pakvaan chaar logon ke liye hai
(this) (dish) (four) (people) (for) (is)
In this example the main verb “feeds” and the object “four people” of the
English sentence together give the predicative adjunct, which is the PP, “chaar
logon ke liye” (in English “for four people”) of the Hindi sentence. The subject
“this dish” remains subject after translation.
2. Demotional sub-type 2: Unlike the above sub-type, here the main verb and its
complement (instead of the object) of the English sentence are realized as the
predicative adjunct of the Hindi sentence. The following example illustrates

this point:
This house belongs to a doctor. ∼
yah ghar ek daaktaar kaa hai

(this) (house) (one) (doctor) (of) (is)
112
In this example, “belong to” and “a doctor” are the main verb and its comple-
ment of the English sentence, respectively. They jointly provide the predicative
adjunct (daaktaar kaa) of the Hindi sentence.
3. Demotional sub-type 3 : Under this sub-type the main verb and the object
of the English sentence are realized as the adjectival SC and the adjectival
complementation by preposition (SC C), respectively, in the Hindi translation.
Here also, the subject of the English sentence remains the subject of the Hindi
sentence. The following example explains this sub-type:
These two sofas face each other. ∼

yeh do sofa ek dusre ke saamne hain
(these) (two) (sofa) (one) (of other) (opposite) (are)
In this example, the main verb of the English sentence “face” is realized as
the SC “saamne” in the Hindi sentence. Also, the object “each other” of the
English sentence becomes an SC C, i.e. “ek dusre ke”. Thus, this translation
belongs to demotional divergence of sub-type 3. The literal meaning of this
translation is “These two sofas are opposite to each other”.
4. Demotional sub-type 4 : Here also, the main verb of source is realized as SC

(adjective) of the target language. But the object and subject of the English
sentence become the subject of the translation and the post modifier of the
SC of the target language sentence, respectively. We illustrate this with the
following example:
This soup lacks salt. ∼ iss soop mein namak kam hai
(this) (soup) (in) (salt) (less) (is)
113
Step1. IF(root word of the main verb of E is "be/have")

THEN RETURN(0)
Step2. IF (root word of the main verb of H is not "ho " )
THEN RETURN(0)
Step3. IF( (the subject of E )EQUAL (the subject of H ))THEN
IF(the PA of H is not null)THEN
IF(the object of E is not null)THEN RETURN(1)
ELSEIF(the VC of E is not null)THEN RETURN(2)
ELSE RETURN(0)
ELSE
IF(the SC C of H is not null) THEN
IF(the object of E is not null)THEN RETURN(3)
ELSE RETURN(0)
Step4. IF(the SC of H is null)THEN RETURN(0)
Step5. IF (the SC C of H is null)THEN RETURN (0)
Step6. IF(the object of E is not null)THEN RETURN(4)
ELSE RETURN(0)
Figure 3.9: Algorithm for Identification of Demotional Divergence
In the above example, the main verb “lack” of the English sentence is realized as
“kam”, the SC of the Hindi sentence. The object “salt” (“namak ”) becomes
the subject of the target language, and the sense of “the soup” is realized
as “soop mein”, post modifier of the SC. In particular, this is an adjective

complementation, and is expressed through the said PP. The literal meaning
of the translation is “Salt is less in this soup.”.
Analysis of the translation examples that involve demotional divergence high-

lights the following points:
1. In all the instances of demotional divergence we find that the main verb of
the English sentence is different from “be” or “have”. Thus if the main verb
of an input sentence is either “be” or “have”, the possibility of demotional

divergence in its Hindi translation may be ruled out.
114
2. On the other hand, if the main verb of the Hindi translation is not the “ho”
verb (i.e. in English “be”), then demotional divergence cannot occur.
3. If the Hindi equivalent of the subject of the English sentence E is same as the
subject of the Hindi sentence H, then occurrence of demotional divergence is
decided as follows. Since the English main verb is realized upon translation as
SC or PA, if the Hindi translation has no SC or PA, then here also demotional
divergence cannot occur.
4. Otherwise depending upon whether the PA or SC are present in the H, the
method returns sub-type 1, 2, 3 or 4, accordingly, indicating occurrence of the

corresponding sub-type of demotional divergence.
Figure 3.9 provides a schematic view of the proposed algorithm.
Illustration 1.
Consider the English sentence (E) “The soup lacks salt”, and its Hindi translation
(H) “soop mein namak kam hai ”. The SPACs of these sentences and their term
correspondences are given in Figure 3.10.
Figure 3.10: Correspondence of SPAC E and SPAC H for Demotional Sub-type 4
115
Here the root form of the main verb of H is “ho” (i.e. “be”), and for E it is
“lacks”. Hence, steps 1 and step 2 are not satisfied, and therefore, computation
proceeds to step 3. However, if condition of step 3 fails as the subjects of E and H
are not same. Step 4 and 5 check that both SC and SC C are present in the Hindi
sentence. Hence, step 6 is considered. Since E has no object, the algorithm returns
4 indicating that the above sentence pair has a demotional divergence of sub-type 4.
Illustration 2.
Consider another example, where E is “This dish feeds four people.”, and H is “yah
pakvaan chaar logon ke liye hai ”. The SPACs of these two sentences and their
correspondences are given in Figure 3.11.
Figure 3.11: SPAC Correspondence for Demotional Divergence of Sub-type 1
The root form of the main verb of E and H are “to feed” and “ho”, respectively.
Therefore, both step 1 and step 2 are not satisfied. Further, the algorithm checks
other steps for determining the sub-type of demotional divergence. The subject of
both E and H are same. The PA is present in H, and the object of E is not null.
This implies that the conditions of step 3 are satisfied. The algorithm, therefore,
116
returns 1, i.e. the demotional divergence of sub-type 1 is present.
3.3.6 Conflational Divergence
Conflational divergence pertains to the main verb of the source language sentence.
Typically, as characterized in (Dorr, 1993), conflational divergence occurs when some
new words are required to be incorporated in the target language sentence in or-
der to convey the proper sense of the verb of the input. However, with respect to
English to Hindi translation we need to deviate from this definition because of the
following reason. Many English verbs do not have a single-word equivalent in Hindi.
In fact, a large number of English verbs are expressed in Hindi with the help of a
noun followed by a simple verb. Such a combination is called a “Verb Part” (Singh,
2003), where the verb used in the Verb Part is some basic verb such as “honaa” (to
become), “karnaa” (to do) etc. Some examples of Hindi Verb Parts are given below.
Begin - aarambh karnaa Answer - uttar denaa
Fail - asafal honaa Allow - aagyaa denaa

Ride - savaarii karnaa Wonder - hairaan honaa
For illustration, consider the verb “to begin”. Its Hindi equivalent is “aarambh
karmaa”. In Hindi, “aarambh” is the abstract noun meaning the “beginning”;
whereas, “karnaa” means “to do”. Thus the verb is realized in Hindi as a com-
bination of noun and verb. In a similar vein, the verbs “denaa” (meaning “to give”)
and “honaa” (meaning “to become”) are used as the basic verbs along with appro-
priate nouns to provide the meanings of the English verbs cited above. There are
also examples of Verb Parts involving other basic verbs, such as “maarna” as well.
117
Thus, if Dorr’s definition is adopted in English to Hindi translation, there will be

a large number of instances of conflational divergence. Calling such a large set of
translation examples as “divergence” makes little sense. Hence we propose that the
event of introduction of a noun to convey the sense of a verb should not be called a
divergence for English to Hindi translation.
However, there are situations when the action, suggested by the main verb of
an English sentence, needs the help of a prepositional phrase or adverbial phrase to
convey the proper sense of the verb. These cases are encountered occasionally, and
therefore deviate from the normal Hindi verb structure. We call these variations as
divergence from English to Hindi translation. Below we provide two sub-types of

this divergence.
1. Conflational sub-type 1 : Divergence of this type occurs when the new words
are added as adjunct to the verb. Typically, this adjunct is realized as a
prepositional phrase. For illustration, consider the following English sentences
and their Hindi translations:
Ram stabbed John. ∼ ram ne john ko chaaku se maaraa
(Ram) (to John) (knife) (by) (hit)
The sense of the verb “stab” is conveyed through the introduction of the prepo-
sitional phrase “chaaku se”. There are cases when the adjunct appears in the
form of an adverbial phrase instead of a prepositional phrase.
Mary hurried to market. ∼ mary jaldi se bazaar gayii

(mary) (hurriedly) (market) (went)
To convey the proper sense of the verb “hurry”, the adverbial phrase “jaldi
se” is used along with the main verb “jaanaa” meaning “to go”. Note that
118
“gayee” is the past form (with feminine singular subject) of “jaanaa”.
Although, the conflational verb adds the lexicon in the adjunct normally, in
English to Hindi translation we have found some examples in which the con-
flational verb adds lexicon in the subject of the target language. This we call
sub-type 2 of conflational divergence.
2. Conflational sub-type 2 : Under this sub-type the new word added acts as
the subject of the Hindi translation, and the original subject of the English
sentence becomes the post modifier or possessive case of the subject of the
Hindi sentence.
Example 1. He resembles his mother. ∼
uskii shakal uskii maa se miltii hai
(his) (face) (his) (mother) (with) (matches)
The literal meaning of the translation is: “His face is similar to his mother.”.
The subject of the Hindi sentence, viz. “uskii shakal ” (meaning “his face”)
is realized form the source language verb “to resemble”. Here “uskii ” (“his”)
is the possessive pronoun of the original subject (“he”) of the English sentence.
Example 2. This dish tastes good. ∼

iss pakvaan kaa swaad achachaa hai
(this) (dish) (of) (taste) (good) (is)
In this example too, the subject of the Hindi sentence “iss pakvaan kaa swaad ”
(the taste of this dish) is realized from the verb “to taste”.
Figure 3.12 provides a schematic representation of the proposed algorithm keep-
119
Step1. IF(root word of the main verb of E is "be/have")

THEN RETURN(0)
Step2. IF(# adjunct(s) of E < # adjunct(s) of H )THEN RETURN(1)
Step3. S1= number of nouns in the SPAC of the subject of E
S2 = number of nouns in the SPAC of the subject of H
IF(S1 <S2)THEN
IF(((SPAC of the subject of E has "PRP")
AND(SPAC of the subject of H has "PRP$"))
OR(SPAC of the subject of H has "POSS")
OR(SPAC of the subject of H has "P"))
THEN RETRUN (2)
ELSE RETURN(0)
ELSE RETURN(0)
Figure 3.12: Algorithm for Identification of Conflational Divergence
ing in view the following points.
1. If the English sentence E has declension of “be/have” verb at the main verb
position, then conflational divergence cannot occur.
2. If H has more adjuncts than E, then it is the case of conflational divergence

sub-type 1.
3. If the number of nouns in the SPAC of the subject of E is less than the number
of nouns in the SPAC of the subject of H, and the SPAC of the subject of
H further contains a possessive personal pronoun (PRP$), or a possesive case
(POSS), or a preposition (P), then conflational divergence of sub-type 2 occurs.
The algorithm returns 0 if the translation does not involve any conflational di-
vergence. Otherwise, depending upon the case it returns 1 or 2.
120
Illustration 1.
Let E be the sentence “I stabbed John.”, and let its translation H be “main ne john
ko chaaku se maaraa”. The corresponding SPAC for both the sentences and the
term correspondences are given in Figure 3.13.
Figure 3.13: Correspondence of SPAC E and SPAC H for Conflational Divergence

of Sub-type 1
In Step 1, the algorithm finds that in the English sentence the root form of the
main verb is “stab” other than be/have, it implies else condition of step 1 occurs.
Now we will check further steps for determining the sub-type of the conflational
divergence.
In the step 2, S1 and S2 are 0 and 1 respectively, as E does not have any adjunct
but H has an adjunct. This implies, the given translation pair has conflational
divergence of sub-type 1.
3.3.7 Possessional Divergence
Possessional divergence deals with English sentences in which the declension of verb
“have” is used as the main verb. An interesting feature of Hindi is that it has
121
no possessive verb, i.e. one equivalent to the “have” verb of English. The normal
translation pattern of English sentences with declensions of “have” as main verb is
illustrated below:
• Ram has many enemies. ∼ ram ke bahut shatru hai
(ram’s) (many) (enemies) (is)

• Ram has a holiday today. ∼ ram kii aaj chhuttii hai
(ram’s) (today) (holiday) (is)
• Ram has an inkpot. ∼ ram ke paas davaat hai
(with ram) (inkpot) (is)
The above examples demonstrate that the normal translation pattern of these
sentences is one of the following:
1. The main verb of the translated sentence is “honaa” which means “to be”.
2. The verb is used along with some genitive prepositions (viz. kaa, ke or kii ),
or the locative prepositional phrase, viz. “ke paas”, to convey the meaning of
possession (kachru, 1980).
3. Which one of the three genitive prepositions will be used depends upon the
number and gender of the object. It is “kaa” if the object is masculine singular,
“kii ” if the object is feminine singular, and “ke” for plural both masculine and
feminine.
However, there are many examples where the translation structure deviates from
this normal pattern giving rise to divergence. We call this as the “possessional
122
divergence”. Depending upon how the roles of different FTs change, six different
sub-types are identified. These sub-types are explained below.
1. Possessional sub-type 1 : Here the roles of the subject and the object are
reversed upon translation. Thus this sub-type is akin to thematic divergence.
But in Hindi this pattern is observed only when the main verb of the English
sentence is “have” or its declensions. Hence we categorize this as possessional

divergence. For illustration, consider the following examples:
(a) He has a bad headache. ∼ use tez sirdard hai
(to him) (bad) (headache) (is)

(b) Ram has fever. ∼ ram ko bukhaar hai
(to ram) (fever) (is)
In sentence (a), “he” and “a bad headache” are the subject and the object,
respectively. In the Hindi translation the subject is “tez sirdard ”, i.e. “bad
headache”, and the object of the Hindi sentence is “use” which is the accusative
case of “he”. Thus the roles of subject and object are reversed upon translation.
Similarly, in (b) upon translation the roles of the subject “Ram” and object
“fever” are reversed.
2. Possessional sub-type 2 : In this case the object and its premodifying adjective
in the English sentence are realized as the subject and SC, respectively, in the
Hindi sentence. The subject of the English sentence is realized as possessive

case of the subject of the target language sentence. The following example
illustrates this.
123
These birds have sweet voice. ∼
in chidiyon kii aawaaz miithii hain

(these) (birds’) (voice) (sweet) (is)
The object “voice” and its premodifying adjective “sweet” of the English sen-
tence are realized in Hindi as the subject “aawaaz ” and its adjectival comple-
ment “miithii ”. Note that the subject “these birds” of the English sentence is
realized as a possessive case (“in chidiyon kii”) in the Hindi translation.
3. Possessional sub-type 3 : Here, the object and its post modifier (normally, a
PP) in the English sentence are realized as the subject and the predicative
adjunct, respectively, in the Hindi translation. The subject of the English
sentence also contributes as the possessive case to the predicative adjunct. For
illustration, consider the following:
Boys have books in their satchels. ∼
ladkon ke baston mein kitaaben hain

(boys’) (satchels) (in) (books) (are)
Ram has two rupees in his pocket. ∼
ram kii zeb mein do rupaye hain
(ram’s ) (pocket) (in) (two) (rupees) (are)
In the first example, the object (“books”) provides the subject (“kitaaben”) of
the Hindi translation. The post-modifier “in their satchels” of the object of the
English sentence is realized as a predicative adjunct “ladkon ke baston mein”

of the Hindi sentence. One may notice that the subject “boys” is present as
the possessive case in the predicative adjunct.
124
Similar transformation takes place in the second example. The object (“two
rupees”) and the post modifier of the object (“in his pocket”) are realized upon
translation as the subject (“do rupaye”) and the predicative adjunct (“raam
kii zeb mein”), respectively. Thus, the literal meaning of the Hindi sentence
is “Two rupees are in Ram’s pocket”.
4. Possessional sub-type 4 : In this case the subject of the Hindi translation is

derived from the object of the English sentence; and the subject of the English
sentence becomes a predicative adjunct upon translation. For illustration,
consider the following:
This city has a museum. ∼

iss shahar mein ek sangrahaalay hai
(this) (city) (in) (one) (museum) (is)
The subject of the Hindi sentence is “ek sangrahaalay ”, which comes from
the object of the English sentence. The object “this city” translates to “iss
shahar ” that becomes a predicative adjunct “iss shahar mein” in Hindi.
5. Possessional sub-type 5 : Here, the object, which is a noun with/without any
premodifier, becomes the main verb of the Hindi sentence. The premodifier
may be an adjective or a noun, that becomes an adjunct of the translated
sentence. Consider, for illustration, the following translations:
125
(a) Mary has regards for her uncle. ∼
mary apne chaachaa kii izzat kartii hai

(Mary) (her) (uncle) (of) (respect) (does)
(b) They had a narrow escape. ∼
woye baal baal bache the
(they) (marginally) (escaped)
In example (a) the main verb of the Hindi sentence (“izzat kartii hai”) is
realized from the object “regards” of the English one. Similarly, in example
(b), the object “escape” of the English sentence is realized as the main verb
(“bache the”) of the Hindi sentence. Further, the premodifying adjective of
the object (“narrow”) is realized as an adjunct (“baal baal ”) in the translated
sentence.
6. Possessional sub-type 6 : Here, the main verb of the translated sentence is not
“ho”. Moreover, this verb does not come from any of the functional tags of
the English sentence. Consider for example the following translations:
(a) Radha had a good time here. ∼

raadhaa ne yahaan acchaa samay bitaayaa
(radha ) (here) (good) (time) (spent)
(b) Ram had heavy breakfast. ∼
ram ne bhaarii naashtaa kiyaa

(ram) (heavy) (breakfast) (did)
In example (a), the main verb of the Hindi sentence is “bitaayaa” which is
different from the verb “ho” which does not come from any FT of the English
126
sentence. The literal meaning of the Hindi translation of (a) is “Radha spent a
good time here.”. Similarly, in (b) introduction of a new verb “kiyaa” (means
“did”) may be noticed.
We have the following observations on the translation examples that involve

possessional divergence:
1. Possessional divergence cannot occur if the main verb of the English sentence
is not a declension of “have”.
2. Possessional divergence cannot occur if the subject of H has a postposition

(“ke”, “kaa” or “kii ”).
3. If the root form of the main verb of H is not “ho”, then the presence of
divergence of sub-type 6 or sub-type 5 will be identified if the object is present
in H.
4. If the root form of the main verb of H is “ho”, the object of H is not present,
and the predicative adjunct is present in H, then decision of divergence sub-
type 3 and 4 will be taken on the basis of the postmodifier of the object of
E.
5. To check the precondition of sub-type 1, one has to first find out the trans-
lation of the subject and the object of the English sentence E with the help
of bilingual dictionary. If these act as the object and subject of the Hindi
translation (i.e. their roles are reversed) then possessional divergence sub-type
1 occurs.
6. If the postmodifier of the object of E is not present then it gives divergence

of sub-type 4, otherwise divergence sub-type 3. Under sub-type 3, the subject
127
of E becomes a possessive case of the subject of H. This implies that if

the subject of E is either personal pronoun (PRP) or NP then the subject
of E becomes possesive personal pronoun (PRP$) or possesive noun phrase,
respectively. This can be identified if the SPAC of the subject conatins one of
POSS, PRP$ or P.
7. For sub-type 2 to occur the following three conditions are necessary. The root
form of the main verb of H should be “ho”, the SPAC of the object of H
should contain an Adj (i.e. adjective), and also the SC of H should not be
null. When all the three conditions are meet with, possessional divergence of
sub-type 2 is identified.
We have designed our algorithm taking care of the above observations. Fig-
ure 3.14 provides a schematic view of the proposed algorithm. We illustrate the
algorithm with the help of the following examples.
Illustration 1.
Consider the English sentence (E) “Suresh has fever.”. Its Hindi translation (H)
is “suresh ko (suresh) bukhaar (fever) hai (is)”. The SPACs of these sentences and
their terms correspondences are given in Figure 3.15.
The root form of the main verb of E and H are “have” and “ho” respectively.
This implies that they do not satisfy the conditions of step 1 and 2. In step 3, the
algorithm checks the postposition condition of the subject of H, it finds that none
of the relevant postpositions is present for the subject of H. In step 4, the algorithm
finds that the subject of E and the object of H are “suresh” and “suresh ko”,
respectively, which are translations of each other. Further it finds that the object
128
Step1. IF(root word of the main verb of E is not “have”)

THEN RETURN(0)
Step2. IF(root word of the main verb of H is not “ho”) THEN
IF(the object of H is null) THEN RETURN(5)
ELSE RETURN(6)
Step3. IF((postposition of the subject of H) EQUAL
(“ke paas” OR “ke” OR “kaa ” OR “kii”))
THEN RETURN(0)
Step4. IF(((the object of E) EQUAL (the subject of H))
AND ((the subject of E) EQUAL (the object of H)))
THEN RETURN(1)
Step5. IF(the object of H is not null)THEN RETURN(0)
{
IF(the PA of H is not null) THEN
IF(the post modifier of object in E is not null)THEN
IF(((SPAC of the subject of E has “PRP”)AND
(SPAC of the subject of H has “PRP$”))OR
(SPAC of the subject of H has “POSS”)OR
(SPAC of the subject of H has “P”)) RETURN(3)
ELSE RETURN(4)
ELSE
IF(SPAC of the object of E has “Adj”)THEN
IF(the SC of H is not null)THEN
IF(((SPAC of the subject of E has “PRP”)AND
(SPAC of the subject of H has “PRP$”))OR
(SPAC of the subject of H has “POSS”)OR
(SPAC of the subject of H has “P”))RETURN(2)
ELSE RETURN(0)
ELSE RETURN(0)
}
Figure 3.14: Algorithm for Identification of Possessional Divergence
129
Figure 3.15: Correspondence of SPAC E and SPAC H for Possessional Divergence

of Sub-type 1
“fever” of E became the subject “bukhaar ” of the Hindi sentence H. Therefore, step
4 returns 1 indicating the occurrence of possessional divergence of sub-type 1 in the
above translation.
Illustration 2.
Consider the English sentence (E) “This city has a museum.”. Its Hindi translation
H is “iss (this) shahar (city) mein (in) ek (one) sangrahaalaya (museum) hai (is)”.
The SPACs of these sentences and their terms correspondences are given in Figure
3.16.
The root form of the main verb of E and H are “have” and “ho”, respectively.
Therefore, algorithm arrives to step 3, here the subject of H does not have any
postposition “kaa”, “ke” or “kii ”. Hence the algorithm proceeds further. Since the
conditions for step 4 are not meet with, the algorithm arrives at step 5. Here it
finds that in H there is no object, but a PA (“iss shahar mein”) is present. Also,
since there is no postmodifier of the object of E, the algorithm returns 4. Thus, the
algorithm diagnoses possessional divergence of sub-type 4 in the above translation
example.
130
Figure 3.16: Correspondence of SPAC E and SPAC H for Possessional Divergence

of Sub-type 4
3.3.8 Some Critical Comments
In this chapter we have discussed the various types of divergences that have been
observed in English to Hindi translation. By analyzing the characteristics of various
examples, we have been able to identify different sub-types under each divergence
type. These observations helped us to design algorithms for their identification.
However, we still have some examples of divergence which do not fall under any of
the above-mentioned types. At the same time we do not have sufficient number of
examples for these types so as to classify them under some new type or sub-types.
Efficiency of the algorithm however, is dependent on the availability of the fol-

lowing:
• Cleaned and aligned parallel corpus of both the source and the target lan-
guages.
• An on-line bi-lingual dictionary. For this work, we have used “Shabdanjali”,

an English-Hindi on-line dictionary.1
1
http://www.iiit.net/ltrc/Dictionaries/Dict Frame.html
131
• Appropriate parsers have to be designed for source language and target lan-
guage. The parsers should be able to provide the FT and SPAC information
for both the languages. Note that, presently no such parser is available for
Hindi. For our experiments we have used manually annotated Hindi corpora.
This chapter deals with the characterization and identification of different types
of divergence that may occur in English to Hindi translation. We observed that
identification of divergence can be made without going into the semantic details of
the two sentences. This can be achieved by comparing the Functional Tags (FT)
and Syntactic Phrase Annotated Chunks (SPAC) of the source language sentence
and its translation.
The work described here may be broadly classified into two parts:
1. Characterization of English to Hindi divergence. Divergence is essentially a
language dependent phenomenon. Depending upon the semantic and syntactic

properties of the source and target language the nature of divergence may
change. Although divergence has been studied in great detail for European
languages, not much has been done with respect to Indian languages in this
regard. This work describes in detail the various types (and sub-types) of
divergence that may occur in English to Hindi translation. The work also
identifies three new types of divergence that have hitherto been not found in
translation between any other language pair.
2. Identification of divergence. This chapter makes a meticulous study of the
132
structural changes in the sentences that occur due to various types of di-
vergence. Seven different types of divergence have been studied, and all of
them have a number of sub-types. The necessary preconditions in the English
sentence corresponding to each of these sub-types have been identified, and
consequent variations in the translated Hindi sentences have been observed.

These observations enabled us to form rules on the basis of the FTs and SPACs
of both the English sentences and their Hindi translations to identify the type
and sub-type of divergence, if any has occurred.
An obvious question that arises at this point is how an EBMT system is expected
to handle divergences. In this regard our suggestions are as follows. Once divergences
are identified, the focus of a system designer should be on the following:
• To split the system’s example base into two parts: normal and divergence
example base. The translation examples are to be put in the appropriate part
of the example base.
• To design appropriate retrieval policy, so that for a given input sentence, an
EBMT system can heuristically judge whether its translation may involve any
divergence, and retrieval may be made accordingly.
• To design appropriate adaptation strategies for modifying retrieved translation

examples. Since translations having divergence do not follow any standard
patterns, their adaptations may need specialized handling that may vary with
the type/sub-type of divergence.
The following chapter discusses these issues in detail.
133
Chapter 4
A Corpus-Evidence Based
Approach for Prior Determination
of Divergence
A Corpus-Evidence Based Approach
4.1 Introduction
This chapter presents a corpus-evidence based scheme for deciding whether the trans-
lation of an English sentence into Hindi will involve divergence. Surely, occurrence
of divergence poses a great hindrance in efficient adaptation of retrieved sentences.
A possible solution may lie in separating the example base (EB) into two parts:
Divergence EB and Normal EB, so that given an input sentence, retrieval can be
made from the appropriate part of the example base. However, this scheme can
work successfully only if the EBMT system has the capability to judge from the
input sentence itself whether its translation will involve any divergence. However,
making such a decision is not straightforward, since occurrence of divergence does
not follow any patterns or rules. In fact, a divergence may be induced by various
factors, such as, structure of the input sentence, semantics of its constituent words
etc. In this chapter we propose a corpus-evidence based approach to deal with this
difficulty. Under this scheme, upon receiving an input sentence, a system looks into
its example base to glean evidences in support as well as against any possible type of
divergence that may occur in the translation of the input sentence. Based on these
evidences the system decides whether the retrieval has to be made from the normal
EB, or from the divergence EB.
The algorithm proposed here, works for structural, categorial, conflational, de-
motional, pronominal and nominal types of divergence 1 . For convenience of presen-
tation we denote them as d1 , d2 , d3 , d4 , d5 and d6 , respectively. Barring structural

divergence (d1 ) all of the other five types of divergence (i.e. d2 ,...,d6 ) have further
1
Prior identification of “possessional divergence” has been kept out of discussion here. This is
because possessional divergence depends upon several factors, such as, subject, object, and even
the sense in which the verb “have” is used. Our work (Goyal et. al., 2004) discusses these issues
in detail.
135
4.2. Corpus-Based Evidences and Their Use in Divergence Identification
been classified into several sub-types depending upon the variations in the role of
different functional tags upon translation to Hindi.
In this chapter, we have identified the necessary FT-features that the source
langauge (English) sentences should have in order that a particular type/sub-type
of divergence may occur. This, however, does not mean that any sentence having
those FT-features will necessarily produce a divergence upon translation. As a
consequence, mere examination of the FTs of an input sentence cannot ascertain

whether its translation will induce any divergence or not. Hence more evidences
need to be considered.
This chapter describes all these evidences and how they are to be used for mak-
ing a priori decision regarding whether the input English sentence will involve any
divergence upon translation to Hindi.
4.2 Corpus-Based Evidences and Their Use in Di-
vergence Identification
The proposed scheme makes use of three different types of evidence to decide whether
a given input sentence will have a normal translation, or whether it will involve one
(or more ) type(s) of divergence when translated into Hindi. These evidences are
used in succession to obtain the overall evidence in support of divergence(s)/non-
divergence in the translation of the input sentence. These three steps are explained
below:
Step1 : Here Functional Tags (FTs) of the constituent words of the input sentence
are used to determine the divergence types that cannot certainly occur in the
136
translation of that sentence. The output of this step is a set D of divergence

types that may possibly occur in the translation of a given input sentence.
Step2 : Here semantic similarities of constituent word(s) of input sentence with con-
stituent words of sentences in the divergence EB and the normal EB are de-
termined. Depending on occurrence of similar words in the divergence and/or

normal EB the scheme decides whether upon translation the input sentence
may induce any divergence.
Step3 : Some times the above two steps may suggest more than one type of divergence.
In such a situation the algorithm should consult its knowledge base to ascer-
tain which combinations of divergence types are possible in the translation of
a single sentence. A scrutiny of our example base, and examination of the syn-
tactic rules of the Hindi grammar suggest that only the following combinations
of divergence are possible with respect to English to Hindi translation:
1. structural (d1 ) and conflational (d3 )
2. conflational (d3 ) and demotional (d4 )
3. categorical (d2 ) and pronominal (d5 )
This knowledge is stored in a set CD := {{d1 , d3 }, {d3 , d4 }, {d2 , d5 }}. The

possible combinations of divergence can be used as evidence to rule out any
suggestions given by the earlier two steps that do not conform with the knowl-
edge stored in the set CD described just above.
The following subsections elaborate the above steps.
137
4.2.1 Roles of Different Functional Tags
Analysis of the divergence examples suggests that for each divergence type to occur
the underlying sentence needs to have some specific functional tags (FT) and/or some
specific attributes of these FTs. We call them together FT-features of a sentence.
Considering all the divergences together we found that ten different FT-features are,
in particular, useful for identification of divergence. Table 4.1 provides a list of these
features, which we label as f1 , f2 , . . . , f10 .
FT-feature Property of feature

f1 Root form of the main verb is “be”
f2 To-infinitive form of a verb is present
f3 Root form of the main verb is not “be/have”
f4 Subject is present
f5 Object is present
f6 Subjective complement (SC) is present
f7 Subjective complement is adjective
f8 Subject of the sentence is “it”
f9 Verb complement(VC) is present and is a PP
f10 Predicative adjunct (PA) is present
Table 4.1: FT-features Instrumental for Creating Diver-
gence
With respect to a particular type of divergence, an FT-feature may have one of

the following three roles:
• Its presence in the input sentence is necessary for the corresponding divergence
type to occur;
• It should necessarily be absent in the input sentence if the corresponding di-

vergence is to occur.
138
• The FT-feature has no role in occurrence of the corresponding divergence.
We denote the above three possibilities as P (present), A (absent), and X (don’t

care). Table 4.2 gives the roles of the 10 FT-features discussed above in occurrence of
the different types of divergence and their sub-types. We call the table as “Relevance
Table”.
di sub-type f1 f2 f3 f4 f5 f6 f7 f8 f9 f10
d1 - X X P X P A A X A A
sub-type 1 P X A P A P X X A A
d2
sub-type 2 P X A P A P X X A A
sub-type 3 P X A P A A X X A P
sub-type 4 P X A P A A X X A P
sub-type 1 A X P X X X X X X A
d3
sub-type 2 A X P P X X X X X A
sub-type 1 A X P P P A A X A A
d4
sub-type 2 A X P P A A A X P A
sub-type 1 P X A P A P X P X A
d5
sub-type 2 A X P P A X X P X A
sub-type 3 P P A P A P X P A A
sub-type 1 P X A P A P P X A A
d6
sub-type 2 A X P P A P P X A A
Table 4.2: Relevance of FT-features in Different Diver-

gence Types
139
Each row of the Relevance Table provides the necessary conditions on the FT-
features of an input sentence in order that the corresponding divergence may occur.
The advantage of this evidence is that it helps in quick discarding of those types of
divergence that cannot occur in the translation of the given input sentence.
The information given in Table 4.2 may be used in the following way. Given an
input sentence, the algorithm first extracts the values for the ten FT-features, fj ,
j = 1, 2, ..., 10, from the sentence. These values are then compared with the row
entries of the Relevance Table. If the FT-features of the sentence conform with the
entries of some particular row, then evidence is obtained towards occurrence of that
particular divergence for which this row corresponds to one of the sub-types. If a
particular sentence has evidence supporting more than one divergence then all these
possible divergence types are to be considered for step 2 of the algorithm. This set
of possible divergence types for a given input is denoted as D.
For illustration, consider the following input sentence: “ Ram is friendly to me.”.
As the sentence is parsed (with some unnecessary components edited) one may get
the following:
@SUBJ <Proper> N SG “Ram”, @+FMAINV V PRES “be”, @PCOMPL-S A ABS

“friendly” , @ADVL PREP “to”, @”
The notations used here are from ENGCG parser and explained in Appendix B.
We can summarize the parsed version as follows. Of the ten FT-features discussed
above (see Table 4.1), only four are present in the above sentence. These are:
• f1 – because the main verb of the sentence is “be”.
• f4 – since the sentence has a subject, viz. “Ram”.
140
• f6 – as an SC “friendly” is present in the sentence.
• f7 – since the SC of this sentence is an adjective.
Thus in the Hindi translation of this sentence only those divergence sub-types
can occur for which the entries corresponding to FT-features f1 , f4 , f6 , and f7 are
either “P” or “X”. For the other FT-features the entries have to be either “A”
or “X”. This algorithm assumes that occurrence of a particular divergence type is
possible only if at least one of its sub-types satisfies the above conditions. Thus for
the above input sentence the possible divergences are:
• Categorial (d2 ), since sub-types 1 and 2 conform with the above requirements.
• Nominal(d6 ), since sub-type 1 satisfies the above requirements.
Also note that, sub-type 1 of d5 has values either “P” or “X” for the FT-features
f1 , f4 , f6 , and f7 . But divergence d5 cannot occur in this case as the sub-type has
an extra requirement that FT-feature f8 should also be present, which is not true
for this sentence. Therefore, the output of this step is the set D = {d2 , d6 }.
It, however, should be noted that the FT-features specified in the Relevance
Table do not provide conclusive evidence towards the presence of some particular
divergence type. For example, consider the following two sentences.
Example (A):
She is in trouble. ∼ wah (she) musiibat (trouble) mein (in) hai (is)
She is in tears. ∼ wah (she) ro (cry) rahii (..ing) hai (is)
Since both the sentences given in Example (A) have the same FT-features, i.e.
f1 , f4 and f10 , the Relevance Table gives evidence supporting categorial divergence
141
d2 (check the rows for sub-types 3 and 4) for both the sentences. But of the two
sentences the translation of the first one is a normal one. It is only the second
sentence that involves categorial divergence upon translation to Hindi. Thus, to
determine the possible divergence type(s) in a sentence, only the FT-features cannot
be taken as the sole evidence, and more evidences need to be sought.
From the above example, it can be surmised that it is the prepositional phrase
“in tears” that is instrumental for causing the categorial divergence in the second
sentence. In general, corresponding to each divergence type one can associate some
functional tags that are instrumental for causing the divergence. We call it as the
Problematic FT of the corresponding divergence type. Table 4.3 provides the Prob-
lematic FT corresponding to all the six divergence types relevant in the context
of English to Hindi translation. This table has been obtained by examining the
sentences in our example base.
Divergence Type Problematic FT

Structural Main Verb
Categorial Subjective Complement (SC: adjective, noun)
or Predicative Adjunct (adverb, PP)
Conflational Main Verb
Demotional Main Verb
Pronominal Main Verb or Subjective Complement (adjective, noun)
Nominal SC (adjective)
Table 4.3: FT of the Problematic Words for Each Diver-
gence Type
Table 4.3 is to be used in the following way. If the FT-features of a given

input conform with the requirements of a particular divergence type (as given in
142
the Relevance Table) then the corresponding problematic FT in the sentence needs
to be examined more carefully. Since both the sentences of Example (A) have the
structures required for categorial divergence, Table 4.3 suggests that to gather more
evidence the scheme should concentrate on the SC or PA of the sentences.
In this respect one major difficulty is that a particular word may convey different
senses in different context even if it is under the same FT. For example, consider
the two sentences and their Hindi translations given in Example (B) below:
Example (B):
Mohan beat the drum in the school. ∼
Mohan ne widyaalay mein drum bajaayaa

(Mohan) (school) (in) (drum) (beat)
Agassi beat Becker in the final. ∼

Agassi ne final mein Becker ko haraayaa
(Agassi) (final) (in) (Becker) (beat)
Here, the first one is an example of normal translation, while the second one is a
case of structural divergence because of the introduction of the postposition “ko” in
the object of the Hindi sentence. A careful examination suggests that although the
main verb of both the sentences is “beat”, its translation causes divergence when
used in a particular sense, but not when used in some other sense. By referring
to WordNet 2.02 one may find that the first sentence has the 6th sense of the word
“beat”, which is “to make a rhythmic sound ”; while the second sentence has the
1st sense of the word “beat”, which is “to come out better in a competition, race,
or conflict”. Therefore, while dealing with words one needs to pay attention to
2
http://www.cogsci.princeton.edu/cgi-bin/webwn
143
the particular sense in which a word is being used – in some senses it may cause
divergence, and in some other senses it may not induce any divergence at all.
Since an exhaustive list of words (along with their relevant senses) that lead
to divergence is impossible to make, the proposed algorithm tries to gather more
evidences by using the semantic similarity of the constituent words to the word
senses that are already known to cause divergence, or known to deliver a normal
translation. In order to achieve this two dictionaries have been created: Problematic
Sense Dictionary (PSD) and Normal Sense Dictionary (NSD). The PSD contains the
words along with their senses that have been found to cause divergence. Similarly,
the NSD contains the words along with their senses for which normal translation
has been observed.
Divergence type (di ) No. of words in PSDi No. of words in NSDi

Structural (d1 ) 163 1078
Categorial (d2 ) 57 167
Conflational (d3 ) 43 997
Demotional (d4 ) 66 1422
Pronominal (d5 ) 75 170
Nominal (d6 ) 12 97
Total 416 3931
Table 4.4: Frequency of Words in Different Sections
These dictionaries are further grouped into six sections – a section correspond-
ing to each divergence type. Section PSDi contains problematic words occurring in
sentences whose translations involve divergence of type di . Similarly, section NSDi

contains problematic words of sentences having the FT-features as required for di-
vergence type di (as specified in the Relevance Table), but actually having a normal
144
translation. Table 4.4 gives the number of words in each section of the PSD and the
NSD that is currently present in our example base.
PSD1 NSD1 PSD2 NSD2

Attend#v#1 Beat#v#6 Afraid#a#1 Brave#a#1
Beat#v#1 Do#v#13 Friendly#a#4 Good#a#1
Love#v#3 Eat #v#4 On#r#2 Illusion#n#2
Marry#v#1 Purchase#v#1 Pain#n#1 Monitor#n#2
Occupy#v#4 See#v#1 Tear#n#1 Trouble#n#1
.. .. .. ..
. . . .
PSD3 NSD3 PSD4 NSD4
Face#v#3 Agree#v#4 Belong#v#1 Continue#v#9
Look#v#5 Feel#v#4 Face#v#3 Ride#v#9
Resemble#v#1 Go#v#10 Front#v#1 Sell#v#2
Rush#v#4 Look#v#3 Smell#v#2 Solve#v#1
Stab#v#1 Solve#v#1 Suffice#v#1 Walk#v#6
.. .. .. ..
. . . .
PSD5 NSD5 PSD6 NSD6
Freeze#v#6 Bright#a#10 Cold#a#1 Dull#a#4
Humid#a#1 Light#a#1 Hot#a#1 Good#a#1
Morning#n#3 Plain#a#2 Hungry#a#1 Happy#a#2
Rain#v#1 Shiny#a#3 Sleepy#a#1 Helpful#a#1
Winter#n#1 Wrong#a#1 Thirsty#a#2 Innocent#a#4
.. .. .. ..
. . . .
Table 4.5: PSD/NSD Schematic Representations
Each PSD/NSD entry contains along with the relevant word, its part of speech
and appropriate sense number (as given by WordNet 2.0). Table 4.5 shows some
entries corresponding to each PSDi and NSDi , i=1,2,...6. The entries are stored
in the format word#pos#k, where pos stands for the particular Part of Speech,
which can be one of n, v, a or r (corresponding to noun, verb, adjective and adverb
145
respectively), and k stands for the sense number.
For illustration, consider the two sentences given in Example (A). Both of them
have the structure required for categorial divergence, i.e. d2 . Problematic FT for
this divergence type is the predicative adjunct (PA), which is a prepositional phrase.
Hence, in PSD2 and NSD2 we store tears#n#1 and trouble#n#1, respectively. Sim-
ilarly, corresponding to Example (B) where the relevant divergence is structural, i.e.
d1 , the entries in PSD1 and NSD1 are beat#v#1 and beat#v#6, respectively.
In order to ascertain whether a given input sentence may have a divergence di

the proposed scheme proceeds as follows. It first identifies the problematic word ai
of the sentence corresponding to the divergence di . The evidence is collected on the
basis of four parameters, viz. sim(ai , wi ), s(di ), sim(ai , wi0 ) and s(ni ), as described
below:
1. sim(ai , wi ) gives the maximum similarity score between ai and the words in
PSDi , where sim(x, y) denotes the semantic similarity between two words x
and y (see Appendix C).
2. The quantity s(di ), corresponding to divergence type di is defined as follows:


 0 if xi = 0;
s(di ) = ...(1)
1 xi ci
+ otherwise.


2 ci S
where ci , xi and S are as follows:
(a) ci is the total number of entries in PSDi (given in Table 4.4);
(b) xi is the number of words in PSDi that are semantically similar to ai ;
146
(c) S is the total number of words in the PSD. Note that, currently the
total number of words in PSD is 416 (see Table 4.4). This number will
increase as more divergence examples are obtained, and corresponding
problematic words are added to the dictionary.
3. The quantity sim(ai , wi0 ) is similar to sim(ai , wi ). While computing sim(ai , wi0 ),
the scheme will use NSDi and NSD instead of PSDi and PSD.
4. The quantity s(ni ) is similar to s(di ), and is calculated using NSDi and NSD.
The value used for S here is the cardinality of the NSD which is at present
3931 (see Table 4.4).
These four quantities are used to determine the possibility of occurrence of di-
vergence di in the translation of the given input sentence.
4.3 The Proposed Approach
In order to determine whether a given input sentence, say e, may involve some
divergence upon translation, the evidences mentioned in previous section are used
in the following way. First the input sentence e is parsed, and then using the
Relevance Table a set D is determined that contains the divergence types that may
possibly occur in the translation of e. For each possible divergence type di ∈ D the
problematic word ai is extracted from the sentence e. From PSDi , the word wi is
retrieved that is semantically most similar to ai . The subsequent steps depend upon
the value of sim(ai , wi ). If the value is 1, that implies that ai is present in PSDi .
On the other hand, a small value of sim(ai , wi ) implies that there is not enough
evidence in support of divergence di . Hence it may be concluded that divergence di
147
4.3. The Proposed Approach
will not occur in the translation of e. Note that, whether the value of sim(ai , wi ) is
sufficiently small is determined by comparing it with a threshold t, which is to be
determined experimentally from the corpus. If the value of sim(ai , wi ) is between t
and 1, then some evidence in support of divergence di is obtained. In order to make
a conclusion from this point the algorithm now refers to NSDi to obtain the word
wi0 that is semantically most similar to ai . Depending upon the values of sim(ai ,
wi ) and sim(ai , wi0 ), a decision is taken regarding whether the translation of e will
involve divergence di or not. Based on this decision, the retrieval is to be made from
the appropriate part of the example base, i.e. the Divergence EB or Normal EB.
The overall scheme is explained below which involves four major steps as follows:
Step 1: At this stage, the input sentence e is parsed, and its FT-features are
obtained. From these FT-features, using Table 4.2, the set D of possible divergence
types is determined.
The main objective now is to determine the divergence types, out of all the di ∈
D, which have positive evidence supporting them to happen in the translation of e.
Steps 2 and 3 are designed for this purpose. A set of flags, Flagi , corresponding to
each di ∈ D is used to store this information. Initially each of these flags is set to
–1. Step 2 and Step 3 are now carried out for each di ∈ D in order to reassign the
value of Flagi . At each iteration the next di with the minimum index i is chosen
such that Flagi is -1.
Step 2: From the input sentence e, the problematic word ai corresponding to

divergence di (see Section 4.2) is determined. The set Wi comprising of words be-
longing to PSDi , and having positive semantic similarity score with ai is determined.
Thus Wi = {b : b ∈ PSDi and sim(ai , b) >0}. From Wi the word wi is obtained such
148
that sim(ai , wi ) = max sim(ai , b) ∀ b ∈ Wi . If Wi is empty then sim(ai , wi ) is

considered to be 0. Depending on the similarity score sim(ai , wi ), decision is taken
regarding di , as follows.
Case 2a: If sim(ai , wi ) = 1, then set Flagi = 1. This is because the condition
implies that the word ai is present in PSDi . Hence this sentence will certainly have
divergence di upon translation. Therefore Flagi is set to 1.
Case 2b: This case occurs when ai ∈

/ PSD. But if ai is a noun or verb, and further
ai is a coordinate term of wi (i.e. according to WordNet terminology, ai and wi have

the same hypernym), then it can be decided that ai will not create divergence of
type di upon translation. This is because all those coordinate terms of wi that may
cause divergence are already stored in the PSD. Therefore Flagi is set to 0.
Case 2c: If sim(ai , wi ) < t, where t is some pre-defined threshold, then too it
may be decided that ai will not cause divergence di . Consequently, Flagi is set to 0.
The main difficulty here is to decide upon the right value for the threshold t. After
a sequence of experiments with different values for t, we found that the best results
are obtained for t = 0.5. However, since this value is corpus dependent, for other
corpora the value of t should be determined experimentally.
Since in all the three cases above, the scheme arrives at a decision regarding the
divergence type di , computation may skip Step 3, and go to Step 4 directly. But
there may be cases when the similarity score sim(a i , wi ) lies between t and 1. In
these cases, as mentioned above, the NSD has to be referred to. Hence Step 3 is
executed.
Step 3: Here, first the set W0i = {b |b ∈ NSDi and sim(ai , b) > 0) is computed.
149
From this set the word wi0 is picked such that sim(ai , wi0 ) = max sim(ai , b) ∀b ∈ Wi0 .
If Wi0 is empty then sim(ai , wi0 ) is considered to be 0. Depending on sim(ai , wi0 ), one
of the following cases is executed.
Case 3a: If sim(ai , wi0 ) = 0 then it implies that there is no evidence that the
word will lead to normal translation. Consequently, Flagi is set to 1 indicating that
divergence di has a positive chance of happening.
Case 3b: If sim(ai , wi0 ) = 1 then the evidence suggests that the word ai should
provide a normal translation to the sentence, and there is no possibility of divergence

di to occur in the translation of this sentence. Consequently, Flagi is set to 0.
Case 3c: Decision making becomes most difficult when 0 < sim(ai , wi0 ) < 1. This
implies that words sufficiently similar to ai exist neither in the PSD nor in the NSD.
Thus, any decision about divergence/non-divergence cannot be taken yet.
In this case the scheme proposes to look into how many words similar to ai , are
available in PSDi and NSDi . This evidence is given by score s(di ) and s(ni ) computed
using formula (1) (given in Section 4.2). Finally, similarity scores sim(ai , wi ) and
sim(ai , wi0 ) are combined with s(di ) and s(ni ) respectively, to take into consideration
the importance of both the evidences. If evidence supporting divergence di is more
then the value of Flagi is set to 1 otherwise it is set to 0. Thus, in this case, following
computations are performed:
• Compute s(di ) and s(ni ).
• Determine m(di ) := 12 (s(di ) + sim(ai , wi )), and
m(ni ) := 12 (s(ni ) + sim(ai ,wi 0 )).
150
• If m(di ) > m(ni ) Then
Set Flagi = 1; GO TO Step 4.
Else If m(di ) < m(ni )
Set Flagi = 0; GO TO Step 4.
Else Flagi = 1/2; Break;
The last case refers to a rare situation when m(di ) and m(ni ) are equal. In
this case, the algorithm cannot recommend whether the translation will involve
divergence di , or will it be normal. In such a situation the system can at best pick
the most similar examples from both normal EB and divergence EB, and leave it to
the user to make the final decision. Therefore, in such cases, the Flagi is set to 1/2.
Once the evidences supporting/against all divergence types di ∈ D are obtained,

that is the value of Flagi ∀ di ∈ D is determined, Step 4 is performed to make a final
decision regarding possible divergence types in the translation of the given input
e. Here it should be noted that the value of Flagi = 0, implies that e cannot have
divergence di ; while value of Flagi = 1 implies that upon translation e may have
divergence di . A set D 0 is constituted, such that D 0 = {di ∈ D and Flagi = 1}, i.e.
D 0 stores all those di ’s for which positive evidences are obtained.
Step 4: The final decision is computed in the following way.
Case 4a: If D 0 = φ, then the conclusion is that sufficient evidence has not been
obtained for any of the divergence types. Hence, the decision is that the translation
of the input sentence e will not involve any divergence.
Case 4b: If |D 0 | = 1, i.e. D 0 = {dk }. This implies that evidence is obtained in

support of just one divergence type dk . The algorithm therefore decides that the
151
translation of the input sentence will have divergence dk .
Case 4c: If |D 0 | > 1, it implies that there is a possibility of more than one type
of divergence. The algorithm therefore seeks further evidence to make any decision.
The evidence provided by CD (Section 4.2) may be used here. A set C = {{di , dj }
∈ CD | di , dj ∈ D0 } is constructed. Depending upon the |C|, further decision has
to be taken in the following way.
• If |C| = 0, it implies that no permissible combination has been found. In

this case, the algorithm computes s(di ) and m(di ) ∀di ∈ D 0 as is in Case 3c.
The algorithm concludes that the translation of the input sentence will have
divergences dk , where k is such that m(dk ) = max0 {m(di )}.
di ∈D
• If |C| = 1, it implies that there is evidence for only one permissible combination.
Let it be {dk , dl }. The algorithm suggests that the input sentence e will involve
both divergence dk and dl upon translation to Hindi.
• If |C| > 1, that is, if the evidences are obtained in support of more than
one permissible combination of divergences, then the scheme needs to select
the most likely combination of them. It therefore determines the quantity

1
2
(m(di ) + m(dj )) for all combinations {di , dj } ∈ C. The scheme recommends
that combination of divergences for which this quantity is maximum.
The flowchart of the proposed scheme is given in Figures 4.1 and 4.2.
152
Figure 4.1: Schematic Diagram of the Proposed Algorithm
153
Figure 4.2: Continuation of the Figure 4.1
154
4.4 Illustrations and Experimental Results
In this section we first illustrate with examples how the above algorithm works
towards prior identification of divergence, if any, in translation from English to Hindi.

The examples considered are increasingly difficult in nature. Later in Subsection
4.4.4 a consolidated result of several experiments is presented, and certain limitations
of the said algorithm are discussed.
4.4.1 Illustration 1
Consider the input sentence: “I am feeling hungry”.
The parsed version of the above sentence is: @SUBJ PRON PERS SG1 “i”, @+FAUXV
V PRES “be”, @-FMAINV V PCP1 “feel”, @PCOMPL-S A ABS “hungry” < $.>.
Of the ten FT-features (see Table 4.1) only four are present in the above sentence.
These are:
• f3 – since the main verb (feel) of the sentence is not “be” or “have”.
• f4 – as the sentence has a subject, viz. “I”.
• f6 – because the sentence has an SC.
• f7 – since the SC of this sentence is an adjective (hungry).
Note that the FT-features of the given input sentence conform with both the sub-
types of d3 and only sub-type 2 of d6 (see Table 4.2). Hence the set D of possible
divergence types is obtained as D={d3 , d6 } which are conflational and nominal types
155
4.4. Illustrations and Experimental Results
of divergence, respectively. Therefore, evidences need to be collected for both of the

divergence types.
Evidences for conflational divergence (d3 ):
Table 4.3 suggests that the problematic word for d3 is the main verb, i.e. “feel”.
WordNet 2.0 provides thirteen different senses for the word “feel” when used as a
verb, such as:
• sense1 : feel, experience – undergo an emotional sensation
• sense2 : find, feel – come to believe on the basis of emotion, intuitions, or
indefinite grounds
• sense3 : feel, sense – perceive by a physical sensation, e.g., coming from the
skin or muscles
For the given input sentence the appropriate sense is sense1. Thus a3 is feel#v#1.
A scrutiny of PSD3 reveals that it contains no words w such that similarity sim(w, a3 ) >
0. Thus W3 = φ, and therefore, Flag3 is set to 0.
Evidences for nominal divergence (d6 ):
Problematic FT for d6 is “Subjectival complement (Adjective)”. Hence the prob-
lematic word of the input sentence is “hungry”. WordNet 2.0 provides two senses for
“hungry” of which the first one “feeling hunger” is appropriate in this case. Thus,
the problematic word is a6 which is hungry#a#1. PSD6 is then scrutinized to find
the word semantically most similar to a6 . It is found that PSD6 already contains
hungry#a#1. Therefore w6 is same as a6 , and hence similarity score is 1. Thus

Flag6 is set to 1.
156
Now the set D 0 is constructed as D 0 = {di ∈ D: Flagi =1}. Evidently for the
given input sentence D 0 contains a single element d6 . Thus the algorithm suggests
that the above input sentence will cause nominal divergence upon translation to
Hindi, which is a correct decision.
Consider the input sentence: “She is in a dilemma”.
Its parsed version is @SUBJ PRON PERS FEM SG3 “she”, @+FMAINV V PRES
“be”, @ADVL PREP “in”, @.
The FT-features present in this sentence are:
• f1 – as the root form of the main verb is “be”.
• f4 – because the sentence has a subject, viz. “she”.
• f10 – since the sentence has a PA, viz. “in dilemma”.
Using the Relevance Table the set D of possible divergence types is obtained as
{d2 }.
The algorithm now collects evidences in support of categorial divergence (d 2 ):
Table 4.3 suggests that problematic FT for d2 is predicative adjunct, i.e. “in
dilemma”. Thus problematic word is “dilemma”. WordNet 2.0 provides only one
sense for dilemma: “state of uncertainty or perplexity especially as requiring a choice
between equally unfavorable option”. Thus the problematic word a2 is dilemma#n#1.
A search in PSD2 for the word that is semantically most similar to a2 retrieves the
157
entry motion#n#4 as w2 , and the similarity score sim(a2 , w2 ) is computed to be

0.578.
It may be noted that similarity between “dilemma” and “motion” is not apparent
at the surface level. However, since in this algorithm the hypernyms of the words
concerned are used for computing the similarity value, a positive semantic score has
been obtained because the last abstraction level in the hypernyms of “dilemma” and
“motion” are same which is “=⇒ state”.
Since 0.5 ≤ sim(a2 , w2 ) <1, the Step 2 of the algorithm suggests that NSD2 has
to be checked for further evidence. From NSD2 , the word w20 most similar to a2 is
determined, and it is found to be confusion#n#2 with sim(a2 , w20 ) = 0.960. The
algorithm therefore determines s(d2 ), s(n2 ), m(d2 ) and m(n2 ) (see case3c). These
values are found to be 0.086, 0.035, 0.332 and 0.497, respectively. Since m(n2 ) >
m(d2 ), Flag2 is set to 0.
Using step 3 the algorithm now constructs the set D 0 consisting of divergence
types di for which the Flags have been set to 1. Evidently, D 0 is found to be
empty. Thus the algorithm suggests that the above input sentence does not give
any divergence upon translation to Hindi.
It may be noted that the above decision made by the algorithm is a correct one.
Now consider the sentence: “My house faces east.”
Its parsed version is: @GN> PRON PERS GEN SG1 “i” , @SUBJ N SG “house”,
@+FMAINV V PRES “face”, @OBJ N SG “east” <$.>
158
Note that the main verb of the input sentence is “face” which is not “be” or
“have”. Further, the sentence has a subject “my house” and an object “east”. Thus
the FT-features of the given input sentence are: f3 , f4 and f5 .
According to the Relevance Table the set D is constructed, and it has three
elements:
• d1 , i.e. structural divergence
• d3 , i.e. conflational divergence because of sub-types 1 and 2.
• d4 , i.e. demotional divergence due to sub-types 1, 3 and 4.
Evidences for structural divergence (d1 ):
The problematic FT for d1 is the main verb which is “face”. Nine senses are
provided by WordNet 2.0 for the verb “face” of which sense3 (be oriented in a
certain direction, often with respect to another reference point; be opposite to) is the
relevant one in this case. Thus problematic word a1 is “face#v#3 ”. From PSD1
the word w1 that is most similar to a1 is retrieved. Note that w1 is obtained as
attend#v#1, and the similarity score sim(a1 , w1 ) is calculated to be 0.660. Since
0.5 ≤ sim(a1 , w1 ) < 1, the algorithm now checks the NSD1 . From NSD1 , W01 is
constructed, and w10 is found to be cap#v#1 with sim(a1 , w10 ) = 0.889. In this
case, the algorithm has to determine s(d1 ) and s(n1 ). These are found to be 0.444
and 0.151 respectively. Thus, m(d1 ) = 21 (sim(a1 ,w1 ) + s(d1 )) = 0.552 and m(n1 ) =
1
2
(sim(a1 ,w10 ) + s(n1 )) = 0.520. Since m(d1 ) > m(n1 ), the algorithm set Flag1 to be
1.
159
Evidences for conflational divergence (d3 ):
The problematic FT for d3 is also main verb (See Table 4.3), and therefore the
problematic word (a3 ) here too is “face#v#3 ”. From PSD3 the word w3 that is
most similar to a3 is retrieved. In this case the same word face#v#3 exists in PSD3 ,
and therefore sim(a3 , w3 ) = 1.0. Therefore, due to Case 2a Flag3 is set to 1.
Evidences for demotional divergence (d4 ):
Problematic word a4 for d4 is also “face#v#3 ”, which too exists in PSD4 . Hence
Flag4 is also set to 1.
Divergence type (di ) s(di ) m(di )
structural (d1 ) 0.444 0.552
conflational (d3 ) 0.086 0.543
demotional (d4 ) 0.204 0.602
Table 4.6: Values of s(di ) and m(di ) for Illustration 3
In Step 3, the set D 0 = {d1 , d3 , d4 } is constructed . The set of possible com-

binations C (see Case 4c) is found to be {{d1 , d3 }, {d3 , d4 }}. For a final decision
the algorithm now computes the values of s(di ) and m(di ) (see Case 3c). These
values are given in Table 4.6. Using the values given therein the algorithm computes
1/2*(m(d
1) + m(d3 )) = 0.548 and 1/2*(m(d3 ) + m(d4 )) = 0.673.
Since the latter one is maximum, the algorithm suggests that the above input
sentence will have divergence d3 and d4 upon translation to Hindi. The above deci-
sion of the algorithm is also correct.
Tables 4.7 and 4.8 provide few more examples with brief explanation. The overall
160
analysis of each example sentence requires 17 columns. Table 4.7 contains the column
numbers (i ) to (viii ), and Table 4.8 contains the column numbers (ix ) to (xvii ). For
ease of understanding one column corresponding to serial number (S. No.) and
column number (ii ) are given in both the tables. In these tables, “NA” is used
when particular condition is not applicable, and “Nil” implies that no word having
semantic similarity score greater than 0 has been found in the PSD/NSD.
161
S. Sentence D Problematic Most similar sim(ai , wi ) Is wi Most simi- sim(ai , wi 0 )
No. Word, ai word, wi a lar word, w i 0
coordinate
term?
(i ) (ii ) (iii ) (iv ) (v ) (vi ) (vii ) (viii )
Continued . . .
1. She will d1 resolve#v#6 calculate#v#1 0.984 No resolve#v#6 1.0
resolve d3 resolve#v#6 Nil 0.0 NA NA NA
this issue. d4 resolve#v#6 Nil 0.0 NA NA NA
2. I will d1 attend#v#1 attend#v#1 1.0 NA NA NA
attend this d3 attend#v#1 look#v#5 0.66 No ride#v#9 0.66
162
meeting. d4 attend#v#1 face#v#4 0.75 Yes NA NA

3. This exer- d1 hurt#v#2 trample#v#2 0.96 No twist#v#9 0.96
cise
will hurt d3 hurt#v#2 knife#v#1 0.96 No twist#v#9 0.96
your back. d4 hurt#v#2 Nil 0.0 NA NA NA
4. John d1 stab#v#1 stab#v#1 1.0 NA NA NA
stabbed d3 stab#v#1 stab#v#1 1.0 NA NA NA
Mary. d4 stab#v#1 Nil 0.0 NA NA NA
5. This dish d3 taste#v#1 taste#v#1 1.0 NA NA NA
tastes d6 good#a#1 Nil 0.0 NA NA NA
good.
S. Sentence D Problematic Most similar sim(ai , wi ) Is wi Most simi- sim(ai , wi 0 )
No. Word, ai word, wi a lar word, w i 0
coordinate
term?
(i ) (ii ) (iii ) (iv ) (v ) (vi ) (vii ) (viii )
6. This table d1 weigh#v#1 encounter#v#3 0.660 No stay#v#1 0.660
weighs d3 weigh#v#1 measure#v#3 0.972 No look#v#3 0.660
100kg. d4 weigh#v#1 suffer#v#6 0.660 No look#v#3 0.660
7. It is windy d2 windy#a#1 stormy#a#1 0.75 No Nil 0.0
today. d5 windy#a#1 stormy#a#1 0.75 No Nil 0.0
8. It will be d2 morning#n#3 pain#n#2 0.406 No NA NA
163
morning d5 morning#n#3 morning#n#3 1.0 NA NA NA

soon.

9. She is in d2 pain#n#1 pain#n#2 0.438 No NA NA
pain.
10. It suffices. d3 suffice#v#1 resemble#v#1 0.782 No meet#v#5 0.96
d4 suffice#v#1 suffice#v#1 1.0 NA NA NA
d5 suffice#v#1 Nil 0.0 NA NA NA
Table 4.7: Some Illustrations
1
S. D s(di ) s(ni ) m(di ) m(ni ) Flagi D0 C 2
(m(di ) + m(dj )) Result
No.
(ii ) (ix ) (x ) (xi ) (xii ) (xiii ) (xiv ) (xv ) (xvi ) (xvii )
Continued . . .
1. d1 NA NA NA NA 0
d3 NA NA NA NA 0 φ NA NA Normal
d4 NA NA NA NA 0
2. d1 NA NA NA NA 1
d3 0.075 0.141 0.368 0.401 0 d1 NA NA d1
d4 NA NA NA NA 0
3. d1 0.241 0.142 0.601 0.551 1
164
d3 0.08 0.145 0.52 0.553 0 d1 NA NA d1

d4 NA NA NA NA 0
4. d1 NA NA NA NA 1
d3 NA NA NA NA 1 d 1 , d3 {d1 ,d3 } NA d 1 , d3
d4 NA NA NA NA 0
5. d3 NA NA NA NA 1 d3 NA NA d3
d6 NA NA NA NA 0
6. d1 0.224 0.231 0.442 0.495 0 d3 NA NA d3 ; No decision about
d4
d3 0.186 0.219 0.579 0.439 1
1
d4 0.287 0.287 0.473 0.473 2
1
S. D s(di ) s(ni ) m(di ) m(ni ) Flagi D0 C 2
(m(di ) + m(dj )) Result
No.
(ii ) (ix ) (x ) (xi ) (xii ) (xiii ) (xiv ) (xv ) (xvi ) (xvii )
7. d2 NA NA NA NA 1 d 2 , d5 {d2 , d5 } NA d 2 , d5
d5 NA NA NA NA 1
d5 NA NA NA NA 1
9. d2 NA NA NA NA NA NA NA NA Normal as
sim(a2 , w2 ) < 0.5,
wrong decision
165

d5 NA NA NA NA 0
Table 4.8: Continuation of Table 4.7
4.4.4 Experimental Results
In order to evaluate the performance we have used the above algorithm on randomly
selected 300 sentences, that are not currently present in our example base. Manual
analysis of the translations of these 300 sentences revealed that 32 of them will
involve some type of divergence when translated from English to Hindi. Remaining
268 sentences have normal translations.
The output of the algorithm is as follows: It recognized 36 of the sentences

to have divergence upon translation, and 261 to have normal translation. For 3
sentences the algorithm could not make any decision. Table 4.9 summarizes the
overall outcome.
Parameters Divergence Normal
Number of examples 32 268
Experimental results 36 261
Correct results 30 260
Recall % 83.33% 99.62%
Precision % 93.75% 97.39%
Table 4.9: Results of Our Experiments
The very high value (above 90%) for precision establishes the efficiency of the
algorithm in detecting possible occurrence of divergence even before the actual trans-
lation is carried out.
There are few examples when the algorithm failed to produce the correct decision.
These may be put into three categories:
166
1. Translation of the input sentence actually involves divergence but the algo-
rithm predicts normal translation. Table 4.9 indicates that there is one such
case in our experiments. Although the algorithm suggests that 261 sentences
will be translated normally it has been found that actually 260 of them are
correct decisions.
2. The input sentence actually has normal translation but the algorithm predicts
divergence. In the experiments carried out by us, we found six such exam-
ples. While the algorithm suggests that 36 sentences will involve some type of
divergence, only 30 of them are correct decisions (see Table 4.9).
3. The algorithm is unable to decide the nature of the translation of the input
sentence. Out of 300 examples tried the algorithm could provide decisions for
only 297 (36+261) sentences. For the remaining three sentences the algorithm
could not arrive at any decision regarding whether they will be translated
normally, or their translations will involve some type of divergence. These are
the situations that fall under Case 3c of the algorithm.
Table 4.7 provides one of the example of this type. Here the input sentence
and its translation are: “This table weighs 100kg” ∼ iss (this) mez kaa vajan
(weight of this table) 100 kilo (100 kg) hai (is)”. This example has demotional
divergence, i.e. d4 . However the algorithm could not give any decision regard-
ing occurence/non-occurrence of d4 since the values of both m(d4 )and m(n4 )

are computed to be 0.473.
The algorithm is not able to give correct result in first two cases. We feel that
the possible reasons behind the incorrect decisions taken by the algorithm are the
following:
167
• Lack of robust PSD and NSD. The present size of the PSD and NSD are
416 and 3931 respectively. Evidently, these numbers are not large enough
to deal with all different sentences. As more examples (particularly, those
involving divergence) are collected, both the PSD and NSD may be enriched
with additional entries. This will in turn enable the algorithm to measure
semantic similarity in a more direct way. As a consequence the number of
erroneous decisions will reduce.
• The value of threshold. For our experiments we have used 0.5 to be the value
of the threshold t. This value has been obtained by carrying out a number
of experiments on our example base. However, with more examples this value
of t may have to be reassigned, which may in turn improve the quality of the
results. Further experiments with more examples need to be carried out to
arrive at an optimal value of the threshold t.
Occurrence of divergence poses great hindrance in efficient adaptation of retrieved
sentences in an EBMT system. This can be dealt with efficiently provided an EBMT
system is capable of making a priori decision regarding whether an input sentence
will cause any divergence upon translation. This will enable the EBMT system to
retrieve a past example more judiciously. However, the primary difficulty in handling
divergence is that their occurrences are not governed by any linguistic rules. Hence
no straightforward method exists for determining whether a source language sentence
will involve any divergence upon translation. In this work we attempted to bridge
this gap. We developed a scheme so that an a priori decision may be made by seeking
168
evidences from the existing example base. In order to achieve the above goal we first
analyzed different divergence examples to ascertain the root cause behind occurrence
of a divergence. We found that each divergence type can be associated with some
Functional Tag (FT) that is instrumental for causing this type of divergence. We
call it the “problematic FT” corresponding to that particular divergence. In fact, a

detailed analysis of a large number of translation examples revealed that occurrence
of each type of divergence invariably demands certain patterns in the structure of
the input sentence. While the presence of certain FTs (including the problematic
FT) in the input sentence is mandatory, some other FT features should necessarily
be absent in order that the particular divergence type can occur.
Since divergence in an occasional phenomenon, it is not true that any sentence

having the structure required by a particular divergence will certainly involve diver-
gence upon translation. Occurrence of divergence also depends upon semantics of
some constituent words. To measure the semantic similarity between words two dic-
tionaries, viz. “problematic sense dictionary” (PSD) and “normal sense dictionary”
(NSD), have been created.
Given an input, these knowledge bases are referred to seek evidence in sup-
port/against divergence. Evidences used are of the following types:
(a) The Functional Tags of the constituent words of a given input;
(b) Semantic similarity of these constituent words with words in the PSD and
NSD;
(c) Frequency of occurrence of different divergence types in the example base; and
(d) Which divergence types may co-occur in the translation of an input sentence.
169
The experiments carried out by us resulted in very high values of precision and
recall. However, more experiments need to be done to establish this scheme as a key
technique for dealing with divergences for an EBMT system.
The following points may be noted with respect to the scheme presented here:
1) Creation of the sense dictionaries is an important background work required

for implementation of the proposed scheme. The sense dictionaries (PSD and
NSD) used in this work have been created manually. Some suitable Word Sense
Disambiguation techniques may have to be developed/used to accomplish this
task.
2) The decisions made by the scheme concerns with divergence types only. We
feel that the scheme may be further extended to deal with various sub-types
that are associated with each divergence type. Our present example base does
not have sufficient number of examples for each sub-type. More examples in-
volving each of these sub-types need to be obtained, and analyzed for any such
extension, and also to improve upon the performance of the present scheme.
170
Chapter 5
A Cost of Adaptation Based

Scheme for Efficient Retrieval of
Translation Examples
A Cost of Adaptation Based Scheme
5.1 Introduction
Similarity measurement is an essential part of any EBMT system as this leads to
the development of an effective retrieval scheme for it. The closer the retrieved sen-
tence to the input one, the easier is its adaptation towards generating the required
translation. However, no standard technique has been developed for measuring sen-
tential similarity. Typically, similarity between sentences is measured using syntax
and semantics (Manning and Schutze, 1999). But in this chapter we show that if
adaptation is the main concern for measuring similarity neither of them is adequate
for similarity measurement.
In this work we look at similarity from “adaptation” point of view. A new

algorithm has been proposed that considers cost of adaptation as the key concept
for measuring similarity. This means that the proposed algorithm measures the
computational cost involved in adapting the translation of a sentence E1 to generate

the translation of another sentence E2 . The lesser the cost, the more similar the two
sentences will be. The algorithm has been tested on our normal example base. For
convenience, in subsequent discussions “normal example base”(Refer Chapter 4) is
referred to as “example base” only. The results obtained are compared with other
two algorithms based on syntactic and semantic metrics. It has been shown that the
algorithm proposed in this chapter performs better than the other two.
5.2 Brief Review of Related Past Work
Various similarity metrics reported in the literature can be characterized depend-

ing on the text units they are applied on. These units may be words, characters,
171
5.2. Brief Review of Related Past Work
sentences or chunks1 . Some of these metrics are discussed below:
1. Word-based metrics: Word-based metrics are considered to be one of the basic

similarity metrics, suggested by Nagao (1984) and used in many early EBMT
systems. The proposed metric uses thesaurus or similar means for identifying
word similarity on the basis of meaning or usage. According to Nirenburg

(1993), individual words of the two sentences are compared in terms of their
morphological paradigms, synonyms, hypernyms, hyponyms and antonyms.
On the other hand, Sumita et. al. (1990) has used a semantic distance d
(0 ≤ d ≤ 1) that is determined by the Most Specific Common Abstraction
(MSCA), obtained from a thesaurus abstraction hierarchy. Similar technique

was used by Sumita and Iida (1991) for translating Japanese adnominal parti-
cle constructions (Noun1 preposition Noun2 ). This shows that this technique
works for measuring sub-sentence level similarity as well.
2. Character-based metrics: Another approach that is based on character-based

metric has been proposed in (Sato, 1992). This is a highly language depen-
dent approach that requires analysis of characteristics of the language under
consideration. Sato’s work has been applied to Japanese taking advantage of
certain characteristics of the language. These characteristics are as follows :
(a) This character based method does not need morphological analysis.
(b) It can retrieve some kind of synonyms without a thesaurus, because syn-
onyms often have the same Kanji character in Japanese.
The character based best match can be determined by defining the distance
or similarity measure between two strings. Considering character order con-
1
Chunk is a segment or substring of words from a sentence or text.
172
straints, the simple measure of similarity between two strings is the number of
matching characters.
3. Chunk/Substring based matching scheme : This approach is proposed by Niren-

burg et. al. (1993). Here, the search for matching candidates proceeds as fol-
lows. A sentence is broken into segments at punctuation marks or at unknown
words, and thus a list of all contiguous substrings (“chunks”) of a segment

is produced. For every input chunk the algorithm looks for sentences in the
corpus that contains a matching substring. This algorithm considers a relaxed
definition of matching that allows not only complete matches but also matches
in which (i) there are gaps between the words, (ii) the word order is different.
It also considers matching on the basis of subset of the words in the input
chunk, and also takes care of word inflections.
For each of the inexact matches, a penalty is calculated. This penalty is

based on some fixed numbers in the scale 1 to 15 reflecting the degree of in-
exactness. For example, the penalty for unmatched words is set to 10, the
penalty for disordering is set to 15 etc. Match scores are first calculated sep-
arately for each incomplete matches. Then a cumulative score is produced.
Candidate finding procedure retains only those matches whose match scores
are above a threshold, which is set at 10 for the best matches.
4. Syntactic/Semantic based matching (Gupta and Chatterjee, 2002): This idea

has been borrowed from the domain of information retrieval as proposed in
(Manning and Schutze, 1999). Here, each of the example base sentences and
the input are represented in a high-dimensional space, in which each dimension
of the space corresponds to a distinct word in the database. The similarity is

calculated as the dot product of the vectors. In both cases the measurement
173
score depends to a significant extent on the word weights (word frequency and
sentence frequency)2 , which in turn depends on the sentences in the example
base. Thus the schemes become highly subjective. In particular, sentences
having similar structure (in terms of tense, subject, number of objects, etc.)
have higher similarity measurement values for a given input sentence. Differ-
ent weights have been assigned to similarity of different syntactic tags. For
example, a score of 20 is given to verb or auxiliary verb matching, a score of
5 is given to adjective or adverb matching etc. .
5. Hybrid retrieval Scheme: This scheme has been used in ReVerb system (Collins
and Cunningham, 1996) (Collins, 1998) that utilizes two different levels of
case retrieval: String matching retrieval (Phase 1), and Activation passing for
syntactic retrieval (Phase 2). In Phase 1, only exact words are matched, and
near morphological neighbours (such as, variation due to number, variation due
to tense ) are not considered. The highest score is allocated to those cases that
have been activated the greatest numbers of times. In Phase 2, for structural
retrieval, the input sentence is first pre-chunked, such that each chunk has an
explicit head-word. The algorithm initiates activation from each word in the
chunk, giving the head word an increased weight to reflect its pivotal role in
the chunk. The final score is evaluated by summing the above two scores.
6. DP-matching between word sequence (Sumita, 2001): This scheme scans the
source parts of all example sentences in a bilingual corpus. By measuring
the semantic distance between the word sequences of the input and example
sentences, it retrieves the examples with the minimum distance, provided the
distance is smaller than a given threshold. Otherwise, the whole translation
2
These terms are explained in detail in Section 5.5.1.
174
fails with no output. The semantic distance (dist) is calculated in the following
way.
P
I + D + 2 SEM DIST
dist =
Linput + Lexample
where I is the number of insertions, D is the number of deletions and SEMDIST

K
= N
, where K is the level of the least common abstraction in the thesaurus,
and N is height of the abstraction level. The value of SEMDIST ranges from
0 to 1. The denominator of the above expression is the sum of the length of

the input and the example sentence.
7. Semantic matching procedure (Jain, 1995): This scheme first looks at the verb-
part of the input sentence, and on the basis of the type of verb-part it chooses
an appropriate partition of the input sentence. The syntactic units of the

input sentence are counted for entering into the next level of partition. After
reaching the correct sub-partition, exact pattern matching is performed. For
all such examples, distance is found for the input sentence using a distance
formula. The distance d between I(input sentence) and E(example sentence)
is defined as follows:
n
X
d(I, E) = dp (IG, EG) + dv (IV, EV )
p=1
where n is the number of noun syntactic groups in the source langauge sentence.
IG and EG are “input sentence noun syntactic group”, and “example sentence
noun syntactic group”, respectively. Similarly, IV and EV are the input and
the example sentence verb groups, respectively.
175
The above distance d is calculated on the basis of weighted average of at-

tribute difference, status difference, gender difference, number difference, per-
son difference, additional sematic difference, and verb category difference be-
tween example sentence E and input sentence I. Pre-assigned values in the
range of 0 to 1 have been used as the weighting factors for the above parame-
ters.
8. Retrieving Meaning-equivalent Sentences (Shimohata et. al., 2003): Retrieval

of meaning-equivalent sentences is based on content words (e.g. noun, ad-
jective, verb), modality (request, desire, question) and tense. This method
does not rely on functional words (e.g. conjunction, preposition, auxiliary
verb) information. A thesaurus is utilized to extend the coverage of the ex-
ample base. Two types of content words “identical” and “synonymous” have
been used. Sentences that satisfy the following conditions are recognized as
meaning-equivalent sentences.
• The retrieved sentence should have the same modality and tense as the
input sentence.
• All content words (identical or synonymous) are included in the input sen-
tence. This means that the set of content words of a meaning-equivalent
sentence is a subset of the input.
• At least one identical content word is included in the input sentence.
If more than one sentence is retrieved, the algorithm ranks them by introducing
“focus area” to select the most similar one. The focus area has been defined
as the last N words from the word list of an input sentence. The value of N
varies according to the length of the input sentence.
176
Many other similarity measurement schemes are found in the literature. Every
metric has its own advantages and disadvantages. The demerits mentioned below
motivate us for defining a new metric.
• Character based metrics are highly script dependent. Hence a scheme which
is designed for a specific langauge may not be pursued for another language.
• Word based metrics are generally dependent on the size of the database. If the
database does not contain sentences having words common to a given sentence,
then these methods may fail to retrieve any similar sentence from the example
base.
• Most importantly, in almost all the schemes 3 described above, adaptation and
retrieval have been dealt with independently. However, we feel that adaptation
and retrieval should go hand in hand. A retrieval scheme should be considered
efficient (for an EBMT system) if the adaptation of the retrieved sentence is
computationally less expensive.
In order to avoid above difficulties we propose cost of adaptation to be the major
yardstick for measuring similarity between sentences in an EBMT system. The

following section describes the metric proposed by us. The cost of adaptation is
based on constituent word, morpho-word and suffix operations already discussed in
Chapter 2.
3
The only metrics that we found to have considered the concepts of adaptation while measuring
similarity are (Sumita, 2001) and (Collins, 1998). These schemes rely only on counting the num-
ber of adaptation operations, and a fixed penalty is assigned to these operations. However, this
assumption is not very realistic.
177
5.3. Evaluation of Cost of Adaptation
5.3 Evaluation of Cost of Adaptation
As discussed in Chapter 2, the cost of adaptation depends on the number of opera-
tions required for adapting a retrieved example. The total cost may then be com-
puted as the sum of the individual cost of each operation used for the adaptation.
An important point to be noted in this respect is that some adaptation operations
(e.g. constituent word addition and constituent word replacement) requires a search
of an English to Hindi dictionary. Typically, this dictionary will not be stored in the
RAM of the system, and its access requires retrieval from external storage. This cost
is much more than the cost of any operation that can be accomplished. Resorting
morpho-word or suffix operations reduce the number of dictionary searches, since
the number of morpho-tags and suffixes is much smaller in comparison with the to-
tal content of a dictionary. However, since complete avoidance of constituent word

additions and replacements is impossible, we had to take into account the search
time due to different operations in our analysis of computational cost. To deal with
dictionary search, we make the following assumptions/observations:
1. We assume that the dictionary is stored in a hard-drive. We also assume that
the search will be done using a binary search algorithm. One may consider
some multi-way search trees (e.g. B+-tree) (Loomis, 1997) also. But since a
successful search in a dictionary of size D takes log2 D for a binary tree, and
logm D for a m-way tree, the difference between the search times in these two
cases is due to a constant factor only4 .
2. We further assume that the index tree that is used to facilitate the dictionary
search is already in RAM. Typically, the index tree is designed with the help
4
logm D = log2 D × logm 2
178
of a set of keys. In this case, we assume that the keys are actually the English
words, which are used for the search operation. The record corresponding to
each key contains all other relevant information, e.g. the Hindi meaning of the
word, the POS and other information. These records are stored in the external
storage.
3. The search procedure refers to the index tree for identifying the location of
the word in the dictionary. This operation is carried out by accessing the
RAM only. For actual retrieval the external storage is accessed, and this has
its associated factors, e.g. latency, seek time (Weiderhold, 1987). However,
in our analysis we do not consider all these factors. We make a simple but
realistic assumption following the studies on temporal requirements as given in
http://www.kingston.com/tools /umg/ umg01a.asp. Accordingly, we assume
that the access time for RAM for the CPU is 200 ns (nanoseconds), while the
access time for hard disk is 12,000,000ns. Thus the time requirements differ
by a constant of the order of 105 .
4. In order to reduce the search time, instead of using one dictionary, we recom-
mend using different dictionaries for different POS. The dictionaries used for
this work are of the following sizes: Noun - 13953, Adjective - 5449, Adverb -
1027, Preposition - 87, Pronoun - 72 and Verb - 4330. This database has been
taken from the “Shabdanjali” English to Hindi dictionary (http://ltrc.iiit.net
/ onlineServices/ Dictionaries/ Shabdanjali/ data source.html). Thus the ap-
proximate search time for all these part of speeches are as follows:
179
POS Size Search time
Noun 13953 log2 13953 ≈ 13.77

Adjective 5449 log2 5449 ≈ 12.41
Adverb 1027 log2 1027 ≈ 10.00
Preposition 87 log2 87 ≈ 6.44
Pronoun 72 log2 72 ≈ 6.17

Verb 4330 log2 4330 ≈ 12.08
5. Constituent word addition operations require another element in an EBMT

system. If a new word needs to be added into a retrieved translation, the
addition should be according to the syntax rules of the target language (here
Hindi). Since one should not expect that all possible examples are available in
the knowledge base of an EBMT system, pure example-based approach may
not be able to obtain the right place for the new word to be added. In the
absence of suitable examples the system needs to wade through a large set of
syntax rules to determine the appropriate position where the new word has to
be added. We denote this cost by ψ, and assume that this cost is much more
than the cost of finding a Hindi equivalent of a given English word from the
dictionary.
6. For applying any constituent word, morpho-word or suffix operation, first one
needs to find the appropriate word position in a retrieved example. If the

retrieved Hindi sentence length is L, we consider the average search time for
L
finding the appropriate word position to be proportional to 2
.
7. Since dictionary search required for constituent word addition (WA) and con-
stituent word replacement (WR) operations is computationally expensive, we

introduce the following step before referring to the dictionary. We suggest that
180
before referring to the dictionary the scheme should first check whether the
word to be added is already present in the sentence (may be with a different
functional tag). In that case the Hindi equivalent of the word may be taken
directly from the retrieved sentence, and thereby dictionary search may be
Lp
avoided. The cost of this step is proportional to 2
, which should be added
to overall cost of constituent word addition and replacement. Here Lp is the
length of the parsed version of the retrieved English sentence.
For illustration, consider two examples
(A) The car runs on diesel. ∼
gaadii diijal par chaltii hai

(car) (diesel) (on) (runs) (is)
(B) Diesel is a suitable fuel for this car. ∼
diijal iss gaadii ke liye upyukta iindhan hai
(diesel) (this) (car) (for) (suitable) (fuel) (is)

Note that in sentence (A) the words “car” and “diesel” are subject and com-
plement of preposition, respectively. On the other hand, in sentence (B) their
roles are reversed. In order to generate the translation of (A) if the sentence
(B) is retrieved then for these two positions constituent word replacement op-
erations are required. Typically, this operation demands a dictionary search
to get the Hindi equivalent of these words. However, in the above example
the dictionary search may be avoided since the Hindi equivalent of the desired
words (“car” and “diesel”) may be obtained from the retrieved example itself.
Hence the computational cost can be minimized.
8. Morpho-word operations or suffix operations do not require any dictionary

search. Only a set of fixed rules (which may be in a tabular form) is needed
181
for finding the appropriate morpho-word or suffix for addition, deletion or

replacement. If the total number of morpho-words is M , then the average cost
M
to find the relevant morpho-word is proportional to 2
. In a similar way, the
K
average cost of finding the appropriate suffix is proportional to 2
, where K is
the total number of suffixes.
Section 5.3.1 describes how the computational cost of each of the adaptation
operations is computed in view of the above assumptions.
5.3.1 Cost of Different Adaptation Operations
Based on the above observations, cost of the ten different adaptation operations
(discussed in Chapter 2) are estimated in the following way:
1. Constituent Word Deletion(WD): To delete a word from a retrieved example,

first the word is located in the sentence, and then it is deleted. Thus the average
cost is (l1 × L2 ) + , where L is the length of the retrieved Hindi sentence, and
l1 is the constant of proportionality. is a small positive quantity reflecting
the cost of actual deletion operation (e.g. adjustment of pointers if sentences
are stored in a linked-list structure of words).
2. Constituent Word Addition(WA): Constituent word addition is done in three

steps:
• First, the Hindi equivalent of the word to be added has to be found in

the dictionary. This involves the cost {(d × log2 D) + (c × 105 )}, where
D is the size of the relevant dictionary, and c and d are the constants of
proportionality. The two terms correspond to searching the binary tree
182
of keys and then retrieving the related record from the external storage,
respectively.
• In the second step the position (in the sentence) where the new word
has to be added is located. This requires referring to the syntactic rules
of the target language grammar to find the proper position of the word.
The cost of this operation has been considered ψ (see item 5 of Section
Lp
5.3). Thus the overall cost of this step is ψ + (l1 × L2 )+ (l2 × 2
). Here
l1 and L are same as in the case of WD discussed above. Lp is the
length of the parsed version of the retrieved English sentence, and l2 is
the corresponding constant of proportionality.
• Finally, the actual addition is done. The cost involved for this is , indi-
cating the cost of adding the new word in the retrieved translation.
Therefore, average time requirement for a WA operation is (l1 × L2 ) + (l2 × L2p )
+ {(d × log2 D) + (c × 105 )} + ψ+ .
3. Constituent Word Replacement(WR): The work here is similar to what needs

to be done in WA, except that here no grammar rules is required to be referred
for finding the proper position of the new word. Consequently, no space is
required to be created for the new word. The cost, therefore, is reduced by
and ψ in comparison with constituent word addition. Hence the average cost
Lp
is (l1 × L2 ) + (l2 × 2
) + {(d × log2 D) + (c × 105 )}.
4. Morpho-word Deletion (MD): As discussed in item 8 of Section 5.3 to delete a

morpho-word from a retrieved example first the relevant morpho word has to
M
be identified. Hence an additional cost m × 2
is to be added to the cost of
the constituent word deletion to get the cost of morpho-word deletion. Here
183
m is the constant of proportionality. Therefore, the average cost is ( L2 ×l1 ) +

M
(m × 2
) + .
5. Morpho-word Addition (MA): For morpho-word addition, the cost for dictio-
nary search and access in constituent word addition (by referring to item 2
M
above) is replaced with the average cost that is (m × 2
). Moreover, the cost
(l2 × L2p ) in constituent word addition is not considered for morpho word addi-
tion, as these morpho-words are not present in the tagged version. Therefore,
the average cost of morpho-word addition is (l1 × L2 ) + (m × M
2
) + ψ+ .
6. Morpho-word Replacement (MR): To compute the cost of morpho-word re-

placement one may refer to the morpho-word addition cost as explained just
above. However, two of its components, viz. ψ and , need not be considered
in the cost for morpho-word replacement. This is because the grammar rules
need not be used to find the location of new word, and consequently no extra
space needs to be created. Further, an additional cost (m × M21 ) is to be added

for finding out the morpho-word to be replaced, where M1 is the set from
where the new morpho-word is to be picked. Therefore, the average cost for
morpho-word replacement is (l1 × L2 ) + (m × M2 ) + (m × M21 ). It may be noted
that M1 and M can be equal if the word is replaced with some morpho-word
from the same set.
7. Suffix Deletion (SD): Here the work involved is first to identify the right suffix,
then to do the stripping. So the cost is (l1 × L2 ) + (k × K2 ), where k is the cost
of proportionality, and K is the total number of suffixes (as explained in item

8 of Section 5.3).
8. Suffix Addition (SA): Suffix addition is done in two steps. First the position
184
of the word where the suffix has to be added is determined. The average cost
for this operation is (l1 × L2 ) (as explained above). Next the suffix database
is searched for obtaining the appropriate suffix. The average cost therefore is:
(l1 × L2 )+ (k × K
2
).
9. Suffix Replacement(SR): In a similar manner, here the average cost may be

K1
computed as (l1 × L2 )+ (k × K
2
) + (k × 2
). This operation is costlier than
SA because here on the top of adding the suffix some extra computational
K1
effort (k × 2
) is made in identifying the suffix to be replaced, and then in its
stripping from the word. It may be noted that K1 and K can be equal if the
suffix is replaced with some suffix from the same set.
10. For Copy operation no computational cost is taken into account.
These individual costs may be used for determining the overall cost of adaptation.
Section 5.4 discusses how cost may be calculated for adaptation between different
functional slots and kind of sentences.
5.4 Cost Due to Different Functional Slots and
Kind of Sentences
In this Section we discuss cost of adaptation corresponding to the features by refer-

ring to the adaptation rules as presented in various rule tables given in Sections 2.3,
2.4, 2.5, 2.6 and 2.7.
185
5.4. Cost Due to Different Functional Slots and Kind of Sentences
5.4.1 Costs Due to Variation in Kind of Sentences
The adaptation rule Table 2.15 suggests that in order to adapt a particular kind of
sentence into another kind requires one or more of the following operations: either
addition or deletion of the word adverb “nahiin”; or addition or deletion of the
morpho-word “kyaa”. Hence the cost table can be generated by computing the
costs with respect to the above four adaptation operations only.
By referring to the notations in adaptation cost operations given in Section 5.3.1,

the cost of these operations are:
• The cost (k1) of WA for the word adverb “nahiin” is (l1 × L2 ) + ψ + . Here
dictionary search is not required as the translation “nahiin” may be stored in
some readily accessible location.
• The cost (k2) of WD for the word adverb “nahiin” is (l1 × L2 ) + ;
• The cost of MA (for the morpho-word “kyaa”) is , as “kyaa” always comes in

the beginning of the sentence, no search is required to find the correct position
of the word in the retrieved Hindi sentence. We call this cost k3.
• Similarly, the cost of MD of the morpho-word “kyaa” may be computed as k4

= .
Table 5.1 gives the adaptation cost due to kind of sentences for different com-
binations of input and retrieved sentence. Cost of adaptation due to variations in
kind of sentences can now be calculated by referring to the required set of adaptation
operations for different cases as given in Table 2.15.
186
Input → AFF NEG INT NINT

Ret’d ↓
AFF 0 k1 k3 k1 + k3
NEG k2 0 k3 + k2 k3
INT k4 k1 + k4 0 k1
NINT k2 + k4 k4 k2 0
Table 5.1: Cost Due to Variation in Kind of Sentences
5.4.2 Cost Due to Active Verb Morphological Variation
Below we discuss the cost of adaptation for certain types of verb morphological
variations. In particular, we discuss two groups:
(1) the input and the retrieved sentence have same tense and same verb form.
(2) the input and the retrieved sentence have same tense but different verb forms.
Cost due to same tense same verb form
In Section 2.3.1 different cases of this group have been discussed. Further, in Table
2.3 different adaptation rules for present indefinite to present indefinite have been
illustrated in detail. It has also been argued that all other cases belonging to this
group can be dealt with in a similar way. Below we discuss the adaptation cost for
present indefinite to present indefinite by referring to the corresponding adaptation
rule Table 2.3.
The above mentioned table suggests that the relevant adaptation operations are
copy (CP), suffix replacement (SR) and morpho-word replacement (MR). The costs
of these basic operations may be computed in the following way.
187
• The cost of CP is considered to be 0.
• Cost of SR is (l1 × L2 )+ (k × 32 ) + (k × 32 ). Note that, as discussed in item 9 of

Section 5.3.1, here the term (k× 23 ) occurs twice in determining the average cost
for SR. It is because the algorithm has to decide which suffix in the verb of the
retrieved sentence needs replacement. This will be followed by identification

of the appropriate suffix which will replace the present suffix. With respect
the present indefinite case the relevant suffix set is {taa, te, tii }. Hence the
above expression is obtained. We shall denote the overall cost of SR by s.
• In a similar way the average cost of morpho-word replacement may be com-

puted to be (l1 × L2 )+ (m × 42 ) + (m × 42 ). Note that here the relevant set of
morpho-word is {ho, hain, hai, hoon}. Hence the cost factor (m × 24 ) has been
considered twice in the overall expression. The overall cost of MR is denoted
by n.
Ret’d ↓
M1S 0 s s+n s+n s+n s+n n s+n s+n s+n
F1S s 0 s+n n s+n n s+n s+n n n
M1P s+n s+n 0 s n s+n s+n 0 s+n s
F1P s+n n s 0 n+s n s+n s n 0
M2S s+n s+n n s+n 0 s s+n n s+n s+n
F2S s+n n s+n n s 0 s+n s+n n n
M3S n s+n s+n s+n s+n s+n 0 s+n s s+n
M3P s+n s+n 0 s n s+n s+n 0 s+n s
F3S s+n n s+n n s+n n s s+n 0 n
F3P s+n n s 0 s+n n s+n s n 0
Table 5.2: Cost Due to Verb Morphological Variation
Present Indefinite to Present Indefinite
188
The cost table corresponding to present indefinite to present indefinite is given in

Table 5.2. It has been formulated in accordance with the adaptation rule Table 2.3.
Here the cost of adapting present indefinite to present indefinite is picked according
to the gender, number and person of the subject of the input and the retrieved
sentence.
The cost table for other verb morphological variations under same tense same
verb form can be formulated in the similar way. Some relevant points in this regard
are discussed below.
• The same Table 5.2 works for adaptation from past indefinite to past indefinite
with a slight modification. In this case morpho-word replacement is done from
the morpho-word set {thaa, the, thii } instead of morpho-words set {hain, ho,
hoon, hai }. Hence if the value of n is replaced by (l1 × L2 )+ (m× 32 ) + (m× 32 ) in
the cost Table 5.2, one gets the cost table for past indefinite to past indefinite.
• In case of adaptation from future indefinite to future indefinite, cost depends

upon two operations CP and SR. Hence the cost n due to morpho-word re-
placement (MR) is to be removed from the entries of the Table 5.2. The cost
s of SR in this case is (l1 × L2 )+ (k × 82 ) + (k × 82 ), which is obtained by
considering the relevant set of suffixes, viz. {oongaa, oongii, oge, ogii, egaa,
egii, enge, engii }.
• For all other combinations of verb morphological variations of the same group
one more morpho-word replacement is to be added to the cost Table 5.2 in
place of the suffix replacement cost s (as discussed in items 3 to 6 of Section
2.3.1). Here cost of these two morpho-word replacements will vary according
to the tense and verb form. For example, in case of present continuous to
189
present continuous the relevant morpho-word sets are {hain, ho, hoon, hai }
and {rahaa, rahii, rahe}. The average cost of these morpho-word replacements
are (l1 × L2 )+ (m × 42 ) + (m × 42 ) and (l1 × L2 )+ (m × 3f2 ) + (m × 23 ), respectively.
The cost for morpho-word replacements for the remaining 5 cases (e.g. future
continuous to future continuous, past prefect to past perfect etc.) can be
computed in a similar way by referring to the appropriate morpho-word sets.
Cost due to same tense different verb forms
There are total 18 verb morphological variations (see Section 2.3.3). To keep our
discussion simple here we explain the adaptation cost calculations with the case
explained in the Section 2.3.3 under the heading “same tense different verb forms”.
In particular, we discuss the case where the input sentence is in future indefinite,
and the retrieved sentence is either in future continuous or future perfect.
Here the cost of verb morphological variations depends on three adaption oper-
ations. One is suffix addition, and the other two are morpho-word deletions. The
cost of these operations are as follows:
• In the case of future indefinite the appropriate suffix set is {oongaa, oongii,
oge, ogii, egaa, egii, enge, engii }. Hence the average cost of suffix additionis
(l1 × L2 )+ (k × 82 ). We denote it as s.
• In case of future continuous, the two morpho-word deletions will be restricted

to the sets {rahaa, rahii, rahe} and {hoongaa, hoongii, honge, hogaa, hogii,
hoge}, respectively. However, if the retrieved sentence is in future perfect,
the two morpho-word deletions are restricted to {chukaa, chukii, chuke} and
{hoongaa, hoongii, honge, hogaa, hogii, hoge}, respectively.
190
• The cost of morpho-word deletion corresponding to {rahaa, rahii, rahe} is

(l1 × L2 )+ (m × 32 ) + . We denote it as m1 . The cost of morpho-word deletion
is same for the morpho-word set {chukaa, chukii, chuke}.
• The cost of morpho-word deletion of {hoongaa, hoongii, honge, hogaa, hogii,
hoge} is (l1 × L2 )+ (m × 62 ) + , which we denote by m2 .
Therefore, the total cost involved in adaptation from future continuous or future
perfect to future indefinite is (s + m1 + m2 ). The cost will be same irrespective of
the variation in number, gender and person of the subject of the input as well as of
the retrieved sentence.
For the reverse case (i.e input sentence is either future continuous or future
perfect, and retrieved sentence is future indefinite) the cost will be the sum of two
morpho-word additions and one suffix deletion. For these adaptation operations the
suffix set and the morpho-word sets are same as in the above case. The individual
costs of these adaptation operations may be calculated in the way explained in

Section 5.3.1. Here the total cost will be sum of the cost of these three adaptation
operation that is : ((l1 × L2 )+ (m × 62 ) + ψ +) + ((l1 × L2 )+ (m × 32 ) + ψ + ) +
((l1 × L2 )+ (k × 82 )).
In a similar way, the cost can be evaluated for the rest of the cases of verb mor-
phological variations of same tense different verb forms. One may refer to Sections
2.3.3 and 5.3.1 to get the relevant adaptation operations and their costs, respectively.
The adaption cost with respect to the other two groups (i.e. “different tenses
same verb form”, and “different tenses different verb forms”) can be evaluated in a
similar way with the help of rule tables and set of adaption operations as discussed
in Section 2.3.2 and Section 2.3.4. To avoid the stereotyped nature of discussion,
191
we do not present all other different cases in this report. However, we present below
the adaption cost table (Table 5.3) for verb morphological variation from present
indefinite to past indefinite, which belongs to the group “different tenses same verb
form”. These values are obtained by referring to the adaptation rule Table 2.4.
Ret’d ↓
M1S w s+w s+w s+w s+w s+w w s+w s+w s+w
F1S s+w w s+w w s+w w s+w w s+w w
M1P s+w s+w w s+w w s+w s+w s+w w s+w
F1P s+w w s+w w s+w w s+w w s+w w
M2S s+w s+w w s+w w s+w s+w s+w w s+w
M3S w s+w s+w s+w s+w s+w w s+w s+w s+w
M3P s+w s+w w s+w w s+w s+w s+w w s+w
F3P s+w w s+w w s+w w s+w w s+w w
Variation Present Indefinite to Past indefinite
Here the cost s denotes the cost of suffix replacement between {taa, te, tii }, which
is (l1 × L2 )+ (k × 32 ) + (k × 32 ), and w denotes the cost of morpho-word replacement

from morpho word of set {hoon, hai, ho, hain} to the set of morpho-word {thaa,
thii, the} which is (l1 × L2 )+ (m × 42 ) + (m × 32 ).
5.4.3 Cost Due to Subject/Object Functional Slot
In this subsection we discuss the adaptation cost majorly for three functional tags
under subject/object functional slot. These tags are genitive case (@GEN), pre-
192
modifying adjective (@AN), and subject/object (@SUB/@OBJ). The relevant adap-

tation rules have been discussed in Section 2.5.
Cost due to adapting genitive case to genitive case
A transformation from genitive case to genitive case requires eleven adaptation op-
erations as given in Table 2.8. Below we describe the cost for each of them. Note
that the genitive word can be a proper noun, or a noun, or a pronoun. We denote
this set by P.
1. The average cost of constituent word replacement from the set P with a proper
noun. We denote this by w1 . Note that in this case no dictionary search

is required as proper nouns are not stored in any dictionary. Hence w1 is
Lp
computed as (l1 × L2 ) + (l2 × 2
).
2. The average cost of morpho-word replacement (MR) from {kaa, ke, kii } with
itself. We denote this cost by w2 . Since the number of morpho-words is 3, w2

may be formulated as (l1 × L2 ) + (m × 32 ) + (m × 23 ).
3. The average cost of WR from the set P with a noun. This cost is denoted
by w3 . Note that in this case noun dictionary search is necessary for which
the search time is 13.77 (see item 4 of Section 5.3). Further, to access the
dictionary a cost (c × 105 ) is required. Hence the total cost is (l1 × L2 ) +

Lp
(l2 × 2
) + {(d× 13.77 ) + (c × 105 )}.
4. The average cost of WR from the set P with pronoun. This is denoted by w4 .
Imitating the case just mentioned above the cost here may be formulated as
Lp
(l1 × L2 ) + (l2 × 2
) + {(d× 6.17 ) + (c × 105 )}.
193
5. The average cost of morpho-word deletion from the set {kaa, ke, kii }. This
cost is denoted by w5 , which may be formulated simply as (l1 × L2 ) + m × 23 +
.
6. The average cost of morpho-word addition from the set {kaa, ke, kii }. We
denote this cost by w6 , which is formulated as (l1 × L2 ) + (m × 23 ) + ψ+ .
7. The average cost of suffix replacement for converting a noun in either an oblique
noun form or a plural form (refer Section 2.5.2 and Appendix A). We denote
this cost by s1 . Since the number of relevant suffixes is four, s1 may be com-
puted as is (l1 × L2 ) + (k × 42 ) + (k × 42 ).
8. The average cost of suffix addition for converting noun either in oblique noun
form or in plural form. This cost of suffix addition can be formulated in a way
similar to item 7 above. Here this cost is (l1 × L2 ) + (k × 52 ) + (k × 25 ), which
we denote as s2 .
9. The average cost of suffix addition form the set {kaa, ke, kii } is (l1 × L2 ) +
(k × 32 ). We denote it as s3 .
10. The average cost of suffix deletion for converting oblique noun form to noun,
or plural to singular ( See Appendix A and Section 2.5.2). This cost is (l1 × L2 )
+ (k × 52 ) + . We denote is as s4 .
11. The average cost of suffix replacement from the set {kaa, ke, kii }. We denote
this cost by s5 , which is formulated as (l1 × L2 ) + (k × 23 ) + (k × 32 ).
The cost table corresponding to genitive case to genitive case is given in Table
5.4. It has been formulated in accordance with the adaptation rule Table 2.8.
194
Input→ <proper> N PRON

Ret’d ↓
<proper> (0 or ({w1 } + w3 + {w2 }+ {s1 or w4 + w5 +s3
{w2 })) s2 }
N w1 + {w1 } (0 or ({w3 }+ {w2 } + w4 + w5 +s3
{(s1 or s2 or s4 )}))
PRON w1 +w6 w3 + {s1 or s2 } +w6 (0 or s5 or
(w4 +s3 ))
Table 5.4: Costs Due to Adapting Genitive Case to Gen-
itive Case
Cost due to adapting subject/object to subject/object
We have considered four possible cases a noun, a proper noun, a pronoun and a
gerund form (PCP1) at the subject or object position (refer Section 2.5.5). We
denote this set by Q. All possible adaptation operations required in this case are
listed in Table 2.11, which has been referred for evaluating the cost for adapting
subject/object to subject/object possible variations.
1. The average cost of constituent word replacement of the set Q by noun. This
cost is denoted by w1 . In this case noun dictionary search is required, and its
Lp
search time is 13.77. Hence, w1 is computed as (l1 × L2 ) + (l2 × 2
) + {(d×
13.77 ) + (c × 105 )}.
2. The average cost of constituent word replacement from the set Q to proper
noun. This cost is denoted by w2 . Note that in this case no dictionary search
is required as proper nouns are not stored in any dictionary. Hence the cost
Lp
w2 is computed as (l1 × L2 ) + (l2 × 2
).
195
3. The average cost of constituent word replacement from the set Q to pronoun.
This cost is denoted as w3 , which is formulated as (l1 × L2 ) + (l2 × L2p ) + {(d×
6.17) + (c × 105 )}(same given in item 1 above).
4. The average cost of constituent word replacement from the set Q to gerund
Lp
(PCP1) is (l1 × L2 ) + (l2 × 2
) + {(d× 12.08) + (c × 105 )}(same as in item
1). Note that, here verb dictionary search is required, and its search time is
12.08. We denote this cost as w4 .
5. For converting singular form of noun to plural form of noun, or vice versa (See
appendix A) any one of the three different suffix operations are required that
are suffix replacement (SR) , suffix addition (SA) and suffix deletion (SD). The
average cost of these operations are:
• The cost of SR is (l1 × L2 ) + (k × 42 ) + (k × 24 ). We denote it as s1 .
• The cost of SA is (l1 × L2 ) + (k × 32 ). We denote it as s2 .
• The cost of SD is (l1 × L2 ) + (k × 32 ). We denote it as s3 .
6. The average cost of suffix addition “na” in the verb of PCP1 form is (l1 × L2 ).
Note that here only one suffix is required in any of the cases, therefore, no
search is required for deciding about the suffix. This cost is denoted by s4
Table 5.5 discusses the cost due to subject/object to subject/object morpho

changes pairwise.
196
Input→ N <proper> PRON PCP1

Ret’d ↓
N (0 or ({w1 } + {s1 or w2 w3 w4 +s4
s2 or s3 })
<proper> w1 + {s1 or s2 or w2 w3 w4 +s4
s3 }
PRON w1 + {s1 or s2 or w2 (0 or w3 ) w4 +s4
s3 }
PCP1 w1 + {s1 or s2 or w2 w3 (0 or w4 )
s3 }
Table 5.5: Cost of Adaptation Due to Subject/Object to
Subject/Object
Similarly, we have formulated the cost of adaptation for pre-modifying adjective.

In order to avoid the repetitive nature of description here, we put the content in
Appendix E.
Similarly, cost of adaptation for other sentence patterns components which have
been discussed in Chapter 2 can be formulated. However, to avoid the stereotyped
nature of discussion, we do not present all other different cases in this report. The
primary advantage of the above analysis is that it paves the way for using adapta-
tion cost as a good yardstick for similarity measurement that may lead to efficient
retrieval from EBMT perspective. The following subsection describes adaptation
cost can be used for this purpose.
5.4.4 Use of Adaptation Cost as a Measure of Similarity
The input sentence may be compared with the example base sentences in terms
of functional-morpho tags, their discrepancies may be measured, and adaptation
197
5.5. The Proposed Approach vis-à-vis Some Similarity Measurement Schemes
cost may be estimated using the formulae given above. The example base sentence
having the minimum cost of adaptation may then be considered the most similar
to the input sentence, and may be retrieved for generating the translation of given
input sentence. Below we compare the proposed scheme with some other similarity
measurement schemes. In particular, we consider semantic similarity and syntactic

similarity in a way similar to what Manning and Schutze (1999) prescribed for
information retrieval.
5.5 The Proposed Approach vis-à-vis Some Simi-
larity Measurement Schemes
5.5.1 Semantic Similarity
Semantic similarity depends on the similarity of words occurring in the two sentences
under consideration. Here, we used a purely word based metric, and developed a
vector space model as suggested in (Manning and Schutze, 1999). However, the
weighting scheme has been modified (Gupta and Chatterjee, 2002) in order that the
scheme can be applied on sentences in a meaningful way. Here, each of the example
base sentences and the input are represented in a high-dimensional space, in which
each dimension of the space corresponds to a distinct word in the example base.
Similarity is calculated as the normalized dot product of the vectors. The method
is explained below.
Let Ej : j = 1,2,. . ., N be the English sentences in the example base, and E0 be

the input sentence in English. We denote E0 and each Ej as n-dimensional vector
in a real valued space. The space of all distinct words W1 , W2 , . . ., Wn are in the
198
example base. Thus Ej = (ej1 , ej2 ,. . .,ejn ), for j =0, 1, 2, . . ., N . The similarity
measure between E0 , and of the example base sentence Ej is defined here as:
n
X
m(E0 , Ej ) = e0i eji (5.1)
i=1
This scheme computes how well the occurrence of word Wi (measured by e0i
and eji ) correlate in input and the example base sentences. The coordinates eji are
called word weights in the vector space model. The basic information used for word
weighting are word frequency (wji ) and sentence frequency (si ).
For a word Wi word frequency wji and sentence frequency si are combined into
a single word weight as

 wji × (( N ) − 1),

if wji ≥ 1;
si
eji = (5.2)
 0,

if wji = 0.
where i = 1, 2, . . ., n and j = 0, 1, 2, . . ., N . Here N is the total number of sentences
in the example base. The term (( N

si
) - 1) gives maximum weight to words that occur
in only one sentence, whereas a word occurring in all the sentences would get zero
weight =( N
N
- 1 = 1 - 1 = 0). Here the word frequency wji and sentence frequency
si implie:
• Word frequency wji is the number of times the word Wi occurs in the j th
sentence Ej . This implies how salient a word is within a given sentence. The
higher the word frequency, the more likely is that word is a good description
of the content of the sentence.
• Sentence frequency si is the number of sentences of the database in which the
199
word Wi occurs. The sentence frequency is interpreted as an indicator of how

informative the word is in the example base. In this respect we distinguish
between two types of words: “semantically focused words” and “semantically
unfocused words”. A semantically focused word is a word, which gives mean-
ing to a sentence. These semantically focused words enrich the vocabulary of

the language. For example: verb, noun, adjective and adverb. Semantically
unfocused words introduce structurally standard behaviour. These words are
limited in number predetermined by the grammar of the language. For exam-
ple, prepositions, conjunctions, pronouns, auxiliary verbs etc.
Given an input sentence, this similarity may be used for retrieving an appropriate
past example from the example base. In order to achieve that the similarity of the
input sentence is measured with each of the example base sentence. The one with
the highest similarity score may be considered for retrieving.
We have experimented with two input sentences “I work.” and “Sita sings ghaz-
als.”. Tables 5.6 and 5.7 provide the best five matches for them, respectively.
Retrieved Sentences Semantic Score

I do this work. 0.9852
I have this work. 0.9746
I will do this work. 0.9543
They work there. 0.7954
The hungry man work. 0.6834
Table 5.6: Best Five Matches by Using Semantic Simi-
larity for the Input Sentence “I work.”
200
Example Sentence Semantic Score

Sita sings ghazals. 1.00
Ghazals were nice. 0.775
Sita reads books. 0.733
Sita is eating rice. 0.731
He has been singing ghazals. 0.701
Table 5.7: Best Five Matches by Using Semantic Simi-
larity for the Input Sentence “Sita sings ghazals.”
One may note that the drawback of this scheme is that the outcome varies signif-
icantly on the content words, the size of the database sentences and the occurrence
of the words in the sentences.
5.5.2 Syntactic Similarity
Syntactic similarity pertains to the similarity of the structure of two sentences under
consideration. Let Tj be the tagged version of English sentence Ej of the example
base, and T0 is the tagged version of the input sentence E0 . Here too, every sentence
Tj in the example base is expressed as a vector generated from the structure of the
sentence. A matching technique similar to that used for semantic similarity has been
applied on Tj and T0 (instead of Ej and E0 , as discussed in earlier subsection). As
a consequence similarity measures are computed at the structural level, and not at
the word level.
The key question is whether all the components in determining the structural
similarity are of equal importance. We feel that the contributions of all the con-
stituent words on the formation of the sentence are not same, in particular, sentences
201
having similar structure (in terms of verb, auxiliary, adverb etc) should have higher
similarity measurement value for a given input sentence. Having tried with different
weighting schemes, we found that the one given in Table 5.8 provides the best result.
POS/syntactic role Multiplier

Auxv/verb 20
Preposition 10
Adjective/adverb 5
Subject/object 1
Determiner/negative 0.1
Table 5.8: Weighting Scheme for Different POS and Syn-
tactic Role
Table 5.9 and Table 5.10 give the similarity measures obtained for the example
base corresponding to the input sentences “I work” and “Sita sings ghazals” when
the above weighted syntactic similarity scheme has been used.
Retrieved Sentences Syntactic Score

I walk. 1.00
I do this work. 0.971
I hear the parrot. 0.968
They walk. 0.942
They work there. 0.928
Table 5.9: Best Five Matches by Syntactic Similarity for
the Input Sentence “I work.”
202
Example sentences Syntactic Score

Sita sings ghazals. 1.000
Sita reads books. 1.000
Mohan eats mangoes. 1.000
Babies drink milk. 0.918
He reads history. 0.907
Table 5.10: Best Five Matches by Syntactic Similarity
for the Input Sentence “Sita sings ghazals.”
Note that here similarity of words is completely ignored, as the main emphasis
is laid on the similarity of tense. By resorting to a different weighting scheme one
can change the similarity measures to some extent.
5.5.3 A Proposed Approach: Cost of Adaptation Based Sim-
ilarity
The above studies reveal that neither semantic measure nor syntactic measure pro-
vides an effective scheme for calculating similarity between two sentences. In both
cases the measurement score depends to a significant extent on the word weights,
which in turn depends on the sentences in the example base. Thus the schemes
become highly subjective. We, therefore, look for a method that provides a more
objective measurement of similarity. We consider the cost of adaptation for this

purpose., which is seen as the number of operations required for transforming a
retrieved translation example into the translation of a given input sentence. We
continue with the adaptation operations discussed in Section 5.3.1. The following
example illustrates how the functional-morpho tags of an input (IE) and a retrieved
203
example base sentence (RE) can be used for determining the appropriate adaptation
operations.
Input sentence(IE) : Ram is driving the car at a high speed.

Retrieved English sentence (RE) : He is sitting on the chair.
Retrieved Hindi sentence (RH) : wah kursii par baith rahaa hai
(he) (chair) (on) (sit) (..ing) (is)
Table 5.11 gives the functional-morpho tags of the IE and the RE. To generate
the translation “ram bahut tezii se gaadii chalaa rahaa hai ” of the input sentence
the following adaptation operations are required.
IE: Ram is driving the car at a high RE: He is sitting on the chair.
speed.
Ram @SUBJ <Proper> N SG he @SUBJ PRON MASC SG3
“Ram” “he”
is @+FAUXV V PRES “be” is @+FAUXV V PRES “be”
driving @-FMAINV V PCP1 “drive” sitting @-FMAINV V PCP1 “sit”
the @DN> ART “the” the @DN> ART “the”
car @OBJ N SG “car” ... ...
at @ADVL PREP “at” on @ADVL PREP “on”
a @DN> ART “a” ... ...
high @AN> A ABS “high” ... ...
speed @<P N SG “speed” chair @<P N SG “chair”
Table 5.11: Functional-morpho Tags for the Input En-
glish Sentence (IE) and the Retrieved English Sentence
(RE)
(a) Whenever a functional tag along with the morpho tags match in both sentences
but the corresponding words are different, a constituent word replacement
204
needs to be done for the retrieved Hindi translation. For example, “driving”
and “sitting” are both verbs in their present continuous form “@-FMAINV V
PCP1”. But since the root verbs “drive” and “sit” are different, a constituent
word replacement is required in the retrieved Hindi translation RH. Therefore,
the scheme replaces “baith” with “chalaa”5 . In a similar way “chair” and
“speed” have same functional-morpho tag “@<P N SG”. Hence, here too, the
scheme replaces “kursii ” with “tez ”.
(b) If the functional tags match, but corresponding morpho tags do not match,
then either a constituent word replacement or some suffix modifications (or
both) need to be done to modify the retrieved Hindi translation. For example,
“Ram” and “he” are both subjects, but “Ram” is a proper noun while “he” is
a pronoun. Hence a constituent word replacement is required, and the scheme
replaces “wah” with “ram”.
(c) Whenever a functional tag is not present in the sentence RE, but is present in
IE, the corresponding word in Hindi has to be retrieved from an appropriate
word dictionary, and added at the appropriate position in the Hindi sentence
RH. For example, the object “car” which comes before the preposition “at”
in the IE sentence is not complementing the preposition, the object “chair”
which comes after the preposition “on” in the RE sentence is complementing
the preposition. Thus, the two objects “car - @OBJ N SG” and “chair - @<P
N SG” differ in their functional tags. Therefore a constituent word addition

is required in the Hindi sentence RH. Therefore, the word “gaadii ” has to be
added which serves as the object before the Hindi verb “chalaa ”.
(d) Whenever a functional tag is present in RE, but is not present in IE, the
5
The Hindi translation of the word “drive” is to be taken from the verb dictionary.
205
corresponding word in Hindi has to be deleted from the Hindi sentence RH.
(e) Other necessary adaptation operations may be WA (“bahut”) and SA(“tez”

→ “tezii ”). Similarly, these adaptation operations can also be identified.
After the identification of all the adaptation operations required for adapting RE
to IE, we have calculated cost of each of the adaptation operations. For this purpose
we have referred cost of all the different operations which are listed in Section 5.3.1
and Section 5.4.
In order to apply cost of adaptation to design an appropriate retrieval scheme

one needs to have a measurement of different constants of proportionality (i.e. l1 ,
l2 , d etc.) as described in Section 5.3.1. Evidently, these constants depend upon
the underline computing system. Hence in our discussion we want to keep them
independent of any particular platform. We further make a few assumptions in
order to make the calculations relatively simple:
• We assume that the linear search operations in the RAM are equally costly
irrespective of the size of each data record. Hence we assume that the constants
l1 , l2 , d, m, and k are all equal. Let them have a common value α.
• It has already been discussed in Section 5.3 that hard disc operations are
costlier than RAM operations by an order of 105 . Hence we denote the constant
associated with retrieval from the external storage as c × 105 , where c ≈ α.
• ψ and costs are treated as independent quantities. Here is very small
quantity and ≤ α. On the other hand, ψ is considered as a large quantity as

discussed in item 5 of Section 5.3.
206
Table 5.12 and Table 5.13 give the best five matches when the retrieval is made
by the cost of adaptation based scheme by using the same input sentence and the
same example base. Cost values here are measured according to scheme given in
Section 5.4.
Retrieved sentences Adaptation cost

I have been working for four hours. 23α+4
I have not been working for four hours. 30α+5
This works. 13.17α+ 105 c
The man works. 13.67α+105c
I walk. 16.27α+105c
Table 5.12: Retrieval on the Basis of Cost of Adaptation
Based Scheme for the Input Sentence “I work.”
Retrieved sentence Adaptation cost

Sita sings ghazals. 0
Sita sang ghazal. 9α
He has been singing ghazals. 13α+
Sita is singing melodious song. 22α+ 105 c+ 2
Sita reads books. 32.85α+2×(105c)
Table 5.13: Retrieval on the Basis of Cost of Adapta-
tion Based Similarity for the Input Sentence “Sita sings
ghazals.”
To generate the translation “main kaam kartaa hoon” of the input sentence
“I work.”, the adaptation operations required for adapting each of the sentences
given in Table 5.12 above are as follows:
• For adapting “I have been working for four hours.” ∼ “main chaar ghante se
207
kaam kar rahaa hoon” to the input sentence, five operations are required. This
operations are : SA (kar → kartaa), WD (chaar ), WD(ghante), WD (se) and
MD (rahaa). The total adaptation cost is therefore = (5.5α)+ (4α+) +

(4α+) + (4α+)+(5.5α+)= 23α+4. One may refer Section 5.3.1 for this
computation.
• In case of adapting the second retrieved sentence “I have not been working for
four hours” ∼ “main chaar ghante se kaam nahi kar rahaa hoon” to the input
sentence, at most six operations are required to be done. These operations are
: SA (kar → kartaa), WD (chaar ), WD(ghante), WD (se), WD(nahiin) and
MD(rahaa). Hence the total adaptation cost is (6α)+ (4.5α+) + (4.5α+) +
(4.5α+)+ (4.5α+) +(6α+)= 30α+5.
• For adapting “ This works” ∼ ‘yah kaam kartaa hai ” to the input sentence,
at most two operations WR (yah → main) and MR (hai → hoon) are re-
quired. The total adaptation cost is therefore (9.17α + c105 )+ (2α+ 2α)=
13.67α+c105 .
• Similarly, one can identify the appropriate adaptation operations for adapting
last two sentences to the input sentence.
We now consider the costs of adaptation for the best five sentences that are
retrieved using semantic and syntactic similarity schemes as given in Sections 5.5.1
and 5.5.2.
208
Retrieved Sentences Based on Se- Adaptation Cost

mantic Similarity
I do this work. 23.77α+ 105 c+ 2
I have this work. 25.4α+ 105 c+ ψ +3
I will do this work. 22.9α+ 105 c+ ψ+ 2
They work there. 21.17α+ 105 c+
The hungry man work. 21.67α+ 105 c+
Retrieved Sentences Based on Syn- Adaptation Cost
tactic Similarity
I walk. 16.27α+ 105 c
I do this work. 23.27α+ 105 c+ 2
I hear the parrot. 19.77α+ 2(105 c)
They walk. 28.94α+ 2(105 c)
They work there. 21.17α+ (105 c)+
Table 5.14: Cost of Adaptation for Retrieved Best Five
Matches for the Input Sentence “I work.” by Using Se-
mantic and Syntactic Based Similarity Schemes
First we consider the input sentence “I work”. Table 5.14 provides the costs of
adaptation of the best five matches under semantic similarity and syntactic similarity
based measurement schemes. An examination of the adaptation costs suggests that
all the five sentences retrieved by the semantic similarity based scheme are costlier
to adapt in comparison with all the sentences retrieved by cost of adaptation scheme
(see Table 5.12). On the other hand, the sentence “I walk.” which is retrieved as the
best matching sentence under syntactic similarity based scheme actually requires
more computational efforts than the best four sentences as given by the cost of
adaptation based scheme (see Table 5.12).
209
Retrieved Sentences Based onSe- Adaptation Cost

mantic Similarity
Ghazals were nice. 29.08α+ 105 c+ ψ+
Sita reads books. 32.85α+ 2(105 c)
Sita is eating rice. 46.85α+ 2(105 c)+
He has been singing ghazals. 13α+
Retrieved Sentences Based on Syn- Adaptation Cost
tactic Similarity
Sita reads books. 32.85α+ 2(105 c)
Mohan eats mangoes. 39.85α+ 2(105 c)
Babies drink milk. 39.85α+ 2(105 c)
He read history. 39.85α+ 2(105 c)
Table 5.15: Cost of Adaptation for Retrieved Best Five
Matches for the Input Sentence “Sita sings ghazals” by
Using Semantic and Syntactic based Similarity Schemes
In a similar way Table 5.15 provides the costs of adaptation for the best five
matches retrieved by semantic and syntactic based schemes for the input sentences
“Sita sings ghazals”. One may note the following by comparing Table 5.13 and Table
5.15.
• “Sita sings ghazals.” is retrieved as a best match by all the three schemes
because it is already there in the example base.
• The second best match “Ghazals were nice.” under semantic similarity based
scheme is actually very expensive to adapt as it has a term ψ. This term
occurs since the sentence concerned is of a structure that is different from the
210
input sentence.
• The sentences retrieved by the syntactic similarity based scheme are costlier
to adapt than the sentences retrieved by the cost of adaptation based scheme.
The above results clearly demonstrate the superiority of the proposed scheme
over the semantic and syntactic similarity based schemes.
5.5.4 Drawbacks of the Proposed Scheme
One major drawback of the proposed scheme is that for each input sentence, the
scheme essentially boils down to evaluating the cost of adaptation for all the sentence
in the example base. This makes retrieval from a large example base is computa-
tionally very expensive. On the other hand, use of cost of adaptation as a potential
yardstick for measuring similarity makes too strong an argument to be ignored with
respect to Example Based Machine Translation. This, therefore, necessitates devel-
opment of some filtration technique so that given an input sentence, the example
base sentences that are difficult to adapt are discarded. The adaptation scheme can
therefore be applied only on the remaining sentences of the example base. We have
designed a systematic two-level filtration scheme for this purpose.
It is clear from cost of adaptation operations, mentioned in Section 5.3.1, that
constituent word addition and constituent word replacement are the costliest adap-
tation operations in terms of computational cost, with the former being costlier than
the latter. Hence the filters are designed to retrieve those example base sentences for
which the adaptation of given input sentence will require less number of constituent
word addition and constituent word replacement operations.
The two-level scheme works as follows.
211
• In the first level, the algorithm retrieves sentences that are structurally sim-
ilar to the input sentence, thereby reducing the number of constituent word
additions in the adaptation of the retrieved example. Here functional tags
(FTs) are used to determine the structural similarity. We call this step as
“measurement of structural similarity”.
• In the second level only the sentences passed by the first filter are considered
for further processing. Here the dissimilarity of each of these sentences with the
input sentence is measured. The lower is the dissimilarity score of an example,
the lesser will be its adaptation cost to generate the required translation. The
dissimilarity is measured on the basis of tense and POS tag along with its root
word. Henceforth, for notational convenience, we shall denote these features
as characteristic features of a sentence. This step is denoted as “measurement
of characteristic feature dissimilarity”.
The following examples illustrate the necessity of the two levels of the filtration
scheme. Let us consider the following two sentences:
A A beautiful girl is going to her home.

∼ ek sundar ladkii apne ghar jaa rahii hai
(a) (beautiful) (girl) (my) (home) (go) (...ing) (is)
B This home is very beautiful.
∼ yeh ghar bahut sundar hai

(this) (home) (very) (beautiful) (is)
Even though there are two common words (beautiful and home) between these two
sentences, adapting the translation of sentence B to generate translation of A is not
an easy task because of their structural difference. Adaptation of the translation
212
of B to generate the translation of A requires only eight adaptation operations.

These are WR (yeh → ek ), WA (sundar ), WR (ghar → ladkii ), WA (ghar ), WA
(apne), WA(jaa), MA(rahii ), and WD (bahut). Hence the total cost of adaptation
for adapting B to A is 84.79α+4ψ+4(105c)+7, by referring to Section 5.3.1.
Let us now consider another sentence:

C This girl is going to office.
∼ yeh ladkii office jaa rahii hai
(this) (girl) (office) (go) (...ing) (is)
This sentence also has two words (girl and going) common with sentence A. But
its adaptation for generating the translation of A is computationally less expensive in

comparison with adaptation of B. In order to adapt C to A, only four adaptation op-
erations are required: WR (yeh → ek ), WA (sundar ), WR (office → ghar ) and WA
(apne). The total cost of adaptation for adapting C to A is 46.08α+2ψ+3(10 5 c)+2.
Evidently, this cost is much less than the cost of adapting B to A as computed above.
This happens because of the structural similarity and commonality of some charac-
teristic features of sentence C with A.
The above discussion suggests that one of the filters alone is not sufficient. For
appropriate filtration both the levels are required. The next section discusses the
proposed filtration scheme in detail.
5.6 Two-level Filtration Scheme
We have used the following notations to describe the filtration scheme. Let L denote
natural language (here it is English), and let e ∈ L denote an input sentence. S
213
5.6. Two-level Filtration Scheme
denotes the example base which is a finite subset of L, and d ∈ S is an example base
sentence. The following subsections discuss the above-mentioned levels of the filter.
5.6.1 Measurement of Structural Similarity
In this step, the aim is to filter the example base S of sentences to produce a subset
of S such that the sentences are structurally similar to e. The example base is
partitioned into equivalence classes of sentences that have same functional tags (e.g.
subject, object, verb etc.). This example base of equivalence classes is filtered, and
the classes that are similar in structure to the equivalence class of the input sentence
are identified. Here too we have used ENGCG parser for finding the functional tags
(FTs).
Given a sentence x ∈ L, let φ(x) be the bag of functional tags present in x. We

have used the term “bag” in place of “set” as in a bag repetition of elements are
allowed. For example, if x is “My brother helps me in my studies.” then φ(x) = {

@GN>, @SUBJ, @+FMAINV, @OBJ, @ADVL, @GN>, @<P}. Let F be the set
of the possible bag of functional tags for the language L.
Note that φ induces an equivalence relation on L. Two sentences e ∈ L and

e0 ∈ L are said to be equivalent(notationally, eEe0 ) if they have the same bag of
functional tags. Let [e] denote the equivalence class corresponding to the sentence
e, i.e. [e] = {e0 ∈ L| eEe0 }. For example, the sentences “He drank milk.”,“Sita eats
mangoes.”, “They are playing football.” and “Will Ram marry Sita?” are members of
the same equivalence class because all these five sentences have same functional tags
representation φ(.) = {@SUBJ, @+FMAINV, @OBJ}.
Since our concentration is on our example base S, the function φ and equivalence
214
class are restricted to set S. The restriction of φ to S is also denoted by φ, and it

induces an equivalence relation on S, whose equivalence classes are also denoted by
[d] for d ∈ S. Let S 0 = {[d] | d ∈ S} be the partition of S into equivalence classes.
For a given input sentence e, and an example base sentence d ∈ S, | φ(e) ∩ φ(d) |
denotes the number of common FTs between [e] and [d]. Let m denote max d∈S (|
φ(e) ∩ φ(d) |), i.e. the maximum number of common FTs between φ(e) and φ(d).
From the partitioned example base S 0 a new set Se0 is constructed such that Se0 =
{[d] : | φ(e) ∩ φ(d) |≥ d m2 e}. Here the function d m2 e gives the smallest integer greater
m
than or equal to 2
. Thus, Se0 is constructed in such a manner that it contains all
those equivalence classes for which the number of common FTs is between d m2 e and
m. Thus, by this step, all the equivalence classes having number of common FTs
less than d m2 e are discarded. We claim that the sentences having higher cost of
adaptation has been discarded from the set Se0 . The proof is given below:
Let n = |φ(e)|, i.e. the number of functional tags present in e is n, evidentally,

n ≥ m. Let us now consider the examples left out of consideration in Se0 (i.e. the
set (S 0 − Se0 )). Of all the examples belonging to this set, the one that will have least
cost of adaptation should have the following properties:
(a) It should have maximum number of common FTs with e. We assume that
there exist a sentence with (d m2 e -1) FTs common with e.
(b) We further assume that for all these common FTs, the underlying words are
also same as in e.
Therefore, to adapt any such sentence to generate a translation of e, the words

corresponding all the other functional tags are to be added from dictionaries. This
215
means (n − (d m2 e-1)) constituent word additions are required. Therefore, the cost of
adaptation of any such sentence will be approximately6 : (n − d m2 e+1)×WA, where
WA is the cost of constituent word addition, and is ((l1 × L2 ) + (l2 × L2p ) + {(d×log2 D)
+ (c × 105 )} + ψ+ ). For details of this cost, check item 2 of Section 5.3.1. Let us
denote this cost as C1. This cost will be certainly more than the cost of adaptation
for the sentence having m common FTs with e, i.e. the sentences belonging to the
equivalence classes of the set Se0 selected by the first filter. Argument supporting
this fact is as follows:
If all the words corresponding to the m common FTs are different from the input
sentence constituent words, then the cost of adaptation will be approximately the
sum of m constituent word replacements and (n − m) constituent word additions,

i.e. m×WR + (n − m)×WA, where WR is the cost of constituent word replacement,
Lp
and is ((l1 × L2 ) + (l2 × 2
) + {(d × log2 D) + (c × 105 )})) (see item 2 and 3 of
Section 5.3.1). Therefore, WA = WR +ψ + . We denote this cost as C2. Thus,
the value of C2 is n × WR+ (n − m) ×(ψ+ ). Now let us consider the difference
C1 − C2.
C1 − C2 = (n − d m2 e+1)×WA - (n×WR + (n − m)×(ψ+ ))
= (n − d m2 e+1)×(WR +ψ + ) - (n×WR + (n − m)×(ψ+ ))
≈ ( m2 + 1) × (ψ + ) - ( m2 − 1)× WR > 0, since ψ is greater than the
cost of dictionary search, i.e. {(d × log2 D) + (c × 105 )} (see Section 5.3).
It can also be noted that the sentence having m common FTs not necessarily
will have the minimum cost of adaptation. For this, consider the cases:
• The sentence having m common FTs have all different words. The approximate
6
We have not added other costs like suffix operations and morpho-word operations cost.
216
costs will be sum of the cost of m constituent word replacements and the cost
of (n − m) constituent word additions, i.e. m×WR + (n − m)×WA.
• The sentence having d m2 e common FTs have all same words. In this case the
approximate cost is (n − d m2 e) constituent word additions, i.e. (n − d m2 e)×WA.
By using a similar argument as given above, it can be shown that the cost of
the latter case may be less than the former one. Hence the sentences having d m2 e
common FTs cannot be discarded at this level.
The significance of this filtration step is that corresponding to any sentence x

that has been discarded by this filter there is at least one sentence x0 that will be
considered for the next level of filtration, and the cost of adaptation of x0 is much
less than that of x.
The equivalence classes passed by the first filter are subjected to further analysis
in the second level of filter as mentioned below.
5.6.2 Measurement of Characteristic Feature Dissimilarity
This filter arranges sentences of the set Se0 on the basis of the characteristic features
(see Section 5.5.4) of a sentence. We have considered the following characteristic
features: POS with its root word – main verb (V), noun(N), adverb(ADV), adjec-
tive(A), pronoun(Pron), determiner(DET), preposition(p), gerund(PCP1), partici-
ples(PCP1, PCP2), and tense and form of the sentence. Here note that we have
considered those main verbs whose root forms are other than “be or have ”. We
stick to the notations provided by the ENGCG parser (See Appendix B). For con-
venience of presentation, we denote the above mentioned ten characteristic features
217
as p1 (y), p2 (y), ..., p10 (y), where y is the root word of the corresponding character-
istic feature pi . For example, let us consider the sentence “I am sitting on the old
chair”. According to this example, there are six characteristic features, i.e. p 5 (“I”),
p1 (“sit”), p7 (“on”), p4 (“old”), p2 (“chair”) and p10 (“present continuous”). Here the
verb “sit” is the root form of the verb “sitting”.
In the following, we define a dissimilarity measure so that the sentences belonging

to Se0 can be arranged in increasing order of dissimilarity score. Note that the
smaller is the dissimilarity score, the lesser is the cost of adaptation (to generate the
translation of the input sentence e) of the corresponding sentence.
Let M be the set of the possible bag of characteristic features. We define a
mapping θ : L → M such that θ(x) = {pi (y) : pi (y) is a characteristic feature

[
of sentence x}. Let Se = [d], the set of sentences of all equivalence classes
[d]∈Se0
of Se0 , and let the restriction of θ to Se be noted by θ only. Further a mapping
η : L × Se → M is defined such that η(a, b) = {pi : pi (y) ∈ θ(a) and pi (y) ∈
/ θ(b)},
i.e. η(a, b) contains those characteristic features that are present in θ(a), but not in
θ(b), where a ∈ L and b ∈ Se .
For example, let the input sentence e be “The old man is sitting on the old chair.”,
and let the sentence x from example base be “He is sitting on my bed”.
Here the characteristic feature bags are:
θ(e) = {p4 (“old”), p2 (“man”), p1 (“sit”), p7 (“on”), p4 (“old”), p2 (“chair”),

p10 (“present continuous”)}
θ(x) = {p5 (“he”), p1 (“sit”), p7 (“on”), p5 (“I”), p2 (“bed”), p10 (“present continu-
ous”)}
Therefore, η(e, x) = { p4 (“old”), p2 (“man”), p4 (“old”), p2 (“chair”) }
218
For e ∈ L, we define a dissimilarity function dise : Se → R by
X
dise (d) = ( wi ) + (ψ × (| φ(e) | − | φ(e) ∩ φ(d) |)) (5.3)
pi (y)∈η(e,d)
dise (d) gives the dissimilarity score of d ∈ Se with respect to e. Here, ψ is the
cost of finding the location of new word, which has already been explained in item
5 of Section 5.3, and wi is the weight assigned to the characteristic feature pi (.).
Significance of this dissimilarity function dise (d) and wi are explained below.
As mentioned earlier, two of the costliest operations from the adaptation point
of view are constituent word addition and constituent word replacement. Thus, the
dissimilarity measure is designed to focus on these two operations. The second term
in the above-mentioned measure correspond to the approximate cost involved for
constituent word addition (to find the appropriate position). Further it should be
noted that cost of adaptation varies with the POS of the word to be added/replaced.
This is because this cost depends on the dictionary size of the concerned POS. The
bigger is the dictionary size, more will be the search time, and hence costlier will
be the required operation. Thus, for the characteristic features pi (y), i=1, 2, . . .,
9, a weight wi is assigned, depending on the respective dictionary size. Table 5.16

lists the weights of these characteristic features according to the search time of the
respective dictionaries.
Note that for tense and form (p10 ) identification cannot be done through dic-
tionary search. Appropriate rules should be developed for this purpose. In our
implementation, we have used 65 rules to take care of the sentences in our example
base. Therefore, the weight 6.02 (log2 (65) = 6.02 ) is assigned to the characteristic
feature p10 .
219
POS pi Dictionary size Weights, wi

Verb (V) p1 4330 log2 (4330) = 12.08
Noun (N) p2 13953 log2 (13953) = 13.77
Adverb (A) p3 1027 log2 (1027) = 10.00
Adjective (ADJ) p4 5449 log2 (5449) = 12.41
Pronoun (PRON) p5 72 log2 (72) = 6.17
Determiner (DET) p6 72 log2 (72) = 6.17
Preposition (P) p7 87 log2 (87) = 6.44
Gerund (PCP2) p8 4330 log2 (4330) = 12.08
Participles (PCP1, PCP2) p9 4330 log2 (4330) = 12.08
Table 5.16: Weights Used for Characteristic Features
Below we illustrate the significance of these weights in computing the dissimilarity

score between the input e and the sentence of the set Se . Let the input sentence e
be “This girl is my sister.”, and let two sentences d1 and d2 from the example base
be:
d1 : This boy is my brother.
d2 : That girl is her sister.
For these sentences the characteristic feature bags are:
θ(e) = {p6 (“this”), p2 (“girl”), p5 (“I”), p2 (“sister”), p10 (“simple present tense”)}
θ(d1 ) = {p6 (“this”), p2 (“boy”), p5 (“I”), p2 (“brother”), p10 (“simple present tense”)}
θ(d2 ) = {p6 (“that”), p2 (“girl”), p5 (“she”), p2 (“sister”), p10 (“simple present tense”)}
Note that instead of the words “my” and “her” their root words “I” and “she”,
respectively have been considered above.
Therefore, η(e, d1 )={p2 (“girl”), p2 (“sister”)} and η(e, d2 ) = { p5 (“she”), p6 (“that”)}
220
P P
Thus, dise (dj ) = ( pi ∈η(e,dj ) wi ) + (ψ × |4 − 4|) =( pi ∈η(e,dj ) wi ) for j = 1, 2. It is
to be noted that the contribution of the second term is zero for both d1 and d2 since
both these sentences have the same FTs as that of e.
Let us now consider two cases:
1. The weights corresponding to all features are same, say wi = 1 ∀ i = 1, 2, ...,

10. In this case, dise (dj ) = 2 for j = 1, 2.
2. The weights are taken as given in Table 5.16. In this case, dise (d1 )= w2 + w2
= 13.77 + 13.77 = 27.54 and dise (d2 )= w5 + w6 = 6.17 + 6.17 = 12.34.
Note that in the first case, the dissimilarity score is same. But from adaptation
point of view, the cost involved for adapting d2 to e is much less than the cost of d1 .
This is due to the fact that d1 has a determiner and a pronoun characteristic feature
common with e, while d2 has two noun characteristic features common with e.
Since the search and access time from a dictionary depends upon the size of
the dictionary under consideration, in this context one has to look at the sizes of
the dictionaries concerned. It is a general observation that the size of the noun
dictionary is much more than the sizes of the pronoun and determiner dictionaries.
For example, in our case the sizes are 14000, 70 and 72, respectively (see item 4
of Section 5.3). Consequently, retrieval from noun dictionary is computationally
costlier than the retrieval from pronoun or determiner dictionaries. This fact is
not reflected if equal weights are assigned to each POS. Hence in order to assign
priorities to the POS features in such a way that the dissimilarity score reflect the
approximate cost of adaptation, weights are assigned to each POS as given in Table
5.16.
221
5.7. Complexity Analysis of the Proposed Scheme
Here the dissimilarity metric is so designed that the dissimilarity score is directly
proportional to the approximate cost of adaptation. Finally the sentences in Se are
arranged in descending order of dissimilarity score. Now a few best sentences are
considered for cost of adaptation based scheme, and the best one is retrieved as the
most similar to the given input sentence. In our experiments we have considered
five best sentences by this two-level filtration scheme for evaluation of their costs of
adaptation.
5.7 Complexity Analysis of the Proposed Scheme
The above discussed filtration scheme aims at improving the efficiency of cost of
adaptation based scheme. This improvement can be observed by comparing the
worst case complexities of the two algorithms - cost of adaptation based scheme
without the two-level filtration, and cost of adaptation based scheme after the two-
level filtration. These two similarity measurement schemes are denoted as A1 and
A2 , respectively. Table 5.17 gives the notations for different parameters used in the
analysis, and their maximum sizes with respect to our example base.
Parameters Notations Maximum size

Example base size N 4000
Input sentence length Le 10
Example base sentence length Ld 10
Morpho-tag length LF 5
No. of equivalence classes of example base |S 0 | 162
No. of retrieved equivalence classes in 1st filter |Se0 | |S 0 |
No. of functional tags in input e |φ(e)| 10
No. of functional tags in d |φ(d)| 10
Table 5.17: Notation Used in the Complexity Analysis
222
In the algorithm A1 , for each example base sentence d, the maximum amount of
efforts required to adapt the translation of d to the translation of the input sentence
e is the number of comparisons required to identify the adaptation operation(s) in
the worst case. For an example base sentence d, the number of comparisons required
are as follows:
• First, the appropriate functional tag in d corresponding to each functional tag

in e is identified. In this step, a total of |φ(e)|×|φ(d)| comparisons are required.
• Then, the morpho tags for all matching functional tags are compared and
hence the maximum number of comparisons required are Le × LF .
Therefore, the total number of comparisons for adapting d to e is given by C1 ,

and C1 is (|φ(e)| × |φ(d)|) + (Le × LF ) = Le × (Ld + LF ). It may be noted that the
length of the functional tag set is same as the length of the sentence (i.e. |φ(e)| =
Le and |φ(d)| = Ld ). Hence, the complexity of A1 for all example base sentences is
given by TA1 = N × C1 .
The complexity computation of the algorithm A2 requires a detailed analysis

of the two filters. In the first filter, the complexity depends on the number of
comparisons between the functional tags of the equivalence class [e] of the input and
[d] of the example base, where e ∈ L, d ∈ S. This value, in worst case, is given
by C21 = |φ(e)| × |φ(d)| = Le × Ld . So, for a total of equivalence classes in S 0 , the
complexity of the first filter is given by A21 = |S 0 | × C21 .
For the second filter, we need to work on the sentences of the equivalence classes
retrieved (Se0 ) from the first filter. Suppose, there are Pi sentences in the ith , equiv-
alence class, i = 1,2, . . ., Se0 . The total number of sentences in all the equivalence
223
5.7. Complexity Analysis of the Proposed Scheme
classes of Se0 are Se . Maximum two comparisons are required: one for POS matching
and other for its root word matching between d and e for finding the characteristic
feature. The number of POS and its root words can be at most equal to the length of
a sentence. Thus the total number of comparisons required are computed as follows:
• POS of d and e are compared first which makes the number of comparisons to
be Le × Ld .
• Then, comparison of root words of d and e having same POS is done. Thus
there are Le comparisons.
Hence, the total number of comparisons required for POS and root word matching
between e and d are Le × (Ld + 1). Summing it over all the sentences of Se , we get
PSe0
the total complexity as A22 = ( i=1 Pi ) × (Le × (Ld + 1)) = |Se | × (Le × (Ld + 1)),
PSe0
where i=1 Pi ≤ N .
Finally the cost of adaptation based scheme is applied on the top few sentences
of Se having the minimum dissimilarity score. We have considered a set of first five
sentences in our experiments. This makes the number of comparisons to be 5 × C1 .
Hence, the total complexity of the algorithm A2 is given by TA2 = A21 +A22 +5×C1 .
Comparing the time complexities of the algorithms A1 and A2 ,

TA2 A21 +A22 +5∗C1
TA1
= N ∗C1
(|S 0 |×(Le ×Ld ))+(|Se |×(Le ×(Ld +1)))+(5×Le ×(Ld +LF ))

= N ×(Le ×(Ld +LF ))
|S 0 | +1
= N
× [( LdL+L
d
F
)] + Se
N
× [( LLdd+L F
)] + 5
N
(As |Se | ≤ |S| = N )
In the worst case, we assume that |Se | = |S| = N

TA2 |S 0 | +1
Thus, TA1
= N
× [( LdL+L
d
F
)] + N
N
× [( LLdd+L F
)] + 5
N
224
Putting the values of N , S 0 , Ld , and LF the ratio becomes

TA2
TA1
= c, where c = 0.762.
The above ratio shows that in the worst case the improvement in the algorithm
A2 is about 25%, i.e. the cost of adaptation based scheme will need to be applied on
only 75% sentences of the example base. But experimentally we found that for 500
different examples, which are not present in our example base, the improvement is
of the order of about 75%, which is quiet a significant improvement. This variation
is mainly due to the fact that during our experiments the cardinality of Se has been
Se
obtained to be much less than N , and thus the ratio N
reduces the contribution of
+1
Se
N
× [( LLdd+L F
)], which is the main contributory term towards c.
The retrieval scheme has been developed with respect to simple sentences. How-
ever, if the input sentence is complex then its adaptation is not straightforward
(Dorr et. al., 1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003)
and (Rao et. al., 2000).
Typically, complex sentences are characterized by sentences having more than
one clause, of which one is the main clause, and the rest are subordinate clauses
(Wren, 1989), (Ansell, 2000). Relative clause is a type of subordinate clause in
which a relative adjective (who, which etc.) or relative adverb (when, where etc.)
is used as a connective. The clauses may be joined by some connectives, but their
presence is not mandatory. However, in this work we consider complex sentences

with exactly one relative clause, and we further assume presence of the connective is
mandatory. Even with this simplified assumption, we find that translating complex
sentences under an EBMT framework is relatively difficult. This is because English
complex sentences having same connectives are often translated in different ways
in Hindi. Consequently, for a given complex English sentence, finding its suitable
225
5.8. Difficulties in Handling Complex Sentences
match from the example base is difficult. And even when it is found its adaptation
may not be straightforward. The following section illustrates the above points.
5.8 Difficulties in Handling Complex Sentences
Here we first observe the following two points:
• Even for complex sentences having the same connective (e.g. who, when, where,
which) the structure of the translations may vary. For illustration, consider the
four examples given below. Each of these English sentences may have at least
four possible variations depending on, in which position the Hindi connectives
are used. It may further be noticed that although the keywords of all these
four sentences are same7 (subject to morphological variations), their translation
patterns vary according to the role of the connective, and the role of the noun
modified by the relative clause. If the relative adjective “who” plays the role
of subject in the relative clause, then the Hindi relative adjective may be one
of “jo”, “jis” or “jin”, depending upon the tense and form (i.e. for present
perfect, past indefinite and past perfect) of the main verb of the relative clause.
The items (A), (B), (C) and (D) below show the four sentences and their Hindi
translations.
(A) The policeman who chased the thief was tall.

∼ wah sipaahii jo chor kaa piichhaa kartaa thaa, lambaa thaa
∼ wah sipaahii, jis ne chor kaa piichhaa kiyaa lambaa thaa
∼ jo sipaahii chor kaa piichhaa kartaa thaa wah lambaa thaa
7
policeman - sipaahii, thief - chor, to chase - piichaa karnaa, I - main, tall - lambaa, to know -
jaannaa
226
∼ jis sipaahii ne chor kaa piichhaa kiyaa wah lambaa thaa
(B) The thieves who the policeman chased were tall.

∼ we chor, sipaahii jis kaa piichhaa kartaa thaa, lambe the
∼ we chor, sipaahii ne jis kaa piichhaa kiyaa, lambe the
∼ sipaahii jin choron kaa piichhaa karte the we lambe the

∼ sipaahii ne jin choron kaa piichhaa kiyaa we lambe the
(C) I know the policemen who chased the thief.
∼ main un sipaahiyoan ko jaantii hoon, jo chor kaa piichhaa karte the

∼ main un sipaahiyoan ko jaantii hoon, jin ne chor kaa piichhaa kiyaa
∼ jo sipaahii chor kaa piichhaa karte the main us ko jaantii hoon
∼ jin sipaahiyoan ne chor kaa piichhaa kiyaa main un ko jaantii hoon
(D) I know the thief who the policemen chased.

∼ main us chor ko jaantii hoon jis kaa sipaahii piichhaa karte the
∼ main us chor ko jaantii hoon, jis kaa sipaahiyoan ne piichhaa kiyaa
∼ sipaahii jis chor kaa piichhaa karte the main us ko jaantii hoon
∼ jin sipaahiyoan ne chor kaa piichhaa kiyaa main us ko jaantii hoon
• Although, in general, the structures of the Hindi translations of two complex

sentences having different connectives are different, certain part of them may
still be similar. Hence an EBMT system may use this similarity in an effective
way. For example, consider the following two sentences and their translations
which involve the following keywords: man - aadmii, is working - kaam kar
227
5.8. Difficulties in Handling Complex Sentences
rahii hai, in - mein, farmer - kissan, said - kahaa, he - wah.
The man who is working in the field is a farmer.
∼ jo aadmii khet mein kaam kar rahaa hai wah kisaan hai
This man said that he is a farmer.
∼ iss aadmii ne kahaa ki wah kisaan hai
Despite the dissimilarity in their structures, one may notice that the part “wah
kissan hai ” is common to both the translations. Typically this can happen if
the two complex sentences have some similar clauses. The above observation
also implies that sometimes a simple sentence may also be helpful in generating
the translation of a complex sentence, or some of its parts.
The above discussion suggests that the retrieval and adaptation strategies for
complex sentences may need to take care of a large number of variations according
to each connective word and its usage. Creating the adaptation rules and imple-
mentation of such number of possibilities are not an easy task. To overcome this
problem, we propose a “split and translate” scheme for handling complex sentences
in an EBMT framework. The proposed scheme works as follows:
1. First it checks whether the input sentence is complex. If “yes” then it executes
the following:
2. It splits the input sentence into two simple sentences RC and MC, correspond-
ing to the Relative clause and the main clause of the complex sentence.
3. By using cost of adaptation based scheme it retrieves sentences most simi-

lar to RC and MC. Let these retrieved sentences be denoted as R1 and R2,
respectively.
228
4. It generates translations of RC and MC from the retrieved examples R1 and

R2.
5. The translation of given complex sentence is generated by using the transla-

tions of RC and MC.
In the following subsections we discuss some of the splitting rules required to
convert complex sentence into simple sentences, and the adaptation procedure to
obtain the Hindi translation of the given complex sentence using the translation of
the splitted sentences.
5.9 Splitting Rules for Converting Complex Sen-
tence into Simple Sentences
Various approaches have been suggested in literature for splitting of complex sen-
tences. For example:
1. Furuse et. al. (1998, 2001) has proposed a technique where a sentence is split
according to the sub-trees and partly constructed parsed trees.
2. Takezawa (1999) recommended a word-sequence characteristic based tech-

nique.
3. Doi and Sumita (2003) proposed two methods: Method-T and Method-N.
Method-T uses three criteria, viz. fault-length, partial-translation-count and

combined-reliability. On the other hand Method-N uses pre-process-splitting
method based on N-gram of POS subcategories.
229
5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences
Many approaches exist for splitting complex sentences (typically for English),
e.g.(Orăsan, 2000), (Sang and Déjean, 2001) and (Clough, 2001). The technique
used by us is similar in nature to that proposed in (Leffa, 1998), (Puscasu, 2004).
They suggest three ways in which a sentence can be segmented to the clause level:
(1) Starting with the first word in the sentence, and processing it from left to
right, word by word until all the clauses are identified;
(2) Starting with formal indicators of subordination/coordination, and proceeding

until the end of the clause is found;
(3) Starting with the verb phrase, identifying the verb type, and locating its sub-
ject and complements.
In our approach, we have used first two methods. We have developed heuristics
to split a complex sentence into two simple sentences one related to the main clause
and the other to the relative clause. Here the advantage is that both the simple
sentences now can be translated independently using the retrieval and adaptation
procedures developed for dealing with simples sentences. For this work we made the
following assumptions for the input sentence:
• The sentence has only one relative clause, and a connective must be present.
• The connectives that we have considered are when, where, whenever, wher-
ever, who, which, whose, whom, whoever, whichever, that, whomever, what and
whatever.
• The algorithm makes use of the delimiter of the input sentence as well. We
illustrate this technique with respect to the delimiters “.” and “?”.
230
• No wh-family word (e.g. who, which, when, where) should be present in the
main clause.
In the following subsections, we discuss the splitting rules for complex sentences
having any of the following connectives: “when”, “where”, “whenever”,“wherever”
and “who”. Since the splitting rule for some of the above connectives are same,
the following subsection considers connectives “when”, “where”, “whenever” and

“wherever” together . The subsequent Subsection 5.9.2 discusses the splitting rule
of complex sentence having connective “who”.
5.9.1 Splitting Rule for the Connectives “when”, “where”,
“whenever” and “wherever”
This rule is explained using three modules.
Module 1
Module 1 identifies whether a given input sentence e is a complex sentence or not.
If e is complex, then the module identifies the position of the relative adverb, which
can be one of “when”, “whenever”, “where” or “wherever”. The algorithm considers
the two possible positions of the relative clause: i.e, the relative clause is present
before the main clause, or it is present after the main clause. Depending upon the
position of the relative clause, the algorithm proceeds to Module 2 or Module 3.

Figure 5.1 provides a schematic view of this module.
231
-Let the input sentence be e and let e be e1 , e2 ,. . ., en ; where

n is the length of the English sentence.
-Let the parsed version of e be denoted by f , and its bag of

functional tags is denoted as {f1 , f2 ,. . ., fn }, where fi is the
functional-morpho tag corresponding to ei .
-For all ei ∈ e, let Roote (ei ) denote the root word corresponding
to ei .
IF((f1 = @ADVL) AND (Roote (e1 ) = “where" OR “when" OR “wherever"

OR “whenever"))
THEN {
IF(((f2 = @+FAUXV) AND (Roote(e2 ) = “be" OR “do" OR
“have" OR “can" OR “may" OR “shall" OR “will")) OR
((f2 = @+MAINV) AND (Roote (e2 )= “be" OR “have")))
THEN
{
PRINT “Simple sentence";
EXIT;
}
ELSE {Print “Complex sentence"; GO TO Module 2;}
}
ELSE {
j =0;
For (i = 2, 3, . . .,n)
{
IF(Roote (ei ) = "where" OR “when" OR “wherever"
OR “whenever")
j = i;
}
IF(j = 0)
THEN {Print “Simple sentence";}
ELSE {Print “Complex sentence"; GO To Module 3;}
}
Figure 5.1: Schematic View of Module 1 for Identification of Complex Sentence with
Connective any of “when”, “where”, “whenever”, or “wherever”
232
The following two examples illustrate this module.
Example 1:
Let the input sentence e be “Whenever you go to India, speak Hindi.”. Its parsed
version f obtained using the ENGCG parser, is:
@ADVL ADV WH “whenever”, @SUBJ PRON PERS SG2/PL2 “you”,

@+FMAINV V PRES VFIN “go”, @ADVL PREP “to”, @
N SG “India”, @+FMAINV V IMP “speak”, @OBJ <proper> N SG
“Hindi” <$.>
The length of the input sentence e is 7, and the bag of functional tags is {@ADVL,
@SUBJ, @+FMAINV, @ADVL, @<P, @+FMAINV, @OBJ}. Since f1 is @ADVL,
Roote (e1 )8 is “whenever”, and f2 is not @+FAUXV (f2 is @SUBJ), it is concluded

that the given input sentence e is complex, and the algorithm should proceed to
Module 2.
Example 2:
Consider the another input sentence e “Will you bring anyone along when you return
from town?”. Its parsed version f is:
@+FAUXV V AUXMOD VFIN “will”, @SUBJ PRON PERS SG2/PL2

“you”, @-FMAINV V INF “bring”, @OBJ PRON SG “anyone”, @ADVL
ADV ADVL “along”, @ADVL ADV WH “when”, @SUBJ PRON PERS
SG2/PL2 “you”, @+FMAINV V PRES VFIN “return”, @ADVL PREP
“from” , @

The length of e is 10, and the bag of functional tags is {@+FAUXV, @SUBJ, @-
FMAINV, @OBJ, @ADVL, @ADVL, @SUBJ, @+FMAINV, @ADVL, @<P}. Since

8
Roote (e1 ) denote the root word corresponding to e1
233
f1 is not @ADVL, the module checks the presence of any of the connectives ‘when”,
“whenever”, “where” or “wherever” in e. The connective “when” is present at the
6th position, i.e. j =6. Hence Module 1 concludes that the given input sentence e is
complex, and for splitting of e, the algorithm should proceed to Module 3.
Module 2
If the relative adverb is the first word of the given input sentence e then the sentence
is splitted in Module 2. Figure 5.2 gives a schematic view of this module. Table
5.19 gives the typical sentence structures that can be handled by this module. The
structure of the sentence handled by this module is characterized, the relative clause
will be present in the beginning of the main clause. In this module, along with
the position of the relative clause, position(s) of the subject(s) is used to split the
complex sentence. In the following, we assume the length of e to be n. Sub-steps of
this module are as follows:
• If the delimiter of the input sentence e is “?”, or if the input sentence has
only one subject (possible tags of subject are @SUBJ and @F-SUBJ) and the
delimiter of the sentence is “.”, then the main verb (i.e. @+FAMINV tag)
or main auxiliary verb (i.e. @+FAUXV tag) decides the splitting point. The
module looks for the second occurrence of @+FAMINV or @+FAUXV tag9 .
Let l be the word position where one of these two tags occur. If one of the
above two cases is true then from the second word to (l − 1)th word, and from
the lth word to nth word of e constitute the two simple sentences, which are
9
ENGCG parser always provides either @+FAMINV or @+FAUXV tag to the first occurrence
of a verb whether it is main verb or auxiliary verb. All other verbs (main or auxiliary) in the
sentence are denoted with either @-FAMINV or @-FAUXV tag
234
the parts of the main clause and relative clause, respectively. We call these
two simple sentences as MC and RC, respectively.
• If the delimiter of input sentence e is “.”, and it has two subjects10 then
the position of the second subject slot gives the decision about the splitting
point. For this purpose, the pre-modifiers (i.e. determiner, article, pre-modifier
adjective, adverb etc.) of the second subject are identified. If the position of
the first pre-modifier of second subject is k, then from the second word to
(k − 1)th word of e, and k th word to nth word of e constitute the two simple
sentences. First simple sentence (RC) is a part of the relative clause, and
second simple sentence (MC) is the main clause.
When I saw the oxen they were pulling the plow.

∼ jab maine bail dekhe, tab we hal khiinch rahe the
Whenever the woman eats too much, she gets sick.

∼ jab bhii wah aurat bahut zyaadaa khaatii hai, bimaar ho jaatii hai
Whenever you go to India, speak Hindi.

∼ jab bhii tum india jaate ho, hindi bolo
Where there is a cat, there is a dog.

∼ jahaan billii hotii hai, wahaan kuttaa hotaa hai
Wherever I run, the little dog will follow me.

∼ jahaan bhii main dauddataa hoon, chhothaa kuttaa mere piichhe jaaegaa
Table 5.19: Typical Examples of Complex Sentence with
Connective ‘when”, “where”, “whenever” or “wherever”
Handled by Module 2
10
The algorithm works for at most two clauses in a complex sentence, therefore, the maximum
number of subjects in the sentence is taken to be two.
235
LetK = number of @SUBJ or @F-SUBJ tags in the sentence e

\∗K =1 or K = 2 ∗\
IF ((delimiter of e = “?") OR (delimiter of e = “." AND K =1))

THEN {
l = 0;
For (i = 2, 3, . . ., n)
{
IF(fi = @+FMAINV OR @+FAUXV)
IF(l = 0)
THEN l++ ;
ELSE { l = i; Break;}
}
- The string e2 , e3 , . . . , el−1 constitutes a simple
sentence (say RC), which is the relative clause;
- The string el , el+1 , . . . , en constitutes a simple

sentence (say MC), which is the main clause;
- The functional-morpho tags of RC are f2 , f3 , . . .,fl−1 ;
- The functional-morpho tags of MC are fl , fl+1 , . . .,fn ;
- Delimiter of RC is “.";
IF(delimiter of e is “?")
THEN {delimiter of MC is “?"}
ELSE {delimiter of MC is “."}
236
ELSE {
IF(delimiter of e = “." AND K =2)
{
m = 0;
For(i = 2 to n)
{
IF(fi = @SUBJ or @F-SUBJ)
IF(m = 0)
THEN m++ ;
ELSE {m = i; Break;}
}
}
\∗ Now this algorithm finds the attributes (pre-modifier adjective,

determiner etc.) of second subject ∗\
k = m − 1;
WHILE((k>2) AND (fk = @N OR @DN> OR @NN> OR @GN>
OR @AN> OR @QN> OR @AD-A>))
k − −;
-The string ek , ek+1 , . . . , en constitutes the simple

sentence (say MC), which is the main clause;
-The string e2 , e3 , . . . , ek−1 constitutes the simple

sentence (say RC), which is the relative clause;
- The functional-morpho tags of MC is fk , fk+1 , . . ., fn ;
-The functional-morpho tags of RC is f2 , f3 , . . ., fk−1 ;
- Delimiter of RC is “.";
- Delimiter of MC is “?";
}
Figure 5.2: Schematic View of Module 2
237
Our discussion on Module 1 concluded with the remark that the complex sentence
given in Example 1 should be splitted using Module 2. We now continue with the
same example “Whenever you go to India, speak Hindi.” to show how Module 2 splits
this sentence into two simple ones. In this example, the number of subjects is one,
i.e. K = 1, and the delimiter is “.”. The module now proceeds to determine the
position of the second occurrence of @+FMAINV or @+FAUXV tag, which is found
to be at the 6th position11 , i.e. l = 6. Hence the input complex sentence is splitted
into simple sentences as follows:
• 2nd to 5th words constitute a simple sentence RC, i.e. “You go to India”, its
delimiter is “.”. This is a part of relative clause, and its morpho functional
tags are:
@SUBJ PRON PERS SG2/PL2 “you”, @+FMAINV V PRES VFIN “go”,
@ADVL PREP “to”, @ N SG “India” <$.>.
• 6th and 7th words constitute the simple sentence MC, i.e. “Speak Hindi”, its
delimiter is also “.”. This is a part of main clause, and its morpho functional
tags are: @+FMAINV V IMP “speak” , @OBJ <proper> N SG “Hindi” <$.>.
Module 3
If the relative adverb (or connective) is not the first word of the given input sentence,
then the sentence is splitted by this module. In this case, the relative clause is present
after the main clause, i.e. relative clause is located towards the end of the sentence.
Let the position of the relative adverb (as identified in Module 1) be j. In this case,
11
The second main verb in the given input sentence is “speak”.
238
first j − 1 words of e will constitute the first simple sentence MC (which is the main
clause.), and j + 1 to nth words will constitute the second simple sentence RC (which
is a part of the relative clause). Module 3 is given in Figure 5.3. Table 5.20 gives
the typical sentence structures that can be handled by this module.
Please do not talk to him when the carpenter is working.

∼ jab barhaii kaam kar rahaa hai tab usse na boliye
Should you speak English when you go to India?

∼ jab tum india jaate ho kyaa tumhe english bolnii chaahiye?
Visit us whenever you come here.

∼ jab bhii tum yanhaa aate ho ham se milo
I will stop, where there are interesting spots in my journey.

∼ jahaan bhii mere safar mein dilchasp jaghe hohii main rukungaa
Will you bring anyone along when you return from town?
∼ jab tum shahar se waapis aate ho tab kyaa tum kisii ko saath laaoge?
Do you want to go wherever I go?

∼ jahaan bhii main jaatii hoon wahaan kyaa tum jaanaa chaahte ho?
Table 5.20: Typical Examples of Complex Sentence with
Connective “when”, “where”, “whenever” or “wherever”
Handled by Module 3
According to Module 1, the rule given in Module 3 will split the complex sentence
discussed in Example 2, i.e. “Will you bring anyone along when you return from town?”.
As discussed in Module 1, for this input sentence e the value of j 12 is 6. Hence
spitting in this case is as follows:

12
j denotes the position of relative adverb “when”.
239
- The string e1 , e2 , . . . , ej−1 constitutes the simple sentence

MC which is the main clause;
\∗j is the position of relative adverb as obtained in Module 1 ∗\
- The string ej+1 , ej+2 , . . . , en constitutes the

simple sentence RC, which is the part of the relative
clause;
- The functional-morpho tags of MC is f1 , f2 , . . ., fj−1 ;
- The functional-morpho tags of RC is fj+1 , fj+2 , . . ., fn ;
- Delimiter for RC is always “.";
IF(delimiter of e is “.")
THEN {delimiter of MC is “."};
ELSE {delimiter of MC is “?"};
• The first five words of e constitute a simple sentence, which is main clause.
That is, the first simple sentence (denoted by MC) is “Will you bring anyone
along”. Since the delimiter of e is “?”, the delimiter of MC is “?”. Its functional
morpho tags are:
@+FAUXV V AUXMOD VFIN “will”, @SUBJ PRON PERS SG2/PL2 “you”,

@-FMAINV V INF “bring”, @OBJ PRON SG “anyone”, @ADVL ADV ADVL
“along” <$?>.
• 7th to 10th words constitute the other simple sentence RC, i.e. “You return
from town”, which is a part of the relative clause. The delimiter of RC is “.”.
Its functional morpho tags are:
@SUBJ PRON PERS SG2/PL2 “you”, @+FMAINV V PRES VFIN “return”,

@ADVL PREP “from” , @.
240
5.9.2 Splitting Rule for the Connective “who”
Here we discuss the algorithm for splitting complex sentences when the connective
is “who”. It should be noted that in this case, the relative clause can occur either
embedded in between the main clause, or after the main clause. In both the cases,
there are two possible functional tags of the connective word “who”, i.e. @SUBJ and
@OBJ. The algorithm takes care of all these possibilities. The algorithm is divided
into four modules, which are given in Figures 5.4, 5.6, 5.7 and 5.8. Along with these
four modules, there is a subroutine SPLIT given in Figure 5.5. The brief outline of
these modules and subroutine SPLIT is as follows:
• The Module 1 checks whether the given input sentence is complex or not. If
the sentence is complex with the connective “who” then depending on the
position of the clause and the delimiter of a sentence it routes the algorithm
to the appropriate module.
• Module 2 splits those complex sentences in which the relative clause is embed-
ded in the main clause, and the delimiter of the sentence is “.”. Table 5.21
provides the typical sentence structures considered in this module.
• The complex sentences in which the relative clause follows the main clause are
splitted in Module 3. Here also the delimiter of the sentence under consid-
eration should be “.”. The sentence structures considered in this module are
exemplified in Table 5.22.
• Irrespective of the position of the relative clause, Module 4 splits those com-
plex sentences for which the delimiter is “?”. Examples given in Table 5.23
demonstrate the sentence structures considered in this module.
241
• The algorithm for splitting those complex sentences in which the relative clause
is embedded in the main clause is given in subroutine SPLIT. This subroutine
accepts two arguments: integer-type x, and character-type y. x gives a split-
ting point position, and y provides the delimiter of the simple sentence that is
a part of the main clause.
Those students, who want to learn Hindi, should study a lot.

∼ jo vidyaarthii Hindii siikhnaa chaahte hain, unko bahut parhnaa chaahiye
∼ un vidyaarthiyoan ko jo hindii siikhnaa chaahte hain bahut parhnaa
chaahiye
The old man, who is working in the field, is a farmer.

∼ jo aadmii khet mein kaam kar rahaa hai, wah kisaan hai
∼ wah aadmii jo khet mein kaam kar rahaa hai, kisaan hai
The dog who I chased was black.

∼ jis kutte kaa maine piichhaa kiyaa, wah kaalaa hai
∼ wah kuttaa, jis kaa maine piichhaa kiyaa, kaalaa hai
Table 5.21: Typical Complex Sentences with Relative
Adverb “who” Handled by Module 2
I met the person who called me yesterday.

∼ jis insaan ne mujhe kal pukaaraa, main us ko milaa
∼ wah insaan, jis ne mujhe kal pukaaraa, main use milaa
She met the person who I called yesterday.

∼ maine jis insaan ko kal pukaaraa, wah usko milii
∼ wah insaan, maine jis ko kal pukaaraa, wah use milii
242
Do you know the boy who chased the dog?

∼ jo ladkaa kutte kaa piichhaa kartaa thaa kyaa tum usko jaante ho?
Do you know the boy who I chased?

∼ kyaa tum us ladke ko jaante ho, jis kaa maine piichhaa kiyaa?
∼ jis ladke kaa maine piichhaa kiyaa, kyaa tum us ko jaante ho?
Did not the man, who read the book, like it?
∼ jis aadmi ne kitaab padhii kyaa usne yah pasand nahiin kii?
∼ kyaa wah aadmii, jis ne kitaab padhii, yah pasand nahiin kii?
Did the man, who I know, like this book?

∼ jis aadmii ko main jaantii hoon, kyaa wah yah kitaab pasand kartaa thaa?
∼ kyaa wah aadmii jis ko main jaantii hoon, yah kitaab pasand kartaa thaa?

The following illustration explains how the above modules work.
Let the input sentence e be “Those students, who want to learn Hindi, should study a
lot.”. The parsed version f of this sentence is:
@DN> DEM “that”, @SUBJ N PL “student”, @SUBJ <Rel> PRON

WH SG/PL “who”, @+FMAINV V PRES VFIN “want”, @INFMARK>
INFMARK> “to”, @-FMAINV V INF “learn”, @OBJ <proper> N SG
“Hindi”, @+FAUXV V AUXMOD “should”, @-FMAINV V INF “study”,
@DN> ART “a”, @OBJ N SG “lot” <$.>
243
- Let the input sentence be e and let e be e1 , e2 ,. . ., en where

n is the length of the English sentence.
- Let the parsed version of e be denoted by f , and its bag of

functional tags is denoted as {f1 , f2 ,. . ., fn }, where fi is
the functional morpho tag corresponding to ei .
- For all ei ∈e, let Roote (ei ) denote the root word corresponding
to ei .
j = 0; \∗ j will store the position of the connective “who”∗\
For(i = 1 to n)
IF( Roote (ei ) = "who" AND <Rel> ∈ Morpho-tag of fj )THEN
THEN {Print “Complex sentence"; j = i}
ELSE {Print "Simple sentence"; Exit;}
Flag = 0; p = 0; \∗ p stores the position of @+FAUXV or

@+FMAINV, if any one of them occurs before the
connective “who” ∗\
For(i = j − 1 to 1)
IF(fi = @+FAUXV OR fi = @+FMAINV)
{Flag=1; p = i; break;}
IF(Flag = 0)
THEN GO TO Module 2;
ELSE IF(delimiter of sentence = ".")
ELSEIF(fi = @+FAUXV)
ELSE Print "Sentence can not be splitted";
Figure 5.4: Schematic View of Module 1 for Identification of Complex Sentence with
Connective “who”
244
SUBROUTINE SPLIT(int x, char y)

{
IF(fj 6= @OBJ) \∗j stores the position of “who” as obtained
in Module 1. The connective “who” has
@SUBJ tag∗\
THEN {
- The constituent words e1 to ej−1 concatenated with
ex to en forms the main clause, which is a simple
sentence, denoted as MC;
- The functional tags f1 to fj−1 concatenated with

fx to fn forms the parsed output of MC;
- ej to ex−1 forms the relative clause;
- Replace ej with either “he", “she" or“they"

depending on the gender and number of el , where l is
such that fl = @SUBJ; and 1 ≤ l ≤ j − 1, and also change
the morpho functional tag fj with the corresponding
tag of either “he, “she" or “they".
- The morpho functional tag of “he, “she" and “they"

are @SUBJ PRON PERS MASC SG3 “he",
@SUBJ PRON PERS FEM SG3 “she"
and @SUBJ PRON PERS PL3 “they", respectively;
\∗If parser does not specify the gender, then

the gender of el is considered to be masculine.\∗
- After this modification, ej to ex−1 forms the simple

sentence denoted by RC;
}
245
ELSE { ∗\Here fj = @OBJ ∗\

c = 0;
For(i = j + 1 to x − 1)
IF(fi = @-FMAINV OR fi = @+FMAINV)
{c = i; break;}
- Words e1 to ej−1 concatenated with word ex to en forms

the main clause, which is a simple sentence, say MC;
- The functional-morpho tags f1 to fj−1 and fx to fn

together form the parsed output of MC;
- ej to ex−1 forms the relative clause;
- (j + 1)th to (x − 1)th words will form the simple sentence

with the following modification;
Modification:
- A new word, either "him", "her" or "them" is placed
after the cth word. The functional-morpho tag of this
new word will be @OBJ PRON PERS MASC SG3 “he",
@OBJ PRON PERS FEM SG3 “she" or
@OBJ PRON PERS PL3 “they";
The choice of this new word depends on the gender and
number of el , where l is such that fl =@SUBJ, 1 ≤ l ≤ j−1.
∗\If parser does not specify the gender, then

the gender of el is considered to be masculine.∗\
- After this modification, ej to ex−1 forms the

simple sentence and we denote it as RC;
}
Delimiter of RC is ".";
Delimiter of MC is y ;
Exit;
}
Figure 5.5: Schematic View of the SUBROUTINE SPLIT
246
\∗ This module will be executed when neither @+FAUXV nor @+FMAINV

tag is present before “who”∗\
Let count =0, k = 0; \∗k stores the position of second occurrence

of @+FAUXV or @+FMAINV tag∗\
For(i = j + 1 to n)
{
IF(fi = @+FAUXV OR fi = @+FMAINV)
IF(count = 0)
THEN count++;
ELSE k = i;
IF(count = 1)
break;
}
CALL SUBROUTINE SPLIT(k , ".")
\∗This modules splits sentences in which the relative clause succeeds the
main clause.∗\
IF(fj 6= @OBJ)
THEN {
- Words e1 to ej−1 of the given input sentence e forms
the main clause, which is a simple sentence, say MC;
- Functional-morpho tags f1 to fj−1 of the parsed

version of e gives the parsed version of MC
- Words ej to en constitute the relative clause;
\∗ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j − 1th word to the first word in e.∗\
NUMBER = ’ ’;
GENDER = ’ ’;
247
For(i = j − 1 to 1)
{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)
{
GENDER = gender of ith word;
NUMBER = number of ith word;
}
IF(NUMBER 6= ’ ’) break;
}
IF (GENDER = ’ ’) GENDER = ’MASC’;
- ej is replaced with either “he", “she" or “they"

depending on GENDER and NUMBER. Morpho-functional
tag fj of new ej is @SUBJ PRON PERS MASC SG3 “he",
@SUBJ PRON PERS FEM SG3 “she" or
@SUBJ PRON PERS PL3 “they".
- Words ej to en constitute the other simple sentence,

say RC;
- The functional-morpho tags fj to fn form the parsed
version of RC;
}
ELSE {
- Words e1 to ej−1 form the main clause, which
is a simple sentence, say MC;
- f1 to fj−1 functional-morpho tags form the parsed

version of MC
- Words ej to en form the relative clause;
c=0; \∗c stores the position of first occurrence of the word having
@+FMAINV or @-FMAINV tag.∗\
For(i=j + 1 to n)
IF(fi = @+FMAINV OR fi = @-FMAINV)
{c = i; break;}
248
\∗ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j − 1th word to the first word in e.∗\
NUMBER = ’ ’;
GENDER = ’ ’;
For(i = j − 1 to 1)
{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)
{
GENDER = gender of ith word;
NUMBER = number of ith word;
}
IF(NUMBER 6= ’ ’) break;
}
IF (GENDER = ’ ’) GENDER = ’MASC’;
- A new word w which can be either "him", "her" or "them"

is placed after the cth word. The functional- morpho
tag of w will be @OBJ PRON PERS MASC SG3 “he",
@OBJ PRON PERS FEM SG3 “she" or
@OBJ PRON PERS PL3 “they";
The choice of w depends on GENDER and NUMBER.
- Words j + 1 to c followed by w concatenated with the

(c+1)th to nth words of e form the other simple sentence,
call it RC;
- Except for the functional-morpho tag of (c + 1)th word,

the functional-morpho tag of constituent words of RC
are obtained from the functional-morpho tags
of the corresponding words of e.
}
Delimiter of MC is ".";
Delimiter of RC is ".";
Exit;
249
\∗ This module splits interrogative complex sentences∗\

r=0;
For(i = p + 1 to j − 1) \∗ p is as obtained from Module 1 ∗\
{
IF(fi = @-FMAINV AND fi−1 = @INFMARK>) r = i;
IF(r 6= 0) break;
}
IF(r 6= 0) \∗This implies that relative clause follows main clause∗\

{CALL Module3a, where Module3a is same as Module 3 with one
modification, i.e. the delimiter of MC is "?" instead of
".";}
\∗ The following will be performed when the relative clause is embedded

between the main clause∗\
c = 0; c1 = 0; c2 = 0;
For(i = j + 1 to n)
{
IF((fi = @+FMAINV)OR (fi = @-FMAINV AND fi−1 = @INFMARK>)
{ c++;
IF(c = 1)
THEN c1 = i; \∗ c1 stores the position of the first
occurrence of the main verb after ”who”∗\
ELSE c2 = i; \∗c2 stores the position of the second
occurrence of the main verb after ”who”∗\
}
IF(c2 6= 0) break;
}
\∗Position of the auxiliary verb preceding the second main verb, if any, is
determined below.∗\
s = 0;
For(i = c1 + 1 to c2 − 1)
{
IF(fi = @-FAUXV) s = i;
IF(s 6= 0) Break;
}
IF(s 6= 0)
THEN CALL SUBROUTINE SPLIT(s, "?");
ELSE CALL SUBROUTINE SPLIT(c2 , "?");
}

250
The length of the input sentence is 11, and the bag of functional tags is {@DN>,
@SUBJ, @SUBJ, @+FMAINV, @INFMARK>, @-FMAINV, @OBJ, @+FAUXV,
@-FMAINV, @DN>, @OBJ}. Since Roote(e3 ) is “who” and its morpho tags contain
<Rel>, the module decides that given input sentence e is complex sentence with
the connective “who”. To identify the position of the relative clause in the given
input sentence e, presence of @+FAUXV or @+FMAINV tag is checked in the first
two words. It is found that none of these two functional tags is present in the first
two words (which are those and students). Thus Flag is set to 0 indicating that the
relative clause is embedded between the main clause. Hence the Module 1 concludes
that the given sentence e will be splitted by Module 2.
For separating the main and the relative clause, Module 2 first locates the position
k of the second occurrence of @+FAUXV or @+FMAINV tag in the parsed version
(f ) of e. It should be noted that since neither @+FAUXV nor @+FMAINV tag is
present in first two words, and the third word is “who”, the algorithm checks the
tags of 4th word to 11th word to determine the value of k. The value of k is found
to be 8. @+FMAINV tag occurs at the 4th position, and @+FAUXV tag is present
at the 8th position of the sentence e.
Since the functional tag of the connective “who” is @SUBJ, the module gives the
following output:
• The first two words concatenated with 8th to 11th words constitute the simple
sentence (MC) which is also the main clause . Thus, first simple sentence is
“Those students should study a lot.”. The delimiter of MC is “.”. The parsed
version of MC is obtained from the FTs of the corresponding words in the
parsed version f of e. Thus the parsed version of MC is
251
@DN> DEM “that”, @SUBJ N PL “student”, “should”, @-FMAINV V
INF “study”, @DN> ART “a”, @OBJ N SG “lot” <$.>
• The words from 3rd position to 7th position of the input sentence e form the rel-
ative clause, i.e. “who wants to learn Hindi”. Now the 3rd word is replaced with
“they” as the functional tag of the 2nd word (i.e. “students”)has the functional
tag @SUBJ whose number is plural (PL). Also, since the gender of “students”
is not specified, the gender of “they” is assumed to be masculine. Thus the

other simple sentence RC is “They want to learn Hindi” and its delimiter is “.”.
The parsed version of RC is:
@SUBJ PRON PERS PL3 “they”, @+FMAINV V PRES VFIN “want”,

@INFM- ARK> INFMARK> “to”, @-FMAINV V INF “learn”, @OBJ
<proper> N SG “Hindi” <$.>
Similarly, splitting of other structures of complex sentences with connective

“who” can be carried out using the above-mentioned modules. It should be noted
that the algorithm cannot deal with those interrogative complex sentences for which
the root form of the main verb of the main clause is “be”. In these type of sentences
the identification of main clause and the relative clause is relatively more compli-
cated. For example, consider the complex sentence “Is the man who was reading the
book in the library upstairs?”. Its parsed version is:
@+FMAINV V PRES “be”, @DN> ART “the”, @SUBJ N SG “man”,

@SUBJ <Rel> PRON WH SG/PL “who”, @+FAUXV V PAST “be”, @-
FMAINV PCP1 “read”, @DN> “the” ART, @OBJ N SG “book”, @ADVL
PREP “in”, @DN> ART “the”, @
252
In the above sentence, “in the library” is the preposition phrase (<PP>) and
“upstairs” is adverb. Since root form of the main verb of the main clause is “be”,
it can take either “upstairs” or “in the library” along with “upstairs” as its predica-
tive(s)13 . Thus, the main clause can be “Is the man in the library upstairs?” or “Is
the man upstairs?”. The relative clause will also vary accordingly. Hence in this
situation formulating the splitting rules is not achievable using this parsing scheme.
The same problem occurs for other variations of these type of sentences (e.g. Is this
the man who saw you with the binoculars?). Thus these type of sentences are not
handled in this report.
We have developed algorithms for splitting complex sentences using other con-
nectives also. But these rules are not discussed in this report in order to avoid
the repetitive nature of discussion. The following subsection discusses the adapta-
tion procedure for obtaining translations of the input complex sentences using the
splitted simple sentences RC and MC.
5.10 Adaptation Procedure for Complex Sentence
In the following subsections, we discuss the adaptation procedure to obtain the
translation of complex sentences having any of the following connectives: “when”,

“where”, “whenever”,“wherever” and “who”.
13
The predicative of the sentence having root form of the main verb “be” can be any one (or
combination) of the subjective complement, preposition phrase or adverb.
253
5.10. Adaptation Procedure for Complex Sentence
5.10.1 Adaptation Procedure for Connectives “when”, “where”,
“whenever” and “wherever”
Since the Hindi translation patterns of the complex sentence having connectives one
of “when”, “where”, “whenever” or “wherever” are same, the adaptation procedure

for such complex sentences is discussed collectively. Table 5.24 gives the translation
of the above-mentioned connectives, and Table 5.25 provides the possible structures
of English and its Hindi translation for these connectives (Refer Tables 5.19 and 5.20
for examples of these sentence patterns). Since the correlative adverb is frequently
not indicated in the Hindi translation of complex sentences having any of above-
mentioned connectives (Bender, 1961), (Kachru, 1980), the correlative adverb is
given in {}.
English Hindi Hindi

Relative Adverb Relative Adverb Correlative Adverb
When jab tab
where jahaan vahaan
whenever jab bhii tab
wherever jahaan bhii vahaan
Table 5.24: Hindi Translation of Relative Adverbs
The adaptation procedure for generating the translation of the complex sentences
under consideration is discussed below. Suppose R1 and R2 are sentences having

least cost of adaptation which have been retrieved from example base corresponding
to RC and MC, respectively. The steps of this procedure are as follows:
1. Adapt the translation of R1 to the translation of RC.
2. Adapt the translation of R2 to the translation of MC.
254
3. Add one morpho-word (i.e. corresponding Hindi relative adverb, refer Table
5.24) in the beginning of the translation of RC. Other morpho-word (i.e. cor-
responding Hindi correlative adverb, refer Table 5.24) may be added in the
beginning of the translation of MC.
4. Concatenate the (modified) translations of RC and MC.
Complex English Sentence Pattern

<Relative clause with connective [“when”, “where”, “whenever” or
“wherever”]> &<Main clause>.
OR
<Main clause> &<Relative clause with connective [“when”, “where”,
“whenever” or “wherever”]>.
Complex Hindi Sentence Pattern

Connective when: ∼ “jab” &<Hindi translation of RC> &{“tab”}
&<Hindi translation of MC>
Connective where: ∼ “jahaan” &<Hindi translation of RC> &{“vahaan”}

Connective whenever: ∼ “jab bhii ” &<Hindi translation of RC> &{“tab”}

Connective wherever: ∼ “jahaan bhii ” &<Hindi translation of RC>

&{“vahaan”} &<Hindi translation of MC>
Table 5.25: Patterns of Complex Sentence with Connec-
tive “when”, “where”, “whenever” and “wherever”
It may be noted that the total cost involved in generating the translation of
given complex sentence depends on the cost of adapting the translation of R1 and
255
R2 to the translation of RC and MC, respectively. This is due to the fact that the
cost involved in one (or two) morpho-word addition (required in step 3) is fixed, i.e.
(or 2) as relative and correlative adverbs always occur at the beginning of the
Hindi translation of RC and MC sentences (refer Table 5.25), respectively. Hence no
search is required to find the correct position for morpho-word in the Hindi sentence.
Further, the cost of concatenating these two translations is also fixed, which is .
Assuming that the cost of adapting the translation of R1 and R2 to the trans-
lation of RC and MC are c1 and c2 , respectively. Hence the total cost involved in
generating the translation of the given complex sentence is c1 +c2 +2+.
5.10.2 Adaptation Procedure for Connective “who”
This section discusses the adaptation procedure for the complex sentence having
connective “who”. It may be noted that there are many variations in sentence
structure having this connective (refer Figure 5.19). For illustration, we consider
the sentence pattern as given in Table 5.26. In this pattern the connective “who”
plays the role of subject in the relative clause of English sentence.
Two different Hindi translation patterns may be noted corresponding to the

above mentioned English sentence pattern. In the first pattern, the relative adjective
“jo” occurs in the beginning of the relative clause whereas in the other pattern “jo”
occurs before the subject slot of the main clause. The noun in the main clause,
to which the relative clause modifies14 , is represented by “wah” or “we” depending
upon the number of the noun (Bender, 1961), (Kachru, 1980).
14
For the sentences under consideration, this noun is the subject of the main clause.
256
Complex English Sentence Pattern

<Subject slot of Main clause>, &(Relative clause with connective [“who”])
&(Main clause, without subject slot).
(e.g. The man who is reading a book is nice.)
Complex Hindi Sentence Pattern

(“wah” or “we”) &<translation of subject slot of MC> & “jo” &(translation of
RC, without subject slot) &(Translation of MC, without subject slot)
(e.g. wah aadmii jo kitaab padh rahaa hai achchhaa hai )
OR
“jo” &<Translation of subject slot of MC> &(Translation of RC, without subject
slot) &(“wah” or “we”) &(Translation of MC, without subject slot)
(e.g. jo aadmii kitaab padh rahaa hai wah achchhaa hai )
Table 5.26: Patterns of Complex Sentence with Connec-
tive “who”
The adaptation procedure for generating the translation of the complex sentences
under consideration is discussed below. Suppose R1 and R2 are sentences having the
least cost of adaptation that have been retrieved from example base corresponding
to RC and MC, respectively. The steps of this procedure are as follows:
1. Adapt the translation of R1 to the translation of RC. The subject slot of R1

is not adapted for RC as it is to be deleted while formulating the translation
of the given complex sentence.
2. Adapt the translation of R2 to the translation of MC.
3. Depending upon the required translation pattern, add two appropriate morpho-
words in RC and/or MC. The first morpho-word to be added is taken from
257
the set {wah, we}, and the other morpho-word is “jo”. The position of the
morpho-words in the two patterns are given below:
• For the first pattern, the morpho-word “jo” is added in the beginning of
the translation of RC, and depending on the number of the subject of
MC, the morpho-word either “wah” or “we” is added in the beginning of
the translation of MC.
• For the second pattern, the morpho-word “jo” is added in the beginning
of the translation of MC, and depending on the number of the subject

of MC, a morpho-word (either “wah” or “we”) is added after the subject
slot of the translation of MC.
4. Combine the (modified) translations of RC and MC. For both translation pat-
terns, the translation of RC is embedded in the translation of MC after the

subject slot.
The cost involved in generating the translation of complex sentences discussed

above, is as follows:
1. Cost of adapting the translation of R1 to the translation of RC. Let this cost
be c1 . In this case, the cost involved for adapting the translation of the subject
slot is not included.
2. Cost of deletion of subject slot from the translation of RC. Let us denote this
cost by w.
3. Cost of adapting the translation of R2 to the translation of MC. Let this cost
be denoted as c2 .
258
4. Cost of two morpho-word additions which are given below for both the Hindi
translation patterns.
• For the first translation pattern, the cost of adding these two morpho-
words is () + (0.5α+) = 0.5α+2 (refer Section 5.3). Here dictionary
search is not required as the morpho-words may be stored in some readily

accessible location. Since these morpho-words are always added in the
beginning of the translation of RC and MC, no search is required to
determine the correct position for the morpho-word addition.
• For the second Hindi translation pattern, the cost of adding these two
morpho-words is () + ( L2 × α + 0.5α + ψ + )= ( L2 + 0.5) × α + ψ +
2, where L is the length of the translated of MC.
5. Cost of combining the translation of RC and MC for both translation patterns

L
is 2
× α + ψ + . Here too, L is the length the translated sentence of MC.
Thus the total cost involved for two translation patterns is the sum of all the
above mentioned costs. The two simple sentences R1 and R2 are retrieved from
the example base for generating the translation of given complex sentence so as to
minimize the total cost of adaptation.
We have formulated the adaptation procedure for other complex sentence struc-
tures having connective “who” in a similar way. However, due to similar nature of
discussion we do not elaborate on them in this report.
The above discussed adaptation procedures are illustrated in the following sub-
section. In particular, we show for a given complex sentence how the scheme retrieves
two similar simple sentences from the example base that can be used to generate
the translation of the input complex sentences.
259
5.11. Illustrations
5.11 Illustrations
The adaptation procedures for complex sentences are explained using two illustra-
tions.
Suppose the input sentence is: “You should speak Hindi when you go to India.”. Its
parsed version is:
@SUBJ PRON PERS SG2/PL2 “you” , @+FAUXV V AUXMOD “should”

, @-FMAINV V INF “speak” , @OBJ <Proper> N SG “Hindi” , @ADVL
ADV WH “when” , @SUBJ PRON PERS SG2/PL2 “you” , @+FMAINV
V PRES “go” , @ADVL PREP “to” , @ N SG “India” < $. >
The algorithm of splitting complex sentence (see Figure 5.1 and Figure 5.2)
results in two simple sentences RC and MC as given below.
RC : You go to India. @SUBJ PRON PERS SG2/PL2 “you”, @+FMAINV V

PRES “go”, @ADVL PREP “to”, @ N SG “India” < $. >
MC : You should speak Hindi. @SUBJ PRON PERS SG2/PL2 “you”,

@+FAUXV V AUXMOD “should” , @-FMAINV V INF “speak”, @OBJ
<Proper> N SG “Hindi” < $. >
Five most similar sentences for RC and MC, obtained by applying cost of adap-
tation based scheme, are given in Table 5.27 and Table 5.28, respectively.
260
Retrieved Sentences for RC Cost of Adaptation

You go to school. 4α
Ram has gone to school. 25.67α+105 c
You are coming from India. 29.58α+105 c+2
You will not go to India. 9.5α+ ψ+2
They will go to Kanpur. 24.67α+ ψ+ 105 c+
Table 5.27: Five Most Similar Sentence for RC “You go
to India.” Using Cost of Adaptation Based Scheme
Retrieved Sentences for MC Cost of Adaptation

He should speak English. 10.67α+c105
The boy should study Hindi. 28.25α+2c105
You should speak. 8.5α+ψ+
You can speak Hindi. 15.5α+ψ+
He can speak English. 30.67α+c105 +ψ+
Table 5.28: Five Most Similar Sentence for MC “ You
should speak Hindi.” Using Cost of Adaptation based
Scheme
To obtain the translation of RC and MC, we consider the first sentence of Table
5.27 and Table 5.28, respectively. Thus, R1 is “You go to school.”, and R2 is “He
should speak English.”. The Hindi translations of these sentences are:
Translation of R1 : tum vidyaalay jaate ho

(you) (school) (go)
Translation of R2 : us ko english bolnii chaahiye
(he) (English) (speak) (should)
261
5.11. Illustrations
The translation of R1 and R2 is adapted to generate the translations of RC and

MC, respectively. Thus, the translation of RC and MC are:
Translation of RC : tum india jaate ho

(you) (India) (go)
Translation of MC : tum ko hindi bolnii chaahiye
(you) (Hindi) (speak) (should)
The morpho-words “jab” and “tab” are to be added in the beginning of the Hindi
translation of RC and MC, respectively. After this modification, these two sentences
are concatenated. Hence the desired translation of the given input sentence is “jab
tum india jaate ho tab tum ko hindi bolnii chaahiye”.
Let us consider another input sentence “The student who wants to learn Hindi should
study this book.” and its parsed version:
@DN> ART “the”, @SUBJ N SG “student”, @SUBJ <Rel> PRON

WH SG/PL “who”, @+FMAINV V PRES “want”, @INFMARK>
INFMARK> “to”, @-FMAINV V INF “learn”, @OBJ <proper> N SG
“Hindi”, @+FAUXV V AUXMOD “should”, @-FMAINV V INF “study”,
@DN> DEM “this” , @OBJ N SG “book” < $. >
After applying algorithm of splitting complex sentence (refer Figure 5.4 and
Figure 5.6), two simple sentences RC and MC are obtained. These are as follows:
262
RC : He wants to learn Hindi. @SUBJ PRON PERS SG3 “he”, @+FMAINV
V PRES “want”, @INFMARK> INFMARK> “to”, @-FMAINV V INF

“learn”, @OBJ <proper> N SG “Hindi” < $. >
MC : The student should study this book. @DN> ART “the”, @SUBJ N
SG “student”, @+FAUXV V AUXMOD “should”, @-FMAINV V INF

“study”, @DN> DEM “this” , @OBJ N SG “book” < $. >
Five most similar sentences for RC and MC are given in Table 5.29 and Table
5.30, respectively. One point to be noted here is that the cost for obtaining the
translation of subject slot of RC is not considered in Table 5.29.
Retrieved Sentences for RC Cost of Adaptation

He likes to learn Hindi. 17.58α+c105
Ram wants to teach a student. 23.08α+c105
The student wants to play football. 23.08α+c105
The student wants to study this book. 28.08α+c105+
He is leaning Hindi. 33.08α+ c105 +ψ+ 2
Table 5.29: Five Most Similar Sentence for RC “He wants
to learn Hindi.” Using Cost of Adaptation Based Scheme
Retrieved Sentences for MC Cost of Adaptation

The student wants to study this book. 24α+ ψ+2
The student should listen this poem. 37.85α +2×(c105 )
The student studies books. 20.17α+c105 +2ψ+
The boy should study Hindi. 48.17α+3×(c105 )+ψ+
The student wants to play football. 56.52α+ 3×(c105 )+2ψ+3
Table 5.30: Five Most Similar Sentence for MC “The
student should study this book.” Using Cost of Adaptation
Based Scheme
263
For all the possible combinations of sentences given in Tables 5.29 and 5.30,
the cost of adaptation involved for generating the translation of the input sentence
is calculated in the way explained under the Section 5.10. The minimum cost of
adaptation, which is (17.58α+c105 ) + (3α+) + (24α+ψ+2) + (0.5α+2) + (3 × α
+ ψ + ) = 48.08α + c105 + 2ψ + 6, is obtained for following sentences:
R1: He likes to learn Hindi.
R2: The student wants to study this book.
Hence the translation of the given input sentence after generating the translation
of RC (i.e wah (he) hindi (Hindi) siikhanaa (to learn) chaahtaa hai (likes)) and
MC (i.e. vidyarthii (student) ko yah (this) kitaab (book) padhnii (study) chaahiye
(should)), and appending the relative adjective “jo”, and the appropriate personal
pronoun from the set {we, wah} in the begining of the translation of RC and MC,
respectively, is “wah vidyarthii ko jo hindi siikhanaa chaahtaa hai yah kitaab padhnii
chaahiye”.
In this chapter we have examined a technique for evaluating similarity between

sentences. This is required for effective retrieval of past examples in order to facilitate
efficient EBMT. However, we observed that with respect to EBMT similarity may
have to be defined in a different way. Since the key focus of EBMT is adaptation, we
define “cost of adaptation” as a measure for similarity between sentences. According
to this definition a sentence d is said to be similar to a given input sentence e if the
adaptation of d to generate the translation of e is computationally less expensive.

We showed the results obtained by applying some of the existing retrieval techniques,
264
based on syntax and semantics, used in text retrieval (Manning and Schutze, 1999),
(Gupta and Chatterjee, 2002). These results have been compared with the result of
cost of adaptation based scheme, and have shown the superiority of proposed scheme
over syntactic and semantic based scheme. The proposed scheme works on simple
sentences, and for measuring cost of adaptation, adaptation operations described in

Chapter 2 have been used.
One apparent drawback of this scheme is that it needs to compare the input
sentence with all the sentences of the example base. This makes the process com-
putationally very expensive. Hence one needs to filter out irrelevant sentences, and
then evaluate the cost of adaptation on a smaller set of sentences.
In this respect, we have proposed two-level filtration scheme for measuring dis-
similarity. The filtration scheme works on the following two steps:
1. Measuring surface similarity which is based on functional tags (FTs).
2. Measuring characteristic feature dissimilarity. The dissimilarity is measured

on the basis of tense and POS tag along with its root word.
The lower is the dissimilarity score of an example, the lesser will be its adap-
tation cost to generate the required translation. Finally, cost of adaptation based
scheme is applied on the selected set of sentences provided by the filtration scheme.
The advantage of this filtration scheme is that it reduces the number of example
base sentences that are to be analysed for evaluating the cost of adaptation. In the
worst case it reduced this number by 25%. However, as we repeated our experiments
with 500 different sentences we found that the average reduction in the number of
sentence subjected to evaluation of cost of adaptation is about 75%. The proposed
265
scheme however cannot be applied for complex sentences straightway. This is be-
cause adaptation with respect to complex sentence is more difficult due to the more
complicated structure of complex sentences for both English and Hindi. Conse-
quently, we suggest that complex sentences may be first split into simple sentences.
Then the adaptation cost based scheme may be applied to retrieve best matches
for each of the simple sentences. These retrieved translations may then be adapted
to generate the translation of these simple sentences. These translations may then
be combined using linguistic rules to generate the translation of the input complex
sentence. The novelty of the scheme is that it gives an algorithmic way for handling
complex sentences under EBMT as well.
In this work we have dealt with complex sentences with main clause and one
relative clause. We have developed heuristics to first determine whether a sentence
is complex. We use observations from our example base of complex sentences to
validate the heuristics used in the sentence splitting algorithm. In this report, we
have discussed algorithms for splitting complex sentences for five connectives: “who,
“when”, “where”, “wherever” and “whenever”. We have also developed the splitting
rules for other connectives (e.g. which, whom, whose, whoever, whichever), but to
avoid the similar type of discussion, we have not discussed all of them in this report.
Finally, we have shown the adaptation procedure for adapting given complex sen-
tences with any of the above-mentioned connective. In particular, we showed for a
given complex sentence how the scheme retrieves two similar simple sentences from
the example base that can be used to generate the translation of the input complex
sentences.
266
Chapter 6
Discussions and Conclusions

Chapter 6. Discussions and Conclusions
6.1 Goals and Motivation
The primary goal of this research is to study various aspects of designing an EBMT
system for translation from English to Hindi. It may be observed that in today’s
world a lot of information is being generated around the world in various fields. How-
ever, since most of this information is in English, it remains out of reach of people at
large for whom English is not the language of communication. This is particularly
true for a country like India, where the population size is more than a billion, yet
only about 3% of the population can understand English. As a consequence, an in-
creasing demand for developing machine translation systems from English to various
languages of Indian subcontinent is being felt very strongly. However, development
of MT systems typically demands availability of a large volume of computational

resources, which is currently not available for these languages, in general. Moreover,
generating such a large volume of computational resources (which may comprise an
extensive rule base, a large volume of parallel corpora etc.) is not an easy task.
EBMT scheme, on the other hand, is less demanding on computational resources
making it more feasible to implement in respect of these languages.
In this respect, we further observed that although a few number of English to

Hindi MT systems are available online, quality of the translations produced by them
is not always up to a satisfactory level. This prompted us to investigate in detail the
various difficulties that one may face while developing an MT system from English
to Hindi. We feel that the studies made in this research will be helpful not only for
Hindi, but also for other languages that are major regional languages of Indian sub-
continent, and at the same time prominent “minority” languages of other countries
(e.g. U.K.). Although an increasing demand for MT systems from English to these
languages is clearly evident, development of necessary computational resources is
267
6.2. Contributions Made by This Research
still at a very rudimentary stage.
In this research we studied different aspects of designing an English to Hindi

EBMT system in great details. In particular, we concentrated on finding suitable
solutions for the following aspects:
a) Development of efficient retrieval and adaptation scheme.
b) Study of divergence for English to Hindi translation, and how translation di-
vergence can be effectively handled within an EBMT framework.
c) How to handle complex sentences - which are in general considered to be

difficult to deal with in an MT system.
6.2 Contributions Made by This Research
Development of an efficient adaptation scheme. Efficient adaptation of past

examples is a major aspect of an EBMT system. Even an efficient similarity mea-
surement scheme, and a quite large example base cannot, in general, guarantee an
exact match for a given input sentence. As a consequence, the need arises for an
efficient and systematic adaptation scheme for modifying a retrieved example, and
thereby generating the required translation. In this work we developed an adap-
tation strategy consisting of ten different adaptation operations. A study of dif-

ferent adaptation techniques suggested in different EBMT systems intimates that
these techniques work primarily at word level. However, with respect to English and
Hindi, we observe that both the languages depend heavily on suffixes for carrying out
morphological variations. We further observed that adaptation may often be simple

and computationally less expensive, if the adaptation scheme focuses on suffixes as
268
well. In a similar way, we further observed that declensions of Hindi verb, noun
and adjective often depend on some auxiliary words, called morpho-words. Adapta-
tion using morpho-words also makes it efficient and computationally cheaper. The
above observations motivate us to design an adaptation scheme comprising nine dif-
ferent basic operations, beside copy, to perform addition, deletion or replacement

of constituent words, morpho-words and suffixes. Successive application of these
operations help in adapting the translation of a retrieved sentence to generate the
translation of a given input. Another advantage of using these operations is that
their algorithmic nature enables one to estimate the computational cost of each of
these operations. We used this estimation to design a novel similarity measurement

scheme as explained below.
Retrieval and Adaptation. However good an adaptation scheme is, its per-
formance is hindered seriously if the example that it attempts to adapt is not quite
similar to the input sentence. But there is no unique way of defining similarity be-
tween sentences. Depending upon the application, the definition of similarity may
vary. In this work we proposed a scheme for defining similarity from “adaptation”
perspective. We say that a sentence S1 should be called “similar” to another sen-
tence S2 if adaptation of the translation of S1 to generate the translation of S2 is
computationally inexpensive. The lesser the cost is the more will be the similarity.
In this work we have provided appropriate models for prior estimation of cost of
adaptation. This cost depends not only on the number of basic operations to be per-
formed, but also on the functional slots on which the operation is applied. Thorough
analysis of adaptation costs for different phrasal structures within various functional
slots (e.g., subject, object, verb), and also for different sentence types (e.g., affirma-
tive, negative, interrogative) has been carried out, and models for estimating these
269
6.2. Contributions Made by This Research
computational costs have been designed.
We have carried out experiments on retrieval using the proposed scheme, and also
with other major similarity measurement schemes, namely schemes based on com-
monality of words and similarity of syntax. These experiments clearly established
the supremacy of the proposed scheme over the others.
Study of Divergence. Divergence comes as a major hindrance for EBMT.

As per Dorr, divergence occurs “when structurally similar sentences of the source
language do not translate into sentences that are similar in structures in the target
language.”. In this work we have studied the different types of divergences that may
occur in the context of English to Hindi translation. Our findings are compared with
the divergence types that are obtained in translations among European languages,
for which divergence has been extensively studied. Through this research, we have
been able to discover three new types of divergence that have so far not been reported
with respect to European languages. Altogether we have been able to characterize
seven divergence types that are prevalent in English to Hindi context.
In order to deal with divergence within an EBMT system we have proposed

the following. We developed algorithms for determining from an English sentence
and its Hindi translation whether the translation involves any of the seven types
of divergence. These algorithms will help one to partition the example base into
different parts depending upon whether the translation is “normal”, or involves
any of the divergences. This partitioning of example base is essential for designing
appropriate retrieval scheme for dealing with divergences efficiently.
We have further developed a corpus-evidence based scheme that enables the
system to take prior decision on whether the translation of given input sentence
will involve divergence or not. Depending on the decision of the scheme, similar
270
sentences are retrieved from an appropriate part of the example base.
Dealing with Complex sentences. One of the major difficulties in developing

any MT system is to design an appropriate scheme for handling complex sentences.
With respect to an EBMT system we have observed that formulating appropriate
adaptation and retrieval policies for complex sentences is not straightforward. In
order to resolve this problem we proposed a “split and translate” technique to handle
complex sentences. We have developed heuristics to identify whether a sentence
is complex. We use observations from our example base of complex sentences to
validate the heuristics used in the sentence splitting algorithm. This work is based
on those sentences that have at most one subordinate clause in complex sentence.
The “split” algorithm generates two simple sentences out of a given complex sentence
based on its main and relative clauses. These simple sentences are then translated
individually using the proposed approach. We have developed further heuristics
that combine these individual translations to generate the translation of the given
complex sentence.
One major difficulty that we faced while doing this research is that no suitable
online data on English-Hindi parallel corpus was available at that time. The data
required for this work should be properly aligned in both sentence level and word
level. The example base of about 4,500 sentences used for this work has been created
and aligned manually. These sentences have been collected by scrutinizing about
30,000 translation pairs collected from various sources like story books, recipes,
government notices etc. For efficient working of the proposed EBMT system, the size
of example base should be increased. Although now huge parallel data is available
online e.g., EMILLE data, various web-sites having both English and Hindi version
(e.g., www.iitd.ac.in, www.statebankofindia.com), this data is not aligned. In order
271
6.3. Possible extensions
to use these resources effectively, proper alignment techniques for English to Hindi
alignment should be developed.
6.3 Possible extensions
The work carried out in this research may be extended in several directions:
• In this work, various adaptation procedures are studied and developed for
the sentence structures (and their components). These are the patterns that
are predominantly found among the sentences present in our example base.
Many other variations in sentence structures are possible which have not been
discussed in this work. Adaptation rules may be developed for such sentence
patterns.
• Although we have dealt with complex sentences, we imposed certain restric-

tions in this work on the structures of complex sentence. The splitting and
adaptation rules have been developed for these restricted structures only. The
proposed “split and translate” technique needs to be extended for complex sen-
tences having more complicated structures. Further, we have left compound
sentences out of our discussion. Strategies are to be developed for dealing with
compound sentences as well.
• With respect to English to Hindi translation, seven types of divergence are

identified. Although this finding is based on our analysis of about 30,000
sentences, possibility of some other divergence types cannot be completely
ruled out. More detailed study of English-Hindi parallel corpus is required to

identify other divergence types, if any.
272
• Robustness of the scheme proposed for taking a prior decision about the pos-
sible divergence types in translation of given input sentence (refer Chapter 4),
depends on PSD/NSD. These dictionaries contain the proper sense of the word,
and are created manually. For automating the construction of PSD/NSD,
appropriate Word Sense Disambiguation technique should be developed and

applied.
• Prior identification of Possessional Divergence is not discussed in Chapter 4.

This is due to the reason that possessional divergence may be associated with
a large number of variations on the properties of the subject, object, pre-

modifier of the object etc. which are not governed by a simple set of rules.
The hypernym (according to WordNet 2.0) of these words need to be analyzed
and compared to arrive at any conclusion regarding prior identification of this
divergence type.
• Divergence identification algorithms (discussed in Chapter 3) depends on FT

and SPAC of the English sentence and its Hindi translation. For English
sentence, this knowledge is extracted from the parsed version which is obtained
from the parsers available online. But no such resource is available for Hindi.
Thus, in our work we parsed and obtained FT and SPAC of Hindi sentences
manually. For practical applications of the proposed algorithms, Hindi parser
is needed to obtain the required information of Hindi sentence.
6.4 Epilogue
There are many issues pertaining MT that have not been dealt with in this work.
Arguably, the two most important ones of them are:
273
6.4. Epilogue
• Study of pre-editing and post-editing requirements
• How to evaluate the quality of translation given by an MT system.
Although both of them pertain to a full-fledged working MT system, and hence do
not fall within the purview of the work reported in this work. We include a brief
description of these two topics here in order that future works on English-Hindi
EBMT may be directed to take care of these issues too.
6.4.1 Pre-editing and Post-editing
Pre-editing is the process of identifying and editing, where necessary, the source
text prior to translation so that any sentences (segments) of text that the machine
will have problems with are highlighted and removed. In other words, pre-editing is
based on the building up of new text data from a given text data of existing version
(e.g. paragraph form) that the MT system is able to handle. Pre-editing metric
varies according to the requirement of MT system.
In case of our study on EBMT system, we have also done some pre-editing ac-
cording to the requirement of the problem of retrieval, adaptation and identification
of divergence. Firstly, we have assumed that our original data is aligned sententially,
i.e. one source language sentence corresponds to one target language sentence. In
the retrieval and the adaptation procedure, we have added the parsed version of
source sentence, which is based on morpho-functional tags, along with the informa-
tion of word alignment at the root level. This minimum information is stored in
our example base for carrying out adaptation, converting complex sentences into
simple sentences, and measuring similarity for effective retrieval. In Chapter 1, we

have provided Figure 1.1 of an example record of the example base. However, the
274
algorithms for divergence identification require both FTs and SPAC (see Appendix
B) information for the parallel corpus. Pre-editing has to be done accordingly.
The task of the post-editing is to edit, modify and/or correct the output of the
MT system that has been processed by a machine translation system from a source
language into target language. In other words, post-editing corrects the output of the
MT system to an agreed standard, e.g. amending the style of the output sentences,
or any minimal amendments which will make the text more readable.
As such we have not developed any MT system, so post-editing is not directly

relevant in this work. But still here we point out some situations where post-editing
can be possible while deigning an EBMT system.
In case of EBMT system post-editing may be required in the following form. The
desired translation of the input sentence is generated by adapting the translation

of the best similar example. Sometimes it may happen that even after adaption
the system does not produce a translation that is grammatically correct in the
target language. It may be because of insufficient morpho-syntactic information
or grammar rules that the system uses while carrying out the adaptation task. In
this situation one has to correct translation according to the requirement. Another
situation, where post-editing can be useful is when the system does not have sufficient
number of words in the dictionary. Typically, in these cases MT system provides
transliteration of these words in the target language. Post-editing is useful in these
cases too. The amount of post-editing required on the output provides a good
yardstick for measuring the output quality of an MT system.
275
6.4. Epilogue
6.4.2 Evaluation Measures of Machine Translation
Implementation of an MT system can be considered to be successful only if the qual-

ity of the translation produced by the system is of acceptable quality. This automat-
ically raises the issue of how to evaluate the quality of the output produced by a sys-
tem. In recent years, various methods have been proposed to automatically evaluate
machine translation quality. Typically, these methods take the help of some “refer-
ence” translation of some pre-selected test data. Reference translation is also known
as “gold-standard translation”. By comparing the output produced by the system
under consideration (with respect to the pre-selected test data) with the reference
translation, an estimate of possible discrepancy is arrived at. This in turn gives a
measure of the translation quality of the said system. Examples of such methods are
Word Error Rate (WER), Position-independent word Error Rate (PER)(Tillmann
et al., 1997), and multi-reference Word Error Rate (mWER)(Nießen et al., 2000).
Below we describe the above-named methods.
• WER: The word error rate is based on the Levenshtein distance. It is com-
puted as the minimum number of substitution, insertion and deletion oper-
ations that have to be performed to convert the generated sentence into the
reference sentence.
• PER: A major shortcoming of the WER is the fact that it requires a perfect
word order. In order to overcome this problem, the position independent word
error rate (PER) was introduced as additional measure. It compares the words
in the two sentences without taking the word order into account. Words that
have no matching counterparts are counted as substitution errors, missing

words are deletion error and additional words are insertion errors. Evidently,
276
the PER provides a lower bound for the WER.
• mWER: Often there exist many possible correct translations of a sentence.

The WER and the PER compare the produced translation only to one given
reference which might be insufficient due to variance in syntax. Thus, a set
of reference translations for each test sentence is built. For each translation
hypothesis, the Levenshtein distance to the most similar reference sentence in
this set is calculated. This yields a more reliable error measure, and is a lower
bound for the WER.
In later years n-gram based schemes have been proposed to evaluate transla-
tion quality. The most prominent among them are BLEU (Bilingual Evaluation
Understudy)(Papineni et al., 2001) and NIST (National Institute of Standard and

Technology) (Doddington, 2002) scores. All these criteria try to approximate hu-
man assessment, and often achieve an astonishing degree of correlation to human
subjective evaluation of fluency1 and adequacy2 . These methods work as follows.
• BLEU: This scheme has been proposed by IBM (Papineni et. al., 2001). It is
based on the notion of modified n-gram precision, for which all candidate n-
gram counts are collected. The geometric mean of n-grams precision of various
lengths between a hypothesis and a set of reference translations are computed.
This score is multiplied by brevity penalty (BP) factor to penalize too short
translations. Therefore,
1
A fluent sentence is one that is well-formed grammatically, contains correct spellings and
adheres to common use of terms, is intuitively acceptable and can be sensibly interpreted by a
native speaker
2
The judge is presented with the gold-standard translation, and should evaluate how much of
the meaning expressed in the gold-standard translation is also expressed in the target translation
277
6.4. Epilogue
N
!
X log pn
BLU E = BP ∗ exp
n=1
N
Here pn denotes the precision of n-gram in the hypothesis translation. N de-

notes total number of n-grams considered, usually N ∈{1, 2, 3, 4}. Papineni et.
al. (2001) state that BLEU captures adequacy3 as well as fluency4 . BLEU is
an accuracy measure, while the above-mentioned measures are error measures.

The disadvantage of BLUE score is that longer n-grams dominate over shorter
n-grams, and it cannot match corresponding (sub)parts from hypothesis to
reference.
• NIST: This score was proposed by National Institute of Standard and Tech-
nology in 2002. It reduces the effect of longer n-grams. This criterion computes
a arithmetic average over n-gram counts instead of geometric mean and mul-
tiplied by a factor BP that penalizes short sentences. Both, NIST and BLEU
are accuracy measures, and thus larger values reflect better translation quality.
Each of the above schemes focuses on certain aspects of translation. No single
one of them can be considered to be the best from all perspectives. Hence typically
translation quality is expressed in terms of four scores, viz. WER, PER, BLUE and
NIST.
Designing a full-fledged MT system is an enormous task. Many approaches have

been proposed, and many techniques have been pursued. However, no consensus
have yet been reached regarding the best way of designing a system. In this work
we have made contributions on various aspects of MT. Some of them are specific
3
Matches of shorter n-grams(n = 1,2,...) capture adequacy.
4
Matches of longer n-grams(n = 3,4,...) capture fluency.
278
for EBMT, while some other, such as, study of divergences, is relevant for other
paradigms as well. We hope that the contributions made in this thesis will be useful
for designing English to Hindi MT system, and also for many other language pairs
at large.
279
Appendix A
Appendix A.
A.1 English and Hindi Language Variations
English and Hindi languages are of two different origin, so study of their general
structural properties is necessary. In this discussion, some of the basic concepts of the
translation from English to Hindi are briefly outlined. Some of the general structural
properties of English and Hindi(Kachru, 1980)(Kellogg and Bailey, 1965)(Singh,
2003)(Qurick and Greenbaum, 1976) are described below. For example,
• Sentence Pattern: The basic sentence pattern in English is Subject (S) Verb
(V) Object (O), whereas it is SOV in Hindi. Consider for example “Radha
eats mango” here “Radha” is subject; “eats” is the verb while “mango” is the
object. So the words occur in the order SVO. But in Hindi it becomes
radha (S) aama (O) khaatii hai (V)

Radha mango eats
• Order of Words in a Sentence: English is a positional language and is therefore

has (relatively) fixed order. Relation between various components of the sen-
tence are mainly shown by the relative position of the components. Consider
this example as:
Radha watches the sparrows.
is very different from

The sparrows watch Radha.
Hindi is (relatively) free-order. Relation between various components of the
sentence are mainly shown by inflecting the components. Change of position
of components normally change the emphasis of an utterance, and not the

basic meaning.
281
A.1. English and Hindi Language Variations
For Example: radha chidiyaan dekhatii hai
(Radha) (sparrows) (watches)

has the same meaning
chidiyaan radha dekhatii hai

(sparrows) (Radha) (watches)
Above mentioned differences are structural differences between English and Hindi.
Some differences are in the part of speech properties of English and Hindi languages.
These discrepancies are as follows:
• Noun: Hindi nouns are effected by gender, number and case ending (Kellogg
and Bailey, 1965). These are as follows:
1. Gender : English has four genders-masculine (MASC), feminine (FEM),

common and neuter, whereas Hindi has only two- masculine and femi-
nine. The neuter gender of Sanskrit (Origin of Indian languages), Hindi

as well as the closely related languages, has vanished.
2. Number : As English, Hindi also has two numbers- Singular and Plural.
There are some possible suffixes for singular to plural conversion in Hindi,
which are as follows (Kellogg and Bailey, 1965):
For example:
singular Plural
ladkaa - boy (MASC) ladke - boys
ghar - house (MASC) ghar - houses (No change)
kapadaa - cloth (MASC) kapade - clothes
ladkii - girl (FEM) ladkiyaan - girls
kakshaa-class(FEM) kakshayen- classes

282
Appendix A.
3. Case ending: There are eight case endings in Hindi, which are given below
in Table A.2. All these are appended to the oblique form of the noun,
where such a form exists. There are some rules for making oblique nouns.
Some of them are as follows:
(a) Masculine singular nouns ending in “aa” change into “e” when some
case ending is added : e.g. ladkaa + ne ∼ “ladke ne”. Nouns ending

in other vowels do not undergo such changes e.g. “ghar ko”, “daaku
kaa”.
(b) If a noun (masculine or feminine) ends in “a”, it is changed into

“aon” in plural, when a case ending is added. For example: “in the
house” ∼ “ghar mein” while “in the houses” ∼ “ghar on main”.

Note that, normally the plural of “ghar ” is “ghar ”, but it changes to
“gharon” in the above example because of the case ending.
The addition to the oblique form of noun of certain particles is commonly

called postposition.
Case Case-endings
Nominative case ne
Accusative case ko
Agent case se (by, with and through )
Dative case ko (to), ke liye (for), ke waaste
Ablative case se ( from, since)
Possessive case kaa, ke, kii
Locative case mein, par ( in, on)
Vocative case he, ajii, are
Table A.2: Different Case Ending in Hindi
283
A.1. English and Hindi Language Variations
No postposition is used with the nominative and Vocative. Here we will discuss
three cases nominative, accusative and possessive case. Other cases work same
as English case ending. These cases as follows:
1. Nominative case: The subject of a sentence takes the nominative sign

“ne” only when its predicative is a transitive verb in the past tense (past
indefinite, present perfect and past perfect). The use of this case is to
make a noun or pronoun act as subject of a verb. In that case, verb agrees
with the object in gender and number. For example,
Ram narrated a story. ∼ ram ne kahaanii sunaayii
(Ram) (story) (narrated)

The farmer has sowed the seeds. ∼ kisaan ne biij boyee hain
(farmer) (seeds) (sowed has)
Here in these two examples the objects of translated sentence are “ka-
haanii ”, “biij ”. The number and gender of these nouns are singular
feminine and plural masculine, respectively.
2. Accusative case: “ko” is the sign of this case and it is generally added
only to animate objects. Sometimes it is also added to inanimate ob-
jects, either to intensify its effect or to express a special significance. For
example:
The boy beats the dog. ∼ ladkaa kutte ko martaa hai

(boy) (dog) (beats)
3. Possessive case: the signs of this case are “kaa” , “ke” and “kii ”. These
words are used with noun according to gender, number and case-ending
of the following noun. This case ending has already been discussed in
detail in Section 2.5.2 of Chapter 2.
284
Appendix A.
• Preposition-Postposition: In English, preposition occurs before the noun, e.g.

“on the table”, “in the box”. But in Hindi it occurs after the noun (e.g.
“meja(table) par (on)” - “on the table”), and hence this may be called post-
position instead of preposition. However, for ease of understanding we shall
call them “preposition” only.
• Article: As Hindi has no article, the distinction indicated in English by the def-
inite and indefinite articles cannot always be expressed in Hindi. As “ghodha”
may be either “a horse” or “the horse”; “istriyaan” may be “women” or ‘the
women”. The indefinite article may sometimes be rendered by the numeral

“aka”, “one”, or the indefinite pronoun, “any”∼“koyii ”, “some ∼ kuchh”.
A.2 Verb Morphological and Structure Variations
Every language has its own grammar rules. In other words, we can find same sentence
following different grammatical aspects corresponding to the language concerned.
For example, consider an English sentence “He will be sleeping at the moment”. Its
translation in Hindi is “wah iss samay so rahaa hogaa”. As per English grammar
rules, verb phrase follows future tense and progressive aspect (or continuous aspect)
but at the same time Hindi sentence verb phrase comes under definite potential type
of mood according to Hindi grammar. For the translation work, we have followed
English grammar categorization for verb phrase structure (Quirk and Greenburm,
1976) which involves different combination of tense, aspect and mood.
To understand English to Hindi verb structure, conjugation of root verb in Hindi

has been presented in the following subsection.
285
A.2. Verb Morphological and Structure Variations
A.2.1 Conjugation of Root Verb
Verb morphological variations in Hindi depend on four aspects: tense and form of
the sentence, gender of the subject, person of the subject and number of the subject.
All these variations affect the root verb of a sentence. Since there are three tenses
(i.e. Present, Past and Future) and four forms (i.e. Indefinite, Continuous, Perfect,
and Perfect Continuous), in all one can have 12 different conjugations. In Hindi,
these conjugations are realized using suffixes attached to the root verbs, and/or
adding some auxiliary verbs, which we call “Morpho-Words” (MW). Table A.3 gives
the total number of morphological words and suffixes in Hindi, for all the tenses and
their forms.
Tense form Present Tense Past Tense Future Tense

Indefinite Suffix: taa, tii, te Suffix : taa, tii, te Suffix : oongaa,
MW: hoon, hai, MW: thaa, thii, oongii, oge, ogii,
ho, hain the egaa, egii, enge, en-
gii
Continuous MW: rahaa, rahe, MW: rahaa, rahe, MW: rahaa, rahe,
rahii, hoon, hai, ho, rahii, thaa, the, thii rahii, hoongaa,
hain hoongii, hoonge,
hogaa, hogii, hoge
Prefect MW: hoon, hai, MW: thaa, the, MW: hoongaa,
hain, ho, chukaa, thii, chukaa, chukii, hoongii, hoonge,
chukii, chuke chuke hogaa, hogii, hoge,
hongee, chukaa,
chukii, chuke
Perfect Same as Continu- Same as Continu- Same as Continu-
Continuous ous ous ous
Table A.3: Suffixes and Morpho-Words for Hindi Verb
Conjugations
286
Appendix A.
Above suffixes and morphological words in present prefect, past indefinite and
past prefect are used for literal translation of a sentence. Actually conjugation in
root verb is “aa”, “e” and “ii ”. It has been observed that according to Table
A.3, suffixes {taa, te, tii } are added in the root form of past indefinite tense form.
According to the tense forms, the morpho-words {thaa, the, thii }, {chukaa, chukii,
chuke} and {hoon, hai, ho, hain} are added after the main verb of the sentence.
Another possible way of expressing these three tenses and forms in Hindi is that,
in place of above mentioned suffixes different conjugations of verbs is used that
is different from the verb of tenses and forms discussed earlier. The morpho words
{thaa, the, thii } or { hoon, hai, ho, hain} is added depending upon the tense towards
the end of the sentence.
Some rules of these conjugation of verbs are as follows (Sastri and Apte, 1968):
• If the root of the verb ends in “a” (silent) lengthen it to “aa” in masculine
singular and change it into “e” for masculine plural; in feminine singular it
becomes “ii ” and in feminine plural “iin”. For example the verb “play” -
“khelaa” is in Hindi khelaa (masculine singular), khelii (feminine singular),
khele (masculine plural) and kheliin (feminine plural).
• If the root ends in “aa” or “oo”, “yaa” is added, which changes according
to the ‘aa”, “ai ” and “ii ” rule1 . Sometimes “e is used in place of “ye”; and
“ii ” and “iin” in the place of “yii ” and “yiin”, respectively. For example, the
verb is “come” -“aa”, in masculine “aayaa” (singular) and “aaye” or “aae”
(plural), and in feminine “aayii ” or “aaii (singular) and “aayiin” or “aaiin
(plural).
1
The “aa”, “ai”, “ii” Rule (Sastri and Apte, 1968): Masculine words ending in “aa” form their
plurals by changing the “aa” into “e” and their feminine by changing aa” into “ii”
287
A.2. Verb Morphological and Structure Variations
• If the verb-root ends in “uu”, change it into “u” and add “aa” and “e” in
masculine and “ii ” and “iin” in feminine. For example the verb is “touch”, in
masculine “chhuaa” and “chhue”, and in feminine “chhuii ” and “chhuiin”.
These rules are defined as PCP verb form rules.
Of the English verb group, above mentioned morpho word and suffixes are called
the morphological transformations in Hindi. Table A.4 provides some conjugation of
verb “write”, in a view of the systematize knowledge which has been given in Table
A.3.
English Sentence Gender Tense Hindi Sentence

I am writing a letter. M/F Present main patr likh rahaa hoon
continuous main patr likh rahee hoon
You write a letter. M/F Present tum patr likhte ho
Indefinite tum patr likhtii ho
I write a letter. M /F Present main patr likhtii hoon
Indefinite main patr likhtaa hoon
He (She) was writing a M /F Past wah patr likh rahaa thaa
letter. continuous wah patr likh rahii thii
We will write a letter. M/F Future hum patr likhenge
Indefinite hum patr likhengii
Sita wrote a letter. F Past Sita ne patr likhaa
Indefinite
Table A.4: Verb Morphological Changes From English to
Hindi Translation
Similar discussion can be done for the passive verb form also. Passive form can
be formulated for transitive verbs only.
The morphological variation depends on the gender and number of the object
288
Appendix A.
of the active form of the sentence that is basically the subject in the passive form
(Sastri and Apte, 1968). The subject of the active form occurs in the passive forms
as the instrumental case followed by “by” and its Hindi is either “se, “ ke duwaraa”
or “duwaraa”. In passive form, the changes in the main verb are according to the
rules of PCP form of verb as discussed in the section A.1. Moreover an extra verb
“jaa” is introduced after the main verb and the suffixes that are given in Table A.3
are added in this additional verbs instead of the main verb of the sentence. The
morpho words are added after the conjugation of verb “jaa”. Suppose the set of
examples are:
We add sugar to milk.
∼ ham dudh mein shakkar daalte hain

(we) (milk) (in) (sugar) (add)
Sugar is added to milk by us
∼ dudh mein shakkar hamare duwaraa daalii jatii hai
(milk) (in) (sugar) (us) (by) (added) (is)

The first example is in the active form and the second example in the passive
form. The verb morphological changes are according the above discussion.
289
Appendix B
Appendix B.
B.1 Functional Tags
In this work we have used the ENGCG parser1 for parsing the English sentence.
Most of the FTs that are relevant for this work are obtained directly from the
parser. Description of these FTs are given below:
@+FAUXV – Finite Auxiliary Predicator

(e.g. He can read.)
@-FAUXV – Nonfinite Auxiliary Predicator

(e.g. She may have read.)
@+FMAINV – Finite Main Predicator
(e.g. He reads.)
@-FMAINV – Nonfinite Main Predicator

(e.g. She has read.)
@SUBJ – Subject
(e.g. He reads.)
@F-SUBJ – Formal Subject

(e.g.There was some argument about that. It is raining.)
@N – Title
(King George and Mr. Smith)
@DN> – Determiner
(He read the book.)

@NN> – Premodifying Noun
(The car park was full.)
@AN> – Premodifying Adjective
(The blue car is mine.)

1
291
B.1. Functional Tags
@QN> – Premodifying Quantifier
(He had two sandwiches and some coffee.)

@GN> – Premodifying Genitive
(My car and Bill’s bike are blue.)
@AD-A> – Premodifying Ad-Adjective
(e.g. She is very intelligent.)

@OBJ – Object
(e.g. She read a book.)
@PCOMPL-S – Subject Complement
(e.g. He is a fool.)
@I-OBJ – Indirect Object
(e.g.He gave Mary a book.)
@ADVL – Adverbial
(e.g. She came home late. She is in the car.)
@<NOM-OF – Postmodifying of
(e.g. Five of you will pass.)
@<NOM-FMAINV – Postmodifying Nonfinite Verb
(e.g.He has the licence to kill. John is easy to please.)
@<AD-A – Postmodifying Ad-Adjective

(e.g. This is good enough.)
@INFMARK> – Infinitive Marker
(e.g. John wants to read.)
@<P-FMAINV – Nonfinite Verb as Complement of Preposition

(e.g. This is a brush for cleaning.)
@CC – Coordinator
292
Appendix B.
(e.g.John and Bill are friends.)
@CS – Subordinator
(e.g. If John is there, we shall go, too.)
@NEG – Negative Particle
(e.g.It is not funny.)
@DN> – Determiner
(e.g. He read the book.)
@AN> – Premodifying Adjective
(e.g. The blue car is mine.)
@QN> – Premodifying Quantifier

(e.g. He had two sandwiches and some coffee.)
@GN> – Premodifying Genitive
(e.g. My car and Bill’s bike are blue.)
@”. The angle bracket indicates the direction
where the head of the word is to be found.
Some of the functional tags that are required for divergence identification al-
gorithms are not directly given by the available parsers. These FTs are Adjunct
(A), predicative adjunct (PA) and VC (verb complement)(refer Appendix C of their
definitions). We have formulated rules for obtaining these FTs by using information
available in the morpho tags of the underline sentence.
293
B.2. Morpho Tags
B.2 Morpho Tags
Each morpho tag is followed by a short description and an example.
Part-of-speech tags
A adjective (small)
ADV adverb (soon)
CC coordinating conjunction (and)
CS subordinating conjunction (that)

DET determiner (any)
INFMARK> infinitive marker (to)
INTERJ interjection (hooray)
N noun (house)
NEG-PART negative particle (not)
NUM numeral (two)
PCP1 -ing form (writing)
PCP2 -ed/-en form (written, decided)

PREP preposition (in)
PRON pronoun (this)
V verb (write)
Features for adjectives

ABS absolute form (good)
CMP comparative form (better)
SUP superlative form (best)
Features for adverbs
294
Appendix B.
ABS absolute form (much)
CMP comparative form (sooner)

SUP superlative form (fastest)
WH wh-adverb (when)
ADVL adverb always used as an adverbial (in)
Features for determiners

<**CLB> clause boundary (which)
<Def> definite (the)
<Indef> indefinite (an)

<Quant> quantifier (some)
ART article (the)
CENTRAL central determiner (this)
CMP comparative form (more)

DEM demonstrative determiner (that)
GEN genitive (whose)
NEG negative form (neither)
PL plural (few)
POST postdeterminer (much)
PRE predeterminer (all)
SG singular (much)
SG/PL singular or plural (some)

SUP superlative form (most)
WH wh-determiner (whose)
295
B.2. Morpho Tags
Features for nouns

<Proper> proper (Jones)
GEN genitive case (people’s)
PL plural (cars)
SG singular (car)
SG/PL singular or plural (means)
Features for numerals
<Fraction> fraction (two-thirds)

CARD cardinal numeral (four)
ORD ordinal numeral (third)
SG singular (one-eighth)
PL plural (three-eighths)
Features for pronouns

<**CLB> clause boundary (who)
<Comp-Pron> compound pronoun (something)
<Interr> interrogative (who)

<Quant> quantitative pronoun (some)
<Refl> reflexive pronoun (themselves)
<Rel> relative pronoun (which)

CMP comparative form (more)
DEM demonstrative pronoun (those)
296
Appendix B.
FEM feminine (she)
GEN genitive (our)

MASC masculine (he)
NEG negative form (none)
PERS personal pronoun (you)
PL plural (fewer)
PL1 1st person plural (us)
PL2 2nd person plural (yourselves)
PL3 3rd person plural (them)
RECIPR reciprocal pronoun (each=other)

SG singular (much)
SG/PL singular or plural (some)
SG1 1st person singular (me)
SG2 2nd person singular (yourself)
SG2/PL2 2nd person singular or plural (you)

SG3 3rd person singular (it)
SUP superlative form (most)
WH wh-pronoun (who)
SUBJ a pronoun in the nominative that is always used

as a subject (he)
Features for prepositions
<CompPP> multi-word preposition (in=spite=of)
297
B.2. Morpho Tags
Features for verbs
<SV> intransitive (go)

<SVO> monotransitive (open)
<SVOO> ditransitive (give)
<SVC/A> copular with adjective complement (plead)
<SVC/N> copular with noun complement (become)

AUXMOD modal auxiliary (can)
IMP imperative (go)
INF infinitive (be)
PAST past tense (wrote)

PRES present tense (sings)
298
Appendix C
Appendix C.
C.1 Definitions of Some Non-typical Functional
Tags and SPAC Sturctures
The definitions of some non-typical functional tags that we have used in our algo-
rithms are given below.
1. Adjunct (A): An adjunct is a type of adverbial indicating the circumstances

of the action. Adjuncts may be obligatory or optional. They express such
relations as time, place, manner, reason, condition, i.e. they are answers to
the questions where, when, how and why.
For example:
He lives in Brazil.
She was walking slowly.
Here, “in Brazil” is the adjunct as it gives the answer to “where”.
2. Predicative Adjunct (PA): If the copula (linking verb) is present, and it allows
an adverbial as complementation, then the complementation is called predica-
tive adjunct.
For example:
The children are at the zoo.
The party will be at nine o’clock.
The two eggs are for you.
The party will be tonight.
Here, the underlined prepositional phrase and adverb are the examples of pred-
icative adjunct.
299
C.1. Definitions of Some Non-typical Functional Tags and SPAC Sturctures
3. Adjective Complementation by Preposition Phrase (SC C): Some predicative

adjectives require complementation by a prepositional phrase. Such a prepos-
tional phrase is called the postmodifier of the adjective complement. Note that
the preposition may be specific to a particular adjective.
For example :
Mary is fond of music. (Here, “of music” is the SC C)
He was angry (with Mary)., here parenthesis implies optional.
4. Verb Complement (VC): Sometimes prepositional phrases may act as the com-
plementation of a verb. We use a generalized term “verb complement” to de-
note them. This happens when the main verb of the sentence is intransitive.
In case of transitive and ditransitive verbs direct and/or indirect objects are
used to complete the sense of the sentence. But these are not considered here
under verb complement. We have used their actual names object and indirect
object.
For Example :
We depend on him. (here, “on him” is the verb complement)
We want treat. (direct object).
He gave a pen. (direct object ) to me (indirect object).
Although many English parsers are available on-line, none of them give full in-
formation on both FT and SPAC. In order to glean both the information, we had
to combine the output of two parsers. In particular, we have used the following two
parsers:
300
Appendix C.
1. ENGCG parser1 for FTs
2. Memory-Based Shallow Parser (MBSP)2 (Buchholz, 2002), (Daelemans et. al.,

1996) for SPAC
Obtaining the FTs in a given sentence is necessary for successful implementation
of the algorithms. Some of the functional tags that are required for divergence iden-
tification algorithms are not directly given by the available parsers. These FTs are
Adjunct (A), predicative adjunct (PA), Postmodifier of the subjective complement
(SC C), and VC (verb complement). We have formulated rules for obtaining these
FTs by using information available in the morpho tags of the underline sentence.
The SPAC structure has been taken from MBSP. No specific rules are required
to capture the SPAC structure for a sentence. However, we made small structural
changes so that they can be manipulated by program easily. For example, consider
the sentence “The student is weak in his studies”. The MBSP gives the following
output:
[NP the/DT student/NN NP] [VP is/VBZ VP] [ADJP weak/JJR ADJP] PNP}{PNP
[Prep in/IN Prep][NP his/PRP$ studies/NNS NP]PNP}.
Since all these information is not required of our identification algorithm, we

have some what simplified the representations. Thus in our notation the above tag
information is represented as:
[NP [the/DT student/N]] [VP [is/V]] [ADJP weak/Adj]][PP in/IN [NP his/PRP$ stud-
ies/N]]
For Hindi, no parser is available online. According to the English parser infor-
mation, we have tagged Hindi sentences for our work i.e. we have used same FTs
1
2
http://pi0657.kub.nl/cgi-bin/tstchunk/demo.pl
301
C.1. Definitions of Some Non-typical Functional Tags and SPAC Sturctures
and same SPAC structure for Hindi.
302
Appendix D
Appendix D.
D.1 Semantic Similarity
Semantic similarity between two words is computed on the basis of their semantic
distance (sd ) (Stretina et. al., 1998), as follows:
sim(a,b) = 1 – (sd (a,b))2
The semantic similarity score lies between 0 and 1. Semantic distance [Stetina
et. al., 1998] between two words, say a and b, is computed as:
• Semantic Distance for Nouns and Verbs

1 Ha − H Hb − H
sd(a, b) = +
2 Ha Hb
Here, Ha is the depth of the hypernyms of a, Hb is the depth of the hypernyms
of b, and H is the depth of their nearest common ancestor.
• Semantic Distance for Adjectives and Adverb
sd (a, b) = 0 for the same adjectival synsets (including Synonymy)
sd (a, b) = 0 for the synsets in antonym relation ant(a,b)
sd (a, b) = 0.5 for the same synsets in the same similarity cluster and antonym
relation ant(a,b)
sd (a, b) = 1 for all other synsets.
303
Appendix E
Appendix E.
E.1 Cost Due to Adapting Pre-modifier Adjective
to Pre-modifier Adjective
Here transformation from genitive case to genitive case requires fourteen adaptation
operations. Below we describe the cost for each of them. Note that the pre-modifier
word can be either ABS or A(PCP1) or A(PCP2) (See Section 2.5.4). We denote
this set by R.
1. The average cost of word replacement from the set R to ABS. This cost is
denoted by w1 . Note that in this case adjective dictionary search is necessary

for which the search time is 12.41 (see item 4 of Section 5.3). Hence, the total
Lp
average cost may be computed as (l1 × L2 ) + (l2 × 2
) + {(d× 12.41) + (c ×
105 )}.
2. The average cost of word replacement from set R to either A(PCP1) or A(PCP2).
We denote this cost as w2 . Here also, dictionary search is required. Note that
adjective forms A(PCP1) and A(PCP2) are derived from the verb part of
speech, therefore, in this case dictionary search time is 12.08. Hence the total
Lp
average cost is (l1 × L2 ) + (l2 × 2
) + {(d× 12.08) + (c × 105 )}.
3. The average cost of morpho-word addition from the set {huaa, huye, huii }.
This cost is denoted as w3 . Since the total morpho words are three, the average
cost may be formulated as (l1 × L2 ) + (m × 32 ) + ψ +.
4. The average cost of morpho-word deletion from the set {huaa, huye, huii }.
We denoted it as w4 . This average cost is evaluated as (l1 × L2 ) + (m × 23 ) + .
305
E.1. Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective
5. The average cost of suffix replacement from the set {aa, e, ii } is (l1 × L2 ) +
(k × 32 ) + (k × 32 ). We denote it as s1 .
6. The average cost of suffix addition from the set {taa, tai, tii }. This cost is
denoted as s2 , which is computed to be (l1 × L2 ) + (k × 32 ).
7. The average cost of suffix replacement in verb form of A(PCP2) by using PCP
form of verb (see Appendix A). We denote it as s3 . Hence, the total average
cost is (l1 × L2 ) + (k × 62 ) + (k × 62 ).
8. The average cost of suffix addition in verb form of A(PCP2) by using PCP
form of verb is (l1 × L2 ) + (k × 82 ). We denote this cost as s4 .
9. We denote the average cost of suffix replacement from the set {taa, te, tii } as
s5 , which is formulated as (l1 × L2 ) + (k × 32 ) + (k × 32 ).
10. The average cost of suffix replacement from the set {aa, ye, ii } is (l1 × L2 ) +
(k × 32 ) + (k × 32 ). We denote it as s6 .
11. The average cost of suffix replacement from the suffix set {taa, te, tii } to any
of the suffix which is required for verb form of A(PCP2) (using PCP verb form
rule, see Appendix A). We denote it as s7 . Since the number of suffixes required
for verb form of A(PCP2) is fourteen, the average cost of this operation may
14
be formulated as (l1 × L2 ) + (k × 2
) + (k × 32 ).
12. The average cost of suffix replacement from any of the suffix which is required
for converting verb form of A(PCP2) to {taa, te, tii } is (l1 × L2 ) + (k × 32 ) +
14
(k × 2
) (Here also, as in item 11 above). We denote it as s8 .
13. The average cost of suffix replacement for verb form of A(PCP2) to verb form
306
Appendix E.
of A(PCP2) (using PCP verb form rule, see appendix A) is (l1 × L2 ) + (k × 28 )

+ (k × 82 ). This cost is denoted by s9 .
14. The average cost of suffix addition for verb form of A(PCP2) to verb form
of A(PCP2). We denote it as s10 , which may be formulated as (l1 × L2 ) +
(k × 62 )(In a similar manner, as in item 13 above).
Table E.1 discusses the cost of pairwise modification from pre-modifier adjective
to pre-modifier adjective by referring its adaptation rule Table 2.10.
Input→ ABS A(PCP1) A(PCP2)

Ret’d ↓
ABS (0 or s1 or w2 + w2 + s2 w2 + w3 + (s3 or s4 )
(w1 +{s1 })
A(PCP1) w2 + {s1 } + w4 (0 or ({w2 } + {s5 } + ((s7 + {s6 }) or
{s6 } )) (w2 + s7 + {s6 })
A(PCP2) w2 + {s1 } + w4 ((s8 + {s6 }) or (w2 + (0 or ({w2 }+ {s6 }+
s8 + {s6 }) { s9 or s10 }))
Table E.1: Costs Due to Adapting Pre-modifier Adjective
to Pre-modifier Adjective
307
Bibliography
Ansell, M.: 2000, English Grammar: Explanations and Exercises, Second edn,
http://www.fortunecity.com/bally/durrus/153/gramdex.html.
Arnold, D. and Sadler, L.: 1990, Theoretical basis of MiMo, Machine Translation
5(3), 195–222.
Bender, E.: 1961, HINDI Grammar and Reader, University of Pennsylvania Press,
University of Pennsylvania South Asia Regional Studies, Philadelphia, Pennsyl-
vania.
Bennett, W. S.: 1990, How much semantics is necessary for MT systems?, Pro-
ceedings of the Third International Conference on Theoretical and Methodological
Issues in Machine Translation of Natural Languages, Vol. TX, Linguistics Re-
search Center, The University of Texas, Austin, pp. 261–269.
Bharati, A., Sriram, V., Krishna, A. V., Sangal, R. and Bendre, S.: 2002, An
algorithm for aligning sentences in bilingual corpora using lexical information,
International Conference on Natural Language Processing, Mumbai.
Brown, P. F.: 1990, A statistical approach to Machine Translation, Computational
Linguistics 16(2), 79–85.

BIBLIOGRAPHY
Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., Lafferty, J. D. and Mercer, R. L.:
1992, Analysis, statistical transfer, and synthesis in machine translation, Pro-
ceedings of the Fourth International Conference on Theoretical and Methodolog-
ical Issues in Machine Translation of Natural Languages, Montreal, Canada,
pp. 83–100.
Brown, P. F., Pietra, S. A., Pietra, V. J. D., Pietra, D. and Mercer, R. L.: 1993, The
mathematics of statistical Machine Translation: parameter estimation, Compu-
tational Linguistics 19(2), 263–311.
Brown, P., Lai, J. C. and Mercer, R. L.: 1991, Aligning sentences in parallel cor-
pora, Proc. of 29th Annual Meeting of Association for Computational Linguistic,
Berkeley, pp. 169–176.
Brown, R. D.: 1996, Example-Based Machine Translation in the pangloss system,

Proceedings of the 16th International Conference on Computational Linguistics
(COLING-96), Copenhagen, Denmark, pp. 169–174.
Brown, R. D.: 1999, Adding linguistic knowledge to a lexical Example-Based

Translation System, Proceedings of the Eighth International Conference on Theo-
retical and Methodological Issues in Machine Translation (TMI-99), Chester, UK,
pp. 22–32.
Brown, R. D.: 2000, Automated generalization of translation examples, Proceedings

of the Eighteenth International Conference on Computational Linguistics, pp. 125–
131.
Brown, R. D.: 2001, Transfer-rule induction for Example-Based Translation, Pro-
310
BIBLIOGRAPHY
ceedings of the MT Summit VIII Workshop on Example-Based Machine Transla-

tion, Santiago de Compostela, Spain, pp. 1–11.
Buchholz, S.: 2002, Memory-Based Grammatical Relation Finding, PhD thesis,

Tilburg University, Netherlands.
Carl, M. and Hansen, S.: 1999, Linking translation memories with Example-Based
Machine Translation, Proceedings of Machine Translation Summit VII, Singapore,
pp. 617–624.
Carl, M. and Way, A.: 2003, Advances in Example-Based Machine Translation
Series: Text, Speech and Language Technology, Vol. 21, Kluwer Academic Pub-
lishers, Netherlands.
Chatterjee, N.: 2001, A statistical approach to similarity measurement for EBMT,

Proceedings of STRANS-2001, IIT Kanpur, pp. 122–131.
Choueka, Y., Conley, E. S. and Dagan, I.: 2000, A comprehensive bilingual word
alignment system: Accommodating disparate languages: Hebrew and English, J.
Vronis (ed.): Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Clough, P.: 2001, A Perl program for sentence splitting using rules,
www.ayre.ca/library/cl/files/sentenceSplitting.ps.
Collins, B.: 1998, Example-Based Machine Translation: an Adaptation-Guided

Retrieval Approach, PhD thesis, University of Dublin, Trinity College.
Collins, B. and Cunningham, P.: 1996, Adaptation guided retrieval in ebmt: A

case-based approach to Machine Translation, EWCBR, pp. 91–104.
311
BIBLIOGRAPHY
Daelemans, W., Zavrel, J., Berck, P. and Gillis, S.: 1996, MBT: A memory-based
part of speech tagger-generator, Proceedings of the Fourth Workshop on Very
Large Corpora, E. Ejerhed and I. Dagan (eds.), Copenhagen, Denmark, pp. 14–
27.
Dave, S., Parikh, J. and Bhattacharya, P.: 2002, Interlingua Based English-Hindi
Machine Translation and language divergence, Journal of Machine Translation
(JMT) 17.
Doi, T. and Sumita, E.: 2003, Input sentence splitting and translating, HLT-NAACL
2003 Workshop: Building and Using Parallel Texts Data Driven machine Trans-
lation and Beyond, Edmonton, pp. 104–110.
Dorr, B. J.: 1993, Machine Translation: A View from the Lexicon, MIT Press,
Cambridge, MA.
Dorr, B. J., Jordan, P. W. and Benoit, J. W.: 1998, A survey of current paradigms
in Machine Translation, Technical Report LAMP-TR-027, UMIACS-TR-98-72,

CS-TR-3961, University of Maryland, College Park, USA.
Dorr, B. J., Pearl, L., Hwa, R. and Habash, N. Y. A.: 2002, DUSTer: A method for
unraveling cross-language divergences for statistical word level alignment., Pro-
ceedings of the Fifth Conference of the Association for Machine Translation in the
Americas, AMTA-2002, Tiburon, CA.
Fung, P. and McKeown, K.: 1996, A technical word-and term-translation aid using
noisy parallel corpora across language groups, The Machine Translation Journal,
Special Issue on New Tools for Human Translators, pp. 53–87.
312
BIBLIOGRAPHY
Furuse, O., Yamada, S. and Yamamoto, K.: 1998, Splitting long or ill-formed input
for robust spoken-language translation, Proceedings of the Thirty-Sixth Annual
Meeting of the ACL and Seventeenth International Conference on Computational
Linguistics, pp. 421–427.
Gale, W. A. and Church, K. W.: 1991a, Identifying word correspondences in par-

allel texts, Proceedings of the Fourth DARPA Workshop on Speech and Natural
Language, Morgan Kaufmann Publishers, Inc., pp. 152–157.
Gale, W. A. and Church, K. W.: 1991b, A program for aligning sentences in bilingual
corpora, ACL ’91, Berkeley CA, pp. 177–184.
Gale, W. and Church, K.: 1993, A program for aligning sentences in bilingual cor-
pora, Computational Linguistics 19(1), 75–102.
George, D.: 2002, Automatic evaluation of machine translation quality using n-

gram co-occurrence statistics, Proceedings ARPA Workshop on Human Language
Technology.
Germann, U.: 2001, Building a Statistical Machine Translation system from scratch:
How much bang for the buck can we expect?, ACL 2001 Workshop on Data-Driven
Machine Translation, Toulouse, France.
Goyal, S., Gupta, D. and Chatterjee, N.: 2004, A study of Hindi translation pat-
terns for English sentences with have as the main verb, Proceedings of Interna-
tional Symposium on MT, NLP and Translation Support Systems: iSTRANS-
2004, CDEC and IIT Kanpur, Tata McGraw-Hill, New Delhi, pp. 46–51.
Grishman, R. and Kosaka, M.: 1992, Combining rationalist and empiricist ap-
proaches to Machine Translation, Proceedings of the Fourth International Confer-
313
BIBLIOGRAPHY
ence on Theoretical and Methodlogical Issues in Machine Translation of Natural

Languages, Montreal, Canada, pp. 263–274.
Gupta, D. and Chatterjee, N.: 2002, Study of similarity and its measurement for
English to Hindi EBMT, Proceedings of STRANS-2002, IIT Kanpur.
Gupta, D. and Chatterjee, N.: 2003a, Divergence in English to Hindi Translation:

Some studies, International Journal of Translation 15, 5–24.
Gupta, D. and Chatterjee, N.: 2003b, Identification of divergence for English to

Hindi EBMT, Proceedings of the MT SUMMIT IX, Orleans, LA, pp. 141–148.
Gupta, D. and Chatterjee, N.: 2003c, A morpho-syntax based adaptation and re-
trieval scheme for English to Hindi EBMT, Proceedings of Workshop on Com-
putational Linguistic for the Languages of South Asia: Expanding Synergies with
Europe, Budapest, Hungary, pp. 23–30.
Güvenir, H. A. and Cicekli, I.: 1998, Learning translation templates from examples,
Information System 23, 353–363.
Habash, N.: 2003, Generation-Heavy Hybrid Machine Translation, PhD thesis,

University of Maryland, College Park.
Habash, N. and Dorr, B. J.: 2002, Handling translation divergences: Combining sta-
tistical and symbolic techniques in generation-heavy Machine Translation, Pro-

ceedings of the Fifth Conference of the Association for Machine Translation in the
Americas, AMTA-2002, Tiburon, CA.
Han, C.-h., Benoit, L., Martha, P., Owen, R., Kittredge, R., Korelsky, T., Kim,
N. and Kim, M.: 2000, Handling structural divergences and recovering dropped
314
BIBLIOGRAPHY
arguments in a Korean/English machine translation system, Proceedings of the

Fourth Conference of the Association for Machine Translation in the Americas,
AMTA-2000, Cuernavaca, Mexico.
Hutchins, J.: 2003, The Oxford Handbook of Computational Linguistics, Oxford
University Press, chapter Machine translation: general overview, pp. 501–511.
Jain, R.: 1995, HEBMT: A Hybrid Example-Based Approach for Machine

Translation (Design and Implementation for Hindi to English), PhD thesis, I.I.T.
Kanpur.
Kachru, Y.: 1980, Aspects of Hindi Grammar, Manohar Publications, New Delhi.
Kellogg, R. S. and Bailey, T. G.: 1965, A Grammar of the Hindi Language, Rout-
ledge and Kegan Paul Ltd., London.
Kit, C., Pan, H. and Webster, J.: 2002, Example-Based Machine Translation: A
New Paradigm, Translation and Information Technology, Chinese U of HK Press,

pp. 57–78.
Leffa, V. J.: 1998, Clause processing in complex sentences, Proceedings of the First
International Conference on Language Resources and Evaluation, Vol. 1, pp. 937–
943.
Loomis, M. E. S.: 1997, Data Management and File Structures, second edn, Prentice
Hall of India Private Limited, New Delhi-110001.
Manning, C. and Schutze, H.: 1999, Foundations of Statistical Natural Language

Processing, The MIT Press, MA.
315
BIBLIOGRAPHY
McEnery, A. M., Oakes, M. P. and Garside, R.: 1994, The use of approximate
string matching techniques in the alignment of sentences in parallel corpora, in
A. Vella (ed.), The Proceedings of Machine Translation: 10 Years On, University
of Cranfield.
McTait, K.: 2001, Translation Pattern Extraction and Recombination for Example-
Based Machine Translation, PhD thesis, Centre for Computational Linguistics
Department of Language Engineering, UMIST.
Nagao, M.: 1984, Artificial and Human Intelligence, North-Holland, chapter A
Framework of a Mechanical Translation Between Japanese and English by Anal-

ogy Principle, pp. 173–180.
Nieß en, S., Och, F. J., Leusch, G. and Ney, H.: 2000, An evaluation tool for ma-
chine translation: Fast evaluation for machine translation research., Proceedings
of the Second Int. Conf. on Language Resources and Evaluation (LREC), Athens,
Greece, pp. 39–45.
Nirenburg, S.: 1993, Example-Based Machine Translation, Proceedings of the Bar

Ilan Symposium on Foundations of Artificial Intelligence, Bar Ilan University,
Israel.
Nirenburg, S., Grannes, D. and Domashnev, K.: 1993, Two approaches of match-
ing in Example-Based Machine Translation, Proceedings of TMIMT-93, Kyoto,
Japan.
Oard, D. W.: 2003, The surprise language exercises, ACM Transactions on Asian
Language Processing 2(2), 79–84.
316
BIBLIOGRAPHY
Orăsan, C.: 2000, A hybrid method for clause splitting in unrestricted English texts,
Proceedings of ACIDCA 2000, Monastir, Tunisia.
Papineni, K. A., Roukos, S., Ward, T. and Zhu, W.-J.: 2001, Bleu: a method
for automatic evaluation of machine translation, Technical Report Technical Re-
port RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research

Center, Yorktown Heights, NY.
Piperidis, S., Boutsis, S. and Papageorgiou, H.: 2000, From sentences to words and
clauses, Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Puscasu, G.: 2004, A multilingual method for clause splitting, Proceedings of CLUK
2004, Birmingham, UK, pp. 199 – 206.
Quirk, R. and Greenbaum, S.: 1976, A University Grammar of English, English

Language Book Socitey, Longman.
Rao, D.: 2001, Human aided Machine Translation from English to Hindi: The
MaTra project at NCST, Proceedings Symposium on Translation Support Sys-
tems, STRANS-2001, I.I.T. Kanpur.
Rao, D., Mohanraj, K., Hegde, J., Mehta, V. and Mahadane, P.: 2000, A practi-
cal framework for syntactic transfer of compound-complex sentences for English-
Hindi Machine Translation, Proceedings of the Conf. on Knowledge based com-

puter systems, National Centre for Software Technology, Mumbai, pp. 343–354.
Resnik, P. and Yarowsky, D.: 2000, Distinguishing systems and distinguishing senses:
New evaluation methods for word sense disambiguation, Natural Language Engi-
neering 5(2), 113–133.
317
BIBLIOGRAPHY
Sang, E. F. T. K. and Déjean, H.: 2001, Introduction to the CoNLL-2001 shared

task: Clause identification, Proceedings of CoNLL-2001, Toulouse, France, pp. 53–
57.
Sangal, R.: 2004, Shakti: IIIT-Hyderabad machine translation system (experimen-

tal), http://shakti.iiit.net/ shakti/.
Sastri, S. and Apte, B.: 1968, Hindi Grammar, Dakshina Bharat Hindi Prachar
Sabha, Madras, India.
Sato, S.: 1992, CTM: An Example-Based Translation aid system, Proc. of

COLING-1992, pp. 1259–1263.
Shimohata, M., Sumita, E. and Matsumoto, Y.: 2003, Retrieving meaning-

equivalent sentences for Example-Based Rough Translation, HLT-NAACL 2003
Workshop: Building and Using Parallel Texts Date Driven Machine Translation
and Beyond, Edmonton, pp. 50–56.
Shiri, S., Bond, F. and Takhashi, Y.: 1997, A Hybrid Rule and Example-Based
Method for Machine Translation, Proceedings of the 4th Natural Language Pro-
cessing Pacific Rim Symposium: NLPRS-97, Phuket, Thailand, pp. 49–54.
Singh, S. B.: 2003, English- Hindi Translation Grammar, first edn, Prabhat
Prakashan, 4/19 Asaf Ali Road, New Delhi-110002.
Sinha, R. and Jain, A.: 2003, AnglaHindi: An English to Hindi Machine

Translation system, Proceedings of the MT SUMMIT IX, Orleans, LA, pp. 23–27.
Sinha, R. M. K., Jain, R. and Jain, A.: 2002, An English to Hindi machine aided
translation system based on ANGLABHARTI technology “ANGLA HINDI”,

I.I.T. Kanpur, http://anglahindi.iitk.ac.in/translation.htm.
318
BIBLIOGRAPHY
Somers, H.: 1997, Machine Translation and minority languages, Translating and the
Computer 19: Papers from the Aslib conference, London.
Somers, H.: 1998, Further experiments in bilingual text alignment, International

Journal of Corpus Linguistics 3, 115–150.
Somers, H.: 1999, Review article: Example-Based Machine Translation, Machine
Translation 14, 113–158.
Somers, H.: 2001, EBMT seen as case-based reasoning, MT Summit VIII Workshop
on Example-Based Machine Translation, Santiago de Compostela, Spain, pp. 56–
65.
Stetina, J., Kurohashi, S. and Nagao, M.: 1998, General word sense disambigua-
tion method based on a full sentential context, Proceedings of COLING-ACL
Workshop, Usage of WordNet [http://www.cogsci.princeton.edu/cgi-bin/webwn]

in Natural Language Processing, Montreal, Canada.
Sumita, E.: 2001, Example-Based Machine Translation using DP-matching be-

tween word sequences, Proc. of the ACL 2001 Workshop on Data-Driven Methods
in Machine Translation, pp. 1–8.
Sumita, E. and Iida, H.: 1991, Experiments and prospects of Example-Based

Machine Translation, Proceedings of the 29th Annual Meeting of the Association
for Computational Linguistics, Berkeley, California, USA, pp. 85–192.
Sumita, E., Iida, H. and Kohyama, H.: 1990, Translating with examples: A new
approach to Machine Translation, TMI-1990, pp. 203–212.
Sumita, E. and Tsutsumi, Y.: 1988, A translation aid system using flexible text
retrieval based on syntax matching, Proceedings of TMI-88, CMU, Pittsburgh.
319
BIBLIOGRAPHY
Takezawa, T.: 1999, Transformation into meaning-ful chunks by dividing or con-

necting utterance units, Journal of Natural Language Processing 6(2).
Thurmair, G.: 1990, Complex lexical transfer in METAL, Proceedings of the Third
International Conference on Theoretical and Methodological Issues in Machine
Translation of Natural languages, Linguistics research center, The University of

Texas, Austin, TX, pp. 91–107.
Tillmann, C., Vogel, S., Ney, H., Zubiaga, A. and Sawaf, H.: 1997, Accelerated dp
based search for statistical translation, In European Conf. on Speech Communi-
cation and Technology, Rhodes, Greece, p. 26672670.
Uchida, H. and Zhu, M.: 1998, The Universal Networking Language (UNL) spec-
ifications version 3.0. 1998, Technical report, United Nations University, Tokyo,
http://www.unl.unu.edu/unlsys/unl/unls30.doc.
Veale, T. and Way, A.: 1997, Gaijin: A template-driven bootstrapping approach to
Example-Based Machine Translation, International Conference, Recent Advances

in Natural Language Processing, Tzigov Chark, Bulgaria,, pp. 239–244.
Vikas, O.: 2001, Technology development for indian languages, Proceedings of Sym-
posium on Translation Support Systems STRANS-2001, IIT Kanpur.
Watanabe, H.: 1992, A similarity-driven transfer system, Proceeding of the 14th

COLING, pp. 770–776.
Watanabe, H., Kurohashi, S. and Aramaki, E.: 2000, Finding structural correspon-
dences from bilingual parsed corpus for Corpus-Based Translation, Proceedings
of COLING-2000, Saarbrucken, Germany.
320
BIBLIOGRAPHY
Weiderhold, G.: 1987, File organization for database design, McGraw-Hill Inc. New
York, USA.
Wren, P., Martin, H. and Rao., N.: 1989, High School English Grammar, S. Chand
& Co. Ltd., New Delhi.
Wu, D.: 1995, Large-scale automatic extraction of an English-Chinese translation

lexicon, Machine Translation 9(3-4), 285–313.
321
About the Author
Ms. Deepa Gupta was born on July 5th , 1977. She obtained a Bachelor’s Degree
in Mathematics (honors) from L.B. College, University of Delhi, in 1997 with an over-
all score of 73.00%. She completed her Post Graduation in Mathematics in 1999 from
Indian Institute of Technology Delhi with an C.G.P.A 7.70. She joined the Ph.D.
programme of Department of Mathematics at IIT Delhi in July, 1999 as a Junior Re-
search Fellow. Thereafter in July, 2001 she was awarded as a Senior Research Fellow.
During her research tenure she participated and presented various research articles
and published seven research papers in different national/international journals and
conferences. She can be contacted at gupta deepa@rediffmail.com or deepag iitd @ya-

hoo.com.
List of Publications
Published Paper(s) in Journal
1. Gupta D. and Chatterjee N. 2003. ”Divergence in English to Hindi Transla-

tion: Some Studies” International Journal of Translation, Vol. 15, pp.5-24.
Published Papers in Conference
1. Gupta D. and Chatterjee N. 2001. “Study of Divergence for Example Based

English-Hindi Machine Translation”, In the Proc. of STRANS 2001, IIT Kan-
pur. pp.132-139.
2. Gupta D. and Chatterjee N. 2002. “Study of Similarity and its Measurement
for English to Hindi EBMT”, In the Proc. of STRANS-2002, IIT Kanpur.
3. Gupta D. and Chatterjee N. 2002. “A Systematic Adaptation Scheme for

English-Hindi Example-Based Machine Translation”, In the Proc. of STRANS-
2002, IIT Kanpur.
4. Gupta D. and Chatterjee N. 2003. “A Morpho-Syntax based Adaptation

and Retrieval Scheme for English to Hindi EBMT”, In the Proc. of Workshop
on Computational Linguistic for the Languages of South Asia: Expanding
Synergies with Europe, Budapest, Hungary, pp 23-30.
5. Gupta D. and Chatterjee N. 2003. “Identification of Divergence for English

to Hindi EBMT” In Proc. MT SUMMIT IX, New Orleans, LA, pp. 141-148.
6. Goyal S., Gupta D. and Chatterjee N. 2004. “A Study of Hindi Translation

Patterns for English Sentences with “Have” as the Main Verb”, In the Proc.
of International Symposium on MT, NLP and Translation Support Systems:

iSTRANS-2004, New Delhi, pp 46-51.
Communicated Paper(s)
For International Journal ”Machine Translation” Kluwer Academic Publishers:
1. FT and SPAC Based Algorithm for Identification of Divergence from Parallel

aligned corpus for English to Hindi EBMT (Co-author Dr. Niladri Chatterjee)
2. Will sentences have Divergence upon Translation? : A Corpus-Evidence based

Solution for Example Based Approach (Co-authors Dr. Niladri Chatterjee and
Shailly Goyal)
Honours and Awards
• Tata Infotech Research Fellowship of Rs. 12,000, during July, 2002 to April,
2004.
• Vice-president of Mathematics Society, Indian Institute of Technology Delhi,
during April, 2001 to April, 2003.
• Secured 85.04 percentile in Graduate Aptitude Test For Engineers (GATE).

GATE is a National Level Entrance Test for Higher Education in the field of
Technology and Basic Sciences, conducted by India’s premier research insti-
tutes IISc, Bangalore, and IITs.
• College topper in B.A. (Hons), IInd year, 1995-1996.

10 1 1 103 4227 PDF

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

10 1 1 103 4227 PDF

Cargado por

Copyright:

Formatos disponibles

CONTRIBUTIONS TO ENGLISH TO HINDI

MACHINE TRANSLATION USING

in fulfilment of the requirement of

Indian Institute of Technology Delhi

My Brother Ashish and

This is to certify that the thesis entitled “Contributions to English to Hindi

Deepa Gupta to the Department of Mathematics, Indian Institute of Technology

Dr. Niladri Chatterjee

of them, and in particular:

acknowledgement than expressed here.

I would like to express my sincere thanks to my friends Priya and Dharmendra

confidence throughout this work.

Finally, I thank GOD for every thing.

This research focuses on development of Example Based Machine Translation (EBMT)

based MT systems require extraction of syntactic and semantic knowledge in the

ability of large-scale computational resources is still scarce. The primary motivation

a) Although a small number of English to Hindi MT systems are already available,

performance, and try to provide some solutions for them.

computational resources in these languages is still at its infancy. Since many

1) Development of a systematic adaptation scheme. We proposed an adaptation

2) Study of Divergence. We observe that occurrence of divergence causes major

3) Development of Retrieval scheme. We propose a novel approach for measuring

to an EBMT system, will be most efficient if it measures similarity on the basis

4) Dealing with Complex sentences. Handling complex sentences by an MT sys-

developing an efficient EBMT system for translating from English to Hindi. We

1.1 Description of the Work Done and Summary of the Chapters . . . . . 6

1.2 Some Critical Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Adaptation in English to Hindi Translation: A Systematic Ap-

2.2 Description of the Adaptation Operations . . . . . . . . . . . . . . . 29

2.3 Study of Adaptation Procedure for Morphological Variation of Active

2.3.1 Same Tense Same Verb Form . . . . . . . . . . . . . . . . . . 38

2.3.2 Different Tenses Same Verb Form . . . . . . . . . . . . . . . . 42

2.3.3 Same Tense Different Verb Forms . . . . . . . . . . . . . . . . 46

2.3.4 Different Tenses Different Verb Forms . . . . . . . . . . . . . . 48

2.4 Adaptation Procedure for Morphological Variation of Passive Verbs . 51

2.5 Study of Adaptation Procedures for Subject/ Object Functional Slot 56

2.5.1 Adaptation Rules for Variations in the Morpho Tags of @DN> 59

2.5.2 Adaptation Rules for Variations in the Morpho Tags of @GN> 60

2.5.3 Adaptation Rules for Variations in the Morpho Tags of @QN . 64

2.5.4 Adaptation Rules for Variations in the Morpho Tags of Pre-

modifier Adjective @AN> . . . . . . . . . . . . . . . . . . . . 64

2.5.5 Adaptation Rules for Variations in the Morpho Tags of @SUB 69

2.6 Adaptation of Interrogative Words . . . . . . . . . . . . . . . . . . . 73

2.7 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 83

2.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3 An FT and SPAC Based Divergence Identification Technique From

3.2 Divergence and Its Identification: Some Relevant Past Work . . . . . 89

3.3 Divergences and Their Identification in English to Hindi Translation . 96

3.3.1 Structural Divergence . . . . . . . . . . . . . . . . . . . . . . . 97

3.3.2 Categorial Divergence . . . . . . . . . . . . . . . . . . . . . . 100

3.3.3 Nominal Divergence . . . . . . . . . . . . . . . . . . . . . . . 104

3.3.4 Pronominal Divergence . . . . . . . . . . . . . . . . . . . . . . 107

3.3.5 Demotional Divergence . . . . . . . . . . . . . . . . . . . . . . 111

3.3.6 Conflational Divergence . . . . . . . . . . . . . . . . . . . . . 117

3.3.7 Possessional Divergence . . . . . . . . . . . . . . . . . . . . . 121

3.3.8 Some Critical Comments . . . . . . . . . . . . . . . . . . . . . 131

3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4 A Corpus-Evidence Based Approach for Prior Determination of

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.2 Corpus-Based Evidences and Their Use in Divergence Identification . 136

4.2.1 Roles of Different Functional Tags . . . . . . . . . . . . . . . . 138

4.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 147