Você está na página 1de 118

Engineering Machine Translation for

Deployment on Cloud

by

Rashid Ahmad
200807014
rashid.ahmed@research.iiit.ac.in

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE


REQUIREMENTS FOR THE DEGREE OF

Master of Science (by Research)


in
Computer Science & Engineering

Language Technologies Research Centre


International Institute of Information Technology
Hyderabad, India
March 2013
Copyright Rashid Ahmad, 2013
All Rights Reserved
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY
Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled Engineering Machine
Translation for Deployment on Cloud by Rashid Ahmad (200807014) submitted in
partial fulfillment for the award of the degree of Master of Science (by Research) in
Computer Science & Engineering, has been carried out under my supervision and it is not
submitted elsewhere for a degree.

Date: Advisor: Prof. Rajeev Sangal


Dedicated to my dearest ones for supporting
me at every stage of life
Acknowledgements
In my wonderful journey towards this thesis, I came to the most beautiful part
that gives me an opportunity to thank all the people who have contributed to this
thesis.

I would like to thank my supervisor Prof. Rajeev Sangal and Dr. Mukul Sinha
without whom, this thesis would not have been possible. Dr. Mukul Sinha is a
Director of my company Expert Software Consultant Ltd. New Delhi. He was
visiting Professor at IIITH for spring 2011-12. I admire Prof. Sangal as a good
adviser, leader and a great human. They always inspired me and had faith in my
ability to rise to the occasion and deliver the best work. He is the one who taught
us how to give a wonderful presentation. I am grateful to TDIL (Technology
Development in Indian Language) group, Dept. of Information Technology, Govt.
of India in conceptualizing such a challenging Indian Language to Indian Language
Machine Translation (IL-ILMT) Project and the whole consortium for the valuable
meetings and workshops which taught me lots of professional things. I learnt a lot
in terms of research, punctuality, individual personality from Dr. Ramanan who is
the PRSG chairman of ILMT Project.

It was a dream for me to do research on the topic of Software Engineering in


field of Natural Language Processing (NLP) and a divine gift to have Dr. Mukul
Sinha . I can definitely say that he has put utmost effort in driving this thesis,
correcting, and questioning me until it reached required levels of quality. I am
sure that this effort has materialized in my thesis, without his support, perhaps I
would have reached some other end. My heartful thanks to him for allowing me
to think at an abstract level and pushing me to the extreme end of getting into
minutest of details and deliver research in practice. He has awakened the
researcher in me by continuously questioning my ideas for correctness and
teaching me that you should proud of yourself being an engineer.

My special thanks to Mr. Pawan Kumar (Sirji) for reviewing my thesis a number of
times, discussing the various viewpoints of my thesis, analyzing, and commenting
on it and beyond it. Your motivation and support during the period of this thesis
are extraordinary. Your witty humor and scolding has helped me to keep myself
happy and encourage.

I thank the LTRC staff, who made my research journey at LTRC most comfortable.
Mr. Srinivas Ji for wonderful lab, Mr. Rambabu Ji for administrative issues Mr.
Satish for accommodation and tickets and Mr. Kumara Swamy for most important
work INVOICE clearance and Mr. Lakshmi Narayan for general or any issues.
Thanks to Appaji sir, Kishore sir, for making my life easier in IIIT. I would like to
thank all the members of Expert for providing their valuable suggestions
throughout the thesis Mr. Arun Kumar, B. Rambabu, Kumar Avinash Singh,
Phani Sajja, Sanket Kumar Pathak and all others. Mr. Sanket Kumar (Pathak Ji)
needs a special mention for being so nice with me and supporting in all
circumstances.

My special thanks to Mr. Vinod Singh ji from IBM Bangalore for reviewing my
thesis specially distributed part of my research (MapReduce, Virtual Machine and
Cloud).

Finally and most importantly, I would like to thank my parents, brothers and wife
and loving daughter Maryam (Maria) age of one & half year for their
unprecedented love and support throughout the thesis (I cannot express my
thanks through words here).
Abstract

We describe the Symbiotic Software Engineering approach to produce field


deployable and maintainable software systems using diverse modules developed
out of distributed effort loosely tied together. These systems are heterogeneous
made of different modules written in different programming languages and at
different level of software engineering. This approach has been successfully
applied on Machine Translation (MT) systems for nine major Indic languages in
eighteen directions developed by a consortium of eleven academic/research
institutions. As a result, we have engineered and deployed Sampark MT systems
at http://sampark.org.in.

Further a tool called Dashboard, has been developed which is based on pipe-line
blackboard architecture for integration and testing of Natural Language
Processing (NLP) applications. The Dashboard helps in testing of a module in
isolation, as well as integration and testing of complete integrated systems.
Functional module developer can avail and configure, if they so desire, the
dynamic cache facility offered by the tool. It is also equipped with a user-friendly
visualization tool to build, test, and integrate a system (or a subsystem) and view
its component-wise performance, and step-wise processing as well. This caching
facility improves performance of modules up to 20%.

We have also described an approach to make MapReduce framework applicable


for computationally intensive application like MT system. This framework helps to
enhance the throughput of MT system uniformly by deploying it on large cluster
of physical or virtual machines. Cloud gives us on-demand computing resources,
so with the availability of elastic computing resources through cloud environment
the completion time of any MT job irrespective of its size can be delivered within
a short time.
Publication
Portions of this thesis are based on the following papers:

Re-engineering Machine Translation Systems through Symbiotic Approach.


Pawan Kumar, Rashid Ahmad, Arun Kumar, Mukul K Sinha, Rajeev Sangal. In
International Conference on Contemporary Computing 3rd International Conference (IC3
2010), August 2010.
(Report no: IIIT/TR/2010/85)

Dashboard: An Integration and Testing Platform based on Blackboard


Architecture for NLP Applications. Pawan Kumar, Arun Kumar, Rashid Ahmad,
Mukul K Sinha, Rajeev Sangal. In 6th International Conference on Natural Language
Processing and Knowledge Engineering (IEEE NLP-KE 2010), August 2010.
(Report no: IIIT/TR/2010/84)

Provision of a Cache by a System Integration and Deployment Platform to


Enhance the Performance of Compute-Intensive NLP Applications. Rashid
Ahmad, Arun Kumar, Rambabu B, Pawan Kumar, Mukul K Sinha, Rajeev Sangal. In
African Conference on Software Engineering & Applied Computing (ACSEAC 2011),
September 2011.
(Report no: IIIT/TR/2011/103)

Enhancing Throughput of a Machine Translation System using MapReduce


Framework: An Engineering Approach. Rashid Ahmad, Pawan Kumar, Rambabu B,
Phani Sajja, Mukul K Sinha, Rajeev Sangal. In 9th International Conference on Natural
Language Processing (ICON-2011), December 2011.
(Report no: IIIT/TR/2011/102)

Machine Translation System as Virtual Appliance: For Scalable Service


Deployment on Cloud. Pawn Kumar, Rashid Ahmad, B. D. Chaudhary, Rajeev
Sangal. In The 7th IEEE International Symposium on Service-Oriented System
Engineering (IEEESOSE 2013) San francisco Bay, USA March 25 - March 28, 2013
(Report no: IIIT/TR/2013/ )
Contents
Chapters Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Approach & Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Engineering Machine Translation (MT) Systems . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Field Deployable, Maintainable and Software Engineering Attributes . . 8
2.3.1 Field Deployable & Maintainable . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Software Engineering Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Our Symbiotic Software Engineering Approach . . . . . . . . . . . . . . . . . . 10
2.4.1 Symbiotic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Module Engineering Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 System Integration to Deliver MT System . . . . . . . . . . . . . . . . . . 13
2.4.4 Dashboard Development Infrastructure . . . . . . . . . . . . . . . . . . . 14
2.5 Practical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Experiences of Symbiotic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Dashboard Tool for MT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Infrastructural Approaches & Objective Requirement . . . . . . . . . . . . . 18

ix
3.3 Dashboard and Sampark MT Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Specificity of NLP/AI Applications . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 The Blackboard Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Dashboard and its Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Requirement Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Dashboard: An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 Implementation Details of Dashboard . . . . . . . . . . . . . . . . . . . . . 27
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Caching for MT Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Problem Domain Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Middleware and Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 AI Applications and Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Dashboard as Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Caching Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Cache and NLP Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Cache in Dashboard Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

x
5 Enhancing Throughput of MT System using MapReduce . . . . . . . . . . . . . . . . 39
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 MapReduce Framework: Strengths and Limitations . . . . . . . . . . . . . . . 40
5.2.1 MapReduce Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Adaptations of MapReduce Framework . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Statistical MT & MapReduce Framework . . . . . . . . . . . . . . . . . . . 41
5.2.4 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Transfer Based MT System and MapReduce Framework . . . . . . . . . . . 42
5.3.1 Finer Granularity of Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.2 MT as a List Homomorphism & Compute-intensive . . . . . . . . . . 42
5.3.3 MT as a Dedicated Web Application . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.4 Our Approach to run MT under Hadoop . . . . . . . . . . . . . . . . . . . 42
5.4 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 MT on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Background: Sampark MT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 Deploying MT System on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 Our Approach to run MT under Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.6 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xi
6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Appendix A: Software Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9 Appendix B: Eucalyptus Installation and User Guide . . . . . . . . . . . . . . . . . . . 74
10 Appendix C: Hadoop Installation and User Guide . . . . . . . . . . . . . . . . . . . . . 88
11 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

xii
List of Figures

2.1 Sampark Machine Translation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Symbiotic Approach Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Engine-Language (E-L) Separations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Dashboard Showing Translation of Hindi text into Urdu . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Dashboard Showing Input and Output of Hindi Morph . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Dashboard Showing Input and Output of Hindi POS Tagger . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Cache Miss Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Cache Hit Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 MapReduce Approach for MT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Total Time vs. Number of Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.1 MapReduce Approach for MT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Total time vs. Number of Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xiii
List of Tables

2.1 Software Engineering Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Engineering Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Lexical Transfer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Word Generator Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Full Translation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Throughput Data Set and Result of Stand-alone PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Running Time for Job (input data set of 100 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Running Time for Job (input data set of 200 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4 Running Time for Job (input data set of 500 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.5 Running Time for Job (input data set of 1000 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1 Throughput Results on Stand-alone PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 Throughput Results on the Cloud with Increasing No. of VM . . . . . . . . . . . . . . . . . . . . . . 55

xiv
Chapter 1
Introduction
In this Chapter, we present an introduction to our work towards engineering machine translation
(MT) systems. We introduce our background work in Section 1.1. We then present the problem
statement in the Section 1.2. Section 1.3 summarizes the main contribution of this thesis. Section
1.4 describes the organization of this thesis.

1.1 Background and Motivation


India is a country of more than 1.3 billion people, with 22 official languages and more than 12
scripts. With so much of language diversity, there was an imperative social need for the
government to provide available digital content in one language into another language to its
citizens. Consequently, Technology Development in Indian Languages (TDIL) Group, Dept. of
Information Technology, Govt. of India formed a consortium1 of eleven academic/research
institutions for delivering eighteen Sampark2 MT Systems as a part of Indian Languages to Indian
Languages Machine Translation (ILMT) Project for nine Indic languages pairs, viz.: Hindi <>
{Punjabi, Bangla, Marathi, Urdu, Tamil, Telugu, Kannada}; Malayalam <> Tamil; and Tamil <>
Telugu; covering two domains, tourism and health.

A natural language processing (NLP) Application development, such as that of Sampark, is usually
a team effort of NLP experts (who design and tune the software architecture of the application),
computational linguists (who take the decisions related to the language resources, coverage,
rules, corpora, etc.), and software engineers who implement the various algorithms. Unlike in
generic software development process [1], as the knowledge cannot be passed to software
vendors, continuous involvement of NLP experts and computational linguists in the development
process is imperative. In addition, the success of an NLP application solely depends on them as it
is essentially their design and visualization. The distinguish Features of NLP applications [2] are
listed below:

1
Consortium of Institution are : IIIT Hyderabad, University of Hyderabad, CDAC (Noida, Pune), Anna University KBC
Chennai, IIT Bombay, Jadavpur University, IIT Kharagpur, Tamil University, IIIT Allahabad, IISC Banglore
2
Sampark is a MT System among Indian Languages. It can be accessed at http://sampark.org.in

1
No Concept of Correct Input or Correct Output: An NLP application does not have a strict correct
input or correct output. Therefore, a partial or grammatically wrong sentence needs to be
handled by an MT system. Similarly the output need not be grammatical.

Absence of Single Right Answer: There is no concept of a single correct translation as its output.
Many different translations might be acceptable. Whether an output is acceptable or not is
validated for its accuracy by human evaluators and given a score as to its comprehensibility and
understandability.

Continuous Improvement: Unlike a generic software application, an NLP application requires


continuous improvement in its accuracy as well as its performance. The accuracy is improved by
improving the corpora, or tuning the rules, or improving the algorithms of the modules, or any
combination thereof. This requires continued participation of NLP experts, computational
linguists, and language and software engineers during development and also for a long period
after its first release.

Architectural Resilience and Robustness: Some modules of an application may have bugs and get
into an indefinite loop. This requires a facility for recovering against such situations, so that
modules that follow continue to work. A degraded output is better than no output.

Modules Composition: Most of the core NLP modules have an Engine part, and a corpora part (or
rules part). While the former is language independent, the latter is Language dependent or
Language-pair dependent (E-L Separation). So, an Engine can be made applicable to another
natural language by incorporating the corpora (rules) of new language. For modules with E-L
Separation, the development of the engine part is the generic horizontal task done by one
institution, while the language specific corpus/rules are language vertical task to be done by
institutions responsible for that language.

This background has setup the motivation that consortium visualized a need for the participation
of a software development company, called the Software Engineering Group (SEG), to
supplement the tasks of professional software engineering practices [3] to work with the
consortium, not as a vendor, but in a symbiotic mode [4] to facilitate them in engineering the
Sampark systems as each step and also deploy the Sampark systems on the web and to maintain
thereafter.

1.2 Problem Statement


The consortium, led by a coordinator, is given the responsibility to deliver and deploy all Sampark
Systems for public usage. The Consortium (where each member having their own teams of NLP
experts, computational linguists and software developers) has subdivided their tasks into
horizontal tasks comprising generic language independent MT modules, and vertical tasks
comprising language specific part of different modules. For each language, one distinct institution

2
is assigned the responsibility to provide all language specific corpora, and rules for various
modules.

For the last two decades Govt. has funded all the participating institutions to develop NLP
components and language resources for majority of Indic languages. It is proposed that all these
available components, though written in different programming languages, must be re-used and
re-engineered to get robust systems with high accuracy, in less time, and less cost.

As all the chosen Indic languages have evolved from the classical language Sanskrit, the MT
systems can be built on Paninis framework [5]. Correspondingly, the consortium developed a
common software architectural framework, being re-used for all the eighteen language pairs.

The mandate of ILMT Project is that all the Sampark MT system should be field deployable and
maintainable product.

In addition to the above requirements, researchers [2, 6] have highlighted that for any generic
NLP application, some distinctive software engineering issues need to be tackled as well.
Therefore, symbiotic software engineering approach has worked for the development of the
proposed Sampark MT system.

To summarize, the research objectives of this thesis are formulated as follows:

- Engineer previously developed NLP modules in laboratory environment for developing


Sampark MT systems covering nine pairs of Indic languages with health and tourism
domain
- Improve the performance (speed) of system without tinkering with the modules
- Improve throughput of the system using Hadoop MapReduce framework
- Deploy the system on large cluster of physical or virtual machines on cloud environment

1.3 Approach and Contributions of this Thesis


Various approaches are used to achieve the research objectives.

1.3.1 Approach

Symbiotic Software Engineering Approach


We have symbiotic software engineering approach to Engineer previously developed NLP
modules in laboratory environment for developing Sampark MT systems. This approach is
discussed in detail in chapter-2.

3
Cache Mechanism
We have Cache mechanism to improve the performance of NLP modules. By using this
mechanism we achieve 15-20% improvement in speed. This approach is discussed in detail in
chapter-4.
Hadoop MapReduce Framework
We have MapReduce approach to run MT system under Hadoop framework to improve
throughput of the system. By using this approach, systems throughput increases proportionally to
the capacity added if computing resources are available. There is a time minima for a given
granularity of the task. This approach is discussed in detail in chapter-5.
Virtual Appliance
We have a mechanism to build virtual appliance of MT system for deployment on the cloud. The
MT virtual appliance can be deployed on the virtual machines in the cloud as well, with significant
reduction in the deployment time. Deployment time came down from hours to few minutes. This
approach discussed in detail in chapter-6.

1.3.2 Contribution

The contributions of this thesis are listed below.

First we developed a methodology to take previously developed natural language modules in


laboratory environment to produce field deployable and maintainable MT systems (and modules)
that can be reused as customizable components.

Second we developed software integration and deployment platform for NLP applications, called
Dashboard, which has been enhanced to provide to functional module developers mechanisms to
avail and to configure, if they so desire, the dynamic cache facility offered by the system
integration platform to improve the performance of the modules.

Third we use Hadoop MapReduce framework to enhance throughput of MT system without


tinkering with the modules.

Finally we built MT system as virtual appliance for deployment on a standalone virtual machine or
on virtual machine available in the cloud.

4
1.4 Organization of the Thesis
This thesis is organized as follows:

In Chapter 2, we describe the Symbiotic Software Engineering approach to produce field


deployable and maintainable software systems using diverse modules developed out of
distributed effort loosely tied together. These systems are heterogeneous made of different
modules written in different programming languages and at different level of software
engineering. This approach has been successfully applied on Machine Translation (MT) systems
for nine major Indic languages in eighteen directions developed by a consortium of eleven
academic/research institutions.

In Chapter 3, we present a tool called Dashboard, has been developed which is based on pipe-
line blackboard architecture for integration and testing of Natural Language Processing (NLP)
applications. The Dashboard helps in testing of a module in isolation, as well as integration and
testing of complete integrated systems.

In Chapter 4, we explain caching mechanism for computationally intensive modules. Any


functional module developer can avail and configure, if they so desire, the dynamic cache facility
offered by the Dashboard tool.

In Chapter 5, we describe an approach to make MapReduce framework applicable for


computationally intensive application like MT system. This framework helps to enhance the
throughput of MT system uniformly by deploying it on large cluster of physical or virtual
machines.

In Chapter 6, we describe how to package MT system, along with Hadoop as middleware, as a


virtual appliance for deployment on cloud. As cloud gives us on-demand computing resources, so
with the availability of elastic computing resources through cloud environment the completion
time of any MT job irrespective of its size is approximately known for a given hardware
configuration (virtual machine). Simply speaking with the availability of elastic resources, large
translation job(s) can be completed in a short time.

In Chapter 7, we present the conclusions of a symbiotic software engineering methodology,


development of a tool that facilitates the symbiotic approach, and use of map reduce and cloud
for compute intensive application. We then summarize the contributions of this thesis and
provide some perspectives on future directions.

5
Chapter 2
Engineering Machine Translation Systems
In this Chapter, we present the symbiotic approach to engineer the computationally Intensive
application like MT systems. We introduce Sampark MT system in Section 2.1. We then show the
prior work in the Section 2.2. Section 2.3 describes the mandate of Sampark systems. Section 2.4
describes the symbiotic engineering approach. Section 2.5 describe the practical measures and
section 2.6 shown experience of symbiotic approach

2.1 Introduction
For Sampark System, the proposed system architecture [7] is a pipe-line architecture comprising
of fourteen major modules and some optional language specific modules. Sampark is a hybrid
system consisting of traditional rules-based algorithms and dictionaries and newer statistical
machine-learning techniques. It consists of three major parts and 12 modules arranged in a
pipeline.

Figure 2.1 Sampark Machine Translation Architecture


6
A) Source Analysis
Tokenizer: Converts text into a sequence of tokens (words, punctuation marks, etc.) in Shakti
Standard Format [10].
Morphological analyzer: Uses rules to identify the root and grammatical features of a word. It
splits the word into its root and grammatical suffixes.
Part of speech tagger: Based on statistical techniques, assigns a part of speech, such as noun,
verb or adjective, to each word.
Chunker: Uses statistical methods to identify and tag parts of a sentence, such as noun phrases,
verb groups, and adjectival phrases, and a rule base to give it a suitable chunk tag.
Named entity recognizer: Identifies and tags entities such as names of persons and organizations.
Simple parser: Identifies and names relations between a verb and its participants in the sentence,
based on the Computational Paninian Grammar framework.
Word sense disambiguation: Identifies the correct sense of a word, such as whether bank
refers to a financial institution or a part of a river.

B) Transfer
Syntax transfer: Converts the parse structure in the source language to the structure in the target
language that gives the correct word order, as well as a change in structure, if any.
Lexical transfer: Root words identified by the morphological analyzer are looked up in a bilingual
dictionary for the target language equivalent.
Transliteration: Allows a source word to be rendered in the script of the target language. It is
Useful in cases where translation fails for a word or a chunk.

C) Target Generation
Agreement: Performs gender-number-person agreement between related words in the target
sentence.
Insertion of Vibhakti: Adds post position and other markers that indicate the meanings of words
in the sentence.
Word generator: Takes root words and their associated grammatical features, generates the
appropriate suffixes and concatenates them. It combines the generated words into a sentence.
The consortium has, to start with, multiple copies of most of the modules of ILMT Project,
developed by different institutions, having no metrics to choose one against the other. All
modules being developed as research and development efforts have only skeletal manuals.
Further, many of the modules which ideally should have been structured with engine-language
(E-L) separation might not be so as it was developed for a specific language MT System. And

7
hence, re-engineering/engineering is required to disengage the engine portion from the language
portions.
As NLP Applications are very complex with large number of modules, it is imperative to develop
the complete system under a Software Development Infrastructure [8, 9] which facilitates the
software development paradigm being applied. It is assumed that most of the available modules
are neither comprehensively tested, nor maintainable nor appropriately packaged. The
consortium decided that the available version of a module would be taken as the initial version.
Each module should be engineered step-wise, by the development team of the module-owner
itself, but in symbiotic relation with the software engineering group (SEG) associated with project
to make it a field deployable and maintainable products.

2.2 Prior Work


In case of conventional software development, its software architecture and modules
specifications results after analysis and design, but for NLP applications, its proposed architecture
is decided a priori. For Sampark System the consortium decided to adopt the Black Board
Architecture [9, 17] as it has to re-use previously developed heterogeneous set of modules. In
this architecture, heterogeneity of modules does not affect their operations as all of them
operate on common in-memory data structure. In black board architecture there is no fixed order
of module execution, but the consortium restricted the control to pre-specified pipe-line
architecture.
The blackboard architecture gives a flexible environment for module integration where
each module is a pluggable component, and one module can be easily replaced by another
version of the same module. Another advantage is its robustness, as the system continues to
function even after an in between module, has failed. The following module in the pipe can still
work on the existing in-memory data structure, and the system can proceed further, providing
graceful degradation of the system.
The consortium has adopted Shakti Standard Format (SSF) [10] for in-memory data
structure of the blackboard. SSF has text notation to provide human readability. SSF is extensible,
and can be used for other NLP applications as common representation for inter-module data
exchange.

2.3 Field Deployable, Maintainable and Software Engineering


Attributes
To make Sampark system into a field deployable software product, first it should be well
engineered, and then it should be tested, first module wise, and then as an integrated system,
and finally, the system should be packaged for field deployability.

8
2.3.1 Field Deployable and Maintainable
It means that the software product has strictly followed the standard software engineering
process during all development phases [3], with final software product having following
components: System Requirement Specification (SRS), Software Design Documents (SDD),
complete source code (following standard coding practice), block level documentation and In-line
comments, build and configure procedure for each module, validation suites of each module,
integrated product, distribution package of the product, user manual and installation manual of
software product.
A well engineered software product assures the quality, and the modules appropriately
abstracted and well encapsulated, assuring its re-usability.
A software product is maintainable, if it is correctable, adaptable, perfectable, and sustainable
[11]. A Software Product is correctable and adaptable if it is well engineered, but it is sustainable
only when it can be perfected without any structural change of the software architecture.
For Sampark System, its pre-specified architecture assures its sustainability. For perfective
maintenance of Sampark system requires continuous performance improvement, not by
changing code, but by improving/tuning language specific (or language pair specific) data and
rules, and domain specific data/corpora, thus, engine-language separation becomes an
imperative requirement for maintainability of Sampark system.

2.3.2 Software Engineering Attributes


Given the initial state of the components of Sampark System, the proposed Sampark System can
be assured to be Field Deployable and Maintainable Software Product, if each module of the final
system satisfies following software engineering attributes:
Table 2.1 Software Engineering Attributes

S.No. Software Engineering Attributes


1. Software Functional Specification (SFS)
2. No hard-coded data and no magic numbers
3. In-line and block level comments
4. Command line parsing using getopt (), Unix filter like interface
5. Error/exception handling and logging for bug traceability
6. Test suites
7. Engine-language (E-L) separation
8. Common representation for inter-module data exchange
9. Makefile for build and configure
10. Distribution/packaging using rpm

The above software engineering attributes are given in detail in Appendix-A.


In addition, the Sampark System must have all the documents as specified in Section 2.3.1.

9
2.4 Symbiotic Software Engineering Approach
At initial stage of Sampark system, we have a good number of modules available for reuse, yet
none of them qualify as component [12]. The engineering activities are to be done by the same
institution that developed the earlier version to start with, and with the continuity of the main
researcher, no reverse engineering [13] task is required prior to engineering any module.
Therefore, we (SEG) associated mainly to facilitate these institutions in symbiotic mode to
engineer their horizontal (engine/common programs) and vertical (language specific) task
components.
First we collated all the candidate modules, and language resources, along with their
available documentation, and did an initial code walkthrough along with the team of the
module/resource owner. It was followed by a series of workshops, both on software engineering
and NLP issues, with participation of all relevant researchers. It was strongly felt that the
availability of a software development infrastructure tool is essential for the consortium
members for their independently developed heterogeneous modules to communicate with each
other, mainly to build the final system, and also to follow a common software engineering
approach.
Thus, we conceptualize and propose the symbiotic software engineering paradigm for
engineering NLP/AI Systems.

2.4.1 Symbiotic Process

We gathered input and output specifications of each module by the help of NLP expert and
computational linguists. To validation that module is working on our system same as the home
machine of the developer, we run the supplied input output test data by the developer. Identify
the program code that relate to functional requirements; non functional requirements and
error/exception handling. We have come out with the following symbiotic process to follow the
engineering task in each module with actual developers

1. Identify Explain Build


2. Identify Explain Suggest Build
3. Identify Explain Suggest Guide Build
4. Identify Explain Suggest Guide Co-create

10
Figure 2.2 Symbiotic Approach Process

2.4.2 Module Engineering


In the symbiotic software engineering approach we have six engineering tasks mentioned in
Table 2.2 that need to be done on each of the candidate modules. Each of these six engineering
tasks has to be done by the owner of respective module with symbiotic support from us. Once a
task is done it is audited by us before being accepted for further engineering. The consortium has
given us the right to reject a module if it does not clear the audit process. The task of testing and
integrating of complete MT System for any language pair rests on us.

Table 2.2 Engineering Task

S.No. Engineering Task


1. Enhance Usability
2. Improve Documentation
3. Improve Robustness and Traceability
4. Engine and Language(E-L) Separation
5. Module Level Inter-operability
6. Program Restructuring

These engineering tasks above are elaborated below.

1. Enhance Usability: To enhance usability, a module must be portable to any other raw machine,
and command like parsing using getopt(). In addition, it must possess file level input/output
interoperability by having Unix Filter like interface,
The module owner must read the source code, mainly to identify the file/data access
points, the hard-coded data access points, inappropriate size of strings, any magic numbers, etc.
The code has to be modified replacing the hard-coded data into a configurable parameter to be
read from its configuration file. After completion of this task, a module would have following
11
attributes: No hard-codes, no magic numbers, command line parsing using getopt(), and Unix
filter like interface for programs.
We audit enhance usability, would run the module on a raw platform first. It may run or
may not run giving some errors, like segmentation fault. We would inform the owner in case of
error, along with suggestions.
2. Improve Documentation: To improve documentation, for modules written in C and Perl, we
request to provide the first level DFD and flow chart. These were refined by us. For modules in
Java, the developers would be requested to provide class diagram and sequence diagram.
To audit, we tried to read the source code of each module, mainly the control flow, the
data variables (global/local), and the data input and output. We tried not to understand the
program logic at all. At completion of this task, following artifacts are produced (for modules
written in C and Perl): limited data flow diagram (DFD), control flow diagram, extracted clusters
of data items, showing data flow levels and module structure; and improved SRS version 0.1.
These documents have to be additionally authenticated by the NLP Experts
/computational linguists of the institutions. After making these documents available, the
Software Engineering Attribute of the module is: Module Level Analyzability.
3. Improve Robustness and Bug Traceability: As part of this engineering task, developers are
supposed to cover all errors and exceptions of their modules to improve its robustness. In
addition, they are requested to write log statements at various points in each module to enhance
bug traceability. After completing this task, the module would have: Error and exception handling
statements covering all possible cases, and log statements inserted at various points in module
for bug traceability.
4. Engine and Language (E-L) Separation: For this engineering task, all developers of such
modules have to dis-engage their engine portion of the code from that of languages
resources/rules. For a module, the E-L separation is complete only when the engine can be tested
independently with at least two different language resources. After completion of this re-
engineering we would have following artifacts: language dependent modules with E-L separation.
Further, the software engineering attributes after this stage are: E-L separation for language
dependent modules.

Figure 2.3 Engine-Language (E-L) Separations

12
5. Module Level Inter-operability: The module developer has to see to it that their modules
comply for SSF Interface, i.e., all modules must accept SSF as input and produce SSF as output. In
those cases, where SSF data representation model does not suit modules computation model,
module would intend to run on its data representation suitable to it. To provide inter-operability
of such modules for pipe-lined black board architecture, we provide SSF Wrappers for such
modules, making them SSF Interface compliant. After completion of this task, all modules are
provided with Makefile for build and configure. Once a module reaches this stage, it is ready as a
component for independent testing and validation
Consequently, modules are validated against the module specific validation data by us and
validation reports are provided. After successful validation, we have following artifacts: SSF
interface compliant modules, modules with Makefile for build and configure, module validation
data and validation reports, and improved SRS version 0.2. Software engineering attributes after
this stage are: common representation for inter-module data exchange and Makefile for build
and configure.
6. Program Restructuring: After a module has become an independent component, its source
code can be restructured to make its embedded program structure and DFD explicit. It would
result into module with structure of multiple internal sub-modules/subroutines. Block level
documentation and In-line comments would be done. The group also sees to it that the module is
maintainable with respect to adaptability requirement. The control flow diagram and DFDs of this
restructured module is developed from earlier version. The restructured module is tested against
the module specific validation test conducted earlier.
This task can be taken once the module has completely stabilized and become robust.
After program re-structuring, a module becomes stable robust re-usable component that can be
used for the final integrated system. At the end of this stage, we have the following artifacts:
revised Software Design Document (SDD), restructured and maintainable source code of the
module/sub-modules, module/block level comments and in-line comments in program code,
revised requirement specifications, SRS version 0.3. After program restructuring task, software
engineering attributes covered are: modular at subroutine level, block level documentation and
in-line comments.
2.4.3 Subsystem/System Integration
Once all modules have completed engineering tasks, they can be integrated to build the final
system. By integrating relevant modules we build a Sampark MT system for respective language
pair and subject it to the system level validation, and validation reports are generated. After this,
each Sampark system has system level validation data and reports.
Now, we have the first release of integrated system that is well-engineered and
maintainable. This release of the integrated system is made available to the NLP experts and
computational linguists for evaluation, and for continuous improvement.
Each release goes through validation and regression tests, and once it crosses the
threshold of comprehensibility, it is packaged as a field deployable system.

13
2.4.4 Dashboard Development Infrastructure
There is imperative need for building a tool which facilitates testing, integration of the modules
and provide input and output of the module/system for debug and improve to NLP
expert/computational linguists. As a basic, simple blackboard tool, called Dashboard, was
available with one of the participating institute [14]. We redesign the Dashboard having following
features:

1- Configurability to setup modules in a pipe-line


2- To display intermediate I/O at module/subsystem level
3- Provisions for step-by-step running
4- For time profile at system, module, and sentence level
5- Timeout for failure handling
6- To prepare and store reference (manual/gold) data for each module/subsystem/system
level
7- To perform module/subsystem/system level validation, and regression testing.

We redesigned the Dashboard with all the features envisaged above, and it provides additional
tools such as character encoding and data format conversion tool, I/O data validation tool, etc.
The Dashboard sets up control flow among modules irrespective of the heterogeneity. It provides
reader/writer primitives on the common in-memory (SSF) data structure for programming
languages C, Java, and Perl. A module reads from the SSF data structure, converts into its own
internal data structure for processing, and after completion, reconverts the data back into that of
SSF data structure. Chapter 3 describes Dashboard in details.

2.5 Practical Measures


Ideally, modules would be ready for integration only after reaching module level inter-operability.
But in practice, developers would like to integrate the complete Sampark system, and test the
performance of their modules even when some modules have not reached the state. This
scenario was visualized a priori, and we have provided following measures to satisfy the
intermediate needs of all developers:
1. Module Specific Wrappers A module, irrespective of the stage it is in, a module specific
wrapper program (made available by us) can wrap the existing module in such a way that it can
be integrated in the pipe-line architecture of the Sampark system.

2. Concurrence of engineering Tasks Once a module that has reached a state to get integrated
into the pipe-line architecture for intermediate testing, it is released to us. It is engineering tasks

14
of for improving documentation and program restructuring can be done concurrently. Usually, a
new version of a module is released by its owner institution, at least once in six months.

3. Module/System Testing and Validation Data To facilitate system and module testing and
validation for each language pair of Sampark systems, the language experts have created 100
sentences of gold standard for each source language, along with its expected translation in target
language manually done. For the same 100 sentences, the expected output after each module of
a pipe-line execution is also generated manually. This set serves as the basic testing and
validation data.

4. Internal MT Evaluation Teams for Sampark Systems Once a Sampark System is validated by
us, it is sent to the internal evaluation [15] for accuracy measurement. The team tests it for
comprehensibility, and if it is not up to the mark, the language related feedback is given to the
concerned institution for improvement.

2.6 Experiences of Symbiotic Approach


Presently, four Sampark MT systems Punjabi to Hindi, Urdu to Hindi, Hindi to Punjabi and Telugu
to Tamil; have been released for public usage at http://sampark.org.in, and remaining MT
systems are expected to release soon.

The technology section of communication of the ACM (CACM) [16] has compared these Sampark
systems based on transfer-based hybrid approach to that of purely statistical based approach of
by Google and Microsoft.

Experience was elicited from of all eleven institutions on their participation in a consortium mode
to produce Sampark MT systems by engineering their existing laboratory modules through
symbiotic approach. A questionnaire was used for survey. This survey was done only after almost
all modules had gone through engineering tasks, and were submitted to us for validation. The
experience has been consolidated along the following two dimensions:

2.6.1 Experience of Module Engineering Tasks: As module engineering tasks were executed in
symbiotic relation with us, the experiences of these two groups provide different perspectives:
- All institutions appreciated the tasks of enhancing usability, improving robustness and bug
traceability as they made their modules reusable, robust and portable. These engineering
practices were not followed earlier!
- Most researchers took time to appreciate E-L separation task as their focus was limited to their
own language pair, and not realizing the re-usability characteristics of their module for other
languages as well. For this task, the role of chief investigators of each participating institutes, and
that of the coordinator was crucial.

15
- Most developers used module specific wrappers extensively to integrate their modules for
testing, as they are concentrating more in improving the accuracy of their respective Sampark
system, than documentation and program restructuring tasks.
- The relation between participating institutions and the software engineering group was weak to
start with, but became better when the Dashboard development infrastructure started being
used extensively. It helped institutions to continuously improve their modules accuracy and
performance, and show their results to a wider audience. In addition, they appreciated the hand-
holding support provided by us in making their modules as components, and Sampark system
web deployable.
- We felt satisfied in making the NLP researchers aware of software engineering issues of
usability, traceability, robustness, and modularity, but had very limited success in influencing
them to the idea of validation, readability and re-usability.

2.6.2 Experience of Work in Consortium Mode: It was the first experience for each participant to
work in a consortium. As the project was in an intense phase of its delivery activities, yet the
responses indicates the following:
- Human/group inter-operability is essential for consortium mode project to succeed. A series of
workshops (on language and software engineering issues) to develop common guideline, coupled
with weekly conference calls among all participating institutions helped to smoothen the group
and human inter-operability.
- Availability of a common development environment in Dashboard and a software engineering
group facilitated them in individual engineering. It helped them to focus their goal towards
project deliverables.
- As academicians are involved in multiple activities, those institutions could contribute more
where chief investigators were not over-loaded.
- It is imperative for us to have sufficient knowledge of NLP issues to develop symbiotic relation
with participating NLP teams. It took us around six months time for us to get the requisite domain
knowledge in MT system, and only then the inter-communication became smooth. Visits by us to
most of the participating institutions helped them to develop direct professional bonds with all
the developers.
- The role of Project Review and Steering Group (PRSG) set up by TDIL was extremely helpful in
keeping the project focused, and gave the crucial impetus to each of the participating institutions
to maintain the delivery quality as well as schedule.

16
Chapter 3
Dashboard Tool for MT
In this Chapter, we present an introduction about the dashboard tool. We introduce the
infrastructure approaches and objective requirement of the tool in Section 3.2. We then present
the Dashboard and Sampark MT system in the Section 3.3. Section 3.4 presents the support for
Dashboard. Section 3.5 describes the implementation of Dashboard. Section 3.6 describes the
evaluation of Dashboard. Section 3.7 describes the discussion and future work.

3.1 Introduction
As natural language processing (NLP) applications are knowledge intensive, complex, and are
normally developed by a co-operating team consisting of NLP experts, computational linguists,
software engineers and language engineers, a large application like machine translation (MT)
systems cannot be developed by a third party software vendor following classical software
development paradigms [3]. For the development of NLP applications, the conventional software
development tools are not very suitable as the application specifications are inherently imprecise,
i.e., the output is not tested against correct output, but is validated against criteria specifying
correctness. For example for the overall MT system, we use comprehensibility and fluency [15].
In addition, an NLP application, unlike a conventional software application, goes through
continuous accuracy improvement, for considerable long duration, after their release. Therefore,
for the development of NLP applications, it is advisable to develop a software development
infrastructure [8] corresponding to the NLP software development paradigm being applied. The
Dashboard development infrastructure, presented in this chapter, is based on Blackboard
architecture [17], an attractive option for building an NLP system, which facilitates integration of
a set of heterogeneous modules (i.e., written in different programming languages) collaborating
among themselves through a common in-memory data structure, referred to as blackboard.
The need for creating a development infrastructure like Dashboard arose when the
Technology Development for Indian Languages (TDIL) Group of Dept. of Information Technology
(DIT), Govt. of India formed a consortium of eleven academic and research institutions [7] for
developing 18 pairs of Sampark MT system. These were over nine Indic languages, and the pairs
(viz., Hindi{Bangla, Kannada, Marathi, Punjabi, Tamil, Telugu, Urdu}; TamilMalayalam; and
TamilTelugu) were built by re-using/re-engineering most of the NLP components, and
language resources that were available with the participating institutions through their prior
work. As eighteen Sampark systems were to be developed for public use in a limited time frame
of the project, we considered various infrastructural approaches to facilitate the speedy
implementation of these systems.

17
3.2 Infrastructural Approaches and Objective Requirement
Infrastructural Approaches: To facilitate the development team comprising of NLP experts,
computational linguists and software engineers, in building of large NLP applications, there have
been three infrastructural approaches, viz., (1) frameworks, (2) architectures, and (3)
development environments [12]. Frameworks facilitate a components based development, by
providing a common and powerful platform with a number of mechanisms (e.g., ActiveX, Java
Beans, etc.) that can be used, or adapted by the developers to build their systems. A framework
greatly reduces development as well as maintenance effort, and it presumes that the system is
going to be developed on common OS platform. Architecture defines a system in terms of its
components, and the type of inter-relations among the components (e.g., client-server
architecture, pipe-line architecture, etc.). It is the inter-relationships among the components that
represent the distinguishing power of a specific architecture. Application developers have
realized that a specific architecture is far more suitable for a family of software applications,
giving rise to a concept of reference architecture or domain specific software architecture [19].
For NLP applications where heterogeneity of components is an essential characteristic,
blackboard architecture gets widely used [20]. Furthermore, when an implementation of an
infrastructure based on a specific architecture provides additional tools for
development/validation of a component, integrating and testing a set of components, and
building and testing the complete system, it is usually called development environment [21].

Objective Requirement: We felt strongly for building a development environment as all


participating institutions are geographically distributed, and multiple institutions have to
collaborate and coordinate to deliver each Sampark system. As most of the chosen Indic
languages have evolved from the classical language Sanskrit, the Sampark systems have been
built on a common paradigm based on Paninian framework [5].

Further, we decided to adopt the blackboard architecture [17] for building Sampark
systems as it has to re-use previously developed set of heterogeneous modules. In black board
architecture, heterogeneity of modules does not affect their operations as all of them operate on
common in-memory data structure. Further, we restricted the control to pre-specified lattice
architecture, as all the machine translation systems are being built following transfer based
approach, using representation based on Paninian framework. The lattice is currently
implemented as a pipeline comprising of 14 major modules for each language pair.

All the major modules in the system consist of language independent engines and
language specific parts. Among these eleven academic institutions, that are spread
geographically, each institution was responsible for one or more specific engine, responsibility for
each language rested with a single institution. Thus, language specific parts of all the modules for
a specific language were the responsibility of a single institution. Therefore, we needed a
development tool that can be used by each institution independently, at different granularity

18
levels of a system, i.e., either in re-engineering/testing of a module in isolation, or integration and
testing of a subsystem, or building and testing of the complete system.
Correspondingly, we visualized Dashboard, a common blackboard based pipe-lined architectural
framework for building the translation systems which could utilize all available modules, in spite
of their heterogeneity. And the same framework could also be re-used, as much as possible, so
that all the eighteen Sampark systems have same architecture.

The Dashboard, apart from implementing the features of blackboard architecture, has
many more additional features for use it as a development infrastructure for a family of NLP
applications, such as machine translation system, speech to speech translation system,
information extraction system, etc. It is also equipped with a user-friendly visualization tool to
build, test, validate, and integrate a system (or a subsystem), and view its step-wise processing
and module-wise time profiling to facilitate the development team to improve systems accuracy
and (speed) performance respectively.

3.3 Dashboard and Sampark MT Systems


As all the participating institutions were academic and research organizations, the consortium
visualized a need for a Software Engineering Group (SEG), to supplement the tasks of professional
software engineering practices. A company was engaged to work with the consortium in a
symbiotic [4] mode to facilitate them in engineering the Sampark systems as field deployable and
maintainable products. The SEG, by symbiotically interacting with the participants, evolved the
specification of the Dashboard development environment that would facilitate them in
engineering available modules in symbiotic mode to build Sampark systems as stable products.

The first version of Dashboard development platform was released after one year, and for
the last three years, is being successfully used by each of the participating institution
independently. Out of the total eighteen MT systems being developed using Dashboard, the
accuracy of four MT systems, viz., PunjabiHindi, UrduHindi, TeluguTamil, and
HindiTelugu, have improved considerably, and they have been deployed at website
http://sampark.iiit.ac.in. It is expected that remaining MT Systems would be released at regular
intervals, spanning over next the few months.

Another consortium of seven academic institutions [22] which is developing Sanskrit to


Hindi machine translation system is also using Dashboard as their system development
environment.

3.4 Strengths
NLP application software development is usually a team effort where the software architecture
(its selection or tuning) is done by NLP experts. All the decisions related with language resources,
such as coverage, rules, corpora, etc., are taken by NLP experts or computational linguists, and
algorithms are implemented by language/software engineers. The continuous involvement of
19
NLP experts and computational linguists in the development process is essential as their
knowledge cannot be passed easily to the software developers located with vendors. In addition,
the success of the NLP application solely depends upon them as it is essentially their visualization
and creation, requiring continuous molding by the experts during development, and also
continuous efforts to improve accuracy after release.

Unlike the conventional software development tool, the development environment


suitable for NLP applications has to cater, not only the specificity of NLP applications [2], but also
the needs of each member type of the development team, viz., the NLP experts, the
computational linguists and the software engineers.

3.4.1 Specificity of NLP/AI Applications

This subsection describes the salient requirements, emerging due to specificities of NLP
applications [2] (distinct from conventional software), that must be covered by the proposed
Dashboard development environment, and they are:

Heterogeneity of Modules: The complexity of NLP application is generally high because different
component modules work on different level of language, i.e., some work on paragraph level,
some on sentence level, some on chunk level, and some on word level. Accordingly, different
component module would need different data representation model for its efficient
implementation. Also, different modules get developed in different programming language, the
language that provides data representation model that suits its computation model making the
implementation far more natural, elegant and efficient. Sometimes for reasons of legacy
software, the development environment must facilitate development team to incorporate
heterogeneous set of modules in the system.

Modularity: A set of generic NLP modules have been identified and developed and sometimes
they are available off-the-shelf, from multiple sources. These generic modules are generally re-
used for various applications and domains by proper adaptation and enhancements. This need
requires that the development environment must view a system as built by integration of a set of
independent modules.

Transparency at module interface level: The specifications for most NLP applications are
approximate. Usually there is no concept of correct input and a single correct output [18]. For
example, in an MT system grammatically wrong sentences are valid input as well as valid output.
And hence, unlike conventional software, its output is not tested against correct output, but is
validated against criteria specifying threshold of accuracy, mostly by human evaluators [15].
Therefore, a development environment should provide transparency at module interface level,
(i.e., to view the input and the output of any module), facilitating the development team to easily
isolate the module having the trouble spot, and independently modifying it, to improve the
system.

20
Module Level Flexibility: In conventional software application, after it is developed and released,
the development team comes into picture if and only if there are some residual bugs, and not
otherwise. While for an NLP application continuous accuracy and performance improvement is a
generic requirement, and it needs to be continuously improved by NLP experts and
computational linguists. Further, the accuracy improvement is done, not by changing code, but by
improving/tuning language specific (or language pair specific) data and rules, and also domain
specific data/corpora. Hence, the development environment must provide mechanism for easy
replacement of a module by its new version, without any repercussion to the remaining set of
modules of the system.

Time Profiling of Modules: As the complexity of an NLP application is very high, performance is a
major issue. Hence, time profiling of each component module is an imperative need which a
development environment must provide so that the development team can concentrate to
improve those modules that are time intensive.

Robustness of System: Some NLP components many a time, take quite a long time, get into
indefinite loop, or even fail. Therefore, robustness at system level becomes another imperative
need for any NLP application, and hence, the development environment should provide module
level timeout facility to terminate the indefinite loop.

Resiliency against Module Failure: As a less precise output is more acceptable than no output.
Therefore, the design of the system should be resilient enough to recover a module failure (or its
forced termination), and proceed further to give, at least a degraded output.

3.4.2 The Blackboard Architecture

The project document [7] proposing to develop eighteen Sampark systems has stressed the need
to develop all the systems based on pipe-lined blackboard architecture. The blackboard in the
classical form [17] can be viewed as a central repository of all shared information among multiple
heterogeneous problem solving agents/experts. The name blackboard architecture was chosen to
evoke a metaphor in which a group of experts gathers around a blackboard to collaboratively
solve a complex problem. The information on blackboard represents assumptions, facts, events,
and deductions made by system during course of problem solving. Each expert continuously sees
the information on blackboard, and at some instant of time depending on the information
content, if it feels it can contribute to the solution, tries to write (and update the information
content) on the blackboard. At some stage, if multiple experts compete to write on blackboard,
the moderator or facilitator controls the mediation among competing experts.

In case of the Sampark system, the experts are the modules, and the central repository is
represented by common in-memory data structure. As the Sampark system is being built by
following transfer approach following Paninian framework, currently the moderator limits the

21
execution of modules in a pre-assigned order (i.e., pipe-line architecture) specified by the
development team at the time of configuring the system.

Justifications of Blackboard Architecture for the Sampark MT Systems: The project document
[7] has given objective and technical reasons that are enumerated below briefly for
comprehensiveness of this Chapter:

Reuse of NLP Components and Language Resources: Since the project requires to reuse the
Natural Language Processing (NLP) components and language resources (for most of the nine
Indic languages) written in different programming languages (viz., Java, C, Perl and Python) and
available with participating institutions, the blackboard infrastructure would be most suitable to
build systems using independent heterogeneous modules.

Reuse of Architectural Framework: we have a common software architectural framework of the


NLP applications, viz., blackboard architecture, so that all the eighteen MT Systems have the
same architecture. Again, this experience would be used in future to develop, MT systems
covering other Indic language pairs on the same architectural framework.

Transfer Based Approach and Pipe-lined Blackboard Architecture: As all the MT systems are
being built following transfer based approach [16], we restricted the control to pre-defined pipe-
line architecture. Pipe-line architecture also reduces the complexity of configuring and testing
systems under a development environment.

Graceful Degradation and Common In-memory Data Structure: In pipe-lined blackboard


architecture modules communicate among themselves by its common in-memory data structure.
This pipe-lined feature provides resiliency against failure of any in-between module of the system
as following modules can still continue to work on the available in-memory data, assuring
graceful degradation of the system.

SSF for In-memory Data Structure: Blackboard architecture is qualified by its in-memory data
structure format. We has adopted Shakti Standard Format (SSF) for in-memory data structure as
it is based on tree structure with bags and associated feature structures [10], and has text
notation for representing it unambiguously and providing human readability as well. Human
readability of a module input/output is an imperative need as it helps extensively in
understanding the working and improving the module.

3.5 Dashboard and its Implementation


Dashboard, as a framework for setting blackboards, was available with one of the participating
academic institute [10, 14]. We were assigned the task to redesign the Dashboard afresh, as a
Development Environment with enhanced functionalities satisfying the Sampark system needs
discussed above. The new Dashboard was required to be built on Linux platform, and we

22
concretized the requirement specifications of the first version of the proposed Dashboard
Development Environment, which is given in next subsection.

3.5.1 Requirement Specifications of Dashboard

1. The Basic Characteristics of Dashboard Development Environment

a. The common in-memory data structure should be of Shakti Standard Format (SSF) [10]
b. The system being built should be composed of a set of modules, where each module
independent of others, operates on the common in-memory data structure
c. The inter-module data exchange can be either through common in-memory data
structure, or alternatively through an I/O stream
d. The set of modules should execute in a pre-specified sequence, i.e., following pipe-line
architecture
e. It should provide application program interfaces (APIs), viz., reader/writer primitives
on the common in-memory data structure (SSF) for programming languages C, Java,
and Perl as most of the available modules are written in these languages.

2. Mechanisms to Configure a System to Run under Dashboard Development Environment

a. Provision to define system specifications file to configure a system composed of set of


modules in a pipe-line, granularity of pipe can be a single module, a set of module
comprising a sub-system, or complete system
b. Provision to define modules runtime properties through module specification file, viz.,
its programming language, i/o data format, i/o data encoding, i/o in stream or in-
memory, level of module operation,
c. Provision to edit system specifications file for reconfiguring the system, mainly to
facilitate replacing a module with its new version,
d. Provision to compile a given system specification file, and generate the executable file
of the Moderator which would be eventually executed to run the configured system
under the Dashboard Development Environment.

3. Dashboard Development Environment as a Visualization Tool

a. Display of intermediate i/o at module/subsystem level, and final i/o at system level,
mainly to analyze the accuracy of the output at module/subsystem/system level,
b. Step-by-step run interactively at module level,

23
c. Here and now debugging tool change the intermediate output and run the system
further. It would help to find multiple bugs in single execution, and it would also help
in integration and system testing,
d. Provision to save a session, i.e., to save the system level and each module level
input/output, for post analysis by the development team,
e. Provision for transliterating the input text (in source language script), or the output
text (in target language script) of Sampark system, in either source language script or
target language script, facilitating the single script readers to read both the input text
as well as the output text.

4. Provision of get Time Profile for each sentence at module level.

5. Provision to specify Module Level Timeout for Failure Handling.

6. Provisions of additional support tools for smooth inter-module interfacing, such as:

a. Character encoding conversion tool, for converting data from UNICODE to WX and
vice versa.
b. Data format conversion tool, for those modules input/output may be in different data
format than that of the SSF. It provides tools to convert into, and convert from SSF to
that acceptable to the module (e.g., SSF to TNT, etc.),
c. I/O data validation tool, to verify whether the input/output data is in correct format or
not.

As of today, Dashboard has been implemented with all the features envisaged above, (except the
starred specification, i.e., the module level timeout). It is being used by all participating institutes
independently, to integrate and test their modules, subsystem as well as the specific Sampark
system(s) in which they are involved.

A module is expected to operate on the common in-memory data structure. In case a module
uses its own data structure, it has to first take the requisite information from common SSF
blackboard, build its own data structure, and after completing its task, it wishes back the new
information in the common SSF blackboard.

3.5.2 Dashboard: An Example

The power of the Dashboard Development Environment is illustrated through a set of screen
shots given in Figures 3.1, 3.2, and 3.3. All these figures show the various screen shots of Hindi to
Urdu Sampark system running under Dashboard. The vertical column in the middle provides the
names of all eleven modules (and the last is always system) comprising the pipe-line

24
architecture of the present system running under the Dashboard. The user of the Dashboard can
provide input to the system either through a file, or typing directly on the left pane. Further, he
can choose to run the complete system by clicking the system choice from the vertical column.
Alternatively, he can run the system step by step by clicking the modules, one by one, from top
to bottom.

Figure 3.1 shows the complete translation of Hindi text into Urdu as the user has chosen
system from the vertical column. The left pane shows the input text (composed of five
sentences) written in the source language Hindi, and the right pane shows the translated output
text, as the output of the Sampak system, written in the target language Urdu. The total system
execution time of 40.11 seconds is also shown on the right corner of the Tool bar.

Once the system is executed completely, the session can be saved for future analysis. In
case, the developer wants to analyze the intermediate output of any specific module, and for any
specific sentence, he can do so through Dashboard. He has to choose first the sentence of his
choice, by scrolling through the sentence numbers shown in the Tool bar adjacent to the
execution time, and then click the module of his choice from the vertical pipeline. In Figure 2, the
left pane shows the input (the first sentence in text form) to module morph (the first module of
the system pipe), and the right pane shows the output produced by the module morph in SSF.
The execution time consumed by the morph module, i.e., 0.77 seconds, is also shown above.

Figure 3.1: Dashboard showing Translation of Hindi text into Urdu

25
Figure 3.2: Dashboard showing Input and Output of Hindi Morph

Similarly, in Figure 3, the left pane shows the input (in SSF) to module postagger and the right
pane shows the output produced by it (in SSF). The postagger execution time of 1.06 seconds is
also shown. In this way, the development team can post analyze the complete paragraph
translation, by scrolling through each sentence level and each module level executions.

For each specific inappropriate output, the module level localization of fault can easily be
traced. Similarly, the module level time profile with sentence level granularity would easily guide
the development team to isolate those modules needing performance improvement.

Figure 3.3: Dashboard showing Input and Output of Hindi POS Tagger

26
3.5.3 Implementation of Dashboard

The first version of Dashboard Development Environment was released in the latter part of 2008,
and with continuous interactions with the participating institutions of the consortium, we have
released an enhanced version of Dashboard every six months. The latest version of Dashboard is
fully stable, and it satisfies the major needs of consortium as the main development environment
for productizing the proposed eighteen Sampark systems.

The back-end source code of the Dashboard is written in Perl, and the component
handling the visualization interface is written in Java. The total size of complete Dashboard
development Environment is 27.7 MB, out of which the program sizes of back-end component
and its visualization interface are of 4.3 MB and 23.4 MB respectively

3.6 Evaluation
The consolidation of the experience of all eleven institutions (and the SEG) on their usage of
Dashboard Development Environment to produce eighteen Sampark Systems through reusing/re-
engineering their existing laboratory modules was done through a set of questionnaire [**]
distributed to them. This survey was done only after almost all eighteen Sampark systems were
built by us and were going through continuous accuracy and performance improvement by
participating institutions. Four systems have been released on the web after their accuracy
crossed the threshold of the acceptable limit. The major experiences are:
1. As Engineering Tool A module written by an institution always runs satisfactorily in its own
laboratory environment. Invariably, any other institution finds it extremely difficult to integrate
the same module in their respective laboratory environment. This usually happens as context
dependent hard coded data (e.g., path names of files, etc.) creeps into the code without
researcher/developer being aware of it, making the module environment sensitive. Dashboard
helps developer to engineer their module to make it portable, as it does not permit any module
to run under it with environment sensitive hard-coded data.

2. As Testing and Integration Tool As the Dashboard was made available to each module
developer, each one of them maintained their own testing environment by locally integrating
their present version module with all other modules developed by other groups. In other words,
it provided a multiple replicated integrating and testing environment available with each of the
participating institutes. Whenever, a developer released next version of his module, all other
developers upgrade their testing environment and test their systems with new released version
of the module.

3. As Group Testing and Coordination Tool The multiple replicated integrating and testing
environments not only helped the group to test their module with real modules (and not with
stubs), it also helped them to give feedbacks to the owners of preceding module (i.e., the module
whose output they take as input) to improve the accuracy of the system. Further, the Dashboard
27
has become extremely helpful as a group coordination tool as well, because the members of the
group are geographically scattered.

4. As Profiling Tool - Sampark system needs continuous improvement on its performance as well.
The Dashboard provides time profile data of each module which helps to facilitate in improving
the design of various time intensive modules. In case, the response of the system falls short of
user expectations, this also gives guideline to enhance the hardware resources so that response
to the user is brought within acceptable limit.

3.7 Discussion and Future Work


The feedback given by consortium, and interest shown by the development teams of other NLP
applications, we have visualized a few essential enhancements to Dashboard to make it more
useful for (i) generic NLP application development, (ii) large applications running on multiple
machines, and (iii) system validation.
3.7.1 Dashboard and Generic NLP Applications
Though the requirement specifications of Dashboard were derived from the specificity of generic
NLP applications, the pressing needs of ILMT project constrained its specifications, making it
useful more for the development of Sampark systems than a generic NLP application
development environment. Presently, the Dashboard is tied with SSF, and its own way of writing
system specification file. To make it generic NLP application development environment, it should
have following features:
System specific in-memory data structure: it should provide system developers to define
system specific in-memory data structure in other formats including XML.
System specification: the system specifications file should be in XML where the modules
dependency is specified for the system being setup.

3.7.2 Applications running on Multiple Machines


Partitioning: Dashboard can be distributed into multiple partitions (each partition comprises of a
set of modules), and one Dashboard partition can be configured to run on a specific machine.
Subsystem specification: It should have provision to group the modules into subsystems.

3.7.3 System Validation


Validation Data Preparation: It will provide infrastructure to prepare and store validation data for
each module as well as the system being setup.
Module Validation: It will have provision to perform module level validation.
Subsystem and System Validation: it will have provision to perform validation at subsystem as
well as system level.
Regression Testing: It will provide mechanism to perform regression testing.

28
Chapter 4
Caching for MT Systems
In this Chapter, we present provision of cache mechanism by system integration and deployment
platform (Dashboard) to enhance the performance of computationally intensive NLP application.
We present introduction in section 4.1. Then we present problem domain exploration in Section
4.2. We explain middleware in the Section 4.3. Section 4.4 explains the caching mechanism in
details. Section 4.5 showed the experiments and results. We present the conclusions in section
4.6

4.1 Introduction
After the public release of the first set of four Indic machine translation systems at the website
http://sampark.org.in [7, 16] for interactive as well as batch usage, the following two types of
demands emerged: to improve the accuracy of the MT systems; and to improve their response
time and throughput. While the former task lies completely in the NLP domain, the latter task
presents a challenge to the application development teams as well as to the software engineering
group (SEG) associated with the project which has designed and built Dashboard, a testing,
integration and deployment environment [23] over which these MT systems are built and
deployed. Dashboard not only facilitates NLP developers in development, testing and integration
of the MT systems but also shields them from the complexity of the underlying deployment
platform on which these systems are supposed to run.

For a large and complex web application, the improvement of its response time as well as
its throughput depends upon, not only its intra-application architecture and programming
complexity but also equally, and some times more, on the complexity of the computational
environment on which it is deployed. In the present case, while the tasks to improve
performance related to application architecture and program complexity come under the
purview of the application designers and the developers, i.e., the NLP experts, whereas those
related to the deployment environment, viz., Dashboard, comes under the SEG.

4.2 Problem Domain Exploration


Dashboard was designed mainly as a development environment [9] based on blackboard
architecture [17], a very attractive option for NLP applications, which facilitates integration of a
set of heterogeneous modules (i.e., written in different programming languages) collaborating
among themselves through a common in-memory data structure, referred to as blackboard [20].
It provides additional tools for development and validation of a module, integrating and testing a

29
set of modules, and for building and testing the complete system. The implementation of
Dashboard is presented in [23]. While designing Dashboard, it was realized that an NLP
application such as an MT system, even after release, needs continuous accuracy and
performance improvements, and hence, mechanisms for integrating heterogeneous modules, for
module level flexibility, and also for time profiling of the modules were provided. Yet, it did not
focus attention to providing mechanisms which can be used of by developers to improve the
module performance.

Expecting increase in the use of the systems in the near future, experiments are being
done to estimate the performance of the MT systems, in terms of interactive response time and
throughput. It is found that the performance of the MT systems on the web deteriorates very
fast, as the load increases. Before coming to a decision on appropriate up-gradation of the
hardware platform for the expected load increase, the SEG considered the additional software
engineering mechanisms that can be made available to the application developers by enhancing
Dashboard so that performance of the MT systems can be improved. Provision of an optional
dynamic cache [24] was a possible choice.

Caching is mostly used to decrease data access time whether it is used in hardware,
operating system, middleware, or at application level. Usually for a web application, a cache is
most commonly used for improving the performance of data intensive information retrieval,
query and search applications such as web search engines, large business applications, etc.,
where the data volumes are not only very high but also widely distributed [24, 25]. It reduces the
input/output cost in locating a query item deep down in the storage hierarchy as well as the high
network communication cost. Again, caching helps in improvement of the performance of an
application if and only if there is a high query frequency [26], i.e., a small set of queries are raised
frequently, and distinct types of queries are less likely.

In contrast, an MT system is very compute-intensive application where various intra-


module data corpora are either of small size, or at most of moderate size. Further, the documents
submitted for translation would be composed of a set of distinct sentences, but in the same
(natural) language. While an MT system is neither data-intensive nor with high query frequency,
we still found that provision of a cache can improve its performance for the following three
distinguishing features:

(i) From empirical observations, it was found that in a given corpus of natural language sentences,
the frequency of any word is inversely proportional to its rank in the frequency file. It was first
proposed by GK Zipf [27] and is also known as Zipfs law. In other words, the most frequent word
would occur approximately twice as often as the next most frequent word, and three times to the
third frequent word, and so on.

(ii) Any document that is submitted for translation purpose does contain large number of content
words (i.e., nouns) which are mostly language independent. From another empirical observation,
it is found that there is a relationship between the size of a collection and the size of vocabulary
30
used M. If T is number of words in a text and M is the number of unique words used, then this
relationship is given by Heaps law [28]

i.e., M kTb

Here k and b are constants that depend upon the type of collection. For English text, it has been
found that

10 k 100
And
0.4 b 0.6

It means that with the increase in size of T there will be diminishing return in discovery of unique
vocabulary items M.

(iii) In a rule-based MT system, such as the Sampark system [5], most of its compute-intensive
modules operate on words, chunks (viz., a sequence of words) or sentences. Further, all these
modules are functional in nature. Just the modules that operate at the word level account for
more than almost 35% of the computation load of any sentence translation. This is approximately
same across all the Indic languages for which MT systems are being developed.

It needs to be noted that both Zipfs as well as Heaps Laws hold valid when the volume of
collection is high. By combining these three distinguishing characteristics, we may assert that in
any MT system, where a small set of words (whether they are content words, verbs, or
connectors, etc.) has a very high probability of repeated appearance in any document, a cache is
a very attractive option for performance improvement. A dynamically growing cache would make
the repeated running of compute-intensive, word level functional modules redundant after it has
executed once for the same word. Realizing these novel features of an MT system, we decided to
provide, as a part of Dashboard, a cache mechanism that module developers can opt for while
integrating their modules to build the MT system.

Provision of a cache is made optional for the Dashboard as management of the cache
infrastructure does carry some performance penalty. Each module developer has to choose
whether to use a cache or not, after weighing the performance improvement due to the sharp
decrease in the number of times the word level modules need to run against the performance
penalty associated with the increase in computation load for accessing the cache and for cache
infrastructure management.

After enhancing Dashboard to provide the optional dynamic cache facility to module
developers, we are now experimenting and recording the performance improvement in one of
the four deployed MT systems, viz., Punjabi Hindi MT system. This chapter reports these initial

31
findings, and we expect to complete the recording of the performance improvement due to the
provisioning of a cache for all the four deployed MT systems.

4.3 Dashboard as Middleware


Dashboard can be thought of a middleware as discussed below

4.3.1 Middleware
Middleware software [29, 30] is meant to shield application designers, and in some cases the
application support persons as well, from having to monitor and manage the complexity of the
underlying ever-evolving, large scale distributed computation platform. It helps in increase the
productivity of large scale application developers working in complex, heterogeneous and high-
performance computing environments.
Usually, middleware provides higher level language abstractions, libraries that hide
complexities of underlying platforms, and also services that are common to many applications.
Different middleware is developed to cater for different classes of applications.

Researchers [31] have broadly identified two classes of applications relying on distributed
systems: commercial network/enterprise services and computational science and business
services.

Typical commercial network/enterprise services applications are product browsing and


search, email management, web retailers and service providers, auction platforms such as eBay,
search engines, news and blogs, and others.

Commercial network/enterprise services applications are:

Interactive and on-line, with time constraint on response time and throughput
Homogeneous in terms of software architecture
Typically parts of web servers, servlets, and database management systems
Architecturally complex due to having to contend with a large community of users
requiring the service.

Computational science and business services applications are the increasingly more demanding
applications designed to allow the study and processing of large quantities of simulation or
sensor data in order to better predict scientific phenomena, economic variables, etc. Examples of
such applications are physical systems modeling (e.g., weather, chemistry, biological systems),
business modeling (e.g., demand forecast, logistics, data mining), problem solving environments,
distributed visualization toolkits, etc.
32
The characteristics of computational science and business services applications are:
Large data size that needs to be processed (of the order of terabytes or more)
Compute-intensive, algorithmically 23 order more complex than commercial applications
Community of users is not necessarily as large as for the other class above
The applications themselves require a substantial amount of tuning and configuration in
order to effectively obtain the desired performance; and generally run as batch jobs.

However, key features of middleware systems catering for the two application classes are
intrinsically different.

The middleware systems for the commercial network/enterprise services applications address
issues such as:
Replication (with different consistency models)
Fault tolerance, and fault recovery
Load balance and self-tuning
Caching

On the other hand, the middleware systems for computational science and business services
applications focus on methods for:
Work load partitioning
Parallel input/output
Work load scheduling and balancing
Caching

4.3.2 AI Applications and Middleware

AI applications, such as MT systems, speech recognition systems, game systems, etc., lie between
the two classes of applications discussed above. They are compute-intensive, have a large data
set to operate on, have an intrinsic need for interactiveness and need to be available on-line for
best usage.

Such applications are:


Heterogeneous in terms of software architecture
Interactive and on-line, requiring acceptable response time and throughput
33
Required to cater to a large number of on-line users
Compute-intensive
Need caching of results

When we contrast the characteristics of AI applications listed above the two classes of
applications discussed above, we come to conclusion that a middleware suitable for AI
applications should have the following features:

Fault tolerance, and fault recovery


Work load partitioning
Replication of compute-intensive modules
Caching

In NLP Applications, Dashboard is playing a dual role: that of a development environment for the
Sampark systems; and of middleware, shielding application developers from deployment
complexity.

4.3.3 Dashboard as Middleware

Dashboard already provides Resiliency and Robustness that caters for the requirements of fault
tolerance, and fault recovery.

Further, for running large applications on multiple machines, Dashboard is being


enhanced [23] so that the application can be distributed into multiple partitions (each partition
comprising of a set of modules), and a specific partition can be run on a specific machine. This
partitioning feature addresses the issues of work load partitioning.

In addition, in the next version of Dashboard replication of compute-intensive modules


will also be provided, at the time of configuring the system prior to deployment.

Presently, Dashboard has already been equipped for provision of optional dynamic
caching at the module level.

In brief, the next version of Dashboard would also be able to fulfill the role of middleware
for NLP applications.

34
4.4 Caching Mechanism
4.4.1. Cache and NLP Applications

A cache is the most commonly used mechanism for performance improvement where the locality
of reference is high [26]; that is, when there is high data locality, or high query frequency [32].
NLP applications operate on natural language text and documents that show the presence of high
frequency of occurrence of a small number of words. With the provision of a cache, the compute-
intensive modules would run less often, improving the performance of the system substantially.

Provision of a cache infrastructure, in any application environment, does contribute to


following performance penalties as well:
Accessing the cache adds to the computational load on the application for checking
whether the word exists in the cache or not;
The cache is built from scratch, in every run, and it grows dynamically, resulting in an
increase in the computational load for accessing the cache as its size increases; and
The cache cannot be allowed to grow indefinitely, which results in a need for cache
infrastructure management [33], adding to the computational load.

There are two types of caches, static and dynamic. In the case of a static cache, the application is
provided with a cache of a fixed size, filled with data items a priori. This is usually done where
enough study has been done to ascertain the stability, and size of the high frequency data items.
Where stability and locality of data cannot be ascertained in advance, it is appropriate to start a
cache from scratch and let it grow dynamically. As sufficient study has not yet been done on
documents in any of the Indic languages, it is advisable to provide the facility of dynamic cache.

Further, Zipfs Law and Heaps Law are both valid for high volumes of content. The
present size of documents submitted to Sampark systems are likely to be in range of 50 to 100
sentences, each sentence having ten to fifteen words on an average. It cannot be asserted that
the provision of a cache, considered together with its associated computation load, would always
give better performance. Hence, it is better to make the provision of the cache optional.

35
In Sampark system we have used cache mechanism below figures shown cache hit and cache
miss.

Figure 4.1 Cache Miss Flow Figure 4.2 Cache Hit Flow

4.4.2 Cache in Dashboard Tool

Dashboard provides mechanisms [23] to define a system specification file to configure a system
composed of a set of modules in a pipe-line, and also a module specification file to define the
runtime properties of the modules. The module specification file specifies the programming
language, input parameters, level of module operation, and many additional parameters for the
modules operation. We have now provided an additional parameter for the developers to opt
for a cache.

In case the module developer opts for a cache, a dynamic cache is created for that
module. The input parameters for the modules are converted to make a Composite Key, and this
Composite Key is used as a key in the Hash Map.

36
4.5 Experiments and Results
Presently, the four Sampark systems are running on a hardware platform comprising of: Intel
Core 2 Quad CPS @ 2.5 GHz with 2 MB L2 cache and 4 GB RAM.

The experiment has been done on Punjabi Hindi Machine Translation system. Though
there are four word level functional modules in the Sampark system, the experiment has been
done by opting for the cache in the lexical substitution and the word generator modules only. The
experiment used documents with 25, 50 or 75 sentences, having 360, 878 or 1258 words and a
unique vocabulary count of 193, 368 or 519, respectively.

As the lexical substitution module and the word generator module are modules way down
the translation pipe line, the number of tokens presented to them would be different, but may
show a similar proportion.

The experiments were done to measure the performance of the Sampark system with and
without enabling the cache. The results are presented in the below tables

Table 4.1 Lexical Transfer Module

No. of Sentences Total Unique Time with Time with % Gain


words words Cache OFF Cache ON
25 476 152 11.51 8.37 27.28
50 1230 283 25.43 18.96 25.44
75 1682 401 35.51 28.96 18.45

Table 4.2 Word Generator Module

No. of Sentences Total Unique Time with Time with % Gain


words words Cache OFF Cache ON
25 569 306 6.24 4.97 20.35
50 1475 693 16.16 11.85 26.67
75 2096 955 23.48 16.87 28.15

Table 4.3 Full Translation System

No. of Sentences Total Unique Time with Time with % Gain


words words Cache OFF Cache ON
25 360 193 108.69 97.34 10.44
50 878 368 234.39 198.24 15.42
75 1258 519 340.87 284.29 16.60

37
4.6 Conclusion
The preliminary results show a module level performance improvement of around 20% or more,
and around 10% to 15% performance improvement at the sentence level, using the cache in just
these two modules. It is expected that once we provide a cache in the other word level modules
in the pipe line, such as the morphological analyzer, we may expect an even larger improvement
in performance.

These three documents have been chosen randomly from our test suit of documents and
hence may not have captured the average characteristics of the Punjabi language as we find an
anomaly in the lexical substitution module, viz., the increase in size does not give a corresponding
growth of performance improvement. This may be due to the specificities of these three
documents.

The implementation of the cache facility in Dashboard allows the option of making the
cache persistent, i.e., the dynamic cache built by the previous runs can continue as a static cache
for the next run. Since a Sampark application is natural language specific, we strongly feel that
the provision of a persistent cache at the module level will increase the performance of the
Sampark systems.

Provision of a persistent, dynamic cache will also require an upper size limit on the cache.
In the Sampark systems, the static cache should comprise of only words/tokens related to the
natural language words, and not the content words (i.e., nouns) as there are no objective basis of
their re-appearance in different documents. Thus, persistent cache must save only the
words/tokens related to the natural language.

38
Chapter 5
Enhancing Throughput of MT System using
MapReduce
In this Chapter we describe in detail how we can improve Throughput a computationally intensive
application like MT system by using MapReduce framework. We present the introduction in
Section 5.1. We then present the MapReduce framework strength and limitation in the Section
5.2. Section 5.3 characterize MT system and describe the approach to run MT system under
Hadoop. Section 5.4 describes the experiments and results. In Section 5.5 describe the discussion
and future work.

5.1 Introduction
Parallel processing of applications can be broadly classified as compute-intensive or data-
intensive type. Computing applications that devote most of their execution time to the
processing requirements of their problem are classified as compute intensive applications. These
applications typically have small input data to be processed, unlike data-intensive applications
where the input data that needs to be processed is generally terabytes/petabytes in size.
Machine Translation (MT) is one such compute-intensive application where input data is small
but the MT application is highly compute intensive. Sampark Indian language to Indian language
machine translation system is one such application.
It was found that the performance of Sampark MT system on the present platform
(CentOS Linux running on 4 Core CPU of 2.50 GHz with 4GB RAM) deteriorates very fast, as the
workload increases. Therefore, in the second phase of the Indian Language to Indian Language
Machine Translation (ILMT) project, one of the goals set to the Software Engineering Group (SEG)
associated with the project [4], is to improve the response time and the throughput of the
systems by porting it to appropriate deployment environment that can be incrementally scaled
up as and when the average volume of workload increases. Our initial approach is to consider the
task on enhancing the throughput of Sampark as the primary task, relegating the task of
improving response time at abeyance.
In recent past, MapReduce [35] has emerged as the most attractive programming model
for compute-light but large scale data intensive applications, for scalable performance on
machine cluster of shared memory systems, such as multi-core chips and symmetric multi-
processor systems. MapReduce is mostly suited for functional applications, and its two functions:
Map and Reduce are inspired from LISP, the functional programming language [36]. The Map
function helps the job to be partitioned in smaller tasks by splitting the data input file, and each
task is distributed to individual available nodes for parallel execution. The Reduce function
receives results of all partitioned tasks, and collates them to give the final result. In this model,

39
programmers without any experience of parallel and distributed systems can utilize the resources
of large distributed system, as the run time system takes care of partitioning of job into multiple
tasks, scheduling of all tasks comprising the job, the inter-task communication, and also recovery
from failure of computing resources. With wide acceptance of MapReduce paradigm, various
types of applications are being written in this model to utilize available resources of large
distributed system resulting in enhanced performance and better throughput [37]. As Hadoop
adds considerable operational overhead; in the range of 20-30 seconds and sometimes even
more, to the application system running under it, the real benefit for parallel and distributed
computing offered by the framework can be visible only for larger size applications.
We felt the applicability of MapReduce framework for our Rule Based/Transfer Based MT
system for following four reasons:
(i) Any document file that is required to be translated, i.e., the data input file to MT system,
can always be abstracted as a List of sentences which can be further split into a sequence
of Lists of words
(ii) The transfer based machine translation system like Sampark is a functional application,
and is List Homomorphism [38], and hence, can be easily parallelized and executed on
large cluster of machines [39]
(iii) The incremental scaling up of computing resources on-demand is MapReduce frameworks
basic inherent criterion
(iv) Availability of Hadoop (The Apache Software Foundation), the open source
implementation of MapReduce which can be modified to circumvent the difficulties posed
by contradictory features of our application.

We felt that to run MT system on MapReduce framework, its list homomorphism characteristics
would be utilized to run its multiple instances, in parallel on multiple nodes, enhancing
throughput. Furthermore, as MT system is a complex and compute-intensive application, the
perception of the enhanced throughput would be felt squarely.

5.2 MapReduce Framework: Strength and Limitation


5.2.1 MapReduce Framework

MapReduce has been designed for data-intensive but compute-light applications to run on
hardware platform composed of cluster of nodes. In Hadoop cluster data gets distributed on all
nodes of cluster while it is being loaded. The Hadoop Distributed File System (HDFS) splits data
into chunks, and each chunk is loaded on different nodes of the cluster, well before the
application gets initiated. The Hadoop MapReduce Programming framework is implemented as
clientsever architecture having a single master client, called jobtracker, and many server slaves,
called tasktracker, running one per node in the cluster. The jobtracker is the point of interaction
with all users, to receive users map/reduce tasks. The jobtracker puts all submitted tasks in the
40
pending job queue, and schedules each map task to different tasktrackers running on, but only on
those nodes where the applications data chunk has been preloaded. On completion of map tasks
by different tasktrackers the set of intermediate outputs are produced. The jobtracker then
schedule the reduce task on those free tasktracker nodes that have easy access to the
intermediate task outputs. The reduce task combines the intermediate outputs received to
produce the final output. In MapReduce framework, as the job gets parallelized across various
available tasktrackers in the cluster the completion time of any job gets tremendously reduced,
enhancing throughput of the system.
In this framework, the map and reduce tasks of users are submitted by a client to the
jobtracker. The jobtracker transfer the map/reduce tasks to all those tasktracker nodes where
various instance of the map/reduce tasks have been scheduled to run. It is presumed that
map/reduce tasks are compute-light with smaller code footprint, and they do not strain the inter-
node communication network within the cluster as they are being transferred from the
jobtracker to various tasktrackers.

5.2.2 Adaptations of MapReduce Framework

The MapReduce framework that was originally proposed by Google is being utilized by it to
process more than 10 petabytes of data per day [35]. After release of Hadoop implementation of
MapReduce more than hundred organizations, including large companies and academia are using
it for various types of applications. This has also resulted intense research and development
activities in various directions [37]. Some researchers have developed distinct MapReduce
algorithms for processing of different types of massive data [40, 41], some have simulated well
known parallel processing algorithm in MapReduce framework [42], while some others are
involved it developing schemes for implementing MapReduce framework in distinct types of
physical platforms [43, 44, 45], and optimize scheduling problem in its context [46].

5.2.3 Statistical MT and MapReduce framework

The output quality of Statistical Machine Translation (SMT) Systems increases with amount of
training data [47, 48]. To get good quality of output, usual SMT systems train their engines on 5
million to 10 million sentences pairs, and to train engine on such massive volume of data, even on
good processing platforms, takes couple of days to even a week. And hence, many efforts are
being pursued to use MapReduce framework to execute such training module over large corpora
on a large distributed systems, bringing down the training time to a couple of hours [49, 50].
Open source toolkits capable of training phrase based MT models on Hadoop cluster [51] and
grammar based statistical MT on Hadoop cluster [52] have been reported.

5.2.4 Limitations of MapReduce Framework

In present implementation of MapReduce in Hadoop, compute task gets transmitted across the
nodes of the cluster, and for those applications that have a large code footprint transferring it
41
across the nodes of the cluster would be totally antithetical to the basic goal of throughput
enhancement. And hence, if we run a compute intensive application having large size of code
under Hadoop the time spent in transferring compute task across the nodes would completely
drain the expected throughput enhancement due to parallel processing of its multiple instances.
And this is the main limitation of Hadoop, making it completely unsuitable for compute-heavy
jobs. To utilize the benefit of parallelism provided by Hadoop, computationally heavy jobs require
to evolve a distinct approach to overcome this difficulty without paying any time penalty.

5.3 Transfer Based MT and MapReduce Framework


Distinguishing Features of Transfer Based Machine Translation System

5.3.1 Finer Granularity of Input Data

As discussed earlier, a text document file that is submitted for translation, i.e., the data input file
to MT system, can always be abstracted as a List which can be further split into a sequence of
sub-Lists. While a document may contain large number of sentences the size of a sub-list may go
down to lowest granularity of a single text sentence. Hence, the application has wider option to
right-size the granularity of a sub-list (considering the time overhead of parallelization in the
deployment platform) to get the best throughput.

5.3.2 MT as a List Homomorphism and Compute-intensive

As the transfer based machine translation system is a List Homomorphism Functional Application
it is best candidate for MapReduce framework as it can be easily parallelized [39, 52], and
executed on a cluster of large number of physical machines.

As discussed earlier, a machine translation system is a complex, compute-intensive application


with large code size. And these characteristics of MT system, in MapReduce framework, create
large communication load draining completely the throughput advantages achieved by parallel
processing.

5.3.3 MT as a Dedicated Web Application

Any machine translation application is generic in nature, and hence, it is usually offered to users
as a dedicated web application. This feature of application can be utilized to avoid the
communication overload through some innovative engineering approach.

5.3.4 Our Approach to run MT under Hadoop

Compute-intensive Web Application under Hadoop To circumvent this problem for MT System,
we have taken following three steps:

42
(i) We have developed a program, called MT Invoker, which calls the MT system, and this
program has been defined as the map task,
(ii) The reduce task is a program, called ConCat, that simply concatenates the intermediate
outputs produced by the map tasks, and
(iii) We preload MT System on all tasktracker nodes, presuming it to be dedicated web
application scenario.

Figure 5.1 MapReduce Approach for MT System

In this setup, it is the MT Invoker and ConCat that are transferred by the jobtracker to other
tasktrackers running on various nodes of the cluster. Both these tasks being compute-light suit
appropriately with the present implementation of Hadoop. When an MT Invoker starts running
on any of the tasktracker node, it in turn calls the MT system which is pre-loaded at the node at
the cluster setup time. In this way, multiple MT systems run in parallel on multiple tasktracker
nodes of the cluster, translating on distinct Input-split (viz., sub-list of the input document). When
the translation outputs of all Input-splits are available, the jobtracker runs the ConCat on a
tasktracker node to concatenate all these sub-lists to generate the final translated output
document.
In this way, we have deluded Hadoop to run a compute-intensive job like MT system as a
map task for a dedicated web application scenario.

43
5.4 Experiments and Results
5.4.1 Experimental Setup

The experiment has been done on Hindi to Punjabi Machine Translation system to measure the
throughput of the system for four different data sets, viz., Documents comprising 100 sentences,
200 sentences, 500 sentences, and 1000 sentences. All sentences are of similar sizes. The set of
experiments was carried out on the computing platforms, viz., Hadoop on Physical Machine (PM)
Cluster, having at most 10 available slave machines for parallel task allocation.

The configuration of the PM Cluster is: 4 * 4 Core Processor of 2.50GHz, RAM: 4GB, HDD: 320GB.

First, Centos operating system, viz., Centos-5.4 (64-bit), is loaded on all processors of the
cluster. On this PM Cluster, Hadoop Version 0.20.2 has been installed giving 12 Slave Machines of
2.50GHz, 4GB RAM. This includes setting up jobtracker on the designated master node, and
tasktracker on each of the remaining server nodes of the PM cluster. Then by running rsync unix
command MT System is loaded in all the server nodes.

5.4.2 Results
On each platform, for each data set, say for the document of 100 sentences, we measured the
throughput by executing it at four levels of parallelism, i.e., job composed of 1, 2, 5, and 10
parallel tasks (achieved by splitting the data-input into 1, 2, 5, and 10 input-splits respectively).
For example, when am MT job containing an input document of 100 sentences is split into 5
parallel tasks; each task would have 20 sentences to be translated. In other words, the same job
is split into multiple tasks by splitting the input executing on different computing resources, in
parallel, giving better throughput. So, four experiments were done per data set. For any of the
experiments, we have generated at most 10 parallel tasks as the platform can offer at most 10
slave computing nodes to execute tasks in parallel. So, no task needs to wait, and all are executed
in parallel concurrently, giving the best throughput.

Table 5.1: Throughput data set and result of stand-alone PC

Sentences 100 200 500 1000


Total Time (seconds) 155 253 602 1295

In total, 4 experiments were done for each of the 4 data sets, running on physical
platforms, giving total 16 (4*4) recordings. These measurements are tabulated in four through-
tasks tables, one table for each data set, and are given in tables 5.2 through 5.5. Each of the table
shows the throughput value of the data set, with increasing parallelism, for physical
environments. For comparison, we have given the throughput of the data set running on the
current stand-alone computing system in table 5.1.
44
Table 5.2: Running time for Job (input data set of 100 sentences)

No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 100 160
2 50 91
4 25 66
5 20 59
10 10 34

Table 5.3: Running time for Job (input data set of 200 sentences)

No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 200 274
2 100 131
4 50 80
5 40 72
10 20 54

Table 5.4: Running time for Job (input data set of 500 sentences)

No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 500 336
2 250 251
4 125 138
5 100 134
10 50 87

Table 5.5: Running time for Job (input data set of 1000 sentences)

No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 1000 968
2 500 334
4 250 230
5 200 226
10 100 184

45
Graphs for the four throughput-tasks are shown in figure-5.2, one curve for each data set.

Figure 5.2: Total time vs. the number of task

The calculated value of associated overhead for parallel distribution of tasks across available
computing resources is 20-30 seconds for physical machine cluster.

5.5 Conclusions and Future Work


5.5.1 Conclusions
Looking at the four tables and a graph, we come to the following conclusions:

Throughput of MT application increases linearly with increase in amount of parallelism.

The best throughput that can be extrapolated from all the curves of a graph indicates existence of
a universal throughput minimum value, for each specific physical environment.

The throughput minima of each physical environment are slightly higher than the amount of
associated overhead for parallel distribution of tasks across various nodes, and hence, there is no
extra gain in throughput by increasing parallelism beyond a point.

By providing sufficient amount of parallelism, the throughput of a document of any size can be
brought closer to the universal throughput minimum which is less than a minute!

46
5.5.2 Future Work
1. Additional Experiments with Symmetrical Computing Resources in Cloud Environment
We would like to repeat the same experiments on cluster of virtual machines (with symmetrical
computing platform, and with the same version of Hadoop) provided by public cloud service
provider. As cloud infrastructure can provide computing resources on demand, the completion
time of a document of any size can be reduced to time minima, constrained by the overheads of
MapReduce and Cloud infrastructure. These additional experiments would give Universal
Throughput Minima with better precision.

2. Wider Applicability of Our Approach


Apart from rule-based/transfer based machine translation systems, there are many NLP
applications which are list homomorphism functional applications (such as text to speech,
automatic speech recognition, transliteration systems, etc.), are generic in nature, and can be
provided as dedicated web applications. All such applications, following our engineering
approach, can be run under MapReduce framework to enhance their throughput.

3. Need for Special Version of Hadoop for Compute-intensive NLP Applications


Due to wide applicability of this approach to enhance throughput of such compute-intensive NLP
applications, we feel there is a need to adopt Hadoop for such type of applications, and develop a
special version that can be used by others with ease.

47
Chapter 6
MT on Cloud
In this Chapter we describe in detail how we deploy an MT system on cloud and make it as a
virtual appliance. We present the introduction in Section 6.1. We then present the background of
MT system and why we need the virtual appliance in the Section 6.2. Section 6.3 describes the
related work of the virtual appliance. Section 6.4 describes how we deploy MT system in cloud
and make it virtual appliance for ease of deployment. In section 6.5 describe the approach how
we translate with the help of MT as virtual appliance. In Section 6.6 describe the conclusion and
future work.

6.1 Introduction
Machine Translations (MT) Systems in general are composed of large number of modules, that
are heterogeneous in nature, and these heterogeneous modules in turn depend upon complex
set of environmental dependencies to perform a given task. To resolve such complex
dependencies at the time of deployment is a hard, technical intensive, time consuming task,
additionally it is undesirable too.

Software deployment is defined as the process between the acquisition, and execution of
the software. This process is performed as the post-development activity that takes care of the
user-centric customization and configuration of the software as well. At times this process can be
quite complex and may need the involvement and expertise of the developers and the system
administrator quite extensively. Apart from deployment complexity, the deployment task may be
time consuming (of the order of hours). It is found [54, 55] that in general, 19% of total cost of
operation (TCO) of a software system goes in deployment cost. As an MT system is far more
complex and technical intensive it is fair to expect that its TCO must be far higher.

Unlike generic applications, an NLP application like MT system goes through frequent and
regular updates, mainly to improve its accuracy and performance, and also to increase the
coverage of its domain. Every new release of system requires fresh deployment of the new
version from scratch, aggravating the technical administration distress, and in turn, inflating the
total cost of operation far higher.

Furthermore, response of an MT system becomes exponentially slow with growing load.


Scaling up of computation resources with growing load, mainly to provide users with better
response time requires additional financial commitment from the service provider which cannot
48
be done on-the-fly. Cloud infrastructure offered by third party, where computation resources can
be scaled-on-the-fly, seems to be the most appropriate platform for offering service of such types
of applications. Deploying an application on cloud infrastructure, in compared to a stand alone
system, is far more complex, technical intensive, and time consuming. And hence, the total cost
of deployment for such applications becomes even more significant when it is to be deployed on
the cloud.

For a complex and technical intensive application like an MT system which is to be


distributed to lay users, and which may have frequent and regular version releases, the need to
minimize the deployment time and to diminish its technical complexity become imperative.

To satisfy above needs this chapter proposes that complex NLP applications like MT
system should be packaged and released as virtual appliance [54, 70]. An application as a virtual
appliance can just be taken out of the box by a lay user to deploy easily on his machine, and with
very little setup time the application system becomes ready to use from that instant onwards.
Though packaging an application as a virtual appliance do take time and is also technically
intricate, but once it is built, its deployment can be done even by a lay user, and takes very little
time as the applications complexity and its technical intricacy have been made transparent in its
virtual appliance incarnation. A virtual appliance packaged either for stand-alone system or for
cloud is expected to give proportional deployment advantages.

This chapter reports the experimental time measurement results for software deployment
of MT virtual appliance in relation to MT application. The results are reported for standalone
system as well as for cloud platform.

6.2 Background : Sampark MT System


Sampark MT [16] is a machine translation system for translation of written text from one Indian
language to another Indian language (developed for 9 bidirectional language pairs). Each MT
system is typically composed of 15-20 modules depending on the specificity of the language pair.
These modules in turn depend upon complex set of environmental dependencies. These
environmental dependencies could be due to the large number of language libraries used by the
modules at runtime, and operating system (OS) and associated other third party tools. All these
dependencies are linked at runtime. MT systems are comparatively more complex due to their
inherent heterogeneous nature. This heterogeneity [56, 4, 23] is due to variations (in
computation model, programming language, module interfaces, etc.) of the various MT modules.
For the deployment of such a complex MT system, apart from resolving the dependencies,
each system needs to be configured for specific language-pair and for a designated translation

49
domain. A typical deployment process of Sampark MT system (composed of building, configuring
and verification tasks) takes some where between two to three hours time, depending on the
language pair. Many a times, on a release of a new MT version, this process can be quite complex
and may need the involvement of the developer/computational linguists to resolve its
deployment issues. It usually requires the expertise of the MT application as well as the
underlying systems used by the MT application.

Sampark system has been built by a consortium [4] of eleven research institutions. Each
group is continuously working on its module to improve the accuracy and performance of the
system. So the updates to the MT systems are very frequent. To handle these updates in a large
number of deployed systems, which is a routine task, is highly technical intensive, and is usually
very time consuming.

Sampark MT systems are also deployed on the web, and people are regularly using it. For
the last two years consortium members have been improving the accuracy & performance of the
system regularly. It was found that as the web traffic increased the performance of the system
(response time) for each translation deteriorated sharply. Various options available for improving
the response time were experimented, viz.

Refactoring the modules to keep the file I/O minimum in production system [4]

Incorporation of cache in compute intensive modules [57]

Distributing the translation tasks among the cluster of machines running MT systems;
work load partitioning using Hadoop framework as middleware [58]

With the above experiments, it was possible to considerably improve the response time of the
Sampark MT system. With provisioning of additional hardware resources, it was possible to keep
the response time within optimum limits as the load increased. The Sampark MT system was able
to scale with provisioning of additional resources [58].

To translate a book of seventy pages, the Sampark MT system on a stand alone system
took 71 minutes and 25 seconds, while on Eucalyptus Cloud environment with twelve (12) virtual
machines (VMs) it took only 5 minutes and 27 seconds.

The objective of our initial experiment of deploying Sampark MT system, on cloud


environment with provisioning of larger computation resources, was to verify its scalability and
see improvement in the response time. But for cloud deployment, optimum resource utilization is
possible only when an application is able to scale-up and scale-down rapidly, i.e., to scale-up or
scale-down in real time.

50
In our case, the MT system was unable to scale-up (or scale-down) rapidly as it was taking
very long time to deploy new instances of MT on additional virtual machines acquired
dynamically on the cloud. Once the additional resources were provisioned, the system remained
underutilized if the load decreased rapidly. While the MT system in principal was scalable, it
remained either in the state of over-provisioned (i.e., resources remained underutilized) or under-
provisioned (i.e., starved for resources). The large deployment time of the MT system was the
main reason for its inability to rapid scaling up/down, making it completely unsuitable for cloud
deployment.

In the early stage of improving the deployment process, we built a deployment script that
would setup and configure the MT system with little intervention from the user. This script
improved the speed of deployment, it came down from hours to minutes (some where between
20 to 30 minutes depending upon the expertise of the user), but it still required the user to
provide complex details of the environment like, OS details, path of various language libraries,
and other third party tools required to run the various modules of the MT system. The complexity
of the deployment process remained unsolved. Moreover, in spite of the deployment script,
deployment process largely remained manual.

In this Chapter mainly discusses how to build MT as a virtual appliance that can be
deployed on stand alone VM. The same MT virtual appliance can be deployed on cloud, and can
be scaled-up or scaled-down in-time.

6.3 Related Work


A virtual appliance [54, ,59, 60, 61, 62, 63, 70] is a full application stack containing the Just
enough Operating System (JeOS), the application software, their required dependencies, and the
configuration and data files required to run the system. Everything is pre-integrated, pre-
installed, and pre-configured to run with minimum or no manual intervention [56]. Virtual
appliances come in the form of a data file that can be easily deployed in the cloud. Virtualization
increases the mobility of a software system. Virtualization can greatly simplify server provisioning
because each VM is contained in a single file. As a result, virtual machines can be easily cloned
(copied to create additional images) and deployed using virtualization. In 2004, Dell Engineers
found that virtualization can reduce deployment time considerably [64].

In the past many tools were designed to ease systems/service deployment. Dearle [65]
studied six cases of software deployment technologies and gave some of the future directions
about the impact of virtualizations. The benefits of virtualization and virtual machines are
discussed in [60, 61, 62]. Sapuntzakis et al. [66] proposed collective, a compute utility using
virtual appliances to manage systems. In the past software vendors have assembled the software
applications, operating systems, and the middleware (if any) as virtual appliance (or as software
appliance) and have distributed as ready-to-run application stack [67] that boots in a setup
51
wizard. Most of the time virtual appliances have been built to ease the distribution of software.

In the past people have used virtual appliance to improving software manageability and
automate provisioning [68]. Deploying applications in the cloud allows scaling on demand, and
provides benefits of elasticity and transference of risks, especially the risks of overprovisioning
and underprovisioning [69]. The key benefits of enabling the application for cloud deployment as
virtual appliance is to add or remove computational resources with fine granularity and with a
lead time of minutes rather than hours.

In contrast we have experimented to build MT system as virtual appliance to facilitate


deployment on virtual machines.

6.4 Deploying MT system on the cloud


Virtual Appliance can run on a standalone VM or VM available in cloud. To build Sampark MT into
a virtual appliance and deploy it into a cloud we used various tools:
- CentOS 5.3, a variant of Linux, as the guest operating system,

- Xen [71] for virtualization of the hardware,

- CentOS 5.7, a variant of Linux, as the Host Operating System for virtualization,

- Hadoop [73] Version 0.20.2 as the middleware for work load partitioning, and

- Eucalyptus [72] Version 2.0.3 for setting up the cloud infrastructure.

The engineering steps taken to setup of eucalyptus on a cluster of physical machine and Hadoop
on a cluster of VM is given in detail in Appendix-B and Appendix-C respectively.
Various terms used in building and deploying the virtual appliance in the Eucalyptus cloud are
explain below
Eucalyptus Cloud:
image: An image is a snapshot of a system's root file system and it provides the basis for
instances
baseline-image: baseline-image usually includes the optimum operating system, any required
service packs, a standard set of application (that is to be virtualized), other underlying tools
required by the application, and the necessary patches if any, or loosely speaking snapshot of the
root file system with the running application
euca2ools: Image management commands for Eucalyptus Cloud
euca-bundle-vol: Bundle the local file system of a running instance along with kernel and ramdisk

52
euca-upload-bundle: Upload the bundled image to the cloud
euca-register: Register an image for use with the cloud
euca-describe-instances: To check the status of the image instance in the cloud
euca-run-instance: To start a new instance of the image in the cloud
euca-terminate-instance: To terminate an instance of image from the cloud

Hadoop: MapReduce Terms


NameNode: Hadoop master node, the machine where the computation task is actually
submitted, NameNode partitions a work load among the available DataNodes, and schedules the
computation task to each of these data nodes
DataNode: Hadoop worker nodes, on which the tasks are actually computed
hadoop: Hadoop command interpreter that actually executes all the commands for the Hadoop
cluster
In our experiment, we had
- Each PC - a Quad-core 2.5 GHz Intel processor, 4 GB RAM, and 2MB of L2 cache,
- Each Virtual Machine in the cloud had: 2 CPU with 1GB RAM each,
- In the Eucalyptus Cloud a total of 12 Xen Virtual Machines.
The Eucalyptus cloud is setup using Xen as virtualization platform. An instance of the default
image of CentOS 5.3 is instantiated on the available VM. It takes approximately 60 seconds to
boot the virtual image whose image size is 1GB on our hardware as mentioned above.

Once the virtual image is running, Sampark MT system with all its required dependencies,
and the middleware Hadoop, is installed in the system. Then current running image is rebased.
This current image is bundled along with kernel and ramdisk by the command euca-bundle-vol.
This bundled image is uploaded at the base machine by command euca-upload-bundle, and lastly
image is registered by command euca-register to make sure it is available for launch. Now the
rebased image is ready for deployment as virtual appliance on the Xen VM.

Deployment of Sampark MT virtual appliance on a virtual machine takes approximately


150 seconds (on our hardware), which is well within the desirable limit. Image size of this
Sampark MT virtual appliance is 4GB.

53
6.5 Our Approach to run MT under Cloud
To rum Sampark MT system in Eucalyptus cloud we launch the baseline image of Sampark MT
system of the respective language pair to conduct the experiment. Once all the available
instances are ready for use then we assume one to be master node and rest as slaves and change
the configuration accordingly to run Sampark MT system.

We have taken the following three steps:


(iv) We have developed a program, called paragraph extractor, which extracts the paragraphs
from .doc and .txt file of format. We copy all the extracted files into Hadoop HDFS
manually
(v) we have developed a program, called MT Invoker, which in turn would call the MT system,
and this program has been defined as the map task
(vi) The reduce task is a program, called ConCat, that simply concatenates the intermediate
outputs produced by the map tasks

Figure 6.1 MapReduce Approach for MT System

In this setup, it is the MT Invoker and ConCat that are transferred by the jobtracker to other
tasktrackers running on various nodes of the cluster. Both these tasks being compute-light suit
appropriately with the present implementation of Hadoop. When an MT Invoker starts running
on any of the tasktracker node, it in turn calls the MT system. In this way, multiple MT systems
run in parallel on multiple tasktracker nodes of the cluster, translating on distinct Input-split (viz.,
sub-list of the input document). When the translation outputs of all Input-splits are available, the
jobtracker runs the ConCat on a tasktracker node to concatenate all these sub-lists to generate
the final translated output of the input document.

54
6.6 Experiments and Results
The experiment has been done on Hindi to Punjabi machine translation system to measure the
throughput of the system for Nirmala book written by Prem Chand comprising 1359 paragraphs,
5,049 sentences and 74,951 words. Average words per sentence are 14. Number of paragraphs,
sentences and words are approximate because we use a program to extract the paragraphs. We
use tokenizer to split paragraphs into sentences and wc command for counting words.

6.6.1 Results

We measured the throughput by executing the whole book by using 2, 4, 8 and 12 virtual
machines with dual core CPU and 1 GB RAM. Job (Book) composed of 1359 paragraphs which are
parallel tasks (achieved by splitting the whole book by paragraph extractor program). In other
words, the same job is split into multiple tasks for executing on different computing resources, in
parallel, giving better throughput. In case of 2 virtual machines 4 tasks (paragraphs) are running
parallel and others are in queue and so on. In case of 12 VM 24 tasks (paragraphs) are in parallel
execution rest are in queue.

Table 6.1: Throughput results on stand-alone PC

Book Nirmal by Premchand


Total Time (seconds) 198 minutes + 13 Seconds

In our experiment the measurements are tabulated in the table 6.2. We have given the
throughput of the data set running on the current stand-alone computing system in Table 6.1.

Table 6.2: Throughput results on the Cloud with increasing no. of VM

No. of Virtual Machines Total Time


2 112 m + 19 sec
4 54 m + 6 sec
8 26 m + 26 sec
12 16 m + 10 sec

Graph for the throughput-tasks is shown in figure-6.2

55
Figure 6.2: Total time vs. the number of virtual machine

The calculated value of associated overhead for parallel distribution of tasks across available
computing resources is 20-30 seconds for virtual machine cluster.

6.7 Conclusion and Future Work

6.7.1 Conclusion

In this Chapter we have experimented to show that a complex MT system can be built into a
virtual appliance that can be deployed in the cloud.

6.7.2 Future Work

1. Wider Applicability of this Approach


Apart from rule-based/transfer based machine translation systems, there are many NLP
applications which are list homomorphism functional applications (such as text to speech,
automatic speech recognition, transliteration systems, etc.), are generic in nature, and can be
provided as virtual appliance.

2. Virtual Appliance with Repositories


We also intend to build this virtual appliance so that it is available as repositories and from there
it can be deployed in the cloud that would enable MT system to handle the frequent updates.

3. Provisioning Policy for Auto scaling


Deploying application in cloud allows scaling on demand and provides benefits of elasticity. We
will implement auto scaling to build a script to initiate/shutdown additional instances of VM, if
required.

56
Chapter 7
Conclusion and Future Work
In this chapter, we present the overall conclusions of this thesis. We summarize the conclusions
of this thesis in Section-7.1. A few possible future directions of our work are presented in Section-
7.2.

7.1 Conclusions
A new methodology called Symbiotic Software Engineering Approach applied on previously
developed natural language modules in laboratory environment to produce field deployable and
maintainable MT systems (and modules). As a result, successfully engineered and deployed
Sampark MT systems at http://sampark.org.in.

We have also developed software integration and deployment platform for NLP applications,
called Dashboard, which has been enhanced to provide to functional module developers
mechanisms to avail and to configure, if they so desire, the dynamic cache facility offered by the
system integration platform to improve the performance of the modules. By using cache
mechanism facility in some of the modules we achieve 15-20% improvement in speed.

We use Hadoop MapReduce framework to enhance throughput of MT system without tinkering


with the modules. To enhance the throughput we developed an approach to make MapReduce
framework applicable to computationally intensive application like MT system.

Finally we have built MT system as virtual appliance for deployment on the cloud.

7.2 Future Research Direction


We will try the symbiotic software engineering approach with other application of Knowledge
intensive system.

In future some experiments can be done on phrases similar to Heaps law for phrases to avail
caching mechanism facility offered by the system integration platform to improve the
performance of the modules. Also in distributed platform distributed cache can be tried.

Wider applicability of our engineering approach to enhance throughput of computationally


intensive application and deployment on cloud can also be try with some other application.

57
Appendix-A

Software Engineering Guidelines

1. Coding Guidelines
2. Inline Comments
3. Code Review Checklist
4. Module Packaging Guidelines
5. System Integration Guidelines
6. System Deployment Directory Structure
7. Software Requirement Specification Guidelines
8. References

58
1. Coding Guidelines

1.1 Naming Conventions for Variables & Functions

1. In programs, the names of global variables and functions themselves serve as comments.
So as far as possible intuitive name should be chosen for variables as well as functions.
Such names convey useful information about the variables & functions. It drastically
improves the readability of the code.
2. Function names should begin with verbs, e.g., find_index(), build_tree(),
calculate_emit_prob(). Similarly information entities should be chosen as noun words, e.g.,
letter_tree
3. Underscores should be used to separate words in variable & function names e.g.,
input_file(), train_lambda().
4. Use enums when you want to define names with constant integer values.
5. Avoid using magic numbers for array size definitions, rather use hash (#) defined macros in
CAPITAL letters to define Array sizes.
6. Local variable names can be shorter, because they are used only within a local context,
inline comments explain their purpose.

1.2 Command Line Arguments

1. All programs should provide Unix like Filter3 interface.


2. All programs should provide -h or --help command line options, so that one can refer to the
usage of the program.
3. Command line arguments parsing should be done using getopt. It will do away with the
limitation of the sequence of arguments on the command line.

3
As is generally the case with command line (i.e., all-text mode) programs in Unix-like operating systems, filters read
data from standard input and write to standard output. Standard input is the source of data for a program, and by
default it is text typed in at the keyboard. However, it can be redirected to come from a file or from the output of
another program. Standard output is the destination of output from a program, and by default it is the display
screen. This means that if the output of a command is not redirected to a file or another device (such as a printer) or
piped to another filter for further processing, it will be sent to the monitor where it will be displayed.

59
4. Command line arguments parsing should be done at the beginning of the program. It should
not be scattered all around the program code.
5. There must be a provision for giving input arguments by specifying command line options.
6. There must be a provision to redirect input from standard input.
7. There must be a provision to redirect output to standard output.

1.3 Application Error Logs & Debug Logs

1. Application Error Logs and Debug Logs should be generated (if possible) using macros in C
programs, or log4perl like mechanism in Perl programs. Log4java can be used for Java
programs.
2. It should be implemented so that Logging is configurable at runtime, with the help of
command line options.
3. Applications should provide logging messages of multiple levels, like info, warn, fatal.

60
2. Inline Comments

2.1 Every Program should start with a brief comment explaining what it is for. This comment
should be at the top of the program file. Additionally it should mention the name of the
original and subsequent authors of the program. For example,

/* wc prints the number of bytes, words, and lines in the files


Copyright (C) 85, 91, 1995-2002 Free Software Foundation, Inc.

/* Written by Paul Rubin, phr@ocf.berkeley.edu


and David MacKenzie, djm@gnu.ai.mit.edu. */

2.2 Put a comment on each function, saying what the function does, and what sorts of
arguments it gets. What are the possible values of arguments and for what they are used.
For example,

/* concat returns a newly-allocated string whose contents concatenate


those of string S1, string S2 and, string S3. */
static char *
concat (const char *s1, const char *s2, const char *s3)
{
int len1 = strlen (s1), len2 = strlen (s2), len3 = strlen (s3);
char *result = (char *) xmalloc (len1 + len2 + len3 + 1);
strcpy (result, s1);
strcpy (result + len1, s2);
strcpy (result + len1 + len2, s3);
result[len1 + len2 + len3] = 0;
return result;
}

61
2.3 Explain the significance of return value, if there is one.
/* different returns zero if two strings OLD and NEW match, nonzero
if not. OLD and NEW point not to the beginnings of the lines but rather to
the beginnings of the fields to compare.
OLDLEN and NEWLEN are their lengths. */
static int
different (char *old, char *new, size_t oldlen, size_t newlen)
{
if (check_chars < oldlen)
oldlen = check_chars;
if (check_chars < newlen)
newlen = check_chars;
if (ignore_case)
{
/* FIXME: This should invoke strcoll somehow. */
return oldlen != newlen || memcasecmp (old, new, oldlen);
}
else if (HAVE_SETLOCALE && hard_LC_COLLATE)
return xmemcoll (old, oldlen, new, newlen);
else
return oldlen != newlen || memcmp (old, new, oldlen);
}

2.4 Comments on a function are much clearer if you use argument name to speak about the
argument values. Variable names itself should be lowercase, but write it in UPPER case
when you are speaking (referring) about the values.

Example 1:
/* right_justify Copy the SRC_LEN bytes of data beginning at SRC
into the DST_LEN-byte buffer, DST, so that the last source byte is
at the end of the destination buffer. If SRC_LEN is longer than

62
DST_LEN, then set *TRUNCATED to nonzero.
Set *RESULT to point to the beginning of (the portion of) the
source data in DST. Return the number of bytes remaining in the
destination buffer. */
static size_t
right_justify (char *dst, size_t dst_len, const char *src,
size_t src_len, char **result, int *truncated)
{
const char *sp;
char *dp;
if (src_len <= dst_len)
{
sp = src;
dp = dst + (dst_len - src_len);
*truncated = 0;
}
else
{
sp = src + (src_len - dst_len);
dp = dst;
src_len = dst_len;
*truncated = 1;
}
*result = memcpy (dp, sp, src_len);
return dst_len - src_len;
}

Example 2:
/* concat return a newly-allocated string whose contents concatenate
those of string S1, string S2 and, string S3. */
static char *
concat (const char *s1, const char *s2, const char *s3)
63
{
int len1 = strlen (s1), len2 = strlen (s2), len3 = strlen (s3);
char *result = (char *) xmalloc (len1 + len2 + len3 + 1);
strcpy (result, s1);
strcpy (result + len1, s2);
strcpy (result + len1 + len2, s3);
result[len1 + len2 + len3] = 0;
return result;
}
2.5 If there are global variables inside the programs, each should have a comment
preceding it. Generally global variables can have longer names.

Examples:

/* The name this program was run with. */


char *program_name;

/* Number of fields to skip on each line when doing comparisons. */


static size_t skip_fields;

/* Number of chars to skip after skipping any fields. */


static sizet skipchars;
For more detailed coding standards one should refer to GNU Coding Standards.

64
3. Code Review Checklist

3.1 All the program codes and the training & testing data should be put into a tar distribution,
like <module_name-ver.tgz>, e.g., postagger-0.1.tgz
3.2 There should be a README file inside the tar distribution, which should explain the
directories and other files inside the tar distribution.
3.3 There should be an INSTALL file inside the tar distribution that should explain how to
install and test that module.
3.4 There should be a Makefile file inside the tar distribution that should contain instructions
for build, install, clean, and uninstall the module.
3.5 There should be a brief description of the program at the beginning of each program files.
Normally the program names should be intuitive.
3.6 Brief descriptions of the functions/methods at the beginning should be given. Description
of the function arguments and their types must also be given at the beginning.
3.7 Significance of the return values of functions and programs, if any, must be explained.
3.8 All the tar distributions must have a docs directory which should contain SRS document of
the module as discussed.
3.9 The SRS document should have sections like, Flow Chart, Data Flow Diagram, and
Function Descriptions.

A review checklist Template (proposed) is attached on the next page.

65
S.N. Description Yes/No
1 Module code available in tar distribution with version number, like,
module_name-0.2.tgz
2 README file available in the tar distribution
3 INSTALL file available in the tar distribution
4 docs directory available in the tar distribution
5 First level SRS as per the templates provided.
6 Flow Chart4 for program flow available in the SRS document
7 Data Flow Diagram5 available in the SRS document
8 Class Diagram and/or Sequence Diagram6 available in the SRS document
9 Program description available at the beginning of the program. List each of
the program files and mention if description for each of them is available.
10 Brief description of the functions/methods available. List all the functions
available in a program.
11 Description of function arguments and types available
12 Significance of return values explained

4
Flow Chart should explain the basic flow of the application. It must show the decision points.
5
Module is designed with Structured Design approach then DFD should be provided for that module.
6
Module is designed with Object Oriented Design then Class Diagram & Sequence Diagram should be available.

66
4. Module Packaging Guidelines

4.1 All the Programs Files (source code, binaries, and scripts), Data Files (training data & other
lexical resources like dictionary), and other documents must be packaged into a single tar
distribution.

4.2 It should be named as <module_name-version>.tgz, e.g., postagger-0.2.tgz

4.3 The tar distribution must have parent directory named as module_name-version, like
postagger-2.0. All the other files and directories should be inside this parent directory.

4.4 A docs directory must be available inside parent directory which has README, INSTALL,
Copyright, and other documents. The parent directory must have src, bin, and data directory
for corresponding files.7

4.5 All the modules must be accompanied with Makefile for the module. This Makefile must
contain instructions for build, install, uninstall, and clean. A Template Makefile will be
provided as sample.

4.6 Makefile should build & install the application in the directory chosen by the user, like
/usr/local/sampark/

4.7 Any user on a Linux machine should be able to run the application from his home directory,
as long as application binaries/scripts are in the PATH environment variable.

4.8 A small shell script, <module_name>_run.sh should be made available to run and test the
application with the test/validation data made available by the developer of the application.

4.9 A symbolic link of this <module_name>_run.sh can be provided in /usr/bin (by the install
script) so that it can be accessed from anywhere to test the module.

4.10 Validation Data (in SSF Format) must be comprehensive enough to test the functionality of
the module.

4.11 Validation Data (in SSF Format) should be comprehensive enough to test the accuracy of
the module (desirable).

7
Refer to section System Directory Structure of Sampark system

67
5. System Integration Guidelines

5.1 System comprises of various modules. Each module in turn comprises of sub-modules
(programs). Programs of each module may be written in different languages, like C, CPP, Perl,
Java, Python and Lex.

5.2 Module should run as standalone application, given a valid test-data in SSF Format

5.3 Dashboard application shall be used to integrate the various modules of Sampark system.

5.4 All the modules will be put in a sequential pipeline using a Dashboard specifications file.

5.5 in the specifications file we define the following for each modules

Dependency for each of the modules/programs, e.g.names of modules whose output


this module needs.

Language for each of the programs

Input/Output specifications for each programs, whether in-memory or stream.

Specify the runtime parameters for each program, e.g.

postagger n2 d5 test=test_data input_lex=train_lex input_123=train_123

68
6. System Deployment Directory Structure
6.1 Directory Structure for Sampark System Deployment
Requirements There would be several Program files and Data files in the Sampark System.
They will be organized on the basis of modules, versions, and their dependence on the source or
target language. The other files will also be organized in conventional style.

Directory Structure

1. <prefix_dir>/sampark/src/ can contain the program source code for Sampark


system

2. <prefix_dir>/sampark/data_src/ can contain the data files for Sampark system

3. <prefix_dir>/sampark/docs/ will contain the all the documents of the Sampark


system

4. <prefix_dir>/sampark/bin/<dep>/<module_name>/<lang|lang_pair|sys>/ will
contain the program binaries and scripts of the Sampark system

5. <prefix_dir>/sampark/data_bin/<dep>/<module_name>/<lang|lang_pair>/ will
contain the data_binaries (dictionary, trained parameters, etc.) of the Sampark
system

6.2 Explanation of the Directory Structure mentioned above


<prefix_dir> Root directory where the Sampark system is being deployed, e.g., /usr/local

<dep> possible values for dep could be sl, tl, sl_tl, or sys.
sl contains analysis modules that are source language dependent
tl contains generation modules that are target language dependent
sl_tl contains transfer modules that are source-target language pair dependent
sys contains the modules that are independent of languages, like format converter
programs like, text2ssf, wx2utf, ssf2tnt, tnt2ssf, ssf2bio
common contains the modules/programs that are common to all languages.
<module_name> name of the Sampark system modules, it could be tokenizer, morph,
postagger, chunker, lwg, lex_transfer
<lang|lang_pair|common> possible values are tel, hin, tel_hin, tam, tam_hin, etc.
The above deployment directory structure of Sampark MT system can be used for other system
as well.

69
6.3 Directory Structure for Sampark Validation Data

Requirements Validation data of the Sampark system should be stored & made available for
reference in future for analysis, comparison, and improving the accuracy of the Sampark system.
Identification of version should be possible at module level as well as system level. Identification
of specs file and other run time parameters should be possible, when the data is being referred in
the future. In future Regression Testing can be easily be performed.

Directory Structure for System Validation


<prefix_dir>/sampark/val_data/system/<lang_pair>/ver_<x.y.[z]>/<nameof_validation_data
>/<nameof_specs_file>/
<prefix_dir> Root Directory where validation data will be stored, e.g., /var
<lang_pair> source target language pair, e.g., tel_hin, hin_tel, tam_tel, tel_tam, etc.
<x.y.[z]> Version number of the Sampark system, e.g., 0.1, 0.1.1, etc.
<nameof_validation_data> Name of validation data, e.g., samachar, parichay, mhatma,
gandhi, shanti, sandesh, bangladesh war, etc.
<nameof_specs_file> Name of the specs file which was used to run the validation data,
e.g., sampark_specs-3

Example: When we run Sampark system (version 2.2) on a validation data (file named, gandhi)
for telugu to hindi with specs file (sampark_specs-2), and <prefix_dir> is /var, then we will have
following directory and files.

/var/sampark/val_data/system/tel_hin/ver_2.2/gandhi/sampark_specs-2>/
gandhi.rin_rout
gandhi.rin
gandhi.rout
sampark_specs-2
sampark_specs-2.exe
tokenizer.in
tokenizer.out
ssplit.in
ssplit.out
morph.in
morph.out
postagger.in
postagger.out
chunker.in
chunker.out
lwg.in
lwg.out
lex_trans.in
lex_trans.out

All these files will be present in the above mentioned directory.

70
6.4 Directory Structure for standalone Module Validation
<prefix_dir>/sampark/val_data/modules/<dep>/<module_name>/ver_<x.y.[z]>/<nameof_validatio
n_data>/<nameof_specs_file>/

<prefix_dir> Root directory where validation data will be stored, e.g., /var
<dep> possible values for dep could be sl, tl, sl_tl, or common. It tells us whether
the module is dependent on source language, target language; or is dependent on
language pair (source & target); or the module is language independent.
<module_name> name of the module that is being tested
<x.y.[z]> Version number of the module, e.g., 0.1,
<nameof_validation_data> Name of validation data, e.g., samachar, parichay,
mahatma, gandhi, shanti, sandesh, bangladesh war,
<nameof_specs_file> Name of the specs file which was used to run the validation
data, e.g., sampark_specs-3

Example: When we run postagger (version 2.1) on a validation data (file named, gandhi) for
telugu with specs file (sampark_specs-2), and <prefix_dir> is /var, then we will have following
directory and files.

/var/sampark/val_data/modules/sl/tel/postagger/ver_2.1/gandhi/sampark_specs-2>/
gandhi.rin_rout
gandhi.rin
gandhi.rout
sampark_specs-2
sampark_specs-2.exe
postagger.in
postagger.out

All these files will be present in the above mentioned directory.

The above validation directory structure of Sampark MT system can be used for other system as
well.

71
7. SRS Guidelines
SRS for each module of Sampark system should have the following sections depending upon
the design approach followed by the module developers.

Input Output Specifications


Flow Chart
Process Descriptions & Data Flow Diagram
Class Diagram & Sequence Diagram

7.1 Input Output Specifications


SRS for a module It should specify a valid Input SSF for the module. It should enumerate the
mandatory fields in the input SSF and mention the possible domain for each of the fields. It
should specify a valid Output SSF for the module. It should mention the field and the
corresponding values that have been modified by the module in the Input SSF.

7.2 Flow Chart


Flow Chart should show the sequence of activity performed by a module.

7.3 Process Descriptions


If the module is developed with structured design approach, then it must provide the process
description for the module, which includes a list of steps taken by the module to complete its
functionality.

7.4 Data Flow Diagram


If the module is developed with structured design approach, then it should provide Data Flow
Diagram to show the information flow inside the module.

7.5 Class Diagram & Sequence Diagram


If the module is developed with object oriented approach SRS must provide Class Diagram
as well as Sequence Diagram for the module in UML notation.

72
References
1. GNU Coding Standards

2. GNU Make

3. Recommended Practice for Software Requirements Specifications

IEEE Std. 830 1998.

73
Appendix-B
Setting up Eucalyptus on a Cluster of Machines
This guide provides instructions for a single-cluster deployment of Eucalyptus Open Source
release 2.0.3 on a two-machine system running CentOS 5.7.

Eucalyptus Overview
Eucalyptus is a Linux-based software architecture that implements scalable private and hybrid
clouds within your existing IT infrastructure. Eucalyptus allows you to provision your own
collections of resources (hardware, storage, and network) using a self-service interface on an as-
needed basis.

Eucalyptus Components
Eucalyptus consists of the following distributed components:
- CLC: Cloud Controller (EC2 functionality)
- Walrus (S3 functionality)
- CC: Cluster Controller (middle-tier cluster management service)
- SC: Storage Controller (EBS functionality)
- NC: Node Controller (controls VM instances)
Hardware Requirements
You will need a minimum of two machines to host Eucalyptus components. All the node machines
should have multicore processor with virtualization Technology (VT) enable.

Frontend/Base (CLC, Walrus, CC, SC): Minimum 100GB HDD and 4-8GB RAM
Node (NC): Minimum 100GB HDD and 4-8GB RAM
Be sure that all machines are running the latest release of CentOS 5.7. Test that all systems allow
ssh login, and that root access is available (sudo is OK)

Network Requirements
1. You must have a DHCP server available (installed but not running).
2. You must have a range of available public IP addresses. These will be assigned to Virtual
Machines (instances).
3. You must have a large range of available private IP addresses. These will be used by a virtual
subnet. They cannot overlap or contain any part of a physical network IP address space.
Software Requirements
74
You must have access to the following:
- CentOS 5.7 install CD
- Eucalyptus Fast Start media (the same can be download from eucalyptus web site
http://open.eucalyptus.com )

A) Operating System (OS) Installation

B) Node Configuration

C) Frontend Configuration

D) Getting Started Using Eucalyptus 2.0.3


1- SSH Key Management
2- Image Management
3- Virtual Machine (VM) Instance Management
4- Block Storage Management
5- How to rebuild or rebase the instance?

E) FAQ

For details visit the below URL:


http://open.eucalyptus.com
Last accessed: 21-March-2012

75
A) Install CentOS 5.7 on all machines

1- At the boot: prompt, press ENTER for the graphical installer. You can skip the media
verification if you like, then accept the defaults, with the following exceptions

1.1 For network interface configuration, select Edit and manually configure
I) IP address
II) Netmask
III) Hostname
IV) Gateway
V) DNS

For example:
IP address: 192.168.1.65
netmask: 255.255.255.0
hostname: node1.in
gateway: leave it blank
DNS: leave it blank

1.2 Package options


- Deselect Desktop - Gnome
- Select Server
- Select Customize later
- Click next

1.3 Choose install CentOS, click next choose select remove linux partitions on selected drives &
create default layout, click next

1.4 Choose region -> Asia/Kolkata


- Deselect "System Clock uses UTC"
After complete OS installation machine will be rebooted.

Tip: We recommend that you install your node controllers first so that you can register them with
the frontend/base server as part of that install.

76
B) Node Configuring

1- Now login as root on node machine insert the Eucalyptus Fast Start media (CD/DVD or
pendrive) and mount it for copy the cloud setup

mkdir /media/usb/
mount -t vfat /dev/sdb1 /media/usb

Note: the USB drive might appear as a device other than /dev/sdb1 depending on your hardware
configuration (i.e. /dev/sdc1, etc.). Use the command fdisk l to see the devices.

2- Stop the firewall


/etc/init.d/iptables save
/etc/init.d/iptables stop
chkconfig iptables off

3- Install ntp to update date and time of the other machines of the network.

yum install ntp


Uncomment the below line in /etc/ntp.conf
broadcast 192.168.1.255 key 42

Edit the above line to broadcast 192.168.1.255 save and exit


/etc/init.d/ntpd start

4- For NC Configuring

cd /root/faststartusb
chmod +x *.sh
./fastinstall.sh
Answer n to the first question (this is not the front-end server).

Note: After the install completes, the machine will reboot automatically and be ready to use.

77
C) Frontend Configuring

1- Now login as root on node machine insert the Eucalyptus Fast Start media (CD/DVD or
pendrive) and mount it for copy the cloud setup

mkdir /media/usb/
mount -t vfat /dev/sdb1 /media/usb

Note: The USB drive might appear as a device other than /dev/sdb1 depending on your hardware
configuration (i.e. /dev/sdc1, etc.)

2- Stop the firewall


/etc/init.d/iptables save
/etc/init.d/iptables stop
chkconfig iptables off

3- Synchronize the machine with NTP server running to any machine in the cluster install ntp
first and update the date and time by using ntpdate <NODE-IP> command. (NODE-IP is IP-
address of the machine where NTP server running)

yum install ntp


ntpdate 192.168.1.65

4- Ensure from your front machine, node(s) should be accessible ping NODE-IP or ssh
root@NODE-IP (don't login only for testing weather it ask for password

5- For Frontend configuring

cd /root/faststartusb
chmod +x *.sh
./fastinstall.sh
Answer y to the first question (this is the front-end server)

Note: In the install process, you will be asked for

a. Public and private network interfaces (defaults should be fine)


b. Private subnet (address used by instances internal to the cloud)
c. Private netmask
d. DNS server address (used for name resolution in the cloud)
e. Available IP addresses for the instances (public address on your network that instances can
use)
78
For example:
The public ethernet interface [eth0] Press ENTER
The private ethernet interface [eth0] Press ENTER
Eucalyptus-only dedicated subnet [192.168.0.0] 192.168.1.0
Eucalyptus subnet netmask [255.255.255.0] 255.255.255.0
The DNS server address [?.?.?.?] Press ENTER
Based on the size of your private subnet, we recommend the next value be set to 16
How many addresses per net? [32] 16
The range of public IPs [?.?.?.?-?.?.?.?] 192.168.1.100-192.168.1.120

Services will now start and the script will wait for the CLC to run. The script registers components
(you will be asked about connecting to the server over ssh and to provide the root password).
As the last step on the frontend, you need to enter the IP address of each NC (you can have more
than one). Press ENTER (at the prompt) when you are done

In a web browser, go to the URL provided. It will be in this format


https://<your_frontend_IP_address>:8443/

6- Login as admin/admin

7- Enter a new admin password and your email address


e.g. passwd: admin123
email : admin@localhost

8- Configure dhcpd server to assign IP of virtual Machines copy below lines and paste in
dhcpd.conf save and exit. Start the dhcpd server
-------------------------------------------
option domain-name "lan.ltrc.org";
ddns-update-style none;
option subnet-mask 255.255.255.0;
default-lease-time 600;
max-lease-time 7200;

subnet 192.168.1.0 netmask 255.255.255.0 {


authoritative;
range 192.168.1.100 192.168.1.120;
option routers 192.168.1.60;
}
-------------------------------------------
vi /etc/dhcpd.conf
copy & paste above lines then save and re-start dhcpd server
/etc/init.d/dhcpd start
Once your dhcpd server is started then your cloud cluster is ready to use.
79
D) Getting Started Using Eucalyptus 2.0.3
Euca2ools are command-line tools for interacting the Eucalyptus and also compatible with
Amazon EC2 and S3 services.
Login to front end machine and export the environment variables be executing the below
commands
source the eucarc file to use euca2ools
source /root/eucarc

Is our eucalyptus running properly?


You may run euca-describe-availability-zones verbose command to see the eucalyptus is
running.
euca-describe-availability-zones verbose
Above command output the available cloud resources. If you see the available resources properly
then ONLY you can use it.
For quick start the following steps

Create a keypair for ssh authentication


euca-add-keypair mykey | tee mykey.private
chmod 0600 mykey.private

Launch the instance


To launch the instance first see the available images in your cloud cluster by running euca-
describe-images command
euca-describe-images
"euca-run-instances" will allow you to deploy VM instances of images that have been previously
uploaded to the cloud.
For instance, to run an instance of the image with id "emi-53444344" with the kernel "eki-
34323333" the ramdisk "eri-33344234" and the keypair "mykey" you can use the following
command
euca-run-instances -k mykey --kernel eki-34323333 --ramdisk eri-33344234 emi-53444344

It will take some time to boot the instance, in the mean time run euca-describe-instances
command to see the status.
Once the status is running then you can login and work on the newly launched instance.
For more help, try,
euca-run-instances --help

Login to instance
ssh -i mykey.private root@<accessible-instance-ip>
80
1- SSH Key Management

1.1 Adding a keypair

euca-add-keypair <keypair-name>

e.g. euca-add-keypair mykey | tee mykey.private

A pair of keys is created; one public key, stored in Eucalyptus, and one private key stored in the
file mykey.private and printed to standard output. The ssh client requires strict permissions on
private keys:
chmod 0600 mykey.private

For more options, type


euca-add-keypair --help

1.2 Listing the key pairs


You may use euca-describe-keypairs which display a list of available key pairs
euca-describe-keypairs

For more options, type


euca-describe-keypairs --help

1.3 Delete a keypair


You may use euca-delete-keypair testkey which delete testkey.

euca-delete-keypair <keypair-name>

For more options, type


euca-delete-keypair --help

81
2- Image Management

In order to use run instances from images that you have created (or downloaded), you need to
bundle the images with your cloud credentials, upload them and register them with the cloud.

2.1 Adding images into the cloud


The examples here assume that you have sourced the eucarc config file obtained when you
downloaded user credentials.
Add the kernel to Walrus, and register it with Eucalyptus (WARNING: your bucket names must
not end with a slash!)
euca-bundle-image -i <kernel file> --kernel true
euca-upload-bundle -b <kernel bucket> -m /tmp/<kernel file>.manifest.xml
euca-register <kernel-bucket>/<kernel file>.manifest.xml

Next, add the root file system image to Walrus

euca-bundle-image -i <vm image file>


euca-upload-bundle -b <image bucket> -m /tmp/<vm image file>.manifest.xml
euca-register <image bucket>/<vm image file>.manifest.xml
add the ramdisk to Walrus:
euca-bundle-image -i <initrd file> --ramdisk true
euca-upload-bundle -b <initrd bucket> -m /tmp/<initrd file>.manifest.xml
euca-register <initrd bucket>/<initrd file>.manifest.xml

Examples
Following is an example using the Ubuntu pre-packaged image that we provide using the included
KVM compatible kernel/ramdisk (a Xen compatible kernel/ramdisk is also included).

tar zxvf euca-ubuntu-9.04-x86_64.tar.gz


euca-bundle-image -i euca-ubuntu-9.04-x86_64/kvm-kernel/vmlinuz-2.6.28-11-generic --kernel
true
euca-upload-bundle -b ubuntu-kernel-bucket -m /tmp/vmlinuz-2.6.28-11-generic.manifest.xml
euca-register ubuntu-kernel-bucket/vmlinuz-2.6.28-11-generic.manifest.xml
(set the printed eki to $EKI)

euca-bundle-image -i euca-ubuntu-9.04-x86_64/kvm-kernel/initrd.img-2.6.28-11-generic --
ramdisk true
euca-upload-bundle -b ubuntu-ramdisk-bucket -m /tmp/initrd.img-2.6.28-11-
generic.manifest.xml
euca-register ubuntu-ramdisk-bucket/initrd.img-2.6.28-11-generic.manifest.xml
(set the printed eri to $ERI)

82
euca-bundle-image -i euca-ubuntu-9.04-x86_64/ubuntu.9-04.x86-64.img --kernel $EKI --ramdisk
$ERI
euca-upload-bundle -b ubuntu-image-bucket -m /tmp/ubuntu.9-04.x86-64.img.manifest.xml
euca-register ubuntu-image-bucket/ubuntu.9-04.x86-64.img.manifest.xml

2.2 Listing the images


You may use euca-describe-images which display a list of available images.
euca-describe-images

For more options, type


euca-describe-images --help

2.3 Deleting images

In order to delete an image, you must first de-register the image

euca-deregister <emi-XXXXXXXX>

Then, you can remove the files stored in your bucket. Assuming you have sourced your 'eucarc' to
set up EC2 client tools

euca-delete-bundle -a $EC2_ACCESS_KEY -s $EC2_SECRET_KEY --url $S3_URL -b <bucket> -p <file


prefix>

If you would like to remove the image and the bucket, add the '--clear' option

euca-delete-bundle -a $EC2_ACCESS_KEY -s $EC2_SECRET_KEY --url $S3_URL -b <bucket> -p <file


prefix> --clear

2.4 Downloading an image

Bundled images that have been uploaded may also be downloaded or deleted from the cloud.

For instance, to download the image(s) that have been uploaded to the bucket "image-bucket"
you may use the following command

euca-download-bundle -b image-bucket

For more options, type,

euca-download-bundle --help

83
3- Virtual Machine (VM) Instance Management

3.1 Running instances


"euca-run-instances" will allow you to deploy VM instances of images that have been previously
uploaded to the cloud.

For instance, to run an instance of the image with id "emi-53444344" with the kernel "eki-
34323333" the ramdisk "eri-33344234" and the keypair "mykey" you can use the following
command,

euca-run-instances -k mykey --kernel eki-34323333 --ramdisk eri-33344234 emi-53444344

To run more than one instances, you may use the "-n" or "--instance-count" option.

For more help, try,


euca-run-instances --help

3.2 Displaying instances currently running


You may use "euca-describe-instances" which will display a list of currently running instances.
euca-describe-instances

To get information about a specific instance, you can use the instance id as an argument to euca-
describe-instances. For example,

euca-describe-instances i-43035890

For more options, type,


euca-describe-instances --help

3.3 Logging into a VM Instance


You can now log into it with the SSH key that you created
ssh -i mykey.private root@<accessible-instance-ip>

3.4 Shutting down or terminating instances


You may shutdown running instances using the "euca-terminate-instances" command. For
example, to terminate an instance "i-34523332"
euca-terminate-instances i-34523332
For more options, type,
euca-terminate-instances --help
84
4 Block Storage Management

You can create dynamic block volumes, attach volumes to instances, detach volumes, deletes
volumes, create snapshots from volumes and create volumes from snapshots with your cloud.
Volumes are raw block devices. You can create a file system on top of an attached volume and
mount the volume inside a VM instance as a block device. You can also create instantaneous
snapshots from volumes and create volumes from snapshots.

4.1 Creating a volume


To create a dynamic block volume, use "euca-create-volume"
For instance, to create a volume that is 1GB in size in the availability zone "myzone" you may use
the following command,
euca-create-volume --size 1 -z myzone

To list availability zones, you may use "euca-describe-availability-zones"

You may also create a volume from an existing snapshot. For example, to create a volume from
the snapshot "snap-33453345" in the zone "myzone" tries the following command

euca-create-volume --snapshot snap-33453345 -z myzone

For more options, type,

euca-create-volume --help

4.2 Attaching a volume to an instance

You may attach block volumes to instances using "euca-attach-volume." You will need to specify
the local block device name (this will be used inside the instance) and the instance identified. For
instance, to attach a volume "vol-33534456" to the instance "i-99838888" at "/dev/sdb" use the
following command

euca-attach-volume -i i-99838888 -d /dev/sdb vol-33534456

You can attach a volume to only one instance at a given time.

4.3 Detaching a volume

To detach a previously attached volume, use "euca-detach-volume" For example, to detach the
volume "vol-33534456"
euca-detach-volume vol-33534456
85
You must detach a volume before terminating an instance or deleting a volume. If you fail to
detach a volume, it may leave the volume in an inconsistent state and you risk losing data.

4.4 Delete a volume

To delete a volume, use "euca-delete-volume." For example, to delete the volume "vol-
33534456" use the following command

euca-delete-volume vol-33534456

You may only delete volumes that are not currently attached to instances.

4.5 Creating a snapshot

You may create an instantaneous snapshot of a volume. A volume could be attached and in use
during a snapshot operation. For example, to create a snapshot of the volume "vol-33534456"
use the following command

euca-create-snapshot vol-33534456

4.6 Deleting a snapshot

To delete a snapshot, use "euca-delete-snapshot" for example, to delete the snapshot snap-
33453345, use the following command,

euca-delete-snapshot snap-33453345

86
5 How to rebuild or rebase the instance?

Login the Virtual Machine to which you want rebuild/rebase for future use. Ensure that
euca2tools should be installed on the instance and synchronize the clock with NTP server running
in your private network.

Copy cloud credential to the instance and sourced the eucarc file. Before do the following ensure
euca-describe-availability-zone verbose command should run properly?

5.1 Bundle the instance


euca-bundle-vol -c ${EC2_CERT} -k ${EC2_PRIVATE_KEY} -u ${EC2_USER_ID} --ec2cert
${EUCALYPTUS_CERT} --no-inherit --kernel eki-8D9B16EA --ramdisk eri-87F516CA -d /mnt -p
demo-image -s 2048 -e /mnt,/root/.ssh/authorized_keys

5.2 Upload the image


euca-upload-bundle -b centos_sampark_dependency -m /mnt/sampark-
dependency.manifest.xml

5.6 Register the image


euca-register centos_sampark_dependency/sampark-dependency.manifest.xml

87
Appendix-C
Setting up Hadoop Cluster on a Cluster of Physical/Virtual Machines

Hadoop is an open source framework of MapReduce implementation for the distributed


processing of large data sets on a cluster build of commodity hardware.

Features in Hadoop
- Programming model Map/Reduce.
- Data Handling through Hadoop distributed File System (HDFS)
- Scheduling by dynamic task scheduling using global queue (Data locality, rack aware)
- Failure handles by re-execution of failed task and duplicated execution of slow task
- HLL support Pig Latin
- Environment support Linux cluster and Amazon Elastic MapReduce on EC2
- Intermediate Data Transfer by File and HTTP.
Required Software
1. JavaTM 1.6.x, preferably from Sun, must be installed.
2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage
remote Hadoop daemons.

Architecture of MapReduce in Hadoop

MapReduce Engine also has master/slave architecture consisting of a single JobTracker as the
master and a number of TaskTracker as the slaves (workers). The JobTracker manage the
MapReduce job over a cluster and is responsible for monitoring jobs and assign task to
TaskTracker. The TaskTracker manages the execution of the map and/or reduce task on a single
computation node in the cluster.
88
A) Operating System CentOS-5.5 installation on all the physical machines in the
cluster

B) JAVA installation on all the physical/virtual machines in the cluster

C) Hadoop cluster installation

D) Getting Started with Hadoop

E) Hadoop Distributed File System (HDFS)

For more details


http://hadoop.apache.org

Last Accessed: 27-March-2012 at 8:00 PM

89
A) Operating System CentOS-5.5 installation on all the physical machines in the
cluster

1- At the boot: prompt, press ENTER for the graphical installer. You can skip the media
verification , then accept the defaults, with the following exceptions

2- Choose install CentOS, click next choose select remove linux partitions on selected drives &
create default layout, click next

3- For network interface configuration, select Edit and manually configure


3.1 IP address
3.2 Netmask
3.3 Hostname
3.4 Gateway
3.5 DNS

For example: if you have 3 physical machines to setup Hadoop cluster. Select one machine for
master node and other two for slave nodes.

For master node Configuration


IP address : 192.168.1.100
netmask : 255.255.255.0
hostname : master.in
gateway : leave it blank
DNS : leave it blank

For slave1 node Configuration


IP address : 192.168.1.101
netmask : 255.255.255.0
hostname : slave1.in
gateway : leave it blank
DNS : leave it blank

For slave2 node Configuration


IP address : 192.168.1.102
netmask : 255.255.255.0
hostname : slave2.in
gateway : leave it blank
DNS : leave it blank

4- Package options
- Select Desktop - Gnome

90
- Select Server
- Select Customize later
- Click next

5- Choose region -> Asia/Kolkata


- Deselect "System Clock uses UTC"

After complete OS installation machine will be rebooted and ask for post installation
configurations.

6- For firewall configuration -> Select disabled

7- For SE Linux configuration -> Select disabled

8- For Date and Time Setting -> Change date and time if differ with actual date and time.

9- For User Creation -> add a user name hadoop and password hadoop

Note: In case of virtual machines, installation of the OS is not needed if the virtual machines
already have an OS.

91
B) JAVA installation on all the physical/virtual machines in the cluster
Do the below steps on all the physical/virtual machines on which you want to setup the Hadoop
cluster.
1- Login into root user.

2- Copy the hadoopstartusb directory from the media (CD/DVD/USB) to /root/ in your machine.

3- Delete the default java link from /usr/bin/


rm /usr/bin/java
Note: hadoopstartusb contains Hadoop-0.20.2.tgz and jdk1.6.0_27.tar.gz

4- Install java jdk1.6.0_27 in /usr/local/ by using the below command


tar -xvzf /root/hadoopstartusb/jdk1.6.0_27.tar.gz -C /usr/local

5- Add JAVA_HOME environment variable path in /etc/profile file, export it and then add it into
PATH. Open file /etc/profile add the below 3 line in it at end of file then save it and exit.
JAVA_HOME=/usr/local/jdk1.6.0_27
export JAVA_HOME
PATH=/usr/local/jdk1.6.0_27/bin:$PATH

6- Read the file /etc/profile to make the changes effective.


source /etc/profile

7- Edit the file /etc/hosts by default it content some lines remove all the lines and add below 3
lines in it save and exit.
192.168.1.100 master.in master
192.168.1.101 slave1.in slave1
192.168.1.102 slave2.in slave2

8- Restart the network to make the changes effective.


/etc/init.d/network restart

9) Create user hadoop and set its password hadoop for Hadoop framework installation if not
done in post installation process of CentOS installation.
adduser hadoop
passwd hadoop

10 Now copy the file /root/hadoopstartusb/hadoop-0.20.2.tar.gz to hadoop user of master


machine.
cp /root/hadoopstartusb/hadoop-0.20.2.tar.gz hadoop@<mater-ip>
For example
cp /root/hadoopstartusb/hadoop-0.20.2.tar.gz hadoop@192.168.1.100:
11- Now reboot the machine by using reboot command
92
C) Hadoop cluster installation

Installing a Hadoop cluster is unpacking the software on all the machines in the cluster and
changes the configuration files as per your requirement. Typically one machine in the cluster is
designated as master. The rest of the machines in the cluster act as slaves.

Do the following steps in order on Master machine. We have indicated whenever a step is
required to run on slaves.
1) Login into hadoop user.

2) For password less connectivity from master to all the slaves.

2a) Generate a public key on master and copy it to all the slaves. It is use in password less
connectivity from master to slave.

ssh-keygen -t rsa
above command ask you name of file to save it. Press ENTER for default one. It again ask you for
passphrase twice. again press ENTER for for no passphrase.

scp $HOME/.ssh/id_rsa.pub hadoop@<slave-ip>:


repeat above command for each slaves. Where <slave-ip> is the IP address of the salve.

Does this step only on all Slave machines


2b) copy the content of public key into file $HOME/.ssh/authorized_keys and then change the
permission for the same.

mkdir -p $HOME/.ssh
chmod 700 $HOME/.ssh
cat id_rsa.pub >>$HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys

3) Ensure that java 1.6 or higher version must be installed.


Java -version
echo $JAVA_HOME

4) Untar the file hadoop-0.20.2.tar.gz into hadoop user home directory.


tar -xvzf $HOME/hadoop-0.20.2.tar.gz -C $HOME

5) Open the file $HOME/hadoop-0.20.2/conf/hadoop-env.sh uncomment line no. 9 and change


JAVA_HOME variable path to /usr/local/jdk1.6.0_27
#export JAVA_HOME=</usr/lib/j2sdk1.5-sun>
For example
export JAVA_HOME=/usr/local/jdk1.6.0_27
93
6) Open the file $HOME/hadoop-0.20.2/conf/masters by default it will show localhost, remove
it and add the master machine domain name
master.in

7) Open the file $HOME/hadoop-0.20.2/conf/slaves by default it will show localhost, remove it


and add all slaves domain name.
slave1.in
slave2.in

8) Open the file $HOME/hadoop-0.20.2/conf/core-site.xml and add the below lines in between
<configuration> element

<property>
<name>fs.default.name</name>
<value>hdfs://master.in:54310</value>
<description>The name and URI of the default FS.</description>
</property>

9) Open the file $HOME/hadoop-0.20.2/conf/mapred-site.xml and add the below lines in


between <configuration> element

<property>
<name>mapred.job.tracker</name>
<value>master.in:54311</value>
<description>>Map Reduce jobtracker</description>
</property>

10) Open the file $HOME/hadoop-0.20.2/conf/hdfs-site.xml and add the below lines in
between <configuration> element

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication</description>
</property>
Note: the value node data is always equal to number of data nodes, if slaves are two then it is
<value>2</value>

11) Copy the modified hadoop-0.20.2 directory to each slave into hadoop user $HOME directory.
rsync -r $HOME/hadoop-0.20.2 hadoop@<node-ip>:
For example
rsync -r $HOME/hadoop-0.20.2 hadoop@192.168.1.101:
94
rsync -r $HOME/hadoop-0.20.2 hadoop@192.168.1.102:
Note: we presume JAVA_HOME path is same on each the slave.

12) Change directory to hadoop-0.20.2


cd $HOME/hadoop-0.20.2

13) Format the name node


./bin/hadoop namenode -format

14) Start the Namenode and Datanodes


./bin/start-dfs.sh

This should start the namenode on master and datanode on slave.

15a)Run the jps command. Check the output of jps command.


jps

Output of above command should be like


28615 NameNode
30570 Jps
28870 SecondaryNameNode

Does this step only on all Slave machines


15b) Run jps command. Check the output of jps command.
jps
Output of above command should be like
28610 DataNode
30570 Jps

Tips: if jps command output is NOT matched as stated above on master and slave. Manually kill
the process id of java programs by kill -9 <process-id> command clean the log created
/tmp/hadoop-hadoop* and repeat continue from step 13.

16) Start the JobTracker and Tasktrackers


./bin/start-mapred.sh

This should start the JobTracker on master and TaskTracker on slave.

16a)Run the jps command. Check the output of jps command.


jps

Output of above command should be like


28982 JobTracker
95
28615 NameNode
30570 Jps
28870 SecondaryNameNode

Does this step only on all Slave machines


16b) Run jps command. Check the output of jps command.
jps

Output of above command should be like


28610 DataNode
29109 TaskTracker
30570 Jps

Now the Hadoop cluster setup is ready for use.

17) Stop the whole Hadoop Cluster


./bin/stop-all.sh

18) Start the whole Hadoop cluster


./bin/start-all.sh

96
D) Getting Started with Hadoop

Do the following steps on master machine to check your Hadoop cluster is working properly.

1) Login into hadoop user.

2- Change directory to hadoop-0.20.2


cd $HOME/hadoop-0.20.2

3- Create a directory in hdfs


./bin/hadoop fs -mkdir input

4- List hdfs files


./bin/hadoop fs -ls

5- Copying local input data files to hdfs


./bin/hadoop fs -put <input-file-name.txt> input

Where <input-file-name.txt> is name of file which contain some data.

6- Run a wordcount example already provided in Hadoop Framework


./bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output

Once the above command is run successfully then we can analyze the job by the following URL

http://<mastermachineip>:50030

For example
http://192.168.1.100:50030

7- See the output on the terminal


./bin/hadoop fs -cat output/part-*

97
Bibliography

[1] Pressman, R.S., Software Engineering: A Practitioners Approach. 6th Edition,


McGraw Hill International, 2006.
[2] Leidner J.L., Current Issues in Software Engineering for Natural Language
Processing, HLT-NAACL 2003 Workshop on Software Engineering and Architecture of
Language Technology Systems, (2003).
[3] Arnold, R. S., Software Re-engineering, IEEE Computer Society Press, (1993).
[4] Pawan Kumar, Rashid Ahmad, Arun Kumar Rathaur, M.K. Sinha, R. Sangal:
Reengineering Machine Translation Systems through Symbiotic Approach, Accepted
in the 3rd International Conference in Contemporary Computing, NOIDA India, August
9-11, 2010. Proceedings to be published by Springer in Communications in Computer
and Information Sciences, ISSN: 1865-0929.
[5] Bharati, A., Chaitanya, V., & Sangal, R.,: Natural Language Processing A Paninian
Perspective, Prentice Hall of India, New Delhi, (2004).
[6] Zajac R.,: An Open Distributed Architecture for Reuse and Integration of
Heterogeneous NLP Components, Proceedings of the Fifth Conf. on Applied NLP,
Washington, DC, (1997)
[7] Sangal R.,: Project Proposal to Develop Indian Language to Indian Language Machine
Translation System, IIIT Hyderabad, TDIL Group, Dept. of IT, Govt. of India, (2006)
[8] Cunningham, H., Humphreys, K., Gaizauskas, R., & Wilks, Y.,: Software Infrastructure
for Natural Language Processing, Proceedings of 5th Conference on Applied Natural
Language Processing, (1997).
[9] Cunnigham, H., and Scot, J.,: Software Architecture for Language Engineering,
Journal of Natural Language Engineering 10 (3/4), (2004).
[10] Sangal R,: Architecture of Shakti Machine Translation System, IIIT Hyderabad, (2004).
[11] Swanson E. B.,: The Dimension of Maintenance, Proc. Second Intl. Conference on
Software Engineering, IEEE, October (1976).

98
[12] Brown, A.W., & Wallanau, K.C.,: Engineering of Component-Based Systems in
Component Based Software Engineering, IEEE Computer Society Press, (1996).
[13] Chikofsky, E., & Cross, J.,: Reverse Engineering & Design Recovery: A Taxonomy,
IEEE Software, 7(1), (1990).
[14] Sangal R,: Dashboard: A Framework for Setting Blackboards, IIIT Hyderabad, (2005).
[15] Moona, R., Singh, S., Sangal, R., Sharma, D.M.,: MTeval: An Evaluation Methodology
for Machine Translation Systems, Language Technology Research Center, IIIT
Hyderabad, (2004).
[16] Anthes, G.,: Automated Translation of Indian Languages, CACM, Vol. 53 (1), (2010).
[17] Hayes-Roth, B.,: A Blackboard Architecture for Control, Art. Intelligence, (1985)
[18] EAGLES, Expert Advisory Group on Language Engineering: Evaluation of Natural
Language Processing Systems (Final Report), DG XIII of European Commission,
(1996).
[19] L. Bass, P. Clements & R. Kazman: Software Architecture in Practice, 2nd Edition,
Addison Wesley, 2003.
[20] D.D. Corkill: Blackboard Systems, AI Experts, Vol. 6, No. 9, pp: 40-47, Sept. 1991.
[21] H. Cunnigham, and J. Scot: Software Architecture for Language Engineering, Journal
of Natural Language Engineering 10 (3/4), (2004).
[22] Amba Kulkarni: Development of Sanskrit Tool Kit and Sanskrit to Hindi Machine
Translation System (SHMT), Department of Sanskrit Studies, University of Hyderabad,
Hyderabad, http://sanskrit.uohyd.ernet.in/shmt/login.php, accessed on May 28, 2010.
[23] Kumar Pawan, Rathaur AK, Ahmad Rashid, Sinha Mukul K, Sangal Rajeev,
Dashboard: An Integration & Testing Platform based on Black Board Architecture for
NLP Applications, in Proceedings of Natural language Processing and Knowledge
Engineering (NLPKE) 2010, Beijing, China.
[24] Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri,
F. The impact of caching on search engines. In Proc. of ACM SIGIR 2007, 183-190.

99
[25] Fagni, T., Perego, R., Silvestri, F., and Orlando, S. Boosting the performance of Web
search engines: Caching and prefetching query results by exploiting historical usage
data. ACM TOIS 24, 1 (Jan. 2006), 51-78.
[26] Markatos, E. P. On caching search engine query results. Computing Communications
24 (2001), 137-143.
[27] Zipf, George K., The Psychobiology of Language. Houghton-Mifflin. 1935, (see
citations at http://citeseer.ist.psu.edu/context/64879/0).
[28] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM
Press, 1999.
[29] Bernstein, P. Middleware: A Model for Distributed System Services. Communications
of the ACM, 39:2, February 1996, 8698.
[30] Campbell, A., Coulson, G., and Kounavis, M.. Managing Complexity: Middleware
Explained. IT Professional, IEEE Computer Society, 1:5, September/October 1999,
2228.
[31] Kim, Jik-Soo, Andrabe, H., Sussman, A., Principle for Designing Data-/Compute-
intensive Distributed Applications and Middleware Systems for Heterogeneous
Environments, Journal of Parallel and distributed Computing 67(2007), pp: 755-771.
[32] Rifat Ozcan, Ismail Sengor Altingovde, zgr Ulusoy, Static Query Result Caching
Revisited, WWW 2008, Beijing, China, April 2008.
[33] Andrade, H., Kurc, T., Sussman, A., Borovikov, E., Saltz, J., On cache replacement
policies for servicing mixed data intensive query workloads, in: Proceedings of the
Second Workshop on Caching, Coherence, and Consistency, held in conjunction with
the 16th ACM International Conference on Supercomputing, New York, NY, June
2002.
[34] J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford:
Clarendon, 1892, pp.6873.
[35] J. Dean & S. Ghemawat, 2004 MapReduce: Simplified Data Processing on Large
Cluster, in OSDI 2004, Proc. of the 6th Symposium on Operating System Design &
Implementation 2004, in co-operation with ACM SIGOPS, pp:137-149.

100
[36] Rajeev Sangal 1991 Programming Paradigms in Lisp
[37] J. Lin & Chris Dyer, Data-Intensive Text Processing with MapReduce, University of
Maryland, USA, April 2010.
[38] Richard S. Bird, 1986 An Introduction to the Theory of Lists, Oxford University
Technical Monograph PRG-S6.
[39] M.Cole, 1995 Parallel Programming with List Homomaphism, Parallel Processing
Letters Vol. 5, No. 2, Page 191-103.
[40] B. He, W. Fang, Q. Luo., N.K. Govindarajan & T. Wang 2008 Mars: A MapReduce
Frameworks on Graphics Processing, in PACT 08. Proc. Of 17th Conf. on Parallel
Architecture & Compilation Techniques, pp. 260-268.
[41] M.de Kruijf & K. Sankarlingam, 2007 MapReduce for the Cell B.E. Architecture,
University of Wisconsin, Comp. Sc. Tech. Report: CS-TR-2007-1625.
[42] H. Karloff, S. Suri & S. Varrilvitski, 2010 A Model of Computation for MapReduce, In
Proceed. Of 21st Annual ACM-SIAM Symposium on Discrete Algorithm, Austin,
Texas.
[43] G. Ananthnarayanan, S. Kandula, A Greenberg. I Stoica. Y. Lin, B. Saha, E. Harris,
2010 Reining in the Outliers in Map-Reduce Cluster using Manti, in USENIX OSDI.
[44] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski & C. Kozyrakis, 2007
Evaluating MapReduce for Multi-core & Multi-processor systems in HPCA 07. Proc.
of the 2007 IEEE 13th Intl. Sym. on High Performance Computer Architecture, pp. 13-
24. Phoenix, Arizona.
[45] R.M.Yoo, A. Romano, C. Kozyrakis, Phoenix Rebirth: Scalable MapReduce on a
Large Scale Shared Memory System, Stanford University, Computer System
Laboratory, CA, USA.
[46] J. Ekanayake, Shrideep Pallickara and G. Fox , 2008 MapReduce for Data Intensive
Scientific Analyses, Dept of Computer Sience Indiana University Bloomington, USA.
[47] M. Banko and E Brill, 2001 Scaling very very Large Corpora for Natural Language
Disambiguation, In Proc. Of 39th Annual Meeting of Assoc. of Computational
Linguistics (ACL 2001), pp.: 26-33, Toulouse, France.

101
[48] C. Collism-Burch, C. Bannard & J Schroeder, 2005 Scaling Phrase-based Statistical
Translation to Larger Corpora and Larger Phrases, In Proc. 43rd Annual Meeting of
Assoc. of Computational Linguistics ACL, pp.: 255-262, Ann Arbor, Michigan, USA.
[49] CT Chu, SM Kim, YA Lin, Y Yu, GR Bradski, AY Ng and K Olukotun, MapReduce for
Machine Learning on Multi-core, In Advances in Neural Information Processing
Systems 19 (NIPS 2006), pp.: 281-288, Vancouver, Canada.
[50] C. Dyer, Aaron Cordora, A. Mont, J. Lin 2008 Fast, Easy & Cheap: Construction of
Statistical Machine Translation Model with MapReduce, In Proc. of 3rd Workshop on
Statistical MT at ACL, University of Marytal, Columns, Ohio.
[51] Qin Goa & Stephon Vogel, 2010 Training Phrasebased Machine Translation Models
on the Clouds: Open Source Machine Translation Toolkit Chanki, The Prague Bulletin
of Mathematical Linguistics, 93: 37-16.
[52] Z.N. Grant-Dutt & P. Harrison, 1996 Parallelism via Homomorphism, Parallel
Processing Letters, Vol. 6, No. 2, Page 279-295.
[53] Venugopal Ashish & Andreen Zollnam, 2009 Grammar Based Statistical MT on
Hadoop.An end-to-end Toolkit for Large Scale PSCFG based MT, The Prague
Bulletin of Mathematical Linguistics (91), 67-78.
[54] IDC Paper: Virtual Appliance vs Software Appliance
[55] J. S. David, D. Schuff, and R. S. Louis, Managing your total IT cost of ownership,
Communications of the ACM, vol. 45, no. 1, January 2002.
[56] Changhua Sun, Le He, Qingbo Wang and Ruth Willenborg, Simplifying Service
Deployment with Virtual Appliances, 2008 IEEE International Conference on Services
Computing.
[57] Rashid Ahmad, AK Rathaur, B Rambabu, Pawan Kumar, Mukul K Sinha, and Rajeev
Sangal, Provision of a Cache by a System Integration and Deployment Platform to
Enhance the Performance of Compute-Intensive NLP Applications,
[58] Rashid Ahmad, Pawan Kumar, B Rambabu, Phani Sajja, Mukul K Sinha, and Rajeev
Sangal, Enhancing Throughput of a Machine Translation System using MapReduce
Framework: An Engineering Approach, ICON-2011, Chennai, INDIA.

102
[59] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt,
and A. Warfield, Xen and the art of virtualization, in Proceedings of ACM symposium
on Operating systems principles, 2003.
[60] P. Chen and B. Noble, When virtual is better than real, in Proceedings of Workshop
on Hot Topics in Operating Systems (HotOS), 2001, pp. 133138.
[61] J. J. Wlodarz, Virtualization: A double-edged sword, 2007. [Online]. Available:
http://www.citebase.org/abstract?id=oai:arXiv.org:0705.2786
[62] R. Willenborg, Virtual appliances panacea or problems? Oct 2007. [Online].
Available: http://www.ibm.com/developerworks/websphere/techjournal/0710 col
willenborg/0710 col willenborg.html
[63] Vmware white paper : Virtual Appliances: A New Paradigm for Software Delivery
[64] S Shumate, Implications of Virtualization for Image Deployment, Dell Power
Solutions, October 2004
[65] A. Dearle, Software deployment, past, present and future, in International
Conference on Software En-gineering (Future of Software Engineering), 2007.
[66] C. Sapuntzakis and M. S. Lam, Virtual appliances in the collective: A road to hassle-
free computing, in Proceedings of Workshop on Hot Topics in Operating Systems,
2003.
[67] What is a virtual appliance? July 2012, [Online]. Available: http://www.turnkeylinux.org
[68] Virtual Appliances: Improve Manageability & Automate Provisioning, White Paper
Published By: Swsoft
[69] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy
Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia,
Above the Clouds: A Berkeley View of Cloud Computing Technical Report No.
UCB/EECS-2009-28 http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-
28.html, February 10, 2009
[70] Osterman Research White Paper, "Why You Should Consider Deploying Software
Appliances", Published December 2008
[71] Xen Virtualization, http://www.xen.org/, last accessed 30-Aug-2012
103
[72] Eucalyptus: http://open.eucalyptus.com/wiki, last accessed on 30-Aug-2012
[73] Apache Hadoop, http://hadoop.apache.org/, last accessed on 30-Aug-2012

104

Você também pode gostar