Escolar Documentos
Profissional Documentos
Cultura Documentos
Deployment on Cloud
by
Rashid Ahmad
200807014
rashid.ahmed@research.iiit.ac.in
CERTIFICATE
It is certified that the work contained in this thesis, titled Engineering Machine
Translation for Deployment on Cloud by Rashid Ahmad (200807014) submitted in
partial fulfillment for the award of the degree of Master of Science (by Research) in
Computer Science & Engineering, has been carried out under my supervision and it is not
submitted elsewhere for a degree.
I would like to thank my supervisor Prof. Rajeev Sangal and Dr. Mukul Sinha
without whom, this thesis would not have been possible. Dr. Mukul Sinha is a
Director of my company Expert Software Consultant Ltd. New Delhi. He was
visiting Professor at IIITH for spring 2011-12. I admire Prof. Sangal as a good
adviser, leader and a great human. They always inspired me and had faith in my
ability to rise to the occasion and deliver the best work. He is the one who taught
us how to give a wonderful presentation. I am grateful to TDIL (Technology
Development in Indian Language) group, Dept. of Information Technology, Govt.
of India in conceptualizing such a challenging Indian Language to Indian Language
Machine Translation (IL-ILMT) Project and the whole consortium for the valuable
meetings and workshops which taught me lots of professional things. I learnt a lot
in terms of research, punctuality, individual personality from Dr. Ramanan who is
the PRSG chairman of ILMT Project.
My special thanks to Mr. Pawan Kumar (Sirji) for reviewing my thesis a number of
times, discussing the various viewpoints of my thesis, analyzing, and commenting
on it and beyond it. Your motivation and support during the period of this thesis
are extraordinary. Your witty humor and scolding has helped me to keep myself
happy and encourage.
I thank the LTRC staff, who made my research journey at LTRC most comfortable.
Mr. Srinivas Ji for wonderful lab, Mr. Rambabu Ji for administrative issues Mr.
Satish for accommodation and tickets and Mr. Kumara Swamy for most important
work INVOICE clearance and Mr. Lakshmi Narayan for general or any issues.
Thanks to Appaji sir, Kishore sir, for making my life easier in IIIT. I would like to
thank all the members of Expert for providing their valuable suggestions
throughout the thesis Mr. Arun Kumar, B. Rambabu, Kumar Avinash Singh,
Phani Sajja, Sanket Kumar Pathak and all others. Mr. Sanket Kumar (Pathak Ji)
needs a special mention for being so nice with me and supporting in all
circumstances.
My special thanks to Mr. Vinod Singh ji from IBM Bangalore for reviewing my
thesis specially distributed part of my research (MapReduce, Virtual Machine and
Cloud).
Finally and most importantly, I would like to thank my parents, brothers and wife
and loving daughter Maryam (Maria) age of one & half year for their
unprecedented love and support throughout the thesis (I cannot express my
thanks through words here).
Abstract
Further a tool called Dashboard, has been developed which is based on pipe-line
blackboard architecture for integration and testing of Natural Language
Processing (NLP) applications. The Dashboard helps in testing of a module in
isolation, as well as integration and testing of complete integrated systems.
Functional module developer can avail and configure, if they so desire, the
dynamic cache facility offered by the tool. It is also equipped with a user-friendly
visualization tool to build, test, and integrate a system (or a subsystem) and view
its component-wise performance, and step-wise processing as well. This caching
facility improves performance of modules up to 20%.
ix
3.3 Dashboard and Sampark MT Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Specificity of NLP/AI Applications . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 The Blackboard Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Dashboard and its Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Requirement Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Dashboard: An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 Implementation Details of Dashboard . . . . . . . . . . . . . . . . . . . . . 27
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Caching for MT Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Problem Domain Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Middleware and Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.2 AI Applications and Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Dashboard as Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Caching Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Cache and NLP Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Cache in Dashboard Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
x
5 Enhancing Throughput of MT System using MapReduce . . . . . . . . . . . . . . . . 39
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 MapReduce Framework: Strengths and Limitations . . . . . . . . . . . . . . . 40
5.2.1 MapReduce Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Adaptations of MapReduce Framework . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Statistical MT & MapReduce Framework . . . . . . . . . . . . . . . . . . . 41
5.2.4 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Transfer Based MT System and MapReduce Framework . . . . . . . . . . . 42
5.3.1 Finer Granularity of Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.2 MT as a List Homomorphism & Compute-intensive . . . . . . . . . . 42
5.3.3 MT as a Dedicated Web Application . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.4 Our Approach to run MT under Hadoop . . . . . . . . . . . . . . . . . . . 42
5.4 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 MT on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Background: Sampark MT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 Deploying MT System on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 Our Approach to run MT under Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.6 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
xi
6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Appendix A: Software Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9 Appendix B: Eucalyptus Installation and User Guide . . . . . . . . . . . . . . . . . . . 74
10 Appendix C: Hadoop Installation and User Guide . . . . . . . . . . . . . . . . . . . . . 88
11 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xii
List of Figures
xiii
List of Tables
5.2 Running Time for Job (input data set of 100 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Running Time for Job (input data set of 200 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Running Time for Job (input data set of 500 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 Running Time for Job (input data set of 1000 sentences) . . . . . . . . . . . . . . . . . . . . . . . . . 45
xiv
Chapter 1
Introduction
In this Chapter, we present an introduction to our work towards engineering machine translation
(MT) systems. We introduce our background work in Section 1.1. We then present the problem
statement in the Section 1.2. Section 1.3 summarizes the main contribution of this thesis. Section
1.4 describes the organization of this thesis.
A natural language processing (NLP) Application development, such as that of Sampark, is usually
a team effort of NLP experts (who design and tune the software architecture of the application),
computational linguists (who take the decisions related to the language resources, coverage,
rules, corpora, etc.), and software engineers who implement the various algorithms. Unlike in
generic software development process [1], as the knowledge cannot be passed to software
vendors, continuous involvement of NLP experts and computational linguists in the development
process is imperative. In addition, the success of an NLP application solely depends on them as it
is essentially their design and visualization. The distinguish Features of NLP applications [2] are
listed below:
1
Consortium of Institution are : IIIT Hyderabad, University of Hyderabad, CDAC (Noida, Pune), Anna University KBC
Chennai, IIT Bombay, Jadavpur University, IIT Kharagpur, Tamil University, IIIT Allahabad, IISC Banglore
2
Sampark is a MT System among Indian Languages. It can be accessed at http://sampark.org.in
1
No Concept of Correct Input or Correct Output: An NLP application does not have a strict correct
input or correct output. Therefore, a partial or grammatically wrong sentence needs to be
handled by an MT system. Similarly the output need not be grammatical.
Absence of Single Right Answer: There is no concept of a single correct translation as its output.
Many different translations might be acceptable. Whether an output is acceptable or not is
validated for its accuracy by human evaluators and given a score as to its comprehensibility and
understandability.
Architectural Resilience and Robustness: Some modules of an application may have bugs and get
into an indefinite loop. This requires a facility for recovering against such situations, so that
modules that follow continue to work. A degraded output is better than no output.
Modules Composition: Most of the core NLP modules have an Engine part, and a corpora part (or
rules part). While the former is language independent, the latter is Language dependent or
Language-pair dependent (E-L Separation). So, an Engine can be made applicable to another
natural language by incorporating the corpora (rules) of new language. For modules with E-L
Separation, the development of the engine part is the generic horizontal task done by one
institution, while the language specific corpus/rules are language vertical task to be done by
institutions responsible for that language.
This background has setup the motivation that consortium visualized a need for the participation
of a software development company, called the Software Engineering Group (SEG), to
supplement the tasks of professional software engineering practices [3] to work with the
consortium, not as a vendor, but in a symbiotic mode [4] to facilitate them in engineering the
Sampark systems as each step and also deploy the Sampark systems on the web and to maintain
thereafter.
2
is assigned the responsibility to provide all language specific corpora, and rules for various
modules.
For the last two decades Govt. has funded all the participating institutions to develop NLP
components and language resources for majority of Indic languages. It is proposed that all these
available components, though written in different programming languages, must be re-used and
re-engineered to get robust systems with high accuracy, in less time, and less cost.
As all the chosen Indic languages have evolved from the classical language Sanskrit, the MT
systems can be built on Paninis framework [5]. Correspondingly, the consortium developed a
common software architectural framework, being re-used for all the eighteen language pairs.
The mandate of ILMT Project is that all the Sampark MT system should be field deployable and
maintainable product.
In addition to the above requirements, researchers [2, 6] have highlighted that for any generic
NLP application, some distinctive software engineering issues need to be tackled as well.
Therefore, symbiotic software engineering approach has worked for the development of the
proposed Sampark MT system.
1.3.1 Approach
3
Cache Mechanism
We have Cache mechanism to improve the performance of NLP modules. By using this
mechanism we achieve 15-20% improvement in speed. This approach is discussed in detail in
chapter-4.
Hadoop MapReduce Framework
We have MapReduce approach to run MT system under Hadoop framework to improve
throughput of the system. By using this approach, systems throughput increases proportionally to
the capacity added if computing resources are available. There is a time minima for a given
granularity of the task. This approach is discussed in detail in chapter-5.
Virtual Appliance
We have a mechanism to build virtual appliance of MT system for deployment on the cloud. The
MT virtual appliance can be deployed on the virtual machines in the cloud as well, with significant
reduction in the deployment time. Deployment time came down from hours to few minutes. This
approach discussed in detail in chapter-6.
1.3.2 Contribution
Second we developed software integration and deployment platform for NLP applications, called
Dashboard, which has been enhanced to provide to functional module developers mechanisms to
avail and to configure, if they so desire, the dynamic cache facility offered by the system
integration platform to improve the performance of the modules.
Finally we built MT system as virtual appliance for deployment on a standalone virtual machine or
on virtual machine available in the cloud.
4
1.4 Organization of the Thesis
This thesis is organized as follows:
In Chapter 3, we present a tool called Dashboard, has been developed which is based on pipe-
line blackboard architecture for integration and testing of Natural Language Processing (NLP)
applications. The Dashboard helps in testing of a module in isolation, as well as integration and
testing of complete integrated systems.
5
Chapter 2
Engineering Machine Translation Systems
In this Chapter, we present the symbiotic approach to engineer the computationally Intensive
application like MT systems. We introduce Sampark MT system in Section 2.1. We then show the
prior work in the Section 2.2. Section 2.3 describes the mandate of Sampark systems. Section 2.4
describes the symbiotic engineering approach. Section 2.5 describe the practical measures and
section 2.6 shown experience of symbiotic approach
2.1 Introduction
For Sampark System, the proposed system architecture [7] is a pipe-line architecture comprising
of fourteen major modules and some optional language specific modules. Sampark is a hybrid
system consisting of traditional rules-based algorithms and dictionaries and newer statistical
machine-learning techniques. It consists of three major parts and 12 modules arranged in a
pipeline.
B) Transfer
Syntax transfer: Converts the parse structure in the source language to the structure in the target
language that gives the correct word order, as well as a change in structure, if any.
Lexical transfer: Root words identified by the morphological analyzer are looked up in a bilingual
dictionary for the target language equivalent.
Transliteration: Allows a source word to be rendered in the script of the target language. It is
Useful in cases where translation fails for a word or a chunk.
C) Target Generation
Agreement: Performs gender-number-person agreement between related words in the target
sentence.
Insertion of Vibhakti: Adds post position and other markers that indicate the meanings of words
in the sentence.
Word generator: Takes root words and their associated grammatical features, generates the
appropriate suffixes and concatenates them. It combines the generated words into a sentence.
The consortium has, to start with, multiple copies of most of the modules of ILMT Project,
developed by different institutions, having no metrics to choose one against the other. All
modules being developed as research and development efforts have only skeletal manuals.
Further, many of the modules which ideally should have been structured with engine-language
(E-L) separation might not be so as it was developed for a specific language MT System. And
7
hence, re-engineering/engineering is required to disengage the engine portion from the language
portions.
As NLP Applications are very complex with large number of modules, it is imperative to develop
the complete system under a Software Development Infrastructure [8, 9] which facilitates the
software development paradigm being applied. It is assumed that most of the available modules
are neither comprehensively tested, nor maintainable nor appropriately packaged. The
consortium decided that the available version of a module would be taken as the initial version.
Each module should be engineered step-wise, by the development team of the module-owner
itself, but in symbiotic relation with the software engineering group (SEG) associated with project
to make it a field deployable and maintainable products.
8
2.3.1 Field Deployable and Maintainable
It means that the software product has strictly followed the standard software engineering
process during all development phases [3], with final software product having following
components: System Requirement Specification (SRS), Software Design Documents (SDD),
complete source code (following standard coding practice), block level documentation and In-line
comments, build and configure procedure for each module, validation suites of each module,
integrated product, distribution package of the product, user manual and installation manual of
software product.
A well engineered software product assures the quality, and the modules appropriately
abstracted and well encapsulated, assuring its re-usability.
A software product is maintainable, if it is correctable, adaptable, perfectable, and sustainable
[11]. A Software Product is correctable and adaptable if it is well engineered, but it is sustainable
only when it can be perfected without any structural change of the software architecture.
For Sampark System, its pre-specified architecture assures its sustainability. For perfective
maintenance of Sampark system requires continuous performance improvement, not by
changing code, but by improving/tuning language specific (or language pair specific) data and
rules, and domain specific data/corpora, thus, engine-language separation becomes an
imperative requirement for maintainability of Sampark system.
9
2.4 Symbiotic Software Engineering Approach
At initial stage of Sampark system, we have a good number of modules available for reuse, yet
none of them qualify as component [12]. The engineering activities are to be done by the same
institution that developed the earlier version to start with, and with the continuity of the main
researcher, no reverse engineering [13] task is required prior to engineering any module.
Therefore, we (SEG) associated mainly to facilitate these institutions in symbiotic mode to
engineer their horizontal (engine/common programs) and vertical (language specific) task
components.
First we collated all the candidate modules, and language resources, along with their
available documentation, and did an initial code walkthrough along with the team of the
module/resource owner. It was followed by a series of workshops, both on software engineering
and NLP issues, with participation of all relevant researchers. It was strongly felt that the
availability of a software development infrastructure tool is essential for the consortium
members for their independently developed heterogeneous modules to communicate with each
other, mainly to build the final system, and also to follow a common software engineering
approach.
Thus, we conceptualize and propose the symbiotic software engineering paradigm for
engineering NLP/AI Systems.
We gathered input and output specifications of each module by the help of NLP expert and
computational linguists. To validation that module is working on our system same as the home
machine of the developer, we run the supplied input output test data by the developer. Identify
the program code that relate to functional requirements; non functional requirements and
error/exception handling. We have come out with the following symbiotic process to follow the
engineering task in each module with actual developers
10
Figure 2.2 Symbiotic Approach Process
1. Enhance Usability: To enhance usability, a module must be portable to any other raw machine,
and command like parsing using getopt(). In addition, it must possess file level input/output
interoperability by having Unix Filter like interface,
The module owner must read the source code, mainly to identify the file/data access
points, the hard-coded data access points, inappropriate size of strings, any magic numbers, etc.
The code has to be modified replacing the hard-coded data into a configurable parameter to be
read from its configuration file. After completion of this task, a module would have following
11
attributes: No hard-codes, no magic numbers, command line parsing using getopt(), and Unix
filter like interface for programs.
We audit enhance usability, would run the module on a raw platform first. It may run or
may not run giving some errors, like segmentation fault. We would inform the owner in case of
error, along with suggestions.
2. Improve Documentation: To improve documentation, for modules written in C and Perl, we
request to provide the first level DFD and flow chart. These were refined by us. For modules in
Java, the developers would be requested to provide class diagram and sequence diagram.
To audit, we tried to read the source code of each module, mainly the control flow, the
data variables (global/local), and the data input and output. We tried not to understand the
program logic at all. At completion of this task, following artifacts are produced (for modules
written in C and Perl): limited data flow diagram (DFD), control flow diagram, extracted clusters
of data items, showing data flow levels and module structure; and improved SRS version 0.1.
These documents have to be additionally authenticated by the NLP Experts
/computational linguists of the institutions. After making these documents available, the
Software Engineering Attribute of the module is: Module Level Analyzability.
3. Improve Robustness and Bug Traceability: As part of this engineering task, developers are
supposed to cover all errors and exceptions of their modules to improve its robustness. In
addition, they are requested to write log statements at various points in each module to enhance
bug traceability. After completing this task, the module would have: Error and exception handling
statements covering all possible cases, and log statements inserted at various points in module
for bug traceability.
4. Engine and Language (E-L) Separation: For this engineering task, all developers of such
modules have to dis-engage their engine portion of the code from that of languages
resources/rules. For a module, the E-L separation is complete only when the engine can be tested
independently with at least two different language resources. After completion of this re-
engineering we would have following artifacts: language dependent modules with E-L separation.
Further, the software engineering attributes after this stage are: E-L separation for language
dependent modules.
12
5. Module Level Inter-operability: The module developer has to see to it that their modules
comply for SSF Interface, i.e., all modules must accept SSF as input and produce SSF as output. In
those cases, where SSF data representation model does not suit modules computation model,
module would intend to run on its data representation suitable to it. To provide inter-operability
of such modules for pipe-lined black board architecture, we provide SSF Wrappers for such
modules, making them SSF Interface compliant. After completion of this task, all modules are
provided with Makefile for build and configure. Once a module reaches this stage, it is ready as a
component for independent testing and validation
Consequently, modules are validated against the module specific validation data by us and
validation reports are provided. After successful validation, we have following artifacts: SSF
interface compliant modules, modules with Makefile for build and configure, module validation
data and validation reports, and improved SRS version 0.2. Software engineering attributes after
this stage are: common representation for inter-module data exchange and Makefile for build
and configure.
6. Program Restructuring: After a module has become an independent component, its source
code can be restructured to make its embedded program structure and DFD explicit. It would
result into module with structure of multiple internal sub-modules/subroutines. Block level
documentation and In-line comments would be done. The group also sees to it that the module is
maintainable with respect to adaptability requirement. The control flow diagram and DFDs of this
restructured module is developed from earlier version. The restructured module is tested against
the module specific validation test conducted earlier.
This task can be taken once the module has completely stabilized and become robust.
After program re-structuring, a module becomes stable robust re-usable component that can be
used for the final integrated system. At the end of this stage, we have the following artifacts:
revised Software Design Document (SDD), restructured and maintainable source code of the
module/sub-modules, module/block level comments and in-line comments in program code,
revised requirement specifications, SRS version 0.3. After program restructuring task, software
engineering attributes covered are: modular at subroutine level, block level documentation and
in-line comments.
2.4.3 Subsystem/System Integration
Once all modules have completed engineering tasks, they can be integrated to build the final
system. By integrating relevant modules we build a Sampark MT system for respective language
pair and subject it to the system level validation, and validation reports are generated. After this,
each Sampark system has system level validation data and reports.
Now, we have the first release of integrated system that is well-engineered and
maintainable. This release of the integrated system is made available to the NLP experts and
computational linguists for evaluation, and for continuous improvement.
Each release goes through validation and regression tests, and once it crosses the
threshold of comprehensibility, it is packaged as a field deployable system.
13
2.4.4 Dashboard Development Infrastructure
There is imperative need for building a tool which facilitates testing, integration of the modules
and provide input and output of the module/system for debug and improve to NLP
expert/computational linguists. As a basic, simple blackboard tool, called Dashboard, was
available with one of the participating institute [14]. We redesign the Dashboard having following
features:
We redesigned the Dashboard with all the features envisaged above, and it provides additional
tools such as character encoding and data format conversion tool, I/O data validation tool, etc.
The Dashboard sets up control flow among modules irrespective of the heterogeneity. It provides
reader/writer primitives on the common in-memory (SSF) data structure for programming
languages C, Java, and Perl. A module reads from the SSF data structure, converts into its own
internal data structure for processing, and after completion, reconverts the data back into that of
SSF data structure. Chapter 3 describes Dashboard in details.
2. Concurrence of engineering Tasks Once a module that has reached a state to get integrated
into the pipe-line architecture for intermediate testing, it is released to us. It is engineering tasks
14
of for improving documentation and program restructuring can be done concurrently. Usually, a
new version of a module is released by its owner institution, at least once in six months.
3. Module/System Testing and Validation Data To facilitate system and module testing and
validation for each language pair of Sampark systems, the language experts have created 100
sentences of gold standard for each source language, along with its expected translation in target
language manually done. For the same 100 sentences, the expected output after each module of
a pipe-line execution is also generated manually. This set serves as the basic testing and
validation data.
4. Internal MT Evaluation Teams for Sampark Systems Once a Sampark System is validated by
us, it is sent to the internal evaluation [15] for accuracy measurement. The team tests it for
comprehensibility, and if it is not up to the mark, the language related feedback is given to the
concerned institution for improvement.
The technology section of communication of the ACM (CACM) [16] has compared these Sampark
systems based on transfer-based hybrid approach to that of purely statistical based approach of
by Google and Microsoft.
Experience was elicited from of all eleven institutions on their participation in a consortium mode
to produce Sampark MT systems by engineering their existing laboratory modules through
symbiotic approach. A questionnaire was used for survey. This survey was done only after almost
all modules had gone through engineering tasks, and were submitted to us for validation. The
experience has been consolidated along the following two dimensions:
2.6.1 Experience of Module Engineering Tasks: As module engineering tasks were executed in
symbiotic relation with us, the experiences of these two groups provide different perspectives:
- All institutions appreciated the tasks of enhancing usability, improving robustness and bug
traceability as they made their modules reusable, robust and portable. These engineering
practices were not followed earlier!
- Most researchers took time to appreciate E-L separation task as their focus was limited to their
own language pair, and not realizing the re-usability characteristics of their module for other
languages as well. For this task, the role of chief investigators of each participating institutes, and
that of the coordinator was crucial.
15
- Most developers used module specific wrappers extensively to integrate their modules for
testing, as they are concentrating more in improving the accuracy of their respective Sampark
system, than documentation and program restructuring tasks.
- The relation between participating institutions and the software engineering group was weak to
start with, but became better when the Dashboard development infrastructure started being
used extensively. It helped institutions to continuously improve their modules accuracy and
performance, and show their results to a wider audience. In addition, they appreciated the hand-
holding support provided by us in making their modules as components, and Sampark system
web deployable.
- We felt satisfied in making the NLP researchers aware of software engineering issues of
usability, traceability, robustness, and modularity, but had very limited success in influencing
them to the idea of validation, readability and re-usability.
2.6.2 Experience of Work in Consortium Mode: It was the first experience for each participant to
work in a consortium. As the project was in an intense phase of its delivery activities, yet the
responses indicates the following:
- Human/group inter-operability is essential for consortium mode project to succeed. A series of
workshops (on language and software engineering issues) to develop common guideline, coupled
with weekly conference calls among all participating institutions helped to smoothen the group
and human inter-operability.
- Availability of a common development environment in Dashboard and a software engineering
group facilitated them in individual engineering. It helped them to focus their goal towards
project deliverables.
- As academicians are involved in multiple activities, those institutions could contribute more
where chief investigators were not over-loaded.
- It is imperative for us to have sufficient knowledge of NLP issues to develop symbiotic relation
with participating NLP teams. It took us around six months time for us to get the requisite domain
knowledge in MT system, and only then the inter-communication became smooth. Visits by us to
most of the participating institutions helped them to develop direct professional bonds with all
the developers.
- The role of Project Review and Steering Group (PRSG) set up by TDIL was extremely helpful in
keeping the project focused, and gave the crucial impetus to each of the participating institutions
to maintain the delivery quality as well as schedule.
16
Chapter 3
Dashboard Tool for MT
In this Chapter, we present an introduction about the dashboard tool. We introduce the
infrastructure approaches and objective requirement of the tool in Section 3.2. We then present
the Dashboard and Sampark MT system in the Section 3.3. Section 3.4 presents the support for
Dashboard. Section 3.5 describes the implementation of Dashboard. Section 3.6 describes the
evaluation of Dashboard. Section 3.7 describes the discussion and future work.
3.1 Introduction
As natural language processing (NLP) applications are knowledge intensive, complex, and are
normally developed by a co-operating team consisting of NLP experts, computational linguists,
software engineers and language engineers, a large application like machine translation (MT)
systems cannot be developed by a third party software vendor following classical software
development paradigms [3]. For the development of NLP applications, the conventional software
development tools are not very suitable as the application specifications are inherently imprecise,
i.e., the output is not tested against correct output, but is validated against criteria specifying
correctness. For example for the overall MT system, we use comprehensibility and fluency [15].
In addition, an NLP application, unlike a conventional software application, goes through
continuous accuracy improvement, for considerable long duration, after their release. Therefore,
for the development of NLP applications, it is advisable to develop a software development
infrastructure [8] corresponding to the NLP software development paradigm being applied. The
Dashboard development infrastructure, presented in this chapter, is based on Blackboard
architecture [17], an attractive option for building an NLP system, which facilitates integration of
a set of heterogeneous modules (i.e., written in different programming languages) collaborating
among themselves through a common in-memory data structure, referred to as blackboard.
The need for creating a development infrastructure like Dashboard arose when the
Technology Development for Indian Languages (TDIL) Group of Dept. of Information Technology
(DIT), Govt. of India formed a consortium of eleven academic and research institutions [7] for
developing 18 pairs of Sampark MT system. These were over nine Indic languages, and the pairs
(viz., Hindi{Bangla, Kannada, Marathi, Punjabi, Tamil, Telugu, Urdu}; TamilMalayalam; and
TamilTelugu) were built by re-using/re-engineering most of the NLP components, and
language resources that were available with the participating institutions through their prior
work. As eighteen Sampark systems were to be developed for public use in a limited time frame
of the project, we considered various infrastructural approaches to facilitate the speedy
implementation of these systems.
17
3.2 Infrastructural Approaches and Objective Requirement
Infrastructural Approaches: To facilitate the development team comprising of NLP experts,
computational linguists and software engineers, in building of large NLP applications, there have
been three infrastructural approaches, viz., (1) frameworks, (2) architectures, and (3)
development environments [12]. Frameworks facilitate a components based development, by
providing a common and powerful platform with a number of mechanisms (e.g., ActiveX, Java
Beans, etc.) that can be used, or adapted by the developers to build their systems. A framework
greatly reduces development as well as maintenance effort, and it presumes that the system is
going to be developed on common OS platform. Architecture defines a system in terms of its
components, and the type of inter-relations among the components (e.g., client-server
architecture, pipe-line architecture, etc.). It is the inter-relationships among the components that
represent the distinguishing power of a specific architecture. Application developers have
realized that a specific architecture is far more suitable for a family of software applications,
giving rise to a concept of reference architecture or domain specific software architecture [19].
For NLP applications where heterogeneity of components is an essential characteristic,
blackboard architecture gets widely used [20]. Furthermore, when an implementation of an
infrastructure based on a specific architecture provides additional tools for
development/validation of a component, integrating and testing a set of components, and
building and testing the complete system, it is usually called development environment [21].
Further, we decided to adopt the blackboard architecture [17] for building Sampark
systems as it has to re-use previously developed set of heterogeneous modules. In black board
architecture, heterogeneity of modules does not affect their operations as all of them operate on
common in-memory data structure. Further, we restricted the control to pre-specified lattice
architecture, as all the machine translation systems are being built following transfer based
approach, using representation based on Paninian framework. The lattice is currently
implemented as a pipeline comprising of 14 major modules for each language pair.
All the major modules in the system consist of language independent engines and
language specific parts. Among these eleven academic institutions, that are spread
geographically, each institution was responsible for one or more specific engine, responsibility for
each language rested with a single institution. Thus, language specific parts of all the modules for
a specific language were the responsibility of a single institution. Therefore, we needed a
development tool that can be used by each institution independently, at different granularity
18
levels of a system, i.e., either in re-engineering/testing of a module in isolation, or integration and
testing of a subsystem, or building and testing of the complete system.
Correspondingly, we visualized Dashboard, a common blackboard based pipe-lined architectural
framework for building the translation systems which could utilize all available modules, in spite
of their heterogeneity. And the same framework could also be re-used, as much as possible, so
that all the eighteen Sampark systems have same architecture.
The Dashboard, apart from implementing the features of blackboard architecture, has
many more additional features for use it as a development infrastructure for a family of NLP
applications, such as machine translation system, speech to speech translation system,
information extraction system, etc. It is also equipped with a user-friendly visualization tool to
build, test, validate, and integrate a system (or a subsystem), and view its step-wise processing
and module-wise time profiling to facilitate the development team to improve systems accuracy
and (speed) performance respectively.
The first version of Dashboard development platform was released after one year, and for
the last three years, is being successfully used by each of the participating institution
independently. Out of the total eighteen MT systems being developed using Dashboard, the
accuracy of four MT systems, viz., PunjabiHindi, UrduHindi, TeluguTamil, and
HindiTelugu, have improved considerably, and they have been deployed at website
http://sampark.iiit.ac.in. It is expected that remaining MT Systems would be released at regular
intervals, spanning over next the few months.
3.4 Strengths
NLP application software development is usually a team effort where the software architecture
(its selection or tuning) is done by NLP experts. All the decisions related with language resources,
such as coverage, rules, corpora, etc., are taken by NLP experts or computational linguists, and
algorithms are implemented by language/software engineers. The continuous involvement of
19
NLP experts and computational linguists in the development process is essential as their
knowledge cannot be passed easily to the software developers located with vendors. In addition,
the success of the NLP application solely depends upon them as it is essentially their visualization
and creation, requiring continuous molding by the experts during development, and also
continuous efforts to improve accuracy after release.
This subsection describes the salient requirements, emerging due to specificities of NLP
applications [2] (distinct from conventional software), that must be covered by the proposed
Dashboard development environment, and they are:
Heterogeneity of Modules: The complexity of NLP application is generally high because different
component modules work on different level of language, i.e., some work on paragraph level,
some on sentence level, some on chunk level, and some on word level. Accordingly, different
component module would need different data representation model for its efficient
implementation. Also, different modules get developed in different programming language, the
language that provides data representation model that suits its computation model making the
implementation far more natural, elegant and efficient. Sometimes for reasons of legacy
software, the development environment must facilitate development team to incorporate
heterogeneous set of modules in the system.
Modularity: A set of generic NLP modules have been identified and developed and sometimes
they are available off-the-shelf, from multiple sources. These generic modules are generally re-
used for various applications and domains by proper adaptation and enhancements. This need
requires that the development environment must view a system as built by integration of a set of
independent modules.
Transparency at module interface level: The specifications for most NLP applications are
approximate. Usually there is no concept of correct input and a single correct output [18]. For
example, in an MT system grammatically wrong sentences are valid input as well as valid output.
And hence, unlike conventional software, its output is not tested against correct output, but is
validated against criteria specifying threshold of accuracy, mostly by human evaluators [15].
Therefore, a development environment should provide transparency at module interface level,
(i.e., to view the input and the output of any module), facilitating the development team to easily
isolate the module having the trouble spot, and independently modifying it, to improve the
system.
20
Module Level Flexibility: In conventional software application, after it is developed and released,
the development team comes into picture if and only if there are some residual bugs, and not
otherwise. While for an NLP application continuous accuracy and performance improvement is a
generic requirement, and it needs to be continuously improved by NLP experts and
computational linguists. Further, the accuracy improvement is done, not by changing code, but by
improving/tuning language specific (or language pair specific) data and rules, and also domain
specific data/corpora. Hence, the development environment must provide mechanism for easy
replacement of a module by its new version, without any repercussion to the remaining set of
modules of the system.
Time Profiling of Modules: As the complexity of an NLP application is very high, performance is a
major issue. Hence, time profiling of each component module is an imperative need which a
development environment must provide so that the development team can concentrate to
improve those modules that are time intensive.
Robustness of System: Some NLP components many a time, take quite a long time, get into
indefinite loop, or even fail. Therefore, robustness at system level becomes another imperative
need for any NLP application, and hence, the development environment should provide module
level timeout facility to terminate the indefinite loop.
Resiliency against Module Failure: As a less precise output is more acceptable than no output.
Therefore, the design of the system should be resilient enough to recover a module failure (or its
forced termination), and proceed further to give, at least a degraded output.
The project document [7] proposing to develop eighteen Sampark systems has stressed the need
to develop all the systems based on pipe-lined blackboard architecture. The blackboard in the
classical form [17] can be viewed as a central repository of all shared information among multiple
heterogeneous problem solving agents/experts. The name blackboard architecture was chosen to
evoke a metaphor in which a group of experts gathers around a blackboard to collaboratively
solve a complex problem. The information on blackboard represents assumptions, facts, events,
and deductions made by system during course of problem solving. Each expert continuously sees
the information on blackboard, and at some instant of time depending on the information
content, if it feels it can contribute to the solution, tries to write (and update the information
content) on the blackboard. At some stage, if multiple experts compete to write on blackboard,
the moderator or facilitator controls the mediation among competing experts.
In case of the Sampark system, the experts are the modules, and the central repository is
represented by common in-memory data structure. As the Sampark system is being built by
following transfer approach following Paninian framework, currently the moderator limits the
21
execution of modules in a pre-assigned order (i.e., pipe-line architecture) specified by the
development team at the time of configuring the system.
Justifications of Blackboard Architecture for the Sampark MT Systems: The project document
[7] has given objective and technical reasons that are enumerated below briefly for
comprehensiveness of this Chapter:
Reuse of NLP Components and Language Resources: Since the project requires to reuse the
Natural Language Processing (NLP) components and language resources (for most of the nine
Indic languages) written in different programming languages (viz., Java, C, Perl and Python) and
available with participating institutions, the blackboard infrastructure would be most suitable to
build systems using independent heterogeneous modules.
Transfer Based Approach and Pipe-lined Blackboard Architecture: As all the MT systems are
being built following transfer based approach [16], we restricted the control to pre-defined pipe-
line architecture. Pipe-line architecture also reduces the complexity of configuring and testing
systems under a development environment.
SSF for In-memory Data Structure: Blackboard architecture is qualified by its in-memory data
structure format. We has adopted Shakti Standard Format (SSF) for in-memory data structure as
it is based on tree structure with bags and associated feature structures [10], and has text
notation for representing it unambiguously and providing human readability as well. Human
readability of a module input/output is an imperative need as it helps extensively in
understanding the working and improving the module.
22
concretized the requirement specifications of the first version of the proposed Dashboard
Development Environment, which is given in next subsection.
a. The common in-memory data structure should be of Shakti Standard Format (SSF) [10]
b. The system being built should be composed of a set of modules, where each module
independent of others, operates on the common in-memory data structure
c. The inter-module data exchange can be either through common in-memory data
structure, or alternatively through an I/O stream
d. The set of modules should execute in a pre-specified sequence, i.e., following pipe-line
architecture
e. It should provide application program interfaces (APIs), viz., reader/writer primitives
on the common in-memory data structure (SSF) for programming languages C, Java,
and Perl as most of the available modules are written in these languages.
a. Display of intermediate i/o at module/subsystem level, and final i/o at system level,
mainly to analyze the accuracy of the output at module/subsystem/system level,
b. Step-by-step run interactively at module level,
23
c. Here and now debugging tool change the intermediate output and run the system
further. It would help to find multiple bugs in single execution, and it would also help
in integration and system testing,
d. Provision to save a session, i.e., to save the system level and each module level
input/output, for post analysis by the development team,
e. Provision for transliterating the input text (in source language script), or the output
text (in target language script) of Sampark system, in either source language script or
target language script, facilitating the single script readers to read both the input text
as well as the output text.
6. Provisions of additional support tools for smooth inter-module interfacing, such as:
a. Character encoding conversion tool, for converting data from UNICODE to WX and
vice versa.
b. Data format conversion tool, for those modules input/output may be in different data
format than that of the SSF. It provides tools to convert into, and convert from SSF to
that acceptable to the module (e.g., SSF to TNT, etc.),
c. I/O data validation tool, to verify whether the input/output data is in correct format or
not.
As of today, Dashboard has been implemented with all the features envisaged above, (except the
starred specification, i.e., the module level timeout). It is being used by all participating institutes
independently, to integrate and test their modules, subsystem as well as the specific Sampark
system(s) in which they are involved.
A module is expected to operate on the common in-memory data structure. In case a module
uses its own data structure, it has to first take the requisite information from common SSF
blackboard, build its own data structure, and after completing its task, it wishes back the new
information in the common SSF blackboard.
The power of the Dashboard Development Environment is illustrated through a set of screen
shots given in Figures 3.1, 3.2, and 3.3. All these figures show the various screen shots of Hindi to
Urdu Sampark system running under Dashboard. The vertical column in the middle provides the
names of all eleven modules (and the last is always system) comprising the pipe-line
24
architecture of the present system running under the Dashboard. The user of the Dashboard can
provide input to the system either through a file, or typing directly on the left pane. Further, he
can choose to run the complete system by clicking the system choice from the vertical column.
Alternatively, he can run the system step by step by clicking the modules, one by one, from top
to bottom.
Figure 3.1 shows the complete translation of Hindi text into Urdu as the user has chosen
system from the vertical column. The left pane shows the input text (composed of five
sentences) written in the source language Hindi, and the right pane shows the translated output
text, as the output of the Sampak system, written in the target language Urdu. The total system
execution time of 40.11 seconds is also shown on the right corner of the Tool bar.
Once the system is executed completely, the session can be saved for future analysis. In
case, the developer wants to analyze the intermediate output of any specific module, and for any
specific sentence, he can do so through Dashboard. He has to choose first the sentence of his
choice, by scrolling through the sentence numbers shown in the Tool bar adjacent to the
execution time, and then click the module of his choice from the vertical pipeline. In Figure 2, the
left pane shows the input (the first sentence in text form) to module morph (the first module of
the system pipe), and the right pane shows the output produced by the module morph in SSF.
The execution time consumed by the morph module, i.e., 0.77 seconds, is also shown above.
25
Figure 3.2: Dashboard showing Input and Output of Hindi Morph
Similarly, in Figure 3, the left pane shows the input (in SSF) to module postagger and the right
pane shows the output produced by it (in SSF). The postagger execution time of 1.06 seconds is
also shown. In this way, the development team can post analyze the complete paragraph
translation, by scrolling through each sentence level and each module level executions.
For each specific inappropriate output, the module level localization of fault can easily be
traced. Similarly, the module level time profile with sentence level granularity would easily guide
the development team to isolate those modules needing performance improvement.
Figure 3.3: Dashboard showing Input and Output of Hindi POS Tagger
26
3.5.3 Implementation of Dashboard
The first version of Dashboard Development Environment was released in the latter part of 2008,
and with continuous interactions with the participating institutions of the consortium, we have
released an enhanced version of Dashboard every six months. The latest version of Dashboard is
fully stable, and it satisfies the major needs of consortium as the main development environment
for productizing the proposed eighteen Sampark systems.
The back-end source code of the Dashboard is written in Perl, and the component
handling the visualization interface is written in Java. The total size of complete Dashboard
development Environment is 27.7 MB, out of which the program sizes of back-end component
and its visualization interface are of 4.3 MB and 23.4 MB respectively
3.6 Evaluation
The consolidation of the experience of all eleven institutions (and the SEG) on their usage of
Dashboard Development Environment to produce eighteen Sampark Systems through reusing/re-
engineering their existing laboratory modules was done through a set of questionnaire [**]
distributed to them. This survey was done only after almost all eighteen Sampark systems were
built by us and were going through continuous accuracy and performance improvement by
participating institutions. Four systems have been released on the web after their accuracy
crossed the threshold of the acceptable limit. The major experiences are:
1. As Engineering Tool A module written by an institution always runs satisfactorily in its own
laboratory environment. Invariably, any other institution finds it extremely difficult to integrate
the same module in their respective laboratory environment. This usually happens as context
dependent hard coded data (e.g., path names of files, etc.) creeps into the code without
researcher/developer being aware of it, making the module environment sensitive. Dashboard
helps developer to engineer their module to make it portable, as it does not permit any module
to run under it with environment sensitive hard-coded data.
2. As Testing and Integration Tool As the Dashboard was made available to each module
developer, each one of them maintained their own testing environment by locally integrating
their present version module with all other modules developed by other groups. In other words,
it provided a multiple replicated integrating and testing environment available with each of the
participating institutes. Whenever, a developer released next version of his module, all other
developers upgrade their testing environment and test their systems with new released version
of the module.
3. As Group Testing and Coordination Tool The multiple replicated integrating and testing
environments not only helped the group to test their module with real modules (and not with
stubs), it also helped them to give feedbacks to the owners of preceding module (i.e., the module
whose output they take as input) to improve the accuracy of the system. Further, the Dashboard
27
has become extremely helpful as a group coordination tool as well, because the members of the
group are geographically scattered.
4. As Profiling Tool - Sampark system needs continuous improvement on its performance as well.
The Dashboard provides time profile data of each module which helps to facilitate in improving
the design of various time intensive modules. In case, the response of the system falls short of
user expectations, this also gives guideline to enhance the hardware resources so that response
to the user is brought within acceptable limit.
28
Chapter 4
Caching for MT Systems
In this Chapter, we present provision of cache mechanism by system integration and deployment
platform (Dashboard) to enhance the performance of computationally intensive NLP application.
We present introduction in section 4.1. Then we present problem domain exploration in Section
4.2. We explain middleware in the Section 4.3. Section 4.4 explains the caching mechanism in
details. Section 4.5 showed the experiments and results. We present the conclusions in section
4.6
4.1 Introduction
After the public release of the first set of four Indic machine translation systems at the website
http://sampark.org.in [7, 16] for interactive as well as batch usage, the following two types of
demands emerged: to improve the accuracy of the MT systems; and to improve their response
time and throughput. While the former task lies completely in the NLP domain, the latter task
presents a challenge to the application development teams as well as to the software engineering
group (SEG) associated with the project which has designed and built Dashboard, a testing,
integration and deployment environment [23] over which these MT systems are built and
deployed. Dashboard not only facilitates NLP developers in development, testing and integration
of the MT systems but also shields them from the complexity of the underlying deployment
platform on which these systems are supposed to run.
For a large and complex web application, the improvement of its response time as well as
its throughput depends upon, not only its intra-application architecture and programming
complexity but also equally, and some times more, on the complexity of the computational
environment on which it is deployed. In the present case, while the tasks to improve
performance related to application architecture and program complexity come under the
purview of the application designers and the developers, i.e., the NLP experts, whereas those
related to the deployment environment, viz., Dashboard, comes under the SEG.
29
set of modules, and for building and testing the complete system. The implementation of
Dashboard is presented in [23]. While designing Dashboard, it was realized that an NLP
application such as an MT system, even after release, needs continuous accuracy and
performance improvements, and hence, mechanisms for integrating heterogeneous modules, for
module level flexibility, and also for time profiling of the modules were provided. Yet, it did not
focus attention to providing mechanisms which can be used of by developers to improve the
module performance.
Expecting increase in the use of the systems in the near future, experiments are being
done to estimate the performance of the MT systems, in terms of interactive response time and
throughput. It is found that the performance of the MT systems on the web deteriorates very
fast, as the load increases. Before coming to a decision on appropriate up-gradation of the
hardware platform for the expected load increase, the SEG considered the additional software
engineering mechanisms that can be made available to the application developers by enhancing
Dashboard so that performance of the MT systems can be improved. Provision of an optional
dynamic cache [24] was a possible choice.
Caching is mostly used to decrease data access time whether it is used in hardware,
operating system, middleware, or at application level. Usually for a web application, a cache is
most commonly used for improving the performance of data intensive information retrieval,
query and search applications such as web search engines, large business applications, etc.,
where the data volumes are not only very high but also widely distributed [24, 25]. It reduces the
input/output cost in locating a query item deep down in the storage hierarchy as well as the high
network communication cost. Again, caching helps in improvement of the performance of an
application if and only if there is a high query frequency [26], i.e., a small set of queries are raised
frequently, and distinct types of queries are less likely.
(i) From empirical observations, it was found that in a given corpus of natural language sentences,
the frequency of any word is inversely proportional to its rank in the frequency file. It was first
proposed by GK Zipf [27] and is also known as Zipfs law. In other words, the most frequent word
would occur approximately twice as often as the next most frequent word, and three times to the
third frequent word, and so on.
(ii) Any document that is submitted for translation purpose does contain large number of content
words (i.e., nouns) which are mostly language independent. From another empirical observation,
it is found that there is a relationship between the size of a collection and the size of vocabulary
30
used M. If T is number of words in a text and M is the number of unique words used, then this
relationship is given by Heaps law [28]
i.e., M kTb
Here k and b are constants that depend upon the type of collection. For English text, it has been
found that
10 k 100
And
0.4 b 0.6
It means that with the increase in size of T there will be diminishing return in discovery of unique
vocabulary items M.
(iii) In a rule-based MT system, such as the Sampark system [5], most of its compute-intensive
modules operate on words, chunks (viz., a sequence of words) or sentences. Further, all these
modules are functional in nature. Just the modules that operate at the word level account for
more than almost 35% of the computation load of any sentence translation. This is approximately
same across all the Indic languages for which MT systems are being developed.
It needs to be noted that both Zipfs as well as Heaps Laws hold valid when the volume of
collection is high. By combining these three distinguishing characteristics, we may assert that in
any MT system, where a small set of words (whether they are content words, verbs, or
connectors, etc.) has a very high probability of repeated appearance in any document, a cache is
a very attractive option for performance improvement. A dynamically growing cache would make
the repeated running of compute-intensive, word level functional modules redundant after it has
executed once for the same word. Realizing these novel features of an MT system, we decided to
provide, as a part of Dashboard, a cache mechanism that module developers can opt for while
integrating their modules to build the MT system.
Provision of a cache is made optional for the Dashboard as management of the cache
infrastructure does carry some performance penalty. Each module developer has to choose
whether to use a cache or not, after weighing the performance improvement due to the sharp
decrease in the number of times the word level modules need to run against the performance
penalty associated with the increase in computation load for accessing the cache and for cache
infrastructure management.
After enhancing Dashboard to provide the optional dynamic cache facility to module
developers, we are now experimenting and recording the performance improvement in one of
the four deployed MT systems, viz., Punjabi Hindi MT system. This chapter reports these initial
31
findings, and we expect to complete the recording of the performance improvement due to the
provisioning of a cache for all the four deployed MT systems.
4.3.1 Middleware
Middleware software [29, 30] is meant to shield application designers, and in some cases the
application support persons as well, from having to monitor and manage the complexity of the
underlying ever-evolving, large scale distributed computation platform. It helps in increase the
productivity of large scale application developers working in complex, heterogeneous and high-
performance computing environments.
Usually, middleware provides higher level language abstractions, libraries that hide
complexities of underlying platforms, and also services that are common to many applications.
Different middleware is developed to cater for different classes of applications.
Researchers [31] have broadly identified two classes of applications relying on distributed
systems: commercial network/enterprise services and computational science and business
services.
Interactive and on-line, with time constraint on response time and throughput
Homogeneous in terms of software architecture
Typically parts of web servers, servlets, and database management systems
Architecturally complex due to having to contend with a large community of users
requiring the service.
Computational science and business services applications are the increasingly more demanding
applications designed to allow the study and processing of large quantities of simulation or
sensor data in order to better predict scientific phenomena, economic variables, etc. Examples of
such applications are physical systems modeling (e.g., weather, chemistry, biological systems),
business modeling (e.g., demand forecast, logistics, data mining), problem solving environments,
distributed visualization toolkits, etc.
32
The characteristics of computational science and business services applications are:
Large data size that needs to be processed (of the order of terabytes or more)
Compute-intensive, algorithmically 23 order more complex than commercial applications
Community of users is not necessarily as large as for the other class above
The applications themselves require a substantial amount of tuning and configuration in
order to effectively obtain the desired performance; and generally run as batch jobs.
However, key features of middleware systems catering for the two application classes are
intrinsically different.
The middleware systems for the commercial network/enterprise services applications address
issues such as:
Replication (with different consistency models)
Fault tolerance, and fault recovery
Load balance and self-tuning
Caching
On the other hand, the middleware systems for computational science and business services
applications focus on methods for:
Work load partitioning
Parallel input/output
Work load scheduling and balancing
Caching
AI applications, such as MT systems, speech recognition systems, game systems, etc., lie between
the two classes of applications discussed above. They are compute-intensive, have a large data
set to operate on, have an intrinsic need for interactiveness and need to be available on-line for
best usage.
When we contrast the characteristics of AI applications listed above the two classes of
applications discussed above, we come to conclusion that a middleware suitable for AI
applications should have the following features:
In NLP Applications, Dashboard is playing a dual role: that of a development environment for the
Sampark systems; and of middleware, shielding application developers from deployment
complexity.
Dashboard already provides Resiliency and Robustness that caters for the requirements of fault
tolerance, and fault recovery.
Presently, Dashboard has already been equipped for provision of optional dynamic
caching at the module level.
In brief, the next version of Dashboard would also be able to fulfill the role of middleware
for NLP applications.
34
4.4 Caching Mechanism
4.4.1. Cache and NLP Applications
A cache is the most commonly used mechanism for performance improvement where the locality
of reference is high [26]; that is, when there is high data locality, or high query frequency [32].
NLP applications operate on natural language text and documents that show the presence of high
frequency of occurrence of a small number of words. With the provision of a cache, the compute-
intensive modules would run less often, improving the performance of the system substantially.
There are two types of caches, static and dynamic. In the case of a static cache, the application is
provided with a cache of a fixed size, filled with data items a priori. This is usually done where
enough study has been done to ascertain the stability, and size of the high frequency data items.
Where stability and locality of data cannot be ascertained in advance, it is appropriate to start a
cache from scratch and let it grow dynamically. As sufficient study has not yet been done on
documents in any of the Indic languages, it is advisable to provide the facility of dynamic cache.
Further, Zipfs Law and Heaps Law are both valid for high volumes of content. The
present size of documents submitted to Sampark systems are likely to be in range of 50 to 100
sentences, each sentence having ten to fifteen words on an average. It cannot be asserted that
the provision of a cache, considered together with its associated computation load, would always
give better performance. Hence, it is better to make the provision of the cache optional.
35
In Sampark system we have used cache mechanism below figures shown cache hit and cache
miss.
Figure 4.1 Cache Miss Flow Figure 4.2 Cache Hit Flow
Dashboard provides mechanisms [23] to define a system specification file to configure a system
composed of a set of modules in a pipe-line, and also a module specification file to define the
runtime properties of the modules. The module specification file specifies the programming
language, input parameters, level of module operation, and many additional parameters for the
modules operation. We have now provided an additional parameter for the developers to opt
for a cache.
In case the module developer opts for a cache, a dynamic cache is created for that
module. The input parameters for the modules are converted to make a Composite Key, and this
Composite Key is used as a key in the Hash Map.
36
4.5 Experiments and Results
Presently, the four Sampark systems are running on a hardware platform comprising of: Intel
Core 2 Quad CPS @ 2.5 GHz with 2 MB L2 cache and 4 GB RAM.
The experiment has been done on Punjabi Hindi Machine Translation system. Though
there are four word level functional modules in the Sampark system, the experiment has been
done by opting for the cache in the lexical substitution and the word generator modules only. The
experiment used documents with 25, 50 or 75 sentences, having 360, 878 or 1258 words and a
unique vocabulary count of 193, 368 or 519, respectively.
As the lexical substitution module and the word generator module are modules way down
the translation pipe line, the number of tokens presented to them would be different, but may
show a similar proportion.
The experiments were done to measure the performance of the Sampark system with and
without enabling the cache. The results are presented in the below tables
37
4.6 Conclusion
The preliminary results show a module level performance improvement of around 20% or more,
and around 10% to 15% performance improvement at the sentence level, using the cache in just
these two modules. It is expected that once we provide a cache in the other word level modules
in the pipe line, such as the morphological analyzer, we may expect an even larger improvement
in performance.
These three documents have been chosen randomly from our test suit of documents and
hence may not have captured the average characteristics of the Punjabi language as we find an
anomaly in the lexical substitution module, viz., the increase in size does not give a corresponding
growth of performance improvement. This may be due to the specificities of these three
documents.
The implementation of the cache facility in Dashboard allows the option of making the
cache persistent, i.e., the dynamic cache built by the previous runs can continue as a static cache
for the next run. Since a Sampark application is natural language specific, we strongly feel that
the provision of a persistent cache at the module level will increase the performance of the
Sampark systems.
Provision of a persistent, dynamic cache will also require an upper size limit on the cache.
In the Sampark systems, the static cache should comprise of only words/tokens related to the
natural language words, and not the content words (i.e., nouns) as there are no objective basis of
their re-appearance in different documents. Thus, persistent cache must save only the
words/tokens related to the natural language.
38
Chapter 5
Enhancing Throughput of MT System using
MapReduce
In this Chapter we describe in detail how we can improve Throughput a computationally intensive
application like MT system by using MapReduce framework. We present the introduction in
Section 5.1. We then present the MapReduce framework strength and limitation in the Section
5.2. Section 5.3 characterize MT system and describe the approach to run MT system under
Hadoop. Section 5.4 describes the experiments and results. In Section 5.5 describe the discussion
and future work.
5.1 Introduction
Parallel processing of applications can be broadly classified as compute-intensive or data-
intensive type. Computing applications that devote most of their execution time to the
processing requirements of their problem are classified as compute intensive applications. These
applications typically have small input data to be processed, unlike data-intensive applications
where the input data that needs to be processed is generally terabytes/petabytes in size.
Machine Translation (MT) is one such compute-intensive application where input data is small
but the MT application is highly compute intensive. Sampark Indian language to Indian language
machine translation system is one such application.
It was found that the performance of Sampark MT system on the present platform
(CentOS Linux running on 4 Core CPU of 2.50 GHz with 4GB RAM) deteriorates very fast, as the
workload increases. Therefore, in the second phase of the Indian Language to Indian Language
Machine Translation (ILMT) project, one of the goals set to the Software Engineering Group (SEG)
associated with the project [4], is to improve the response time and the throughput of the
systems by porting it to appropriate deployment environment that can be incrementally scaled
up as and when the average volume of workload increases. Our initial approach is to consider the
task on enhancing the throughput of Sampark as the primary task, relegating the task of
improving response time at abeyance.
In recent past, MapReduce [35] has emerged as the most attractive programming model
for compute-light but large scale data intensive applications, for scalable performance on
machine cluster of shared memory systems, such as multi-core chips and symmetric multi-
processor systems. MapReduce is mostly suited for functional applications, and its two functions:
Map and Reduce are inspired from LISP, the functional programming language [36]. The Map
function helps the job to be partitioned in smaller tasks by splitting the data input file, and each
task is distributed to individual available nodes for parallel execution. The Reduce function
receives results of all partitioned tasks, and collates them to give the final result. In this model,
39
programmers without any experience of parallel and distributed systems can utilize the resources
of large distributed system, as the run time system takes care of partitioning of job into multiple
tasks, scheduling of all tasks comprising the job, the inter-task communication, and also recovery
from failure of computing resources. With wide acceptance of MapReduce paradigm, various
types of applications are being written in this model to utilize available resources of large
distributed system resulting in enhanced performance and better throughput [37]. As Hadoop
adds considerable operational overhead; in the range of 20-30 seconds and sometimes even
more, to the application system running under it, the real benefit for parallel and distributed
computing offered by the framework can be visible only for larger size applications.
We felt the applicability of MapReduce framework for our Rule Based/Transfer Based MT
system for following four reasons:
(i) Any document file that is required to be translated, i.e., the data input file to MT system,
can always be abstracted as a List of sentences which can be further split into a sequence
of Lists of words
(ii) The transfer based machine translation system like Sampark is a functional application,
and is List Homomorphism [38], and hence, can be easily parallelized and executed on
large cluster of machines [39]
(iii) The incremental scaling up of computing resources on-demand is MapReduce frameworks
basic inherent criterion
(iv) Availability of Hadoop (The Apache Software Foundation), the open source
implementation of MapReduce which can be modified to circumvent the difficulties posed
by contradictory features of our application.
We felt that to run MT system on MapReduce framework, its list homomorphism characteristics
would be utilized to run its multiple instances, in parallel on multiple nodes, enhancing
throughput. Furthermore, as MT system is a complex and compute-intensive application, the
perception of the enhanced throughput would be felt squarely.
MapReduce has been designed for data-intensive but compute-light applications to run on
hardware platform composed of cluster of nodes. In Hadoop cluster data gets distributed on all
nodes of cluster while it is being loaded. The Hadoop Distributed File System (HDFS) splits data
into chunks, and each chunk is loaded on different nodes of the cluster, well before the
application gets initiated. The Hadoop MapReduce Programming framework is implemented as
clientsever architecture having a single master client, called jobtracker, and many server slaves,
called tasktracker, running one per node in the cluster. The jobtracker is the point of interaction
with all users, to receive users map/reduce tasks. The jobtracker puts all submitted tasks in the
40
pending job queue, and schedules each map task to different tasktrackers running on, but only on
those nodes where the applications data chunk has been preloaded. On completion of map tasks
by different tasktrackers the set of intermediate outputs are produced. The jobtracker then
schedule the reduce task on those free tasktracker nodes that have easy access to the
intermediate task outputs. The reduce task combines the intermediate outputs received to
produce the final output. In MapReduce framework, as the job gets parallelized across various
available tasktrackers in the cluster the completion time of any job gets tremendously reduced,
enhancing throughput of the system.
In this framework, the map and reduce tasks of users are submitted by a client to the
jobtracker. The jobtracker transfer the map/reduce tasks to all those tasktracker nodes where
various instance of the map/reduce tasks have been scheduled to run. It is presumed that
map/reduce tasks are compute-light with smaller code footprint, and they do not strain the inter-
node communication network within the cluster as they are being transferred from the
jobtracker to various tasktrackers.
The MapReduce framework that was originally proposed by Google is being utilized by it to
process more than 10 petabytes of data per day [35]. After release of Hadoop implementation of
MapReduce more than hundred organizations, including large companies and academia are using
it for various types of applications. This has also resulted intense research and development
activities in various directions [37]. Some researchers have developed distinct MapReduce
algorithms for processing of different types of massive data [40, 41], some have simulated well
known parallel processing algorithm in MapReduce framework [42], while some others are
involved it developing schemes for implementing MapReduce framework in distinct types of
physical platforms [43, 44, 45], and optimize scheduling problem in its context [46].
The output quality of Statistical Machine Translation (SMT) Systems increases with amount of
training data [47, 48]. To get good quality of output, usual SMT systems train their engines on 5
million to 10 million sentences pairs, and to train engine on such massive volume of data, even on
good processing platforms, takes couple of days to even a week. And hence, many efforts are
being pursued to use MapReduce framework to execute such training module over large corpora
on a large distributed systems, bringing down the training time to a couple of hours [49, 50].
Open source toolkits capable of training phrase based MT models on Hadoop cluster [51] and
grammar based statistical MT on Hadoop cluster [52] have been reported.
In present implementation of MapReduce in Hadoop, compute task gets transmitted across the
nodes of the cluster, and for those applications that have a large code footprint transferring it
41
across the nodes of the cluster would be totally antithetical to the basic goal of throughput
enhancement. And hence, if we run a compute intensive application having large size of code
under Hadoop the time spent in transferring compute task across the nodes would completely
drain the expected throughput enhancement due to parallel processing of its multiple instances.
And this is the main limitation of Hadoop, making it completely unsuitable for compute-heavy
jobs. To utilize the benefit of parallelism provided by Hadoop, computationally heavy jobs require
to evolve a distinct approach to overcome this difficulty without paying any time penalty.
As discussed earlier, a text document file that is submitted for translation, i.e., the data input file
to MT system, can always be abstracted as a List which can be further split into a sequence of
sub-Lists. While a document may contain large number of sentences the size of a sub-list may go
down to lowest granularity of a single text sentence. Hence, the application has wider option to
right-size the granularity of a sub-list (considering the time overhead of parallelization in the
deployment platform) to get the best throughput.
As the transfer based machine translation system is a List Homomorphism Functional Application
it is best candidate for MapReduce framework as it can be easily parallelized [39, 52], and
executed on a cluster of large number of physical machines.
Any machine translation application is generic in nature, and hence, it is usually offered to users
as a dedicated web application. This feature of application can be utilized to avoid the
communication overload through some innovative engineering approach.
Compute-intensive Web Application under Hadoop To circumvent this problem for MT System,
we have taken following three steps:
42
(i) We have developed a program, called MT Invoker, which calls the MT system, and this
program has been defined as the map task,
(ii) The reduce task is a program, called ConCat, that simply concatenates the intermediate
outputs produced by the map tasks, and
(iii) We preload MT System on all tasktracker nodes, presuming it to be dedicated web
application scenario.
In this setup, it is the MT Invoker and ConCat that are transferred by the jobtracker to other
tasktrackers running on various nodes of the cluster. Both these tasks being compute-light suit
appropriately with the present implementation of Hadoop. When an MT Invoker starts running
on any of the tasktracker node, it in turn calls the MT system which is pre-loaded at the node at
the cluster setup time. In this way, multiple MT systems run in parallel on multiple tasktracker
nodes of the cluster, translating on distinct Input-split (viz., sub-list of the input document). When
the translation outputs of all Input-splits are available, the jobtracker runs the ConCat on a
tasktracker node to concatenate all these sub-lists to generate the final translated output
document.
In this way, we have deluded Hadoop to run a compute-intensive job like MT system as a
map task for a dedicated web application scenario.
43
5.4 Experiments and Results
5.4.1 Experimental Setup
The experiment has been done on Hindi to Punjabi Machine Translation system to measure the
throughput of the system for four different data sets, viz., Documents comprising 100 sentences,
200 sentences, 500 sentences, and 1000 sentences. All sentences are of similar sizes. The set of
experiments was carried out on the computing platforms, viz., Hadoop on Physical Machine (PM)
Cluster, having at most 10 available slave machines for parallel task allocation.
The configuration of the PM Cluster is: 4 * 4 Core Processor of 2.50GHz, RAM: 4GB, HDD: 320GB.
First, Centos operating system, viz., Centos-5.4 (64-bit), is loaded on all processors of the
cluster. On this PM Cluster, Hadoop Version 0.20.2 has been installed giving 12 Slave Machines of
2.50GHz, 4GB RAM. This includes setting up jobtracker on the designated master node, and
tasktracker on each of the remaining server nodes of the PM cluster. Then by running rsync unix
command MT System is loaded in all the server nodes.
5.4.2 Results
On each platform, for each data set, say for the document of 100 sentences, we measured the
throughput by executing it at four levels of parallelism, i.e., job composed of 1, 2, 5, and 10
parallel tasks (achieved by splitting the data-input into 1, 2, 5, and 10 input-splits respectively).
For example, when am MT job containing an input document of 100 sentences is split into 5
parallel tasks; each task would have 20 sentences to be translated. In other words, the same job
is split into multiple tasks by splitting the input executing on different computing resources, in
parallel, giving better throughput. So, four experiments were done per data set. For any of the
experiments, we have generated at most 10 parallel tasks as the platform can offer at most 10
slave computing nodes to execute tasks in parallel. So, no task needs to wait, and all are executed
in parallel concurrently, giving the best throughput.
In total, 4 experiments were done for each of the 4 data sets, running on physical
platforms, giving total 16 (4*4) recordings. These measurements are tabulated in four through-
tasks tables, one table for each data set, and are given in tables 5.2 through 5.5. Each of the table
shows the throughput value of the data set, with increasing parallelism, for physical
environments. For comparison, we have given the throughput of the data set running on the
current stand-alone computing system in table 5.1.
44
Table 5.2: Running time for Job (input data set of 100 sentences)
No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 100 160
2 50 91
4 25 66
5 20 59
10 10 34
Table 5.3: Running time for Job (input data set of 200 sentences)
No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 200 274
2 100 131
4 50 80
5 40 72
10 20 54
Table 5.4: Running time for Job (input data set of 500 sentences)
No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 500 336
2 250 251
4 125 138
5 100 134
10 50 87
Table 5.5: Running time for Job (input data set of 1000 sentences)
No. of Tasks Sentences per Task Total Time (Seconds) for Job
1 1000 968
2 500 334
4 250 230
5 200 226
10 100 184
45
Graphs for the four throughput-tasks are shown in figure-5.2, one curve for each data set.
The calculated value of associated overhead for parallel distribution of tasks across available
computing resources is 20-30 seconds for physical machine cluster.
The best throughput that can be extrapolated from all the curves of a graph indicates existence of
a universal throughput minimum value, for each specific physical environment.
The throughput minima of each physical environment are slightly higher than the amount of
associated overhead for parallel distribution of tasks across various nodes, and hence, there is no
extra gain in throughput by increasing parallelism beyond a point.
By providing sufficient amount of parallelism, the throughput of a document of any size can be
brought closer to the universal throughput minimum which is less than a minute!
46
5.5.2 Future Work
1. Additional Experiments with Symmetrical Computing Resources in Cloud Environment
We would like to repeat the same experiments on cluster of virtual machines (with symmetrical
computing platform, and with the same version of Hadoop) provided by public cloud service
provider. As cloud infrastructure can provide computing resources on demand, the completion
time of a document of any size can be reduced to time minima, constrained by the overheads of
MapReduce and Cloud infrastructure. These additional experiments would give Universal
Throughput Minima with better precision.
47
Chapter 6
MT on Cloud
In this Chapter we describe in detail how we deploy an MT system on cloud and make it as a
virtual appliance. We present the introduction in Section 6.1. We then present the background of
MT system and why we need the virtual appliance in the Section 6.2. Section 6.3 describes the
related work of the virtual appliance. Section 6.4 describes how we deploy MT system in cloud
and make it virtual appliance for ease of deployment. In section 6.5 describe the approach how
we translate with the help of MT as virtual appliance. In Section 6.6 describe the conclusion and
future work.
6.1 Introduction
Machine Translations (MT) Systems in general are composed of large number of modules, that
are heterogeneous in nature, and these heterogeneous modules in turn depend upon complex
set of environmental dependencies to perform a given task. To resolve such complex
dependencies at the time of deployment is a hard, technical intensive, time consuming task,
additionally it is undesirable too.
Software deployment is defined as the process between the acquisition, and execution of
the software. This process is performed as the post-development activity that takes care of the
user-centric customization and configuration of the software as well. At times this process can be
quite complex and may need the involvement and expertise of the developers and the system
administrator quite extensively. Apart from deployment complexity, the deployment task may be
time consuming (of the order of hours). It is found [54, 55] that in general, 19% of total cost of
operation (TCO) of a software system goes in deployment cost. As an MT system is far more
complex and technical intensive it is fair to expect that its TCO must be far higher.
Unlike generic applications, an NLP application like MT system goes through frequent and
regular updates, mainly to improve its accuracy and performance, and also to increase the
coverage of its domain. Every new release of system requires fresh deployment of the new
version from scratch, aggravating the technical administration distress, and in turn, inflating the
total cost of operation far higher.
To satisfy above needs this chapter proposes that complex NLP applications like MT
system should be packaged and released as virtual appliance [54, 70]. An application as a virtual
appliance can just be taken out of the box by a lay user to deploy easily on his machine, and with
very little setup time the application system becomes ready to use from that instant onwards.
Though packaging an application as a virtual appliance do take time and is also technically
intricate, but once it is built, its deployment can be done even by a lay user, and takes very little
time as the applications complexity and its technical intricacy have been made transparent in its
virtual appliance incarnation. A virtual appliance packaged either for stand-alone system or for
cloud is expected to give proportional deployment advantages.
This chapter reports the experimental time measurement results for software deployment
of MT virtual appliance in relation to MT application. The results are reported for standalone
system as well as for cloud platform.
49
domain. A typical deployment process of Sampark MT system (composed of building, configuring
and verification tasks) takes some where between two to three hours time, depending on the
language pair. Many a times, on a release of a new MT version, this process can be quite complex
and may need the involvement of the developer/computational linguists to resolve its
deployment issues. It usually requires the expertise of the MT application as well as the
underlying systems used by the MT application.
Sampark system has been built by a consortium [4] of eleven research institutions. Each
group is continuously working on its module to improve the accuracy and performance of the
system. So the updates to the MT systems are very frequent. To handle these updates in a large
number of deployed systems, which is a routine task, is highly technical intensive, and is usually
very time consuming.
Sampark MT systems are also deployed on the web, and people are regularly using it. For
the last two years consortium members have been improving the accuracy & performance of the
system regularly. It was found that as the web traffic increased the performance of the system
(response time) for each translation deteriorated sharply. Various options available for improving
the response time were experimented, viz.
Refactoring the modules to keep the file I/O minimum in production system [4]
Distributing the translation tasks among the cluster of machines running MT systems;
work load partitioning using Hadoop framework as middleware [58]
With the above experiments, it was possible to considerably improve the response time of the
Sampark MT system. With provisioning of additional hardware resources, it was possible to keep
the response time within optimum limits as the load increased. The Sampark MT system was able
to scale with provisioning of additional resources [58].
To translate a book of seventy pages, the Sampark MT system on a stand alone system
took 71 minutes and 25 seconds, while on Eucalyptus Cloud environment with twelve (12) virtual
machines (VMs) it took only 5 minutes and 27 seconds.
50
In our case, the MT system was unable to scale-up (or scale-down) rapidly as it was taking
very long time to deploy new instances of MT on additional virtual machines acquired
dynamically on the cloud. Once the additional resources were provisioned, the system remained
underutilized if the load decreased rapidly. While the MT system in principal was scalable, it
remained either in the state of over-provisioned (i.e., resources remained underutilized) or under-
provisioned (i.e., starved for resources). The large deployment time of the MT system was the
main reason for its inability to rapid scaling up/down, making it completely unsuitable for cloud
deployment.
In the early stage of improving the deployment process, we built a deployment script that
would setup and configure the MT system with little intervention from the user. This script
improved the speed of deployment, it came down from hours to minutes (some where between
20 to 30 minutes depending upon the expertise of the user), but it still required the user to
provide complex details of the environment like, OS details, path of various language libraries,
and other third party tools required to run the various modules of the MT system. The complexity
of the deployment process remained unsolved. Moreover, in spite of the deployment script,
deployment process largely remained manual.
In this Chapter mainly discusses how to build MT as a virtual appliance that can be
deployed on stand alone VM. The same MT virtual appliance can be deployed on cloud, and can
be scaled-up or scaled-down in-time.
In the past many tools were designed to ease systems/service deployment. Dearle [65]
studied six cases of software deployment technologies and gave some of the future directions
about the impact of virtualizations. The benefits of virtualization and virtual machines are
discussed in [60, 61, 62]. Sapuntzakis et al. [66] proposed collective, a compute utility using
virtual appliances to manage systems. In the past software vendors have assembled the software
applications, operating systems, and the middleware (if any) as virtual appliance (or as software
appliance) and have distributed as ready-to-run application stack [67] that boots in a setup
51
wizard. Most of the time virtual appliances have been built to ease the distribution of software.
In the past people have used virtual appliance to improving software manageability and
automate provisioning [68]. Deploying applications in the cloud allows scaling on demand, and
provides benefits of elasticity and transference of risks, especially the risks of overprovisioning
and underprovisioning [69]. The key benefits of enabling the application for cloud deployment as
virtual appliance is to add or remove computational resources with fine granularity and with a
lead time of minutes rather than hours.
- CentOS 5.7, a variant of Linux, as the Host Operating System for virtualization,
- Hadoop [73] Version 0.20.2 as the middleware for work load partitioning, and
The engineering steps taken to setup of eucalyptus on a cluster of physical machine and Hadoop
on a cluster of VM is given in detail in Appendix-B and Appendix-C respectively.
Various terms used in building and deploying the virtual appliance in the Eucalyptus cloud are
explain below
Eucalyptus Cloud:
image: An image is a snapshot of a system's root file system and it provides the basis for
instances
baseline-image: baseline-image usually includes the optimum operating system, any required
service packs, a standard set of application (that is to be virtualized), other underlying tools
required by the application, and the necessary patches if any, or loosely speaking snapshot of the
root file system with the running application
euca2ools: Image management commands for Eucalyptus Cloud
euca-bundle-vol: Bundle the local file system of a running instance along with kernel and ramdisk
52
euca-upload-bundle: Upload the bundled image to the cloud
euca-register: Register an image for use with the cloud
euca-describe-instances: To check the status of the image instance in the cloud
euca-run-instance: To start a new instance of the image in the cloud
euca-terminate-instance: To terminate an instance of image from the cloud
Once the virtual image is running, Sampark MT system with all its required dependencies,
and the middleware Hadoop, is installed in the system. Then current running image is rebased.
This current image is bundled along with kernel and ramdisk by the command euca-bundle-vol.
This bundled image is uploaded at the base machine by command euca-upload-bundle, and lastly
image is registered by command euca-register to make sure it is available for launch. Now the
rebased image is ready for deployment as virtual appliance on the Xen VM.
53
6.5 Our Approach to run MT under Cloud
To rum Sampark MT system in Eucalyptus cloud we launch the baseline image of Sampark MT
system of the respective language pair to conduct the experiment. Once all the available
instances are ready for use then we assume one to be master node and rest as slaves and change
the configuration accordingly to run Sampark MT system.
In this setup, it is the MT Invoker and ConCat that are transferred by the jobtracker to other
tasktrackers running on various nodes of the cluster. Both these tasks being compute-light suit
appropriately with the present implementation of Hadoop. When an MT Invoker starts running
on any of the tasktracker node, it in turn calls the MT system. In this way, multiple MT systems
run in parallel on multiple tasktracker nodes of the cluster, translating on distinct Input-split (viz.,
sub-list of the input document). When the translation outputs of all Input-splits are available, the
jobtracker runs the ConCat on a tasktracker node to concatenate all these sub-lists to generate
the final translated output of the input document.
54
6.6 Experiments and Results
The experiment has been done on Hindi to Punjabi machine translation system to measure the
throughput of the system for Nirmala book written by Prem Chand comprising 1359 paragraphs,
5,049 sentences and 74,951 words. Average words per sentence are 14. Number of paragraphs,
sentences and words are approximate because we use a program to extract the paragraphs. We
use tokenizer to split paragraphs into sentences and wc command for counting words.
6.6.1 Results
We measured the throughput by executing the whole book by using 2, 4, 8 and 12 virtual
machines with dual core CPU and 1 GB RAM. Job (Book) composed of 1359 paragraphs which are
parallel tasks (achieved by splitting the whole book by paragraph extractor program). In other
words, the same job is split into multiple tasks for executing on different computing resources, in
parallel, giving better throughput. In case of 2 virtual machines 4 tasks (paragraphs) are running
parallel and others are in queue and so on. In case of 12 VM 24 tasks (paragraphs) are in parallel
execution rest are in queue.
In our experiment the measurements are tabulated in the table 6.2. We have given the
throughput of the data set running on the current stand-alone computing system in Table 6.1.
55
Figure 6.2: Total time vs. the number of virtual machine
The calculated value of associated overhead for parallel distribution of tasks across available
computing resources is 20-30 seconds for virtual machine cluster.
6.7.1 Conclusion
In this Chapter we have experimented to show that a complex MT system can be built into a
virtual appliance that can be deployed in the cloud.
56
Chapter 7
Conclusion and Future Work
In this chapter, we present the overall conclusions of this thesis. We summarize the conclusions
of this thesis in Section-7.1. A few possible future directions of our work are presented in Section-
7.2.
7.1 Conclusions
A new methodology called Symbiotic Software Engineering Approach applied on previously
developed natural language modules in laboratory environment to produce field deployable and
maintainable MT systems (and modules). As a result, successfully engineered and deployed
Sampark MT systems at http://sampark.org.in.
We have also developed software integration and deployment platform for NLP applications,
called Dashboard, which has been enhanced to provide to functional module developers
mechanisms to avail and to configure, if they so desire, the dynamic cache facility offered by the
system integration platform to improve the performance of the modules. By using cache
mechanism facility in some of the modules we achieve 15-20% improvement in speed.
Finally we have built MT system as virtual appliance for deployment on the cloud.
In future some experiments can be done on phrases similar to Heaps law for phrases to avail
caching mechanism facility offered by the system integration platform to improve the
performance of the modules. Also in distributed platform distributed cache can be tried.
57
Appendix-A
1. Coding Guidelines
2. Inline Comments
3. Code Review Checklist
4. Module Packaging Guidelines
5. System Integration Guidelines
6. System Deployment Directory Structure
7. Software Requirement Specification Guidelines
8. References
58
1. Coding Guidelines
1. In programs, the names of global variables and functions themselves serve as comments.
So as far as possible intuitive name should be chosen for variables as well as functions.
Such names convey useful information about the variables & functions. It drastically
improves the readability of the code.
2. Function names should begin with verbs, e.g., find_index(), build_tree(),
calculate_emit_prob(). Similarly information entities should be chosen as noun words, e.g.,
letter_tree
3. Underscores should be used to separate words in variable & function names e.g.,
input_file(), train_lambda().
4. Use enums when you want to define names with constant integer values.
5. Avoid using magic numbers for array size definitions, rather use hash (#) defined macros in
CAPITAL letters to define Array sizes.
6. Local variable names can be shorter, because they are used only within a local context,
inline comments explain their purpose.
3
As is generally the case with command line (i.e., all-text mode) programs in Unix-like operating systems, filters read
data from standard input and write to standard output. Standard input is the source of data for a program, and by
default it is text typed in at the keyboard. However, it can be redirected to come from a file or from the output of
another program. Standard output is the destination of output from a program, and by default it is the display
screen. This means that if the output of a command is not redirected to a file or another device (such as a printer) or
piped to another filter for further processing, it will be sent to the monitor where it will be displayed.
59
4. Command line arguments parsing should be done at the beginning of the program. It should
not be scattered all around the program code.
5. There must be a provision for giving input arguments by specifying command line options.
6. There must be a provision to redirect input from standard input.
7. There must be a provision to redirect output to standard output.
1. Application Error Logs and Debug Logs should be generated (if possible) using macros in C
programs, or log4perl like mechanism in Perl programs. Log4java can be used for Java
programs.
2. It should be implemented so that Logging is configurable at runtime, with the help of
command line options.
3. Applications should provide logging messages of multiple levels, like info, warn, fatal.
60
2. Inline Comments
2.1 Every Program should start with a brief comment explaining what it is for. This comment
should be at the top of the program file. Additionally it should mention the name of the
original and subsequent authors of the program. For example,
2.2 Put a comment on each function, saying what the function does, and what sorts of
arguments it gets. What are the possible values of arguments and for what they are used.
For example,
61
2.3 Explain the significance of return value, if there is one.
/* different returns zero if two strings OLD and NEW match, nonzero
if not. OLD and NEW point not to the beginnings of the lines but rather to
the beginnings of the fields to compare.
OLDLEN and NEWLEN are their lengths. */
static int
different (char *old, char *new, size_t oldlen, size_t newlen)
{
if (check_chars < oldlen)
oldlen = check_chars;
if (check_chars < newlen)
newlen = check_chars;
if (ignore_case)
{
/* FIXME: This should invoke strcoll somehow. */
return oldlen != newlen || memcasecmp (old, new, oldlen);
}
else if (HAVE_SETLOCALE && hard_LC_COLLATE)
return xmemcoll (old, oldlen, new, newlen);
else
return oldlen != newlen || memcmp (old, new, oldlen);
}
2.4 Comments on a function are much clearer if you use argument name to speak about the
argument values. Variable names itself should be lowercase, but write it in UPPER case
when you are speaking (referring) about the values.
Example 1:
/* right_justify Copy the SRC_LEN bytes of data beginning at SRC
into the DST_LEN-byte buffer, DST, so that the last source byte is
at the end of the destination buffer. If SRC_LEN is longer than
62
DST_LEN, then set *TRUNCATED to nonzero.
Set *RESULT to point to the beginning of (the portion of) the
source data in DST. Return the number of bytes remaining in the
destination buffer. */
static size_t
right_justify (char *dst, size_t dst_len, const char *src,
size_t src_len, char **result, int *truncated)
{
const char *sp;
char *dp;
if (src_len <= dst_len)
{
sp = src;
dp = dst + (dst_len - src_len);
*truncated = 0;
}
else
{
sp = src + (src_len - dst_len);
dp = dst;
src_len = dst_len;
*truncated = 1;
}
*result = memcpy (dp, sp, src_len);
return dst_len - src_len;
}
Example 2:
/* concat return a newly-allocated string whose contents concatenate
those of string S1, string S2 and, string S3. */
static char *
concat (const char *s1, const char *s2, const char *s3)
63
{
int len1 = strlen (s1), len2 = strlen (s2), len3 = strlen (s3);
char *result = (char *) xmalloc (len1 + len2 + len3 + 1);
strcpy (result, s1);
strcpy (result + len1, s2);
strcpy (result + len1 + len2, s3);
result[len1 + len2 + len3] = 0;
return result;
}
2.5 If there are global variables inside the programs, each should have a comment
preceding it. Generally global variables can have longer names.
Examples:
64
3. Code Review Checklist
3.1 All the program codes and the training & testing data should be put into a tar distribution,
like <module_name-ver.tgz>, e.g., postagger-0.1.tgz
3.2 There should be a README file inside the tar distribution, which should explain the
directories and other files inside the tar distribution.
3.3 There should be an INSTALL file inside the tar distribution that should explain how to
install and test that module.
3.4 There should be a Makefile file inside the tar distribution that should contain instructions
for build, install, clean, and uninstall the module.
3.5 There should be a brief description of the program at the beginning of each program files.
Normally the program names should be intuitive.
3.6 Brief descriptions of the functions/methods at the beginning should be given. Description
of the function arguments and their types must also be given at the beginning.
3.7 Significance of the return values of functions and programs, if any, must be explained.
3.8 All the tar distributions must have a docs directory which should contain SRS document of
the module as discussed.
3.9 The SRS document should have sections like, Flow Chart, Data Flow Diagram, and
Function Descriptions.
65
S.N. Description Yes/No
1 Module code available in tar distribution with version number, like,
module_name-0.2.tgz
2 README file available in the tar distribution
3 INSTALL file available in the tar distribution
4 docs directory available in the tar distribution
5 First level SRS as per the templates provided.
6 Flow Chart4 for program flow available in the SRS document
7 Data Flow Diagram5 available in the SRS document
8 Class Diagram and/or Sequence Diagram6 available in the SRS document
9 Program description available at the beginning of the program. List each of
the program files and mention if description for each of them is available.
10 Brief description of the functions/methods available. List all the functions
available in a program.
11 Description of function arguments and types available
12 Significance of return values explained
4
Flow Chart should explain the basic flow of the application. It must show the decision points.
5
Module is designed with Structured Design approach then DFD should be provided for that module.
6
Module is designed with Object Oriented Design then Class Diagram & Sequence Diagram should be available.
66
4. Module Packaging Guidelines
4.1 All the Programs Files (source code, binaries, and scripts), Data Files (training data & other
lexical resources like dictionary), and other documents must be packaged into a single tar
distribution.
4.3 The tar distribution must have parent directory named as module_name-version, like
postagger-2.0. All the other files and directories should be inside this parent directory.
4.4 A docs directory must be available inside parent directory which has README, INSTALL,
Copyright, and other documents. The parent directory must have src, bin, and data directory
for corresponding files.7
4.5 All the modules must be accompanied with Makefile for the module. This Makefile must
contain instructions for build, install, uninstall, and clean. A Template Makefile will be
provided as sample.
4.6 Makefile should build & install the application in the directory chosen by the user, like
/usr/local/sampark/
4.7 Any user on a Linux machine should be able to run the application from his home directory,
as long as application binaries/scripts are in the PATH environment variable.
4.8 A small shell script, <module_name>_run.sh should be made available to run and test the
application with the test/validation data made available by the developer of the application.
4.9 A symbolic link of this <module_name>_run.sh can be provided in /usr/bin (by the install
script) so that it can be accessed from anywhere to test the module.
4.10 Validation Data (in SSF Format) must be comprehensive enough to test the functionality of
the module.
4.11 Validation Data (in SSF Format) should be comprehensive enough to test the accuracy of
the module (desirable).
7
Refer to section System Directory Structure of Sampark system
67
5. System Integration Guidelines
5.1 System comprises of various modules. Each module in turn comprises of sub-modules
(programs). Programs of each module may be written in different languages, like C, CPP, Perl,
Java, Python and Lex.
5.2 Module should run as standalone application, given a valid test-data in SSF Format
5.3 Dashboard application shall be used to integrate the various modules of Sampark system.
5.4 All the modules will be put in a sequential pipeline using a Dashboard specifications file.
5.5 in the specifications file we define the following for each modules
68
6. System Deployment Directory Structure
6.1 Directory Structure for Sampark System Deployment
Requirements There would be several Program files and Data files in the Sampark System.
They will be organized on the basis of modules, versions, and their dependence on the source or
target language. The other files will also be organized in conventional style.
Directory Structure
4. <prefix_dir>/sampark/bin/<dep>/<module_name>/<lang|lang_pair|sys>/ will
contain the program binaries and scripts of the Sampark system
5. <prefix_dir>/sampark/data_bin/<dep>/<module_name>/<lang|lang_pair>/ will
contain the data_binaries (dictionary, trained parameters, etc.) of the Sampark
system
<dep> possible values for dep could be sl, tl, sl_tl, or sys.
sl contains analysis modules that are source language dependent
tl contains generation modules that are target language dependent
sl_tl contains transfer modules that are source-target language pair dependent
sys contains the modules that are independent of languages, like format converter
programs like, text2ssf, wx2utf, ssf2tnt, tnt2ssf, ssf2bio
common contains the modules/programs that are common to all languages.
<module_name> name of the Sampark system modules, it could be tokenizer, morph,
postagger, chunker, lwg, lex_transfer
<lang|lang_pair|common> possible values are tel, hin, tel_hin, tam, tam_hin, etc.
The above deployment directory structure of Sampark MT system can be used for other system
as well.
69
6.3 Directory Structure for Sampark Validation Data
Requirements Validation data of the Sampark system should be stored & made available for
reference in future for analysis, comparison, and improving the accuracy of the Sampark system.
Identification of version should be possible at module level as well as system level. Identification
of specs file and other run time parameters should be possible, when the data is being referred in
the future. In future Regression Testing can be easily be performed.
Example: When we run Sampark system (version 2.2) on a validation data (file named, gandhi)
for telugu to hindi with specs file (sampark_specs-2), and <prefix_dir> is /var, then we will have
following directory and files.
/var/sampark/val_data/system/tel_hin/ver_2.2/gandhi/sampark_specs-2>/
gandhi.rin_rout
gandhi.rin
gandhi.rout
sampark_specs-2
sampark_specs-2.exe
tokenizer.in
tokenizer.out
ssplit.in
ssplit.out
morph.in
morph.out
postagger.in
postagger.out
chunker.in
chunker.out
lwg.in
lwg.out
lex_trans.in
lex_trans.out
70
6.4 Directory Structure for standalone Module Validation
<prefix_dir>/sampark/val_data/modules/<dep>/<module_name>/ver_<x.y.[z]>/<nameof_validatio
n_data>/<nameof_specs_file>/
<prefix_dir> Root directory where validation data will be stored, e.g., /var
<dep> possible values for dep could be sl, tl, sl_tl, or common. It tells us whether
the module is dependent on source language, target language; or is dependent on
language pair (source & target); or the module is language independent.
<module_name> name of the module that is being tested
<x.y.[z]> Version number of the module, e.g., 0.1,
<nameof_validation_data> Name of validation data, e.g., samachar, parichay,
mahatma, gandhi, shanti, sandesh, bangladesh war,
<nameof_specs_file> Name of the specs file which was used to run the validation
data, e.g., sampark_specs-3
Example: When we run postagger (version 2.1) on a validation data (file named, gandhi) for
telugu with specs file (sampark_specs-2), and <prefix_dir> is /var, then we will have following
directory and files.
/var/sampark/val_data/modules/sl/tel/postagger/ver_2.1/gandhi/sampark_specs-2>/
gandhi.rin_rout
gandhi.rin
gandhi.rout
sampark_specs-2
sampark_specs-2.exe
postagger.in
postagger.out
The above validation directory structure of Sampark MT system can be used for other system as
well.
71
7. SRS Guidelines
SRS for each module of Sampark system should have the following sections depending upon
the design approach followed by the module developers.
72
References
1. GNU Coding Standards
2. GNU Make
73
Appendix-B
Setting up Eucalyptus on a Cluster of Machines
This guide provides instructions for a single-cluster deployment of Eucalyptus Open Source
release 2.0.3 on a two-machine system running CentOS 5.7.
Eucalyptus Overview
Eucalyptus is a Linux-based software architecture that implements scalable private and hybrid
clouds within your existing IT infrastructure. Eucalyptus allows you to provision your own
collections of resources (hardware, storage, and network) using a self-service interface on an as-
needed basis.
Eucalyptus Components
Eucalyptus consists of the following distributed components:
- CLC: Cloud Controller (EC2 functionality)
- Walrus (S3 functionality)
- CC: Cluster Controller (middle-tier cluster management service)
- SC: Storage Controller (EBS functionality)
- NC: Node Controller (controls VM instances)
Hardware Requirements
You will need a minimum of two machines to host Eucalyptus components. All the node machines
should have multicore processor with virtualization Technology (VT) enable.
Frontend/Base (CLC, Walrus, CC, SC): Minimum 100GB HDD and 4-8GB RAM
Node (NC): Minimum 100GB HDD and 4-8GB RAM
Be sure that all machines are running the latest release of CentOS 5.7. Test that all systems allow
ssh login, and that root access is available (sudo is OK)
Network Requirements
1. You must have a DHCP server available (installed but not running).
2. You must have a range of available public IP addresses. These will be assigned to Virtual
Machines (instances).
3. You must have a large range of available private IP addresses. These will be used by a virtual
subnet. They cannot overlap or contain any part of a physical network IP address space.
Software Requirements
74
You must have access to the following:
- CentOS 5.7 install CD
- Eucalyptus Fast Start media (the same can be download from eucalyptus web site
http://open.eucalyptus.com )
B) Node Configuration
C) Frontend Configuration
E) FAQ
75
A) Install CentOS 5.7 on all machines
1- At the boot: prompt, press ENTER for the graphical installer. You can skip the media
verification if you like, then accept the defaults, with the following exceptions
1.1 For network interface configuration, select Edit and manually configure
I) IP address
II) Netmask
III) Hostname
IV) Gateway
V) DNS
For example:
IP address: 192.168.1.65
netmask: 255.255.255.0
hostname: node1.in
gateway: leave it blank
DNS: leave it blank
1.3 Choose install CentOS, click next choose select remove linux partitions on selected drives &
create default layout, click next
Tip: We recommend that you install your node controllers first so that you can register them with
the frontend/base server as part of that install.
76
B) Node Configuring
1- Now login as root on node machine insert the Eucalyptus Fast Start media (CD/DVD or
pendrive) and mount it for copy the cloud setup
mkdir /media/usb/
mount -t vfat /dev/sdb1 /media/usb
Note: the USB drive might appear as a device other than /dev/sdb1 depending on your hardware
configuration (i.e. /dev/sdc1, etc.). Use the command fdisk l to see the devices.
3- Install ntp to update date and time of the other machines of the network.
4- For NC Configuring
cd /root/faststartusb
chmod +x *.sh
./fastinstall.sh
Answer n to the first question (this is not the front-end server).
Note: After the install completes, the machine will reboot automatically and be ready to use.
77
C) Frontend Configuring
1- Now login as root on node machine insert the Eucalyptus Fast Start media (CD/DVD or
pendrive) and mount it for copy the cloud setup
mkdir /media/usb/
mount -t vfat /dev/sdb1 /media/usb
Note: The USB drive might appear as a device other than /dev/sdb1 depending on your hardware
configuration (i.e. /dev/sdc1, etc.)
3- Synchronize the machine with NTP server running to any machine in the cluster install ntp
first and update the date and time by using ntpdate <NODE-IP> command. (NODE-IP is IP-
address of the machine where NTP server running)
4- Ensure from your front machine, node(s) should be accessible ping NODE-IP or ssh
root@NODE-IP (don't login only for testing weather it ask for password
cd /root/faststartusb
chmod +x *.sh
./fastinstall.sh
Answer y to the first question (this is the front-end server)
Services will now start and the script will wait for the CLC to run. The script registers components
(you will be asked about connecting to the server over ssh and to provide the root password).
As the last step on the frontend, you need to enter the IP address of each NC (you can have more
than one). Press ENTER (at the prompt) when you are done
6- Login as admin/admin
8- Configure dhcpd server to assign IP of virtual Machines copy below lines and paste in
dhcpd.conf save and exit. Start the dhcpd server
-------------------------------------------
option domain-name "lan.ltrc.org";
ddns-update-style none;
option subnet-mask 255.255.255.0;
default-lease-time 600;
max-lease-time 7200;
It will take some time to boot the instance, in the mean time run euca-describe-instances
command to see the status.
Once the status is running then you can login and work on the newly launched instance.
For more help, try,
euca-run-instances --help
Login to instance
ssh -i mykey.private root@<accessible-instance-ip>
80
1- SSH Key Management
euca-add-keypair <keypair-name>
A pair of keys is created; one public key, stored in Eucalyptus, and one private key stored in the
file mykey.private and printed to standard output. The ssh client requires strict permissions on
private keys:
chmod 0600 mykey.private
euca-delete-keypair <keypair-name>
81
2- Image Management
In order to use run instances from images that you have created (or downloaded), you need to
bundle the images with your cloud credentials, upload them and register them with the cloud.
Examples
Following is an example using the Ubuntu pre-packaged image that we provide using the included
KVM compatible kernel/ramdisk (a Xen compatible kernel/ramdisk is also included).
euca-bundle-image -i euca-ubuntu-9.04-x86_64/kvm-kernel/initrd.img-2.6.28-11-generic --
ramdisk true
euca-upload-bundle -b ubuntu-ramdisk-bucket -m /tmp/initrd.img-2.6.28-11-
generic.manifest.xml
euca-register ubuntu-ramdisk-bucket/initrd.img-2.6.28-11-generic.manifest.xml
(set the printed eri to $ERI)
82
euca-bundle-image -i euca-ubuntu-9.04-x86_64/ubuntu.9-04.x86-64.img --kernel $EKI --ramdisk
$ERI
euca-upload-bundle -b ubuntu-image-bucket -m /tmp/ubuntu.9-04.x86-64.img.manifest.xml
euca-register ubuntu-image-bucket/ubuntu.9-04.x86-64.img.manifest.xml
euca-deregister <emi-XXXXXXXX>
Then, you can remove the files stored in your bucket. Assuming you have sourced your 'eucarc' to
set up EC2 client tools
If you would like to remove the image and the bucket, add the '--clear' option
Bundled images that have been uploaded may also be downloaded or deleted from the cloud.
For instance, to download the image(s) that have been uploaded to the bucket "image-bucket"
you may use the following command
euca-download-bundle -b image-bucket
euca-download-bundle --help
83
3- Virtual Machine (VM) Instance Management
For instance, to run an instance of the image with id "emi-53444344" with the kernel "eki-
34323333" the ramdisk "eri-33344234" and the keypair "mykey" you can use the following
command,
To run more than one instances, you may use the "-n" or "--instance-count" option.
To get information about a specific instance, you can use the instance id as an argument to euca-
describe-instances. For example,
euca-describe-instances i-43035890
You can create dynamic block volumes, attach volumes to instances, detach volumes, deletes
volumes, create snapshots from volumes and create volumes from snapshots with your cloud.
Volumes are raw block devices. You can create a file system on top of an attached volume and
mount the volume inside a VM instance as a block device. You can also create instantaneous
snapshots from volumes and create volumes from snapshots.
You may also create a volume from an existing snapshot. For example, to create a volume from
the snapshot "snap-33453345" in the zone "myzone" tries the following command
euca-create-volume --help
You may attach block volumes to instances using "euca-attach-volume." You will need to specify
the local block device name (this will be used inside the instance) and the instance identified. For
instance, to attach a volume "vol-33534456" to the instance "i-99838888" at "/dev/sdb" use the
following command
To detach a previously attached volume, use "euca-detach-volume" For example, to detach the
volume "vol-33534456"
euca-detach-volume vol-33534456
85
You must detach a volume before terminating an instance or deleting a volume. If you fail to
detach a volume, it may leave the volume in an inconsistent state and you risk losing data.
To delete a volume, use "euca-delete-volume." For example, to delete the volume "vol-
33534456" use the following command
euca-delete-volume vol-33534456
You may only delete volumes that are not currently attached to instances.
You may create an instantaneous snapshot of a volume. A volume could be attached and in use
during a snapshot operation. For example, to create a snapshot of the volume "vol-33534456"
use the following command
euca-create-snapshot vol-33534456
To delete a snapshot, use "euca-delete-snapshot" for example, to delete the snapshot snap-
33453345, use the following command,
euca-delete-snapshot snap-33453345
86
5 How to rebuild or rebase the instance?
Login the Virtual Machine to which you want rebuild/rebase for future use. Ensure that
euca2tools should be installed on the instance and synchronize the clock with NTP server running
in your private network.
Copy cloud credential to the instance and sourced the eucarc file. Before do the following ensure
euca-describe-availability-zone verbose command should run properly?
87
Appendix-C
Setting up Hadoop Cluster on a Cluster of Physical/Virtual Machines
Features in Hadoop
- Programming model Map/Reduce.
- Data Handling through Hadoop distributed File System (HDFS)
- Scheduling by dynamic task scheduling using global queue (Data locality, rack aware)
- Failure handles by re-execution of failed task and duplicated execution of slow task
- HLL support Pig Latin
- Environment support Linux cluster and Amazon Elastic MapReduce on EC2
- Intermediate Data Transfer by File and HTTP.
Required Software
1. JavaTM 1.6.x, preferably from Sun, must be installed.
2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage
remote Hadoop daemons.
MapReduce Engine also has master/slave architecture consisting of a single JobTracker as the
master and a number of TaskTracker as the slaves (workers). The JobTracker manage the
MapReduce job over a cluster and is responsible for monitoring jobs and assign task to
TaskTracker. The TaskTracker manages the execution of the map and/or reduce task on a single
computation node in the cluster.
88
A) Operating System CentOS-5.5 installation on all the physical machines in the
cluster
89
A) Operating System CentOS-5.5 installation on all the physical machines in the
cluster
1- At the boot: prompt, press ENTER for the graphical installer. You can skip the media
verification , then accept the defaults, with the following exceptions
2- Choose install CentOS, click next choose select remove linux partitions on selected drives &
create default layout, click next
For example: if you have 3 physical machines to setup Hadoop cluster. Select one machine for
master node and other two for slave nodes.
4- Package options
- Select Desktop - Gnome
90
- Select Server
- Select Customize later
- Click next
After complete OS installation machine will be rebooted and ask for post installation
configurations.
8- For Date and Time Setting -> Change date and time if differ with actual date and time.
9- For User Creation -> add a user name hadoop and password hadoop
Note: In case of virtual machines, installation of the OS is not needed if the virtual machines
already have an OS.
91
B) JAVA installation on all the physical/virtual machines in the cluster
Do the below steps on all the physical/virtual machines on which you want to setup the Hadoop
cluster.
1- Login into root user.
2- Copy the hadoopstartusb directory from the media (CD/DVD/USB) to /root/ in your machine.
5- Add JAVA_HOME environment variable path in /etc/profile file, export it and then add it into
PATH. Open file /etc/profile add the below 3 line in it at end of file then save it and exit.
JAVA_HOME=/usr/local/jdk1.6.0_27
export JAVA_HOME
PATH=/usr/local/jdk1.6.0_27/bin:$PATH
7- Edit the file /etc/hosts by default it content some lines remove all the lines and add below 3
lines in it save and exit.
192.168.1.100 master.in master
192.168.1.101 slave1.in slave1
192.168.1.102 slave2.in slave2
9) Create user hadoop and set its password hadoop for Hadoop framework installation if not
done in post installation process of CentOS installation.
adduser hadoop
passwd hadoop
Installing a Hadoop cluster is unpacking the software on all the machines in the cluster and
changes the configuration files as per your requirement. Typically one machine in the cluster is
designated as master. The rest of the machines in the cluster act as slaves.
Do the following steps in order on Master machine. We have indicated whenever a step is
required to run on slaves.
1) Login into hadoop user.
2a) Generate a public key on master and copy it to all the slaves. It is use in password less
connectivity from master to slave.
ssh-keygen -t rsa
above command ask you name of file to save it. Press ENTER for default one. It again ask you for
passphrase twice. again press ENTER for for no passphrase.
mkdir -p $HOME/.ssh
chmod 700 $HOME/.ssh
cat id_rsa.pub >>$HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys
8) Open the file $HOME/hadoop-0.20.2/conf/core-site.xml and add the below lines in between
<configuration> element
<property>
<name>fs.default.name</name>
<value>hdfs://master.in:54310</value>
<description>The name and URI of the default FS.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>master.in:54311</value>
<description>>Map Reduce jobtracker</description>
</property>
10) Open the file $HOME/hadoop-0.20.2/conf/hdfs-site.xml and add the below lines in
between <configuration> element
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication</description>
</property>
Note: the value node data is always equal to number of data nodes, if slaves are two then it is
<value>2</value>
11) Copy the modified hadoop-0.20.2 directory to each slave into hadoop user $HOME directory.
rsync -r $HOME/hadoop-0.20.2 hadoop@<node-ip>:
For example
rsync -r $HOME/hadoop-0.20.2 hadoop@192.168.1.101:
94
rsync -r $HOME/hadoop-0.20.2 hadoop@192.168.1.102:
Note: we presume JAVA_HOME path is same on each the slave.
Tips: if jps command output is NOT matched as stated above on master and slave. Manually kill
the process id of java programs by kill -9 <process-id> command clean the log created
/tmp/hadoop-hadoop* and repeat continue from step 13.
96
D) Getting Started with Hadoop
Do the following steps on master machine to check your Hadoop cluster is working properly.
Once the above command is run successfully then we can analyze the job by the following URL
http://<mastermachineip>:50030
For example
http://192.168.1.100:50030
97
Bibliography
98
[12] Brown, A.W., & Wallanau, K.C.,: Engineering of Component-Based Systems in
Component Based Software Engineering, IEEE Computer Society Press, (1996).
[13] Chikofsky, E., & Cross, J.,: Reverse Engineering & Design Recovery: A Taxonomy,
IEEE Software, 7(1), (1990).
[14] Sangal R,: Dashboard: A Framework for Setting Blackboards, IIIT Hyderabad, (2005).
[15] Moona, R., Singh, S., Sangal, R., Sharma, D.M.,: MTeval: An Evaluation Methodology
for Machine Translation Systems, Language Technology Research Center, IIIT
Hyderabad, (2004).
[16] Anthes, G.,: Automated Translation of Indian Languages, CACM, Vol. 53 (1), (2010).
[17] Hayes-Roth, B.,: A Blackboard Architecture for Control, Art. Intelligence, (1985)
[18] EAGLES, Expert Advisory Group on Language Engineering: Evaluation of Natural
Language Processing Systems (Final Report), DG XIII of European Commission,
(1996).
[19] L. Bass, P. Clements & R. Kazman: Software Architecture in Practice, 2nd Edition,
Addison Wesley, 2003.
[20] D.D. Corkill: Blackboard Systems, AI Experts, Vol. 6, No. 9, pp: 40-47, Sept. 1991.
[21] H. Cunnigham, and J. Scot: Software Architecture for Language Engineering, Journal
of Natural Language Engineering 10 (3/4), (2004).
[22] Amba Kulkarni: Development of Sanskrit Tool Kit and Sanskrit to Hindi Machine
Translation System (SHMT), Department of Sanskrit Studies, University of Hyderabad,
Hyderabad, http://sanskrit.uohyd.ernet.in/shmt/login.php, accessed on May 28, 2010.
[23] Kumar Pawan, Rathaur AK, Ahmad Rashid, Sinha Mukul K, Sangal Rajeev,
Dashboard: An Integration & Testing Platform based on Black Board Architecture for
NLP Applications, in Proceedings of Natural language Processing and Knowledge
Engineering (NLPKE) 2010, Beijing, China.
[24] Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri,
F. The impact of caching on search engines. In Proc. of ACM SIGIR 2007, 183-190.
99
[25] Fagni, T., Perego, R., Silvestri, F., and Orlando, S. Boosting the performance of Web
search engines: Caching and prefetching query results by exploiting historical usage
data. ACM TOIS 24, 1 (Jan. 2006), 51-78.
[26] Markatos, E. P. On caching search engine query results. Computing Communications
24 (2001), 137-143.
[27] Zipf, George K., The Psychobiology of Language. Houghton-Mifflin. 1935, (see
citations at http://citeseer.ist.psu.edu/context/64879/0).
[28] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM
Press, 1999.
[29] Bernstein, P. Middleware: A Model for Distributed System Services. Communications
of the ACM, 39:2, February 1996, 8698.
[30] Campbell, A., Coulson, G., and Kounavis, M.. Managing Complexity: Middleware
Explained. IT Professional, IEEE Computer Society, 1:5, September/October 1999,
2228.
[31] Kim, Jik-Soo, Andrabe, H., Sussman, A., Principle for Designing Data-/Compute-
intensive Distributed Applications and Middleware Systems for Heterogeneous
Environments, Journal of Parallel and distributed Computing 67(2007), pp: 755-771.
[32] Rifat Ozcan, Ismail Sengor Altingovde, zgr Ulusoy, Static Query Result Caching
Revisited, WWW 2008, Beijing, China, April 2008.
[33] Andrade, H., Kurc, T., Sussman, A., Borovikov, E., Saltz, J., On cache replacement
policies for servicing mixed data intensive query workloads, in: Proceedings of the
Second Workshop on Caching, Coherence, and Consistency, held in conjunction with
the 16th ACM International Conference on Supercomputing, New York, NY, June
2002.
[34] J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford:
Clarendon, 1892, pp.6873.
[35] J. Dean & S. Ghemawat, 2004 MapReduce: Simplified Data Processing on Large
Cluster, in OSDI 2004, Proc. of the 6th Symposium on Operating System Design &
Implementation 2004, in co-operation with ACM SIGOPS, pp:137-149.
100
[36] Rajeev Sangal 1991 Programming Paradigms in Lisp
[37] J. Lin & Chris Dyer, Data-Intensive Text Processing with MapReduce, University of
Maryland, USA, April 2010.
[38] Richard S. Bird, 1986 An Introduction to the Theory of Lists, Oxford University
Technical Monograph PRG-S6.
[39] M.Cole, 1995 Parallel Programming with List Homomaphism, Parallel Processing
Letters Vol. 5, No. 2, Page 191-103.
[40] B. He, W. Fang, Q. Luo., N.K. Govindarajan & T. Wang 2008 Mars: A MapReduce
Frameworks on Graphics Processing, in PACT 08. Proc. Of 17th Conf. on Parallel
Architecture & Compilation Techniques, pp. 260-268.
[41] M.de Kruijf & K. Sankarlingam, 2007 MapReduce for the Cell B.E. Architecture,
University of Wisconsin, Comp. Sc. Tech. Report: CS-TR-2007-1625.
[42] H. Karloff, S. Suri & S. Varrilvitski, 2010 A Model of Computation for MapReduce, In
Proceed. Of 21st Annual ACM-SIAM Symposium on Discrete Algorithm, Austin,
Texas.
[43] G. Ananthnarayanan, S. Kandula, A Greenberg. I Stoica. Y. Lin, B. Saha, E. Harris,
2010 Reining in the Outliers in Map-Reduce Cluster using Manti, in USENIX OSDI.
[44] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski & C. Kozyrakis, 2007
Evaluating MapReduce for Multi-core & Multi-processor systems in HPCA 07. Proc.
of the 2007 IEEE 13th Intl. Sym. on High Performance Computer Architecture, pp. 13-
24. Phoenix, Arizona.
[45] R.M.Yoo, A. Romano, C. Kozyrakis, Phoenix Rebirth: Scalable MapReduce on a
Large Scale Shared Memory System, Stanford University, Computer System
Laboratory, CA, USA.
[46] J. Ekanayake, Shrideep Pallickara and G. Fox , 2008 MapReduce for Data Intensive
Scientific Analyses, Dept of Computer Sience Indiana University Bloomington, USA.
[47] M. Banko and E Brill, 2001 Scaling very very Large Corpora for Natural Language
Disambiguation, In Proc. Of 39th Annual Meeting of Assoc. of Computational
Linguistics (ACL 2001), pp.: 26-33, Toulouse, France.
101
[48] C. Collism-Burch, C. Bannard & J Schroeder, 2005 Scaling Phrase-based Statistical
Translation to Larger Corpora and Larger Phrases, In Proc. 43rd Annual Meeting of
Assoc. of Computational Linguistics ACL, pp.: 255-262, Ann Arbor, Michigan, USA.
[49] CT Chu, SM Kim, YA Lin, Y Yu, GR Bradski, AY Ng and K Olukotun, MapReduce for
Machine Learning on Multi-core, In Advances in Neural Information Processing
Systems 19 (NIPS 2006), pp.: 281-288, Vancouver, Canada.
[50] C. Dyer, Aaron Cordora, A. Mont, J. Lin 2008 Fast, Easy & Cheap: Construction of
Statistical Machine Translation Model with MapReduce, In Proc. of 3rd Workshop on
Statistical MT at ACL, University of Marytal, Columns, Ohio.
[51] Qin Goa & Stephon Vogel, 2010 Training Phrasebased Machine Translation Models
on the Clouds: Open Source Machine Translation Toolkit Chanki, The Prague Bulletin
of Mathematical Linguistics, 93: 37-16.
[52] Z.N. Grant-Dutt & P. Harrison, 1996 Parallelism via Homomorphism, Parallel
Processing Letters, Vol. 6, No. 2, Page 279-295.
[53] Venugopal Ashish & Andreen Zollnam, 2009 Grammar Based Statistical MT on
Hadoop.An end-to-end Toolkit for Large Scale PSCFG based MT, The Prague
Bulletin of Mathematical Linguistics (91), 67-78.
[54] IDC Paper: Virtual Appliance vs Software Appliance
[55] J. S. David, D. Schuff, and R. S. Louis, Managing your total IT cost of ownership,
Communications of the ACM, vol. 45, no. 1, January 2002.
[56] Changhua Sun, Le He, Qingbo Wang and Ruth Willenborg, Simplifying Service
Deployment with Virtual Appliances, 2008 IEEE International Conference on Services
Computing.
[57] Rashid Ahmad, AK Rathaur, B Rambabu, Pawan Kumar, Mukul K Sinha, and Rajeev
Sangal, Provision of a Cache by a System Integration and Deployment Platform to
Enhance the Performance of Compute-Intensive NLP Applications,
[58] Rashid Ahmad, Pawan Kumar, B Rambabu, Phani Sajja, Mukul K Sinha, and Rajeev
Sangal, Enhancing Throughput of a Machine Translation System using MapReduce
Framework: An Engineering Approach, ICON-2011, Chennai, INDIA.
102
[59] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt,
and A. Warfield, Xen and the art of virtualization, in Proceedings of ACM symposium
on Operating systems principles, 2003.
[60] P. Chen and B. Noble, When virtual is better than real, in Proceedings of Workshop
on Hot Topics in Operating Systems (HotOS), 2001, pp. 133138.
[61] J. J. Wlodarz, Virtualization: A double-edged sword, 2007. [Online]. Available:
http://www.citebase.org/abstract?id=oai:arXiv.org:0705.2786
[62] R. Willenborg, Virtual appliances panacea or problems? Oct 2007. [Online].
Available: http://www.ibm.com/developerworks/websphere/techjournal/0710 col
willenborg/0710 col willenborg.html
[63] Vmware white paper : Virtual Appliances: A New Paradigm for Software Delivery
[64] S Shumate, Implications of Virtualization for Image Deployment, Dell Power
Solutions, October 2004
[65] A. Dearle, Software deployment, past, present and future, in International
Conference on Software En-gineering (Future of Software Engineering), 2007.
[66] C. Sapuntzakis and M. S. Lam, Virtual appliances in the collective: A road to hassle-
free computing, in Proceedings of Workshop on Hot Topics in Operating Systems,
2003.
[67] What is a virtual appliance? July 2012, [Online]. Available: http://www.turnkeylinux.org
[68] Virtual Appliances: Improve Manageability & Automate Provisioning, White Paper
Published By: Swsoft
[69] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy
Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia,
Above the Clouds: A Berkeley View of Cloud Computing Technical Report No.
UCB/EECS-2009-28 http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-
28.html, February 10, 2009
[70] Osterman Research White Paper, "Why You Should Consider Deploying Software
Appliances", Published December 2008
[71] Xen Virtualization, http://www.xen.org/, last accessed 30-Aug-2012
103
[72] Eucalyptus: http://open.eucalyptus.com/wiki, last accessed on 30-Aug-2012
[73] Apache Hadoop, http://hadoop.apache.org/, last accessed on 30-Aug-2012
104