Você está na página 1de 11

NLP Research at Amrita

Overview

Our Natural Language Processing (NLP) group was formed in 2006 with the funded
project from Department of Information Technology (DIT) called "English to Indian
Languages: Machine Translation System. Our groups long term focus is to develop
a robust Machine Translation (MT) system for regional languages such as Tamil and
Malayalam. Towards achieving our goal, we have developed a prototype Machine
Translation system using Lexicalized Tree Adjoining Grammar (LTAG) parsing and
generation of target language. Prior to working on transfer rule based approach, we
tested English-Tamil Statistical Machine Translation (SMT) system with comparable
training corpora of 5000 sentences. The results were not promising as one of the
obvious reasons was training corpus size. The other major factors that affected the
systems performance were "sandhi and "morphologically complex nature of the
language. In Tamil, verbs and nouns undergo morphological changes, also
minimum of 200 wordforms can be constructed out of single noun and a verb,
makes SMT system really difficult to learn the translation probabilities even if the
training corpus size is very large. The other plausible approach to tackle this
problem within the SMT framework would be to incorporate linguistic information
such as Part of Speech and Morphological Information in the training data by
developing Morphological Analyzers (MA) and Part of Speech (POS) taggers for the
target language. Though recent advancements in SMT allows the features we
mentioned through Factored Translation Models , these techniques are yet to be
tested as there was no off-the-shelf MA and PoS taggers available for Tamil
language readily. This led us to put more emphasize on developing linguistic
resources and rule based systems as a mean for solving MT problem rather than
opting for corpus based approaches. Other NLP related tools we have developed
include Morphological Analyzer/Synthesizer for Tamil using Finite State
Transducers, Part of Speech (PoS) tagger for Tamil, Named Entity Recognition
(NER) and Transliteration (in progress), Sentence Aligner and font converters. We
are also active in other Natural Language areas such as Speech processing and Text
Summarization.


People

S.No Name Academic Experience
1


Dr. K.P. Soman
B.Sc. (Engg.), P.M.Dip. (SQC&OR,
ISI-Calcutta), M.Tech. (IIT-Kgp.),
Ph.D (IIT-Kgp.)
Professor- Head CEN

16 years of research & teaching experience
at IIT-Kharagpur and Amrita. Has around 60
publications in national & international
journals and conference proceedings. Has
organized a series of workshops and summer
schools in e-commerce, fractals, chaos,
neural networks, and wavelets for industry
and academia. Authored books are "Insight
into Wavelets, "Insight into Data mining
and "Support Vector Machines and Other
Kernel Methods (to be published) published
by Prentice Hall, New Delhi, Project
Investigator of the ongoing consortia project
for Automated Machine Translation.

2


Dr. A.G. Menon
Visiting Professor

Senior Linguist. Has over 30 years of
teaching and research experience in
university of Leidan, Netherlands. Have
various publications in popular linguistic
journals.
3


C. Santhosh Kumar
Assistant Professor

Graduated from University of Cambridge in
1989. Worked with Institute of Systems
Science, National University of Singapore and
Kent Ridge Digital labs, Singapore on Natural
language related projects. Has published
articles in journals and conferences. Primary
research interests include Speech
recognition, NLP and Text to Speech
synthesis
4


Mr. Vijay Krishna Menon
Research Associate
Obtained an MSc (Applied Science, Software
Engineering) degree from Bharathiar
University. Currently pursuing MS by
Research (Computational Engineering)
specialized in Machine Translation using
Synchronous TAG, Propabilistic Grammars
and POS Taggers


5


Mr. Saravanan
Research Associate
Completed BTech in Information Technology
from Anna University. He is Working on
models for Morphological Synthesis,
Probabilistic Models and Verb classification
for Tamil
6


Mr. R. Loganathan
Research Associate
Currently pursuing MS by Research in
Computational Informatics. His primary
research interests include Statistical
approaches to Natural language processing
and Machine Learning.





7


Mr. Shivapratap Gopakumar
Research Associate


Obtained MTech in Computational
Engineering and Networking from Amrita
Vishwa Vidyapeetham. His primary research
interests include Information Retrieval (IR)
and Algorithms analysis.
8


Ms. Dhanalakshmi
PhD Scholar(Tamil Linguist)

Currently pursuing PhD in linguistics in Tamil
university, Tanjore. Her primary interests
include Syntax and Grammar development
for Tamil. Currently developing PoS tagger
for Tamil.
9


Ms. Kaveri
Linguist

Obtained PhD in Tamil from Madurai Kamaraj
university. Currently dedicated to building
Parallel Corpora. Works in corpus
management and alignment.
10


Ms. Meena
Programmer

Completed BE in Electronics and
Communication, currently working as a
programmer











Financial Requirements for supporting the following
1. Five Research Scholars
2. Scholarships for a few MTech students in Computational Engg
3. Linguistic resource building
4. Travel

Deliverables
1. Morphological analyzer for Tamil.
2. Morphological analyzer for Malayalam.
3. Translation Engine for bilingual pair English-to-Tamil.
4. Translation Engine for bilingual pair English-to-Malayalam.
5. Translation Engine for bilingual pair Tamil-to-Malayalam.
6. Translation Engine for bilingual pair Malayalam-to-Tamil.
7. Tool for Grammar Development
8. Tool for Sentence Alignment
9. SVM based POS Tagger for Tamil and Malayalam












Curriculum Vitae


Name : Dr. Soman K. P.
Designation : Professor & Head, CEN,
Centre for Excellence in Computational Engg. and
Networking, Amrita Vishwa Vidyapeetham.

Degrees conferred

Degree
Institution Conferring the
Degree
Field(s)
Year
B.Sc.
(Engg)

P.M.
Diploma


M.Tech.


Ph.D.
Regional Engineering
College, Calicut

Indian Statistical Institute,
Calcutta.


Indian Institute of
Technology, Kharagpur.

Indian Institute of
Technology, Kharagpur.
Electrical
Engineering

Quality Control
and Operations
Research

Reliability
Engineering

Systems
Engineering

1984


1986


1989


1994



Research Work / Training Experience :

Institution
Nature of Work
Done
Duration

Indian Institute of
Technology, Kharagpur

Keltron Counters,
Trivandrum.

GKW, Howrah.

Estimation of System
Reliability

Quality Control


Quality Control

4 years (1990 - 94)


3 months (1995)


6 months (1996)


Current fields of research : Interior Point Methods for Optimisation Problems,
High Performance Computing, Machine learning using Support Vector
Machines, Signal Processing, Wavelets and Fractals.

Research Experience

Sponsored Research

"Analysis of PIV and PLIF Images for Fluid Flow Characterization and
Visualization for Indian Space Research Organization (ISRO), Government of
India

"Study for Development of Methodology Detection Tools for Digital Contents
Plagiarism and other Piracy Software. for the DIT, Ministry of
Communications and Information Technology, Government of India.

"Under Water Sonic Data Classification using Massively Parallel Support
Vector Machines, for National Physical Oceanographic Laboratory,
Government of India.

"Study of Image fusion techniques and Quality Metrics for Defense
Research Development Organization, Government of India.

Machine Translation System - From English to Tamil for the DIT, Ministry of
Communications and Information Technology, Government of India.

Video Summarization for content based Video Retrieval, Indian Space
Research Organization (ISRO), Government of India

List of publications : (partial list of publications)

Books

"Insight into Wavelets, Soman K.P., K.I. Ramahandran, Prentice Hall
India, 2004.
"Insight into Data Mining, Soman K.P., Shyam Diwakar, Ajay V, Prentice
Hall India, 2006.
"Support Vector machines and Kernel Methods, Soman K.P., Ajay V.,
Loganathan R., Prentice Hall India, 2007.

Papers:

1. K. P. Soman and K. B. Misra, Multistate Fault Tree Analysis using
Fuzzy Probability Vectors and Resolution Identity., Contributed as a
chapter of the book, "Fuzzy sets and probability theory in reliability
and safety related problem edited by Takshian Onisawa and Janusz
Kacprzyk.
2. K. P. Soman and K. B. Misra, Moments of order statistics using orthogonal
inverse expansion method and its application in reliability,
Microelectronics and Reliability Vol.32, pp. 469 - 473, 1992.
3. K. P. Soman and K. B. Misra, Fuzzy Fault Tree analysis using resolution
identity, International Journal of Fuzzy Sets and Mathematics, Vol. I,
No.1, pp. 193 - 212, 1993.
4. S. Patra, K. P. Soman and K. B. Misra, Fuzzy Reliability of a
Communication Network, International Journal of Fuzzy Sets and
Mathematics, Vol. I, No. 3, pp. 58 - 64, 1993.
5. K. P. Soman and K. B. Misra, A Simple Method of Computing Variance
Sensibility Coefficients of Top Events in a Fault Tree, Microelectronics and
Reliability, Vol. 34, No. 5, pp. 929 - 934, 1994.
6. K. P. Soman and K. B. Misra, Bayesian Sequential Estimation of
Parameters of a Weibull Distribution, Microelectronics and Reliability, Vol.
35, pp. 1010 - 1015, 1994.
7. K. P. Soman and K. B. Misra, Bayesian Estimation of System Reliability,
Microelectronics and Reliability, Vol. 33, No. 10, pp. 1455 - 1459, 1993.
8. K. P. Soman and K. B. Misra, A Least Square Estimation of Three
Parameters of a Weibull Distribution, Microelectronics and Reliability, Vol.
32, pp. 303 - 305, 1992.
9. K. P. Soman and K. B. Misra, A Simple Method for Determining the
Moments of a Top Event, International Journal of Quality and Reliability
Management, Vol. 13, No. 5, pp. 50 - 60, 1996.
10.S. Patra, K. P. Soman and K. B. Misra, Fuzzy Event Tree Analysis of a
Power System using Bayesian and Fuzzy Set Approach, Institution of
Engineers, India, pp. 60 - 65.
11.K. P. Soman, A General Algorithm for Uncertainty Propagation in Fault
Trees, 7
th
International M.I.R.C.E. Symposium on System Operational
Effectiveness, 1998, University of Exeter, United Kingdom.
12.K. P. Soman and M. V. Asokan, Prediction of Reliability Based on Early
Failures - A Case Study, 7
th
International M.I.R.C.E. Symposium on
System Operational Effectiveness, 1998, University of Exeter, United
Kingdom.
13.K. P. Soman, Sequential Estimation of Weibull Parameters using Kalman
Filter and its application in Quality and Reliability Engineering, 7
th

International M.I.R.C.E. Symposium on System Operational Effectiveness,
1998, University of Exeter, United Kingdom.
14.K. P. Soman, Left Shoulder Detection using Support Vector Machines,
ICFAI Journal of Applied Finance, Vol. 8, pp. 5 - 13, May 2002.
15.K. P. Soman, M. Shyam Diwakar, P. Madhavadas, Efficient classification
and analysis of Ischemic Heart Disease Using Proximal Support Vector
Machine based Decision Trees, IEEE Region 10 Conference, Indian
Institute of Science, Bangalore, October 2003.



Curriculum Vitae

Name: A.G. Menon
Date of Birth: 03-03-1940

After B.Sc. (1962) with Chemistry and Physics as two major subjects completed
M.A. (1964) in Tamil Language and Literature and Linguistics (First Class). Attended
summer schools in Linguistics conducted by the Indian and American faculty
members. Took my Ph.D. in Linguistics in 1972. This dissertation consists of a little
more than three thousand pages and follows the Dutch Linguistic Theories with
emphasis on statistics and the notion "Centre and Periphery" in the linguistic
analysis. All these drgrees are from the Kerala University though a major part of
the linguistic analysis is carried out in the Leiden Univesity, The Netherlands. The
major fields of research and publications cover Dravidian Linguistics with emphasis
on Tamil and Malayalam, Colonial Linguistics and the use of computer in linguistics.
Already in early 1970s computer was used for the analysis of Malayalam in the
Leiden University with the enthusiastic cooperation of the Leiden University
computer centre.

I went to the Leiden University, The Netherlands in 1968 at their invitation to
continue my linguistics research in colloboration with one of the greatest Dutch
specialists in this field Prof.Dr.F.B.J.Kuiper. At their insistance I continued in the
same university till my retirement in 2005 as professor of Dravidian languages and
linguistics, comparative Dravidian and Dravidian cultural history. I still continue
here as a honorary member of the faculty.

From 2007 I work as Visiting Professor of Linguistcs at the Amrita University,
Ettimadai, Coimbatore, India and function as one of the members of a team of
enthusiasic experts working in the field of NLP and, in particular, on Machine
Translation.

In connection with my academic research and the international conferences I have
visited the relevant universities in Europe, America including South America, Japan,
Malaysia, Singapore, Mauritius and Thailand.

I serve also as an external examiner to the Indian universites for the evaluation of
Ph.D. dissertations in Linguistics.

Currently President of the International Association of Dravidian Linguists (for the
academic year 2007-2008).

Though my publications cover all the research fields mentioned earlier, I am
furnishing below the bibliographic information of a few of them:

1. "Some Problems of Lexicography" (in Tamil), Centamil 61, Madurai (1966),
pp 207-214.
2. "Meaning - its nature and analysis" (in Tamil), Centamil 63, Madurai (1967),
pp. 35-44.
3. "Tamil Numeral Constructions", Linguistic Circle, University of Kerala
(1967), pp. 1-19.
4. "Vocative", Four Papers on Literature and Language, University of Kerala
(1968), pp. 41-48.
5. "Voicing and Voicelessness in Tolkaappiym", Indo-Iranian Journal, Vol. XI,
No.1 (1968), pp. 24-28. (international journal published from The
Netherlands).
6. "From Proto-Tamil-Malayalam to West Coast Dialects", Indo-Iranian Journal,
Vol. XIV, No. 1/2 (1972), pp. 52-60.
7. A Grammatical Study of Kamparaamaayanam, dissertation 1972, about 3000
pages.
8. "Tamil Verb Classification", Proceedings of the 29th International Congress of
Orientalists, Paris (1973), pp. 136-148.
9. "Computer and Dravidian Linguistics", Bulletin of the Association for Literary
and Linguistic Computing, Vol. 1`, Nr. 3, Michaemas Tern, England (1973).
10.Raamacaritam - Text, Frequency and Indices, Computer output, (1974),
about 900 pages, Leiden University, The Netherlands.
11."Adjectives in Tamil", Fourth International Conference-Seminar of Tamil
Studies, Jaffna, Sri Lanka (1974), pp. 1-11.

12. "The Relative Participle in -iya of Modern malayalam", Indo-Iranian Journal
18 (1977), pp. 211-225.
13.Review of the Tirunelveli Tamil Dialect, Indo-Iranian Journal 18 (1976), pp.
327-331.
14."A Critical Reconsideration of R. Thapar's Dravidian hypothesis for the
location of Meluhha, Dilmun and Makan", Journal of the Economic and Social
History of the Orient, Vol. XXI, Part II (1978), pp. 113-145. Co-author Dr.
E.C.L. During Caspers (The Netherlands).
15.An Outline of Tamil (Literature), in Japanese, Journal of Naritasan Institute of
Buddhist Studies, No. 4 (1979), Tokyo, pp. 1-50, Co-author: Dr. Shokei
Matsumoto.
16."Tamil Literary Traditions and the Oldest Malayalam Ramayana", Proceedings
of the Vth International Conference-Seminar of Tamil Studies, Vol. II, Section
8, Madurai, India (1981), pp. 73-82.
17."Some Observations on seventeenth century Malayalam", Indo-Iranian
Journal 25 (1983), pp. 241-273.
18."The case of the unknown sound change in Malayalam", VIth South Asian
language Analysis Rountable: Literacy and Linguistic change,
Austin, University of Texas, U.S.A. (1984), pp. 1-20.
19."Review of Proto-Elamo-Dravidian: the evidence and its implications",
Bibliotheca Orientalis XLIV, Nrs. 3/4 (1987), pp. 495-499 (The Netherlands).
20."Centre and Periphery: the compositon of Kamban's language", International
Journal of Dravidian Linguistics, Vol. XVII, Nr. 1 (1988), pp. 72-78.
21."Phonetic Change and Phoneme Substitution in Malayalam: c- > t- and s- >
t-", Proceedings of the First Internation Seminar on Dravidian Linguistics and
the 14 th All India Conference of Dravidian Linguists, Dravidian Linguistics
Association, Trivandrum (1989), pp. 28-66.

22. "Some Observations of the Sub-group Tamil-Malayalam: differtial
realizations of the cluster *nt", Bulletin of the School of Oreiental and African
Studies, University of London, Vol. LIII, Part 1 (1990), pp. 87-99.
23. "Linguistic Convergence: the Tamil-Hindi Auxiliaries", Bulletin of the School
of Oriental and African Studies, University of London, Vol. LIII, Part 2 (1990),
pp. 266-282. Co-author: Dr.G.H. Schokker.
24. Review of William Bright: Language Variation in South Asia, Oxford: Oxford
University Press, 1990 in the Bulletin of the School of Oriental and African
Studies, University of London, Vol. LVI, Part 1 (1993), pp. 161-162.
25. "Tamil Verb Stem Formation", International Journal of Dravidian Linguistics,
Vol. XXVII, Nr. 1 (1998), pp. 13-40.
26. "Copper Plates to Silver Plates: Cholas, Dutch and Buddhism", International
Journal of Dravidian Linguistics, Vol. XXIX, Nr. 2 (2000), pp. 83-106.
27. "Colonial Linguistics and the Spoken Language", International Journal of
Dravidian Linguistics, Vol. XXXII, Nr. 1 (2003), pp. 73-86.

Prof.Dr.A.G.Menon
Leiden

Você também pode gostar