Escolar Documentos
Profissional Documentos
Cultura Documentos
4, July 2013
ABSTRACT
Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages, performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure.
KEYWORDS
NER, Transliteration, Unknown words, Performance Metrics
1. INTRODUCTION
Named Entity Recognition (NER) is one of the application areas of Natural Language Processing, in which Named Entities are identified and thereafter categorised into different classes of Named Entities. The various classes of Named Entities can be the name of person, location, organization, state, sport, river, city, country, percentage, time, quantity etc. Various applications of NER include: Information extraction, Machine Translation, Question Answering System, Information Retrieval, Automatic Summarization etc. e. g. Consider Training Sentences: Ram/PER is/OTHER a/OTHER intelligent/OTHER boy/OTHER Deepa/PER lives/OTHER in/OTHER Nagpur/CITY Ankit/PER is/OTHER a/OTHER football/SPORT player/OTHER Aabhas/PER plays/OTHER cricket/SPORT In the given above tagged training text in English, PER denotes that Ram, Deepa, Ankit and Aabhas are the Names of Person. Nagpur is tagged with CITY tag since it is a Name of City. Similarly, football and cricket are the names of Sport, so they are tagged with SPORT tag. The entities that are tagged with OTHER tag are not Named Entities. The above tagged sentences are input to HMM Train module that computes HMM Parameters i.e. Start Probability, Transition Probability and Emission Probability. HMM Parameters and Testing sentences are input to the HMM Test module, and using Viterbi Algorithm Named Entities can be derived. If testing sentence in NER is given as:
DOI:10.5121/ijfcst.2013.3408 67
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
Aabhas lives in Nagpur The output of NER based system for the above testing sentence is list of Named Entities along with their tags i.e. Aabhas/PER and Nagpur/CITY. We have developed a tool NERHMM, a language independent NER tool based on Hidden Markov Model technique. [1][2]. In this paper, we will discuss about our modified tool.
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
Similarly, we can develop Testing document also using our tool. So, our tool is capable of performing Corpus Development both for training as well as for testing. After getting the annotated corpus, we click on TRAIN HMM button and choose the file to be trained by clicking on Browse button. HMM parameters (Start Probability, Transition Probability and Emission Probability) are calculated and can be viewed by clicking on View Parameters button. This is shown in Fig2.
69
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
Now, when we click on TEST HMM button, we can either click on browse button to select a file for testing, or build a testing file by clicking on button named Develop a new testing Corpus. Finally, when we click on TEST HMM, we select a testing file using Browse button and Viterbi algorithm is made to run that accepts all the HMM parameters computed by the tool and displays optimal state sequence as shown in Fig 3. If any unknown word appears in testing file then transliteration module is made to run and the unknown word can be handled Our system can perform training and testing in any language while dealing with known words. In case of dealing with unknown words, our system can handle only those words that appear in one of the following languages: Hindi, Punjabi, Marathi, Bengali, Telugu, Urdu, English and French. When we click on SAVE OUTPUT button then output of NER based system can be saved in a file. And, when we click on NER EVALUATION button, then Performance Metrics of NER based system is calculated automatically and displayed in a new window. fig 4. Our system is capable of handling Spurious words. Spurious words are those that are found to be untagged in training file. Such words are tagged as OTHER or Not-a-Named Entity by our system. We have tried to solve the problem of unknown words using Transliteration approach.
70
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
6. CONCLUSION
We have performed Named Entity Recognition using Hidden Markov Model in Natural languages such as Hindi, Marathi, Punjabi, Telugu, Urdu, Bengali, English and French.
71
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
The existing tools related to Named Entity Recognition are highly language dependent and domain specific in nature. So, a need was felt to develop a tool that is language independent and can work in any domain. So, we developed a tool that performs NER in Natural languages and can work in any domain using Hidden Markov Model. We have also tried to solve the problem of Unknown words in Named Entity Recognition using Transliteration approach. Our system is also capable of performing NER on multilingual data. If the training Named Entities is in one language and in testing file same Named Entities are in another language, then using Transliteration approach these Named Entities can be identified easily
ACKNOWLEDGEMENT
We would like to thank all those who helped me in accomplishing this task.
REFERENCES
[1] Sudha Morwal and Deepti Chopra NERHMM: A Tool For Named Entity Recognition based on Hidden Markov ModelInternational Journal on Natural Language Computing (IJNLC) Vol.2, No.2, April 2013 DOI:10.5121/ijnlc.2013.2204, Pg 43-49. Available at: http://airccse.org/journal/ijnlc/papers/2213ijnlc04.pdf Sudha Morwal and Deepti Chopra Identification and Classification of Named Entities in Indian Languages International Journal on Natural Language Computing (IJNLC) Vol.2, No.1, February 2013 DOI:10.5121/ijnlc.2013.210 Pg 37-43 Available at: http://airccse.org/journal/ijnlc/papers/1412ijnlc02.pdf Sudha Morwal, Nusrat Jahan and Deepti Chopra Named Entity Recognition using Hidden Markov Model (HMM) International Journal on Natural Language Computing (IJNLC) Vol.1, No.4, December 2012, DOI:10.5121/ijnlc.2012.1402, Pg 15-23Available at: http://airccse.org/journal/ijnlc/papers/1412ijnlc02.pdf Deepti Chopra, Nusrat Jahan and Sudha Morwal Hindi Named Entity Recognition By Using Rule Based Heuristics And Hidden Markov ModelInternational Journal of Information Sciences and Techniques (IJIST) Vol.2, No.6, November 2012. DOI : 10.5121/ijist.2012.2604. Available at: http://airccse.org/journal/IS/papers/2612ijist04.pdf G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR Named Entity Recognition for Telugu Using Maximum Entropy Model B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011. Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay Language Independent Named Entity Recognition in Indian Languages .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 3340,Hyderabad, India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf Darvinder kaur, Vishal Gupta.A survey of Named Entity Recognition in English and other Indian Languages.IJCSI International Journal of Computer Science Issues, Vol.7, Issue 6, November 2010. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. Named Entity Recognition System for Hindi Language:A Hybrid Approach International Journal of Computational Linguistics (IJCL), Volume (2): Issue (1): 2011.Available at http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf
[2]
[3]
[4]
[8] [9]
72
International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013
Authors
Deepti Chopra is working as Assistant Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has received B.Tech degree in Computer Science and Engineering from Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.She has done M.Tech in Computer Science and Engineering from Banasthali University, Rajasthan in 2013. Her research interests include Artificial Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in International journals and conferences. Sudha Morwal is an active researcher in the field of Natural Language Processing. Currently working as Associate Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has done M.Tech (Computer Science) , NET, M.Sc (Computer Science) and her PhD is in progress from Banasthali University (Rajasthan), India. She has published many papers in International Conferences and Journals. Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics at Banasthali University (Rajasthan). Before joining Banasthali University, he was Professor and Head of the Department of Mathematics, University of Rajasthan, Jaipur. He had been Chief-editor of a research journal and regular reviewer of many journals. His present interest is in O.R., Discrete Mathematics and Communication networks. He has published around 40 research papers in various journals.
73