Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields

Machine transliteration has received significant research attention in recent years. In most cases, the source language has been English and the target language is an Asian language. This paper focuses on Hindi to English machine transliteration of Indian named entities such as proper nouns, place names and organization names using conditional random fields (CRF). Hindi is the national language of the India and spoken by more than 500 millions Indian. Hindi is the world‟s fourth most commonly used language after Chinese, English and Spanish. This system takes Indian place name as an input in Hindi language using Devanagari script and transliterates it into English. The input to the system is provided in the form of syllabification in order to apply the n-gram techniques. As more than 50% named entities are formed as a combination of two and three syllabic units, the ngram approach with unigrams, bigrams and trigrams of Hindi are used to train the corpus. The system provides the satisfactory performance for trigrams as compared to unigrams and bigrams. General Terms Machine Transliteration

[1]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[2]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[3]  Sivaji Bandyopadhyay,et al.  A Hidden Markov Model Based Named Entity Recognition System: Bengali and Hindi as Case Studies , 2007, PReMI.

[4]  Prasad Pingali,et al.  Statistical Transliteration for Cross Langauge Information Retrieval using HMM alignment and CRF , 2008, IJCNLP 2008.

[5]  Falk Scholer,et al.  Machine transliteration survey , 2011, ACM Comput. Surv..

[6]  Marco Furini,et al.  International Journal of Computer and Applications , 2010 .

[7]  Sivaji Bandyopadhyay,et al.  Bengali Named Entity Recognition Using Support Vector Machine , 2008, IJCNLP.

[8]  Leah S. Larkey,et al.  Statistical transliteration for english-arabic cross language information retrieval , 2003, CIKM '03.

[9]  Jong-Hoon Oh,et al.  Machine Transliteration using Target-Language Grapheme and Phoneme: Multi-engine Transliteration Approach , 2009, NEWS@IJCNLP.

[10]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[11]  Pushpak Bhattacharyya,et al.  Transliteration involving English and Hindi languages using Syllabification Approach , 2009 .

[12]  Sivaji Bandyopadhyay,et al.  A web-based Bengali news corpus for named entity recognition , 2008, Lang. Resour. Evaluation.

[13]  Pabitra Mitra,et al.  Named Entity Recognition in Hindi using Maximum Entropy and Transliteration , 2008, Polibits.

[14]  Vishal Gupta,et al.  A survey of Named Entity Recognition in English and other Indian Languages , 2010 .

[15]  Sivaji Bandyopadhyay,et al.  Named Entity Recognition Using Appropriate Unlabeled Data, Post-processing and Voting , 2010, Informatica.

[16]  Muhammad Ghulam Abbas Malik,et al.  Punjabi Machine Transliteration , 2006, ACL.

[17]  Ganapathiraju Madhavi,et al.  Om: one tool for many (Indian) languages , 2005 .

[18]  Sivaji Bandyopadhyay,et al.  Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems , 2008, IJCNLP.

[19]  Eunok Paek,et al.  An English to Korean Transliteration Model of Extended Markov Window , 2000, COLING.

[20]  Kevin Knight,et al.  Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[21]  Richard Sproat,et al.  Book Reviews: A Computational Theory of Writing Systems , 2006, CL.

[22]  R. Sproat A FORMAL COMPUTATIONAL ANALYSIS OF INDIC SCRIPTS , 2003 .

[23]  Manoj Kumar Chinnakotla,et al.  Transliteration for Resource-Scarce Languages , 2010, TALIP.

[24]  Sivaji Bandyopadhyay,et al.  Bengali Named Entity Recognition Using Classifier Combination , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[25]  Sivaji Bandyopadhyay,et al.  Voted NER System using Appropriate Unlabeled Data , 2009, NEWS@IJCNLP.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Sivaji Bandyopadhyay,et al.  Improving the Performance of a NER System by Post-processing and Voting , 2008, SSPR/SPR.

[28]  R. K. Joshi A Phonemic Code Based Scheme for Effective Processing of Indian Languages , 2003 .

[29]  Mansur Arbabi,et al.  Algorithms for Arabic name transliteration , 1994, IBM J. Res. Dev..