Improved Named Entity Recognition using Machine Translation-based Cross-lingual Information

In this paper, we describe a technique to improve named entity recognition in a resource-poor language (Hindi) by using cross-lingual information. We use an on-line machine translation system and a separate word alignment phase to find the projection of each Hindi word into the translated English sentence. We estimate the cross-lingual features using an English named entity recognizer and the alignment information. We use these cross-lingual features in a support vector machine-based classifier. The use of cross-lingual features improves F 1 score by 2.1 points absolute (2.9% relative) over a good-performing baseline model.

[1]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[2]  Robert E. Frederking,et al.  SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation , 2010 .

[3]  Pabitra Mitra,et al.  Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER , 2008, ACL.

[4]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[5]  Dan Klein,et al.  Learning Better Monolingual Models with Unannotated Bilingual Text , 2010, CoNLL.

[6]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[7]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[8]  Asif Ekbal,et al.  Differential Evolution Based Feature Selection and Classifier Ensemble for Named Entity Recognition , 2012, COLING.

[9]  Dipti Misra Sharma,et al.  AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages , 2008 .

[10]  Pushpak Bhattacharyya,et al.  Towards Efficient Named-Entity Rule Induction for Customizability , 2012, EMNLP.

[11]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[12]  Bente Maegaard Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT , 2003 .

[13]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[14]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[15]  Christopher D. Manning,et al.  Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning , 2013, ArXiv.

[16]  Dan Klein,et al.  Joint Parsing and Alignment with Weakly Synchronized Grammars , 2010, NAACL.

[17]  Kevin Knight,et al.  Name Translation in Statistical Machine Translation - Learning When to Transliterate , 2008, ACL.

[18]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[19]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[20]  Hermann Ney,et al.  Using POS information for statistical machine translation into morphologically rich languages , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[21]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[22]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[23]  Christopher Cieri,et al.  Corpus Support for Machine Translation at LDC , 2006, LREC.

[24]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[25]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[26]  Andy Way,et al.  Supertags as source language context in hierarchical phrase-based SMT , 2010, AMTA 2010.

[27]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[28]  P PallaviK.,et al.  HITS@FIRE task 2015: Twitter based Named Entity Recognizer for Indian Languages , 2015, FIRE Workshops.

[29]  Kristina Toutanova,et al.  Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia , 2012, ACL.