Joint bilingual name tagging for parallel corpora

Traditional isolated monolingual name taggers tend to yield inconsistent results across two languages. In this paper, we propose two novel approaches to jointly and consistently extract names from parallel corpora. The first approach uses standard linear-chain Conditional Random Fields (CRFs) as the learning framework, incorporating cross-lingual features propagated between two languages. The second approach is based on a joint CRFs model to jointly decode sentence pairs, incorporating bilingual factors based on word alignment. Experiments on Chinese-English parallel corpora demonstrated that the proposed methods significantly outperformed monolingual name taggers, were robust to automatic alignment noise and achieved state-of-the-art performance. With only 20%of the training data, our proposed methods can already achieve better performance compared to the baseline learned from the whole training set.1

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Xiang Li,et al.  Cross-lingual Slot Filling from Comparable Corpora , 2011, BUCC@ACL.

[3]  Kevin Knight,et al.  Name Translation in Statistical Machine Translation - Learning When to Transliterate , 2008, ACL.

[4]  Chengqing Zong,et al.  On Jointly Recognizing and Aligning Bilingual Named Entities , 2010, ACL.

[5]  YUEN REN CHAO THE EFFICIENCY OF THE CHINESE LANGUAGE , 1997 .

[6]  Heng Ji,et al.  Collaborative entity extraction and translation , 2007 .

[7]  Hsin-Hsi Chen,et al.  Proper Name Translation in Cross-Language Information Retrieval , 1998, COLING-ACL.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Ming Zhou,et al.  A New Approach for English-Chinese Named Entity Alignment , 2004, EMNLP.

[10]  Robert C. Moore Learning Translations of Named-Entity Phrases from Parallel Corpora , 2003, EACL.

[11]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[12]  Heng Ji,et al.  Analysis and Repair of Name Tagger Errors , 2006, ACL.

[13]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[14]  Martin J. Wainwright,et al.  Tree-based reparameterization for approximate inference on loopy graphs , 2001, NIPS.

[15]  Keita Tsuji Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs , 2002, Int. J. Comput. Process. Orient. Lang..

[16]  Stephan Vogel,et al.  Improved named entity translation and bilingual named entity extraction , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[17]  Kathleen McKeown,et al.  MT Error Detection for Cross-Lingual Question Answering , 2010, COLING.

[18]  Yuqing Gao,et al.  Guiding Statistical Word Alignment Models With Prior Knowledge , 2007, ACL.

[19]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[20]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..