Multilingual named entity extraction and translation from text and speech

Named entities (NE), the noun or noun phrases referring to persons, locations and organizations, are among the most information-bearing linguistic structures. Extracting and translating named entities benefits many natural language processing problems such as cross-lingual information retrieval, cross-lingual question answering and machine translation. In this theisis we propose an efficient and effective framework to extract and translate NEs from text and speech. We adopt the hidden Markov model (HMM) as a baseline NE extraction system, and investigate its performance in multiple language pairs with varying amounts of training data. We expand the baseline text NE tagger with a context-based NE extraction model, which aims to detect and correct NE recognition errors from automatic speech recognition hypotheses. We also adapt the broadcast stews trained NE tagger for meeting transcripts. We develop several language-independent features to capture phonetic and semantic similarity measures between source and target NE pairs. We incorporate these features to solve various NE translation problems presented in different language pairs (Chinese to English, Arabic to English and Hindi to English), with varying resources (parallel and non-parallel corpora as well as the World Wide Web) and different input data streams (text and speech). We also propose a cluster-specific name transliteration framework. By grouping names from similar origins into one cluster and training cluster-specific transliteration and language models, we manage to dramatically reduce the name transliteration error rates.

[1]  Pamela W. Jordan,et al.  A survey of current paradigms in machine translation , 1999, Adv. Comput..

[2]  Alon Lavie,et al.  Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario , 2003, TALIP.

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Nancy Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[5]  Ying Zhang,et al.  Using the web for automated translation extraction in cross-language information retrieval , 2004, SIGIR '04.

[6]  Mari Ostendorf,et al.  Robust information extraction from automatically generated speech transcriptions , 2000, Speech Commun..

[7]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[10]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[11]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Kevin Knight,et al.  Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[14]  Pascale Fung,et al.  Using N-best lists for Named Entity Recognition from Chinese Speech , 2004, NAACL.

[15]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[16]  Hermann Ney,et al.  Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[17]  Stephan Vogel,et al.  Improved named entity translation and bilingual named entity extraction , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[18]  Oskar Dressler,et al.  Künstliche Intelligenz? , 1986, FIFF Jahrestagung.

[19]  Ying Zhang,et al.  Mining Key Phrase Translations from Web Corpora , 2005, HLT.

[20]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[21]  Jaime G. Carbonell,et al.  An Efficient Interlingua Translation System for Multi-lingual Document Production , 1991, MTSUMMIT.

[22]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[23]  Marine Carpuat,et al.  A Stacked, Voted, Stacked Model for Named Entity Recognition , 2003, CoNLL.

[24]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[25]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[26]  Douglas W. Oard,et al.  The surprise language exercises , 2003, TALIP.

[27]  Robert C. Moore Learning Translations of Named-Entity Phrases from Parallel Corpora , 2003, EACL.

[28]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[29]  Ralph Weischedel,et al.  Named Entity Extraction from Broadcast News , 1999 .

[30]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[31]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[33]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[34]  Yaser Al-Onaizan,et al.  Translating Named Entities Using Monolingual and Bilingual Resources , 2002, ACL.

[35]  Ralph Grishman,et al.  A Decision Tree Method for Finding and Classifying Names in Japanese Texts , 1998, VLC@COLING/ACL.

[36]  Fabio Pianesi,et al.  Architecture and Design Considerations in NESPOLE!: a Speech Translation System for E-commerce Applications , 2001, HLT.

[37]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[38]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[39]  Lynette Hirschman,et al.  Overview: Information Extraction From Broadcast News , 1999 .

[40]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[41]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[42]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[43]  Mansur Arbabi,et al.  Algorithms for Arabic name transliteration , 1994, IBM J. Res. Dev..

[44]  Alexander H. Waibel,et al.  Effective Phrase Translation Extraction from Alignment Models , 2003, ACL.

[45]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[46]  Y. Zhang,et al.  Integrated phrase segmentation and alignment algorithm for statistical machine translation , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[47]  Hiroshi Uchida Fujitsu machine translation system: ATLAS , 1986, Future Gener. Comput. Syst..

[48]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[49]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[50]  Alon Lavie,et al.  Janus: A System for Translation of Conversational Speech , 1997, Künstliche Intell..

[51]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[52]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[53]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[54]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[55]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[56]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[57]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[58]  Stephan Vogel,et al.  Word Alignment Based on Bilingual Bracketing , 2003, ParallelTexts@NAACL-HLT.

[59]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[60]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[61]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[62]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[63]  Douglas E. Appelt,et al.  FASTUS: A Finite-state Processor for Information Extraction from Real-world Text , 1993, IJCAI.

[64]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[65]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[66]  Ying Zhang,et al.  Mining translations of OOV terms from the web through cross-lingual query expansion , 2005, SIGIR '05.

[67]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[68]  Ralf D. Brown,et al.  Automated Generalization of Translation Examples , 2000, COLING.

[69]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[70]  Ying Zhang,et al.  An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora , 2005, EAMT.

[71]  Pu-Jen Cheng,et al.  Translating unknown queries with web corpora for cross-language information retrieval , 2004, SIGIR '04.