论文信息 - Multilingual named entity extraction and translation from text and speech

Multilingual named entity extraction and translation from text and speech

Named entities (NE), the noun or noun phrases referring to persons, locations and organizations, are among the most information-bearing linguistic structures. Extracting and translating named entities benefits many natural language processing problems such as cross-lingual information retrieval, cross-lingual question answering and machine translation. In this theisis we propose an efficient and effective framework to extract and translate NEs from text and speech. We adopt the hidden Markov model (HMM) as a baseline NE extraction system, and investigate its performance in multiple language pairs with varying amounts of training data. We expand the baseline text NE tagger with a context-based NE extraction model, which aims to detect and correct NE recognition errors from automatic speech recognition hypotheses. We also adapt the broadcast stews trained NE tagger for meeting transcripts. We develop several language-independent features to capture phonetic and semantic similarity measures between source and target NE pairs. We incorporate these features to solve various NE translation problems presented in different language pairs (Chinese to English, Arabic to English and Hindi to English), with varying resources (parallel and non-parallel corpora as well as the World Wide Web) and different input data streams (text and speech). We also propose a cluster-specific name transliteration framework. By grouping names from similar origins into one cluster and training cluster-specific transliteration and language models, we manage to dramatically reduce the name transliteration error rates.

A. Waibel | Fei Huang

[1] Pamela W. Jordan,et al. A survey of current paradigms in machine translation , 1999, Adv. Comput..

[2] Alon Lavie,et al. Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario , 2003, TALIP.

[3] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4] Nancy Chinchor,et al. Overview of MUC-7 , 1998, MUC.

[5] Ying Zhang,et al. Using the web for automated translation extraction in cross-language information retrieval , 2004, SIGIR '04.

[6] Mari Ostendorf,et al. Robust information extraction from automatically generated speech transcriptions , 2000, Speech Commun..

[7] Min Tang,et al. Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[8] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9] I. Dan Melamed,et al. Models of translation equivalence among words , 2000, CL.

[10] Richard M. Schwartz,et al. Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[11] James P. Callan,et al. Experiments Using the Lemur Toolkit , 2001, TREC.

[12] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13] Kevin Knight,et al. Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[14] Pascale Fung,et al. Using N-best lists for Named Entity Recognition from Chinese Speech , 2004, NAACL.

[15] Richard M. Schwartz,et al. BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[16] Hermann Ney,et al. Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[17] Stephan Vogel,et al. Improved named entity translation and bilingual named entity extraction , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[18] Oskar Dressler,et al. Künstliche Intelligenz? , 1986, FIFF Jahrestagung.

[19] Ying Zhang,et al. Mining Key Phrase Translations from Web Corpora , 2005, HLT.

[20] Ralph Grishman,et al. A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[21] Jaime G. Carbonell,et al. An Efficient Interlingua Translation System for Multi-lingual Document Production , 1991, MTSUMMIT.

[22] Jian Su,et al. Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[23] Marine Carpuat,et al. A Stacked, Voted, Stacked Model for Named Entity Recognition , 2003, CoNLL.

[24] Kevin Knight,et al. A Syntax-based Statistical Translation Model , 2001, ACL.

[25] Grace Ngai,et al. Transformation Based Learning in the Fast Lane , 2001, NAACL.

[26] Douglas W. Oard,et al. The surprise language exercises , 2003, TALIP.

[27] Robert C. Moore. Learning Translations of Named-Entity Phrases from Parallel Corpora , 2003, EACL.

[28] C. Lee Giles,et al. Accessibility of information on the web , 1999, Nature.