Toward integrating word sense and entity disambiguation into statistical machine translation

We describe a machine translation approach being designed at HKUST to integrate semantic processing into statistical machine translation, beginning with entity and word sense disambiguation. We show how integrating the semantic modules consistently improves translation quality across several data sets. We report results on five different IWSLT 2006 speech translation tasks, representing HKUST’s first participation in the IWSLT spoken language translation evaluation campaign. We translated both read and spontaneous speech transcriptions fromChinese to English, achieving reasonable performance despite the fact that our system is essentially text-based and therefore not designed and tuned to tackle the challenges of speech translation. We also find that the system achieves reasonable results on a wide range of languages, by evaluating on read speech transcriptions from Arabic, Italian, and Japanese into English.

[1]  John W. Clark,et al.  Edwin Thompson Jaynes , 2000 .

[2]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[3]  Marine Carpuat,et al.  Augmenting ensemble classification for Word Sense Disambiguation with a kernel PCA model , 2004, ACL 2004.

[4]  Takehito Utsuro,et al.  Named Entity Chunking Techniques in Supervised Learning for Japanese Named Entity Recognition , 2000, COLING.

[5]  Philipp Koehn,et al.  Noun phrase translation , 2003 .

[6]  Marine Carpuat,et al.  Evaluating the Word Sense Disambiguation Performance of Statistical Machine Translation , 2005, IJCNLP.

[7]  Dan Klein,et al.  Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[8]  Marine Carpuat,et al.  Boosting for Chinese Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[11]  Philipp Koehn,et al.  Explorer Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation , 2005 .

[12]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[13]  Marine Carpuat,et al.  Boosting for Named Entity Recognition , 2002, CoNLL.

[14]  Marine Carpuat,et al.  A Stacked, Voted, Stacked Model for Named Entity Recognition , 2003, CoNLL.

[15]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[16]  Hermann Ney,et al.  The RWTH Phrase-based Statistical Machine Translation System , 2005, IWSLT.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[19]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[20]  Marine Carpuat,et al.  Semi-supervised training of a Kernel PCA-Based Model for Word Sense Disambiguation , 2004, COLING.

[21]  Marine Carpuat,et al.  Word Sense Disambiguation vs. Statistical Machine Translation , 2005, ACL.

[22]  D. Id,et al.  Evaluating sense disambiguation across diverse parameter spaces , 2002 .

[23]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[24]  E. T. Jaynes,et al.  Where do we Stand on Maximum Entropy , 1979 .

[25]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[26]  James Mayfield,et al.  Entity Extraction without Language-Specific Resources , 2002, CoNLL.

[27]  Marine Carpuat,et al.  A Kernel PCA Method for Superior Word Sense Disambiguation , 2004, ACL.

[28]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[29]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[30]  Joel D. Martin,et al.  PORTAGE: A Phrase-Based Machine Translation System , 2005, ParallelText@ACL.