Statistical computer assisted translation

In recent years, significant improvements have been achieved in statistical machine translation (MT), but still even the best machine translation technology is far from replacing or even competing with human translators. However, an MT system helps to increase the productivity of human translators. Usually, human translators edit the MT system output to correct the errors, or they may edit the source text to limit vocabulary. A way of increasing the productivity of the whole translation process (MT plus human work) is to incorporate the human correction activities in the translation process, thereby shifting the MT paradigm to that of computer-assisted translation (CAT). In a CAT system, the human translator begins to type the translation of a given source text; by typing each character the MT system interactively offers and enhances the completion of the translation. Human translator may continue typing or accept the whole completion or part of it. Here, we will use a fully fledged translation system, phrase-based MT, to develop computer-assisted translation systems. An important factor in a CAT system is the response time of the MT system. We will describe an efficient search space representation using word hypotheses graphs, so as to guarantee a fast response time. The experiments will be done on a small and a large standard task. Skilled human translators are faster in dictating than typing the translations, therefore a desired feature of a CAT system is the integration of human speech into the CAT system. In a CAT system with integrated speech, two sources of information are available to recognize the speech input: the target language speech and the given source language text. The target language speech is a human-produced translation of the source language text. The main challenge in the integration of the automatic speech recognition (ASR) and the MT models in a CAT system, is the search. The search in the MT and in the ASR systems are already very complex, therefore a full single search to combine the ASR and the MT models will considerably increase the complexity. In addition, a full single search becomes more complex since there is not any specific model nor any appropriate training data. In this work, we study different methods to integrate the ASR and the MT models. We propose several new integration methods based on N -best list and word graph rescoring strategies. We study the integration of both single-word based MT and phrase-based MT with ASR models. The experiments are performed on a standard large task, namely the European parliament plenary sessions. A CAT system might be equipped with a memory-based module that does not actually translate, but find the translation from a large database of exact or similar matches from sentences or phrases that are already known. Such a database, known as bilingual corpora are also essential in training the statistical machine translation models. Therefore, having a larger database means a more accurate and faster translation system. In this thesis, we will also investigate the efficient ways to compile bilingual sentence-aligned corpora from the Internet. We propose two new methods for sentence alignment. The first one is a typical extension of the existing methods in the field of sentence alignment for parallel texts. We will show how we can employ sentence-length based models, word-to-word translation models, cognates, bilingual lexica, and any other features in an efficient way. In the second method, we propose a new method for aligning sentences based on bipartite graph matching. We show that this new algorithm has a competitive performance with other methods for parallel corpora, and at the same time it is very useful in handling different order of sentences in a source text and its corresponding translation text. Further, we propose an efficient way to recognize and filter out wrong sentence pairs from the bilingual corpora.

[1]  Pierre Isabelle,et al.  Target-Text Mediated Interactive Machine Translation , 2004, Machine Translation.

[2]  MarcuDaniel,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005 .

[3]  Salim Roukos,et al.  Maximum likelihood and discriminative training of direct translation models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[5]  Shahram Khadivi,et al.  A Sequence Alignment Model Based on the Averaged Perceptron , 2007, EMNLP.

[6]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[7]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[8]  Hermann Ney,et al.  Improvements in Phrase-Based Statistical Machine Translation , 2004, NAACL.

[9]  Yuji Matsumoto,et al.  Automatic Construction of Machine Translation Knowledge Using Translation Literalness , 2003, EACL.

[10]  李幼升,et al.  Ph , 1989 .

[11]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[12]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[13]  Olivia Craciunescu,et al.  Machine Translation and Computer-Assisted Translation : A New Way of Translating ? , 2007 .

[14]  Hermann Ney,et al.  Efficient Search for Interactive Statistical Machine Translation , 2003, EACL.

[15]  Hermann Ney,et al.  Symmetric Word Alignments for Statistical Machine Translation , 2004, COLING.

[16]  Francisco Casacuberta,et al.  PATTERN RECOGNITION APPROACHES FOR SPEECH-TO-SPEECH TRANSLATION , 2004, Cybern. Syst..

[17]  Donald E. Knuth,et al.  The Stanford GraphBase - a platform for combinatorial computing , 1993 .

[18]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[21]  A. Waibel,et al.  Speech translation enhanced automatic speech recognition , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[22]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[23]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[24]  Philipp Koehn,et al.  Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm , 2000, AAAI/IAAI.

[25]  Francisco Casacuberta,et al.  MONOTONE STATISTICAL TRANSLATION USING WORD GROUPS , 2001 .

[26]  Shankar Kumar,et al.  Segmentation and alignment of parallel text for statistical machine translation , 2006, Natural Language Engineering.

[27]  Alexander H. Waibel,et al.  Efficient Optimization for Bilingual Sentence Alignment Based on Linear Regression , 2003, ParallelTexts@NAACL-HLT.

[28]  Roland Kuhn,et al.  French speech recognition in an automatic dictation system for translators: the transtalk project , 1995, EUROSPEECH.

[29]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[30]  Hermann Ney,et al.  Automatic text dictation in computer-assisted translation , 2005, INTERSPEECH.

[31]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[32]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[33]  Sergei Nirenburg,et al.  The Proper Place of Men and Machines in Language Translation , 2003 .

[34]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[35]  Hermann Ney,et al.  Algorithms for statistical translation of spoken language , 2000, IEEE Trans. Speech Audio Process..

[36]  Hermann Ney,et al.  Automatic Filtering of Bilingual Corpora for Statistical Machine Translation , 2005, NLDB.

[37]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[38]  Eiichiro Sumita,et al.  Bilingual corpus cleaning focusing on translation literality , 2002, INTERSPEECH.

[39]  Stephan Vogel Using Noisy Bilingual Data for Statistical Machine Translation , 2003, EACL.

[40]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[41]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[42]  Hermann Ney,et al.  A Flexible Architecture for CAT Applications , 2006, EAMT.

[43]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[44]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[45]  Shingo Kuroiwa,et al.  Sentence alignment using P-NNT and GMM , 2007, Comput. Speech Lang..

[46]  Maria das Graças Volpe Nunes,et al.  Evaluation of sentence alignment methods on portuguese-english parallel texts , 2003 .

[47]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[48]  Elliott Macklovitch TransType2 : The Last Word , 2006, LREC.

[49]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[50]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[51]  George F. Foster,et al.  Unit Completion for a Computer-aided Translation Typing System , 2004, Machine Translation.

[52]  Robert L. Mercer,et al.  Automatic speech recognition in machine-aided translation , 1994, Comput. Speech Lang..

[53]  Hermann Ney,et al.  Morpho-syntactic Arabic Preprocessing for Arabic to English Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[54]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[55]  Hermann Ney,et al.  Statistical Approaches to Computer-Assisted Translation , 2009, CL.

[56]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[57]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[58]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[59]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[60]  Peter Beyerlein,et al.  Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[61]  Franz Josef Och,et al.  An Efficient Method for Determining Bilingual Word Classes , 1999, EACL.

[62]  Heng Ji,et al.  NYU-Fair Issac-RWTH Chinese to English entity translation 07 system , 2007 .

[63]  George F. Foster,et al.  TransType: a Computer-Aided Translation Typing System , 2000 .

[64]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[65]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[66]  Hermann Ney,et al.  The RWTH statistical machine translation system for the IWSLT 2006 evaluation , 2006, IWSLT.

[67]  Philippe Langlais,et al.  Trans Type: Development-Evaluation Cycles to Boost Translator's Productivity , 2002, Machine Translation.

[68]  Tanja Schultz,et al.  Document driven machine translation enhanced ASR , 2005, INTERSPEECH.

[69]  Hermann Ney,et al.  Novel Reordering Approaches in Phrase-Based Statistical Machine Translation , 2005, ParallelText@ACL.

[70]  Michel Simard,et al.  Bilingual Sentence Alignment: Balancing Robustness and Accuracy , 2004, Machine Translation.

[71]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[72]  Enrique Vidal,et al.  Finite-state speech-to-speech translation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[73]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[74]  Xiaobo Ren,et al.  Translation Analysis and Translation Automation , 1993, TMI.

[75]  Hermann Ney,et al.  FSA: An Efficient and Flexible C++ Toolkit for Finite State Automata Using On-Demand Computation , 2004, ACL.

[76]  Zina M. Ibrahim,et al.  Advances in Artificial Intelligence , 2003, Lecture Notes in Computer Science.

[77]  Hermann Ney,et al.  Generation of Word Graphs in Statistical Machine Translation , 2002, EMNLP.

[78]  Hermann Ney,et al.  Integration of Speech Recognition and Machine Translation in Computer-Assisted Translation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[79]  Judith Keijsper,et al.  An Efficient Algorithm for Minimum-Weight Bibranching , 1998, J. Comb. Theory, Ser. B.

[80]  Francisco Casacuberta,et al.  Combining Phrase-Based and Template-Based Alignment Models in Statistical Translation , 2003, IbPRIA.

[81]  George F. Foster,et al.  User-Friendly Text Prediction For Translators , 2002, EMNLP.

[82]  Hermann Ney,et al.  Preprocessing and Normalization for Automatic Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[83]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[84]  Sergei Nirenburg,et al.  A Statistical Approach to Machine Translation , 2003 .

[85]  Yaser Al-Onaizan,et al.  Translation with Finite-State Devices , 1998, AMTA.

[86]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[87]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[88]  Hermann Ney,et al.  The RWTH Arabic-to-English spoken language translation system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[89]  Hermann Ney,et al.  The RWTH Phrase-based Statistical Machine Translation System , 2005, IWSLT.

[90]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[91]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[92]  Hermann Ney,et al.  Sentence segmentation using IBM word alignment model 1 , 2005, EAMT.

[93]  Jian-Yun Nie,et al.  Automatic construction of parallel English-Chinese corpus for cross-language information retrieval , 2000, ANLP.

[94]  Hermann Ney,et al.  Some approaches to statistical and finite-state speech-to-speech translation , 2004, Comput. Speech Lang..

[95]  Hermann Ney,et al.  Recent improvements of the RWTH large vocabulary speech recognition system on spontaneous speech , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[96]  Guy Lapalme,et al.  Text prediction for translators , 2002 .

[97]  Pascale Fung,et al.  Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora , 2005, IJCNLP.

[98]  Amaury Habrard,et al.  A Polynomial Algorithm for the Inference of Context Free Languages , 2008, ICGI.

[99]  Yuji Matsumoto,et al.  Automatic construction of machine translation knowledge using translation literalness , 2003 .

[100]  Jian Cai,et al.  Filtering noisy parallel corpora of web pages , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[101]  Marc Dymetman,et al.  Towards an automatic dictation system for translators : the transtalk project , 1994, ICSLP.

[102]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[103]  Francisco Casacuberta,et al.  Learning Finite-State Models for Machine Translation , 2004, ICGI.

[104]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[105]  MARTIN KAY The Proper Place of Men and Machines in Language Translation , 2004, Machine Translation.

[106]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[107]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.