Parallel sentence generation from comparable corpora for improved SMT

A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.

[1]  Benjamin K. Tsou,et al.  Building a Large English-Chinese Parallel Corpus from Comparable Patents and its Experimental Application to SMT , 2010 .

[2]  Pierre Zweigenbaum,et al.  Looking for French-English translations in comparable medical corpora , 2002, AMIA.

[3]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[4]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[5]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[6]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[7]  Christopher C. Yang,et al.  Automatic construction of English/Chinese parallel corpora , 2003, J. Assoc. Inf. Sci. Technol..

[8]  Tony McEnery,et al.  Chapter 2. Parallel and Comparable Corpora: What is Happening? , 2007 .

[9]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[10]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[11]  Lynne Bowker,et al.  Designing a Tool for Exploiting Bilingual Comparable Corpora , 2000, LREC.

[12]  Masao Utiyama,et al.  Development of a Japanese-English Software Manual Parallel Corpus , 2009, MTSUMMIT.

[13]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[14]  Hiroyuki Kaji Word Sense Acquisition from Bilingual Comparable Corpora , 2003, HLT-NAACL.

[15]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[16]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[17]  Holger Schwenk,et al.  Exploiting Comparable Corpora with TER and TERp , 2009, BUCC@ACL/IJCNLP.

[18]  Richard M. Schwartz,et al.  Language and Translation Model Adaptation using Comparable Corpora , 2008, EMNLP.

[19]  Viktor Pekar,et al.  Finding translations for low-frequency words in comparable corpora , 2006, Machine Translation.

[20]  Stanley Peters,et al.  A Bootstrapping Method for Extracting Bilingual Text Pairs , 2000, COLING.

[21]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[22]  Philipp Koehn,et al.  Proceedings of the Third Workshop on Statistical Machine Translation (StatMT '08) , 2008 .

[23]  Tuomas Talvensaari Comparable Corpora in Cross-Language Information Retrieval , 2008 .

[24]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[25]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[26]  Masatoshi Yoshikawa,et al.  Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval , 2003, ACL.

[27]  Chris Quirk,et al.  Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction , 2007 .

[28]  Heng Ji,et al.  Mining Name Translations from Comparable Corpora by Creating Bilingual Information Networks , 2009, BUCC@ACL/IJCNLP.

[29]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[30]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[31]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[32]  Andreas Eisele,et al.  Improving Machine Translation Performance Using Comparable Corpora , 2010 .

[33]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[34]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 2022, COLING.

[35]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[36]  Douglas W. Oard,et al.  Alternative Approaches for Cross-Language Text Retrieval , 1997 .

[37]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[38]  Ulrich Germann Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[39]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[40]  Serge Sharoff,et al.  Using collocations from comparable corpora to find translation equivalents , 2006, LREC.

[41]  Pascale Fung,et al.  Trillions of comparable documents , 2010 .

[42]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[43]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[44]  Ying Zhang,et al.  Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[45]  Pascale Fung,et al.  Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora , 2005, IJCNLP.

[46]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[47]  Ying Zhang,et al.  Mining Key Phrase Translations from Web Corpora , 2005, HLT.

[48]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[49]  Stephan Vogel,et al.  Can Crowds Build parallel corpora for Machine Translation Systems? , 2010, Mturk@HLT-NAACL.

[50]  Chris Callison-Burch,et al.  Using Mechanical Turk to Build Machine Translation Evaluation Sets , 2010, Mturk@HLT-NAACL.

[51]  Pierre Zweigenbaum,et al.  ACL-IJCNLP 2009 BUCC 2009 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora , 2009 .

[52]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[53]  LU Bin Building a Large English-Chinese Parallel Corpus from Comparable Patents and its Experimental Application to SMT , 2011 .

[54]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[55]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[56]  Nerea Ezeiza,et al.  Named Entities Translation Based on Comparable Corpora , 2006, Workshop On Multi-Word-Expressions In A Multilingual Context.

[57]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[58]  Maddalen Lopez de Lacalle,et al.  Mining Term Translation from Domain Restricted Comparable Corpora , 2008, Proces. del Leng. Natural.