Bilingual Data Cleaning for SMT using Graph-based Random Walk

The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks.

[1]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[2]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[3]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[4]  Qingsheng Zhu,et al.  Mining Bilingual Data from the Web with Adaptively Learnt Patterns , 2009, ACL/IJCNLP.

[5]  Hermann Ney,et al.  Training Phrase Translation Models with Leaving-One-Out , 2010, ACL.

[6]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[7]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[8]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[9]  Xiaojun Wan,et al.  Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction , 2007, ACL.

[10]  George F. Foster,et al.  The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance , 2012, AMTA.

[11]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Qun Liu,et al.  Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation , 2006, ACL.

[14]  Roland Kuhn,et al.  Phrasetable Smoothing for Statistical Machine Translation , 2006, EMNLP.

[15]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.