THE EFFECT OF PARALLEL CORPUS QUALITY VS SIZE IN ENGLISH -TO- TURKISH SMT

A parallel corpus plays an important role in statistical machine translation (SMT) systems. In this study, our aim is to figure out the effects of parallel corpus size and quality in the SMT. We develop a machine learning based classifier to classify parallel sentence pairs as high-quality or poor-quality. We applied this classifier to a parallel corpus containing 1 million parallel English-Turkish sentence pairs and obtained 600K high-quality parallel sentence pairs. We train multiple SMT systems with various sizes of entire raw parallel corpus and filtered highquality corpus and evaluate their performance. As expected, our experiments show that the size of parallel corpus is a major factor in translation performance. However, instead of extending corpus with all available “so-called” parallel data, a better translation performance and reduced time-complexity can be achieved with a smaller high-quality corpus using a quality filter.

[1]  Spyridon Matsoukas,et al.  Discriminative Corpus Weight Estimation for Machine Translation , 2009, EMNLP.

[2]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[3]  Hermann Ney,et al.  Automatic Filtering of Bilingual Corpora for Statistical Machine Translation , 2005, NLDB.

[4]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[5]  Shahram Khadivi,et al.  A discriminative approach to filter out noisy sentence pairs from bilingual corpora , 2010, 2010 5th International Symposium on Telecommunications.

[6]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[7]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[8]  Eiichiro Sumita,et al.  Method of Selecting Training Data to Build a Compact and Efficient Translation Model , 2008, IJCNLP.

[9]  Miquel Espl,et al.  Bitextor, a free/open-source software to harvest translation memories from multilingual websites , 2009 .

[10]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Ming Zhou,et al.  Evaluating the Quality of Web-Mined Bilingual Sentences Using Multiple Linguistic Features , 2010, 2010 International Conference on Asian Language Processing.

[15]  Dan Klein,et al.  Faster and Smaller N-Gram Language Models , 2011, ACL.

[16]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[17]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[18]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Jian-Yun Nie,et al.  Parallel Web text mining for cross-language IR , 2000, RIAO.

[21]  Phuong-Thai Nguyen,et al.  Exploiting Non-Parallel Corpora for Statistical Machine Translation , 2012, 2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future.

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  Jörg Tiedemann,et al.  News from OPUS — A collection of multilingual parallel corpora with tools and interfaces , 2009 .

[24]  Tunga Güngör,et al.  Compiling a Turkish-English Bilingual Corpus and Developing an Algorithm for Sentence Alignment , 2006 .

[25]  Ming Zhou,et al.  Bilingual Data Cleaning for SMT using Graph-based Random Walk , 2013, ACL.

[26]  Holger Schwenk,et al.  Investigations on large-scale lightly-supervised training for statistical machine translation. , 2008, IWSLT.