Ensembles of Classifiers for Cleaning Web Parallel Corpora and Translation Memories

The last years witnessed an increasing interest in the automatic methods for spotting false translation units in translation memories. This problem presents a great interest to industry as there are many translation memories that contain errors. A closely related line of research deals with identifying sentences that do not align in the parallel corpora mined from the web. The task of spotting false translations is modeled as a binary classification problem. It is known that in certain conditions the ensembles of classifiers improve over the performance of the individual members. In this paper we benchmark the most popular ensemble of classifiers: Majority Voting, Bagging, Stacking and Ada Boost at the task of spotting false translation units for translation memories and parallel web corpora. We want to know if for this specific problem any ensemble technique improves the performance of the individual classifiers and if there is a difference between the data in translation memories and parallel web corpora with respect to this task.

[1]  Marco Turchi,et al.  An Unsupervised Method for Automatic Translation Memory Cleaning , 2016, ACL.

[2]  Jörg Tiedemann,et al.  Bitext Alignment , 2011, Synthesis Lectures on Human Language Technologies.

[3]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[4]  Eduard Barbu Spotting false translation segments in translation memories , 2015 .

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Friedel Wolff Combining off-the-shelf components to clean a translation memory , 2016, Machine Translation.

[8]  Gregor Thurmair,et al.  A modular open-source focused crawler for mining monolingual and bilingual corpora from the web , 2013, BUCC@ACL.

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Constantin Orasan,et al.  The first Automatic Translation Memory Cleaning Shared Task , 2016, Machine Translation.

[11]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[12]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[13]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[14]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[15]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[16]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.