A Taxonomy of Weeds: A Field Guide for Corpus Curators to Winnowing the Parallel Text Harvest

Modern machine translation techniques rely heavily on parallel corpora, which are commonly harvested from the web. Such harvested corpora commonly exhibit problems in encoding, language identification, sentence alignment, and transliteration. Just as agricultural harvests must be threshed and winnowed to separate grain from chaff, electronic harvests should be carefully processed to ensure the quality and usability of the resulting corpora. In this work, we catalog a taxonomy of problems commonly found in harvested parallel corpora, and outline approaches for detecting and correcting these problems. This work is motivated by the lack of a standardized field guide outlining best practices for curating parallel corpora, especially those harvested from the web. Even the most-well curated parallel corpus is likely to contain some problems; even Europarl (Koehn, 2005), arguably the most widely examined parallel corpus, has undergone eight distinct revisions since its release in 2005. While this work is by no means comprehensive of all problems extant in corpus creation and curation, we nevertheless believe that a practical taxonomic field guide, laying out likely pitfalls awaiting corpus curators will represent an important contribution to our community.

[1]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[2]  Timothy Baldwin,et al.  Exploring Methods and Resources for Discriminating Similar Languages , 2014, VarDial@COLING.

[3]  Lane Schwartz,et al.  Machine Translation and Monolingual Postediting: The AFRL WMT-14 System , 2014, WMT@ACL.

[4]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[5]  Nadir Durrani,et al.  Integrating an Unsupervised Transliteration Model into Statistical Machine Translation , 2014, EACL.

[6]  Jan Niehues,et al.  Tight Integration of Speech Disfluency Removal into SMT , 2014, EACL.

[7]  Michel Simard Clean data for training statistical MT: the case of MT contamination , 2014, AMTA.

[8]  Ahmed Abdelali,et al.  The AMARA Corpus: Building Parallel Language Resources for the Educational Domain , 2014, LREC.

[9]  Shuly Wintner,et al.  Improving Statistical Machine Translation by Adapting Translation Models to Translationese , 2013, CL.

[10]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[11]  Katherine Young Reversing the Palladius Mapping of Chinese Names in Russian Text , 2012, AMTA.

[12]  George F. Foster,et al.  The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance , 2012, AMTA.

[13]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[14]  Chris Quirk,et al.  MT Detection in Web-Scraped Parallel Corpora , 2011, MTSUMMIT.

[15]  Juri Ganitkevitch,et al.  Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation. , 2011, EMNLP.

[16]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[17]  Satoshi Sekine,et al.  Latent Class Transliteration based on Source Language Origin , 2011, ACL.

[18]  Yandex,et al.  Building a Web-based parallel corpus and filtering out machine-translated text , 2011 .

[19]  Andy Way,et al.  Lattice Score Based Data Cleaning for Phrase-Based Statistical Machine Translation , 2010, EAMT.

[20]  François Masselot,et al.  A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context , 2010, Prague Bull. Math. Linguistics.

[21]  Tsuyoshi Okita,et al.  Data Cleaning for Word Alignment , 2009, ACL.

[22]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[23]  Cyril Goutte,et al.  Automatic Detection of Translated Text and its Impact on Machine Translation , 2009, MTSUMMIT.

[24]  Coskun Mermer,et al.  The tÜbİTAK-UEKAE statistical machine translation system for IWSLT 2008 , 2007, IWSLT.

[25]  Haizhou Li,et al.  Semantic Transliteration of Personal Names , 2007, ACL.

[26]  Hermann Ney,et al.  Automatic Filtering of Bilingual Corpora for Statistical Machine Translation , 2005, NLDB.

[27]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[28]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[29]  Eiichiro Sumita,et al.  Bilingual corpus cleaning focusing on translation literality , 2002, INTERSPEECH.

[30]  Michael Gamon,et al.  A Machine Learning Approach to the Automatic Evaluation of Machine Translation , 2001, ACL.

[31]  Philip Resnik,et al.  Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[32]  J. House,et al.  Shifts of Cohesion and Coherence in Translation , 1996 .

[33]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[34]  C. Myers-Scotton Social Motivations For Codeswitching: Evidence from Africa , 1994 .

[35]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[36]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[37]  Maria Antonietta Fusco Quality in Conference Interpreting between Congnate Languages: A Preliminary Approach to the Spanish-Italian Case , 1990 .