Building a German/Simple German Parallel Corpus for Automatic Text Simplification

In this paper we report our experiments in creating a parallel corpus using German/Simple German documents from the web. We require parallel data to build a statistical machine translation (SMT) system that translates from German into Simple German. Parallel data for SMT systems needs to be aligned at the sentence level. We applied an existing monolingual sentence alignment algorithm. We show the limits of the algorithm with respect to the language and domain of our data and suggest ways of circumventing them.

[1]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[2]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[3]  Stuart M. Shieber,et al.  Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora , 2006, EACL.

[4]  Horacio Saggion,et al.  A Hybrid System for Spanish Text Simplification , 2012, SLPAT@HLT-NAACL.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Horacio Saggion,et al.  An Unsupervised Alignment Algorithm for Text Simplification Corpus Construction , 2011, Monolingual@ACL.

[7]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[8]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[9]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[10]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[11]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[12]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[13]  Cristian Danescu-Niculescu-Mizil,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[14]  Delphine Bernhard,et al.  Simplification syntaxique de phrases pour le français (Syntactic Simplification for French Sentences) [in French] , 2012, JEP-TALN-RECITAL.

[15]  David Kauchak,et al.  Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.

[16]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[17]  Lucia Specia Translating from Complex to Simplified Sentences , 2010, PROPOR.

[18]  Caroline Gasperin,et al.  Challenging Choices for Text Simplification , 2010, PROPOR.

[19]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[20]  Jorge Proença,et al.  Computational Processing of the Portuguese Language , 2014, Lecture Notes in Computer Science.

[21]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[22]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .