PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification

In this paper we present PaCCSS–IT, a Parallel Corpus of Complex–Simple Sentences for ITalian. To build the resource we develop a new method for automatically acquiring a corpus of complex–simple paired sentences able to intercept structural transformations and particularly suitable for text simplification. The method requires a wide amount of texts that can be easily extracted from the web making it suitable also for less–resourced languages. We test it on the Italian language making available the biggest Italian corpus for automatic text simplification.

[1]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.

[2]  Simonetta Montemagni,et al.  READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification , 2011, SLPAT.

[3]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[4]  Horacio Saggion,et al.  An Unsupervised Alignment Algorithm for Text Simplification Corpus Construction , 2011, Monolingual@ACL.

[5]  Vito Pirrelli,et al.  The PAISÀ Corpus of Italian Web Texts , 2014, WaC@EACL.

[6]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[7]  Walt Detmar Meurers,et al.  Readability-based Sentence Ranking for Evaluating Text Simplification , 2016, ArXiv.

[8]  Dekang Lin On the Structural Complexity of Natural Language Sentences , 1996, COLING.

[9]  Delphine Bernhard,et al.  Syntactic Sentence Simplification for French , 2014, PITR@EACL.

[10]  Felice Dell'Orletta,et al.  Design and Annotation of the First Italian Corpus for Text Simplification , 2015, LAW@NAACL-HLT.

[11]  David McClosky,et al.  Parsing Paraphrases with Joint Inference , 2015, ACL.

[12]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[13]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[14]  Sigrid Klerke,et al.  DSim, a Danish Parallel Corpus for Text Simplification , 2012, LREC.

[15]  Advaith Siddharthan,et al.  Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules , 2014, EACL.

[16]  Tullio De Mauro,et al.  Il dizionario della lingua italiana , 2000 .

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[19]  Stuart M. Shieber,et al.  Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora , 2006, EACL.

[20]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[21]  Chris Callison-Burch,et al.  Paraphrase Substitution for Recognizing Textual Entailment , 2006, CLEF.

[22]  E. Gibson Linguistic complexity: locality of syntactic dependencies , 1998, Cognition.

[23]  Horacio Saggion,et al.  Text simplification resources for Spanish , 2014, Lang. Resour. Evaluation.

[24]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[25]  Chris Callison-Burch,et al.  Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases , 2009, EMNLP.

[26]  Felice Dell'Orletta,et al.  Ensemble system for Part-of-Speech tagging , 2009 .

[27]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[28]  Simon Ostermann,et al.  Paraphrase Detection for Short Answer Scoring , 2014 .

[29]  Barry K. Rosen,et al.  Syntactic Complexity , 1974, Inf. Control..

[30]  Gaetano Barone Il dizionario della lingua italiana , 1995 .

[31]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[32]  Lucia Specia,et al.  Building a Brazilian Portuguese Parallel Corpus of Original and Simplified Texts , 2009 .

[33]  Felice Dell'Orletta,et al.  Accurate Dependency Parsing with a Stacked Multilayer Perceptron , 2009 .

[34]  Oren Etzioni,et al.  Paraphrase-Driven Learning for Open Question Answering , 2013, ACL.

[35]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.