An Open Corpus of Everyday Documents for Simplification Tasks

In recent years interest in creating statistical automated text simplification systems has increased. Many of these systems have used parallel corpora of articles taken from Wikipedia and Simple Wikipedia or from Simple Wikipedia revision histories and generate Simple Wikipedia articles. In this work we motivate the need to construct a large, accessible corpus of everyday documents along with their simplifications for the development and evaluation of simplification systems that make everyday documents more accessible. We present a detailed description of what this corpus will look like and the basic corpus of everyday documents we have already collected. This latter contains everyday documents from many domains including driver’s licensing, government aid and banking. It contains a total of over 120,000 sentences. We describe our preliminary work evaluating the feasibility of using crowdsourcing to generate simplifications for these documents. This is the basis for our future extended corpus which will be available to the community of researchers interested in simplification of everyday documents.

[1]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[2]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[3]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[4]  Stephan Vogel,et al.  Can Crowds Build parallel corpora for Machine Translation Systems? , 2010, Mturk@HLT-NAACL.

[5]  David Kauchak,et al.  Sentence Simplification as Tree Transduction , 2013, PITR@ACL.

[6]  Horacio Saggion,et al.  A Hybrid System for Spanish Text Simplification , 2012, SLPAT@HLT-NAACL.

[7]  Matt Post,et al.  Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing , 2012, WMT@NAACL-HLT.

[8]  Sigrid Klerke,et al.  DSim, a Danish Parallel Corpus for Text Simplification , 2012, LREC.

[9]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[10]  Walt Detmar Meurers,et al.  On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition , 2012, BEA@NAACL-HLT.

[11]  Thomas François,et al.  Do NLP and machine learning improve traditional readability formulas? , 2012, PITR@NAACL-HLT.

[12]  Napoleon Katsos,et al.  Offline Sentence Processing Measures for testing Readability with Users , 2012, PITR@NAACL-HLT.

[13]  Irina P. Temnikova,et al.  The C-Score – Proposing a Reading Comprehension Metrics as a Common Evaluation Measure for Text Simplification , 2013, PITR@ACL.

[14]  Lijun Feng,et al.  A Comparison of Features for Automatic Readability Assessment , 2010, COLING.

[15]  Martine De Cock,et al.  Using the crowd for readability prediction , 2012, Natural Language Engineering.

[16]  Mari Ostendorf,et al.  Text simplification for language learners: a corpus analysis , 2007, SLaTE.

[17]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .

[18]  Martin Volk,et al.  Building a German/Simple German Parallel Corpus for Automatic Text Simplification , 2013, PITR@ACL.

[19]  William H. DuBay The Principles of Readability. , 2004 .

[20]  Caroline Gasperin,et al.  Fostering Digital Inclusion and Accessibility: The PorSimples project for Simplification of Portuguese Texts , 2010, NAACL.

[21]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[22]  Mirella Lapata,et al.  WikiSimple: Automatic Simplification of Wikipedia Articles , 2011, AAAI.

[23]  Maxine Eskénazi,et al.  Toward better crowdsourced transcription: Transcription of a year of the Let's Go Bus Information System data , 2010, 2010 IEEE Spoken Language Technology Workshop.

[24]  R. Mitkov,et al.  What can readability measures really tell us about text complexity , 2012 .

[25]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .