New data-driven approaches to text simplification

Many texts we encounter in our everyday lives are lexically and syntactically very complex. This makes them difficult to understand for people with intellectual or reading impairments, and difficult for various natural language processing systems to process. This motivated the need for text simplification (TS) which transforms texts into their simpler variants. Given that this is still a relatively new research area, many challenges are still remaining. The focus of this thesis is on better understanding the current problems in automatic text simplification (ATS) and proposing new data-driven approaches to solving them. We propose methods for learning sentence splitting and deletion decisions, built upon parallel corpora of original and manually simplified Spanish texts, which outperform the existing similar systems. Our experiments in adaptation of those methods to different text genres and target populations report promising results, thus offering one possible solution for dealing with the scarcity of parallel corpora for text simplification aimed at specific target populations, which is currently one of the main issues in ATS. The results of our extensive analysis of the phrase-based statistical machine translation (PB-SMT) approach to ATS reject the widespread assumption that the success of that approach largely depends on the size of the training and development datasets. They indicate more influential factors for the success of the PB-SMT approach to ATS, and reveal some important differences between cross-lingual MT and the monolingual v MT used in ATS. Our event-based system for simplifying news stories in English (EventSimplify) overcomes some of the main problems in ATS. It does not require a large number of handcrafted simplification rules nor parallel data, and it performs significant content reduction. The automatic and human evaluations conducted show that it produces grammatical text and increases readability, preserving and simplifying relevant content and reducing irrelevant content. Finally, this thesis addresses another important issue in TS which is how to automatically evaluate the performance of TS systems given that access to the target users might be difficult. Our experiments indicate that existing readability metrics can successfully be used for this task when enriched with human evaluation of grammaticality and preservation of meaning.

[1]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[2]  John Sabatini,et al.  The Automated Text Adaptation Tool , 2007, NAACL.

[3]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[4]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[5]  Lucia Specia Translating from Complex to Simplified Sentences , 2010, PROPOR.

[6]  Ricardo Baeza-Yates,et al.  Frequent Words Improve Readability and Short Words Improve Understandability for People with Dyslexia , 2013, INTERACT.

[7]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[8]  John R. Bormuth,et al.  READABILITY--A NEW APPROACH. , 1966 .

[9]  David A. Smith,et al.  Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies , 2006, WMT@HLT-NAACL.

[10]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[11]  Sanja Stajner,et al.  Adapting Text Simplification Decisions to Different Text Genres and Target Users , 2013, Proces. del Leng. Natural.

[12]  Horacio Saggion,et al.  Spanish Text Simplification: An Exploratory Study , 2011, Proces. del Leng. Natural.

[13]  William H. DuBay The Principles of Readability. , 2004 .

[14]  Marie-Francine Moens,et al.  The latent words language model , 2012, Comput. Speech Lang..

[15]  Maria das Graças Volpe Nunes,et al.  DiZer: An Automatic Discourse Analyzer for Brazilian Portuguese , 2004, SBIA.

[16]  Hubert Tardieu,et al.  Processing of anaphoric devices in young skilled and less skilled comprehenders: Differences in metacognitive monitoring , 1999 .

[17]  Leo Wanner,et al.  A development environment for MTT-based sentence generators: demonstration note , 2000 .

[18]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[19]  Marie-Francine Moens,et al.  Text simplification for children , 2010, SIGIR 2010.

[20]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .

[21]  Caroline Gasperin,et al.  Fostering Digital Inclusion and Accessibility: The PorSimples project for Simplification of Portuguese Texts , 2010, NAACL.

[22]  John Tait,et al.  Cohesive Generation of Syntactically Simplified Newspaper Text , 2000, TSD.

[23]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[24]  Noémie Elhadad,et al.  Putting it Simply: a Context-Aware Approach to Lexical Simplification , 2011, ACL.

[25]  J. Jastrzembski Multiple meanings, number of related meanings, frequency of occurrence, and the lexicon , 1981, Cognitive Psychology.

[26]  Sanja Stajner,et al.  Automatic Text Simplification in Spanish: A Comparative Evaluation of Complementing Modules , 2013, CICLing.

[27]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[28]  Advaith Siddharthan,et al.  Text simplification using synchronous dependency grammars: Generalising automatically harvested rules , 2014, INLG.

[29]  Siobhan Devlin,et al.  Simplifying Text for Language-Impaired Readers , 1999, EACL.

[30]  Joakim Nivre,et al.  MaltParser at the EVALITA 2009 Dependency Parsing Task , 2009 .

[31]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[32]  Lucia Specia,et al.  Readability Assessment for Text Simplification , 2010 .

[33]  Mirella Lapata,et al.  WikiSimple: Automatic Simplification of Wikipedia Articles , 2011, AAAI.

[34]  Karen B. Moni,et al.  LITERACY: Meeting the challenge of limited literacy resources for adolescents and adults with intellectual disabilities , 2008 .

[35]  Richard J. Evans,et al.  Assessing Conformance of Manually Simplified Corpora with User Requirements: the Case of Autistic Readers , 2014 .

[36]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[37]  Simonetta Montemagni,et al.  READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification , 2011, SLPAT.

[38]  Srinivas Bangalore,et al.  The Institute For Research In Cognitive Science Disambiguation of Super Parts of Speech ( or Supertags ) : Almost Parsing by Aravind , 1995 .

[39]  Gerald M. Kosicki,et al.  Framing analysis: An approach to news discourse , 1993 .

[40]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[41]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[42]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[43]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[44]  Ted Briscoe,et al.  Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars , 1993, CL.

[45]  Fernando Cuetos Vega,et al.  El efecto polisemia: Ahora lo ves otra vez , 1997 .

[46]  Advaith Siddharthan,et al.  Complex Lexico-syntactic Reformulation of Sentences Using Typed Dependency Representations , 2010, INLG.

[47]  Lucia Specia,et al.  Building a Brazilian Portuguese Parallel Corpus of Original and Simplified Texts , 2009 .

[48]  Matthew Shardlow,et al.  Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline , 2014, LREC.

[49]  Cristian Danescu-Niculescu-Mizil,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[50]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.

[51]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[52]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[53]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[54]  Son Bao Pham,et al.  Learning to Simplify Children Stories with Limited Data , 2014, ACIIDS.

[55]  Emanuele Pianta,et al.  The TextPro Tool Suite , 2008, LREC.

[56]  M. Glanzer,et al.  Analysis of the word-frequency effect in recognition memory , 1976 .

[57]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[58]  David Kauchak,et al.  Sentence Simplification as Tree Transduction , 2013, PITR@ACL.

[59]  Thea van der Geest,et al.  Accessible Website Content Guidelines for Users with Intellectual Disabilities , 2007 .

[60]  Goran Glavas,et al.  Event-Centered Simplification of News Stories , 2013, RANLP.

[61]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[62]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[63]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[64]  David K. Allen,et al.  A study of the role of relative clauses in the simplification of news texts for learners of English , 2009 .

[65]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[66]  Advaith Siddharthan,et al.  Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules , 2014, EACL.

[67]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[68]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[69]  Pascal Denis,et al.  Statistical French Dependency Parsing: Treebank Conversion and First Results , 2010, LREC.

[70]  Mari Ostendorf,et al.  Text simplification for language learners: a corpus analysis , 2007, SLaTE.

[71]  Helmer Strik,et al.  Human language technology and communicative disabilities: requirements and possibilities for the future , 2012, Lang. Resour. Evaluation.

[72]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[73]  Yuan Ding,et al.  Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars , 2005, ACL.

[74]  R. Gunning The Technique of Clear Writing. , 1968 .

[75]  R. Flesch The Art of Readable Writing , 1974 .

[76]  S. T. Rosen The syntactic representation of linguistic events , 1999 .

[77]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[78]  Richard J. Evans,et al.  Comparing methods for the syntactic simplification of sentences in information extraction , 2011, Literary and Linguistic Computing.

[79]  Kentaro Inui,et al.  Text Simplification for Reading Assistance: A Project Note , 2003, IWP@ACL.

[80]  Lijun Feng,et al.  Automatic readability assessment for people with intellectual disabilities , 2009, ASAC.

[81]  Sanja Stajner,et al.  Translating sentences from 'original' to 'simplified' Spanish , 2014, Proces. del Leng. Natural.

[82]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .

[83]  Sanja Stajner,et al.  One Step Closer to Automatic Evaluation of Text Simplification Systems , 2014, PITR@EACL.

[84]  David Kauchak,et al.  Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.

[85]  Daphne Koller,et al.  Sentence Simplification for Semantic Role Labeling , 2008, ACL.

[86]  Arantza Díaz de Ilarraza,et al.  Simple or Complex? Assessing the readability of Basque Texts , 2014, COLING.

[87]  Sanja Stajner,et al.  Readability Indices for Automatic Evaluation of Text Simplification Systems: A Feasibility Study for Spanish , 2013, IJCNLP.

[88]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[89]  Richard P. Kern,et al.  Usefulness of Readability Formulas for Achieving Army Readability Objectives: Research and State-of-the-Art Applied to the Army's Problem. , 1980 .

[90]  Aravind K. Joshi,et al.  Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[91]  Noam Chomsky Knowledge of language: its nature, origin, and use , 1988 .

[92]  Véronique Hoste,et al.  Towards an Improved Methodology for Automated Readability Prediction , 2010, LREC.

[93]  Mark Johnson,et al.  Dynamic programming for parsing and estimation of stochastic unification-based grammars , 2002, ACL.

[94]  Christian Smith,et al.  Towards a Rule Based System for Automatic Simplification of Texts , 2010 .

[95]  Lijun Feng,et al.  Cognitively Motivated Features for Readability Assessment , 2009, EACL.

[96]  M. Gernsbacher,et al.  The mechanism of suppression: a component of general comprehension skill. , 1991, Journal of experimental psychology. Learning, memory, and cognition.

[97]  James Pustejovsky,et al.  TimeML: Robust Specification of Event and Temporal Expressions in Text , 2003, New Directions in Question Answering.

[98]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[99]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[100]  Jan Snajder,et al.  Construction and evaluation of event graphs , 2014, Natural Language Engineering.

[101]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[102]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[103]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[104]  Jan Snajder,et al.  Exploring Coreference Uncertainty of Generically Extracted Event Mentions , 2013, CICLing.

[105]  Mirella Lapata,et al.  Models for Sentence Compression: A Comparison across Domains, Training Requirements and Evaluation Measures , 2006, ACL.

[106]  Goiara Mendonça de Castilho,et al.  Normas de concretude para 909 palavras da língua portuguesa , 2007 .

[107]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[108]  Raman Chandrasekar,et al.  Automatic induction of rules for text simplification , 1997, Knowl. Based Syst..

[109]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[110]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[111]  Shashi Narayan,et al.  Hybrid Simplification using Deep Semantics and Machine Translation , 2014, ACL.

[112]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[113]  Horacio Saggion,et al.  Text Simplification Tools for Spanish , 2012, LREC.

[114]  Goran Glavaš,et al.  Event-centered simplication of news stories , 2013 .

[115]  Horacio Saggion,et al.  Text Simplification in Simplext. Making Text More Accessible , 2011, Proces. del Leng. Natural.

[116]  R. Mitkov,et al.  What can readability measures really tell us about text complexity , 2012 .

[117]  Mirella Lapata,et al.  Sentence Compression as Tree Transduction , 2009, J. Artif. Intell. Res..

[118]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[119]  Sven Hartrumpf,et al.  A Readability Checker with Supervised Learning Using Deep Indicators , 2008, Informatica.

[120]  Renata Pontin de Mattos Fortes,et al.  A corpus analysis of simple account texts and the proposal of simplification strategies: first steps towards text simplification systems , 2008, SIGDOC '08.

[121]  Yorick Wilks,et al.  The METER corpus : a corpus for analysing journalistic text reuse , 2001 .

[122]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.

[123]  Richard Evans,et al.  A Tagging Approach to Identify Complex Constituents for Text Simplification , 2013, RANLP.

[124]  Ruslan Mitkov,et al.  Simple or Not Simple? A Readability Question , 2015 .

[125]  Aravind K. Joshi,et al.  Natural language parsing: Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? , 1985 .

[126]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[127]  Horacio Saggion,et al.  Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish , 2012, COLING.

[128]  Seth Spaulding,et al.  A Spanish Readability Formula , 1956 .

[129]  Michael J Cortese,et al.  Visual word recognition of single-syllable words. , 2004, Journal of experimental psychology. General.

[130]  Horacio Saggion,et al.  Reporting simply: A lexical simplification strategy for enhancing text accessibility , 2012 .

[131]  Patrick Watrin,et al.  On the Contribution of MWE-based Features to a Readability Formula for French as a Foreign Language , 2011, RANLP.

[132]  Mari Ostendorf,et al.  A machine learning approach to reading level assessment , 2009, Comput. Speech Lang..

[133]  Luz Rello,et al.  DysWebxia: a model to improve accessibility of the textual web for dyslexic users , 2012, ASAC.

[134]  Daniel Marcu,et al.  Summarization beyond sentence extraction: A probabilistic approach to sentence compression , 2002, Artif. Intell..

[135]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[136]  C. Norbury,et al.  Barking up the wrong tree? Lexical ambiguity resolution in children with language impairments and autistic spectrum disorders. , 2005, Journal of experimental child psychology.

[137]  Sara Tonelli,et al.  ERNESTA: A Sentence Simplification Tool for Children's Stories in Italian , 2013, CICLing.

[138]  Horacio Saggion,et al.  Corpus-based Sentence Deletion and Split Decisions for Spanish Text Simplification , 2013 .

[139]  Advaith Siddharthan,et al.  Text Simplification using Typed Dependencies: A Comparision of the Robustness of Different Generation Strategies , 2011, ENLG.

[140]  H. Kamp A Theory of Truth and Semantic Representation , 2008 .

[141]  Ruslan Mitkov,et al.  The Fewer, the Better? A Contrastive Study about Ways to Simplify , 2014 .

[142]  Daniel Marcu,et al.  Text Simplification for Information-Seeking Applications , 2004, CoopIS/DOA/ODBASE.

[143]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[144]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[145]  Johan Frid,et al.  Measuring Syntactic Complexity in Spontaneous Spoken Swedish , 2007, Language and speech.

[146]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[147]  Delphine Bernhard,et al.  Syntactic Sentence Simplification for French , 2014, PITR@EACL.

[148]  Pablo Gervás,et al.  Feasibility Analysis for SemiAutomatic Conversion of Text to Improve Readability , 2009, ICTA.

[149]  Mark Dredze,et al.  Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language , 2010, HLT-NAACL 2010.

[150]  Inmaculada Fajardo,et al.  Easy-to-read texts for students with intellectual disability: linguistic factors affecting comprehension. , 2014, Journal of applied research in intellectual disabilities : JARID.

[151]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[152]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[153]  Amy M. Shapiro,et al.  Skilled readers make better use of anaphora: a study of the repeated-name penalty on text comprehension. , 2017 .

[154]  Atro Voutilainen,et al.  A language-independent system for parsing unrestricted text , 1995 .

[155]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .