Improving lexical coverage of text simplification systems for Spanish

Abstract The current bottleneck of all data-driven lexical simplification (LS) systems is scarcity and small size of parallel corpora (original sentences and their manually simplified versions) used for training. This is especially pronounced for languages other than English. We address this problem, taking Spanish as an example of such a language, by building new simplification-specific datasets of synonyms and paraphrases using freely available resources. We test their usefulness in the LS task by adding them, in various combinations, to the existing text simplification (TS) training dataset in a phrase-based statistical machine translation (PBSMT) approach. Our best systems significantly outperform the state-of-the-art LS systems for Spanish, by the number of transformations performed and the grammaticality, simplicity and meaning preservation of the output sentences. The results of a detailed manual analysis show that some of the newly built TS resources, although they have a good lexical coverage and lead to a high number of transformations, often change the original meaning and do not generate simpler output when used in this PBSMT setup. The good combinations of these additional resources with the TS training dataset and a good choice of language model, in contrast, improve the lexical coverage and produce sentences which are grammatical, simpler than the original, and preserve the original meaning well.

[1]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[2]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[3]  Sanja Stajner,et al.  Making It Simplext , 2015, ACM Trans. Access. Comput..

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[6]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[7]  Siobhan Devlin,et al.  Simplifying Text for Language-Impaired Readers , 1999, EACL.

[8]  Horacio Saggion,et al.  Corpus-based Sentence Deletion and Split Decisions for Spanish Text Simplification , 2013 .

[9]  Ricardo Baeza-Yates,et al.  CASSA: A Context-Aware Synonym Simplification Algorithm , 2015, NAACL.

[10]  Ruslan Mitkov,et al.  The Fewer, the Better? A Contrastive Study about Ways to Simplify , 2014 .

[11]  Daniel Marcu,et al.  Text Simplification for Information-Seeking Applications , 2004, CoopIS/DOA/ODBASE.

[12]  Sanja Stajner,et al.  Automatic Text Simplification for Spanish: Comparative Evaluation of Various Simplification Strategies , 2015, RANLP.

[13]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.

[14]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[15]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[16]  Goran Glavas,et al.  Simplifying Lexical Simplification: Do We Need Simplified Corpora? , 2015, ACL.

[17]  Daphne Koller,et al.  Sentence Simplification for Semantic Role Labeling , 2008, ACL.

[18]  Sergiu Nisioi,et al.  A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification , 2018, LREC.

[19]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[20]  Daniel Ferrés,et al.  An Adaptable Lexical Simplification Architecture for Major Ibero-Romance Languages , 2017 .

[21]  Sanja Stajner,et al.  One Step Closer to Automatic Evaluation of Text Simplification Systems , 2014, PITR@EACL.

[22]  David Kauchak,et al.  Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.

[23]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[24]  Mari Ostendorf,et al.  Text simplification for language learners: a corpus analysis , 2007, SLaTE.

[25]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[26]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[27]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[28]  David Kauchak,et al.  Sentence Simplification as Tree Transduction , 2013, PITR@ACL.

[29]  Sanja Stajner,et al.  Translating sentences from 'original' to 'simplified' Spanish , 2014, Proces. del Leng. Natural.

[30]  Lucia Specia,et al.  Shared task on quality assessment for text simplification , 2016 .

[31]  Sergiu Nisioi,et al.  Exploring Neural Text Simplification Models , 2017, ACL.

[32]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[33]  David Kauchak,et al.  Learning a Lexical Simplifier Using Wikipedia , 2014, ACL.

[34]  Lucia Specia,et al.  Unsupervised Lexical Simplification for Non-Native Speakers , 2016, AAAI.

[35]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[36]  Horacio Saggion,et al.  Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish , 2012, COLING.

[37]  Heiner Stuckenschmidt,et al.  Sentence Alignment Methods for Improving Text Simplification Systems , 2017, ACL.

[38]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[39]  Lucia Specia,et al.  Understanding the Lexical Simplification Needs of Non-Native Speakers of English , 2016, COLING.

[40]  Mirella Lapata,et al.  Sentence Simplification with Deep Reinforcement Learning , 2017, EMNLP.

[41]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[42]  Sanja Stajner,et al.  Can Text Simplification Help Machine Translation? , 2016, EAMT.

[43]  Marie-Francine Moens,et al.  Text simplification for children , 2010, SIGIR 2010.

[44]  Caroline Gasperin,et al.  Fostering Digital Inclusion and Accessibility: The PorSimples project for Simplification of Portuguese Texts , 2010, NAACL.

[45]  Lucia Specia Translating from Complex to Simplified Sentences , 2010, PROPOR.

[46]  Ricardo Baeza-Yates,et al.  Frequent Words Improve Readability and Short Words Improve Understandability for People with Dyslexia , 2013, INTERACT.

[47]  Horacio Saggion,et al.  Book Review: Automatic Text Simplification by Horacio Saggion , 2017, CL.

[48]  Horacio Saggion,et al.  Text Simplification in Simplext. Making Text More Accessible , 2011, Proces. del Leng. Natural.

[49]  R. Mitkov,et al.  What can readability measures really tell us about text complexity , 2012 .

[50]  Sanja Stajner,et al.  A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation , 2015, ACL.

[51]  Horacio Saggion,et al.  Towards Automatic Lexical Simplification in Spanish: An Empirical Study , 2012, PITR@NAACL-HLT.