Corpus-based Sentence Deletion and Split Decisions for Spanish Text Simplification

This study addresses the automatic simplification of texts in Spanish in order to make them more accessible to people with cognitive disabilities. A corpus analysis of original and manually simplified news articles was undertaken in order to identify and quantify relevant operations to be implemented in a text simplification system. The articles were further compared at sentence and text level by means of automatic feature extraction and various machine learning classification algorithms, using three different groups of features (POS frequencies, syntactic information, and text complexity measures) with the aim of identifying features that help separate original documents from their simple equivalents. Finally, it was investigated whether these features can be used to decide upon simplification operations to be carried out at the sentence level (split, delete, and reduce). Automatic classification of original sentences into those to be kept and those to be eliminated outperformed the classification that was previously conducted on the same corpus. Kept sentences were further classified into those to be split or significantly reduced in length and those to be left largely unchanged, with the overall F-measure up to 0.92. Both experiments were conducted and compared on two different sets of features: all features and the best subset returned by an attribute selection algorithm.

[1]  Lucia Specia Translating from Complex to Simplified Sentences , 2010, PROPOR.

[2]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Pablo Gervás,et al.  Feasibility Analysis for SemiAutomatic Conversion of Text to Improve Readability , 2009, ICTA.

[5]  Sanja Stajner,et al.  Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach , 2012, LREC.

[6]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[7]  Alberto Anula Rebollo Lecturas adaptadas a la enseñanza del español como L2: variables lingüísticas para la determinación del nivel de legibilidad , 2008 .

[8]  David Kauchak,et al.  Simple English Wikipedia: A New Text Simplification Task , 2011, ACL.

[9]  Lucia Specia,et al.  Building a Brazilian Portuguese Parallel Corpus of Original and Simplified Texts , 2009 .

[10]  Adam Lopez,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

[11]  Partha Lal,et al.  Extract-based Summarization with Simplification , 2002, ACL 2002.

[12]  Cristian Danescu-Niculescu-Mizil,et al.  For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia , 2010, NAACL.

[13]  Renata Pontin de Mattos Fortes,et al.  Towards Brazilian Portuguese automatic text simplification systems , 2008, DocEng '08.

[14]  Horacio Saggion,et al.  Spanish Text Simplification: An Exploratory Study , 2011, Proces. del Leng. Natural.

[15]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[16]  A. D. Ilarraza,et al.  First Approach to Automatic Text Simplification in Basque Marı́a , 2012 .

[17]  John Sabatini,et al.  The Automated Text Adaptation Tool , 2007, NAACL.

[18]  Sanja Stajner,et al.  Automatic Text Simplification in Spanish: A Comparative Evaluation of Complementing Modules , 2013, CICLing.

[19]  Lucia Specia,et al.  Learning When to Simplify Sentences for Natural Text Simplification , 2009 .

[20]  Mari Ostendorf,et al.  Text simplification for language learners: a corpus analysis , 2007, SLaTE.

[21]  Daniel Marcu,et al.  Text Simplification for Information-Seeking Applications , 2004, CoopIS/DOA/ODBASE.

[22]  Horacio Saggion,et al.  Reducing Text Complexity through Automatic Lexical Simplification: an Empirical Study for Spanish , 2012, Proces. del Leng. Natural.

[23]  Mari Ostendorf,et al.  Identifying targets for syntactic simplification , 2011, SLaTE.

[24]  Horacio Saggion,et al.  Text Simplification in Simplext. Making Text More Accessible , 2011, Proces. del Leng. Natural.

[25]  R. Mitkov,et al.  What can readability measures really tell us about text complexity , 2012 .

[26]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[27]  Noémie Elhadad,et al.  Putting it Simply: a Context-Aware Approach to Lexical Simplification , 2011, ACL.

[28]  Siobhan Devlin,et al.  Helping aphasic people process online information , 2006, Assets '06.

[29]  Horacio Saggion,et al.  Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish , 2012, COLING.

[30]  Advaith Siddharthan,et al.  An architecture for a text simplification system , 2002, Language Engineering Conference, 2002. Proceedings.

[31]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[32]  Kentaro Inui,et al.  Text Simplification for Reading Assistance: A Project Note , 2003, IWP@ACL.

[33]  David Kauchak,et al.  Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.

[34]  Christian Smith,et al.  Towards a Rule Based System for Automatic Simplification of Texts , 2010 .