Supervised and Unsupervised Neural Approaches to Text Readability

We present a set of novel neural supervised and unsupervised approaches for determining readability of documents. In the unsupervised setting, we leverage neural language models, while in the supervised setting three different neural architectures are tested in the classification setting. We show that the proposed neural unsupervised approach on average produces better results than traditional readability formulas and is transferable across languages. Employing neural classifiers, we outperform current state-of-the-art classification approaches to readability which rely on standard machine learning classifiers and extensive feature engineering. We tested several properties of the proposed approaches and showed their strengths and possibilities for improvements.

[1]  Mari Ostendorf,et al.  Estimating Linguistic Complexity for Science Texts , 2018, BEA@NAACL-HLT.

[2]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[3]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[4]  Damjan Popič Korpusnojezikoslovni mo(nu)menti: Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba , 2013 .

[5]  R. Gunning The Technique of Clear Writing. , 1968 .

[6]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[7]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[8]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[9]  Hamid Mohammadi,et al.  Text as Environment: A Deep Reinforcement Learning Text Readability Assessment Model , 2019, ArXiv.

[10]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[11]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[12]  Marko Robnik-Sikonja,et al.  FinEst BERT and CroSloEngual BERT: less is more in multilingual models , 2020, TDS.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  John R. Bormuth,et al.  Development of Readability Analysis. , 1969 .

[15]  Michael Flor,et al.  Lexical Tightness and Text Complexity , 2013 .

[16]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[17]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[18]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[19]  Maria Soledad Pera,et al.  Multiattentive Recurrent Neural Network Architecture for Multilingual Readability Assessment , 2019, TACL.

[20]  Kevyn Collins-Thompson,et al.  Predicting reading difficulty with statistical language models , 2005, J. Assoc. Inf. Sci. Technol..

[21]  Luo Si,et al.  A statistical model for scientific readability , 2001, CIKM '01.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Walt Detmar Meurers,et al.  On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition , 2012, BEA@NAACL-HLT.

[24]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[25]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[26]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[27]  Teun A. van Dijk,et al.  Text and Context: Explorations in the Semantics and Pragmatics of Discourse , 1977 .

[28]  Berlin Chen,et al.  Integrating LSA-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts , 2019, Natural Language Engineering.

[29]  Maria Soledad Pera,et al.  Is cross‐lingual readability assessment possible? , 2020, J. Assoc. Inf. Sci. Technol..

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[32]  Christoph Rensing,et al.  Automatic Text Difficulty Estimation Using Embeddings and Neural Networks , 2019, EC-TEL.

[33]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[34]  Jonathan Anderson Analysing the Readability of English and Non-English Texts in the Classroom with Lix. , 1981 .

[35]  Nederlandse Taalunie,et al.  Common European Framework of Reference for Languages: Learning, Teaching, Assessment , 2007 .

[36]  Michael Flor,et al.  A Two-Stage Approach for Generating Unbiased Estimates of Text Complexity , 2013 .

[37]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[38]  Masoud Jasbi,et al.  Linguistic Features for Readability Assessment , 2020, BEA@ACL.

[39]  Delphine Bernhard,et al.  Are Cohesive Features Relevant for Text Readability Evaluation? , 2016, COLING.

[40]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[42]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[43]  Pearson ’ s Text Complexity Measure , 2022 .

[44]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[45]  Robert N. Kantor,et al.  On the Failure of Readability Formulas to Define Readable Texts: A Case Study from Adaptations. , 1982 .

[46]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[47]  Michael Flor,et al.  The TextEvaluator Tool , 2014, The Elementary School Journal.

[48]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Mari Ostendorf,et al.  A machine learning approach to reading level assessment , 2009, Comput. Speech Lang..

[50]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[51]  Danielle S. McNamara,et al.  Predicting Text Comprehension, Processing, and Familiarity in Adult Readers: New Approaches to Readability Formulas , 2017, Discourse Processes.

[52]  Lijun Feng,et al.  A Comparison of Features for Automatic Readability Assessment , 2010, COLING.

[53]  Yi Ma,et al.  Ranking-based readability assessment for early primary children’s literature , 2012, NAACL.

[54]  Geoffrey Williams Michael Hoey. Lexical Priming: A New Theory of Words and Language. London: Routledge. 2005. xiii + 202 pages. ISBN 0-415-32863-2. , 2006 .

[55]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[56]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[57]  Nitin Madnani,et al.  Automated Scoring: Beyond Natural Language Processing , 2018, COLING.

[58]  Lijun Feng,et al.  Cognitively Motivated Features for Readability Assessment , 2009, EACL.

[59]  Sowmya Vajjala,et al.  OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification , 2018, BEA@NAACL-HLT.

[60]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[61]  Ani Nenkova,et al.  Revisiting Readability: A Unified Framework for Predicting Text Quality , 2008, EMNLP.

[62]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[63]  Ted Briscoe,et al.  Text Readability Assessment for Second Language Learners , 2016, BEA@NAACL-HLT.

[64]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[65]  Jianzhong Qi,et al.  A Domain Independent Approach for Extracting Terms from Research Papers , 2015, ADC.

[66]  Simon Krek,et al.  Evaluation of Statistical Readability Measures on Slovene texts , 2018 .

[67]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[68]  Attapol Khamkhien,et al.  Lexical Priming: A New Theory of Words and Language , 2013 .

[69]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[70]  Špela Arhar Holdt,et al.  Predicting Slovene Text Complexity Using Readability Measures , 2019, Contributions to Contemporary History.

[71]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[72]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[73]  Kevyn Collins-Thompson,et al.  Computational Assessment of Text Readability: A Survey of Current and Future Research Running title: Computational Assessment of Text Readability , 2014 .

[74]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[75]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[76]  Yoko Futagi,et al.  Generating Automated Text Complexity Classifications That Are Aligned with Targeted Text Complexity Standards. Research Report. ETS RR-10-28. , 2010 .

[77]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[78]  Robert Mundkowsky,et al.  Online Readability and Text Complexity Analysis with TextEvaluator , 2015, NAACL.