Revisiting the Readability Assessment of Texts in Portuguese

The Web content accessibility guidelines (WCAG) 2.0 include in its principle of comprehensibility an accessibility requirement related to the level of writing. This requirement states that websites with texts demanding higher reading skills than individuals with lower secondary education possess (fifth to ninth grades in Brazil) should offer them an alternative version of the same content. Natural Language Processing technology and research in Psycholinguistics can help automate the task of classifying a text according to its reading difficulty. In this paper, we present experiments to build a readability checker to classify texts in Portuguese, considering different text genres, domains and reader ages, using naturally occurring texts. More precisely, we classify texts in simple (for 7 to 14-year-olds) and complex (for adults), and address three key research questions: (1) Which machine-learning algorithm produces the best results? (2) Which features are relevant? (3) Do different text genres have an impact on readability assessment?.

[1]  Kevyn Collins-Thompson,et al.  An Analysis of Statistical Models and Features for Reading Difficulty Prediction , 2008, ACL 2008.

[2]  Ani Nenkova,et al.  Revisiting Readability: A Unified Framework for Predicting Text Quality , 2008, EMNLP.

[3]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  Eleni Miltsakaki,et al.  Read-X: Automatic Evaluation of Reading Difficulty of Web Text , 2007 .

[6]  Irene Kostin,et al.  Reading Level Assessment for Literary and Expository Texts , 2007 .

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Brian Roark,et al.  Syntactic complexity measures for detecting Mild Cognitive Impairment , 2007, BioNLP@ACL.

[9]  D. McNamara,et al.  Proceedings of the 29th Annual Cognitive Science Society , 2007 .

[10]  William H. DuBay The Principles of Readability. , 2004 .

[11]  D. McNamara,et al.  A Linguistic Analysis of Simplified and Authentic Texts , 2007 .

[12]  Lee Gillam,et al.  The Linguistics of Readability: The Next Step for Word Processing , 2010, HLT-NAACL 2010.

[13]  Eleni Miltsakaki,et al.  Real Time Web Text Classification and Analysis of Reading Difficulty , 2008 .

[14]  Lijun Feng,et al.  Cognitively Motivated Features for Readability Assessment , 2009, EACL.

[15]  Mari Ostendorf,et al.  A machine learning approach to reading level assessment , 2009, Comput. Speech Lang..

[16]  Sandra M. Aluísio,et al.  Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português , 2010, Linguamática.

[17]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[18]  Martin Chodorow,et al.  CriterionSM Online Essay Evaluation: An Application for Automated Evaluation of Student Essays , 2003, IAAI.

[19]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[20]  Marina Santini,et al.  Characterizing Genres of Web Pages: Genre Hybridism and Individualization , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[21]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[22]  Gregg C. Vanderheiden,et al.  Web Content Accessibility Guidelines (WCAG) 2.0 , 2008 .

[23]  Lucia Specia,et al.  Readability Assessment for Text Simplification , 2010 .

[24]  Maria das Graças Volpe Nunes,et al.  Readability formulas applied to textbooks in brazilian portuguese , 1996 .