Do NLP and machine learning improve traditional readability formulas?

Readability formulas are methods used to match texts with the readers' reading level. Several methodological paradigms have previously been investigated in the field. The most popular paradigm dates several decades back and gave rise to well known readability formulas such as the Flesch formula (among several others). This paper compares this approach (henceforth "classic") with an emerging paradigm which uses sophisticated NLP-enabled features and machine learning techniques. Our experiments, carried on a corpus of texts for French as a foreign language, yield four main results: (1) the new readability formula performed better than the "classic" formula; (2) "non-classic" features were slightly more informative than "classic" features; (3) modern machine learning algorithms did not improve the explanatory power of our readability model, but allowed to better classify new observations; and (4) combining "classic" and "non-classic" features resulted in a significant gain in performance.

[1]  Bernice E. Leary,et al.  What makes a book readable , 1935 .

[2]  William H. DuBay The Principles of Readability. , 2004 .

[3]  Kevyn Collins-Thompson,et al.  An Analysis of Statistical Models and Features for Reading Difficulty Prediction , 2008, ACL 2008.

[4]  Ani Nenkova,et al.  Revisiting Readability: A Unified Framework for Predicting Text Quality , 2008, EMNLP.

[5]  Luo Si,et al.  A statistical model for scientific readability , 2001, CIKM '01.

[6]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[7]  W. Holtzman Fundamental statistics in psychology and education. , 1951 .

[8]  Susan Kemper,et al.  Measuring the Inference Load of a Text. , 1983 .

[9]  Alexandra L. Uitdenbogerd Readability of French as a foreign language and its uses , 2005 .

[10]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .

[11]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[12]  Thomas François,et al.  Les apports du traitement automatique des langues à la lisibilité du français langue étrangère , 2011 .

[13]  Kevyn Collins-Thompson,et al.  Predicting reading difficulty with statistical language models , 2005, J. Assoc. Inf. Sci. Technol..

[14]  W. Kintsch,et al.  Reading comprehension and readability in educational practice and psychological theory , 1979 .

[15]  조석주,et al.  교과서 문장의 Readability , 1985 .

[16]  John R. Bormuth,et al.  READABILITY--A NEW APPROACH. , 1966 .

[17]  C. Urquhart,et al.  The impact of information. , 1997, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[18]  Gilbert de Landsheere Pour une application des tests de lisibilite de Flesch a la langue francaise. , 1963 .

[19]  H. Hotelling The Selection of Variates for Use in Prediction with Some Comments on the General Problem of Nuisance Parameters , 1940 .

[20]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[21]  Lijun Feng,et al.  A Comparison of Features for Automatic Readability Assessment , 2010, COLING.

[22]  Mabel Vogel,et al.  An Objective Method of Determining Grade Placement of Children's Reading Material , 1928, The Elementary School Journal.

[23]  George R. Klare,et al.  The measurement of readability , 1963 .

[24]  Rohit J. Kate,et al.  Learning to Predict Readability using Diverse Linguistic Features , 2010, COLING.

[25]  Hyeran Lee,et al.  Densidées : calcul automatique de la densité des idées dans un corpus oral , 2010, JEPTALNRECITAL.

[26]  Marc Brysbaert,et al.  Lexique 2 : A new French lexical database , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[27]  James B. Tharp The Measurement of Vocabulary Difficulty , 1939 .

[28]  Lucia Specia,et al.  Readability Assessment for Text Simplification , 2010 .

[29]  Eleni Miltsakaki,et al.  Matching Readers’ Preferences and Reading Skills with Appropriate Web Texts , 2009, EACL.

[30]  Ari Huhta,et al.  Common European Framework of Reference , 2012 .

[31]  Thomas François,et al.  La lisibilité computationnelle : un renouveau pour la lisibilité du français langue première et seconde ? , 2010 .