An analysis of a French as a Foreign Language Corpus for Readability Assessment

Readability aims to assess the difficulty of texts based on various linguistic predictors (the lexicon used, the complexity of sentences, the coherence of the text, etc.). It is an active field that has applications in a large number of NLP domains, among which machine translation, text simplification, text summarisation, or CALL (Computer-Assisted Language Learning). For CALL, readability tools could be used to help the retrieval of educational materials or to make CALL platforms more adaptive. However, developing a readability formula is a costly process that requires a large amount of texts annotated in terms of difficulty. The current mainstream method to gather such a large corpus of annotated texts is to get them from educational resources such as textbooks or simplified readers. In this paper, we describe the collection process of an annotated corpus of French as a foreign language texts with the purpose of training a readability model. We follow the mainstream approach, getting the texts from textbooks, but we are concerned with the limitations of such “annotation” approach, in particular, as regards the homogeneity of the difficulty annotations across textbook series. Their reliability is assessed using both a qualitative and a quantitative analysis. It appears that, for some educational levels, the hypothesis of the annotation homogeneity must be rejected. Various reasons for such findings are discussed and the paper concludes with recommandations for future similar attempts

[1]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[2]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[3]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[4]  Kumiko Tanaka-Ishii,et al.  Sorting Texts by Readability , 2010, CL.

[5]  John W. Oller,et al.  Assessing Competence in ESL: Reading* , 1972 .

[6]  Maxine Eskénazi,et al.  Automatic Question Generation for Vocabulary Assessment , 2005, HLT.

[7]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[8]  David Little The Common European Framework of Reference for Languages: Content, purpose, origin, reception and impact , 2006, Language Teaching.

[9]  Cédrick Fairon,et al.  AMesure: une formule de lisibilité pour les textes administratifs , 2014 .

[10]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[11]  Piet Desmet L'enseignement/apprentissage des langues à l'ère du numérique : tendances récentes et défis , 2006 .

[12]  Véronique Hoste,et al.  A Posteriori Agreement as a Quality Measure for Readability Prediction Systems , 2011, CICLing.

[13]  Manfred Klenner,et al.  What exactly is wrong and why? Tutorial Dialogue for Intelligent CALL Systems , 2013 .

[14]  史尚明 Automatic Cloze Generation for English Proficiency Testing , 2009 .

[15]  Serge Verlinde Alfalex: un environnement d'apprentissage du vocabulaire français en ligne, interactif et automatisé , 2003 .

[16]  Milagros Aquino Reading Comprehension Difficulty as a Function of Content Area and Linguistic Complexity , 1969 .

[17]  Ari Huhta,et al.  Common European Framework of Reference , 2012 .

[18]  Cédrick Fairon,et al.  An “AI readability” Formula for French as a Foreign Language , 2012, EMNLP.

[19]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[20]  David Coniam A Preliminary Inquiry into Using Corpus Word Frequency Data in the Automatic Generation of English Language Cloze Tests , 2013 .

[21]  Walt Detmar Meurers,et al.  Enhancing Authentic Web Pages for Language Learners , 2010 .

[22]  Edgar Dale,et al.  A Study of the Factors Influencing the Difficulty of Reading Materials for Adults of Limited Reading Ability , 1934, The Library Quarterly.

[23]  G. Spache,et al.  A New Readability Formula for Primary-Grade Reading Materials , 1953, The Elementary School Journal.

[24]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[25]  Delphine Bernhard,et al.  Coherence and Cohesion for the Assessment of Text Readability , 2013, NLPCS 2013.

[26]  Richard Johansson,et al.  Rule-based and machine learning approaches for second language sentence-level readability , 2014, BEA@ACL.

[27]  Michael S Lewis-Beck Experimental design and methods , 1993 .

[28]  Kevyn Collins-Thompson,et al.  Predicting reading difficulty with statistical language models , 2005, J. Assoc. Inf. Sci. Technol..

[29]  D R Wekstein,et al.  Linguistic ability in early life and cognitive function and Alzheimer's disease in late life. Findings from the Nun Study. , 1996, JAMA.

[30]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[31]  Michel Fayol,et al.  Psychologie cognitive de la lecture , 1992 .

[32]  Michael A Covington,et al.  Automatic measurement of propositional idea density from part-of-speech tagging , 2008, Behavior research methods.

[33]  T. Shanahan Cloze as a Measure of Intersentential Comprehension. , 1982 .

[34]  R. Gunning The Technique of Clear Writing. , 1968 .

[35]  E. B. Coleman,et al.  A set of thirty-six prose passages calibrated for complexity , 1967 .

[36]  Hyeran Lee,et al.  Densidées : calcul automatique de la densité des idées dans un corpus oral , 2010, JEPTALNRECITAL.

[37]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[38]  C. Bjornsson Readability of Newspapers in 11 Languages. , 1983 .

[39]  Thomas François Combining a Statistical Language Model with Logistic Regression to Predict the Lexical and Syntactic Difficulty of Texts for FFL , 2009, EACL.

[40]  Lijun Feng,et al.  Cognitively Motivated Features for Readability Assessment , 2009, EACL.

[41]  Walter Kintsch,et al.  Comprehension and recall of text as a function of content variables , 1975 .

[42]  Thomas François,et al.  Les apports du traitement automatique des langues à la lisibilité du français langue étrangère , 2011 .

[43]  Albert S. Glickman,et al.  Improving Reading Comprehension : Measuring Readability , 2011 .

[44]  Olivier Kraif,et al.  Modélisation de l'intégration de ressources TAL pour l'apprentissage des langues : la plateforme MIRTO , 2005 .

[45]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[46]  Thierry Selva Génération automatique d’exercices contextuels de vocabulaire , 2002, JEPTALNRECITAL.

[47]  Jason S. Chang,et al.  FAST – An Automatic Generation System for Grammar Tests , 2006, ACL.

[48]  Aurélie Beauné Nouveau numéro de la revue Apprentissage des Langues et Systèmes d'Information et de Communication (ALSIC) , 2012 .

[49]  Noah A. Smith,et al.  Automatic factual question generation from text , 2011 .

[50]  John S. Caylor,et al.  Methodologies for Determining Reading Requirements Military Occupational Specialties. , 1973 .

[51]  Luo Si,et al.  A statistical model for scientific readability , 2001, CIKM '01.

[52]  Albert J. Kingston,et al.  A FACTOR ANALYSIS OF THE CLOZE PROCEDURE AND OTHER MEASURES OF READING AND LANGUAGE ABILITY , 1963 .

[53]  Richard Johansson,et al.  Automatic Selection of Suitable Sentences for Language Learning Exercises , 2013 .

[54]  Hend Suliman Al-Khalifa,et al.  AUTOMATIC READABILITY MEASUREMENTS OF THE ARABIC TEXT: AN EXPLORATORY STUDY , 2010 .

[55]  Wilson L. Taylor,et al.  'Cloze' Readability Scores as Indices of Individual Differences in Comprehension and Aptitude: Erratum. , 1957 .

[56]  François Richaudeau 6 phrases, 200 sujets, 42 lapsus, 1 rêve , 1974 .

[57]  J. Charles Alderson,et al.  The CEFR and the Need for More Research , 2007 .

[58]  Mihai Dascalu,et al.  ReaderBench (2) - Individual Assessment through Reading Strategies and Textual Complexity , 2014 .

[59]  Walt Detmar Meurers,et al.  On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition , 2012, BEA@NAACL-HLT.

[60]  John Nerbonne,et al.  Computer-Assisted Language Learning And Natural Language Processing , 2002 .

[61]  Harry Singer,et al.  The Seer Technique: A Non-Computational Procedure for Quickly Estimating Readability Levela , 1975 .

[62]  Lijun Feng,et al.  A Comparison of Features for Automatic Readability Assessment , 2010, COLING.

[63]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .

[64]  James I. Brown The Flesch Formula "Through the Looking Glass" , 1952 .

[65]  Véronique Hoste,et al.  Readability Annotation: Replacing the Expert by the Crowd , 2011, BEA@ACL.

[66]  Gilbert de Landsheere Le test de closure: mesure de la lisibilité et de la compréhension , 1973 .

[67]  M A Just,et al.  A theory of reading: from eye fixations to comprehension. , 1980, Psychological review.

[68]  H. Levene Robust tests for equality of variances , 1961 .

[69]  John R. Bormuth,et al.  Development of Readability Analysis. , 1969 .