Automatic Text Difficulty Classifier - Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching

This paper describes a system to assist the selection of adequate reading materials to support European Portuguese teaching, especially as second language, while highlighting the key challenges on the selection of linguistic features for text difficulty (readability) classification. The system uses existing Natural Language Processing (NLP) tools to extract linguistic features from texts, which are then used by an automatic readability classifier. Currently, 52 features are extracted: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, and some extra features. A classifier was created using these features and a corpus, previously annotated by readability level, using a five-levels language classification official standard for Portuguese as Second Language. In a five-levels (from A1 to C1) scenario, the best-performing learning algorithm (LogitBoost) achieved an accuracy of 75.11% with a root mean square error (RMSE) of 0.269. In a three-levels (A, B and C) scenario, the best-performing learning algorithm (C4.5 grafted) achieved 81.44% accuracy with a RMSE of 0.346.

[1]  Glenn Fulcher,et al.  Text difficulty and accessibility: Reading formulae and expert judgement , 1997 .

[2]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[3]  Syntactic REAP Exercises on Word Formation , 2013 .

[4]  A. Jackson Stenner,et al.  Measuring Reading Comprehension with the Lexile Framework. , 1996 .

[5]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[6]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[7]  Jorge Baptista,et al.  Auxiliary Verbs and Verbal Chains in European Portuguese , 2010, PROPOR.

[8]  Peter Reutemann,et al.  WEKA Manual for Version 3-6-10 , 2008 .

[9]  António Branco,et al.  Rolling out Text Categorization for Language Learning Assessment Supported by Language Technology , 2014, PROPOR.

[10]  Ani Nenkova,et al.  Revisiting Readability: A Unified Framework for Predicting Text Quality , 2008, EMNLP.

[11]  Rudolf Franz Flesch Marks of readable style : a study in adult education , 1943 .

[12]  R. Gunning The Fog Index After Twenty Years , 1969 .

[13]  Maxine Eskénazi,et al.  Porting REAP to European Portuguese , 2009, SLaTE.

[14]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[15]  Sandra M. Aluísio,et al.  Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português , 2010, Linguamática.

[16]  Maxine Eskenazi,et al.  Retrieval of Authentic Documents for Reader-Specific Lexical Practice , 2004 .

[17]  George R. Klare,et al.  The measurement of readability , 1963 .

[18]  R. Gunning The Technique of Clear Writing. , 1968 .

[19]  Pedro dos Santos,et al.  Classificador de textos para o ensino de portugues como segunda l´õngua , 2014 .