Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories

I investigate Russian second language readability assessment using a machine-learning approach with a range of lexical, morphological, syntactic, and discourse features. Testing the model with a new collection of Russian L2 readability corpora achieves an F-score of 0.671 and adjacent accuracy 0.919 on a 6-level classification task. Information gain and feature subset evaluation shows that morphological features are collectively the most informative. Learning curves for binary classifiers reveal that fewer training data are needed to distinguish between beginning reading levels than are needed to distinguish between intermediate reading levels.

[1]  D. McNamara,et al.  A Linguistic Analysis of Simplified and Authentic Texts , 2007 .

[2]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[3]  Kevyn Collins-Thompson,et al.  An Analysis of Statistical Models and Features for Reading Difficulty Prediction , 2008, ACL 2008.

[4]  Le Zhao,et al.  Retrieval of Reading Materials for Vocabulary and Reading Practice , 2008 .

[5]  Johan Frid,et al.  Measuring Syntactic Complexity in Spontaneous Spoken Swedish , 2007, Language and speech.

[6]  Michael A Covington,et al.  Automatic measurement of propositional idea density from part-of-speech tagging , 2008, Behavior research methods.

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  Lijun Feng,et al.  A Comparison of Features for Automatic Readability Assessment , 2010, COLING.

[9]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .

[10]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[11]  Walt Detmar Meurers,et al.  Readability Classification for German using Lexical, Syntactic, and Morphological Features , 2012, COLING.

[12]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[13]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[14]  D. McNamara,et al.  Assessing Text Readability Using Cognitively Based Indices , 2008 .

[15]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[16]  Sangeetha Gopalakrishnan,et al.  Enhancing Student Engagement through Online Authentic Materials , 2012 .

[17]  Patrick Watrin,et al.  On the Contribution of MWE-based Features to a Readability Formula for French as a Foreign Language , 2011, RANLP.

[18]  Elena Volodina,et al.  A Readable Read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity , 2016, Int. J. Comput. Linguistics Appl..

[19]  Sven Hartrumpf,et al.  A Semantically Oriented Readability Checker for German , 2007 .

[20]  Adam Kilgarriff,et al.  Corpus-based vocabulary lists for language learners for nine languages , 2014, Lang. Resour. Evaluation.

[21]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[22]  Nikolay Karpov,et al.  Single-Sentence Readability Prediction in Russian , 2014, AIST.

[23]  Lijun Feng,et al.  Automatic Readability Assessment , 2010 .

[24]  William H. DuBay The Classic Readability Studies. , 2007 .

[25]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[26]  Mari Ostendorf,et al.  A machine learning approach to reading level assessment , 2009, Comput. Speech Lang..

[27]  Lucia Specia,et al.  Readability Assessment for Text Simplification , 2010 .

[28]  Simonetta Montemagni,et al.  READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification , 2011, SLPAT.

[29]  Michael A. Covington,et al.  Measuring propositional idea density through part-of-speech tagging , 2007 .

[30]  Danielle S. McNamara,et al.  Toward a New Readability: A Mixed Model Approach , 2007 .

[31]  Walt Detmar Meurers,et al.  On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition , 2012, BEA@NAACL-HLT.

[32]  Sowmya Vajjala Balakrishna,et al.  Analyzing Text Complexity and Text Simplification: Connecting Linguistics, Processing and Educational Applications , 2015 .

[33]  B. Miller,et al.  A Readability Checker with Supervised Learning using Deep Syntactic and Semantic Indicators , 2008 .

[34]  Michael T. Putnam,et al.  Catenae: Introducing a Novel Unit of Syntactic Analysis , 2012 .

[35]  A. Jackson Stenner,et al.  Measuring Reading Comprehension with the Lexile Framework. , 1996 .

[36]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[37]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[38]  A. Gilmore Authentic materials and authenticity in foreign language learning , 2007, Language Teaching.

[39]  E. Gibson The dependency locality theory: A distance-based theory of linguistic complexity. , 2000 .

[40]  Luo Si,et al.  A statistical model for scientific readability , 2001, CIKM '01.