Automatic CEFR Level Prediction for Estonian Learner Text

This paper reports on approaches for automatically predicting a learner’s language proficiency in Estonian according to the European CEFR scale. We used the morphological and POS tag information extracted from the texts written by learners. We compared classification and regression modeling for this task. Our models achieve a classification accuracy of 79% and a correlation of 0.85 when modeled as regression. After a comparison between them, we concluded that classification is more effective than regression in terms of exact error and the direction of error. Apart from this, we investigated the most predictive features for both multiclass and binary classification between groups and also explored the nature of the correlations between highly predictive features. Our results show considerable improvement in classification accuracy over previously reported results and take us a step closer towards the automated assessment of Estonian learner text.

[1]  Sowmya Vajjala,et al.  Role of Morpho-Syntactic Features in Estonian Proficiency Classification , 2013, BEA@NAACL-HLT.

[2]  David M. Williamson A Framework for Implementing Automated Scoring , 2009 .

[3]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Robert Östling,et al.  Automated Essay Scoring for Swedish , 2013, BEA@NAACL-HLT.

[6]  Julia Hancke,et al.  Master's Thesis in Computational Linguistics Automatic Prediction of CEFR Prociency Levels Based on Linguistic Features of Learner Language , 2013 .

[7]  Nina Vyatkina,et al.  The Development of Second Language Writing Complexity in Groups and Individuals: A Longitudinal Learner Corpus Study , 2012 .

[8]  Jill Burstein,et al.  The E-rater® scoring engine: Automated essay scoring with natural language processing. , 2003 .

[9]  Bo Zhang,et al.  Investigating Proficiency Classification for the Examination for the Certificate of Proficiency in English (ECPE) , 2008 .

[10]  Scott A. Crossley,et al.  Automatically Assessing Lexical Sophistication: Indices, Tools, Findings, and Application , 2015 .

[11]  Xiaofei Lu,et al.  Automatic analysis of syntactic complexity in second language writing , 2010 .

[12]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[13]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[14]  Julia Hancke,et al.  Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language , 2013 .

[15]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[16]  Xiaofei Lu The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives. , 2012 .

[17]  Danielle S. McNamara,et al.  Predicting lexical proficiency in language learner texts using computational indices , 2011 .

[18]  Martin Chodorow,et al.  Progress and New Directions in Technology for Automated Essay Evaluation , 2010 .

[19]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[20]  Henrik Gyllstad,et al.  Linguistic correlates to communicative proficiency levels of the CEFR: The case of syntactic complexity in written L2 English, L3 French and L4 Italian , 2014 .