Coursebook Texts as a Helping Hand for Classifying Linguistic Complexity in Language Learners’ Writings

We bring together knowledge from two different types of language learning data, texts learners read and texts they write, to improve linguistic complexity classification in the latter. Linguistic complexity in the foreign and second language learning context can be expressed in terms of proficiency levels. We show that incorporating features capturing lexical complexity information from reading passages can boost significantly the machine learning based classification of learner-written texts into proficiency levels. With an F1 score of .8 our system rivals state-of-the-art results reported for other languages for this task. Finally, we present a freely available web-based tool for proficiency level classification and lexical complexity visualization for both learner writings and reading texts.

[1]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[2]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[3]  Torsten Zesch,et al.  Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks , 2016, COLING.

[4]  Hwee Tou Ng,et al.  Flexible Domain Adaptation for Automated Essay Scoring Using Correlated Linear Regression , 2015, EMNLP.

[5]  Sowmya Vajjala,et al.  Automatic CEFR Level Prediction for Estonian Learner Text , 2014 .

[6]  Elena Volodina,et al.  You Get what You Annotate: A Pedagogically Annotated Corpus of Coursebooks for Swedish as a Second Language , 2014 .

[7]  Torsten Zesch,et al.  Task-Independent Features for Automated Essay Grading , 2015, BEA@NAACL-HLT.

[8]  Sofie Johansson Kokkinakis,et al.  Introducing the Swedish Kelly-list, a new lexical e-resource for Swedish , 2012, LREC.

[9]  Julia Hancke,et al.  Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language , 2013 .

[10]  Elena Volodina,et al.  SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies , 2016, LREC.

[11]  Thomas François,et al.  SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners , 2016, LREC.

[12]  Markus Forsberg,et al.  Korp — the corpus infrastructure of Språkbanken , 2012, LREC.

[13]  Jessie S. Barrot,et al.  Comparing the Linguistic Complexity in Receptive and Productive Modes , 2015 .

[14]  Cédrick Fairon,et al.  An “AI readability” Formula for French as a Foreign Language , 2012, EMNLP.

[15]  Ted Briscoe,et al.  Text Readability Assessment for Second Language Learners , 2016, BEA@NAACL-HLT.

[16]  Kuo-En Chang,et al.  Leveling L2 Texts Through Readability: Combining Multilevel Linguistic Features with the CEFR , 2015 .

[17]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[18]  António Branco,et al.  Rolling out Text Categorization for Language Learning Assessment Supported by Language Technology , 2014, PROPOR.

[19]  Elena Volodina,et al.  A Readable Read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity , 2016, Int. J. Comput. Linguistics Appl..

[20]  David Alfter,et al.  From distributions to labels: A lexical proficiency analysis using learner corpora , 2016 .