Using Multilingual Resources to Evaluate CEFRLex for Learner Applications

The Common European Framework of Reference for Languages (CEFR) defines six levels of learner proficiency, and links them to particular communicative abilities. The CEFRLex project aims at compiling lexical resources that link single words and multi-word expressions to particular CEFR levels. The resources are thought to reflect second language learner needs as they are compiled from CEFR-graded textbooks and other learner-directed texts. In this work, we investigate the applicability of CEFRLex resources for building language learning applications. Our main concerns were that vocabulary in language learning materials might be sparse, i.e. that not all vocabulary items that belong to a particular level would also occur in materials for that level, and, on the other hand, that vocabulary items might be used on lower-level materials if required by the topic (e.g. with a simpler paraphrasing or translation). Our results indicate that the English CEFRLex resource is in accordance with external resources that we jointly employ as gold standard. Together with other values obtained from monolingual and parallel corpora, we can indicate which entries need to be adjusted to obtain values that are even more in line with this gold standard. We expect that this finding also holds for the other languages

[1]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[2]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[3]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[4]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[5]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[6]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[7]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[8]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[9]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[10]  Yi-Ting Huang,et al.  A Robust Estimation Scheme of Reading Difficulty for Second Language Learners , 2011, 2011 IEEE 11th International Conference on Advanced Learning Technologies.

[11]  Sofie Johansson Kokkinakis,et al.  Introducing the Swedish Kelly-list, a new lexical e-resource for Swedish , 2012, LREC.

[12]  Cédrick Fairon,et al.  An “AI readability” Formula for French as a Foreign Language , 2012, EMNLP.

[13]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[14]  Nathaniel J. Smith,et al.  The effect of word predictability on reading time is logarithmic , 2013, Cognition.

[15]  Cédrick Fairon,et al.  FLELex: a graded lexical resource for French foreign learners , 2014, LREC.

[16]  Martin Volk,et al.  Cleaning the Europarl Corpus for Linguistic Applications , 2014, KONVENS.

[17]  Adam Kilgarriff,et al.  Corpus-based vocabulary lists for language learners for nine languages , 2014, Lang. Resour. Evaluation.

[18]  Torsten Zesch,et al.  Readability for foreign language learning: the importance of cognates , 2014 .

[19]  David Alfter,et al.  Classification of Swedish learner essays by CEFR levels , 2016 .

[20]  David Alfter,et al.  Coursebook Texts as a Helping Hand for Classifying Linguistic Complexity in Language Learners’ Writings , 2016, CL4LC@COLING 2016.

[21]  Anna Papst,et al.  Learning Vocabulary In Another Language , 2016 .

[22]  Jörg Tiedemann,et al.  Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[23]  Thomas François,et al.  SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners , 2016, LREC.

[24]  内田 諭 English Vocabulary Profileを語彙指導に活用する , 2017 .

[25]  Johannes Graën Exploiting alignment in multiparallel corpora for applications in linguistics and language learning , 2018 .

[26]  Thomas François,et al.  EFLLex: A Graded Lexical Resource for Learners of English as a Foreign Language , 2018, LREC.

[27]  Cédrick Fairon,et al.  NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet , 2018, BEA@NAACL-HLT.

[28]  David Alfter,et al.  Interconnecting lexical resources and word alignment: How do learners get on with particle verbs? , 2019, NODALIDA.

[29]  Marc Kupietz,et al.  Modelling large parallel corpora. The Zurich Parallel Corpus Collection , 2019 .

[30]  Lars Borin,et al.  Lärka: From Language Learning Platform to Infrastructure for Research on Language Learning , 2019, CLARIN Annual Conference.