SW4ALL: a CEFR Classified and Aligned Corpus for Language Learning

Learning a second language is a task that requires a good amount of time and dedication. Part of the process involves the reading and writing of texts in the target language, and so, to facilitate this process, especially in terms of reading, teachers tend to search for texts that are associated to the interests and capabilities of the learners. But the search for this kind of text is also a time-consuming task. By focusing on this need for texts that are suited for different language learners, we present in this study the SW4ALL, a corpus with documents classified by language proficiency level (based on the CEFR recommendations) that allows the learner to observe ways of describing the same topic or content by using strategies from different proficiency levels. This corpus uses the alignments between the English Wikipedia and the Simple English Wikipedia for ensuring the use of similar content or topic in pairs of text, and an annotation of language levels for ensuring the difference of language proficiency level between them. Considering the size of the corpus, we used an automatic approach for the annotation, followed by an analysis to sort out annotation errors. SW4ALL contains 8,669 pairs of documents that present different levels of language proficiency.

[1]  Le Zhao,et al.  Retrieval of Reading Materials for Vocabulary and Reading Practice , 2008 .

[2]  R. Gunning The Technique of Clear Writing. , 1968 .

[3]  Torsten Zesch,et al.  Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks , 2016, COLING.

[4]  Leonardo Zilio,et al.  Enhancing Grammatical Structures in Web-Based Texts. , 2017 .

[5]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[6]  Susanne Rott THE EFFECT OF EXPOSURE FREQUENCY ON INTERMEDIATE LANGUAGE LEARNERS' INCIDENTAL VOCABULARY ACQUISITION AND RETENTION THROUGH READING , 1999, Studies in Second Language Acquisition.

[7]  Leonardo Zilio,et al.  Using NLP for Enhancing Second Language Acquisition , 2017, RANLP.

[8]  Cédrick Fairon,et al.  An “AI readability” Formula for French as a Foreign Language , 2012, EMNLP.

[9]  Walt Detmar Meurers,et al.  Linguistically Aware Information Retrieval: Providing Input Enrichment for Second Language Learners , 2016, BEA@NAACL-HLT.

[10]  Bertram C. Bruce,et al.  Why readability formulas fail , 1981, IEEE Transactions on Professional Communication.

[11]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[12]  Ted Briscoe,et al.  Text Readability Assessment for Second Language Learners , 2016, BEA@NAACL-HLT.

[13]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[14]  Walt Detmar Meurers,et al.  Online Information Retrieval for Language Learning , 2016, ACL.

[15]  Eleni Miltsakaki,et al.  Read-X: Automatic Evaluation of Reading Difficulty of Web Text , 2007 .

[16]  M. Kathleen Sheehan,et al.  Sourcefinder: a construct-driven approach for locating appropriately targeted reading comprehension source texts , 2007, SLaTE.

[17]  Walt Detmar Meurers,et al.  On The Applicability of Readability Models to Web Texts , 2013, PITR@ACL.

[18]  Leonardo Zilio,et al.  Adaptive System for Language Learning , 2017, 2017 IEEE 17th International Conference on Advanced Learning Technologies (ICALT).

[19]  Jeroen Geertzen,et al.  Automatic Linguistic Annotation ofLarge Scale L2 Databases: The EF-Cambridge Open Language Database(EFCamDat) , 2014 .

[20]  Walt Detmar Meurers,et al.  MERLIN : An Online Trilingual Learner Corpus Empirically Grounding the European Reference Levels in Authentic Learner Data , 2013 .

[21]  David Kauchak,et al.  Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.

[22]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[23]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[24]  Walt Detmar Meurers,et al.  Information retrieval for education: making search engines language aware , 2011 .

[25]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[26]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[27]  Rebecca J. Passonneau,et al.  ELECTRONIC SOURCES AS INPUT TO GRE® READING COMPREHENSION ITEM DEVELOPMENT: SOURCEFINDER PROTOTYPE EVALUATION , 2002 .