You Get what You Annotate: A Pedagogically Annotated Corpus of Coursebooks for Swedish as a Second Language

We present the COCTAILL corpus, containing over 700.000 tokens of Swedish texts from 12 coursebooks aimed at second/foreign language (L2) learning. Each text in the corpus is labelled with a proficiency level according to the CEFR proficiency scale. Genres, topics, associated activities, vocabulary lists and other types of information are annotated in the coursebooks to facilitate Second Language Acquisition (SLA)-aware studies and experiments aimed at Intelligent Computer-Assisted Language Learning (ICALL). Linguistic annotation in the form of parts-of-speech (POS; e.g. nouns, verbs), base forms (lemmas) and syntactic relations (e.g. subject, object) has been also added to the corpus. In the article we describe our annotation scheme and the editor we have developed for the content mark-up of the coursebooks, including the taxonomy of pedagogical activities and linguistic skills. Inter-annotator agreement has been computed and reported on a subset of the corpus. Surprisingly, we have not found any other examples of pedagogically marked-up corpora based on L2 coursebooks to draw on existing experiences. Hence, our work may be viewed as “groping in the darkness” and eventually a starting point for others. The paper also presents our first quantitative exploration of the corpus where we focus on textually and pedagogically annotated features of the coursebooks to exemplify what types of studies can be performed using the presented annotation scheme. We explore trends shown in use of topics and genres over proficiency levels and compare pedagogical focus of exercises across levels. The final section of the paper summarises the potential this corpus holds for research within SLA and various ICALL tasks.

[1]  Markus Forsberg,et al.  Korp — the corpus infrastructure of Språkbanken , 2012, LREC.

[2]  Lars Borin,et al.  A flexible language learning platform based on language resources and web services , 2014, LREC.

[3]  Xiaofei Lu,et al.  Challenging the Research Base of the Common Core State Standards , 2013 .

[4]  Walt Detmar Meurers,et al.  On The Applicability of Readability Models to Web Texts , 2013, PITR@ACL.

[5]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[6]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[7]  Antoinette Renouf,et al.  The changing face of corpus linguistics , 2006 .

[8]  Ute Römer Looking at looking: Functions and contexts of progressives in spoken English and ‘school’ English , 2006 .

[9]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[10]  Fanny Meunier,et al.  New types of corpora for new educational challenges: collecting, annotating and exploiting a corpus of textbook material , 2009 .

[11]  Rebecca J. Passonneau,et al.  Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation , 2006, LREC.

[12]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[13]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[14]  Thomas François,et al.  Les apports du traitement automatique des langues à la lisibilité du français langue étrangère , 2011 .

[15]  Cédrick Fairon,et al.  FLELex: a graded lexical resource for French foreign learners , 2014, LREC.

[16]  إسلام يسري علي مراجعة كتاب الإطار المرجعي الأوروبي المشترك للغات: دراسة، تدريس، تقييم / The Common European Framework of Reference for Languages: Learning, Teaching, Assessment , 1970 .

[17]  J. Fleiss,et al.  Measuring Agreement for Multinomial Data , 1982 .

[18]  Fanny Meunier,et al.  The treatment of phraseology in ELT textbooks , 2007 .