The SweLL Language Learner Corpus

The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, – both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.

[1]  Kari Tenfjord,et al.  The "Hows" and the "Whys" of Coding Categories in a Learner Corpus (or "How and Why an Error-Tagged Learner Corpus is not 'ipso facto' One Big Comparative Fallacy") , 2006 .

[2]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[3]  Mats Wirén,et al.  SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora , 2019, CLARIN Annual Conference.

[4]  Alex Housen,et al.  Complexity, accuracy and fluency in second language acquisition , 2009 .

[5]  Nitin Madnani,et al.  Second Language Acquisition Modeling , 2018, BEA@NAACL-HLT.

[6]  Inge Bartning,et al.  Can linguistic features discriminate between the communicative CEFR-levels? : A pilot study of written L2 French , 2010 .

[7]  Nikola Dobric Quality Measurements of Error Annotation - Ensuring Validity Through Reliability , 2015 .

[8]  Jennifer Thewissen,et al.  Capturing L2 accuracy developmental patterns: Insights from an error-tagged EFL learner corpus , 2013 .

[9]  Magali Paquot,et al.  Lexical bundles and L1 transfer effects , 2013 .

[10]  Jeroen Geertzen,et al.  Automatic Linguistic Annotation ofLarge Scale L2 Databases: The EF-Cambridge Open Language Database(EFCamDat) , 2014 .

[11]  Markus Forsberg,et al.  Korp and Karp - a bestiary of language resources: the research infrastructure of Språkbanken , 2013, NODALIDA.

[12]  Folkert Kuiken,et al.  Task complexity and measures of linguistic performance in L2 writing , 2007 .

[13]  Markus Forsberg,et al.  Sparv : Språkbanken ’ s corpus annotation pipeline infrastructure , 2016 .

[14]  Lars Borin,et al.  A flexible language learning platform based on language resources and web services , 2014, LREC.

[15]  Egon Stemle,et al.  KoKo: an L1 Learner Corpus for German , 2014, LREC.

[16]  Sylviane Granger,et al.  Formulaic Language in Learner Corpora , 2012, Annual Review of Applied Linguistics.

[17]  Sylviane Granger,et al.  The Cambridge Handbook of Learner Corpus Research , 2015 .

[18]  Torsten Zesch,et al.  Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks , 2016, COLING.

[19]  Florence Myles,et al.  Interlanguage corpora and second language acquisition research , 2005 .

[20]  Anke Lüdeling,et al.  Multi-level error annotation in learner corpora , 2005 .

[21]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[22]  Beata Beigman Klebanov,et al.  Writing Mentor: Self-Regulated Writing Feedback for Struggling Writers , 2018, COLING.

[23]  Walt Detmar Meurers,et al.  Towards interlanguage POS annotation for effective learner corpora in SLA and FLT , 2009 .

[24]  Laurent Romary,et al.  A model oriented approach to the mapping of annotation formats using standards , 2010 .

[25]  Claudia Leacock,et al.  Automated Grammatical Error Correction for Language Learners , 2010, COLING.

[26]  Emma Marsden,et al.  Second Language Learning Theories , 1998 .

[27]  Walt Detmar Meurers,et al.  Scaling Up Intervention Studies to Investigate Real-Life Foreign Language Learning in School , 2019, Annual Review of Applied Linguistics.

[28]  Anna Feldman,et al.  Evaluating and automating the annotation of a learner corpus , 2013, Language Resources and Evaluation.

[29]  Mikael Parkvall Sveriges språk - vem talar vad och var? , 2009 .

[30]  Sylviane Granger,et al.  Towards standardization of metadata for L2 corpora , 2017 .

[31]  Magali Paquot,et al.  Quantitative research methods and study quality in learner corpus research , 2015 .

[32]  Anne Golden,et al.  Crosslinguistic Influence and Distinctive Patterns of Language Learning: Findings and Insights from a Learner Corpus , 2017 .

[33]  Cecilie Carlsen,et al.  Proficiency Level—a Fuzzy Variable in Computer Learner Corpora , 2012 .

[34]  Alan Hirvela,et al.  Feedback in Second Language Writing: Contexts and Issues, Ken Hyland, Fiona Hyland (Eds.), Cambridge University Press, Cambridge (2019), p. 314, Paperback: $34.99, eBook: $28.00, ISBN: 9781108439978 , 2020 .

[35]  Sylviane Granger,et al.  Error-tagged learner corpora and CALL: a promising synergy , 2003 .

[36]  Robert Östling,et al.  Automated Essay Scoring for Swedish , 2013, BEA@NAACL-HLT.

[37]  J. Lavid,et al.  Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics , 2013 .

[38]  K. Krippendorff Reliability in Content Analysis: Some Common Misconceptions and Recommendations , 2004 .

[39]  S. P. Corder THE SIGNIFICANCE OF LEARNER'S ERRORS , 1967 .

[40]  Beáta Megyesi,et al.  SWEGRAM – A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts , 2017 .

[41]  J. Norris,et al.  Towards an Organic Approach to Investigating CAF in Instructed SLA: The Case of Complexity , 2009 .

[42]  Julia Prentice,et al.  A Friend in Need? : Research agenda for electronic Second Language infrastructure , 2016 .

[43]  Aline Godfroid,et al.  SLA for all? Reproducing SLA research in non-academic samples , 2018 .

[44]  Julia Prentice,et al.  Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish , 2018 .

[45]  Pascale Sébillot,et al.  Automated classification of unexpected uses of this and that in a learner corpus of English , 2014 .

[46]  Špela Arhar Holdt,et al.  Corpus-Based Resources for L1 Teaching: The Case of Slovene , 2017 .

[47]  Sylviane Granger,et al.  The computer learner corpus: a versatile new source of data for SLA research , 1998 .

[48]  Sylviane Granger,et al.  The contribution of learner corpora to second language acquisition and foreign language teaching: A critical evaluation , 2009 .

[49]  B. MacWhinney A Shared Platform for Studying Second Language Acquisition. , 2017 .

[50]  Julia Prentice,et al.  Annotation of Learner Corpora: first SweLL insights , 2018 .

[51]  Walt Detmar Meurers,et al.  Task Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing Techniques , 2017 .

[52]  Beáta Megyesi,et al.  The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis , 2016, LREC.

[53]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[54]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[55]  Walt Detmar Meurers,et al.  The MERLIN corpus: Learner language and the CEFR , 2014, LREC.

[56]  Paul Meurer,et al.  The ASK Corpus - a Language Learner Corpus of Norwegian as a Second Language , 2006, LREC.

[57]  Kar n Fort,et al.  Collaborative Annotation for Reliable Natural Language Processing: Technical and Sociological Aspects , 2016 .

[58]  Wolfgang Lenhard,et al.  A Continuous Solution to the Norming Problem , 2018, Assessment.

[59]  Sylviane Granger The computer learner corpus: a versatile new source of data for SLA research: Sylviane Granger , 2014 .

[60]  Mats Wirén,et al.  Error Coding of Second-Language Learner Texts Based on Mostly Automatic Alignment of Parallel Corpora , 2018 .

[61]  Peter Skehan,et al.  The Influence of Task Structure and Processing Conditions on Narrative Retellings. , 1999 .

[62]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[63]  Eva Pettersson,et al.  Annotating errors in student texts: First experiences and experiments , 2017 .

[64]  Joakim Nivre,et al.  Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting , 2013, NODALIDA.

[65]  Kenneth Ward Church Emerging trends: I did it, I did it, I did it, but. . . , 2017, Natural Language Engineering.

[66]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[67]  G. Pallotti CAF: Defining, Refining and Differentiating Constructs , 2009 .