Multi-level error annotation in learner corpora

Learner corpora – principled collections of learner language – provide interesting insights into the mechanisms by which a foreign language is acquired. For overviews over the current state of learner corpus research see Granger (2002, to appear), Nesselhauf (2004), and Pravec (2002). Learner corpora are used to test hypotheses in the theory of acquisition in two main ways. First, learner corpora can be used for the so-called contrastive interlanguage analysis (CIA), i.e. the quantitative comparison of learner language and native language to find patterns of overuse or underuse. For CIA, a corpus does not have to be tagged. In this article we are concerned with the second main area of learner corpus research: error tagging. While error-tagging is problematic in many theoretical respects, it is probably not controversial anymore that error-tagged learner corpora can be useful for a number of research questions if the tagging follows certain guidelines. In this paper we do not argue for the need for error annotation (see Granger, to appear, for a motivation) or discuss the theoretical problems involved but are concerned only with issues of error tagging and corpus architecture. We argue for a multi-level standoff architecture (rather than a flat token-tag architecture) for error-tagged learner corpora. By using the German learner corpus Falko as an example, we show how multilevel approaches to learner corpora can help solve some of the problems that occur in error tagging if flat annotation models are used.

[1]  Jack C. Richards,et al.  A non-contrastive approach to error analysis , 1970 .

[2]  Mark Liberman,et al.  A formal framework for linguistic annotation , 1999, Speech Commun..

[3]  J. H. Wilkinson,et al.  Error analysis , 2003 .

[4]  Nadja Nesselhauf,et al.  Learner Corpora and their Potential for Language Teaching , 2004 .

[5]  Heidi Byrnes,et al.  The role of task and task-based assessment in a content-oriented collegiate foreign language curriculum , 2002 .

[6]  Jean Carletta,et al.  The NITE Object Model Library for Handling Structured Linguistic Annotation on Multimodal Data Sets , 2002 .

[7]  Thomas C. Schmidt EXMARaLDA und Datenbank ‚Mehrsprachigkeit' - Konzepte und praktische Erfahrungen * , 2005 .

[8]  Thomas Schmidt EXMARaLDA - ein System zur computergestützten Diskurstranskription , 2004 .

[9]  Gordon Taylor Errors and Explanations , 1986 .

[10]  Rod Ellis,et al.  The Study of Second Language Acquisition , 1994 .

[11]  S. Lukas Challenges in Modelling a Richly Annotated Diachronic Corpus of German , 2004 .

[12]  S. P. Corder THE SIGNIFICANCE OF LEARNER'S ERRORS , 1967 .

[13]  Hagen Hirschmann Platzhalterphrasen bei fortgeschrittenen Lernern des Deutschen als Fremdsprache , 2005 .

[14]  Julie A. Belz,et al.  Learner corpus analysis and the development of foreign language proficiency , 2004 .

[15]  Sylviane Granger,et al.  Error-tagged learner corpora and CALL: a promising synergy , 2003 .

[16]  Norma A. Pravec Survey of learner corpora , 2002 .

[17]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[18]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[19]  Sylviane Granger,et al.  A Bird’s-eye view of learner corpus research , 2002 .