Automatic Linguistic Annotation ofLarge Scale L2 Databases: The EF-Cambridge Open Language Database(EFCamDat)

∗Naturalistic learner productions are an important empirical resource for SLA research. Some pioneering works have produced valuable second language (L2) resources supporting SLA research.1 One common limitation of these resources is the absence of individual longitudinal data for numerous speakers with different backgrounds across the proficiency spectrum, which is vital for understanding the nature of individual variation in longitudinal development.2 A second limitation is the relatively restricted amounts of data annotated with linguistic information (e.g., lexical, morphosyntactic, semantic features, etc.) to support investigation of SLA hypotheses and obtain patterns of development for different linguistic phenomena. Where available, annotations tend to be manually obtained, a situation posing immediate limitations to the quantity of data that could be annotated with reasonable human resources and within reasonable time. Natural Language Processing (NLP) tools can provide automatic annotations for parts-of-speech (POS) and syntactic structure and are indeed increasingly applied to learner language in various contexts. Systems in computer-assisted language learning (CALL) have used a parser and other NLP tools to automatically detect learner errors and provide feedback accordingly.3 Some work aimed at adapting annotations provided by parsing tools to accurately describe learner syntax (Dickinson & Lee, 2009) or evaluated parser performance on learner language and the effect of learner errors on the parser. Krivanek and Meurers (2011) compared two parsing methods, one using a hand-crafted lexicon and one trained on a corpus. They found that the former is more successful in recovering the main grammatical dependency relations whereas the latter is more successful in recovering optional, adjunction relations. Ott and Ziai (2010) evaluated the performance of a dependency parser trained on native German (MaltParser; Nivre et al., 2007) on 106 learner answers to a comprehension task in L2 German. Their study indicates that while some errors can be problematic for the parser (e.g., omission of finite verbs) many others (e.g., wrong word order) can be parsed robustly, resulting in overall high performance scores. In this paper we have two goals. First, we introduce a new English L2 database, the EF Cambridge Open Language Database, henceforth EFCAMDAT. EFCAMDAT was developed by the Department of Theoretical and Applied Linguistics at the University of Cambridge in collaboration with EF Education First, an international educational organization. It contains writings submitted to Englishtown, the

[1]  JensenK.,et al.  Parse fitting and prose fixing , 1983 .

[2]  Lance A. Miller,et al.  Parse Fitting and Prose Fixing: Getting a Hold on III-Formedness , 1983, Am. J. Comput. Linguistics.

[3]  Wiktor Marck Executing Temporal Logic Programs by Ben Moszkowski , 1986, SGAR.

[4]  Robert Bley-Vroman,et al.  The logical problem of foreign language learning , 1989 .

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Lance A. Miller,et al.  Parse Fitting and Prose Fixing , 1993, Natural Language Processing.

[7]  Marianne Starren,et al.  The European Science Foundation’s Second Language Database , 1996 .

[8]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[9]  Wolfgang Menzel,et al.  Error Diagnosis for Language Learning Systems , 1999 .

[10]  Sylviane Granger,et al.  Error-tagged learner corpora and CALL: a promising synergy , 2003 .

[11]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[12]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[13]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[14]  Anke Lüdeling,et al.  Multi-level error annotation in learner corpora , 2005 .

[15]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[16]  Jun'ichi Tsujii,et al.  Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles , 2007, EMNLP.

[17]  Virginie Zampa,et al.  Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness1 , 2007, ReCALL.

[18]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[19]  Markus Dickinson,et al.  Dependency Annotation for Learner Corpora , 2009 .

[20]  Walt Detmar Meurers,et al.  Towards interlanguage POS annotation for effective learner corpora in SLA and FLT , 2009 .

[21]  Joachim Wagner,et al.  The effect of correcting grammatical errors on parse probabilities , 2009, IWPT.

[22]  Niels Ott,et al.  Evaluating Dependency Parsing Performance on German Learner Language , 2010 .

[23]  Daniel Jurafsky,et al.  Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy , 2010, LREC.

[24]  Walt Detmar Meurers,et al.  On using intelligent computer-assisted language learning in real-life foreign language teaching and learning , 2011, ReCALL.

[25]  Walt Detmar Meurers,et al.  Comparing Rule-Based and Data-Driven Dependency Parsing of Learner Language , 2011, DepLing.

[26]  Markus Dickinson,et al.  Avoiding the Comparative Fallacy inthe Annotation of Learner Corpora , 2011 .

[27]  Walt Detmar Meurers,et al.  On the Automatic Analysis of Learner Language: Introduction to the Special Issue , 2013 .

[28]  Markus Dickinson,et al.  Modifying Corpus Annotation to Support the Analysis of Learner Language , 2013 .

[29]  Markus Dickinson,et al.  Dependency annotation of coordination for learner language , 2014 .