Error annotation in a Learner Corpus of Portuguese

We present the error tagging system of the COPLE2 corpus and the first results of its implementation.. The system takes advantage of the corpus architecture and the possibilities of the TEITOK environment to reduce manual effort and produce a final standoff, multilevel annotation with position-based tags that account for the main error types observed in the corpus. The first step of the tagging process involves the manual annotation of errors at the token level. We have already annotated 47% of the corpus using this approach. In a further step, the token-based annotations will be automatically transformed (fully or partially) in position-based error tags. COPLE2 is the first Portuguese learner corpus with error annotation. We expect that this work will support new research in different fields connected with Portuguese as second/foreign language, like Second Language Acquisition/Teaching or Computer Assisted

[1]  Oliver Christ,et al.  The IMS Corpus Workbench: Corpus Query Processor (CQP) - User's Manual , 1999 .

[2]  Andrey Kutuzov,et al.  Semi-automated typical error annotation for learner English essays: integrating frameworks , 2015 .

[3]  Øistein E. Andersen Semi-automatic ESOL error annotation , 2011 .

[4]  Sylviane Granger,et al.  Computer learner corpus research: current status and future prospects , 2004 .

[5]  Isabel Leiria,et al.  Léxico, aquisição e ensino do português europeu língua não materna , 2006 .

[6]  Nicolas Ballier,et al.  Automatic Treatment and Analysis of Learner Corpus Data , 2013 .

[7]  B. MacWhinney The CHILDES project: tools for analyzing talk , 1992 .

[8]  Amália Mendes,et al.  Error annotation in a Learner Corpus of Portuguese , 2016, LREC.

[9]  Paul Thompson,et al.  Learner corpora: looking towards the future , 2013 .

[10]  Ana Díaz-Negrillo,et al.  ERROR TAGGING SYSTEMS FOR LEARNER CORPORA , 2006 .

[11]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[12]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[13]  Cristóbal Lozano CEDEL2: Corpus Escrito del Español L2 , 2009 .

[14]  Amália Mendes,et al.  The COPLE2 corpus: a learner corpus for Portuguese , 2016, LREC.

[15]  Walt Detmar Meurers,et al.  The Cambridge Handbook of Learner Corpus Research: Learner corpora and natural language processing , 2015 .

[16]  Heming Yong,et al.  C-oRAl-RoM Integrated Reference Corpora for Spoken Romance Languages , 2009 .

[17]  Walt Detmar Meurers,et al.  The MERLIN corpus: Learner language and the CEFR , 2014, LREC.

[18]  Sylviane Granger,et al.  Error Tagging Manual Version 1.2. , 1996 .

[19]  Mark Shea,et al.  INTERNATIONAL CORPUS OF LEARNER ENGLISH: VERSION 2 . Sylvaine Granger, Estelle Dagneaux, Fanny Meunier, and Magali Paquot (Eds.). Louvain-La-Neuve, France: Presses Universitaires de Louvain, 2009. Pp. 223. , 2011, Studies in Second Language Acquisition.

[20]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[21]  Sylviane Granger,et al.  Error-tagged learner corpora and CALL: a promising synergy , 2003 .

[22]  Sandrine Garnier,et al.  Learner Corpora: Design, Development and Applications Development of NLP tools for CALL based on learner corpora (German as a foreign language) , 2003 .

[23]  Maarten Janssen NeoTag: a POS Tagger for Grammatical Neologism Detection , 2012, LREC.

[24]  Anna Feldman,et al.  Evaluating and automating the annotation of a learner corpus , 2013, Language Resources and Evaluation.

[25]  Sylviane Granger,et al.  Computer-Aided Error Analysis. , 1998 .

[26]  Sylviane Granger,et al.  The Cambridge Handbook of Learner Corpus Research , 2015 .

[27]  Thomas Schmidt EXMARaLDA and the FOLK tools - two toolsets for transcribing and annotating spoken language , 2012, LREC.

[28]  Sylviane Granger,et al.  From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora , 1996 .

[29]  Lars Hinrichs Codeswitching on the Web: English and Jamaican Creole in E-mail Communication (Pragmatics & Beyond, Issn 0922-842x) , 2006 .

[30]  G. Natalia International Corpus of Learner English: Implications for ELT , 1998 .

[31]  Sylviane Granger,et al.  Contrastive interlanguage analysis: A reappraisal , 2015 .

[32]  Michelle C. Braña-Straw Codeswitching on the Web: English and Jamaican Creole in E-mail Communication , 2008 .

[33]  Maarten Janssen,et al.  TEITOK: Text-Faithful Annotated Corpora , 2016, LREC.