The COPLE2 corpus: a learner corpus for Portuguese

We present the COPLE2 corpus, a learner corpus of Portuguese that includes written and spoken texts produced by learners of Portuguese as a second or foreign language. The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The original handwritten productions are transcribed in TEI compliant XML format and keep record of all the original information, such as reformulations, insertions and corrections made by the teacher, while the recordings are transcribed and aligned with EXMARaLDA. The TEITOK environment enables different views of the same document (XML, student version, corrected version), a CQP-based search interface, the POS, lemmatization and normalization of the tokens, and will soon be used for error annotation in stand-off format. The corpus has already been a source of data for phonological, lexical and syntactic interlanguage studies and will be used for a data-informed selection of language features for each proficiency level.

[1]  Thomas Schmidt EXMARaLDA and the FOLK tools - two toolsets for transcribing and annotating spoken language , 2012, LREC.

[2]  Sylviane Granger,et al.  From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora , 1996 .

[3]  Robert F. Ilson,et al.  The BBI Combinatory Dictionary of English: A guide to word combinations , 1989 .

[4]  Sylviane Granger,et al.  Error Tagging Manual Version 1.2. , 1996 .

[5]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[6]  Isabel Leiria,et al.  Léxico, aquisição e ensino do português europeu língua não materna , 2006 .

[7]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[8]  Mark Shea,et al.  INTERNATIONAL CORPUS OF LEARNER ENGLISH: VERSION 2 . Sylvaine Granger, Estelle Dagneaux, Fanny Meunier, and Magali Paquot (Eds.). Louvain-La-Neuve, France: Presses Universitaires de Louvain, 2009. Pp. 223. , 2011, Studies in Second Language Acquisition.

[9]  Oliver Christ,et al.  The IMS Corpus Workbench: Corpus Query Processor (CQP) - User's Manual , 1999 .

[10]  Kevin R. Gregg SECOND LANGUAGE ACQUISITION AND UNIVERSAL GRAMMAR , 2004, Studies in Second Language Acquisition.

[11]  Heming Yong,et al.  C-oRAl-RoM Integrated Reference Corpora for Spoken Romance Languages , 2009 .

[12]  Maria João Freitas,et al.  A princesa ficou *adormir ou a dormir? Dados sobre a consciência da unidade palavra em Português europeu , 2017 .

[13]  Lydia White,et al.  Missing Surface Inflection or Impairment in second language acquisition? Evidence from tense and agreement , 2000 .

[14]  Chaofen Sun,et al.  Chinese: A Linguistic Introduction , 2006 .

[15]  Sandrine Garnier,et al.  Learner Corpora: Design, Development and Applications Development of NLP tools for CALL based on learner corpora (German as a foreign language) , 2003 .

[16]  Walt Detmar Meurers,et al.  The MERLIN corpus: Learner language and the CEFR , 2014, LREC.

[17]  Dalia Guerreiro Congresso de Humanidades Digitais em Portugal , 2015 .

[18]  Nadja Nesselhauf,et al.  Collocations in a Learner Corpus , 2005 .

[19]  Maarten Janssen NeoTag: a POS Tagger for Grammatical Neologism Detection , 2012, LREC.

[20]  Magali Paquot,et al.  Lexical bundles and L1 transfer effects , 2013 .

[21]  Ichael,et al.  The UAM CorpusTool : software for corpus annotation and exploration , 2008 .

[22]  Scott Jarvis,et al.  Methodological Rigor in the Study of Transfer: Identifying L1 Influence in them Interlanguage Lexicon , 2000 .

[23]  吉島 茂,et al.  文化と言語の多様性の中のCommon European Framework of Reference for Languages: Learning, teaching, assessment (CEFR)--それは基準か? (第10回明海大学大学院応用言語学研究科セミナー 講演) , 2008 .

[24]  M. Zubizarreta,et al.  Overgeneralization of Causatives and Transfer in L2 Spanish and L2 English , 2005 .

[25]  Nélia Alexandre,et al.  Aspects of Relative clauses in Portuguese as a Foreign Language by Chinese learners , 2014 .

[26]  Darko Stefanovic,et al.  Reachability Bounds for Chemical Reaction Networks and Strand Displacement Systems , 2012, DNA.