Towards an Italian Learner Treebank in Universal Dependencies

In this paper we describe the preliminary work on a novel treebank which includes texts written by learners of Italian drawn from the VALICO corpus. Data processing mostly involved the application of Universal Dependencies formalism and error annotation. First, we parsed the texts on UDPipe trained on the existent Italian UD treebanks, then we manually corrected them. The particular focus of this paper is on a one-hundred-sentence sample of the collection, used as a case study to define an annotation scheme for identifying the linguistic phenomena characterizing learners’ interlanguage.

[1]  Barbara Plank,et al.  Do dependency parsing metrics correlate with human judgments? , 2015, CoNLL.

[2]  S. Malmasi Native language identification: explorations and applications , 2016 .

[3]  Elisa Corino,et al.  Italiano di Stranieri. I corpora VALICO e VINCA , 2017 .

[4]  Martin Emms Tree Distance and Some Other Variants of Evalb , 2008, LREC.

[5]  Claudio Russo,et al.  Parsing di Corpora di Apprendenti di Italiano: un Primo Studio su VALICO (Parsing Italian Learner Corpora: a Case Study on VALICO) , 2016, CLiC-it/EVALITA.

[6]  Evelina Andersson,et al.  Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation , 2011, EMNLP.

[7]  Theodora Alexopoulou,et al.  Dependency parsing of learner English , 2018, International Journal of Corpus Linguistics.

[8]  Walt Detmar Meurers,et al.  The MERLIN corpus: Learner language and the CEFR , 2014, LREC.

[9]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[10]  Magali Paquot,et al.  The Cambridge Handbook of Learner Corpus Research: Learner corpora and native language identification , 2015 .

[11]  Chaitanya Ramineni,et al.  Learner corpora and automated scoring , 2015 .

[12]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[13]  Erik Smitterberg,et al.  International Corpus of Learner English , 2004 .

[14]  Norma A. Pravec Survey of learner corpora , 2002 .

[15]  Simonetta Montemagni,et al.  The Evalita 2014 Dependency Parsing task , 2014 .

[16]  Jeroen Geertzen,et al.  Automatic Linguistic Annotation ofLarge Scale L2 Databases: The EF-Cambridge Open Language Database(EFCamDat) , 2014 .

[17]  Walt Detmar Meurers,et al.  Towards interlanguage POS annotation for effective learner corpora in SLA and FLT , 2009 .

[18]  Tony McEnery,et al.  What Corpora Can Offer in Language Teaching and Learning , 2011 .

[19]  Anna Korhonen,et al.  Isomorphic Transfer of Syntactic Structures in Cross-Lingual NLP , 2018, ACL.

[20]  Anke Lüdeling,et al.  Competing target hypotheses in the Falko corpus , 2013 .

[21]  Cristina Bosco,et al.  PoSTWITA-UD: an Italian Twitter Treebank in Universal Dependencies , 2018, LREC.

[22]  John Lee,et al.  Towards Universal Dependencies for Learner Chinese , 2017, UDW@NoDaLiDa.

[23]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[24]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[25]  Markus Dickinson,et al.  Defining Syntax for Learner Language Annotation , 2012, COLING.

[26]  R. Simone,et al.  Fondamenti di linguistica , 1990 .

[27]  Boris Katz,et al.  Universal Dependencies for Learner English , 2016, ACL.

[28]  Anke Lüdeling,et al.  Multi-level error annotation in learner corpora , 2005 .

[29]  Simonetta Montemagni,et al.  Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank , 2013, LAW@ACL.

[30]  Sandrine Garnier,et al.  Learner Corpora: Design, Development and Applications Development of NLP tools for CALL based on learner corpora (German as a foreign language) , 2003 .

[31]  Teresa Lynn,et al.  Irish dependency treebanking and parsing , 2016 .

[32]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[33]  Hwee Tou Ng,et al.  Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English , 2013, BEA@NAACL-HLT.

[34]  Hwee Tou Ng,et al.  The CoNLL-2013 Shared Task on Grammatical Error Correction , 2013, CoNLL Shared Task.

[35]  Martin Chodorow,et al.  Automatic grammar- and spell-checking for language learners , 2015 .