A Portuguese Native Language Identification Dataset

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.

[1]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[2]  Marcos Zampieri,et al.  Grammatical Error Detection with Limited Training Data: The Case of Chinese , 2014 .

[3]  Amália Mendes,et al.  Error annotation in a Learner Corpus of Portuguese , 2018, LREC.

[4]  Walt Detmar Meurers,et al.  The MERLIN corpus: Learner language and the CEFR , 2014, LREC.

[5]  S. Granger The International Corpus of Learner English: A New Resource for Foreign Language Learning and Teaching and Second Language Acquisition Research , 2003 .

[6]  Shervin Malmasi,et al.  The Jinan Chinese Learner Corpus , 2015, BEA@NAACL-HLT.

[7]  Amália Mendes,et al.  Error annotation in a Learner Corpus of Portuguese , 2016, LREC.

[8]  Mar Ndiaye,et al.  A Spell Checker Tailored to Language Learners , 2003 .

[9]  Maxine Eskénazi,et al.  Porting REAP to European Portuguese , 2009, SLaTE.

[10]  Shervin Malmasi,et al.  Native Language Identification using Stacked Generalization , 2017, ArXiv.

[11]  Zheng Yuan,et al.  Generating artificial errors for grammatical error correction , 2014, EACL.

[12]  Shervin Malmasi,et al.  Chinese Native Language Identification , 2014, EACL.

[13]  Jill Burstein,et al.  Automated Essay Scoring : A Cross-disciplinary Perspective , 2003 .

[14]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[15]  António Branco,et al.  Out-of-the-Box Robust Parsing of Portuguese , 2010, PROPOR.

[16]  Shervin Malmasi,et al.  Multilingual native language identification , 2015, Natural Language Engineering.

[17]  S. Malmasi Native language identification: explorations and applications , 2016 .

[18]  Trude Heift,et al.  Language Learners and Generic Spell Checkers in CALL , 2013 .

[19]  Maria das Graças Volpe Nunes,et al.  Linguistic issues in the development of ReGra: A grammar checker for Brazilian Portuguese , 1998, Natural Language Engineering.

[20]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[21]  Cristóbal Lozano CEDEL2: Corpus Escrito del Español L2 , 2009 .

[22]  Joel R. Tetreault,et al.  A Report on the 2017 Native Language Identification Shared Task , 2017, BEA@EMNLP.

[23]  Jorge Baptista,et al.  P-AWL: Academic Word List for Portuguese , 2010, PROPOR.

[24]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[25]  Shervin Malmasi,et al.  Arabic Native Language Identification , 2014, ANLP@EMNLP.