The Power of Character N-grams in Native Language Identification

In this paper, we explore the performance of a linear SVM trained on language independent character features for the NLI Shared Task 2017. Our basic system (GRONINGEN) achieves the best performance (87.56 F1-score) on the evaluation set using only 1-9 character n-grams as features. We compare this against several ensemble and meta-classifiers in order to examine how the linear system fares when combined with other, especially non-linear classifiers. Special emphasis is placed on the topic bias that exists by virtue of the assessment essay prompt distribution.

[1]  Yann LeCun,et al.  Very Deep Convolutional Networks for Natural Language Processing , 2016, ArXiv.

[2]  Erik Smitterberg,et al.  International Corpus of Learner English , 2004 .

[3]  Joel R. Tetreault,et al.  A Report on the 2017 Native Language Identification Shared Task , 2017, BEA@EMNLP.

[4]  Scott Jarvis,et al.  Maximizing Classification Accuracy in Native Language Identification , 2013, BEA@NAACL-HLT.

[5]  Johan Bos,et al.  Semantic Tagging with Deep Residual Networks , 2016, COLING.

[6]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[7]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Håkan Ringbom,et al.  Language transfer. Cross-linguistic influence in language learning , 1990 .

[10]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[11]  Aoife Cahill,et al.  Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[12]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[13]  Graeme Hirst,et al.  Robust, Lexicalized Native Language Identification , 2012, COLING.

[14]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[15]  Barbara Plank,et al.  Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[16]  Terence Odlin,et al.  Language Transfer: Cross-Linguistic Influence in Language Learning , 1989 .

[17]  Moshe Koppel,et al.  Determining an author's native language by mining a text for errors , 2005, KDD '05.

[18]  Shervin Malmasi,et al.  Native Language Identification using Stacked Generalization , 2017, ArXiv.

[19]  Sylviane Granger,et al.  The International Corpus of Learner English , 1993 .

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..