论文信息 - The Power of Character N-grams in Native Language Identification - 字舞流文

The Power of Character N-grams in Native Language Identification

In this paper, we explore the performance of a linear SVM trained on language independent character features for the NLI Shared Task 2017. Our basic system (GRONINGEN) achieves the best performance (87.56 F1-score) on the evaluation set using only 1-9 character n-grams as features. We compare this against several ensemble and meta-classifiers in order to examine how the linear system fares when combined with other, especially non-linear classifiers. Special emphasis is placed on the topic bias that exists by virtue of the assessment essay prompt distribution.

Malvina Nissim | Barbara Plank | Gertjan van Noord | Johannes Bjerva | Martijn Wieling | Artur Kulmizev | Bo Blankers | Barbara Plank | M. Nissim | Johannes Bjerva | M. Wieling | Artur Kulmizev | B. Blankers

[1] Yann LeCun,et al. Very Deep Convolutional Networks for Natural Language Processing , 2016, ArXiv.

[2] Erik Smitterberg,et al. International Corpus of Learner English , 2004 .

[3] Joel R. Tetreault,et al. A Report on the 2017 Native Language Identification Shared Task , 2017, BEA@EMNLP.

[4] Scott Jarvis,et al. Maximizing Classification Accuracy in Native Language Identification , 2013, BEA@NAACL-HLT.

[5] Johan Bos,et al. Semantic Tagging with Deep Residual Networks , 2016, COLING.

[6] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[7] Joel R. Tetreault,et al. A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[8] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9] Håkan Ringbom,et al. Language transfer. Cross-linguistic influence in language learning , 1990 .

[10] Eric P. Xing,et al. Sparse Additive Generative Models of Text , 2011, ICML.

[11] Aoife Cahill,et al. Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[12] Kevin Leyton-Brown,et al. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[13] Graeme Hirst,et al. Robust, Lexicalized Native Language Identification , 2012, COLING.

[14] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[15] Barbara Plank,et al. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss , 2016, ACL.

[16] Terence Odlin,et al. Language Transfer: Cross-Linguistic Influence in Language Learning , 1989 .

[17] Moshe Koppel,et al. Determining an author's native language by mining a text for errors , 2005, KDD '05.

[18] Shervin Malmasi,et al. Native Language Identification using Stacked Generalization , 2017, ArXiv.

[19] Sylviane Granger,et al. The International Corpus of Learner English , 1993 .

[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..