Fewer features perform well at Native Language Identification task

This paper describes our results at the NLI shared task 2017. We participated in essays, speech, and fusion task that uses text, speech, and i-vectors for the task of identifying the native language of the given input. In the essay track, a linear SVM system using word bigrams and character 7-grams performed the best. In the speech track, an LDA classifier based only on i-vectors performed better than a combination system using text features from speech transcriptions and i-vectors. In the fusion task, we experimented with systems that used combination of i-vectors with higher order n-grams features, combination of i-vectors with word unigrams, a mean probability ensemble, and a stacked ensemble system. Our finding is that word unigrams in combination with i-vectors achieve higher score than systems trained with larger number of n-gram features. Our best-performing systems achieved F1-scores of 87.16%, 83.33% and 91.75% on the essay track, the speech track and the fusion track respectively.

[1]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[2]  Çagri Çöltekin,et al.  Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing , 2017, VarDial.

[3]  Aoife Cahill,et al.  Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Çağrı Çöltekin,et al.  Discriminating Similar Languages with Linear SVMs and Neural Networks , 2016, VarDial@COLING.

[6]  Joel R. Tetreault,et al.  A Report on the 2017 Native Language Identification Shared Task , 2017, BEA@EMNLP.

[7]  Scott Jarvis,et al.  Maximizing Classification Accuracy in Native Language Identification , 2013, BEA@NAACL-HLT.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Shervin Malmasi,et al.  Native Language Identification using Stacked Generalization , 2017, ArXiv.

[10]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[11]  Walt Detmar Meurers,et al.  Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization , 2014, COLING.

[12]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[13]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[14]  Shervin Malmasi,et al.  Multilingual native language identification , 2015, Natural Language Engineering.

[15]  S. Malmasi Native language identification: explorations and applications , 2016 .

[16]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[17]  Marine Carpuat,et al.  Feature Space Selection and Combination for Native Language Identification , 2013, BEA@NAACL-HLT.

[18]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .