Maximizing Classification Accuracy in Native Language Identification

This paper reports our contribution to the 2013 NLI Shared Task. The purpose of the task was to train a machine-learning system to identify the native-language affiliations of 1,100 texts written in English by nonnative speakers as part of a high-stakes test of general academic English proficiency. We trained our system on the new TOEFL11 corpus, which includes 11,000 essays written by nonnative speakers from 11 native-language backgrounds. Our final system used an SVM classifier with over 400,000 unique features consisting of lexical and POS n-grams occurring in at least two texts in the training set. Our system identified the correct nativelanguage affiliations of 83.6% of the texts in the test set. This was the highest classification accuracy achieved in the 2013 NLI Shared Task.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Scott Jarvis,et al.  Data mining with learner corpora: Choosing classifiers for L1 detection , 2011 .

[3]  Martin Chodorow,et al.  Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification , 2012, COLING.

[4]  Walt Detmar Meurers,et al.  Native Language Identification using Recurring n-grams – Investigating Abstraction and Domain Dependence , 2012, COLING.

[5]  Magali Paquot,et al.  Exploring the role of n-grams in L1 identification , 2012 .

[6]  Scott Jarvis,et al.  Approaching language transfer through text classification : explorations in the detection-based approach , 2012 .

[7]  Ari Rappoport,et al.  Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words , 2007 .

[8]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[9]  Sylviane Granger,et al.  Error patterns and automatic L1 identification , 2012 .

[10]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[11]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[12]  Ets Rr,et al.  TOEFL11: A Corpus of Non-Native English , 2013 .

[13]  Sylviane Granger,et al.  The International Corpus of Learner English. Handbook and CD-ROM , 2002 .

[14]  Yves Bestgen DEFT2009 : essais d'optimisation d'une procédure de base pour la tâche 1 , 2012 .

[15]  Scott Jarvis,et al.  1. The Detection-Based Approach: An Overview , 2012 .

[16]  Daniel B Lan Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification , 2012 .

[17]  Nina Vyatkina Approaching Language Transfer Through Text Classification: Explorations in the Detection-Based Approach. , 2014 .

[18]  Moshe Koppel,et al.  Automatically Determining an Anonymous Author's Native Language , 2005, ISI.