Exploring Optimal Voting in Native Language Identification

We describe the submissions entered by the National Research Council Canada in the NLI-2017 evaluation. We mainly explored the use of voting, and various ways to optimize the choice and number of voting systems. We also explored the use of features that rely on no linguistic preprocessing. Long ngrams of characters obtained from raw text turned out to yield the best performance on all textual input (written essays and speech transcripts). Voting ensembles turned out to produce small performance gains, with little difference between the various optimization strategies we tried. Our top systems achieved accuracies of 87% on the essay track, 84% on the speech track, and close to 92% by combining essays, speech and i-vectors in the fusion track.

[1]  Shervin Malmasi,et al.  Native Language Identification using Stacked Generalization , 2017, ArXiv.

[2]  P. J. McCarthy PSEUDO-REPLICATION: HALF SAMPLES' , 1969 .

[3]  Paul N. Bennett Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[4]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[5]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[6]  Aoife Cahill,et al.  String Kernels for Native Language Identification: Insights from Behind the Curtains , 2016, CL.

[7]  Marine Carpuat,et al.  The NRC System for Discriminating Similar Languages , 2014, VarDial@COLING.

[8]  Joel R. Tetreault,et al.  Oracle and Human Baselines for Native Language Identification , 2015, BEA@NAACL-HLT.

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[11]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Joel R. Tetreault,et al.  A Report on the 2017 Native Language Identification Shared Task , 2017, BEA@EMNLP.

[14]  S. Malmasi Native language identification: explorations and applications , 2016 .

[15]  Pradip K. Das,et al.  i-Vectors in speech processing applications: a survey , 2015, Int. J. Speech Technol..

[16]  Cyril Goutte,et al.  Advances in Ngram-based Discrimination of Similar Languages , 2016, VarDial@COLING.