论文信息 - Feature Space Selection and Combination for Native Language Identification

Feature Space Selection and Combination for Native Language Identification

We decribe the submissions made by the National Research Council Canada to the Native Language Identification (NLI) shared task. Our submissions rely on a Support Vector Machine classifier, various feature spaces using a variety of lexical, spelling, and syntactic features, and on a simple model combination strategy relying on a majority vote between classifiers. Somewhat surprisingly, a classifier relying on purely lexical features performed very well and proved difficult to outperform significantly using various combinations of feature spaces. However, the combination of multiple predictors allowed to exploit their different strengths and provided a significant boost in performance.

[1] Joel R. Tetreault,et al. A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[3] Graeme Hirst,et al. Robust, Lexicalized Native Language Identification , 2012, COLING.

[4] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[5] Bianca Zadrozny,et al. Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[6] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.

[7] Shlomo Argamon,et al. Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[8] Moshe Koppel,et al. Determining an author's native language by mining a text for errors , 2005, KDD '05.

[9] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[10] Silvia Bernardini,et al. A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..