Using N-gram and Word Network Features for Native Language Identification

We report on the performance of two different feature sets in the Native Language Identification Shared Task (Tetreault et al., 2013). Our feature sets were inspired by existing literature on native language identification and word networks. Experiments show that word networks have competitive performance against the baseline feature set, which is a promising result. We also present a discussion of feature analysis based on information gain, and an overview on the performance of different word network features in the Native Language Identification task.

[1]  David Yarowsky,et al.  Stylometric Analysis of Scientific Articles , 2012, NAACL.

[2]  Mitsuru Ishizuka,et al.  A Document as a Small World , 2001, JSAI Workshops.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[5]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[6]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[7]  Walt Detmar Meurers,et al.  Native Language Identification using Recurring n-grams – Investigating Abstraction and Domain Dependence , 2012, COLING.

[8]  Marc Reznicek,et al.  Stylometry and the interplay of topic and L1 in the different annotation layers in the FALKO corpus , 2011 .

[9]  Graeme Hirst,et al.  Native language detection with 'cheap' learner corpora , 2013 .

[10]  Mark Dras,et al.  Exploring Adaptor Grammars for Native Language Identification , 2012, EMNLP.

[11]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[12]  Benjamin Swanson,et al.  Native Language Detection with Tree Substitution Grammars , 2012, ACL.

[13]  Dominique Estival,et al.  TAT: An Author Profiling Tool with Application to Arabic Emails , 2007, ALTA.

[14]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[15]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[16]  Hans van Halteren,et al.  Linguistic profiling of texts for the purpose of language verification , 2004, COLING.

[17]  Scott Jarvis,et al.  Approaching language transfer through text classification : explorations in the detection-based approach , 2012 .

[18]  Mark Dras,et al.  Exploiting Parse Structures for Native Language Identification , 2011, EMNLP.

[19]  Ekaterina Kochmar,et al.  Identification of a Writer ’ s Native Language by Error Analysis , 2011 .

[20]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[21]  Danielle S. McNamara,et al.  The comparative and combined contributions of n-grams, Coh-Metrix indices, and error types in the L1 classification of learner texts , 2012 .

[22]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[23]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[24]  Vladimir Batagelj,et al.  An O(m) Algorithm for Cores Decomposition of Networks , 2003, ArXiv.

[25]  Sylviane Granger,et al.  Error patterns and automatic L1 identification , 2012 .

[26]  Mark Dras,et al.  Topic Modeling for Native Language Identification , 2011, ALTA.

[27]  Danielle S. McNamara,et al.  Detecting the first language of second language writers using automated indices of cohesion, lexical sophistication, syntactic complexity and conceptual knowledge , 2012 .

[28]  Graeme Hirst,et al.  Robust, Lexicalized Native Language Identification , 2012, COLING.

[29]  Mark Dras,et al.  Contrastive Analysis and Native Language Identification , 2009, ALTA.

[30]  Graeme Hirst,et al.  Measuring Interlanguage: Native Language Identification with L1-influence Metrics , 2012, LREC.

[31]  John Yearwood,et al.  Using psycholinguistic features for profiling first language of authors , 2012, J. Assoc. Inf. Sci. Technol..

[32]  Moshe Koppel,et al.  Determining an author's native language by mining a text for errors , 2005, KDD '05.