Measuring Interlanguage: Native Language Identification with L1-influence Metrics

The task of native language (L1) identification suffers from a relative paucity of useful training corpora, and standard within-corpus evaluation is often problematic due to topic bias. In this paper, we introduce a method for L1 identification in second language (L2) texts that relies only on much more plentiful L1 data, rather than the L2 texts that are traditionally used for training. In particular, we do word-by-word translation of large L1 blog corpora to create a mapping to L2 forms that are a possible result of language transfer, and then use that information for unsupervised classification. We show this method is effective in several different learner corpora, with bigram features being particularly useful.

[1]  Dan Roth,et al.  Algorithm Selection and Model Adaptation for ESL Correction Tasks , 2011, ACL.

[2]  Mark Dras,et al.  Contrastive Analysis and Native Language Identification , 2009, ALTA.

[3]  Graeme Hirst,et al.  Native language detection with 'cheap' learner corpora , 2013 .

[4]  Claudia Leacock,et al.  Automated Grammatical Error Correction for Language Learners , 2010, COLING.

[5]  Ekaterina Kochmar,et al.  Identification of a Writer ’ s Native Language by Error Analysis , 2011 .

[6]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[7]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[8]  Marc Reznicek,et al.  Stylometry and the interplay of topic and L1 in the different annotation layers in the FALKO corpus , 2011 .

[9]  Jason S. Chang,et al.  An automatic collocation writing assistant for Taiwanese EFL learners: A case of corpus-based NLP technology , 2008 .

[10]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[11]  David Yarowsky,et al.  Modeling Latent Biographic Attributes in Conversational Genres , 2009, ACL.

[12]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..

[13]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[14]  Mark Dras,et al.  Exploiting Parse Structures for Native Language Identification , 2011, EMNLP.

[15]  Ari Rappoport,et al.  Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words , 2007 .

[16]  Hans van Halteren,et al.  Source Language Markers in EUROPARL Translations , 2008, COLING.

[17]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[18]  Mark Shea,et al.  INTERNATIONAL CORPUS OF LEARNER ENGLISH: VERSION 2 . Sylvaine Granger, Estelle Dagneaux, Fanny Meunier, and Magali Paquot (Eds.). Louvain-La-Neuve, France: Presses Universitaires de Louvain, 2009. Pp. 223. , 2011, Studies in Second Language Acquisition.

[19]  Hwee Tou Ng,et al.  Correcting Semantic Collocation Errors with L1-induced Paraphrases , 2011, EMNLP.

[20]  Moshe Koppel,et al.  Determining an author's native language by mining a text for errors , 2005, KDD '05.

[21]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[22]  Seanna Doolittle,et al.  Das Lernerkorpus Falko , 2008 .

[23]  References , 1971 .