Finnish Native Language Identification

We outline the first application of Native Language Identification (NLI) to Finnish learner data. NLI is the task of predicting an author’s first language using writings in an acquired language. Using data from a new learner corpus of Finnish — a language typology quite different from others previously investigated, with its morphological richness potentially causing difficulties — we show that a combination of three feature types is useful for this task. Our system achieves an accuracy of 70% against a baseline of 20% for predicting an author’s L1. Using the same features we can also distinguish non-native writings with an accuracy of 97%. This methodology can be useful for studying language transfer effects, developing teaching materials tailored to students’ native language and also forensic linguistics.

[1]  Jean-Marc Dewaele,et al.  Lexical Inventions: French Interlanguage as L2 versus L3 , 1998 .

[2]  Shirin Murphy Second Language Transfer During Third Language Acquisition , 2003 .

[3]  Tanja Nieminen,et al.  Becoming a new Finn through language : non-native English-speaking immigrants' views on integrating into Finnish society , 2009 .

[4]  Shervin Malmasi,et al.  NLI Shared Task 2013: MQ Submission , 2013, BEA@NAACL-HLT.

[5]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[6]  Yan Guo,et al.  The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. , 2007 .

[7]  Tim Grant,et al.  Quantifying evidence in forensic authorship analysis , 2007 .

[8]  Shervin Malmasi,et al.  Chinese Native Language Identification , 2014, EACL.

[9]  Kirsti Siitonen Learners’ dilemma: an example of complexity in academic Finnish. The frequency and use of the E infinitive passive in L2 and L1 Finnish , 2014 .

[10]  冯占省 法律语言学研究具有明显的司法实践性——解读An Introduction to Forensic Linguistics:Language in Evidence , 2010 .

[11]  Shervin Malmasi,et al.  Arabic Native Language Identification , 2014, ANLP@EMNLP.

[12]  Fred Karlsson Finnish: An Essential Grammar , 1999 .

[13]  B. Laufer,et al.  Form-focused Instruction in Second Language Vocabulary Learning: A Case for Contrastive Analysis and Translation , 2008 .

[14]  Willis Edmondson,et al.  The study of second language acquisition , 1995 .

[15]  Sarah Williams,et al.  Language switches in L3 production : Implications for a polyglot speaking model , 1998 .

[16]  Shervin Malmasi,et al.  From Visualisation to Hypothesis Construction for Second Language Acquisition , 2014, TextGraphs@EMNLP.

[17]  Graeme Hirst,et al.  Robust, Lexicalized Native Language Identification , 2012, COLING.

[18]  Gessica De Angelis,et al.  Multilingualism and Non-native Lexical Transfer: An Identification Problem , 2005 .

[19]  Shervin Malmasi,et al.  Language Transfer Hypotheses with Linear SVM Weights , 2014, EMNLP.

[20]  Håkan Ringbom,et al.  Chapter 4. Lexical Transfer in L3 Production , 2001 .

[21]  Kwai Hang Ng,et al.  Language in the Law , 2015 .

[22]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[23]  Ria Perkins,et al.  Linguistic identifiers of L1 Persian speakers writing in English:NLID for authorship analysis , 2014 .

[24]  Ilmari Ivaska,et al.  The Corpus of Advanced Learner Finnish (LAS2): Database and toolkit to study academic learner Finnish , 2014 .

[25]  Martin Chodorow,et al.  Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification , 2012, COLING.

[26]  Benjamin Swanson,et al.  Data Driven Language Transfer Hypotheses , 2014, EACL.

[27]  Scott Jarvis,et al.  Approaching language transfer through text classification : explorations in the detection-based approach , 2012 .

[28]  Jasone Cenoz,et al.  Cross-Linguistic Influence in Third Language Acquisition: Psycholinguistic Perspectives. Bilingual Education and Bilingualism 31. , 2001 .

[29]  Ari Pirkola,et al.  Morphological typology of languages for IR , 2001, J. Documentation.

[30]  Comrie Bernard Language Universals and Linguistic Typology , 1982 .

[31]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..