The Story of the Characters, the DNA and the Native Language

This paper presents our approach to the 2013 Native Language Identification shared task, which is based on machine learning methods that work at the character level. More precisely, we used several string kernels and a kernel based on Local Rank Distance (LRD). Actually, our best system was a kernel combination of string kernel and LRD. While string kernels have been used before in text analysis tasks, LRD is a distance measure designed to work on DNA sequences. In this work, LRD is applied with success in native language identification. Finally, the Unibuc team ranked third in the closed NLI Shared Task. This result is more impressive if we consider that our approach is language independent and linguistic theory neutral.

[1]  Dana Shapira,et al.  Large Edit Distance with Multiple Block Operations , 2003, SPIRE.

[2]  V. Y. Popov,et al.  Multiple genome rearrangement by swaps and by element duplications , 2007, Theor. Comput. Sci..

[3]  Liviu P. Dinu,et al.  An Efficient Rank Based Approach for Closest String and Closest Substring , 2012, PloS one.

[4]  Walt Detmar Meurers,et al.  Native Language Identification using Recurring n-grams – Investigating Abstraction and Domain Dependence , 2012, COLING.

[5]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[6]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[7]  Martin Chodorow,et al.  Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification , 2012, COLING.

[8]  Graeme Hirst,et al.  Robust, Lexicalized Native Language Identification , 2012, COLING.

[9]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[10]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[11]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[12]  Cristian Grozea,et al.  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[13]  Markus Chimani,et al.  A Closer Look at the Closest String and Closest Substring Problem , 2011, ALENEX.

[14]  Marius Popescu,et al.  Studying Translationese at the Character Level , 2011, RANLP.

[15]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[16]  Cristian Grozea,et al.  Kernel Methods and String Kernels for Authorship Analysis , 2012, CLEF.

[17]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[18]  Alberto Policriti,et al.  rNA: a fast and accurate short reads numerical aligner , 2012, Bioinform..

[19]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[20]  Liviu P. Dinu,et al.  A Low-complexity Distance for DNA Strings , 2006, Fundam. Informaticae.

[21]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[22]  Liviu P. Dinu On the Classification and Aggregation of Hierarchies with Different Constitutive Elements , 2003, Fundam. Informaticae.

[23]  Liviu P. Dinu,et al.  Clustering Based on Rank Distance with Applications on DNA , 2012, ICONIP.