Identification of a Writer ’ s Native Language by Error Analysis

This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the regulation length of 15, 000 words, including tables and footnotes. Summary In this project, we investigate the task of native language identification. We study a set of Indo-European languages, and demonstrate how machine learning techniques can be used to identify native language of a text's author. A number of different features are extracted and applied to this task. Their contribution to overall performance is investigated and reported. We explore the hypotheses that the choice of words in a free text is influenced by a writer's native language, and that the errors committed by a writer are based on the differences between the writer's native language system and that of English. We identify the error types typical for speakers of different native languages, and show how using different features based on the discriminative error types can improve classification. Acknowledgments I would like to thank my supervisor, Prof. Ted Briscoe, for his guidance and constant support. I am grateful for his encouragement and valuable suggestions throughout the course of this work. I would also like to thank Helen Yannakoudakis and Øistein Andersen for their much appreciated help and their ability to identify the weak spots in my work and offer suggestions for improvement.

[1]  John Sie Yuen Lee Automatic correction of grammatical errors in non-native English text , 2009 .

[2]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[3]  Hans van Halteren,et al.  Source Language Markers in EUROPARL Translations , 2008, COLING.

[4]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[5]  Simon Kirby,et al.  Measuring Language Divergence by Intra-Lexical Comparison , 2006, ACL.

[6]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[7]  J. H. Steiger Tests for comparing elements of a correlation matrix. , 1980 .

[8]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  E. J. Williams The Comparison of Regression Variables , 1959 .

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  R. Lado,et al.  Linguistics Across Cultures: Applied Linguistics for Language Teachers , 1957 .

[13]  Rachele De Felice,et al.  Automatic error detection in non-native English , 2008 .

[14]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[15]  David Weimer Bibliography , 2018, Medical History. Supplement.

[16]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[17]  Jianfeng Gao,et al.  Using Contextual Speller Techniques and Language Modeling for ESL Error Correction , 2008, IJCNLP.

[18]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[19]  Na-Rae Han,et al.  Detection of Grammatical Errors Involving Prepositions , 2007, ACL 2007.

[20]  M. Pagel Human language as a culturally transmitted replicator , 2009, Nature Reviews Genetics.

[21]  Martin Chodorow,et al.  The Ups and Downs of Preposition Error Detection in ESL Writing , 2008, COLING.

[22]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[23]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[24]  M. Coulthard Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[25]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[26]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[27]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[28]  Terence Odlin,et al.  Language Transfer: Cross-Linguistic Influence in Language Learning , 1989 .

[29]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[30]  Ted Briscoe,et al.  Automated assessment of ESOL free text examinations , 2010 .

[31]  Moshe Koppel,et al.  Determining an author's native language by mining a text for errors , 2005, KDD '05.

[32]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[33]  S. Coder Significance of learners' errors , 1967 .

[34]  Jack C. Richards,et al.  A non-contrastive approach to error analysis , 1970 .

[35]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[36]  Mark Dras,et al.  Contrastive Analysis and Native Language Identification , 2009, ALTA.

[37]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[38]  Moshe Koppel,et al.  Automatically Determining an Anonymous Author's Native Language , 2005, ISI.

[39]  Ari Rappoport,et al.  Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words , 2007 .

[40]  D Nicholls,et al.  The Cambridge Learner Corpus-Error coding and analysis , 1999 .

[41]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[42]  Martin Chodorow,et al.  Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection , 2008, COLING 2008.