Determining an author's native language by mining a text for errors

In this paper, we show that stylistic text features can be exploited to determine an anonymous author's native language with high accuracy. Specifically, we first use automatic tools to ascertain frequencies of various stylistic idiosyncrasies in a text. These frequencies then serve as features for support vector machines that learn to classify texts according to author native language.

[1]  Galit Avneri,et al.  Style-based Text Categorization: What Newspaper Am I Reading? , 1998 .

[2]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[3]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[4]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[5]  Sylviane Granger,et al.  Computer-Aided Error Analysis. , 1998 .

[6]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[7]  S. P. Corder,et al.  Error analysis and interlanguage , 1981 .

[8]  Laura Mayfield Tomokiyo,et al.  You’re Not From ’Round Here, Are You? Naive Bayes Detection of Non-Native Utterances , 2001, NAACL.

[9]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[10]  Laura P. Izquierdo Pedrosa,et al.  Error analysis and interlanguage , 2004 .

[11]  D. W. Foster Author Unknown: On the Trail of Anonymous , 2000 .

[12]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[13]  Martin Chodorow,et al.  An Unsupervised Method for Detecting Grammatical Errors , 2000, ANLP.

[14]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[15]  G. L. Trager,et al.  Linguistics across cultures , 1957 .