Automatic authorship attribution based on character n-grams in Swiss German

Automatic authorship attribution aims to train computers to identify the author of a disputed text based on idiolectal language features. When confronted with nonstandard data – in the present study Swiss German instant messages – languagespecific NLP toolkits are often unavailable, limiting the availability of features to classify texts. Thus, the approach I propose for Swiss German is based on character ngrams, which not only avoids the problem of a lack of available NLP tools, but – in addition to being a proven successful feature for authorship attribution – allows the capturing of orthographical idiosyncrasies. It thus allows the exploitation of Swiss German’s lack of standardised spelling rules, turning the challenge that Swiss German presents as non-standard data into an advantage. Different lengths of n-grams as features of a Na¨ive Bayes classifier combined with varying sizes of training and test corpora were tested, and 6- and 7-grams were found to faultlessly identify authors for all combinations considered. The number of distinctive n-grams in an author’s data set was found to be a determining factor for the classifier’s success, highlighting the benefits of exploiting Swiss German’s non-standard nature for authorship identification.

[1]  Jörg Tiedemann,et al.  Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects , 2014 .

[2]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[3]  Yves Scherrer,et al.  Natural Language Processing for the Swiss German Dialect Area , 2010, KONVENS.

[4]  M. Coulthard Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[5]  R. Totty,et al.  Forensic linguistics: the determination of authorship from habits of style , 1987 .

[6]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[7]  Paul A. Watters,et al.  Recentred local profiles for authorship attribution , 2011, Natural Language Engineering.

[8]  Nora Hollenstein,et al.  Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging , 2014, VarDial@COLING.

[9]  John Olsson,et al.  Forensic linguistics , 1997, English Today.

[10]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[11]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[12]  J. Milton,et al.  Language Independent Authorship Attribution using Character Level Language Models , 2003 .

[13]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[14]  Walter Daelemans,et al.  The effect of author set size and data size in authorship attribution , 2011, Lit. Linguistic Comput..

[15]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[16]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[17]  Marcos Zampieri,et al.  The Taming of a Dialect: Interlinear Glossing of Swiss German Text Messages , 2013 .

[18]  Marcos Zampieri,et al.  Non-standard data in Swiss text messages with a special focus on dialectal forms , 2013 .

[19]  H. T. Eddy The characteristic curves of composition. , 1887, Science.