Blogs and Twitter Feeds: A Stylometric Environmental Impact Study

Stylometry is the study of determining the author of a document based on the linguistic features contained in the document. Previous work in this area has yielded impressive results, but assumes that the training and testing documents are similar key attributes, namely the domain and setting in which they are written. This paper focuses on the scenario where this assumption cannot be made. We determine that standard methods in stylometry do not perform well when the training and suspect documents differ in this way. For example, when working exclusively with blogs we obtain an average accuracy of 93.30% and with Twitter feeds we obtain an average accuracy of over 98.99%. However, when we apply the same method to try to identify a twitter feed via a blog’s writing, accuracy falls drastically. We provide a method to improve this cross-domain accuracy to 88.89%. Being able to identify authors across domains facilitates linking identities across the Internet, making this a key privacy concern.

[1]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[2]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[3]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[4]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[5]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[6]  J. Milton,et al.  Language Independent Authorship Attribution using Character Level Language Models , 2003 .

[7]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[8]  Yejin Choi,et al.  Domain Independent Authorship Attribution without Domain Adaptation , 2011, RANLP.

[9]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[10]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[11]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[12]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[13]  Ariel Stolerman,et al.  Breaking the Closed-World Assumption in Stylometric Authorship Attribution , 2014, IFIP Int. Conf. Digital Forensics.

[14]  Ariel Stolerman,et al.  Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization , 2012, Privacy Enhancing Technologies.

[15]  Ido Dagan,et al.  Feature instability as a criterion for selecting potential style markers , 2006, J. Assoc. Inf. Sci. Technol..

[16]  Mudit Bhargava,et al.  Stylometric Analysis for Authorship Attribution on Twitter , 2013, BDA.

[17]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[18]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[19]  Ariel Stolerman,et al.  Doppelgänger Finder: Taking Stylometry to the Underground , 2014, 2014 IEEE Symposium on Security and Privacy.

[20]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[21]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.