Blogs, Twitter Feeds, and Reddit Comments: Cross-domain Authorship Attribution

Abstract Stylometry is a form of authorship attribution that relies on the linguistic information to attribute documents of unknown authorship based on the writing styles of a suspect set of authors. This paper focuses on the cross-domain subproblem where the known and suspect documents differ in the setting in which they were created. Three distinct domains, Twitter feeds, blog entries, and Reddit comments, are explored in this work. We determine that state-of-the-art methods in stylometry do not perform as well in cross-domain situations (34.3% accuracy) as they do in in-domain situations (83.5% accuracy) and propose methods that improve performance in the cross-domain setting with both feature and classification level techniques which can increase accuracy to up to 70%. In addition to testing these approaches on a large real world dataset, we also examine real world adversarial cases where an author is actively attempting to hide their identity. Being able to identify authors across domains facilitates linking identities across the Internet making this a key security and privacy concern; users can take other measures to ensure their anonymity, but due to their unique writing style, they may not be as anonymous as they believe.

[1]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[2]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[3]  E. Stuart Kirby,et al.  The Foreign Policy of Communist China , 1960 .

[4]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[5]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[6]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[7]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[8]  Rachel Greenstadt,et al.  Detecting Hoaxes, Frauds, and Deception in Writing Style Online , 2012, 2012 IEEE Symposium on Security and Privacy.

[9]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[10]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[11]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[12]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[13]  Patrick Juola,et al.  Empirical evaluation of authorship obfuscation using JGAAP , 2010, AISec '10.

[14]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[15]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[16]  Gene Tsudik,et al.  Stylometric Linkability of Tweets , 2014, WPES.

[17]  Mudit Bhargava,et al.  Stylometric Analysis for Authorship Attribution on Twitter , 2013, BDA.

[18]  Ariel Stolerman,et al.  Doppelgänger Finder: Taking Stylometry to the Underground , 2014, 2014 IEEE Symposium on Security and Privacy.

[19]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[20]  Richard Dazeley,et al.  Authorship Attribution for Twitter in 140 Characters or Less , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[21]  Ariel Stolerman,et al.  Breaking the Closed-World Assumption in Stylometric Authorship Attribution , 2014, IFIP Int. Conf. Digital Forensics.

[22]  Ido Dagan,et al.  Feature instability as a criterion for selecting potential style markers , 2006, J. Assoc. Inf. Sci. Technol..

[23]  Michael Gamon,et al.  Obfuscating Document Stylometry to Preserve Author Anonymity , 2006, ACL.

[24]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[25]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[26]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[27]  Yejin Choi,et al.  Domain Independent Authorship Attribution without Domain Adaptation , 2011, RANLP.

[28]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[29]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[30]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[31]  Masatoshi Yoshikawa,et al.  Query processing utilizing dependencies and horizontal decomposition , 1983, SIGMOD '83.

[32]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .