Authorship Attribution for Twitter in 140 Characters or Less

Authorship attribution is a growing field, moving from beginnings in linguistics to recent advances in text mining. Through this change came an increase in the capability of authorship attribution methods both in their accuracy and the ability to consider more difficult problems. Research into authorship attribution in the 19th century considered it difficult to determine the authorship of a document of fewer than 1000 words. By the 1990s this values had decreased to less than 500 words and in the early 21st century it was considered possible to determine the authorship of a document in 250 words. The need for this ever decreasing limit is exemplified by the trend towards many shorter communications rather than fewer longer communications, such as the move from traditional multi-page handwritten letters to shorter, more focused emails. This trend has also been shown in online crime, where many attacks such as phishing or bullying are performed using very concise language. Cybercrime messages have long been hosted on Internet Relay Chats (IRCs) which have allowed members to hide behind screen names and connect anonymously. More recently, Twitter and other short message based web services have been used as a hosting ground for online crimes. This paper presents some evaluations of current techniques and identifies some new preprocessing methods that can be used to enable authorship to be determined at rates significantly better than chance for documents of 140 characters or less, a format popularised by the micro-blogging website Twitter1. We show that the SCAP methodology performs extremely well on twitter messages and even with restrictions on the types of information allowed, such as the recipient of directed messages, still perform significantly higher than chance. Further to this, we show that 120 tweets per user is an important threshold, at which point adding more tweets per user gives a small but non-significant increase in accuracy.

[1]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[2]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[3]  Amr Ahmed,et al.  Mining online diaries for blogger identification , 2009 .

[4]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[5]  Patrick Brennan,et al.  A Prototype for Authorship Attribution Studies , 2006, Lit. Linguistic Comput..

[6]  Efstathios Stamatatos,et al.  Automatic Authorship Attribution , 1999, EACL.

[7]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[8]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[9]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[10]  Robert E. Beasley Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10,000 Cellular Phone Text Messages. , 2009 .

[11]  Felix C. Freiling,et al.  Measuring and Detecting Fast-Flux Service Networks , 2008, NDSS.

[12]  Angela Orebaugh,et al.  Classification of Instant Messaging Communications for Forensics Analysis , 2009 .

[13]  Paul A. Watters,et al.  Determining provenance in phishing websites using automated conceptual analysis , 2009, 2009 eCrime Researchers Summit.

[14]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[15]  Paul A. Watters,et al.  Unsupervised authorship analysis of phishing webpages , 2012, 2012 International Symposium on Communications and Information Technologies (ISCIT).

[16]  Stephen G. MacDonell,et al.  A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis , 1997, ICONIP.

[17]  Aida Mustapha,et al.  Lexical criminal identification for chatting corpus , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[18]  Berkant Barla Cambazoglu,et al.  Chat mining: Predicting user and message attributes in computer-mediated communication , 2008, Inf. Process. Manag..

[19]  Rob Thomas,et al.  The underground economy: priceless , 2006 .

[20]  Danah Boyd,et al.  Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[21]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[22]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[23]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[24]  William Ralph Bennett Scientific and Engineering Problem-Solving with the Computer , 1976 .

[25]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[26]  Kye Taylor,et al.  An algorithm for automated authorship attribution using neural networks , 2008, Lit. Linguistic Comput..

[27]  Tyler Moore,et al.  Examining the impact of website take-down on phishing , 2007, eCrime '07.

[28]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[29]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[30]  Simon Brown,et al.  Using Differencing to Increase Distinctiveness for Phishing Website Clustering , 2009, 2009 Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing.

[31]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[32]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..