Stylometric Analysis for Authorship Attribution on Twitter

Authorship Attribution (AA), the science of inferring an author for a given piece of text based on its characteristics is a problem with a long history. In this paper, we study the problem of authorship attribution for forensic purposes and present machine learning techniques and stylometric features of the authors that enable authorship to be determined at rates significantly better than chance for texts of 140 characters or less. This analysis targets the micro-blogging site Twitter, where people share their interests and thoughts in form of short messages called "tweets". Millions of "tweets" are posted daily via this service and the possibility of sharing sensitive and illegitimate text cannot be ruled out. The technique discussed in this paper is a two stage process, where in the first stage, stylometric information is extracted from the collected dataset and in the second stage different classification algorithms are trained to predict authors of unseen text. The effort is towards maximizing the accuracy of predictions with optimum amount of data and users under consideration.

[1]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[2]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[3]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[4]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[5]  Malcolm W. Corney,et al.  Analysing e-mail text authorship for forensic purposes , 2003 .

[6]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[7]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[8]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[9]  Amr Ahmed,et al.  Mining online diaries for blogger identification , 2009 .

[10]  Thamar Solorio,et al.  Authorship attribution of web forum posts , 2010, 2010 eCrime Researchers Summit.

[11]  Adriana Kovashka,et al.  Authorship Attribution Using Probabilistic Context-Free Grammars , 2010, ACL.

[12]  Elisabeth Métais,et al.  Natural language interfaces : what's the problem? -a data-driven quantitative analysis , 2010 .

[13]  Seckin Anil Unlu,et al.  2010 eCrime Researchers Summit Table of Contents Authorship Attribution of Web Forum Posts , 2010 .

[14]  Eugénio C. Oliveira,et al.  'twazn me!!! ;(' Automatic Authorship Analysis of Micro-Blogging Messages , 2011, NLDB.

[15]  Vittorio Murino,et al.  Conversationally-inspired stylometric features for authorship attribution in instant messaging , 2012, ACM Multimedia.