A Framework for Stylometric Similarity Detection in Online Settings

Online marketplaces and communication media such as email, web sites, forums, and chat rooms have been ubiquitously integrated into our everyday lives. Unfortunately, the anonymous nature of these channels makes them an ideal avenue for online fraud, hackers, and cybercrime. Anonymity and the sheer volume of online content make cyber identity tracing an essential yet strenuous endeavor for Internet users and human analysts. In order to address these challenges, we propose a framework for online stylometric analysis to assist in distinguishing authorship in online communities based on writing style. Our framework includes the use of a scalable identity-level similarity detection technique coupled with an extensive stylistic feature set and an identity database. The framework is intended to support stylometric authentication for Internet users as well as provide support for forensic investigations. The proposed technique and extended feature set were evaluated on a test bed encompassing thousands of feedback comments posted by 100 electronic market traders. The method outperformed benchmark stylometric techniques with an accuracy of approximately 95% when differentiating between 200 trader identities. The results indicate that the proposed stylometric analysis approach may help mitigate the effects of online anonymity abuse.

[1]  Patrick Juola,et al.  The Time Course of Language Change , 2003, Comput. Humanit..

[2]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[3]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[4]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[5]  Hsinchun Chen,et al.  Visualizing Authorship for Identification , 2006, ISI.

[6]  Ophir Frieder,et al.  Discrimination of Authorship Using Visualization , 1994, Inf. Process. Manag..

[7]  Dmitry V. Khmelev Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov Chains of Letters in Human Language Texts , 2000, J. Quant. Linguistics.

[8]  Stephen G. MacDonell,et al.  Software Forensics: Extending Authorship Analysis Techniques to Computer Programs , 2002 .

[9]  Warren Sack,et al.  Conversation Map: An Interface for Very Large-Scale Conversations , 2000, J. Manag. Inf. Syst..

[10]  Ronald E. Rice,et al.  Identification of Comment Authorship in Anonymous Group Support Systems , 2003, J. Manag. Inf. Syst..

[11]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[12]  Chrysanthos Dellarocas,et al.  The Digitization of Word-of-Mouth: Promise and Challenges of Online Feedback Mechanisms , 2003, Manag. Sci..

[13]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[14]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[15]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[16]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[17]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[18]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[19]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[20]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[21]  Gurpreet Dhillon,et al.  Software piracy: a view from Hong Kong , 2000, CACM.

[22]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[23]  Curtis R. Cook,et al.  Programming style authorship analysis , 1989, CSC '89.

[24]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[25]  Colin Martindale,et al.  On the utility of content analysis in author attribution:The Federalist , 1995, Comput. Humanit..

[26]  E. Airoldi,et al.  Data Mining Challenges for Electronic Safety: The Case of Fraudulent Intent Detection in E-Mails , 2004 .

[27]  Ido Dagan,et al.  Feature instability as a criterion for selecting potential style markers , 2006, J. Assoc. Inf. Sci. Technol..

[28]  Audun Jøsang,et al.  A survey of trust and reputation systems for online service provision , 2007, Decis. Support Syst..

[29]  Ronald E. Rice,et al.  Attribution accuracy when using anonymity in group support systems , 1997, Int. J. Hum. Comput. Stud..

[30]  Wendy A. Kellogg,et al.  Social translucence: an approach to designing systems that support social processes , 2000, TCHI.

[31]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[32]  Donald A. Jackson STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS: A COMPARISON OF HEURISTICAL AND STATISTICAL APPROACHES' , 1993 .

[33]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[34]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..