Verifying Online User Identity using Stylometric Analysis for Short Messages

Stylometry consists of the analysis of linguistic styles and writing characteristics of the authors for identification, characterization, or verification purposes. In this paper, we investigate authorship verification for the purpose of user authentication process. In this setting, authentication consists of comparing sample writing of an individual against the model or profile associated with the identity claimed by that individual at login time (i.e. 1-to-1 identity matching). In addition, the authentication process must be done in a short period of time, which means analyzing short messages. Although a significant amount of literature has been produced showing high accuracy rates for long documents, it is still challenging to identify accurately authors of short unstructured documents, in particular when dealing with large authors populations. In this paper, we pose some steps toward achieving that goal by proposing a supervised learning technique combined with n-grams analysis for authorship verification for short texts. We introduce a new n-gram metric and study several sizes of n-grams using a relatively large dataset. The experimental evaluation shows increased effectiveness of our approach compared to the existing approaches published in the literature.

[1]  Isaac Woungang,et al.  Authorship verification for short messages using stylometry , 2013, 2013 International Conference on Computer, Information and Telecommunication Systems (CITS).

[2]  Benjamin C. M. Fung,et al.  A unified data mining solution for authorship analysis in anonymous textual communications , 2013, Inf. Sci..

[3]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[4]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[5]  Peng Hao,et al.  Authorship Similarity Detection from Email Messages , 2011, MLDM.

[6]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[7]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[8]  J. P. Carvalho,et al.  Authorship identification and author fuzzy “fingerprints” , 2011, 2011 Annual Meeting of the North American Fuzzy Information Processing Society.

[9]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[10]  Charles C. Tappert,et al.  A Stylometry System for Authenticating Students Taking Online Tests , 2011 .

[11]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[12]  Benjamin C. M. Fung,et al.  e-mail authorship verification for forensic investigation , 2010, SAC '10.

[13]  Luiz Eduardo Soares de Oliveira,et al.  Author Identification Using Compression Models , 2022 .

[14]  Rajarathnam Chandramouli,et al.  Gender identification from E-mails , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[15]  Mourad Debbabi,et al.  Towards an integrated e-mail forensic analysis framework , 2009, Digit. Investig..

[16]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[17]  Daniel Jurafsky,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2009, Prentice Hall series in artificial intelligence.

[18]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[19]  Berkant Barla Cambazoglu,et al.  Chat mining: Predicting user and message attributes in computer-mediated communication , 2008, Inf. Process. Manag..

[20]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[21]  Jonathan H. Clark,et al.  A Classifier System for Author Recognition Using Synonym-Based Features , 2007, MICAI.

[22]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[23]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[24]  Hans Van Halteren,et al.  Author verification by linguistic profiling: An exploration of the parameter space , 2007, TSLP.

[25]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[26]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[27]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[28]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[29]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[30]  Eric Backer,et al.  On musical stylometry - a pattern recognition approac , 2005, Pattern Recognit. Lett..

[31]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..

[32]  E. Backer,et al.  On musical stylometry — a pattern recognition approach , 2005 .

[33]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[34]  Fazli Can,et al.  Change of Writing Style with Time , 2004, Comput. Humanit..

[35]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[36]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[37]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[38]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[39]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[40]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[41]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[42]  J. Hilton On Verifying Wordprint Studies: Book of Mormon Authorship , 1990 .

[43]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[44]  S. Fienberg,et al.  Inference and Disputed Authorship: The Federalist , 1966 .