A feature selection method for author identification in interactive communications based on supervised learning and language typicality

Abstract Authorship attribution, conceived as the identification of the origin of a text between different authors, has been a very active area of research in the scientific community mainly supported by advances in Natural Language Processing (NLP), machine learning and Computational Intelligence. This paradigm has been mostly addressed from a literary perspective, aiming at identifying the stylometric features and writeprints which unequivocally typify the writer patterns and allow their unique identification. On the other hand, the upsurge of social networking platforms and interactive messaging have undoubtedly made the anonymous expression of feelings, the sharing of experiences and social relationships much easier than in other traditional communication media. Unfortunately, the popularity of such communities and the virtual identification of their users deploy a rich substrate for cybercrimes against unsuspecting victims and other forms of illegal uses of social networks that call for the activity tracing of accounts. In the context of one-to-one communications this manuscript postulates the identification of the sender of a message as a useful approach to detect impersonation attacks in interactive communication scenarios. In particular this work proposes to select linguistic features extracted from messages via NLP techniques by means of a novel feature selection algorithm based on the dissociation between essential traits of the sender and receiver influences. The performance and computational efficiency of different supervised learning models when incorporating the proposed feature selection method is shown to be promising with real SMS data in terms of identification accuracy, and paves the way towards future research lines focused on applying the concept of language typicality in the discourse analysis field.

[1]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[2]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[3]  Syed Fawad Hussain,et al.  On retrieving intelligently plagiarized documents using semantic similarity , 2015, Eng. Appl. Artif. Intell..

[4]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[5]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[6]  Tao Chen,et al.  Creating a live, public short message service corpus: the NUS SMS corpus , 2011, Lang. Resour. Evaluation.

[7]  Graeme Hirst,et al.  Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts , 2007, Lit. Linguistic Comput..

[8]  F. A. Grootjen,et al.  Author Identification in Chatlogs using Formal Concept Analysis , 2007 .

[9]  Eugénio C. Oliveira,et al.  'twazn me!!! ;(' Automatic Authorship Analysis of Micro-Blogging Messages , 2011, NLDB.

[10]  Boris A. Galitsky Machine learning of syntactic parse trees for search and classification of text , 2013, Eng. Appl. Artif. Intell..

[11]  Ido Dagan,et al.  Feature instability as a criterion for selecting potential style markers , 2006, J. Assoc. Inf. Sci. Technol..

[12]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[13]  Graeme Hirst,et al.  Segmenting documents by stylistic character , 2005, Natural Language Engineering.

[14]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[15]  Gianluca Stringhini,et al.  COMPA: Detecting Compromised Accounts on Social Networks , 2013, NDSS.

[16]  Danielle S. McNamara,et al.  Analyzing Writing Styles with Coh-Metrix , 2006, FLAIRS.

[17]  Akebo Yamakami,et al.  On the Validity of a New SMS Spam Collection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[18]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[21]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[22]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[23]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[24]  Mark Steyvers,et al.  Detecting authorship deception: a supervised machine learning approach using author writeprints , 2012, Lit. Linguistic Comput..

[25]  Fabio Crestani,et al.  Finding Participants in a Chat: Authorship Attribution for Conversational Documents , 2013, 2013 International Conference on Social Computing.

[26]  Javier Del Ser,et al.  On a Machine Learning Approach for the Detection of Impersonation Attacks in Social Networks , 2014, IDC.

[27]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[28]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[29]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[30]  Isaac Woungang,et al.  Authorship verification for short messages using stylometry , 2013, 2013 International Conference on Computer, Information and Telecommunication Systems (CITS).