A comparison of classifiers and features for authorship authentication of social networking messages

This paper develops algorithms and investigates various classifiers to determine the authenticity of short social network postings, an average of 20.6 words, from Facebook. This paper presents and discusses several experiments using a variety of classifiers. The goal of this research is to determine the degree to which such postings can be authenticated as coming from the purported user and not from an intruder. Various sets of stylometry and ad hoc social networking features were developed to categorize 9259 posts from 30 Facebook authors as authentic or non‐authentic. An algorithm to utilize machine‐learning classifiers for investigating this problem is described, and an additional voting algorithm that combines three classifiers is investigated. This research is one of the first works that focused on authorship authentication in short messages, such as postings on social network sites. The challenges of applying traditional stylometry techniques on short messages are discussed. Experimental results demonstrate an average accuracy rate of 79.6% among 30 users. Further empirical analyses evaluate the effect of sample size, feature selection, user writing style, and classification method on authorship authentication, indicating varying degrees of success compared with previous studies. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[2]  Sung-Hyuk Cha,et al.  Behavioral biometric verification of student identity in online course assessment and authentication of authors in literary works , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[3]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[6]  Keke Gai,et al.  Proactive user-centric secure data scheme using attribute-based semantic access controls for mobile clouds in financial industry , 2018, Future Gener. Comput. Syst..

[7]  Richard Dazeley,et al.  Authorship Attribution for Twitter in 140 Characters or Less , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[8]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[9]  Efstathios Stamatatos,et al.  Author Identification Using Imbalanced and Limited Training Texts , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[10]  John V. Monaco,et al.  Authorship Authentication Using Short Messages from Social Networking Sites , 2014, 2014 IEEE 11th International Conference on e-Business Engineering.

[11]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[12]  Louise Guthrie,et al.  Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation , 2008, LREC.

[13]  Keke Gai,et al.  An Empirical Study on Preprocessing High-Dimensional Class-Imbalanced Data for Classification , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[14]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-one-Out Cross-Validation , 1997, COLT.

[15]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Keke Gai,et al.  Proactive Attribute-based Secure Data Schema for Mobile Cloud in Financial Industry , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[18]  Keke Gai,et al.  A Reusable Software Component for Integrated Syntax and Semantic Validation for Services Computing , 2015, 2015 IEEE Symposium on Service-Oriented System Engineering.

[19]  Abid Hussain,et al.  Social Data Analytics Tool: A Demonstrative Case Study of Methodology and Software , 2014 .

[20]  A. Orebaugh An Instant Messaging Intrusion Detection System Framework: Using character frequency analysis for authorship identification and validation , 2006, Proceedings 40th Annual 2006 International Carnahan Conference on Security Technology.

[21]  Jing Liu,et al.  An Analysis of Security in Social Networks , 2009, 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing.

[22]  Benjamin C. M. Fung,et al.  e-mail authorship verification for forensic investigation , 2010, SAC '10.

[23]  Keke Gai,et al.  Phase-Change Memory Optimization for Green Cloud with Genetic Algorithm , 2015, IEEE Transactions on Computers.

[24]  Meikang Qiu,et al.  Privacy Protection for Preventing Data Over-Collection in Smart City , 2016, IEEE Transactions on Computers.

[25]  John W. Sheppard,et al.  Comparing Frequency- and Style-Based Features for Twitter Author Identification , 2013, FLAIRS.

[26]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[27]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .