Twitter spam account detection based on clustering and classification methods

Twitter social network has gained more popularity due to the increase in social activities of registered users. Twitter performs dual functions of online social network (OSN), acting as a microblogging OSN, and at the same time as a news update platform. Recently, the growth in Twitter social interactions has attracted the attention of cybercriminals. Spammers have used Twitter to spread malicious messages, post phishing links, flood the network with fake accounts, and engage in other malicious activities. The process of detecting the network of spammers who engage in these activities is an important step toward identifying individual spam account. Researchers have proposed a number of approaches to identify a group of spammers. However, each of these approaches addressed a specific category of spammer. This paper proposes a different approach to detect spammers on Twitter based on the similarities that exist among spam accounts. A number of features were introduced to improve the performance of the three classification algorithms selected in this study. The proposed approach applied principal component analysis and tuned K -means algorithm to cluster over 200,000 accounts, randomly selected from more than 2 million tweets to detect the clusters of spammers. Experimental results show that Random Forest achieved the highest accuracy of 96.30%. This result is followed by multilayer perceptron with 96.00% and support vector machine, which achieved 95.60%. The performance of the selected classifiers based on class imbalance also revealed that Random Forest achieved the highest accuracy, precision, recall, and F -measure.

[1]  Ponnurangam Kumaraguru,et al.  PhishAri : Automatic Realtime Phishing Detection on Twitter Anupama Aggarwal , 2012 .

[2]  Hakim Hacid,et al.  Towards multiple identity detection in social networks , 2012, WWW.

[3]  Po-Ching Lin,et al.  A study of effective features for detecting long-surviving Twitter spam accounts , 2013, 2013 15th International Conference on Advanced Communications Technology (ICACT).

[4]  Nisha P. Shetty,et al.  N-Gram Assisted Youtube Spam Comment Detection , 2018 .

[5]  AbdulMalik S. Al-Salman,et al.  TSD: Detecting Sybil Accounts in Twitter , 2014, 2014 13th International Conference on Machine Learning and Applications.

[6]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, IMC '10.

[7]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[8]  Abdulrahman A. Mirza,et al.  Spammer Classification Using Ensemble Methods over Structural Social Network Features , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[9]  Krishna P. Gummadi,et al.  Understanding and combating link farming in the twitter social network , 2012, WWW.

[10]  Sung Wook Baik,et al.  Image steganography using uncorrelated color space and its application for security of visual contents in online social networks , 2016, Future Gener. Comput. Syst..

[11]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[12]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[13]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[14]  Muhammad Abulaish,et al.  A generic statistical approach for spam detection in Online Social Networks , 2013, Comput. Commun..

[15]  Nor Badrul Anuar,et al.  Malicious accounts: Dark of the social networks , 2017, J. Netw. Comput. Appl..

[16]  Haining Wang,et al.  Detecting Social Spam Campaigns on Twitter , 2012, ACNS.

[17]  Leonardo Noriega,et al.  Multilayer Perceptron Tutorial , 2005 .

[18]  Marcin Luckner,et al.  Stable web spam detection using features based on lexical items , 2014, Comput. Secur..

[19]  Roberto Di Pietro,et al.  The Paradigm-Shift of Social Spambots: Evidence, Theories, and Tools for the Arms Race , 2017, WWW.

[20]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[21]  Lindsay I. Smith,et al.  A tutorial on Principal Components Analysis , 2002 .

[22]  Patrick P. K. Chan,et al.  Spam filtering for short messages in adversarial environment , 2015, Neurocomputing.

[23]  Vinh Tran,et al.  Botnets Threat Analysis and Detection , 2017 .

[24]  Engin Avci,et al.  An intelligent diagnosis system based on principle component analysis and ANFIS for the heart valve diseases , 2009, Expert Syst. Appl..

[25]  Dong-Jo Park,et al.  A Novel Validity Index for Determination of the Optimal Number of Clusters , 2001 .

[26]  Chia-Mei Chen,et al.  Feature set identification for detecting suspicious URLs using Bayesian classification in social networks , 2014, Inf. Sci..

[27]  Muhammad Al-Qurishi,et al.  Sybil Defense Techniques in Online Social Networks: A Survey , 2017, IEEE Access.

[28]  Xianchao Zhang,et al.  Detecting Spam and Promoting Campaigns in the Twitter Social Network , 2012, 2012 IEEE 12th International Conference on Data Mining.

[29]  Muhammad Abulaish,et al.  An MCL-Based Approach for Spam Profile Detection in Online Social Networks , 2012, 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications.

[30]  Alex 'Sandy' Pentland,et al.  If it looks like a spammer and behaves like a spammer, it must be a spammer: analysis and detection of microblogging spam accounts , 2016, International Journal of Information Security.

[31]  C. Valliyammai,et al.  Social Context Based Naive Bayes Filtering of Spam Messages from Online Social Networks , 2018, Soft Computing in Data Analytics.

[32]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[33]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[34]  Stan Lipovetsky,et al.  Dimensionality reduction for data of unknown cluster structure , 2016, Inf. Sci..

[35]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[36]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[37]  Jong Kim,et al.  Early filtering of ephemeral malicious accounts on Twitter , 2014, Comput. Commun..

[38]  Kyoung-jae Kim,et al.  A recommender system using GA K-means clustering in an online shopping market , 2008, Expert Syst. Appl..

[39]  Xiao Wang,et al.  VoteTrust: Leveraging Friend Invitation Graph to Defend against Social Network Sybils , 2016, IEEE Transactions on Dependable and Secure Computing.

[40]  Jun Ho Huh,et al.  Hybrid spam filtering for mobile communication , 2009, Comput. Secur..

[41]  Sung Wook Baik,et al.  Image steganography for authenticity of visual contents in social networks , 2017, Multimedia Tools and Applications.

[42]  Ali M. Meligy,et al.  Identity Verification Mechanism for Detecting Fake Profiles in Online Social Networks , 2017 .

[43]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[44]  David G. Schwartz,et al.  Social network analysis of web links to eliminate false positives in collaborative anti-spam systems , 2011, J. Netw. Comput. Appl..

[45]  S. Santhosinidevi,et al.  Towards Detecting Compromised Accounts on Social Networks , 2018 .

[46]  Kaushik Dutta,et al.  Identifying Fake Profiles in LinkedIn , 2020, PACIS.

[47]  Krishna P. Gummadi,et al.  Towards Detecting Anomalous User Behavior in Online Social Networks , 2014, USENIX Security Symposium.

[48]  Zheyi Chen,et al.  Detecting spammers on social networks , 2015, Neurocomputing.

[49]  Muhammad Abulaish,et al.  Community-based features for identifying spammers in Online Social Networks , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[50]  Ali Feizollah,et al.  Evaluation of machine learning classifiers for mobile malware detection , 2014, Soft Computing.

[51]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[52]  Yanchun Zhang,et al.  Equally contributory privacy-preserving k-means clustering over vertically partitioned data , 2013, Inf. Syst..

[53]  Fang-Yie Leu,et al.  Clickbait Detection Based on Word Embedding Models , 2018, IMIS.

[54]  Monika Singh,et al.  Detecting Malicious Users in Twitter using Classifiers , 2014, SIN.

[55]  Yu Yan,et al.  Spammer detection based on comprehensive features in Sina Microblog , 2016, 2016 13th International Conference on Service Systems and Service Management (ICSSSM).