Statistical Twitter Spam Detection Demystified: Performance, Stability and Scalability

With the trend that the Internet is becoming more accessible and our devices being more mobile, people are spending an increasing amount of time on social networks. However, due to the popularity of online social networks, cyber criminals are spamming on these platforms for potential victims. The spams lure users to external phishing sites or malware downloads, which has become a huge issue for online safety and undermined user experience. Nevertheless, the current solutions fail to detect Twitter spams precisely and effectively. In this paper, we compared the performance of a wide range of mainstream machine learning algorithms, aiming to identify the ones offering satisfactory detection performance and stability based on a large amount of ground truth data. With the goal of achieving real-time Twitter spam detection capability, we further evaluated the algorithms in terms of the scalability. The performance study evaluates the detection accuracy, the true/false positive rate and the F-measure; the stability examines how stable the algorithms perform using randomly selected training samples of different sizes. The scalability aims to better understand the impact of the parallel computing environment on the reduction of the training/testing time of machine learning algorithms.

[1]  Rajkumar Buyya,et al.  High-Performance Cloud Computing: A View of Scientific Applications , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[2]  Xianchao Zhang,et al.  Detecting Spam and Promoting Campaigns in the Twitter Social Network , 2012, 2012 IEEE 12th International Conference on Data Mining.

[3]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[4]  Marco Aurélio Amaral Henriques,et al.  Speedup and scalability analysis of Master-Slave applications on large heterogeneous clusters , 2007, J. Parallel Distributed Comput..

[5]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[6]  Jong Kim,et al.  Spam Filtering in Twitter Using Sender-Receiver Relationship , 2011, RAID.

[7]  Jing Shan Superlinear Speedup in Parallel Computation , 2002 .

[8]  J. R. Zirbas,et al.  Measuring the scalability of parallel computer systems , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[9]  C. A. Murthy,et al.  On visualization and aggregation of nearest neighbor classifiers , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Gianluca Stringhini,et al.  COMPA: Detecting Compromised Accounts on Social Networks , 2013, NDSS.

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[13]  J. R. Quinlan,et al.  Data Mining Tools See5 and C5.0 , 2004 .

[14]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[17]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[18]  Drew Conway,et al.  Machine Learning for Hackers , 2012 .

[19]  Christopher Ke,et al.  AN IN-DEPTH ANALYSIS OF ABUSE ON TWITTER , 2014 .

[20]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[21]  Klaus Hechenbichler,et al.  Weighted k-Nearest-Neighbor Techniques and Ordinal Classification , 2004 .

[22]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2011, IEEE Transactions on Information Forensics and Security.

[23]  Max Kuhn,et al.  The caret Package , 2007 .

[24]  Calton Pu,et al.  Click traffic analysis of short URL spam on Twitter , 2013, 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[25]  Radu Prodan,et al.  Superlinear speedup in HPC systems: Why and when? , 2016, 2016 Federated Conference on Computer Science and Information Systems (FedCSIS).

[26]  Xiao Chen,et al.  6 million spam tweets: A large ground truth for timely Twitter spam detection , 2015, 2015 IEEE International Conference on Communications (ICC).

[27]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .