A Social Spam Detection Framework via Semi-supervised Learning

With the increasing popularity of social networking websites such as Twitter, Facebook, Sina Weibo and MySpace, spammers on them are getting more and more rampant. Social spammers always create a mass of compromised or fake accounts to deceive users and lead them to access malicious websites which contain illegal, pornography or dangerous information. As we all know, most of the studies on social spam detection are based on supervised machine learning which requires plenty of annotated datasets. Unfortunately, labeling a large number of datasets manually is a complex, error-prone and tedious task which may costs a lot of human efforts and time. In this paper, we propose a novel semi-supervised classification framework for social spam detection, which combines co-training with k-medoids. First we utilize k-medoids clustering algorithm to acquire some informative and presentative samples for labelling as our initial seeds set. Then we take advantage of the content features and behavior features of users for our co-training classification framework. In order to illustrate the effectiveness of k-medoids, we compare the performance with random selecting strategy. Finally, we evaluate the effectiveness of our proposed detection framework compared with several classical supervised algorithms.

[1]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[2]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[3]  Xiaokang Yang,et al.  Analysis and identification of spamming behaviors in Sina Weibo microblog , 2013, SNAKDD '13.

[4]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[5]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[6]  Jun Du,et al.  When Does Cotraining Work in Real Data? , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  Pang-Ning Tan,et al.  A co-classification framework for detecting web spam and spammers in social media web sites , 2009, CIKM.

[8]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[9]  Calton Pu,et al.  A social-spam detection framework , 2011, CEAS '11.

[10]  Rui Wang,et al.  Towards social user profiling: unified and discriminative influence model for inferring home locations , 2012, KDD.

[11]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2013, IEEE Trans. Inf. Forensics Secur..

[12]  Aoying Zhou,et al.  Detecting Spamming Groups in Social Media Based on Latent Graph , 2015, ADC.

[13]  Xing Xie,et al.  Leveraging Careful Microblog Users for Spammer Detection , 2015, WWW.

[14]  Xianchao Zhang,et al.  Detecting Spam and Promoting Campaigns in the Twitter Social Network , 2012, 2012 IEEE 12th International Conference on Data Mining.

[15]  Zengyou He,et al.  A Semi-Supervised Framework for Social Spammer Detection , 2015, PAKDD.

[16]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[17]  Virgílio A. F. Almeida,et al.  Detecting Spammers and Content Promoters in Online Video Social Networks , 2009, IEEE INFOCOM Workshops 2009.

[18]  Chao Yang,et al.  CATS: Characterizing automation of Twitter spammers , 2013, 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS).

[19]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[20]  Georgia Koutrika,et al.  Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges , 2007, IEEE Internet Computing.

[21]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[22]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[23]  Arkaitz Zubiaga,et al.  Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter , 2015, #MSM.

[24]  Zhi-Hua Zhou,et al.  Semi-supervised learning by disagreement , 2010, Knowledge and Information Systems.

[25]  Kyumin Lee,et al.  Content-driven detection of campaigns in social media , 2011, CIKM '11.

[26]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .