Semi-Supervised Spam Detection in Twitter Stream

Most existing techniques for spam detection on Twitter aim to identify and block users who post spam tweets. In this paper, we propose a semi-supervised spam detection (S3D) framework for spam detection at tweet-level. The proposed framework consists of two main modules: spam detection module operating in real-time mode and model update module operating in batch mode. The spam detection module consists of four lightweight detectors: 1) blacklisted domain detector to label tweets containing blacklisted URLs; 2) near-duplicate detector to label tweets that are near-duplicates of confidently prelabeled tweets; 3) reliable ham detector to label tweets that are posted by trusted users and that do not contain spammy words; and 4) multiclassifier-based detector labels the remaining tweets. The information required by the detection module is updated in batch mode based on the tweets that are labeled in the previous time window. Experiments on a large-scale data set show that the framework adaptively learns patterns of new spam activities and maintain good accuracy for spam detection in a tweet stream.

[1]  V. Paxson,et al.  The Underground on 140 Characters or Less ∗ , 2010 .

[2]  Igor Santos,et al.  Semi-supervised Learning for Unknown Malware Detection , 2011, DCAI.

[3]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[4]  Songqing Chen,et al.  UNIK: unsupervised social network spam detection , 2013, CIKM.

[5]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[6]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[7]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[8]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[9]  Igor Santos,et al.  Twitter Content-Based Spam Filtering , 2013, SOCO-CISIS-ICEUTE.

[10]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[11]  Huan Liu,et al.  Social Spammer Detection in Microblogging , 2013, IJCAI.

[12]  Krishna P. Gummadi,et al.  Understanding and combating link farming in the twitter social network , 2012, WWW.

[13]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, IMC '10.

[14]  Huan Liu,et al.  Online Social Spammer Detection , 2014, AAAI.

[15]  Jun Zhang,et al.  A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection , 2015, IEEE Transactions on Computational Social Systems.

[16]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[17]  Arkaitz Zubiaga,et al.  Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter , 2015, #MSM.

[18]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2011, IEEE Transactions on Information Forensics and Security.

[19]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[20]  Aixin Sun,et al.  HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research , 2015, SIGIR.

[21]  Junhao Wen,et al.  LSSL-SSD: Social Spammer Detection with Laplacian Score and Semi-supervised Learning , 2016, KSEM.

[22]  Fabrício Benevenuto,et al.  Phi.sh/$oCiaL: the phishing landscape through short URLs , 2011, CEAS '11.

[23]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[24]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[25]  Alok N. Choudhary,et al.  Towards Online Spam Filtering in Social Networks , 2012, NDSS.

[26]  Saurabh Bagchi,et al.  Spam detection in voice-over-IP calls through semi-supervised clustering , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[27]  Zengyou He,et al.  A Semi-Supervised Framework for Social Spammer Detection , 2015, PAKDD.