Statistical Detection of Online Drifting Twitter Spam: Invited Paper

Spam has become a critical problem in online social networks. This paper focuses on Twitter spam detection. Recent research works focus on applying machine learning techniques for Twitter spam detection, which make use of the statistical features of tweets. We observe existing machine learning based detection methods suffer from the problem of Twitter spam drift, i.e., the statistical properties of spam tweets vary over time. To avoid this problem, an effective solution is to train one twitter spam classifier every day. However, it faces a challenge of the small number of imbalanced training data because labelling spam samples is time-consuming. This paper proposes a new method to address this challenge. The new method employs two new techniques, fuzzy-based redistribution and asymmetric sampling. We develop a fuzzy-based information decomposition technique to re-distribute the spam class and generate more spam samples. Moreover, an asymmetric sampling technique is proposed to re-balance the sizes of spam samples and non-spam samples in the training data. Finally, we apply the ensemble technique to combine the spam classifiers over two different training sets. A number of experiments are performed on a real-world 10-day ground-truth dataset to evaluate the new method. Experiments results show that the new method can significantly improve the detection performance for drifting Twitter spam.

[1]  Jong Kim,et al.  Spam Filtering in Twitter Using Sender-Receiver Relationship , 2011, RAID.

[2]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[3]  Christopher Ke,et al.  AN IN-DEPTH ANALYSIS OF ABUSE ON TWITTER , 2014 .

[4]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[5]  Xiao Chen,et al.  6 million spam tweets: A large ground truth for timely Twitter spam detection , 2015, 2015 IEEE International Conference on Communications (ICC).

[6]  Jun Zhang,et al.  Fuzzy-Based Feature and Instance Recovery , 2016, ACIIDS.

[7]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[8]  Jun Zhang,et al.  Asymmetric self-learning for tackling Twitter Spam Drift , 2015, 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[9]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[10]  Jong Kim,et al.  WarningBird: A Near Real-Time Detection System for Suspicious URLs in Twitter Stream , 2013, IEEE Transactions on Dependable and Secure Computing.

[11]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[12]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2013, IEEE Trans. Inf. Forensics Secur..

[13]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[14]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[15]  Dawn Xiaodong Song,et al.  Suspended accounts in retrospect: an analysis of twitter spam , 2011, IMC '11.

[16]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[17]  Chao Yang,et al.  Empirical Evaluation and New Design for Fighting Evolving Twitter Spammers , 2011, IEEE Transactions on Information Forensics and Security.

[18]  Rizal Setya Perdana What is Twitter , 2013 .

[19]  Jun Zhang,et al.  A Performance Evaluation of Machine Learning-Based Streaming Spam Tweets Detection , 2015, IEEE Transactions on Computational Social Systems.

[20]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[21]  Gianluca Stringhini,et al.  COMPA: Detecting Compromised Accounts on Social Networks , 2013, NDSS.

[22]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[23]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[24]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[25]  Xianchao Zhang,et al.  Detecting Spam and Promoting Campaigns in the Twitter Social Network , 2012, 2012 IEEE 12th International Conference on Data Mining.

[26]  Xingquan Zhu,et al.  iSRD: Spam review detection with imbalanced data distributions , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[27]  Taghi M. Khoshgoftaar,et al.  Improving Software-Quality Predictions With Data Sampling and Boosting , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[28]  Madhavi Anugolu,et al.  Proceedings of the 2010 International Conference on Embedded Systems & Applications, ESA 2010, July 12-15, 2010, Las Vegas Nevada, USA , 2010, ESA.

[29]  R. Kishore Kumar,et al.  Comparative Study on Email Spam Classifier using Data Mining Techniques , 2012 .

[30]  Hossam Faris,et al.  Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution , 2015 .