A comparative study of the class imbalance problem in Twitter spam detection

Recently, online social network (OSN) such as Twitter has become an important and popular source for real‐time information and news dissemination, and Twitter is inevitably a prime target of spammers. It has been showed that the security threats caused by Twitter spam can reach far beyond the social media platform itself. To mitigate the damage caused by Twitter spam, machine learning classification algorithms have been employed by researchers and communities to detect the Twitter spam. However, most of these studies have overlooked the class imbalance problem in Twitter spam detection. In this paper, we have studied the class imbalance problem in Twitter spam detection. Firstly, we have conducted a comparative study regarding some popular methods in handling the class imbalance problem in order to identify the most effective approach for addressing the class imbalance problem. Then, we have conducted another comparative study from Twitter spam detection based on several classic techniques. Experimental results demonstrate that a fuzy‐based ensemble learning can significantly improve the classification performance on imbalance ground truth Twitter data.

[1]  Xiuzhen Zhang,et al.  Comments on "Data Mining Static Code Attributes to Learn Defect Predictors" , 2007, IEEE Trans. Software Eng..

[2]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[3]  Mark Johnston,et al.  Genetic programming for image classification with unbalanced data , 2009, 2009 24th International Conference Image and Vision Computing New Zealand.

[4]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Taghi M. Khoshgoftaar,et al.  Survey of review spam detection using machine learning techniques , 2015, Journal of Big Data.

[6]  Hien M. Nguyen,et al.  A comparative study on sampling techniques for handling class imbalance in streaming data , 2012, The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems.

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  Mark Johnston,et al.  Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[11]  C.J.H. Mann,et al.  Handbook of Data Mining and Knowledge Discovery , 2004 .

[12]  Rishabh Kaushal,et al.  Rumor detection in twitter: An analysis in retrospect , 2015, 2015 IEEE International Conference on Advanced Networks and Telecommuncations Systems (ANTS).

[13]  Xiao Chen,et al.  6 million spam tweets: A large ground truth for timely Twitter spam detection , 2015, 2015 IEEE International Conference on Communications (ICC).

[14]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[15]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[16]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[17]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[19]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[20]  Jane Hunter,et al.  Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine , 2014, PRICAI.

[21]  J. Holmes Differential Negative Reinforcement Improves Classifier System Learning Rate in Two-Class Problems with Unequal Base Rates , 1990 .

[22]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[23]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[24]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[25]  Wei-Pang Yang,et al.  An intelligent three-phase spam filtering method based on decision tree data mining , 2016, Secur. Commun. Networks.

[26]  Jun Zhang,et al.  Statistical Detection of Online Drifting Twitter Spam: Invited Paper , 2016, AsiaCCS.

[27]  Sebastián Ventura,et al.  Weighted Data Gravitation Classification for Standard and Imbalanced Data , 2013, IEEE Transactions on Cybernetics.

[28]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[29]  Mark Johnston,et al.  Genetic Programming for Classification with Unbalanced Data , 2010, EuroGP.

[30]  Zhi-Hua Zhou,et al.  Ensembling MML Causal Discovery , 2004, PAKDD.

[31]  Muhammad Arshad Islam,et al.  A hybrid approach for spam detection for Twitter , 2017, 2017 14th International Bhurban Conference on Applied Sciences and Technology (IBCAST).

[32]  Daniel Dajun Zeng,et al.  Filtering spam in Weibo using ensemble imbalanced classification and knowledge expansion , 2015, 2015 IEEE International Conference on Intelligence and Security Informatics (ISI).

[33]  Malcolm I. Heywood,et al.  GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation , 2008, EuroGP.

[34]  Jun Zhang,et al.  Addressing the class imbalance problem in Twitter spam detection using ensemble learning , 2017, Comput. Secur..

[35]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[36]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[37]  Mark Johnston,et al.  Reusing Genetic Programming for Ensemble Selection in Classification of Unbalanced Data , 2014, IEEE Transactions on Evolutionary Computation.

[38]  Philip S. Yu,et al.  A hybrid coupled k-nearest neighbor algorithm on imbalance data , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[39]  Shirui Pan,et al.  Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Graph Classification with Imbalanced Class Distributions and Noise ∗ , 2022 .

[40]  Yu Wang,et al.  Statistical Features-Based Real-Time Detection of Drifted Twitter Spam , 2017, IEEE Transactions on Information Forensics and Security.

[41]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[42]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[43]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[44]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[45]  Ling Zhuang,et al.  Parameter Optimization of Kernel-based One-class Classifier on Imbalance Learning , 2006, J. Comput..

[46]  Taghi M. Khoshgoftaar,et al.  Improving Software-Quality Predictions With Data Sampling and Boosting , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[47]  Arjun Mukherjee,et al.  Fake Review Detection: Classification and Analysis of Real and Pseudo Reviews , 2013 .

[48]  Kristian Kersting,et al.  Learning from Imbalanced Data in Relational Domains: A Soft Margin Approach , 2014, 2014 IEEE International Conference on Data Mining.

[49]  ZhangHongyu,et al.  Comments on "Data Mining Static Code Attributes to Learn Defect Predictors" , 2007 .

[50]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[51]  Yang Wang,et al.  An Effective Integrated Method for Learning Big Imbalanced Data , 2014, 2014 IEEE International Congress on Big Data.

[52]  Jun Zhang,et al.  Fuzzy-Based Feature and Instance Recovery , 2016, ACIIDS.

[53]  Gang Li,et al.  Study of Ensemble Strategies in Discovering Linear Causal Models , 2005, FSKD.

[54]  Bingru Yang,et al.  Application of the Condensed Set based on Cooperative Coevolution in Imbalanced Datasets Classification , 2011 .