Using Improved Conditional Generative Adversarial Networks to Detect Social Bots on Twitter

The detection and removal of malicious social bots in social networks has become an area of interest in industry and academia. The widely used bot detection method based on machine learning leads to an imbalance in the number of samples in different categories. Classifier bias leads to a low detection rate of minority samples. Therefore, we propose an improved conditional generative adversarial network (improved CGAN) to extend imbalanced data sets before applying training classifiers to improve the detection accuracy of social bots. To generate an auxiliary condition, we propose a modified clustering algorithm, namely, the Gaussian kernel density peak clustering algorithm (GKDPCA), which avoids the generation of data-augmentation noise and eliminates imbalances between and within social bot class distributions. Furthermore, we improve the CGAN convergence judgment condition by introducing the Wasserstein distance with a gradient penalty, which addresses the model collapse and gradient disappearance in the traditional CGAN. Three common oversampling algorithms are compared in experiments. The effects of the imbalance degree and the expansion ratio of the original data on oversampling are studied, and the improved CGAN performs better than the others. Experimental results comparing with three common oversampling algorithms show that the improved CGAN achieves the higher evaluation scores in terms of F1-score, G-mean and AUC.

[1]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[3]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[4]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[5]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[7]  Filippo Menczer,et al.  Online Human-Bot Interactions: Detection, Estimation, and Characterization , 2017, ICWSM.

[8]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[9]  Yaping Lin,et al.  Synthetic minority oversampling technique for multiclass imbalance problems , 2017, Pattern Recognit..

[10]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[11]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[12]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[13]  Ma Li,et al.  CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests , 2017, BMC Bioinformatics.

[14]  Yixian Yang,et al.  Building an Effective Intrusion Detection System Using the Modified Density Peak Clustering Algorithm and Deep Belief Networks , 2019, Applied Sciences.

[15]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[16]  Fernando Bação,et al.  Effective data generation for imbalanced learning using conditional generative adversarial networks , 2018, Expert Syst. Appl..

[17]  Gang Wang,et al.  Northeastern University , 2021, IEEE Pulse.

[18]  Reza Zafarani,et al.  10 Bits of Surprise: Detecting Malicious Users with Minimum Information , 2015, CIKM.

[19]  Vladimir Cherkassky,et al.  Development and Evaluation of Cost-Sensitive Universum-SVM , 2015, IEEE Transactions on Cybernetics.

[20]  J. Brownstein,et al.  Twitter as a Sentinel in Emergency Situations: Lessons from the Boston Marathon Explosions , 2013, PLoS currents.

[21]  Oscar Cordón,et al.  Cost-Sensitive Learning of Fuzzy Rules for Imbalanced Classification Problems Using FURIA , 2014, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22]  Jan Eloff,et al.  Using Machine Learning to Detect Fake Identities: Bots vs Humans , 2018, IEEE Access.

[23]  Daniel Dajun Zeng,et al.  Behavior enhanced deep bot detection in social media , 2017, 2017 IEEE International Conference on Intelligence and Security Informatics (ISI).

[24]  Wei Zhang,et al.  Minority oversampling for imbalanced ordinal regression , 2019, Knowl. Based Syst..

[25]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[26]  Shang Gao,et al.  Grouped SMOTE With Noise Filtering Mechanism for Classifying Imbalanced Data , 2019, IEEE Access.

[27]  T. Jayanthi,et al.  Weighted-SMOTE: A modification to SMOTE for event classification in sodium cooled fast reactors , 2017 .

[28]  Michael Sirivianos,et al.  Aiding the Detection of Fake Accounts in Large Scale Social Online Services , 2012, NSDI.

[29]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[30]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[31]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[32]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[33]  Jon Crowcroft,et al.  Of Bots and Humans (on Twitter) , 2017, ASONAM.

[34]  Aina Musdholifah,et al.  The Implementation of Genetic Algorithm in Smote (Synthetic Minority Oversampling Technique) for Handling Imbalanced Dataset Problem , 2018, 2018 4th International Conference on Science and Technology (ICST).

[35]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[36]  Fadi Thabtah,et al.  Data imbalance in classification: Experimental evaluation , 2020, Inf. Sci..

[37]  Yanfei Sun,et al.  Over-sampling algorithm for imbalanced data classification , 2019, JSEE.

[38]  Lixiang Li,et al.  Nearest neighbors based density peaks approach to intrusion detection , 2018 .

[39]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[40]  Christopher M. Danforth,et al.  Sifting robotic from organic text: A natural language approach for detecting automation on Twitter , 2015, J. Comput. Sci..

[41]  Nazar Zaki,et al.  Detecting Social Bots on Twitter: A Literature Review , 2018, 2018 International Conference on Innovations in Information Technology (IIT).

[42]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[43]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[44]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[45]  Husanbir Singh Pannu,et al.  A Systematic Review on Imbalanced Data Challenges in Machine Learning , 2019, ACM Comput. Surv..

[46]  Kim-Kwang Raymond Choo,et al.  Detecting Malicious Social Bots Based on Clickstream Sequences , 2019, IEEE Access.

[47]  Alfredo De Santis,et al.  Using generative adversarial networks for improving classification effectiveness in credit card fraud detection , 2017, Inf. Sci..

[48]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[49]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[50]  Peter Corcoran,et al.  Smart Augmentation Learning an Optimal Data Augmentation Strategy , 2017, IEEE Access.

[51]  Raúl Monroy,et al.  Contrast Pattern-Based Classification for Bot Detection on Twitter , 2019, IEEE Access.

[52]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[53]  Amos Azaria,et al.  The DARPA Twitter Bot Challenge , 2016, Computer.

[54]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[55]  Venkatesan Guruswami,et al.  CopyCatch: stopping group attacks by spotting lockstep behavior in social networks , 2013, WWW.

[56]  Kun Jiang,et al.  A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE , 2016 .

[57]  Ben Y. Zhao,et al.  Uncovering social network sybils in the wild , 2011, IMC '11.

[58]  Hossein Hamooni,et al.  DeBot: Twitter Bot Detection via Warped Correlation , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[59]  Chee Khiang Pang,et al.  Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[60]  David W. McDonald,et al.  Dissecting a Social Botnet: Growth, Content and Influence in Twitter , 2015, CSCW.

[61]  Rahime Ceylan,et al.  A discriminative dictionary learning-AdaBoost-SVM classification method on imbalanced datasets , 2017, 2017 International Artificial Intelligence and Data Processing Symposium (IDAP).

[62]  Emilio Ferrara,et al.  Deep Neural Networks for Bot Detection , 2018, Inf. Sci..

[63]  Roberto Di Pietro,et al.  Social Fingerprinting: Detection of Spambot Groups Through DNA-Inspired Behavioral Modeling , 2017, IEEE Transactions on Dependable and Secure Computing.

[64]  Liangxiao Jiang,et al.  Randomly selected decision tree for test-cost sensitive learning , 2017, Appl. Soft Comput..

[65]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[66]  Rashmi Ranjan Rout,et al.  Detection of Social Botnet using a Trust Model based on Spam Content in Twitter Network , 2018, 2018 IEEE 13th International Conference on Industrial and Information Systems (ICIIS).

[67]  Anastasiya Doroshenko Piecewise-Linear Approach to Classification Based on Geometrical Transformation Model for Imbalanced Dataset , 2018, 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP).

[68]  Jacob Ratkiewicz,et al.  Political Polarization on Twitter , 2011, ICWSM.

[69]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[70]  Anil A. Bharath,et al.  A data augmentation methodology for training machine/deep learning gait recognition algorithms , 2016, BMVC.

[71]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..