Malicious Web Domain Identification using Online Credibility and Performance Data by Considering the Class Imbalance Issue

Purpose Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones). Design/methodology/approach The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling. Findings By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective. Practical implications This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification. Originality/value Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.

[1]  Ali Yazdian Varjani,et al.  New rule-based phishing detection method , 2016, Expert Syst. Appl..

[2]  Shuai Zhang,et al.  A novel ensemble method for credit scoring: Adaption of different imbalance ratios , 2018, Expert Syst. Appl..

[3]  Mohammad Pourmahmood Aghababa,et al.  Heuristic nonlinear regression strategy for detecting phishing websites , 2018, Soft Computing.

[4]  Mohamed Abdelrazek,et al.  An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction , 2018, IEEE Access.

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[8]  Swapan Purkait,et al.  Examining the effectiveness of phishing filters against DNS based phishing attacks , 2015, Inf. Comput. Secur..

[9]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[10]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[11]  Steven C. H. Hoi,et al.  Cost-sensitive online active learning with application to malicious URL detection , 2013, KDD.

[12]  José Salvador Sánchez,et al.  On the suitability of resampling techniques for the class imbalance problem in credit scoring , 2013, J. Oper. Res. Soc..

[13]  Yiqiang Chen,et al.  Weighted extreme learning machine for imbalance learning , 2013, Neurocomputing.

[14]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[15]  Albert Y. Zomaya,et al.  A particle swarm based hybrid system for imbalanced medical data sampling , 2009, BMC Genomics.

[16]  Tommy W. S. Chow,et al.  Textual and Visual Content-Based Anti-Phishing: A Bayesian Approach , 2011, IEEE Transactions on Neural Networks.

[17]  Herna L. Viktor,et al.  SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling , 2015, 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K).

[18]  Xiaotie Deng,et al.  Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance (EMD) , 2006, IEEE Transactions on Dependable and Secure Computing.

[19]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[20]  David Cornforth,et al.  Using Support Vector Machine Ensembles for Target Audience Classification on Twitter , 2015, PloS one.

[21]  Bart Baesens,et al.  Benchmarking sampling techniques for imbalance learning in churn prediction , 2018, J. Oper. Res. Soc..

[22]  Hsiu-Sen Chiang,et al.  Internet security: malicious e-mails detection and protection , 2004, Ind. Manag. Data Syst..

[23]  Choon Lin Tan,et al.  Phishing Webpage Detection Using Weighted URL Tokens for Identity Keywords Retrieval , 2017 .

[24]  Yong Chen,et al.  Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list , 2009, Journal of Intelligent Information Systems.

[25]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[26]  Sungzoon Cho,et al.  Machine learning-based anomaly detection via integration of manufacturing, inspection and after-sales service data , 2017, Ind. Manag. Data Syst..

[27]  Raymond Chiong,et al.  Identifying malicious web domains using machine learning techniques with online credibility and performance data , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[28]  Pedro Antonio Gutiérrez,et al.  Graph-Based Approaches for Over-Sampling in the Context of Ordinal Regression , 2015, IEEE Transactions on Knowledge and Data Engineering.

[29]  Qingyu Zhang,et al.  Big data analytics with swarm intelligence , 2016, Ind. Manag. Data Syst..

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[32]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[33]  Akito Monden,et al.  MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction , 2018, IEEE Trans. Software Eng..

[34]  Arnon Rungsawang,et al.  Using Domain Top-page Similarity Feature in Machine Learning-Based Web Phishing Detection , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[35]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[36]  Carolyn Penstein Rosé,et al.  CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites , 2011, TSEC.

[37]  Akshya Swain,et al.  A two-dimensional (2-D) learning framework for Particle Swarm based feature selection , 2018, Pattern Recognit..

[38]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[39]  Kun-Huang Chen,et al.  A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients , 2014, Appl. Soft Comput..

[40]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[41]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[42]  Steven C. H. Hoi,et al.  Malicious URL Detection using Machine Learning: A Survey , 2017, ArXiv.

[43]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[44]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[45]  Alper Ekrem Murat,et al.  A discrete particle swarm optimization method for feature selection in binary classification problems , 2010, Eur. J. Oper. Res..

[46]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[47]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[48]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[49]  Yi Peng,et al.  A cost-sensitive multi-criteria quadratic programming model for imbalanced data , 2018, J. Oper. Res. Soc..

[50]  Yufei Xia,et al.  Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending , 2017, Electron. Commer. Res. Appl..

[51]  Alessandro Acquisti,et al.  The Effect of Online Privacy Information on Purchasing Behavior: An Experimental Study , 2011, WEIS.

[52]  Thamar Solorio,et al.  Lexical feature based phishing URL detection using online learning , 2010, AISec '10.

[53]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[54]  S. F.R.,et al.  An Essay towards solving a Problem in the Doctrine of Chances . By the late Rev . Mr . Bayes , communicated by Mr . Price , in a letter to , 1999 .

[55]  Rukshan Athauda,et al.  A Distributed Secure Mechanism for Resource Protection in a Digital Ecosystem Environment , 2012, J. Information Security.

[56]  W. Philip Kegelmeyer,et al.  Streaming Malware Classification in the Presence of Concept Drift and Class Imbalance , 2013, 2013 12th International Conference on Machine Learning and Applications.

[57]  Raymond Chiong,et al.  An alternative way of presenting statistical test results when evaluating the performance of stochastic approaches , 2015, Neurocomputing.

[58]  Qingzhong Liu,et al.  High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics , 2009, BMC Genomics.

[59]  Xiang Yang,et al.  Phishing Website Detection Using C4.5 Decision Tree , 2017 .

[60]  Sonia San Martín Gutiérrez,et al.  Curbing electronic shopper perceived opportunism and encouraging trust , 2017, Ind. Manag. Data Syst..

[61]  Abdelfettah Belghith,et al.  Using Case-Based Reasoning for Phishing Detection , 2017, ANT/SEIT.

[62]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[63]  Raymond Chiong,et al.  Profit guided or statistical error guided? a study of stock index forecasting using support vector regression , 2017, J. Syst. Sci. Complex..

[64]  Raymond Chiong,et al.  Nature That Breeds Solutions , 2012, Int. J. Signs Semiot. Syst..

[65]  Choon Lin Tan,et al.  PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder , 2016, Decis. Support Syst..

[66]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[67]  Sheng Chen,et al.  A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems , 2011, Neurocomputing.

[68]  Yuan Zhou,et al.  Finding effective classifier for malicious URL detection , 2018, ICMSS 2018.

[69]  Russell C. Eberhart,et al.  A discrete binary version of the particle swarm algorithm , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[70]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[71]  Fernando Bação,et al.  Effective data generation for imbalanced learning using conditional generative adversarial networks , 2018, Expert Syst. Appl..

[72]  Cardona Alzate,et al.  Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas , 2020 .