A Comparison of Performance Metrics with Severely Imbalanced Network Security Big Data

Severe class imbalance between the majority and minority classes in large datasets can prejudice Machine Learning classifiers toward the majority class. Our work uniquely consolidates two case studies, each utilizing three learners implemented within an Apache Spark framework, six sampling methods, and five sampling distribution ratios to analyze the effect of severe class imbalance on big data analytics. We use three performance metrics to evaluate this study: Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, and Geometric Mean. In the first case study, models were trained on one dataset (POST) and tested on another (SlowlorisBig). In the second case study, the training and testing dataset roles were switched. Our comparison of performance metrics shows that Area Under the Precision-Recall Curve and Geometric Mean are sensitive to changes in the sampling distribution ratio, whereas Area Under the Receiver Operating Characteristic Curve is relatively unaffected. In addition, we demonstrate that when comparing sampling methods, borderline-SMOTE2 outperforms the other methods in the first case study, and Random Undersampling is the top performer in the second case study.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[3]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[4]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[5]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[6]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[7]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[8]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[9]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[10]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[11]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[12]  Taghi M. Khoshgoftaar,et al.  Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[13]  Oksana Yevsieieva,et al.  Analysis of the impact of the slow HTTP DOS and DDOS attacks on the cloud environment , 2017, 2017 4th International Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T).

[14]  Taghi M. Khoshgoftaar,et al.  A review of statistical and machine learning methods for modeling cancer risk using structured clinical data , 2018, Artif. Intell. Medicine.

[15]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[16]  Angappa Gunasekaran,et al.  Big Data in Healthcare Management: A Review of Literature , 2018 .

[17]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[18]  Seong-hun Park,et al.  Highway traffic accident prediction using VDS big data analysis , 2016, The Journal of Supercomputing.

[19]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[20]  J. Hess,et al.  Analysis of variance , 2018, Transfusion.

[21]  Yu-hua Liu,et al.  A DoS attack situation assessment method based on QoS , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[22]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[23]  J. Galindo,et al.  Credit Risk Assessment Using Statistical and Machine Learning: Basic Methodology and Risk Modeling Applications , 2000 .

[24]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[25]  Taghi M. Khoshgoftaar,et al.  Survey on deep learning with class imbalance , 2019, J. Big Data.

[26]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[27]  Taghi M. Khoshgoftaar,et al.  Detecting Slow HTTP POST DoS Attacks Using Netflow Features , 2019, FLAIRS.

[28]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[29]  Chad Calvert,et al.  Detection of Slowloris Attacks Using Netflow Traffic , 2018 .

[30]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[31]  Farah Magrabi,et al.  Using statistical text classification to identify health information technology incidents , 2013, J. Am. Medical Informatics Assoc..

[32]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[33]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[34]  Julian D Olden,et al.  Machine Learning Methods Without Tears: A Primer for Ecologists , 2008, The Quarterly Review of Biology.

[35]  Toyoo Takata,et al.  A Defense Method against Distributed Slow HTTP DoS Attack , 2016, 2016 19th International Conference on Network-Based Information Systems (NBiS).

[36]  Taghi M. Khoshgoftaar,et al.  A Multi-dimensional Comparison of Toolkits for Machine Learning with Big Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[37]  J. Tukey Comparing individual means in the analysis of variance. , 1949, Biometrics.

[38]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[39]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[40]  Taghi M. Khoshgoftaar,et al.  A Study on the Relationships of Classifier Performance Metrics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[41]  Taghi M. Khoshgoftaar,et al.  A survey on addressing high-class imbalance in big data , 2018, Journal of Big Data.

[42]  Seong-hun Park,et al.  Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction , 2014, 2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.