Severely imbalanced Big Data challenges: investigating data sampling approaches

Severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.

[1]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[2]  Taghi M. Khoshgoftaar,et al.  Detecting Slow HTTP POST DoS Attacks Using Netflow Features , 2019, FLAIRS.

[3]  Angappa Gunasekaran,et al.  Big Data in Healthcare Management: A Review of Literature , 2018 .

[4]  Alois Knoll,et al.  Gradient boosting machines, a tutorial , 2013, Front. Neurorobot..

[5]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[8]  Julian D Olden,et al.  Machine Learning Methods Without Tears: A Primer for Ecologists , 2008, The Quarterly Review of Biology.

[9]  Taghi M. Khoshgoftaar,et al.  Comparison of Data Sampling Approaches for Imbalanced Bioinformatics Data , 2014, FLAIRS.

[10]  Seong-hun Park,et al.  Highway traffic accident prediction using VDS big data analysis , 2016, The Journal of Supercomputing.

[11]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[12]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[13]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[14]  Valeria Vitelli,et al.  Probabilistic preference learning with the Mallows rank model , 2014, J. Mach. Learn. Res..

[15]  Yu-hua Liu,et al.  A DoS attack situation assessment method based on QoS , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[16]  J. Galindo,et al.  Credit Risk Assessment Using Statistical and Machine Learning: Basic Methodology and Risk Modeling Applications , 2000 .

[17]  Chad Calvert,et al.  Detection of Slowloris Attacks Using Netflow Traffic , 2018 .

[18]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[19]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[20]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[21]  Taghi M. Khoshgoftaar,et al.  A survey on addressing high-class imbalance in big data , 2018, Journal of Big Data.

[22]  Seong-hun Park,et al.  Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction , 2014, 2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[23]  J. Hess,et al.  Analysis of variance , 2018, Transfusion.

[24]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[25]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[27]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[28]  Taghi M. Khoshgoftaar,et al.  A Study on the Relationships of Classifier Performance Metrics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Chih-Fong Tsai,et al.  Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies , 2016, J. Syst. Softw..

[31]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[32]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[33]  P. Mahadevan,et al.  An overview , 2007, Journal of Biosciences.

[34]  Taghi M. Khoshgoftaar,et al.  Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection , 2018, 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI).

[35]  Farah Magrabi,et al.  Using statistical text classification to identify health information technology incidents , 2013, J. Am. Medical Informatics Assoc..

[36]  Francisco Herrera,et al.  Evolutionary undersampling for extremely imbalanced big data classification under apache spark , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[37]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[38]  Oksana Yevsieieva,et al.  Analysis of the impact of the slow HTTP DOS and DDOS attacks on the cloud environment , 2017, 2017 4th International Scientific-Practical Conference Problems of Infocommunications. Science and Technology (PIC S&T).

[39]  Toyoo Takata,et al.  A Defense Method against Distributed Slow HTTP DoS Attack , 2016, 2016 19th International Conference on Network-Based Information Systems (NBiS).

[40]  J. Tukey Comparing individual means in the analysis of variance. , 1949, Biometrics.

[41]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[42]  Francisco Herrera,et al.  Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[43]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[44]  Taghi M. Khoshgoftaar,et al.  An Empirical Study on Class Rarity in Big Data , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[45]  Jason Venner,et al.  Pro Hadoop , 2009 .