Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

[1]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[2]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[3]  Victor S. Sheng,et al.  Multiclass imbalanced learning with one-versus-one decomposition and spectral clustering , 2020, Expert Syst. Appl..

[4]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[5]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[6]  Saroj K. Biswas,et al.  Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance , 2017, Pattern Recognit. Lett..

[7]  Reynold Xin,et al.  Apache Spark , 2016 .

[8]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[9]  Siegfried Gottwald Review of "Applications of Fuzzy Sets to Systems Analysis" by Constantin Virgil Negoita and Dan A. Ralescu , 1977, IEEE Trans. Syst. Man Cybern..

[10]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Mei Song,et al.  PCCN: Parallel Cross Convolutional Neural Network for Abnormal Network Traffic Flows Detection in Multi-Class Imbalanced Network Traffic Flows , 2019, IEEE Access.

[12]  Rogelio Florencia Juárez,et al.  Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data , 2020, Expert Syst. Appl..

[13]  Amir Hussain,et al.  Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study , 2016, IEEE Access.

[14]  Jerzy Stefanowski,et al.  Local Data Characteristics in Learning Classifiers from Imbalanced Data , 2018, Advances in Data Analysis with Computational Intelligence Methods.

[15]  Deying Li,et al.  Landslide Susceptibility Prediction Using Particle-Swarm-Optimized Multilayer Perceptron: Comparisons with Multilayer-Perceptron-Only, BP Neural Network, and Information Value Models , 2019, Applied Sciences.

[16]  Khin Thidar Lynn,et al.  KNN-Based Overlapping Samples Filter Approach for Classification of Imbalanced Data , 2019, ICSE 2019.

[17]  Anil Kumar Tripathi,et al.  BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques , 2020, Expert Syst. Appl..

[18]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[19]  Juan Monroy-de-Jesús,et al.  A Selective Dynamic Sampling Back-Propagation Approach for Handling the Two-Class Imbalance Problem , 2016 .

[20]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[21]  Eun Jong Cha,et al.  Classification of Kidney Cancer Data Using Cost-Sensitive Hybrid Deep Learning Approach , 2020, Symmetry.

[22]  Pinar Yildirim Pattern Classification with Imbalanced and Multiclass Data for the Prediction of Albendazole Adverse Event Outcomes , 2016, ANT/SEIT.

[23]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[24]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[25]  Loris Nanni,et al.  Coupling different methods for overcoming the class imbalance problem , 2015, Neurocomputing.

[26]  Eréndira Rendón Lara,et al.  Performance Analysis of Deep Neural Networks for Classification of Gene-Expression Microarrays , 2018, MCPR.

[27]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[28]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  María José del Jesús,et al.  KEEL 3.0: An Open Source Software for Multi-Stage Analysis in Data Mining , 2017, Int. J. Comput. Intell. Syst..

[30]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[31]  S LewMichael,et al.  Deep learning for visual understanding , 2016 .

[32]  Donato Malerba,et al.  Dealing with Class Imbalance in Android Malware Detection by Cascading Clustering and Classification , 2020, Complex Pattern Mining.

[33]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[34]  Qing Li,et al.  Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering , 2020, Inf. Sci..

[35]  Sherif Sakr,et al.  Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service , 2017, Big Data Res..

[36]  José Salvador Sánchez,et al.  On the suitability of resampling techniques for the class imbalance problem in credit scoring , 2013, J. Oper. Res. Soc..

[37]  Man Leung Wong,et al.  Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain , 2020, Expert Syst. Appl..

[38]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[39]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[40]  Jane You,et al.  Hybrid Classifier Ensemble for Imbalanced Data , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Jianjun Cao,et al.  A Feature Selection Based Serial SVM Ensemble Classifier , 2019, IEEE Access.

[42]  Hamid Parvin,et al.  A New Imbalanced Learning and Dictions Tree Method for Breast Cancer Diagnosis , 2013 .

[43]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[44]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[45]  Ayoub Ait Lahcen,et al.  Big Data technologies: A survey , 2017, J. King Saud Univ. Comput. Inf. Sci..

[46]  Fang Liu,et al.  Imbalanced Hyperspectral Image Classification Based on Maximum Margin , 2015, IEEE Geoscience and Remote Sensing Letters.

[47]  José Salvador Sánchez,et al.  Instance Selection Methods and Resampling Techniques for Dissimilarity Representation with Imbalanced Data Sets , 2013 .

[48]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[49]  José Francisco Martínez Trinidad,et al.  A review of instance selection methods , 2010, Artificial Intelligence Review.

[50]  Taghi M. Khoshgoftaar,et al.  A survey on addressing high-class imbalance in big data , 2018, Journal of Big Data.

[51]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[52]  J. Pacheco-Sánchez,et al.  Neural networks to fit potential energy curves from asphaltene-asphaltene interaction data , 2019, Fuel.

[53]  Eyad Elyan,et al.  Neighbourhood-based undersampling approach for handling imbalanced and overlapped data , 2020, Inf. Sci..