Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks

The class imbalance problem is a challenging situation in machine learning but also it appears frequently in recent Big Data applications. The most studied techniques to deal with the class imbalance problem have been Random Over Sampling (ROS), Random Under Sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE), especially in two-class scenarios. However, in the Big Data scale, multi-class imbalance scenarios have not extensively studied yet, and only a few investigations have been performed. In this work, the effectiveness of ROS and SMOTE techniques is analyzed in the Big data multi-class imbalance context. The KDD99 dataset, which is a popular multi-class imbalanced big data set, was used to probe these oversampling techniques, prior to the application of a Deep Learning Multi-Layer Perceptron. Results show that ROS and SMOTE are not always enough to improve the classifier performance in the minority classes. However, they slightly increase the overall performance of the classifier in comparison to the unsampled data.

[1]  Roberto Alejo,et al.  A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios , 2013, Pattern Recognit. Lett..

[2]  Eréndira Rendón Lara,et al.  Performance Analysis of Deep Neural Networks for Classification of Gene-Expression Microarrays , 2018, MCPR.

[3]  Francisco Herrera,et al.  Evolutionary undersampling for imbalanced big data classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[4]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[5]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[6]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[7]  Shaogang Gong,et al.  Imbalanced Deep Learning by Minority Class Incremental Rectification , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Saeid Nahavandi,et al.  A Classifier Graph Based Recurring Concept Detection and Prediction Approach , 2018, Comput. Intell. Neurosci..

[9]  Ayoub Ait Lahcen,et al.  Big Data technologies: A survey , 2017, J. King Saud Univ. Comput. Inf. Sci..

[10]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[11]  Taghi M. Khoshgoftaar,et al.  A survey on addressing high-class imbalance in big data , 2018, Journal of Big Data.

[12]  Reynold Xin,et al.  Apache Spark , 2016 .

[13]  Roberto Alejo,et al.  An improved dynamic sampling back-propagation algorithm based on mean square error to face the multi-class imbalance problem , 2017, Neural Computing and Applications.

[14]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[17]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[18]  Jerzy Stefanowski,et al.  Local Data Characteristics in Learning Classifiers from Imbalanced Data , 2018, Advances in Data Analysis with Computational Intelligence Methods.

[19]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Andrew C Peet,et al.  Multiclass imbalance learning: Improving classification of pediatric brain tumors from magnetic resonance spectroscopy , 2016, Magnetic resonance in medicine.

[21]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[22]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[23]  Yong-Hyuk Kim,et al.  Machine-Learning Approach to Optimize SMOTE Ratio in Class Imbalance Dataset for Intrusion Detection , 2018, Comput. Intell. Neurosci..

[24]  Sherif Sakr,et al.  Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service , 2017, Big Data Res..