Benchmarking framework for class imbalance problem using novel sampling approach for big data

The traditional techniques of machine learning always need to be strengthened for dealing with cosmic nature of big data for systematic and methodical learning. The unbalanced distribution of classes in big data, popularly known as imbalanced big data chases the problem of learning to a much higher level. The conventional methods are being progressively modified to handle and curtail the problem of learning from imbalanced datasets in the context of big data at the data level and algorithmic level. In the current study, a cluster heads based data level sampling solution which inherits edge of K-Means and Fuzzy C-Means clustering approaches is applied. The proposed approach is evaluated with three different classifiers namely Support Vector Machines, Decision Tree and k-Nearest Neighbor and compared with conventional SMOTE algorithm. The experiment has shown promising results with an increment of 8.09% and 35.71% in terms of accuracy and AUC respectively, for all imbalanced datasets. This work imparts a baseline comparison of solutions for imbalanced classification at data level in big data scenario and proposes an efficient clustering-based solution for same.

[1]  Rajiv Pandey,et al.  Quantitative Evaluation of Big Data Categorical Variables through R , 2015 .

[2]  Francisco Herrera,et al.  A Compact Evolutionary Interval-Valued Fuzzy Rule-Based Classification System for the Modeling and Prediction of Real-World Financial Applications With Imbalanced Data , 2015, IEEE Transactions on Fuzzy Systems.

[3]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[4]  James A. Rodger,et al.  Discovery of medical Big Data analytics: Improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive , 2015 .

[5]  Fuzhen Zhuang,et al.  Parallel sampling from big data with uncertainty distribution , 2015, Fuzzy Sets Syst..

[6]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[7]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[8]  Nilanjan Dey,et al.  A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset , 2016, Comput. Methods Programs Biomed..

[9]  Dorit S. Hochbaum,et al.  Sparse computation for large-scale data mining , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[10]  Durgaprasad Gangodkar,et al.  Hadoop, MapReduce and HDFS: A Developers Perspective☆ , 2015 .

[11]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[12]  Hyunjoong Kim,et al.  RHSBoost: Improving classification performance in imbalance data , 2017, Comput. Stat. Data Anal..

[13]  George K. Karagiannidis,et al.  Efficient Machine Learning for Big Data: A Review , 2015, Big Data Res..

[14]  Francesco Marcelloni,et al.  A MapReduce solution for associative classification of big data , 2016, Inf. Sci..

[15]  Sachin Subhash Patil,et al.  Enriched Over_Sampling Techniques for Improving Classification of Imbalanced Big Data , 2017, 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService).

[16]  Francisco Herrera,et al.  Fuzzy rough classifiers for class imbalanced multi-instance data , 2016, Pattern Recognit..

[17]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[18]  María José del Jesús,et al.  A View on Fuzzy Systems for Big Data: Progress and Opportunities , 2016, Int. J. Comput. Intell. Syst..

[19]  MengChu Zhou,et al.  A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification , 2017, IEEE Transactions on Cybernetics.

[20]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[21]  Francisco Herrera,et al.  Evolutionary undersampling for extremely imbalanced big data classification under apache spark , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[22]  Alicia Troncoso Lora,et al.  Imbalanced classification techniques for monsoon forecasting based on a new climatic time series , 2017, Environ. Model. Softw..

[23]  S. D. Madhu Kumar,et al.  Improving execution speed of incremental runs of MapReduce using provenance , 2017, Int. J. Big Data Intell..

[24]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[25]  Yang Liu,et al.  Short-Term Load Forecasting Based on Big Data Technologies , 2014, CIT 2014.

[26]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[27]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[28]  Francisco Herrera,et al.  A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules , 2015, Int. J. Comput. Intell. Syst..

[29]  Jian Pei,et al.  Classification: Basic Concepts , 2012 .

[30]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[31]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[32]  Khaled Belkadi,et al.  Parallel Distributed Patterns Mining Using Hadoop MapReduce Framework , 2017, Int. J. Grid High Perform. Comput..

[33]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[34]  Young-Im Cho,et al.  Integrating of Data Using the Hadoop and R , 2015, FNC/MobiSPC.

[35]  Ying Ju,et al.  Finding the Best Classification Threshold in Imbalanced Classification , 2016, Big Data Res..

[36]  Ching-Hsien Hsu,et al.  An Adaptive and Memory Efficient Sampling Mechanism for Partitioning in MapReduce , 2015, International Journal of Parallel Programming.

[37]  Seong-hun Park,et al.  Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction , 2014, 2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.