Imbalanced Classification for Big Data

New developments in computation have allowed an explosion for both data generation and storage. The high value that is hidden within this large volume of data has attracted more and more researchers to address the topic of Big Data analytics. The main difference between addressing Big Data applications and carrying out traditional DM tasks is scalability. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way (supported by a distributed file system) to adapt for commodity hardware. Apart from the difficulties in addressing the Big Data problem itself, we must take into account that the events of interest might occur infrequently. Having in mind the challenges of mining rare classes in standard classification tasks, adding this to the problem of addressing high volumes of data impose a strong constraint for the development of both accurate and scalable solutions. In order to present this interesting topic, current chapter is organized as follows. First, Sect. 13.1 provides a quick overview on Big Data analytics in the context of imbalanced classification. Then, Sect. 13.2 presents the topic of Big Data in detail, focusing on the MapReduce programming model, the Spark framework, and those software libraries that includes Big Data implementations for ML algorithms. Section 13.3 shows an overview on those works that address imbalanced classification for Big Data problems. Then, Sect. 13.4 presents a discussion on the challenges and open problems on imbalanced Big Data classification. Finally, Sect. 13.5 summarizes and concludes this chapter.

[1]  Sachin S. Patil,et al.  Enhanced SMOTE algorithm for classification of imbalanced big-data using Random Forest , 2015, 2015 IEEE International Advance Computing Conference (IACC).

[2]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[3]  Seong-hun Park,et al.  Highway traffic accident prediction using VDS big data analysis , 2016, The Journal of Supercomputing.

[4]  Tim Kraska,et al.  Finding the Needle in the Big Data Systems Haystack , 2013, IEEE Internet Computing.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Francisco Herrera,et al.  Evolutionary undersampling for extremely imbalanced big data classification under apache spark , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[7]  Francisco Herrera,et al.  An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes , 2011, Pattern Recognit..

[8]  Anwar Haque,et al.  Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms , 2015, J. Comput. Sci..

[9]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[10]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[11]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[12]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Francisco Herrera,et al.  Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[15]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[16]  Xingquan Zhu,et al.  A Classifier Ensembling Approach for Imbalanced Social Link Prediction , 2013, 2013 12th International Conference on Machine Learning and Applications.

[17]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[18]  Feng Hu,et al.  A Parallel Oversampling Algorithm Based on NRSBoundary-SMOTE , 2014 .

[19]  Mohsen Guizani,et al.  Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications , 2015, IEEE Communications Surveys & Tutorials.

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[22]  Huijun Gao,et al.  Recent Advances on Recursive Filtering and Sliding Mode Design for Networked Nonlinear Stochastic Systems: A Survey , 2013 .

[23]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[24]  Francisco Herrera,et al.  Integrating a differential evolution feature weighting scheme into prototype generation , 2012, Neurocomputing.

[25]  Francisco Herrera,et al.  Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce , 2018, Inf. Fusion.

[26]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[27]  Francisco Herrera,et al.  An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species , 2015, BioMed research international.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Shahriar Akter,et al.  How ‘Big Data’ Can Make Big Impact: Findings from a Systematic Review and a Longitudinal Case Study , 2015 .

[30]  Morteza Mardani,et al.  Subspace Learning and Imputation for Streaming Big Data Matrices and Tensors , 2014, IEEE Transactions on Signal Processing.

[31]  Stan Matwin,et al.  A distributed instance-weighted SVM algorithm on large-scale imbalanced datasets , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[32]  Francisco Herrera,et al.  Evolutionary undersampling for imbalanced big data classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[33]  Bogusław Cyganek,et al.  Object Detection and Recognition in Digital Images: Theory and Practice , 2013 .

[34]  Nilanjan Dey,et al.  A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset , 2016, Comput. Methods Programs Biomed..

[35]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[36]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[37]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[38]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[39]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[40]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[41]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[42]  Hari M. Srivastava,et al.  Third-Order Differential Subordination and Superordination Results for Meromorphically Multivalent Functions Associated with the Liu-Srivastava Operator , 2014 .

[43]  Dariusz Brzezinski,et al.  Structural XML Classification in Concept Drifting Data Streams , 2015, New Generation Computing.

[44]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[45]  Francisco Herrera,et al.  GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs , 2016, Inf. Sci..

[46]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[47]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[48]  Simon Fong,et al.  Improvised methods for tackling big data stream mining challenges: case study of human activity recognition , 2016, The Journal of Supercomputing.

[49]  Simon Fong,et al.  A Scalable Data Stream Mining Methodology: Stream-Based Holistic Analytics and Reasoning in Parallel , 2014, 2014 2nd International Symposium on Computational and Business Intelligence.

[50]  Feng Hu,et al.  A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE , 2013 .

[51]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[52]  Isotta Chimenti,et al.  The Potential of GMP-Compliant Platelet Lysate to Induce a Permissive State for Cardiovascular Transdifferentiation in Human Mediastinal Adipose Tissue-Derived Mesenchymal Stem Cells , 2015, BioMed research international.

[53]  Mingzhu Tang,et al.  Cost-Sensitive Support Vector Machine Using Randomized Dual Coordinate Descent Method for Big Class-Imbalanced Data Classification , 2014 .

[54]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[55]  Bartosz Krawczyk,et al.  GPU-Accelerated Extreme Learning Machines for Imbalanced Data Streams with Concept Drift , 2016, ICCS.

[56]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[57]  Gianluca Bontempi,et al.  SCARFF: A scalable framework for streaming credit card fraud detection with spark , 2017, Inf. Fusion.

[58]  Reinaldo Molina Ruiz,et al.  Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers , 2018, BMC Bioinformatics.

[59]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[60]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[61]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[62]  Francisco Herrera,et al.  A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules , 2015, Int. J. Comput. Intell. Syst..

[63]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[64]  Seong-hun Park,et al.  Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction , 2014, 2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[65]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[66]  Jun-Hai Zhai,et al.  The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers , 2017, Int. J. Mach. Learn. Cybern..

[67]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[68]  Francisco Herrera,et al.  SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification , 2017, Progress in Artificial Intelligence.