On the use of MapReduce for imbalanced big data using Random Forest

In this age, big data applications are increasingly becoming the main focus of attention because of the enormous increment of data generation and storage that has taken place in the last years. This situation becomes a challenge when huge amounts of data are processed to extract knowledge because the data mining techniques are not adapted to the new space and time requirements. Furthermore, real-world data applications usually present a class distribution where the samples that belong to one class, which is precisely the main interest, are hugely outnumbered by the samples of the other classes. This circumstance, known as the class imbalance problem, complicates the learning process as the standard learning techniques do not correctly address this situation.In this work, we analyse the performance of several techniques used to deal with imbalanced datasets in the big data scenario using the Random Forest classifier. Specifically, oversampling, undersampling and cost-sensitive learning have been adapted to big data using MapReduce so that these techniques are able to manage datasets as large as needed providing the necessary support to correctly identify the underrepresented class. The Random Forest classifier provides a solid basis for the comparison because of its performance, robustness and versatility.An experimental study is carried out to evaluate the performance of the diverse algorithms considered. The results obtained show that there is not an approach to imbalanced big data classification that outperforms the others for all the data considered when using Random Forest. Moreover, even for the same type of problem, the best performing method is dependent on the number of mappers selected to run the experiments. In most of the cases, when the number of splits is increased, an improvement in the running times can be observed, however, this progress in times is obtained at the expense of a slight drop in the accuracy performance obtained. This decrement in the performance is related to the lack of density problem, which is evaluated in this work from the imbalanced data point of view, as this issue degrades the performance of classifiers in the imbalanced scenario more severely than in standard learning.

[1]  Sung-Kwun Oh,et al.  The design of polynomial function-based neural network predictors for detection of software defects , 2013, Inf. Sci..

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  María José del Jesús,et al.  A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets , 2013, Knowl. Based Syst..

[4]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[5]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[7]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[8]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[9]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[10]  Helmut Krcmar,et al.  Big Data , 2014, Wirtschaftsinf..

[11]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[12]  Taghi M. Khoshgoftaar,et al.  An empirical study of the classification performance of learners on imbalanced and noisy software quality data , 2014, Inf. Sci..

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[15]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[17]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[18]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Vasile Palade,et al.  Adjusted Geometric-mean: a Novel Performance Measure for Imbalanced Bioinformatics Datasets Learning , 2012, J. Bioinform. Comput. Biol..

[21]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[22]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[23]  Mukesh K. Mohania,et al.  Cloud Computing and Big Data Analytics: What Is New from Databases Perspective? , 2012, BDA.

[24]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[25]  Ester Bernadó-Mansilla,et al.  Evolutionary rule-based systems for imbalanced data sets , 2008, Soft Comput..

[26]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[27]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[28]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[29]  Michael Minelli,et al.  Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses , 2012 .

[30]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[31]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[32]  Gary M. Weiss The Impact of Small Disjuncts on Classifier Learning , 2010, Data Mining.

[33]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[34]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[35]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[36]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[37]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.

[38]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[39]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[40]  Francisco Herrera,et al.  Evolutionary-based selection of generalized instances for imbalanced classification , 2012, Knowl. Based Syst..

[42]  Alex Alves Freitas,et al.  Coping with Unbalanced Class Data Sets in Oral Absorption Models , 2013, J. Chem. Inf. Model..

[43]  Gary M. Weiss Mining with Rare Cases , 2010, Data Mining and Knowledge Discovery Handbook.

[44]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[45]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[46]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[47]  Yi-Ping Phoebe Chen,et al.  Computational intelligence for heart disease diagnosis: A medical knowledge driven approach , 2013, Expert Syst. Appl..

[48]  Donald. Miner,et al.  MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems , 2012 .

[49]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[50]  Ligang Zhou,et al.  Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods , 2013, Knowl. Based Syst..

[51]  Sean Owen,et al.  Mahout in Action , 2011 .

[52]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[53]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[54]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[55]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[56]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .