SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification

Nowadays, it is usual to work with large amounts of data since our capacity of collecting and storing information has increased significantly. The extraction of knowledge from these scenarios is commonly known as “Big Data,” and it is performed on large clusters with MapReduce platforms. Imbalanced classification poses a problem both in traditional and Big Data learning scenarios. Data sampling is one of the ways that allows to improve the performance on imbalanced problems. A commodity hardware-based method for Big Data problems can offload these computations from the expensive and highly demanded hardware that MapReduce platforms require. The characteristics of some sampling methods make them suitable to be adapted to commodity hardware, taking advantage of the parallel computation capabilities of graphics processing units. SMOTE is one of the most popular oversampling methods which is based on the nearest neighbor rule. The proposed SMOTE-GPU efficiently handles large datasets (several millions of instances) on a wide variety of commodity hardware, including a laptop computer.

[1]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Duncan Poole,et al.  Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born , 2012, Journal of chemical theory and computation.

[4]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[5]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[6]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[7]  Duncan Poole,et al.  Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. , 2013, Journal of chemical theory and computation.

[8]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[9]  C. A. R. Hoare,et al.  Algorithm 64: Quicksort , 1961, Commun. ACM.

[10]  Francisco Herrera,et al.  GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs , 2016, Inf. Sci..

[11]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[12]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[15]  Samuel Madden,et al.  From Databases to Big Data , 2012, IEEE Internet Comput..

[16]  Francisco Herrera,et al.  A High Performance Fingerprint Matching System for Large Databases Based on GPU , 2014, IEEE Transactions on Information Forensics and Security.

[17]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[18]  C. A. R. Hoare Algorithm 63: partition , 1961, CACM.

[19]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[20]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[21]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[22]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[23]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[24]  Gustavo E. A. P. A. Batista,et al.  Class imbalance revisited: a new experimental setup to assess the performance of treatment methods , 2014, Knowledge and Information Systems.