Imbalanced big data classification: a distributed implementation of SMOTE

In the domain of machine learning, quality of data is most critical component for building good models. Predictive analytics is an AI stream used to predict future events based on historical learnings and is used in diverse fields like predicting online frauds, oil slicks, intrusion attacks, credit defaults, prognosis of disease cells etc. Unfortunately, in most of these cases, traditional learning models fail to generate required results due to imbalanced nature of data. Here imbalance denotes small number of instances belonging to the class under prediction like fraud instances in the total online transactions. The prediction in imbalanced classification gets further limited due to factors like small disjuncts which get accentuated during the partitioning of data when learning at scale. Synthetic generation of minority class data (SMOTE [1]) is one pioneering approach by Chawla [1] to offset said limitations and generate more balanced datasets. Although there exists a standard implementation of SMOTE in python, it is unavailable for distributed computing environments for large datasets. Bringing SMOTE to distributed environment under spark is the key motivation for our research. In this paper we present our algorithm, observations and results for synthetic generation of minority class data under spark using Locality Sensitivity Hashing [LSH]. We were able to successfully demonstrate a distributed version of Spark SMOTE which generated quality artificial samples preserving spatial distribution1.

[1]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[2]  Xibei Yang,et al.  Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers , 2013, BioMed research international.

[3]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[4]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[5]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Francisco Herrera,et al.  IFROWANN: Imbalanced Fuzzy-Rough Ordered Weighted Average Nearest Neighbor Classification , 2015, IEEE Transactions on Fuzzy Systems.

[8]  Gustavo E. A. P. A. Batista,et al.  Learning with Class Skews and Small Disjuncts , 2004, SBIA.

[9]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[10]  Roberto Alejo,et al.  A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios , 2013, Pattern Recognit. Lett..

[11]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[12]  Francisco Herrera,et al.  On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on Intrusion Detection Systems , 2015, Expert Syst. Appl..

[13]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[14]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Michael A. Casey,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors , 2008 .

[16]  李航,et al.  A Parallel Oversampling Algorithm Based on NRSBoundary-SMOTE , 2014 .

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[19]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[20]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[21]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[22]  Nilanjan Dey,et al.  A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset , 2016, Comput. Methods Programs Biomed..

[23]  M. Slaney,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes] , 2008, IEEE Signal Processing Magazine.

[24]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[25]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[26]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[27]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[28]  Din J. Wasem Mining of Massive Datasets , 2014 .

[29]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.