Radial-Based Undersampling for Imbalanced Data Classification

Data imbalance remains one of the most widespread problems affecting contemporary machine learning. The negative effect data imbalance can have on the traditional learning algorithms is most severe in combination with other dataset difficulty factors, such as small disjuncts, presence of outliers and insufficient number of training observations. Said difficulty factors can also limit the applicability of some of the methods of dealing with data imbalance, in particular the neighborhood-based oversampling algorithms based on SMOTE. Radial-Based Oversampling (RBO) was previously proposed to mitigate some of the limitations of the neighborhood-based methods. In this paper we examine the possibility of utilizing the concept of mutual class potential, used to guide the oversampling process in RBO, in the undersampling procedure. Conducted computational complexity analysis indicates a significantly reduced time complexity of the proposed Radial-Based Undersampling algorithm, and the results of the performed experimental study indicate its usefulness, especially on difficult datasets.

[1]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[2]  Bogdan Kwolek,et al.  Convolutional Neural Network-Based Classification of Histopathological Images Affected by Data Imbalance , 2018, FFER/DLPR@ICPR.

[3]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[4]  Pedro Antonio Gutiérrez,et al.  Oversampling the Minority Class in the Feature Space , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[8]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[11]  Bartosz Krawczyk,et al.  Radial-Based Approach to Imbalanced Data Oversampling , 2017, HAIS.

[12]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[13]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[14]  Michal Wozniak,et al.  Experimental Study on Modified Radial-Based Oversampling , 2018, SOCO-CISIS-ICEUTE.

[15]  Michal Wozniak,et al.  CCR: A combined cleaning and resampling algorithm for imbalanced data classification , 2017, Int. J. Appl. Math. Comput. Sci..

[16]  Amos Azaria,et al.  Behavioral Analysis of Insider Threat: A Survey and Bootstrapped Prediction in Imbalanced Data , 2014, IEEE Transactions on Computational Social Systems.

[17]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[18]  Bartosz Krawczyk,et al.  Radial-Based Oversampling for Multiclass Imbalanced Data Classification , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Wojciech Czarnecki,et al.  Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[20]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[21]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[22]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[23]  Bartosz Krawczyk,et al.  Radial-Based oversampling for noisy imbalanced data classification , 2019, Neurocomputing.

[24]  Tomasz Maciejewski,et al.  Local neighbourhood extension of SMOTE for mining imbalanced data , 2011, 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[25]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[26]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[27]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[28]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[29]  Nathalie Japkowicz,et al.  Manifold-based synthetic oversampling with manifold conformance estimation , 2018, Machine Learning.

[30]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[31]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[32]  Tony R. Martinez,et al.  An instance level analysis of data complexity , 2014, Machine Learning.

[33]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[34]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[35]  P. N. Suganthan,et al.  An approach for classification of highly imbalanced data using weighting and undersampling , 2010, Amino Acids.

[36]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[37]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[38]  Jerzy Stefanowski,et al.  Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data , 2018, Journal of Intelligent Information Systems.

[39]  Michal Wozniak,et al.  Imbalanced Data Classification Based on Feature Selection Techniques , 2018, IDEAL.

[40]  Francisco Herrera,et al.  Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm , 2016, Eng. Appl. Artif. Intell..

[41]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[42]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[43]  Zhe Li,et al.  Adaptive Ensemble Undersampling-Boost: A novel learning framework for imbalanced data , 2017, J. Syst. Softw..

[44]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[45]  Jerzy Stefanowski,et al.  Types of minority class examples and their influence on learning classifiers from imbalanced data , 2015, Journal of Intelligent Information Systems.

[46]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[47]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[48]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[49]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[50]  María José del Jesús,et al.  A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets , 2017, Int. J. Neural Syst..