An “Outside the Box” Solution for Imbalanced Data Classification

A common problem of the real-world data sets is the class imbalance, which can significantly affect the classification abilities of classifiers. Numerous methods have been proposed to cope with this problem; however, even state-of-the-art methods offer a limited improvement (if any) for data sets with critically under-represented minority classes. For such problematic cases, an “outside the box” solution is required. Therefore, we propose a novel technique, called enrichment, which uses the information (observations) from the external data set(s). We present three approaches to implement the enrichment technique: (1) selecting observations randomly, (2) iteratively choosing observations that improve the classification result, (3) adding observations that help the classifier to determine the border between classes better. We then thoroughly analyze developed solutions on ten real-world data sets to experimentally validate their usefulness. On average, our best approach improves the classification quality by 27%, and in the best case, by outstanding 66%. We also compare our technique with the state-of-the-art methods. We find that our technique surpasses the existing methods performing, on average, 21% better. The advantage is especially noticeable for the smallest data sets, for which existing methods failed, while our solutions achieved the best results. Additionally, the enrichment technique applies to both the multi-class and binary classification tasks. It can also be combined with other techniques dealing with the class imbalance problem.

[1]  Rui Liu,et al.  Affinity and class probability-based fuzzy support vector machine for imbalanced data sets , 2019, Neural Networks.

[2]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[3]  Sotiris B. Kotsiantis,et al.  Uncertainty Based Under-Sampling for Learning Naive Bayes Classifiers Under Imbalanced Data Sets , 2020, IEEE Access.

[4]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[5]  Nitesh V. Chawla,et al.  Noname manuscript No. (will be inserted by the editor) Learning from Streaming Data with Concept Drift and Imbalance: An Overview , 2022 .

[6]  Shu-Ching Chen,et al.  Dynamic Sampling in Convolutional Neural Networks for Imbalanced Data Classification , 2018, 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[7]  Michal Koziarski,et al.  Radial-Based Undersampling for Imbalanced Data Classification , 2019, Pattern Recognit..

[8]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[9]  Monique Snoeck,et al.  APATE: A novel approach for automated credit card transaction fraud detection using network-based extensions , 2015, Decis. Support Syst..

[10]  MengChu Zhou,et al.  A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Stefan Wermter,et al.  Towards Effective Classification of Imbalanced Data with Convolutional Neural Networks , 2016, ANNPR.

[12]  Lijun Xie,et al.  A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data , 2018, Pattern Recognit..

[13]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[14]  Tony R. Martinez,et al.  An instance level analysis of data complexity , 2014, Machine Learning.

[15]  J. Shaffer Modified Sequentially Rejective Multiple Test Procedures , 1986 .

[16]  Gilberto Reynoso-Meza,et al.  Ensemble learning by means of a multi-objective optimization design approach for dealing with imbalanced data sets , 2020, Expert Syst. Appl..

[17]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[18]  Wei Feng,et al.  Imbalanced Hyperspectral Image Classification With an Adaptive Ensemble Method Based on SMOTE and Rotation Forest With Differentiated Sampling Rates , 2019, IEEE Geoscience and Remote Sensing Letters.

[19]  Jerzy Stefanowski,et al.  Types of minority class examples and their influence on learning classifiers from imbalanced data , 2015, Journal of Intelligent Information Systems.

[20]  Chengqi Zhang,et al.  Graph Ensemble Boosting for Imbalanced Noisy Graph Stream Classification , 2015, IEEE Transactions on Cybernetics.

[21]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[22]  Loris Nanni,et al.  Coupling different methods for overcoming the class imbalance problem , 2015, Neurocomputing.

[23]  Szymon Wilk,et al.  An Algorithm for Selective Preprocessing of Multi-class Imbalanced Data , 2017, CORES.

[24]  Ana L. C. Bazzan,et al.  Balancing Training Data for Automated Annotation of Keywords: a Case Study , 2003, WOB.

[25]  Min Chen,et al.  Deep Learning for Imbalanced Multimedia Data Classification , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[26]  Patel Harshita,et al.  Classification of Imbalanced Data Using a Modified Fuzzy-Neighbor Weighted Approach , 2017 .

[27]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[28]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[29]  Bartosz Krawczyk,et al.  Radial-Based Oversampling for Multiclass Imbalanced Data Classification , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Jinyan Li,et al.  Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data , 2017, PloS one.

[31]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[32]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[33]  Yonghe Liu,et al.  Improving interpolation-based oversampling for imbalanced data learning , 2020, Knowl. Based Syst..

[34]  Bahareh Nikpour,et al.  Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification , 2017, 2017 2nd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC).

[35]  Przemyslaw Kazienko,et al.  Analysis of group evolution prediction in complex networks , 2017, PloS one.

[36]  Tao Chang,et al.  Research on Fine-Grained Sentiment Classification , 2019, NLPCC.

[37]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[38]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[39]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[40]  Dinesh Kumar,et al.  Addressing class imbalance problem in medical diagnosis: A genetic algorithm approach , 2017, 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC).

[41]  Michal Wozniak,et al.  Dealing with the task of imbalanced, multidimensional data classification using ensembles of exposers , 2017, LIDTA@PKDD/ECML.

[42]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[43]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[44]  Iqbal Gondal,et al.  Partial Undersampling of Imbalanced Data for Cyber Threats Detection , 2020, ACSW.

[45]  Bartosz Krawczyk,et al.  Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization , 2017, HAIS.

[46]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47]  Marco Vannucci,et al.  Genetic Algorithms Based Resampling for the Classification of Unbalanced Datasets , 2017, KES-IDT.

[48]  Younghwan Namkoong,et al.  Ordinal Classification of Imbalanced Data with Application in Emergency and Disaster Information Services , 2016, IEEE Intelligent Systems.

[49]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[50]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[51]  Wei Wen Soh,et al.  Predicting Credit Card Fraud on a Imbalanced Data , 2019 .

[52]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[53]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[54]  Michael W. Kattan,et al.  A comprehensive data level analysis for cancer diagnosis on imbalanced data , 2019, J. Biomed. Informatics.

[55]  Francisco Herrera,et al.  Weighted one-class classification for different types of minority class examples in imbalanced data , 2014, 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[56]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[57]  Jerzy Stefanowski,et al.  Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data , 2013 .

[58]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[59]  El-Sayed M. El-Alfy,et al.  Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text , 2017, ANT/SEIT.

[60]  Yanping Zhang,et al.  A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification , 2019, IEEE Access.

[61]  Jing Wang,et al.  A Novel Imbalanced Data Classification Approach Based on Logistic Regression and Fisher Discriminant , 2015 .