A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems

The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.

[1]  Anjana Gosain,et al.  FF-SMOTE: A Metaheuristic Approach to Combat Class Imbalance in Binary Classification , 2019, Appl. Artif. Intell..

[2]  Ahmed Atwan,et al.  A multilabel classification approach for complex human activities using a combination of emerging patterns and fuzzy sets , 2019, International Journal of Electrical and Computer Engineering (IJECE).

[3]  Woojin Chang,et al.  Instance-based entropy fuzzy support vector machine for imbalanced data , 2019, Pattern Analysis and Applications.

[4]  Sebastián Ventura,et al.  On the Use of Genetic Programming for Mining Comprehensible Rules in Subgroup Discovery , 2014, IEEE Transactions on Cybernetics.

[5]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[6]  Gabriela Meșniță,et al.  Methods of Handling Unbalanced Datasets in Credit Card Fraud Detection , 2020 .

[7]  José Francisco Martínez Trinidad,et al.  LCMine: An efficient algorithm for mining discriminative regularities and its application in supervised classification , 2010, Pattern Recognit..

[8]  Bogdan Kwolek,et al.  Convolutional Neural Network-Based Classification of Histopathological Images Affected by Data Imbalance , 2018, FFER/DLPR@ICPR.

[9]  Zhi-Hua Zhou,et al.  Cost-Sensitive Face Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ligang Zhou,et al.  Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods , 2013, Knowl. Based Syst..

[11]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[12]  Yongmoo Suh,et al.  Identifying financial statement fraud with decision rules obtained from Modified Random Forest , 2020, Data Technol. Appl..

[13]  Zahir Tari,et al.  KRNN: k Rare-class Nearest Neighbour classification , 2017, Pattern Recognit..

[14]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[16]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[17]  Nathalie Japkowicz,et al.  Boosting support vector machines for imbalanced data sets , 2008, Knowledge and Information Systems.

[18]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[19]  Kun-Huang Chen,et al.  A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients , 2014, Appl. Soft Comput..

[20]  Youlong Yang,et al.  Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning , 2018, Knowl. Based Syst..

[21]  Licheng Jiao,et al.  An adjustable fuzzy classification algorithm using an improved multi-objective genetic strategy based on decomposition for imbalance dataset , 2019, Knowledge and Information Systems.

[22]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[23]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[24]  Xia Hong,et al.  Construction of Neurofuzzy Models For Imbalanced Data Classification , 2014, IEEE Transactions on Fuzzy Systems.

[25]  Seoung Bum Kim,et al.  An overlap-sensitive margin classifier for imbalanced and overlapping data , 2018, Expert Syst. Appl..

[26]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[27]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[28]  Jesús Ariel Carrasco-Ochoa,et al.  Evaluation of quality measures for contrast patterns by using unseen objects , 2017, Expert Syst. Appl..

[29]  Yuan-Hai Shao,et al.  An efficient weighted Lagrangian twin support vector machine for imbalanced data classification , 2014, Pattern Recognit..

[30]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[31]  David A. Elizondo,et al.  E2PAMEA: A fast evolutionary algorithm for extracting fuzzy emerging patterns in big data environments , 2020, Neurocomputing.

[32]  Xiaowei Gu,et al.  A self‐adaptive synthetic over‐sampling technique for imbalanced classification , 2019, Int. J. Intell. Syst..

[33]  Ángel Miguel García-Vico,et al.  FEPDS: A Proposal for the Extraction of Fuzzy Emerging Patterns in Data Streams , 2020, IEEE Transactions on Fuzzy Systems.

[34]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Octavio Loyola-González,et al.  Black-Box vs. White-Box: Understanding Their Advantages and Weaknesses From a Practical Point of View , 2019, IEEE Access.

[37]  J V Tu,et al.  Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. , 1996, Journal of clinical epidemiology.

[38]  Padraig Cunningham,et al.  k-Nearest Neighbour Classifiers - A Tutorial , 2020, ACM Comput. Surv..

[39]  Deepak Gupta,et al.  Entropy based fuzzy least squares twin support vector machine for class imbalance learning , 2018, Applied Intelligence.

[40]  Yiqiang Chen,et al.  Weighted extreme learning machine for imbalance learning , 2013, Neurocomputing.

[41]  María José del Jesús,et al.  A Big Data Approach for the Extraction of Fuzzy Emerging Patterns , 2019, Cognitive Computation.

[42]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[43]  B. S. Pabla,et al.  Support vector machines based non-contact fault diagnosis system for bearings , 2020, J. Intell. Manuf..

[44]  Yongdong Zhang,et al.  Adaptive weighted imbalance learning with application to abnormal activity recognition , 2016, Neurocomputing.

[45]  Jie Liu,et al.  Fuzzy support vector machine for imbalanced data with borderline noise , 2020, Fuzzy Sets Syst..

[46]  Firas Ajil Jassim,et al.  Image Denoising Using Interquartile Range Filter with Local Averaging , 2013, ArXiv.

[47]  Jesús Ariel Carrasco-Ochoa,et al.  PBC4cip: A new contrast pattern-based classifier for class imbalance problems , 2017, Knowl. Based Syst..

[48]  Yufei Xia,et al.  Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending , 2017, Electron. Commer. Res. Appl..

[49]  Ke Lu,et al.  Missing data imputation by K nearest neighbours based on grey relational structure and mutual information , 2015, Applied Intelligence.

[50]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[51]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[52]  Patel Harshita,et al.  Classification of Imbalanced Data Using a Modified Fuzzy-Neighbor Weighted Approach , 2017 .

[53]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[54]  Li Zhang,et al.  Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks , 2014, Expert Syst. Appl..

[55]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[56]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Kin Keung Lai,et al.  A new fuzzy support vector machine to evaluate credit risk , 2005, IEEE Transactions on Fuzzy Systems.

[58]  Jun Ni,et al.  Mining and Integrating Reliable Decision Rules for Imbalanced Cancer Gene Expression Data Sets , 2012 .

[59]  Mo-Yuen Chow,et al.  Power Distribution Fault Cause Identification With Imbalanced Data Using the Data Mining-Based Fuzzy Classification $E$-Algorithm , 2007, IEEE Transactions on Power Systems.

[60]  J. Zupan,et al.  Self-organizing maps for imputation of missing data in incomplete data matrices , 2015 .

[61]  José Francisco Martínez Trinidad,et al.  Fuzzy emerging patterns for classifying hard domains , 2011, Knowledge and Information Systems.

[62]  B. B. Orazbayev,et al.  A Hybrid Method for the Development of Mathematical Models of a Chemical Engineering System in Ambiguous Conditions , 2018 .

[63]  Gunwoo Kim,et al.  Classification cost: An empirical comparison among traditional classifier, Cost-Sensitive Classifier, and MetaCost , 2012, Expert Syst. Appl..

[64]  Massimo Buscema,et al.  K-CM: a new artificial neural network. Application to supervised pattern recognition , 2014 .

[65]  Jerzy W. Grzymala-Busse,et al.  Functional Behavioral Assessment Using the LERS Data Mining System—Strategies for Understanding Complex Physiological and Behavioral Patterns , 2003, Journal of Intelligent Information Systems.

[66]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[67]  S. Wold,et al.  Partial least squares analysis with cross‐validation for the two‐class problem: A Monte Carlo study , 1987 .

[68]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[69]  Jun Zhang,et al.  Fuzzy-Based Information Decomposition for Incomplete and Imbalanced Data Learning , 2017, IEEE Transactions on Fuzzy Systems.

[70]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[71]  Francisco Herrera,et al.  Evolutionary Fuzzy Systems: A Case Study in Imbalanced Classification , 2016, Fuzzy Logic and Information Fusion.

[72]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[73]  Prabhjot Kaur,et al.  Robust hybrid data-level sampling approach to handle imbalanced data during classification , 2020, Soft Computing.

[74]  Kemal Polat Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets , 2018, Neural Computing and Applications.

[75]  M. Kantardzic,et al.  The Use of Emerging Patterns in the Analysis of Gene Expression Profiles for the Diagnosis and Understanding of Diseases , 2005 .

[76]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[77]  Bo Tang,et al.  GIR-based ensemble sampling approaches for imbalanced learning , 2017, Pattern Recognit..

[78]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[79]  Changming Zhu,et al.  Entropy-based matrix learning machine for imbalanced data sets , 2017, Pattern Recognit. Lett..

[80]  Stefan Lessmann,et al.  Cost-sensitive business failure prediction when misclassification costs are uncertain: A heterogeneous ensemble selection approach , 2020, Eur. J. Oper. Res..

[81]  Klaus Dieter Meyer Gramann,et al.  Fuzzy Classification: An Overview , 1994 .

[82]  María José del Jesús,et al.  A first approach to handle fuzzy emerging patterns mining on big data problems: The EvAEFP-spark algorithm , 2017, 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[83]  Songcan Chen,et al.  Matrix-pattern-oriented Ho-Kashyap classifier with regularization learning , 2007, Pattern Recognit..

[84]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[85]  M. Ambika,et al.  Enhanced decision support system to predict and prevent hypertension using computational intelligence techniques , 2020, Soft Computing.

[86]  Xiao-Jun Wu,et al.  A new fuzzy twin support vector machine for pattern classification , 2017, International Journal of Machine Learning and Cybernetics.

[87]  Vasile Palade,et al.  FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning , 2010, IEEE Transactions on Fuzzy Systems.

[88]  Taehoon Kim,et al.  A Hybrid Under-sampling Approach for Better Bankruptcy Prediction , 2015 .

[89]  María José del Jesús,et al.  NMEEF-SD: Non-dominated Multiobjective Evolutionary Algorithm for Extracting Fuzzy Rules in Subgroup Discovery , 2010, IEEE Transactions on Fuzzy Systems.

[90]  Kotagiri Ramamohanarao,et al.  Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[91]  Jerzy Stefanowski,et al.  Types of minority class examples and their influence on learning classifiers from imbalanced data , 2015, Journal of Intelligent Information Systems.

[92]  Harshita Patel,et al.  An Improved Fuzzy K-Nearest Neighbor Algorithm for Imbalanced Data using Adaptive Approach , 2019 .

[93]  Kai Liu,et al.  A fast divisive clustering algorithm using an improved discrete particle swarm optimizer , 2010, Pattern Recognit. Lett..

[94]  Guozhu Dong,et al.  More Expressive Contrast Patterns and Their Mining , 2013, Contrast Data Mining.

[95]  María José del Jesús,et al.  A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets , 2013, Knowl. Based Syst..

[96]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[97]  Youlong Yang,et al.  Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data , 2020, Applied Intelligence.

[98]  Xiuzhen Zhang,et al.  Overview and Analysis of Contrast Pattern Based Classification , 2013, Contrast Data Mining.

[99]  Robert B. Fisher,et al.  Classifying imbalanced data sets using similarity based hierarchical decomposition , 2015, Pattern Recognit..

[100]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[101]  Zhi-Bo Zhu,et al.  Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis , 2010 .

[102]  Guozhu Dong,et al.  Using Emerging Patterns in Outlier and Rare-Class Prediction , 2013, Contrast Data Mining.

[103]  Jesús Ariel Carrasco-Ochoa,et al.  An Explainable Artificial Intelligence Model for Clustering Numerical Databases , 2020, IEEE Access.

[104]  María José del Jesús,et al.  Subgroup Discovery on Multiple Instance Data , 2019, Int. J. Comput. Intell. Syst..

[105]  Woojin Chang,et al.  Application of Instance-Based Entropy Fuzzy Support Vector Machine in Peer-To-Peer Lending Investment Decision , 2019, IEEE Access.

[106]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[107]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[108]  Shilin Wang,et al.  A Bayesian Possibilistic C-Means clustering approach for cervical cancer screening , 2019, Inf. Sci..

[109]  Divya Jain,et al.  A two-phase hybrid approach using feature selection and Adaptive SVM for chronic disease classification , 2019, International Journal of Computers and Applications.

[110]  Hongyuan Zha,et al.  Entropy-based fuzzy support vector machine for imbalanced datasets , 2017, Knowl. Based Syst..

[111]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[112]  Xiaowei Gu,et al.  Local optimality of self-organising neuro-fuzzy inference systems , 2019, Inf. Sci..

[113]  Beatrice Lazzerini,et al.  Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets , 2010, Soft Comput..

[114]  Dominik Olszewski,et al.  A probabilistic approach to fraud detection in telecommunications , 2012, Knowl. Based Syst..

[115]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[116]  Kalyanmoy Deb,et al.  An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints , 2014, IEEE Transactions on Evolutionary Computation.

[117]  Paolo Massimo Buscema,et al.  The semantic connectivity map: an adapting self-organising knowledge discovery method in data bases. Experience in gastro-oesophageal reflux disease , 2008, Int. J. Data Min. Bioinform..

[118]  Ángel Miguel García-Vico,et al.  Study on the use of different quality measures within a multi-objective evolutionary algorithm approach for emerging pattern mining in big data environments , 2019, Big Data Analytics.