Coupling different methods for overcoming the class imbalance problem

Many classification problems must deal with imbalanced datasets where one class - the majority class - outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical.Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches.To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature.Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357.

[1]  HerreraFrancisco,et al.  Addressing imbalanced classification with instance generation techniques , 2014 .

[2]  Jack Y. Yang,et al.  Asymmetric bagging and feature selection for activities prediction of drug molecules , 2008, Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007).

[3]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[5]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Gang Chen,et al.  Dynamic class imbalance learning for incremental LPSVM , 2013, Neural Networks.

[7]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[8]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[9]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[10]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[12]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[13]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[14]  Yiqiang Chen,et al.  Weighted extreme learning machine for imbalance learning , 2013, Neurocomputing.

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[16]  Zhenjiang Miao,et al.  Bootstrap FDA for counting positives accurately in imprecise environments , 2007, Pattern Recognit..

[17]  Ethem Alpaydin,et al.  Eigenclassifiers for combining correlated classifiers , 2012, Inf. Sci..

[18]  Szymon Wilk,et al.  IIvotes ensemble for imbalanced data , 2012, Intell. Data Anal..

[19]  Xuelong Li,et al.  Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Nojun Kwak,et al.  Feature extraction for classification problems and its application to face recognition , 2008, Pattern Recognit..

[21]  Loris Nanni,et al.  Reduced Reward-punishment editing for building ensembles of classifiers , 2011, Expert Syst. Appl..

[22]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[23]  Ting Yu,et al.  A Hierarchical VQSVM for Imbalanced Data Sets , 2007, 2007 International Joint Conference on Neural Networks.

[24]  Zhengding Qiu,et al.  The effect of imbalanced data sets on LDA: A theoretical and empirical analysis , 2007, Pattern Recognit..

[25]  Xudong Jiang,et al.  Asymmetric Principal Component and Discriminant Analyses for Pattern Classification , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Szymon Wilk,et al.  Integrating Selective Pre-processing of Imbalanced Data with Ivotes Ensemble , 2010, RSCTC.

[27]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[28]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[29]  David G. Stork,et al.  Pattern Classification , 1973 .

[30]  Loris Nanni,et al.  A genetic encoding approach for learning methods for combining classifiers , 2009, Expert Syst. Appl..

[31]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[32]  Guo-Zheng Li,et al.  An asymmetric classifier based on partial least squares , 2010, Pattern Recognit..

[33]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[34]  Cheng-Chew Lim,et al.  Dual /spl nu/-support vector machine with error rate and training size biasing , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[35]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[36]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[37]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[38]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[39]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[40]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[41]  Kenneth Strzepek,et al.  A technique for generating regional climate scenarios using a nearest‐neighbor algorithm , 2003 .

[42]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[43]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[44]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[45]  Yuhua Li,et al.  Selecting Critical Patterns Based on Local Geometrical and Statistical Information , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Suhaimi Ibrahim,et al.  Outlier Detection in Stream Data by Machine Learning and Feature Selection Methods , 2014 .

[47]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[48]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[49]  Jaakko Astola,et al.  A novel strategy for microarray quality control using Bayesian networks , 2003, Bioinform..

[50]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[51]  Li Zhang,et al.  Sparse ensembles using weighted combination methods based on linear programming , 2011, Pattern Recognit..

[52]  Francisco Herrera,et al.  Addressing imbalanced classification with instance generation techniques: IPADE-ID , 2014, Neurocomputing.

[53]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[54]  Jack Y. Yang,et al.  Asymmetric Bagging and Feature Selection for Activities Prediction of Drug Molecules , 2007, IMSCCS.

[55]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[56]  et al,et al.  Matching of catalogues by probabilistic pattern classification , 2006 .

[57]  D. M. Titterington,et al.  Do unbalanced data have a negative effect on LDA? , 2008, Pattern Recognit..

[58]  Ton J Cleophas Machine Learning in Therapeutic Research: The Hard Work of Outlier Detection in Large Data. , 2016, American journal of therapeutics.

[59]  Robert Sabourin,et al.  From dynamic classifier selection to dynamic ensemble selection , 2008, Pattern Recognit..

[60]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[61]  Shuiping Gou,et al.  Greedy optimization classifiers ensemble based on diversity , 2011, Pattern Recognit..

[62]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[63]  Jakub M. Tomczak,et al.  Boosted SVM with active learning strategy for imbalanced data , 2014, Soft Computing.

[64]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[65]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[66]  Hyun-Chul Kim,et al.  Constructing support vector machine ensemble , 2003, Pattern Recognit..

[67]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[68]  Josef Kittler,et al.  Inverse random under sampling for class imbalance problem and its application to multi-label classification , 2012, Pattern Recognit..