To combat multi-class imbalanced problems by means of over-sampling and boosting techniques

Imbalanced problems are quite pervasive in many real-world applications. In imbalanced distributions, a class or some classes of data, called minority class(es), is/are under-represented compared to other classes. This skewness in the data underlying distribution causes many difficulties for typical machine learning algorithms. The notion becomes even more complicated when machine learning algorithms are to combat multi-class imbalanced problems. The presented solutions for tackling the issues arising from imbalanced distributions, generally fall into two main categories: data-oriented methods and model-based algorithms. Focusing on the latter, this paper suggests an elegant blend of boosting and over-sampling paradigms, which is called MDOBoost, to bring considerable benefits to the learning ability of multi-class imbalanced data sets. The over-sampling technique introduced and adopted in this paper, Mahalanobis distance-based over-sampling technique (MDO in short), is delicately incorporated into boosting algorithm. In fact, the minority classes are over-sampled via MDO technique in such a way that they almost preserve the original minority class characteristics. MDO, in comparison with the popular method in this field, SMOTE, generates more similar minority class examples to original class samples. Moreover, the broader representation of minority class examples is provided via MDO, and this, in turn, causes the classifier to build larger decision regions. MDOBoost increases the generalization ability of a classifier, since it indicates better results with pruned version of C4.5 classifier; unlike other over-sampling/boosting procedures, which have difficulties with pruned version of C4.5. MDOBoost is applied to real-world multi-class imbalanced benchmarks and its performance is then compared with several data-level and model-based algorithms. The empirical results and theoretical analyses reveal that MDOBoost offers superior advantages compared to popular class decomposition and over-sampling techniques in terms of MAUC, G-mean, and minority class recall.

[1]  Ali Hamzeh,et al.  DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets , 2012, Data Knowl. Eng..

[2]  Hyoungjoo Lee,et al.  The Novelty Detection Approach for Different Degrees of Class Imbalance , 2006, ICONIP.

[3]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[5]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[6]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[7]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[8]  Taghi M. Khoshgoftaar,et al.  Improving Learner Performance with Data Sampling and Boosting , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[9]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[10]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[11]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[12]  Ian Witten,et al.  Data Mining , 2000 .

[13]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[14]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[17]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[18]  Ken Chen,et al.  Efficient Classification of Multi-label and Imbalanced Data using Min-Max Modular Classifiers , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[19]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[20]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[21]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[24]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[25]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[26]  María José del Jesús,et al.  Multi-class Imbalanced Data-Sets with Linguistic Fuzzy Rule Based Classification Systems Based on Pairwise Learning , 2010, IPMU.

[27]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[28]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[29]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[30]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[31]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[32]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[33]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[34]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[35]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[36]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[37]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[38]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[39]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[40]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[41]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[42]  Ling Zhuang,et al.  Parameter Optimization of Kernel-based One-class Classifier on Imbalance Learning , 2006, J. Comput..

[43]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[44]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[45]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[46]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[47]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[48]  Yang Wang,et al.  Parameter Inference of Cost-Sensitive Boosting Algorithms , 2005, MLDM.

[49]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[50]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[51]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[52]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[53]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[54]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[55]  T. Warren Liao,et al.  Classification of weld flaws with imbalanced class data , 2008, Expert Syst. Appl..

[56]  J. Aczel,et al.  On Measures of Information and Their Characterizations , 2012 .

[57]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[58]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[59]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[60]  Norbert Henze,et al.  A class of invariant consistent tests for multivariate normality , 1990 .