ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification

Abstract Many sampling-based preprocessing methods have been proposed to solve the problem of unbalanced dataset classification. The fundamental principle of these methods is rebalancing an unbalanced dataset by a concrete strategy. Herein, we introduce a novel hybrid proposal named ant colony optimization resampling (ACOR) to overcome class imbalance classification. ACOR primarily includes two steps: first, it rebalances an imbalanced dataset by a specific oversampling algorithm; next, it finds an (sub)optimal subset from the balanced dataset by ant colony optimization. Unlike other oversampling techniques, ACOR does not focus on the mechanics of generating new samples. The main advantage of ACOR is that existing oversampling algorithms can be fully utilized and an ideal training set can be obtained by ant colony optimization. Therefore, ACOR can enhance the performance of existing oversampling algorithms. Experimental results on 18 real imbalanced datasets prove that ACOR yields significantly better results compared with four popular oversampling methods in terms of various assessment metrics, such as AUC, G-mean, and BACC.

[1]  Francisco Herrera,et al.  ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem , 2015, Knowl. Based Syst..

[2]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[3]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[4]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[5]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[7]  Xindong Wu,et al.  Online feature selection for high-dimensional class-imbalanced data , 2017, Knowl. Based Syst..

[8]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[9]  Karim Faez,et al.  An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system , 2008, Appl. Math. Comput..

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Li Li,et al.  A Density-based Under-sampling Algorithm for Imbalance Classification , 2019, Journal of Physics: Conference Series.

[12]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[13]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[14]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Myong Kee Jeong,et al.  Class dependent feature scaling method using naive Bayes classifier for text datamining , 2009, Pattern Recognit. Lett..

[17]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[18]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[19]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[20]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[21]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[22]  Sin-Jin Lin,et al.  Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction , 2013, Knowl. Based Syst..

[23]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[24]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[25]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[26]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[27]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[28]  Ligang Zhou,et al.  Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods , 2013, Knowl. Based Syst..

[29]  Bojan Cukic,et al.  Caveats , 2020, The African Continental Free Trade Area: Economic and Distributional Effects.

[30]  Jesús Ariel Carrasco-Ochoa,et al.  PBC4cip: A new contrast pattern-based classifier for class imbalance problems , 2017, Knowl. Based Syst..

[31]  L. Javier García-Villalba,et al.  Hybrid ACO Routing Protocol for Mobile Ad Hoc Networks , 2013, Int. J. Distributed Sens. Networks.

[32]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[33]  Hongyuan Zha,et al.  Entropy-based fuzzy support vector machine for imbalanced datasets , 2017, Knowl. Based Syst..

[34]  Luca Maria Gambardella,et al.  Ant colony system: a cooperative learning approach to the traveling salesman problem , 1997, IEEE Trans. Evol. Comput..

[35]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[36]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[37]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[38]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[39]  M. Tyers,et al.  Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. , 2002, Cancer research.

[40]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[41]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[42]  Zhi-Hua Zhou,et al.  ON MULTI‐CLASS COST‐SENSITIVE LEARNING , 2006, Comput. Intell..

[43]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[44]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[45]  Sotiris B. Kotsiantis,et al.  Uncertainty Based Under-Sampling for Learning Naive Bayes Classifiers Under Imbalanced Data Sets , 2020, IEEE Access.

[46]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective User Profiling , 1996, KDD.