BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification

This paper proposes an ensemble algorithm named of BPSO-Adaboost-KNN to cope with multi-class imbalanced data classification. The main idea of this algorithm is to integrate feature selection and boosting into ensemble. What's more, we utilize a novel evaluation metric called AUCarea which is especially for multi-class classification. In our model BPSO is employed as the feature selection algorithm in which AUCarea is chosen as the fitness. For classification, we generate a boosting classifier in which KNN is selected as the basic classifier. In order to verify the effectiveness of our method, 19 benchmarks are used in our experiments. The results show that the proposed algorithm improves both the stability and the accuracy of boosting after carrying out feature selection, and the performance of our algorithm is comparable with other state-of-the-art algorithms. In statistical analyses, we apply Bland-Altman analysis to show the consistencies between AUCarea and other popular metrics like average G-mean, average F-value etc. Besides, we use linear regression to find deeper correlation between AUCarea and other metrics in order to show why AUCarea works well in this issue. We also put out a series of statistical studies in order to analyze if there exist significant improvements after feature selection and boosting are employed. At last, the proposed algorithm is applied in oil-bearing of reservoir recognition. The classification precision is up to 99% in oilsk81-oilsk85 well logging data in Jianghan oilfield of China, which is 20% higher than KNN classifier. Particularly, the proposed algorithm has significant superiority when distinguishing the oil layer from other layers.

[1]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Stan Matwin,et al.  Cost-Sensitive Boosting Algorithms for Imbalanced Multi-instance Datasets , 2013, Canadian Conference on AI.

[4]  María José del Jesús,et al.  A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets , 2013, Knowl. Based Syst..

[5]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[6]  S. R. Searle,et al.  Linear Models For Unbalanced Data , 1988 .

[7]  Peter A. Flach,et al.  Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves , 2003, ICML.

[8]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[9]  Ying Cao,et al.  Advance and Prospects of AdaBoost Algorithm , 2013, ACTA AUTOMATICA SINICA.

[10]  Bartosz Krawczyk,et al.  Hypertension Type Classification Using Hierarchical Ensemble of One-Class Classifiers for Imbalanced Data , 2014, ICT Innovations.

[11]  Ali Hamzeh,et al.  DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets , 2012, Data Knowl. Eng..

[12]  Taghi M. Khoshgoftaar,et al.  Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[13]  Ding Chang,et al.  Optimizing reservoir features in oil exploration management based on fusion of soft computing , 2011 .

[14]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[15]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[16]  Ahmad Taher Azar,et al.  Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis , 2014, Comput. Methods Programs Biomed..

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[18]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[19]  Gustavo E. A. P. A. Batista,et al.  Class imbalance revisited: a new experimental setup to assess the performance of treatment methods , 2014, Knowledge and Information Systems.

[20]  C. Yiannoutsos,et al.  Ordered multiple‐class ROC analysis with continuous measurements , 2004, Statistics in medicine.

[21]  Federico Lecumberry,et al.  Novel classifier scheme for imbalanced problems , 2013, Pattern Recognit. Lett..

[22]  Hong Gu,et al.  Imbalanced classification using support vector machine ensemble , 2011, Neural Computing and Applications.

[23]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[24]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[25]  Jonathan E. Fieldsend,et al.  Multi-class ROC analysis from a multi-objective optimisation perspective , 2006, Pattern Recognit. Lett..

[26]  Francisco Herrera,et al.  Weighted one-class classification for different types of minority class examples in imbalanced data , 2014, 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[27]  S. Sathiya Keerthi,et al.  Building Support Vector Machines with Reduced Classifier Complexity , 2006, J. Mach. Learn. Res..

[28]  Ling Guan,et al.  Covariance-guided One-Class Support Vector Machine , 2014, Pattern Recognit..

[29]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[30]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[31]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[32]  Loris Nanni,et al.  Coupling different methods for overcoming the class imbalance problem , 2015, Neurocomputing.

[33]  Feng Guo-qing APPLICATION OF FUZZY CLOSENESS DEGREE IN RESERVOIR RECOGNITION , 1999 .

[34]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[35]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[36]  José Hernández-Orallo,et al.  Volume under the ROC Surface for Multi-class Problems , 2003, ECML.

[37]  Mikel Galar,et al.  Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches , 2013, Knowl. Based Syst..

[38]  Chen Xiao,et al.  A binary particle swarm optimization algorithm inspired by multi-level organizational learning behavior , 2012, Eur. J. Oper. Res..

[39]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[40]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[41]  Gerald Schaefer,et al.  An improved ensemble approach for imbalanced classification problems , 2013, 2013 IEEE 8th International Symposium on Applied Computational Intelligence and Informatics (SACI).

[42]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[43]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[44]  James Bailey,et al.  A Novel Scalable Multi-class ROC for Effective Visualization and Computation , 2010, PAKDD.

[45]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[46]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.

[48]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[49]  Lars Schmidt-Thieme,et al.  Cost-sensitive learning methods for imbalanced data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[50]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[51]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[52]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[53]  Russell C. Eberhart,et al.  A discrete binary version of the particle swarm algorithm , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[54]  Juan José Rodríguez Diez,et al.  Random Balance: Ensembles of variable priors classifiers for imbalanced data , 2015, Knowl. Based Syst..

[55]  Xia Hong,et al.  Construction of Neurofuzzy Models For Imbalanced Data Classification , 2014, IEEE Transactions on Fuzzy Systems.

[56]  Chidchanok Lursinsap,et al.  Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms , 2015, Neurocomputing.