论文信息 - Hybrid Classifier Ensemble for Imbalanced Data

Hybrid Classifier Ensemble for Imbalanced Data

The class imbalance problem has become a leading challenge. Although conventional imbalance learning methods are proposed to tackle this problem, they have some limitations: 1) undersampling methods suffer from losing important information and 2) cost-sensitive methods are sensitive to outliers and noise. To address these issues, we propose a hybrid optimal ensemble classifier framework that combines density-based undersampling and cost-effective methods through exploring state-of-the-art solutions using multi-objective optimization algorithm. Specifically, we first develop a density-based undersampling method to select informative samples from the original training data with probability-based data transformation, which enables to obtain multiple subsets following a balanced distribution across classes. Second, we exploit the cost-sensitive classification method to address the incompleteness of information problem via modifying weights of misclassified minority samples rather than the majority ones. Finally, we introduce a multi-objective optimization procedure and utilize connections between samples to self-modify the classification result using an ensemble classifier framework. Extensive comparative experiments conducted on real-world data sets demonstrate that our method outperforms the majority of imbalance and ensemble classification approaches.

[1] Vipin Kumar,et al. Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2] Fang Li,et al. Multi-objective evolutionary algorithms embedded with machine learning — A survey , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[3] Jesús Alcalá-Fdez,et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[4] Yang Wang,et al. Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[5] Tom Heskes,et al. Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers , 2017, BMC Bioinformatics.

[6] Yan-Ping Zhang,et al. Cluster-based majority under-sampling approaches for class imbalance learning , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[7] C. L. Philip Chen,et al. A Multiple-Feature and Multiple-Kernel Scene Segmentation Algorithm for Humanoid Robot , 2014, IEEE Transactions on Cybernetics.

[8] Gary Weiss,et al. Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[9] Kalyanmoy Deb,et al. A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[10] André Gustavo dos Santos,et al. Application of NSGA-II framework to the travel planning problem using real-world travel data , 2016, CEC.

[11] David J. Hand,et al. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[12] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[13] Alexander G. Gray,et al. Stochastic Alternating Direction Method of Multipliers , 2013, ICML.

[14] Charles Elkan,et al. The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[15] Zhi-Hua Zhou,et al. Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[16] Tom Fawcett,et al. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[17] Jürgen Branke,et al. Evolutionary optimization in uncertain environments-a survey , 2005, IEEE Transactions on Evolutionary Computation.

[18] Tom Fawcett,et al. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[19] Hareton K. N. Leung,et al. Hybrid $k$ -Nearest Neighbor Classifier , 2016, IEEE Transactions on Cybernetics.

[20] Jane You,et al. A New Kind of Nonparametric Test for Statistical Comparison of Multiple Classifiers Over Multiple Datasets , 2017, IEEE Transactions on Cybernetics.

[21] Pedro M. Domingos. MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[22] Igor Kononenko,et al. Cost-Sensitive Learning with Neural Networks , 1998, ECAI.

[23] Inderjit S. Dhillon,et al. Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[24] Kai Ming Ting,et al. An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[25] Ryan M. Rifkin,et al. In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[26] Hisao Ishibuchi,et al. Performance comparison of NSGA-II and NSGA-III on various many-objective test problems , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[27] Ludmila I. Kuncheva,et al. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[28] David B. Fogel,et al. An introduction to simulated evolutionary optimization , 1994, IEEE Trans. Neural Networks.

[29] Kalyanmoy Deb,et al. Muiltiobjective Optimization Using Nondominated Sorting in Genetic Algorithms , 1994, Evolutionary Computation.

[30] Kun Zhang,et al. A novel algorithm for many-objective dimension reductions: Pareto-PCA-NSGA-II , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[31] Seetha Hari,et al. Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[32] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[33] Zhendong Niu,et al. Distribution based ensemble for class imbalance learning , 2015, Fifth International Conference on the Innovative Computing Technology (INTECH 2015).

[34] Ron Kohavi,et al. The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[35] Zhen Liu,et al. Objective cost-sensitive-boosting-WELM for handling multi class imbalance problem , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[36] Salvatore J. Stolfo,et al. AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[37] Francisco Herrera,et al. Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[38] Maya R. Gupta,et al. Functional Bregman divergence , 2008, 2008 IEEE International Symposium on Information Theory.

[39] Kay Chen Tan,et al. Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[40] Francisco Herrera,et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[41] Huaiyu Zhu. On Information and Sufficiency , 1997 .

[42] Zili Zhang,et al. Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications , 2014, IEEE Transactions on Cybernetics.

[43] Brendon J. Woodford,et al. A streaming ensemble classifier with multi-class imbalance learning for activity recognition , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[44] Francisco Herrera,et al. Analysis of Data Preprocessing Increasing the Oversampling Ratio for Extremely Imbalanced Big Data Classification , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[45] Foster Provost,et al. The effect of class distribution on classifier learning: an empirical study , 2001 .

[46] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[47] Francisco Herrera,et al. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[48] Haibo He,et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[49] Nathalie Japkowicz,et al. The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[50] Yanbin Yuan,et al. Modified NSGA-II for Solving Continuous Berth Allocation Problem: Using Multiobjective Constraint-Handling Strategy , 2017, IEEE Transactions on Cybernetics.

[51] Robert C. Holte,et al. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[52] Igor Vajda,et al. On Bregman Distances and Divergences of Probability Measures , 2012, IEEE Transactions on Information Theory.

[53] Robert C. Holte,et al. Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[54] M. Friedman. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[55] Taeho Jo,et al. A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[56] Taeho Jo,et al. Class imbalances versus small disjuncts , 2004, SKDD.

[57] Zhi-Hua Zhou,et al. The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[58] Yi Mei,et al. A NSGA-II-based approach for service resource allocation in Cloud , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[59] Hui Han,et al. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[60] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[61] Joydeep Ghosh,et al. An Optimization Framework for Combining Ensembles of Classifiers and Clusterers with Applications to Nontransductive Semisupervised Learning and Transfer Learning , 2014, TKDD.

[62] Dimitri P. Bertsekas,et al. On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators , 1992, Math. Program..

[63] L. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[64] Francisco Herrera,et al. An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[65] Gustavo E. A. P. A. Batista,et al. A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[66] Jane You,et al. Hybrid cluster ensemble framework based on the random combination of data transformation operators , 2012, Pattern Recognit..

[67] David Mease,et al. Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[68] Pranab K. Muhuri,et al. NSGA-II based multi-objective pollution routing problem with higher order uncertainty , 2017, 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[69] R. K. Ursem. Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[70] Kalyanmoy Deb,et al. An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints , 2014, IEEE Transactions on Evolutionary Computation.

[71] Jorma Laurikkala,et al. Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.