Imputation-Based Ensemble Techniques for Class Imbalance Learning

Correct classification of rare samples is a vital data mining task and of paramount importance in many research domains. This paper mainly focuses on the development of the novel class-imbalance learning techniques, which make use of oversampling methods integrated with bagging and boosting ensembles. Two novel oversampling strategies based on the single and the multiple imputation methods are proposed. The proposed techniques aim to create useful synthetic minority class samples, similar to the original minority class samples, by estimation of missing values that are already induced in the minority class samples. The re-balanced datasets are then used to train base-learners of the ensemble algorithms. In addition, the proposed techniques are compared with the commonly used class imbalance learning methods in terms of three performance metrics including AUC, F-measure, and G-mean over several synthetic binary class datasets. The empirical results show that the proposed multiple imputation-based oversampling combined with bagging significantly outperforms other competitors.

[1]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[2]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  Xin Yao,et al.  Online Ensemble Learning of Data Streams with Gradually Evolved Classes , 2016, IEEE Transactions on Knowledge and Data Engineering.

[4]  Yiqiang Chen,et al.  Weighted extreme learning machine for imbalance learning , 2013, Neurocomputing.

[5]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[6]  Cesare Alippi,et al.  Credit Card Fraud Detection: A Realistic Modeling and a Novel Learning Strategy , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Qiang Yang,et al.  Test strategies for cost-sensitive decision trees , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  Xuelong Li,et al.  Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  See-Kiong Ng,et al.  Integrated Oversampling for Imbalanced Time Series Classification , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Xin Yao,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Relationships between Diversity of Classification Ensembles and Single-class Performance Measures , 2022 .

[11]  Shinichi Nakajima,et al.  On Bayesian PCA: Automatic Dimensionality Selection and Analytic Solution , 2011, ICML.

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  Steven C. H. Hoi,et al.  Cost-Sensitive Online Classification , 2012, 2012 IEEE 12th International Conference on Data Mining.

[14]  Cen Li,et al.  Classifying imbalanced data using a bagging ensemble variation (BEV) , 2007, ACM-SE 45.

[15]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[16]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[17]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[18]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[19]  Bao-Gang Hu,et al.  A New Strategy of Cost-Free Learning in the Class Imbalance Problem , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[22]  Chun-Hsiang Chuang,et al.  Minority Oversampling in Kernel Adaptive Subspaces for Class Imbalanced Datasets , 2018, IEEE Transactions on Knowledge and Data Engineering.

[23]  José Francisco Martínez Trinidad,et al.  Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases , 2016, Neurocomputing.

[24]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[25]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[26]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[27]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[28]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[29]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[30]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[32]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[33]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[34]  Francisco Herrera,et al.  Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[35]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[36]  Luis Hernández-Callejo,et al.  Exploratory study on Class Imbalance and solutions for Network Traffic Classification , 2019, Neurocomputing.

[37]  Hong Yan,et al.  The theoretic framework of local weighted approximation for microarray missing value estimation , 2010, Pattern Recognit..

[38]  Joelle Pineau,et al.  Online Bagging and Boosting for Imbalanced Data Streams , 2013, IEEE Transactions on Knowledge and Data Engineering.

[39]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[40]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[41]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[42]  Swagatam Das,et al.  Boosting with Lexicographic Programming: Addressing Class Imbalance without Cost Tuning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[43]  Ana L. C. Bazzan,et al.  Balancing Training Data for Automated Annotation of Keywords: a Case Study , 2003, WOB.

[44]  Kai Ming Ting,et al.  A Comparative Study of Cost-Sensitive Boosting Algorithms , 2000, ICML.

[45]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[46]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[47]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[48]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..