Generative learning for imbalanced data using the Gaussian mixed model

Abstract Imbalanced data classification, an important type of classification task, is challenging for standard learning algorithms. There are different strategies to handle the problem, as popular imbalanced learning technologies, data level imbalanced learning methods have elicited ample attention from researchers in recent years. However, most data level approaches linearly generate new instances by using local neighbor information rather than based on overall data distribution. Differing from these algorithms, in this study, we develop a new data level method, namely, generative learning (GL), to deal with imbalanced problems. In GL, we fit the distribution of the original data and generate new data on the basis of the distribution by adopting the Gaussian mixed model. Generated data, including synthetic minority and majority classes, are used to train learning models. The proposed method is validated through experiments performed on real-world data sets. Results show that our approach is competitive and comparable with other methods, such as SMOTE, SMOTE-ENN, SMOTE-TomekLinks, Borderline-SMOTE, and safe-level-SMOTE. Wilcoxon signed rank test is applied, and the testing results show again the significant superiority of our proposal.

[1]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[2]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[3]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[4]  Haiying Xia,et al.  A modified Gaussian mixture background model via spatiotemporal distribution with shadow detection , 2016, Signal Image Video Process..

[5]  Qingquan Li,et al.  A hierarchical naive Bayesian network classifier embedded GMM for textural image , 2012, Int. J. Appl. Earth Obs. Geoinformation.

[6]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[7]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[8]  Sebastián Ventura,et al.  Weighted Data Gravitation Classification for Standard and Imbalanced Data , 2013, IEEE Transactions on Cybernetics.

[9]  S. B. Shinde,et al.  Cost sensitive improved Levenberg Marquardt algorithm for imbalanced data , 2016, 2016 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC).

[10]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[11]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[12]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[13]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[14]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[19]  A NiranjilKumar,et al.  Background Subtraction in Dynamic Environment based on Modified Adaptive GMM with TTD for Moving Object Detection , 2015 .

[20]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[22]  Juan José Rodríguez Diez,et al.  Diversity techniques improve the performance of the best imbalance learning ensembles , 2015, Inf. Sci..

[23]  Yaping Lin,et al.  Synthetic minority oversampling technique for multiclass imbalance problems , 2017, Pattern Recognit..

[24]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[25]  Björn E. Ottersten,et al.  Example-dependent cost-sensitive decision trees , 2015, Expert Syst. Appl..

[26]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[27]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[28]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[29]  Francisco Herrera,et al.  Evolutionary undersampling for extremely imbalanced big data classification under apache spark , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[30]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[31]  Yuhui Zheng,et al.  Robust generative asymmetric GMM for brain MR image segmentation , 2017, Comput. Methods Programs Biomed..

[32]  Jing Zhang,et al.  Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data , 2016, Pattern Recognit. Lett..

[33]  Francisco Herrera,et al.  Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection , 2014, Appl. Soft Comput..

[34]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[35]  Jun Zhang,et al.  Addressing the class imbalance problem in Twitter spam detection using ensemble learning , 2017, Comput. Secur..

[36]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[37]  Liangxiao Jiang,et al.  A differential evolution-based method for class-imbalanced cost-sensitive learning , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[38]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[39]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[40]  Narasimhan Sundararajan,et al.  Risk-sensitive loss functions for sparse multi-category classification problems , 2008, Inf. Sci..

[41]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[42]  Zhenbing Liu,et al.  A Cost-Sensitive Sparse Representation Based Classification for Class-Imbalance Problem , 2016, Sci. Program..

[43]  Fernando Bacao,et al.  Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning , 2017, Expert Syst. Appl..

[44]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[45]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[46]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[47]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[48]  Kay Chen Tan,et al.  Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[49]  Kun-Huang Chen,et al.  A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients , 2014, Appl. Soft Comput..

[50]  Bartosz Krawczyk,et al.  Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets , 2016, Pattern Recognit..

[51]  A. Lyhyaoui,et al.  Intrusion Detection based Sample Selection for imbalanced data distribution , 2012, Second International Conference on the Innovative Computing Technology (INTECH 2012).