KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning

In imbalanced learning, most standard classification algorithms usually fail to properly represent data distribution and provide unfavorable classification performance. More specifically, the decision rule of minority class is usually weaker than majority class, leading to many misclassification of expensive minority class data. Motivated by our previous work ADASYN [1], this paper presents a novel kernel based adaptive synthetic over-sampling approach, named KernelADASYN, for imbalanced data classification problems. The idea is to construct an adaptive over-sampling distribution to generate synthetic minority class data. The adaptive over-sampling distribution is first estimated with kernel density estimation methods and is further weighted by the difficulty level for different minority class data. The classification performance of our proposed adaptive over-sampling approach is evaluated on several real-life benchmarks, specifically on medical and healthcare applications. The experimental results show the competitive classification performance for many real-life imbalanced data classification problems.

[1]  M. Elter,et al.  The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. , 2007, Medical physics.

[2]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[3]  Sheng Chen,et al.  PDFOS: PDF estimation based over-sampling for imbalanced two-class problems , 2014, Neurocomputing.

[4]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[5]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Max A. Little,et al.  Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection , 2007 .

[7]  Sheng Chen,et al.  A Kernel-Based Two-Class Classifier for Imbalanced Data Sets , 2007, IEEE Transactions on Neural Networks.

[8]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[9]  J. Jossinet Variability of impedivity in normal and pathological breast tissue , 1996, Medical and Biological Engineering and Computing.

[10]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[11]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[13]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14]  H. Labelle,et al.  Analysis of the Sagittal Balance of the Spine and Pelvis Using Shape and Orientation Parameters , 2005, Journal of spinal disorders & techniques.

[15]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[16]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[17]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[18]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[19]  N. B. Venkateswarlu,et al.  A Critical Comparative Study of Liver Patients from USA and INDIA: An Exploratory Analysis , 2012 .

[20]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[21]  H. Tong,et al.  Article: 2 , 2002, European Financial Services Law.

[22]  Lukasz A. Kurgan,et al.  Knowledge discovery approach to automated cardiac SPECT diagnosis , 2001, Artif. Intell. Medicine.

[23]  Bo Tang,et al.  Hybrid classification with partial models , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[24]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[25]  Sungzoon Cho,et al.  EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems , 2006, ICONIP.

[26]  Nanda Kambhatla,et al.  Dimension Reduction by Local Principal Component Analysis , 1997, Neural Computation.

[27]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[28]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[29]  Guang Yang,et al.  L 1 Graph Based on Sparse Coding for Feature Selection , 2013, ISNN.

[30]  Bo Tang,et al.  ENN: Extended Nearest Neighbor Method for Pattern Recognition [Research Frontier] , 2015, IEEE Computational Intelligence Magazine.

[31]  Guang Yang,et al.  Sparse-Representation-Based Classification with Structure-Preserving Dimension Reduction , 2014, Cognitive Computation.

[32]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[33]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[34]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[37]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[38]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[39]  Haibo He,et al.  Feature selection based on sparse imputation , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[40]  Bo Tang,et al.  A Parametric Classification Rule Based on the Exponentially Embedded Family , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Fernando Vilariño,et al.  Experiments with SVM and Stratified Sampling with an Imbalanced Problem: Detection of Intestinal Contractions , 2005, ICAPR.