KNN-Based Overlapping Samples Filter Approach for Classification of Imbalanced Data

Imbalanced data classification is one of the most interesting problems in various real-world data sets. The class distribution of imbalanced data set strongly affects the classification rate of learning classifiers. If the class distribution problems can’t be solved before implementing the learning algorithms, the predictions of learning classifiers tend to support a large number of samples (majority class) and ignore the other samples (minority class). In addition, the class overlapping problem can increase the difficulty to classify the minority class samples correctly. In this paper, we propose an effective under-sampling method for the classification of imbalanced and overlapping data by using KNN-based overlapping samples filter approach. Besides, this paper summarizes the performance analysis of three ensemble-based learning classifiers for the proposed method. Experimental results on fifteen imbalanced data sets indicate that the proposed under-sampling method can effectively improve the five representative algorithms in terms of three popular metrics; area under the curve (AUC), G-mean and F-measure.

[1]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[2]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[3]  Simon Fong,et al.  Similarity Majority Under-Sampling Technique for Easing Imbalanced Classification Problem , 2017, AusDM.

[4]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[5]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[6]  Saroj K. Biswas,et al.  Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance , 2017, Pattern Recognit. Lett..

[7]  Chih-Fong Tsai,et al.  Under-sampling class imbalanced datasets by combining clustering analysis and instance selection , 2019, Inf. Sci..

[8]  Chidchanok Lursinsap,et al.  Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms , 2015, Neurocomputing.

[9]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[10]  Lior Rokach,et al.  Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem , 2017, Neurocomputing.

[11]  Hossein Nezamabadi-pour,et al.  NPC: Neighbors' progressive competition algorithm for classification of imbalanced data sets , 2017, 2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS).

[12]  Hongbing Liu,et al.  Improving undersampling-based ensemble with rotation forest for imbalanced problem , 2019, TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES.

[13]  MengChu Zhou,et al.  A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification , 2017, IEEE Transactions on Cybernetics.

[14]  Nelson F. F. Ebecken,et al.  A KNN Undersampling Approach for Data Balancing , 2015 .

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[17]  Muhammad Shahid,et al.  Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques , 2016, J. King Saud Univ. Comput. Inf. Sci..

[18]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[19]  Nilanjan Dey,et al.  Rare Event Prediction Using Similarity Majority Under-Sampling Technique , 2017, SCDS.

[20]  Seoung Bum Kim,et al.  An overlap-sensitive margin classifier for imbalanced and overlapping data , 2018, Expert Syst. Appl..

[21]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[22]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[23]  Jia Song,et al.  A bi-directional sampling based on K-means method for imbalance text classification , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[24]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[25]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[26]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[27]  Zhaowei Shang,et al.  Tackling class overlap and imbalance problems in software defect prediction , 2018, Software Quality Journal.

[28]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[29]  A. Elhassan,et al.  Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method , 2017 .