Under-sampling class imbalanced datasets by combining clustering analysis and instance selection

Abstract Class-imbalanced datasets, i.e., those with the number of data samples in one class being much larger than that in another class, occur in many real-world problems. Using these datasets, it is very difficult to construct effective classifiers based on the current classification algorithms, especially for distinguishing small or minority classes from the majority class. To solve the class imbalance problem, the under/oversampling techniques have been widely used to reduce and enlarge the numbers of data samples in the majority and minority classes, respectively. Moreover, the combinations of certain sampling approaches with ensemble classifiers have shown reasonably good performance. In this paper, a novel undersampling approach called cluster-based instance selection (CBIS) that combines clustering analysis and instance selection is introduced. The clustering analysis component groups similar data samples of the majority class dataset into ‘subclasses’, while the instance selection component filters out unrepresentative data samples from each of the ‘subclasses’. The experimental results based on the KEEL dataset repository show that the CBIS approach can make bagging and boosting-based MLP ensemble classifiers perform significantly better than six state-of-the-art approaches, regardless of what kinds of clustering (affinity propagation and k -means) and instance selection (IB3, DROP3 and GA) algorithms are used.

[1]  Stefanos Zafeiriou,et al.  A survey on face detection in the wild: Past, present and future , 2015, Comput. Vis. Image Underst..

[2]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[3]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[4]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[10]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[11]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[12]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[13]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Zahir Tari,et al.  KRNN: k Rare-class Nearest Neighbour classification , 2017, Pattern Recognit..

[15]  Vincent Vigneron,et al.  A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting , 2015, Pattern Analysis and Applications.

[16]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[17]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[18]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[19]  D. Liberati,et al.  Training digital circuits with Hamming clustering , 2000 .

[20]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[21]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[22]  Changyin Sun,et al.  Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data , 2015, Knowl. Based Syst..

[23]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[24]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[25]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[26]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[27]  Vishal Mahajan,et al.  Review of Data Mining Techniques for Churn Prediction in Telecom , 2015 .

[28]  Wei-Yang Lin,et al.  Machine Learning in Financial Crisis Prediction: A Survey , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[30]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[31]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[32]  Marco Muselli,et al.  Binary Rule Generation via Hamming Clustering , 2002, IEEE Trans. Knowl. Data Eng..

[33]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[34]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[35]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[36]  Maumita Bhattacharya,et al.  Intelligent Financial Fraud Detection: A Comprehensive Review , 2015 .